Jump to content

Web data integration

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 2a01:4c8:3f:434f:28f6:f958:77bf:87c3 (talk) at 14:53, 14 June 2019 (Changing Web Integration to its know and real name Web Scraping and adding more sources of general interest on the legal implications.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Web Scraping is the process of aggregating and managing data from different websites into a single, homogeneous workflow. This process includes data access to third party websites, transformation, mapping, quality assurance and fusion of data.

Web Scraping is historically seen as a legally grey area but becoming increasingly utilized by especially the financial sector. But lawsuits are now clouding web scraping companies like California based import.io with millions of invested into its future. https://www.courthousenews.com/linkedin-takes-data-scraping-fight-to-ninth-circuit/

http://blog.galkinlaw.com/weblaw-scout-blog/legality-of-data-scraping

https://venturebeat.com/2018/12/18/import-io-raises-15-5-million-for-ai-that-extracts-web-data/



Web Scraping techniques forms the foundation for businesses taking advantage of data available on the ever-increasing number of publicly-accessible websites.[1] Corporate spending on this area amounted to about USD 2.5bn in 2017, and it is expected that by 2020 the market will reach almost USD 7bn.[2]

Sources

Web Scraping extends and specializes data extraction to see the web as a collection of views of databases accessible over the web protocols, including, but not limited to[3]:

  • Open data catalogs
  • Government data catalogs
  • Web applications and sites
    • UI
    • API
  • The semantic web (SPARQL)
  • HTML embedded structured data
  • HTML data tables
  • Spreadsheets
  • PDFs
  • Online encyclopedias

Data access and transformation

Web Scraping has technical challenges different to data integration due to the data access and transformation required for the web data sources being often unstructured or semi-structured data without a standard query mechanism.

Data quality

Understanding the quality and veracity of data is even more important in Web Scraping than in data integration, as the data is generally less implicitly trusted and of lower quality than that which is collected from a trusted source. There are attempts to try to automate a trust rating for web data.[4]

Data quality in data integration can generally happen after data access and transformation, but in Web Scraping quality may need to be monitored as data is collected, due to both the time and the cost of re-collecting the data.

Applications

Web Scraping has application in many fields, including bioinformatics,[5] search engines,[6] price comparison,[7] and forensic search.[8]

References

  1. ^ "IE 670 Web Data Integration". www.uni-mannheim.de. 2019-01-24. Retrieved 2019-02-11.
  2. ^ "Opimas: The Web Data Extraction Market". Opimas: We begin with an understanding. Retrieved 2019-02-12. {{cite web}}: Cite has empty unknown parameter: |dead-url= (help)
  3. ^ "Introduction :: Web Data Integration". www.webdataintegration.io. Retrieved 2019-02-14.
  4. ^ Giménez-García, José M.; Thakkar, Harsh; Zimmermann, Antoine (2016). Sack, Harald; Rizzo, Giuseppe; Steinmetz, Nadine; Mladenić, Dunja; Auer, Sören; Lange, Christoph (eds.). "Assessing Trust with PageRank in the Web of Data". The Semantic Web. Lecture Notes in Computer Science. 9989. Springer International Publishing: 293–307. doi:10.1007/978-3-319-47602-5_45. ISBN 9783319476025.
  5. ^ "Web Data Integration". Database Group Leipzig. {{cite web}}: Cite has empty unknown parameter: |dead-url= (help)
  6. ^ "Web-scale Data Integration - You Can Only Afford to Pay as You Go". www.datascienceassn.org. Retrieved 2019-02-12.
  7. ^ Siegel, Michael D.; Madnick, Stuart E.; Zhu, Hongwei (2008). "Enabling global price comparison through semantic integration of web data". undefined. Retrieved 2019-02-12.
  8. ^ "PwC buys Kusiri, London-based fraud detection start-up". www.consultancy.uk. Retrieved 2019-02-12.