Scalable Data Extraction Via Program Synthesis

Adi Omari, Ph.D. Thesis Seminar
Thursday, 11.1.2018, 12:30
Taub 601
Prof. Eran Yahav

Web extraction is an important research topic that has been studied extensively, receiving a lot of attention and focus. Large amounts of data are produced and consumed online in a continuous and growing rate. The ability to collect and analyze these data has become essential for enabling a wide range of applications and improving the effectiveness of modern businesses. Web extraction methods facilitate the collection and analysis of these data by transforming the human friendly data available online into structured information that can be automatically manipulated and analyzed. In this work we address the problem of data extraction from a software synthesis perspective. Our goal is not only to extract data from web-sites, but to synthesize the programs that extract the data. The popularity of data extraction query languages and their use in a wide variety of applications make them a natural target for automatic synthesis methods. Another motivation for using synthesis for web extraction related applications is the fact that web applications are often generated dynamically using template code. Reverse engineering web pages at the page and site level may facilitate - among other applications - unsupervised web extraction. We first focus on the problem of automatic synthesis of web-crawlers for a family of websites that contain the same kind of information but differ on layout and formatting. We propose a method that uses the data shared among sites from the same category in order to decrease or eliminate the manual tagging needed for generating extraction schemes for these sites. We use the data on one site to identify data on another site. The identified data is then used to learn the website structure and synthesize an appropriate extraction scheme. This process iterates, as synthesized extraction schemes result in additional data to be used for re-learning the website structure. To address the problem of unsupervised extraction, we proposed a solution to the more general problem of separation of web-pages into template-code and data. Web pages are often served by running layout code on data, producing an HTML document that formats the data into a human readable and elegant presentation. We considered the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: unsupervised data extraction, traffic compression, data migration and template-code simplification. In our last work, we generalized our separation approach to address the problem of site-level separation. Finally, we address the problem of synthesizing robust data extractors from a family of websites that contain the same kind of information. We introduce and implement the idea of forgiving extractors that dynamically adjust their precision to handle structural changes, without the need to sacrifice precision upfront.

Back to the index of events