Many web sites contain large sets of pages generated using a common
template or layout. For example, Amazon lays out the author, title, comments,
etc. in the same way in all its book pages. The values used to generate the
pages (e.g., the author, title,...) typically come from a database. In this
paper, we study the problem of automatically extracting the database values
from the web pages without any learning examples or other similar human input.
We formally define the notion of a template, and propose a model that describes
how values are encoded into pages using a template. We present an extraction
algorithm that uses sets of words that have similar occurrence pattern in the
input pages, to construct the template. The constructed template is then used
to extract values from the pages. We show experimentally that the extracted
values make semantic sense in most cases. knowlesys
|