Problem
The unabated growth of the Web has resulted in a situation in which more
information is available to more people than ever in human history. Along with
this unprecedented growth has come the inevitable problem of information overload.
To counteract this information overload, users typically rely on search engines
(like Google and AllTheWeb) or on manually-created categorization hierarchies
(like Yahoo! and the Open Directory Project). Though excellent for accessing
Web pages on the so-called "crawlable" web, these approaches overlook
a much more massive and high-quality resource: the Deep Web.
The Deep Web
(or Hidden Web) comprises all information that resides in autonomous databases
behind portals and information providers' web front-ends. Web pages in the Deep
Web are dynamically-generated in response to a query through a web site's
search form and often contain rich content. A recent study has estimated the
size of the Deep Web to be more than 500 billion pages, whereas the size of the
"crawlable" web is only 1% of the Deep Web (i.e., less than 5 billion
pages). Even those web sites with some static links that are
"crawlable" by a search engine often have much more information
available only through a query interface. Unlocking this vast deep web content
presents a major research challenge.
In analogy to
search engines over the "crawlable" web, we argue that one way to
unlock the Deep Web is to employ a fully automated approach to extracting,
indexing, and searching the query-related information-rich regions from dynamic
web pages. For this miniproject, we focus on the first of these: extracting
data from the Deep Web.
Extracting the
interesting information from a Deep Web site requires many things: including
scalable and robust methods for analyzing dynamic web pages of a given web
site, discovering and locating the query-related information-rich content
regions, and extracting itemized objects within each region. By full
automation, we mean that the extraction algorithms should be designed
independently of the presentation features or specific content of the web
pages, such as the specific ways in which the query-related information is laid
out or the specific locations where the navigational links and advertisement
information are placed in the web pages.
There are many
possible 7001-miniprojects. Feel free to talk to either of us for more details.
Here are a few possibilities to consider:
1. Develop a
Web-based demo for clustering pages of a similar type from a single Deep Web
source. For example, AllMusic produces three types of pages in response to a
user query: a direct match page (e.g. for Elvis Presley), a list of links to
match pages (e.g. a list of all artists named Jackson), and a page with no
matches. As a first-step to extracting the relevant data from each page, you
may develop techniques to separate out the pages that contain query matches
from pages that contain no matches, and perhaps, rank each group based on some
metric of quality.
2. Design a
system for extracting interesting data from a collection of pages from a Deep
Web source. You might define a set of regular expression that can identify
dates, prices, or names. Develop a small program that converts a page into a
type structure. For example, given a DOM model of a web page, identify all of
the types that you have defined, and replace the string tokens with XML tags
identifying the types. Replace all non-type tokens with a generic type, and
return the tree as a full type structure). Alternatively, you may suggest your
own approach for extracting data.
3. Develop a
system to recognize names in page. Given a list of names and a web page,
identify possible matches in the page. Based on the structure of the page and
the distribution of recognized names, identify strings that may also be names
based on their location in the DOM tree heirarchy representing the page.
4. Write a
survey paper about current approaches for understanding and analyzing the Deep
Web. Be sure to include many of your own comments on the viability of the
approaches you review.
5. Or, feel
free to suggest a miniproject of your own.
Background: Knowledge of Java or Python would
be helpful. Some knowledge of information retrieval and machine learning may be
useful but is not required.
Deliverables: You should submit a report that
clearly describes what you have learned and what you have accomplished. The
report should include useful references. You should also provide any source
code you may have written to validate your ideas.
Evaluation: You will be graded on the novelty
and quality of your report and implementation.
......
For more information,
please visit our website: http://www.knowlesys.com
|