Web Database Integration: Wei Liu Xiaofeng Meng

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

In Proceedings of the Ph.

D Workshop in conjunction with VLDB 06 (VLDB-PhD2006),


Seoul, Korea, September 11, 2006

Web Database Integration

Wei Liu Xiaofeng Meng


School of Information School of Information
Renmin University of China Renmin University of China
Beijing, 100872, China Beijing, 100872, China
[email protected] [email protected]

ABSTRACT
More and more accessible databases are available in the
Web. In order to provide people a unified access to these
Web databases and achieve information from them auto-
matically, a comprehensive solution for Web database inte-
gration is proposed in this paper. After summarizing the
research status in this area, the works which are the focus
of my PhD thesis are presented.

1. INTRODUCTION
With the rapid development of Web, more and more acces-
sible databases are available in the Web. Such databases
are usually called Web database (or WDB in short) by re-
searchers. From this angle, the Web can be divided into two Figure 1: The query interface of Amazon
parts: Surface Web and Deep Web. The Surface Web refers
to the static Web pages which can be crawled and indexed
by popular search engines, while the Deep Web refers to the
contents stored in Web databases and published by dynamic
Web pages. nity to get their desired information.

The abundant information stored in Web databases is ”hided” With proliferation of Web databases, it is not only an op-
behind the query interfaces in Web pages. This means that portunity but also a challenge for people. At present, people
the main approach people access Web databases is through access to Web databases mainly by manual approach, and
their query interfaces. Figure 1 gives the query interface his will bring an overhead problem.
provided by Amazon which is a very popular e-commerce
Web site. Here is an example to explain the problem. Suppose Jane
wants buy a book on Java. There are several tasks she
According to the survey[1] released by UIUC in 2004, there has to complete. First, she must find the Web sites which
are more than 300,000 Web databases and 450,000 query in- sell books. If she wants save money, more Web sites are
terfaces available at that time, and the two figures are still needed to compare. Second, she fills the query interfaces
increasing quickly. Besides the scale of Web databases, the with an appropriate query (for example, fill book title with
contents in Web databases are spanning well across all top- ”think in java”) and submits them. Third, when the Web
ics. Some Deep Web portal services provide Deep Web di- pages contain query results returned (these Web pages are
rectories which classify Web databases in some taxonomies. called response pages generally), she browses them in turn
For example, CompletePlanet[2], the biggest Deep Web di- and chooses the best book. The whole process is time-
rectory, has collected more than 7,000 Web databases and consuming. Maybe Jane will spend half a day for this.
classified them into 42 topics. Combing the above two as- Therefore, the challenge of manual approach is people of-
pects, we can conclude that theses Web databases are just ten have difficulties in first finding the right sources and
like a huge repository and provide people a great opportu- then querying over them.

It is impending and compulsory to integrate Web databases


and to provide people a unified access to them and achieve
information automatically. Web databases integration can
be considered as the heterogeneous data source integration
(c)2006 for the individual paper by the paper’ authors. Copy-
ing permitted for private and scientific purposes. Re- in Web context. The traditional heterogeneous data source
publication of material on this page requires permission by integration generally focuses on the heterogeneity and au-
the copyright owners. tonomy of data sources. According to my investigation, Web
Proceedings of the VLDB2006 Ph.D. Workshop databases also have four distinct characteristics which are
Seoul, Rep of Korea, 2006 different to other heterogeneous data sources:
Result Process Module
Web DB
Data Results Results
RDB Merging Annotation Extraction
Web DB Web DB

We b
Integrated Web DB
Interface Web DB

WDB Query Query


Selection Translation Submission

Query Process Module

Interface WDB InterfaceSchema WDB


Integration Clustering Extraction Discovery

Interface Integration Module


ss

Figure 2: A comprehensive solution for Web database integration

• Scale: There are myriads of Web databases in Web, ple can use a customized program to access Web databases.
and even under a special topic the quantity of Web But this approach has two limitations: first, only a small
databases is still striking. portion of Web sites provide Web Services for their Web
databases; second, this approach must depend on a cus-
• Dynamic: First, Web databases are very sparsely tomized program, and this is not an easy thing for common
distributed in Web, and they appear and disappear users. So in this paper we focus on the popular approach
endlessly. So searching for appropriate Web databases of accessing Web databases through the query interfaces in
in Web is really like looking for a few needles in a Web pages.
haystack. Second, the contents in Web databases are
usually updated frequently. Especially in some top- The rest of this paper is organized as follows. Section 2 gives
ics, such as airline and job, everyday a batch of new the solution for Web database integration; Section 3 sum-
contents will be added to Web databases and the out- marizes the research status in this area; Section 4 presents
dated part will be removed. So the information in Web the works we are focusing now and will focus in the future;
databases is ”ever” not ”forever” to you. Section 5 is the conclusion.
• Access through query interfaces: Due to the pecu-
liar access approach, the schema of a Web database can 2. A SOLUTION FOR WDB INTEGRATION
not be captured directly. We can only infer the schema In this section, a comprehensive solution for Web database
from their query interfaces and response pages. integration is proposed, which is the pursuit in my PhD
track. Figure 2 is the architecture of the solution. This
• Heterogeneity: The query interfaces and response
solution includes three primary modules: integrated inter-
pages are designed by different persons and there are
face generation module, query processing module and results
no design standards to follow. Even in the same topic,
processing module.
the query interfaces and response pages are often very
dissimilar.
Integrated interface generation module: Produce an
integrated interface over the query interfaces of the Web
In a word, the research on Web database integration aims to databases to be integrated. There are four components in
help people make use of the abundant information in Web this module. The functions of them are described as follow-
databases effectively and efficiently. But due to the distinct ing:
characteristics of Web databases, there are many challenging
research issues in this area.
• Web database discovery: Search Web sites which have
My PhD thesis is focusing on building a Web database inte- Web databases behind, and identify the query inter-
gration system and addressing several challenging issues in faces among the Web pages in these Web sites.
this area. In this paper a comprehensive solution for Web • Query interface schema extraction: Extract the at-
database integration is presented and my current and future tributes in query interfaces (such as ”Title” and ”Au-
research works in this area is indicated . thor” in Figure 1), and the meta-information about
each attribute (such as value type, default value, etc).
There is a fact which should not be neglected. Some Web
sites provide Web Services for their Web databases, and peo- • Web database clustering by topic: Cluster all discov-
ered Web databases into different groups. The Web Several issues have been already addressed well and are ma-
databases in each group belong to the same topic. ture enough we can resort to (developed issues), some issues
is developing and need be researched deeply (developing is-
• Interface integration: Given the Web databases in the sues), and some issues have not been touched yet (undevel-
same topic, merge the same semantic attributes in dif- oped issues). We summarize the research status according
ferent query interfaces into a global attribute, and fi- to the development of these issues.
nally form an integrated interface.
3.1 Developed Issues
Query processing module: Process a user’s query filled Interface integration It has received enough attention, and
in integrated interface, and submit the query to each Web several effective approaches[3][4][5][6] are proposed solve this
databases. There are three components in this module. The problem. These approaches match attributes of query inter-
functions of them are described as following: faces by exploiting the semantic similarity between labels as
well as that between data instances.
• Web database selection: Select appropriate Web databases
for a user’s query in order to get the satisfying results Query interface schema extraction In order to understand
at minimal cost. query capabilities a query interface supports, [7] transforms
query interfaces into a visual language, and develops a 2P
• Query translation: Try to translate the query on inte-
grammar and a best-effort parser to realize a parsing mech-
grated interface equivalently into a set of local queries
anism.
on the query interfaces of Web databases.
• Query submission: Analyze the submission approaches 3.2 Developing Issues
of local query interfaces, and submit each local query Besides introducing the current approach for developing is-
automatically. sues, the shortcomings of them are pointed out at the same
time.
Result processing module: Extract the query results
achieved from Web databases, and merge the results to- Web database discovery [9] proposed a strategy does that
gether under a global schema. There are three components by focusing the crawl on a given topic and choosing links to
in this module. The functions of them are described as fol- follow within a topic that are more likely to lead to pages
lowing: that contain query interfaces. It can not assure the quantity
of discovered Web databases. [10] use automatic feature
• Result extraction: Identify and extract the pure results generation to describe candidates and C4.5 decision trees to
from the response pages returned by Web databases. detect query interfaces. It can not differentiate the query
interfaces of search engines from that of Web databases.
• Result Annotation: Append the proper semantics for
the extracted results. Web database clustering [11] performs the clustering based
on the features available on the interface page. [12] proposed
• Result merging: Merge the results extracted from dif-
an objective function, model-differentiation, to compute the
ferent Web databases together under a global schema.
probability which topic a query interface belongs to. Their
accuracy depends on the schema information of query in-
These components work together and make up of a com- terfaces, so they are not good at dealing with the query
prehensive solution for Web database integration. It’s not interfaces with simple schema.
difficult to found that there are dependency relationships
between them. Figure 2 has disclosed such dependency re- Result extraction There are lots of approaches proposed to
lationship. For example, query processing module depends address this issue. Most of them[13][14][15] first transform
on integrated interface generation module (high level), in- the response page into a HTML tag tree, then identify and
terface integration depends on Web database clustering (low extract data records or data items by analyzing tree struc-
level). So the quality of the implementation of a component ture and tag information. They can only deal with the Web
will affect the next component greatly. pages designed by HTML language, so it is a latent short-
coming with the development of Web.
In fact, each component can be considered as a research issue
itself. In order to build a practical Web database integration Result annotation This problem is often solved during the
system, these issues must be solved well in theory first. In process of Result extraction. [17] find the proper the an-
Section 3, the research status in this area will be discussed. notation of an extracted data item in the response page by
some heuristic rules. They are very effective if a data item
3. RESEARCH STATUS IN THIS AREA really has its annotation in the response page. But they can
Until now, large numbers of efforts are devoted to this area. not ensure all data items get their annotations.
Due to the space limit, the related works can not be dis-
cussed comprehensively and in detail. We only discuss them Entity identification Entity identification is one of the key
summarily according to the issues they address, and we also components of data merging. Several approaches have been
give the representative works. proposed to solve this problem. For example, [16] applies
a set of domain-independent string transformations to com-
Unfortunately, the development of research in this area is pare the entities’ shared attributes in order to identify match-
uneven very much though the great efforts have been done. ing entities. All current approaches assume that they have
achieved the well-build schema match between Web databases, At present, we are engaging to find an effective algorithm
but schema match in Web context have not been solved yet. to train the weights and threshold of the whole similarity
by a small set of sample data records pairs. A data record
3.3 Undeveloped Issues pair is two data records from different Web databases, and
The undeveloped issues include Web database selection, Query they refer to a same entity. The algorithm is now being
translation, and Data merging. These issues have been well detailed. The primary experiment result is very satisfying
studied in some contexts(such as data warehouse), but there under the book topic. Further, the experiments under other
have not been approaches proposed to address these issues topics (car, estate, etc.) will be done.
in the context of Web database integration, and they are
compulsory in Web database integration. 4.2 Vision Based Result Extraction
Most current approaches extract the results from response
Among these developing and undeveloped issues, Entity iden- pages based on HTML language. But they have several
tification, Result extraction and Web database selection are inextirpable limitations. First, besides HTML, some other
in my PhD track at present and in the future, which are languages, such as XML and XHTML, have been introduced
discussed in Section 4. design Web pages. Second, HTML is still evolving. New
versions of HTML will be proposed in the future, and new
4. SEVERAL RESEARCH WORKS tags may appear and applied continuously. Third, as more
In this section, several research works are proposed for dis- and more web pages use more complex JavaScript and CSS
cussion, which are being done at present and will be done to influence the structure of web pages, the applicability of
in future. the existing solutions will become lower. Fourth, if HTML
is replaced by a new language in the future, then previous
4.1 Entity Identification among Web Databases solutions will have to be revised greatly or even abandoned,
and other approaches must be proposed to accommodate
Entity identification is a key operation in integrating data
the new language.
from multiple sources. This issue has been well studied for
years. As discussed in Subsection 3.2, though several solu-
Based on such motivations, it is important to find an ap-
tions have already been proposed for Web databases, all of
proach which is vision based and language independent. In
they are based on such assumption that the schema match
current phrase, we only aim at the response pages with mul-
between Web databases has been built well. As well known,
tiple data records. Our basic idea is that, though the data
due to the poor structure of Web pages, schema match in
records in a response page are different on the contents, they
Web context is a very hard work, and there is still not au-
are similar on the appearance. The following is the imple-
tomatic solution for it.
mentation we are engaging in:
So we are trying to find a way to implement entity identifi-
cation between Web databases without the help of schema 1. achieve the vision information (such as the font of a
match. Our basic consideration is described as following. text, the size of an image, and their location in the
We do not try to analyze the structure (or schema) of data Web page) by accessing the program interface of Web
records in response pages. Instead, given two Web databases browser;
A and B, each data record from A or B is considered as a
text document. We judge whether data record a (from A) 2. build a vision based block tree by VIPs[18] algorithm.
and data record b (from B ) by comparing the text similar- A data record is composed by one or more blocks in
ity of them. Obviously, it is very naive to compute the text the vision based block tree. So result extraction here
similarity of two data records directly, and the accuracy is is to find these blocks and judge which blocks compose
also not satisfying in our test. The reason is that, the im- a data record.
portance of every part in a data record is different, and there 3. locate the data region (the region contains all data
is much noise information in a data record (for example, the records in a response page) in the vision based block
words ”author” and ”price” often appear in the book data tree.
records). In order to make the similarity of a and b more
reasonable (ideally, if a and b refer to a same entity, and a 4. find the boundaries of all data records by computing
and c do not, then the similarity of a and b must be big- the vision similarity of blocks in the vision based block
ger than that of a and c), our approach is implemented as tree.
following:
The primary experiment has indicated that this approach
1. filter the noise information from a and b as possible; is not only HTML language independent, but also very suit
for extracting information-rich data records.
2. segment a into several blocks, and each block of a is
formulated into a query for b;
4.3 Web Database Selection
3. compute the similarity of each block and b; There are myriads of Web databases in the Web. So maybe
4. assign an appropriate weight for the similarity of each a lot of Web databases are integrated under a topic. If a user
block and b, and sum up them; submits a query on the integrated interface and the query
is dispatched to all the Web databases integrated, it will
5. judge whether a and b refer to a same entity according be time-consuming and overhead to process all the returned
to the whole similarity. results, especially data cleaning and deduplication. In most
subset of the query).

5. CONCLUSIONS
With the rapid increasing of Web databases, it is impending
to integrate these Web databases and provide people a uni-
fied access to them and achieve information automatically.
In this paper, a comprehensive solution for Web database
integration is proposed. There are a number of components
Figure 3: An example for Web Database Selection in the solution, and each of them is also a research issue in
this area. After summarizing the research statuses of the
issues in this area, we introduce the issues which are be-
cases, we only need select several ones among them to get ing focused on now and will be addressed in the future. In
the satisfying results. So Web Database Selection aims to conclusion, the focuses of my PhD thesis are building a Web
select appropriate Web databases for a given user’s query on database integration system and addressing several issues in
integrated interface, which can help users get their desired this area.
results at the lowest cost.
6. REFERENCES
In order to judge whether a Web database should be selected [1] K. C. Chang, B. He, C. Li, M. Patel, Z. Zhang. Structured
Databases on the Web: Observations and Implications.
to answer a given query, there are two aspects must be con-
SIGMOD Record 33(3): 61-70 (2004).
sidered. One is the pertinency of the Web database and the
[2] http://www.completeplanet.com/.
given query; the other is the query capability of the query
interface of the Web database. The following gives some our [3] B. He, K. C. Chang. Statistical Schema Matching across
Web Query Interfaces. SIGMOD Conference 2003: 217-228.
considerations about the two aspects.
[4] H. He, W. Meng, C. T. Yu, Z. Wu. WISE-Integrator: An
Automatic Integrator of Web Search Interfaces for
The prerequisite of selecting a Web database is it is per- E-Commerce. VLDB Conference 2003: 117-128.
tinent to the given query. Extremely, it is meaningless to [5] W. Wu, A. Doan, C. T. Yu. WebIQ: Learning from the Web
query a Web database if it does not has any useful informa- to Match Deep-Web Query Interfaces. ICDE Conference
tion for the query. Figure 3 gives an example to illustrate 2006.
this. Suppose A, B, C, and D are four Web databases, and [6] E. Dragut, W. Wu, A. P. Sistla, C. T. Yu, W. Meng.
q is a query to them. Where the size of A, B, C and D is Merging Source Query Interfaces on Web Databases. ICDE
the quantity of data records in them, the size of q is the Conference 2006.
quantity of data records satisfies q. Instinctively, C does [7] Z. Zhang, B. He, K. C. Chang. Understanding Web Query
not satisfy q at all, B satisfies q partly, A and D can satisfy Interfaces: Best-Effort Parsing with Hidden Syntax.
q completely, but at last D is the best selection compared SIGMOD Conference 2004: 107-118.
with A. So we need achieve the features of Web databases [8] H. He, W. Meng, C. T. Yu, Z. Wu. Automatic extraction of
in advance. The features of a Web database include the web search interfaces for interface schema integration.
WWW Conference 2004: 414-415.
size, the update ratio, the distribution on each attribute,
etc. Because we can only access a Web database through [9] L. Barbosa, J. Freire. Searching for Hidden-Web Databases.
WebDB 2005: 1-6.
its query interface, it is impossible to understand a Web
database directly. The challenge is how to obtain the fea- [10] J. Cope, N. Craswell, D. Hawking. Automated Discovery of
Search Interfaces on the Web. ADC Conference 2003:
tures by the query interface only. In the future, we want to 181-189.
design a sample records retriever to address this problem.
[11] Q. Peng, W. Meng, H. He, C. T. Yu. WISE-cluster:
Sample records retriever is a tool that can obtain a small clustering e-commerce search engines automatically. WIDM
set of data records which are distributed evenly in the Web 2004: 104-111.
database. We can profile the Web database by analyzing the [12] B. He, T. Tao, K. C. Chang. Clustering Structured Web
obtained data records. Sample records retriever should have Sources: A Schema-Based, Model-Differentiation Approach.
two components: query interface analyzer and query gener- EDBT 2004: 536-546.
ator. Query interface analyzer is to obtain the necessary [13] B. Liu, R. L. Grossman, Y. Zhai. Mining data records in
information of each attribute; query generator produces a Web pages. KDD Conference 2003: 601-606.
set of smart queries according to the information obtained [14] Y. Zhai, B. Liu. Web data extraction based on partial tree
by query interface analyzer. alignment. WWW Conference 2005: 76-85.
[15] H. Zhao, W. Meng, Z. Wu, V. Raghavan, C. T. Yu. Fully
The query interfaces are often different about the query ca- automatic wrapper generation for search engines. WWW
pability among Web databases, and this will influence the Conference 2005: 66-75.
accuracy of a query. For example, in the book topic, a query [16] S. Tejada, C. A. Knoblock, S. Minton. Learning
on the integrated interface is “title=java and price<20$”. If domain-independent string transformation weights for high
accuracy object identification. KDD Conference 2002:
the query interface of a Web database contains both the 350-359.
two attributes , it can answer the query accurately. But
[17] J. Wang, F. H. Lochovsky. Data extraction and label
if it only contains the attribute “title” or “price”, then the assignment for web databases. WWW Conference 2003:
results returned from the Web database will contain quite 187-196.
many data records which do not satisfy the query. So the [18] D. Cai, S. Yu, J. Wen, W. Ma. Extracting Content
challenge tasks are how to how to make the returned results Structure for Web Pages Based on Visual Representation.
be satisfying(for example, the minimal superset or maximal APWeb Conference 2003: 406-417.

You might also like