Data Integration Using Similarity Joins and A Word-Based Information Representation Language
Data Integration Using Similarity Joins and A Word-Based Information Representation Language
The integration of distributed, heterogeneous databases, such as those available on the World
Wide Web, poses many problems. Here we consider the problem of integrating data from
sources that lack common object identifiers. A solution to this problem is proposed for
databases that contain informal, natural-language “names” for objects; most Web-based
databases satisfy this requirement, since they usually present their information to the
end-user through a veneer of text. We describe WHIRL, a “soft” database management system
which supports “similarity joins,” based on certain robust, general-purpose similarity metrics
for text. This enables fragments of text (e.g., informal names of objects) to be used as keys.
WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate,
and answers every query with a list of answer substitutions that are ranked according to an
overall score. Experiments show that WHIRL is much faster than naive inference methods,
even for short queries, and efficient on typical queries to real-world databases with tens of
thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the
accuracy of hand-coded normalization routines on one benchmark problem, and outperforming
exact matching with a plausible global domain on a second.
Categories and Subject Descriptors: H.2.5 [Information Systems]: Database Management—
heterogeneous databases; H.2.3 [Information Systems]: Database Management—data ma-
nipulation languages; query languages; H.3.3 [Information Storage and Retrieval]: Infor-
mation Search and Retrieval—retrieval models; performance evaluation
General Terms: Reliability
1. INTRODUCTION
Integration of distributed, heterogeneous databases, sometimes known as
data integration, is an active area of research in the database community
[Duschka and Genesereth 1997b; Levy et al. 1996b; Arens et al. 1996;
Garcia-Molina et al. 1995; Tomasic et al. 1997; Bayardo et al. 1997].
Largely inspired by the proliferation of database-like sources on the World
Wide Web, previous researchers have addressed a diverse set of problems,
Author’s address: WhizBang Labs, 4616 Henry Street, Pittsburgh, PA 15213; email:
[email protected].
Permission to make digital / hard copy of part or all of this work for personal or classroom use
is granted without fee provided that the copies are not made or distributed for profit or
commercial advantage, the copyright notice, the title of the publication, and its date appear,
and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission
and / or a fee.
© 2000 ACM 1046-8188/00/0700 –0288 $05.00
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000, Pages 288–321.
Data Integration Using Similarity Joins • 289
thus make the natural assumption that all data are stored in relations, but
that the primitive elements of each relation are fragments of text, rather
than character strings or numbers. We call this data model STIR (for
Storing Texts In Relations).
To represent text fragments, we adopt the widely used vector space model
[Salton 1989], which we will now briefly review. We assume a vocabulary T
of terms, which will be treated as atomic; terms might include words,
phrases, or word stems (morphologically derived word prefixes). A fragment
of text is represented as document vector: a vector of real numbers vជ 僆
兩T兩 , each component of which corresponds to a term t 僆 T. We will denote
the component of vជ which corresponds to t 僆 T by vជ t .
A number of schemes have been proposed for assigning weights to terms.
We found it convenient to adopt the widely used TF-IDF weighting scheme
with unit length normalization. Assuming that the document represented
by vជ is a member of a document collection C, define vជ t to have the value
zero if t is not present in the document represented by vជ , and otherwise the
value vជ t ⫽ (log(TF vជ , t ⫹ 1) 䡠 log(IDF t ), where the “term frequency” TF vជ , t is
the number of times that term t occurs in the document represented by vជ ,
and the “inverse document frequency” IDF t is 兩C兩/n t , where n t is the
number of documents in C that contain the term t. The collection C
associated with a document vector vជ will (usually) be the set of text
fragments appearing in the same column of the same relation as vជ .
The similarity of two document vectors vជ and w ជ is given by the formula
sim 共 vជ , w
ជ兲⫽ 冘 储vជvជ储 䡠䡠 储w
t ជ
wt
ជ储
t僆T
1
Notice that this representation ignores all information about word order; thus the two strings
“Cohen, William W.” and “William W. Cohen” would be mapped to identical vectors. The vector
space model can be extended to include some word-order information (e.g., Fagan [1989]);
however, in our experience, word order is seldom necessary to distinguish between object
names.
Q 1 : p(Company1,Industry) q(Company2,WebSite)
Company1⬃Company2
Q 2 : p(Company1,Industry) q(Company2,WebSite)
Company1⬃Company2 const1(IO) Industry⬃IO
2
Note that R Q is not always unique, since ties in score can be broken by any method.
3
These lists are the relations HooverWeb and Iontech from Table IV.
兵具 A 4 Q, , s 典 : 具 A 4 Q, , s 典 僆 support 共 a 兲 and 僆 R 其 .
Also define the r-score for a from R by replacing support(a) in Eq. (1) with
the r-support set for a. Finally, define the r-materialization of ᐂ from R to
contain all tuples xជ 1 , . . . , xជ k with nonzero r-score, with the score of xជ 1 ,
. . . , xជ k in p being its r-score from R. We define ⑀-support, ⑀-score, and
⑀-materialization analogously, replacing the r-answers for the Q i ’s with
⑀-answers.
Clearly, the r-materialization of a view can be constructed using only an
r-answer for each clause body involved in the view. As r is increased, the
r-answers will include more and more high-scoring substitutions, and the
r-materialization will become a better and better approximation to the full
materialized view. An analogous statement holds for an ⑀-materialization
as ⑀ is decreased.
Thus given an efficient mechanism for computing r-answers (or ⑀-an-
swers) for conjunctive views, one can efficiently approximate the answers to
more complex queries.
4
Specifically, if one assumes that queries B 1 . . . B k are “irredundant” in the sense that
there is no ground substitution with nonzero score such that B i ⫽ B j for i ⫽ j, and make
the same independence assumptions made in Fuhr’s DatalogPID, then the score for a WHIRL
predicate is exactly the probability of the corresponding compound event, which is the same as
the probability computed by DatalogPID.
niques used in IR ranked retrieval [Turtle and Flood 1995]. However, using
search in query processing is unusual for database systems, which more
typically use search only in optimizing a query; in WHIRL, search is used
to generate each tuple in an answer.
To motivate our use of search, consider finding an r-answer to the
WHIRL query
insiderTip(X) publiclyTraded(Y) X⬃Y
where the relation publiclyTraded is very large, but the relation insiderTip is
very small. In processing the corresponding equijoin insiderTip(X) public-
lyTraded(Y) X⫽Y with a conventional database system, one would first
construct a query plan: for example, one might first find all bindings for X,
and then use an index to find all values Y in the first column of public-
lyTraded that are equivalent to some X. It is tempting to extend such a
query plan to WHIRL, by simply changing the second step to find all values
Y that are similar to some X.
However, this natural extension can be quite inefficient. Imagine that
insiderTip contains the vector xជ , corresponding to the document “Armadil-
los, Inc”. Due to the frequent term “Inc”, there will be many documents Y
that have nonzero similarity to xជ , and it will be expensive to retrieve all of
these documents Y and compute their similarity to xជ .
One way of avoiding this expense is to start by retrieving a small number
of documents Y that are likely to be highly similar to xជ . In this case, one
might use an index to find all Y’s that contain the rare term “Armadillos”.
Since “Armadillos” is rare, this step will be inexpensive, and the Y’s
retrieved in this step must be somewhat similar to xជ . (Recall that the
weight of a term depends inversely on its frequency, so rare terms have
high weight; and hence these Y’s will share at least one high-weight term
with X.) Conversely, any Y⬘ not retrieved in this step must be somewhat
dissimilar to xជ , since such a Y⬘ cannot share with xជ the high-weight term
“Armadillos”. This suggests that if r is small, and an appropriate pruning
method is used, a subtask like “find the r documents Y that are most
similar to xជ ” might be accomplished efficiently by the subplan of “find all
Y’s containing the term ‘Armadillos’.”
Of course, this subplan depends on the vector xជ . To find the Y’s most
similar to the document “The American Software Company” (in which every
term is somewhat frequent) a very different type of subplan might be
required. The observations suggest that query processing should proceed in
small steps, and that these steps should be scheduled dynamically, in a
manner that depends on the specific document vectors being processed.
In the query-processing algorithm described below, we will use the Aⴱ
algorithm to search through a space of partial substitutions: for example,
one state in the search space for the query given above would correspond to
the substitution that maps X to xជ and leaves Y unbound. The steps we take
through this search space are small ones, as suggested by the discussion
above; for instance, one operation is to select a single term t and use an
inverted index to find plausible bindings for a single unbound variable.
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
298 • William W. Cohen
Fig. 1. A generic version of Aⴱ search, and an implementation of WHIRL based on Aⴱ. (In
lines marked ⴱ, X ⬃ Y is constraining in Q with generator p(Y 1 , . . . , Y k ) and generation
index ᐉ, and t is a term with nonzero weight in X .)
3.2 Aⴱ search
Aⴱ search (summarized in Figure 1) is a graph search method which
attempts to find the highest-scoring path between a given start state s 0 and
a goal state [Nilsson 1987; Korf 1993]. Goal states are defined by a
goalState predicate. The graph being searched is defined by a function
children(s), which returns the set of states directly reachable from state s.
To conduct the search the Aⴱ algorithm maintains a set OPEN of states
that might lie on a path to some goal state. Initially OPEN contains only
the start state s 0 . At each subsequent step of the algorithm, a single state
is removed from the OPEN set; in particular, the state s that is “best”
according to a heuristic function, f(s), is removed from OPEN. If s is a goal
state, then this state is output; otherwise, all children of s are added to the
OPEN set. The search continues until r goal states have been output, or all
states s in OPEN have f(s) ⬍ ⑀ , or the search space is exhausted.
The procedure described above is a variant of the Aⴱ procedure normally
studied, but it has similar desirable properties, as shown in Section 4.
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
Data Integration Using Similarity Joins • 299
5
This restriction is made innocuous by an additional predicate eq(X, Y) which is true when X
and Y are bound to the same document vector. The implementation of the eq predicate is
relatively straightforward, and will be ignored in the discussion below.
f 共具 Q, , E 典兲 ⬅ 写 g共Bi , , E兲 䡠 写 h共Bi , , E兲
Bi ground Bi not ground
6
Although it would have been preferable to implement both STIR and WHIRL using MIX
[Knuth 1975].
some additional ground substitutions. These will be added to the OPEN list,
along with a state s ⬙1 which has a larger exclusion set and thus a lower f
value.
This process will continue until r documents are generated. Note that it
is quite likely that low-weight terms such as “or” will not be used at all.
In a survey article, Turtle and Flood [1995] review a number of query
optimization methods for ranked retrieval IR systems. The most effective of
these was one they call the maxscore optimization. The behavior of WHIRL
on queries of the sort shown above is identical to the behavior of an IR
system using the maxscore optimization.
Example 3. Consider the query
p(Company1,Industry) q(Company2,WebSite) Company1⬃Company2
In solving this query, the first step will be to explode the smaller of these
relations. Assume that this is p, and that p contains 1000 tuples. This will
add 1000 states s 1 , . . . , s 1000 to the OPEN list. In each of these states,
Company1 and Industry are bound, and Company1⬃Company2 is a con-
straining literal. Thus each of these 1000 states is analogous to the state s 1
in the preceding example.
However, the f values for the states s 1 , . . . , s 1000 will not be equal.
The value of the state s i associated with the substitution i will depend on
the maximum possible score for the literal Company1⬃Company2, and this
will be large only if the high-weight terms in the document Company1 i
appear in the company field of q. As an example, a one-word document like
“3Com” will have a high f value if that term appears (infrequently) in the
company field of q, and a zero f value if it does not appear; similarly, a
document like “Agents, Inc” will have a low f value if the term “agents”
does not appear in the first column of q.
The result is that the next step of the algorithm will be to choose a
promising state s i from the OPEN list—a state that could result in a good
final score. A term from the Company1 document in s i —say “3Com”—will
then be picked and used to generate bindings for Company2 and WebSite. If
any of these bindings results in perfect match, then an answer can be
generated on the next iteration of the algorithm.
In short, the operation of WHIRL is somewhat similar to time-sharing
1000 simpler queries on a machine for which the basic unit of computation
is to access a single inverted index. However, WHIRL’s use of the f
function will schedule the computation of these queries in an intelligent
way: queries unlikely to produce good answers can be discarded, and
low-weight terms are unlikely to be used.
Example 4. Consider the query
p(Company1,Industry) q(Company2,WebSite) Company1⬃Company2
const1(IO) Industry⬃IO
where the relation const1 contains the single document, “telecommunica-
tions and/or equipment.” In solving this query, WHIRL will first explode
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
Data Integration Using Similarity Joins • 303
const1 and generate a binding for IO. The literal Industry⬃IO then becomes
constraining, so it will be used to pick bindings for Company1 and Industry
using some high-weight term, perhaps “telecommunications”.
At this point there will be two types of states on the OPEN list. There will
be one state s⬘ in which only IO is bound, and 具telecommunications,Industry典
is excluded. There will also be several states s 1 , . . . , s n in which IO,
Company1, and Industry are bound; in these states, the literal
Company1⬃Company2 is constraining. If s⬘ has a higher score than any of
the s i ’s, then s⬘ will be removed from the OPEN list, and another term from
the literal Industry⬃IO will be used to generate additional variable bind-
ings.
However, if some s i literal has a high f value then it will be taken
ahead of s⬘. Note that this is possible when the bindings in s i lead to a good
actual similarity score for Industry⬃IO as well as a good potential similarity
score for Company1⬃Company2 (as measured by the h function). If an s i
is picked, then bindings for Company2 and WebSite will be produced,
resulting in a ground state. This ground state will be removed from the
OPEN list on the next iteration only if its f value is higher than that of s⬘
and all of the remaining s i ’s.
This example illustrates how bindings can be propagated through simi-
larity literals. The binding for IO is first used to generate bindings for
Company1 and Industry, and then the binding for Company1 is used to bind
Company2 and Website. Note that bindings are generated using high-
weight, low-frequency terms first, and low-weight, high-frequency terms
only when necessary.
ជ1, . . . , Y k ⫽ v
兵具 Q, 艛 Y 1 ⫽ v ជk其 , E 典 : 具 vជ 1, . . . , vជ k典 僆 index 共 t, p, ᐉ 兲 )}.
Again, clearly desc(s i ) and desc(s i ) are disjoint. In the other case, exactly
one of s i , s j is in S t , and the other is not: let us assume without loss of
generality that s j ⫽ s⬘ ⫽ 具Q, , E 艛 {具t, Y典}典 and s i 僆 S t . In this case, the
descendents of s j must be disjoint from the descendents of s i , since (because
of the exclusion 具t, Y典) no descendents of s j can bind Y to a vector with
nonzero weight for t, and all of the descendents of s i bind Y some vector
with nonzero weight for t. Thus the graph generated by the children
function is a tree.
To see that f is admissible, it is sufficient to note that Eq. (2) is an
upper bound on the score of B i ⬘ relative to any ground superset ⬘ of
associated with a valid state.
Finally, we wish to show that if f is admissible, and the graph G
defined by the children function is a bounded tree, then algorithm Aⴱ of
Figure 1 outputs in nonincreasing order the goal states with the largest f
values. This statement is a slight (and unsurprising) variant7 of the
correctness property traditionally associated with Aⴱ search [Nilsson 1987],
7
The principle differences are that Aⴱ usually is considered to minimize a sum of costs, rather
than maximizing a score that is a product, and that Aⴱ usually is only used to find a single
“best” goal state. The former difference is trivial; the latter, somewhat more important.
5. EXPERIMENTAL RESULTS
We evaluated our implementation of WHIRL along two dimensions. First,
we wished to measure the time needed to evaluate queries. Second, we
wished to measure the accuracy of the answers produced by WHIRL. In this
evaluation we used the measures of precision and recall traditionally used
in the statistical IR community. All experiments were performed using an
implementation of WHIRL that keeps all indices and document vectors in
main memory.
p 共 X 1 , . . . , X i , . . . , X k兲 q 共 Y 1 , . . . , Y j , . . . , Y b兲 X i ⬃ Y j .
An r-answer to this query will consist of the r tuples from p and q such
that X i and Y j are most similar. In these experiments we used the relations
described in Table I.
Similarity join queries have several advantages for benchmarking pur-
poses. This query type is highly relevant to our research goals, since it is
directly related to the sort of data integration problem which led us to
develop WHIRL. This class of queries is also sufficiently constrained in
form so that it can be handled using simple algorithms built on top of
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
306 • William W. Cohen
10
It does, however, greatly reduce the number of accesses to the inverted index, as Turtle and
Flood observed.
Fig. 3.. Runtime in CPU seconds for similarity joins of company names and news stories.
tem. However, these simple queries are probably not typical of the sort of
queries that one would like to pose to a real data integration system; one
would expect that typical user queries would be more selective, and more
complex.
To better understand WHIRL’s behavior on “typical” queries, WHIRL
was embedded into a working, Web-based, data integration system [Cohen
1998b]. This system spiders a number of related Web sites and extracts a
WHIRL knowledge base, which can then be queried. The main additional
components of this system are an HTTP server interface to WHIRL, which
allows conjunctive queries11 to WHIRL to be easily formulated, and a
spider program, which downloads and extracts data from HTML pages.
Two moderately large domains were implemented for this system, one
integrating information on birds of North America, and one integrating
information about educational computer games.
The interface to the system in the game domain allows the user to ask a
question by filling out an HTML form—e.g., “help me find reviews of games
that are in the category ‘art’, are recommended by two or more sites, and
are designed for children six years old.” This question is then translated
into a conjunctive WHIRL query. The interface to the bird domain is
similar: an example of a question that might be posed in this domain, again
using a forms interface, is “help me find pictures of birds in the order
pelicaniforms that have been sighted in New Jersey and are endangered or
threatened.” In addition to a forms interface for constructing complex
questions, the bird domain interface also supports browsing the database,
and a “quick search” feature, in which a simple keyword query can be used
to search relevant portions of the database. Browsing and “quick search”
are implemented by translating browsing commands and simple keyword
searches into appropriate WHIRL queries.
11
The user’s queries are conjunctive, but not necessarily flat—they may involve WHIRL views,
which are used in this system to make the different data sources more compatible.
We made both domains available on the Web, and recorded each query
issued to the system. Later we took a “snapshot” of each domain12 and
measured the response time for a subset of these queries. In the game
domain, we took a random sample of 100 queries. In the bird domain, we
took all queries (over a period of several days) which used the “advanced
search” feature, thus excluding many of the simpler queries; there are 91
queries in this sample.
A comparative study of performance is inappropriate here, since arbi-
trary WHIRL queries cannot be answered by any means other than the
algorithm of Section 3; thus, Table II simply summarizes the results. Note
that the average response time is well under a second.
Table III provides some additional detail on these results. For each
domain and for each number k, we show the number of queries that are
k-way joins, the average number of similarity literals used in k-way join
queries, and the average time to execute the k-way join queries.13 In these
samples, many of the queries are relatively simple, but a substantial
fraction are moderately complex: in the bird domain, about a quarter are 4-,
5-, or 6-way joins, and in the game domain, about a quarter are 5- or 6-way
joins. WHIRL is still quite efficient, even on the longer queries.
12
Notice that since the sites indexed are not static, the size of the databases changes every
time the spiders are run.
13
The number k does not count “joins” with singleton relations like the const relations in the
example queries of Section 3.5.
Table III. A More Detailed Summary of the Performance of the WHIRL Interpreter on
Real-World Queries
14
Thanks to Alon Levy and Jaewoo Kang for providing me with the data and normalization
routines.
15
In the BirdCall relation, we manually cleaned the secondary keys— but not the primary
keys— by fixing spelling errors.
16
Thanks to Sheila Trejada for supplying the restaurant data.
Table VI. Pairs of Names from the Hoovers and Iontech Relations
17
On the other hand, normalization of these types of names would be extremely difficult.
18
The movie reviews usually contain a title naming the movie being reviewed, but also contain
a lot of additional text. The average length of a review is more than 400 words.
19
Punctuation has been added for clarity. The text in the original is also structured by
interleaved graphics, and by positioning on the page.
Table VII. Average Precision for Similarity Joins between Pairs of Relations with
Incompatible Schemas, or Incompletely Extracted Names
crest of the wavelike curve.” (Mathews) Limekiln Lake, August 1996. Song
in flight (62kB); Perching song (59kB); and another (48kB) from the same
bird. Long Island, July 1997.
Using a methodology similar to that used with the Demo relation, we joined
these descriptions to a list of bird names. Average precision for the data
with manually extracted names was 95.8%; without extraction, average
precision is decreased to 83.0%. These results are summarized in Table VII.
6. RELATED WORK
Chaudhuri et al. [1995] present efficient solutions to the problem of loosely
integrating Boolean text queries with database queries. In contrast, we
have considered a much tighter integration between databases and statis-
tical IR queries. The assumptions made by Chaudhuri et al. are not
particularly appropriate in the context of heterogeneous database integra-
tion.
As noted above in Section 2.4, WHIRL is closely related to probabilistic
databases (e.g., Fuhr [1995] and Barbara et al. [1992]). To our knowledge
such database systems have not been used in data integration tasks.
Furthermore, the implementation of WHIRL is unique in generating only a
few “best” answers to a query; existing probabilistic database systems
typically find all tuples with nonzero probability. As we argued above in
Section 2.2, this would often be impractical for the problems encountered in
this sort of heterogeneous database integration, due to the prevalence of
weak matches between documents.
Fuzzy set theory [Zadeh 1965] has also been used as the basis for “soft”
database systems [Bosc and Prade 1997]. Fagin [1998] and others have
proposed algorithms that find the best few answers to a conjunctive query
in this model. However, unlike WHIRL, these algorithms make assump-
tions about the independence of the atomic queries, and provide the best
answer only with high probability.
The WHIRL query algorithm borrows heavily from techniques previously
used to optimize ranked retrieval searches in statistical IR. To our knowl-
edge, these techniques have not been previously used for approximating the
join of lists of documents. More generally, the sort of approximate join
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
316 • William W. Cohen
7. CONCLUSIONS
In an ideal world, one would like to integrate information from heteroge-
neous autonomous databases with little or no human effort. In other words,
one would like data to be easily shared among databases. Unfortunately,
such data sharing is difficult with current data models. One fundamental
and critical problem is the lack of global domains: different databases are
likely to use different constants to refer to the same real-world entity,
making operations like joins across relations from different databases
impossible.
We believe the data model and query language presented in this paper
represent a significant advance toward the long-term goal of easily shar-
able data. We have outlined an approach to the integration of structured
heterogeneous information sources, based on extended conventional data-
base query languages with standard IR methods for reasoning about
textual similarity. The approach is embodied in an implemented “informa-
tion representation language” called WHIRL. WHIRL is intended for inte-
gration of relations that are semantically heterogeneous in the sense that
there is no common naming scheme for entities.
ACM Transactions on Information Systems, Vol. 18, No. 3, July 2000.
318 • William W. Cohen
ACKNOWLEDGMENTS
The author is grateful to Alon Levy for numerous helpful discussions while
I was formulating this problem, and for comments on a draft of the paper;
to Jaewoo Kang and Sheila Tejada, for providing data; to Alex Borgida, Sal
Stolfo, and Mark Jones for comments on the paper; to Susan Cohen for
proofreading; and to Edith Cohen, David Lewis, Haym Hirsh, Fernando
Pereira, Divesh Srivastava, Dan Suciu, and many other colleagues for
helpful advice and discussions.
REFERENCES
ABITEBOUL, S. AND VIANU, V. 1997. Regular path queries with constraints. In Proceedings of
the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Sys-
tems (PODS-97) (Tucson, AZ, May 1997).
ARENS, Y., KNOBLOCK, C. A., AND HSU, C.-N. 1996. Query processing in the SIMS informa-
tion mediator. In A. Tate Ed., Advanced Planning Technology. Menlo Park, CA: AAAI Press.
ATZENI, P., MECCA, G., AND MERIALDO, P. 1997. Semistructured and structured data on the
Web: going back and forth. In D. Suciu Ed., Proceedings of the Workshop on Management of
Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.re-
search.att.com/⬃suciu/workshop-papers.html.
BARBARA, D., GARCIA-MOLINA, H., AND PORTER, D. 1992. The management of probabilistic
data. IEEE Transations on knowledge and data engineering 4, 5 (October), 487–501.
BARTELL, B. T., COTTRELL, G. W., AND BELEW, R. K. 1994. Automatic combination of multiple
ranked retrieval systems. In Seventeenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (1994).
BAYARDO, R. J., BOHRER, W., BRICE, R., CICHOCKI, A., FOWLER, J., HELAL, A., KASHYAP, V.,
KSIEZYK, T., MARTIN, G., NODINE, M., RASHID, M., RUSINKIEWICZ, M., SHEA, R., UNNIKRISHAN,
C., UNRUH, A., AND WOELK, D. 1997. Infosleuth: an agent-based semantic integration of
information in open and dynamic environments. In Proceedings of the 1997 ACM SIGMOD
(May 1997).
BORGMAN, C. L. AND SIEGFRIED, S. L. 1992. Getty’s Synoname and its cousins: a survey of
applications of personal name-matching algorithms. Journal of the American Society for
Information Science 43, 7, 459 – 476.
BOSC, P. AND PRADE, H. 1997. An introduction to the fuzzy set and possibility theory-based
treatment of queries and uncertain or imprecise databases. In Uncertainty management in
information systems. Kluwer Academic Publishers.
BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1994. A machine learning architecture for optimiz-
ing web search engines. Technical Report WS-96-05, American Association of Artificial
Intelligence.
CHAUDHURI, S., DAYAL, U., AND YAN, T. 1995. Join queries with external text sources:
execution and optimization techniques. In Proceedings of the 1995 ACM SIGMOD (May
1995).
COHEN, W. W. 1997. Knowledge integration for structured information sources containing
text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval
(1997).
COHEN, W. W. 1998a. Integration of heterogeneous databases without common domains
using queries based on textual similarity. In Proceedings of ACM SIGMOD-98 (Seattle, WA,
1998).
COHEN, W. W. 1998b. A Web-based information system that reasons with structured
collections of text. In Proceedings of Autonomous Agents-98 (St. Paul, MN, 1998).
COHEN, W. W. AND HIRSH, H. 1998. Joins that generalize: Text categorization using WHIRL.
In Proceedings of the Fourth International Conference on Knowledge Discovery and Data
Mining (New York, NY, 1998), pp. 169 –173.
COHEN, W. W. AND SINGER, Y. 1996. Context-sensitive learning methods for text categoriza-
tion. In Proceedings of the 19th Annual International ACM Conference on Research and
Development in Information Retrieval (Zurich, Switzerland, 1996), pp. 307–315. ACM Press.
COHEN, W. W., SCHAPIRE, R. E., AND SINGER, Y. 1997. Learning to order things. In Advances
in Neural Processing Systems 10 (Denver, CO, 1997). MIT Press.
CRAVEN, M., DIPASQUO, D., FREITAG, D., MCCALLUM, A., MITCHELL, T., NIGAM, K., AND SLATTERY,
S. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings
of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (Madison, WI,
1998).
DUSCHKA, O. M. AND GENESERETH, M. R. 1997a. Answering recursive queries using views. In
Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems (PODS-97) (Tucson, AZ, May 1997).
DUSCHKA, O. M. AND GENESERETH, M. R. 1997b. Query planning in infomaster. In Proceed-
ings of the Twelfth Annual ACM Symposium on Applied Computing (SAC97) (San Jose, CA,
February 1997).
FAGAN, J. L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase
indexing for document retrieval. Journal of the American Society for Information Science 40,
2, 115–132.
FAGIN, R. 1998. Fuzzy queries in multimedia database systems. In Proc. 1998 ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98)
(1998).
FANG, D., HAMMER, J., AND MCLEOD, D. 1994. The identification and resolution of semantic
heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution
for Global Information Sharing, pp. 52– 60. IEEE Computer Society Press, Los Alamitos,
California.
FELLIGI, I. P. AND SUNTER, A. B. 1969. A theory for record linkage. Journal of the American
Statistical Society 64, 1183–1210.
FIEBIG, T., WEISS, J., AND MOERKOTTE, G. 1997. RAW: a relational algebra for the Web. In D.
Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson,
Arizona, May 1997). Available on-line from http://www.research.att.com/⬃suciu/workshop-
papers.html.
FUHR, N. 1995. Probabilistic Datalog—a logic for powerful retrieval methods. In Proceed-
ings of the 1995 ACM SIGIR conference on research in information retrieval (New York,
1995), pp. 282–290.
GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J.,
AND WIDOM, J. 1995. The TSIMMIS approach to mediation: Data models and languages
(extended abstract). In Next Generation Information Technologies and Systems (NGITS-95)
(Naharia, Israel, November 1995).
HERNANDEZ, M. AND STOLFO, S. 1995. The merge/purge problem for large databases. In
Proceedings of the 1995 ACM SIGMOD (May 1995).
HUFFMAN, S. AND STEIER, D. 1995. Heuristic joins to integrate structured heterogeneous
data. In Working notes of the AAAI spring symposium on information gathering in heteroge-
neous distributed environments (Palo Alto, CA, March 1995). AAAI Press.
KILSS, B. AND ALVEY, W. 1985. Record linkage techniques—1985. Statistics of Income
Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts.
gov/fcsm/methodology/.
KNUTH, D. E. 1975. The Art of Computer Programming, Volume I: Fundamental Algorithms
(second edition). Addison-Wesley, Reading, MA.
KONOPNICKI, D. AND SCHMUELI, O. 1995. W3QS: a query system for the world wide web. In
Proceedings of the 21st International Conference on Very Large Databases (VLDB-96)
(Zurich, Switzerland, 1995).
KORF, R. 1993. Linear-space best-first search. Artificial Intelligence 62, 1 (July), 41–78.
LEVY, A. Y., RAJARAMAN, A., AND ORDILLE, J. J. 1996a. Query answering algorithms for
information agents. In Proceedings of the 13th National Conference on Artificial Intelligence
(AAAI-96) (Portland, Oregon, August 1996).
LEVY, A. Y., RAJARAMAN, A., AND ORDILLE, J. J. 1996b. Querying heterogeneous information
sources using source descriptions. In Proceedings of the 22nd International Conference on
Very Large Databases (VLDB-96) (Bombay, India, September 1996).
LEWIS, D. 1992. Representation and learning in information retrieval. Technical Report
91-93, Computer Science Dept., University of Massachusetts at Amherst. PhD Thesis.
MENDELZON, A. AND MILO, T. 1997. Formal models of Web queries. In Proceedings of the
Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
(PODS-97) (Tucson, AZ, May 1997).
MONGE, A. AND ELKAN, C. 1996. The field-matching problem: algorithm and applications. In
Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining (August 1996).
MONGE, A. AND ELKAN, C. 1997. An efficient domain-independent algorithm for detecting
approximately duplicate database records. In The proceedings of the SIGMOD 1997 work-
shop on data mining and knowledge discovery (May 1997).
NEWCOMBE, H. B., KENNEDY, J. M., AXFORD, S. J., AND JAMES, A. P. 1959. Automatic linkage
of vital records. Science 130, 954 –959.
NILSSON, N. 1987. Principles of Artificial Intelligence. Morgan Kaufmann.
PEARL, J. 1984. Heuristics: Intelligent Search Strategies for Computer Problem Solving.
Addison-Wesley, Reading, MA.
PORTER, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130 –137.
QUINLAN, J. R. 1990. Learning logical definitions from relations. Machine Learning 5, 3,
239 –266.
SALTON, G. ED. 1989. Automatic Text Processing. Addison Wesley, Reading, Massachusetts.
SCHÄUBLE, P. 1993. SPIDER: A multiuser information retrieval system for semistructured
and dynamic data. In Proceedings of the 1993 ACM SIGIR conference on research in
information retrieval (Pittsburgh, PA, 1993), pp. 318 –327.
SUCIU, D. 1996. Query decomposition and view maintenance for query languages for un-
structured data. In Proceedings of the 22nd International Conference on Very Large Data-
bases (VLDB-96) (Bombay, India, 1996).
SUCIU, D. ED. 1997. Proceedings of the Workshop on Management of Semistructured Data.
Available on-line from http://www.research.att.com/suciu/workshop-papers.html, Tucson,
Arizona.
TOMASIC, A., AMOUROUX, R., BONNET, P., AND KAPITSKAIA, O. 1997. The distributed informa-
tion search component (Disco) and the World Wide Web. In Proceedings of the 1997 ACM
SIGMOD (May 1997).
TURTLE, H. AND FLOOD, J. 1995. Query evaluation: strategies and optimizations. Informa-
tion processing and management 31, 6 (November), 831– 850.
ZADEH, L. A. 1965. Fuzzy sets. Information and Control 8, 338 –353.