Domain of Study - Datamining Platform - J2Ee
Domain of Study - Datamining Platform - J2Ee
PLATFORM – J2EE
ABSTRACT:
Record matching, which identifies the records that
represent the same real-world entity, is an important step
for data integration. Most state-of-the-art record matching
methods are supervised, which requires the user to provide
training data.
These methods are not applicable for the Web database
scenario, where the records to match are query results
dynamically generated on-the- fly. Such records are query-
dependent and a prelearned method using training
examples from previous query results may fail on the
results of a new query.
To address the problem of record matching in the Web
database scenario, we present an unsupervised, online
record matching method, UDD, which, for a given query,
can effectively identify duplicates from the query result
records of multiple Web databases.
After removal of the same-source duplicates, the “presumed”
nonduplicate records from the same source can be used as training
examples alleviating the burden of users having to manually label
training examples.
Experimental results show that UDD works well for the Web database
scenario where existing supervised methods do not apply.
EXISTING SYSTEM
Most previous work is based on predefined matching
rules hand-coded by domain experts or matching rules
learned offline by some learning method from a set of
training examples.
Such approaches work well in a traditional database
environment, where all instances of the target
databases can be readily accessed, as long as a set of
high-quality representative records can be examined
by experts or selected for the user to label.
PROPOSED SYSTEM
We propose a new record matching method Unsupervised
Duplicate Detection (UDD) for the specific record matching
problem of identifying duplicates among records in query
results from multiple Web databases. The key ideas of our
method are: