Domain of Study - Datamining Platform - J2Ee

This document discusses an unsupervised online method called UDD for detecting duplicate records across multiple web databases. UDD uses presumed non-duplicate records within a source to train classifiers without manual labeling. It iteratively identifies duplicates using a weighted component similarity classifier and SVM classifier. The method addresses limitations of supervised approaches for dynamic web query results that may not contain representative training data. Experimental results showed UDD effectively identifies duplicates from query results across web databases without requiring labeled training examples.

Uploaded by

gowrishankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

Domain of Study - Datamining Platform - J2Ee

Uploaded by

gowrishankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 11

DOMAIN OF STUDY – DATAMINING

PLATFORM – J2EE
ABSTRACT:
Record matching, which identifies the records that
represent the same real-world entity, is an important step
for data integration. Most state-of-the-art record matching
methods are supervised, which requires the user to provide
training data.
These methods are not applicable for the Web database
scenario, where the records to match are query results
dynamically generated on-the- fly. Such records are query-
dependent and a prelearned method using training
examples from previous query results may fail on the
results of a new query.
 To address the problem of record matching in the Web
database scenario, we present an unsupervised, online
record matching method, UDD, which, for a given query,
can effectively identify duplicates from the query result
records of multiple Web databases.
After removal of the same-source duplicates, the “presumed”
nonduplicate records from the same source can be used as training
examples alleviating the burden of users having to manually label
training examples.

Starting from the nonduplicate set, we use two cooperating classifiers,

a weighted component similarity summing classifier and an SVM
classifier, to iteratively identify duplicates in the query results from
multiple Web databases.

Experimental results show that UDD works well for the Web database
scenario where existing supervised methods do not apply.
EXISTING SYSTEM
Most previous work is based on predefined matching
rules hand-coded by domain experts or matching rules
learned offline by some learning method from a set of
training examples.
Such approaches work well in a traditional database
environment, where all instances of the target
databases can be readily accessed, as long as a set of
high-quality representative records can be examined
by experts or selected for the user to label.
PROPOSED SYSTEM
We propose a new record matching method Unsupervised
Duplicate Detection (UDD) for the specific record matching
problem of identifying duplicates among records in query
results from multiple Web databases. The key ideas of our
method are:

1. We focus on techniques for adjusting the weights of the

record fields in calculating the similarity between two
records. Two records are considered as duplicates if they are
“similar enough” on their fields. We believe different fields
may need to be assigned different importance weights in an
adaptive and dynamic manner.
2. Due to the absence of labeled training examples, we
use a sample of universal data consisting of record
pairs from different data sources as an approximation
for a negative training set as well as the record pairs
from the same data source. We believe, and our
experimental results verify, that doing so is reasonable
since the proportion of duplicate records in the
universal set is usually much smaller than the
proportion of nonduplicates.
Fig. 1. Example query results from
two Web databases.
(a) Query results from
booksamillion.com.
(b) Query results from
abebooks.com.
3.1 SYSTEM REQUIREMENTS

3.1.1 Hardware Requirements

Processor : Pentium IV
Speed : 2.4GHZ
Ram : 256MB & Above
Hard disk : 10GB and Above

3.1.2 Software Requirements

Application Server : Weblogic Server 8.1,
Tomcat Server 4.1

Java Environment :Java 2 Enterprise Edition,

CONCLUSIONS
Duplicate detection is an important step in data
integration and most state-of-the-art methods are
based on offline learning techniques, which require
training data.
 In the Web database scenario, where records to match
are greatly query-dependent, a pre-trained approach is
not applicable as the set of records in each query’s
results is a biased subset of the full data set.
To overcome this problem, we presented an
unsupervised,online approach, UDD, for detecting
duplicates over the query results of multiple Web databases.
Two classifiers, WCSS and SVM, are used cooperatively in
the convergence step of record matching to identify the
duplicate pairs from all potential duplicate pairs iteratively.
Experimental results show that our approach is comparable
to previous work that requires training examples for
identifying duplicates from the query results of multiple
Web databases.