0% found this document useful (0 votes)
15 views99 pages

javaDocument-2

This document discusses the normalization of duplicate records from multiple sources, focusing on the record normalization problem (RNP) which aims to create a uniform standard record from groups of true matching records. It outlines the challenges of conflicting data from different web sources and proposes three levels of normalization granularity: record, field, and value-component. The paper also reviews existing systems and algorithms while emphasizing the need for a robust integration system for effective data normalization.

Uploaded by

hariprasad13876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views99 pages

javaDocument-2

This document discusses the normalization of duplicate records from multiple sources, focusing on the record normalization problem (RNP) which aims to create a uniform standard record from groups of true matching records. It outlines the challenges of conflicting data from different web sources and proposes three levels of normalization granularity: record, field, and value-component. The paper also reviews existing systems and algorithms while emphasizing the need for a robust integration system for effective data normalization.

Uploaded by

hariprasad13876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Normalization Of Duplicate Records From Multiple Sources

1. INTRODUCTION

1.1 PURPOSE

The Web has evolved into a data-rich repository containing a large amount of
structured content spread across millions of sources. The usefulness of Web data
increases exponentially (e.g., building knowledge bases, Web-scale data analytics)
when it is linked across numerous sources. Structured data on the Web resides in Web
databases and Web tables. Web data integration is an important component of many
applications collecting data from Web databases, such as Web data warehousing (e.g.,
Google and Bing Shopping; Google Scholar), data aggregation (e.g., product and
service reviews), and metasearching. Integration systems at Web scale need to
automatically match records from different sources that refer to the same real-world
entity find the true matching records among them and turn this set of records into a
standard record for the consumption of users or other applications. There is a large
body of work on the record matching problem and the truth discovery problem. The
record matching problem is also referred to as duplicate record detection, record
linkage, object identification, entity resolution, or deduplication and the truth
discovery problem is also called as truth finding or fact finding - a key problem in
data fusion. In this paper, we assume that the tasks of record matching and truth
discovery have been performed and that the groups of true matching records have thus
been identified. Our goal is to generate a uniform, standard record for each group of
true matching records for end-user consumption. We call the generated record the
normalized record. We call the problem of computing the normalized record for a
group of matching records the record normalization problem (RNP), and it is the
focus of this work.

1.1 SCOPE

RNP is another specific interesting problem in data fusion. Record normalization


is important in many application domains. For example, in the research publication
domain, although the integrator website, such as Citeseer or Google Scholar, contains
records gathered from a variety of sources using automated extraction techniques, it
must display a normalized record to users. Otherwise, it is unclear what can be
presented to users: (i) present the entire group of matching records or (ii) simply
Normalization Of Duplicate Records From Multiple Sources

present some random record from the group, to just name a couple of ad-hoc
approaches. Either of these choices can lead to a frustrating experience for a user,
because in (i) the user needs to sort/browse through a potentially large number of
duplicate records, and in (ii) we run the risk of presenting a record with missing or
incorrect pieces of data. Record normalization is a challenging problem because
different Web sources may represent the attribute values of an entity in different ways
or even provide conflicting data. Conflicting data may occur because of incomplete
data, different data representations, missing attribute values, and even erroneous data.
They are extracted from different websites. Record Rnorm is constructed by hand for
illustration purposes.

1.3. NEED FOR SYSTEM

We identify three levels of normalization granularity: record, field, and value-


component. Record level assumes that the values of the fields within a record are
governed by some hidden criterion and that together create a cohesive unit that is
user-friendly. As a consequence, this normalization favors building the normalized
record from entire records among the set of matching records rather than piecing it
together from field values of different records. Thus, any of the matching records
(ideally, that has no missing values) can be the normalized record. Using our running
example in Table 1, the record Rc is a possible choice for the normalized record with
this level of normalization granularity. Field level assumes that record level is often
inadequate in practice because records contain fields with incomplete values. Recall
that these records are the products of automatic data extraction tools, which are not
perfect and thus may produce errors.
Normalization Of Duplicate Records From Multiple Sources

2. SOFTWARE REQUIREMENT ANALYSIS AND SPECIFICATION

2.1. RELATED WORK

2.1.1 A secure data privacy preservation for on-demand cloud service

This paper is focus on privacy and security of data stored in the cloud. They
albeit computing is introduced to provide to increasing its efficiency, optimization and
effectiveness of the cloud environment. Thus author introduce Privacy Preserving
Model to Prevent Digital Data Loss in the Cloud. This proposal helps the Cloud
Requester/Users to trust their proprietary information and data stored in the cloud.

2.1.2 Privacy-preserving security solution for cloud services

This paper is based on the privacy-preserving security solution for cloud. It


based on the signature scheme for the no bilinear group providing the unidentified
access to the cloud server and shared storage server. It makes Unidentified
Authentication for the registered user. The user personal information can be displayed
without revealing the user detail. However any illegal activity is found, the user rights
in the cloud server can be revoked. Author proposed work helps to Anonymous
access, unlink ability and data transmission confidentiality.

2.1.3 An efficient public auditing protocol with novel dynamic structure for
cloud data

This paper is based on the efficient method of making the structure of the data.
Author proposed public auditing scheme in which dynamic operation can be
performed. Hashing can be performed in this method. Using Merkle Hash Tree the
dynamic data operation can be performed. Ring signature stores the information of the
user.
Normalization Of Duplicate Records From Multiple Sources

2.1.4 On a relation between verifiable secret sharing schemes and a class of


error-correcting codes

This paper explains about the Verifiable Secret Sharing Schemes. Using the
metric author forms a set of codes known as set of error correcting codes. Then they
consider the burst error interleaving codes introduces the efficient burst error
correcting scheme. By this methods error correcting and secrete sharing of files can
be performed.

2.1.5Security and privacy of sensitive data in cloud computing: A survey of


recent developments

This paper represents the Available technologies and a broad collection of


Created and implementation of projects on cloud confidentiality and security. This
paper are arranged based on the available works based on the cloud
architecture ,Management of resources and cloud work management layers, along
with the recollection of the developments that available in privacy preserving
confidential data in cloud computing.

2.1.6 A survey on cloud security issues and techniques

This paper explains about some of the security issues in cloud in various
aspects like Insider attacks, Outsider attacks, Loss of control, data loss, multi
tenancy, Network security, elasticity, and availability. It also consists of available
security schemes and method for a securing cloud. This paper will deliver the idea
about different security issues and tools to the researchers and professionals.
Normalization Of Duplicate Records From Multiple Sources

2.3. LITERATURE SURVEY


[1] G. Loy and A. Zelinsky, “Fast radial symmetry for detecting points of interest” IEEE
Transaction on Pattern Analysis and Machine Intelligence, Vol. 25, NO.8, AUGUST 2014.

As sharing personal media online becomes easier and widely spread, new privacy
concerns emerge – especially when the persistent nature of the media and associated
context reveals details about the physical and social context in which the media items were
created. In a first-of-its-kind study, we use context-aware camerephone devices to examine
privacy decisions in mobile and online photo sharing. Through data analysis on a corpus of
privacy decisions and associated context data from a real-world system, we identify
relationships between location of photo capture and photo privacy settings. Our data
analysis leads to further questions which we investigate through a set of interviews with 15
users. The interviews reveal common themes in privacy considerations: security, social
disclosure, identity and convenience. Finally, we highlight several implications and
opportunities for design of media sharing applications, including using past privacy patterns
to prevent oversights and errors.

[2] J. Bonneau, J. Anderson, and L. Church, “Privacy suites: Shared privacy for social
networks,” in Proc. Symp. Usable Privacy Security, 2009.

Creating privacy controls for social networks that are both expressive and usable is a
major challenge. Lack of user un- derstanding of privacy settings can lead to unwanted
disclosure of private information and, in some cases, to material harm. We propose a new
paradigm which allows users to easily choose \suites" of privacy settings which have been
speci_ed by friends or trusted experts, only modifying them if they wish. Given that most
users currently stick with their default, operator-chosen settings, such a system could
dramatically increase the privacy protection that most users experience with minimal time
investment.

[3] J. Bonneau, J. Anderson, and G. Danezis, “Prying data out of a social network,” in Proc.
Int. Conf. Adv. Soc. Netw. Anal. Mining., 2009, pp.249–254.

Online photo albums have been prevalent in recent years and have resulted in more
and more applications developed to provide convenient functionalities for photo sharing. In
this project, we propose a system named SheepDog to automatically add photos into
appropriate groups and recommend suitable tags for users on Flickr. We adopt concept
detection to predict relevant concepts of a photo and probe into the issue about training
Normalization Of Duplicate Records From Multiple Sources

data collection for concept classification. From the perspective of gathering training data by
web searching, we introduce two mechanisms and investigate their performances of
concept detection. Based on some existing information from Flickr, a ranking-based method
is applied not only to obtain reliable training data, but also to provide reasonable group/tag
recommendations for input photos. We evaluate this system with a rich set of photos and
the results demonstrate the effectiveness of our work.

[4] H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu, “Sheepdog:
Group and tag recommendation for flickr photos by automatic search-based learning,” in
Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.

The social media site Flickr allows users to upload their photos, annotate them with
tags, submit them to groups, and also to form social networks by adding other users as
contacts. Flickr offers multiple ways of browsing or searching it. One option is tag search,
which returns all images tagged with a specific keyword. If the keyword is ambiguous, e.g.,
“beetle” could mean an insect or a car, tag search results will include many images that are
not relevant to the sense the user had in mind when executing the query. We claim that
users express their photography interests through the metadata they add in the form of
contacts and image annotations. We show how to exploit this metadata to personalize
search results for the user, thereby improving search performance.

2.2. PRODUCT ARCHITECTUER

Fig.1.1: System Architecture


Normalization Of Duplicate Records From Multiple Sources

3.1. EXISTING SYSTEM

The problem of normalization of database records was first described by


Culotta et al. They provided the first attempt to formalize the record normalization problem
and proposed three solutions. The first solution uses string edit distance to determine the
most central record. The second solution optimizes the edit distance parameters, and the
third one describes a feature-based solution to improve performance by means of a
knowledge base. Their approach is an instance of typical field value normalization. They did
not consider value-component-level normalization. In addition, their gold standard dataset
has many instances of unreasonable normalized records. Swoosh describes a record Merge
operator, however, the purpose of the operator is not for producing normalized records, but
rather for improving the ability to establish difficult record matchings. Wick et al. propose a
discriminatively-trained model to implement schema matching, reference, and
normalization normalization jointly. But the complexity of the model is greatly increased.

3.1.1. Disadvantages

 In the existing work, the system uses only Field-level Normalization.


 There is no Integration system at Web scale which needs to automatically match
records from different sources that refer to the same real-world entity.

3.2. PROPOSED SYSTEM

In this paper, we assume that the tasks of record matching and truth discovery have
been performed and that the groups of true matching records have thus been identified.
Our goal is to generate a uniform, standard record for each group of true matching records
for end-user consumption. The system calls the generated record the normalized record. We
call the problem of computing the normalized record for a group of matching records the
record normalization problem (RNP), and it is the focus of this work. RNP is another specific
interesting problem in data fusion. The system proposes three levels of granularities for
record normalization along with methods to construct normalized records according to
them.

3.2.1. Advantages
Normalization Of Duplicate Records From Multiple Sources

 The system is very fast due to identification of three levels of normalization gran-
ularity such as record, field, and value component.
 An Exact Duplicate records detection due to Mining Template Collocation-Sub
Collocation Pairs

2.3. PRODUCT FUNCTIONS

3.4.1 Record-level Normalization

The record-level normalization assumes that each record, The assumption, while
intuitively appealing and allows to build the theoretical underpins for constructing
normalized records, needs to be taken with a grain of salt in practice. Re contains a mixture
of candidate normalized records and records with incomplete or arcane representations of
e, which may be difficult to understand by ordinary users

3.4.2 Field-level Normalization

Field-level normalization selects a normalized value for each field fi independently


and concatenates the selected values of all fields into a normalized record. The normalized
value for the field fi is one of the values that appear among the records in Re in the field fi
and it is selected according to some criteria (e.g., more descriptive). The normalized record
formed in this way may consist of field values from different records.

3.4.3.Typical Normalization Framework

1) The typical normalization framework has two paths: record-level and field-level. The
former works with whole records from Re. It includes a number of record-level
rankers (RL rankers) to rank the records in Re according to their fitness to represent
the normalized record for entity e. In the single-strategy approach, each ranker
recommends the top-1 candidate in its ranked list as the normalized record. In RL
TSNRi denotes the normalized record recommended by the ith ranker. If we instead
use the multistrategy approach, then we employ rank merging methodologies to
select the final normalized record. In the multistrategy approach each ranker acts as
a voter and the records in Re are the candidates (for the normalized record). Each
ranker ranks the records in descending order of preference.
Normalization Of Duplicate Records From Multiple Sources

2.4. USER CONSTRAINTS

User Constraints for project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This
is to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.

 ECONOMICAL CONSTRAINTS
 TECHNICAL CONSTRAINTS
 SOCIAL CONSTRAINTS

ECONOMICAL CONSTRAINTS

This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be justified.
Thus the developed system as well within the budget and this was achieved because
most of the technologies used are freely available. Only the customized products had
to be purchased.

TECHNICAL CONSTRAINTS

This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.

SOCIAL CONSTRAINTS

The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.
Normalization Of Duplicate Records From Multiple Sources

Existing Algorithms

Several existing algorithms have been developed to handle duplicate record normalization.
These methods generally fall into the categories of rule-based, probabilistic, and machine
learning-based approaches.

1. Rule-Based Approaches

These methods rely on manually defined rules to identify duplicates.

 Exact Matching: Compares records based on exact matches of key attributes (e.g.,
name, address).
 Fuzzy Matching: Uses approximate string matching techniques like Levenshtein
distance and Jaro-Winkler.
 Custom Heuristics: Uses domain-specific rules for deduplication.

Example Algorithms:

Proposed Algorithms

Newer approaches aim to enhance deduplication efficiency and accuracy by leveraging


hybrid models, graph-based methods, and reinforcement learning.

1. Hybrid Approaches

 Combines rule-based and ML models for better accuracy.


 Uses rule-based filtering to reduce candidate pairs and deep learning for final
matching.

2.5. HARDWARE REQUIREMENTS

HARDWARE REQUIREMENTS

Processor : I3 or higher
Speed : 2.9 GHz
RAM : 4 GB (min)
Hard Disk : 160 GB

SOFTWARE REQUIREMENTS

 Operating system : Windows 7 Ultimate


 Coding Language : Java
Normalization Of Duplicate Records From Multiple Sources

 Designing : Html, css, javascript


 Data Base : MySQL (WAMP Server)

Functional Requirements

Functional requirements describe what the system should do. The functional
requirements can be further categorized as follows:

 What inputs the system should accept?


 What outputs the system should produce?
 What data the system must store?
 What are the computations to be done?

The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and the
steps are necessary to put transaction data in to a usable form for processing that can
be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system.
The design of input focuses on controlling the amount of input required, controlling
the errors, avoiding delay, avoiding extra steps and keeping the process simple. The
input is designed in such a way so that it provides security and ease of use with
retaining the privacy. Input Design considered the following things:

1. What data should be given as input?


2. How the data should be arranged or coded?
3. The dialog to guide the operating personnel in providing input.
4. Methods for preparing input validations and steps to follow when error occur.

Non-Functional Requirements

Non-functional requirements are the constraints that must be adhered during


development. They limit what resources can be used and set bounds on aspects of the
software’s quality.

User Interfaces

The User Interface is a GUI developed using Java.

Software Interfaces
Normalization Of Duplicate Records From Multiple Sources

The main processing is done in Java and console application.

Manpower Requirements

5 members can complete the project in 2 – 4 months if they work fulltime on it.
Normalization Of Duplicate Records From Multiple Sources

3. SYSTEM DESIGN

3.1. UML DIAGRAMS INTRODUCTION

The unified modeling language allows the software engineer to express an


analysis model using the modeling notation that is governed by a set of syntax,
semantic and pragmatic rules. A UML system is represented using five different
views that describe the system from distinctly different perspective.

UML is specifically constructed through two different domains they are:

 UML Analysis modeling, this focuses on the user model and structural model
views of the system.

 UML design modeling, which focuses on the behavioral modeling,


implementation modeling and environmental model views.

3.2. SYSTEM DESIGN ASPECTS

Once the analysis stage is completed, the next stage is to determine


in broad outline form how the problem might be solved. During system
design, we are beginning to move from the logical to physical level.

System design involves architectural and detailed design of the


system. Architectural design involves identifying software components,
decomposing them into processing modules and conceptual data
structures, and specifying the interconnections among components.

Detailed design is concerned with how to package processing


modules and how to implement the processing algorithms, data structures
and interconnections of standard algorithms, invention of new algorithms,
and design of data representations and packaging of software products.
Normalization Of Duplicate Records From Multiple Sources

Two kinds of approaches are available:

 Top down approach


 Bottom up approach

3.2.1. Design of Code


Since information systems projects are designed with space, time
and cost saving in mind, coding methods in which conditions, words,
ideas or control errors and speed the entire process. The purpose of the
code is to facilitate the identification and retrieval of the information. A
code is an ordered collection of symbols designed to provide unique
identification of an entity or an attribute.

3.2.2. Design of Input

Design of input involves the following decisions


 Input data
 Input medium
 The way data should be arranged or coded
 Validation needed to detect every step to follow when error occurs

The input controls provide ways to ensure that only authorized users access
the system guarantee the valid transactions, validate the data for accuracy and
determine whether any necessary data has been omitted. The primary input medium
chosen is display. Screens have been developed for input of data using HTML. The
validations for all important inputs are taken care of through various events using JSP
control.
3.2.3. Design of Output

Design of output involves the following decisions


 Information to present
 Output medium
Normalization Of Duplicate Records From Multiple Sources

 Output layout
Output of this system is given in easily understandable, user-friendly manner,
Layout of the output is decided through the discussions with the different users.

3.2.4 Design of Control

The system should offer the means of detecting and handling errors.

Input controls provides ways per

 Valid transactions are only acceptable

 Validates the accuracy of data

 Ensures that all mandatory data have been captured

All entities to the system will be validated. And updating of tables is allowed
for only valid entries. Means have been provided to correct, if any by change incorrect
entries have been entered into the system they can be edited.

3.3. UML DIAGRAMS

Why We Use UML in projects?

As the strategic value of software increases for many companies, the industry
looks for techniques to automate the production of software and to improve quality
and reduce cost and time-to-market. These techniques include component technology,
visual programming, patterns and frameworks. Businesses also seek techniques to
manage the complexity of systems as they increase in scope and scale. In particular,
they recognize the need to solve recurring architectural problems, such as physical
distribution, concurrency, replication, security, load balancing and fault tolerance.
Additionally, the development for the World Wide Web, while making some things
simpler, has exacerbated these architectural problems. The Unified Modeling
Language (UML) was designed to respond to these needs. Simply, Systems design
refers to the process of defining the architecture, components, modules, interfaces,
and data for a system to satisfy specified requirements which can be done easily
through UML diagrams.

In the project four basic UML diagrams have been explained among the
following list:
Normalization Of Duplicate Records From Multiple Sources

 Class Diagram
 Use Case Diagram
 Sequence Diagram
 Activity Diagram
 Collaboration Diagram
 Deployment Diagram
 State Chart Diagram
 Component Diagram

Class Diagram

A Class diagram in the Unified Modeling Language (UML) is a type of static


structure diagram that describes the structure of a system by showing the system's
classes, their attributes, and the relationships between the classes.

This is one of the most important of the diagrams in development. The diagram
breaks the class into three layers. One has the name, the second describes its attributes and
the third its methods. A padlock to left of the name represents the private attributes. The
relationships are drawn between the classes. Developers use the Class Diagram to develop
the classes. Analyses use it to show the details of the system.

Architects look at class diagrams to see if any class has too many functions
and see if they are required to be split.

Fig.3.1: Class Diagram

Use Case
Diagram
Normalization Of Duplicate Records From Multiple Sources

A Use Case diagram in the Unified Modeling Language (UML) is


a type of behavioral diagram defined by and created from a Use-case
analysis. Its purpose is to present a graphical overview of the
functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use
cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. Roles of the actors in the system
can be depicted. Use cases are used during requirements elicitation and
analysis to represent the functionality of the system. Use cases focus on
the behavior of the system from the external point of view. The actors are
outside the boundary of the system, whereas the use cases are inside the
boundary of the system

Fig.3.2: Use Case Diagram


Sequence Diagram
A Sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called Event-
trace diagrams, event scenarios, and timing
diagrams
Normalization Of Duplicate Records From Multiple Sources

Fig.3.3: Sequence Diagram

Activity Diagram

Activity diagrams are a loosely defined diagram technique for showing


workflows of stepwise activities and actions, with support for choice, iteration and
concurrency. In the Unified Modeling Language, activity diagrams can be used to
describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.

Fig.3.4: Activity Diagram

Collaboration Diagram

A Communication diagram models the interactions between objects or parts in


terms of sequenced messages. Communication diagrams represent a combination of
information taken from Class, Sequence, and Use Case Diagrams describing both the
static structure and dynamic behavior of a system.
Normalization Of Duplicate Records From Multiple Sources

Fig.3.5: Collaboration Diagram

Deployment Diagram

A Deployment diagram in the Unified Modeling Language models the


physical deployment of artifacts on nodes. To describe a web site, for example, a
deployment diagram would show what hardware components ("nodes") exist (e.g., a
web server, an application server, and a database server), what software components
("artifacts") run on each node (e.g., web application, database), and how the different
pieces are connected e.g. JDBC, REST

Fig.3.6: Deployment Diagram

State Chart Diagram

A State diagram is a type of diagram used in computer science and related


fields to describe the behavior of systems. State diagrams require that the system
described is composed of a finite number of states sometimes, this is indeed the case,
while at other times this is a reasonable abstraction. Many forms of state diagrams
exist, which differ slightly and have different semantics.
Normalization Of Duplicate Records From Multiple Sources

Fig.3.7: State Chart Diagram

Component Diagram

In the Unified Modeling Language, a component diagram depicts how


components are wired together to form larger components and or software systems.
They are used to illustrate the structure of arbitrarily complex systems.

Fig.3.8: Component Diagram

DATA FLOW DIAGRAM

1. The DFD is also called as bubble chart. It is a simple graphical formalism that
can be used to represent a system in terms of input data to the system, various
processing carried out on this data, and the output data is generated by this
system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is
used to model the system components. These components are the system process,
the data used by the process, an external entity that interacts with the system and
the information flows in the system.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that depicts
Normalization Of Duplicate Records From Multiple Sources

information flow and the transformations that are applied as data moves from
input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at
any level of abstraction. DFD may be partitioned into levels that represent
increasing information flow and functional detail.

DFD NOTATIONS

Define source and destination data.

Shows path of the data flow.

To represent a process that transforms


or modifies the Data

To represent an attribute

Data Store
Normalization Of Duplicate Records From Multiple Sources

Login Master

Enter yse Check


Open Login User Home
Username username yes
Form Page
Password Password
No

Validation Data

Fig.3.9: Data Flow Diagram


Normalization Of Duplicate Records From Multiple Sources

UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized general-


purpose modeling language in the field of object-oriented software engineering. The
standard is managed, and was created by, the Object Management Group.

The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of method
or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying,


Visualization, Constructing and documenting the artifacts of software system, as well
as for business modeling and other non-software systems.

The UML represents a collection of best engineering practices that have


proven successful in the modeling of large and complex systems.

The UML is a very important part of developing objects oriented software


and the software development process. The UML uses mostly graphical notations to
express the design of software projects.

GOALS:

The Primary goals in the design of the UML are as follows:


1. Provide users a ready-to-use, expressive visual modeling Language so that they
can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks,
patterns and components.
7. Integrate best practices.
Normalization Of Duplicate Records From Multiple Sources

3.3.1. USE CASE DIAGRAM:

Register

Login

Upload File

User
Public Cloud

Download File

Check for Duplicate

List of Files

Logout

Fig.3.4.: Use case Diagram for overall project


Normalization Of Duplicate Records From Multiple Sources

3.3.2. CLASS DIAGRAM:

Public-Cloud
userid
password
filestorage
DataUser files
userid
password 1 login()
files storefiles()
fileid encrypt()
fileblocks * decrypt()
duplicate()
login() logout()
register()
upload()
duplicatecheck() *
encrypt()
decrypt()
downoad() Private-Cloud
logout() userid
password
1 files
rights
ownername
permissions

login()
activiation()
permissions()
logout()

Fig.3.11: Class Diagram for Overall Project

3.3.3. SEQUENCE DIAGRAM:

\ A Sequence diagram in Unified Modeling Language (UML) is a kind of


interaction diagram that shows how processes operate with one another and in what
Normalization Of Duplicate Records From Multiple Sources

order. It is a construct of a Message Sequence Chart. Sequence diagrams are


sometimes called event diagrams, event scenarios, and timing diagrams.

Owner Login Received Permission File Upload View User Receive File Attribte
from admin Details

uid,pwd

verify

receive permission

file upload

view user details

receive file from cloud

change key

Fig.3.12: Sequence Diagram for Overall Project


Normalization Of Duplicate Records From Multiple Sources

3.3.4. ACTIVITY DIAGRAM:

Fig.3.13: Activity Diagram


Normalization Of Duplicate Records From Multiple Sources

ER-DIAGRAM

Fig 3.14: Er Diagram


Normalization Of Duplicate Records From Multiple Sources

4. TESTING

4..1. SOFTWARE TESTING TECHNIQUES

Software Testing is a critical element of software quality assurance and


represents the ultimate review of specification, design and coding, Testing presents an
interesting anomaly for the software engineer.

4..1.1. Testing Objectives

1. Testing is a process of executing a program with the intent of finding an error.


2. A good test case is one that has a probability of finding an as yet
3. undiscovered error.
4. A successful test is one that uncovers an undiscovered error.
5. These above objectives imply a dramatic change in view port.

Testing cannot show the absence of defects, it can only show that software
errors are present.

4..1.2. Test Case Design

Any engineering product can be tested in one of two ways:

White Box Testing

This testing is also called as glass box testing. In this testing, by knowing the
specified function that a product has been designed to perform test can be conducted
that demonstrates each function is fully operation at the same time searching for
errors in each function. It is a test case design method that uses the control structure of
the procedural design to derive test cases. Basis path testing is a white box testing.

Basis Path Testing

 Flow graph notation


 Cyclomatic Complexity

Deriving test cases Control Structure Testing

 Condition testing
 Data flow testing
 Loop testing
Normalization Of Duplicate Records From Multiple Sources

Black Box Testing

In this testing by knowing the internal operation of a product, tests can be


conducted to ensure that “all gears mesh”, that is the internal operation performs
according to specification and all internal components have been adequately
exercised. It fundamentally focuses on the functional requirements of the software.

The steps involved in black box test case design are:

 Graph based testing methods


 Equivalence partitioning
 Boundary value analysis
 Comparison testing
 Graph matrices

4..2. SOFTWARE TESTING STRATEGIES

A Strategy for software testing integrates software test cases into a series of
well planned steps that result in the successful construction of software. Software
testing is a broader topic for what is referred to as Verification and Validation.
Verification refers to the set of activities that ensure that the software correctly
implements a specific function. Validation refers he set of activities that ensure that
the software that has been built is traceable to customer’s requirements.

4..2.1. Unit Testing

Unit testing focuses verification effort on the smallest unit of software design
that is the module. Using procedural design description as a guide, important control
paths are tested to uncover errors within the boundaries of the module. The unit test
is normally white box testing oriented and the step can be conducted in parallel for
multiple modules.

4..2.2. Integration Testing

Integration testing is a systematic technique for constructing the program


structure, while conducting test to uncover errors associated with the interface. The
Normalization Of Duplicate Records From Multiple Sources

objective is to take unit tested methods and build a program structure that has been
dictated by design.

Top-Down Integration

Top down integrations is an incremental approach for construction of program


structure. Modules are integrated by moving downward through the control
hierarchy, beginning with the main control program. Modules subordinate to the
main program are incorporated in the structure either in the breath-first or depth-first
manner.

Bottom-up Integration

This method as the name suggests, begins construction and testing with atomic
modules i.e., modules at the lowest level. Because the modules are integrated in the
bottom up manner the processing required for the modules subordinate to a given
level is always available and the need for stubs is eliminated.

Regression Testing

In this contest of an integration test strategy, regression testing is the re


execution of some subset of test that have already been conducted to ensure that
changes have not propagate unintended side effects.

4..2.3 Validation Testing

At the end of integration testing software is completely assembled as a


package. Validation testing is the next stage, which can be defined as successful when
the software functions in the manner reasonably expected by the customer.
Reasonable expectations are those defined in the software requirements specifications.
Information contained in those sections form a basis for validation testing approach.

Reasonable expectation is defined in the software requirement specification –


a document that describes all user-visible attributes of the software. The specification
contains a section titled “Validation Criteria”. Information contained in that section
forms the basis for a validation testing approach.

Validation Test Criteria

Software validation is achieved through a series of black-box tests that


demonstrate conformity with requirement. A test plan outlines the classes of tests to
Normalization Of Duplicate Records From Multiple Sources

be conducted, and a test procedure defines specific test cases that will be used in an
attempt to uncover errors in conformity with requirements. Both the plan and
procedure are designed to ensure that all functional requirements are satisfied, all
performance requirements are achieved, documentation is correct and human-
engineered; and other requirements are met.

After each validation test case has been conducted, one of two possible
conditions exists: (1) The function or performance characteristics conform to
specification and are accepted, or (2) a deviation from specification is uncovered and
a deficiency list is created. Deviation or error discovered at this stage in a project can
rarely be corrected prior to scheduled completion. It is often necessary to negotiate
with the customer to establish a method for resolving deficiencies.

Configuration Review

An important element of the validation process is a configuration review. The


intent of the review is to ensure that all elements of the software configuration have
been properly developed, are catalogued, and have the necessary detail to support the
maintenance phase of the software life cycle. The configuration review sometimes
called an audit.

Alpha and Beta Testing

It is virtually impossible for a software developer to foresee how the customer


will really use a program. Instructions for use may be misinterpreted. Strange
combination of data may be regularly used; and output that seemed clear to the tester
may be unintelligible to a user in the field.

When custom software is built for one customer, a series of acceptance tests
are conducted to enable the customer to validate all requirements. Conducted by the
end user rather than the system developer, an acceptance test can range from an
informal “test drive” to a planned and systematically executed series of tests. In fact,
acceptance testing can be conducted over a period of weeks or months, thereby
uncovering cumulative errors that might degrade the system over time.

The beta test is conducted at one or more customer sites by the end user of the
software. Unlike alpha testing, the developer is generally not present. Therefore, the
beta test is a “live” application of the software in an environment that cannot be
controlled by the developer. The customer records all problems that are encountered
Normalization Of Duplicate Records From Multiple Sources

during beta testing and reports these to the developer at regular intervals. As a result
of problems reported during beta test, the software developer makes modification and
then prepares for release of the software product to the entire customer base.

4..2.4. System Testing

System testing is actually a series of different tests whose primary purpose is


to fully exercise the computer-based system. Although each test has a different
purpose, all work to verify that all system elements have been properly integrated to
perform allocated functions.

4..2.3. Security Testing

Attempts to verify the protection mechanisms built into the system.

4..2.6. Performance Testing

This method is designed to test runtime performance of software within the


context of an integrated system.

4..3. TEST CASES


Table 4..1: Test Case Results

5.
TEST EXPECTED ACTUAL
S. No. INPUT STATUS
CASES RESULT RESULT

User Enter all User gets Registration


1 pass
Registration fields registered is successful

User if user miss User not Registration is


2 fail
Registration any field registered un successful
Give the user Admin home Admin home
Admin
3 name and page should Page has been pass
Login
password be opened opened
Give User page
User page has
4 User Login Username and should be pass
been opened l
password opened
Give User page User name and
5 User Login Username should not be password is fail
without opened invalid
Normalization Of Duplicate Records From Multiple Sources

Password
Upload Add Select the to Upload to the Post Upload
6 pass
file upload file Database Success Fully
Normalization Of Duplicate Records From Multiple Sources

6. IMPLEMENTATION

Java Technology

Java technology is both a programming language and a platform.

The Java Programming Language:


The Java programming language is a high-level language that can be
characterized by all of the following buzzwords:

 Simple
 Architecture neutral
 Object oriented
 Portable
 Distributed
 High performance
 Interpreted
 Multithreaded
 Robust
 Dynamic
 Secure

With most programming languages, you either compile or interpret a program


so that you can run it on your computer. The Java programming language is unusual
in that a program is both compiled and interpreted. With the compiler, first you
translate a program into an intermediate language called Java byte codes —the
platform-independent codes interpreted by the interpreter on the Java platform. The
interpreter parses and runs each Java byte code instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.
Normalization Of Duplicate Records From Multiple Sources

Fig 5.1: Working of Java Program

If we think of Java byte codes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web
browser that can run applets, is an implementation of the Java VM. Java byte codes help
make “write once, run anywhere” possible. You can compile your program into byte codes on
any platform that has a Java compiler. The byte codes can then be run on any implementation
of the Java VM. That means that as long as a computer has a Java VM, the same program
written in the Java programming language can run on Windows 2000, a Solaris workstation,
or on an iMac.

Fig 5.2: Implementation of Java Virtual Machine


Normalization Of Duplicate Records From Multiple Sources

The Java Platform

A platform is the hardware or software environment in which a program runs. We’ve


already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris,
and MacOS. Most platforms can be described as a combination of the operating system and
hardware. The Java platform differs from most other platforms in that it’s a software-only
platform that runs on top of other hardware-based platforms.

The Java platform has two components:

 The Java Virtual Machine (Java VM)

 The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform
and is ported onto various hardware-based platforms.

Fig 5.3: Program Running on the Java Platform

Native code is code that after you compile it, the compiled code runs on a
specific hardware platform. As a platform-independent environment, the Java
platform can be a bit slower than native code. However, smart compilers, well-tuned
interpreters, and just-in-time byte code compilers can bring performance close to that
of native code without threatening portability.

Feasibility Study

Technical Feasibility

GUI is developed using HTML to capture the information from the customer.
HTML is used to display the content on the browser. It uses TCP/IP protocol. It is an
interpreted language. It is very easy to develop a page/document using HTML some
RAD (Rapid Application Development) tools are provided to quickly design/develop
Normalization Of Duplicate Records From Multiple Sources

our application. So many objects such as button, text fields, and text area etc are
provided to capture the information from the customer.

Economical Feasibility
The economical issues usually arise during the economical feasibility stage are
whether the system will be used if it is developed and implemented, whether the financial
benefits are equal are exceeds the costs. The cost for developing the project will include cost
conducts full system investigation, cost of hardware and software for the class of being
considered, the benefits in the form of reduced costs or fewer costly errors. The project is
economically feasible if it is developed and installed. It reduces the work load. Keep the class
of application in the view, the cost of hardware and software is considered to be economically
feasible.

Operational Feasibility

In our application front end is developed using GUI. So it is very easy to the
customer to enter the necessary information. But customer must have some knowledge
on using web applications before going to use our application.

1. Installation of java:
 Go to http://www.oracle.com/technetwork/java/javase/downloads /in-
dex.html.
 click on JDK DOWNLOAD button. run the exe file and then follow the
instruction given in wizard.
 To set up the path:-
o Right click on my pc and then go to my properties
Normalization Of Duplicate Records From Multiple Sources

Fig: properties wizard

o Go to advanced settings and then click on environment variables


o create a class path and copy the path of the java folder where it is
located in program files.
Normalization Of Duplicate Records From Multiple Sources

Fig: path setting for java

2. Installation and setup of Apache Tomcat:

 Go to http://tomcat.apache.org/index.html and click on download latest


versions.
 Run the exe file and click on next and follow the wizard instructions.

Fig: Welcome Page of Tomcat


Normalization Of Duplicate Records From Multiple Sources

 Click on install with port number 8090 with username and password as
aits and aits.
 Mention the connection port as 8090 and then click on next and finally
click on finish.

Fig: Tomcat Configuration Options Page

 Click on I agree button in. license agreement in order to accept the


terms and condition.
Normalization Of Duplicate Records From Multiple Sources

Fig: Tomcat License Agreement

3. Installation and setup of SQL:

 Go tohttp://dev.mywql.com/downloads/ . and click on install button.


 After completion of installation, click on exe file and then click on
next.
 Run the MySQL setup and click on next and follow the instruction in
wizard.
Normalization Of Duplicate Records From Multiple Sources

Fig: Welcome wizard of MySQL

 Conform the type as typical and then click on next and follow the in-
structions.

Fig: SQL setup Wizard

 Now confirm the password as root in system settings field and then
click on finish.
Normalization Of Duplicate Records From Multiple Sources

Fig: Database Configuration Engine


Normalization Of Duplicate Records From Multiple Sources

5.1. Sample Screens

Home Page

Screen 1: Home Page of Project


Normalization Of Duplicate Records From Multiple Sources

Admin Menu

Screen 2: Admin Menu Page


Normalization Of Duplicate Records From Multiple Sources

View Duplicate Records

Screen 3: Report Showing List of Duplicated Publication Records


Normalization Of Duplicate Records From Multiple Sources

Normalized Records

Screen 4: Report showing Normalized Records


Normalization Of Duplicate Records From Multiple Sources

View Book Marks

Screen 5: Report Showing List of Book Marks


Normalization Of Duplicate Records From Multiple Sources

Graph of Publication Rank

Screen 6: Graph showing Publication Records


Normalization Of Duplicate Records From Multiple Sources

Publication Search History

Screen 7: Report Showing Publication Search History


Normalization Of Duplicate Records From Multiple Sources

User Login

Screen 8: Form for User Login


Normalization Of Duplicate Records From Multiple Sources

User Menu

Screen 9: Form for User Menu


Normalization Of Duplicate Records From Multiple Sources
Normalization Of Duplicate Records From Multiple Sources

6.CONCLUSION

In this paper, we studied the problem of record normalization over a set of matching
records that refer to the same real-world entity. We presented three levels of normalization
granularities (record-level, field-level and valuecomponent level) and two forms of
normalization (typical normalization and complete normalization). For each form of
normalization, we proposed a computational framework that includes both single-strategy
and multi-strategy approaches. We proposed four single-strategy approaches: frequency,
length, centroid, and feature-based to select the normalized record or the normalized field
value. For multistrategy approach, we used result merging models inspired from
metasearching to combine the results from a number of single strategies. We analyzed the
record and field level normalization in the typical normalization. In the complete
normalization, we focused on field values and proposed algorithms for acronym expansion
and value component mining to produce much improved normalized field values. We
implemented a prototype and tested it on a real-world dataset. The experimental results
demonstrate the feasibility and effectiveness of our approach. Our method outperforms the
state-of-the-art by a significant margin

In the future, we plan to extend our research as follows. First, conduct additional
experiments using more diverse and larger datasets. The lack of appropriate datasets
currently has made this difficult. Second, investigate how to add an effective human-in-the-
loop component into the current solution as automated solutions alone will not be able to
achieve perfect accuracy. Third, develop solutions that handle numeric or more complex
values.
Normalization Of Duplicate Records From Multiple Sources

BIBLIOGRAPHY
[1] K. C.-C. Chang and J. Cho, “Accessing the web: From search to integration,” in SIGMOD,
2006, pp. 804–805.

[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the
power of tables on the web,” PVLDB, vol. 1, no. 1, pp. 538–549, 2008.

[3] W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool
Publishers, 2010.

[4] A. Gruenheid, X. L. Dong, and D. Srivastava, “Incremental record linkage,” PVLDB, vol. 7,
no. 9, pp. 697–708, May 2014.

[5] E. K. Rezig, E. C. Dragut, M. Ouzzani, and A. K. Elmagarmid, “Query-time record linkage


and fusion over web databases,” in ICDE, 2015, pp. 42–53.

[6] W. Su, J. Wang, and F. Lochovsky, “Record matching over query results from multiple
web databases,” TKDE, vol. 22, no. 4, 2010.

[7] H. K¨opcke and E. Rahm, “Frameworks for entity matching: A comparison,” DKE, vol. 69,
no. 2, pp. 197–210, 2010.

[8] X. Yin, J. Han, and S. Y. Philip, “Truth discovery with multiple conflicting information
providers on the web,” ICDE, 2008.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A


survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007.

[10] P. Christen, “A survey of indexing techniques for scalable record linkage and
deduplication,” TKDE, vol. 24, no. 9, 2012.

[11] S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for
information integration,” Inf. Sys., vol. 26, no. 8, pp. 607–633, 2001.

[12] L. Shu, A. Chen, M. Xiong, and W. Meng, “Efficient spectral neighborhood blocking for
entity resolution,” in ICDE, 2011.

Page 56
Normalization Of Duplicate Records From Multiple Sources

[9] S. M. Almansob and S. S. Lomte, “Addressing challenges for intrusion detection


system using naive bayes and pca algorithm,” in Convergence in Technology
(I2CT), 2017 2nd International Conference for. IEEE, 2017, pp. 565–568

APPENDIX – A
 URL LISTING
o www.google.co.in
o www.Java.org
o www.w3schools.com
o www.Javatutorial.com
 REFERENCE BOOKS
 Java Crash Course 2nd Edition - this is a basic level book for
beginners.
 Learning Java 5th Edition - this book is a practical learning book for
basic to advanced level.
 Java Cookbook - this book for advanced programmer interested in
learning about modern Java development tools.
 Automating Boring Stuff With Java - In this book you will learn to
write programs in Java.
 Head First Java - this book covered the fundamental of Java.
 Think Java - the basics of programming concepts and cover advanced
topics like data structure and object-oriented design.

Page 57
Normalization Of Duplicate Records From Multiple Sources

APPENDIX – B
 GLOSSARY
o GUI : Graphical User Interface

o UML : Unified Modeling Language

o API : Application Programming Interface

o HTML : Hyper Text Markup Language

o URL : Uniform Resource Locator

o ODBC : Open Database Connectivity

Page 58
Normalization Of Duplicate Records From Multiple Sources

APPENDIX – B
 GLOSSARY
o GUI : Graphical User Interface

o UML : Unified Modeling Language

o API : Application Programming Interface

o HTML : Hyper Text Markup Language

o URL : Uniform Resource Locator

o ODBC : Open Database Connectivity

Page 59
Normalization Of Duplicate Records From Multiple Sources

APPENDIX – C
 List of Figures

FIG. NO. FIGURE NAME PAGE NO.

1.1 System Architecture 2


5.1 System Architecture 10
5.2 Data Flow Diagram 11
5.3 Use case diagram for overall project 13
5.4 Class diagram for overall project 14
5.5 Sequence diagram for overall project 15
5.6 Activity Diagram for Client 16
7.1 Flow Chart for Home Page 27
7.2 Graph Showing the Flow of home page 28

Page 60
Normalization Of Duplicate Records From Multiple Sources

 List of Screens

SCREEN NO. SCREEN NAME PAGE NO.


Screen 1: Upload Crop Dataset’ button to upload dataset 39

Screen 2 Selecting and uploading ‘Dataset.csv’

file and then click on ‘Open’ button to

load dataset and to get below screen 40

Screen 3: Dataset loaded and we can see dataset contains

some non-numeric values and ML will not take

non-numeric values so we need to preprocess dataset

to convert non-numeric values to numeric values by

assigning ID to each non-numeric value. 41

Screen 4: Non-numeric values converted to numeric format

and in below lines 42

Screen 5: Screen ML is trained and we got prediction error rate 43

Screen 6: Screen selecting and uploading ‘test.csv’ file 44

Screen 7: Screen each test record is separated with newline 45

Page 61
Normalization Of Duplicate Records From Multiple Sources

 List of Tables

TABLE NO. TABLE NAME PAGE NO.

7.1 Test Case Results 29

Page 62
Normalization Of Duplicate Records From Multiple Sources

APPENDIX – D
 Coding
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Home Page</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="css/coin-slider.css" />
<script type="text/javascript" src="js/cufon-yui.js"></script>
<script type="text/javascript" src="js/cufon-titillium-250.js"></script>
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/script.js"></script>
<script type="text/javascript" src="js/coin-slider.min.js"></script>
<style type="text/css">
<!--
.style1 {font-size: 20px}
.style2 {
color: #FF0000;
font-size: 25px;
}
.style4 { color: #FF0000;
font-weight: bold;
}
-->
</style>
</head>
<body>
<div class="main">
<div class="header">

Page 63
Normalization Of Duplicate Records From Multiple Sources

<div class="header_resize">
<div class="slider">
<div id="coin-slider"> <a href="#"><img src="images/slide1.jpg"
width="960" height="399" alt="" /> </a></div>
</div>
<div class="menu_nav">
<ul>
<li class="active"><a href="index.html"><span>Home
Page</span></a></li>
<li><a href="a_login.jsp"><span>Admin</span></a></li>
<li><a href="u_login.jsp"><span>User</span></a></li>
<li><a href="p_login.jsp"><span>Publisher</span></a></li>

</ul>
</div>
<div class="logo">
<h1 class="style1"><a href="index.html" class="style2">Normalization of
Duplicate Records <br />
from Multiple Sources</a></h1>
</div>
<div class="clr"></div>
</div>
</div>
<div class="content">
<div class="content_resize">
<div class="mainbar">
<div class="article">
<h2 align="center"><span> Welcome </span></h2>
<p align="center"><img src="images/Home.png" width="566"
height="190" /></p>
<p align="justify"><span class="style4">Data consolidation is a
challenging issue in data integration. The usefulness of data increases when it is
linked and fused with other data from numerous (Web) sources. The promise of
Big Data hinges upon addressing several big data integration challenges, such as

Page 64
Normalization Of Duplicate Records From Multiple Sources

record linkage at scale, real-time data fusion, and integrating Deep Web. Although
much work has been conducted on these problems, there is limited work on
creating a uniform, standard record from a group of records corresponding to the
same real-world entity. We refer to this task as record normalization. Such a
record representation, coined normalized record, is important for both front-end
and back-end applications. In this paper, we formalize the record normalization
problem, present in-depth analysis of normalization granularity levels (e.g.,
record, field, and value-component) and of normalization forms (e.g., typical
versus complete). We propose a comprehensive framework for computing the
normalized record. The proposed framework includes a suit of record
normalization methods, from naive ones, which use only the information gathered
from records themselves, to complex strategies, which globally mine a group of
duplicate records before selecting a value for an attribute of a normalized record.
We conducted extensive empirical studies with all the proposed methods. We
indicate the weaknesses and strengths of each of them and recommend the ones to
be used in practice.</span></p>
<div class="clr"></div>
</div>
</div>
<div class="sidebar">
<div class="clr"></div>
<div class="gadget">
<h2 class="star"><span>Sidebar</span> Menu</h2>
<div class="clr">
<p>&nbsp;</p>
</div>
<ul class="sb_menu"><li><a href="index.html"><span>Home
Page</span></a></li>
<li class="active"><a href="a_login.jsp"><span>Admin</span></a></li>
<li><a href="u_login.jsp"><span>User</span></a></li>
</ul>
<p><img src="images/img2.jpg" width="180" height="229" /></p>
<p>&nbsp;</p>
<p>&nbsp;</p>

Page 65
Normalization Of Duplicate Records From Multiple Sources

<p>&nbsp;</p>
</div>
</div>
<div class="clr"></div>
</div>
</div>
<div class="fbg"></div>
<div class="footer">
<div class="footer_resize">
<div style="clear:both;"></div>
</div>
</div>
</div>
<div align=center></div>
</body>
</html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> Bookmark Details</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="css/coin-slider.css" />
<script type="text/javascript" src="js/cufon-yui.js"></script>
<script type="text/javascript" src="js/cufon-titillium-250.js"></script>
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/script.js"></script>
<script type="text/javascript" src="js/coin-slider.min.js"></script>
<script language="javascript" type="text/javascript">
</script>
<style type="text/css">
<!--
.style1 {font-size: 20px}

Page 66
Normalization Of Duplicate Records From Multiple Sources

.style2 {
color: #FF0000;
font-size: 25px;
}
.style4 {font-family: "Times New Roman", Times, serif}
.style5 {color: #FF0000}
.style6 {font-size: 15px}
.style7 {font-weight: bold}
.style8 {color: #000000}
-->
</style>
</head>
<body>
<div class="main">
<div class="header">
<div class="header_resize">
<div class="slider">
<div id="coin-slider"> <a href="#"><img src="images/slide1.jpg"
width="960" height="399" alt="" /> </a></div>
</div>
<div class="menu_nav">
<ul>
<li><a href="index.html"><span>Home Page</span></a></li>
<li><a href="a_login.jsp"><span>Admin</span></a></li>
<li class="active"><a href="u_login.jsp"><span>User</span></a></li>

</ul>
</div>
<div class="logo">
<h1 class="style1"><a href="index.html" class="style2">Normalization of
Duplicate Records from Multiple Sources</a></h1>
</div>
<div class="clr"></div>
</div>

Page 67
Normalization Of Duplicate Records From Multiple Sources

</div>
<div class="content">
<div class="content_resize">
<div class="mainbar">
<div class="article">
<h2 align="center"> Bookmark Details </h2>
<p>&nbsp;</p>

<%@ include file="connect.jsp" %>


<%@ page import="java.util.*"%>
<%@ page import="java.text.*"%>
<%@ page import="java.util.Date"%>
<%@ page import="java.sql.*"%>
<%@ page
import="com.oreilly.servlet.*,java.lang.*,java.text.SimpleDateFormat,java.io.*,ja
vax.servlet.*, javax.servlet.http.*" %>
<%@ page import
="java.util.*,java.security.Key,java.util.Random,javax.crypto.Cipher,javax.crypto.
spec.SecretKeySpec"%>
<%@ page import="org.bouncycastle.util.encoders.Base64"%>
<%@ page import="java.util.Random,java.io.PrintStream,
java.io.FileOutputStream, java.io.FileInputStream,
java.security.DigestInputStream, java.math.BigInteger,
java.security.MessageDigest, java.io.BufferedInputStream" %>

<%
String s1 = "", s2 = "", s3 = "", s4 = "", s5 = "", s6 = "", s7 = "", s8, s9 = "", s10,
s11, s12, s13,s14,s15,s16,s17,s33 = "", s44 = "", s55 = "", s66 = "";
String ss2 = "", ss3 = "", ss4 = "", ss5 = "", ss6 = "", ss7 = "", ss8, ss9 = "";
int i = 0, j = 0, k = 0,i2 = 0;
String bk=request.getParameter("bk");
String rk=request.getParameter("rank");
String rk2=request.getParameter("rank2");
String keyword=request.getParameter("key");

Page 68
Normalization Of Duplicate Records From Multiple Sources

String user=(String)application.getAttribute("user");

try
{

SimpleDateFormat sdfDate = new


SimpleDateFormat("dd/MM/yyyy");
SimpleDateFormat sdfTime = new
SimpleDateFormat("HH:mm:ss");
Date now = new Date();
String strDate = sdfDate.format(now);
String strTime = sdfTime.format(now);
String dt = strDate + " " + strTime;

String task="Searched";
String strQuery222 = "insert into
transaction_bk(user,bname,task,dt)
values('"+user+"','"+bk+"','"+task+"','"+dt+"')";

connection.createStatement().executeUpdate(strQuery222);

String sql2="select rank from


transaction3 where user='"+user+"' and bname='"+bk+"' ";
Statement
st22=connection.createStatement();
ResultSet
rs22=st22.executeQuery(sql2);
if(rs22.next())
{

Page 69
Normalization Of Duplicate Records From Multiple Sources

s10 =
rs22.getString(1);

//int
UpdateRank1=Integer.parseInt(s10)+1;

String
strQuery12 = "update transaction3 set rank="+rk2+" where user='"+user+"' and
bname='"+bk+"' ";

connection.createStatement().executeUpdate(strQuery12);

}
else{

String rank="1";
String strQuery22 = "insert into
transaction3(user,bname,rank) values('"+user+"','"+bk+"','"+rank+"')";

connection.createStatement().executeUpdate(strQuery22);
}

String sql="select * from bookmark where name='"+bk+"' ";


Statement
st=connection.createStatement();
ResultSet
rs=st.executeQuery(sql);
if(rs.next())
{

i = rs.getInt(1);
s2 = rs.getString(2);
s3 = rs.getString(3).toLowerCase();//bk name

Page 70
Normalization Of Duplicate Records From Multiple Sources

s4 = rs.getString(4).toLowerCase();//url
s5 = rs.getString(5).toLowerCase();//tag
s6 = rs.getString(6);//descr
s7 = rs.getString(7);//img

s8 = rs.getString(8);//rank

s9 = rs.getString(9);

String keys="q2e34rrfgfgfgg2a";

byte[] keyValue1 = keys.getBytes();

Key key1 = new SecretKeySpec(keyValue1, "AES");

Cipher c1 = Cipher.getInstance("AES");

c1.init(Cipher.DECRYPT_MODE, key1);

String decrys6 = new


String(Base64.decode(s6.getBytes()));

//int UpdateRank=Integer.parseInt(s8)+1;

String strQuery2 = "update bookmark set rank='"+rk+ "' where name='"+ s3 +


"'";

Page 71
Normalization Of Duplicate Records From Multiple Sources

connection.createStatement().executeUpdate(strQuery2);

%>

<table width="515" border="1.5" align="center" cellpadding="0"


cellspacing="0">

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>Bookmark
Image</strong></div></td>
<td width="116" rowspan="1" ><div class="style7" style="margin:10px
13px 10px 13px;">
<input name="image" type="image" src="bk_Pic.jsp?id=<%=i%>"
style="width:90px; height:90px;">
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>Bookmark
Name</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(s3);%>
</div></td>
</tr>

Page 72
Normalization Of Duplicate Records From Multiple Sources

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5"
style="margin-left:20px;"><strong>URL</strong></div></td>
<td width="252" valign="middle" height="40"><div align="left"
class="style23 style9 style10 style6 style4" style="margin-left:20px;">

<input type="button" value="<%=s4%>" onclick="window.open('<


%=s4%>')">
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
User(Uploader)</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(s2);%>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
Date</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(s9);%>

Page 73
Normalization Of Duplicate Records From Multiple Sources

</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5"
style="margin-left:20px;"><strong>Tag</strong></div></td>
<td width="252" valign="middle" height="40"><div align="left"
class="style23 style9 style10 style6 style4" style="margin-left:20px;">
<textarea name="text" cols="25" rows="7" readonly><%= s5
%></textarea>
</div></td>
</tr>

<tr>
<td width="139" height="40" align="left" valign="middle"
bgcolor="#FFFF00" style="color: #2c83b0;"><div align="left" class="style14
style15 style20 style9 style4 style6 style5" style="margin-
left:20px;"><strong>Description</strong></div></td>
<td width="252" valign="middle" height="40"><div align="left"
class="style23 style9 style10 style6 style4" style="margin-left:20px;">
<textarea name="textarea" cols="25" rows="7" readonly><%= decrys6
%></textarea>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
Rank</strong></div></td>

Page 74
Normalization Of Duplicate Records From Multiple Sources

<td width="252" valign="middle" height="40"


style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(rk);%>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong> Ratings
</strong></div></td>
<td><span class="style8">
<%
int rank=Integer.parseInt(s8);

if(rank==3)
{
%>
<input name="image2" type="image" src="Gallery/1.png" width="30"
height="30" />
<%
}
if(rank>3 && rank<=6)
{
%>
<input name="image2" type="image" src="Gallery/2.png" width="80"
height="30" />
<%
}
if(rank>6 && rank<=9)
{
%>

Page 75
Normalization Of Duplicate Records From Multiple Sources

<input name="image2" type="image" src="Gallery/3.png"


width="100" height="30" />
<%
}
if(rank>9 && rank<=12)
{
%>
<input name="image2" type="image" src="Gallery/4.png"
width="120" height="30" />
<%
}
if(rank>12 && rank<=15)
{
%>
<input name="image2" type="image" src="Gallery/5.png"
width="140" height="30" />
<%
}
if(rank>15)
{
%>
<input name="image2" type="image" src="Gallery/6.png"
width="170" height="30" />
<%
}
%>
</span></td>
</tr>

</table>

<%

Page 76
Normalization Of Duplicate Records From Multiple Sources

catch(Exception e)
{
out.println(e.getMessage());
}

%>
</table>
<p>&nbsp;</p>
<p align="right">&nbsp;</p>
<p align="right"><a href="u_search_bk.jsp">Back</a></p>
<p>&nbsp;</p>
</div>
</div>
<div class="sidebar">
<div class="clr"></div>
<div class="gadget">
<h2 class="star"><span>User</span> Menu</h2>
<div class="clr">
<p>&nbsp;</p>
</div>
<ul class="sb_menu">
<li><a href="u_main.jsp"><span>User Main </span></a></li>
<li><a href="u_login.jsp"><span>Log Out</span></a></li>
</ul>
</div>

Page 77
Normalization Of Duplicate Records From Multiple Sources

</div>
<div class="clr"></div>
</div>
</div>
<div class="fbg"></div>
<div class="footer">
<div class="footer_resize">
<div style="clear:both;"></div>
</div>
</div>
</div>
<div align=center></div>
</body>
</html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> Publication Details</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="css/coin-slider.css" />
<script type="text/javascript" src="js/cufon-yui.js"></script>
<script type="text/javascript" src="js/cufon-titillium-250.js"></script>
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/script.js"></script>
<script type="text/javascript" src="js/coin-slider.min.js"></script>
<script language="javascript" type="text/javascript">
</script>
<style type="text/css">
<!--
.style1 {font-size: 20px}
.style2 {
color: #FF0000;

Page 78
Normalization Of Duplicate Records From Multiple Sources

font-size: 25px;
}
.style4 {font-family: "Times New Roman", Times, serif}
.style5 {color: #FF0000}
.style6 {font-size: 15px}
.style7 {font-weight: bold}
-->
</style>
</head>
<body>
<div class="main">
<div class="header">
<div class="header_resize">
<div class="slider">
<div id="coin-slider"> <a href="#"><img src="images/slide1.jpg"
width="960" height="399" alt="" /> </a></div>
</div>
<div class="menu_nav">
<ul>
<li><a href="index.html"><span>Home Page</span></a></li>
<li><a href="a_login.jsp"><span>Admin</span></a></li>
<li class="active"><a href="u_login.jsp"><span>User</span></a></li>

</ul>
</div>
<div class="logo">
<h1 class="style1"><a href="index.html" class="style2">Normalization of
Duplicate Records from Multiple Sources</a></h1>
</div>
<div class="clr"></div>
</div>
</div>
<div class="content">
<div class="content_resize">

Page 79
Normalization Of Duplicate Records From Multiple Sources

<div class="mainbar">
<div class="article">
<h2 align="center" class="style5"> Publication Details !!! </h2>
<p>&nbsp;</p>

<%@ include file="connect.jsp" %>


<%@ page import="java.util.*"%>
<%@ page import="java.text.*"%>
<%@ page import="java.util.Date"%>
<%@ page import="java.sql.*"%>
<%@ page
import="com.oreilly.servlet.*,java.lang.*,java.text.SimpleDateFormat,java.io.*,ja
vax.servlet.*, javax.servlet.http.*" %>
<%@ page import
="java.util.*,java.security.Key,java.util.Random,javax.crypto.Cipher,javax.crypto.
spec.SecretKeySpec"%>
<%@ page import="org.bouncycastle.util.encoders.Base64"%>
<%@ page import="java.util.Random,java.io.PrintStream,
java.io.FileOutputStream, java.io.FileInputStream,
java.security.DigestInputStream, java.math.BigInteger,
java.security.MessageDigest, java.io.BufferedInputStream" %>

<%
String s1 = "", s2 = "", s3 = "", s4 = "", s5 = "", s6 = "", s7 = "", s8, s9 = "", s10,
s11, s12, s13,s14,s15,s16,s17,s33 = "", s44 = "", s55 = "", s66 = "";
String ss2 = "", ss3 = "", ss4 = "", ss5 = "", ss6 = "", ss7 = "", ss8, ss9 = "";
int i = 0, j = 0, k = 0,i2 = 0;
String pub=request.getParameter("pub");
String rk=request.getParameter("rank");
String rk2=request.getParameter("rank2");
String keyword=request.getParameter("key");
String user=(String)application.getAttribute("user");

Page 80
Normalization Of Duplicate Records From Multiple Sources

try
{

SimpleDateFormat sdfDate = new


SimpleDateFormat("dd/MM/yyyy");
SimpleDateFormat sdfTime = new
SimpleDateFormat("HH:mm:ss");
Date now = new Date();
String strDate = sdfDate.format(now);
String strTime = sdfTime.format(now);
String dt = strDate + " " + strTime;

String task="Searched";
String strQuery222 = "insert into
transaction_pub(user,pname,task,dt)
values('"+user+"','"+pub+"','"+task+"','"+dt+"')";

connection.createStatement().executeUpdate(strQuery222);

String sql2="select rank from


transaction4 where user='"+user+"' and pname='"+pub+"' ";
Statement
st22=connection.createStatement();
ResultSet
rs22=st22.executeQuery(sql2);
if(rs22.next())
{

s10 =
rs22.getString(1);

Page 81
Normalization Of Duplicate Records From Multiple Sources

//int
UpdateRank1=Integer.parseInt(s10)+1;

String
strQuery12 = "update transaction4 set rank="+rk2+" where user='"+user+"' and
pname='"+pub+"' ";

connection.createStatement().executeUpdate(strQuery12);

}
else{

String rank="1";
String strQuery22 = "insert into
transaction4(user,pname,rank) values('"+user+"','"+pub+"','"+rank+"')";

connection.createStatement().executeUpdate(strQuery22);
}

String sql="select * from publication where name='"+pub+"'


";
Statement
st=connection.createStatement();
ResultSet
rs=st.executeQuery(sql);
if(rs.next())
{

i = rs.getInt(1);
s2 = rs.getString(2);
s3 = rs.getString(3);//pub name
s4 = rs.getString(4);
s5 = rs.getString(5);//tag

Page 82
Normalization Of Duplicate Records From Multiple Sources

s6 = rs.getString(6);//descr
s7 = rs.getString(7);//img

s8 = rs.getString(8);//rank

s9 = rs.getString(9);

String keys="q2e34rrfgfgfgg2a";

byte[] keyValue1 = keys.getBytes();

Key key1 = new SecretKeySpec(keyValue1, "AES");

Cipher c1 = Cipher.getInstance("AES");

c1.init(Cipher.DECRYPT_MODE, key1);

String decrys6 = new


String(Base64.decode(s6.getBytes()));

//int UpdateRank=Integer.parseInt(s8)+1;

String strQuery2 = "update publication set rank='"+ rk+ "' where name='"+ s3
+ "'";

connection.createStatement().executeUpdate(strQuery2);

Page 83
Normalization Of Duplicate Records From Multiple Sources

%>

<table width="515" border="1.5" align="center" cellpadding="0"


cellspacing="0">

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>Title
Image</strong></div></td>
<td width="116"><div class="style7" style="margin:10px 13px 10px
13px;">
<input name="image" type="image" src="pub_Pic.jsp?id=<%=i%>"
style="width:90px; height:90px;">
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>Publication
Name</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6"
style="margin-left:20px;">
<%out.println(s3);%>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9

Page 84
Normalization Of Duplicate Records From Multiple Sources

style4 style6 style5"


style="margin-left:20px;"><strong>Title</strong></div></td>
<td width="252" valign="middle" height="40"><div align="left"
class="style23 style9 style10 style6" style="margin-left:20px;">
<%out.println(s4);%>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
User(Uploader)</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(s2);%>
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
Date</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(s9);%>
</div></td>
</tr>

<tr>

Page 85
Normalization Of Duplicate Records From Multiple Sources

<td width="139" height="40" valign="middle" bgcolor="#FFFF00"


style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5"
style="margin-left:20px;"><strong>Venue,Pages</strong></div></td>
<td width="252" valign="middle" height="40"><div align="left"
class="style23 style9 style10 style6" style="margin-left:20px;">
<textarea name="text" cols="25" rows="7" readonly><%= s5
%></textarea>
</div></td>
</tr>

<tr>
<td width="139" height="40" align="left" valign="middle"
bgcolor="#FFFF00" style="color: #2c83b0;"><div align="left" class="style14
style15 style20 style9 style4 style6 style5" style="margin-
left:20px;"><strong>Release Date </strong></div></td>

<td width="252" valign="middle" height="40"><div align="left"


class="style23 style9 style10 style6" style="margin-left:20px;">
<input name="textarea" type="text" value="<%= decrys6 %>"
size="25" readonly="readonly" />
</div></td>
</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong>
Rank</strong></div></td>
<td width="252" valign="middle" height="40"
style="color:#000000;"><div align="left" class="style23 style9 style10 style6
style4" style="margin-left:20px;">
<%out.println(rk);%>
</div></td>

Page 86
Normalization Of Duplicate Records From Multiple Sources

</tr>

<tr>
<td width="139" height="40" valign="middle" bgcolor="#FFFF00"
style="color: #2c83b0;"><div align="left" class="style14 style15 style20 style9
style4 style6 style5" style="margin-left:20px;"><strong> Ratings
</strong></div></td>
<td><span class="style8">
<%
int rank=Integer.parseInt(s8);

if(rank==3)
{
%>
<input name="image2" type="image" src="Gallery/1.png" width="30"
height="30" />
<%
}
if(rank>3 && rank<=6)
{
%>
<input name="image2" type="image" src="Gallery/2.png" width="80"
height="30" />
<%
}
if(rank>6 && rank<=9)
{
%>
<input name="image2" type="image" src="Gallery/3.png"
width="100" height="30" />
<%
}
if(rank>9 && rank<=12)
{

Page 87
Normalization Of Duplicate Records From Multiple Sources

%>
<input name="image2" type="image" src="Gallery/4.png"
width="120" height="30" />
<%
}
if(rank>12 && rank<=15)
{
%>
<input name="image2" type="image" src="Gallery/5.png"
width="140" height="30" />
<%
}
if(rank>15)
{
%>
<input name="image2" type="image" src="Gallery/6.png"
width="170" height="30" />
<%
}
%>
</span></td>
</tr>

</table>

<%

Page 88
Normalization Of Duplicate Records From Multiple Sources

catch(Exception e)
{
out.println(e.getMessage());
}

%>
</table>

<p>&nbsp;</p>
<p align="right">&nbsp;</p>
<p align="right"><a href="u_search_pub.jsp">Back</a></p>
<p>&nbsp;</p>
</div>
</div>
<div class="sidebar">
<div class="clr"></div>
<div class="gadget">
<h2 class="star"><span>User</span> Menu</h2>
<div class="clr">
<p>&nbsp;</p>
</div>
<ul class="sb_menu">
<li><a href="u_main.jsp"><span>User Main </span></a></li>
<li><a href="u_login.jsp"><span>Log Out</span></a></li>
</ul>
</div>
</div>
<div class="clr"></div>
</div>
</div>
<div class="fbg"></div>
<div class="footer">

Page 89
Normalization Of Duplicate Records From Multiple Sources

<div class="footer_resize">
<div style="clear:both;"></div>
</div>
</div>
</div>
<div align=center></div>
</body>
</html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>All Bookamarks Cluster Format </title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="css/coin-slider.css" />
<script type="text/javascript" src="js/cufon-yui.js"></script>
<script type="text/javascript" src="js/cufon-titillium-250.js"></script>
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/script.js"></script>
<script type="text/javascript" src="js/coin-slider.min.js"></script>
<script language="javascript" type="text/javascript">
</script>
<style type="text/css">
<!--
.style1 {font-size: 20px}
.style2 {
color: #FF0000;
font-size: 25px;
}
.style4 {font-size: 15px}
.style5 {font-family: "Times New Roman", Times, serif}
.style6 {color: #FF0000}
.style12 {color: #000000}

Page 90
Normalization Of Duplicate Records From Multiple Sources

.style13 {
font-family: "Times New Roman", Times, serif;
font-size: 20px;
color: #0000FF;
}
-->
</style>
</head>
<body>
<div class="main">
<div class="header">
<div class="header_resize">
<div class="slider">
<div id="coin-slider"> <a href="#"><img src="images/slide1.jpg"
width="960" height="399" alt="" /> </a></div>
</div>
<div class="menu_nav">
<ul>
<li><a href="index.html"><span>Home Page</span></a></li>
<li class="active"><a href="a_login.jsp"><span>Admin</span></a></li>
<li><a href="u_login.jsp"><span>User</span></a></li>

</ul>
</div>
<div class="logo">
<h1 class="style1"><a href="index.html" class="style2">Normalization of
Duplicate Records from Multiple Sources</a></h1>
</div>
<div class="clr"></div>
</div>
</div>
<div class="content">
<div class="content_resize">
<div class="mainbar">

Page 91
Normalization Of Duplicate Records From Multiple Sources

<div class="article">
<h2 align="center">View Bookamark Cluster Format Based on Name
</h2>
<p>&nbsp;</p>

<%@page import="java.io.BufferedInputStream"%>
<%@page import="java.security.DigestInputStream"%>
<%@page import="java.io.FileInputStream"%>
<%@page import="java.io.PrintStream"%>
<%@page import="java.io.FileOutputStream"%>
<%@page import="java.math.BigInteger"%>
<%@ page

import="java.security.Key,java.security.KeyPair,java.security.KeyPairGenerator,j
avax.crypto.Cipher"%>
<%@ include file="connect.jsp"%>
<%@page

import="java.util.*,java.security.Key,java.util.Random,javax.crypto.Cipher,javax.
crypto.spec.SecretKeySpec,org.bouncycastle.util.encoders.Base64"%>

<%@page import="java.security.MessageDigest"%>
<%@page import="java.sql.Statement"%>
<%@page import="java.sql.ResultSet"%>

<%@page import="java.text.SimpleDateFormat"%>
<%@page import="java.util.Date"%>

<%

String s1 = "", s2 = "", s3 = "", s4 = "", s5 = "", s6 = "", s7 = "", s8, s9 = "", s10,
s11, s12, s13,s14,s15,s16,s17;

Page 92
Normalization Of Duplicate Records From Multiple Sources

int i = 0, j = 1, k = 0;

try {

String query2 = "select distinct user from bookmark ";


Statement st2 = connection.createStatement();
ResultSet rs2 = st2.executeQuery(query2);
while (rs2.next())
{
s1 = rs2.getString(1);//user

%>
<span class="style13">Name: <%=s1%></span>
<table width="846" border="1" align="center" cellspacing="0" cellpadding="5">
<tr>
<td width="17" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Id</div></td>
<td width="65" bgcolor="#FFFF00"><div align="center"
class="style3 style4 style9 style5 style6">Uploader Name </div></td>
<td width="72" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Bookmark Name </div></td>
<td width="92" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Bookmark Image </div></td>
<td width="81" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">URL</div></td>
<td width="82" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Tag</div></td>
<td width="83" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Description</div></td>
<td width="58" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Upload Date</div></td>
<td width="45" bgcolor="#FFFF00"><div align="center" class="style3
style4 style9 style5 style6">Rank</div></td>

Page 93
Normalization Of Duplicate Records From Multiple Sources

<td width="170" bgcolor="#FFFF00"><div align="center" class="style3


style4 style5 style6">Rating</div></td>
</tr>
<%

String query = "select * from bookmark where user='"+s1+"'


";
Statement st = connection.createStatement();
ResultSet rs = st.executeQuery(query);
while (rs.next())
{
i = rs.getInt(1);
s2 = rs.getString(2);
s3 = rs.getString(3);//bk name
s4 = rs.getString(4);//url
s5 = rs.getString(5);//tag
s6 = rs.getString(6);//descr
s7 = rs.getString(7);//img

s8 = rs.getString(8);//rank

s9 = rs.getString(9);

String keys="q2e34rrfgfgfgg2a";

byte[] keyValue1 = keys.getBytes();

Key key1 = new SecretKeySpec(keyValue1, "AES");

Cipher c1 = Cipher.getInstance("AES");

c1.init(Cipher.DECRYPT_MODE, key1);

Page 94
Normalization Of Duplicate Records From Multiple Sources

String decrys6 = new


String(Base64.decode(s6.getBytes()));

%>

<tr>
<td><div align="center" class="style9 style10 style5 style4
style12"><%=j%></div></td>
<td><div align="center" class="style9 style10 style5 style4
style12"><%=s2%></a></div></td>
<td><div align="center" class="style9 style10 style5 style4 style12"><
%=s3%></div></td>
<td><div align="center" class="style9 style10 style5 style4 style12">
<input name="image" type="image" src="bk_Pic.jsp?id=<%=i%>"
style="width:90px; height:90px;" />
</div></td>
<td><div align="center" class="style9 style10 style5 style4
style12"><input type="button" value="<%=s4%>" onClick="window.open('<
%=s4%>')"></div></td>
<td><div align="center" class="style9 style10 style5 style4 style12">
<textarea name="text" cols="10" rows="5" readonly><%= s5
%></textarea>
</div></td>
<td><div align="center" class="style9 style10 style5 style4 style12">
<textarea name="text" cols="10" rows="5" readonly><%= decrys6
%></textarea>
</div></td>
<td><div align="center" class="style9 style10 style5 style4 style12"><
%=s9%></div></td>

Page 95
Normalization Of Duplicate Records From Multiple Sources

<td><div align="center" class="style8 style5 style4 style12"><


%=s8%></div></td>
<td><span class="style8 style5 style4 style12">
<%
int rank=Integer.parseInt(s8);

if(rank==3)
{
%>
<input name="image2" type="image" src="Gallery/1.png" width="30"
height="30" />
<%
}
if(rank>3 && rank<=6)
{
%>
<input name="image2" type="image" src="Gallery/2.png" width="80"
height="30" />
<%
}
if(rank>6 && rank<=9)
{
%>
<input name="image2" type="image" src="Gallery/3.png"
width="100" height="30" />
<%
}
if(rank>9 && rank<=12)
{
%>
<input name="image2" type="image" src="Gallery/4.png"
width="120" height="30" />
<%
}

Page 96
Normalization Of Duplicate Records From Multiple Sources

if(rank>12 && rank<=15)


{
%>
<input name="image2" type="image" src="Gallery/5.png"
width="140" height="30" />
<%
}
if(rank>15)
{
%>
<input name="image2" type="image" src="Gallery/6.png"
width="170" height="30" />
<%
}
%>
</span></td>
</tr>

<%

j=j+1;}

%>
</table>
<p>&nbsp;</p>
<%

Page 97
Normalization Of Duplicate Records From Multiple Sources

j=1;}

connection.close();
}

catch (Exception e) {
// out.println(e.getMessage());
}

%>

<p>&nbsp;</p>
<p align="right"><a href="a_all_bk.jsp">Back</a></p>
<div class="clr"></div>
</div>
</div>
<div class="sidebar">
<div class="clr"></div>
<div class="gadget">
<h2 class="star"><span>Admin</span> Menu</h2>
<div class="clr"><p>&nbsp;</p>
</div>
<ul class="sb_menu">
<li><a href="a_main.jsp">Admin Main </a></li>

Page 98
Normalization Of Duplicate Records From Multiple Sources

<li><a href="a_login.jsp">Log Out</a></li>


</ul>
</div>
</div>
<div class="clr"></div>
</div>
</div>
<div class="fbg"></div>
<div class="footer">
<div class="footer_resize">
<div style="clear:both;"></div>
</div>
</div>
</div>
<div align=center></div>
</body>
</html>

Page 99

You might also like