0% found this document useful (0 votes)

64 views5 pages

Genetic Approach Deduplication

The document discusses an optimized approach for record deduplication using a modified bat algorithm (MBAT). It summarizes existing genetic programming approaches to record deduplication that identify duplicate records. It then proposes using a modified bat algorithm that applies concepts from bat echolocation behavior to optimize the identification of duplicate records. Key concepts of the MBAT approach include modeling "bats" that use echolocation to identify "prey" or duplicate records. The approach aims to combine advantages of existing algorithms to improve deduplication performance.

Uploaded by

letter2lal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views5 pages

Genetic Approach Deduplication

Uploaded by

letter2lal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 2 Issue 6 June, 2013 Page No.

1874-1878

AN OPTIMIZED APPROACH FOR RECORD DEDUPLICATION USING MBAT ALGORITHM Subi S, Thangam P
PG-Scholar, M.E-CSE Coimbatore Institute of Engineering and Technology
[email protected] Assistant Professor,CSE Department Coimbatore Institute of Engineering and Technology [email protected]
2

Abstract: -Record deduplication[1] is the task of identifying, in a data storage, records that refer to the same real entity or any object in
spite of spelling mistakes, typing errors, different writing styles or even different schema representations or data types. In the existing system aims at providing Unsupervised Duplication Detection method which can be used to identify and remove the duplicate records from different data storge. UDD, which for a given query, can effectively identify duplicates from the query result records of different web databases. After removing the same source duplicates, the supposed non duplicate records from the same data storage can be used as training examples alleviating the trouble of users having to manually labeled training examples. Starting from the non duplicate reocord set, the two different classifiers, a Weighted Component Similarity Summing Classifier (WCSS) is used to knowing the duplicate records from the non duplicate record and presently a genetic programming (GP) approach to record deduplication. The approach joins several different pieces of attribute with similarity function extracted from the data content to produce a deduplication function that is able to identify whether two or more entries in a repository are replicas or not. Since record deduplication is a time taking task even for small repositories, the aim is to foster a method that finds a proper combination of the proper pieces of attribute with similarity function, thus yielding a deduplication function that maximizes performance using a small representative portion of the corresponding data for training purposes. But the optimization of result is less . The proposed system has to develop new method, modified bat algorithm for record duplication. The aim behind is to create a flexible and effective method that uses Data Mining algorithms. The system shares many similarities function with generational computation techniques such as Genetic programming approach.

Keywords- Deduplication, Gp, Modified bat algorithm

1 INTRODUCTION
1.1 RECORD DEDUPLICATION Record deduplication[1] is the task of identifying, in a data storage, records that refer the same real entity or object in spite of spelling mistakes words, typing errors, different writing styles or even different schema representations or data types. In this Research, the existing a genetic programming (GP)[5] approach to record deduplication[2] joines several different pieces of evidence extracted from the data content to produce a deduplication function that is able to identify whether two or more entries in a repository are replicas or not. Since record deduplication is a time taking task even for small repositories, our motive is to foster a method t0 finds a proper combination of the proper pieces of attribute with similarity function, thus yielding a deduplication function that maximizes performance using a small representative part of the corresponding data for guidance purposes. Then, the function can be used on the left over data or even applied to other repositories with similar characteristics. Moreover, new additional web cora data can be treated similarly by the suggested function, as long as there are no abrupt changes in the data structure, something that is very improbable in large data storage .The existing genetic programming approach consider all the data to found the duplicate records. Deduplication[3] is a key operation in integrating data from multiple data sources. The main challenge in this task is designing a function that can resolve when a pair of records refers to the same entity in spite of various data inconsistencies. Record deduplication is the task of identifying, in a data storage, records that refer to the same real world entity or any object in spite of spelling mistakes, typing errorfs, different writing styles or even different schema representations or data types.

2 RELATED WORK
Record deduplication is a growing research topic,several existing methods are present for record deduplication 2.1 OVERVIEW OF THE GENETIC PROGRAMMING APPROACH IN RECORD DEDUPLICATION GP[4] evolves a population of length-free data structures, also called records, each one representing a single solution to a given problem. During the generating process, the records are handled and modified by genetic operations such as reproduction, crossover, and mutation, in an iterative way that is expected to spawn better records (solutions to the proposed problem) in the subsequent generations. Page

Subi S, IJECS Volume2 Issue6 june, 2013 Page No.1874-1878

1874

In this work, the GP[6] generation wise process is guided by a generational evolutionary algorithm. This means that there are well defined and distinct generation cycles. It can adopted this approach since it captures the basic idea behind several generation wise algorithms. The algorithm steps are the following: 1. Initialize the population (with random or user provided records). 2. Evaluate all records in the present population, assigning a numeric rating or fitnessfunction to each record. 3. If the termination criterion is satisfied, then execute the last step. Otherwise continue. 4. Select the best n individuals into the next generation population. 5. Select m individuals that will compose the next generation with the best parents. 6. Apply the genetic operations to all records selected. Their children will compose the next Population. Replace the existing generation by the generated population and go back to Step 2. 7. Present the best record(s) in the population as the output of the evolutionary process. 2.2 EDIT DISTANCE APPROACH The edit distance[10] between two strings 1 and 2 is the minimum number of edit operations of single characters needed to transform the string 1 into 2. There are three types of edit operations:

when the training data is little. The classifier can perhaps reduce its confusion by seeking predictions on these uncertain instances. This intuition forms the basis for one major criteria of active learning, namely, selecting instances about which the classifier(s) built on the current training set is most uncertain.For example to show how selecting instances based on uncertainty can help reduce a classifier's confusion.

3 MODIFIED BAT ALGORITHM BASIC CONCEPTS

Metaheuristic algorithms such as particle swarm optimization, firefly algorithm and harmony search are now becoming powerful methods for solving many tough optimization problems. In this paper, propose a new metaheuristic method, the Bat Algorithm, based on the echosound behaviour of bats(basic attribute). It can also intend to join the advantages of existing algorithms into the new bat algorithm. The vast majority of heuristic and metaheuristic algorithms have been derived from the behaviour of biological systems and/or physical systems in nature. For example, particle swarm optimization was developed based on the swarm behaviour of birds and fish while simulated annealing was based on the annealing process of metals. New algorithms are also emerging recently, including harmony search and the firefly algorithm. The former was inspired by the improvising process of composing a piece of music, while the latter was formulated based on the flashing behaviour of fireflies. Each of these algorithms has certain advantages and disadvantages. For example, simulating annealing can almost assurance to find the best solution if the cooling process is slow enough and the simulation is running long enough; however, the fine adjustment in parameters does affect the union rate of the optimization process. A natural question is whether it is possible to join major advantages of these algorithms and try to build up a potentially improved algorithm. 3.1 BEHAVIOR OF THE MBATS Most microbats(with basic attribute) are insectivores. Microbats use a produce of sonar, called, echolocation, to detect pre(record), avoid obstacles, and locate their roosting crevices in the dark. These bats emit a very loud sound pulse and listen for the echo that bounces back from the surrounding things. Their pulses change in properties and can be linked with their hunting strategies, depending on the type. Most bats use small, frequency-modulated signals to sweep through about an octave, while others more often use constantfrequency signals for echosound. Their signal bandwidth varies depends on the species, and often increased by using more harmonics. By idealizing some of the echolsound characteristics of microbats(small keys), we can develop various bat-inspired algorithms or bat algorithms. Here developed Modified Bat Algorithm with Doppler Effect. For simplicity, here some of the approximate or idealized rules: 1. All bats(with abasic key record) use echosound to identify distance, and they also know the difference between food/prey(records) and background barriers in some magical way; 2. Bats(with a basic key) fly randomly with velocity vi at position xi with a fixed frequency fmin, varying wavelength Page 1875

insert a any word into the string. delete a word from the string, and modify one word with a different character.
To employ learnable text distance operations for each database field, and demonstrate that such measures are capable of adapting to the specific notion of similarity that is appropriate for the fields domain.Different edit operations have varying significance in different domains. For example, a digit substitution makes a major difference in a street address since it effectively changes the house number,while a single letter substitution is semantically insignificant because it is more likely to be caused by a typo or an abbreviation. Therefore, adapting string edit distance to a particular domain requires assigning different weights to different edit operations.Edit distance[9] metrics are widely used not only for text processing but also for biological equence alignment 2.3 ACTIVE LEARNING APPROACH The success of this method criticallyhinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication[9] function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists An active learner[5] starts with a limited labeled and a large unlabeled pool of instances. The labeled set forms the training data for an initial preliminary classifier. The goal is to seek out from the unlabeled pool those instances which when labeled will help strengthen the classifier at the fastest possible rate. The initial classifier will be sure about its predictions on some unlabeled instances but unsure on most others. The unsure instances are those that fall in the classifier's confusion region. This confusion region is large

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

and loudness A0 to search for prey9original records). They can automatically adjust the wavelength (or frequency) of their emitted pulses and adjust the rate of pulse emission r [0,1], depending on the proximity of their target; 3. Doppler Effect is the within frequency of a wave for an observer moving relative to the source to destiny of the wave. The received frequency is higher (compared to the emitted frequency) during the approach, it is the same at the instant of passing by, and it is lower during the recession. where vs is positive if the source is running away from the observer, and negative if the source is running towards the observer.

Equation for checking the fi metric.

CORADATASET 10 20 30 40

GENETIC 0.71 0.8 0.91 1.1

MBAT 0.78 0.83 0.96 1.5

Table 1 precision for cora dataset Table 1 lists the precision of the existing method and MBAT for cora data set.It shows mbat algorithm performs the good optimization result

f =

(1)

CORADATASET
(ii) where the similar convention applies: vr is positive if the observer is running towards the source, and negative if the

GENETIC 0.7 0.75 0.8 1.2

MBAT 0.8 0.83 0.95 1.6

f = f=

f0 f0

(2)

(III) Single equation with both the source and receiver moving.

10 20 30 40

Table 2 recall for cora dataset

(3)
Table 2 lists the recall of the existing method and MBAT for cora data set.It shows mbat algorithm performsthe good optimization result

C is the velocity of waves in the medium(air) Vr is the velocity of the receiver relative to the medium; if the receiver is moving towards the source. Vs positive is the velocity of the source relative to the medium; positive if the source is moving away from the receiver. 4. Although the loudness can vary in many ways, we assume that the loudness varies from a large (positive) A0 to a minimum constant value Amin

METHOD GENETIC MBAT

F ACCURACY (%) 75 80

4 EXPERIMENTAL DATASET
For experiment evaluation used real data sets commonly employed for evaluating record deduplication approaches which are based on real data gathered from the web. In addition The first real data set, the Cora data set, is a collection of 1,295 distinct citations to 59 computer science papers taken from the Cora research paper search engine.These citations were divided into different attributes (authornames, year, title, venue, and pages and other info) by an information extraction system. In which for experimental evaluation u F1 metric,presion and recal mesurments are used. The F1metric harmonically combines the traditional precision (P)and recall (R) metrics commonly used for evaluating accuracy F1- accuracy measured according to the precision and recall measurement. P=

Table 3 F-accuracy measures for cora dataset Table 3 lists the f-accuracy of the existing method and MBAT for cora data set.It shows mbat algorithm performsthe good optimization result

4.1 SYTEM FLOW DIAGRAM

INPUT DATASET COMPONENT WEIGHT

ASSIGNMENT ALGORITHM IDENTIFYING DUPLICATE RECORD WITHOUT DUPLICATION GENETIC ALGORITHM MBAT ALGORITHM PERFORMANCE EVALUATION

(4) R= (5)
F1= (6)

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Page 1876

Fig.1 System flow diagram 4..2 SAMPLE SCREENSHOTS

Fig.5 Comparing the precision of genetic and MBAT method in line

Fig.2 Home Page Fig.2 shows the main home page of the output

graph Fig.6 Comparing the recall of genetic and MBAT method in bar graph In Fig 3 it can notice that MBAT performs the best optimization result according the evaluation f=accuracy metrics.

Fig.3 the existing system output using genetic programming approach Fig.7 Comparing the f accuracy of genetic and MBAT method in bar graph

6 CONCLUSION
Duplicate detection is an important step in data integration and this method is based on offline learning techniques, which requires training data. In the Web cora database scenario, where records to match are greatly query dependent. The genetics programming approach combines several different pieces of attribute with similarity function extracted from the data content to produce a deduplication[10] function that is able to identify whether two or more entries in a repository are replicas or not. Their aim is to foster a method that finds a proper combination of the best pieces of attribute with similarity function, thus yielding a deduplication[11] function that maximizes performance using a small representative portion of the corresponding data for training purposes. In the proposed system increases the optimization of process and increases the most represented data samples are selected, it finds the best optimization solution to deduplication of the records. MBAT shares many similarities with evolutionary computation techniques such as Genetic Page 1877

Fig.4 the proposed system output using MBAT algorithm

4 RESULTS AND DISCUSSION

In Fig 1 and Fig 2 it can notice that MBAT performs the best compared to th existing system result.

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Algorithms[12]. The system is initialized with a population(set of records) of random solutions and searches for optima by updating generations. MBAT search the best optimal by updating generations. In MBAT takes and less error rate when comparing to the GP. It is a one -way information sharing mechanism. References [1] Bhattacharya I and L. Getoor, (2004) Iterative Record Linkage forCleaning and Integration, Proc. Ninth ACM SIGMOD WorkshopResearch Issues in Data Mining and KnowledgeDiscovery, pp. 11-18. [2] de Almeida H.M, M. Cristo M.A. Goncalves, and P. Calado, A Combined Component Approach for Finding Collection-AdaptedRanking Functions Based on GenetiProgramming, Proc. 30thAnn. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 399-406, 2007. [3] Laender A.H.F , A.S. daSilva , M.A. Goncalves, and, M.G. de Carvalho (2006) Learning to Deduplicate Proc. Sixth ACM/IEEE CS JointConf. Digital Libraries, pp. 41-50. [4] Moustakides G.V, M.G. Elfeky, and, V.S. Verykios Bayesian Decision Model for Cost Optimal Record Matching, The VeryLarge Databases J., vol. 12, no. 1, pp. 28-40, 2003. [5] Sunter A.Band, I.P. Fellegi A Theory for Record Linkage, J. Am.Statistical Assoc., vol. 66, no. 1, pp. 1183-1210, 1969. [6] Bilenko M , P. Ravikumar, R. Mooney, S. Fienberg and W. Cohen, and,Adaptive Name Matching in Information Integration, IEEEIntelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003. [7] Falcao A.X, B. Zhang, , E.A. Fox, J.P. Papa, M.A. Goncalves, R.d.S. Torres, and, W.Fan A Genetic Programming Framework for Content-Based Image Retrieval, PatternRecognition, vol. 42, no. 2,pp. 283292, 2009.Citation Indexing, Computer, vol. 32, no. 6, pp. 67-71, June 1999.

[8] Moises G. de Carvalho, Alberto H.F. Laender, Marcos Andre Goncalves, and Altigran S. da Silva A Genetic Programming Approach to Record Deduplication Ieee transactions on knowledge and data engineering, vol. 24, no. 3, march 2012. [9] Mikhail Bilenko and Raymond J. Mooney Department of Computer Sciences University of Texas at Austin Adaptive Duplicate Detection Using Learnable String Similarity Measures Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003), Washington DC, pp.39-48, August, 2003. [10]. Elmagarmid A.K, P.G. Ipeirotis, and V.S. Verykios, Duplicate Record Detection: A Survey, IEEE Trans. Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.

Author Profile
Ms.Subi.S is currently pursuing M.E Computer Science and Engineering at Coimbatore Institute of Engineering and Technology, Coimbatore, Tamil Nadu, (Anna University, Chennai). She completed her B.E in Information Technology from Hindusthan Institute of Technology, Coimbatore, Tamil Nadu, (Anna University, Coimbatore) in 2011. Her research interests include Data Mining.

Ms.P.Thangam received her B.E Degree in Computer Hardware and software Engineering from Avinashilingam University, Coimbatore in 2001. She has received her M.E degree in Computer Science and Engineering from Government College of Technology, Coimbatore in 2007. She is currently doing her PhD in the area of Medical Image Processing under Anna University, Chennai. Presently she is working as an Assistant Professor in the Department of Computer Science and Engineering at Coimbatore Institute of Engineering and Technology, Coimbatore. Her research interests are in Image Processing, Medical Image Analysis, Data Mining, Classification and Pattern Recognition.

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Page 1878

Heath Ledger
100% (1)
Heath Ledger
37 pages
Online Food Delivering Companies
67% (3)
Online Food Delivering Companies
62 pages
Conspiracy To Erase A Nation Mini-Booklet
100% (1)
Conspiracy To Erase A Nation Mini-Booklet
21 pages
JAVA Syllabus PDF
No ratings yet
JAVA Syllabus PDF
2 pages
F Test
No ratings yet
F Test
10 pages
Plant Breeding For Biotic Stress Resistance
No ratings yet
Plant Breeding For Biotic Stress Resistance
166 pages
Mpumalanga Geography - Desktop Research Task 2024
No ratings yet
Mpumalanga Geography - Desktop Research Task 2024
13 pages
A Study On Data Deduplication Techniques For Optimized Storage
No ratings yet
A Study On Data Deduplication Techniques For Optimized Storage
7 pages
Propeller Owner's Manual: and Logbook
No ratings yet
Propeller Owner's Manual: and Logbook
278 pages
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
No ratings yet
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
15 pages
ECT2000 English+MANUAL PDF
100% (1)
ECT2000 English+MANUAL PDF
27 pages
2024 V15i5016
No ratings yet
2024 V15i5016
12 pages
Black Powder
100% (1)
Black Powder
48 pages
Ayesha
No ratings yet
Ayesha
43 pages
Creating Photorealistic Images With AI V1
100% (1)
Creating Photorealistic Images With AI V1
182 pages
Indeterministic Handling of Uncertain Decisions in Deduplication
No ratings yet
Indeterministic Handling of Uncertain Decisions in Deduplication
25 pages
Paper73335 339
No ratings yet
Paper73335 339
5 pages
Cse 20
No ratings yet
Cse 20
6 pages
Secularism Project A Study of Compatibility of Anti Conversion Laws With Right To Freedom New
100% (1)
Secularism Project A Study of Compatibility of Anti Conversion Laws With Right To Freedom New
43 pages
5 2009 XML
No ratings yet
5 2009 XML
13 pages
Chemistry X
No ratings yet
Chemistry X
125 pages
Adolescents in India
No ratings yet
Adolescents in India
250 pages
BDA April-May 2024 Answers
No ratings yet
BDA April-May 2024 Answers
5 pages
Recap Through Exercise
No ratings yet
Recap Through Exercise
37 pages
Iso 8529-2
No ratings yet
Iso 8529-2
38 pages
Gokul........ A Study On Product Positioning at TVS Motors
No ratings yet
Gokul........ A Study On Product Positioning at TVS Motors
101 pages
Payroll HRMS Presentation PDF
100% (1)
Payroll HRMS Presentation PDF
34 pages
Duplicate Record Detection: A Survey
No ratings yet
Duplicate Record Detection: A Survey
40 pages
Documentation
No ratings yet
Documentation
12 pages
Exercise 2: Creating A Matching Policy: Creating The DQS Connection Manager
No ratings yet
Exercise 2: Creating A Matching Policy: Creating The DQS Connection Manager
15 pages
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
No ratings yet
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
72 pages
Multi Banking Transaction System
No ratings yet
Multi Banking Transaction System
4 pages
Machine Learning Applied To The Clerical Task Management Problem in Master Data Management Systems
No ratings yet
Machine Learning Applied To The Clerical Task Management Problem in Master Data Management Systems
13 pages
Overall Description: Advantage of Proposed System
No ratings yet
Overall Description: Advantage of Proposed System
34 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Record Matching Over Query Results From Multiple Web Databases
No ratings yet
Record Matching Over Query Results From Multiple Web Databases
93 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
Co PRIVACY PDF
No ratings yet
Co PRIVACY PDF
7 pages
GEN. PHYSICS 1 (Week 6)
No ratings yet
GEN. PHYSICS 1 (Week 6)
5 pages
Vide
No ratings yet
Vide
80 pages
Digital Payment
No ratings yet
Digital Payment
8 pages
Project Management - Chapter One..
100% (1)
Project Management - Chapter One..
46 pages
Multi-Assignment Clustering For Boolean Data: Mario Frank
No ratings yet
Multi-Assignment Clustering For Boolean Data: Mario Frank
31 pages
CG Lab Record
No ratings yet
CG Lab Record
77 pages
Volume 3 Industrial
No ratings yet
Volume 3 Industrial
280 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Joomla Training Syllabus
No ratings yet
Joomla Training Syllabus
2 pages
A General Self Organized Tree Based Energy Balance
No ratings yet
A General Self Organized Tree Based Energy Balance
4 pages
65 2 37 395 PDF
No ratings yet
65 2 37 395 PDF
6 pages
C20.0046: Database Management Systems Lecture #5: M.P. Johnson Stern School of Business, NYU Spring, 2008
No ratings yet
C20.0046: Database Management Systems Lecture #5: M.P. Johnson Stern School of Business, NYU Spring, 2008
39 pages
DWH Fall2010 Lecture Slides Week13
No ratings yet
DWH Fall2010 Lecture Slides Week13
32 pages
Multi Banking Transaction System
No ratings yet
Multi Banking Transaction System
4 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
A New Efficient Data Cleansing Method
No ratings yet
A New Efficient Data Cleansing Method
11 pages
Cinta Teflon Cafe PTFE 5151
No ratings yet
Cinta Teflon Cafe PTFE 5151
2 pages
Record Matching Over Query Results From Multiple Web Databases
No ratings yet
Record Matching Over Query Results From Multiple Web Databases
27 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
29 pages
DQ Matching
No ratings yet
DQ Matching
6 pages
Airline Questionnaire
No ratings yet
Airline Questionnaire
9 pages
Final Srs
No ratings yet
Final Srs
18 pages
Glass Comp. Methods PDF
No ratings yet
Glass Comp. Methods PDF
14 pages
Physics For The Anaesthetic Viva Complete DOCX Download
100% (15)
Physics For The Anaesthetic Viva Complete DOCX Download
17 pages
Pointers To Profession by Navansha Lord of 10th House Lord
No ratings yet
Pointers To Profession by Navansha Lord of 10th House Lord
2 pages
Keys Smart Time 11
No ratings yet
Keys Smart Time 11
15 pages
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
No ratings yet
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
18 pages
Task D - Principles of Flight CFI FAA Pilots .
No ratings yet
Task D - Principles of Flight CFI FAA Pilots .
9 pages
Support Vector Machine Based Data Classification To Avoid Data Redundancy Removal Before Persist The Data in A DBMS
No ratings yet
Support Vector Machine Based Data Classification To Avoid Data Redundancy Removal Before Persist The Data in A DBMS
4 pages
Eliminating Fuzzy Duplicates in Data Warehouses
No ratings yet
Eliminating Fuzzy Duplicates in Data Warehouses
12 pages
Domain of Study - Datamining Platform - J2Ee
No ratings yet
Domain of Study - Datamining Platform - J2Ee
11 pages
LSJ1512 - Progressive Duplicate Detection
No ratings yet
LSJ1512 - Progressive Duplicate Detection
5 pages
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
No ratings yet
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
10 pages
Data Cleaning by Genetic Programming Technique: Rajnish Kumar, Pradeep Bhaskar Salve, Pritam Desale
No ratings yet
Data Cleaning by Genetic Programming Technique: Rajnish Kumar, Pradeep Bhaskar Salve, Pritam Desale
7 pages
Record Deduplication Using Genetic Programming Approach: Research Paper
No ratings yet
Record Deduplication Using Genetic Programming Approach: Research Paper
8 pages
Subway Perfect Order Template
No ratings yet
Subway Perfect Order Template
12 pages
AMC - EX - P001 - SOP - Field Data Naming Standard - V01
No ratings yet
AMC - EX - P001 - SOP - Field Data Naming Standard - V01
14 pages
Unit 2 and 3 (2 Part)
No ratings yet
Unit 2 and 3 (2 Part)
9 pages
Conference SVM Classifier
No ratings yet
Conference SVM Classifier
6 pages
A3 Alpha Meter With EA - NIC: Product Guide PG42-1017A
No ratings yet
A3 Alpha Meter With EA - NIC: Product Guide PG42-1017A
24 pages
Interactive Deduplication Using Active Learning: Bombay Bhamidipaty Bombay
No ratings yet
Interactive Deduplication Using Active Learning: Bombay Bhamidipaty Bombay
10 pages
01
No ratings yet
01
18 pages
Máquina Tridimensional Portatil Fulcrum - Datasheet
No ratings yet
Máquina Tridimensional Portatil Fulcrum - Datasheet
2 pages
A Survey On: Application-Aware Big Data Deduplication in Cloud Environment
No ratings yet
A Survey On: Application-Aware Big Data Deduplication in Cloud Environment
7 pages
Using The Scorpion Z6 Alarm Controller
No ratings yet
Using The Scorpion Z6 Alarm Controller
4 pages
Efficient Merging and Filtering Algorithms For Approximate String Searches
No ratings yet
Efficient Merging and Filtering Algorithms For Approximate String Searches
10 pages
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
No ratings yet
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
12 pages
3 - Solid State Ionics 176, 2005, 1601
No ratings yet
3 - Solid State Ionics 176, 2005, 1601
11 pages
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
An Efficient Framework and Techniques of Data Deduplication in Data Center
No ratings yet
An Efficient Framework and Techniques of Data Deduplication in Data Center
5 pages
Query Similarities in Various Web Databases
No ratings yet
Query Similarities in Various Web Databases
6 pages
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
No ratings yet
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
12 pages
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
No ratings yet
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
5 pages
p802 Koudastutorial
No ratings yet
p802 Koudastutorial
2 pages
File Sharing and Data Duplication Removal in Cloud Using File Checksum
No ratings yet
File Sharing and Data Duplication Removal in Cloud Using File Checksum
3 pages
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
No ratings yet
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
12 pages
Novel and Efficient Approach For Duplicate Record Detection
No ratings yet
Novel and Efficient Approach For Duplicate Record Detection
5 pages
A Genetic Programming Approach To Record Deduplication
No ratings yet
A Genetic Programming Approach To Record Deduplication
45 pages
Ijctt V3i3p108
No ratings yet
Ijctt V3i3p108
6 pages
A Survey: Enhanced Block Level Message Locked Encryption For Data Deduplication
No ratings yet
A Survey: Enhanced Block Level Message Locked Encryption For Data Deduplication
4 pages
A Study On Application-Aware Local-Global Source Deduplication For Cloud Backup Services of Personal Storage
No ratings yet
A Study On Application-Aware Local-Global Source Deduplication For Cloud Backup Services of Personal Storage
3 pages
Data Duplication Removal Using File Checksum
No ratings yet
Data Duplication Removal Using File Checksum
2 pages
Probabilistic Data Deduplication Using Modern Backup Operation
No ratings yet
Probabilistic Data Deduplication Using Modern Backup Operation
5 pages
Energy Efficient Data Aggregation Techniques in Wireless Sensor Networks
No ratings yet
Energy Efficient Data Aggregation Techniques in Wireless Sensor Networks
1 page
Duplicate Detection Using Algorithms
No ratings yet
Duplicate Detection Using Algorithms
3 pages
An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
No ratings yet
An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
8 pages
IBC Storage & Handling - Gulf EcoPro
No ratings yet
IBC Storage & Handling - Gulf EcoPro
2 pages
CCB 342 Soil Mechanics-Course Plan-Semester 2 205
No ratings yet
CCB 342 Soil Mechanics-Course Plan-Semester 2 205
2 pages

Genetic Approach Deduplication

Uploaded by

Genetic Approach Deduplication

Uploaded by

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 2 Issue 6 June, 2013 Page No.

Keywords- Deduplication, Gp, Modified bat algorithm

Subi S, IJECS Volume2 Issue6 june, 2013 Page No.1874-1878

3 MODIFIED BAT ALGORITHM BASIC CONCEPTS

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Equation for checking the fi metric.

GENETIC 0.71 0.8 0.91 1.1

MBAT 0.78 0.83 0.96 1.5

GENETIC 0.7 0.75 0.8 1.2

MBAT 0.8 0.83 0.95 1.6

Table 2 recall for cora dataset

METHOD GENETIC MBAT

4.1 SYTEM FLOW DIAGRAM

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Fig.1 System flow diagram 4..2 SAMPLE SCREENSHOTS

Fig.5 Comparing the precision of genetic and MBAT method in line

Fig.4 the proposed system output using MBAT algorithm

4 RESULTS AND DISCUSSION

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

Subi S, IJECS Volume2 Issue6 June, 2013 Page No.1874-1878

You might also like