Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
Anomaly Detection Via Eliminating Data Redundancy and Rectifying Data Error in Uncertain Data Streams
Abstract
INTRODUCTION
Anomaly detection refers to the problem of finding patterns in any kind of data which
30796 M.Nalini and S.Anbu
cannot satisfy the customer’s expected behavior. These non-conforming patterns are
often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants in different application domains.
Of these anomalies and outliers are two terms used most commonly in the context of
anomaly detection and sometimes interchangeably. Anomaly detection finds
extensive use in a wide variety of applications. Such as fraud detection for credit
cards, insurance or health care, intrusion detection for cyber-security, fault detection
in safety critical systems, and military surveillance for enemy activities.
The importance of anomaly detection is due to the fact that anomalies in data
that translates to significant (and often critical) actionable information in a wide
variety of application domains. For example, an anomalous traffic pattern in a
computer network could mean that a hacked computer is sending out sensitive data to
an unauthorized destination. An anomalous MRI image may indicate presence of
malignant tumors. Anomalies in credit card transaction data could indicate credit card
or identity theft or anomalous readings from a space craft sensor could signify a fault
in some component of the space craft. Detecting outliers or anomalies in data has
been studied in the statistics community as early as the 19thcentury. Over time, a
variety of anomaly detection techniques have been developed in several research
communities. Many of these techniques have been specifically developed for certain
application domains, while others are more generic.
Also in this paper the data taken into account is uncertain data, the size of the
data is also too large.PSO approach is applied here to find and count the de-
duplication where it combines various pieces of evidence extracted from the data
content to produce de-duplication method. This approach is able to identify whether
any entries in a repository are same or not. In the same manner, by combining more
pieces in the data, taken as evidence, and compare with the whole data as training
data. This function is applied repeatedly on the whole data or in the repositories.
Newly inserted data can also be compared in the same manner to avoid replica by
comparing with evidence. A method applied to record de-duplication should
accomplish individual but contradictory objectives: this process should effectively
increase the identification of the records replicated. The approach GP [15] is chosen
as the basic approach which is suitable for finding accurate answers to a given
problem without searching all the data on the whole. Due to the record de-duplication
problem, in the existing approach [14, 16] the genetic programming is applied to
provide good solutions to it.
In this paper, the existing system results in [16] are taken for comparison
without PSO based approach, where our approach is able to automatically find more
effective de-duplication methods. Moreover, PSO based approach can interoperable
with existing best de-duplication methods to change on the replication identification
limits used to classify a pair of records as a match or not. In our experiment, real time
dataset having all scientific domain based article citations and hotel index records.
Also, the real time data set having synthetic generated datasets to control in best
experimental environment. In all the scenarios, our approach can be applied to all the
possible scenarios.
Anomaly Detection Via Eliminating Data Redundancy 30797
On the whole, our contribution of this paper is PSO based approach to find
and count the De-Duplication is as follows:
A less computational time based solution can be obtained in terms of
duplication detection
Reduce the individual comparison using PSO approach to find out the
similarity values.
Choosing the replicas by computing TPR and FPR among the data
Rectify the errors in the data entries.
RELATED WORKS
In [3] the author proposed an approach to data reduction. This data reduction
functions are very essential to machine learning and data mining. An agent based
population algorithm is used for solving data reduction. Only data reduction is not
only the solution for improving the quality of databases. Various sizes of database are
used to provide high classification among the data to find out anomalies. Two
algorithms such as evolutionary and non-evolutionary are applied and the results are
compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary
relations are computed to define the patterns in the dataset [5] where it provides
relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two
systems developed to analyze RDBMS. The main objective of the data mining
technique is to detect and classify data in a huge set of database [7] without
negotiating the speed of the process. PCA is used for data reduction and SVM is used
for data classification in [7]. In [8] the data redundancy method is explored using
mathematical representation. Software developed with safe, correct and reliable
operations for avionics and automobile based database systems [9]. A statistical QA-
[Question Answer] model is applied to develop a prototype to avoid web based data
redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On-
Line Analytical Processing) is applied to Gist database and other spatial database
analysis, indexing, and generating various set of reports without any error. In [12], an
effective method was proposed for P-2-P sharing data. During the data sharing the
data duplication is removed using the effective method. Web entity data extraction
associated with the attributes of the data [13] can be obtained using a novel approach
which uses duplicated attribute value pairs.
G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication
and convert de-duplication in the data also mainly concentrated on identifying the
entries are repository or replica. This approach outperformed and provides 6.2% of
accuracy more than the earlier approaches for two different data sets found in [2]. Our
proposed approach can be extended for various benchmark data with real time data
such as time series data, clinical data, 20-20 new group etc.
PARTICLE OPERATIONS
PSO generates random particles representing individuals. In this paper, the current
modeling the trees representing arithmetic functions which are illustrated in Figure-1.
When using, this tree representation of the PSO based approach, the set of all inputs,
variables, constants and methods should be defined [8]. Some of the nodes
terminating the trees are called as leaves. The collection of operators, statements and
methods are used in the PSO evolutionary process to manipulate the terminal values.
All these methods are placed in the internal nodes of the tree is shown in Figure-1.In
general PSO is analyzing social behavior of birds. In order to search for food, every
bird in a flock of birds is referred by velocity based on the personal experience and
information collected by interacting with each other inside the flock. This is the basic
idea about the PSO. Each particle denoting each bird, flies denotes searching in the
subspace for the optimization problem searching for the optimum solution. In PSO,
the solutions within the iteration are called as swarm and equal in population.
x /
x z
Tree(a, b, c) = a + ( b + b)
PROPOSED APPROACH
The proposed approach utilizes the functionality of the PSO optimization method for
finding the difference among entities in each record in a database. This difference
indicates the similarity index among two data entities can decide the duplication. In
this case, if the distance between two data entities [ and ] is less than a threshold
value [ ], then and are decided as duplicate. The algorithm of PSO applied in
this paper is given here:
1. Generate random population P is representing each individual of data entries.
2. Assume a random feasible solution from particles.
3. For I = 1to P
4. Evaluate all particles based on the objective function
5. The objective function = ( [ ], [ ]) ≤ ≥∝
6. Gbest = best solution based particle
7. Compute the velocity of the Gbest particles
8. Update the current position of the best solution
9. Next i
={ , ,…, } ---(1)
= = ---(2)
: : :
is the entity at row and column in the data. Here represent the rows
and j represents the column. In this paper the threshold value is user defined very
small value among 0 and 1.
30800 M.Nalini and S.Anbu
Yes
Anomaly Detection
Anomaly Detection
No
PREPROCESSING
Let us consider an example of an employee data for an MNC company, where the
company branches are located overall world. The entire data are read from the
database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant
characters placed as an entity in the database. [Example, if an entity is numerical data,
then it should contain only the digits from 0 to 9. If it is a name, then it should
represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant
characters presented in any dataset, then those data entity are treated as error data and
it will be corrected, removed or changed by any other relevant characters.
If the data-type of the field is a string, then the preprocessing function assigns
“NULL” in the corresponding entity, else if the data type of the field is a numeric,
then preprocessing function assigns 0’s [according to the length of the numeric data
type] in the corresponding entity. Similarly, preprocessing function replaces the entity
as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on.
Once the data is preprocessed the results of the SQL-Query are good else error
generated.
For example, in the following table-1, the first row says the Field name and all
the rows contain set of records. In the given Table-1, the first record fourth field is
having an irrelevant character as “~”. In the same way the second record 3rd field
consists “##” instead of numbers. It gives an error, when a query
Anomaly Detection Via Eliminating Data Redundancy 30801
is passed in the table EMP [Table-1]. To avoid error during query process the City,
Age fields are corrected by verifying the original data sources. If it is not possible
then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing
and correcting the error. If it is not possible to correct the record, those data are
marked [‘*’] and moved to a separate pool area.
The entire data can be divided as sub windows for easy and fast process. Let
the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and
DB2. Each DB1 and DB2 has a number of windows as 1, 2 … .
DATA NORMALIZING
In general an uncertain data stream is considered for anomaly detection. The main
problem defined in this paper is anomaly detection for any kind of Data streams.
Since, the size of the data stream is huge, in our approach the complete data are
divided into subsets of data streams. A data stream DS is divided into two uncertain
data streams DS1 and DS2, are taken for our problem, where both data stream
consists of a sequence of continuously occurring uncertain objects in various time
intervals, are denoted as
At the current time intervals .It can say in other words, when a new certain
object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1]
(y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1]
(y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval
(t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t-
w+2],….,y[t+1]}) of size cw.
DS1
USG
answers
DS2
Compartment window at
time interval t - CW (DS2)
For Grouping the uncertain Data Streams the two data streams DS1 and DS2
and distance threshold value and a probabilistic threshold α ∈ [0, 1].A group on
uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i]
within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the
current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the
similarity distance can be obtained using PSO. Such that
=( − ) ≤1 --- (9)
= --- (12)
30804 M.Nalini and S.Anbu
where is the jth component of the data set .There are various methods used for
data mining. Numerous such methods, for example, NN-classification techniques,
cluster investigation, and multi-dimensional scaling methods are based on the
processes of similarity between data. As a replacement for measuring similarity,
dissimilarity among the entities too will give the same results. For measuring
dissimilarity one of the parameters that can be used is distance. This category of
measures is also known as separability, divergence or discrimination measures.
A distance metric is a real-values function , such that for any data points
, , and :
( , ) ≥ 0, ( , ) = 0, = --- (13)
( , ) = ( , ) --- (14)
( , ) ≤ ( , ) + ( , ) --- (15)
The first line (13), positive definiteness assures the distance is a non- negative
value. The distance can be zero for the points to be the same. The second property
indicates the symmetry nature of distance. There are various distance formulas are
available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper
the Similarity-Distance is taken as the main method to find the similarity distance
among two data sets. The distance among a set of observed groups in m-dimensional
space determined by m variables is known as Similarity-Distance method. The less
distance value says the data in the groups are very close and the other is not close. The
mathematical formula for Similarity-Distance for two set of data samples as X and Y
is written as:
( , )= ( − ) ∑ ( − ) --- (16)
[ ]≤ ℎ 1
[ ]=0 ℎ 0 ---- (18)
[ ]> ℎ −1
The first line in (18) says that the data available in both windows of 1
and 2 are more or less similar.The next line says that exactly same and the third
line says that the data are different. Whenever the distance among dataset satisfies
[ ] = 0 and [ ] ≤ both data are marked in the DB. The value of [ ] gives two
solution such as:
Anomaly Detection Via Eliminating Data Redundancy 30805
TPR—if the similarity value lies above this boundary [-1 to 1], the records are
considered as replicas;
TNR—if the similarity value lies below this boundary then the records are considered
as not being replicas.
In this situation the similarity values lies among the two boundaries then the
records are classified as “possiblematches”. In this case, a human judgment is also
necessary to find the matching score. Usually most of the existing approaches to
replica identification depend on several choices to set their parameters and they may
not be always optimal. Setting these parameters requires the accomplishment of the
following tasks:
Selecting the best proof to use- as evidence, it takes more time to find out the
duplication due to apply more processes to compute the similarity among the data.
Decide how to merge the best evidence, some evidence may be more effective for
duplication identification than others. Finding the best boundary values to be used,
Bad boundaries may increase the number of identification errors (e.g., false positives
and false negatives),nullifying the whole process. Window1 from DB1 is compared
with Window1, window2, window3 and so on from DB2 can be written as:
[ ]= ( 1) − ∑ ( 2) ---- (19)
DATA:
For experimenting the proposed approach two real time data sets commonly
employed for evaluating the record de-duplication purposes. They are based on the
current data gathered from the web index. Additionally, some more data sets also
created using a synthetic data set generator. One of the dataset is
the ℎ data set is a assembly of 1,295 different credentials to 122
computer science papers occupied from the Cora research paper through search
engine. These credentials were separated into numerous characteristics (author names,
year, title, venue, and pages and other info) by an information mining system. The
another real-time data set is the Restaurants data set comprises 864 records of
restaurant names and supplementary data together with 112 replicas that were attained
by incorporating records from Fodor and Zagat’s guidebooks. It is used the following
attributes from this data set: (restaurant) name, address, city, and specialty. The
synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32]
available in the Febrl [26] package.
30806 M. Nalini and S.Anbu
Since the real time dataset are not sufficient and not easily accessible for the
experiment, such as time series data set, 20-20 news data set and customer data from
OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like
social security number). Using SDG it also can create manually some errors in data
and duplication in data. Some of the modifications also can be applied on the records
attribute level. The data taken for experiments are
DATA-1: This data set contains four files of 1000 records (600 originals and 400
duplicates) with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and in full record.
DATA-2: This data set contains four files of 1000 records (750 originals and 250
duplicates), with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and four in the full record.
DATA-3: This data set contains four files of 1000 records (800 originals and 200
duplicates) with a maximum of seven duplicates, based on one original record (using
a Poisson distribution of duplicate records) and with a maximum of four
modifications in a single attribute and five in the full record. The duplication can be
applied on each attribute of the data in the form of [i.e the evidence]
〈 , − 〉
The experiment on the time series data is done in MATLAB software and the
time complexity is compared with the existing system. The elapsed time taken for
implementing the proposed approach is 5.482168 seconds. The results obtained for all
the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.
Anomaly Detection Via Eliminating Data Redundancy 30807
Fig.3 shows the data originality as such taken from web. It has errors,
redundancy and noise. The three lines show that the data DB is divided into DB1,
DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted
and overlapped in many places which indicates the data redundancy. Also it is drawn
in zigzag form says it is not preprocessed. In the time series data there are 14
numerical data are preprocessed [replaced as 0’s], it is verified from the database.
After normalization the data is divided into windows and shown in Fig.5
where the window size is 50 defined by the developer. Each window has 50 data for
fast comparison. In order to confirm this behavior observed with real data, we
conducted additional experiments using our synthetic data sets. The user-selected
evidence setup used in this experiment was built using the following list of evidence:
This list of evidence, using the PSO similarity function for free text attributes
and a string distance function for numeric attributes was chosen. Since, it required
less time to be processed in our initial tuning tests.
Anomaly Detection Via Eliminating Data Redundancy 30809
Data set Original Data Good Data Similar Data Error Data
Time Series 1000 600 400 24%
Restaurant 1000 750 250 15%
Student Database 1000 800 200 12.4%
Cora 1000 700 300 19.2%
395, 206, 146 and 244 respectively. Due more complex or error in the data the de-
duplication is not obtained 100%.
Some of the performance metrics can be calculated to find out the accuracy of
our proposed approach as:
TNR =
FNR =
Sensitivity = = 99%
Specificity = = 88.5%
Accuracy = = 96.3%
CONCLUSION
In this paper the PSO based distance method is taken as the main method for finding
the similarity [redundancy] in any database. Where the similarity score is computed
for various databases and the performance is compared. The accuracy obtained using
this proposed approach is 96.3% for four different databases. The time series data is in
the form of Excel, Cora data is in the form of table, student data is in the form of MS-
Access and the restaurant data is in the form of SQL table. It is concluded from the
experiment results obtained using our proposed approach it is easy to do anomaly
detection and removal in terms of data redundancy and error. In future the reliability
and scalability is investigated in terms of data size and data variations.
Anomaly Detection Via Eliminating Data Redundancy 30811
REFERENCES