A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
A R T I C L E I N F O A B S T R A C T
Keywords: Data mining is an emerging technology where researchers explore innovative ideas in different domains,
Data Mining particularly detecting anomalies. Instances in the dataset which considerably deviate from others by their
Entropy common patterns are known as anomalies. The state of being ambiguous and not affording certainty of data
Fuzzy Proximity
exists in this world of nature. Rough Set Theory is a proven methodology which deals with ambiguity and un
Mixed Data
Rough sets
certainty of data. Research works that have been done until this point were focused on numeric or categorical
Weighted Density type, which fails when the attributes are mixed type. By using fuzzy proximity and ordering relations, the nu
merical data has been converted to categorical data. This article presented an idea for detecting outliers in mixed
data where the weighted density values of attributes and objects are calculated. The proposed approach has been
compared with existing outlier detection methods by taking the hiring dataset as an example and benchmarked
with Harvard dataverse datasets to prove its efficiency and performance.
* Corresponding e-mail:
E-mail address: [email protected] (G. Mary A).
https://doi.org/10.1016/j.socl.2021.100027
Received 24 May 2021; Received in revised form 10 October 2021; Accepted 11 November 2021
Available online 18 November 2021
2666-2221/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
treated as normal objects, and the remaining objects that are not labeled strategies. In particular, while accessing labeling objects, it can be uti
are identified as outliers. Each pattern will be assigned an outlier score lized, or with the closer unlabelled objects that are close by preparing a
by fixing the threshold value to determine the degree of outliers [18]. layout for normal objects. The layout of the ordinary object at that point
The similitude of data cannot be correspondingly measured if there can be utilized to identify outliers - the items which do not fit into the
exists much noise. But the similarity measure and density measure do layout of normal objects are anomalies [4]. To enhance the nature of
not suit high dimensional data [8]. Nowadays, researchers are focussing exception location, one can get assistance from models of unsupervised
on detecting outliers in high-dimensional data. Because so many works strategies.
were carried out to detect outliers for qualitative and quantitative data
[23]. The proposed approach probably suits mixed data with a high level 3. Rough set theory and fuzzy approximation space
of significance. Different methodologies for outlier detection techniques
are shown in Fig. 1. During the 1980s, Zdzislaw Pawlak[27], a Polish mathematician,
developed a mathematical tool called rough sets with lower and upper
2. Outlier detection method approximation concepts, which have crisp sets. However, it does not
need any prior or extra information about concerned data. There exists a
2.1. Supervised method strict association between vague and uncertain data. The rough set
approach demonstrates a clear association between these two ideas.
This technique displays data uniformity and anomaly. The specialists Vagueness is associated with sets, while uncertainty is associated with
label similar objects and objects that do not coordinate the model of components of sets. The data analysis with rough sets uses decision ta
ordinary objects as exceptions or outliers [1]. The normal data objects bles with structured rows and columns[12]. The columns of a table are
appear much than the outlier objects. This method has two classes attributes classified into two groups: condition and decision attributes.
(normal and outliers) which are imbalanced. The small amount of Each row specifies an object which induces some decision or result. If
sample data taken for training will not suitably be considered for outlier some conditions are satisfied, then the decision rule is certain; other
distribution. But labeling the true object as an outlier should not be wise, it is uncertain.
allowed. It is more important than outlier detection. It also implicates the thought of similarity. Let us consider the in
formation table IT=(W, X, Y, Z) where W is the universe which should be
2.2. Unsupervised methods nonempty, X is the set of attributes, Y and Z are the conditional and
decisional attributes[13]. The components of W are objects, entities,
In a few applications, labeling objects as "usual" or "exception" are items, or investigations. Attributes are also represented as features, as
not made. Consequently, an unsupervised learning technique must be pects, or characteristics.
utilized. Clustering can be done between normal objects and outlier Assume S=(V, RT) then the subset Y⫅V and an equivalence relation
objects[9]. Objects which deviate from normal behavior form one RT∈IND(S). The subsets of X, such as lower and upper approximation,
cluster, and the remaining objects fall into a normal category. The issues are defined as follows:
in unsupervised strategies are sometimes data that does not belong to
RTY = ∪{X∈V/RT: X⫅Y}
any group might be considered noise but not an outlier[35]. Also, it is
RTY = ∪{X∈V /RT: X∩ =∅}
Y
2.3. Semi-supervised methods From this, Boundary(X) = RT X -RT X will be called the RT boundary
of X. The boundary sets are included in the upper approximation but not
It can be viewed as the utilization of semi-supervised learning in the lower approximation. Rough sets are defined through the lower
2
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
and upper approximation. Also, a boundary region is a devoid set (RT X to be calculated. If the value has a high deviation, then it is an outlier
= RT X).
∕ [34].
Multivariate Outlier Detection (MOD) is a traditional strategy for the
detection of outliers. It regularly demonstrates those perceptions that
3.1. Membership relation and approximation are found generally a long way from the focal point of the information
distributed. A few distance measures are executed for such detection
Membership relation is derived from approximation spaces. Both [19]. The Mahalanobis distance is an outstanding rule which relies upon
membership and set approximation are related to knowledge only[28]. evaluated parameters of the multivariate distribution.
The representation is shown below: The rough membership function also is used to detect outliers from
l∈T Lthenl ∈ TL the real-world dataset. One of the most popular distance-based ap
lT ∈Lthenl ∈ TL proaches is the Manhattan distance. When the threshold value increases,
this technique outdoes the performance of statistical approaches and
In which,∈Treads "l surely belongs to L for T" and ∈T, "l possibly distance-based methods. The efficiency can be improved by fixing the
belongs to L concerning T", is the lower and upper membership relation, proper threshold value. The clustering technique provides more accu
respectively. Fig. 2 depicts the set approximation. racy than the distance-based approach[37,38]. Small clusters can be
constructed by using Partitioning Around Medoids (PAM) to detect
3.2. Fuzzy approximation space with Rough Sets outliers from the dataset.
In neural networks, the data will be trained and tested. It is used to
In general, fuzzy sets are used to handle the issues in understand clear the ambiguity in patterns and is also an effective tool to retrieve
ability of the patterns, incomplete and noisy data, multimedia infor knowledge from large databases. The rough set method with the neural
mation, and intercommunication between persons resolves quickly network is defined well to handle data mining problems. A back
within the determined time [14]. The minimal and maximal approxi propagation algorithm has been employed using rough sets to avoid
mation for a fuzzy set B in Z as the fuzzy sets T ↓B , T ↑ B in Z as inconsistencies between data. The neural system learning model uses
backpropagation. Neurobiologists and therapists initially ignited this
(T↓B)(r) = infs∈S (R(s, r), A(s))
field to create and test neuron’s computational analog. The neural sys
(T↑B)(r) = sups∈S T(R(s, r), A(s))
tem is arranged so that input/yield units are associated with weights
T↓B and T↑B can also be determined as how much inclusion Tr in B and related to it [25].
overlap of Tr and B respectively [10], which is related to r ∈ T ↓ Aonly [r] Backpropagation learns by preparing an informational index of
tuples iteratively, which contrasts the system’s expectation of an indi
T⫅B and r ∈ T ↑ B only [r]T ∩ B ∕= 0.
vidual tuple with the known target. The objective target might be the
class name known for the preparation tuple (characterization issues) or
4. Related work
consistent instance (forecast). Each preparation tuple has weights
altered to limit the error of mean squared value between the system’s
Datasets make different clusters based on different labeling tech
expectation and the real target instance [20]
niques. A data item to be compared with these formed clusters that don’t
Rough Entropy has been used to measure the uncertainty of data.
belong to any cluster will be identified as an outlier[7]. For a single class
Each object and attribute is calculated with a weighted density value to
classification, a support vector data description (SVDD) method was
detect outliers. But clustering of data had not been done[21]. The
used. It determines a hypersphere that includes all normal data within
clustering approach can be improved by using RKM (rough K-means)
its space. The objects which out lies from the hypersphere are termed
with a preliminary centroid selection method [22]. Cluster validity
outliers. In k-means clustering, objects that are found to be similar under
index will be achieved by improved entropy-based rough K-means
a feature vector are formed into clusters, and any object that does not
(ERKM) method. In multi granulation rough sets, the decision was made
group under any cluster is outliers. In the local outlier factor method
by "OR" instead of "AND" logic. When the two attributes have
(LOF), the relative distance of an object with its neighborhood points is
3
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
Table 1
Study on different outlier detection methods
S. Outlier Detection Advantages Disadvantages
No Method
1 Support Vector Data It detects outliers well in smaller sample sizes and produces effective results for If the sample sizes become larger, outlier detection is
description (SVDD) more intricate and scanty datasets. difficult.
2 k means clustering Even if the dataset is huge,outlier detection is possible Generally, outliers are to be discarded, but in this method,
outliers form a separate group.
3 Local Outlier Factor(LOF) The point at the smallest distance is considered an outlier to the cluster, which is The threshold value will be fixed to detect outliers. The
at a denser level. But in general outlier detection approaches, the point at the fixation of the threshold value will be based on the problem
smallest distance will not be considered an outlier. and the user.
4 Multivariate Outlier It detects outliers (of n features)in n-dimensional space. Finding distributions of n-dimensional space is difficult, so
Detection (MOD) training of the dataset would be required.
5 Partitioning Around When compared to other available partitioning algorithms, outliers are less Choosing k medoids is random; it gives a different result for
Medoids (PAM) noticeable in the PAM method the same dataset.
6 Backpropagation Method A deeper understanding of the data is not required. Particularly sensitive to noisy data.
7 Rough k Means(RKM) The weighted density method uses the Gaussian function to detect outliers in a When separating objects which are overlapped between
vague dataset. clusters, the approach is susceptible.
8 Entropy Rough k Means Effectively outliers are removed, which results in the formation of quality Centroid selection is random based on the Rough k means
(ERKM) clusters. Method(RKM)
contradictions and inconsistencies, multi granulation with a rough set intuitionistic fuzzy numbers, sorting functions, and intimacy coefficients
framework has been used [40]. So that, it needs effective computation. [45].We create the outranked set for each alternative and present a
The traditional approach of outlier detection was the statistical hybrid information table that includes a Multi-Attribute Decision-Mak
method where it applies to single-dimensional datasets alone. The model ing matrix and a loss function table.Multi-attribute decision-making
suitably fits for perceptible real-world datasets where the categorical (MADM) is a crucial component of modern decision sciences[46]. It
data has been converted into numerical data for the processing of sta refers to a decision problem of selecting the best alternative or ranking
tistical methods [5]. So it increases the processing time for tangled alternatives based on numerous attributes.A three-way decision has
datasets. The simple outlier detection method with no prior information been included in a multi-scale decision information system, which offers
needed for the processing of data is the proximity-based technique. But, a novel approach to addressing multi-attribute decision-making con
the calculation of distances between all objects results in high expo cerns in a multi-scale decision information system[47]. In addition, a
nential growth. The number of objects n and its dimensionality m is review has been made for outlier detection using data mining methods.
directly proportional to its time complexity. So it will not be suitable for The pros and cons of different outlier detection methods are shown in
high dimensional data. Table 1.
The parametric method is suitable for larger datasets because it has a
built-in distribution model. If any model fits the prescribed dataset, then 5. Proposed model
the outcome will be accurate. The data model grows with paradigmatic
complexity, not with the size of data. The only condition is the pre Detecting outliers is a major data mining technique that has signif
defined model should be fit for the available dataset. The nonparametric icant consideration inside different research groups and application
methods need prior information to process. domains. Numerous methods have been created to identify outliers but
In some cases, the prior knowledge will not be available, or the only on numerical data. Those methods cannot be applied directly to
computation cost will be high[32]. Many datasets use not only a categorical data. So the fuzzy proximity relation is introduced to convert
determined data model but also follow a random distribution model. It numerical data to categorical[36]. Then the Density and uncertainty of
may be applicable for regression and principal component analysis every object and attribute are calculated. For a stable dataset, the fixa
methods. In the pre-processing stage, parameter settings are to be made, tion of the threshold value is high, and for the unstable dataset, the
and later they should be processed. lower threshold value is fixed. In this way, outliers are removed
An outer perception, or anomaly, seems to diverge extraordinarily incredibly to improve the execution of data mining algorithms. In Fig. 3,
from other individuals where it occurs. A perception (or a subset of at the pre-processing stage, the mixed data is converted to categorical by
perceptions) gives off an impression of conflicting with the rest of the using fuzzy proximity relation in post-processing. Finally, a rough set
data [24]. Exceptions are defined as the focuses lie outwards from the entropy-based weighted density outlier detection approach is applied to
cluster but at the same time are isolated from the noise[30]. Patterns determine outliers.
with the well-defined notion of normal behavior, which are not
confirmed, are outliers, and the regions of network structure differs from
5.1. Roughset entropy-based weighted density outlier detection algorithm
expected under the normal behavior [26].
Social network anomaly detection focuses on outlier detection
A dataset may include missing data and some negative and null
techniques developed in machine learning and statistical domains [31].
values, which are outliers. So the dataset is defined to be vague and
Intrusion detection with anomaly detection was proposed through sys
incomplete. To handle this scenario, a rough set with a weighted
tem calls[33]. First, evaluate decision-makers preferences for each
density-based outlier detection method is proposed. In the pre-
choice and introduce the concept of pre-decisions, resulting in an
processing stage, numerical data is converted to categorical data by
incomplete fuzzy decision system[43]. Then, using the defined similar
using fuzzy proximity relation, and then it is ordered. In the post-
ity relation, the weighted conditional probabilities are determined. The
processing stage, similar objects are identified concerning attributes
concept of relative utility functions is next introduced, followed by a
using indiscernibility relation, and complement entropy measure is used
method for determining relative utility function values. Then, in
to calculate uncertainty values; the weighted density values are calcu
incomplete fuzzy decision systems, we build a three-way decision model
lated by identifying indiscernible objects divided by the total number of
and apply it to the modeling of incomplete multi-attribute decision-
objects to each attribute. Finally, the user fixes the threshold value. If the
making issues[44].
calculated value is lesser than the threshold, then they are treated as
On IFVIS(intuitionistic fuzzy-valued information systems), three
outlier objects. The following definitions will be used to detect outliers
alternative sorting decision-making procedures include subtracting
when the table has been converted from mixed to categorical type,
4
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
5
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
Table 2
17 29
Hiring Dataset - Mixed Type Weight of Attribute(Degree) = ; Weight of Attribute(Experience) =
54 54
Objects Degree Experience French Reference
Let the almost indiscernibility be ω ≥ 90%, From Table 2, thus, the W(E10 ) = 0.88.
objects E1 , E2 , E5 are ω- identical. Similarly, E3 , E4 , E6 , E7 , E8 , E9 , E10 are
If θ < 0.7, then the objects E1 ,E2 and E5 are outliers. The normal and
ω – identical.
outlier objects are shown in Fig. 4.
/
U Rω1 = {{E1 , E2 , E5 }, {E3 , E4 , E6 , E7 , E8 , E9 , E10 }}
7. Experimental results
Based on the similarity value of ω, the attribute experience is ordered
into two groups. The numerical values of the attribute experience for
The working model of outlier detection algorithm in mixed datasets
objects {E1 , E2 , E5 } are having greater values. So it is classified as High will be understood by, conducted experiments on a hiring dataset that
and the remaining objects {E3 , E4 , E6 , E7 , E8 , E9 , E10 } are classified as
has 120 objects with four conditional attributes of numerical and cate
Low. Now the numeric type of experience attribute is converted to gorical values. It has been implemented with Processor-Intel Pen
categorical, which is shown in Table 4.
tium,1GigaByte RAM, and the Windows10 operating system. Existing
Obtain indiscernible relation for each attribute. Objects that possess
methods like Distance-based, Density-based, Local Outlier Factor and
indiscernible values for attributes are:
Class outlier factor were analyzed using Rapid Miner 7.0. The concept of
/ { }
U IND (Degree) = E1, E4 , E7 , E10 , {E2 , E3 , E8 }, {E5 , E6 , E9 } Rough sets was implemented using C. It is a flexible language that is used
to implement mathematical models. The proposed algorithm has been
U/IND(Experience) = {E1 , E2 , E5 }, {E3 , E4 , E6 , E7 , E8 , E9 , E10 } run on a hiring dataset that is of mixed type. The fuzzy proximity rela
tion method was used to convert numerical value to categorical value,
U/ IND(French) = {E1 , E2 , E5 , E6 , E7 , E10 }, {E3 , E4 , E8 , E9 } and then it was ordered.
A rough set entropy-based weighted density outlier detection
U/IND(Reference) = {E1 , E7 , E8 }, {E2 , E4 , E5 , E9 }, {E3 , E6 , E10 }
The complement entropy function is to be calculated for each attri Table 4
bute with the obtained indiscernible relation. Converted Table – Mixed to Categoric Type
( ) ( ) ( ) Objects Degree Experience French Reference
4 4 3 3 3 3 33
CE(Degree) = 1− + 1− + 1− = E1 MBA High Yes Excellent
10 10 10 10 10 10 50
E2 MSc High Yes Good
( ) ( )
3 3 7 7 21 E3 MSc Low No Neutral
CE(Experience) = 1− + 1− =
10 10 10 10 50 E4 MBA Low No Good
E5 MCA High Yes Good
24 33 E6 MCA Low Yes Neutral
CE(French) = ; CE(Reference) =
50 50 E7 MBA Low Yes Excellent
Calculate each attribute weight by adding the total number of at E8 MSc Low No Excellent
Table 3
Fuzzy Proximity Relation -Experience Attribute
R1 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
E1 1.0000 0.9053 0.7907 0.6494 0.9123 0.747 0.5946 0.7620 0.6836 0.6316
E2 0.9053 1.0000 0.8832 0.7353 0.8191 0.8379 0.677 0.8534 0.7715 0.7165
E3 0.7907 0.8832 1.0000 0.8475 0.7084 0.9539 0.7858 0.9697 0.8853 0.8276
E4 0.6494 0.7353 0.8475 1.0000 0.5748 0.8929 0.9362 0.8772 0.9616 0.9796
E5 0.9123 0.8191 0.7084 0.5748 1.0000 0.6667 0.5239 0.6809 0.6068 0.5582
E6 0.747 0.8379 0.9539 0.8929 0.6667 1.0000 0.8302 0.9842 0.9311 0.8728
E7 0.5946 0.677 0.7858 0.9362 0.5239 0.8302 1.0000 0.8149 0.898 0.9566
E8 0.762 0.8534 0.9697 0.8772 0.6809 0.9842 0.8149 1.0000 0.9153 08572
E9 0.6836 0.7715 0.8853 0.9616 0.6068 0.9311 0.8980 0.9153 1.0000 0.9412
E10 0.6316 0.7165 0.8276 0.9796 0.5582 0.8728 0.9566 0.8572 0.9412 1.0000
6
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
method has been applied for effective outlier detection. Fig. 5 shows that top-class outlier and N represents the number of nearest neighbors. This
the comparison chart for an existing and proposed method for outlier algorithm detects ten outlier objects. Further, our proposed method
detection. In the distance-based outlier detection method, each data rough set entropy-based weighted density outlier detection method
point has been ranked based on the distance to its k-th nearest neighbor detects outliers by computing the weighted density value of all objects
[39] so that the top n data points are declared as outliers. It detects ten and attributes. It detects 18 outlier objects[42]. Our proposed algo
outlier objects. In density-based outlier detection method DensityBased rithm’s performance and efficiency are high compared to existing
(p, P), an object that deviates at least P distance from the p, the pro methods because it calculates weighted density values for every object
portion of all data objects is considered outliers. This method does not and attribute so that a true object will never be detected as an outlier.
detect any outlier objects. In the local outlier factor method, each object The comparison chart showing various outlier detection methods is
should be calculated with a local outlier factor based upon the local shown in Fig. 5.
density measure. Then it is compared with their l nearest neighbors [41]. Also, benchmark datasets such as the annthyroid dataset, breast
The objects which are having lower density values when compared cancer dataset, and letter dataset have been taken from Harvard data
with their neighbors are termed to be outliers. It detects seven outlier verse to show the proposed algorithm efficiency, which has been
objects. In the class outlier factor method, each data point in the sample compared with other existing outlier detection methodologies such as
will be ranked based on ClassOutlierFactor=(S, N) where S represents local outlier factor (LOF), feature-based(FB), isolation forest(IF), K-
7
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
Fig. 6. Comparison of Proposed and Existing Outlier Detection Methods with Benchmark Datasets.
nearest neighbor (KNN), average KNN and histogram-based outlier score. The accuracy of a classifier is calculated as the total number of
score (HBOS). The local outlier factor determines the Density of an ob objects which are correctly classified to the total number of objects
ject with the distance of its neighbors. Feature bagging selects features of available. The formula to calculate accuracy is as follows:
the subsamples randomly and finally combines the values of all base
TP + TN
detectors, using the local outlier factor. Isolation forest observes data by Accuracy =
TP + FN + TN + FP
constructing a tree[15]. The isolated value score is determined as out
liers that are well suitable for high dimensional data. By constructing where TP is True Positive, FP is False Positive, TN is True Negative, and
histograms, outliers are detected in the histogram-based outlier score FN is False Negative. Thus, sensitivity or recall measures the true posi
approach. It is an unsupervised learning method that generates scores by tive values proportion, which is correctly identified, whereas specificity
considering the independent features.KNN identifies the nearest measures the true negative values proportion, which is correctly
neighbor to an object. Based on the distance, it calculates scores, and detected. The values are obtained from the formulas shown below:
outliers are identified. In the Average KNN method, super samples are
constructed for individual classes. The test data is given as an input, and TN
Specificity =
Average KNN searches samples available in super samples or closer. TN + FP
Others are identified as outliers. The comparison chart of the proposed TP
method with existing outlier detection algorithms for the benchmark Sensitivity =
TP + FN
datasets is shown in Fig. 6.
Other approaches like fuzzy bipolar soft set and Pythagorean fuzzy TP
Recall =
bipolar soft set are compared with the proposed method to prove its TP + FN
efficiency. The fuzzy-based bipolar soft set is used to analyze the patients
Precision or positive predictive value is the one that measures rele
with the help of membership degrees and decide whether the patient is
vant objects from the retrieved objects. The formula to calculate preci
hypomania, depression, or bipolar. Mostly it is used in decision-making
sion is as follows:
systems.On the other hand, the pythagorean fuzzy bipolar soft setis
mostly used in group decision-making situations. Personalization of the TP
Precision =
findings acquired is avoided because a common idea is derived from the TP + FP
opinions of all doctors. But the proposed method identifies indiscernible F1 score measure provides the balance between precision and recalls
values, computesEntropy, and then calculates each object’s weighted when the distribution of classes is not even. It becomes worse when its
density value and attribute to detect outliers. value is 0 and best when it is 1. The formula to calculate the F1 score is as
follows:
7.1. Measures for performance evaluation
2 ∗ Precision ∗ Recall
F1 Score =
The performance evaluation of benchmark datasets is measured by Precision + Recall
calculating their accuracy, specificity, sensitivity, precision, and F1
8
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
Table 5 The WDOD technique appears to be particularly suitable for big data
Performance Evaluation - Annthyroid Dataset sets with high dimensionality and data sets with a high number of
Sl.No Measures LOF REBWDOD outliers, based on the results of these experiments. The WDOD algo
rithm’s growing rate of execution time is substantially slower than the
1 Accuracy 98.16% 99.57%
2 Specificity 1.0 1.0 local outlier factor algorithm. As a result, when the data size is huge,
3 Sensitivity 0.9813 0.9955 attributes are more the suggested WDOD algorithm can ensure efficient
4 Precision 1.0 1.0 execution in detecting outliers which are shown in Fig. 7, Fig. 8, and
5 F1 Score 0.9906 0.9978 Fig. 9.
8. Conclusion
Table 6
Performance Evaluation - Breast Cancer Dataset
In this paper, outlier detection for a mixed dataset has been pro
Sl.No Measures LOF REBWDOD posed. In the pre-processing stage, fuzzy proximity relations with order
1 Accuracy 99.18% 99.46% information rules convert numeric to categorical attributes. The rough
2 Specificity 1.0 1.0 set-based Entropy weighted density outlier detection method has been
3 Sensitivity 0.99 0.99
carried out to detect outliers in the post-processing stage. Research
4 Precision 1.0 1.0
5 F1 Score 0.9958 0.9972 works carried out so far detect outliers only for numeric or categorical
data, where mixed data was not considered. The proposed model detects
outliers in the hiring dataset, which has mixed data, by calculating their
Table 7 weighted density value so that the normal object will not be detected as
Performance Evaluation - Letter Dataset an outlier. However, the proposed algorithm is benchmarked with
Sl.No Measures LOF REBWDOD Harvard dataverse datasets such as the annthyroid dataset,breast cancer
dataset, and letter dataset compared with the existing local outlier factor
1 Accuracy 97.56% 98.69%
2 Specificity 1.0 1.0 outlier method to prove its efficiency and performance level. As the
3 Sensitivity 0.97 0.98 number of increasing objects and attributes, the proposed method en
4 Precision 1.0 1.0 sures efficient execution in detecting outliers.Future work will be
5 F1 Score 0.9872 0.9930 focused on detecting outliers where input is dynamic and in multi
granulation sets. The proposed work has some limitations, such that the
The performance evaluation of benchmark datasets such as annthy fixation of threshold value sometimes results in regular objects become
roid, breast cancer, and letter dataset are shown in Table 5, Table 6, and outlier and outliers become regular objects.This research did not receive
Table 7. any specific grant from funding agencies in the public, commercial, or
not-for-profit sectors.
7.2. Analysis of efficacy
9. Acknowledgements
The following three sorts of tests have been conducted to see how
each algorithm’s performance changes as factors change, such as the size Funding
of the data set, the dimensionality of the data set, and the number of
outliers[48].In comparison to the local outlier factor method, the WDOD No funding received for this research.
approach takes less time in terms of data size, data dimensionality, and
mark the number of outliers.
9
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
The process of writing and the content of the article does not give The corresponding author claims the major contribution of the paper
grounds for raising the issue of a conflict of interest. including formulation, analysis and editing. The Second author provides
guidance to verify the analysis result and manuscript editing.
Not applicable This article is a completely original work of its authors; it has not
been published before and will not be sent to other publications until the
journal’s editorial board decides not to accept it for publication
Code availability (Definations 1 & 6) (Algorithm 1)
Not applicable
10
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
Algorithm 1 [16] F.E. Grubbs, Procedures for detecting outlying observations in samples,
The algorithm for the proposed model has been shown below: Technometrics 11 (1) (1969) 19–21. https://www.tandfonline.com/doi/abs/10.1
080/00401706.1969.10490657.
Input: Dataset DS(W,α, β) and θ be threshold values. [17] D. Hawkins, Identification of outliers, Monographs on Applied Probability and
Output: Set S holds outlier objects. Statistics (1980).
[18] V. Hodge, J.A. Austin, Survey of outlier detection methodologies, Artificial
Step 1:Start
Intelligence Review 22 (2) (2004) 85–126. https://link.springer.com/article/10.
Step 2: Input the dataset of mixed type.
1023/B:AIRE.0000045502.10941.a9.
Step 3: Use fuzzy proximity relation and ordering to convert numeric into
[19] J. Han, M. Kamber & J. Pei, Data Mining concepts and techniques, 2012.
categorical data. [20] E.M. Knorr, R.T. Ng, A unified approach for mining outliers, in: Proc. Conf. of the
Step 4: Let S = ∅ Centre for Advanced Studies on Collaborative Research (CASCON), Toronto,
Step 5: For each attribute βi ∈ β Canada, 1997. https://www.aaai.org/Papers/KDD/1997/KDD97-044.pdf.
Step 6: Calculate the indiscernibility function U/IND(αi ) according to definition2; [21] E.M. Knorr, N.G. RT, Algorithms for mining distance-based outliers in large
Step 7: Calculate the complement entropy according to definition3; datasets, in: Proc. Int. Conf. on Very Large Data Bases (VLDB), New York, NY,
Step 8:For each attribute βi ∈ β, compute weighted Density according to definition4; 1998. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.8026&rep
Step 9:For each object αi ∈ W, calculate the weighted Density according to =rep1&type=pdf.
definition5; [22] E.M. Knorr, R.T. Ng, Finding intensional knowledge of distance-based outliers, in:
Step 10: If (Weighted Density(αi) <θ) Proc. Int. Conf. onVeryLargeDataBases(VLDB)Edinburgh Scotland on Very Large
Step 11: S = S ∪ {αi}. Data Bases (VLDB), Edinburgh, Scotland, 1999. http://citeseerx.ist.psu.edu/vie
Step 12: Return S. wdoc/download?doi=10.1.1.45.9005&rep=rep1&type=pdf.
[23] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J.A. Srivastava, Comparative study of
Step 13: Stop.
anomaly detection schemes in network intrusion detection, in: Proceedings of
theThirdInternational Conference on Data Mining. SIAM, 2021. https://epubs.sia
m.org/doi/abs/10.1137/1.9781611972733.3.
Declaration of interests [24] A. McCallum, K. Nigam, L.H. Ungar, Efficient clustering of high-dimensional data
sets with application to reference matching, in: Proc. ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (SIGKDD), Boston, MA, 2000. https://dl.
The authors declare that they have no known competing financial acm.org/doi/abs/10.1145/347090.347123.
interests or personal relationships that could have appeared to influence [25] M. Markou, S. Singh, Novelty detection: A Review - Part1:Statistical Approaches,
the work reported in this paper. Signal Processing 83 (12) (2003) 2481–2497. https://www.sciencedirect.com/sci
ence/article/abs/pii/S0165168403002020.
[26] M. Markou, S. Singh, Novelty detection: A Review-Part 2: Neural Network based
References approaches, Signal Processing 83 (12) (2003) 2499–2521. https://www.sciencedir
ect.com/science/article/abs/pii/S0165168403002032.
[1] H.P. Achtert, L. Kriegel, E Reichert, R. Schubert, Zimek Wojdanowski, A Visual [27] Z. Pawlak, Rough Sets, J. Computer and Information Sciences 11 (1982) 341–356.
Evaluation of Outlier Detection Models, in: Proc. International Conference on https://link.springer.com/article/10.1007/BF01001956.
Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, 2010. htt [28] M.I. Petrovskiy, Outlier detection algorithms in data mining systems, Programming
ps://link.springer.com/chapter/10.1007/978-3-642-12098-5_34. and Computer Software 29 (4) (1988) 228–237. https://link.springer.com/ar
[2] C.C. Aggarwal, P.S. Yu, Outlier detection for high dimensional data, in: Proc.ACM- ticle/10.1023/A:1024974810270.
SIGMOD, Int. Conf. Management of Data (SIGMOD’01), 2001, pp. 37–46. Santa [29] F. Preparata & M. Shamos, Computational Geometry: an Introduction. Springer
Barbara,CA, https://dl.acm.org/doi/abs/10.1145/375663.375668. Verlag.
[3] A. Arning, R. Agrawal, &P. Raghavan, A linear method for deviation detection in [30] R. Banshal, N. Gaur, S.N. Singh, Outlier detection-Applications and Techniques in
large databases, in: Proc. Int. Conf. on Knowledge Discovery and Data Mining Data Mining, IEEE Conference 44 (12) (2016) 2862–2870. https://ieeexplore.ieee.
(KDD), Portland, 1996. OR, https://www.aaai.org/Papers/KDD/1996/KDD9 org/abstract/document/7508146.
6-027.pdf. [31] R. Kannan, H. Woo, C.C. Aggarwal, Outlier Detection for text data-An extended
[4] P. Ashok, &G.M.K. Nawaz, Outlier Detection Method on UCI Repository Dataset by version 23 (2000) 61–69. https://arxiv.org/abs/1701.01325.
Entropy-Based Rough K-means, J. Defence Science 11 (2016) 113–121. [32] R. Kaur, S. Singh, A review of social network-centric anomaly detection techniques
[5] V. Barnett, &T. Lewis, Outliers in statistical data, John Wiley and sons, 1994. 23 (1) (2016) 61–69. https://www.inderscienceonline.com/doi/abs/10.1504
[6] S.D. Bay, &M. Schwabacher, Mining distance-based outliers in near-linear time /IJCNDS.2016.080582.
with randomization and a simple pruning rule, in: Proc. Int. Conf. on Knowledge [33] P. J. Rousseeuw, & A.M. Leroy Robust regression and outlier detection. John Wiley
Discovery and Data Mining (KDD), Washington, DC., 2003. https://dl.acm.org/d & Sons, Inc., New York, NY, USA.
oi/abs/10.1145/956750.956758. [34] P.I.J. Ruts, Rousseeuw, Computing depth contours of bivariate point clouds,
[7] R.J. Beckman, R.D. Cook, Outliers Technometrics 25 (2) (1983) 119–149, https:// Computational Statistics and Data Analysis 23 (2021) 153–168. https://www.sci
doi.org/10.1080/00401706.1983.10487840. encedirect.com/science/article/abs/pii/S0167947396000278.
[8] M.M. Breunig, H.P. Kriegel, J. Sander, Identifying density-based local outliers, in: [35] D. Synder, MS thesis, Department of Computer Science, Florida State University,
Proc Acm Sigmod Conference, 2021, pp. 93–104. https://dl.acm.org/doi/abs/10 2001.
.1145/342009.335388. [36] S. Mitra, Sankar, P. Mitra, Data Mining in soft computing framework: A Survey, J.
[9] V. Chandola, A. Banerjee, V. Kumar, Anomaly Detection A Survey, ACM IEEE transactions on neural networks, 13 (2002) 132. https://ieeexplore.ieee.or
ComputingSurveys 41 (1) (2011) 58–66, 2011, https://dl.acm.org/doi/abs/10.11 g/abstract/document/977258.
45/1541880.1541882. [37] E. Savage, G. Becker, R. Zamudio, A survey of link mining and anomaly detection
[10] A.G. Christy, M. Gandhi, S.V. Subramaniyan, Cluster based outlier detection for 34 (3) (2015) 645–654.
cluster data 5 (5) (2012) 363–387. [38] J. Tang, Z. Chen, A.W. Fu, D.W. Cheung, Capabilities of outlier detection schemes
[11] D. Dasgupta, F.A. Nino, Comparison of negative and positive selection algorithms in large datasets, framework, and methodologies, Knowledge and Information
in novel pattern detection, in: Proceedings of the IEEE International Conferenceon Systems 11 (1) (2006) 45–84. https://link.springer.com/article/10.1007/s10
Systems, Man, and Cybernetics. Nashville, TN 1, 2000, pp. 125–130. https://ieeex 115-005-0233-6.
plore.ieee.org/abstract/document/884976. [39] V. Kumar, S Kumar, A.K. Singh, Outlier detection – A clustering based approach,
[12] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering IJISME I (7) (2021) 383–387. ISSN:2319-6386.
clusters in large spatial databases with noise, in: Proc. Int. Conf. on Knowledge [40] X. Zhao, J. Liang, F. Cao, A simple and effective outlier detection algorithm for
Discovery and Data Mining (KDD), Portland, OR, 1996. https://www.aaai.org/Pa categorical data, J.Machine Learning and Cybernetics 5 (2014) 469–477. https://l
pers/KDD/1996/KDD96-037.pdf?source=post_page. ink.springer.com/article/10.1007%2Fs13042-013-0202-4.
[13] F. Jiang, Y. Sui, C. Cao, Outlier Detection Based on RoughMembership Function, [41] Z.A. Bakar, R. Mohamed, A. Ahmad, M.M. Deris, A comparative study for outlier
Rough Sets and Current Trends in Computing 4259 (2006) 388–397. https://link. detection techniques in data mining 10 (4) (2006) 371–395. https://ieeexplore.
springer.com/chapter/10.1007/11908029_41. ieee.org/abstract/document/4017846.
[14] S. Forrest, C. Warrender, B. Pearlmutter, Detecting intrusions using system calls: [42] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method
Alternate data models, in: Proceedings of the IEEE Symposium on Security and for very large databases, in: Proc. ACM SIGMOD Int. Conf. on Management of Data
Privacy, IEEE Computer Society, Washington, DC, USA, 1999, pp. 133–145. htt (SIGMOD), Montreal, Canada, 2021. https://dl.acm.org/doi/abs/10.1145/2359
ps://ieeexplore.ieee.org/abstract/document/766910. 68.233324.
[15] A. Ghoting, S. Parthasarathy, M. Otey, Fast mining of distance-based outliers in [43] J. Zhan, J. Ye, W. Ding, P. Liu, A novel three-way decision model based on utility
high dimensional spaces, in: Proc SIAM Int Conf on Data Mining(SDM) dimensional theory in incomplete fuzzy decision systems, IEEE Transactions on Fuzzy Systems
spaces, Bethesda, ML, 2006. https://link.springer.com/article/10.1007/s10618-00 (2021).
8-0093-2.
11
T. Sangeetha and G. Mary A Soft Computing Letters 3 (2021) 100027
[44] K. Zhang, J. Zhan, W.Z. Wu, On multi-criteria decision-making method based on a [47] W. Wang, J. Zhan, C. Zhang, Three -way decisions based multi-attribute decision
fuzzy rough set model with fuzzy α-neighborhoods, IEEE Transactions on Fuzzy making with probabilistic dominance relations, Information Sciences 559 (2021)
Systems (2020). 75–96.
[45] J. Zhan, H. Jiang, Y. Yao, Three-way multi-attribute decision-making based on [48] J. Deng, J. Zhan, W.Z. Wu, A three-way decision methodology to multi-attribute
outranking relations, IEEE Transactions on Fuzzy Systems (2020). decision-making in multi-scale decision information systems, Information Sciences
[46] J. Ye, J. Zhan, W. Ding, H. Fujita, A novel fuzzy rough set model with fuzzy 568 (2021) 175–198.
neighborhood operators, Information Sciences 544 (2021) 266–297.
12