A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
Pages: 430-443
ABSTRACT:
In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered.
The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources
availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e.
examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have
rigorously studied several techniques to alleviate the problem of class imbalance, including resampling
algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work
dubbed as Cluster Disjunct Minority Oversampling Technique (CDMOTE) and Nave Bayes for Cluster
Disjunct (NBCD) for learning from skewed training data. These algorithms provide a simpler and faster
alternative by using cluster disjunct concept. We conduct experiments using fifteen UCI data sets from various
application domains using four algorithms for comparison on six evaluation metrics. The empirical study
suggests that CDMOTE and NBCD have been believed to be effective in addressing the class imbalance
problem.
Keywords Classification, class imbalance, cluster disjunct, CDMOTE, NBCD.
1. INTRODUCTION
A dataset is class imbalanced if the classification
categories are not approximately equally
represented. The level of imbalance (ratio of size of
the majority class to minority class) can be as huge
as 1:99 [1]. It is noteworthy that class imbalance is
emerging as an important issue in designing
classifiers [2], [3], [4]. Furthermore, the class with
the lowest number of instances is usually the class
of interest from the point of view of the learning
task [5]. This problem is of great interest because it
turns up in many real-world classification
problems, such as remote-sensing [6], pollution
detection [7], risk management [8], fraud detection
[9], and especially medical diagnosis [10][13].
There exist techniques to develop better performing
classifiers with imbalanced datasets, which are
generally called Class Imbalance Learning (CIL)
methods. These methods can be broadly divided
into two categories, namely, external methods and
internal methods. External methods involve
preprocessing of training datasets in order to make
them balanced, while internal methods deal with
modifications of the learning algorithms in order to
reduce their sensitiveness to class imbalance [14].
The main advantage of external methods as
previously pointed out, is that they are independent
of the underlying classifier.
Whenever a class in a classification task is under
represented (i.e., has a lower prior probability)
430
COST-SENSITIVE LEARNING
Direct cost-sensitive learning methods
Methods for cost-sensitive meta-learning
Cost-sensitive meta-learning
Thresholding methods
MetCost
Cost-sensitive meta-learning sampling
methods
431
ALGORITHMS MODIFICATION
Proposal for new splitting criteria DKM
Adjusting the distribution reference in the
tree
Offset Entropy
In this section, we first review the major research
about clustering in class imbalance learning and
explain why we choose undersampling as our
technique in this paper.
Siti Khadijah Mohamad et al. [18] have conducted
a review to look into how the data mining was
tackled by previous scholars and the latest trends
on data mining in educational research. Hongzhou
Sha et al. [19] have proposed a method named
EPLogCleaner that can filter out plenty of
irrelevant items based on the common prefix of
their URLs.
M.S.B. PhridviRaj et al. [20] have proposed an
algorithm for finding frequent patterns from data
streams by performs only one time scan of the
database initially and uses the information to find
frequent patterns using frequent pattern generation
tree. Chumphol Bunkhumpornpat et al. [21] have
a new over-sampling technique called DBSMOTE
is proposed. DBSMOTE technique relies on a
density-based notion of clusters and is designed to
oversample an arbitrarily shaped cluster discovered
by DBSCAN. DBSMOTE generates synthetic
instances along a shortest path from each positive
instance to a pseudo centroid of a minority-class
cluster. Matas Di Martino et al. [22] have
presented a new classifier developed specially for
imbalanced problems, where maximum F-measure
instead of maximum accuracy guide the classifier
design.
V. Garcia et al. [23] have investigated the influence
of both the imbalance ratio and the classifier on the
performance of several resampling strategies to
deal with imbalanced data sets. The study focuses
on evaluating how learning is affected when
different resampling algorithms transform the
originally imbalanced data into artificially balanced
class distributions. Table 2 presents recent
algorithmic advances in class imbalance learning
available in the literature. Obviously, there are
many other algorithms which are not included in
this table. A profound comparison of the above
algorithms and many others can be gathered from
the references list.
Table 2
Recent advances in Class Imbalance Learning
_______________________________________________________________
ALGORITHM
DESCRIPTION
REFERENECE
_________________________________________
DCEID
432
(a)
(b)
433
(c)
(d)
After Applying CDMOTE: (a). Checking Status
(b). Duration (c). Credit History (d). Housing
The j value will change from one dataset to other,
and depending upon the unique properties of the
dataset the value of k can be equal to one also i.e
no cluster disjunct attributes can be identified after
applying visualization technique on the dataset.
(d)
Before (a). Checking Status (b). Duration (c).
Credit History (d). Housing
The algorithm 1: CDMOTE and NBCD can be
explained as follows, the inputs to the algorithm are
majority subclass p and minority class n with
the number of features j. The output of the
algorithm will be the average measures such as
AUC, Precision, F-measure, TP rate and TN rate
produced by the CDMOTE and NBCD methods.
The algorithm begins with initialization of k=1 and
j=1, where j is the number of cluster disjuncts
identified by applying visualization technique on
the subset n and k is the variable used for looping
of j cluster disjuncts.
(a)
2: k 1,j1.
3: Apply Visualization Technique on subset N,
4: Identify cluster disjunct Cj from N, j= number
of cluster disjunct identified in visualization
5: repeat
(b)
6: k=k+1
7: Identify and remove the borderline and outlier
instances for the cluster disjunct Cj.
8: Until k = j
Phase II: Over sampling Phase
9: Apply Oversampling on Cj cluster disjunct
(c)
from N,
434
10: repeat
11: k=k+1
12: Generate Cj s synthetic positive
13: Until k = j
5 Evaluation Metrics
AUC
1 TPRATE FPRATE
2
________
(1)
TP
TP FP
________ (2)
TruePositi veRate
TP
TP FN
________ (4)
TN
TN FP
________ (5)
6 Experimental framework
In this study CDMOTE and NBCD are applied to
fifteen binary data sets from the UCI repository
[34] with different imbalance ratio (IR). Table 3
summarizes the data selected in this study and
shows, for each data set, the number of examples
(#Ex.), number of attributes (#Atts.), class name of
each class (minority and majority) and IR.
435
Class (_,+) IR
_________________________________________
1.Breast 268 9 (recurrence; no-recurrence) 2.37
2. Breast_w 699 9 (benign; malignant)
1.90
3.Colic
(yes; no)
1.71
21 (good; bad)
2.33
368
4.Credit-g
22
1000
303
14 (<50,>50_1)
7.Heart-h
294 14
(<50,>50_1)
1.19
1.77
8.Heart-stat
1.25
9.Hepatitis
3.85
10.Ionosphere 351 34
(b;g)
1.79
11. Kr-vs-kp
1.09
12. Labor
56
1.85
16
(bad ; good )
1.08
15. Sonar
1.15
_________________________________________
________
7 Results
For all experiments, we use existing prototypes
present in Weka [42]. We compare the following
domain adaptation methods:
We compared proposed methods CDMOTE and
NBCD with the SVM, C4.5 [44], FT, and SMOTE
[41] state-of -the-art learning algorithms. In all the
experiments we estimate AUC, Precision, Fmeasure, TP rate and TN rate using 10-fold crossvalidation. We experimented with 15 standard
datasets for UCI repository; these datasets are
standard benchmarks used in the context of highdimensional imbalance learning. Experiments on
these datasets have 2 goals. First, we study the
class imbalance properties of the datasets using
proposed CDMOTE and NBCD learning
algorithms. Second, we compare the classification
performance of our proposed CDMOTE and
NBCD algorithms with the traditional and class
imbalance learning methods based on all datasets.
Following, we analyze the performance of the
method considering the entire original algorithms,
without pre-processing, data sets for SVM, C4.5
and FT. we also analyze a pre-processing method
SMOTE for performance evaluation of CDMOTE
and NBCD. The complete table of results for all the
algorithms used in this study is shown in Table 4 to
9, where the reader can observe the full test results,
of performance of each approach with their
associated standard deviation. We must emphasize
the good results achieved by CDMOTE and
NBCD, as it obtains the highest value among all
algorithms.
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
67.217.28
74.286.05
Breast_w
96.752.00
95.012.73
Colic
79.786.57
85.165.91
79.11 6.51
Credit-g
68.914.46
71.253.17
71.883.68
Diabetes
76.554.67
74.495.27
Heart-c
81.027.25
Heart-h
81.816.20
Heart-stat
82.076.88
69.837.77
70.235.91
96.162.06
96.581.79
97.9711.503
87.924.70
90.6414.640
76.503.38
75.063.89
76.8444.494
70.62 4.67
76.084.04
81.754.08
79.3334.137
76.946.59
76.066.84
82.994.98
80.576.55
83.0526.371
80.227.95
78.337.54
85.655.46
83.565.81
85.1785.143
83.895.05
80.317.75
81.8727.342
78.157.42
68.587.52
95.452.52
76.158.46
88.534.10
73.3566.603
436
81.908.38
79.229.57
Ionosphere 90.264.97
89.744.38
87.105.12
90.284.73
94.643.74
94.4113.590
Kv-rs-kp
99.020.54
99.440.37
90.611.65
99.660.27
99.450.42
98.1031.636
Labor
92.4011.07
78.6016.58
84.3016.24
80.2711.94
88.3311.09
95.9057.259
Mushroom 100.00.00
100.00.00
100.00.000
100.00.00
100.00.00
100.00.00
99.070.50
98.3790.691
86.238.31
86.178.187
Sick
99.260.04
Sonar
75.469.92
81.408.55
98.720.55
96.100.92
73.619.34
86.178.45
78.359.09
83.599.65
97.610.68
82.427.25
89.5298.001
__________________________________________________________________________________________
Table 4 Summary of tenfold cross validation performance for AUC on all the datasets
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
Breast_w
Colic
0.5860.102
0.9770.017
0.8020.073
0.6060.087
0.6040.082
0.9570.034
0.8430.070
0.7170.084
0.9490.030
0.7050.082
0.9670.025
0.9730.018
0.7990.074
0.9910.009
0.7770.072
0.9080.040
0.9000.042
0.9580.029
Credit-g
0.6500.075
0.6470.062
0.6550.044
0.7780.041
0.7880.041
0.8470.043
Diabetes
0.7930.072
0.7510.070
0.6680.051
0.7910.041
0.8360.046
0.8490.040
Heart-c
0.8430.084
0.7690.082
0.7570.069
0.8300.077
0.8220.077
0.9130.052
Heart-h
0.8520.078
0.7750.089
0.7630.082
0.9040.054
0.8690.065
0.9230.043
Heart-stat
0.8640.075
0.7860.094
0.7600.085
0.8320.062
0.8220.076
0.8700.068
Hepatitis
0.7570.195
0.6680.184
0.6780.139
0.7980.112
0.8480.136
0.9520.056
Ionosphere 0.9000.060
0.8910.060
0.9490.041
0.9610.032
Kr-vs-kp
0.9960.005
0.9980.003
0.9060.017
0.9990.001
0.9980.002
0.9950.004
Labor
0.9710.075
0.7260.224
0.8440.162
0.8330.127
0.8700.126
0.9950.024
1.0000.00
1.0000.00
1.0000.00
1.0000.00
0.7950.053
0.9620.025
0.9920.012
0.9790.019
0.8140.090
0.8540.086
0.9240.063
Mushroom 1.0000.00
1.0000.00
Sick
0.9900.014
0.9520.040
Sonar
0.7710.103
0.7530.113
0.8310.067
0.9040.053
0.8590.086
__________________________________________________________________________________________
1(a)
100
1(c)
80
60
Sonar
SVM
C4.5
FT
SMOTE
CDMOTE
1(b)
Diabetes
SVM
C4.5
FT
SMOTE
Sick
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
1(d)
100
SVM
100
90
C4.5
95
80
FT
90
Ionosphere
Fig. 1(a) (d) Test results on AUC on C4.5, CART, FT, REP, SMOTE and CDMOTE for Sonar,
Ionosphere, Diabetes and Sick Datasets.
437
Table 5 Summary of tenfold cross validation performance for Precision on all the datasets
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
0.7450.051
0.7530.042
0.7620.051
0.7100.075
0.7130.059
0.7700.062
Breast_w
0.9880.019
0.9650.026
0.9640.026
0.9740.025
0.9860.021
0.9960.011
Colic
0.8450.060
0.8510.055
0.8390.062
0.8530.057
0.8640.059
0.9250.058
Credit-g
0.7760.033
0.7670.025
0.7910.027
0.7680.034
0.7990.044
0.8050.052
Diabetes
0.7930.037
0.7970.045
0.7640.036
0.7810.064
0.8620.050
0.8260.054
Heart-c
0.8250.080
0.7830.076
0.7760.068
0.7790.082
0.8080.087
0.8310.084
Heart-h
0.8490.058
0.8240.071
0.8300.063
0.8780.076
0.8940.072
0.8960.070
Heart-stat
0.8330.078
0.7990.051
0.7960.085
0.7910.081
0.8210.094
0.8280.084
Hepatitis
0.6040.271
0.5100.371
0.5460.333
0.7090.165
0.7390.200
0.7910.151
Ionosphere 0.9060.080
0.8950.084
0.9380.073
0.9340.049
0.9450.047
0.9440.051
Kr-vs-kp
0.9910.008
0.9940.006
0.9050.021
0.9960.005
0.9940.005
0.9780.023
Labor
0.9150.197
0.6960.359
0.8020.250
0.8710.151
0.9210.148
0.9380.122
Mushroom 1.0000.000
1.0000.000
1.0000.000
1.0000.000
1.0000.000
1.0000.000
Sick
0.9970.003
0.9920.005
0.9750.007
0.9830.007
0.9960.004
0.9900.005
Sonar
0.7640.119
0.7280.121
0.8830.100
0.8630.068
0.8510.090
0.8580.092
__________________________________________________________________________________________
Table 6 Summary of tenfold cross validation performance for F-measure on all the datasets
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
0.7810.059
0.8380.040
0.7760.057
0.7300.076
0.7750.049
0.7820.056
Breast_w
0.9650.019
0.9620.021
0.9750.016
0.9600.022
0.9670.018
0.9800.015
Colic
0.8330.055
0.8880.044
0.8380.054
0.8800.042
0.8870.043
0.9080.045
Credit-g
0.8020.027
0.8050.022
0.7790.034
0.7870.034
Diabetes
0.7780.037
0.8060.044
0.8270.038
0.7410.046
Heart-c
0.7820.064
0.7920.059
0.8270.069
0.7720.070
0.8000.069
0.8270.065
Heart-h
0.8300.063
0.8510.061
0.8590.052
0.8410.061
0.8290.066
0.8500.054
Heart-stat
0.7810.083
0.8060.069
0.8410.061
0.7910.072
Hepatitis
0.4690.265
0.4090.272
0.5570.207
Ionosphere 0.7870.098
0.8500.066
Kv-rs-kp
0.9110.016
0.9950.004
Labor
0.7940.211
0.6360.312
Mushroom 1.0000.000
1.0000.000
0.7630.039
0.8080.047
0.7840.041
0.7860.044
0.8020.076
0.8190.077
0.6770.138
0.6930.192
0.8300.129
0.8550.079
0.9050.048
0.9440.039
0.9420.037
0.9910.005
0.9950.004
0.9940.004
0.9810.016
0.7930.132
0.8420.157
0.9540.082
1.0000.000
1.0000.000
1.0000.000
0.9950.003
0.9910.004
0.8790.195
1.0000.000
Sick
0.9790.005
0.9930.003
0.9960.003 0.9870.004
Sonar
0.8440.099
0.7160.105
0.7530.102
0.8610.061
0.8670.082
0.8660.080
__________________________________________________________________________________________
438
Table 7 Summary of tenfold cross validation performance for TP Rate (Recall) (Sensitivity) on all the
datasets
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
0.8060.091
0.9470.060
0.8150.095
0.7630.117
0.8610.101
.8000.085
Breast_w
0.9670.025
0.9590.033
0.9620.029
0.9470.035
0.9500.033
0.9650.026
Colic
0.8320.075
0.9310.053
0.8350.077
0.9130.058
0.9150.058
0.8960.063
Credit-g
0.8150.041
0.8470.036
0.7830.052
0.8100.058
0.7330.057
0.7670.051
Diabetes
0.7950.054
0.8210.073
0.8680.065
0.7120.089
0.7630.070
0.7530.061
Heart-c
0.7950.095
0.8080.085
0.8370.100
0.7770.110
0.8020.102
0.8310.092
Heart-h
0.8350.093
0.8850.081
0.8760.089
0.8150.084
0.7830.107
0.8160.088
Heart-stat
0.7750.113
0.8240.104
0.8570.090
0.8030.110
0.7940.102
0.8170.102
Hepatitis
0.4480.273
0.3740.256
0.5730.248
0.6810.188
0.7000.247
0.8920.149
Ionosphere 0.6890.131
0.8210.107
0.8200.114
0.8810.071
0.9460.054
0.9430.053
Kv-rs-kp
0.9160.021
0.9950.005
0.9900.007
0.9950.006
0.9950.006
0.9850.012
Labor
0.8450.243
0.6400.349
0.8850.234
0.7650.194
0.8230.227
0.9830.073
Mushroom 1.0000.000
1.0000.000
1.0000.000
1.0000.000
1.0000.000
1.0000.000
Sick
0.9840.006
0.9950.004
0.9950.004
0.9900.005
0.9930.004
0.9920.005
Sonar
0.8200.131
0.7210.140
0.7570.136
0.8650.090
0.8930.109
0.8830.105
__________________________________________________________________________________________
Table 8 Summary of tenfold cross validation performance for TN Rate (Specificity) on all the datasets
__________________________________________________________________________________________
Datasets
SVM
C4.5
FT
SMOTE
CDMOTE
NBCD
__________________________________________________________________________________________
Breast
0.2600.141
0.3350.166
0.1510.164
Breast_w
0.9320.052
0.9770.037
0.9310.060
Colic
0.7170.119
0.7340.118
0.7310.121
Credit-g
0.3980.085
0.4690.098
0.3710.105
Diabetes
0.6030.111
0.4640.169
0.6340.128
0.9840.025
0.9960.012
0.8620.063
0.8410.080
0.9180.069
0.7130.056
0.7720.063
0.7710.073
0.8730.054
0.8340.063
Heart-c
0.7230.119
0.7790.117
0.7170.119
0.8610.068
0.8090.099
0.8300.097
Heart-h
0.6550.158
0.7140.131
0.6360.152
0.8940.074
0.8930.079
0.8910.085
Heart-stat
0.7280.131
0.7750.123
0.6770.152
0.8620.064
0.8120.115
0.8200.098
Hepatitis
0.9000.097
0.8820.092
0.9420.093
0.8370.109
0.8880.097
0.8960.090
Ionosphere
0.9400.055
0.9490.046
0.9330.063
0.9280.057
Kv-rs-kp
0.9930.007
Labor
0.8650.197
0.9450.131
0.8430.214
Mushroom
1.0000.000
1.0000.000
1.0000.000
Sick
0.8750.071
Sonar
0.7490.134
0.5740.095
0.9900.009
0.9740.026
0.7520.148
0.5670.105
0.9870.010
0.8460.080
0.7620.145
0.6220.137
0.9750.024
0.8070.077
0.9980.003
0.9470.047
0.9450.054
0.9940.006
0.9770.025
0.9280.138
0.9460.106
1.0000.000
1.0000.000
1.0000.000
0.8720.053
0.9700.031
0.9190.045
0.8310.113
0.8390.120
0.8470.187
0.7520.113
__________________________________________________________________________________________
439
Sonar
0.9910.004
Breast
_________________________________________
Labor
Datasets
CILIUS [22] CDMOTE NBCD
_________________________________________
Sick
Sonar
AUC
Breast
_________________________________________
0.9250.058
440
2.
3.
4.
5.
6.
7.
8.
9.
8. Conclusion
Class imbalance problem have given a scope for a
new paradigm of algorithms in data mining. The
traditional and benchmark algorithms are
worthwhile for discovering hidden knowledge from
the data sources, meanwhile class imbalance
learning methods can improve the results which are
very much critical in real world applications. In this
paper we present the class imbalance problem
paradigm, which exploits the cluster disjunct
concept in the supervised learning research area,
and implement it with C4.5 and NB as its base
learners. Experimental results show that CDMOTE
and NBCD have performed well in the case of
multi class imbalance datasets. Furthermore,
10.
11.
441
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
442
35.
36.
37.
38.
39.
http://www.ics.uci.edu/mlearn/MLRepositor
y.html
J.R. Quinlan, Induction of Decision Trees,
Machine Learning, vol. 1, no. 1, pp. 81-106,
1986.
T. Jo and N. Japkowicz, Class Imbalances
versus Small Disjuncts, ACM SIGKDD
Explorations Newsletter, vol. 6, no. 1, pp. 4049, 2004.
N. Japkowicz, Class Imbalances: Are We
Focusing on the Right Issue? Proc. Intl Conf.
Machine Learning, Workshop Learning from
Imbalanced Data Sets II, 2003.
R.C. Prati, G.E.A.P.A. Batista, and M.C.
Monard, Class Imbalances versus Class
Overlapping: An Analysis of a Learning
System Behavior, Proc. Mexican Intl Conf.
Artificial Intelligence, pp. 312-321, 2004.
42.
43.
44.
443