Chapter 3
Chapter 3
Clustering high dimensional data based on K-means++ clustering with feature weighting
High-dimensional data arise naturally in many domains, and have regularly presented a
great challenge for traditional data mining techniques, both in terms of effectiveness and
efficiency. The clustering method is a process of dividing the data into groups according to the
similarity or uniqueness criteria between data points. Clustering algorithms are not only used for
classification but are also used for data compression, feature weighting, and data reduction.
The most commonly preferred clustering methods in terms of frequency of use are k-
means clustering , fuzzy c-means clustering , mountain clustering, and subtractive clustering . In
this study, a data weighting process has been carried out using the k-means++ clustering (KMC)
algorithm which is the most widely preferred in the literature. A novel high dimensional data
clustering approach is proposed in this approach based on K-means++ clustering with feature
weighting (KMCFW). Feature weighting and Silhouette Index techniques can improve the
The proposed approach flow diagram is shown in figure 1. Initially the high dimensional
data is preprocessed in order to remove some missing or null values. Then the feature weight is
computed for each and every dimension in the dataset. Based on the feature weight the data is
clustered using the k-means++ algorithm. Finally the cluster indexes such as silhouette and dunn
index is computed to measure the cluster distance which leads to overlap among the cluster.
Data Preprocessing
K-means++ Clustering
Figure 1 gives the proposed approach flow diagram. Initially data processing, feature
weight computation, k-means++ clustering, silhouette and Dunn index computation. These are
four phases in the proposed approach. The following section describes the data preprocessing
techniques in detail.
Data preparation (Tidke et al 2012) such as missing data imputation, noise control is a
key step in data mining and knowledge discovery. It is especially crucial in data mining
applications, as a minor data quality improvement may lead to higher effectiveness, which would
significantly increase the validity and quality of the discovered knowledge. It is reported that the
data preparation takes approximately 80 % of the total data engineering effort. More effort may
be required when there are missing data in the whole data set, especially, when the data is not
missing at random (NMAR). Therefore, data preparation is a crucial research topic in areas such
as machine learning, pattern recognition and data mining. This paper focuses on missing data
In fact, real-world data in general have missing values. For example, people may fail to
respond to the questionnaire such as salary; there is no register history or changes of the data or
malfunction of equipment etc. Take the questionnaire as an example, when conducting survey
using a questionnaire, low income people are reluctant to fill their salary, that is to say all the
missing data are less than the maximal value in the same attribute. In this paper we focus on this
type of missing data, namely missing data on continuous variables with missing values assumed
not exced the maximal values of the attributes. Our motivation is to introduce an algorithm that
Imputation methods (WILSON and STEFAN E 2015) involve replacing missing values
with estimated ones based on some information available in the data set. There are many options
varying from naive methods like mean imputation to some more robust methods based on
relationships among attributes. This section surveys some widely used imputation methods,
A. Case substitution
This method is typically used in sample surveys. One instance with missing data (for
example, a person that cannot be contacted) is replaced by another non sampled instance;
B. Mean and mode
One of the most frequently used methods. This method consists of replacing the missing
data for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all
In the hot deck method, a missing attribute value is filled in with a value from an
estimated distribution for the missing value from the current data. Hot deck is typically
implemented into two stages. In the first stage, the data are partitioned into clusters. And, in the
second stage, each instance with missing data is associated with one cluster. The complete cases
in a cluster are used to fill in the missing values. This can be done by calculating the mean or
mode of the attribute within a cluster. Cold deck imputation is similar to hot deck but the data
D. Prediction model
Prediction models are sophisticated procedures for handling missing data. These
methods consist of creating a predictive model to estimate values that will substitute the missing
data. The attribute with missing data is used as class-attribute, and the remaining attributes are
used as input for the predictive model. An important argument in favor of this approach is that,
frequently, attributes have relationships (correlations) among themselves. In this way, those
correlations could be used to create a predictive model for classification or regression (depending
on the attribute type with missing data, being, respectively, nominal or continuous).
Some of these relationships among the attributes may be maintained if they were
captured by the predictive model. An important drawback of this approach is that the model
estimated values are usually more well-behaved than the true values would be, i.e., since the
missing values are predicted from a set of attributes, the predicted values are likely to be more
consistent with this set of attributes than the true (not known) value would be. A second
drawback is the requirement for correlation among the attributes. If there are no relationships
among one or more attributes in the data set and the attribute with missing data, then the model
KNN method (Malarvizhi, Ms R., and Antony 2012) is popular due to its simplicity and
proven effectiveness in many missing value imputation problems. For a missing value, the
method seeks its K nearest variables or subjects and imputes by a weighted average of observed
values of the identified neighbors. We adopted the weight choice from the LSimpute method
used for microarray missing value imputation. LSimpute is an extension of the KNN, which
utilizes correlations between both genes and arrays, and the missing values are imputed by a
weighted average of the gene and array based estimates. Specifically, the weight for the kt h
2
r 2k
w k=
(( 1r 2k + ) )
(1)
rk
where is the correlation between the kth neighbor and the missing variable or
6
subject and = 10 . As a result, this algorithm gives more weight to closer neighbors.
(i) k-nearest neighbor can predict both qualitative attributes (the most frequent value
among the k nearest neighbors) and quantitative attributes (the mean among the k
nearest neighbors).
(ii) It does not require to create a predictive model for each attribute with missing
data. Actually, the k-nearest neighbor algorithm does not create explicit models.
(iii) It can easily treat instances with multiple missing values.
(iv) It takes in consideration the correlation structure of the data.
Feature weight, which calculates feature (term) values in documents, is one of important
techniques in clustering
Four methods are included in this study, each of which uses tf as features capacity of
describing the document contents. The differences of these methods are that they measure
According to functions used, the four feature weight methods are: tf*idf, tf*CRF, tf*OddsRatio,
and tf*CHI. OddsRatio and CHI are very excellent methods for feature selection ].
A. tf
freq ij
Before describing feature weight methods, we give the definition of tf. Let be
fi dj
the number of times feature i is mentioned in the text of document . Then, the tf of
fi di
feature in document is given by
freqij
tf ( f i , d i)=
max k freq kj
(2)
dj
The maximum is computed over all features that are mentioned in the dataset . For the sake
tf ij
of sententiousness, tf (fi , dj) is also written as .
B. Tf/idf
Tf/idf originated from information retrieval is the best known feature weight scheme in
clustering. This method uses idf to measure features ability of discriminating similar documents.
The motivation for idf is that features, which appear in many documents, are not very useful for
distinguishing a relevant document from a non-relevant one. Let N be the total number of
ni fj
documents and be the number of documents in which feature appears. Then, the
id f i fi
, inverse document frequency for , is given by
N
idf i=log
ni
(3)
fi dj
According to tf/idf schemes, value of feature in the vector of document is given by
freqij N
v ij =tf ijidf i = log
max k freq kj ni
(4)
Tf/idf is the simplest technique for feature weight. It easily scales to very large corpora, with a
documents. However, idf is global measure and ignore the fact that features may have different
discriminating powers for different document topics. For example, football is most valuable
term in sport news while it has little value for implying financial news. According to idf, weather
football in sport news or not, its values of idf is the same. The following sections will discuss
some methods that calculate features ability of discriminating similar documents in terms of
document categories.
C. Tf/CRF
CRF (Category Relevance Factor) stands for the discriminating power of features to
c1 cm
categories (such as document topics). Let C = { , , } be the set of predefined
categories and F={f 1, , fn } be feature set. Let DOC= Di be the set of documents
Di ci
where is the set of documents belonging to category . The category relevance factor
fi cj
CRF of and is given by
(5)
fi cj
X is the number of documents that contain feature and belong to category ,Y
cj
is the number of documents that belongs to category , U is the number of documents that
fi cj
contain feature and dont belong to category , V is the number of documents that
cj Dj
dont belongs to category . For document d in , let the feature vector V of d is be (v1
fi
, v2 , , vn) where vi is the value of feature . Then, in terms of tf/CRF scheme vi is given
by:
v i=tf ( f i , d )CRF (f i , c j )
(6)
D. Tf/OddsRatio
OddsRatio is commonly used in information retrieval where the problem is to rank out
documents according to their relevance for the positive class with using occurrence of different
words as features. It was first used as feature selection methods by Mladenic. Mladenic have
compare six feature scoring measures with each other on real Web documents. He found that
OddsRation showed the best performance. This shows that OddsRatio is best for feature scoring
and may be very suitable for A Comparative Study on Feature Weight in Clustering 591 feature
fi cj
weight. If one considers the two-way contingency table of a feature and , A is the
fi cj cj
number of times and co-occur, B is the number of times occurs, C is the
fi cj cj
number of times occurs without ; D is the number of times is not occur, then
fi cj
the OddsRatio between and is defined to be
cj
fi
(1 p j)
cj
fi
p c j
fi
1 p
fi
p
OddsRatio ( f i ,c j )=log
(7)
E. Tf/CHI
The CHI measures the lack of independence between a feature and a category and can be
compared to the 2 distribution with one degree of freedom to judge extremeness. Given a
fi cj fi cj
feature and a category , The CHI of and is given by
N( ADCB)2
CHI (f i , c j )=
( A +C )( B+ D )( A+ B )( c+ D)
(8)
fi cj
N is the total number of documents; A is the number of times and co-occur, B is the
fi cj cj
number of times occurs without ; C is the number of times occurs without
fi fi
; D is the number of times neigher nor cj occurs. The CHIij has a value of zero if
fi fi
and cj are independent. On the other hand, the CHIij has the maximal value of N if
cj fi cj
and either co-occur or co-absent. The more and are correlative the more the
CHI ij
is high and vice versa. CHI is one of the most effective feature selection methods.
generally used feature weighting method. Currently, some improved methods based TF-IDF has
been proposed as discussed above, but there is not a method that is able to comprehensive each
algorithms advantages. So, feature weighting method based on real-coded genetic algorithm
(GA) is proposed in this approach, the real-coded GA is used to calculate the feature weights.
Weighting features is a relaxation of the assumption that each feature has the same
importance with respect to the target concept. Assigning a proper weight to each feature is a
process for estimating how much important the feature has. There have been a number of
methods for feature weighting. Among them, an information-theoretic filter method is used for
assigning weights to features. The information theoretic is the most widely used methods in
feature weighting. In order to calculate the weight for each feature, we first assume that when a
certain feature value is observed, it gives a certain amount of information to the target feature. In
this paper, the amount of information that a certain feature value contains is defined as the
In this study, Kullback-Leibler measure is adopted for computing the weight for
each feature in the dataset. This measure has been widely used in many learning domains. The
defined as
be defined as the weighted average of the KL measures across the feature values. Therefore, the
wt avg(i)
weight of feature i, denoted as , is defined as,
( e ij )
wt avg(i)= . KLm ( Re ij )
ji ND
(10)
P ( eij ) . KLm ( Re ij )
ji
(11)
( e ij ) e ij
Where represents the number of instances that have the value of and the
P ( eij )
N_D means the total number of training in the dataset. In this formula, means the
e ij
probability that the feature i has the value of .
Based on the calculated feature weight the cluster formation is performed using the k-
The k-means++ (Bahmani et al 2012) is the clustering algorithm utilized in this approach
for clustering the high dimensional data. K-means is a numerical, unsupervised, non-
deterministic, iterative method. It is simple and very fast, so in many practical applications, the
method is proved to be a very effective way that can produce good clustering results. The
standard k-means algorithm (Na, Shi, Liu Xumin, and Guan Yong 2010) is effective in producing
clusters for many practical applications. But the computational complexity of the original k-
means algorithm is very high in high dimensional data. Different methods have been proposed
with k-means for high dimensional data. But the accuracy of the k-means clusters heavily
If the initial partitions are not chosen carefully, the computation will run the chance of
converging to a local minimum rather than the global minimum solution. The initialization step
is therefore very important. To combat this problem it might be a good idea to run the algorithm
several times with different initializations. If the results converge to the same partition then it is
likely that a global minimum has been reached. This, however, has the drawback of being very
In this work, initial centers are determined using k-means++ method to assign the data-
point to cluster. This approach deals with the method for improving the accuracy and efficiency
Definitions
This section, defines the k-means problem formally, as well as the k-means and k-means+
+ algorithms. For the k-means problem, consider the given an integer k and a set of n data points
(12)
From these centers, the clustering can be defined by grouping data points according to
which center each point is assigned to. The basic k-means algorithm procedure is a simple and
fast algorithm for this problem, although it offers no approximation guarantees at all. And the
procedure is
For each l { 1, ,k }
Set
cr l to set of points in X
Check
Cr i close to cr i and then to cr j
For each l { 1, ,k }
Algorithm 1. K-means
The k-means++ algorithm presents a specific way of choosing centers for the k-means
algorithm. In particular, let D(x) denote the shortest distance from a data point to the closest
center we have already chosen. Then, we define the following algorithm, which we call k-
The above mentioned k-means++ clustering provides the efficient cluster result which is further
analyzed for computing the Silhouette Index that is described in following section.
Silhouette Index
Silhouette index (Rousseeuw and Peter J 1987) refers to a method of interpretation and
validation of consistency within clusters of data. The technique provides a succinct graphical
X j ( j=1, , c ) , Xj
For a given cluster, this method assigns to each sample of a
quality measure, s(i) (i = 1,, m), known as the Silhouette width. The Silhouette width is a
Xj
confidence indicator on the membership of the ith sample in cluster . The Silhouette width
Xj
for the ith sample in cluster is defined as
s (i ) =b ( i )a(i)/max {a (i ) , b ( i ) }
(13)
where a(i) is the average distance between the ith sample and all of the samples included in
Xj
and b(i) is the minimum average distance between the ith sample and all of the samples
clustered in X k ( k=1, . c ; k j). From this formula it shows that s(i) has a value between -1
and 1.
Xj Sj
Thus, for a given cluster, , it is possible to calculate a cluster silhouette ,
which characterizes the heterogeneity and isolation properties of such a cluster. It is calculated as
Xj
the sum of all samples, silhouette widths in . Moreover, for any partition, a global
silhouette value or silhouette index, GSu, can be used as an effective validity index for a partition
c
1
GS u= s
c j=1 j
(14)
this case the maximum silhouette index value is taken as the optimal partition.
Dunn index:
The Dunn index (DI) is a metric for evaluating clustering algorithms (Michelin
andJ.Hanmorgan Kauffman 2006). This is part of a group of validity indices including the
DaviesBouldin index or Silhouette index, in that it is an internal evaluation scheme, where the
result is based on the clustered data itself. The Dunn index (DU) identifies clusters which are
well separated and compact. The goal is therefore to maximize the inter-cluster distance while
minimizing the intracluster distance. The Dunn index for k clusters is defined in the following
formula
min
{ ( )}
min
dis ( c p , c q )
DN l =
p=1, ,l q=1+1, ,l max m=1, ,l dia (c m )
(15)
dis ( c p , cq )=min x c , y c x y cp cq
Where p q is the dissimilarity between clusters and and
Dunn index is large, it means that compact and well separated clusters exist. Therefore, the
maximum is observed for k equal to the most probable number of clusters in the data set.
Summary:
This chapter describes the proposed methods and techniques to cluster the high
dimensional data. Initially the preprocessing procedure with KNN imputation method for
missing value estimation is discussed. Then the feature weight computation process with real
coded genetic algorithm is explained. Then the clustering process based on k-means++ algorithm
is discussed. Finally the quality of the cluster is analyzed by the silhouette index and Dunn index
is described. The proposed approach provides the efficient cluster result by reducing the