0% found this document useful (0 votes)
75 views18 pages

Chapter 3

The document discusses clustering high dimensional data using K-means++ clustering with feature weighting. It proposes a novel approach that performs data preprocessing, computes feature weights, clusters the data using K-means++, and evaluates clusters using silhouette and Dunn indices. The key steps are: 1) Data is preprocessed to remove missing values using KNN imputation which replaces missing values with weighted averages of neighbor values. 2) Feature weights are computed for each dimension using methods like TF-IDF to measure how well features discriminate between documents. 3) The weighted data is clustered using K-means++ clustering. 4) Clusters are evaluated using silhouette and Dunn indices which measure cluster distance and overlap.

Uploaded by

Kamal Nathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views18 pages

Chapter 3

The document discusses clustering high dimensional data using K-means++ clustering with feature weighting. It proposes a novel approach that performs data preprocessing, computes feature weights, clusters the data using K-means++, and evaluates clusters using silhouette and Dunn indices. The key steps are: 1) Data is preprocessed to remove missing values using KNN imputation which replaces missing values with weighted averages of neighbor values. 2) Feature weights are computed for each dimension using methods like TF-IDF to measure how well features discriminate between documents. 3) The weighted data is clustered using K-means++ clustering. 4) Clusters are evaluated using silhouette and Dunn indices which measure cluster distance and overlap.

Uploaded by

Kamal Nathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Chapter 3

Clustering high dimensional data based on K-means++ clustering with feature weighting

High-dimensional data arise naturally in many domains, and have regularly presented a

great challenge for traditional data mining techniques, both in terms of effectiveness and

efficiency. The clustering method is a process of dividing the data into groups according to the

similarity or uniqueness criteria between data points. Clustering algorithms are not only used for

classification but are also used for data compression, feature weighting, and data reduction.

The most commonly preferred clustering methods in terms of frequency of use are k-

means clustering , fuzzy c-means clustering , mountain clustering, and subtractive clustering . In

this study, a data weighting process has been carried out using the k-means++ clustering (KMC)

algorithm which is the most widely preferred in the literature. A novel high dimensional data

clustering approach is proposed in this approach based on K-means++ clustering with feature

weighting (KMCFW). Feature weighting and Silhouette Index techniques can improve the

efficiency and the accuracy of the analysis.

3.1 K-means++ clustering-based feature weighting (KMCFW):

The proposed approach flow diagram is shown in figure 1. Initially the high dimensional

data is preprocessed in order to remove some missing or null values. Then the feature weight is

computed for each and every dimension in the dataset. Based on the feature weight the data is

clustered using the k-means++ algorithm. Finally the cluster indexes such as silhouette and dunn

index is computed to measure the cluster distance which leads to overlap among the cluster.
Data Preprocessing

Feature Weight Computation

K-means++ Clustering

Silhouette and Dunn index

Proposed Approach Flow Diagram

Figure 1. Proposed Approach Flow Diagram.

Figure 1 gives the proposed approach flow diagram. Initially data processing, feature

weight computation, k-means++ clustering, silhouette and Dunn index computation. These are

four phases in the proposed approach. The following section describes the data preprocessing

techniques in detail.

3.1.1 Data preprocessing:

Data preparation (Tidke et al 2012) such as missing data imputation, noise control is a

key step in data mining and knowledge discovery. It is especially crucial in data mining

applications, as a minor data quality improvement may lead to higher effectiveness, which would

significantly increase the validity and quality of the discovered knowledge. It is reported that the
data preparation takes approximately 80 % of the total data engineering effort. More effort may

be required when there are missing data in the whole data set, especially, when the data is not

missing at random (NMAR). Therefore, data preparation is a crucial research topic in areas such

as machine learning, pattern recognition and data mining. This paper focuses on missing data

that is not missing at random for incomplete data.

In fact, real-world data in general have missing values. For example, people may fail to

respond to the questionnaire such as salary; there is no register history or changes of the data or

malfunction of equipment etc. Take the questionnaire as an example, when conducting survey

using a questionnaire, low income people are reluctant to fill their salary, that is to say all the

missing data are less than the maximal value in the same attribute. In this paper we focus on this

type of missing data, namely missing data on continuous variables with missing values assumed

not exced the maximal values of the attributes. Our motivation is to introduce an algorithm that

is able to estimate missing values through a KNN imputation method.

3.1.2 Imputation Methods

Imputation methods (WILSON and STEFAN E 2015) involve replacing missing values

with estimated ones based on some information available in the data set. There are many options

varying from naive methods like mean imputation to some more robust methods based on

relationships among attributes. This section surveys some widely used imputation methods,

although others forms of imputation are available.

A. Case substitution

This method is typically used in sample surveys. One instance with missing data (for

example, a person that cannot be contacted) is replaced by another non sampled instance;
B. Mean and mode

One of the most frequently used methods. This method consists of replacing the missing

data for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all

known values of that attribute;

C. Hot deck and cold deck

In the hot deck method, a missing attribute value is filled in with a value from an

estimated distribution for the missing value from the current data. Hot deck is typically

implemented into two stages. In the first stage, the data are partitioned into clusters. And, in the

second stage, each instance with missing data is associated with one cluster. The complete cases

in a cluster are used to fill in the missing values. This can be done by calculating the mean or

mode of the attribute within a cluster. Cold deck imputation is similar to hot deck but the data

source must be other than the current data source;

D. Prediction model

Prediction models are sophisticated procedures for handling missing data. These

methods consist of creating a predictive model to estimate values that will substitute the missing

data. The attribute with missing data is used as class-attribute, and the remaining attributes are

used as input for the predictive model. An important argument in favor of this approach is that,

frequently, attributes have relationships (correlations) among themselves. In this way, those

correlations could be used to create a predictive model for classification or regression (depending

on the attribute type with missing data, being, respectively, nominal or continuous).
Some of these relationships among the attributes may be maintained if they were

captured by the predictive model. An important drawback of this approach is that the model

estimated values are usually more well-behaved than the true values would be, i.e., since the

missing values are predicted from a set of attributes, the predicted values are likely to be more

consistent with this set of attributes than the true (not known) value would be. A second

drawback is the requirement for correlation among the attributes. If there are no relationships

among one or more attributes in the data set and the attribute with missing data, then the model

will not be precise to estimate the missing values.

3.1.3 Imputation with k-Nearest Neighbor:

KNN method (Malarvizhi, Ms R., and Antony 2012) is popular due to its simplicity and

proven effectiveness in many missing value imputation problems. For a missing value, the

method seeks its K nearest variables or subjects and imputes by a weighted average of observed

values of the identified neighbors. We adopted the weight choice from the LSimpute method

used for microarray missing value imputation. LSimpute is an extension of the KNN, which

utilizes correlations between both genes and arrays, and the missing values are imputed by a

weighted average of the gene and array based estimates. Specifically, the weight for the kt h

neighbor of a missing variable or subject was given by

2
r 2k
w k=
(( 1r 2k + ) )
(1)
rk
where is the correlation between the kth neighbor and the missing variable or

6
subject and = 10 . As a result, this algorithm gives more weight to closer neighbors.

The advantages of KNN imputation are:

(i) k-nearest neighbor can predict both qualitative attributes (the most frequent value

among the k nearest neighbors) and quantitative attributes (the mean among the k

nearest neighbors).
(ii) It does not require to create a predictive model for each attribute with missing

data. Actually, the k-nearest neighbor algorithm does not create explicit models.
(iii) It can easily treat instances with multiple missing values.
(iv) It takes in consideration the correlation structure of the data.

3.1.4 Feature Weight computation:

Feature weight, which calculates feature (term) values in documents, is one of important

techniques in clustering

3.1.4.1 Feature Weight Methods

Four methods are included in this study, each of which uses tf as features capacity of

describing the document contents. The differences of these methods are that they measure

features capacity of discriminating similar documents via various statistical functions.

According to functions used, the four feature weight methods are: tf*idf, tf*CRF, tf*OddsRatio,

and tf*CHI. OddsRatio and CHI are very excellent methods for feature selection ].

A. tf
freq ij
Before describing feature weight methods, we give the definition of tf. Let be

fi dj
the number of times feature i is mentioned in the text of document . Then, the tf of

fi di
feature in document is given by

freqij
tf ( f i , d i)=
max k freq kj

(2)

dj
The maximum is computed over all features that are mentioned in the dataset . For the sake

tf ij
of sententiousness, tf (fi , dj) is also written as .

B. Tf/idf

Tf/idf originated from information retrieval is the best known feature weight scheme in

clustering. This method uses idf to measure features ability of discriminating similar documents.

The motivation for idf is that features, which appear in many documents, are not very useful for

distinguishing a relevant document from a non-relevant one. Let N be the total number of

ni fj
documents and be the number of documents in which feature appears. Then, the

id f i fi
, inverse document frequency for , is given by
N
idf i=log
ni

(3)

fi dj
According to tf/idf schemes, value of feature in the vector of document is given by

freqij N
v ij =tf ijidf i = log
max k freq kj ni

(4)

Tf/idf is the simplest technique for feature weight. It easily scales to very large corpora, with a

computational complexity approximately linear in the number of features and training

documents. However, idf is global measure and ignore the fact that features may have different

discriminating powers for different document topics. For example, football is most valuable

term in sport news while it has little value for implying financial news. According to idf, weather

football in sport news or not, its values of idf is the same. The following sections will discuss

some methods that calculate features ability of discriminating similar documents in terms of

document categories.

C. Tf/CRF

CRF (Category Relevance Factor) stands for the discriminating power of features to

c1 cm
categories (such as document topics). Let C = { , , } be the set of predefined

categories and F={f 1, , fn } be feature set. Let DOC= Di be the set of documents
Di ci
where is the set of documents belonging to category . The category relevance factor

fi cj
CRF of and is given by

CRF ( f i , c j)=log (X /Y )/(U /V )

(5)

fi cj
X is the number of documents that contain feature and belong to category ,Y

cj
is the number of documents that belongs to category , U is the number of documents that

fi cj
contain feature and dont belong to category , V is the number of documents that

cj Dj
dont belongs to category . For document d in , let the feature vector V of d is be (v1

fi
, v2 , , vn) where vi is the value of feature . Then, in terms of tf/CRF scheme vi is given

by:

v i=tf ( f i , d )CRF (f i , c j )

(6)

D. Tf/OddsRatio

OddsRatio is commonly used in information retrieval where the problem is to rank out

documents according to their relevance for the positive class with using occurrence of different
words as features. It was first used as feature selection methods by Mladenic. Mladenic have

compare six feature scoring measures with each other on real Web documents. He found that

OddsRation showed the best performance. This shows that OddsRatio is best for feature scoring

and may be very suitable for A Comparative Study on Feature Weight in Clustering 591 feature

fi cj
weight. If one considers the two-way contingency table of a feature and , A is the

fi cj cj
number of times and co-occur, B is the number of times occurs, C is the

fi cj cj
number of times occurs without ; D is the number of times is not occur, then

fi cj
the OddsRatio between and is defined to be

cj

fi
(1 p j)

cj

fi
p c j
fi
1 p
fi
p
OddsRatio ( f i ,c j )=log

(7)

E. Tf/CHI
The CHI measures the lack of independence between a feature and a category and can be

compared to the 2 distribution with one degree of freedom to judge extremeness. Given a

fi cj fi cj
feature and a category , The CHI of and is given by

N( ADCB)2
CHI (f i , c j )=
( A +C )( B+ D )( A+ B )( c+ D)

(8)

fi cj
N is the total number of documents; A is the number of times and co-occur, B is the

fi cj cj
number of times occurs without ; C is the number of times occurs without

fi fi
; D is the number of times neigher nor cj occurs. The CHIij has a value of zero if

fi fi
and cj are independent. On the other hand, the CHIij has the maximal value of N if

cj fi cj
and either co-occur or co-absent. The more and are correlative the more the

CHI ij
is high and vice versa. CHI is one of the most effective feature selection methods.

Feature weighting technique can improve the accuracy of clustering; TF-IDF is a

generally used feature weighting method. Currently, some improved methods based TF-IDF has

been proposed as discussed above, but there is not a method that is able to comprehensive each
algorithms advantages. So, feature weighting method based on real-coded genetic algorithm

(GA) is proposed in this approach, the real-coded GA is used to calculate the feature weights.

3.1.4.2 Feature Weight Computation:

Weighting features is a relaxation of the assumption that each feature has the same

importance with respect to the target concept. Assigning a proper weight to each feature is a

process for estimating how much important the feature has. There have been a number of

methods for feature weighting. Among them, an information-theoretic filter method is used for

assigning weights to features. The information theoretic is the most widely used methods in

feature weighting. In order to calculate the weight for each feature, we first assume that when a

certain feature value is observed, it gives a certain amount of information to the target feature. In

this paper, the amount of information that a certain feature value contains is defined as the

discrepancy between prior and posterior distributions of the target feature.

In this study, Kullback-Leibler measure is adopted for computing the weight for

each feature in the dataset. This measure has been widely used in many learning domains. The

Kullback-Leibler measure (denoted as KLm) (Tumminello et al 2007) for a feature value a is

defined as

KLm ( R|e ij ) = P ( reij ) log


r
( P ( re ij )
P (r ) )
(9)
e ij
Where means the j value of the i-th feature in dataset. The weight of a feature can

be defined as the weighted average of the KL measures across the feature values. Therefore, the

wt avg(i)
weight of feature i, denoted as , is defined as,

( e ij )
wt avg(i)= . KLm ( Re ij )
ji ND

(10)

P ( eij ) . KLm ( Re ij )
ji

(11)

( e ij ) e ij
Where represents the number of instances that have the value of and the

P ( eij )
N_D means the total number of training in the dataset. In this formula, means the

e ij
probability that the feature i has the value of .

Based on the calculated feature weight the cluster formation is performed using the k-

means++ algorithm which is described in the following section.

3.1.5 k-means++ Algorithm:

The k-means++ (Bahmani et al 2012) is the clustering algorithm utilized in this approach

for clustering the high dimensional data. K-means is a numerical, unsupervised, non-
deterministic, iterative method. It is simple and very fast, so in many practical applications, the

method is proved to be a very effective way that can produce good clustering results. The

standard k-means algorithm (Na, Shi, Liu Xumin, and Guan Yong 2010) is effective in producing

clusters for many practical applications. But the computational complexity of the original k-

means algorithm is very high in high dimensional data. Different methods have been proposed

with k-means for high dimensional data. But the accuracy of the k-means clusters heavily

depending on the random choice of initial centroids.

If the initial partitions are not chosen carefully, the computation will run the chance of

converging to a local minimum rather than the global minimum solution. The initialization step

is therefore very important. To combat this problem it might be a good idea to run the algorithm

several times with different initializations. If the results converge to the same partition then it is

likely that a global minimum has been reached. This, however, has the drawback of being very

time consuming and computationally expensive.

In this work, initial centers are determined using k-means++ method to assign the data-

point to cluster. This approach deals with the method for improving the accuracy and efficiency

by reducing dimension and initialize the cluster for modified k-means.

Definitions

This section, defines the k-means problem formally, as well as the k-means and k-means+

+ algorithms. For the k-means problem, consider the given an integer k and a set of n data points

X R d . k centers Cr is choose to minimize the potential function,


min 2
= cr Cr xcr
x X

(12)

From these centers, the clustering can be defined by grouping data points according to

which center each point is assigned to. The basic k-means algorithm procedure is a simple and

fast algorithm for this problem, although it offers no approximation guarantees at all. And the

procedure is

Arbitrarily choose an initial k centers


Cr={ cr 1 , cr 2 , , cr k }

For each l { 1, ,k }

Set
cr l to set of points in X

Check
Cr i close to cr i and then to cr j

For each l { 1, ,k }

Algorithm 1. K-means

The k-means++ algorithm presents a specific way of choosing centers for the k-means

algorithm. In particular, let D(x) denote the shortest distance from a data point to the closest

center we have already chosen. Then, we define the following algorithm, which we call k-

means++. cr 1 , chosen uniformly at random from X


Select single center

Select a new center


cr 1 , choosing x X with probability
2
D( x)
D(x )2
x X
Algorithm 2. K-means++

The above mentioned k-means++ clustering provides the efficient cluster result which is further

analyzed for computing the Silhouette Index that is described in following section.

Silhouette Index

Silhouette index (Rousseeuw and Peter J 1987) refers to a method of interpretation and

validation of consistency within clusters of data. The technique provides a succinct graphical

representation of how well each object lies within its cluster.

X j ( j=1, , c ) , Xj
For a given cluster, this method assigns to each sample of a

quality measure, s(i) (i = 1,, m), known as the Silhouette width. The Silhouette width is a

Xj
confidence indicator on the membership of the ith sample in cluster . The Silhouette width

Xj
for the ith sample in cluster is defined as

s (i ) =b ( i )a(i)/max {a (i ) , b ( i ) }

(13)

where a(i) is the average distance between the ith sample and all of the samples included in

Xj
and b(i) is the minimum average distance between the ith sample and all of the samples

clustered in X k ( k=1, . c ; k j). From this formula it shows that s(i) has a value between -1

and 1.
Xj Sj
Thus, for a given cluster, , it is possible to calculate a cluster silhouette ,

which characterizes the heterogeneity and isolation properties of such a cluster. It is calculated as

Xj
the sum of all samples, silhouette widths in . Moreover, for any partition, a global

silhouette value or silhouette index, GSu, can be used as an effective validity index for a partition

c
1
GS u= s
c j=1 j

(14)

Sj S (i) {i=1, .. m}, is known as silhouette width. In


where is silhouette index value and

this case the maximum silhouette index value is taken as the optimal partition.

Dunn index:

The Dunn index (DI) is a metric for evaluating clustering algorithms (Michelin

andJ.Hanmorgan Kauffman 2006). This is part of a group of validity indices including the

DaviesBouldin index or Silhouette index, in that it is an internal evaluation scheme, where the

result is based on the clustered data itself. The Dunn index (DU) identifies clusters which are

well separated and compact. The goal is therefore to maximize the inter-cluster distance while

minimizing the intracluster distance. The Dunn index for k clusters is defined in the following

formula
min

{ ( )}
min
dis ( c p , c q )
DN l =
p=1, ,l q=1+1, ,l max m=1, ,l dia (c m )

(15)

dis ( c p , cq )=min x c , y c x y cp cq
Where p q is the dissimilarity between clusters and and

dia (C)=maxx , yC x y is the intra-cluster function (or diameter) of the cluster. If

Dunn index is large, it means that compact and well separated clusters exist. Therefore, the

maximum is observed for k equal to the most probable number of clusters in the data set.

Summary:

This chapter describes the proposed methods and techniques to cluster the high

dimensional data. Initially the preprocessing procedure with KNN imputation method for

missing value estimation is discussed. Then the feature weight computation process with real

coded genetic algorithm is explained. Then the clustering process based on k-means++ algorithm

is discussed. Finally the quality of the cluster is analyzed by the silhouette index and Dunn index

is described. The proposed approach provides the efficient cluster result by reducing the

overlapping among the clusters.

You might also like