0% found this document useful (0 votes)

75 views18 pages

Chapter 3

The document discusses clustering high dimensional data using K-means++ clustering with feature weighting. It proposes a novel approach that performs data preprocessing, computes feature weights, clusters the data using K-means++, and evaluates clusters using silhouette and Dunn indices. The key steps are: 1) Data is preprocessed to remove missing values using KNN imputation which replaces missing values with weighted averages of neighbor values. 2) Feature weights are computed for each dimension using methods like TF-IDF to measure how well features discriminate between documents. 3) The weighted data is clustered using K-means++ clustering. 4) Clusters are evaluated using silhouette and Dunn indices which measure cluster distance and overlap.

Uploaded by

Kamal Nathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views18 pages

Chapter 3

Uploaded by

Kamal Nathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Chapter 3

Clustering high dimensional data based on K-means++ clustering with feature weighting

High-dimensional data arise naturally in many domains, and have regularly presented a

great challenge for traditional data mining techniques, both in terms of effectiveness and

efficiency. The clustering method is a process of dividing the data into groups according to the

similarity or uniqueness criteria between data points. Clustering algorithms are not only used for

classification but are also used for data compression, feature weighting, and data reduction.

The most commonly preferred clustering methods in terms of frequency of use are k-

means clustering , fuzzy c-means clustering , mountain clustering, and subtractive clustering . In

this study, a data weighting process has been carried out using the k-means++ clustering (KMC)

algorithm which is the most widely preferred in the literature. A novel high dimensional data

clustering approach is proposed in this approach based on K-means++ clustering with feature

weighting (KMCFW). Feature weighting and Silhouette Index techniques can improve the

efficiency and the accuracy of the analysis.

3.1 K-means++ clustering-based feature weighting (KMCFW):

The proposed approach flow diagram is shown in figure 1. Initially the high dimensional

data is preprocessed in order to remove some missing or null values. Then the feature weight is

computed for each and every dimension in the dataset. Based on the feature weight the data is

clustered using the k-means++ algorithm. Finally the cluster indexes such as silhouette and dunn

index is computed to measure the cluster distance which leads to overlap among the cluster.
Data Preprocessing

Feature Weight Computation

K-means++ Clustering

Silhouette and Dunn index

Proposed Approach Flow Diagram

Figure 1. Proposed Approach Flow Diagram.

Figure 1 gives the proposed approach flow diagram. Initially data processing, feature

weight computation, k-means++ clustering, silhouette and Dunn index computation. These are

four phases in the proposed approach. The following section describes the data preprocessing

techniques in detail.

3.1.1 Data preprocessing:

Data preparation (Tidke et al 2012) such as missing data imputation, noise control is a

key step in data mining and knowledge discovery. It is especially crucial in data mining

applications, as a minor data quality improvement may lead to higher effectiveness, which would

significantly increase the validity and quality of the discovered knowledge. It is reported that the
data preparation takes approximately 80 % of the total data engineering effort. More effort may

be required when there are missing data in the whole data set, especially, when the data is not

missing at random (NMAR). Therefore, data preparation is a crucial research topic in areas such

as machine learning, pattern recognition and data mining. This paper focuses on missing data

that is not missing at random for incomplete data.

In fact, real-world data in general have missing values. For example, people may fail to

respond to the questionnaire such as salary; there is no register history or changes of the data or

malfunction of equipment etc. Take the questionnaire as an example, when conducting survey

using a questionnaire, low income people are reluctant to fill their salary, that is to say all the

missing data are less than the maximal value in the same attribute. In this paper we focus on this

type of missing data, namely missing data on continuous variables with missing values assumed

not exced the maximal values of the attributes. Our motivation is to introduce an algorithm that

is able to estimate missing values through a KNN imputation method.

3.1.2 Imputation Methods

Imputation methods (WILSON and STEFAN E 2015) involve replacing missing values

with estimated ones based on some information available in the data set. There are many options

varying from naive methods like mean imputation to some more robust methods based on

relationships among attributes. This section surveys some widely used imputation methods,

although others forms of imputation are available.

A. Case substitution

This method is typically used in sample surveys. One instance with missing data (for

example, a person that cannot be contacted) is replaced by another non sampled instance;
B. Mean and mode

One of the most frequently used methods. This method consists of replacing the missing

data for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all

known values of that attribute;

C. Hot deck and cold deck

In the hot deck method, a missing attribute value is filled in with a value from an

estimated distribution for the missing value from the current data. Hot deck is typically

implemented into two stages. In the first stage, the data are partitioned into clusters. And, in the

second stage, each instance with missing data is associated with one cluster. The complete cases

in a cluster are used to fill in the missing values. This can be done by calculating the mean or

mode of the attribute within a cluster. Cold deck imputation is similar to hot deck but the data

source must be other than the current data source;

D. Prediction model

Prediction models are sophisticated procedures for handling missing data. These

methods consist of creating a predictive model to estimate values that will substitute the missing

data. The attribute with missing data is used as class-attribute, and the remaining attributes are

used as input for the predictive model. An important argument in favor of this approach is that,

frequently, attributes have relationships (correlations) among themselves. In this way, those

correlations could be used to create a predictive model for classification or regression (depending

on the attribute type with missing data, being, respectively, nominal or continuous).
Some of these relationships among the attributes may be maintained if they were

captured by the predictive model. An important drawback of this approach is that the model

estimated values are usually more well-behaved than the true values would be, i.e., since the

missing values are predicted from a set of attributes, the predicted values are likely to be more

consistent with this set of attributes than the true (not known) value would be. A second

drawback is the requirement for correlation among the attributes. If there are no relationships

among one or more attributes in the data set and the attribute with missing data, then the model

will not be precise to estimate the missing values.

3.1.3 Imputation with k-Nearest Neighbor:

KNN method (Malarvizhi, Ms R., and Antony 2012) is popular due to its simplicity and

proven effectiveness in many missing value imputation problems. For a missing value, the

method seeks its K nearest variables or subjects and imputes by a weighted average of observed

values of the identified neighbors. We adopted the weight choice from the LSimpute method

used for microarray missing value imputation. LSimpute is an extension of the KNN, which

utilizes correlations between both genes and arrays, and the missing values are imputed by a

weighted average of the gene and array based estimates. Specifically, the weight for the kt h

neighbor of a missing variable or subject was given by

2
r 2k
w k=
(( 1r 2k + ) )
(1)
rk
where is the correlation between the kth neighbor and the missing variable or

6
subject and = 10 . As a result, this algorithm gives more weight to closer neighbors.

The advantages of KNN imputation are:

(i) k-nearest neighbor can predict both qualitative attributes (the most frequent value

among the k nearest neighbors) and quantitative attributes (the mean among the k

nearest neighbors).
(ii) It does not require to create a predictive model for each attribute with missing

data. Actually, the k-nearest neighbor algorithm does not create explicit models.
(iii) It can easily treat instances with multiple missing values.
(iv) It takes in consideration the correlation structure of the data.

3.1.4 Feature Weight computation:

Feature weight, which calculates feature (term) values in documents, is one of important

techniques in clustering

3.1.4.1 Feature Weight Methods

Four methods are included in this study, each of which uses tf as features capacity of

describing the document contents. The differences of these methods are that they measure

features capacity of discriminating similar documents via various statistical functions.

According to functions used, the four feature weight methods are: tf*idf, tf*CRF, tf*OddsRatio,

and tf*CHI. OddsRatio and CHI are very excellent methods for feature selection ].

A. tf
freq ij
Before describing feature weight methods, we give the definition of tf. Let be

fi dj
the number of times feature i is mentioned in the text of document . Then, the tf of

fi di
feature in document is given by

freqij
tf ( f i , d i)=
max k freq kj

(2)

dj
The maximum is computed over all features that are mentioned in the dataset . For the sake

tf ij
of sententiousness, tf (fi , dj) is also written as .

B. Tf/idf

Tf/idf originated from information retrieval is the best known feature weight scheme in

clustering. This method uses idf to measure features ability of discriminating similar documents.

The motivation for idf is that features, which appear in many documents, are not very useful for

distinguishing a relevant document from a non-relevant one. Let N be the total number of

ni fj
documents and be the number of documents in which feature appears. Then, the

id f i fi
, inverse document frequency for , is given by
N
idf i=log
ni

(3)

fi dj
According to tf/idf schemes, value of feature in the vector of document is given by

freqij N
v ij =tf ijidf i = log
max k freq kj ni

(4)

Tf/idf is the simplest technique for feature weight. It easily scales to very large corpora, with a

computational complexity approximately linear in the number of features and training

documents. However, idf is global measure and ignore the fact that features may have different

discriminating powers for different document topics. For example, football is most valuable

term in sport news while it has little value for implying financial news. According to idf, weather

football in sport news or not, its values of idf is the same. The following sections will discuss

some methods that calculate features ability of discriminating similar documents in terms of

document categories.

C. Tf/CRF

CRF (Category Relevance Factor) stands for the discriminating power of features to

c1 cm
categories (such as document topics). Let C = { , , } be the set of predefined

categories and F={f 1, , fn } be feature set. Let DOC= Di be the set of documents
Di ci
where is the set of documents belonging to category . The category relevance factor

fi cj
CRF of and is given by

CRF ( f i , c j)=log (X /Y )/(U /V )

(5)

fi cj
X is the number of documents that contain feature and belong to category ,Y

cj
is the number of documents that belongs to category , U is the number of documents that

fi cj
contain feature and dont belong to category , V is the number of documents that

cj Dj
dont belongs to category . For document d in , let the feature vector V of d is be (v1

fi
, v2 , , vn) where vi is the value of feature . Then, in terms of tf/CRF scheme vi is given

by:

v i=tf ( f i , d )CRF (f i , c j )

(6)

D. Tf/OddsRatio

OddsRatio is commonly used in information retrieval where the problem is to rank out

documents according to their relevance for the positive class with using occurrence of different
words as features. It was first used as feature selection methods by Mladenic. Mladenic have

compare six feature scoring measures with each other on real Web documents. He found that

OddsRation showed the best performance. This shows that OddsRatio is best for feature scoring

and may be very suitable for A Comparative Study on Feature Weight in Clustering 591 feature

fi cj
weight. If one considers the two-way contingency table of a feature and , A is the

fi cj cj
number of times and co-occur, B is the number of times occurs, C is the

fi cj cj
number of times occurs without ; D is the number of times is not occur, then

fi cj
the OddsRatio between and is defined to be

cj

fi
(1 p j)

cj

fi
p c j
fi
1 p
fi
p
OddsRatio ( f i ,c j )=log

(7)

E. Tf/CHI
The CHI measures the lack of independence between a feature and a category and can be

compared to the 2 distribution with one degree of freedom to judge extremeness. Given a

fi cj fi cj
feature and a category , The CHI of and is given by

N( ADCB)2
CHI (f i , c j )=
( A +C )( B+ D )( A+ B )( c+ D)

(8)

fi cj
N is the total number of documents; A is the number of times and co-occur, B is the

fi cj cj
number of times occurs without ; C is the number of times occurs without

fi fi
; D is the number of times neigher nor cj occurs. The CHIij has a value of zero if

fi fi
and cj are independent. On the other hand, the CHIij has the maximal value of N if

cj fi cj
and either co-occur or co-absent. The more and are correlative the more the

CHI ij
is high and vice versa. CHI is one of the most effective feature selection methods.

Feature weighting technique can improve the accuracy of clustering; TF-IDF is a

generally used feature weighting method. Currently, some improved methods based TF-IDF has

been proposed as discussed above, but there is not a method that is able to comprehensive each
algorithms advantages. So, feature weighting method based on real-coded genetic algorithm

(GA) is proposed in this approach, the real-coded GA is used to calculate the feature weights.

3.1.4.2 Feature Weight Computation:

Weighting features is a relaxation of the assumption that each feature has the same

importance with respect to the target concept. Assigning a proper weight to each feature is a

process for estimating how much important the feature has. There have been a number of

methods for feature weighting. Among them, an information-theoretic filter method is used for

assigning weights to features. The information theoretic is the most widely used methods in

feature weighting. In order to calculate the weight for each feature, we first assume that when a

certain feature value is observed, it gives a certain amount of information to the target feature. In

this paper, the amount of information that a certain feature value contains is defined as the

discrepancy between prior and posterior distributions of the target feature.

In this study, Kullback-Leibler measure is adopted for computing the weight for

each feature in the dataset. This measure has been widely used in many learning domains. The

Kullback-Leibler measure (denoted as KLm) (Tumminello et al 2007) for a feature value a is

defined as

KLm ( R|e ij ) = P ( reij ) log

r
( P ( re ij )
P (r ) )
(9)
e ij
Where means the j value of the i-th feature in dataset. The weight of a feature can

be defined as the weighted average of the KL measures across the feature values. Therefore, the

wt avg(i)
weight of feature i, denoted as , is defined as,

( e ij )
wt avg(i)= . KLm ( Re ij )
ji ND

(10)

P ( eij ) . KLm ( Re ij )
ji

(11)

( e ij ) e ij
Where represents the number of instances that have the value of and the

P ( eij )
N_D means the total number of training in the dataset. In this formula, means the

e ij
probability that the feature i has the value of .

Based on the calculated feature weight the cluster formation is performed using the k-

means++ algorithm which is described in the following section.

3.1.5 k-means++ Algorithm:

The k-means++ (Bahmani et al 2012) is the clustering algorithm utilized in this approach

for clustering the high dimensional data. K-means is a numerical, unsupervised, non-
deterministic, iterative method. It is simple and very fast, so in many practical applications, the

method is proved to be a very effective way that can produce good clustering results. The

standard k-means algorithm (Na, Shi, Liu Xumin, and Guan Yong 2010) is effective in producing

clusters for many practical applications. But the computational complexity of the original k-

means algorithm is very high in high dimensional data. Different methods have been proposed

with k-means for high dimensional data. But the accuracy of the k-means clusters heavily

depending on the random choice of initial centroids.

If the initial partitions are not chosen carefully, the computation will run the chance of

converging to a local minimum rather than the global minimum solution. The initialization step

is therefore very important. To combat this problem it might be a good idea to run the algorithm

several times with different initializations. If the results converge to the same partition then it is

likely that a global minimum has been reached. This, however, has the drawback of being very

time consuming and computationally expensive.

In this work, initial centers are determined using k-means++ method to assign the data-

point to cluster. This approach deals with the method for improving the accuracy and efficiency

by reducing dimension and initialize the cluster for modified k-means.

Definitions

This section, defines the k-means problem formally, as well as the k-means and k-means+

+ algorithms. For the k-means problem, consider the given an integer k and a set of n data points

X R d . k centers Cr is choose to minimize the potential function,

min 2
= cr Cr xcr
x X

(12)

From these centers, the clustering can be defined by grouping data points according to

which center each point is assigned to. The basic k-means algorithm procedure is a simple and

fast algorithm for this problem, although it offers no approximation guarantees at all. And the

procedure is

Arbitrarily choose an initial k centers

Cr={ cr 1 , cr 2 , , cr k }

For each l { 1, ,k }

Set
cr l to set of points in X

Check
Cr i close to cr i and then to cr j

For each l { 1, ,k }

Algorithm 1. K-means

The k-means++ algorithm presents a specific way of choosing centers for the k-means

algorithm. In particular, let D(x) denote the shortest distance from a data point to the closest

center we have already chosen. Then, we define the following algorithm, which we call k-

means++. cr 1 , chosen uniformly at random from X

Select single center

Select a new center

cr 1 , choosing x X with probability
2
D( x)
D(x )2
x X
Algorithm 2. K-means++

The above mentioned k-means++ clustering provides the efficient cluster result which is further

analyzed for computing the Silhouette Index that is described in following section.

Silhouette Index

Silhouette index (Rousseeuw and Peter J 1987) refers to a method of interpretation and

validation of consistency within clusters of data. The technique provides a succinct graphical

representation of how well each object lies within its cluster.

X j ( j=1, , c ) , Xj
For a given cluster, this method assigns to each sample of a

quality measure, s(i) (i = 1,, m), known as the Silhouette width. The Silhouette width is a

Xj
confidence indicator on the membership of the ith sample in cluster . The Silhouette width

Xj
for the ith sample in cluster is defined as

s (i ) =b ( i )a(i)/max {a (i ) , b ( i ) }

(13)

where a(i) is the average distance between the ith sample and all of the samples included in

Xj
and b(i) is the minimum average distance between the ith sample and all of the samples

clustered in X k ( k=1, . c ; k j). From this formula it shows that s(i) has a value between -1

and 1.
Xj Sj
Thus, for a given cluster, , it is possible to calculate a cluster silhouette ,

which characterizes the heterogeneity and isolation properties of such a cluster. It is calculated as

Xj
the sum of all samples, silhouette widths in . Moreover, for any partition, a global

silhouette value or silhouette index, GSu, can be used as an effective validity index for a partition

c
1
GS u= s
c j=1 j

(14)

Sj S (i) {i=1, .. m}, is known as silhouette width. In

where is silhouette index value and

this case the maximum silhouette index value is taken as the optimal partition.

Dunn index:

The Dunn index (DI) is a metric for evaluating clustering algorithms (Michelin

andJ.Hanmorgan Kauffman 2006). This is part of a group of validity indices including the

DaviesBouldin index or Silhouette index, in that it is an internal evaluation scheme, where the

result is based on the clustered data itself. The Dunn index (DU) identifies clusters which are

well separated and compact. The goal is therefore to maximize the inter-cluster distance while

minimizing the intracluster distance. The Dunn index for k clusters is defined in the following

formula
min

{ ( )}
min
dis ( c p , c q )
DN l =
p=1, ,l q=1+1, ,l max m=1, ,l dia (c m )

(15)

dis ( c p , cq )=min x c , y c x y cp cq
Where p q is the dissimilarity between clusters and and

dia (C)=maxx , yC x y is the intra-cluster function (or diameter) of the cluster. If

Dunn index is large, it means that compact and well separated clusters exist. Therefore, the

maximum is observed for k equal to the most probable number of clusters in the data set.

Summary:

This chapter describes the proposed methods and techniques to cluster the high

dimensional data. Initially the preprocessing procedure with KNN imputation method for

missing value estimation is discussed. Then the feature weight computation process with real

coded genetic algorithm is explained. Then the clustering process based on k-means++ algorithm

is discussed. Finally the quality of the cluster is analyzed by the silhouette index and Dunn index

is described. The proposed approach provides the efficient cluster result by reducing the

overlapping among the clusters.

Fundamentals Machine Learning Using Pyth
No ratings yet
Fundamentals Machine Learning Using Pyth
348 pages
An Investigation of Missing Data Methods For Classification Trees
No ratings yet
An Investigation of Missing Data Methods For Classification Trees
43 pages
Master Wilson
No ratings yet
Master Wilson
66 pages
Data Sheet Das579543
No ratings yet
Data Sheet Das579543
29 pages
Data Mining Questions and Answers
No ratings yet
Data Mining Questions and Answers
22 pages
Journal - pone.0259266ACT TicketACT Ticket
No ratings yet
Journal - pone.0259266ACT TicketACT Ticket
33 pages
Ojs 2023121515132551
No ratings yet
Ojs 2023121515132551
22 pages
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
No ratings yet
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
8 pages
Sefidian2018 PDF
No ratings yet
Sefidian2018 PDF
61 pages
Cienciadedatos
No ratings yet
Cienciadedatos
21 pages
Interview Questions ML
100% (1)
Interview Questions ML
83 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
Z - A Three-Way Clustering Method Based On Ensemble Strategy and Three-Way Decision
No ratings yet
Z - A Three-Way Clustering Method Based On Ensemble Strategy and Three-Way Decision
13 pages
Missing Value Imputation Based On Data Clustering: January 2008
No ratings yet
Missing Value Imputation Based On Data Clustering: January 2008
12 pages
A Method For Missing Values Imputation of Machine Learning Datasets
No ratings yet
A Method For Missing Values Imputation of Machine Learning Datasets
11 pages
Wagstaff 2004
No ratings yet
Wagstaff 2004
10 pages
Usr Tomcat7 Documents REZUMAT ENG Lemnaru Camelia
No ratings yet
Usr Tomcat7 Documents REZUMAT ENG Lemnaru Camelia
6 pages
Missing Value Imputation Using Hybrid K-Means and Association Rules
No ratings yet
Missing Value Imputation Using Hybrid K-Means and Association Rules
9 pages
Best Imputation Techniques For Missing Values PDF
No ratings yet
Best Imputation Techniques For Missing Values PDF
32 pages
Meth 2024 Part3 Imput
No ratings yet
Meth 2024 Part3 Imput
32 pages
An Analysis of Four Missing Data Treatment Methods
No ratings yet
An Analysis of Four Missing Data Treatment Methods
13 pages
Missing Value Imputation Via Clusterwise Linear Regression
No ratings yet
Missing Value Imputation Via Clusterwise Linear Regression
13 pages
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
No ratings yet
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
16 pages
Descriptive Data Mining
No ratings yet
Descriptive Data Mining
8 pages
Missing Imput Values
No ratings yet
Missing Imput Values
2 pages
Major Project Sunny - Docxhiii
0% (1)
Major Project Sunny - Docxhiii
52 pages
Missing Value
No ratings yet
Missing Value
11 pages
SICE: An Improved Missing Data Imputation Technique: Open Access Research
No ratings yet
SICE: An Improved Missing Data Imputation Technique: Open Access Research
21 pages
Engineering Journal Missing Data Imputation Methods in Classification Contexts
No ratings yet
Engineering Journal Missing Data Imputation Methods in Classification Contexts
6 pages
2015 Elsevier Kernel Penalized K Means A Feature Selection Method Based On Kernel K Means
No ratings yet
2015 Elsevier Kernel Penalized K Means A Feature Selection Method Based On Kernel K Means
11 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Mid Term
No ratings yet
Mid Term
12 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
An Evaluation of K-Nearest Neighbour Imputation Using Likert Data
No ratings yet
An Evaluation of K-Nearest Neighbour Imputation Using Likert Data
23 pages
Fuzzy Based Techniques For Handling Missing Values
No ratings yet
Fuzzy Based Techniques For Handling Missing Values
6 pages
Data Driven Decision Making
100% (1)
Data Driven Decision Making
27 pages
IJDKP
No ratings yet
IJDKP
17 pages
Grangier Melvin Nips 2010
No ratings yet
Grangier Melvin Nips 2010
9 pages
K Nearest Neighbours Based On Mutual Inf
No ratings yet
K Nearest Neighbours Based On Mutual Inf
6 pages
8 Hron Et Al 2010
No ratings yet
8 Hron Et Al 2010
13 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
Data Prep
No ratings yet
Data Prep
5 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Framework For Missing Value Imputation: Ms.R.Malarvizhi, DR - Antony Selvadoss Thanamani
No ratings yet
Framework For Missing Value Imputation: Ms.R.Malarvizhi, DR - Antony Selvadoss Thanamani
3 pages
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
No ratings yet
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
4 pages
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
No ratings yet
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
8 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Estimating Missing Values of Heterogeneous Datasets by Clustering
No ratings yet
Estimating Missing Values of Heterogeneous Datasets by Clustering
24 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Quiz and Mid Paper Data
No ratings yet
Quiz and Mid Paper Data
31 pages
Practical No 1: Aim:Breadth First Search & Iterative Depth First Search
No ratings yet
Practical No 1: Aim:Breadth First Search & Iterative Depth First Search
36 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Paper 4-Imputation and Classification of Missing Data Using Least Square Support Vector Machines - A New Approach in Dementia Diagnosis
No ratings yet
Paper 4-Imputation and Classification of Missing Data Using Least Square Support Vector Machines - A New Approach in Dementia Diagnosis
6 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Crop Yield
No ratings yet
Crop Yield
112 pages
Diagnosis of Operating Conditi
No ratings yet
Diagnosis of Operating Conditi
20 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
About The Study: A Study On Employee Performencw Appraisal in Hatsun Dairy Products (P) LTD, Salem
No ratings yet
About The Study: A Study On Employee Performencw Appraisal in Hatsun Dairy Products (P) LTD, Salem
4 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Ai& ML Lab TLP (Ai A)
No ratings yet
Ai& ML Lab TLP (Ai A)
5 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
1 ObjectDetection
No ratings yet
1 ObjectDetection
46 pages
Crime Data Mediante Machine Learning
No ratings yet
Crime Data Mediante Machine Learning
6 pages
Android Mobile Widget For Proximity Hospital Locator: Sathiyabama M
No ratings yet
Android Mobile Widget For Proximity Hospital Locator: Sathiyabama M
3 pages
Invisible Mask: Practical Attacks On Face Recognition With Infrared
No ratings yet
Invisible Mask: Practical Attacks On Face Recognition With Infrared
13 pages
Weather Prediction Using CPT
No ratings yet
Weather Prediction Using CPT
2 pages
Weather Prediction Using CPT
No ratings yet
Weather Prediction Using CPT
2 pages
Android Crime Reporter
No ratings yet
Android Crime Reporter
11 pages
Digital Finger Print
No ratings yet
Digital Finger Print
4 pages
Machine Learning For Biometrics: Concepts, Algorithms and Applications (Cognitive Data Science in Sustainable Computing) Partha Pratim Sarangi
100% (1)
Machine Learning For Biometrics: Concepts, Algorithms and Applications (Cognitive Data Science in Sustainable Computing) Partha Pratim Sarangi
62 pages
Mini Project 2 Report
No ratings yet
Mini Project 2 Report
28 pages
An Assessment of Machine Learning Models and Algorithms For Early
No ratings yet
An Assessment of Machine Learning Models and Algorithms For Early
14 pages
Application of Machine Learning Techniques in Mineral Classification
No ratings yet
Application of Machine Learning Techniques in Mineral Classification
13 pages
Machine Learning-Lecture 03
No ratings yet
Machine Learning-Lecture 03
19 pages
Krebs Chapter 06 2013
No ratings yet
Krebs Chapter 06 2013
42 pages
Unit2 ML Notes
No ratings yet
Unit2 ML Notes
19 pages
Foml Paper Solution 1
No ratings yet
Foml Paper Solution 1
35 pages
Machine Learning Examples With R
No ratings yet
Machine Learning Examples With R
30 pages
DS Unit2
No ratings yet
DS Unit2
23 pages
TSP CMC 54460
No ratings yet
TSP CMC 54460
26 pages
Early Detection of Heart Disease Using Machine Learning
No ratings yet
Early Detection of Heart Disease Using Machine Learning
8 pages
Alignment-Free Sequence Comparison A Systematic Survey From A Machine Learning Perspective
No ratings yet
Alignment-Free Sequence Comparison A Systematic Survey From A Machine Learning Perspective
17 pages
Pink Professional Gradients Conference Research Education Presentation
No ratings yet
Pink Professional Gradients Conference Research Education Presentation
12 pages
Lab Assesment Sheet of FML
No ratings yet
Lab Assesment Sheet of FML
1 page
A Signature-Based Model For Learning Lung Cancer Stage From Multiplex Immunoflurescence Image Data With Spatial Summary Functions
No ratings yet
A Signature-Based Model For Learning Lung Cancer Stage From Multiplex Immunoflurescence Image Data With Spatial Summary Functions
5 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Chapter 3

efficiency and the accuracy of the analysis.

3.1 K-means++ clustering-based feature weighting (KMCFW):

Feature Weight Computation

Silhouette and Dunn index

Proposed Approach Flow Diagram

Figure 1. Proposed Approach Flow Diagram.

3.1.1 Data preprocessing:

that is not missing at random for incomplete data.

is able to estimate missing values through a KNN imputation method.

3.1.2 Imputation Methods

although others forms of imputation are available.

known values of that attribute;

C. Hot deck and cold deck

source must be other than the current data source;

will not be precise to estimate the missing values.

3.1.3 Imputation with k-Nearest Neighbor:

neighbor of a missing variable or subject was given by

The advantages of KNN imputation are:

3.1.4 Feature Weight computation:

3.1.4.1 Feature Weight Methods

features capacity of discriminating similar documents via various statistical functions.

computational complexity approximately linear in the number of features and training

CRF ( f i , c j)=log (X /Y )/(U /V )

Feature weighting technique can improve the accuracy of clustering; TF-IDF is a

3.1.4.2 Feature Weight Computation:

discrepancy between prior and posterior distributions of the target feature.

Kullback-Leibler measure (denoted as KLm) (Tumminello et al 2007) for a feature value a is

KLm ( R|e ij ) = P ( reij ) log

means++ algorithm which is described in the following section.

3.1.5 k-means++ Algorithm:

depending on the random choice of initial centroids.

time consuming and computationally expensive.

by reducing dimension and initialize the cluster for modified k-means.

X R d . k centers Cr is choose to minimize the potential function,

Arbitrarily choose an initial k centers

means++. cr 1 , chosen uniformly at random from X

Select a new center

representation of how well each object lies within its cluster.

Sj S (i) {i=1, .. m}, is known as silhouette width. In

dia (C)=maxx , yC x y is the intra-cluster function (or diameter) of the cluster. If

overlapping among the clusters.

You might also like