Investigating The Configurable Parameters of K-Means Unsupervised Learning
Investigating The Configurable Parameters of K-Means Unsupervised Learning
Research question: To what extent is the performance of the k-means clustering algo-
om
rithm in unsupervised learning influenced by the initial placement algorithm, the number
l.c
ai
of features, and the number of clusters?
gm
3@
l1
ra
1
Downloaded from www.clastify.com by Dhruv Manral
Contents
1 Introduction 4
2 Background Information 5
2.1 Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . 5
2.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 First configurable parameter: initial placement . . . . . . . . . . 7
2.2.3 Second configurable parameter: number of clusters . . . . . . . . 8
om
2.2.4 Third configurable parameter: number of features . . . . . . . . . 8
l.c
2.2.5 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ai
2.2.6 Dimensionality Reduction Through principal Comoponent Analy-
gm
sis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3@
l1
3 Methodology 11
ra
an
4 Experimental results 14
4.1 Table of Synthetic data set Results . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Table of Wine data set Results . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Example of Programmed Outcome . . . . . . . . . . . . . . . . . . . . . 15
4.4 Graphical Presentation of Achieved Results . . . . . . . . . . . . . . . . . 17
4.4.1 Graphical presentation of synthetic data set results . . . . . . . . 18
4.4.2 Graphical presentation of wine data set results . . . . . . . . . . . 19
5 Data Analysis 20
2
Downloaded from www.clastify.com by Dhruv Manral
6 Limitations 23
7 Conclusion 24
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
ast
Cl
3
Downloaded from www.clastify.com by Dhruv Manral
1 Introduction
As our world continues to technologically advance, machine learning has taken on a role
of propelling the future. One fascinating machine learning technique is clustering (also
known as cluster analysis). Clustering is a process of discovering patterns in unlabeled
data (data that has not been tagged with identification [1]), and aims to group individ-
ual objects based on their degree of similarity from one another [2]. Clustering can be
applied to many aspects of the real world: from grouping customers based on their be-
havioral psychology to grouping different types of wine (a data set that will be explored
in this investigation) - clustering results are influential in all disciplines.
om
l.c
Within clustering algorithms, there are many configurable parameters that affect the
ai
gm
overall performance. It is vital to understand the differences in performance of each al-
3@
gorithm when certain parameters are customized in order to maximize the effectiveness
l1
ra
This paper seeks to evaluate the effectiveness in performance, measured through sil-
ru
houette score and the number of iterations of three configurable properties of k-means
dh
clustering serves to extract value from large unlabeled data sets. This gives the results of
a
Cl
this research the potential to improve the efficiency of clustering applications. By under-
standing how certain parameters of k-means clustering can be configured to maximize
the effectiveness, industries such as business would benefit greatly from better client,
product, and data clustering for their operations.
The following research question will be explored: To what extent is the performance
of the k-means clustering algorithm in unsupervised learning influenced by the ini-
tial placement algorithm, the number of features, and the number of clusters? For
this investigation, k-means clustering algorithms were programmed to group data from
a synthetic data set and a public wine data set. For each data set, the initial placement,
4
Downloaded from www.clastify.com by Dhruv Manral
number of clusters, and number of iterations were altered at each rerun. Patterns were
analyzed and performance was evaluated through calculated silhouette scores and the
number of iterations needed to complete the process. This investigation will also de-
termine whether the metric used, silhouette score, is a reliable determinant of accuracy
of an unsupervised learning algorithm. Logical and mathematical explanations for the
results obtained are discussed.
2 Background Information
om
2.1 Supervised and Unsupervised Learning
l.c
Machine learning algorithms have two main approaches - supervised and unsupervised
ai
gm
learning. Supervised learning algorithms refers to working with labeled data sets to train
3@
and “supervise” algorithms in processing data. Since input and output data are labeled,
l1
the supervised learning model can easily measure accuracy. Classification and regres-
ra
an
sion algorithms are the most common types trained by supervised learning, due to their
m
vs
Unsupervised learning discovers hidden patterns without the need of human interac-
ify
st
tion or labeled data sets. The main tasks associated with unsupervised learning are
a
Cl
This paper will specifically explore the k-means clustering algorithm, a method of vector
quantization that originally stems from signal processing.
5
Downloaded from www.clastify.com by Dhruv Manral
k X
X k
X
2
arg min ∥x − µi ∥ = arg min |Si | Var Si
S S
i=1 x∈Si i=1
om
(maximization) is computing the centroid of each cluster. Here is a rundown of how
l.c
k-means operates:
ai
gm
1. Specify number of clusters k.
3@
l1
2. Initialize centroids by first shuffling the data set and then randomly selecting k
ra
an
3. Keep iterating until all stopping criteria are met. As k-means is an iterative pro-
ru
cess, it is crucial to understand when to stop the algorithm. Essentially, the three
dh
ify
stopping criteria are when centroids of newly formed clusters do not change, when
st
points remain in the same cluster, and when the maximum number of iterations is
a
Cl
reached [5].
6
Downloaded from www.clastify.com by Dhruv Manral
2.2.1 Parameters
In this paper, the effects of three different parameters on the performance of k-means
processes will be investigated across two data sets - a synthetic data set and a real data
set containing the chemical properties of certain wine types.
The first altered parameter will be the initial placement, which will be set to either
random or k-means++. This refers to the initial placement of the clusters in the k-
means clustering process. A random initial placement means that the center-points of
om
clusters are randomly chosen. K-means++ is a biased random sampling that chooses
l.c
centers farther apart from one another, avoiding close points; it aims to achieve the
ai
gm
optimal clustering results in a fewer number of iterations. The first chosen centroid of
3@
k-means++ is random, and the next centroids are chosen as the datapoints with the
l1
The above figure demonstrates the use of k-means++ to determine the third centroid of
a set of datapoints. The square of the distances of each datapoint from its closest centroid
(green or red) is calculated, and the blue datapoint is selected as the third centroid since
it has the largest squared distance from its nearest centroid in Figure 2a.
7
Downloaded from www.clastify.com by Dhruv Manral
The second configurable parameter in this investigation is the number of clusters. There
is no limit to how many clusters can be formed in k-means clustering. We will be deter-
mining the optimum combination of three configured parameters with the synthetic and
wine data sets in this investigation.
The final parameter to be configured in this investigation is the number of features. The
number of features can vary greatly for real world data sets. In the context of students
om
at a school, features include nationality, gender, grades, household income, etc. This
l.c
is an interesting area of exploration, since on the surface level it may seem that more
ai
gm
features makes it easier to find similarities and establish clusters. However, it could also
3@
Feature scaling is an important step to take prior to processing data for many ma-
ru
the features to reflect the properties of a standard normal distribution. This is vital in
ify
st
many algorithms as they may behave badly if individual features do not represent nor-
a
Cl
mally distributed data. For example, if an investigation aims to describe the physical at-
tributes people and the data provided includes their heights in centimeters and weights
in pounds, a five pound difference cannot directly be compared to a five centimeter dif-
ference in height.
Features are standardized by removing the mean and scaling to unit variance. As an
example, the standard score of a sample x is calculated as: z = (x − u)/s, where u is the
mean of the training samples and the variable s is the standard deviation of the training
samples.
8
Downloaded from www.clastify.com by Dhruv Manral
One instance feature scaling is used is during Principal Component Analysis, or PCA. PCA
is a dimensionality reduction method typically used to reduce the feature dimensionality
of large data sets. This is done by transforming a data set with many variables into a
smaller one with less variables but is still able to capture most of the information of the
original large data set. Simply put, the goal of dimensionality reduction methods such
as PCA is to decrease the number of variables of a data set while preserving as much
information as possible [7].
om
To better understand PCA, refer to the graph below, Figure 3. There are 10 principal
l.c
components seen, meaning the original data set is 10-dimensional, having 10 features/-
ai
gm
variables. principal components are essentially crafted as combinations of all ten of the
3@
variables. They are mixed in such a way that most information of variables is compressed
l1
into the first few principal components (as represented by the highest percentage of ex-
ra
an
9
Downloaded from www.clastify.com by Dhruv Manral
One obvious issue of lowering the number of variables in the data set is that accuracy
will be negatively affected. However, the intent of dimensionality reduction methods
are to sacrifice a little accuracy for simplicity. This is because smaller data sets without
extraneous variables are easier to investigate, making the visualization and analyzing
processes of machine learning algorithms much easier, faster, and more streamlined.
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
10
Downloaded from www.clastify.com by Dhruv Manral
3 Methodology
Primary experimental data is the main source of data in this paper. Two data sets (a
synthetic and a wine data set) were used to complete a k-means clustering process (code
om
in appendix, adapted from an example from Scikit-learn [9]). The number of itera-
l.c
ai
tions taken to run each configured program was recorded and accuracy was displayed
gm
by silhouette score. This investigation took an experimental approach because there was
3@
limited secondary data to answer the research question. The chosen approach allows
l1
ra
taken, the results of the experiment are technically limited to the scope of the procedure.
vs
ru
dh
The hardware configuration used was an Apple MacBook Air (M1, 2020) with 16GB
ify
Memory. The software package used in the code was Python 3.9.0 and scikit-learn
st
1.1.1.
a
Cl
The synthetic data set used in this investigation generates the sample data from the
make_blobs Python function. This particular setting has one distinct cluster and 3 clus-
ters placed close together. Below is the source code used that generates the synthetic
data set.
11
Downloaded from www.clastify.com by Dhruv Manral
1 X , y = make_blobs (
2 n_samples =1000 ,
3 n_features = 20 ,
4 centers =4 ,
5 cluster_std =1 ,
6 center_box =( -10.0 , 10.0) ,
7 shuffle = True ,
8 random_state =1 ,
9 )
om
l.c
The wine data set includes 3 classes, with each class containing 59, 71, and 48 samples,
ai
respectively. For each row, there are 13 real and positive features. The 13 features are
gm
Alcohol, Malic acid, Alkalinity of ash, Magnesium, Total phenols, Flavanoids, Nonfla-
3@
The metric of accuracy used in this paper is silhouette score, mainly used to evaluate the
quality of clusters created. Silhouette score is calculated at each data point, and requires
the mean distance between the observation point and all other data points in the same
cluster. This is known as mean intra-cluster distance. In the following equation, the mean
12
Downloaded from www.clastify.com by Dhruv Manral
(b−a)
S= max(a,b)
The range of silhouette scores is between -1 and 1. A score of 1 means that the cluster is
itself dense and well-separated from other clusters. A value of 0 represents overlapping
clusters, with their samples extremely close to the boundary of neighboring clusters.
A negative score indicates inaccuracy, suggesting that the datapoints may have been
assigned to the wrong cluster [11].
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
The silhouette score values for clusters 2 and 3 look relatively optimal. The score for
each cluster is above the average silhouette score and there is minimal fluctuation in
13
Downloaded from www.clastify.com by Dhruv Manral
size.
Silhouette plots with clusters having uniform thicknesses is an indication of the optimal
number of clusters. The top right plot with 3 clusters have the most uniform thicknesses
out of all four plots. Thus, the optimal number of clusters in the above figure is 3.
4 Experimental results
om
The following table displays the experimental results of k-means clustering using the
l.c
synthetic data set. Silhouette scores are displayed to four significant figures in order to
ai
gm
maintain high accuracy, as differences between some values were rather minimal.
3@
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
14
Downloaded from www.clastify.com by Dhruv Manral
The following table displays the experimental results of k-means clustering using the
wine data set.
om
l.c
ai
gm
3@
l1
ra
In order to visualize some results of the code, the following two figures are the produced
st
charts of the program. Figure 9 depicts the synthetic data set results with k-means++
a
Cl
initial placement, 5 features, and 4 clusters. The results depicted are optimal, since all
four clusters in the left chart are almost of equal size and all cross the average silhouette
score threshold indicated by the dotted red line. The right chart is a display of of the ac-
tual data-points being formed into clusters represented in the four colors that correspond
with the left chart. It is vital to note that the right chart has x and y axes that represent
two out of the five total features, as it is not easy to create a five-featured visual repre-
sentation of the clusters. However, by comparing two features, the user can still clearly
note the distinction between clusters. Figure 10 is not optimal, with no consistently sized
clusters and only two clusters crossing the average silhouette score threshold.
15
Downloaded from www.clastify.com by Dhruv Manral
om
l.c
ai
Figure 9: Synthetic data set results of k-means++ initialization, 5 features, and 4 clus-
ters (figure generated by author) gm
3@
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
Figure 10: Synthetic data set results of k-means++ initialization, 5 features, and 6
clusters (figure generated by author)
16
Downloaded from www.clastify.com by Dhruv Manral
For ease of visualization, the data has been displayed in the bar charts below. In all bar
charts, the left bar (blue) has random initial placement, and the right bar (orange or
grey) has k-means++ initial placement.
The first row of the blue-orange charts displays the silhouette score of the data sets with
5 features (a) and 15 features (b). Each bar represents the results of a particular number
of clusters (x-axis) on the accuracy (indicated by silhouette score on the y-axis) of the k-
means clustering process. The second row is the same, except the y-axis is now replaced
om
by the number of iterations. The final row (of blue-grey charts) display the silhouette
l.c
scores (a) and number of iterations (b) of a data set with a set amount of clusters when
ai
gm
altering the number of features (x-axis). The aforementioned row descriptions apply to
3@
the 14 charts in Sections 4.4.1 and 4.4.2. All figures are generated by the author.
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
17
Downloaded from www.clastify.com by Dhruv Manral
(a) Synthetic data set with 5 features (b) Synthetic data set with 15 features
om
l.c
ai
gm
3@
l1
ra
an
(a) Synthetic data set with 5 features (b) Synthetic data set with 15 features
m
vs
(a) Silhouette score of synthetic data set (b) Number of iterations of synthetic data
with 4 clusters set with 4 clusters
Figure 13: Altering the number of features of synthetic data set with 4 clusters
18
Downloaded from www.clastify.com by Dhruv Manral
(a) Wine data set with 5 features (b) Wine data set with 15 features
om
l.c
ai
gm
3@
l1
ra
an
(a) Wine data set with 5 features (b) Wine data set with 15 features
m
vs
(a) Silhouette score of wine data set (b) Number of iterations of wine data
with 3 clusters set with 3 clusters
Figure 16: Altering the number of features of wine data set with 3 clusters
19
Downloaded from www.clastify.com by Dhruv Manral
(a) Silhouette score of wine data set with 3 (b) Number of iterations of wine data set with
clusters after undergoing PCA 3 clusters after undergoing PCA
Figure 17: Wine data set results for 3 clusters after applying PCA
5 Data Analysis
om
5.1 Analyzing Number of Clusters Using Silhouette Score
l.c
ai
gm
First, this investigation has demonstrated that the silhouette score is a reliable indicator
3@
of choosing the optimal number of clusters for both synthetic data and real data. In
l1
supervised learning, there is a Ground Truth that the algorithms are aware of, meaning
ra
an
algorithm that is learning as it runs with no Ground Truth to compare to, thus accuracy
vs
ru
is much harder to measure. For the sake of testing, it was already known that the optimal
dh
number of clusters for the synthetic data set was 4. The results shown in Figure 11 por-
ify
tray this, with a peak in silhouette score at 4 clusters for both random and k-means++
a st
initial placements, whether it was data collected for 5 features or 15 features of the syn-
Cl
thetic data set. This means that the silhouette score consistently classified 4 clusters as
the optimal amount. Although the algorithm is not aware that this is the correct answer,
it was already known externally, so this serves as evidence supporting the strength of
using silhouette score as an indicator of accuracy.
For the real data collected with the wine data set, it was externally known that the
optimal number of clusters should be 3. To see whether the silhouette score was able to
capture this optimal cluster value, refer to Figure 14. Although Figure 14a of the wine
20
Downloaded from www.clastify.com by Dhruv Manral
data set with 5 features shows a clear peak in silhouette score for 3 clusters, Figure 14b
with 15 features shows that the best silhouette score is achieved with 2 clusters. This is
a solid example of how real data does not operate like synthetic data, where the optimal
number of clusters is the same across all number of features used; instead, it is measured
relatively.
As seen across all results, the initial placement (random or k-means++) has a negligi-
ble difference in results of silhouette score, but causes varying results for the number
om
of iterations. This is especially true for a larger number of clusters when referring to
l.c
Figure 12, as the number of iterations stays relatively similar regardless of initial place-
ai
gm
ment on both graphs of synthetic data sets with 1-3 clusters, but clusters 4 and 5 see a
3@
drastic difference with the left bar representing random initial placement and the right
l1
bar representing k-means++ initial placement. However, this pattern is not reflected
ra
an
when looking at data comparing the number of iterations of a synthetic data set with
m
4 clusters and varying features, as shown in Figure 13b. Instead, a difference is seen
vs
ru
between initial placements for the middle two number of features of the data set (10
dh
and 15), with 5 and 20 features achieving the same number of iterations for both initial
ify
placements represented by the blue and grey bar graphs. For the wine data set results, as
a st
seen in Figures 15 and 16, the initial placements have distinctly different results across
Cl
Regardless of what pattern is seen in the differing results between initialization methods,
all data collected across both the synthetic and wine data sets reflected that silhouette
score was not different between experiments using different initial placements, but the
number of iterations always varied at some point.
Since the initialization method does cause varying results for the number of iterations,
21
Downloaded from www.clastify.com by Dhruv Manral
its impact is to reduce the number of iterations. Across nearly all data, k-means++ gen-
erally outperforms random initialization with a lesser number of iterations used. This
is especially when the number of clusters is closer to the Ground Truth known by the
external experimenter. This saves computing power and a significant amount of time,
especially in large scale data sets such as the wine data set.
First, it notable that feature scaling in clustering, as different features are being com-
om
pared. Otherwise, the result has the potential to be severely skewed. In this study, fea-
l.c
ai
tures were already scaled for the synthetic data, but had to be scaled for the wine data.
gm
The features within the wine data set included features like the percentage of alcohol
3@
and alkalinity of ash, which are two features that cannot directly be compared. Once
l1
ra
A very significant result is that the number of features has the no direct relationship with
ify
finding the optimal number of clusters. It is quite a common assumption that adding
st
more features (including more information) will aid with clustering processes. However,
a
Cl
this is not necessarily the case. In Figure 11a and 11b, a varying number of clusters
for the synthetic data set with 5 features and 15 features was compared. As seen in
the nearly identical results, the number of features varying between 5 and 15 did not
create a significant impact in determining the optimal number of clusters. There are
slightly more visible varied results between Figure 12a and 12b, comparing a synthetic
data set with 5 features and 15 features, but the general pattern is still generally similar.
When referring to Figure 13 depicting the silhouette scores and number of iterations of
the synthetic data set with a varying number of features, there is no specific pattern seen.
22
Downloaded from www.clastify.com by Dhruv Manral
When looking at the wine data set, the effect of the number of features sometimes varies
from that of the synthetic data set. As seen in Figure 14a and 14b, the wine data set
with 5 features deemed 3 clusters the optimal, but the data set with 15 features deemed
2 clusters the optimal. As previously mentioned, it was already known that the true op-
timal number of clusters was 3 for the wine data set, which means that the Figure 14a
with 5 features was more effective at determining the true number of clusters than Fig-
ure 14b with 15 features. When comparing Figure 15a to 15b and Figure 16a to 16b,
there was no specific pattern seen when altering just the number of features with a set
number of clusters.
om
l.c
Therefore, simply adding more features does not necessarily improve the accuracy of
ai
clusters generated by k-means clustering.
gm
3@
Dimensionality reduction using PCA was implemented to reduce the feature dimension-
an
m
ality of the wine data set. The results are shown in Figure 17, the highest silhouette
vs
ru
score belongs to 2 clusters, which also has lowest number of iterations when compared
dh
with 3 and 4 clusters. This experimental value of the optimal number of clusters matches
ify
the Ground Truth. Thus, dimensionality reduction is very helpful in k-means clustering,
a st
as this result was not consistently shown in the wine data set prior to dimensionality
Cl
reduction.
6 Limitations
The first limitation of this study is that only a certain number of clusters and features are
considered. Hence, these results can only be considered a local optimal, but not gener-
alized to be the global optimal. Future research needs to be done in order to confirm or
deny whether the local results reflect globally.
23
Downloaded from www.clastify.com by Dhruv Manral
This investigation only had a Ground Truth to compare experimental results to due to
setting up the data set externally for experimental purposes. However, since k-means
clustering is unsupervised, no one knows what the number of clusters is - this means
multiple must be tested (this experiment tested 5 clusters). Future research can be done
to find a Machine Learning approach to figuring out how many clusters should be tested
using these methods in order to find the optimal number of clusters to use.
7 Conclusion
In this paper, the effects of changing the number of clusters, number of features, and
om
initialization methods of k-means clustering were analyzed. Logical and mathematical
l.c
ai
explanations for the patterns observed were also provided.
gm
3@
The results prove that silhouette score is a reliable indicator of accuracy, as there was
l1
ra
a Ground Truth to compare experimental results to. However, when k-means clustering
an
m
Since there is a limitation that it is unknown how many clusters should be tested, re-
ru
dh
searchers currently need to test multiple clusters experimentally (such as in this paper)
ify
to find the optimal. The amount of clusters and which clusters should be tested can
st
be estimated based on the application of the algorithm. If the ultimate goal is to clus-
a
Cl
ter students into different socioeconomic groups in a high school, it is likely to deduce
from logical reasoning that the optimal number of clusters lies between 3 and 5, so a
researcher should test the clusters within and around this range (i.e., test 2-6 clusters).
It was found that altering the initialization method had little effect on the silhouette
scores, but using k-means++ generally improved computational running speed with a
lower number of iterations needed to determine the optimal number of clusters.
The effect of altering the number of features is less predictable, as it followed no clear
24
Downloaded from www.clastify.com by Dhruv Manral
relationship for both data sets. However, when the features underwent dimensionality
reduction using principal Component Analysis, it was advantageous to improving accu-
racy and speed with a higher silhouette score and lower number of iterations.
Hopefully this paper will prove useful to Machine Learning resources in guiding their
choices as they utilize k-means clustering, leading to more innovative training of unsu-
pervised learning algorithms to be used through all facets of study.
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
a st
Cl
25
Downloaded from www.clastify.com by Dhruv Manral
References
[1] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learn-
ing Tools and Techniques. Morgan Kaufmann Series in Data Management Systems.
Morgan Kaufmann, 3 edition, 2011.
[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical
learning: data mining, inference and prediction. Springer, 2 edition, 2009.
om
cessed 23 June 2022].
l.c
ai
[4] Sanjoy Dasgupta and Yoav Freund. Random projection trees for vector quantiza-
gm
tion. IEEE Transactions on Information Theory, 55(7):3229–3242, 2009.
3@
l1
21 June 2022].
dh
[6] Jake VanderPlas. Python Data Science Handbook. O’Reilly Media, Inc., 2016.
ify
a st
26
Downloaded from www.clastify.com by Dhruv Manral
[10] Michele Forina, Riccardo Leardi, Armanino C, and Sergio Lanteri. PARVUS: An
Extendable Package of Programs for Data Exploration. 01 1998.
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
ast
Cl
27
Downloaded from www.clastify.com by Dhruv Manral
Appendix
The following program was used for this investigation. Different test trials of k-means
clustering algorithms were collected with silhouette score and number of iterations as
results. Some insight needed to write this code was drawn from Scikit-learn [9].
om
7 import numpy as np
l.c
8 import time
ai
9 gm
3@
10 # Generating the sample data from make_blobs
X , y = make_blobs (
l1
11
ra
12 n_samples =1000 ,
an
13 n_features = 20 ,
m
vs
14 centers =4 ,
ru
15 cluster_std =1 ,
dh
17 shuffle = True ,
st
18 random_state =1 ,
a
Cl
19 )
20 # For reproducibility
21
22 # for wine
23
28
Downloaded from www.clastify.com by Dhruv Manral
30
31
om
41
l.c
42 # print ( X )
ai
43
44 range_n_clusters = [2 , 3 , 4 , 5 , 6]
gm
3@
45
l1
50
52
a
example all
Cl
29
Downloaded from www.clastify.com by Dhruv Manral
om
70 silhouette_avg = silhouette_score (X , cluster_labels )
l.c
71 print (
ai
72 " For n_clusters = " ,
73 n_clusters ,
gm
3@
75 silhouette_avg ,
ra
an
77 # t_batch ,
vs
clusterer . n_iter_ ,
dh
79
80 )
ify
st
81
a
85 y_lower = 10
86 for i in range ( n_clusters ) :
87 # Aggregate the silhouette scores for samples belonging to
88 # cluster i , and sort them
89 ith_cluster_silhouette_values = sample_silhouette_values [
cluster_labels == i ]
90
91 i t h _ c l u s t e r _ s i l h o u e t t e _ v a l u e s . sort ()
30
Downloaded from www.clastify.com by Dhruv Manral
92
om
103 alpha =0.7 ,
l.c
104 )
ai
105
106
gm
# Label the silhouette plots with their cluster numbers at the
3@
middle
l1
108
m
111
112 ax1 . set_title ( " The silhouette plot for the various clusters . " )
ify
113
a
115
116 # The vertical line for average silhouette score of all the values
117 ax1 . axvline ( x = silhouette_avg , color = " red " , linestyle = " --" )
118
31
Downloaded from www.clastify.com by Dhruv Manral
om
135 c = " white " ,
l.c
136 alpha =1 ,
ai
137 s =200 ,
138 edgecolor = " k " ,
gm
3@
139 )
l1
140
ra
an
142 ax2 . scatter ( c [0] , c [1] , marker = " $ % d$ " % i , alpha =1 , s =50 ,
vs
143
144 ax2 . set_title ( " The visualization of the clustered data . " )
ify
145
a
146 ax2 . set_ylabel ( " Feature space for the 2 nd feature " )
Cl
147
32
Downloaded from www.clastify.com by Dhruv Manral
om
l.c
ai
gm
3@
l1
ra
an
m
vs
ru
dh
ify
st
a
Cl
33