0% found this document useful (0 votes)
39 views11 pages

K-Means Clustering - Jupyter Notebook

Uploaded by

vy5083712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

K-Means Clustering - Jupyter Notebook

Uploaded by

vy5083712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

7/1/24, 1:25 PM K-Means clustering

K-Means clustering
In [1]: import pandas as pd

In [2]: df=pd.read_csv("C:\\Users\\vaish\\Downloads\\SIC\\Mall_Customers.csv")

In [3]: df = df.drop(["CustomerID", "Gender", "Age"], axis=1)

In [4]: df

Out[4]: Annual Income (k$) Spending Score (1-100)

0 15 39

1 15 81

2 16 6

3 16 77

4 17 40

... ... ...

195 120 79

196 126 28

197 126 74

198 137 18

199 137 83

200 rows × 2 columns

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Annual Income (k$) 200 non-null int64
1 Spending Score (1-100) 200 non-null int64
dtypes: int64(2)
memory usage: 3.3 KB

In [6]: df.columns

Index(['Annual Income (k$)', 'Spending Score (1-100)'], dtype='object')


Out[6]:

In [7]: df.isnull().sum()

Annual Income (k$) 0


Out[7]:
Spending Score (1-100) 0
dtype: int64

In [8]: import matplotlib.pyplot as plt

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 1/11


7/1/24, 1:25 PM K-Means clustering
plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'])
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Annual Income vs Spending Score')
plt.show()

In [9]: from sklearn.cluster import KMeans

In [ ]: vc=[]
for i in range (1,11):
km=KMeans(n_clusters=i)
km.fit_predict(df)
vc.append(km.inertia_)

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 2/11


7/1/24, 1:25 PM K-Means clustering

C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d


efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d
efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d
efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d
efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d
efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d
efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(

In [23]: vc

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 3/11


7/1/24, 1:25 PM K-Means clustering
[269981.28,
Out[23]:
181363.59595959596,
106348.37306211119,
73679.78903948834,
44448.45544793371,
37239.835542456036,
30241.343617936585,
24995.969781135962,
22119.99312141347,
20065.21982804018]

In [12]: import matplotlib.pyplot as plt

plt.plot(range(1, 11), vc)


plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.show()

In [13]: X = df.iloc[:,:].values
km = KMeans(n_clusters = 5)
y_means = km.fit_predict(X)

C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The d


efault value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n
_init` explicitly to suppress the warning
warnings.warn(
C:\anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans
is known to have a memory leak on Windows with MKL, when there are less chunks tha
n available threads. You can avoid it by setting the environment variable OMP_NUM_
THREADS=1.
warnings.warn(

In [14]: y_means

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 4/11


7/1/24, 1:25 PM K-Means clustering
array([1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4,
Out[14]:
1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 0,
1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 0, 3, 2, 3, 2, 3,
0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3])

In [15]: selected_rows = X[y_means == 0]


selected_rows

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 5/11


7/1/24, 1:25 PM K-Means clustering
array([[39, 61],
Out[15]:
[40, 55],
[40, 47],
[40, 42],
[40, 42],
[42, 52],
[42, 60],
[43, 54],
[43, 60],
[43, 45],
[43, 41],
[44, 50],
[44, 46],
[46, 51],
[46, 46],
[46, 56],
[46, 55],
[47, 52],
[47, 59],
[48, 51],
[48, 59],
[48, 50],
[48, 48],
[48, 59],
[48, 47],
[49, 55],
[49, 42],
[50, 49],
[50, 56],
[54, 47],
[54, 54],
[54, 53],
[54, 48],
[54, 52],
[54, 42],
[54, 51],
[54, 55],
[54, 41],
[54, 44],
[54, 57],
[54, 46],
[57, 58],
[57, 55],
[58, 60],
[58, 46],
[59, 55],
[59, 41],
[60, 49],
[60, 40],
[60, 42],
[60, 52],
[60, 47],
[60, 50],
[61, 42],
[61, 49],
[62, 41],
[62, 48],
[62, 59],
[62, 55],
[62, 56],
[62, 42],
[63, 50],
[63, 46],
[63, 43],

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 6/11


7/1/24, 1:25 PM K-Means clustering
[63, 48],
[63, 52],
[63, 54],
[64, 42],
[64, 46],
[65, 48],
[65, 50],
[65, 43],
[65, 59],
[67, 43],
[67, 57],
[67, 56],
[67, 40],
[69, 58],
[71, 35],
[72, 34],
[76, 40]], dtype=int64)

In [16]: selected_rows = X[y_means == 1]


selected_rows

array([[15, 39],
Out[16]:
[16, 6],
[17, 40],
[18, 6],
[19, 3],
[19, 14],
[20, 15],
[20, 13],
[21, 35],
[23, 29],
[24, 35],
[25, 5],
[28, 14],
[28, 32],
[29, 31],
[30, 4],
[33, 4],
[33, 14],
[34, 17],
[37, 26],
[38, 35],
[39, 36],
[39, 28]], dtype=int64)

In [17]: selected_rows = X[y_means == 2]


selected_rows

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 7/11


7/1/24, 1:25 PM K-Means clustering
array([[ 70, 29],
Out[17]:
[ 71, 11],
[ 71, 9],
[ 73, 5],
[ 73, 7],
[ 74, 10],
[ 75, 5],
[ 77, 12],
[ 77, 36],
[ 78, 22],
[ 78, 17],
[ 78, 20],
[ 78, 16],
[ 78, 1],
[ 78, 1],
[ 79, 35],
[ 81, 5],
[ 85, 26],
[ 86, 20],
[ 87, 27],
[ 87, 13],
[ 87, 10],
[ 88, 13],
[ 88, 15],
[ 93, 14],
[ 97, 32],
[ 98, 15],
[ 99, 39],
[101, 24],
[103, 17],
[103, 23],
[113, 8],
[120, 16],
[126, 28],
[137, 18]], dtype=int64)

In [18]: selected_rows = X[y_means == 3]


selected_rows

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 8/11


7/1/24, 1:25 PM K-Means clustering
array([[ 69, 91],
Out[18]:
[ 70, 77],
[ 71, 95],
[ 71, 75],
[ 71, 75],
[ 72, 71],
[ 73, 88],
[ 73, 73],
[ 74, 72],
[ 75, 93],
[ 76, 87],
[ 77, 97],
[ 77, 74],
[ 78, 90],
[ 78, 88],
[ 78, 76],
[ 78, 89],
[ 78, 78],
[ 78, 73],
[ 79, 83],
[ 81, 93],
[ 85, 75],
[ 86, 95],
[ 87, 63],
[ 87, 75],
[ 87, 92],
[ 88, 86],
[ 88, 69],
[ 93, 90],
[ 97, 86],
[ 98, 88],
[ 99, 97],
[101, 68],
[103, 85],
[103, 69],
[113, 91],
[120, 79],
[126, 74],
[137, 83]], dtype=int64)

In [19]: selected_rows = X[y_means == 4]


selected_rows

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 9/11


7/1/24, 1:25 PM K-Means clustering
array([[15, 81],
Out[19]:
[16, 77],
[17, 76],
[18, 94],
[19, 72],
[19, 99],
[20, 77],
[20, 79],
[21, 66],
[23, 98],
[24, 73],
[25, 73],
[28, 82],
[28, 61],
[29, 87],
[30, 73],
[33, 92],
[33, 81],
[34, 73],
[37, 75],
[38, 92],
[39, 65]], dtype=int64)

In [20]: selected_rows = X[y_means == 0,1]


selected_rows

array([61, 55, 47, 42, 42, 52, 60, 54, 60, 45, 41, 50, 46, 51, 46, 56, 55,
Out[20]:
52, 59, 51, 59, 50, 48, 59, 47, 55, 42, 49, 56, 47, 54, 53, 48, 52,
42, 51, 55, 41, 44, 57, 46, 58, 55, 60, 46, 55, 41, 49, 40, 42, 52,
47, 50, 42, 49, 41, 48, 59, 55, 56, 42, 50, 46, 43, 48, 52, 54, 42,
46, 48, 50, 43, 59, 43, 57, 56, 40, 58, 35, 34, 40], dtype=int64)

In [21]: plt.scatter(X[y_means == 0, 0], X[y_means == 0, 1], color='blue', label='Cluster 0'


plt.scatter(X[y_means == 1, 0], X[y_means == 1, 1], color='yellow', label='Cluster
plt.scatter(X[y_means == 2, 0], X[y_means == 2, 1], color='red', label='Cluster 2')
plt.scatter(X[y_means == 3, 0], X[y_means == 3, 1], color='green', label='Cluster 3
plt.scatter(X[y_means == 4, 0], X[y_means == 4, 1], color='orange', label='Cluster

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Clusters of data points')
plt.legend()
plt.show()

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 10/11


7/1/24, 1:25 PM K-Means clustering

localhost:8888/nbconvert/html/Downloads/SIC/K-Means clustering.ipynb?download=false 11/11

You might also like