0% found this document useful (0 votes)
16 views8 pages

45B AIML Practical07 Clustering

Uploaded by

Ahmed Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

45B AIML Practical07 Clustering

Uploaded by

Ahmed Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Name of Student: Ahmed Mobin Ahmed Shaikh

Roll Number: 45 Lab Practical Number: 07

Title of Lab Assignment: Implementation and analysis of clustering


algorithms like K-Means, K-medoid.

DOP: 06/03/24 DOS: 06/03/24

CO Mapped: PO Mapped: Signature:


CO2 PO3, PO5,
PO6, PO7,
PO11,
PO12
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

keyboard_arrow_down 1. K-Means Clustering:


I. IMPORTING DATASET:

import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('F1Drivers_Dataset.csv')
#dataset
dataset

Driver Nationality Seasons Championships Race_Entries Race_Starts Pole_P

Carlo [1962,
0 Italy 0.0 3.0 0.0
Abate 1963]

George United [1951,


1 0.0 2.0 2.0
Abecassis Kingdom 1952]

Kenny United [1983,


2 0.0 10.0 3.0
Acheson Kingdom 1985]

[1968,
Andrea 1970,
3 de Italy 1971, 0.0 36.0 30.0
Adamich 1972,
1973]

Philippe
4 Belgium [1994] 0.0 2.0 2.0
Adams

... ... ... ... ... ... ...

Emilio
863 Spain [1976] 0.0 1.0 0.0
Zapico

Zhou
864 China [2022] 0.0 23.0 23.0
Guanyu

[1999,
2000,
Ricardo
865 Brazil 2001, 0.0 37.0 36.0
Zonta
2004,
2005]

[1975,
Renzo
866 Italy 1976, 0.0 7.0 7.0
Zorzi
1977]

[1979,
Ricardo
867 Argentina 1980, 0.0 11.0 10.0
Zunino
1981]

868 rows × 22 columns

keyboard_arrow_down EXTRACTING INDEPENDENT VARIABLES:


SOURCE CODE:

x = dataset[['Race_Entries', 'Race_Starts']]
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elbow Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 1/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

keyboard_arrow_down TRAINING K-MEANS MODEL ON A DATASET:


SOURCE CODE:

# Applying KMeans with 5 clusters


kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_predict = kmeans.fit_predict(x)

# Visualizing the clusters


mtp.scatter(x[y_predict == 0]['Race_Entries'], x[y_predict == 0]['Race_Starts'], s=100, c='blue', label='Cluster 1')
mtp.scatter(x[y_predict == 1]['Race_Entries'], x[y_predict == 1]['Race_Starts'], s=100, c='green', label='Cluster 2')
mtp.scatter(x[y_predict == 2]['Race_Entries'], x[y_predict == 2]['Race_Starts'], s=100, c='red', label='Cluster 3')
mtp.scatter(x[y_predict == 3]['Race_Entries'], x[y_predict == 3]['Race_Starts'], s=100, c='cyan', label='Cluster 4')
mtp.scatter(x[y_predict == 4]['Race_Entries'], x[y_predict == 4]['Race_Starts'], s=100, c='magenta', label='Cluster 5')

# Plotting centroids
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroid')

mtp.title('Clusters of Driver data')


mtp.xlabel('Race_Entries')
mtp.ylabel('Race_Starts')
mtp.legend()
mtp.show()

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning
warnings.warn(

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 2/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

keyboard_arrow_down 2. K-Medoids:
I. Importing Packages & Loading Dataset:

SOURCE CODE:

!pip install scikit-learn-extra


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn import datasets, metrics

# Load the Wine dataset


wine = datasets.load_wine()
x = wine.data # Features
y = wine.target # Target labels
wine

Collecting scikit-learn-extra
Downloading scikit_learn_extra-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 8.9 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit
Installing collected packages: scikit-learn-extra
Successfully installed scikit-learn-extra-0.3.0
{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
1.185e+03],
...,
[1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
8.350e+02],
[1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
8.400e+02],
[1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
5.600e+02]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2]),
'frame': None,
'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),
'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n
:Number of Instances: 178\n :Number of Attributes: 13 numeric, predictive attributes and the class\n :Attribute
Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash \n \t\t- Magnesium\n\t\t- Total phenols\n
\t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of
diluted wines\n \t\t- Proline\n\n - class:\n - class_0\n - class_1\n - class_2\n\t\t\n
:Summary Statistics:\n \n ============================= ==== ===== ======= =====\n Min
Max Mean SD\n ============================= ==== ===== ======= =====\n Alcohol: 11.0 14.8
13.0 0.8\n Malic Acid: 0.74 5.80 2.34 1.12\n Ash: 1.36 3.23 2.36
0.27\n Alcalinity of Ash: 10.6 30.0 19.5 3.3\n Magnesium: 70.0 162.0 99.7 14.3\n
Total Phenols: 0.98 3.88 2.29 0.63\n Flavanoids: 0.34 5.08 2.03 1.00\n
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12\n Proanthocyanins: 0.41 3.58 1.59 0.57\n Colour
Intensity: 1.3 13.0 5.1 2.3\n Hue: 0.48 1.71 0.96 0.23\n OD280/OD315 of
diluted wines: 1.27 4.00 2.61 0.71\n Proline: 278 1680 746 315\n
============================= ==== ===== ======= =====\n\n :Missing Attribute Values: None\n :Class Distribution: class_0
(59), class_1 (71), class_2 (48)\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%[email protected])\n
:Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-
databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three
different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types
of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and
Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa,
Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA:
University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n (1) S. Aeberhard, D.
Coomans and O. de Vel, \n Comparison of Classifiers in High Dimensional Settings, \n Tech. Rep. no. 92-02, (1992), Dept. of

keyboard_arrow_down II. Scaling and Fitting KMedoids:


https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 3/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory
SOURCE CODE:

# Scaling the features


scaler = StandardScaler().fit(x)
x_scaled = scaler.transform(x)

# II. Scaling and Fitting KMedoids


kMedoids = KMedoids(n_clusters=3, random_state=0)
kMedoids.fit(x_scaled)
y_kmed = kMedoids.predict(x_scaled)

keyboard_arrow_down III. Silhouette Method to evaluate cluster:


SOURCE CODE:

silhouette_avg = silhouette_score(x_scaled, y_kmed)


print("Silhouette Score:", silhouette_avg)

Silhouette Score: 0.26597740204536796

keyboard_arrow_down IV. Silhouette Width to find number of cluster:


SOURCE CODE:

sw = []
for i in range(2, 11):
kMedoids = KMedoids(n_clusters=i, random_state=0)
kMedoids.fit(x_scaled)
y_kmed = kMedoids.predict(x_scaled)
silhouette_avg = silhouette_score(x_scaled, y_kmed)
sw.append(silhouette_avg)

plt.plot(range(2, 11), sw)


plt.title('Silhouette Score')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Width')
plt.show()

keyboard_arrow_down V. Computing Purity:


SOURCE CODE:

def purity_score(y_true, y_pred):


contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

print("Purity Score:", purity_score(y, y_kmed))

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 4/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

Purity Score: 0.898876404494382

keyboard_arrow_down VI. How extreme values effect K-Medoid compared to K-means:


SOURCE CODE:

kmeans = KMeans(n_clusters=3, init='random', max_iter=300, n_init=10, random_state=0)


y_kmeans = kmeans.fit_predict(x_scaled)
print("Purity Score for K-Means:", purity_score(y, y_kmeans))

Purity Score for K-Means: 0.9662921348314607

keyboard_arrow_down VII. Plotting values:


SOURCE CODE:

plt.scatter(x_scaled[y_kmeans == 0, 0], x_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')


plt.scatter(x_scaled[y_kmeans == 1, 0], x_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(x_scaled[y_kmeans == 2, 0], x_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.legend()
plt.show()

keyboard_arrow_down VIII. Adding extreme values:


SOURCE CODE:

import numpy as np

# Add extreme values to your dataset


extreme_values = np.array([[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]])
m = np.append(x, extreme_values, axis=0)
y_extreme = np.append(y, [2, 2, 2])
print(y_extreme)
print("we see 3 observations are added over here.-", m.shape)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
we see 3 observations are added over here.- (181, 13)

SOURCE CODE:

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 5/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

scaler = StandardScaler().fit(m)
x_scaled_extreme = scaler.transform(m)

# Perform KMeans clustering


kmeans_extreme = KMeans(n_clusters=3, init='random', max_iter=300, n_init=10, random_state=0)
y_kmeans_extreme = kmeans_extreme.fit_predict(x_scaled_extreme)

# Calculate purity score


purity_extreme = purity_score(y_extreme, y_kmeans_extreme)
print(purity_extreme)

0.7016574585635359

keyboard_arrow_down IX. Plot:


SOURCE CODE:

plt.scatter(x_scaled_extreme[y_kmeans_extreme == 0, 0], x_scaled_extreme[y_kmeans_extreme == 0, 1], s=100, c='red', label='C1')


plt.scatter(x_scaled_extreme[y_kmeans_extreme == 1, 0], x_scaled_extreme[y_kmeans_extreme == 1, 1], s=100, c='blue', label='C2')
plt.scatter(x_scaled_extreme[y_kmeans_extreme == 2, 0], x_scaled_extreme[y_kmeans_extreme == 2, 1], s=100, c='green', label='C3')
plt.scatter(kmeans_extreme.cluster_centers_[:, 0], kmeans_extreme.cluster_centers_[:, 1], s=100, c='yellow', label='Centroids')
plt.legend()
plt.show()

output

SOURCE CODE:

data = [['k-Means', 0.81], ['k-Means with Outliers', purity_extreme],


['k-Medoid', 0.84], ['K Medoid with outliers', 0.86]]
df = pd.DataFrame(data, columns=['Method', 'Purity'])
df.plot.bar(x='Method', y='Purity', title='Cluster Quality')
plt.show()

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 6/7
3/8/24, 9:38 PM 45B_AIML_Practical07_Clustering.ipynb - Colaboratory

https://colab.research.google.com/drive/1lMbyWJ57Y8yEjDX02i2XrF_nKrKep-Fk#scrollTo=Z7Y3hDiCyg3e&printMode=true 7/7

You might also like