0% found this document useful (0 votes)
6 views4 pages

Btech1010622 Lab4

The document outlines a data analysis process using Python libraries to cluster customer data based on spending scores and demographics. It includes data preprocessing steps such as scaling and encoding, followed by the application of Agglomerative Clustering with different configurations. The purity of the clustering results is calculated to evaluate the effectiveness of the clustering methods used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Btech1010622 Lab4

The document outlines a data analysis process using Python libraries to cluster customer data based on spending scores and demographics. It includes data preprocessing steps such as scaling and encoding, followed by the application of Agglomerative Clustering with different configurations. The purity of the clustering results is calculated to evaluate the effectiveness of the clustering methods used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

import numpy as np

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

df=pd.read_csv("Live.csv")
df.head()

status_id status_type status_published \


0 246675545449582_1649696485147474 video 4/22/2018 6:00
1 246675545449582_1649426988507757 photo 4/21/2018 22:45
2 246675545449582_1648730588577397 video 4/21/2018 6:17
3 246675545449582_1648576705259452 photo 4/21/2018 2:29
4 246675545449582_1645700502213739 photo 4/18/2018 3:22

num_reactions num_comments num_shares num_likes num_loves


num_wows \
0 529 512 262 432 92
3
1 150 0 0 150 0
0
2 227 236 57 204 21
1
3 111 0 0 111 0
0
4 213 0 0 204 9
0

num_hahas num_sads num_angrys Column1 Column2 Column3 Column4

0 1 1 0 NaN NaN NaN NaN

1 0 0 0 NaN NaN NaN NaN

2 1 0 0 NaN NaN NaN NaN

3 0 0 0 NaN NaN NaN NaN

4 0 0 0 NaN NaN NaN NaN

df1=pd.read_csv("Mall_Customers.csv")
df1.head()

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)


0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
df1['Spending Score (1-100)']=pd.cut(df1['Spending Score (1-100)'],
bins=[-1, 39, 64, 84, 100],labels=['L', 'M', 'H', 'VH'])
print(df1.head())

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)


0 1 Male 19 15 L
1 2 Male 21 15 H
2 3 Female 20 16 L
3 4 Female 23 16 H
4 5 Female 31 17 M

from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()
df1['Gender'] = label_encoder.fit_transform(df1['Gender'])
print(df1.head())

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)


0 1 1 19 15 L
1 2 1 21 15 H
2 3 0 20 16 L
3 4 0 23 16 H
4 5 0 31 17 M

df1_cluster = df1.drop(columns=["Spending Score (1-100)"])


scaler = StandardScaler()
df1_scaled = scaler.fit_transform(df1_cluster)
print(df1_scaled[:5])

[[-1.7234121 1.12815215 -1.42456879 -1.73899919]


[-1.70609137 1.12815215 -1.28103541 -1.73899919]
[-1.68877065 -0.88640526 -1.3528021 -1.70082976]
[-1.67144992 -0.88640526 -1.13750203 -1.70082976]
[-1.6541292 -0.88640526 -0.56336851 -1.66266033]]

distance_matrix=[]
distance_matrix=pairwise_distances(df1_scaled,metric='euclidean')
print(distance_matrix)

[[0. 0.14457468 2.01649422 ... 5.51941563 5.8580287


5.84708314]
[0.14457468 0. 2.01627105 ... 5.48623951 5.82672938
5.81921472]
[2.01649422 2.01627105 0. ... 5.81691044 6.13642782
6.12756308]
...
[5.51941563 5.48623951 5.81691044 ... 0. 0.42022084
0.44507011]
[5.8580287 5.82672938 6.13642782 ... 0.42022084 0.
0.14457468]
[5.84708314 5.81921472 6.12756308 ... 0.44507011 0.14457468 0.
]]
from sklearn.cluster import AgglomerativeClustering
agg_clustering_1 = AgglomerativeClustering(n_clusters=4,
affinity='precomputed', linkage='single')
agg_clustering_1.fit(distance_matrix)
agg_clustering_2 = AgglomerativeClustering(n_clusters=2,
affinity='precomputed', linkage='complete')
agg_clustering_2.fit(distance_matrix)
labels_1 = agg_clustering_1.labels_
labels_2 = agg_clustering_2.labels_
print("Cluster labels from the first run:")
print(labels_1)
print("Cluster labels from the second run:")
print(labels_2)

Cluster labels from the first run:


[0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1
1 1
1 1 1 1 0 0 1 1 1 1 1 1 1 1 3 1 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1
1 1
0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0
0 0
1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0
0 1
1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0
1 1
0 1 0 1 1 1 1 0 1 2 1 2 0 0 0]
Cluster labels from the second run:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

from sklearn.metrics import confusion_matrix


import numpy as np
def calculate_purity(true_labels, predicted_labels):
cm = confusion_matrix(true_labels, predicted_labels)
purity = np.sum(np.amax(cm, axis=1)) / np.sum(cm)
return purity
true_labels = df1['Spending Score (1-100)'].map({'L': 0, 'M': 1, 'H':
2, 'VH': 3})
purity_1 = calculate_purity(true_labels, labels_1)
purity_2 = calculate_purity(true_labels, labels_2)
print(f"Purity for Clustering 1: {purity_1}")
print(f"Purity for Clustering 2: {purity_2}")

Purity for Clustering 1: 0.555


Purity for Clustering 2: 0.72

You might also like