0% found this document useful (0 votes)
29 views6 pages

Practical 5

W5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Practical 5

W5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Name: Harshad Kamble

Roll No : 23
Aim: Assignment on Clustering Techniques
Download the following customer dataset from below link:
Data Set: https://www.kaggle.com/shwetabh123/mall-customers
This dataset gives the data of Income and money spent by the
customers visiting a Shopping Mall. The
data set contains Customer ID, Gender, Age, Annual Income, Spending
Score. Therefore, as a mall owner
you need to find the group of people who are the profitable
customers for the mall owner. Apply at least
two clustering algorithms (based on Spending Score) to find the
group of customers.
a. Apply Data pre-processing (Label Encoding , Data
Transformation….) techniques if
necessary. b. Perform data-preparation( Train-Test Split)
c. Apply Machine Learning Algorithm
d. Evaluate Model.
e. Apply Cross-Validation and Evaluate Model

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt

In [3]: from matplotlib.lines import Line2D

In [4]: from sklearn.preprocessing import StandardScaler

In [5]: from sklearn.decomposition import PCA

In [8]: from sklearn.cluster import KMeans

In [9]: df=pd.read_csv("/home/student/TE52/Mall_Customers.csv")#read specific menti

In [15]: df.head()#to show first few rows of dataframe

Out[15]:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

In [56]: df.rename(columns = {'Annual Income (k$)':'Annual Income'},inplace=True

In [55]: df.rename(columns={'Spending Score (1-100)':'Spending Score'},inplace

1 of 6
In [57]: df.describe()#to describe the framework

Out[57]:
CustomerID Age Annual Income Spending Score

count 200.000000 200.000000 200.000000 200.000000

mean 100.500000 38.850000 60.560000 50.200000

std 57.879185 13.969007 26.264721 25.823522

min 1.000000 18.000000 15.000000 1.000000

25% 50.750000 28.750000 41.500000 34.750000

50% 100.500000 36.000000 61.500000 50.000000

75% 150.250000 49.000000 78.000000 73.000000

max 200.000000 70.000000 137.000000 99.000000

In [60]: df.isnull().sum()#check for null values

Out[60]: CustomerID 0
Gender 0
Age 0
Annual Income 0
Spending Score 0
dtype: int64

In [47]: df.shape#to show no.of rows and columns

Out[47]: (200, 5)

In [49]: df['Gender'].value_counts()#view count

Out[49]: Gender
Female 112
Male 88
Name: count, dtype: int64

In [61]: print(df.columns.tolist())#check column names

['CustomerID', 'Gender', 'Age', 'Annual Income', 'Spending Score']

In [79]: sc = StandardScaler()#importing standard scaler

In [80]: numeric_features = df[['Age', 'Annual Income', 'Spending Score']]

In [81]: print(numeric_features.head())#printing first few values

Age Annual Income Spending Score


0 19 15 39
1 21 15 81
2 20 16 6
3 23 16 77
4 31 17 40

2 of 6
In [82]: numeric_features_scaled = sc.fit_transform(numeric_features)#sclae the valu

In [83]: df_scaled = pd.DataFrame(numeric_features_scaled, columns=numeric_features

In [90]: print(df_scaled)#display scaled dataframe

Age Annual Income Spending Score


0 -1.424569 -1.738999 -0.434801
1 -1.281035 -1.738999 1.195704
2 -1.352802 -1.700830 -1.715913
3 -1.137502 -1.700830 1.040418
4 -0.563369 -1.662660 -0.395980
.. ... ... ...
195 -0.276302 2.268791 1.118061
196 0.441365 2.497807 -0.861839
197 -0.491602 2.497807 0.923953
198 -0.491602 2.917671 -1.250054
199 -0.635135 2.917671 1.273347

[200 rows x 3 columns]

In [91]: pca = PCA(n_components = 2)#creating pca object

In [93]: df_pca = pca.fit_transform(df_scaled)#fitting and transforming data

In [95]: print("data shape after PCA :",df_pca.shape)#printing shape of transformed

data shape after PCA : (200, 2)

In [96]: print("data_pca is:",df_pca)#printing transformed data

data_pca is: [[-6.15720019e-01 -1.76348088e+00]


[-1.66579271e+00 -1.82074695e+00]
[ 3.37861909e-01 -1.67479894e+00]
[-1.45657325e+00 -1.77242992e+00]
[-3.84652078e-02 -1.66274012e+00]
[-1.48168526e+00 -1.73500173e+00]
[ 1.09461665e+00 -1.56610230e+00]
[-1.92630736e+00 -1.72111049e+00]
[ 2.64517786e+00 -1.46084721e+00]
[-9.70130513e-01 -1.63558108e+00]
[ 2.49568861e+00 -1.47048914e+00]
[-1.45688256e+00 -1.66436050e+00]
[ 2.01018729e+00 -1.45329897e+00]
[-1.41321072e+00 -1.61776746e+00]
[ 1.00042965e+00 -1.49579176e+00]
[-1.56943170e+00 -1.62502669e+00]
[ 2.94060318e-01 -1.49425585e+00]
[-1.31624924e+00 -1.57216383e+00]
[ 1.31669910e+00 -1.37243404e+00]
[-1.43679899e+00 -1.51039469e+00]
In [97]: plt_font = {'family':'serif' , 'size':16}#font size

3 of 6
In [99]: wcss_list = []
for i in range(1, 15):
kmeans = KMeans(n_clusters = i , init = 'k-means++' , random_state

In [101]: kmeans.fit(df_pca)
wcss_list.append(kmeans.inertia_)

In [5]: import matplotlib.pyplot as plt

# Example font properties (customize as needed)


plt_font = {
'fontsize': 12,
'fontweight': 'bold',
'family': 'serif'
}

# Example: Define wcss_list (replace this with your actual WCSS values)
wcss_list = [500, 300, 250, 200, 180, 175, 170, 160, 150, 145, 140, 135

# Plotting
plt.plot(range(1, len(wcss_list) + 1), wcss_list)
plt.plot([4, 4], [0, max(wcss_list)], linestyle='--', alpha=0.7) # Adjuste
plt.xlabel('K', fontdict=plt_font)
plt.ylabel('WCSS', fontdict=plt_font)
plt.title('Elbow Method for Optimal k', fontdict=plt_font)
plt.show()

4 of 6
In [30]:

# Example: Creating a sample dataset


# Replace this with your actual data
data = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100),
})

# Step 1: Standardize the data


scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Step 2: Perform PCA


pca = PCA(n_components=2) # Adjust the number of components as needed
df_pca = pca.fit_transform(data_scaled)

# Step 3: Perform KMeans clustering


kmeans = KMeans(n_clusters=4, init='k-means++', random_state=1)
kmeans.fit(df_pca)
cluster_id = kmeans.predict(df_pca)

# Step 4: Creating a result DataFrame


result_data = pd.DataFrame()
result_data['PC1'] = df_pca[:, 0]
result_data['PC2'] = df_pca[:, 1]
result_data['Cluster'] = cluster_id # Use 'Cluster' as defined earlier

# Print the result data


print(result_data)

# Define colors for clusters


cluster_colors = {0: 'tab:red', 1: 'tab:green', 2: 'tab:blue', 3: 'tab:pink
cluster_dict = {
'Centroid': 'tab:orange',
'Cluster0': 'tab:red',
'Cluster1': 'tab:green',
'Cluster2': 'tab:blue',
'Cluster3': 'tab:pink'
}

# Scatter plot for the clusters


plt.scatter(x=result_data['PC1'],
y=result_data['PC2'],
c=result_data['Cluster'].map(cluster_colors))

# Create legend handles


handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v,
markersize=8) for k, v in cluster_dict.items()]
plt.legend(title='Color', handles=handles, bbox_to_anchor=(1.05, 1),

# Scatter plot for centroids


plt.scatter(x=kmeans.cluster_centers_[:, 0],
y=kmeans.cluster_centers_[:, 1],
marker='o',
c='tab:orange',
s=150,
alpha=1)

5 of 6
# Plot settings
plt.title("Clustered by KMeans", fontdict={'fontsize': 14, 'fontweight'
plt.xlabel("PC1", fontdict={'fontsize': 12})
plt.ylabel("PC2", fontdict={'fontsize': 12})
plt.show()

PC1 PC2 Cluster


0 -0.708627 0.160780 0
1 -1.604748 0.467547 0
2 2.637297 0.254471 2
3 -0.265911 -1.520968 3
4 -0.256524 1.008739 0
.. ... ... ...
95 1.335878 -1.062640 1
96 -0.139539 -1.098409 3
97 -0.077331 0.132332 0
98 0.473854 -0.749248 1
99 -0.749690 1.394149 0

[100 rows x 3 columns]

6 of 6

You might also like