0% found this document useful (0 votes)
7 views4 pages

Kmeans

The document outlines a data analysis process using Python, specifically focusing on income data. It includes steps for data visualization with Seaborn, data scaling with StandardScaler, and clustering using KMeans. Additionally, it analyzes the results of clustering by calculating mean and standard deviation for different clusters.

Uploaded by

kreeves75234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Kmeans

The document outlines a data analysis process using Python, specifically focusing on income data. It includes steps for data visualization with Seaborn, data scaling with StandardScaler, and clustering using KMeans. Additionally, it analyzes the results of clustering by calculating mean and standard deviation for different clusters.

Uploaded by

kreeves75234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1yb74zs6n

January 2, 2025

[1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
df = pd.read_csv("/content/drive/MyDrive/Data Set/Income Data.csv")
sn.lmplot(x="age", y="income", data = df, fit_reg = False)

[1]: <seaborn.axisgrid.FacetGrid at 0x7d4579fbaa40>

1
[2]: from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df[["age", "income"]])
scaled_df[0:5]

[2]: array([[ 1.3701637 , 0.09718548],


[-1.3791283 , 0.90602749],
[ 1.10388844, 0.51405021],
[ 0.23849387, -1.27162408],
[-0.35396857, -1.32762083]])

[3]: from sklearn.cluster import KMeans


clusters = KMeans(3)
clusters.fit(scaled_df)
df["clusterid"] = clusters.labels_
markers = ['+', '^', '*']
sn.lmplot(x="age", y="income", data = df, hue = "clusterid", fit_reg = False,␣
↪markers = markers)

[3]: <seaborn.axisgrid.FacetGrid at 0x7d45758555d0>

2
[5]: clusters = KMeans(3)
clusters.fit(scaled_df)
df["new_clusterid"] = clusters.labels_
df.groupby("new_clusterid")[['age', 'income']].agg(["mean", 'std']).
↪reset_index() # Changed tuple to list

[5]: new_clusterid age income


mean std mean std
0 0 46.627184 2.151559 44308.737864 4390.321503
1 1 29.384000 0.921458 55204.000000 1951.943864
2 2 39.140206 3.558665 18321.649485 6924.747691

[6]: cluster_range = range(1,10)


cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans(num_clusters)
clusters.fit(scaled_df)
cluster_errors.append(clusters.inertia_)
plt.figure(figsize = (6,4))
plt.plot(cluster_range, cluster_errors, marker = "*")
plt.xlabel("No. of clusters")
plt.ylabel("Sum of Squared Error")

[6]: Text(0, 0.5, 'Sum of Squared Error')

3
[ ]:

You might also like