0% found this document useful (0 votes)
83 views9 pages

K Means Clustering Project - Sample

The document discusses a K-Means clustering project to cluster universities into private and public groups without using provided labels. It imports libraries, loads university data, and performs exploratory data analysis including scatter plots of graduation rates vs room and board and enrollments vs tuition colored by private/public. It also creates stacked histograms of tuition and graduation rates to identify a school with over 100% graduation rate, which is set to 100.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views9 pages

K Means Clustering Project - Sample

The document discusses a K-Means clustering project to cluster universities into private and public groups without using provided labels. It imports libraries, loads university data, and performs exploratory data analysis including scatter plots of graduation rates vs room and board and enrollments vs tuition colored by private/public. It also creates stacked histograms of tuition and graduation rates to identify a school with over 100% graduation rate, which is set to 100.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

11/16/22, 12:00 AM K Means Clustering Project

K Means Clustering Project


For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public.

It is very important to note, we actually have the labels for this data set, but we will NOT use them for the
KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances, it is because you don't have labels. In this case we will use
the labels to try to get an idea of how well the algorithm performed, but you won't usually do this for Kmeans, so the
classification report and confusion matrix at the end of this project, don't truly make sense in a real world setting!.

The Data
We will use a data frame with 777 observations on the following 18 variables.

Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates P.Undergrad
Number of parttime undergraduates Outstate Out-of-state
tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate

Import Libraries
Import the libraries you usually use for data analysis.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 1
11/16/22, 12:00 AM K Means Clustering Project

In [1]: import pandas as pd


import matplotlib.pyplot as plt import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

Get the Data

Read in the College_Data file using read_csv. Figure out how to set the first column as the index.

In [2]: df = pd.read_csv('College_Data', index_col = 0)

Check the head of the data

In [3]: df.head()

Out[3]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr
Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University

Adelphi
Yes 2186 1924 512 16 29 2683 1227
University

Adrian
Yes 1428 1097 336 22 50 1036 99
College

Agnes
Scott Yes 417 349 137 60 89 510 63
College

Alaska
Pacific Yes 193 146 55 16 44 249 869
University

Check the info() and describe() methods on the data.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 2
11/16/22, 12:00 AM K Means Clustering Project

In [4]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylva
nia
Data columns (total 18 columns):
Private 777 non-null object
Apps 777 non-null int64
Accept 777 non-null int64
Enroll 777 non-null int64
Top10perc 777 non-null int64
Top25perc 777 non-null int64
F.Undergrad 777 non-null int64
P.Undergrad 777 non-null int64
Outstate 777 non-null int64
Room.Board 777 non-null int64
Books 777 non-null int64
Personal 777 non-null int64
PhD 777 non-null int64
Terminal 777 non-null int64
S.F.Ratio 777 non-null float64
perc.alumni 777 non-null int64
Expend 777 non-null int64
Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB

In [5]: df.describe()

Out[5]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000

mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336

std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531

min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000

25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000

50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000

75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000

max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000

EDA
It's time to create some data visualizations!

Create a scatterplot of Grad.Rate versus Room.Board where the points are colored by the Private
column.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 3
11/16/22, 12:00 AM K Means Clustering Project

In [6]: sns.lmplot(x = 'Room.Board', y = 'Grad.Rate', data = df, fit_reg = False, hue


= 'Private', size = 6, aspect = 1)

Out[6]: <seaborn.axisgrid.FacetGrid at 0x11619e2e8>

Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 4
11/16/22, 12:00 K Means Clustering Project

In [7]: sns.lmplot(x = 'Outstate', y = 'F.Undergrad', data = df, fit_reg = False, hue


= 'Private', size = 6, aspect = 1)

Out[7]: <seaborn.axisgrid.FacetGrid at 0x11620afd0>

Create a stacked histogram showing Out of State Tuition based on the Private column. Try doing
this using sns.FacetGrid
(https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html). If that is too
tricky, see if you can do it just by using two instances of pandas.plot(kind='hist').

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 5
11/16/22, 12:00 K Means Clustering Project

In [8]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Outst

Create a similar histogram for the Grad.Rate column.

In [9]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

Notice how there seems to be a private school with a graduation rate of higher than 100%.What is the
name of that school?

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 6
11/16/22, 12:00 K Means Clustering Project

In [10]: df[df['Grad.Rate'] > 100]

Out[10]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Underg
r
Cazenovia
Yes 3847 3433 527 9 35 1010 12
College

Set that school's graduation rate to 100 so it makes sense. You may get a warning not an error) when
doing this operation, so use dataframe operations or just re-do the histogram visualization to make sure
it actually went through.

In [11]: df.loc[df['Grad.Rate'] > 100, 'Grad.Rate'] = 100

In [12]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

K Means Cluster Creation


Now it is time to create the Cluster labels!

Import KMeans from SciKit Learn.

In [13]: from sklearn.cluster import KMeans

Create an instance of a K Means model with 2 clusters.

In [14]: myKMC = KMeans(n_clusters = 2)

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 7
11/16/22, 12:00 K Means Clustering Project

Fit the model to all the data except for the Private label.

In [15]: myKMC.fit(df.drop('Private', axis = 1))

Out[15]: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,


n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)

What are the cluster center vectors?

In [16]: myKMC.cluster_centers_

Out[16]: array([[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,


5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
6.50926756e+01],
[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
6.75925926e+01]])

Evaluation
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have
the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real
world.

Create a new column for df called 'Cluster', which is a 1 for a Private school, and a 0 for a public school.

In [17]: df['Cluster'] = df['Private'].apply(lambda x: 1 if x == 'Yes' else 0)

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 8
11/16/22, 12:00 K Means Clustering Project

In [18]: df.head()

Out[18]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr

Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University

Adelphi
Yes 2186 1924 512 16 29 2683 1227
University

Adrian
Yes 1428 1097 336 22 50 1036 99
College

Agnes
Scott Yes 417 349 137 60 89 510 63
College

Alaska
Pacific Yes 193 146 55 16 44 249 869
University

Create a confusion matrix and classification report to see how well the Kmeans clustering worked
without being given any labels.

In [19]: from sklearn.metrics import confusion_matrix, classification_report

In [20]: print(confusion_matrix(df['Cluster'], myKMC.labels_))

[[138 74]
[531 34]]

In [21]: print(classification_report(df['Cluster'], myKMC.labels_))

precision recall f1-score support

0 0.21 0.65 0.31 212


1 0.31 0.06 0.10 565

avg / total 0.29 0.22 0.16 777

Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! Hopefully
you can begin to see how K Means is useful for clustering un-labeled data!

Great Job!

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 9

You might also like