K Means Clustering Project - Sample
K Means Clustering Project - Sample
It is very important to note, we actually have the labels for this data set, but we will NOT use them for the
KMeans clustering algorithm, since that is an unsupervised learning algorithm.
When using the Kmeans algorithm under normal circumstances, it is because you don't have labels. In this case we will use
the labels to try to get an idea of how well the algorithm performed, but you won't usually do this for Kmeans, so the
classification report and confusion matrix at the end of this project, don't truly make sense in a real world setting!.
The Data
We will use a data frame with 777 observations on the following 18 variables.
Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates P.Undergrad
Number of parttime undergraduates Outstate Out-of-state
tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
Import Libraries
Import the libraries you usually use for data analysis.
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 1
11/16/22, 12:00 AM K Means Clustering Project
Read in the College_Data file using read_csv. Figure out how to set the first column as the index.
In [3]: df.head()
Out[3]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr
Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University
Adelphi
Yes 2186 1924 512 16 29 2683 1227
University
Adrian
Yes 1428 1097 336 22 50 1036 99
College
Agnes
Scott Yes 417 349 137 60 89 510 63
College
Alaska
Pacific Yes 193 146 55 16 44 249 869
University
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 2
11/16/22, 12:00 AM K Means Clustering Project
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylva
nia
Data columns (total 18 columns):
Private 777 non-null object
Apps 777 non-null int64
Accept 777 non-null int64
Enroll 777 non-null int64
Top10perc 777 non-null int64
Top25perc 777 non-null int64
F.Undergrad 777 non-null int64
P.Undergrad 777 non-null int64
Outstate 777 non-null int64
Room.Board 777 non-null int64
Books 777 non-null int64
Personal 777 non-null int64
PhD 777 non-null int64
Terminal 777 non-null int64
S.F.Ratio 777 non-null float64
perc.alumni 777 non-null int64
Expend 777 non-null int64
Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB
In [5]: df.describe()
Out[5]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000
EDA
It's time to create some data visualizations!
Create a scatterplot of Grad.Rate versus Room.Board where the points are colored by the Private
column.
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 3
11/16/22, 12:00 AM K Means Clustering Project
Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 4
11/16/22, 12:00 K Means Clustering Project
Create a stacked histogram showing Out of State Tuition based on the Private column. Try doing
this using sns.FacetGrid
(https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html). If that is too
tricky, see if you can do it just by using two instances of pandas.plot(kind='hist').
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 5
11/16/22, 12:00 K Means Clustering Project
Notice how there seems to be a private school with a graduation rate of higher than 100%.What is the
name of that school?
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 6
11/16/22, 12:00 K Means Clustering Project
Out[10]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Underg
r
Cazenovia
Yes 3847 3433 527 9 35 1010 12
College
Set that school's graduation rate to 100 so it makes sense. You may get a warning not an error) when
doing this operation, so use dataframe operations or just re-do the histogram visualization to make sure
it actually went through.
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 7
11/16/22, 12:00 K Means Clustering Project
Fit the model to all the data except for the Private label.
In [16]: myKMC.cluster_centers_
Evaluation
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have
the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real
world.
Create a new column for df called 'Cluster', which is a 1 for a Private school, and a 0 for a public school.
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 8
11/16/22, 12:00 K Means Clustering Project
In [18]: df.head()
Out[18]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr
Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University
Adelphi
Yes 2186 1924 512 16 29 2683 1227
University
Adrian
Yes 1428 1097 336 22 50 1036 99
College
Agnes
Scott Yes 417 349 137 60 89 510 63
College
Alaska
Pacific Yes 193 146 55 16 44 249 869
University
Create a confusion matrix and classification report to see how well the Kmeans clustering worked
without being given any labels.
[[138 74]
[531 34]]
Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! Hopefully
you can begin to see how K Means is useful for clustering un-labeled data!
Great Job!
https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 9