0% found this document useful (0 votes)

83 views9 pages

K Means Clustering Project - Sample

The document discusses a K-Means clustering project to cluster universities into private and public groups without using provided labels. It imports libraries, loads university data, and performs exploratory data analysis including scatter plots of graduation rates vs room and board and enrollments vs tuition colored by private/public. It also creates stacked histograms of tuition and graduation rates to identify a school with over 100% graduation rate, which is set to 100.

Uploaded by

harshita kasaundhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views9 pages

K Means Clustering Project - Sample

Uploaded by

harshita kasaundhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

11/16/22, 12:00 AM K Means Clustering Project

K Means Clustering Project

For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public.

It is very important to note, we actually have the labels for this data set, but we will NOT use them for the
KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances, it is because you don't have labels. In this case we will use
the labels to try to get an idea of how well the algorithm performed, but you won't usually do this for Kmeans, so the
classification report and confusion matrix at the end of this project, don't truly make sense in a real world setting!.

The Data
We will use a data frame with 777 observations on the following 18 variables.

Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates P.Undergrad
Number of parttime undergraduates Outstate Out-of-state
tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate

Import Libraries
Import the libraries you usually use for data analysis.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 1
11/16/22, 12:00 AM K Means Clustering Project

In [1]: import pandas as pd

import matplotlib.pyplot as plt import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

Get the Data

Read in the College_Data file using read_csv. Figure out how to set the first column as the index.

In [2]: df = pd.read_csv('College_Data', index_col = 0)

Check the head of the data

In [3]: df.head()

Out[3]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr
Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University

Adelphi
Yes 2186 1924 512 16 29 2683 1227
University

Adrian
Yes 1428 1097 336 22 50 1036 99
College

Agnes
Scott Yes 417 349 137 60 89 510 63
College

Alaska
Pacific Yes 193 146 55 16 44 249 869
University

Check the info() and describe() methods on the data.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 2
11/16/22, 12:00 AM K Means Clustering Project

In [4]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylva
nia
Data columns (total 18 columns):
Private 777 non-null object
Apps 777 non-null int64
Accept 777 non-null int64
Enroll 777 non-null int64
Top10perc 777 non-null int64
Top25perc 777 non-null int64
F.Undergrad 777 non-null int64
P.Undergrad 777 non-null int64
Outstate 777 non-null int64
Room.Board 777 non-null int64
Books 777 non-null int64
Personal 777 non-null int64
PhD 777 non-null int64
Terminal 777 non-null int64
S.F.Ratio 777 non-null float64
perc.alumni 777 non-null int64
Expend 777 non-null int64
Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB

In [5]: df.describe()

Out[5]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000

mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336

std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531

min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000

25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000

50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000

75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000

max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000

EDA
It's time to create some data visualizations!

Create a scatterplot of Grad.Rate versus Room.Board where the points are colored by the Private
column.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 3
11/16/22, 12:00 AM K Means Clustering Project

In [6]: sns.lmplot(x = 'Room.Board', y = 'Grad.Rate', data = df, fit_reg = False, hue

= 'Private', size = 6, aspect = 1)

Out[6]: <seaborn.axisgrid.FacetGrid at 0x11619e2e8>

Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 4
11/16/22, 12:00 K Means Clustering Project

In [7]: sns.lmplot(x = 'Outstate', y = 'F.Undergrad', data = df, fit_reg = False, hue

= 'Private', size = 6, aspect = 1)

Out[7]: <seaborn.axisgrid.FacetGrid at 0x11620afd0>

Create a stacked histogram showing Out of State Tuition based on the Private column. Try doing
this using sns.FacetGrid
(https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html). If that is too
tricky, see if you can do it just by using two instances of pandas.plot(kind='hist').

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 5
11/16/22, 12:00 K Means Clustering Project

In [8]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Outst

Create a similar histogram for the Grad.Rate column.

In [9]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

Notice how there seems to be a private school with a graduation rate of higher than 100%.What is the
name of that school?

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 6
11/16/22, 12:00 K Means Clustering Project

In [10]: df[df['Grad.Rate'] > 100]

Out[10]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Underg
r
Cazenovia
Yes 3847 3433 527 9 35 1010 12
College

Set that school's graduation rate to 100 so it makes sense. You may get a warning not an error) when
doing this operation, so use dataframe operations or just re-do the histogram visualization to make sure
it actually went through.

In [11]: df.loc[df['Grad.Rate'] > 100, 'Grad.Rate'] = 100

In [12]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

K Means Cluster Creation

Now it is time to create the Cluster labels!

Import KMeans from SciKit Learn.

In [13]: from sklearn.cluster import KMeans

Create an instance of a K Means model with 2 clusters.

In [14]: myKMC = KMeans(n_clusters = 2)

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 7
11/16/22, 12:00 K Means Clustering Project

Fit the model to all the data except for the Private label.

In [15]: myKMC.fit(df.drop('Private', axis = 1))

Out[15]: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)

What are the cluster center vectors?

In [16]: myKMC.cluster_centers_

Out[16]: array([[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,

5.34708520e+01, 2.18854858e+03, 5.95458894e+02, 1.03957085e+04,
4.31136472e+03, 5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
7.78251121e+01, 1.40997010e+01, 2.31748879e+01, 8.93204634e+03,
6.50926756e+01],
[1.03631389e+04, 6.55089815e+03, 2.56972222e+03, 4.14907407e+01,
7.02037037e+01, 1.30619352e+04, 2.46486111e+03, 1.07191759e+04,
4.64347222e+03, 5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
9.13333333e+01, 1.40277778e+01, 2.00740741e+01, 1.41705000e+04,
6.75925926e+01]])

Evaluation
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have
the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real
world.

Create a new column for df called 'Cluster', which is a 1 for a Private school, and a 0 for a public school.

In [17]: df['Cluster'] = df['Private'].apply(lambda x: 1 if x == 'Yes' else 0)

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 8
11/16/22, 12:00 K Means Clustering Project

In [18]: df.head()

Out[18]:
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergr

Abilene
Christian Yes 1660 1232 721 23 52 2885 537
University

Adelphi
Yes 2186 1924 512 16 29 2683 1227
University

Adrian
Yes 1428 1097 336 22 50 1036 99
College

Agnes
Scott Yes 417 349 137 60 89 510 63
College

Alaska
Pacific Yes 193 146 55 16 44 249 869
University

Create a confusion matrix and classification report to see how well the Kmeans clustering worked
without being given any labels.

In [19]: from sklearn.metrics import confusion_matrix, classification_report

In [20]: print(confusion_matrix(df['Cluster'], myKMC.labels_))

[[138 74]
[531 34]]

In [21]: print(classification_report(df['Cluster'], myKMC.labels_))

precision recall f1-score support

0 0.21 0.65 0.31 212

1 0.31 0.06 0.10 565

avg / total 0.29 0.22 0.16 777

Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! Hopefully
you can begin to see how K Means is useful for clustering un-labeled data!

Great Job!

https://amete.github.io/DataSciencePortfolio/Udemy/Python-DS-and-ML-Bootcamp/K_Means_Clustering_Project.html 9

212-82 V12.95
No ratings yet
212-82 V12.95
92 pages
Basler RDP-110
No ratings yet
Basler RDP-110
26 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
Sales and Inventory System Document
100% (1)
Sales and Inventory System Document
29 pages
T Test Formula
100% (1)
T Test Formula
2 pages
FinalPaperDesign and Simulation of PID Controller For Power Electronics Converter Circuits170541
No ratings yet
FinalPaperDesign and Simulation of PID Controller For Power Electronics Converter Circuits170541
6 pages
Presentation Basics
No ratings yet
Presentation Basics
74 pages
IMAS 08.10 Ed.1 Am2
No ratings yet
IMAS 08.10 Ed.1 Am2
19 pages
Connecting Python With SQL Database
No ratings yet
Connecting Python With SQL Database
21 pages
NCS Expert Tutorial - How To Code Features in Your Car.
100% (1)
NCS Expert Tutorial - How To Code Features in Your Car.
10 pages
Certificate - of 406 MHZ Epirb Annual Testing: Parameters Condition Good NG
No ratings yet
Certificate - of 406 MHZ Epirb Annual Testing: Parameters Condition Good NG
3 pages
Delta Ia-Cnc Solution en 20190123
No ratings yet
Delta Ia-Cnc Solution en 20190123
44 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
ECE650 Chapter 0 Course Outline
No ratings yet
ECE650 Chapter 0 Course Outline
11 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
Zara
No ratings yet
Zara
47 pages
Times University Ranks DataSet Analysis
No ratings yet
Times University Ranks DataSet Analysis
19 pages
8th STD - Maths - Qtrly Exam - Sep 2021 - 22 Online - 20.09.2021
No ratings yet
8th STD - Maths - Qtrly Exam - Sep 2021 - 22 Online - 20.09.2021
2 pages
3.unsupervised Learning
No ratings yet
3.unsupervised Learning
9 pages
Data Mining K-Means Algorithm
No ratings yet
Data Mining K-Means Algorithm
36 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Predictive Data Analytics With Python
100% (1)
Predictive Data Analytics With Python
97 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Guide To Computer Forensics and Investigations Fourth Edition
No ratings yet
Guide To Computer Forensics and Investigations Fourth Edition
44 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
No ratings yet
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
19 pages
CycloTouch R-Series Brochure V03
No ratings yet
CycloTouch R-Series Brochure V03
2 pages
Unit 4 Notes CC Ramadevi
No ratings yet
Unit 4 Notes CC Ramadevi
31 pages
Pa66 ML Exp6
No ratings yet
Pa66 ML Exp6
9 pages
LP3 Soft Computing 4 Practical
No ratings yet
LP3 Soft Computing 4 Practical
7 pages
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
No ratings yet
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
6 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Synchronous Optical Networking (Sonet)
No ratings yet
Synchronous Optical Networking (Sonet)
6 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
No ratings yet
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
139 pages
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
Data Mining
100% (1)
Data Mining
6 pages
Keer 2010
No ratings yet
Keer 2010
288 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Assignment 3.1 K Means Clustering in Python PART 1
No ratings yet
Assignment 3.1 K Means Clustering in Python PART 1
7 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
17 pages
Clustering Algorithms SciKit Learn 1705740354
No ratings yet
Clustering Algorithms SciKit Learn 1705740354
22 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
Khalid Khan
No ratings yet
Khalid Khan
4 pages
Report ML 2
No ratings yet
Report ML 2
10 pages
Python Report Ritik
No ratings yet
Python Report Ritik
15 pages
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
No ratings yet
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
19 pages
Exp 7
No ratings yet
Exp 7
3 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Vid 4
No ratings yet
Vid 4
6 pages
Odbc
No ratings yet
Odbc
2 pages
Clustering MMD
No ratings yet
Clustering MMD
1 page
AAI101 - Session 2 - Unsupervised Learning
No ratings yet
AAI101 - Session 2 - Unsupervised Learning
38 pages
Tellio - Job Ad 07-17-2023
No ratings yet
Tellio - Job Ad 07-17-2023
2 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Zainab Pate Data PPF #5 - Colab
No ratings yet
Zainab Pate Data PPF #5 - Colab
10 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
No ratings yet
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
8 pages
K Means Clustering - Experiment 12
No ratings yet
K Means Clustering - Experiment 12
3 pages
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
No ratings yet
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
22 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
Data Mining
No ratings yet
Data Mining
27 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
MLT 8 KK
No ratings yet
MLT 8 KK
2 pages
Assignmnet 5
No ratings yet
Assignmnet 5
11 pages
AdityaGaur BDA Exp8
No ratings yet
AdityaGaur BDA Exp8
4 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Da Exp 10 66
No ratings yet
Da Exp 10 66
6 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
10.lab Activity
No ratings yet
10.lab Activity
11 pages
Tutorial 2 QB & QP
No ratings yet
Tutorial 2 QB & QP
4 pages
Key Management Services (KMS) Client Activation and Product Keys For Windows Server and Windows - Microsoft Learn
No ratings yet
Key Management Services (KMS) Client Activation and Product Keys For Windows Server and Windows - Microsoft Learn
8 pages
Da Exp 10
No ratings yet
Da Exp 10
6 pages
DMBI
No ratings yet
DMBI
16 pages
31 X 41ft G.F House Plan
No ratings yet
31 X 41ft G.F House Plan
1 page
Lab-7 Clustering
No ratings yet
Lab-7 Clustering
4 pages
Clustering
No ratings yet
Clustering
43 pages
Pro Python System Administration 2nd Edition Rytis Sileika Download
100% (1)
Pro Python System Administration 2nd Edition Rytis Sileika Download
60 pages
Aam Unit 4 QB With Answer
No ratings yet
Aam Unit 4 QB With Answer
11 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
A Mini Rpoject
No ratings yet
A Mini Rpoject
7 pages
Lab 9
No ratings yet
Lab 9
3 pages
Datascience
No ratings yet
Datascience
26 pages
20 ENG 016 Assignment 8
No ratings yet
20 ENG 016 Assignment 8
4 pages
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)

K Means Clustering Project - Sample

Uploaded by

K Means Clustering Project - Sample

Uploaded by

11/16/22, 12:00 AM K Means Clustering Project

K Means Clustering Project

In [1]: import pandas as pd

Get the Data

In [2]: df = pd.read_csv('College_Data', index_col = 0)

Check the head of the data

Check the info() and describe() methods on the data.

mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336

std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531

min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000

25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000

50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000

75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000

max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000

In [6]: sns.lmplot(x = 'Room.Board', y = 'Grad.Rate', data = df, fit_reg = False, hue

Out[6]: <seaborn.axisgrid.FacetGrid at 0x11619e2e8>

In [7]: sns.lmplot(x = 'Outstate', y = 'F.Undergrad', data = df, fit_reg = False, hue

Out[7]: <seaborn.axisgrid.FacetGrid at 0x11620afd0>

In [8]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Outst

Create a similar histogram for the Grad.Rate column.

In [9]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

In [10]: df[df['Grad.Rate'] > 100]

In [11]: df.loc[df['Grad.Rate'] > 100, 'Grad.Rate'] = 100

In [12]: g = sns.FacetGrid(df, hue = 'Private', size = 6, aspect = 2) g = g.map(plt.hist, 'Grad.

K Means Cluster Creation

Import KMeans from SciKit Learn.

In [13]: from sklearn.cluster import KMeans

Create an instance of a K Means model with 2 clusters.

In [14]: myKMC = KMeans(n_clusters = 2)

In [15]: myKMC.fit(df.drop('Private', axis = 1))

Out[15]: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

What are the cluster center vectors?

Out[16]: array([[1.81323468e+03, 1.28716592e+03, 4.91044843e+02, 2.53094170e+01,

In [17]: df['Cluster'] = df['Private'].apply(lambda x: 1 if x == 'Yes' else 0)

In [19]: from sklearn.metrics import confusion_matrix, classification_report

In [20]: print(confusion_matrix(df['Cluster'], myKMC.labels_))

In [21]: print(classification_report(df['Cluster'], myKMC.labels_))

precision recall f1-score support

0 0.21 0.65 0.31 212

avg / total 0.29 0.22 0.16 777

You might also like