0% found this document useful (0 votes)

31 views5 pages

PCA - Jupyter Notebook

The document performs principal component analysis (PCA) on university data to reduce its dimensionality. It loads and cleans the data, applies PCA to extract four principal components explaining over 95% of variance, and clusters the PCA-transformed data into three clusters using k-means and hierarchical clustering. The clustered data is then merged back with the original data.

Uploaded by

venkatesh m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

PCA - Jupyter Notebook

Uploaded by

venkatesh m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

In

[1]: import numpy as np

import pandas as pd
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

In [2]: univ = pd.read_csv("D:\\Course PPTS\\R Codes\\7 PCA codes\\Universities.csv")

univ

...

In [3]: univ

del univ['Univ']

univ

...

In [4]: univ_normal = scale(univ)

univ_normal.shape

Out[4]: (25, 6)

In [8]: univ_normal

...

In [15]: # import the method from libarary

from sklearn.decomposition import PCA

# using method create an alogrithm(same like as function)

pca = PCA(n_components=4)

In [16]: pca

Out[16]: PCA(n_components=4)

In [17]: pca_values = pca.fit_transform(univ_normal)

pca_values

...

In [18]: #PCA Scores

principalDf = pd.DataFrame(pca_values)

...
In [19]:
# The amount of variance that each PCA explains is
var = pca.explained_variance_ratio_

In [20]: var

Out[20]: array([0.76868084, 0.13113602, 0.04776031, 0.02729668])

In [21]: var1 = np.cumsum(np.round(var,decimals = 2)*100)

var1

Out[21]: array([77., 90., 95., 98.])

In [22]: #principalDf = pd.DataFrame(data = pca_values,columns = ['principal component 1',

In [23]: principalDf

Out[23]: principal component 1 principal component 2 principal component 3 principal component 4

0 -1.009874 -1.064310 0.081066 0.056951

1 -2.822238 2.259045 0.836829 0.143845

2 1.112466 1.631209 -0.266787 1.075075

3 -0.741741 -0.042187 0.060501 -0.157208

4 -0.311912 -0.635244 0.010241 0.171364

5 -1.696691 -0.344363 -0.253408 0.012564

6 -1.246821 -0.490984 -0.032094 -0.205644

7 -0.338750 -0.785169 -0.493585 0.039856

8 -2.374150 -0.386539 0.116098 -0.453366

9 -1.403277 2.119515 -0.442827 -0.632543

10 -1.726103 0.088237 0.170404 0.260902

11 -0.450857 -0.011133 -0.175746 0.236166

12 0.040238 -1.009204 -0.496517 0.229299

13 3.233730 -0.374580 -0.495373 -0.521238

14 -2.236265 -0.371793 -0.398994 0.406966

15 5.172992 0.779915 -0.385912 -0.232212

16 -1.699644 -0.305597 0.318508 -0.297463

17 4.578146 -0.347591 1.499642 -0.454252

18 0.822603 -0.698906 1.427811 0.760779

19 -0.097762 0.650446 0.100508 -0.500097

20 1.963183 -0.224768 -0.255881 -0.048474

21 -0.542289 -0.079589 -0.305393 0.131699

22 0.532221 -1.017167 -0.423716 0.169536

23 3.548697 0.778462 -0.449363 0.323679

24 -2.305900 -0.117704 0.253989 -0.516183

In [30]: from sklearn.cluster import KMeans

from scipy.spatial.distance import cdist
model=KMeans(n_clusters=3)

In [31]: abc = model.fit(principalDf)

In [32]: abc

Out[32]: KMeans(n_clusters=3)
In [33]: y_kmeans = model.predict(principalDf)

In [34]: y_kmeans

Out[34]: array([0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 0, 0, 1, 2, 1, 2, 1, 0, 0, 1, 0,

0, 1, 2])

In [35]: univ['clusters']=pd.Series(y_kmeans)

In [36]: univ

...

In [27]: # Using the dendrogram to find the optimal number of clusters

import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(principalDf, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

...

In [28]: from scipy.cluster.hierarchy import cophenet

import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import pdist

In [38]: # Fitting Hierarchical Clustering to the dataset

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3,affinity='euclidean', linkage='com
test = cluster.fit_predict(principalDf)

In [39]: test

Out[39]: array([0, 0, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 2, 1, 0, 1, 0, 1, 2, 2, 1, 2,

2, 1, 0], dtype=int64)

In [41]: univ['clusters']=pd.Series(test)
In [42]: univ

Out[42]: SAT Top10 Accept SFRatio Expenses GradRate clusters

0 1310 89 22 13 22704 94 0

1 1415 100 25 6 63575 81 0

2 1260 62 59 9 25026 72 2

3 1310 76 24 12 31510 88 2

4 1280 83 33 13 21864 90 2

5 1340 89 23 10 32162 95 0

6 1315 90 30 12 31585 95 0

7 1255 74 24 12 20126 92 2

8 1400 91 14 11 39525 97 0

9 1305 75 44 7 58691 87 0

10 1380 94 30 10 34870 91 0

11 1260 85 39 11 28052 89 2

12 1255 81 42 13 15122 94 2

13 1081 38 54 18 10185 80 1

14 1375 91 14 8 30220 95 0

15 1005 28 90 19 9066 69 1

16 1360 90 20 12 36450 93 0

17 1075 49 67 25 8704 67 1

18 1240 95 40 17 15140 78 2

19 1290 75 50 13 38380 87 2

20 1180 65 68 16 15470 85 1

21 1285 80 36 11 27553 90 2

22 1225 77 44 14 13349 92 2

23 1085 40 69 15 11857 71 1

24 1375 95 19 11 43514 96 0

In [ ]:

2 Basic of Python - Functions
No ratings yet
2 Basic of Python - Functions
3 pages
3 SVM - Jupyter Notebook
No ratings yet
3 SVM - Jupyter Notebook
4 pages
1 Basics of Python
No ratings yet
1 Basics of Python
6 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
2.basic Statistics - Jupyter Notebook
100% (1)
2.basic Statistics - Jupyter Notebook
7 pages
5 Random Forest - Jupyter Notebook
No ratings yet
5 Random Forest - Jupyter Notebook
2 pages
1 KNN - Jupyter Notebook
No ratings yet
1 KNN - Jupyter Notebook
3 pages
1 Simple Linear Regression
No ratings yet
1 Simple Linear Regression
9 pages
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
No ratings yet
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
17 pages
Action Plan: Department of Education
No ratings yet
Action Plan: Department of Education
3 pages
Film Insurance
100% (1)
Film Insurance
8 pages
Transportation Calculations
No ratings yet
Transportation Calculations
11 pages
Grand Designs UK - November 2021
No ratings yet
Grand Designs UK - November 2021
156 pages
Project 2
No ratings yet
Project 2
7 pages
Plastic Collapse of A Portal Frame: Bending
100% (1)
Plastic Collapse of A Portal Frame: Bending
4 pages
The Use of Copper Shells by Twin Roll Strip Casters: TMS Light Metals March 2010
No ratings yet
The Use of Copper Shells by Twin Roll Strip Casters: TMS Light Metals March 2010
6 pages
IJRPR11690
No ratings yet
IJRPR11690
4 pages
Life Cycle Costing
100% (1)
Life Cycle Costing
8 pages
Clarion Dxz838rmp
No ratings yet
Clarion Dxz838rmp
28 pages
Backgroud of Malaysia Airlines 1
No ratings yet
Backgroud of Malaysia Airlines 1
38 pages
GBC - Group Contract Assignment Guidelines and Rubric 2023 3
No ratings yet
GBC - Group Contract Assignment Guidelines and Rubric 2023 3
4 pages
London Show Daily April 11, 2018
No ratings yet
London Show Daily April 11, 2018
32 pages
Pressure Transmitter SPTW-P10R-G14-A-M12: Data Sheet
No ratings yet
Pressure Transmitter SPTW-P10R-G14-A-M12: Data Sheet
2 pages
1612-Article Text-6168-1-4-20250219
No ratings yet
1612-Article Text-6168-1-4-20250219
20 pages
Avg. Market Capitalization of Listed Companies During Jul-Dec 2018
No ratings yet
Avg. Market Capitalization of Listed Companies During Jul-Dec 2018
294 pages
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
No ratings yet
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
3 pages
Module 33 - Related Party Disclosures
No ratings yet
Module 33 - Related Party Disclosures
60 pages
Experiment #2 - Introduction To TI C2000 Microcontroller, Code Composer Studio (CCS) and Matlab Graphic User Interface (GUI)
No ratings yet
Experiment #2 - Introduction To TI C2000 Microcontroller, Code Composer Studio (CCS) and Matlab Graphic User Interface (GUI)
18 pages
ARIBA Supplier Manual
No ratings yet
ARIBA Supplier Manual
23 pages
820P 203
No ratings yet
820P 203
10 pages
02 Activity 1 READING WRITING
No ratings yet
02 Activity 1 READING WRITING
5 pages
Total Productive Maintenance
No ratings yet
Total Productive Maintenance
53 pages
9-ch3 Part3 ch5 Part1
No ratings yet
9-ch3 Part3 ch5 Part1
24 pages
School Based Management
No ratings yet
School Based Management
6 pages
FEA Questions
No ratings yet
FEA Questions
9 pages
Application of Responsibility Accounting
No ratings yet
Application of Responsibility Accounting
28 pages
Computerized System Validation
No ratings yet
Computerized System Validation
14 pages
FII and DII in Indian Stock Market: A Behavioural Study
No ratings yet
FII and DII in Indian Stock Market: A Behavioural Study
9 pages

PCA - Jupyter Notebook

Uploaded by

PCA - Jupyter Notebook

Uploaded by

In

[1]: import numpy as np

In [2]: univ = pd.read_csv("D:\\Course PPTS\\R Codes\\7 PCA codes\\Universities.csv")

In [4]: univ_normal = scale(univ)

In [15]: # import the method from libarary

In [17]: pca_values = pca.fit_transform(univ_normal)

In [18]: #PCA Scores

Out[20]: array([0.76868084, 0.13113602, 0.04776031, 0.02729668])

In [21]: var1 = np.cumsum(np.round(var,decimals = 2)*100)

Out[21]: array([77., 90., 95., 98.])

In [22]: #principalDf = pd.DataFrame(data = pca_values,columns = ['principal component 1',

Out[23]: principal component 1 principal component 2 principal component 3 principal component 4

0 -1.009874 -1.064310 0.081066 0.056951

1 -2.822238 2.259045 0.836829 0.143845

2 1.112466 1.631209 -0.266787 1.075075

3 -0.741741 -0.042187 0.060501 -0.157208

4 -0.311912 -0.635244 0.010241 0.171364

5 -1.696691 -0.344363 -0.253408 0.012564

6 -1.246821 -0.490984 -0.032094 -0.205644

7 -0.338750 -0.785169 -0.493585 0.039856

8 -2.374150 -0.386539 0.116098 -0.453366

9 -1.403277 2.119515 -0.442827 -0.632543

10 -1.726103 0.088237 0.170404 0.260902

11 -0.450857 -0.011133 -0.175746 0.236166

12 0.040238 -1.009204 -0.496517 0.229299

13 3.233730 -0.374580 -0.495373 -0.521238

14 -2.236265 -0.371793 -0.398994 0.406966

15 5.172992 0.779915 -0.385912 -0.232212

16 -1.699644 -0.305597 0.318508 -0.297463

17 4.578146 -0.347591 1.499642 -0.454252

18 0.822603 -0.698906 1.427811 0.760779

19 -0.097762 0.650446 0.100508 -0.500097

20 1.963183 -0.224768 -0.255881 -0.048474

21 -0.542289 -0.079589 -0.305393 0.131699

22 0.532221 -1.017167 -0.423716 0.169536

23 3.548697 0.778462 -0.449363 0.323679

24 -2.305900 -0.117704 0.253989 -0.516183

In [30]: from sklearn.cluster import KMeans

In [31]: abc = model.fit(principalDf)

In [27]: # Using the dendrogram to find the optimal number of clusters

In [28]: from scipy.cluster.hierarchy import cophenet

In [38]: # Fitting Hierarchical Clustering to the dataset

Out[42]: SAT Top10 Accept SFRatio Expenses GradRate clusters

1 1415 100 25 6 63575 81 0

You might also like