PCA Problem Statement With Answer
PCA Problem Statement With Answer
Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: Shafiyana Sayyad
Batch ID: 15022022
Topic: Principal Component Analysis
Hints:
1. Business Problem
1.1. What is the business objective?
1.1. Are there any constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.
5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.
Problem Statement: -
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).
Answer:
import pandas as pd
In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import accuracy_score sns.set()
from sklearn.decomposition import PCA
dataset.tail()
In [3]:
Out[3]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthoc
Data preprocessing
# not going to falloe EDA step here since it is already done in link1.(Above cell)
In [5]:
# as we know ID & award will not make much contribution during clutering. we will drop
Out[5]:
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Colo
0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.6
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.3
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.6
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.8
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.3
dataset.info()
In [6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
# Column Non-Null Count Dtype
Out[7]:
sns.pairplot(dataset.iloc[:,0:12])
<seaborn
.axisgrid.
PairGrid
at
0x2641a5
50e80>
Norma
lizing
data
for any
type of
cluster
ing
def norm_func(i):
In [8]:
x = (i-i.min())/(i.max()-i.min())
return (x)
df_norm = norm_func(dataset.iloc[:,1:]) print(df_norm)
Proline
0 0.561341
1 0.550642
2 0.646933
3 0.857347
4 0.325963
.. ...
173 0.329529
174 0.336662
175 0.397290
176 0.400856
177 0.201141
print(pca_std_df)
In [12]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]
In [13]: pca_std_df.shape
(178, 3)
Out[13]:
# eigenvalues..
In [14]:
print(pca_std.singular_values_)
[28.94203422 21.08225141 16.04371561]
# variance containing in each formed PCA
In [15]:
print(pca_std.explained_variance_ratio_*100)
Conclusion:
by applying PCA on standardized data with 95% variance it gives 3 PCA components
MODEL 1 - KMeans
import warnings
warnings.filterwarnings('ignore')
y_kmeans
Out[18]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1])
plt.figure(figsize=(8,6))
In [19]:
plt.scatter(pca_std_df[:,0], pca_std_df[:,1], cmap='tab10') plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
Heirarchial Clustering
print(pca_std_df)
In [25]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]
# create clusters
In [36]:
hc = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'single') hc
Out[37]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int64)
Clusters=pd.DataFrame(y_hc,columns=['Clusters']) Clusters
In [38]:
Out[38]:
Clusters
0 2
1 2
2 2
3 2
4 2
... ...
173 2
174 2
175 2
176 2
177 2
In [ ]:
Problem Statement: -
Note: This is just a snapshot of the data. The datasets can be downloaded from AiSpry LMS in
the Hands-On Material section.