0% found this document useful (0 votes)
91 views22 pages

PCA Problem Statement With Answer

The document discusses performing dimension reduction using principal component analysis (PCA) on a wine dataset. It preprocesses the data by normalizing it, then performs PCA to extract the first 3 principal components. It clusters the original dataset and the reduced 3D PCA dataset using hierarchical and K-means clustering. It compares the results of clustering before and after PCA to see if similar clusters are obtained.

Uploaded by

SBS Movies
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views22 pages

PCA Problem Statement With Answer

The document discusses performing dimension reduction using principal component analysis (PCA) on a wine dataset. It preprocesses the data by normalizing it, then performs PCA to extract the first 3 principal components. It clusters the original dataset and the reduced 3D PCA dataset using hierarchical and K-means clustering. It compares the results of clustering before and after PCA to see if similar clusters are obtained.

Uploaded by

SBS Movies
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Topic: Dimension Reduction With PCA

Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: Shafiyana Sayyad
Batch ID: 15022022
Topic: Principal Component Analysis

Hints:
1. Business Problem
1.1. What is the business objective?
1.1. Are there any constraints?

2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:

2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.

3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.

5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?

Problem Statement: -
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).

Answer:
import pandas as pd
In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import accuracy_score sns.set()
from sklearn.decomposition import PCA

from matplotlib.colors import ListedColormap

dataset = pd.read_csv('wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values


In [2]:
dataset.head()

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Out[2]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocya

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39

dataset.tail()
In [3]:
Out[3]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthoc

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56

© 2013 - 2021 360DigiTMG. All Rights Reserved.


dataset.describe()
In [4]:
Out[4]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids N

count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000

mean 1.938202 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270

std 0.775035 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859

min 1.000000 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000

25% 1.000000 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000

50% 2.000000 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000

75% 3.000000 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000

max 3.000000 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000

Data preprocessing
# not going to falloe EDA step here since it is already done in link1.(Above cell)
In [5]:
# as we know ID & award will not make much contribution during clutering. we will drop

dataset1 = dataset.drop(['Type'], axis=1) dataset1.head()

Out[5]:
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Colo

0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.6

1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.3

2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.6

3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.8

4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.3
dataset.info()
In [6]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
# Column Non-Null Count Dtype

0 Type 178 non-null int64


1 Alcohol 178 non-null float64
2 Malic 178 non-null float64
3 Ash 178 non-null float64
4 Alcalinity 178 non-null float64
5 Magnesium 178 non-null int64
6 Phenols 178 non-null float64
7 Flavanoids 178 non-null float64
8 Nonflavanoids 178 non-null float64
9 Proanthocyanins 178 non-null float64
10 Color 178 non-null float64
11 Hue 178 non-null float64
12 Dilution 178 non-null float64
13 Proline 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.6 KB
In [7]:

Out[7]:
sns.pairplot(dataset.iloc[:,0:12])

<seaborn
.axisgrid.
PairGrid
at
0x2641a5
50e80>
Norma
lizing
data
for any
type of
cluster
ing
def norm_func(i):
In [8]:
x = (i-i.min())/(i.max()-i.min())
return (x)
df_norm = norm_func(dataset.iloc[:,1:]) print(df_norm)

Alcohol Malic Ash Alcalinity Magnesium Phenols \


0 0.842105 0.191700 0.572193 0.257732 0.619565 0.627586
1 0.571053 0.205534 0.417112 0.030928 0.326087 0.575862
2 0.560526 0.320158 0.700535 0.412371 0.336957 0.627586
3 0.878947 0.239130 0.609626 0.319588 0.467391 0.989655
4 0.581579 0.365613 0.807487 0.536082 0.521739 0.627586
.. ... ... ... ... ... ...
173 0.705263 0.970356 0.582888 0.510309 0.271739 0.241379
174 0.623684 0.626482 0.598930 0.639175 0.347826 0.282759
175 0.589474 0.699605 0.481283 0.484536 0.543478 0.210345
176 0.563158 0.365613 0.540107 0.484536 0.543478 0.231034
177 0.815789 0.664032 0.737968 0.716495 0.282609 0.368966

Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution \


0 0.573840 0.283019 0.593060 0.372014 0.455285 0.970696
1 0.510549 0.245283 0.274448 0.264505 0.463415 0.780220
2 0.611814 0.320755 0.757098 0.375427 0.447154 0.695971
3 0.664557 0.207547 0.558360 0.556314 0.308943 0.798535
4 0.495781 0.490566 0.444795 0.259386 0.455285 0.608059
.. ... ... ... ... ... ...
173 0.056962 0.735849 0.205047 0.547782 0.130081 0.172161
174 0.086498 0.566038 0.315457 0.513652 0.178862 0.106227
175 0.073840 0.566038 0.296530 0.761092 0.089431 0.106227
176 0.071730 0.754717 0.331230 0.684300 0.097561 0.128205
177 0.088608 0.811321 0.296530 0.675768 0.105691 0.120879

Proline
0 0.561341
1 0.550642
2 0.646933
3 0.857347
4 0.325963
.. ...
173 0.329529
174 0.336662
175 0.397290
176 0.400856
177 0.201141

[178 rows x 13 columns]

In [9]: from sklearn.preprocessing import StandardScaler

std_df = StandardScaler().fit_transform(dataset1) std_df.shape # this will used for kmeans

Out[9]: (178, 13)


In [10]: print(std_df)
[[ 1.51861254436
-0.5622498
... 0.31830389 0.78858745
1.01300893] 1.39514818]
[ 0.24628963 -0.49941338
...
0.96524152]
[ 0.1968 [ 0.33275817 1.74474449 -0.38935541 ... -1.61212515 -1.48544548
0.28057537]
7903
[ 0.20923168 0.22769377 0.01273209 ... -1.56825176 -1.40069891
0.02123 0.29649784]
125 [ 1.39508604 1.58316512 1.36520822 ... -1.52437837 -1.42894777
1.10933 -0.59516041]]

Running PCA of standardized data.


# applying PCA on std_df
In [11]:
# we are considering 95% variance in n_components to not loose any data.

from sklearn.decomposition import PCA


pca_std = PCA(random_state=14, n_components=3) pca_std_df= pca_std.fit_transform(std_df)

print(pca_std_df)
In [12]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]

In [13]: pca_std_df.shape
(178, 3)
Out[13]:

# eigenvalues..
In [14]:
print(pca_std.singular_values_)
[28.94203422 21.08225141 16.04371561]
# variance containing in each formed PCA
In [15]:
print(pca_std.explained_variance_ratio_*100)

[36.1988481 19.20749026 11.12363054]

# Cummulative variance ratio..


In [16]:
# this will give an idea of, at how many no. of PCAs, the cummulative addition of #........variance will give much
information..

cum_variance = np.cumsum(pca_std.explained_variance_ratio_*100) cum_variance

Out[16]: array([36.1988481 , 55.40633836, 66.52996889])

Conclusion:

by applying PCA on standardized data with 95% variance it gives 3 PCA components

MODEL 1 - KMeans

K-means clustering with PCA


from sklearn.cluster import KMeans
In [29]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10, 8)) wcss = []


for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(df_norm)
wcss.append(kmeans.inertia_) #criterion based on which K-means clustering works
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method - K-means with PCA Clustering') plt.xlabel('Number of clusters')
plt.ylabel('WCSS') plt.show()

import warnings
warnings.filterwarnings('ignore')

# Fitting K-Means to the dataset


In [18]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42) y_kmeans = kmeans.fit_predict(df_norm)

y_kmeans

Out[18]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1])
plt.figure(figsize=(8,6))
In [19]:
plt.scatter(pca_std_df[:,0], pca_std_df[:,1], cmap='tab10') plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Out[19]: Text(0, 0.5, 'Second Principal Component')


pca_kmeans =pd.concat([dataset1.reset_index(drop = True), pd.DataFrame(df_norm)], axis=
In [20]:
pca_kmeans.columns.values[-3: ] = ['Component 1', 'Component 2', 'Component 3']
pca_kmeans['Segment Kmeans PCA'] = y_kmeans print(pca_kmeans)

Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids \


0 14.23 1.71 2.43 15.6 127 2.80 3.06
1 13.20 1.78 2.14 11.2 100 2.65 2.76
2 13.16 2.36 2.67 18.6 101 2.80 3.24
3 14.37 1.95 2.50 16.8 113 3.85 3.49
4 13.24 2.59 2.87 21.0 118 2.80 2.69
.. ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95 1.68 0.61
174 13.40 3.91 2.48 23.0 102 1.80 0.75
175 13.27 4.28 2.26 20.0 120 1.59 0.69
176 13.17 2.59 2.37 20.0 120 1.65 0.68
177 14.13 4.10 2.74 24.5 96 2.05 0.76

Nonflavanoids Proanthocyanins Color ... Magnesium Phenols \


0 0.28 2.29 5.64 ... 0.619565 0.627586
1 0.26 1.28 4.38 ... 0.326087 0.575862
2 0.30 2.81 5.68 ... 0.336957 0.627586
3 0.24 2.18 7.80 ... 0.467391 0.989655
4 0.39 1.82 4.32 ... 0.521739 0.627586
.. ... ... ... ... ... ...
173 0.52 1.06 7.70 ... 0.271739 0.241379
174 0.43 1.41 7.30 ... 0.347826 0.282759
175 0.43 1.35 10.20 ... 0.543478 0.210345
176 0.53 1.46 9.30 ... 0.543478 0.231034
177 0.56 1.35 9.20 ... 0.282609 0.368966

Flavanoids Nonflavanoids Proanthocyanins Color Component 1 \


0 0.573840 0.283019 0.593060 0.372014 0.455285
1 0.510549 0.245283 0.274448 0.264505 0.463415
2 0.611814 0.320755 0.757098 0.375427 0.447154
3 0.664557 0.207547 0.558360 0.556314 0.308943
4 0.495781 0.490566 0.444795 0.259386 0.455285
.. ... ... ... ... ...
173 0.056962 0.735849 0.205047 0.547782 0.130081
174 0.086498 0.566038 0.315457 0.513652 0.178862
175 0.073840 0.566038 0.296530 0.761092 0.089431
176 0.071730 0.754717 0.331230 0.684300 0.097561
177 0.088608 0.811321 0.296530 0.675768 0.105691

Component 2 Component 3 Segment Kmeans PCA


0 0.970696 0.561341 2
1 0.780220 0.550642 2
2 0.695971 0.646933 2
3 0.798535 0.857347 2
4 0.608059 0.325963 2
.. ... ... ...
173 0.172161 0.329529 1
174 0.106227 0.336662 1
175 0.106227 0.397290 1
176 0.128205 0.400856 1
177 0.120879 0.201141 1
[178 rows x 27 columns]

pca_kmeans['Segment'] = pca_kmeans['Segment Kmeans PCA'].map({0:'First', 1:'Second', 2:


In [21]:
x_axis = pca_kmeans['Component 2'] y_axis = pca_kmeans['Component 1'] plt.figure(figsize= (10,8))
In [30]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()
x_axis = pca_kmeans['Component 2'] y_axis = pca_kmeans['Component 3'] plt.figure(figsize= (10,8))
In [31]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()
x_axis = pca_kmeans['Component 1'] y_axis = pca_kmeans['Component 3'] plt.figure(figsize= (10,8))
In [32]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()

Heirarchial Clustering
print(pca_std_df)
In [25]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]

import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 6))


In [26]:
plt.title("Wine Data Dendogram - Ward")
dend = shc.dendrogram(shc.linkage(pca_std_df, method='ward'))

# create clusters
In [36]:
hc = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'single') hc

Out[36]: AgglomerativeClustering(linkage='single', n_clusters=3)


In [37]: # save clusters for chart
y_hc = hc.fit_predict(pca_std_df) y_hc

Out[37]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int64)

Clusters=pd.DataFrame(y_hc,columns=['Clusters']) Clusters
In [38]:

Out[38]:
Clusters

0 2

1 2

2 2

3 2

4 2

... ...

173 2

174 2

175 2

176 2

177 2

178 rows × 1 columns

In [ ]:
Problem Statement: -

A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat


heart diseases. The company has gathered data from its secondary sources and would like you to
provide high level analytical insights on the data. Its aim is to segregate patients depending on
their age group and other factors given in the data. Perform PCA and clustering algorithms on
the dataset and check if the clusters formed before and after PCA are the same and provide a
brief report on your model. You can also explore more ways to improve your model.

Note: This is just a snapshot of the data. The datasets can be downloaded from AiSpry LMS in
the Hands-On Material section.

You might also like