0% found this document useful (0 votes)

91 views22 pages

PCA Problem Statement With Answer

The document discusses performing dimension reduction using principal component analysis (PCA) on a wine dataset. It preprocesses the data by normalizing it, then performs PCA to extract the first 3 principal components. It clusters the original dataset and the reduced 3D PCA dataset using hierarchical and K-means clustering. It compares the results of clustering before and after PCA to see if similar clusters are obtained.

Uploaded by

SBS Movies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views22 pages

PCA Problem Statement With Answer

Uploaded by

SBS Movies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Topic: Dimension Reduction With PCA

Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: Shafiyana Sayyad
Batch ID: 15022022
Topic: Principal Component Analysis

Hints:
1. Business Problem
1.1. What is the business objective?
1.1. Are there any constraints?

2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:

2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.

3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.

5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?

Problem Statement: -
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).

Answer:
import pandas as pd
In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import accuracy_score sns.set()
from sklearn.decomposition import PCA

from matplotlib.colors import ListedColormap

dataset = pd.read_csv('wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values

In [2]:
dataset.head()

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Out[2]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocya

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39

dataset.tail()
In [3]:
Out[3]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthoc

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56

© 2013 - 2021 360DigiTMG. All Rights Reserved.

dataset.describe()
In [4]:
Out[4]:
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids N

count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000

mean 1.938202 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270

std 0.775035 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859

min 1.000000 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000

25% 1.000000 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000

50% 2.000000 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000

75% 3.000000 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000

max 3.000000 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000

Data preprocessing
# not going to falloe EDA step here since it is already done in link1.(Above cell)
In [5]:
# as we know ID & award will not make much contribution during clutering. we will drop

dataset1 = dataset.drop(['Type'], axis=1) dataset1.head()

Out[5]:
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Colo

0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.6

1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.3

2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.6

3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.8

4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.3
dataset.info()
In [6]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
# Column Non-Null Count Dtype

0 Type 178 non-null int64

1 Alcohol 178 non-null float64
2 Malic 178 non-null float64
3 Ash 178 non-null float64
4 Alcalinity 178 non-null float64
5 Magnesium 178 non-null int64
6 Phenols 178 non-null float64
7 Flavanoids 178 non-null float64
8 Nonflavanoids 178 non-null float64
9 Proanthocyanins 178 non-null float64
10 Color 178 non-null float64
11 Hue 178 non-null float64
12 Dilution 178 non-null float64
13 Proline 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.6 KB
In [7]:

Out[7]:
sns.pairplot(dataset.iloc[:,0:12])

<seaborn
.axisgrid.
PairGrid
at
0x2641a5
50e80>
Norma
lizing
data
for any
type of
cluster
ing
def norm_func(i):
In [8]:
x = (i-i.min())/(i.max()-i.min())
return (x)
df_norm = norm_func(dataset.iloc[:,1:]) print(df_norm)

Alcohol Malic Ash Alcalinity Magnesium Phenols \

0 0.842105 0.191700 0.572193 0.257732 0.619565 0.627586
1 0.571053 0.205534 0.417112 0.030928 0.326087 0.575862
2 0.560526 0.320158 0.700535 0.412371 0.336957 0.627586
3 0.878947 0.239130 0.609626 0.319588 0.467391 0.989655
4 0.581579 0.365613 0.807487 0.536082 0.521739 0.627586
.. ... ... ... ... ... ...
173 0.705263 0.970356 0.582888 0.510309 0.271739 0.241379
174 0.623684 0.626482 0.598930 0.639175 0.347826 0.282759
175 0.589474 0.699605 0.481283 0.484536 0.543478 0.210345
176 0.563158 0.365613 0.540107 0.484536 0.543478 0.231034
177 0.815789 0.664032 0.737968 0.716495 0.282609 0.368966

Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution \

0 0.573840 0.283019 0.593060 0.372014 0.455285 0.970696
1 0.510549 0.245283 0.274448 0.264505 0.463415 0.780220
2 0.611814 0.320755 0.757098 0.375427 0.447154 0.695971
3 0.664557 0.207547 0.558360 0.556314 0.308943 0.798535
4 0.495781 0.490566 0.444795 0.259386 0.455285 0.608059
.. ... ... ... ... ... ...
173 0.056962 0.735849 0.205047 0.547782 0.130081 0.172161
174 0.086498 0.566038 0.315457 0.513652 0.178862 0.106227
175 0.073840 0.566038 0.296530 0.761092 0.089431 0.106227
176 0.071730 0.754717 0.331230 0.684300 0.097561 0.128205
177 0.088608 0.811321 0.296530 0.675768 0.105691 0.120879

Proline
0 0.561341
1 0.550642
2 0.646933
3 0.857347
4 0.325963
.. ...
173 0.329529
174 0.336662
175 0.397290
176 0.400856
177 0.201141

[178 rows x 13 columns]

In [9]: from sklearn.preprocessing import StandardScaler

std_df = StandardScaler().fit_transform(dataset1) std_df.shape # this will used for kmeans

Out[9]: (178, 13)

In [10]: print(std_df)
[[ 1.51861254436
-0.5622498
... 0.31830389 0.78858745
1.01300893] 1.39514818]
[ 0.24628963 -0.49941338
...
0.96524152]
[ 0.1968 [ 0.33275817 1.74474449 -0.38935541 ... -1.61212515 -1.48544548
0.28057537]
7903
[ 0.20923168 0.22769377 0.01273209 ... -1.56825176 -1.40069891
0.02123 0.29649784]
125 [ 1.39508604 1.58316512 1.36520822 ... -1.52437837 -1.42894777
1.10933 -0.59516041]]

Running PCA of standardized data.

# applying PCA on std_df
In [11]:
# we are considering 95% variance in n_components to not loose any data.

from sklearn.decomposition import PCA

pca_std = PCA(random_state=14, n_components=3) pca_std_df= pca_std.fit_transform(std_df)

print(pca_std_df)
In [12]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]

In [13]: pca_std_df.shape
(178, 3)
Out[13]:

# eigenvalues..
In [14]:
print(pca_std.singular_values_)
[28.94203422 21.08225141 16.04371561]
# variance containing in each formed PCA
In [15]:
print(pca_std.explained_variance_ratio_*100)

[36.1988481 19.20749026 11.12363054]

# Cummulative variance ratio..

In [16]:
# this will give an idea of, at how many no. of PCAs, the cummulative addition of #........variance will give much
information..

cum_variance = np.cumsum(pca_std.explained_variance_ratio_*100) cum_variance

Out[16]: array([36.1988481 , 55.40633836, 66.52996889])

Conclusion:

by applying PCA on standardized data with 95% variance it gives 3 PCA components

MODEL 1 - KMeans

K-means clustering with PCA

from sklearn.cluster import KMeans
In [29]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10, 8)) wcss = []

for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(df_norm)
wcss.append(kmeans.inertia_) #criterion based on which K-means clustering works
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method - K-means with PCA Clustering') plt.xlabel('Number of clusters')
plt.ylabel('WCSS') plt.show()

import warnings
warnings.filterwarnings('ignore')

# Fitting K-Means to the dataset

In [18]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42) y_kmeans = kmeans.fit_predict(df_norm)

y_kmeans

Out[18]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1])
plt.figure(figsize=(8,6))
In [19]:
plt.scatter(pca_std_df[:,0], pca_std_df[:,1], cmap='tab10') plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Out[19]: Text(0, 0.5, 'Second Principal Component')

pca_kmeans =pd.concat([dataset1.reset_index(drop = True), pd.DataFrame(df_norm)], axis=
In [20]:
pca_kmeans.columns.values[-3: ] = ['Component 1', 'Component 2', 'Component 3']
pca_kmeans['Segment Kmeans PCA'] = y_kmeans print(pca_kmeans)

Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids \

0 14.23 1.71 2.43 15.6 127 2.80 3.06
1 13.20 1.78 2.14 11.2 100 2.65 2.76
2 13.16 2.36 2.67 18.6 101 2.80 3.24
3 14.37 1.95 2.50 16.8 113 3.85 3.49
4 13.24 2.59 2.87 21.0 118 2.80 2.69
.. ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95 1.68 0.61
174 13.40 3.91 2.48 23.0 102 1.80 0.75
175 13.27 4.28 2.26 20.0 120 1.59 0.69
176 13.17 2.59 2.37 20.0 120 1.65 0.68
177 14.13 4.10 2.74 24.5 96 2.05 0.76

Nonflavanoids Proanthocyanins Color ... Magnesium Phenols \

0 0.28 2.29 5.64 ... 0.619565 0.627586
1 0.26 1.28 4.38 ... 0.326087 0.575862
2 0.30 2.81 5.68 ... 0.336957 0.627586
3 0.24 2.18 7.80 ... 0.467391 0.989655
4 0.39 1.82 4.32 ... 0.521739 0.627586
.. ... ... ... ... ... ...
173 0.52 1.06 7.70 ... 0.271739 0.241379
174 0.43 1.41 7.30 ... 0.347826 0.282759
175 0.43 1.35 10.20 ... 0.543478 0.210345
176 0.53 1.46 9.30 ... 0.543478 0.231034
177 0.56 1.35 9.20 ... 0.282609 0.368966

Flavanoids Nonflavanoids Proanthocyanins Color Component 1 \

0 0.573840 0.283019 0.593060 0.372014 0.455285
1 0.510549 0.245283 0.274448 0.264505 0.463415
2 0.611814 0.320755 0.757098 0.375427 0.447154
3 0.664557 0.207547 0.558360 0.556314 0.308943
4 0.495781 0.490566 0.444795 0.259386 0.455285
.. ... ... ... ... ...
173 0.056962 0.735849 0.205047 0.547782 0.130081
174 0.086498 0.566038 0.315457 0.513652 0.178862
175 0.073840 0.566038 0.296530 0.761092 0.089431
176 0.071730 0.754717 0.331230 0.684300 0.097561
177 0.088608 0.811321 0.296530 0.675768 0.105691

Component 2 Component 3 Segment Kmeans PCA

0 0.970696 0.561341 2
1 0.780220 0.550642 2
2 0.695971 0.646933 2
3 0.798535 0.857347 2
4 0.608059 0.325963 2
.. ... ... ...
173 0.172161 0.329529 1
174 0.106227 0.336662 1
175 0.106227 0.397290 1
176 0.128205 0.400856 1
177 0.120879 0.201141 1
[178 rows x 27 columns]

pca_kmeans['Segment'] = pca_kmeans['Segment Kmeans PCA'].map({0:'First', 1:'Second', 2:

In [21]:
x_axis = pca_kmeans['Component 2'] y_axis = pca_kmeans['Component 1'] plt.figure(figsize= (10,8))
In [30]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()
x_axis = pca_kmeans['Component 2'] y_axis = pca_kmeans['Component 3'] plt.figure(figsize= (10,8))
In [31]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()
x_axis = pca_kmeans['Component 1'] y_axis = pca_kmeans['Component 3'] plt.figure(figsize= (10,8))
In [32]:
sns.scatterplot(x_axis, y_axis, hue=pca_kmeans['Segment'],palette=['g', 'r','c'] ) plt.title('Clusters by PCA Componets')
plt.show()

Heirarchial Clustering
print(pca_std_df)
In [25]:
[[ 3.31675081e+00 -1.44346263e+00 -1.65739045e-01] [ 2.20946492e+00 3.33392887e-01 -2.02645737e+00]
[ 2.51674015e+00 -1.03115130e+00 9.82818670e-01] [ 3.75706561e+00 -2.75637191e+00 -1.76191842e-01]
[ 1.00890849e+00 -8.69830821e-01 2.02668822e+00] [ 3.05025392e+00 -2.12240111e+00 -6.29395827e-01]
[ 2.44908967e+00 -1.17485013e+00 -9.77094891e-01] [ 2.05943687e+00 -1.60896307e+00 1.46281883e-01]
[ 2.51087430e+00 -9.18070957e-01 -1.77096903e+00] [ 2.75362819e+00 -7.89437674e-01 -9.84247490e-01]
[ 3.47973668e+00 -1.30233324e+00 -4.22735217e-01] [ 1.75475290e+00 -6.11977229e-01 -1.19087832e+00]
[ 2.11346234e+00 -6.75706339e-01 -8.65086426e-01] [ 3.45815682e+00 -1.13062988e+00 -1.20427635e+00]
[ 4.31278391e+00 -2.09597558e+00 -1.26391275e+00] [ 2.30518820e+00 -1.66255173e+00 2.17902616e-01]
[ 2.17195527e+00 -2.32730534e+00 8.31729866e-01] [ 1.89897118e+00 -1.63136888e+00 7.94913792e-01]
[ 3.54198508e+00 -2.51834367e+00 -4.85458508e-01]

import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 6))

In [26]:
plt.title("Wine Data Dendogram - Ward")
dend = shc.dendrogram(shc.linkage(pca_std_df, method='ward'))

# create clusters
In [36]:
hc = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'single') hc

Out[36]: AgglomerativeClustering(linkage='single', n_clusters=3)

In [37]: # save clusters for chart
y_hc = hc.fit_predict(pca_std_df) y_hc

Out[37]: array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int64)

Clusters=pd.DataFrame(y_hc,columns=['Clusters']) Clusters
In [38]:

Out[38]:
Clusters

0 2

1 2

2 2

3 2

4 2

... ...

173 2

174 2

175 2

176 2

177 2

178 rows × 1 columns

In [ ]:
Problem Statement: -

A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat

heart diseases. The company has gathered data from its secondary sources and would like you to
provide high level analytical insights on the data. Its aim is to segregate patients depending on
their age group and other factors given in the data. Perform PCA and clustering algorithms on
the dataset and check if the clusters formed before and after PCA are the same and provide a
brief report on your model. You can also explore more ways to improve your model.

Note: This is just a snapshot of the data. The datasets can be downloaded from AiSpry LMS in
the Hands-On Material section.

Pca SPSS
100% (1)
Pca SPSS
52 pages
Statistics Assignment 05
50% (2)
Statistics Assignment 05
14 pages
Import As From Import From Import Import As
No ratings yet
Import As From Import From Import Import As
5 pages
Dvpd11 Merged Merged 27 83
No ratings yet
Dvpd11 Merged Merged 27 83
57 pages
ML Journal
No ratings yet
ML Journal
29 pages
Empirical Crop Suitability Model 1694688954
No ratings yet
Empirical Crop Suitability Model 1694688954
24 pages
K Means Wine Clustering
No ratings yet
K Means Wine Clustering
3 pages
PCA Problem Statement
No ratings yet
PCA Problem Statement
25 pages
GIS320 Lecture6 Principal Components Analysis
No ratings yet
GIS320 Lecture6 Principal Components Analysis
16 pages
Wine
No ratings yet
Wine
22 pages
45B AIML Practical07 Clustering
No ratings yet
45B AIML Practical07 Clustering
8 pages
CODE
No ratings yet
CODE
7 pages
02 Pca
No ratings yet
02 Pca
14 pages
Experiment 4 AIML
No ratings yet
Experiment 4 AIML
4 pages
Chapter 2 Principal Components Analysis: Math 3210
No ratings yet
Chapter 2 Principal Components Analysis: Math 3210
30 pages
PCA Example - Toothpaste
No ratings yet
PCA Example - Toothpaste
7 pages
Apple Data
No ratings yet
Apple Data
8 pages
Rlab SS
No ratings yet
Rlab SS
25 pages
Almejas 4
No ratings yet
Almejas 4
9 pages
SUBQUERIES
No ratings yet
SUBQUERIES
8 pages
Project Report Data Mining
No ratings yet
Project Report Data Mining
26 pages
AIML Hon. Practical 4
No ratings yet
AIML Hon. Practical 4
5 pages
PCA Example - Toothpaste
No ratings yet
PCA Example - Toothpaste
5 pages
Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence
No ratings yet
Chapter 4 - Dimension Reduction: Data Mining For Business Intelligence
24 pages
Data Mining - Wine Classification Assignment
No ratings yet
Data Mining - Wine Classification Assignment
66 pages
Final Assessment Introductory Data Science Part 2
No ratings yet
Final Assessment Introductory Data Science Part 2
6 pages
ML Week 5
No ratings yet
ML Week 5
2 pages
K Nearest Neighbor
No ratings yet
K Nearest Neighbor
6 pages
UNIT 3 4 Feature Relevance Marginal Entropy
No ratings yet
UNIT 3 4 Feature Relevance Marginal Entropy
4 pages
Use of Principal Component Analysis (PCA) and Hierarchical Cluster Analysis 2017
No ratings yet
Use of Principal Component Analysis (PCA) and Hierarchical Cluster Analysis 2017
8 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Data Mining 1 Practical File-1
No ratings yet
Data Mining 1 Practical File-1
24 pages
20BCE2126 ML Da 5
No ratings yet
20BCE2126 ML Da 5
3 pages
Exp 15
No ratings yet
Exp 15
1 page
Assignment 1 A
No ratings yet
Assignment 1 A
12 pages
Code R
No ratings yet
Code R
3 pages
Granato 2018
No ratings yet
Granato 2018
41 pages
Lab Assignment 10: Web Mining
No ratings yet
Lab Assignment 10: Web Mining
5 pages
Lab Assignment 10: Web Mining
No ratings yet
Lab Assignment 10: Web Mining
5 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
100% (1)
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
7 pages
Principal Component Analysis: Analytical Methods May 2014
No ratings yet
Principal Component Analysis: Analytical Methods May 2014
21 pages
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
No ratings yet
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
4 pages
Principal Components Analysis and Redundancy Analysis
No ratings yet
Principal Components Analysis and Redundancy Analysis
18 pages
Chapter 04 Dimension Reduction (R)
No ratings yet
Chapter 04 Dimension Reduction (R)
27 pages
Scikit Learn1
No ratings yet
Scikit Learn1
4 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
7 pages
PCA Analysis
No ratings yet
PCA Analysis
7 pages
Separations 09 00087 1
No ratings yet
Separations 09 00087 1
12 pages
Exercise#9 Instructions 2021
No ratings yet
Exercise#9 Instructions 2021
5 pages
Wine DS
No ratings yet
Wine DS
14 pages
Artigo Do Chá
No ratings yet
Artigo Do Chá
10 pages
R Console
No ratings yet
R Console
1 page
Water Portability Sunig R
No ratings yet
Water Portability Sunig R
4 pages
Unnamed: 0 Sample Rock - Type Sio2 Tio2 Al2O3 Fe2O3 Mno Mgo Cao Na2O K2O P2O5 0 0 1 1 2 2 3 3 4 4
No ratings yet
Unnamed: 0 Sample Rock - Type Sio2 Tio2 Al2O3 Fe2O3 Mno Mgo Cao Na2O K2O P2O5 0 0 1 1 2 2 3 3 4 4
1 page
PCA For Removal of Noise PCA For Removal of Noise: GC/MS Example
No ratings yet
PCA For Removal of Noise PCA For Removal of Noise: GC/MS Example
9 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Cluster Past
No ratings yet
Cluster Past
5 pages
Practical Research Long Quiz Grade 12
No ratings yet
Practical Research Long Quiz Grade 12
11 pages
Descriptive Sensory Analysis in Different Classes of Orange Juice by A Robust Free-Choice Profile Method
No ratings yet
Descriptive Sensory Analysis in Different Classes of Orange Juice by A Robust Free-Choice Profile Method
10 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Cluster
No ratings yet
Cluster
3 pages
Factor Analysis: Dr. R. Ravanan
No ratings yet
Factor Analysis: Dr. R. Ravanan
22 pages
The Fibonacci Number Series
From Everand
The Fibonacci Number Series
Michael Husted
5/5 (1)
CQE Academy Equation Cheat Sheet - D
No ratings yet
CQE Academy Equation Cheat Sheet - D
15 pages
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 3 Describing Data Using Numerical Measures
No ratings yet
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 3 Describing Data Using Numerical Measures
43 pages
Mathematical Foundation
100% (1)
Mathematical Foundation
6 pages
Parametric and Nonparametric
No ratings yet
Parametric and Nonparametric
2 pages
Day12 Hierarchical Clustering
No ratings yet
Day12 Hierarchical Clustering
9 pages
JASP Manual: Seton Hall University Department of Psychology 2018
No ratings yet
JASP Manual: Seton Hall University Department of Psychology 2018
48 pages
Reliability Analysis, Failure Rate, MTBF: Maxon Motor Control
No ratings yet
Reliability Analysis, Failure Rate, MTBF: Maxon Motor Control
4 pages
ch3 SEM Methods of Estimation - 105548
No ratings yet
ch3 SEM Methods of Estimation - 105548
17 pages
All Units Advanced Java MCQ Question Bank
No ratings yet
All Units Advanced Java MCQ Question Bank
29 pages
Problem (Objective 17-5) in Auditing The Valuation of Inventory, The Auditor, Claire Butler, Decided To Use
No ratings yet
Problem (Objective 17-5) in Auditing The Valuation of Inventory, The Auditor, Claire Butler, Decided To Use
3 pages
Multilinear ProblemStatement
No ratings yet
Multilinear ProblemStatement
132 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Conditional Probability and Table
No ratings yet
Conditional Probability and Table
11 pages
Fix Data
No ratings yet
Fix Data
148 pages
Panel GMM Commands
No ratings yet
Panel GMM Commands
13 pages
XGBoost WM
No ratings yet
XGBoost WM
39 pages
STA1203 Final AA-Apr2022-Question Paper
No ratings yet
STA1203 Final AA-Apr2022-Question Paper
6 pages
2004 JQT Woodall Et Al
No ratings yet
2004 JQT Woodall Et Al
12 pages
Sample Forecast Report From Forecast Pro Unlimited
No ratings yet
Sample Forecast Report From Forecast Pro Unlimited
34 pages
Business Statistics - Spring 2021 (Sec A)
No ratings yet
Business Statistics - Spring 2021 (Sec A)
3 pages
Bec 4102 Econometrics I
No ratings yet
Bec 4102 Econometrics I
2 pages
Batch Effects Correction For Metabolomics: Andrés G. Camacho-Bonet and Wandaliz Torres-García, PH.D
No ratings yet
Batch Effects Correction For Metabolomics: Andrés G. Camacho-Bonet and Wandaliz Torres-García, PH.D
20 pages
Pengaruh Media Bantu Ban Terhadap Hasil Shooting Futsal Pada Peserta Didik Mts Negeri 1 Pontianak
No ratings yet
Pengaruh Media Bantu Ban Terhadap Hasil Shooting Futsal Pada Peserta Didik Mts Negeri 1 Pontianak
12 pages
20 Scenario Q&A For Data Analyst
No ratings yet
20 Scenario Q&A For Data Analyst
4 pages
Thong Kam 2008
No ratings yet
Thong Kam 2008
8 pages
Activity # 1 - Statistics and Data Handling in Analytical Chemistry Treatment of Data
No ratings yet
Activity # 1 - Statistics and Data Handling in Analytical Chemistry Treatment of Data
4 pages
Chapter 6 Exercises
No ratings yet
Chapter 6 Exercises
4 pages
Kategorisasi Variabel: Dep - Tal
No ratings yet
Kategorisasi Variabel: Dep - Tal
5 pages
Business Analytics Chapter 5
No ratings yet
Business Analytics Chapter 5
2 pages
T M N D: HE Ultivariate Ormal Istribution
No ratings yet
T M N D: HE Ultivariate Ormal Istribution
19 pages

PCA Problem Statement With Answer

Uploaded by

PCA Problem Statement With Answer

Uploaded by

Topic: Dimension Reduction With PCA

© 2013 - 2021 360DigiTMG. All Rights Reserved.

from matplotlib.colors import ListedColormap

dataset = pd.read_csv('wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values

© 2013 - 2021 360DigiTMG. All Rights Reserved.

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56

© 2013 - 2021 360DigiTMG. All Rights Reserved.

count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000

mean 1.938202 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270

std 0.775035 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859

min 1.000000 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000

25% 1.000000 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000

50% 2.000000 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000

75% 3.000000 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000

max 3.000000 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000

dataset1 = dataset.drop(['Type'], axis=1) dataset1.head()

0 Type 178 non-null int64

Alcohol Malic Ash Alcalinity Magnesium Phenols \

Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution \

[178 rows x 13 columns]

In [9]: from sklearn.preprocessing import StandardScaler

std_df = StandardScaler().fit_transform(dataset1) std_df.shape # this will used for kmeans

Out[9]: (178, 13)

Running PCA of standardized data.

from sklearn.decomposition import PCA

[36.1988481 19.20749026 11.12363054]

# Cummulative variance ratio..

cum_variance = np.cumsum(pca_std.explained_variance_ratio_*100) cum_variance

Out[16]: array([36.1988481 , 55.40633836, 66.52996889])

K-means clustering with PCA

plt.figure(figsize=(10, 8)) wcss = []

# Fitting K-Means to the dataset

Out[19]: Text(0, 0.5, 'Second Principal Component')

Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids \

Nonflavanoids Proanthocyanins Color ... Magnesium Phenols \

Flavanoids Nonflavanoids Proanthocyanins Color Component 1 \

Component 2 Component 3 Segment Kmeans PCA

pca_kmeans['Segment'] = pca_kmeans['Segment Kmeans PCA'].map({0:'First', 1:'Second', 2:

import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 6))

Out[36]: AgglomerativeClustering(linkage='single', n_clusters=3)

178 rows × 1 columns

A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat

You might also like