ML Chapter 4 Part3
ML Chapter 4 Part3
In [4]: mean_marks
Out[4]: array([66., 60., 60.])
Step 2 – Covariance Matrix Computation
• The covariance Matrix would be
In [5]: CovMat=np.cov(marks, bias=True)
In [6]: CovMat
Out[6]:
array([[504., 360., 180.],
[360., 360., 0.],
[180., 0., 720.]])
Step 2 – Covariance Matrix Computation
• The variance score for each test is shown along the diagonal.
• The Art test has highest variance (720) whereas English has smallest
(360).
• Hence Art score has more variability than the English test.
• The covariance between Math and English is positive (360) whereas
the covariance between Math and Art is also positive (180).
• The Covariance between English and Art is zero that shows no
relationship between English and Art
Step 3 – Compute Eigenvalue and
Eigenvector
• In order to determine the principal components of the data, we need to compute
the eigenvalues and eigenvectors from covariance matrix
In [7]: eig_val, eig_vec = np.linalg.eig(CovMat)
In [8]: eig_val
Out[8]: array([ 44.81966028, 910.06995304, 629.11038668])
In [9]: eig_vec
Out[9]:
array([[ 0.6487899 , -0.65580225, -0.3859988 ],
[-0.74104991, -0.4291978 , -0.51636642],
[-0.17296443, -0.62105769, 0.7644414 ]])
Step 4 – Sort Eigenvalue Choose k
Eigenvector
• The eigenvectors will form the basis of new feature space but they
only define the directions and all of them have unit length.
• To decide which eigenvector(s), we need to drop to get lower
dimensional subspace, we should review the corresponding
eigenvalues of the eigenvectors.
• The eigenvector corresponding to the lowest eigenvalue bears the
lowest information about the distribution of the data and we can
drop them.
Step 4 – Sort Eigenvalue Choose k
Eigenvector
• We need to rank the eigenvalues from the highest to the lowest and choose the top k
eigenvalues and eigenvectors.
In [22]: eig_val[::-1].sort()
In [23]: eig_val
Out[23]: array([910.06995304, 629.11038668, 44.81966028])
In [24]: eig_val/eig_val.sum()
Out[24]: array([0.57453911, 0.39716565, 0.02829524])
We can also find the value explained using sklearn for the top 2 principal components
In [25]: sklearn_pca.explained_variance_ratio_
Out[25]: array([0.57453911, 0.39716565])
Proportion of Variance Explained
Cumulative variance explained is
In [26]: sklearn_pca.explained_variance_ratio_.cumsum()
Out[26]: array([0.57453911, 0.97170476])
We can see that in this case, the first 2 components are able to explain
97.17% variance.
Deciding the number of components
•
Deciding the number of components
In [27]: marks1 = np.array([[90,90,60,60,30],[60,90,60,60,30],[90,30,60,90,30],[
...: 90,60,30,60,90]])
In [32]: plt.figure(figsize=(7,5))
...: plt.plot([1,2,3,4], sklearn_pca.explained_variance_ratio_, '-o', label=
...: 'Individual component')
...: plt.plot([1,2,3,4], np.cumsum(sklearn_pca.explained_variance_ratio_), '
...: -s', label='Cumulative')
...: plt.ylabel('Proportion of Variance Explained')
...: plt.xlabel('Principal Component')
...: plt.xlim(0.75,4.25)
...: plt.ylim(0,1.05)
...: plt.xticks([1,2,3,4])
...: plt.legend(loc=2)
...: plt.show()
Deciding the number of components
• We generally decide on the number of principal components required by
examining a screen plot such as illustrated in Figure – 4.
• We need the smallest number of principal components that can explain a
sizable amount of variation in the data.
• We can do so by eyeballing the screen plot and looking for a point at which
the proportion of variance explained by each subsequent principal
component drops off.
• This is often referred to as an elbow in the screen plot.
• By inspection of Figure – 4, one might conclude a fair amount of variance
has been explained by the first three principal components and there is an
elbow after the third component.
• The fourth principal component explains a very small amount of variance.
Hence, it is worthless.
Independent Component
Analysis
Independent Component Analysis
• The Independent Component Analysis is based on information theory. It is
also a dimensionality reduction technique.
• The difference between Principal Component Analysis and Independent
Component Analysis is that PCA looks for uncorrelated factors whereas ICA
looks for independent factors.
• Two variables uncorrelated if there is no linear relationship between them.
• On the other hand, the two variables are independent, if they do not
depend on other variables.
• For example, the age of an individual is independent of his food
preferences.
Independent Component Analysis
• On various occasions, it is useful to process the data in order to
extract uncorrelated and independent components.
• For example, let’s suppose that we record two people while they sing
different songs. The result is very noisy.
• Our goal is to separate one source from another.
• This problem cannot be solved using PCA because in PCA there is no
constraint on the independence of the components.
• We can solve this using ICA.
• In layman terms, PCA helps to compress the data and ICA helps to
separate the data.
Clustering Methods
Clustering
• The techniques used for finding subgroups, or clusters, in a data set
are known as clustering.
• By using the clustering techniques, we can group the observation
together so that the observations within each group are quite similar
to each other, whereas the observations in different groups are
different from each other.
• However, we need to define how to define whether two or more
observations are similar or different.
• This requires domain-specific knowledge and knowledge about the
data that is being studied.
Clustering
• The objective of both clustering and PCA is to simplify the data via a
small number of summaries even though their mechanisms are
different.
• The objective of PCA is to find a low-dimensional representation of
the observations that can explain a good fraction of the variance.
• The objective of clustering is to find a homogeneous subgroup among
the observations
Clustering
•
K- Means Clustering
•
K- Means Clustering
K- Means Clustering
•
K- Means Clustering
•
K- Means Clustering Python Example
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)
X = np.random.standard_normal((100,2))
X[:50,0] = X[:50,0]+3
X[:50,1] = X[:50,1]-4
km1 = KMeans(n_clusters=2, n_init=20)
km1.fit(X)
np.random.seed(4)
km2 = KMeans(n_clusters=3, n_init=20)
km2.fit(X)
K- Means Clustering Python Example
np.random.seed(6)
km3 = KMeans(n_clusters=4, n_init=20)
km3.fit(X)
np.random.seed(2)
X = np.random.standard_normal((100,2))
ax1.set_title('Complete Linkage')
ax2.set_title('Average Linkage')
ax3.set_title('Single Linkage');
Linkage
Latent Semantic Indexing
Latent Semantic Indexing
• Latent Semantic Analysis (LSA), also known as Latent Semantic
Indexing (LSI), is an application of unsupervised dimensionality
reduction techniques to textual data.
• The problems that LSA tries to solve are the problems of:
• Synonymy: This means multiple words having the same meaning
• Polysemy: This means one-word having multiple meanings
• For example, consider the following two sentences:
• I liked his last novel quite a lot.
• We would like to go for a novel marketing campaign.
• In the first sentence, the word ‘novel’ refers to a book, and in the
second sentence it means new or fresh.
Latent Semantic Indexing
• We can easily distinguish between these words because we are able to
understand the context behind these words.
• However, a machine would not be able to capture this concept as it cannot
understand the context in which the words have been used.
• This is where Latent Semantic Analysis (LSA) comes into play as it attempts
to leverage the context around the words to capture the hidden concepts,
also known as topics.
• So, simply mapping words to documents won’t really help.
• What we really need is to figure out the hidden concepts or topics behind
the words.
• LSA is one such technique that can find these hidden topics.
Thanks
Samatrix Consulting Pvt Ltd