0% found this document useful (0 votes)
7 views82 pages

ML Chapter 4 Part3

Uploaded by

growhigh007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views82 pages

ML Chapter 4 Part3

Uploaded by

growhigh007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Machine Learning

Samatrix Consulting Pvt Ltd


Unsupervised Learning
Unsupervised Learning

Unsupervised Learning

Unsupervised Learning - Challenges

Unsupervised Learning – Use Cases
• Unsupervised learning techniques have been gaining importance in a
number of fields.
• Online shopping sites have been using recommender system that identify
groups of customers with similar browsing and purchase histories.
• It also identifies the items that are of particular interest to the shoppers
within each group.
• Based on the purchase history the customers in a particular group, the
recommender system can show the items to the individual customer.
• The search engine can show different search results to a particular person
based on the click histories of other people who have similar search
patterns.
Principal Component Analysis
• We have already studied the principal component analysis in the
context of principal component regression.
• When our data set contains a large number of correlated variables,
we use principal components to summarize the data set by using a
smaller number of representative variables that can explain the most
variability in the original set collectively.
• The principal components are the directions in the feature space
along with the variability in the data is high.
Principal Component Analysis

Principal Components

Principal Components

Principal Components

Principal Components

Principal Components

Principal Components

Steps for PCA
Step 1 - Standardization

Step 2 – Covariance Matrix Computation

Step 2 – Covariance Matrix Computation

Student Math English Art


1 90 60 90
2 90 90 30
3 60 60 60
4 60 60 90
5 30 30 30
Step 2 – Covariance Matrix Computation
In [1]: import numpy as np

In [2]: marks = np.array([[90,90,60,60,30],[60,90,60,60,30], [90,30,60,90,30]])

• The mean Matrix would be

In [3]: mean_marks=np.mean(marks, axis= 1)

In [4]: mean_marks
Out[4]: array([66., 60., 60.])
Step 2 – Covariance Matrix Computation
• The covariance Matrix would be
In [5]: CovMat=np.cov(marks, bias=True)

In [6]: CovMat
Out[6]:
array([[504., 360., 180.],
[360., 360., 0.],
[180., 0., 720.]])
Step 2 – Covariance Matrix Computation
• The variance score for each test is shown along the diagonal.
• The Art test has highest variance (720) whereas English has smallest
(360).
• Hence Art score has more variability than the English test.
• The covariance between Math and English is positive (360) whereas
the covariance between Math and Art is also positive (180).
• The Covariance between English and Art is zero that shows no
relationship between English and Art
Step 3 – Compute Eigenvalue and
Eigenvector
• In order to determine the principal components of the data, we need to compute
the eigenvalues and eigenvectors from covariance matrix
In [7]: eig_val, eig_vec = np.linalg.eig(CovMat)

In [8]: eig_val
Out[8]: array([ 44.81966028, 910.06995304, 629.11038668])

In [9]: eig_vec
Out[9]:
array([[ 0.6487899 , -0.65580225, -0.3859988 ],
[-0.74104991, -0.4291978 , -0.51636642],
[-0.17296443, -0.62105769, 0.7644414 ]])
Step 4 – Sort Eigenvalue Choose k
Eigenvector
• The eigenvectors will form the basis of new feature space but they
only define the directions and all of them have unit length.
• To decide which eigenvector(s), we need to drop to get lower
dimensional subspace, we should review the corresponding
eigenvalues of the eigenvectors.
• The eigenvector corresponding to the lowest eigenvalue bears the
lowest information about the distribution of the data and we can
drop them.
Step 4 – Sort Eigenvalue Choose k
Eigenvector
• We need to rank the eigenvalues from the highest to the lowest and choose the top k
eigenvalues and eigenvectors.

In [10]: eig_pairs = [(np.abs(eig_val[i]), eig_vec[:,i]) for i in range(len(eig_val))]

In [11]: eig_pairs.sort(key=lambda x: x[0], reverse=True)

In [12]: for i in eig_pairs:


...: print(i[0])
910.0699530410367
629.1103866763253
44.81966028263878
Step 4 – Sort Eigenvalue Choose k
Eigenvector
• The corresponding eigenvectors are defined as weights

In [13]: matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))

In [14]: print('Matrix W:\n', matrix_w)


Matrix W:
[[-0.65580225 -0.3859988 ]
[-0.4291978 -0.51636642]
[-0.62105769 0.7644414 ]]
Step 5 – Transform the value in new subspace

Calculate using sklearn
Alternatively, we can directly use sklearn to calculate the values

In [17]: from sklearn.decomposition import PCA as sklearnPCA

In [18]: sklearn_pca = sklearnPCA(n_components=2)

In [19]: sklearn_transf = sklearn_pca.fit_transform(marks.T)


Calculate using sklearn
In [20]: sklearn_transf
Out[20]:
array([[-34.37098481, -13.66927088],
[ -9.98345733, 47.68820559],
[ 3.93481353, -2.31599277],
[-14.69691716, -25.24923474],
[ 55.11654576, -6.45370719]])
Uniqueness of Principal Components

Uniqueness of Principal Components
• If we compare the eigenvectors following two different methods.
• Even though the first eigenvector is same in value as well as sign but
the second eigenvector is same in value but sign flips.
• In both the cases the value of principal components is same.
Proportion of Variance Explained
• In the previous section, we performed PCA on a three-dimensional data set
and projected the data onto the first two principal component vectors to
obtain a two-dimensional view of the data.
• This two-dimensional representation of the three-dimensional data
successfully captures the major patterns in the data.
• The question arises how much information in a given data is lost by
projecting the observations onto the first few principal components?
• How much information is not included in the first few principal
components?
• We are interested in proportion of variance explained (PVE) by each
principal component.
Proportion of Variance Explained

Proportion of Variance Explained
We can find out the variance explained using eigenvalue (from step 8)

In [22]: eig_val[::-1].sort()

In [23]: eig_val
Out[23]: array([910.06995304, 629.11038668, 44.81966028])

In [24]: eig_val/eig_val.sum()
Out[24]: array([0.57453911, 0.39716565, 0.02829524])

We can also find the value explained using sklearn for the top 2 principal components

In [25]: sklearn_pca.explained_variance_ratio_
Out[25]: array([0.57453911, 0.39716565])
Proportion of Variance Explained
Cumulative variance explained is

In [26]: sklearn_pca.explained_variance_ratio_.cumsum()
Out[26]: array([0.57453911, 0.97170476])

We can see that in this case, the first 2 components are able to explain
97.17% variance.
Deciding the number of components

Deciding the number of components
In [27]: marks1 = np.array([[90,90,60,60,30],[60,90,60,60,30],[90,30,60,90,30],[
...: 90,60,30,60,90]])

In [28]: sklearn_pca = sklearnPCA(n_components=4)

In [29]: sklearn_transf = sklearn_pca.fit_transform(marks1.T)


Deciding the number of components
In [30]: from matplotlib import pyplot as plt

In [32]: plt.figure(figsize=(7,5))
...: plt.plot([1,2,3,4], sklearn_pca.explained_variance_ratio_, '-o', label=
...: 'Individual component')
...: plt.plot([1,2,3,4], np.cumsum(sklearn_pca.explained_variance_ratio_), '
...: -s', label='Cumulative')
...: plt.ylabel('Proportion of Variance Explained')
...: plt.xlabel('Principal Component')
...: plt.xlim(0.75,4.25)
...: plt.ylim(0,1.05)
...: plt.xticks([1,2,3,4])
...: plt.legend(loc=2)
...: plt.show()
Deciding the number of components
• We generally decide on the number of principal components required by
examining a screen plot such as illustrated in Figure – 4.
• We need the smallest number of principal components that can explain a
sizable amount of variation in the data.
• We can do so by eyeballing the screen plot and looking for a point at which
the proportion of variance explained by each subsequent principal
component drops off.
• This is often referred to as an elbow in the screen plot.
• By inspection of Figure – 4, one might conclude a fair amount of variance
has been explained by the first three principal components and there is an
elbow after the third component.
• The fourth principal component explains a very small amount of variance.
Hence, it is worthless.
Independent Component
Analysis
Independent Component Analysis
• The Independent Component Analysis is based on information theory. It is
also a dimensionality reduction technique.
• The difference between Principal Component Analysis and Independent
Component Analysis is that PCA looks for uncorrelated factors whereas ICA
looks for independent factors.
• Two variables uncorrelated if there is no linear relationship between them.
• On the other hand, the two variables are independent, if they do not
depend on other variables.
• For example, the age of an individual is independent of his food
preferences.
Independent Component Analysis
• On various occasions, it is useful to process the data in order to
extract uncorrelated and independent components.
• For example, let’s suppose that we record two people while they sing
different songs. The result is very noisy.
• Our goal is to separate one source from another.
• This problem cannot be solved using PCA because in PCA there is no
constraint on the independence of the components.
• We can solve this using ICA.
• In layman terms, PCA helps to compress the data and ICA helps to
separate the data.
Clustering Methods
Clustering
• The techniques used for finding subgroups, or clusters, in a data set
are known as clustering.
• By using the clustering techniques, we can group the observation
together so that the observations within each group are quite similar
to each other, whereas the observations in different groups are
different from each other.
• However, we need to define how to define whether two or more
observations are similar or different.
• This requires domain-specific knowledge and knowledge about the
data that is being studied.
Clustering
• The objective of both clustering and PCA is to simplify the data via a
small number of summaries even though their mechanisms are
different.
• The objective of PCA is to find a low-dimensional representation of
the observations that can explain a good fraction of the variance.
• The objective of clustering is to find a homogeneous subgroup among
the observations
Clustering

K- Means Clustering

K- Means Clustering
K- Means Clustering

K- Means Clustering

K- Means Clustering Python Example
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)
X = np.random.standard_normal((100,2))
X[:50,0] = X[:50,0]+3
X[:50,1] = X[:50,1]-4
km1 = KMeans(n_clusters=2, n_init=20)
km1.fit(X)

np.random.seed(4)
km2 = KMeans(n_clusters=3, n_init=20)
km2.fit(X)
K- Means Clustering Python Example
np.random.seed(6)
km3 = KMeans(n_clusters=4, n_init=20)
km3.fit(X)

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(18,5))

ax1.scatter(X[:,0], X[:,1], s=40, c=km1.labels_)


ax1.set_title('K-Means Clustering Results K=2')
ax1.scatter(km1.cluster_centers_[:,0], km1.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2)
ax2.scatter(X[:,0], X[:,1], s=40, c=km2.labels_)
ax2.set_title('K-Means Clustering Results K=3')
ax2.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2);
ax3.scatter(X[:,0], X[:,1], s=40, c=km3.labels_)
ax3.set_title('K-Means Clustering Results K=4')
ax3.scatter(km3.cluster_centers_[:,0], km3.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2);
K- Means Clustering Algorithm

K- Means Clustering Algorithm
Hierarchical Clustering
Hierarchical Clustering

Interpreting a Dendrogram
• We have plotted the hierarchical clustering using 45 observations in a
two-dimensional space.
• The results have been shown in Figure – 6. In the left-hand panel of Figure –
6, each leaf of the dendrogram represents one of the 46 observations.
• Following a bottom-up approach, as we move up the tree, some leaves that
are similar to each other start to fuse into branches.
• As we go higher up the tree, branches fuse with either other branches or
leaves.
• During the fusion process, a similar group of observations fuses with each
other at the early stage (lower in the tree).
Interpreting a Dendrogram
Interpreting a Dendrogram
• The observations that fuse at a later stage (near the top of the tree),
can be different from each other.
• Hence, we need to look for the point in the tree where the branches
containing those two observations are fused first.
• We can measure the height of the fusion on the vertical axis.
• The observations that fuse with each other at the bottom of the tree
are very similar to each other.
• On the other hand, the observations that fuse towards the top of the
tree will be quite different from each other.
Interpreting a Dendrogram

Linkage
• One of the important concepts is the dissimilarity between pairs of
observation and pairs of groups of observations. The term linkage
defines the dissimilarity between two groups of observations. The
three most common types of linkage are as follows
• Complete – Maximal inter-cluster dissimilarity
• Single – Minimal inter-cluster dissimilarity
• Average – Mean inter-cluster dissimilarity
• The python code of plotting dendrogram based on the three linkages
is given below
Linkage
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy

np.random.seed(2)
X = np.random.standard_normal((100,2))

fig, (ax1,ax2,ax3) = plt.subplots(3,1, figsize=(15,18))

for linkage, cluster, ax in zip([hierarchy.complete(X), hierarchy.average(X), hierarchy.single(X)], ['c1','c2','c3'],


[ax1,ax2,ax3]):
cluster = hierarchy.dendrogram(linkage, ax=ax, color_threshold=0)

ax1.set_title('Complete Linkage')
ax2.set_title('Average Linkage')
ax3.set_title('Single Linkage');
Linkage
Latent Semantic Indexing
Latent Semantic Indexing
• Latent Semantic Analysis (LSA), also known as Latent Semantic
Indexing (LSI), is an application of unsupervised dimensionality
reduction techniques to textual data.
• The problems that LSA tries to solve are the problems of:
• Synonymy: This means multiple words having the same meaning
• Polysemy: This means one-word having multiple meanings
• For example, consider the following two sentences:
• I liked his last novel quite a lot.
• We would like to go for a novel marketing campaign.
• In the first sentence, the word ‘novel’ refers to a book, and in the
second sentence it means new or fresh.
Latent Semantic Indexing
• We can easily distinguish between these words because we are able to
understand the context behind these words.
• However, a machine would not be able to capture this concept as it cannot
understand the context in which the words have been used.
• This is where Latent Semantic Analysis (LSA) comes into play as it attempts
to leverage the context around the words to capture the hidden concepts,
also known as topics.
• So, simply mapping words to documents won’t really help.
• What we really need is to figure out the hidden concepts or topics behind
the words.
• LSA is one such technique that can find these hidden topics.
Thanks
Samatrix Consulting Pvt Ltd

You might also like