0% found this document useful (0 votes)
51 views3 pages

VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018

Bk = nj (μj − μ)(μj − μ)T and Wk = Σ(x(i) − μc(i) )(x(i) − μc(i) )T j=1 i=1 The Calinski-Harabaz index is defined as: CH = tr(Bk /(k − 1)) / tr(Wk /(n − k)) Higher values indicate better clustering. Principal Component Analysis (PCA) - PCA finds the k-dimensional subspace that maximizes the variance of the projected data. - It does so by computing

Uploaded by

Mridul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views3 pages

VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018

Bk = nj (μj − μ)(μj − μ)T and Wk = Σ(x(i) − μc(i) )(x(i) − μc(i) )T j=1 i=1 The Calinski-Harabaz index is defined as: CH = tr(Bk /(k − 1)) / tr(Wk /(n − k)) Higher values indicate better clustering. Principal Component Analysis (PCA) - PCA finds the k-dimensional subspace that maximizes the variance of the projected data. - It does so by computing

Uploaded by

Mridul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS 229 – Machine Learning https://stanford.

edu/~shervine

VIP Cheatsheet: Unsupervised Learning

Afshine Amidi and Shervine Amidi

September 9, 2018

Introduction to Unsupervised Learning


k-means clustering
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data
{x(1) ,...,x(m) }. We note c(i) the cluster of data point i and µj the center of cluster j.
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
following inequality: algorithm repeats the following step until convergence:
E[f (X)] > f (E[X]) m
X
1{c(i) =j} x(i)
i=1
Expectation-Maximization c(i) = arg min||x(i) − µj ||2 and µj = m
j X
1{c(i) =j}
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are i=1
latent variables:

Setting Latent variable z x|z Comments

Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk

Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn

r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at


estimating the parameter θ through maximum likelihood estimation by repeatedly constructing
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
r Distortion function – In order to see if the algorithm converges, we look at the distortion
function defined as follows:
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
a particular cluster z (i) as follows: m
X
J(c,µ) = ||x(i) − µc(i) ||2
Qi (z (i)
) = P (z (i)
|x (i)
; θ) i=1

• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering

r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that


Xˆ   build nested clusters in a successive manner.
P (x(i) ,z (i) ; θ)
θi = argmax Qi (z (i) ) log dz (i) r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
θ z (i) Qi (z (i) )
i different objective functions, which is summed up in the table below:

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj − µj 1 X (i) 1 X (i)
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m
i=1 i=1
Clustering assessment metrics
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since 1 X T
we don’t have the ground truth labels as was the case in the supervised learning setting. • Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
r Silhouette coefficient – By noting a and b the mean distance between a sample and all i=1
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b−a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between


and within-clustering dispersion matrices respectively defined as

k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1

the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:

Tr(Bk ) N −k
s(k) = × Independent component analysis
Tr(Wk ) k−1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = λz
• Write the probability of x = As = W −1 s as:

r Spectral theorem – Let A ∈ Rn×n . If A is symmetric, then A is diagonalizable by a real n


Y
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have: p(x) = ps (wiT x) · |W |
i=1
∃Λ diagonal, A = U ΛU T
• Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of sigmoid function as:
matrix A.
m n
!
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction X X  
0
technique that projects the data on k dimensions by maximizing the variance of the data as l(W ) = log g (wjT x(i) ) + log |W |
follows: i=1 j=1

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:

1 − 2g(w1T x(i) )
  
1 − 2g(w2 x ) x(i) T + (W T )−1 
T (i)
W ←− W + α  .
..
 
1 − 2g(wn T x(i) )

Stanford University 3 Fall 2018

You might also like