VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
edu/~shervine
September 9, 2018
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering
Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj − µj 1 X (i) 1 X (i)
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m
i=1 i=1
Clustering assessment metrics
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since 1 X T
we don’t have the ground truth labels as was the case in the supervised learning setting. • Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
r Silhouette coefficient – By noting a and b the mean distance between a sample and all i=1
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b−a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.
k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:
Tr(Bk ) N −k
s(k) = × Independent component analysis
Tr(Wk ) k−1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A−1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = λz
• Write the probability of x = As = W −1 s as:
Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:
1 − 2g(w1T x(i) )
1 − 2g(w2 x ) x(i) T + (W T )−1
T (i)
W ←− W + α .
..
1 − 2g(wn T x(i) )