0% found this document useful (0 votes)
7 views46 pages

Lecture 3

Uploaded by

asedovskaya.ann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

Lecture 3

Uploaded by

asedovskaya.ann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Clustering

 What is Unsupervised learning?


 K-means clustering
 Hierarchical clustering
 Gaussian mixture model

What is Unsupervised Learning?

Unsupervised learning, also called Descriptive analytics, describes a family of


methods for uncovering latent structure in data.
In Supervised learning aka Predictive analytics, our data consisted of
observations (xi, yi), xi Rp, i = 1, . . . , n. Such data is called labelled, and the yi
are thought of as the labels for the data.
In Unsupervised learning, we just look at data xi, i = 1, . . . , n. This is called
unlabelled data.
Even if we have labels yi, we may still wish to temporarily ignore the yi and
conduct unsupervised learning on the inputs xi
Examples of clustering tasks:
 Identify similar groups of online shoppers based on their browsing and purchasing
history.
 Identify similar groups of music listeners or movie viewers based on their ratings or
recent listening/viewing patterns.
 Cluster input variables based on their correlations to remove redundant predictors
from consideration.
 Cluster hospital patients based on their medical histories.
 Cluster labeled data to see how classes are separated by features.

Left: Data Right: One possible way to cluster the data


Here's a less clear example. How should we partition it?
Here's one reasonable clustering.
A clustering is a partition {C1, . . . , CK}, where each Ck denotes a subset of the
observations.
Each observation belongs to one and only one of the clusters.
To denote that the i-th observation is in the k-th cluster, we write i Ck.
Method: K-mean clustering

Main idea: A good clustering is one for which the within-cluster variation is as
small as possible.
The within-cluster variation for cluster Ck is some measure of the amount by
which the observations within each class differ from one another.
We'll denote it by WCV (Ck).
Goal: Find C1, . . . , CK that minimize

This says: Partition the observations into K clusters such that the WCV summed
up over all K clusters is as small as possible.
How to define within-cluster variation?
Goal: Find C1, . . . , CK that minimize

Typically, we use Euclidean distance:

where |Ck| denotes the number of observations in cluster k.


To be clear: we're treating K as fixed ahead of time. We are not optimizing K as
part of this objective.
Simple example
How do we minimize WCV?

It's computationally infeasible to actually minimize this criterion.


We essentially have to try all possible partitions of n points into K sets.
When n = 10, K = 4, there are 34,105 possible partitions.
When n = 25, K = 4, there are 5 × 1013…
We're going to have to settle for an approximate solution.
K-means algorithm

It turns out that we can rewrite WCVk more conveniently:

Where is just the average of all the points in cluster Ck


So, let's try the following:
K-means algorithm:
1. Start by randomly partitioning the observations into K clusters.
2. Until the clusters stop changing, repeat:
a. For each cluster, compute the cluster centroid x¯k,
b. Assign each observation to the cluster whose centroid is the closest.
K-means demo with K = 3
Summary of K-means
We'd need to minimize

It's infeasible to actually optimize this in practice, but K-means at least gives us
a so-called local optimum of this objective.
The result we get depends both on K, and also on the random initialization that
we wind up with.
It's a good idea to try different random starts and pick the best result among
them.
There's a method called K-means++ that improves how the clusters are
initialized.
A related method, called K-medoids, clusters based on distances to a centroid
that is chosen to be one of the points in each cluster.
Hierarchical clustering
K-means is an objective-based approach that requires us to pre-specify the number
of clusters K.
The answer it gives is somewhat random: it depends on the random initialization
we started with.
Hierarchical clustering is an alternative approach that does not require a pre-
specified choice of K, and which provides a deterministic answer (no
randomness).
We'll focus on bottom-up or agglomerative hierarchical clustering.
Top-down or divisive clustering is also good to know about, but we won't directly
cover it here.
Dendogram
Left: Dendrogram obtained from complete linkage clustering
Center: Dendrogram cut at height 9, resulting in K = 2 clusters
Right: Dendrogram cut at height 5, resulting in K = 3 clusters
Interpreting dendrograms

Observations 5 and 7 are similar to each other, as are observations 1 and 6.


Observation 9 is no more similar to observation 2 than it is to observations 8, 5
and 7.
This is because observations {2, 8, 5, 7 } all fuse with 9 at height 1.8.
Linkages
Let dij = d(xi, xj) denote the dissimilarity (distance) between observation xi and xj.
At our first step, each cluster is a single point, so we start by merging the two
observations that have the lowest dissimilarity.
But after that…we need to think about distances not between points, but between
sets (clusters).
The dissimilarity between two clusters is called the linkage.
i.e., Given two sets of points, G and H, a linkage is a dissimilarity measure d(G,H)
telling us how different the points in these sets are.
Let's look at some examples.
Common linkage types
Complete. Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and the observations in
cluster B and record the largest of these dissimilarities.

Single. Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities


between the observations in cluster A and the observations in cluster B and record
the smallest of these dissimilarities.

Average. Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities


between the observations in cluster A and the observations in cluster B and record
the average of these dissimilarities.

Centroid. Dissimilarity between the centroid for cluster A (a mean vector of


length p) and the centroid for cluster B. Centroid linkage can result in undesirable
inversions.

Ward. Minimizes the variance of the clusters being merged.


Single linkage
dij = d(xi, xj) is pair distance,
single linkage score dsingle(G,H) is the distance of the closest pair.
Complete linkage
Average Linkage
Shortcomings of Single and Complete linkage

Single and complete linkage have some practical problems:


 Single linkage suffers from chaining.
 In order to merge two groups, only need one pair of points to be close,
irrespective of all others. Therefore clusters can be too spread out, and not
compact enough.

Complete linkage avoids chaining but suffers from crowding.


 Because its score is based on the worst-case dissimilarity between pairs, a
point can be closer to points in other clusters than to points in its own cluster .
Clusters are compact, but not far enough apart.

Average linkage tries to strike a balance. It uses average pairwise dissimilarity, so


clusters tend to be relatively compact and relatively far apart.
CHAINING versus CROWDING
Shortcomings of average linkage
Average linkage has its own problems:
• Unlike single and complete linkage, average linkage doesn’t give us a
nice interpretation when we cut the dendrogram.
• Results of average linkage clustering can change if we simply apply a
monotone increasing transformation to our dissimilarity measure, our results can
change

This can be a big problem if we’re not sure precisely what dissimilarity measure
we want to use.
Single and Complete linkage do not have this problem.
Gaussian Mixtures Model (GMM)

Multivariate Gaussian distribution


Gaussian Mixture Model:

Assume each observation has probability πk of coming from cluster k.


Assume that all observations from cluster k a drawn randomly from a
MVN(µk, ∑k) distribution.

We are assuming that there are latent class labels that we do not observe.
Expectation – Maximization Algorithm
PIJNCIPAL COMPONENT ANALYSIS (PCA)
PCA using scikit-learn

import numpy as np
from sklearn.decomposition import PCA

X = reading of data

pca = PCA()
pca.fit(X)

print(pca.explained_variance_ratio_)
print(pca.mean_)
C = components_
Y = pca.transform(X)

You might also like