1
Unsupervised learning
Announcements
▪ Last year exam posted on Moodle front page
▪ Quiz 1 grades posted.
▪ Next week: 22.11.2023, Quiz 2 - same format as last week
• Covers topics covered during 23.10-06.11
3
Outline
▪ Unsupervised learning
• Dimensionality reduction: Principal Component Analysis (PCA)
• Clustering: k-Means
4
Introduction Linear regression Logistic regression
Feature engineering Data statistics Naive Bayes
KNN Clustering Dimensionality reduction
Neural networks Convolutional neural Decision-trees
networks
Review of data statistics, ex: Quiz 1 grades
Mean: 7.63, Mode: 8, Median: 8, Standard deviation: 1.85
Brief review of last lecture
Introduction
Unsupervised learning
Unsupervised learning is a type of machine learning that looks for previously
undetected patterns in a data set with no pre-existing labels and with a minimum of
human supervision.
in the next techniques, we don’t use labels anymore!
Dimensionality reduction
Motivation
Intuition
We have samples described by a series of features
We want to find a smaller set of new
x2 features that explain our sample because:
Less features is easier to visualize
Feature 2
Some of the current feature can be
redundant
Some of the current features are not very
useful to describe our samples
Feature 1 x1
Principal component analysis (PCA)
Approach to dimensionality reduction
How to find this smaller set of new features ?
PCA : Find the best linear combination of
x2
features to create new features that explain our
samples better
Feature 2
Feature 1 x1
PCA
Projection of points onto a lower dimensional subspace
How to find this smaller set of new features ?
x2
w1x1 + w2x2
The new feature
Length
A mix of length and weight that describe
our samples better
Weight x1
Projection of a point onto a subspace
Subspace
Distance of a point to a subspace
Projection of a point onto a subspace
Find a subspace to minimize average distance of data to it
Sum of distances of all data points to a subspace
PCA chooses a subspace to minimize
Formulation of PCA objective using the data matrix
Standardize data:
Frobenius norm of a matrix:
PCA objective:
Eigenvalue decomposition of a symmetric matrix
Solution to PCA using eigenvalue decomposition
Principal components
Example: projection of data onto 2 principal components
PCA
Pseudo-code
PCA example - distinguishing texts
Defining features
Each data sample is a document
There are d unique words in all the documents
Feature is positive for document if word is in document
TFIDF feature definition for documents
Term frequency of word … in document …
Document frequency of word ….
TFIDF for each word ….
PCA example
Based on “Principal Component Analysis” lecture of Stanford EE104
Distinguishing text: The Critique of Pure Reason by Immanuel Kant and
The Problems of Philosophy by Bertrand Russell
Dimensionality reduction
Other techniques
There are several other techniques for dimensionality reduction :
Linear discriminant analysis (LDA)
Generalized discriminant analysis (GDA)
T-distributed Stochastic Neighbor Embedding (t-SNE)
Autoencoders
Autoencoder
Introduction
▪ An autoencoder is a type of neural network often used for dimensionality
reduction.
▪ Autoencoders are trained in an unsupervised manner, by minimizing the
N
L (x , x ̂ )
i i
∑
reconstruction error / loss
i=1
▪ Example: Squared error
Autoencoder
Autoencoder vs. PCA
Top: Some examples of the original MNIST
test samples
Middle: Reconstructed output from an auto-
encoder with a latent space of 8 dimensions
This auto-encoder uses convolutional layers, and
was trained on the MNIST training set
Bottom: Reconstructed output from PCA with
8 reduced dimensions
Image credit: F. Fleuret, Deep Learning (EPFL)
Summary - dimensionality reduction
Used for
• Exploratory data analysis
• Visualizing data
• Help reduce overfitting by reducing feature dimension
PCA: an approach to dimensionality reduction
▪ Projects data onto a linear subspace
▪ Useful in case there is approximately linear dependence
between different features
▪ Easy to compute
▪ Connection to singular value decomposition (see Problem set 2)
Clustering
ML
Supervised
S
& - unsupervised
classification dimensionale
I 1)
regression ,
recluchen clostery
2X :
lin logesha
regar K .
NN
negression
>
·
PCA
-
K-
means
.
"
NN
-
Naive
Bayes
Clustering - definition and motivation
Goal: group data points as a first step to understand the data set
Examples: (see 4.4 and 4.5 in LinAlgebra book
- Clustering music genres: grouping music based on the similarities in their
audio features (beat per minute, duration, loudness, amount of words, etc)
Clustering introduction
▪ Toy example:
• Each data sample is a point in 2D
cluster 1
cluster 2
cluster 3 cluster 4
k-means approach to clustering
How to cluster data without labels?
Try to find similarity between groups of points
k-means: Group points based on their proximity
(in terms of distance in the feature space)
k-means
Introduction
▪ Given a set of unlabelled input samples, group the samples into k
clusters (k ∈ ℕ) [xi). ,
▪ k-Means idea:
• Identify&
k cluster of data points given N samples.
-
• Find prototype points μ1, μ2, . . . , μk representing the center of each
-
&
cluster and add the other data points to the nearest cluster center.
pank
representative
k-Means
Preliminaries
▪ A single representative point for data:
k-Means
▪ Choose k clusters to represent data
▪ Determining the cluster a single point … belongs to
▪ Determining the cluster centres to minimize the distance of each point to its
assigned cluster
k-Means objective function
Algorithm - heuristic
1. Initialize {μ1, μ2, . . . , μk} (e.g., randomly)
2. While not converged
1. Assign each point …. to the nearest center
2. Update each center μj based on the points assigned to it
k-Means
Algorithm - Details
▪ Step 2.1: Assign each point…. to the nearest center
• For each point …, compute the Euclidean distance to every center
{μ1, μ2, . . . , μk}
• Find the smallest distance
• The point is said to be assigned to the corresponding cluster (note that each
point is assigned to a single cluster)
▪ Step 2.2: Update each center μj based on the points assigned to it
• Recompute each center μj as the mean of the points that were assigned to it
k-Means
Algorithm - Convergence
▪ Step 2 is repeated while k-Means has not converged
• What criteria to stop iterating?
• Fixed number of iterations? It’s arbitrary and a too small number can lead to
bad results
• The difference in assignments or center locations between two iterations can
be used as criteria to stop the algorithm
▪ K-Means does not always converge to the best solution
•
k-Means
Example
▪ Use the Palmer Penguins dataset
▪ With Flipper length against Bill length
With label Without label
k-Means
Example
▪ Centroid initialisation
k-Means
Example
▪ First assignment
k-Means
Example
▪ Next assignment
k-Means
Example
▪ Next assignment
k-Means
Example
▪ Final assignment
Summary - Clustering
Used for understanding data
Examples:
Topic discovery in a large set of documents
Recommendation engines
Guessing missing entries
k-means: an approach to clustering
Easy to implement and to interpret
k-means algorithm converges. However, non-convex optimization: solution depends on
initial condition
Further reading
PCA
importance of Standardisation:
PCA on StatQuest!!! k-Means: chapter 4 of LinAlgebra book
k-means: Chapter 4 of LinAlgebra book
▪ For coding of PCA/k-means, see SciKit Learn