Section06 UnsupervisedLearning
Section06 UnsupervisedLearning
Sebastian Peitz 2
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Unsupervised learning
Feature selection
Sebastian Peitz 4
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• In the following data set 𝒟 = 𝒙(1) , … , 𝒙(𝑁) = 𝑿, how many features do we need
to (approximately) describe the data?
→ If we perform a coordinate transform, one direction is clearly more important in
characterizing the structure of 𝑿 than the second one!
→ This is nothing else but representing the same data in a different coordinate
system: Instead of using the standard Euclidean basis
𝑥1 1 0
𝒙 = 𝑥 = 𝑥1 𝒆1 + 𝑥2 𝒆2 = 𝑥1 + 𝑥2
2 0 1
we can use a basis 𝑼 tailored to the data:
𝒙 = 𝑎1 𝒖1 + 𝑎2 𝒖2
• Which properties should such a new basis 𝑼 have?
⊤ 1 if 𝑖 = 𝑗
• It should be orthonormal: 𝒖𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 = ቊ
0 else
• For every dimension 𝑟, it should have the smallest approximation error:
2
𝑁 𝑟
• Due to the famous Eckart-Young theorem, we know that the solution to this optimization problem can be
obtained using a very efficient tool from linear algebra: the Singular Value Decomposition (SVD)
𝑿 = 𝑼𝚺𝑽∗
• This is a product of three matrices. If 𝑿 ∈ ℂ𝑛×𝑁 , then 𝑼 ∈ ℂ𝑁×𝑁 , 𝚺 ∈ ℝ𝑁×𝑛 and 𝑽 ∈ ℂ𝑛×𝑛 (𝑽∗ is the
conjugate transposed matrix of 𝑽)
• The matrices have many favorable properties:
• 𝑼 = 𝒖1 , … , 𝒖𝑁 and 𝑽 = 𝒗1 , … , 𝒗𝑛 are unitary matrices (column-wise orthonormal):
𝒖⊤ ⊤
𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 and 𝒗𝑖 𝒗𝑗 = 𝛿𝑖,𝑗
𝚺
• 𝚺 = is a diagonal matrix with diagonal entries 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0: the singular values
𝟎
• Since the last 𝑁 − 𝑛 rows (assuming that 𝑁 > 𝑛) are zero, we have the following economic version:
∗ ∗ ∗
𝚺
𝑿 = 𝑼𝚺𝑽 = 𝑼 𝑼 ⊥ 𝑽 = 𝑼𝚺𝑽
𝟎
• Since the columns of 𝑼 and 𝑽 all have unit length, the relative importance of a particular column of 𝑼 is encoded
in the singular values:
𝑛
Sebastian Peitz 7
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
First 16 eigenfaces:
෩ ∈ ℝ32256×16
𝑼 Low-rank
reconstruction
𝚺
𝑿=𝑼 𝑽∗
Economy
SVD
Flatten / reshape: 𝑿 =
• Now let’s try to distinguish two individuals from the data base by projecting onto two modes:
𝐏𝐂𝟓 𝒙𝑖 = 𝒙⊤ 𝑖 𝒖5 , 𝐏𝐂𝟔(𝒙𝑖 ) = 𝒙⊤ 𝑖 𝒖6
Sebastian Peitz 9
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Consider a function 𝑓(𝑥) that is piecewise smooth and 2𝜋-periodic. Any function of this class can be
expressed in terms of its Fourier transform:
∞ ∞ ∞
𝑎0
𝑓 𝑥 = + 𝑎𝑘 cos 𝑘𝑥 + 𝑏𝑘 sin(𝑘𝑥) = 𝑐𝑘 𝑒 𝑖𝑘𝑥 = 𝑎𝑘 + 𝑖𝑏𝑘 cos 𝑘𝑥 + 𝑖 sin(𝑘𝑥)
2
𝑘=1 𝑘=−∞ 𝑘=−∞
• The (real) coefficients are given by
1 𝜋
𝑎𝑘 = න 𝑓 𝑥 cos 𝑘𝑥 𝑑𝑥 ,
𝜋 −𝜋
1 𝜋
𝑏𝑘 = න 𝑓 𝑥 sin 𝑘𝑥 𝑑𝑥
𝜋 −𝜋
• This is nothing else but representing 𝑓 𝑥 in terms of an
orthogonal basis: the Fourier modes cos 𝑘𝑥 and sin 𝑘𝑥
• Closely related to the SVD basis transform, only that 𝑓 𝑥
is not a vector, but an infinite-dimensional function
• Instead of point-wise data, the Fourier modes contain
global information over the entire domain.
Sebastian Peitz 10
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• The Fourier transform can be adapted to vectors using the Discrete Fourier Transform (DFT) or its highly
efficient implementation: the Fast Fourier Transform (FFT)
𝒙 ∈ ℝ𝑛 → 𝒄 ∈ ℂ𝑛
𝑘𝜋
• The entries of 𝒄 are the complex Fourier coefficients of increasing frequency (𝜔𝑘 = )
𝐿
• In 2D: First in one direction, then in the second direction
Sebastian Peitz 11
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Very powerful compression technique (This was the JPEG algorithm until a few years ago)
Sebastian Peitz 12
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Sebastian Peitz 13
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Sebastian Peitz 14
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Dogs
Cats
Sebastian Peitz 15
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Now let’s assume that we only have unlabeled data 𝒟 = (𝒙(𝑖) )𝑁 𝑖=1 , 𝒙
(𝑖)
∈ ℝ𝑛
• We would like to separate the data into 𝐾 clusters in an optimal way, represented by a set of 𝐾 prototype
vectors 𝝁1 , … 𝝁𝐾 ∈ ℝ𝑛 representing the clusters
• Which parameters do we have to optimize?
→ The prototypes as well as the assignment of data to the clusters:
𝑁 𝐾
min 𝐸 = 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2
𝝁,𝒓
𝑖=1 𝑘=1
• 𝒓 is a matrix of binary variables (𝒓𝑖𝑘 ∈ 0,1 ), where the first index refers to the data point and the second
to the cluster – exactly one entry per row is one: σ𝐾
𝑘=1 𝒓𝑖𝑘 = 1 for all 𝑖 ∈ 1, … , 𝑁
→ We assign each data point to precisely one cluster and then seek to minimize the distance of all points
within a cluster 𝑘 to their prototype 𝝁𝑘
• Which norm for the distance? → depends!
• Euclidean: 𝒙𝑖 − 𝝁𝑘 2
2
• Squared Euclidean: 𝒙𝑖 − 𝝁𝑘 2
• Manhattan: 𝒙𝑖 − 𝝁𝑘 1
• Maximum distance: 𝒙𝑖 − 𝝁𝑘 ∞
• Mahalanobis distance: 𝒙𝑖 − 𝝁𝑘 ⊤ 𝚺 −1 𝒙𝑖 − 𝝁𝑘 with the covariance matrix 𝚺
Sebastian Peitz 16
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
1 if 𝑘 = arg min 𝒙𝑖 − 𝝁𝑘 22
𝒓𝑖𝑘 = ቐ 𝑗
0 otherwise
• Prototypes 𝝁𝑘 : with 𝒓 fixed, this is a weighted least squares regression problem:
𝑁
σ𝑁
𝑖=1 𝒓𝑖𝑘 𝒙𝑖
2 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 = 0 ⇔ 𝝁𝑘 = 𝑁
σ𝑖=1 𝒓𝑖𝑘
𝑖=1
→ This is the mean over all 𝒙𝑖 belonging to cluster 𝑘
• Repeat the two steps until there are no re-assignments
• Does this algorithm converge?
→ Yes, because a reduction of the objective function is guaranteed by design
Sebastian Peitz 17
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Sebastian Peitz 18
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Sebastian Peitz 19
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Another approach to identify clusters is via a hierarchical, tree-based approach → the Dendrogram
• A cloud of points is clustered / separated one by one, until some threshold is achieved
• Divisive approach (top-down):
• All points are contained in a single cluster
• The data is then recursively split into smaller and smaller clusters
• The splitting continues until the algorithm stops according to a user specified objective
• The divisive method can split the data until each data point is its own node
• Agglomerative approach (bottom-up):
• Each data point 𝑥𝑗 is its own cluster initially.
• The data is merged in pairs as one creates a hierarchy of clusters.
• The merging of data eventually stops once all the data has been merged into a single cluster
• How can we do this? → Greedy approach!
Sebastian Peitz 20
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Algorithm:
1. Compute the distance (Euclidean, Manhattan, …) between all points: 𝑑 𝒙𝑗 , 𝒙𝑖 , 𝑖, 𝑗 ∈ 1, … , 𝑁
2. Merge the closest two data points into a single new data point midway between their original locations
3. Repeat the calculation with the new 𝑁 − 1 points
Sebastian Peitz 21
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
Sebastian Peitz 22
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• Can we also try to find a probabilistic model for our data? This seems to be natural, as noise is often
present in measurements.
• Consider the Old Faithful dataset once more.
• Can we model this using, say, a Gaussian distribution?
• What about a superposition of multiple Gaussians?
• This leads to mixture models (or – if we consider Gaussians – Gaussian Mixture Models (GMMs))
𝐾
𝑝 𝒙 = 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1
• These can in general create highly complex densities to arbitrary precision
𝒙2 : Duration of the eruption [min]
• The coefficients 𝜋𝑘 are called mixing coefficients. If both 𝑝 𝒙 and the individual
Gaussians are normalized, then a simple integration yields σ𝐾 𝑘=1 𝜋𝑘 = 1
• In addition, the requirement 𝑝 𝒙 > 0 implies 𝜋𝑘 ≥ 0 for all 𝑘 → 0 ≤ 𝜋𝑘 ≤ 1
• Using the sum and product rule, we can also write the mixing coefficients as follows:
𝐾
𝑝 𝒙 = 𝑝(𝑘)
ถ 𝑝 𝒙𝑘
𝑘=1
𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
• Via Bayes’ theorem, we thus get access to the posterior probability of 𝑘 given 𝒙,
a.k.a. responsibility:
𝑝 𝑘 𝑝 𝒙𝑘
𝛾𝑘 𝒙 = 𝑝 𝑘 𝒙 =
𝑝(𝒙)
Sebastian Peitz 24
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• In an entirely Bayesian approach, we thus need to learn the parameters 𝝁𝑘 and 𝚺𝑘 of the individual
Gaussian distributions as well as the mixing coefficients 𝜋𝑘 .
• As these can themselves be seen as random variables, let us introduce a corresponding 𝐾-dimensional
latent state 𝒛 in the form of a 1-of-𝐾 representation: 𝒛𝑘 ∈ 0,1 , σ𝐾 𝑘=1 𝒛𝑘 = 1.
→ 𝒛 can be in 𝐾 different states.
→ 𝑝 𝒛𝑘 = 1 = 𝜋𝑘 and σ𝐾 𝑘=1 𝑝 𝒛𝑘 = 1 = 1
• The probability of a specific latent variable 𝒛 and a specific sample 𝒙 given 𝒛 is thus
𝐾 𝐾
𝒛 𝒛𝑘
𝑝 𝒛 = ෑ 𝜋𝑘𝑘 and 𝑝 𝒙 𝒛 = ෑ 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• As a consequence, the Gaussian mixture model can be expressed as before, but using the latent variable 𝒛
(we’ll see in the next chapter how this is beneficial for learning → Expectation Maximization):
𝐾 𝐾
𝑝 𝒙 = 𝑝 𝒛 𝑝 𝒙 𝒛 = 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• Responsibility (𝜋𝑘 = prior; 𝛾 𝒛𝑘 = posterior):
𝑝 𝑧𝑘 = 1 𝑝 𝒙 𝑧𝑘 = 1 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝛾 𝒛𝑘 = 𝑝 𝒛𝑘 = 1 𝒙 = 𝐾 = 𝐾
σ𝑗=1 𝑝 𝒛𝑗 = 1 𝑝 𝒙 𝒛𝑗 = 1 σ𝑗=1 𝜋𝑗 𝒩 𝒙 𝝁𝑗 , 𝚺𝑗
Sebastian Peitz 25
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
log 𝑝 𝑿 𝝅, 𝝁, 𝚺 = log 𝜋𝑘 𝒩 𝒙𝑖 𝝁𝑘 , 𝚺𝑘
𝑖=1 𝑘=1
Sebastian Peitz 26
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• From now on, let’s again assume that we have both labeled and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁
𝑖=𝐿+1
• Consider the situation where the number of labeled data is much smaller: 𝐿 ≪ 𝑁 − 𝐿, maybe due to the
fact that the labeling has to be done by hand and is very expensive.
• Central goal: improve the learning performance by taking the additional unlabeled (and likely much
cheaper) data into account.
• In some situations, this can help to significantly improve the performance.
• However, this is very hard (or impossible) to prove formally.
Sebastian Peitz 27
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• The idea of multi-view learning is to look at an object (e.g., a website) from two (or more) different
viewpoints (e.g., the pictures and the text on the website).
• Formally, suppose the instance space 𝒳 to be split into two parts → an instance is represented in the form
𝒙 𝑖 = 𝒙 𝑖,1 , 𝒙 𝑖,2
• Co-training proceeds from the assumption that each view alone is insufficient to train a good classifier and,
moreover, that 𝒙 𝑖,1 and 𝒙 𝑖,2 are conditionally independent given the class.
• Co-training algorithms repeat the following steps:
1 2
• Train two classifiers ℎ 1 and ℎ 2 from 𝒟𝐿 and 𝒟𝐿 , respectively.
• Classify 𝒟𝑈 separately with ℎ 1 and ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 1 to the labeled training data of ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 2 to the labeled training data of ℎ 1 .
• Advantages:
• Co-training is a simple wrapper method that applies to all existing classifiers.
• Co-training tends to be less sensitive to mistakes than self-training.
• Disadvantages:
• a natural split of the features does not always exist (the feature subsets do not necessarily need to be disjunct).
• models using both views simultaneously may often perform better.
Sebastian Peitz 28
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
• There are many variants of multi-view learning via combination with other techniques (majority voting,
weighting, …)
• Multiview learning (with m learners) can also be realized via regularization:
𝑚 𝐿 𝑚 𝑁
2 2
min 𝑒 𝑦𝑖 , ℎ𝑣 𝒙𝑖 + 𝜆1 ℎ𝑣 + 𝜆2 ℎ𝑢 𝒙𝑗 − ℎ𝑣 (𝒙𝑗 )
ℎ∈ℋ
𝑣=1 𝑖=1 𝑢,𝑣=1 𝑗=𝐿+1
• Minimizing a (joint) loss function of this kind encourages the learners to agree on the unlabeled data to
some extent.
Sebastian Peitz 29
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING
𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ 𝑃 𝒙𝑖 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ 𝑃 𝒙𝑖 , 𝑦 𝜽
𝑖=1 𝑗=𝐿+1 𝑖=1 𝑗=𝐿+1 𝑦∈𝒴
• Solution of this problem: maximum likelihood estimation:
𝜽∗ = argmax 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽
𝜽∈𝚯
• Advantage: Theoretically well-grounded, often effective
• Disadvantage: Computationally complex, 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 may have local minima
Sebastian Peitz 30