0% found this document useful (0 votes)
19 views

Section06 UnsupervisedLearning

This document discusses unsupervised and semi-supervised machine learning techniques. It begins by explaining that unsupervised learning involves training data without labels, while semi-supervised learning uses both labeled and unlabeled data. Common applications of unsupervised learning include clustering data to find patterns and categorizing new samples based on previously labeled clusters. Principal component analysis and singular value decomposition are then introduced as methods for feature selection and dimensionality reduction by finding a new basis that best represents the structure of the data using only the most important components.

Uploaded by

Varun Eranki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Section06 UnsupervisedLearning

This document discusses unsupervised and semi-supervised machine learning techniques. It begins by explaining that unsupervised learning involves training data without labels, while semi-supervised learning uses both labeled and unlabeled data. Common applications of unsupervised learning include clustering data to find patterns and categorizing new samples based on previously labeled clusters. Principal component analysis and singular value decomposition are then introduced as methods for feature selection and dimensionality reduction by finding a new basis that best represents the structure of the data using only the most important components.

Uploaded by

Varun Eranki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MACHINE LEARNING II

UNSUPERVISED AND SEMI-SUPERVISED LEARNING


JUN.-PROF. DR. SEBASTIAN PEITZ
Summer Term 2022
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised and semi-supervised learning

• Until now, our data has always been labeled:


𝒟 = (𝒙(𝑖) , 𝑦 (𝑖) )𝑁
𝑖=1
• Everything until now has been supervised learning, as – during training – we can tell our learning algorithm
𝒜 for each sample what the outcome should be and whether the prediction ℎ(𝒙(𝑖) ) was false or true
• However, in many situations, we do not necessarily have labels…
• … or maybe just for some of the samples
• Think about the effort of an expert having to label a gigantic number of
images / documents / …
• Side note: The ImageNet library for visual recognition (> 14 Million images)
has had a massive impact on the advances of modern ML techniques!
• In unsupervised learning, our training data is 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=1 ,
where the index 𝑈 indicates the absence of labels.
• In semi-supervised learning, our training data consists of both labeled
and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=𝐿+1

Sebastian Peitz 2
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning

• If we have no labels, what can we hope to find / learn?


→ Patterns in the data! E.g., clusters.

• These patterns allow for classification of new samples


• Example: the identification of customer preferences in social networks

• If we have previously labeled some elements from a cluster, then we can


easily label new samples as well:
• Categorization/labeling of movies

𝒙2 : Duration of the eruption [min]


• Central question: what are important
features in unsupervised learning?

𝒙1 : Time to next eruption [min]


Sebastian Peitz 3
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection

• What distinguishes two classes in a high-dimensional feature space?


• Example: “Cat vs. dog” images (32 × 32 = 1024 pixels → 𝑥 ∈ ℝ
1024 )

• Remember: All real-world data possesses a massive


amount of structure!
→ There should be some lower-dimensional latent
variable which allows us to distinguish between
two classes just as well
→ Principal components, Fourier modes, …

Sebastian Peitz 4
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (1/5)

• In the following data set 𝒟 = 𝒙(1) , … , 𝒙(𝑁) = 𝑿, how many features do we need
to (approximately) describe the data?
→ If we perform a coordinate transform, one direction is clearly more important in
characterizing the structure of 𝑿 than the second one!
→ This is nothing else but representing the same data in a different coordinate
system: Instead of using the standard Euclidean basis
𝑥1 1 0
𝒙 = 𝑥 = 𝑥1 𝒆1 + 𝑥2 𝒆2 = 𝑥1 + 𝑥2
2 0 1
we can use a basis 𝑼 tailored to the data:
𝒙 = 𝑎1 𝒖1 + 𝑎2 𝒖2
• Which properties should such a new basis 𝑼 have?
⊤ 1 if 𝑖 = 𝑗
• It should be orthonormal: 𝒖𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 = ቊ
0 else
• For every dimension 𝑟, it should have the smallest approximation error:
2
𝑁 𝑟

𝑼= arg min ෍ 𝑥𝑖 − ෍ 𝒖𝑗⊤ 𝒙𝑖 𝒖𝑖 ⇔ 𝑿𝑟 = arg min ෡−𝑿


𝑿 𝐹
෡ s.t. rank 𝑼
𝑼 ෡ =𝑟 ෡ s.t. rank 𝑿
𝑿 ෡ =𝑟
𝑖=1 𝑗=1
Sebastian Peitz 5
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (2/5)

• Due to the famous Eckart-Young theorem, we know that the solution to this optimization problem can be
obtained using a very efficient tool from linear algebra: the Singular Value Decomposition (SVD)
𝑿 = 𝑼𝚺𝑽∗
• This is a product of three matrices. If 𝑿 ∈ ℂ𝑛×𝑁 , then 𝑼 ∈ ℂ𝑁×𝑁 , 𝚺 ∈ ℝ𝑁×𝑛 and 𝑽 ∈ ℂ𝑛×𝑛 (𝑽∗ is the
conjugate transposed matrix of 𝑽)
• The matrices have many favorable properties:
• 𝑼 = 𝒖1 , … , 𝒖𝑁 and 𝑽 = 𝒗1 , … , 𝒗𝑛 are unitary matrices (column-wise orthonormal):
𝒖⊤ ⊤
𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 and 𝒗𝑖 𝒗𝑗 = 𝛿𝑖,𝑗

𝚺
• 𝚺 = is a diagonal matrix with diagonal entries 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0: the singular values
𝟎
• Since the last 𝑁 − 𝑛 rows (assuming that 𝑁 > 𝑛) are zero, we have the following economic version:
∗ ෡ ∗ ෡෡ ∗
𝚺
𝑿 = 𝑼𝚺𝑽 = 𝑼 𝑼 ෡ ෡ ⊥ 𝑽 = 𝑼𝚺𝑽
𝟎
• Since the columns of 𝑼 and 𝑽 all have unit length, the relative importance of a particular column of 𝑼 is encoded
in the singular values:
𝑛

𝑿 = 𝑼𝚺𝑽∗ = ෍ 𝜎𝑖 𝒖𝑖 𝒗∗𝑖 = 𝜎1 𝒖1 𝒗1∗ + ⋯ + 𝜎𝑛 𝒖𝑛 𝒗∗𝑛


𝑖=1
Sebastian Peitz • 6
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (3/5)

• Do we need all columns of 𝑼 to reconstruct the matrix 𝑿?


• What if we are willing to accept a certain error?
→ Truncate 𝑼 after 𝑟 columns!
𝑿෩=𝑼 ෩𝚺෩𝑽෩∗ ≈ 𝑿
→ The Eckart-Young theorem says that this is the
best rank-𝑟 matrix we can find

Sebastian Peitz 7
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (4/5)

• The same can be done with high-dimensional data


• Example: Yale Faces B database (192 × 168 = 32256 pixels, 2414 images → 𝑿 ∈ ℝ32256×2414

First 16 eigenfaces:
෩ ∈ ℝ32256×16
𝑼 Low-rank
reconstruction

෡𝚺
𝑿=𝑼 ෡𝑽∗

Economy
SVD

Flatten / reshape: 𝑿 =

Only the first 1000 rows


Sebastian Peitz 8
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (5/5)

• Now let’s try to distinguish two individuals from the data base by projecting onto two modes:
𝐏𝐂𝟓 𝒙𝑖 = 𝒙⊤ 𝑖 𝒖5 , 𝐏𝐂𝟔(𝒙𝑖 ) = 𝒙⊤ 𝑖 𝒖6

Sebastian Peitz 9
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (1/3)

• Consider a function 𝑓(𝑥) that is piecewise smooth and 2𝜋-periodic. Any function of this class can be
expressed in terms of its Fourier transform:
∞ ∞ ∞
𝑎0
𝑓 𝑥 = + ෍ 𝑎𝑘 cos 𝑘𝑥 + 𝑏𝑘 sin(𝑘𝑥) = ෍ 𝑐𝑘 𝑒 𝑖𝑘𝑥 = ෍ 𝑎𝑘 + 𝑖𝑏𝑘 cos 𝑘𝑥 + 𝑖 sin(𝑘𝑥)
2
𝑘=1 𝑘=−∞ 𝑘=−∞
• The (real) coefficients are given by
1 𝜋
𝑎𝑘 = න 𝑓 𝑥 cos 𝑘𝑥 𝑑𝑥 ,
𝜋 −𝜋
1 𝜋
𝑏𝑘 = න 𝑓 𝑥 sin 𝑘𝑥 𝑑𝑥
𝜋 −𝜋
• This is nothing else but representing 𝑓 𝑥 in terms of an
orthogonal basis: the Fourier modes cos 𝑘𝑥 and sin 𝑘𝑥
• Closely related to the SVD basis transform, only that 𝑓 𝑥
is not a vector, but an infinite-dimensional function
• Instead of point-wise data, the Fourier modes contain
global information over the entire domain.

Sebastian Peitz 10
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (2/3)

• The Fourier transform can be adapted to vectors using the Discrete Fourier Transform (DFT) or its highly
efficient implementation: the Fast Fourier Transform (FFT)
𝒙 ∈ ℝ𝑛 → 𝒄 ∈ ℂ𝑛
𝑘𝜋
• The entries of 𝒄 are the complex Fourier coefficients of increasing frequency (𝜔𝑘 = )
𝐿
• In 2D: First in one direction, then in the second direction

Sebastian Peitz 11
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (3/3)

• Very powerful compression technique (This was the JPEG algorithm until a few years ago)

Sebastian Peitz 12
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (1/3)

• Let’s consider the cats and dogs example once more:

• These are the first four SVD modes

• Alternative feature identification method:


First four modes according to the following procedure
• Transform images 𝑿 to Wavelet domain 𝑪 (think of this
as a hierarchical version of the Fourier transform)
→ This is today’s JPEG compression technique
• Perform an SVD on the Wavelet/Fourier coefficients
→ basis 𝑼𝑪 for the space of Wavelet/Fourier coefficients
• Inverse Wavelet/Fourier transform of the basis to the
original space → 𝑼𝑿

Sebastian Peitz 13
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (2/3)

Original space Wavelet space

Sebastian Peitz 14
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (3/3)

Dogs
Cats
Sebastian Peitz 15
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (1/4)

• Now let’s assume that we only have unlabeled data 𝒟 = (𝒙(𝑖) )𝑁 𝑖=1 , 𝒙
(𝑖)
∈ ℝ𝑛
• We would like to separate the data into 𝐾 clusters in an optimal way, represented by a set of 𝐾 prototype
vectors 𝝁1 , … 𝝁𝐾 ∈ ℝ𝑛 representing the clusters
• Which parameters do we have to optimize?
→ The prototypes as well as the assignment of data to the clusters:
𝑁 𝐾

min 𝐸 = ෍ ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2
𝝁,𝒓
𝑖=1 𝑘=1
• 𝒓 is a matrix of binary variables (𝒓𝑖𝑘 ∈ 0,1 ), where the first index refers to the data point and the second
to the cluster – exactly one entry per row is one: σ𝐾
𝑘=1 𝒓𝑖𝑘 = 1 for all 𝑖 ∈ 1, … , 𝑁
→ We assign each data point to precisely one cluster and then seek to minimize the distance of all points
within a cluster 𝑘 to their prototype 𝝁𝑘
• Which norm for the distance? → depends!
• Euclidean: 𝒙𝑖 − 𝝁𝑘 2
2
• Squared Euclidean: 𝒙𝑖 − 𝝁𝑘 2
• Manhattan: 𝒙𝑖 − 𝝁𝑘 1
• Maximum distance: 𝒙𝑖 − 𝝁𝑘 ∞
• Mahalanobis distance: 𝒙𝑖 − 𝝁𝑘 ⊤ 𝚺 −1 𝒙𝑖 − 𝝁𝑘 with the covariance matrix 𝚺
Sebastian Peitz 16
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (2/4)

• How do we solve the optimization problem to minimize 𝐸 = σ𝑁 𝐾


𝑖=1 σ𝑘=1 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2 ?
2

• Alternate between 𝒓 and 𝝁


• Assignment 𝒓: with 𝝁𝑘 fixed, 𝑬 can be decomposed into the individual contribution of each data point:

1 if 𝑘 = arg min 𝒙𝑖 − 𝝁𝑘 22
𝒓𝑖𝑘 = ቐ 𝑗
0 otherwise
• Prototypes 𝝁𝑘 : with 𝒓 fixed, this is a weighted least squares regression problem:
𝑁
σ𝑁
𝑖=1 𝒓𝑖𝑘 𝒙𝑖
2 ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 = 0 ⇔ 𝝁𝑘 = 𝑁
σ𝑖=1 𝒓𝑖𝑘
𝑖=1
→ This is the mean over all 𝒙𝑖 belonging to cluster 𝑘
• Repeat the two steps until there are no re-assignments
• Does this algorithm converge?
→ Yes, because a reduction of the objective function is guaranteed by design

• However, we have to be aware that the solution can be a local minimum

Sebastian Peitz 17
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (3/4)

• Example: Old Faithful

Sebastian Peitz 18
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (4/4)

• How to choose the number of clusters?


→ Depends on the data! Oftentimes, multiple runs with varying 𝐾 are required RGB

Example: Image segmentation by clustering the pixels of an image (𝑁 pixels/points, 𝒙𝑖 ∈ 0,1 3 , 𝑖 = 1, … , 𝑁)

Sebastian Peitz 19
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Another approach to identify clusters is via a hierarchical, tree-based approach → the Dendrogram
• A cloud of points is clustered / separated one by one, until some threshold is achieved
• Divisive approach (top-down):
• All points are contained in a single cluster
• The data is then recursively split into smaller and smaller clusters
• The splitting continues until the algorithm stops according to a user specified objective
• The divisive method can split the data until each data point is its own node
• Agglomerative approach (bottom-up):
• Each data point 𝑥𝑗 is its own cluster initially.
• The data is merged in pairs as one creates a hierarchy of clusters.
• The merging of data eventually stops once all the data has been merged into a single cluster
• How can we do this? → Greedy approach!

Sebastian Peitz 20
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Algorithm:
1. Compute the distance (Euclidean, Manhattan, …) between all points: 𝑑 𝒙𝑗 , 𝒙𝑖 , 𝑖, 𝑗 ∈ 1, … , 𝑁
2. Merge the closest two data points into a single new data point midway between their original locations
3. Repeat the calculation with the new 𝑁 − 1 points

Sebastian Peitz 21
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Example: Two Gaussian distributions with 50 points each, Euclidean distance: 𝑑 𝒙𝑖 , 𝒙𝑗 = 𝒙𝑖 − 𝒙𝑗


𝟐

Sebastian Peitz 22
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (1/4)

• Can we also try to find a probabilistic model for our data? This seems to be natural, as noise is often
present in measurements.
• Consider the Old Faithful dataset once more.
• Can we model this using, say, a Gaussian distribution?
• What about a superposition of multiple Gaussians?
• This leads to mixture models (or – if we consider Gaussians – Gaussian Mixture Models (GMMs))
𝐾

𝑝 𝒙 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1
• These can in general create highly complex densities to arbitrary precision
𝒙2 : Duration of the eruption [min]

Sebastian Peitz 𝒙1 : Time to next eruption [min] 23


MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (2/4)

• The coefficients 𝜋𝑘 are called mixing coefficients. If both 𝑝 𝒙 and the individual
Gaussians are normalized, then a simple integration yields σ𝐾 𝑘=1 𝜋𝑘 = 1
• In addition, the requirement 𝑝 𝒙 > 0 implies 𝜋𝑘 ≥ 0 for all 𝑘 → 0 ≤ 𝜋𝑘 ≤ 1

• Using the sum and product rule, we can also write the mixing coefficients as follows:
𝐾
𝑝 𝒙 =෍ 𝑝(𝑘)
ถ 𝑝 𝒙𝑘
𝑘=1
𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘

• Via Bayes’ theorem, we thus get access to the posterior probability of 𝑘 given 𝒙,
a.k.a. responsibility:
𝑝 𝑘 𝑝 𝒙𝑘
𝛾𝑘 𝒙 = 𝑝 𝑘 𝒙 =
𝑝(𝒙)

• This responsibility 𝛾𝑘 can be used to infer a cluster membership:


Given a new sample 𝒙, which cluster has the highest responsibility for this sample?

Sebastian Peitz 24
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (3/4)

• In an entirely Bayesian approach, we thus need to learn the parameters 𝝁𝑘 and 𝚺𝑘 of the individual
Gaussian distributions as well as the mixing coefficients 𝜋𝑘 .
• As these can themselves be seen as random variables, let us introduce a corresponding 𝐾-dimensional
latent state 𝒛 in the form of a 1-of-𝐾 representation: 𝒛𝑘 ∈ 0,1 , σ𝐾 𝑘=1 𝒛𝑘 = 1.
→ 𝒛 can be in 𝐾 different states.
→ 𝑝 𝒛𝑘 = 1 = 𝜋𝑘 and σ𝐾 𝑘=1 𝑝 𝒛𝑘 = 1 = 1
• The probability of a specific latent variable 𝒛 and a specific sample 𝒙 given 𝒛 is thus
𝐾 𝐾
𝒛 𝒛𝑘
𝑝 𝒛 = ෑ 𝜋𝑘𝑘 and 𝑝 𝒙 𝒛 = ෑ 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• As a consequence, the Gaussian mixture model can be expressed as before, but using the latent variable 𝒛
(we’ll see in the next chapter how this is beneficial for learning → Expectation Maximization):
𝐾 𝐾

𝑝 𝒙 = ෍ 𝑝 𝒛 𝑝 𝒙 𝒛 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• Responsibility (𝜋𝑘 = prior; 𝛾 𝒛𝑘 = posterior):
𝑝 𝑧𝑘 = 1 𝑝 𝒙 𝑧𝑘 = 1 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝛾 𝒛𝑘 = 𝑝 𝒛𝑘 = 1 𝒙 = 𝐾 = 𝐾
σ𝑗=1 𝑝 𝒛𝑗 = 1 𝑝 𝒙 𝒛𝑗 = 1 σ𝑗=1 𝜋𝑗 𝒩 𝒙 𝝁𝑗 , 𝚺𝑗
Sebastian Peitz 25
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (4/4)

• Responsibility in the previous example:


𝑝 𝒛 𝑝 𝒙𝒛 𝑝 𝒙 Colors averaged using 𝛾𝑘

• How can we train this model given a data matrix 𝑿 ∈ ℝ𝑛×𝑁 ?


→ Likelihood maximization over 𝒛 and the parameters of the distribution, i.e., 𝝁k and 𝚺k
𝑁 𝐾

log 𝑝 𝑿 𝝅, 𝝁, 𝚺 = ෍ log ෍ 𝜋𝑘 𝒩 𝒙𝑖 𝝁𝑘 , 𝚺𝑘
𝑖=1 𝑘=1

Sebastian Peitz 26
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Self-training

• From now on, let’s again assume that we have both labeled and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁
𝑖=𝐿+1
• Consider the situation where the number of labeled data is much smaller: 𝐿 ≪ 𝑁 − 𝐿, maybe due to the
fact that the labeling has to be done by hand and is very expensive.
• Central goal: improve the learning performance by taking the additional unlabeled (and likely much
cheaper) data into account.
• In some situations, this can help to significantly improve the performance.
• However, this is very hard (or impossible) to prove formally.

• The simplest thing we can do: Self-training


→ Train a classifier 𝑔(𝒙) on 𝒟𝐿 and then label the samples in 𝒟𝑈 according to the prediction of 𝑔:
𝑦 (𝑖) = 𝑔 𝒙 𝑖 , i ∈ 𝐿 + 1, … , 𝑁
• Advantage: easily usable as a wrapper around arbitrary functions (frequently used in natural language
processing)
• Disadvantage: Errors can get amplified

Sebastian Peitz 27
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (1/2)

• The idea of multi-view learning is to look at an object (e.g., a website) from two (or more) different
viewpoints (e.g., the pictures and the text on the website).
• Formally, suppose the instance space 𝒳 to be split into two parts → an instance is represented in the form
𝒙 𝑖 = 𝒙 𝑖,1 , 𝒙 𝑖,2
• Co-training proceeds from the assumption that each view alone is insufficient to train a good classifier and,
moreover, that 𝒙 𝑖,1 and 𝒙 𝑖,2 are conditionally independent given the class.
• Co-training algorithms repeat the following steps:
1 2
• Train two classifiers ℎ 1 and ℎ 2 from 𝒟𝐿 and 𝒟𝐿 , respectively.
• Classify 𝒟𝑈 separately with ℎ 1 and ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 1 to the labeled training data of ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 2 to the labeled training data of ℎ 1 .
• Advantages:
• Co-training is a simple wrapper method that applies to all existing classifiers.
• Co-training tends to be less sensitive to mistakes than self-training.
• Disadvantages:
• a natural split of the features does not always exist (the feature subsets do not necessarily need to be disjunct).
• models using both views simultaneously may often perform better.
Sebastian Peitz 28
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (2/2)

• There are many variants of multi-view learning via combination with other techniques (majority voting,
weighting, …)
• Multiview learning (with m learners) can also be realized via regularization:
𝑚 𝐿 𝑚 𝑁
2 2
min ෍ ෍ 𝑒 𝑦𝑖 , ℎ𝑣 𝒙𝑖 + 𝜆1 ℎ𝑣 + 𝜆2 ෍ ෍ ℎ𝑢 𝒙𝑗 − ℎ𝑣 (𝒙𝑗 )
ℎ∈ℋ
𝑣=1 𝑖=1 𝑢,𝑣=1 𝑗=𝐿+1
• Minimizing a (joint) loss function of this kind encourages the learners to agree on the unlabeled data to
some extent.

Sebastian Peitz 29
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Generative models

• Generative methods first estimate a joint distribution 𝑃 on 𝒳 × 𝒴


• Predictions can then be derived by conditioning on a given query 𝒙:
𝑃 𝒙, 𝑦 𝑃 𝒙, 𝑦
𝑃 𝑦𝒙 = = ∝ 𝑃 𝒙, 𝑦
𝑃(𝒙) σ𝑦∈𝒴 𝑃 𝒙, 𝑦
• Generative methods can be applied in the semi-supervised context in a quite natural way, because they can
model the probability of observing an instance 𝑥𝑗 as a marginal probability:
𝑃 𝒙 = ෍ 𝑃 𝒙, 𝑦
𝑦∈𝒴
• Suppose the (joint) probability 𝑃 to be parametrized by 𝜽 ∈ 𝚯. Then, assuming i.i.d. observations,
𝐿 𝑁 𝐿 𝑁

𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ 𝑃 𝒙𝑖 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ ෍ 𝑃 𝒙𝑖 , 𝑦 𝜽
𝑖=1 𝑗=𝐿+1 𝑖=1 𝑗=𝐿+1 𝑦∈𝒴
• Solution of this problem: maximum likelihood estimation:
𝜽∗ = argmax 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽
𝜽∈𝚯
• Advantage: Theoretically well-grounded, often effective
• Disadvantage: Computationally complex, 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 may have local minima
Sebastian Peitz 30

You might also like