0% found this document useful (0 votes)

19 views

Section06 UnsupervisedLearning

This document discusses unsupervised and semi-supervised machine learning techniques. It begins by explaining that unsupervised learning involves training data without labels, while semi-supervised learning uses both labeled and unlabeled data. Common applications of unsupervised learning include clustering data to find patterns and categorizing new samples based on previously labeled clusters. Principal component analysis and singular value decomposition are then introduced as methods for feature selection and dimensionality reduction by finding a new basis that best represents the structure of the data using only the most important components.

Uploaded by

Varun Eranki

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Section06 UnsupervisedLearning

Uploaded by

Varun Eranki

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

MACHINE LEARNING II

UNSUPERVISED AND SEMI-SUPERVISED LEARNING

JUN.-PROF. DR. SEBASTIAN PEITZ
Summer Term 2022
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised and semi-supervised learning

• Until now, our data has always been labeled:

𝒟 = (𝒙(𝑖) , 𝑦 (𝑖) )𝑁
𝑖=1
• Everything until now has been supervised learning, as – during training – we can tell our learning algorithm
𝒜 for each sample what the outcome should be and whether the prediction ℎ(𝒙(𝑖) ) was false or true
• However, in many situations, we do not necessarily have labels…
• … or maybe just for some of the samples
• Think about the effort of an expert having to label a gigantic number of
images / documents / …
• Side note: The ImageNet library for visual recognition (> 14 Million images)
has had a massive impact on the advances of modern ML techniques!
• In unsupervised learning, our training data is 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=1 ,
where the index 𝑈 indicates the absence of labels.
• In semi-supervised learning, our training data consists of both labeled
and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁 𝑖=𝐿+1

Sebastian Peitz 2
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning

• If we have no labels, what can we hope to find / learn?

→ Patterns in the data! E.g., clusters.

• These patterns allow for classification of new samples

• Example: the identification of customer preferences in social networks

• If we have previously labeled some elements from a cluster, then we can

easily label new samples as well:
• Categorization/labeling of movies

𝒙2 : Duration of the eruption [min]

• Central question: what are important
features in unsupervised learning?

𝒙1 : Time to next eruption [min]

Sebastian Peitz 3
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection

• What distinguishes two classes in a high-dimensional feature space?

• Example: “Cat vs. dog” images (32 × 32 = 1024 pixels → 𝑥 ∈ ℝ
1024 )

• Remember: All real-world data possesses a massive

amount of structure!
→ There should be some lower-dimensional latent
variable which allows us to distinguish between
two classes just as well
→ Principal components, Fourier modes, …

Sebastian Peitz 4
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (1/5)

• In the following data set 𝒟 = 𝒙(1) , … , 𝒙(𝑁) = 𝑿, how many features do we need
to (approximately) describe the data?
→ If we perform a coordinate transform, one direction is clearly more important in
characterizing the structure of 𝑿 than the second one!
→ This is nothing else but representing the same data in a different coordinate
system: Instead of using the standard Euclidean basis
𝑥1 1 0
𝒙 = 𝑥 = 𝑥1 𝒆1 + 𝑥2 𝒆2 = 𝑥1 + 𝑥2
2 0 1
we can use a basis 𝑼 tailored to the data:
𝒙 = 𝑎1 𝒖1 + 𝑎2 𝒖2
• Which properties should such a new basis 𝑼 have?
⊤ 1 if 𝑖 = 𝑗
• It should be orthonormal: 𝒖𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 = ቊ
0 else
• For every dimension 𝑟, it should have the smallest approximation error:
2
𝑁 𝑟

𝑼= arg min ෍ 𝑥𝑖 − ෍ 𝒖𝑗⊤ 𝒙𝑖 𝒖𝑖 ⇔ 𝑿𝑟 = arg min ෡−𝑿

𝑿 𝐹
෡ s.t. rank 𝑼
𝑼 ෡ =𝑟 ෡ s.t. rank 𝑿
𝑿 ෡ =𝑟
𝑖=1 𝑗=1
Sebastian Peitz 5
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (2/5)

• Due to the famous Eckart-Young theorem, we know that the solution to this optimization problem can be
obtained using a very efficient tool from linear algebra: the Singular Value Decomposition (SVD)
𝑿 = 𝑼𝚺𝑽∗
• This is a product of three matrices. If 𝑿 ∈ ℂ𝑛×𝑁 , then 𝑼 ∈ ℂ𝑁×𝑁 , 𝚺 ∈ ℝ𝑁×𝑛 and 𝑽 ∈ ℂ𝑛×𝑛 (𝑽∗ is the
conjugate transposed matrix of 𝑽)
• The matrices have many favorable properties:
• 𝑼 = 𝒖1 , … , 𝒖𝑁 and 𝑽 = 𝒗1 , … , 𝒗𝑛 are unitary matrices (column-wise orthonormal):
𝒖⊤ ⊤
𝑖 𝒖𝑗 = 𝛿𝑖,𝑗 and 𝒗𝑖 𝒗𝑗 = 𝛿𝑖,𝑗
෡
𝚺
• 𝚺 = is a diagonal matrix with diagonal entries 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0: the singular values
𝟎
• Since the last 𝑁 − 𝑛 rows (assuming that 𝑁 > 𝑛) are zero, we have the following economic version:
∗ ෡ ∗ ෡෡ ∗
𝚺
𝑿 = 𝑼𝚺𝑽 = 𝑼 𝑼 ෡ ෡ ⊥ 𝑽 = 𝑼𝚺𝑽
𝟎
• Since the columns of 𝑼 and 𝑽 all have unit length, the relative importance of a particular column of 𝑼 is encoded
in the singular values:
𝑛

𝑿 = 𝑼𝚺𝑽∗ = ෍ 𝜎𝑖 𝒖𝑖 𝒗∗𝑖 = 𝜎1 𝒖1 𝒗1∗ + ⋯ + 𝜎𝑛 𝒖𝑛 𝒗∗𝑛

𝑖=1
Sebastian Peitz • 6
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (3/5)

• Do we need all columns of 𝑼 to reconstruct the matrix 𝑿?

• What if we are willing to accept a certain error?
→ Truncate 𝑼 after 𝑟 columns!
𝑿෩=𝑼 ෩𝚺෩𝑽෩∗ ≈ 𝑿
→ The Eckart-Young theorem says that this is the
best rank-𝑟 matrix we can find

Sebastian Peitz 7
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (4/5)

• The same can be done with high-dimensional data

• Example: Yale Faces B database (192 × 168 = 32256 pixels, 2414 images → 𝑿 ∈ ℝ32256×2414

First 16 eigenfaces:
෩ ∈ ℝ32256×16
𝑼 Low-rank
reconstruction

෡𝚺
𝑿=𝑼 ෡𝑽∗

Economy
SVD

Flatten / reshape: 𝑿 =

Only the first 1000 rows

Sebastian Peitz 8
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Principal Component Analysis / Singular Value Decomposition (5/5)

• Now let’s try to distinguish two individuals from the data base by projecting onto two modes:
𝐏𝐂𝟓 𝒙𝑖 = 𝒙⊤ 𝑖 𝒖5 , 𝐏𝐂𝟔(𝒙𝑖 ) = 𝒙⊤ 𝑖 𝒖6

Sebastian Peitz 9
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (1/3)

• Consider a function 𝑓(𝑥) that is piecewise smooth and 2𝜋-periodic. Any function of this class can be
expressed in terms of its Fourier transform:
∞ ∞ ∞
𝑎0
𝑓 𝑥 = + ෍ 𝑎𝑘 cos 𝑘𝑥 + 𝑏𝑘 sin(𝑘𝑥) = ෍ 𝑐𝑘 𝑒 𝑖𝑘𝑥 = ෍ 𝑎𝑘 + 𝑖𝑏𝑘 cos 𝑘𝑥 + 𝑖 sin(𝑘𝑥)
2
𝑘=1 𝑘=−∞ 𝑘=−∞
• The (real) coefficients are given by
1 𝜋
𝑎𝑘 = න 𝑓 𝑥 cos 𝑘𝑥 𝑑𝑥 ,
𝜋 −𝜋
1 𝜋
𝑏𝑘 = න 𝑓 𝑥 sin 𝑘𝑥 𝑑𝑥
𝜋 −𝜋
• This is nothing else but representing 𝑓 𝑥 in terms of an
orthogonal basis: the Fourier modes cos 𝑘𝑥 and sin 𝑘𝑥
• Closely related to the SVD basis transform, only that 𝑓 𝑥
is not a vector, but an infinite-dimensional function
• Instead of point-wise data, the Fourier modes contain
global information over the entire domain.

Sebastian Peitz 10
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (2/3)

• The Fourier transform can be adapted to vectors using the Discrete Fourier Transform (DFT) or its highly
efficient implementation: the Fast Fourier Transform (FFT)
𝒙 ∈ ℝ𝑛 → 𝒄 ∈ ℂ𝑛
𝑘𝜋
• The entries of 𝒄 are the complex Fourier coefficients of increasing frequency (𝜔𝑘 = )
𝐿
• In 2D: First in one direction, then in the second direction

Sebastian Peitz 11
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection – Fourier transform / Fast Fourier Transform (3/3)

• Very powerful compression technique (This was the JPEG algorithm until a few years ago)

Sebastian Peitz 12
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (1/3)

• Let’s consider the cats and dogs example once more:

• These are the first four SVD modes

• Alternative feature identification method:

First four modes according to the following procedure
• Transform images 𝑿 to Wavelet domain 𝑪 (think of this
as a hierarchical version of the Fourier transform)
→ This is today’s JPEG compression technique
• Perform an SVD on the Wavelet/Fourier coefficients
→ basis 𝑼𝑪 for the space of Wavelet/Fourier coefficients
• Inverse Wavelet/Fourier transform of the basis to the
original space → 𝑼𝑿

Sebastian Peitz 13
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (2/3)

Original space Wavelet space

Sebastian Peitz 14
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Feature selection in latent variables (3/3)

Dogs
Cats
Sebastian Peitz 15
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (1/4)

• Now let’s assume that we only have unlabeled data 𝒟 = (𝒙(𝑖) )𝑁 𝑖=1 , 𝒙
(𝑖)
∈ ℝ𝑛
• We would like to separate the data into 𝐾 clusters in an optimal way, represented by a set of 𝐾 prototype
vectors 𝝁1 , … 𝝁𝐾 ∈ ℝ𝑛 representing the clusters
• Which parameters do we have to optimize?
→ The prototypes as well as the assignment of data to the clusters:
𝑁 𝐾

min 𝐸 = ෍ ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2
𝝁,𝒓
𝑖=1 𝑘=1
• 𝒓 is a matrix of binary variables (𝒓𝑖𝑘 ∈ 0,1 ), where the first index refers to the data point and the second
to the cluster – exactly one entry per row is one: σ𝐾
𝑘=1 𝒓𝑖𝑘 = 1 for all 𝑖 ∈ 1, … , 𝑁
→ We assign each data point to precisely one cluster and then seek to minimize the distance of all points
within a cluster 𝑘 to their prototype 𝝁𝑘
• Which norm for the distance? → depends!
• Euclidean: 𝒙𝑖 − 𝝁𝑘 2
2
• Squared Euclidean: 𝒙𝑖 − 𝝁𝑘 2
• Manhattan: 𝒙𝑖 − 𝝁𝑘 1
• Maximum distance: 𝒙𝑖 − 𝝁𝑘 ∞
• Mahalanobis distance: 𝒙𝑖 − 𝝁𝑘 ⊤ 𝚺 −1 𝒙𝑖 − 𝝁𝑘 with the covariance matrix 𝚺
Sebastian Peitz 16
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (2/4)

• How do we solve the optimization problem to minimize 𝐸 = σ𝑁 𝐾

𝑖=1 σ𝑘=1 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 2 ?
2

• Alternate between 𝒓 and 𝝁

• Assignment 𝒓: with 𝝁𝑘 fixed, 𝑬 can be decomposed into the individual contribution of each data point:

1 if 𝑘 = arg min 𝒙𝑖 − 𝝁𝑘 22
𝒓𝑖𝑘 = ቐ 𝑗
0 otherwise
• Prototypes 𝝁𝑘 : with 𝒓 fixed, this is a weighted least squares regression problem:
𝑁
σ𝑁
𝑖=1 𝒓𝑖𝑘 𝒙𝑖
2 ෍ 𝒓𝑖𝑘 𝒙𝑖 − 𝝁𝑘 = 0 ⇔ 𝝁𝑘 = 𝑁
σ𝑖=1 𝒓𝑖𝑘
𝑖=1
→ This is the mean over all 𝒙𝑖 belonging to cluster 𝑘
• Repeat the two steps until there are no re-assignments
• Does this algorithm converge?
→ Yes, because a reduction of the objective function is guaranteed by design

• However, we have to be aware that the solution can be a local minimum

Sebastian Peitz 17
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (3/4)

• Example: Old Faithful

Sebastian Peitz 18
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – K-means clustering (4/4)

• How to choose the number of clusters?

→ Depends on the data! Oftentimes, multiple runs with varying 𝐾 are required RGB

Example: Image segmentation by clustering the pixels of an image (𝑁 pixels/points, 𝒙𝑖 ∈ 0,1 3 , 𝑖 = 1, … , 𝑁)

Sebastian Peitz 19
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Another approach to identify clusters is via a hierarchical, tree-based approach → the Dendrogram
• A cloud of points is clustered / separated one by one, until some threshold is achieved
• Divisive approach (top-down):
• All points are contained in a single cluster
• The data is then recursively split into smaller and smaller clusters
• The splitting continues until the algorithm stops according to a user specified objective
• The divisive method can split the data until each data point is its own node
• Agglomerative approach (bottom-up):
• Each data point 𝑥𝑗 is its own cluster initially.
• The data is merged in pairs as one creates a hierarchy of clusters.
• The merging of data eventually stops once all the data has been merged into a single cluster
• How can we do this? → Greedy approach!

Sebastian Peitz 20
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Algorithm:
1. Compute the distance (Euclidean, Manhattan, …) between all points: 𝑑 𝒙𝑗 , 𝒙𝑖 , 𝑖, 𝑗 ∈ 1, … , 𝑁
2. Merge the closest two data points into a single new data point midway between their original locations
3. Repeat the calculation with the new 𝑁 − 1 points

Sebastian Peitz 21
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Dendrogram

• Example: Two Gaussian distributions with 50 points each, Euclidean distance: 𝑑 𝒙𝑖 , 𝒙𝑗 = 𝒙𝑖 − 𝒙𝑗

𝟐

Sebastian Peitz 22
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (1/4)

• Can we also try to find a probabilistic model for our data? This seems to be natural, as noise is often
present in measurements.
• Consider the Old Faithful dataset once more.
• Can we model this using, say, a Gaussian distribution?
• What about a superposition of multiple Gaussians?
• This leads to mixture models (or – if we consider Gaussians – Gaussian Mixture Models (GMMs))
𝐾

𝑝 𝒙 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1
• These can in general create highly complex densities to arbitrary precision
𝒙2 : Duration of the eruption [min]

Sebastian Peitz 𝒙1 : Time to next eruption [min] 23

MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (2/4)

• The coefficients 𝜋𝑘 are called mixing coefficients. If both 𝑝 𝒙 and the individual
Gaussians are normalized, then a simple integration yields σ𝐾 𝑘=1 𝜋𝑘 = 1
• In addition, the requirement 𝑝 𝒙 > 0 implies 𝜋𝑘 ≥ 0 for all 𝑘 → 0 ≤ 𝜋𝑘 ≤ 1

• Using the sum and product rule, we can also write the mixing coefficients as follows:
𝐾
𝑝 𝒙 =෍ 𝑝(𝑘)
ถ 𝑝 𝒙𝑘
𝑘=1
𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘

• Via Bayes’ theorem, we thus get access to the posterior probability of 𝑘 given 𝒙,
a.k.a. responsibility:
𝑝 𝑘 𝑝 𝒙𝑘
𝛾𝑘 𝒙 = 𝑝 𝑘 𝒙 =
𝑝(𝒙)

• This responsibility 𝛾𝑘 can be used to infer a cluster membership:

Given a new sample 𝒙, which cluster has the highest responsibility for this sample?

Sebastian Peitz 24
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (3/4)

• In an entirely Bayesian approach, we thus need to learn the parameters 𝝁𝑘 and 𝚺𝑘 of the individual
Gaussian distributions as well as the mixing coefficients 𝜋𝑘 .
• As these can themselves be seen as random variables, let us introduce a corresponding 𝐾-dimensional
latent state 𝒛 in the form of a 1-of-𝐾 representation: 𝒛𝑘 ∈ 0,1 , σ𝐾 𝑘=1 𝒛𝑘 = 1.
→ 𝒛 can be in 𝐾 different states.
→ 𝑝 𝒛𝑘 = 1 = 𝜋𝑘 and σ𝐾 𝑘=1 𝑝 𝒛𝑘 = 1 = 1
• The probability of a specific latent variable 𝒛 and a specific sample 𝒙 given 𝒛 is thus
𝐾 𝐾
𝒛 𝒛𝑘
𝑝 𝒛 = ෑ 𝜋𝑘𝑘 and 𝑝 𝒙 𝒛 = ෑ 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• As a consequence, the Gaussian mixture model can be expressed as before, but using the latent variable 𝒛
(we’ll see in the next chapter how this is beneficial for learning → Expectation Maximization):
𝐾 𝐾

𝑝 𝒙 = ෍ 𝑝 𝒛 𝑝 𝒙 𝒛 = ෍ 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝑘=1 𝑘=1
• Responsibility (𝜋𝑘 = prior; 𝛾 𝒛𝑘 = posterior):
𝑝 𝑧𝑘 = 1 𝑝 𝒙 𝑧𝑘 = 1 𝜋𝑘 𝒩 𝒙 𝝁𝑘 , 𝚺𝑘
𝛾 𝒛𝑘 = 𝑝 𝒛𝑘 = 1 𝒙 = 𝐾 = 𝐾
σ𝑗=1 𝑝 𝒛𝑗 = 1 𝑝 𝒙 𝒛𝑗 = 1 σ𝑗=1 𝜋𝑗 𝒩 𝒙 𝝁𝑗 , 𝚺𝑗
Sebastian Peitz 25
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Unsupervised learning – Mixture models (4/4)

• Responsibility in the previous example:

𝑝 𝒛 𝑝 𝒙𝒛 𝑝 𝒙 Colors averaged using 𝛾𝑘

• How can we train this model given a data matrix 𝑿 ∈ ℝ𝑛×𝑁 ?

→ Likelihood maximization over 𝒛 and the parameters of the distribution, i.e., 𝝁k and 𝚺k
𝑁 𝐾

log 𝑝 𝑿 𝝅, 𝝁, 𝚺 = ෍ log ෍ 𝜋𝑘 𝒩 𝒙𝑖 𝝁𝑘 , 𝚺𝑘
𝑖=1 𝑘=1

Sebastian Peitz 26
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Self-training

• From now on, let’s again assume that we have both labeled and unlabeled data:
𝒟𝐿 = (𝒙(𝑖) , 𝑦 (𝑖) )𝐿𝑖=1 , 𝒟𝑈 = (𝒙(𝑖) )𝑁
𝑖=𝐿+1
• Consider the situation where the number of labeled data is much smaller: 𝐿 ≪ 𝑁 − 𝐿, maybe due to the
fact that the labeling has to be done by hand and is very expensive.
• Central goal: improve the learning performance by taking the additional unlabeled (and likely much
cheaper) data into account.
• In some situations, this can help to significantly improve the performance.
• However, this is very hard (or impossible) to prove formally.

• The simplest thing we can do: Self-training

→ Train a classifier 𝑔(𝒙) on 𝒟𝐿 and then label the samples in 𝒟𝑈 according to the prediction of 𝑔:
𝑦 (𝑖) = 𝑔 𝒙 𝑖 , i ∈ 𝐿 + 1, … , 𝑁
• Advantage: easily usable as a wrapper around arbitrary functions (frequently used in natural language
processing)
• Disadvantage: Errors can get amplified

Sebastian Peitz 27
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (1/2)

• The idea of multi-view learning is to look at an object (e.g., a website) from two (or more) different
viewpoints (e.g., the pictures and the text on the website).
• Formally, suppose the instance space 𝒳 to be split into two parts → an instance is represented in the form
𝒙 𝑖 = 𝒙 𝑖,1 , 𝒙 𝑖,2
• Co-training proceeds from the assumption that each view alone is insufficient to train a good classifier and,
moreover, that 𝒙 𝑖,1 and 𝒙 𝑖,2 are conditionally independent given the class.
• Co-training algorithms repeat the following steps:
1 2
• Train two classifiers ℎ 1 and ℎ 2 from 𝒟𝐿 and 𝒟𝐿 , respectively.
• Classify 𝒟𝑈 separately with ℎ 1 and ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 1 to the labeled training data of ℎ 2 .
• Add the 𝑘 most confident examples of ℎ 2 to the labeled training data of ℎ 1 .
• Advantages:
• Co-training is a simple wrapper method that applies to all existing classifiers.
• Co-training tends to be less sensitive to mistakes than self-training.
• Disadvantages:
• a natural split of the features does not always exist (the feature subsets do not necessarily need to be disjunct).
• models using both views simultaneously may often perform better.
Sebastian Peitz 28
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Co-training / multi-view learning (2/2)

• There are many variants of multi-view learning via combination with other techniques (majority voting,
weighting, …)
• Multiview learning (with m learners) can also be realized via regularization:
𝑚 𝐿 𝑚 𝑁
2 2
min ෍ ෍ 𝑒 𝑦𝑖 , ℎ𝑣 𝒙𝑖 + 𝜆1 ℎ𝑣 + 𝜆2 ෍ ෍ ℎ𝑢 𝒙𝑗 − ℎ𝑣 (𝒙𝑗 )
ℎ∈ℋ
𝑣=1 𝑖=1 𝑢,𝑣=1 𝑗=𝐿+1
• Minimizing a (joint) loss function of this kind encourages the learners to agree on the unlabeled data to
some extent.

Sebastian Peitz 29
MACHINE LEARNING II
UNSUPERVISED & SEMI-SUPERVISED LEARNING

Semi-supervised learning – Generative models

• Generative methods first estimate a joint distribution 𝑃 on 𝒳 × 𝒴

• Predictions can then be derived by conditioning on a given query 𝒙:
𝑃 𝒙, 𝑦 𝑃 𝒙, 𝑦
𝑃 𝑦𝒙 = = ∝ 𝑃 𝒙, 𝑦
𝑃(𝒙) σ𝑦∈𝒴 𝑃 𝒙, 𝑦
• Generative methods can be applied in the semi-supervised context in a quite natural way, because they can
model the probability of observing an instance 𝑥𝑗 as a marginal probability:
𝑃 𝒙 = ෍ 𝑃 𝒙, 𝑦
𝑦∈𝒴
• Suppose the (joint) probability 𝑃 to be parametrized by 𝜽 ∈ 𝚯. Then, assuming i.i.d. observations,
𝐿 𝑁 𝐿 𝑁

𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ 𝑃 𝒙𝑖 𝜽 = ෑ 𝑃 𝒙𝑖 , 𝑦𝑖 𝜽 ⋅ ෑ ෍ 𝑃 𝒙𝑖 , 𝑦 𝜽
𝑖=1 𝑗=𝐿+1 𝑖=1 𝑗=𝐿+1 𝑦∈𝒴
• Solution of this problem: maximum likelihood estimation:
𝜽∗ = argmax 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽
𝜽∈𝚯
• Advantage: Theoretically well-grounded, often effective
• Disadvantage: Computationally complex, 𝑃 𝒟𝐿 , 𝒟𝑈 𝜽 may have local minima
Sebastian Peitz 30

Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
P2 Chp11 Integration
No ratings yet
P2 Chp11 Integration
72 pages
Lec 05
No ratings yet
Lec 05
53 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Lecture1
No ratings yet
Lecture1
56 pages
第八章
No ratings yet
第八章
28 pages
AI - Physics Informed Neural Network by ARNAB HALDER
No ratings yet
AI - Physics Informed Neural Network by ARNAB HALDER
15 pages
Lecture 12 - Unsupervised- PCA
No ratings yet
Lecture 12 - Unsupervised- PCA
17 pages
cours1
No ratings yet
cours1
42 pages
CSE489: Machine Vision (Sheet 7) : Yehia Zakaria
No ratings yet
CSE489: Machine Vision (Sheet 7) : Yehia Zakaria
34 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
P2-Chp4-BinomialExpansion
No ratings yet
P2-Chp4-BinomialExpansion
21 pages
Lec9 NN I
No ratings yet
Lec9 NN I
47 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
1 AI_Introduction and ML
No ratings yet
1 AI_Introduction and ML
32 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
lecture3-linear-classifiers
No ratings yet
lecture3-linear-classifiers
36 pages
Intro
No ratings yet
Intro
38 pages
03 Supervised Classification
No ratings yet
03 Supervised Classification
68 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
XAI (v7)
No ratings yet
XAI (v7)
40 pages
8-cluster
No ratings yet
8-cluster
33 pages
Section04 ProbabilityTheory
No ratings yet
Section04 ProbabilityTheory
29 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
AI_02_Naive_Bayes
No ratings yet
AI_02_Naive_Bayes
9 pages
P2 Ch4 - Binomal Expansion
No ratings yet
P2 Ch4 - Binomal Expansion
19 pages
Some Practice Questions
No ratings yet
Some Practice Questions
10 pages
1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF
No ratings yet
1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF
65 pages
lec4
No ratings yet
lec4
33 pages
Learning in Artificial Intelligence
No ratings yet
Learning in Artificial Intelligence
6 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
Lecture 2. Dimension Reduction
No ratings yet
Lecture 2. Dimension Reduction
71 pages
Tud DL Lecture02 Optimization
No ratings yet
Tud DL Lecture02 Optimization
81 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Ma/Csse 473 Day 28: Optimal Bsts
No ratings yet
Ma/Csse 473 Day 28: Optimal Bsts
29 pages
FAI 1 Introduction
No ratings yet
FAI 1 Introduction
39 pages
Lecture 7 Slides
No ratings yet
Lecture 7 Slides
124 pages
Lecture 4 - Brute-Force Algorithms (Part 2) - Miscellaneous
No ratings yet
Lecture 4 - Brute-Force Algorithms (Part 2) - Miscellaneous
14 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Chapitre 1
No ratings yet
Chapitre 1
13 pages
08-complexity
No ratings yet
08-complexity
36 pages
Lec 3 NNs
No ratings yet
Lec 3 NNs
64 pages
Stable Diffusion A Tutorial
100% (1)
Stable Diffusion A Tutorial
66 pages
Mode Generalization
No ratings yet
Mode Generalization
46 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Dtrees
No ratings yet
Dtrees
43 pages
cs3244 11a.explainable-ai
No ratings yet
cs3244 11a.explainable-ai
55 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6 (1)
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6 (1)
39 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
No ratings yet
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
50 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
ML_basics_lecture2_linear_classification
No ratings yet
ML_basics_lecture2_linear_classification
34 pages
AI Essentials Courseware
From Everand
AI Essentials Courseware
Reinier van den Biggelaar
No ratings yet
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
6 036 Final Fall 2021 Exam Solutions
No ratings yet
6 036 Final Fall 2021 Exam Solutions
33 pages
Bussin
No ratings yet
Bussin
81 pages
AI&ML Unit 5
No ratings yet
AI&ML Unit 5
122 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
No ratings yet
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
14 pages
Machine Learning Intro Final
No ratings yet
Machine Learning Intro Final
74 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
26 pages
IJACSA Volume2No10
No ratings yet
IJACSA Volume2No10
130 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Antipole Tree Indexing
No ratings yet
Antipole Tree Indexing
16 pages
Bis Wp2013 - 01 - Riemer-Final
No ratings yet
Bis Wp2013 - 01 - Riemer-Final
21 pages
Paper Viii: Artificial Intelligence & Expert Systems
No ratings yet
Paper Viii: Artificial Intelligence & Expert Systems
2 pages
Revit API Using CSharp Python Dynamo AI Plugins Training-1
No ratings yet
Revit API Using CSharp Python Dynamo AI Plugins Training-1
17 pages
AI Foundations and Challenges1
No ratings yet
AI Foundations and Challenges1
31 pages
Ise-Vii-data Warehousing and Data Mining (10is74) - Notes
100% (1)
Ise-Vii-data Warehousing and Data Mining (10is74) - Notes
143 pages
A.V.C College of Engineering, Mannampandal M.E - Applied Electronics
No ratings yet
A.V.C College of Engineering, Mannampandal M.E - Applied Electronics
3 pages
Clustering: ISOM3360 Data Mining For Business Analytics
No ratings yet
Clustering: ISOM3360 Data Mining For Business Analytics
28 pages
Prediction of Diseases in Smart Health Care System Using Machine Learning
No ratings yet
Prediction of Diseases in Smart Health Care System Using Machine Learning
5 pages
Paper 65-Fraud Detection in Credit Cards
No ratings yet
Paper 65-Fraud Detection in Credit Cards
12 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
Dynamic Modeling Technique For Weather Prediction: Jyotismita Goswami
No ratings yet
Dynamic Modeling Technique For Weather Prediction: Jyotismita Goswami
8 pages
Python SciKit Learn Tutorial _ DigitalOcean
No ratings yet
Python SciKit Learn Tutorial _ DigitalOcean
11 pages
Kompetencije Menadzera
No ratings yet
Kompetencije Menadzera
7 pages
AI-900 Related Question Bank
No ratings yet
AI-900 Related Question Bank
52 pages
Software Engineering Dissertation Examples
100% (2)
Software Engineering Dissertation Examples
6 pages
A cluster-based optimization framework for vehicle routing problem with workload balance
No ratings yet
A cluster-based optimization framework for vehicle routing problem with workload balance
14 pages
Adarsh
No ratings yet
Adarsh
6 pages
Spe 209127 Ms
No ratings yet
Spe 209127 Ms
17 pages
JCU Brochure Data Science
No ratings yet
JCU Brochure Data Science
17 pages