0% found this document useful (0 votes)
2 views

Lecture11- Unsupervised Learning (I)

The document discusses unsupervised learning, focusing on word embeddings and autoencoders. It outlines various types of autoencoders, including undercomplete, overcomplete, regularized, sparse, denoising, contractive, and deep autoencoders, along with their applications and differences from PCA. Additionally, it highlights challenges in word embeddings and presents key models like Word2Vec, GloVe, ELMo, GPT, and BERT.

Uploaded by

Ahmed Amr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture11- Unsupervised Learning (I)

The document discusses unsupervised learning, focusing on word embeddings and autoencoders. It outlines various types of autoencoders, including undercomplete, overcomplete, regularized, sparse, denoising, contractive, and deep autoencoders, along with their applications and differences from PCA. Additionally, it highlights challenges in word embeddings and presents key models like Word2Vec, GloVe, ELMo, GPT, and BERT.

Uploaded by

Ahmed Amr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Unsupervised Learning (I)

Dr. Mohamed Elshenawy


[email protected]
Zewail University of Science and Technology
Outline
• Word Embeddings – a Recap
• Unsupervised Learning Algorithms
• Autoencoders (Chapter 14)
• Overcomplete and Undercomplete Autoencoders
• Regularized Autoencoders
• Sparse Autoencoders
• Denoising Autoencoders (DAE)
• Contractive Autoencoders (CAE)
• Deep Autoencoders
Word Embeddings – a Recap
Timeline of the Key Models
• Key challenges
• Out-of-Vocabulary (OOV) words. The
datasets used for training do not
completely include all words.
• Word representation that depends on
the context. Understanding the context
of the word is necessary for downstream
tasks.
• Word embeddings of different
languages. It is necessary to design a
specific word embedding model for a
specific language

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
Classical Models – The language model

• A method of generating text, which is the maximum likelihood of a sequence of multiple n


words

𝑝(𝑤1 , 𝑤2 , … , 𝑤𝑛 ) = ς𝑛𝑖=1 𝑝 𝑤𝑖 𝑤𝑖−𝑛+1 , … , 𝑤𝑖−1 )

• The occurrence of the 𝑖 𝑡ℎ word is only related to the previous 𝑛 − 1 words

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
Neural network language model (NNLM)
• Proposed by Bengio et al.
• Similar to the traditional
language model, NNLM uses
the previous 𝑛 − 1 words to
predict the 𝑛𝑡ℎ word as it
overall structure.
• The neural network

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
Word2Vec

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
Glove

• Word2vec only focus on the information obtained from local context window
while the global statistic information is not used well.
• Glove solves this problem using global co-occurrence matrix
• Each element 𝑋𝑖𝑗 in the matrix represents the frequency of the word 𝑤𝑖 and the
word 𝑤𝑗 co-occur in a particular context window.

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
Fasttext
• Models such as Word2vec are simple, efficient, and can learn semantic
representations of words on large data sets, but they cannot learn embeddings
from out-ofvocabulary(OOV) words.
• Most of the existing word representation approaches assigning a distinct vector
to each word while words are regarded as atomic tokens.
• This is a limitation, especially for languages which consist with sub-word level
information. fastText uses sub-word n-gram information, which can obtain the
order relationship between characters and better capture the internal semantics
of words, to solve this problem.

A survey of word embeddings based on deep learning


Shirui Wang · Wenan Zhou · Chao Jiang
ELMo (Embeddings from Language Models)
• Another challenge facing word embeddings is to combine context-specific
representation problems.
• ELMo is a a deep contextualized word representation method to solve the above
problem
• Uses a bidirectional LSTM model on the large corpus
OpenAI-GPT (Generative Pre-Training)
• Unlike ELMo, GPT uses Transformer for feature extraction
BERT (Bidirectional Encoder Representations from
Transformers)
• For many downstream tasks such as
machine reading comprehension, it
is important to be able to extract
context information from both
directions at the same time.
• Uses the bi-Transformer technique
which can effectively exploit the
deep semantic information of a
sentence.
• uses the bi-Transformer technique
which can effectively exploit the
deep semantic information of a
sentence.
Unsupervised Learning
Algorithms
Unsupervised Learning Algorithms

• Useful to learn useful properties about the structure of this dataset. For example,
they can learn the probability distribution that generated a dataset (density
function estimation).
• Can be used for dimensionality reduction.
• Can act as a pre-processing step before applying supervised learning techniques
(e.g. denoising).
• Can perform other tasks such as clustering.
Autoencoders (Chapter 14)
• An autoencoder is a neural network that is
trained to attempt to copy its input to its
output.
• Internally, it has a hidden layer h that
describes a code used to represent the input.
• The network may be viewed as consisting of
two parts: an encoder function h = f (x) and a
decoder that produces a reconstruction r =
g(h).

Goodfellow, Bengio, Courville 2016


Autoencoders
• Autoencoders are restricted in ways that allow them to copy only approximately,
and to copy only input that resembles the training data.
• Typically, we would like to prioritize learning some useful aspects of the data (e.g.
if your input is a noisy data, you would like your autoencoder to learn how to
recover the original data)
• If an autoencoder succeeds in simply learning to set g(f (x)) = x everywhere, then
it is not especially useful.
• Traditionally, autoencoders were used for dimensionality reduction or feature
learning. Recently, theoretical connections between autoencoders and latent
variable models have brought autoencoders to the forefront of generative
modeling,

Goodfellow, Bengio, Courville 2016


Autoencoders and PCA
• The simplest kind of autoencoder has one hidden layer,
linear activations, and squared error loss
ො = 𝑥 − 𝑥ො 2
L(x,𝑥)
𝑥ො = WVx (a linear function) 𝑋෠ N Units

• If K>=N, then we can choose WV that is the identity V


function
• If K<N, W maps x to a K-dimensional space, so it’s doing K Units
dimensionality reduction
• The autoencoder should learn to choose the subspace W
which minimizes the squared distance from the data to
the projections. X N Units

• Thus, it is equivalent to PCA which maximizes the


variance of the projections.

Goodfellow, Bengio, Courville 2016


Difference between autoencoders and PCA
• In PCA, transformations are linear.
• When the decoder is linear and the loss function (L) is the mean squared error, an
undercomplete autoencoder learns to span the same subspace as PCA.
• Autoencoders with nonlinear encoder functions f and nonlinear decoder
functions g can learn a more powerful nonlinear generalization of PCA.
• If the encoder and decoder are allowed too much capacity, the autoencoder can
learn to perform the copying task without extracting useful information about
the distribution of the data.
• If the capacity of the autoencoder is allowed to become too great, an
autoencoder can fail to learn anything useful about the dataset.
• Thus, f or g, in undercomplete autoencoders, typically has low capacity

Goodfellow, Bengio, Courville 2016


Overcomplete and
Undercomplete Autoencoders
Undercomplete Autoencoders
• One way to obtain useful features from the
autoencoder is to constrain h to have smaller
dimension than x.
• An autoencoder whose code dimension is less
than the input dimension is called
undercomplete.
• Learning an undercomplete representation
forces the autoencoder to capture the most
salient (important) features of the training
data.
• The learning process is described simply as
minimizing a loss function
L(x, g(f(x)))
• where L is a loss function penalizing g(f (x)) is
the decoder, f(x) is the encoder function. 𝑥 ℎ 𝑥ො

Goodfellow, Bengio, Courville 2016


Overcomplete Autoencoders
• An autoencoder whose code dimension is
greater than the input dimension is called
overcomplete.
• In case of overcomplete autoencoder, even a
linear encoder and linear decoder may learn
to copy the input to the output without
learning anything useful about the data
distribution.
• Must be regularized

𝑥 ℎ 𝑥ො
Regularized Autoencoders
Regularized Autoencoders
• Regularized autoencoders provide the ability to choose the decoder based on the
complexity of distribution to be modeled.
• Rather than limiting the model capacity by keeping the encoder and decoder
shallow and the code size small, regularized autoencoders use a loss function that
encourages the model to have other properties besides the ability to copy its
input to its output.
• These other properties include sparsity of the representation (sparse
autoencoders), robustness to noise or to missing inputs (denoising
autoencoders), and smallness of the derivative of the representation (Contractive
autoencoders).
• A regularized autoencoder can be nonlinear and overcomplete but still learn
something useful about the data distribution even if the model capacity is great
enough to learn a trivial identity function.

Goodfellow, Bengio, Courville 2016


Sparse Autoencoders
• Typically used to learn features as a pre-processing for another task such as
classification.
• A sparse autoencoder is simply an autoencoder whose training criterion involves a
sparsity penalty Ω(ℎ) on the code layer h, in addition to the reconstruction error:
𝐿 𝑥, 𝑔 𝑓 𝑥 + Ω(ℎ)
Ω ℎ = 𝜆 ෍ ℎ𝑖
𝑖
• We can think of the penalty Ω(h) simply as a regularizer term added to a feedforward
network whose primary task is to copy the input to the output (unsupervised learning
objective) and possibly also perform some supervised task (with a supervised learning
objective) that depends on these sparse features.
https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

Goodfellow, Bengio, Courville 2016


Denoising Autoencoders (DAE)
• Instead of feeding the original input
x we feed a noise version of it 𝑥.

• A denoising autoencoder or DAE
minimizes
𝐿 𝑥, 𝑔 𝑓 𝑥෤
• Reconstruction 𝑥ො is computed from
the corrupted input 𝑥෤
• Loss function compares 𝑥ො
reconstruction with the noiseless
input 𝑥

𝑥 𝑥෤ ℎ 𝑥ො

Goodfellow, Bengio, Courville 2016


Denoising Autoencoders (Cont.)

• Red crosses are the training


examples x.
• Black line low-dimensional model of
the autoencoder
• The corruption process
𝐶 (𝑥෤ | 𝑥) with a gray circle of
equiprobable corruptions.
Goodfellow, Bengio, Courville 2016
Contractive Autoencoders (CAE)
• Contractive Autoencoders forces the model to learn a function that does not
change much when x changes slightly
𝐿 𝑥, 𝑔 𝑓 𝑥 + Ω ℎ, 𝑥
2
Ω ℎ, 𝑥 = 𝜆 ෍ 𝛻𝑥 ℎ𝑖
𝑖
• Features that are sensitive to small changes in the inputs are penalized.
• The name contractive arises from the way that the CAE warps space.
• Specifically, because the CAE is trained to resist perturbations of its input, it is
encouraged to map a neighborhood of input points to a smaller neighborhood of
output points.

Goodfellow, Bengio, Courville 2016


Deep Autoencoders
• “Reducing the Dimensionality of
Data with Neural Networks” by
G. E. Hinton* and R. R.
Salakhutdinov
Deep Autoencoders

Top to bottom:
1) Random samples from the test data set;
2) reconstructions by the 30-dimensional autoencoder;
3) reconstructions by 30-dimensional PCA.
The average squared errors are 126 and 135.

You might also like