Autoencoders
Autoencoders
Autoencoders
Piyush Rai
IIT Kanpur
Introduction to Autoencoders
Autoencoder Variants and Extensions
Some Applications of Autoencoders
Defined by two (possibly nonlinear) mapping functions: Encoding function f , Decoding function g
Defined by two (possibly nonlinear) mapping functions: Encoding function f , Decoding function g
h = f (x) denotes an encoding (possibly nonlinear) for the input x
Defined by two (possibly nonlinear) mapping functions: Encoding function f , Decoding function g
h = f (x) denotes an encoding (possibly nonlinear) for the input x
x̂ = g (h) = g (f (x)) denotes the reconstruction (or the “decoding”) for the input x
Defined by two (possibly nonlinear) mapping functions: Encoding function f , Decoding function g
h = f (x) denotes an encoding (possibly nonlinear) for the input x
x̂ = g (h) = g (f (x)) denotes the reconstruction (or the “decoding”) for the input x
For an Autoencoder, f and g are learned with a goal to minimize the difference between x̂ and x
The learned code h = f (x) can be used as a new feature representation of the input x
The learned code h = f (x) can be used as a new feature representation of the input x
Note: If we learn f , g to minimize the squared error ||x̂ − x||2 then the linear autoencoder with
W∗ = W> is optimal
Note: If we learn f , g to minimize the squared error ||x̂ − x||2 then the linear autoencoder with
W∗ = W> is optimal, and is equivalent to Principal Component Analysis (PCA)
The hidden nodes can also be nonlinear transforms of the inputs, e.g.,
Can define h as a linear transform of x followed by a nonlinearity (e.g., sigmoid, ReLU)
h = sigmoid(Wx + b)
1
where the nonlinearity sigmoid(z) = 1+exp(−z)
squashes the real-valued z to lie between 0 and 1
The hidden nodes can also be nonlinear transforms of the inputs, e.g.,
Can define h as a linear transform of x followed by a nonlinearity (e.g., sigmoid, ReLU)
h = sigmoid(Wx + b)
1
where the nonlinearity sigmoid(z) = 1+exp(−z)
squashes the real-valued z to lie between 0 and 1
The hidden nodes can also be nonlinear transforms of the inputs, e.g.,
Can define h as a linear transform of x followed by a nonlinearity (e.g., sigmoid, ReLU)
h = sigmoid(Wx + b)
1
where the nonlinearity sigmoid(z) = 1+exp(−z)
squashes the real-valued z to lie between 0 and 1
Figure below: The K × D matrix W learned on digits data. Each tiny block visualizes a row of W
Figure below: The K × D matrix W learned on digits data. Each tiny block visualizes a row of W
Thus W captures the possible “patterns” in the training data (akin to the K basis vectors in PCA)
Figure below: The K × D matrix W learned on digits data. Each tiny block visualizes a row of W
Thus W captures the possible “patterns” in the training data (akin to the K basis vectors in PCA)
For any input x, the encoding h tells us how much each of these K features in present in x
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
In general, it is defined in terms of the difference between x̂ and x (reconstruction error)
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
In general, it is defined in terms of the difference between x̂ and x (reconstruction error)
For a single input x = [x1 , . . . , xD ] and its reconstruction x̂ = [x̂1 , . . . , x̂D ]
D
X
`(x̂, x) = (x̂d − xd )2 (squared loss; used if input are real-valued)
d=1
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
In general, it is defined in terms of the difference between x̂ and x (reconstruction error)
For a single input x = [x1 , . . . , xD ] and its reconstruction x̂ = [x̂1 , . . . , x̂D ]
D
X
`(x̂, x) = (x̂d − xd )2 (squared loss; used if input are real-valued)
d=1
D
X
`(x̂, x) = − [xd log(x̂d ) + (1 − xd ) log(1 − x̂d )] (cross-entropy loss; used if input are binary)
d=1
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
In general, it is defined in terms of the difference between x̂ and x (reconstruction error)
For a single input x = [x1 , . . . , xD ] and its reconstruction x̂ = [x̂1 , . . . , x̂D ]
D
X
`(x̂, x) = (x̂d − xd )2 (squared loss; used if input are real-valued)
d=1
D
X
`(x̂, x) = − [xd log(x̂d ) + (1 − xd ) log(1 − x̂d )] (cross-entropy loss; used if input are binary)
d=1
We find (W, b, W∗ , c) by minimizing the reconstruction error (summed over all training data)
The loss function (a function of parameters W, b, W∗ , c) can be defined using various ways
In general, it is defined in terms of the difference between x̂ and x (reconstruction error)
For a single input x = [x1 , . . . , xD ] and its reconstruction x̂ = [x̂1 , . . . , x̂D ]
D
X
`(x̂, x) = (x̂d − xd )2 (squared loss; used if input are real-valued)
d=1
D
X
`(x̂, x) = − [xd log(x̂d ) + (1 − xd ) log(1 − x̂d )] (cross-entropy loss; used if input are binary)
d=1
We find (W, b, W∗ , c) by minimizing the reconstruction error (summed over all training data)
This can be done using backpropagation
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 9
Undercomplete, Overcomplete, and Need for Regularization
Make the model robust against small changes in the input (Contractive Autoencoders)
Make the learned code sparse (Sparse Autoencoders). Done by adding a sparsity penalty on h
Make the learned code sparse (Sparse Autoencoders). Done by adding a sparsity penalty on h
First add some noise (e.g., Gaussian noise) to the original input x
Let’s denote x̃ as the corrupted version of x
The encoder f operates on x̃, i.e., h = f (x̃)
First add some noise (e.g., Gaussian noise) to the original input x
Let’s denote x̃ as the corrupted version of x
The encoder f operates on x̃, i.e., h = f (x̃)
First add some noise (e.g., Gaussian noise) to the original input x
Let’s denote x̃ as the corrupted version of x
The encoder f operates on x̃, i.e., h = f (x̃)
Most autoencoders can be extended to have more than one hidden layer
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
Can also define the encoder and decoder functions using probability distributions
pencoder (h|x)
pdecoder (x|h)
The choice of distributions depends on the type of data being modeled and of the encodings
1 http://www.birving.com/presentations/autoencoders/
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 16
Variational Autoencoders (VAE)
Unlike standard AE, a VAE model learns to generate plausible data from random encodings
1 http://www.birving.com/presentations/autoencoders/
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 16
Some Applications of Autoencoders
Example: A deep AE for low-dim feature learning for 784-dimensional MNIST images2
Assume we are given a partially known N × M ratings matrix R of N users on M items (movies)
Assume we are given a partially known N × M ratings matrix R of N users on M items (movies)
An idea: If the predicted value of a user’s rating for a movie is high, then we should ideally
recommend this movie to the user
An idea: If the predicted value of a user’s rating for a movie is high, then we should ideally
recommend this movie to the user
Thus if we can “reconstruct” the missing entries in R, we can use this method to recommend
movies to users. Using an autoencoders can help us do this!
Note: During backprop, only update weights in W that are connected to the observed ratings3
Once learned, the model can predict (reconstruct) the missing ratings
4 Deep Collaborative Filtering via Marginalized Denoising Auto-encoder (Li et al, CIKM 2015)
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 25
Another Autoencoder based Approach
Idea: Rating of a user u on an item i can be defined using the inner-product based similarity of
>
their features learned via an autoencoder: Rui = f (h (u) h (i) ) where f is some compatibity function
4 Deep Collaborative Filtering via Marginalized Denoising Auto-encoder (Li et al, CIKM 2015)
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 25
Another Autoencoder based Approach
Idea: Rating of a user u on an item i can be defined using the inner-product based similarity of
>
their features learned via an autoencoder: Rui = f (h (u) h (i) ) where f is some compatibity function
4 Deep Collaborative Filtering via Marginalized Denoising Auto-encoder (Li et al, CIKM 2015)
Piyush Rai (IIT Kanpur) Autoencoders, Extensions, and Applications 25
Other Approaches on Autoencoders for Recommender Systems
Also possible to incorporate side information about the users and/or items (Wang et al, KDD 2015)