Part 15 MD
Part 15 MD
Part 15 MD
Ji Hui
1
Before Transformers: RNN
Transformers was motivated by the problem of machine translation
For a source sequence , where each denotes a word (e.g.
English), we seek to predict a translation of into a different language, a.k.a. a target
sequence (e.g. French).
Probabilistically, machine translation is about modeling the conditional distribution :
2
Encoder-decoder structure for most Seq2Seq models
Encoder : accepting an input vector and and a hidden state
Decoder : Feed as the initial hidden state, along with the first element
of the sequence to decode
The decoder outputs a categorical distribution over the target vocabulary.
3
Attention is all you need
One drawback of the RNN-based encoder-decoder is that a single hidden state vector can
only carry so much information
the longer an input sequence, the noisier the information from earlier parts becomes,
which limits the ability of the model to make good predictions over long sequence lengths.
Attention: exploring the possibility of using all of the hidden state vectors generated during
encoding to decode the target sequence
Attention condenses the set of hidden states into a weighted sum of its constituent vectors
The weighting is based in part on the contents of the vectors
4
Attention
The decoder receives a context vector at each time step, which is computed by
attending to the inputs.
5
Attention
The context vector is computed as a weighted average of the encoder's contexts
The attention weights are computed as a softmax, whose input depends on the context
and teh decoder's state:
Score measures the similarity between two states, e.g. dot attention
Note that the attention here does not depend on the position in the sequence.
6
More details on Attention
In general, Attention mappings can be described as a function of a query and a set of key-
value pairs
Context vector :
Attention:
7
Scaled dot-product attention
8
Multi-head attention
Scaled Dot-Product attention:
Multi-head attention:
9
Transformer (Vaswani et al. (2017))
Transformer has a encoder-decoder
architecture
All the recurrent connections are replaced
by the attention modules
The transfomer model uses stacked self-
attention layers
Positional Encoding
the attention encoder outputs do not
depend on the order of the inputs.
The order of the sequence conveys
important information for language
modeling.
The solution: Add positional information of
a input token in the sequence into the input
embedding vectors.
10
VIT (Vision Transformer): Model an image as a sequence of words
1. Split an image into patches and flatten the patches
2. Produce lower-dimensional linear embeddings from the flattened patches
3. Add positional embeddings
11
Generative model
Given training data, generate new samples from same distribution
13
Why generative model
Generate realistic samples, e.g. synthesizing realistic images for graphics and entertainment
Model the data distribution in order to tell which of several candidate outputs is more likely
Train a generative model to learn high-level features that are useful to other tasks.
14
Auto-encoder
Encoder: In which the model learns how to reduce the input dimensions and compress the
input data into an encoded representation
Bottleneck: which is the layer that contains the compressed representation of the input data.
This is the lowest possible dimensions of the input data.
Decoder: In which the model learns how to reconstruct the data from the encoded
representation to be as close to the original input as possible.
Reconstruction Loss: This is the method that measures measure how well the decoder is
performing and how close the output is to the original input.
15
Illustration of auto-encoder
16
Principle component analysis (PCA)
PCA produces a low-dimensional approximation of a dataset , by finding a
linear combination of the variables with maximal variance and mutual un-correlation.
Procedure of PCA
Stack the observations (with zero mean) into the rows of a matrix .
Construct the Singular Value Decomposition (SVD): :
where
The columns of is called the principle components. 17
Demo of PCA on digits
18
From PCA to Autoencoder
Consider , we have a low-dimensional approximation
19
Autoencoder
Auto-encoder: Generalized PCA with non-linearity
For linear activation functions, this is equivalent to linear layers, which is equivalent to the
Encoder of the PCA.
For non-linear activations, the depth of the network matters, many layers might be needed to
build a good representation.
Autoencoder provides a way to represent input vector by a nonlinear projection onto a lower-
dimensional space of neuron activations for the inner-most layer with few neurons.
20
Autoencoder in action
An autoencoder with 20 latent variables
21
From decoder to generative model
The decoder of an auto-encoder provides a generator network, which maps a low-
dimensional code to a high-dimensional image.
The decoder above is not a generative model yet, as it does not define a distribution
Consider training a generator network for the distribution
23
Training a VAE
Training the encoder distribution to return the mean and the covariance matrix of Gaussians
The loss function
A “reconstruction term” (on the final layer), that makes the encoding-decoding scheme a
good approximation
a “regularisation term” (on the latent layer), that makes the distributions returned by the
encoder close to a standard normal distribution, via Kulback-Leibler (KL) divergence
24
VAE vs AE
Instead of encoding an input as a single point, we encode it as a distribution over the latent
space
25
Generative Adversarial Network (GAN)
The idea behind the GAN: if the genera- tor is doing a good job of modeling the data
distribution, then the generated samples should be indistinguishable from the true data.
27
Training GAN: a two-player game
Generator network
Try to fool the discriminator by generating real-looking images
Discriminator network
Try to distinguish between real and fake images
Trained using a logistic regression classifier with cost function being cross-entropy for
classifying real vs. fake:
28
Training GANs: Minimax Game
29
Training GANs
Minimax Game on: logistic loss function
30
Generated samples by GAN
boxed images are nearest images in training data
31
Diffusion model
Diffusion models are inspired by non-equilibrium thermodynamics, which define a Markov
chain of diffusion steps to slowly add random noise to data and then learn to reverse the
diffusion process to construct desired data samples from the noise
Different from VAE, diffusion models are learned with a fixed procedure and the latent
variable has high dimensionality (same as the original data).
32
Forward diffusion process
Given a data point sampled from a distribution , a forward diffusion process adds
small Gaussian noise to the sample in steps, producing a sequence of noisy samples
where
The step sizes are controlled by a variance schedule
The data sample gradually loses its distinguishable features as the step becomes larger.
Eventually when , is equivalent to an isotropic Gaussian distribution. 33
Reverse diffusion process
If we can reverse the forward process and sample from , we will be able to
recreate the true sample from a Gaussian noise input,
Bad news: We cannot easily estimate because it needs the data distribution
Approximation: We learn a model to approximate these conditional probabilities in order
to run the reverse diffusion process.
34
Illustration
36