Part 15 MD

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Visual Information Interpretation

Beyond CNN (Optional): Transformer and Generative models

Ji Hui

1
Before Transformers: RNN
Transformers was motivated by the problem of machine translation
For a source sequence , where each denotes a word (e.g.
English), we seek to predict a translation of into a different language, a.k.a. a target
sequence (e.g. French).
Probabilistically, machine translation is about modeling the conditional distribution :

A recurrent neural network (RNN) for Seq2Seq model


SOS: start of sequence; EOS: end of sequence

2
Encoder-decoder structure for most Seq2Seq models
Encoder : accepting an input vector and and a hidden state

Decoder : Feed as the initial hidden state, along with the first element
of the sequence to decode
The decoder outputs a categorical distribution over the target vocabulary.

3
Attention is all you need

One drawback of the RNN-based encoder-decoder is that a single hidden state vector can
only carry so much information
the longer an input sequence, the noisier the information from earlier parts becomes,
which limits the ability of the model to make good predictions over long sequence lengths.
Attention: exploring the possibility of using all of the hidden state vectors generated during
encoding to decode the target sequence
Attention condenses the set of hidden states into a weighted sum of its constituent vectors
The weighting is based in part on the contents of the vectors

4
Attention
The decoder receives a context vector at each time step, which is computed by
attending to the inputs.

5
Attention
The context vector is computed as a weighted average of the encoder's contexts

The attention weights are computed as a softmax, whose input depends on the context
and teh decoder's state:

Score measures the similarity between two states, e.g. dot attention

Note that the attention here does not depend on the position in the sequence.

6
More details on Attention
In general, Attention mappings can be described as a function of a query and a set of key-
value pairs
Context vector :
Attention:

The query : represents the context vector we want


Inter-attention: For example, something computed from the previous layer or -th
timestep of the decoder
Self-attention: For example, something computed from the -th encoder.
The key : tells whether or not is useful context
For example, something computed from the -tj timestep of the encoder
The value : provides the actual context
For example, something computed from the th timestep of the encoder.

7
Scaled dot-product attention

Scaled dot-product attention

The context summary vector

8
Multi-head attention
Scaled Dot-Product attention:
Multi-head attention:

where are learnable weighting matrices.

9
Transformer (Vaswani et al. (2017))
Transformer has a encoder-decoder
architecture
All the recurrent connections are replaced
by the attention modules
The transfomer model uses stacked self-
attention layers
Positional Encoding
the attention encoder outputs do not
depend on the order of the inputs.
The order of the sequence conveys
important information for language
modeling.
The solution: Add positional information of
a input token in the sequence into the input
embedding vectors.

10
VIT (Vision Transformer): Model an image as a sequence of words
1. Split an image into patches and flatten the patches
2. Produce lower-dimensional linear embeddings from the flattened patches
3. Add positional embeddings

11
Generative model
Given training data, generate new samples from same distribution

Learning the distribution which is close to the distribution


12
Generative models
Density estimation
Density estimation is to estimate a continuous density function from a discretely
sampled set of points drawn from that density function
Two main favors
Explicit density estimation: explicitly define and solve for
Implicit density estimation (Generative models): Given training data, generate new
samples from same distribution

13
Why generative model
Generate realistic samples, e.g. synthesizing realistic images for graphics and entertainment
Model the data distribution in order to tell which of several candidate outputs is more likely
Train a generative model to learn high-level features that are useful to other tasks.

Auto-encoder Generative adversarial network (GAN)

14
Auto-encoder

Encoder: In which the model learns how to reduce the input dimensions and compress the
input data into an encoded representation
Bottleneck: which is the layer that contains the compressed representation of the input data.
This is the lowest possible dimensions of the input data.
Decoder: In which the model learns how to reconstruct the data from the encoded
representation to be as close to the original input as possible.
Reconstruction Loss: This is the method that measures measure how well the decoder is
performing and how close the output is to the original input.

15
Illustration of auto-encoder

16
Principle component analysis (PCA)
PCA produces a low-dimensional approximation of a dataset , by finding a
linear combination of the variables with maximal variance and mutual un-correlation.
Procedure of PCA
Stack the observations (with zero mean) into the rows of a matrix .
Construct the Singular Value Decomposition (SVD): :

For a -dimensional PCA, is the matrix with a diagonal upper sub-matrix:

where
The columns of is called the principle components. 17
Demo of PCA on digits

18
From PCA to Autoencoder
Consider , we have a low-dimensional approximation

can be viewed as the code vector containing only non-zero entries


PCA is a linear autoencoder

19
Autoencoder
Auto-encoder: Generalized PCA with non-linearity

For linear activation functions, this is equivalent to linear layers, which is equivalent to the
Encoder of the PCA.
For non-linear activations, the depth of the network matters, many layers might be needed to
build a good representation.
Autoencoder provides a way to represent input vector by a nonlinear projection onto a lower-
dimensional space of neuron activations for the inner-most layer with few neurons.
20
Autoencoder in action
An autoencoder with 20 latent variables

The reconstructed digits from the autoencoder with 20 latent variables

21
From decoder to generative model
The decoder of an auto-encoder provides a generator network, which maps a low-
dimensional code to a high-dimensional image.

The decoder above is not a generative model yet, as it does not define a distribution
Consider training a generator network for the distribution

If is lower-dimensional, almost everywhere. 22


VAE (Variational auto-encoder)
Generalizing the decoder to a generative noisy model

where denote the reconstruction of the decoder.


The code is defined as the samples from some distribution, which is approximated by a
mean field model

where and , i.e.,

23
Training a VAE
Training the encoder distribution to return the mean and the covariance matrix of Gaussians
The loss function
A “reconstruction term” (on the final layer), that makes the encoding-decoding scheme a
good approximation
a “regularisation term” (on the latent layer), that makes the distributions returned by the
encoder close to a standard normal distribution, via Kulback-Leibler (KL) divergence

24
VAE vs AE

Instead of encoding an input as a single point, we encode it as a distribution over the latent
space

25
Generative Adversarial Network (GAN)
The idea behind the GAN: if the genera- tor is doing a good job of modeling the data
distribution, then the generated samples should be indistinguishable from the true data.

Train a discriminator whose job it is to classify whether an observation (e.g. an image) is


from the training set or whether it was produced by the generator.
The generator is evaluated based on the discriminator's failure on telling samples/data. 26
Generative Adversarial Network (GAN)

27
Training GAN: a two-player game
Generator network
Try to fool the discriminator by generating real-looking images
Discriminator network
Try to distinguish between real and fake images
Trained using a logistic regression classifier with cost function being cross-entropy for
classifying real vs. fake:

28
Training GANs: Minimax Game

Discriminator maximizes objective such that is close to (real) and


is close to (fake).
Generate minimizes objective such that is close to (discriminator is fooled into
thinking is real).

29
Training GANs
Minimax Game on: logistic loss function

Alternative iteration between


Gradient ascent on discriminator

Gradient descent on generator

30
Generated samples by GAN
boxed images are nearest images in training data

31
Diffusion model
Diffusion models are inspired by non-equilibrium thermodynamics, which define a Markov
chain of diffusion steps to slowly add random noise to data and then learn to reverse the
diffusion process to construct desired data samples from the noise

Different from VAE, diffusion models are learned with a fixed procedure and the latent
variable has high dimensionality (same as the original data).

32
Forward diffusion process
Given a data point sampled from a distribution , a forward diffusion process adds
small Gaussian noise to the sample in steps, producing a sequence of noisy samples

where
The step sizes are controlled by a variance schedule

The data sample gradually loses its distinguishable features as the step becomes larger.
Eventually when , is equivalent to an isotropic Gaussian distribution. 33
Reverse diffusion process
If we can reverse the forward process and sample from , we will be able to
recreate the true sample from a Gaussian noise input,
Bad news: We cannot easily estimate because it needs the data distribution
Approximation: We learn a model to approximate these conditional probabilities in order
to run the reverse diffusion process.

34
Illustration

(Image source: Sohl-Dickstein et al., 2015) 35


Summary of three generative models

36

You might also like