Lecture 2 Autoregressive Models
Lecture 2 Autoregressive Models
Pieter Abbeel, Xi (Peter) Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan
UC Berkeley
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 2
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 3
Likelihood-based models
Problems we’d like to solve:
- Generating data: synthesizing images, videos, speech, text
- Compressing data: constructing efficient codes
- Anomaly detection
Likelihood-based models: estimate pdata from samples x(1), …, x(n) ~ pdata(x)
Learns a distribution p that allows:
- Computing p(x) for arbitrary x
- Sampling x ~ p(x)
Today: discrete data
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 4
Desiderata
We want to estimate distributions of complex, high-dimensional data
- A 128x128x3 image lies in a ~50,000-dimensional space
We also want computational and statistical efficiency
- Efficient training and model representation
- Expressiveness and generalization
- Sampling quality and speed
- Compression rate and speed
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 5
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 6
Learning: Estimate frequencies by counting
Recall: the goal is to estimate pdata from samples
x(1), …, x(n) ~ pdata(x)
Suppose the samples take on values in a finite set
{1, …, k}
The model: a histogram
- (Redundantly) described by k nonnegative
numbers: p1, …, pk
- To train this model: count frequencies
pi = (# times i appears in the dataset) /
(# points in the dataset)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 7
Inference and Sampling
Inference (querying pi for arbitrary i): simply a lookup into the array p1, …, pk
Are we done?
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 8
Failure in high dimensions
No, because of the curse of dimensionality. Counting fails when
there are too many bins.
- (Binary) MNIST: 28x28 images, each pixel in {0, 1}
784
- There are 2 ≈ 10236 probabilities to estimate
- Any reasonable training set covers only a tiny fraction of this
- Each image influences only one parameter. No generalization
whatsoever!
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 9
Problematic even for single variable
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 10
Parameterized Distributions
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 11
Status
- Issues with histograms
- High dimensions: won’t work
- Even 1-d: if many values in the domain, prone to overfitting
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 12
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 13
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 14
Likelihood-based generative models
Recall: the goal is to estimate pdata from x(1), …, x(n) ~ pdata(x)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 15
Fitting distributions
- Given data x(1), …, x(n) sampled from a “true” distribution pdata
- Set up a model class: a set of parameterized distributions pθ
- Pose a search problem over parameters
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 16
Maximum likelihood
- Maximum likelihood: given a dataset x(1), …, x(n), find θ by solving the optimization
problem
- Statistics tells us that if the model family is expressive enough and if enough data
is given, then solving the maximum likelihood problem will yield parameters that
generate the data
- Equivalent to minimizing KL divergence between the empirical data distribution
and the model
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 17
Stochastic gradient descent
Maximum likelihood is an optimization problem. How do we solve it?
Stochastic gradient descent (SGD).
- SGD minimizes expectations: for f a differentiable function of θ, it solves
- Why maximum likelihood + SGD? It works with large datasets and is compatible
with neural networks.
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 18
Designing the model
- Key requirement for maximum likelihood + SGD: efficiently compute log p(x) and
its gradient
- We will choose models pθ to be deep neural networks, which work in the regime
of high expressiveness and efficient computation (assuming specialized hardware)
- How exactly do we design these networks?
- Any setting of θ must define a valid probability distribution over x:
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 19
Bayes nets and neural nets
Main idea: place a Bayes net structure (a directed acyclic graph) over the variables in
the data, and model the conditional distributions with neural networks.
Reduces the problem to designing conditional likelihood-based models for single
variables. We know how to do this: the neural net takes variables being conditioned on
as input, and outputs the distribution for the variable being predicted.
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 20
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 21
Autoregressive models
- First, given a Bayes net structure, setting the conditional distributions to neural
networks will yield a tractable log likelihood and gradient. Great for maximum
likelihood training!
- But is it expressive enough? Yes, assuming a fully expressive Bayes net structure:
any joint distribution can be written as a product of conditionals
- This is called an autoregressive model. So, an expressive Bayes net structure with
neural network conditional distributions yields an expressive model for p(x) with
tractable maximum likelihood training.
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 22
A toy autoregressive model
Two variables: x1, x2
Model: p(x1, x2) = p(x1) p(x2|x1)
- p(x1) is a histogram
- p(x2|x1) is a multilayer perceptron
- Input is x1
- Output is a distribution over x2 (logits, followed by softmax)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 23
One function approximator per conditional
Does this extend to high dimensions?
- Somewhat. For d-dimensional data, O(d) parameters
- Much better than O(exp(d)) in tabular case
- What about text generation where d can be arbitrarily large?
- Limited generalization
- No information sharing among different conditionals
Solution: share parameters among conditional distributions. Two
approaches:
- Recurrent neural networks
- Masking
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 24
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 25
RNN autoregressive models - char-rnn
Sequence of Character at
characters ith position
[Karpathy, 2015]
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 26
MNIST
■ Handwritten digits
■ 28x28
■ 60,000 train
■ 10,000 test
■ Original: greyscale
■ “Binarized MNIST” -- 0/1 (black/white)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 27
RNN on MNIST
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 28
RNN with Pixel Location Appended on MNIST
■ Append (x,y) coordinates of pixel in the image as input to RNN
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 29
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE
- Masked Convolutions
- Wavenet
- PixelCNN (+variations)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 30
Masking-based autoregressive models
Second major branch of neural AR models
■ Key property: parallelized computation of all conditionals
■ Masked MLP (MADE)
■ Masked convolutions & self-attention
■ Also share parameters across time
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 31
Masked Autoencoder for Distribution Estimation (MADE)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 32
Masked Autoencoder for Distribution Estimation (MADE)
General principle
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 33
MADE on MNIST
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 34
Masked Autoencoder for Distribution Estimation (MADE)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 35
MADE results
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 36
MADE results
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 37
MADE -- Different Orderings
All orderings achieve roughly the same bits per dim, but samples are different
Top to Middle,
Random Permutation Even then Odd Indices Rows (Raster Scan) Columns Bottom to Middle
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 38
MADE: Multiple Orderings
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 39
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE
- Masked Convolutions
- Wavenet
- PixelCNN (+variations)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 40
Masked Temporal (1D) Convolution
p(xi+1| x<=i)
However
● Limited receptive field, linear in
number of layers
xi
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 41
WaveNet
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 42
WaveNet on MNIST
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 43
WaveNet with Pixel Location Appended on MNIST
■ Append (x,y) coordinates of pixel in the image as input to
WaveNet
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 44
Masked Temporal (1D) Convolution
# More efficient implementation possible by
# padding instead of masking kernels
# k: size of kernel
# kernel: convolution weights
padded_x = tf.pad(x, [
(0, 0), (k - 1, 0),
(0, 0), (0, 0)
])
y = tf.nn.conv2d(padded_x, kernel, padding='VALID')
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 45
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE
- Masked Convolutions
- Wavenet
- PixelCNN (+variations)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 46
Masked Spatial (2D) Convolution - PixelCNN
■ Images can be flatten into 1D vectors, but they are fundamentally
2D
■ We can use a masked variant of ConvNet to exploit this knowledge
■ First, we impose an autoregressive ordering on 2D images:
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 47
PixelCNN
■ Design question: how to design a masking method to obey
that ordering?
■ One possibility: PixelCNN (2016)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 48
Softmax Sampling
0 255
0 255
0 255
0 255
0 255
0 255
0 255
0 255
0 255
0 255
0 255
0 255
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 72
Gated PixelCNN
■ Gated PixelCNN (2016) introduced a fix by combining two
streams of convolutions
How?
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 73
Gated PixelCNN
■ Vertical stack: through padding, activations at ith row only
depend on input before ith row
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 74
Gated PixelCNN
■ Improved ConvNet architecture: Gated ResNet Block
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 75
Gated PixelCNN
■ Better receptive field + more expressive architecture = better
performance
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 76
PixelCNN++
■ Moving away from softmax: we know nearby pixel values are
likely to co-occur!
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 77
Recap: Logistic distribution
pdf cdf
= sigmoid((x - mu) / scale)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 78
Mixture of Logistics -- Discrete Distribution
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 79
Ex. Training Mixture of Logistics
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 80
PixelCNN++
■ Capture long dependencies efficiently by downsampling
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 81
PixelCNN++
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 82
Masked Attention
■ A recurring problem for convolution: limited receptive field ->
hard to capture long-range dependencies
■ (Self-)Attention: an alternative that has
■ unlimited receptive field!!
■ also O(1) parameter scaling w.r.t. data dimension
■ parallelized computation (versus RNN)
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 83
Attention
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 84
Self-Attention
Convolution Self-attention
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 85
Masked Attention
- masked(ki, q) * 1010
- masked(ki, q) * 1010
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 86
Masked Attention
■ Much more flexible than masked convolution. We can design
any autoregressive ordering we want
■ An example:
Zigzag ordering
- How to implement with masked
conv?
- Trivial to do with masked attention!
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 87
Masked Attention + Convolution
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 88
Masked Attention + Convolution
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 89
Multi-Head Self-Attention on MNIST
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 90
Masked Attention + Convolution
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 91
Sample Quality
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 92
AR models can have good samples
■ Good samples can be achieved by selective bits conditioning
■ Grayscale PixelCNN
■ Subscale Pixel Network
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 93
Class-Conditional PixelCNN
How to condition?
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 94
Hierarchical Autoregressive Models with Auxiliary Decoders
De Fauw, Jeffrey, Sander Dieleman, and Karen Simonyan. "Hierarchical autoregressive image models with auxiliary decoders." arXiv
preprint arXiv:1903.04933 (2019).
APA
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 95
Image Super-Resolution with PixelCNN
■ A PixelCNN is conditioned on
7 x 7 subsampled MNIST
images to generated the
corresponding 28 x 28 image
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 96
Pixel Recursive Super Resolution
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 97
Hierarchy: Grayscale PixelCNN
■ Design an autoregressive model
architecture that takes
advantage of the structure of
data
■ Learn a PixelCNN on binary
images, and a PixelCNN
conditioned on binary images to
generate colored images
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 98
PixelCNN Models with Auxiliary Variables for Natural Image Modeling
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 99
Neural autoregressive models: the good
Best in class modelling performance:
■ expressivity - autoregressive factorization is general
100
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Masked autoregressive models: the bad
● Sampling each pixel = 1 forward pass!
● 11 minutes to generate 16 32-by-32 images on a Tesla K40 GPU
101
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by caching activations
https://github.com/PrajitR/fast-pixel-cnn
102
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by caching activations
https://github.com/PrajitR/fast-pixel-cnn
103
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by breaking autoregressive pattern
■ O(d) -> O(log(d)) by parallelizing within groups {2, 3, 4}
■ Cannot capture dependencies within each group: this is a fine assumption
if all pixels in one group are conditionally independent
■ Most often they are not, then you trade expressivity for sampling
speed
104
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Multiscale PixelCNN
105
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models
[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]
106
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models -- BAIR Robot Pushing
Large Spatiotemporal Subscaling Small Spatiotemporal Subscaling
[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]
107
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models -- Kinetics
Cooking (left-to-right by likelihood) Full Kinetics (left-to-right by likelihood)
[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]
108
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores
■ Main challenge:
■ How to get a latent representation from PixelCNN?
■ Why hard? The random input happens on a per-pixel sample basis
■ Proposed solution
■ Use Fisher score
109
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores
[Wilson Yan, Jonatha Ho, Pieter Abbeel. ““Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015
110
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores
[Wilson Yan, Jonatha Ho, Pieter Abbeel. ““Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015
111
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Bibliography
char-rnn: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
MADE: Germain, Mathieu, et al. "Made: Masked autoencoder for distribution estimation." International Conference on Machine Learning. 2015.
WaveNet: Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
PixelCNN: Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).
Gated PixelCNN: Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders." Advances in Neural Information Processing Systems. 2016.
PixelCNN++: Salimans, Tim, et al. "Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications." arXiv preprint arXiv:1701.05517 (2017)
Self-attention: Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.
PixelSNAIL: Chen, Xi, et al. "Pixelsnail: An improved autoregressive generative model." arXiv preprint arXiv:1712.09763 (2017)
Fast PixelCNN++: Ramachandran, Prajit, et al. "Fast generation for convolutional autoregressive models." arXiv preprint arXiv:1704.06001(2017).
Multiscale PixelCNN: Reed, Scott, et al. "Parallel multiscale autoregressive density estimation." Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR. org, 2017.
Grayscale PixelCNN: Kolesnikov, Alexander, and Christoph H. Lampert. "PixelCNN models with auxiliary variables for natural image modeling." Proceedings of the 34th International
Conference on Machine Learning-Volume 70. JMLR. org, 2017.
Subscale Pixel Network: Menick, Jacob, and Nal Kalchbrenner. "Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling." arXiv preprint
arXiv:1812.01608(2018)
Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)
Sparse Attention: Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. “Generating Long Sequences with Sparse Transformers.” arXiv 1904.10509
Wilson Yan, Jonathan Ho, Pieter Abbeel. “Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015
PixelCNN Super Resolution: Dahl, Ryan, Mohammad Norouzi, and Jonathon Shlens. "Pixel recursive super resolution." Proceedings of the IEEE International Conference on
Computer Vision. 2017.
Grayscale PixelCNN: Kolesnikov, Alexander, and Christoph H. Lampert. "PixelCNN models with auxiliary variables for natural image modeling." Proceedings of the 34th International
Conference on Machine Learning-Volume 70. JMLR. org, 2017.
112
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Colab
113
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models