0% found this document useful (0 votes)
64 views

Lecture 2 Autoregressive Models

This document outlines a lecture on autoregressive models for deep unsupervised learning. It discusses the limitations of simple generative models like histograms in high dimensions and introduces parameterized distributions and maximum likelihood training as solutions. Autoregressive models are presented as modern neural network approaches that decompose complex joint distributions into conditional distributions. Recurrent neural networks and masking-based models are given as examples of autoregressive models.

Uploaded by

albertoluin10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Lecture 2 Autoregressive Models

This document outlines a lecture on autoregressive models for deep unsupervised learning. It discusses the limitations of simple generative models like histograms in high dimensions and introduces parameterized distributions and maximum likelihood training as solutions. Autoregressive models are presented as modern neural network approaches that decompose complex joint distributions into conditional distributions. Recurrent neural networks and masking-based models are given as examples of autoregressive models.

Uploaded by

albertoluin10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

CS294-158 Deep Unsupervised Learning

Lecture 2 Likelihood Models: Autoregressive Models

Pieter Abbeel, Xi (Peter) Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan
UC Berkeley
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 2
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 3
Likelihood-based models
Problems we’d like to solve:
- Generating data: synthesizing images, videos, speech, text
- Compressing data: constructing efficient codes
- Anomaly detection
Likelihood-based models: estimate pdata from samples x(1), …, x(n) ~ pdata(x)
Learns a distribution p that allows:
- Computing p(x) for arbitrary x
- Sampling x ~ p(x)
Today: discrete data

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 4
Desiderata
We want to estimate distributions of complex, high-dimensional data
- A 128x128x3 image lies in a ~50,000-dimensional space
We also want computational and statistical efficiency
- Efficient training and model representation
- Expressiveness and generalization
- Sampling quality and speed
- Compression rate and speed

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 5
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 6
Learning: Estimate frequencies by counting
Recall: the goal is to estimate pdata from samples
x(1), …, x(n) ~ pdata(x)
Suppose the samples take on values in a finite set
{1, …, k}
The model: a histogram
- (Redundantly) described by k nonnegative
numbers: p1, …, pk
- To train this model: count frequencies
pi = (# times i appears in the dataset) /
(# points in the dataset)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 7
Inference and Sampling
Inference (querying pi for arbitrary i): simply a lookup into the array p1, …, pk

Sampling (lookup into the inverse cumulative distribution function)


1. From the model probabilities p1, …, pk, compute the cumulative
distribution
Fi = p1 + ⋯ + pi for all i ∈ {1, …, k}
2. Draw a uniform random number u ~ [0, 1]
3. Return the smallest i such that u ≤ Fi

Are we done?

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 8
Failure in high dimensions
No, because of the curse of dimensionality. Counting fails when
there are too many bins.
- (Binary) MNIST: 28x28 images, each pixel in {0, 1}
784
- There are 2 ≈ 10236 probabilities to estimate
- Any reasonable training set covers only a tiny fraction of this
- Each image influences only one parameter. No generalization
whatsoever!

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 9
Problematic even for single variable

learned histogram = training data distribution

→ often poor generalization

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 10
Parameterized Distributions

Fitting a parameterized distribution often


generalizes better

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 11
Status
- Issues with histograms
- High dimensions: won’t work
- Even 1-d: if many values in the domain, prone to overfitting

- Solution: function approximation. Instead of storing each


probability, store a parameterized function pθ(x)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 12
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 13
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 14
Likelihood-based generative models
Recall: the goal is to estimate pdata from x(1), …, x(n) ~ pdata(x)

Now we introduce function approximation: learn θ so that pθ(x) ≈ pdata(x).


- How do we design function approximators to effectively represent
complex joint distributions over x, yet remain easy to train?
- There will be many choices for model design, each with different tradeoffs
and different compatibility criteria.

Designing the model and the training procedure go hand-in-hand.

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 15
Fitting distributions
- Given data x(1), …, x(n) sampled from a “true” distribution pdata
- Set up a model class: a set of parameterized distributions pθ
- Pose a search problem over parameters

- Want the loss function + search procedure to:


- Work with large datasets (n is large, say millions of training examples)
- Yield θ such that pθ matches pdata — i.e. the training algorithm works. Think of
the loss as a distance between distributions.
- Note that the training procedure can only see the empirical data distribution,
not the true data distribution: we want the model to generalize.

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 16
Maximum likelihood
- Maximum likelihood: given a dataset x(1), …, x(n), find θ by solving the optimization
problem

- Statistics tells us that if the model family is expressive enough and if enough data
is given, then solving the maximum likelihood problem will yield parameters that
generate the data
- Equivalent to minimizing KL divergence between the empirical data distribution
and the model

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 17
Stochastic gradient descent
Maximum likelihood is an optimization problem. How do we solve it?
Stochastic gradient descent (SGD).
- SGD minimizes expectations: for f a differentiable function of θ, it solves

- With maximum likelihood, the optimization problem is

- Why maximum likelihood + SGD? It works with large datasets and is compatible
with neural networks.

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 18
Designing the model
- Key requirement for maximum likelihood + SGD: efficiently compute log p(x) and
its gradient
- We will choose models pθ to be deep neural networks, which work in the regime
of high expressiveness and efficient computation (assuming specialized hardware)
- How exactly do we design these networks?
- Any setting of θ must define a valid probability distribution over x:

- log pθ(x) should be easy to evaluate and differentiate with respect to θ


- This can be tricky to set up!

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 19
Bayes nets and neural nets
Main idea: place a Bayes net structure (a directed acyclic graph) over the variables in
the data, and model the conditional distributions with neural networks.
Reduces the problem to designing conditional likelihood-based models for single
variables. We know how to do this: the neural net takes variables being conditioned on
as input, and outputs the distribution for the variable being predicted.

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 20
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 21
Autoregressive models
- First, given a Bayes net structure, setting the conditional distributions to neural
networks will yield a tractable log likelihood and gradient. Great for maximum
likelihood training!

- But is it expressive enough? Yes, assuming a fully expressive Bayes net structure:
any joint distribution can be written as a product of conditionals

- This is called an autoregressive model. So, an expressive Bayes net structure with
neural network conditional distributions yields an expressive model for p(x) with
tractable maximum likelihood training.

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 22
A toy autoregressive model
Two variables: x1, x2
Model: p(x1, x2) = p(x1) p(x2|x1)
- p(x1) is a histogram
- p(x2|x1) is a multilayer perceptron
- Input is x1
- Output is a distribution over x2 (logits, followed by softmax)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 23
One function approximator per conditional
Does this extend to high dimensions?
- Somewhat. For d-dimensional data, O(d) parameters
- Much better than O(exp(d)) in tabular case
- What about text generation where d can be arbitrarily large?
- Limited generalization
- No information sharing among different conditionals
Solution: share parameters among conditional distributions. Two
approaches:
- Recurrent neural networks

- Masking

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 24
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 25
RNN autoregressive models - char-rnn

Sequence of Character at
characters ith position

[Karpathy, 2015]

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 26
MNIST
■ Handwritten digits
■ 28x28
■ 60,000 train
■ 10,000 test

■ Original: greyscale
■ “Binarized MNIST” -- 0/1 (black/white)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 27
RNN on MNIST

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 28
RNN with Pixel Location Appended on MNIST
■ Append (x,y) coordinates of pixel in the image as input to RNN

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 29
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE

- Masked Convolutions

- Wavenet
- PixelCNN (+variations)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 30
Masking-based autoregressive models
Second major branch of neural AR models
■ Key property: parallelized computation of all conditionals
■ Masked MLP (MADE)
■ Masked convolutions & self-attention
■ Also share parameters across time

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 31
Masked Autoencoder for Distribution Estimation (MADE)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 32
Masked Autoencoder for Distribution Estimation (MADE)
General principle

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 33
MADE on MNIST

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 34
Masked Autoencoder for Distribution Estimation (MADE)

# param: normal FC weight


# x: layer input
# y: autoregressive activations
mask = get_linear_ar_mask(in_size, out_size)
# create mask of pattern
# array([[0., 1., 1.],
# [0., 0., 1.],
# [0., 0., 0.]], dtype=float32)
y = tf.matmul(x, param * mask)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 35
MADE results

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 36
MADE results

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 37
MADE -- Different Orderings
All orderings achieve roughly the same bits per dim, but samples are different

Top to Middle,
Random Permutation Even then Odd Indices Rows (Raster Scan) Columns Bottom to Middle

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 38
MADE: Multiple Orderings

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 39
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE

- Masked Convolutions

- Wavenet
- PixelCNN (+variations)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 40
Masked Temporal (1D) Convolution
p(xi+1| x<=i)

● Easy to implement, masking part of


the conv kernel
● Constant parameter count for
variable-length distribution!
● Efficient to compute, convolution has
hyper-optimized implementations on
all hardware

However
● Limited receptive field, linear in
number of layers

xi

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 41
WaveNet

● Improved receptive field: dilated


convolution, with exponential dilation
● Better expressivity: Gated Residual
blocks, Skip connections

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 42
WaveNet on MNIST

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 43
WaveNet with Pixel Location Appended on MNIST
■ Append (x,y) coordinates of pixel in the image as input to
WaveNet

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 44
Masked Temporal (1D) Convolution
# More efficient implementation possible by
# padding instead of masking kernels
# k: size of kernel
# kernel: convolution weights
padded_x = tf.pad(x, [
(0, 0), (k - 1, 0),
(0, 0), (0, 0)
])
y = tf.nn.conv2d(padded_x, kernel, padding='VALID')

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 45
Outline
- Motivation
- Simple generative models: histograms
- Modern neural autoregressive models
- Parameterized distributions and maximum likelihood
- Autoregressive Models
- Recurrent Neural Nets
- Masking-based Models
- MADE

- Masked Convolutions

- Wavenet
- PixelCNN (+variations)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 46
Masked Spatial (2D) Convolution - PixelCNN
■ Images can be flatten into 1D vectors, but they are fundamentally
2D
■ We can use a masked variant of ConvNet to exploit this knowledge
■ First, we impose an autoregressive ordering on 2D images:

This is called raster scan ordering.


(Different orderings are possible,
more on this later)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 47
PixelCNN
■ Design question: how to design a masking method to obey
that ordering?
■ One possibility: PixelCNN (2016)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 48
Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling

Slide by Aaron van den Oord


Softmax Sampling
Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


Softmax Sampling

0 255

Slide by Aaron van den Oord


PixelCNN
■ PixelCNN-style masking has one problem: blind spot in
receptive field

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 72
Gated PixelCNN
■ Gated PixelCNN (2016) introduced a fix by combining two
streams of convolutions
How?

This is easy, we know how to do 1D masked conv

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 73
Gated PixelCNN
■ Vertical stack: through padding, activations at ith row only
depend on input before ith row

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 74
Gated PixelCNN
■ Improved ConvNet architecture: Gated ResNet Block

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 75
Gated PixelCNN
■ Better receptive field + more expressive architecture = better
performance

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 76
PixelCNN++
■ Moving away from softmax: we know nearby pixel values are
likely to co-occur!

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 77
Recap: Logistic distribution

pdf cdf
= sigmoid((x - mu) / scale)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 78
Mixture of Logistics -- Discrete Distribution

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 79
Ex. Training Mixture of Logistics

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 80
PixelCNN++
■ Capture long dependencies efficiently by downsampling

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 81
PixelCNN++

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 82
Masked Attention
■ A recurring problem for convolution: limited receptive field ->
hard to capture long-range dependencies
■ (Self-)Attention: an alternative that has
■ unlimited receptive field!!
■ also O(1) parameter scaling w.r.t. data dimension
■ parallelized computation (versus RNN)

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 83
Attention

Self-attention when qi also generated from x

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 84
Self-Attention

Convolution Self-attention

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 85
Masked Attention

- masked(ki, q) * 1010

- masked(ki, q) * 1010

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 86
Masked Attention
■ Much more flexible than masked convolution. We can design
any autoregressive ordering we want
■ An example:

Zigzag ordering
- How to implement with masked
conv?
- Trivial to do with masked attention!

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 87
Masked Attention + Convolution

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 88
Masked Attention + Convolution

Gated PixelCNN PixelCNN++ PixelSNAIL

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 89
Multi-Head Self-Attention on MNIST

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 90
Masked Attention + Convolution

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 91
Sample Quality

Which set of samples are generated by a GAN versus an AR model?

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 92
AR models can have good samples
■ Good samples can be achieved by selective bits conditioning
■ Grayscale PixelCNN
■ Subscale Pixel Network

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 93
Class-Conditional PixelCNN

How to condition?

IN: One-hot encoding of the labels

THEN: multiplying by different learned


weight matrices in each convolutional
layer, and added as a bias channel-wise
and broadcasted spatially

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 94
Hierarchical Autoregressive Models with Auxiliary Decoders

De Fauw, Jeffrey, Sander Dieleman, and Karen Simonyan. "Hierarchical autoregressive image models with auxiliary decoders." arXiv
preprint arXiv:1903.04933 (2019).
APA
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 95
Image Super-Resolution with PixelCNN
■ A PixelCNN is conditioned on
7 x 7 subsampled MNIST
images to generated the
corresponding 28 x 28 image

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 96
Pixel Recursive Super Resolution

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 97
Hierarchy: Grayscale PixelCNN
■ Design an autoregressive model
architecture that takes
advantage of the structure of
data
■ Learn a PixelCNN on binary
images, and a PixelCNN
conditioned on binary images to
generate colored images

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 98
PixelCNN Models with Auxiliary Variables for Natural Image Modeling

UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models 99
Neural autoregressive models: the good
Best in class modelling performance:
■ expressivity - autoregressive factorization is general

■ generalization - meaningful parameter sharing has good


inductive bias

-> State of the art models on multiple datasets, modalities

100
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Masked autoregressive models: the bad
● Sampling each pixel = 1 forward pass!
● 11 minutes to generate 16 32-by-32 images on a Tesla K40 GPU

101
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by caching activations
https://github.com/PrajitR/fast-pixel-cnn

102
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by caching activations
https://github.com/PrajitR/fast-pixel-cnn

103
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Speedup by breaking autoregressive pattern
■ O(d) -> O(log(d)) by parallelizing within groups {2, 3, 4}
■ Cannot capture dependencies within each group: this is a fine assumption
if all pixels in one group are conditionally independent
■ Most often they are not, then you trade expressivity for sampling
speed

104
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Multiscale PixelCNN

Improved sampling speed

More limited modelling capacity

105
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models

[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]

106
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models -- BAIR Robot Pushing
Large Spatiotemporal Subscaling Small Spatiotemporal Subscaling

[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]

107
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Scaling Autoregressive Video Models -- Kinetics
Cooking (left-to-right by likelihood) Full Kinetics (left-to-right by likelihood)

[Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)]

108
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores

■ Main challenge:
■ How to get a latent representation from PixelCNN?
■ Why hard? The random input happens on a per-pixel sample basis

■ Proposed solution
■ Use Fisher score

Note: applicable to any likelihood model


[Wilson Yan, Jonatha Ho, Pieter Abbeel. ““Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015

109
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores

[Wilson Yan, Jonatha Ho, Pieter Abbeel. ““Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015

110
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Natural Image Manipulation for Autoregressive Models using Fisher Scores

[Wilson Yan, Jonatha Ho, Pieter Abbeel. ““Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015

111
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Bibliography
char-rnn: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
MADE: Germain, Mathieu, et al. "Made: Masked autoencoder for distribution estimation." International Conference on Machine Learning. 2015.
WaveNet: Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
PixelCNN: Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).
Gated PixelCNN: Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders." Advances in Neural Information Processing Systems. 2016.
PixelCNN++: Salimans, Tim, et al. "Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications." arXiv preprint arXiv:1701.05517 (2017)
Self-attention: Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.
PixelSNAIL: Chen, Xi, et al. "Pixelsnail: An improved autoregressive generative model." arXiv preprint arXiv:1712.09763 (2017)
Fast PixelCNN++: Ramachandran, Prajit, et al. "Fast generation for convolutional autoregressive models." arXiv preprint arXiv:1704.06001(2017).
Multiscale PixelCNN: Reed, Scott, et al. "Parallel multiscale autoregressive density estimation." Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR. org, 2017.
Grayscale PixelCNN: Kolesnikov, Alexander, and Christoph H. Lampert. "PixelCNN models with auxiliary variables for natural image modeling." Proceedings of the 34th International
Conference on Machine Learning-Volume 70. JMLR. org, 2017.
Subscale Pixel Network: Menick, Jacob, and Nal Kalchbrenner. "Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling." arXiv preprint
arXiv:1812.01608(2018)
Dirk Weissenborn, Oscar Tackstrom, Jakob Uszkoreit. “Scaling Autoregressive Video Models.” arXiv 1906.02634 (2019)
Sparse Attention: Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. “Generating Long Sequences with Sparse Transformers.” arXiv 1904.10509
Wilson Yan, Jonathan Ho, Pieter Abbeel. “Natural Image Manipulation for Autoregressive Models using Fisher Scores.” arXiv 1912.05015
PixelCNN Super Resolution: Dahl, Ryan, Mohammad Norouzi, and Jonathon Shlens. "Pixel recursive super resolution." Proceedings of the IEEE International Conference on
Computer Vision. 2017.
Grayscale PixelCNN: Kolesnikov, Alexander, and Christoph H. Lampert. "PixelCNN models with auxiliary variables for natural image modeling." Proceedings of the 34th International
Conference on Machine Learning-Volume 70. JMLR. org, 2017.

112
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models
Colab

113
UC Berkeley -- Spring 2020 -- Deep Unsupervised Learning -- Pieter Abbeel, Peter Chen, Jonathan Ho, Aravind Srinivas, Alex Li, Wilson Yan -- L2 Autoregressive Models

You might also like