0% found this document useful (0 votes)
0 views

cs236_lecture3

Uploaded by

21pd14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

cs236_lecture3

Uploaded by

21pd14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Autoregressive Models

Stefano Ermon

Stanford University

Lecture 3

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 1 / 36


Learning a generative model
We are given a training set of examples, e.g., images of dogs

We want to learn a probability distribution p(x) over images x such that


1 Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
2 Density estimation: p(x) should be high if x looks like a dog, and low
otherwise (anomaly detection)
3 Unsupervised representation learning: We should be able to learn
what these images have in common, e.g., ears, tail, etc. (features)
First question: how to represent p(x). Second question: how to learn it.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 2 / 36
Recap: Bayesian networks vs neural models

Using Chain Rule

p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )

Fully General, no assumptions needed (exponential size, no free lunch)


Bayes Net

p(x1 , x2 , x3 , x4 ) ≈ pCPT (x1 )pCPT (x2 | x1 )pCPT (x3 | 


x1 , x2 )pCPT (x4 | x1 , 
x2 ,
x3 )


Assumes conditional independencies; tabular representations via conditional


probability tables (CPT)
Neural Models

p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )

Assumes specific functional form for the conditionals. A sufficiently deep


neural net can approximate any function.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 3 / 36


Neural Models for classification
Setting: binary classification of Y ∈ {0, 1} given input features X ∈ {0, 1}n
For classification, we care about p(Y | x), and assume that
p(Y = 1 | x; α) = f (x, α)
Pn
Logistic regression: let z(α, x) = α0 + i=1 αi xi .
plogit (Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z )
Non-linear dependence: let h(A, b, x) be a non-linear transformation of the
Ph
input features. pNeural (Y = 1 | x; α, A, b) = σ(α0 + i=1 αi hi )
More flexible
More parameters: A, b, α
Repeat multiple times to get a multilayer perceptron (neural network)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 4 / 36


Motivating Example: MNIST

Given: a dataset D of handwritten digits (binarized MNIST)

Each image has n = 28 × 28 = 784 pixels. Each pixel can either be


black (0) or white (1).
Goal: Learn a probability distribution p(x) = p(x1 , · · · , x784 ) over
x ∈ {0, 1}784 such that when x ∼ p(x), x looks like a digit
Two step process:
1 Parameterize a model family {pθ (x), θ ∈ Θ} [This lecture]
2 Search for model parameters θ based on training data D [Next lecture]

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 5 / 36


Autoregressive Models
We can pick an ordering of all the random variables, i.e., raster scan
ordering of pixels from top-left (X1 ) to bottom-right (Xn=784 )
Without loss of generality, we can use chain rule for factorization
p(x1 , · · · , x784 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )
Some conditionals are too complex to be stored in tabular form. Instead, we
assume
p(x1 , · · · , x784 ) = pCPT (x1 ; α1 )plogit (x2 | x1 ; α2 )plogit (x3 | x1 , x2 ; α3 ) · · ·
plogit (xn | x1 , · · · , xn−1 ; αn )
More explicitly
pCPT (X1 = 1; α1 ) = α1 , p(X1 = 0) = 1 − α1
plogit (X2 = 1 | x1 ; α2 ) = σ(α20 + α21 x1 )
plogit (X3 = 1 | x1 , x2 ; α3 ) = σ(α30 + α31 x1 + α32 x2 )
Note: This is a modeling assumption. We are using parameterized
functions (e.g., logistic regression above) to predict next pixel given all the
previous ones. Called autoregressive model.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 6 / 36
Fully Visible Sigmoid Belief Network (FVSBN)

The conditional variables Xi | X1 , · · · , Xi−1 are Bernoulli with parameters


i−1
X
x̂i = p(Xi = 1|x1 , · · · , xi−1 ; αi ) = p(Xi = 1|x<i ; αi ) = σ(α0i + αji xj )
j=1
How to evaluate p(x1 , · · · , x784 )? Multiply all the conditionals (factors)
In the above example:
p(X1 = 0, X2 = 1, X3 = 1, X4 = 0) = (1 − x̂1 ) × x̂2 × x̂3 × (1 − x̂4 )
= (1 − x̂1 ) × x̂2 (X1 = 0) × x̂3 (X1 = 0, X2 = 1) × (1 − x̂4 (X1 = 0, X2 = 1, X3 = 1))

How to sample from p(x1 , · · · , x784 )?


1 Sample x 1 ∼ p(x1 ) (np.random.choice([1,0],p=[x̂1 , 1 − x̂1 ]))
2 Sample x 2 ∼ p(x2 | x1 = x 1 )
3 Sample x 3 ∼ p(x3 | x1 = x 1 , x2 = x 2 ) · · ·

How many parameters (in the αi vectors)? 1 + 2 + 3 + · · · + n ≈ n2 /2


Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 7 / 36
FVSBN Results

Training data on the left (Caltech 101 Silhouettes). Samples from the
model on the right.
Figure from Learning Deep Sigmoid Belief Networks with Data
Augmentation, 2015.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 8 / 36


NADE: Neural Autoregressive Density Estimation

To improve model: use one layer neural network instead of logistic regression

hi = σ(Ai x<i + ci )
x̂i = p(xi |x1 , · · · , xi−1 ; Ai , ci , αi , bi ) = σ(αi hi + bi )
| {z }
parameters

   
      x  
For example h2 = σ  . .   .. . 
 .. x1 + ..  h3 = σ  .. .. ( x2 ) + .. 
1

|{z} |{z} | {z } |{z}


A2 c2 A3 c3

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 9 / 36


NADE: Neural Autoregressive Density Estimation

Tie weights to reduce the number of parameters and speed up computation


(see blue dots in the figure):
hi = σ(W·,<i x<i + c)
x̂i = p(xi |x1 , · · · , xi−1 ) = σ(αi hi + bi )
     
        
     
 ..   .. ..   .. . .
     
  . .   
 .    . .   
x
 . . .  x1 
For example h2 = σ 
 w 
 1 1x + c

h
 3
 = σ
 w w 

 1 2  x2
1  h
 4
 = σ
 w
 1 w2 w3 
 x
x2 

 .   . . 
 3 
   . . .  
 .   . .   . . . 
 .
| {z



 . . 
 | .
 . . 

} | {z } {z }
W·,<2 W·,<3 W·,<4

If hi ∈ Rd , how many total parameters? Linear in n: weights W ∈ Rd×n ,


biases c ∈ Rd , and n logistic regression coefficient vectors αi , bi ∈ Rd+1 .
Probability is evaluated in O(nd).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 10 / 36
NADE results

Samples from a model trained on MNIST on the left. Conditional


probabilities x̂i on the right.
Figure from The Neural Autoregressive Distribution Estimator, 2011.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 11 / 36


General discrete distributions

How to model non-binary discrete random variables Xi ∈ {1, · · · , K }? E.g., pixel


intensities varying from 0 to 255
One solution: Let x̂ i parameterize a categorical distribution
hi = σ(W·,<i x<i + c)
p(xi |x1 , · · · , xi−1 ) = Cat(pi1 , · · · , piK )
x̂ i = (pi1 , · · · , piK ) = softmax(Ai hi + bi )
Softmax generalizes the sigmoid/logistic function σ(·) and transforms a vector of
K numbers into a vector of K probabilities (non-negative, sum to 1).
exp(a1 ) exp(aK )
 
1 K
softmax(a) = softmax(a , · · · , a ) = P i
,··· , P i
i exp(a ) i exp(a )

In numpy: np.exp(a)/np.sum(np.exp(a))
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 12 / 36
RNADE

How to model continuous random variables Xi ∈ R? E.g., speech signals


Solution: let x̂ i parameterize a continuous distribution
E.g., uniform mixture of K Gaussians

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 13 / 36


RNADE

How to model continuous random variables Xi ∈ R? E.g., speech signals


Solution: let x̂ i parameterize a continuous distribution
E.g., In a mixture of K Gaussians,
K
X 1
p(xi |x1 , · · · , xi−1 ) = N (xi ; µji , σij )
j=1
K
hi = σ(W·,<i x<i + c)
x̂ i = (µ1i , · · · , µKi , σi1 , · · · , σiK ) = f (hi )

x̂ i defines the mean and standard deviation of each of the K Gaussians (µji , σij ).
Can use exponential exp(·) to ensure non-negativity
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 14 / 36
Autoregressive models vs. autoencoders

On the surface, FVSBN and NADE look similar to an autoencoder:


an encoder e(·). E.g., e(x) = σ(W 2 (W 1 x + b 1 ) + b 2 )
a decoder such that d(e(x)) ≈ x. E.g., d(h) = σ(Vh + c).
Loss function for dataset D
X X
Binary r.v.: min −xi log x̂i − (1 − xi ) log(1 − x̂i )
W 1 ,W 2 ,b 1 ,b 2 ,V ,c x∈D i
2
X X
Continuous r.v.: min (xi − x̂i )
W 1 ,W 2 ,b 1 ,b 2 ,V ,c x∈D i

e and d are constrained so that we don’t learn identity mappings. Hope that
e(x) is a meaningful, compressed representation of x (feature learning)
A vanilla autoencoder is not a generative model: it does not define a
distribution over x we can sample from to generate new data points.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 15 / 36
Autoregressive autoencoders

On the surface, FVSBN and NADE look similar to an autoencoder. Can we


get a generative model from an autoencoder?
We need to make sure it corresponds to a valid Bayesian Network (DAG
structure), i.e., we need an ordering for chain rule. If ordering is 1, 2, 3, then:
x̂1 cannot depend on any input x = (x1 , x2 , x3 ). Then at generation
time we don’t need any input to get started
x̂2 can only depend on x1
···
Bonus: we can use a single neural network (with n inputs and outputs) to
produce all the parameters x̂ in a single pass. In contrast, NADE requires n
passes. Much more efficient on modern hardware.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 16 / 36
MADE: Masked Autoencoder for Distribution Estimation

1 Challenge: An autoencoder that is autoregressive (DAG structure)


2 Solution: use masks to disallow certain paths (Germain et al., 2015).
Suppose ordering is x2 , x3 , x1 , so p(x1 , x2 , x3 ) = p(x2 )p(x3 | x2 )p(x1 | x2 , x3 ).
1 The unit producing the parameters for x̂2 = p(x2 ) is not allowed to
depend on any input. Unit for p(x3 |x2 ) only on x2 . And so on...
2 For each unit in a hidden layer, pick a random integer i in [1, n − 1].
That unit is allowed to depend only on the first i inputs (according to
the chosen ordering).
3 Add mask to preserve this invariant: connect to all units in previous
layer with smaller or equal assigned number (strictly < in final layer)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 17 / 36
RNN: Recurrent Neural Nets
Challenge: model p(xt |x1:t−1 ; αt ). “History” x1:t−1 keeps getting longer.
Idea: keep a summary and recursively update it

Summary update rule: ht+1 = tanh(Whh ht + Wxh xt+1 )


Prediction: ot+1 = Why ht+1
Summary initalization: h0 = b 0

1 Hidden layer ht is a summary of the inputs seen till time t


2 Output layer ot−1 specifies parameters for conditional p(xt | x1:t−1 )
3 Parameterized by b 0 (initialization), and matrices Whh , Wxh , Why .
Constant number of parameters w.r.t n!
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 18 / 36
Example: Character RNN (from Andrej Karpathy)

1 Suppose xi ∈ {h, e, l, o}. Use one-hot encoding:


h encoded as [1, 0, 0, 0], e encoded as [0, 1, 0, 0], etc.
2 Autoregressive: p(x = hello) = p(x1 = h)p(x2 = e|x1 = h)p(x3 =
l|x1 = h, x2 = e) · · · p(x5 = o|x1 = h, x2 = e, x3 = l, x4 = l)
3 For example,
exp(2.2)
p(x2 = e|x1 = h) = softmax(o1 ) =
exp(1.0) + · · · + exp(4.1)
o1 = Why h1
h1 = tanh(Whh h0 + Wxh x1 )
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 19 / 36
RNN: Recurrent Neural Nets

Pros:
1 Can be applied to sequences of arbitrary length.
2 Very general: For every computable function, there exists a finite
RNN that can compute it
Cons:
1 Still requires an ordering
2 Sequential likelihood evaluation (very slow for training)
3 Sequential generation (unavoidable in an autoregressive model)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 20 / 36


Example: Character RNN (from Andrej Karpathy)

Train 3-layer RNN with 512 hidden nodes on all the works of Shakespeare.
Then sample from the model:

KING LEAR: O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder’d at the deeds,
So drop upon your lordship’s head, and your opinion
Shall be against your honour.

Note: generation happens character by character. Needs to learn valid


words, grammar, punctuation, etc.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 21 / 36


Example: Character RNN (from Andrej Karpathy)

Train on Wikipedia. Then sample from the model:

Naturalism and decision for the majority of Arab countries’ capitalide was
grounded by the Irish language by [[John Clair]], [[An Imperial Japanese
Revolt]], associated with Guangzham’s sovereignty. His generals were
the powerful ruler of the Portugal in the [[Protestant Immineners]], which
could be said to be directly in Cantonese Communication, which followed
a ceremony and set inspired prison, training. The emperor travelled
back to [[Antioch, Perth, October 25—21]] to note, the Kingdom of
Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]],
known in western [[Scotland]], near Italy to the conquest of India with
the conflict.
Note: correct Markdown syntax. Opening and closing of brackets [[·]]

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 22 / 36


Example: Character RNN (from Andrej Karpathy)
Train on Wikipedia. Then sample from the model:

{ { cite journal — id=Cerling Nonforest Depart-


ment—format=Newlymeslated—none } }
”www.e-complete”.
”’See also”’: [[List of ethical consent processing]]

== See also ==
*[[Iender dome of the ED]]
*[[Anti-autism]]

== External links==
* [http://www.biblegateway.nih.gov/entrepre/ Website of the World
Festival. The labour of India-county defeats at the Ripper of California
Road.]

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 23 / 36


Example: Character RNN (from Andrej Karpathy)

Train on data set of baby names. Then sample from the model:

Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen


Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto
Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria El-
lia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa
Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Ve-
len Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine
Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica Wallin
Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella Roselina Alessia
Chasty Deland Berther Geamar Jackein Mellisand Sagdy Nenc Lessie
Rasemy Guen Gavi Milea Anneda Margoris Janin Rodelin Zeanna Elyne
Janah Ferzina Susta Pey Castina

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 24 / 36


Issues with RNNs

Issues with RNN models


A single hidden vector needs to summarize all the (growing) history.
For example, h(4) needs to summarize the meaning of “My friend
opened the”.
Sequential evaluation, cannot be parallelized
Exploding/vanishing gradients when accessing information from many
steps back

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 25 / 36


Attention based models

Attention mechanism to compare a query vector to a set of key vectors


1 Compare current hidden state (query) to all past hidden states (keys),

e.g., by taking a dot product


2 Construct attention distribution to figure out what parts of the

history are relevant, e.g., via a softmax


3 Construct a summary of the history, e.g., by weighted sum

4 Use summary and current hidden state to predict next token/word

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 26 / 36


Generative Transformers

Current state of the art (GPTs): replace RNN with Transformer


Attention mechanisms to adaptively focus only on relevant context
Avoid recursive computation. Use only self-attention to enable
parallelization
Needs masked self-attention to preserve autoregressive structure
Demo: https://transformer.huggingface.co/doc/gpt2-large
Demo: https://huggingface.co/spaces/
huggingface-projects/llama-2-13b-chat

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 27 / 36


Pixel RNN (Oord et al., 2016)

1 Model images pixel by pixel using raster scan order


2 Each pixel conditional p(xt | x1:t−1 ) needs to specify 3 colors
p(xt | x1:t−1 ) = p(xtred | x1:t−1 )p(xtgreen | x1:t−1 , xtred )p(xtblue | x1:t−1 , xtred , xtgreen )
and each conditional is a categorical random variable with 256 possible
values
3 Conditionals modeled using RNN variants. LSTMs + masking (like MADE)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 28 / 36


Pixel RNN

Results on downsampled ImageNet. Very slow: sequential likelihood


evaluation.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 29 / 36


Convolutional Architectures

Convolutions are natural for image data and easy to parallelize on modern
hardware.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 30 / 36


PixelCNN (Oord et al., 2016)

Idea: Use convolutional architecture to predict next pixel given context (a


neighborhood of pixels).
Challenge: Has to be autoregressive. Masked convolutions preserve raster scan
order. Additional masking for colors order.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 31 / 36


PixelCNN

Samples from the model trained on Imagenet (32 × 32 pixels). Similar


performance to PixelRNN, but much faster.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 32 / 36


Application in Adversarial Attacks and Anomaly detection

Machine learning methods are vulnerable to adversarial examples

Can we detect them?

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 33 / 36


PixelDefend (Song et al., 2018)

Train a generative model p(x) on clean inputs (PixelCNN)


Given a new input x, evaluate p(x)
Adversarial examples are significantly less likely under p(x)

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 34 / 36


WaveNet (Oord et al., 2016)

Very effective model for speech:

Dilated convolutions increase the receptive field: kernel only touches the
signal at every 2d entries.

Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 35 / 36


Summary of Autoregressive Models
Easy to sample from
1 Sample x 0 ∼ p(x0 )
2 Sample x 1 ∼ p(x1 | x0 = x 0 )
3 ···
Easy to compute probability p(x = x)
1 Compute p(x0 = x 0 )
2 Compute p(x1 = x 1 | x0 = x 0 )
3 Multiply together (sum their logarithms)
4 ···
5 Ideally, can compute all these terms in parallel for fast training
Easy to extend to continuous variables. For example, can choose
Gaussian conditionals p(xt | x<t ) = N (µθ (x<t ), Σθ (x<t )) or mixture
of logistics
No natural way to get features, cluster points, do unsupervised
learning
Next: learning
Stefano Ermon (AI Lab) Deep Generative Models Lecture 3 36 / 36

You might also like