0% found this document useful (0 votes)
16 views24 pages

Lecture 22

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

Lecture 22

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CS7015 (Deep Learning) : Lecture 22

Autoregressive Models (NADE, MADE)

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Module 22.1: Neural Autoregressive Density Estimator
(NADE)

2/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
H ∈ {0, 1}n
c1 c2 cn
So far we have seen a few latent variable
h1 h2 ··· hn
generation models such as RBMs and VAEs
w1,1 wm,n W ∈ Rm×n Latent variable models make certain independence
assumptions which reduces the number of factors
v1 v2 ··· vm
and in turn the number of parameters in the model
b1 b2 bm
V ∈ {0, 1}m
For example, in RBMs we assumed that the visible
x̂ variables were independent given the hidden
Pφ (x|z) variables which allowed us to do Block Gibbs
z
Sampling
+ Similarly in VAEs we assumed P (x|z) = N (0, I)
∗  which effectively means that given the latent
variables, the x’s are independent of each other
µ Σ (Since Σ = I)

Qθ (z|x)

3/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
We will now look at Autoregressive (AR)
Models which do not contain any latent
variables
The aim of course is to learn a joint
distribution over x
As usual, for ease of illustration we will
assume x ∈ {0, 1}n
x1 x2 x3 x4 AR models do not make any independence
assumption but use the default factorization
of p(x) given by the chain rule p(x) =
Yn
p(xi |x<k )
i=1
The above factorization contains n factors
and some of these factors contain many
parameters ( O(2n ) in total )
4/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Obviously, it is infeasible to learn such an
x1 x2 x3 x4 exponential number of parameters
AR models work around this by using a
neural network to parameterize these factors
and then learn the parameters of this neural
network
What does this mean? Let us see!

5/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
At the output layer we want to predict
) , x3 n conditional probability distributions (each
) ,x
2
,x 2
) |x 1 |x 1 |x 1 corresponding to one of the factors in our joint
p(x p(x x x
1 2 3 4
p( p(
distribution)
At the input layer we are given the n input
V3 variables
Now the catch is that the nth output should
only be connected to the previous n-1 inputs
In particular, when we are computing
p(x3 |x2 , x1 ) the only inputs that we should
consider are x1 , x2 because these are the only
variables given to us while computing the
W.,<k conditional
x1 x2 x3 x4

6/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
The Neural Autoregressive Density Estimator
) , x3 (NADE) proposes a simple solution for this
) ,x2
,x 2
) |x 1 |x 1 |x 1
p(x p(x x x
1 2 3 4
p( p( First, for every output unit, we compute a
hidden representation using only the relevant
input units
V
For example, for the k th output unit, the
h1 h2 h3 h4 hidden representation will be computed using:

hk = σ(W.,<k x<k + b)

where hk ∈ Rd , W ∈ Rd×n , W.,<k are the first


k columns of W
W We now compute the output p(xk |xk−1
1 ) as:

x1 x2 x3 x4 yk = p(xk |xk−1
1 ) = σ(Vk hk + ck )

7/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
Let us look at the equations carefully
) , x3
) ,x
2
,x 2
) |x 1 |x 1 |x 1 hk = σ(W.,<k x<k + b)
p(x p(x x x
1 2 3 4
p( p(

yk = p(xk |xk−1
1 ) = σ(Vk hk + ck )
V3 How many parameters does this model have ?
Note that W ∈ Rd×n and b ∈ Rd×1 are shared
h1 h2 h3 h4
parameters and the same W, b are used for
computing hk for all the n factors (of course
only the relevant columns of W are used for
each k) resulting in nd + d parameters
In addition, we have Vk ∈ Rd×1 and ck ∈ Rd×1
W.,<3 for each of the n factors resulting in a total of
nd + n parameters
x1 x2 x3 x4

8/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
) There is also an additional parameter h1 ∈ Rd
) , x3 (similar to the initial state in LSTMs, RNNs)
) ,x
2
,x 2
) |x 1 |x 1 |x 1
p(x p(x x x
1 2 3 4
p( p( The total number of parameters in the model
is thus 2nd + n + 2d which is linear in n
In other words, the model does not have an
V3
exponential number of parameters which is
h1 h2 h3 h4 typically the case for the default factorization
Yn
p(x) = p(xi |x<k )
i=1
Why? Because we are sharing the parameters
across the factors
W.,<3 The same W, b contribute to all the factors

x1 x2 x3 x4

9/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
How will you train such a network?
) , x3 backpropagation: its a neural network after all
) ,x
2
,x 2
) |x 1 |x 1 |x 1
p(x p(x x x
1 2 3 4
p( p( What is the loss function that you will choose?
For every output node we know the true
probability distribution
V3
For example, for a given training instance, if
h1 h2 h3 h4 X3 = 1 then the true probability distribution
is given by p(x3 = 1|x2 , x1 ) = 1, p(x3 =
0|x2 , x1 ) = 0 or p = [0, 1]
If the predicted distribution is q = [0.7, 0.3]
then we can just take the cross entropy
between p and q as the loss function
W.,<3
The total loss will be the sum of this cross
x1 x2 x3 x4 entropy loss for all the n output nodes

10/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
Now let’s ask a couple of questions about the
) , x3 model (assume training is done)
) ,x
2
,x 2
) |x 1 |x 1 |x 1
p(x p(x x x
1 2 3 4
p( p( Can the model be used for abstraction? i.e.,
if we give it a test instance x, can the model
give us a hidden abstract representation for x
V3
Well, you will get a sequence of hidden
h1 h2 h3 h4 representations h1 , h2 , ..., hn but these are not
really the kind of abstract representations
that we are interested in
For example, hn only captures the information
required to reconstruct xn given x1 to xn−1
(compare this with an autoencoder wherein
W.,<3 the hidden representation can reconstruct all
x1 x2 x3 x4 of x1 , x2 , ..., xn )
These are not latent variable models and are,
by design, not meant for abstraction
11/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
Can the model do generation? How?
) , x3
) ,x2
,x 2 Well, we first compute p(x1 = 1) as y1 =
) |x 1 |x 1 |x 1
p(x p(x x x
1 2 3 4
p( p( σ(V1 h1 + c1 )
Note that V1 , h1 , c1 are all parameters of the
model which will be learned during training
V1
We will then sample a value for x1 from the
h1 h2 h3 h4 distribution Bernoulli(y1 )

W
x1 x2 x3 x4

12/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
)
We will now use the sampled value of x1 and
) , x3 compute h2 as
) ,x2
,x 2
) |x 1 |x 1 |x 1 h2 = σ(W.,<2 x<2 + b)
p(x p(x x x
1 2 3 4
p( p(
Using h2 we will compute P (x2 = 1|x1 = x1 )
as y2 = σ(V2 h2 + c2 )
V1 V4
We will then sample a value for x2 from the
h1 h2 h3 h4 distribution Bernoulli(y2 )
We will then continue this process till xn
generating the value of one random variable
at a time
If x is an image then this is equivalent to
generating the image one pixel at a time (very
W W.,<4
slow)
x1 x2 x3 x4

13/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Of course, the model requires a lot of computations because for generating
each pixel we need to compute

hk = σ(W.,<k x<k + b)
yk = p(xk |xk−1
1 ) = σ(Vk hk + ck )

However notice that

W.,<k+1 x<k+1 + b = W.,<k x<k + b + W.,k xk

Thus we can reuse some of the computations done for pixel k while predicting
the pixel k + 1 (this can be done even at training time)

14/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Things to remember about NADE
n
Q
Uses the explicit representation of the joint distribution p(x) = p(xi |x<k )
i=1
Each node in the output layer corresponds to one factor in this explicit
representation
Reduces the number of parameters by sharing weights in the neural network
Not designed for abstraction
Generation is slow because the model generates one pixel (or one random
variable) at a time
Possible to speed up the computation by reusing some previous computations

15/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Module 22.2: Masked Autoencoder Density Estimator
(MADE)

16/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Suppose the input x ∈ {0, 1}n , then the

3)
x
output layer of an autoencoder also contains

2)

2,
x

x
n units

1)

1,

1,
x

x
1)

2|

3|

4|
x

x
p( Notice the explicit factorization of the joint

p(

p(

p(
distribution p(x) also contains n factors
n
V Y
p(x) = p(xk |x<k )
k=1

W2 Question: Can we tweak an autoencoder so


that its output units predict the n conditional
distributions instead of reconstructing the n
W1 inputs?

x1 x2 x3 x4

17/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
Note that this is not straightforward because

3)
we need to make sure that the k-th output

x
2)

2,
x

x
unit only depends on the previous k −1 inputs

1)

1,

1,
x

x
1)

2|

3|

4|
x In a standard autoencoder with fully

x
p(

p(

p(

p(
connected layers the k-th unit obviously
depends on all the input units
V In simple words, there is a path from each of
the input units to each of the output units
We cannot allow this if we want to predict the
W2 conditional distributions p(xk |x<k ) (we need
to ensure that we are only seeing the given
variables x<k and nothing else)
W1

x1 x2 x3 x4

18/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
We could ensure this by masking

x
2)

2,
x

x
1)

1,

1,
some of the connections in the

x
1)

2|

3|

4|
x

x
x̂1 x̂2 x̂3 x̂4

p(

p(

p(

p(
Masks
1 2 3 4 network to ensure that yk only
V
= MV depends on x<k
1 1 2 1 3
We will start by assuming some
= MW
2 ordering on the inputs and just
W2
number them from 1 to n
1 2 1 2 3
Now we will randomly assign each
W1
=M W1
hidden unit a number between 1 to
1 2 3 4
x1 x2 x3 x4 x1 x2 x3 x4 n-1 which indicates the number of
inputs it will be connected to
For example, if we assign a node the
number 2 then it will be connected to
the first two inputs
We will do a similar assignment for
all the hidden layers

19/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
Let us see what this means

x
2)

2,
x

x
1)

1,

1,
For the first hidden layer this

x
1)

2|

3|

4|
x

x
x̂1 x̂2 x̂3 x̂4

p(

p(

p(

p(
Masks
1 2 3 4 numbering is clear - it simply
V
= MV indicates the number of ordered
1 1 2 1 3
inputs to which this node will be
= MW
2 connected
W2
Let us now focus on the highlighted
1 2 1 2 3
node in the second layer which has
W1
=M W1
the number 2
1 2 3 4
x1 x2 x3 x4 x1 x2 x3 x4 This node is only allowed to depend
on inputs x1 and x2 (since it is
numbered 2)
This means that it should be only
connected to those nodes in the
previous hidden layer which have seen
only x1 and x2
In other words it should only have
connections from those nodes, which
have been assigned a number ≤ 2 20/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
Now consider the node labeled 3 in

x
2)

2,
x

x
1)

1,

1,
the output layer

x
1)

2|

3|

4|
x

x
x̂1 x̂2 x̂3 x̂4

p(

p(

p(

p(
Masks
1 2 3 4 This node is only allowed to see
V
= MV inputs x1 and x2 because it predicts
1 1 2 1 3
p(x3 |x2 , x1 ) (and hence the given
= MW
2 variables should only be x1 and x2 )
W2
By the same argument that we made
1 2 1 2 3
on the previous slide, this means that
W1
=M W1
it should be only connected to those
1 2 3 4
x1 x2 x3 x4 x1 x2 x3 x4 nodes in the previous hidden layer
which have seen only x1 and x2
We can implement this by taking
the weight matrices W 1 , W 2 and V
and applying an appropriate mask
to them so that the disallowed
connections are dropped

21/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
x
2)

2,
x

x
1)

1,

1,
x

x
1)

2|

3|

4|
x

x
x̂1 x̂2 x̂3 x̂4

p(

p(

p(

p(
Masks
1 2 3 4
= MV
V
1 1 2 1 3
2
= MW
W2
1 2 1 2 3

W1 1
= MW
1 2 3 4
x1 x2 x3 x4 x1 x2 x3 x4

For example we can apply the following mask at layer 2


 2 2 2 2 2   
W11 W12 W13 W14 W15 1 0 1 0 0
W212 2
W22 W232 W242 2 
W25 1 0 1 0 0
 2 2 2 2

2 
 
W W32 W33 W34 W35 1 1 1 1 0
 31   
W 2 2
W42 W432 W442 2 
W45 1 0 1 0 0
41
W512 2
W52 W532 W542 2
W55 1 1 1 1 1
22/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
The objective function for this

x
2)

2,
x

x
1)

1,

1,
network would again be a sum of

x
1)

2|

3|

4|
x

x
x̂1 x̂2 x̂3 x̂4

p(

p(

p(

p(
Masks
1 2 3 4 cross entropies
= MV
V The network can be trained using
1 1 2 1 3 backpropagation such that the
2
= MW
W2 errors will only be propagated
1 2 1 2 3
along the active (unmasked)
W1
= MW
1
connections (similar to what
1 2 3 4
x1 x2 x3 x4 x1 x2 x3 x4 happens in dropout)

23/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22
3)
Similar to NADE, this model is

x
2)

2,
x

x
1)

1,

1,
not designed for abstraction but for

x
1)

2|

3|

4|
x

x
p(

p(

p(

p(
Masks
1 2 3 4 generation
= MV
How will you do generation in this
1 1 2 1 3 model? Using the same iterative
2
= MW
process that we used with NADE
1 2 1 2 3
First sample a value of x1
1
= MW
1 2 3 4 Now feed this value of x1 to the
x1 x2 x3 x4
network and compute y2
Now sample x2 from Bernoulli (y2 )
and repeat the process till you
generate all variables upto xn

24/24
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

You might also like