0% found this document useful (0 votes)

39 views338 pages

Lecture 20

Uploaded by

rajarshi234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views338 pages

Lecture 20

Uploaded by

rajarshi234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 338

CS7015 (Deep Learning) : Lecture 20

Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence

for training RBMs

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.1 : Markov Chains

2/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals

X ∈ R1024

EP (X) [f (X)]

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
We will use Gibbs Sampling (class of
Metropolis-Hastings algorithm) to achieve
these goals

EP (X) [f (X)]

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn

X ∈ R1024

EP (X) [f (X)]

X ∈ R1024

EP (X) [f (X)]

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
For ease of illustration we will stick to the res-
taurant example and assume that instead of
actual counts we are interested only in binary
EP (X) [f (X)]
counts (high=1, low=0)

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 x3 One way of looking at this is that the state
has transitioned from x1 to x2
Similarly, on day 3, if X3 takes on the value x3
then we can say that the state has transitioned
from x1 to x2 to x3

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1

x1 x2 x3 ··· xi

6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )

x1 x2 x3 ··· xi

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov

property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov

property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

In other words, given the previous state Xi−1 , Xi is

independent of all preceding states

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov

property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

In other words, given the previous state Xi−1 , Xi is

independent of all preceding states
Can we draw a graphical model to encode this inde-
pendence assumption ?
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
What will be the edges in the graph ?

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property

X1 X2 ··· Xk

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
Further, since Xi ’s take on discrete values this
is called a discrete time discrete space Markov
Chain

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities

Recall that each Xi ∈ {0, 1}n

P (Xi = xi |Xi−1 = xi−1 )?

P (Xi = xi |Xi−1 = xi−1 )? (l2 )

9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. ..
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. .. The matrix T is called the transition matrix
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i

Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i

Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ?
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i

Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ? Well,
1
..
2
..
0.06
..
because this transition probabilities may be
. . . different for different time steps
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i

Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ? Well,
1
..
2
..
0.06
..
because this transition probabilities may be
. . . different for different time steps
1 l 0.02
2 1 0.03 For example, the transition in the number
2 2 0.07
.. .. .. of customers may be different from Friday
. . .
2 l 0.01 to Saturday (weekend) as compared to from
.. .. ..
. . . Sunday to Monday(weekday)
l 1 0.1
l 2 0.09 Thus, for a Markov chain X1 , X2 , . . . , Xk
.. .. ..
. . . we will have k such transition matrices
l l 0.21
T1 , T2 , . . . , Tk
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous

Xi−1 Xi−2 Tab

1 1 0.05
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02 In other words
2 1 0.03
2 2 0.07
..
.
..
.
..
.
P (Xi = b|Xi−1 = a) = Tab ∀a, b ∀i
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )

X1 X2 ··· Xk

12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)

X1 X2 ··· Xk

µ0a is the probability that the random variable

X1 X2 ··· Xk takes on the value a among all the possible 2n
values

µ0a is the probability that the random variable

X1 X2 ··· Xk takes on the value a among all the possible 2n
values
Given µ0 and T how will you compute µk
where
µka = P (Xk = a)

µ0a is the probability that the random variable

X1 X2 ··· Xk takes on the value a among all the possible 2n
values
Given µ0 and T how will you compute µk
where
µka = P (Xk = a)

µk is again a 2n dimensional vector whose ath

entry tells us the probability that Xk will take
on the value a among all the possible 2n values
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)

X0 X1

2 b

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

2 b

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a
X
= µ0a Tab
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us see if there is a more compact
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

3 0.4 3
0.3

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

0.4 0.2 0.4

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3 What does the product µ0 T give us ?
0.3   It gives us the distribution P
µ1 ! (the
0.2 0.5 0.3 th
b entry of this vector is 0
a µa Tab
µ0 T = 0.3 0.4 0.3 0.3 0.6 0.1

which is P (X1 = b))
0.4 0.2 0.4

= 0.3 0.45 0.25
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X2 = b)
X0 X1 X2

2 b

.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1

2 b

.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) =

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b Thus the distribution at any time step can be

computed by finding the appropriate element
.. .. .. from the following series
. . .
µ0 T 1 , µ0 T 2 , µ0 T 3 , . . . , µ0 T k , . . .
l

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b Thus the distribution at any time step can be

computed by finding the appropriate element
.. .. .. from the following series
. . .
µ0 T 1 , µ0 T 2 , µ0 T 3 , . . . , µ0 T k , . . .
l
Note that this is still computationally expens-
ive because it involves a product of µ0 (2n ) and
T k (2n × 2n ) (but later on we will see that we
do not need this full product) 16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2

2 b

.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

2 b

.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of

2 b
the Markov chain
.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of

2 b
the Markov chain
.. .. .. Xt , Xt+1 , Xt+2 , . . . will all follow the same dis-
. . . tribution π

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of

2 b
the Markov chain
.. .. .. Xt , Xt+1 , Xt+2 , . . . will all follow the same dis-
. . . tribution π
In other words, if we have Xt = xt , Xt+1 =
l xt+1 , Xt+2 = xt+2 and so on then we can think
of xt , xt+1 , xt+2 as samples drawn from the
same distribution π (this is a crucial property
and we will return back to it soon)
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b

.. .. ..
. . .

18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b
What do we mean by run a Markov Chain for
a large number of time steps ?
.. .. ..
. . .

18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples?

2 b

.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No

2 b

.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b In other words the joint distribution µk has
2n parameters which is prohibitively large
.. .. ..
. . . I wonder what can I do to reduce the number
of parameters in a joint distribution (I hope
l you already know what to do but we will re-
turn back to it later)

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
How does this discussion tie back to our goals?

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.2 : Why do we care about Markov Chains?

21/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)

X ∈ R1024

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)

X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that

X ∈ R1024

EP (X) [f (X)]

X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024 Then it would mean that if we run the Markov
Chain for long enough, we will start getting
samples from P (X)
And once we have a large number of such samples
we can empirically estimate EP (X) f (X) as
EP (X) [f (X)] l+n
1X
f (Xi )
n
i=l
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We will now get into a formal discussion to concretize the above intuition

23/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R

If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

for any function f : X → R

If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

for any function f : X → R

If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

for any function f : X → R

If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

for any function f : X → R

If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
Of course Part A and Part B are related!
Further note that it doesn’t matter what the initial state was (the theorem holds for
∀x, x0 ∈ X )
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic (because the theorem only
holds for such chains)

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.3 : Setting up a Markov Chain for RBMs

26/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
At time-step 1, we transition to a new value
2
of X

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time

steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
.. ..
. . P (Xi = yi |X−i = x−i )
.. ..
. .
Repeat the above process for many many time
steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1 How do we show that the stationary distribu-
4 1 0 1 ... ... 0 tion is P (X) (where X = V, H) [We haven’t
.. ..
. . even defined T , then how can we talk about
.. ..
. . the stationary distribution for T ] ?

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
Actually, we defined a very simple T which
4 1 0 1 ... ... 0
.. .. allowed only certain types of transitions
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
The second term P (Xi = yi |X−i ) essentially tells us that given the value of the
remaining random variable what is the probability of Xi taking on a certain
value
With that we have answered the first question “What is the transition matrix
T ?” (It is a very sparse matrix allowing only certain transitions)

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question :

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?
Well, not really !

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1 Pm
where z = σ( j=1 wij vj + ci )
4 1 0 1 ... ... 0
..
.
..
.
The above probability is very easy to compute
.. .. (just a sigmoid function)
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
Detailed Balance Condition
To show that a distribution π is a stationary distribution for a Markov Chain
described by the transition probabilities pxy , x, y ∈ Ω, it is sufficient to show that
∀x, y ∈ Ω, the following condition holds:

π(x)pxy = π(x)pyx

Let us revisit what pxy is and what π is

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise

For consistency of notation we will denote P (X) i.e., P (V, H) as π(X)

Further, as shorthand we will refer to π(X = x) as π(x)

For consistency of notation we will denote P (X) i.e., P (V, H) as π(X)

Further, as shorthand we will refer to π(X = x) as π(x)
Thus, to prove that P (X), i.e., π(X) is the stationary distribution for our
Markov Chain we need to prove that

π(x)pxy = π(y)pyx ∀x, y ∈ {0, 1}m+n

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3:

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
= π(y)pyx

Hence the detailed balance condition holds

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1:
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2:
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2: x and y are equal (i.e., they do not
3 1 0 1 ... ... 1
differ in the state of any random variable)
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (let us see)

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
For example, {3, 6, 9, 12, . . . } and hence the
4 1 0 1 ... ... 0
greater common divisor would be 3 (and the
..
.
..
.
Markov Chain would be periodic with a period
..
.
..
.
of 3)
However if the chain is not periodic then the
set would contain arbitrary numbers and their
GCD would just be 1 (hence the above defin-
ition) 43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
Once again, we can formally prove this but we
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
will just rely on the intuition for now
.. ..
. .
.. ..
. .

44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic

45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (done)

45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.4 : Training RBMs using Gibbs Sampling

46/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling

47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling
We will first quickly revisit the expectations that we wanted to compute and
write a simplified expression for them

47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij

h1 h2 ··· hn

w1,1 wm,n W ∈ Rm×n

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H

w1,1 wm,n W ∈ Rm×n

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

= Ep(H|V ) [vi hj ] − Ep(V,H) [vi hj ]

v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )

−
P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj

= Ep(H|V ) [vi hj ] − Ep(V,H) [vi hj ]

v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )

−
P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj
We will now simplify the expression for these
two expectations 48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h

= p(Hi = 1|v)vj

= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1

= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij j=1 v j=1

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v
= σ(Wv + c)v T − Ev [σ(Wv + c)v T ]
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
X
∇c L (θ) = σ(Wv + c) − p(v)σ(Wv + c)
v
= σ(Wv + c) − Ev [σ(Wv + c)]
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
term

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term

Ev [v]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term

Ev [v]

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v]

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v] Solution?

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
These expectations are intractable.
Ev [v] Solution? Estimation with the help
of sampling

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
Specifically, we will use Gibbs
Ev [σ(Wv + c)] Sampling to estimate the
expectation

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
k
1 X Specifically, we will use Gibbs
Ev [σ(Wv + c)] ≈ σ(Wv (k) + c) Sampling to estimate the
k
i=1
expectation

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input:

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c

end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do

end
end

c ← c + η∇c L (θ)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P
Pk+r
c ← c + η[σ(Wvd + c) − 1r t=k+1 σ(Wv (t) + c)]
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.5 : Training RBMs using Contrastive
Divergence

55/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain

56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain
We will now see a more efficient algorithm called k-contrastive divergence
which is used in practice for training RBMs

56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently

Ep(H|V ) [vj hi ] = σ(wi v + ci )vj

X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation

Ep(H|V ) [vj hi ] = σ(wi v + ci )vj

X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X The second expectation depends on the
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj samples drawn from the Markov chain
v (v1 , v2 , ..., vn )
The first expectation thus depends on
the empirical samples, whereas the
second expectation depends on the
model samples (because the samples are
generated based on P (V |H) and
P (H|V ) output by the model)
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Vs

Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v)

Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v) ∼ p(v|h)

Vs V(1)

Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v)

Vs V(1)

Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea

Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea

Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj ≈ σ(wi ṽ + ci )ṽj
v
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again

We have two summations here

The first term can be thought of as summation over a single point v from
training example

We have two summations here

The first term can be thought of as summation over a single point v from
training example
Similarly, for the second term, the summation over ṽ is being replaced by a
point estimate computed from the model sample
As training progresses and ṽ (model sample) starts looking more and more
like our training (empirical) samples, the difference between the two terms will
be small and the parameters of the model will stabilize (convergence)
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input:

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0

end

end
end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η∇c L (θ)
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η[σ(Wv + c) − σ(Wṽ + c)]
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

In practice, k = 1 also works well

61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

In practice, k = 1 also works well

The higher the value of k, the less biased the estimate of the gradient will be.

61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
426 pages
Thermodynamics of Black Holes: Semi-Classical Approaches and Beyond
No ratings yet
Thermodynamics of Black Holes: Semi-Classical Approaches and Beyond
152 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Unit 5
No ratings yet
Unit 5
36 pages
Hmms
No ratings yet
Hmms
59 pages
MCMC
No ratings yet
MCMC
76 pages
Bioinformatics HMM Updated
No ratings yet
Bioinformatics HMM Updated
28 pages
(Ebook PDF) Probability, Statistics, and Random Signals by Charles Bonceletinstant Download
100% (6)
(Ebook PDF) Probability, Statistics, and Random Signals by Charles Bonceletinstant Download
49 pages
CS115 Probability 2
No ratings yet
CS115 Probability 2
58 pages
Unit 4
No ratings yet
Unit 4
61 pages
Markov Chains
No ratings yet
Markov Chains
136 pages
ST202 Notes
No ratings yet
ST202 Notes
52 pages
05 MCMC
No ratings yet
05 MCMC
36 pages
Lecture-02 Probability Basics
No ratings yet
Lecture-02 Probability Basics
33 pages
Introduction To Machine Learning - Unit 12 - Week 9
No ratings yet
Introduction To Machine Learning - Unit 12 - Week 9
5 pages
CS 182 Berkeley 2021 Discussion 1
No ratings yet
CS 182 Berkeley 2021 Discussion 1
7 pages
Week 7
No ratings yet
Week 7
9 pages
Restricted Boltzmann Machines: Abstract
No ratings yet
Restricted Boltzmann Machines: Abstract
21 pages
Bioinformatics-Lesson 07 - Hidden Markov Model
No ratings yet
Bioinformatics-Lesson 07 - Hidden Markov Model
28 pages
Computational Genomics Hidden Markov Models (HMMS)
No ratings yet
Computational Genomics Hidden Markov Models (HMMS)
55 pages
Dokumen - Tips - Gre Kaplan Barrons Word Lffis
No ratings yet
Dokumen - Tips - Gre Kaplan Barrons Word Lffis
24 pages
ML Unit5 QB Solutions
No ratings yet
ML Unit5 QB Solutions
13 pages
MCMC
No ratings yet
MCMC
70 pages
Thermodynamics of Black Holes in Massive Gravity: Rong-Gen Cai, Ya-Peng Hu, Qi-Yuan Pan, Yun-Long Zhang
No ratings yet
Thermodynamics of Black Holes in Massive Gravity: Rong-Gen Cai, Ya-Peng Hu, Qi-Yuan Pan, Yun-Long Zhang
19 pages
LectureNotes Complete
No ratings yet
LectureNotes Complete
90 pages
cs236 Lecture2
No ratings yet
cs236 Lecture2
30 pages
Marcov Chains
No ratings yet
Marcov Chains
40 pages
Lec19 PDF
No ratings yet
Lec19 PDF
9 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Markov Chain Implementation in C++ Using Eigen
No ratings yet
Markov Chain Implementation in C++ Using Eigen
9 pages
MCMC Brief
100% (1)
MCMC Brief
69 pages
Lec1 PDF
No ratings yet
Lec1 PDF
29 pages
Prob Inf
No ratings yet
Prob Inf
56 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
ML Merge
No ratings yet
ML Merge
145 pages
Markov Chains Ergodicity
No ratings yet
Markov Chains Ergodicity
8 pages
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
Lecture 22
No ratings yet
Lecture 22
24 pages
Neural Processes
No ratings yet
Neural Processes
11 pages
Time Grad
No ratings yet
Time Grad
11 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
Atom Lifetime
No ratings yet
Atom Lifetime
2 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Bts 360 S Article
No ratings yet
Bts 360 S Article
104 pages
Deep Boltzmann Machines
No ratings yet
Deep Boltzmann Machines
8 pages
Lec7 MarkovChains
No ratings yet
Lec7 MarkovChains
14 pages
Generative Learning Algorithims 1233
No ratings yet
Generative Learning Algorithims 1233
33 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Tom M CMU ANN Lecture Notes
No ratings yet
Tom M CMU ANN Lecture Notes
68 pages
Lectures 7 and 8
No ratings yet
Lectures 7 and 8
37 pages
DSC4821 Full Study Guide
No ratings yet
DSC4821 Full Study Guide
109 pages
AItRBM Proof
No ratings yet
AItRBM Proof
23 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Awiszus Markov Chain Neural CVPR 2018 Paper
No ratings yet
Awiszus Markov Chain Neural CVPR 2018 Paper
8 pages
HW 4
No ratings yet
HW 4
5 pages
Probability Theory - Formula Sheet
No ratings yet
Probability Theory - Formula Sheet
13 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Markov Hand Out
No ratings yet
Markov Hand Out
14 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
No ratings yet
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
2 pages