0% found this document useful (0 votes)
39 views338 pages

Lecture 20

Uploaded by

rajarshi234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views338 pages

Lecture 20

Uploaded by

rajarshi234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 338

CS7015 (Deep Learning) : Lecture 20

Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence


for training RBMs

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.1 : Markov Chains

2/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)

X ∈ R1024

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]

EP (X) [f (X)]

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
We will use Gibbs Sampling (class of
Metropolis-Hastings algorithm) to achieve
these goals

EP (X) [f (X)]

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
We will use Gibbs Sampling (class of
Metropolis-Hastings algorithm) to achieve
these goals
We will first understand the intuition be-
EP (X) [f (X)] hind Gibbs Sampling and then understand the
math behind it

3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn

X ∈ R1024

EP (X) [f (X)]

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step

X ∈ R1024

EP (X) [f (X)]

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024

EP (X) [f (X)]

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i

EP (X) [f (X)]

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
For ease of illustration we will stick to the res-
taurant example and assume that instead of
actual counts we are interested only in binary
EP (X) [f (X)]
counts (high=1, low=0)

4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
For ease of illustration we will stick to the res-
taurant example and assume that instead of
actual counts we are interested only in binary
EP (X) [f (X)]
counts (high=1, low=0)
Thus Xi ∈ {0, 1}n
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)

x1

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 One way of looking at this is that the state
has transitioned from x1 to x2

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 x3 One way of looking at this is that the state
has transitioned from x1 to x2
Similarly, on day 3, if X3 takes on the value x3
then we can say that the state has transitioned
from x1 to x2 to x3

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 x3 One way of looking at this is that the state
has transitioned from x1 to x2
Similarly, on day 3, if X3 takes on the value x3
then we can say that the state has transitioned
from x1 to x2 to x3
Finally, on day n, we can say that the state
has transitioned from x1 to x2 to x3 to . . . xn

5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1

x1 x2 x3 ··· xi

6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )

x1 x2 x3 ··· xi

6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov


property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov


property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

In other words, given the previous state Xi−1 , Xi is


independent of all preceding states

6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )

x1 x2 x3 ··· xi Now suppose the chain exhibits the following Markov


property
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
= P (Xi = xi |Xi−1 = xi−1 )

In other words, given the previous state Xi−1 , Xi is


independent of all preceding states
Can we draw a graphical model to encode this inde-
pendence assumption ?
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
What will be the edges in the graph ?

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
What will be the edges in the graph ?
Well, each node only depends on its prede-
cessor, so we will just have an edge between
successive nodes

7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property

X1 X2 ··· Xk

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain

X1 X2 ··· Xk

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
Further, since Xi ’s take on discrete values this
is called a discrete time discrete space Markov
Chain

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
Further, since Xi ’s take on discrete values this
is called a discrete time discrete space Markov
Chain
Okay, but why are we interested in Markov
chains? (we will get there soon! for now let
us just focus on these definitions)

8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities

Recall that each Xi ∈ {0, 1}n

9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)

9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
How many values do we need to specify the
distribution

P (Xi = xi |Xi−1 = xi−1 )?

9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
How many values do we need to specify the
distribution

P (Xi = xi |Xi−1 = xi−1 )? (l2 )

9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. ..
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. .. The matrix T is called the transition matrix
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i


Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i


Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ?
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i


Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ? Well,
1
..
2
..
0.06
..
because this transition probabilities may be
. . . different for different time steps
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i


Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ? Well,
1
..
2
..
0.06
..
because this transition probabilities may be
. . . different for different time steps
1 l 0.02
2 1 0.03 For example, the transition in the number
2 2 0.07
.. .. .. of customers may be different from Friday
. . .
2 l 0.01 to Saturday (weekend) as compared to from
.. .. ..
. . . Sunday to Monday(weekday)
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,

P (Xi = b|Xi−1 = a) ∀a, b ∀i


Xi−1 Xi−2 Tab
1 1 0.05 Why do we need to define this ∀i ? Well,
1
..
2
..
0.06
..
because this transition probabilities may be
. . . different for different time steps
1 l 0.02
2 1 0.03 For example, the transition in the number
2 2 0.07
.. .. .. of customers may be different from Friday
. . .
2 l 0.01 to Saturday (weekend) as compared to from
.. .. ..
. . . Sunday to Monday(weekday)
l 1 0.1
l 2 0.09 Thus, for a Markov chain X1 , X2 , . . . , Xk
.. .. ..
. . . we will have k such transition matrices
l l 0.21
T1 , T2 , . . . , Tk
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous

Xi−1 Xi−2 Tab


1 1 0.05
1 2 0.06
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02 In other words
2 1 0.03
2 2 0.07
..
.
..
.
..
.
P (Xi = b|Xi−1 = a) = Tab ∀a, b ∀i
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02 In other words
2 1 0.03
2 2 0.07
..
.
..
.
..
.
P (Xi = b|Xi−1 = a) = Tab ∀a, b ∀i
2 l 0.01
.. .. .. The transition matrix does not depend on the
. . .
l 1 0.1 time i and hence such a Markov Chain is
l 2 0.09
..
.
..
.
..
.
called time homogeneous
l l 0.21

11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )

X1 X2 ··· Xk

12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)

X1 X2 ··· Xk

12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)

µ0a is the probability that the random variable


X1 X2 ··· Xk takes on the value a among all the possible 2n
values

12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)

µ0a is the probability that the random variable


X1 X2 ··· Xk takes on the value a among all the possible 2n
values
Given µ0 and T how will you compute µk
where
µka = P (Xk = a)

12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)

µ0a is the probability that the random variable


X1 X2 ··· Xk takes on the value a among all the possible 2n
values
Given µ0 and T how will you compute µk
where
µka = P (Xk = a)

µk is again a 2n dimensional vector whose ath


entry tells us the probability that Xk will take
on the value a among all the possible 2n values
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)

X0 X1

2 b

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

2 b

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0

.. ..
. .

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a

1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a
X
= µ0a Tab
a

13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us see if there is a more compact
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2

3 0.4 3
0.3

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3
0.3

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3 What does the product µ0 T give us ?
0.3  
0.2 0.5 0.3
µ0 T = 0.3 0.4 0.3 0.3 0.6 0.1
 

0.4 0.2 0.4

14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1

1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3 What does the product µ0 T give us ?
0.3   It gives us the distribution P
µ1 ! (the
0.2 0.5 0.3 th
b entry of this vector is 0
a µa Tab
µ0 T = 0.3 0.4 0.3 0.3 0.6 0.1
 
which is P (X1 = b))
0.4 0.2 0.4
 
= 0.3 0.45 0.25
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X2 = b)
X0 X1 X2

2 b

.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1

2 b

.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b

.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. ..
. . .

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. .. X
. . . = P (X1 = a)P (X2 = b|X1 = a)
a
l

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. .. X
. . . = P (X1 = a)P (X2 = b|X1 = a)
a
X
l = µ1a Tab
a

15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) =

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b

.. .. ..
. . .

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b Thus the distribution at any time step can be


computed by finding the appropriate element
.. .. .. from the following series
. . .
µ0 T 1 , µ0 T 2 , µ0 T 3 , . . . , µ0 T k , . . .
l

16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as

P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k

2 b Thus the distribution at any time step can be


computed by finding the appropriate element
.. .. .. from the following series
. . .
µ0 T 1 , µ0 T 2 , µ0 T 3 , . . . , µ0 T k , . . .
l
Note that this is still computationally expens-
ive because it involves a product of µ0 (2n ) and
T k (2n × 2n ) (but later on we will see that we
do not need this full product) 16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2

2 b

.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

2 b

.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of


2 b
the Markov chain
.. .. ..
. . .

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of


2 b
the Markov chain
.. .. .. Xt , Xt+1 , Xt+2 , . . . will all follow the same dis-
. . . tribution π

17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps

1 µj = π(j ≥ t)

π is then called the stationary distribution of


2 b
the Markov chain
.. .. .. Xt , Xt+1 , Xt+2 , . . . will all follow the same dis-
. . . tribution π
In other words, if we have Xt = xt , Xt+1 =
l xt+1 , Xt+2 = xt+2 and so on then we can think
of xt , xt+1 , xt+2 as samples drawn from the
same distribution π (this is a crucial property
and we will return back to it soon)
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b

.. .. ..
. . .

18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b
What do we mean by run a Markov Chain for
a large number of time steps ?
.. .. ..
. . .

18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b
What do we mean by run a Markov Chain for
a large number of time steps ?
.. .. ..
. . . It means we start drawing a sample X0 ∼ µ0
and then continue drawing samples
l
X1 ∼ µ0 T, X2 ∼ µ0 T 2 , X3 ∼ µ0 T 3 , . . . ,
. . . , Xt ∼ π, Xt+1 ∼ π, Xt+2 ∼ π . . .

18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples?

2 b

.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No

2 b

.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b

.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b In other words the joint distribution µk has
2n parameters which is prohibitively large
.. .. ..
. . .

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b In other words the joint distribution µk has
2n parameters which is prohibitively large
.. .. ..
. . . I wonder what can I do to reduce the number
of parameters in a joint distribution (I hope
l you already know what to do but we will re-
turn back to it later)

19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
How does this discussion tie back to our goals?

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
How does this discussion tie back to our goals?
We will first see an intuitive explanation for how all this ties back to our goals
and then get into a more formal discussion

20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.2 : Why do we care about Markov Chains?

21/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)

X ∈ R1024

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)

X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that

X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and

X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024 Then it would mean that if we run the Markov
Chain for long enough, we will start getting
samples from P (X)

EP (X) [f (X)]

22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024 Then it would mean that if we run the Markov
Chain for long enough, we will start getting
samples from P (X)
And once we have a large number of such samples
we can empirically estimate EP (X) f (X) as
EP (X) [f (X)] l+n
1X
f (Xi )
n
i=l
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We will now get into a formal discussion to concretize the above intuition

23/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R


If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R


If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R


If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R


If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
Of course Part A and Part B are related!

24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1

for any function f : X → R


If, further the Markov Chain is aperiodic then P (Xt = xt |X0 = x0 ) → π(X) as
t → ∞ ∀x, x0 ∈ X

So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
Of course Part A and Part B are related!
Further note that it doesn’t matter what the initial state was (the theorem holds for
∀x, x0 ∈ X )
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic (because the theorem only
holds for such chains)

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic (because the theorem only
holds for such chains)
For ease of notation instead of X = V1 , V2 , Vm , . . . , H1 , H2 , . . . , Hn , we will use
X = X1 , X2 , . . . , Xn+m

25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.3 : Setting up a Markov Chain for RBMs

26/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1
2

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
At time-step 1, we transition to a new value
2
of X

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
At time-step 1, we transition to a new value
2
of X
What does this mean? How do we do this
transition? Let us see

27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 Fix the value of all variables except Xi
2
3
4

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
P (Xi = yi |X−i = x−i )

Repeat the above process for many many time


steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
.. ..
. . P (Xi = yi |X−i = x−i )
.. ..
. .
Repeat the above process for many many time
steps (each time step corresponds to 1 step of
the chain)

28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1 How do we show that the stationary distribu-
4 1 0 1 ... ... 0 tion is P (X) (where X = V, H) [We haven’t
.. ..
. . even defined T , then how can we talk about
.. ..
. . the stationary distribution for T ] ?

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1 How do we show that the stationary distribu-
4 1 0 1 ... ... 0 tion is P (X) (where X = V, H) [We haven’t
.. ..
. . even defined T , then how can we talk about
.. ..
. . the stationary distribution for T ] ?
Let us answer these questions one by one

29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ?
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
Actually, we defined a very simple T which
4 1 0 1 ... ... 0
.. .. allowed only certain types of transitions
. .
.. ..
. .

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
Actually, we defined a very simple T which
4 1 0 1 ... ... 0
.. .. allowed only certain types of transitions
. .
..
.
..
.
In particular, under this T , transitioning from
a state x to a state y was possible only if x and
y differ in the value of only one of the n + m
variables

30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
The second term P (Xi = yi |X−i ) essentially tells us that given the value of the
remaining random variable what is the probability of Xi taking on a certain
value

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise

where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
The second term P (Xi = yi |X−i ) essentially tells us that given the value of the
remaining random variable what is the probability of Xi taking on a certain
value
With that we have answered the first question “What is the transition matrix
T ?” (It is a very sparse matrix allowing only certain transitions)

31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question :

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?
Well, not really !

32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1 Pm
where z = σ( j=1 wij vj + ci )
4 1 0 1 ... ... 0
..
.
..
.
The above probability is very easy to compute
.. .. (just a sigmoid function)
. .

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1 Pm
where z = σ( j=1 wij vj + ci )
4 1 0 1 ... ... 0
..
.
..
.
The above probability is very easy to compute
.. .. (just a sigmoid function)
. .
Once you compute the above probability, with
probability z you will set the value of Vi to 1
and with probability 1 − z you will set it to 0

33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1 Both these computations are easy
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1 Both these computations are easy
4 1 0 1 ... ... 0
.. ..
Hence it is easy to create this chain starting
. .
.. ..
from any x0
. .

34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
Detailed Balance Condition
To show that a distribution π is a stationary distribution for a Markov Chain
described by the transition probabilities pxy , x, y ∈ Ω, it is sufficient to show that
∀x, y ∈ Ω, the following condition holds:

π(x)pxy = π(x)pyx

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
Detailed Balance Condition
To show that a distribution π is a stationary distribution for a Markov Chain
described by the transition probabilities pxy , x, y ∈ Ω, it is sufficient to show that
∀x, y ∈ Ω, the following condition holds:

π(x)pxy = π(x)pyx

Let us revisit what pxy is and what π is

35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise

For consistency of notation we will denote P (X) i.e., P (V, H) as π(X)

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise

For consistency of notation we will denote P (X) i.e., P (V, H) as π(X)


Further, as shorthand we will refer to π(X = x) as π(x)

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise

For consistency of notation we will denote P (X) i.e., P (V, H) as π(X)


Further, as shorthand we will refer to π(X = x) as π(x)
Thus, to prove that P (X), i.e., π(X) is the stationary distribution for our
Markov Chain we need to prove that

π(x)pxy = π(y)pyx ∀x, y ∈ {0, 1}m+n

36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially

37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially

38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3:

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
.. ..
. .

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
= π(y)pyx

39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
= π(y)pyx

Hence the detailed balance condition holds


39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1:
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2:
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2: x and y are equal (i.e., they do not
3 1 0 1 ... ... 1
differ in the state of any random variable)
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2: x and y are equal (i.e., they do not
3 1 0 1 ... ... 1
differ in the state of any random variable)
4 1 0 1 ... ... 0
.. .. Case 3:
. .
.. ..
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx

V1 V2 ... Vm H1 H2 ... Hn Thus we have proved that the detailed balance


X1 X2 X3 ... ... Xn+m condition holds in all the 3 cases
0 1 1 0 ... ... 1 Case 1: x and y differ in the state of more
1 1 0 0 ... ... 1 than one random variable
2 1 0 1 ... ... 1
Case 2: x and y are equal (i.e., they do not
3 1 0 1 ... ... 1
differ in the state of any random variable)
4 1 0 1 ... ... 0
.. .. Case 3: x and y differ in the state of only
. .
.. .. one random variable
. .

40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (let us see)

41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. .. For example, notice that we can reach from
. .
.. .. the state containing all 0’s to all 1’s after some
. .
finite time steps

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. .. For example, notice that we can reach from
. .
.. .. the state containing all 0’s to all 1’s after some
. .
finite time steps
We can prove this more formally but for now
we will just rely on the intuition

42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1

V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
For example, {3, 6, 9, 12, . . . } and hence the
4 1 0 1 ... ... 0
greater common divisor would be 3 (and the
..
.
..
.
Markov Chain would be periodic with a period
..
.
..
.
of 3)

43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
For example, {3, 6, 9, 12, . . . } and hence the
4 1 0 1 ... ... 0
greater common divisor would be 3 (and the
..
.
..
.
Markov Chain would be periodic with a period
..
.
..
.
of 3)
However if the chain is not periodic then the
set would contain arbitrary numbers and their
GCD would just be 1 (hence the above defin-
ition) 43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .

44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
Once again, we can formally prove this but we
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
will just rely on the intuition for now
.. ..
. .
.. ..
. .

44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic

45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (done)

45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.4 : Training RBMs using Gibbs Sampling

46/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling

47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling
We will first quickly revisit the expectations that we wanted to compute and
write a simplified expression for them

47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij

h1 h2 ··· hn

w1,1 wm,n W ∈ Rm×n

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H

w1,1 wm,n W ∈ Rm×n

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H

v1 v2 ··· vm

b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H

v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H

= Ep(H|V ) [vi hj ] − Ep(V,H) [vi hj ]


v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )


P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj

48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H

= Ep(H|V ) [vi hj ] − Ep(V,H) [vi hj ]


v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )


P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj
We will now simplify the expression for these
two expectations 48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i

= p(Hi = 1|v)vj

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i

= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i

= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij j=1 v j=1

49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v

50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn

v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v
= σ(Wv + c)v T − Ev [σ(Wv + c)v T ]
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v
X
∇b L (θ) = v − p(v)v
v

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v
X
∇b L (θ) = v − p(v)v
v
= v − Ev [v]

51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
X
∇c L (θ) = σ(Wv + c) − p(v)σ(Wv + c)
v

52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
X
∇c L (θ) = σ(Wv + c) − p(v)σ(Wv + c)
v
= σ(Wv + c) − Ev [σ(Wv + c)]
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
term

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term

Ev [v]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term

Ev [v]

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v]

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v] Solution?

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
These expectations are intractable.
Ev [v] Solution? Estimation with the help
of sampling

Ev [σ(Wv + c)]

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
Specifically, we will use Gibbs
Ev [σ(Wv + c)] Sampling to estimate the
expectation

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
k
1 X Specifically, we will use Gibbs
Ev [σ(Wv + c)] ≈ σ(Wv (k) + c) Sampling to estimate the
k
i=1
expectation

53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input:

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c

54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do

end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do

end

end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end

end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do

end
end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η∇W L (θ)

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
b ← b + η∇b L (θ)

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P

end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P

c ← c + η∇c L (θ)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P
Pk+r
c ← c + η[σ(Wvd + c) − 1r t=k+1 σ(Wv (t) + c)]
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.5 : Training RBMs using Contrastive
Divergence

55/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain

56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain
We will now see a more efficient algorithm called k-contrastive divergence
which is used in practice for training RBMs

56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently

Ep(H|V ) [vj hi ] = σ(wi v + ci )vj


X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation

Ep(H|V ) [vj hi ] = σ(wi v + ci )vj


X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X The second expectation depends on the
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj samples drawn from the Markov chain
v (v1 , v2 , ..., vn )

57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X The second expectation depends on the
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj samples drawn from the Markov chain
v (v1 , v2 , ..., vn )
The first expectation thus depends on
the empirical samples, whereas the
second expectation depends on the
model samples (because the samples are
generated based on P (V |H) and
P (H|V ) output by the model)
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Vs

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v)

Vs

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v) ∼ p(v|h)

Vs V(1)

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v)

Vs V(1)

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate

58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate
Ep(V,H) [vj hi ]
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

Contrastive divergence uses the following idea


Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
Replace the expectation by a point estimate
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj ≈ σ(wi ṽ + ci )ṽj
v
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1

We have two summations here

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1

We have two summations here


The first term can be thought of as summation over a single point v from
training example

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1

We have two summations here


The first term can be thought of as summation over a single point v from
training example
Similarly, for the second term, the summation over ṽ is being replaced by a
point estimate computed from the model sample

59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1

We have two summations here


The first term can be thought of as summation over a single point v from
training example
Similarly, for the second term, the summation over ṽ is being replaced by a
point estimate computed from the model sample
As training progresses and ṽ (model sample) starts looking more and more
like our training (empirical) samples, the difference between the two terms will
be small and the parameters of the model will stabilize (convergence)
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input:

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0

60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do

end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do

end

end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end

end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do

end
end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η∇W L (θ)

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η∇b L (θ)

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]

end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η∇c L (θ)
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η[σ(Wv + c) − σ(Wṽ + c)]
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

In practice, k = 1 also works well

61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...

∼ p(h|v) ∼ p(v|h) ∼ p(h|v) ∼ p(v|h)

Vs V(1) V(k) = Ṽ

In practice, k = 1 also works well


The higher the value of k, the less biased the estimate of the gradient will be.

61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

You might also like