Lecture 20
Lecture 20
Mitesh M. Khapra
1/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.1 : Markov Chains
2/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
X ∈ R1024
3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
EP (X) [f (X)]
3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
We will use Gibbs Sampling (class of
Metropolis-Hastings algorithm) to achieve
these goals
EP (X) [f (X)]
3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us first begin by restating our goals
Goal 1: Given a random variable X ∈ Rn ,
we are interested in drawing samples from the
joint distribution P (X)
Goal 2: Given a function f (X) defined over
the random variable X, we are interested in
X ∈ R1024 computing the expectation EP (X) [f (X)]
We will use Gibbs Sampling (class of
Metropolis-Hastings algorithm) to achieve
these goals
We will first understand the intuition be-
EP (X) [f (X)] hind Gibbs Sampling and then understand the
math behind it
3/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
X ∈ R1024
EP (X) [f (X)]
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
X ∈ R1024
EP (X) [f (X)]
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
EP (X) [f (X)]
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
EP (X) [f (X)]
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
For ease of illustration we will stick to the res-
taurant example and assume that instead of
actual counts we are interested only in binary
EP (X) [f (X)]
counts (high=1, low=0)
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Suppose instead of a single random variable
X ∈ Rn , we have a chain of random variables
X1 , X2 , . . . , XK each Xi ∈ Rn
The i here corresponds to a time step
For example, Xi could be a n-dimensional vec-
tor containing the number of customers in a
given set of n restaurants on day i
X ∈ R1024
In our case, Xi could be a 1024 dimensional
image sent by our friend on day i
For ease of illustration we will stick to the res-
taurant example and assume that instead of
actual counts we are interested only in binary
EP (X) [f (X)]
counts (high=1, low=0)
Thus Xi ∈ {0, 1}n
4/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
x1
5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2
5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 One way of looking at this is that the state
has transitioned from x1 to x2
5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 x3 One way of looking at this is that the state
has transitioned from x1 to x2
Similarly, on day 3, if X3 takes on the value x3
then we can say that the state has transitioned
from x1 to x2 to x3
5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
On day 1, let X1 take on the value x1 (x1 is
one of the possible 2n vectors)
On day 2, let X2 take on the value x2 (x2 is
again one of the possible 2n vectors)
x1 x2 x3 One way of looking at this is that the state
has transitioned from x1 to x2
Similarly, on day 3, if X3 takes on the value x3
then we can say that the state has transitioned
from x1 to x2 to x3
Finally, on day n, we can say that the state
has transitioned from x1 to x2 to x3 to . . . xn
5/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
x1 x2 x3 ··· xi
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
x1 x2 x3 ··· xi
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
6/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We may now be interested in knowing what is the most
likely value that the state will take on day i given the
states on day 1 to day i − 1
More formally, we may be interested in the following
distribution
P (Xi = xi |X1 = x1 , X2 = x2 , . . . , Xi−1 = xi−1 )
7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
What will be the edges in the graph ?
7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In this graphical model, the random variables
are X1 , X2 , . . . , Xk
We will have a node corresponding to each of
X1 X2 ··· Xk these random variables
What will be the edges in the graph ?
Well, each node only depends on its prede-
cessor, so we will just have an edge between
successive nodes
7/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
X1 X2 ··· Xk
8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
X1 X2 ··· Xk
8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
Further, since Xi ’s take on discrete values this
is called a discrete time discrete space Markov
Chain
8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
⊥ X1i−2 |Xi−1 ) is called the
This property (Xi ⊥
Markov property
And the resulting chain X1 , X2 , . . . , Xk is
called a Markov chain
Further, since we are considering discrete time
steps, this is called a discrete time Markov
X1 X2 ··· Xk
Chain
Further, since Xi ’s take on discrete values this
is called a discrete time discrete space Markov
Chain
Okay, but why are we interested in Markov
chains? (we will get there soon! for now let
us just focus on these definitions)
8/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
How many values do we need to specify the
distribution
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
How many values do we need to specify the
distribution
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. ..
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk Let us delve a bit deeper into Markov Chains
and define a few more quantities
Let us assume 2n = l (i.e., Xi can take l val-
Recall that each Xi ∈ {0, 1}n
ues)
Xi−1 Xi−2 Tab
How many values do we need to specify the
1 1 0.05 distribution
1 2 0.06
.. .. ..
1
.
l
. .
0.02
P (Xi = xi |Xi−1 = xi−1 )? (l2 )
2 1 0.03
2 2 0.07 We can represent this as a matrix T ∈ l ×
.. .. ..
. . . l where the entry Ta,b of the matrix denotes
2 l 0.01
..
.
..
.
..
.
the probability of transitioning to state b from
l 1 0.1 state a (i.e., P (Xi = b|Xi−1 = a))
l 2 0.09
.. .. .. The matrix T is called the transition matrix
. . .
l l 0.21
9/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,
10/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to define this transition matrix Tab ,
X1 X2 ··· Xk i.e.,
11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02
2 1 0.03
2 2 0.07
.. .. ..
. . .
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21
11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02 In other words
2 1 0.03
2 2 0.07
..
.
..
.
..
.
P (Xi = b|Xi−1 = a) = Tab ∀a, b ∀i
2 l 0.01
.. .. ..
. . .
l 1 0.1
l 2 0.09
.. .. ..
. . .
l l 0.21
11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X1 X2 ··· Xk However, for this discussion we will assume
that the Markov chain is time homogeneous
What does that mean? It means that
Xi−1 Xi−2 Tab
1 1 0.05
1 2 0.06 T1 = T2 = · · · = Tk = T
.. .. ..
. . .
1 l 0.02 In other words
2 1 0.03
2 2 0.07
..
.
..
.
..
.
P (Xi = b|Xi−1 = a) = Tab ∀a, b ∀i
2 l 0.01
.. .. .. The transition matrix does not depend on the
. . .
l 1 0.1 time i and hence such a Markov Chain is
l 2 0.09
..
.
..
.
..
.
called time homogeneous
l l 0.21
11/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
X1 X2 ··· Xk
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)
X1 X2 ··· Xk
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)
12/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Now suppose the starting distribution at time
step 0 is given by µ0 )
Just to be clear µ0 is a 2n dimensional vector
such that
µ0a = P (X0 = a)
X0 X1
2 b
.. ..
. .
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a
2 b
.. ..
. .
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a
1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
.. ..
. .
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a
1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a
1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X1 = b)
X
P (X1 = b) = P (X0 = a, X1 = b)
X0 X1 a
1
The above sum essentially captures all the
paths of reaching X1 = b irrespective of the
2 b value of X0
X
.. .. P (X1 = b) = P (X0 = a, X1 = b)
. .
a
X
= P (X0 = a)P (X1 = b|X0 = a)
l
a
X
= µ0a Tab
a
13/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us see if there is a more compact
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1
1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
3 0.4 3
0.3
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1
1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3
0.3
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1
1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3 What does the product µ0 T give us ?
0.3
0.2 0.5 0.3
µ0 T = 0.3 0.4 0.3 0.3 0.6 0.1
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1
1 0.2 1
0.5
0.3 Let us see if there is a more compact
0.3
0.3
way of writing the distribution P (X1 )
(i.e., of specifying P (X1 = b) ∀b)
2 0.6 2
Let us consider a simple case when
0.4 0.1
0.4 l = 3 (as opposed to 2n )
0.2
Thus, µ0 ∈ R3 and T ∈ R3×3
3 0.4 3 What does the product µ0 T give us ?
0.3 It gives us the distribution P
µ1 ! (the
0.2 0.5 0.3 th
b entry of this vector is 0
a µa Tab
µ0 T = 0.3 0.4 0.3 0.3 0.6 0.1
which is P (X1 = b))
0.4 0.2 0.4
= 0.3 0.45 0.25
14/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (X2 = b)
X0 X1 X2
2 b
.. .. ..
. . .
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1
2 b
.. .. ..
. . .
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b
.. .. ..
. . .
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. ..
. . .
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. .. X
. . . = P (X1 = a)P (X2 = b|X1 = a)
a
l
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Let us consider P (XX
2 = b)
X0 X1 X2 P (X2 = b) = P (X1 = a, X2 = b)
a
1 The above sum essentially captures all the paths
of reaching X2 = b irrespective of the value of X1
2 b X
P (X2 = b) = P (X1 = a, X2 = b)
a
.. .. .. X
. . . = P (X1 = a)P (X2 = b|X1 = a)
a
X
l = µ1a Tab
a
15/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
X0 X1 X2
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T
X0 X1 X2
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T
X0 X1 X2
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) =
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k
2 b
.. .. ..
. . .
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k
16/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Once again we can write P (X2 ) compactly as
P (X2 ) = µ1 T = (µ0 T )T = µ0 T 2
X0 X1 X2
In general,
1
P (Xk ) = µ0 T k
2 b
.. .. ..
. . .
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps
1 µj = π(j ≥ t)
2 b
.. .. ..
. . .
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps
1 µj = π(j ≥ t)
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps
1 µj = π(j ≥ t)
17/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
If at a certain time step t, µt reaches a distri-
bution π such that πT = π
X0 X1 X2 Then for all subsequent time steps
1 µj = π(j ≥ t)
.. .. ..
. . .
18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b
What do we mean by run a Markov Chain for
a large number of time steps ?
.. .. ..
. . .
18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Important: If we run a Markov Chain for a
large number of time steps then after a point
we start getting samples xt , xt+1 , xt+2 , . . .
X0 X1 X2 which are essentially being drawn from the
1 stationary distribution (Spoiler Alert: one
of our goals was to draw samples from a very
complex distribution)
2 b
What do we mean by run a Markov Chain for
a large number of time steps ?
.. .. ..
. . . It means we start drawing a sample X0 ∼ µ0
and then continue drawing samples
l
X1 ∼ µ0 T, X2 ∼ µ0 T 2 , X3 ∼ µ0 T 3 , . . . ,
. . . , Xt ∼ π, Xt+1 ∼ π, Xt+2 ∼ π . . .
18/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples?
2 b
.. .. ..
. . .
19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
2 b
.. .. ..
. . .
19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b
.. .. ..
. . .
19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b In other words the joint distribution µk has
2n parameters which is prohibitively large
.. .. ..
. . .
19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
X0 X1 X2 Is it always easy to draw these samples? No
|µk | = 2n which means that we need to com-
1
pute the probability of each of the possible 2n
values that X k can take
2 b In other words the joint distribution µk has
2n parameters which is prohibitively large
.. .. ..
. . . I wonder what can I do to reduce the number
of parameters in a joint distribution (I hope
l you already know what to do but we will re-
turn back to it later)
19/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
How does this discussion tie back to our goals?
20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
The story so far...
We have seen what a discrete space, discrete time, time homogeneous Markov
Chain is
We have also defined µ0 (initial distribution), T (transition matrix) and π
(stationary distribution)
So far so good! But why do we care about Markov Chains and their
properties?
How does this discussion tie back to our goals?
We will first see an intuitive explanation for how all this ties back to our goals
and then get into a more formal discussion
20/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.2 : Why do we care about Markov Chains?
21/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
X ∈ R1024
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
X ∈ R1024
EP (X) [f (X)]
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
X ∈ R1024
EP (X) [f (X)]
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
X ∈ R1024
EP (X) [f (X)]
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024
EP (X) [f (X)]
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024 Then it would mean that if we run the Markov
Chain for long enough, we will start getting
samples from P (X)
EP (X) [f (X)]
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall our goals
Goal 1: Sample from P (X)
Goal 2: Compute EP (X) f (X)
Now suppose we set up a Markov Chain
X1 , X2 , . . . such that
It is easy to draw samples from this chain and
This Markov Chain’s stationary distribution
is P (X)
X ∈ R1024 Then it would mean that if we run the Markov
Chain for long enough, we will start getting
samples from P (X)
And once we have a large number of such samples
we can empirically estimate EP (X) f (X) as
EP (X) [f (X)] l+n
1X
f (Xi )
n
i=l
22/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We will now get into a formal discussion to concretize the above intuition
23/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
Of course Part A and Part B are related!
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Theorem: If X0 , X1 , . . . , Xt is an irreducible time homogeneous discrete Markov
Chain with stationary distribution π, then
t
1X converges almost surely
f (Xi ) −−−−−−−−−−−−−−→ Eπ [f (X)], where X ∈ X and X ∼ π
t t→∞
i=1
So Part A of the theorem essentially tells us that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then using samples from this chain we can compute Eπ [f (X)]
(which we know is otherwise intractable)
Similarly Part B of the theorem says that if we can set up the chain X0 , X1 , . . . , Xt
such that it is tractable then we can essentially get samples as if they were drawn from
π(X) (which was otherwise intractable)
Of course Part A and Part B are related!
Further note that it doesn’t matter what the initial state was (the theorem holds for
∀x, x0 ∈ X )
24/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic (because the theorem only
holds for such chains)
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic (because the theorem only
holds for such chains)
For ease of notation instead of X = V1 , V2 , Vm , . . . , H1 , H2 , . . . , Hn , we will use
X = X1 , X2 , . . . , Xn+m
25/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.3 : Setting up a Markov Chain for RBMs
26/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1
2
27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
At time-step 1, we transition to a new value
2
of X
27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We begin by defining our Markov Chain
Recall that X = {V, H} ∈ {0, 1}n+m , so at
V1 V2 ... Vm H1 H2 ... Hn
time step 0 we create a random vector X ∈
X1 X2 X3 ... ... Xn+m
{0, 1}n+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
At time-step 1, we transition to a new value
2
of X
What does this mean? How do we do this
transition? Let us see
27/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1
2
3
4
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1
2
3
4
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 Fix the value of all variables except Xi
2
3
4
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
P (Xi = yi |X−i = x−i )
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We need to transition from a state X = x ∈
{0, 1}n+m to y ∈ {0, 1}n+m
This is how we will do it
V1 V2 ... Vm H1 H2 ... Hn
Sample a value i ∈ {1 to n + m} using a dis-
X1 X2 X3 ... ... Xn+m
tribution q(i) (say, uniform distribution)
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 Fix the value of all variables except Xi
2 1 0 1 ... ... 1 Sample a new value for Xi (could be a V or a
3 1 0 1 ... ... 1 H) using the following conditional distribution
4 1 0 1 ... ... 0
.. ..
. . P (Xi = yi |X−i = x−i )
.. ..
. .
Repeat the above process for many many time
steps (each time step corresponds to 1 step of
the chain)
28/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1 How do we show that the stationary distribu-
4 1 0 1 ... ... 0 tion is P (X) (where X = V, H) [We haven’t
.. ..
. . even defined T , then how can we talk about
.. ..
. . the stationary distribution for T ] ?
29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
What are we doing here? How is this related
to our goals?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
More specifically, we have defined a Markov
0 1 1 0 ... ... 1
Chain, but where is our Transition Matrix T?
1 1 0 0 ... ... 1 How is it easy to create this chain (or creating
2 1 0 1 ... ... 1 samples x0 , x1 , ...xN ) ?
3 1 0 1 ... ... 1 How do we show that the stationary distribu-
4 1 0 1 ... ... 0 tion is P (X) (where X = V, H) [We haven’t
.. ..
. . even defined T , then how can we talk about
.. ..
. . the stationary distribution for T ] ?
Let us answer these questions one by one
29/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ?
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
Actually, we defined a very simple T which
4 1 0 1 ... ... 0
.. .. allowed only certain types of transitions
. .
.. ..
. .
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
First, let us talk about the transition matrix
We have actually defined T although we did
V1 V2 ... Vm H1 H2 ... Hn
not explicitly mention it
X1 X2 X3 ... ... Xn+m What would T contain ? The probability of
0 1 1 0 ... ... 1 transitioning from any state x to any state y
m+n ×2m+n
1 1 0 0 ... ... 1 So T ∈ R2 (when did we define such
2 1 0 1 ... ... 1 a matrix?)
3 1 0 1 ... ... 1
Actually, we defined a very simple T which
4 1 0 1 ... ... 0
.. .. allowed only certain types of transitions
. .
..
.
..
.
In particular, under this T , transitioning from
a state x to a state y was possible only if x and
y differ in the value of only one of the n + m
variables
30/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise
31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise
where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise
where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
The second term P (Xi = yi |X−i ) essentially tells us that given the value of the
remaining random variable what is the probability of Xi taking on a certain
value
31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
More formally, we defined T such that
(
q(i)P (yi |x−i ), if ∃i ∈ X so that ∀v ∈ X with v 6= i, xv = yv
pxy =
0, otherwise
where q(i) is the probability that Xi is the random variable whose value trans-
itions while the value of X−i remains the same
The second term P (Xi = yi |X−i ) essentially tells us that given the value of the
remaining random variable what is the probability of Xi taking on a certain
value
With that we have answered the first question “What is the transition matrix
T ?” (It is a very sparse matrix allowing only certain transitions)
31/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question :
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
We now look at the second question : How is it
easy to create this chain (or creating samples
x0 , x1 , ...xl )?
V1 V2 ... Vm H1 H2 ... Hn At each step we are changing only one of the
X1 X2 X3 ... ... Xn+m
n + m random variables using the following
0 1 1 0 ... ... 1
probability
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 P (X)
P (Xi = yi |X−i = x−i ) =
3 1 0 1 ... ... 1 P (X−i )
4 1 0 1 ... ... 0
.. ..
. .
.. .. But how is computing this probability easy?
. .
Doesn’t the joint distribution on LHS also
have 2n+m parameters ?
Well, not really !
32/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1 Pm
where z = σ( j=1 wij vj + ci )
4 1 0 1 ... ... 0
..
.
..
.
The above probability is very easy to compute
.. .. (just a sigmoid function)
. .
33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Consider the case when i <= m (i.e., we have
decided to transition the value of one of the
visible variables V1 to Vm )
V1 V2 ... Vm H1 H2 ... Hn Then P (Xi = yi |X−i = x−i ) is essentially
X1 X2 X3 ... ... Xn+m
(
0 1 1 0 ... ... 1 z, if yi = 1
1 1 0 0 ... ... 1 P (Vi = yi |V−i , H) = P (Vi = yi |H) =
1 − z, if yi = 0
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1 Pm
where z = σ( j=1 wij vj + ci )
4 1 0 1 ... ... 0
..
.
..
.
The above probability is very easy to compute
.. .. (just a sigmoid function)
. .
Once you compute the above probability, with
probability z you will set the value of Vi to 1
and with probability 1 − z you will set it to 0
33/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1 Both these computations are easy
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
So essentially at every time step you sample a
X1 X2 X3 ... ... Xn+m
i from a uniform distribution (qi )
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1 And then sample a value of Vi ∈ {0, 1} using
2 1 0 1 ... ... 1 the distribution Bernoulli(z)
3 1 0 1 ... ... 1 Both these computations are easy
4 1 0 1 ... ... 0
.. ..
Hence it is easy to create this chain starting
. .
.. ..
from any x0
. .
34/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
Detailed Balance Condition
To show that a distribution π is a stationary distribution for a Markov Chain
described by the transition probabilities pxy , x, y ∈ Ω, it is sufficient to show that
∀x, y ∈ Ω, the following condition holds:
π(x)pxy = π(x)pyx
35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, finally let’s look at the third question: How do we show that the stationary
distribution is P (X) (where X = V, H)
To prove this we will refer to the following Theorem:
Detailed Balance Condition
To show that a distribution π is a stationary distribution for a Markov Chain
described by the transition probabilities pxy , x, y ∈ Ω, it is sufficient to show that
∀x, y ∈ Ω, the following condition holds:
π(x)pxy = π(x)pyx
35/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise
36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise
36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise
36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Recall that pxy is given by
(
q(i)P (Xi = yi |X−i x−i ), if ∃i ∈ {1, 2, . . . , n + m} such that ∀j ∈ {1, 2, . . . , n + m}i
pxy =
0, otherwise
36/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 1: x and y differ in the state of more
V1 V2 ... Vm H1 H2 ... Hn
than one random variable
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x) ∗ 0 = 0
3 1 0 1 ... ... 1 π(y)pyx = π(y) ∗ 0 = 0
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially
37/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
There are 3 cases that we need to consider
Case 2: x and y are equal (i.e., they do not
V1 V2 ... Vm H1 H2 ... Hn
differ in the state of any random variable)
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 In this case, by definition
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
π(x)pxy = π(x)pxx
3 1 0 1 ... ... 1 π(y)pyx = π(x)pxx
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. . Hence the detailed balance condition holds
trivially
38/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3:
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
.. ..
. .
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
= π(y)pyx
39/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
There are 3 cases that we need to consider
To prove: π(x)pxy = π(y)pyx
Case 3: x and y differ in the state of only
one random variable
V1 V2 ... Vm H1 H2 ... Hn In this case, by definition
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1 π(x)pxy = π(x)q(i)π(yi |x−i )
1 1 0 0 ... ... 1 π(yi , x−i )
2 1 0 1 ... ... 1
= q(i)π(xi , x−i )
π(x−i )
3 1 0 1 ... ... 1
π(xi , x−i )
4 1 0 1 ... ... 0 = π(yi , x−i )q(i)
.. .. π(x−i )
. .
..
.
..
.
= π(y)q(i)π(xi |x−i )
= π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
To prove: π(x)pxy = π(y)pyx
40/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is?
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about)
Show that the chain is irreducible and aperiodic
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (let us see)
41/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. .. For example, notice that we can reach from
. .
.. .. the state containing all 0’s to all 1’s after some
. .
finite time steps
42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A Markov chain is irreducible if one can get
from any state in Ω to any other in a finite
number of transitions or more formally
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m ∀i, j ∈ Ω ∃k > 0 with
0 1 1 0 ... ... 1
P (X (k) = j|X (0) = i) > 0
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1 Intuitively, we can see that our chain is irre-
3 1 0 1 ... ... 1 ducible
4 1 0 1 ... ... 0
.. .. For example, notice that we can reach from
. .
.. .. the state containing all 0’s to all 1’s after some
. .
finite time steps
We can prove this more formally but for now
we will just rely on the intuition
42/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
1 1 0 0 ... ... 1
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
For example, {3, 6, 9, 12, . . . } and hence the
4 1 0 1 ... ... 0
greater common divisor would be 3 (and the
..
.
..
.
Markov Chain would be periodic with a period
..
.
..
.
of 3)
43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
A chain is called aperiodic if ∀i ∈ Ω the
greatest common divisor of
{k|P (X (k) = i|X (0) = i) > 0 ∧ k ∈ N0 } is 1
The set we have defined above contains all the
V1 V2 ... Vm H1 H2 ... Hn timesteps at which we can reach state i start-
X1 X2 X3 ... ... Xn+m ing from step i
0 1 1 0 ... ... 1
Suppose the chain was periodic then this set
1 1 0 0 ... ... 1
would contain multiples of a certain number
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
For example, {3, 6, 9, 12, . . . } and hence the
4 1 0 1 ... ... 0
greater common divisor would be 3 (and the
..
.
..
.
Markov Chain would be periodic with a period
..
.
..
.
of 3)
However if the chain is not periodic then the
set would contain arbitrary numbers and their
GCD would just be 1 (hence the above defin-
ition) 43/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
.. ..
. .
.. ..
. .
44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
V1 V2 ... Vm H1 H2 ... Hn
X1 X2 X3 ... ... Xn+m
0 1 1 0 ... ... 1
Again intuitively it should be clear that our
1 1 0 0 ... ... 1
chain is aperiodic
2 1 0 1 ... ... 1
Once again, we can formally prove this but we
3 1 0 1 ... ... 1
4 1 0 1 ... ... 0
will just rely on the intuition for now
.. ..
. .
.. ..
. .
44/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic
45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
So our task is cut out now
Define what our Markov Chain is? (done)
Define the transition matrix T for our Markov Chain (done)
Show how it is easy to sample from this chain (done)
Show that the stationary distribution of this chain is the distribution P (X)
(i.e., the distribution that we care about) (done)
Show that the chain is irreducible and aperiodic (done)
45/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.4 : Training RBMs using Gibbs Sampling
46/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling
47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Okay, so we are now ready to write the full algorithm for training RBMs using
Gibbs Sampling
We will first quickly revisit the expectations that we wanted to compute and
write a simplified expression for them
47/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj
48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj
48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj
48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H
v1 v2 ··· vm
We were interested in computing the partial
b1 b2 bm derivative of the log likehood w.r.t. one of the
V ∈ {0, 1}m parameters (wij )
P P
E(V, H)
P = − iP j wij vi hj −
i bi vi − j cj hj
48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H
−
P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj
48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H
−
P P We saw that this partial derivative is actually
E(V, H) = iP j wij vi hj −
P the sum of two expectations
i bi vi − j cj hj
We will now simplify the expression for these
two expectations 48/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i
= p(Hi = 1|v)vj
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i
= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj hi ] − Ep(V,H) [vj hi ]
∂wij
X X
= p(h|v)hi vj − p(v, h)hi vj
h v,h
X X X
= p(h|v)hi vj − p(v) p(h|v)hi vj
h v h
X
We will first focus on p(h|v)hi vj
h
X XX
p(h|v)hi vj = p(hi |v)p(h−i |v)hi vj
h hi h−i
X X
= p(hi |v)hi vj p(h−i |v)
hi h−i
= p(Hi = 1|v)vj
m
X
= σ( wij vj + ci )vj
j=1
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij j=1 v j=1
49/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn
v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn
v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn
v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
h1 h2 ··· hn
v1 v2 ··· vm
m m
∂L (θ) X X X
= σ( wij vj + ci )vj − p(v)σ( wij vj + ci )vj
∂wij v
j=1 j=1
X
= σ(wi v + ci )vj − p(v)σ(wi v + ci )vj
v
X
∇W L (θ) = σ(Wv + c)v − T
p(v)σ(Wv + c)v T
v
= σ(Wv + c)v T − Ev [σ(Wv + c)v T ]
50/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v
X
∇b L (θ) = v − p(v)v
v
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [vj ] − Ep(V,H) [vj ]
∂bj
X X
= p(h|v)vj − p(v, h)vj
h v,h
X X X
= p(h|v)vj − p(v) p(h|v)vj
h v h
X X X
= vj p(h|v) − p(v)vj p(h|v)
h v h
X
= vj − p(v)vj
v
X
∇b L (θ) = v − p(v)v
v
= v − Ev [v]
51/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
X
∇c L (θ) = σ(Wv + c) − p(v)σ(Wv + c)
v
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∂L (θ)
= Ep(H|V ) [hi ] − Ep(V,H) [hi ]
∂ci
X X
= p(h|v)hi − p(v, h)hi
h v,h
X X X
= p(h|v)hi − p(v) p(h|v)hi
h v h
X
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
v
m
X X Xm
= σ( wij vj + ci ) − p(v)σ( wij vj + ci )
j=1 v j=1
X
∇c L (θ) = σ(Wv + c) − p(v)σ(Wv + c)
v
= σ(Wv + c) − Ev [σ(Wv + c)]
52/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
term
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
Ev [v]
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
Ev [v]
Ev [σ(Wv + c)]
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v]
Ev [σ(Wv + c)]
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
expressions have an expectation
Ev [σ(Wv + c)v T ] term
These expectations are intractable.
Ev [v] Solution?
Ev [σ(Wv + c)]
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
These expectations are intractable.
Ev [v] Solution? Estimation with the help
of sampling
Ev [σ(Wv + c)]
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
Specifically, we will use Gibbs
Ev [σ(Wv + c)] Sampling to estimate the
expectation
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Notice that all the 3 gradient
k
1X expressions have an expectation
Ev [σ(Wv + c)v T ] ≈ σ(Wv (k) + c)v (k)T term
k
i=1
k These expectations are intractable.
1 X
(k)
Ev [v] ≈ v Solution? Estimation with the help
k
i=1 of sampling
k
1 X Specifically, we will use Gibbs
Ev [σ(Wv + c)] ≈ σ(Wv (k) + c) Sampling to estimate the
k
i=1
expectation
53/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input:
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
end
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
end
end
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
end
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
end
end
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η∇W L (θ)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
b ← b + η∇b L (θ)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P
c ← c + η∇c L (θ)
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: RBM Training with Block Gibbs Sampling
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W, b, c
forall v ∈ D do
Randomly initialize v (0)
for t = 0, ..., k, k + 1, ..., k + r do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
Pk+r
W ← W + η[σ(Wvd + c)vdT − 1r t=k+1 σ(Wv (t) + c)v (t)T ]
k+r
b ← b + η[vd − 1r t=k+1 v (t) ]
P
Pk+r
c ← c + η[σ(Wvd + c) − 1r t=k+1 σ(Wv (t) + c)]
end 54/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Module 20.5 : Training RBMs using Contrastive
Divergence
55/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain
56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
In practice, Gibbs Sampling can be very inefficient because for every step of
stochastic gradient descent we need to run the Markov chain for many many
steps and then compute the expectation using the samples drawn from this
chain
We will now see a more efficient algorithm called k-contrastive divergence
which is used in practice for training RBMs
56/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj
v
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X The second expectation depends on the
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj samples drawn from the Markov chain
v (v1 , v2 , ..., vn )
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Just to reiterate, our goal is to compute
the two expectations efficiently
We already have a simplified formula for
the first expectation
Furthermore, note that the first
expectation depends only on the seen
training example (v)
Ep(H|V ) [vj hi ] = σ(wi v + ci )vj
X The second expectation depends on the
Ep(V,H) [vj hi ] = p(v)σ(wi v + ci )vj samples drawn from the Markov chain
v (v1 , v2 , ..., vn )
The first expectation thus depends on
the empirical samples, whereas the
second expectation depends on the
model samples (because the samples are
generated based on P (V |H) and
P (H|V ) output by the model)
57/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Contrastive divergence uses the following idea
Instead of starting the Markov Chain at a random point (V = v 0 ), start from
v (t) where v (t) is the current training instance
Run Gibbs Sampling for k steps and denote the sample at the k th step by ṽ
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Vs
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v)
Vs
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
∼ p(h|v) ∼ p(v|h)
Vs V(1)
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...
Vs V(1)
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...
Vs V(1) V(k) = Ṽ
58/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...
Vs V(1) V(k) = Ṽ
Vs V(1) V(k) = Ṽ
Vs V(1) V(k) = Ṽ
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1
59/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Over time as our model becomes better and better ṽ should start looking
more and more like our training (empirical) samples
Once that starts happening what will happen to the gradient ?
We consider the derivative w.r.t wij again
m
∂L (θ) X X
= σ(wi v + ci )vj − p(v)σ( wi v + ci )vj
∂wij v j=1
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input:
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output:
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
end
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
end
end
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
end
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
end
end
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η∇W L (θ)
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η∇b L (θ)
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η∇c L (θ)
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
Algorithm 0: k-step Contrastive Divergence
Input: RBM (V1 , ..., Vm , H1 , ..., Hn ), training batch D
Output: Learned Parameters W, b, c
init W = b = c = 0
forall v ∈ D do
Initialize v (0) ← v
for t = 0, ..., k do
for i = 1, ..., n do
(t)
sample hi ∼ p(hi |v (t) )
end
for j = 1, ..., m do
(t+1)
sample vj ∼ p(vj |h(t) )
end
end
W ← W + η[σ(Wvd + c)vdT − σ(Wṽ + c)ṽ]
b ← b + η[v − ṽ]
c ← c + η[σ(Wv + c) − σ(Wṽ + c)]
end
60/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...
Vs V(1) V(k) = Ṽ
61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20
...
Vs V(1) V(k) = Ṽ
61/61
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20