Advanced Machine Learning: CS 281
Advanced Machine Learning: CS 281
Advanced Machine Learning: CS 281
Contents
Discrete Models 2
Linear Regression 8
Linear Classification 13
Exponential Families 18
Neural Networks 22
Information Theory 55
Mixture Models 59
Mean Field 65
1
CS281: Advanced ML September 6, 2017
• Discrete models take values from a countable set, e.g. {0,1}, {cold, flu, asthma} and are simpler than
continuous models.
• We will use simple discrete models to develop our tactics such as marginalization and conditioning.
• Today, we will focus coins as a real-world example.
Easy Prior
Assume we know the coin came from one of 3 unknown manufacturers (later, we’ll have mixture model
estimation, but for now assume these probabilities come from an oracle).
1. θ = 0.4 with probability .1
2. θ = 0.5 with probability .8
3. θ = 0.6 with probability .1
p(θ ) = 0.1 · δ(θ = 0.4) + 0.8 · δ(θ = 0.5) + 0.1 · δ(θ = 0.6)
Likelihood
Likelihood = p(data|parameters). For the coin example,
N
p(coin flips|θ ) = Bin( N1 | N, θ ) = θ N1 (1 − θ ) N − N1 where N = N0 + N1 = number of flips
N1
Note that the last two terms, the ”score”, is our focus since they are the only terms that depend on θ. The
first term normalizes the distribution.
2.2 Inference
Inference 1: p(θ | x ) (x ∈ N0 , N1 ). How can we estimate θ?
Note that Inference 6= Decision Making. If we asked you to make a bet on the coin, based on this you could
either
2
1. Always take heads if θ > .5. In this case, p(win) = θ
2. Take heads with probability = θ. In this case, p(win) = θ 2 + (1 − θ )2 [p(is heads) * p(choose heads) +
...]
If θ = 0.6, for option 1, p(win) = θ = 0.6. For option 2, p(win) = θ 2 + (1 − θ ) = 0.52. In this case, the additional
information used in the calculation of option 2 does not result in a better decision.
• Prior: p(θ )
θ MAP = argmaxθ p(θ |x) = argmaxθ log [ p(x|θ ) p(θ )] from Bayes’ Rule: p(θ |x) ∝ p(x|θ ) p(θ )
Consider an example:
N
p(θ = 0.4| N0 , N1 ) ∝ (.4) N1 (1 − .4) N0 (0.1)
N1
p(θ = 0.45| N0 , N1 ) = 0 Due to the sparsity of the prior - similar result for θ = 0.5 and 0.6
θ MAP = θ MLE when we have a uniform prior since the MLE calculation does not explicitly factor a prior
into its calculation.
Full Posterior
R
Partition or Marginal Likelihood: p( N0 , N1 ) = θ p( N0 , N1 , θ ).
p( x |θ ) p(θ )
p(θ | N0 , N1 ) = Note that p( N0 , N1 ) is a very difficult term to compute.
p( N0 , N1 )
Beta Prior
Γ ( α 0 + α 1 ) α1 −1
p ( θ | α0 , α1 ) = θ (1 − θ ) α0 −1 support ∈ [0, 1]
Γ ( α0 ) Γ ( α1 )
From the image of the beta function for different parameters, we can see that it can ether be balanced,
skewed to one side, or tend toward infinity on one side.
With a beta prior:
Γ ( α 0 + α 1 ) α1 −1
p(θ | N0 , N1 ) = θ (1 − θ )α0 −1 θ N1 (1 − θ ) N0 · (constant normalization term w.r.t θ)
Γ ( α0 ) Γ ( α1 )
The key insight is that we get additive terms in the exponent and the resulting distribution looks like an-
other beta. The prior ”counts” (pseudocounts) from the hyperparameters can be interpreted as counts we
have beforehand.
Γ(α0 + α1 ) N1 +α1 −1
p(θ | N0 , N1 ) = θ (1 − θ ) N0 +α0 −1 · (constant normalization term w.r.t θ)
Γ ( α0 ) Γ ( α1 )
3
Figure 2.1: Beta Params
Γ(α0 + α1 + N0 + N1 ) N1 +α1 −1
p(θ | N0 , N1 ) = θ (1 − θ ) N0 +α0 −1 ∼ Beta(θ | N0 + α0 , N1 + α1 ) (posterior)
Γ(α0 )Γ(α1 )Γ( N0 )Γ( N1 )
The mode of the Beta gives us back θ MAP , but with additional information about the shape of the distribu-
tion. What does the prior that tends to infinity at 1 imply? That in the absence of other information, the
coin is definitely heads.
Predictive Distribution
Z
p( x̂ | N0 , N1 ) = p( x |θ, N0 , N1 ) p(θ | N0 , N1 )dθ
Zθ
= θ p(θ | N0 , N1 )dθ
θ
= Eθ ∼ p(θ | N0 ,N1 ) θ
This is the expectation under the posterior of θ which is the mean of the Beta distribution. Feel free to prove
this as an exercise.
Marginal Likelihood
Z
p( N0 , N1 ) = p( x1 , . . . xn |θ ) p(θ )dθ
θ
Z
Γ(α1 + α1 ) α1 + N1 −1
= θ (1 − θ )α0 + N0 −1
θ Γ ( α0 ) Γ ( α1 )
The first term can be moved outside, as it does not depend on θ. After introducing our normalization term
and making the distribution insidesum to 1,
Γ(α1 + α1 ) Γ( N0 + α0 )Γ( N1 + α1 )
p( N0 , N1 ) =
Γ(α0 )Γ(α1 ) Γ( N0 + N1 + α0 + α1 )
4
2.3 Extensions on the Coin Flip Model: Super Coins
• Many correlated coins: models of binary data, important for discrete graphical models
Bernoulli( x |θ ) = θ x (1 − θ ) x
∏ θk k
x
Categorical( x |θ ) = generalization of Bernoulli
k
(∑ xk )!
∏θ k
x
Multinomial( x |θ ) = generalization of Binomial
∏k xk ! k k
Γ(∑ αk ) α −1
Dirichlet( x |α) =
∏ Γ(αk )
∏ θk k generalization of Beta, often used as a prior
k
Note that the Dirichlet distribution is the conjugate prior of the Categorical and Multinomial distributions.
5
CS281: Advanced ML September 11, 2017
3.1 Examples
Multivariate gaussians are used for modeling in various applications, where knowing mean and variance
is useful:
where for many problems we focus on the quadratic form ( x − µ) T Σ−1 ( x − µ) (which geometrically can be
thought of as distance) and ignore the normalization factor (2π )− D/2 |Σ|−1/2 . The figure below plots the
contours of a bivariate Normal for various µ and Σ (in the figure, ρ denotes the off-diagonal elements of Σ,
given by the covariance of x1 and x2 ).
6
Note that we can decompose Σ as
Σ = ( x − µ ) T Σ −1 ( x − µ )
= ( x − µ ) T U T Λ −1 U ( x − µ )
!
1
= ( x − µ) T
∑ λd Ud U (x − µ)
T
d
1
=∑ ( x − µ)T Ud UdT ( x − µ),
d
λd
where ( x − µ) T Ud can be interpreted as the projection of ( x − µ) onto Ud (which can each be thought of as
univariate gaussians), the eigenvector corresponding to the eigenvalue λd . Since Σ is the weighted sum of
the dot product of such projections (with weights being given by 1/λd , which can be thought of as the scale
1/σ2 ), we can describe the MVN as tiling of univariates.
• ’Overkill’: We can perform a change of variables1 . Here, we have x = A−1 and |dx/dy| = | A−1 |,
leading to
p(y) = N A−1 (y − b)|0, I | A−1 |
1
= exp [( A−1 (y − b)) T ( A−1 (y − b))]
z
1
= exp [(y − b)T ( A−1 )T ( A−1 )(y − b)]
z
= N (y|b, AA T ),
where z is the normalizing constant.
• Using the properties of MVN, we know that y is also MVN, so is completely specified by its mean and
covariance matrix which can easily be derived,
Thus, we can generate MVN from N (0, I ) via the transformation y = Ax + b, where we set A = UΛ1/2 ,
leading to ΣY = U T ΛU. Then shifts are represented by b, stretches by Λ, and rotations by U.
1A change of variables can be done in the following way: Let y = f ( x ) and assume f is invertible so that x = f −1 (y). Then
p(y) = p( x )|dx/dy|. This is a technique which will be used often in this course
7
Thus, it is not only expected that x lies on the boundary but as D increases most of its realizations will in
fact fall on the boundary2 .
−1 −1
µ1|2 = µ1 + Σ12 Σ22 ( x2 − µ2 ), Σ1|2 = Σ11 − Σ12 Σ22 Σ21 .
2 It is left as an exercise to show that this formula holds. Hint: Use the fact that we assumed no covariance.
8
CS281: Advanced ML September 13, 2017
= arg max − ∑( xn − µ) Σ T −1
( xn − µ)
µ n
dL
= Σ n Σ −1 ( x n − µ ) = 0
dµ
Σ
⇔ µ∗MLE = Xn
N
Similarly,
dL
= (exercise) = 0
dΣ
1 1 >
⇔ Σ∗MLE =
N ∑ xn xnT = N
X X
n
dL
For calculating dΣ as an exercise , the following might be helpful:
• d
dA ln | A | = A −1
d
• dA tr ( BA ) = BT
• tr ( ABC ) = tr ( BCA)
9
4.3 Linear-Gaussian Models
Let x be a vector of affine, noisy observations with a prior distribution:
x ∼ N ( m 0 , S0 )
4.3.1 p( x |y)
We are interested in calculating the posterior distribution: p( x |y).
p( x |y) ∝ p( x ) p(y| x )
(
1 ( x − m0 )> S0−1 ( x − m0 )
= exp
2 +(y − ( Ax + b))> Σ− 1
y ( y − ( Ax + b ))
( > −1 ?? ?
1 x S0 x − 2x > S0−1 m0 + . . .
= exp ?? > −1 ? > −1 ?
2 + x > ( A> Σ− > >
y A ) x − 2x ( A Σy ) y + 2x ( A Σy ) b + . . .
1
The terms containing x are underlined. Double-starred (??) terms are quadratic in x, while single-starred
(?) terms are linear in x. The remaining terms are constants that are swallowed up by the proportional-
ity. By Gaussian-Gaussian conjugacy, we know the resulting distribution should be Gaussian. To find the
parameters, we’ll modify p( x |y) to fit the form of a Normal. This requires completing the square!
• “a” is S− 1 −1 > −1
N = S0 + A Σ y A
h i
• “h” is m N = S N S0−1 m0 + A> Σ−
y
1 (y − b)
In this more “intuitive” representation, we find that p( x |y) has the form of N (m N , S N ). Murphy also has a
more explicit representation:
• Σ x |y = Σ− > −1
x + A Σy A
1
• µ x |y = Σ x |y [ Σ− > −1
x µ x + A Σ x ( y − b )]
1
4.3.3 p(y)
We now calculate the normalizer term, p(y). Now, x is fixed. y follows the linear model:
y = Ax + b + e
The result is that y follows a Normal distribution with the following form:
10
4.3.4 Prior (just for µ)
p ( µ ) = N ( µ | m 0 , S0 )
where m0 , S0 are pseudo mean, pseudo variance. p(µ) is defined Gaussian because Gaussian is the con-
jugate prior of itself. A prior is called a conjugate prior if it has the same distribution as the posterior
distribution.
• “a” is S− 1 −1 > −1
N = S0 + A Σ y A
h i
• “h” is m N = S N S0−1 m0 + A> Σ− 1
y (y − b)
S− 1 −1
N = S0 + Σ
−1
m N = S N [S0−1 m0 + Σ−1 X ]
Hence,
p(µ| X ) = N (m|m N , S N )
Here, we define the problem as attempting to compute p(y | x, θ). Consider the following example. We
assume that our data is generated as follows:
y = w T x + noise
Further, we assume that the noise (denoted by e) is distributed as Gaussian with mean 0; that is:
e ∼ N (0, σ2 )
Then, we have:
p(y | x, θ ) = N (y | w T x, σ2 )
11
4.4.1 Log Likelihood
Consider a data-set that looks like {( xi , yi )}iN=1 . The log-likelihood L(θ) is given by:
Note that data here refers to just the yi ’s. The yn ’s are called the target; the w represents the weights; and
the xn ’s are the observations. The term (yn − w T xn )2 is essentially just the residual sum of squares.
There is an analytical solution to this, and we obtain it by simply computing the gradient and setting it to
0.
h i
∂w w T X T Xw − 2w T X T y = 2X T Xw − 2X T y
w MLE = ( X T X )−1 X T y
As we will see in homework 1, ( X T X )−1 X T y can be viewed as the projection of y onto the column space of
X.
p(w) = N (w | m0 , S0 )
Thus, we have:
p(y | X, w, µ, σ2 ) = N (y | µ + X T w, σ2 I )
We assume that µ = 0.
The posterior then is of the form:
p(w | . . .) ∝ N (w | m0 , S0 )N (y | X T w, σ2 I )
Applying the results obtained above with the linear Gaussian results, with:
b=0
A = XT
Σy = σ2 I
12
Thus, we have:
1
S N −1 = S0 −1 + 2 X T X
σ
−1 T 1
mN = S N S0 m0 + X y 2
σ
p(y | x, y) = N (y | X T m N , σ2 + X T S0 X )
The variance term is particularly interesting because now the variance has dependence on the actual data;
thus, the Bayesian method has thus produced a different result. The mean, however, is the same as the
MAP estimate (x T m N )
x → φ( x )
Examples include:
• φ1 ( x ) = sin( x )
• φ2 ( x ) = sin(λx )
• φ3 ( x ) = max(0, x )
• φ( x; w) = max(0, w0 > x )
The last example is the core of neural networks and deep learning where the weights are learned for each
level of w.
13
CS281: Advanced ML September 18, 2017
where y is the class label and comes from a categorical distribution, and x j is a dimension of the input x.
In Naı̈ve Bayes, the form of the class distribution is fixed and parametrized independently from the class
conditional distribution. The ”Naı̈ve” term in ”Naı̈ve Bayes” precisely refers to the conditional indepen-
dence between y and x j |y. Depending of what the data looks like we can choose a different form for the
class conditional distribution.
Here we present three possible choices for the class conditional distribution:
x j |y ∼ Bern(µ jc ) if y = c
Here y takes values in a set of classes, and µ jc is a parameter associated with a specific feature (or
dimension) in the input and a specific class. We use multivariate Bernoulli when we only allow two
possible values for each feature, therefore x j |y follows a Bernoulli distribution.
We can think of x as living in a hyper cube, with each dimension j having an associated µ for each
class c. From here we get the name multivariate Bernoulli distribution.
• Categorical Naı̈ve Bayes:
x j |y ∼ Cat(µ jc ) if y = c
We use the Categorical Naı̈ve Bayes when we allow different classes for each feature j, so x j |y follows
a Categorical distribution.
14
• Multivariate Normal Naı̈ve Bayes
x|y ∼ N (µc , Σcdiag )
Note that here we use x vector and not a specific feature. Since we impose that Σc is a diagonal matrix,
we have no covariance between features, so this comes down to having an independent normal for
each feature (or dimension) of the output. This is also required by the ”Naı̈ve” assumption of condi-
tional independence. We would use MVN Naı̈ve Bayes when the features take continuous values in
Rn .
where in equation (5.2) we assume conditional independence (the ”Naı̈ve” assumption). The term p( xnj |yn )
depends on the generative model used for x and also on the class yn .
We can then solve for the parameters maximizing the likelihood, which is equivalent to maximizing the log
likelihood.
(πMLE , µMLE ) = argmax ∑ log p( xn , yn |param) (5.3)
(π,µ) n
where p(π ) represents the prior on class distribution and ∏ j ∏c p(µ jc ) represents prior on class conditional
distribution.
Now, what prior should we use?
15
1. π: Dirichlet (goes with Categorical)
2. µ jc :
(a) Beta (goes with Bernoulli)
(b) Dirichlet (goes with Categorical)
(c) Normal (goes with Normal)
(d) Inverse-Wishart (Iw) (goes with Normal)
Here, what distribution we choose depends on our choice of class conditional distribution.
Recall that we want to use conjugate priors to have a natural update (that’s why we pair them up!). By
using conjugate priors, we will have:
5.4.1 Intuition
You can think of the αi above as initial pseudocounts. Those pseudocounts give nonzero probability to fea-
tures we haven’t seen before, which is crucial for NLP. For unseen features, you could have a pseudocount
of 1 or 0.5 (Laplace term) or something.
Because of this property, a Bayesian model helps prevents overfitting by introducing such priors: consider
the spam email classification problem mentioned before. Say the word ”subject” (call it feature j) always
occurs in both classes (”spam” and ”not spam”), so we estimate θ̂ jc = 1 (we overfit!) What will happen if
we encounter a new email which does not have this word in it? Our algorithm will crash and burn! This is
another manifestation of the black swan paradox discussed in Book Section 3.3.4.1. Note that this will not
happen if we introduce pseudocounts to all features!
16
Figure 5.2: The Sigmoid Function
where the first two term logπc + ∑ j log(1 − µ jc ) is a constant (we call it b for bias), and the last term
µ jc
∑ j x j log 1−µ is linear (we call it θ).
jc
So, we have:
p(y = c | x ) ∝ exp(θcT x + bc )
Thus, in order to determine which class (”spam” or ”not spam”), for each class we simply compute a linear
function with respect to x, and compare the two. Our θx + b is going to be associated with a linear separator
of the data. Even better, for prediction, we can simply compute θ and β (as shown above) using closed form
for both MAP and MLE cases.
1
p(y = c| x ) = exp(θcT x + bc )
Z
17
Figure 5.3: The Softmax Function
In general, we can compute the normal by summing over all our classes.
Z (θ ) = ∑0 = exp(θcT0 x + bc0 )
c
In practice, this summation is often computationally expensive. However, it is not necessary to compute
this sum if we are only interested in the most likely class label given an input.
We call the resulting probability density function the softmax:
exp(zi )
softmax(z)i =
∑i0 exp(zi0 )
This function generalizes the sigmoid function to multiple classes/dimensions. We call it the ”softmax”
because we may think of it as a smooth, differentiable version of the function which simply returns 1 for
the most likely class (or argmax).
What are the advantages and disadvantages of this approach? The primary disadvantage compared to
methods we have seen earlier is that this maximum likelihood estimate has no closed form. It is also not
clear how we might incorporate our prior (although there is recent work in this area). On the other hand,
this equation is convex and it is easy (at least mathematically, not necessarily computationally) to compute
gradients, so we may use gradient descent.
(
d(·) 1 − softmax(θcT x ) if yn = c
= ∑ xn · (5.8)
dθc n softmax(θcT x ) otherwise
This model is known as logistic regression (even though it is used for classification, not regression) and is
widely used in practice.
More Resources on Optimization
• Convex Optimization by Lieven Vandenberghe and Stephen P. Boyd
18
CS281: Advanced ML September 20, 2017
6.1 Introduction
(Wainwright and Jordan (textbook) presents a more detailed coverage of the material in this lecture.)
This lecture, we will unify all of the fundamentals presented so far:
where
µ mean parameters
θ (µ) natural / canonical / exponential parameters
Z (θ ) A(θ ) also written as Z (θ (µ)) or Z (µ), the partition function and log partition
φ( x ) sufficient statistics of x, potential functions, “features”
h( x ) scaling term, in most cases, we have h( x ) = 1
Note that there is “minimal form” and “overcomplete form”.
19
For the minimal form, we have
h( x ) = 1
φ1 ( x ) = x
µ
θ1 (µ) = log (“log odds”)
1−µ
µ = σ(θ )
A(µ) = − log(1 − µ)
A(θ ) = − log(1 − σ (θ )) = θ + log(1 + e−θ )
For the overcomplete form, we have
x
φ( x ) =
1−x
log µ
θ=
log(1 − µ)
For the Categorical/Multinouilli distribution, we have
log µ1
θ = ...
log µn
where ∑c µc = 1.
Side note: Writing out in overcomplete form usually comes with some restraints.
1
N ( x | µ, σ2 ) = (2πσ2 )1/2 exp{− ( x − µ )2 }
2σ2
1 1 µ 1
= (2πσ2 )− 2 exp{− 2 x2 + 2 x − 2 µ2 }
| {z } | 2σ {z σ 2σ
} | {z }
A(µ,σ2 )
θ T φ( x ) A(µ,θ 2 )
x
φ( x ) = 2
x
" µ #
θ= σ2
− 2σ1 2
1 1
A(µ, σ2 ) = log(2πσ2 ) + 2 µ2
2 2σ
θ1
µ=−
2θ2
1
σ2 = −
2θ2
1 θ2
A(θ ) = − log(−2θ2 ) − 1
2 4θ2
20
6.4 Properties of Exponential Families
Most inference problems involve a mapping between natural parameters and mean parameters, so this is a
natural framework.
Here are three properties of exponential families:
Property 1 Derivatives of A(θ ) provide us the cumulants of the distribution E(φ( x )), var(φ( x )):
Proof. For univariate, first order:
dA d
= (log Z (θ ))
dθ dθ Z
d
= log exp{θφ}h( x )dx
dθ
| {z }
needed to integrate to 1
R
φ exp{θφ} h( x )dx
= R
exp(θφ)h(z)dx
R
φ exp{θφ} h( x )dx
=
exp( A(θ ))
Z
= φ( x ) exp(θφ( x ) − A(θ ))h( x ) dx
| {z }
p( x )
Z
= φ( x ) p( x )dx
= E(φ( x ))
The same property holds for multivariates (refer to textbook for proof).
Bernoulli:
A(θ ) = θ + log(1 + e−θ )
dA e−θ 1
= 1− −
= = σ(θ ) = µ
dθ 1+e θ 1 + e−θ
| {z }
sigmoid
21
∑ φ( xd )
E(φ( x )) =
N }
| {z
set mean parameter to sample means that gives us MLE
η - parameters
s̄ = ∑ φ(xd )/N
d
p(data | η ) ∝ exp[( N s̄)η − N A(η )]
p(η | N0 , s0 ) ∝ exp[( N0 , s̄0 )η − N0 A(η ) ]
| {z }
not log partition, which has to be a function strictly of parameters
The above two distributions have the same sufficient statistics – so we have a conjugate prior. It also tells
us that it is not a coincidence that we kept obtaining pseudo counts. (More references will be put up to
describe this).
g −1 ( w T x + b )
x −→ µ → θ → p ( y | x ).
Example 1 Exponential family - Normal distribution with σ2 = 1 and g−1 is the identity function. This
gives us the linear regression
µ = wT x + b R → R.
−1
Example 2 distribution and g is the sigmoid function σ : R → (0, 1).
Exponential family - Bernoulli
Now, µ = σ (w T x + b) and θ = log 1−µ . This is how we define logistic regression. This gives us
µ
Example 3 Exponential family - Categorical distribution with g−1 as the softmax function.
µc = softmax(wcT x + bc )c
θc = log µc
22
CS281: Advanced ML September 25, 2017
7.1 Introduction
Neural networks have been a hot topic recently in machine learning. But everything we will cover today has
essentially been known since ’70s and ’80s. Since then, there has been in increased focus on this subject due
to its successes after improved computing power, larger datasets, better neural network architectures, and
more careful study in academia. Neural networks have also seen wide adoption in industry in recent years.
Lately, there has also been work trying to integrate other methods of inference into neural networks—we
will take a look at this topic later in this course. We cover neural networks now as a tangentially-related
introduction to graphical models and as an example of combining traditional inference with deep models.
Example 2 (Linear Classification). In this case, g−1 is the sigmoid function σ, so that µ = σ (w> x + b) with
y | x ∼ Bern(σ (w> x + b)). We can think of the sigmoid function as a smooth approximation to an indicator
variable, so that σ (w> x + b) is simply an estimation of the class of w> x + b.
Example 3 (Softmax Classification). Here g−1 is the softmax function, so that µc = softmax(Wx + b)c where
exp(z )
W is some matrix rather than a vector. Remember that the softmax is defined by softmax(z)c = ∑ 0 exp(cz 0 ) .
c c
Think of the softmax function as a sigmoid approximation to multi-dimension.
Example 4. (Basis Function in Speech Recognition) A snippet of speech might be a waveform, and one way
to extract features is to chunk the waveform by time, for each chunk applying a Fourier transform. Then
we would take as features some values of each transformed chunk in the frequency domain. Typically this
process gives 13 features per chunk. These features are then passed to a learning model.
We now consider general linear models in combination with basis functions. Suppose y| x ∼ N (w> φ( x ) +
b, σ2 ), and y| x ∼ Bern(σ (w> φ( x )) + b), where φ gives rise to a basis. We can do MLE just as before, i.e.,
compute argmaxw ∑n log p(yn | xn , w). In general, these will be solvable just as before, e.g., with numerical
optimization—iterative gradient calculation and updates. The form of the MLE depends on the distribution
of y | x. When it is normal, the optimization becomes over sum of squares, when it is Bernoulli, the
optimization becomes over cross-entropy (as discussed in previous classes).
23
Figure 7.4: Fourier transform in speech recognition in Example 4
y| x ∼ N (w> tanh(w0 x + b0 ) + b, σ2 )
When we do MLE, we have to take the same argmax over the parameters w, w0 , b0 , b. All that’s changing
is that the function we are optimizing is non-linear, with many parameters, and non-convex. So when we
optimize such functions, we might end up at a local optimum instead of the global optimum. We will see
many techniques for combating the complexities of non-convex optimization.
7.4 Demo
See iPython notebook for demo.
In the literature, the circles are called “neurons,” matrices are “fully connected,” each column is a “layer”
with implied squashing, each line is a parameter. The goal of these networks is to find µ.
“Personally, I find this part—the ‘it’s like a brain!’—pretty silly. It’s just linear algebra separated by nonlin-
ear transformations.” - S. Rush
24
7.5.1 Application Architectures for Neural Networks
In a typical neural network, we have x → Layer 1 → Layer 2 → · · · → Output, where each arrow is a
linear map, and in each layer is a non-linear function. In the class before, we talked about classifying the
MNIST data set—for this simple model, we had 8000 parameters(!). “But that’s nothing—just yesterday I
was working with a model with 1.2 billion parameters.”
Although some of the power of neural networks comes from this flexibility in parameters, much of the
interesting work is done in trying to find better neural network architectures that capture more of the
essence of the data with fewer parameters. For example, the modern approaches to digit classification are
done by convolutional neural networks, where the architecture captures some of the “local” information of
images.
Example 5. [Speech Recognition] Suppose we want to map sounds into classes of saying the digits “one,”
“two,” and so on. Recall that the typical approach is to split speech into chunks and perform Fourier trans-
forms to extract features from each chunk. The problem here is that individual chunks don’t necessarily
map to single digits, since there’s no guarantee the chunk even corresponds to an entire word in speech!
Instead, what is typically done in this case is convolution using a kernel (equivalently, a single weight vector
called wtile ) that spans several chunks. Rather than applying learning on the full Rn·13 data set (where n is
the number of chunks), we multiply each k-chunk stretch of speech by the kernel to obtain ≈ n − k chunks.
Based on our choice of kernel, we can take advantage of sparsity to improve structure in the data set. This
is known as a one-dimensional convolution between the kernel, wtile , and the input φ( x ).
Example 6 (Image Classification). For the case of images, we can do the same as above with two-dimensional
convolution, where have blocks in the image instead of tiles. This lets us pick up on information that is very
local—e.g., or edges or corners in images, informatoin which can then be recombined in later layers with
spectacular success.
Example 7 (Language Classification). Suppose we want to determine whether a movie review was good or
bad. Consider the review “The movie was not very good.” One way to do this to convert words to vector
representations (e.g., via word2vec or glove), since discrete words are difficult to deal with, but vectors let
us have a more continuous approach while taking into account the meaning. We can do things like add
these vectors up over the course of the review (e.g., a bag-of-words approach). An alternative is to take
blocks of words and use a one-dimensional convolution. One advantage of the latter is that it allows you
to pick up on structures such as “not very good,” which wouldn’t be observed in a bag-of-words model,
which may pick up on the words “very” and “good” instead.
Remark. All of these convolution methods exist in PyTorch under nn.conv.
25
CS281: Advanced ML September 27, 2017
µ = σ(w T ReLU(Wx ))
Here, µ parameterizes a Bernoulli distribution, Ber(µ). Suppose we want to find µ such that it maximizes
the likelihood of a single data example ( x, y). Then we compute
ReLU(Wx)1
x1 ReLU(Wx)2
x2 ReLU(Wx)3 L
x3
ReLU(Wx)4
ReLU(Wx)5
In order to generate this graph, we must perform the following computational operations in order:
26
We would like to get the gradient terms v̇(i) ≡ dL(i) for any i, which tell us how each part of the neural
dv
network affects our loss. We can do this by applying the chain rule (of calculus) to get a recursive solution
(by convention, the derivative of a scalar with respect to a vector is represented as a column vector):
T
dL dL dv(i+1)
=
dv(i) dv(i+1) dv(i)
( i +1)
∂L ∂L ∂v j
(i )
=∑ ( i +1)
∂vk j ∂v j ∂v( i )
Since the gradient of each term depends on the gradient of the subsequent term, we can compute the
gradients in reverse while applying the chain rule. This method is known as backpropagation. For each
backward step, we need to remember everything that was computed in the corresponding forward step,
dv(i+1)
namely v(i) , v̇(i+1) , and :
dv(i)
output, v(i+1)
input, v(i)
dv(i+1)
dv(i)
grad output, v̇(i+1)
grad input, v̇(i)
We can also augment the black boxes. For instance, if we let f take in parameter W, we can also
dL
compute dW within this function.
• computational graph (e.g. Theano, TensorFlow)
Everything is implemented in terms of primitives, so there are no black boxes:
+, ×, ...
This allows us to optimize the neural network once and run it on many examples.
27
• imperative/autograd systems
These are tape-based systems in which the computational graph can look different for different exam-
ples, but we can still compute gradients using backpropagation. Torch is built on an autograd core,
but higher level functions like the Linear module take on a “blocks” style approach.
In the case of Naive Bayes (see Figure 8.6), we know from a previous lecture that:
• p(y) is probably categorical.
• p( x j |y) could be one of many different distributions, including Categorical, Gaussian, Bernoulli, etc.
We are interested in the following distributions from the underlying data that we have:
• p(y, x): joint distribution
• p( x j ): marginal distribution, or p(y | x ): conditional distribution
The structure of the model will often determine the difficulty of inference. This is the motivation of why we want
to draw these graphs.
On a high level, given p( A, B, C ), we can always apply the chain rule (in probability):
p( A, B, C ) = p( A | B, C ) p( B | C ) p(C )
However, if we write p( A, B, C ) in the way above, we basically assume that all variables depend on each
other. In some cases, this is not necessarily true, and we want to find a factorization as below.
8.2.2 Factorization
If we have the case presented in Figure 8.7, we can rewrite p( A | B, C ) → p( A | B). Having A only depend
on one variable (B) is better than having it depend on two variables (B and C).
28
Figure 8.7: A graph where factorization is possible.
Since we have n samples of ( x, y), we can use the plate notation, as shown in Figure 8.9.
29
Figure 8.9: A graph to illustrate the use of plate notation.
Figure 8.10: A model with probability tables. The CPT of C is a three-dimensional table.
Example 9 (Second Order Markov Chain). Figure 8.12 shows an example of a second order Markov chain
graphical model.
30
Figure 8.12: Second Order Markov Chain Graphical Model
Example 10 (Hidden Markov Model). Figure 8.13 shows an example of a hidden Markov graphical model.
Example 11 (Navie Bayes). Figure 8.14 shows an example of a Naive Bayes graphical model.
As we have seen in Figure 8.9, we can also incorporate parameters in the DGMs, as it is illustrated in the
next example.
Example 12. In this case (Figure 8.15), we use the same Naive Bayes example with parameters. Here, we
incorporate parameters α ∼ Dirichlet. This is interesting because it combines two types of distributions:
some of them are discrete, but in this example α and π are drawn from continuous distributions, as marked
in the figure below.
31
Figure 8.15: Naive Bayes Graphical Model with Parameters
Figure 8.15 corresponds to a single example. If we have multiple examples, we can use a plate-in-plate
representation as in Figure 8.16.
In the above equation, we transform each of the µi based on the starting mean plus a linear transformation
of their parents (and to simplify things, we subtract the mean of each parent). We have an underlying
generative process where each one of our random variables is a draw from a Gaussian and its children are
a linear transformations of that draw.
This means that, we can rewrite xi as:
xi = µi + ∑ Wij ( x j − µ j ) + σi zi ∀i zi ∼ N (0, 1)
j
32
Figure 8.16: Naive Bayes Graphical Model with Parameters and Plate-in-Plate Notation
This tells us how the x random variable differs from the mean at each of the different positions.
We know that the Σ term for our covariant matrix is defined as:
h i
Σ ≡ cov [ x − µ] = cov ( I − W )−1 S z
= ( I − W )−1 S cov [z] S(( I − W )−1 )T
= ( I − W )−1 S2 (( I − W )−1 )T
33
CS281: Advanced ML October 2, 2017
Let B denote observing B, which informs our understanding on how A and C are conditional independent
or not. Our cases are then:
A → B → C,A⊥C (9.10)
A → B → C , A ⊥ C|B (9.11)
A ← B → C,A⊥C (9.12)
A ← B → C , A ⊥ C|B (9.13)
A → B ← C,A⊥C (9.14)
Information is being blocked in cases 2, 4, and 5 but flowing freely in all other cases. It’s useful to think
about the concept of “explaining away” to understand what is going on in the last case. ’“Explaining away”
is a common pattern of reasoning in which the confirmation of one cause of an observed or believed event
reduces the need to invoke alternative causes.’
(http://strategicreasoning.org/wp-content/uploads/2010/03/pami93.pdf)
How would a directed graphical model be interesting in practice? One example is probabilistic program-
ming: demonstrated in the IPython notebook “DGM.ipynb”. The demonstration shows how we can con-
vert programs from a directed to an undirected form. We can also specify priors on our features and visu-
alize the flow of the model.
The purpose of creating a DGM is to specify the relationship between variables of interest, in order to to facilitate
understanding of the independence properties.
34
9.2.1 Independence properties
As stated above, we have conditional independence if the marginal probability factors in the following
way:
p( A, B|C ) = p( A|C ) p( B|C ) ⇒ A ⊥ B|C (9.16)
A − B − C,A⊥C (9.17)
A − B − C , A ⊥ C|B (9.18)
(9.19)
We can say that we have “conditional independence” between two nodes given some third node if all paths
between the two nodes are blocked. For our simple example with three nodes, this is when the third node
is in the evidence, that is, when the third node is “seen”.
Here is an example of how independence works in the UGM.
A D
Fundamental consequence: Imagine we have a graph with a node A and some complicated connection of
nodes around A. We can make A conditionally independent from all other nodes by conditioning on the
“Markov blanket” of A. The Markov blanket is defined to be the neighbors of A. We will see later in this
class that is is very nice if we can establish these independence properties. We will be able to look at a point
in a graphical model and, if we can condition on the Markov blanket of the node of interest, we can ignore
the rest of the graph.
In the example below, conditioning on the Markov blanket of A means conditioning on B, C, E, and F.
B
E A D
G
F H
1 1
2 3 2 3
4 5 4 5
35
To carry out this process, we take all the directed edges and make them undirected. We then create addi-
tional edges by “marrying” the parents of a node. In this case we gain an extra edge between 2 and 3, which
comes from marrying the parents of 4.
y y
x1 ... xV x1 ... xV
For this case, the UGM and DGM are the same. However, if we condition the other way, i.e. the features
x1 , ..., x D are on top of the DGM with directed arrows towards y at the bottom, we would need to add
connections in the UGM between all of the features. In the illustrated example, the joint probability having
seen y is
All
DGM UGM
B C
Here we have
A ⊥ D | B, C (9.23)
B ⊥ C | A, D (9.24)
(9.25)
36
A
B C
In contrast to the UGM, the DGM has the following independence properties.
A ⊥ D | B, C (9.26)
B ⊥ C | A, D (9.27)
C B
D A
A ⊥ D | B, C (9.28)
B ⊥ C | A, D (9.29)
So there is no directed graphical model structure that gives us same properties that the original UGM had.
How about another example of a DGM
X Y
(1) X 6 ⊥ Y | Z (9.30)
(2) X ⊥ Y (9.31)
X Y
X Y
37
9.4 Parametrization of UGMs
The following section is some of the harder material that we will cover. However, this is the last time we
are introducing a new model. Thereafter, we will do inference on the models we have discussed.
normal A B
C D
cat
E
Suppose that in this example the nodes within the red circle follow a local normal distribution, but that
the nodes within the blue circle follow a local categorical distribution. The notation relating a node X to
its parents is p( x |pa( x )), where pa(c) refers to the parents, and the conditional probability table is “locally
normalized”, “sums to one”, and is non-negative.
For UGMs we use “global normalization”. All is fine locally as long as the whole global probability sums
up to make whole joint distribution normalized. For this, we treat everything as an exponential family.
n o
p( x1 , ..., xd ) = multivariate exp. fam. = exp θ T φ( x1 , ..., x D ) − A(θ ) (9.32)
Here, θ are the parameters and φ( x1 , ..., x D ) the sufficient statistics. For every “clique” in the graph, we
associate a set of sufficient statistics. A “clique” is defined as a set of nodes that are all connected to each
other. Note: a set of one or two nodes forms a trivial clique.
In the discrete case:
(
1 if clique c has xc = v
φcv ( x ) = (9.34)
0 otherwise
Suppose we have the following UGM in which each node can take on one of k possible discrete outcomes:
K K
This means v ∈ K3 in this example. We also have a Θ associated with each of these values. Since φ( x ) is big
(and awkward), we use the following notation:
38
Now we write out the whole thing:
p( x1 , ..., x D ) = exp
∑ θc ( xc ) − A(θ ) (9.36)
c
| {z }
(neg) energy
The sum over “everything” is our nemesis because it is over something big. And in general, this is NP-
Hard or even #P-Hard. But in practice, the structure of the graph determines the difficulty. Examples of the
“everything” include: all possible images, social network graphs... (How we can exchange the sums over
x 0 and c depends on the structure of the graph.)
9.5 Examples
9.5.1 Example 1: Naive Bayes model
39
Any distribution obeying this structure has an exponential family parametrization. We convert the local
distribution
p ( y 1 ) ∏ p ( y i | y i −1 ) p ( x i | y i ) (9.41)
i
For this example, the shown grid could connect to other such grids. As an example, suppose we want to
detect the foreground and background of an image. For this, we perform a binary classification of each pixel
(0=pixel in background; 1=pixel in foreground). We want to obtain a probability that a pixel is in either class.
The class should depend on the neighboring pixel so that there is consistency among neighboring pixels -
it would be weird if every other pixel is in a different class.
This results in a binary model with neighbor scores. In order to force this to be a probability distribution,
we treat the whole thing as an exponential family:
( )
p( x ) = exp ∑ θij ( xij , xi−1,j ) + θij ( xij , xi,j−1 ) + θij ( xij ) − A(θ ) ,
↑ ←
(9.43)
ij
where A(θ ) = log (∑ x0 exp ∑c θc ( xc0 )). Note that A is again very hard to calculate. Also note that the
↓ → in the diagram are given by other θ ↑ and θ ← so that when we sum over all i and j we
“missing” θ22 and θ22
are not double counting any of the connections.
40
CS281: Advanced ML September 11, 2017
10.1 Prelude
10.1.1 Notation
Recall from Lecture 9 our notation for the joint probability distributions of Undirected Graphical Models
(UGMs). In particular, we have
p( x1 , . . . , x T ) = exp{∑ θc ( xc ) − A(θ )}
c
Here, θc ( xc ) is the score associated with some clique c, and xc is some value assignment on the clique. This
notation is great because it corresponds to exponential families! However, there are a few other notations
you may see around:
p ( x1 , . . . , x T ) ∝ ∏ exp(θc (xc ))
c
= ∏ ψc ( xc ))
c
This latter notation is used by Murphy. ψc ( xc ) are the potentials, and they are simply exp(θc ( xc ). The score
functions θc are then known as the log potentials.
10.1.2 Moralization
Recall from last lecture the process of conversion between a Directed Graphical Model (DGM) and UGM,
known as moralization. Here we consider two diagrams:
P( A, B, C, D ) = P( A) P( B| A) P(C | A) P( D | B, C )
41
We see that the UGM has two cliques, the c0 = { A, B, C }, and c1 = { B, C, D }. Using the notation above,
this gives us
P( A, B, C, D ) = ψc0 ( A, B, C )ψc1 ( B, C, D )
This form gives a particularly nice parallelism where it is clear that the first three terms
P( A) P( B| A) P(C | A) = ψc0 ( A, B, C )
and
P( D | B, C ) = ψc1 ( B, C, D )
In Lecture 5, we discussed all kinds of Naive Bayes (NB), including Bernoulli, MVN, Categorical, and more.
All of these follow the diagram above, which only specifies the conditional probability distribution
p ( x | y ) = p ( y ) ∏ p ( xi | y )
i
Here we are only interested in p(y| x1 , . . . , x T ). Similarly to NB, the parametrizations of this model can take
many forms. For instance, one could use logistic or linear regression, most GLM’s, Neural Networks, or
even convolutional Neural Networks. In fact this is the model on which Alphago functions, where features
x are the tile placements on the board, and y is the output move!
The main point here is that we can stick arbitrary parametrizations into graphical models. These diagrams
just specify the conditional dependence–not the distributions. Further, as long as we specify the graphical
model structure, we will be able to tell how hard inference will be.
42
y = {t,h,e, ,d,o,g}
Example 14 (NLP). Here we are given discete words as input variables x, and we wish to predict discrete
y, their parts of speech
x: the dog ate a carrot
y: DT NOUN VERB DT NOUN
Example 15 (Speech Recognition). Recall from previous lectures that our signal may be divided up into
time steps and translated to continues vectors in R13 , these our are continuous input variables x. Our
output for each vector is the phoneme corresponding to the sound.
y = {d,d,d,o,o}
Example 16 (Tracking). Here we are tracking the position of some object with presumed Gaussian noise. In
this case both our input and output variables are continuous. The inputs are our position vectors, and the
output is the predicted correction for noise.
Example 17 (Education). This example is based upon the presentation at the beginning of the class. Our
inputs are given by a kinect tracker and are continuous body positions at snapshots in time. The discrete
output y is whether the body position corresponds to attentive or bored.
y = {s,t,o,p}
43
We discussed several Markov Models in previous lectures. We begin with the Hidden Markov Model (HMM),
which actually predates DGMs. HMMs assume discrete y, though DGMs with the same structure may
have continuous y. This has a fully joint parametrization p(y1:T , x1:T ), where in most cases p(yt |yt−1 ) is
categorical. We can choose the distribution of p( xi |yi ) to fit our given circumstances:
1. p( xi |yi ) is categorical [e.g. parts of speech]
2. p( xi |yi ) is MVN [typing, speech (this uses a mixed gaussian)]
3. p( xi1 , . . . , xiv |yi ) = ∏ p( xiv |yi ) [OCR]
v
The last example here has an embedded Naive Bayes model for each feature x.
HMMs such as the above are often parametrized in the following manner:
Thus we are interested in p(y1 , . . . , y T | x1 , . . . , x T ), and p(yt | xt , yt−1 ) is a GLM such as logistic or softmax
regression.
MEMM comes with the distinct advantage that we no longer have to assume our features are independent.
This is particularly useful in, say, tagging parts of speech where can pick an arbitrary feature basis without
worrying about independence. However, the model comes with the downside that there is no closed form,
so we must use SGD, i.e. we isolate each xi and predict yi with logistic regression.
Picking p(yt | xt , yt−1 ) here as an arbitrary neural network is called a Neural Network Markov Model or
NN-Markov Model.
44
10.4 Conditional Random Field Markov Model
While CRF is still a Markov Model, it gets its own subsection due to its importance. The CRF is simply the
UGM of all the above Markov Models:
Here h and o refer to the labeled arrows in the diagram. This is the general form of any Markov Model.
Note that because we have observed the features x1:T , we may rewrite the Tto (yt , xt ) terms as θto (yt ; xt ). If
we are trying to do inference, we can think of the above after conditioning as simply
This is a simple Markov Chain UGM. In fact after converting to UGM and conditioning on our features, all
Markov models become the above!
Now while we can compute the closed form of the MLE on HMMs, CRF is a bit tricker. However, it is a
member of the exponential family model, and we know the MLE for these families in general.
∑ φ( xd )
E(φ( x )) =
N
Here, φ(y) are the sufficient statistics–but what do these look like? In fact we went over this in a previous
lecture, it is simply a vector of indicator functions for every clique assignment (see Lecture 9). For inference
in particular, we care about
E(1( xc = v))
Then the clique marginals are given by
∑ p( x0 )
x 0 = xc0 =v
∑ p( x 00 )
x 00
However this may be computationally intractable, as the x 00 we sum over is the entire universe! However,
this can be computed efficiently for some models such as chain models
45
tracked at 1-30Hz), and even body posing and coordination (using a Kinect; tracked at 30Hz). Schnieder
pointed out that these sorts of studies provide massive amounts of data. During the lecture, he highlighted
that there are a number of cool CS 281 projects that could arise from datasets like this one. Some projects he
proposed are:
1. Unsupervised machine learning: modeling collaborative learning processes with probabilistic graph-
ical models.
2. Supervised machine learning: training models to make predictions. For instance, training a model to
predict the joint visual attention of two people based on their gestures, head orientation, and speech.
Another example is training a model to predict physiological activity based on features like pupil size
and body posture. With supervised machine learning, however, Schnieder noted that overfitting can
be a dangerous pitfall.
3. Design your own!
Schnieder emphasized that he is open to ideas, and that he highly encourages anyone interested in working
with this data in some way or form to email him at bertrand [email protected].
46
CS281: Advanced ML October 11, 2017
y1 y2 y3 ... yS ... yT
bel− −
t ( yt ) ∝ ψt ( yt )mt−1→t ( yt )
m−
t −1→ t = ∑ ψt−1 (yt−1 , yt )bel−t−1 (yt−1 )
y t −1
m+
t +1→ t = ∑ ψt+1 (yt+1 , yt )ψt+1 (yt+1 )m+t+2→t+1 (yt+1 )
y t +1
47
11.2 Other Graphs
p(ys = v) = ∑ ∏ ψc (y0c )
y0 ,y0s =v c
48
Computing this sum product might be hard. Here we define the minimum size of maximum clique indcued
-1 to be the treewidth of the graph.
49
11.2.1 Compute Marginals for treewidh = 1 Graph
3. Downward pass:
bels ( xs ) ∝ bel−
s ( xs ) ∏ m+
t→s ( xs )
t∈pa(s)
m+
t→s ( xs ) = ∑ ψs − t( xs , xt )ψt ( xt ) ∏ = m−
t ( xt )
xt c∈ch(t),c6=s
bels ( xs ) ∝ ψs ( xs ) ∏ mt → s ( x s )
t∈nbr(s)
m( s → t ) = ∑ ψs (xs )ψs−t ∏ ( xs )
xs u∈nbr(s),u6=t
50
11.4.1 Commutative Semi-ring
+ max ∩
× × ∪
∨ ∨ ∨
marginal argmax satisfying assignment
51
CS281: Advanced ML October 16, 2017
12.1 Introduction
Recall the basic structure of a time series model:
y1 y2 y3 ... y t −1 yt
In previous lectures we discussed interpreting such a model as a UGM, with log-potentials given by:
θ ( y t ) + θ ( y t −1 , y t )
What did we gain from this abstraction? Because all of these models have the same comditional inde-
pendence structure, we are able to run the same algorithm for inference on all of these parameterizations
(sum-product). Thus, conditioned on observed data, we may find the exact marginals: p(ys = v)
Perhaps these structures are not necessary? Is exact inference required in cases where there is alot of data?
12.2 RNNs
12.2.1 What is an RNN?
Recall our discussion of using neural networks for classification. The UGM describing this setting is as
follows:
x1 ... xT
However, the problem with this type of approach is that it is ”time invarient,” that is the same weights are
shared by all of the xt . To see why this is problematic, consider using a bag of words representation for the
Xt and encountering the two sentences: ”The man ate the hot dog.” and ”The hot dog ate the man.” While
these two sentences are saying completely different things, they result in the same value generated by φ.
Recurrent neural networks get around this problem by implementing the following choice of φ:
where w(1) xt incorperates the current positional input, w(2) φ( x1:t−1 ; θ ) carries information from the previ-
ous inputs, b is the bias and tanh is the chosen nonlinear transformation.
Representing this RNN as a computational graph:
52
h0 h1 h2
...
...
...
x1 x2
where we call ht the RNN hidden state. As you can see, each ht is a nonlinear function of x1:t .
h1 = tanh(w(1) x1 + w(2) h0 + b)
h2 = tanh(w(1) x2 + w(2) h1 + b)
...
(1)
ht = tanh(w x t + w (2) h t −1 + b )
Notice that the only parameters to optimize in these expressions are w(1) and w(2) , which are shared among
all of the equations. In our normal representation of backpropagation we have:
h1 h2 ht
... loss
x1 x2 x3 xt
Thus the size of the computational graph and the amount of backpropagation necessary will scale with the
length of the input, T. Now, thinking of this situation like a simple feed-forward NN, how many layers does
this network have?
53
12.2.4 Main idea/trick/hack for vanishing gradients
In order to deal with vanishing gradients, we want to try to pass on more info from low layers while taking
gradients, so we add connections variously called:
• residual connections
• gated connections
• highway connections
• adaptive connections
The idea of residual connections is that we let
ht = ht−1 + tanh w(1) xt + w(2) ht−1 + b
so that taking the gradient at layer t we get more information passed on from the the linear term ht−1
outside of the tanh.
In fact, we can adaptively learn how much the gradient at each time step should be taken
from the previous
time step directly from the cata. Thus, we weight the contributions of the ht and tanh w(2) ht−1 + ... by a
factor λ that is also learned from the data:
ht = λ ht−1 + (1 − λ) tanh w(1) xt + w(2) ht−1 + b
λ = σ w (4) h t −1 + w (3) x t + b
By passing on information directly from previous timesteps, we can prevent vanishing gradients, since the
linear terms pass on more information from previous timesteps. In this sense, the λs function as “memory”
of the previous timesteps. Important RNN variants using this idea are:
• LSTSM (Long short-term memory networks)
• GRU
• ResNet
Critically, this means that we do not attempt to model the relationship between the yi at all, and thus we do
not need to assume any distribution over y!
Lets compare this to our alternative approach, which requires full generation. Imagine that any yi is condi-
tional on all of y1:i−1 . Then our DGM (for five nodes) is a fully connected K5 :
54
y0 y1 y2 y3 y4
How can we then compute p(ys = v)? A naive approach would be to literally enumerate all possible
sequences and sum across all possibilities. However, this is very computationally expensive (exponential
in T). Instead, we can speed up this approach by employing a greedy search. Let
12.3.2 Applications
RNNs are commonly used for machine language translation and speech recognition. A simple machine
translation model is that of the Encoder-Decoder model.
y1 y1 y2
h1 h2 h3 C h4 h5 h6
Decoder
x1 x2 x3
Encoder
In this model, a sentence from the first language is fed in word by word (each xi ) into the Encoder. This
then runs through a normal RNN setup before being fed into C, which stores the final result of the encoder.
Now in the decoder, another RNN is run in reverse that spits out words in the second, translated, language.
Each translated word yi depends on the current layer in the decoder RNN, hi , C and the last translated
word (to prevent the same word from being generated multiple times).
Remark. We will talk about Information Theory in the next lecture.
55
CS281: Advanced ML October 18, 2017
13.1 Announcements
The Midterm is next Monday. The list of topics is on the website. It’s open note but not open computer. You
can bring your textbook. They’ll try to bring copies of the textbook for people who don’t have it so they
don’t have to print out copies.
13.2 Introduction
Interestingly, almost all of information theory is laid out in a single paper, written by Claude Shannon in
1948. We often quote Alan Turing as the ’father’ of CS, but for the area we are discussing, that ’father’ is
Shannon. (Aside: Video of device Shannon built called the Ultimate Machine)
Today, we’ll cover the core aspects of information theory and discuss how we’ll use it in this class.
Earlier in this class, we’ve focused on exact methods for inference: MLE, MAP, etc. The one exception was
neural networks, which are not convex.
Exact inference is only tractable in a small subset of models. Given most models are intractable, what do we
do? We use approximate inference. Here, there are two main types of approximation: optimization-based
(which includes coordinate ascent, SGD, and Linear Programming methods) and sample-based. You can
use them together, but they have different histories, so we’ll discuss them separately.
Information theory brings together many of the expectation-maximization methods we’ve seen earlier into
a unified language. In particular, information theory lets us study relative entropy.
The Information Source spits out bits (call this W), which is passed into the encoder (giving X(W)). The en-
coded bits pass through a noisy channel giving Y, which is passed into a decoder that gives us our target
message.
We don’t know how the noisy channel will act in a deterministic way. But we have a probability distribution
p(Y | X ) describing its behavior. The target message also has a distribution p( x |y) α p( x ) p(y| x ). Where p( x )
represents the source model, and p(y| x ) represents the channel model. We have a simple graphical model
here:
56
Several fields of research are contained in the first diagram. One area of information theory is called channel
coding (Noisy Channel in diagram). It studies how to best encode the data so it is most robust to the noisy
channel. A second area is called source coding - data compression (Source in diagram). It discusses how
we can exploit the way information is naturally represented in the world. For example, compression falls
here. (Aside: Hutter prize).
Example: people in speech recognition use this as an analogy for what they are trying to do. The model of
what a person wants to say is W. The encoding path is the sound the person makes X (w). The decoder is
what we (or the microphone) hear. And the goal, of course, is to uncover the target message.
We can write this analogy:
1. Source: Language
2. Encoder: Talking
3. Noisy Channel: Air
4. Decoder: ears/recognizer
Then, p( X ) represents the source model. If you know a person well, you can anticipate what is going on in
their head, and you know what they are going to say This makes your p( X |Y ) stronger. We multiply this
quantity by p(Y | X ) which represents the noisiness of the channel.
13.4 Definitions
Entropy:
K
∆
H (X) = − ∑ p(x = k) log2 p(x = k) = −Ek [log2 p(x = k)]
k =1
For a given random variable X, entropy measures the ”uncertainty” in the distribution. Entropy maxes out
when we have full uncertainty in the distribution.
57
This comes back to the idea of source coding. If there is full certainty in what can and is sent, then it does
not matter what the encoder/decoder does. The answer is trivial. In contrast, if there is full uncertainty, we
have to work much harder.
The unit of measure of this is called ”bits”(log2 ) or ”shannons” or ”nats” (loge ). The average number of bits
needed to represent a message is less than
H (x) + 1
We won’t prove this, but this is the fundamental link between information and coding.
Shannon Game In his paper, Shannon proposes the ”Shannon Game,” which tries to quantify human
source coding: given a sequence of text, give a probability distribution over the next letter/word.
THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG READING LAMP ON THE DESK SHED GLOW
ON POLISHED .
In English, it takes roughly 80 guesses to get the right answer. We write that the perplexity or ”how confus-
ing the next prediction is” is
2 H (x)
Minimizing perplexity is where the ’power’/advances of RNNs comes from.
Cross Entropy:
H ( p, q) = − ∑ p( x = k) log q( x = k) = E p [log q( x = k )]
k
Cross entropy tells us expected number of bits to encode true distribution p with q.
We can sample from p to approximate
x˜1 , . . . , x˜N ∼ p( x )
Then, we compute the minimization of the cross entropy.
N
1
minq −
n ∑ log q(x˜n )
n
But this is just the negative log likelihood for categoricals. Last class, we talked about RNNs. We learn q
and compare to p. If we can make it more and more like p, this gives us the ability to do compression and
source modeling better.
Relative entropy (KL-Divergence):
∆ p( x = k ) pk
KL( p||q) = E p log = ∑ pk log = ∑ pk log pk − ∑ pk log qk = − H ( p) + H ( p, q)
p(q = k) k
qk k k
So the relative entropy is the negative entropy of p by itself plus the relative entropy of p and q. It’s a way
of comparing two distributions, but is not a metric - KL( p||q) 6= KL(q|| p)
Theorem. KL( p||q) ≥ 0
Proof. We have that
qk q
−KL( p||q) = E p [log ] ≤ log E p [ k ] = log ∑ qk ( x ) = log 1 = 0
pk pk k
58
if f is convex. Mostly using when f = log.
Information Geometry (Working with asymmetrical divergence).
We have some p and want q ∈ Q (set of distributions). There are two options:
1. Forward (moment projection):
− E p log qk
Issues arise when qk → 0 and pk > 0. Forward projection will try to avoid zeros.
2. Reverse (information projection):
13.5 Demo
We did an example iPython notebook (KL.ipynb) in class.
Next class, we will use the reverse projection to get Q. This is called ”variational inference”.
59
CS281: Advanced ML October 25, 2017
yn
µk
where the priors could be Categorical on p(y|π ) and Bernoulli on each p( x j |y, µ).
In this setting, with data { xi , yi }, we can perform log-likelihood maximization with p({ x }, {y}) to infer the
parameters µ and π.
In a new setting, instead suppose we have the same graphical plate model, but we now do not observe the
categories; instead of yn , we denote these latent variables as zn .
60
π
zn
µk
which also can be represented with the following undirected graphical model:
zn
µk
What does this graphical model look like, and what dependencies do we need to consider?
All of these Naive Bayes models without observed yn are used for different types of clustering. These
models are referred to as mixture models, and can handle multimodality well.
61
Figure 14.17: Multimodality of Multiple Gaussians vs Unimodality of One Gaussian
Although we won’t limit ourselves to only using continuous mixture models, we’ll be primarily working
with mixtures of Gaussians. Below are examples of mixture models we’ll see more of.
p( x |y) ∼ N (µ, Σ)
An application example is speech recognition, which must deal with different accents (e.g. British vs Amer-
ican) for pronunciations of the same word. In this case, Gaussian mixture models are able to consolidate
the seemingly different pronunciations.
Although we pushed the sum into the product, the sum over all possible values of z is still hard to compute.
Our main alternative strategy is thus to utilize the complete data likelihood in order to find p( x ).
Below is a heuristic algorithm to find p( x ):
1. Initialize π and µ randomly.
62
2. Compute the log-likelihood p( x, z|π, µ) = ∑ ∑ znk log πk + znk log p(xn |µk ) using fixed π and µ.
n k
3. Denote qnk = p(zn = k| xn ; ...) as a “hallucinated” distribution over znk ; the origin of qnk is unclear, but
it satisfies the given requirements. We can use this distribution of z ∼ qnk to compute the expectation
of the log-likelihood as:
Ez∼qnk [log p( xn , zn |π, µ)] = ∑ ∑ qnk log πk + qnk log p(xn |µk ).
n k
We can then “hallucinate” parameters through the distribution of znk ! Returning to the graphical
model, recall that it was difficult to actually compute the parameters. Howevere, given some param-
eters π, µ, we can compute a marginal distribution over the z: p(z| x, π, µ).
4. We thus can perform a coordinate ascent algorithm; that is, we optimize one set of parameters, then
the other, then continue to alternate optimization.
This heuristic is summarized below:
1: procedure EM NB(x, z)
2: Initialize π and µ randomly.
3: while π and µ not converged do
4: (Expectation) Compute qnk = p(z| x, π, µ) using fixed π and µ
5: (Maximization) Compute MLE of π and µ using q through Ez∼q [log p( x, z)]
The above is called the Expectation Maximization (EM) Algorithm, where step 4 computes the actual ex-
pectation maximization, and step 5 actually maximizes through the MLE. π will be Categorical and µ will
be the means of the Gaussians themselves here.
1: procedure EM(x,z)
2: Initialize π and µ randomly.
3: while π and µ not converged do
πk p( xn |µc )
4: (Expectation) qnk ←
∑ πk0 p( xn |µk0 )
k0
5: (Maximization) Ez∼q [log p( x, z|π, µ)] = ∑ ∑ qnk log πk + qnk log p(xn |µk )
n k
We’ll thinking about this algorithm in the context of information theory. In the maximization step, we have
the first term ∑ ∑ qnk log πk can be thought about as maximizing the πk under the model, with the as-
n k
sumption they were sampled from distribution of qnk .
∑∑ qnk
|{z}
log πk
|{z}
n k
Came from qnk guess Want to find πk
So, we can think of qnk as “expected counts” from a distribution. So, the MLE is similar to standard counts
of observations. In particular,
∑ qnk
πk = n .
∑ qnk
n,k
63
Alternatively, we could also use the MAP estimate. The main point is that the prior is embedded in the max
step.
A similar update happens for µ, but will be specific to the actual form of the model p( x |µ).
14.5.1 Assumptions on EM
One key point is that the model yields an easily computable p(z| x, π, µ). If we imagine that p( x |z) is an
arbitrary neural net with continuous output, this could be very difficult. A specific example is a Variational
Autoencoder (VAE), which could be found from Kingsma and Welling.
14.6 Demo on EM
We did a demonstration on Expectation Maximization.
In general, we saw it was pretty hard to even infer the clusters of data especially as the number of true
clusters increased.
Q: How to infer number of true clusters? It’s actually a hard problem, no really good standout way to do this.
Q: Difference between k-means and EM here? k-means will choose q as the highest probability, instead of
assigning a distribution to qnk as EM does.
" #
log p( x ) = ∑ log ∑ p(xn , zn , |π, µ)
n zn
" #
p( xn , zn |π, µ)
= ∑ log ∑ q(zn = k)
n zn q(zn = k )
p( xn , zn |π, µ)
= ∑ log Eq
n q(zn = k )
p( xn , zn |π, µk )
≥ ∑ ∑ q(znk ) log
n k
q(zn )
" # " #
= ∑ ∑ q(znk ) log p(xn , zn |π, µk ) −∑ ∑ q(znk ) log q(zn )
n k n k
= ∑ ∑ qnk log p( xn , zn = k|π, µk ) + ∑ H [q(zn )]
n k n
q(zn = k )
In the second line we only multiplied by 1 = . In the third line, we use the trick of applying
q(zn = k )
Jensen’s inequality when we see the log of an expectation.
In now analyzing the M-step, we are computing
where ∑ qnk log πk is the cross-entropy between parameters and expected counts.
k
64
The E-step, we find qnk .
" # " #
p(z| x ) p( x ) p(z| x )
arg min ∑ ∑ q(z) log = arg min ∑ ∑ q(z) log q(z) + const
q n k
q(z) q n k
= arg min ∑ KL (q(z) || p(z| x )) ,
q n
and we can minimize the KL divergence term by setting qnk = p(zn | xn , π, µ) as the same distribution.
65
CS281: Advanced ML October 30, 2017
Recall the Ising model from a few lectures ago, a grid of nodes connected horizontally and vertically
to adjacent nodes on 4 sides (or 3 on the edges and 2 on the corners). Things in which we may be
interested include the marginal distributions p(yij = 1) and the partition function. Unlike for a tree
structure, direct computation using loopy belief propagation may not converge well for this graph
with cycles, so we will look for a different method.
• Gaussian Mixture Model
π ∼ dir (α)
Θk = (µk , Σk )
Zn ∼ π
Xn |( Zn , Θ) = N (µk , Σk )
Recall the Gaussian Mixture Model from last lecture, an unsupervised learning method that attempts
to categorize data points into clusters. This is nearly the same graphical model as Naive Bayes classi-
fication, except that the classes are not observed, but rather inferred. Again, computation of p( xn ) is
rather difficult because zn are not observed. Therefore, one must find a method that infers the zn in
order to infer p( xn ).
66
15.3 Variational Idea
reverse kl
kl(q||p) I-projection
forward kl
M-projection
kl(p||q)
minq d( p, q) = KL(qk p)
Z
q(z)
= q(z)log
p(z| D )
In trying find the gap function d( p, q) that measures our gap from p to q, one good choice is the KL diver-
gence, but because it is asymmetric, we essentially have 2 options, both of which lead to valid methods:
• KL(qk p)
The reverse KL, I-projection
• KL( pkq)
The forward KL, M-projection
The forward KL method is also known as the Expectation Propagation (EP) method.
15.4 Relationship to EM
Recall the Expectation-Maximization (EM) method from last lecture. Much of the mathematics we will use
resembles that which we have already studied.
Z
log p( D ) = log p( D, z)dz (D is the observed data set)
Zz
= log q(z) p( D, z)/q(z)dz
z
p( D, z)
= log Ez∼q
q( D )
p( D, z)
≥ Ez∼q log (the lower bound, by Jensen’s inequality)
q( D )
p( D, z) p( D, z)
log p( D ) − Eq log = Eq log p( D ) − log
q(z) q(z)
q(z)
= Eq log
p(z| D )
= KL(qk p)
One more remaining issue is that
p(z, D )
p(z| D ) =
p( D )
67
where p( D ) is generally very difficult to compute and the main reason why computing p exactly becomes
an intractable problem. However, for the purposes of finding q, we can ignore p( D ) as just a constant, since
it is independent of z. Thus,
q(z)
min KL(qk p) = min E log
q q p(z, D )
Comparison to EM
• In variational inference, we have coordinate ascent just like in EM, except we try to optimize a lower
bound.
• In variational inference, we pick q from the EASY set. As a result, we don’t have to select point
estimates, but rather we can use entire distributions.
• This is very useful in Bayesian setups.
• We can also combine this with EM and sampling to obtain more sophisticated techniques as well.
q3 (z3 ) q4 (z4 )
q5 (z5 )
This is an algorithm that optimizes one particular qi , via an expectation over the adjacent variables, while
fixing all of the other q and then iterates this procedure over all of the q. It proceeds as follows:
68
where
Note that E−qi denotes an expectation taken over all the variables except zi .
We now apply Mean Field Variational Inference to the Ising Model. Note that θvi denotes the log-potential
of vertex i while θ Ei−n denotes the log-potential of the edge from vertex i to vertex n. We compute qi as an
expectation over the 4 neighboring nodes, the Markov blanket. This is then
Therefore
qi ∝ exp[ Zi θvi + Σn θ Ei−n Zi qn (1)]
We can then repeat this procedure over all nodes to update the entire graph. Then we can repeat that over
several epochs until we find good convergence of q.
69
CS281: Advanced ML November 1, 2017
16.1 Announcements
• T4 out, due 9/13 at 5pm
• Exams available in office
• OH - today 2:30-4pm (Wednesdays)
• Follow formatting for the abstract of the final project. There were many inconsistencies with the
formatting requirements for the initial proposal.
16.2 Introduction
Last class, we talked about variational inference. This class, we gonna talk about a very different type of
VI. We also will talk about some other types of VI but will not go too much into the details.
Murphy’s book, especially Chapter 22, covers many details on the theory side. The other text, Murphy
referred as ”The Monster”, we put online as a textbook written by Michael Jordan.
π zn
xn µk
We assume:
µk ∼ N (0, σ2 ) ∀ k
1 1
zn ∼ Cat( , . . . , ) ∀ k
k k
xn |zn , µ ∼ N (µzn , 1) ∀ n.
Then we write:
p({ xn }, {zn }, µ) = p(µ) ∏ p(zn ) p( xn |zn , µ).
n
And we get: Z Z
p( x ) = p( x, z, µ) = p(µ) ∏ ∑ p(zn ) p( xn |zn , µ)dµ.
z,µ n zn
Variation setup. Goal:
min KL(q|| p)
q∈ EASY
reverse KL.
We pick EASY as mean field.
70
zn
xn µk
zn µk
Variational parametrization:
q(µ, z) = ∏ q k ( µ k ) ∏ q n ( z n ).
k n
Example, deriving the math for GMM can be useful to understand how it works and how we will do mean
field updates. Start from the above, setup the problem:
µk ∼ N (0, σ2 ) ∀ k
1 1
zn ∼ Cat( , . . . , ) ∀ k
k k
x n | z n , µ ∼ N ( µ z n , 1) ∀ n
2
where we also have λzn for hidden switch variable, and λm s
k , λk for the Gaussians.
71
qn (zn ; λzn ) ∝ exp[E−qn log( p(µ, z, x ))]
∝ exp[E−qn log( p( xn |zn , µzn ))]
∝ exp[E−qn − ( xn − µzn )2 /z]
∝ exp[E−qn ( xn µzn − µzn 2 /2)]
∝ exp[ xn E−qn (µzn ) − E−qn (µzn 2 )/2]
zn
xn µzn
2
So we identify E−qn (µzn ) with λm
k =zn and E−qn ( µzn ) with λk , and then we can write:
2 s
Then:
qk (µk ) = exp[θ T φ − A + . . . ],
2
where θ1 = ∑n λznk xn , θ2 = −( σ2 + ∑ λznk /2) and φ1 = µk , φ2 = µk 2 , as in GLM.
For normal distribution, we have:
∑n λznk xn 2 1
λm
k = z , and λsk = .
1/σ + ∑n λnk
2 1/σ + ∑n λznk
2
72
zj
blanket
q(z) = ∏ q(z j )
j
λ j = E[θ (z− j , x )]
βk πn
zni
wni
73
– zni - topic selected for word i of document n
– wni - word selected for ni.
– λzni - probability for the topic of word i in document n
16.6 Demo
We did an example iPython notebook (TopicModeling.ipynb) in class.
74
CS281: Advanced ML November 6, 2017
17.1 Loopy BP
17.1.1 History
The history of Loopy Believe Propagation (Loopy-BP) began in 1988 with Judea Pearl who tried to analyze
the behavior of the BP algorithm (which gives exact marginal inference on tree) on graphs that are not trees,
like the Ising model. Remember the message passing algorithms is
bels ( xs ) ∝ ψ( xs ) ∏ mt→s ( xs )
t∈NBR(s)
Note that this algorithm doesn’t require an ordering on the nodes, so it can naturally extend to cyclic graphs.
Around 1998, a decade after Pearl raised the question of the BP algorithm on general graphs, papers study-
ing Turbo coding or Low density parity check (LDPC) codes used the BP algorithm on cyclical graphs with
empirical success, motivating more research. This research drew parallels between loopy-BP and varia-
tional inference.
17.1.2 Implementations
We can implement loopy belief propagation either synchronously or asynchronously
• Synchronous Updates: All of the nodes are updated togethers, so that every node at iteration t de-
pends on it’s Markov blanket at t − 1
– parallelizable
– usually takes more updates
• Asynchronous Updates: The nodes are given an order and updated sequentially
– sequential
– usually converges in fewer updates
75
These updates look similar to mean field variational inference (VI) with some differences in performance.
• Loopy BP is exact on trees, mean-field variational inference is not.
• Loopy BP is not guaranteed to converge, where VI will.
• If Loopy BP converges it is not guaranteed to converge to a local optimum.
Loopy BP is generally more costly than mean field VI, because it stores node beliefs in addition to edge
messages, whereas mean field VI only stores node parameters. If the graph is large and dense, then Loopy
BP might be very costly.
In practice, because Loopy BP contains more information about the problem than mean field VI it tends to
work better.
x̃i ∼ P( xi | x̃−i )
Naturally, we only care about the Markov blanket when estimating a particular node. Gibbs sampling is
not a parallel algorithm, the samples have to be generated sequentially. However, we can go through our
graph in any particular order and update each node accordingly.
β k ∼ Dir (η ) k topics
πn ∼ Dir (α) n documents
zni ∼ Cat(πn ) Draw topic for each word and each document
wni ∼ Dir ( β zni ) Draw a word from a topic
The last two terms are essentially the full posterior that we saw in lecture 2. Although these updates are
rather simple, it is used in many applications given it works with a wide variety of models. Typically, one
would draw one sample per update. It is not clear whether drawing multiple samples or running more
epochs helps the algorithm converge faster.
76
Figure 17.19: Graphical model associated to topic modeling example
We can approximate this expectation by sampling some N number of z̃, with each sample represented by
z̃n , which results in the following expression for the expectation:
N
1
Ez∼ p [ f (z)] ≈
N ∑ f (z̃n )
n =1
2
This is much simpler to compute, and it can be shown that the resulting µ̃ satisfies that (µ̃ − µ) ∼ N (0, σN ),
which implies that the variance of the approximation with respect to the true mean decreases with the
number of samples N.
77
17.3.4 Deriving the Variational Objective
Letting θ = zn , µk , we would like p( x ) for the GMM MF configuration previously developed. We aim to
maximize the following lower bound (also from previous lecture) in order to approach p( x ), where qλ is
any q with variational parameters λ.
p( x, θ )
p( x ) ≥ Eq [log ]
qλ (θ )
p( x, θ )
max Eq [log ]
q qλ (θ )
Expressing p( x, θ ) = p(θ ) p( x |θ ) and distributing the log, we find the variational objective, and we can
interpret each of the three terms:
In the expression above, we can recognize that the first term is just an entropy term, the second term
includes the prior to effectively find a q close to p(θ ) based on cross-entropy, and the third term contains
the likelihood to reflect how well θ predicts the data. Furthermore, we see that the first two terms together
form the negative inverse KL divergence of p and q, so we further simplify to the expression below:
In the expression above, the KL divergence will act to tend q towards the prior p(θ ) and the expectation
term will give weight in q to parameters that tend to explain x.
N
1
∇λ Eqλ [ p( x |θ )] ≈
N ∑ ∇λ p(x| f λ (z̃n ))
n =1
78
CS281: Advanced ML November 8, 2017
Today we are going to go a step forward in Variational Inference. We are going to combine the variational
inference and neural networks.
18.1 Autoencoder
Suppose we receive a signal x from a source. Then we encode x with encoding function Enc
z = Enc( x )
so that z is simpler than x, for example, z can be lower dimension than x or the learned encoding function
itself can be a simple function. The goal is to limit the capacity of the model by parameterising enc and p(),
and to learn them from the data. This leads to the model learning the optimal compression for the data
which only saves the most important information and discards the rest. This encoding is often nonlinear
in practice, for example, people often use neural networks. Then z might be changed to z0 by noise. After
that, we want to recover x by inference:
This is basically what autoencoder does. For variational autoencoder, we need to parameterize function
Enc and Pr( x = x 0 |z0 ) usually with neural networks. We might want to learn these functions from data.
where the first term makes q close to prior p and the second term is the likelihood based on q. There are
two methods to optimization above expression
• Coordinates Ascent (CAVI)
• We can make mean field assumption (which requires the full Markov blanket), then q(θ ) = ∏i qi (θ )
79
• We can also do structured mean field. Basically we can divide this graph into several simple sub-
graphs (they can overlap each other). Inside each subgraphs we can do belief propagation using for
example forward-backward algorithm and outside of the subgraphs we use mean field approxima-
tion. This q is closer to the original and has a better bound.
• More generally we can use neural networks, where λ are neural network parameters and we use qλ
to approximate the posterior.
z
µ
x
N
Since the parameters are not fixed, the EM fails in this case. Using this as a motivating example, let’s exam-
ine Variational Autoencoders. In our previous examples for inference, we always consider the conditional
distribution as discrete distribution or just linear Gaussian distribution, and that we can calculate our pos-
terior in a straightforward manner. But in general p(Data|Label) can be non-linear, for example, neural
network. Then p(Label|Data) becomes intractable. In this case, we should make q-function as a neural
network. Suppose z are labels, x are data and µ are parameters. And we assume
Pr( x |z, µ) = softmax(wφ( x; µ))
where φ is a neural network. We use neural networks to compute q functions
after we obtain optimized parameter λ from large amount to data x, we will obtain an qλ ( x ) as encoding
function for all x and we can use it for future encoding and recovery and even generate new x from encoding
and decoding processes.
80
18.4 Sampling from other distribution
One typical problem when we want to do optimization is that we don’t know how to compute
∇λ Eqλ [log p( x |θ )]
∇λ qλ (θ )
∇λ log qλ (θ ) =
qλ (θ )
∇λ qλ (θ ) = qλ (θ )∇λ log qλ (θ )
Then we have
Z
∇λ Eqλ [log p( x |θ )] = ∇λ qλ (θ ) p( x |θ )dθ (18.48)
Z
= qλ (θ )∇λ log qλ (θ ) p( x |θ )dθ (18.49)
= Eqλ [ p( x |θ )∇λ log[qλ (θ )]] (18.50)
81
CS281: Advanced ML November 13, 2017
19.1 History
We started with Monte Carlo in the past few lectures. The main method is to use draw samples from a
proposal distribution and take sample average to approximate expectations.
Z
Ey∼ p(y| x) [ f (y)] = p(y| x ) f (y)dy
N
1
≈
N ∑ f (ỹ(n) )
n =1
where ỹ(n) ∼ p(y| x ). This approach requires the ability to sample y ∼ p(y| x ).
Box-Muller Sample z1 , z2 ∼ Unif[−1, 1]. Discard the points outside of the unit circle, so our sample of
{(z1 , z2 )} is uniform on the unit disk. We would like to transform each (z1 , z2 ) into some ( x1 , x2 ) ∼ N (0, I ).
We want to find the right transform such that the Jacobian makes the following hold
Normal PDF Unif disk PDF
z }| { z }| { ∂ ( z1 , z2 )
p ( x1 , x2 ) = p ( z1 , z2 )
∂( x , x ) .
1 2
We may check that
1/2
−2 log r2
x1 = z1
r2
1/2
−2 log r2
x2 = z2 ,
r2
where r2 = z21 + z22 , is the desired transform.
82
19.4 Rejection Sampling
Assume that we have access to the PDF p( x ) or the unnormalized PDF p̃( x ). The idea is to pick a guide
function (valid PDF) q( x ) that is similar to p and easy to compute. We also pick a scale M. We require that
Mq( x ) > p( x )
for all x and that we have access to p( x ) or pe( x ). The algorithm is as follows:
1. Sample xn ∼ q( x )
2. Draw u ∼ Unif[0, 1]
p( xn )
3. If u < Mq( xn )
, then keep xn ; otherwise, rerun from 1.
The interpretation is simple. The algorithm “graphs” p( x ) and the bounding Mq( x ) on a board, then pro-
ceeds to throw darts at the board and accepting those darts that hit below p( x ). The same algorithm works
even if p̃ is unnormalized, since we have the degree of freedom to choose M and thus absorb the normaliz-
ing constant.
p̃( x )
This method works with = p( x ). But why should it work?
Z
ANS This works because for whatever guide function we pick, we can write that guide function to be:
e = ZM
M
R x0 R 1 p̃( x )
−∞ 0 q ( x ) 1( u ≤ )dudx
Mq( x )
p( x ≤ x0 | xis accepted) =
R∞ R1 p̃( x )
−∞ 0 q ( x ) 1( u ≤ Mq ( x ) ) dudx
1 R x0
−∞ p̃ ( x ) dx
= MR the denominator is probability the acceptance
1 ∞
p̃ ( x ) dx
Z
M −∞
x0
= p( x )dx
−∞
= p ( x ≤ x0 )
2. Let p ∼ N (0, σp2 I ) and q ∼ N (0, σq2 I ) where σq2 > σp2 . Pick
M = (σq /σp ) D ,
where D is the dimension of the Multivariate Normal. Note that M becomes very large when D
becomes large, and so rejection sampling may be inefficient when D is large. If we imagine M as the
metric of which to boost the Gaussians to make random sampling work, due to the known geometry
of Gaussian distributions we can imagine as D increase there will be more and more ”space” between
p and q to fill, thus making Random Sampling quite difficult.
83
19.6 Importance Sampling
We want to approximate the expectation
Z
Ex∼ p ( f ( x )) = f ( x ) p( x ) dx.
So far, we can sample a bunch of points—via, say, rejection sampling—from p and calculate a sample mean
to approximate the true expectation. If the structure of p and the structure of f are very different, so Monte
Carlo methods so far might be inefficient, since it samples from high density areas in p, which may have
very low values of f , and the Monte Carlo may miss areas with high values of f but low probability of
happening.
Consider the integral
Z
p( x ) p( x )
q( x ) f ( x ) dx = Eq f (x) = E p ( f ( x )).
q( x ) q( x )
We may now apply the same Monte Carlo trick to sample from q and take the sample mean of f ( x ) p( x )/q( x ).
What is the benefit of using q? Since we can choose q to be closer to f , then more of the sample we choose
would be around high values of f . Here we don’t need to wait for some low-probability tail event in p to
happen in order to get reliable estimates of E p ( f ( x )). Instead, we can directly look at the tail events via q
and weight the data appropriately using p/q to still maintain asymptotic convergence.
How exactly do we choose q? We want to minimize the variance of f ( x ) p( x )/q( x ) when x ∼ q, since this
allows for faster convergence. Then
" # 2
f ( x ) p( x ) f ( x ) p( x ) 2 f ( x ) p( x )
Var =E − E
q( x ) q( x ) q( x )
| {z }
constant eventually
2 2
f ( x ) p( x ) p( x )| f ( x )|
E ≥ Eq (Jensen’s inequality)
q( x ) q( x )
Z 2
= p( x )| f ( x )| dx
We minimize the lower bound via Jensen’s inequality (similar to what we did in variational inference). The
optimal q is chosen via
| f ( x )| p( x )
q∗ = R .
| f ( x )| p( x ) dx
It may be difficult to normalize q∗ in practice, however.
Z
E p [ f ( x )] = p( x ) f ( x )dx
N
1
≈
N ∑ f ( x̃ (n)
n =1
where x̃ (n) ∼ p( x )
84
CS281: Advanced ML November 15, 2017
z1 z2 z3 ... zt
x1 x2 x3 ... xt
z1 z2 z3 ... zt
x1 x2 x3 ... xt
85
We can marginalize, obtaining a tree, on which we can apply exact belief propagation (BP).
z1 z2 z3 ... zt
86
To approximate integrals of the form:
Z S S
p(z) 1 p(zs )
E p [ f (z)] = q(z)
q(z)
f (z)dz ≈
S ∑ q(zs )
f ( z s
) = ∑ ws f (zs )
i =1 i =1
p(zs )
ws = q(zs )
are importance weights associated with each sample.
Note: This form of importance sampling is convenient as it lets us work with unnormalized distributions.
However, in many cases is much less convenient than, for instance, rejection sampling or MonterCarlo
sampling because it is not fully clear how to use the weights.
Now consider getting samples from p. We need to define p( x ) in terms of E p [ f ( x )] using a δ function that
returns 1 if it’s the value we want, 0 otherwise:
p( x = x 0 ) = E p [δx0 ( x )]
Then, our estimate in terms of importance sampling is:
S
p( x ) = ∑ ws δxs (x)
s =1
But we actually want unweighted samples, which can be obtained through re-sampling: pick x s with
probability ws . In other words, we create a discrete set that approximates the original distribution and then
we draw samples from that discrete set based on a categorical distribution weighted by the ws .
87
20.2.3 Example
Recall from last class, we have:
p( Data|θ ) p(θ )
p(θ | Data) =
p( Data)
p̃(θ | Data) = p( Data|θ ) p(θ )
Notice we drop the normalization term for p̃. Then q(θ ) = p(θ ), which implies that we importance-sample
based on the prior. This gives normalized ws :
We update the belief state using importance sampling, with the following importance weights.
p(z1:t | x1:t )
wts ∼
q(z1:t | x1:t )
p(z1:t | x1:t ) ∝ p(z1:t | xt−1 ) p( x |t|z1:t , yt−1 ) ∝ p( xt |zt ) p(zt |zt−1 ) p(z1:t−1 | x1:t )
where p( xt |zt ), p(zt |zt−1 ), and p(z1:t−1 | x1:t ) correspond to the observation, transition, and recursion of the
sequence, respectively.
Using this expression, we have the following algorithm to approximate the belief state.
1. Start with particles (wt , zt )(s) .
p ( x | z ) p ( z t | z t −1 ) s
2. At each time step t for particle x, calculate wts = q(zt |zt wt−1 and sample zst ∼ q(zt |z1:t
s
−1 , x t ).
t 1:t−1 ,y1:t )
3. Compute p(zt , x1:t )
88