CSC 446 Lecture Notes
CSC 446 Lecture Notes
Contents
1 What Is Machine Learning? 1
2 Probability Theory 1
3 Concentration Bounds 3
5 Entropy 5
5.1 Bounds on Entropy for a Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Further Entropy Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Mutual Information 6
6.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3 Lower Bound for KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.4 L1 norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
8 Linear Regression 9
9 Smoothing 9
9.1 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9.2 Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9.3 Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9.4 Justifying the Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 Perceptrons 12
11.1 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
11.2 Perceptron in Stochastic Gradient Descent perspective . . . . . . . . . . . . . . . . . . . . . . . 15
1
13 Support Vector Machines 17
13.1 Training Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
13.2 Convex Optimization Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
13.3 Karush-Kuhn-Tucker (KKT) Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
14 Kernel Functions 20
14.1 Review Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
14.2 Kernel Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
14.3 Proof that φ exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
14.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
15 Graphical Models 24
15.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
15.2 Factor Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
15.3 Message Passing (Belief Propagation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
15.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
16 Junction Tree 27
16.1 Max-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
16.2 Tree Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
17 Expectation Maximization 31
17.1 Parameter Setting: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
17.2 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
17.3 EM Algorithm in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
17.4 Gradient Ascent ( ∂L
∂θ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
17.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
17.6 Variational Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
17.7 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
18 Sampling 38
18.1 How to Sample a Continuous Variable: Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
18.2 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
18.3 Proof of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
18.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
18.5 Gibbs sampling with Continuous Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
18.6 EM with Gibbs Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
18.6.1 Some problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
18.6.2 Some advantages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
19 Error Bounds 44
19.1 Leave One Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
19.2 One Way to Compute an Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
19.3 Another Way to Compute an Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
22 LBFGS 51
22.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
22.2 The BFGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
22.3 Proof of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
22.4 L-BFGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2
23 Reinforcement Learning 54
23.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
23.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
23.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
23.4 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
24 Game Theory 56
24.1 Definition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
24.2 Some Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
24.3 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
24.4 Proof of the LP Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
• Regression is the problem of learning a function from datapoints to numbers; fitting a line or a curve
to the data is an example of a regression problem.
One example of a classification problem would be: given heights and weights, classify people by sex. We
are given a number of training datapoints, whose heights, weights, and sexes we know; we can plot these
datapoints in a two-dimensional space. Our goal is to learn a rule from these datapoints that will allow us
to classify other people whose heights and weights are known but whose sex is unknown. This rule will
take the form of a curve in our two-dimensional space: then, when confronted with future datapoints, we
will classify those datapoints which fall below the curve as female those which fall above the curve as male.
So how should we draw this curve?
One idea would be to draw a winding curve which carefully separates the datapoints, assuring that all
males are on one side and all females are on the other. But this is a very complicated rule, and it’s likely to
match our training data too closely and not generalize well to new data. Another choice would be to draw
a straight line; this is a much simpler rule which is likely to do better on new data, but it does not classify
all of the training datapoints correctly. This is an example of a fundamental tradeoff in machine learning,
that of overfitting vs. generalization. We will return to this tradeoff many times during this class, as we
learn methods of preventing overfitting.
An example of a regression problem would be: given weights of people, predict their heights. We
can apply the nearest neighbor model to solve this problem. The nearest neighbor model remembers the
weights and corresponding heights of the people in the training data. Then for a new test weight, it looks up
the person with the closest weight in the training data, and returns the corresponding height. This results
in a piece-wise constant function that may be affected by outliers and may result in overfitting. Another
choice would be to fit a straight line.
2 Probability Theory
This section contains a quick review of basic concepts from probability theory.
Let X be a random variable, i.e., a variable that can take on various values, each with a certain proba-
bility. Let x be one of those values. Then we denote the probability that X = x as P (X = x). (We will often
write this less formally, as just P (x), leaving it implicit which random variable we are discussing. We will
also use P (X) to refer to the entire probability distribution over possible values of X.)
In order for P (X) to be a valid probability distribution, it must satisfy the following properties:
• For all x, P (X = x) ≥ 0.
3
P R
• xP (X = x) = 1 or P (x)dx = 1, depending on whether the probability distribution is discrete or
continuous.
If we have two random variables, X and Y , we can define the joint distribution over X and Y , denoted
P (X = x, Y = y). The comma is like a logical “and”; this is the probability that both X = x and Y = y.
Analogously to the probability distributions for a single random variable, the
P joint distribution must obey
the properties that for all x and for all y, P (X = x, Y = y) ≥ 0 and either x,y P (X = x, Y = y) = 1 or
RR
P (x, y)dydx = 1, depending on whether the distribution is discrete or continuous.
From
P the joint distribution P (X, Y ), we can marginalize to get the distribution P (X): namely, P (X =
x) = y P (X = x, Y = y). We can also define the conditional probability P (X = x|Y = y), the probability
that X = x given that we already know Y = y. This is P (X = x|Y = y) = P (X=x,Y =y)
P (Y =y) , which is known as
the product rule. Through two applications of the product rule, we can derive Bayes rule:
P (Y = y|X = x)P (X = x)
P (X = x|Y = y) =
P (Y = y)
Two random variables X and Y are independent if knowing the value of one of the variables does not
give us any further clues as to the value of the other variable. Thus, for X and Y independent, P (X =
x|Y = y) = P (X = x), or, written another way, P (X = x, Y = y) = P (X = x)P (Y = y).
The expectation
P of a random variableR X with respect to the probability distribution P (X) is defined as
EP [X] = x P (X = x)x or EP [X] = P (x)xdx, depending on whether the random variable is discrete
or continuous. The expectation is a weighted average of the values that a random variable can take on. Of
course, this only makes sense for random variables which take on numerical values; this would not work
in the example from earlier where the two possible values of the “sex” random variable were “male” and
“female”.
We can also define the conditional expectation,
P the expectation of a random variable with respect to a
conditional distribution: EP (X|Y ) [X] = x P (X = x|Y = y)x. This is also sometimes written as EP [X|Y ].
Lastly, we are not restricted to taking the expectations ofPrandom variables only; we can also take the
expectation of functions of random variables: EP [f (X)] = x P (X = x)f (x). Note that we will often leave
the probability distribution implicit and write the expectation simply as E[X].
Expectation is linear, which means that the expectation of the sum is the sum of the expectations, i.e.,
PN PN
E[X + Y ] = E[X] + E[Y ], or, more generally, E[ i=1 Xi ] = i=1 E[Xi ]. For the two-variable case, this can
be proven as follows:
X
E[X + Y ] = P (X = x, Y = y)(x + y)
x,y
X X
= P (X = x, Y = y)x + P (X = x, Y = y)y
x,y x,y
X X
= P (X = x)x + P (Y = y)y
x y
= E[X] + E[Y ]
4
3 Concentration Bounds
Markov’s Inequality For a non-negative random variable, X > 0, and for any δ > 0,
1
P (X ≥ δE[X]) ≤
δ
or equivalently,
E[X]
P (X ≥ a) ≤
a
Proof: Your homework.
Chebyshev’s Inequality For any random variable with finite variance σ 2 , and for any k > 0
1
P (|X − E[X]| ≥ kσ) ≤
k2
Proof: Your homework.
θ∗ = argmax P (X1 , . . . , XN ; θ)
θ
This method of estimating θ is called maximum likelihood estimation, and we will call the optimal setting
of the parameters θMLE . It is a constrained optimization problem that we can solve using the tools of vector
PN
calculus, though first we will introduce some more convenient notation. For each k, let c(k) = n=1 I(Xn =
k) be the number of datapoints with value k. Here, I is an indicator variable which is 1 when the statement
in the parentheses is true, and 0 when it is false.
Using these counts, we can rewrite the probability of our data as follows:
N
Y
P (X1 , . . . , XN |θ) = θ xn
n=1
K
c(k)
Y
= θk
k=1
5
This switch in notation is very important, and we will do it quite frequently. Here we have grouped our
data according to outcome rather than ordering our datapoints sequentially.
QK
Now we can proceed with the optimization. Our goal is to find argmaxθ k=1 P (X = k)c(k) such
PK
that k=1 θk = 1. (We need to add this constraint to assure that whichever θ we get describes a valid
probability distribution.) If this were an unconstrained optimization problem, we would solve it by setting
the derivative to 0 and then solving for θ. But since this is a constrained optimization problem, we must
use a Lagrange multiplier.
In general, we might want to solve a constrained optimization problem of the form max~x f (~x) such
that g(~x) = c. Here, f (~x) is called the objective function and g(~x) is called the constraint. We form the
Lagrangian
Of (~x) + λOg(~x) = 0
and then solve for both ~x and λ.
Now we have all the tools required to solve this problem. First, however, we will transform the objective
function a bit to make it easier to work with, using the convenient fact that the logarithm is monotonic
increasing, and thus does not affect the solution.
K K
!
c(k)
Y Y
c(k)
max P (X = k) = max log θk
θ θ
k=1 k=1
K
c(k)
X
= max log(θk )
θ
k=1
K
X
= max c(k) log(θk )
θ
k=1
We get the gradient of this objective function by, for each θk , taking the partial derivative with respect
to θk :
K
∂ X c(k)
c(j) log(θj ) =
∂θk j=1 θk
(To get this derivative, observe that all of the terms in the sum are constant with respect to θk except for the
one term containing θk ; taking the derivative of that term gives the result, and the other terms’ derivatives
are 0.)
Thus, we get that
K
∂ X c(1) c(K)
Of = c(j)log(θj ) =
,...,
∂θ j=1 θ1 θK
In a similar vein,
K
∂ X
Og = θj = (1, . . . , 1)
∂θ j=1
Now we substitute these results into the Lagrangian Of + λOg = 0. Solving this equation, we discover that
for each k, c(k) = −λ, or θk = − c(k)
λ . To solve for λ, we substitute this back into our constraint, and discover
PK θk 1
PK PK
that k=1 θk = − λ k=1 c(k), and thus −λ = k=1 c(k). This is thus our normalization constant.
In retrospect, this formula seems completely obvious. The probability of outcome k is the fraction of
times outcome k occurred in our data. The math accords perfectly with our intuitions; why would we ever
want to do anything else? The problem is that this formula overfits our data, like the curve separating the
male datapoints from the female datapoints at the beginning of class. For instance, suppose we never see
outcome k in our data. This formula would have us set θk = 0. But we probably don’t want to assume that
outcome k will never, ever happen. In the next lecture, we will look at how to avoid this issue.
6
5 Entropy
Entropy is:
X 1
H(X) = P (x) log
P (x)
Zx
1
= P (x) log dx
P (x)
We can think of this as a measure of information content. An example of this idea of information content
is seen in Huffman coding. High frequency letters have short encodings while rarer letters have longer
encodings. This forms a binary tree where the letters are at the leaves and edges to the left are 0 bits and
edges to the right are 1 bits. If the probablities for the letters are all equal then this tree is balanced.
1
In the case of entropy we notice that log P (x) is a non-integer, so it is like an expanded Huffman coding.
K
X 1
H(X) = log K = log K
i=1
K
If K is 2n then H(X) = log 2n = n. Part of Homework 2 will be to prove P that entropy on a discrete
random variable is maximized by a uniform distribution (maxθ H(X) where n θn = 1 using the Lagrange
equation).
P To minimize H(X) we want P (xi ) = 1 for some i (with all other P (xj ) being zero1 ) giving H(X) =
1
1≤j≤K,j6=i 0 log 0 + 1 log 1 = 0. We see then that:
0 ≤ H(X) ≤ log K
If we consider some distribution we can see that if we cut up the “shape” of the distribution and add in
1
gaps that the gaps that are added do not contribute to P (x) log P (x) .
X 1
H(X, Y ) = P (x, y) log
P (x, y)
X 1
H(X|Y ) = P (x|y)P (y) log
x,y
P (x|y)
1
= EXY log
P (x|y)
X 1
= P (x, y) log
x,y
P (x|y)
1 What about 0 · log 10 ? It is standard to define this as equal to zero (justified by the limit being zero).
7
6 Mutual Information
Mutual information attempts to measure how correlated two variables are with each other:
X P (x, y)
I(X; Y ) = P (x, y) log
x,y
P (x)P (y)
Consider communicating the values of two variables. The mutual information of these two variables
is the difference beween the entropy of communicating these varibales individually and the entropy if
we can send them together. For example if X and Y are the same then H(X) + H(Y ) = 2H(X) while
H(X, Y ) = H(X) (since we know Y if we are given X). So I(X; Y ) = 2H(X) − H(X) = H(X).
6.1 Covariance
A number version of mutual information is covariance:
X
Covar[X, Y ] = P (x, y)(x − X̄)(y − Ȳ )
x,y
Covar[X, X] = Var[X]
Covariance indicates the high level trend, so if both X and Y are generally increasing, or both generally
decreasing, then the covariance will be positive. If one is generally increasing, but the other is generally de-
creasing, then the covariance will be negative. Two variables can have a high amount of mutual information
but no general related trend and the covariance will not indicate much (probably be around zero).
6.2 KL divergence
Kullback–Leibler (KL) divergence compares two distributions over some variable:
X P (x)
D(P k Q) = P (x) log
x
Q(x)
1 1
= EP log − log
Q(x) P (x)
= HP (Q) − H(P )
| {z } | {z }
Cross Entropy Entropy
If we have the same distribution then the there is non divergence D(P k P ) = 0. In general the KL
divergence is non-symetric D(P k Q) 6= D(Q k P ). If neither distribution is “special” the average 12 [D(P k
Q) + D(Q k P )] is sometimes used and is symetric. The units of KL divergence are log probability.
The cross entropy has an information interpretation quantifying how many bits are wasted by using the
wrong code:
code for Q
z { }|
X 1
HP (Q) = P (x) log
x
| {z } Q(x)
Sending P
8
θf (x1 ) + (1 − θ)f (x2 ). This is saying that any chord on the function is above the function itself on the same
interval.
Some examples of convex include a straight line and f (x) = x2 . If the Hessian exists for a function then
2
∇ f 0 (the Hessian is positive semidefinite) indicates that f is convex. This works for a line, but not
something like f (x) = |x|.
Jensen’s inequality states that if f is convex then E[f (X)] ≥ f (E[X]).
Proof.
P (x)
D(P k Q) = EP log
Q(x)
Q(x)
= EP − log
P (x)
Q(x)
To apply Jensen’s inequality we will let − log be our function and P (x) be our x (note that this ratio is a
number so we can push the EP inside).
Q(x) Q(x)
EP − log ≥ − log EP
P (x) P (x)
X Q(x)
= − log P (x)
x
P (x)
= − log 1 = 0
Thinking of our information interpretation, we see that we always pay some cost for using the wrong
P (x)
code. Also note that log Q(x) is sometimes positive and sometimes negative (P and Q both sum to one), yet
D(P k Q) ≥ 0.
6.4 L1 norm
The L1 norm is defined as:
X
kP − Qk1 = |P (x) − Q(x)|
x
It can be thought of as “how much earth has to be moved” to match the distributions.
Because P and Q sum to one we quickly see that 0 ≤ kP − Qk1 ≤ 2. This property can be advantagous
when bounds are needed.
9
7.1 Maximum Likelihood Estimation
N
1 X (n)
µ= x
N n=1
N
1 X (n) (n)
Σij = (x − µi )(xj − µj )
N n=1 i
1
exp λ(x − µ)2
p(x) = (1)
Z
7.3 Marginalization
If
x= xa xb
x ∼ N (X; µΣ)
µa
µ=
µb
Σaa Σab
Σ=
Σba Σbb
xa ∼ N (xa ; µa , Σaa )
7.4 Conditioning
With xa , xb as above, the conditional distribution P (Xa | xb ) is gaussian:
10
8 Linear Regression
Let our prediction ŷ = wT x.
X
min (ŷ − yn )2
w
n
X
min (wT x(n) − yn )2
w
n
min kwT X − yk2
w
∂
0= kwT X − yk2
∂w
= 2X T Xw − 2X T y
w = (X T X)−1 X T y
with regularization
w = (X T X + λI)−1 X T y
9 Smoothing
If we use MLE to train a classifier, all of our probabilities are based on counts, any unseen combination of a
single feature x and the class label y results in
c(x, y)
P (x|y) =
c(y)
= 0.
These zeros can ruin the entire classifier. For example, say there’s one bill where all the Republicans we
know about voted “no”. Now, say we are trying to classify an unknown politician who followed the Re-
publican line on every other bill, but voted “yes” on this bill. The classifier will say that there is zero
probability of this person being a Republican, since it has never seen the combination (Republican, voted
yes) for that bill. It gives that single feature way too much power. To get rid of that, we can use a technique
called smoothing, and modify the probabilities a little :
c(x = k, y) + α
P (x = k|y) =
c(y) + Kα
k ∈ {1, ..., K}
Basically we are taking a little bit of the probability mass from things with high probability and giving
it to things with otherwise zero probability. (Republicans might veto this technique, since it’s like redistri-
bution of wealth!) Note that these probabilities must still sum to 1. This seems great - we’ve gotten rid of
things with zero probability. But doesn’t this contradict what we proved earlier? That is, last week we said
that we can best infer the probability distribution by solving
N
Y
argmax Pθ (xn )
θ n=1
K
X
s.t. θk = 1
k=1
11
which results in the count-based distribution
c(k)
θk∗ = .
N
How then can we mathematically justify our smoothed probabilities?
T
θ = [θ1 , θ2 , . . . θK ]
K
X
and θk = 1
k=1
Suppose that we have a coin with two outcomes, heads or tails (K=2). We can picture the θ1 and θ2
which we could pick for the probability distribution of these two outcomes. A fair coin has θ1 = 1/2 and
θ2 = 1/2. An weighted coin might have θ1 = 2/3 and θ2 = 1/3. Since we are treating θ as a random variable,
its probability P (θ) is describing the probability that it takes on these values. P (θ) is called a prior, since
it’s what we believe about θ before we even have any observations. For example, we might tend to believe
that the coin will be pretty fair, so we could have P (θ) be a normal curve with the peak where θ1 = 1/2 and
θ2 = 1/2.
PK K
Γ( αk ) Y αk −1
P (θ) = QK k=1 θk
k=1 Γ(αk ) k=1
K
1 Y αk −1
= θk
Z
k=1
This is also written as P (θ; α), Pα (θ), or P (θ|α). α is a vector with the same size as θ, and it is known as a
“hyperparameter”. The choice of α determines the shape of θ’s distribution, which you can see by varying
it. If α is simply a vector of ones, we just get a uniform distribution; all θs are equally probable. In the case
of two variables, we can have α1 =100, and α2 =50 and we see a sharp peak around 2/3. The larger α1 , the
more shaply peaked it gets around α1α+α 1
2
.
At this point, we are tactfully ignoring that Γ in the Dirichlet distribution. What is that function, and
what does it do?
This function occurs often in difficult, nasty integrals. However, it has the nice property of being equivalent
to the factorial function:
Γ(n) = (n − 1)!
12
We can prove this using integration by parts:
Z ∞
Γ(x) = e−t tx−1 dt
0
Z ∞
x−1 −t ∞
= −t e 0 + e−t (x − 1)tx−2 dt
0
Z ∞
= 0 + (x − 1) e−t tx−2 dt
0
= (x − 1)Γ(x − 1)
Further noting that Γ(1) = 1, we can conclude that Γ(n) = (n−1)!. This function is used in the normalization
constant of our Dirichlet prior in order to guarantee that:
Z
P
P (θ)dθ = 1.
k θk =1
P (x = k|θ) = θk
Z
P (x = k) = P P (x|θ)P (θ)dθ
k θk =1
Z PK K
Γ( k=1 αk ) Y αk −1
= θk QK θk dθ
k=1 Γ(αk ) k=1
PK Z Y K
Γ( k=1 αk ) α −1+I(k0 =k)
= QK θk0k0 dθ
k=1 Γ(αk ) k0 =1
PK
Γ( k=1 αk ) k0 Γ(αk0 + I(k 0 = k))
Q
=
Γ( k0 αk0 + I(k 0 = k))
QK P
k=1 Γ(αk )
PK
Γ( k=1 αk ) Γ(αk + 1)
= P
Γ( αk + 1) Γ(αk0 )
αk
=P
k 0 αk
0
Most of the time, all of the αk ’s are set to the same number. So, we just showed that
αk
P (x) = = P
k0 αk
0
13
But what about
Z
P (XN +1 |X1N ) = P (XN +1 , θ|X1N )dθ
Z
= P (XN +1 |θ, X1N )P (θ|X1N )dθ
This is simpler since it does not require an integral. Using the same Lagrange Multipliers technique as we
did before:
1 Y αk −1 Y c
argmax θk θk (k)
Z
k k
X
s.t. θk = 1
k
11 Perceptrons
A Perceptron is a linear classifier that determines a decision boundary through successive changes to the
slope of a line according to a binary feature. The classifier finds values for the weight vector wT to solve
the equation
0 = wT x + b
14
We define our classification function sign as
−1 :x<0
sign(x) = 0 :x=0
1 :x>0
We can remove b from the equation by adding it as an element of w and adding a 1 to x in the same spot.
w1 x1
.. ..
w 0 = . x0 = .
wN xN
b 1
T
w 0 x0 = w T x + b
The next question is, how do we pick w? We have x as the vector of data points, with xn as the nth data
point. We have tn as the classifier output (1 or -1) for the nth data point.
tn = sign(wT x)
To solve this equation, we want to add up all of the positive points to get a vector in that direction. This
gives us the Perceptron algorithm:
repeat
for n = 1...N do
if tn 6= y n then
w ← w + y n xn
end if
end for
until ∀n tn = y n or maxiters
While this algorithm will completely separate linearly separable data, it may not be the best separation
(it may not accurately represent the separating axis of the data).
We can solve this problem by replacing our original error function,
X
E= I(y n 6= tn )
n
Now, the function E w.r.t. w is not convex, so we have to use an iterative method. We use Gradient Descent
to solve for w, which is the successive application of
∂E
w=w−
∂w
15
Figure 1: Linear Separability
∂E X ∂E n X ∂E n ∂t
= =
∂w n
∂w n
∂t ∂w
However, our sign function is not differentiable at 0, so we replace it an activation function g. A good
candidate for g is tanh:
ea − e−a
tn = g(a) = tanh(a) = a
e + e−a
where
a = wT x
Plugging in, we get
∂E X ∂t ∂a
= (tn − y n ) (2)
∂w n
∂a ∂w
X
= (tn − y n )g 0 (wT x)x (3)
n
An alternative solution, which begins making small adjustments immediately, is Stochastic Gradient De-
scent. Having defined a learning rate η, the algorithm is only slightly different:
repeat
for n = 1 . . . N do
n
w ← w − η ∂E ∂w
end for
until maxiters
16
2
Theorem: If x(n) is bounded by R, i.e., ∀n, kx(n) k ≤ R, then the perceptron algorithm makes at most Rδ2
pP
updates. (for a vector kvk denotes the Euclidean norm of v, i.e., kvk = 2
i vi )
Proof: As we keep updating the weights w in the algorithm, a sequence of w(k) are generated.
Let w(1) = 0.
Each time we encountered a misclassification, we use it to update the weights w. Suppose we use data
point (x, y) to update w(k) , which means Equation (4) holds,
Bear in mind that since we are using (x, y) to update w(k) , w(k) misclassified (x, y), which means Equa-
tion (5) holds,
y(w(k) )T x < 0 (5)
2
Now we can use Equation (4), (5) and w(1) = 0 to prove that k ≤ Rδ2 . Hence, the convergence holds.
From Equation (6), we can get a upper bound of kw(k+1) k, and from Equation (7), we can get its lower
bound.
w(k+1) = w(k) + yx
⇒ uT w(k+1) = uT w(k) + yuT x Multiply oracle u to both sides
T (k+1) T (k)
⇒ u w ≥ u w +δ Definition of (u, δ)
⇒ uT w(k+1) ≥ kδ Induction and w(1) = 0 (6)
⇒ kuT w(k+1) k2 ≥ k2 δ2 Both sides positive
⇒ kuT k2 kw(k+1) k2 ≥ k2 δ2 Cauchy Schwarz: kak × kbk ≥ aT b
(k+1) 2 2 2
⇒ kw k ≥ k δ kuk = 1
w(k+1) = w(k) + yx
⇒ kw(k+1) k2 = kw(k) + yxk2 Apply Euclidean norm to both sides
⇒ kw(k+1) k2 = kw k + kyk2 kxk2 + 2yxw(k)
(k) 2
Expansion
(7)
⇒ kw(k+1) k2 ≤ kw(k) k2 + kyk2 kxk2 Equation (5)
⇒ kw(k+1) k2 ≤ kw(k) k2 + R2 kyk = 1, kXk k ≤ R
⇒ kw(k+1) k2 ≤ kR2 Induction and w(1) = 0
Combining results of Equation (6) and (7), we get k 2 δ 2 ≤ kw(k+1) k ≤ kR2 . Thus, k 2 δ 2 ≤ kR2 and
2
k ≤ Rδ2 .
2
Since the number of updates is bounded by Rδ2 , the perceptron algorithm will eventually converge to
somewhere no updates are needed.
1 X 1 X
−yk wT X k +
argmin f (w) , fk (w) , (9)
w N N
k k
One way to optimize the convex function in Equation (9) is called Gradient Descent. Essentially, Gradient
Descent keeps updating the weights w, w ← w − α∇w f (w), in which α is called learning rate. The gradient
17
1
P
∇w f (w) can be carried out by ∇w f (w) = N ∇w fk (w), and ∇w fk (w) is computed by Equation (10).
k
(
−yk X k if yk wT X k < 0
∇w fk (w) = (10)
0, if yk wT X k ≥ 0
Because the gradient ∇w f (w) is a summation of local gradients ∇w fk (w), we can also do Stochastic
Gradient Descent by using one data instance a time.
1. Randomly pick a data instance, (X k , yk )
2. Compute local gradient on it, ∇w fk (w) as Equation (10).
3. Update weights using the local gradient, w ← w − α∇w fk (w). This is exactly the update in perceptron
algorithm.
where !
X
zj = g wij xi
i=0
are known as the hidden units, wij are the weights, xi are the input variables, and the σ and g are activa-
tion functions. Non-linear functions are usually chosen for activation functions such as tanh and sigmoid
functions.
The response function tni take take form of (11) but to keep notation uncluttered we will omit the input
parameters. To minimize the error function of a no hidden layer network, we apply the chain rule for the
partial derivatives with respect to (w.r.t.) the weights wij to get
∂E n X ∂E n ∂tnj0
=
∂wij 0
∂tnj0 ∂wij
j
∂tn ∂aj
= tnj − yjn
∂aj ∂wij
= tnj − yjn g 0 (aj )xi
= δ j xi (13)
18
where X
aj = wij xi
i
is the sum of the weighted input feeds. The derivative respect to the matrix W is
∂E n ∂E n
= = [δj xi ]ij = xδ T .
∂W ∂wij ij
We now focus on training on a neural network with a hidden layer (multi layer) that takes the same
form as (12) and minimize the error w.r.t. the weights wij in this form
∂E n X ∂E n ∂tn
k
X ∂E n ∂tn ∂ak
k
= =
∂wij ∂tnk ∂wij ∂tnk ∂ak ∂wij
k k
X ∂ X
= (tk − yk ) g 0 (ak ) wj 0 k g(aj 0 )
∂wij 0
k j
X
0 ∂
= (tk − yk ) g (ak )wjk g(aj )
∂wij
k
X
= (tk − yk ) (g 0 (ak )wjk g 0 (aj )xi )
k
X
= (tk − yk )g 0 (ak )wjk g 0 (aj ) xi
k
| {z }
δj
= δ j xi
The derivatives with respect to the first and second layer weights are given by
∂E n ∂En
= δj xi and = δk zj .
∂wij ∂wjk
and yn ∈ {−1, 1}. Initially we assume that the two classes are linearly separable. The hyperplane separating
the two classes can be represented as:
wT xn + b = 0,
such that:
w T xn + b ≥ 1 for yn = +1,
wT xn + b ≤ −1 for yn = −1.
19
w
M H2
−b/|w|
H1
Figure 2: The figure shows a linear SVM classifier for two linearly separable classes. The hyperplane wT x+b
is the solid line between H1 and H2 , and the the margin is M .
Let H1 and H2 be the two hyperplanes (Figure 2) separating the classes such that there is no other data point
between them. Our goal is to maximize the margin M between the two classes. The objective function:
max M
w
s.t. y n (wT xn + b) ≥ M,
wT w = 1.
2
The margin M is equal to kwk . We can rewrite the objective function as:
1 T
min w w
w 2
s.t. y n (wT xn + b) ≥ 1
Now, let’s consider the case when the two classes are not linearly separable. We introduce slack variables
{ξn }N
n=1 and allow few points to be on the wrong side of the hyperplane at some cost. The modified
objective function:
N
1 T X
min w w+C ξn
w 2 n=1
s.t. y n (wT xn + b) + ξn ≥ 1,
ξn ≥ 0, ∀n.
The parameter C can be tuned using development set. This is the primal optimization problem for SVM.
20
Differentiating the Lagrangian with respect to the variables:
∂ X
L (w, b, ξ, α, µ) = w − αn yn xn = 0
∂w n
∂ X
L (w, b, ξ, α, µ) = − αn yn = 0
∂b n
∂
L (w, b, ξ, α, µ) = C − αn − µn = 0
∂ξn
αn = C − µn (15)
We now plug-in these values to get the dual function and cancelling out some terms:
1 XX XX X
g(α, µ) = αn αm yn ym xn T xm − αn αm yn ym xn T xm + αn
2 n m n n n
X 1 XX
= αn − αn αm yn ym xn T xm (16)
n
2 n m
Using the equation (15) and (16) and the KKT conditions, we obtain the dual optimization problem:
X 1 XX
max αn − αn αm yn ym xn T xm
α
n
2 n m
s.t. 0 ≤ αn ≤ C.
The dual optimization problem is concave and easy to solve. The dual variables (αn ) lie within a box with
side C. We usually vary two values αi and αj at a time and numerically optimize the dual function. Finally,
we plug in the values of the αn∗ ’s to the equations (14) to obtain the primal solution w∗ .
min f0 (x)
x
s.t. fi (x) ≤ 0, for i ∈ 1, 2, . . . , K,
where f0 and fi (i ∈ {1, 2, . . . , K}) are convex functions. We call this optimization problem the ‘primal’
problem.
21
The Lagrange dual function:
g(λ) = min L(x, λ)
x
The dual function g(λ) is concave and hence easy to solve. We can obtain the minima of a convex primal
optimization problem by maximizing the dual function g(λ). The dual optimization problem:
max g(λ)
λ
s.t. λi ≥ 0, for i ∈ 1, 2, . . . , K.
fi (x∗ ) ≤ 0
λ∗i ≥ 0, ∀i ∈ 1, . . . , K
∂
L(x∗ , λ∗1 , . . . , λ∗K ) = 0
∂x
λ∗i fi (x∗ ) = 0
14 Kernel Functions
14.1 Review Support Vector Machines
Goal: To solve equation:
!
1 2
X
min ||w|| + C ξn
w 2 n
s.t. y n (wT xn + b) + ξn ≥ 1
ξn ≥ 0
where
T
xn = [x1 , x2 , ..., xK ] , n ∈ 1, ..., N
This is a K-dimensional problem, which means the more features the data has, the more complicated to
solve this problem.
In the meantime, this equation is equal to
!
1 XX X
max − αn αm y n y m xnT xm + αn
α 2 n m n
s.t. αn ≥ 0
αn ≤ C
This is a N-dimensional problem, which means the more data points we include in the training set, the
more time it takes to find the optimized classifier.
To train the classifier is actually to solve this problem inside the box of alpha.
22
According to KKT,
λi fi (x) = 0
λi ≥ 0
fi (x) ≤ 0
X
w= α n y n xn
n
Points on the right side but not on the margin contribute nothing because alpha equals to 0. (The green
point)
For points on the wrong side (the red point), alpha equals to C, and
ξn > 0
so they along with points on the margin contribute to the vector, but no point is allowed to contribute more
than C.
SVM can train classifier better than naive bayes in the most of time, but since its still binary classification
it is not able to deal with situation like this one below:
23
We can introduce Kernel Function K now, the simplest one is:
xnT x = K (xn , x)
Now the problem is transformed into:
!
1XX X
max − αn αm y n y m K (xn , xm ) + αn
α 2 n m n
where
T
K (x, y) = φ (x) φ (y)
for some φ.
The most commonly seen Kernel Functions are:
K (x, y) = xT y
m
K (x, y) = xT y
2
K (x, y) = e−c||x−y||
Generally, Kernel function is a measure of how x and y are similar, then they are the same, it has the
peak output.
Basically, φ transforms x from a linear space to a multi nominal space like shown below:
24
because we have
x2
ex = 1 + x + + ...
2
it transforms feature into a infinite dimensional space. Generally Kernel Functions lead to more dimension
of w which is K-dimensional so solve dual is more practical.
For perceptron, its error term and w have a relationship drawn as below: (not convex) so that we cant
14.4 Regression
When we are predicting output we actually have a space like this: The line is the prediction line, the points
ŷ = wT x
−1 T
w = XT X X ~y
25
15 Graphical Models
A directed graphical model, also known as a Bayes net or a belief net, is a joint distribution over several
variables specificied in terms of a conditional distribution for each variable:
Y
P (X1 , X2 , . . . , XN ) = P (Xi |Parents(Xi ))
i
We draw a Bayes net as a graph with a node for each variable, and edges to each node from its parents. This
graph expresses the independence relations implicit in the choice of parents for each node. The parent-child
edges must form a acyclic graph.
An undirected graphical model is also a distribution specified in terms of a set of functions of sepcific
variables:
1 Y
P (X1 , X2 , . . . , XN ) = fm (Xm ))
Z m
where is each Xm is a subset of {X1 , X2 , . . . , XN }. The are no normalization constraints on the individual
factors fm , and the normaliztion constant Z ensures that the entire joint distribution sums to one:
X Y
Z= fm (Xm )
X1 ,...,XN m
Suppose that we wish to find the mariginal probability of a variable Xi in a directed graphical model:
X Y
P (Xi ) = P (Xi |Parents(Xi ))
X1 ,...,Xi−1 ,Xi+1 ,...,XN i
For binary variables, there are 2N −1 terms in this sum. Our goal in this section is to compute this probability
more efficiently by using the structure of network, thus taking advantage of the independence assumptions
of the network. The techniques apply to both directed and undirected graphical models. They also ap-
ply to the problem of computing conditional probabilities where some variables are known, and we must
marginalize over the others.
15.1 Example
26
To compute P (X7 |X2 ), we have
1 XXXX
P (x7 |x2 ) = P (X3 |x2 )P (X4 |X3 )P (X5 |X4 )P (X6 |X5 )P (x7 |X6 )
Z
X3 X4 X5 X6
Suppose every variable Xi is binary, then the summation has 24 = 16 terms. On the other hand, we can use
the same trick in dynamic programming by recording every probabilities we have computed for reuse. For
example, in above example, if we define
X
f5 (x5 ) = P (X6 |X5 = x5 )P (x7 |X6 ) (17)
X6
X
f4 (x4 ) = P (X5 |X4 = x4 )f5 (X5 ) (18)
X5
X
f3 (x3 ) = P (X4 |X3 = x3 )f4 (X4 ) (19)
X4
X
f2 (x2 ) = P (X3 |X2 = x2 )f3 (X3 ) (20)
X3
1 XXXX
P (X7 = x7 |X2 = x2 ) = P (X3 |X2 = x2 )P (X4 |X3 )P (X5 |X4 )P (X6 |X5 )P (X7 = x7 |X6 )
Z
X3 X4 X5 X6
(21)
1 XXX
= P (X3 |x2 )P (X4 |X3 )P (X5 |X4 )f5 (X5 ) (22)
Z
X3 X4 X5
1 XX
= P (X3 |x2 )P (X4 |X3 )f4 (X4 ) (23)
Z
X3 X4
1 X
= P (X3 |x2 )f3 (X3 ) (24)
Z
X3
1
= f2 (x2 ) (25)
Z
There are 4 sums and each sum needs to compute 2x2 probabilities, so a total of 16 steps.
27
For more examples,
Note that in the figures above, factor graphs illustrate that the shadowed variable nodes block the in-
formation flow from one variable node to another except the last one. In the last example, the two parent
nodes are independent, although this cannot be seen from the graph structure. However, the blockage
can be read from the table of the factor node in the center. Also note that the last two graphs have same
undirected shape, but their factor graphs are different.
where M (n) is the set of factors touching Xn . This table contains the information propagated from vari-
able n to its neighboring factor vertex fm . For each factor vertex fm and its neighboring variable n, the
28
information propagated from fm to n is,
X −−→ Y
rm→n (Xn ) = fm (Xm ) qn0 →m (Xn0 )
−
−→ n0 ∈N (m)\{n}
Xm \Xn
P
where N (m) is the set of variables touching fm . − −→
Xm \Xn
is the sum is over all variables connected to fm
except Xn . This table contains the information propagated from factor fm to its neighbor variable n. Note
that if variable vertex n is a leaf, qn→m = 1, and if factor vertex m is a leaf, rm→n = fm (Xn ).
The procedure of message passing or belief propagation is first to propagate the information from leaf
vertices to the center (i.e., from leaves to internal nodes) by filling in the tables for each message. Once all
the messages variable xn have been computed, the marginal probability of xn is computed by combining
the incoming messages:
1 Y
P (Xn ) = rm→n (Xn )
Z
m∈M (n)
To compute marginal probabilities for all variables, the information is propagated from center back to
leaves.
16 Junction Tree
From last class, we know that
Y
qn→m (xn ) = rm0 →n (xn )
m0 ∈M (n)\m
fm (−
x→
X Y
rm→n (xn ) = m) qn0 →m (xn0 )
−
x→ n0 ∈N (m)\{n}
m \xn
29
Figure 3: An example of max-product
qn→m (xn ) means the information propagated from variable node n to factor node fm ; rm→n (xn ) is the
information propagated from factor node fm to variable node n. And our goal is to compute the marginal
probability for each variable xn :
1 Y
P (xn ) = rm→n (xn ).
Z
m∈M (n)
The joint distribution of two variables can be found by, for each joint assignment to both variables,
performing message passing to marginalize out all other variables, and then renormalizing the result:
1 Y Y
P (xi , xj ) = rm→i (xi ) rm→j (xj )
Z{i,j}
m∈M (i) m∈M (j)
In the original problem, the marginal probability of variable xn is obtained by summing the joint distri-
bution over all the variables except xn :
1 X
fm ( −
x→
X X XY
P (xn ) = ··· ··· m ).
Z x x x x m
1 n−1 n+1 N
And by pushing summations inside the products, we obtain the efficient algorithm above.
16.1 Max-Sum
practice, sometimes we wish to find the set of variables that maximizes the joint distribution P (x,N ) =
In Q
1 −→
Z m fm (xm ). Removing the constant factor, it can be expressed as
fm ( −
x→
Y
max m)
x1 ,...,xN
m
fm (−
x→
Y
= max ... max m)
x1 xN
m
Figure 3 shows an example, in which the shadowed variables xj , xk , and xl block the outside informa-
tion flow. So to compute P (xi |xj , xk , xl ), we can forget everything outside them, and just find assignments
for inside variables:
fm (−
x→
Y
max m ).
inside var
m
30
Figure 4: An example of tree decomposition
Like the sum-product algorithm, we can also make use of the distributive law for multiplicationP and
push maxs inside the products to obtain an efficient algorithm. We can put max whenever we see in the
sum-product algorithm to get the max-sum algorithm, which now actually is max-product (Viterbi) algorithm.
For example,
fm ( −
x→
Y
rm→n (xn ) = −max
→ m) qn0 →m (xn0 )
xm \xn
n0 ∈N (m)\{n}
Since products of many small probabilities may lead to numerical underflow, we take the logarithm
of the joint distribution, replacing the products in the max-product algorithm with sums, so we obtain the
max-sum algorithm.
Y Y X
max fm −→ max log fm −→ max log fm
fm (−
x→
^
findx1 , ..., xN s.t. m ).
m
We need to find some assignments to make it 1, which can be seen as a reduction from the 3-SAT problem
(constraint satisfaction). So the problem is N P-complete in general.
To solve the problem, we force the graph to look like a tree, which is tree decomposition. Figure 4 shows
an example.
Given a Factor Graph, we first need to make a new graph (Dependency Graph) by replacing each factor
with a clique, shown in Figure 5. Then we apply the tree decomposition.
Tree decomposition can be explained as: given graph G = (V, E), we want to find ({Xi }, T ), Xi ⊆ V ,
T = tree over{Xi }. It should satisfy 3 conditions:
31
Figure 5: Dependency graph for tree decomposition (vertex for each variables)
Figure 6: The procedure of tree decomposition on a directed graphical model (we can directly get Dependency
Graph by moralization)
S
1. i Xi = V , which means the new graph should cover all the vertex;
2. For (u, v) ∈ E, ∃Xi such that u, v ∈ Xi ;
3. If j is on the path from i to k in T , then (Xi ∩ Xk ) ⊆ Xj (running intersection property).
Using this method, we can get the new graph in Figure 5 with X1 = {A, B}, X2 = {B, C, D}, and
X3 = {D, E}. The complexity of original problem is O((N + M )(k + l)2l−1 ), with l = maxm |N (m)|. By
tree decomposition, we can obtain l = maxi |Xi |. Figure 6 shows the procedure to do tree decomposition on a
directed graphical model.
A new concept is the treewidth of a graph:
For example, treewidth(tree) = 1, treewidth(cycle) = 2, and the worst case, treewidth(Kn ) = n − 1 (Kn
is a complete graph with n vertices). If the treewidth of the original graph is high, the tree decomposition
becomes impractical.
Actually, finding the best tree decomposition is N P-complete. One practical way is Vertex Elimination:
1. choose vertex v (heuristicly, choose v with fewest neighbors);
2. create Xi for v and its neighbors;
3. remove v;
4. connect v’s neighbors;
5. repeat the first four steps until no new vertex.
Vertex Elimination cannot ensure to find the optimum solution. Figure 7 shows an example of this
method on a single cycle.
32
Figure 7: An example of Vertex Elimination on a single cycle
17 Expectation Maximization
In last lecture, we introduced Tree Decomposition. Till now, we have covered a lot as regards how to do
inference in a graphical model. In this lecture, we will move back to the learning part. We will consider
how to set parameters for the variables.
for probabilistic Latent Semantic Analysis (pLSA). The variable’s value is called an aspect. pLSA has been
widely used in information retrieval. In this example, let x1 and x2 respectively denote the document ID
and word ID. Then, there is a sequence of pair (x1 , x2 ), e.g., (1, “the”), (1, “green”), ..., (1000, “the”). In this
context, we may have various tasks, e.g., to find the words which co-occur, to find the words on the same
topic, or to find the documents containing the same words.
33
Now, we will introduce the term cluster. The reasons why we need the cluster representation are as
followed: (1) There are a large amount (e.g., 10,000) of documents, each of which is formed of a large
amount (e.g., 10,000) of words. Without a cluster representation, we have to handle a huge query table with
too many (say, 10, 0002 ) entries. That makes any query difficult. (2) If we still set the parameters just by
counting how often each variable occurs, then there is a risk of over-fitting the individuals (i.e., the pair
of (document, word)). Because of them, we need to do something smart: clustering. Recall the graphical
model displayed above, the hidden (latent) variable z is just the cluster PID.
Note that x1 ⊥x2 kz. Therefore, the joint distribution p(x1 , x2 ) = z p(z)p(x1 |z)p(x2 |z). As shown in
the figure below, now we will not directly compute each entry to obtain the 10, 000 × 10, 000 query table
P (x1 , x2 ). Instead, we maintain low-rank matrices P (z), P (x1 |z) and P (x2 |z).
Now, we have a set of observed variables X = {(x11 , x12 ), (x21 , x22 ), ...}, a set of hidden variables Z =
{z1 , z2 , ...} and a set of parameters θ = {θz , θx1 |z , θx2 |z }. Note that xi1 are i.i.d. variables, and the same for
xi2 . To choose θ, we maximize the likelihoods (MLE): maxθ P (X; θ).
Y YX X X
θ = argmax Pθ (xn1 , xn2 ) = argmax p(z)p(xn1 |z)p(xn2 |z) = argmax log p(z)p(xn1 |z)p(xn2 |z)
θ n θ n z θ n z
(26)
If there is no hidden variable z, we will just count, instead of summing over z. However now, we need
to sum over z and find the maximum of the above objective function, which is not a closed-form expression.
Thus, it is not feasible to directly set the derivative to zero. To solve this tough optimization problem, we
will introduce the Expectation-Maximization (EM) algorithm.
E-step:
Guess z
M-step:
MLE to fit θ to X, Z
REPEAT
E-step:
for n = 1 . . . N
for z = 1 . . . K
p(z, n) = θz (z) · θx1 |z (xn1 |z) · θx2 |z (xn2 |z)
sum + = p(z, n)
for z
34
p(z, n) ← p(z,n)
sum
An alternative:
ec(z) + = p(z, n)
ec(z, xn1 ) + = p(z, n)
ec(z, xn2 ) + = p(z, n)
M-step:
for z PN
ec(z) = n=1 p(z, n)
θz ← ec(z)
N
for z, x1 PN
ec(z, x1 ) = n=1 I(xn1 = x1 )p(z, n)
θx1 |z = ec(z,x1)
ec(z)
for z, x2 PN
ec(z, x2 ) = n=1 I(xn2 = x2 )p(z, n)
θx2 |z = ec(z,x2)
ec(z)
UNTIL convergence
where sum is for normalization and ec(·) denotes the expected count, which is not a real count but an
average on what we think z is. Namely, this count is probabilistic. The intuition is to assign some credit to
each possible value. Also note that I(·) is an indicator function (return 1/0 if the condition is true/fasle). In
the following, we will derive how to approximate the maximum of the likelihood by maximizing the joint
probability’s log likelihood iteratively through E-M steps. For the example present in Sec.2, now let us go
further using the same formulation with Eqn. (26).
1 1 1
Therefore, θ = sum 0
ec(z), θx1 |z = sum 1
ec(x1 , z), and θx2 |z = sum 2
ec(x2 , z), but make sure that normal-
ization is done (sum
to 1). Notably, c(·) denotes the count and ec(·) denotes the expected count. Also note
that E c(z = k) = ec(z) which we have mentioned in the EM algorithm flow, and similar for ec(x1 , z) and
ec(x2 , z).
35
objective function Q.
Q(θ; θold )
= Ep(z|x,θold ) log p(X, Z)
= Ep(z|x,θold ) log p(Z|X) · p(X)
= Ep(z|x,θ
h old ) log p(Z|X) + log p(X)
i
p(Z|X)
= E log pθold (Z|X) · pθold (Z|X) + log p(X) (28)
make
h it look like i K − L divergence
pθold (Z|X)
= E − log p(Z|X) − E − log pθold (Z|X) + log p(X)
old
= −D Z|X, θ Z|X, θ − H Z|X, θold + L(θ)
where D, H, L are respectively the K-L divergence, the entropy and the likelihood. Note that our objective
is to maximize the likelihood log p(X). It does not have Z inside, so it can be put out of E(·). Now, we write
down L(θ) with simplified notations:
where Q, H and D are all dynamic functions. The approximation can be illustrated in the space of parameter
θ, as shown schematically in the figure below. Here the red curve depicts L(θ) (incomplete data) which we
wish to maximize. We start with some initial parameter value θold , and in the first E step we evaluate the
distribution of Q(θ; θold ) + H(θold ), as shown by the blue curve. Since the K-L divergence D(θold kθ) is
always positive, the blue curve gives a lower bound to the red curve L(θ). And D(θold kθ) just gives the
gap between the two curves. Note that the bound makes a tangent contact with L(θ) at θold , so that both
curves have the same gradient and D(θold kθold ) = 0. Thus, L(θold ) = Q(θold old
; θ )old
+ H(θ
old
). Besides, the
new
bound is a convex function having a unique maximum at θ = argmaxθ Q(θ; θ ) . In the M step, the
bound is maximized giving the value θnew , which also gives a larger value of L(θ) than θold : L(θnew ) =
Q(θnew ; θold ) + H(θold ) + D(θold kθnew ). In practice, during the beginning iterations, this point is usually
still far away from the maximum of L(θ). However, if we run one more iteration, the result will get better.
The subsequent E step then constructs a bound that is tangential at θnew as shown by the green curve.
Iteratively, the maximum will be accessed in the end, in a manner kind of similar with gradient ascent. In
short, there are two properties for the EM algorithm: (1) the performance gets better step by step. (2) it will
converge. At last, it should also be emphasized that EM is not guaranteed to find the global maximum,
for there will generally be multiple local maxima. Being sensitive to the initialization, EM may not find the
largest of these maxima.
36
In this lecture, we quickly go through the details of the EM algorithm, which maximizes L through
maximizing Q at each iteration step. Then, the remaining problems are how to compute Q, and exactly
how to compute p(Z|X, θold ).
∂L
17.4 Gradient Ascent ( ∂θ
)
Gradient Ascent ∂L
∂θ means the gradient of likelihood function w.r.t. parameters.
Using the gradient ascent, our new parameter θnew is given by :
∂L
θnew = θold + η
∂θ
EM (expectation maximization) generally finds a local optimum of θ more quickly, because in this
method we optimize the q-step before making a jump. It is similar to saying that we make a big jump
first and then take smaller steps accordingly. But for Gradient Ascent method, the step size varies with the
likelihood gradient, and so it requires more steps when the gradient is not that steep.
We are required to find EP (Z|X,θ) [Z|X]
P (Z)
Our θ = P (X1 |Z)
P (X2 |Z)
which is a long list of all the possible probabilities, after unfolding each of them.
We should make sure that the parameters (θ = [θ1 , θ2 ], for only 2 parameters) always stays within the
straight line shown below. It should never go out of the line.
eλk
P (Z = k) = θk = P λ
k 0 ek 0
In this case, the gradient ascent is used to find the new value of λ
37
∂L
λnew = λold + η
∂λ
1. Newton’s Method takes a lot of time, because we need to calculate the Hessian Matrix, which is the
2nd derivative.
2. Since there is no KL divergence in Newton’s Method, there is always a chance that we take a big jump,
and arrive at a point far away from the global optimum and in fact worse than where we started.
where µ and Σ are the mean vector and co-variance matrix respectively.
T −1
e−0.5(X−µ) Σ (X−µ)
N (X|µ, Σ) = (30)
(2π)D/2 |Σ|0.5
Here X and µ are vectors, Σ is a 2-D matrix, and D is the dimensionality of data X. The left side side of
equation (1) refers to :
X1
X2
N . µ, Σ
..
XN
For 1-D data,
2 2
e−0.5(X−µ) /σ
N (X|µ, σ) =
(2π)0.5 σ
38
Equation (1) is similar to writing as f (X) = X T AX; wherein we are stretching the vector space.
changes to Z Y
rm→n = ~ m)
f (X ~ m \Xn )
qn→m d(X
n∈N (m)\n
P (Z)P (X n |Z)
P (Z|X n ) =
P (X n )
(λk /zk ) exp[− 12 (X n − µk )T Σ−1 (X n − µk )]
=P 1 n −1 (X n − µ 0 )]
0 T
k0 (1/zk ) exp[− 2 (X − µk ) Σ
0 k
39
M-step
for k = 1, 2, . . . , K (total number of hidden variables)
PN
n=1 P (Z = k|X n )
λk =
N
n n
P
n P (Z = k|X )X
µk = P n
n P (Z = k|X )
18 Sampling
We have already studied how to calculate the probability of a variable or variables using the message
passing method. However, there are some times when the structure of the graph is too complicated to be
calculated. The relation between the diseases and symptoms is a good example, where the variables are all
mixed together and brings the graph a high tree width. Another case is that of continuous variable, where
during the message passing,
Z Y
rm→n = f (~xm ) qn0 →m (xn0 )d(~xm \xn ).
n0
If this integration can not be calculated, what can we do to evaluate the probability of variables? This is
what sampling is used for.
For example, if we want to sample from a variable with standard normal distribution, the points we
pick up are calculated from
X = erf −1 (x),
where x is drawn from a uniform distribution, and
Z x
erf(x) = N (t, 0, 1)dt,
0
We could play the same trick for many other distributions. However, there are some distributions which
do not have a closed-form integral to calculate their CDF, which makes the above method fail. Under such
conditions, we could turn to a framework called Markov chain Monte Carlo (MCMC).
40
xt ∼ P (xt |xt−1 ).
Our goal is to find a Markov chain which has a distribution similar to a given distribution which we
want to sample from, so that by running the Markov chain, we get results as if we were sampling from
the original distribution. In other words, we want to have the Markov chain that eventually be able to 1)
explore over the entire space of the original distribution, 2) reflect the original PDF.
The general algorithm for generating the samples is called the Metropolis-Hastings algorithm. Such an
algorithm draws a candidate
x0 ∼ Q(x0 ; xt ),
and then accepts it with probability
P (x0 )Q(xt ; x0 )
min 1, .
P (xt )Q(x0 ; xt )
The key here is the function Q, called proposed distribution which is used to reduce the complexity of the
original distribution. Therefore, we have to select a Q that is easy to sample from, for instance, a Gaussian
function. Note that there is a trade-off on choosing the variance of the Gaussian, which determines the step
size of the Markov chain. If it is too small, it will take a long time, or even make it impossible for the states
of the variable to go over the entire space. However, if the variance is too large, the probability of accepting
the new candidate will become small, and thus it is possible that the variable will stay on the same state for
ever. All these extremes will make the chain fail to simulate the original PDF.
If we sample from P directly, that is Q(x0 ; xt ) = P (x0 ), we have
P (x0 )Q(xt ; x0 )
= 1,
P (xt )Q(x0 ; xt )
which means that the candidate we draw will always be accepted. This tells us that Q should approximate
P.
By the way, how do we calculate P (x)? There are two cases.
• Although we cannot get the integration of P (x), P (x) itself is easy to compute.
R
• P (x) = f (x)/Z, where Z = f (x)dx is what we do not know. But since we know f (x) = ZP (x), we
could just substitute f (x) instead of P (x) in calculating the probability of acceptance of a candidate.
1. Stationary distribution
A distribution with respect to a Markov chain is said to be stationary if the distribution remains the
same before and after taking one step in the chain, which could be denoted as
Πt = T × Πt−1 = Πt−1 ,
or X
Πi = Tij Πj , ∀i,
j
where Π is a vector which contains the stationary distribution of the state of the variable in each
step with its element Πi = P (x = i), and T is the transition probability matrix where its element
Tij = P (xt = i|xt−1 = j) denotes the probability that the variable transits from state
j to i. For
0.5
example, the two Markov chains in 8a and 8b all have a stationary distribution Π = .
0.5
41
0.5 0.7 1 0.1
q1 q1 q1 q1 q1
q2 q2 q2 q2 q2
The stationary distribution of a Markov chain could be calculated by solving the equation
TΠ = Π
1.
X
Πi =
i
Note that there might be more than one stationary distribution for a Markov chain. A rather simple
example would be a chain with a identity transition matrix shown in 8c.
If a Markov chain has a stationary distribution and the stationary distribution is unique, it is ensured
that it will converge eventually to that distribution no matter what the original state the chain is.
2. Detailed Balance: Property to ensure the stationary distribution
Once we know a Markov chain is uniquely stationary, then we can use it to sample from a given
distribution. Now, we will see a sufficient (but not necessary) condition for ensuring a Π is stationary,
which is a property of the transition matrix called Detailed Balance. The definition of such a property
is
which means Pi→j = Pj→i , and is also called reversibility due to the symmetry of the structure.
Starting from such definition, we have
X X X
∀i, Tij Πj = Tji Πi = Πi Tji .
j j j
P
Note that j Tji = 1, we come up with
X
∀i, Tij Πj = Πi · 1 = Πi ,
j
which is exactly the second definition of stationary distribution we have just discussed. Therefore, if a
distribution makes the transition matrix of a Markov chain satisfy detailed balance, that distribution is
the stationary distribution of that chain. Note that although a periodic Markov chain like that shown
in 8d satisfies detailed balance, we do not call it stationary. This because it will not truly converge
and thus is not guaranteed to approximate the original PDF. What is more, it is often the case that we
add a probability like shown in 8e to avoid such a periodic circumstance.
42
Note that the Detailed Balance does not ensure the uniqueness of the stationary distribution of a
Markov chain. However, such uniqueness is necessary, or the Markov chain would not go to the PDF
we want. What we could do is that, when we construct the chain at the very beginning, we make
the chain such that 1) any state is reachable for any other and 2) the chain is aperiodic. Under that
condition, we could ensure the uniqueness of the stationary distribution.
3. Final proof
Now, let us be back to the Metropolis-Hastings algorithm and prove that the transition matrix of its
Markov chain has the detailed balance property. If we can prove that, it is obvious that such a Markov
chain has a unique stationary distribution.
According to the Metropolis-Hastings algorithm, the transition probability of the Markov chain of the
algorithm is
P (x0 )Q(x; x0 )
0 0
T (x ; x) = Q(x ; x) · min 1,
P (x)Q(x0 ; x)
If x0 = x, then it is automatically detailed balancing due to the symmetry of the definition of detailed
balance. To be specific, the condition of detailed balance, which is
P (x0 )Q(x; x0 )
0 0
T (x ; x) = min Q(x ; x), .
P (x)
= T (x; x )P (x0 )
0
Therefore, we proved the detailed balance of the transition matrix, and thus the Markov chain of the
Metropolis-Hastings algorithm does have a stationary distribution, which means that we could use
such a Markov chain to simulate the original PDF.
without knowing Z. We could use the Gibbs Sampling, shown in Algorithm 1, where x¬k means all the
variables x except xk . Note that the Gibbs Sampling is actually a particular instance of the Metropolis-
Hastings algorithm where the new candidate is always accepted. This is proved as follows.
43
Algorithm 1: Gibbs Sampling
for k = 1 . . . K do
1
Q
xk ∼ P (xk |x¬k ) = Z0 m∈M (k) f (~xm );
Substitute
P (x) = P (x¬k )P (xk |x¬k )
P (x0 ) = P (x0¬k )P (x0k |x0¬k )
Q(x0 ; x) = P (x0k |x¬k )
Q(x; x0 ) = P (xk |x0¬k )
x0¬k = x¬k
Therefore, the Gibbs Sampling algorithm will always accept the candidate.
M–step Use hard (fixed value of the sample) assignments from sampling in:
I(z n = k)
P
n Nk
λk = = (31)
N N
In this I is counting how many points are assigned k and we denote this sum as Nk .
n
xn n
xn
P P
n I(z = k)~ n I(z = k)~
µk = P n
= (32)
n I(z = k) Nk
The computation of Σ is similar.
44
18.6.2 Some advantages.
With sampling we can apply EM to complicated models. For example, a factor graph with cycles or highly
connected components. Sampling can be improved by sampling L times in the E–step.
In practice a short cut is taken and we combine the E and the M steps and take advantage of samples
immediately:
for n ← 1...N
sample z n
1 n 1 n
λk ← λk + N I(z = k) − N I(zold = k)
or
λk ← NNk
λk N (~x; µk Σk )
zn ∼
Z
When we have seen data the probability of the next is:
c(k) + α
P (z N +1 |z 1 . . . z N ) =
N + Kα
Applying this we have:
Nk +α
N +Kα N (~
x; µk Σk )
zn ∼
Z
Nk + α
λk =
N + Kα
Now λk can never go all the way to zero. Now if we run long enough we will converge to the real
distribution. We have a legitimate Gibbs sampler (states with just relabeling have equal probability). We
are sampling with
Z
1
λk = P (z 1 . . . z N |λ)P (λ)dλ
Z
The λ is integrated out, giving what is called a collapsed Gibbs sampler.
A remaining question is: when should we stop? If we plot P (x) as we iterate we should see a general
trend upward with some small dips and then the curve levels off. But, it could be that in just a few steps a
better state could continue the upward trend.
The real reason for doing this Gibbs sampling is to handle a complicated model. One example of that
could be a factor graph of diseases and symptoms due to its high tree width. If we try to use EM directly
45
we have exponential computation for the expected values of the hidden variables. With Gibbs we avoid
that problem.
19 Error Bounds
19.1 Leave One Out
Assume the data points are linear separable and iid. And the training and test data are D dimensions, while
D is unknown, meaning our problem setting is distribution free.
1 y 6= ŷ(x)
Q(x, a) =
0 otherwise
Now we need to know the maximum valve of the expectation of the risk over possible
Ptraining sets. This
means we need to calculate E(R(a)) ≤?; In the past lecture notes, we know W = φ = i αi Xi Yi where Xi
is the ith data point.
−1
Yi =
1
In order to estimate the risk for the function Q(z, a) , we need to use the following statistics: exclude the
first vector z1 from the sequence and obtain the function that minimizes the empirical risk for the remaining
l − 1 elements of the sequence for the given sequence z1 , z2 , . . . , zl ;
Let the function be Q(z1 , al−1 |z1 ). In this notation, we indicate that the vector z1 was excluded from the
sequence. We use this excluded vector for computing the value Q(z1 , al−1 |z1 ).
Now we excluded the second z2 from the sequence (while the first vector is retained) and compute the
value Q(z2 , al−1 |z2 ).
In this manner, we compute the values for all the vectors and calculator the number of errors in the
leave on out procedure;
l
X
L(z1 , z2 ....zl ) = Q(zi , al−1 |zi )
i=1
R l+1
P
EL(z1 , z2 , ......zl+1 )/(l + 1) = 1/(l + 1) Q(zi , al |zi )dP (z1 ), ......, dP (zl+1 )
i=1
R l+1
P R
= 1/(l + 1) ( Q(zi , al| |zi )dP (zi ))
i=1
l+1
P
= E(1/(l + 1). R(al |zi ))
i=1
= ER(al )
We will introduce the essential support vectors: Essential support vectors are the vectors that, if removed,
would result in learning a different SVM. Indeed, if the vector xi is not an essential support vector, then
there exists an expansion of the vector φ that defines the optimal hyperplane that doesn’t contain the vector
xi .
46
Since the optimal hyperplane is unique, removing this vector from the training set doesn’t change it.
Therefore in the leave one out method it will be recognized correctly.
Thus the leave one out method recognizes correctly all the vectors. Therefore the number of L(z1 , ..., zl+1 )
of errors in the leave one out method doesn’t exceed Kl+1 , the number of the essential support vectors; that
is,
L(z1 , . . . , zl+1 ) ≤ Kl+1
Then the equation we will get:
And the maximum of W (a) in the area a ≥ 0 is achieved at the vector a0 = (a01 , . . . , a0l ). Let the vector
a
X
φ0 = a0i xi yi
i=1
Define the optimal hyperplane passing through the original, where we enumerate the support vectors with
i=1:a
Let us denote by ap the vector providing the maximum for the functional W (a) under the constraint
ap = 0
ai ≥ 0 where(i 6= p)
a
api xi yi
P
Let the vector φp =
i=1
The above defines the coefficients of the corresponding separating hyperplane passing through the ori-
gin.
Now denote by W0p the value of the W (a) for:
ai = a0i where(i 6= p)
ap = 0;
p
Consider the vector a that maximize the function W (a) under the above constraint. The following obvious
inequality is valid:
W0p ≤ W (ap )
On the other hand the following inequality is true:
W (ap ) ≤ W (a0 )
47
Taking into account that xp is a support vector, we have
Suppose the optimal hyperplane passing through the origin recognizes the vector xp incorrectly. This means
that the inequality:
yp (xp ∗ φ0 ) ≤ 0
is valid. This is possible only if the vector xp is an essential support vector. Now let us make one step in
maximization the function W (a) by fix ai , i 6= p and change only one parameter ap > 0. We obtain:
ap = (1 − yp (xp φp )) ÷ |xp |2
Since ∆Wp does not exceed the increment of the function W (a) for the complete maximization, we obtain:
From this equation and use the equation above, we could obtain:
Thus if the optimal hyperplane makes the error classifying vector xp in the leave one out procedure, then
the inequality holds. Therefore
X a
a0i >= Ll+1 /Dl+1
2
i=1
Where L((x1 , y1 ), ..., (xl+1 , yl+1 )) is the number of errors in the leave one out procedure on the sample(x1 , y1 ), ..., (xl+1 , yl+1 ).
Now let us recall the properties of the optimal hyperplane
a
X
(φ0 ∗ φ0 ) = a0i
i=1
and
(φ0 ∗ φ0 ) = 1/p2l+1
Combing the above equations, we conclude that the inequality:
2
Ll+1 ≤ Dl+1 /p2l+1
48
20 Logistic Regression a.k.a. Maximum Entropy
1 Pi λi fi (x,y)
P (y|x) = e
Zx
X P
λi fi (x,y)
Zx = e i
(
1 if x=word that ends with ‘tion’ and y=Noun ,
f100 (x, y) =
0 if otherwise .
The above is more general binary classification where we are only deciding whether an example belongs
to a class or NOT. Here, we can have features contributing to multiple classes according to their weights.
For binary classification, the decision boundary is linear, as with perceptron or SVM. A major difference
from SVMs is that, during training, every example contributes to the objective function, whereas in SVMs
only the examples close to the decision boundary matter.
If we plot this function, we get a sigmoid-like graph. We can draw analogy between maximum entropy
and neural network, and consider features as the input nodes in the neural network.
If we take the log of equation (33), we get a linear equation
X
log P = λi fi + c
i
In the above equation, maximizing λi is easy, but maximizing Zx is not. To maximize for λi , we turn it
into a concave form and find the point where the derivative w.r.t. λ is zero (hill climbing)
1 XX X P
L= λi fi − log e i λi fi (xn ,y) (33)
N n i y
!
∂L 1 X 1 ∂ X P
λi fi
= fj − e i (34)
∂λj N n Zx ∂λj y
" #
1 X 1 X P
= fj − fj e i λi fi (y,xn ) (35)
N n Zx y
1 X X
= fj − fj P (y|xn ) (36)
N n y
1 X X
= fj (xn , yn ) − fj P (y|xn ) (37)
N n y
49
We can morph eq. 37 into expectation form by defining joint probability as follows:
∂L
= EP̃ [fj ] − EP [fj ] (41)
∂λj
where the first term (before the minus) is a constant, and the complexity of calculating the second term
depends on the number of classes in the problem. Now we have:
∂L
λ←λ+η (42)
∂λ
We will justify why we chose log linear form instead of something else. Assume we want to find the
maximum entropy subject to constraints on the feature expectations:
To put this into words, we want to build a model such that for each feature, our model should match
the training data. We have
X 1
H(y|x) = P̃ (x)P (y|x) log (47)
x,y
P (y|x)
X
L(P, λ, µ) =f0 + λj fj (48)
j
X
= P̃ (x)P (y|x) log P (y|x) (49)
x,y
!
X X
+ λi P̃ (x)P̃ (y|x)fi − P̃ (x)P (y|x)fi (50)
i x,y
!
X X
+ P̃ (x)µx P (y|x) − 1 (51)
x y
∂L X
= P̃ (x) (log P (y|x) + 1) − λi P̃ (x)fi + P̃ (x)µx = 0 (52)
∂P (y|x) i
50
X
log P (y|x) = −1 + λi fi − µx (53)
i
P
P (y|x) = e−1−µx e i λi fi (54)
1 Pi λi fi
= e (55)
Zx
The above result shows that maximum entropy has log-linear form. If we solve the dual of the problem
(60)
Substituting µx into g:
" # !
X 1 P
i λi fi
X X X P
i λi fi
g(λ, µ) = − P̃ (x) P P
λi fi
e + EP̃ λ i fi − P̃ (x) log e −1 (63)
x,y y e i
i x y
" # !
X X X P
= −1 + EP̃ λi fi − P̃ (x) log e i λi fi +1 (64)
i x y
" # !
X X X P
λi fi
= EP̃ λi fi − P̃ (x) log e i (65)
i x y
" !#
X X P
λi fi
= EP̃ λi fi − log e i (66)
i y
" #
X
= EP̃ λi fi − log Zx (67)
i
= EP̃ [log P (y|x)] (68)
=L (69)
Thus solving the dual of the entropy maximization problem consists of maximizing the likelihood of the
training data with a log-linear functional form for P (y|x).
51
21 Hidden Markov Models
A Hidden Markov Model (HMM) is a Markov Chain (a series of states with probabilities of transitioning
from one state to another) where the states are hidden (latent) and each state has an emission as a random
variable. The model is described as follows:
• Ω : the set of states, with yi ∈ Ω denoting a particular state
• Σ : the set of possible emissions with xi ∈ Σ denoting a particular emission
• P ∈ RΩ×Ω
[0,1] : the matrix with each element giving the probability of a transition
• Q ∈ RΩ×Σ
[0,1] : the matrix with each element giving the probability of an emission
• Π : the matrix with each element giving the probability of starting in each state
The probability distribution of an HMM can be decomposed as follows:
n−1
Y n
Y
P (x1 , . . . xn , y1 , . . . , yn ) = Π(y1 ) P (yi , yi+1 ) Q(yi , xi )
i=1 i=1
122112112
↓↓↓↓↓↓↓↓↓
abcaaaaab
We can consider multiple problems relating to HMMs.
1. Decoding I: Given x1 , . . . , xn , P, Q, Π, determine the sequence y1 . . . yn that maximizes P (Y1 , . . . Yn |X1 , . . . Xn ).
2. Decoding II: Given x1 , . . . , xn and t, determine the distribution of yK , that is, for all values a of yt ,
P (yt = a|X1 , . . . , Xn ).
3. Evaluation: Given x1 , . . . xn , determine P (X1 , . . . Xn ).
(1) (1) (k) (k)
4. Learning: Given a sequence of observations, x1 , . . . xn , . . . x1 , . . . xn , learn P, Q, Π that maximize
the likelihood of the observed data.
We define two functions, α and β.
52
X
β t−1 (a) = Q(c, x̂t )β t (c)P (a, c)
c∈Ω
We return to the Decoding II problem. Given x1 , . . . , xn and t, determine the distribution of YK , that is, for
all values a of Yt , P (Yt = a|X1 , . . . , Xn ). To do this, we rewrite the equation as follows:
P (X1 , . . . , Xn , Yt = a)
P (yt = a|X1 , . . . , Xn ) = .
P (X1 , . . . , Xn )
The Decoding I problem can be solved with Dynamic Programming. (Given x1 , . . . , xn , P, Q, Π, determine
the sequence y1 . . . yn that maximizes P (Y1 , . . . Yn |X1 , . . . Xn ).) We can fill in a table with the following
values:
T [t, a] = max P (y1 , . . . yt |X1 , . . . Xt )
y1 ...yt ,yt =a
which means that each value is the probability of the most likely sequence at time t with the last emission
being a. This can be computed using earlier values with the following formula:
max T [n, a]
a∈Ω
The learning problem can be solved using EM. Given the number of internal states, and x1 , . . . xn , we want
to figure out P , Q, and Π. In the E step, we want to compute an expectation over hidden variables:
X P (X, Y |θ)
L(θ, q) = q(y|x) log
y
q(Y |X)
For HMM’s, the number of possible hidden state sequences is exponential, so we use dynamic program-
ming to compute expected counts of individual transitions and emissions:
n−1
X
P (a, b) ∝ q(Yi = a, Yi−1 = b|X1 . . . Xn ) (70)
i=1
Xn
Q(a, b) ∝ q(Yi = a|X1 . . . Xn )I(X = w) (71)
i=1
22 LBFGS
To train maximum entropy (logistic regression) models, we maximized the probability of the training data
over possible feature weights λ: Y
max P (Yn |Xn )
λ
n
53
It is to maximize Y 1 P
L = log e i λi fi
n
ZXn
Of course we can solve it by using gradient ascend, but we today will talk about using an approximation
of Newton’s method.
22.1 Preliminary
For quadratic objective function
1 >
f (x) = x Ax + b> x + c
2
Newton’s iteration is given by
Because the Hessian can be seen as the second order derivative of f , we wish to choosse Bk such that:
Let
sk = xk+1 − xk
yk = ∇f (xk+1 ) − ∇f (xk )
then we have
Bk sk = yk
This is to say that, our approximation is a solution of above equation. Consider
yk yk>
Bk =
s>
k yk
Because
yk yk> sk yk (yk> sk )
Bk sk = = = yk
s>k yk s>k yk
Further, let Hk be our approximation of (∇2 f (xk ))−1 , the inverse of Hessian. We will have:
sk = Hk yk
54
Hk is a solution of above equation. One such Hk is given by
sk s>
k
Hk =
s> y
k k
So, we want a direct formula of computing a symmetric Hk+1 from Hk . That is, we want to fill in the ?
term in following equation
sk s>
k
Hk+1 = Hk + +?
s> y
k k
55
Algorithm 2: L-BFGS
Require: Hk−m , si , yi
q ← ∇fk
for i = k − 1..k − m do
αi ← ρi s>
i q
q ← q − αi yi
end for
r ← Hk−m q
for i = k − m..k − 1 do
β ← ρi yi> r
r ← r + si (αi − β)
end for
return r
23 Reinforcement Learning
23.1 Markov Decision Processes
A Markov Decision Process is an extension of the standard (unhidden) Markov model [1]. Each state has a
collection of actions that can be performed in that particular state. These actions serve to move the system
into a new state. More formally, the MDP’s state transitions can be described by the transition function
T (s, a, s0 ), where a is an action moving performable during the current state s, and s0 is some new state.
As the name implies, all MDPs obey the Markov property, which holds that the probability of finding the
system in a given state is dependent only on the previous state. Thus, the system’s state at any given time
is determined solely by the transition function and the action taken during the previous timestep:
Here, we assume that t = 0 is the current time. Since 0 < γ < 1, greater values of t (indicating rewards
farther in the future) are given smaller weight than rewards in the nearer future.
Let V Π (s) be the value function for the policy Π. This function V Π : S7→ R maps the application of Π to
some state s ∈ S to some reward value. Assuming the system starts in state s0 , we would expect the system
to have the value "∞ #
X
V Π (s) = E γ t R(st ) | s0 = s, Π
t=0
56
Since the probability of the system being in a given state s0 ∈ S is determined by the transition function
T (s, a, s0 ), we can rewrite the formula above for some arbitrary state s ∈ S as
X
V Π (s) = R(s) + T (s, a, s0 )γV Π (s0 )
s0
where a = Π(s) is the action selected by the policy for the given state s.
Our goal here is to determine the optimal policy Π*(s). Examining the formula above, we see that R(s)
is unaffected by choice of policy. This makes sense because at any given state s, local reward term R(s) is
determined simply by virtue of the fact that the system is in state s. Thus, if we wish to find the maximum
policy value function (and therefore find the optimum policy) we must find the action a that maximizes the
summation term above:
∗ X ∗
V Π (s) = R(s) + max T (s, a, s0 )γV Π (s0 )
a
s0
Note that this formulation assumes that the number of states is finite.
The formula above forms the basis of the value iteration algorithm. This operation starts with some initial
policy value function guess and iteratively refines V (s) until some acceptable convergence is reached:
Each pass of the value iteration maximizes V (s) and assigns to Π∗ (s) the action a that maximizes V (s).
The function Q(s, a) represents the potential value for V (s) produced by the action a ∈ A.
23.3 Q-Learning
The example of value iteration above presumes that the transition function T (s, a, s0 ) is known. If the
transition function is not known, then the function Q(s, a) can be obtained through a similar process of
iterative learning, the aptly-named Q-learning.
The naive guess for a Q-learning formula would be one that closely resembles the policy value function,
such as h i
0
Q(s, a) = R(s) + γ max
0
Q old (s, a )
a
This formula aggressively replaces old values of Q, though, which is not always desirable. For better
results [1], use a weighted average learning function:
h i
Q(s, a) = ηQold (s) + (1 − η)γ max
0
Qold (s, a0 )
a
57
The Q-learning algorithm below is from Ballard’s textbook [1]:
Algorithm 4: One-Step Q-Learning
1. Initialize Π(s) to argmax Q(s, a).
a
Where 0 < λ < 1 and V (s) is presumed to be a function of the weight w. The temporal difference
learning algorithm below is adapted from [1]:
24 Game Theory
Game theory is not like the reinforcement learning which we have just talked about. In the game theory, the
condition that one is playing against someone else is considered. Furthermore, game theory is not about
learning, it does not learn from data or previous instances, but tries to figure out what to do given a specific
game rule.
A game could be described using a game matrix. The two players are called row player and column
player. The value of each element in the matrix denotes the reward for the player. Such a matrix model
could be applied to many situations, e.g., board games, bets. A good example would be the rock-paper-
scissors game, which has a matrix M like
58
Rock Paper Scissors
Rock 0 -1 1
Paper 1 0 -1
Scissors -1 1 0
Note that here, the game has the same set of choices for both players, which is not necessary in general.
Moreover, the above matrix satisfies that
M = −M T ,
which makes the game symmetric. Also, we could find that this is a zero-sum game, where one player gets
his/her reward from the other, with the total amount never changed.
If we consider the problem from the perspective of the row player (which will be the same if we choose
the column player), our goal is to maximize the reward E of the row player, i.e., to find a strategy p that
maximizes the row value vr and
where the minimization means that our rival (the column player) does whatever that is worst for us, and
the maximization chooses the strategy that is best for us given the strategy of the column player. Similarly,
the column value vc is
which reflects the fact that the row player does whatever that could maximize his/her reward, and that the
column player chooses a best strategy to minimize it.
59
min E(p, q) = min E(p, j),
q j
which means that instead of considering all possibilities of q, we could consider only one move at a time.
In the context of the Rock-Paper-Scissors game, this is to say that q = (1, 0, 0) or (0, 1, 0) or (0, 0, 1).
Using this equality, the row value vr that we want to maximize is simplified as
X
max min E(p, q) ⇒ max min E(p, j) = max min pi mij .
p q p j p j
i
max v min v
p,v q,v
X X
s.t. pi mij ≥ v ∀j s.t. qj mij ≤ v ∀i
i j
X X
pi = 1 qj = 1
i j
pi ≥ 0 ∀i qj ≥ 0 ∀j
where the left one is for vr and right one vc . For the left one, we could simplify a little bit more by firstly
adding onto the game matrix a constant such that all of its entries are positive, and define
yi = pi /v, ∀i,
yi ≥ 0 ∀i xj ≥ 0 ∀j
The model on the right is obtained by introducing ∀j, xj = qj /v for the vc model. What we just came
up with are actually special cases of the following general models
min bT y max cT x
y x
s.t. AT y ≥ c s.t. Ax ≤ b
y≥0 x≥0
which are a pair of duals in the linear programming (LP) theory, and thus have the same optimization result.
The set of strategy (p, q) under such condition is called Nash Equilibrium, which could be mathematically
described as
60
(p, q)
s.t. E(p, q) = max
0
E(p0 , q)
p
E(p, q) = max
0
E(p, q 0 )
q
When the Nash Equilibrium has been reached, both players know the strategy of each other, and they
all get the best reward using their current strategies against those of the rivals. Hence neither of them need
to change the current strategy.
Then by defining
X
−∞, if ∃i, 1 − λj mij − µi 6= 0
j
g(λ, µ) = min L(y, λ, µ) = X
y
λj , o.w.
j
λj ≥ 0, ∀j
µi ≥ 0, ∀i
Due to the constraint ∀i, µi ≥ 0, we could remove µi in the first constraint and update it as
X X
1− λj mij ≥ 0 ⇒ λj mij ≤ 1
j j
If we replace λ with x, this dual is exactly the same as the linear programming model for vc , which
proves that the previous two LP models are dualities. Note that here, we did not do the proof for general
cases, which is easy to get using the variables of b and c, which is set to 1 in our proof.
Recall that we also applied the primal-dual in the SVM, the difference is that in SVM, we go to the
domain of the dual to make the problem easier to solve, while here, we generally use the duality to prove
that the two model are duals, which gives us the Nash Equilibrium.
61