cs229.... Machine Language. Andrew NG
cs229.... Machine Language. Andrew NG
cs229.... Machine Language. Andrew NG
Course
Tyler Neylon
331.2016
These are notes I’m taking as I review material from Andrew Ng’s CS 229 course
on machine learning. Specifically, I’m watching these videos and looking at the
written notes and assignments posted here. These notes are available in two
formats: html and pdf.
I’ll organize these notes to correspond with the written notes from the class.
As I write these notes, I’m also putting together some homework solutions.
1 On lecture notes 1
The notes in this section are based on lecture notes 1.
∂J
θj := θj − α .
∂θj
1
1.2 Gradient descent on linear regression
I realize this is a toy problem because linear regression in practice is not solve
iteratively, but it seems worth understanding well. The general update equation
is, for a single example i,
(i)
θj := θj + α(y (i) − hθ (x(i) ))xj .
θ1 := θ0 + (y − θ0 x)/x,
hθ = θ1 x = θ0 x + y − θ0 x = y,
tr AB = tr BA. (1)
where I’m using the informal shorthand notation that a variable repeated within
a single product implies that the sum is taken over all relevant values of that
variable. Specifically,
2
X
aij bji means aij bji .
i,j
tr AB = aij bji ,
we have that
∂f
(∇AT f (A))ij = = (∇A f (A))ji .
∂aji
• Goal: ∇A tr(ABAT C) = CAB + C T AB T .
I’ll use some nonstandard index variable names below because I think it will
help avoid possible confusion. Start with
(∇A α)ij = bjw avw cvi + axz bzj cix = cvi avw bjw + cix axz bzj .
(The second equality above is based on the fact that we’re free to rearrange terms
within products in the repeated-index notation being used. Such rearrangement
is commutativity of numbers, not of matrices.)
3
This last expression is exactly the ij th entry of the matrix CAB + C T AB T ,
which was the goal.
Ng starts with
1
∇θ J(θ) = ∇θ (Xθ − y)T (Xθ − y),
2
and uses some trace tricks to get to
X T Xθ − X T y.
I thought that the trace tricks were not great in the sense that if I were faced
with this problem it would feel like a clever trick out of thin air to use the trace
(perhaps due to my own lack of experience with matrix derivatives?), and in the
sense that the connection between the definition of J(θ) and the result is not
clear.
Next is another approach.
Start with ∇θ J(θ) = ∇θ 12 (θT Zθ − 2y T Xθ); where Z = X T X, and the doubled
term is a valid combination of the two similar terms since they’re both real
numbers, so we can safely take the transpose of one of them to add them together.
The left term, θT Zθ, can be expressed as w = θi1 Zij θj1 , treating θ as an (n+1)×1
matrix. Then (∇θ w)i1 = Zij θj1 + θj1 Zji , using the product rule; so ∇θ w = 2Zθ,
using that Z is symmetric.
The right term, v = y T Xθ = yi1 Xij θj1 with (∇θ v)i1 = yj1 Xji so that ∇θ v =
X T y.
These all lead to ∇θ J(θ) = X T Xθ − X T y as before. I think it’s clearer, though,
once you grok the sense that
holds, even in cases where the product f g is not a scalar; e.g., that it is vector-
or matrix-valued. But I’m not going to dive into that right now.
4
Also note that X T X can easily be singular. A simple example is X = 0, the
scalar value. However, if X is m × n with rank n, then X T Xei = X T x(i) 6= 0
since hx(i) , x(i) i =
6 0. (If hx(i) , x(i) i = 0 then X could not have rank n.) So X T X
is nonsingular iff X has rank n.
Ng says something in a lecture (either 2 or 3) that implied that (X T X)−1 X T is
not the pseudoinverse of X, but for any real-valued full-rank matrix, it is.
This motivation for least squares makes sense when the error is given by i.i.d.
normal curves, and often this may seem like a natural assumption based on the
central limit theorem.
However, this same line of argument could be used to justify any cost function
of the form
X
J(θ) = f (θ, x(i) , y (i) ),
i
where f is intuitively some measure of distance between hθ (x(i) ) and y (i) . Specif-
ically, model the error term ε(i) as having the probability density function
(i) (i)
e−f (θ,x ,y ) . This is intuitively reasonable when f has the aforementioned
distance-like property, since error values are likely near zero and unlikely far
from zero. Then the log likelihood function is
(i)
,y (i) )
Y X
`(θ) = log L(θ) = log e−f (θ,x = −f (θ, x(i) , y (i) ),
i i
so that maximizing L(θ) is the same as minimizing the J(θ) defined in terms
of this arbitrary function f. To be perfectly clear, the motivation Ng provides
only works when you have good reason to believe the error terms are indeed
normal. At the same time, using a nice simple algorithm like least squares is
quite practical even if you don’t have a great model for your error terms.
This idea is that, given a value x, we can choose θ to minimize the modified cost
function
X
w(i) (y (i) − θT x(i) )2 ,
i
where
5
(x(i) − x)2
w(i) = exp − .
2τ 2
P (i) (i)
iw y
y= P (i)
iw
instead?
I don’t have a strong intuition for which approach would be better, although the
linear interpolation is more work (unless I’m missing a clever way to implement
LWR that wouldn’t also work for the simpler equation immediately above). This
also reminds me of the k−nearest neighbors algorithm, though I’ve seen that
presented as a classification approach while LWR is regression. Nonetheless,
perhaps one could apply a locality-sensitive hash to quickly approximately find
the k nearest neighbors and then build a regression model using that.
where
The update equation from gradient descent turns out to be nice for this setup.
However, besides “niceness,” it’s not obvious to me why the logistic function g
is signifcantly better than any other sigmoid function.
In general, suppose
Here, τ (i) indicates the function τ evaluated on θ and x(i) . We can split up the
term inside the sum like so:
6
y (i) 1 − y (i) y (i) − h(i)
a(i) = − = (i) ,
h (i) 1−h (i) h (1 − h(i) )
and
∂τ (i)
b(i) = .
∂θj
Intuitively, the value of a(i) makes sense as it’s the error of h(i) weighted more
heavily when a wrong output value of h is given with higher confidence. The
value b(i) makes sense as the direction in which we’d want to move θj in order
to increase τ (i) . Multiplied by the sign of a(i) , the end result is a movement of
θj that decreases the error.
It’s not obvious to me which choice of function τ is best. Consider, for example,
the alternative function
arctan(z) 1
τ (z) = + .
π 2
(i)
∂τ (i) xj
b(i) = = .
∂θj 1 + (θT x)2
In general, sigmoid functions will tend to give a value of b(i) that is small when
|θT x| is large, symmetric around θT x = 0, and has the same sign as xj since
τ − 1/2 will be an increasing odd function; the derivative of an increasing odd
function is even and positive; and this positive even function will be multiplied
by xj to derive the final value of b(i) . The values of b(i) will be small for large
|θT x| because τ (θT x) necessarily flattens out for large input values (since it’s
monotonic and bounded in [0, 1]).
Ng states that the assumption of a Bernoulli distribution for p(y|x; θ) results in
the traditional gradient update rule for logistic regression, but there is another
choice available in the argument. The definition of a Bernoulli distribution
means that we must have
(
hθ (x) if y = 1, and
p(y|x; θ) =
1 − hθ (x) otherwise.
There are a number of expressions that also capture this same behavior for the
two values y = 0 and y = 1. The one used in the course is
7
p(y|x; θ) = hy (1 − h)1−y ,
which is convenient since we’re working with the product p(y) = p(y (i) ).
Q
i
However, we could have also worked with
which also gives exactly the needed values for the cases y = 1 and y = 0. Although
the traditional expression is easier to work with, it seems worth nothing that
there is a choice to be made.
θ := θ − H −1 ∇θ `(θ) (2)
without justifying it beyond explaining that it’s the vector version of what one
would do in a non-vector setting for finding an optima of `0 (θ), namely,
`0 (θ)
θ := θ − .
`00 (θ)
∂2`
H = (hij ) = .
∂θi ∂θj
X ∂
∇`(θ + α) i ≈ ∇`(θ) i + (3)
∇`(θ) i · αj .
j
∂θj
8
Recall that our goal is to iterate on θ toward a solution to ∇`(θ) = 0. Use (4)
toward this end by looking for α which solves ∇`(θ + α) ≈ 0. If H is invertible,
then we have
θ := θ + α = θ − H −1 ∇`(θ),
A class of distributions fits within the exponential family when there are functions
a, b, and T such that
Y
p(y; η) = b(y)c(η) di (yi )ηi ,
i
Ng doesn’t spell this out, but I think the following line of argument is why
generalized linear models are considered useful.
9
Suppose η = θT x and that we’re working with a distribution parametrized by η
that can be expressed as
X X
`(θ) = log p(y (i) ; η (i) ) = log(b(y (i) )) + η (i)T T (y (i) ) − a(η (i) ).
i i
X
∇`(θ) = x(i) T (y (i) ) − a0 (η (i) ) ,
i
Let’s see how this nice general update rule looks in a couple special cases. For
simplicity, I’ll stick to special cases where η is a scalar value.
In this case, a(η) = η 2 /2 so a0 (η) = η; and T (y) = y. Then the update rule (6)
becomes
θ := θ + α(y − θT x)x,
In this case, a(η) = log(1 + eη ) and T (y) = y. So a0 (η) = 1/(1 + e−η ), and we
can write the update rule (6) as
where g is the logistic function. This matches the earlier update equation we
saw for logistic regression, suggesting that our general approach isn’t completely
insane. It may even be technically correct.
10
2 On lecture notes 2
The notes in this section are based on lecture notes 2.
Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the
details of justifying it, so let’s do that.
The goal is to show that
1
p(y = 1 | x) = , (7)
1 + e−θT x
p(y) = φy (1 − φ)1−y ,
p(x | y = 0) = N (µ0 , Σ), and
p(x | y = 1) = N (µ1 , Σ),
where N (µ, Σ) indicates the multivariate normal distribution. From this we can
derive
φA1 1
p(y = 1 | x) = = . (9)
(1 − φ)A0 + φA1
1+ A0
A1
1−φ
φ
hx, yi := xT Σ−1 y,
11
noting that this is linear in both terms just as the usual inner product is. It will
also be convenient to define
Our ultimate goal (7) can now be reduced to showing that the denominator of
(9) is of the form 1 + exp(ahb, xi + c) for some constants a, b, and c.
Notice that
exp(−1/2B0 ) 1 1
A0
= = exp (B1 − B0 ) = exp h2µ0 − 2µ1 , xi + C ,
A1 exp(−1/2B1 ) 2 2
where C is based on the terms of Bj that use µj but not x. The scalar factor
of A0 /A1 in (9) can also be absorbed into the constant c in the expression
exp(ahb, xi + c). This confirms that the denominator of (9) is indeed of the
needed form, confirming the goal (7).
I’ve heard that, in practice, naive Bayes works well for spam classification. If its
assumption is so blatantly false, then why does it work so well? I don’t have an
answer, but I wanted to mention that as an interesting question.
Also, independence of random variables is not just a pair-wise property. You could
have three random variables x1 , x2 , and x3 which are all pair-wise independent,
but not independent as a group. As an example of this, here are three random
variables defined as indicator functions over the domain {a, b, c, d} :
If we know the value of xi for any i, then we have no extra information about
the value of xj for any j 6= i. However, if we know any two values xi and xj ,
then we’ve completely determined the value of xk .
This is all leading up to the point that the naive Bayes assumption, stated
carefully, is that the entire set of events {xi | y} is independent, as opposed to
just pair-wise independent.
I’m curious about any theoretical justification for Laplace smoothing. As pre-
sented so far, it mainly seems like a hacky way to avoid a potential implementation
issue.
12
3 On lecture notes 3
The notes in this section are based on lecture notes 3.
At one point, Ng says that any constraint on w will work to avoid the problem
that maximizing the functional margin simply requires making both w and b
larger. However, I think many constraints make less sense than ||w|| = 1. The
only case that doesn’t fit well with ||w|| = 1 is the case that the best weights
are actually w = 0, but this vector can only ever give a single output, so it’s
useless as long as we have more than one label, which is the only interesting
case. On the other hand, a constraint like |w3 | = 1 would avoid the exploding
(w, b) problem, but it may be the case that the best weight vector has w3 = 0,
which would fail to be found under such a constraint. In general, constraints
tend to assume that some subset of values in (w, b) are nonzero, and anything
more specific than ||w|| = 1 would add an assumption that may end up missing
the truly optimal weights.
Then
≤ f (x0 , y0 )
≤ max f (x, y0 ) = min max f (x, y),
x y x
13
f (x, y0 ) ≤ f (x0 , y0 ) ≤ f (x0 , y) ∀x, y,
then
Let’s confirm that the Gaussian kernel is indeed a kernel. As a side note, this
kernel is also referred to as the radial basis function kernel, or the RBF kernel.
I had some difficulty proving this myself, so I found this answer by Douglas Zare
that I’ll use as the basis for this proof.
The goal is to show that the function
−||x − y||2
k(x, y) = exp
2σ 2
is a kernel.
14
The proof will proceed by providing a function φw such that the inner product
hφx , φy i = k(x, y) for any vectors x, y ∈ Rn . I’ll be using an inner product defined
on functions of Rn as in
Z
hf, gi = f (z)g(z)dz.
Rn
√ n/2
Begin by letting cσ = √2
σ π
. Define
−||z − w||2
φw (z) = cσ exp .
σ2
Then
Let
x+y x−y
a= , b= , v = z − a,
2 2
so that
z − x = z − a − b = v − b,
z − y = z − a + b = v + b.
Then
The last equation uses the facts that ||v − b||2 = hv − b, v − bi = hv, vi − 2hv, bi +
hb, bi, and that, similarly, ||v + b||2 = ||v||2 + 2hv, bi + ||b||2 .
Continuing,
−2||b||2 −2||v||2
Z
hφx , φy i = c2σ exp exp dv.
σ2 σ2
15
√ n
2v
σ
= u, dv = √ du.
σ 2
We get
n Z
−2||b||2
σ
hφx , φy i = c2σ exp √ exp(−||u||2 )du
σ2 2
√ n
−2||b||2
σ π
= c2σ exp √
σ2 2
2
−2||x − y||
= exp
4σ 2
−||x − y||2
= exp
2σ 2
= k(x, y),
SMO can be used to solve the dual optimization problem for support vector
machines. I wonder: Can an SMO approach be used to solve a general linear or
quadratic programming problem?
The original SMO paper by John Platt specifically applied the algorithm to
support vector machines. Based on a quick search, I don’t see any applications of
SMO to other fields or problems. I could be missing something, but my answer
so far is that SMO either hasn’t been tried or simply doesn’t offer new value as
a more general approach.
TODO any remaining notes from lecture notes 3
4 On lecture notes 4
The notes in this section are based on lecture notes 4.
In the case of finite H, I was confused for a moment about the need to include k
in the following line of thinking:
16
(ĥ) ≤ ˆ(ĥ) + γ
≤ ˆ(h∗ ) + γ
≤ (h∗ ) + 2γ.
In this part of the notes, Ng argues that we may informally consider a theoretically
infinite hypothesis space with p parameters as something like a finite space with
log(k) = O(p) since it can be represented with a single floating-point number
per parameter. However, he also proposed the counterargument that we could
have inefficiently used twice as many variables in a program to represent the
same hypothesis space. I disagree with this counterargument on the basis that
the actual number of hypothesis functions represented is identical, and we can
count k based on this number. To put it slightly more formally, suppose we have
two sets of functions,
17