Class06 SGD
Class06 SGD
Sasha Rakhlin
Much (but not all) of Machine Learning: write down objective function
involving data and parameters, find good (or optimal) parameters
through optimization.
Example (parabola):
f (w ) = aw 2 + bw + c
Start with any w1 . Then Newton’s Method gives
w2 = w1 − (2a)−1 (2aw1 + b)
2 2
Verify: if f (w ) = kY − Xw k + λ kw k , (X T X ) becomes (X T X + λ)
−→ Online Learning
A−1 uv T A−1
(A + uv T )−1 = A−1 −
1 + v T A−1 u
Hence
−1 −1
−1 Ct−1 xt xtT Ct−1
Ct−1 = Ct−1 − −1
1 + xtT Ct−1 xt
and (do the calculation)
−1 1
Ct−1 xt = Ct−1 xt · −1
1 + xt Ct−1
T
xt
The algorithm
wt+1 = wt + ηt xt (yt − xtT wt ).
I is recursive;
I does not require storing the matrix Ct−1 ;
I does not require updating the inverse, but only vector/vector
multiplication.
However, we are not guaranteed convergence in 1 step. How many? How
to choose ηt ?
BG
f (w̄T ) − f (w ∗ ) ≤ √ .
T
Proof:
2 2
kwt+1 − w ∗ k = kwt − η∇f (wt ) − w ∗ k
2 2
= kwt − w ∗ k + η 2 k∇f (wt )k − 2η∇f (wt ) T (wt − w ∗ )
Rearrange:
2 2 2
2η∇f (wt ) T (wt − w ∗ ) = kwt − w ∗ k − kwt+1 − w ∗ k + η 2 k∇f (wt )k .
Convexity of f means
and so
T T
1 X 1 X BG
f (wt ) − f (w ∗ ) ≤ ∇f (wt ) T (wt − w ∗ ) ≤ √
T t=1 T t=1 T
Setting #1:
Setting #2:
f (w ) = E`(Y , w T X )
A function f : Rd → R is convex if
where
∂f (u) ∂f (u)
∇f (u) = ,..., .
∂u1 ∂ud
∇fi (v ) ∈ ∂f (v ).