CH 11-Regression
CH 11-Regression
Regression
Regression Problem
Training data: sample drawn i.i.d. from set X
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X Y,
with Y R is a measurable subset.
Loss function: L : Y Y R+ a measure of closeness,
typically L(y, y ) = (y y)2
or L(y, y ) = |y y|p
for
some p 1.
Problem: find hypothesis h :X R in H with small
generalization error with respect to target f
RD (h) = E L h(x), f (x) .
x D
Mehryar Mohri - Foundations of Machine Learning page 2
Notes
Empirical error:
m
1
RD (h) = L h(xi ), yi .
m i=1
t1
t2
x1 x2
Mehryar Mohri - Foundations of Machine Learning page 10
Pseudo-Dimension
(Pollard, 1984)
Definition: Let G be a family of functions mapping
from X to R . The pseudo-dimension of G, Pdim(G),
is the size of the largest set shattered by G.
Definition (equivalent, see also (Vapnik, 1995)):
Pdim(G) = VCdim (x, t) 1(g(x) t)>0 : g G .
b b ✏
Pr sup |R(h) R(h)| > ✏ Pr sup R(1L(h,f )>t ) R(1L(h,f )>t ) > .
h2H h2H M
t2[0,M ]
(x)
Mehryar Mohri - Foundations of Machine Learning page 17
Linear Regression - Solution
1
Rewrite objective function as F (W) = X W Y 2,
m
X= 1
(x1 )... (xm )
... 1
R(N +1) m
w1
(x1 ) 1 .. y1
with X = .. . .
Y = ..
. W= .
wN
(xm ) 1 ym
b
Convex and differentiable function.
2
F (W) = X(X W Y).
m
F (W) = 0 X(X W Y) = 0 XX W = XY.
8R2 2 1 log 1
R(h) R(h) + 1+
m 2 2
Tr[K] R2 2
RS (H) .
m m
Solution: W = (XX + I) 1
XY.
always invertible.
Mehryar Mohri - Foundations of Machine Learning page 24
Ridge Regression - Equivalent Formulations
Optimization problem:
m
min (w · (xi ) + b yi )2
w,b
i=1
subject to: w 2 2
.
Optimization problem:
m
min 2
i
w,b
i=1
subject to: i =w· (xi ) + b yi
w 2 2
.
Mehryar Mohri - Foundations of Machine Learning page 25
Ridge Regression Equations
Lagrangian: assume b = 0 . For all , w, , 0,
m m
L( , w, , )= 2
i + i (yi i w· (xi )) + ( w 2 2
).
i=1 i=1
KKT conditions:
m m
1
wL = (xi ) + 2 w = 0 w= (xi ).
i=1
i
2 i=1
i
i L=2 i i =0 i = i /2.
Thus,
m m m
1 2 1
L= + i yi (xi ) (xj ) 2
4 i=1
i
i=1
4 i,j=1
i j
m m m
= 2
i +2 i yi i j (xi ) (xj ) 2
,
i=1 i=1 i,j=1
with i =2 i .
or max
Rm
(X X + I) + 2 y.
Solution: m
h(x) = i (xi ) · (x),
i=1
with = (X X + I) 1
y.
Solution Prediction
Dual O( m2 + m3 ) O( m)
or max
Rm
(K + I) + 2 y.
Solution: m
h(x) = i K(xi , x),
i=1
with = (K + I) 1
y.
(x)
Equivalent formulation:
m
1
min w 2
+C ( i + i)
w, , 2 i=1
subject to (w · (xi ) + b) yi + i
yi (w · (xi ) + b) + i
i 0, i 0.
Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < <C
with b= m
i
Huber
6 !
x2 if |x| ≤ c
x !→
2c|x| − c2 otherwise.
loss
4
!-insensitive
x !→ max(0, |x| − ϵ)
2
0
-4 -2 0 2 4
x
Mehryar Mohri - Foundations of Machine Learning page 38
SVR - Quadratic Loss
Optimization problem:
1 1
max ( + ) 1+( ) y ( ) K+ I ( )
, 2 C
subject to: ( 0) ( 0) ( ) 1 = 0) .
Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < =0
with b= m
i i
WidrowHoff(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt wt · xt
5 Receive(yt )
6 wt+1 wt + 2 (wt · xt yt )xt >0
7 return wT +1
DualSVR()
1 0
2 0
3 for t 1 to T do
4 Receive(xt )
T
5 yt s=1 ( s s )K(xs , xt )
6 Receive(yt )
7 t+1 t + min(max( (yt yt ), t ), C t)
8 t+1 t + min(max( (yt yt ), t ), C t)
T
9 return t=1 t K(xt , ·)
L1 regularization L2 regularization
• Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression. Annals of
Statistics 2003.
• Arthur Hoerl and Robert Kennard. Ridge Regression: biased estimation of nonorthogonal
problems. Technometrics, 12:55-67, 1970.
• C. Saunders and A. Gammerman and V.Vovk, Ridge Regression Learning Algorithm in Dual
Variables, In ICML ’98, pages 515--521,1998.
• Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal
Statistical Society, pages B. 58:267-288, 1996.
• Bernard Widrow and Ted Hoff. Adaptive Switching Circuits. Neurocomputing: foundations of
research, pages 123-134, MIT Press, 1988.