0% found this document useful (0 votes)
26 views

CH 11-Regression

The document provides an overview of regression problems and methods in machine learning. It defines regression as finding a hypothesis function that minimizes generalization error based on training data. It discusses empirical error, loss functions like mean squared error, and the goal of minimizing expected loss on new data. It also summarizes techniques like linear regression, kernel ridge regression, and Lasso, as well as theoretical concepts such as Rademacher complexity and pseudo-dimension that provide generalization guarantees.

Uploaded by

mmm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

CH 11-Regression

The document provides an overview of regression problems and methods in machine learning. It defines regression as finding a hypothesis function that minimizes generalization error based on training data. It discusses empirical error, loss functions like mean squared error, and the goal of minimizing expected loss on new data. It also summarizes techniques like linear regression, kernel ridge regression, and Lasso, as well as theoretical concepts such as Rademacher complexity and pseudo-dimension that provide generalization guarantees.

Uploaded by

mmm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Foundations of Machine Learning

Regression
Regression Problem
Training data: sample drawn i.i.d. from set X
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X Y,
with Y R is a measurable subset.
Loss function: L : Y Y R+ a measure of closeness,
typically L(y, y ) = (y y)2
or L(y, y ) = |y y|p
for
some p 1.
Problem: find hypothesis h :X R in H with small
generalization error with respect to target f
RD (h) = E L h(x), f (x) .
x D
Mehryar Mohri - Foundations of Machine Learning page 2
Notes
Empirical error:
m
1
RD (h) = L h(xi ), yi .
m i=1

In much of what follows:

• Y = R or Y = [ M, M ] for some M > 0.


• L(y, y ) = (y y) mean squared error. 2

Mehryar Mohri - Foundations of Machine Learning page 3


This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 4


Generalization Bound - Finite H
Theorem: let H be a finite hypothesis set, and
assume that L is bounded by M . Then, for any > 0 ,
with probability at least 1 ,
log |H| + log 2
h H, R(h) R(h) + M .
2m
Proof: By the union bound,
 X h i
Pr sup R(h) b
R(h) >✏  Pr R(h) b
R(h) >✏ .
h2H
h2H
By Hoeffding’s bound, for a fixed h ,
2m 2
Pr R(h) R(h) > 2e M2 .
Mehryar Mohri - Foundations of Machine Learning page 5
Rademacher Complexity of Lp Loss
Theorem:Let p 1, Hp = {x |h(x) f (x)|p : h H}.
Assume that supx2X,h2H |h(x) f (x)|  M . Then, for
any sample S of size m ,
RS (Hp ) pM p 1
RS (H).

Mehryar Mohri - Foundations of Machine Learning page 6


Proof
Proof: Let H = {x h(x) f (x) : h H}. Then,
observe that Hp = { h : h H } with : x |x|p.
• is pM p 1 - Lipschitz over [ M, M ], thus
RS (Hp ) pM p 1
RS (H ).

• Next, observe that:


m
1
RS (H ) = E sup i h(xi ) + i f (xi )
m h H i=1
m m
1
= E sup i h(xi ) +E i f (xi ) = RS (H).
m h H i=1 i=1

Mehryar Mohri - Foundations of Machine Learning page 7


Rad. Complexity Regression Bound
Theorem: Let p 1 and assume that h f M
for all h H . Then, for any > 0, with probability at
least 1 , for all h H ,
m
p 1 p log 1
E h(x) f (x) h(xi ) f (xi ) + 2pM p 1
Rm (H) + M p .
m i=1
2m
m
p 1 p log 2
E h(x) f (x) h(xi ) f (xi ) + 2pM p 1
RS (H) + 3M p .
m i=1
2m

Proof: Follows directly bound on Rademacher


complexity and general Rademacher bound.

Mehryar Mohri - Foundations of Machine Learning page 8


Notes
As discussed for binary classification:
• estimating the Rademacher complexity can be
computationally hard for some Hs.
• can we come up instead with a combinatorial
measure that is easier to compute?

Mehryar Mohri - Foundations of Machine Learning page 9


Shattering
Definition: Let G be a family of functions mapping
from X to R . A = {x1 , . . . , xm }is shattered by G if
there exist t1 , . . . , tm R such that
sgn g(x1 ) t1
.. :g G = 2m .
.
sgn g(xm ) tm

t1

t2

x1 x2
Mehryar Mohri - Foundations of Machine Learning page 10
Pseudo-Dimension
(Pollard, 1984)
Definition: Let G be a family of functions mapping
from X to R . The pseudo-dimension of G, Pdim(G),
is the size of the largest set shattered by G.
Definition (equivalent, see also (Vapnik, 1995)):
Pdim(G) = VCdim (x, t) 1(g(x) t)>0 : g G .

Mehryar Mohri - Foundations of Machine Learning page 11


Pseudo-Dimension - Properties
Theorem: Pseudo-dimension of hyperplanes.
Pdim(x w · x + b: w RN , b R) = N + 1.
Theorem: Pseudo-dimension of a vector space of
real-valued functions H:
Pdim(H) = dim(H).

Mehryar Mohri - Foundations of Machine Learning page 12


Generalization Bounds
Classification Regression
Lemma (Lebesgue integral): for f 0 measurable,

E[f (x)] = Pr[f (x) > t]dt.


D 0 D

Assume that the loss function L is bounded by M .


Z M ⇣ ⌘
|R(h) b
R(h)| = Pr [L(h(x), f (x)) > t] Pr [L(h(x), f (x)) > t] dt
0 x⇠D x⇠S

 M sup Pr [L(h(x), f (x)) > t] Pr [L(h(x), f (x)) > t]


t2[0,M ] x⇠D x⇠S

= M sup E [1L(h(x),f (x))>t ] E [1L(h(x),f (x))>t ] .


t2[0,M ] x⇠D x⇠S

 
b b ✏
Pr sup |R(h) R(h)| > ✏  Pr sup R(1L(h,f )>t ) R(1L(h,f )>t ) > .
h2H h2H M
t2[0,M ]

Standard classification generalization bound.


Mehryar Mohri - Foundations of Machine Learning page 13
Generalization Bound - Pdim
Theorem: Let H be a family of real-valued functions.
Assume that Pdim({L(h, f ) : h H}) = d < and that
the loss L is bounded by M . Then, for any > 0 , with
probability at least 1 , for any h H ,
2d log em log 1
R(h) R(h) + M d
+M .
m 2m

Proof: follows observation of previous slide and


VCDim bound for indicator functions of lecture 3.

Mehryar Mohri - Foundations of Machine Learning page 14


Notes
Pdim bounds in unbounded case modulo
assumptions: existence of an envelope function or
moment assumptions.
Other relevant capacity measures:
• covering numbers.
• packing numbers.
• fat-shattering dimension.

Mehryar Mohri - Foundations of Machine Learning page 15


This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 16


Linear Regression
Feature mapping :X RN.
Hypothesis set: linear functions.
{x w· (x) + b : w RN , b R}.
Optimization problem: empirical risk minimization.
m
1 2
min F (w, b) = (w · (xi ) + b yi ) .
w,b m i=1
y

(x)
Mehryar Mohri - Foundations of Machine Learning page 17
Linear Regression - Solution
1
Rewrite objective function as F (W) = X W Y 2,
m
X= 1
(x1 )... (xm )
... 1
R(N +1) m

w1
(x1 ) 1 .. y1
with X = .. . .
Y = ..
. W= .
wN
(xm ) 1 ym
b
Convex and differentiable function.
2
F (W) = X(X W Y).
m
F (W) = 0 X(X W Y) = 0 XX W = XY.

Mehryar Mohri - Foundations of Machine Learning page 18


Linear Regression - Solution
Solution:
(XX ) 1 XY if XX invertible.
W=
(XX )† XY in general.

• Computational complexity: O(mN +N 3


) if matrix
inversion in O(N 3 ).
• Poor guarantees in general, no regularization.
• For output labels in R , p > 1, solve p distinct
p

linear regression problems.

Mehryar Mohri - Foundations of Machine Learning page 19


This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 20


Mean Square Bound - Kernel-Based Hypotheses

Theorem: Let K: X X R be a PDS kernel and


let : X H be a feature mapping associated to K.
Let H = {x w· (x) : w H }. Assume K(x, x) R2
and|f (x)| R for all x X. Then, for any > 0 , with
probability at least 1 , for any h H ,

8R2 2 1 log 1
R(h) R(h) + 1+
m 2 2

8R2 2 Tr[K] 3 log 2


R(h) R(h) + + .
m mR 2 4 2

Mehryar Mohri - Foundations of Machine Learning page 21


Mean Square Bound - Kernel-Based Hypotheses

Proof: direct application of the Rademacher


Complexity Regression Bound (this lecture) and
bound on the Rademacher complexity of kernel-
based hypotheses (lecture 5):

Tr[K] R2 2
RS (H) .
m m

Mehryar Mohri - Foundations of Machine Learning page 22


Ridge Regression
(Hoerl and Kennard, 1970)
Optimization problem:
m
2
min F (w, b) = w 2
+ (w · (xi ) + b yi ) ,
w
i=1
where 0 is a (regularization) parameter.

• directly based on generalization bound.


• generalization of linear regression.
• closed-form solution.
• can be used with kernels.
Mehryar Mohri - Foundations of Machine Learning page 23
Ridge Regression - Solution
Assume b = 0 : often constant feature used (but not
equivalent to the use of original offset!).
Rewrite objective function as
F (W) = W 2
+ X W Y 2.
Convex and diferentiable function.
F (W) = 2 W + 2X(X W Y).
F (W) = 0 (XX + I)W = XY.

Solution: W = (XX + I) 1
XY.
always invertible.
Mehryar Mohri - Foundations of Machine Learning page 24
Ridge Regression - Equivalent Formulations
Optimization problem:
m
min (w · (xi ) + b yi )2
w,b
i=1
subject to: w 2 2
.

Optimization problem:
m
min 2
i
w,b
i=1
subject to: i =w· (xi ) + b yi
w 2 2
.
Mehryar Mohri - Foundations of Machine Learning page 25
Ridge Regression Equations
Lagrangian: assume b = 0 . For all , w, , 0,
m m
L( , w, , )= 2
i + i (yi i w· (xi )) + ( w 2 2
).
i=1 i=1

KKT conditions:
m m
1
wL = (xi ) + 2 w = 0 w= (xi ).
i=1
i
2 i=1
i

i L=2 i i =0 i = i /2.

i [1, m], i (yi i w· (xi )) = 0


( w 2 2
) = 0.

Mehryar Mohri - Foundations of Machine Learning page 26


Moving to The Dual
Plugging in the expression of w and is gives
m 2 m m 2 m m
1 1
L= i
+ i yi
i
(xi ) (xj )+ (xi ) 2 2
.
i=1
4 i=1 i=1
2 2 i,j=1
i j
4 2
i=1
i

Thus,
m m m
1 2 1
L= + i yi (xi ) (xj ) 2
4 i=1
i
i=1
4 i,j=1
i j

m m m
= 2
i +2 i yi i j (xi ) (xj ) 2
,
i=1 i=1 i,j=1

with i =2 i .

Mehryar Mohri - Foundations of Machine Learning page 27


RR - Dual Optimization Problem
Optimization problem:
maxm +2 y (X X)
R

or max
Rm
(X X + I) + 2 y.

Solution: m
h(x) = i (xi ) · (x),
i=1

with = (X X + I) 1
y.

Mehryar Mohri - Foundations of Machine Learning page 28


Direct Dual Solution
Lemma: The following matrix identity always holds.
(XX + I)
X = X(X X + I) 1 . 1

Proof: Observe that (XX + I)X = X(X X + I).


Left-multiplying by (XX + I) 1 and right-
multiplying by (X X + I) 1 yields the statement.
Dual solution:m such that m
W= i K(xi , ·) = i (xi ) = X .
i=1 i=1
By lemma,W = (XX + I) 1
XY = X(X X+ I) 1
Y.
This gives = (X X+ I) 1
Y.
Mehryar Mohri - Foundations of Machine Learning page 29
Computational Complexity

Solution Prediction

Primal O(mN 2 + N 3 ) O(N )

Dual O( m2 + m3 ) O( m)

Mehryar Mohri - Foundations of Machine Learning page 30


Kernel Ridge Regression
(Saunders et al., 1998)
Optimization problem:
maxm +2 y K
R

or max
Rm
(K + I) + 2 y.

Solution: m
h(x) = i K(xi , x),
i=1

with = (K + I) 1
y.

Mehryar Mohri - Foundations of Machine Learning page 31


Notes
Advantages:
• strong theoretical guarantees.
• generalization to outputs in Rp: single matrix
inversion (Cortes et al., 2007).
• use of kernels.
Disadvantages:
• solution not sparse.
• training time for large matrices: low-rank
approximations of kernel matrix, e.g., Nyström
approx., partial Cholesky decomposition.
Mehryar Mohri - Foundations of Machine Learning page 32
This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 33


Support Vector Regression
(Vapnik, 1995)
Hypothesis set:
{x w· (x) + b : w RN , b R}.
Loss function: -insensitive loss.
L(y, y ) = |y y| = max(0, |y y| ).
y
w· (x)+b
Fit ‘tube’ with
width to data.

(x)

Mehryar Mohri - Foundations of Machine Learning page 34


Support Vector Regression (SVR)
(Vapnik, 1995)
Optimization problem: similar to that of SVM.
m
1
w 2
+C yi (w · (xi ) + b) .
2 i=1

Equivalent formulation:
m
1
min w 2
+C ( i + i)
w, , 2 i=1
subject to (w · (xi ) + b) yi + i
yi (w · (xi ) + b) + i
i 0, i 0.

Mehryar Mohri - Foundations of Machine Learning page 35


SVR - Dual Optimization Problem
Optimization problem:
1
max ( + ) 1+( ) y ( ) K( )
, 2
subject to: (0 C) (0 C) (( ) 1 = 0) .

Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < <C
with b= m
i

i=1 ( j j )K(xj , xi ) + yi when 0 < i < C.

Support vectors: points strictly outside the tube.

Mehryar Mohri - Foundations of Machine Learning page 36


Notes
Advantages:
• strong theoretical guarantees (for that loss).
• sparser solution.
• use of kernels.
Disadvantages:
• selection of two parameters: C and . Heuristics:

search C near maximum y , near average
difference of ys, measure of no. of SVs.
• large matrices: low-rank approximations of
kernel matrix.
Mehryar Mohri - Foundations of Machine Learning page 37
Alternative Loss Functions
quadratic !-insensitive
8 x !→ max(0, |x| − ϵ)2

Huber
6 !
x2 if |x| ≤ c
x !→
2c|x| − c2 otherwise.
loss

4
!-insensitive
x !→ max(0, |x| − ϵ)
2

0
-4 -2 0 2 4
x
Mehryar Mohri - Foundations of Machine Learning page 38
SVR - Quadratic Loss
Optimization problem:
1 1
max ( + ) 1+( ) y ( ) K+ I ( )
, 2 C
subject to: ( 0) ( 0) ( ) 1 = 0) .

Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < =0
with b= m
i i

i=1 ( j j )K(xj , xi ) + yi when 0 < i i = 0.

Support vectors: points strictly outside the tube.


For = 0 , coincides with KRR.
Mehryar Mohri - Foundations of Machine Learning page 39
ε-Insensitive Bound - Kernel-Based Hypotheses

Theorem: Let K: X X R be a PDS kernel and


let : X H be a feature mapping associated to K.
Let H = {x w· (x) : w H }. Assume K(x, x) R2
and|f (x)| R for all x X . Then, for any > 0 , with
probability at least 1 , for any h H ,
R log 1
E[|h(x) f (x)| ] E[|h(x) f (x)| ] + 2+ +1 .
m 2
R Tr[K]/R2 log 2
E[|h(x) f (x)| ] E[|h(x) f (x)| ]+ 2 +3 +1 .
m m 2

Mehryar Mohri - Foundations of Machine Learning page 40


ε-Insensitive Bound - Kernel-Based Hypotheses

Proof: Let H = {x |h(x) f (x)| : h H} and let H


be defined by H = {x h(x) f (x) : h H} .
• The function : x |x| is 1-Lipschitz
and (0) = 0 . Thus, by the contraction lemma,
RS (H ) RS (H ).

• Since R (H ) = R S (see proof for S (H)


Rademacher Complexity of Lp Loss), this shows
that RS (H ) RS (H).
• The rest is a direct application of the
Rademacher Complexity Regression Bound (this
lecture).
Mehryar Mohri - Foundations of Machine Learning page 41
On-line Regression
On-line version of batch algorithms:

• stochastic gradient descent.


• primal or dual.
Examples:

• Mean squared error function: Widrow-Hoff (or


LMS) algorithm (Widrow and Hoff, 1995).

• SVR ε-insensitive (dual) linear or quadratic


function: on-line SVR.

Mehryar Mohri - Foundations of Machine Learning page 42


Widrow-Hoff
(Widrow and Hoff, 1988)

WidrowHoff(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt wt · xt
5 Receive(yt )
6 wt+1 wt + 2 (wt · xt yt )xt >0
7 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 43


Dual On-Line SVR
(Vijayakumar and Wu, 1988)
(b = 0)

DualSVR()
1 0
2 0
3 for t 1 to T do
4 Receive(xt )
T
5 yt s=1 ( s s )K(xs , xt )
6 Receive(yt )
7 t+1 t + min(max( (yt yt ), t ), C t)
8 t+1 t + min(max( (yt yt ), t ), C t)
T
9 return t=1 t K(xt , ·)

Mehryar Mohri - Foundations of Machine Learning page 44


This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 45


LASSO
(Tibshirani, 1996)
Optimization problem: ‘least absolute shrinkage
and selection operator’.
m
2
min F (w, b) = w 1 + (w · xi + b yi ) ,
w
i=1
where 0 is a (regularization) parameter.

Solution: equiv. convex quadratic program (QP).


• general: standard QP solvers.
• specific algorithm: LARS (least angle regression
procedure), entire path of solutions.

Mehryar Mohri - Foundations of Machine Learning page 46


Sparsity of L1 regularization

L1 regularization L2 regularization

Mehryar Mohri - Foundations of Machine Learning page 47


Sparsity Guarantee
Rademacher complexity of L1-norm bounded
linear hypotheses:
m
1
RS (H) = E sup iw · xi
m w 1 1 i=1
m
1
= E i xi (by definition of the dual norm)
m i=1
m
1
= E max i xij (by definition of · )
m j [1,N ]
i=1
m
1
= E max max s i xij (by definition of · )
m j [1,N ] s { 1,+1}
i=1
m
1 2 log(2N )
= E sup i zi r 1 . (Massart’s lemma)
m z A i=1 m

Mehryar Mohri - Foundations of Machine Learning page 48


Notes
Advantages:
• theoretical guarantees.
• sparse solution.
• feature selection.
Drawbacks:
• no natural use of kernels.
• no closed-form solution (not necessary, but can
be convenient for theoretical analysis).

Mehryar Mohri - Foundations of Machine Learning page 49


Regression
Many other families of algorithms: including
• neural networks.
• decision trees (see next lecture).
• boosting trees for regression.

Mehryar Mohri - Foundations of Machine Learning page 50


References
• Corinna Cortes, Mehryar Mohri, and Jason Weston. A General Regression Framework for
Learning String-to-String Mappings. In Predicting Structured Data. The MIT Press, 2007.

• Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression. Annals of
Statistics 2003.

• Arthur Hoerl and Robert Kennard. Ridge Regression: biased estimation of nonorthogonal
problems. Technometrics, 12:55-67, 1970.

• C. Saunders and A. Gammerman and V.Vovk, Ridge Regression Learning Algorithm in Dual
Variables, In ICML ’98, pages 515--521,1998.

• Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal
Statistical Society, pages B. 58:267-288, 1996.

• David Pollard. Convergence of Stochastic Processes. Springer, New York, 1984.

• David Pollard. Empirical Processes:Theory and Applications. Institute of Mathematical


Statistics, 1990.

Mehryar Mohri - Foundations of Machine Learning page 51


References
• Sethu Vijayakumar and Si Wu. Sequential support vector classifiers and regression. In
Proceedings of the International Conference on Soft Computing (SOCO’99), 1999.

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,


1982.

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

• Bernard Widrow and Ted Hoff. Adaptive Switching Circuits. Neurocomputing: foundations of
research, pages 123-134, MIT Press, 1988.

Mehryar Mohri - Foundations of Machine Learning page 52

You might also like