0% found this document useful (0 votes)

26 views

CH 11-Regression

The document provides an overview of regression problems and methods in machine learning. It defines regression as finding a hypothesis function that minimizes generalization error based on training data. It discusses empirical error, loss functions like mean squared error, and the goal of minimizing expected loss on new data. It also summarizes techniques like linear regression, kernel ridge regression, and Lasso, as well as theoretical concepts such as Rademacher complexity and pseudo-dimension that provide generalization guarantees.

Uploaded by

mmm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

CH 11-Regression

Uploaded by

mmm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Foundations of Machine Learning

Regression
Regression Problem
Training data: sample drawn i.i.d. from set X
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X Y,
with Y R is a measurable subset.
Loss function: L : Y Y R+ a measure of closeness,
typically L(y, y ) = (y y)2
or L(y, y ) = |y y|p
for
some p 1.
Problem: find hypothesis h :X R in H with small
generalization error with respect to target f
RD (h) = E L h(x), f (x) .
x D
Mehryar Mohri - Foundations of Machine Learning page 2
Notes
Empirical error:
m
1
RD (h) = L h(xi ), yi .
m i=1

In much of what follows:

• Y = R or Y = [ M, M ] for some M > 0.

• L(y, y ) = (y y) mean squared error. 2

Mehryar Mohri - Foundations of Machine Learning page 3

This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 4

Generalization Bound - Finite H
Theorem: let H be a finite hypothesis set, and
assume that L is bounded by M . Then, for any > 0 ,
with probability at least 1 ,
log |H| + log 2
h H, R(h) R(h) + M .
2m
Proof: By the union bound,
 X h i
Pr sup R(h) b
R(h) >✏  Pr R(h) b
R(h) >✏ .
h2H
h2H
By Hoeffding’s bound, for a fixed h ,
2m 2
Pr R(h) R(h) > 2e M2 .
Mehryar Mohri - Foundations of Machine Learning page 5
Rademacher Complexity of Lp Loss
Theorem:Let p 1, Hp = {x |h(x) f (x)|p : h H}.
Assume that supx2X,h2H |h(x) f (x)|  M . Then, for
any sample S of size m ,
RS (Hp ) pM p 1
RS (H).

Mehryar Mohri - Foundations of Machine Learning page 6

Proof
Proof: Let H = {x h(x) f (x) : h H}. Then,
observe that Hp = { h : h H } with : x |x|p.
• is pM p 1 - Lipschitz over [ M, M ], thus
RS (Hp ) pM p 1
RS (H ).

• Next, observe that:

m
1
RS (H ) = E sup i h(xi ) + i f (xi )
m h H i=1
m m
1
= E sup i h(xi ) +E i f (xi ) = RS (H).
m h H i=1 i=1

Mehryar Mohri - Foundations of Machine Learning page 7

Rad. Complexity Regression Bound
Theorem: Let p 1 and assume that h f M
for all h H . Then, for any > 0, with probability at
least 1 , for all h H ,
m
p 1 p log 1
E h(x) f (x) h(xi ) f (xi ) + 2pM p 1
Rm (H) + M p .
m i=1
2m
m
p 1 p log 2
E h(x) f (x) h(xi ) f (xi ) + 2pM p 1
RS (H) + 3M p .
m i=1
2m

Proof: Follows directly bound on Rademacher

complexity and general Rademacher bound.

Mehryar Mohri - Foundations of Machine Learning page 8

Notes
As discussed for binary classification:
• estimating the Rademacher complexity can be
computationally hard for some Hs.
• can we come up instead with a combinatorial
measure that is easier to compute?

Mehryar Mohri - Foundations of Machine Learning page 9

Shattering
Definition: Let G be a family of functions mapping
from X to R . A = {x1 , . . . , xm }is shattered by G if
there exist t1 , . . . , tm R such that
sgn g(x1 ) t1
.. :g G = 2m .
.
sgn g(xm ) tm

x1 x2
Mehryar Mohri - Foundations of Machine Learning page 10
Pseudo-Dimension
(Pollard, 1984)
Definition: Let G be a family of functions mapping
from X to R . The pseudo-dimension of G, Pdim(G),
is the size of the largest set shattered by G.
Definition (equivalent, see also (Vapnik, 1995)):
Pdim(G) = VCdim (x, t) 1(g(x) t)>0 : g G .

Mehryar Mohri - Foundations of Machine Learning page 11

Pseudo-Dimension - Properties
Theorem: Pseudo-dimension of hyperplanes.
Pdim(x w · x + b: w RN , b R) = N + 1.
Theorem: Pseudo-dimension of a vector space of
real-valued functions H:
Pdim(H) = dim(H).

Mehryar Mohri - Foundations of Machine Learning page 12

Generalization Bounds
Classification Regression
Lemma (Lebesgue integral): for f 0 measurable,

E[f (x)] = Pr[f (x) > t]dt.

D 0 D

Assume that the loss function L is bounded by M .

Z M ⇣ ⌘
|R(h) b
R(h)| = Pr [L(h(x), f (x)) > t] Pr [L(h(x), f (x)) > t] dt
0 x⇠D x⇠S

 M sup Pr [L(h(x), f (x)) > t] Pr [L(h(x), f (x)) > t]

t2[0,M ] x⇠D x⇠S

= M sup E [1L(h(x),f (x))>t ] E [1L(h(x),f (x))>t ] .

t2[0,M ] x⇠D x⇠S

 
b b ✏
Pr sup |R(h) R(h)| > ✏  Pr sup R(1L(h,f )>t ) R(1L(h,f )>t ) > .
h2H h2H M
t2[0,M ]

Standard classification generalization bound.

Mehryar Mohri - Foundations of Machine Learning page 13
Generalization Bound - Pdim
Theorem: Let H be a family of real-valued functions.
Assume that Pdim({L(h, f ) : h H}) = d < and that
the loss L is bounded by M . Then, for any > 0 , with
probability at least 1 , for any h H ,
2d log em log 1
R(h) R(h) + M d
+M .
m 2m

Proof: follows observation of previous slide and

VCDim bound for indicator functions of lecture 3.

Mehryar Mohri - Foundations of Machine Learning page 14

Notes
Pdim bounds in unbounded case modulo
assumptions: existence of an envelope function or
moment assumptions.
Other relevant capacity measures:
• covering numbers.
• packing numbers.
• fat-shattering dimension.

Mehryar Mohri - Foundations of Machine Learning page 15

This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 16

Linear Regression
Feature mapping :X RN.
Hypothesis set: linear functions.
{x w· (x) + b : w RN , b R}.
Optimization problem: empirical risk minimization.
m
1 2
min F (w, b) = (w · (xi ) + b yi ) .
w,b m i=1
y

(x)
Mehryar Mohri - Foundations of Machine Learning page 17
Linear Regression - Solution
1
Rewrite objective function as F (W) = X W Y 2,
m
X= 1
(x1 )... (xm )
... 1
R(N +1) m

w1
(x1 ) 1 .. y1
with X = .. . .
Y = ..
. W= .
wN
(xm ) 1 ym
b
Convex and differentiable function.
2
F (W) = X(X W Y).
m
F (W) = 0 X(X W Y) = 0 XX W = XY.

Mehryar Mohri - Foundations of Machine Learning page 18

Linear Regression - Solution
Solution:
(XX ) 1 XY if XX invertible.
W=
(XX )† XY in general.

• Computational complexity: O(mN +N 3

) if matrix
inversion in O(N 3 ).
• Poor guarantees in general, no regularization.
• For output labels in R , p > 1, solve p distinct
p

linear regression problems.

Mehryar Mohri - Foundations of Machine Learning page 19

This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 20

Mean Square Bound - Kernel-Based Hypotheses

Theorem: Let K: X X R be a PDS kernel and

let : X H be a feature mapping associated to K.
Let H = {x w· (x) : w H }. Assume K(x, x) R2
and|f (x)| R for all x X. Then, for any > 0 , with
probability at least 1 , for any h H ,

8R2 2 1 log 1
R(h) R(h) + 1+
m 2 2

8R2 2 Tr[K] 3 log 2

R(h) R(h) + + .
m mR 2 4 2

Mehryar Mohri - Foundations of Machine Learning page 21

Mean Square Bound - Kernel-Based Hypotheses

Proof: direct application of the Rademacher

Complexity Regression Bound (this lecture) and
bound on the Rademacher complexity of kernel-
based hypotheses (lecture 5):

Tr[K] R2 2
RS (H) .
m m

Mehryar Mohri - Foundations of Machine Learning page 22

Ridge Regression
(Hoerl and Kennard, 1970)
Optimization problem:
m
2
min F (w, b) = w 2
+ (w · (xi ) + b yi ) ,
w
i=1
where 0 is a (regularization) parameter.

• directly based on generalization bound.

• generalization of linear regression.
• closed-form solution.
• can be used with kernels.
Mehryar Mohri - Foundations of Machine Learning page 23
Ridge Regression - Solution
Assume b = 0 : often constant feature used (but not
equivalent to the use of original offset!).
Rewrite objective function as
F (W) = W 2
+ X W Y 2.
Convex and diferentiable function.
F (W) = 2 W + 2X(X W Y).
F (W) = 0 (XX + I)W = XY.

Solution: W = (XX + I) 1
XY.
always invertible.
Mehryar Mohri - Foundations of Machine Learning page 24
Ridge Regression - Equivalent Formulations
Optimization problem:
m
min (w · (xi ) + b yi )2
w,b
i=1
subject to: w 2 2
.

Optimization problem:
m
min 2
i
w,b
i=1
subject to: i =w· (xi ) + b yi
w 2 2
.
Mehryar Mohri - Foundations of Machine Learning page 25
Ridge Regression Equations
Lagrangian: assume b = 0 . For all , w, , 0,
m m
L( , w, , )= 2
i + i (yi i w· (xi )) + ( w 2 2
).
i=1 i=1

KKT conditions:
m m
1
wL = (xi ) + 2 w = 0 w= (xi ).
i=1
i
2 i=1
i

i L=2 i i =0 i = i /2.

i [1, m], i (yi i w· (xi )) = 0

( w 2 2
) = 0.

Mehryar Mohri - Foundations of Machine Learning page 26

Moving to The Dual
Plugging in the expression of w and is gives
m 2 m m 2 m m
1 1
L= i
+ i yi
i
(xi ) (xj )+ (xi ) 2 2
.
i=1
4 i=1 i=1
2 2 i,j=1
i j
4 2
i=1
i

Thus,
m m m
1 2 1
L= + i yi (xi ) (xj ) 2
4 i=1
i
i=1
4 i,j=1
i j

m m m
= 2
i +2 i yi i j (xi ) (xj ) 2
,
i=1 i=1 i,j=1

with i =2 i .

Mehryar Mohri - Foundations of Machine Learning page 27

RR - Dual Optimization Problem
Optimization problem:
maxm +2 y (X X)
R

or max
Rm
(X X + I) + 2 y.

Solution: m
h(x) = i (xi ) · (x),
i=1

with = (X X + I) 1
y.

Mehryar Mohri - Foundations of Machine Learning page 28

Direct Dual Solution
Lemma: The following matrix identity always holds.
(XX + I)
X = X(X X + I) 1 . 1

Proof: Observe that (XX + I)X = X(X X + I).

Left-multiplying by (XX + I) 1 and right-
multiplying by (X X + I) 1 yields the statement.
Dual solution:m such that m
W= i K(xi , ·) = i (xi ) = X .
i=1 i=1
By lemma,W = (XX + I) 1
XY = X(X X+ I) 1
Y.
This gives = (X X+ I) 1
Y.
Mehryar Mohri - Foundations of Machine Learning page 29
Computational Complexity

Solution Prediction

Primal O(mN 2 + N 3 ) O(N )

Dual O( m2 + m3 ) O( m)

Mehryar Mohri - Foundations of Machine Learning page 30

Kernel Ridge Regression
(Saunders et al., 1998)
Optimization problem:
maxm +2 y K
R

or max
Rm
(K + I) + 2 y.

Solution: m
h(x) = i K(xi , x),
i=1

with = (K + I) 1
y.

Mehryar Mohri - Foundations of Machine Learning page 31

Notes
Advantages:
• strong theoretical guarantees.
• generalization to outputs in Rp: single matrix
inversion (Cortes et al., 2007).
• use of kernels.
Disadvantages:
• solution not sparse.
• training time for large matrices: low-rank
approximations of kernel matrix, e.g., Nyström
approx., partial Cholesky decomposition.
Mehryar Mohri - Foundations of Machine Learning page 32
This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 33

Support Vector Regression
(Vapnik, 1995)
Hypothesis set:
{x w· (x) + b : w RN , b R}.
Loss function: -insensitive loss.
L(y, y ) = |y y| = max(0, |y y| ).
y
w· (x)+b
Fit ‘tube’ with
width to data.

(x)

Mehryar Mohri - Foundations of Machine Learning page 34

Support Vector Regression (SVR)
(Vapnik, 1995)
Optimization problem: similar to that of SVM.
m
1
w 2
+C yi (w · (xi ) + b) .
2 i=1

Equivalent formulation:
m
1
min w 2
+C ( i + i)
w, , 2 i=1
subject to (w · (xi ) + b) yi + i
yi (w · (xi ) + b) + i
i 0, i 0.

Mehryar Mohri - Foundations of Machine Learning page 35

SVR - Dual Optimization Problem
Optimization problem:
1
max ( + ) 1+( ) y ( ) K( )
, 2
subject to: (0 C) (0 C) (( ) 1 = 0) .

Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < <C
with b= m
i

i=1 ( j j )K(xj , xi ) + yi when 0 < i < C.

Support vectors: points strictly outside the tube.

Mehryar Mohri - Foundations of Machine Learning page 36

Notes
Advantages:
• strong theoretical guarantees (for that loss).
• sparser solution.
• use of kernels.
Disadvantages:
• selection of two parameters: C and . Heuristics:
•
search C near maximum y , near average
difference of ys, measure of no. of SVs.
• large matrices: low-rank approximations of
kernel matrix.
Mehryar Mohri - Foundations of Machine Learning page 37
Alternative Loss Functions
quadratic !-insensitive
8 x !→ max(0, |x| − ϵ)2

Huber
6 !
x2 if |x| ≤ c
x !→
2c|x| − c2 otherwise.
loss

4
!-insensitive
x !→ max(0, |x| − ϵ)
2

0
-4 -2 0 2 4
x
Mehryar Mohri - Foundations of Machine Learning page 38
SVR - Quadratic Loss
Optimization problem:
1 1
max ( + ) 1+( ) y ( ) K+ I ( )
, 2 C
subject to: ( 0) ( 0) ( ) 1 = 0) .

Solution: m
h(x) = ( i i )K(xi , x) +b
i=1
m
i=1 ( j j )K(xj , xi ) + yi + when 0 < =0
with b= m
i i

i=1 ( j j )K(xj , xi ) + yi when 0 < i i = 0.

Support vectors: points strictly outside the tube.

For = 0 , coincides with KRR.
Mehryar Mohri - Foundations of Machine Learning page 39
ε-Insensitive Bound - Kernel-Based Hypotheses

Theorem: Let K: X X R be a PDS kernel and

let : X H be a feature mapping associated to K.
Let H = {x w· (x) : w H }. Assume K(x, x) R2
and|f (x)| R for all x X . Then, for any > 0 , with
probability at least 1 , for any h H ,
R log 1
E[|h(x) f (x)| ] E[|h(x) f (x)| ] + 2+ +1 .
m 2
R Tr[K]/R2 log 2
E[|h(x) f (x)| ] E[|h(x) f (x)| ]+ 2 +3 +1 .
m m 2

Mehryar Mohri - Foundations of Machine Learning page 40

ε-Insensitive Bound - Kernel-Based Hypotheses

Proof: Let H = {x |h(x) f (x)| : h H} and let H

be defined by H = {x h(x) f (x) : h H} .
• The function : x |x| is 1-Lipschitz
and (0) = 0 . Thus, by the contraction lemma,
RS (H ) RS (H ).

• Since R (H ) = R S (see proof for S (H)

Rademacher Complexity of Lp Loss), this shows
that RS (H ) RS (H).
• The rest is a direct application of the
Rademacher Complexity Regression Bound (this
lecture).
Mehryar Mohri - Foundations of Machine Learning page 41
On-line Regression
On-line version of batch algorithms:

• stochastic gradient descent.

• primal or dual.
Examples:

• Mean squared error function: Widrow-Hoff (or

LMS) algorithm (Widrow and Hoff, 1995).

• SVR ε-insensitive (dual) linear or quadratic

function: on-line SVR.

Mehryar Mohri - Foundations of Machine Learning page 42

Widrow-Hoff
(Widrow and Hoff, 1988)

WidrowHoff(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt wt · xt
5 Receive(yt )
6 wt+1 wt + 2 (wt · xt yt )xt >0
7 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 43

Dual On-Line SVR
(Vijayakumar and Wu, 1988)
(b = 0)

DualSVR()
1 0
2 0
3 for t 1 to T do
4 Receive(xt )
T
5 yt s=1 ( s s )K(xs , xt )
6 Receive(yt )
7 t+1 t + min(max( (yt yt ), t ), C t)
8 t+1 t + min(max( (yt yt ), t ), C t)
T
9 return t=1 t K(xt , ·)

Mehryar Mohri - Foundations of Machine Learning page 44

This Lecture
Generalization bounds
Linear regression
Kernel ridge regression
Support vector regression
Lasso

Mehryar Mohri - Foundations of Machine Learning page 45

LASSO
(Tibshirani, 1996)
Optimization problem: ‘least absolute shrinkage
and selection operator’.
m
2
min F (w, b) = w 1 + (w · xi + b yi ) ,
w
i=1
where 0 is a (regularization) parameter.

Solution: equiv. convex quadratic program (QP).

• general: standard QP solvers.
• specific algorithm: LARS (least angle regression
procedure), entire path of solutions.

Mehryar Mohri - Foundations of Machine Learning page 46

Sparsity of L1 regularization

L1 regularization L2 regularization

Mehryar Mohri - Foundations of Machine Learning page 47

Sparsity Guarantee
Rademacher complexity of L1-norm bounded
linear hypotheses:
m
1
RS (H) = E sup iw · xi
m w 1 1 i=1
m
1
= E i xi (by definition of the dual norm)
m i=1
m
1
= E max i xij (by definition of · )
m j [1,N ]
i=1
m
1
= E max max s i xij (by definition of · )
m j [1,N ] s { 1,+1}
i=1
m
1 2 log(2N )
= E sup i zi r 1 . (Massart’s lemma)
m z A i=1 m

Mehryar Mohri - Foundations of Machine Learning page 48

Notes
Advantages:
• theoretical guarantees.
• sparse solution.
• feature selection.
Drawbacks:
• no natural use of kernels.
• no closed-form solution (not necessary, but can
be convenient for theoretical analysis).

Mehryar Mohri - Foundations of Machine Learning page 49

Regression
Many other families of algorithms: including
• neural networks.
• decision trees (see next lecture).
• boosting trees for regression.

Mehryar Mohri - Foundations of Machine Learning page 50

References
• Corinna Cortes, Mehryar Mohri, and Jason Weston. A General Regression Framework for
Learning String-to-String Mappings. In Predicting Structured Data. The MIT Press, 2007.

• Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression. Annals of
Statistics 2003.

• Arthur Hoerl and Robert Kennard. Ridge Regression: biased estimation of nonorthogonal
problems. Technometrics, 12:55-67, 1970.

• C. Saunders and A. Gammerman and V.Vovk, Ridge Regression Learning Algorithm in Dual
Variables, In ICML ’98, pages 515--521,1998.

• Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal
Statistical Society, pages B. 58:267-288, 1996.

• David Pollard. Convergence of Stochastic Processes. Springer, New York, 1984.

• David Pollard. Empirical Processes:Theory and Applications. Institute of Mathematical

Statistics, 1990.

Mehryar Mohri - Foundations of Machine Learning page 51

References
• Sethu Vijayakumar and Si Wu. Sequential support vector classifiers and regression. In
Proceedings of the International Conference on Soft Computing (SOCO’99), 1999.

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,

1982.

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

• Bernard Widrow and Ted Hoff. Adaptive Switching Circuits. Neurocomputing: foundations of
research, pages 123-134, MIT Press, 1988.

Mehryar Mohri - Foundations of Machine Learning page 52

Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Foundations of Machine Learning: Regression
No ratings yet
Foundations of Machine Learning: Regression
52 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
hw2 6
No ratings yet
hw2 6
2 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Inference in Linear Regression Models With Many Covariates and Heteroskedasticity
No ratings yet
Inference in Linear Regression Models With Many Covariates and Heteroskedasticity
47 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
Trimmed Sample Means For Uniform Mean Estimation and Regression
No ratings yet
Trimmed Sample Means For Uniform Mean Estimation and Regression
36 pages
Hdnotes 2021
No ratings yet
Hdnotes 2021
31 pages
Jerome H. Friedman
No ratings yet
Jerome H. Friedman
44 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Mathematical model
No ratings yet
Mathematical model
34 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
APS 4 Report
No ratings yet
APS 4 Report
84 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
No ratings yet
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
211 pages
Solution21 Winter
No ratings yet
Solution21 Winter
16 pages
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
No ratings yet
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
22 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Machine Learning - The Science of Selection under Uncertainty
No ratings yet
Machine Learning - The Science of Selection under Uncertainty
85 pages
week2
No ratings yet
week2
43 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
No ratings yet
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
307 pages
wainwrightslides1
No ratings yet
wainwrightslides1
67 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
eng
No ratings yet
eng
10 pages
ass8_solns
No ratings yet
ass8_solns
10 pages
Model Selection and Multiple Hypothesis Testing
No ratings yet
Model Selection and Multiple Hypothesis Testing
6 pages
Linear Algebra: 03/26/12 Revised by D.H. Chen 1
No ratings yet
Linear Algebra: 03/26/12 Revised by D.H. Chen 1
47 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
DoubleDeBiasedMachineLearningUsing Powerpoint
No ratings yet
DoubleDeBiasedMachineLearningUsing Powerpoint
38 pages
Hilbert Space Methods For Reduced-Rank Gaussian Process Regression
No ratings yet
Hilbert Space Methods For Reduced-Rank Gaussian Process Regression
32 pages
Logit (Lect 05)
No ratings yet
Logit (Lect 05)
6 pages
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Naveed Paper
No ratings yet
Naveed Paper
9 pages
Electricity Load Forecasting - A Systematic Review
No ratings yet
Electricity Load Forecasting - A Systematic Review
19 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
45 pages
FAI-question bank
No ratings yet
FAI-question bank
4 pages
Tentative Course List (Jan - April 2024)
No ratings yet
Tentative Course List (Jan - April 2024)
124 pages
Disaster Response Classification Using NLP: Under Supervision of - Mrs. Sonali Mathur
No ratings yet
Disaster Response Classification Using NLP: Under Supervision of - Mrs. Sonali Mathur
14 pages
Python
No ratings yet
Python
65 pages
Image Generative Models
No ratings yet
Image Generative Models
2 pages
Human Activities Recognition and Monitoring System Using Machine Learning Techniques
No ratings yet
Human Activities Recognition and Monitoring System Using Machine Learning Techniques
5 pages
Robust Vowel Detection
No ratings yet
Robust Vowel Detection
4 pages
Understanding Data Warehousing and Data Mining
No ratings yet
Understanding Data Warehousing and Data Mining
7 pages
Unit 1 Introduction to Data Analytics
No ratings yet
Unit 1 Introduction to Data Analytics
20 pages
A Novel Two Stage Hybrid Default Prediction Model 2022 Research in Internat
No ratings yet
A Novel Two Stage Hybrid Default Prediction Model 2022 Research in Internat
24 pages
Question Bank 67
No ratings yet
Question Bank 67
77 pages
Raghavi Resume
No ratings yet
Raghavi Resume
1 page
MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
No ratings yet
MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
21 pages
Kunal Jadhav: Work Experience Skills
No ratings yet
Kunal Jadhav: Work Experience Skills
1 page
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Model Lifecycle (XII)
No ratings yet
Model Lifecycle (XII)
10 pages
DL_UNIT V (1)
No ratings yet
DL_UNIT V (1)
12 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
frunze2021
No ratings yet
frunze2021
6 pages
Ai 1
No ratings yet
Ai 1
16 pages
Disambiguating Music Emotion Using Software Agents: Dan Yang Wonsook Lee
No ratings yet
Disambiguating Music Emotion Using Software Agents: Dan Yang Wonsook Lee
6 pages
Artificial Intelligence in Healthcare - Past, Present and Future
No ratings yet
Artificial Intelligence in Healthcare - Past, Present and Future
14 pages
Boltz321 PDF
No ratings yet
Boltz321 PDF
7 pages
A Race for Long Horizon Bankruptcy Prediction
No ratings yet
A Race for Long Horizon Bankruptcy Prediction
21 pages
RESEARCH
No ratings yet
RESEARCH
8 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Major Project Report
No ratings yet
Major Project Report
41 pages

CH 11-Regression

Uploaded by

CH 11-Regression

Uploaded by

Foundations of Machine Learning

In much of what follows:

• Y = R or Y = [ M, M ] for some M > 0.

Mehryar Mohri - Foundations of Machine Learning page 3

Mehryar Mohri - Foundations of Machine Learning page 4

Mehryar Mohri - Foundations of Machine Learning page 6

• Next, observe that:

Mehryar Mohri - Foundations of Machine Learning page 7

Proof: Follows directly bound on Rademacher

Mehryar Mohri - Foundations of Machine Learning page 8

Mehryar Mohri - Foundations of Machine Learning page 9

Mehryar Mohri - Foundations of Machine Learning page 11

Mehryar Mohri - Foundations of Machine Learning page 12

E[f (x)] = Pr[f (x) > t]dt.

Assume that the loss function L is bounded by M .

 M sup Pr [L(h(x), f (x)) > t] Pr [L(h(x), f (x)) > t]

= M sup E [1L(h(x),f (x))>t ] E [1L(h(x),f (x))>t ] .

Standard classification generalization bound.

Proof: follows observation of previous slide and

Mehryar Mohri - Foundations of Machine Learning page 14

Mehryar Mohri - Foundations of Machine Learning page 15

Mehryar Mohri - Foundations of Machine Learning page 16

Mehryar Mohri - Foundations of Machine Learning page 18

• Computational complexity: O(mN +N 3

linear regression problems.

Mehryar Mohri - Foundations of Machine Learning page 19

Mehryar Mohri - Foundations of Machine Learning page 20

Theorem: Let K: X X R be a PDS kernel and

8R2 2 Tr[K] 3 log 2

Mehryar Mohri - Foundations of Machine Learning page 21

Proof: direct application of the Rademacher

Mehryar Mohri - Foundations of Machine Learning page 22

• directly based on generalization bound.

i [1, m], i (yi i w· (xi )) = 0

Mehryar Mohri - Foundations of Machine Learning page 26

Mehryar Mohri - Foundations of Machine Learning page 27

Mehryar Mohri - Foundations of Machine Learning page 28

Proof: Observe that (XX + I)X = X(X X + I).

Primal O(mN 2 + N 3 ) O(N )

Mehryar Mohri - Foundations of Machine Learning page 30

Mehryar Mohri - Foundations of Machine Learning page 31

Mehryar Mohri - Foundations of Machine Learning page 33

Mehryar Mohri - Foundations of Machine Learning page 34

Mehryar Mohri - Foundations of Machine Learning page 35

i=1 ( j j )K(xj , xi ) + yi when 0 < i < C.

Support vectors: points strictly outside the tube.

Mehryar Mohri - Foundations of Machine Learning page 36

i=1 ( j j )K(xj , xi ) + yi when 0 < i i = 0.

Support vectors: points strictly outside the tube.

Theorem: Let K: X X R be a PDS kernel and

Mehryar Mohri - Foundations of Machine Learning page 40

Proof: Let H = {x |h(x) f (x)| : h H} and let H

• Since R (H ) = R S (see proof for S (H)

• stochastic gradient descent.

• Mean squared error function: Widrow-Hoff (or

• SVR ε-insensitive (dual) linear or quadratic

Mehryar Mohri - Foundations of Machine Learning page 42

Mehryar Mohri - Foundations of Machine Learning page 43

Mehryar Mohri - Foundations of Machine Learning page 44

Mehryar Mohri - Foundations of Machine Learning page 45

Solution: equiv. convex quadratic program (QP).

Mehryar Mohri - Foundations of Machine Learning page 46

Mehryar Mohri - Foundations of Machine Learning page 47

Mehryar Mohri - Foundations of Machine Learning page 48

Mehryar Mohri - Foundations of Machine Learning page 49

Mehryar Mohri - Foundations of Machine Learning page 50

• David Pollard. Convergence of Stochastic Processes. Springer, New York, 1984.

• David Pollard. Empirical Processes:Theory and Applications. Institute of Mathematical

Mehryar Mohri - Foundations of Machine Learning page 51

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 52

You might also like