0% found this document useful (0 votes)

8 views10 pages

Lect 6

Uploaded by

Ark Mtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Lect 6

Uploaded by

Ark Mtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 6: Regression continued

C4B Machine Learning Hilary 2011 A. Zisserman

• Lasso
• L1 regularization
• other regularizers

• SVM regression
• epsilon-insensitive loss

• More loss functions

Regression
y

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f (x) from this data

such that
yi = f (xi)
Regression cost functions
Minimize with respect to w
N
X
l (f (xi, w), yi) + λR (w)
i=1
loss function regularization

• There is a choice of both loss functions and regularization

• So far we have seen – “ridge” regression
N
X
• squared loss: (yi − f (xi, w))2
i=1

• squared regularizer: λkwk2

• Now, consider other losses and regularizers

The “Lasso” or L1 norm regularization

• LASSO = Least Absolute Shrinkage and Selection

Minimize with respect to w ∈ Rd

N
X d
X
2
(yi − f (xi, w)) + λ |wj |
i=1 j

loss function regularization

• This is a quadratic optimization problem

• There is a unique solution ⎛ ⎞1
d
X p
• p-Norm definition: k w kp = ⎝ |wi|p⎠
j=1
Sparsity property of the Lasso
• contour plots for d = 2
N
X
(yi − f (xi, w))2
i=1

d
λkwk2 λ
X
|wj |
ridge regression lasso j

• Minimum where loss contours tangent to regularizer’s

• For the lasso case, minima occur at “corners”
• Consequently one of the weights is zero
• In high dimensions many weights can be zero
Example: Lasso for polynomial basis functions regression
ideal fit
• The red curve is the true function 1.5
Sample points

(which is not a polynomial) 1

Ideal fit

0.5

• The data points are samples from the

curve with added noise in y. 0

y
-0.5

• N = 9, M = 7 -1

M
X -1.5
j > 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f (x, w) = wj x = w Φ(x) x

j=0
w is a M+1
dimensional vector

ridge regression

lasso

Variation of weights with lambda

100
ridge regression 500
lasso
400
50
300

0 200
wj

100
wj

-50
0

-100
-100

-200
-150

-300

-8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10 -8 -7 -6 -5 -4 -3 -2
10 10 10 10 10 10 10
Variation of weightslog λ lambda
with
with λlambda
Variation of weights log
30
200

detail detail
20 150

100
10

50
wj

0
wj

-10
-50

-20 -100

-150
-5 -4 -7 -6 -5
10 10 10 10 10
log λ log λ
Second example – lasso in action
1.5

0.5
weights

−0.5

−1
0 0.5 1 1.5
regularization parameter λ

Sparse weight vectors

• Weights being zero is a method of “feature selection” –

zeroing out the unimportant features

• The SVM classifier also has this property (sparse alpha in

the dual representation)

• Ridge regression does not

• AdaBoost achieves feature selection by a different,

greedy approach
Other regularizers
N
X d
X
2
(yi − f (xi, w)) + λ |wj |q
i=1 j

• For q ≥ 1, the cost function is convex and has a unique minimum.

The solution can be obtained by quadratic optimization.

• For q < 1, the problem is not convex, and obtaining the global
minimum is more diﬃcult
SVMs for Regression
Use ε-insensitive error measure square
( loss
0 if |r| ≤ ε
Vε(r) =
|r| − ε otherwise. Vε(r)
This can also be written as

Vε(r) = (|r| − ε)+

r
where ()+ indicates the positive part of (.).
Or equivalently as
cost is zero inside epsilon “tube”
Vε(r) = max ((|r| − ε), 0)

loss function regularization

• As before, introduce slack variables for

cost is zero inside epsilon “tube”
points that violate ε-insensitive error.

• For each data point, xi, two slack vari-

ables, ξi, ξbi, are required (depending on
whether f (xi) is above or below the tube)

• Learning is by the optimization

N ³
X ´ 1
min C ξi + ξbi + ||w||2
w∈Rd, ξi, ξbi i 2
subject to

yi ≤ f (xi, w)+ε+ξi, yi ≥ f (xi, w)−ε−ξbi, ξi ≥ 0, ξbi ≥ 0 for i = 1 . . . N

• Again, this is a quadratic programming problem

• It can be dualized
• Some of the data points will become support vectors
• It can be kernelized
Example: SV regression with Gaussian basis functions
ideal fit
1.5

• The red curve is the true function Sample points

Ideal fit

(which is not a polynomial) 1

0.5

• Regression function – Gaussians 0

y
centred on data points
-0.5

• Parameters are: C, epsilon, sigma -1

-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

N
X 2
/σ 2
f (x, w) = wi e−(x−xi ) = w> Φ(x)
i=1

Φ : x → Φ(x) R → RN w is a N-vector

1.5 1.5
Sample points Sample points
Ideal fit Validation set fit
1 1 Support vectors

0.5 0.5

0 0
y

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

epsilon = 0.01

• Validation set fit is a search

over both C and sigma
epsilon = 0.5 epsilon = 0.8
1.5 1.5
Sample points Sample points
Validation set fit Validation set fit
1 Support vectors 1 Support vectors

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As epsilon increases:
• fit becomes looser
• less data points are support vectors

Loss functions for regression

• quadratic (square) loss `(y, f (x)) = 1
2 (y − f (x))
2

• ε-insensitive loss `(y, f (x)) = max ((|r| − ε), 0)

• Hüber loss (mixed quadratic/linear): robustness to outliers:

`(y, f (x)) = h(y − f (x))
(
r2 if |r| ≤ c 4 square
h(r) = 2
2c|r| − c otherwise. ε−insensitive
Huber
3
• all of these are convex
2

0
−3 −2 −1 0 1 2 3
y−f(x)
Final notes on cost functions

Regressors and classifiers can be constructed by a “mix ‘n’ match” of loss

functions and regularizers to obtain a learning machine suited to a
particular application. e.g. for a classifier f (x) = w>x + b
• L1 Logistic regression
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||1
w∈Rd i

• L1—SVM
N
X
min max (0, 1 − yif (xi)) + λ||w||1
w∈Rd i

• Least squares SVM

N
X
min [max (0, 1 − yif (xi))]2 + λ||w||2
w∈Rd i

Background reading

• Bishop, chapters 3.1 & 7.1.4

• Hastie et al, chapters 3.4 & 12.3.5

• More on web page:

http://www.robots.ox.ac.uk/~az/lectures/ml

Seminar Titles For Business Research Methods
100% (1)
Seminar Titles For Business Research Methods
20 pages
Management Science Chap15-Forecasting
No ratings yet
Management Science Chap15-Forecasting
86 pages
Political Science Research Methods (Etc.)
100% (4)
Political Science Research Methods (Etc.)
609 pages
Attribute MSA Format
No ratings yet
Attribute MSA Format
5 pages
C 8
0% (1)
C 8
45 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Unit 2
No ratings yet
Unit 2
92 pages
Week 12 Slides - Notes
No ratings yet
Week 12 Slides - Notes
85 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
ML - Lec 5 - Regression - Gradient Descent Least Square
No ratings yet
ML - Lec 5 - Regression - Gradient Descent Least Square
59 pages
Mid - Term Exam in Statistics and Probability
No ratings yet
Mid - Term Exam in Statistics and Probability
3 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Day 1
No ratings yet
Day 1
41 pages
4lasso and Friends
No ratings yet
4lasso and Friends
36 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Group 30
No ratings yet
Group 30
33 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
No ratings yet
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
29 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Module 3
No ratings yet
Module 3
35 pages
BEO2255 Applied Statistics For Business Assignment 3: Name Vu Id
No ratings yet
BEO2255 Applied Statistics For Business Assignment 3: Name Vu Id
42 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Regularization
No ratings yet
Regularization
3 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
6 Complexity
No ratings yet
6 Complexity
22 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
SLChapter 5
No ratings yet
SLChapter 5
16 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Regression Interpolation
No ratings yet
Regression Interpolation
34 pages
Slides 2
No ratings yet
Slides 2
27 pages
Advance Statistics Project Report
No ratings yet
Advance Statistics Project Report
23 pages
Lasso NIPS
No ratings yet
Lasso NIPS
8 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Box Plot - Excel 2007
No ratings yet
Box Plot - Excel 2007
11 pages
Least Squares
No ratings yet
Least Squares
12 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
DHS P2 Solutions
No ratings yet
DHS P2 Solutions
14 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Estimating Single Population Parameters: Exercises
No ratings yet
Estimating Single Population Parameters: Exercises
17 pages
Lecture 3 Multi-Regresion 2022.
No ratings yet
Lecture 3 Multi-Regresion 2022.
16 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Unit 10 Big Data and Business Analytics Authorised Assignment Brief v1
No ratings yet
Unit 10 Big Data and Business Analytics Authorised Assignment Brief v1
6 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Statistics Review: EEE 305 Lecture 10: Regression
No ratings yet
Statistics Review: EEE 305 Lecture 10: Regression
12 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
2019-11-30.10.19.31-Ma5165 Statistical Methods For Engineers
No ratings yet
2019-11-30.10.19.31-Ma5165 Statistical Methods For Engineers
2 pages
7.5 Trip Generation: Definitions of Terms
No ratings yet
7.5 Trip Generation: Definitions of Terms
9 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Solis CO2 DLP 2021 Corrected
No ratings yet
Solis CO2 DLP 2021 Corrected
7 pages
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
No ratings yet
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
12 pages
Human Disease Prediction Using Rule Based Expert System: R.Karthikeyan Assistant Professor SRM College Chennai
No ratings yet
Human Disease Prediction Using Rule Based Expert System: R.Karthikeyan Assistant Professor SRM College Chennai
15 pages
QM-II Midterm OCT 2014 Solution
No ratings yet
QM-II Midterm OCT 2014 Solution
19 pages
Spring 2024 Project #2
No ratings yet
Spring 2024 Project #2
2 pages
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
Formula - Sheet Using in Final Exam
No ratings yet
Formula - Sheet Using in Final Exam
3 pages
Statquest Multinomial Naive Bayes Study Guide V3-Mgywmv
No ratings yet
Statquest Multinomial Naive Bayes Study Guide V3-Mgywmv
8 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Raihan Ilham Ramadhan - Tugas Minggu 3 - Statistika 1C PDF
No ratings yet
Raihan Ilham Ramadhan - Tugas Minggu 3 - Statistika 1C PDF
3 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Activity 5 MMW
No ratings yet
Activity 5 MMW
3 pages
Dm-Mica Teltek Deepsi Jain
No ratings yet
Dm-Mica Teltek Deepsi Jain
12 pages
1238 Support Vector Regression Machines
No ratings yet
1238 Support Vector Regression Machines
7 pages
Proof Wilks Theorem Likelihood Ratio Test
No ratings yet
Proof Wilks Theorem Likelihood Ratio Test
4 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
ECON232 Problem Set 2: Randomized Evaluations
No ratings yet
ECON232 Problem Set 2: Randomized Evaluations
4 pages
Test of Goodness of Fit
No ratings yet
Test of Goodness of Fit
3 pages
Chapter 2 (Subtopics and Mechancs of Writing)
No ratings yet
Chapter 2 (Subtopics and Mechancs of Writing)
4 pages
Water Fall Model
No ratings yet
Water Fall Model
2 pages
Bio Statistical Analysis 5th Edition by Jerrold H Zar - Hands Down Best Statistics Text
0% (7)
Bio Statistical Analysis 5th Edition by Jerrold H Zar - Hands Down Best Statistics Text
2 pages
Application of Newton Raphson Method To Non - Linear Models 1
No ratings yet
Application of Newton Raphson Method To Non - Linear Models 1
11 pages
50 % of Marks For SC/ST/OBC (Non Creamy Layer) /person With Disability 3. 55 % of Marks For All Other Candidates
No ratings yet
50 % of Marks For SC/ST/OBC (Non Creamy Layer) /person With Disability 3. 55 % of Marks For All Other Candidates
1 page
History of Crypto
No ratings yet
History of Crypto
1 page
Formulate A Project Strategy
No ratings yet
Formulate A Project Strategy
1 page
SDLC
No ratings yet
SDLC
1 page

Lect 6

Uploaded by

Lect 6

Uploaded by

Lecture 6: Regression continued

C4B Machine Learning Hilary 2011 A. Zisserman

• More loss functions

• Suppose we are given a training set of N observations

((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R

• The regression problem is to estimate f (x) from this data

• There is a choice of both loss functions and regularization

• squared regularizer: λkwk2

The “Lasso” or L1 norm regularization

• LASSO = Least Absolute Shrinkage and Selection

Minimize with respect to w ∈ Rd

loss function regularization

• This is a quadratic optimization problem

• Minimum where loss contours tangent to regularizer’s

(which is not a polynomial) 1

• The data points are samples from the

Variation of weights with lambda

Sparse weight vectors

• Weights being zero is a method of “feature selection” –

• The SVM classifier also has this property (sparse alpha in

• Ridge regression does not

• AdaBoost achieves feature selection by a different,

• For q ≥ 1, the cost function is convex and has a unique minimum.

Vε(r) = (|r| − ε)+

loss function regularization

• As before, introduce slack variables for

• For each data point, xi, two slack vari-

• Learning is by the optimization

yi ≤ f (xi, w)+ε+ξi, yi ≥ f (xi, w)−ε−ξbi, ξi ≥ 0, ξbi ≥ 0 for i = 1 . . . N

• Again, this is a quadratic programming problem

• The red curve is the true function Sample points

(which is not a polynomial) 1

• Regression function – Gaussians 0

• Parameters are: C, epsilon, sigma -1

• Validation set fit is a search

Loss functions for regression

• ε-insensitive loss `(y, f (x)) = max ((|r| − ε), 0)

• Hüber loss (mixed quadratic/linear): robustness to outliers:

Regressors and classifiers can be constructed by a “mix ‘n’ match” of loss

• Least squares SVM

• Bishop, chapters 3.1 & 7.1.4

• Hastie et al, chapters 3.4 & 12.3.5

• More on web page:

You might also like