0% found this document useful (0 votes)
116 views26 pages

Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF

Uploaded by

Sharelle Tew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views26 pages

Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF

Uploaded by

Sharelle Tew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

EE2211 Introduction to Machine

Learning
Lecture 6
Semester 2
2021/2022

Helen Juan Zhou


[email protected]

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
Mid-term: Feb. 25th
– Data Engineering 12-1pm
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression & Polynomial
Regression

Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model -% . = . & ( scalar function
For m samples: /% ! = !( = +
.!& (
+= ⁞ where . +& = [1, $+,!, … , $+,# ]
&(
.$
1 $!,! … $!,# ) ,!
!= ⁞ ⁞ ⋱ ⁞ ( = *! += ⁞
1 $$,! … $$,# ⁞ ,$
* #
$
Objective:7+,!(-% . + − y+ )- = :& : = (!( − +)& (!( − +)
Learning/training when ! & ! is invertible
0 = (! & !)'!! & + Leftinverse
Least square solution: (
Prediction/testing: 3()* = /4 % ! ()* = ! ()* ( 0
4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
<% ! = !; = =
*.,! … *.,/
Sample 1 .!& 1 $!,! … $!,#
*!,! … *!,/
⁞ = ⁞ ;= ⁞ ⁞ ⋱ ⁞
& 1 $$,! … $$,# ⁞ ⋱ ⁞
Sample m .$ *#,! … *#,/
Sample 1’s output ,!,! … ,!,/
⁞ = ⁞
Sample m’s output ,$,! … ,$,/ Least Squares Regression

If ! & ! is invertible, then


Learning/training: ; @ = (! & !)'!! & =
Prediction/testing: <4% ! ()* = ! ()* ; @

! ∈ ?$×(#2!) , ; ∈ ?(#2!)×/ , = ∈ ?$×/


5
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)

Linear Methods for Classification


• We have a collection of labeled examples
• ! is the size of the collection
• " + is the d-dimensional feature vector of example # = 1, … , !
• (+ is discrete target label (e.g., (+ ∈ {−1, +1} or {0, 1} for
binary classification problems) threshold O threshold 0.5

• Note:
• when (+ is continuous valued à a regression problem
• when (+ is discrete valued àa classification problem
• Linear model: /%,4 " = " & 0 + 1 or in compact form /% " = " & 0
(having the offset term absorbed into the inner product)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

6
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
i
Linear Methods for Classification
Binary Classification:
If 2 & 2 is invertible, then
Learning: 3 = (2 & 2)'! 2 & 6,
0 (+ ∈ −1, +1 , # = 1, … , !
Prediction: /7%5 " ()* = sign(" ()*
& 0) &
3 for each row " ()* of 2 ()*

sign(&)
+1

0
&
-1

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {.6 , y6 }8
67$ {1 = −9} → { 4 = −1 }
) ( * {1 = −7} → { 4 = −1 }
1 = −5 → { 4 = −1 }
Bias 1 −9 −1
{ 1 = 1} → 4 = +1
1 −7 −1 { 1 = 5} → { 4 = +1 }
?. −1
1 −5 = { 1 = 9} → { 4 = +1 }
?! 1
1 1
1 5 1
1 9 1 over determinedsystem

This set of linear equations has NO exact solution

+ = ) ! * = () " ))#$) " *


( 2 & 2 is invertible
−1
−1
6 −6 '!
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
we inv XT x xT y
Linear Regression (for classification)
Example 1 y predict Knew w
y classpredict a upsignly predict El :< = sign()()
+
{ 7 = 1} → = = +1
0.1406
{ 7 = 5} → { = = +1 }
{ 7 = 9} → { = = +1 } = sign() )
0.1406
y' = )(
+ = 0.1406+0.1406x

Prediction:
::;<=
Test set {. = −2} → {: = ? }
(()* = /7%5 " ()* = sign " ()* 0
3
Bias
0.1406
{7 = −9} → { = = −1 } −2
= sign( 1 − 2 )
{7 = −7} → { = = −1 } 0.1406
7 = −5 → { = = −1 } .:;< = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification

I
Multi-Category Classification:

If 2 & 2 is invertible, then


dxc
Learning: ! = (% ! %)"# % ! ',
" ' ∈ )$×&
Prediction: ?>%5 * '() = arg max*+#,…,& * '()
! " 0 :,3 !
for each * '() of % '()

Each row (of Y


new
# = 1, … , !) in D has an one-hot encoding/assignment:
e.g., target for class-1 is labelled as 6+& = [1, 0, 0, … , 0] for the ith sample,
target for class-2 is labelled as 6B& = 0, 1, 0, … , 0 for the jth sample,
& = 0, 0, … , 0, 1 for the mth sample.
target for class-C is labelled as 6$
G
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.4)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{1! = 1, 1- = 1} → { 4! = 1, 4- = 0, 4C = 0} Class 1
Training set
{1! = −1, 1- = 1} → { 4! = 0, 4- = 1, 4C = 0} Class 2
{@ 6 , *6 }8
67$ {1! = 1, 1- = 3} → { 4! = 1, 4- = 0, 4C = 0} Class 1
{1! = 1, 1- = 0} → { 4! = 0, 4- = 0, 4C = 1} Class 3
) = >
Bias 1 1 1 ? ?!,- ?!,C 1 0 0
!,!
1 −1 1 ?-,! ?-,- ?-,C = 0 1 0
1 1 3 ?C,! ?C,- ?C,C 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
onehot encoder
? = ) > = () )) ) >
= ! " #$ " &
2 2 is invertible approximation

4 2 5 '!
1 1 1 1 1 0 0
0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
fromskiearn preprocessing inportone
UKEncoder
Linear Regression (for classification)
Example 2 Prediction iii
Yerone
Ii militia
ht fit
ease
Test set ) :;<
hot one
{1! = 6, 1- = 8} → {CDEFF 1, 2, GH 3? }
encoder
ly
transform class
print Etr men
{1! = 0, 1- = −1} → {CDEFF 1, 2, GH 3? }
W invIXT x x T Yerone
hot
0 0.5 0.5 g test Xtest W
J = 2 ()* K 1 6 8
D L= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:
If I ifinymax x else0for
y x forx inytestda
print ytestclass
NM 5% 2 ()* = arg maxD,!,…,F (DJ(: , T))
= arg maxD,!,…,F (
4 − 2.50 − 0.50
)
−0.2587 0.50 0.7857 ytest class upargmaxlytest
I
1 Class 1 axis l
=
3 Class 3 For each row of Y, the column position of the largest number
(across all columns for that row) determines the class label.
Python
E.g. in the first row, the maximum number is 4 which is in
demo 2 column 1. Therefore, the resulting predicted class is 1.
12
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression reduce
refitting

Recall Linear regression


8
Objective:A( = argmin B67$(CJ @ 6 − y6 )K = ()( − *)" ()( − *)
The learning computation: (+ = () " ))#$) " *
We cannot guarantee that the matrix ) " ) is invertible
fiaturistttatentienterethanthesamplesize samples
wertittingottraining
inreality Aging.me
toimproveaccuracy inunknowntesting
Ridge regression: shrinks the regression coefficients w by imposing a
penalty on their size
8
( = argmin B67$(CJ @ 6 − y6 )K + λ ∑ML7$ GLK penalizeterm
Objective:A
= argmin ()( − *)" ()( − *) + λ( " ( cocontulthemodel
complexity
Csminksomewtoo
thefeatures
therefore
discovered
Here λ ≥ 0 is a complexity parameter that controls the amount of are
shrinkage: the larger the value of λ, the greater the amount of shrinkage.

Note: m samples & d parameters


13
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Using a linear model: mm
minJ()( − *) )( − * + λ( " (
"

Solution:
N
(()( − *)" )( − * + λ( " () = I
NJ
⇒ 2) " )( − 2) " * + 2λ( = I
⇒ ) " )( + λ( = ) " *
⇒ () " ) + λK)( = ) " *
where I is the dxd identity matrix
Here on, we shall focus on single column of output 6 in derivations in
the sequel a
Learning: + = () " ) +λK)#$ ) " *
(
Fem
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

14
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

ayyy
Ridge Regression in Primal Form (when m > d)

() " ) + λK) is invertible for λ > 0, did


Learning: (+ = () " ) +λK)#$ ) " *
Prediction: NM J ) :;< = ) :;< (+

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

15
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

underdetermined
Ridge Regression in Dual Form (when m < d)

()) " +λK) is invertible for λ > 0,


Learning: (
gym
+ = ) " ()) " +λK)#$ *
Prediction: NM J ) :;< = ) :;< (+
set X 00001
Derivation as homework (see tutorial 6). dontset x 10
Hint: start off with (2 & 2 + WX)0 = 2 & 6 and make use of 0 = 2 & Y and
Z = W'! 6 − 20 , W > 0
in reality twoformsgiveverysimilarans

16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2, X X2
a polynomial function of degree = 2 is:
K K
CJ @ = GO + G$.$+ GK.K + G$K .$.K + G$$ .O
$ + GKK .O

Fsa0
K.
number.it
I polynomialof
owaytoseparateReddBlue
XOR problem
I 41 Jetfindaplanetoseparatethepoi

1
11 11
CJ @ = .$.K
I I
17
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Polynomial Expansion
• The linear model CJ @ = @ " ( can be written as
"
CJ @ = @ ( of
number
features order

= ∑M . G , . =1 Ccday
67O 6 6 O
= GO + ∑M67$ .6 G6 . r n 1

K foreg 2ndorderwith X d Xi n dir


offset c 3 3moneterms
• By including additional terms involving the products of
pairs of components of @, we obtain a quadratic model:
CJ @ = GO + ∑M67$G6 .6 + ∑M67$∑ML7$ . 3rdorderwith
G6L .6 .L foreg
I
Xi xz us r
a
2nd order: ?% P = Q. + Q! 1! + Q- 1- + Q!- 1! 1- + Q!! 1!- + Q-- 1-- terms

3rd order: ?% P = Q. + Q! 1! + Q- 1- + Q!- 1! 1- + Q!! 1!- + Q-- 1-- +


I ∑#+,! ∑#B,! ∑#D,! Q+BD 1+ 1B 1D , S = 2 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

© Copyright EE, NUS. All Rights Reserved. bxixz.x.xi.xi.tt 18


Polynomial Regression Fdtr 1
r dD
Generalized Linear Discriminant Function
• In general:
linear
offset rel quadratic raz cubic r 3
# # # # #
?% P = Q. + ∑+,! Q+ 1+ + ∑+,! ∑B,! Q+B 1+ 1B + ∑+,! ∑B,! ∑#D,! Q+BD 1+ 1B 1D + ⋯

Weierstrass Approximation Theorem: Every continuous function defined on a


closed interval [a, b] can be uniformly approximated as closely as desired by
a polynomial function.
- Suppose f is a continuous real-valued function defined on the real interval [a, b].
- For every ε > 0, there exists a polynomial p such that for all x in [a, b], we have| f (x) − p(x)| < ε.
(Ref: https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online

19
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
mm
?% P = Q. + ∑#+,! Q+ 1+ + ∑#+,! ∑#B,! Q+B 1+ 1B + ∑#+,! ∑#B,! ∑#D,! Q+BD 1+ 1B 1D + ⋯

U% P = VJ ( Note: V ≜ V(Y) for symbol simplicity )

Z!& J P is a polynomialexpansionof X ?.
= ⁞ thefunctionof X isthelinearfunctionof P ?!
&
Z$ J ⁞
?#
where \&G 0 = [1, ]G,! , … , ]G,# , … , ]G,+ ]G,B , … , ]G,+ ]G,B ]G,D , … ] ⁞
?+B

?+BD
D = 1, … , ]; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

20
© Copyright EE, NUS. All Rights Reserved.
d d IX ex

Jj
Example 3 {1! = 0, 1- = 0} → {4 = −1}
{1! = 1, 1- = 1} → {4 = −1}
Training set {@ 6 , *6 }8
67$ {1! = 1, 1- = 0} → {4 = +1}
{1! = 0, 1- = 1} → {4 = +1}
2nd order polynomial model
th
CJ @ = GO + G$.$+ GK.K + G$K .$.K + G$$ .$K+ GKK .KK
GO

mum
Gt
G$ ysuparrayE l 1.1.11
GK
= [1 .$ .K .$.K .$K .KK] G$K iii
iii
poly Polynomialfeaturesorder
P polyfittransformX
G$$ checkrankD drank
Ply
Stack the 4 training samples as a matrix GKK
underdetermined
system J Pmatrix
P
K K uniqueconstrained
1 .$,$ .$,K .$,$.$,K .$,$ .$,K solution
1 0 0 0 0 0
K K
1 .K,$ .K,K .K,$.K,K .K,$ .K,K 1 1 1 1 1 1
O= K K =
1 ._,$ ._,K ._,$._,K ._,$ ._,K 1 1 0 0 1 0
1 .`,$ .`,K .`,$.`,K K
.`,$ K
.`,K 1 0 1 0 0 1
Pmatrix
21
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary
sizeof Pmatrix

Ridge Regression in Primal Form (m > d)


y
number columnsof P Primal I
For λ > 0, of d
Learning: 03 = (_ & _ +λX)'! _ & 6 modelshrinkagetoconto
Prediction: NM % _(2 ()* ) = _()* 0 3

Ky thismeansconvert
Ridge Regression in Dual Form (m < d)
sizeof Pmatrix
n

numberofrowsof P Dual I computationaltime


For λ > 0, smallermany save
Learning: 03= _ & (__ & +λX)'! 6 Staywiththis
Prediction: NM % _(2 ()* ) = _()* 0
3
Primalform Dualformsolutions areactuallysimilar buttosavecomputational
time
Note: Change X to P with reference to slides 15/16; m & d refers to the size of P (not X)
22
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

For Regression Applications


• Learn continuous valued ( using either primal form or dual form
• Prediction: NM % _(2 ()* ) = _()* 0
3

For Classification Applications


• Learn discrete valued ( (( ∈ {−1, +1}) or D (one-hot) using either primal
form or dual form
• Binary Prediction: NM 5% _(2 ()* ) = sign(_()* 0)
3
• Multi-Category Prediction: NM 5% _(2 ()* ) = arg maxD,!,…,F (_()* L(:
J , T))

23
© Copyright EE, NUS. All Rights Reserved.
{1! = 0, 1- = 0} → {4 = −1}
Example 3 (cont’d) {1! = 1, 1- = 1} → {4 = −1}
Training set {1! = 1, 1- = 0} → {4 = +1}
{1! = 0, 1- = 1} → {4 = +1}
2nd order polynomial model
K K d m
1 .$,$ .$,K .$,$.$,K .$,$ .$,K
1 0 0 0 0 0
K K
1 .K,$ .K,K .K,$.K,K .K,$ .K,K 1 1 1 1 1 1
O= K K =
1 ._,$ ._,K ._,$._,K ._,$ ._,K 1 1 0 0 1 0
1 .`,$ .`,K .`,$.`,K K
.`,$ K
.`,K 1 0 1 0 0 1
withdidgeRegression

+ = O " (OO " )#$*


( g L2
re x npidentity Pshapeto
#$ w dualridge PT inv P p.ttreg 2
1 1 1 1 −1 y
1 if L
doprimalridge
0 1 1 0 1 1 1 1 −1
x upidentityishapel
−1 1 reg
= 0 1 0 1 1 6 3 3 = wprintridge
0 1 0 0 1 3 3 1 +1 −4 invCP.T Ptreg.L PT y
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {]! = 0.1, ]- = 0.1} → {( = class − 1 or + 1? }
Test point 2: {]! = 0.9, ]- = 0.9} → {( = class − 1 or + 1? }
Test point 3: {]! = 0.1, ]- = 0.9} → {( = class − 1 or + 1? }
Test point 4: {]! = 0.9, ]- = 0.1} → {( = class − 1 or + 1? }
gf = V()* J
h −1 Xnew uparray 0.1 0.1 10.90.97 Lo1 0.9 10.90.1
1 0.1 0.1 0.01 0.01 0.01 1 Pnewa polyfit transform Xnew
1 0.9 0.9 0.81 0.81 0.81 1 Phew w duel
=
1 0.1 0.9 0.09 0.01 0.81 −4 Yow
1 0.9 0.1 0.09 0.81 0.01 1 n
1
[1 ]! ]- ]! ]- ]!- ]-- ]
−0.82
−0.82
Ua 5% V(Y ()* ) = sign(g)
f = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1

25
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations N% 2 = 20 = 6
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary

Primal form Learning: (+ = (O " O +λK)#$ O " *


Prediction: NM J O() :;< ) = O:;< ( +

Dual form Learning: (+ = O " (OO " +λK)#$ *


Prediction: NM J O() :;< ) = O:;< (+
Hint: python packages: sklearn.preprocessing (PolynomialFeatures), np.sign,
sklearn.model_selection (train_test_split), sklearn.preprocessing (OneHotEncoder)
26
© Copyright EE, NUS. All Rights Reserved.

You might also like