Regularization and Feature Selectio N
Regularization and Feature Selectio N
Feature
Selectio
n
Learning
•Objectives
Explain cost functions, regularization, feature selection,
and hyper-parameters
• Summarize complex statistical optimization algorithms
like gradient descent and its application to linear
regression
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware
•
Motivation
If more than two independent variables are
highly correlated:
Model
True Function
Samples
Y Y Y
X X X
Model
True Function
Samples
Y Y Y
X 𝑚 X X
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠
2𝑚
𝑖=1
Ridge Regression
Ridge Regression Model
Ridge Regression
Why does this help?
🞑 Smallercoefficients give less sensitivity of the
variables.
Ridge Regression
Lagrange Multiplier
🞑A strategy for finding the local maxima or minima of a function
subject to equality/inequality constraints
Minimizing
𝑛
Σ
Equivalent to minimizing
𝑓(
𝑛
𝑥)
𝑠. Σ 𝑓(𝑥) + λ𝑔(𝑥) ,
𝑡. 𝑖=1
Where λ is positive.
𝑔(
𝑥)
≤
Ridge Regression
Ridge Regression Model
𝜕H 𝐛,
λ 𝜕𝐛 = −2𝐗′𝐲 + 2𝐗′𝐗𝐛 + 2λ𝐛 = 𝟎
𝐗′𝐗 + λ𝐈 𝐛 = 𝐗′𝐲
𝐛 = (𝐗′𝐗 + λ𝐈)−1𝐗′𝐲
Y Y Y
X X X
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
Effect of Ridge Regression on Parameters
Poly=9, 𝜆=0.0 Poly=9, 𝜆=1e-5 Poly=9, 𝜆=0.1
Model
True
Function
Samples
Y
X X X
10 8
106
abs(coefficien
104
102
t)
100
123456789 123456789
𝑚 123456789
𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)𝑜𝑏𝑠 (𝑖)
− 𝑦𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚 𝑖=1
𝑗=1
Ridge Regression (L2)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
• Penalty shrinks magnitude of all coefficients
• Larger coefficients strongly penalized because of the
Squaring
A ridge solution can be hard to interpret because it is not sparse
(no β's are set exactly to 0). What if we constrain the L1 norm instead
of the Euclidean (L2 norm)?
Ridge L2 Example:
https://www.youtube.com/watch?v=Q81RR3yKn30&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=23
Lasso Regression (L1)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1
X X X
10 8
106
abs(coefficient)
104
102
100
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9𝑚 1 2 3 4 5 6 7 8 9 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽𝑥1(𝑖 𝑜𝑏𝑠
) (𝑖)
− 𝑦 𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1
• Lasso L1 Example:
https://www.youtube.com/watch?v=NGf0voTMlcs&list=PLblh5JKOoLUICTaGLRo
HQDuF_7q2GfuJF&index=24
• L1 Vs L2
https://www.youtube.com/watch?v=Xm2C_gTAl8c&list=PLblh5JKOoLUICTaGLR
oHQDuF_7q2GfuJF&index=26
Elastic Net Regularization
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 1 Σ 𝛽𝑗 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 𝑗
𝑖=1 𝑗=1 𝑗=1
Y Y Y
X X X
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 Σ 𝛽 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 1 𝑗 𝑗
𝑖=1 𝑗=1 𝑗=1
Example:
https://www.youtube.com/watch?v=1dKRdX9bfIo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=27
Hyperparameters and Their Optimization
NO!
use test data for tuning Training
Data
• Create additional split of data to
tune
Test Data
hyperparameters— validation set
• Cross validation can also be
used on
training data
Hyperparameters and Their
Optimization
• Regularization coefficients (𝜆1 and 𝜆2) Tune 𝜆 with Cross Validation
are empirically determined
• Want value that generalizes—do not Training Data
use test data for tuning
• Create additional split of data to tune Validation Data
Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)
Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)
Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)
Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)
Fit the instance on the data and then predict the expected value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)
Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)
Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)
𝑦=𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑝
𝐽 𝛽
𝛽
Gradient Descent
Start with a cost function J(𝛽):
𝐽 𝛽
Global Minimum
𝛽
Then gradually move towards the minimum.
Convex function
The line connecting two points on the graph must lie above the
function
The general idea
We have k parameters 𝜃1, 𝜃2, … , 𝜃𝑘we’d like to train
for a model – with respect to some error/loss function
𝐽(𝜃1, … , 𝜃 𝑘 ) to be minimized
Gradient descent is one way to iteratively determine the optimal set of
parameter values:
1. Initialize parameters
2. Keep changing values to reduce 𝐽(𝜃1, … , 𝜃 𝑘 )
🞑 tells us which direction increases 𝐽 the most
🞑 We go in the opposite direction of
To actually descend…
https://www.geeksforgeeks.org/difference-between-gradient-descent-and-normal-equation/
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:
Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 0, 1
(𝛽0, 𝛽1)
• This is a more complicated surface
on
which the minimum must be found
• How can we do this without knowing
what 1
0
𝐽 0, 1 looks like?
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
𝐽what
0, 1 looks
like?
𝛽1
𝛽0
Gradient Descent with Linear Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
what
𝐽 𝛽0, 𝛽1 looks like? 𝛽1
𝛽0
Gradient Descent with Linear Regression
𝛽1
𝛽0
Gradient Descent with Linear Regression
𝛽1
𝛽0
Gradient Descent with Linear
Regression
• Then use the gradient and the
cost function to calculate the next point
(𝜔1) from the current one (𝜔0 ): 𝜔0
𝜔1
𝛽1
• The learning rate (𝛼) is a tunable parameter
that determines step size
𝛽0
Gradient Descent with Linear Regression
(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Gradient Descent with Linear Regression
(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Issues
Convex objective function guarantees convergence to global
minimum
Non-convexity brings the possibility of getting stuck in a local
minimum
🞑 Different, randomized starting values can fight this
Issues cont.
Convergence can be slow
🞑 Largerlearning rate α can speed things up, but with too large of α,
optimums can be ‘jumped’ or skipped over
- requiring more iterations
🞑 Too small of a step size will keep convergence slow
🞑 Gauss–Newton algorithm
🞑 Levenberg–Marquardt algorithm
🞑 Trust-Region Methods
(𝑖)
2
𝜔1 = 𝜔0 − 𝛼❑
2 0 + 1𝑥𝑜𝑏𝑠 𝛽1
− 𝑦𝑜𝑏𝑠
𝛽0
𝑖=1
1
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1
𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1
𝛽1
𝛽0
Stochastic Gradient Descent
𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data
𝜔0
𝜔1
𝜔2 𝜔3 𝜔4
𝛽1
• Path is less direct due to noise in single data 𝛽0
point—"stochastic"
SGD Solved Example: https://www.youtube.com/watch?v=vMh0zPT0tLI
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1
𝛽1
𝛽0
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1
𝛽1
𝛽0
Mini Batch Gradient
Descent
• Perform an update for every 𝑛
training
examples
𝜔0
𝜔1
Best of both
worlds:
• Reduced memory relative to "vanilla" 𝛽1
gradient descent
𝛽0
• Less noisy than stochastic gradient
descent
Mini Batch Gradient
Descent
• Mini batch implementation typically used for neural nets