0% found this document useful (0 votes)

59 views102 pages

Regularization and Feature Selectio N

Uploaded by

Ehab Emam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views102 pages

Regularization and Feature Selectio N

Uploaded by

Ehab Emam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 102

Regularization and

Feature
Selectio
n
Learning
•Objectives
Explain cost functions, regularization, feature selection,
and hyper-parameters
• Summarize complex statistical optimization algorithms
like gradient descent and its application to linear
regression
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware

•
Motivation
 If more than two independent variables are
highly correlated:

 The intercept is approximated well, but

coefficients?
Motivation
 It happens because x1 and x2 are highly correlated.
🞑 RSS(40, -38) = 21.7 (our estimate) is very closed to RSS(1, 1) = 22.6
(the truth)
 Effective way of dealing with this problem is through
penalization:
🞑 Insteadof minimizing RSS only, we consider an additional term in
the regression form…
Preventing Under- and
Overfitting
Polynomial Degree = 1 Polynomial Degree = Polynomial Degree =
3 9

Model
True Function
Samples
Y Y Y

X X X

• How to use a degree 9 polynomial and prevent

overfitting?
Preventing Under- and Overfitting
Polynomial Degree = 1 Polynomial Degree = 3 Polynomial Degree = 9

Model
True Function
Samples

Y Y Y

X 𝑚 X X
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠
2𝑚
𝑖=1
Ridge Regression
 Ridge Regression Model
Ridge Regression
 Why does this help?
🞑 Smallercoefficients give less sensitivity of the
variables.
Ridge Regression
 Lagrange Multiplier
🞑A strategy for finding the local maxima or minima of a function
subject to equality/inequality constraints

Minimizing
𝑛

Σ
Equivalent to minimizing
𝑓(
𝑛
𝑥)
𝑠. Σ 𝑓(𝑥) + λ𝑔(𝑥) ,
𝑡. 𝑖=1
Where λ is positive.
𝑔(
𝑥)
≤
Ridge Regression
 Ridge Regression Model

Vector norms: https://www.youtube.com/watch?v=5fN2J8wYnfw

Optimization
H 𝐛, λ = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛 +λ𝐛′𝐛
= 𝐲′𝐲 − 𝟐𝐛′𝐗′𝐲 + 𝐛′𝐗′𝐗𝐛+λ𝐛′𝐛

𝜕H 𝐛,
λ 𝜕𝐛 = −2𝐗′𝐲 + 2𝐗′𝐗𝐛 + 2λ𝐛 = 𝟎
𝐗′𝐗 + λ𝐈 𝐛 = 𝐗′𝐲
𝐛 = (𝐗′𝐗 + λ𝐈)−1𝐗′𝐲

𝐗′𝐗 + λ𝐈 is always invertible. Always gives a

unique solution,
Ridge Regression
 Similar to the ordinary least squares solution,
but with the addition of a “ridge” regularization

 Applying the ridge regression penalty has the

effect of shrinking the estimates toward zero
 Introduce bias but reduce the variance of the
estimate
Ridge Regression
Regularization
Poly Degree=9, 𝜆=0.0 Poly Degree=9, 𝜆=1e-5 Poly Degree=9, 𝜆=0.1
Model
True Function
Samples

Y Y Y

X X X
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
Effect of Ridge Regression on Parameters
Poly=9, 𝜆=0.0 Poly=9, 𝜆=1e-5 Poly=9, 𝜆=0.1
Model
True
Function
Samples
Y

X X X
10 8

106
abs(coefficien

104

102
t)

100
123456789 123456789
𝑚 123456789
𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)𝑜𝑏𝑠 (𝑖)
− 𝑦𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚 𝑖=1
𝑗=1
Ridge Regression (L2)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
• Penalty shrinks magnitude of all coefficients
• Larger coefficients strongly penalized because of the
Squaring
A ridge solution can be hard to interpret because it is not sparse
(no β's are set exactly to 0). What if we constrain the L1 norm instead
of the Euclidean (L2 norm)?
Ridge L2 Example:
https://www.youtube.com/watch?v=Q81RR3yKn30&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=23
Lasso Regression (L1)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1

• Penalty selectively shrinks some coefficients exactly to zero

• Can be used for feature selection
• Slower to converge than Ridge regression
Effect of Lasso Regression on Parameters
Poly=9, 𝜆=0.0 Poly=9, 𝜆=1e-5 Poly=9,
𝜆=0.1
Model
True
Function
Samples
Y

X X X
10 8

106
abs(coefficient)

104

102

100
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9𝑚 1 2 3 4 5 6 7 8 9 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽𝑥1(𝑖 𝑜𝑏𝑠
) (𝑖)
− 𝑦 𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1
• Lasso L1 Example:
https://www.youtube.com/watch?v=NGf0voTMlcs&list=PLblh5JKOoLUICTaGLRo
HQDuF_7q2GfuJF&index=24

• L1 Vs L2
https://www.youtube.com/watch?v=Xm2C_gTAl8c&list=PLblh5JKOoLUICTaGLR
oHQDuF_7q2GfuJF&index=26
Elastic Net Regularization
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 1 Σ 𝛽𝑗 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 𝑗
𝑖=1 𝑗=1 𝑗=1

• Compromise of both Ridge and Lasso regression

• Requires tuning of additional parameter that distributes
regularization penalty between L1 and L2
Elastic Net Regularization
Poly=9, 𝜆1=𝜆2=0.0 Poly=9, 𝜆1=𝜆2=1e-5 Poly=9, 𝜆1=𝜆2=0.1
Model
True
Function
Samples

Y Y Y

X X X
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 Σ 𝛽 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 1 𝑗 𝑗
𝑖=1 𝑗=1 𝑗=1

Example:
https://www.youtube.com/watch?v=1dKRdX9bfIo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=27
Hyperparameters and Their Optimization

• Regularization coefficients (𝜆1 and 𝜆2) Use Test Data to Tune 𝜆?

are empirically determined
• Want value that generalizes—do not
use Training Data
test data for tuning
• Create additional split of data to
Test Data
tune
hyperparameters— validation set
• Cross validation can also be used
on
training data
Hyperparameters and Their
Optimization
• Regularization coefficients (𝜆1 and 𝜆2) Use Test Data to Tune
are empirically determined 𝜆?
• Want value that generalizes—do not

NO!
use test data for tuning Training
Data
• Create additional split of data to
tune
Test Data
hyperparameters— validation set
• Cross validation can also be
used on
training data
Hyperparameters and Their
Optimization
• Regularization coefficients (𝜆1 and 𝜆2) Tune 𝜆 with Cross Validation
are empirically determined
• Want value that generalizes—do not Training Data
use test data for tuning
• Create additional split of data to tune Validation Data

hyperparameters—validation set Test Data

• Cross validation can also be
used on
training data
Ridge Regression: Get λ using CV
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

To use the Intel® Extension for Scikit-learn* variant of this algorithm:

• Install Intel® oneAPI AI Analytics Toolkit (AI Kit)
• Add the following two lines of code after the above code:
import patch_sklearn
patch_sklearn()
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class

RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform

cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class

RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform

cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import Ridge
Create an instance of the
class regularizatio
n
RR = Ridge(alpha=1.0)
parameter
Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform

cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class

RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the

expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)
The RidgeCV class will perform cross validation on a set of values for
alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class

RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform

cross validation on a set of
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
values for alpha.
Lasso Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Lasso

Create an instance of the class

LR = Lasso(alpha=1.0)

Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)

The LassoCV class will perform

cross validation on a set of
values for alpha.
Lasso Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import Lasso
Create an instance of the
class regularizatio
n
LR = Lasso(alpha=1.0)
parameter
Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)

The LassoCV class will perform

cross validation on a set of
values for alpha.
Elastic Net Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import ElasticNet

Create an instance of the class

EN = ElasticNet(alpha=1.0, l1_ratio=0.5)

Fit the instance on the data and then predict the expected value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)

The ElasticNetCV class will

perform cross validation on a
set of values for l1_ratio
Elastic Net Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import ElasticNet
Create an instance of the class
alpha is the
EN = ElasticNet(alpha=1.0, l1_ratio=0.5) regularizatio
n parameter
Fit the instance on the data and then predict the
expected value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)
The ElasticNetCV class will perform cross validation on a set of values for
l1_ratio and alpha.
Elastic Net Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import ElasticNet
Create an instance of the class
l1_ratio
EN = ElasticNet(alpha=1.0, l1_ratio=0.5) distributes
alpha to L1/L2
Fit the instance on the data and then predict the expected
value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)
The ElasticNetCV class will perform cross validation on a set of values for
l1_ratio and alpha.
Feature Selection

• Regularization performs feature selection by shrinking the

contribution of features
• For L1-regularization, this is accomplished by
driving some
coefficients to zero
• Feature selection can also be performed by
removing features
Feature Selection

• Regularization performs feature selection by shrinking the

contribution of features
• For L1-regularization, this is accomplished by driving
some coefficients to zero
• Feature selection can also be performed by
removing
features
Feature Selection

• Regularization performs feature selection by shrinking the

contribution of features
• For L1-regularization, this is accomplished by driving
some coefficients to zero
• Feature selection can also be performed by removing
features
Why is Feature Selection Important?

• Reducing the number of features is another way to prevent

overfitting (similar to regularization)
• For some models, fewer features can improve fitting time
and/or results
• Identifying most critical features can improve model
interpretability
Recursive Feature Elimination: The Syntax
Import the class containing the feature selection method
from sklearn.feature_selection import RFE

Create an instance of the class

rfeMod = RFE(est, n_features_to_select=5)

Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross

validation.
Recursive Feature Elimination: The Syntax
Import the class containing the feature selection
method
from sklearn.feature_selection import RFE
Create an instance of the class est is an
rfeMod = RFE(est, n_features_to_select=5) instance of the
model to use
Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross

Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross

validation.
Gradient
Descent
Optimization
 Consider a function f (.) of p numbers
of variables:

𝑦=𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑝

 Find 𝑥1 , 𝑥2 , … , 𝑥𝑝 that maximizes or

minimizes y
 Usually, minimize a cost/loss function or
maximize profit/likelihood function.
Global/Local Optimization
Gradient
 Single variable:
🞑 The derivative: slope of the tangent line at a point 𝑥0
Gradient (vector)
 Multivariable:

🞑A vector of partial derivatives with respect to each of the independent

variables
 points in the direction of greatest rate of change or “steepest
ascent”
 Magnitude (or length) of is the greatest rate of change
Gradient (vector)
Gradient (vector)
Gradient Descent
Start with a cost function J(𝛽):

𝐽 𝛽

𝛽
Gradient Descent
Start with a cost function J(𝛽):

𝐽 𝛽

Global Minimum

𝛽
Then gradually move towards the minimum.
Convex function

The line connecting two points on the graph must lie above the
function
The general idea
 We have k parameters 𝜃1, 𝜃2, … , 𝜃𝑘we’d like to train
for a model – with respect to some error/loss function
𝐽(𝜃1, … , 𝜃 𝑘 ) to be minimized
 Gradient descent is one way to iteratively determine the optimal set of
parameter values:
1. Initialize parameters
2. Keep changing values to reduce 𝐽(𝜃1, … , 𝜃 𝑘 )
🞑 tells us which direction increases 𝐽 the most
🞑 We go in the opposite direction of
To actually descend…
https://www.geeksforgeeks.org/difference-between-gradient-descent-and-normal-equation/
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 0, 1
(𝛽0, 𝛽1)
• This is a more complicated surface
on
which the minimum must be found
• How can we do this without knowing
what 1
0
𝐽 0, 1 looks like?
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
𝐽what
0, 1 looks
like?
𝛽1
𝛽0
Gradient Descent with Linear Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
what
𝐽 𝛽0, 𝛽1 looks like? 𝛽1
𝛽0
Gradient Descent with Linear Regression

• Compute the gradient, , which

points in the direction of the biggest
increase!
• - (negative gradient) points to the
biggest decrease at that point!

𝛽1
𝛽0
Gradient Descent with Linear Regression

• The gradient is the a vector whose

coordinates consist of the partial
derivatives of the parameters

𝛽1
𝛽0
Gradient Descent with Linear
Regression
• Then use the gradient and the
cost function to calculate the next point
(𝜔1) from the current one (𝜔0 ): 𝜔0
𝜔1

• The learning rate (𝛼) is a tunable

parameter 𝛽1
that determines step size 𝛽0
Gradient Descent with Linear Regression

• Then use the gradient and the

cost function to calculate the next
point (𝜔1) from the current one (𝜔0): 𝜔0
𝜔1

𝛽1
• The learning rate (𝛼) is a tunable parameter
that determines step size
𝛽0
Gradient Descent with Linear Regression

• Each point can be iteratively calculated

from the previous one
𝜔0
𝜔1
𝑚
𝜔2
1

(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ ෍ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Gradient Descent with Linear Regression

• Each point can be iteratively calculated

from the previous one
𝜔0
𝜔1
𝑚
𝜔2
1

(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ ෍ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Issues
 Convex objective function guarantees convergence to global
minimum
 Non-convexity brings the possibility of getting stuck in a local
minimum
🞑 Different, randomized starting values can fight this
Issues cont.
 Convergence can be slow
🞑 Largerlearning rate α can speed things up, but with too large of α,
optimums can be ‘jumped’ or skipped over
- requiring more iterations
🞑 Too small of a step size will keep convergence slow

🞑 Can be combined with a learning decay

• Learning rate decay is a technique for training modern neural networks.
• It starts training the network with a large learning rate and then slowly
reducing/decaying it until local minima is obtained.
• It is empirically observed to help both optimization and generalization.
Issues cont.
Numerical Optimization
 Numerical Optimization
🞑 gradientdescent
🞑 Newton Method

🞑 Gauss–Newton algorithm

🞑 Levenberg–Marquardt algorithm

🞑 Line Search Methods

🞑 Trust-Region Methods

GD Solved example https://www.youtube.com/watch?v=sDv4f4s2SB8

Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all the
data 𝜔0
𝑚
1 𝜔1
(𝑖)

(𝑖)
2
𝜔1 = 𝜔0 − 𝛼❑ ෍
2 0 + 1𝑥𝑜𝑏𝑠 𝛽1
− 𝑦𝑜𝑏𝑠
𝛽0
𝑖=1

1
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1

𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1

𝛽1
𝛽0
Stochastic Gradient Descent

• Use a single data point to determine the

gradient and cost function instead of all
the data 𝜔0
𝜔1
𝜔2 𝜔3 𝜔4

𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data
𝜔0
𝜔1
𝜔2 𝜔3 𝜔4

𝛽1
• Path is less direct due to noise in single data 𝛽0
point—"stochastic"
SGD Solved Example: https://www.youtube.com/watch?v=vMh0zPT0tLI
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1

𝛽1
𝛽0
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1

𝛽1
𝛽0
Mini Batch Gradient
Descent
• Perform an update for every 𝑛
training
examples
𝜔0
𝜔1
Best of both
worlds:
• Reduced memory relative to "vanilla" 𝛽1
gradient descent
𝛽0
• Less noisy than stochastic gradient
descent
Mini Batch Gradient
Descent
• Mini batch implementation typically used for neural nets

• Batch sizes range from 50– 256 points

• Trade off between batch size and learning rate (𝛼)

• Tailor learning rate schedule: gradually reduce learning rate

during a given epoch
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

squared_loss
SGDreg = SGDRregressor(loss='squared_loss', =
alpha=0.1, penalty='l2') linear
regression
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')
regularizatio
n
parameters
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then

transform the data
SGDreg = SGDreg.fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform

the data mini-batch
version
SGDreg = SGDreg.partial_fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Stochastic Gradient Descent Regression:
The Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class

SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then

transform the data
SGDreg = SGDreg.fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Other loss methods exist:
epsilon_insensitive, huber, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier

Create an instance of the class

SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier
Create an instance of the class
log loss =
SGDclass = SGDClassifier(loss='log', logistic
alpha=0.1, penalty='l2') regression

Fit the instance on the data and then transform

the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)

Other loss methods exist: hinge,

squared_hinge, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class

SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform

the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier

Create an instance of the class

SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')
Fit the instance on the data and then transform
the data SGDclass = SGDclass.partial_fit(X_train, mini-batch
version
y_train) y_pred = SGDclass.predict(X_test)
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class

SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform

the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)

Other loss methods exist: hinge,

squared_hinge, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class

SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform

the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)
See SVM
Other loss methods exist: hinge, squared_hinge,
etc.
lecture
(week 7)

Comen S8 Service Manual
100% (1)
Comen S8 Service Manual
59 pages
RISC V Diagram - Drawio
No ratings yet
RISC V Diagram - Drawio
1 page
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
Electronic Medical Record
100% (1)
Electronic Medical Record
10 pages
1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
41 pages
Bosch Protocol Technical Information: en Application Note
No ratings yet
Bosch Protocol Technical Information: en Application Note
10 pages
Slides Ridge Lasso Regression
No ratings yet
Slides Ridge Lasso Regression
23 pages
Master Thesis Ibm Zurich
100% (3)
Master Thesis Ibm Zurich
8 pages
Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
MEASURING THE DIGITAL ECONOMY Thomas L. Mesenbourg
100% (1)
MEASURING THE DIGITAL ECONOMY Thomas L. Mesenbourg
19 pages
Online Grocery Shop Project Proposal
No ratings yet
Online Grocery Shop Project Proposal
27 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
Unit 2
No ratings yet
Unit 2
92 pages
RMK Group 21cs905 CV Unit 1
No ratings yet
RMK Group 21cs905 CV Unit 1
77 pages
SOS Computer Class 5 2024-25
No ratings yet
SOS Computer Class 5 2024-25
3 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Regularization
No ratings yet
Regularization
3 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
1.introduction To Machine Learning and Toolkit
No ratings yet
1.introduction To Machine Learning and Toolkit
102 pages
Report Header: Message 1
No ratings yet
Report Header: Message 1
2 pages
2.introduction To Supervised Learning and K Nearest Neighbors
No ratings yet
2.introduction To Supervised Learning and K Nearest Neighbors
74 pages
User's Manual User's Manual
No ratings yet
User's Manual User's Manual
128 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
HZCT 100 B
No ratings yet
HZCT 100 B
31 pages
SL LMRG
No ratings yet
SL LMRG
32 pages
CAD Tutorials
No ratings yet
CAD Tutorials
18 pages
Lecture - 6 Classification (Logistic Regression)
No ratings yet
Lecture - 6 Classification (Logistic Regression)
48 pages
Lec8 Regularization Polynomial Regression
No ratings yet
Lec8 Regularization Polynomial Regression
30 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Lecture 4
No ratings yet
Lecture 4
42 pages
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
No ratings yet
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
29 pages
SP 24 BADM 576 Final - Exam - Study - Guide
No ratings yet
SP 24 BADM 576 Final - Exam - Study - Guide
13 pages
INSY446 - 3 - Linear Model Part 2
No ratings yet
INSY446 - 3 - Linear Model Part 2
27 pages
Lecture-6 Linear Regression Addition
No ratings yet
Lecture-6 Linear Regression Addition
15 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Regularization
No ratings yet
Regularization
13 pages
Data Analytics - Ridge and LASSO Regression
No ratings yet
Data Analytics - Ridge and LASSO Regression
15 pages
Comprehensive Machine Learning Tutorial - Regressio
No ratings yet
Comprehensive Machine Learning Tutorial - Regressio
9 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Mlda U1
No ratings yet
Mlda U1
10 pages
Lecture 3 Multi-Regresion 2022.
No ratings yet
Lecture 3 Multi-Regresion 2022.
16 pages
Regression
No ratings yet
Regression
16 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Regularization
No ratings yet
Regularization
13 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
16 pages
Feature Selection
No ratings yet
Feature Selection
19 pages
Assigment Regression
No ratings yet
Assigment Regression
9 pages
ML Solved Endsem
No ratings yet
ML Solved Endsem
16 pages
Specsheet HYDRA-1
No ratings yet
Specsheet HYDRA-1
6 pages
q6-5 Solution (Ridge and Lasso)
No ratings yet
q6-5 Solution (Ridge and Lasso)
7 pages
Module 3
No ratings yet
Module 3
35 pages
PGN AI and ML Presentation
No ratings yet
PGN AI and ML Presentation
28 pages
Password Strength Tester Ijariie20152
No ratings yet
Password Strength Tester Ijariie20152
12 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
VXR P Doc List
No ratings yet
VXR P Doc List
5 pages
Describe in Brief Different Types of Regression Algorithms
No ratings yet
Describe in Brief Different Types of Regression Algorithms
25 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
LAB5 Regularization
No ratings yet
LAB5 Regularization
6 pages
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
4A0 100 Demo
No ratings yet
4A0 100 Demo
5 pages
Programmable Machine Pre History - MechMachTheor - May2001
No ratings yet
Programmable Machine Pre History - MechMachTheor - May2001
15 pages
Contoh Resume Jurnal Pendidikan
No ratings yet
Contoh Resume Jurnal Pendidikan
4 pages
Chapter2 - Optimisation
No ratings yet
Chapter2 - Optimisation
7 pages
CSL0777 L17
No ratings yet
CSL0777 L17
27 pages
GangaPrasad N PD
No ratings yet
GangaPrasad N PD
2 pages
The Hidden ROI of Embedded Analytics
No ratings yet
The Hidden ROI of Embedded Analytics
7 pages
2017 Supervised Machine Learning Based Surface Inspection by Synthetizing Artificial Defects
No ratings yet
2017 Supervised Machine Learning Based Surface Inspection by Synthetizing Artificial Defects
6 pages
Report FlipFlops
No ratings yet
Report FlipFlops
15 pages
Unit 2
No ratings yet
Unit 2
8 pages
Advanced Regression Assignment
No ratings yet
Advanced Regression Assignment
5 pages
Physio@Home: Exploring Visual Guidance and Feedback Techniques For Physiotherapy Exercises
No ratings yet
Physio@Home: Exploring Visual Guidance and Feedback Techniques For Physiotherapy Exercises
10 pages
Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
No ratings yet
Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
11 pages
SB601 datasheet-EN
No ratings yet
SB601 datasheet-EN
2 pages
INSY662 - F23 - Week 3-2
No ratings yet
INSY662 - F23 - Week 3-2
15 pages
CS251 Intro. To SE (0) Module Outline - An Intro. To SE
No ratings yet
CS251 Intro. To SE (0) Module Outline - An Intro. To SE
22 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Budget of Work (Bow) in Mathematics
No ratings yet
Budget of Work (Bow) in Mathematics
6 pages
Masons Rule
No ratings yet
Masons Rule
2 pages
Ridge Mt1cars
No ratings yet
Ridge Mt1cars
4 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Beginner's Guide To Make A Game Controller
No ratings yet
Beginner's Guide To Make A Game Controller
23 pages
UL Cert For Hanger
No ratings yet
UL Cert For Hanger
2 pages
COMSATS Institute of Information Technology, Islamabad: Terminal Examination Fall2014
No ratings yet
COMSATS Institute of Information Technology, Islamabad: Terminal Examination Fall2014
8 pages
Dependent Independent Variable (S) : Regression: What Is Regression
No ratings yet
Dependent Independent Variable (S) : Regression: What Is Regression
15 pages
Ridge Regression
No ratings yet
Ridge Regression
5 pages
Ridge and Lasso in Python PDF
No ratings yet
Ridge and Lasso in Python PDF
5 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Regularization and Feature Selectio N

Uploaded by

Regularization and Feature Selectio N

Uploaded by

Regularization and

 The intercept is approximated well, but

• How to use a degree 9 polynomial and prevent

Vector norms: https://www.youtube.com/watch?v=5fN2J8wYnfw

𝐗′𝐗 + λ𝐈 is always invertible. Always gives a

 Applying the ridge regression penalty has the

• Penalty selectively shrinks some coefficients exactly to zero

• Compromise of both Ridge and Lasso regression

• Regularization coefficients (𝜆1 and 𝜆2) Use Test Data to Tune 𝜆?

hyperparameters—validation set Test Data

To use the Intel® Extension for Scikit-learn* variant of this algorithm:

Create an instance of the class

The RidgeCV class will perform

Create an instance of the class

The RidgeCV class will perform

The RidgeCV class will perform

Create an instance of the class

Fit the instance on the data and then predict the

Create an instance of the class

The RidgeCV class will perform

Create an instance of the class

The LassoCV class will perform

The LassoCV class will perform

Create an instance of the class

The ElasticNetCV class will

• Regularization performs feature selection by shrinking the

• Regularization performs feature selection by shrinking the

• Regularization performs feature selection by shrinking the

• Reducing the number of features is another way to prevent

• Reducing the number of features is another way to prevent

• Reducing the number of features is another way to prevent

Create an instance of the class

The RFECV class will perform feature elimination using cross

The RFECV class will perform feature elimination using cross

The RFECV class will perform feature elimination using cross

 Find 𝑥1 , 𝑥2 , … , 𝑥𝑝 that maximizes or

🞑A vector of partial derivatives with respect to each of the independent

• Compute the gradient, , which

• The gradient is the a vector whose

• The learning rate (𝛼) is a tunable

• Then use the gradient and the

• Each point can be iteratively calculated

• Each point can be iteratively calculated

🞑 Can be combined with a learning decay

🞑 Line Search Methods

GD Solved example https://www.youtube.com/watch?v=sDv4f4s2SB8

• Use a single data point to determine the

• Batch sizes range from 50– 256 points

• Trade off between batch size and learning rate (𝛼)

• Tailor learning rate schedule: gradually reduce learning rate

Create an instance of the class

Create an instance of the class

Create an instance of the class

Create an instance of the class

Fit the instance on the data and then

Create an instance of the class

Fit the instance on the data and then transform

Create an instance of the class

Fit the instance on the data and then

Create an instance of the class

Fit the instance on the data and then transform

Other loss methods exist: hinge,

Create an instance of the class

Fit the instance on the data and then transform

Create an instance of the class

Create an instance of the class

Fit the instance on the data and then transform

Other loss methods exist: hinge,

Create an instance of the class

Fit the instance on the data and then transform

You might also like