0% found this document useful (0 votes)
59 views102 pages

Regularization and Feature Selectio N

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views102 pages

Regularization and Feature Selectio N

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 102

Regularization and

Feature
Selectio
n
Learning
•Objectives
Explain cost functions, regularization, feature selection,
and hyper-parameters
• Summarize complex statistical optimization algorithms
like gradient descent and its application to linear
regression
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware


Motivation
 If more than two independent variables are
highly correlated:

 The intercept is approximated well, but


coefficients?
Motivation
 It happens because x1 and x2 are highly correlated.
🞑 RSS(40, -38) = 21.7 (our estimate) is very closed to RSS(1, 1) = 22.6
(the truth)
 Effective way of dealing with this problem is through
penalization:
🞑 Insteadof minimizing RSS only, we consider an additional term in
the regression form…
Preventing Under- and
Overfitting
Polynomial Degree = 1 Polynomial Degree = Polynomial Degree =
3 9

Model
True Function
Samples
Y Y Y

X X X

• How to use a degree 9 polynomial and prevent


overfitting?
Preventing Under- and Overfitting
Polynomial Degree = 1 Polynomial Degree = 3 Polynomial Degree = 9

Model
True Function
Samples

Y Y Y

X 𝑚 X X
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠
2𝑚
𝑖=1
Ridge Regression
 Ridge Regression Model
Ridge Regression
 Why does this help?
🞑 Smallercoefficients give less sensitivity of the
variables.
Ridge Regression
 Lagrange Multiplier
🞑A strategy for finding the local maxima or minima of a function
subject to equality/inequality constraints

Minimizing
𝑛

Σ
Equivalent to minimizing
𝑓(
𝑛
𝑥)
𝑠. Σ 𝑓(𝑥) + λ𝑔(𝑥) ,
𝑡. 𝑖=1
Where λ is positive.
𝑔(
𝑥)

Ridge Regression
 Ridge Regression Model

Vector norms: https://www.youtube.com/watch?v=5fN2J8wYnfw


Optimization
H 𝐛, λ = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛 +λ𝐛′𝐛
= 𝐲′𝐲 − 𝟐𝐛′𝐗′𝐲 + 𝐛′𝐗′𝐗𝐛+λ𝐛′𝐛

𝜕H 𝐛,
λ 𝜕𝐛 = −2𝐗′𝐲 + 2𝐗′𝐗𝐛 + 2λ𝐛 = 𝟎
𝐗′𝐗 + λ𝐈 𝐛 = 𝐗′𝐲
𝐛 = (𝐗′𝐗 + λ𝐈)−1𝐗′𝐲

𝐗′𝐗 + λ𝐈 is always invertible. Always gives a


unique solution,
Ridge Regression
 Similar to the ordinary least squares solution,
but with the addition of a “ridge” regularization

 Applying the ridge regression penalty has the


effect of shrinking the estimates toward zero
 Introduce bias but reduce the variance of the
estimate
Ridge Regression
Regularization
Poly Degree=9, 𝜆=0.0 Poly Degree=9, 𝜆=1e-5 Poly Degree=9, 𝜆=0.1
Model
True Function
Samples

Y Y Y

X X X
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
Effect of Ridge Regression on Parameters
Poly=9, 𝜆=0.0 Poly=9, 𝜆=1e-5 Poly=9, 𝜆=0.1
Model
True
Function
Samples
Y

X X X
10 8

106
abs(coefficien

104

102
t)

100
123456789 123456789
𝑚 123456789
𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)𝑜𝑏𝑠 (𝑖)
− 𝑦𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚 𝑖=1
𝑗=1
Ridge Regression (L2)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽 2𝑗
2𝑚
𝑖=1 𝑗=1
• Penalty shrinks magnitude of all coefficients
• Larger coefficients strongly penalized because of the
Squaring
A ridge solution can be hard to interpret because it is not sparse
(no β's are set exactly to 0). What if we constrain the L1 norm instead
of the Euclidean (L2 norm)?
Ridge L2 Example:
https://www.youtube.com/watch?v=Q81RR3yKn30&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=23
Lasso Regression (L1)
𝑚 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖)
𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1

• Penalty selectively shrinks some coefficients exactly to zero


• Can be used for feature selection
• Slower to converge than Ridge regression
Effect of Lasso Regression on Parameters
Poly=9, 𝜆=0.0 Poly=9, 𝜆=1e-5 Poly=9,
𝜆=0.1
Model
True
Function
Samples
Y

X X X
10 8

106
abs(coefficient)

104

102

100
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9𝑚 1 2 3 4 5 6 7 8 9 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽𝑥1(𝑖 𝑜𝑏𝑠
) (𝑖)
− 𝑦 𝑜𝑏𝑠 + 𝜆 Σ 𝛽𝑗
2𝑚
𝑖=1 𝑗=1
• Lasso L1 Example:
https://www.youtube.com/watch?v=NGf0voTMlcs&list=PLblh5JKOoLUICTaGLRo
HQDuF_7q2GfuJF&index=24

• L1 Vs L2
https://www.youtube.com/watch?v=Xm2C_gTAl8c&list=PLblh5JKOoLUICTaGLR
oHQDuF_7q2GfuJF&index=26
Elastic Net Regularization
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 1 Σ 𝛽𝑗 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 𝑗
𝑖=1 𝑗=1 𝑗=1

• Compromise of both Ridge and Lasso regression


• Requires tuning of additional parameter that distributes
regularization penalty between L1 and L2
Elastic Net Regularization
Poly=9, 𝜆1=𝜆2=0.0 Poly=9, 𝜆1=𝜆2=1e-5 Poly=9, 𝜆1=𝜆2=0.1
Model
True
Function
Samples

Y Y Y

X X X
𝑚 𝑘 𝑘
1 2
𝐽 𝛽0 , 𝛽1 = Σ 𝛽0 + 𝛽 1𝑥(𝑖) + 𝜆 Σ 𝛽 + 𝜆2 Σ 𝛽2
2𝑚 𝑜𝑏𝑠 − 𝑦(𝑖)
𝑜𝑏𝑠 1 𝑗 𝑗
𝑖=1 𝑗=1 𝑗=1

Example:
https://www.youtube.com/watch?v=1dKRdX9bfIo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=27
Hyperparameters and Their Optimization

• Regularization coefficients (𝜆1 and 𝜆2) Use Test Data to Tune 𝜆?


are empirically determined
• Want value that generalizes—do not
use Training Data
test data for tuning
• Create additional split of data to
Test Data
tune
hyperparameters— validation set
• Cross validation can also be used
on
training data
Hyperparameters and Their
Optimization
• Regularization coefficients (𝜆1 and 𝜆2) Use Test Data to Tune
are empirically determined 𝜆?
• Want value that generalizes—do not

NO!
use test data for tuning Training
Data
• Create additional split of data to
tune
Test Data
hyperparameters— validation set
• Cross validation can also be
used on
training data
Hyperparameters and Their
Optimization
• Regularization coefficients (𝜆1 and 𝜆2) Tune 𝜆 with Cross Validation
are empirically determined
• Want value that generalizes—do not Training Data
use test data for tuning
• Create additional split of data to tune Validation Data

hyperparameters—validation set Test Data


• Cross validation can also be
used on
training data
Ridge Regression: Get λ using CV
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

To use the Intel® Extension for Scikit-learn* variant of this algorithm:


• Install Intel® oneAPI AI Analytics Toolkit (AI Kit)
• Add the following two lines of code after the above code:
import patch_sklearn
patch_sklearn()
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class


RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform


cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class


RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform


cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import Ridge
Create an instance of the
class regularizatio
n
RR = Ridge(alpha=1.0)
parameter
Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform


cross validation on a set of
values for alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class


RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the


expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)
The RidgeCV class will perform cross validation on a set of values for
alpha.
Ridge Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Ridge

Create an instance of the class


RR = Ridge(alpha=1.0)

Fit the instance on the data and then predict the expected value
RR = RR.fit(X_train, y_train)
y_predict = RR.predict(X_test)

The RidgeCV class will perform


cross validation on a set of
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
values for alpha.
Lasso Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import Lasso

Create an instance of the class


LR = Lasso(alpha=1.0)

Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)

The LassoCV class will perform


cross validation on a set of
values for alpha.
Lasso Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import Lasso
Create an instance of the
class regularizatio
n
LR = Lasso(alpha=1.0)
parameter
Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)

The LassoCV class will perform


cross validation on a set of
values for alpha.
Elastic Net Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import ElasticNet

Create an instance of the class


EN = ElasticNet(alpha=1.0, l1_ratio=0.5)

Fit the instance on the data and then predict the expected value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)

The ElasticNetCV class will


perform cross validation on a
set of values for l1_ratio
Elastic Net Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import ElasticNet
Create an instance of the class
alpha is the
EN = ElasticNet(alpha=1.0, l1_ratio=0.5) regularizatio
n parameter
Fit the instance on the data and then predict the
expected value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)
The ElasticNetCV class will perform cross validation on a set of values for
l1_ratio and alpha.
Elastic Net Regression: The Syntax
Import the class containing the regression
method
from sklearn.linear_model import ElasticNet
Create an instance of the class
l1_ratio
EN = ElasticNet(alpha=1.0, l1_ratio=0.5) distributes
alpha to L1/L2
Fit the instance on the data and then predict the expected
value
EN = EN.fit(X_train, y_train)
y_predict = EN.predict(X_test)
The ElasticNetCV class will perform cross validation on a set of values for
l1_ratio and alpha.
Feature Selection

• Regularization performs feature selection by shrinking the


contribution of features
• For L1-regularization, this is accomplished by
driving some
coefficients to zero
• Feature selection can also be performed by
removing features
Feature Selection

• Regularization performs feature selection by shrinking the


contribution of features
• For L1-regularization, this is accomplished by driving
some coefficients to zero
• Feature selection can also be performed by
removing
features
Feature Selection

• Regularization performs feature selection by shrinking the


contribution of features
• For L1-regularization, this is accomplished by driving
some coefficients to zero
• Feature selection can also be performed by removing
features
Why is Feature Selection Important?

• Reducing the number of features is another way to prevent


overfitting (similar to regularization)
• For some models, fewer features can improve fitting
time
and/or results
• Identifying most critical features can improve
model interpretability
Why is Feature Selection Important?

• Reducing the number of features is another way to prevent


overfitting (similar to regularization)
• For some models, fewer features can improve fitting time
and/or results
• Identifying most critical features can improve
model
interpretability
Why is Feature Selection Important?

• Reducing the number of features is another way to prevent


overfitting (similar to regularization)
• For some models, fewer features can improve fitting time
and/or results
• Identifying most critical features can improve model
interpretability
Recursive Feature Elimination: The Syntax
Import the class containing the feature selection method
from sklearn.feature_selection import RFE

Create an instance of the class


rfeMod = RFE(est, n_features_to_select=5)

Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross


validation.
Recursive Feature Elimination: The Syntax
Import the class containing the feature selection
method
from sklearn.feature_selection import RFE
Create an instance of the class est is an
rfeMod = RFE(est, n_features_to_select=5) instance of the
model to use
Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross


validation.
Recursive Feature Elimination: The Syntax
Import the class containing the feature selection
method
from sklearn.feature_selection import RFE
Create an instance of the class
final number
rfeMod = RFE(est, n_features_to_select=5) of features

Fit the instance on the data and then predict the expected value
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)

The RFECV class will perform feature elimination using cross


validation.
Gradient
Descent
Optimization
 Consider a function f (.) of p numbers
of variables:

𝑦=𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑝

 Find 𝑥1 , 𝑥2 , … , 𝑥𝑝 that maximizes or


minimizes y
 Usually, minimize a cost/loss function or
maximize profit/likelihood function.
Global/Local Optimization
Gradient
 Single variable:
🞑 The derivative: slope of the tangent line at a point 𝑥0
Gradient (vector)
 Multivariable:

🞑A vector of partial derivatives with respect to each of the independent


variables
 points in the direction of greatest rate of change or “steepest
ascent”
 Magnitude (or length) of is the greatest rate of change
Gradient (vector)
Gradient (vector)
Gradient Descent
Start with a cost function J(𝛽):

𝐽 𝛽

𝛽
Gradient Descent
Start with a cost function J(𝛽):

𝐽 𝛽

Global Minimum

𝛽
Then gradually move towards the minimum.
Convex function

The line connecting two points on the graph must lie above the
function
The general idea
 We have k parameters 𝜃1, 𝜃2, … , 𝜃𝑘we’d like to train
for a model – with respect to some error/loss function
𝐽(𝜃1, … , 𝜃 𝑘 ) to be minimized
 Gradient descent is one way to iteratively determine the optimal set of
parameter values:
1. Initialize parameters
2. Keep changing values to reduce 𝐽(𝜃1, … , 𝜃 𝑘 )
🞑 tells us which direction increases 𝐽 the most
🞑 We go in the opposite direction of
To actually descend…
https://www.geeksforgeeks.org/difference-between-gradient-descent-and-normal-equation/
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2
Slides
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 0, 1
(𝛽0, 𝛽1)
• This is a more complicated surface
on
which the minimum must be found
• How can we do this without knowing
what 1
0
𝐽 0, 1 looks like?
Gradient Descent with Linear
Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
𝐽what
0, 1 looks
like?
𝛽1
𝛽0
Gradient Descent with Linear Regression
• Now imagine there are two parameters 𝐽 𝛽0, 𝛽1
(𝛽0, 𝛽1)
• This is a more complicated surface on
which the minimum must be found
• How can we do this without knowing
what
𝐽 𝛽0, 𝛽1 looks like? 𝛽1
𝛽0
Gradient Descent with Linear Regression

• Compute the gradient, , which


points in the direction of the biggest
increase!
• - (negative gradient) points to the
biggest decrease at that point!

𝛽1
𝛽0
Gradient Descent with Linear Regression

• The gradient is the a vector whose


coordinates consist of the partial
derivatives of the parameters

𝛽1
𝛽0
Gradient Descent with Linear
Regression
• Then use the gradient and the
cost function to calculate the next point
(𝜔1) from the current one (𝜔0 ): 𝜔0
𝜔1

• The learning rate (𝛼) is a tunable


parameter 𝛽1
that determines step size 𝛽0
Gradient Descent with Linear Regression

• Then use the gradient and the


cost function to calculate the next
point (𝜔1) from the current one (𝜔0): 𝜔0
𝜔1

𝛽1
• The learning rate (𝛼) is a tunable parameter
that determines step size
𝛽0
Gradient Descent with Linear Regression

• Each point can be iteratively calculated


from the previous one
𝜔0
𝜔1
𝑚
𝜔2
1

(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ ෍ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Gradient Descent with Linear Regression

• Each point can be iteratively calculated


from the previous one
𝜔0
𝜔1
𝑚
𝜔2
1

(𝑖)
𝛽1
(𝑖)
2
𝛽0
𝜔3 = 𝜔 2 − 𝛼❑ ෍ 0 + 1𝑥𝑜𝑏𝑠
2
− 𝑦𝑜𝑏𝑠
𝑖=1
Issues
 Convex objective function guarantees convergence to global
minimum
 Non-convexity brings the possibility of getting stuck in a local
minimum
🞑 Different, randomized starting values can fight this
Issues cont.
 Convergence can be slow
🞑 Largerlearning rate α can speed things up, but with too large of α,
optimums can be ‘jumped’ or skipped over
- requiring more iterations
🞑 Too small of a step size will keep convergence slow

🞑 Can be combined with a learning decay


• Learning rate decay is a technique for training modern neural networks.
• It starts training the network with a large learning rate and then slowly
reducing/decaying it until local minima is obtained.
• It is empirically observed to help both optimization and generalization.
Issues cont.
Numerical Optimization
 Numerical Optimization
🞑 gradientdescent
🞑 Newton Method

🞑 Gauss–Newton algorithm

🞑 Levenberg–Marquardt algorithm

🞑 Line Search Methods

🞑 Trust-Region Methods

GD Solved example https://www.youtube.com/watch?v=sDv4f4s2SB8


Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all the
data 𝜔0
𝑚
1 𝜔1
(𝑖)

(𝑖)
2
𝜔1 = 𝜔0 − 𝛼❑ ෍
2 0 + 1𝑥𝑜𝑏𝑠 𝛽1
− 𝑦𝑜𝑏𝑠
𝛽0
𝑖=1

1
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1

𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data 𝜔0
𝜔1

𝛽1
𝛽0
Stochastic Gradient Descent

• Use a single data point to determine the


gradient and cost function instead of all
the data 𝜔0
𝜔1
𝜔2 𝜔3 𝜔4

𝛽1
𝛽0
Stochastic Gradient Descent
• Use a single data point to determine the
gradient and cost function instead of all
the data
𝜔0
𝜔1
𝜔2 𝜔3 𝜔4

𝛽1
• Path is less direct due to noise in single data 𝛽0
point—"stochastic"
SGD Solved Example: https://www.youtube.com/watch?v=vMh0zPT0tLI
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1

𝛽1
𝛽0
Mini Batch Gradient Descent
• Perform an update for every 𝑛 training
examples
𝜔0
𝜔1

𝛽1
𝛽0
Mini Batch Gradient
Descent
• Perform an update for every 𝑛
training
examples
𝜔0
𝜔1
Best of both
worlds:
• Reduced memory relative to "vanilla" 𝛽1
gradient descent
𝛽0
• Less noisy than stochastic gradient
descent
Mini Batch Gradient
Descent
• Mini batch implementation typically used for neural nets

• Batch sizes range from 50– 256 points

• Trade off between batch size and learning rate (𝛼)

• Tailor learning rate schedule: gradually reduce learning rate


during a given epoch
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


squared_loss
SGDreg = SGDRregressor(loss='squared_loss', =
alpha=0.1, penalty='l2') linear
regression
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')
regularizatio
n
parameters
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then


transform the data
SGDreg = SGDreg.fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Stochastic Gradient Descent Regression:
Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform


the data mini-batch
version
SGDreg = SGDreg.partial_fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Stochastic Gradient Descent Regression:
The Syntax
Import the class containing the regression model
from sklearn.linear_model import SGDRegressor

Create an instance of the class


SGDreg = SGDRregressor(loss='squared_loss',
alpha=0.1, penalty='l2')

Fit the instance on the data and then


transform the data
SGDreg = SGDreg.fit(X_train, y_train)
y_pred = SGDreg.predict(X_test)
Other loss methods exist:
epsilon_insensitive, huber, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier

Create an instance of the class


SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier
Create an instance of the class
log loss =
SGDclass = SGDClassifier(loss='log', logistic
alpha=0.1, penalty='l2') regression

Fit the instance on the data and then transform


the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)

Other loss methods exist: hinge,


squared_hinge, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class


SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform


the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification
model
from sklearn.linear_model import SGDClassifier

Create an instance of the class


SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')
Fit the instance on the data and then transform
the data SGDclass = SGDclass.partial_fit(X_train, mini-batch
version
y_train) y_pred = SGDclass.predict(X_test)
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class


SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform


the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)

Other loss methods exist: hinge,


squared_hinge, etc.
Stochastic Gradient Descent Classification: The
Syntax
Import the class containing the classification model
from sklearn.linear_model import SGDClassifier

Create an instance of the class


SGDclass = SGDClassifier(loss='log',
alpha=0.1, penalty='l2')

Fit the instance on the data and then transform


the data
SGDclass = SGDclass.fit(X_train, y_train)
y_pred = SGDclass.predict(X_test)
See SVM
Other loss methods exist: hinge, squared_hinge,
etc.
lecture
(week 7)

You might also like