0% found this document useful (0 votes)

34 views59 pages

Notes 3

The document discusses various techniques for regression analysis including linear regression, logistic regression, and generalized linear models. It provides an overview of key concepts like least squares optimization, gradient descent, and regularization. Examples are given to explain linear regression and how to calculate coefficients through closed-form and iterative methods like gradient descent.

Uploaded by

Nikita Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views59 pages

Notes 3

Uploaded by

Nikita Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

CLL788-Process Data Analytics

Regression and its derivatives

Hariprasad Kodamana
IIT DELHI

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 1 / 59

Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 2 / 59

A motivating example
Table: Portland housing data
Living area (feet 2 ) APT or HOUSE Price (1000s)
2104 APT 400
1600 APT 330
2400 HOUSE 369
1416 APT 232
3000 HOUSE 540
.. .. ..
. . .

Given data like this, (i) Can we

predict the prices of other houses as
a function of the size of their living
areas? (ii) Based on the area can we
say it is an APT or HOUSE?

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 3 / 59

Some notations

x (i) ∈ X ⊂ R denotes the ”input” variables (input

features)-living area
y (i) ∈ Y ⊂ R denotes the ”output” (target variable) that is
trying to predict- price
A pair (x (i) , y (i) ) is a training example (training sample)
A list of m training examples {(x (i) , y (i) ); i = 1, . . . , m} is a
training set.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 4 / 59

Supervised learning
Training set

Learning algorithm

x Model y
(h)

The goal is, given a training set, to learn a function h : X → Y

that h(x ) is a ”good” predictor for the corresponding value of y
When the target variable y to be predicted is continuous, such
as price, the learning problem a regression problem.
When the target variable y can take on only a small number of
discrete values (such as house or apartment), it is a classification
problem.
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 5 / 59
Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 6 / 59

Linear regression
Data for regression
i y x1 x2 ... xn
1 y (1) x (11) x (12) ... x (1n)
.. .. .. .. .. ..
. . . . . .
n y (m) x (m1) x (m2) ... x (mn)

Model: hθ (x ) + ϵ = θ0 + θ1 x1 + · · · + θn xn + ϵ = x T θ + ϵ,
θ-parameter, x -inputs
How to compute θ? One option is to minimize the difference
between hθ (x ) and y
Minimize least squares:
m
1X
J(θ) = (hθ (x (i) ) − y (i) )2 (1)
2 i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 7 / 59

Least squares-closed form solution

Data
i y x1 x2 ... xn
(1) (11) (12)
1 y x x ... x (1n)
.. .. .. .. .. ..
. . . . . .
m y (m) x (m1) x (m2) ... x (mn)
Let
(x (1) )T (x (11) ) (y (1) )
     
 (x (2) )T  (x (12) )  (y (2) ) 
 (i)
X = ,x =  ,Y =  (2)
    
 ...   ...   ... 


(x (m) )T (x (1n) ) (y (m) )

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 8 / 59

Linear regression assumptions

There should be a linear and additive relationship between

dependent (response) variable and independent (predictor)
variable(s).
There should be no correlation between the residual (error)
terms. (no autocorrelation)
The independent variables should not be correlated. (no
multi-collinearity)
The error terms must have constant variance
(homoskedasticity). The presence of non-constant variance is
referred to heteroskedasticity.
The error terms must be normally distributed.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 9 / 59

Least squares-closed form solution
For linear regression, hθ (x (i) ) = (x (i) )T θ, so
(x (1) )T θ (y (1) ) hθ (x (1) ) − (y (1) )
     
 (x (2) )T θ   (y (2) )   (h (x (2) ) − (y (2) ) 
 θ
Xθ − Y =  − =
    
 ...   ...  ...

 
(x (m) )T θ (y (m) ) hθ (x (m) ) − (y (m) )
(3)
J(θ) = 12 m (i) (i) 2 1 T
i=1 (hθ (x ) − y ) = 2 (X θ − Y ) (X θ − Y )
P

For minimizing J(θ), ∇θ J(θ) = 0

1
∇θ J(θ) = ∇θ (θT X T X θ − θT X T Y − Y T X θ + Y T Y ) = 0
2
(4)
1
= (X T X θ + X T X θ − 2X T Y ) = 0 (5)
2
=⇒ X T X θ = X T Y =⇒ θ = (X T X )−1 X T Y (6)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 10 / 59
Gradient descent optimizer
A search algorithm that starts with an initial guess for θ, and
repeatedly updates the guess value of θ such that J(θ) is
minimized
The search direction will be rate of change of J(θ)
Gradient descent update
∂
θj := θj − α J(θj ), j = 1, . . . , n (7)
∂θj
Negative update-for achieving minimum
α is called the learning rate that can be tuned
Also called steepest descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 11 / 59

Gradient descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 12 / 59

Gradient descent

Vanilla/ Batch Gradient Descent: Calculates the averaged

gradient with all training samples.
Stochastic Gradient Descent: Calculates the gradient with one
training sample. Response is memory efficient but unstable
(response fluctuates) and may shoot even after getting global
optimum.
Mini Batch Gradient Descent: The data is divided into different
mini batches. Gradient is calculated as average value over all
samples present in one mini batch.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 13 / 59

Gradient descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 14 / 59

Nesterov Accelerated Gradient (NAG)

When we reach the minimum there is a need to slow down

otherwise the ball will not stop at flat surface at the minimum
and will continue to move up.
NAG provides a ability of slowing down when algorithm reaches
close to minimum.
It works better than the conventional momentum algorithm.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 15 / 59

Nesterov Accelerated Gradient (NAG)

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 16 / 59

AdaDelta and RMSProp
It adapts different learning rates for different parameters.
The parameters associated with frequently occurring features are
adapted with smaller learning rates whereas the parameters
associated with less frequent features are updated with larger
learning rates.
The learning rate decreases with increase in number of steps.
It does not require the tuning of learning rate. α is kept at 0.01.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 17 / 59

AdaDelta and RMS-Prop

AdaGrad is extended to AdaDelta and RMS-Prop in which the

aggressively decreasing learning rate is reduced.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 18 / 59

Adaptive Moment Estimation (Adam)
It adapts different learning rates for different parameters.
In addition to storing of square of gradient for calculating the
learning rate (similar to RMS-Prop), it also adapts the gradient
with the concept of momentum.
Typical values used are β1 = 0.9, β2 = 0.999, and ϵ = 10-8.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 19 / 59

Least Mean Square (LMS) update: linear
regression
Also known as Widrow-Hoff learning rule
Let us suppose we have only one training example (x , y )
∂ ∂ 1
J(θ) = (hθ (x ) − y )2 , j = 1, . . . , n (8)
∂θj ∂θj 2
1 ∂
= 2 (hθ (x ) − y ) (hθ (x ) − y ) (9)
2 ∂θj
n
∂ X
= (hθ (x ) − y ) ( θi xi − y ) (10)
∂θj i=1
= (hθ (x ) − y )xj (11)
For the i th training example:
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj , j = 1, . . . , n (12)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 20 / 59
LMS: linear regression(contd..)

How to generalize this method for more than one training

example?
Data: m training examples
Result: θj , i = 1, . . . , n
initialization θj ;
while until converegence do
(i)
θj := θj + α m (i)
− hθ (x (i) ))xj , j = 1, . . . , n
P
i=1 (y
end
Algorithm 1: Batch LMS

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 21 / 59

LMS: linear regression (contd..)

How to generalize this method for more than one training

example?
Data: m training examples
Result: θj , i = 1, . . . , n
initialization θj ;
while (optional) until desired minimum is reached do
for i = 1, . . . , m do
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj , j = 1, . . . , n
end
end
Algorithm 2: Stochastic LMS

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 22 / 59

Properties of LMS

The magnitude of update is proportional to the error:

(y (i) − hθ (x (i) ))
When a training example on which the prediction nearly matches
the actual value of y (i) , the parameter change is minimal
A larger change to the parameters will be made when hθ (x (i) ) is
highly deviated from y (i)
Note that, LMS can be susceptible to local minima in general
In the linear regression case gradient descent always converges,
as it has only one global solution

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 23 / 59

Batch and stochastic LMS
Batch LMS: at every example in the entire training set on every
step
Stochastic LMS: Each time when a training example is
encountered, the parameters are updated according to that
single training example only
Batch LMS has to scan through the entire training set before
taking a single step a costly operation if data set is large
Stochastic LMS can start the update right away when a training
example is arrived
Stochastic LMS helps to reach θ to the minimum much faster
than batch LMS. However that it may never converge to the
minimum, due the update based on single example
The parameters will keep J(θ) oscillating around the minimum
When the training set is large, stochastic LMS is often preferred
over batch LMS
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 24 / 59
Least squares-a probabilistic interpretation

Let us assume that the outputs and the inputs are related via

y (i) = θT x (i) + ϵ(i) (13)

ϵ(i) an error term that captures either unmodeled effects, or

random noise and are are distributed IID (independently and
identically distributed) according to a Gaussian distribution
1
(ϵ(i) )2
So, ϵ(i) ∼ N (0, σ 2 ), that is, p(ϵ(i) ) = √1 exp(− 2 )
2πσ σ2
(i) −θ T x (i) )2
Hence, p(y (i) |x (i) ; θ) = √1
2πσ
exp(− 12 (y σ2
)

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 25 / 59

Least squares-a probabilistic interpretation
Given X (which contains all the x (i) ’s) and θ, what is the
distribution of the y (i) ? The probability of this is : p(Y |X ; θ)
Likelihood function: explicit representation of this as a function
of θ:
m
p(y (i) |x (i) ; θ)
Y
L(θ) = L(θ; X , Y ) = p(Y |X , θ) = (14)
i=1

the independence assumption on the ϵ(i)

m
Y 1 1 (y (i) − θT x (i) )2
= √ exp(− ) (15)
i=1 2πσ 2 σ2
Maximum likelihood estimate of θ: Maximize L(θ)
logarithm- a monotically increasing function
Loglikelihood l(θ) = ln L(θ)- for simplicity and tractability
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 26 / 59
Least squares-a probabilistic interpretation

MLE
m
Y1 1 (y (i) − θT x (i) )2
l(θ) =ln √exp(− ) (16)
i=1 2πσ 2 σ2
m
1 1 1X
= m ln √ − 2 (y (i) − θT x (i) )2 (17)
2πσ σ
|{z} 2 i=1
| {z }
const scaling

Hence, maximizing l(θ) is equivalent to 21 m (i)

− θT x (i) )2 ,
P
i=1 (y
the least squares objective
Otherwise, Under the probabilistic assumptions on the data,
least-squares regression corresponds to finding the maximum
likelihood estimate of θ

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 27 / 59

Locally Weighted Linear Regression (LWR)
Linear regression
Training: Fit θ to minimize
P (i) − θT x (i) )2
i (y
1

2 Prediction: Output θT x
LWR
1 An online algorithm
Training: Fit θ to minimize i w (i) (y (i) − θT x (i) )2
P
2

3 Prediction: Output θT x
w (i) - non-negative valued weights
If w (i) is large, penalization of (y (i) − θT x (i) )2 is high, w (i) is
small penalization fo (y (i) − θT x (i) )2 is small
(i) )2
A standard choice: w (i) = exp(− (x 2τ−x 2 ), at a particular query
point x
|(x (i) − x )| is small w (i) ≊ 1, |(x (i) − x )| is large w (i) ≊ 0
τ - bandwidth parameter which decides how the weight of a
training example falls off with distance of its x (i) from the query
point x
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 28 / 59
Parametric and non-parametric models

Parametric model Non-parametric model

Fixed, finite number of Model parameters (θ) are not
parameters (θ s) in the model fixed
Once θ is fixed, they can be Training set has to be stored
directly used for prediction and used during future
Training set is not used during predictions
future predictions Eg:Locally weighted linear
Eg: Linear regression regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 29 / 59

Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 30 / 59

Logistic regression

Consider the binary classification problem in which y can take on

only two values, 0 and 1.
If y is discrete-valued, when linear regression is used to predict y
given x will give poor response
Logistic regression model:
1
hθ (x ) = g(θT x ) = (18)
1 + e −θT x
1
g(z) = 1+e −z
is called logistic or sigmoid function

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 31 / 59

Logistic regression

Consider the binary classification problem in which y can take on

only two values, 0 and 1.
If y is discrete-valued, when linear regression is used to predict y
given x will give poor response
Logistic regression model:
1
hθ (x ) = g(θT x ) = (19)
1 + e −θT x

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 32 / 59

Logistic regression
1
g(z) = 1+e −z
is called logistic or sigmoid fuction

g(z) → 1 when z → ∞, and g(z) → 0 when z → −∞

An interesting property of g(z)
d 1 1
g ′ (z) = −z
=− (−e −z ) (20)
dz 1 + e (1 + e −z )2
1 e −z 1 1

= = 1− (21)
(1 + e −z ) (1 + e −z ) (1 + e −z ) (1 + e −z )
= g(z)(1 − g(z)) (22)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 33 / 59
Logistic regression: parameter estimation
Let us assume that:

p(y = 1|x ; θ) = hθ (x ) (23)

p(y = 0|x ; θ) = 1 − hθ (x ) (24)

Combining these:

p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y ) (25)

MLE: if there are m independent training examples:

m
p(y (i) |x (i) ; θ)
Y
L(θ) = P(Y |X ; θ) = (26)
i=1
m
(i) ) (i) )
(hθ (x ))(y (1 − hθ (x ))(1−y
Y
= (27)
i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 34 / 59

Logistic regression: parameter estimation
Let us assume that:

p(y = 1|x ; θ) = hθ (x ) (28)

p(y = 0|x ; θ) = 1 − hθ (x ) (29)

Combining these:

p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y ) (30)

MLE: if there are m independent training examples:

m
p(y (i) |x (i) ; θ)
Y
L(θ) = P(Y |X ; θ) = (31)
i=1
m
(i) ) (i) )
(hθ (x ))(y (1 − hθ (x ))(1−y
Y
= (32)
i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 35 / 59

Logistic regression: parameter estimation
Loglikelihood:
l(θ) = log(L(θ)) (33)
m
X
= y (i) ln h(x (i) ) + (1 − y (i) )ln (1 − ln h(x (i) )) (34)
i=1

Gradient ascent (as we are maximizing the likelihood):

∂ 1 1 ∂

l(θ) = y T
− (1 − y ) T
g(θT x ) (35)
∂θj g(θ x ) (1 − g(θ x )) ∂θj
1 1 ∂ T

= y T
− (1 − y ) T
g(θT x )(1 − g(θT x )) (θ x )
g(θ x ) (1 − g(θ x )) ∂θj
= (y (1 − g(θT x )) − (1 − y )g(θT x ))xj (36)
= (y − hθ (x ))xj (37)
(i)
θ := θ + α(y (i) − hθ (x (i) ))xj
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 36 / 59
Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 37 / 59

Generalized Linear Models

Exponential family distributions:

p(y ; η) = b(y ) exp(η T T (y ) − a(η)), η-the natural parameter of
the distribution; T (y ) is the sufficient statistic - mostly chosen
to be y
A fixed choice of T , a and b defines a family (or set) of
distributions that is parametrized by η; as we vary η, we then
get different distributions within this family.
2 2 2
Let η = µ, T (y ) = y , a(η) = η2 = µ2 , b(y ) = √12π exp ( −y2 ),
then,
2 2

p(y ; µ) = √12π exp ( −y2 ) exp(µy − µ2 ) = √12π exp − 21 (y − µ)2 ,
capturing Gaussian distribution.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 38 / 59

Generalized Linear Models (GLM) (contd...)

GLM
y |x ; θ ∼ exponential family
Given x , our goal is to predict the expected value of T (y ) given
x , that is, hθ = E (y |x ). Eg. Logistic regression:
p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y )
The natural parameter η and the inputs x are related linearly:
η = θT x

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 39 / 59

Special cases of GLM

Linear regressiony = θT x + ϵ
For y |x ; θ ∼ N (µ = θT x , σ 2 ), belong to exponential family
hθ (x ) = E (y |x ) = θT x
In θT x , inputs are varying linearly with respect to parameter
1
Logistic regression:hθ (x ) = g(θT x ) = 1+exp(−θT x )
1
hθ (x ) = g(θT x ) = 1+exp(−θ T x ) -Success or failure-Bernaulli

distribution is exponential family

hθ (x ) = p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y )
In θT x , inputs are varying linearly with respect to parameter
There are many other examples

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 40 / 59

Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 41 / 59

Subset selection

There are two reasons why we are often not satisfied with the
least squares estimates
1 prediction accuracy: the least squares estimates often have low
bias but large variance. Prediction accuracy can sometimes be
improved by shrinking or setting some coefficients to zero. By
doing so we sacrifice a little bit of bias to reduce the variance of
the predicted values, and hence may improve the overall
prediction accuracy.
2 interpretation: with a large number of predictors, we often
would like to determine a smaller subset that exhibit the
strongest effects. In order to get the ”big picture” we are willing
to sacrifice some of the small details.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 42 / 59

Best-subset Selection
Linear regression model: y = βo + ki=1 βi xk + ϵ
P

Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the

subset of size k that gives smallest residual sum of squares

Forward-stepwise regression approach:starts with the intercept,

and then sequentially adds into the model the predictor that
most improves the fit.
Backward-stepwise regression approach:starts with the full
model, and sequentially deletes the predictors
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 43 / 59
Best-subset Selection
Stagewise regression
1 It starts like forward-stepwise regression, with an intercept equal
to y , and centered predictors with coefficients initially all 0.
2 At each step the algorithm identifies the variable most
correlated with the current residual.
3 It then computes the simple linear regression coefficient of the
residual on this chosen variable, and then adds it to the current
coefficient for that variable.
4 This is continued till none of the variables have correlation with
the residuals
Unlike forward-stepwise regression, none of the other variables
are adjusted when a term is added to the model.
As a consequence, forward stagewise regression can be slow in
fitting

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 44 / 59

Least Angle Regression

A ”democratic” version of forward stepwise/stagewise regression

1 Standardize the predictors to have mean zero and unit norm.
Start with the residual r = y − y , β1 , . . . , βp = 0
2 Find the regressor xj most correlated with r .
3 Move βj from 0 towards its least-squares coefficient of (r , xj )
until some other competitor xk has as much correlation with the
current residual as does xj . βj = βj + α(xjT xj )−1 xjT r
4 Move βj and βk in the direction defined by their joint least
squares coefficient of the current residual on (xj , xk ), until some
other competitor xl has as much correlation with the current
residual.
5 Continue in this way until all p predictors have been entered.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 45 / 59

Least Angle Regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 46 / 59

Fitting

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 47 / 59

Advantage of regularized models

One wants to choose a model that both accurately captures the

regularities in its training data, but also generalizes well to
unseen data
Less complex model-underfit
More complex model-more overfit =⇒ more variance error
Law of parsimony-Occam’s razor
Regularization =shrinkage=sparsity
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 48 / 59
Advantage of regularized models

Model with more number of variable are difficult to interpret

physically.
Eg. Let us say you are modeling a concentration of product (y )
as a function of temperature (x1 ), flow rate (x2 ), pressure (x3 ),
reactor volume (x4 ), reactor temperature (x5 ), i.e,
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5
If all the coefficients are active β0 , . . . , β5 , we do not have any
clue which is the dominant variable that controls composition.
Let us only one or two active variables after modeling, say, x1
and x3 , we conclude that temperature and pressure play a key
role concentration y = β1 x1 + β3 x3
That is, lower number of variables, improves interpretability and
decision making.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 49 / 59

Ridge regression
Ridge regression shrinks the regression coefficients by imposing a
penalty on their size.
The ridge coefficients minimize a penalized residual sum of squares
β̂ ridge = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 βj2 }
P P P
β
Here λ ≥ 0 is a complexity parameter that controls the amount of
shrinkage
The larger the value of λ, the greater the amount of shrinkage. The
coefficients are shrunk toward zero (and each other).
Equivalent formation which makes explicit the size constraint on the
parameters:
n
X p
X
β̂ ridge =arg min {(yi − β0 − βj xj )2 }
β i=1 j=1
p
X
βj2 ≤ t
j=1

L penalty
2
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 50 / 59
Ridge regression
When there are many correlated variables in a linear regression
model, their coefficients can become poorly determined and
exhibit high variance.
A wildly large positive coefficient on one variable can be
canceled by a similarly large negative coefficient on its correlated
variable. By imposing a size constraint on the coefficients, this
problem is alleviated.
Notice that the intercept β0 has been left out of the penalty
term to avoid the procedure depend on the origin chosen for y
β̂ ridge = (X T X + λI)−1 X T y , I-identity matrix p × p
The solution adds a positive constant to the diagonal of X T X
before inversion. This makes the problem nonsingular, even if
X T X is not of full rank, and was the main motivation for ridge
regression
λI projects up from singularity, hence the name
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 51 / 59
Least Absolute Shrinkage and Selection
Operator (LASSO)
The lasso regression is a shrinkage method like ridge, with subtle
but important differences.
Lasso regression:
n p
β̂ Lasso =arg min βj xj )2 }
X X
{(yi − β0 −
β i=1 j=1
p
X
|βj | ≤ t
j=1

Equivalent Lagrangian form:

β̂ Lasso = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 |βj |}
P P P
β
L1 penalty-shrinkage is much more
This latter constraint makes the solutions non-linear in the yi ,
and there is no closed form expression as in ridge regression.
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 52 / 59
Discussion: Ridge Regression and the Lasso

Lasso Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 53 / 59

Discussion: Ridge and Lasso regressions

Ridge and Lasso: The residual sum of squares has elliptical

contours, centered at the full least squares estimate.
Constraint for ridge regression is disk: x12 + x22 ≤ t
Constraint for ridge regression is Lasso: |x1 | + |x2 | ≤ t
In constraint optimization, solution would lie at the active
constraints (on the boundary of the constraint surface)
Unlike the disk, the diamond has corners; if the solution occurs
at a corner, then it will make one parameter βj equal to zero.
Thus, shrinkage in Lasso is more than in Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 54 / 59

Generalization of Lasso
Lasso-generalized:
β̃ = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 ||βj ||q }
P P P
β
The value q = 0 corresponds to variable subset selection, as the
penalty simply counts the number of nonzero parameters
q = 1 corresponds to the Lasso
q = 2 correspond to ridge regression.
We can also optimize q-hyper parameter optimization

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 55 / 59

Elastic-net regression

The elastic net is a regularized regression method that linearly

combines the L1 and L2 penalties of the Lasso and Ridge
methods.
Elastic-net: β̂ elastic =
arg min ni=1 {(yi −β0 − pj=1 βj xj )2 +λ1 pj=1 |βj |+λ2 pj=1 ||βj ||2 }
P P P P
β
Often times it is choosen to be: λ2 = 1 − λ1
The elastic net method includes the LASSO and ridge regression
λ1 = λ and λ2 = 0: Lasso
λ2 = λ and λ1 = 0: Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 56 / 59

Robust regression models

Data with outlier Regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 57 / 59

Robust regression
g( yi −x iβ
PN
Robust regression: β̂ = arg min i=1 σ̂
), where g is a
β
robust function
Choice of g(.): If g(t) = 12 t 2 , it is OLS solution.
Huber’s regression: For a threshold M (which user can fix based
on how much tolerance you need for outliers):

1t2 if |t| ≤ M
2
g(t) =
M|t| − 1 M 2 if |t| > M
2

This function is identical to the least squares penalty for small

residuals, but on large residuals, its penalty is lower and
increases linearly rather than quadratically. It is thus more
forgiving of outliers.
Least Absolute Deviation (LAD): g(t) = |t|, that is
β̂ = arg min Ni=1 |( yi −x iβ
P
σ̂
)| - also called median regression
β
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 58 / 59
Thank you!

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 59 / 59

User Illusion
33% (3)
User Illusion
2 pages
2020-04-14-Re - Â Startup India AMA Session With Rema Subramanian, Co-Founder & Managing Partner Â - Ankur Capital
No ratings yet
2020-04-14-Re - Â Startup India AMA Session With Rema Subramanian, Co-Founder & Managing Partner Â - Ankur Capital
4 pages
NC Boot Menu
No ratings yet
NC Boot Menu
11 pages
session1
No ratings yet
session1
39 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Mauryan Empire
No ratings yet
Mauryan Empire
11 pages
Automated Virtual Product Placement and Assessment in Images Using Diffusion Models
No ratings yet
Automated Virtual Product Placement and Assessment in Images Using Diffusion Models
9 pages
Deloitte PPT-Devang
No ratings yet
Deloitte PPT-Devang
7 pages
Math Behind ML Algos
No ratings yet
Math Behind ML Algos
18 pages
Logistic
No ratings yet
Logistic
14 pages
ESI 6100 Datasheet
No ratings yet
ESI 6100 Datasheet
5 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
FT 210 1
No ratings yet
FT 210 1
14 pages
Project Initiation After Business Case
No ratings yet
Project Initiation After Business Case
51 pages
Launch Crt511e
No ratings yet
Launch Crt511e
3 pages
2024 PrecISOft English
No ratings yet
2024 PrecISOft English
15 pages
Lecture15 Regression
No ratings yet
Lecture15 Regression
15 pages
Maths Behind ML Algos
No ratings yet
Maths Behind ML Algos
18 pages
Open Ended Lab (ME-115)
No ratings yet
Open Ended Lab (ME-115)
4 pages
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
No ratings yet
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
7 pages
3.Linear Regression
No ratings yet
3.Linear Regression
18 pages
13RDL 761 Food Microbiology
No ratings yet
13RDL 761 Food Microbiology
12 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
School of Computing & Informatics: Overview of E-Commerce
No ratings yet
School of Computing & Informatics: Overview of E-Commerce
9 pages
Sustainability Community
No ratings yet
Sustainability Community
6 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
ISCOM3000G (B) Series Switches: Raisecom Technology Co., LTD
No ratings yet
ISCOM3000G (B) Series Switches: Raisecom Technology Co., LTD
8 pages
15 RDL 761 Microbial Food Satey
No ratings yet
15 RDL 761 Microbial Food Satey
17 pages
Ict Logic Gates
No ratings yet
Ict Logic Gates
22 pages
Automation: Servo Drive SCA06 Servo Motor SWA
No ratings yet
Automation: Servo Drive SCA06 Servo Motor SWA
16 pages
Operating Systems: Mentoring Operating System (Mentos) Fundamental Concepts
No ratings yet
Operating Systems: Mentoring Operating System (Mentos) Fundamental Concepts
27 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
ScintCare CT16-quotation For I-Health
No ratings yet
ScintCare CT16-quotation For I-Health
8 pages
E-Waste Management PDF
No ratings yet
E-Waste Management PDF
5 pages
Himanshu Meena - 2020CH70168 - G14 - CLL271 - Report
No ratings yet
Himanshu Meena - 2020CH70168 - G14 - CLL271 - Report
6 pages
CS229
No ratings yet
CS229
69 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Linear Regression Using Batch Gradient Descent
No ratings yet
Linear Regression Using Batch Gradient Descent
7 pages
Chapter 03: IMAGE: Part 1: Key Term Quiz
No ratings yet
Chapter 03: IMAGE: Part 1: Key Term Quiz
6 pages
21 RDL 761 Food Additives
No ratings yet
21 RDL 761 Food Additives
36 pages
Cyber Security
No ratings yet
Cyber Security
22 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
Linear Regression 18may
No ratings yet
Linear Regression 18may
28 pages
Linear Sequences Questions MME
No ratings yet
Linear Sequences Questions MME
7 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Regression
No ratings yet
Regression
30 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Unit 1
No ratings yet
Unit 1
38 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
54 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
ML Notes
No ratings yet
ML Notes
14 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Cost Function
No ratings yet
Cost Function
17 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
Bolutex BC04 Datasheet
No ratings yet
Bolutex BC04 Datasheet
21 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
.NG - 46992.24317 Design and Implementation of Web Based Crime Reporting System
No ratings yet
.NG - 46992.24317 Design and Implementation of Web Based Crime Reporting System
41 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Web Crawling State of ArtTechniques ApproachesandApplication
No ratings yet
Web Crawling State of ArtTechniques ApproachesandApplication
26 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Codeavour Track 1 Innovation and Entrepreneurship Rulebook
No ratings yet
Codeavour Track 1 Innovation and Entrepreneurship Rulebook
8 pages
Digital Communication. Introduction
No ratings yet
Digital Communication. Introduction
5 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Version Comparison Matrix: Feature
No ratings yet
Version Comparison Matrix: Feature
12 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet