0% found this document useful (0 votes)
12 views

Notes 3

The document discusses various techniques for regression analysis including linear regression, logistic regression, and generalized linear models. It provides an overview of key concepts like least squares optimization, gradient descent, and regularization. Examples are given to explain linear regression and how to calculate coefficients through closed-form and iterative methods like gradient descent.

Uploaded by

Nikita Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Notes 3

The document discusses various techniques for regression analysis including linear regression, logistic regression, and generalized linear models. It provides an overview of key concepts like least squares optimization, gradient descent, and regularization. Examples are given to explain linear regression and how to calculate coefficients through closed-form and iterative methods like gradient descent.

Uploaded by

Nikita Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

CLL788-Process Data Analytics

Regression and its derivatives


Hariprasad Kodamana
IIT DELHI

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 1 / 59


Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 2 / 59


A motivating example
Table: Portland housing data
Living area (feet 2 ) APT or HOUSE Price (1000s)
2104 APT 400
1600 APT 330
2400 HOUSE 369
1416 APT 232
3000 HOUSE 540
.. .. ..
. . .

Given data like this, (i) Can we


predict the prices of other houses as
a function of the size of their living
areas? (ii) Based on the area can we
say it is an APT or HOUSE?

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 3 / 59


Some notations

x (i) ∈ X ⊂ R denotes the ”input” variables (input


features)-living area
y (i) ∈ Y ⊂ R denotes the ”output” (target variable) that is
trying to predict- price
A pair (x (i) , y (i) ) is a training example (training sample)
A list of m training examples {(x (i) , y (i) ); i = 1, . . . , m} is a
training set.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 4 / 59


Supervised learning
Training set

Learning algorithm

x Model y
(h)

The goal is, given a training set, to learn a function h : X → Y


that h(x ) is a ”good” predictor for the corresponding value of y
When the target variable y to be predicted is continuous, such
as price, the learning problem a regression problem.
When the target variable y can take on only a small number of
discrete values (such as house or apartment), it is a classification
problem.
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 5 / 59
Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 6 / 59


Linear regression
Data for regression
i y x1 x2 ... xn
1 y (1) x (11) x (12) ... x (1n)
.. .. .. .. .. ..
. . . . . .
n y (m) x (m1) x (m2) ... x (mn)

Model: hθ (x ) + ϵ = θ0 + θ1 x1 + · · · + θn xn + ϵ = x T θ + ϵ,
θ-parameter, x -inputs
How to compute θ? One option is to minimize the difference
between hθ (x ) and y
Minimize least squares:
m
1X
J(θ) = (hθ (x (i) ) − y (i) )2 (1)
2 i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 7 / 59


Least squares-closed form solution

Data
i y x1 x2 ... xn
(1) (11) (12)
1 y x x ... x (1n)
.. .. .. .. .. ..
. . . . . .
m y (m) x (m1) x (m2) ... x (mn)
Let
(x (1) )T (x (11) ) (y (1) )
     
 (x (2) )T  (x (12) )  (y (2) ) 
 (i)
X = ,x =  ,Y =  (2)
    
 ...   ...   ... 

(x (m) )T (x (1n) ) (y (m) )

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 8 / 59


Linear regression assumptions

There should be a linear and additive relationship between


dependent (response) variable and independent (predictor)
variable(s).
There should be no correlation between the residual (error)
terms. (no autocorrelation)
The independent variables should not be correlated. (no
multi-collinearity)
The error terms must have constant variance
(homoskedasticity). The presence of non-constant variance is
referred to heteroskedasticity.
The error terms must be normally distributed.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 9 / 59


Least squares-closed form solution
For linear regression, hθ (x (i) ) = (x (i) )T θ, so
(x (1) )T θ (y (1) ) hθ (x (1) ) − (y (1) )
     
 (x (2) )T θ   (y (2) )   (h (x (2) ) − (y (2) ) 
 θ
Xθ − Y =  − =
    
 ...   ...  ...

 
(x (m) )T θ (y (m) ) hθ (x (m) ) − (y (m) )
(3)
J(θ) = 12 m (i) (i) 2 1 T
i=1 (hθ (x ) − y ) = 2 (X θ − Y ) (X θ − Y )
P

For minimizing J(θ), ∇θ J(θ) = 0


1
∇θ J(θ) = ∇θ (θT X T X θ − θT X T Y − Y T X θ + Y T Y ) = 0
2
(4)
1
= (X T X θ + X T X θ − 2X T Y ) = 0 (5)
2
=⇒ X T X θ = X T Y =⇒ θ = (X T X )−1 X T Y (6)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 10 / 59
Gradient descent optimizer
A search algorithm that starts with an initial guess for θ, and
repeatedly updates the guess value of θ such that J(θ) is
minimized
The search direction will be rate of change of J(θ)
Gradient descent update

θj := θj − α J(θj ), j = 1, . . . , n (7)
∂θj
Negative update-for achieving minimum
α is called the learning rate that can be tuned
Also called steepest descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 11 / 59


Gradient descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 12 / 59


Gradient descent

Vanilla/ Batch Gradient Descent: Calculates the averaged


gradient with all training samples.
Stochastic Gradient Descent: Calculates the gradient with one
training sample. Response is memory efficient but unstable
(response fluctuates) and may shoot even after getting global
optimum.
Mini Batch Gradient Descent: The data is divided into different
mini batches. Gradient is calculated as average value over all
samples present in one mini batch.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 13 / 59


Gradient descent

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 14 / 59


Nesterov Accelerated Gradient (NAG)

When we reach the minimum there is a need to slow down


otherwise the ball will not stop at flat surface at the minimum
and will continue to move up.
NAG provides a ability of slowing down when algorithm reaches
close to minimum.
It works better than the conventional momentum algorithm.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 15 / 59


Nesterov Accelerated Gradient (NAG)

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 16 / 59


AdaDelta and RMSProp
It adapts different learning rates for different parameters.
The parameters associated with frequently occurring features are
adapted with smaller learning rates whereas the parameters
associated with less frequent features are updated with larger
learning rates.
The learning rate decreases with increase in number of steps.
It does not require the tuning of learning rate. α is kept at 0.01.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 17 / 59


AdaDelta and RMS-Prop

AdaGrad is extended to AdaDelta and RMS-Prop in which the


aggressively decreasing learning rate is reduced.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 18 / 59


Adaptive Moment Estimation (Adam)
It adapts different learning rates for different parameters.
In addition to storing of square of gradient for calculating the
learning rate (similar to RMS-Prop), it also adapts the gradient
with the concept of momentum.
Typical values used are β1 = 0.9, β2 = 0.999, and ϵ = 10-8.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 19 / 59


Least Mean Square (LMS) update: linear
regression
Also known as Widrow-Hoff learning rule
Let us suppose we have only one training example (x , y )
∂ ∂ 1
J(θ) = (hθ (x ) − y )2 , j = 1, . . . , n (8)
∂θj ∂θj 2
1 ∂
= 2 (hθ (x ) − y ) (hθ (x ) − y ) (9)
2 ∂θj
n
∂ X
= (hθ (x ) − y ) ( θi xi − y ) (10)
∂θj i=1
= (hθ (x ) − y )xj (11)
For the i th training example:
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj , j = 1, . . . , n (12)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 20 / 59
LMS: linear regression(contd..)

How to generalize this method for more than one training


example?
Data: m training examples
Result: θj , i = 1, . . . , n
initialization θj ;
while until converegence do
(i)
θj := θj + α m (i)
− hθ (x (i) ))xj , j = 1, . . . , n
P
i=1 (y
end
Algorithm 1: Batch LMS

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 21 / 59


LMS: linear regression (contd..)

How to generalize this method for more than one training


example?
Data: m training examples
Result: θj , i = 1, . . . , n
initialization θj ;
while (optional) until desired minimum is reached do
for i = 1, . . . , m do
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj , j = 1, . . . , n
end
end
Algorithm 2: Stochastic LMS

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 22 / 59


Properties of LMS

The magnitude of update is proportional to the error:


(y (i) − hθ (x (i) ))
When a training example on which the prediction nearly matches
the actual value of y (i) , the parameter change is minimal
A larger change to the parameters will be made when hθ (x (i) ) is
highly deviated from y (i)
Note that, LMS can be susceptible to local minima in general
In the linear regression case gradient descent always converges,
as it has only one global solution

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 23 / 59


Batch and stochastic LMS
Batch LMS: at every example in the entire training set on every
step
Stochastic LMS: Each time when a training example is
encountered, the parameters are updated according to that
single training example only
Batch LMS has to scan through the entire training set before
taking a single step a costly operation if data set is large
Stochastic LMS can start the update right away when a training
example is arrived
Stochastic LMS helps to reach θ to the minimum much faster
than batch LMS. However that it may never converge to the
minimum, due the update based on single example
The parameters will keep J(θ) oscillating around the minimum
When the training set is large, stochastic LMS is often preferred
over batch LMS
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 24 / 59
Least squares-a probabilistic interpretation

Let us assume that the outputs and the inputs are related via

y (i) = θT x (i) + ϵ(i) (13)

ϵ(i) an error term that captures either unmodeled effects, or


random noise and are are distributed IID (independently and
identically distributed) according to a Gaussian distribution
1
(ϵ(i) )2
So, ϵ(i) ∼ N (0, σ 2 ), that is, p(ϵ(i) ) = √1 exp(− 2 )
2πσ σ2
(i) −θ T x (i) )2
Hence, p(y (i) |x (i) ; θ) = √1
2πσ
exp(− 12 (y σ2
)

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 25 / 59


Least squares-a probabilistic interpretation
Given X (which contains all the x (i) ’s) and θ, what is the
distribution of the y (i) ? The probability of this is : p(Y |X ; θ)
Likelihood function: explicit representation of this as a function
of θ:
m
p(y (i) |x (i) ; θ)
Y
L(θ) = L(θ; X , Y ) = p(Y |X , θ) = (14)
i=1

the independence assumption on the ϵ(i)


m
Y 1 1 (y (i) − θT x (i) )2
= √ exp(− ) (15)
i=1 2πσ 2 σ2
Maximum likelihood estimate of θ: Maximize L(θ)
logarithm- a monotically increasing function
Loglikelihood l(θ) = ln L(θ)- for simplicity and tractability
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 26 / 59
Least squares-a probabilistic interpretation

MLE
m
Y1 1 (y (i) − θT x (i) )2
l(θ) =ln √exp(− ) (16)
i=1 2πσ 2 σ2
m
1 1 1X
= m ln √ − 2 (y (i) − θT x (i) )2 (17)
2πσ σ
|{z} 2 i=1
| {z }
const scaling

Hence, maximizing l(θ) is equivalent to 21 m (i)


− θT x (i) )2 ,
P
i=1 (y
the least squares objective
Otherwise, Under the probabilistic assumptions on the data,
least-squares regression corresponds to finding the maximum
likelihood estimate of θ

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 27 / 59


Locally Weighted Linear Regression (LWR)
Linear regression
Training: Fit θ to minimize
P (i) − θT x (i) )2
i (y
1

2 Prediction: Output θT x
LWR
1 An online algorithm
Training: Fit θ to minimize i w (i) (y (i) − θT x (i) )2
P
2

3 Prediction: Output θT x
w (i) - non-negative valued weights
If w (i) is large, penalization of (y (i) − θT x (i) )2 is high, w (i) is
small penalization fo (y (i) − θT x (i) )2 is small
(i) )2
A standard choice: w (i) = exp(− (x 2τ−x 2 ), at a particular query
point x
|(x (i) − x )| is small w (i) ≊ 1, |(x (i) − x )| is large w (i) ≊ 0
τ - bandwidth parameter which decides how the weight of a
training example falls off with distance of its x (i) from the query
point x
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 28 / 59
Parametric and non-parametric models

Parametric model Non-parametric model


Fixed, finite number of Model parameters (θ) are not
parameters (θ s) in the model fixed
Once θ is fixed, they can be Training set has to be stored
directly used for prediction and used during future
Training set is not used during predictions
future predictions Eg:Locally weighted linear
Eg: Linear regression regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 29 / 59


Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 30 / 59


Logistic regression

Consider the binary classification problem in which y can take on


only two values, 0 and 1.
If y is discrete-valued, when linear regression is used to predict y
given x will give poor response
Logistic regression model:
1
hθ (x ) = g(θT x ) = (18)
1 + e −θT x
1
g(z) = 1+e −z
is called logistic or sigmoid function

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 31 / 59


Logistic regression

Consider the binary classification problem in which y can take on


only two values, 0 and 1.
If y is discrete-valued, when linear regression is used to predict y
given x will give poor response
Logistic regression model:
1
hθ (x ) = g(θT x ) = (19)
1 + e −θT x

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 32 / 59


Logistic regression
1
g(z) = 1+e −z
is called logistic or sigmoid fuction

g(z) → 1 when z → ∞, and g(z) → 0 when z → −∞


An interesting property of g(z)
d 1 1
g ′ (z) = −z
=− (−e −z ) (20)
dz 1 + e (1 + e −z )2
1 e −z 1 1
 
= = 1− (21)
(1 + e −z ) (1 + e −z ) (1 + e −z ) (1 + e −z )
= g(z)(1 − g(z)) (22)
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 33 / 59
Logistic regression: parameter estimation
Let us assume that:

p(y = 1|x ; θ) = hθ (x ) (23)


p(y = 0|x ; θ) = 1 − hθ (x ) (24)

Combining these:

p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y ) (25)

MLE: if there are m independent training examples:


m
p(y (i) |x (i) ; θ)
Y
L(θ) = P(Y |X ; θ) = (26)
i=1
m
(i) ) (i) )
(hθ (x ))(y (1 − hθ (x ))(1−y
Y
= (27)
i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 34 / 59


Logistic regression: parameter estimation
Let us assume that:

p(y = 1|x ; θ) = hθ (x ) (28)


p(y = 0|x ; θ) = 1 − hθ (x ) (29)

Combining these:

p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y ) (30)

MLE: if there are m independent training examples:


m
p(y (i) |x (i) ; θ)
Y
L(θ) = P(Y |X ; θ) = (31)
i=1
m
(i) ) (i) )
(hθ (x ))(y (1 − hθ (x ))(1−y
Y
= (32)
i=1

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 35 / 59


Logistic regression: parameter estimation
Loglikelihood:
l(θ) = log(L(θ)) (33)
m
X
= y (i) ln h(x (i) ) + (1 − y (i) )ln (1 − ln h(x (i) )) (34)
i=1

Gradient ascent (as we are maximizing the likelihood):


∂ 1 1 ∂
 
l(θ) = y T
− (1 − y ) T
g(θT x ) (35)
∂θj g(θ x ) (1 − g(θ x )) ∂θj
1 1 ∂ T
 
= y T
− (1 − y ) T
g(θT x )(1 − g(θT x )) (θ x )
g(θ x ) (1 − g(θ x )) ∂θj
= (y (1 − g(θT x )) − (1 − y )g(θT x ))xj (36)
= (y − hθ (x ))xj (37)
(i)
θ := θ + α(y (i) − hθ (x (i) ))xj
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 36 / 59
Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 37 / 59


Generalized Linear Models

Exponential family distributions:


p(y ; η) = b(y ) exp(η T T (y ) − a(η)), η-the natural parameter of
the distribution; T (y ) is the sufficient statistic - mostly chosen
to be y
A fixed choice of T , a and b defines a family (or set) of
distributions that is parametrized by η; as we vary η, we then
get different distributions within this family.
2 2 2
Let η = µ, T (y ) = y , a(η) = η2 = µ2 , b(y ) = √12π exp ( −y2 ),
then,
2 2
 
p(y ; µ) = √12π exp ( −y2 ) exp(µy − µ2 ) = √12π exp − 21 (y − µ)2 ,
capturing Gaussian distribution.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 38 / 59


Generalized Linear Models (GLM) (contd...)

GLM
y |x ; θ ∼ exponential family
Given x , our goal is to predict the expected value of T (y ) given
x , that is, hθ = E (y |x ). Eg. Logistic regression:
p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y )
The natural parameter η and the inputs x are related linearly:
η = θT x

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 39 / 59


Special cases of GLM

Linear regressiony = θT x + ϵ
For y |x ; θ ∼ N (µ = θT x , σ 2 ), belong to exponential family
hθ (x ) = E (y |x ) = θT x
In θT x , inputs are varying linearly with respect to parameter
1
Logistic regression:hθ (x ) = g(θT x ) = 1+exp(−θT x )
1
hθ (x ) = g(θT x ) = 1+exp(−θ T x ) -Success or failure-Bernaulli

distribution is exponential family


hθ (x ) = p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y )
In θT x , inputs are varying linearly with respect to parameter
There are many other examples

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 40 / 59


Overview of Presentation

1 Introduction

2 Regression and least squares

3 Logistic regression

4 Generalized linear models

5 Regularization and regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 41 / 59


Subset selection

There are two reasons why we are often not satisfied with the
least squares estimates
1 prediction accuracy: the least squares estimates often have low
bias but large variance. Prediction accuracy can sometimes be
improved by shrinking or setting some coefficients to zero. By
doing so we sacrifice a little bit of bias to reduce the variance of
the predicted values, and hence may improve the overall
prediction accuracy.
2 interpretation: with a large number of predictors, we often
would like to determine a smaller subset that exhibit the
strongest effects. In order to get the ”big picture” we are willing
to sacrifice some of the small details.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 42 / 59


Best-subset Selection
Linear regression model: y = βo + ki=1 βi xk + ϵ
P

Best subset regression finds for each k ∈ {0, 1, 2, . . . , p} the


subset of size k that gives smallest residual sum of squares

Forward-stepwise regression approach:starts with the intercept,


and then sequentially adds into the model the predictor that
most improves the fit.
Backward-stepwise regression approach:starts with the full
model, and sequentially deletes the predictors
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 43 / 59
Best-subset Selection
Stagewise regression
1 It starts like forward-stepwise regression, with an intercept equal
to y , and centered predictors with coefficients initially all 0.
2 At each step the algorithm identifies the variable most
correlated with the current residual.
3 It then computes the simple linear regression coefficient of the
residual on this chosen variable, and then adds it to the current
coefficient for that variable.
4 This is continued till none of the variables have correlation with
the residuals
Unlike forward-stepwise regression, none of the other variables
are adjusted when a term is added to the model.
As a consequence, forward stagewise regression can be slow in
fitting

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 44 / 59


Least Angle Regression

A ”democratic” version of forward stepwise/stagewise regression


1 Standardize the predictors to have mean zero and unit norm.
Start with the residual r = y − y , β1 , . . . , βp = 0
2 Find the regressor xj most correlated with r .
3 Move βj from 0 towards its least-squares coefficient of (r , xj )
until some other competitor xk has as much correlation with the
current residual as does xj . βj = βj + α(xjT xj )−1 xjT r
4 Move βj and βk in the direction defined by their joint least
squares coefficient of the current residual on (xj , xk ), until some
other competitor xl has as much correlation with the current
residual.
5 Continue in this way until all p predictors have been entered.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 45 / 59


Least Angle Regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 46 / 59


Fitting

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 47 / 59


Advantage of regularized models

One wants to choose a model that both accurately captures the


regularities in its training data, but also generalizes well to
unseen data
Less complex model-underfit
More complex model-more overfit =⇒ more variance error
Law of parsimony-Occam’s razor
Regularization =shrinkage=sparsity
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 48 / 59
Advantage of regularized models

Model with more number of variable are difficult to interpret


physically.
Eg. Let us say you are modeling a concentration of product (y )
as a function of temperature (x1 ), flow rate (x2 ), pressure (x3 ),
reactor volume (x4 ), reactor temperature (x5 ), i.e,
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5
If all the coefficients are active β0 , . . . , β5 , we do not have any
clue which is the dominant variable that controls composition.
Let us only one or two active variables after modeling, say, x1
and x3 , we conclude that temperature and pressure play a key
role concentration y = β1 x1 + β3 x3
That is, lower number of variables, improves interpretability and
decision making.

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 49 / 59


Ridge regression
Ridge regression shrinks the regression coefficients by imposing a
penalty on their size.
The ridge coefficients minimize a penalized residual sum of squares
β̂ ridge = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 βj2 }
P P P
β
Here λ ≥ 0 is a complexity parameter that controls the amount of
shrinkage
The larger the value of λ, the greater the amount of shrinkage. The
coefficients are shrunk toward zero (and each other).
Equivalent formation which makes explicit the size constraint on the
parameters:
n
X p
X
β̂ ridge =arg min {(yi − β0 − βj xj )2 }
β i=1 j=1
p
X
βj2 ≤ t
j=1

L penalty
2
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 50 / 59
Ridge regression
When there are many correlated variables in a linear regression
model, their coefficients can become poorly determined and
exhibit high variance.
A wildly large positive coefficient on one variable can be
canceled by a similarly large negative coefficient on its correlated
variable. By imposing a size constraint on the coefficients, this
problem is alleviated.
Notice that the intercept β0 has been left out of the penalty
term to avoid the procedure depend on the origin chosen for y
β̂ ridge = (X T X + λI)−1 X T y , I-identity matrix p × p
The solution adds a positive constant to the diagonal of X T X
before inversion. This makes the problem nonsingular, even if
X T X is not of full rank, and was the main motivation for ridge
regression
λI projects up from singularity, hence the name
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 51 / 59
Least Absolute Shrinkage and Selection
Operator (LASSO)
The lasso regression is a shrinkage method like ridge, with subtle
but important differences.
Lasso regression:
n p
β̂ Lasso =arg min βj xj )2 }
X X
{(yi − β0 −
β i=1 j=1
p
X
|βj | ≤ t
j=1

Equivalent Lagrangian form:


β̂ Lasso = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 |βj |}
P P P
β
L1 penalty-shrinkage is much more
This latter constraint makes the solutions non-linear in the yi ,
and there is no closed form expression as in ridge regression.
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 52 / 59
Discussion: Ridge Regression and the Lasso

Lasso Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 53 / 59


Discussion: Ridge and Lasso regressions

Ridge and Lasso: The residual sum of squares has elliptical


contours, centered at the full least squares estimate.
Constraint for ridge regression is disk: x12 + x22 ≤ t
Constraint for ridge regression is Lasso: |x1 | + |x2 | ≤ t
In constraint optimization, solution would lie at the active
constraints (on the boundary of the constraint surface)
Unlike the disk, the diamond has corners; if the solution occurs
at a corner, then it will make one parameter βj equal to zero.
Thus, shrinkage in Lasso is more than in Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 54 / 59


Generalization of Lasso
Lasso-generalized:
β̃ = arg min ni=1 {(yi − β0 − pj=1 βj xj )2 + λ pj=1 ||βj ||q }
P P P
β
The value q = 0 corresponds to variable subset selection, as the
penalty simply counts the number of nonzero parameters
q = 1 corresponds to the Lasso
q = 2 correspond to ridge regression.
We can also optimize q-hyper parameter optimization

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 55 / 59


Elastic-net regression

The elastic net is a regularized regression method that linearly


combines the L1 and L2 penalties of the Lasso and Ridge
methods.
Elastic-net: β̂ elastic =
arg min ni=1 {(yi −β0 − pj=1 βj xj )2 +λ1 pj=1 |βj |+λ2 pj=1 ||βj ||2 }
P P P P
β
Often times it is choosen to be: λ2 = 1 − λ1
The elastic net method includes the LASSO and ridge regression
λ1 = λ and λ2 = 0: Lasso
λ2 = λ and λ1 = 0: Ridge

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 56 / 59


Robust regression models

Data with outlier Regression

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 57 / 59


Robust regression
g( yi −x iβ
PN
Robust regression: β̂ = arg min i=1 σ̂
), where g is a
β
robust function
Choice of g(.): If g(t) = 12 t 2 , it is OLS solution.
Huber’s regression: For a threshold M (which user can fix based
on how much tolerance you need for outliers):

1t2 if |t| ≤ M
2
g(t) =
M|t| − 1 M 2 if |t| > M
2

This function is identical to the least squares penalty for small


residuals, but on large residuals, its penalty is lower and
increases linearly rather than quadratically. It is thus more
forgiving of outliers.
Least Absolute Deviation (LAD): g(t) = |t|, that is
β̂ = arg min Ni=1 |( yi −x iβ
P
σ̂
)| - also called median regression
β
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 58 / 59
Thank you!

HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 59 / 59

You might also like