0% found this document useful (0 votes)
12 views19 pages

Machine Learning

The document discusses linear regression issues such as multicollinearity, underfitting, and overfitting, explaining their implications on model performance. It highlights the bias-variance tradeoff and presents solutions like feature selection and regularization techniques, including Ridge and LASSO regression. Additionally, it details the mathematical formulations and gradient descent methods used in Ridge regression to optimize model coefficients while mitigating overfitting.

Uploaded by

kqxzyffw4m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Machine Learning

The document discusses linear regression issues such as multicollinearity, underfitting, and overfitting, explaining their implications on model performance. It highlights the bias-variance tradeoff and presents solutions like feature selection and regularization techniques, including Ridge and LASSO regression. Additionally, it details the mathematical formulations and gradient descent methods used in Ridge regression to optimize model coefficients while mitigating overfitting.

Uploaded by

kqxzyffw4m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Linear Regression

(Problems & Solutions)

D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Problem 1: Multicollinearity
 In regression, "multicollinearity" refers to predictors that are correlated with other
predictors.
 Multicollinearity occurs when our model includes multiple factors that are correlated
not just to your response variable, but also to each other.
 In other words, it results when we have factors that are a bit redundant.
 Multicollinearity increases the standard errors of the coefficients.
 Increased standard errors in turn means that coefficients for some independent variables
may be found not to be significantly different from 0.
In other words, by overinflating the standard errors, multicollinearity makes some
variables statistically insignificant when they should be significant.
Problem 2 & 3: Underfitting & Overfitting
 In order to understand the concept of overfitting and underfitting, consider following
example (shown in Figure 1(a)) where we have to fit a regression function (hypothesis)
that can predict the value of output variable (y) given input variable (x).
 Let us fit three regression hypothesis on the data as shown in Figure 1(b), 1(c), and 1 (d)

Figure 1(a) Figure 1(b) Figure 1(c) Figure 1(d)


𝑦∧= 𝛽 +𝛽 𝑥 𝑦∧ =𝛽 +𝛽 𝑥+𝛽 𝑥 𝑦∧= 𝛽 +𝛽 𝑥+𝛽 𝑥 +𝛽 𝑥 +𝛽 𝑥
(Underfit) (Good Fit) (Overfit)
Underfitting
 The figure 1(a) shows the result of fitting a linear hypothesis
𝑦∧= 𝛽 + 𝛽 𝑥 .
 We can see that the data doesn’t really lie on straight line,
and so the fit is not very good.
 The figure shows an instance of underfitting—in which the
data clearly shows structure not captured by the model.
 Underfitting, or high bias, is when the form of our
hypothesis function h maps poorly to the trend of the data.
 It is usually caused by a function that is too simple or uses
too few features.
 Underfitting can be solved by increasing the number of
features in our training data.
Overfitting
 The figure 1(d) shows the result of fitting a high order
polynomial hypothesis ∧
.
 The figure is an example of overfitting.
 Overfitting, or high variance, is caused by a hypothesis
function that fits the available data but does not
generalize well to predict new data.
 It is usually caused by a complicated function that
creates a lot of unnecessary curves and angles unrelated
to the data.
 For example, a quadratic fit shown in Figure 1(c) (slide
number 3) is a good fit as it fits well to data and
generalizes well.
Bias and Variance Tradeoff
 If we have less number of features i.e., low model complexity, the fit is
underfit with high bias error.
 the bias is an error from a faulty assumption in the learning algorithm.
 when the bias is too large, the algorithm would be able to correctly model
the relationship between the features and the target outputs.
 As we keep on increasing the model complexity (number of features) bias
decreases but variance increases (overfit).
 Variance is an error resulting from fluctuations int he training dataset.
A high value for variance would cause the algorithm may capture the most
data points put would not be generalized enough to capture new data points.
 The tradeoff between bias and variance can be controlled using controlling
the complexity of model.
Solutions to Regression Problems
 The problems of multicollinearity, and overfitting can be controlled by handling the complexity
of the model.
 The model complexity can be controlled using any of these features:
1. Remove highly correlated predictors from the model
 Manually select which features to keep.
 Use feature selection methods to choose features that maximizes relevancy or minimizes
redundancy.
 Use feature extraction techniques models such as PCA, SVD, and LDA.
2. Regularization

 Keep all the features, but reduce the magnitude of parameters


 Regularization works well when we have a lot of slightly useful features.
Regularization/ Shrinkage
 Regularization is a technique used for tuning the function by adding an
additional penalty term in the error function that reduces the magnitude of
parameters .
 The additional term controls the excessively fluctuating function such that the
coefficients don’t take extreme values.
This technique of keeping a check or reducing the value of error coefficients are
called shrinkage methods.
 If we have overfitting from our hypothesis function (as shown in Figure 1(d)),
we can reduce the weight of some of the terms in our function carry by
increasing their cost.
Regularization/ Shrinkage Contd…
 Without actually getting rid of these features or changing the form of our
hypothesis, we can instead modify our cost function:
𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝐽 𝛽 , 𝛽 , 𝛽 , 𝛽 , 𝛽 + 5000 𝛽 + 5000𝛽

 We've added two extra terms at the end to inflate the cost of 𝛽 𝑎𝑛𝑑 𝛽 .

Now, in order for the cost function to get close to zero, we will have to
reduce the values of 𝛽 𝑎𝑛𝑑 𝛽 to near zero.

This will in turn greatly reduce the values of 𝛽 𝑎𝑛𝑑 𝛽 in our hypothesis
function.

As a result, we see that the new hypothesis (depicted by the pink curve)
looks like a quadratic function but fits the data better due to the extra small
terms 𝛽 𝑥 + 𝛽 𝑥
Regularization in Linear Regression
 In regression models, we do not know which regression coefficients we should
shrink by adding their penalty in the cost function.
 So, the general tendency of applying regularization in regression is to shrink the
weight (regression coefficients) of all the input variables.
 Two most commonly used regularization in linear regression are:
1. Ridge Regression (L2 Normalization)
2. Least Absolute Selection and Shrinkage Operator (LASSO) Regression (L1
Normalization)
Ridge Regression
 Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of
squares of coefficients in the optimization objective.
 Thus, ridge regression optimizes the following:
(sum of square of coefficients)
𝟏 𝒏 𝟐 𝒌 𝟐
i.e., 𝒊 𝟏 𝒊 𝟎 𝟏 𝒊𝟏 𝟐 𝒊𝟐 𝟑 𝒊𝟑 𝒌 𝒊𝒌 𝒊 𝟎 𝒊
𝟐𝒏

where n are the total number of training examples; k are the number of features; represents
regression coefficients of jth input variable and is the regularization parameter.
Ridge Regression Contd…
𝜆 can take various values:
1. 𝝀 = 0:
 The objective becomes same as simple linear regression.
 We’ll get the same coefficients as simple linear regression.
2. 𝝀 = ∞:
 The coefficients will be zero.
 Because of infinite weightage on square of coefficients, anything less than zero will make
the objective infinite.
3. 0 < 𝝀 < ∞:
 The magnitude of α will decide the weightage given to different parts of objective.
 The coefficients will be somewhere between 0 and ones for simple linear regression.
Ridge Regression using Gradient Descent
 We know in gradient descent optimization, we update the regression coefficients as
follows:
( )

 For Ridge Regression, cost function is given by:

 The gradient of cost function w.r.t ’s is given by:

+
Ridge Regression using Gradient Descent
Contd…
Similarly, = ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + 𝜆 𝛽 )

= ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + + 𝜆 𝛽 )

= ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + + 𝜆 𝛽 )



𝝏𝑱 𝟏
In general, 𝝏𝜷 = 𝒏 ∑𝒏𝒊 𝟏((𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + 𝜷𝟑 𝒙𝒊𝟑 + ⋯ … … … . . +𝜷𝒌 𝒙𝒊𝒌 − 𝒚𝒊 ) × 𝒙𝒊𝒋 + 𝝀 𝜷𝒋 )
𝒋
Ridge Regression using Gradient Descent
Contd…
 Therefore, the regression coefficients are updated as:
𝛼
𝛽 =𝛽 − ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + 𝜆 𝛽 )
𝑛
𝜶𝝀 𝜶
or 𝜷𝒋 = 𝜷𝒋 𝟏 − 𝒏
− 𝒏 ∑𝒏𝒊 𝟏(𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + 𝜷𝟑 𝒙𝒊𝟑 + ⋯ … … … . . +𝜷𝒌 𝒙𝒊𝒌 − 𝒚𝒊 ) × 𝒙𝒊𝒋

𝜶𝝀
Since the factor 𝟏 − 𝒏
is less than 1, therefore, the algorithm, will keep on shrinking the values of

the regression coefficients, and will handle the problem of overfitting.


Ridge Regression using Least Square
Error Fit
In least Square error fit, we represent error in matrix form as:
𝜖 = 𝑦 − 𝑋𝛽
𝜖 𝑦 𝛽
𝜖 𝑦
Where 𝜖 = ⋮ ; y= ⋮ ;β= 𝛽
⋮ ⋮ ⋮

1 𝑥 𝑥 … 𝑥
1 𝑥 𝑥 … 𝑥
and 𝑋 = ⋮ ⋮ ⋮ ⋮

1 𝑥 𝑥 … 𝑥

𝑪𝒐𝒔𝒕 𝑶𝒃𝒋𝒆𝒄𝒕𝒊𝒗𝒆 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 = 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 𝑬𝒓𝒓𝒐𝒓 + 𝝀(sum of square of coefficients)


Ridge Regression using Least Square
Error Fit Contd….
𝑪𝒐𝒔𝒕 𝑶𝒃𝒋𝒆𝒄𝒕𝒊𝒗𝒆 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 = 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 𝑬𝒓𝒓𝒐𝒓 + 𝝀(sum of square of coefficients)
1 1
𝐽 𝛽 = 𝜖 +𝜆 𝛽 = (𝜖 𝜖 + 𝜆𝛽 𝛽)
2𝑛 2𝑛

1
= ((𝑦 − 𝑋𝛽) (𝑦 − 𝑋𝛽) + 𝜆𝛽 𝛽)
2𝑛
1
= ((𝑦 − 𝛽 𝑋 )(𝑦 − 𝑋𝛽) + 𝜆𝛽 𝛽)
2𝑛
1
𝐽 𝛽 = (𝑦 𝑦 − 𝛽 𝑋 𝑦 − 𝑦 𝑋𝛽 + 𝛽 𝑋 𝑋𝛽 +𝜆𝛽 𝛽)
2𝑛
1
𝑱 𝜷 = (𝒚𝑻 𝒚 − 𝟐𝒚𝑻 𝑿𝜷 + 𝜷𝑻 𝑿𝑻 𝑋𝛽 +𝜆𝛽 𝛽)
2𝑛
[Because 𝑦 𝑋𝛽 and 𝛽 𝑋 𝑦 is always equal with only one entry. The square error function is minimized using
second derivative test]
Ridge Regression using Least Square
Error Fit Contd….
Step 1: Compute the partial derivate of J(𝜷) w.r.t 𝜷
𝜕J(𝛽) 1 𝜕(𝑦 𝑦 − 2𝑦 𝑋𝛽 + 𝛽 𝑋 𝑋𝛽 +𝜆𝛽 𝛽)
=
𝜕𝛽 2𝑛 𝜕𝛽
1 𝜕𝑦 𝑦 𝜕2𝑦 𝑋𝛽 𝜕𝛽 𝑋 𝑋𝛽 𝜕𝜆𝛽 𝛽
= × − + +
2𝑛 𝜕𝛽 𝜕𝛽 𝜕𝛽 𝜕𝛽
1 𝜕𝛽 𝜕𝛽 𝑋 𝑋𝛽 𝜕𝜆𝛽 𝛽 𝜕𝐴𝑋
= × 0 − 2𝑋 𝑦 + + [𝐵𝑒𝑐𝑎𝑢𝑠𝑒 =𝐴 ]
2𝑛 𝜕𝛽 𝜕𝛽 𝜕𝛽 𝜕𝑋
1
= × −2𝑋 𝑦 + 2𝑋 𝑋𝛽 + 2𝜆𝐼𝛽 = −𝑋 𝑦 + 𝑋 𝑋𝛽 + 𝜆𝐼𝛽
2𝑛
𝜕𝑋 𝐴𝑋
[𝐵𝑒𝑐𝑎𝑢𝑠𝑒 = 2𝐴𝑋]
𝜕𝑋
Ridge Regression using Least Square
Error Fit Contd….
𝝏J(𝜷)
Step 2: Compute 𝜷ˆ for 𝜷 for which =𝟎
𝝏𝜷
−𝑋 𝑦 + 𝑋 𝑋𝛽ˆ + 𝜆𝐼𝛽ˆ = 0
(𝑋 𝑋 + 𝜆𝐼)𝛽ˆ=𝑋 𝑦
𝜷ˆ=(𝑿𝑻 𝑿 + 𝝀𝑰) 𝟏 𝑻
𝑿 𝒚

𝝏2J(𝜷)
 Step 3: Compute and prove it to be minimum for 𝜷ˆ
𝝏𝜷𝟐

𝜕2J(𝛽) 𝜕(−𝑋 𝑦 + 𝑋 𝑋𝛽 + 𝜆𝐼𝛽)


2
= = 0 + 2 𝑋𝑇𝑋 + 𝜆𝐼 = +𝑣𝑒 𝑠𝑒𝑚𝑖 − 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑒 𝑚𝑎𝑡𝑟𝑖𝑥
𝜕𝛽 𝜕𝛽

 Thus, 𝛽 is updated as 𝛽ˆ=(𝑋 𝑋 + 𝜆𝐼) 𝑋 𝑦 . It will solve the problem of overfitting and
multicollinearity as | 𝑋 𝑋 + 𝜆𝐼| will not be zero for correlated features.

You might also like