Machine Learning
Machine Learning
D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Problem 1: Multicollinearity
In regression, "multicollinearity" refers to predictors that are correlated with other
predictors.
Multicollinearity occurs when our model includes multiple factors that are correlated
not just to your response variable, but also to each other.
In other words, it results when we have factors that are a bit redundant.
Multicollinearity increases the standard errors of the coefficients.
Increased standard errors in turn means that coefficients for some independent variables
may be found not to be significantly different from 0.
In other words, by overinflating the standard errors, multicollinearity makes some
variables statistically insignificant when they should be significant.
Problem 2 & 3: Underfitting & Overfitting
In order to understand the concept of overfitting and underfitting, consider following
example (shown in Figure 1(a)) where we have to fit a regression function (hypothesis)
that can predict the value of output variable (y) given input variable (x).
Let us fit three regression hypothesis on the data as shown in Figure 1(b), 1(c), and 1 (d)
We've added two extra terms at the end to inflate the cost of 𝛽 𝑎𝑛𝑑 𝛽 .
Now, in order for the cost function to get close to zero, we will have to
reduce the values of 𝛽 𝑎𝑛𝑑 𝛽 to near zero.
This will in turn greatly reduce the values of 𝛽 𝑎𝑛𝑑 𝛽 in our hypothesis
function.
As a result, we see that the new hypothesis (depicted by the pink curve)
looks like a quadratic function but fits the data better due to the extra small
terms 𝛽 𝑥 + 𝛽 𝑥
Regularization in Linear Regression
In regression models, we do not know which regression coefficients we should
shrink by adding their penalty in the cost function.
So, the general tendency of applying regularization in regression is to shrink the
weight (regression coefficients) of all the input variables.
Two most commonly used regularization in linear regression are:
1. Ridge Regression (L2 Normalization)
2. Least Absolute Selection and Shrinkage Operator (LASSO) Regression (L1
Normalization)
Ridge Regression
Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of
squares of coefficients in the optimization objective.
Thus, ridge regression optimizes the following:
(sum of square of coefficients)
𝟏 𝒏 𝟐 𝒌 𝟐
i.e., 𝒊 𝟏 𝒊 𝟎 𝟏 𝒊𝟏 𝟐 𝒊𝟐 𝟑 𝒊𝟑 𝒌 𝒊𝒌 𝒊 𝟎 𝒊
𝟐𝒏
where n are the total number of training examples; k are the number of features; represents
regression coefficients of jth input variable and is the regularization parameter.
Ridge Regression Contd…
𝜆 can take various values:
1. 𝝀 = 0:
The objective becomes same as simple linear regression.
We’ll get the same coefficients as simple linear regression.
2. 𝝀 = ∞:
The coefficients will be zero.
Because of infinite weightage on square of coefficients, anything less than zero will make
the objective infinite.
3. 0 < 𝝀 < ∞:
The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and ones for simple linear regression.
Ridge Regression using Gradient Descent
We know in gradient descent optimization, we update the regression coefficients as
follows:
( )
+
Ridge Regression using Gradient Descent
Contd…
Similarly, = ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + 𝜆 𝛽 )
= ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + + 𝜆 𝛽 )
= ∑ ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + + 𝜆 𝛽 )
⋮
⋮
⋮
𝝏𝑱 𝟏
In general, 𝝏𝜷 = 𝒏 ∑𝒏𝒊 𝟏((𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + 𝜷𝟑 𝒙𝒊𝟑 + ⋯ … … … . . +𝜷𝒌 𝒙𝒊𝒌 − 𝒚𝒊 ) × 𝒙𝒊𝒋 + 𝝀 𝜷𝒋 )
𝒋
Ridge Regression using Gradient Descent
Contd…
Therefore, the regression coefficients are updated as:
𝛼
𝛽 =𝛽 − ((𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥 + 𝜆 𝛽 )
𝑛
𝜶𝝀 𝜶
or 𝜷𝒋 = 𝜷𝒋 𝟏 − 𝒏
− 𝒏 ∑𝒏𝒊 𝟏(𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + 𝜷𝟑 𝒙𝒊𝟑 + ⋯ … … … . . +𝜷𝒌 𝒙𝒊𝒌 − 𝒚𝒊 ) × 𝒙𝒊𝒋
𝜶𝝀
Since the factor 𝟏 − 𝒏
is less than 1, therefore, the algorithm, will keep on shrinking the values of
1 𝑥 𝑥 … 𝑥
1 𝑥 𝑥 … 𝑥
and 𝑋 = ⋮ ⋮ ⋮ ⋮
⋮
1 𝑥 𝑥 … 𝑥
1
= ((𝑦 − 𝑋𝛽) (𝑦 − 𝑋𝛽) + 𝜆𝛽 𝛽)
2𝑛
1
= ((𝑦 − 𝛽 𝑋 )(𝑦 − 𝑋𝛽) + 𝜆𝛽 𝛽)
2𝑛
1
𝐽 𝛽 = (𝑦 𝑦 − 𝛽 𝑋 𝑦 − 𝑦 𝑋𝛽 + 𝛽 𝑋 𝑋𝛽 +𝜆𝛽 𝛽)
2𝑛
1
𝑱 𝜷 = (𝒚𝑻 𝒚 − 𝟐𝒚𝑻 𝑿𝜷 + 𝜷𝑻 𝑿𝑻 𝑋𝛽 +𝜆𝛽 𝛽)
2𝑛
[Because 𝑦 𝑋𝛽 and 𝛽 𝑋 𝑦 is always equal with only one entry. The square error function is minimized using
second derivative test]
Ridge Regression using Least Square
Error Fit Contd….
Step 1: Compute the partial derivate of J(𝜷) w.r.t 𝜷
𝜕J(𝛽) 1 𝜕(𝑦 𝑦 − 2𝑦 𝑋𝛽 + 𝛽 𝑋 𝑋𝛽 +𝜆𝛽 𝛽)
=
𝜕𝛽 2𝑛 𝜕𝛽
1 𝜕𝑦 𝑦 𝜕2𝑦 𝑋𝛽 𝜕𝛽 𝑋 𝑋𝛽 𝜕𝜆𝛽 𝛽
= × − + +
2𝑛 𝜕𝛽 𝜕𝛽 𝜕𝛽 𝜕𝛽
1 𝜕𝛽 𝜕𝛽 𝑋 𝑋𝛽 𝜕𝜆𝛽 𝛽 𝜕𝐴𝑋
= × 0 − 2𝑋 𝑦 + + [𝐵𝑒𝑐𝑎𝑢𝑠𝑒 =𝐴 ]
2𝑛 𝜕𝛽 𝜕𝛽 𝜕𝛽 𝜕𝑋
1
= × −2𝑋 𝑦 + 2𝑋 𝑋𝛽 + 2𝜆𝐼𝛽 = −𝑋 𝑦 + 𝑋 𝑋𝛽 + 𝜆𝐼𝛽
2𝑛
𝜕𝑋 𝐴𝑋
[𝐵𝑒𝑐𝑎𝑢𝑠𝑒 = 2𝐴𝑋]
𝜕𝑋
Ridge Regression using Least Square
Error Fit Contd….
𝝏J(𝜷)
Step 2: Compute 𝜷ˆ for 𝜷 for which =𝟎
𝝏𝜷
−𝑋 𝑦 + 𝑋 𝑋𝛽ˆ + 𝜆𝐼𝛽ˆ = 0
(𝑋 𝑋 + 𝜆𝐼)𝛽ˆ=𝑋 𝑦
𝜷ˆ=(𝑿𝑻 𝑿 + 𝝀𝑰) 𝟏 𝑻
𝑿 𝒚
𝝏2J(𝜷)
Step 3: Compute and prove it to be minimum for 𝜷ˆ
𝝏𝜷𝟐
Thus, 𝛽 is updated as 𝛽ˆ=(𝑋 𝑋 + 𝜆𝐼) 𝑋 𝑦 . It will solve the problem of overfitting and
multicollinearity as | 𝑋 𝑋 + 𝜆𝐼| will not be zero for correlated features.