0% found this document useful (0 votes)
239 views31 pages

Linear - Regression & Evaluation Metrics

Linear regression is a machine learning algorithm used to predict continuous variables. It has assumptions of linearity, normality, no multicollinearity, and homoscedasticity. There are simple and multiple linear regression. Multiple linear regression uses more than one independent variable to predict the dependent variable. Key steps are checking data, analyzing relationships, building the model, and interpreting outputs like R-squared and p-values. Regularization helps address overfitting and multicollinearity. Ridge regression performs L2 regularization while lasso does L1, affecting coefficients differently.

Uploaded by

reshma acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
239 views31 pages

Linear - Regression & Evaluation Metrics

Linear regression is a machine learning algorithm used to predict continuous variables. It has assumptions of linearity, normality, no multicollinearity, and homoscedasticity. There are simple and multiple linear regression. Multiple linear regression uses more than one independent variable to predict the dependent variable. Key steps are checking data, analyzing relationships, building the model, and interpreting outputs like R-squared and p-values. Regularization helps address overfitting and multicollinearity. Ridge regression performs L2 regularization while lasso does L1, affecting coefficients differently.

Uploaded by

reshma acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Regression

 
Linear Regression
Assumptions of Linear Regression
 Linear Relationship between IV and DV
 Multivariate Normality
 No or Little Multicollinearity(Multicollinearity is the occurrence of high intercorrelations
among two or more independent variables in a multiple regression model.)
 No or Little Autocorrelation(Autocorrelation measures the relationship between a
variable's current value and its past values.1)
 Homoscedasticity(Equal error variation)
A Box Cox transformation is a transformation of non-normal dependent variables into a
normal shape. Normality is an important assumption for many statistical techniques; if your
data isn't normal, applying a Box-Cox means that you are able to run a broader number of
tests.
Linear Regression is a linear approach to modeling the relationship between a dependent
variable and one independent variable. An independent variable is a variable that is
controlled in a scientific experiment to test the effects on the dependent variable.
A dependent variable is a variable being measured in a scientific experiment.

Linear Regression Formula.


Multiple Linear Regression is a linear approach to modeling the relationship between a
dependent variable and two or more independent variables.

Multiple Linear Regression Formula.


Steps for Running the Linear Regression
Step 1: Understand the model description, causality, and directionality
Step 2: Check the data, categorical data, missing data, and outliers
 Outlier is a data point that differs significantly from other observations. We can use the
standard deviation method and interquartile range (IQR) method.
 Dummy variable takes only the value 0 or 1 to indicate the effect for categorical
variables.
Step 3: Simple Analysis — Check the effect comparing between dependent variable to
independent variable and independent variable to independent variable
 Use scatter plots to check the correlation
 Multicollinearity occurs when more than two independent variables are highly correlated.
We can use Variance Inflation Factor (VIF) to measure if VIF > 5 there is highly
correlated and if VIF > 10, then there is certainly multicollinearity among the variables.
 Interaction Term implies a change in the slope from one value to another value.
Step 4: Multiple Linear Regression — Check the model and the correct variables
Step 5: Residual Analysis
 Check normal distribution and normality for the residuals.
 Homoscedasticity describes a situation in which the error term is the same across all
values of the independent variables and means that the residuals are equal across the
regression line.
Step 6: Interpretation of Regression Output
 R-Squared is a statistical measure of fit that indicates how much variation of a dependent
variable is explained by the independent variables. Higher R-Squared value represents
smaller differences between the observed data and fitted values.
 P-value
 Regression Equation

Detailed explanation:
Linear regression is a machine learning algorithm used to predict the value of continuous
response variable. The predictive analytics problems that are solved using linear regression
models are called as supervised learning problems as it requires that the value of response /
target variables must be present and used for training the models. Also, recall that “continuous”
represents the fact that response variable is numerical in nature and can take infinite different
values. Linear regression models belong to a class of parametric models.
Linear regression models work great for data which are linear in nature. In other words, the
predictor / independent variables in the data set have linear relationship with the target /
response / dependent variable. 
Linear regression models is of two different kinds. They are simple linear regression and multiple
linear regression.

 Simple linear regression: When there is just one independent or predictor variable
such as that in this case, Y = mX + c, the linear regression is termed as simple linear
regression.
 Multiple linear regression: When there are more than one independent or predictor
variables such as Y=w1x1+w2x2+…+wnxnY=w1x1+w2x2+…+wnxn, the linear
regression is called as multiple linear regression.
Residual Error: Residual error is difference between actual value and the predicted value.
Sum of Square Total (SST): Sum of Squares Total is equal to sum of squared difference
between actual values related to response variable and the mean of actual values. It is also called
as variance of the response.
Recall how you calculate variance – sum of squared difference between observations and their
mean of all observations. It is also termed as Total Sum of Squares (TSS).
 Sum of Square Error (SSE): Sum of Square Error or Sum of Square Residual Error
is the sum of squared difference between actual value and the predicted value related
to response variable against each of the predictor variables. It is also termed
as Residual Sum of Squares.
 Sum of Square Regression (SSR): Sum of Square Regression is the sum of squared
difference between the predicted value and mean of actual values. It is also termed
as Explained Sum of Squares (ESS)
       SST = SSR + SSE
 R-Squared: R-squared is measure of how good is the regression or best fit line. It is
also termed as coefficient of determination. Mathematically, it is represented as the
ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST).
R-Squared = SSR / SST = 1 – (SSE / SST)

Greater the value of R-Squared, better is the regression line as higher is the variance explained by
the regression line.

The value of R-squared is a statistical measure of goodness of fit for a linear regression model.
Alternatively, R-squared represents how close the prediction is to actual value.
If you clearly observe the R-Squared formula, it’s lagging with the concepts of number of
features used. As there is no component for changing the number of features used in the
regression model. The R-squared value will be the same or higher if we include more number
of features in the regression model. 
The adjusted R-Squared method will say whether adding the new feature will improve the
performance of the model are not.
Adjusted R-Squared Formal

Where:

  is the R-squared value


 N is the number of samples
 p is the number of features used (IV)

Always consider the adjusted r-squared as the evaluation metrics unless we build a model


with single feature.

Error %ge=RMSE/y bar

Regularization
Regularization is an important concept that is used to avoid overfitting of the data,
especially when the trained and test data are much varying.

Regularization is implemented by adding a “penalty” term to the best fit derived from the
trained data, to achieve a lesser variance with the tested data and also restricts the
influence of predictor variables over the output variable by compressing their
coefficients.
In regularization, what we do is normally we keep the same number of features but reduce
the magnitude of the coefficients. We can reduce the magnitude of the coefficients by
using different types of regression techniques which uses regularization to overcome this
problem.

There are two main regularization techniques, namely Ridge Regression and Lasso
Regression. They both differ in the way they assign a penalty to the coefficients.

some of the Regularization techniques used to address over-fitting and feature selection are:
L1 Regularization, L2 Regularization

A regression model that uses L1 regularization technique is called Lasso Regression and


model which uses L2 is called Ridge Regression.

L1 Regularization

L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the
coefficient. This regularization type can result in sparse models with few coefficients.
Some coefficients might become zero and get eliminated from the model. Larger penalties
result in coefficient values that are closer to zero (ideal for producing simpler models). 

L2 Regularization

L2 regularization can deal with the multicollinearity (independent variables are highly
correlated) problems through constricting the coefficient and by keeping all the variables. 

 L1 regularization gives output in binary weights from 0 to 1 for the model’s features

and is adopted for decreasing the number of features in a huge dimensional dataset. 

 L2 regularization disperse the error terms in all the weights that leads to more

accurate customized final models.

Ridge regression

Ridge regression is a model tuning method that is used to analyse any data that suffers
from multicollinearity. This method performs L2 regularization. When the issue of
multicollinearity occurs, least-squares are unbiased, and variances are large, this results in
predicted values to be far away from the actual values. 
The cost function for ridge regression:

Min(||Y – X(theta)||^2 + λ||theta||^2)

Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge
function. So, by changing the values of alpha, we are controlling the penalty term. Higher
the values of alpha, bigger is the penalty and therefore the magnitude of coefficients is
reduced.

 It shrinks the parameters. Therefore, it is used to prevent multicollinearity


 It reduces the model complexity by coefficient shrinkage

Assumptions of Ridge Regressions

The assumptions of ridge regression are the same as that of linear regression: linearity,
constant variance, and independence. However, as ridge regression does not provide
confidence limits, the distribution of errors to be normal need not be assumed.

LASSO Regression

Lasso Meaning : The word “LASSO” stands for Least Absolute Shrinkage


and Selection Operator. It is a statistical formula for the regularisation of data models and
feature selection.

Lasso regression is a regularization technique. It is used over regression methods for a


more accurate prediction. This model uses shrinkage.

Shrinkage is where data values are shrunk towards a central point as the mean. The lasso
procedure encourages simple, sparse models (i.e., models with fewer parameters). This
particular type of regression is well-suited for models showing high levels of
multicollinearity or when we want to automate certain parts of model selection, like
variable selection/parameter elimination.

Lasso Regression uses L1 regularization technique. It is used when there are more number
of features because it automatically performs feature selection.

The lasso model coefficients are:


All coefficients are zero except ‘carat’ column because the use of L1 regularization.

L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the
coefficient. This regularization type can result in sparse models with few coefficients.
Some coefficients might become zero and get eliminated from the model. Larger penalties
result in coefficient values that are closer to zero.

Linear Regression using statsmodels

Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics are available for each estimator.
Statsmodels library has more advanced statistical tools as compared to sci-kit learn.
Moreover, it’s regression analysis tools can give more detailed results.

Understanding the Results

R-squared value: This is a statistical measure of how well the regression line fits with the
real data points. The higher the value, the better the fit.

Adj, R-squared: This is the corrected R-squared value according to the number of input
features. Ideally, it should be close to the R-squareds value.

Coefficient: This gives the ‘M’ value for the regression line. It tells how much the Selling
price changes with a unit change in Taxes. A positive value means that the two variables are
directly proportional. A negative value, however, would have meant that the two variables are
inversely proportional to each other.

Std error:  This tells us how accurate our coefficient value is. The lower the standard error,
the higher the accuracy.

P >|t| : This is the p-value. It tells us how statistically significant Tax values are to the Selling
price. A value less than 0.05 usually means that it is quite significant.

 Both r-squared and adjusted r-squared value got same value 0.941.
 The intercept is a negative value: -499.09
 z variables is not significant as its p value is greater than 0.05

 Evaluation metrics:
Mean Squared Error(MSE)

MSE is a most used and very simple metric with a little bit of change in mean absolute error.
Mean squared error states that finding the squared difference between actual and predicted
value.
MSE represents the squared distance between actual and predicted values. we perform
squared to avoid the cancellation of negative terms and it is the benefit of MSE.

Advantages of MSE

The graph of MSE is differentiable, so you can easily use it as a loss function.

Disadvantages of MSE

 The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in meter
squared.
 If you have outliers in the dataset then it penalizes the outliers most and the calculated
MSE is bigger.

Root Mean Squared Error(RMSE)

As RMSE is clear by the name itself, that it is a simple square root of mean squared error.
Advantages of RMSE

  The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.

Disadvantages of RMSE

 It is not that robust to outliers as compared to MAE.

R Squared (R2)

R2 score is a metric that tells the performance of your model, not the loss in an absolute sense
that how many wells did the model perform.

The same in classification problems was called a threshold which is fixed at 0.5. So basically
R2 squared calculates how must regression line is better than a mean line.

Hence, R2 squared is also known as Coefficient of Determination or sometimes also known

as Goodness of fit.

If the R2 score is zero then the above regression line by mean line is equal means 1. so 1-1 is
zero. So, in this case, both lines are overlapping means model performance is worst, It is not
capable to take advantage of the output column.

Now the second case is when the R2 score is 1, it means when the division term is zero and it
will happen when the regression line does not make any mistake, it is perfect. In the real
world, it is not possible.

The normal case is when the R2 score is between zero and one like 0.8 which means the
model is capable to explain 80 per cent of the variance of data.

Adjusted R Squared

The disadvantage of the R2 score is while adding new features in data the R2 score starts
increasing or remains constant but it never decreases because it assumes that while adding
more data, variance of data increases.
But the problem is when an irrelevant feature is added in the dataset, R2 sometimes starts
increasing which is incorrect.

Hence, to control this situation Adjusted R Squared came into existence.

Now as K increases by adding some features so the denominator will decrease, n-1 will
remain constant. R2 score will remain constant or will increase slightly so the complete
answer will increase and when we subtract this from one then the resultant score will
decrease.

And if a relevant feature is added then the R2 score will increase and 1-R2 will decrease
heavily and the denominator will also decrease so the complete term decreases, and on
subtracting from one the score increases.

What VIF means?


Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of
multiple regression variables. This ratio is calculated for each independent variable. A high
VIF indicates that the associated independent variable is highly collinear with the other
variables in the model.

 Variance inflation factor (VIF) is used to detect the severity of multicollinearity in the
ordinary least square (OLS) regression analysis.
 Multicollinearity inflates the variance and type II error. It makes the coefficient of a
variable consistent but unreliable.
 VIF measures the number of inflated variances caused by multicollinearity.
 VIF can be calculated by the formula below: 

 
 Where R  represents the unadjusted coefficient of determination for regressing the
i
2

ith independent variable on the remaining ones. The reciprocal of VIF is known


as tolerance. Either VIF or tolerance can be used to detect multicollinearity.
 Generally, a VIF above 4 or tolerance below 0.25 indicates that multicollinearity
might exist.
 When VIF is higher than 10 or tolerance is lower than 0.1, there is significant
multicollinearity that needs to be corrected.
Skill test Questions and Answers

1) True-False: Linear Regression is a supervised machine learning algorithm.

A) TRUE
B) FALSE

Solution: (A)

Yes, Linear regression is a supervised learning algorithm because it uses true labels
for training. Supervised learning algorithm should have input variable (x) and an
output variable (Y) for each example.

2) True-False: Linear Regression is mainly used for Regression.

A) TRUE
B) FALSE

Solution: (A)

Linear Regression has dependent variables that have continuous values.

3) True-False: It is possible to design a Linear regression algorithm using a


neural network?
A) TRUE
B) FALSE

Solution: (A)

True. A Neural network can be used as a universal approximator, so it can


definitely implement a linear regression algorithm.

4) Which of the following methods do we use to find the best fit line for data in
Linear Regression?

A) Least Square Error


B) Maximum Likelihood
C) Logarithmic Loss
D) Both A and B

Solution: (A)
In linear regression, we try to minimize the least square errors of the model to
identify the line of best fit.

5) Which of the following evaluation metrics can be used to evaluate a model


while modeling a continuous output variable?

A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error

Solution: (D)
Since linear regression gives output as continuous values, so in such case we use
mean squared error metric to evaluate the model performance.

AUC-ROC, Accuracy, Logloss use in case of a classification problem.

6) True-False: Lasso Regularization can be used for variable selection in


Linear Regression.

A) TRUE
B) FALSE

Solution: (A)

True, In case of lasso regression we apply absolute penalty which makes some of
the coefficients zero.

7) Which of the following is true about Residuals ?

A) Lower is better
B) Higher is better
C) A or B depend on the situation
D) None of these

Solution: (A)

Residuals refer to the error values of the model. Therefore lower residuals are
desired.
8) Suppose that we have N independent variables (X1,X2… Xn) and dependent
variable is Y. Now Imagine that you are applying linear regression by fitting the
best fit line using least square error on this data.

You found that correlation coefficient for one of it’s variable(Say X1) with Y is -
0.95.

Which of the following is true for X1?

A) Relation between the X1 and Y is weak


B) Relation between the X1 and Y is strong
C) Relation between the X1 and Y is neutral
D) Correlation can’t judge the relationship

Solution: (B)

The absolute value of the correlation coefficient denotes the strength of the
relationship. Since  absolute correlation is very high it means that the relationship
is strong between X1 and Y.

9) Looking at above two characteristics, which of the following option is the


correct for Pearson correlation between V1 and V2?

If you are given the two variables V1 and V2 and they are following below two
characteristics.

1. If V1 increases then V2 also increases


2. If V1 decreases then V2 behavior is unknown

A) Pearson correlation will be close to 1


B) Pearson correlation will be close to -1
C) Pearson correlation will be close to 0
D) None of these

Solution: (D)

We cannot comment on the correlation coefficient by using only statement 1.  We


need to consider the both of these two statements. Consider V1 as x and V2 as |x|.
The correlation coefficient would not be close to 1 in such a case.

10) Suppose Pearson correlation between V1 and V2 is zero. In such case, is it


right to conclude that V1 and V2 do not have any relation between them?

A) TRUE
B) FALSE

Solution: (B)

Pearson correlation coefficient between 2 variables might be zero even when they
have a relationship between them. If the correlation coefficient is zero, it just
means that that they don’t move together. We can take examples like y=|x| or
y=x^2.

11) Which of the following offsets, do we use in linear regression’s least square
line fit? Suppose horizontal axis is independent variable and vertical axis is
dependent variable.

 
A) Vertical offset
B) Perpendicular offset
C) Both, depending on the situation
D) None of above

Solution: (A)

We always consider residuals as vertical offsets. We calculate the direct


differences between actual value and the Y labels. Perpendicular offset are useful
in case of PCA.

12) True- False: Overfitting is more likely when you have huge amount of
data to train?

A) TRUE
B) FALSE

Solution: (B)

With a small training dataset, it’s easier to find a hypothesis to fit the training data
exactly i.e. overfitting.

 
13) We can also compute the coefficient of linear regression with the help of
an analytical method called “Normal Equation”. Which of the following is/are
true about Normal Equation?

1. We don’t have to choose the learning rate


2. It becomes slow when number of features is very large
3. Thers is no need to iterate

A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3

Solution: (D)

Instead of gradient descent, Normal Equation can also be used to find coefficients.
Refer this article for read more about normal equation.

14) Which of the following statement is true about sum of residuals of A and
B?

Below graphs show two fitted regression lines (A & B) on randomly generated
data. Now, I want to find the sum of residuals in both cases A and B.

Note:

1. Scale is same in both graphs for both axis.


2. X axis is independent variable and Y-axis is dependent variable.
A) A has higher sum of residuals than B
B) A has lower sum of residual than B
C) Both have same sum of residuals
D) None of these

Solution: (C)

Sum of residuals will always be zero, therefore both have same sum of residuals

Question Context 15-17:

Suppose you have fitted a complex regression model on a dataset. Now, you are
using Ridge regression with penality x.

15) Choose the option which describes bias in best manner.


A) In case of very large x; bias is low
B) In case of very large x; bias is high
C) We can’t say about bias
D) None of these

Solution: (B)
If the penalty is very large it means model is less complex, therefore the bias would
be high.

16) What will happen when you apply very large penalty?

A) Some of the coefficient will become absolute zero


B) Some of the coefficient will approach zero but not absolute zero
C) Both A and B depending on the situation
D) None of these

Solution: (B)

In lasso some of the coefficient value become zero, but in case of Ridge, the
coefficients become close to zero but not zero.

17) What will happen when you apply very large penalty in case of Lasso?
A) Some of the coefficient will become zero
B) Some of the coefficient will be approaching to zero but not absolute zero
C) Both A and B depending on the situation
D) None of these

Solution: (A)

As already discussed, lasso applies absolute penalty, so some of the coefficients


will become zero.

 
18) Which of the following statement is true about outliers in Linear
regression?

A) Linear regression is sensitive to outliers


B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these

Solution: (A)

The slope of the regression line will change due to outliers in most of the cases. So
Linear Regression is sensitive to outliers.

19) Suppose you plotted a scatter plot between the residuals and predicted
values in linear regression and you found that there is a relationship between
them. Which of the following conclusion do you make about this situation?

A) Since the there is a relationship means our model is not good


B) Since the there is a relationship means our model is good
C) Can’t say
D) None of these

Solution: (A)

There should not be any relationship between predicted values and residuals. If
there exists any relationship between them,it means that the model has not
perfectly captured the information in the data.
 

Question Context 20-22:

Suppose that you have a dataset D1 and you design a linear regression model of
degree 3 polynomial and you found that the training and testing error is “0” or in
another terms it perfectly fits the data.

20) What will happen when you fit degree 4 polynomial in linear regression?
A) There are high chances that degree 4 polynomial will over fit the data
B) There are high chances that degree 4 polynomial will under fit the data
C) Can’t say
D) None of these

Solution: (A)

Since is more degree 4 will be more complex(overfit the data) than the degree 3
model so it will again perfectly fit the data. In such case training error will be zero
but test error may not be zero.

21) What will happen when you fit degree 2 polynomial in linear regression?
A) It is high chances that degree 2 polynomial will over fit the data
B) It is high chances that degree 2 polynomial will under fit the data
C) Can’t say
D) None of these

Solution: (B)

If a degree 3 polynomial fits the data perfectly, it’s highly likely that a simpler
model(degree 2 polynomial) might under fit the data.
 

22) In terms of bias and variance. Which of the following is true when you fit
degree 2 polynomial?

A) Bias will be high, variance will be high


B) Bias will be low, variance will be high
C) Bias will be high, variance will be low
D) Bias will be low, variance will be low

Solution: (C)

Since a degree 2 polynomial will be less complex as compared to degree 3, the bias
will be high and variance will be low.

Question Context 23:

Which of the following is true about below graphs(A,B, C left to right) between
the cost function and Number of iterations?

23) Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively.
Which of the following is true about l1,l2 and l3?

 
A) l2 < l1 < l3

B) l1 > l2 > l3


C) l1 = l2 = l3
D) None of these

Solution: (A)

In case of high learning rate, step will be high, the objective function will decrease
quickly initially, but it will not find the global minima and objective function starts
increasing after a few iterations.

In case of low learning rate, the step will be small. So the objective function will
decrease slowly

Question Context 24-25:

We have been given a dataset with n records in which we have input attribute as x
and output attribute as y. Suppose we use a linear regression method to model this
data. To test our linear regressor, we split the data in training set and test set
randomly.

24) Now we increase the training set size gradually. As the training set size
increases, what do you expect will happen with the mean training error?

A) Increase
B) Decrease
C) Remain constant
D) Can’t Say

Solution: (D)

Training error may increase or decrease depending on the values that are used to fit
the model. If the values used to train contain more outliers gradually, then the error
might just increase.

25) What do you expect will happen with bias and variance as you increase
the size of training data?

A) Bias increases and Variance increases


B) Bias decreases and Variance increases
C) Bias decreases and Variance decreases
D) Bias increases and Variance decreases
E) Can’t Say False

Solution: (D)

As we increase the size of the training data, the bias would increase while the
variance would decrease.

Question Context 26:

Consider the following data where one input(X) and one output(Y) is given.
26) What would be the root mean square training error for this data if you
run a Linear Regression model of the form (Y = A0+A1X)?

A) Less than 0
B) Greater than zero
C) Equal to 0
D) None of these

Solution: (C)

We can perfectly fit the line on the following data so mean error will be zero.

Question Context 27-28:

Suppose you have been given the following scenario for training and validation
error for Linear Regression.

Scenario Learning Rate Number of iterations Training Error Validation Error


1 0.1 1000 100 110
2 0.2 600 90 105
3 0.3 400 110 110
4 0.4 300 120 130
5 0.4 250 130 150

27) Which of the following scenario would give you the right hyper
parameter?

A) 1
B) 2
C) 3
D) 4

Solution: (B)

Option B would be the better option because it leads to less training as well as
validation error.

28) Suppose you got the tuned hyper parameters from the previous question.
Now, Imagine you want to add a variable in variable space such that this
added feature is important. Which of the following thing would you observe in
such case?

A) Training Error will decrease and Validation error will increase

B) Training Error will increase and Validation error will increase


C) Training Error will increase and Validation error will decrease
D) Training Error will decrease and Validation error will decrease
E) None of the above

Solution: (D)

If the added feature is important, the training and validation error would decrease.
Question Context 29-30:

Suppose, you got a situation where you find that your linear regression model is
under fitting the data.

29) In such situation which of the following options would you consider?

1. Add more variables


2. Start introducing polynomial degree variables
3. Remove some variables

A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3

Solution: (A)

In case of under fitting, you need to induce more variables in variable space or you
can add some polynomial degree variables to make the model more complex to be
able to fir the data better.

30) Now situation is same as written in previous question(under fitting).Which


of following regularization algorithm would you prefer?

A) L1
B) L2
C) Any
D) None of these
Solution: (D)

I won’t use any regularization methods because regularization is used in case of


overfitting.

Elastic net is basically a combination of both L1 and L2 regularization. So if


you know elastic net, you can implement both Ridge and Lasso by tuning
the parameters. So it uses both L1 and L2 penality term, therefore its
equation look like as follows:

So how do we adjust the lambdas in order to control the L1 and L2 penalty


term? Let us understand by an example. You are trying to catch a fish from a
pond. And you only have a net, then what would you do? Will you randomly
throw your net? No, you will actually wait until you see one fish swimming
around, then you would throw the net in that direction to basically collect
the entire group of fishes. Therefore even if they are correlated, we still
want to look at their entire group.

Elastic regression works in a similar way. Let’ say, we have a bunch of


correlated independent variables in a dataset, then elastic net will simply
form a group consisting of these correlated variables. Now if any one of the
variable of this group is a strong predictor (meaning having a strong
relationship with dependent variable), then we will include the entire group
in the model building, because omitting other variables (like what we did in
lasso) might result in losing some information in terms of interpretation
ability, leading to a poor model performance.

You might also like