Linear - Regression & Evaluation Metrics
Linear - Regression & Evaluation Metrics
Linear Regression
Assumptions of Linear Regression
Linear Relationship between IV and DV
Multivariate Normality
No or Little Multicollinearity(Multicollinearity is the occurrence of high intercorrelations
among two or more independent variables in a multiple regression model.)
No or Little Autocorrelation(Autocorrelation measures the relationship between a
variable's current value and its past values.1)
Homoscedasticity(Equal error variation)
A Box Cox transformation is a transformation of non-normal dependent variables into a
normal shape. Normality is an important assumption for many statistical techniques; if your
data isn't normal, applying a Box-Cox means that you are able to run a broader number of
tests.
Linear Regression is a linear approach to modeling the relationship between a dependent
variable and one independent variable. An independent variable is a variable that is
controlled in a scientific experiment to test the effects on the dependent variable.
A dependent variable is a variable being measured in a scientific experiment.
Detailed explanation:
Linear regression is a machine learning algorithm used to predict the value of continuous
response variable. The predictive analytics problems that are solved using linear regression
models are called as supervised learning problems as it requires that the value of response /
target variables must be present and used for training the models. Also, recall that “continuous”
represents the fact that response variable is numerical in nature and can take infinite different
values. Linear regression models belong to a class of parametric models.
Linear regression models work great for data which are linear in nature. In other words, the
predictor / independent variables in the data set have linear relationship with the target /
response / dependent variable.
Linear regression models is of two different kinds. They are simple linear regression and multiple
linear regression.
Simple linear regression: When there is just one independent or predictor variable
such as that in this case, Y = mX + c, the linear regression is termed as simple linear
regression.
Multiple linear regression: When there are more than one independent or predictor
variables such as Y=w1x1+w2x2+…+wnxnY=w1x1+w2x2+…+wnxn, the linear
regression is called as multiple linear regression.
Residual Error: Residual error is difference between actual value and the predicted value.
Sum of Square Total (SST): Sum of Squares Total is equal to sum of squared difference
between actual values related to response variable and the mean of actual values. It is also called
as variance of the response.
Recall how you calculate variance – sum of squared difference between observations and their
mean of all observations. It is also termed as Total Sum of Squares (TSS).
Sum of Square Error (SSE): Sum of Square Error or Sum of Square Residual Error
is the sum of squared difference between actual value and the predicted value related
to response variable against each of the predictor variables. It is also termed
as Residual Sum of Squares.
Sum of Square Regression (SSR): Sum of Square Regression is the sum of squared
difference between the predicted value and mean of actual values. It is also termed
as Explained Sum of Squares (ESS)
SST = SSR + SSE
R-Squared: R-squared is measure of how good is the regression or best fit line. It is
also termed as coefficient of determination. Mathematically, it is represented as the
ratio of Sum of Squares Regression (SSR) and Sum of Squares Total (SST).
R-Squared = SSR / SST = 1 – (SSE / SST)
Greater the value of R-Squared, better is the regression line as higher is the variance explained by
the regression line.
The value of R-squared is a statistical measure of goodness of fit for a linear regression model.
Alternatively, R-squared represents how close the prediction is to actual value.
If you clearly observe the R-Squared formula, it’s lagging with the concepts of number of
features used. As there is no component for changing the number of features used in the
regression model. The R-squared value will be the same or higher if we include more number
of features in the regression model.
The adjusted R-Squared method will say whether adding the new feature will improve the
performance of the model are not.
Adjusted R-Squared Formal
Where:
Regularization
Regularization is an important concept that is used to avoid overfitting of the data,
especially when the trained and test data are much varying.
Regularization is implemented by adding a “penalty” term to the best fit derived from the
trained data, to achieve a lesser variance with the tested data and also restricts the
influence of predictor variables over the output variable by compressing their
coefficients.
In regularization, what we do is normally we keep the same number of features but reduce
the magnitude of the coefficients. We can reduce the magnitude of the coefficients by
using different types of regression techniques which uses regularization to overcome this
problem.
There are two main regularization techniques, namely Ridge Regression and Lasso
Regression. They both differ in the way they assign a penalty to the coefficients.
some of the Regularization techniques used to address over-fitting and feature selection are:
L1 Regularization, L2 Regularization
L1 Regularization
L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the
coefficient. This regularization type can result in sparse models with few coefficients.
Some coefficients might become zero and get eliminated from the model. Larger penalties
result in coefficient values that are closer to zero (ideal for producing simpler models).
L2 Regularization
L2 regularization can deal with the multicollinearity (independent variables are highly
correlated) problems through constricting the coefficient and by keeping all the variables.
L1 regularization gives output in binary weights from 0 to 1 for the model’s features
and is adopted for decreasing the number of features in a huge dimensional dataset.
L2 regularization disperse the error terms in all the weights that leads to more
Ridge regression
Ridge regression is a model tuning method that is used to analyse any data that suffers
from multicollinearity. This method performs L2 regularization. When the issue of
multicollinearity occurs, least-squares are unbiased, and variances are large, this results in
predicted values to be far away from the actual values.
The cost function for ridge regression:
Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge
function. So, by changing the values of alpha, we are controlling the penalty term. Higher
the values of alpha, bigger is the penalty and therefore the magnitude of coefficients is
reduced.
The assumptions of ridge regression are the same as that of linear regression: linearity,
constant variance, and independence. However, as ridge regression does not provide
confidence limits, the distribution of errors to be normal need not be assumed.
LASSO Regression
Shrinkage is where data values are shrunk towards a central point as the mean. The lasso
procedure encourages simple, sparse models (i.e., models with fewer parameters). This
particular type of regression is well-suited for models showing high levels of
multicollinearity or when we want to automate certain parts of model selection, like
variable selection/parameter elimination.
Lasso Regression uses L1 regularization technique. It is used when there are more number
of features because it automatically performs feature selection.
L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the
coefficient. This regularization type can result in sparse models with few coefficients.
Some coefficients might become zero and get eliminated from the model. Larger penalties
result in coefficient values that are closer to zero.
Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics are available for each estimator.
Statsmodels library has more advanced statistical tools as compared to sci-kit learn.
Moreover, it’s regression analysis tools can give more detailed results.
R-squared value: This is a statistical measure of how well the regression line fits with the
real data points. The higher the value, the better the fit.
Adj, R-squared: This is the corrected R-squared value according to the number of input
features. Ideally, it should be close to the R-squareds value.
Coefficient: This gives the ‘M’ value for the regression line. It tells how much the Selling
price changes with a unit change in Taxes. A positive value means that the two variables are
directly proportional. A negative value, however, would have meant that the two variables are
inversely proportional to each other.
Std error: This tells us how accurate our coefficient value is. The lower the standard error,
the higher the accuracy.
P >|t| : This is the p-value. It tells us how statistically significant Tax values are to the Selling
price. A value less than 0.05 usually means that it is quite significant.
Both r-squared and adjusted r-squared value got same value 0.941.
The intercept is a negative value: -499.09
z variables is not significant as its p value is greater than 0.05
Evaluation metrics:
Mean Squared Error(MSE)
MSE is a most used and very simple metric with a little bit of change in mean absolute error.
Mean squared error states that finding the squared difference between actual and predicted
value.
MSE represents the squared distance between actual and predicted values. we perform
squared to avoid the cancellation of negative terms and it is the benefit of MSE.
Advantages of MSE
The graph of MSE is differentiable, so you can easily use it as a loss function.
Disadvantages of MSE
The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in meter
squared.
If you have outliers in the dataset then it penalizes the outliers most and the calculated
MSE is bigger.
As RMSE is clear by the name itself, that it is a simple square root of mean squared error.
Advantages of RMSE
The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.
Disadvantages of RMSE
R Squared (R2)
R2 score is a metric that tells the performance of your model, not the loss in an absolute sense
that how many wells did the model perform.
The same in classification problems was called a threshold which is fixed at 0.5. So basically
R2 squared calculates how must regression line is better than a mean line.
as Goodness of fit.
If the R2 score is zero then the above regression line by mean line is equal means 1. so 1-1 is
zero. So, in this case, both lines are overlapping means model performance is worst, It is not
capable to take advantage of the output column.
Now the second case is when the R2 score is 1, it means when the division term is zero and it
will happen when the regression line does not make any mistake, it is perfect. In the real
world, it is not possible.
The normal case is when the R2 score is between zero and one like 0.8 which means the
model is capable to explain 80 per cent of the variance of data.
Adjusted R Squared
The disadvantage of the R2 score is while adding new features in data the R2 score starts
increasing or remains constant but it never decreases because it assumes that while adding
more data, variance of data increases.
But the problem is when an irrelevant feature is added in the dataset, R2 sometimes starts
increasing which is incorrect.
Now as K increases by adding some features so the denominator will decrease, n-1 will
remain constant. R2 score will remain constant or will increase slightly so the complete
answer will increase and when we subtract this from one then the resultant score will
decrease.
And if a relevant feature is added then the R2 score will increase and 1-R2 will decrease
heavily and the denominator will also decrease so the complete term decreases, and on
subtracting from one the score increases.
Variance inflation factor (VIF) is used to detect the severity of multicollinearity in the
ordinary least square (OLS) regression analysis.
Multicollinearity inflates the variance and type II error. It makes the coefficient of a
variable consistent but unreliable.
VIF measures the number of inflated variances caused by multicollinearity.
VIF can be calculated by the formula below:
Where R represents the unadjusted coefficient of determination for regressing the
i
2
A) TRUE
B) FALSE
Solution: (A)
Yes, Linear regression is a supervised learning algorithm because it uses true labels
for training. Supervised learning algorithm should have input variable (x) and an
output variable (Y) for each example.
A) TRUE
B) FALSE
Solution: (A)
Solution: (A)
4) Which of the following methods do we use to find the best fit line for data in
Linear Regression?
Solution: (A)
In linear regression, we try to minimize the least square errors of the model to
identify the line of best fit.
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error
Solution: (D)
Since linear regression gives output as continuous values, so in such case we use
mean squared error metric to evaluate the model performance.
A) TRUE
B) FALSE
Solution: (A)
True, In case of lasso regression we apply absolute penalty which makes some of
the coefficients zero.
A) Lower is better
B) Higher is better
C) A or B depend on the situation
D) None of these
Solution: (A)
Residuals refer to the error values of the model. Therefore lower residuals are
desired.
8) Suppose that we have N independent variables (X1,X2… Xn) and dependent
variable is Y. Now Imagine that you are applying linear regression by fitting the
best fit line using least square error on this data.
You found that correlation coefficient for one of it’s variable(Say X1) with Y is -
0.95.
Solution: (B)
The absolute value of the correlation coefficient denotes the strength of the
relationship. Since absolute correlation is very high it means that the relationship
is strong between X1 and Y.
If you are given the two variables V1 and V2 and they are following below two
characteristics.
Solution: (D)
A) TRUE
B) FALSE
Solution: (B)
Pearson correlation coefficient between 2 variables might be zero even when they
have a relationship between them. If the correlation coefficient is zero, it just
means that that they don’t move together. We can take examples like y=|x| or
y=x^2.
11) Which of the following offsets, do we use in linear regression’s least square
line fit? Suppose horizontal axis is independent variable and vertical axis is
dependent variable.
A) Vertical offset
B) Perpendicular offset
C) Both, depending on the situation
D) None of above
Solution: (A)
12) True- False: Overfitting is more likely when you have huge amount of
data to train?
A) TRUE
B) FALSE
Solution: (B)
With a small training dataset, it’s easier to find a hypothesis to fit the training data
exactly i.e. overfitting.
13) We can also compute the coefficient of linear regression with the help of
an analytical method called “Normal Equation”. Which of the following is/are
true about Normal Equation?
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
Solution: (D)
Instead of gradient descent, Normal Equation can also be used to find coefficients.
Refer this article for read more about normal equation.
14) Which of the following statement is true about sum of residuals of A and
B?
Below graphs show two fitted regression lines (A & B) on randomly generated
data. Now, I want to find the sum of residuals in both cases A and B.
Note:
Solution: (C)
Sum of residuals will always be zero, therefore both have same sum of residuals
Suppose you have fitted a complex regression model on a dataset. Now, you are
using Ridge regression with penality x.
Solution: (B)
If the penalty is very large it means model is less complex, therefore the bias would
be high.
16) What will happen when you apply very large penalty?
Solution: (B)
In lasso some of the coefficient value become zero, but in case of Ridge, the
coefficients become close to zero but not zero.
17) What will happen when you apply very large penalty in case of Lasso?
A) Some of the coefficient will become zero
B) Some of the coefficient will be approaching to zero but not absolute zero
C) Both A and B depending on the situation
D) None of these
Solution: (A)
18) Which of the following statement is true about outliers in Linear
regression?
Solution: (A)
The slope of the regression line will change due to outliers in most of the cases. So
Linear Regression is sensitive to outliers.
19) Suppose you plotted a scatter plot between the residuals and predicted
values in linear regression and you found that there is a relationship between
them. Which of the following conclusion do you make about this situation?
Solution: (A)
There should not be any relationship between predicted values and residuals. If
there exists any relationship between them,it means that the model has not
perfectly captured the information in the data.
Suppose that you have a dataset D1 and you design a linear regression model of
degree 3 polynomial and you found that the training and testing error is “0” or in
another terms it perfectly fits the data.
20) What will happen when you fit degree 4 polynomial in linear regression?
A) There are high chances that degree 4 polynomial will over fit the data
B) There are high chances that degree 4 polynomial will under fit the data
C) Can’t say
D) None of these
Solution: (A)
Since is more degree 4 will be more complex(overfit the data) than the degree 3
model so it will again perfectly fit the data. In such case training error will be zero
but test error may not be zero.
21) What will happen when you fit degree 2 polynomial in linear regression?
A) It is high chances that degree 2 polynomial will over fit the data
B) It is high chances that degree 2 polynomial will under fit the data
C) Can’t say
D) None of these
Solution: (B)
If a degree 3 polynomial fits the data perfectly, it’s highly likely that a simpler
model(degree 2 polynomial) might under fit the data.
22) In terms of bias and variance. Which of the following is true when you fit
degree 2 polynomial?
Solution: (C)
Since a degree 2 polynomial will be less complex as compared to degree 3, the bias
will be high and variance will be low.
Which of the following is true about below graphs(A,B, C left to right) between
the cost function and Number of iterations?
23) Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively.
Which of the following is true about l1,l2 and l3?
A) l2 < l1 < l3
Solution: (A)
In case of high learning rate, step will be high, the objective function will decrease
quickly initially, but it will not find the global minima and objective function starts
increasing after a few iterations.
In case of low learning rate, the step will be small. So the objective function will
decrease slowly
We have been given a dataset with n records in which we have input attribute as x
and output attribute as y. Suppose we use a linear regression method to model this
data. To test our linear regressor, we split the data in training set and test set
randomly.
24) Now we increase the training set size gradually. As the training set size
increases, what do you expect will happen with the mean training error?
A) Increase
B) Decrease
C) Remain constant
D) Can’t Say
Solution: (D)
Training error may increase or decrease depending on the values that are used to fit
the model. If the values used to train contain more outliers gradually, then the error
might just increase.
25) What do you expect will happen with bias and variance as you increase
the size of training data?
Solution: (D)
As we increase the size of the training data, the bias would increase while the
variance would decrease.
Consider the following data where one input(X) and one output(Y) is given.
26) What would be the root mean square training error for this data if you
run a Linear Regression model of the form (Y = A0+A1X)?
A) Less than 0
B) Greater than zero
C) Equal to 0
D) None of these
Solution: (C)
We can perfectly fit the line on the following data so mean error will be zero.
Suppose you have been given the following scenario for training and validation
error for Linear Regression.
27) Which of the following scenario would give you the right hyper
parameter?
A) 1
B) 2
C) 3
D) 4
Solution: (B)
Option B would be the better option because it leads to less training as well as
validation error.
28) Suppose you got the tuned hyper parameters from the previous question.
Now, Imagine you want to add a variable in variable space such that this
added feature is important. Which of the following thing would you observe in
such case?
Solution: (D)
If the added feature is important, the training and validation error would decrease.
Question Context 29-30:
Suppose, you got a situation where you find that your linear regression model is
under fitting the data.
29) In such situation which of the following options would you consider?
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
Solution: (A)
In case of under fitting, you need to induce more variables in variable space or you
can add some polynomial degree variables to make the model more complex to be
able to fir the data better.
A) L1
B) L2
C) Any
D) None of these
Solution: (D)