0% found this document useful (0 votes)
68 views35 pages

Regression

The document discusses regression analysis techniques. Regression is used to model predictive relationships between independent and dependent variables. The goal is to find the best fitting curve for the dependent variable based on the independent variables. The quality of fit can be measured by the coefficient of correlation. Key steps for regression include identifying variables, establishing the dependent variable, examining relationships visually, and using other variables to predict the dependent variable. Regression determines the relationship between independent and dependent variables to predict one with the other.

Uploaded by

karvamin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views35 pages

Regression

The document discusses regression analysis techniques. Regression is used to model predictive relationships between independent and dependent variables. The goal is to find the best fitting curve for the dependent variable based on the independent variables. The quality of fit can be measured by the coefficient of correlation. Key steps for regression include identifying variables, establishing the dependent variable, examining relationships visually, and using other variables to predict the dependent variable. Regression determines the relationship between independent and dependent variables to predict one with the other.

Uploaded by

karvamin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Unit 4:

Regression
INTRODUCTION
TO REGRESSION
• Regression is a well-known statistical technique to model the
predictive relationship between several independent variables
and one dependent variable.
• The objective is to find the best-fitting curve for a dependent
variable in a multidimensional space, with each independent
variable being a dimension.
• The curve could be a straight line, or it could be a nonlinear
curve.
• The quality of fit of the curve to the data can be measured by a
coefficient of correlation (r), which is the square root of the
amount of variance explained by the curve.
POINT TO PONDER?
● “Imagine you have made plans with friends after a long time and you wish to go
out, but you are not sure whether it will rain or not. It’s the monsoon season, but
your mom says the air feels dry today, and therefore the probability of raining
today is less. On the contrary, your sister believes because it rained yesterday it’s
likely that it will rain today. Considering you have no control over the weather,
how will you decide whose opinion to take more seriously, keeping in mind the
fact that you are impartial towards both?”

Source: https://www.dezyre.com/article/types-of-regression-analysis-in-machine-learning/410
Geographical
location

Dependent Variable

Linearly Correlated

Rainfall/
Precipitation
Independent Variable
Wind
Humidity Speed
KEY STEPS

The key steps for regression are simple


1. List all the variables available for making
the model.
2. Establish a Dependent Variable (DV) of
interest.
3. Examine visual (if possible) relationships
between variables of interest.
4. Find a way to predict DV using other
variables.
INDEPENDENT AND DEPENDENT
VARIABLES
• In our example what we are trying to predict is
today’s precipitation level which is dependent on
the level of humidity and rain received yesterday
hence it is called, the dependent variable.
• The variables on which it depends will be
called independent variables.
• What we try to do with regression Analysis is to
model or quantify the relationship between these
two kinds of variables and hence predict one with
the help of the other with a level of certainty.
• To solve our problem If we were to do a simple
linear regression, we would collect the humidity
level and precipitation level for the previous month
and plot them.
REGRESSION ANALYSIS
• Regression analysis is a predictive modelling technique that analyzes the relation between the
target or dependent variable and independent variable in a dataset.
• The different types of regression analysis techniques get used when the target and
independent variables show a linear or non-linear relationship between each other, and the
target variable contains continuous values.
• The regression technique gets used mainly to determine the predictor strength, forecast
trend, time series, and in case of cause & effect relation.
• Regression analysis is the primary technique to solve the regression problems in machine
learning using data modelling.
• It involves determining the best fit line, which is a line that passes through all the data points
in such a way that distance of the line from each data point is minimized.

Original Source: https://www.upgrad.com/blog/types-of-regression-models-in-machine-learning/


Univariate vs Multivariate vs Time-series
Regression

Univariate: Input Vector Regression Model Continuous Output Value

Continuous Output Value 1


Multivariate: Input Vector Regression Model ……………………………..
Continuous Output Value n

Regression Model
Previous Values Future Value
Time-series: Xt-1 Xt
as Prediction
Model Xt+1
EVALUATING
REGRESSION MODELS
ACCURACY IS NOT A MEASURE TO
CALCULATE REGRESSION!

• A common question by beginners to regression predictive modeling projects is:

How do I calculate accuracy for my regression model?


• Accuracy (e.g. classification accuracy) is a measure for classification, not regression.
• We cannot calculate accuracy for a regression model.
• The skill or performance of a regression model must be reported as an error in those predictions.
• This makes sense if you think about it. If you are predicting a numeric value like a height or a
dollar amount, you don’t want to know if the model predicted the value exactly (this might be
intractably difficult in practice); instead, we want to know how close the predictions were to the
expected values.
• Error addresses exactly this and summarizes on average how close predictions were to their
expected values.
ERROR METRICS
• There are three error metrics that are commonly used for evaluating and
reporting the performance of a regression model; they are:
• Mean Squared Error (MSE).
• Root Mean Squared Error (RMSE).
• Mean Absolute Error (MAE)

• There are many other metrics for regression, although these are the most
commonly used. You can see the full list of regression metrics supported by
the scikit-learn Python machine learning library here:
• Scikit-Learn API: Regression Metrics.
Original Source: https://machinelearningmastery.com/regression-metrics-for-
machine-learning/
1.1. MEAN SQUARED ERROR MEAN
SQUARED ERROR
• Mean Squared Error, or MSE for short, is a popular error metric for
regression problems.
• It is also an important loss function for algorithms fit or optimized using
the least squares framing of a regression problem. Here “least squares”
refers to minimizing the mean squared error between predictions and
expected values.
• The MSE is calculated as the mean or average of the squared differences
between predicted and expected target values in a dataset.
• The squaring also has the effect of inflating or magnifying large errors.
That is, the larger the difference between the predicted and expected
values, the larger the resulting squared positive error. This has the effect
of “punishing” models more for larger errors when MSE is used as a loss
function. It also has the effect of “punishing” models by inflating the
average error score when used as a metric.
• The mean squared error between your expected and predicted values
can be calculated using the mean_squared_error() function from the
scikit-learn library.
• The function takes a one-dimensional array or list of expected values and
predicted values and returns the mean squared error value.
2. ROOT MEAN SQUARED ERROR
• The Root Mean Squared Error, or RMSE, is an extension of
the mean squared error.
• Importantly, the square root of the error is calculated,
which means that the units of the RMSE are the same as
the original units of the target value that is being predicted.
• As such, it may be common to use MSE loss to train a
regression predictive model, and to use RMSE to evaluate
and report its performance.
• MSE uses the square operation to remove the sign of each
error value and to punish large errors. The square root
reverses this operation, although it ensures that the result
remains positive.
• The root mean squared error between your expected and
predicted values can be calculated using
the mean_squared_error() function from the scikit-learn
library.
3. MEAN ABSOLUTE ERROR
• Mean Absolute Error, or MAE, is a popular metric because, like RMSE,
the units of the error score match the units of the target value that is
being predicted.
• MSE and RMSE punish larger errors more than smaller errors, inflating
or magnifying the mean error score. This is due to the square of the
error value. The MAE does not give more or less weight to different
types of errors and instead the scores increase linearly with increases
in error.
• As its name suggests, the MAE score is calculated as the average of
the absolute error values. Absolute or abs() is a mathematical function
that simply makes a number positive. Therefore, the difference
between an expected and predicted value may be positive or negative
and is forced to be positive when calculating the MAE.
• The mean absolute error between your expected and predicted values
can be calculated using the mean_absolute_error() function from the
scikit-learn library.
REGRESSION
ANALYSIS
Outliers Underfitting

Regression
Analysis

Overfitting Heteroscedasticity
Outliers

● Outliers are basically values or data


points that are very stray from the
general population or distribution of
data.
● Outliers have the ability to skew the
results of any ML model towards
their detection.
● Therefore, it is necessary to detect
them early on or use algorithms
resistant to outliers.

Image Source: https://datascience.foundation/


Underfitting and Overfitting
● Bias: Assumptions made by a model to make a
function easier to learn. It is actually the error
rate of the training
● Variance: The difference between the error
rate of training data and testing data is called
variance.
● Underfitting: A statistical model or a machine
learning algorithm is said to have underfitting
when it cannot capture the underlying trend of
the data, i.e., it only performs well on training
data but performs poorly on testing data.
● Overfitting: A statistical model is said to be
overfitted when the model does not make
accurate predictions on testing data. When a
model gets trained with so much data, it starts
learning from the noise and inaccurate data Image Source: https://datascience.foundation/
entries in our data set.
Heteroskedasticity
● Heteroskedasticity refers to situations where the
variance of the residuals is unequal over a range
of measured values.

● When running a regression analysis,


heteroskedasticity results in an unequal scatter of
the residuals (also known as the error term).

● If there is an unequal scatter of residuals, the


population used in the regression contains
unequal variance, and therefore the analysis
results may be invalid.

● We have humidity which predicts rainfall or


precipitation. Now as humidity increases the
amount by which precipitation increases or Image Source: https://datascience.foundation/
decreases is variable and is not fixed.
TYPES OF
REGRESSION
LINEAR REGRESSION
● Linear regression is one of the most basic types of regression in machine learning.
● The linear regression model consists of a predictor variable and a dependent variable related linearly to each other.
● Linear regression with one predictor or independent variable is called Simple Linear Regression
● In case the data involves more than one independent variable, then linear regression is called Multiple Linear Regression
● The below-given equation is used to denote the linear regression model:
● y=mx+c+e

● where m is the slope of the line, c is an intercept, and e represents the error in the model.

● The best fit line is determined by varying the values of m and c.

● The predictor error is the difference between the observed values and the predicted value.

● The values of m and c get selected in such a way that it gives the minimum predictor error. It is important to note that a
simple linear regression model is susceptible to outliers.
Source: https://medium.com/machine-learning-id/simple-linear-
regression-teori-d4abebd1ade2
Multiple Linear Regression
The equation for a multiple linear regression is
shown below.

n stands for the number of variables

More variables are added as features


increase!
LOGISTIC REGRESSION
• Logistic regression is one of the types of regression analysis technique,
which gets used when the dependent variable is discrete.
• Example: 0 or 1, true or false, etc. This means the target variable can
have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.
• Logit function is used in Logistic Regression to measure the relationship
between the target variable and independent variables. Below is the
equation that denotes the logistic regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.
LOGISTIC REGRESSION (CONTD.)

● When Logistic regression is applied in real-


world problems – like detecting cancer in
people P here, would tell the probability of
whether the person has cancer or not.
● P less than 0.5 would be mapped to no
cancer and greater than that would map to
cancer.
● Logistic regression is a linear method, but
the predictions are transformed using the
logistic function.
● The curve for it follows the curve for log
function.
Original Source:
https://www.upgrad.com/blog/types-of-
regression-models-in-machine-learning/
Sources:
1. https://www.statisticshowto.com/regularized-regression/z
2. https://algoritmaonline.com/
3. https://www.statology.org/lasso-regression/
Bull’s Eye Example
Simple Linear Regression

Overfitting!!!
Regularization to reduce Overfitting

● Regularized regression is a type of regression where


the coefficient estimates are constrained to zero. The magnitude
(size) of coefficients, as well as the magnitude of the error term, are
penalized. Complex models are discouraged, primarily to
avoid overfitting.

● There are two types of regression that are quite familiar and use this
Regularization technique, namely:
○ Ridge Regression
○ Lasso Regression
Ridge Regression

● Ridge Regression is a variation of linear regression. We use


ridge regression to tackle the multicollinearity problem.
● So to reduce this variance a degree of bias is added to the
regression estimates.
● It can be seen that the main idea of Ridge Regression is to add a
little bias to reduce the value of the variance estimator.
Ridge Regression
Ridge Regression

● It can be seen that the greater the


value of λ (lambda) the
regression line will be more
horizontal, so the coefficient
value approaches 0.
● If λ = 0, the output is similar to
simple linear regression.
● If λ = very large , the coefficients
value approaches 0.
Lasso Regression
● LASSO (Least Absolute Shrinkage Selector
Operator), The algorithm is another variation
of linear regression like ridge regression. We
use lasso regression when we have large
number of predictor variables.

● Lasso regression is a type of linear


regression that uses shrinkage. Shrinkage is
where data values are shrunk towards a
central point, like the mean.

● This type is very useful when you have high


levels of multicollinearity or when you want
to automate certain parts of model
selection, like variable selection/parameter
elimination.
Difference between Ridge and Lasso Regression

● Lasso regression and ridge regression are both known as regularization methods because they both
attempt to minimize the sum of squared residuals (RSS) along with some penalty term.
● In other words, they constrain or regularize the coefficient estimates of the model.
● The main difference between Ridge and LASSO Regression is that if ridge regression can shrink the
coefficient close to 0 so that all predictor variables are retained. Whereas LASSO can shrink the
coefficient to exactly 0 so that LASSO can select and discard the predictor variables that have the right
coefficient of 0.
● When we use ridge regression, the coefficients of each predictor are shrunken towards zero but none of
them can go completely to zero.
● Conversely, when we use lasso regression it’s possible that some of the coefficients could go completely
to zero when λ gets sufficiently large.
Which is better: Ridge and Lasso Regression?

● In cases where only a small number of predictor variables are significant,


lasso regression tends to perform better because it’s able to shrink
insignificant variables completely to zero and remove them from the model.
● However, when many predictor variables are significant in the model and their
coefficients are roughly equal then ridge regression tends to perform better
because it keeps all of the predictors in the model.
● To determine which model is better at making predictions, we perform k-fold
cross-validation. Whichever model produces the lowest test mean squared
error (MSE) is the preferred model to use.

You might also like