DA Notes 3
DA Notes 3
Linear regression is a supervised learning algorithm that compares input (X) and output (Y) variables based
on labeled data. It’s used for finding the relationship between the two variables and predicting future results
based on past relationships.
For example, a data science student could build a model to predict the grades earned in a class based on the
hours that individual students study. The student inputs a portion of a set of known results as training data. The
data scientist trains the algorithm by refining its parameters until it delivers results that correspond to the
known dataset. The result should be a linear regression equation that can predict future students’ results based
on the hours they study.
The equation creates a line, hence the term linear, that best fits the X and Y variables provided. The distance
between a point on the graph and the regression line is known as the prediction error. The goal is to create a line
that has as few errors as possible.
You may also hear the term “logistic regression.” It’s another type of machine learning algorithm used for
binary classification problems using a dataset that’s presented in a linear format. It is used when the dependent
variable has two categorical options, which must be mutually exclusive. There are usually multiple independent
variables, useful for analyzing complex questions with “either-or” construction.
Completing a simple linear regression on a set of data results in a line on a plot representing the relationship
between the independent variable X and the dependent variable Y. The simple linear regression predicts the
value of the dependent variable based on the independent variable.
For example, compare the time of day and temperature. The temperature will increase as the sun rises and
decline during sunset. This can be depicted as a straight line on the graph showing how the variables relate over
time.
Linear regression is a type of supervised learning algorithm in which the data scientist trains the algorithm
using a set of training data with correct outputs. You continue to refine the algorithm until it returns results that
meet your expectations. The training data allows you to adjust the equation to return results that fit with the
known outcomes.
The goal of the linear equation is to end up with the line that best fits the data. That means the total prediction
error is as small as possible, depicted on the graph as the shortest distance between each data point and the
regression line.
The linear regression equation is the same as the slope formula you may have learned previously in algebra or
AP statistics.
To begin, determine if there is a relationship between the two variables. Look at the data in x-y format (i.e., two
columns of data: independent and dependent variables). Create a scatterplot with the data. Then you can judge
if the data roughly fits a line before you attempt the linear regression equation. The equation will help you find
the best-fitting line through the data points on the scatterplot.
In simple linear regression, the predictions of Y when plotted as a function of X form a straight line. If the data
is not linear, the line will be curvy through the plotted points.
The basic formula for a regression line is Y’ = bX + A, where Y’ is the predicted score, b is the slope of the line,
and A is the Y-intercept.
The correlation coefficient or R-squared value will guide you in determining if the model is fit properly. The
R-squared value ranges from 0 to 1.0, denoting zero correlation at the low end (0) and a 100% correlation at
the high end (1.0).
In a statistics class, you may learn how to calculate linear regressions by hand. In the professional world,
linear regression is typically done using software. One of the most common tools is Microsoft Excel. In Excel
2013, Stephanie Glen offers basic steps to find the regression equationExternal link:open_in_new and R-squared
value in a simple regression analysis.
Excel also offers statistical calculations for linear regression analyses via its free Analysis Toolpak. A tutorial
on how to enable that in Excel can be found on this Microsoft 365 support pageExternal link:open_in_new.
Linear regression is a useful tool for determining which variables have an impact on factors of interest to an
organization.
For a real-world example, let’s look at a dataset of high school and college GPA grades for a set of 105 computer
science majors from the Online Stat BookExternal link:open_in_new. We can start with the assumption that
high school GPA scores would correlate with higher university GPA performance. With a linear regression
equation, we could predict students’ university GPA based on their high school results.
As we suspected, the scatterplot created with the data shows a strong positive relationship between the two
scores. The R-squared value is 0.78, a strong indicator of correlation. This relationship is confirmed visually on
the chart as the data points for university GPAs are clustered tightly to the linear regression line based on high
school GPA.
Using the linear equation derived from this dataset, we can predict a student with a high school GPA of 3 would
have a university GPA of 3.12.
In the world of business, managers use regression analysis of past performance to predict future events. For
example, a company’s sales manager believes they sell more products when it rains. Managers gather sales
data and rainfall numbers for the past three years. The y-axis is the number of sales or the dependent variable.
The x-axis is the total rainfall. The data does show that sales increase when it rains. The regression line
represents the relationship between sales data and rainfall data. The data shows that for every inch of rain, the
company has experienced five additional sales. However, further analysis is likely necessary to determine the
actual factors that increase sales for the company with a high degree of certainty.
It is important for data scientists who use linear regression to understand some of the underlying assumptions
of the method. Otherwise, they may draw incorrect conclusions and create faulty predictions that don’t reflect
real-world performance.
Four principal assumptions about the data justify the use of linear regression modelsExternal
link:open_in_new for prediction or inference of the outcome:
1. Linearity and additivity: The expected value of the dependent variable is a straight-line function of the
independent variable. The effects of different independent variables are additive to the expected value of
the dependent variable.
2. Statistical independence: There is no correlation between consecutive errors when using time series
data. The observations are independent of each other.
3. Homoscedasticity: The errors have a constant variance in time compared to predictions and when
compared to any independent variable.
4. Normality: For a fixed value of X, Y values are distributed normally.
Data scientists use these assumptions to evaluate models and determine if any data observations will cause
problems with the analysis. If the data do not support any of these assumptions, then forecasts rendered from
the model may be biased, misleading or, at the very least, inefficient.
Linear regression models are simple to understand and useful for smaller datasets that aren’t overly complex.
For small datasets, they can be calculated by hand.
Simple linear regression is useful for finding a relationship between two continuous variables. The formula
reveals a statistical relationship but not a deterministic relationship. In other words, it can express correlation
but not causation. It shows how closely the two values are linked but not if one variable caused the other. For
example, there’s a high correlation between hours studied and grades on a test. It can’t explain why students
might study a given amount of hours or why a certain outcome might happen.
Linear regression models also have some disadvantages. They don’t work efficiently with complicated datasets
and are difficult to design for nonlinear data. That’s why data scientists recommend starting with exploratory
data analysis to examine the data for linear distribution. If there is not an apparent linear distribution in the
chart, other methods should be used.
So far, we have focused on becoming familiar with simple linear regression. A simple linear regression relies
on a single input variable and its relationship with an output variable. However, a more accurate model might
consider multiple inputs rather than one.
Take the GPA example from above. To determine a college student’s GPA, the student’s high school GPA was
used as the sole input variable. What if we considered using the number of credits a student takes as another
input? Or their age? Or financial assistance?
A combination of multiple inputs like this would lend itself to a multiple linear regression modelExternal
link:open_in_new. The multiple or multivariable linear regression algorithm determines the relationship
between multiple input variables and an output variable.
Multiple linear regressions are subject to similar assumptions as linear regression, as well as other
assumptions like multicollinearity.
Simple linear regression is used to find out the best relationship between a single input variable
(predictor, independent variable, input feature, input parameter) & output variable (predicted,
dependent variable, output feature, output parameter) provided that both variables are continuous in
nature. This relationship represents how an input variable is related to the output variable and how it
is represented by a straight line.
To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots provides a
graphical representation of the relationship of two continuous variables.
After looking at scatter plot we can understand:
1. The direction
2. The strength
3. The linearity
The above characteristics are between variable Y and variable X. The above scatter plot shows us that
variable Y and variable X possess a strong positive linear relationship. Hence, we can project a
straight line which can define the data in the most accurate way possible.
If the relationship between variable X and variable Y is strong and linear, then we conclude that
particular independent variable X is the effective input variable to predict dependent variable Y.
To check the collinearity between variable X and variable Y, we have correlation coefficient (r), which
will give you numerical value of correlation between two variables. You can have strong, moderate or
weak correlation between two variables. Higher the value of “r”, higher the preference given for
particular input variable X for predicting output variable Y. Few properties of “r” are listed as
follows:
1. Range of r: -1 to +1
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
5. Strong correlation: r > 0.85 (depends on business scenario)
Command used for calculation “r” in RStudio is:
> cor(X, Y)
where, X: independent variable & Y: dependent variable Now, if the result of the above command is
greater than 0.85 then choose simple linear regression.
If r < 0.85 then use transformation of data to increase the value of “r” and then build a simple linear
regression model on transformed data.
The equation that represents how an independent variable X is related to a dependent variable Y.
Example:
Let us understand simple linear regression by considering an example. Consider we want to predict
the weight gain based upon calories consumed only based on the below given data.
Now, if we want to predict weight gain when you consume 2500 calories. Firstly, we need to visualize
data by drawing a scatter plot of the data to conclude that calories consumed is the best independent
variable X to predict dependent variable Y.
As, r = 0.9910422 which is greater than 0.85, we shall consider calories consumed as the best
independent variable(X) and weight gain(Y) as the predict dependent variable.
Now, try to imagine a straight line drawn in a way that should be close to every data point in the
scatter diagram.
To predict the weight gain for consumption of 2500 calories, you can simply extend the straight line
further to the y-axis at a value of 2,500 on x-axis . This projected value of y-axis gives you the rough
weight gain. This straight line is a regression line.
So, weight gain predicted by our simple linear regression model is 4.49Kgs after consumption of 2500
calories.
—------------------------------------------------------------------------------------------------------------------------------------
-
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses
several explanatory variables to predict the outcome of a response variable. The goal of multiple linear
regression is to model the linear relationship between the explanatory (independent) variables and response
(dependent) variables. In essence, multiple regression is the extension of ordinary least-squares (OLS)
regression because it involves more than one explanatory variable.
● Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable.
● Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
● MLR is used extensively in econometrics and financial inference.
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n observations:
yi=dependent variable
xi=explanatory variables
Simple linear regression is a function that allows an analyst or statistician to make predictions about one
variable based on the information that is known about another variable. Linear regression can only be used
when one has two continuous variables—an independent variable and a dependent variable. The independent
variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression
model extends to several explanatory variables.
● There is a linear relationship between the dependent variables and the independent variables
● The independent variables are not too highly correlated with each other
● yi observations are selected independently and randomly from the population
● Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the
variation in outcome can be explained by the variation in the independent variables. R2 always increases as
more predictors are added to the MLR model, even though the predictors may not be related to the outcome
variable.
R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the
independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables
constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation,
or vertically in table form.
As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil
(XOM). In this case, their linear equation will have the value of the S&P 500 index as the independent
variable, or predictor, and the price of XOM as the dependent variable.
In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for example,
depends on more than just the performance of the overall market. Other predictors such as the price of oil,
interest rates, and the price movement of oil futures can affect the price of XOM and stock prices of other oil
companies. To understand a relationship in which more than two variables are present, multiple linear
regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among several random
variables.1 In other terms, MLR examines how multiple independent variables are related to one dependent
variable. Once each of the independent factors has been determined to predict the dependent variable, the
information on the multiple variables can be used to create an accurate prediction on the level of effect they have
on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best
approximates all the individual data points.
The least-squares estimates—B0, B1, B2…Bp—are usually computed by statistical software. As many variables
can be included in the regression model in which each independent variable is differentiated with a number—1,2,
3, 4...p. The multiple regression model allows an analyst to predict an outcome based on information provided
on multiple explanatory variables.
Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome
predicted by the model. The residual value, E, which is the difference between the actual outcome and the
predicted outcome, is included in the model to account for such slight variations.
Assuming we run our XOM price regression model through a statistics computation software, that returns this
output:
An analyst would interpret this output to mean if other variables are held constant, the price of XOM will
increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows that the price of XOM
will decrease by 1.5% following a 1% rise in interest rates. R2 indicates that 86.5% of the variations in the stock
price of Exxon Mobil can be explained by changes in the interest rate, oil price, oil futures, and S&P 500 index.
Ordinary linear squares (OLS) regression compares the response of a dependent variable given a change in
some explanatory variables. However, a dependent variable is rarely explained by only one variable. In this
case, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one
independent variable. Multiple regressions can be linear and nonlinear.
Multiple regressions are based on the assumption that there is a linear relationship between both the dependent
and independent variables. It also assumes no major correlation between the independent variables.
A multiple regression considers the effect of more than one explanatory variable on some outcome of interest. It
evaluates the relative effect of these explanatory, or independent, variables on the dependent variable when
holding all the other variables in the model constant.
Why Would One Use a Multiple Regression Over a Simple OLS Regression?
A dependent variable is rarely explained by only one variable. In such cases, an analyst uses multiple
regression, which attempts to explain a dependent variable using more than one independent variable. The
model, however, assumes that there are no major correlations between the independent variables.
It's unlikely as multiple regression models are complex and become even more so when there are more variables
included in the model or when the amount of data to analyze grows. To run a multiple regression you will likely
need to use specialized statistical software or functions within programs like Excel.
In multiple linear regression, the model calculates the line of best fit that minimizes the variances of each of the
variables included as it relates to the dependent variable. Because it fits a line, it is a linear model. There are
also non-linear regression models involving multiple variables, such as logistic regression, quadratic
regression, and probit models.
Any econometric model that looks at more than one variable may be a multiple. Factor models compare two or
more factors to analyze relationships between variables and the resulting performance. The Fama and French
Three-Factor Mod is such a model that expands on the capital asset pricing model (CAPM) by adding size risk
and value risk factors to the market risk factor in CAPM (which is itself a regression model). By including
these two additional factors, the model adjusts for this outperforming tendency, which is thought to make it a
better tool for evaluating manager performance.