1 - Multiple Regression
1 - Multiple Regression
“In principle, more analytic power can be achieved by varying multiple things at once in an
uncorrelated (random) way, and doing standard analysis, such as multiple linear regression. In
practice, though, A/B testing is widely used, because A/B tests are easy to deploy, easy to understand,
and easy to explain to management.”
— Christopher D. Manning
Multiple linear regression or also known as multiple regression is an extension of simple linear
regression.
It is used when we want to predict the value of a variable based on the value of two or more
other variables. The variable we want to predict is called the dependent variable (or sometimes,
the outcome, target or criterion variable).
Simple linear regression is a function that allows an analyst or statistician to make predictions
about one variable based on the information that is known about another variable.
Linear regression can only be used when one has two continuous variables—an independent
variable and a dependent variable.
The independent variable is the parameter that is used to calculate the dependent variable or
outcome. A multiple regression model extends to several explanatory variables.
Before we go further to the topic, here is a video that tells us about the multiple regression
https://www.youtube.com/watch?v=zITIFTsivN8&t=130s
Objective:
The objective of multiple regression analysis is to use the independent variables whose values
are known to predict the value of the single dependent (Y) value.
Terminologies:
R, is the measure of association between the observed value and the predicted value of the
Dependent variable.
R Square (or the coefficient determination) - the square of the measure of association which
indicates the percent of overlap between the predictor variables and the Dependent variable.
Adjusted R2 is an estimate of the R2 if you used this model with a new data set.
Independent variable (p – value) - is the parameter that is used to calculate the dependent
variable or outcome.
Dependent variable (criterion value) - in an equation, the variable whose value depends on one
or more variables in the equation.
observations:
yi=dependent variable
xi=Independent variables here we have “p” predictor variables and “p+1” as total regression
parameters.
β0= y-intercept (constant term)
βp= slope coefficients for each explanatory variable
ϵ= the model’s error term (also known as the residuals)
The multiple regression model is based on the following assumptions:
There is a linear relationship between the dependent variables and the independent variables
The independent variables are not too highly correlated with each other
yi observations are selected independently and randomly from the population
Residuals should be normally distributed with a mean of 0 and variance σ
Following is a list of 7 steps that could be used to perform multiple regression analysis
Check the relationship between each predictor variable and the response variable. This could
be done using scatterplots and correlations.
Check the relationship amoung the predictor variables. This could be done using scatterplots
and correlations. It is also termed as multi-collinearity test.
Try and analyze the simple linear regression between the predictor and response variable.
Use the non-redundant predictor variables in the analysis. This is based on checking the
multicollinearity between each of the predictor variables. If the correlation exists, one may
want to one of these variable.
t-statistics of one or more parameters: This is used to test the null hypothesis
whether the parameter’s value is equal to zero.
p-value: This is used to test the null hypothesis whether there exists a
relationship between the dependent and independent variable. Lesser the p-
value, greater is the statistical significance of the parameter. This could, in
turn, imply that there exists a relationship between the dependent and
independent variable
f-value: Tests how fit is the model
R2 (R squared) or adjusted R2: Tests the fitness of the regression model
Use the best fitting model to make prediction based on the predictor (independent variables).
This is done based on the statistical analysis of some of the above mentioned statistics such as
t-score, p-value, R squared, F-value etc.
Techniques used in Multiple Regression Analysis
Following are some of the key techniques that could be used for multiple regression analysis:
Scatterplots: Scatterplots could be used to visualize the relationship between two variables.
Correlation analysis (also includes multicollinearity test): Correlation tests could be used to
find out following:
o Whether the dependent and independent variables are related
o Whether the independent variables are related among each other. This is also termed
as multicollinearity.
whether two variables are correlated or not.
As an example, an analyst may want to know how the movement of the market affects the price of
ExxonMobil (XOM). In this case, their linear equation will have the value of the S&P 500 index as the
independent variable, and the price of XOM as the dependent variable.
In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for
example, depends on more than just the performance of the overall market. Other predictors such as
the price of oil, interest rates, and the price movement of oil futures can affect the price of XOM and
stock prices of other oil companies. To understand a relationship in which more than two variables are
present, multiple linear regression is used.
B1 = regression coefficient that measures a unit change in the dependent variable when xi1 changes -
the change in XOM price when interest rates change
B2 = coefficient value that measures a unit change in the dependent variable when xi2 changes—the
change in XOM price when oil prices change
In order to get the information needed, usually statistical software is use. In this case we will use the
Microsoft excell spreasheet.
X0M price = 1.5 + 0.18 Interest rate + 1.15 oil price – 0.4 value of S&P 500 index – 0.09 price of oil
futures
An analyst would interpret this output to mean if other variables are held constant, the price of XOM
will increase by 1.15% if the price of oil in the markets increases by n%(I just use random data points).
The model also shows that the price of XOM will increase by 1.15% following a n% rise in interest rates.
R2 indicates that 47% of the variations in the stock price of Exxon Mobil can be explained by changes in
the interest rate, oil price, oil futures, and S&P 500 index.
Additional:
https://stats.blue/Stats_Suite/multiple_linear_regression_calculator.html
The ability to determine the relative influence of one or more independent variables to the
depedent value.
For example. The real estate agent could find that the size of the homes and
the number of bedrooms have a strong correlation to the price of a home,
while the proximity to schools has no correlation at all, or even a negative
correlation if it is primarily a retirement community.
Just as with simple regression, multiple regression will not be good at explaining the
relationship of the independent variables to the dependent variables if those relationships
are not linear.
Any disadvantage of using a multiple regression model usually comes down to the data being
used. Using incomplete data risk and falsely concluding that a correlation is a causation.
When reviewing the price of homes, for example, suppose the real estate
agent looked at only 10 homes, seven of which were purchased by young
parents. In this case, the relationship between the proximity of schools may
lead her to believe that this had an effect on the sale price for all homes being
sold in the community. This illustrates the pitfalls of incomplete data. Had she
used a larger sample, she could have found that, out of 100 homes sold, only
ten percent of the home values were related to a school's proximity. If she had
used the buyers' ages as a predictor value, she could have found that younger
buyers were willing to pay more for homes in the community than older
buyers.
Impact on the industry\Application\ latest statistical report of multiple regression
It can be applicable while predicting the expected crop yield with the consideration of climate
factors such as a certain rainfall, temperature and fertilizer level, etc.
In order to find the connection between the GPA of a class of students and the number of
study-hours and their height. Here the dependent variable is GPA and the number of study-
hours and student’s heights is explanatory variables.
For determining the salary of a batch of executives in a company and the number of years of
experience and the age of executives, regression analysis can be used. Here, the dependent
variable for this regression is the salary of executives, and the experience and age of the
executives are independent variables.
It is highly used in anticipating trends and future values/events. For example, rain forecast in
coming days, or price of gold/silver in the coming months from the present time.
An example of identifying the relationship between the distance covered (dependent variable)
by the cab driver and the age of the driver and years of experience (independent variables)
https://www.youtube.com/watch?v=3EokKw3eg78&t=107s
Sources:
https://www.investopedia.com/terms/m/mlr.asp
https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/multiple-
regression/
https://us.sagepub.com/sites/default/files/upm-assets/78103_book_item_78103.pdf
https://sciencing.com/calculate-odds-ratio-contingency-table-8782587.html
https://www.analyticssteps.com/blogs/multiple-linear-regression
https://vitalflux.com/data-science-8-steps-to-multiple-regression-analysis/