Statistics
Lecture 10
regression analysis
Regression - helps to determine the functional relationship between 2 or more variables
- is usually conducted:
• when we want to know whether any relationship between variables actually exists
• when we want to understand the nature of the relationship between variables (i.e. strength, direction )
• when we want to predict a variable given the value of others X Y
Independent Dependent
Predictor Response
Explanatory Explained
Exogenous Endogenous
What is the difference between correlation and regression? Regressor Regressand
Eg. height depends on weight
Regression Correlation
able to assess the significance of the independent not able
variable in explaining the variability or behaviour of the
dependent variable
able to predict and optimize a variable given the value of not able
others
able to show cause and effect not able – only relationship
data represented by line data represented in single point
regression analysis
Linear regression models
• the important and the most used in applications class of regression models
• attempt to explain the relationship between two or more variables using a straight line
• can often lead to very useful results
• even if the relationship between dependent and one or more of the independent variables is nonlinear
and describe by a nonlinear model we can change it to a linear model by using an appropriate
transformation.
If we have only one independent variable - the model is call “simple” - eg. dependent –sales, independent - advertising,
If we have many independent variables - the model is call “multiply”- eg. dependent –sales, independent - advertising,
size of the shop, number of shops, number of customers,
Looking for the best solution (model) we usually start from many independent variables
– from multiply linear regression
… but multiply linear regression approach is based on simple linear regression approach –
…so we limit our attention to this class of regression models - simple linear regression
regression analysis
y = 𝑎 + 𝑏𝑥 − algebra
Simple linear regression - is a statistical model used to study the relationship
between 𝒚 and 𝒙 if they are related LINEARLY.
X Y
Independent Dependent
The population simple linear regression model Predictor Response
Explanatory Explained
- the type I regression model – for population
Exogenous Endogenous
Regressor Regressand
𝑌 = 𝛼0 + 𝛼1 𝑋 + 𝜀 (1) Eg. data on advertising, sales
where:
𝑌 - the dependent variable, the variable we wish to explain or predict
𝑋 - the independent variable
𝜀 - the error term, the only random component in the model sales
𝛼0 - the population intercept of the line line - nonrandom component
𝛼1 - the population slope of the line
Regression
• parametric approach. shops
‘Parametric’ - we estimate different parameters of 𝛼1 𝑌 = 𝛼0 + 𝛼1 𝑋
regression equation like : slope, intercept -
- we have to make assumptions about data for the
purpose of analysis.
• is restrictive in nature – NO good results with datasets which 𝛼0 - the intercept
do not fulfill regression assumptions (scaterplt, statistical tests) advertising
Aczel A. D., Sounderpandian J., (2017, 2005), Statystyka w zarządzaniu (Complete Business Statistics)
regression analysis Model assumptions
1. The true relationship between dependent variable and independent variable(s) is linear -
change in dependent variable due to one unit change in independent variable is constant, regardless of the value of independent variable
(If we fit a linear model to a non-linear data set, the regression algorithm would can not capture all data - this will lead to an inefficient model
and wrong predictions)
How to check: we can look at residual vs fitted value plots
The scatterplot shows the perfect setup for
a linear regression:
https://itfeature.com/time-series-analysis-and-forecasting/autocorrelation/residuals-plot-for-detection-of-autocorrelation
https://www.andrew.cmu.edu/user/achoulde/94842/homework/regression_diagnostics.html https://data.library.virginia.edu/diagnostic-plots/ https://data.library.virginia.edu/normality-assumption/
regression analysis Model assumptions
2. The observations are independent of one another → NO autocorrelation = no correlation between the residual
(error) terms - this is an issue with time series data -- that means - data with a natural time-ordering (the next instant is dependent on
previous instant)
Autocorrelation - observations are sequentially correlated
Autocorrelation strongly reduces model’s accuracy.
How to check: we can use Durbin – Watson (DW) test, we can look at residual vs time value plots
Some pattern, so AC
Some pattern, so AC
Some pattern, so AC
Some pattern, so AC
No AC
ONLY random pattern of residuals indicates
the non-presence of autocorrelation
https://itfeature.com/time-series-analysis-and-forecasting/autocorrelation/residuals-plot-for-detection-of-autocorrelation
https://www.andrew.cmu.edu/user/achoulde/94842/homework/regression_diagnostics.html https://data.library.virginia.edu/diagnostic-plots/ https://data.library.virginia.edu/normality-assumption/
regression analysis Model assumptions
3. The independent variables should not be correlated→NO multicollinearity
In a model with correlated variables it becomes difficult to find out which variable is actually contributing to predict the dependent variable.
How to check: we can use scatterplot, VIF factor, a correlation table
Scatter Plot Matrix
The variables are:
• Age (years),
• Weight (kg),
• Oxygen intake rate (ml per kg body weight per min.),
• RunTime - time to run 1.5 miles (minutes)
https://itfeature.com/time-series-analysis-and-forecasting/autocorrelation/residuals-plot-for-detection-of-autocorrelation
https://www.andrew.cmu.edu/user/achoulde/94842/homework/regression_diagnostics.html https://data.library.virginia.edu/diagnostic-plots/
regression analysis Model assumptions
4. The error terms must have constant variance – homoskedasticity→NO heteroskedasticity, non-constant
variance - arises in presence of outliers
How to check: we can look at residual vs fitted values plot, we can use Breusch-Pagan / Cook – Weisberg test or White general test.
If heteroskedasticity exists -> funnel shape pattern
funnel shape
non-constant variance
variance arises in presence
of outliers
regression analysis Model assumptions
5. The error terms must be normally distributed - 𝜀~𝑁 0, 𝜎 2
How to check: we can look at QQ plot, we can use Kolmogorov-Smirnov test, Shapiro-Wilk test
Normal Q-Q Plot (quantile-quantile)
Q-Q Plot helps validate the assumption of normal
distribution in a data set
- If the data comes from a normal distribution, the
plot shows fairly straight line
- Absence of normality in the errors can be seen
with deviation in the straight line
https://itfeature.com/time-series-analysis-and-forecasting/autocorrelation/residuals-plot-for-detection-of-autocorrelation https://tobeneo.files.wordpress.com/2013/12/plot.jpg
https://www.andrew.cmu.edu/user/achoulde/94842/homework/regression_diagnostics.html https://data.library.virginia.edu/diagnostic-plots/ https://data.library.virginia.edu/normality-assumption/
https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/
regression analysis Regression Plots
variance arises
Scatter plot - Residual vs Fitted Values (Predicted Values ) in presence of
outliers
We can check linearity, autocorrelation, multicollinearity, homoskedasticity:
• is there any pattern in this plot
• is there any funnel shape funnel shape (e)
non-constant variance
• If residuals show problems with
heteroscedasticity or non-
normality, we could try
transforming the raw data.
• if we don’t meet the linearity
assumption, we can check if we
can do logistic regression
instead.
Even if the assumptions are not met, we could still use the model to draw conclusion - but –
about the sample, NOT POPULATION
to generalize the model to POPULATION - the assumptions must be met
https://tobeneo.files.wordpress.com/2013/12/plot.jpg
https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/
regression analysis Simple linear regression
The population simple linear regression model
- the type I regression model – for population
𝑌 = 𝛼0 + 𝛼1 𝑋 + 𝜀
1 2
The estimated regression equation For each particular data point
- the type II regression model
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑒 (1) 𝑦𝑖 = 𝑎0 + 𝑎1 𝑥𝑖 + 𝑒𝑖 (2)
where:
𝑎0 estimates 𝛼0 where
𝑎1 estimates 𝛼1 𝑖 = 1, 2, , 𝑛
e - the observed errors—the residuals from fitting the line 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
𝑎0 + 𝑎1 𝑥 to the data set of n points. 𝑒1 = 𝑦1 − 𝑦ො1 is the first residual, the distance from the 1st
data point to the fitted regression line
𝑒𝑛 , the 𝑛 − 𝑡ℎ error
3
The errors 𝑒𝑖 are viewed as estimates of the true population errors 𝜀𝑖 .
sales
The regression line
𝑦ො = 𝑎0 + 𝑎1 𝑥
𝑦ො = 𝑎0 + 𝑎1 𝑥 (3)
where:
𝑦ො - predicted value of 𝑦 or
𝑦 value lying on the fitted regression line for a given 𝑥
advertising
https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/the-method-of-least-squares.html
Aczel A. D., Sounderpandian J., (2017), Statystyka w zarządzaniu (Complete Business Statistics)
regression analysis
Estimates of the unknown population parameters
Eg. data on advertising and sales 𝛼0 and 𝛼1 are obtained by the method of least squares.
sales raw data
different lines passing
through the dataset
advertising
line that minimizes sum of the squared errors - SSE
very large errors
𝑦ො = 𝑎0 + 𝑎1 𝑥
some errors are positive and others are negative
- If we want to minimize all the errors, we should minimize the sum
of the squared errors (SEE )
Aczel A. D., Sounderpandian J., (2017), Statystyka w zarządzaniu (Complete Business Statistics)
regression analysis
Simple linear regression - we have to find the least-squares line-the line that minimizes SSE sum of the squared errors
The regression function y with respect to x in the random sample
the estimated regression equation the regression line
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑒 𝑦ො = 𝑎0 + 𝑎1 𝑥
Least-squares regression estimators:
𝑆𝑆𝑥𝑦 σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത 𝑆𝑦
the slope 𝑎1 = = σ𝑛 2
(1) 𝑎1 = 𝑟 (2)
𝑆𝑆𝑥 𝑖=1 𝑥𝑖 −𝑥ҧ 𝑆𝑥
formula for detail data
where:
the intercept 𝑎0 = 𝑦ത − 𝑎1 𝑥ҧ 𝑟 is the Pearson’s correlation coefficient
𝒂𝟏 parameter indicates the average change of y variable given the x value increases by 1
𝒂𝟎 informs what would be the hypothetical value of y when x = 0
(that is only the mathematical interpretation and often it doesn’t make any economic sense)
! 𝑟 and 𝑎1 always have the same sign
regression analysis
Simple linear regression Excel LINEST function
Eg. data on sales and cost of sales. We want to estimate regression equation (parameters 𝑎1 , 𝑎0 )
Sales Cost
sprzedaż koszty xi − x yi − y
2
x x($) y ($
y
) xi − x ( x i − x )( y i − y ) ( xi − x ) ( y i − y ) 2
0 18 -90 -2,91 261,9 8100 8,4681
20 20,3 -70 -0,61 42,7 4900 0,3721
40 20,5 -50 -0,41 20,5 2500 0,1681
60 20,4 -30 -0,51 15,3 900 0,2601
80 21,2 -10 0,29 -2,9 100 0,0841
100 21,7 10 0,79 7,9 100 0,6241
120 21,3 30 0,39 11,7 900 0,1521 900
x= = 90
140 21,6 50 0,69 34,5 2500 0,4761 10
160 22,2 70 1,29 90,3 4900 1,6641
209,1
180 21,9 90 0,99 89,1 8100 0,9801 y= = 20,91
900 209,1 571 33000 13,249 10
𝑆𝑆𝑥𝑦 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത 571
𝑎1 = = 𝑛
=
33000
=0.017 𝑦ො = 𝑎0 + 𝑎1 𝑥
𝑆𝑆𝑥 σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑎0 = 𝑦ത − 𝑎1 𝑥ҧ = 20.91 − 0.017 ∙ 90 = 19.47
𝑦ො = 19.47 + 0.017𝑥
the average increase in cost (y) given the sales (x) value increases by 1($) equals 0.017 ($)
the hypothetical value of cost (y) when sales (x) = 0 equals 19.47 ($)