Regression Analysis
Regression Analysis
Regression Analysis
Introduction
• Statistics is a science of data.
• Data is numerical value.
• Data contains many information inside it.
• Using the data, a popular objective in applied sciences is to find
out the relationship among different variables.
• Important objective: Modelling
What is a Model?
• Representation of a relationship
• All salient features of the process are incorporated
Process generates data Correct
or
Data generates process Wrong
Definition
Regress: To move in backward direction
: Use data and get the statistical relations.
Model -> 2 components – variables & parameters
Difference between mathematical and statistical
modelling?
Y= 3x + 2 Equation, regression model
y: Yield per hectare, x: Quantity of fertilizer (in Kg.)
x 1 2 5 10 50 100 1000
y↑ x
y 5 8 17 32 152 302 3002
Interpretation:
Increase quantity of fertilizer and obtain higher yield – Mathematical opinion
Increase quantity of fertilizer and obtain higher upto certain level, after that
the crop will be destroyed – Statistical opinion
Regression Model
Regression Model ‐‐‐‐ 2 types of variables
• Input variables or independent variables
• Output variables or dependent variables.
Statistician’s role:
• No right to change or alter the process.
• works only on the basis of a small sample.
Linearity in parameters
• Dependent variable – y, Independent variable – x
Regression Model
• It is easier to use mathematical tools on linear functions than on nonlinear
functions.
• So linear regression model is more preferred over non‐linear regression
model.
• Linear function: dependent variable – y, independent variable – x
y=α+βx
• Α- intercept term β: slope or rate of change in y wrt x
• Relationship between y and x is linear
Model
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
How to verify if there exist a linear relationship?
• Use scatter diagram
yi
ŷi xi
C A
B
y
B y
A
C
yi
(y
i 1
i y) 2
( yˆ
i 1
i y) 2
( yˆ
i 1
i yi ) 2
R2=SSreg/SStotal
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean of y Variability due to x (regression) Additional variability not explained
Total variation by x—what least squares method aims
to minimize
Lines of Regression
• Line of Regression of X on Y
𝜎𝑥 𝜎𝑥
𝑋−𝑋 = 𝑟 (𝑌 − 𝑌) bxy=𝑟 Regression Coefficient
𝜎𝑦 𝜎𝑦
• Line of Regression of Y on X
𝜎𝑦 𝜎𝑦
𝑌−𝑌 =𝑟 (𝑋 − 𝑋) byx= 𝑟 Regression Coefficient
𝜎𝑥 𝜎𝑥
Proof
𝑦 = 𝑛𝑎 + 𝑏 𝑥 −− −1
𝑦 = 𝑛𝑎 + 𝑏 𝑥 --------------1
𝑥𝑦 = 𝑎 𝑥 + 𝑏 𝑥 2 −−−− −2 Multiplying 1 by 𝑥
𝑥 𝑦 = 𝑛𝑎 𝑥+𝑏 𝑥 𝑥
Dividing equation 1 by n
𝑦 𝑛 𝑏
𝑛
= 𝑎+
𝑛 𝑛
𝑥 𝑥 𝑦 = 𝑛𝑎 𝑥+𝑏 𝑥2 −−−− −4
𝑦 𝑏
=𝑎+𝑛 𝑥 Multiplying equation 2 by n
𝑛
𝑛 𝑥𝑦 = 𝑛𝑎 𝑥 + 𝑛𝑏 𝑥 2 ---------5
𝑦 = 𝑎 + 𝑏𝑥 a= 𝑦 − 𝑏𝑥
Subtracting 4 from 5
𝑦 = 𝑎 + 𝑏𝑥 𝑛 𝑥𝑦 − 𝑥 𝑦 = 𝑛𝑏 𝑥2 − 𝑏 𝑥2
y= (𝑦 − 𝑏𝑥) + 𝑏x 𝑛 𝑥𝑦 − 𝑥 𝑦 = 𝑏[𝑛 𝑥2 − 𝑥 2]
𝑦 − 𝑦 = 𝑏(𝑥 − 𝑥) --------------3 𝑛 𝑥𝑦− 𝑥 𝑦
b= 𝑛 𝑥2− 𝑥2
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥2
Example
x y x2 xy
0 2 0 0 𝑦 = 𝑁𝑎 + 𝑏 𝑥
1 1 1 1
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥2
2 3 4 6
3 2 9 6
4 4 16 16 20=7a+21b
5 3 25 15 74=21a+91b
6 5 36 30
Sum 21 20 91 74
a=1.357 b=0.5
y=1.357+0.5x
Exercise
x y
0 1
1 1.8
2 3.3
3 4.5
4 6.3
Y=-4.29x+11.26
Normal Equations
• The system of equations required to be solved for obtaining the values of
constants known as Normal equations
x=a+by
𝑥 = 𝑁𝑎 + 𝑏 𝑦
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑦2
From the following data obtain two regression equations
x 6 2 10 4 8
y 9 11 5 8 7
𝑦 = 𝑁𝑎 + 𝑏 𝑥
x y x2 y2 xy
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥2
6 9 36 81 54
2 11 4 121 22
40=5a+30b
10 5 100 25 50 214=30a+220b
4 8 16 64 32
a=11.9 b=-0.65
8 7 64 49 56 Y=11.9-0.65x
Sum 30 40 Sum 220 340 214
Mean 6 8 Mean 44 68 42.8
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑦2
From the following data obtain regression equations taking deviations of items from the mean of x and y
x 6 2 10 4 8
y 9 11 5 8 7
𝜎
𝑋 − 𝑋 = 𝑟 𝜎𝑥 (𝑌 − 𝑌)
𝑦
X 𝑋−𝑋 x2 Y 𝑌−𝑌 y2
x y xy 𝜎𝑥 𝑥𝑦 −26
𝑟 = = = −1.3
6 9 𝜎𝑦 𝑦2 20
0 0 1 1 0
2 -4 16 11 3 9 -12 𝑋 − 6 =-1.3 (𝑌 −8)
X=-1.3Y+16.4
10 4 16 5 -3 9 -12
4 -2 4 8 0 0 0 𝜎𝑦
𝑌 − 𝑌 = 𝑟 𝜎 (𝑋 − 𝑋)
8 2 4 7 -1 1 -2 𝑥
Sum 𝜎𝑦 𝑥𝑦
Sum 30 0 40 Sum 40 0 20 -26 𝑟𝜎 = =-5.2/40=-0.65
Mean 𝑥 𝑦2
Mean 6 0 8 Mean 8 0 4 -5.2
𝑌 − 8 =-0.65 (X−6)
Y=-0.65.3X+11.9
From the following data obtain regression equations taking deviations of X series from 5 and Y series from 7
x 6 2 10 4 8
y 9 11 5 8 7
𝜎
𝑋 − 𝑋 = 𝑟 𝜎𝑥 (𝑌 − 𝑌)
𝑦
X 𝑋−5 dx2 Y 𝑌 −7 dy2
dx dy dxdy 𝜎𝑥 𝑁 𝑑𝑥𝑑𝑦 − ( 𝑑𝑥)( 𝑑𝑦)
𝑟 =
6 9 𝜎𝑦 𝑁 𝑑𝑦 2 − ( 𝑑𝑦)2
1 1 2 4 2
2 11 5 −21 −5∗5
-3 9 4 16 -12 = = −1.3
5∗25−25
10 5 25 5 -2 4 -10
4 8 𝑋 − 6 =-1.3 (𝑌 −8)
-1 1 1 1 -1
X=-1.3Y+16.4
8 3 9 7 0 0 0
Sum
Sum 30 5 45 Sum 40 5 25 -21
Mean
Mean 6 1 9 Mean 8 1 5 -4.25
𝜎
𝑋 − 𝑋 = 𝑟 𝜎𝑥 (𝑌 − 𝑌)
𝑦
X-65=0.8*(2.5/3.5)(Y-67) Y-67=0.8*(3.5/2.5)(Y-65)
X-65=0.57(Y-67) X-67=1.12(Y-65)
X=0.57Y+26.81 X=1.12Y-5.8
After 9/11 attack, a company could partially recover following record on correlation
Variance of X=9
Eqns. Of regression 8X-10Y+66=0
40X-18Y=214
Find on the basis of above information
1) Mean values of X and Y
2) Coefficient of Correlation
3) Standard Deviation of Y
8X-10Y=-66 40X=18Y+214
10Y=8X+66 Y=(18/40)X+214/18
Y=(8/10)X+66/10 bxy=18/40=0.45
byx=8/10=0.8
R=sqrt(byx*bxy)=sqrt(0.8*0.45)=0.6
For 50 students in a class, the regression equations for marks in statistics(X) and the marks in
accountancy(Y) is 3Y-5X+180=0. The mean in accountancy is 44 and variance of marks in statistics(X)
is (9/16)th of the marks in accountancy(Y) . Find the mean marks in statistics and coefficient of
correlation between marks of two subjects.
𝜎 𝑟 =(0.6*4)/3
bxy = 𝑟 𝜎𝑥 √9
0.6 = 𝑟 √16
𝑦 =0.8
Errors in regression
• Residual- the difference between the actual value and the model’s estimate
• If our collection of residuals are small, it implies that the model does a good
job at predicting our output of interest.
• Conversely, if these residuals are generally large, it implies that model is a
poor estimator.
• Statisticians have developed summary measurements that take our
collection of residuals and condense them into a single value that represents
the predictive ability of our model.
Mean Absolute Error
Mean Square Error
Mean Absolute Percentage Error
Mean Percentage Error
epsilon ϵ
• ϵ represents error that comes from sources out of our control, causing
the data to deviate slightly from their true position.
• Error metrics will be able to judge the differences between prediction
and actual values, but we cannot know how much the error has
contributed to the discrepancy.
• While we cannot ever completely eliminate epsilon, it is useful to retain a
term for it in a linear model.
• Population parameter
Mean Absolute Error
Mean Absolute Error
Mean Absolute Error
• MAE is easily interpretable-simplest
• Because we use the absolute value of the residual, the MAE does not
indicate underperformance or overperformance of the model.
• Each residual contributes proportionally to the total amount of error,
meaning that larger errors will contribute linearly to the overall error.
• A small MAE suggests the model is great at prediction, while a large
MAE suggests that your model may have trouble in certain areas. A
MAE of 0 means that your model is a perfect predictor of the outputs.
Mean Square Error
• Because we are squaring the difference, the MSE will almost always be bigger
than the MAE.
• The effect of the square term in the MSE equation is most apparent with the
presence of outliers in our data.
• While each residual in MAE contributes proportionally to the total error, the
error grows quadratically in MSE.
• This means that outliers in our data will contribute to much higher total error
in the MSE than they would the MAE.
• Similarly, our model will be penalized more for making predictions that differ
greatly from the corresponding actual value.
Mean Square Error
Root mean squared error (RMSE)
• Square root of the MSE.
• Because the MSE is squared, its units do not match that of the original output.
• Researchers will often use RMSE to convert the error metric back into similar
units, making interpretation easier.
• Since the MSE and RMSE both square the residual, they are similarly affected by
outliers.
• The RMSE is analogous to the standard deviation (MSE to variance) and is a
measure of how large your residuals are spread out.
• Both MAE and MSE can range from 0 to positive infinity, so as both of these
measures get higher, it becomes harder to interpret how well your model is
performing.
• Another way we can summarize our collection of residuals is by using
percentages so that each prediction is scaled against the value it’s supposed to
estimate.
Mean Absolute Percentage Error
• Since positive and negative errors will cancel out, we cannot make
any statements about how well the model predictions perform
overall.
• However, if there are more negative or positive errors, this bias will
show up in the MPE.
• Unlike MAE and MAPE, MPE is useful to us because it allows us to see
if our model systematically underestimates (more negative error)
or overestimates (positive error).
Mean Percentage Error
Summary
Acronym Full Name Residual Robust To
Operation? Outliers?
MAE Mean Absolute Error Absolute Value Yes