Lecture Note - Regression
Lecture Note - Regression
Research Interests
• Whether socio-economic status has a larger
effect on educational achievement
• The importance of education on earnings
• How exercise habits effect weight?
• Whether advertising has an impact on sales
volumes of cookies?
Regression Analysis
Regression analysis is a statistical technique for
investigating and modeling the relationship
between variables.
outcome Input
• Example
Production line Amount of Man-hours (x)
production (y)
1 10 300
2 12 250
3 6 200
4 18 450
5 20 500
Dependent Independent
variable variable
Y =f(X)
Dependent variable Independent variable
Explanatory variable
Functional relationship
Observational fall directly on the curve of
relationship
Statistical relationship
In general, the observations derive from
the real-life scenarios do not directly fall
on the curve of relationship
Y X
Notation
y1 x1
Y=f(X) X,Y =variables y2 x2
x,y=data points
y3 x3
y4 x4
Slope =β1
Intercept Gradient or Slope
Intercept
β0
β0 𝛽0
True regression line = Y= β0 + β1X
β1 𝛽1
Residual
The difference between the observed value yi and the
corresponding fitted value 𝑦𝑖 is a residual.
e2
x2
X Residual value for i th data point
𝑦2 is ei = 𝑦𝑖 -𝑦𝑖
The residuals from the least-squares line have a special property: the mean of the
least-squares residuals is always zero.
We want to minimize the total error (or residual) because it means that the
data points are collectively as close to the model’s values as possible.
Estimates for β0 and β1 are calculated so that sum of squares of errors (or
residuals) get minimized.
𝛽1 𝛽0 = 𝑦 − 𝛽1𝑥
Using the sample data and by using the above two formulas, it is possible to
calculate 𝛽0 and 𝛽1 . Thereby, we get the fitted model as follows.
Source: online.stat.psu.edu
Source: online.stat.psu.edu
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 18
STATISTICS
Assumptions on Regression
1. The mean of 𝑦𝑖 is a linear function of 𝑥𝑖 .
2. The errors 𝑒𝑖 are normally distributed
3. The errors zero mean and constant
variance(homoscedasticity)
4. The errors ( 𝑒𝑖 ) are independent.
Normal distribution
Mean value
Standard deviation
24
STATISTICS
Testing Assumption
Draw a scatter plot for residuals versus observation order
Prediction
• Exercise
The regression equation below relates the scores students in an advanced
statistics course received for homework completed(x) and the subsequent
midterm exam(y). Homework scores(x) are based on assignments that
preceded the exam. The maximum homework score a student could obtain
was 500 and the maximum midterm score was 350. The regression line that
was obtained is given by
𝑦 = −84.4 + .91 𝑥
100
95
height (cm)
90
85
80
30 35 40 45 50 55 60 65
age (months)
𝒚 = 71.95 + .383 x
𝒚 = 71.95 + .383 x
height (cm)
extrapolation purposes. It may not provide reliable 150
130
predictions. 110
90
70
30 90 150 210 270 330 390
age (months)
Exercise
The data and the graph below show the scores students in an advanced statistics course
received for homework (Hw) completed and the subsequent midterm exam. Homework
scores are based on assignments that preceded the exam. The maximum homework
score a student could obtain was 500 and the maximum midterm score was 350. The
residual for the student whose homework score was 395 is
a. negative.
b. positive.
c. zero.
d. undetermined.
Hw
score 387 275 280 459 395 314 428 366 421 234
Exam
score 190 200 108 323 315 256 341 236 285 125
𝛽1
There is a close connection between correlation and the slope of the least-squares
line.
𝑛 2
𝑖=1(𝑦𝑖 −𝑦)
𝛽1 = r
We can get 𝑛 2
𝑖=1(𝑥𝑖 −𝑥)
The slop and the correlation coefficient always have the same sign.
Measure of Association
• Correlation Coefficient (r)
(Pearson’s product moment correlation coefficient)
-1≤ r ≤ 1
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 35
STATISTICS
𝑆𝑆𝑅
R2 = ; 0≤R ≤1
𝑆𝑆𝑇
R value is R value is
high low
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 37
STATISTICS
R2 =0.84
Curve
SST=SSR+SSE
2
• SSE (Sum of Square of Error) = 𝑦𝑖 −𝑦𝑖
2
• SSR (Sum of Square of Regression) = 𝑦𝑖 −𝑦
2
• SST (Sum of Square of Total) = 𝑦𝑖 −𝑦