0% found this document useful (0 votes)
14 views

Lecture Note - Regression

Regression

Uploaded by

arooran234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture Note - Regression

Regression

Uploaded by

arooran234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

STATISTICS

Simple Linear Regression

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 1


STATISTICS

Research Interests
• Whether socio-economic status has a larger
effect on educational achievement
• The importance of education on earnings
• How exercise habits effect weight?
• Whether advertising has an impact on sales
volumes of cookies?

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 2


STATISTICS

Regression Analysis
Regression analysis is a statistical technique for
investigating and modeling the relationship
between variables.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 3


STATISTICS

outcome Input
• Example
Production line Amount of Man-hours (x)
production (y)
1 10 300
2 12 250
3 6 200
4 18 450
5 20 500

Dependent Independent
variable variable

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 4


STATISTICS

Y =f(X)
Dependent variable Independent variable

Response variable Predictor variable

Outcome variable Regressor variable

Explanatory variable

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 5


STATISTICS

• Functional relationship and statistical relationship

Functional relationship
Observational fall directly on the curve of
relationship

Statistical relationship
In general, the observations derive from
the real-life scenarios do not directly fall
on the curve of relationship

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 6


STATISTICS

• Examples: functional relationships


• Y=f(X)=

Y X
Notation
y1 x1
Y=f(X) X,Y =variables y2 x2
x,y=data points
y3 x3
y4 x4

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 7


STATISTICS

• Linear function : Y= mX+C


Y= β0 + β1X

Slope =β1
Intercept Gradient or Slope

Intercept
β0

• Simple linear regression


Y= dependent variable
β0 = intercept
Y= β0 + β1X
β1 = slope
X =independent variable

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 8


STATISTICS

• Simple linear regression


Y= β0 + β1X Y= dependent variable
β0 = intercept
β1 = slope
X =independent variable

• Multiple linear regression


Y= β0 + β1X1+β2X2 +β3X3+β4X4 Y= dependent variable
β0 , β1 , β2 , β3 , …= regression
coefficients
X1 , X2 , X3 , …= regressor
variables

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 9


STATISTICS

Our data points We want to know true regression


line i.e. β0, β1

But we get different line for our


limited number of data.
Mean
values So we find estimates for β0, β1
for
each x Parameter Estimates

β0 𝛽0
True regression line = Y= β0 + β1X
β1 𝛽1

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 12


STATISTICS

Residual
The difference between the observed value yi and the
corresponding fitted value 𝑦𝑖 is a residual.

Residual value for 2 nd data point is


Y e2 = 𝑦2 -𝑦2
y2

e2
x2
X Residual value for i th data point
𝑦2 is ei = 𝑦𝑖 -𝑦𝑖

The residuals from the least-squares line have a special property: the mean of the
least-squares residuals is always zero.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 13


STATISTICS

Least Squares Method


𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑒𝑖 ) = 𝑦𝑖 − 𝑦𝑖
𝑛

𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 = (𝑦𝑖 − 𝑦𝑖 )2


𝑖=1

We want to minimize the total error (or residual) because it means that the
data points are collectively as close to the model’s values as possible.

Estimates for β0 and β1 are calculated so that sum of squares of errors (or
residuals) get minimized.

This method of estimating is β0 and β1 is called the Least-squares method.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 14


STATISTICS

Least Squares Estimators


𝛽0 and 𝛽1 are the estimators of the intercept and slope using Least-Squares
method.

𝛽1 𝛽0 = 𝑦 − 𝛽1𝑥

Using the sample data and by using the above two formulas, it is possible to
calculate 𝛽0 and 𝛽1 . Thereby, we get the fitted model as follows.

Fitted regression model is


𝑌= 𝛽0 + 𝛽1 X

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 15


STATISTICS

Regression Analysis on Minitab


• Investigating the relationship between quiz
averages and final exam scores.

Source: online.stat.psu.edu

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 16


STATISTICS

Regression Analysis on Minitab

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 17


STATISTICS

Source: online.stat.psu.edu
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 18
STATISTICS

• What is the Independent Variable?


• What is the dependent Variable?
• What is 𝛽0 of the above regression equation?
• Interpret 𝛽0
• What is 𝛽1 of the above regression equation?
• Interpret 𝛽1
• What is the final score when the quiz-average is 20?
• Is this Regression Significant?

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 19


STATISTICS

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 20


STATISTICS

Assumptions on Regression
1. The mean of 𝑦𝑖 is a linear function of 𝑥𝑖 .
2. The errors 𝑒𝑖 are normally distributed
3. The errors zero mean and constant
variance(homoscedasticity)
4. The errors ( 𝑒𝑖 ) are independent.
Normal distribution
Mean value
Standard deviation

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 21


STATISTICS

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 22


STATISTICS

Checking the Assumptions of the Model


Assumption 1:
The mean of 𝑦𝑖 is a linear function of 𝑥𝑖.
Testing Assumption
Use scatter plot for x and Y to check whether their
relationship is linear

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 23


STATISTICS

Checking the Assumptions of the Model


Assumption 2:
The errors 𝑒𝑖 are normally distributed
Testing Assumption
1. Draw a histogram for residuals
2. Draw a normal quantile plot (or Normal Probability plot) for
the residuals - when the variable is normally distributed the
points in this plot are aligned.

24
STATISTICS

Checking the Assumptions of the Model


Assumption 3:
The errors are normally distributed with mean zero and constant
variance
Testing Assumption
Draw a scatter plot for residuals versus fitted values

The important thing to check here is that


there is no special pattern. For instance, it
could be that the residuals increase,
decrease in a regular way or have a
special clustering.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 25


STATISTICS

Checking the Assumptions of the Model


Assumption 4:
The errors ( 𝑒𝑖) are independent.

Testing Assumption
Draw a scatter plot for residuals versus observation order

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 26


STATISTICS

Prediction

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 27


STATISTICS

Fitted regression model

• Practical Interpretation: Using the regression model,


we would predict oxygen purity of 89.23% when the
hydrocarbon level is x 1.00%.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 28


STATISTICS

• Exercise
The regression equation below relates the scores students in an advanced
statistics course received for homework completed(x) and the subsequent
midterm exam(y). Homework scores(x) are based on assignments that
preceded the exam. The maximum homework score a student could obtain
was 500 and the maximum midterm score was 350. The regression line that
was obtained is given by
𝑦 = −84.4 + .91 𝑥

If a student had a homework score of 420, the midterm score would be


predicted to be (rounded to an integer)
a. 298.
b. 336.
c. 378.
d. None of the choices are correct.

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 29


STATISTICS

Interpolation and Extrapolation


Interpolation
• Regression relationships are valid only for values of the regressor variable within
the range of the original data.

E.g: Sarah’s height was plotted against her age

100

95
height (cm)

90

85

80
30 35 40 45 50 55 60 65
age (months)

𝒚 = 71.95 + .383 x

Can you predict her height at age 42 months? 𝒚 = 88 cm

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 31


STATISTICS

Interpolation and Extrapolation


Extrapolation

𝒚 = 71.95 + .383 x

Can we predict Sarah’s height at age 30 years


(360 months)? 𝒚 = 209.8 cm= 6’10.5”

The above prediction is an extrapolation. However, 210

Regression models are not necessarily valid for 190


170

height (cm)
extrapolation purposes. It may not provide reliable 150
130

predictions. 110
90
70
30 90 150 210 270 330 390

age (months)

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 32


STATISTICS

Exercise
The data and the graph below show the scores students in an advanced statistics course
received for homework (Hw) completed and the subsequent midterm exam. Homework
scores are based on assignments that preceded the exam. The maximum homework
score a student could obtain was 500 and the maximum midterm score was 350. The
residual for the student whose homework score was 395 is
a. negative.
b. positive.
c. zero.
d. undetermined.

Hw
score 387 275 280 459 395 314 428 366 421 234

Exam
score 190 200 108 323 315 256 341 236 285 125

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 33


STATISTICS

𝛽1

There is a close connection between correlation and the slope of the least-squares
line.
𝑛 2
𝑖=1(𝑦𝑖 −𝑦)
𝛽1 = r
We can get 𝑛 2
𝑖=1(𝑥𝑖 −𝑥)

The slop and the correlation coefficient always have the same sign.

The least-squares regression line always


passes through (𝑥,𝑦)
Y
𝑛
𝑛 𝑖=1 𝑦𝑖
𝑖=1 𝑥𝑖 𝑦=
𝑥= 𝑛
𝑛
X

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 34


STATISTICS

Measure of Association
• Correlation Coefficient (r)
(Pearson’s product moment correlation coefficient)

Both X and Y are quantitative (Interval or ratio)

-1≤ r ≤ 1
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 35
STATISTICS

• Coefficient of Determination (R2)

𝑆𝑆𝑅
R2 = ; 0≤R ≤1
𝑆𝑆𝑇

Proportion of variance of Y explained by the regressor variable X

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 36


STATISTICS

Coefficient of Determination (R2)


• Values of R2 that are close to 1 imply that
most of the variability in Y is explained by the
regression model.

R value is R value is
high low
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 37
STATISTICS

Coefficient of Determination (R2)


We often refer loosely to R2 as the
amount of variability in the data
explained or accounted for by the
regression model.

R2 =0.84

Therefore, R2 is not always a good


estimator to measure the strength of
linearity model.

Curve

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 38


STATISTICS

SST= SSR + SSE


Explains Explains
Explains total variation of Y variation of
variation of Y associated with Y associated
regression with error
model

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 39


STATISTICS

SST=SSR+SSE
2
• SSE (Sum of Square of Error) = 𝑦𝑖 −𝑦𝑖

2
• SSR (Sum of Square of Regression) = 𝑦𝑖 −𝑦

2
• SST (Sum of Square of Total) = 𝑦𝑖 −𝑦

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 40


STATISTICS

ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 41


STATISTICS

Testing the Significance of the Regression


(Using ANOVA)
Source of SS Degree of MS P
variation freedom
Regression 2 1 𝑆𝑆𝑅 p_value
SSR= 𝑦𝑖 −𝑦 MSR=
1
Error 2 n-2 𝑆𝑆𝐸
SSE= 𝑦𝑖 −𝑦𝑖 MSE=
𝑛−2
2
Total SST= 𝑦𝑖 −𝑦 n-1
n=sample size, number of data points

Hypothesis Null hypothesis =gradient is zero= no significant linear


relationship
H0 : β1 = 0
H1 : β1 ≠ 0 If p-value ≤ 0.05 then reject H0
Therefore β1 ≠ 0, The linear relationship between X
and Y is significant
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 42
STATISTICS

• A strong observed association between variables does not


necessarily imply that a causal relationship exists between
those variables.

Social Relationships and Health


House, J., Landis, K., and Umberson, D. “Social Relationships and
Health,” Science, Vol. 241 (1988), pp 540-545.
Does lack of social relationships cause people to become ill? (There was
a strong correlation.)
 Or, are unhealthy people less likely to establish and maintain
social relationships? (reversed relationship)
 Or, is there some other factor that predisposes people both to
have lower social activity and become ill?

• Designed experiments are the only way to determine cause


and-effect relationships.
ETC 1052 /BTC 1052/ ITC 1032 Principles of Statistics 43

You might also like