0% found this document useful (0 votes)
12 views

Lecture 8 Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture 8 Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 9 Outline

• Simple Regression:
 Form of the general model
REGRESSION ANALYSIS  Procedure in SPSS
 Interpretation of SPSS output
 Testing significance of a slope/intercept
 Assumption checking
Reading materials: • Multiple Regression:
Chap 16,17(Keller)  As above

1 2

Regression analysis Regression analysis: example


• Regression analysis investigates  E.g:
whether and how variables are
related to each other. More • How price is related to product demand => making
specifically, regression analysis changes on price, how product demand will
can be used to: change?
• Determine whether the value of one
variable has any effects on the
• How salary of staffs depend on their education and
values of another; experience?
• Determine whether, as one variable
changes, another tend to increase or • Does blood pressure level predict life expectancy?
decrease?
• Predict the values of one variable
• Do university entrance exam scores predict student
based on the values of one or more performance?
other variables.
• Does reading statistics book make you a better
person?
3 4

Association b/w scores of uni entrance Types of relationships


exam and salary
Positive linear relationship Negative linear relationship

Non-linear relationship No relationship

• Sample: 1,532 graduates from NEU after two years of graduation;


• Scores of uni entrance exam from 17,5 to 29,5
• Salaries vary from VND 1 to 35 millions per month.
Source: fb Nguyễn Việt Cường 6
5

1
Simple linear relationship Simple linear relationship: example
Respondent Duration of Quality of Attitude Towards
Number Residence infrastructure City
• In simple linear relationship, we want to see whether
1 10 3 6
a linear relationship exist b/w one dependent variable 2 12 11 9
(Y) and one independent variable (X). 3 12 4 8
• Example: want to see whether the time persons have 4 4 1 3

lived in a city (in years) affects their attitude towards 5 12 11 10


6 6 1 4
that city in a linear manner. Attitude towards the city
7 8 7 5
is measured on an 11-point scale (1=do not like, 11= 8 2 4 2
very much like). 9 18 8 11
10 9 10 9
11 17 8 10
12 2 5 2
7 8

Steps in regression analysis Simple linear regression: notation

1. Analyse the nature of the relationship b/w independent • Simple regression – one predictor
and dependent variables • We have n observations.
2. Make a scatterplot • Xi = value of the independent variable on ith obs
• Yi= value of dependent variable on ith obs.
3. Formulate the mathematical model that describes the
• sx=sample standard deviation of the independent variables
relationship b/w the independent and dependent variables
• sy=sample standard deviation of the dependent variables
4. Estimate and interpret the coefficients of the model • Y is the sample average of the dependent variables
5. Test the model • X is the sample average of the independent variables
6. Evaluate the strength of the relationship (fitness) and
prediction accuracy

9 10

What is mostly dangerous to old people?


Step 1: Analyse the nature of the relationship
• Có một số quan sát của các giáo viên ở Mỹ cho thấy tỷ lệ sinh
viên nói với thầy cô là họ có ông hay bà bị chết tăng lên đột biến
• Determined relationship => whether changes in one variable trước kỳ thi của sinh viên. Chẳng hạn theo Adams (1999) thì tỷ lệ
leads to changes in the other variable. bà bị chết tăng lên khoảng 10 lần trong vòng một tuần trước bài
tập giữa kỳ, và gần 20 lần trước kỳ thi cuối kỳ của sinh viên so
 E.g: changes in advertising expense cause changes in sale với các tuần khác trong năm. Khủng khiếp hơn nữa là bà của các
volume. Are there any inverse relationship? học sinh kém thì có tỷ lệ tử vong trước kỳ thi cao hơn 50 lần các
• Association relationship: some other factor causes the thời điểm khác trong năm. Một điều kỳ lạ là nam giới tuy tuổi thọ
change in both dependent and independent variables. thấp hơn, nhưng tỷ lệ tử vong của người ông trước kỳ thi của sinh
viên thì thấp hơn khoảng 20 lần so với bà.
 E.g: sales of sunglasses and ice-cream increase because of • Như vậy không phải là tai nạn giao thông, bệnh tim mạch hay ung
hot weather thư, mà chính là kỳ thi học kỳ của người cháu mới là kẻ thù đáng
sợ của người cao tuổi. Không biết ở các trường nước mình, các
thầy cô có thấy hiện tượng này không?
Source: fb ‘Cường Nguyễn Việt’
11 12

2
Are good students managers? Simple linear regression: scatterplot

• Step 2: Make a Scatterplot


Example – city attitudes vs duration of residence

Scatterplot of Attitude Towards City vs Duration of Residence

11

10

Attitude Towards City


8

• Study on graduates of NEU in 2015, 2016. 0 5 10 15 20


Duration of Residence

• Source: fb ‘Cường Nguyễn Việt’


13 14

Simple linear regression: Model The line of best fit


• Example – city attitudes vs duration of residence
• Step 3: Formulate the General Model
 Fit a straight line to the data, fitting the following
model:
Intercept Error terms
(Residual)

Yi   0  1 X i   i

Slope
 Slope and intercept are estimated by the ordinary
least squares (OLS) method.
15 16

OLS method (1) OLS method (2)


Want:  i
2
minimum

Y Yi   0  1 Xi   i Observed
value

 i = error terms

YX   0   1 Xi

X
17 18

3
OLS method (3)
Gauss-Markov assumptions

 Assumption on linear relation


A0: linear model
 Assumption on the factor
A5: Exogeneity assumption: Cov( X ,  )  0
 Assumption on the error terms:
A1 : E ( i )  0 i  1,..., n
A2 : Normality of error terms  ~ N
A3 : Non-autocorrelation of error terms cov( i ,  j )  0 i  j
A4 : Homoskedasticity Var( i )   2 i  1,..., n

19 20

Estimate the parameters Applying this to example

• Step 4: Estimate the parameters (slope and


Check yourself:
intercept)
• Slope = 16.333/27.697
Yˆi  ˆ 0  ˆ 1 X i
= 0.5897
• Can calculate estimates of slope and intercept
using formulae, which are derived from the OLS • Intercept = 6.5833-0.5897*9.333
=1.0796
n n n
n X iYi   X i  Yi Fitted Equation: Yˆi 1.07960.5897*Xi
1  i 1 i 1 i 1
2
n
 n 
n X i2    X i 
i 1  i 1 
 0  Y  1 X
21 22

Applying this to example Interpreting the coefficients

Check yourself: ̂1 = 0.5897 means that each additional year of



staying in the city, your attitude towards city will
increase by an average of 0.5897 points
• ˆ0 = 1.0796 is the value of Y when X=0. This
means that other reasons unrelated to the duration
of residence make your attitude towards city equal
to 1.0796 points.
• Note: sometimes, ̂ 0 makes non-sense when X=0,
we don’t interpret the meaning of this coefficient.
23 24

4
Step 5: Testing for significance of estimated
parameters
Applying this to example

• H0:β1=0
• Can test significance of linear relationship • HA:β1≠0
• H0:β1=0
• Test Statistic:
• HA:β1≠0
ˆ1  1 0.5897  0
• Test Statistic: t   8.412
sˆ 0.0701
ˆ   1

T  1 1 ; where sˆ is the standard error of ˆ1.


sˆ 1
1 • Compare this t-value with the t-distribution to make
• Decision Rule: Compare to a t-distribution with decision rule.
n-2 degrees of freedom.
25 26

Step 6: Determine the strength or fitness of the


Decision rule relationship
• So, rejection region will be t>2.2281 or t<-
2.2281 for 5% significance (use df=10)
• OR from SPSS, p-value = 0.000.
• Conclusion: Reject the null hypothesis. There
is a significant linear relationship between
duration of residence and attitude to the city.

27 28

Step 6: Determine the strength or fitness of the Step 6: Determine the strength or fitness of the
relationship relationship

29 30

5
Step 6: Determine the strength or fitness of the Step 6: Determine the strength or fitness of the
relationship relationship
• Measured by r2 – coefficient of determination.
• r2 measures proportion of total variation (Y)
explained by the variation in X, i.e.

31
32

Applying this to example Step 6: Check prediction accuracy

• Can use standard error of the estimate, sε.


• Here is outputs from SPSS
SS res
s 
S = 1.22329 R-Sq = 87.6% R-Sq(adj) = n  k 1
86.4%
• Interpretation: average residual; average error in predicting
• So, 87.6% of variation in Y is explained by Y from the regression equation.
• Used to construct confidence intervals
the variation in X. – for mean value of Y for given X
– for all values of Y for given X

33 34

Inferential statistics Checking assumption


• Regression analysis makes several
assumptions:
• Error terms normally distributed
• Error terms have mean 0, constant variance
• Error terms are independent
• These should be checked with plots (see
multiple regression section)

35 36

6
Example using SPSS Multiple Regression

• Use the cntry15.sav data file for SPSS practice.


• Data:
We want to see how birth-rate/1000  one dependent variable
populations influences female life expectancy.  two or more independent variables
• Example: Are consumers’ perceptions of
quality determined by the perceptions of prices,
brand image and brand attributes?

37 38

Why do we need multiple regression? Model – general form


• Simple linear regression sometimes violates Y   0  1 X 1   2 X 2     k X k  
A5 (see GM assumptions) which is estimated by
– Relationship b/w income and expenditure of Yˆ  ˆ0  ˆ1 X 1  ˆ2 X 2    ˆk X k
households
– Salary and education level of individuals
ˆ0  estimated intercept
 Other advantages of multiple regression
ˆi  estimated partial regression coefficient
– Provide more information and thus improve
forecasting quality • As before, use least squares method to estimate
– More model forms can be used parameters, minimise the error (residual) sum of squares.

TS. Trần Thị Bích - Khoa Thống kê


39 40

Interpreting a Partial Regression Coefficient Example 2

• Imagine a case with two predictors • Attitude to city now being explained by
 Duration of residence
Y   0  1 X1   2 X2   i  Quality of infrastructure

1 represents the expected average change in


Y when X1 is increased by one unit, but X2 is
held constant or otherwise controlled

41 42

7
General Model Estimation (SPSS)
The regression equation is
Attitude Towards City = 0.337 + 0.481 Duration of Residence
+ 0.289 quality of infrastructure

• Let Coefficientsa

 Y=attitude to city Unstandardized Standardized


Coefficients Coefficients
 X1=duration of residence Model B Std. Error Beta t Sig.
 X2=quality of infrastructure 1 (Constant) .337 .567 .595 .567
duration .481 .059 .764 8.160 .000
quality .289 .086 .314 3.353 .008
Y   0  1 X 1   2 X 2   a. Dependent Variable: attitude

43 44

Strength of relationship (R2) Points about R2

• Now called coefficient of multiple


• As before, is the proportion of variation determination
explained by the model. • Will go up as we add more explanatory terms
explained variation SS reg to the model whether they are “important” or
R2   not.
total variation SS y
• Often we use “adjusted R2” – compensates for
• In the example, 94.5% of variation in Y
adding more variables, so is lower than R2
can be explained by the variation in X1 when variables are not “important”
and X2

45 46

Significance Testing 1. Significance of the overall regression


• H0: β1= β2= β3=…= βk=0
• Can test two different things • HA: not all slopes = 0
1. Significance of the overall regression • Test Statistic:
2. Significance of specific partial regression
SS reg / k R2 / k
coefficients. F 
SSres /(n  k  1) 1  R  /(n  k  1)
2

• Decision Rule: Compared to an F-distribution


with k, (n-k-1) degrees of freedom.
• If H0 is rejected, one or more slopes are not zero.
Additional tests are needed to determine which
47
slopes are significant. 48

8
Applying this to example–SPSS output 2. Significance of specific partial regression
coefficients.
• This is the test done in the ANOVA section of the • H0: βi=0
output. • HA: βi≠0
• In this case, we reject the null hypothesis – at least • Test Statistic:
ˆi  i ˆi
one of the slopes is significantly different from t 
sˆ sˆ
zero. i i

• Decision Rule: Compared to a t-distribution with (n-k-1)


degrees of freedom (i.e. residual d.f.)
• If H0 is rejected, the slope of the ith variable is
significantly different from zero. That is, once the other
variables are considered, the ith predictor has a
significant linear relationship with the response.
49 50

Applying this to example


Check residuals

Coefficientsa
Unstandardized Standardized • Assumptions made:
Coefficients Coefficients • Error terms normally distributed
Model B Std. Error Beta t Sig. • Error terms have mean 0, constant variance
1 (Constant) .337 .567 .595 .567 • Error terms are independent
duration .481 .059 .764 8.160 .000
• Definition: A residual (also called error term) is
quality .289 .086 .314 3.353 .008
a. Dependent Variable: attitude
the difference between the observed response
value Yi, and the value predicted by the
• Once the quality of infrastructure is considered, the regression equation, Yˆi
duration of residence still has a significant linear
relationship with the attitude to a city. • (Vertical distance between point and line.)

51 52

Error terms normally distributed Error terms have mean 0, constant variance

• Can be checked by looking at a histogram of


the residuals - look for bell-shaped • Checked by using plots of residuals vs
distribution. predicted values; residuals vs independent
• Also normal probability plot – look for straight variables.
line. • Look for random scatter of points around
• For preference, use standardised residuals – zero.
have a std dev of 1. • If not, may indicate linear regression is not
appropriate – may need to transform data

53 54

9
Error terms are independent Example
Residual Plots for Attitude Towards City
• Check in previous plots; also in residuals vs Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

time/order.
99 2

Standardized Residual
90
1

Percent
• Look for random scatter of residuals.
50 0

10 -1

1 -2
-2 -1 0 1 2 2 4 6 8 10
Standardized Residual Fitted Value

Histogram of the Residuals Residuals Versus the Order of the Data


3 2

Standardized Residual
1

Frequency
2
0

1
-1

0 -2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 6 7 8 9 10 11 12
Standardized Residual Observation Order

55 56

Example using SPSS

• Use the cntry15.sav data file for SPSS practice.


We want to see how birth-rate/1000 persons
and the number of doctors/10,000 people
influence female life expectancy.

57

10

You might also like