0% found this document useful (0 votes)
14 views

13simple linear regression

The document discusses correlation and regression analysis, explaining how to find relationships between quantitative variables using statistical techniques. It covers concepts such as scatter diagrams, correlation coefficients, and the least-squares method for fitting regression lines. Additionally, it introduces Spearman Rank Correlation and outlines the assumptions of simple linear regression models.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

13simple linear regression

The document discusses correlation and regression analysis, explaining how to find relationships between quantitative variables using statistical techniques. It covers concepts such as scatter diagrams, correlation coefficients, and the least-squares method for fitting regression lines. Additionally, it introduces Spearman Rank Correlation and outlines the assumptions of simple linear regression models.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 127

• YILMA CHISHA

• ASISSTANT PROFESSOR, BHI


• AMU, SPH, CMHS

1
Correlation

Finding the relationship between


two quantitative variables
without being able to infer causal
relationships

Correlation is a statistical
technique used to determine the
degree to which two variables are
related
12 March 2025 2
Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X)
and the second is called dependent (Y)
• Points are not joined
• No frequency table Y
* *
*
X
12 March 2025 3
Example

Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)

12 March 2025 4
SBP(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
220 SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood


pressure
12 March 2025 5
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood


pressure
12 March 2025 6
Scatter plots

The pattern of data is indicative of the type


of relationship between your two
variables:
 positive relationship
 negative relationship
 no relationship

12 March 2025 7
Positive relationship

12 March 2025 8
18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90

12 March 2025 Age in Weeks 9


Negative relationship

10
Reliability

Age of Car
12 March 2025
No relation

12 March 2025 11
Correlation Coefficient

Statistic showing the degree of relation


between two variables

12 March 2025 12
Simple Correlation coefficient (r)

 It is also called Pearson's correlation


or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of
the quantitative type.

12 March 2025 13
The sign of r denotes the nature
of association

while the value of r denotes the


strength of association.

12 March 2025 14
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).

12 March 2025 15
How to compute the simple correlation coefficient (r)

 xy   x y
r n
 ( x)2
 ( y) 
2
x 
2 .  y 
2 
 n  n 
  

12 March 2025 16
Example:

A sample of 6 children was selected, data about their


age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

17
Weight Age serial
(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5

12 March 2025
13 9 6
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:

 xy   x y
r  n
 ( x) 2  ( y)2 
x 
2 .  y 
2 
 n  n 
  

12 March 2025 18
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n
(y) (x)
144 49 84 12 7 1

64 36 48 8 6 2

144 64 96 12 8 3

100 25 50 10 5 4

121 36 66 11 6 5

169 81 117 13 9 6

=y2∑ =x2∑ xy=∑ =y ∑ =x ∑ Total


742 291 461 66 41

12 March 2025 19
41 66
461 
r 6
 (41) 2   (66) 2 
 291   . 742  
 6  6 

r = 0.759
strong direct correlation

12 March 2025 20
EXAMPLE: Relationship between Anxiety and
Test Scores
Anxiety Test X2 Y2 XY
)X( score (Y)

21
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
12 March 2025
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)( 200)

r = - 0.94

Indirect strong correlation

12 March 2025 22
Spearman Rank Correlation Coefficient (rs)

It is a non-parametric measure of correlation.


This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.

12 March 2025 23
Procedure:

1. Rank the values of X from 1 to n where n


is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi
4. Square each di and compute ∑di2 which
is the sum of the squared values.

12 March 2025 24
5. Apply the following formula

6 (di) 2
rs 1 
n(n 2  1)

The value of rs denotes the magnitude and


nature of association giving the same
interpretation as simple r.

12 March 2025 25
Example
In a study of the relationship between level injury
and income the following data was obtained. Find
the relationship between them and comment.

26
Income level of injury sample
(Y) (X) numbers
25 moderate. A
10 mild. B
8 fatal. C
10 Sever. D
15 Sever. E
50 Normal F
60 fatal. G
12 March 2025
Answer:
di2 di Rank Rank
Y X (Y) (X)
4 2 3 5 25 moderate. A

27
0.25 0.5 5.5 6 10 mild. B
30.25 -5.5 7 1.5 8 fatal. C
4 -2 5.5 3.5 10 Sever. D
0.25 -0.5 4 3.5 15 Sever. E
25 5 2 7 50 Normal F
0.25 0.5 1 1.5 60 fatal. G

∑ di2=64

12 March 2025
6 64
rs 1   0.1
7(48)

Comment:
There is an indirect weak correlation
between level of injury and income.

12 March 2025 28
exercise

12 March 2025 29
What is regression analysis?
• An extension of correlation
• A way of measuring the relationship
between two or more variables.
• Used to calculate the extent to which one
variable changes (DV) when other
variable(s) change (IV(s)).
• Used to help understand possible causal
effects of one variable on another.

12 March 2025 30
What is linear regression (LR)?
• Involves:
– one predictor (IV) and
– one outcome (DV)
• Explains a relationship using a straight line
fit to the data.

12 March 2025 31
Least squares criterion

12 March 2025 32
Least-Squares Regression
The most common method for fitting a
regression line is the method of least-
squares.
This method calculates the best-fitting line for
the observed data by minimizing the sum of the
squares of the vertical deviations from each
data point to the line (if a point lies on the fitted
line exactly, then its vertical deviation is 0).
Because the deviations are first squared, then
summed, there are no cancellations between
positive and negative values.
12 March 2025 33
Linear Regression - Model
Y
? (the actual value of Yi)
Yi b0 +
Y= bX1
ei

12 March 2025 Xi X34


Linear Regression - Model

Yi     X i   i Population

Regression Coefficients for a . . .

Y ˆ= b0 + b1Xi + e
Sample

Yˆ = b0 + b1Xi

12 March 2025 35
Simple Linear Regression Model
• The population simple linear regression model:

y= a + b x +  my|x=a+b x

36
or
Nonrandom or Random
Systematic Component
Component
• Where
• y is the dependent (response) variable, the variable we wish to explain or
predict;
• x is the independent (explanatory) variable, also called the predictor variable;
and
•  is the error term, the only random component in the model, and thus, the only
source of randomness in y.

12 March 2025
Cont…
• my|x is the mean of y when x is specified,
all called the conditional mean of Y.

• a is the intercept of the systematic


component of the regression relationship.
•  is the slope of the systematic
component.

12 March 2025 37
Picturing the Simple Linear Regression Model
Regression Plot • The simple linear regression
Y
model posits an exact linear
relationship between the

38
expected or average value of Y,
• the dependent variable Y, and X,
my|x=a +  x
the independent or predictor
variable:

{
y

Error:  }  = Slope my|x= a+b x


}

Actual observed values of Y (y) differ


1
from the expected value (my|x ) by

{
an unexplained or random
error(e):
a = Intercept
y = my|x + 
X = a+b x + 
0 x

12 March 2025
Assumptions of the Simple Linear
Regression Model
• The relationship between X and Y LINE assumptions of the Simple Linear
is a straight-Line (linear) Y
Regression Model
relationship.

39
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in my|x=a +  x
the values of Y comes from the
error term .
• The errors  are uncorrelated (i.e. y
Independent) in successive
observations. The errors  are
Normally distributed with mean Identical normal
0 and variance 2(Equal distributions of errors, all
centered on the
variance). That is: ~ N(0,2) N(my|x, sy|x2) regression line.

X
x
12 March 2025
Fitting a Regression Line
Y Y

40
Data
Three errors from the
least squares regression
X line X
Y e

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
12 March 2025
Errors in Regression
Y

41
yi . yˆ a  bx the fitted regression line

yˆi
{
Error ei yi  yˆi
yˆ the predicted value of Y for x

X
xi
12 March 2025
Sums of Squares, Cross Products, and Least
Squares Estimators
Sums of Squares andCross Products:
(å x)
2

lxx = å (x x ) å x
- 2
= 2
-
n 2
lyy = å (y - y)2 = å y2 -
(å y)
n
(å x)(å y)
ŷ a =
lxy bx å (x - x)(y - y) = å xy -
n
Least -squares regressionestimators:
lxy
b= lxx
ŷ a  bx
a = y - bx
12 March 2025 42
Example
x2 y2
 x 
Patient x y x ×y 2
592.62
1 22. 4 134. 0 501. 76 17956. 0 3001. 60 lxx  x 2
 41222.14  6104.66
4 25. 1 80. 2 630. 01 6432. 0 2013. 02 n 10
8 32. 4 97. 2 1049. 76 9447. 8 3149. 28
 y 
2
2 51. 6 167. 0 2662. 56 27889. 0 8617. 20 1428.702
3 58. 1 132. 3 3375. 61 17503. 3 7686. 63 l yy  y 2
 220360.47  16242.10
5 65. 9 100. 0 4342. 81 10000. 0 6590. 00 n 10
7
6
75. 3 187. 2
79. 7 139. 1
5670. 09
6352. 09
35043. 8
19348. 8
14096. 16
11086. 27 lxy  xy 
 x y 91866.46  592.6 1428.70 7201.70
10 85. 7 199. 4 7344. 49 39760. 4 17088. 58 n 10
9 96. 4 192. 3 9292. 96 36979. 3 18537. 72
Total 592. 6 1428. 7 41222. 14 220360. 5 91866. 46 7201.70
lxy
b  1.18
l 6104.66
xx

regression equation: a y  bx 1428.7  (1.18)  592.6 


10
 
 10 
yˆ 72.96  1.18 x 72.96

12 March 2025 43
Linear Regression - Variation

SSR

Due to regression.

SST

SST = SSR + SSE SSE

Random/unexplained.

12 March 2025 44
Linear Regression - Variation
Y 
SSE =(Yi - Yi
_ )2
SST = (Yi -
Y)2
 _
SSR = (Yi - Y)2
_
Y

Xi X
12 March 2025 45
Contents of correlation and linear
regression
• Correlation
• Introduction to simple linear regression
• Least-squares estimation of the parameter

46
Introduction
• Correlation and regression – for quantitative
variables
– Correlation: assessing the association between
quantitative variables
– Simple linear regression: description and prediction of
one quantitative variable from another
• Only considering linear relationships
• When considering correlation or carrying out a
regression analysis between two variables always
plot the data on a scatter plot first
47
Scatter plot

48
Cont…

49
Pearson Correlation Coefficient

50
51
Correlation – Linear Relationship

52
Correlation – Linear Relationship

53
Correlation Does Not Imply Causation
• Correlation does not mean causation
• If we observe high correlation between two variables, this does
not necessarily imply that because one variable has a high
value it causes the other to have a high value
• There may be a third variable causing a simultaneous change in
both variables.
• Example:
– Suppose we measured children’s shoe size and reading skills
– There would be a high correlation between these two variables, as
the shoe size increases so too do the child’s reading abilities
– But one does not cause the other, the underlying variable is age
– As age increases so too does shoes size and reading ability

54
Example: Percentage of children immunized against
DPT and under-five mortality rate for 20 countries,
1992
Nation Percentage Mortality Nation Percentage Mortality
immunized rate per immunized rate per
1000 live 1000 live
births births
Bolivia 77 118 Greece 54 9
Brazil 69 65 India 89 124
Cambodia 32 184 Italy 95 10
Canada 85 8 Japan 87 6
China 94 43 Mexico 91 33
Czech Repu. 99 12 Poland 98 16
Egypt 89 55 Russian Fed. 73 32
Ethiopia 13 208 Senegal 47 145
Finland 95 7 Turkey 76 87
France 95 9 United King. 90 9
55
Natio xi yi xi-x^ yi—y^ Natio xi yi xi-x^ yi—y^
n n
Bolivia 77 118 -0.4 59 Gree 54 9
Brazil 69 65 India 89 124
Camb. 32 184 Italy 95 10
Canad. 85 8 Japan 87 6
China 94 43 Mexi 91 33
Czech 99 12 Poland 98 16
Egypt 89 55 Russia 73 32
Ethio. 13 208 Seneg 47 145
Finla. 95 7 Turkey 76 87
France 95 9 United 90 9
Mean 77.4 59

56
57
Non-Parametric Correlation
• Rank correlation may be used whatever type of pattern
is seen in the scatter diagram, doesn’t specifically
assess linear association but more general association
• Spearman’s rank correlation rho
– Non-parametric measure of correlation – doesn’t make any
assumptions about the particular nature of the relationship
between the variables, doesn’t assume a linear relationship
– rho is a special case of Pearson’s r in which the two sets of
data are converted to rankings
– can test null hypothesis that the correlation is zero and
calculate confidence intervals
58
Formula

59
60
Linear regression
• Is used to explore the nature of the relationship
between two “continuous” normally distributed
random variables.
• Enables us to investigate the change in response
variable, which corresponds to a given change in
the explanatory variable.
• The ultimate objective of regression analysis is to
predict or estimate the value of the response that
is associated with a fixed value of the explanatory
variable
61
Example:
Cigarettes & coronary heart disease

IV = Cigarette DV = Coronary Heart


consumption Disease (CHD) 62
Example:
Cigarettes & coronary heart disease

• IV =X= Average no. of cigarettes per adult per


day
• DV =Y= Coronary Heart Disease mortality (rate
of deaths per 10,000 per year due to CHD)
• Unit of analysis = Country
• How fast does CHD mortality rise with a one
unit increase in smoking?
63
Data

64
Scatterplot
Cigarate CHD Cigarate CHD
11 26 5 4
9 21 5 18
9 24 5 12
9 21 5 3
8 19 4 11
8 13 4 15
8 19 4 6
6 11 3 13
6 23 3 4
5 15 3 14
5 13

65
Scatterplot with Line of Best Fit

66
Simple linear regression
• It is a model with a single regressor x that has a
linear relationship with a response y.
• The simple linear regression model is

• Where:
– Y= response variable - = slope
– X = regressor variable - ε =random error component
– = intercept
67
• X is
– controlled variable not random variable
– Deterministic or mathematical variable
• Y
– Is random variable and can’t be controlled
– It depends on the regressor variable

68
Basic assumptions on the model

i= 1 to n
1. εi is a random variable with zero mean & variable δ 2
(unknown). i.e. E(εi)=0 & v(εi)=δ2
2. εi & εj are uncorrelated. i≠j. So Cov (εi,εj)=0
3. εi is a normally distributed random variable, with
mean zero & variance δ2
εi ~ N (0, δ2) 69
the variable with out error is not random variable, it is the true population mean
value. If we add error on it we can find the sample statistics value which is the actual
or observed value. Mean of y/x=bo+b1xi and mean of y=bo+b1x
i+ error. Because
E(ε)=0

Because
β0+β1χi are
not
random
variable
b/c no
error.

70
Assumptions of the Simple Linear
Regression Model
• The relationship between X LINE assumptions of the Simple
and Y is a straight-Line (linear) Y
Linear Regression Model
relationship.
• The values of the independent
variable X are assumed fixed
(not random); the only my|x=a +  x
randomness in the values of Y
comes from the error term .
• The errors  are uncorrelated y
(i.e. Independent) in
successive observations. The
Identical normal
errors  are Normally distributions of errors,
distributed with mean 0 and all centered on the
regression line.
variance 2(Equal variance). N(my|x, sy|x2)
That is: ~ N(0,2)
X
x
71
Picturing the Simple Linear Regression Model
Y Regression Plot The simple linear regression model
posits an exact linear relationship
between the expected or average
value of Y, the dependent variable Y,
and X, the independent or predictor
my|x=a +  x variable:
my|x= a+b x
{
yi

Error:  }  = Slope Actual observed values of Y (yi)


differ from the expected value y at
}

1
(my|x ) by an unexplained or random

{
error(e):

a = Intercept
yi = my|x + 
0 x
X = a+b x + 
72
Estimation: The Method of Least Squares
Estimation of a simple linear regression relationship involves finding estimated or
predicted values of the intercept and slope of the linear regression line.

The estimated regression equation:


y= a+ bx + e

where a estimates the intercept of the population regression line, a ;


b estimates the slope of the population regression line,  ;
and e stands for the observed errors ------- the residuals from fitting the estimated
regression line a+ bx to a set of n points.

The estimated regres sion line:

y$ = a +b x

$
where (y - hat) is th e value of Y lying n
o the fitted regression line f or a given
value of X .
73
Fitting a Regression Line
Y Y

Data
Three errors from the
least squares regression
X line X
Y e

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
74
Fitting a Regression Line
Y Y

Data
The parameters β0 & β1 are
Three errors from the
unknown and must be least squares regression
estimatedXusing sample line X
Y data: e
(x1,y1), (x2,y2), …, (xn,yn).

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
75
Fitting
The line fitted by least
a Regression Line
Y
square is the one that Y
makes the sume of
squares of all vertical
discrepancies as small as
possible

Data
Three errors from the
least squares regression
X line X
Y e

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
76
30

Residual
CHD Mortality per 10, 000

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay 77


30
Ŷ=β
0 +β
1 χ1
Residual
CHD Mortality per 10, 000

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay 78


30
Ŷ=β
(x9,y9) 0 +β
1 χ1
Residual
CHD Mortality per 10, 000

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay 79


30
Ŷ=β
(x9,y9) 0 +β
1 χ1
Residual
CHD Mortality per 10, 000

20

Prediction

10

(x9,ŷ9)

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay 80


30
Ŷ=β
(x9,y9) 0 +β
1 χ1
Residual
CHD Mortality per 10, 000

20

ε9=y9-ŷ9

Prediction

10

(x9,ŷ9)

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay 81


General
Y

yi .
yˆi
Error ei yi  yˆi
{

X
xi
82
Least square estimation is
n
2 General
ssresiduals   i is min imum
i 1
Y

yi .
yˆi
{

X
xi
83
Least square estimation is
General
Y

yi . Now we can estimate the

yˆi
{ parameters (β0 & β1),
because the sum of
squares of all the
differences between the
observation yi and the
fitted line is minimum
X
xi
84
.

85
.
• The least square estimator of β0 & β1
must satisfy the following two equations

86
.
• The least square estimator of β0 & βWe
1 have two
normal
must satisfy the following two equations
equations and
two unknowns
and they are
independent
therefore we
can uniquely fit
β0 & β1

87
.
• So the estimator of are solution of the
equations

88
.
89
.

90
Regression Statistics

SST  (Y  Y ) 2

SSR  (Y   Y ) 2

SSE  (Y  Y ) 2
Variance to be
explained by predictors
(SST)

Y
X1

Variance Y
explained by Variance NOT
X1 explained by X1
(SSR) (SSE)
Regression Statistics

SST SSR  SSE


Regression Statistics

SSR
2
R 
SST
Coefficient of Determination
to judge the adequacy of the regression model
Regression Statistics
2
R R
S xy  xy
R 
S xx S yy  x y

Correlation
measures the strength of the linear association between two variables.
Regression Statistics
Standard Error for the regression model

S e  S  ˆ
2
e
2

SSE SSE  (Y  Y ) 2
2
S 
e
n 2
2
S e MSE
ANOVA
H 0 : 1 0
H A : 1 0

df SS MS Fcal P-value

Regression K=2) SSR SSR / df MSR / MSE P(F)


2-1
Residual K=2 SSE SSE / df
n-2
Total n-1 SST
If P(F)<a then we know that we get significantly better prediction of Y from the
regression model than by just predicting mean of Y.
ANOVA to test significance of regression model
Hypothesis Tests for Regression
Coefficients

H 0 :  i 0
H 1 :  i 0

bi   i
t( n  k  1) 
Sbi
Hypotheses Tests for Regression
Coefficients

H 0 : 1 0
H A : 1 0

b1  1 b1  1
t( n  k  1)  
S e (b1 ) 2
Se
S xx
Confidence Interval on Regression
Coefficients of leaner model

2 2
S S
b1  t / 2,( n  k  1) e
1 b1  t / 2,( n  k  1) e
S xx S xx

Confidence Interval for b1


Hypothesis Tests on Regression Coefficients

H 0 :  0 0
H A :  0 0

b0   0 b0   0
t( n  k  1)  
S e (b0 ) 1 X 2

S  
2
e

 n S xx 
Confidence Interval on Regression
Coefficients

2 1 X 2
 2 1 X 2
b0  t / 2,( n  k  1) S e     0 b0  t / 2,( n  k  1) S e   
 n S xx   n S xx 

Confidence Interval for the intercept


Hypotheses Test the Correlation Coefficient
By using t-test
H 0 :  0
H A :  0

R n 2
T0 
1 R2

We would reject the null hypothesis if t 0  t / 2 , n  2


Diagnostic Tests For Regressions
Expected distribution of residuals for a linear
model with normal distribution or residuals
(errors).

i

Yi
Diagnostic Tests For Regressions
Residuals for a non-linear fit

i

Yi
Diagnostic Tests For Regressions
Residuals for a quadratic function
or polynomial

i

Yi
Diagnostic Tests For Regressions
Residuals are not homogeneous
(increasing in variance)

i

Yi
Regression – important points

1. Ensure that the range of values


sampled for the predictor variable
is large enough to capture the full
range to responses by the response
variable. This means that the range
of sampled variable should be
wide enough to accommodate all
values of response variable.
Y

X
Y

X
Regression – important points
22Ensure that the distribution of
predictor values is approximately
uniform within the sampled range.
Y

X
Y

X
Readings

• Howell (2004) – Fundamentals - Regression (Ch10)


• Howell (2007) – Methods - Correlation & Regression (Ch
9)
• Francis (2004) – Relationships Between Metric Variables
- Section 3.1
Linear Regression Assumptions
• The values of the dependent variable Y should be Normally
distributed for each value of the independent variable X
(needed for hypothesis testing and confidence intervals)
• The variability of Y (variance or standard deviation) should
be the same for each value of X (homoscedasticity)
• The relationship between the two variables should be linear
• The observations should be independent or uncorrelated.
• Do not have to have both variables random, values of X do
not have to be random and they don’t have to be Normally
distributed

114
Cigarate xi CHD yi Xi-x~ Yi-y~ Cigarate xi CHD yi Xi-x~ Yi-y~
11 26 5.05 11.48 5 4 -0.95 -10.52
9 21 3.05 6.48 5 18 -0.95 3.48
9 24 3.05 9.48 5 12 -0.95 -2.52
9 21 3.05 6.48 5 3 -0.95 -11.52
8 19 2.05 4.48 4 11 -1.95 -3.52
8 13 2.05 -1.52 4 15 -1.95 0.48
8 19 2.05 4.48 4 6 -1.95 -8.52
6 11 0.5 -3.52 3 13 -2.95 -1.52
6 23 0.5 8.48 3 4 -2.95 -10.52
5 15 -0.95 0.48 3 14 -2.95 -0.52
5 13 -0.95 -1.52
mean 5.95 14.52
115
Making a prediction
• Assume that we want to predict CHD
mortality when cigarette consumption is 6.

Yˆ bX  a 2.04 X  2.37


Yˆ 2.04 * 6  2.37 14.61
• We predict 14.61 people/10,000 in that
country will die of coronary heart disease.
Accuracy of prediction
• Finnish smokers smoke 6 cigarettes/adult/day
• We predict 14.61 deaths/10,000
• They actually observed have 23 deaths/10,000
• Our error (“residual”) = 23 - 14.61 = 8.39
30

Residual
CHD Mortality per 10, 000

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per D ay


Errors of prediction
• Residual variance
– The variability of predicted values
ˆ
(Y  Y ) 2
2
sY  Yˆ 
N 2
• Standard error of estimate
– The standard deviation of predicted values
Standard error of estimate
ˆ
(Y  Y ) 2
sY  Yˆ 
N 2
• A common measure of the accuracy of our
predictions
– We want it to be as small as possible.
– It has an inverse relationship to r2
(i.e., when r2 is large, the standard error of the
estimate will be small, and vice-versa)
Explained variance
• r = .71
• r2 = .712 =.51
• Approximately 50% in variability of incidence
of CHD mortality is associated with variability
in smoking.
Residuals

124
Regression Coefficient
• Regression coefficient:
– this is the slope of the regression line
– indicates the strength of the relationship between
the two variables
– interpreted as the expected change in y for a one-
unit change in x
– can calculate a standard error for the regression
coefficient
– can calculate a confidence interval for the coefficient
– can test the hypothesis that b = 0, i.e., that there is
no relationship between the two variables 125
Intercept
• Intercept:
– the estimated intercept a gives the value of y that
is expected when x = 0
– often not very useful as in many situations it may
not be realistic or relevant to consider x = 0
– it is possible to get a confidence interval and to
test the null hypothesis that the intercept is zero
and most statistical packages will report these

126
Coefficient of Determination, R-Squared
• The coefficient of determination or R-squared is the amount of
variability in the data set that is explained by the statistical model
• Used as a measure of how good predictions from the model will be
• In linear regression R-squared is the square of the correlation coefficient
• The regression analysis can be displayed as ANOVA table, many
statistical packages present the regression analysis in this format

– Often expressed as a percentage


– High R-squared says that the majority of the variability in the data is
explained by the model (good!)

127
Adjusted R-Squared
• Adjusted R-squared
– Sometimes an adjusted R-squared will be
presented in the output as well as the R- squared
– Adjusted R-squared is a modification to the R-
squared to compensate for the number of
explanatory or predictor variables in the model
(more relevant when considering multiple
regression)
– The adjusted R-squared will only increase if the
addition of the new predictor improves the
model more than would be expected by chance
128
Interpolation and Extrapolation
• Interpolation
– Making a prediction for Y within the range of values of
the predictor X in the sample used in the analysis
– Generally this is fine
• Extrapolation
– Making a prediction for Y outside the range of values
of the predictor X in the sample used in the analysis
– No way to check linearity outside the range of values
sampled, not a good idea to predict outside this range

129

You might also like