Topic08. Simple Linear Reg
Topic08. Simple Linear Reg
So far we have been concerned with analyses of differences. And, in doing so, we
have considered measuring n subjects on a single outcome variable (or two groups of n
subjects on one variable). Such measurements have yielded univariate frequency
distribution and the analysis is often referred to as univariate analysis. Now, we are
considering n subjects and in each subject has two measures available; in other words, we
have two variables per subject, say x and y. Our interest in this kind of data is obviously to
measure relationship between the two variables. We can plot the value of y against the
value of x in a scatter diagram and assess whether the value of y varies systematically with
the variation in values of x. But we still wants to have a single summary measure of the
strength of relationship between x and y.
In his book "Natural Inheritance", Francis Galton wrote: "each peculiarity in a man
is shared by his kinsman, but on the average, in a less degree. For example, white tall
fathers would tend to have tall sons, the sons would be on the average shorter than their
fathers, and sons of short fathers, though having heights below the average for the entire
population, would tend to be taller than their fathers." He, then, concluded a phenomenon
called "law of universal regression" which was the origin of the topic we are learning right
now. Today, the characteristics of returning from extreme values toward the average of the
full population is well recognised and is termed "regression toward the mean".
In a previous topic, we stated that if Y and Y are independent variables, then the
variance of the sum or difference between X and Y is equal to the variance of X plus the
variance of Y, that is:
what happen if X and Y are not independent? Before discussing this problem, we
introduce the concepts of covariance and correlation.
c2 = a 2 + b2
Analogously, if we have two random variables X and Y, where X may be the height
of father and Y may be the height of daughter, their variance can be estimated by:
1 n
∑ ( xi − x )
2
s x2 = [1]
n − 1 i =1
1 n
∑ ( yi − y )
2
and s 2y =
n − 1 i =1
respectively.
2
s 2X +Y = s 2X + sY2 [2]
Let us now discuss X and Y in the context of genetics. Let X be BMD of father and Y
be the BMD of daughter. It is clear that we can find another expression for the relationship
between X and Y by multiplying each father's BMD from its mean (xi − x ) by
corresponding deviation of his daughter ( yi − y ) , instead of squaring the father's or
daughter's deviation, before summation. We refer this quantity to as covariance between X
and Y and is denoted by Cov(X, Y); that is:
1 n
cov( X , Y ) = ∑ ( xi − x )( yi − y ) [3]
n − 1 i =1
By definition and analogous to the Cosine law in any triangle, we have: if X and Y
are not independent, then:
: σ X2 +Y = σ X2 + σ Y2 + 2Cov( X , Y ) [4]
(a) Variances as defined in [1] are always positive since they are derived from sums
of squares, whereas, covariances as defined in [3] are derived from sum of cross-products
of deviations and so may be either positive or negative.
(b) A positive value indicates that the deviations from the mean in one distribution,
say father's BMDs, are preponderantly accompanied by deviations in the other, say
daughter's BMDs, in the same direction, positive or negative.
(c) A negative covariance, on the other hand, indicates that deviations in the two
distributions are preponderantly in opposite directions.
(d) When the deviation in one of the distribution is equally likely to be accompanied
by deviation of like or opposite sign in the other, the covariance, apart from errors of
random sampling, will be zero.
3
BMD fathers generally have low BMD daughters. In other words, we should expect them
to have positive covariance. Lack of genetic control would produce a covariance of zero. It
was by this means that Galton first showed stature in man to be under genetic control. He
found that the covariance of parent and offspring, and also that of pairs of siblings, was
positive.
The size of the covariance relative to some standard gives a measure of the strength
of the association between the relatives. The standard taken is that afforded by the
variances of the two separate distributions, in our case, of father's BMD and daughter's
BMD. We many compare the covariance to these variances separately and we do this by
calculating the regression coefficients which have the forms:
Cov( X , Y )
(regression of daughters on father)
var( X )
Cov( X , Y )
or (regression of fathers on daughters)
var(Y )
we can also compare the covariance with the two variances at once:
Cov( X , Y )
var(Y )× var(Y )
Cov( X , Y ) Cov( X , Y )
i.e. r= = [5]
var( X ). var(Y ) sx × s y
With some algebraic manipulation, we can show that [5] can be written in another
way:
4
n
∑ ( xi − x )( yi − y )
r= i =1
n 2 n
∑ ( xi − x ) ∑ ( yi − y )
2
i =1 i =1
n 1 n n
∑ xi yi − ∑ xi ∑ yi
n i =1 i =1
= i =1 [6]
(n − 1)s x s y
One obvious question is that whether the observed coefficient of correlation (r) is
significantly different from zero. Under the null hypothesis that there is no association in
the population (r = 0), it can be shown that the statistic:
n−2
t=r
1− r2
On the other hand, for a moderate or large sample size, we can set up a 95%
confidence interval of r by using a theoretical distribution of r. It can be shown that the
sampling distribution of r is not normally distributed. We can, however, transform it to a
Normal distributed quantity by using the so-called Fisher's transformation in which:
1 1+ r
z= ln [7]
2 1− r
1
SE ( z ) = [8]
n−3
5
Thus, approximate 95% confidence interval is:
1. 96 1. 96
z- to z+
n−3 n−3
Of course, we can back-transform the data to obtain 95% confidence interval for r
(this is left for exercise).
6
n
Cov(X, Y) = ∑ ( xi − x )( yi − y )
i =1
n
= ∑ x y − nxy
i =1
i i
= 10.68
Cov( X , Y ) 10. 68
r= = = 0.937.
sx s y 13. 596 × 0. 838
To test for the significance of r, we need to covert it to the z score as given in [7]:
1 1 + r 1 1 + 0.937
z= ln = ln = 0.56
2 1 − r 2 1 − 0.937
1
SE ( z ) =
1
= = 0.2582
n−3 18 − 3
Then the t ratio is 0.56 / 0.2582 = 2.165 which exceeds the expected value of 2.11
(with 17 df and 5% significance level), we conclude that there is an association between
age and cholesterol in this sample of subjects. //
Suppose that we have two sample coefficients of correlation r1 and r2 which were
estimated from two unknown population coefficients ρ1 and ρ2 , respectively. Suppose
further that r1 and r2 were derived from two independent samples of n1 and n2 subjects,
respectively. To test the hypothesis that ρ1 = ρ2 versus the alternative hypothesis that ρ1 ≠
ρ2 , we firstly convert these sample coefficients into a z-score:
7
1 1 + r1 1 1 + r2
z1 = ln and z2 = ln
2 1 − r1 2 1 − r2
By theory, the statistic z1 − z2 is distributed about the mean
ρ ρ
Mean( z1 − z2 ) = −
2(n1 − 1) 2(n2 − 1)
1 1
Var( z1 − z2 ) = +
n1 − 3 n2 − 3
If the samples are not small or if n1 and n2 are not very different, the statistic
z1 − z2
t=
1 1
+
n1 − 3 n2 − 3
8
II. SIMPLE LINEAR REGRESSION ANALYSIS
We are now extending the idea of correlation into a rather mechanical concept
called regression analysis. Before doing so, let us briefly recall this idea in the historical
context. As mentioned earlier, In 1885, Francis Galton introduced the concept of
"regression" in a study that demonstrated that offspring do not tend toward the size of
parents, but rather toward the average as compared to the parents. The method of regression
has, however, a longer history. In fact, a legendary French mathematician by the name of
Adrien Marie Legendre published the first work on regression (although he did not use the
word) in 1805. Still, the credit for discovery of the method of least squares generally given
to Carl Friedrich Gauss (another legendary mathematician), who used the procedure in the
early part of the 19th century.
Much used (and perhaps overused) cliche in data analysis "garbage in - garbage
out" and "the results are only as good as the data that produced them" apply in the building
of regression models. If the data do not reflect a trend involving the variables, there will be
no success in model development or in drawing inferences regarding the system. Even with
some types of relationship does exist, this does not imply that the data will reveal it in a
clearly detectable fashion.
Many of the ideas and principles used in fitting linear models to data are best
illustrated by using simple linear regression. These ideas can be extended to more complex
modelling techniques once the basic concepts necessary for model development, fitting and
assessment have been discussed.
Example 1 (continued): The plot of cholesterol (y-axis) versus age (x-axis) yields
the following relationship:
9
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
20 30 40 50 60 70
From this graph, we can see that cholesterol level seems to vary systematically with
age (which was confirmed earlier by the correlation analysis); moreover, the data points
seem to scatter around the line connects between two points (20, 2.2) and (65, 4.5). Now,
we learned earlier (in Topic 1) that for any two given points in a two-dimensional space,
we could construct a straight line through two points. The same principle is applied here,
although the technique of estimation is slightly more complicated.
yi = β 0 + β1 xi + ε i [8]
In this model, β 0 and β1 are unknown parameters and are to be estimated from the
observed data, ε is a random error or departure term representing the level of inconsistency
present in repeated observations under similar experimental conditions. To proceed with
the parameter estimation, we have to make some assumptions
10
and on the random error ε, we assume that ε's are:
Because β 0 and β1 are parameters (hence, constants) and that the value of x is fixed,
we can obtain the expected value of [8] as :
E( yi ) = β 0 + β1 xi [9]
n
Q = ∑ [ yi − (b0 + bx xi )]2
i =1
∑y i = nb0 + b1 ∑ xi
∑x y i i = b0 ∑ xi + b1 ∑ xi2
11
∑ ( xi − x )( yi − y )
cov( x, y )
b1 = = [11]
var( x )
∑ ( xi − x )
2
and b 0 = y − b1 x [12]
Example 1 (continued):
Cov( x, y ) 10.68
b1 = = = 0.0577
cov( x ) 184.85
y = 1.089 + 0.057x
That is, for any individual, his/her cholesterol is completely determined by the
equation:
where e is the specific error which is not accounted for by the equation (including
measurement error) associated with the subject. For instance, for subject 1 (46 years old),
his/her expected cholesterol is: 1.089 + 0.057 x 46 = 3.7475; when compared with his/her
actual value of 3.5, the residual is e = 3.5 - 3.7475 = -0.2475. Similarly, the expected
cholesterol value for subject 2 is 1.089 + 0.057 x 26 = 2.245 and is higher than his/her
actual level by 0.3450.
The predicted value calculated using the above equation, together with the residuals
(e) are tabulated in the following table.
12
I.D Observed Predicted Residual
(O) (P) (e = O - E)
To some large extent, the interest will lie in the values of slope. Interpretation of
this parameter is meaningless without a knowledge of its distribution. Therefore, having
calculate the estimates b1 and b0 , we need to determine the standard error of these
parameters so that we can make inferences regarding their significance in the model.
Before doing this, let us have a brief look at the significance of the term e.
We learned in earlier topic that if y is the sample mean of a variable Y, then the
1 n
∑ ( yi − y ) . Now, in the regression case, y is actually equal
2
variance of Y is given by
n − 1 i =1
to β 0 + β1 xi = y$ . Hence, it is reasonable that the sample variance of the residuals e
13
should provide an estimator of σ 2 in [10]. It is from this reasoning that the unbiased
estimate of σ 2 is defined as:
s2 =
1 n
∑ ( yi − yˆ i ) =
n − 2 i =1
2 1
n−2
ei2( ) [13]
It can be shown that the expected values of b1 and b0 are β1 and β 0 (true
parameters), respectively. Furthermore, from [13], it can be shown that the variances of b1
and b0 are:
s2
var(b1 ) = n
[14]
∑ ( xi − x )
2
i =1
2
2 1 x
which is: var(b0 ) = s + [15]
n n 2
∑ ( xi − x )
i =1
That is, b1 is normally distributed with mean β1 and variance given in [14], and b0 is
normally distributed with mean β 0 and variance given in [15]. It follows that the test for
significance of b1 is the ratio
b1 b1sx
t= =
s2 s
sx2
14
b0
and t=
(
s (1 / n ) + x 2 / s x2 )
is a test for b0 , which is distributed according to the t distribution with n-2 df.
Example 1 (continued):
n
We can calculate the corrected sum of square of AGE, ∑ (xi − x )2 , by working out
i =1
n
∑ (xi − x ) = sx (n - 1)
2 2
i =1
= 184.85 (17)
= 3142.45
t = b1 / SE( b1 )
= 0.0578 / 0.00539
= 10.70
15
2
2 1 x
var( b0 ) = s +
n n 2
∑ ( xi − x )
i =1
1 (38.83)2
= 0.0916 +
19 3142.45
= 0.049
t = b0 / SE( b0 )
= 1.089 / 0. 049
= 4.92
or,
n n n
∑ ( yi − y ) = ∑ ( yˆi − y ) + ∑ ( yi − yˆi )
2 2 2
i =1 i =1 i =1
16
Now, SSTO is associated with n-1 df. For SSR, there are two parameters (b0 and b1)
n
in the model, but the constraint ∑ ( yˆi − y ) = 0 takes away 1df, hence it has finally 1 df. For
i =1
SSE, there are n residuals (ei); however, 2 df are lost because of two constraints on the ei's
associated with estimating the parameters β0 and β1 by the two normal equations see
section 2.1).
Source df SS MS
n
Regression 1 SSR = ∑ ( yˆ i − y )2 MSR = SSR/1
i =1
n
Residual error n-2 SSE = ∑ ( yi − yˆ i )2 MSE = SSE / (n-2)
i =1
n
Total n-1 SSTO = ∑ ( yi − y )2
i =1
R-SQUARE
From this table it seems to be sensible to obtain a "global" statistic to indicate how
well the model fits the data. If we divide the regression sum of square (variation due to
regression model, SSR) by the total variation of Y (SSTO), we would have what
statisticians called the coefficient of determination, which is denoted by R2:
n
∑ ( yˆ i − y )
2
SSR SSE
: R 2 = i =n1 = = 1− [17]
SSTO
∑ ( yi − y )
2 SSTO
i =1
In fact, it can be shown that the coefficient of correlation r defined in [5] is equal to
R .
2
17
Obviously, R2 is restricted to 0 < R2 < 1. An R2 = 0 indicates that X and Y are
independent (unrelated), whereas an R2 = 1 indicates that Y is completely determined by X.
However, there are a lot of pitfalls in this statistic. A value of R2 = 0.75 is likely to be
viewed with some satisfaction by experimenters. It is often more appropriate to recognise
that there is still another 25% of the total variation unexplained by the model. We must ask
why this could be, and whether a more complex model and/or inclusion of additional
independent variables could explain much of this apparently residual variation.
A large R2 value does not necessarily mean a good model. Indeed, R2 can
artificially high when either the slope of the equation is large or the spread of the
independent variable is large. Also a large R2 can be obtained when straight lines are fitted
to data that display non-linear relationships. Additional methods for assessing the fit of a
model are therefore needed and will be described later.
F STATISTIC
An assessment of the significance of the regression (or a test of the hypothesis that
β1 = 0) is made from the ratio of the regression mean square (MSR) to the residual mean
square MSE (s2) which is an F-ratio with 1 and n-2 degrees of freedom. This calculation is
usually exhibited in an analysis of variance table produced by most computer programs.
MSR
F= [18]
MSE
It is important that a highly significant F ratio should not seduce the experimenter to
a belief that the straight line fits the data superbly. The F test is simply an assessment of the
extend to which the fitted line has a slope which is different from zero. If the slope of the
line is near zero, the scatter of the data points about the line would need to be small in order
to obtain a significant F ratio. However, a situation with a slope very different from zero
can give a highly significant F ratio with a considerable scatter of points about the line.
18
b1 b1sx
The F test as defined in [18] is actually equivalent to the t test in t = 2
= .
s s
sx2
The F test is therefore can be used for testing β1 = 0 versus β1 ≠ 0 and is not for testing
one-sided alternatives.
Example 1 (continued):
n
SSR = ∑ ( yˆ i − y )2
i =1
n
SSE = ∑ ( yi − yˆ i )2
i =1
Source df SS MS F-test
19
Regression 1 10.4944 10.4944 114.565
Residual errors 16 1.4656 0.0916
Total 17 11.960
A residual is defined as the difference between the observed and predicted y value,
given by ei = yi − y$i , the value which is not accounted for by the regression equation.
Hence, an examination of this term should reveal how appropriate the equation is.
However, these residuals do not have constant variance. In fact, var(ei) = (1-hi)s2, where hi
is the ith diagonal element of the matrix H which is such that y i = Hy. H is called the "hat
matrix", since it defines the transformation that puts the "hat" on y! In view of this, it is
preferable to work with the standardised residuals. In simple linear regression case, the
standardised residual ri is defined as:
ei
ri = [19]
MSE
These standardised residuals have mean 0 and variance 1. We can use ri to verify
assumptions of the regression model which we made in section 2.1. These are:
20
Useful graphical methods for examining the standard assumptions of constant
variance, normality of the error terms and appropriateness of the fitted model include:
- A plot of residuals in the order in which the observations were taken to detect
non-independence.
OUTLIERS
Outliers in regression are observations that are not well fitted by the assumed
model. Such observations will have large residuals. A crude rule of thumb is that an
observation with a standardised residual greater than 2.5 in absolute value is an outlier and
the source of that data point should be investigated, if possible. More often than not, the
only evidence that something has gone wrong in the data generating process is provided by
the outliers themselves ! A sensible way of proceeding with the analysis is to determine
whether those values have substantial effect on the inferences to be drawn from the
regression analysis, that is, whether they are influential.
INFLUENTIAL OBSERVATIONS
21
There is a problem (drawback) to using the leverage to identify influential values -
it does not contain any information about the value of the Y variable, only the value of the
X variables. To detect an influential observation, a natural statistic to use is a scaled version
( )
of yˆ i ( j ) − yi 2 where ŷi ( j ) is the fitted value for the jth observation when the ith
observation is omitted from the fit. This leads to the so-called Cook's statistic. Fortunately,
to obtain the value of this statistic, we do not need to carry out a regression fit, omitting
each point in term, for the statistic given by:
ri2 − hi
Di =
p(1 − hi )
22
2.5
1.5
0.5
0
0 1 2 3 4 5
-0.5
-1
-1.5
23
2.4. SOME FINAL COMMENTS
"Correlation coefficients lie within the ranged -1 to +1, with the midpoint of zero
indicating no linear association between the two variables. A very small correlation does
not necessarily indicate that two variables are not associated, however. To be sure of this,
we should study a plot of the data, because it is possible that the two variables display a
peculiar (i.e. non-linear) relationship. For example, we should not observe much, if any,
correlation between the average midday temperature and calendar moth because there is a
cyclic pattern. More common is the situation of a curved relationship between two
variables, such as between birthweight and length of gestation. In this case, Pearson's r will
underestimate the association as it is a measure of linear association. The rank correlation
coefficient is better here as it assesses in a more general way whether the variables tend to
rise together (or move in opposite direction).
One way of looking at the correlation that helps to modify the over-enthusiasm is to
calculate the R-square value, which is the percentage of the variability of the data that is
"explained" by the association between the two variables. So, a correlation of 0.7 implies
that just 49% of the variability may be put down to the observed association.
24
X causes (influences) Y
Y influences X
Both X and Y are influenced by one or more other variables.
Another common problem of interpretation occurs when we know that each of two
variables is associated with a third variables. For example, if X is positively correlated with
Y and Y is positively correlated with Z, it is tempting to say that X and Z must be
positively correlated. Although this may indeed be true, such an inference is unjustified -
we can not say anything about the correlation between X and Z. The same is true when one
has observed no association. For example, Mazess et al (1984) the correlation between age
and height was 0.05 and between weight and %fat was 0.03. This does not imply that the
correlation between age and %fat was also near zero. In fact, this correlation was 0.79.
Correlation can not be inferred from direct associations."
25
estimates (such as a sample mean) there will be uncertainty associated with the estimated
slope and intercept. The confidence intervals for the whole line and prediction intervals for
individual subjects show other aspect of variability. The latter are especially useful as
regression is often used to make predictions about individuals.
It should be remembered that the regression line should not be used to make
predictions for X values outside the range of values in the observed data. Such
extrapolation is unjustified as we have no evidence about the relationship beyond the
observed data. A statistical model is only an approximation. One rarely believes, for
example, that the true relationship is exactly linear, but the linear regression equation is
taken as a reasonable approximation for the observed data. Outside the range of the
observed data one can not safely use the same equation. Thus, we should not use the
regression equation to predict value beyond what we have observed.
26
III. EXERCISES
1. Consider the simple regression equation [8]. Consider the least square residuals,
given by yi − y$i where i = 1, 2, 3, . . ., n. Show that
1 n n
(a) ∑ y$i = y . (b) ∑ ( yˆ i − y ) = 0
n i =1 i =1
2. The following data represent diastolic blood pressures taken during rest. The x values
denote the length of time in minutes since rest began, and the y values denote
diastolic blood pressures.
x: 0 5 10 15 20
y: 72 66 70 64 66
3. The following table shows resting metabolic rate (RMR) (kcal/24 hr) and body
weight (kg) of 44 women (Owen et al 1986).
Wt: 49.9 50.8 51.8 52.6 57.6 61.4 62.3 64.9 43.1 48.1 52.2
RMR: 1079 1146 1115 1161 1325 1351 1402 1365 870 1372 1132
Wt: 53.5 55.0 55.0 56.0 57.8 59.0 59.0 59.2 59.5 60.0 62.1
RMR: 1172 1034 1155 1392 1090 982 1178 1342 1027 1316 1574
Wt: 64.9 66.0 66.4 72.8 74.8 77.1 82.0 82.0 83.4 86.2 88.6
RMR: 1526 1268 1205 1382 1273 1439 1536 1151 1248 1466 1323
Wt: 89.3 91.6 99.8 103 104.5 107.7 110.2 122.0 123.1 125.2 143.3
RMR: 1300 1519 1639 1382 1414 1473 2074 1777 1640 1630 1708.
27
(c) Obtain a 95% confidence interval for the slope of the line.
(d) Is it possible to use an individual's weight to predict their RMR to within 250
kcal/24 hr ?
28
23 82.7 31.8 0.76
24 87.9 55.4 1.06
25 101.5 110.6 1.38
26 105.0 114.4 1.85
27 110.5 69.3 2.25
28 114.2 84.8 1.76
29 117.8 63.9 1.60
30 122.6 76.1 0.88
31 127.9 112.8 1.70
32 135.6 82.2 0.98
33 136.0 46.8 0.94
34 153.5 137.7 1.76
35 201.1 76.1 0.87
5. Consider the following 4 data sets. Note that X1 = X2 = X4. Fit a linear regression
equation with Y as a dependent variable and X as an independent variable. What is the
most striking feature from these data sets. Carry out a residual plot for each data set
and comment on the result.
X1 Y1 X2 Y2 X3 Y3 X4 Y4
29