Simple Regression
Simple Regression
1 INTRODUCTION
2 SIMPLE REGRESSION
In one such experiment, the results were recorded in the Table 2.1 below.
Temperature (oC) 13 50 63 58 20 78 39 55 29 62
Length (cm) 5.10 5.68 5.85 5.74 5.25 5.98 5.59 5.73 5.46 5.81
Table 2.1
It is obvious from the experiment that the length of the metal rod depends
on its temperature. Thus, temperature is the independent variable and length is
the dependent variable. These will be respectively denoted by X and Y. The first
step in the determination of the equation Y = f ( X ) is to plot the above results in
the form of points (x, y) on graph. The resulting figure is known as a scatter
diagram and its importance is that it gives us an idea of whether the points
describe a linear or non-linear relationship between X and Y.
Fig. 2.2 and 2.3 below display possibilities of linear and exponential
relationships respectively. However, for the purpose of this course, we will only
consider linear relationships.
x x
x
x
x
x x
x
x
x
X
O
Fig. 2.2 Linear relationship
Y
x
x
x
x
x
x
x x
x x
X
O
Fig. 2.3 Exponential relationship
x
x x
x Line of best fit
x x x
x
x
x
y1 x
e1 { x
y1
X
O x1
The points indicated by crosses in Fig. 4.1 above are values obtained by
observation (in an experiment) whereas those indicated by black dots are their
expected counterparts. For example, one of the observed pairs was the point
( x1 , y1 ) ; it can be seen that, for that same x-value, there exists an expected y-
value, denoted by y1 . We could interpret y1 as the value, which should have been
theoretically obtained, but for experimental or systematic errors. The difference
between the ith observed value and its corresponding expected value is known as a
residual or, simply, an error, mathematically equal to y i y i and denoted by ei
( e1 is indicated in Fig. 4.1).
The principle of the least-squares method is based on minimising the sum
of the squares of the residuals, ei2 . The intricate mathematical manipulations
being beyond the scope of this course, it will simply be mentioned that, for the
regression line of Y on X, the method of least-squares yield two normal equations
from which the regression coefficients can be determined.
These equations are given by
y = na + b x
xy = a x + b x
2
a=
y b x and b =
n xy x y
.
n x 2 ( x )
2
n
Using the data from our rod-heating experiment, with the x-values rearranged in
ascending order for better understanding, we have Table 4.2:
Temperature (oC) 13 20 29 39 50 55 58 62 63 78
Length (cm) 5.10 5.25 5.46 5.59 5.68 5.73 5.74 5.81 5.85 5.98
Table 4.2
n xy x y (10)(2674.93) (467)(56.19)
b= = = 0.013 .
n x ( x ) (10)(25717) (467) 2
2 2
a=
y b x = 56.19 467b = 5.011
n 10
The constant a, the value of y when x = 0, represents the length of the rod
o
at 0 C, that is 5.011 cm, whereas b (the gradient) is the rate of change of length
with temperature, that is, for every rise of 1o C, the length of the rod increases by
0.013 cm.
5 PREDICTION
The ultimate aim of regression being prediction, once the equation of the
regression line of Y on X has been determined, we can use it to find values of Y
for given values of X.
5.1 Interpolation
If prediction is made for those x-values lying within the range of values of
X given in the table, the process is known as interpolation. Note that the value of
Y can also be found by graphical drawing and from a calculator.
5.2 Extrapolation
However, concessions may be made for values of X lying very near the
least and greatest values in the table on the assumption that it is very probable
that, near those regions, Y will follow an identical pattern as described by the
regression equation. We should bear in mind that the further the given value of X
is from the table values, the less reliable will be the forecast.
Example 1
A large field of maize was divided into six plots of equal area and each
plot fertilised with a different concentration of fertiliser. The yield of maize from
each plot is shown below.
(a) Obtain the equation of the regression line for yield on concentration,
giving the values of the coefficients to 2 decimal places.
(c) Use the regression line to obtain the yield when the concentration is 3 oz
m-2. State precisely what is being estimated by this value.
(d) State any reservations you would have about making an estimate from the
regression equation of the expected yield per plot if 7 oz m-2 of fertiliser is
applied.
Solution
n xy x y (6)(666) (15)(210)
b= = = 8.06 .
n x 2 ( x ) (6)(55) (15) 2
2
a=
y b x = 210 15b = 14.86
n 6
(b) The minimum yield per plot (without any fertiliser) is 14.86 oz m-2.
For every extra oz m-2 of fertilizer added, there is an additional yield of
8.06 tonnes per plot.
Example 2
(d) State, giving a reason, whether or not you would use the line to find the
expected expenditure of a trip lasting 2 months.
Solution
(d) 2 months are approximately equal to 60 days. Since 60 is well outside the given
range of values of X, it is not advisable to use the line to make any forecast. There
is no guarantee that the same policy of allocating 9.89 per day is applicable. For
example, there could be special package deals for longer trips.