0% found this document useful (0 votes)
40 views7 pages

Linear Regression

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7


25.1 Introduction
In Lesson 23 we have established the fact that if two variables are closely related we may be
interested in estimating the value of one variable given the value of another. For example, if
we know that in milk, the content of total solids and fat levels are correlated we want to find
out expected total solids in milk for a given fat level. Similarly, if we know that spoilage of
milk (in %) and the temperature (oC) of storage of milk in a dairy plant are closely related we
may find out the level of temperature at which spoilage of milk starts. It is often of interest to
determine how change of values of some variables influences the change of values of other
variables. Regression analysis reveals average relationship between two variables and this
makes possible estimation or prediction. The literal or dictionary meaning of the word
“Regression” is ‘stepping back or returning to the average value’. The term was first used by
British biometrician Sir Francis Galton in the later part of the 19th century in connection with
some studies made on estimating the extent to which the stature of the population. Actually
regression means to regress i.e., to step back or to fall back or to return back to a former state.
So regression means returning of retrogression. Falconer (1936) conducted an experiment in
which he took two groups of parents, one group having more height while others having
shorter than the normal height. It was found that the children of the first group of parents try
to go back to normal height while the children of second group of parents try to reach the
normal height. Regression analysis in the general sense means the estimation or prediction of
the unknown value of one variable from the known value of the other variable. It is one of the
very important statistical tools which are extensively used in almost all branches of science-
natural, social and physical. It is specially used in business and economics to study the
relationship between two or more variables that are related casually and for estimation of
demand and supply curves, cost functions, production and consumption functions, etc.
25.2 Definition of Regression
Regression analysis is one of the very scientific techniques for making predictions. In the
words of M.M. Blair “Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of the original units of the data”.

According to Morris Hamburg the term ”regression analysis”:refers to the methods by which
estimates are made of the values of a variable from a knowledge of the values of one or more
values of other variables and to the measurement of the errors involved in this estimation
Ya-Lun Chou defined “Regression analysis attempts to establish the nature of the relationship
between variables-that is, to study the functional relationship between the variables and
thereby provide a mechanism for prediction or forecasting”.
It is clear from the above definitions that regression analysis is a statistical device with the
help of which estimate (predict) the unknown values of one variable from known values of
another variable. In regression analysis there are two types of variables. The variables whose
value is influenced or is to be predicted is called dependent variable and the variable which
influence the values or is used for prediction, is called independent variable. In regression
analysis independent variable is also known as regressor or predictor or explanatory while the
dependent variable is also known as regressed or explained variable.
25.3 Types of Regression Analysis
The main types of regression analysis are as follows:
a) Simple and Multiple
b) Linear and Non- Linear
25.3.1 Simple and multiple
The regression analysis confined to the study of only two variables at a time is termed as
simple regression. In simple regression analysis one variable is dependent and another is
independent. The functional relationship between total solids and fat content in milk samples
is an example of simple regression. But quite often the values of a particular phenomenon
may be affected by multiplicity of factors. The regression analysis where we study more than
two variables at a time is known as multiple regression for example, the study of effect of fat
and SNF contents, on total solids in milk of samples; the study of Total Quality as affected
by Methyl Blue Reduction Time (MBR) and Standard Plate Counts (SPC) etc.
25.3.2 Linear and non �linear
If the given bivariate data are plotted on a graph, the points so obtained on the scatter diagram
will more or less concentrate round a curve, called the “curve of regression”. Often such a
curve is not distinct and is quite confusing and sometimes complicated too. The mathematical
equation of the regression curve, usually called the regression equation, enables us to study
the average change in the value of the dependent variable for any given value of the
independent variable. If the regression curve is a straight line, we say that there is linear
regression between the variables under study. The equation of such a curve is the equation of
a straight line, i.e., a first degree equation in the variable X and Y. In case of linear
regression, the values of the dependent variable increase by a constant absolute amount for a
unit change in the value of the independent variable. However, if the curve of regression is
not a straight line, the regression is termed as curved or non-linear regression. In that case the
regression equation is a functional relation between X and Y involving transformed values of
X and Y, i.e., involving terms of the type X2, Y2, XY, log X, log Y etc. However, in this
chapter we shall confine our discussion to linear regression between two variables only.
25.4 Simple Linear Regression
In practice, simple linear regression is often used and under this, regression lines, regression
equations and regression coefficients are very important to be studied, which are discussed in
the subsequent sections.

25.5 Regression Lines
The regression line shows the average relationship between two variables. It is the line which
gives the best estimate of one variable for given value of other variable. The term best fit is
interpreted in accordance with the Principle of Least Squares which consists in minimizing
the sum of the squares of the residuals or the errors of estimates, i.e., the deviations between
the given observed values of the variable and their corresponding estimated values as given
by the line of best fit. In case of two variables X and Y, we shall have two lines regression
one for Y on X and the other for X on Y.
25.5.1 Regression line of Y on X
Line of regression of Yon X is the line which gives the best estimate for the value of Y for
any specified value of X and is obtained by minimising the sum of squares of the errors
parallel to Y-axis.
25.5.2 Regression line of X on Y
Line of regression of X on Y is the line which gives the best estimate for the value of X for
any specified value of Y and is obtained by minimising the sum of squares of the errors
parallel to X-axis.
25.6 Derivation of Line of Regression of Y on X
Let (X1, Y1), (X2, Y2),...., (Xn, Yn) be n pairs of observations on two variables under study.
Y=a+bX (Eq. 25.1)
be the line of regression (best fit of Y on X). For a given point Pi (Xi, Yi) in the scatter
diagram, the error estimate or residual as given by the line of best fit (Eq. 25.1) is PiHi as
shown in figure 25.1. Now the X- coordinate of Hi is same as that of Pi, so Xi lies on the same
line (25.1) the Y-coordinates of Hi i.e., Hi M is given by (a + b Xi). Hence, the error of
estimate for Pi is given by
PiHi = PiM - Hi M = Yi = (a + bXi) (Eq. 25.2)
This error is parallel to Y-axis for the i point and we compute such error for all points of
scatter diagram. The PiHi which lie above the line be positive and below the line, the error
will be negative. There will be several lines passing throw these scatter of points and we have
to find that particular line of best fit for which deviation or residual is minimum.

Fig. 25.1 Scatter diagram with an estimating line

According to principle of least squares, we have to determine the constants a and b in

equation (25.1) such that the residual or deviation sum of squares of the errors is minimum.
In other words we have to minimise the residual sum of squares due to error �E�
(Eq. 25.3)

Differentiating E partially with respect to and we get

(Eq. 25.4)

(Eq. 25.5)
Equation 25.4 and 25.5 are known as two normal equations. Solving these two normal
equations, we get

(Eq. 25.6)

Putting the value of in either of the normal equation we get

..(Eq. 25.7)

Substituting these values of �& �from equation (25.7) and (25.6) respectively in
equation (25.1) we get required equation of line regression of Y on X.
Dividing both the equation (25.4) by number of pairs of observation we get

(Eq. 25.8)
This implies that line of best fit passes through the point or in other words points lies
on the line of regression of Y on X. The required equation of the line of regression of Y on X
can be written as:

(Eq. 25.9)

But we know that�

Substituting the value of Cov. (X,Y) in equation (25.9) we get

(Eq. 25.10)
25.6.1 Line of Regression of X on Y
Similarly we can have a line of X on Y i.e., X = a + bY

The required equation of the line of regression of X on Y can be written as :

(Eq. 25.11)
From equations (25.9) and (25.11) it is evident that both the lines of regression X on Y and Y
on X pass through the point . Hence �is a point of intersection of Y on X and X on
Y. The above procedure of fitting of regression equation is illustrated through the following
Example 1 : The following data pertains to spoilage of milk (in %)(X) and the temperature
(oC) (Y) of storage of milk in a dairy plant.

Spoilage of Milk (X) 27.3 29.5 26.8 29.5 30.5 29.7 25.6 25.4 24.6 23.6
o 33.9 34.6 34.5 36.9 37.1 37.3 28.8 29.6 31.2 30.7
Temperature( C) (Y)

Fit a linear regression for spoilage of milk (in %)(X) on the temperature (oC) (Y)and vice
versa . Also predict the spoilage of milk when temperature is 40oC and value of temperature
when spoilage in milk is 35 %.

Spoilage of Temperature
Milk (Xi) o
( C) (Yi)
27.3 33.9 0.05 0.44 0.0025 0.1936 0.022
29.5 34.6 2.25 1.14 5.0625 1.2996 2.565
26.8 34.5 -0.45 1.04 0.2025 1.0816 -0.468
29.5 36.9 2.25 3.44 5.0625 11.8336 7.7400
30.5 37.1 3.25 3.64 10.5625 13.2496 11.830
29.7 37.3 2.45 3.84 6.0025 14.7456 9.408
25.6 28.8 -1.65 -4.66 2.7225 21.7156 7.689
25.4 29.6 -1.85 -3.86 3.4225 14.8996 7.141
24.6 31.2 -2.65 -2.26 7.0225 5.1076 5.989
23.6 30.7 -3.65 -2.76 13.3225 7.6176 10.074
Total 272.5 334.6 53.3850 91.7440 61.990

Regression coefficient of Yon X (bYX) :

Regression coefficient of X on Y (bXY) :

Regression equation of Y on X i.e., regression line of temperature on spoilage of milk is

The required equation of the line of regression of Y on X i.e. regression equation of

temperature on spoilage in milk is Y=1.8173+1.1612X. To predict the value of temperature
when spoilage in milk is 35%, we put X=35 in the above equation so we get Y=42.4593. It
means when spoilage in milk is 35% the temperature will be 42.4593 �C.
Regression equation of X on Y i.e., regression line of spoilage of milk on temperature is

The required equation of the line of regression of X on Y i.e. regression equation of spoilage
of milk on temperature is X=4.6416+0.6757Y. To predict the value of spoilage in milk when
temperature is 40�C, we put Y=40 in the above equation so we get X=31.6693. It means
when temperature is 40�C then spoilage in milk will be 31.6693 %.
25.6.2 Why there are two regression lines
The line of regression of Y on X (Y = a + bYXX) is used to estimate/predict the value of Y for
any given value of X i.e. Y is a dependent variable and X is an independent/explanatory
variable. The estimates so obtained will be best in the sense that it will have the minimum
possible error as defined by the principle of the least squares. In order to predict or estimate X
for any given value of Y we use the regression equation of X on Y (X = a + bXYY) which is
obtained by minimizing sum of squares due to error of estimates in X. Here X is dependent
variable and Y is independent/explanatory variable. Two regression equations are not
reversible or interchangeable. Regression equation of Y on X is obtained by minimizing the

sum of square of errors parallel to the Y-axis, while the regression equation of X on Y is
obtained by minimizing the sum of squares of error parallel to X-axis. In a particular case of
perfect correlation, positive or negative i.e., r = +1, the equation of line of regression of Y on
X becomes:

Similarly , the equation of the line of regression of X on Y becomes:

Above two equations are same. Hence, in case of perfect correlation (r = �1) both the lines
coincide. Therefore, in general we always have two lines of regression except in the
particular case of perfect correlation when both the lines coincide and we get only one line.


You might also like