Block 2
Block 2
Structure
5.0 Objectives
5.1 Introduction
5.2 Scatter Diagram
5.3 Covariance
5.4 Correlation Coefficient
5.5 Interpretation of Correlation Coefficient
5.6 Rank Correlation Coefficient
5.7 The Concept of Regression
5.8 Linear Relationship: Two-Variables Case
5.9 Minimisation of Errors
5.10 Method Least Squares
5.11 Prediction
5.12 Relationship between Regression and Correlation
5.13 Multiple Regressions
5.14 Non-Linear Regression
5.15 Let Us Sum Up
5.16 Answers/Hints to Check Your Progress Exercises
5.0 OBJECTIVES
After going through this unit you will be in a position to
plot scatter diagram;
compute correlation coefficient and state its properties;
compute rank correlation;
explain the concept of regression;
explain the method of least squares;
identify the limitations of linear regression;
apply linear regression models to given data; and
use the regression equation for prediction.
5.1 INTRODUCTION
The word ‘bivariate’ is used to describe situations in which two character are
measured on each individual or item, the character being represented by two
variables. For example, the measurement of height ( X i ) and weight ( Yi ) of
students in a school. The subscript i in this case represents the student concerned.
*
Prof. Kaustuva Barik, School of Social Sciences, Indira Gandhi National Open University.
Summarisation of Thus, for example, X 5 , Y5 represent the height and weight of the fifth student.
Bivariate and Multi-
variate Data Statistical data relating to simultaneous measurement of two variables are called
bivariate data. The observation on each individual are paired, one for each
variable (X1, Y1), (X2, Y2), ......, (Xn, Yn).
In statistical studies with several variables, there are generally two types of
problems. In some problems it is of interest to study how the variables are
interrelated; such problems are tackled by using correlation technique. For
instance, an economist may be interested in studying the relationship between the
stock prices of various companies; for this he may use correlation techniques. In
other problems there is a variable y of basic interest and the problem is to find out
what information the other variable provides on Y, such problems are tackled
using regression techniques. For instance, an economist may be interested in
studying what factors determine the pay of an employed person and in particular,
he may be interested in exploring what role the factors such as education,
experience, market demand, etc. play in determining the pay. In the above
situation he may use regression techniques to set up a prediction formula for pay
based on education, experience, etc.
A representation of data of this type on a graph is a useful device which will help
us to understand the nature and form of the relationship between the two
variables, whether there is a discernible relationship or not and if so whether it is
linear of not. For this let us denote score in Economics by X and the score in
Statistics by Y and plot the data of Table 5.1 on the x-y plane. It does not matter
which is called X and which Y for this purpose. Such a plot is called Scatter Plot
or Scatter Diagram. For data of Table 5.1 the scatter diagram is given in Fig. 5.1.
98
Correlation and
Regression
An inspection of Table 5.1 and Fig. 5.1 shows that there is a positive relationship
between x and y. This means that larger values of x associated with larger values
of y and smaller values of y. Further, the points seem to lie scattered around both
sides of a straight line. Thus, it appears that a linear relationship exists between x
and y. This relationship, however, in not perfect in the sense that there are
deviations from such a relationship in the case of certain observations. It would
indeed be useful to get a measure of the strength of this linear relationship.
5.3 COVARIANCE
In the case of a single variable we have learnt the concept of variance, which is
defined as
𝜎 = ∑ (𝑋 − 𝑋) … (5.1)
You may recall that standard deviation is always positive since it is defined as the
positive square root of variance. In the case of covariance there are two terms
( X i X ) and (Yi Y ) which represent the deviations in x from X and Y from Y .
99
Summarisation of Moreover, ( X i X ) can be positive or negative depending on whether xi is less
Bivariate and Multi-
variate Data than or greater than X . Similarly (Yi Y ) can be positive or negative. It is not
necessary that whenever ( X i X ) is positive (Yi Y ) will also be positive.
Therefore, the product ( X i X ) (Yi Y ) can be either positive or negative. A
positive value for ( X i X ) (Yi Y ) implies the whenever X i X , we have
Yi Y . Thus a higher value of xi is associated with a relatively higher value in yi .
On the other hand, ( X i X )(Yi Y ) 0 implies that a lower value in X i is
associated with a relatively higher value in yi . when we sum it over all the
observations and ivied by the number of observations, we may obtain a negative
or positive value. Therefore, covariance can assume both positive and negative
values.
When covariance between x and y is negative ( xy 0) we can say that the
relationship could be inverse. Similarly, ( xy 0 ) implies a positive relationship
between x and y. A major limitation of covariance is that it is not independent of
unit of measurement. It means that if we change the unit of measurement of the
variables we will get a difference value for xy .
1 n 1 n
xy
n i 1
( X i X )(Yi Y ) ( X iYi XYi XY )
n i1
1 n 1 n 1 n 1 n
xy i i n
n i 1
X Y
i 1
X Yi i n
n i1
X Y
i 1
XY
1 n 1
Since
n i 1
XYi X iY XY we have
n
1 n
xy X iYi XY
n i 1
… (5.3)
100
This can be achieved by standardizing each variable, that is by considering Correlation and
Regression
XX Y Y
and where X and Y are the means of X and Y respectively and x
x y
and y are standard deviations.
Let us denote these standardised variables by u and y respectively. Let us also use
the notation ( X i , Yi ) to denote the score ith student in Economics and Statistics
respectively, i ranging from 1 to n, the number of students, n being 20 in our
example. Similarly, let (ui , vi ) denote the standardised scores of ith student. Then
recall the following formulae for mean and standard deviation:
1 n 1 n
X
n i 1
X i ; x2 ( X i X ) 2 ;
n i 1
1 n 1 n
Y i Y n
n i 1
X ; 2
i 1
(Yi Y ) 2
Fig. 5.2 is the scatter diagram in terms of standardised variables u and v. Let us
observe that in this example there is a positive association between the two
scores. The larger one score is, the larger the other score also is; the smaller one
score is the smaller the other score is, on the whole. In view of this, most of the
points are either in the first quadrant or in the third quadrant. The first quadrant
represents the cases where both scores are above their respective means and third
quadrant represents the cases where both scores are below their respective means.
There are only a very few points in second and fourth quadrants, which represent
the cases where one score is above its mean and the other is below its mean. Thus
the product of the u, v values is a suitable indicator of the strength of the
relationship; this product is positive in the first and third quadrants and negative
in the second and fourth. Thus the product of u, v averaged over all the points
may be considered to be suitable measure of the strength of linear relationship
between X and Y.
101
Summarisation of This measure is called the correlation coefficient between X and Y and is usually
Bivariate and Multi-
variate Data denoted by rxy or simply by r, when it is clear what x and y in the context are.
This is also called the Pearson’s Product-Moment Correlation Coefficient to
distinguish it from other types of correlation coefficients.
1 n 1 n
i
n i1
( X x )(Yi Y ) Xi X (Yi Y )
n i1
r ... (5.6)
1 n 2 1
n
1 n 2 1
n
X i X n
n i1 i1
(Yi Y ) 2
( X i X ) n
n i1 i1
(Yi Y ) 2
Or, alternatively
n n n
n X iYi X i Yi
r i 1 i 1 i 1
… (5.7)
2
n 2
n X i2 X i n Yi 2 Yi
n n n
i 1 i1 i 1 i 1
Let us go back to the data given in Table 5.1 and work out the value of r. You can
use any of the formulae (5.4), (5.5) or (5.7) to get the value of r. Since all the
formulae are derived from the same concept we obtain the same value for r
whichever formulae we use. For the data set in Table 5.1 we have calculated it by
using (5.4) and (5.7). We construct Table 5.2 for this purpose.
102
Table 5.2: Calculation of Correlation Coefficient Correlation and
Regression
Observation X Y X2 Y2 XY
No.
1 82 64 6724 4096 5248
2 70 40 4900 1600 2800
3 34 35 1156 1225 1190
4 80 48 6400 2304 3840
5 66 54 4356 2916 3564
6 84 56 7056 3136 4704
7 74 62 5476 3844 4588
8 84 66 7056 4356 5544
9 60 52 3600 2704 3120
10 86 82 7396 6724 7052
11 76 58 5776 3364 4408
12 76 66 5776 4356 5016
13 92 72 8464 5184 6624
14 72 46 5184 2116 3312
15 64 44 4096 1936 2816
16 86 76 7396 5776 6536
17 84 52 7056 2704 4368
18 60 40 3600 1600 2400
19 82 60 6724 3600 4920
20 90 60 8100 3600 5400
Total 1502 1133 116292 67141 87450
X
i 1
i 150; X 75.1;
20
Y
i 1
i 1133; Y 56.65;
20
1 15022
X i2 116292; x2
i 1 20
116292
20
174.59; x 13.21;
20
1 11332
Y
i 1
i
2
67141; x2
20
67141
20
147.83; y 12.16;
1 1502 1133
XY i i 87450; xy
20
87450
20 118.09
103
Summarisation of Thus we see that both the formulae provide the same value of the correlation
Bivariate and Multi-
variate Data coefficient r. You can check yourself that the same value of r is obtained by using
the formula (5.5). For this purpose you will need values on
4)
Find the correlation coefficient between toughness and nickel content and
comment on the result.
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
4) Determine the correlation coefficient between x and y.
x : 5 7 9 11 13 15
D i 0 D1
2
87.50
Let us consider the data of Table 5.3. Here there are some ties; the tied cases are
given the same rank in such a way their total is the same as when there is no tie.
For example, when there are two cases with rank 6, each is given a rank of 6.5
and there is no case with rank either 6 or 7. Similarly, if there are three cases with
rank 5, then each is given a rank of 6 and there is no case with rank 5 or 7.
Spearman’s rank correlation coefficient, called Spearman’s Rho, denoted by , is
based on the difference Di (i for ith observation) between the two rankings. If the
two rankings completely coincide, then Di is zero for every case. The larger the
value of Di, the greater is the difference between the two rankings and smaller is
the association. Thus, the association can be measured by considering the
magnitudes of Di. Since the sum of Di is always zero, to find a single index on the
basis of Di values, we should remove the sign of Di and consider only the
magnitude. In Spearman’s , this is done by taking Di2 .
108
n Correlation and
However, the largeness or smallness of D
i 1
i
2
, where n is the number of cases, Regression
will depend on n. thus, in order to be able to interpret this value, we could create a
ratio by dividing this sum by the largest possible value, which depends only on n,
n
6 Di2
n( n 1 )2
i 1
which is . However, is zero for perfect association and 2 for
6 n( n 2 1 )
lack of association, i.e., perfect negative association, while we would like it to be
other way around. So we subtract this ratio from 1. Thus
n
6 Di2
1 i 1
… (5.8)
n( n 2 1 )
is defined as Spearman’s rank correlation.
Let us calculated the value of from the data given in Table 5.3.
6 87.5 525
1 1 0.53 0.47.
10( 10 1 )
2
990
Like Karl Pearson’s coefficient of correlation the Spearman’s rank correlation has
a value +1 for perfect matching of ranks, –1for perfect mismatching of ranks and
0 for the lack of relation between the ranks.
There are other measures of association suitable for use when the variables are of
nominal, ordinal and other types. We do not discuss them here.
Check Your Progress 2
1) In a contest, two judges ranked eight candidates A, B, C, D, E, F, G and H in
order of their preference, as shown in the following table. Find the rank
correlation coefficient.
A B C D E F G H
First Judge 5 2 8 1 4 6 3 7
Second Judge 4 5 7 3 2 8 1 6
…………………………………………………………………………….
…………………………………………………………………………….
…………………………………………………………………………..…
…………………………………………………………………………..…
…………………………………………………………………………..…
…………………………………………………………………………..…
……………………………………………………………………………..
109
Summarisation of
Bivariate and Multi- Roll Nos. 1 2 3 4 5 6 7 8 9 10
variate Data
Rank in B. Com. Exam. 1 5 8 6 7 4 2 3 9 10
Ranks by A : 1 6 5 10 3 2 4 9 7 8
Ranks by B : 3 5 8 4 7 10 2 1 6 9
Ranks by C : 6 4 9 8 1 2 3 10 5 7
Using the rank correlation method, discuss which pair of judges has the
nearest approach to common liking in music.
…………………………………………………………………....………….....
………………………………………………………………….......…………..
………………………………………………………………….......……..……
………………………………………………………………….......………..…
………………………………………………………………….......……..……
110
Correlation and
5.7 THE CONCEPT OF REGRESSION Regression
In the previous section we noted that correlation coefficient does not reflect cause
and effect relationship between two variables. Thus we cannot predict the value
of one variable for a given value of the other variable. This limitation is removed
by regression analysis. In regression analysis, the relationship between variables
are expressed in the form of a mathematical equation. It is assumed that one
variable is the cause and the other is the effect. You should remember that
regression is a statistical tool which helps understand the relationship between
variables and predicts the unknown values of the dependent variable from known
values of the independent variable.
In regression analysis we have two types of variables: i) dependent (or explained)
variable, and ii) independent (or explanatory) variable. As the name (explained
and explanatory) suggests the dependent variable is explained by the independent
variable.
In the simplest case of regression analysis there is one dependent variable and one
independent variable. Let us assume that consumption expenditure of a household
is related to the household income. For example, it can be postulated that as
household income increases, expenditure also increases. Here consumption
expenditure is the dependent variable and household income is the independent
variable.
Usually we denote the dependent variable as Y and the independent variable as X.
Suppose we took up a household survey and collected n pairs of observations in X
and Y. The next step is to find out the nature of relationship between X and Y.
The relationship between X and Y can take many forms. The general practice is
to express the relationship in terms of some mathematical equation. The simplest
of these equations is the linear equation. This means that the relationship between
X and Y is in the form of a straight line and is termed linear regression. When the
equation represents curves (not a straight line) the regression is called non-linear
or curvilinear.
Now the question arises, ‘How do we identify the equation form?’ There is no
hard and fast rule as such. The form of the equation depends upon the reasoning
and assumptions made by us. However, we may plot the X and Y variables on a
graph paper to prepare a scatter diagram. From the scatter diagram, the location of
the points on the graph paper helps in identifying the type of equation to be fitted.
If the points are more or less in a straight line, then linear equation is assumed. On
the other hand, if the points are not in a straight line and are in the form of a
curve, a suitable non-linear equation (which resembles the scatter) is assumed.
We have to take another decision, that is, the identification of dependent and
independent variables. This again depends on the logic put forth and purpose of
analysis: whether ‘Y depends on X’ or ‘X depends on Y’. Thus there can be two
regression equations from the same set of data. These are i) Y is assumed to be
111
Summarisation of dependent on X (this is termed ‘Y on X’ line), and ii) X is assumed to be
Bivariate and Multi-
variate Data dependent on Y (this is termed ‘X on Y’ line).
Regression analysis can be extended to cases where one dependent variable is
explained by a number of independent variables. Such a case is termed multiple
regression. In advanced regression models there can be a number of both
dependent as well as independent variables.
You may by now be wondering why the term ‘regression’, which means ‘reduce’.
This name is associated with a phenomenon that was observed in a study on the
relationship between the stature of father (x) and son (y). It was observed that the
average stature of sons of the tallest fathers has a tendency to be less than the
average stature of these fathers. On the other hand, the average stature of sons of
the shortest fathers has a tendency to be more than the average stature of these
fathers. This phenomenon was called regression towards the mean. Although this
appeared somewhat strange at that time, it was found later that this is due to
natural variation within subgroups of a group and the same phenomenon occurred
in most problems and data sets. The explanation is that many tall men come from
families with average stature due to vagaries of natural variation and they produce
sons who are shorter than them on the whole. A similar phenomenon takes place
at the lower end of the scale.
112
For the months when your income is the same, do your consumption remain the Correlation and
Regression
same? The point we are trying to make is that economic relationship involves
certain randomness.
Therefore, we assume the relationship between Y and X to be stochastic and add
one error term in (5.9). Thus our stochastic model is
Yi a bX i ei …(5.10)
where ei is the error term. In real life situations ei represents randomness in
human behaviour and excluded variables, if any, in the model. Remember that
the right hand side of (5.10) has two parts, viz., i) deterministic part (that is,
a bX i ), and ii) stochastic or randomness part (that is, ei ). Equation (5.10)
implies that even if X i remains the same for two observations, Yi need not be the
same because of different ei . Thus, if we plot (5.10) on a graph paper the
observations will not remain on a straight line.
Example 5.1
The amount of rainfall and agricultural production for ten years are given in Table
5.4.
Table 5.4: Rainfall and Agricultural Production
Rainfall Agricultural
(in mm.) production (in
tonne)
60 33
62 37
65 38
71 42
73 42
75 45
81 49
85 52
88 55
90 57
113
Summarisation of We plot the data on a graph paper. The scatter diagram looks something like Fig.
Bivariate and Multi-
variate Data 5.4. We observe from Fig. 5.4 that the points do not lie strictly on a straight line.
But they show an upward rising tendency where a straight line can be fitted. Let
us draw the regression line along with the scatter plot.
where b is the slope and a is the intercept on y-axis. The location of a straight line
depends on the value of a and b, called parameters. Therefore, the task before us
is to estimate these parameters from the collected data. (You will learn more
about the concept of estimation in Block 4). In order to obtain the line of best fit
to the data we should find estimates of a and b in such a way that the error ei is
minimum.
In Fig. 5.4 these differences between observed and predicted values of Y are
marked with straight lines from the observed points, parallel to y-axis, meeting
the regression line. The lengths of these segments are the errors at the observed
points.
Let us denote the n observations as before by ( X i , Yi ), i = 1, 2, ....., n. In Example
5.1 on agricultural production and rainfall, n=10.
114
Let us denote the predicted value of Yi at X i by Ŷi (the notation Ŷi is pronounced Correlation and
Regression
as ‘ Yi -cap’ or ‘ Yi -hat’). Thus
Yˆi a bX i , i = 1, 2, ....., n.
ei Yi Yˆi ……(5.11)
It would be nice if we can determine a and b in such a way that each of the ei , i =
1, 2, ....., n is zero. But this is impossible unless it so happens that all the n points
lie on a straight line, which is very unlikely. Thus we have to be content with
minimising a combination of ei , i = 1, 2, ....., n. What are the options before us?
n
It is tempting to think that the total of all the ei , i = 1, 2, ….., n, that is, ei
i1
is a suitable choice. But it is not. Because, ei for points above the line are
positive and below the line are negative. Thus by having a combination of
n
large positive and large negative errors, it is possible for ei to be very
i 1
small.
A second possibility is that if we take a = y (the arithmetic mean of the Yi ’s)
n
and b = 0, ei could be made zero. In this case, however, we do not need the
i 1
value of X at all for prediction! The predicted value is the same irrespective of
the observed value of X. This evidently is wrong.
n
What then is wrong with the criterion ei ? It takes into account the sign of
i 1
ei . What matters is the magnitude of the error and whether the error is on the
n
positive side or negative side is really immaterial. Thus, the criterion ei is
i 1
115
Summarisation of
Bivariate and Multi- 5.10 METHOD OF LEAST SQUARES
variate Data
In the least squares method we minimise the sum of squares of the error terms,
n
that is, ei2 .
i 1
The next question is: How do we obtain the values of a and b to minimise (5.12)?
Those of you who are familiar with the concept of differentiation will
remember that the value of a function is minimum when the first derivative of
the function is zero and second derivative is positive. Here we have to choose
n
the value of a and b. Hence, ei2 will be minimum when its partial derivatives
i 1
n
with respect to a and b are zero. The partial derivatives of ei2 are obtained as
i 1
follows:
ei2 (Yi a bX i ) 2
i
i
2 (1) (Yi a bX i ) …(5.13)
a a i
ei2 (Yi a bX i ) 2
i
i
2 ( X i ) (Yi a bX i ) …(5.14)
b b i
By equating (5.13) and (5.14) to zero and re-arranging the terms we get the
following two equations:
n n
Yi na b X i …(5.15)
i 1 i 1
n n n
2
X i Yi a X i b X i …(5.16)
i 1 i 1 i 1
These two equations, (5.15) and (5.16), are called the normal equations of
least squares. These are two simultaneous linear equations in two unknowns.
These can be solved to obtain the values of a and b.
Those of you who are not familiar with the concept of differentiation can use
a rule of thumb (We suggest that you should learn the concept of
differentiation, which is so much useful in Economics). We can say that the
normal equations given at (5.15) and (5.16) are derived by multiplying the
coefficients of a and b to the linear equation and summing over all
observations. Here the linear equation is Yi a bX i . The first normal
equation is simply the linear equation Yi a bX i summed over all
observations (since the coefficient of a is 1).
Yi a bX i or Yi na b X i
116
The second normal equation is the linear equation multiplied by X i (since the Correlation and
Regression
coefficient of b is X i )
2 2
X iYi aX i bX i or X i Yi a X i b X i
After obtaining the normal equations we calculate the values of a and b from the
set of data we have.
Example 5.2: Assume that quantity of agricultural production depends on the
amount of rainfall and fit a linear regression to the data given in Example 5.1.
In this case dependent variable (Y) is quantity of agricultural production and
independent variable (X) is amount of rainfall. The regression equation to be
fitted is
Yi a bX i ei
For the above equation we find out the normal equations by the method of least
squares. These equations are given at (5.15) and (5.16). Next we construct a table
as follows:
Table 5.5: Computation of Regression Line
Xi Yi X i2 X i Yi Ŷi ei
By substituting values from Table 5.5 in the normal equations (5.15) and (5.16)
we get the following:
117
Summarisation of 450 = 10a + 750b
Bivariate and Multi-
variate Data
34526 = 750a + 57294b
Notice that the sum of errors ei for the estimated regression equation in zero
i
The computation given in Table 5.5 often involves large numbers and poses
difficulty. Hence we have a short-cut method for calculating the values of a and b
from the normal equations.
Let us take
x X X and y Y Y where X and Y are the arithmetic means of X and Y
respectively.
Hence xy ( X X )(Y Y )
a Y bX …(5.18)
1n
You may recall that covariance is given by xy (Xi X)(Yi Y ) = 1 n
xi y i .
n i1 n i 1
1 n
n
Moreover, variance of X is given by x2 ( Xi X )2 = 1 xi2
n i1 n i 1
n
xy xy
Since b
i1 b
n
x
2
we can say that
x2 …(5.19)
i1
Since these formulae are derived from the normal equations we get the same
values for a and b in this method also. For the data given in Table 5.4 we compute
the values of a and b by this method. For this purpose we construct Table 5.6.
118
Table 5.6: Computation of Regression Line (short-cut method) Correlation and
Regression
Xi Yi xi yi xi2 xi y i
60 33 -15 -12 225 180
62 37 -13 -8 169 104
65 38 -10 -7 100 70
71 42 -4 -3 16 12
73 42 -2 -3 4 6
75 45 0 0 0 0
81 49 6 4 36 24
85 52 10 7 100 70
88 55 13 10 169 130
90 57 15 12 225 180
Total = 750 450 0 0 1044 776
a Y b X 45 0.743 10 10 .73
Thus the regression line in this method also Yˆi 10.73 0.743X i …(5.20)
5.11 PREDICTION
A major interest in studying regression lies in its ability to forecast. In Example
5.1 we assumed that the quantity of agricultural production is dependent on the
amount of rainfall. We fitted a linear equation to the observed data and got the
relationship
Yˆi 10.73 0.743X i
From this equation we can predict the quantity of agricultural output given the
amount of rainfall. Thus when rainfall is 60 mm. agricultural production is
( 10 .73 0.74 60 ) 33 .85 thousand tonnes. This figure is the predicted value on
the basis of regression equation. In a similar manner we can find the predicted
values of Y for different values of X.
119
Summarisation of Let us compare the predicted value with the observed value. From Table 5.4,
Bivariate and Multi-
variate Data where observed values are given, we find that when rainfall is 60 mm.
agricultural production is 33 thousand tonnes. In fact, the predicted values Ŷi for
observed values of X are given in the fifth column of Table 5.5. Thus when
rainfall is 60 mm. Predicted value is 33.85 thousand tonnes. Thus the error value
ei is –0.85 thousand tonne.
Now a question arises, ‘Which one, between observed and predicted values,
should we believe?’ In other words, what will be the quantity of agricultural
production if there is a rainfall of 60 mm. in future? On the basis of our regression
line it is given to be 33.85 tonnes. And we accept this value because it is based on
the overall data. The error of –0.85 is considered as a random fluctuation which
may not be repeated.
The second question that comes to our mind is, ‘Is the prediction valid for any
value of X?’ For example, we find from the regression equation that when rainfall
is zero, agricultural production is –10.73 thousand tonne. But common sense tells
us that agricultural production cannot be negative! Is there anything wrong with
our regression equation? In fact, the regression equation here is estimated on the
basis of rainfall data in the range of 60-90 mm. Thus prediction is be valid in this
range of X. Our prediction should not be for far off values of X.
A third, question that arises here is, ‘Will the predicted value come true?’ This
depends upon the coefficient of determination. If the coefficient of determination
is closer to one, there is greater likelihood that the prediction will be realised.
However, the predicted value is constrained by elements of randomness involved
with human behaviour and other unforeseen factors.
a) Y on X line, Yi a bX i
b) X on Y line, X i Yi
120
You may ask, ‘What is the need for having two different lines? By rearrangement Correlation and
Regression
a 1
of terms of the Y on X line we obtain X i Yi . Thus we should have
b b
a 1
and . However, the observations are not on a straight line and the
b b
relation between X and Y is not a mathematical one. You may recall that
estimates of the parameters are obtained by the method of least squares. Thus the
regression line Yˆi a bX i is obtained by minimising (Yi a bX i ) 2 whereas
i
xy2
Thus b , which is the same as r2.
x2 2y
This r2 is called the coefficient of determination. Thus the product of the two
regression coefficients of Y on X and X on Y is the square of the correlation
coefficient. This gives a relationship between correlation and regression. Notice,
however, that the coefficient of determination of either regression is the same,
i.e., r2; this means that although the two regression lines are different, their
predictive powers are the same. Note that the coefficient of determination r2
ranges between 0 and 1, i.e., the maximum value it can assume is unity and the
minimum value is zero; it cannot be negative.
From the previous discussions, two points emerge clearly:
1) If the points in the scatter lie close to a straight line, then there is a strong
relationship between X and Y and the correlation coefficient is high.
2) If the points in the scatter diagram lie close to a straight line, then the observed
values and predicted values of Y by least squares are very close and the
prediction errors (Yi Yˆi ) are small.
Thus, the prediction errors by least squares seem to be related to the correlation
coefficient. We explain this relationship here. The sum of squares of errors at the
various points upon using the least squares linear regression is Yi Yˆi
n
i 1
2
.
On the other hand, if we had not used the value of observed X to predict Y, then
the prediction would be a constant, say, a. The best value of a by least squares
n
criterion is such an a that minimises Yi a ; the solution to this a is seen to be
2
i 1
Y . Thus the sum of squares of errors of prediction at various points without using
X is Yi Y .
n 2
i 1
121
Summarisation of n n
Bivariate and Multi- The ratio, (Yi Yˆi ) 2 (Yi Y ) 2 can then be used as an index of how much has
i 1 i 1
variate Data
been gained by the use of X. In fact, this ratio is the coefficient of determination
and same as r 2 mentioned above. Since both the numerator and denominator of
this ratio are non-negative, the ratio is greater than or equal to zero.
5) Obtain the equation of the line of regression of yield of rice (y) on water (x)
from the data given in the following table :
Water in inches (x) 12 18 24 30 36 42 48
Yield in tons (y) 5.27 5.68 6.25 7.21 8.02 8.71 8.42
Estimate the most probable yield of rice for 40 inches of water.
…………………………………………………………………....………….....
………………………………………………………………….......…………..
………………………………………………………………….......……..……
………………………………………………………………….......………..…
………………………………………………………………….......……..……
123
Summarisation of By solving the above equations we obtain estimates for α, β and γ. The regression
Bivariate and Multi-
variate Data equation that we obtain is
Yˆ X 1 X 2 …(5.23)
In the bivariate case (Y,X) we could plot the regression line on a graph paper.
However, it is quite complex to plot the three variable case (Y, X 1 , X 2 ) on graph
paper because it will require three dimensions. However, the intuitive idea
remains the same and we have to minimise the sum of errors. In fact when we
add all the error terms ( e1 , e2 ,........en ) it sum up to zero.
In many cases the number of explanatory variables may be more than two. In
such cases we have to follow the basic principle of least squares: minimize e2 .
Thus if Y a0 a1 X1 a2 X 2 ............... an X n e then we have to minimize
e 2 (Y a0 a1 X 1 a2 X 2 ........ an X n ) 2
124
The steps involved in estimation of regression line are: Correlation and
Regression
i) Find out the regression equation to be estimated. In this case it is given by
Y X 1 X 2 e .
ii) Find out the normal equations for the regression equation to be estimated.
In this case the normal equations are
Y n X 1 X 2
X 1Y X 1 X 12 X 1 X 2
X 2Y X 2 X 1 X 2 X 22
iv) Put the values from the table in the normal equations.
v) Solve for the estimates of , and .
Y X1 X2 X 1Y X 2Y X 12 X 22 X1X 2 Yˆ ei
By applying the above mentioned steps we obtain the estimated regression line as
Yˆ 4.80 0.45 X 1 0.09 X 2 .
125
Summarisation of
Bivariate and Multi- 5.14 NON-LINEAR REGRESSION
variate Data
The equation fitted in regression can be non-linear or curvilinear also. In fact, it
can take numerous forms. A simpler form involving two variables is the
quadratic form. The equation is
Y = a + bX + cX 2
There are three parameters here viz., a, b and c and the normal equations are:
Y n bX cX 2
XY X bX 2 cX 3
X 2Y X 2 bX 3 cX 4
By solving for these equation we obtain the values of a, b and c.
Certain non-linear equations can be transformed into linear equations by taking
logarithms. Finding out the optimum values of the parameters from the
transformed linear equations is the same as the process discussed in the previous
section. We give below some of the frequently used non-linear equations and the
respective transformed linear equations.
1) Y = a c bx
By taking natural log (ln), it can be written as
ln Y = ln a + bX
or Y’ = + X’
Where, Y’ = lnY, = ln a, X’ = X and = b
2) Y = aX b
By taking logarithm (log), the equation can be transformed into
log Y = log a + b log X
or Y’ = + X’
where, Y’ = log Y, = log a, = b and X’ = log X
1
3) Y=
a bX
1
If we take Y’ = then
Y
Y’ = a + bX
4) Y=a+b X
If we take X’ = X then
Y = a + bX’
Once the non-linear equation is transformed, the fitting of a regression line is as
per the method discussed in the beginning of this Unit.
126
We derive the normal equations and substitute the values calculated from the Correlation and
Regression
observed data. From the transformed parameters, the actual parameters can be
obtained by making the reverse transformation.
127
Summarisation of
Bivariate and Multi- 5.16 ANSWERS/HINTS TO CHECK YOUR
variate Data PROGRESS EXERCISES
Check Your Progress 1
1) + 0.47
2) + 0.996
3) + 0.98
4) + 0.995
5) – 0.84
1) 2/3
2) + 0.64
4) + 0.82
128
UNIT 6 INDEX NUMBERS
Structure
6.0 Objectives
6.1 Introduction
6.2 Steps in Construction of Index Numbers
6.2.1 Selection of Base Period
6.2.2 Choice of a Suitable Average
6.2.3 Selection of Items and their Numbers
6.2.4 Collection of Data
6.3 Method of Construction of Index Number
6.3.1 Relative Methods
6.3.2 Aggregative Methods
6.3.3 Quantity or Volume Index Numbers
6.6 Cost of Living Index Number (CLI) or Consumer Price Index Number (CPI)
6.7 Worked-Out Examples
6.8 Let Us Sum Up
6.9 Answers or Hints to Check Your Progress Exercises
6.0 OBJECTIVES
After going through this Unit, you will be able to:
define index numbers; and
Construct and calculate them.
6.1 INTRODUCTION
An “index” in the common sense of the word is an “indicator” and not anything
more than that. “Index numbers” or “indices” are forms of the plural, but they all
mean the same thing.
An index number represents the general level of magnitude of the changes
between two (or more) periods of time or places, in a number of variables taken
as a whole. In this definition, the world “variable” refers to numerical variables
which can be measured in quantity, such as the prices of commodities.
Adapted from IGNOU study material of EEC 13: Elementary Statistical Methods and Survey
Techniques, Unit 10 written by J Roy with modifications by Kaustuva Barik
Summarisation of For example, we may like to compare the price level of an article between 2010
Bivariate and Multi-
variate Data and 2020 or between Mumbai and Kolkata. Let us consider the yield of rice in
2015 and in 2020 as 50,000 and 60,000 tons respectively. The year 2015 is taken
as base for comparison of yields that is 2015 = 100. The corresponding figure for
,
2020 will be , × 100 = 120. This is a single-commodity index number in its
simplest form, being just a relative number. In practice, however, we deal usually
with a number of commodities for the construction of an index.
Index numbers are ratios that are usually expressed as percentage in order to
avoid awkward decimals. Thus if one commodity costs Rs. 45 in 2019 and Rs.
150 in 2020 the ratio would be or 3.33. If instead of this, we express the ratio
into a percentage × 3.33, we say that the index is 333, based on 2019, which
is 100.
131
Summarisation of 6.2.3 Collection of Data
Bivariate and Multi-
variate Data As prices often vary from market to market, they should be collected at regular
intervals from various representative markets. It is desirable to select shops
which are visited by a cross section of customers. The reliability of the index
depends greatly on the accuracy of the quotations given for each constituent item.
1) Relative methods
2) Aggregative methods
i) Lasperyres’ index
132
a) Simple Average of Relatives Index Numbers
index = 100 ∑
You should note that the base year weighting preserves continuity, but it loses
“up-to-dateness” in the course of time.
Example 6.1: The table below presents the average fares per railway journey.
Using 2010 average = 100, calculations are made according to base year weights.
133
Summarisation of Example 6.2: The table below shows the average fares per railway journey.
Bivariate and Multi-
variate Data Using 2010 average = 100, calculation are made according to current year
weights.
𝑃 + 𝑃 + … … … .. 𝑃
= × 100
𝑃 + 𝑃 + … … … .. 𝑃
∑𝑃 ∑𝑃
× 100 = × 100 … (6.4)
∑𝑃 ∑𝑃
where the summation symbol ∑ extends over all the selected commodities
numbering k. On the other hand, in the case of weighted aggregative index we
have,
𝑝 + 𝑝 𝑞 + … … … .. 𝑝 𝑞
General index = × 100
𝑝 𝑞 + 𝑝 𝑞 + … … … .. 𝑝 𝑞
∑𝑝 𝑞
× 100
∑𝑝 𝑞
∑𝑝 𝑞
or simply = × 100 … (6.5)
∑𝑝 𝑞
The weights used should be actual quantities bought or sold, and these are kept
unchanged until such time as the index requires to be revised.
134
There are many formulae for weighted aggregative index, but depending on the Index Numbers
type of weights used, we discuss four indices which are commonly used.
a) Laspeyres’ index
If we use base period quantities (q0) as the weights in the general weighed
aggregative index formula (6.5), we get what is known as Laspeyre’s formula
(L).
∑
𝐿 = ∑
× 100 … (6.6)
It can be seen that this index has fixed base year quantity as weights (𝑞 ) and is
equivalent to arithmetic mean of price relatives given at formula (6.2). Thus, we
can also write (6.6) as
∑ ×
𝐿 = ∑
× 100
b) Paasche’s index
If we use current year quantities (𝑞 ) as weights in the general aggregative index
formula(6.5), we get what get what is known as Paasche’s formula(P).
∑
𝑃 = ∑
× 100 … (6.7)
Where 𝑞 (actually 𝑞𝑛1 , 𝑞𝑛2 , … . . 𝑞𝑛𝑘 )are the quantities bought or sold in the
current period.
c) Fisher’s Ideal Index
An index number obtained as geometric mean (i.e., square root of the product) of
indices obtained by Laspeyres’ and Paasche’s formulae, satisfies certain
important properties (to be discussed later), is known as the Fisher’s ideal
formula.
∑ 𝑝𝑛 𝑞0 ∑ 𝑝𝑛 𝑞𝑛
𝐹 = √𝐿 × 𝑃 = ∑ 𝑝0 𝑞0
×∑ × 100 … (6.8)
𝑝0 𝑞𝑛
d) Edgeworth-Marshall Index
If the mean of the base period and the current period quantities is used as weight,
i.e.,
𝑤 = (𝑞0 + 𝑞𝑛 ), we get a compromise formula of ‘Edgeworth-Marshall index’.
∑ 𝑝 (𝑞 + 𝑞 )/2
𝐼 = × 100
∑ 𝑝 (𝑞 + 𝑞 )/2
∑ ( )
= ∑
× 100 … (6.9)
( )
We take some hypothetical data and calculate the above indices from it (see
Table 6.1).
135
Summarisation of Table 6.1: Illustrative Calculation of Laspeyres’
Bivariate and Multi-
variate Data Edgeworth-Marshall]s and Fisher’s Indices
(𝑝 𝑞 ) (𝑝 𝑞 ) (𝑝 𝑞 ) (𝑝 𝑞 )
Price Quantity Price Quantity
(𝑝 ) (𝑞 ) (𝑝 ) (𝑞 )
104.72 = 105
∑ 𝑃𝑛 𝑞𝑛
2) Paasche’s price index = × 100 = × 100 =
𝑃0 𝑞𝑛
110.97 = 111
∑ 𝑃𝑛 𝑞0 +∑ 𝑃𝑛 𝑞𝑛
3) Edgewroth-Marshall’s price index = ∑ ×
𝑃0 𝑞0 +∑ 𝑃0 𝑞𝑛
100 = × 100 =
∑ 𝑃𝑛 𝑞0 ∑ 𝑃𝑛 𝑞𝑛
4) Fisher’s ideal index = ∑ 𝑃0 𝑞0 ∑ 𝑃0 𝑞𝑛
× 100 =
[(𝐿) × (𝑃)] =
(104.72 × 110.97) = 107.8 =
108
Note that for the same price change, different formulae provide different values.
Moreover, when prices are increasing, Laspeyres’ index gives the lowest value
while Paasche’s index gives the highest value. Therefore, it is often said that
Laspeyres’ index is an under-estimate while Paasche’s index is an over-estimate
of true price change.
6.3.3 Quantity or Volume Index Numbers
We can get a quantity or volume index number, which measures and permits
136
comparison of quantities of goods, from corresponding price index number Index Numbers
formulae simply by replacing p. by q and q and q by p.
1) Quantity relative = × 100
∑ 𝑞𝑛 𝑃0 ∑ 𝑞𝑛 𝑃𝑛
7) Fisher’s ideal index = ∑ 𝑞0 𝑃0 ∑ 𝑞0 𝑃𝑛
× 100
∑ 𝑞𝑛 (𝑃0 +𝑃𝑛 )
8) Edgeworth-Marshall’s index = ∑ 𝑞0 (𝑃0 +𝑃𝑛 )
× 100
3) The following are the prices of six different commodities for 2020 and 2021.
Compute the price index by (a) aggregative method, and (b) average of
price relatives method by using arithmetic mean.
137
Summarisation of
Commodities Price in 2020 (Rs.) Price in 2021 (Rs.)
Bivariate and Multi-
variate Data
A 40 50
B 50 60
C 20 30
D 50 70
E 80 80
F 100 110
………………………………………………………………………………....…
…………………………………………………………………………....………
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
4) Calculate Fisher’s Ideal Index Number from the following group of items.
Base Year Current Year
Item No. Price Quantity Price Quantity
(in Rs.) (in Kg.) (in Rs.) (in Kg.)
1 4 1.0 3 4
2 8 1.5 7 5
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
5) Calculate Laspeyres’ and Paasche’s Index Number from the following data.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
138
Index Numbers
6.4 MERITS OF THE VARIOUS AGGREGATIVE
MEASURES
The different index numbers serve different purposes and, therefore, the
appropriateness of a particular index number depends on the purpose at hadn.
The Laspeyres’ index calculation is simpler, since this uses the base period
quantities as weighs which are not difficult to get and the denominator needs
calculating only once. But in this index a rise in pries tends to overstated, since
it does not take into account corresponding falls in demand or changes in output.
Indices such as Paasche’s on the other hand, use current period quantities as
weights which are difficult o get and the weights need to be constructed afresh
for every year. Moreover, Paasche’s index tends to understate the rise in prices
because it uses current weights. The Laspeyres’ index is probably more
commonly used, since it is convenient to employ fixed weights. But with the
passage of time the weights are rendered out of date. For example, in 1995 the
number of mobile phones in Odisha was nil. In 2020 there are more mobile
phones than the number of land line connections. The Paasche’s index uses the
preferable current weights, but since-up-to-date information on quantity of goods
produced or consumed or marketed or distributed are not readily obtained, the
Laspeyres’ index has a great advantage.
Symbolically,
I0n × I0n = 1
where I0n = index number for period n with the base period 0.
139
Summarisation of There are five methods which do satisfy the time reversal test. These are:
Bivariate and Multi-
variate Data 1) Simple geometric mean of price relative
2) Aggregative indices with fixed weights
3) Edgeworth-Marshall formula
4) Weighted geometric mean of price relatives if fixed weights are used
5) Fisher’s ideal index
∑ 𝑃𝑛 𝑞0 ∑ 𝑃𝑛 𝑞𝑛
Fisher’s ideal index F = ∑ 𝑃0 𝑞0
× ∑ 𝑃0 𝑞𝑛
Now, for example, Laspeyres’ index for prices and quantities are given
respectively by
∑ 𝑃𝑛 𝑞0 ∑ 𝑞𝑛 𝑃0
Ip = ∑ and Iq = ∑
𝑃0 𝑞𝑛 𝑞0 𝑃0
∑(𝑃𝑛 𝑞0 ) ∑ 𝑃𝑛 𝑞0
Ip.Iq = ∑(𝑃0 𝑞0 )2
= Iv
On the other hand, Fisher’s ideal index satisfies this test, as shown below.
∑ 𝑃𝑛 𝑞0 ∑ 𝑃𝑛 𝑞𝑛
Ip = ∑ 𝑃𝑛 𝑞0
×∑
𝑃0 𝑞𝑛
∑ 𝑞𝑛 𝑃0 ∑ 𝑞𝑛 𝑃𝑛
Iq = ∑ 𝑞𝑛 𝑃0
×∑
𝑞0 𝑃𝑛
∑ 𝑃𝑛 𝑞0 ∑ 𝑃𝑛 𝑞𝑛 ∑ 𝑞𝑛 𝑃0 ∑ 𝑞𝑛 𝑃𝑛
Ip.Iq = ∑ 𝑃0 𝑞0
×∑ ×∑ ×∑
𝑃0 𝑞𝑛 𝑞0 𝑃0 𝑞0 𝑃𝑛
∑ 𝑃𝑛 𝑞𝑛 ∑ 𝑃𝑛 𝑞𝑛 ∑ 𝑃𝑛 𝑞𝑛
= ∑ 𝑃0 𝑞0
×∑ =∑ = Iv
𝑞0 𝑃0 𝑃0 𝑞0
140
To understand this principle further, we take the following example. Index Numbers
If the price and quantity per unit of an item changed in 2020, as compared to
2010, from Rs. 16 to Rs. 32 and from 100 units to 200 units respectively, then the
price and quantity in 2020 would both be 200% or 2.00 times the price and
quantity in 2010. The values (product of price and quantity) would be Rs. 1600 in
2010 and Rs. 6400 in 2020, so that the value ratio is 6400/1600 = 4.00 Thus, we
verify that 2.00 × 2.00 = 4.00, that is, the product of price ratio and quantity ratio
is equal to the value ratio.
Example 6.3 we show with the following data that the Fisher’s ideal index
satisfies the factor reversal test:
Let us calculate the following from the data given in the above table.
∑ 𝑃𝑛 𝑞0 ∑ 𝑃𝑛 𝑞𝑛
Price Ratio: Ip = ∑ 𝑃𝑛 𝑞0
×∑ = ×
𝑃0 𝑞𝑛
∑ 𝑞𝑛 𝑃0 ∑ 𝑞𝑛 𝑃𝑛
Quantity Ratio: Ip = ∑ 𝑞𝑛 𝑃0
×∑ = ×
𝑞0 𝑃𝑛
∑ 𝑃𝑛 𝑞𝑛
Value Ratio: Iv = ∑ =
𝑃0 𝑞0
Ip.Iq = × × × = × =
141
Summarisation of 6.5.3 Chain Index Number and Circular Test
Bivariate and Multi-
variate Data Two types of base periods are used for the construction of index numbers,
namely, (a) fixed base, (b) chain base. Most commonly used indices use fixed
base method. This method cannot take into account any changes in price or
quantity in any other year. It fails to include new commodities gaining
importance at a later date or exclude commodities losing significance in course of
time. These problems can be overcome by chain index numbers.
Using a suitable index number formula (say, Laspeyres’ index), link indices,
defined as follows, are first calculated: Link index = Index number with previous
period as base. The chain index is obtained by multiplying link indices
progressively. Thus, the chain index number Ion for period n with base period 0 is
given by
I01 = I01
……………………………..
…………………………….
Example 6.4 The calculation of chain index numbers is illustrated with reference
to the following data:
Year Link index Chain index (Base 2010 =100)
Thus, the chain index numbers for the years 2011 to 2013 with 2010 as the base
are 80, 96 and 72 respectively.
Circular Test: The circular test is an extension of time reversal test over a
number of years. It states that the chain index for the year 2013, calculated above,
starting from the base year 2010 will be same as the index number directly
calculated with fixed base period of 2010. In symbols,
142
I01= I12 × ……..× I(n-1)n ×= In-0) = 1. (Notice that I0n = Index Numbers
With base period 0, we can trace the above formula from 1 to 3 years:
∑ 𝑃1 𝑞 ∑ 𝑃2 𝑞 ∑ 𝑃3 𝑞 ∑ 𝑃0 𝑞
× × × =1
∑ 𝑃0 𝑞 ∑ 𝑃1 𝑞 ∑ 𝑃2 𝑞 ∑ 𝑃3 𝑞
Fisher’s ideal index does not satisfy this test. It has been proved that no index
satisfies both the factor reversal and the circular test.
A 20 16 28 35 21
B 25 30 24 36 45
C 20 25 30 24 30
…………………………………………………………………………...…
…………………………………………………………………………...…
…………………………………………………………………………...…
…………………………………………………………………………...…
……………………………………………………………………………..
…………………………………………………………..…....................…
143
Summarisation of 2) Construct Fisher’s ideal Index number from the following data and
Bivariate and Multi-
variate Data show that it satisfies Factor and Time Reversal Tests.
Base Year Current Year
Commodities Price Expenditure Price Expenditure
Per unit (Rs.) per unit (Rs.)
A 2 40 5 75
B 4 16 8 40
C 1 10 2 24
D 5 25 10 60
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
1) Food
3) Clothing
4) House rent
5) Miscellaneous
The common method for obtaining the consumption basket is to conduct a family
living survey among the population group for which the index is to be
constructed. Prices of selected items are also collected from various retail
markets used by consumers in question. It may be noted that each of the above
broad groups contains several sub groups. Thus, ‘food’ includes cereals, pulses,
oils, meat, fish, egg, spices, vegetables, fruits, non-alcoholic beverages, etc.
‘Miscellaneous’ includes such items as medical care, education, transport,
recreation, gifts and many others. When more than one price quotation is
collected for a single commodity, a simple average is taken. Index number is
constructed for each of the five groups using weighted average of the price
group; the weighs used are proportional to the expenditure on the consumed item
buy an average family. Next, the overall index (CLI) is computed as an weighted
144
average of group indices, the weights being again the proportional expenditure on Index Numbers
The CLI or consumer price index (CPI) numbers have significant practical
implications and extensive public use. Its use as a wage regulator is the most
important. The dearness allowance (DA) of employees is primarily determined
by this index. When wages or incomes are divided by corresponding CLI, the
effect of changes in prices (inflation) is eliminated. This is known as the process
of deflation, which is used to find ‘real wages’ or ‘real income’ As mentioned
earlier the reciprocal of CLI gives us the purchasing power of money.
Example 6.5: Construction of an index for food
Item Prices Weights
Pn P0 P = (Pn × P0) w P×w
Rice 50 40 125.0 30 3750.0
Wheat 45 30 150.0 20 3000.0
Pulses 60 40 150.0 10 1500.0
Sugar 40 20 200.0 5 1000.0
Oil 75 60 125.0 15 1875.0
Potato 60 50 120.0 15 1800.0
Fish 200 150 133.3 5 666.5
Total 100 13591.5
∑ ×( ) ∑
Index (food) = ∑
× 100 = ∑
× 100
.
= = 135.915 = 135.92
145
Summarisation of Check your progress 3
Bivariate and Multi-
variate Data 1) Calculate a number which will indicate the percentage change in volume
of traffic from October 2019 to October 2020, when account is taken of
the relative values of the different types of traffic.
Type of traffic Tons(‘000) Receipts(Rs.’000)
Oct. 2019 Oct. 2020 Oct. 2019
Merchandise 1246 1206 776
Fuel 4794 4229 562
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
………………………………………………….................………..……
2) Compute Paasche’s price index number for 2020 with 2015 as the base
from the following data:
Commodity Unit Price(Rs.) per unit Quantities sold
2015 2020 2015 2020
A kg. 4 5 95 120
B kg. 60 70 118 13
C kg. 35 40 50 70
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…………………………………………………………………………...
3) From the following data, compute the Laspeyres’ price index number for
2021 with 2019 as base:
Item Price(Rs.) Total Value(Rs.)
2019 2021 2019
A 12.50 14.00 112.50
B 10.50 12.00 126.00
C 15.00 14.00 105.00
D 9.40 11.20 47.00
146
…………………………………………………………………………… Index Numbers
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
147
Summarisation of ∑ / .
Bivariate and Multi- =∑ × 100 = × 100 = 213.84
/ .
variate Data
b) Method of Price Relative
Index number for 2020 (base 2010 =100)
∑ ×
=
= 952 5 = 190.4
Example 6.8: Calculate price index numbers from the following information,
using (a) weighted aggregative formula, and (b) weighted arithmetic mean of
price relatives:
Item Unit Price per unit(Rs.)
Base year Current year Weight
A quintal 85 115 19
B kg. 15 15 25
C dozen 45 61 40
D litre 55 100 20
E Lb 17 23 21
Example 6.9: Given below are the data on prices of some consumer goods and
the weights attached to the various commodities. Calculate price index numbers
148
for the year 2021 (base 2020 = 100), using (a) simple average, and (b) weighted Index Numbers
average of price relatives.
Price (Rs.)
Commodity Unit 2020 2021 Weights
Wheat Kg. 0.05 0.75 2
Milk Litre 0.60 0.75 5
Egg Dozen 2.00 2.40 4
Sugar Kg. 1.80 2.10 8
Shoes Pair 8.00 10.00 1
Calculations for price relative index.
∑ ×
a) Simple average of price relative index = = = 127.4
∑
b) Weighted average of price relative index = ∑
= = 123.3
Example 6.10: On the basis of the following data, calculate the wholesale price
index number for the five groups combined (Base: 2015-16 = 100).
Group Weight Index no. for
the week ending 31.01.2021
Food 50 241
149
Summarisation of
Bivariate and Multi- Group weight (w) Group Index(I) I×w
variate Data
Example 6.11: Annual production (in million tons) of four commodities are
given below:
Commodity Production Weight
2015 2019 2020
Calculate quantity index numbers for the years 2019 and 2020 with 2015 as base
year, using (a) simple arithmetic mean, and (b) weighted arithmetic mean of the
relatives.
Quantity relatives for 2019 with base year 2015 (=100)
𝑞 𝑞
𝐼= 𝑞 × 100 = 𝑞 × 100
Commodity A: × 100 = 125
Example 6.12: From the following price(p) and quantity (q) data, compute
Fisher’s ideal index number.
Commodity 2015 (Base Year) 2020 (Current Year)
Price Quantity Price Quantity
A 12 10 17 10
B 14 9 16 11
C 11 12 13 10
∑ 𝑃𝑛 𝑞𝑛
Paasche’s price index = ∑ × 100 = × 100 = 123.96 = 124
𝑃0 𝑞𝑛
∑ × ×
Quantity index = ∑
= = 92
152
Commodity P0 Pn q0 qn P0 qn Pn qn Index Numbers
3) We are given the base price(P0), current price (Pn) and value in the base year
(P0q0). To find base year quantity (q0), we can use the relation
𝑃𝑞
𝑞 =
𝑃
Using P0, Pn and q0, we can find Laspeyres’ index as
∑ 𝑃𝑛 𝑞0
𝐿= × 100
∑ 𝑃0 𝑞0
Calculation for Laspeyres’ price index
Item P0 Pn P0 q0 Pn Pn q0
∑ 𝑃𝑛 𝑞0 +𝑞𝑛
4) Marshall-Edgeworth index = ∑ × 100
𝑃0 𝑞0 +𝑞𝑛
∑ 𝑃𝑛 𝑞0 +∑ 𝑃𝑛 𝑞𝑛
=∑ × 100
𝑃0 𝑞0 +∑ 𝑃0 𝑞𝑛
153
UNIT 7 DETERMINISTIC TIME SERIES AND
FORECASTING
Structure
7.0 Objectives
7.1 Introduction
7.2 Problems and Objects of Study of Time Series Data
7.2.1 Components of Time Series
7.2.2 Construction of Time Series: An Example
7.3 Measurement of Trend
7.3.1 Moving Average Method
7.3.2 Suitability of Moving Averages
7.3.3 Examples of Moving Averages
7.4 Method of Fitting Polynomials
7.4.1 Suitability of Least Squares Method
7.4.2 Examples of Least Squares Method
7.5 Monthly or Quarterly Trend Values from Annual Data
7.6 Measurement of Seasonal Variations
7.6.1 Method of Simple Average
7.6.2 Ratio to Trend Method
7.6.3 Ratio to Moving Average Method
7.7 Let Us Sum Up
7.8 Answers or Hints to Check Your Progress Exercises
7.0 OBJECTIVES
After going through this Unit, you will be able to
construct a trend line for time series data;
compute moving averages; and
calculate various measures of seasonal variations.
7.1 INTRODUCTION
A time series is a set of observations on a variable measured at successive points
of time. Usually the variable values are recorded over equal time intervals-yearly,
quarterly monthly, etc. Generally the term ‘time series’ refers to economic data,
but it equally applies to quantitative data collected in other fields also. The time
series of National Income, Agricultural Income, and Agricultural Production are
based on annual observations.
Adapted from IGNOU study material of EEC 13: Elementary Statistical Methods and Survey Techniques,
Unit 11 written by S Bandopadhyay with modifications by K. Barik
Other examples of time series are yield of a crop in different years, population of Deterministic Time
Series and Forecasting
a country over different points of time, sales of a departmental store during
different seasons of the year, quarterly exports of tea, etc. For these types of data
one of the variables is time, denoted by tj and the other which is dependent on
time (such as yield, population, sale or export) is represented by yt. We will
analyse some of these series with the help of the methodology to be developed in
this Unit.
155
Summarisation of Although the additive model facilitates easier calculation, the multiplicative
Bivariate and Multi-
variate Data model has been most widely used in analysis of time series.
a) Secular Trend
By secular trend we mean the smooth, regular, long-term changes in the series
when observed over a period of time. Some series may exhibit an upward trend,
some series a downward trend while some others may remain more or less
constant over time. The upward trend of a series may be caused by factors such
as increase in population and improvement in techniques of production. For
example, the pattern of growth of many industries follows closely that of
population growth of the country. Again the advances in technology may give
rise to upward movement of most of the economics time series. But not all time
series will exhibit growth. Some may show decline while some others may show
fluctuations. The time series of crude death rates of a country is likely to show a
declining trend.
b) Seasonal Variations
The graphs of most of the time series reveal that a large number of fluctuations
are imposed on the trend. By seasonal variation we mean the periodic movement
in a time series where the period is not longer than one year. A periodic
movement is that which repeats at regular intervals or periods of time. For
example, the sales of cold drinks increase during summer and decrease during
winter, sales of garments are maximum during some seasons of the year, say
during May or festivals, the number of passengers carried by buses has a peak
during office hours, the number of books borrowed from a library has a peak
during some days of the week, etc. The factors which contribute to this type of
fluctuations are the climatic changes of different seasons, customs and habits
which people follow at different times.
C) Cyclical Fluctuations
By cyclical fluctuations we mean oscillatory movements of a time series, where
the period of oscillation, called the cycle, is more than a year. It includes those
factors leading to alternating periods of expansion and contraction that
characterize most economic and business series. Sometimes these fluctuations are
highly irregular with respect to their Sometimes these fluctuations are highly
irregular with respect to their shape, amplitude and direction. But the phenomena
they reflect – the periods of depression, recovery, boom and collapse-have been
observed in virtually all time series dealing with business and economics data.
d) Irregular Movement
The irregular movement includes component all factors not classifiable
elsewhere. Thus factors such as work stoppage, elections, wars, fire may affect a
particular time series; this category of movement includes all types of variations
not accounted for by secular trend, seasonal or cyclical fluctuations.
Unfortunately, factors of these kinds are frequently indistinguishable from
156
cyclical factors and as such in some discussions cyclical and irregular Deterministic Time
Series and Forecasting
components are combined together.
7.2.2 Construction of Time Series: An Example
As an illustration we prepare a time series according to the multiplicative model.
Table 7.1 presents trend, seasonal and cyclical-irregular components of a
hypothetical series.
Table 7.1: Hypothetical Time Series and its Components (Quarterly)
Year Components
Quarter Series Trend Seasonal Cyclical- Irregular
(yt) (T) (100S) (100CI)
I 79 80 120 82
1
II 58 85 80 85
III 84 90 92 102
IV 107 95 108 105
157
Summarisation of
Bivariate and Multi-
Thus the observation 79 (of I quarter of 1st year) = 80 ×
120 82
variate Data × .
100 100
Thus, each quarterly figure (yt) is the product of the secular trend (T), the
seasonal index (S), cyclical and the irregular component (CI). Such a synthetic
composition looks very much like an actual time series and has encouraged use
of the model as the basis for the analysis of time series data.
1 y1 - -
2 y2 - -
y1 + y1 + y1 + y1 = T1 T1 / 4
y2 + y3+ y4 + y5 = T2 T2 / 4
y3 + y4+ y5 + y6 = T3
T3 / 4
y4 + y5+ y6 + y7 = T4 T4 / 4
6 y6 - -
7 y7 - -
In the above illustration, the period of moving averages is 4 years. Both in the
direct and in the short-cut method col. 3 shows the 4-year moving totals. The first
value (T1) is placed between the second and the third year, the second moving
total (T2) is placed between the third and the fourth year and so on. The centered
4-year moving averages are placed at the third year, fourth year, by taking a
further 2 item moving average in the direct method (Table 7.2.) In the short cut
method (Table 7.3), the calculation of the 4-year moving average is omitted (as
shown in col. 4 of Table 7.2 in the direct method). Instead, the 2-item moving
totals of the 4-year moving averages are obtained (col. 4 and 5).
159
Summarisation of You should note that for a 4-year moving average, the procedure for centering
Bivariate and Multi-
variate Data leaves out 4/2 = 2 years at the end of the series each.
Table 7.3 Calculation of centered 4-year moving averages (Short Method)
Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Production
18 19 20 22 20 19 22 24 25 24 25 26
(‘000 tons)
160
Steps of Calculation Deterministic Time
Series and Forecasting
1) In Table 7.3.1 the figures in col. 3 are obtained as the sum of three
consecutive values of col. 2. Thus the first moving total (M.T.) is 57 = 18
+ 19 + 20 and is placed against 2001. The second moving total 61 = 19 +
20 + 22 is placed against 2002.
3) The five-year moving totals in col.5 are obtained as the sum of five
consecutive values in col.2. Thus the first moving total against the year
2002 is 99 = 18 + 19 + 20 + 22 + 20.
4) The five-year moving total in col. 5 by 5. Thus moving average for 2005
is 107 ÷ 5 = 21.4.
Table 7.3.1: Calculation of (I) 3-year Average (II) 5-year Moving Average
2000 18 - - - -
2001 19 57 19.0 - -
2010 25 75 25.0 - -
2011 26 - - - -
161
Summarisation of
Bivariate and Multi- Note that for 3-year centered moving averages = 1 year, and for 5-year
variate Data
centered moving averages = 2 years, respectively, are left out both at the
beginning and the end of the series.
Example 7.3.2: Compute trend values for the following time series using 4-
yearly moving averages.
Year 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Yield 52 54 55 57 58 61 63 66 67 70
(qntls.)
Solution:
Table 7.3.2(a): Calculation of 4-year moving average (Direct Method)
2009 52 - - - -
2017 67 - - - -
2018 70 - - - -
162
Table 7.3.2(b): Calculation of 4-year Moving Average (Shortcut Method) Deterministic Time
Series and Forecasting
Year Yield 4-year 2-item (Centered)
M.T. M.A 4-year M.A
Example 7.3.3
Find trend values for the following series using a 3-year weighted moving
average with weights 1, 2, 1.
Year 1 2 3 4 5 6
Value 2 3 5 6 8 11
163
Summarisation of Solution:
Bivariate and Multi-
variate Data Table 7.3.3: Calculation of 3-year weighted moving average
1 2 - -
2 3 13 3.25
3 5 19 4.75
4 6 25 6.25
5 8 33 8.25
6 11 - -
Step of calculation
1) Col. 3 figures are the weighted moving totals of col.2 figures with weights 1,
2, 1.
Thus 1 × 2 + 2 ×3+ 1×5 =13
1 × 3 + 2 × 5 + 1×6 =19
2) Col.4 = col.3 ÷ (sum of weights, i.e., 4)
Thus 13 ÷ 4 = 3.25, 19 ÷ 4 = 4.75
Example 7.3.4: Calculate the 4-quarter moving average for the following time
series data
Year
Quarter
2015 2016 2017 2018
1 62 66 72 79
2 58 60 67 74
3 72 74 80 88
4 60 64 69 77
164
Solution: Deterministic Time
Series and Forecasting
Year Quarter Value 4-quarter Centered 4-quarter
(M.T.) (M.T.) (M.A.)
165
Summarisation of Check Your Progress 1
Bivariate and Multi-
variate Data 1) Given below is data on index of production for the period 2011 to 2020.
Year 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Index of 109.2 119.8 129.7 140.8 153.8 152.2 152.6 163.0 175.3 184.3
Production
1) Fit the trend line and predict the index of production for the year 2012 by
3-year moving averages method.
…………………………………………………………………………...…
………....………………………………………………………………..….
……..…………………………………………………………………….…
……………………...………………………………………………………
………………………………...……………………………………………
166
Y = a +bx +cx2 +dx3 (third degree polynomial) Deterministic Time
Series and Forecasting
The constants appearing in the above equations (such as a,b,c…) are obtained by
applying the principle of “least squares”, as in regressions (see unit 5). This
states that the values of the constants will be such as to make the sum of squares
of the deviations
∑(𝑦 − 𝑌) minimum,
Y = a +bx
Or,
Y = a +bx +cx2 etc., and the summation is taken over all the observations.
In the case of straight line fitted by the method of “least squares”, the constants a
and b are determined from the following normal equations:
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 and
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥
Thus for straight line y = a + bx, as the coefficient of a is 1, the first normal
equation is ∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥.
For the second normal equation, multiply each observation by the coefficient of b
in that equation and take sum over all the n observations. In the case of straight
line, coefficient of b is x. So, the second normal equation is ∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 +
𝑏∑𝑥 .
Now we will consider trend fitting for periods covering odd (Table 7.4) and even
(Table 7.5) number of years taking the first degree polynomial.
167
Summarisation of
Bivariate and Multi-
Case I: Odd number of years (n = 5)
variate Data Table 7.4
Year y x x2 xy
168
∑ 𝑥𝑦 = 70𝑏 Deterministic Time
Series and Forecasting
Here the origin (x = 0) will be in the middle of 3rd and 4th year and the unit of x
=6 months.
7.4.1 Suitability of Least Squares Method
Trend lines are used for description of the growth or decline of the time series
and as an aid to the study of the long-term trend of the economy. The method of
fitting polynomials completely eliminates personal bias and trend values for all
the given periods can be obtained. This is, however, not possible with moving
average method. But the choice of the type of the polynomial curve is arbitrary
and one cannot be sure whether a linear or parabolic curve will represent the
trend best. The choice of the trend equation may itself lead to a bias. It is,
however, possible to get some idea of the pattern of trend from the scatter
diagram of the data.
7.4.2 Examples of Least Squares Method
Example 7.4.1
Fit a straight line trend by the method of least squares to the following data:
1) The data given below give the index of industrial production from 1961 to
1970
Here the number of years is odd (n = 7) Let y = a+bx be the equation of the
straight line trend with origin (x = 0) at 2008 and one unit of x = 1 year. The least
squares normal equations are (see unit 5)
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥
7a = 736, so a = 105.1
2005 81 -3 9 -243
2006 92 -2 4 -184
2008 105 0 0 0
Hence, using the trend equation the estimate for 2012 is Y = 105.1 + 4 × 7.21 =
133.94.
Example: 7.4.2
Fit a straight line trend to the following time series data:
Solution:
Here the number of years is even (n = 6). Let y = a+bx be the trend equation
with origin.
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥
170
Table 7.4.2: Fitting straight Line Trend Deterministic Time
Series and Forecasting
Year Profit (y) x x2 xy
(Rs. lakhs)
So, substituting the values of ∑ 𝑦 , ∑ 𝑥𝑦, ∑ 𝑥, and ∑ 𝑥 from the above table in
the normal equations, we get
6a = 20.8, or a = 3.47
70b = 4.8, or b = 0.07
The trend equation is
Y = 3.47 + 0.07x, with origin at the middle of 2012 and 2013 and unit of x = 6
months.
For 2016, x would be 7.
So, estimate for 2016 is
Y = 3.47 + 0.07 × 7 = 3.47 + 0.49 =3.96
Hence the estimated profit for 2016 is Rs. 3.96 lakhs.
Example: 7.4.3
The sales of a company (in thousands of rupees) for the year 2010 to 2016 are
given in the following table. Fit an exponential trend (Y = A,Bx) and estimate the
sales for 2017.
Solution:
Here the number of years is odd (n = 7). Taking log of both sides of the given
equation, we can write log Y = logA + x logB. Let a = logA and b= logB. Thus
we can write
171
Summarisation of logY = a + bx.
Bivariate and Multi-
variate Data Further, we take origin (x = 0) at 2013 and one unit of x = 1 year. The least
squares normal equations are:
∑ log𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥
∑ 𝑥log𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥
Table 7.4.3: Fitting Straight Line Trend
2013 92 0 0 1.9638 0
So, substituting the values of ∑ log𝑦 , ∑ 𝑥. log𝑦, ∑ 𝑥, and ∑ 𝑥 from the above
table in the normal equations, we get
7a = 13.7931, or a = 1.97
28b = 4.3237, or b = 0.154
Thus, the fitted function is logy = 1.97 + 0.154x or Y = antilog(1.97 + 0.154x).
For 2017, x would be 4.
Thus, the estimated value for 2017 is
Y = antilog (1.97+0.154×4) = antilog 2.586 = 385.48.
A case with even number of years can be attempted as in the fitting of a straight
line (see Example 7.4.2).
Example: 7.4.4
The following table shows the production of cement in India during 2002 to
2008.
Fit a second degree polynomial to the data.
172
Deterministic Time
Series and Forecasting
Year 2012 2013 2014 2015 2016 2017 2018
Solution:
Here the number of years is odd (n = 7). Let y = a + bx+cx2 be the trend
equation with origin (x = 0) at 2015 and unit of x = 1year. The normal equation
is:
∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥
∑𝑥 𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥 + 𝑐∑𝑥
Table 7.4.4: Fitting Second Degree Polynomial
Year y x x2 x3 x4 xy x2 y
173
Summarisation of Y = 33 + 3.37x + 0.134x2,
Bivariate and Multi-
variate Data with origin (x = 0) at 2015 and unit of x = 1 year.
Example: 7.4.5
Fit a second degree polynomial to the following data. Estimate the trend value for
2012.
Solution:
Here the number of years is even (n = 6). Let y = a + bx+cx2 be the trend
equation with origin (x = 0) mid-way between 2008 and 2009 and unit of x = 6
months. The normal equations are ∑ 𝑦 = 𝑛𝑎 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥
∑ 𝑥𝑦 = 𝑎 ∑ 𝑥 + 𝑏 ∑ 𝑥 + 𝑐 ∑ 𝑥
∑𝑥 𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥 + 𝑐∑𝑥
Table 7.4.5: Fitting Second Degree Polynomial
Years y x x2 x3 x4 xy x2y
174
Y = 829.2 + 92.31x + 4.924x2, Deterministic Time
Series and Forecasting
with origin (x = 0) mid-way between 2008 and 2009 and unit of x = 6 months.
For 2012, x would be 7.
Therefore, estimate for 2012 is
Y = 829.2 + 92.31 × 7 = 3.47 + 4.924 × (7)2
= 829.2+646.17+241.28 = 1716.65.
175
Summarisation of Example 7.5.1
Bivariate and Multi-
variate Data The trend equation for certain production data is Y = 150 +24x (y = annual
production in thousand tons and x = time with origin at 2008, unit of x = 1 year).
Estimate the trend value for May 2013.
Solution: The monthly trend equation is
𝑌= + 𝑥 = 12.5 + 0.167𝑥,
where Y = monthly production, unit of x = 1 month and origin at 2008, i.e., 30th
June 2008.
To estimate the trend for May 2013, we substitute x = 58.5 in the above
equation. Thus, we get Y = 12.5 + 0.167 × 58.5 = 22.25 (‘000 tons)
Example 7.5.2
The trend equation fitted to quarterly average sales for 7 years is given by y =
250 + 20x (unit of x = 1 year, origin = 30th June 2010). Estimate the trend value
for the first quarter of 2013 (January-March).
Solution: Here the quarterly average refers to average per quarter for each year.
The quarterly trend equation is given by 𝑌 = 250 + 𝑥, where Y = quarterly
sales, x = 1 quarter and origin at 30th June 2010.
The interval between 30th June 2010 and the 1st quarter of 2013 are 10.5 quarters.
Thus, to obtain the trend for 1st quarter of 2013, we substitute x = 10.5 in the
above equation.
Hence, the required trend is Y = 250 + 5× 10.5 = 302.5.
Check Your Progress 2
1) Fit a straight line trend to the following data and show how to obtain the
monthly trend values from the trend line fitted to the given time series.
Obtain two such monthly values.
Monthly Production: 38 40 41 45 47
(in ‘000 tons)
…………………………………………………………………………...…
………....………………………………………………………………..….
……..…………………………………………………………………….…
……………………...………………………………………………………
……………………...………………………………………………………
…………………...…………………………………………………………
……………………………...………………………………………………
176
2) The trend equation for certain production data is y = 240 + 48x (y = annual Deterministic Time
Series and Forecasting
production in tons, x = time with origin at 2010, unit of x = 1 year).
Estimate the trend for October 2016.
…………………………………………………………………………...…
………....………………………………………………………………..….
……..…………………………………………………………………….…
……………………...………………………………………………………
………………………………...……………………………………………
3) The trend equation fitted to quarterly average sales data is given by y =
60 + 8x (unit of x = 1year, origin = 30th June, 2018). Estimate the trend
value for first quarter (Jan-March.) of 2020.
…………………………………………………………………………...…
………....………………………………………………………………..….
……..…………………………………………………………………….…
……………………...………………………………………………………
……………………...………………………………………………………
………………………………...……………………………………………
Years Quarters
I II II IV
1 y1 Y2 Y3 Y4
2 Y5 Y6 Y7 Y8
Total T1 T2 T3 T4
Average A1 A2 A3 A4
S.I. s1 s2 s3 s4
S.I. (adjusted) S1 S2 S3 S4
Explanatory notes:
a) T1 = y1 + y5+ y9 + y13 + y17 is the total of y values of first quarter of each year.
Similarly, T2, T3 and T4, are the totals of second, third and fourth quarters of
each year respectively.
b) Ai is the ith quarter average = , where i = 1, 2, 3, 4, and n denotes the number
of years.
∑
c) G is defined as the grand average = .
d) 𝑠 = × 100, 𝑖 = 1, 2, 3, 4.
e) s = s1 + s2+ s3 + s4
178
f) S1, S2, S3, and S4, are the seasonal indices for the first, second, third and the Deterministic Time
Series and Forecasting
fourth quarters respectively, where 𝑆 = × 400, 𝑖 = 1, 2, 3, 4. Note that the
sum of these 4 index numbers must be equal to 400. Further, 𝑆 = 𝑠 if s =
400.
g) For a time series with monthly data, the sum of 12 seasonal indices, one for
each month, must be equal to 1200.
Example: 7.6.1
Compute seasonal indices for the following data by the Method of Simple
Average.
Years Quarters
I II III IV
1992 72 68 80 70
1993 76 70 82 74
1994 74 66 84 80
1995 76 74 84 78
1996 78 74 86 82
Solution:
Table 7.6.1: Calculation of Seasonal Indices
Years Quarters
I II III IV
1992 72 68 80 70
1993 76 70 82 74
1994 74 66 84 80
1995 76 74 84 78
1996 78 74 86 82
179
Summarisation of Explanatory Notes:
Bivariate and Multi-
variate Data . . . .
Grand Average 𝐺 = = = 76.4
From this, the irregular component can be eliminated by the use of Simple
Average Method.
Example: 7.6.2
The following table shows the sales (9n’000 Rs.) in a departmental store for five
different years. Obtain the seasonal indices by Ratio to Trend Method.
Years Quarters
I II III IV
180
Solution: Deterministic Time
Series and Forecasting
Let us fit a straight line trend to the data on quarterly averages. The trend
equation fitted to quarterly averages y = a + bx, where y denotes quarterly
average of the year and the unit of x = 1 year. The table below has been
constructed from the given data by computing the averages of 4 quarters of each
year.
Table 7.6.2(a): Fitting Linear Trend
Years Quarters
y x x2 xy
2002 894 0 0 0
Year Quarter x T = y (y ÷
915.525 + T)×100
18.25x
IV -7 787.8 362 46
IV -3 860.8 390 45
IV 1 933.8 422 45
IV 5 1006.8 464 46
IV 9 10779.8 515 48
182
The trend ratios are now arranged by quarters and the seasonal indices are Deterministic Time
Series and Forecasting
calculated by the method of simple averages.
Years Quarters
I II III IV
2000 68 217 79 46
2001 65 206 81 45
2002 63 203 85 45
2003 62 201 90 46
2004 62 202 94 48
Total 320 1029 429 230
Average 64.0 205.8 85.8 46.0
S.I. 63.74 209.98 85.46 45.82
× 100 = = 𝑆𝐼"
′
Example: 7.6.3
Use the Ratio to Moving Average Method to calculate seasonal indices for the
following data.
183
Summarisation of Solution:
Bivariate and Multi-
variate Data Table 7.6.3: Calculation of Seasonal Indices by Ratio to Moving Average Method
Year Quarter y 4-period M.T. Centered Total 4-period M.A. (M) ( y ÷ M)×100
2009 Sum 30 - - - -
Mon 81 -- - -- -
292 - - -
Aut 62 597 73.38 84.50
295
Win 119 613 76.63 155.30
318
2010 Sum 33 660 82.50 40.00
342
Mon 104 736 92.00 113.04
394
Aut 86 796 99.63 86.32
403
Win 171 855 106.88 160.00
452
2011 Sum 42 917 114.63 36.64
465
Mon 153 980 122.50 124.90
515
Aut 99 1044 130.50 75.86
529
Win 221 1077 134.63 164.16
548
2012 Sum 56 1126 140.75 39.79
578
Mon 172 1170 146.25 117.61
592
Aut 129 1195 149.38 86.36
603
Win 235 1235 154.38 152.23
632
2013 Sum 67 1271 158.88 42.17
639
Mon 201 1345 168.13 119.55
706 - - -
Aut 136 - - - -
Win 302 - - - -
184
The moving ratios are now arranged by quarters and the seasonal indices are Deterministic Time
Series and Forecasting
calculated by the method of simple averages.
Year Quarters
Year Jan. Feb. Mar Apr. May. Jun. Jul. Aug. Sep. Oct. Nov. Dec.
1992 420 414 502 365 368 332 390 396 429 417 422 496
1993 491 466 516 337 342 360 409 402 372 391 394 446
1994 463 465 478 310 325 406 415 437 438 445 430 416
1995 502 487 536 404 418 429 489 492 475 456 476 476
Obtain seasonal indices by the method of Ratio to Trend, assuming linear trend.
…………………………………………………………………………...…………
....………………………………………………………………..….……..….….…
…………………………………………………………….………………………..
.………………………………………………………………………………..……
…...……………………………………………………………………………....…
………………………………………..................................................................…
3) Given the following quarterly sales figures in thousands of rupees for the
years 2006 to 2009. Find the specific seasonal by the method of moving
averages.
Years Quarters
I II III IV
186
4) The seasonal indices for the sales of garments of a particular type in a certain Deterministic Time
Series and Forecasting
shop are given below:
Quarter Seasonal Index
Jan-Mar 97
Apr-Jun 85
Jul-Sep 83
Oct-Dec 135
If the total sales in the first quarter of a year are Rs. 15,000, determine how
much worth of garments of this type should be kept in stocky by the shop
owner to meet the demand for each of the other three quarters of the year?
…………………………………………………………………………...…………
....………………………………………………………………..….……..….….…
…………………………………………………………….………………………..
.………………………………………………………………………………..……
…………………………………………………………………………...…………
1) 10957, 107.00, 118.69, 82.71, 84.87, 89.19, 99.47, 100.87, 100.11, 99.82,
100.58, 107.12
188
UNIT 8 VITAL STATISTICS*
Structure
8.0 Objectives
8.1 Introduction
8.2 Data Sources
8.3 Uses of Vital Statistics
8.4 Measurement of Population
8.4.1 Linear Interpolation Method
8.4.2 Using Compound Growth Rate Formula
8.4.3 Natural Increase and Net Migration Method
8.5 Vital Rates
8.5.1 Crude Birth Rate
8.5.2 Crude Death Rate
8.5.3 Crude Rate of Natural Increase
8.5.4 Rate of Net Migration
8.5.5 Rate of Total Increase
8.5.6 Infant Mortality Rate
8.0 OBJECTIVES
After going through this Unit, you will able to
explain the sources of data in vital statistics;
calculate various vital rates;
explain the procedure of construction of life tables; and
appreciate the application and limitations of life tables.
*
Adapted from IGNOU study material of EEC 13: Elementary Statistical Methods and Survey Techniques,
Unit 12 written by C G Naidu with modifications by Kaustuva Barik.
Summarisation of
Bivariate and
8.1 INTRODUCTION
Multivariate Data
Vital statistics is mainly concerned with the factors contributing to population
growth. Some of these factors are birth rates, death rates, expectancy of life, and
migration. As you go through this Unit you will be in a position to appreciate the
importance and applications of vital statistics in economics. The main objectives
of this Unit are to introduce some of the basic concepts of vital statistics, the data
sources, how to measure various ratios, and what are the applications of these
ratios in projecting the population, calculating life expectancy, uses in actuarial
science, etc.
190
has been periodically increased. The frame was recently updated in 2014 Vital Statistics
comprising 8861 sample units.
191
Summarisation of The above method provides us a good estimate at a constant rate between the
Bivariate and inter-censal years.
Multivariate Data
Example 8.1: The total population of India in 1991 census was 846 million and
in 2001 census was 1027 million. Calculate the total population of India in 1996.
Here, P0 846, P1 1027, N 10, n 5
5
Therefore, P1996 846 (1027 846) 936.5 million.
10
The limitation of the above method is that is that we can estimate the population
only for the years between two census years. We cannot have the estimates for
the future years.
Example 8.2: The population of a small town in 2001 was 50500. The compound
growth rate of the population of that town between 2001 and 2011 was 0.025.
Estimate the population of the town for the year 2015 (assuming that the
population growth rate will be the same beyond 2011).
Here, we are given P0 50500, r 0.025, and n = 14 (since 2015 – 2001 = 14)
Therefore, 𝑃 = 50500(1 + 0.025) = 71355
192
P0 is the base year (usually previous census year) Vital Statistics
B and D are the total number of births and deaths respectively during
the base year to the year t.
I and E are the total number of immigrants and emigrants respectively
during the base year to the year t.
Example 8.3: The population of a small town in 2011 census was 22000. From
2011 to 2013 the number of births, deaths, immigrants and emigrants are 800,
150, 2500 and 1500 respectively. Find the total population of the town in 2013.
Here, P0 22000, B = 800, D = 150, I = 2500, E = 1500
Therefore, 𝑃 = 22000 + (800 – 150) + (2500 – 1500)
= 23650
Check Your Progress I
The following table gives information on mid-year total population of India and
annual compound growth rates of population.
Note that the compound growth rates are in terms of percentage. Divide it by 100
to get the required r. For example, for the period 1950-60 the compound growth
rate is 1.9%. Therefore, r = 1.9/100 = 0.019.
On the basis of the above table answer the questions below:
1) Find the mid-year population for the following years using linear
interpolation method.
Year Mid-year population
1954
1966
1973
1985
1998
2005
2018
193
Summarisation of 2) Find the mid-year population for the following years using compound growth
Bivariate and rate method.
Multivariate Data
Year Mid-year population
1954
1966
1973
1985
1998
2005
2018
194
problem, vital rates are expressed on the basis of per thousand persons. In this Vital Statistics
section you will learn some important vital rates.
The crude birth rate tells us at what rate the births are occurring in a region or
community.
Example 8.4: The mid-year population and number of births occurred of a tribal
community in Madhya Pradesh in 2020 are 40,000 and 1200 respectively. Find
the crude birth rate.
Here, we have 2020 mid-year population = 40000 and the 2020 number of births
= 1200
1200
Crude birth rate 1000
40000
= 30 per 1000 persons per annum
The crude death rate tells us at what rate the deaths are happening in a age group,
sex group or region or community.
Example 8.5: The mid-year population and the number of deaths registered in
2020 for a town in Maharashtra among females are 25000 and 245 respectively.
Find the crude death rate.
Here, we have 2020 mid year female population = 25000 and the number of
deaths in 2020 = 245.
245
Crude death rate (females) 1000
25000
195
Summarisation of The annual natural increase is measured as: annual number of births-annual
Bivariate and number of deaths.
Multivariate Data
The formula for calculating the crude rate of natural increase is
Annual naturalincrease
Crude rate of natural increase 1000
Annual mid year population
The annual rate of net migration tells us at what rate the net migration has added
to the population over the course of the year.
Example 8.7: In 2020 for a region the annual number of immigrants, emigrants,
and mid-year population are given as 6500, 5200 and 667700 respectively. Find
the annual rate of net migration.
Here, we have the number of immigrants = 6500
the number of emigrants = 5200
mid-year population = 66700
Annual net migration = 6500 – 5200 = 1300
1300
Annual rate of net migration 1000 = 19.7 per 1000 per annum
66700
Annual totalincrease
Rate of total increase 100
Annual mid year population
196
The rate of total increase for a given year tells us the rate at which the population Vital Statistics
has increased over the year.
Example 8.8: The annual natural increase, annual net migration, and annual mid-
year population in 2018 for a region are recorded as 1500, 500 and 50000
respectively. Find the rate of total increase.
Here, we have
Annual natural increase = 1500
Annual net migration = 500
Mid-year population = 50000
Annual total increase = 1500 + 500 = 2000
2000
Rate of totalincrease 1000
50000
= 40 per 1000 per annum.
8.5.6 Infant Mortality Rate
The infant mortality rate is defined as the member of deaths of infants (less than
one year old) per 1000 live births in a given year. The formula to calculate the
infant mortality rate is given as:
Annual infant deaths (of males or females or total)
Infant mortality rate 100
(Annual live births (of males or females or total)
The infant mortality rate tells us for a given year the chances of a birth failing to
survive one year life. The infant mortality rates can be calculated separately for
males and females.
Example 8.9: In 2019 for a small town the total number of live births and infant
deaths among females are recorded as 3000 and 25 respectively. Find the infant
mortality rate among females.
Here, we have
Annual live female births = 3000
Annual infant deaths = 25
15
Infant mortality rate 1000
3000
197
Summarisation of Observe that all the vital rates are higher in rural areas than in urban areas. Write
Bivariate and one most significant reason for each of the following:
Multivariate Data
1) The birth rate in rural areas is high because:
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
2) The death rate in urban areas is low because:
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
3) The infant mortality rate in rural areas is high because:
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
198
(i.e., 1-4) and the remaining values are 5 (namely, 5-9, 10-14, … 95-99). Vital Statistics
The last value is an exception, which is again taken as 1 (100+).
iii) Number of deaths recorded in the age interval (dx): This column
presents the number of persons dying in that age group during the year
corresponding to the life table.
iv) Number of persons in the age interval (Px): This column indicates the
number of persons in the age interval during the year corresponding to the
life table.
v) Separation factor ( n ax ) : This represents the average number of years
lived by those who die between age x and (x+n). Although, it is necessary
in calculations, this factor is not typically presented as a column of the life
table. Each person living in the interval (x to x+n). In a complete life table,
a value 0.5 (that is, half of one year) is valid from the age of 5 years. For a
simpler calculation, it is assumed that those who die in the 5 year age
intervals of a life table live on average 2.5 years. However, remember that
the value of the fraction depends on the mortality pattern over the entire
interval and not the mortality rate for any single year. In addition, since a
large portion of infant deaths occur in the first few weeks of life, this value
is much smaller in the <1 and 1-4 age groups.
Similarly, the death rates in the last three groups (namely, 91-94, 95-99, and
100+) are very high. Therefore, the value of the separation factor is small in the
age group 91-94 and 95-99. In the last age group (100+) since the death is certain
we have taken the separation factor as 1.
Calculation of the separation factor is easy if the date of birth and the date of
death are available. For the purpose of constructing a life table the separation
factor will be given in the table. When they are not, values from model life
tables, such as those tabulated by Coale and Demney shown in Table 8.1 can be
utilised for and the rest are taken as 0.5 years for every year in the group interval
(that is 2.5 in year interval).
Table 8.1: Separation Factors for Ages < 1 and 1-4
Separation factor for age < 1 Separation factor for ages 1-4
Zones Men Women Both Mean Women Both
sexes sexes
Infant North (1) 0.33 0.35 0.3500 1.558 1.570 1.5700
Mortality East (2) 0.29 0.31 0.3100 1.313 1.324 1.3240
Rate South (3) 0.33 035 0.3500 1.240 1.239 1.2390
>0.100 West (4) 0.33 0.35 0.3500 1.352 1.361 1.3610
Infant North (1) 0.0425 0.05 0.05 1.859 1.733 1.7330
Mortality East (2) 0.0025 0.01 0.01 1.614 1.487 1.4870
Rate South (3) 0.0425 0.05 0.05 1.541 1.402 1.4020
<0.100 West (4) 0.0425 0.05 0.05 1.653 1.524 1.5240
Source: Coale, Ansley J. and Demeny P. (1966) Regional Model Life Tables and Stable
Population, Princeton University Press.
Notes: (1) Iceland, Norway and Switzerland; (2) Austria, Czechoslovakia, North-central Italy,
Poland and Hungary; (3) South Italy, Portugal and Spain; (4) Rest, of the World.
199
Summarisation of vi) Central Mortality ( n M x ) : This column results from dividing the number
Bivariate and
Multivariate Data of deaths in the age interval x to x+n (column dx) by the number of people
in this age group (column Px).
dx
n Mx
Px
n px 1 n q x
n d x n1x n q x
xi) Number of years lived by the total of the cohort of 100, 000 births in
the interval x to x + n ( n Lx ) : Each member of the cohort who survives
the inte
interval
rval x to x+n contributes n years to L, while each member who dies
in the interval x and x+n contributes the average number of years lived by
those who die in this period (that is, the separation factor of deaths n a x ).
The n L x is calculated using the following formula.
n Lx n x n1x n n a x n d x
where, n l x n n lx n p x
or,
l
n xn n lx n d x
xii) Total years lived after exact age x( nTx ) : This number is essential for the Vital Statistics
calculation of life expectancy. It indicates the total number of years lived by
the survivor n lx between the anniversary x and the extinction of the whole
generation. The value of the first row of nTx is the total number of years
lived by the cohort until death of its last component.
T Sumof n Lx (from last row of n Lx to the current row of n Lx )
n x
xiii) Life expectancy at age x( n ex ) : Among all the indicators provided by the
life table, the most widely used is the life expectancy ( n ex ) which
represents the average number of years lived by a generation of newborns
under given mortality conditions.
Table 8.2 below provides the basic information required for construction of a life
table. The data pertains to Indian females in 2000. Let us construct the life table.
201
Summarisation of Table 8.3: Life Table
Bivariate and
Multivariate Data Age nx n ax n Mx n qx n px n lx n dx n Lx nTx n ex
<1 1 0.1 0.06765 0.6377 0.93623 100000 6376.52 94261.1 6268416 62.6842
Life expectancy always decreases from the first row of the table to the last row,
with the exception of the second row and sometimes the third row (age group/5-
9), which can be greater than the first row (age group/<1) in countries with high
infant mortality. It is generally observed that for a given population, life
expectancy is greater in women than in men and overall life expectancy should
be approximately between the two. However, in countries where the maternal
mortality is high the general living conditions of women are worse, life
expectancy among women is lower than men.
202
8.7.1 Calculation of Probability of Surviving and Dying Vital Statistics
While constructing life table you have learnt that n q x is the probability of dying
between the two ages (x, x+n) for the person who has survived up to age x. For
example, let us consider the row corresponding to age group 30-34 years in Table
8.3. The probability of dying (females) between 30 to 34 years of age, for those
who have survived up to 30 years of age, is 0.01506 ( n q x ). It means that out of
every 100,000 Indian females who have survived the age of 30 years, 1506 ( =
100, 000 ×0.01506) will die between the age 30 and 34 years. Secondly, n p x tells
us the probability of living between the two ages (x, 30-34 years/ x + n) of
survival is (1 – 0.01506) = 0.98494 ( n q x ). That means out of every 100,000
Indian females who have survived the age of 30 years, 98494 will survive in the
age group 30-34 years.
Thirdly, we can calculate the probability at birth of a person dying between ages
0-4 years. This is given by the number of original births dying ( n d x ) between the
ages 0-4 years, divided by the number of original births (usually 100000). In our
example, n d x 1281 and the probability is 0.01281 (= 1281/100000). This
probability tells us that on and average out of every 100,000 female births in
India (subject to mortality in 2000), 1281 females will die between the ages 0-4
years.
8.7.2 Uses in Actuarial Science
Life tables have important applications in actuarial science especially in the field
of life assurance. Life tables form the basis for determining the rates of premiums
necessary to various amount of life assurance. Life tables provide the actuarial
science with a sound foundation, converting the insurance business from a mere
gambling in the human lives to the ability to offer well calculated safeguard in
the event of death.
Actually, the calculations involved in the fixation of premium amounts in Life
assurance are very complex, but the underlying principles are simple. Let us
consider a few examples.
Example 8.10: According to mortality conditions in India for the year 2000,
what annual premium would an Indian female have to pay on a whole life policy
worth Rs. 100,000 if this life was assured at birth, assuming that the assurance
office earns no income on its funds?
Let the premium be Rs. X per annum. Since a female on the average can be
expected to live 62.7 years, over her life time she will have paid Rs. x × 62.7 in
premiums. This will have to be equal to the value of the policy Rs. 100000.
Therefore, Rs. x × 62.7 = 100000 and x = 100000/62.7 = Rs. 1594.90.
Example 8.11: In the above example if the policy was taken at the age of 25
years, then find the annual premium.
If the policy was taken at age 25 then the total premium paid will be Rs. x × 46.9
for 46.9 years expectation of life at 25 years age. Then the annual premium must
be x = 100000/46.9 = Rs. 2132.20.
203
Summarisation of Example 8.12: In example 8.10 if the policy is an endowment policy, taken at 30
Bivariate and
Multivariate Data
years of age and payable up to 50 years of age or prior deaths. What is the annual
premium to be paid?
204
above explains the life expectancy for males and females in some selected Vital Statistics
countries.
iii) Population projections: Life tables have also been used in preparation of
population projections by age and sex. That is, in estimating what the size
of the population will be at some future date.
205
Summarisation of 4) What is the mortality rate between 15 and 20 years of age?
Bivariate and
.............................................................................................................................
Multivariate Data
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
5) What is the probability that a female reaching 15 years of age reaches 20?
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
6) How many additional years is a female between 15 and 20 years of age in
2000 in India expected to live?
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
206
2) Vital Statistics
1) The birth rate in rural areas is high because of the lack of awareness among
the people on the family planning methods and its need.
2) The death rate in urban areas is low because of the improved health facilities
in towns and cities.
3) The infant mortality rate in rural areas is high because of the lack of health
facilities in rural areas and malnutrition among mothers.
207