0% found this document useful (0 votes)
767 views147 pages

What Is Statistics

Statistics is concerned with collecting, organizing, summarizing, presenting, and analyzing numerical data to derive valid conclusions. It involves systematically collecting data and interpreting it. There are four main stages: collection of data, presentation of data (e.g. in tables or graphs), analysis of data (e.g. measures of central tendency and dispersion), and interpretation of conclusions from the data. Statistics uses diagrams and graphs to visually present relationships in the data, including line diagrams, bar diagrams, and multiple or subdivided bar diagrams to compare different data groups. It has many applications, including in education for research, policymaking, and testing past knowledge.

Uploaded by

Swami Gurunand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
767 views147 pages

What Is Statistics

Statistics is concerned with collecting, organizing, summarizing, presenting, and analyzing numerical data to derive valid conclusions. It involves systematically collecting data and interpreting it. There are four main stages: collection of data, presentation of data (e.g. in tables or graphs), analysis of data (e.g. measures of central tendency and dispersion), and interpretation of conclusions from the data. Statistics uses diagrams and graphs to visually present relationships in the data, including line diagrams, bar diagrams, and multiple or subdivided bar diagrams to compare different data groups. It has many applications, including in education for research, policymaking, and testing past knowledge.

Uploaded by

Swami Gurunand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

STATISTICS

Statistics is concerned with scientific methods for


collecting, organising, summarising, presenting and analysing data
as well as deriving valid conclusions and making reasonable
decisions on the basis of this analysis. Statistics is concerned with
the systematic collection of numerical data and its interpretation.
The word statistic is used to refer to
1. Numerical facts, such as the number of people living in
particular area.
2. The study of ways of collecting, analysing and interpreting
the facts.
Definition by Croxton and Cowden:
Statistics may be defined as the science of collection,
presentation analysis and interpretation of numerical data from the
logical analysis.
According to this definition there are four stages:
1. Collection of Data: It is the first step and this is the foundation
upon which the entire data set. Careful planning is essential before
collecting the data. There are different methods of collection of
data such as census, sampling, primary, secondary, etc., and the
investigator should make use of correct method.
2. Presentation of data: The mass data collected should be
presented in a suitable, concise form for further analysis. The
collected data may be presented in the form of tabular or
diagrammatic or graphic form.
3. Analysis of data: The data presented should be carefully
analysed for making inference from the presented data such as
measures of central tendencies, dispersion, correlation, regression
etc.,
4. Interpretation of data: The final step is drawing conclusion
from the data collected. A valid conclusion must be drawn on the
basis of analysis. A high degree of skill and experience is necessary
for the interpretation.

The following are some basic technical terms when a


continuous frequency distribution is formed or data are classified
according to class intervals.
a) Class limits:
The class limits are the lowest and the highest values that
can be included in the class. For example, take the class 30-40.
The lowest value of the class is 30 and highest class is 40. The two
boundaries of class are known as the lower limits and the upper
limit of the class. The lower limit of a class is the value below
which there can be no item in the class. The upper limit of a class
is the value above which there can be no item to that class. Of the
class 60-79, 60 is the lower limit and 79 is the upper limit, i.e. in
the case there can be no value which is less than 60 or more than
79. The way in which class limits are stated depends upon the
nature of the data. In statistical calculations, lower class limit is
denoted by L and upper class limit by U.
b) Class Interval:
The class interval may be defined as the size of
grouping of data. For example, 50-75, 75-100, 100-125are
intervals. Each grouping begins with the lower limit of a
interval and ends at the lower limit of the next succeeding
interval

each
class
class
class

c) Width or size of the class interval:


The difference between the lower and upper class limits is
called Width or size of class interval and is denoted by C .
d) Range:
The difference between largest and smallest value of the
observation is called The Range and is denoted by R ie
R = Largest value Smallest value
R = L-S
e) Mid-value or mid-point:
The central point of a class interval is called the mid value
or mid-point. It is found out by adding the upper and lower limits
of a class and dividing the sum by 2.
52

L+U
2
For example, if the class interval is 20-30 then the mid-value is
20 + 30
= 25
2
f) Frequency:
Number of observations falling within a particular class
interval is called frequency of that class.
Let us consider the frequency distribution of weights if
persons working in a company.
(i.e.) Midvalue =

Weight
(in kgs)
30-40
40-50
50-60
60-70
70-80
80-90
90-100
Total

Number of
persons
25
53
77
95
80
60
30
420

total frequency indicate the total number of observations


considered in a frequency distribution.
g) Number of class intervals:
The number of class interval in a frequency is matter of
importance. The number of class interval should not be too many.
For an ideal frequency distribution, the number of class intervals
can vary from 5 to 15. To decide the number of class intervals for
the frequency distributive in the whole data, we choose the lowest
and the highest of the values. The difference between them will
enable us to decide the class intervals.
h) Size of the class interval:
Since the size of the class interval is inversely
proportional to the number of class interval in a given distribution.
The approximate value of the size (or width or magnitude) of the
class interval C is obtained by using sturges rule as

4.8

Cumulative frequency table:


Cumulative frequency distribution has a running total of the
values. It is constructed by adding the frequency of the first class
interval to the frequency of the second class interval. Again add
that total to the frequency in the third class interval continuing
until the final total appearing opposite to the last class interval will
be the total of all frequencies. The cumulative frequency may be
downward or upward. A downward cumulation results in a list
presenting the number of frequencies less than any given amount
as revealed by the lower limit of succeeding class interval and the
upward cumulative results in a list presenting the number of
frequencies more than and given amount is revealed by the upper
limit of a preceding class interval.
Example 3:
Age
group
(in
years)
15-20
20-25
25-30
30-35
35-40
40-45

Number
of women

3
7
15
21
12
6

Less than
Cumulative
frequency
3
10
25
46
58
64

More than
cumulative
frequency
64
61
54
39
18
6

(a) Less than cumulative frequency distribution table


End values upper less than Cumulative
limit
frequency
Less than 20
3
Less than 25
10
Less than 30
25
Less than 35
46
Less than 40
58
Less than 45
64

(b) More than cumulative frequency distribution table


End values lower
limit
15 and above
20 and above
25 and above
30 and above
35 and above
40 and above

Cumulative frequency
more than
64
61
54
39
18
6

Statistics and Education:


Statistics is widely used in education. Research has become
a common feature in all branches of activities. Statistics is
necessary for the formulation of policies to start new course,
consideration of facilities available for new courses etc. There are
many people engaged in research work to test the past knowledge
and evolve new knowledge. These are possible only through
statistics.

DIAGRAMATIC AND GRAPHICAL


REPRESENTATION
Diagrams:
A diagram is a visual form for presentation of statistical
data, highlighting their basic facts and relationship. If we draw
diagrams on the basis of the data collected they will easily be
understood and appreciated by all. It is readily intelligible and save
a considerable amount of time and energy.
Types of diagrams:
In practice, a very large variety of diagrams are in use and
new ones are constantly being added. For the sake of convenience
and simplicity, they may be divided under the following heads:
1. One-dimensional diagrams
2. Two-dimensional diagrams
3. Three-dimensional diagrams
4. Pictograms and Cartograms
One-dimensional diagrams:
In such diagrams, only one-dimensional measurement, i.e
height is used and the width is not considered. These diagrams are
in the form of bar or line charts and can be classified as
1. Line Diagram
2. Simple Diagram
3. Multiple Bar Diagram
4. Sub-divided Bar Diagram
5. Percentage Bar Diagram

Line Diagram:
Line diagram is used in case where there are many items to be
shown and there is not much of difference in their values. Such
diagram is prepared by drawing a vertical line for each item
according to the scale. The distance between lines is kept uniform.
Line diagram makes comparison easy, but it is less attractive.
Example 1:
Show the following data by a line chart:
No. of children
0
1
2
3
4
5
Frequency
10
14
9
6
4
2
Line Diagram

Frequency

16
14
12
10
8
6
4
2
0
0

No. of Children

Simple Bar Diagram:


Simple bar diagram can be drawn either on horizontal or
vertical base, but bars on horizontal base more common. Bars must
be uniform width and intervening space between bars must be
equal.While constructing a simple bar diagram, the scale is
determined on the basis of the highest value in the series.
To make the diagram attractive, the bars can be coloured.
Bar diagram are used in business and economics. However, an
important limitation of such diagrams is that they can present only
one classification or one category of data. For example, while
presenting the population for the last five decades, one can only
depict the total population in the simple bar diagrams, and not its
sex-wise distribution.

Example 2:
Represent the following data by a bar diagram.

Year

Production
(in tones)

1991
1992
1993
1994
1995

45
40
42
55
50

Solution:
Simple Bar Diagram
60

Production
(in tonnes)

50
40
30
20
10
0
1991

1992

1993

1994

1995

Year

Multiple Bar Diagram:


Multiple bar diagram is used for comparing two or more
sets of statistical data. Bars are constructed side by side to
represent the set of values for comparison. In order to distinguish
bars, they may be either differently coloured or there should be
different types of crossings or dotting, etc. An index is also
prepared to identify the meaning of different colours or dottings.

Example 3:
Draw a multiple bar diagram for the following data.
Profit before tax
Profit after tax
Year
( in lakhs of rupees )
( in lakhs of rupees )
1998
195
80
1999
200
87
2000
165
45
2001
140
32
Solution:

Profit (in Rs)

Multiple Bar Diagram


200
180
160
140
120
100
80
60
40
20
0
1998

1999

2000

2001

Year

Profit before tax

Profit after tax

Sub-divided Bar Diagram:


In a sub-divided bar diagram, the bar is sub-divided into
various parts in proportion to the values given in the data and the
whole bar represent the total. Such diagrams are also called
Component Bar diagrams. The sub divisions are distinguished by
different colours or crossings or dottings.
The main defect of such a diagram is that all the parts do
not have a common base to enable one to compare accurately the
various components of the data.
Example 4:
Represent the following data by a sub-divided bar diagram.

Monthly expenditure
(in Rs.)
Family A
Family B
75
95
20
25
15
10
40
65
25
35

Expenditure items
Food
Clothing
Education
Housing Rent
Miscellaneous
Solution:

Sub-divided Bar Diagram

Monthly expenditure (in Rs)

240
220
200
180
160
140
120
100
80
60
40
20
0
Family A

Expenditure item

Food

Clothing

Housing Rent

Miscellaneous

Family B

Education

Percentage bar diagram:


This is another form of component bar diagram. Here the
components are not the actual values but percentages of the whole.
The main difference between the sub-divided bar diagram and
percentage bar diagram is that in the former the bars are of different
heights since their totals may be different whereas in the latter the
bars are of equal height since each bar represents 100 percent. In
the case of data having sub-division, percentage bar diagram will
be more appealing than sub-divided bar diagram.

Example 5:
Represent the following data by a percentage bar diagram.
Particular
Selling Price
Quantity Sold
Wages
Materials
Miscellaneous

Factory A
400
240
3500
2100
1400

Factory B
650
365
5000
3500
2100

Solution:
Convert the given values into percentages as follows:
Factory A
Rs.
%
400
5
240
3
3500
46
2100
28
1400
18
7640
100

Particulars
Selling Price
Quantity Sold
Wages
Materials
Miscellaneous
Total
Solution:

Factory B
Rs.
%
650
6
365
3
5000
43
3500
30
2100
18
11615
100

Sub-divided PercentageBar Diagram

Percentages

100
80
60
40
20
0
Factory A

Factory B
Particulars

Selling price

Quantity sold

Materials

Miscellaneous

Two-dimensional Diagrams:
In one-dimensional diagrams, only length is taken into
account. But in two-dimensional diagrams the area represent the
data and so the length and breadth have both to be taken into
account. Such diagrams are also called area or surface diagrams.
Pie Diagram or Circular Diagram:
Another way of preparing a two-dimensional diagram is in
the form of circles. In such diagrams, both the total and the
component parts or sectors can be shown. The area of a circle is
proportional to the square of its radius.
While making comparisons, pie diagrams should be used on a
percentage basis and not on an absolute basis. In constructing a pie
diagram the first step is to prepare the data so that various
components values can be transposed into corresponding degrees
on the circle.
The second step is to draw a circle of appropriate size with a
compass. The size of the radius depends upon the available space
and other factors of presentation. The third step is to measure
points on the circle and representing the size of each sector with the
help of a protractor.
Example
Draw a Pie diagram for the following data of production of sugar in
quintals of various countries.

Country
Cuba
Australia
India
Japan
Egypt
Total

Production of Sugar
In
In Degrees
Quintals
62
134
47
102
35
76
16
35
6
13
166
360
Pie Diagram

Cuba
Australia
India
Japan
Egypt

5.6 Graphs:
A graph is a visual form of presentation of statistical data.
A graph is more attractive than a table of figure. Even a common
man can understand the message of data from the graph.
Comparisons can be made between two or more phenomena very
easily with the help of a graph.
However here we shall discuss only some important types of
graphs which are more popular and they are
1.Histogram
3.Frequency Curve

2. Frequency Polygon
4. Ogive
5. Lorenz Curve

5.6.1 Histogram:
A histogram is a bar chart or graph showing the frequency of
occurrence of each value of the variable being analysed. In
histogram, data are plotted as a series of rectangles.
Class
intervals are shown on the X-axis and the frequencies on the
Y-axis .
The height of each rectangle represents the frequency of the
class interval. Each rectangle is formed with the other so as to give
a continuous picture. Such a graph is also called staircase or block
diagram.
However, we cannot construct a histogram for distribution
with open-end classes. It is also quite misleading if the distribution
has unequal intervals and suitable adjustments in frequencies are
not made.
Example 10:
Draw a histogram for the following data.
Daily Wages
Number of Workers
0-50
8
50-100
16
100-150
27
150-200
19
200-250
10
250-300
6

Solution:
HISTOGRAM

Number of Workers

30
25
20
15
10
5
0
50

100

150
200
Daily Wages (in Rs.)

250

Example 11:
For the following data, draw a histogram.
Number of
Marks
Students
21-30
6
31-40
15
41-50
22
51-60
31
61-70
17
71-80
9
Solution:
For drawing a histogram, the frequency distribution should be
continuous. If it is not continuous, then first make it continuous as
follows.
Number of
Marks
Students
20.5-30.5
6
30.5-40.5
15
40.5-50.5
22
50.5-60.5
31
60.5-70.5
17
70.5-80.5
9

HISTOGRAM

35

Number of Students

30

25

20
15

10

0
20.5

30.5

40.5

50.5

60.5

70.5

80.5

Marks

5.6.2

Frequency Polygon:
If we mark the midpoints of the top horizontal sides of the
rectangles in a histogram and join them by a straight line, the figure
so formed is called a Frequency Polygon. This is done under the
assumption that the frequencies in a class interval are evenly
distributed throughout the class. The area of the polygon is equal
to the area of the histogram, because the area left outside is just
equal to the area included in it.

Example 13:
Draw a frequency polygon for the following data.
Weight (in kg)
30-35
35-40
40-45
45-50
50-55
55-60
60-65

Number of
Students
4
7
10
18
14
8
3

FREQUENCY POLYGON
20
18

Number of Students

16
14
12
10
8
6
4
2
0
30

35

40

45

50

55

60

65

Weight (in kgs)

5.6.3

Frequency Curve:
If the middle point of the upper boundaries of the rectangles
of a histogram is corrected by a smooth freehand curve, then that
diagram is called frequency curve. The curve should begin and end
at the base line.

Example 14:
Draw a frequency curve for the following data.
Monthly Wages
(in Rs.)
0-1000
1000-2000
2000-3000
3000-4000
4000-5000
5000-6000
6000-7000
7000-8000

No. of family
21
35
56
74
63
40
29
14

Solution:
FREQUENCY CURVE
80
70

No. of Family

60
50
40
30
20
10
0

1000

2000

3000
4000
Monthly Wages
(in Rs.)5000 6000
Monthly Wages in Rs.

7000 8000

5.6.4 Ogives:
For a set of observations, we know how to construct a
frequency distribution. In some cases we may require the number
of observations less than a given value or more than a given value.
This is obtained by a accumulating (adding) the frequencies upto

(or above) the give value. This accumulated frequency is called


cumulative frequency.
These cumulative frequencies are then listed in a table is
called cumulative frequency table. The curve table is obtained by
plotting cumulative frequencies is called a cumulative frequency
curve or an ogive.
There are two methods of constructing ogive namely:
1. The less than ogive method
2. The more than ogive method.
In less than ogive method we start with the upper limits of the
classes and go adding the frequencies. When these frequencies are
plotted, we get a rising curve. In more than ogive method, we start
with the lower limits of the classes and from the total frequencies
we subtract the frequency of each class. When these frequencies
are plotted we get a declining curve.
Example 15:
Draw the Ogives for the following data.
Class interval
Frequency
20-30
4
30-40
6
40-50
13
50-60
25
60-70
32
70-80
19
80-90
8
90-100
3
Solution:
Class
Less than
More than
limit
ogive
ogive
20
0
110
30
4
106
40
10
100
50
23
87
60
48
62
70
80
30
80
99
11

90
100

107
110

3
0

Cumulative frequency

Ogives
120
110
100
90
80
70
60
50
40
30
20
10
0

x axis 1cm = 10 units


y axis 1 cm = 10 units

20

30

40

50

60

70

Class limit

80

90

100

MEASURES OF CENTRAL TENDENCY


Measures of Central Tendency:
In the study of a population with respect to one in which we
are interested we may get a large number of observations. It is not
possible to grasp any idea about the characteristic when we look at
all the observations. So it is better to get one number for one group.
That number must be a good representative one for all the
observations to give a clear picture of that characteristic. Such
representative number can be a central value for all these
observations. This central value is called a measure of central
tendency or an average or a measure of locations. There are five
averages. Among them mean, median and mode are called simple
averages and the other two averages geometric mean and harmonic
mean are called special averages.
The meaning of average is nicely given in the following definitions.
A measure of central tendency is a typical value around which
other figures congregate.
An average stands for the whole group of which it forms a part
yet represents the whole.
One of the most widely used set of summary figures is known
as measures of location.
Characteristics for a good or an ideal average :
The following properties should possess for an ideal average.
1. It should be rigidly defined.
2. It should be easy to understand and compute.
3. It should be based on all items in the data.
4. Its definition shall be in the form of a mathematical
formula.
5. It should be capable of further algebraic treatment.
6. It should have sampling stability.
7. It should be capable of being used in further statistical
computations or processing.

Besides the above requisites, a good average should


represent maximum characteristics of the data, its value should be
nearest to the most items of the given series.
Arithmetic mean or mean :
Arithmetic mean or simply the mean of a variable is defined
as the sum of the observations divided by the number of
observations. If the variable x assumes n values x1, x2 xn then the
mean, x, is given by
x + x + x + .... + xn
x= 1 2 3
n
n
1
= xi
n i =1
This formula is for the ungrouped or raw data.
Example 1 :
Calculate the mean for 2, 4, 6, 8, 10
Solution:
2 + 4 + 6 + 8 + 10
5
30
=
=6
5

x=

Short-Cut method :
Under this method an assumed or an arbitrary average
(indicated by A) is used as the basis of calculation of deviations
from individual values. The formula is
d
x = A+
n
where, A = the assumed mean or any value in x
d = the deviation of each value from the assumed mean
Example 2 :
A student s marks in 5 subjects are 75, 68, 80, 92, 56. Find his
average mark.

Solution:
X
75
A 68
80
92
56
Total
d
x = A+
n
31
= 68 +
5
= 68 + 6.2
= 74.2

d=x-A
7
0
12
24
-12
31

Grouped Data :
The mean for grouped data is obtained from the following formula:
fx
x=
N
where x = the mid-point of individual class
f = the frequency of individual class
N = the sum of the frequencies or total frequencies.
Short-cut method :
fd
c
x = A+
N
x A
where d =
c
A = any value in x
N = total frequency
c = width of the class interval
Example 3:
Given the following frequency distribution, calculate the
arithmetic mean
Marks
: 64 63
62
61
60
59
Number of
Students

: 8

18

12

9
96

Solution:
X
64
63
62
61
60
59

F
8
18
12
9
7
6

fx
512
1134
744
549
420
354

60

3713

d=x-A
2
1
0
1
2
3

fd
16
18
0
9
14
18
-7

Direct method

3713
fx
=
= 61.88
N
60
Short-cut method
7
fd
= 62
x = A+
= 61.88
N
60
Example 4 :
Following is the distribution of persons according to
different income groups. Calculate arithmetic mean.
x=

Income
Rs(100)

Number of
persons

Solution:
Income
C.I
0-10
10-20
20-30
30-40
40-50
50-60
60-70

0-10

10-20

Number of
Persons (f)
6
8
10
12
7
4
3
50

20-30

10

Mid
X
5
15
25
A 35
45
55
65

97

30-40

40-50

50-60

60-70

12

d =

xA
c
-3
-2
-1
0
1
2
3

Fd
-18
-16
-10
0
7
8
9
-20

Mean = x = A +
= 35

fd
N

20
50 10

= 35 4
= 31

Merits and demerits of Arithmetic mean :


Merits:
1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. If the number of items is sufficiently large, it is more
accurate and more reliable.
4. It is a calculated value and is not based on its position in the
series.
5. It is possible to calculate even if some of the details of the
data are lacking.
6. Of all averages, it is affected least by fluctuations of
sampling.
7. It provides a good basis for comparison.
Demerits:
1. It cannot be obtained by inspection nor located through a
frequency graph.
2. It cannot be in the study of qualitative phenomena not
capable of numerical measurement i.e. Intelligence, beauty,
honesty etc.,
3. It can ignore any single item only at the risk of losing its
accuracy.
4. It is affected very much by extreme values.
5. It cannot be calculated for open-end classes.
6. It may lead to fallacious conclusions, if the details of the
data from which it is computed are not given.
Positional Averages:
These averages are based on the position of the given
observation in a series, arranged in an ascending or descending
order. The magnitude or the size of the values does matter as was in
the case of arithmetic mean. It is because of the basic difference

that the median and mode are called the positional measures of an
average.
Median :
The median is that value of the variate which divides the
group into two equal parts, one part comprising all values greater,
and the other, all values less than median.
Ungrouped or Raw data :
Arrange the given values in the increasing or decreasing
order. If the number of values are odd, median is the middle value
.If the number of values are even, median is the mean of middle
two values.
By formula
n + 1 th
Median = Md =
item.
2
Grouped Data:
In a grouped distribution, values are associated with frequencies.
Grouping can be in the form of a discrete frequency distribution or
a continuous frequency distribution. Whatever may be the type of
distribution , cumulative frequencies have to be calculated to know
the total number of items.
Cumulative frequency : (cf)
Cumulative frequency of each class is the sum of the frequency of
the class and the frequencies of the pervious classes, ie adding the
frequencies successively, so that the last cumulative frequency
gives the total number of items.
Discrete Series:
Step1: Find cumulative frequencies.
N +1
Step2: Find

2
Step3: See in the cumulative frequencies the value just greater than
N +1

2
Step4: Then the corresponding value of x is median.

Example 14:
The following data pertaining to the number of members in
a family. Find median size of the family.
Number of
members x
Frequency
F
Solution:

10

11

12

10

13

X
1
2
3
4
5
6
7
8
9
10
11
12

f
1
3
5
6
10
13
9
5
3
2
2
1
60

cf
1
4
9
15
25
38
47
52
55
57
59
60

N + 1 th
Median = size of
item
2
60 + 1 th
= size of
item
2
= 30.5th item
The cumulative frequencies just greater than 30.5 is 38.and the
value of x corresponding to 38 is 6.Hence the median size is 6
members per family.

Continuous Series:
The steps given below are followed for the calculation of
median in continuous series.
Step1: Find cumulative frequencies.
N
Step2: Find
2
Step3: See in the cumulative frequency the value first greater than
N
2 , Then the corresponding class interval is called the Median

class. Then apply the formula
N
m
2
Median = l +
c
f
Where
l = Lower limit of the median class
m = cumulative frequency preceding the median
c = width of the median class
f =frequency in the median class.
N=Total frequency.
Note :
If the class intervals are given in inclusive type convert
them into exclusive type and call it as true class interval and
consider lower limit in this.

Example :
Calculate median from the following data
f
Value
0-4
5-9
10-14
15-19
20-24
25-29
30-34
35-39

5
8
10
12
7
6
3
2
53

True class
interval
0.5-4.5
4.5-9.5
9.5-14.5
14.5-19.5
19.5-24.5
24.5-29.5
29.5-34.5
34.5-39.5

c.f
5
13
23
35
42
48
51
53

53
N
= = 26.5
2
2
N
m
2
Md = l +
c
f
26.5 23
5
12
= 14.5+1.46 = 15.96

= 14.5 +

2 m
Md = l +
c
f

37.5 32
= 400 +
100 = 400 + 68.75 = 468.75
8

Merits of Median :
1. Median is not influenced by extreme values because it is a
positional average.
2. Median can be calculated in case of distribution with openend intervals.
3. Median can be located even if the data are incomplete.
4. Median can be located even for qualitative factors such as
ability, honesty etc.
Demerits of Median :
1. A slight change in the series may bring drastic change in
median value.
2. In case of even number of items or continuous series,
median is an estimated value other than any value in the
series.
3. It is not suitable for further mathematical treatment except
its use in mean deviation.
4. It is not taken into account all the observations.
Quartiles :
The quartiles divide the distribution in four parts. There are
three quartiles. The second quartile divides the distribution into two
halves and therefore is the same as the median. The first (lower)
quartile (Q1) marks off the first one-fourth, the third (upper)
quartile (Q3) marks off the three-fourth.
Raw or ungrouped data:
First arrange the given data in the increasing order and use the
formula for Q1 and Q3 then quartile deviation, Q.D is given by
Q 3 - Q1
Q.D =
2
n + 1 th
n + 1 th
Where Q1=
item and Q3 = 3
item
4
4
Example 22 :
Compute quartiles for the data given below 25,18,30, 8, 15,
5, 10, 35, 40, 45
Solution :
5, 8, 10, 15, 18,25, 30,35,40, 45

n + 1 th
Q1 =
item
4
10 + 1 th
=
item
4
= (2.75)th item
3
= 2nd item + (3rd item-2nd item)
4
3
= 8+
(10-8)
4
3
= 8 +
2
4
= 8 + 1.5
= 9.5
th

n +1
Q3 = 3
item
4
= 3 (2.75)th item
= (8.25)th item
1 th
= 8th item +
[9 item-8th item]
4
1
= 35 +
[40-35]
4
= 35+1.25=36.25

Percentiles :
The percentile values divide the distribution into 100 parts
each containing 1 percent of the cases. The percentile (Pk) is that
value of the variable up to which lie exactly k% of the total number
of observations.
Relationship :
P25 = Q1 ; P50 = D5 = Q2 = Median and P75 = Q3

Percentile for Raw Data or Ungrouped Data :


Example
Calculate P15 for the data given below:
5, 24 , 36 , 12 , 20 , 8
Arranging the given values in the increasing order.
5, 8, 12, 20, 24, 36
th

15(n + 1)
P15 =
item
100
th

15 7
=
item
100
= (1.05)th item
= 1st item + 0.05 (2nd item 1st -item)
= 5 + 0.05 (8 5) = 5 + 0.15 = 5.15
Percentile for grouped data :
Example
Find P53 for the following frequency distribution.
Class
interval

0-5

5-10

10-15

15-20

20-25

25-30

30-35

35-40

Frequency

12

16

20

10

Solution:
Class Interval
0-5
5-10
10-15
15-20
20-25
25-30
30-35
35-40
Total

Frequency
5
8
12
16
20
10
4
3
78

C.f
5
13
25
41
61
71
75
78

P53

53N
m
= l + 100
c
f
= 20 +

41.34 41
5
20

= 20 + 0.085 = 20.085.

Mode :
The mode refers to that value in a distribution, which
occur most frequently. It is an actual value, which has the highest
concentration of items in and around it.
According to Croxton and Cowden The mode of a
distribution is the value at the point around which the items tend to
be most heavily concentrated. It may be regarded at the most
typical of a series of values.
It shows the centre of concentration of the frequency in around a
given value. Therefore, where the purpose is to know the point of
the highest concentration it is preferred. It is, thus, a positional
measure.
Its importance is very great in marketing studies where a
manager is interested in knowing about the size, which has the
highest concentration of items. For example, in placing an order for
shoes or ready-made garments the modal size helps because this
sizes and other sizes around in common demand.
Computation of the mode:
Ungrouped or Raw Data:
For ungrouped data or a series of individual observations,
mode is often found by mere inspection.
Example :
2 , 7, 10, 15, 10, 17, 8, 10, 2
Mode = M0 =10
In some cases the mode may be absent while in some cases
there may be more than one mode.
Example
1. 12, 10, 15, 24, 30 (no mode)
2. 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
the modes are 7 and 10

Grouped Data:
For Discrete distribution, see the highest frequency and
corresponding value of X is mode.
Continuous distribution :
See the highest frequency then the corresponding value of class
interval is called the modal class. Then apply the formula.
1

Mode = M 0 = l +
1

l = Lower limit of the model class


1 = f1-f0
2 =f1-f2
f1 = frequency of the modal class
f0 = frequency of the class preceding the modal class
f2 = frequency of the class succeeding the modal class
The above formula can also be written as
f1 -f 0
Mode = l +
c
2f1 - f 0 - f 2
Remarks :
1. If (2f1-f0-f2) comes out to be zero, then mode is obtained
by the following formula taking absolute differences
within vertical lines.
( f1 f 0 )
2. M0= l +
c
| f1 f 0 | + | f1 f 2 |
3. If mode lies in the first class interval, then f0 is taken as
zero.

4. The computation of mode poses no problem in


distributions with open-end classes, unless the modal
value lies in the open-end class.
Example 31:
Calculate mode for the following :
C- I
0-50
50-100
100-150
150-200
200-250
250-300
300-350
350-400
400 and above

f
5
14
40
91
150
87
60
38
15

Solution:
The highest frequency is 150 and corresponding class interval is
200 250, which is the modal class.
Here l=200,f1=150,f0=91, f2=87, C=50
Mode = M0 = l +

= 200 +

f1 -f 0
c
2f1 - f 0 - f 2

150-91
50
2 150 91 87

2950
122
= 200 + 24.18 = 224.18
= 200 +

Merits of Mode:
1. It is easy to calculate and in some cases it can be located
mere inspection
2. Mode is not at all affected by extreme values.
3. It can be calculated for open-end classes.
4. It is usually an actual value of an important part of the
series.
5. In some circumstances it is the best representative of data.
Demerits of mode:
1. It is not based on all observations.
2. It is not capable of further mathematical treatment.
3. Mode is ill-defined generally, it is not possible to find mode
in some cases.
4. As compared with mean, mode is affected to a great extent,
by sampling fluctuations.
5. It is unsuitable in cases where relative importance of items
has to be considered.
EMPIRICAL RELATIONSHIP BETWEEN AVERAGES
In a symmetrical distribution the three simple averages
mean = median = mode. For a moderately asymmetrical
distribution, the relationship between them are brought by Prof.
Karl Pearson as mode = 3median - 2mean.
Example 34:
If the mean and median of a moderately asymmetrical series
are 26.8 and 27.9 respectively, what would be its most probable
mode?
Solution:
Using the empirical formula
Mode = 3 median 2 mean
= 3 27.9 2 26.8
= 30.1

7. MEASURES OF DISPERSION
SKEWNESS AND KURTOSIS
7.1

Introduction :
The measure of central tendency serve to locate the
center of the distribution, but they do not reveal how the items
are spread out on either side of the center. This characteristic
of a frequency distribution is commonly referred to as
dispersion. In a series all the items are not equal. There is
difference or variation among the values. The degree of
variation is evaluated by various measures of dispersion.
Small dispersion indicates high uniformity of the items, while
large dispersion indicates less uniformity.
For example
consider the following marks of two students.
Student I
Student II
68
85
75
90
65
80
67
25
70
65
Both have got a total of 345 and an average of 69 each.
The fact is that the second student has failed in one paper.
When the averages alone are considered, the two students are
equal. But first student has less variation than second student.
Less variation is a desirable characteristic.
Characteristics of a good measure of dispersion:
An ideal measure of dispersion is expected to possess
the following properties
1.It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
141

4. It should lend itself for algebraic manipulation.


5. It should be simple to understand and easy to
calculate
7.2 Absolute and Relative Measures :
There are two kinds of measures of dispersion, namely
1.Absolute measure of dispersion
2.Relative measure of dispersion.
Absolute measure of dispersion indicates the amount of
variation in a set of values in terms of units of observations.
For example, when rainfalls on different days are available in
mm, any absolute measure of dispersion gives the variation in
rainfall in mm. On the other hand relative measures of
dispersion are free from the units of measurements of the
observations. They are pure numbers. They are used to
compare the variation in two or more sets, which are having
different units of measurements of observations.
The various absolute and relative measures of
dispersion are listed below.
Absolute measure
Relative measure
1. Range
1.Co-efficient of Range
2.Quartile deviation 2.Co-efficient of Quartile deviation
3.Mean deviation
3. Co-efficient of Mean deviation
4.Standard deviation 4.Co-efficient of variation
7.3 Range and coefficient of Range:
7.3.1 Range:
This is the simplest possible measure of dispersion and
is defined as the difference between the largest and smallest
values of the variable.
In symbols, Range = L S.
Where
L = Largest value.
S = Smallest value.
142

In individual observations and discrete series, L and S


are easily identified. In continuous series, the following two
methods are followed.
Method 1:
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Method 2:
L = Mid value of the highest class.
S = Mid value of the lowest class.
7.3.2 Co-efficient of Range :
L S
Co-efficient of Range =
L+S
Example1:
Find the value of range and its co-efficient for the following
data.
7, 9, 6, 8, 11, 10, 4
Solution:
L=11, S = 4.
Range
= L S = 11- 4 = 7
L S
Co-efficient of Range =
L+S
11 4
=
11 + 4
7
= 0.4667
=
15
Example 2:
Calculate range and its co efficient from the following
distribution.
Size:
60-63 63-66 66-69 69-72 72-75
Number:
5
18
42
27
8
Solution:
L = Upper boundary of the highest class.
= 75
143

S = Lower boundary of the lowest class.


= 60
Range = L S = 75 60 = 15
L S
Co-efficient of Range =
L+S
75 60
=
75 + 60
15
=
= 0.1111
135
7.3.3 Merits and Demerits of Range :
Merits:
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather
forecasts, share price analysis, et c., range is most widely
used.
Demerits:
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.
7.4 Quartile Deviation and Co efficient of Quartile
Deviation :
7.4.1 Quartile Deviation ( Q.D) :
Definition: Quartile Deviation is half of the difference
between the first and third quartiles. Hence, it is called Semi
Inter Quartile Range.
Q Q1
In Symbols, Q . D = 3
. Among the quartiles Q1, Q2
2
and Q3, the range Q3 Q1 is called inter quartile range and
Q 3 Q1
, Semi inter quartile range.
2
144

7.4.2 Co-efficient of Quartile Deviation :


Q 3 Q1
Co-efficient of Q.D =
Q 3 + Q1
Example 3:
Find the Quartile Deviation for the following data:
391, 384, 591, 407, 672, 522, 777, 733, 1490, 2488
Solution:
Arrange the given values in ascending order.
384, 391, 407, 522, 591, 672, 733, 777, 1490, 2488.
n +1
10 + 1
Position of Q1 is
=
= 2.75th item
4
4
Q1 = 2nd value + 0.75 (3rd value 2nd value )
= 391 + 0.75 (407 391)
= 391 + 0.75 16
= 391 + 12
= 403
n +1
Position Q3 is 3
= 3 2.75 = 8.25th item
4
Q3 = 8th value + 0.25 (9th value 8th value)
= 777 + 0.25 (1490 777)
= 777 + 0.25 (713)
= 777 + 178.25 = 955.25
Q Q1
Q.D = 3
2
955.25 403
=
2
552.25
=
= 276.125
2

Example 5:
For the date given below, give the quartile deviation and
coefficient of quartile deviation.
X : 351 500 501 650 651 800 801950 9511100
f :
48
189
88
4
28
Solution :
x
351- 500
501- 650
651- 800
801- 950
951- 1100
Total

f
48
189
88
47
28
N = 400

True class
Intervals
350.5- 500.5
500.5- 650.5
650.5- 800.5
800.5- 950.5
950.5- 1100.5

N
m1
Q1 = l1 + 4
c1
f1
N
400
=
= 100,
4
4
Q1 Class is 500.5 650.5
l1 = 500.5, m1 = 48, f1 = 189, c1 = 150
100 48
150
Q1 = 500.5 +
189
52 150
= 500.5 +
189
= 500.5 + 41.27
= 541.77
N
3
m3
4
Q3 = l3 +
c3
f3
147

Cumulative
frequency
48
237
325
372
400

N
= 3 100 = 300,
4
Q3 Class is 650.5 800.5
l3 = 650.5, m3 = 237, f3 = 88, C3 = 150
300 - 237
150
Q3 = 650.5 +
88
63 150
= 650.5 +
88
= 650.5 + 107.39
= 757. 89
Q Q1
Q.D = 3
2
757.89 541 .77
=
2
216.12
=
2
= 108.06
Q Q1
Coefficient of Q.D = 3
Q 3 + Q1
757.89 541.77
=
757.89 + 541.77
216.12
=
= 0.1663
1299.66
7.4.3 Merits and Demerits of Quartile Deviation
Merits :
1. It is Simple to understand and easy to calculate
2. It is not affected by extreme values.
3. It can be calculated for data with open end classes also.
Demerits:
1. It is not based on all the items. It is based on two
positional values Q1 and Q3 and ignores the extreme
50% of the items
3

148

2. It is not amenable to further mathematical treatment.


3. It is affected by sampling fluctuations.
7.5 Mean Deviation and Coefficient of Mean Deviation:
7.5.1 Mean Deviation:
The range and quartile deviation are not based on all
observations. They are positional measures of dispersion. They
do not show any scatter of the observations from an average.
The mean deviation is measure of dispersion based on all
items in a distribution.
Definition:
Mean deviation is the arithmetic mean of the deviations
of a series computed from any measure of central tendency;
i.e., the mean, median or mode, all the deviations are taken as
positive i.e., signs are ignored. According to Clark and
Schekade,
Average deviation is the average amount scatter of the
items in a distribution from either the mean or the median,
ignoring the signs of the deviations.
We usually compute mean deviation about any one of
the three averages mean, median or mode. Some times mode
may be ill defined and as such mean deviation is computed
from mean and median. Median is preferred as a choice
between mean and median. But in general practice and due to
wide applications of mean, the mean deviation is generally
computed from mean. M.D can be used to denote mean
deviation.
7.5.2 Coefficient of mean deviation:
Mean deviation calculated by any measure of central
tendency is an absolute measure. For the purpose of comparing
variation among different series, a relative mean deviation is
required. The relative mean deviation is obtained by dividing
the mean deviation by the average used for calculating mean
deviation.
149

Mean deviation
Mean or Median or Mode
If the result is desired in percentage, the coefficient of mean
Mean deviation
deviation =
100
Mean or Median or Mode
7.5.3 Computation of mean deviation Individual Series :
1. Calculate the average mean, median or mode of the
series.
2. Take the deviations of items from average ignoring
signs and denote these deviations by |D|.
3. Compute the total of these deviations, i.e., |D|
4. Divide this total obtained by the number of items.
|D|
Symbolically: M.D. =
n
Example 6:
Calculate mean deviation from mean and median for the
following data:
100,150,200,250,360,490,500,600,671 also calculate coefficients of M.D.
Coefficient of mean deviation: =

Solution:
Mean = x =

x
n

3321
=369
9

Now arrange the data in ascending order


100, 150, 200, 250, 360, 490, 500, 600, 671
n + 1
Median = Value of
item
2
th

9 + 1
= Value of
item
2
= Value of 5th item
= 360
th

150

D = xx

D = x Md

100
150
200
250
360
490
500
600
671
3321

269
219
169
119
9
121
131
231
302
1570

260
210
160
110
0
130
140
240
311
1561

M.D from mean

n
1570
=
= 174.44
9
M.D
Co-efficient of M.D =
x
174.44
=
= 0.47
369
D
M.D from median =
n
1561
=
= 173.44
9
M.D
173.44
Co-efficient of M.D.=
=
= 0.48
Median
360
7.5.4 Mean Deviation Discrete series:
Steps: 1. Find out an average (mean, median or mode)
2. Find out the deviation of the variable values from the
average, ignoring signs and denote them by D
3. Multiply the deviation of each value by its respective
frequency and find out the total f D
151

f D by the total frequencies


f D
M.D. =

4. Divide
Symbolically,

Example 7:
Compute Mean deviation from mean and median from the
following data:
Height 158 159 160 161 162 163 164 165 166
in cms
No. of
15
20
32
35
33
22
20
10
8
persons
Also compute coefficient of mean deviation.
Solution:
Height
No. of
X
persons
f
158
15
159
20
160
32
161
35
162
33
163
22
164
20
165
10
166
8
195
x = A+

d= x- A
A =162

fd

-4
-3
-2
-1
0
1
2
3
4

|D| =
|X- mean|

- 60
- 60
- 64
- 35
0
22
40
30
32
- 95

fd

N
95
= 162 +
= 162 0.49 = 161.51
195
f D = 338.59 = 1.74
M.D. =
N
195
152

3.51
2.51
1.51
0.51
0.49
1.49
2.49
3.49
4.49

f|D|

52.65
50.20
48.32
17.85
16.17
32.78
49.80
34.90
35.92
338.59

M.D
1.74
=
= 0.0108
161.51
X
No. of
D =
persons
c.f.
X Median
f
15
15
3
20
35
2
32
67
1
35
102
0
33
135
1
22
157
2
20
177
3
10
187
4
8
195
5
195

Coefficient of M.D.=
Height
x
158
159
160
161
162
163
164
165
166

f D
45
40
32
0
33
44
60
40
40
334

th

N +1
Median = Size of
item
2
th

195 +1
= Size of
item
2
= Size of 98 th item
= 161
f D = 334 = 1.71

M.D =
195
N
1.71
M.D
=.0106
=
161
Median
7.5.5 Mean deviation-Continuous series:
The method of calculating mean deviation in a continuous
series same as the discrete series.In continuous series we have to
find out the mid points of the various classes and take deviation of
these points from the average selected. Thus
f | D |
M.D =
N
Coefficient of M.D. =

153

Where

D = m - average
M = Mid point
Example 8:
Find out the mean deviation from mean and median from the
following series.
Age in years
No.of
persons
0-10
20
10-20
25
20-30
32
30-40
40
40-50
42
50-60
35
60-70
10
70-80
8
Also compute co-efficient of mean deviation.
Solution:

0-10
10-20
20-30
30-40
40-50
50-60
60-70
70-80

5
15
25
35
45
55
65
75

20
25
32
40
42
35
10
8
212

x = A+

fd c

N
32
= 35 +
10
212

mA
c
(A=35,C=10)

= 35 +

d=

-3
-2
-1
0
1
2
3
4

D =
fd

-60
-50
-32
0
42
70
30
32
32

mx

f D

31.5
21.5
11.5
1.5
8.5
18.5
28.5
38.5

630.0
537.5
368.0
60.0
357.0
647.5
285.0
308.0
3193.0

320
= 35 + 1.5 = 36.5
212
154

f D

3193
= 15.06
N
212
Calculation of median and M.D. from median
M.D. =

c.f

|D| = |m-Md|

f |D|

0-10
10-20
20-30
30-40
40-50
50-60
60-70
70-80

5
15
25
35
45
55
65
75

20
25
32
40
42
35
10
8

20
45
77
117
159
194
204
212

32.25
22.25
12.25
2.25
7.75
17.75
27.75
37.75
Total

645.00
556.25
392.00
90.00
325.50
621.25
277.50
302.00
3209.50

N
212
=
= 106
2
2
l = 30, m = 77, f = 40, c = 10
N
m
2
Median = l +
c
f
106 - 77
= 30 +
10
40
29
= 30 +
4
= 30 + 7.25 = 37.25
f | D |
M. D. =
N
3209.5
=
= 15.14
212
M.D
Coefficient of M.D =
Median
15.14
= 0.41
=
37.25
155

7.5.6 Merits and Demerits of M.D :


Merits:
1. It is simple to understand and easy to compute.
2. It is rigidly defined.
3. It is based on all items of the series.
4. It is not much affected by the fluctuations of sampling.
5. It is less affected by the extreme items.
6. It is flexible, because it can be calculated from any
average.
7. It is better measure of comparison.
Demerits:
1. It is not a very accurate measure of dispersion.
2. It is not suitable for further mathematical calculation.
3. It is rarely used. It is not as popular as standard deviation.
4. Algebraic positive and negative signs are ignored. It is
mathematically unsound and illogical.
7.6 Standard Deviation and Coefficient of variation:
7.6.1 Standard Deviation :
Karl Pearson introduced the concept of standard deviation
in 1893. It is the most important measure of dispersion and is
widely used in many statistical formulae. Standard deviation is also
called Root-Mean Square Deviation. The reason is that it is the
squareroot of the mean of the squared deviation from the
arithmetic mean. It provides accurate result. Square of standard
deviation is called Variance.
Definition:
It is defined as the positive square-root of the arithmetic
mean of the Square of the deviations of the given observation from
their arithmetic mean.
The standard deviation is denoted by the Greek letter (sigma)
7.6.2 Calculation of Standard deviation-Individual Series :
There are two methods of calculating Standard deviation in
an individual series.
a) Deviations taken from Actual mean
b) Deviation taken from Assumed mean
156

a) Deviation taken from Actual mean:


This method is adopted when the mean is a whole number.
Steps:
1. Find out the actual mean of the series ( x )
2. Find out the deviation of each value from the mean
3.Square the deviations and take the total of squared
deviations x2
x2
4. Divide the total ( x2 ) by the number of observation

n
x2
The square root of
is standard deviation.
n

x2
(x x) 2
or

n
n
b) Deviations taken from assumed mean:
This method is adopted when the arithmetic mean is
fractional value.
Taking deviations from fractional value would be a very
difficult and tedious task. To save time and labour, We apply short
cut method; deviations are taken from an assumed mean. The
formula is:

Thus =

d2 d
=

N
N
Where d-stands for the deviation from assumed mean = (X-A)
Steps:
1. Assume any one of the item in the series as an average (A)
2. Find out the deviations from the assumed mean; i.e., X-A
denoted by d and also the total of the deviations d
3. Square the deviations; i.e., d2 and add up the squares of
deviations, i.e, d2
4. Then substitute the values in the following formula:
2

157

d2
d

=

n
n
Note: We can also use the simplified formula for standard
deviation.
1
2
n d 2 ( d )
=
n
For the frequency distribution
c
2
N fd 2 ( fd )
=
N
Example 9:
Calculate the standard deviation from the following data.
14, 22, 9, 15, 20, 17, 12, 11
Solution:
Deviations from actual mean.
2

Values (X)
14
22
9
15
20
17
12
11
120
120
X=
=15
8
=

-1
7
-6
0
5
2
-3
-4

(x x)2
n

140
8
= 17.5 = 4.18

158

1
49
36
0
25
4
9
16
140

Example 10:
The table below gives the marks obtained by 10 students in
statistics. Calculate standard deviation.
Student Nos : 1 2 3
4
5 6 7
8
9 10
Marks
: 43 48 65 57 31 60 37 48 78 59
Solution: (Deviations from assumed mean)
Nos.
Marks (x)
d=X-A (A=57)
1
2
3
4
5
6
7
8
9
10

43
48
65
57
31
60
37
48
78
59

n = 10
d2
=
n
=

d2

-14
-9
8
0
-26
3
-20
-9
21
2

196
81
64
0
676
9
400
81
441
4

d=-44

d2 =1952

1952
44

10
10

= 195.2 19.36
= 175.84 = 13.26
7.6.3 Calculation of standard deviation:
Discrete Series:
There are three methods for calculating standard deviation
in discrete series:
(a) Actual mean methods
(b) Assumed mean method
(c) Step-deviation method.
159

(a) Actual mean method:


Steps:
1. Calculate the mean of the series.
2. Find deviations for various items from the means i.e.,
x- x = d.
3. Square the deviations (= d2 ) and multiply by the respective
frequencies(f) we get fd2
4. Total to product (fd2 ) Then apply the formula:
fd 2
f
If the actual mean in fractions, the calculation takes lot of
time and labour; and as such this method is rarely used in practice.
(b) Assumed mean method:
Here deviation are taken not from an actual mean but from
an assumed mean. Also this method is used, if the given variable
values are not in equal intervals.
Steps:
1. Assume any one of the items in the series as an assumed
mean and denoted by A.
2. Find out the deviations from assumed mean, i.e, X-A and
denote it by d.
3. Multiply these deviations by the respective frequencies and
get the fd
4. Square the deviations (d2 ).
5. Multiply the squared deviations (d2) by the respective
frequencies (f) and get fd2.
6. Substitute the values in the following formula:
=

fd 2 fd
=

f
f
Where d = X A , N = f.
2

Example 11:
Calculate Standard deviation from the following data.
X:
20
22
25
31
35
40
42
f:
5
12
15
20
25
14
10
160

45
6

Solution:
Deviations from assumed mean
x
f
d = x A
(A = 31)
-11
5
20
-9
12
22
-6
15
25
0
20
31
4
25
35
9
14
40
11
10
42
14
6
45
N=107
=

fd 2 fd

f
f

d2

fd

fd2

121
81
36
0
16
81
121
196

-55
-108
-90
0
100
126
110
84
fd=167

605
972
540
0
400
1134
1210
1176
fd2
=6037

=
=
=

6037 167

107
107
56.42 2.44
53.98 = 7.35

(c) Step-deviation method:


If the variable values are in equal intervals, then we adopt
this method.
Steps:
1. Assume the center value of the series as assumed mean A
xA
2. Find out d =
, where C is the interval between each
C
value
3. Multiply these deviations d by the respective frequencies
and get fd
4. Square the deviations and get d 2
5. Multiply the squared deviation (d 2 ) by the respective
frequencies (f) and obtain the total fd 2
161

6. Substitute the values in the following formula to get the


standard deviation.

Example 12:
Compute Standard deviation from the following data
Marks
:
10
20
30
40
50
No.of students:
8
12
20
10
7
Solution:
Marks x
F
fd
x 30
d =
10
-16
-2
8
10
-12
-1
12
20
0
0
20
30
10
1
10
40
14
2
7
50
9
3
3
60
N=60
fd =5

60
3
fd

32
12
0
10
28
27
fd 2
= 109

109 5
- 10
60 60

= 1.817 - 0.0069 10
= 1.8101 10
= 1.345 10
= 13.45
7.6.4 Calculation of Standard Deviation Continuous series:
In the continuous series the method of calculating standard
deviation is almost the same as in a discrete series. But in a
continuous series, mid-values of the class intervals are to be found
out. The step- deviation method is widely used.
162

The formula is,

d =

mA
, C- Class interval.
C

Steps:
1.Find out the mid-value of each class.
2.Assume the center value as an assumed mean and denote
it by A
mA
3.Find out d =
C
4.Multiply the deviations d by the respective frequencies and
get fd
5.Square the deviations and get d 2
6.Multiply the squared deviations (d 2) by the respective
frequencies and get fd 2
7.Substituting the values in the following formula to get the
standard deviation

Example 13:
The daily temperature recorded in a city in Russia in a year
is given below.
Temperature C 0
No. of days
10
-40 to 30
18
-30 to 20
30
-20 to 10
42
-10 to
0
65
0 to 10
180
10 to 20
20
20 to 30
365
Calculate Standard Deviation.
163

Solution:
Temperature
-40 to -30
-30 to -20
-20 to -10
-10 to - 0
0 to 10
10 to 20
20 to 30

Mid
value
(m)
-35
-25
-15
-5
5
15
25

No. of
days
f
10
18
30
42
65
180
20

d =
m (5n )
10n
-3
-2
-1
0
1
2
3

N=365

1157
389

10
365
365
= 3.1699 - 1.1358 10
= 2.0341 10
= 1.4262 10
= 14.26c
=

fd
-30
-36
-30
0
65
360
60

fd

90
72
30
0
65
720
180

fd = fd 2
389
=1157

7.6.6 Merits and Demerits of Standard Deviation:


Merits:
1. It is rigidly defined and its value is always definite and
based on all the observations and the actual signs of
deviations are used.
2. As it is based on arithmetic mean, it has all the merits of
arithmetic mean.
3. It is the most important and widely used measure of
dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence
stable.
6. It is the basis for measuring the coefficient of correlation
and sampling.
Demerits:
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values
are squared up.
3. As it is an absolute measure of variability, it cannot be used
for the purpose of comparison.

7.6.7 Coefficient of Variation :


The Standard deviation is an absolute measure of
dispersion. It is expressed in terms of units in which the original
figures are collected and stated. The standard deviation of heights
of students cannot be compared with the standard deviation of
weights of students, as both are expressed in different units, i.e
heights in centimeter and weights in kilograms. Therefore the
standard deviation must be converted into a relative measure of
dispersion for the purpose of comparison. The relative measure is
known as the coefficient of variation.
The coefficient of variation is obtained by dividing the
standard deviation by the mean and multiply it by 100.
symbolically,

Coefficient of variation (C.V) =


100
X
If we want to compare the variability of two or more series,
we can use C.V. The series or groups of data for which the C.V. is
greater indicate that the group is more variable, less stable, less
uniform, less consistent or less homogeneous. If the C.V. is less, it
indicates that the group is less variable, more stable, more uniform,
more consistent or more homogeneous.
Example 15:
In two factories A and B located in the same industrial area,
the average weekly wages (in rupees) and the standard deviations
are as follows:
Factory
A
B

Average
34.5
28.5

Standard Deviation
5
4.5

No. of workers
476
524

1. Which factory A or B pays out a larger amount as weekly


wages?
2. Which factory A or B has greater variability in individual
wages?
Solution:
Given N1 = 476, X1 = 34.5, 1 = 5
167

N2 = 524, X 2 = 28.5, 2 = 4.5


1. Total wages paid by factory A
= 34.5 476
= Rs.16.422
Total wages paid by factory B
= 28.5 524
= Rs.14,934.
Therefore factory A pays out larger amount as weekly wages.
2. C.V. of distribution of weekly wages of factory A and B are
1

100
X1
5
=
100
34.5
= 14.49

C.V (B) = 2 100


X2
4.5
=
100
28.5
= 15.79
Factory B has greater variability in individual wages, since
C.V. of factory B is greater than C.V of factory A
C.V.(A) =

Example 16:
Prices of a particular commodity in five years in two cities are
given below:
Price in city A
Price in city B
10
20
20
22
18
19
12
23
15
16
Which city has more stable prices?
168

Solution:
Actual mean method

20
22
19
23
16

City A
Deviations
from X=20
dx
0
2
-1
3
-4

x=100

dx=0

Prices
(X)

City A: X =

x=
=

dx

Prices
(Y)

0
4
1
9
16

10
20
18
12
15

dx2=30 y=75
x
n

100
= 20
5

(x x)2
=
n
30
=
5

dx 2
n

6 =2.45

100
x
2.45
=
100
20
= 12.25 %
75
y
City B: Y =
=
= 15
n
5
C.V(x) =

y =

(y y)2
=
n

dy2
n
169

City B
Deviations
from Y =15
dy
-5
5
3
-3
0
dy=0

dy2

25
25
9
9
0
dy2
=68

68
= 13.6 = 3.69
5
y
x 100
C.V.(y) =
y
3.69
=
100
15
= 24.6 %
City A had more stable prices than City B, because the
coefficient of variation is less in City A.
=

7.9
Skewness:
7.9.1 Meaning:
Skewness means lack of symmetry . We study skewness to
have an idea about the shape of the curve which we can draw with
the help of the given data.If in a distribution mean = median =
mode, then that distribution is known as symmetrical distribution.
If in a distribution mean median mode , then it is not a
symmetrical distribution and it is called a skewed distribution and
such a distribution could either be positively skewed or negatively
skewed.
a) Symmetrical distribution:

Mean = Median = Mode


It is clear from the above diagram that in a symmetrical
distribution the values of mean, median and mode coincide. The
spread of the frequencies is the same on both sides of the center
point of the curve.
b)Positively skewed distribution:

Mode Median Mean


It is clear from the above diagram, in a positively skewed
distribution, the value of the mean is maximum and that of the
mode is least, the median lies in between the two. In the positively
skewed distribution the frequencies are spread out over a greater
range of values on the right hand side than they are on the left hand
side.
174

c) Negatively skewed distribution:

Mean Median Mode


It is clear from the above diagram, in a negatively skewed
distribution, the value of the mode is maximum and that of the
mean is least. The median lies in between the two. In the negatively
skewed distribution the frequencies are spread out over a greater
range of values on the left hand side than they are on the right hand
side.
7.10 Measures of skewness:
The important measures of skewness are
(i) Karl Pearason s coefficient of skewness
(ii) Bowley s coefficient of skewness
(iii)Measure of skewness based on moments
7.10.1 Karl Pearson s Coefficient of skewness:
According to Karl Pearson, the absolute measure of
skewness = mean mode. This measure is not suitable for making
valid comparison of the skewness in two or more distributions
because the unit of measurement may be different in different
series. To avoid this difficulty use relative measure of skewness
called Karl Pearson s coefficient of skewness given by:
Mean - Mode
Karl Pearson s Coefficient Skewness =
S .D.
In case of mode is ill defined, the coefficient can be determined
by the formula:
3(Mean - Median)
Coefficient of skewness =
S .D.
Example 18:
Calculate Karl Pearson s coefficient of skewness for the
following data.
25, 15, 23, 40, 27, 25, 23, 25, 20
175

Solution:
Computation of Mean and Standard deviation :
Short cut method.
Size
Deviation from A=25
D
0
25
-10
15
-2
23
15
40
2
27
0
25
-2
23
0
25
-5
20
d=-2

N=9

d2
0
100
4
225
4
0
4
0
25
d2=362

d
n
2
= 25 +
9
= 25 0.22 = 24.78

Mean = A +

d2 d

n
n

362 2

9 9

40.22 0.05
= 40.17 = 6.3
Mode = 25, as this size of item repeats 3 times
Karl Pearson s coefficient of skewness
24.78 25
=
Mean - Mode
6.3
=
0.22
S .D.
=
6.3
= 0.03
=

Example 19:
Find the coefficient of skewness from the data given below
Size :
3
4 5
6
7
8
9
10
Frequency: 7
10 14 35
102
136
43
8
Solution:
Size

Frequency
(f)

3
4
5
6
7
8
9
10

7
10
14
35
102
136
43
8
N=355

Mean = A +
= 6+

Deviation
From A=6
(d)
-3
-2
-1
0
1
2
3
4

fd
N

480
355

= 6 + 1.35
= 7.35
Mode = 8
Coefficient of skewness =

d2

fd

fd2

9
4
1
0
1
4
9
16

-21
-20
-14
0
102
272
129
32
fd=480

63
40
14
0
102
544
387
128
fd2=1278

fd 2 fd

N
N

=
=

1278 480

355 355

3.6 1.82
= 1.78 = 1.33
=

Mean - Mode
S .D.

7.35 - 8
1.33

0.65
1.33

= 0.5

Example 23:
Calculate the value of the Bowley s coefficient of skewness from
the following series.
Wages : 10-20 20-30
(Rs)
No.of
Persons : 1
3
Solution:
Wages(Rs)
10-20
20-30
30-40
40-50
50-60
60-70
70-80

30-40

40-50

50-60

60-70

70-80

11

21

43

32

F
1
3
11
21
43
32
9
N=120

c.f
1
4
15
36
79
111
120

N
m1
Q1 = l1 + 4
c1
f1
N 120
=
= 30
4
4
Q1class

40-50

l1= 40, m1=15, f1=21, c1=10


Q1

30 15
10
21
150
= 40 +
21
= 40 + 7.14
= 47.14
= 40 +

N
m
Q2 = Median = l + 2
c
f
N 120
=
= 60
2
2

Medianal class = 50 60
l= 50 , m=36, f = 43, c=10
60 36
median
= 50 +
10
43
240
= 50 +
43
= 50 + 5.58
= 55.58
N
m3
4
Q3 = l3 +
c3
f3
N
120
3 = 3
= 90
4
4
Q3 class = 60 70
l3=60, m3=79, f3=32, c3=10
90 79
Q3 = 60 +
10
32
110
= 60 +
32
= 60 +3.44
= 63.44
3

Bowley s Coefficient of skewness

Q 3 + Q1 2 Median
Q 3 Q1
63.44 + 47.14 2 55.58
=
63.44 47.14

110.58 111.16
16.30
0.58
=
16.30
= 0.0356
=

Kurtosis:
The expression Kurtosis is used to describe the
peakedness of a curve.
The three measures central tendency, dispersion and
skewness describe the characteristics of frequency distributions.
But these studies will not give us a clear picture of the
characteristics of a distribution.
As far as the measurement of shape is concerned, we have
two characteristics skewness which refers to asymmetry of a
series and kurtosis which measures the peakedness of a normal
curve. All the frequency curves expose different degrees of flatness
or peakedness. This characteristic of frequency curve is termed as
kurtosis. Measure of kurtosis denote the shape of top of a
frequency curve. Measure of kurtosis tell us the extent to which a
distribution is more peaked or more flat topped than the normal
curve, which is symmetrical and bell-shaped, is designated as
Mesokurtic. If a curve is relatively more narrow and peaked at the
top, it is designated as Leptokurtic. If the frequency curve is more
flat than normal curve, it is designated as platykurtic.

L = Lepto Kurtic
M = Meso Kurtic
P = Platy Kurtic

7.11.1 Measure of Kurtosis:


The measure of kurtosis of a frequency distribution based
moments is denoted by 2 and is given by

If 2 =3, the distribution is said to be normal and the curve


is mesokurtic.
If 2 >3, the distribution is said to be more peaked and the
curve is leptokurtic.
If 2< 3, the distribution is said to be flat topped and the
curve is platykurtic.
Example 24:
Calculate 1 and 2 for the following data.
X: 0
1 2
3
4
5
6
7
8
F:
5
10 15
20
25
20
15
10
5
Solution:
[Hint: Refer Example of page 172 and get the values of first four
central moments and then proceed to find 1 and 2]
fd 2 500
=
0
=
=
=4
1
2
N
125
fd 3
fd 4 4700
=0
=
=
= 37.6
=
4
3
N
N
125
0
= 0
64
2

4
2
2

37.6
42

37.6
= 2.35
16
The value of 2 is less than 3, hence the curve is platykurtic.
=

Example 25:
From the data given below, calculate the first four moments
about an arbitrary origin and then calculate the first four central
moments.
X : 30-33 33-36 36-39 39-42 42-45
45-48
f :
2
4
26
47
15
6
Solution:
[Hint: Refer Example 18 of page 172 and get the values of first
four moments about the origin and the first four moments about the
mean. Then using these values find the values of 1 and 2.]
4 = 291.454
3 = 2.91,
2 = 8.76
1 = 0,
1 =

8.47
(2.91) 2
= 0.0126
=
3
672.24
(8.76)

291.454
= 3.70
(8.76) 2
Since 2 >3, the curve is leptokurtic.
2 =

8. CORRELATION
Introduction:
The term correlation is used by a common man without
knowing that he is making use of the term correlation. For example
when parents advice their children to work hard so that they may
get good marks, they are correlating good marks with hard work.
The study related to the characteristics of only variable such
as height, weight, ages, marks, wages, etc., is known as univariate
analysis. The statistical Analysis related to the study of the
relationship between two variables is known as Bi-Variate
Analysis. Some times the variables may be inter-related. In health
sciences we study the relationship between blood pressure and age,
consumption level of some nutrient and weight gain, total income
and medical expenditure, etc., The nature and strength of
relationship may be examined by correlation and Regression
analysis.
Thus Correlation refers to the relationship of two variables
or more. (e-g) relation between height of father and son, yield and
rainfall, wage and price index, share and debentures etc.
Correlation is statistical Analysis which measures and
analyses the degree or extent to which the two variables fluctuate
with reference to each other. The word relationship is important. It
indicates that there is some connection between the variables. It
measures the closeness of the relationship. Correlation does not
indicate cause and effect relationship. Price and supply, income
and expenditure are correlated.
Definitions:
1. Correlation Analysis attempts to determine the degree of
relationship between variables- Ya-Kun-Chou.
2. Correlation is an analysis of the covariation between two
or more variables.- A.M.Tuttle.
Correlation expresses the inter-dependence of two sets of
variables upon each other. One variable may be called as (subject)
191

independent and the other relative variable (dependent). Relative


variable is measured in terms of subject.
Uses of correlation:
1. It is used in physical and social sciences.
2. It is useful for economists to study the relationship between
variables like price, quantity etc. Businessmen estimates
costs, sales, price etc. using correlation.
3. It is helpful in measuring the degree of relationship
between the variables like income and expenditure, price
and supply, supply and demand etc.
4. Sampling error can be calculated.
5. It is the basis for the concept of regression.
Scatter Diagram:
It is the simplest method of studying the relationship
between two variables diagrammatically. One variable is
represented along the horizontal axis and the second variable along
the vertical axis. For each pair of observations of two variables, we
put a dot in the plane. There are as many dots in the plane as the
number of paired observations of two variables. The direction of
dots shows the scatter or concentration of various points. This will
show the type of correlation.
1. If all the plotted points form a straight line from lower left hand
corner to the upper right hand corner then there is
Perfect positive correlation. We denote this as r = +1

Perfect positive
Correlation
r = +1

Perfect Negative
Correlation

O
O

X axis

(r = 1)

O
192

X axis
X

1. If all the plotted dots lie on a straight line falling from upper
left hand corner to lower right hand corner, there is a perfect
negative correlation between the two variables. In this case
the coefficient of correlation takes the value r = -1.
2. If the plotted points in the plane form a band and they show
a rising trend from the lower left hand corner to the upper
right hand corner the two variables are highly positively
correlated.
Highly Positive

Highly Negative
Y

X axis

X axis

1. If the points fall in a narrow band from the upper left


hand corner to the lower right hand corner, there will be a
high degree of negative correlation.
2. If the plotted points in the plane are spread all over the
diagram there is no correlation between the two
variables.
No correlation
( r = 0)

O
X
193

Merits:
1. It is a simplest and attractive method of finding the nature
of correlation between the two variables.
2. It is a non-mathematical method of studying correlation. It
is easy to understand.
3. It is not affected by extreme items.
4. It is the first step in finding out the relation between the two
variables.
5. We can have a rough idea at a glance whether it is a positive
correlation or negative correlation.
Demerits:
By this method we cannot get the exact degree or
correlation between the two variables.
Types of Correlation:
Correlation is classified into various types. The most
important ones are
i) Positive and negative.
ii) Linear and non-linear.
iii) Partial and total.
iv) Simple and Multiple.
Positive and Negative Correlation:
It depends upon the direction of change of the variables. If
the two variables tend to move together in the same direction (ie)
an increase in the value of one variable is accompanied by an
increase in the value of the other, (or) a decrease in the value of one
variable is accompanied by a decrease in the value of other, then
the correlation is called positive or direct correlation. Price and
supply, height and weight, yield and rainfall, are some examples of
positive correlation.
If the two variables tend to move together in opposite
directions so that increase (or) decrease in the value of one variable
is accompanied by a decrease or increase in the value of the other
variable, then the correlation is called negative (or) inverse
correlation. Price and demand, yield of crop and price, are
examples of negative correlation.
194

Linear and Non-linear correlation:


If the ratio of change between the two variables is a
constant then there will be linear correlation between them.
Consider the following.
X
2
4
6
8
10
12
Y
3
6
9
12
15
18
Here the ratio of change between the two variables is the
same. If we plot these points on a graph we get a straight line.
If the amount of change in one variable does not bear a
constant ratio of the amount of change in the other. Then the
relation is called Curvi-linear (or) non-linear correlation. The
graph will be a curve.
Simple and Multiple correlation:
When we study only two variables, the relationship is
simple correlation. For example, quantity of money and price level,
demand and price. But in a multiple correlation we study more
than two variables simultaneously. The relationship of price,
demand and supply of a commodity are an example for multiple
correlation.
Partial and total correlation:
The study of two variables excluding some other variable is
called Partial correlation. For example, we study price and
demand eliminating supply side. In total correlation all facts are
taken into account.
Computation of correlation:
When there exists some relationship between two
variables, we have to measure the degree of relationship. This
measure is called the measure of correlation (or) correlation
coefficient and it is denoted by r .
Co-variation:
The covariation between the variables x and y is defined as
( x x)( y y )
Cov( x,y) =
where x, y are respectively means of
n
x and y and n is the number of pairs of observations.
195

Karl pearson s coefficient of correlation:


Karl pearson, a great biometrician and statistician,
suggested a mathematical method for measuring the magnitude of
linear relationship between the two variables. It is most widely
used method in practice and it is known as pearsonian coefficient of
correlation. It is denoted by r . The formula for calculating r is
C ov( x, y )
(i) r =
where x , y are S.D of x and y
x . y
respectively.
xy
(ii) r =
n x y
(iii) r =

XY
X2 . Y2

X = x x , Y = y y

when the deviations are taken from the actual mean we can apply
any one of these methods. Simple formula is the third one.
The third formula is easy to calculate, and it is not
necessary to calculate the standard deviations of x and y series
respectively.
Steps:
1. Find the mean of the two series x and y.
2. Take deviations of the two series from x and y.
X = x x , Y = y y
3. Square the deviations and get the total, of the respective
squares of deviations of x and y and denote by X2 ,
Y2 respectively.
4. Multiply the deviations of x and y and get the total and
Divide by n. This is covariance.
5. Substitute the values in the formula.
r =

cov( x, y )
=
x.y

( x x) ( y - y ) / n

( x x) 2 ( y y ) 2
.
n
n
196

The above formula is simplified as follows


XY
r =
, X = x x , Y = y y
X2 . Y2
Example 1:
Find Karl Pearson s coefficient of correlation from the following
data between height of father (x) and son (y).
X
64
65
66
67
68
69
70
Y
66
67
65
68
70
68
72
Comment on the result.
Solution:
x
Y
XY
X2 Y = y y Y2
X = x x
X = x 67
Y = y - 68
64
66
-3
9
-2
4
6
65
67
-2
4
-1
1
2
66
65
-1
1
-3
9
3
67
68
0
0
0
0
0
68
70
1
1
2
4
2
69
68
2
4
0
0
0
70
72
3
9
4
16
12
469 476
0
28
0
34
25
469
476
x=
= 67 ; y =
= 68
7
7
25
25
25
XY
r =
=
=
=
= 0.81
30.85
28 34
952
X2 . Y2
Since r = + 0.81, the variables are highly positively correlated. (ie)
Tall fathers have tall sons.
Working rule (i)
We can also find r with the following formula
C ov( x, y )
We have r =
x . y
Cov( x,y) =

( x x)( y

y)

( xy + x y yx x y )
n
197

yx
xy
xy
x y
+
n
n
n
n
xy
xy
Cov(x,y) =
=
yx - x y + x y
xy
n
n
2
2
x 2
y 2
2 2
2 2
xx =
- x , y =
- y
n
n
C ov( x, y )
Now r =
x . y
=

xy
xy
n
r=
2
2
x 2
y 2
x
.
- y


n
n

nxy - (x) (y )
r =
[nx 2 (x ) 2 ][ny 2 - (y )2 ]
Note: In the above method we need not find mean or standard
deviation of variables separately.
Example 2:
Calculate coefficient of correlation from the following data.
X
1
2
3
4
5
6
7
8
Y
9
8
10
12
11
13
14
16
x
1
2
3
4
5
6
7
8
9
45

y
9
8
10
12
11
13
14
16
15
108

x2
1
4
9
16
25
36
49
64
81
285
198

y2
81
64
100
144
121
169
196
256
225
1356

xy
9
16
30
48
55
78
98
128
135
597

9
15

r =
r =

nxy - (x) (y )
[nx 2 (x ) 2 ][ny 2 - (y )2 ]
9 597 - 45 108

(9 285 (45) ) .(9 1356 (108) )


2

r =
=

5373 - 4860
(2565 2025).(12204 11664)
513
513
=
= 0.95
540
540 540

Working rule (ii) (shortcut method)


C ov( x, y )
We have r =
x . y
where Cov( x,y) =

( x x)( y

y)

n
Take the deviation from x as x A and the deviation from y as
yB
[( x - A) - ( x A)] [( y - B) - ( y B)]
Cov(x,y) =
n
1
=
[( x - A) ( y - B) - ( x - A) ( y - B)
n
- ( x A)( y B) + ( x A)( y B)]
=

1
( x - A)
[( x - A) ( y - B) - ( y - B)
n
n
( y - B ) ( x - A)( y B)
( x A)
+
n
n
( x - A)( y - B)
nA
)
( y B) ( x
n
n
nB
) + ( x A) ( y B)
( x A) ( y
n
199

( x - A)( y - B)
( y B) ( x A)
n
( x A) ( y B ) + ( x A) ( y B)

( x - A)( y - B)
( x A) ( y B)
n
Let x- A = u ; y - B = v;
x A=u ; yB =v
uv
uv
Cov (x,y) =
n
2
u 2
xx2 =
u = u2
n
2
v 2
y 2y =
v = v2
n
nuv (u )(v)
r =
nu 2 (u )2 . (nv 2 ) (v)2
Example 3:
Calculate Pearson s Coefficient of correlation.
X 45 55 56 58 60 65 68 70 75
Y 56 50 48 60 62 64 65 70 74
=

X
45
55
56
58
60
65
68
70
75
80
85

Y
56
50
48
60
62
64
65
70
74
82
90

80
82

85
90

u = x-A v = y-B u2
v2
uv
-20
-14
400
196
280
-10
-20
100
400
200
-9
-22
81
484
198
-7
-10
49
100
70
-5
-8
25
64
40
0
-6
0
36
0
3
-5
9
25
-15
5
0
25
0
0
10
4
100
16
40
15
12
225
144
180
20
20
400
400
400
2
-49
1414
1865
1393
200

nuv (u ) (v)

r=

[nu 2 (u 2 )] [nv 2 (v) 2 ]

r=

11 1393 - 2 (-49)

(1414 11 (2) 2 ) (1865 11 (49) 2 )


15421
15421
=
=
= + 0.92
16783.11
15550 18114

Correlation of grouped bi-variate data:


When the number of observations is very large, the data is
classified into two way frequency distribution or correlation table.
The class intervals for y are in the column headings and for x in
the stubs. The order can also be reversed. The frequencies for
each cell of the table are obtained. The formula for calculation of
correlation coefficient r is
cov( x, y )
f ( x x)( y y )
r=
Where cov(x,y) =
N
x, y
fxy
=
x y
N
22
22
fxfx22
fyfy22
2xxx22 ==
xx ;; yy22y2 ==
yy
NN
NN
N total frequency
N fxy - (fx ) (fy )
r =
[ N fx 2 (fx)2 ].[ N fy 2 (fy )2 ]
Theorem:
The correlation coefficient is not affected by change
of origin and scale.
x A
yB
If u =
; v=
then rxy =ruv
c
d
Proof:
u=

x A
c
201

cu = x- A
x = cu +A
x = cu + A
yB
d
vd = y B
y = B + vd
v=

y = [B + v d]

x = c u ; y = d v
cov(x , y )
rxy =
x , y
f ( x x)( y y )
cov(x,y) =
n
1
f[(cu+A) - (cu+A)][(dv+B) - (d v+B)]
n
1
= f cu-cu (dv-d v )
n
1
= f c (u - u) d (v v )
N
1
= f cd u - u v v
N
1
= cd f (u u ) (v - v )
N
f (u u ) (v - v )
= cd
= cd cov(u, v)
N
cov( x, y ) = c.d cov(u, v)
cov(x , y ) cd cov(u , v ) cov(u , v )
r xy =
=
=
= r uv
c .. u . d . v
x y
u v
rxy = ruv
202

Steps:
1. Take the step deviations of the variable x and denote these
deviations by u.
2. Take the step deviations of the variable y and denote these
deviations by v.
3. Multiply uv and the respective frequency of each cell and
unite the figure obtained in the right hand bottom corner of
each cell.
4. Add the corrected (all) as calculated in step 3 and obtain the
total fuv.
5. Multiply the frequencies of the variable x by the deviations
of x and obtain the total fu.
6. Take the squares of the step deviations of the variable x and
multiply them by the respective frequencies and obtain the
fu2
Similarly get fv and fv2 . Then substitute these values in the
formula 1 and get the value of r .
Example 4:
The following are the marks obtained by 132 students in two tests.
Test-1 30-40 40-50 50-60 60-70 70-80 Total
Test-2
20-30
2
5
3
10
30-40
1
8
12
6
27
40-50
5
22
14
1
42
50-60
2
16
9
2
29
60-70
1
8
6
1
16
70-80
2
4
2
8
Total
3
21
63
39
6
132
Calculate the correlation coefficient.
Let x denote Test 1 marks.
Let y denote Test 2 marks.
x 55
y 45
u=
v=
10
10
203

mid x
mid y

35

45

55

65

75

25

0
12

1
0

45

0
5

22
0

0
-1
2

55

-2
-2
65

1
-2

75
f
u
fu
fu2
fuv

3
-2
-6
12
10

r =

21
-1
-21
21
14

fv

fv2

fuv

10

-2

-20

40

18

27

-1

-27

27

29

29

29

11

16

32

64

14

24

72

24

132
0
24
96
71

38

232

71

10

8
35

0
16
0
0
8
0
0
2
0
63
0
0
0
0

-1
6
-6
0
14
0
1
9
9
2
6
12
3
4
12
39
1
39
39
27

0
1

42
0

2
2
4
4
1
4
6
2
12
6
2
12
24
20

N fuv - (fu ) (fv )


[ N fu 2 (fu ) 2 ].[ N fv 2 (fv) 2 ]
132 71 24 38

[132 96 (24) 2 ] [132 232 (38) 2 ]


9372 912

(12672 576) (30624-1444)


8460
8460
=
=
= 0.4503
109.96 170.82 18786.78
204

Check

Example 5:
Calculate Karl Pearson s coefficient of correlation from the data
given below:
Age in years
Marks
18
19
20
21
22
0- 5
3
1
5- 10
3
2
10-15
7
10
15-20
5
4
20-25
3
2
x 12.5
5
y 20
v=
1

u=

y
mid x

18

2.5

19

20

fv

fv2

Fuv

-2

-8

16

-10

-1

-5

-7

17

-5

10

20

-16

16
1
16
16
-9

3
2
6
12
-8

40
0
9
47
-38

50

-38

21

22

-2
3

-4
1
-4
-2
2
-4

-6
7.5

-1
3

-3
0

0
12.5

10

0
17.5

-4

22.5

3
-12

f
u
fu
fu2
fuv

3
-2
-6
12
-12

-1
5
-5
-2
2
-4
7
-1
-7
7
-9

0
4
0
11
0
0
0
0

205

Check

N fuv - (fu ) (fv )

r =

[ N fu 2 (fu ) 2 ].[ N fv 2 (fv) 2 ]


40(38) 6 9

[40 50 62 ].[40 47 92 ]
1520 54
1574
=
=
= 0.8373
(2000 36) (1880 81)
1964 1799

Properties of Correlation:
1. Correlation coefficient lies between 1 and +1
(i.e) 1 r +1
x x
y y
Let x =
; y =
x
y
Since (x +y )2 being sum of squares is always non-negative.
(x +y )2 0
x 2 + y 2 +2 x y 0
2

y y
x x
x x y y

+ 2
0
+

x
x y
y
2
2
2( x x ) (Y Y )
( x x )
( y y )
+
+
0
x2
y2
x y
2

dividing by n we get
2 1
1 1
1 1
. ( x x) ( y y )
. ( x x ) 2 +
. ( y y ) 2 +
x2 n
y2 n
x y n
0
1
1
2
. x 2 +
.cov( x, y ) 0
y2 +
x2
y2
x y
1 + 1 + 2r 0
2 + 2r 0
2(1+r) 0
(1 + r) 0
1 r -------------(1)
206

Similarly, (x y )2 0
2(l-r) 0
l - r 0
r +1 --------------(2)
(1) +(2) gives 1 r 1
Note: r = +1 perfect +ve correlation.
r = 1 perfect ve correlation between the variables.
Property 2:
Property 3:
Property 4:
Property 5:
Property 6:

r is independent of change of origin and scale.


It is a pure number independent of units of
measurement.
Independent variables are uncorrelated but the
converse is not true.
Correlation coefficient is the geometric mean of two
regression coefficients.
The correlation coefficient of x and y is symmetric.
rxy = ryx.

Limitations:
1. Correlation coefficient assumes linear relationship regardless
of the assumption is correct or not.
2. Extreme items of variables are being unduly operated on
correlation coefficient.
3. Existence of correlation does not necessarily indicate causeeffect relation.
Interpretation:
The following rules helps in interpreting the value of r .
1. When r = 1, there is perfect +ve relationship between the
variables.
2. When r = -1, there is perfect ve relationship between the
variables.
3. When r = 0, there is no relationship between the variables.
4. If the correlation is +1 or 1, it signifies that there is a high
degree of correlation. (+ve or ve) between the two variables.
If r is near to zero (ie) 0.1,-0.1, (or) 0.2 there is less correlation.
207

Rank Correlation:
It is studied when no assumption about the parameters of
the population is made. This method is based on ranks. It is useful
to study the qualitative measure of attributes like honesty, colour,
beauty, intelligence, character, morality etc.The individuals in the
group can be arranged in order and there on, obtaining for each
individual a number showing his/her rank in the group. This
method was developed by Edward Spearman in 1904. It is defined
6D 2
as r = 1 3
r = rank correlation coefficient.
n n
Note: Some authors use the symbol for rank correlation.
D2 = sum of squares of differences between the pairs of ranks.
n = number of pairs of observations.
The value of r lies between 1 and +1. If r = +1, there is
complete agreement in order of ranks and the direction of ranks is
also same. If r = -1, then there is complete disagreement in order of
ranks and they are in opposite directions.
Computation for tied observations: There may be two or more
items having equal values. In such case the same rank is to be
given. The ranking is said to be tied. In such circumstances an
average rank is to be given to each individual item. For example if
the value so is repeated twice at the 5th rank, the common rank to
5+6
be assigned to each item is
= 5.5 which is the average of 5
2
and 6 given as 5.5, appeared twice.
If the ranks are tied, it is required to apply a correction
1
factor which is
(m3-m). A slightly different formula is used
12
when there is more than one item having the same value.
The formula is
r=

6[D 2 +

1
1
(m3 m) + (m 3 m) + ....]
12
12
n3 n
208

Where m is the number of items whose ranks are common


and should be repeated as many times as there are tied
observations.
Example 6:
In a marketing survey the price of tea and coffee in a town based on
quality was found as shown below. Could you find any relation
between and tea and coffee price.
Price of tea
Price of coffee
Price of
tea
88
90
95
70
60
75
50

Rank
3
2
1
5
6
4
7

88
120

90
134

Price of
coffee
120
134
150
115
110
140
100

95 70 60
150 115 110

75
140

50
100

Rank

D2

4
3
1
5
6
2
7

1
1
0
0
0
2
0

1
1
0
0
0
4
0
2
D =6

6D 2
66
= 1 3
3
n n
7 7
36
= 1
= 1 0.1071
336
= 0.8929
The relation between price of tea and coffee is positive at
0.89. Based on quality the association between price of tea and
price of coffee is highly positive.
r = 1

Example 7:
In an evaluation of answer script the following marks are awarded
by the examiners.
1st
88
95
70
960
50
80
75
85
2nd
84
90
88
55
48
85
82
72
209

Do you agree the evaluation by the two examiners is fair?


x
R1
y
R2
D
88
2
84
4
2
95
1
90
1
0
70
6
88
2
4
60
7
55
7
0
50
8
48
8
0
80
4
85
3
1
85
3
75
6
3

D2
4
0
16
0
0
1
9
30

6D 2
6 30
= 1 3
3
n n
8 8
180
= 1 0.357 = 0.643
= 1
504
r = 0.643 shows fair in awarding marks in the sense that uniformity
has arisen in evaluating the answer scripts between the two
examiners.
Example 8:
Rank Correlation for tied observations. Following are the marks
obtained by 10 students in a class in two tests.

r = 1

Students A
B
C
D
E
F
G
H
I
J
Test 1
70 68
67
55
60
60
75
63
60
72
Test 2
65 65
80
60
68
58
75
63
60
70
Calculate the rank correlation coefficient between the marks of two tests.
Student
Test 1
R1
Test 2
R2
D
D2
A
70
3
65
5.5
-2.5
6.25
B
68
4
65
5.5
-1.5
2.25
C
67
5
80
1.0
4.0
16.00
D
55
10
60
8.5
1.5
2.25
E
60
8
68
4.0
4.0
16.00
F
60
8
58
10.0
-2.0
4.00
G
75
1
75
2.0
-1.0
1.00
H
63
6
62
7.0
-1.0
1.00
I
60
8
60
8.5
0.5
0.25
J
72
2
70
3.0
-1.0
1.00
50.00

210

60 is repeated 3 times in test 1.


60,65 is repeated twice in test 2.
m = 3; m = 2; m = 2
1
1
1
6[D 2 + (m3 m) + (m3 m) + (m3 m)
12
12
12
r = 1
3
n n
1 3
1
1
(3 3) + (23 2) + (23 2)]
12
12
12
= 1
103 10
6[50 + 2 + 0.5 + 0.5]
= 1
990
6 53
672
= 1
=
= 0.68
990
990
6[50 +

Interpretation: There is uniformity in the performance of students


in the two tests.
Exercise 8
I. Choose the correct answer:
1.Limits for correlation coefficient.
(a) 1 r 1
(b) 0 r 1
(c) 1 r 0
(d) 1 r 2
2. The coefficient of correlation.
(a) cannot be negative (b) cannot be positive
(c) always positive
(d)can either be positive or negative
3. The product moment correlation coefficient is obtained by
XY
XY
(a) r =
(b) r =
xy
n x y
XY
(d) none of these
n x
4. If cov(x,y) = 0 then
(a) x and y are correlated (b) x and y are uncorrelated
(c) none
(d) x and y are linearly related
(c) r =

211

5. If r = 0 the cov (x,y) is


(a) 0
(b) -1
(c) 1
(d) 0.2
6. Rank correlation coefficient is given by
6D 2
6D 2
6D 2
(a) 1 + 3
(b) 1 2
(c) 1 3
n n
n n
n n
2
6 D
(d) 1 3
n +n
7. If cov (x,y) = x y then
(a) r = +1
(b) r = 0
(c) r = 2
(d) r = -1
8. If D2 = 0 rank correlation is
(a) 0
(b) 1
(c)0.5
(d) -1
9. Correlation coefficient is independent of change of
(a) Origin
(b) Scale
(c) Origin and Scale
(d) None
10. Rank Correlation was found by
(a) Pearson
(b) Spearman
(c) Galton
(d) Fisher
II. Fill in the blanks:
11 Correlation coefficient is free from _________.
12 The diagrammatic representation of two variables
is called _________
13 The relationship between three or more variables is studied
with the help of _________ correlation.
14 Product moment correlation was found by _________
15 When r = +1, there is _________ correlation.
16 If rxy = ryx, correlation between x and y is _________
17 Rank Correlation is useful to study ______characteristics.
18 The nature of correlation for shoe size and IQ is _________
III. Answer the following :
19 What is correlation?
20 Distinguish between positive and negative correlation.
21 Define Karl Pearson s coefficient of correlation. Interpret r,
when r = 1, -1 and 0.
22 What is a scatter diagram? How is it useful in the study of
Correlation?
212

23
24
25
26
27
28
29
30
31
32
33

Distinguish between linear and non-linear correlation.


Mention important properties of correlation coefficient.
Prove that correlation coefficient lies between 1 and +1.
Show that correlation coefficient is independent of change of
origin and scale.
What is Rank correlation? What are its merits and demerits?
Explain different types of correlation with examples.
Distinguish between Karl Pearson s coefficient of correlation
and Spearman s correlation coefficient.
For 10 observations x = 130; y = 220; x2 = 2290;
y2 = 5510; xy = 3467. Find r .
Cov (x,y) = 18.6; var(x) = 20.2; var(y) = 23.7. Find r .
Given that r = 0.42 cov(x,y) = 10.5 v(x) = 16; Find the
standard deviation of y.
Rank correlation coefficient r = 0.8. D2 = 33. Find n .

Karl Pearson Correlation:


34. Compute the coefficient of correlation of the following score of
A and B.
A
5
10 5
11 12 4
3
2
7
1
B
1
6
2
8
5
1
4
6
5
2
35. Calculate coefficient of Correlation between price and supply.
Interpret the value of correlation coefficient.
Price 8
10 15 17 20 22 24 25
Supply 25 30 32 35 37 40 42 45
36. Find out Karl Pearson s coefficient of correlation in the
following series relating to prices and supply of a commodity.
Price(Rs.)
11 12 13 14 15 16 17 18 19 20
Supply(Rs.) 30 29 29 25 24 24 24 21 18 15
37. Find the correlation coefficient between the marks obtained by
ten students in economics and statistics.
Marks (in 70 68 67 55 60 60 75 63 60 72
economics
Marks (in 65 65 80 60 68 58 75 62 60 70
statistics
213

38. Compute the coefficient of correlation from the following data.


Age of 40 34 22 28 36 32 24 46 26 30
workers
Days
2.5 3
5
4
2.5 3
4.5 2.5 4
3.5
absent
39. Find out correlation coefficient between height of father and
son from the following data
Height 65 66 67 67 68 69 70 72
of
father
Height 67 68 65 68 72 72 69 71
of son
BI-VARIATE CORRELATION:
40. Calculate Karl Pearson s coefficient of correlation.for the
following data.
Class
0 1 2 3 4 5 6 7 8 Total
Interval
20-29
2 1 2 2 - 1 - 1 1 10
30-39
- 2 - 1 - 2 - 1 2 8
40-49
- 2 - 2 - - 1 - 1 6
50-59
1 - 2 - - - - 1 - 4
60-69
- - - - - 1 - 1 - 2
41. Calculate the coefficient of correlation and comment upon
your result.
Age of wives
Age of
Husband
15-25
25-35
35-45
45-55
55-65
65-75
Total

15-25
1
2
3

25-35
1
12
4
17

35-45
1
10
3
14

45-55
1
6
2
9

214

55-65
1
4
1
6

65-75
2
2
4

Total
2
15
15
10
8
3
53

42. The following table gives class frequency distribution of 45


clerks in a business office according to age and pay. Find
correlation between age and pay if any.
Pay
Age
20-30
30-40
40-50
50-60
60-70
Total

60-70 70-80
4
3
2
5
1
2
1
7
11

80-90
1
2
3
3
1
10

90-100
1
2
5
1
9

100-110
1
2
5
8

Total
8
10
9
11
7
45

43. Find the correlation coefficient between two subjects marks


scored by 60 candidates.
Marks in Statistics
Marks in 5-15 15-25 25-35 35-45
Total
economics
0-10
1
1
2
10-20
3
6
5
1
15
20-30
1
8
9
2
20
30-40
3
9
3
15
40-50
4
4
8
Total
5
18
27
10
60
44. Compute the correlation coefficient for the following data.
Advertisement Expenditure( 000)
Sales
Revenue 5-15 15-25 25-35 35-45
Total
(Rs. 000)
75-125
4
1
5
125-175
7
6
2
1
16
175-225
1
3
4
2
10
225-275
1
1
3
4
9
Total
13
11
9
7
40

215

45. The following table gives the no. of students having different
heights and weights. Do you find any relation between height
and weight.
Weights in Kg
Height in
cms
150-155
155-160
160-165
165-170
Total

55-60
1
2
1
4

60-65
3
4
5
3
15

65-70
7
10
12
8
37

70-75
5
7
10
6
28

75-80
2
4
7
3
16

Total
18
27
35
20
100

RANK CORRELATION:
46. Two judges gave the following ranks to eight competitors in a
beauty contest.
Examine the relationship between their
judgements.
Judge A 4
5
1
2
3
6
7
8
Judge B 8
6
2
3
1
4
5
7
47. From the following data, calculate the coefficient of rank
correlation.
X
Y

36
50

56
35

20
70

65
25

42
58

33
75

44
60

50
45

15
80

60
38

48. Calculate spearman s coefficient of Rank correlation for the


following data.
X
53 98 95 81 75 71 59 55
Y
47 25 32 37 30 40 39 45
49. Apply spearman s Rank difference method and calculate
coefficient of correlation between x and y from the data given
below.
X
Y

22
18

28
25

31
25

23
37

29
31

31
35

27
31

22
29

31
18

18
20

50. Find the rank correlation coefficients.


Marks in 70
Test I
Marks in 65
Test II

68

67

55 60 60

75 63

60

72

65

80

60 68 58

75 62

60

70

216

51. Calculate spearman s Rank correlation coefficient for the


following table of marks of students in two subjects.
First
80 64 54 49 48 35 32 29 20 18 15 10
subject
Second 36 38 39 41 27 43 45 52 51 42 40 52
subject
IV. Suggested Activities
Select any ten students from your class and find their heights
and weights. Find the correlation between their heights and
weights
Answers:
I.
1. (a).
2. (d)
6. (c)
7. (a)
II.
11. Units
14. Pearson
17. Qualitative
III.
30. r = 0.9574
33. n = 10
36. r = - 0.96
39. r = +0.64
42. r = +0.746
45. r = +0.0945
48. r = - 0.905
51. r = 0.685

3. (b)
8. (b)

4.(b) 5. (a)
9. (c) 10. (b)

12. Scatter diagram


15. Positive perfect
18. No correlation

13. Multiple
16. Symmetric

31. r = 0.85
34. r = +0.58
37. r = +0.68
40. r = +0.1
43. r = +0.533
46. r = +0.62
49. r = 0.34

32. y = 6.25.
35. r = +0.98
38. r = - 0.92
41. r = +0.98
44. r = +0.596
47. r = - 0.93
50. r = 0.679

217

9. REGRESSION
9.1 Introduction:
After knowing the relationship between two variables we
may be interested in estimating (predicting) the value of one
variable given the value of another. The variable predicted on the
basis of other variables is called the dependent or the explained
variable and the other the independent or the predicting variable.
The prediction is based on average relationship derived statistically
by regression analysis. The equation, linear or otherwise, is called
the regression equation or the explaining equation.
For example, if we know that advertising and sales are
correlated we may find out expected amount of sales for a given
advertising expenditure or the required amount of expenditure for
attaining a given amount of sales.
The relationship between two variables can be considered
between, say, rainfall and agricultural production, price of an input
and the overall cost of product, consumer expenditure and
disposable income. Thus, regression analysis reveals average
relationship between two variables and this makes possible
estimation or prediction.
9.1.1 Definition:
Regression is the measure of the average relationship
between two or more variables in terms of the original units of the
data.
9.2 Types Of Regression:
The regression analysis can be classified into:
a) Simple and Multiple
b) Linear and Non Linear
c) Total and Partial
a) Simple and Multiple:
In case of simple relationship only two variables are
considered, for example, the influence of advertising expenditure
on sales turnover. In the case of multiple relationship, more than
218

two variables are involved. On this while one variable is a


dependent variable the remaining variables are independent ones.
For example, the turnover (y) may depend on advertising
expenditure (x) and the income of the people (z). Then the
functional relationship can be expressed as y = f (x,z).
b) Linear and Non-linear:
The linear relationships are based on straight-line trend, the
equation of which has no-power higher than one. But, remember a
linear relationship can be both simple and multiple. Normally a
linear relationship is taken into account because besides its
simplicity, it has a better predective value, a linear trend can be
easily projected into the future. In the case of non-linear
relationship curved trend lines are derived. The equations of these
are parabolic.
c) Total and Partial:
In the case of total relationships all the important variables
are considered. Normally, they take the form of a multiple
relationships because most economic and business phenomena are
affected by multiplicity of cases. In the case of partial relationship
one or more variables are considered, but not all, thus excluding the
influence of those not found relevant for a given purpose.
9.3

Linear Regression Equation:


If two variables have linear relationship then as the
independent variable (X) changes, the dependent variable (Y) also
changes. If the different values of X and Y are plotted, then the two
straight lines of best fit can be made to pass through the plotted
points. These two lines are known as regression lines. Again, these
regression lines are based on two equations known as regression
equations. These equations show best estimate of one variable for
the known value of the other. The equations are linear.
Linear regression equation of Y on X is
Y = a + bX
. (1)
And X on Y is
X = a + bY
.(2)
a, b are constants.
219

From (1) We can estimate Y for known value of X.


(2) We can estimate X for known value of Y.
9.3.1 Regression Lines:
For regression analysis of two variables there are two
regression lines, namely Y on X and X on Y. The two regression
lines show the average relationship between the two variables.
For perfect correlation, positive or negative i.e., r = + 1,
the two lines coincide i.e., we will find only one straight line. If r =
0, i.e., both the variables are independent then the two lines will cut
each other at right angle. In this case the two lines will be parallel
to X and Y-axes.
Y
Y

r=-1
r =+1

X O

O at the point of meansXof X and


Lastly the two lines intersect
Y. From this point of intersection, if a straight line is drawn on Xaxis, it will touch at the mean value of x. Similarly, a perpendicular
drawn from the point of intersection of two regression lines on Yaxis will touch the mean value of Y.
Y
Y

r=0
( x, y )

220

9.3.2 Principle of Least Squares :


Regression shows an average relationship between two
variables, which is expressed by a line of regression drawn by the
method of least squares. This line of regression can be derived
graphically or algebraically. Before we discuss the various methods
let us understand the meaning of least squares.
A line fitted by the method of least squares is known as the
line of best fit. The line adapts to the following rules:
(i)
The algebraic sum of deviation in the individual
observations with reference to the regression line may be
equal to zero. i.e.,
(X Xc) = 0 or (Y- Yc ) = 0
Where Xc and Yc are the values obtained by regression analysis.
(ii)
The sum of the squares of these deviations is less than
the sum of squares of deviations from any other line. i.e.,
(Y Yc)2 < (Y Ai)2
Where Ai = corresponding values of any other straight line.
(iii) The lines of regression (best fit) intersect at the mean
values of the variables X and Y, i.e., intersecting point is
x, y .
9.4 Methods of Regression Analysis:
The various methods can be represented in the form of chart
given below:
Regression methods

Graphic
(through regression lines)

Algebraic
(through regression equations)

Scatter Diagram

Regression Equations
(through normal equations)

Regression Equations
(through regression coefficient)
221

9.4.1 Graphic Method:


Scatter Diagram:
Under this method the points are plotted on a graph paper
representing various parts of values of the concerned variables.
These points give a picture of a scatter diagram with several points
spread over. A regression line may be drawn in between these
points either by free hand or by a scale rule in such a way that the
squares of the vertical or the horizontal distances (as the case may
be) between the points and the line of regression so drawn is the
least. In other words, it should be drawn faithfully as the line of
best fit leaving equal number of points on both sides in such a
manner that the sum of the squares of the distances is the best.
9.4.2 Algebraic Methods:
(i)
Regression Equation.
The two regression equations
for X on Y; X = a + bY
And for Y on X; Y = a + bX
Where X, Y are variables, and a,b are constants whose
values are to be determined
For the equation, X = a + bY
The normal equations are
X = na + b Y and
XY = aY + bY2
For the equation, Y= a + bX, the normal equations are
Y = na + b X and
XY = aX + bX2
From these normal equations the values of a and b can be
determined.
Example 1:
Find the two regression equations from the following data:
X:
Y:

6
9

2
11

10
5
222

4
8

8
7

Solution:
Y
X2
Y2
XY
9
36
81
54
11
4
121
22
5
100
25
50
8
16
64
32
7
64
49
56
40
220
340
214
Regression equation of Y on X is Y = a + bX and the
normal equations are
X
6
2
10
4
8
30

Y = na + bX
XY = aX + bX2
Substituting the values, we get
40 = 5a + 30b (1)
214 = 30a + 220b .(2)
Multiplying (1) by 6
240 = 30a + 180b

. (3)
(2) (3)
- 26 = 40b
26
= - 0.65
40
Now, substituting the value of b in equation (1)
40 = 5a 19.5
5a = 59.5
59.5
a=
= 11.9
5
Hence, required regression line Y on X is Y = 11.9 0.65 X.
Again, regression equation of X on Y is
X = a + bY and
or b = -

The normal equations are


X = na + bY and
XY = aY + bY2
223

Now, substituting the corresponding values from the above table,


we get
30 = 5a + 40b .(3)
214 = 40a + 340b .(4)
Multiplying (3) by 8, we get
240 = 40a + 320 b .(5)
(4) (5) gives
-26 = 20b
26
b== - 1.3
20
Substituting b = - 1.3 in equation (3) gives
30 = 5a 52
5a = 82
82
a=
= 16.4
5
Hence, Required regression line of X on Y is
X = 16.4 1.3Y
(ii) Regression Co-efficents:

The regression equation of Y on X is ye = y + r y ( x x)


x
Here, the regression Co.efficient of Y on X is

b1 = byx = r y
x
ye = y + b1 ( x x)
The regression equation of X on Y is

X e = x + r x ( y y)
y
Here, the regression Co-efficient of X on Y

b2 = bxy = r x
y
X e = X + b2 ( y y )

224

If the deviation are taken from respective means of x and y


( X X )(Y Y )
xy and
b1 = byx =
=
x2
( X X )2
b2 = bxy =

( X X )(Y Y )
(Y Y )
2

xy

where x = X X , y = Y Y
If the deviations are taken from any arbitrary values of x and y
(short cut method)
n uv u v
b1 = byx =
2
n u 2 ( u )
b2 = bxy =

n uv u v
n v 2 ( v )

where u = x A : v = Y-B
A = any value in X
B = any value in Y
9.5 Properties of Regression Co-efficient:
1. Both regression coefficients must have the same sign, ie either
they will be positive or negative.
2. correlation coefficient is the geometric mean of the regression
coefficients ie, r = b1b2
3. The correlation coefficient will have the same sign as that of the
regression coefficients.
4. If one regression coefficient is greater than unity, then other
regression coefficient must be less than unity.
5. Regression coefficients are independent of origin but not of
scale.
6. Arithmetic mean of b1 and b2 is equal to or greater than the
b1 + b2
coefficient of correlation. Symbolically
r
2
225

7. If r=0, the variables are uncorrelated , the lines of regression


become perpendicular to each other.
8. If r= +1, the two lines of regression either coincide or parallel to
each other
m m2
9. Angle between the two regression lines is = tan-1 1

1 + m1m2
where m1 and,m2 are the slopes of the regression lines X on Y
and Y on X respectively.
10.The angle between the regression lines indicates the degree of
dependence between the variables.
Example 2:
If 2 regression coefficients are b1=

4
9
and b2 =
.What would be
5
20

the value of r?
Solution:
The correlation coefficient , r = b1b2
4 9
x
5 20

=
=

36
6
=
= 0.6
100 10

Example 3:
15
3
and b2 = , Find r
Given b1 =
8
5
Solution:
r = b1b2
=

15 3
x
8 5

9
=1.06
8
It is not possible since r, cannot be greater than one. So the given
values are wrong
=

226

9.6 Why there are two regression equations?


The regression equation of Y on X is

Ye = Y + r y ( X X )
x
(1)
(or)
Ye = Y + b1 ( X X )
The regression equation of X on Y is

X e = X + r x (Y Y )
y
X e = X + b2 (Y Y )
These two regression equations represent entirely two
different lines. In other words, equation (1) is a function of X,
which can be written as Ye = F(X) and equation (2) is a function of
Y, which can be written as Xe = F(Y).
The variables X and Y are not inter changeable. It is mainly
due to the fact that in equation (1) Y is the dependent variable, X is
the independent variable. That is to say for the given values of X
we can find the estimates of Ye of Y only from equation (1).
Similarly, the estimates Xe of X for the values of Y can be obtained
only from equation (2).
Example 4:
Compute the two regression equations from the following data.
X
1
2
3 4
5
Y
2
3
5
4
6
If x =2.5, what will be the value of y?
Solution:
X
Y
x2
y2
xy
x = X X y = Y Y
1
2
-2
-2
4
4
4
2
3
-1
-1
1
1
-1
3
5
0
1
0
1
0
4
4
1
0
1
0
0
5
6
2
2
4
4
4
15
20
20
10
10
9
227

X 15
= =3
n
5
Y 20
Y=
=
=4
n
5
Regression Co efficient of Y on X
xy 9
byx =
= = 0.9
x 2 10
X=

Hence regression equation of Y on X is


Y = Y + byx ( X X )
= 4 + 0.9 ( X 3 )
= 4 + 0.9X 2.7
=1.3 + 0.9X
when X = 2.5
Y = 1.3 + 0.9 2.5
= 3.55
Regression co efficient of X on Y
xy 9
= = 0.9
bxy =
y 2 10
So, regression equation of X on Y is
X = X + bxy (Y Y )
= 3 + 0.9 ( Y 4 )
= 3 + 0.9Y 3.6
= 0.9Y - 0.6
Short-cut method
Example 5:
Obtain the equations of the two lines of regression for the data
given below:
X
Y

45
40

42
38

44
36

43
35

41
38
228

45
39

43
37

40
41

Solution:
X
46
42
44
A 43
41
45
43
40

Y
40
38 B
36
35
38
39
37
41

u = X-A
3
-1
1
0
-2
2
0
-3
0

u2
9
1
1
0
4
4
0
9
28

v = Y-B
2
0
-2
-3
0
1
-1
3
0

u
n
0
= 43
= 43 +
8
u
Y = B+
n
0
= 38
= 38 +
8
The regression Co-efficient of Y on X is
n uv u v
b1 = byx =
2
n u 2 ( u )
X = A+

8(3) (0)(0)
24
=
= -0.11
2
8(28) (0)
224
The regression coefficient of X on Y is
n uv u v
b2 = bxy =
2
n v 2 ( v )
=

8(3) (0)(0)
8(28) (0) 2
24
=
= - 0.11
224
=

229

V2
4
0
4
9
0
1
1
9
28

uv
6
0
-2
0
0
2
0
-9
-3

Hence the reression equation of Y on X is


Ye = Y + b1 ( X X )
= 38 0.11 (X-43)
= 38 0.11X + 4.73
= 42.73 0.11X
The regression equation of X on Y is
X e = X + b1 (Y Y )
= 43 0.11 (Y-38)
= 43 0.11Y + 4.18
= 47.18 0.11Y
Example 6:
In a correlation study, the following values are obtained

Mean
S.D

X
65
2.5

Y
67
3.5

Co-efficient of correlation = 0.8


Find the two regression equations that are associated with the
above values.
Solution:
Given,
X = 65, Y = 67, x = 2.5, y= 3.5, r = 0.8
The regression co-efficient of Y on X is

byx= b1 = r y
x
3.5
= 0.8
= 1.12
2.5
The regression coefficient of X on Y is

bxy = b2 = r x
y
230

2.5
= 0.57
3.5
Hence, the regression equation of Y on X is
Ye = Y + b1 ( X X )
= 67 + 1.12 (X-65)
= 67 + 1.12 X - 72.8
= 1.12X 5.8
The regression equation of X on Y is
X e = X + b2 (Y Y )
= 65 + 0.57 (Y-67)
= 65 + 0.57Y 38.19
= 26.81 + 0.57Y
Note:
Suppose, we are given two regression equations and we
have not been mentioned the regression equations of Y on X and X
on Y. To identify, always assume that the first equation is Y on X
then calculate the regression co-efficient byx = b1 and bxy = b2. If
these two are satisfied the properties of regression co-efficient, our
assumption is correct, otherwise interchange these two equations.
= 0.8

Example 7:
Given 8X 10Y + 66 = 0 and 40X 18Y = 214. Find the
correlation coefficient, r.
Solution:
Assume that the regression equation of
8X- 10Y + 66 = 0.
-10Y = -66-8X
10Y = 66 + 8X
66 8 X
Y=
+
10 10
Now the coefficient attached with X is byx
8
4
i.e., byx =
=
10
5
231

Y on X is

The regression equation of X on Y is


40X-18Y=214
In this keeping X left side and write other things right side
i.e., 40X = 214 + 18Y
214 18
i.e., X =
+ Y
40 40
Now, the coefficient attached with Y is bxy
18 9
=
40 20
Here byx and bxy are satisfied the properties of regression
coefficients, so our assumption is correct.
b yx b xy
Correlation Coefficient, r =
i.e., bxy =

4 9

5 20

36
100

6
10
= 0.6

Example 8:
Regression equations of two correlated variables X and Y
are 5X-6Y+90 = 0 and 15X-8Y-130 = 0. Find correlation
coefficient.
Solution:
Let 5X-6Y+90 =0 represents the regression equation of X
on Y and other for Y on X
6
90
Now
X= Y
5
5
232

bxy = b2 =

6
5

For 15X-8Y-130 = 0
15
130
Y=
X
8
8
byx = b1
15
=
8
r = b1 b2
15 6

8 5
= 2.25
= 1.5 >1
It is not possible. So our assumption is wrong. So let us take the
first equation as Y on X and second equation as X on Y.
From the equation 5x 6y + 90 = 0,
5
90
Y=
X
6
6
5
byx =
6
From the equation 15x - 8y 130 = 0,
8
130
X=
Y+
15
15
8
bxy =
15
Correlation coefficient, r = b1 b2
=

5 8

6 15

40
90
2
=
3

233

= 0.67
Example 9:
The lines of regression of Y on X and X on Y are
respectively, y = x + 5 and 16X = 9Y 94. Find the variance of X
if the variance of Y is 19. Also find the covariance of X and Y.
Solution:
From regression line Y on X,
Y = X+5
We get byx = 1
From regression line X on Y,
16X = 9Y-94
9
94
X=
Y
,
16
16
we get
9
bxy =
16
r = b1 b2
= 1
=

9
16

3
4

Again , byx = r

y
x

4
3
(Since Y 2=16, Y = 4 )

x
4
X = 3.
Variance of X = X2

i.e., 1 =

Again byx =

=9
cov( x, y )
2
x

234

cov( x, y )
9
or cov (x,y) = 9.
1

Example 10:
Is it possible for two regression lines to be as follows:
Y = -1.5X + 7 , X = 0.6Y + 9 ? Give reasons.
Solution:
The regression coefficient of Y on X is b1 = byx = -1.5
The regression coefficient of X on Y is b2 = bxy = 0.6
Both the regression coefficients are of different sign, which is a
contrary. So the given equations cannot be regression lines.
Example 11:
In the estimation of regression equation of two variables X
and Y the following results were obtained.
X = 90, Y = 70, n = 10, x2 =6360; y2 = 2860,
xy = 3900 Obtain the two regression equations.
Solution:
Here, x, y are the deviations from the Arithmetic mean.
xy
b1 = byx =
x 2
3900
=
= 0.61
6360
xy
b2 = bxy = 2
y
3900
=
= 1.36
2860
Regression equation of Y on X is
Ye = Y +b1 (X - X )
= 70 + 0.61 (X 90)
= 70 + 0.61 X 54.90
= 15.1 + 0.61X
235

Regression equation of X on Y is
Xe = X + b2 (Y-Y )
= 90 + 1.36 (Y 70)
= 90 + 1.36 Y 95.2 = 1.36Y 5.2
9.7 Uses of Regression Analysis:
1. Regression analysis helps in establishing a functional
relationship between two or more variables.
2. Since most of the problems of economic analysis are based on
cause and effect relationships, the regression analysis is a highly
valuable tool in economic and business research.
3. Regression analysis predicts the values of dependent variables
from the values of independent variables.
4. We can calculate coefficient of correlation ( r) and coefficient of
determination ( r2) with the help of regression coefficients.
5. In statistical analysis of demand curves, supply curves,
production function, cost function, consumption function etc.,
regression analysis is widely used.
9.8 Difference between Correlation and Regression:
S.No
Correlation
Regression
Regression means
1.
Correlation is the relationship
going back and it is a
between two or more variables,
which vary in sympathy with the mathematical measure
other in the same or the opposite showing the average
relationship between
direction.
two variables
2.
Both the variables X and Y are
Here X is a random
random variables
variable and Y is a
fixed variable.
Sometimes both the
variables may be
random variables.
It indicates the causes
3.
It finds out the degree of
and effect relationship
relationship between two
between the variables
variables and not the cause and
and establishes
effect of the variables.
functional relationship.
236

4.

It is used for testing and


verifying the relation between
two variables and gives limited
information.

5.

The coefficient of correlation is


a relative measure. The range of
relationship lies between 1 and
+1

6.

There may be spurious


correlation between two
variables.
It has limited application,
because it is confined only to
linear relationship between the
variables.

7.

8.

It is not very useful for further


mathematical treatment.

9.

If the coefficient of correlation is


positive, then the two variables
are positively correlated and
vice-versa.

Besides verification it
is used for the
prediction of one
value, in relationship
to the other given
value.
Regression coefficient
is an absolute figure. If
we know the value of
the independent
variable, we can find
the value of the
dependent variable.
In regression there is
no such spurious
regression.
It has wider
application, as it
studies linear and nonlinear relationship
between the variables.
It is widely used for
further mathematical
treatment.
The regression
coefficient explains
that the decrease in one
variable is associated
with the increase in the
other variable.

Exercise 9
I. Choose the correct answer:
1. When the correlation coefficient r = +1, then the two regression
lines
a) are perpendicular to each other
b) coincide
c) are parallel to each other
d) none of these
237

2. If one regression coefficient is greater than unity then the


other must be
a) greater than unity
b) equal to unity
c) less than unity
d) none of these
3. Regression equation is also named as
a) predication equation
b) estimating equation
c) line of average relationship d) all the above
4. The lines of regression intersect at the point
a) (X,Y)
b) ( X , Y )
c) (0,0)
d) (1,1)
5. If r = 0, the lines of regression are
a) coincide
b) perpendicular to each other
c) parallel to each other
d) none of the above
6. Regression coefficient is independent of
a) origin
b) scale
c)both origin and scale
d) neither origin nor scale.
7. The geometric mean of the two-regression coefficients byx
and bxy is equal to
a) r
b) r2
c) 1
d) r
8. Given the two lines of regression as 3X 4Y +8 = 0 and
4X 3Y = 1, the means of X and Y are
a) X = 4, Y = 5
b) X =3, Y = 4
c) X = 2, Y = 2
d) X = 4/3, Y = 5/3
9. If the two lines of regression are
X + 2Y 5 = 0 and
2X + 3Y 8 = 0, the means of X and Y are
a) X = -3, Y = 4
b) X = 2, Y = 4
c) X =1, Y = 2
d) X = -1, Y = 2
10. If byx = -3/2, bxy = -3/2 then the correlation coefficient, r is
a) 3/2
b) 3/2
c) 9/4
d) 9/4
II. Fill in the blanks:
11. The regression analysis measures ________________
between X and Y.
12. The purpose of regression is to study ________ between
variables.
13. If one of the regression coefficients is ________ unity, the other
must be _______ unity.
238

14. The farther the two regression lines cut each other, the _____
be the degree of correlation.
15. When one regression coefficient is positive, the other would
also be _____.
16. The sign of regression coefficient is ____ as that of correlation
coefficient.
III. Answer the following:
17. Define regression and write down the two regression
equations
18. Describe different types of regression.
19. Explain principle of least squares.
20. Explain (i) graphic method, (ii) Algebraic method.
21. What are regression co-efficient?
22. State the properties of regression coefficients.
23. Why there are two regression equations?
24. What are the uses of regression analysis?
25. Distinguish between correlation and regression.
26. What do you mean by regression line of Y on X and
regression line of X on Y?
27. From the following data, find the regression equation
X = 21, Y = 20, X2 = 91, XY = 74, n = 7
28. From the following data find the regression equation of Y on
X. If X = 15, find Y?
X 8
11
7
10 12 5
4
6
Y 11 30
25 44 38 25 20 27
29. Find the two regression equations from the following data.
X
25 22 28 26 35 20 22 40 20 18
Y
18 15 20 17 22 14 16 21 15 14
30. Find S.D (Y), given that variance of X = 36, bxy = 0.8,
r = 0.5
31. In a correlation study, the following values are obtained
X
Y
Mean
68 60
S.D.
2.5 3.5
Coefficient of correlation, r = 0.6 Find the two regression
equations.
239

32. In a correlation studies, the following values are obtained:


X
Y
Mean
12 15
S.D.
2
3
r = 0.5 Find the two regression equations.
33. The correlation coefficient of bivariate X and Y is r=0.6,
variance of X and Y are respectively, 2.25 and 4.00, X =10,
Y =20. From the above data, find the two regression lines
34. For the following lines of regression find the mean values of
X and Y and the two regression coefficients
8X-10Y+66=0
40X-18Y=214
35. Given X=90, Y=70,bxy = 1.36, byx = 0.61
Find (i) the most probable values of X, when Y = 50 and
(ii) the coefficient of correlation between X and Y
36. You are supplied with the following data:
4X-5Y+33 = 0 and 20X-9Y-107 = 0
variance of Y = 4. Calculate
(I)
Mean values of x and y
(II)
S.D. of X
(III) Correlation coefficients between X and Y.
Answers:
I. 1. b 2. c
3.d
4. b 5. b 6. a
7. a
8. a
9. c
10. b
II.
11. dependence 12. dependence 13. more than, less than
14. lesser 15. positive 16. same.
III.
27. Y = 0.498X +1.366 28. Y =1.98X + 12.9;Y=42.6 30. 3.75
31. Y=2.88 + 0.84X, X = 42.2 + 0.43Y
32. Y = 6 + 0.75X ; X = 7 + 0.33Y
33. Y= 0.8X + 12, X = 0.45Y +1
34. X =13, Y = 17 byx = 9/20, bxy = 4/5
35. (i) 62.8, (ii) 0.91
36. X =13, Y =17, S.D(X)=9, r = 0.6
240

10. INDEX NUMBERS


10.1

Introduction:
An index number is a statistical device for comparing the
general level of magnitude of a group of related variables in two or
more situation. If we want to compare the price level of 2000 with
what it was in 1990, we shall have to consider a group of variables
such as price of wheat, rice, vegetables, cloth, house rent etc., If
the changes are in the same ratio and the same direction, we face no
difficulty to find out the general price level. But practically, if we
think changes in different variables are different and that too,
upward or downward, then the price is quoted in different units i.e
milk for litre, rice or wheat for kilogram, rent for square feet, etc
We want one figure to indicate the changes of different
commodities as a whole. This is called an Index number. Index
Number is a number which indicate the changes in magnitudes.
M.Spiegel say, An index number is a statistical measure designed
to show changes in variable or a group of related variables with
respect to time, geographic location or other characteristic. In
general, index numbers are used to measure changes over time in
magnitude which are not capable of direct measurement.
On the basis of study and analysis of the definition given
above, the following characteristics of index numbers are apparent.
1. Index numbers are specified averages.
2. Index numbers are expressed in percentage.
3. Index numbers measure changes not capable of direct
measurement.
4. Index numbers are for comparison.
10.2

Uses of Index numbers


Index numbers are indispensable tools of economic and
business analysis. They are particular useful in measuring relative
changes. Their uses can be appreciated by the following points.
1. They measure the relative change.
2. They are of better comparison.
241

3. They are good guides.


4. They are economic barometers.
5. They are the pulse of the economy.
6. They compare the wage adjuster.
7. They compare the standard of living.
8. They are a special type of averages.
9. They provide guidelines to policy.
10. To measure the purchasing power of money.
10.3

Types of Index numbers:


There are various types of index numbers, but in brief, we
shall take three kinds and they are
(a) Price Index, (b) Quantity Index and (c) Value Index
(a) Price Index:
For measuring the value of money, in general, price index is
used. It is an index number which compares the prices for a group
of commodities at a certain time as at a place with prices of a base
period. There are two price index numbers such as whole sale price
index numbers and retail price index numbers. The wholesale price
index reveals the changes into general price level of a country, but
the retail price index reveals the changes in the retail price of
commodities such as consumption of goods, bank deposits, etc.
(b) Quantity Index:
Quantity index number is the changes in the volume of
goods produced or consumed. They are useful and helpful to study
the output in an economy.
(c) Value Index
Value index numbers compare the total value of a certain
period with total value in the base period. Here total value is equal
to the price of commodity multiplied by the quantity consumed.
Notation: For any index number, two time periods are needed for
comparison. These are called the Base period and the Current
period. The period of the year which is used as a basis for
comparison is called the base year and the other is the current year.
The various notations used are as given below:
P1 = Price of current year
P0 = Price of base year
q1 = Quantity of current year
q0 = Quantity of base year
242

10.4 Problems in the construction of index numbers


No index number is an all purpose index number. Hence,
there are many problems involved in the construction of index
numbers, which are to be tackled by an economist or statistician.
They are
1. Purpose of the index numbers
2. Selection of base period
3. Selection of items
4. Selection of source of data
5. Collection of data
6. Selection of average
7. System of weighting
10.5 Method of construction of index numbers:
Index numbers may be constructed by various methods as
shown below:
INDEX NUMBERS

Un weighted

Weighted

Simple
Simple
Weighted
Weighted
aggregate
average
average
aggregate
Index
of price
of price
index
numbers
relative
relative
number
10.5.1 Simple Aggregate Index Number
This is the simplest method of construction of index
numbers. The price of the different commodities of the current year
are added and the sum is divided by the sum of the prices of those
commodities by 100. Symbolically,
p1
Simple aggregate price index = P01 =
100
p0
243

Where , p1 = total prices for the current year


p0 = Total prices for the base year
Example 1:
Calculate index numbers from the following data by simple
aggregate method taking prices of 2000 as base.
Commodity

A
B
C
D
Solution:
Commodity

A
B
C
D
Total

Price per unit


(in Rupees)
2000
2004
80
95
50
60
90
100
30
45

Price per unit


(in Rupees)
2000
2004
(P0)
(P1)
80
95
50
60
90
100
30
45
250
300

p1
100
p0
300
=
100 = 120
250
10.5.2 Simple Average Price Relative index:
In this method, first calculate the price relative for the
various commodities and then average of these relative is obtained
by using arithmetic mean and geometric mean. When arithmetic
mean is used for average of price relative, the formula for
computing the index is
Simple aggregate Price index = P01 =

244

Simple average of price relative by arithmetic mean

P01 =

100

p0

n
p1

P1 = Prices of current year


P0 = Prices of base year
n = Number of items or commodities
when geometric mean is used for average of price relative, the
formula for obtaining the index is
Simple average of price relative by geometric Mean
p1

log( p 100)
0

P01 = Antilog
n

Example 2:
From the following data, construct an index for 1998 taking 1997
as base by the average of price relative using (a) arithmetic mean
and (b) Geometric mean
Commodity
Price in 1997
Price in 1998
A
50
70
B
40
60
C
80
100
D
20
30
Solution:
(a) Price relative index number using arithmetic mean
Commodity

Price in 1997
(P0)

A
B
C
D

50
40
80
20

Price in
1998
(P1)
70
60
100
30
Total
245

p1
100
p0
140
150
125
150
565

1 100
p

Simple average of price relative index = (P01) = 0


4
565
=
= 141.25
4
(b) Price relative index number using Geometric Mean
Commodity Price in Price in
p1
p
100 log( 1 100)
1998
1997
p0
p0
(P1)
(P0)
A
50
70
140
2.1461
B
40
60
150
2.1761
C
80
100
125
2.0969
D
20
30
150
2.1761
Total
8.5952
Simple average of price Relative index
p

log 1 x100
po

(P01) = Antilog
n
8.5952
= Antilog
4
= Antilog [ 2.1488] = 140.9
10.5.3 Weighted aggregate index numbers
In order to attribute appropriate importance to each of the
items used in an aggregate index number some reasonable weights
must be used. There are various methods of assigning weights and
consequently a large number of formulae for constructing index
numbers have been devised of which some of the most important
ones are
1. Laspeyre s method
2. Paasche s method
3. Fisher s ideal Method
4. Bowley s Method
5. Marshall- Edgeworth method
6. Kelly s Method
246

1. Laspeyre s method:
The Laspeyres price index is a weighted aggregate price
index, where the weights are determined by quantities in the based
period and is given by
p1q 0
Laspeyre s price index = P01L =
100
p 0 q 0
2. Paasche s method
The Paasche s price index is a weighted aggregate price
index in which the weight are determined by the quantities in the
current year. The formulae for constructing the index is
p1q1
Paasche s price index number = P01P =
100
p 0 q1
Where
P0 = Price for the base year
P1 = Price for the current year
q0 = Quantity for the base year q1 = Quantity for the current year
3. Fisher s ideal Method
Fisher s Price index number is the geometric mean of the
Laspeyres and Paasche indices Symbolically
Fisher s ideal index number = P01F = L P
p1 q 0
p1q1

100
p 0 q 0
p 0 q1
It is known as ideal index number because
(a) It is based on the geometric mean
(b) It is based on the current year as well as the base year
(c) It conform certain tests of consistency
(d) It is free from bias.
4. Bowley s Method:
Bowley s price index number is the arithmetic mean of
Laspeyre s and Paasche s method. Symbolically
L+P
Bowley s price index number = P01B =
2
1 p1q 0
p1q1
=
+

100
2 p 0 q 0
p 0 q1
=

247

5. Marshall- Edgeworth method


This method also both the current year as well as base year
prices and quantities are considered. The formula for constructing
the index is
(q 0 + q1 )p1
Marshall Edgeworth price index = P01ME =
100
(q 0 + q1 )p 0
p1q 0 + p1q1
=
100
p 0 q 0 + p 0 q1
6. Kelly s Method
Kelly has suggested the following formula for constructing
the index number
p1q
Kelly s Price index number = P01k =
100
p 0 q
q + q1
Where = q = 0
2
Here the average of the quantities of two years is used as weights
Example 3:
Construct price index number from the following data by applying
1. Laspeyere s Method
2. Paasche s Method
3. Fisher s ideal Method
2000
2001
Commodity
Price
Qty
Price
Qty
A
2
8
4
5
B
5
12
6
10
C
4
15
5
12
D
2
18
4
20
Solution:
Commodity

A
B
C
D

p0

2
5
4
2

q0

8
12
15
18

p1

4
6
5
4

q1

5
10
12
20
248

p0q0

16
60
60
36
172

p0q1

10
50
48
40
148

p1q0

p1 q1

32
72
75
72
251

20
60
60
80
220

p1q 0
100
p 0 q 0
251
=
100 = 145.93
172
p1q1
Paasche price index number = P01P =
100
p 0 q1
220
=
100
148
= 148 .7
Fisher s ideal index number = L P
Laspeyre s price index = P01L =

(145.9) (148.7)

= 21695.33
= 147 .3
Or
Fisher s ideal index number

=
=
=

p1 q 0
p1q1

100
p 0 q 0
p 0 q1
251 220

100
172 148
(1.459) (1.487) 100

= 2.170 100
= 1.473 100 = 147.3
Interpretation:
The results can be interpreted as follows:
If 100 rupees were used in the base year to buy the given
commodities, we have to use Rs 145.90 in the current year to buy
the same amount of the commodities as per the Laspeyre s
formula. Other values give similar meaning .
Example 4:
Calculate the index number from the following data by applying
(a) Bowley s price index
249

(b) Marshall- Edgeworth price index


Commodity
A
B
C

Base year
Quantity
Price
10
3
20
15
2
25

Current year
Quantity
Price
8
4
15
20
3
30

Solution:
Commodity

A
B
C

q0

10
20
2

P0

3
15
25

q1

8
15
3

P1

4
20
30

p0q0

30
300
50
380

p0q1

24
225
75
324

p1q0

p1 q1

40
400
60
500

32
300
90
422

1 p1q 0 p1q1
+

100
2 p 0 q 0 p 0 q1
1 500 422
+
=
100
2 380 324
1
=
[1.316 + 1.302] 100
2
1
=
[ 2.168] 100
2
= 1.309 100
= 130.9
(b) Marshall Edgeworths price index Number
(q 0 + q1 )p1
= P01ME =
100
(q 0 + q1 )p 0
500 422
=
100
+
380 324
922
=
100
704
(a) Bowley s price index number =

250

= 131. 0
Example 5:
Calculate a suitable price index from the following data
Commodity

Quantity

A
B
C

Price
1996
2
5
3

20
15
8

1997
4
6
2

Solution:
Here the quantities are given in common we can use Kelly s
index price number and is given by
p1q
Kelly s Price index number = P01k =
100
p 0 q
186
=
100 = 133.81
139
Commodity
A
B
C

q
20
15
8

P0
2
5
3

P1
4
6
2
Total

p0q
40
75
24
139

P1 q
80
90
16
186

p1q
100
p 0 q
IV. Weighted Average of Price Relative index.
When the specific weights are given for each commodity, the
weighted index number is calculated by the formula.
pw
Weighted Average of Price Relative index =
w
Where w = the weight of the commodity
P = the price relative index

Kelly s Price index number = P01k =

251

p1
100
p0
When the base year value P0q0 is taken as the weight i.e. W=P0q0
then the formula is
p

1 100 p 0 q 0
p

Weighted Average of Price Relative index = 0


p 0 q 0
p1q 0
=
100
p 0 q 0
This is nothing but Laspeyre s formula.
When the weights are taken as w = p0q1, the formula is
p

1 100 p0 q1
p

Weighted Average of Price Relative index = 0


p 0 q1
p1q1
=
100
p 0 q1
This is nothing but Paasche s Formula.
=

Example 6:
Compute the weighted index number for the following data.
Commodity
Price
Weight
Current
Base
year
year
A
5
4
60
B
3
2
50
C
2
1
30
Solution:
Commodity

P1

P0

A
B

5
3

4
2

60
50
252

P=

p1
100
p0
125
150

PW
7500
7500

30
140

200

Weighted Average of Price Relative index =

6000
21000

pw
w

21000
140
= 150
10.6 Quantity or Volume index number:
Price index numbers measure and permit comparison of the
price of certain goods. On the other hand, the quantity index
numbers measure the physical volume of production, employment
and etc. The most common type of the quantity index is that of
quantity produced.
q1p 0
Laspeyre s quantity index number = Q01L =
100
q 0 p 0
q1p1
Paasche s quantity index number = Q01P =
100
q 0 p1
=

Fisher s quantity index number = Q01F =


=

L P

q1 p0 q1 p1

100
q 0 p 0 q 0 p1

These formulae represent the quantity index in which


quantities of the different commodities are weighted by their prices.
Example 7:
From the following data compute quantity indices by
(i) Laspeyre s method, (ii) Paasche s method and (iii) Fisher s
method.
2000
2002
Commodity
Price
Total
Price
Total
value
value
A
10
100
12
180
B
12
240
15
450
253

15

225

17

340

Solution:
Here instead of quantity, total values are given. Hence first find
quantities of base year and current year,
total value
ie. Quantity =
price
Commodity p0
q0
P1
q1
p0q0 p0q1 p1q0 p1q1
A
10
10
12
15
100 150 120 180
B
12
20
15
30
240 360 300 450
C
15
15
17
20
225 300 255 340
565 810 675 970
q1p 0
100
q 0 p 0
810
=
100
565
= 143.4
q1p1
Paasche s quantity index number = q01P =
100
q 0 p1
970
=
100
675
= 143.7
F
Fisher s quantity index number = q01 = L P
Laspeyre s quantity index number = q01L =

= 143.4 143.7
= 143.6
(or)
q01F =

q1 p0 q1 p1

100
q 0 p 0 q 0 p1

810 970

100
565 675
= 1.434 1.437 100
=

254

= 1.436 100
= 143.6
10.7 Tests of Consistency of index numbers:
Several formulae have been studied for the construction of
index number. The question arises as to which formula is
appropriate to a given problems. A number of tests been developed
and the important among these are
1. Unit test
2. Time Reversal test
3. Factor Reversal test
1. Unit test:
The unit test requires that the formula for constructing an
index should be independent of the units in which prices and
quantities are quoted. Except for the simple aggregate index
(unweighted) , all other formulae discussed in this chapter satisfy
this test.
2. Time Reversal test:
Time Reversal test is a test to determine whether a given
method will work both ways in time, forward and backward. In the
words of Fisher, the formula for calculating the index number
should be such that it gives the same ratio between one point of
comparison and the other, no matter which of the two is taken as
base. Symbolically, the following relation should be satisfied.
P01 P10 = 1
Where P01 is the index for time 1 as time 0 as base and P10 is the
index for time 0 as time 1 as base. If the product is not unity,
there is said to be a time bias is the method. Fisher s ideal index
satisfies the time reversal test.
p1 q 0
p1q1
P01 =

p 0 q 0
p 0 q1
P10 =

p 0 q1 p 0 q 0

p1q1 p1q 0

Then P01 P10 =

p1 q 0
p1q1
p0 q1 p0 q0

p 0 q 0
p 0 q1
p1q1 p1q 0
255

= 1 =1
Therefore Fisher ideal index satisfies the time reversal test.
3. Factor Reversal test:
Another test suggested by Fisher is known s factor reversal
test. It holds that the product of a price index and the quantity
index should be equal to the corresponding value index. In the
words of Fisher, Just as each formula should permit the
interchange of the two times without giving inconsistent results, so
it ought to permit interchanging the prices and quantities without
giving inconsistent result, ie, the two results multiplied together
should give the true value ratio.
In other word, if P01 represent the changes in price in the current
year and Q01 represent the changes in quantity in the current year,
then
p1q1
P01 Q01 =
p 0 q 0
Thus based on this test, if the product is not equal to the value ratio,
there is an error in one or both of the index number. The Factor
reversal test is satisfied by the Fisher s ideal index.
p1 q 0
p1q1
ie.
P01 =

p 0 q 0
p 0 q1
Q01 =
Then P01 Q01 =

=
=

q1 p0 q1 p1

q 0 p 0 q 0 p1
p1q 0
p1q1 q1 p0 q1p1

p 0 q 0
p 0 q1 q 0 p 0 q 0 p1
p1q1

p 0 q 0

p1q1
p 0 q 0

256

p1q1
, the factor reversal test is satisfied by
p 0 q 0
the Fisher s ideal index.
Example 8:
Construct Fisher s ideal index for the Following data. Test whether
it satisfies time reversal test and factor reversal test.
Base year
Current year
Commodity
Quantity
Price
Quantity
Price
A
12
10
15
12
B
15
7
20
5
C
5
5
8
9
Solution:
Commodity q0
p0
q1
p1
P0q0 p0q1 p1q0 p1q1
A
12
10
15
12
120 150 144 180
B
15
7
20
5
105 140
75 100
C
5
5
8
9
25
40
45
72
250 330 264 352
Since P01 Q01 =

Fisher ideal index number P01F =


=
=

p1q 0 p1q1

p0 q 0 p0 q1

100

264 352

100
250 330
(1.056) (1.067) 100

1.127 100
= 1.062 100 = 106.2
=

Time Reversal test:


Time Reversal test is satisfied when P01 P10 = 1
p1q 0 p1q1
P01 =

p0 q 0 p0 q1
=

264 352

250 330
257

p 0 q1
p 0 q 0

p1 q1
p1 q0

P10 =

330
250

352
264
250
264 352 330
Now P01 P10 =

250 330 352


264
= 1
=1
Hence Fisher ideal index satisfy the time reversal test.
=

Factor Reversal test:


Factor Reversal test is satisfied when P01 Q01 =
Now

P01 =
=

Q01 =
=

p1 q 0
p1q1

p 0 q 0
p 0 q1
264 352

250 330
q1 p0 q1 p1

q 0 p 0 q 0 p1
330
352

250
264

Then P01 Q01 =

264 352 330 352

250 330 250 264

352
=

250
352
=
250

258

p1q1
p 0 q 0

p1q1
p 0 q 0
Hence Fisher ideal index number satisfy the factor reversal test.
10.8 Consumer Price Index
Consumer Price index is also called the cost of living index.
It represent the average change over time in the prices paid by the
ultimate consumer of a specified basket of goods and services. A
change in the price level affects the costs of living of different
classes of people differently. The general index number fails to
reveal this. So there is the need to construct consumer price index.
People consume different types of commodities.
People s
consumption habit is also different from man to man, place to place
and class to class i.e richer class, middle class and poor class.
The scope of consumer price is necessary, to specify the
population group covered. For example, working class, poor class,
middle class, richer class, etc and the geographical areas must be
covered as urban, rural, town, city etc.
=

Use of Consumer Price index


The consumer price indices are of great significance and is
given below.
1.
This is very useful in wage negotiations, wage contracts
and dearness allowance adjustment in many countries.
2.
At government level, the index numbers are used for
wage policy, price policy, rent control, taxation and
general economic policies.
3.
Change in the purchasing power of money and real
income can be measured.
4.
Index numbers are also used for analysing market price
for particular kinds of goods and services.
Method of Constructing Consumer price index:
There are two methods of constructing consumer price
index. They are
1.
Aggregate Expenditure method (or) Aggregate method.
2.
Family Budget method (or) Method of Weighted
Relative method.
259

1. Aggregate Expenditure method:


This method is based upon the Laspeyre s method. It is
widely used. The quantities of commodities consumed by a
particular group in the base year are the weight.
p1q 0
The formula is Consumer Price Index number =
100
p 0 q 0
2. Family Budget method or Method of Weighted Relatives:
This method is estimated an aggregate expenditure of an
average family on various items and it is weighted. The formula is
pw
Consumer Price index number =
w
p1
Where P =
100 for each item. w = value weight (i.e) p0q0
p0
Weighted average price relative method which we have studied
before and Family Budget method are the same for finding out
consumer price index.
Example 9:
Construct the consumer price index number for 1996 on the
basis of 1993 from the following data using Aggregate expenditure
method.
Price in
Commodity
Quantity consumed
1993
1996
A
100
8
12
B
25
6
7
C
10
5
8
D
20
15
18
Solution:
Commodity
q0
p0
p1
p0q0
p1q0
A
100
8
12
800
1200
B
25
6
7
150
175
C
10
5
8
50
80
D
20
15
18
300
360
Total
1300
1815
Consumer price index by Aggregate expenditure method
260

p1q 0
100
p 0 q 0
1815
=
100 = 139.6
1300

Example 10:
Calculate consumer price index by using Family Budget
method for year 1993 with 1990 as base year from the following
data.

Items

Weights

Food
Rent
Clothing
Fuel and lighting
Miscellaneous

35
20
10
15
20

Solution:
Items

Food
Rent
Clothing
Fuel and
lighting
Miscellaneous

Price in
1990
1993
(Rs.)
(Rs.)
150
140
75
90
25
30
50
60
60
80

P0

P1

P =

PW

35
20
10

150
75
25

140
90
30

p1
100
p0
93.33
120.00
120.00

15
20
100

50
60

60
80

120.00
133.33

Consumer price index by Family Budget method =


11333.15
100
= 113.33
261
=

pw
w

3266.55
2400.00
1200.00
1800.00
2666.60
11333.15

Exercise 10
I. Choose the correct answer:
1. Index number is a
(a) measure of relative changes
(b) a special type of an average
(c) a percentage relative
(d) all the above
2. Most preferred type of average for index number is
(a) arithmetic mean
(b) geometric mean
(c) hormonic mean
(d) none of the above
3. Laspeyre s index formula uses the weights of the
(a) base year
(b) current year
(c) average of the weights of a number of years
(d) none of the above
4. The geometric mean of Laspeyere s and Passche s price
indices is also known as
(a) Fisher s price index
(b) Kelly s price index
(c) Marshal-Edgeworth index number
(d) Bowley s price index
5. The condition for the time reversal test to hold good with
usual notations is
(a) P01 P10 = 1
(b) P10 P01 = 0
(c) P01 / P10 = 1
(d) P01 + P10 = 1
6. An appropriate method for working out consumer price index
is
(a) weighted aggregate expenditure method
(b) family budget method
(c) price relative method
(d) none of the above
262

7. The weights used in Passche s formula belong to


(a) The base period
(b) The given period
(c) To any arbitrary chosen period
(d) None of the above
II.
Fill in the blank in the following
8. Index numbers help in framing of ____________
9. Fisher s ideal index number is the __________ of Laspeyer s
and Paasche s index numbers
10. Index numbers are expressed in __________
11. __________ is known as Ideal index number
12. In family budget method, the cost of living index number is
_________
III. Answer the following
13. What is an index number? What are the uses of index
numbers.
14. Explain Time Reversal Test and Factor Reversal test.
15. What is meant by consumer price index number? What are
its uses.
16. Calculate price index number by
(i)
Laspeyre s method
(ii)
Paasche s method
(iii)
Fisher s ideal index method.
1990
1995
Commodity
Price
Quantity
Price
Quantity
A
20
15
30
20
B
15
10
20
15
C
30
20
25
10
D
10
5
12
10
17. Calculate Fisher ideal index for the following data. Also test
whether it satisfies time reversal test and factor reversal test.
Price
Quantity
Commodity
2000
2002
2000
2002
A
6
35
10
40
263

B
10
25
12
30
C
12
15
8
20
18. Calculate the cost of living index number from the following
data.
Price
Items
Base
Current
Weight
year
year
Food
30
45
4
Fuel
10
15
2
Clothing
15
20
1
House Rent
20
15
3
Miscellaneous
25
20
2
Answers
I.
1. (d)
2. (b)
3. (a)
4. (a)
5.(a)
6. (b)
7. (b)
II.
8. Polices

9. Geometric mean

11. Fisher s index number

10. Percentage
pw
12.
w

III.
16. (i) L = 110
(ii) P = 123.9
(iii) F = 116.7
17. 296
18. 118.2

264

You might also like