Unit 4
Unit 4
TENDENCY
Structure
Objectives
Introduction
Measures of Central Tendency
4.2.1 Arithmetic Mean
4.2.2 Median
4.2.3 Mode
Other Measures of Central Tendency
4.3.1 Geometric Mean and Harmonic Mean
4.3.2 Weighted Mean
4.3.3 Pooled Mean
4.3.4 Choosing a Measure of Central Tendency
Percentiles
4.4.1 Percentiles: Definition and Computation
4.4.2 Quartiles and Deciles
Let Us sum Up
Key Words
Some Useful Books
Answers or Hints to Check Your Progress Exercises
4.0 OBJECTIVES
After going through this unit, you will be able to:
compute numerical quantities that measure the central tendency of a set of data
such as, mean, median, mode, geometric mean and @manic mean, and
use these measures.
4.1 INTRODUCTION
In the previous Unit we had distxssed about condensation of raw data by grouping
them into a few class intervals and presenting in the form of a table or diagram.
Such tables or diagrams provide a rough idea of the distribution of observations.
OAen we need to compare between distributions. In such situations it is difficult
to compare tables or diagrams simply by looking at them. It is much more
convenient and usefbl for comparison if we could find out a single numerical value
for describing the data.
Measures of Central Tendency (or Location) constitute one of the major statistics
designed for this purpose. There are five main measures of central tendency. These
are Arithmetic Mean, Geometric Mean, Harmonic Mean, Median and Mode. You
will learn about each one of these measures below.
S u n ~ n ~ n r i s a t i o onf
Univariate Data 4.2 MEASURES OF CENTRAL TENDENCY
In frequency distributions of observations discussed in Unit 3 we notice that
the observations tend to cluster around a central value. This phenomenon of
clustering around a central value in a frequency distribution is called 'Central
Tendency'. Thus, it is of interest to locate such a value around which clustering
of observations takes place. There are several measures of central tendency (or
location) of a frequency distribution. These measures produce numbers that
summarise a frequency distribution in terms of one of its properties, namely, central
tendency.
4.2.1 Arithmetic Mean
The average or the arithmetic mean, or simply the mean when there is no
ambiguity, is the most common measure of central tendency. It is defined as the
sum total of all values in the sample divided by the number of observations. It is
denbted by a bar above the symbol of the variable being averaged. Thus stands
for the mean,of X-values in the sample. If in a sample a particular X-value, say
X occurs with frequency4 (i = 1,2, ... n), its contribution to the total of X-values
isJlX,. Thus, one can compute the mean of X-values by
n
When observations are classified into class intervals, as for continuous variables,
individual observations falling into a class interval are not separately identifiable
and the contribution of the individual observation from a class interval to the total
cannot be calculated. To avoid this difficulty, it is assumed that every observation
falling into a class interval has a value equal to the mid-point into which these
observations fall. Such a procedure will not give the exact mean had one
computed it from raw data and may require what is called corrections for
grouping.
Example 4.1: Compute the mean for discrete frequency distribution of Table 4.1.
Table 4.1
Frequency distribution of 100 households by size
3 25
4 33
5 12
6 7
7 2
8 2
Total 100
Let us compute the arithmetic mean of the data given in the above table. Measures o f Central
Tendency
Table 4.2
Frequency distribution of 100 households by average monthly household
expenditure on food
-
Expenditure class (Rs.) Frequency
262.5 - 286.5 1
286.5 - 310.5 14
310.5 - 334.5 16
334.5- 358.5 28
358.5 - 382.5 26
382.5- 406.5 15
Total 100
One may note from the above example that to find column (3) one needs to multiply
the corresponding values of column (1) and (2), and often hand computations are
long for each multiplication. Thesecomputations can be simplified, particularly when
successive column (1) values are equidistant (but applicable otherwise alsb), by
making the following simple transforrnatio~ .
For i = 1, 2, ... , n
X i- A
u. = -
k i.e., 4.= A + hu, . and so X = A + ~ Z .
Summarisation o f
Univariate Data
Often A is called the 'assumed mean' and hc as its correction to get X . Choice
of A and h are made so that computation of ii becomes simple. Usually A is taken
as that X value for which the frequency is largest. For equidistant successive'
X-values in column (I), h may be taken as the difference between two successive
X-values. For equal length class intervals, the difference between suiccessive
mid-points is the same as the length of each class interval.
We will explain this method by re-computing the mean of the monthly average
household food expenditure data given in Table 4.2. We construct Table 4.3 by
using A and h as explained below.
Table 4;3
Computation of mean of data of Table 4.2
X,- 346.5
Class interval Mid-point u, = Frequency f ; Ut
24
(Rs.1 <q u;)
262.5 - 286.5 274.5 - 3 1 -3
286.5- 310.5 298.5 -2 14 - 28
310.5 - 334.5 , 322.5 -1 16 - 16
334.5- 358.5 346.5 0 28 0
358.5 - 382.5 370.5 1 26 26
382.5- 406.5 394.5 2 15 30
Total 100
9J
We find out that
To prove this property, we note that the magnitude of S will depend upon the
selected q l u e of A. Thus, we can say that S is a fhction of A. We want to find
that value ofA for which S is minimum.Using calculus, this value is given by the
(Remember that the value of a fhction is minim& when lirst derivative is zero
and second derivative is positive.)
d'
S
Further, it can be shown that -> 0 when A = jf.
d~~
4.2.2 Median
Median of a distribution locates a central point which divides a distribution into
two equal halves, i.e., it is the middle most value among a set of observations.
Let us start with examples in a discrete case. Consider a data set having 5 distinct
observations: 2,4,9,12, 19 (arranged in ascending order). Here 9 is the middle
most value since an equal number of observations are to its left and to its right.
I
Thus, 9 is the median of the above observations. Consider another data set having
t
6 distinct observations: 3,8,15,25,35,43. Here any point between 15 and 25
has the property that equal number of observations are to its left and to its right.
I Any point in the interval 15 to 25'may be used as a median. Conventionally we
i take the middle point of such an interval to define median uniquely. Thus 20 is
the median of 3, 8, 15, 25, 35, 43.
I
When a data set has non-distinct observations - a situation more common in
practice - difficulties may arise. In such situations, it may not be always
possible to locate the middle most value or the central point that divides the
dfstribution into two equal halves, For example, in the case of the data set having
Summarisation o f 5 observations 2, 9, 9, 12, 19 the value 9 is repeated twice. Thus, a formal
Univariate Data
definition of median is needed to overcome such difficulties.
Let us find out the median household size from the frequencydistribution in Table
4.1. We notice that 77 (out of 100) households have family size of less than or
equal to 4 and 56 households have family size of more than or equal to 4. Thus
median family size in this case is 4.
Histogram
1.25
--
-
.5
m
C
1.00
6 0.75 --
6
C
e,
0.50 --
2
L4
0.25 --
Fig. 4.1
Area up to the class boundary 334.5 is 3 1 and upto 358.5 is 59. Hence the median Measures of Central
Tendency
lies in the class 334.5 - 358.5. We now want to find a poht in this class so that
the area from 334.5 to the point is (50 - 31) = 19, where area up to 334.5 is
31. Since the rectangle over the interval 334.5 - 358.5 has an area of 28, and
19
is of length 24, to get an area of 19 we need 28th part of 24. This works out
19
to be 28 X 24 = 16.3. Thus the median is 334.5 + 16.3 = 350.8. Note also
that the area in the class 350.8 to 358.5 is 28 - 19 = 9 and to the right of 350.8
is 9 + 41 = 50, as it should be.
Based on the above procedure, we can write a formula for the computation of
median.
lmis the lower limit of the median class, i.e., the class in which median lies,
N is the total frequency,
C is the cumulative frequency of classes preceding the median class (note that
C = 3 1 in the above example),
fm is the frequency of median class, and
h is the width of median class.
4.2.3 Mode
As has been pointed out earlier, often observations tend to cluster around a central
value. A simple measure of h s phenomenon is called mode.
From Table 4.1 we find that the mode or modal value of household size is 4 as
this value occurs with largest frequency of 33 among 100 households.
There are, however, data sets when mode cannot be defined uniquely, i.e., the
distribution has multiple mode. Raw data with 7 hypothetical observations with
values 4,3,4, 1,2,5,3, have two modes, 3 and 4. Distributions having two modes
are called bimodal distributions,though the frequently encountered distributions
have only one mode or are unimodal.
lmis the lower limit of the modal class, i.e., the class in which mode lies,
A,(= f, -f,-,) is the difference of the frequencies of the modal class
and its preceding class,
A,(= fm - fm+,) is the difference of the frequencies of the modal class
and its following class, and
h is the width of the modal class.
Let us look back to Table 4.2. Here modal class is 334.5 - 358.5 as it has the
highest frequency, 28.
I Total 250 1
Find the mean, median and mode.
Measures 01 Central
2) Compute the mean, median and mode for the following ihquency distribution. Tendency
I.Q. Frequency
160 - 169 2
150 - 159 3
140 - 149 7
130 - f39
- 19
120 - 129 37
110 - 119 79
100 - 109 69
90 - 99 65
80 - 89 17
-
70 79 5
-
60 69 3
50 - 59 2
40 49
'- 1
Total 309
Often we see that all the observations do not have equal importance. In such cases
we need to give differential importance to different items. Here we use weighted
means -arithmetic, geometric or harmonic -instead of simple means. This we
will discuss in Section 4.3.2.
4.3.1 Geometric Mean and Harmonic Mean
Often we have to deal with data that are time dependent, i.e., time series data
which are unlike one-time data of Tables 4.1 and 4.2. For time dependent data,
it is often of interest to fkd the pattern of change over time. Consider the following
two data sets.
Summarisation of
Univariate Data
First set looks like basic salary (in Rs.)of an employee for 7 years with annual
increment of Rs. 100 per year.
Second set looks more llke his gross salary (in Rs.).Annual increase in the two
sets are given below.
Arithmetic mean of annual increase is 100 for Set I and 141.5 for Set II. On the
basis of these average annual increases, if one works-out figures for the two sets,
starting fiom the initial values, one would get the following.
That the use of arithmetic mean has worked well for Set I and not for Set 11is
because the progression of original numbers in the two sets are different. In Set
I, increment has been a fixed quantum whereas in Set 11, figures have increased
at a fixed rate. Fixed quantum of increase is called arithmetic progression and
arithmetic mean is appropriate to describe the increase. Fixed rate of increase is
called geometric progression and geometric mean is most appropriate to describe
the increase.
For n numbers X,, X,, ...Xn geometric mean (GM) is defined as the nth root
of the product of these n numbers, i.e.,
Clearly, GM is not defined unless all the n numbers are positive. By taking logarithm
of GM, one has
The last expression is the reciprocal of the arithmetic mean of reciprocals and
is called harmonic mean (HM). For a set of n values X,, X,,.....Xu, HM is
defined as
If the stockist, instead of stocking Rs.5000 worth of items, stocks 3000 items
at the beginning of every month at the given prices, the appropriate average would
be arithmetic mean. To verifL this, we can write
i=l
Weighted AM =
e
i=l
wi
i=l
Weighted HM = T L
w.
Let m,, m,, ..... mr be r arithmetic (or geometric or harmonic) means, computed
on the basis of n,, n,, .....nr observations respectively. Then
r
1
Pooled arithmetic mean = ; Cmini, where n = ni
i-1 i-1
-1
Pooled geometric mean = (Qm y ) and
where n = n, + n, + ..+ nr
Note that the above expressions are similar to the expressions for weighted
me2ins.
bf Central
4.3.4 Choosing a Measure of Central Tendency Measures c
Tendency
Since graphical representation of data is more appealing, median or mode are more
usehl in such a situation because their crude values can be obtained easi\ywithout
having to go through any computations.Also, median and mode are simple concepts
for communication and comparison between griiphs. It has, however, been
observed that median is less stable than arithmetic mean in repeated sampling and
one needs to be careful when comparing graphs.
For data that has a distribution close in shape to what is called normal with one
peak and going down symmetrically on either side, one may use one of mean,
median or mode because for a normal shape distribution, these measures have the
same value.
4.4 PERCENTILES
Concept of percentiles will be explained by using mainly Table 4.2 data on average
monthly household expenditure. Percentiles are used in two directions, depending
on the question to be answered. Direction of a question may be, what per cent
of households have monthly average food expenditure upto Rs.350.80? Or it may
be, what is the maximum monthly average food expenditure of the lower 50% of
the households? Note, from our earlier computation of median of Table 4.2
distribution, that the answer to one question is the figure in the other,.i.e., 50%
of the households have Rs.350.80 as maximum average monthly food expenditure.
Depending on interest, percentage below a cut-off point may be called for : when
a poverty line is decided, it is of interest to h o w the percentage below the poverty
line. In the other direction, it may also be of interest to find the status of lower
10% or upper 5% of the population. These are answered by using what are called
percentiles.
4.4.1 Percentile: Definition and Computation
For any given percentage v, vth percentile is PV,a value of the variable being
stud:ed, so that at least vpercent of the observations are less than or equal to
PVand at least (100 - v) percent of the observations are greater than or equal
to P .
Summarisation of For example, for Table 4.1, distribution of household size, Py= 5 for any v
r
Univariate Data
from 78 to 89. I
4
For grouped data, percentiles are more clearly understood when one looks at the
cumulative distribution hction. Let F(X) be the proportion of observations less
than or equal to X, Any given value Xois then the 100 F(XJth percentile. For
Table 4.2, class boundaries, one has F(286.5) = 0.01, F(3 10.5) = 0.15, F(334.5)
= 0.31, F(358.5 ) = 0.59 and F(382.5) = 0.85, and consequently Rs. 286.5 =
PI,, Rs.310.5 = P,,, Rs. 334.5 = P,,, Rs. 358.5 = P,,, and Rs. 382.5 = P,,.
Note that any amount less than Rs.262.5 (lower boundary of first class interval)
is zero-th percentile and amount more than Rs.406.5 (upper boundary of last
class interval) is 100th 3ercentile.
4.4.2 Quartiles and Deciles
Depending on its use, some specific percentiles go by different names. Evary 25th
percentile is called a quartile, and every 10th percentile is called a decile. For
example,
The formulae for Q, and Q3a& similar to the formula for the median. ~ h e s e
can be directly written as given below.
4
Q3 = l ~ , -
+
xh,
fa,
where C denotes the cumulative frequency of classes preceding the fixst (or third)
quartile class and h is the corresponding class width.
Using similar notations, it is possible to write the formula for any partition value.
For example, the formula for 40th percentile can be written as
Just as one does not get a complete picture of a distributionby looking at a measure
of location, too many percentiles may be needed to describe the spread or
dispersion of a-distribution. It is felt that there should be some simple measures
of dispersion. This is the topic of discussion of the next unit.
Check Your progress' 2 Measures of Central
Tendency
1) Given-below are the prices in ratios for five commodities with the
corresponding weights. Calculate the Weighted Arithmetic. Mean and
Geometric Mean.
...................................................................................................................
Find at the time of maniage (i) the average age, (ii) modal age, (iii) the median
age, (iv) third quartile, (v) sixth decile, (vi) nineteenth percentilz.
Summarisation of
Univariate Data
4) In a factory, a mechanic takes 15 days to fabricate a machine, the second
mechanic takes 18 days, the third mechanic takes 30 days and the fourth
mechanic takes 90 days. Find the average number of days taken by
the workers to fabricate the machine. Which average would you use, and
whfl
5) The amount of interest paid on each of the three different sums of money
yielding lo%, 12% and 15% simple interest per annum are equal. What
is the average yield percent on the total sum invested?
4.6 KEYWORDS
Arithmetic Mean :Sum of observed values of a set divided by the number of
observations in the set is called a mean or an average.
Geometric Mean :It is the mean of n values of a variable computed as the nth
root of their product.
Harmonic Mean :It is the inverse of the arithmetic mean of the reciprocals of
*
3) (i) 25.83 years (ii) 24.82 years (iii) 24.86 years (iv) 27.30 years
(v) 25.59 years (vi) 28.79 years
4) Arithmetic Mean, 38.25 days
5) Harmonic Mean, 12%.