4.sheet - Lecture On Central Tendency
4.sheet - Lecture On Central Tendency
Measures of Central Tendency: One number that represents a data set and gives an idea
of the middle quality of the data.
(Measures of Central tendency show the tendency of some central value around which data tends
to cluster)
1. Arithmetic Mean: Arithmetic Mean is the sum of a set of observations (Positive, negative,
zero) divided by the number of observations.
x x 2 x 3 ... x n . x
i 1
i
(a) x = 1 where xi is the individual values of items
n n
10 20 30 50 60 175
Solution: A.M. = 34
5 5
(b) [[ Discrete Series: When actual size of items (individual measurements) and
corresponding frequencies are given.]]
( f x )
i 1
i i
x= n
f
i 1
i n
1
Size of item(in inches) 2 3 6 7 10
Frequencies 3 8 5 10 14
Solution:
( f x ) = 270
i 1
i i
270
A.M.= = 6.75inches
40
(I)Grouped data: i.e When class- interval and frequencies are given
( f x )
i 1
i i
x= n
f
i 1
i n
( f x ) = 369
fi xi
i i
i 1
( f x )
i 1
i i
369
A.M. = 4
= = 7.308 cwt
50
f
i 1
i
H.W: 6: The following table gives the distribution of weekly wages of workers in a factory.
Calculate the arithmetic mean of the distribution.
2
Weekly Wages: 240-269 270-299 300-329 330-359 360-389
No of Workers : 7 19 27 15 18
The GM [G] of n-positive values x 1 ,x 2 ,x 3 ,… , x n is defined as the nth positive root of the
product of n numbers
n 1
xi ) n =
1
G = (x 1 .x 2 .x 3 ,… , x n ) n
=( n x 1 x 2 x 3,…, x n
i 1
For frequency, G= n
x f1 1 x f 2 2 x f3 3 ,…, x f n n
The latter expression states that the log of the geometric mean is the arithmetic mean of the logs
of the numbers.
[Practical definition: The average of the logarithmic values of a data set, converted back
to a base 10 number.]
3
#### Because the geometric mean is based on log values and the log transformation tends to draw
extreme values toward the center of the data, the geometric mean is more "robust" than the
arithmetic mean. "Robust" here means less influenced by outliers.
Grouped data: =
G = n x f1 1 x f 2 2 x f3 3 ,…, x f n n Where x i = the central value of the interval
f i = Frequencies of the respective classes
1 n
3.Harmonic Mean: H.M= =
1 1 1 1 1 1
... ...
x1 x 2 xn x1 x 2 xn
n
For frequency,
𝑁
H= 𝑓1 𝑓2 𝑓3
𝑥1+ 𝑥2+ 𝑥3
𝑁
= 𝑓
𝑥
AM GM HM
3
GM= 3.9.11 = 6.67
4
3
HM= = 5.60
1 1 1
...
3 9 11
So AM GM HM is proved
1580,1600,1606,1640,1650,1660,1690
n 1 7 1
Median= Size of th observation= th observation = 4 th observation
2 2
Here median wage =Taka 1640.
For Even,
8 1
th observation = 4.5 th observation
2
Grouped data:
N
p.c. f
Median = L + 2 i
f
Where,
L= Lower limit of median class
P.c.f= preceding cumulative frequency to the median class
f= Frequency of median class
i= The class interval of median class
N
Median class, CF(cumulative frequency)
2
Ex. 1500 workers are working in an industrial establishment. Their age is classified as follows:
5
Age(yrs) No.of Age (yrs) No.of
workers workers
18-22 120 38-42 184
22-26 125 42-46 162
26-30 280 46-50 86
30-34 260 50-54 75
34-38 155 54-58 53
Solution:
Age group Frequency (f i ) Cumulative frequency(c.f)
18-22 120 120
22-26 125 245
26-30 280 525
30-34 260 785
34-38 155 940
38-42 184 1124
42-46 162 1286
46-50 86 1372
50-54 75 1447
54-58 53 1500
N 1500
So, Median = Size of th observation= Size of th observation= 750 th observation
2 2
Hence of median lies in the class 30-34
N
p.c. f
2 750 525
Median = L + i = 30 + 4 = 30+ 3.46 =33.46 (Answer)
f 260
Mode: Mode is the most frequently occurring value in a frequency distribution.
Ex. To find the mode of 11,3,5,11,7,3,11
Here, mode=11
For example, let us consider the age of (in year) of some children investigated in a small locality:
5, 2, 2, 8, 7, 6, 5, 4, 3, 4, 5, 2, 2
In the above example, age 2 years are recorded 4 times (maximum time). So 2 years is the mode
of the distribution of ages of children.
90.00, 90.00, 95.00, 100.00, 100.00, 90.00, 100.00, 110.00, 115.00, 125.00
6
Distribution of the students according to their grade point average in an examination:
2.5, 3.0, 3.0, 3.5, 3.8, 3.0, 3.0, 2.5, 3.8, 4.0, 4.0, 4.0, 4.0, 2.5, 2.5
In the distribution Mo=2.5, Mo = 3.0 and Mo= 4.0, since each of the value 2.5, 3.0 and 4.0 occur
4 times in the distribution. This distribution is known as multimodal distribution.
Example: the following data represent the distribution of female workers in different garments
industries garments industries according to their monthly salary (in taka).
Solution: (i) the major group of of female workers are 250 whose salary is in the limit 800-
900. Their, on an average, salary is given by mode, where
h( f 1 f 0 )
M0 l
2 f1 f 0 f 2
100(250 65)
800
2 250 65 175
871.15 tk
l 800, h 100, f1 250, f 0 65, f 2 175
(ii)From c.f it is observed that 540 female workers‟ salary is less than 1000.00 tk.
7
Example: The following data represent the distribution of chicken by weight after 7
weeks of birth:
Class interval of No. of chicken, c.f Mid-Value fi X i
weight(in kg) fi Xi
0.6-0.8 25 25 0.7 17.5
0.8-1.0 22 47 0.9 19.8
1.0-1.2 18 65 1.1 19.8
1.2-1.4 12 77 1.3 15.6
1.4-1.6 12 89 1.5 18.0
1.6-1.8 11 100 1.7 18.7
Total 100 109.4
Find the Weight, on average, of the maximum number of the chicken.
Solution: Since the maximum frequency is in the first class, mode is ill defined.
1 109.4
However, Mean
N
f Xi i
100
1.094
hN 0.2
Me l c 1.0 (50 47) 1.03
f 2 18
The weights, on average, of maximum number of chicken is mode of the distribution.
The mode is given by, Mo = 3 Median – 2 mean = 3(1.03) – 2(1.094) =
0.902
Outlier:
A number that is very far away from the rest of the data set. I.e., a value that is much greater
or much less than most of the other numbers
An outlier moves the average from the middle of the cluster of the rest of the data points.
### X : -2, 4, 8
G = (-2.4.8) ⅓
If we have even number of minus values i.e. X: -2, 4, -8
G = (-2.4.-8)⅓ = 4 and x =-2
8
But here G > x , it‟s distorted the relationship, So for negative values we shouldn‟t
calculate Geometric mean
X: 3, 4, 0, 10
G = (3.4.0.10)1 4
=0
4
H=1 +1 +1 +1
3 4 0 10
4
=∞=0
Disadvantage of Geometric & Harmonic Mean:
Geometric & Harmonic mean are not calculated if array observation is Zero or negative.
Geometric & Harmonic mean are used if observations are giver in rates & ratio.
X: 2, 3, 3, 5, 8, 9, 15, 15
Mode = 3 & 15
𝑓2 = ?
Disadvantage of Mode :
1. There may be many modes
2. Mode may not fall at the centre of the array.
3. Mode is not well defined if the modal class is the first class & last class.
Example:-
Find an appropriate measure of central tendency of the following observations with justification.
Ans :-
1. Here arithmetic mean is not suitable as there is an extreme value 150.
2. Geometric& Harmonic mean are not calculated as there are zero& negative observations.
3. Each observation occurs once. So there is no mode.
4. Therefore median is an appropriate measure of central tendency in this case.
0+5 −10+10
Me = 2 = 2.5 Me = 2 = -5
# Measures of Location
Let us, consider the array as follows
X: 5, 7, 8, 10, 13, 15, 16, 18, 19, 20, 25, 27, 29, 31, 33, 36, 37, 38, 40, 45
N = 20
9
13+15 20+25 33+36
Me = Me = Me =
2 2 2
= 14 = 22.5 = 34.5
# A measure which is located in different place in the array is called the measure of location.
Measure of central tendency falls at the centre of the array but measure of location
falls in different places in the array.
Me = l + 𝑓 ( 𝑁 2 – c ) = Q 2 = D 5 = P 50
2𝑁
=l+ −𝐶
𝑔 4
5𝑁
=l+ −𝐶
𝑔 10
2𝑁
=l+ 𝑔 4
−𝐶
50𝑁
=l+ 𝑔 100
−𝐶
𝑖𝑁
Qi = l + 𝑓 4
− 𝐶 ; I = 1, 2, 3
Here,
l = lower limit of Q i class
h= width of Q i class
f= frequency of Q i class
c= c.f of the class before Q i class
N = Total frequency
𝑖𝑁
* Q i is that class for which c.f.≥ 4
𝑖𝑁
### Di = l + 𝑓 10
− 𝑐 ; i= 1,2,……….,9
iN
*Di is that class for which c.f. ≥
10
iN
P i = l+ 𝑓 ( – c ), i= 1,2,……….,99
100
𝑖𝑁
* P i is that class for which c.f. ≥ 100
# Let us consider the following frequency distribution:
Class interval Frequency ,f C.f.
10-12 8 8
10
12-14 12 20
14-16 20 40
16-18 22 62
18-20 18 80
20-22 12 92
22-24 8 100
N = 100
2 100
= 14 + − 20
20 4
1
= 14+ 10
×5
= 14.5
2𝑁
Q2 = l + 𝑓 4 −𝑐
2 2×100
= 16+ − 40
22 4
1
= 16+ 11
× 10
= 16.9
3𝑁
Q3= l + −𝑐
𝑓 4
2 3×100
= 18 + − 62
18 4
1
= 18 + 9
× 13
= 19.44
2.
3𝑁
D3 = l + −𝑐
𝑓 10
2 3×100
= 14 + 20 10
− 20
1
= 14 + 10
× 10
= 15
7𝑁
D7 = l + 𝑓 10
−𝑐
2 7×100
= 18 + 18 10
− 62
1
= 18 + × 89
= 18.9
Box Plot
11
12
20
Outlier: An outlier is defined as a value that is more than 1.5 times the interquartile range smaller
that Q1 or larger than Q3.
Dispersion
The word „Dispersion‟ implies the average distance or deviation from a central value.
13
1) The Range (2) The Standard deviation (3) The Mean Deviation (4) The quartile
deviation
The relative measures are
1) The coefficient of variation
2) The coefficient of mean deviation
3) The coefficient of Quartile deviation
4) The coefficient of Range
*** Absolute measure’s of variation are expressed in the same statistical unit in which the
original data are given such as rupees, kilograms, tones etc.
i.e. when two distributions have same unit.
*** Relative measures of variation: A measure of relative variation is the ratio of a measure
of absolute variation to an average. It is sometimes called a coefficient of variation, because
“Coefficient” means a pure number that is independent of the unit of measurement.
Range: Range is the difference between the value of smallest and largest.
Limitations:
1) Range is not based on each and every observation.
2) The amount of range is affected by extreme values
3) It is not calculated from frequency distribution with open-end classes.
x1 x x2 x .............. xn x x
i 1
i x
MD= =
n n
Ex. Find the mean deviation of 4, 8, 12, 24
Average x = (4+8+12+24)/ 4 = 12
4 12 8 12 12 12 24 12
MD= =6
4
Limitations:
1) It ignores the sign in measuring deviations which is bad from the mathematical point
of view. Hence, it is not good for further algebraic use of the measure.
2) The amount of mean deviation increases with the increase in size of sample.
14
[NB. If all the numbers in the sample are very close to each other, the standard deviation is
close to zero.]
15
n
(x
i 1
i )2
σ= ←For population
n
Where, xi = the individual values
= Population mean
n = number of items
(x
i 1
i x) 2
s= ←For Sample
n 1
For Frequencies:
n
f (x
i 1
i i )2
σ= ←For population
n
Where, xi =mid value of each class interval (For grouped data)
n
f (x
i 1
i i x) 2
s= ←For Sample
n 1
Advantage:
1) Based on all the observations.
2) As sign of deviations are not ignored, it is suitable for further use in statistical analysis.
Limitations:
As it is not free of unit, it is not use to compare the dispersion of two or more distributions.
Ex. Find the standard deviation from a population data on the weekly wages of ten workers working in a
factory.
Wages(TK): 320,310,315,322,326,340,325,321,320,331
10 n
xi
i 1 3230
(x
i 1
i )2
622
µ= = = 323 .: σ = = = 7.89
10 10 n 10
Quartile deviation:
The inter quartile range is frequently reduced to the measure of semi-inter-quartile range, also known as the
quartile deviation, by dividing it by 2. Thus
I QR Q3 Q1
QD
2 2
16
This measure is more meaningful than the range because it is not based on two extreme values.
Coefficient of range:
The coefficient of range is a relative measure corresponding to range and obtained by the following formula
LS
CR 100
LS
Where L and S are respectively the largest and smallest observations in the data set. The coefficient of range
is rarely used as a measure of dispersion because of its inherent difficulties in interpretation.
A co-efficient of variation is computed as the ratio of the Standard deviation of the distribution to the
mean of the same distribution.
S
CV= × 100
x
Uses of co-efficient of variation: CV is helpful in comparing the relative variation in several data sets
that have different means and different standard deviation.
17
S 5
For Height CV= = = 0.125
x 40
S 2
For Weight CV= = = 0.2
x 10
So we can say weight variability is greater than height variability.
S 15
For Systolic, CV = = = 0.115
x 130
S 8
For Diastolic, CV = = = 0.133
x 60
Example: Suppose that we wish to obtain some insight into whether height is more variable than the weight
in the same population. For this purpose, we have to following data obtained from 150 children in a
community.
Height Weight
Mean 40 inch 10 kg
SD 5 inch 2 kg
CV 12.5% 20.0%
Examination of the respective standard deviations does not tell us in any meaningful way which characteristic
has more variability than the other, because they are measured in different units. If we now compute
coefficient of variation, the results become comparable, because coefficient of variation for weight is greater
than that of the height, we conclude that weight has more variability than height in the population.
Even if two variables in the same population are measured in the same unit, the standard deviation may fail to
provide a correct picture of their relative variability. This is illustrated by an example bellow
Example: Consider that the blood pressures of a group of patients were measured at tow level: systolic and
diastolic, both being measured in the same unit. The results were as follow:
Systolic Diastolic
Mean 130 mm Hg 60 mm Hg
SD 15 mm Hg 8 mm Hg
CV 11.5% 13.3%
As implied by the standard deviation, systolic pressure is more than the diastolic pressure. However in
relative terms, as measured by the CV, the diastolic pressure has the greater variability. This show the relative
variability is of more concern than absolute variation hence the importance of the coefficient of variation.
The discussions and examples above tend to demonstrate that coefficient of variation is a very useful
measure when:
1) The data are in different units.
2) The data are in same units but means are far apart.
18
3) When the data sets involve all or nearly all positive values.
In terms of measuring the variability of spread of data, we've seen that the standard deviation is the preferred
and most used measure.
1. The standard deviation is the typical or average distance a value is to the mean
2. If all values are the same, then the standard deviation is 0
3. The standard deviation is heavily influenced by outliers just like the mean (it uses the mean in its
calculation).
4. The sample standard deviation is denoted with the letter s and the population standard deviation is
denoted with the lower case Greek letter sigma σ.
If your data is more spread out (has more variability) then you will have a higher standard deviation. It's often
difficult to interpret a standard deviation since it's based on the sample of data. Is a standard deviation of 12
high or is a .20 high?
If you know nothing about the data other than the mean, one way to interpret the relative magnitude of the
standard deviation is to divide it by the mean. This is called the coefficient of variation. For example, if
the mean is 80 and standard deviation is 12, the cv = 12/80 = .15 or 15%.
If the standard deviation is .20 and the mean is .50, then the cv = .20/.50 = .4 or 40%. So knowing nothing
else about the data, the CV helps us see that even a lower standard deviation doesn't mean less variable
data.
19