Topic 11 - Measures of Dispersion
Topic 11 - Measures of Dispersion
• There may exist a number of observations whose averages are same which may
differ widely from each other in a number of ways.
• So, we can’t comment on the nature of distributions by just looking at the averages.
1
1
Illustration
Series A Series B Series C
15 11 3
15 12 6
15 13 9
15 14 12
15 15 15
15 16 18
15 17 21
15 18 24
15 19 27
Total 135 135 135
Mean 15 15 15
2
2
Measures of Dispersion
• All the 3 Series – A, B and C have the same size and the same means.
• However, the three series are different fundamentally. Series A shows no variability;
Series B is slightly dispersed and Series C is relatively more dispersed.
• Range
• Mean Deviation
• Standard Deviation
• Gini Coefficient
4
4
Measures of Dispersion
Limitation: Not suitable for comparing the variability of two distributions which are
expressed in different units of measurement
• Relative Measures: Expressed as ratios or percentages and are thus pure numbers
independent of the units of measurement. They are thus suitable for comparing the
variability of two distributions which are expressed in different units of
measurement
5
5
1. RANGE
Range is defined as the difference between the two extreme observations of the
distribution.
• To find the range in marks the highest and lowest values need to be found from
the table. The highest coursework mark was 48 and the lowest was 23 giving a
range of 25.
• In the examination, the highest mark was 47 and the lowest 12 producing a range
of 35. This indicates that there was wider variation in the students’ performance in
the examination than in the coursework for this module.
6
6
Relative Measure of Range:
𝑴𝒂𝒙𝒊𝒎𝒖𝒎 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 −𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
Coefficient of Range =
𝑴𝒂𝒙𝒊𝒎𝒖𝒎 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏+𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
Let X denote the Original Series, Xmax and Xmin are the maximum and minimum
observations of the original series respectively. Let U be the news series obtained by
multiplying A to all the elements of the series (thus signifying change of scale) and
adding B to all the elements of the series (change of origin).
Demerits
• Not based on the entire dataset. Based on only two extreme observations. If one of
these is either exceptionally high or low (outliers) it will result in a range that is not
typical of the variability within the dataset. For example, in the previous example, if
one student was awarded a mark of zero in the coursework, The range for the
coursework marks would now become 48 (48-0), rather than 25, however the new
range is not typical of the dataset as a whole and is distorted by the outlier in the
coursework marks.
• In case of a frequency distribution, the frequencies of the various values of the
distribution are immaterial, since range depends only on the two extreme
observations.
• Range is not suitable for further mathematical treatment. 8
8
2. QUARTILE DEVIATION or SEMI-
INTERQUARTILE RANGE
• In order to reduce the problems caused by outliers in a dataset, the inter-quartile
range is often calculated instead of the range.
• The inter-quartile range is a measure that indicates the extent to which the central
50% of values within the dataset are dispersed.
Interquartile range: 𝑄3 − 𝑄1
0-15 8
15-30 26
30-45 30
45-60 45
60-75 20
75-90 17
90-105 4
10
10
Class Interval Frequency Cumulative Frequency
0-15 8 8
15-30 26 34
30-45 30 64
45-60 45 109
60-75 20 129
75-90 17 146
90-105 4 150
37.5 − 34
𝑄1 = 30 + × 15 = 31.75
30
112.5 − 109
𝑄3 = 60 + × 15 = 62.625
20
11
11
• 𝐼𝑛𝑡𝑒𝑟 − 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑅𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1 = 62.625 − 31.75 = 30.875
𝑄3 −𝑄1 30.875
• 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = = 15.44
2 2
𝑄3 −𝑄1 30.88
• Coefficient of 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = = 0.33
𝑄3 +𝑄1 94.38
12
12
BOXPLOTS (also called Box and Whisker Plots)
Boxplots (also called box and whisker plots) are a standardized way of displaying the
distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”).
13
13
• Boxplot for a Symmetric Distribution: Median lies halfway between Q1 and
Q3
𝑄3 − 𝑄2 = 𝑄2 − 𝑄1
𝑄3 + 𝑄1 = 𝑄2 + 𝑄2
𝑄1 +𝑄3
𝑄2 =
2 14
14
Boxplot for a Positively Skewed Distribution
• Since the mass of the distribution is on the left, Median i.e. Q2 is closer to the first
quartile than to the third quartile.
• So, 𝑄3 − 𝑄2 > 𝑄2 − 𝑄1
𝑄3 + 𝑄1 > 𝑄2 + 𝑄2
𝑄 +𝑄
𝑄2 < 1 3
2
15
15
Boxplot for a Negatively Skewed Distribution
• Since the mass of the distribution is on the right, Median i.e. Q2 is closer to the
third quartile than to the first quartile.
• So, 𝑄3 − 𝑄2 < 𝑄2 − 𝑄1
𝑄3 + 𝑄1 < 𝑄2 + 𝑄2
𝑄 +𝑄
𝑄2 > 1 3
2
16
16
BOXPLOTS with OUTLIERS
• A boxplot can be made more informative by explicitly showing the outliers.
• An outlier is an observation that is numerically distant from the rest of the data.
• Arithmetically, an outlier is defined as a datapoint that is located outside 1.5 times
the Interquartile Range below Q1 and outside 1.5 times the Interquartile Range
above Q3.
• Some statistical packages distinguish between mild and extreme outliers. An outlier
is extreme if it is located outside 3 times the Interquartile range below Q1 and
outside 3 times the Interquartile Range above Q3. It is mild otherwise.
• Mild outliers are represented by closed circles and extreme outliers by open circles.
17
17
BOXPLOTS with OUTLIERS: ILLUSTRATION
18
18
BOXPLOTS with OUTLIERS: ILLUSTRATION
Median = Q2 = 92.17
19
19
BOXPLOTS with OUTLIERS: ILLUSTRATION
IQR = 𝑄3 − 𝑄1 = 167.79 − 45.64 = 122.15
Since none of the observations are negative, there are no outliers at the lower end of
the data.
The whiskers in the boxplot extend out to the smallest observation, 9.69, on the
low end and 312.45, the largest observation that is not an outlier, on the upper
end. The data is positively skewed since median line is somewhat closer to the
left edge of the box than to the right edge.
21
21
BOXPLOTS: NUMERICAL
Q. Blood infection concentration (mg/L) was determined both for a sample of
individuals who had died from excited delirium (ED) and for a sample of those who
had died without excited delirium:
ED (27 observations)
0 0 0 0 .1 .1 .1 .1 .2 .2 .3 .3 .3 .4 .5 .7 .8 1.0 1.5 2.7 2.8 3.5 4.0 8.9 9.2 11.7 21.0
For first quartile, we need median of the lower half of the data (including the
median since n is odd)
Upper half including the median: .4 .5 .7 .8 1.0 1.5 2.7 2.8 3.5 4.0 8.9 9.2 11.7 21.0
Median of this dataset
𝒕𝒉
𝒏 𝒕𝒉 𝒏
𝟐
𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏+ 𝟐+𝟏 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 𝟕𝒕𝒉 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏+𝟖𝒕𝒉 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 𝟐.𝟕+𝟐.𝟖
= = =
𝟐 𝟐 𝟐
= 𝟐. 𝟕𝟓 = 𝑄3 23
23
Sample 2: Non-ED (50 observations)
0 0 0 0 0 .1 .1 .1 .1 .2 .2 .2 .3 .3 .3 .4 .5 .5 .6 .8 .9 1.0 1.2 1.4 1.5 1.7 2.0 3.2 3.5 4.1
4.3 4.8 5.0 5.6 5.9 6.0 6.4 7.9 8.3 8.7 9.1 9.6 9.9 11.0 11.5 12.2 12.7 14.0 16.6 17.8
𝒕𝒉
𝒏 𝒕𝒉 𝒏
𝟐
𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏+ 𝟐+𝟏 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
Median = =
𝟐
𝟐𝟓𝒕𝒉 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏+𝟐𝟔𝒕𝒉 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 𝟏.𝟓+𝟏.𝟕
= = 𝟏. 𝟔
𝟐 𝟐
For first quartile, we need median of the lower half of the data
Upper half: 1.7 2.0 3.2 3.5 4.1 4.3 4.8 5.0 5.6 5.9 6.0 6.4 7.9 8.3 8.7 9.1 9.6 9.9 11.0 11.5
12.2 12.7 14.0 16.6 17.8
24
24
OUTLIERS:
Sample 1 (ED):
𝐼𝑄𝑅: 𝑄3 − 𝑄1 = 2.75 − 0.1 = 2.65
Mild outliers are less than .1 - 1.5(2.65) = -3.875 or greater than 2.75 + 1.5(2.65) =
6.725. Extreme outliers are less than .1 - 3(2.65) = -7.85 or greater than 2.75 + 3(2.65)
= 10.7.
So, the two largest observations (11.7, 21.0) are extreme outliers and the next two
largest values (8.9, 9.2) are mild outliers. There are no outliers at the lower end of the
data.
Sample 2 (Non-ED):
𝐼𝑄𝑅: 𝑄3 − 𝑄1 = 7.9 − 0.3 = 7.6
Mild outliers are less than .3 - 1.5(7.6) = -11.1 or greater than 7.9 + 1.5(7.6) =
19.3. Note that there are no mild outliers in the data, hence there can not be any
extreme outliers either.
25
25
COMPARATIVE BOXPLOT:
• The values of the ED data tend to be smaller than those for the Non-ED
data.
27
27
PRACTICE QUESTION:
28
28
PRACTICE QUESTION:
29
29
PRACTICE QUESTION:
30
30
PRACTICE QUESTION:
31
31
MERITS OF QUARTILE DEVIATION
• Not based on all the observations since it ignores 25% of the observations from the
beginning of the distribution and 25% from the end
1
= |𝑋 − 𝐴|
𝑁
Where A can be any of the averages – Mean, Median or Mode
1
= 𝑓|𝑋 − 𝐴|
𝑁
• Mean Deviation is minimum when taken from the Median compared to than
from Mode or Mean
33
33
Relative Measures of Mean Deviation:
0-10 8
10-20 12
20-30 10
30-40 8
40-50 3
50-60 2
60-70 7
35
35
Class X Frequency Cumulative fX ഥ |=|X-29|
|X-𝑿 ഥ |=f|X-
f|X-𝑿 |=|X-25|
|X-𝑿 |=f|X-
f|X-𝑿
Interval (f) Frequency 29| 25|
20-30 25 10 30 250 4 40 0 0
30-40 35 8 38 280 6 48 10 80
40-50 45 3 41 135 16 48 20 60
50-60 55 2 43 110 26 52 30 60
∑fX=1,450 ഥ |=800
∑f|X-𝑿 |=760
∑f|X-𝑿
• Mean = 1450/50 = 29
(25−20)(10)
• Median = 20 + = 25
10
36
36
• Mean Deviation about Mean
1 800
ത
= 𝑓 𝑋 − 𝑋 = = 16
𝑁 50
1 760
= 𝑓 𝑋 −𝑋 = = 15.2
𝑁 50
As can be seen, Mean Deviation about Median < Mean Deviation about Mean
37
37
MERITS OF MEAN DEVIATION
• We take the absolute values of the deviations about an average and ignore the signs
of the deviations
• Not suitable for further mathematical treatment
38
38
4. VARIANCE & STANDARD DEVIATION
• Variance is the mean of the squared deviations of the observations from their mean.
1
2
𝜎 = (𝑋 − 𝑋) ത 2
𝑁
1
2 ത 2
𝜎 = 𝑓(𝑋 − 𝑋)
𝑁
39
39
VARIANCE & STANDARD DEVIATION
• Variance involves squaring the deviations from the mean.
• Squaring has the effect of magnifying larger deviations while diminishing smaller
ones. This means that larger deviations have a more pronounced impact on the
overall variance value.
• Large deviations from the mean often represent significant departures from the
central tendency of the data. By giving them a larger weight, we make the variance
more sensitive to outliers or extreme values in the dataset. This sensitivity is
important because outliers can have a substantial impact on the overall variability of
the data, and we want the variance to reflect this impact.
40
40
VARIANCE & STANDARD DEVIATION
• Weighted by Magnitude: When you square a value that is farther from the mean, the
resulting squared value is larger than if the value were closer to the mean. This
squaring effect assigns greater weight to observations that are farther from the
mean, making them contribute more to the overall variance.
• Squaring removes the negative sign from deviations. This is important because in
the context of measuring dispersion, we are concerned with how far each
observation is from the mean, not its direction. Squaring ensures that both positive
and negative deviations contribute equally to the overall variance.
• The use of squared deviations simplifies the mathematics involved in calculating
variance. It leads to convenient mathematical properties, such as the ability to
decompose the total variance of a dataset into the sum of variances within
subgroups or components of the data.
41
41
VARIANCE & STANDARD DEVIATION
• Suppose we have a dataset of exam scores for a class of students:
• Now, let's calculate the squared deviations from the mean for each score:
• Squared Deviation for 85: (85 - 90)^2 = 25
• Squared Deviation for 88: (88 - 90)^2 = 4
• Squared Deviation for 90: (90 - 90)^2 = 0
• Squared Deviation for 92: (92 - 90)^2 = 4
• Squared Deviation for 95: (95 - 90)^2 = 25
42
42
VARIANCE & STANDARD DEVIATION
• Now, let's calculate the variance using these squared deviations:
• In this example, the squared deviations for scores of 85 and 95 contribute more to
the variance than the squared deviations for scores of 88, 90, and 92. This is because
the squared deviations are larger due to the larger deviations from the mean.
• This illustrates how squaring has the effect of magnifying larger deviations while
diminishing smaller ones. This property is crucial in understanding why variance is
sensitive to extreme values and tends to capture the spread of data more effectively,
giving more weight to observations that are farther from the mean.
43
43
• The limitation of variance is that it is expressed in squared units. Example: Distribution of Heights, 𝝈𝟐 is
expressed in (inches)2.
STANDARD DEVIATION
• Given by Karl Pearson
• It is defined as the positive square root of the arithmetic mean of the square of the
deviations of the observations from their Arithmetic Mean.
1
𝜎= ത 2
(𝑋 − 𝑋)
𝑁
• Value of Standard Deviation will be greater if the observations are scattered widely
away from the mean.
49
49
VARIANCE
Q. Initially, there were 9 workers, all being paid a uniform wage. Later, a 10th worker is
added whose wage rate is Rs 20 less than the others.
Compute:
(a) The effect of the addition of the 10th worker on the Mean Wage
(b) Standard Deviation of Wages for the Group of 10 workers.
50
50
VARIANCE
51
51
VARIANCE
52
52
VARIANCE
Q. Prove that for two observations a and b, their standard deviation is half the distance
between them.
53
53
VARIANCE
54
54
VARIANCE
55
55
VARIANCE
56
56
VARIANCE
57
57
VARIANCE
58
58
VARIANCE
59
59
VARIANCE
60
60
VARIANCE
61
61
VARIANCE
62
62
VARIANCE
63
63
VARIANCE
64
64
VARIANCE
65
65
VARIANCE
57.2 66
66
VARIANCE
67
67