0% found this document useful (0 votes)
0 views

lecture_4

The document discusses measures of dispersion, including range, standard deviation, and quartile deviation, which assess the variability within a dataset. It emphasizes the importance of understanding dispersion when making decisions, such as choosing between two drugs based on their performance consistency. Additionally, it covers the calculation of variance, standard deviation, and quartiles, along with examples and exercises to illustrate these concepts.

Uploaded by

gift60470
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

lecture_4

The document discusses measures of dispersion, including range, standard deviation, and quartile deviation, which assess the variability within a dataset. It emphasizes the importance of understanding dispersion when making decisions, such as choosing between two drugs based on their performance consistency. Additionally, it covers the calculation of variance, standard deviation, and quartiles, along with examples and exercises to illustrate these concepts.

Uploaded by

gift60470
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Measures of Dispersion

Dispersion
• The measure of the spread or variability

• No Variability – No Dispersion
Measures of Variation
• There are 3 values that we will look at to
measure the amount of dispersion or variation.
(The spread of the group)

1. Range
2. Standard Deviation
3. Quartile deviation
Why is it Important?
• You want to choose the best brand of
medicine for your patients. You are
interested in how long the drugs takes to
cure a disease. The choices are narrowed
down to 2 different drugs. The results are
shown in the chart. Which drug would
you choose?
Drug A Drug B
The chart indicates 10 35
the number of days 60 45
a drug takes to 50 30
cure a particular 30 35

disease. 40 40
20 25
210 210
Does the Average Help?
• Drug A: Avg = 210/6 = 35 days

• Drug B: Avg = 210/6 = 35 days

• They both last 35 days to cure a disease. No


help in deciding which to buy.
Consider the Spread
• Drug A: Spread = 60 – 10 = 50 days

• Drug B: Spread = 45 – 25 = 20 days

• Drug B has a smaller variability which means that it


performs more consistently. Choose drug B.
Range
• The range is the difference between the lowest
value in the set and the highest value in the set.

• Range = High # - Low #


Example
• Find the range of the data set.

• 40, 30, 15, 2, 100, 37, 24, 99

• Range = 100 – 2 = 98
Deviation from the Mean
• A deviation from the mean, x – x , is the difference
between the value of x and the mean x
• We base our formulas for variance and standard
deviation on the amount that they deviate from the
mean.
• The mean deviation of a set of observations
𝑥1 , 𝑥2 , ⋯ , 𝑥𝑁 is the mean of the absolute deviations
from the mean and equals
1 𝑁
σ𝑖=1 |𝑥𝑖 − 𝑥|ҧ
N
Formulae for sample and population
variances
Computation formulae Definition formulae
( x) 2
x − 2
σ𝑛
(𝑥
𝑖=1 𝑖 − 𝑥)
lj 2
n 2
𝑠 =
s2 = 𝑛−1
n −1

( xi ) 2 N
 −  (x − )
2 2
x
 =
2 N i
N 2 = i =1

N
Standard Deviation
• The standard deviation is the square root of the
variance.

s = s 2
Example – Using Formula
• Find the variance of the following
dataset 6, 3, 8, 5, 3 (in hours)
x x 2

6 36
3 9
8 64
5 25
3 9
 x = 25  x = 143
2
Example – Using Formula

( x) 2
x 2

s2 = n
n −1

252
143 −
5 143 − 125 18
s =
2
= = = 4.5
4 4 4
Find the standard deviation
• The standard deviation is the positive square
root of the variance.

s = 4.5 = 2.12
Example: Mean, variance and standard deviation of data

Example 4.1 in Clarke and Cooke (1998), 4th ed.

• In a city there are six professional football clubs. Last season they had
25, 30, 18, 27, 28 and 22 players respectively on their full-time paid
staffs. Find the mean, variance and standard deviation of the number
of full-time paid staffs.

• Let us call the number of full-time paid staff r. It is easier to layout the
calculation in form of a table
Example: Mean, variance and standard deviation of data
𝟐
Club 𝒓𝒊 𝒓𝒊 − 𝒓ത 𝒓𝒊 − 𝒓ത 𝒓𝟐𝒊
A 25 0 0 625
B 30 5 25 900
C 18 -7 49 324
D 27 2 4 729
E 28 3 9 784
F 22 -3 9 484
6|150 6|96 3846
Mean 𝑟ҧ = 25 Variance = 16

Hence the standard deviation is 4


Variance and standard deviation of grouped
frequency distribution
Let the table of values of discrete variable be:

Variable values 𝑟1 𝑟2 𝑟3 ⋯ 𝑟𝑘 Total


Frequency 𝑓1 𝑓2 𝑓3 ⋯ 𝑓𝑘 𝑘
෍ 𝑓𝑖 = 𝑁
𝑖=1

The total of all observations is equal to σ𝑘𝑖=1 𝑓𝑖 𝑟𝑖


Variance and standard deviation of grouped
frequency distribution
The variance of a set of N observations of a discrete variable, group so
that the values 𝑟𝑖 𝑖 = 1, 2, ⋯ , 𝑘 occurs with frequency 𝑓𝑖 , is

2
1 𝑘 1 σ𝑘
𝑖=1 𝑓𝑖 𝑟𝑖
σ 𝑓 𝑟𝑖 − 𝑟ҧ 2 or σ𝑘𝑖=1 𝑓𝑖 𝑟𝑖2 −
𝑁 𝑖=1 𝑖 𝑁 𝑁

The expression in the large brackets may be stated in words as “the


sum of the squares of all the observations minus the total squared and
divided by N”.
Example: Mean, variance and standard deviation of grouped
discrete frequency distribution
Number of people Number of days, 𝒇𝒊 𝒇𝒊 𝒓𝒊 𝒇𝒊 𝒓𝟐𝒊
absent, 𝒓𝒊
0 44 0 0
1 19 19 19
2 10 20 40
3 8 24 72
4 7 28 112
5 3 15 75
6 or more 0 0 0
𝑘
෍ 𝑓𝑖 = 𝑁 = 91 105 318
𝑖=1
Variance and standard deviation of grouped
frequency distribution
The mean is
σ 𝑓𝑖 𝑟𝑖 106
= = 1.165
𝑁 91
The variance is
𝑘 𝑘 2
1 σ𝑖=1 𝑓𝑖 𝑟𝑖 1 1062
෍ 𝑓𝑖 𝑟𝑖2 − = 318 −
𝑁 𝑁 91 91
𝑖=1

194.53
= = 2.13377
91
The standard deviation is 2.13377 = 1.46
Variance and standard deviation of grouped
observations of a continuous variable
The variance of a set of N observations of a continuous variable, in
which 𝑓𝑖 observations fall in the interval whose centre is 𝑥𝑖 (𝑖 =
1, 2, ⋯ , 𝑘 ), is
2
1 𝑘 1 σ𝑘
𝑖=1 𝑓𝑖 𝑥𝑖
σ𝑖=1 𝑓𝑖 𝑥𝑖 − 𝑥ҧ 2 or σ𝑘𝑖=1 𝑓𝑖 𝑥𝑖2 −
𝑁 𝑁 𝑁
The semi-inter-quartile range
(or quartile deviation)
The semi-inter-quartile range (or quartile deviation)
• The variance, the standard deviation and the mean deviation go
naturally with the mean.

• They are based on deviations from the mean, and the averaging
process is the same as that for calculating the mean.

• We are going to look at the measure of variability that is based on the


rank order of a set of observations, and therefore related to the
median.

• The range, which we briefly mention in earlier lessons is also based


on the rank order of a set of observations.
Example
• An airline flies a daily service between two cities, using a 100-seat
aircraft for the flight. The number of seats sold on the first fifteen
days of September are 87, 67,98, 57, 74, 100, 83, 60, 99, 88, 54, 72,
78, 75, 93 in the date order.
• Let us arrange these observations in rank order:

Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of seats sold 54 57 60 67 72 74 75 78 83 87 88 93 98 99 100

Quartiles . . . Q1 . . . M . . . Q3
Example
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of seats sold 54 57 60 67 72 74 75 78 83 87 88 93 98 99 100

Quartiles . . . Q1 . . . M . . . Q3

Median divides the


rank order into two
First or lower quartile divides equal parts. Third or upper quartile divides
the lower half of the rank the upper half of the rank
order into two equal parts. order into two equal parts.
Example
• We could (though we hardly do!) call the median M the second
quartile and label it Q2.
• The point is that Q1, M, Q3 divide the distribution of ranked
observations into quarters, and this is the reason for them quartiles.
Definitions
Quartiles: The quartiles of a set of observations are the values below
which fall 25%, 50% and 75% of the observations as arranged in rank
order.
These are called respectively the first, second and third quartiles and
are denoted by 𝑄1 , 𝑄2 or 𝑀, and 𝑄3 .
• The point is that Q1, M, Q3 divide the distribution of ranked
observations into quarters, and this is the reason for calling them
quartiles.
Definitions
Inter-quartile range (mid-spread): The inter-quartile range also known
as mid-spread is the difference between the upper and lower quartiles.
In our examples this is 𝑄3 − 𝑄1 = 93 − 67 = 26.

Semi-inter-quartile range (mid-spread): The semi-inter-quartile range


also known as the quartile deviation of a set of observations equals
1
(𝑄3 −𝑄1 ).
2
1 1
In our example above, this is (𝑄3 −𝑄1 ) = × 26 = 13.
2 2
Exercise (Clarke & Cooke, 4th Ed.) 4.6.1
In the first fifteen days of December in the same year, on the same air
service as described above, the numbers of seats sold were 38, 75, 50,
84, 62, 46, 96, 67, 33, 55, 65, 42, 83, 70, 49 in date order. Find the
median, quartiles and semi-inter-quartile range for these observations.
Plot September and December observations on a sheet of graph-paper,
mark in the measures you have calculated, and compare them. Work
out also the range, mean deviation and standard deviation for each
month’s data.
Exercise (Clarke & Cooke, 4th Ed.) 4.6.1
In the first fifteen days of December in the same year, on the same air
service as described above, the numbers of seats sold were 38, 75, 50,
84, 62, 46, 96, 67, 33, 55, 65, 42, 83, 70, 49 in date order. Find the
median, quartiles and semi-inter-quartile range for these observations.
Plot September and December observations on a sheet of graph-paper,
mark in the measures you have calculated, and compare them. Work
out also the range, mean deviation and standard deviation for each
month’s data.
Box-and-whisker plot
• J.W. Tukey introduced the box-and-whisker plot as well as the stem-
and-leaf plot seen before.

Example: Consider the “Airline’s seats sold” data.


• On a linear scale (see figure on next slide), make vertical marks
corresponding to the minimum (54), the first quartile (67), the
median (78), the third quartile (93) and the maximum (100).
• The lines for the first and third quartile are then joined to make a box
corresponding to the central 50% of the observations, and whiskers
are taken out from the box to the maximum and minimum values.
• This is often a convenient way of representing a distribution and is
particularly useful for comparing distributions.
Box-and-whisker plot

Figure: Box-and-whisker plot for the “Airline’s seats sold” data


Quantiles
• If we have very large mass of data, we might want to divide it into
more than four parts to make a summary of it.
• It is often helpful to draw a cumulative frequency curve, that is to plot
the cumulative frequency in the vertical direction against increasing
values of the variable plotted horizontally.
• Such a curve is often called ogive to describe its general shape. From
this required results can be read (see example later).
• The median and quartiles divide a set of ranked observations into four
parts. Two other sets of measure, deciles and percentiles are useful.
Quantiles - definitions
Deciles: the deciles of a set of ranked observations on a variable are
the variable values which divide the set into ten equal parts.

Percentiles: the percentiles of a set of ranked observations on a


variable are the variable values which divide the set into one hundred
equal parts.

More generally we can consider measures which divide a set of ranked


observations into q parts; one of these measures will be the p-th q-tile
where p takes one of the values 1, 2, …, q-1. Thus for the quartiles q = 4
and p = 1, 2, 3.

The general term for these measures is quantiles; the quartiles, deciles
and percentiles are examples.
Example: Deciles and quartiles of a grouped continuous
variable (Clarke & Cooke, 4th Ed., example 4.6.1)
The incomes of married couples over retiring age in 1973 are shown in columns 1
and 2 of the Table below. Draw a cumulative frequency curve for the data, and
from it estimate the lowest decile, the median, the lower and upper quartiles of
income. Use the curve to estimate the proportion of married couples who had a
gross weekly income between £22 and £28.

Number of married Cumulative frequency


Income (£) couples (thousands) (thousands)
12- 144 144
14- 330 474
16- 329 803
18- 247 1050
20- 412 1462
25- 206 1668
30 or over 391 2059
Example: Deciles and quartiles of a grouped continuous
variable (Clarke & Cooke, 4th Ed., example 4.6.1)
• The first step is to add the third column to the table containing cumulative
frequency
• Then these values of cumulative frequency are plotted against the upper end-
points of the class-intervals of the income distribution.
• The end of the first interval is £13.99, so that the cumulative frequency for
income £13.99 is 144.
• There were no incomes below £12, so we set the cumulative frequency at £11.99
equal to zero.
• We add the remaining points from the table, as shown by the circles in the Figure
below.
• Put the best smooth curve through the circles on the diagram to get the
cumulative frequency curve as asked for.
Cumulatative frequency curve for income of married women
couples

2000

1500
Frequency

1000

500

0
0 5 10 15 20 25 30 35 40 45 50
Gross weekly income (£)
Example: Deciles and quartiles of a grouped continuous
variable (Clarke & Cooke, 4th Ed., example 4.6.1)
• Note that the upper limit has been conveniently set at £50
• The total frequency, in thousands, is 2059.
• One-tenth of the total frequency must lie below the first decile. Thus from the
graph we need to find the income corresponding to 205.9 on the vertical scale: it
is £14.40 as accurately as we can read it. The first decile is thus £14.40.
• From the graph, the median has to have half the total frequency, i.e. 1029.5,
below it. The income corresponding to this is £19.80.
1
• The quartiles correspond to cumulative frequencies of × 2059 = 514.75 and
4
3
× 2059 = 1544.25; they are therefore, from the graph, 𝑄1 = £16.20 and 𝑄3 =
4
£26.20
Example: Deciles and quartiles of a grouped continuous
variable (Clarke & Cooke, 4th Ed., example 4.6.1)
• Finally from the graph the cumulative frequency up to £22 is 1230 thousands,
and up to £28 is 1590 thousands, so that the number of married couples having
360
incomes between £22 and £28 is 360 thousands. This as a proportion = =
2059
1
0.175 (i.e 17 %) of the whole.
2
Coefficient of variation
• The coefficient of variation (CV) is a relative measure of variability
that indicates the size of a standard deviation in relation to its mean.

• It is a standardized, unitless measure that allows you to compare


variability between disparate groups and characteristics.

• It is also known as the relative standard deviation (RSD).


Calculation of coefficient of variation
• Calculating the coefficient of variation involves a simple ratio. Simply
take the standard deviation and divide it by the mean.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 =
𝑀𝑒𝑎𝑛

• Higher values of the CV indicate that the standard deviation is


relatively large compared to the mean.
Example and further interpretations
• A pizza restaurant measures its delivery time in minutes. The mean delivery
time is 20 minutes and the standard deviation is 5 minutes.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 5
𝐶𝑉 = = = 0.25
𝑀𝑒𝑎𝑛 20

• The coefficient of variation is 0.25. This value tells us the relative size of the
standard deviation compared to the mean.

• Analysts often report the coefficient of variation as a percentage. In this


example, the standard deviation is 25% the size of the mean.
Example and further interpretations
• If the value equals one or 100%, the standard deviation equals the
mean.

• Values less than one indicate that the standard deviation is smaller
than the mean (typical), while values greater than one occur when
the S.D. is greater than the mean, a phenomenon referred to as
overdispersion

• In general, higher values represent a greater degree of relative


variability.
Absolute versus Relative Measures of Variability

• In previous sessions, we looked at the standard deviation, interquartile


range, and range. These statistics are absolute measures of variability. They
use the variable’s unit of measurement to describe the variability.

• For the five minute standard deviation in the pizza delivery example, we
know that the typical delivery occurs five minutes before or after the mean
delivery time.

• That information is very useful! It tells us the variability in our data using,
conveniently, the original measurement units. We can conceivably compare
this delivery time variability to another pizza restaurant.
Absolute versus Relative Measures of Variability

• On the other hand, relative measurements use a standardization


process that removes the original units of measurement.

• In the CV ratio, both the standard deviation and the mean use the
same units, which cancels them out and produces a unitless statistic.

• When would you want to use the coefficient of variation? Its


unitless nature provides it with some advantages. Specifically, the
coefficient of variation facilitates meaningful comparisons in
scenarios where absolute measures cannot.
Absolute versus Relative Measures of Variability

• Use the coefficient of variation when you want to compare variability


between:

1) Groups that have means of very different magnitudes.


2) Characteristics that use different units of measurements.

• In these two cases, absolute measures can be problematic


Using the Coefficient of Variation when Means are Vastly
Different
• When you measure a characteristic that has a wide range of values,
you’d often expect the mean and standard deviation to change
together.
• This phenomenon frequently occurs in cross-sectional data. In these
cases, you want to know how the standard deviation
compares relatively to the vastly different means.
• Suppose you’re measuring household expenditures and want to
compare the variability of spending among high-income and low-
income households. These data are fictional.
Using the Coefficient of Variation when Means are Vastly
Different
Expenditures High Income (MWK) Low Income (MWK)
Mean 500,000 40,000
Standard Deviation 125,000 10,000

These values use the same unit of measurement (Malawi Kwacha), allowing you to compare the
standard deviations.

The variability in high-income household expenses is much greater than low-income households
(MWK125,000 vs. MWK10,000). However, given the vast difference in mean expenses, that’s not
surprising.

However, if you want to compare variability while accounting for the disparate means, you need to use
a relative measure of variability, such as the coefficient of variation. The table below shows that when
you account for the differences in expenses, the low-income group actually has equal variability.

Coefficient of Variability High Income Low Income


25% 25%
Real-world examples
• Analysts frequently use the coefficient of variability when their dataset has a
broad range of means, as shown in the previous example.

• Researchers use the CV for assessing the inequality of incomes across


different countries. Average incomes by country vary greatly. There are
affluent countries and impoverished countries. To consider inequality within
each country while accounting for the vastly different mean incomes,
analysts use the coefficient of variability. In this context, when a country has
a larger coefficient of variability, it represents a greater degree of income
disparity.

• Similarly, financial analysts use the coefficient of variability to assess the


volatility of returns for financial investments across a wide range of
valuations. In this context, higher coefficients indicate a more significant risk.
Using the Coefficient of Variation to Compare
Measurements that Use Different Units
• When measurements use different scales, you can’t compare them
directly.
• Suppose you want to compare the variability in SAT scores to ACT
scores? While these college entrance exams are similar in nature and
purpose, they use different scales. Consequently, you can’t compare
their standard deviations directly.
• However, the coefficient of variation standardizes the raw data, which
means you can compare the relative variability of SAT and ACT scores.
Using the Coefficient of Variation to Compare
Measurements that Use Different Units
• Furthermore, any time you want to assess the variability of inherently different
characteristics, you’ll need to use a relative measure of variability, such as the
coefficient of variability.
• For example, you might want to assess the variability of the operating
temperature and speed of rockets. Or compare the variability of the weight and
strength of material samples. You can’t meaningfully compare standard
deviations that use different units, such as kilograms for weight and megapascals
for strength!
• However, if your kilograms variable has a higher coefficient of variability than
megapascals, then you know weight is relatively more variable than strength.
• These examples measure entirely different characteristics using different units.
However, you can use the coefficient of variation to compare their relative
variability!
Cautions About When Not to Use the Coefficient of
Variability
While the coefficient of variability is extremely useful in some contexts,
there are cases when you should not use it.

Do not use when the mean is close to zero


• If the mean equals zero, the denominator of the ratio is zero, which is
problematic! Fortunately, you’re not likely to have a mean that equals
zero exactly. But when the mean is close to zero, the coefficient of
variation can approach infinity, and its value is susceptible to small
changes in the mean!
Cautions About When Not to Use the Coefficient of
Variability
Do not use with interval scales
• Use the coefficient of variation only when your data use a ratio scale.
Don’t use it for interval scales.
• Ratio scales have an absolute zero that represents a total lack of the
characteristic. For example, zero weight (using the Imperial or metric
system) indicates a complete absence of weight. Weight is a ratio
scale.
• However, temperatures in Fahrenheit and Celsius are interval scales.
These measurement systems have a zero value, but those zeros don’t
indicate an absence of temperature. (Kelvin has an absolute zero that
does represent a lack of temperature. Kelvin is a ratio scale.)
Cautions About When Not to Use the Coefficient of
Variability
• Interval scales do not allow you to divide measurements in a
meaningful fashion. For example, 10C is not 1/3 the temperature of
30C! Because the coefficient of variation involves division, this
statistic is meaningless for interval scales.
• Let’s see an example of the problem that occurs when using the
coefficient of variation with interval scales!
• The table below displays pairs of equivalent temperatures. You’d
expect their coefficients of variation to be equal. Let’s check!
Cautions About When Not to Use the Coefficient of
Variability

• The CVs are quite different! That


occurs because we are assessing
interval scales.
• Use the coefficient of variation
only when you have a true
absolute zero on a ratio scale!

You might also like