w3 ch2 Anno
w3 ch2 Anno
incorporate labels and titles correctly in your diagrams and state the units which
you have used
2.4 Introduction
Both themes considered in this chapter (data visualisation and descriptive statistics)
could be applied to population data, but in most cases (namely here) they are applied to
a sample. The notation would change slightly if a population was being represented.
23
2. Data visualisation and descriptive statistics
Most visual representations are very tedious to construct in practice without the aid of
a computer. However, you will understand much more if you try a few by hand (as is
commonly asked in examinations). You should also be aware that spreadsheets do not
2 always use correct terminology when discussing and labelling graphs. It is important,
once again, to go over this material slowly and make sure you have mastered the basic
statistical definitions introduced here before you proceed to more theoretical ideas.
Discrete variables: These have outcomes you can count. Examples include the
number of passengers on a flight and the number of telephone calls received each
day in a call centre. Observed values for these will be 0, 1, 2, . . . (i.e. non-negative
integers).
Continuous variables: These have outcomes you can measure. Examples include
height, weight and time, all of which can be measured to several decimal places,
and typically have units of measurement (such as metres, kilograms and hours).
Many of the problems for which people use statistics to help them understand and make
decisions involve types of variables which can be measured. When we are dealing with a
continuous variable – for which there is a generally recognised method of determining
its value – we can also call it a measurable variable. The numbers which we then
obtain come ready-equipped with an ordered relation, i.e. we can always tell if two
measurements are equal (to the available accuracy) or if one is greater or less than the
other.
Of course, before we do any sort of data analysis, we need to collect data. Chapter 9
will discuss a range of different techniques which can be employed to obtain a sample.
For now, we just consider some simple examples of situations where data might be
collected, such as a:
pre-election opinion poll asking 1,000 people about their voting intentions
market research survey asking adults how many hours of television they watch per
week
census interviewer asking parents how many of their children are receiving full-time
education (note that a census is the total enumeration of a population, hence this
would not be a sample!).
1
Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably
see both forms used when reading widely.
24
2.5. Types of variable
In cases (a) and (b) we are doing simple counts, within a sample, of a single category
– graduates and Party XYZ supporters, respectively – while in cases (c) and (d) we
are looking at some kind of cross-tabulation between two categorical variables – a
scenario which will be considered in Chapter 8.
There is no obvious and generally recognised way of putting political preferences in
order (in the way that we can certainly say that 1 < 2). It is similarly impossible to
rank (as the technical term has it) many other categories of interest: in combatting
discrimination against people, for instance, organisations might want to look at the
effects of gender, religion, nationality, sexual orientation, disability etc. but the
25
2. Data visualisation and descriptive statistics
Before we see our first graphical representation you should be aware when reading
articles in newspapers, magazines and even within academic journals, that it is easy to
mislead the reader by careless or poorly-defined diagrams. As such, presenting data
effectively with diagrams requires careful planning.
A good diagram:
• provides a clear summary of the data
• is a fair and honest representation
• highlights underlying patterns
• allows the extraction of a lot of information quickly.
26
2.6. Data visualisation
A bad diagram:
• confuses the viewer
• misleads (either accidentally or intentionally). 2
Advertisers and politicians are notorious for ‘spinning’ data to portray a particular
narrative for their own objectives!
1. Obtain the range of the dataset (the values spanned by the data), and draw a
horizontal line to accommodate this range.
2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line,
resulting in the empirical distribution.
•
• • •
• • • • • •
• • • • • • • •
11.50 11.60 11.70 11.80 11.90 12.00 12.10 12.20
Instantly, some interesting features emerge from the dot plot which are not
immediately obvious from the raw data. For example, most clerical assistants earn
less than £12 per hour and nobody (in the sample) earns more than £12.20 per hour.
2.6.3 Histogram
Histograms are excellent diagrams to use when we want to visualise the frequency
distribution of discrete or continuous variables. Our focus will be on how to construct a
density histogram.
Data are first organised into a table which arranges the data into class intervals (also
called bins) – disjointed subdivisions of the total range of values which the variable
takes. Let K denote the number of class intervals. These K class intervals should be
mutually exclusive (meaning they do not overlap, such that each observation belongs to
at most one class interval) and collectively exhaustive (meaning that each observation
belongs to at least one class interval).
27
2. Data visualisation and descriptive statistics
Recall that our objective is to represent the distribution of the data. As such, when
choosing K, too many class intervals will dilute the distribution, while too few will
concentrate it (using technical jargon, will tend to degenerate the distribution). Either
2 way, the pattern of the distribution will be lost – defeating the purpose of the
histogram. As a guide, K = 6 or 7 should be sufficient, but remember to always exercise
common sense!
To each class interval, the corresponding frequency is determined, i.e. the number of
observations of the variable which fall within each class interval. Let fk denote the
frequency of class interval k, and let wk denote the width of class interval k, for
k = 1, 2, . . . , K.
PK
The relative frequency of class interval k is rk = fk /n, where n = fk is the sample
k=1
size, i.e. the sum of all the class interval frequencies.
The density of class interval k is dk = rk /wk , and it is this density which is plotted on
the y-axis (the vertical axis). It is preferable to construct density histograms only if
each class interval has the same width.
Example 2.3 Consider the weekly production output of a factory over a 50-week
period (you can choose what the manufactured good is!). Note that this is a discrete
variable since the output will take integer values, i.e. something which we can count.
The data are (in ascending order for convenience):
350 354 354 358 358 359 360 360 362 362
363 364 365 365 365 368 371 372 372 379
381 382 383 385 392 393 395 396 396 398
402 404 406 410 420 437 438 441 444 445
450 451 453 454 456 458 459 460 467 469
We construct the following table, noting that a square bracket ‘[’ includes the class
interval endpoint, while a round bracket ‘)’ excludes the class interval endpoint.
Note that here we have K = 7 class intervals each of width 20, i.e. wk = 20 for
k = 1, 2, . . . , 7. From the raw data, check to see how each of the frequencies, fk , has
been obtained. For example, f1 = 6 represents the first six observations (350, 354,
354, 358, 358 and 359).
28
2.6. Data visualisation
We have n = 50, hence the relative frequencies are rk = fk /50 for k = 1, 2, . . . , 7. For
example, r1 = f1 /n = 6/20 = 0.12. The density values can then be calculated. For
example, d1 = r1 /w1 = 0.12/20 = 0.006. 2
The table above includes an additional column of ‘Cumulative frequency’, which is
obtained by simply determining the running total of the class frequencies (for
example, the cumulative frequency up to the second class interval is 6 + 14 = 20).
Note the final column is not required to construct a density histogram, although the
computation of cumulative frequencies may be useful when determining medians and
quartiles (to be discussed later in this chapter).
To construct the histogram, adjacent bars are drawn over the respective class
intervals such that the histogram has a total area of one. The histogram for the
above example is shown in Figure 2.1.
Figure 2.1: Density histogram of weekly production output for Example 2.3.
29
2. Data visualisation and descriptive statistics
Example 2.4 Continuing with Example 2.3, the stem-and-leaf diagram is:
Note the informative title and labels for the stems and leaves.
For the stem-and-leaf diagram in Example 2.4, note the following points.
These stems are formed of the ‘10s’ part of the observations.
Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees
anti-clockwise reproduces the shape of the data’s distribution, similar to what
would be revealed with a density histogram.
The leaves are placed in ascending order within the stems, so it is a good idea to
sort the raw data into ascending order first of all (fortunately the raw data in
Example 2.3 were already arranged in ascending order, but for other datasets this
may not be the case).
Unlike the histogram, the actual data values are preserved. This is advantageous if
we want to calculate various descriptive statistics later on.
Measures of location – a central point about which the data tend (also known
as measures of central tendency).
30
2.7. Measures of location
32, 28, 67, 39, 19, 48, 32, 44, 37 and 24. (2.1)
2.7.1 Mean
The preferred measure of location/central tendency, which is simply the ‘average’ of the
data. It will be frequently applied in various statistical inference techniques in later
chapters.
(Sample) mean
P
Using the summation operator, , which remember is just a form of ‘notational
shorthand’, we define the sample mean, x̄, as:
n
1X x1 + x2 + · · · + xn
x̄ = xi = .
n i=1
n
To note, the notation x̄ will be used to denote an observed sample mean for a sample
dataset, while µ will denote its population counterpart, i.e. the population mean.
Of course, it is possible to encounter datasets in frequency form, that is each data value
is given with the corresponding frequency of observations for that value, fk , for
k = 1, 2, . . . , K, where there are K different variable values. In such a situation, use the
formula:
K
P
fk xk
k=1
x̄ = K . (2.2)
P
fk
k=1
Note that this preserves the idea of ‘adding up all the observations and dividing by the
total number of observations’. This is an example of a weighted mean, where the weights
are the relative frequencies (as seen in the construction of density histograms).
2
These three measures can be the same in special cases, such as the normal distribution (introduced
in Chapter 4) which is symmetric about the mean (and so mean = median) and achieves a maximum at
this point, i.e. mean = median = mode.
31
2. Data visualisation and descriptive statistics
If the data are given in grouped-frequency form, such as that shown in the table in
Example 2.3, then the individual data values are unknown3 – all we know is the class
interval in which each observation lies. The sensible solution is to use the midpoint of
2 the interval as a proxy for each observation recorded as belonging within that class
interval. Hence you still use the grouped-frequency mean formula (2.2), but each xi
value will be substituted with the appropriate class interval midpoint.
Example 2.6 Using the weekly production data in Example 2.3, the interval
midpoints are: 350, 370, 390, 410, 440, 450 and 470, respectively. These will act as
the data values for the respective class intervals. The mean is then calculated as:
K
P 7
P
f k xk f k xk
k=1 k=1 (6 × 350) + (14 × 370) + · · · + (3 × 470)
x̄ = = = = 400.4.
PK P7 6 + 14 + · · · + 3
fk fk
k=1 k=1
Compared to the true mean of the raw data (which is 399.72), we see that using the
midpoints as proxies gives a mean very close to the true sample mean value. Note
the mean is not rounded up or down since it is an arithmetic result.
A drawback with the mean is its sensitivity to outliers, i.e. extreme observations. For
example, suppose we record the net worth of 10 randomly chosen people. If Elon Musk
(one of the world’s richest people at time of writing), say, was included, his substantial
net worth would pull the mean upward considerably! By increasing the sample size n,
the effect of his inclusion, although diluted, would still be non-negligible, assuming we
were not just sampling from the population of billionaires!
2.7.2 Median
The (sample) median, m, is the middle value of the ordered dataset, where observations
are arranged in ascending order. By definition, 50 per cent of the observations are
greater than or equal to the median, and 50 per cent are less than or equal to the
median.
(Sample) median
Arrange the n numbers in ascending order, x(1) , x(2) , . . . , x(n) , (known as the order
statistics, such that x(1) is the first order statistic, i.e. the smallest observed value,
and x(n) is the nth order statistic, i.e. the largest observed value), then the sample
median, m, depends on whether the sample size is odd or even. If:
n is even, then there is no explicit middle value, so take the average of the values
either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2.
3
Of course, we do have the raw data for the weekly production output and so we could work out the
exact sample mean, but here suppose we did not have access to the raw data, instead we were just given
the table of class interval frequencies as shown in Example 2.3.
32
2.7. Measures of location
Example 2.7 For the dataset in (2.1), the ordered observations are:
19, 24, 28, 32, 32, 37, 39, 44, 48 and 67. 2
Here n = 10, i.e. there is an even number of observations, so we compute the average
of the fifth and sixth ordered observations, that is:
x(n/2) + x(n/2+1) x(5) + x(6) 32 + 37
m= = = = 34.5.
2 2 2
If we only had data in grouped-frequency form (as in Example 2.3), then we can make
use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered
observation which must lie in the [380, 400) class interval because once we exhaust the
ordered data up to the [360, 380) class interval we have only accounted for the smallest
20 observations, while once the [380, 400) class interval is exhausted we have accounted
for the smallest 30 observations, meaning the median must lie in this class interval.
Assuming the raw data are not accessible, we could use the midpoint (i.e. 390) as
denoting the median. Alternatively, we could use an interpolation method which uses
the following ‘general’ formula for grouped data, once you have identified the class
which includes the median (such as [380, 400) above):
bin width × number of remaining observations
endpoint of previous bin + .
bin frequency
Example 2.8 Returning to the weekly production output data from Example 2.3,
the median would be:
20 × (25.5 − 20)
380 + = 391.
10
For comparison, using the raw data, x(25) = 392 and x(26) = 393, gives the ‘true’
sample median of 392.5.
Skewness
33
2. Data visualisation and descriptive statistics
Positively-skewed
distribution
2
Negatively-skewed
distribution
Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis
(i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed)
distribution. Similarly, if the long tail is heading toward −∞ (negative infinity) on the
x-axis (i.e. on the left-hand side) then this indicates a negatively-skewed (left-skewed)
distribution, as illustrated in Figure 2.2.
Example 2.9 The hourly wage rates used in Example 2.2 are skewed to the right,
due to the influence of the relatively large values 12.00, 12.10, 12.10 and 12.20. The
effect of these (similar to Elon Musk’s effect mentioned above, albeit far less extreme
here) is to ‘drag’ or ‘pull’ the mean upward, hence mean > median.
Example 2.10 For the weekly production output data in Example 2.3, we have
calculated the mean and median to be 399.72 and 392.50, respectively. Since the
mean is greater than the median, the data form a positively-skewed distribution, as
confirmed by the histogram in Figure 2.1.
2.7.3 Mode
Our final measure of location is the mode.
(Sample) mode
34
2.8. Measures of dispersion
Example 2.11 The modal value of the dataset in (2.1) is 32, since it occurs twice 2
while the other values only occur once each.
Example 2.12 For the weekly production output data in Example 2.3, looking at
the stem-and-leaf diagram in Example 2.4, we can quickly see that 365 is the modal
value (the three consecutive 5s opposite the second stem stand out). If just given
grouped frequency data, then instead of reporting a modal value we can determine
the modal class interval, which is [360, 380) with 14 observations. (The fact that this
includes 365 here is a coincidence – the modal class interval and modal value are not
equivalent.)
2.8.1 Range
Our first measure of spread is the range.
Range
The range is the largest value minus the smallest value, that is:
Clearly, the range is very sensitive to extreme observations since (when they occur) they
are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively),
and so this measure is of limited appeal. If we were confident that no outliers were
present (or decided to remove any outliers), then the range would better represent the
true spread of the data.
However, the range motivates our consideration of the interquartile range (IQR) instead.
The IQR is the difference between the upper (third) quartile, Q3 , minus the lower (first)
quartile, Q1 . The upper quartile divides ordered data into the bottom 75% and the top
25%, while the lower quartile divides ordered data into the bottom 25% and the top
35
2. Data visualisation and descriptive statistics
75%. Unsurprisingly the median, given our earlier definition, is the middle (second)
quartile, i.e. m = Q2 . By discarding the top 25% and bottom 25% of observations,
respectively, we restrict attention solely to the central 50% of observations.
2
Interquartile range
IQR = Q3 − Q1
where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively.
Example 2.14 Continuing with the dataset in (2.1), computation of the quartiles
can be problematic since, for example, for the lower quartile we require the value
such that the smallest 2.5 observations are below it and the largest 7.5 observations
are above it. A suggested approach (motivated by the median calculation when n is
even) is to use:
x(2) + x(3) 24 + 28
Q1 = = = 26.
2 2
Similarly:
x(7) + x(8) 39 + 44
Q3 = = = 41.5.
2 2
Hence IQR = Q3 − Q1 = 41.5 − 26 = 15.5. Contrast this with the range of 48
(derived in Example 2.13) which is much larger due to the effects of x(1) and x(n) .
There are many different methodologies for computing quartiles, and conventions vary
from country to country, from textbook to textbook, and even from software package to
software package! Any reasonable approach is perfectly acceptable in the examination.
For example, interpolation methods, as demonstrated previously for the case of the
median, are valid. The approach shown in Example 2.14 is the simplest, and so it is
recommended.
2.8.2 Boxplot
At this point, it is useful to introduce another graphical method, the boxplot, also
known as a box-and-whisker plot, no prizes for guessing why!
In a boxplot, the middle horizontal line is the median and the upper and lower ends of
the box are the upper and lower quartiles, respectively. The whiskers extend from the
box to the most extreme data points within 1.5 times the IQR from the quartiles. Any
data points beyond the whiskers are considered outliers and are plotted individually.
Sometimes we distinguish between outliers and extreme outliers, with the latter plotted
using a different symbol. An example of a (generic) boxplot is shown in Figure 2.3.
If you are presented with a boxplot, then it is easy to obtain all of the following: the
median, quartiles, IQR, range and skewness. Recall that skewness (the departure from
symmetry) is characterised by a long tail, attributable to outliers, which are readily
apparent from a boxplot.
36
2.8. Measures of dispersion
Q3
50% of cases
have values Q2 = Median
within the box
Q1
Example 2.15 From the boxplot shown in Figure 2.4, it can be seen that the
median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many
outliers provide a useful indicator that this is a negatively-skewed distribution as the
long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 ,
which tends to indicate negative skewness.
The variance and standard deviation are much better and more useful statistics for
representing the dispersion of a dataset. You need to be familiar with their definitions
and methods of calculation for a sample of data values x1 , x2 , . . . , xn .
Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the
squared deviations of each data value from the (sample) mean, where:
n
X n
X
2
Sxx = (xi − x̄) = x2i − nx̄2 . (2.3)
i=1 i=1
37
2. Data visualisation and descriptive statistics
n
P
Recall from earlier x̄ = xi /n. To see why (2.3) holds:
i=1
n
X
Sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 ) (expansion of quadratic)
i=1
n
X n
X n
X
= x2i − 2x̄xi + x̄2 (separating into three summations)
i=1 i=1 i=1
n
X n
X
= x2i − 2x̄ xi +nx̄2 (noting that x̄ is a constant added n times)
i=1 i=1
| {z }
= nx̄
n
X
= x2i − 2nx̄2 + nx̄2 (substituting in nx̄)
i=1
n
X
= x2i − nx̄2 (simplifying)
i=1
n
P n
P
which uses the fact that x̄ = xi /n, and so xi = nx̄.
i=1 i=1
We now define the sample variance.
Sample variance
38
2.8. Measures of dispersion
Note the divisor used to compute s2 is n − 1, not n. Do not worry about why (this is
covered in ST104B Statistics 2) just remember to divide by n − 1 when computing a
sample variance.4 To obtain the sample standard deviation, s, we just take the
(positive) square root of the sample variance, s2 .
2
Sample standard deviation
When data are given in grouped-frequency form, the sample variance is calculated as
follows.
For grouped-frequency data with K classes, to compute the sample variance we use
the formula:
2
K K K
fk (xk − x̄)2 fk x2k
P P P
fk xk
2 k=1 k=1 k=1
s = = K − K .
K
P P P
fk fk fk
k=1 k=1 k=1
Recall that the last bracketed squared term is simply the mean formula for grouped data
shown in (2.2). Note that for grouped-frequency data we can ignore the ‘divide by n − 1’
rule, since we would expect n to be very large in such cases, such that n − 1 ≈ n and so
K
P
dividing by n or n − 1 makes negligible difference in practice, noting that fk = n.
k=1
N
4
In contrast for population data, the population variance is σ 2 = (xi − µ)2 /N , i.e. we use the N
P
i=1
divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of
µ (the population mean) instead of x̄ (the sample mean).
39
2. Data visualisation and descriptive statistics
or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the density
values to plot (and cumulative frequencies, for later). We construct the following
table:
Interval Relative
width, Frequency, frequency, Density, Midpoint,
Class interval wk fk rk = fk /n dk = rk /wk xk f k xk fk x2k
[120, 130) 10 1 0.0345 0.00345 125 125 15,625
[130, 140) 10 4 0.1379 0.01379 135 540 72,900
[140, 150) 10 5 0.1724 0.01724 145 725 105,125
[150, 160) 10 6 0.2069 0.02069 155 930 144,150
[160, 170) 10 7 0.2414 0.02414 165 1,155 190,575
[170, 180) 10 5 0.1724 0.01724 175 875 153,125
[180, 190)
P 10 1 0.0345 0.0345 185 185 34,225
Total, 29 4,535 715,725
40
2.8. Measures of dispersion
Figure 2.5: Density histogram of trading volume data for Example 2.17.
41
2. Data visualisation and descriptive statistics
2 Let us now consider an extended example bringing together many of the issues
considered in this chapter.
At a time of economic growth but political uncertainty, a random sample of n = 40
economists (from the population of all economists) produces the following forecasts for
the growth rate of an economy in the next year:
1.3 3.8 4.1 2.6 2.4 2.2 3.4 5.1 1.8 2.7
3.1 2.3 3.7 2.5 4.1 4.7 2.2 1.9 3.6 2.8
4.3 3.1 4.2 4.6 3.4 3.9 2.9 1.9 3.3 8.2
5.4 3.3 4.5 5.2 3.1 2.5 3.3 3.4 4.4 5.2
Solution:
(a) It would be sensible to have class interval widths of 1 unit, which conveniently
makes the density values the same as the relative frequencies! We construct the
following table and plot the density histogram.
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[1.0, 2.0) 1 4 0.100 0.100
[2.0, 3.0) 1 10 0.250 0.250
[3.0, 4.0) 1 13 0.325 0.325
[4.0, 5.0) 1 8 0.200 0.200
[5.0, 6.0) 1 4 0.100 0.100
[6.0, 7.0) 1 0 0.000 0.000
[7.0, 8.0) 1 0 0.000 0.000
[8.0, 9.0) 1 1 0.025 0.025
42
2.9. Test your understanding
Note that we still show the ‘6’ and ‘7’ stems even though they have no
corresponding leaves. If we omitted these stems (so that the ‘8’ stem is immediately
below the ‘5’ stem) then this would distort the true shape of the sample
distribution, which would be misleading.
(c) The density histogram and stem-and-leaf diagram show that the data are
positively-skewed (skewed to the right), due to the outlier forecast of 8.2%.
Note if you are ever asked to comment on the shape of a distribution, consider:
• Is the distribution (roughly) symmetric?
• Is the distribution bimodal?
• Is the distribution skewed (an elongated tail in one direction)? If so, what is
the direction of the skewness?
• Are there any outliers?
43
2. Data visualisation and descriptive statistics
(d) There are n = 40 observations, so the median is the average of the 20th and 21st
ordered observations. Using the stem-and-leaf diagram in part (b), we see that
x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%.
2
(e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1
and Q3 , respectively. There are several methods for determining the quartiles, and
any reasonable approach would be acceptable in an examination. For simplicity,
here we will use the following since n is divisible by 4:
Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%.
Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the
whisker limits must satisfy:
max(x(1) , Q1 − 1.5 × IQR) and min(x(n) , Q3 + 1.5 × IQR)
which is:
max(1.3, −0.05) = 1.30 and min(8.2, 6.75) = 6.75.
We see that there is just a single observation which lies outside the interval
[1.30, 6.75], which is x(40) = 8.2% and hence this is plotted individually in the
boxplot. Since this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this
observation is an outlier, rather than an extreme outlier.
The boxplot is (a horizontal orientation is also fine):
Note that the upper whisker terminates at 5.4, which is the most extreme data
point within 1.5 times the IQR above Q3 , i.e. the maximum value no larger than
6.75% as easily seen from the stem-and-leaf diagram in part (b). The lower whisker
terminates at x(1) = 1.3%, since the minimum value of the dataset is within 1.5
times the IQR below Q1 .
It is important to note that boxplot conventions may vary, and some software or
implementations might use slightly different methods for calculating whiskers.
Additionally, different multipliers (other than 1.5) might be used in practice
depending on the desired sensitivity to outliers.
44
2.9. Test your understanding
(f) We have sample data, not population data, hence the (sample) mean is denoted by
x̄ and the (sample) standard deviation is denoted by s. We have:
1X
n
140.4 2
x̄ = xi = = 3.51%
n i=1 40
and:
n
!
1 X 1
s2 = x2i − nx̄2 557.26 − 40 × (3.51)2 = 1.6527.
=
n−1 i=1
39
√
Therefore, the standard deviation is s = 1.6527 = 1.29%.
(g) In (c) it was concluded that the density histogram and stem-and-leaf diagram of
the data were positively-skewed, and this is consistent with the mean being larger
than the median. It is possible to quantify skewness, although this is beyond the
scope of the syllabus.
(h) We calculate:
also:
Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22
and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and
6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in
each interval, respectively, is:
29 39
= 0.725 = 72.5% and = 0.975 = 97.5%.
40 40
• Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit
like the normal distribution (introduced in Chapter 4) – have the property that
68% of the data lie within approximately one standard deviation of the mean, and
95% of the data lie within approximately two standard deviations of the mean. The
percentages in (h) are fairly similar to these.
• The exercise illustrates the importance of (at least) one more decimal place than in
the original data. If we had 3.5% and 1.3% for the mean and standard deviation,
respectively, the ‘boundaries’ for the interval with one standard deviation would
have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we
would have had to worry about which side of the ‘boundary’ to allocate these.
(This type of issue can still happen with the extra decimal place, but much less
frequently.)
45
2. Data visualisation and descriptive statistics
• too few class intervals (which is the same as too wide class intervals)
• too many class intervals (which is the same as too narrow class intervals).
2 For example, with too many class intervals, you mainly get 0, 1 or 2 items per class
interval, so any (true) peak is hidden by the subdivisions which you have used.
• The best number of (equal-sized) class intervals depends on the sample size. For
large samples, many class intervals will not lose the pattern, while for small
samples they will. However, with the datasets which tend to crop up in ST104A
Statistics 1, somewhere between 6 and 10 class intervals are likely to work well.
46
2.13. Solutions to Sample examination questions
2. The data below contain measurements of the low-density lipoproteins, also known
as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in
milligrams per decilitres (mg/dL).
2
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143
3. The average daily intakes of calories, measured in kcals, for a random sample of 12
athletes were:
(a) Construct a boxplot of the data. (The boxplot does not need to be exactly to
scale, but values of box properties and whiskers should be clearly labelled.)
(b) Based on the shape of the boxplot you have drawn, describe the distribution of
the data.
(c) Name two other types of graphical displays which would be suitable to
represent the data. Briefly explain your choices.
47
2. Data visualisation and descriptive statistics
2. (a) We have:
2 Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[90, 100) 10 6 0.200 0.0200
[100, 110) 10 9 0.300 0.0300
[110, 120) 10 7 0.233 0.0233
[120, 130) 10 5 0.167 0.0167
[130, 140) 10 2 0.067 0.0067
[140, 150) 10 1 0.033 0.0033
(b) We have:
3,342
x̄ = = 111.4 mg/dL
30
r
1
s= × (377,076 − 30 × (111.4)2 ) = 12.83 mg/dL.
29
(c) The data exhibit positive skewness, as shown by the mean being greater than
the median.
48
2.13. Solutions to Sample examination questions
Note that no label of the x-axis is necessary and that the plot can be
transposed.
(b) Based on the shape of the boxplot above, we can see that the distribution of
the data is positively skewed, equivalently skewed to the right, due to the
presence of the outlier of 3,061 kcals.
(c) A density histogram, stem-and-leaf diagram or a dot plot are other types of
suitable graphical displays. The reason is that the variable is measurable and
these graphs are suitable for displaying the distribution of such variables.
49
2. Data visualisation and descriptive statistics
50