100% found this document useful (1 vote)
132 views28 pages

w3 ch2 Anno

Chapter 2 focuses on data visualization and descriptive statistics, emphasizing the importance of graphical representations and summary statistics like mean and standard deviation to understand data. It outlines learning outcomes including the ability to draw various diagrams, calculate key statistical measures, and explain their uses and limitations. The chapter also discusses types of variables, including quantitative and categorical, and introduces various data visualization techniques such as histograms and dot plots.

Uploaded by

rafsanhossain11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
132 views28 pages

w3 ch2 Anno

Chapter 2 focuses on data visualization and descriptive statistics, emphasizing the importance of graphical representations and summary statistics like mean and standard deviation to understand data. It outlines learning outcomes including the ability to draw various diagrams, calculate key statistical measures, and explain their uses and limitations. The chapter also discusses types of variables, including quantitative and categorical, and introduces various data visualization techniques such as histograms and dot plots.

Uploaded by

rafsanhossain11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 2 2

Data visualisation and descriptive


statistics

2.1 Synopsis of chapter


This chapter contains two separate but related themes, both to do with the
understanding of data. First, we look at graphical representations for data which allow
us to see their most important characteristics. Second, we calculate simple numbers,
such as the mean or standard deviation, which will summarise those characteristics. In
summary, you should be able to use appropriate diagrams and measures in order to
explain and clarify data which you have collected or which are presented to you.

2.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:

draw and interpret density histograms, stem-and-leaf diagrams and boxplots

incorporate labels and titles correctly in your diagrams and state the units which
you have used

calculate the following: arithmetic mean, median, mode, standard deviation,


variance, quartiles, range and interquartile range

explain the use and limitations of the above quantities.

2.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 2.

2.4 Introduction
Both themes considered in this chapter (data visualisation and descriptive statistics)
could be applied to population data, but in most cases (namely here) they are applied to
a sample. The notation would change slightly if a population was being represented.

23
2. Data visualisation and descriptive statistics

Most visual representations are very tedious to construct in practice without the aid of
a computer. However, you will understand much more if you try a few by hand (as is
commonly asked in examinations). You should also be aware that spreadsheets do not
2 always use correct terminology when discussing and labelling graphs. It is important,
once again, to go over this material slowly and make sure you have mastered the basic
statistical definitions introduced here before you proceed to more theoretical ideas.

2.5 Types of variable


Data1 are obtained on any desired variable. A variable is something which, well, varies!
For quantitative variables, i.e. numerical variables, these can be classified into two types.

Types of quantitative variable

Discrete variables: These have outcomes you can count. Examples include the
number of passengers on a flight and the number of telephone calls received each
day in a call centre. Observed values for these will be 0, 1, 2, . . . (i.e. non-negative
integers).

Continuous variables: These have outcomes you can measure. Examples include
height, weight and time, all of which can be measured to several decimal places,
and typically have units of measurement (such as metres, kilograms and hours).

Many of the problems for which people use statistics to help them understand and make
decisions involve types of variables which can be measured. When we are dealing with a
continuous variable – for which there is a generally recognised method of determining
its value – we can also call it a measurable variable. The numbers which we then
obtain come ready-equipped with an ordered relation, i.e. we can always tell if two
measurements are equal (to the available accuracy) or if one is greater or less than the
other.
Of course, before we do any sort of data analysis, we need to collect data. Chapter 9
will discuss a range of different techniques which can be employed to obtain a sample.
For now, we just consider some simple examples of situations where data might be
collected, such as a:

pre-election opinion poll asking 1,000 people about their voting intentions

market research survey asking adults how many hours of television they watch per
week

census interviewer asking parents how many of their children are receiving full-time
education (note that a census is the total enumeration of a population, hence this
would not be a sample!).
1
Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably
see both forms used when reading widely.

24
2.5. Types of variable

2.5.1 Categorical variables


Qualitative data, often referred to as categorical variables, represent characteristics or
qualities that can be divided into distinct groups or categories. Unlike quantitative data, 2
which recall are numerical, qualitative data is non-numeric and describes attributes or
qualities. Categorical variables can take on different categories or groups, and they are
often used to classify items into specific classes or labels based on shared characteristics.
A polling organisation might be asked to determine whether, say, the political
preferences of voters were in some way linked to their highest level of education – for
example, do graduates tend to be supporters of Party XYZ? In consumer research,
market research companies might be hired to determine whether users were satisfied
with the service they obtained from a business (such as a restaurant) or a department
of local or central government (housing departments being one important example). For
qualitative variables, these can be classified into two types.

Types of qualitative variable

Nominal variables: These have categories with no inherent order or ranking.


Examples include colours (such as red, blue, green etc.) and types of fruit (such
as apple, banana, orange etc.).

Ordinal variables: These have categories with a meaningful order or ranking


but the intervals between them are not consistent. Examples include highest
educational level achieved (such as high school, undergraduate, postgraduate)
and degree classification (such as first class, upper second class, lower second
class etc.).

Example 2.1 Consider the following.

(a) The total number of graduates (in a sample).

(b) The total number of Party XYZ supporters (in a sample).

(c) The number of graduates who support Party XYZ.

(d) The number of Party XYZ supporters who are graduates.

(e) Satisfaction levels of diners at a restaurant.

In cases (a) and (b) we are doing simple counts, within a sample, of a single category
– graduates and Party XYZ supporters, respectively – while in cases (c) and (d) we
are looking at some kind of cross-tabulation between two categorical variables – a
scenario which will be considered in Chapter 8.
There is no obvious and generally recognised way of putting political preferences in
order (in the way that we can certainly say that 1 < 2). It is similarly impossible to
rank (as the technical term has it) many other categories of interest: in combatting
discrimination against people, for instance, organisations might want to look at the
effects of gender, religion, nationality, sexual orientation, disability etc. but the

25
2. Data visualisation and descriptive statistics

whole point of combatting discrimination is that different levels of each category


cannot be ranked. Hence these are examples of nominal variables.
2 In case (e), by contrast, there is a clear ranking: the restaurant would be pleased if
there were lots of people who expressed themselves as being ‘very satisfied’, rather
than merely ‘satisfied’, let alone ‘dissatisfied’ or ‘very dissatisfied’ ! Hence this is an
ordinal variable.

2.6 Data visualisation


Datasets consist of potentially vast amounts of data. Hedge funds, for example, have
access to very large databases of historical price information on a range of financial
assets, such as so-called ‘tick data’ – very high-frequency intra-day data. Of course, the
human brain cannot easily make sense of such large quantities of numbers when
presented with them on a screen. However, the human brain can cope with visual
representations of data. By producing various plots, we can instantly ‘eyeball’ to get a
bird’s-eye view of the dataset. So, at a glance, we can quickly get a feel for the data and
determine whether there are any interesting features, relationships etc. which could
then be examined in greater depth. In modelling, for example, we often make
distributional assumptions, and a suitable variable plot allows us to easily check the
feasibility of a particular distribution by eye. To summarise, plots are a great medium
for communicating the salient features of a dataset to a wide audience.
The main representations we use in ST104A Statistics 1 are histograms,
stem-and-leaf diagrams and boxplots. We will also use scatterplots to visualise the
relationship, if any, between two measurable variables (covered in Chapter 10).
Note that there are many other representations available from software packages like
Tableau, in particular pie charts and standard bar charts which are appropriate when
dealing with categorical data, although these will not be considered further in this half
course. If interested, you are recommended to study ST2187 Business analytics,
applied modelling and prediction.

2.6.1 Presentational traps

Before we see our first graphical representation you should be aware when reading
articles in newspapers, magazines and even within academic journals, that it is easy to
mislead the reader by careless or poorly-defined diagrams. As such, presenting data
effectively with diagrams requires careful planning.

A good diagram:
• provides a clear summary of the data
• is a fair and honest representation
• highlights underlying patterns
• allows the extraction of a lot of information quickly.

26
2.6. Data visualisation

A bad diagram:
• confuses the viewer
• misleads (either accidentally or intentionally). 2
Advertisers and politicians are notorious for ‘spinning’ data to portray a particular
narrative for their own objectives!

2.6.2 Dot plot


The simplicity of a dot plot makes it an ideal starting point to think about the concept
of a sample distribution. For small datasets, this type of plot is very effective for
seeing the data’s underlying distribution. We use the following procedure.

1. Obtain the range of the dataset (the values spanned by the data), and draw a
horizontal line to accommodate this range.

2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line,
resulting in the empirical distribution.

Example 2.2 Hourly wage rates (in £) for clerical assistants:

12.20 11.50 11.80 11.60 12.10 11.80 11.60 11.70 11.50


11.60 11.90 11.70 11.60 12.10 11.70 11.80 11.90 12.00


• • •
• • • • • •
• • • • • • • •
11.50 11.60 11.70 11.80 11.90 12.00 12.10 12.20

Instantly, some interesting features emerge from the dot plot which are not
immediately obvious from the raw data. For example, most clerical assistants earn
less than £12 per hour and nobody (in the sample) earns more than £12.20 per hour.

2.6.3 Histogram
Histograms are excellent diagrams to use when we want to visualise the frequency
distribution of discrete or continuous variables. Our focus will be on how to construct a
density histogram.
Data are first organised into a table which arranges the data into class intervals (also
called bins) – disjointed subdivisions of the total range of values which the variable
takes. Let K denote the number of class intervals. These K class intervals should be
mutually exclusive (meaning they do not overlap, such that each observation belongs to
at most one class interval) and collectively exhaustive (meaning that each observation
belongs to at least one class interval).

27
2. Data visualisation and descriptive statistics

Recall that our objective is to represent the distribution of the data. As such, when
choosing K, too many class intervals will dilute the distribution, while too few will
concentrate it (using technical jargon, will tend to degenerate the distribution). Either
2 way, the pattern of the distribution will be lost – defeating the purpose of the
histogram. As a guide, K = 6 or 7 should be sufficient, but remember to always exercise
common sense!
To each class interval, the corresponding frequency is determined, i.e. the number of
observations of the variable which fall within each class interval. Let fk denote the
frequency of class interval k, and let wk denote the width of class interval k, for
k = 1, 2, . . . , K.
PK
The relative frequency of class interval k is rk = fk /n, where n = fk is the sample
k=1
size, i.e. the sum of all the class interval frequencies.
The density of class interval k is dk = rk /wk , and it is this density which is plotted on
the y-axis (the vertical axis). It is preferable to construct density histograms only if
each class interval has the same width.

Example 2.3 Consider the weekly production output of a factory over a 50-week
period (you can choose what the manufactured good is!). Note that this is a discrete
variable since the output will take integer values, i.e. something which we can count.
The data are (in ascending order for convenience):

350 354 354 358 358 359 360 360 362 362
363 364 365 365 365 368 371 372 372 379
381 382 383 385 392 393 395 396 396 398
402 404 406 410 420 437 438 441 444 445
450 451 453 454 456 458 459 460 467 469

We construct the following table, noting that a square bracket ‘[’ includes the class
interval endpoint, while a round bracket ‘)’ excludes the class interval endpoint.

Interval Relative Cumulative


width, Frequency, frequency, Density, frequency,
P
Class interval wk fk rk = fk /n dk = rk /wk k fk
[340, 360) 20 6 0.12 0.006 6
[360, 380) 20 14 0.28 0.014 20
[380, 400) 20 10 0.20 0.010 30
[400, 420) 20 4 0.08 0.004 34
[420, 440) 20 3 0.06 0.003 37
[440, 460) 20 10 0.20 0.010 40
[460, 480) 20 3 0.06 0.003 50

Note that here we have K = 7 class intervals each of width 20, i.e. wk = 20 for
k = 1, 2, . . . , 7. From the raw data, check to see how each of the frequencies, fk , has
been obtained. For example, f1 = 6 represents the first six observations (350, 354,
354, 358, 358 and 359).

28
2.6. Data visualisation

We have n = 50, hence the relative frequencies are rk = fk /50 for k = 1, 2, . . . , 7. For
example, r1 = f1 /n = 6/20 = 0.12. The density values can then be calculated. For
example, d1 = r1 /w1 = 0.12/20 = 0.006. 2
The table above includes an additional column of ‘Cumulative frequency’, which is
obtained by simply determining the running total of the class frequencies (for
example, the cumulative frequency up to the second class interval is 6 + 14 = 20).
Note the final column is not required to construct a density histogram, although the
computation of cumulative frequencies may be useful when determining medians and
quartiles (to be discussed later in this chapter).
To construct the histogram, adjacent bars are drawn over the respective class
intervals such that the histogram has a total area of one. The histogram for the
above example is shown in Figure 2.1.

Figure 2.1: Density histogram of weekly production output for Example 2.3.

2.6.4 Stem-and-leaf diagram


A stem-and-leaf diagram uses the raw data. As the name suggests, it is formed using
a ‘stem’ and corresponding ‘leaves’. The choice of the stem involves determining a
major component of an observed value, such as the ‘10s’ unit if the order of magnitude
of the observations were 15, 25, 35 etc., or if data are of the order of magnitude 1.5, 2.5,
3.5 etc. the integer part. The remainder of the observed value plays the role of the ‘leaf’.
Applied to the weekly production dataset, we obtain the stem-and-leaf diagram shown
below in Example 2.4.

29
2. Data visualisation and descriptive statistics

Example 2.4 Continuing with Example 2.3, the stem-and-leaf diagram is:

2 Stem-and-leaf diagram of weekly production output

Stem (Tens) Leaves (Units)


35 044889
36 0022345558
37 1229
38 1235
39 235668
40 246
41 0
42 0
43 78
44 145
45 0134689
46 079

Note the informative title and labels for the stems and leaves.

For the stem-and-leaf diagram in Example 2.4, note the following points.
These stems are formed of the ‘10s’ part of the observations.
Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees
anti-clockwise reproduces the shape of the data’s distribution, similar to what
would be revealed with a density histogram.
The leaves are placed in ascending order within the stems, so it is a good idea to
sort the raw data into ascending order first of all (fortunately the raw data in
Example 2.3 were already arranged in ascending order, but for other datasets this
may not be the case).
Unlike the histogram, the actual data values are preserved. This is advantageous if
we want to calculate various descriptive statistics later on.

So far we have considered how to summarise a dataset visually. This methodology is


appropriate to get a visual feel for the distribution of the dataset. In practice, we would
also like to summarise things numerically. There are two key properties of a dataset
which will be of particular interest.

Key properties of a dataset

Measures of location – a central point about which the data tend (also known
as measures of central tendency).

Measures of dispersion – a measure of the variability of the data, i.e. how


spread out the data are about the central point (also known as measures of
spread).

30
2.7. Measures of location

2.7 Measures of location


The mean, median and mode are the three principal measures of location. In general, 2
these will not all give the same numerical value for a given dataset/distribution.2 These
three measures (and, later, measures of dispersion) will now be introduced using the
following small sample dataset:

32, 28, 67, 39, 19, 48, 32, 44, 37 and 24. (2.1)

2.7.1 Mean
The preferred measure of location/central tendency, which is simply the ‘average’ of the
data. It will be frequently applied in various statistical inference techniques in later
chapters.

(Sample) mean
P
Using the summation operator, , which remember is just a form of ‘notational
shorthand’, we define the sample mean, x̄, as:
n
1X x1 + x2 + · · · + xn
x̄ = xi = .
n i=1
n

To note, the notation x̄ will be used to denote an observed sample mean for a sample
dataset, while µ will denote its population counterpart, i.e. the population mean.

Example 2.5 For the dataset in (2.1) above:


10
1 X 32 + 28 + · · · + 24 370
x̄ = xi = = = 37.
10 i=1 10 10

Of course, it is possible to encounter datasets in frequency form, that is each data value
is given with the corresponding frequency of observations for that value, fk , for
k = 1, 2, . . . , K, where there are K different variable values. In such a situation, use the
formula:
K
P
fk xk
k=1
x̄ = K . (2.2)
P
fk
k=1

Note that this preserves the idea of ‘adding up all the observations and dividing by the
total number of observations’. This is an example of a weighted mean, where the weights
are the relative frequencies (as seen in the construction of density histograms).
2
These three measures can be the same in special cases, such as the normal distribution (introduced
in Chapter 4) which is symmetric about the mean (and so mean = median) and achieves a maximum at
this point, i.e. mean = median = mode.

31
2. Data visualisation and descriptive statistics

If the data are given in grouped-frequency form, such as that shown in the table in
Example 2.3, then the individual data values are unknown3 – all we know is the class
interval in which each observation lies. The sensible solution is to use the midpoint of
2 the interval as a proxy for each observation recorded as belonging within that class
interval. Hence you still use the grouped-frequency mean formula (2.2), but each xi
value will be substituted with the appropriate class interval midpoint.

Example 2.6 Using the weekly production data in Example 2.3, the interval
midpoints are: 350, 370, 390, 410, 440, 450 and 470, respectively. These will act as
the data values for the respective class intervals. The mean is then calculated as:
K
P 7
P
f k xk f k xk
k=1 k=1 (6 × 350) + (14 × 370) + · · · + (3 × 470)
x̄ = = = = 400.4.
PK P7 6 + 14 + · · · + 3
fk fk
k=1 k=1

Compared to the true mean of the raw data (which is 399.72), we see that using the
midpoints as proxies gives a mean very close to the true sample mean value. Note
the mean is not rounded up or down since it is an arithmetic result.

A drawback with the mean is its sensitivity to outliers, i.e. extreme observations. For
example, suppose we record the net worth of 10 randomly chosen people. If Elon Musk
(one of the world’s richest people at time of writing), say, was included, his substantial
net worth would pull the mean upward considerably! By increasing the sample size n,
the effect of his inclusion, although diluted, would still be non-negligible, assuming we
were not just sampling from the population of billionaires!

2.7.2 Median
The (sample) median, m, is the middle value of the ordered dataset, where observations
are arranged in ascending order. By definition, 50 per cent of the observations are
greater than or equal to the median, and 50 per cent are less than or equal to the
median.

(Sample) median

Arrange the n numbers in ascending order, x(1) , x(2) , . . . , x(n) , (known as the order
statistics, such that x(1) is the first order statistic, i.e. the smallest observed value,
and x(n) is the nth order statistic, i.e. the largest observed value), then the sample
median, m, depends on whether the sample size is odd or even. If:

n is odd, then there is an explicit middle value, so m = x((n+1)/2)

n is even, then there is no explicit middle value, so take the average of the values
either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2.

3
Of course, we do have the raw data for the weekly production output and so we could work out the
exact sample mean, but here suppose we did not have access to the raw data, instead we were just given
the table of class interval frequencies as shown in Example 2.3.

32
2.7. Measures of location

Example 2.7 For the dataset in (2.1), the ordered observations are:

19, 24, 28, 32, 32, 37, 39, 44, 48 and 67. 2
Here n = 10, i.e. there is an even number of observations, so we compute the average
of the fifth and sixth ordered observations, that is:
x(n/2) + x(n/2+1) x(5) + x(6) 32 + 37
m= = = = 34.5.
2 2 2

If we only had data in grouped-frequency form (as in Example 2.3), then we can make
use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered
observation which must lie in the [380, 400) class interval because once we exhaust the
ordered data up to the [360, 380) class interval we have only accounted for the smallest
20 observations, while once the [380, 400) class interval is exhausted we have accounted
for the smallest 30 observations, meaning the median must lie in this class interval.
Assuming the raw data are not accessible, we could use the midpoint (i.e. 390) as
denoting the median. Alternatively, we could use an interpolation method which uses
the following ‘general’ formula for grouped data, once you have identified the class
which includes the median (such as [380, 400) above):
bin width × number of remaining observations
endpoint of previous bin + .
bin frequency

Example 2.8 Returning to the weekly production output data from Example 2.3,
the median would be:
20 × (25.5 − 20)
380 + = 391.
10
For comparison, using the raw data, x(25) = 392 and x(26) = 393, gives the ‘true’
sample median of 392.5.

Although an advantage of the median is that it is not influenced by outliers (Elon


Musk’s net worth would be x(n) and so would not affect the median), in practice it is of
limited use in formal statistical inference.
For symmetric data, the mean and median are always equal. Therefore, this is a simple
way to verify whether a dataset is symmetric. Asymmetric distributions are skewed,
where skewness measures the departure from symmetry. Although you will not be
expected to compute the coefficient of skewness (its numerical value), you need to be
familiar with the two types of skewness.

Skewness

When mean > median, this indicates a positively-skewed distribution (also,


referred to as ‘right-skewed’).

When mean < median, this indicates a negatively-skewed distribution (also,


referred to as ‘left-skewed’).

33
2. Data visualisation and descriptive statistics

Positively-skewed
distribution
2

Negatively-skewed
distribution

Figure 2.2: Different types of skewed distributions.

Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis
(i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed)
distribution. Similarly, if the long tail is heading toward −∞ (negative infinity) on the
x-axis (i.e. on the left-hand side) then this indicates a negatively-skewed (left-skewed)
distribution, as illustrated in Figure 2.2.

Example 2.9 The hourly wage rates used in Example 2.2 are skewed to the right,
due to the influence of the relatively large values 12.00, 12.10, 12.10 and 12.20. The
effect of these (similar to Elon Musk’s effect mentioned above, albeit far less extreme
here) is to ‘drag’ or ‘pull’ the mean upward, hence mean > median.

Example 2.10 For the weekly production output data in Example 2.3, we have
calculated the mean and median to be 399.72 and 392.50, respectively. Since the
mean is greater than the median, the data form a positively-skewed distribution, as
confirmed by the histogram in Figure 2.1.

2.7.3 Mode
Our final measure of location is the mode.

(Sample) mode

The (sample) mode is the most frequently-occurring value in a (sample) dataset.

It is perfectly possible to encounter a multimodal distribution where several data values

34
2.8. Measures of dispersion

are tied in terms of their frequency of occurrence.

Example 2.11 The modal value of the dataset in (2.1) is 32, since it occurs twice 2
while the other values only occur once each.

Example 2.12 For the weekly production output data in Example 2.3, looking at
the stem-and-leaf diagram in Example 2.4, we can quickly see that 365 is the modal
value (the three consecutive 5s opposite the second stem stand out). If just given
grouped frequency data, then instead of reporting a modal value we can determine
the modal class interval, which is [360, 380) with 14 observations. (The fact that this
includes 365 here is a coincidence – the modal class interval and modal value are not
equivalent.)

2.8 Measures of dispersion


The dispersion (or spread) of a dataset is very important when drawing conclusions
from it. Hence it is essential to have a useful measure of this property, and several
candidates exist, which are introduced below. As expected, there are advantages and
disadvantages to each.

2.8.1 Range
Our first measure of spread is the range.

Range

The range is the largest value minus the smallest value, that is:

range = x(n) − x(1) .

Example 2.13 For the dataset in (2.1), the range is:

x(n) − x(1) = 67 − 19 = 48.

Clearly, the range is very sensitive to extreme observations since (when they occur) they
are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively),
and so this measure is of limited appeal. If we were confident that no outliers were
present (or decided to remove any outliers), then the range would better represent the
true spread of the data.
However, the range motivates our consideration of the interquartile range (IQR) instead.
The IQR is the difference between the upper (third) quartile, Q3 , minus the lower (first)
quartile, Q1 . The upper quartile divides ordered data into the bottom 75% and the top
25%, while the lower quartile divides ordered data into the bottom 25% and the top

35
2. Data visualisation and descriptive statistics

75%. Unsurprisingly the median, given our earlier definition, is the middle (second)
quartile, i.e. m = Q2 . By discarding the top 25% and bottom 25% of observations,
respectively, we restrict attention solely to the central 50% of observations.
2
Interquartile range

The interquartile range (IQR) is defined as:

IQR = Q3 − Q1

where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively.

Example 2.14 Continuing with the dataset in (2.1), computation of the quartiles
can be problematic since, for example, for the lower quartile we require the value
such that the smallest 2.5 observations are below it and the largest 7.5 observations
are above it. A suggested approach (motivated by the median calculation when n is
even) is to use:
x(2) + x(3) 24 + 28
Q1 = = = 26.
2 2
Similarly:
x(7) + x(8) 39 + 44
Q3 = = = 41.5.
2 2
Hence IQR = Q3 − Q1 = 41.5 − 26 = 15.5. Contrast this with the range of 48
(derived in Example 2.13) which is much larger due to the effects of x(1) and x(n) .

There are many different methodologies for computing quartiles, and conventions vary
from country to country, from textbook to textbook, and even from software package to
software package! Any reasonable approach is perfectly acceptable in the examination.
For example, interpolation methods, as demonstrated previously for the case of the
median, are valid. The approach shown in Example 2.14 is the simplest, and so it is
recommended.

2.8.2 Boxplot
At this point, it is useful to introduce another graphical method, the boxplot, also
known as a box-and-whisker plot, no prizes for guessing why!
In a boxplot, the middle horizontal line is the median and the upper and lower ends of
the box are the upper and lower quartiles, respectively. The whiskers extend from the
box to the most extreme data points within 1.5 times the IQR from the quartiles. Any
data points beyond the whiskers are considered outliers and are plotted individually.
Sometimes we distinguish between outliers and extreme outliers, with the latter plotted
using a different symbol. An example of a (generic) boxplot is shown in Figure 2.3.
If you are presented with a boxplot, then it is easy to obtain all of the following: the
median, quartiles, IQR, range and skewness. Recall that skewness (the departure from
symmetry) is characterised by a long tail, attributable to outliers, which are readily
apparent from a boxplot.

36
2.8. Measures of dispersion

x Values more than 3 boxlengths from Q3 (extreme outlier)

o Values more than 1.5 boxlengths from Q3 (outlier)


2
Largest observed value that is not an outlier

Q3

50% of cases
have values Q2 = Median
within the box

Q1

Smallest observed value that is not an outlier

o Values more than 1.5 boxlengths from Q1 (outlier)

x Values more than 3 boxlengths from Q1 (extreme outlier)

Figure 2.3: An example of a boxplot (not to scale).

Example 2.15 From the boxplot shown in Figure 2.4, it can be seen that the
median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many
outliers provide a useful indicator that this is a negatively-skewed distribution as the
long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 ,
which tends to indicate negative skewness.

2.8.3 Variance and standard deviation

The variance and standard deviation are much better and more useful statistics for
representing the dispersion of a dataset. You need to be familiar with their definitions
and methods of calculation for a sample of data values x1 , x2 , . . . , xn .
Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the
squared deviations of each data value from the (sample) mean, where:

n
X n
X
2
Sxx = (xi − x̄) = x2i − nx̄2 . (2.3)
i=1 i=1

37
2. Data visualisation and descriptive statistics

Figure 2.4: A boxplot showing a negatively-skewed distribution.

n
P
Recall from earlier x̄ = xi /n. To see why (2.3) holds:
i=1
n
X
Sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 ) (expansion of quadratic)
i=1
n
X n
X n
X
= x2i − 2x̄xi + x̄2 (separating into three summations)
i=1 i=1 i=1
n
X n
X
= x2i − 2x̄ xi +nx̄2 (noting that x̄ is a constant added n times)
i=1 i=1
| {z }
= nx̄
n
X
= x2i − 2nx̄2 + nx̄2 (substituting in nx̄)
i=1
n
X
= x2i − nx̄2 (simplifying)
i=1
n
P n
P
which uses the fact that x̄ = xi /n, and so xi = nx̄.
i=1 i=1
We now define the sample variance.

Sample variance

The sample variance, s2 , is defined as:


n n
!
Sxx 1 X 1 X
s2 = = (xi − x̄)2 = x2i − nx̄2 .
n−1 n−1 i=1
n−1 i=1

38
2.8. Measures of dispersion

Note the divisor used to compute s2 is n − 1, not n. Do not worry about why (this is
covered in ST104B Statistics 2) just remember to divide by n − 1 when computing a
sample variance.4 To obtain the sample standard deviation, s, we just take the
(positive) square root of the sample variance, s2 .
2
Sample standard deviation

The sample standard deviation, s, is:


s
√ Sxx
s= s2 = .
n−1

Example 2.16 Using the dataset in (2.1), x̄ = 37, so:

Sxx = (32 − 37)2 + (28 − 37)2 + · · · + (24 − 37)2 = 25 + 81 + · · · + 169 = 1,698.


p
Hence s = 1,698/(10 − 1) = 13.74.
P 2
Note that, given xi = 15,388, we could have calculated Sxx using the other
expression:
Xn
Sxx = x2i − nx̄2 = 15,388 − 10 × (37)2 = 1,698.
i=1

So this alternative method is much quicker to calculate Sxx .

When data are given in grouped-frequency form, the sample variance is calculated as
follows.

Sample variance for grouped-frequency data

For grouped-frequency data with K classes, to compute the sample variance we use
the formula:
 2
K K K
fk (xk − x̄)2 fk x2k
P P P
 fk xk 
2 k=1 k=1  k=1
s = = K − K  .

K
P P  P 
fk fk fk
k=1 k=1 k=1

Recall that the last bracketed squared term is simply the mean formula for grouped data
shown in (2.2). Note that for grouped-frequency data we can ignore the ‘divide by n − 1’
rule, since we would expect n to be very large in such cases, such that n − 1 ≈ n and so
K
P
dividing by n or n − 1 makes negligible difference in practice, noting that fk = n.
k=1

N
4
In contrast for population data, the population variance is σ 2 = (xi − µ)2 /N , i.e. we use the N
P
i=1
divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of
µ (the population mean) instead of x̄ (the sample mean).

39
2. Data visualisation and descriptive statistics

Example 2.17 A stockbroker is interested in the level of trading activity on a


particular stock exchange. They have collected the following data, which are weekly
2 average volumes (in millions), over a 29-week period. This is an example of time
series data. Note that this variable is treated as if it was discrete, but because the
numbers are so large the variable can be treated as continuous.

172.5 154.6 163.5


161.9 151.6 172.6
172.3 132.4 168.3
181.3 144.0 155.3
169.1 133.6 143.4
155.0 149.0 140.6
148.6 135.8 125.1
159.8 139.9 171.3
161.6 164.4 167.0
153.8 175.6

To construct a density histogram we first decide on the number of class intervals, K,


which is a subjective decision. The objective is to convey information in a useful
way. In this case the data lie between (roughly) 120 and 190 million shares/week, so
class intervals of width 10 million will give K = 7 classes.
With almost 30 observations this choice is probably adequate; more observations
might support more class intervals (with widths of 5 million, say); fewer observations
would, perhaps, need a larger class interval of width 20 million.
Therefore, the class intervals are defined like this:

120 ≤ volume < 130, 130 ≤ volume < 140 etc.

or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the density
values to plot (and cumulative frequencies, for later). We construct the following
table:

Interval Relative
width, Frequency, frequency, Density, Midpoint,
Class interval wk fk rk = fk /n dk = rk /wk xk f k xk fk x2k
[120, 130) 10 1 0.0345 0.00345 125 125 15,625
[130, 140) 10 4 0.1379 0.01379 135 540 72,900
[140, 150) 10 5 0.1724 0.01724 145 725 105,125
[150, 160) 10 6 0.2069 0.02069 155 930 144,150
[160, 170) 10 7 0.2414 0.02414 165 1,155 190,575
[170, 180) 10 5 0.1724 0.01724 175 875 153,125
[180, 190)
P 10 1 0.0345 0.0345 185 185 34,225
Total, 29 4,535 715,725

The density histogram is as shown in Figure 2.5.

40
2.8. Measures of dispersion

We now use the grouped-frequency data to compute particular descriptive statistics,


specifically the mean, variance and standard deviation.
Using the grouped-frequency data, the sample mean is: 2
7
P
f k xk
k=1 4,535
x̄ = 7
= = 156.4
P 29
fk
k=1

and the sample variance is:


7 7
P 2
fk x2k
P
f x
 k=1 k k 
2
s = k=1
− 7
  = 715,725 − (156.4)2 = 219.2.
P7 P  29
fk fk
k=1 k=1

giving a standard deviation of 219.2 = 14.8. For comparison, the ungrouped mean,
variance and standard deviation are 156.0, 217.0 and 14.7, respectively (compute
these yourself to verify!).
Note the units for the mean and standard deviation are ‘millions of shares/week’,
while the units for the variance are the square of those for the standard deviation,
i.e. ‘(millions of shares/week)2 ’, so this is an obvious reason why we often work with
the standard deviation, rather than the variance, due to the original (and more
meaningful) units.

Figure 2.5: Density histogram of trading volume data for Example 2.17.

41
2. Data visualisation and descriptive statistics

2.9 Test your understanding

2 Let us now consider an extended example bringing together many of the issues
considered in this chapter.
At a time of economic growth but political uncertainty, a random sample of n = 40
economists (from the population of all economists) produces the following forecasts for
the growth rate of an economy in the next year:

1.3 3.8 4.1 2.6 2.4 2.2 3.4 5.1 1.8 2.7
3.1 2.3 3.7 2.5 4.1 4.7 2.2 1.9 3.6 2.8
4.3 3.1 4.2 4.6 3.4 3.9 2.9 1.9 3.3 8.2
5.4 3.3 4.5 5.2 3.1 2.5 3.3 3.4 4.4 5.2

(a) Draw a density histogram for these data.


(b) Construct a stem-and-leaf diagram for these data.
(c) Comment on the shape of the sample distribution.
(d) Determine the median of the data using the stem-and-leaf diagram in (a).
(e) Produce a boxplot for these data.
(f) Using the following summary statistics, calculate the sample mean and sample
standard deviation of the growth rate forecasts:
40
X 40
X
Sum of data = xi = 140.4 and Sum of squares of data = x2i = 557.26.
i=1 i=1

(g) Comment on the relative values of the mean and median.


(h) What percentage of the data fall within one sample standard deviation of the
sample mean? And what percentage fall within two sample standard deviations of
the sample mean?

Solution:

(a) It would be sensible to have class interval widths of 1 unit, which conveniently
makes the density values the same as the relative frequencies! We construct the
following table and plot the density histogram.

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[1.0, 2.0) 1 4 0.100 0.100
[2.0, 3.0) 1 10 0.250 0.250
[3.0, 4.0) 1 13 0.325 0.325
[4.0, 5.0) 1 8 0.200 0.200
[5.0, 6.0) 1 4 0.100 0.100
[6.0, 7.0) 1 0 0.000 0.000
[7.0, 8.0) 1 0 0.000 0.000
[8.0, 9.0) 1 1 0.025 0.025

42
2.9. Test your understanding

(b) A stem-and-leaf diagram for the data is:

Stem-and-leaf diagram of economic growth forecasts

Stem (%) Leaves (0.1%)


1 3899
2 2234556789
3 1113334446789
4 11234567
5 1224
6
7
8 2

Note that we still show the ‘6’ and ‘7’ stems even though they have no
corresponding leaves. If we omitted these stems (so that the ‘8’ stem is immediately
below the ‘5’ stem) then this would distort the true shape of the sample
distribution, which would be misleading.
(c) The density histogram and stem-and-leaf diagram show that the data are
positively-skewed (skewed to the right), due to the outlier forecast of 8.2%.
Note if you are ever asked to comment on the shape of a distribution, consider:
• Is the distribution (roughly) symmetric?
• Is the distribution bimodal?
• Is the distribution skewed (an elongated tail in one direction)? If so, what is
the direction of the skewness?
• Are there any outliers?

43
2. Data visualisation and descriptive statistics

(d) There are n = 40 observations, so the median is the average of the 20th and 21st
ordered observations. Using the stem-and-leaf diagram in part (b), we see that
x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%.
2
(e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1
and Q3 , respectively. There are several methods for determining the quartiles, and
any reasonable approach would be acceptable in an examination. For simplicity,
here we will use the following since n is divisible by 4:
Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%.
Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the
whisker limits must satisfy:
max(x(1) , Q1 − 1.5 × IQR) and min(x(n) , Q3 + 1.5 × IQR)
which is:
max(1.3, −0.05) = 1.30 and min(8.2, 6.75) = 6.75.
We see that there is just a single observation which lies outside the interval
[1.30, 6.75], which is x(40) = 8.2% and hence this is plotted individually in the
boxplot. Since this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this
observation is an outlier, rather than an extreme outlier.
The boxplot is (a horizontal orientation is also fine):

Note that the upper whisker terminates at 5.4, which is the most extreme data
point within 1.5 times the IQR above Q3 , i.e. the maximum value no larger than
6.75% as easily seen from the stem-and-leaf diagram in part (b). The lower whisker
terminates at x(1) = 1.3%, since the minimum value of the dataset is within 1.5
times the IQR below Q1 .
It is important to note that boxplot conventions may vary, and some software or
implementations might use slightly different methods for calculating whiskers.
Additionally, different multipliers (other than 1.5) might be used in practice
depending on the desired sensitivity to outliers.

44
2.9. Test your understanding

(f) We have sample data, not population data, hence the (sample) mean is denoted by
x̄ and the (sample) standard deviation is denoted by s. We have:

1X
n
140.4 2
x̄ = xi = = 3.51%
n i=1 40

and:
n
!
1 X 1
s2 = x2i − nx̄2 557.26 − 40 × (3.51)2 = 1.6527.

=
n−1 i=1
39

Therefore, the standard deviation is s = 1.6527 = 1.29%.

(g) In (c) it was concluded that the density histogram and stem-and-leaf diagram of
the data were positively-skewed, and this is consistent with the mean being larger
than the median. It is possible to quantify skewness, although this is beyond the
scope of the syllabus.

(h) We calculate:

x̄ − s = 3.51 − 1.29 = 2.22 and x̄ + s = 3.51 + 1.29 = 4.80

also:

x̄ − 2 × s = 3.51 − 2 × 1.29 = 0.93 and x̄ + 2 × s = 3.51 + 2 × 1.29 = 6.09.

Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22
and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and
6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in
each interval, respectively, is:
29 39
= 0.725 = 72.5% and = 0.975 = 97.5%.
40 40

Some general points to note are the following.

• Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit
like the normal distribution (introduced in Chapter 4) – have the property that
68% of the data lie within approximately one standard deviation of the mean, and
95% of the data lie within approximately two standard deviations of the mean. The
percentages in (h) are fairly similar to these.

• The exercise illustrates the importance of (at least) one more decimal place than in
the original data. If we had 3.5% and 1.3% for the mean and standard deviation,
respectively, the ‘boundaries’ for the interval with one standard deviation would
have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we
would have had to worry about which side of the ‘boundary’ to allocate these.
(This type of issue can still happen with the extra decimal place, but much less
frequently.)

• When constructing a histogram, it is possible to ‘lose’ a pattern in the data – for


example, an approximate bell shape – through two common errors:

45
2. Data visualisation and descriptive statistics

• too few class intervals (which is the same as too wide class intervals)
• too many class intervals (which is the same as too narrow class intervals).
2 For example, with too many class intervals, you mainly get 0, 1 or 2 items per class
interval, so any (true) peak is hidden by the subdivisions which you have used.
• The best number of (equal-sized) class intervals depends on the sample size. For
large samples, many class intervals will not lose the pattern, while for small
samples they will. However, with the datasets which tend to crop up in ST104A
Statistics 1, somewhere between 6 and 10 class intervals are likely to work well.

2.10 Overview of chapter


In statistical analysis, there are usually simply too many numbers to make sense of just
by staring at them. Data visualisation and descriptive statistics attempt to summarise
key features of the data to make them understandable and easy to communicate. The
main function of diagrams is to bring out interesting features of a dataset visually by
displaying its distribution, i.e. summarising the whole sample distribution of a variable.
Descriptive statistics allow us to summarise one feature of the sample distribution in a
single number. In this chapter we have worked with measures of central tendency,
measures of dispersion and skewness.

2.11 Key terms and concepts


Boxplot Median (of sample)
Categorical variable Mode (of sample)
Continuous Nominal
Density histogram Ordinal
Discrete Range
Dot plot Sample distribution
Interquartile range (IQR) Skewness
Mean (of sample) Standard deviation (of sample)
Measurable variable Stem-and-leaf diagram
Measures of dispersion Variable
Measures of location Variance (of sample)

2.12 Sample examination questions


1. Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer.
(a) Gross domestic product (GDP) of a country.
(b) Five possible responses to a customer satisfaction survey ranging from ‘very
satisfied’ to ‘very dissatisfied’.
(c) A person’s name.

46
2.13. Solutions to Sample examination questions

2. The data below contain measurements of the low-density lipoproteins, also known
as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in
milligrams per decilitres (mg/dL).
2
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143

(a) Construct a density histogram of the data.


(b) Find the mean (given that the sum of the data is 3,342), the P
median and the
standard deviation (given that the sum of the squared data, x2i , is 377,076).
(c) Comment on the data given the shape of the histogram.

3. The average daily intakes of calories, measured in kcals, for a random sample of 12
athletes were:

1,808, 1,936, 1,957, 2,004, 2,009, 2,101,


2,147, 2,154, 2,200, 2,231, 2,500, 3,061.

(a) Construct a boxplot of the data. (The boxplot does not need to be exactly to
scale, but values of box properties and whiskers should be clearly labelled.)
(b) Based on the shape of the boxplot you have drawn, describe the distribution of
the data.
(c) Name two other types of graphical displays which would be suitable to
represent the data. Briefly explain your choices.

2.13 Solutions to Sample examination questions


1. A general tip for identifying measurable and categorical variables is to think of the
possible values they can take. If these are finite and represent specific entities the
variable is categorical. Otherwise, if these consist of numbers corresponding to
measurements, the data are continuous and the variable is measurable. Such
variables may also have measurement units or can be measured to various decimal
places.
(a) Measurable, because GDP can be measured in $bn or $tn to several decimal
places.
(b) Each satisfaction level corresponds to a category. The level of satisfaction is in
a ranked order – for example, in terms of the list items provided. Therefore,
this is a categorical ordinal variable.
(c) Each name (James, Jane etc.) is a category. Also, there is no natural ordering
between the names – for example, we cannot really say that ‘James is higher
than Jane’. Therefore, this is a categorical nominal variable.

47
2. Data visualisation and descriptive statistics

2. (a) We have:

2 Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[90, 100) 10 6 0.200 0.0200
[100, 110) 10 9 0.300 0.0300
[110, 120) 10 7 0.233 0.0233
[120, 130) 10 5 0.167 0.0167
[130, 140) 10 2 0.067 0.0067
[140, 150) 10 1 0.033 0.0033

(b) We have:
3,342
x̄ = = 111.4 mg/dL
30

also median = 109 mg/dL, and the standard deviation is:

r
1
s= × (377,076 − 30 × (111.4)2 ) = 12.83 mg/dL.
29

(c) The data exhibit positive skewness, as shown by the mean being greater than
the median.

48
2.13. Solutions to Sample examination questions

3. (a) Depending on quartile calculation methods, there may be slight variations in


the computed values of Q1 and Q3 . However, the boxplot should look (very)
similar to:
2

Note that no label of the x-axis is necessary and that the plot can be
transposed.
(b) Based on the shape of the boxplot above, we can see that the distribution of
the data is positively skewed, equivalently skewed to the right, due to the
presence of the outlier of 3,061 kcals.
(c) A density histogram, stem-and-leaf diagram or a dot plot are other types of
suitable graphical displays. The reason is that the variable is measurable and
these graphs are suitable for displaying the distribution of such variables.

49
2. Data visualisation and descriptive statistics

50

You might also like