meiassd2n
meiassd2n
meiassd2n
You’ve made your plan and collected the data. The next step is to make sense of the data.
With calculators and data analysis software able to churn out summary statistics and
diagrams with a few clicks of a button the skills you need to analyse data are:
• an ability to get the data in the correct form for the calculations
• a clear understanding of the terminology and notation
• the ability to interpret the data/charts/summary statistics in full context
• the ability to share the data in an appropriate way to your audience.
The type of data you are considering is likely to dictate which techniques are available to
you.
• Categorical data where the information about each item in the sample is a category
rather than having numerical significance (e.g. favourite flavour of yoghurt). Some
categorical data is orderable (e.g. pain ratings such as: none, mild, moderate,
severe, acute). Some categorical data can be numbers, but where the number
doesn’t have significance (e.g. numbers on rugby player’s shirts). Categorical data is
qualitative.
Ordered categorical data may be converted to discrete numerical data, but it needs
some careful thought to make sure the numbers are appropriate.
• Discrete numerical data where the variable is numerically significant, and therefore
orderable, but the results are countable rather than measurable (e.g. number of
people in a household). The values are discrete, it is one value or another. Those
values aren’t always integers though (e.g. individual judges scores in a diving
competition).
• Ranked data where the data is positional within the group rather than by the
measurement or score. It is possible to create a ranking based on unranked
numerical data. Ranked, or ordered data is useful for finding median, quartiles,
maximum, minimum, range etc.
• Continuous numerical data where the variable could, theoretically, take any value
(e.g. height of a person, weight of a vehicle, amount of rainfall). In practice most
continuous data has limits on the accuracy of what can actually be measured, and a
potentially maximum and minimum value it might take. For example, the masses of
newly hatched ducklings are between 25 and 45 grams; it is not possible for a
duckling to have mass 0.1 grams or 500 grams. The mass of a particular duckling
might show on scales as 27.8 grams but in reality the duckling could have mass
anywhere between 27.75 grams and 27.85 grams.
But the context of the questions and data gathering methodology is important when making
these decisions. Your experiment/investigation might cope with some pieces of missing
data but it could be that the effect of reducing the sample size isn’t acceptable. In a
stratified sample, for example, if the data is missing from a certain subcategory it might end
up meaning the proportional representation for each category is thrown and so it might be a
new piece of equivalent data needs to be sought. If the sample is large enough it is less
likely to have a significant impact. If the sample is affected by missing data it might be worth
considering why some data is missing or erroneous and if there is a different way it could
be collected to reduce the imperfections.
Visualisations and calculations are used as part of a mini-cycle here, with huge data sets it
is much easier to identify anomalies using an appropriate visual representation.
Tables
When confronted with a big list of raw data it is always useful to organise it. A table can be
a good starting point. The use of technology to store, process and sort the data means
large data can be organised relatively easily. But even in a neat table the sheer volume of
the data can be difficult to see a way through. Some ways of starting to summarise and
organise the data are:
Frequency tables can condense a huge list into
x f fx something more manageable. If the data is categorical, or
0 2 0 discrete numerical, with a finite number of categories it can
2 5 10 be placed into a frequency table. Frequency tables tell you
3 13 39 how many occurrences there are of each variable, they
4 10 40 can then be used to construct bar charts, used to convert
5 6 30 into proportions and then pie charts etc. Headings x and f
6 3 18 give the variable and the frequency. A potentially useful
Totals 39 147 extra column is fx, frequency multiplied by the variable
fx = which is the total of all the data points that take that
0+10+39+40+30+18 specific x value. If you sum the fx column
fx it gives
= 147 you the value of all the individual data points added
n= f = 2+5+13+10+6+3 together.
= 39
Grouping data
Height, h cm f If the data you have is continuous a frequency table may
3.1 h 3.7 143 not be much simpler than the list of raw data. In this case
3.7 h 4.7 76 creating classes within the data can get the desired
4.7 h 5.2 237 simplification. This helps make the data more accessible
5.2 h 6.0 198 but loses some of the accuracy and detail. Calculations
made from grouped data become estimates. You need to
know how to estimate the summary statistics from a
grouped frequency table. To do this you can interpolate
within the group (linearly or using a cumulative frequency
You need to be comfortable using these different tables and knowing how to calculate key
summary data from each of them. See notes in the data tool kit or research independently if
you are unsure!
Summary statistics
Summary statistics are used to give a sense of the data by summarising some of the key
features of the data in numerical form. Think about what it is useful to know: average
(centres), distributions (shape), spread (range). Data analysis software can do the hard
work and present you with a list of summary statistics, whether you are using your
calculator, excel or a statistical package. It is important to know what they are, what they
mean and know how the calculations work.
x= x
n
Median The middle value when the data is ordered.
n th
The value in an ordered list that is
2
sufficiently large. (In small lists use
n + 1 th
.) If you need to find the 823.5th,
2
for example, find the mean of the 823rd
and 824th values.
Midrange
The value at the middle of the range. As with mean the outliers have a big effect on this
value. It isn’t as common as the other measures of central tendency but there are times
when it may be useful to include. Paired with the value of the range for example it
effectively gives the maximum and minimum values of the data.
Which measures are relevant to your investigation will depend on: type of data, purpose of
questions, context etc. It may be that all these measures are useful.
You need to be able to find each of these measures from a list of data, from frequency
tables and estimate them from grouped frequency tables.
They can be used to directly compare two, or more, sets of data. They are often used to
give a sense about where an individual data point fits in comparison to the population.
“Above/below average” is a phrase you will often spot in news headlines.
Example 1
The averages of maximum daily wind gust speeds for Cambourne (1987) are
Mean = 220 knots, Median = 245 knots, Mode 360 knots (all given accurate to 2 s.f.)
(a) Maya says that more than 50% of the days had a maximum daily wind gust speed
greater than 360 knots. Comment on Maya’s claim.
(b) Khalid says that the median is the best average to use in this case as it is the middle
of the central measures. Comment on Khalid’s statement.
(c) Poppy says that 230 knots is above average for the maximum daily wind gust speed
for Cambourne in 1987. Comment on Poppy’s statement.
Solution
(a) Maya is incorrect. The median is 245 knots and 50% of values are above this.
Since the mode is also above this it would be reasonable to conclude that less than 50% of the values are
above the mode. In fact, 360 is the largest value in the data set, though it isn’t possible to tell that from the
information given.
(b) Khalid’s statement is questionable. The median may be a good average to use, but
not necessarily for the reason he has given.
The mode can be big, small, middling or all at the same time if polymodal. Using the mode to consider the
relevance of the other measures is unsound. That’s not to say it isn’t relevant, but the mode would often lead
to more questions.
Using average interchangeably between the different measures of central tendency can give rise to
misunderstanding and misrepresentation. Defining the type of average is useful and important. If interpreting
data it should be part of what you ask.
Measures of dispersion/spread
Example 2
Elin is applying for jobs. She applies to two companies and having researched their pay
structure finds company A has an average pay of £43 754 and company B has an average
pay of £38 973. Elin says she would prefer to work for the company that has the higher
average pay.
(a) Is this a good decision? Which company would you prefer to work for?
Further research shows that the average (given above) is the mean, and the range of pay
for company A is £195 000 and for company B is £3 000.
(b) Does this change your opinion about which company you would want to work for?
It turns out that actually the pay structure for both companies is that all the employees
receive same wage and the boss gets more. For company A the boss receives £219 254
and the nine employees each get £24 254. For company B the boss receives £41 913 and
each of the 49 employees gets £38 913.
(c) Which company would you prefer to work for now, would you make a different
decision depending on whether you were the boss or an employee?
Solution
(a) Based on the current information the higher average wage seems like a reasonable
choice, all else being equal. But without knowing more it is hard to say how this is
likely to impact what Elin’s starting wage would be.
(b) This additional information tells you that whilst the mean is much higher for company
A there is also a much bigger range. The range will spread both above and below
the mean so the wage offered could be significantly higher or lower.
(c) The additional information: how many people, the range, how the data is distributed
are all important. At this point company B looks like a better option, unless you are
the boss.
As shown above the measures of central tendency are often used as a comparison point
“above or below average” can be seen as a tipping point of success. However a single
value doesn’t give much depth to understanding the nuances of what is going on. Two sets
of data could have very similar central measures but be significantly different to each other.
If you sum the individual deviations to find the measure of spread for the full data set the
result will be zero. So, the deviations are squared and then summed. The sum of the
squared deviations for sample data is Sxx.
n
S xx = ( xi − x )2 which can also be written as S xx = ( x − x )2
i =1
Is this a measure we can use to compare data sets? No, since Sxx reflects not just the size
of the deviations but the size of the data set too. By dividing by the size of the sample you
get a comparable measure of spread which is called the variance. But even here it isn’t
quite a straightforward as that might sound. It turns out that dividing by n − 1 rather than by
n gives a better estimate of the variance of the underlying population. Because the
deviations sum to zero there aren’t n independent values but n − 1 .
S xx
2 2 2
Variance (often denoted by s , sx , or s x ) for sample data is found using s2 = .
n −1
Note on notation: using a subscript in sx2 is helpful if you are working with more than one data set, or you are
analysing different elements within a data set. The third version is how it may look on a calculator - check you
know how the summary statistics appear on your calculator!
s=
S xx
= (x − x ) 2
n −1 n −1
If you have data for a complete population, you can calculate the variance using the mean squared deviation
S xx
, quite literally the mean of the squared deviations. This is often denoted by , x or x .
2 2 2
which is
n
S xx
The square root of the mean squared deviation is the root mean squared deviation . If dealing with
n
population data then this is the calculation for the standard deviation ( , x or x ).
Alternative form:
If the mean doesn’t work out neatly it can make the deviations difficult to work with. In this
case, to limit errors creeping in, you can use an alternative form for Sxx.
S xx = x 2 − nx
Using this alternate form of the sum of squares would give the following formulae:
=
2 x − nx 2
2
Sample variance s
n −1
n −1
With large sets of data you may be given some of the summary data (n, x , x 2
) from
which to calculate the variance or deviation.
Calculating outliers
Samples may include values that are classed as outliers. These are values that are much
higher or lower than the rest of the sample. This doesn’t necessarily mean they are wrong
or shouldn’t be included, just that they are significantly different to the rest of the data set so
should have some consideration.
The child whose height was given as 0.924 cm belonged to the following (age-based) nursery group:
{0.924, 86.1, 87.0, 87.4, 88.9, 88.9, 90.2, 91.5, 91.7, 91.9, 92.0, 92.8, 94.1, 94.2, 95.7, 105.6}
0.924 is an outlier. It is clearly an error and so should be corrected or excluded. The value of 105.6 cm seems
quite big compared to the others based on a quick inspection of the ordered list. But does it matter? Is it
bigger by enough to count as an outlier?
As part of your approach you need to determine when a data value can be considered an
outlier and then decide how you should deal with it.
Remember just because a data value is identified as an outlier doesn’t mean it should be excluded from the
data.
Example 3
A sample of 200 pulse rates (beats per minute) was taken randomly from an American
national health survey. Some data were missing so were excluded.
The data can be summarised as follows:
n = 191 x = 13772 x 2
= 1030120
What rates would be considered to be outliers if an outlier is defined as 2 standard
deviations from the mean?
Solution
x= =
x 13772
n 191
= 72.10471...
x 2 = 5199.08949...
s= x 2
− nx 2
n −1
1030120 − 191 5199.08949...
=
190
= 13.97251...
x + 2s = 100.0497...
x − 2s = 44.15969...
Outliers would be pulse rates below 45 beats per minutes or above 100 beats per minute.
Pulse rates are recorded as integers so it doesn’t make sense in this context to say that outliers are below
44.16 or above 100.05, for example.
“The mean height of the bamboo is 27.7” or “the maximum value is 42.2 m” both contain some context but
“the mean height of the bamboo is 27.7 m” and “The bamboo has a maximum height of 42.2 m” gives the full
translation.