meiassd2n

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

MEI AS Mathematics: Collecting and interpreting data

Section 2: Single variable data


Notes and Examples
These notes contain the following subsections:
General terminology and notation
Clean the data – obvious anomalies
Summary statistics
• Measures of central tendency
• Measures of dispersion/spread
• Variance and deviation
Calculating outliers
Visualise the data - graphs, charts and infographics
Translate your finding(s)

You’ve made your plan and collected the data. The next step is to make sense of the data.
With calculators and data analysis software able to churn out summary statistics and
diagrams with a few clicks of a button the skills you need to analyse data are:
• an ability to get the data in the correct form for the calculations
• a clear understanding of the terminology and notation
• the ability to interpret the data/charts/summary statistics in full context
• the ability to share the data in an appropriate way to your audience.

General terminology and notation


Variables are the properties being measured as part of your statistical investigation. If the
variable you are studying is random, for example the number of heads when a coin is
flipped 10 times, then it is referred to as a random variable. Rather than having to write
“the number of heads following ten flips of a coin” again and again a capital letter is used to
represent the variable, for example X. This should be defined clearly at the start of using it.
The lowercase x then represents the individual results. Subscripts can be used to represent
different variables, xi is the ith term.

n is used to represent the total number of variables in the sample.


Frequency is the number of occurrences of a particular result (x). This is shortened to f.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 1 of 12 integralmaths.org
Sigma notation  is an incredibly useful piece of notation and you will find it in all sorts
of calculations. It means “the sum of”. So  x means add up all the different values of the
 x means first square every value of x then add them all up.
2
variable x, and

The type of data you are considering is likely to dictate which techniques are available to
you.
• Categorical data where the information about each item in the sample is a category
rather than having numerical significance (e.g. favourite flavour of yoghurt). Some
categorical data is orderable (e.g. pain ratings such as: none, mild, moderate,
severe, acute). Some categorical data can be numbers, but where the number
doesn’t have significance (e.g. numbers on rugby player’s shirts). Categorical data is
qualitative.

Ordered categorical data may be converted to discrete numerical data, but it needs
some careful thought to make sure the numbers are appropriate.

• Discrete numerical data where the variable is numerically significant, and therefore
orderable, but the results are countable rather than measurable (e.g. number of
people in a household). The values are discrete, it is one value or another. Those
values aren’t always integers though (e.g. individual judges scores in a diving
competition).

• Ranked data where the data is positional within the group rather than by the
measurement or score. It is possible to create a ranking based on unranked
numerical data. Ranked, or ordered data is useful for finding median, quartiles,
maximum, minimum, range etc.

• Continuous numerical data where the variable could, theoretically, take any value
(e.g. height of a person, weight of a vehicle, amount of rainfall). In practice most
continuous data has limits on the accuracy of what can actually be measured, and a
potentially maximum and minimum value it might take. For example, the masses of
newly hatched ducklings are between 25 and 45 grams; it is not possible for a
duckling to have mass 0.1 grams or 500 grams. The mass of a particular duckling
might show on scales as 27.8 grams but in reality the duckling could have mass
anywhere between 27.75 grams and 27.85 grams.

Clean the data – obvious anomalies


Often the majority of the time spent on real life data projects is cleaning the data. That is
processing the raw data in order for the analysis to be meaningful. This could include:
• copying it into the correct format
• looking for duplicates
• considering missing or clearly incorrect data
• converting values to something more useable (categorical into numerical,
frequencies into proportions).

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 2 of 12 integralmaths.org
If the height of a child is given as 0.924 cm it could be reasonable to assume that this is the correct
measurement, but in metres, and so could be easily converted to fit with the rest of the data.

But the context of the questions and data gathering methodology is important when making
these decisions. Your experiment/investigation might cope with some pieces of missing
data but it could be that the effect of reducing the sample size isn’t acceptable. In a
stratified sample, for example, if the data is missing from a certain subcategory it might end
up meaning the proportional representation for each category is thrown and so it might be a
new piece of equivalent data needs to be sought. If the sample is large enough it is less
likely to have a significant impact. If the sample is affected by missing data it might be worth
considering why some data is missing or erroneous and if there is a different way it could
be collected to reduce the imperfections.
Visualisations and calculations are used as part of a mini-cycle here, with huge data sets it
is much easier to identify anomalies using an appropriate visual representation.

Tables
When confronted with a big list of raw data it is always useful to organise it. A table can be
a good starting point. The use of technology to store, process and sort the data means
large data can be organised relatively easily. But even in a neat table the sheer volume of
the data can be difficult to see a way through. Some ways of starting to summarise and
organise the data are:
Frequency tables can condense a huge list into
x f fx something more manageable. If the data is categorical, or
0 2 0 discrete numerical, with a finite number of categories it can
2 5 10 be placed into a frequency table. Frequency tables tell you
3 13 39 how many occurrences there are of each variable, they
4 10 40 can then be used to construct bar charts, used to convert
5 6 30 into proportions and then pie charts etc. Headings x and f
6 3 18 give the variable and the frequency. A potentially useful
Totals 39 147 extra column is fx, frequency multiplied by the variable
 fx = which is the total of all the data points that take that
0+10+39+40+30+18 specific x value. If you sum the fx column 
fx it gives
= 147 you the value of all the individual data points added
n=  f = 2+5+13+10+6+3 together.
= 39
Grouping data
Height, h cm f If the data you have is continuous a frequency table may
3.1  h  3.7 143 not be much simpler than the list of raw data. In this case
3.7  h  4.7 76 creating classes within the data can get the desired
4.7  h  5.2 237 simplification. This helps make the data more accessible
5.2  h  6.0 198 but loses some of the accuracy and detail. Calculations
made from grouped data become estimates. You need to
know how to estimate the summary statistics from a
grouped frequency table. To do this you can interpolate
within the group (linearly or using a cumulative frequency

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 3 of 12 integralmaths.org
curve), use the mid-point of each group as an estimate of
the data points. This is very unlikely to be the actual
distribution but gives a reasonable estimate for data that is
spread evenly through the possible values within the
group.

Stem and leaf diagrams bridge a gap between grouped


data, graphical representation and tables. They contain all
the detail of the raw data. It is still possible to calculate all
the summary measures accurately. They also start to give
some information about the way the data is distributed.
Look for keys to help with interpretation.

You need to be comfortable using these different tables and knowing how to calculate key
summary data from each of them. See notes in the data tool kit or research independently if
you are unsure!

Summary statistics
Summary statistics are used to give a sense of the data by summarising some of the key
features of the data in numerical form. Think about what it is useful to know: average
(centres), distributions (shape), spread (range). Data analysis software can do the hard
work and present you with a list of summary statistics, whether you are using your
calculator, excel or a statistical package. It is important to know what they are, what they
mean and know how the calculations work.

Measures of central tendency


The measures of central tendency summarise the data by describing the centre of the data.
Mean The value each would have if the total was
x mean from sample data shared out equally. This is often referred to as
(  is the mean from population data, the average. All the measures of central
but is therefore a parameter not a tendency are types of average so be careful!
statistic)

x= x
n
Median The middle value when the data is ordered.
n th
The value in an ordered list that is
2
sufficiently large. (In small lists use
n + 1 th
.) If you need to find the 823.5th,
2
for example, find the mean of the 823rd
and 824th values.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 4 of 12 integralmaths.org
Mode
The most common. Data can be multi-modal if there is more than one “most common”.
This can be applied to continuous data if it is grouped and would be called the “modal
class”. Be careful if class widths vary greatly this could be potentially misleading. Mode is
the only measure of central tendency that can be used with categorical data.

Midrange
The value at the middle of the range. As with mean the outliers have a big effect on this
value. It isn’t as common as the other measures of central tendency but there are times
when it may be useful to include. Paired with the value of the range for example it
effectively gives the maximum and minimum values of the data.

Which measures are relevant to your investigation will depend on: type of data, purpose of
questions, context etc. It may be that all these measures are useful.
You need to be able to find each of these measures from a list of data, from frequency
tables and estimate them from grouped frequency tables.
They can be used to directly compare two, or more, sets of data. They are often used to
give a sense about where an individual data point fits in comparison to the population.
“Above/below average” is a phrase you will often spot in news headlines.

Example 1
The averages of maximum daily wind gust speeds for Cambourne (1987) are
Mean = 220 knots, Median = 245 knots, Mode 360 knots (all given accurate to 2 s.f.)
(a) Maya says that more than 50% of the days had a maximum daily wind gust speed
greater than 360 knots. Comment on Maya’s claim.
(b) Khalid says that the median is the best average to use in this case as it is the middle
of the central measures. Comment on Khalid’s statement.
(c) Poppy says that 230 knots is above average for the maximum daily wind gust speed
for Cambourne in 1987. Comment on Poppy’s statement.

Solution
(a) Maya is incorrect. The median is 245 knots and 50% of values are above this.

Since the mode is also above this it would be reasonable to conclude that less than 50% of the values are
above the mode. In fact, 360 is the largest value in the data set, though it isn’t possible to tell that from the
information given.

(b) Khalid’s statement is questionable. The median may be a good average to use, but
not necessarily for the reason he has given.

The mode can be big, small, middling or all at the same time if polymodal. Using the mode to consider the
relevance of the other measures is unsound. That’s not to say it isn’t relevant, but the mode would often lead
to more questions.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 5 of 12 integralmaths.org
(c) Poppy has not stated which average she is using. 230 knots is above the mean
average but below both mode and median.

Using average interchangeably between the different measures of central tendency can give rise to
misunderstanding and misrepresentation. Defining the type of average is useful and important. If interpreting
data it should be part of what you ask.

Measures of dispersion/spread

Example 2
Elin is applying for jobs. She applies to two companies and having researched their pay
structure finds company A has an average pay of £43 754 and company B has an average
pay of £38 973. Elin says she would prefer to work for the company that has the higher
average pay.
(a) Is this a good decision? Which company would you prefer to work for?
Further research shows that the average (given above) is the mean, and the range of pay
for company A is £195 000 and for company B is £3 000.
(b) Does this change your opinion about which company you would want to work for?
It turns out that actually the pay structure for both companies is that all the employees
receive same wage and the boss gets more. For company A the boss receives £219 254
and the nine employees each get £24 254. For company B the boss receives £41 913 and
each of the 49 employees gets £38 913.
(c) Which company would you prefer to work for now, would you make a different
decision depending on whether you were the boss or an employee?

Solution
(a) Based on the current information the higher average wage seems like a reasonable
choice, all else being equal. But without knowing more it is hard to say how this is
likely to impact what Elin’s starting wage would be.

(b) This additional information tells you that whilst the mean is much higher for company
A there is also a much bigger range. The range will spread both above and below
the mean so the wage offered could be significantly higher or lower.

(c) The additional information: how many people, the range, how the data is distributed
are all important. At this point company B looks like a better option, unless you are
the boss.

As shown above the measures of central tendency are often used as a comparison point
“above or below average” can be seen as a tipping point of success. However a single
value doesn’t give much depth to understanding the nuances of what is going on. Two sets
of data could have very similar central measures but be significantly different to each other.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 6 of 12 integralmaths.org
Measures of dispersion are summary statistics that give a bit more information about how
the data is spread out. They include:
• Range – the range gives the greatest data value subtract the smallest data value. It
is useful as it gives a sense of the amount of variation possible within the data set. It
is strongly affected by extreme values.
• Maximum and minimum values – Used to find the range they also show the region
that all the data points sit within.
• Quartiles, centiles, deciles – If the data is split into 4 equal parts the divisions are
called the quartiles. There are three quartile measures that form part of the summary
statistics known as the lower quartiles ( Q1 ) the median ( Q2 ) and the upper quartile (
Q3 ). This is the data point at the split between the 1st quarter of the data and the
second. Centiles are where the data is split into 100 parts. So the 50th centile would
be the median and 25th centile is the first quartile.
Quartiles are less likely to be affected by extreme values so measures like the
interquartile range ( Q3 − Q1 ) are often useful.
• Deviation and variance are measures of how far away the data points are from the
mean.

Variance and deviation


Consider the small set of data {0, 1, 1, 3, 5},
which has a mean x = 2 .
The deviations are how far each data point is from
the mean ( x − x ) in this case {−2, −1, −1,1,3} .

If you sum the individual deviations to find the measure of spread for the full data set the
result will be zero. So, the deviations are squared and then summed. The sum of the
squared deviations for sample data is Sxx.
n
S xx =  ( xi − x )2 which can also be written as S xx =  ( x − x )2
i =1

Is this a measure we can use to compare data sets? No, since Sxx reflects not just the size
of the deviations but the size of the data set too. By dividing by the size of the sample you
get a comparable measure of spread which is called the variance. But even here it isn’t
quite a straightforward as that might sound. It turns out that dividing by n − 1 rather than by
n gives a better estimate of the variance of the underlying population. Because the
deviations sum to zero there aren’t n independent values but n − 1 .

S xx
2 2 2
Variance (often denoted by s , sx , or s x ) for sample data is found using s2 = .
n −1

Note on notation: using a subscript in sx2 is helpful if you are working with more than one data set, or you are
analysing different elements within a data set. The third version is how it may look on a calculator - check you
know how the summary statistics appear on your calculator!

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 7 of 12 integralmaths.org
Sample standard deviation (often shortened to standard deviation) is the square root of
the sample variance:

s=
S xx
= (x − x ) 2

n −1 n −1
If you have data for a complete population, you can calculate the variance using the mean squared deviation
S xx
, quite literally the mean of the squared deviations. This is often denoted by  ,  x or  x .
2 2 2
which is
n
S xx
The square root of the mean squared deviation is the root mean squared deviation . If dealing with
n
population data then this is the calculation for the standard deviation (  ,  x or  x ).

Alternative form:
If the mean doesn’t work out neatly it can make the deviations difficult to work with. In this
case, to limit errors creeping in, you can use an alternative form for Sxx.

S xx =  x 2 − nx
Using this alternate form of the sum of squares would give the following formulae:

=
2 x − nx 2
2
Sample variance s
n −1

Sample standard deviation s= x − nx 2


2

n −1
With large sets of data you may be given some of the summary data (n,  x , x 2
) from
which to calculate the variance or deviation.

Calculating outliers
Samples may include values that are classed as outliers. These are values that are much
higher or lower than the rest of the sample. This doesn’t necessarily mean they are wrong
or shouldn’t be included, just that they are significantly different to the rest of the data set so
should have some consideration.

The child whose height was given as 0.924 cm belonged to the following (age-based) nursery group:
{0.924, 86.1, 87.0, 87.4, 88.9, 88.9, 90.2, 91.5, 91.7, 91.9, 92.0, 92.8, 94.1, 94.2, 95.7, 105.6}
0.924 is an outlier. It is clearly an error and so should be corrected or excluded. The value of 105.6 cm seems
quite big compared to the others based on a quick inspection of the ordered list. But does it matter? Is it
bigger by enough to count as an outlier?

As part of your approach you need to determine when a data value can be considered an
outlier and then decide how you should deal with it.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 8 of 12 integralmaths.org
Two common ways of identifying outliers:
• Using quartiles to identify outliers: any value that is 1.5 times the interquartile range
below the lower quartile or 1.5 times the interquartile range above the upper quartile.
• Using standard deviation to identify outliers: any data that are more than two
standard deviations away from the mean.
Depending on the context, you might use different conditions for outliers, such as a different
multiple of the standard deviation. These should be defined in the methodology of a study.

Remember just because a data value is identified as an outlier doesn’t mean it should be excluded from the
data.

Example 3
A sample of 200 pulse rates (beats per minute) was taken randomly from an American
national health survey. Some data were missing so were excluded.
The data can be summarised as follows:

n = 191  x = 13772 x 2
= 1030120
What rates would be considered to be outliers if an outlier is defined as 2 standard
deviations from the mean?

Solution

x= =
x 13772
n 191
= 72.10471...
x 2 = 5199.08949...

s= x 2
− nx 2
n −1
1030120 − 191 5199.08949...
=
190
= 13.97251...
x + 2s = 100.0497...
x − 2s = 44.15969...
Outliers would be pulse rates below 45 beats per minutes or above 100 beats per minute.

Pulse rates are recorded as integers so it doesn’t make sense in this context to say that outliers are below
44.16 or above 100.05, for example.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 9 of 12 integralmaths.org
Visualise the data - graphs, charts and infographics
Using a graphical representation to summarise the data you have can be incredibly helpful.
Charts, graphs and infographics can show many of the summary statistics in an easy to
digest and impactful way.
There are loads of different ways to do this, so whether you are creating or interpreting a
chart take your time and think about what is important to know. Below there is a brief
summary of some of the key charts that you might be asked to use. For further detail please
refer to the Data toolkit.
Name Uses and key features Disadvantages and common
errors
Pie chart Used with categorical, discrete By removing the actual values
data or grouped continuous. and comparing as proportions it
Show proportions within the removes detail. It can be easy to
sample. infer incorrectly about “numbers
Sections should be labelled and of” when it only shows
additional information might need proportions.
to be given. Altering pie charts by varying
angle of view can be
aesthetically pleasing but taken
too far can reduce the
readability and so this should be
avoided.
Bar charts Used for discrete or categorical There are many variations that
data. look similar that might be used.
Show the frequencies of the These include compound bar
different groups. charts which have more than
Bars should not be touching. one bar for each group for
Vertical line graphs, where the comparison, or charts that might
width of each column is include a value on the y-axis,
minimised, can be a variation, rather than frequency. If this is
potentially helpful when showing the case it is possible to have
more categories. negative bars, e.g temperature
or daily profit/loss.
Histogram Area represents the number, not Common error is to read the y-
the height of the bar. axis as frequency rather than
Bars joined up along x-axis. frequency density.
y-axis labelled “frequency Whenever working from
density”. grouped data it is only possible
Bars can have different widths. to estimate summary statistics
Starts to show the shape of the such as the quartiles, mean etc.
data. The more bars the closer to
a curve. It is possible to see skew
and possibility for specific
distributions that could be used as
approximations.
The information from a grouped
frequency table is maintained.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 10 of 12 integralmaths.org
You need to be able to:
• Find the frequency of a
given group.
• Calculate frequency
densities from a table in
order to be able to draw the
graph.
• Find estimates for the
summary statistics.

Cumulative Cumulative frequency shows how The cumulative nature of this


frequency grouped categorical data builds diagram can be hard to
diagram up. The line is always increasing interpret. It is often used as a
and for many distributions has a starting point from which to
distinctive shape. drawn box and whisker plots.
Sometimes cumulative frequency
lines are plotted with straight line
segments between points, other
times they are drawn with a
smooth curve passing through the
plotted points.
This type of diagram can be used
to estimate individual values
within the data, based on an
assumption that the growth of the
data is reasonably dispersed
within each group.
Box and whisker Show the distribution as defined Simplification of data by using
plots by the quartiles. Includes a scale. summary statistics.
Can be vertical or horizontal. Important to remember to give
Outliers can be included as comparison of measure and
crosses on the diagram (as relate meaning with context of
identified by 1.5×IQR the comparison too. The data
below/above the first/third). should support your conclusion,
Can be a useful way to compare not be it.
some of the summary statistics
between different sets of data.
Remember the central box shows
the middle 50% of the data and
the line is the median (2nd
quartile).

Translate your finding(s)


With all the beautiful calculations and representations available to you it is always important
to focus back to your original question. What is the conclusion that you have reached? Who
is going to be interested in your findings? How can the data convince others of the same
conclusion? It is always good practice to translate back into the language of the question
asked. Think about how to show the target audience simply and effectively how the data

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 11 of 12 integralmaths.org
support the conclusion. That conclusion may be that a different or further question needs to
be asked, so let’s go round [the data cycle] again!
If being given summary data and asked to interpret it make sure you include the full context,
what the variable is and any associated units. It is easy to forget and give only partial
context.

“The mean height of the bamboo is 27.7” or “the maximum value is 42.2 m” both contain some context but
“the mean height of the bamboo is 27.7 m” and “The bamboo has a maximum height of 42.2 m” gives the full
translation.

MEI AS Maths: Collecting and interpreting data 2 © MEI 01/08/23


Notes and examples page 12 of 12 integralmaths.org

You might also like