Modelling and Stats Guide
Modelling and Stats Guide
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
2
Table of Contents
Page 3: Introduction
Part 1: Modelling techniques
Page 43: Paired t-tests: Comparing the same class, Reaction times
Page 48: Chi Squared Goodness of Fit: Are IB results normally distributed?
Page 54: Spearman’s rank: Does cola taste preference increase with price?
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
3
Introduction
I’ve written this guide to supplement the main Exploration Guide I put together. You
should consult the main guide for guidance on choosing topics, an explanation of the
marking criteria, common student mistakes and technology advice. In this guide I
look at various modelling techniques and also a number of different statistical tests.
In many cases these are taught in textbooks simply using technology, whereas it is
often desirable to demonstrate a greater understanding through non-calculator
methods in your maths exploration. So, where possible I’ve included non-calculator
techniques.
It’s important to note that these methods are not intended to be exemplars - there
are many different ways of explaining the following techniques and ideas, these are
just my ideas! You should attempt to put your methods into your own words so that
you can demonstrate a good personal understanding. The students who do best in
their exploration consult from a variety of sources, collate the ideas and are therefore
able to show a deep understanding.
If you do use this guide then it is essential that you correctly cite this source in your
exploration - failure to cite sources correctly can lead to malpractice investigations by
the IB, so make sure everything is done correctly.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
4
Linear Regression
Method 1
If you are doing a correlation investigation then you should use the Pearson’s
Product formula first to check the strength of correlation. Once you have done this
you can then find the equation of the line of best fit.
Method 2
If you are simply trying to find a linear regression line and not measure correlation
then you can use the least squares regression formula.
y = mx + c
Where:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
5
Say for example we have the following data points we want to fit a line through:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
6
It’s pretty good! If we use the linear regression tool on Desmos by typing:
y1 ~
mx1 + c
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
7
Quadratic regression
If I have the following graph I would notice that it follows a general quadratic shape.
So, usually the easiest method to fit a quadratic curve is to use the form:
y = p(x q) 2
+r
Because my graph uses time and height, I will rewrite this as:
h(t) = p(t q) 2
+r
When written in this form, p represents the vertical stretch factor and will be negative
because the graph is concave down. (q,r) will be the coordinates of the vertex of the
graph.
Looking at the graph, I need to decide where a best-fit quadratic curve would have
its vertex. In this case it looks like the coordinate point (3, 6.5) is quite close.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
8
Therefore I have:
h(t) = p(t 3) 2
+ 6.5
Next I just need to find p. To do this I can choose any point that I want my curve to
go through. If I decide that my curve must go through (0,0) so that my model has a
height of 0 metres after 0 seconds, then I can substitute these values to find p:
h(0) = 0 = p(0 3) 2
+ 6.5
p= 6.5
9
p ≈ 0.722
Therefore:
h(t) = 0.722 (t 3) 2
+ 6.5
We can see that it goes through the point we chose as a vertex as well as the origin.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
9
Desmos regression
I can also see what regression line Desmos will draw for these points by typing in:
y 1 ~ p(x1 q) 2
+r
We can see that this time Desmos fits the best possible quadratic for all the points -
and so it does not quite fit the maximum point or go through the origin.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
10
Cubic regression
Simultaneous equations
y = ax3 + bx2 + cx + d
Because my graph uses time and height, I will rewrite this as:
Here I have 4 unknowns and so need 4 equations. Luckily my graph goes through
(0,0) so I immediately know that d = 0. If your graph doesn’t pass through the origin
you can still use the same method but will use your GDC simultaneous equation
solver with 4 unknowns.
I will then choose the coordinate points which I want my graph to pass through. I’d
like it to pass through the origin (0, 0), the first maximum (1, 5.8), the first minimum
(3, 1.8) and the end point (4.5, 9).
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
11
This will therefore generate the following equations:
3 2
h(0) = 0 = a(0) + b(0) + c(0) + d
Therefore d = 0.
3 2
h(3) = 1.8 = a(3) + b(3) + c(3)
3 2
h(4.5) = 9 = a(4.5) + b(4.5) + c(4.5)
5.8 = a + b + c
1.8 = 27a + 9b + 3c
Simultaneous equations can be solved using a GDC. For those doing HL maths you
might want to explore how to use the inverse of a 3x3 matrix to solve this. However
just using a Casio we could use the simultaneous equation solver:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
12
If we plot this graph we get;
We can see that it passes through the points we specified and it is a very good fit.
Desmos regression
We can type the following in to see what regression line Desmos will create:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
13
Higher powers regression
You can use the same technique for cubic regression to find higher power
regression. For example a quartic curve has general equation:
In order to fit a regression line to a quartic you will need to have 5 equations because
you have 5 unknowns. So, choose 5 points you want your graph to pass through.
You can then use your GDC simultaneous equation solver to solve.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
14
Exponential regression
y = AeBx + C
Because my graph plots infected (I) and time (t), I’ll rewrite it as:
I(t) = AeBt + C
This method needs your graph to have an asymptote y = 0 so that you can set C=0.
If your graph has an asymptote at (say) y = 3 then move all your points down by 3
(i.e take away 3 from each y coordinate). Then follow this method below to find A
and B. Finally you can set C = 3.
I(t) = AeBt
Next I need to choose 2 points that I want the exponential to pass through. I’m going
to choose coordinates one third and two thirds along so I can represent the curve in
the middle section. (5, 4.2) and (10, 14).
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
15
I can eliminate A by doing equation (2) divided by equation (1):
14 AeB(10)
4.2
= AeB(5)
10 e10B
3 = e5B
10
3 = e5B
10
ln( 3 ) = 5B
10
B = 0.2 ln( 3 )
I can then find A by substituting into one of the equations (e.g equation (1) ):
4.2 = AeB(5)
10
4.2 = Ae5(0.2ln( 3 ))
10
4.2 = Aeln( 3 )
4.2 = A( 10
3 )
A = 1.26.
B ≈ 0.241
So my equation is:
I(t) = 1.26e0.241t
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
16
This gives the following equation:
We can see that it goes through the 2 points we specified. It is a reasonable fit over
the first 10 days - but then fits less well over the next 10 days. We could try again
choosing the end point coordinate to ensure the curve fits through this point, or we
could choose to plot a piecewise function (i.e represent this using 2 different
equations, one equation for the first 10 days and another equation for the next 10
days).
Desmos manages to fit a much better exponential curve - which shows that we
should try our exponential model again choosing one of the end coordinates.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
17
Exponential regression
Method 2: Linearisation
Let’s take the same graph and the same starting point:
I(t) = AeBt
ln(I(t)) = ln(AeBt )
ln(I(t)) = lnA + B t
ln(I(t)) = B t + lnA
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
18
Therefore this is in the form of the equation of a straight line, when we plot ln(I(t)) on
the y axis against t. When we do this, the gradient will be B and the y-intercept will
be lnA .
For example with the coordinate (5, 4.2), I will plot (5, ln(4.2)) etc. This will give:
I now can draw a straight line of best fit to find the gradient.
lnA = 0.1832
A = e0.1832
A ≈ 0.833
So my equation is:
I(t) = 0.833e0.260t
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
19
This gives the following graph:
The idea of linearisation is to transform a graph into a straight line which we can then
find the gradient and y-intercept from easily.
We can also use linearisation to find the equation of graphs of the form:
y = AxB
lny = ln(AxB )
In this case we would plot lny against ln(x) and would have B as the gradient and
lnA as the y-intercept.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
20
Trigonometric regression
y = asin(b(x c)) + d
Because I’m looking at months t and average hours of sunlight S(t) I’ll rewrite this as:
period = 2π
b
and we have a translation from the standard S (t) = sin(t) graph by the vector:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
21
Step 1 is to find the amplitude (a). We do this by finding the difference between the
maximum and minimum points then dividing by 2.
a = 16.93 2 7.57
a ≈ 4.68
Step 2 is to find b. We note that the period of the graph is 12 therefore:
period = 2π
b
12 = 2π
b
π
b= 6
The graph of
S (t) = 4.68sin( π6 t)
would have a y-coordinate maximum of 4.68. Therefore the vertical translation must
be:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
22
Then we can see that the maximum point at (3, 16.93) should be at (7, 16.93).
Therefore we need a horizontal translation of 4. So c = 4.
I can type the following into Desmos to see what regression line it will create:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
23
Other useful graphs
a
y = 1+ber(xc)
The logistic model can be very useful for modelling population growth - and will
appear when you use the SIR model for infections. The value a is the carrying
capacity and is the maximum that the population can reach.
Sometimes when modelling harmonic motion you will notice that the amplitude
changes - eg. the height of tides or the vertical height of a pendulum. In this case we
can plot a damped graph by multiplying the trig function by an exponential term.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
24
3. Circles
(x h) 2
+ (y k) 2
= r2
(x 2) 2
+ (y 3) 2
= 22
4. Ellipses
(x h)
2
+
(y k)
2
=1
a2 b
2
This generates an ellipse with distance from the centre to the edges of a horizontally
and b vertically, centred at (h,k). For example:
(x 2)
2
2
+
(y 3)
2
2
=1
1 2
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
25
5. Piecewise functions
Sometimes you’ll not be able to represent your graph using just one function - so you
can instead use a piecewise function like this:
This tells me that the function behaves like a linear function for all x values up to and
including 2, then behaves like a quadratic function for x values greater than 2. Note
here that you should usually aim to have a continuous function (i.e the value when x
= 2 is the same for both equations) when using this for modelling.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
26
Pearson’s Product Correlation investigation: height and arm span.
Correlation investigations are very common - but also have lots of things that can go
wrong with them, so here is an example where I highlight common mistakes and
show good practice.
Step 1. You need to work quite hard to justify a personal interest in order to get
more than C1 on correlation topics. Two ways of showing personal engagement will
be to do some reasonably time consuming data collection and creating a narrative as
to why you are investigating this topic.
“Is there a correlation between the height and arm span of Y13 boys?” is a
reasonable topic question which will be possible to complete, but it’s quite
depersonalised. Why do you care about this?
“Can understanding the relationship between height and arm span help me design
better fitting suits for Y13 boys?” is immediately more engaging. Now there is a
genuine purpose, and plenty of scope for reflection based on this topic question.
The two main problems here are not collecting enough data for the investigation to
be meaningful, and not showing any awareness of sampling methods. I would
recommend trying to collect 40-50 data points if you are collecting your own data. If
you are using secondary data then 50-75 would be better.
You should show a clear explanation of the method used to collect data. For
example, “I borrowed the height measuring machine from the school nurse and
during a Y13 PE lesson asked my sample to line up with a straight back (no shoes).
I measured in cm to 1 decimal place. etc.”
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
27
For example, if you decide that the population you are interested in is limited to Y13
boys then you could conduct a simple random sample by assigning a number to
every Y13 boy in the school and then using a random number generator to generate
your sample.
Step 3: Data presentation. If you have a lot of data then you probably are best
including the first part of a table in the main body and then the full table in the
appendix. I’ll work through some maths with 10 data points as an example.
156.4 162.0
177.7 176.5
161.1 160.8
170.9 170.4
173.3 185.2
173.0 176.5
162.9 170.8
161.2 162.3
188.7 190.9
178.6 180.0
Here we have arm span on the y-axis therefore we are investigating if arm span is
dependent on height. We can clearly see a positive linear correlation so it is relevant
to do a Pearson’s Product correlation calculation.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
28
Note - we might want to also show the graph with x and y axes starting from 150cm
to show the data point trend clearer.
Mean and standard deviation are going to be relevant when considering suit
measurements. We will then do a Pearson’s Product calculation to see how strong
the correlation is.
Note, there is a convention of using μx for the mean of the x values when you are
measuring every x value in the population and to use x when
you are finding the
mean from a sample. For example if I survey all Y13 students in my class and only
care about what the results tell me about this class then this is a population survey
(i.e I have surveyed the whole population). If I survey all Y13 students in my class
and use this to draw conclusions on other Y13 students in the school then this is a
sample.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
29
Finding the standard deviation
This gives the standard deviation as though you were using the entire population
data. There is a very similar equation to if you want to find the standard deviation for
a sample (this is called the unbiased estimator for the sample standard deviation):
At IB SL you’re not really expected to appreciate the difference but you could
mention it. I will use the standard deviation as though it is from the whole population.
x values (height) (x x) 2
156.4 195.4404
177.7 53.5824
161.1 86.1184
170.9 0.2704
173.3 8.5264
173 6.8644
162.9 55.9504
161.2 84.2724
188.7 335.6224
178.6 67.5684
894.216
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
30
Now we have both the mean and standard deviations we can do some reflection on
what these show - linking back to our aim.
We could then use the normal distribution with our mean and standard deviation to
work out (say) the range of heights that 95% of students will have. This requires the
inverse normal function:
This returns the result that 95% of student heights would be between 152cm and
189cm.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
31
Pearson’s Product Correlation formula
There are many other different versions of the Pearson’s Product formula. I think
that the ones below are the most useful as they make use of both the mean and
standard deviation calculations. The equation is in effect the average of the product
of the standardised scores.
If we obtain our means from a sample ( x, y ) we can replace μx and μy with x and
y because the sample mean is an unbiased estimator for our population mean
values. Therefore we have:
Both equations will give the same answer - so choose to use the one which matches
the standard deviation you used. I will use equation (2) in the working out below.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
32
(x x)
. (yσyy)
σx
1.763634539
0.236863645
1.292447823
-0.017849605
0.372200556
0.084779064
0.224051459
1.127988333
3.476728722
0.580496961
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
33
We could then reflect on this and what it shows. Because it is close to 1 it is
appropriate to find the equation of the regression line.
To find c we can then use the fact that a line of best fit will always go through the
mean x and y coordinates:
Then we can show we have checked all this using our GDC:
It’s good to include a screen-capture here to prove to the moderator you’ve done
everything correctly. When n is large there will be little difference between the
population standard deviation and the sample standard deviation - but with small n
there will be a greater discrepancy.
Clearly now we would look to use our regression line, make some reflections based
on the aim of the topic etc.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
34
Standardisation in the Pearson Product equation
For those of you who know about the standardised normal calculation, you will notice
the similarity in the standardisation process.
Both of these are doing the same thing - the standardised score tells me how many
standard deviations away from the mean I am.
The standardised score returns - 2 because the x value is smaller than the mean.
By using standardised x and y values, Pearson’s Product formula is then able to
compare how similarly the 2 sets of values are distributed.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
35
Binomial investigation: Extra Sensory Powers
Step 1
We can define our data collection process - we’re going to test if Person 1 has extra
sensory perception (ESP). We’ll use ESP cards with 5 symbols. For each trial one
of these symbols will be “transmitted” to Person 1. We will then record if they
correctly guess the transmitted symbol. We will do this trial 50 times.
X ~ B (50, 0.2)
E (X) = np
E (X) = 50(0.2)
E (X) = 10
So, we would expect people who do this test to get around 10 correct if they have no
ESP powers.
p)
σ = √np(1
σ = √50(0.2)(1 0.2)
σ = 2√2
Therefore if we expect most our data to be within 2 standard deviations of the mean
then we would expect most people to get:
5 ≤ X ≤ 15
H 0 : p = 0.2
Our null hypothesis ( H 0 ) is that the probability of Person 1 correctly guessing the
transmitted symbol is 0.2. Our alternative hypothesis ( H 1 ) is that the probability of
Person 1 correctly guessing the transmitted symbol is greater than 0.2.
Next, we work out the critical region. We will conduct our hypothesis test to the 5%
significance level. We are interested in finding:
P (X ≤ 14) ≈ 0.939
P (X ≤ 15) ≈ 0.969
Therefore
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
37
And so if Person 1 gets 16 or more correct then this will be in our critical region - and
this will be evidence to reject our null hypothesis and accept the alternative
hypothesis.
We can notice that the value of 16 is just outside our boundary of values with 2
standard deviations of the mean found earlier:
5 ≤ X ≤ 15
Let’s say that Person 1 got a remarkable 20 correct out of 50. In this case we can
clearly see that it is in our critical region, and so we would reject our null hypothesis.
We might then want to see how likely this result was to happen by chance.
P (X ≥ 20) = 1 P (X ≤ 19)
P (X ≥ 20) = 1 0.99906756
P (X ≥ 20) ≈ 0.000932
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
38
Poisson investigation: Customers in a shop
Step 1
Let’s say we are interested in finding out whether having a person standing outside a
shop handing out leaflets to passers-by has a significant impact on customer
numbers entering a shop.
We spend one hour at (say) 3-4pm on a Saturday counting the number of groups of
customers who enter the shop. We count 15 groups of people therefore we have:
X ~ P (15)
μ = 15
σ = √μ
σ = √15
Therefore if we expect most our data to be within 2 standard deviations of the mean
then we would expect most people to get:
8 ≤ X ≤ 22
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
39
Next, let’s do a hypothesis test for our poisson investigation.
H 0 : μ = 15
H 1 : μ > 15
Our null hypothesis ( H 0 ) is that the mean number of groups entering the shop will
be 15. Our alternative hypothesis ( H 1 ) is that the mean number of groups entering
the shop will be more than 15.
Next, we work out the critical region. We will conduct our hypothesis test to the 5%
significance level. We are interested in finding:
P (X ≤ 21) ≈ 0.947
P (X ≤ 22) ≈ 0.967
Therefore
Therefore if there are 23 groups or more that enter our shop this will be in our critical
region - and this will be evidence to reject our null hypothesis and accept the
alternative hypothesis.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
40
We notice that 23 is just outside the 2 standard deviation bound that we found
earlier.
We then spend another hour (3-4pm on the next Saturday) counting customers
whilst the shop employs someone to hand out leaflets to passers-by. We find that
the shop this time has 21 groups of customers. This is not in our critical region, and
so we do not have evidence to reject the null hypothesis.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
41
2 sample t-tests: Comparing different classes, Reaction times
One of the main uses of the t-test is to be able to compare 2 samples and decide if
they both came from the same population. To do this we assume that the two
samples do come from the same population and therefore have the same standard
deviation. We want to test whether they share the same population mean.
If our population distribution is X , we need to assume that the mean X follows a
normal distribution. This is a reasonable assumption as long as either:
1) X is normally distributed
2) Or our sample size (n) is sufficiently large (usually it should be at least n >
30).
If X meets either of the following criteria and has population mean μ and population
standard deviation σ then:
σ2
X ~ N ( μ , n
)
Step 1:
Say for example I want to compare whether a class of Year 7 students have different
reaction times to a class of Year 13 students I could conduct a 2 sample t-test.
Reaction times reasonably approximate a normal distribution (I could discuss this in
more detail) so my test assumption is met. I would then collect some data on
reaction times explaining the collection process.
x: Year 7 221 215 212 320 295 209 211 349 220 198
reaction time
(milliseconds)
y: Year 13 312 341 225 214 238 188 378 301 205 226
reaction time
(milliseconds)
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
42
Step 3: Maths processes
H 0 : μx = μy
H 1 : μx =/ μy
Our null hypothesis ( H 0 ) is that the mean for the Year 7 students ( μx ) will be the
same as the mean for the Year 13 students ( μy ). Our alternative hypothesis ( H 1 ) is
that the two means are not equal.
Therefore I do not have any evidence to reject the null hypothesis - and I can
conclude that there is no significant difference between the reaction times of Y13
and Y7 students.
I can see that whilst the means of the 2 samples are different (the mean for Y7
students is 245 milliseconds and is 262.8 milliseconds for Y13 students), that this is
not significant at the 5% level.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
43
Paired t-tests: Comparing the same class, Reaction times
Step 1
Here I will do a paired t-test because I am comparing the same students (i.e the
results are paired - with each pair of results for one student) I am comparing the
same students’ reaction times before and after they are given some training on how
to improve their reaction times. This will be one-tailed because I want to see if there
is a significant improvement in reaction times in the second trial. I need to make the
same assumptions as for the 2 sample t-test.
x: Year 7 221 215 212 320 295 209 278 349 220 198
reaction time
before
training.
(milliseconds)
y: Year 7 210 250 188 318 238 188 211 301 205 167
reaction time
after training.
(milliseconds)
x-y 11 -35 24 2 57 21 67 48 15 31
Maths processes:
H 0 : μx μ y = 0
H1 : μ μ
x y > 0
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
44
Our null hypothesis ( H 0 ) is that the difference between the means for the Year 7
students before training ( μx ) and after training ( μy ) will be 0. Our alternative
hypothesis ( H 1 ) is that the difference between the two means will be positive
(because in this context it will mean that the trial with training was faster). I will do
the test at the 5% level.
Therefore I have any evidence to reject the null hypothesis - and I can conclude that
there is significant improvement in the reaction times of Y7 students after training .
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
45
Chi Squared test: Efficiency of vaccinations
Say there are several different vaccinations being used to help prevent influenza,
and I want to find out whether people are able to avoid catching influenza
independently of which vaccine they receive. My null hypothesis is therefore that
there is no difference in the efficacy of the different vaccines against influenza. A Chi
squared test is appropriate here as I have categorical data and I am testing for
independence. I do not have a 2x2 table therefore I can avoid having to use Yate’s
Continuity Correction, and also all my table values are greater than 5.
Data analysis:
Vaccine 1 225
1350 (280) = 46.666... 1125
1350 (280) = 233.333... 280
Vaccine 2 225
1350 (250) = 41.666... 1125
1350 (250) = 208.333... 250
Vaccine 3 225
1350 (270) = 45 1125
1350 (270) = 225 270
Vaccine 4 225
1350 (260) = 43.333... 1125
1350 (260) = 216.666... 260
Vaccine 5 225
1350 (290) = 48.333... 1125
1350 (290) = 241.666... 290
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
46
The data table from the British Medical Journal article here.
χ2 =∑
(f o f e )
2
calc fe
f o : observed frequencies
f e : expected frequencies
We take away each of the expected frequencies from each of the observed
frequencies, square the result and divide by the expected frequency. We then sum
all our answers together.
2
(f o f e ) Got influenza Avoided influenza
fe
Vaccine 1
(43 46.666...)
2
= 0.288... 2
(237 233.333...)
= 0.0576...
46.666... 233.333...
Therefore
χ2 calc = ∑
2
(f o f e )
fe
χ2 calc ≈ 16.5
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
47
My result is significant at the 1% level. We will therefore reject the null hypothesis
that avoidance of influenza is independent of the type of vaccination received.
Therefore we have evidence that some of the vaccines are more effective than
others in helping prevent influenza.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
48
Chi Squared Goodness of Fit: Are IB results normally distributed?
We can use a Chi Squared Goodness of Fit test to test if data follows a certain
distribution. Say for example I want to see if the IB results follow a normal
distribution I can proceed as follows:
Step 1: Data
IB Score 0 ≤ X ≤ 15 16 ≤ X ≤ 23 24 ≤ X ≤ 27 28 ≤ X ≤ 31 32 ≤ X ≤ 35 36 ≤ X ≤ 39
X
IB Score X 40 ≤ X ≤ 43 44 ≤ X ≤ 45
I can then find the mean and the sample standard deviation using my GDC (plotting
the mid-point values):
I will take the sample mean as 29.7 to 3sf and the sample standard deviation (which
is the same as the population standard deviation to 3sf in this case) as 7.13.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
49
Therefore my null and alternative hypothesis is that:
Next I can use the normal distribution X ~ N (29.7, 7.132 ) to calculate the following
expected values (leaving the bottom inequality unbounded and the top inequality
unbounded).
Therefore the expected number of students is the probability multiplied by the total
number of students:
0.14635097(85605) = 12528.37479
IB Score X ≤ 15 16 ≤ X ≤ 23 24 ≤ X ≤ 27 28 ≤ X ≤ 31 32 ≤ X ≤ 35 36 ≤ X ≤ 39
X
IB Score X 40 ≤ X ≤ 43 44 ≤ X
I can then use my GDC to perform a Chi Squared Goodness of Fit test. For this test
the degrees of freedom when I have n cells of data is:
d.f = n 1
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
50
But I also estimated both the mean and the standard deviation therefore my degrees
of freedom are:
d.f = n 1 1 1
d.f = 8 1 1 1
d.f = 5
Entering this data into my GDC gives:
χ2 calc = 9753
(You can see a Youtube video for how to do this for a CASIO here )
Therefore as:
9753>11.07
We reject the null hypothesis - i.e we accept the alternative hypothesis that X does
not follow the normal distribution we tested.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
51
Bernoulli trials for polling data
Say I interview 1000 people and ask them a binary question such as “will you vote
Republican or Democrat in the next US election?” (With “don’t knows” or “won’t vote
for either” discarded), I might want to then know how confident I can be that my
results can be applied to the whole population. As long as my sampling technique
accurately reflects the population make-up, 1000 people will give a pretty accurate
gauge of public opinion (which is why many polling companies poll 1000 people).
Maths processes
We model this as 1000 repeated Bernoulli trials. A Bernoulli trial is just defined as a
binomial trial with n = 1. i.e:
X ~ B (1, p)
Each trial is defined as asking a person who they are going to vote for. We can
define p in this case as “success” when someone polled chooses Democrat (for
example).
n
∑ X k ~ B (n, p)
k=1
1000
∑ X k ~ B (1000, p)
k=1
Y ~ B (1000, p)
So, say we interview 1000 people and 685 people say they will vote Democrat. We
want to know between what percentages we can be 95% confident people will
actually vote Democrat.
We use our data to give an estimation of the probability someone will vote Democrat:
p ≈ 685
1000
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
52
p ≈ 0.685
We want to find out a lower and upper bound for this estimate.
Y ~ B (1000, 0.685)
μ = np
Therefore for our sample mean (m) we have the following estimate:
m = 1000(0.685) = 685
And for a binomial we have the following for the standard deviation:
σ = √np(1 p)
Therefore for our sample standard deviation (d) we have the following estimate:
d = √685(1 0.685)
d ≈ 14.689
We can now use the fact that when n is large the binomial distribution is
approximated by the normal distribution.
±
The 95% confidence interval has z value 1.96 therefore this means that 95% of
±
values will lie between 1.96 standard deviations of the mean.
I.e we can be 95% confident that the actual probability lies between:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
53
0.66 < p < 0.71
We can see that our original estimate of 68.5% therefore has a margin of error of ±
2.5%. This means that it will be quite accurate (as long as our sampling was done
correctly). You can read more on the maths behind this polling method here - which
provided the idea behind the example used here.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
54
Spearman’s rank: Does cola taste preference increase with price?
Step 1
I want to investigate if students will prefer the most expensive cola drinks when given
a blind tasting test. I could choose 5 different colas, give them to (say) 30 students -
and each student has to rank them from 1 to 5. 1 being the best, 5 being the worst. I
would then tally all the scores and then give an overall rank to the 5 colas.
Taste rank 1 2 3 4 5
Cost (Baht) 55 38 40 20 25
Rewriting the second row in terms of ranks (with 1 for the most expensive) gives:
Taste rank 1 2 3 4 5
Cost rank 1 3 2 5 4
I can then use the following formula to calculate Spearman’s rank (as long as there
are no tied ranks):
6 ∑ d2
1)
rs = 1 n(n2
Here d is the difference between each paired ranks, and n is the number of ranks we
have used.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
55
Therefore we have:
Taste rank 1 2 3 4 5
Cost rank 1 3 2 5 4
d 0 -1 1 -1 1
So this gives:
2 2
2 2
∑ d = 0 + ( 1) + 1 + ( 1) + 1 = 4
2 2
6 ∑ d2
1)
rs = 1 n(n2
rs = 1 6(4)
2
5(5 1)
rs = 0.8
This shows that there is a positive relationship between the 2 variables - i.e as the
cost of the cola increases, the extent to which people prefer it also increases.
We can check this on the GDC by simply finding Pearson’s Product for the 2 ranks:
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
56
Sampling techniques and experiment design
It’s important to know what your population is. If I want to know about one Year 13
class - without drawing any further conclusions for other Year 13 students then my
population is that Year 13 class. If I include every student in that class then I will
have population data - not sample data. There are slightly different formulae (and
notation) used for population data compared with sample data so you should be
aware of this.
If I take data from one Year 13 class but want to draw wider conclusions about a
bigger population then this is a sample. Be clear what your wider population is - is it
the Year 13 cohort in your school? Is it all Year 13 students in the world? Is it all
students in your school?
What your population is will determine how reasonable your choice of sample is. It
probably is reasonable to use a sample of Year 13 students in your school to give
you data about the population of Year 13 students in your school. If you use a
sample of Year 13 students in your school to draw conclusions about all Year 13
students in the world then you need to consider how representative your school is of
the average Year 13 student. It may be that you could narrow down your population
(say to Year 13 students at international schools in your country) so that the sample
data you collect is more representative.
If your sample is not representative of the population then the conclusions you draw
will not be valid. For example if my population is school aged children and I want to
find the average height - but only sample Year 13 students, clearly the data I get is
useless in drawing any wider conclusions about the height of my population.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
57
2) Sampling
b) Systematic sampling
There is a system to select members by using a random starting point and a
fixed interval. For example if my population is all Y13 students at my school
and I assign a number to each one. I then use a random number generator to
select the first student. I then add 10 to this number to select the second
student etc.
c) Stratified sampling
My population is divided into non-overlapping strata which share common
characteristics and I then chose a random sample from each strata.
For example if my population is all Y13 students at my school I could divide
the students into the two strata boys and girls. I would then choose a sample
from both groups. If my population was all students at my school then my
strata could be the different year groups.
You can add a quota to your stratified sample - for example I divide my
population into year group strata - but then also want to make sure that I
survey 55% girls and 45% boys in each year group (perhaps because this
reflects the overall gender mix of the school etc).
d) Convenience sampling
Here you just choose whatever is easiest to do! As you might expect this is
not an especially good technique when trying to draw wider conclusions from
your sample. For example you may do a survey on the path to the lunch-hall
and stop the first 10 students you see. However it may be appropriate
depending on what you are hoping to achieve. If your population is Year 13
students then it may be convenient to just survey everyone in your form class
- and as long as your form class is representative of the Year 13 population
then this would still allow valid wider conclusions.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
58
3) Survey design
You should try to design a study and a data collection process that reduces errors
and bias. At the very least you should show an awareness of potential errors/bias
when discussing your design.
a) Email /internet surveys - even if these are sent to the whole population, the
people who respond might not be representative of that population. Perhaps
they are more diligent than usual (or perhaps they have more time on their
hands!)
b) Poor design in data collection which does not allow you to accurately compare
students. For example if you are measuring heights of students you need to
very clearly set out the standardised process you used with every student.
Perhaps every student must take off their shoes, stand with their back straight
against the wall, be measured by the same person etc.
c) A lack of anonymity leading to untrue answers. You will likely get different
answers to some questions depending on whether the survey is anonymous
or not. People will (unsurprisingly) be less likely to tell the truth if they think it
makes them look bad. If you do a survey on time spent doing exercise a
week some people will tell you what they think they should be doing, not what
they are doing! This is a common problem for doctors when talking to
patients.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
59
4) Experiment design: Permutations and Latin squares
Say for example I want to conduct an experiment to test whether listening to music
can help with memorising a list of words. A simple way of doing this would be:
Simple experiment
Twenty students are given a list of 10 words and 2 minutes to memorise. They then
write down how many they can remember.
The same 20 students are now given 10 different words and 2 minutes to memorise.
This time they listen to music whilst memorising. They then write down how many
they can remember.
Whilst this looks superficially like a fair experiment there are some serious flaws.
Firstly I can’t be sure that the 2 lists of 10 words are equally easy to remember. This
flaw on its own makes the whole experiment completely useless in drawing any
conclusions from. Secondly it may be the case that the students perform better on
the second trial because they have already got their brains in-gear (or worse
because they are tired etc). So we need to control for both of these problems.
Better experiment
Five students are given a list of 10 words [LIST 1] and 2 minutes to memorise.
There is no music.
The same 5 students are now given 10 different words [LIST 2] and 2 minutes to
memorise. They listen to music whilst memorising.
Five other students are given a list of 10 words [LIST 2] and 2 minutes to memorise.
There is no music.
The same 5 students are now given 10 different words [LIST 1] and 2 minutes to
memorise. They listen to music whilst memorising.
Five other students are given a list of 10 words [LIST 1] and 2 minutes to memorise.
They listen to music whilst memorising.
The same 5 students are now given 10 different words [LIST 2] and 2 minutes to
memorise. There is no music.
Five other students are given a list of 10 words [LIST 2] and 2 minutes to memorise.
They listen to music whilst memorising.
The same 5 students are now given 10 different words [LIST 1] and 2 minutes to
memorise. There is no music
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.
60
This experiment design means that half the students had LIST 1 first and half had
LIST 2 first, half the time LIST 1 was memorised with music and half the time LIST 2
was memorised with music, half the time the students listened to music first and half
the time they listened to music second.
AN-BM
BN-AM
AM-BN
BM-AN
I can then conduct a paired t-test, where I simply compare the student “with music”
score with the student “without music” score.
For more complicated designs you can use the idea of Latin Squares to design fair
experiments like this.
Copyright © 2020 Andrew Chambers. All rights reserved. 300 IA ideas: https://ibmathsresources.com.