Assignment_on_Probability_and_statistics
Assignment_on_Probability_and_statistics
Each day Mr Burke randomly chooses one student to answer a homework question.
(a) Find the probability that on any given day Mr Burke chooses a female student to answer a question. [1]
(b) Find the probability he will choose a female student 8 times. [2]
(c) Find the probability he will choose a male student at most 9 times. [3]
(c) It was not possible to ask every person in the school, so the Headmaster arranged the student names in
alphabetical order and then asked every 10th person on the list.
(a.i) Write down the value of the Pearson’s product–moment correlation coefficient, r. [2]
(a.ii) Using the value of r, interpret the relationship between Stan’s score and Minsun’s score. [2]
(c.i) Use your regression equation from part (b) to estimate Minsun’s score when Stan awards a perfect 10. [2]
( ) y g q p ( ) p [ ]
(c.ii) State whether this estimate is reliable. Justify your answer. [2]
The Commissioner for the event would like to find the Spearman’s rank correlation coefficient.
[2]
(f ) The Commissioner believes Minsun’s score for competitor G is too high and so decreases the score from 9.5 to 9.1.
Explain why the value of the Spearman’s rank correlation coefficient r does not change.
s [1]
(a) Find the number of potatoes in the sample with a weight of more than 200 grams. [2]
(c) The weight of the smallest potato in the sample is 20 grams and the weight of the largest is 400 grams.
Use the scale shown below to draw a box and whisker diagram showing the distribution of the weights of the
potatoes. You may assume there are no outliers.
[2]
(a) Assuming the data follows a linear model for this period, find the regression line of T on m for the remaining
data. [2]
(b) Use your line to find an estimate for the water temperature on the first day of May. [2]
(c.i) Explain why your line should not be used to estimate the value of m at which the temperature is 10. 0 °C. [1]
(c.ii) Explain in context why your line should not be used to predict the value for December (month 12). [1]
(d) State a more appropriate model for the water temperature in the lake over an extended period of time. You are
not expected to calculate any parameters. [1]
(b) Find the equation of the least squares regression line of log 10
P against log 10
.
d
[2]
(c.i) Use your answer to part (b) to write down the value of n to the nearest integer. [1]
Hence
(c.ii) predict the temperature of the metal rod after 3 minutes. [2]
(a.ii) Calculate an estimate of the mean height of the 200 students. [2]
(b) Use the cumulative frequency curve to estimate the interquartile range. [2]
Laszlo is a student in the data set and his height is 204 cm.
(c) Use your answer to part (b) to estimate whether Laszlo’s height is an outlier for this data. Justify your answer. [3]
It is believed that the heights of university students follow a normal distribution with mean 176 cm and standard deviation
13. 5 cm.
It is decided to perform a χ goodness of fit test on the data to determine whether this sample of 200 students could have
2
(d) Write down the null and the alternative hypotheses for the test. [2]
(e.ii) Hence, perform the test to a 5 % significance level, clearly stating the conclusion in context. [4]
10. [Maximum mark: 27] 23N.3.AHL.TZ0.2
This question is about applying ideas from logarithms, calculus and probability to an unfamiliar mathematical theory called
information theory.
Claude Shannon developed a mathematical theory called information theory to measure the information gained when random
events occur. He defined the information, I , that is gained when an event with probability p occurs as
I = − ln p
where 0 < p ≤ 1 . For example, no information is gained (I = 0) when an event is certain to occur (p = 1) .
(a.i) Sketch the graph of I = − ln p , for 0 < p ≤ 1 , labelling all axes intercepts and asymptotes. [3]
(a.iii) Interpret what “I is a decreasing function of p” means in the given context. [1]
(b) A computer selects at random an integer x from 1 to 10, inclusive. Each outcome is equally likely.
(b.ii) Alessia is told that x is odd. Find how much information Alessia gains. [2]
The computer then selects at random an integer y from 1 to 10, inclusive. Each outcome is equally likely.
Daniel is trying to determine the value of y and asks if y is 7. He is told that it is not 7.
If a random variable has n possible outcomes with probabilities p 1, p2 … pn , then the expected information gained, E(I ), is
defined as
n
E(I ) = Σ − p r ln p r .
r=1
(c) For the integer guessing game described in part (b), when Daniel asks if y is 7, there are two possible outcomes: “y
is 7” or “y is not 7”.
(c.i) Show that the expected information gained by Daniel is 0. 325, correct to three significant figures. [2]
(c.ii) Alessia asks if x is odd. Show that her expected information gained is greater than Daniel’s expected information
gained. [2]
(d) When a coin is flipped, the outcome is either heads or tails. The coin may be biased. Let p be the probability of the
outcome being heads.
(d.i) Find, in terms of p, the information gained when the outcome is tails. [1]
(d.ii) Find, in terms of p, the expected information gained when the coin is flipped once. [1]
(d.iii) Hence, find the value of p when the expected information gained is maximized. [2]
A famous puzzle uses 12 balls which appear identical. 11 have the same weight, but one is either lighter or heavier than the others.
A pair of scales can be repeatedly used to compare the weights of different combinations of the balls.
The outcome of each weighing can be “balanced”, “left-hand side heavier” or “right-hand side heavier”. The aim of the puzzle is to
identify the ball which is the different weight, and whether it is heavier or lighter than the others, in as few weighings as possible.
(e) Angela wants to decide how many balls should be compared to each other in the first weighing. She produces
the following table to help plan her strategy.
(e.iv) Use the table to suggest the best choice for Angela’s first weighing. Justify your answer. [1]
Athlete A B C D E F G H
Age (years) 13 17 22 18 19 25 11 36
Time (seconds) 13. 4 14. 6 13. 4 12. 9 12. 0 11. 8 17. 0 13. 1
Sung-Jin decides to calculate the Spearman’s rank correlation coefficient for his set of data.
Athlete A B C D E F G H
Age rank 3
Time rank 1 [2]
(d) Suggest a mathematical reason why Sung-Jin may have decided not to use Pearson’s product-moment
correlation coefficient with his data from the original table. [1]
Grade 1 2 3 4 5 6 7
Frequency 1 4 7 9 p 9 4
Their quality assurance team randomly selects 500 items of food to inspect. The quality of this food is classified as perfect,
satisfactory, or poor. The data is summarized in the following table.
(a) Find the probability that its quality is not perfect, given that it is from breakfast. [2]
A χ test at the 5% significance level is carried out to determine if there is significant evidence of a difference in the quality of the
2
(c) State, with justification, the conclusion for this test. [2]
x 6. 3 4. 1 5. 6 9. 2 7. 8 8. 2
y 9. 2 4. 9 8. 9 10. 3 8. 9 9. 8
(a) State null and alternative hypotheses which could be used to test whether there is a linear correlation between X
and Y . [2]
(c) State whether your result from part (b)(ii) indicates there is sufficient evidence to claim that, at the 5%
significance level, X and Y are not linearly correlated.
Athlete A B C D E F G H
Age (years) 13 17 22 18 19 25 11 36
Time (seconds) 13. 4 14. 6 13. 4 12. 9 12. 0 11. 8 17. 0 13. 1
Sung-Jin decides to calculate the Spearman’s rank correlation coefficient for his set of data.
Athlete A B C D E F G H
Age rank 3
Time rank 1
[2]
(b) Calculate the Spearman’s rank correlation coefficient, r . s [2]
(d) Suggest a mathematical reason why Sung-Jin may have decided not to use Pearson’s product-moment
correlation coefficient with his data from the original table. [1]
(e.i) Find the coefficient of determination for the data from the original table. [2]
Athlete A B C D E F G H
Age (years) 13 17 22 18 19 25 11 36
Time (seconds) 13. 4 14. 6 13. 4 12. 9 12. 0 11. 8 17. 0 13. 1
Sung-Jin decides to calculate the Spearman’s rank correlation coefficient for his set of data.
Athlete A B C D E F G H
Age rank 3
Time rank 1
[2]
(d) Suggest a mathematical reason why Sung-Jin may have decided not to use Pearson’s product-moment
correlation coefficient with his data from the original table. [1]
(e.i) Find the coefficient of determination for the data from the original table. [2]
Year °C (y) 8. 73 9. 22 9. 10 9. 12 9. 13 9. 45 9. 76
Tami creates a linear model for this data by finding the equation of the straight line passing through the points with coordinates
(1708, 8. 73) and (1958, 9. 45).
(a) Calculate the gradient of the straight line that passes through these two points. [2]
(b.i) Interpret the meaning of the gradient in the context of the question. [1]
(c) Find the equation of this line giving your answer in the form y = mx + c . [2]
(d) Use Tami’s model to estimate the mean annual temperature in the year 2000. [2]
(e.ii) Find the value of r, the Pearson’s product-moment correlation coefficient. [1]
(f ) Use Thandizo’s model to estimate the mean annual temperature in the year 2000. [2]
Thandizo uses his regression line to predict the year when the mean annual temperature will first exceed 15 °C.
(g) State two reasons why Thandizo’s prediction may not be valid. [2]
Year °C (y) 8. 73 9. 22 9. 10 9. 12 9. 13 9. 45 9. 76
Tami creates a linear model for this data by finding the equation of the straight line passing through the points with coordinates
(1708, 8. 73) and (1958, 9. 45).
(a) Calculate the gradient of the straight line that passes through these two points. [2]
(b.i) Interpret the meaning of the gradient in the context of the question. [1]
(c) Find the equation of this line giving your answer in the form y = mx + c . [2]
(d) Use Tami’s model to estimate the mean annual temperature in the year 2000. [2]
(e.ii) Find the value of r, the Pearson’s product-moment correlation coefficient. [1]
(f ) Use Thandizo’s model to estimate the mean annual temperature in the year 2000. [2]
20. [Maximum mark: 17] 23M.2.AHL.TZ1.3
A large international sports tournament tests their athletes for banned substances.
They interpret a positive test result as meaning that the athlete uses banned substances.
A negative result means that they do not.
If an athlete uses banned substances, the probability that they will test positive is 0. 71.
If an athlete does not use banned substances, the probability that they will test negative is 0. 98.
(a) Using the information given, complete the following tree diagram.
[2]
(b.i) Determine the probability that a randomly selected athlete does not use banned substances and tests negative. [2]
(b.ii) If two athletes are selected at random, calculate the probability that both athletes do not use banned substances
and both test negative. [2]
(c.i) Calculate the probability that a randomly selected athlete will receive an incorrect test result. [3]
(c.ii) A random sample of 1300 athletes at the tournament are selected for testing. Calculate the expected number of
athletes in the sample that will receive an incorrect test result. [2]
Team X are competing in the tournament. There are 20 athletes in this team. It is known that none of the athletes in Team X use
banned substances.
(d) Calculate the probability that none of the athletes in Team X will test positive. [4]
(e) Determine the probability that more than 2 athletes in Team X will test positive. [2]
The number of trees to be planted in each of the first three months are shown in the following table.
(a) Find the number of trees to be planted in the 15th month. [3]
(b) Find the total number of trees to be planted in the first 15 months. [2]
(c) Find the mean number of trees planted per month during the first 15 months. [2]
Elsie’s data for 160 people who visited the library on that particular day is shown in the following table.
(c.ii) Write down the mid-interval value for this class. [1]
(d) Use Elsie’s data to calculate an estimate of the mean time that people spent in the library. [2]
(e) Using the table, write down the maximum possible number of people who spent 35 minutes or less in the library
on that day. [1]
(f ) Find the probability a visitor spends at least 60 minutes in the library. [2]
The following box and whisker diagram shows the times, in minutes, that the 160 visitors spent in the library.
(g) Write down the median time spent in the library. [1]
(i) Hence show that the longest time that a person spent in the library is not an outlier. [3]
Elsie believes the box and whisker diagram indicates that the times spent in the library are not normally distributed.
(j) Identify one feature of the box and whisker diagram which might support Elsie’s belief. [1]
(a) Find the probability that a randomly chosen door has a total thickness of less than 9. 5 mm. [5]
Eight doors are to be packed into a box to send to a customer. The width of the box is 82 mm. The thickness of each door is
independent.
(b) Find the probability that the total thickness of the eight doors is greater than the width of the box. [4]
The company buys two new machines, A and B, to make the wooden layers. An employee claims that the layers from machine B
are thinner than the layers from machine A. In order to test this claim, a random sample is taken from each machine.
The seven layers in the sample from machine A have a thickness, in mm, of
Find the
The eight layers in the sample from machine B have a mean thickness of 6. 89 mm and S n−1 = 0. 31 .
(d) Perform a suitable test, at the 5% significance level, to test the employee’s claim. You may assume the thickness of
the wooden layers from each machine are normally distributed with equal population variance. [6]
Two friends, Peter and Helen, are discussing ways of predicting the outcomes of international football matches involving
Argentina.
Peter suggests analysing historical data to help make predictions. He lists the results of the most recent 240 matches in which
Argentina played, in chronological order, then considers blocks of four matches at a time. He counts how many times Argentina
has won in each block. The following table shows his results for the 60 blocks of four matches.
(a) Determine the mean number of wins per block of four matches for Argentina. [2]
Peter thinks that this data can be modelled by a binomial distribution with n = 4 and decides to carry out a χ goodness of fit test.
2
(b) Use Peter’s data to write down an estimate for the probability p for this binomial model. [1]
(c.i) Use the binomial model to find the probability that Argentina win zero matches in a block of four matches. [1]
As some expected frequencies are less than 5, Peter combines rows in his table to produce the following observed frequencies. He
then uses his binomial model to find appropriate expected frequencies, correct to one decimal place.
Peter uses this table to carry out a χ goodness of fit test, to test the hypothesis that the data follows a binomial distribution with
2
(e) Using Peter’s binomial model, find the probability that Argentina will win at least one of their next four
international football matches. [2]
Helen thinks that a better prediction might be made by considering the transition between matches. To keep the model simple, she
decides to use only two states: Argentina won (A) or Argentina did not win (B). Helen looks at Peter’s list of results and counts the
number of times that:
29
. [2]
(f.ii) Write down the transition matrix, T , for Helen’s model. [2]
(h) In her retirement, many years from now, Helen is planning to travel to three consecutive international football
matches involving Argentina. Use Helen’s model to find the probability that Argentina will win all three matches. [4]
(a) Calculate the expected number of people who will pass this polygraph test. [2]
(b) Calculate the probability that exactly 4 people will fail this polygraph test. [2]
(c) Determine the probability that fewer than 7 people will pass this polygraph test. [3]
(a) Calculate the expected number of people who will pass this polygraph test. [2]
(b) Calculate the probability that exactly 4 people will fail this polygraph test. [2]
(c) Determine the probability that fewer than 7 people will pass this polygraph test. [3]
25 33 51 62 63 63 70 74 79 79 81 88 90 90 98
For these data, the lower quartile is 62 and the upper quartile is 88.
(a) Show that the test score of 25 would not be considered an outlier. [3]
The box and whisker diagram showing these scores is given below.
Test scores
Another mathematics class is run by the college during the evening. A box and whisker diagram showing the scores from this class
for the same test is given below.
Test scores
A researcher reviews the box and whisker diagrams and believes that the evening class performed better than the morning class.
(b) With reference to the box and whisker diagrams, state one aspect that may support the researcher’s opinion and
one aspect that may counter it. [2]
(a) Find the probability that a randomly chosen applicant from this group was accepted by the university. [1]
An applicant is chosen at random from this group. It is found that they were accepted into the programme of their choice.
(b) Find the probability that the applicant applied for the Arts programme. [2]
(c) Two different applicants are chosen at random from the original group.
Find the probability that both applicants applied to the Arts programme. [3]
(b) Determine if the Netherlands’ score is an outlier for this data. Justify your answer. [3]
Chester is investigating the relationship between the highest-scoring countries’ Eurovision score and their population size to
determine whether population size can reasonably be used to predict a country’s score.
The populations of the countries, to the nearest million, are shown in the table.
Chester finds that, for this data, the Pearson’s product moment correlation coefficient is r = 0. 249 .
(c) State whether it would be appropriate for Chester to use the equation of a regression line for y on x to predict a
country’s Eurovision score. Justify your answer. [2]
Chester then decides to find the Spearman’s rank correlation coefficient for this data, and creates a table of ranks.
Write down the value of:
(d.i) a . [1]
(d.ii) b. [1]
(d.iii) c . [1]
(e.i) Find the value of the Spearman’s rank correlation coefficient r .s [2]
(f ) When calculating the ranks, Chester incorrectly read the Netherlands’ score as 478. Explain why the value of the
Spearman’s rank correlation r does not change despite this error.
s [1]
The number of passengers that arrive to board this flight is assumed to follow a binomial distribution with a probability of 0. 9.
(a) The airline sells 74 tickets for this flight. Find the probability that more than 72 passengers arrive to board the
flight. [3]
(b.i) Write down the expected number of passengers who will arrive to board the flight if 72 tickets are sold. [2]
(b.ii) Find the maximum number of tickets that could be sold if the expected number of passengers who arrive to board
the flight must be less than or equal to 72. [2]
Each passenger pays $150 for a ticket. If too many passengers arrive, then the airline will give $300 in compensation to each
passenger that cannot board.
(c) Find, to the nearest integer, the expected increase or decrease in the money made by the airline if they decide to
sell 74 tickets rather than 72. [8]
(b) Find the estimated number of teenagers who have a reaction time greater than 0. 4 seconds. [2]
(c) Determine the 90th percentile of the reaction times from the cumulative frequency graph. [2]
Mackenzie created the cumulative frequency graph using the following grouped frequency table.
(e) Write down the modal class from the table. [1]
(f ) Use your graphic display calculator to find an estimate of the mean reaction time. [2]
Upon completion of the experiment, Mackenzie realized that some values were grouped incorrectly in the frequency table. Some
reaction times recorded in the interval 0 < t ≤ 0. 2 should have been recorded in the interval 0. 2 < t ≤ 0. 4.
(g) Suggest how, if at all, the estimated mean and estimated median reaction times will change if the errors are
corrected. Justify your response. [4]
32. [Maximum mark: 13] 22M.2.AHL.TZ1.3
A Principal would like to compare the students in his school with a national standard. He decides to give a test to eight students
made up of four boys and four girls. One of the teachers offers to find the volunteers from his class.
(a) Name the type of sampling that best describes the method used by the Principal. [1]
The marks out of 40, for the students who took the test, are:
(c) Perform an appropriate test at the 5% significance level to see if the mean marks achieved by the students in the
school are higher than the national standard. It can be assumed that the marks come from a normal population. [5]
(d) State one reason why the test might not be valid. [1]
Two additional students take the test at a later date and the mean mark for all ten students is 28. 1 and the standard deviation is
8. 4.
For further analysis, a standardized score out of 100 for the ten students is obtained by multiplying the scores by 2 and adding 20.
This equation links a variable k with the temperature T , where A and c are positive constants and T > 0 .
dT
is always positive. [3]
(b) Given that lim k = A and lim k = 0 , sketch the graph of k against T . [3]
T →∞ T →0
T
is a straight line.
Write down
T
. [2]
Find an estimate of
(e.i) c.
(e.ii) A .
t is the number of days since the first computer was infected by the virus.
Q(t) is the total number of computers that have been infected up to and including day t.
(a.ii) Write down the value of r, Pearson’s product-moment correlation coefficient. [1]
(a.iii) Explain why it would not be appropriate to conduct a hypothesis test on the value of r found in (a)(ii). [1]
A model for the early stage of the spread of the computer virus suggests that
Q′(t) = βN Q(t)
where N is the total number of computers in a city and β is a measure of how easily the virus is spreading between computers.
Both N and β are assumed to be constant.
(b.i) Find the general solution of the differential equation Q′(t) = βN Q(t) . [4]
(b.ii) Using the data in the table write down the equation for an appropriate non-linear regression model. [2]
(b.iv) Hence comment on the suitability of the model from (b)(ii) in comparison with the linear model found in part (a). [2]
(b.v) By considering large values of t write down one criticism of the model found in (b)(ii). [1]
(c) Use your answer from part (b)(ii) to estimate the time taken for the number of infected computers to double. [2]
The data above are taken from city X which is estimated to have 2. 6 million computers.
The analyst looks at data for another city, Y. These data indicate a value of β = 9. 64 × 10
−8
.
(d) Find in which city, X or Y, the computer virus is spreading more easily. Justify your answer using your results from
part (b). [3]
Q(t+5)−Q(t−5)
Q′(t) ≈
10
.
The following table shows estimates of Q′(t) for city X at different values of t.
(e) Determine the value of a and of b. Give your answers correct to one decimal place. [2]
An improved model for Q(t), which is valid for large values of t, is the logistic differential equation
Q(t)
Q′(t) = kQ(t)(1 − )
L
Q′(t)
Based on this differential equation, the graph of Q(t)
against Q(t) is predicted to be a straight line.
L
Q(t) =
1+Ce −kt
where C is a constant.
Using your answer to part (f )(i), estimate the percentage of computers in city X that are expected to have been
infected by the virus over a long period of time. [2]