Mas 202 Notes
Mas 202 Notes
COURSE DESCRIPTION
References
The normal distribution is one of the most important distributions in statistics many measured
quantities in the natural science follow a normal distribution under certain circumstances , it is also a
useful approximation to the binomial distribution and to the Poisson distribution .
The normal variable is continuous. Its probability density function ( ) depends on its mean µ and
( )
standard deviation , where ( )
√
It is belled-shaped
It is symmetrical about µ
It extends from to
The maximum value of ( ) is
√
The total area under the curve is 1
95%
1
Approximately 99.9% (very nearly all) of the distribution lies within three standard deviations of
the mean
99.9%
The spread of the distribution depends on . Here are some normal curves, each drawn to the same
scale.
0.8
( ) ( ) ( )
0.4
0.2
-3 -2 -1 0 1 2 3 2 3 4 5 6 47 48 49 50 51 52 53
Finding probabilities
The probability that x lies between and is written ( ). To find this probability, you need
to find the area under the curve between and .
One way of finding areas is to integrate, but since the normal function is complicated and very difficult
to integrate, tables are used instead. ( )
In order to use the same set of the tables for all possible values of µ and , the variable is
standardized so that the mean is 0 and the standard deviation is 1. Notice that since the variance is the
square of the standard deviation, the variance is also 1. This standardized normal variable is called
and ( ).
To illustrate how the variable is standardized to the variable , consider distributed normally with a
mean 50 and a variance 4, i.e ( ), and , so .
The maximum value of ( ) is and the curve is shown in the right-hand section of the
√
diagram below.
2
Now translate the curve 50 units to the left so that the mean is 0. This is shown on the left hand section
of the diagram. The standard deviation is still 2, so the maximum value is again approximately 0.2
( ) 0.4 ( )
0.4
Translate 50 units 0.2
0.2
44 46 48 50 52 53 54
-6 -4 -2 0 2 4 6
Now ‘squash’ the curve towards the vertical axis so that the standard deviation is 1. This is done by the
standard deviation
0.4 ( )
-3 -2 -1 0 1 2 3
In general
To standardize , where ( )
Subtract the mean µ
Then divide by the standard deviation to obtain where ( )
( )
0
Exercise
Read the values of the following from the normal table
1. ( ) {Ans. 0.5636}
2. ( ) {Ans. 0.6350}
3. ( ) {Ans. 0.6660}
Note
When calculating probabilities, remember that the total area under the standardized normal curve is 1.
3
Example
Using the standard normal tables find
a) ( ) b) ( )
Solution
a) ( ) b)
( )
0 0.85 0 0.85
( ) ( ) ( ) ( )
In General
( ) ( ) and ( ) ( )
The standard normal tables start at . You however find probabilities relating to negative values of
by using the symmetrical properties of the curve. Look at these diagrams
To find ( ) where .
( ) ( )
-a 0 0 a
To find ( ) ( ) ( ) ( )
( ( )) ( ) ( )
( )
( ) ( ) ( )
-a 0
0 a
From the diagrams it is obvious that
( ) ( )
( ) ( )
Example
( )
4
iii. ( )
Solution
a) ( ) ( )
(2 s.f)
0 1.377
b) i) Using ( ) ( ) ( )
( ) ( )
-1.377 0 (2 s.f)
ii) Using ( ) ( )
( ) ( )
0 1.377
(2 s.f))
iii) Using ( ) ( ) ( )
( ) ( )
-1.377 0 (2 s.f))
Importance results
a) ( ) ( ) ( )
0 a b
Example
5
1.751
0.345
0
( ) ( ) ( )
b) ( ) ( ) ( ) ( ) ( ( ))
( ) ( ) ( ) ( )
Example
( ) ( ) ( )
c) ( ) ( ) ( )
( ) ( ( ))
( ) ( )
Example
( ) ( ) ( )
-0.6
-1.4
d) (| | ) ( )
( ) ( )
( )
Example
(| | ) ( )
( )
0
-1.433
1.433
b) ( )
Solution
a) ( ) ( ) ( )
95%
2.5%
2.5%
0
-1.96
1.96
6
b) ( ) ( ) ( )
99%
0.5%
0.5%
-2.575
2.575
The central 99% of the distribution lies between
Exercise
Draw sketches to illustrate your answer and consider whether your answer is sensible.
1. If ( ), Find,
a) ( ) b) ( ) c) ( ) d) ( )
e) ( ) f) ( ) g) ( )
h) (| | ) i) (| | )
a) b)
0
1.5
2
-15 0 1.5
-0.674 0 0.674
4) If ( ) and ( ) ( ) , Find
a) ( ) b) ( )
5) If ( ) and ( ) ( ) Find
a) ( ) b) ( )
7
USING STANDARD NORMAL TABLES FOR ANY NORMAL VARIABLE
Remember that to standardize , were ( )
Solution
X is the length in centimeters of a metal strip since =150 and , ( )
a) You need to find the probability that the length is shorter than 165 cm i.e P( )
To be able to use the standard normal tables standardize the variable by subtracting the
mean, 150, then dividing by the standard deviation, 10. Apply this to both sides of the
inequality
becomes
165 becomes
So ( ) becomes ( )
( ) ( )
= ( )
= 0.9332
= 0.93 (2 s.f)
The probability that the length is shorter than 165 cm is 0.93
Note: Although the and distribution have different spreads, in practice it is convenient to
show the values for both distributions on one sketch.
150 165
0 1.5
8
a) To find the probability that the length is within 5 cm of the mean, you need to find (| | )
Dividing by the standard deviation gives
| |
( ) i.e (| | )
(| | ) ( )
= 0.383
= 0.38 (2 s.f)
The probability that the length is within 5 cm of the mean is 0.38.
Example
The time taken by the milkman to deliver to the High Street is normally distributed with a mean of 12
minutes and a standard deviation of 2 minutes. He delivers milk every day. Estimate the number of days
during the year when he takes
a) Longer than 17 minutes,
b) Less than 10 minutes,
c) Between 9 and 13 minutes.
Solution
a) ( ) ( ) ( )
= ( )
= 0.0062
b) ( ) ( )
= ( )
= ( )
= 1- 0.8413
= 0.1587 10 12
Now 365 0.1587 = 57.92 58 -1 0
9
On 58 Days in the year takes less than ten minutes.
c) ( ) ( )
= ( )
= ( ) ( )
= 0.6915 + 0.9332 - 1
= 0.6247 9 12 13
-1.5 0 0.5
Now 365 0.6247 = 228.01 228
On 228 days in the year he takes between nine and thirteen minutes.
Note:
Since is a continuous variable, the following are indistinguishable:
Exercises
Find probabilities using ( )
1) The masses of packages from a particular machine are normally distributed with a mean of 200g and
a standard deviation of 2g. Find the probability that a randomly selected package from the machine
weighs
a) Less than 197 g
b) More than 200.5 g
c) Between 198.5 g and 199.5 g
2) The life of a certain make of electric light bulb is known to be normally distributed with a mean life
of 2000 hours and a standard deviation of 120 Hours. Estimate the probability that the life of such a
bulb will be
a) Greater than 2150 hours
b) Greater than 1910 hours
c) Within the range 1850 hours to 2090 hours.
a) If you are given that ( ) to find look for 0.9406 in the main body of the table.
This occurs when so if ( ) then
( )
This means that the value of such that ( ) is 1.56.
b) Find if ( ) ( )
So ( )
Look for 0.9579 in the main body of the table. It does not appear, so look for the number below it.
This is 0.9573 and it occurs when . To get the digits 9579 you need to add 6 to 9573. Look
at the far right-hand to find 6. It is in column 7.
So ( )
10
Example
1) If ( ) Find the value of if
a) ( )
b) ( )
c) ( )
d) ( )
Solution
a) ( ) ( ) ( ) =1.87
b) ( ) ( )
( ) ( )
= 0.305
c) ( )
Since the probability is greater than 0.5, must be negative, and therefore – is positive using
symmetry, ( )
- ( )
d) ( )
Since the probability is less than 0.5, must be negative. Using symmetry,
( )
- = ( )
= - 1.41
Example
If ( ) find such that (| | )
Solution
(| | ) ( )
From symmetry, 2 ( )
2 ( ) ( ) ( )
= 1.645
This means that the central 90% of the standard normal distribution lies between
Alternatively, If ( )
Then the value of corresponds to an upper tail probability of 0.05, and lower tail probability of 0.95.
Therefore ( ) ( )
= 1.645 ( )
( )
0.05
0.05
0.05
0 a -a 0 a
Example
11
The heights of female students at a particular collage are normally distributed with a mean of 169 cm
and a standard deviation of 9 cm.
a) Given that 80% of these female students have a height less than cm, find the value of .
b) Given that 60% of these female students have a height greater than cm, find the value of .
Solution 0.8
is the height, in centimeters, of a female student. ( )
a) Given ( )
Standardizing ( ) let
( ) ( )
169
Therefore; = 176.38 -1.5
= 176.4 (1d.p)
b) Given ( )
Standardizing
( )
Let ( ) 0.6
must be negative and ( )
( ) ,
Therefore
=166.723 s 169
=166.7(1.d.p) z 0
Example
1. The marks of 500 candidates in an examination are normally distributed with a mean of 45 marks
and a standard deviation of 20 marks.
a) Given that the pass marks is 41, estimate the number of candidates who passed the
examination
b) If 5% of the candidates obtain a distribution by scoring marks or more, estimate the value of x
c) Estimate the interquartile range of the distribution.
Solution
( )
a) P( )= ( ) ( )
( )
Therefore ( )
Since there are 500 candidates, to find the number of candidates who pass, multiply the probability of
500.
500 0.5793 =289.65
Therefore 290 candidates passed.
12
X 41 45
Z -0.2 0
b) ( )
Writing z for the standardized value of x,
( ) where , ( )
( )
X 45 x
Z 0 z
Therefore, ( )
A distinction is awarded for a mark of 78 or more.
c) The interquartile range encloses the central 50% of the distribution between the lower quartile
and upper quartile .
50%
25%
25%
X -z 0 z
Z 0
If ( ) then corresponds to an upper tail probability of 25%. So ( )
( )
= 0.674
Now is the standardized value of the upper quartile, is such that
=58.48.
Lower quartile, is such that
= 31.52
Interquartile range =
= 27 (2s.f)
13
Exercise
1. If ( ), find the upper quartile and the lower quartile of the distribution. Find also the fourth
percentile
2. The masses of cos lettuces sold at a market are normally distributed with mean mass 600g and
standard deviation 20g.
a) If a lettuce is chosen at random, find the probability that its mass lies between 570g and 610g.
b) Find the mass exceeded by 7% of the lettuces
c) In one day, 1000 lettuces are sold
Estimate how many weigh less than 545g.
3. If ( )
a) Find the limits within which the central 95% of the distribution lies
b) Find the interquartile range of the distribution
Example
The lengths of certain items follow a normal distribution with mean cm and standard deviation 6cm. It
is known that 4.78% of the items have length greater that than 82cm. Find the value of the mean .
Solution
( ) ( ) 0.8849
Therefore,
X 100 106
14
Y 0 z
Example
The masses of boxes of oranges are normally distributed such that 30% of them are greater than 4.00 kg
and 20% are greater than 4.53kg. Estimate the mean and standard deviation of the masses.
Solution
( ) ( ) Where ( )
( )
Therefore,
0.30
()
X 4.00
Z 0 z
Also, ( ) ( ) where
( ) ( )
Therefore
( )
0.20
( )
( )
Therefore
15
Exercise
1. The speeds of cars passing a certain point on a motorway can be taken to be normally distributed
observations show that of cars passing the point 95% are travelling at less than 85m.p.h and 10%
are travelling at less than 55m.p.h
a) Find the average speed of the cars passing the point.
b) Find the proportion of cars that travel at more than 70 m.p.h
2. On a particular day, 50% of the employees in a large company had arrived at work by 8.30 am, and
10% had not arrived by 8.55 am.
a) A assuming a normal model, find the standard deviation of the arrival times, in minutes
b) It is given that only 5% of the employees had arrived by 8.05 am without further calculation,
explain why this, might suggest that a normal model is not appropriate.
c) Eighty employees are selected at random. Find the expectation of the number of these
employees that arrived between 8.30 am and 8.55 am.
( ) ( )
)
)
(
(
0 1 2 3 4 5 0 1 2 3 4 5
b)
( ) ( )
)
)
(
(
0 2 4 6 8 10 0 3 10 17 20
16
Notice
When =0.5 the distributions are symmetric, and for larger values of , the distribution takes the
characteristics normal shape.
When p=0.2, the distribution is positively skewed for small values of , but when n= 20 the distribution
is almost symmetrical and bell-shape.
For the discrete random variable , distributed binomially where ( ) and n and p are such that
and where then ( ) approximately.
CONTINUITY CORRECTION
The following example compared probabilities obtained using a binomial distribution and a normal
approximation. It also illustrates the use of a continuity correction needed when using a continuous
distribution. (The normal) as an approximation for a discrete distribution (the binomial).
Examples
Find the probability of obtaining 4, 5, 6 or 7 heads when a fair coin is tossed 12 times.
a) Using the binomial distribution
b) Using a normal approximation to the binomial distribution.
Solution
b) The diagram below shows the probability distribution for ( ). Note that the vertical lines
have been replaced by rectangles to help illustrate the intention to use a continuous distribution as
an approximation for a discrete one. The required binomial probability is represented by the sum of
the areas of the shaded rectangles.
First check the conditions for a normal approximation
17
0.20
0.15
0.10
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Number of heads
3.5 7.5
( ) ( )
( ) ( ) Corrections
( )
( ) ( )
( ) ( )
X: 35 6 7.5
Y: -1.443 0 0.866
Note that the probabilities found by the two different methods compare well and the working for part
(b) is quicker to perform. The approximation is good because, although n is not very large p=0.5
18
Example
It is given that 40% of the population supports the Gamboge party. One hundred and fifty members of
the population are selected at random use a suitable approximation to find the probability that more
than 55 out of the 150 support the Gamboge party.
Solution
Let X be the number in 150 who support the Gamboge party.
So ( ) with . Check and .
( )
55.5 60
-0.75 0
For ( )
A person approximation can be used when is large ( ) and is small ( ).
Then ( ) approximately.
A normal approximation can be used when and are such that
Then ( ) approximately.
Example
A number of different types of fungi are distributed at random in a field. 80 % of these fungi are
mushrooms, and the remainder is toadstools. 5 % of the toadstools are poisonous. A man, who cannot
distinguish between mushrooms and toadstools, wanders across the field and picks a total of 100 fungi.
Determine; correct to two significant figures, using appropriate approximations, the probability that the
man has picked.
a) At least 20 toadstools.
b) Exactly two poisonous toadstools.
19
Solution
( )
X 19.5 20
Z -0.125 0
( )
When is large say ( ), the normal distribution can be used as an approximation. Where
( ) as with the normal approximation, where ( ). As with the normal approximation to
the binominal distribution a continuity correction is needed, since you are using a continuous
distribution as an approximation to a discrete one.
Example
A radioactive disintegration gives counts that follow a Poisson distribution with a mean count of 25 per
second. Find the probability that in one second interval the count is between 23 and 27 inclusive.
Solution
Let be the radioactive count in a one-second interval ( )
( ) ( )
( )
20
( ) ( ) (Continuity correction)
( )
( )
( )
( )
x 22.5 25 27.5
z -0.5 0 0.5
Example
A product is sold in packet whose masses are normally distributed with a mean of 1.42kg and a standard
deviation of 0.025kg.
a) Find the probability that the mass of a packet, selected at random, lies between 1.47Kg and
1.45kg.
b) Estimate the number of packets, in an output of 5000, whose mass is less than 1.35kg.
Solution
The probability that the mass lies between 1.37kg and 1.45kg is 0.86 (2 s.f).
b) ( ) ( )
( )
=
1.35 1.42
-2.8 0
Since there are 5000 packets, multiply the probability by 5000. 5000x0.0026=13; 13 packets have a
mass less than 1.35kg.
21
Example
Machine A, used for filling bags with ground coffee, can be set to dispense, any required mean weight of
coffee per bag. At any setting the weight of coffee in a bag can be modeled by a normal distribution
with a standard difference of 1.95g.
a) If the machine is set to dispense a mean weight of 128g of coffee per bag, calculate the
percentage of bags that contain less than 125g.
b) To meet an official regulation the setting on a machine must be adjusted so that no more
than 1% of bags contain less than 125g.
i) Calculate the smallest mean weight to which machine A should be set to meet the
regulation.
ii) Machine B will only just meet the regulation, when it is set to dispense a mean
weight of 128.5g. Assuming that the weight of coffee is a bag filled by machine B
can be modeled by a normal distribution. Calculate the standard deviation of this
distribution.
Solution
)
Let be the weight, in grams of coffee in a bag from machine A. (
)
a) (
( ) ( )
( )
=( ( )
0.01
125
6.2% of bags contain less than 125g.
-1.538 0
b) ( ) and ( )
i) Standardizing you need to find such that ( )
i.e ( )
( )
0.01
( )
125
Therefore 2 0
125 128.5
From part (i)
-2.326 600
-1.377
22
Therefore
= 1.504……..
The standard deviation is 1.5g (1.d.p)
Exercise
1. It is estimated that, on average, one match in five in the football league is drawn, and that one
match in two is a home win.
a) Twelve matches are selected at random, calculate the probability that the number of drawn
matches is
i) Exactly three
ii) At least four.
b) Ninety matches are selected at random. Use a suitable approximation to calculate the
probability that between 13 and 20 (inclusive) of the matches are drawn.
c) Twenty matches are selected at random. The random variable D and H are the numbers of
drawn matches and home wins, respectively, in these matches. State, with a reason, which
of D and H can be better approximated by a normal variable.
2. The mass of grapes sold per day in a supermarket can be modeled by a normal distribution; it is
found that, over a long period, the mean mass sold per day is 35.0kg and that on average, less than
15.0kg are sold on one day in twenty.
a) Show that the standard deviation of the mass of grapes sold perm day is 12.2kg, correct to
three significant figures.
b) Calculate the probability that, on a day chosen at random, more than 53.0 are sold.
c) Ten days are chosen at random. Assuming independence find the probability that less than
15.0kg will be sold on exactly two of these days.
3. Consultants employed by a large library reported that the time spent in the library by a user could
be modeled by a normal distribution with mean 65 minutes and standard deviation 20 minutes
a) Assuming that this model is adequate, what is the probability that a user spends.
i) Less than 90 minutes in the library
ii) Between 60 and 90 minutes in the library
The library closes at 9.00 pm
b) Explain why the model above could not apply to a user who entered the library at 8.00p.m
c) Estimate an approximate latest time of entry for which the model above could still be
plausible.
If X and Y are any two random variables, discrete or continuous, and a and b are any two constants,
Sums Difference
( ) ( ) ( ) …………………………..(1) ( ) ( ) ( )…………………………...(2)
( ) ( ) ( ) …………………..(3) ( ) ( ) ( )......................(4)
Also, if and are independent , then
( ) ( ) ( ) ……………....(5) ( ) ( ) ( ) ……………....(6)
( ) ( ) ( ) …..(7 ) ( ) ( ) ( ) ….(8)
23
THE SUM OF INDEPENDENT NORMAL VARIABLE
Example
A coffee machine is installed in a students’ common room. It dispenses white coffee by first releasing a
quantity of black coffee, normally distributed with mean 122.5ml and standard deviation 7.5ml, and
then adding a quantity of milk, normally distributed with mean 30ml and standard deviation 5ml.
Each cup is marked to a level of 137.5ml and if this level is not attained the customer receives the drink
free of charge.
Solution
Let B be the amount, in milliliters, of black coffee where ( ), let M be the amount, in
milliliters, of milk, where ( ). B and M are independent normal variables.
Consider W, the amount, in milliliters, of white coffee made by combining the black coffee and milk, so
and
( ) ( ) ( )
( ) ( ) ( )
For independent normal variables, it is true that the sum of these variables is also normally distributed,
so ( ) i.e ( )
( ) ( √
)
( )
( )
W:
Z: -1.664 0
41
So approximately 5% of the cups of white
455coffee will be given free of charge.
45
In general,
If ( ) and ( ) then ( )
This results can be extended to any set of independent normal variables where, with
obvious notation.
( )
24
Example
Four runners, Andy, Bob, Chris, and Dai train to take part in a 1600m relay race in which Andy is to run
100m, Bob 200m, Chris 500m and Dai 800m during training their individual times, recorded in seconds,
follow normal distributions with obvious notations these are;
( ) ( ) ( ) ( )
Find the probability that they run the relay race in less than 3 minutes 35 seconds.
Solution
Let T be the total time, in seconds, for the relay race. Then
E( ) ( ) ( ) ( ) ( )
= 10.8+27.3+62.8+121.2
= 218.5
Therefore, ( )
To find the probability that the total time is less than 3 minutes 35 seconds, 215 seconds, find
( ) ( )
√
= ( )
= ( )
=
T:
= Z: -1.513 0
The probability that the runners take less than 3 minutes 35 seconds is 0.0651 (2s.f)
41
Exercise 455
45
1) The masses of a particular article are normally distributed with mean 20g and standard deviation 2g.
A random sample of 12 such articles is chosen. Find the probability that the total mass is greater
than 230g.
2) The maximum load of a lift can carry is 450 kg. The weights of men are normally distributed with
mean 60kg and standard deviation 10kg. The weights of women are normally distributed with mean
55kg and standard deviation 5kg. Find the probability that the lift will be overloaded by five men and
two women, if their weights are independent.
25
THE DIFFERENCE OF INDEPENDENT NORMAL VARIABLES
( ) ( ) ( )
is normally distributed, so
( )
Example
A machine produces rubber balls whose diameters are normally with mean 5.50cm and standard
deviation 0.08cm.
The balls are packed in cylindrical tubes whose internal diameters are normally distributed with mean
5.70cm and standard deviation 0.12cm.
If a ball, selected at random, is placed in a tube, selected at random, what is the distribution of the
clearance? (The clearance is the internal diameter of the tube minus the diameter of the ball) what is
the probability that the clearance is between 0.05cm and 0.25cm?
Solution
( ) ( ) ( )
So, ( )
To find the probability that the clearance is between 0.05cm and 0.25cm, find
( ) ( )
√ √
( )
( ) ( )
1 T:
Z: -1.040 0 0.347
26
Exercise
1. A certain liquid drug is marketed in bottles containing a nominal 20ml of drug. Tests on a large
number of bottles indicate that the volume of liquid in each bottle ism distributed normally with
mean 20.42ml and standard deviation 0.429ml. if the capacity of the bottles is normally distributed
with mean of 21.77ml and standard deviation 0.210ml, estimate what percentage of bottles will
overflow during filling.
2. In a cafeteria, baked beans are served either in ordinary portions or in children’s portions. The
quantity given with mean 90g and standard deviation 3g and the quantity given for a child’s portion
is a normal variable with mean 43g and standard deviation 2g. What is the probability that Tom,
who has two children’s portions, is given more than his father who has an ordinary portion?
( ) ( ) and ( ) ( )
( ) ( ) ( )
( ) ( ) ( ) )
( ) ( ) ( )
( ) ( ) ( )
( )
( )
Example
Let and e independent random variable and ( ) ( ) find the probability that
an observation from the population of is more than twice the value of an observation from the
population of .
Solution
D: -10 0
27 Z: 0 1.443
41
( )
( )
The probability that on observation from the population of is more than twice the value of an
observation from the population of is (2 s.f)
Sum ( )
Multiple: ( )
Note that the means are the same but the variance is not.
Example
A soft drinks manufacturer sells bottles of drinks in two sizes. The amount in each bottle, in
Mean (Ml) Variance (Ml2)
Small 252 4
Large 1012 25
Milliliters, is normally distributed as shown in the table;
a) A bottle of each size is selected at random. Find the probability that the large bottle contains
less than four times the amount in the small bottle.
b) One large and four small bottles are selected at random. Find the probability that the amount in
the large bottle is less than the total amount in the four small bottles.
Solutions
Let L be the amount, in milliliters, in a large bottle contains less than four times the amount in
( ) ( )
( ) ( ) ( )
( ) ( )
( ) ( )
28
( ) ( ) ( )
( ) ( )
So ( )
L-4S: 0 4
Z: -0.424 0
( ) ( )
√
41
( ) 455 45
( )
The probability that a large bottle contains less than four times the amount in a small bottle is 0.34 (2s.f)
b) To find the probability that the amount in a large bottle is less than the total amount in four
small bottles you need ( ) ( ( )
( ( )) ( ) ( )
( ) ( )
( ( ) ( ) ( )
( ) ( )
Therefore ( ) ( )
( ( ) ) ( )
√
( ) L-(S1+...+S4): 0 4
( ) Z: -0.625 0
41 455
45
The probability that a large bottle contains less than four small bottles is 0.27 (2 s.f)
Example
The lifetime of econ light bulbs are normally distributed with mean 1000h and standard deviation 25h.
a) Find, to three decimal places, the probability that an econ light bulb will have a lifetime between
975h and 1020h
b) Calculate, to three decimal places, the probability that the sum of the lifetimes of eight econ
light bulbs will exceed 7930h. Indicate clearly the stage in your calculation when an assumption
concerning independence is essential.
29
The lifetimes of energy saver light bulbs are normally distributed with mean 7900h and standard
deviation 50h.
c) Calculate, to three decimal places, the probability that an engraver light bulb will last at least
eight times as long as an econ light bulb.
Solution
Let be the lifetime, in hours of an Econ light bulb. The ( )
a) ( ) ( )
( )
( ) ( )
( )
S: 7930 8000
Z: -0.990 0
The probability that the sum of the lifetime of eight Econ light bulbs exceeds 7930h is 0.839 (3 d.p)
41
c) Let Y be the lifetime of an Energy saver light bulb and Y (
455 )
( ) is needed, i.e ( 45
( ) ( ) ( ( )
( ) ( ) ( )
(Assuming independence)
( )
( )
( ) ( )
√
( )
( )
Y- -100 0
8X: 0 0.485
Z:
The probability that an Energy saver light bulb lasts at least eight times as long as an Econ light bulb
is 0.314(3.d.p)
41
455
45
30
Exercise
1. The distribution of the masses of adult husky dogs may be modeled by the normal distribution with
mean 37kg and standard deviation 5kg.
a) Calculate the probability that an adult husky has a mass greater than 30kg.
b) Calculate the probability that a randomly chosen team of six huskies has a total mass lying
between 198kg and 240kg, giving your answer to three decimal places.
2. The weight of a long loaf of bread is a normal variable with mean 42og and standard deviation 30g.
The weight of a small loaf of bread is a normal variable with mean 220g and standard deviation 10g.
a) Find the probability that 5 large loaves weight more than 10 small loaves.
b) Find the probability that the total weight of 5 large loaves and 10 small loaves lies between
4.25kg and 4.4kg.
SURVEYS
a) A census
b) A sample survey
a) Census
It is a total enumeration of the whole population. In a census every member of the population is
surveyed when the population is small, this could be straight forward exercise. When the
populations are large, taking census can be time consuming and difficult to do with accuracy.
b) Sample Survey
When a survey covers less than 10% of the population it is known as a sample survey. Sample
data and be obtained relatively cheaply and quickly and if the sample is representative of the
population, a sample survey can give an accurate indication of the population. Characteristic
being studied.
31
Sample frame
Once the individual members of a population have been numbered to form a list, this list is called a
sampling frame.
Sampling Method
Once a sampling frame has been established you can choose a method of sampling. These fall into two
categories.
Random sampling e.g. simple, systematic; stratifies
Non-random sampling e.g. quota, cluster.
Suppose a population consists of N sampling units and you require a sample of n of these units. A
sample of size n is called a simple random sample if all possible samples of size n are equally likely to be
selected.
If the unit selected at each draw is replaced into the population before the next draw, then it can appear
more than once in the sample. This is known as sampling with replacement.
If unit selected at each draw is not replaced into the population before the next draw, this is known as
simple without replacement.
The second method of sampling without replacement is known as simple random sampling are
commonly used.
Drawing lots
Random number sampling
Drawing lots
For each number, place a coloured ball into a container and then draw n balls out of the container at
random and without replacement. If you wanted a sample of size 20, you would draw out 20 balls. This
is suitable for a small population; Note, however, that the sample must be large enough to provide
sufficiently accurate information about the population. The sample should be selected at random. Any
hint of possible bias should be avoided. If the population is large then the method of drawing lots,
sometimes described as ‘drawing out of a hat’ are not practical you could instead make the choice by
referring to random number table.
Random number tables consist of digits 0, 1, 2, 3,… 9, such that each digit has an equal chance of
occurring. So for example, the probability that a 3 occurs is 0.1. In random number tables the digits
may appear singly or be grouped in some way. This is solely for convenience of printing.
32
Example
6 8 7 2 5 3 8 1 5 9
2 5 3 4 7 0 5 4 9 5
3 2 6 8 7 4 4 7 0 5
Solution
a) To select a group of eight people from a target population of 100 people, allocate a two digit
number to each person, for example allocate 01 to the first on the list 02 to the second,……up to
98,99, 00 calling the hundredth person 00 for convenience .
Using the list, starting at the beginning of the first raw and reading along the rows, you would select
people corresponding to the following numbers
68 72 53 81 59 25 34 70.
Alternatively, you could decide to read the digits backwards, from bottom right in which case your
sample would consist of people corresponding to the numbers.
50 74 47 86 23 59 45 07
b) To select a group of 8 from a target group of 60 people, allocate each person a number from 01 to
60. Using the table disregard any two-digit number outside the range. Starting at the beginning of
the first row and grouping in pairs gives.
68 72 53 81 59 25 34 70 54 95 32 68
74 47 05
53 59 25 34 54 32 47 05
Example
Use the following extract from random number tables to select a random sample of 12 numbers, each to
two decimal places, from the continuous range
52 74 54 80 68 72 51 96 08 00
02 52 09 93 60 43 57 42 13 44
Solution
Since the sample values are required to two decimal place accuracy, consider groups of three digits,
inserting the decimal point between the first and second digit. In this case your sample would consist of
the values 5.27, 4.54, 8.06, 8.72, 5.19, 6.08, 0.00, 2.52, 0.99, 3.60, 4.35, 7.42.
33
Example
Use it to select a random sample of four numbers, each to three decimal places, from the continuous
range .
Solution
Consider group of four digits, inserting the decimal point between the first and second digits disregard
any values that are out of range. This gives.
You probably have a random number generator key Ran # on your calculator, which produces a
number, for example 0.398, every time you press it. The numbers generated are in fact obtained using a
mathematical formula and are really pseudo random numbers, but they suit the purpose very well
indeed.
Suppose you want to use your calculator to select a random sample of six numbers between 1 and 49
for your entry in the National Lottery.
To do this, you probably need to press shift then Ran # = suppose the number you get are 0.730, 0.798,
0.369, 0.499, 0.491, 0.310, 0.135, 0.112, 0.593, 0.652, 0.015, 0.346. You can interpret them in various
ways, for example.
If you decide to the first two digits to the right of the decimal point each time you would obtain
the numbers 73,79, 36, 49, 31,13, 11, 59 65, 01, 34.
Ignoring repeats and numbers bigger than 49; the six numbers would be 30, 10, 35, 12, 15, and
46.
If you decide to use all the digits after the decimal point, you would be choosing from the digits
7307983694994913101511259365015346 grouping these as two- digits numbers gives 73, 07,
98, 36, 94, 99, 49, 13, 10, 13, 51, 12, 59, 36, 52, 01, 53, 46.
Ignoring repeats and numbers bigger than 49 gives the six numbers as 7, 36, 49, 13, 10, 12. The
lists are endless.
Systematic sampling
Random sampling from a very large population is very cumbersome.
An alternative procedure is to list the population in some order, for example alphabetically or in order of
completion on a production. Line and then choose every kth member from the list after obtaining a
random starting point. If every tenth vehicle passing a checkpoint, you would form a 10% sample. If you
choose every twentieth item, for example every twentieth card in an index file, you would form a 5%
Sample.
34
Example
Describe how to choose a systematic sample of 8 members from a list of 300.
Solution
Since you are going to choose every 12th number, you need to find a suitable value for k. to do this,
choose a convenience value close to . In this case, Will do. Now choose a
random starting point, for example if Ran # on your calculator given 0.870 take the first member of the
sample as 87 and then add 40 each time. The other members are 127,167,207,247,287, 27 and 67.
Note that when you reach the end of the list, go back to the beginning so the sample consist of 27, 67,
87, 127, 167, 207, 247, 287.
The advantages of systematic sampling are that it is quick to carry out and it is easy to check for errors.
For large scale sampling, systematic selection is usually used in preference to taking simple random
samples.
The disadvantages of this system are that there may be periodic cycle within the frame itself. For
example a machine may operate in such a manner that every tenth item, starting at S, would result in
half the items in the sample being. Faulty, whereas starting at 2 would produce a sample with no faulty
items of course, it the periodic cycle is recognized then difference samples would be taken by varying
the starting point and the length of the interval between the chosen items.
Stratified Sampling
Stratified sampling is used when the population is split in to distinguishable layers or strata that are
quite different from each other and which together cover the whole population, for example
Age groups
Occupational groups
Topographical regions
Separate random samples are taken from each stratum and put together to form the sample from the
population.
It is usual to represent the population proportionality in the strata, as in the following example.
Example
A competent carrier employs 320 drivers, 80 administrative staff and 40 mechanics. A committee to
represent all the employees is to be formed. The committee is to have 11 members and the selection is
to be made so that there is as close representation as possible without bias towards any individual or
groups. Explain how this could be done.
Solution
If you were to take a simple random sample of all 440 employees this would mean that every employee
would have an equal chance of being selected. There is a high probability that the committee would
consist of 11 drivers and therefore would not be representative of all employees.
A stratified random sample would provide a more accurate representation of the population and could
be formed as follows:
35
Taking in to account that the drivers make up of the work force,
Number of drivers
Similarly:
Number of mechanics
The required representation on the committee is eight drivers, two from the administrative staff and
one mechanic. The people to be included can then be selected from each stratum by using simple
random sampling or systematic sampling.
Non-random sampling
a. Cluster sampling
Sometimes there is a natural sub grouping of the population and these subgroups are called clusters. For
example, in a population consisting of all children in the country attending state primary schools, the
local education authorities form natural clusters.
When a sample survey is carried out on a population that can be broken in to clusters it is often more
convenient to first choose a random sample of clusters and then to sample within each cluster chosen.
Unlike stratified sampling where the strata are as different from each other as possible, each cluster
should be as similar to other clusters as possible.
Advantages
Disadvantages
1) It is non-random
b. Quota cluster
Quota sampling is widely used in market research where the population is divided in to groups in terms
of age, sex, income level and so on. Then the interviewer is told how many people to interview within
each specified group, but is given no specific instructions about how to locate them and fulfill the quota.
It is quick to use, complications are kept to a minimum and unlike random sampling, any member of the
sample may be replaced by another member.
If no sampling phrase exists, then quota sampling may be only practical method of obtaining a sample.
The disadvantages of quota sampling, however is that it is non-random. There is a possibility of bias in
the selection process if, for example, the interviewer selects those easiest to question or those who look
co-operative.
36
SIMULATING RANDOM SAMPLES FROM GIVEN DISTRIBUTONS
A good way to stimulate a random sample from a given distribution is to use cumulative proportional
frequencies or cumulative probabilities, as illustrated in the following example.
a) From a frequency distribution
Example
Use the sequence of random digits 3642945883309239184000300 to generate five simulated
observations from the following frequency distribution.
X 1 2 3 4
t 8 12 14 6 Total 40.
Solution
Consider first then cumulative frequencies and then transfer them to cumulative proportional
frequencies with a total proportion of 1. Then allocate the random numbers in convenient proportional
frequencies.
Cumulative Corresponding
Cumulative proportional random numbers
frequency. frequency.
1 8 8 0 to 20
2 12 20 21 to 50
3 14 34 51 to 85
4 6 40 86 to 99 and 00
Since the cumulative proportional frequencies contain two decimal places, it is convenient to use two
digit random numbers. Note that 00 has been allocated to the X-value of 4 for convenience
Take 5 two-digit random numbers from the list 36, 42, 94, 58, 83.
Match this up with the corresponding sample values 2, 2, 4, 3, 3. So a random sample of size 5 from the
given distribution is 2, 2, 3, 3, 4.
37
Solution
Form the cumulative distribution function F(x) and then allocate random numbers in a convenient way.
Corresponding random
( ) ( )
numbers.
0 0.1 0.1 1
1 0.2 0.3 2, 3
2 0.4 0.7 4, 5, 6, 7
3 0.5 1 8, 9, 0
Corresponding
( ) ( ) random numbers
0 0001 to 4096
1 4097 to 8192
2 8193 to 9728
3 9729 to 9984
9985 to 9999 and
4
0000
The random number 2811 is in the range 0001 to 4096 and so corresponding to
Similarly, 5747 corresponds to
6157 corresponds to
8988 corresponds to
So the random sample of four observations from the distribution consists of the values 0, 1, 1, 2.
Example
Using the random number 8135 take a single random observation from a poison distribution with
parameter 3.
Solution
( )
Using cumulative Poisson probability tables and arranging the results in a table together with a
convenient corresponding random number allocation gives:
38
( ) Corresponding random numbers.
0 0.0498 0001 to 0498
1 0.1991 0499 to 1991
2 0.4232 1992 to 4232
3 0.6472 4233 to 6472
4 0.8153 6473 to 8153
5 0.9161 8154 to 9161
6 0.9665 9162 to 9665
7 0.9881 9666 to 9881
8 or over 1 9882 to 9999 and 000
Example
Using the random digits 723,850 take a random sample of size 2 from the continuous distribution with
probability density function.
( ) For
Solution
F( )
0 a 2
Exercise
1)Use the random numbers 382 824 to take a random sample of 2 from the normal distribution
( ). ( Ans: the two random observations are 29.4,31.9)
39
2)Using the random number 256 construct a random observation of the continuous random
variable where ( )
3)You are given the random number 431. Use this number to obtain a sample observation from
a) A binomial distribution with
b) A normal distribution with mean 6.2 and standard deviation 0.1.
4)Using the random numbers 267 394 018 take a random sample of size 3 from the normal
distribution with mean 35 and variance 9.
5) a) The discrete random variable X is such that X ( ) Take a random sample of size
5 from this distribution, using the random numbers 407 315 401 203 972.
b) Using the random numbers 6143 take a single random observation from the Poisson
distribution with parameter 4.
SAMPLE STATISTICS
When you trying to find out information about a population. It seems sensible to take random sample
and then consider the values obtained from them. It is therefore useful to know how these sample
values are distributed.
Take a random sample of n independent observation from a population. Note that from a finite
population sampling should be within. Replacement to ensure that the observations are
independent.
Calculate the mean of these n sample values. This is known as the sample mean.
Now repeat the procedure until you taken all possible samples of size n, calculating the sample
mean of each one.
Form a distribution of all the sample means the distribution that would be formed is called the
sampling distribution of means.
From
Since ( ) ( ) ( )
Since ( )
Since ( ) ( )
̅ =
( ̅) ( )
40
( ) ( ( )
( ̅)
( ̅)
= ( )
( ̅) i.e ( ̅ ) ( )
The standard deviation of the sampling distribution is √ usually written , This is known as the
√
standard error of the mean.
100
100 100
Mean of samples of size 2 Mean of samples of size 5 Mean of samples Size of 25
From the diagrams, you can see that if samples are taken from a normal population, the
sampling distribution. If ( ) ( )
Example
At a college the masses of the male students can be modeled by a normal distribution with
mean mass 70kg and standard deviation 5kg.
Four male students are chosen at random. Find the probability that their mean mass is less than
65kg.
41
Solution.
is the mass , in kilograms, of a male student at the college, and ( ) with
since the distribution of X is normal, the distribution of X is also normal and
( ) with ( )
So ( ) ( ) ( )
√
( )
( )
̅ 65 70
Z: -2 0
The probability that the mean mass is less than 65kg 41 455(2 s.f)
is 0.023
45
Example
The distribution of the random variable is N (25,340). The mean of a random sample of size
drawn from this distribution is . Find the value of , correct to two significant figures, given that
( ) is approximately 0.005.
Solution
( )
√
Therefore ( ) ( ) ( )
√ √
√ √
You are given that ( ) ( )
√ √
√
So ( )= ( ) √
√
Squaring both sides.
, ( )
42
b) The distribution of ̅ when is not normally distributed
The following diagram illustrate the distribution of for samples of different sizes taken from a
population
i) Distribution of when ( )
0.4
0.3
0.2
0.1
0 1 2 3 4 5 6 7 8 9 10
Distribution of ̅ when is not normally distributed
1 2 3 4 2 3 4 2 3
Mean of samples of size 10 Mean of samples of size 15 Mean of samples of Size 30
Work out for
ii) Distribution of when X ( )
iii) Distribution of when ( )
Example
Thirty random observations are taken from each of the following distribution and the sample mean
calculated. Find, in each case, the probability that the sample mean exceeds 5.
Solution
a) ( )
43 41
455
45
b) ( )
By the central limit theorem, since n is large, X is approximately normal and ( ) with
i.e ( ), ( )
( ) ( ) ( )
√
( )
̅: 4.5 5
Z: 0 1.825
By the central limit theorem, since is large, ̅ is approximately normal and ̅ ( ) with =30
⁄
i.e ̅ ( ) ̅ ( )
( ̅ ) ( )
√
( )
( )
̅: 4.5 5
Z: 0 1.897
The distribution of the sample proportion
Suppose a random sample of n observations is taken from a population in which the proportion of
41
successes is P and the proportion of failures is q=1-p. if X is the number of successes in the sample then
455
X follows a binomial distribution i.e ( ) ( ) ( )
45
The random variable for the proportion of success in the sample is . This can be written
it is possible to work out the mean and the variance of using expectation algebra as follows:
( ) ( ) ( ) ( )
( ) ( ) ( )
44
The distribution of has mean P and variance normal, when n is large, the distribution of and is
proximately the large the sample size n, the better the approximation.
s.d =√
The distribution of is known as the sampling distribution of proportions. The standard deviation of
Example
It is known that 3% of frozen pies delivered to a canteen are broken. What is the probability that on a
morning. When 500 pies are delivered, 5% or more are broken?
Solution
= ( )
= ( )
√
= ( )
: 0.03 0.049
=
Z: 0 2.491
= 0.0064.
41
455
45
45
Alternative method
Instead of considering , the proportion of broken pies, you could consider the number of broken
pies in the sample.
Now
Since is large such that and , use the normal approximation for the binomial
distribution when ( ) with and
( )
You want the probability that 5% or more are broken. 5% of 500= 25, so find the probability that 25 or
more are broken. ( ) ( )
= ( )
√
= ( )
X: 15 24.5
Z: 0 2.491
POINT ESTIMATES
( ̅) ( ̅)
̂ ̂ ( ) ( )
Example 1
A railway enthusiast simulates train journeys and records the number of minutes, , to the nearest
minute, trains are late according to the schedule being used. A random sample of 50 journeys gave the
following times.
17 5 3 10 4 3 10 5 2 14 = 73
3 14 5 5 21 9 22 36 14 34 = 163
22 4 23 6 8 15 41 23 13 7 = 162
6 13 33 8 5 34 26 17 8 43 = 193
24 14 23 4 19 5 23 13 12 10 = 147
Given that and calculate to two decimal places, unbiased estimates of the
mean and the variance of the population from which this sample was drawn.
46
Solution
is the number of minutes that the train is late,
Let ( )= and ( )
Unbiased estimate of
̂ ̅
Unbiased estimate of
(∑ )
(∑ )
( )
( )
(2d.p)
Example
For the data given in example 1 above, estimate the proportion of trains that are more than 25 minutes
late.
Solution
Number in sample that are more than 25 minutes late =7.Proportion in sample, Unbiased
estimate of population proportion ,is ,where .
INTERVAL ESTIMATES
Another way of using a sample value to give a good idea of an unknown population parameter is to
construct an interval, known as a confidence interval.
The internal is usually written (a, b) and the end values, a and b are known as confidence limits
probabilities oftenly used in confidence intervals are 90%, 95% and 99%.
Suppose you do not know the mean of a particular population and you want to workout a 95%
confidence interval for it. You contract an interval (a, b) so that.
( )
In this case, the probability that the interval include is 0.95 or 95%.
47
̅
So, ( ) i.e ( )
√
̅ ̅ Therefore the probability statement is ( ̅ ̅
√ √ √
) therefore the confidence limits are ̅ or be written ̅
√ √ √ √
If ̅ is the mean of a random sample of any size n taken from normal population with known variance
, then a 95% confidence interval for is given by ( ̅ , ̅ +1.96 ).
√ √
Example
The mass of vitamin E in a capsule manufactured by a certain drug company is normally distributed with
a standard deviation 0.042mg. A random sample of five capsules was analysed and mean mass of
vitamin E was found to be 5.12g. Calculate a symmetric 95% confidence interval for the population
mean mass of vitamin E per capsule. Give the values of the end-points of the interval correct to three
significant figures.
Solution.
√ √
In this case, since the sample size is large the central limit theorem can be used.
is large ( ), taken from a non-normal population with known variance , then a 95%
confidence interval for is given by
95%
( )
√ √ 25%
25%
48
Example
The heights of men in a particular district are distributed with mean and the standard deviation
On the basis of the results obtained from a random sample of 100 men from the district, the 95%
confidence interval for was calculated and found to be (177.22cm, 179.18)
Calculate
a) The value of the sample mean,
b) The value of ,
c) A symmetric 90% confidence interval for
Solution
a) A 95% confidence interval for is given by ̅ with , so √
√ √
Since the interval is (177.22, 179.18)
̅ ()
̅ ( )
Adding ( ) and( ) ̅
̅
The sample mean is 178.2 cm
Subtracting ( ) ( )
Ans:
49
CONFIDENCE INTERVAL FOR , THE POPULATION MEAN
Of a normal or non-normal population
With unknown variance
Using a large sample,
When calculating confidence intervals it is of the case that the population variance, is not
known. Provided that the sample size, , is large ( ) it is permissible to use the best
unbiased estimate for
Ideally the distribution of should be normal, but an approximate confidence interval can also be given
when the distribution of is not normal. Remember that in both cases, must be large.
̂ ̂
( ̅ ̅ ) Where ̂ is the sample variance
√ √
( )
or ̂ ( )
Example
The fuel consumption of a new model car is being tested. In one trial, 50 cars chosen at random, where
driven under identical conditions and the distances, , covered in one litre of petrol were recorded.
The results gave the following totals calculate a 95% confidence interval for
the mean petrol consumption, in kilometers per litre of cars of this type.
Solution
̅
( )
is unknown, so use ̂ where ̂ ( ) ( )
= 2.2959………….
………..
95% confidence interval for , 10.92km/litre
( )
Example
The height, , of each man in a random sample of 200 men living in U.K was measured. The following
results were obtained.
a) Calculate unbiased estimates of the mean and variance of the heights of men living in the U.K.
b) Determine an approximate 90% confidence interval for the mean height of men living in U.K.
Name the theorem that you have assumed.
Solution
a) ̂ ̅
̂ ( ̅ )
50
= ( )
= 103.5
̅̅
̂ √
Where T has a t-distribution.
The 95% confidence interval for M is obtained as follows:
If ̅ and are the mean and variance of small sample ( ) from a normal population with
unknown mean and unknown variance , then a 95% confidence interval for is given by
̂ ̂
̅ ̅ where ̂ and is the value from a ( ) distribution such that
√ √
( ) ( ) endoses 95% of the ( ) distribution
Example
i) ( )
ii) ( )
iii) (| | )
51
Solution
i) , so find row 11 and go across to 1.796, then up to the top of the column.
This gives 0.95. ( )
( )
0 a
ii) Find row 11, go across to 3.106, which is in column 0.995
So, ( )
( ) ( )
0.995
0.005
0 3.106
( )
0 2.201
It follows that
( ) also ( )
= 0.95
= 95%
0.95
0.025
0.025
52
Exercise
The random variable has a -distribution with 14 degrees of freedom, i.e ( ). Find the value of
for which
i) ( )
ii) (| | )
Example
The mass, in grams, of a packet of biscuits of a particular brand, follows a normal distribution with mean
. Ten packets of biscuits are chosen at random and their masses noted. The results, in grams are
397.3, 399.6, 401.0, 392.9, 396.8, 400.0, 397.6, 392.1, 400.8, 400.6
Calculate a 95% confidence interval for
Solution
These can be summarized as follows:
Let be the mass, in grams, of a packet of biscuits
( ) with both and unknown.
Since is unknown, find ̂
( )
̂ ( ) , with
= ( )
= 10.325………….
̂ ………..
The sample mean , ̅
Since is small, a ( )distribution is required. , So use a ( ) distribution.
̂
The 95% confidence limits for are ̅ where (– ) enclose 95% of the ( ) distribution.
√
From tables, the critical value is 2.262. Therefore confidence limits are 397.87 2.262
√
= 397.87 2.298……….
95% confidence interval for ( )
=( )
Exercise
A student studying the height of a particular plant, known that it follows a normal distribution with
mean and variance , but he does not know the value of either of these parameters. He selects 15
plants at random, measures their height and calculates that the mean height of the sample is 12.2cm
and the standard deviation is 1.4cm. Using these values, calculate a 90% confidence interval for .
Calculate also the width of this interval
TEST OF HYPOTHESIS
Procedures of carrying out a hypothesis test are that
Define the variable
State and
State the distribution according to
State the level and type of test
State the rejection criterion
Calculate the required probability
Make tour conclusion
53
TYPE I AND TYPEII ERRORS
When you perform a significance test, there are four possible conclusions, and these are;
a) is true and your test leads you to accept corrrect decision.
b) is true but your test leads you to reject -wrong decision-Type I Error
c) is false but your test leads you to accept -wrong decision- Type II Error.
d) is false but your test leads you to reject -correct decision.
Test decision
Accept Reject
Actual is true Correct Type I error
Situation is false Type II error Correct
DEFINE.
A type II error is made when you accept when it is false, i.e P(Type II error) = P(accept when
it is false)
Example
A random observation is taken from a binomial distribution ( ) and use to test the null
hypothesis against the alternative hypothesis
Solution
You are given that ( )
a)
b)
The critical region is , so to find the significance level of the test, find ( )
( ) ( ) ( )
=
= 0.0691
= 7%
The significance level is approximately 7%
54
c) P(Type I error) = P(reject when it is true)
is rejected if , so
P(type I error)= ( ) (found above)
Since probability of a type I error = significance level of the test.
d) You make a type II error when you accept (which you will do if ( ) when is the value
specialized in (not the values given by )
The hypothesis is now
Note
This is a very high probability. To make this smaller you could increase the significance level of the test.
But this would of course be the probability of making a type I error
Exercise
1) The random variable can be modeled by a binomial distribution.
Test, at the 8% level, the hypothesis that against the alternative hypothesis
a) When ,
b) When
2) The random variable can be modeled by a binomial distribution with parameters and
whose value is unknown.
a) Find, at the 10% level of significance, the critical region to test the null hypothesis that
against the alternative hypothesis that
b) Explain what is meant by a type I error
c) State the probability of making a Type I error in test described in (a)
3) A random observation is taken from a binomial distribution ( ) and used to test the null
hypothesis against the alternative hypothesis
The critical region is chosen to be
a) At what significance level is the test carried out?
b) What is the probability of making a Type I error?
c) Find the probability of making a Type II error if, infact,
55
CORRELATION ANALYSIS
Content
Type of correlation
Scatter diagram
The covariance
The correlation co-efficient
Probable Error
Spearman’s rank correlation
Formulae
1. Covariance
∑( ̅ )( ̅)
2. Correlation co-efficient
∑( ̅ )( ̅)
a) ( )
√∑( ̅) ( ̅)
∑ ∑ ∑
b) ( )
√ ∑ (∑ ) √ ∑ (∑ )
( )
c) ( ) ( ) ( )
d) When deviations are taken from actual mean.
∑
( )
√∑ √∑
3. Probable error
( )
√
4. Spearman’s rank correlation
∑
a) ( )
∑ ( )
∑( )
b) ( )
Point to remember
TYPES OF CORRELATION
Let be the two variables under study. We note that there exist, three types of relationship between
these two variables (i) they are directly proportional i.e. when one increases (decreases) the other
increases (decreases); (ii) they are inversely proportional, i.e. when one increases (decreases) the other
decreases (increases); (iii) no relation. These three situations can be stated as follows statistically (i)
positively correlated (ii) negatively correlated (iii) no-correlation.
56
SCATTER DIAGRAM
It is the simplest way of the diagrammatic representation of bivariate data. Let and be the two
variables under study with observation each. If we plot ( )( ) ( ) on a graph
sheet, the resulting diagram gives us a vague idea about the correlation between these two variables
and .
Y-Values
25
20
15
Y-Values
10
0
0 2 4 6 8 10 12 14
Figure 1
Fig 1 represents that there is a positive correlation between the variables and .
Y-Values
25
20
15
Y-Values
10
0
0 2 4 6 8 10 12 14
Figure 2
57
Fig 2 represents that there is a negative correlation between the variables and .
Y-Values
17.5
17
16.5
16
Y-Values
15.5
15
14.5
0 1 2 3 4 5 6 7 8 9
Figure 3
In Fig 3 we are not able to recognize a pattern, i.e., there is no correlation between and .
Thus a scatter diagram help us to get an idea of the correlation between the two variables ( and ).
THE COVARIANCE
Covariance states only the nature of the relationship between the variables, but we also want to know
how far these variables are associated with each other, we use correlation co-efficient to find the same.
Correlation co-efficient is a measure of degree or extent of linear relationship between two variables
and . Correlation co-efficient was defined by Karl Pearson in 1890.
The correlation co-efficient for ( ) is given by:
∑( ̅ )( ̅)
( )
√∑( ̅) ( ̅)
∑ ∑ ∑
( )
√ ∑ (∑ ) √ ∑ (∑ )
( )
( ) ( ) ( )
Let , then ( ) ( )
If the deviation are taken from actual mean of and then ∑ ∑
∑
( ) ( )
√ ∑ √ ∑
∑
√∑ √∑
(Range of the correlation co-efficient is (-1, 1) i.e. .
58
Example 1
You are given the following data relating to aptitude scores and productivity index
Aptitude score (X) 9 18 18 20 20 23
Productivity index (Y) 33 23 33 42 29 32
Find the co-efficient of correlation between aptitude scores and productivity index.
Solution
9 33 -9 1 -9 81 1
18 23 0 -9 0 0 81
18 33 0 1 0 0 1
20 42 2 10 20 4 100
20 29 2 -3 -6 4 9
23 32 5 0 0 25 0
Here
∑ ∑
̅ , ̅ , and ,
∑
( ) ( )
√∑ √∑
√ √
Example 2
4 3 -2 -2 4 4 4
6 6 -1 -1 1 1 1
8 9 0 0 0 0 0
10 12 1 1 1 1 1
12 15 2 2 4 4 4
Total 40 45 0 0 10 10 10
∑ ∑
Here ̅ ,̅ , ,
( ) ( )
59
∑
( )
√∑ √∑
√ √
( )
Which shows that there is a perfect positive linear correlation between and .
Probable error of correlation co-efficient is a measure of reliability of , which is given by the formula.
( )
√
Example 3
If and is significant.
( )
√
( ( ) ) ( )
√
( )
( )
EXERCISES
1. Find Karl Pearson’s co-efficient of correlation of the following data.
Age of husband ( ): 20 22 23 25 25 28 29 30 30 34
Age of wife ( ): 18 20 22 24 21 26 26 25 27 29
2. Calculate Karl Pearson’s co-efficient of correlation between percentage of pass and failure from the
following data.
No. of students ( ): 800 600 900 700 500 400
No. passed ( ): 480 300 450 560 450 300
60
6. Given below are the monthly incomes and their net saving of a sample of 10 supervisory staff
belonging to a firm. Calculate the correlation co-efficient.
Employee No.: 1 2 3 4 5 6 7 8 9 10
Monthly income (x): 780 360 980 250 750 82 900 620 650 390
Net saving: ( ): 84 51 91 60 68 62 86 58 53 47
9. From the following data, compute the co-efficient of correlation between and .
series series
No. of items: 15 15
Arithmetic mean: 25 18
Sum of squares of deviation from mean: 136 138
14. A company wanted to assess the impact of and expenditure ( ) on annual profit ( ) following
table presents data for past 8 years.
9 7 5 10 4 5 3 2
45 42 41 60 30 34 25 20
Compute the correlation co-efficient. Comment on the value.
15. From the following data, compute the co-efficient of correlation between and .
61
series series
No. of items: 15 15
Arithmetic mean: 25 18
Sum of squares of deviation from mean: 136 138
In correlation co-efficient we used the actual value to measure the degree of relationship; here we use
the rank of the given data. Rank correlation is useful when we study the relationship between
quantitative characteristic like beauty, intelligence etc. Ranks are obtained by arranging the data in
order of their merits. Spearman’s rank correlation formula is given by
( )
Example 1
Solution
( ) ( )
15 40 7 3 4 16
20 30 5 5 0 0
25 50 4 2 2 4
18 36 6 4 2 4
40 20 3 6 -3 9
60 10 2 7 -5 25
80 60 1 1 0 0
Total 58
( )
( )
62
REPEATED RANKS
If there are more than one item with the same value in the series then common ranks are given to the
repeated items. This common rank is the average of the ranks which these items would have attained if
they were slightly different from each other and the next item gets the rank next to the ranks (actual
rank). As a result there is a small adjustment in the rank correlation formula.
( )
This adjustment factor is added to ∑ ( where is the number of items repeated ) for each reapeated
values.
( )
[ ∑ ∑ ]
( )
Example 2
Calculate rank correlation from the following.
Marks in statistics ( ): 15 20 28 12 40 60 20 80
Marks in Accountancy ( ): 40 30 50 30 20 10 30 60
Solution
( ) ( )
15 40 7 3 4 16
20 30 5.5 5 0.5 0.25
28 50 4 2 2 0.4
12 36 8 5 3 9
40 20 3 7 -4 16
60 10 2 8 -6 36
20 30 5.5 5 0.5 0.25
80 60 1 1 0 0
Total 81.5
In Y series : 30 is repeated 3 times, it shares 4th, 5th and 6th position 30 gets the average.
Here
( ) ( )
63
( )
[ ∑ ∑ ] [ ]
( ) ( )
EXERCISES
1. Ten competitors in a beauty contest are ranked by three judges in the following order.
First judge: 1 5 4 8 9 6 10 7 3 2
Second judge: 4 8 7 6 5 9 10 3 2 1
First judge: 6 7 8 1 5 10 9 2 3 4
2. Find rank correlation co-efficient for the following data and give your comments.
Marks in accounts( ): 85 56 89 58 59 67 74 78
Marks in maths ( ) 38 69 56 58 63 78 87 77
4. Find rank correlation co-efficient for the following data and give your comments.
Marks in statistics( ): 84 56 89 58 59 67 74 78
Marks in law( ): 38 69 56 58 63 78 87 77
5. A sample of five fathers and their eldest sons gave the following data about their weight in kgs.
Father( ): 65 60 67 63 74
Son( ): 49 45 57 40 60
Obtain the rank correlation co-efficient and comment on the results.
8. Compute the rank correlation between I.Q’s and marks scored in examination.
Personal: A B C D E F
I.Q: 100 110 140 160 120 130
Exam mark: 70 80 81 78 72 75
9. Ten competitors in a beauty contest are ranked by three judges in the following order.
Judge I: 1 6 5 10 3 2 4 9 7 8
Judge II: 3 5 8 4 7 10 2 1 6 9
Judge III: 6 4 9 8 1 2 3 10 5 7
64
Use the rank correlation co-efficient to determine which pair of judges has the nearest approach to
common tastes in beauty.
10. The table below shows the respective I.Q’s of 7 father and their eldest sons.
Father I.Q : 90 98 103 104 105 110 114
Sons I.Q : 102 95 107 114 100 98 113
Calculate the rank correlation for the above statistics.
REGRESSION ANALYSIS
INTRODUTION
In the previous chapter we discussed about the nature of relation-ship between the variable (i.e
correlation) and the extend to which they are correlated. Correlation analysis just gave us an idea about
the association of the variable, but we would like to know how far they are related to each other. i.e we
would like to know the functional relationship between two variables. Regression methods are meant
to determine these functional relationships.
REGRESSION
Regression analysis is a mathematical measure of the average relationship between two or more
variables in term of the original units of the data.
In regression analysis there are two types of variable; namely the dependent and independent variable.
A dependent variable is the one whose value is to be predicted and independent variable is the one
which influences the value of the variable (dependent)
LINE OF REGRESSION
In scatter diagram (discussed in previous topic) , we find that the points cluster around some curve
called curve regression. If curve is straight line, it is called the line of regression.
“The line of regression is the line which gives the best estimate to the value of one variable for any
specific value of the other variable”.
Let us consider the relationship between two variable and . Here there are two lines of regression (i)
on and (ii) on .
(i) Line of regression on : Here is the dependent variable of is independent variable
and this line gives the best estimate for the value of for any specified value of . The
regress equation for on is given by
(ii) Line of regression on : this line is used to estimate the value of for any specified value
of . The regression equation for on is given by
In the above equations and are constants and are obtained by least squares principle.
65
LEAST SQUARES METHOD
Legender’s principle of least squares is “To minimize the sum of squares of the deviations of the actual
values (on ) from its estimated values as given by the line of best fit.”
Regression Equation of on is .
Then according to least squares principles, we have to minimize.
∑( ( ))
Actual Estimated
Value
is minimized byValue
partially differentiating it and respective and equating them to zero. In
this way we get two equations.
These equations are known as normal equations. Solving these equation simultaneously, we can obtain
the values of the constants and .
Regression equation of on :
The regression equation of on is
Here we have to minimize ( ( ))
Similarly here we get
As the normal equations and the values of and can be obtain from them.
Example
Find the regression equation on for the following data.
: 2 4 6 8
: 5 15 20 25
Solution
2 5 4 25 10
4 15 16 225 60
6 20 36 400 120
8 25 64 625 200
Total 20 65 120 1275 390
66
(1)
(2)
9 5: (3)
(2)
(3) - (2): -65 =
REGRESSION CO-EFFICIENT
In a regression equation ‘ is the intercept which the line cuts I the axis and is the slope of the line
and is called the regression co-efficient. The regression co-efficient is a measure of change in dependent
variable corresponding to a unit change in independent variable.
( ̅ )( ̅)
( ̅ )
( ̅ )( ̅)
( ̅)
for on
Similarly for on
We get
for on
67
for on
Now the two regression equations are
( ̅) ( ̅ ) for on
( ̅) ( ̅ ) for on
( ̅ )( ̅)
( ̅)
( )
Let
( )
( ̅ )( ̅)
( ̅)
( )
Let
( )
Example 1
68
You are given the following
Mean 10 15
S.D 2 4
Solution
Regression equation on
( ̅) ( ̅)
( ) ( )
Regression equation on
( ̅) ( ̅)
( ) ( )
Example
Find the regression equation for the following
2 4 6 8
10 20 25 30
Solution
Regression equation
69
2 10 -3 -5 9 25 -15
4 20 -1 5 1 25 -15
6 25 1 10 1 100 10
8 30 3 15 9 225 45
Total 20 65 0 25 20 375 35
Regression equation of on .
( )
( ̅) ( ̅)
( ) ( )
Regression equation of on .
( )
( )
( ̅) ( ̅)
( ) ( )
3. If
70
Property hold the same for regression co-efficient on .
4. If one of the regression co-efficient is greater than unity the other one is less than unity.
i.e, if then
Exercise
1. Obtain the regression of on and on from the following table and estimate the blood
pressure when age is 50.
56 147
42 125
72 160
36 118
63 149
71
47 128
55 150
49 145
38 115
42 140
68 152
60 155
2. A company wanted to assess the impact of and expenditure ( ) on annual profit ( ). Following
table presents data for past 8 years.
9 45
7 42
5 41
10 60
4 30
5 34
3 25
2 20
Find the regression equation on . Estimate the profit for and expenditure of Shs 8 thousand.
4. You are given the following in formation about advertisement and sales.
Adv. Exp. ( ) Sales ( )
(Shs million) (Shs million)
Mean 20 120
S.D 5 25
Correlation co-efficient = 0.8
1. Calculate the two regression equations.
2. Find the likely sales when advertisement expenditure is Shs 25 million
A. M 74.50 125.50
S.D 13.07 15.85
72
Summation of products of corresponding deviations from respective means = 2176. Calculate the
co-efficient of correlation between and and find the regressions equations.
Mean 20 10
S.D 3 4
Mean 25 22
S.D 4 5
9. If and , find .
10. From the data below, find the two regression equation.
( ) ( )
25 43
28 46
35 49
32 41
31 36
36 32
29 31
38 30
34 33
32 39
73
12. To study the effect of rain on yield of wheat the following results were obtained.
Mean S.D
Yield ( ): 800 12
Rainfall ( ): 50 2
13. Find the two regression equations, regression co-efficient of correlation from the following figures.
∑ ∑ ∑ ∑ ∑
74