0% found this document useful (0 votes)
15 views74 pages

Mas 202 Notes

The document outlines a course on statistical inference, covering topics such as the normal distribution, statistical decision problems, point and interval estimates, and hypothesis testing. It explains the characteristics of the normal distribution, methods for finding probabilities, and the use of standard normal tables. Additionally, it provides examples and exercises for practical application of the concepts discussed.

Uploaded by

walkerjames6442
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views74 pages

Mas 202 Notes

The document outlines a course on statistical inference, covering topics such as the normal distribution, statistical decision problems, point and interval estimates, and hypothesis testing. It explains the characteristics of the normal distribution, methods for finding probabilities, and the use of standard normal tables. Additionally, it provides examples and exercises for practical application of the concepts discussed.

Uploaded by

walkerjames6442
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

MAS 202: PRINCIPLES OF STATISTICAL INFERENCE

COURSE DESCRIPTION

Meaning of statistics, objectives of statistical investigation. Statistical decision problems basics


concepts of inferences. Role of normal distribution in statistical random samples, use of
random number tables. Inference about population means, point and interval estimates, simple
one sample and two sample tests, simple tests, linear regression and correlation analysis.
Analysis of simple nonparametric tests.

References

1. Advance level statistics 4th ed. by crawshaw J. and chambers J.


2. Any other relevant book

THE NORMAL DISTRIBUTION

The normal distribution is one of the most important distributions in statistics many measured
quantities in the natural science follow a normal distribution under certain circumstances , it is also a
useful approximation to the binomial distribution and to the Poisson distribution .

The normal variable is continuous. Its probability density function ( ) depends on its mean µ and
( )
standard deviation , where ( )

To describe the distribution write ( ), where is the mean and is variance.


( )

The normal distribution curve has the following features

 It is belled-shaped
 It is symmetrical about µ
 It extends from to
 The maximum value of ( ) is

 The total area under the curve is 1

Notice also that


 Approximately 95% of the distribution lies within two standard deviations of the mean .

95%

1
 Approximately 99.9% (very nearly all) of the distribution lies within three standard deviations of
the mean

99.9%

The spread of the distribution depends on . Here are some normal curves, each drawn to the same
scale.
0.8
( ) ( ) ( )

0.4

0.2

-3 -2 -1 0 1 2 3 2 3 4 5 6 47 48 49 50 51 52 53

Finding probabilities

The probability that x lies between and is written ( ). To find this probability, you need
to find the area under the curve between and .

One way of finding areas is to integrate, but since the normal function is complicated and very difficult
to integrate, tables are used instead. ( )

The standard Normal Variable,

In order to use the same set of the tables for all possible values of µ and , the variable is
standardized so that the mean is 0 and the standard deviation is 1. Notice that since the variance is the
square of the standard deviation, the variance is also 1. This standardized normal variable is called
and ( ).
To illustrate how the variable is standardized to the variable , consider distributed normally with a
mean 50 and a variance 4, i.e ( ), and , so .
The maximum value of ( ) is and the curve is shown in the right-hand section of the

diagram below.

2
Now translate the curve 50 units to the left so that the mean is 0. This is shown on the left hand section
of the diagram. The standard deviation is still 2, so the maximum value is again approximately 0.2
( ) 0.4 ( )
0.4
Translate 50 units 0.2
0.2

44 46 48 50 52 53 54
-6 -4 -2 0 2 4 6
Now ‘squash’ the curve towards the vertical axis so that the standard deviation is 1. This is done by the
standard deviation

0.4 ( )

‘Squash’ in ‘Squash’ in you write


0.2
So that ( ),

-3 -2 -1 0 1 2 3
In general
To standardize , where ( )
 Subtract the mean µ
 Then divide by the standard deviation to obtain where ( )

Using Standard Normal tables


The standard normal table gives the area under the curve as far as a particular value . This is written
( ).
This area gives the probability that is less than a particular value , so ( ) ( )
Note that is a Greek letter, pronounced phi.

( )

0
Exercise
Read the values of the following from the normal table
1. ( ) {Ans. 0.5636}
2. ( ) {Ans. 0.6350}
3. ( ) {Ans. 0.6660}

Note
When calculating probabilities, remember that the total area under the standardized normal curve is 1.

3
Example
Using the standard normal tables find

a) ( ) b) ( )

Solution

a) ( ) b)
( )

0 0.85 0 0.85

( ) ( ) ( ) ( )
In General
( ) ( ) and ( ) ( )

Finding probabilities involving negative values of

The standard normal tables start at . You however find probabilities relating to negative values of
by using the symmetrical properties of the curve. Look at these diagrams

To find ( ) where .

( ) ( )

-a 0 0 a

To find ( ) ( ) ( ) ( )
( ( )) ( ) ( )

( )
( ) ( ) ( )

-a 0
0 a
From the diagrams it is obvious that

( ) ( )

( ) ( )

Example

( )

a) Using the standard normal tables, find ( )


b) Drawing sketches to illustrate your answers, find
i. ( )
ii. ( )

4
iii. ( )

(Give your answers correct to two sig. fig)

Solution

a) ( ) ( )

(2 s.f)
0 1.377

b) i) Using ( ) ( ) ( )
( ) ( )

-1.377 0 (2 s.f)

ii) Using ( ) ( )
( ) ( )
0 1.377

(2 s.f))

iii) Using ( ) ( ) ( )
( ) ( )

-1.377 0 (2 s.f))

NORMAL DISTRIBUTION CONTINUES

Importance results

In the following, , , and

a) ( ) ( ) ( )

0 a b

Example

5
1.751
0.345

0
( ) ( ) ( )

b) ( ) ( ) ( ) ( ) ( ( ))
( ) ( ) ( ) ( )

Example

( ) ( ) ( )

(2 s. f.) -1.696 0 1.865

c) ( ) ( ) ( )
( ) ( ( ))
( ) ( )

Example
( ) ( ) ( )

-0.6
-1.4
d) (| | ) ( )
( ) ( )
( )
Example

(| | ) ( )
( )
0
-1.433

1.433

b) ( )

Solution
a) ( ) ( ) ( )

95%

2.5%
2.5%

0
-1.96

1.96

The central 95% of the distribution lies between

6
b) ( ) ( ) ( )

99%

0.5%
0.5%

-2.575

2.575
The central 99% of the distribution lies between

Exercise
Draw sketches to illustrate your answer and consider whether your answer is sensible.
1. If ( ), Find,
a) ( ) b) ( ) c) ( ) d) ( )
e) ( ) f) ( ) g) ( )
h) (| | ) i) (| | )

2) If ( ), Find the probabilities represented by the shaded areas in the diagrams

a) b)

0
1.5
2

-15 0 1.5

3) If ( ), Complete this statement:

The central………% of the distribution lies between

-0.674 0 0.674

4) If ( ) and ( ) ( ) , Find
a) ( ) b) ( )

5) If ( ) and ( ) ( ) Find
a) ( ) b) ( )

7
USING STANDARD NORMAL TABLES FOR ANY NORMAL VARIABLE
Remember that to standardize , were ( )

 Subtract the mean


 Then divide by the standard deviation to give
where ( )
The procedure is illustrated in the following example.
Example
Length of metal strips produced by a machine are normally distributed with mean length of
150 cm and a standard deviation of 10 cm
Find the probability that the length of a randomly selected strip is
a) Shorter than 165 cm
b) Within 5 cm of the mean

Solution
X is the length in centimeters of a metal strip since =150 and , ( )

a) You need to find the probability that the length is shorter than 165 cm i.e P( )
To be able to use the standard normal tables standardize the variable by subtracting the
mean, 150, then dividing by the standard deviation, 10. Apply this to both sides of the
inequality
becomes
165 becomes
So ( ) becomes ( )
( ) ( )
= ( )
= 0.9332
= 0.93 (2 s.f)
The probability that the length is shorter than 165 cm is 0.93
Note: Although the and distribution have different spreads, in practice it is convenient to
show the values for both distributions on one sketch.

150 165
0 1.5

8
a) To find the probability that the length is within 5 cm of the mean, you need to find (| | )
Dividing by the standard deviation gives
| |
( ) i.e (| | )
(| | ) ( )

= 0.383
= 0.38 (2 s.f)
The probability that the length is within 5 cm of the mean is 0.38.

145 150 165


-0.5 0 0.5

Example
The time taken by the milkman to deliver to the High Street is normally distributed with a mean of 12
minutes and a standard deviation of 2 minutes. He delivers milk every day. Estimate the number of days
during the year when he takes
a) Longer than 17 minutes,
b) Less than 10 minutes,
c) Between 9 and 13 minutes.

Solution

is the time, in minutes, taken to deliver milk to the High Street.


( )
Standardize using

a) ( ) ( ) ( )

= ( )

= 0.0062

To find the number of days, multiply by 365. 365 0.0062 12 17


= 2.263 2 0 2.5

On two days in the year he takes longer than 17 minutes

b) ( ) ( )
= ( )
= ( )
= 1- 0.8413
= 0.1587 10 12
Now 365 0.1587 = 57.92 58 -1 0

9
On 58 Days in the year takes less than ten minutes.

c) ( ) ( )
= ( )
= ( ) ( )
= 0.6915 + 0.9332 - 1
= 0.6247 9 12 13
-1.5 0 0.5
Now 365 0.6247 = 228.01 228
On 228 days in the year he takes between nine and thirteen minutes.
Note:
Since is a continuous variable, the following are indistinguishable:

Exercises
Find probabilities using ( )
1) The masses of packages from a particular machine are normally distributed with a mean of 200g and
a standard deviation of 2g. Find the probability that a randomly selected package from the machine
weighs
a) Less than 197 g
b) More than 200.5 g
c) Between 198.5 g and 199.5 g

2) The life of a certain make of electric light bulb is known to be normally distributed with a mean life
of 2000 hours and a standard deviation of 120 Hours. Estimate the probability that the life of such a
bulb will be
a) Greater than 2150 hours
b) Greater than 1910 hours
c) Within the range 1850 hours to 2090 hours.

USING THE STANDARD NORMAL TABLES IN REVERSE TO FIND WHEN ( ) IS KNOWN.

a) If you are given that ( ) to find look for 0.9406 in the main body of the table.
This occurs when so if ( ) then
( )
This means that the value of such that ( ) is 1.56.
b) Find if ( ) ( )
So ( )

Look for 0.9579 in the main body of the table. It does not appear, so look for the number below it.
This is 0.9573 and it occurs when . To get the digits 9579 you need to add 6 to 9573. Look
at the far right-hand to find 6. It is in column 7.

So ( )

10
Example
1) If ( ) Find the value of if
a) ( )
b) ( )
c) ( )
d) ( )

Solution
a) ( ) ( ) ( ) =1.87
b) ( ) ( )
( ) ( )
= 0.305

c) ( )
Since the probability is greater than 0.5, must be negative, and therefore – is positive using
symmetry, ( )
- ( )

d) ( )
Since the probability is less than 0.5, must be negative. Using symmetry,
( )
- = ( )
= - 1.41

Example
If ( ) find such that (| | )

Solution
(| | ) ( )
From symmetry, 2 ( )
2 ( ) ( ) ( )
= 1.645
This means that the central 90% of the standard normal distribution lies between
Alternatively, If ( )
Then the value of corresponds to an upper tail probability of 0.05, and lower tail probability of 0.95.
Therefore ( ) ( )
= 1.645 ( )

( )
0.05
0.05
0.05

0 a -a 0 a
Example

11
The heights of female students at a particular collage are normally distributed with a mean of 169 cm
and a standard deviation of 9 cm.
a) Given that 80% of these female students have a height less than cm, find the value of .
b) Given that 60% of these female students have a height greater than cm, find the value of .

Solution 0.8
is the height, in centimeters, of a female student. ( )
a) Given ( )
Standardizing ( ) let
( ) ( )
169
Therefore; = 176.38 -1.5
= 176.4 (1d.p)
b) Given ( )
Standardizing
( )

Let ( ) 0.6
must be negative and ( )
( ) ,
Therefore
=166.723 s 169
=166.7(1.d.p) z 0

Example

1. The marks of 500 candidates in an examination are normally distributed with a mean of 45 marks
and a standard deviation of 20 marks.
a) Given that the pass marks is 41, estimate the number of candidates who passed the
examination
b) If 5% of the candidates obtain a distribution by scoring marks or more, estimate the value of x
c) Estimate the interquartile range of the distribution.

Solution

Is the examination mark.

( )

a) P( )= ( ) ( )
( )
Therefore ( )
Since there are 500 candidates, to find the number of candidates who pass, multiply the probability of
500.
500 0.5793 =289.65
Therefore 290 candidates passed.

12
X 41 45
Z -0.2 0
b) ( )
Writing z for the standardized value of x,
( ) where , ( )
( )

X 45 x
Z 0 z

Therefore, ( )
A distinction is awarded for a mark of 78 or more.
c) The interquartile range encloses the central 50% of the distribution between the lower quartile
and upper quartile .
50%

25%
25%

X -z 0 z
Z 0
If ( ) then corresponds to an upper tail probability of 25%. So ( )
( )
= 0.674
Now is the standardized value of the upper quartile, is such that

=58.48.
Lower quartile, is such that

= 31.52
Interquartile range =
= 27 (2s.f)

13
Exercise

1. If ( ), find the upper quartile and the lower quartile of the distribution. Find also the fourth
percentile

2. The masses of cos lettuces sold at a market are normally distributed with mean mass 600g and
standard deviation 20g.
a) If a lettuce is chosen at random, find the probability that its mass lies between 570g and 610g.
b) Find the mass exceeded by 7% of the lettuces
c) In one day, 1000 lettuces are sold
Estimate how many weigh less than 545g.

3. If ( )
a) Find the limits within which the central 95% of the distribution lies
b) Find the interquartile range of the distribution

FINDING THE VALUE OF OR OR BOTH

Example

The lengths of certain items follow a normal distribution with mean cm and standard deviation 6cm. It
is known that 4.78% of the items have length greater that than 82cm. Find the value of the mean .

Solution

a) Is the length, in centimeters, of an item, ( ) and ( )


Since ( ) is less than 0.5,82 must be greater than
( ) ( ) where
So ( )
( )
( ) 0.0478
Therefore,

Then mean, (2 s.f) X 82


Y 0 z
Example
If ( ) and ( )
Find the value of the standard deviation
Solution
( ) ( ) where

( ) ( ) 0.8849

Therefore,

The standard deviation,

X 100 106
14
Y 0 z
Example

The masses of boxes of oranges are normally distributed such that 30% of them are greater than 4.00 kg
and 20% are greater than 4.53kg. Estimate the mean and standard deviation of the masses.

Solution

is the mass, in kilograms, of a box of oranges

( ) Where and are to be found.

( ) ( ) Where ( )

( )

Therefore,
0.30

()

X 4.00
Z 0 z

Also, ( ) ( ) where

( ) ( )

Therefore

( )

Equation (2) – equation (1) gives

0.20

( )

Substituting in equation (1)


X 4.53
Z 0 z

( )

Therefore

15
Exercise

1. The speeds of cars passing a certain point on a motorway can be taken to be normally distributed
observations show that of cars passing the point 95% are travelling at less than 85m.p.h and 10%
are travelling at less than 55m.p.h
a) Find the average speed of the cars passing the point.
b) Find the proportion of cars that travel at more than 70 m.p.h

2. On a particular day, 50% of the employees in a large company had arrived at work by 8.30 am, and
10% had not arrived by 8.55 am.
a) A assuming a normal model, find the standard deviation of the arrival times, in minutes
b) It is given that only 5% of the employees had arrived by 8.05 am without further calculation,
explain why this, might suggest that a normal model is not appropriate.
c) Eighty employees are selected at random. Find the expectation of the number of these
employees that arrived between 8.30 am and 8.55 am.

THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION


Under certain circumstances the normal distribution can be used as an approximation to the binomial
distribution. One practical advantage is that the calculations for finding probabilities are much less
tedious to perform.
The diagram below illustration the distribution of ( ) for and , for various values
of .
a)

( ) ( )
)
)

(
(

0 1 2 3 4 5 0 1 2 3 4 5

b)

( ) ( )
)
)

(
(

0 2 4 6 8 10 0 3 10 17 20

16
Notice
When =0.5 the distributions are symmetric, and for larger values of , the distribution takes the
characteristics normal shape.
When p=0.2, the distribution is positively skewed for small values of , but when n= 20 the distribution
is almost symmetrical and bell-shape.
For the discrete random variable , distributed binomially where ( ) and n and p are such that
and where then ( ) approximately.
CONTINUITY CORRECTION
The following example compared probabilities obtained using a binomial distribution and a normal
approximation. It also illustrates the use of a continuity correction needed when using a continuous
distribution. (The normal) as an approximation for a discrete distribution (the binomial).

Examples
Find the probability of obtaining 4, 5, 6 or 7 heads when a fair coin is tossed 12 times.
a) Using the binomial distribution
b) Using a normal approximation to the binomial distribution.

Solution

is the number of heads in 12 tosses. Since the coin is fair, ( ) ( )

a) Using the binomial distribution.


( ) ( ) ( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )

b) The diagram below shows the probability distribution for ( ). Note that the vertical lines
have been replaced by rectangles to help illustrate the intention to use a continuous distribution as
an approximation for a discrete one. The required binomial probability is represented by the sum of
the areas of the shaded rectangles.
First check the conditions for a normal approximation

Since and use the normal approximation ( ) with


( )
Superimposing the curve which is approximately ( ) the probability of obtaining 4, 5, 6 or 7
heads is found by considering the area under this normal curve from

17
0.20

0.15

0.10

0.05

0
0 1 2 3 4 5 6 7 8 9 10 11 12

Number of heads
3.5 7.5

( ) Transforms to ( ) using a continuity correction. Writing the symbol to


represent transforms to.

( ) ( )

( ) ( ) Corrections

( )

( ) ( )

( ) ( )

X: 35 6 7.5
Y: -1.443 0 0.866

Note that the probabilities found by the two different methods compare well and the working for part
(b) is quicker to perform. The approximation is good because, although n is not very large p=0.5

18
Example
It is given that 40% of the population supports the Gamboge party. One hundred and fifty members of
the population are selected at random use a suitable approximation to find the probability that more
than 55 out of the 150 support the Gamboge party.

Solution
Let X be the number in 150 who support the Gamboge party.
So ( ) with . Check and .

Since use the normal approximation


( ) With
So ( )
(
( )
( )

( )
55.5 60
-0.75 0

DECIDING WHEN TO USE A NORMAL APPROXIMATION AND WHEN TO USE A POISSON


APPROXIMATION FOR A BINOMIAL DISTRIBUTION.

For ( )
 A person approximation can be used when is large ( ) and is small ( ).
Then ( ) approximately.
 A normal approximation can be used when and are such that
Then ( ) approximately.

Example

A number of different types of fungi are distributed at random in a field. 80 % of these fungi are
mushrooms, and the remainder is toadstools. 5 % of the toadstools are poisonous. A man, who cannot
distinguish between mushrooms and toadstools, wanders across the field and picks a total of 100 fungi.
Determine; correct to two significant figures, using appropriate approximations, the probability that the
man has picked.
a) At least 20 toadstools.
b) Exactly two poisonous toadstools.

19
Solution

P(mushroom) = 0.8, P(toadstool) =0.2,

P(Poisonous toadstool) = 0.05 x 0.2 = 0.01

a) Let be the number of toadstools picked in a sample of 100. ( )

Since use a normal approximation with mean =


Variance =
( )
( ) ( ) ( )
( ) ( )
( )

( )
X 19.5 20
Z -0.125 0

b) Let be the number of poisonous toadstools in 100 use ( )


Use a Poisson distribution, since
( )
( )

( )

THE NORMAL APPROXIMATION TO THE POISSON DISTRIBUTION

If follows a Poisson distribution with parameter , i.e ( ) then ( ) and


( )

When is large say ( ), the normal distribution can be used as an approximation. Where
( ) as with the normal approximation, where ( ). As with the normal approximation to
the binominal distribution a continuity correction is needed, since you are using a continuous
distribution as an approximation to a discrete one.

Example
A radioactive disintegration gives counts that follow a Poisson distribution with a mean count of 25 per
second. Find the probability that in one second interval the count is between 23 and 27 inclusive.

Solution
Let be the radioactive count in a one-second interval ( )

( ) ( )

Using a normal approximation

( )

20
( ) ( ) (Continuity correction)

( )

( )

( )

( )

x 22.5 25 27.5
z -0.5 0 0.5

Example

A product is sold in packet whose masses are normally distributed with a mean of 1.42kg and a standard
deviation of 0.025kg.

a) Find the probability that the mass of a packet, selected at random, lies between 1.47Kg and
1.45kg.
b) Estimate the number of packets, in an output of 5000, whose mass is less than 1.35kg.

Solution

Let be the mass, in kilograms, of a packet ( )


Probability
a) ( ) required
Standardizing, using
( )
( )
( ) ( ) X: 1.37 1.42 1.45
Z: -2 0 1.2

The probability that the mass lies between 1.37kg and 1.45kg is 0.86 (2 s.f).
b) ( ) ( )
( )
=

1.35 1.42
-2.8 0

Since there are 5000 packets, multiply the probability by 5000. 5000x0.0026=13; 13 packets have a
mass less than 1.35kg.

21
Example

Machine A, used for filling bags with ground coffee, can be set to dispense, any required mean weight of
coffee per bag. At any setting the weight of coffee in a bag can be modeled by a normal distribution
with a standard difference of 1.95g.
a) If the machine is set to dispense a mean weight of 128g of coffee per bag, calculate the
percentage of bags that contain less than 125g.
b) To meet an official regulation the setting on a machine must be adjusted so that no more
than 1% of bags contain less than 125g.
i) Calculate the smallest mean weight to which machine A should be set to meet the
regulation.
ii) Machine B will only just meet the regulation, when it is set to dispense a mean
weight of 128.5g. Assuming that the weight of coffee is a bag filled by machine B
can be modeled by a normal distribution. Calculate the standard deviation of this
distribution.

Solution
)
Let be the weight, in grams of coffee in a bag from machine A. (
)
a) (
( ) ( )
( )
=( ( )
0.01

125
6.2% of bags contain less than 125g.
-1.538 0
b) ( ) and ( )
i) Standardizing you need to find such that ( )
i.e ( )
( )
0.01
( )

125
Therefore 2 0

The smallest mean weight is 129.5g (1 d.p)


iii) is the weight, in grams of coffee in a bag from machine B.
( ) and ( )
Standardizing. 0.01
( )

125 128.5
From part (i)
-2.326 600
-1.377

22
Therefore
= 1.504……..
The standard deviation is 1.5g (1.d.p)

Exercise
1. It is estimated that, on average, one match in five in the football league is drawn, and that one
match in two is a home win.
a) Twelve matches are selected at random, calculate the probability that the number of drawn
matches is
i) Exactly three
ii) At least four.
b) Ninety matches are selected at random. Use a suitable approximation to calculate the
probability that between 13 and 20 (inclusive) of the matches are drawn.
c) Twenty matches are selected at random. The random variable D and H are the numbers of
drawn matches and home wins, respectively, in these matches. State, with a reason, which
of D and H can be better approximated by a normal variable.

2. The mass of grapes sold per day in a supermarket can be modeled by a normal distribution; it is
found that, over a long period, the mean mass sold per day is 35.0kg and that on average, less than
15.0kg are sold on one day in twenty.
a) Show that the standard deviation of the mass of grapes sold perm day is 12.2kg, correct to
three significant figures.
b) Calculate the probability that, on a day chosen at random, more than 53.0 are sold.
c) Ten days are chosen at random. Assuming independence find the probability that less than
15.0kg will be sold on exactly two of these days.
3. Consultants employed by a large library reported that the time spent in the library by a user could
be modeled by a normal distribution with mean 65 minutes and standard deviation 20 minutes
a) Assuming that this model is adequate, what is the probability that a user spends.
i) Less than 90 minutes in the library
ii) Between 60 and 90 minutes in the library
The library closes at 9.00 pm
b) Explain why the model above could not apply to a user who entered the library at 8.00p.m
c) Estimate an approximate latest time of entry for which the model above could still be
plausible.

LINEAR COMBINATIONS OF NORMAL VARIABLES

If X and Y are any two random variables, discrete or continuous, and a and b are any two constants,

Sums Difference

( ) ( ) ( ) …………………………..(1) ( ) ( ) ( )…………………………...(2)
( ) ( ) ( ) …………………..(3) ( ) ( ) ( )......................(4)
Also, if and are independent , then
( ) ( ) ( ) ……………....(5) ( ) ( ) ( ) ……………....(6)
( ) ( ) ( ) …..(7 ) ( ) ( ) ( ) ….(8)

23
THE SUM OF INDEPENDENT NORMAL VARIABLE
Example

A coffee machine is installed in a students’ common room. It dispenses white coffee by first releasing a
quantity of black coffee, normally distributed with mean 122.5ml and standard deviation 7.5ml, and
then adding a quantity of milk, normally distributed with mean 30ml and standard deviation 5ml.

Each cup is marked to a level of 137.5ml and if this level is not attained the customer receives the drink
free of charge.

What percentage of cups of white coffee will be given free of charge?

Solution
Let B be the amount, in milliliters, of black coffee where ( ), let M be the amount, in
milliliters, of milk, where ( ). B and M are independent normal variables.
Consider W, the amount, in milliliters, of white coffee made by combining the black coffee and milk, so

and

( ) ( ) ( )

( ) ( ) ( )

So has a mean of 152.5 and a variance of 81.25.

For independent normal variables, it is true that the sum of these variables is also normally distributed,
so ( ) i.e ( )

The drink is free of charge is

( ) ( √
)

( )

( )

W:
Z: -1.664 0

41
So approximately 5% of the cups of white
455coffee will be given free of charge.
45
In general,
If ( ) and ( ) then ( )
This results can be extended to any set of independent normal variables where, with
obvious notation.
( )

24
Example

Four runners, Andy, Bob, Chris, and Dai train to take part in a 1600m relay race in which Andy is to run
100m, Bob 200m, Chris 500m and Dai 800m during training their individual times, recorded in seconds,
follow normal distributions with obvious notations these are;
( ) ( ) ( ) ( )

Find the probability that they run the relay race in less than 3 minutes 35 seconds.

Solution
Let T be the total time, in seconds, for the relay race. Then

E( ) ( ) ( ) ( ) ( )

= 10.8+27.3+62.8+121.2

= 218.5

Therefore, ( )

To find the probability that the total time is less than 3 minutes 35 seconds, 215 seconds, find
( ) ( )

= ( )

= ( )

=
T:
= Z: -1.513 0

The probability that the runners take less than 3 minutes 35 seconds is 0.0651 (2s.f)
41
Exercise 455
45
1) The masses of a particular article are normally distributed with mean 20g and standard deviation 2g.
A random sample of 12 such articles is chosen. Find the probability that the total mass is greater
than 230g.
2) The maximum load of a lift can carry is 450 kg. The weights of men are normally distributed with
mean 60kg and standard deviation 10kg. The weights of women are normally distributed with mean
55kg and standard deviation 5kg. Find the probability that the lift will be overloaded by five men and
two women, if their weights are independent.

25
THE DIFFERENCE OF INDEPENDENT NORMAL VARIABLES

For two independent variables , where ( ) and ( )


( ) ( ) ( )

( ) ( ) ( )

is normally distributed, so

( )

Example

A machine produces rubber balls whose diameters are normally with mean 5.50cm and standard
deviation 0.08cm.

The balls are packed in cylindrical tubes whose internal diameters are normally distributed with mean
5.70cm and standard deviation 0.12cm.

If a ball, selected at random, is placed in a tube, selected at random, what is the distribution of the
clearance? (The clearance is the internal diameter of the tube minus the diameter of the ball) what is
the probability that the clearance is between 0.05cm and 0.25cm?

Solution

Let B be the diameter, in centimeters, of a rubber ball. Then ( )

Let T be the internal diameter, in centimeters, of a cylindrical tube. Then ( )

Let C be the clearance in centimeters, so ( ) ( ) ( )

( ) ( ) ( )

So, ( )

To find the probability that the clearance is between 0.05cm and 0.25cm, find

( ) ( )
√ √

( )

( ) ( )

1 T:
Z: -1.040 0 0.347

The probability that the clearance is between 0.05 and


41 0.25cm is 0.49 (2 s. f)
455
45

26
Exercise

1. A certain liquid drug is marketed in bottles containing a nominal 20ml of drug. Tests on a large
number of bottles indicate that the volume of liquid in each bottle ism distributed normally with
mean 20.42ml and standard deviation 0.429ml. if the capacity of the bottles is normally distributed
with mean of 21.77ml and standard deviation 0.210ml, estimate what percentage of bottles will
overflow during filling.
2. In a cafeteria, baked beans are served either in ordinary portions or in children’s portions. The
quantity given with mean 90g and standard deviation 3g and the quantity given for a child’s portion
is a normal variable with mean 43g and standard deviation 2g. What is the probability that Tom,
who has two children’s portions, is given more than his father who has an ordinary portion?

MULTIPLES OF INDEPENDENT NORMAL VARIABLES


Remember that for any constant a,

( ) ( ) and ( ) ( )

If is normal variable such that ( ) then ( ) ( ) and


( ) ( )
Thus ( )

Now consider two independent normal variable and where ( ) and ( )

for any constants and

( ) ( ) ( )

( ) ( ) ( ) )

( ) ( ) ( )

( ) ( ) ( )

( )

( )

Example

Let and e independent random variable and ( ) ( ) find the probability that
an observation from the population of is more than twice the value of an observation from the
population of .

Solution

You need to find ( ) ( ), let


( ) ( ) ( )
( ) ( )
So ( )
( )
( ) ( )

D: -10 0
27 Z: 0 1.443

41
( )
( )

The probability that on observation from the population of is more than twice the value of an
observation from the population of is (2 s.f)

The general, for ( )

Sum ( )

Multiple: ( )

Note that the means are the same but the variance is not.

Example

A soft drinks manufacturer sells bottles of drinks in two sizes. The amount in each bottle, in
Mean (Ml) Variance (Ml2)
Small 252 4
Large 1012 25
Milliliters, is normally distributed as shown in the table;

a) A bottle of each size is selected at random. Find the probability that the large bottle contains
less than four times the amount in the small bottle.
b) One large and four small bottles are selected at random. Find the probability that the amount in
the large bottle is less than the total amount in the four small bottles.

Solutions

Let S be the amount, in milliliters, in a small bottle. Then ( )

Let L be the amount, in milliliters, in a large bottle contains less than four times the amount in

A small bottle, you need.

( ) ( )

( ) ( ) ( )

( ) ( )

( ) ( )

28
( ) ( ) ( )

( ) ( )

So ( )
L-4S: 0 4
Z: -0.424 0
( ) ( )

41
( ) 455 45
( )

The probability that a large bottle contains less than four times the amount in a small bottle is 0.34 (2s.f)

b) To find the probability that the amount in a large bottle is less than the total amount in four
small bottles you need ( ) ( ( )
( ( )) ( ) ( )
( ) ( )

( ( ) ( ) ( )
( ) ( )

Therefore ( ) ( )
( ( ) ) ( )

( ) L-(S1+...+S4): 0 4
( ) Z: -0.625 0

41 455
45
The probability that a large bottle contains less than four small bottles is 0.27 (2 s.f)

Example

The lifetime of econ light bulbs are normally distributed with mean 1000h and standard deviation 25h.

a) Find, to three decimal places, the probability that an econ light bulb will have a lifetime between
975h and 1020h
b) Calculate, to three decimal places, the probability that the sum of the lifetimes of eight econ
light bulbs will exceed 7930h. Indicate clearly the stage in your calculation when an assumption
concerning independence is essential.

29
The lifetimes of energy saver light bulbs are normally distributed with mean 7900h and standard
deviation 50h.
c) Calculate, to three decimal places, the probability that an engraver light bulb will last at least
eight times as long as an econ light bulb.

Solution
Let be the lifetime, in hours of an Econ light bulb. The ( )
a) ( ) ( )

( )

( ) ( )

: 975 1000 1020


: -1 0 0.8

The probability that an Econ light bulb has a lifetime


41 between 975h and 1020h is 0.629 (3 d.p)
455
b) Let S be the sum of the lifetimes of eight Econ light bulbs, so
45
( ) ( )
( ) ( ) (Assuming that lifetime are independent)
( )
( ) ( )

(

( )
S: 7930 8000
Z: -0.990 0

The probability that the sum of the lifetime of eight Econ light bulbs exceeds 7930h is 0.839 (3 d.p)
41
c) Let Y be the lifetime of an Energy saver light bulb and Y (
455 )
( ) is needed, i.e ( 45
( ) ( ) ( ( )
( ) ( ) ( )
(Assuming independence)
( )
( )
( ) ( )

( )
( )
Y- -100 0
8X: 0 0.485
Z:
The probability that an Energy saver light bulb lasts at least eight times as long as an Econ light bulb
is 0.314(3.d.p)
41
455
45
30
Exercise

1. The distribution of the masses of adult husky dogs may be modeled by the normal distribution with
mean 37kg and standard deviation 5kg.
a) Calculate the probability that an adult husky has a mass greater than 30kg.

b) Calculate the probability that a randomly chosen team of six huskies has a total mass lying
between 198kg and 240kg, giving your answer to three decimal places.

2. The weight of a long loaf of bread is a normal variable with mean 42og and standard deviation 30g.
The weight of a small loaf of bread is a normal variable with mean 220g and standard deviation 10g.
a) Find the probability that 5 large loaves weight more than 10 small loaves.
b) Find the probability that the total weight of 5 large loaves and 10 small loaves lies between
4.25kg and 4.4kg.

SAMPLING AND ESTIMATION


SAMPLING
Population
In a statistical inquiry you often need information about a particular group. This group is known as the
population or the target population, and it could be small, large or even infinite.
Note that the word ‘population’ does not necessarily mean ‘people’.
Here are some examples of populations;
 Pupils in a class
 People in England in full time employment.
 Hospitals in Wales.
 Cans of soft drinks produced in a factory
 Ferns in wood.
 Rational numbers between 0 and 10.

SURVEYS

Information is collected by means of a survey. There are two types

a) A census
b) A sample survey
a) Census
It is a total enumeration of the whole population. In a census every member of the population is
surveyed when the population is small, this could be straight forward exercise. When the
populations are large, taking census can be time consuming and difficult to do with accuracy.
b) Sample Survey
When a survey covers less than 10% of the population it is known as a sample survey. Sample
data and be obtained relatively cheaply and quickly and if the sample is representative of the
population, a sample survey can give an accurate indication of the population. Characteristic
being studied.

31
Sample frame
Once the individual members of a population have been numbered to form a list, this list is called a
sampling frame.

Sampling Method
Once a sampling frame has been established you can choose a method of sampling. These fall into two
categories.
 Random sampling e.g. simple, systematic; stratifies
 Non-random sampling e.g. quota, cluster.

Simple Random Sampling

Suppose a population consists of N sampling units and you require a sample of n of these units. A
sample of size n is called a simple random sample if all possible samples of size n are equally likely to be
selected.

If the unit selected at each draw is replaced into the population before the next draw, then it can appear
more than once in the sample. This is known as sampling with replacement.

If unit selected at each draw is not replaced into the population before the next draw, this is known as
simple without replacement.

The second method of sampling without replacement is known as simple random sampling are
commonly used.

 Drawing lots
 Random number sampling

Drawing lots

For each number, place a coloured ball into a container and then draw n balls out of the container at
random and without replacement. If you wanted a sample of size 20, you would draw out 20 balls. This
is suitable for a small population; Note, however, that the sample must be large enough to provide
sufficiently accurate information about the population. The sample should be selected at random. Any
hint of possible bias should be avoided. If the population is large then the method of drawing lots,
sometimes described as ‘drawing out of a hat’ are not practical you could instead make the choice by
referring to random number table.

Using Random Number Tables

Random number tables consist of digits 0, 1, 2, 3,… 9, such that each digit has an equal chance of
occurring. So for example, the probability that a 3 occurs is 0.1. In random number tables the digits
may appear singly or be grouped in some way. This is solely for convenience of printing.

32
Example

Here is an extract from a set of random number tables.

6 8 7 2 5 3 8 1 5 9

2 5 3 4 7 0 5 4 9 5

3 2 6 8 7 4 4 7 0 5

Use it to select a random sample of

a) Eight people from a group of 100 people


b) Eight people from a group of 60.

Solution

a) To select a group of eight people from a target population of 100 people, allocate a two digit
number to each person, for example allocate 01 to the first on the list 02 to the second,……up to
98,99, 00 calling the hundredth person 00 for convenience .
Using the list, starting at the beginning of the first raw and reading along the rows, you would select
people corresponding to the following numbers
68 72 53 81 59 25 34 70.
Alternatively, you could decide to read the digits backwards, from bottom right in which case your
sample would consist of people corresponding to the numbers.
50 74 47 86 23 59 45 07

b) To select a group of 8 from a target group of 60 people, allocate each person a number from 01 to
60. Using the table disregard any two-digit number outside the range. Starting at the beginning of
the first row and grouping in pairs gives.
68 72 53 81 59 25 34 70 54 95 32 68
74 47 05

So you would choose people corresponding to the numbers

53 59 25 34 54 32 47 05

Example

Use the following extract from random number tables to select a random sample of 12 numbers, each to
two decimal places, from the continuous range

52 74 54 80 68 72 51 96 08 00

02 52 09 93 60 43 57 42 13 44

Solution

Since the sample values are required to two decimal place accuracy, consider groups of three digits,
inserting the decimal point between the first and second digit. In this case your sample would consist of
the values 5.27, 4.54, 8.06, 8.72, 5.19, 6.08, 0.00, 2.52, 0.99, 3.60, 4.35, 7.42.

33
Example

Here is a set of random numbers

848051, 386103, 153842, 242330, 580007, 479971

Use it to select a random sample of four numbers, each to three decimal places, from the continuous
range .

Solution

Consider group of four digits, inserting the decimal point between the first and second digits disregard
any values that are out of range. This gives.

8.480 5.138 6.103 1.538 4.224 2.330 5.800 0.747

So the numbers chosen are 1.538, 4.224, 2.330, and 0.747.

Calculator random number generator

You probably have a random number generator key Ran # on your calculator, which produces a
number, for example 0.398, every time you press it. The numbers generated are in fact obtained using a
mathematical formula and are really pseudo random numbers, but they suit the purpose very well
indeed.

Suppose you want to use your calculator to select a random sample of six numbers between 1 and 49
for your entry in the National Lottery.

To do this, you probably need to press shift then Ran # = suppose the number you get are 0.730, 0.798,
0.369, 0.499, 0.491, 0.310, 0.135, 0.112, 0.593, 0.652, 0.015, 0.346. You can interpret them in various
ways, for example.

 If you decide to the first two digits to the right of the decimal point each time you would obtain
the numbers 73,79, 36, 49, 31,13, 11, 59 65, 01, 34.
Ignoring repeats and numbers bigger than 49; the six numbers would be 30, 10, 35, 12, 15, and
46.
 If you decide to use all the digits after the decimal point, you would be choosing from the digits
7307983694994913101511259365015346 grouping these as two- digits numbers gives 73, 07,
98, 36, 94, 99, 49, 13, 10, 13, 51, 12, 59, 36, 52, 01, 53, 46.
Ignoring repeats and numbers bigger than 49 gives the six numbers as 7, 36, 49, 13, 10, 12. The
lists are endless.

Systematic sampling
Random sampling from a very large population is very cumbersome.
An alternative procedure is to list the population in some order, for example alphabetically or in order of
completion on a production. Line and then choose every kth member from the list after obtaining a
random starting point. If every tenth vehicle passing a checkpoint, you would form a 10% sample. If you
choose every twentieth item, for example every twentieth card in an index file, you would form a 5%
Sample.

34
Example
Describe how to choose a systematic sample of 8 members from a list of 300.

Solution
Since you are going to choose every 12th number, you need to find a suitable value for k. to do this,
choose a convenience value close to . In this case, Will do. Now choose a
random starting point, for example if Ran # on your calculator given 0.870 take the first member of the
sample as 87 and then add 40 each time. The other members are 127,167,207,247,287, 27 and 67.
Note that when you reach the end of the list, go back to the beginning so the sample consist of 27, 67,
87, 127, 167, 207, 247, 287.

The advantages of systematic sampling are that it is quick to carry out and it is easy to check for errors.
For large scale sampling, systematic selection is usually used in preference to taking simple random
samples.
The disadvantages of this system are that there may be periodic cycle within the frame itself. For
example a machine may operate in such a manner that every tenth item, starting at S, would result in
half the items in the sample being. Faulty, whereas starting at 2 would produce a sample with no faulty
items of course, it the periodic cycle is recognized then difference samples would be taken by varying
the starting point and the length of the interval between the chosen items.

Stratified Sampling

Stratified sampling is used when the population is split in to distinguishable layers or strata that are
quite different from each other and which together cover the whole population, for example

 Age groups
 Occupational groups
 Topographical regions
Separate random samples are taken from each stratum and put together to form the sample from the
population.
It is usual to represent the population proportionality in the strata, as in the following example.

Example

A competent carrier employs 320 drivers, 80 administrative staff and 40 mechanics. A committee to
represent all the employees is to be formed. The committee is to have 11 members and the selection is
to be made so that there is as close representation as possible without bias towards any individual or
groups. Explain how this could be done.

Solution

If you were to take a simple random sample of all 440 employees this would mean that every employee
would have an equal chance of being selected. There is a high probability that the committee would
consist of 11 drivers and therefore would not be representative of all employees.

A stratified random sample would provide a more accurate representation of the population and could
be formed as follows:

35
Taking in to account that the drivers make up of the work force,

Number of drivers

Similarly:

Number of administrative staff

Number of mechanics

The required representation on the committee is eight drivers, two from the administrative staff and
one mechanic. The people to be included can then be selected from each stratum by using simple
random sampling or systematic sampling.

Non-random sampling

a. Cluster sampling
Sometimes there is a natural sub grouping of the population and these subgroups are called clusters. For
example, in a population consisting of all children in the country attending state primary schools, the
local education authorities form natural clusters.
When a sample survey is carried out on a population that can be broken in to clusters it is often more
convenient to first choose a random sample of clusters and then to sample within each cluster chosen.
Unlike stratified sampling where the strata are as different from each other as possible, each cluster
should be as similar to other clusters as possible.

Advantages

1) There is no need to have a complete sampling flame of the whole population.


2) It is usually for less costly than random sampling.

Disadvantages
1) It is non-random

b. Quota cluster
Quota sampling is widely used in market research where the population is divided in to groups in terms
of age, sex, income level and so on. Then the interviewer is told how many people to interview within
each specified group, but is given no specific instructions about how to locate them and fulfill the quota.
It is quick to use, complications are kept to a minimum and unlike random sampling, any member of the
sample may be replaced by another member.
If no sampling phrase exists, then quota sampling may be only practical method of obtaining a sample.
The disadvantages of quota sampling, however is that it is non-random. There is a possibility of bias in
the selection process if, for example, the interviewer selects those easiest to question or those who look
co-operative.

36
SIMULATING RANDOM SAMPLES FROM GIVEN DISTRIBUTONS
A good way to stimulate a random sample from a given distribution is to use cumulative proportional
frequencies or cumulative probabilities, as illustrated in the following example.
a) From a frequency distribution

Example
Use the sequence of random digits 3642945883309239184000300 to generate five simulated
observations from the following frequency distribution.
X 1 2 3 4
t 8 12 14 6 Total 40.

Solution
Consider first then cumulative frequencies and then transfer them to cumulative proportional
frequencies with a total proportion of 1. Then allocate the random numbers in convenient proportional
frequencies.

Cumulative Corresponding
Cumulative proportional random numbers
frequency. frequency.
1 8 8 0 to 20

2 12 20 21 to 50

3 14 34 51 to 85

4 6 40 86 to 99 and 00

Since the cumulative proportional frequencies contain two decimal places, it is convenient to use two
digit random numbers. Note that 00 has been allocated to the X-value of 4 for convenience
Take 5 two-digit random numbers from the list 36, 42, 94, 58, 83.
Match this up with the corresponding sample values 2, 2, 4, 3, 3. So a random sample of size 5 from the
given distribution is 2, 2, 3, 3, 4.

b) From a probability distribution


Example
Generate a random sample size 10 from the given probability distribution, using the random numbers 3,
7, 4, 7, 6, 5, 3, 3, 9, 0.
0 1 2 3
( ) 0.1 0.2 0.4 0.3

37
Solution
Form the cumulative distribution function F(x) and then allocate random numbers in a convenient way.
Corresponding random
( ) ( )
numbers.
0 0.1 0.1 1
1 0.2 0.3 2, 3
2 0.4 0.7 4, 5, 6, 7
3 0.5 1 8, 9, 0

Take 10 random numbers given and convert them to sample values:


Random numbers: 3, 7, 4, 7, 6, 5, 3, 3, 9, 0.
Sample values: 1, 2, 2, 2, 2, 2, 1, 1, 3, 3.
So the sample values are 1, 1, 1, 2, 2, 2, 2, 2, 3, 3.
Example
Generate a random sample of size 4 from the binomial distribution ( ), using the random
numbers 2811574761578988
Solution
Calculate the cumulative probabilities, either by calculating probabilities first or using cumulative
probability tables directly
Remember that ( )

Corresponding
( ) ( ) random numbers
0 0001 to 4096

1 4097 to 8192

2 8193 to 9728

3 9729 to 9984
9985 to 9999 and
4
0000
The random number 2811 is in the range 0001 to 4096 and so corresponding to
Similarly, 5747 corresponds to
6157 corresponds to
8988 corresponds to
So the random sample of four observations from the distribution consists of the values 0, 1, 1, 2.
Example
Using the random number 8135 take a single random observation from a poison distribution with
parameter 3.

Solution
( )
Using cumulative Poisson probability tables and arranging the results in a table together with a
convenient corresponding random number allocation gives:

38
( ) Corresponding random numbers.
0 0.0498 0001 to 0498
1 0.1991 0499 to 1991
2 0.4232 1992 to 4232
3 0.6472 4233 to 6472
4 0.8153 6473 to 8153
5 0.9161 8154 to 9161
6 0.9665 9162 to 9665
7 0.9881 9666 to 9881
8 or over 1 9882 to 9999 and 000

The given random number 8153, so the random observation corresponds to

Example
Using the random digits 723,850 take a random sample of size 2 from the continuous distribution with
probability density function.
( ) For

Solution

The cumulative distribution function is given by ( )

Taking the first three random numbers if ( ) and √


(2d.p)

F( )

0 a 2

Taking the next three random numbers: if ( ) √


(2d.p)

So the two random observations are

Exercise

1)Use the random numbers 382 824 to take a random sample of 2 from the normal distribution
( ). ( Ans: the two random observations are 29.4,31.9)

39
2)Using the random number 256 construct a random observation of the continuous random
variable where ( )
3)You are given the random number 431. Use this number to obtain a sample observation from
a) A binomial distribution with
b) A normal distribution with mean 6.2 and standard deviation 0.1.
4)Using the random numbers 267 394 018 take a random sample of size 3 from the normal
distribution with mean 35 and variance 9.
5) a) The discrete random variable X is such that X ( ) Take a random sample of size
5 from this distribution, using the random numbers 407 315 401 203 972.

b) Using the random numbers 6143 take a single random observation from the Poisson
distribution with parameter 4.

SAMPLE STATISTICS

When you trying to find out information about a population. It seems sensible to take random sample
and then consider the values obtained from them. It is therefore useful to know how these sample
values are distributed.

THE DISTRIBUTION OF THE SAMPLE MEAN

 Take a random sample of n independent observation from a population. Note that from a finite
population sampling should be within. Replacement to ensure that the observations are
independent.
 Calculate the mean of these n sample values. This is known as the sample mean.
 Now repeat the procedure until you taken all possible samples of size n, calculating the sample
mean of each one.
 Form a distribution of all the sample means the distribution that would be formed is called the
sampling distribution of means.

The Mean and Variance of the Sampling Distribution of Means

Consider a population in which ( ) ( )

Take independent observation

From

Since ( ) ( ) ( )

Since ( )

Since ( ) ( )

The sample mean,

̅ =

( ̅) ( )

40
( ) ( ( )

( ̅)

( ̅)
= ( )

( ̅) i.e ( ̅ ) ( )

The standard deviation of the sampling distribution is √ usually written , This is known as the

standard error of the mean.

a) The distribution of ̅ when the population of is normal.


Distribution of ̅ when ( )

100

Distribution of ̅ when n=2, 5 and 25.

100 100
Mean of samples of size 2 Mean of samples of size 5 Mean of samples Size of 25

From the diagrams, you can see that if samples are taken from a normal population, the
sampling distribution. If ( ) ( )

Example
At a college the masses of the male students can be modeled by a normal distribution with
mean mass 70kg and standard deviation 5kg.
Four male students are chosen at random. Find the probability that their mean mass is less than
65kg.

41
Solution.
is the mass , in kilograms, of a male student at the college, and ( ) with
since the distribution of X is normal, the distribution of X is also normal and
( ) with ( )
So ( ) ( ) ( )

( )

( )

̅ 65 70
Z: -2 0

The probability that the mean mass is less than 65kg 41 455(2 s.f)
is 0.023
45

Example

The distribution of the random variable is N (25,340). The mean of a random sample of size
drawn from this distribution is . Find the value of , correct to two significant figures, given that
( ) is approximately 0.005.

Solution

( )

For samples of size , ( )


Therefore ( ) ( ) ( )
√ √

√ √
You are given that ( ) ( )
√ √


So ( )= ( ) √

Squaring both sides.
, ( )

42
b) The distribution of ̅ when is not normally distributed
The following diagram illustrate the distribution of for samples of different sizes taken from a
population
i) Distribution of when ( )

0.4
0.3
0.2
0.1

0 1 2 3 4 5 6 7 8 9 10
Distribution of ̅ when is not normally distributed

1 2 3 4 2 3 4 2 3
Mean of samples of size 10 Mean of samples of size 15 Mean of samples of Size 30
Work out for
ii) Distribution of when X ( )
iii) Distribution of when ( )

Example
Thirty random observations are taken from each of the following distribution and the sample mean
calculated. Find, in each case, the probability that the sample mean exceeds 5.

a) is the number of telephone calls made in an evening to a counseling service, where


( )
b) is the number of heads obtained when an unbiased coin is tossed nine times.
c) is distributed uniformly throughout the range

Solution
a) ( )

By the central limit theorem, since is large, is large is approximately normal, so


̅ ( )
i.e ̅ ( ), ̅ ( )
(̅ ) ( )= ( )

( )
X: 4.5 5
Z: 0 1.291
( )

43 41
455
45
b) ( )

By the central limit theorem, since n is large, X is approximately normal and ( ) with

i.e ( ), ( )
( ) ( ) ( )

( )

̅: 4.5 5
Z: 0 1.825

c) When is uniformly distributed and


( ) ( ) ( ) ( ) 41
455
Since is valid for ( ) ( ) ( )
45
( )
=

By the central limit theorem, since is large, ̅ is approximately normal and ̅ ( ) with =30

i.e ̅ ( ) ̅ ( )

( ̅ ) ( )

( )

( )
̅: 4.5 5
Z: 0 1.897
The distribution of the sample proportion

Suppose a random sample of n observations is taken from a population in which the proportion of
41
successes is P and the proportion of failures is q=1-p. if X is the number of successes in the sample then
455
X follows a binomial distribution i.e ( ) ( ) ( )
45
The random variable for the proportion of success in the sample is . This can be written
it is possible to work out the mean and the variance of using expectation algebra as follows:

( ) ( ) ( ) ( )

( ) ( ) ( )

44
The distribution of has mean P and variance normal, when n is large, the distribution of and is
proximately the large the sample size n, the better the approximation.

s.d =√

The distribution of is known as the sampling distribution of proportions. The standard deviation of

this distribution is √ and it is known as the standard error or proportion.

Since , using continuity correction


( )

Example
It is known that 3% of frozen pies delivered to a canteen are broken. What is the probability that on a
morning. When 500 pies are delivered, 5% or more are broken?

Solution

Let be the probability that a pie is broken, so


Let be the proportion of pies in the sample that are broken.
Then, ( ) with
( ) ( )
To find the probability that 5% or more are broken, find ( ) ( ).

= ( )

= ( )

= ( )
: 0.03 0.049
=
Z: 0 2.491
= 0.0064.

41
455
45

45
Alternative method

Instead of considering , the proportion of broken pies, you could consider the number of broken
pies in the sample.

In this case, ( ) with ,

Now

Since is large such that and , use the normal approximation for the binomial
distribution when ( ) with and
( )
You want the probability that 5% or more are broken. 5% of 500= 25, so find the probability that 25 or
more are broken. ( ) ( )
= ( )

= ( )

X: 15 24.5
Z: 0 2.491
POINT ESTIMATES

If the random sample taken is of size ,


41
 The best unbiased estimate of , the proportion of successes in the population, is where
455
. is the proportion of successes in the sample.
45
 The best unbiased estimate of the population mean is where ̂ ̅ ̅ is the mean of
the sample.
 The best unbiased estimate of , the population variance, is ̂ where ̂ =
is the variance of the sample.
There are alternative formats for ̂

( ̅) ( ̅)
̂ ̂ ( ) ( )

Example 1
A railway enthusiast simulates train journeys and records the number of minutes, , to the nearest
minute, trains are late according to the schedule being used. A random sample of 50 journeys gave the
following times.
17 5 3 10 4 3 10 5 2 14 = 73
3 14 5 5 21 9 22 36 14 34 = 163
22 4 23 6 8 15 41 23 13 7 = 162
6 13 33 8 5 34 26 17 8 43 = 193
24 14 23 4 19 5 23 13 12 10 = 147
Given that and calculate to two decimal places, unbiased estimates of the
mean and the variance of the population from which this sample was drawn.

46
Solution
is the number of minutes that the train is late,
Let ( )= and ( )
Unbiased estimate of
̂ ̅
Unbiased estimate of
(∑ )
(∑ )
( )
( )

(2d.p)

Example
For the data given in example 1 above, estimate the proportion of trains that are more than 25 minutes
late.

Solution

Number in sample that are more than 25 minutes late =7.Proportion in sample, Unbiased
estimate of population proportion ,is ,where .

INTERVAL ESTIMATES

Another way of using a sample value to give a good idea of an unknown population parameter is to
construct an interval, known as a confidence interval.

The internal is usually written (a, b) and the end values, a and b are known as confidence limits
probabilities oftenly used in confidence intervals are 90%, 95% and 99%.

Suppose you do not know the mean of a particular population and you want to workout a 95%
confidence interval for it. You contract an interval (a, b) so that.

( )

In this case, the probability that the interval include is 0.95 or 95%.

a) Confidence interval for the population mean.


 Of a normal population
 With known variance
 Using any size sample, n large or small.

If ( ), then X ( ) standardizing, , where ( )



For a 95% confidence interval we find the value of Z between which the central 95% of the distribution
lies. This means that the upper tail probability is 0.025 and the lower tail is 0.975.
( )
( )
The values of Z are

47
̅
So, ( ) i.e ( )

̅ ̅ Therefore the probability statement is ( ̅ ̅
√ √ √
) therefore the confidence limits are ̅ or be written ̅
√ √ √ √
If ̅ is the mean of a random sample of any size n taken from normal population with known variance
, then a 95% confidence interval for is given by ( ̅ , ̅ +1.96 ).
√ √

Example

The mass of vitamin E in a capsule manufactured by a certain drug company is normally distributed with
a standard deviation 0.042mg. A random sample of five capsules was analysed and mean mass of
vitamin E was found to be 5.12g. Calculate a symmetric 95% confidence interval for the population
mean mass of vitamin E per capsule. Give the values of the end-points of the interval correct to three
significant figures.

Solution.

X is mass, in milligrams of a vitamin E capsule.


( ) with
( ) with
Then 95% confidence interval for is ( )
√ √

√ √

So the 95% confidence interval for , based on the sample mean, is ( )

b) Confidence interval for , the population mean


 Of a non normal population
 With a known variance
 Using a large sample, say

In this case, since the sample size is large the central limit theorem can be used.

is approximately normal and ( ). If is the mean of a random sample of size , where

is large ( ), taken from a non-normal population with known variance , then a 95%
confidence interval for is given by
95%
( )
√ √ 25%
25%

48
Example

The heights of men in a particular district are distributed with mean and the standard deviation

On the basis of the results obtained from a random sample of 100 men from the district, the 95%
confidence interval for was calculated and found to be (177.22cm, 179.18)
Calculate
a) The value of the sample mean,
b) The value of ,
c) A symmetric 90% confidence interval for

Solution
a) A 95% confidence interval for is given by ̅ with , so √
√ √
Since the interval is (177.22, 179.18)
̅ ()
̅ ( )
Adding ( ) and( ) ̅
̅
The sample mean is 178.2 cm
Subtracting ( ) ( )

b) A symmetric 90% confidence interval for is ( ̅ ̅ )


̅

So the 90% confidence interval = (177.38cm, 179.02) (2 d.p)


Example
A plant produces steel sheets whose heights are known to be normally distributed with a standard
deviation of 2.4kg. A random sample of 36 sheets had a mean weight of 31.4kg.
Find 99% confidence limits for the population mean.
Solution
is the weight, in kilograms, of steel sheet.
Then, ( )
A sample of size 36 is taken, so and the sample mean ̅ . The end values of 99%
confidence interval form are ̅

= 31.4 1.0304
So the 99% confidence interval is ( )( )
Exercise
The result of a stress test is known to be a normally distributed random variable with mean and
standard deviation 1.3. It is required to have a 95% symmetrical confidence interval for with total
width less than 2. Find the least number of tests that should be carried out to achieve this.

Ans:

49
CONFIDENCE INTERVAL FOR , THE POPULATION MEAN
 Of a normal or non-normal population
 With unknown variance
 Using a large sample,
 When calculating confidence intervals it is of the case that the population variance, is not
known. Provided that the sample size, , is large ( ) it is permissible to use the best
unbiased estimate for

Ideally the distribution of should be normal, but an approximate confidence interval can also be given
when the distribution of is not normal. Remember that in both cases, must be large.

Provided that is large, ( ), a 95% confidence interval for is

̂ ̂
( ̅ ̅ ) Where ̂ is the sample variance
√ √

( )
or ̂ ( )

Example

The fuel consumption of a new model car is being tested. In one trial, 50 cars chosen at random, where
driven under identical conditions and the distances, , covered in one litre of petrol were recorded.
The results gave the following totals calculate a 95% confidence interval for
the mean petrol consumption, in kilometers per litre of cars of this type.

Solution
̅
( )
is unknown, so use ̂ where ̂ ( ) ( )
= 2.2959………….
………..
95% confidence interval for , 10.92km/litre
( )

Example
The height, , of each man in a random sample of 200 men living in U.K was measured. The following
results were obtained.

a) Calculate unbiased estimates of the mean and variance of the heights of men living in the U.K.
b) Determine an approximate 90% confidence interval for the mean height of men living in U.K.
Name the theorem that you have assumed.

Solution

a) ̂ ̅

̂ ( ̅ )

50
= ( )

= 103.5

b) The confidence limits for 90% confidence interval for are


̂ √
̅
√ √
= 175.25 1.1833
So, 90% confidence interval is ( )

= (174.07cm, 176.43cm) (2d.p)


The central limit theorem has been used to give an approximate distribution for ̅ , the given sample
mean where ̅ ( )

c) Confidence interval for when


 The population is normal
 is unknown
 Sample size n is small
When calculating confidence interval, you have already encountered the situation when
large sample ( ) are taken from a normal population with unknown variance .

For large samples


̅
̂ √
where Z ( )
̅
But if the sample size is small( ), ̂ no longer has a normal distribution.

For small samples

̅̅
̂ √
Where T has a t-distribution.
The 95% confidence interval for M is obtained as follows:
If ̅ and are the mean and variance of small sample ( ) from a normal population with
unknown mean and unknown variance , then a 95% confidence interval for is given by
̂ ̂
̅ ̅ where ̂ and is the value from a ( ) distribution such that
√ √
( ) ( ) endoses 95% of the ( ) distribution

Example

Consider follows a distribution 11 degrees of freedom, i.e and ( ) Find

i) ( )
ii) ( )
iii) (| | )

51
Solution
i) , so find row 11 and go across to 1.796, then up to the top of the column.
This gives 0.95. ( )
( )

0 a
ii) Find row 11, go across to 3.106, which is in column 0.995
So, ( )
( ) ( )

0.995

0.005

0 3.106

iii) You need (| | ) ( ). Find row 11,go across to


2.201 which is in column 0.975
So, ( )

( )

0 2.201

It follows that
( ) also ( )

= 0.95
= 95%

0.95
0.025
0.025

The values that encloses the central 95% are

52
Exercise
The random variable has a -distribution with 14 degrees of freedom, i.e ( ). Find the value of
for which
i) ( )
ii) (| | )

Example
The mass, in grams, of a packet of biscuits of a particular brand, follows a normal distribution with mean
. Ten packets of biscuits are chosen at random and their masses noted. The results, in grams are
397.3, 399.6, 401.0, 392.9, 396.8, 400.0, 397.6, 392.1, 400.8, 400.6
Calculate a 95% confidence interval for
Solution
These can be summarized as follows:
Let be the mass, in grams, of a packet of biscuits
( ) with both and unknown.
Since is unknown, find ̂
( )
̂ ( ) , with

= ( )
= 10.325………….
̂ ………..
The sample mean , ̅
Since is small, a ( )distribution is required. , So use a ( ) distribution.
̂
The 95% confidence limits for are ̅ where (– ) enclose 95% of the ( ) distribution.

From tables, the critical value is 2.262. Therefore confidence limits are 397.87 2.262

= 397.87 2.298……….
95% confidence interval for ( )
=( )
Exercise
A student studying the height of a particular plant, known that it follows a normal distribution with
mean and variance , but he does not know the value of either of these parameters. He selects 15
plants at random, measures their height and calculates that the mean height of the sample is 12.2cm
and the standard deviation is 1.4cm. Using these values, calculate a 90% confidence interval for .
Calculate also the width of this interval

TEST OF HYPOTHESIS
Procedures of carrying out a hypothesis test are that
 Define the variable
 State and
 State the distribution according to
 State the level and type of test
 State the rejection criterion
 Calculate the required probability
 Make tour conclusion

53
TYPE I AND TYPEII ERRORS
When you perform a significance test, there are four possible conclusions, and these are;
a) is true and your test leads you to accept corrrect decision.
b) is true but your test leads you to reject -wrong decision-Type I Error
c) is false but your test leads you to accept -wrong decision- Type II Error.
d) is false but your test leads you to reject -correct decision.

It can be helpful to see this diagram.

Test decision
Accept Reject
Actual is true Correct Type I error
Situation is false Type II error Correct

DEFINE.

A type I error is made when you reject when it is true.

i.e ( ) P(reject when is true)

A type II error is made when you accept when it is false, i.e P(Type II error) = P(accept when
it is false)

Example

A random observation is taken from a binomial distribution ( ) and use to test the null
hypothesis against the alternative hypothesis

The critical region is chosen to be

a) What is the significance level of the test


b) What is the probability of making type I error?
c) Find the probability of making a type II error, if infact,

Solution
You are given that ( )

a)
b)

The critical region is , so to find the significance level of the test, find ( )

( ) ( ) ( )
=
= 0.0691
= 7%
The significance level is approximately 7%

54
c) P(Type I error) = P(reject when it is true)
is rejected if , so
P(type I error)= ( ) (found above)
Since probability of a type I error = significance level of the test.
d) You make a type II error when you accept (which you will do if ( ) when is the value
specialized in (not the values given by )
The hypothesis is now

So, P(type II error) = P(accept when is true)


( )
= ( ) ( )
( ) ( ) ( )
=
= 0.8244…
= 82%
The probability of making a type II error is 82%

Note
This is a very high probability. To make this smaller you could increase the significance level of the test.
But this would of course be the probability of making a type I error

Exercise
1) The random variable can be modeled by a binomial distribution.
Test, at the 8% level, the hypothesis that against the alternative hypothesis
a) When ,
b) When

2) The random variable can be modeled by a binomial distribution with parameters and
whose value is unknown.
a) Find, at the 10% level of significance, the critical region to test the null hypothesis that
against the alternative hypothesis that
b) Explain what is meant by a type I error
c) State the probability of making a Type I error in test described in (a)

3) A random observation is taken from a binomial distribution ( ) and used to test the null
hypothesis against the alternative hypothesis
The critical region is chosen to be
a) At what significance level is the test carried out?
b) What is the probability of making a Type I error?
c) Find the probability of making a Type II error if, infact,

55
CORRELATION ANALYSIS

Content
 Type of correlation
 Scatter diagram
 The covariance
 The correlation co-efficient
 Probable Error
 Spearman’s rank correlation

Formulae

1. Covariance

∑( ̅ )( ̅)

2. Correlation co-efficient
∑( ̅ )( ̅)
a) ( )
√∑( ̅) ( ̅)
∑ ∑ ∑
b) ( )
√ ∑ (∑ ) √ ∑ (∑ )
( )
c) ( ) ( ) ( )
d) When deviations are taken from actual mean.

( )
√∑ √∑
3. Probable error
( )

4. Spearman’s rank correlation

a) ( )
∑ ( )
∑( )
b) ( )

Point to remember

1. If the value of is ‘+’. There is a positive correlation between the variables if it as


negative then negative correlation.
2. If there is no correlation.
3. Value of is not affected by change in scale and origin.
4. The range of is -1 to +1.

TYPES OF CORRELATION

Let be the two variables under study. We note that there exist, three types of relationship between
these two variables (i) they are directly proportional i.e. when one increases (decreases) the other
increases (decreases); (ii) they are inversely proportional, i.e. when one increases (decreases) the other
decreases (increases); (iii) no relation. These three situations can be stated as follows statistically (i)
positively correlated (ii) negatively correlated (iii) no-correlation.

56
SCATTER DIAGRAM
It is the simplest way of the diagrammatic representation of bivariate data. Let and be the two
variables under study with observation each. If we plot ( )( ) ( ) on a graph
sheet, the resulting diagram gives us a vague idea about the correlation between these two variables
and .

Y-Values
25

20

15

Y-Values
10

0
0 2 4 6 8 10 12 14

Figure 1

Fig 1 represents that there is a positive correlation between the variables and .

Y-Values
25

20

15

Y-Values
10

0
0 2 4 6 8 10 12 14

Figure 2

57
Fig 2 represents that there is a negative correlation between the variables and .

Y-Values
17.5

17

16.5

16
Y-Values

15.5

15

14.5
0 1 2 3 4 5 6 7 8 9

Figure 3

In Fig 3 we are not able to recognize a pattern, i.e., there is no correlation between and .

Thus a scatter diagram help us to get an idea of the correlation between the two variables ( and ).

THE COVARIANCE

Covariance states only the nature of the relationship between the variables, but we also want to know
how far these variables are associated with each other, we use correlation co-efficient to find the same.
Correlation co-efficient is a measure of degree or extent of linear relationship between two variables
and . Correlation co-efficient was defined by Karl Pearson in 1890.
The correlation co-efficient for ( ) is given by:

∑( ̅ )( ̅)
( )
√∑( ̅) ( ̅)
∑ ∑ ∑
( )
√ ∑ (∑ ) √ ∑ (∑ )
( )
( ) ( ) ( )

Short Cut Method

Let , then ( ) ( )
If the deviation are taken from actual mean of and then ∑ ∑

( ) ( )
√ ∑ √ ∑


√∑ √∑
(Range of the correlation co-efficient is (-1, 1) i.e. .

58
Example 1

You are given the following data relating to aptitude scores and productivity index
Aptitude score (X) 9 18 18 20 20 23
Productivity index (Y) 33 23 33 42 29 32
Find the co-efficient of correlation between aptitude scores and productivity index.

Solution

9 33 -9 1 -9 81 1
18 23 0 -9 0 0 81
18 33 0 1 0 0 1
20 42 2 10 20 4 100
20 29 2 -3 -6 4 9
23 32 5 0 0 25 0

Total 108 192 0 0 5 114 192

Here
∑ ∑
̅ , ̅ , and ,

( ) ( )
√∑ √∑

√ √

Example 2

Find the correlation co-efficient for the following data


4 6 8 10 12
3 6 9 12 15
Solution

4 3 -2 -2 4 4 4
6 6 -1 -1 1 1 1
8 9 0 0 0 0 0
10 12 1 1 1 1 1
12 15 2 2 4 4 4

Total 40 45 0 0 10 10 10
∑ ∑
Here ̅ ,̅ , ,
( ) ( )

59

( )
√∑ √∑

√ √
( )

Which shows that there is a perfect positive linear correlation between and .

PROBABLE ERROR AND ITS SIGNIFICANCE

Probable error of correlation co-efficient is a measure of reliability of , which is given by the formula.
( )

If ( ), correlation coefficient is definitely not significant. If ( ), correlation co-efficient


is significant.

Example 3

If and is significant.
( )

( ( ) ) ( )

( )

( )

EXERCISES
1. Find Karl Pearson’s co-efficient of correlation of the following data.
Age of husband ( ): 20 22 23 25 25 28 29 30 30 34
Age of wife ( ): 18 20 22 24 21 26 26 25 27 29

2. Calculate Karl Pearson’s co-efficient of correlation between percentage of pass and failure from the
following data.
No. of students ( ): 800 600 900 700 500 400
No. passed ( ): 480 300 450 560 450 300

3. Calculate Karl Pearson’s co-efficient of correlation for the following data.


Price (in shs) ( ): 21 22 23 24 25 26 27 28 29 30
Demand (in 000 units) ( ): 18 19 19 16 17 16 16 15 13 11

4. Calculate Karl Pearson’s co-efficient of correlation


: 1 2 3 4 5 6 7
: 3 5 6 8 10 11 13

5. Calculate co-efficient of correlation using Karl Pearson’s method.


Advertisement (‘000’ Shs) ( ): 39 65 62 90 82 75 25 98 36 78
Sales (in ‘000’ Shs) ( ): 47 53 58 86 62 68 60 91 51 84

60
6. Given below are the monthly incomes and their net saving of a sample of 10 supervisory staff
belonging to a firm. Calculate the correlation co-efficient.
Employee No.: 1 2 3 4 5 6 7 8 9 10
Monthly income (x): 780 360 980 250 750 82 900 620 650 390
Net saving: ( ): 84 51 91 60 68 62 86 58 53 47

7. From the following data calculate the correlation co-efficient.

8. From the following details find the value of


( ) and .

9. From the following data, compute the co-efficient of correlation between and .
series series
No. of items: 15 15
Arithmetic mean: 25 18
Sum of squares of deviation from mean: 136 138

10. Find the correlation co-efficient.


(i) ( )
(ii)

11. Find the probable error.


(i)
(ii)
Comment on the value of .

12. How do you interpret when it is ( ) ( ) ( ) ?


(a) When , it means is proportion to , it means as increases also increases. There is
a perfect linear association between and
(b) When , it means is inversely proportional to , it means as increases (decreases)
decreases (increases). This ensures perfect negative association.
(c) When , it means there is no linear association between two variables.
13. Calculate the correlation between age and playing habits of students.
Age (years): 15 16 17 18 19 20
No. of students: 800 600 900 700 500 400
Regular players: 480 300 450 560 450 300

14. A company wanted to assess the impact of and expenditure ( ) on annual profit ( ) following
table presents data for past 8 years.
9 7 5 10 4 5 3 2
45 42 41 60 30 34 25 20
Compute the correlation co-efficient. Comment on the value.

15. From the following data, compute the co-efficient of correlation between and .

61
series series
No. of items: 15 15
Arithmetic mean: 25 18
Sum of squares of deviation from mean: 136 138

16. Find the correlation co-efficient.


(iii) ( )
(iv)

17. Find the probable error.


(iii)
(iv)
Comment on the value of .

SPEARMAN’S RANK CORRELATION.

In correlation co-efficient we used the actual value to measure the degree of relationship; here we use
the rank of the given data. Rank correlation is useful when we study the relationship between
quantitative characteristic like beauty, intelligence etc. Ranks are obtained by arranging the data in
order of their merits. Spearman’s rank correlation formula is given by

( )

Where is the difference between the ranks of the two variables.

Example 1

Calculate rank correlation from the following.


Marks in statistics ( ): 15 20 25 18 40 60 80
Marks in Accountancy ( ): 40 30 50 36 20 10 60

Solution

( ) ( )

15 40 7 3 4 16
20 30 5 5 0 0
25 50 4 2 2 4
18 36 6 4 2 4
40 20 3 6 -3 9
60 10 2 7 -5 25
80 60 1 1 0 0

Total 58

( )

( )

62
REPEATED RANKS

If there are more than one item with the same value in the series then common ranks are given to the
repeated items. This common rank is the average of the ranks which these items would have attained if
they were slightly different from each other and the next item gets the rank next to the ranks (actual
rank). As a result there is a small adjustment in the rank correlation formula.

( )

This adjustment factor is added to ∑ ( where is the number of items repeated ) for each reapeated
values.

( )
[ ∑ ∑ ]
( )

Example 2
Calculate rank correlation from the following.
Marks in statistics ( ): 15 20 28 12 40 60 20 80
Marks in Accountancy ( ): 40 30 50 30 20 10 30 60

Solution

( ) ( )

15 40 7 3 4 16
20 30 5.5 5 0.5 0.25
28 50 4 2 2 0.4
12 36 8 5 3 9
40 20 3 7 -4 16
60 10 2 8 -6 36
20 30 5.5 5 0.5 0.25
80 60 1 1 0 0

Total 81.5

In series: 20 is repeated 2 times, the actual ranks will be 5 and 6


the average is 5.5, the mark 15 gets the seventh position.
Here
( ) ( )

In Y series : 30 is repeated 3 times, it shares 4th, 5th and 6th position 30 gets the average.
Here
( ) ( )

63
( )
[ ∑ ∑ ] [ ]
( ) ( )

EXERCISES
1. Ten competitors in a beauty contest are ranked by three judges in the following order.
First judge: 1 5 4 8 9 6 10 7 3 2
Second judge: 4 8 7 6 5 9 10 3 2 1
First judge: 6 7 8 1 5 10 9 2 3 4

2. Find rank correlation co-efficient for the following data and give your comments.
Marks in accounts( ): 85 56 89 58 59 67 74 78
Marks in maths ( ) 38 69 56 58 63 78 87 77

3. Find the rank correlation co-efficient and comment


Marks in mathematics( ): 60 70 50 40 80 90 85 96
Marks in statistics ( ): 64 58 72 44 86 80 95 81

4. Find rank correlation co-efficient for the following data and give your comments.
Marks in statistics( ): 84 56 89 58 59 67 74 78
Marks in law( ): 38 69 56 58 63 78 87 77

5. A sample of five fathers and their eldest sons gave the following data about their weight in kgs.
Father( ): 65 60 67 63 74
Son( ): 49 45 57 40 60
Obtain the rank correlation co-efficient and comment on the results.

6. Calculate the rank correlation.


: 65 66 67 68 69 70 71 72
: 67 68 69 72 78 80 82 85
7. Height of fathers and sons are given in inches.
Height of father : 65 66 67 68 69 71 73
Height of son : 67 68 64 68 72 70 69

8. Compute the rank correlation between I.Q’s and marks scored in examination.
Personal: A B C D E F
I.Q: 100 110 140 160 120 130
Exam mark: 70 80 81 78 72 75

9. Ten competitors in a beauty contest are ranked by three judges in the following order.
Judge I: 1 6 5 10 3 2 4 9 7 8
Judge II: 3 5 8 4 7 10 2 1 6 9
Judge III: 6 4 9 8 1 2 3 10 5 7

64
Use the rank correlation co-efficient to determine which pair of judges has the nearest approach to
common tastes in beauty.

10. The table below shows the respective I.Q’s of 7 father and their eldest sons.
Father I.Q : 90 98 103 104 105 110 114
Sons I.Q : 102 95 107 114 100 98 113
Calculate the rank correlation for the above statistics.

11. Answer the following


1. and . Find
2. and. Find
3. Find
4. and Find

REGRESSION ANALYSIS
INTRODUTION

In the previous chapter we discussed about the nature of relation-ship between the variable (i.e
correlation) and the extend to which they are correlated. Correlation analysis just gave us an idea about
the association of the variable, but we would like to know how far they are related to each other. i.e we
would like to know the functional relationship between two variables. Regression methods are meant
to determine these functional relationships.
REGRESSION
Regression analysis is a mathematical measure of the average relationship between two or more
variables in term of the original units of the data.
In regression analysis there are two types of variable; namely the dependent and independent variable.
A dependent variable is the one whose value is to be predicted and independent variable is the one
which influences the value of the variable (dependent)
LINE OF REGRESSION
In scatter diagram (discussed in previous topic) , we find that the points cluster around some curve
called curve regression. If curve is straight line, it is called the line of regression.
“The line of regression is the line which gives the best estimate to the value of one variable for any
specific value of the other variable”.
Let us consider the relationship between two variable and . Here there are two lines of regression (i)
on and (ii) on .
(i) Line of regression on : Here is the dependent variable of is independent variable
and this line gives the best estimate for the value of for any specified value of . The
regress equation for on is given by

(ii) Line of regression on : this line is used to estimate the value of for any specified value
of . The regression equation for on is given by

In the above equations and are constants and are obtained by least squares principle.

65
LEAST SQUARES METHOD
Legender’s principle of least squares is “To minimize the sum of squares of the deviations of the actual
values (on ) from its estimated values as given by the line of best fit.”

Regression Equation of on is .
Then according to least squares principles, we have to minimize.

∑( ( ))

Actual Estimated
Value
is minimized byValue
partially differentiating it and respective and equating them to zero. In
this way we get two equations.

These equations are known as normal equations. Solving these equation simultaneously, we can obtain
the values of the constants and .

Regression equation of on :
The regression equation of on is
Here we have to minimize ( ( ))
Similarly here we get

As the normal equations and the values of and can be obtain from them.

Example
Find the regression equation on for the following data.
: 2 4 6 8
: 5 15 20 25
Solution

2 5 4 25 10
4 15 16 225 60
6 20 36 400 120
8 25 64 625 200
Total 20 65 120 1275 390

The normal equation for on .

66
(1)
(2)
9 5: (3)
(2)
(3) - (2): -65 =

Substituting the value of in (1) above we get

REGRESSION CO-EFFICIENT
In a regression equation ‘ is the intercept which the line cuts I the axis and is the slope of the line
and is called the regression co-efficient. The regression co-efficient is a measure of change in dependent
variable corresponding to a unit change in independent variable.

Regression co-efficient of on is denoted by and on is denoted by

RELATIONSHIP BETWEEN REGRESSION CO-EFFICIENT ‘ ’ AND CORRELATION CO-EFFICIENT ‘

By definition we know that


( )

By solving the normal equation 6.3 we get

( ̅ )( ̅)
( ̅ )

( ̅ )( ̅)

( ̅)

Multiplying and dividing by we get

for on

Similarly for on
We get

for on

67
for on
Now the two regression equations are
( ̅) ( ̅ ) for on

( ̅) ( ̅ ) for on

Computing of regression co-efficient

Regression co-efficient for on is given by

( ̅ )( ̅)
( ̅)

( )

Let

Be the deviation from the original units then

( )

If the deviations are taken from actual means then

Similarly for on we have

( ̅ )( ̅)
( ̅)

( )

Let

Be the deviation from the original units then

( )

If the deviations are taken from actual means then

Example 1

68
You are given the following

Mean 10 15
S.D 2 4

Find the two regression equations

Solution
Regression equation on

( ̅) ( ̅)

( ) ( )

Regression equation on

( ̅) ( ̅)

( ) ( )

Example
Find the regression equation for the following
2 4 6 8
10 20 25 30

Solution

Regression equation

69
2 10 -3 -5 9 25 -15

4 20 -1 5 1 25 -15

6 25 1 10 1 100 10

8 30 3 15 9 225 45

Total 20 65 0 25 20 375 35

Regression equation of on .

( )

( ̅) ( ̅)

( ) ( )

Regression equation of on .

( )

( )

( ̅) ( ̅)

( ) ( )

PROPERTIES OF REGRESSION CO-EFFICIENT

1. The range of regression co-efficient is to


2. The correlation co-efficient between two variable and is given by

3. If

70
Property hold the same for regression co-efficient on .

4. If one of the regression co-efficient is greater than unity the other one is less than unity.
i.e, if then

5. If the variable and are independent the regression co-efficient is zero.

DISTINCTION BETWEEN CORRELATION AND REGRESSION ANALYSIS


As we have discussed about these two analysis separately, now let us distinguish between them.

1. It studies the relationship between two It estimates the functional relationship


variables between and variable
2. Correlation need not imply cause and It studies the cause and effect of relationships
effect relationship between the between the variables
variables under study
3. Correlation analysis is confined only to Regression analysis deals both linear and non
the study of linear relationship linear relationship of the variables under
study.
4. It measures the degree of covariance It studies the nature of covariance.

5. The relationship may be purely a There is perfect relationship and it has


chance and it may not have practical practical relevance
relevance

6. Correlation co-efficient between and Regression co-efficient is not symmetric i.e


is symmetric i.e ( ) ( )

7. Correlation co-efficient is a relative Regression co-efficient is an absolute measure


measure and its range is -1 to 1 and its range is to

Exercise

1. Obtain the regression of on and on from the following table and estimate the blood
pressure when age is 50.

Age Blood pressure

56 147
42 125
72 160
36 118
63 149

71
47 128
55 150
49 145
38 115
42 140
68 152
60 155

2. A company wanted to assess the impact of and expenditure ( ) on annual profit ( ). Following
table presents data for past 8 years.

9 45
7 42
5 41
10 60
4 30
5 34
3 25
2 20

Find the regression equation on . Estimate the profit for and expenditure of Shs 8 thousand.

3. Calculate the two lines of regression for the following data.


: 1 2 3 4 5
: 3 6 9 12 15

4. You are given the following in formation about advertisement and sales.
Adv. Exp. ( ) Sales ( )
(Shs million) (Shs million)
Mean 20 120
S.D 5 25
Correlation co-efficient = 0.8
1. Calculate the two regression equations.
2. Find the likely sales when advertisement expenditure is Shs 25 million

5. Given No. of pairs = 12

A. M 74.50 125.50
S.D 13.07 15.85

72
Summation of products of corresponding deviations from respective means = 2176. Calculate the
co-efficient of correlation between and and find the regressions equations.

6. From the following details estimate the value of A when B =55.


Mean of A = 39.5, mean of B = 47.5
Std. Dev. of A = 10.8, Std. Dev. of B = 16.8
Co-efficient of correlation between A and B =0.42.

7. You are given the following data.

Mean 20 10
S.D 3 4

Estimate the value of when

8. Find the two regression equations.

Mean 25 22
S.D 4 5

9. If and , find .

10. From the data below, find the two regression equation.

( ) ( )
25 43
28 46
35 49
32 41
31 36
36 32
29 31
38 30
34 33
32 39

Compute co-efficient of correlation between and


11. Calculate the regression equation for the following data and also compute Karl Pearson’s coefficient
of correlation:
No. of students ( ): 800 600 900 700 500 400
No. of Passed ( ): 480 300 450 560 450 310

73
12. To study the effect of rain on yield of wheat the following results were obtained.
Mean S.D
Yield ( ): 800 12
Rainfall ( ): 50 2

Estimate the yield when the rainfall is 80 inches.

13. Find the two regression equations, regression co-efficient of correlation from the following figures.
∑ ∑ ∑ ∑ ∑

74

You might also like