Inference 1 Notes Hsts111
Inference 1 Notes Hsts111
1.0 Introduction
In this chapter, the student will learn about what Statistics is, its main branches and some
practical applications through given problems and exercises.
The word 'Statistics' seems to have been derived from the Latin word 'status' or Italian word
'statista' or the German word 'Statistik' each of which means a political state. Statistics is an old
discipline, as old as the human activity. Its utility has been increasing with time. It was used in
the administrative departments of the states and governments to keep record of birth, death,
population etc., for administrative purpose. John Graunt was the first man to make a systematic
study of birth and death statistics and the calculation of expectation of life at different age in the
17th century which led to the idea of Life Insurance. Almost all the fields like agriculture,
engineering, health, finance, economics, sociology, management etc., are now using Statistical
Methods for different purposes.
Defining Statistics
Understand and explain the branches of Statistics
Distinguish between Population and Sample
Distinguish between Parameters and statistics
Statistics has many branches, but the two major ones are Descriptive Statistics and Inferential
Statistics.
Definition1.2.3: Inferential or Inductive Statistics is that branch of Statistics that deals with
making conclusions, generalisations or extensions of results obtained from sample data to the
entire data set.
1
Definition1.2.4: A random variable is a variable whose value is unknown or a function that
assigns values to each of an experiment's outcome. When the numerical value of a variable is
determined by a chance event, that variable is called a random variable.
Random variables are often designated by letters and can be classified as discrete, that is
variables that can assume some specific and countable values within a given range, or
continuous, that is variables that can have any values within a continuous range.
Consider an experiment where a coin is tossed three times. If X represents the number of times
the coin shows heads then X is a discrete random variable that can only have the values 0, 1, 2
and 3. Examples of continuous random variables are, amount of rainfall in a city over a year, or
the average height of a random sample of 40 girls.
Some populations are finite whilst others are infinite. Finite populations include the people in a
certain area, ages of all first year university students, number of households in Harare, etc. Other
populations are infinite, and these include the Binomial population, etc.
In practical situations it is not possible to analyse every element in the population due to
financial, economic and time constraints. We normally resort to the use of part of the population
for information. This leads to the following definition.
There are probability and non-probability sampling methods. The probability sampling methods
include Simple random sampling, Stratified random, systematic sampling, cluster and
multistage-sampling methods. Non-probability sampling methods include convenience sampling,
judgmental sampling, etc. It is advisable to use probability sampling than non-probability
sampling methods because of their sound mathematical and statistical background. You will
learn more about sampling in the module that deals with sampling sand survey techniques.
2
1.4 Population Parameters and Sample statistics
Summary statistics can be obtained from a population or sample. These are calculations made
using population or sample data such as the mean, median, standard deviation, to mention just a
few.
Definition1.4.1: A parameter is any calculation obtained using population data. It can also
be defined as any numeric descriptive measure of a population characteristic or feature, such as
central tendency or variability.
In making inference about a population parameter on the basis of a sample, we normally use a
statistic, which is defined below.
Example
The probability distribution of a population of figures 1, 2, 3, 4, 5 and 6 is given by
1 2 3 4 5 6
λ
Solution
(a) Since it is a probability distribution it means
(b) ∑
( ) ( ) ( ) ( ) ( )
(c)
3
∑ ( ) ( ) ( ) ( ) ( ) ( )
( )
Definition 1.4.2: A statistic is any calculation obtained using population data. It can also be
defined as any numeric descriptive measure of a sample characteristic or feature, such as sample
mean, sample standard deviation.
Examples of sample statistics include the sample mean ̅ , sample proportion ̂ , sample
correlation , sample variance , population mode and median, etc.
Exercise 1
1. Distinguish between the following terms:
(a) Statistics and statistics
(b) Parameter and statistics
4
CHAPTER 2
2.0 Introduction
Consider the average birth weight, µ in kilograms of all children born in Zimbabwe. The actual
mean weight of all children is not known and it is difficult if not impossible to know it. But in
order to have an idea of the value of µ we find a sample of size n and determine its sample
mean ̅ . These sample means differ from sample to sample. This variability makes it difficult to
make meaningful conclusions based on one sample. If the extent to which a statistic varies from
sample to sample is known, then we can use the ̅ to make inference about µ. In order to make
meaningful inference about µ we need to understand the statistical distribution of the random
variable ̅ , that is its sampling distribution.
The next section gives us the sampling distributions of sample mean, sample proportions, sample
variances and correlations.
5
Consider the experiment where we throw a die and take the uppermost score.
X1 X2 X3 X4
X
Sample 1 6 2 5 6 4.75
Sample 2 2 3 1 6 3
Sample 3 1 1 4 6 3
Sample 4 6 2 2 1 2.75
Sample 5 1 5 1 3 2.5
We know that on a fair die the population mean is 3.5 and the population median is also 3.5. But
if we take a sample of four throws, the mean may be far from 3.5. Since each sample consists of
4 throws, we say that the sample size is . Notice that none of the five samples gave us the
correct mean, and that the mean of the first sample is far from the actual mean. The average
(mean) of these means is 3.2. Thus, although the mean of a particular sample may not be a good
predictor of the population mean, we get better results if we take the mean of a whole bunch of
sample means.
The example above has shown us that a sample statistic (such as the sample mean) may be "all
over the place," so a further question is: How confident can we be in the sample statistic?
Sampling distributions will help us address such a question.
Proof:
1. ( ∑ ) ∑ ∑
2. ( ∑ )
6
∑
∑
Theorem 2.3.2: Let be a random sample from a normal distribution with mean
̅
and variance , then follows the standard normal distribution.
⁄
√
Exercise 2.3
1. Prove part 3 of the theorem above.
̅
2. Use the method of moment generating function to show that follows the standard
⁄
√
normal distribution.
̅ ⁄
( ) ∫
⁄ √
√
This theorem can be applied regardless of the form or type of population probability distribution.
The approximation is generally good for values of
Example
A sample of 800 households had an average household income of $640 and a standard deviation
of $220. Use the Central Limit Theorem to find
7
(a) percentage of households whose income is below $400,
(b) proportion of households whose income is above $730, and
(c) proportion of households whose income is between $400 and $600.
Solution
(a)
The percentage of households below $400 income is 13.79%.
(b) ( )
(c)
Definition 2.5.1: A random variable X has a chi-square distribution with k degrees of freedom,
i.e. if its density function can be written as
⁄
⁄ ⁄
From the module in probability, it can be shown that if a standard normal random variable is
squared, the resulting distribution is chi-square with one degree of freedom.
Lemma 2.5.1: Let X be a standard normal random variable, i.e. . Then , that
is has a chi-square distribution with 1 degree of freedom.
Lemma 2.5.2: It can also be shown that if are independent standard normal
random variables, then , i.e. the sum of squares of k independent
standard normal variables follows a chi-square distribution with k degrees of freedom.
The proofs to lemma ** and ** can be given using the variable transformation techniques or
moment generating functions.
Theorem 2.5.1: Let be a random sample from a normal distribution with mean
and variance , then ̅ ∑ and ∑ ̅ are independent.
We are only going to use the result. The proof to the theorem is beyond the scope of the course.
8
Theorem 2.5.2: Let be a random sample from a normal distribution with mean
and variance and let ̅ and be the sample mean and variance respectively. Then
̅
∑( )
∑ ∑ ̅ ∑ ̅ ̅ ∑ ̅
But ∑ ̅ ̅ so we obtain
∑ ∑ ̅ ̅
̅ ̅
∑( ) ∑( ) ( )
⁄
√
̅ ̅
Now let ∑ ( ) , ∑ ( ) and ( ) . Then with moment
⁄
√
⁄
generating function ( ) and with moment generating function
⁄
( ) . Since and and are independent and also because of
independence of ̅ and , it can be shown that . Solving this equation
for gives
⁄
( )
9
which is the moment generating function of a chi-square distribution with degrees of
freedom.
Example
The lifetime in months of an electronic switch has a distribution. Calculate the probability
that the component will last between 15 and 30 months.
Solution
⁄
∫ (Integrating by parts)
⁄ ⁄
[ ]
⁄ ⁄
( )
Exercise 2.5
1. Suppose that the random variables are independent and normally
distributed with mean µ and variance . Compute the probability that
∑ ̅ is greater than .
2. Find such that
̅
2.6 Sampling Distribution of
⁄
√
A random variable X has a student’s t-distribution with n degrees of freedom if its density
function is given by
( ⁄ ) ⁄
( )
√ ( ⁄ )
In most practical cases is usually unknown. The common approach round this is to estimate
by the sample standard deviation S. The t-distribution was first derived by W.S Gosset in 1908
under the pseudonym student.
Theorem 2.6.1: Let be a standard normal random variable and be a chi-square random
variable with degrees of freedom. If and are independent then the ratio
10
√ ⁄
The probability density function of the t-distribution is symmetric about zero, but it has tails that
are more spread out than the distribution. As the degrees of freedom increase the t-
distribution tends to look more like the distribution.
Exercises 2.6
1. Let be a random sample from Use theorem 2.6.1 to show that
has a t-distribution with degrees of freedom.
⁄
√
2. Use statistical tables or any computer software to evaluate the following.
(a)
(b)
(c) Find a number such that , for T with 15 degrees of
freedom.
(d) when there are 15 degrees of freedom
When the sample size is small, then has binomial distribution with mean and variance
where . When n is large, the binomial variable can well be approximated
by the normal distribution with mean and variance , that is
̂
Dividing both numerator and denominator by we obtain . We use this result to
√ ⁄
̂
standardize sample proportion. From the Central Limit Theorem approaches the
√ ⁄
Using the laws of expected value and variance, we can show that the mean, variance, and
standard deviation of ̂ are:
11
̂ ,
̂ ̂̂ , and
Exercises 2.7
1. In a sample of 2400 households, 1800 were found to be a below the poverty datum line. Find
the sample proportion of households below the poverty datum line,
(ii) When and are both large, the normal distribution remains valid if and are
replaced by their estimators. The statistic
̅ ̅
√
has an approximate standard normal distribution.
(iii) If both population variances are unknown and both sample sizes small, we assume
that both populations are normal and also that the population variances are equal
( . Under these assumptions
̅ ̅ and
( )
The common variance can be estimated by combining information from both samples
to obtain a pooled variance denoted by . The pooled variance is calculated as
∑ ̅ ∑ ̅
12
By using the pooled variance as an estimator of the statistics
̅ ̅
√( )
Definition **: A random variable has an F-distribution with and degrees of freedom if
its density function is given by
( ) ⁄ ( ⁄ )
( )
( ) ( ) ⁄
( )
In practical cases the F-distribution is useful when testing for equality of at least two variances.
Theorem **: Let and be independent chi-square random variable with and degrees of
freedom respectively. Then the random variable
( ⁄ )
( ⁄ )
Consider and sample variances from two independent random samples of sizes and
from normal distributions with population variances and respectively. Let
and . Then the statistic
13
( )
⁄
( )
( )
⁄
( )
̂ ̂
√ ̂ ̂ * +
Example
In a study to determine the proportions of males and females with obesity, a sample of 400 men
gave 80 of them had obesity and a sample of 700 females had 175 of them with obesity.
Calculate the probability that the proportion of females with obesity is greater than that of males.
Solution
̂ ,̂ , ̂ , and
̂ ̂ ̂ ̂
( )
√( )( )( ))
(
14
Sampling Distributions of Order statistics
Most sampling distribution results (except for CLT) apply to samples from normal populations.
If data does not come from a normal (or at least approximately normal), then statistical methods
called “distribution-free” or “non-parametric” methods can be used. Non-parametric methods are
often based on ordered data (called order statistics: ) or just their ranks.
are the minimum and maximum observations in an ordered dataset.
If are from a continuous population with cdf and pdf then the pdf of
is given by
[ ] [ ]
The confidence intervals for percentiles can be derived using the order statistics and the binomial
distribution.
Proof
(a)
[ ]
[ ]
Hence [ ]
15
The sampling distribution of the maximum sample observation is given by
[ ] and the sampling distribution of the minimum sample observation is
given by [ ] .
This statistic gives a measure of the dispersion of the sample. Note the the distribution of the
sample range can be obtained from the joint distribution of X (1) and given earlier.
Exercises
1. Consider a random sample of size n from the exponential distribution with rate parameter
r. Compute the density function of the
(a) order statistic
2. Consider a random sample of size 10 from the uniform distribution on (0, 2).
3. Four fair dice are rolled. Find the (discrete) density function of each of the order
statistics.
16
CHAPTER 3
Introduction
Point estimation is concerned with finding a single value which we think best represents the
unknown population parameter. Suppose we want to a value which best represents the proportion
of the Zimbabwean population owing a vehicle. The best value is the sample proportion. The
population variance is best represented by the sample variance.
If we are to make some inference about some population parameters, we have to have some
observations or sample, from the population of interest. Then based on these
values we can then find approximate probability distribution which we can then use to address
questions relating to the population of interest. The major aim of having these values is to obtain
a statistics ̂ ̂ which we think is good enough to estimate the population
parameter . Once has been estimated then the underlying probability distribution can be
used for further inference about the population or probability distribution. Definition: An
estimator
There are several methods used to find these point estimates. The following section deals with all
the methods used to find these estimates.
17
The following methods are usually used to find estimators of population parameters.
Judgemental method
In this method of estimation personal expertise, experience or judgement is used to determine an
estimate ̂ of the population parameter . This method is commonly used in Economics were
personal judgement and experience are required to come up with sound judgements about the
levels of various parameters of an economic system.
Example
If you are asked to estimate the average household income for the population in Harare. What
statistic would you use and why?
Solution
The sample mean is the best estimate. A parameter being estimated is the population mean
therefore the sample mean would be the best natural estimate of .
Example
Let be a random sample from the uniform distribution
Solution
(a) Since all values of are equal to or greater than , the most sensible estimator of is
̂
Since all values of are equal to or less than , the most sensible estimator of is
̂
18
Method of Moments
This methods is based on both sample and population moments.
then solve the resulting equations for . The solution(s) to the equations are the estimate(s) of .
The population parameter can be vector or single parameter. One can use moments about zero or
moments about the mean, the results obtained are the same.
Example
Let be a random sample from a distribution with density function
Solution
(a) ∫
[ ]
19
is the first population moment
̅ solving for
̅
̂
̅
(b) ̅ ∑
Exercises
1. Find the method of moments estimators of based on a random sample
from each of the following density functions:
(a)
(b)
(c)
20
Definition: Let be a random sample from a distribution with density
function , an estimator of that maximizes the likelihood function is
called the Maximum Likelihood Estimator (MLE) for
If one or more solutions exist, then it should be verified which ones maximises . The value
of that maximises also maximises the log-likelihood, . So for computational
convenience the alternative form of the maximum likelihood equation
can be used.
Example
Let be a random sample from a Poisson distribution
Solution
(a) The likelihood function is given by
∏
∑
(∏ )
Taking natural logarithms and then differentiate with respect to and set the derivative
to 0, we obtain
∑
( (∏ )
)
∑
(∏ )
21
∑ (∏ )
̅ (∏ )
Example
Let be a random sample from an exponential distribution
Solution
⁄
(a) ∏ ( )
∑ ⁄
( )
Taking natural logarithms differentiate and equate to 0 and solve the resulting equation
for
∑ ∑
and
̂ ̅
There are cases where the MLE exists but cannot be obtained as a solution of the maximum
likelihood equation. The example below illustrates such a scenario.
Example
Let be a random sample from a population with a density function
22
(a) Find the MLE for .
∑
(b) Sketch the graph of as a function of .
(c) Hence or otherwise find the MLE for .
Solution
(b)
(c) From the graph the likelihood attains its maximum at . Thus the MLE for is
̂ .
Exercises
1. Let be a random sample from a geometric distribution
( )( )
23
{ ( ) }
√
Find the MLE for and .
∑ ∑
We differentiate ∑ with respect to all the then solve the resulting equations for ̂
(̂ ̂ ̂ ).
Example
Let be a random variable mean . Given sample of size
n , find the least squares estimates of and .
Solution
Let ∑
Set the derivatives to zero and solve the simultaneous equations for and we obtain
24
∑ ∑ ∑
̂
∑ ∑
and
∑ ̂∑
̂
Exercises
Consider the data on monthly Income ($) and Savings ($) given in the table below.
Income 980 1500 420 198 785 2800 3845 1890 4210
Savings 180 315 80 42 165 520 760 366 800
Suppose that Income (X) and savings (Y) are related through a relationship with the form
25
CHAPTER 4
Introduction
This Unit deals with the various requirements or properties that have to be possessed by
estimators. Different estimation procedures yield different estimators for the same population
parameter. We use the different properties to determine the estimator which is the best in some
sense. These properties include error, mean square error, unbiasedness, consistency, efficiency
and sufficiency.
Error
Consider the random sample . The estimator ̂ ̂ of the
population parameter θ. Then error of the estimator ̂ is given by
where θ is the parameter being estimated. Note that the error depends not only on the estimator
but also on the sample. Good estimators tend to have low error values whilst poor ones have
large error values.
Bias
Bias is defined as the difference between the average of the collection of estimates and the single
population parameter being estimated, that is
( ̂)
26
Mean Square Error
It is defined as the expected value of the squared errors, that is,
( ̂) [ ̂ ]
It is used to determine how far on average, the collection of sample estimates are from the
population parameter being estimated. High values of the MSE mean that the poor estimator and
low MSE values mean good estimators.
Exercise
Show that MSE can be expressed as ( ̂) ( ̂) .
Variance
Variance of an estimator ̂ ̂ is defined as the expected value of the
squared sampling deviations, that is
( ̂) [ ̂ ( ̂) ]
It is used to determine how far on average, the collection of sample estimates are from the the
expected value of the estimates. High values of the variance mean that the poor estimator and
low values usually imply good estimators.
Unbiasedness
Definition: Let be a random sample form a population with a parameter . An
estimator of is called an unbiased estimator for if ( ̂ )
̂ , i.e. if the expectation of the
estimator is equal to the parameter .
Example
Solution
̅ ( ∑ )
27
( )
{ }
{ }
Exercise
1. Let be a random sample from a uniform distribution
Consistency
Consistency is another way of assessing the accuracy of an estimator. This property says that as
the sample size increases, the estimator ̂ must get closer to its true value.
Definition
Let be a random sample from a population with parameter . Then an estimator
̂ is said to be a consistent estimator for if for any
28
And mean square consistent if
(̂ )
Example
Let be a random sample from the Bernoulli distribution
Solution
̅ ( ∑ ) ∑ ∑
̅ ( ∑ ) ∑
Exercise
1. Let be a random sample from an exponential distribution
Efficiency
Efficiency is a term used in Statistics when comparing various statistical procedures or refers to a
measure of the optimality of an estimator. A more efficient estimator requires fewer samples that
a less efficient one to achieve a desire level of performance.
29
Definition: Let be a random sample from a distribution having a parameter θ
and let ̂ and ̂ be two unbiased estimators of θ. The relative efficiency of two unbiased
estimators ̂ and ̂ is the ratio of their variances, i.e.
̂
̂
Example
Consider two estimators for the parameter θ of a uniform distribution θ ; θ̂
̅ θ̂ where is the maximum observation on the data .
Determine (a) whether ̂ and ̂ are unbiased estimators of θ.
Solution
(a) (̂ ) ̅ ̅
Since ( ̂ ) we conclude that ̂ is an unbiased estimator of .
(̂ ) ( )( )∫ ( )( )
(b) (̂ ) ̅ ̅ ( )( )
To find the variance of ̂ we first find ( ) that is
( ) ∫ . Therefore the variance of ̂ is given by
(̂ ) ( ) { ( ) ( ( ̂ )) }
30
(̂ ) ( ⁄ )
(̂ ) ( ⁄ )
We previously discussed how to compare at least two unbiased estimators for the same
parameter. The one with least variance is considered the better one. The question that we want
to address is, “Is there a best estimator in the sense of possessing a minimum variance? How do
we know if the estimator is the best?”
In the next section we shall see that the variance of an unbiased estimator cannot be smaller than
a certain bound called the Cramer-Rao bound.
θ̂
*( ) +
θ
and
θ̂
[ ]
The Cramer-Rao Lower Bound (CRLB) sets a lower bound on the variance of an unbiased
estimator. It uses are
(a) If we find an estimator that achieves the CRLB, then we know that we have found a
Minimum Variance Unbiased Estimator (UMVUE),
(b) The CRLB provides a benchmark against which we can compare the performance of an
estimator,
(c) The CRLB can be used to rule-out impossible estimators, and
(d) The theory behind the CRLB can tell us if an estimator exists that achieves the lower
bound.
Example
31
Let be the total number of successes in each of n independent trials. Let p be
the probability of success in any given trial and is an unknown parameter. Let the distribution of
X be
Solution
Then and .
Taking the expected value of the above equation and substitute in the CRLB inequality, we get
( )
Sufficiency
A sufficient statistics with respect to population parameter is a statistic ̂ ̂ that
contains all the information that is useful for the estimation of . It is a very useful data
reduction tool, and studying its properties leads to other useful results.
32
The intuition behind the sufficient statistic concept is that it contains all the information
necessary for estimating θ. Therefore if one is interested in estimating θ, it is necessary to get rid
of the original data while keeping only the value of the sufficient statistic.
The definition of sufficient statistic is very hard to verify. A much easier way to find sufficient
statistics is through the factorization theorem.
Definition: Let be independent and identically distributed random variables
whose distribution is the pdf or the pmf . The likelihood function is the product of
the pd`s or pmf’s, that is
∏
{
Exercises
1. Let be a random sample from Bernoulli distribution with unknown
parameter p. The pdf of the is
Determine whether ̂ ∑ is sufficient for p.
2. Let be a random sample from the uniform distribution over the range
Consider the statistic ̂ , determine whether the
statistic is
(a) unbiased, and
(b) sufficient
3. Let be a random sample from a normal distribution for which the mean
µ is unknown but the variance .
(a) Find the unbiased estimator for µ.
(b) Determine whether ̅ ∑ is a sufficient estimator for µ.
33
CHAPTER 5
INTERVAL ESTIMATION
Point estimates vary from one sample to another that is the probability that an estimate is equal
to the population parameter is almost equal to zero. A lee way to this problem is to determine a
range of values that we think contains the parameter we are looking for with some known
probability. Such a range or interval is called an interval estimate of the parameter. In order to
come up with these ranges, we have to preset the required level of confidence and make use of
the sampling distributions that we dealt with in unit ***.
34
Confidence Intervals for Population Mean
Consider the random sample form a normal distribution or population with mean
and variance . The confidence intervals for the population mean is
Example
A random sample of size 12 was drawn from a normal population with a variance of 69. The
sample gave a sample mean of 28. Determine the
(a) 95%
(b) 98% and
(c) 90%
Solution
Example
Recorded here is the daily number of cars (in hundreds) passing through toll gate in Nyabira over
a period of two weeks
35
34, 49, 53, 22, 69, 55, 47, 60, 54, 44, 37, 42, 50 and 41
Construct a 95% confidence interval for the mean number of cars passing through the toll gate .
Solution
̅ ⁄
√
and
⁄
( )
In other words, to be sure that the error of estimation ̅ does not exceed ,
⁄
the required sample size is ( )
Example
36
For each case determine the sample size n required that is required for estimating the population
mean. The population standard deviation , confidence level and the desired error margin are
given below.
(a) 95%,
(b) 98%,
Solution
⁄
(a) ( ) ( )
Thus to make sure that the sample average is within 0.75 of the true mean 95% of the
time, we must sample at least 70 units.
⁄
(b) ( ) ( )
Thus to make sure that the sample average is within 25.2 of the true mean 98% of the
time, we must sample at least 192 units.
Exercise
1. A random sample of 24 children in Harare had their birth-weights (in kilograms)
recorded. The data obtained is
3.30 3.51 2.97 3.01 2.64 3.00 2.89 3.74 3.66 3.12 2.88 2.97
(a) Construct a 98% confidence interval for true mean birth weight .
(b) Find the minimum sample sizes needed so that one is
(i) 99% confident that the sample mean will be within 0.23kgs of the true mean.
(ii) 90% confident that the sample mean will be within 0.50kgs of the true mean.
2. Measurements on the acidity (pH) of rain samples in an industrial site in Harare recorded
at 15 sites and were
3.6 5.2 4.8 3.7 4.9 3.8 4.6 4.2 4.0 4.6 4.8 4.7 5.0 4.4 5.1
(a) Construct a 98% confidence interval for the mean acidity of rain in that region.
(b) Find the minimum sample sizes needed so that one is 99% confident that the sample
mean will be within 0.23kgs of the true mean.
37
From the sampling distributions of the sample variance, it was shown that one can
( ⁄ ⁄
)
( )
⁄ ⁄
( )
⁄ ⁄
Example
Suppose that a random sample of size 18 from a normal distribution gave a mean 6.45 and a
variance of 1.92. Construct a 95% confidence interval for the population variance .
Solution
( ) ( )
⁄ ⁄
Exercise
1. A certain company packages its products in 2 kilograms bags. In order to determine whether
the packaging machine was able to meet the target of 2kgs a random sample of twelve
products was taken and the weights were:
1.89, 2.05, 2.09, 1.89, 1.99, 2.11, 2.08, 1.92, 1.88, 1.79, 1.98, 2.11
38
If the variance of the weights is greater 0.012, the packaging machine is failing to meet the
target of 2kgs.
̂ ̂
̂ ⁄
√
⁄
̂ ̂ ( )
Example
A health public survey is carried out in order to estimate the prevalence of HIV/AIDS in the
population of a certain town. A random sample of 400 people showed that 72 were HIV/AIDS
infected,
39
(a) Construct a 98% confidence interval of the population proportion .
(b) How many people should be sampled if one wishes to be 95% sure that error margin is
below 0.08?
Solution
̂ ̂
̂ ⁄
√
⁄
̂ ̂ ( ) ( )
Thus to make sure that the sample proportion is within 0.08 of the true proportion 95% of
the time, we must sample at least 521 units.
Exercises
1. In a sample of 800 fuses produced by an electronic company, 69 of them were found to be
defective.
(a) Estimate with a 95% confidence interval the true proportion of defective fuses.
(b) How many fuses should be sampled if one wants to be 97% sure that error margin is
below 0.02?
2. In a survey of 380 voters prior to an election, 260 of them indicated that they would vote for
the incumbent candidate.
(a) Estimate with a 95% confidence interval the population proportion of voters who
support the incumbent.
(b) How many voters should be sampled if one wants to be 99% sure that error margin is
less than 0.03?
40
This section deals with confidence intervals for the differences between two population means.
In this section we assume that the samples used are independent.
(a) If population variances, and are both known, Then the confidence
interval for is
̅ ̅ ⁄ √( )
(b) If population variances, and are both unknown and small sample sizes
̅ ̅ ( ⁄ ) √( )
where .
(c) If population variances, and are both unknown and sample sizes large, i.e.
̅ ̅ ⁄ √( )
Example
The following data show summary statistics for two random samples drawn from two normal
populations.
Solution
41
(a) .
√( )
(b) Since the confidence interval does not include a zero, the population means are different.
Exercises
1. Suppose that samples of sizes and are drawn from two normal
populations. The sample statistics are as follows:
̅ ̅
(a) Construct a 95% confidence interval for
(b) Using your results in (a) can we conclude that and are different?
2. Samples of sizes and were drawn from two populations. The sample
statistics
̅ ̅
̂ ̂ ̂ ̂
̂ ̂ ⁄ √
If the confidence interval includes a zero, this means we cannot rule out the possibility that the
two population proportions are equal.
Example
42
In a research that was investigating the sexual practices of men and women, it was found that
284 women in a random sample of 1400 had sexual intercourse below the age of 15 years.
Another sample of 695 men showed that 101 of them had sexual intercourse below the age of 15
years.
(a) Construct a 95% confidence interval for the difference between the proportions of males
and females who had sexual intercourse below the age of 15 years.
(b) Using the results in part (a), can one conclude that the proportion of girls who enter into
sexual unions below the age of 15 years is higher than that of boy?
Solution
(a) Let and be the proportions of males and females who enter into sexual
relationships below the age of 15 years. Then ̂ , ̂ and .
Then confidence interval for is
( ) ( )
( ) √
(b) Since the confidence interval does not include a zero, we conclude that the proportion of
girls who enter into sexual unions below the age of 15 years is higher than that of boys.
Exercises
1. Consider the following data from independent samples from two populations with and
respectively of items bearing a characteristic of interest. For each construct the 98%
confidence interval for ̂ ̂ and use the confidence interval obtained to test
whether ̂ ̂ .
(a)
(b)
2. In a public opinion survey, 80 out of a sample of 120 high-income voters and 30 out of
80 low-income earners supported a decrease in water charges.
(a) Construct a 95% confidence interval for the difference in the proportion of voters
favouring a decrease in the water charges.
(b) Using your result in (a), can we conclude that the proportions of the voters are
different?
43
Matched/ Paired Samples
Let be the paired observations, where the are the
sample values before the experiment and the are the sample values after the experiment.
Let . Assuming that are independent and are distributed as
where then the confidence interval for is
̅
⁄
√
Example
It is claimed that an industrial safety training programme helps to reduce the number of industrial
accidents. The numbers of accidents were recorded for a period of six weeks before the
programme and another six weeks after the programme in eight different companies. The data
obtained were
Company 1 2 3 4 5 6 7 8
Before Training 13 24 18 9 15 18 30 12
After Training 9 18 19 9 13 16 26 11
(a) Determine the 95% confidence interval for the difference in the mean number of
accidents.
(b) Using the results obtained in (a) can one conclude that the claim is true?
Solution
̅
⁄
√ √
(b) The claim is true since the confidence interval does not include a zero.
Exercise
1. Given following the paired sample data
X 15 13 17 15 18 17
Y 14 12 14 16 18 19
44
(a) Construct a 95% confidence interval for the mean difference of the responses.
(b) Using your results in (a), test whether .
2. Six strings made of wool are stretched with the same force before and after washing and
their length are observed. The lengths in before and after washing are recorded in the
table below.
(a) Construct a 95% confidence interval for the mean differences in stretching.
(b) Using your results in (a), test whether washing reduces stretching length.
⁄ ⁄
⁄ ⁄
( )
Example
Suppose that two independent samples from two normal populations and gave the
following statistics
45
Find the 95% confidence interval for the ratio .
Solution
⁄ ⁄
( )
Exercise
Suppose that two independent samples from two normal populations and
gave the following statistics
CHAPTER 6
TESTING OF HYPOTHESES
Introduction
46
Testing of hypothesis is mainly concerned with addressing questions arising from experimental
and observational research such as:
Is the mean age of women in business 35 years?
Is there a significant difference in the viral loads between males and females infected
with HIV/AIDS?
Is there any significant difference in the production levels of 4 maize hybrids in
Zimbabwe?
The null hypothesis is a claim that is established for the purpose of testing. This claim can either
be rejected or not rejected. If there is sufficient evidence to reject the null hypothesis then the
alternative hypothesis is accepted.
A statistical test of hypothesis is a procedure that is used in conjunction with sample data to
decide whether to reject or not to reject the null hypothesis.
Definition: A test statistic, T, is a calculation made from sample data whose value is used as a
basis for deciding whether or not should be rejected.
Definition: A critical region or rejection region of a test is a set C chosen so that if the value
of the test statistic falls in it, then is rejected.
A critical value is a value that separates the acceptance region from the rejection region, and
these values are usually obtained from some statistical tables.
47
.
The significance level is usually set at 5%, for one to have a $95\%$ confidence on the results
one obtains.
Definition: In a statistical test, if is not rejected or retained when it is it is in fact false, the
error so made is called a Type II error. The probability of making this error is usually denoted
by and it is given by
Example
Let be a random sample of size 1 from a population with a probability distribution
{ }
and suppose that one wants to test the hypothesis
Solution
(a)
∫
(b)
(c)
In most computer based tests of hypothesis, the critical values obtained from tables would not be
readily available, but we make the decision of either rejecting or not rejecting based on what
we call a p-value.
48
Definition: P-value is the lowest level of significance at which the null hypothesis can be
rejected. Consequently, we reject at the level, if .
Exercise
1. Suppose you are to verify the claim that on the basis of a random sample of size
58 and you are given that .
(a) If you set your rejection region to be ̅ what is the probability of making
Type I error, i.e. the level of significance of your test.
(b) Find the numerical value of c such that the test ̅ has a 5% of significance.
3. The probability that a person has an allergic reaction to a new drug on the market
is . The drug is modified and retested on 10 people. The null hypothesis
is rejected if at most two people have an allergic reaction.
(a) What is the probability of making Type I error?
(b) Find the probability of making Type II error for and
Types of tests
One-tailed tests are those statistical tests with the rejection region either on the left or right hand
side of the distribution, i.e. those tests whose alternative hypothesis has the form
or
Two-tailed test are those statistical tests with rejection regions on either tails or ends of the
statistical distribution, i.e. those tests whose alternative hypothesis has the form .
The position of the rejection region depends on the form of the alternative hypothesis and its size
depends on the level of significance selected. The following diagram shows the critical/rejection
regions corresponding to each form of .
Consider the random sample form a normal distribution or population with mean
and variance . The test statistics that is used to test the following hypotheses
49
is given by
̅
(d) if the population variance is known.
⁄
√
̅
(e) if the population variance is unknown.
⁄
√
̅
(f) if the population variance is unknown but the sample size is large.
⁄
√
Example
Suppose we want to ascertain whether the mean amount of sulfur in mustard is 0.7, given that
from a sample of size 9, the mean is 0.706. It is known from past experience that the amount of
sulfur in mustard is normally distributed with a standard deviation of 0.25.
Solution
̅
Test statistic
⁄
√
We reject if ⁄ (1.96 is obtained from the Normal distribution Tables)
Substituting ̅ , and in the test statistic we have
.
⁄
√
Since is less than 1.96, hence we fail to reject , we conclude that there is sufficient
evidence that the mean sulfur content is no significantly different from 0.7.
Example
The following is a random sample of 9 observations on the profits (in $000) realised per month
by women cooperatives; 4.9, 5.8, 5.9, 6.5, 5.5, 5.0, 6.0, 5.6 and 5.7. Test at whether the
mean profit is less than 5.9.
Solution
50
⁄
√
Since is greater than we fail to reject and conclude that at , the average
profit is less than $5900
Example
Suppose that the legally required level of sodium in the production of a certain beverage is .
A randomly selected sample of 144 bottles of the beverage gave a sample mean of and a
sample standard deviation of . Test whether that amount of sodium is lower than Use
Solution
Since the population variance is unknown but the sample size is large, i.e. the test
̅
statistic is
⁄
√
We reject if 649 (from the Normal Tables)
Substitute ̅ and from sample data into your test statistic, we obtain
⁄
√
Since is less than -1.645, we reject and conclude that at , we have enough evidence
that the mean sodium level in the beverages is less than 3.5g.
Exercises
1. For each of the following sets of values, test the hypotheses indicated.
(a) ̅
(b) ̅
(c) ̅
2. The following data are amounts (in grams) of a certain vitamin found in raw milk.
7.97 7.83 7.56 6.15 7.99 7.28 8.11
8.09 7.12 6.69 7.54 7.35 4.55 6.77
Is there enough evidence to conclude that the mean amount of the vitamin is 6.00g? Use
.
51
situations where we are testing for equality of two population means or situations were the
difference between two population means is equal to some non-zero other value.
If we are testing for the equality of two population means then, the hypotheses to be
tested then become
̅ ̅
The test statistic is
√( )
(e) If population variances, and are both unknown and small sample sizes
̅ ̅
where .
√( )
(f) If population variances, and are both unknown and sample sizes large, i.e.
̅ ̅
√( )
Example
Males claim that their weekly savings are higher than those of their female counterparts. It is
known that the variances for their weekly savings are and . A
52
sample of 40 women gave an average saving of 5.22 whilst another sample of 37 men gave an
average of 6.45. Test at whether this claim can be justified.
Solution
̅ ̅
The test statistic is
√( )
√( )
Example
In the comparison of two food preservatives purchased by a certain food company, a random
sample of size 8 of one preservative gave an average shelf-life of 6.75 months with a variance of
while the another sample of size 13 of the other preservative gave an average of
5.92 months with a variance of . Test if there is a significant difference between
these two preservatives. Use .
Solution
̅ ̅
The test statistic is
√( )
We reject if
. Therefore
√( )
Since we fail to reject and conclude that the shelf lives of the two preservatives are
not significantly different.
Example
53
Consider the following sample statistics.
Test at whether the difference between the two populations means is greater than 3.2.
Solution
̅ ̅
The test statistic is
√( )
We reject if .
√( )
Since we reject and conclude that at , the difference between the two
populations means is greater than 3.2.
Exercises
1. Eight secretaries were taught using method 1 to type their documents and 6 others were
taught using method 2. Their typing speeds measured by words typed per minute are
shown in the table below.
Method 1: 41 34 38 42 44 39 36 40
Method 2: 56 45 50 54 48 47
Are the typing speeds significantly different for the two teaching methods? Test at the 5%
level of significance.
2. An employment consultant asked selected workers from two different industries to fill in
a questionnaire on job satisfaction. The answers were scored 0 to 50 with higher scores
indicating greater job satisfaction. The data recorded are given in table below.
Industry ̅
A 35 38.4 3.1
B 44 30.5 2.2
54
Is there a significant difference in the job satisfaction of the workers in the two
industries? Use .
Paired Samples
In the previous section we were testing the difference between two population means using
independent samples. In this section we are now considering situations were the samples are not
independent.
Suppose we want to decide on the basis of weights whether a certain diet has an effect of
reducing weight. In this experiment we measure the weights of the people concerned before they
are put on the diet and also measure their weights after the exercise. In this case we obtain two
samples, the before experiment sample and the after experiment sample, and we call these
samples paired/matched samples.
Item
Before the diet
After the diet
The are representing weights before the diet programme and the are weights of
individuals after the diet exercise. To handle this problem, one has to work with the differences
between the paired measurements, i.e. compute the differences
.
Using the new sample of the one can compute the sample mean ̅ ∑ and the
(∑ )
sample variance (∑ ).
The null hypothesis is equivalent to say the programme has no effect on reducing weight.
Interpreting the alternative hypothesis depends on how one defines the differences.
⁄
√
Example
Ten people were put on a special exercise programme for 8 weeks in order for them to lose
weight. The table below gives the weights (in kg) of the 10 people before and after the
programme.
55
Person 1 2 3 4 5 6 7 8 9 10
Before 90 98 89 110 104 78 69 82 66 59
After 87 98 85 105 103 77 62 76 64 55
Solution
⁄
√
Since , we reject and conclude that the programme indeed reduces weight.
Exercises
1. Two different methods of memorizing difficult material are being tested. Nine pairs of
students are matched according to their IQ and background and then assigned to one of
the two methods at random. A test is finally given to all the students and the results
obtained are as follows.
1 2 3 4 5 6 7 8 9
Method A 90 86 72 65 44 52 46 38 43
Method B 85 87 70 62 4 53 42 35 46
2. An experiment was conducted to test the effect of continuous music on productivity. Ten
workers were selected at random and their productivity was measured one month without
music and another month with music. The table below gives the average number of item
produced per day.
Worker 1 2 3 4 5 6 7 8 9 10
Without 8.37 5.01 7.24 6.35 6.24 4.73 7.82 5.67 8.01 6.98
music
With music 7.38 6.96 8.03 6.40 6.02 4.90 8.53 6.28 8.59 7.32
56
(b) Construct a 98% confidence interval for the production difference.
̂
.
√ ⁄
Example
A manufacturer claims that at least 98% of his products care defect-free. A sample of 800 items
showed that 56 of them were defective. Test the claim at 1% level of significance.
Solution
̂
Test statistic is
√ ⁄
Reject if
Substitute in the test statistic
√
Exercises
1. A drug manufacturer claims that the new drug cures a certain disease in 80% of the cases.
In a random sample of 180 patients who used the drug, 105 of them found it to be
effective. Test the manufacturer’s claim at the 0.02 level of significance.
2. A random sample of 480 urban adolescents revealed that 96 of them were to vote
supporting a law supporting abortion. Similarly, 88 out of another random sample of 620
57
rural adolescents were in support of abortion. Test at 5% level of significance whether
there is no significant difference in the two proportions.
The hypotheses to be tested when we are interested in the difference between two population
proportion is either one of the following,
√̂ ̂ * +
Example
In order to test the effectiveness of a new anthrax vaccine, 240 infected cattle were given the
vaccine and 300 were not. All the 540 cattle were infected with anthrax. Among those which
were vaccinated, 60 of them died and among those which were not vaccinated 115 of them died
from the disease. Does vaccination reduces mortality rate? Use
Solution
58
Let the subscript 1 represents the vaccinated cattle and 2 the unvaccinated ones.
̂ ̂
The test statistics is
√̂ ̂ * +
Reject if
̂ , ̂ and ̂ and ̂
√ ( )* +
Since we reject and conclude that indeed the vaccine reduces mortality rate at
5% level of significance.
Exercises
1. In a random sample of 400 males, 48 of them were found to have sexually transmitted
diseases. In another random sample of 630 females 111 of them were found to have
sexually transmitted diseases.
(a) Construct a 98% confidence interval of the difference in the proportions of those with
the sexually transmitted diseases.
(b) Use your result in (a) to test the hypotheses
2. In a study to determine the prevalence of poverty in rural and urban areas, a sample of
280 households was taken in urban areas and 99 of them were classified as poor. A
sample of 600 households from the rural areas gave 148 of them were classified as poor.
Test at the 2% level of significance whether poverty is high in the urban areas than in
rural areas.
3. Test the following hypotheses for the given data.
(a)
(b)
(c)
59
Tests Concerning a Population Variance
From the sampling distributions of the sample variance, it was shown that
Example
A manufacturer of light bulbs claims that the average lifetime of his bulbs is 150 hours with a
standard deviation of 4.8 hours. A sample of 9 bulbs gave the following lifetimes (in hours)
Test whether the lifetimes of bulbs have a variance greater than hours? Use 5% significance
level.
Solution
Reject if
Since is less than 15.5 we reject and conclude that the variance of the lifetime of the
bulbs is not significantly different from at 5% significance level.
Exercises
1. The following data are amounts (in grams) of a certain vitamin found in raw milk.
7.97 7.83 7.56 6.15 7.99 7.28 8.11
60
8.09 7.12 6.69 7.54 7.35 4.55 6.77
Is there enough evidence to conclude that the variance of the vitamin is ? Use
.
2. The following is a random sample of 9 observations on the profits (in $000) realised per
month by women cooperatives; 4.9, 5.8, 5.9, 6.5, 5.5, 5.0, 6.0, 5.6 and 5.7. Test at
whether the population variance is significantly different from 0.2.
From our probability theory, it can be shown that . This is very important in
cases where the values of are not given in the tables.
Example
Consider the following two samples from two independent normal populations.
Sample 2: 9.5, 8.3, 7.5, 10.9, 11.3, 9.3, 8.8 and 8.0
61
Determine whether there is sufficient evidence that Use
Solution
Reject if or if
Since F is between 0.205 and 5.29, we do not reject . We conclude that is not enough evidence
to conclude that the population variances are significantly different.
Exercises
1. Test the following hypotheses:
(a)
(b)
2. Consider the following samples from two normal populations. Test whether the two
population variances are different. Use
Sample 1: 16 10 24 9 6 16 22
Sample 2: 27 17 10 32 9 15
62
4. The risk of an investment is at times measured by its variance on the return on
investment. In a comparison of the risk associated with two investments, monthly returns
on two $1000 investments were recorded and the data is given in the table below.
Investment 1 15 9 28 -2 21 10 0 10 13 18
Investment 2 16 -2 -13 35 22 -18 36 -12
(a) Is the risk associated with investment 2 more than that of investment 1? Use
(b) What assumptions have you put in place in order to answer part (a)?
CHAPTER 7
Chi-Square Tests
63
The chi-square test is used to determine whether there is a significant difference between the
expected frequencies and the observed frequencies in one or more categories. The Chi-Square
test can be used when;
(i) Testing to see whether there is an association between two categorical variables.
(Independence or Association tests)
(ii) Testing to see whether a set of observations was drawn from a specified probability
distribution (Goodness-of-Fit tests)
This test is used to determine if there is association or no association between two categorical
variables. A set of data is said to be categorical if the data is separable into categories that are
mutually exclusive, for example, race, sex, age groups, and educational level. These factors are
normally displayed in a table called the contingency table and an example of a contingency table
is given below.
B 48 145 121 67
C 58 211 350 147
F 27 74 112 169
This is a 4 by 4 contingency table because it has 4 rows and 4 columns. We use a chi-square test
to determine whether the two factors are independent or whether there is an association between
them.
Procedure
∑∑ [ ]
64
(v) Find the value of the test statistic ∑ ∑ .
(vi) Compare the value of the test statistic against the critical value and make a decision
whether to reject or not. Finally give a conclusion.
When dealing with a 2 by 2 contingency table, the chi-square distribution with one degree of
freedom is used. The Yates correction should be used when calculation the value of the test
statistic, and is given by
∑∑
Example
A study is run to determine whether there is an association between a child’s weight and success
in school. Given the following data, test at α= 0.05 if there is an association between overweight
and success in school?
Weight
Overweight Not overweight
Yes 162 263
Success
Not 38 37
Solution
H0: There is no association between overweight and success in school.
H0: There is association between overweight and success in school.
65
Rejection Criteria: Testing at α=0.05, we reject H0 if χcalc>χ0.05((2-1)(2-1))= χ0.05(1)=3.84.
Test Statistic:
Since this is a 2 by 2 contingency table, we apply the Yates’ correction to the test statistic and we
obtain
(| | )
∑∑
Conclusion
Since we reject H0 and conclude that at 0.05 level of significance, that there is
sufficient evidence that there is association between weight and success.
Exercises
1. The pass rates of 1700 students from three types of schools had their students’ results in
Statistics analysed and the data obtained are given in the table below.
Type of School
Government Private Boarding
Distinction 189 78 113
Credit 217 60 128
Result
Test whether there is an association between Pass rate and the type of school. Test at the
1% level of significance.
2. In a study to determine the mother’s educational level and her number of children, the
following data were obtained.
66
Small 17 30 58 78
Family
Medium 42 77 94 41
Size
Large 117 101 46 29
Test at the 5% level of significance whether there is an association between the mother’s
level of education and family size.
3. An experiment was carried out to determine whether there is a relationship between the
type of fertilizer applied and yield. The yield of cops was classified as high, medium and
low. The data obtained are summarized in the table below.
Medium 16 17 15 48
Low 12 14 17 43
Column Total 48 56 44 148
Stating your hypotheses clearly, test at the 5% level of significance whether or not there
is evidence of an association between type fertilizer and yield.
McNemar's test
McNemar's test can be viewed as a paired version of Chi-square test. Let's say you asked
whether the participants liked the device before and after the experiment.
Here, what you want to test is whether the number of the participants who liked the device was
significantly changed between before and after the experiment. Given two paired variables where
each variable has exactly two possible outcomes (1 Yes and 2 No), the McNemar test can be
used to test if there is a statistically significant difference between the proportions after and
before the experiment. It is useful when dealing with paired binary response data.
67
1. The pairs (Xi, Yi) are mutually independent.
2. Each Xi and Yi can be assigned to one of two possible categories.
3. The difference
P (Xi = 1, Yi = 2) - P (Xi = 2, Yi = 1)
which for large samples is distributed like a chi-squared distribution with 1 degree of freedom. A
closer approximation to the chi-squared distribution uses a continuity correction
Under the null hypothesis, with a sufficiently large number of discordants (cells b and c), χ2 has a
chi-squared distribution with 1 degree of freedom. If either b or c is small (b + c < 25) then χ2 is
not well-approximated by the chi-squared distribution. The binomial distribution can be used to
obtain the exact distribution for an equivalent to the uncorrected form of McNemar's test
statistic. In this formulation, b is compared to a binomial distribution with size parameter equal
to b + c and "probability of success" = ½, which is essentially the same as the binomial sign test.
For b + c < 25, the binomial calculation should be performed, and indeed, most software
packages simply perform the binomial calculation in all cases, since the result then is an exact
test in all cases. When comparing the resulting χ2 statistic to the right tail of the chi-squared
distribution, the p-value that is found is two-sided, whereas to achieve a two-sided p-value in the
case of the exact binomial test, the p-value of the extreme tail should be multiplied by 2.
If the χ2 result is significant, this provides sufficient evidence to reject the null hypothesis, in
favour of the alternative hypothesis that p1 ≠ p2, which would mean that the marginal proportions
are significantly different from each other.
Example
68
A study was carries out to determine if certain has an effect on a certain disease. A sample of 350
people of people were diagnosed for the disease (disease: present or absent) before treatment
given in the rows, and the diagnosis after treatment in the columns. The test requires the same
subjects to be included in the before-and-after measurements (matched pairs).
Solution
Test statistic is
We reject if .
The value of the test statistic is greater than 3.84; we therefore reject and conclude that there
is treatment effect.
Exercises
1. Candidates of two political parties A and B gave their campaigning speeches to a group
of a group of 600 residents of a small certain town. There were 380 people who intended
to vote against political party A before the speech, and after the speech 110 of them
changed their voting intention towards the party. There were 220 residents who intended
to vote for party A before the speech and 70 of them changed their voting intention
against the party after the speech. Does the speech have an effect of changing the voting
intentions of the residents? Use
2. In a study to determine whether prevalence of severe cold increases with age, a sample of
1500 school children were questioned. In the study the children were questioned on the
prevalence of symptoms of severe cold at the age of 12 and again at the age of 14 years.
69
At age twelve, 447 children were reported to have severe colds in the past 12 months
compared to 558 at age 14. The data obtained is summarized in the table below.
Suppose that items from a given population can be placed in any one of the categories
and a random sample of size n is drawn from this population and that the observed
sample frequencies are for each category are . There is a probability that a
randomly selected observation may in any one of these categories
Category 1 2 3 ………i……… r
Observed frequencies
Suppose that the probabilities for belonging to each of these categories respectively
are denoted by , then the expected frequencies ( for each
category are given by , that is . The values
of must be greater than or equal to 5 for the test to be good. If the value of is
smaller than 5, combine the class or category to the adjacent one.
70
The test statistic used in testing the given hypotheses is given by
This statistic follows a chi-square distribution with degrees of freedom where, is the
number of categories and is the number of parameters that we estimate from sample data. But
for large samples the test statistic is distributed as chi-square with degrees of freedom. If
the computed value of ∑ is equal or greater than the tabulated value of chi-
square with degrees of freedom and significance level , we reject at
the level of significance.
Example
Consider the data in the table below.
x 0 1 2 3 4 5+
Frequency 160 98 41 19 4 1
Test at 5% level of significance whether the data came from a Poisson distribution.
Solution
The test-statistic is ∑ . We are given the observed frequencies in the table above.
We must find the expected frequencies. We have to estimate the Poisson parameter by ̅ , that
is ̂ ̅ ∑ .
71
x 0 1 2 3 4 5+
Observed Frequency 160 98 41 19 4 1
Expected Frequency 145.32 116.09 46.35 12.34 2.47 0.45
The last two expected frequencies are less than 5; we merge the last two categories to 3 to obtain
x 0 1 2 3+
Observed Frequency 160 98 41 24
Expected Frequency 145.32 116.09 46.35 15.26
We have estimated the mean, therefore the degrees of freedom of the test statistic become
. Therefore we reject when .
The value of the test statistic is greater than the tabulated value 6.00; therefore we reject and
conclude that the data did not come from a binomial distribution.
Exercise
1. A fair die is thrown 400 times and the score on the die is noted. The table below gives the
observed number of scores in the 400 throws.
Score 1 2 3 4 5 6
Frequency 54 60 83 77 82 44
2. According to some genetic theory, the number of white, yellow, blue, red and black
petals should appear in the ratio 3:6:9:2. Observed frequencies of white, yellow, blue, red
and black petals in a sample of 1200 plants were 240, 320, 589, and 51. Test whether the
observed frequencies are consistent with the genetic theory. Use
3. Samples of size 6 each were regularly drawn from a production line and the numbers of
defective products were noted. During one week 400 samples were drawn and the
number of defective units in each sample is noted. The data obtained are as follows.
72
Test whether the number of defective items follows the binomial distribution. Use a 5%
level of significance.
4. The weights of 230 grade one pupils were recorded and the data obtained were as
follows.
Weight, x <19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55+
f 6 11 27 37 47 42 34 17 9
Test whether the data follows a normal distribution. Use a 5% level of significance.
Summary
73