100% found this document useful (2 votes)
461 views

STAT-II Week End

This document provides an introduction and overview of key concepts in sampling and sampling distributions: - Descriptive statistics analyzes and presents data from a sample, while inferential statistics makes predictions about a larger population based on a sample. - A population is the entire group being studied, while a sample is a subset of the population. Parameters describe populations and statistics describe samples. - There are different sampling methods like simple random sampling, stratified sampling, and cluster sampling. Probability sampling methods assign a known probability of selection to each unit. - Sampling error occurs when using a sample statistic to estimate a population parameter. Non-sampling errors come from issues like non-response or biased responses. - Random

Uploaded by

sifan Mirkena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
461 views

STAT-II Week End

This document provides an introduction and overview of key concepts in sampling and sampling distributions: - Descriptive statistics analyzes and presents data from a sample, while inferential statistics makes predictions about a larger population based on a sample. - A population is the entire group being studied, while a sample is a subset of the population. Parameters describe populations and statistics describe samples. - There are different sampling methods like simple random sampling, stratified sampling, and cluster sampling. Probability sampling methods assign a known probability of selection to each unit. - Sampling error occurs when using a sample statistic to estimate a population parameter. Non-sampling errors come from issues like non-response or biased responses. - Random

Uploaded by

sifan Mirkena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Chapter 1: Sampling and Sampling Distributions

1.1. Introduction and basic definitions


Statistics is the science of collecting, summarizing, analyzing,
presenting and interpreting data. There are two main branches of
statistics: descriptive and inferential.

Descriptive statistics deals with analyzing data and presenting information about a set of data
that has been collected only. The area of descriptive statistics, as you studied in Statistics for
management I, is concerned primarily with methods of presenting and interpreting data using
graphs, tables and numerical summaries.

Inferential statistics, which will be dealt with in this course, is used to make predictions or
comparisons about a larger group (a population) using data gathered about a small part of that
population. Whenever we use data from a sample—i.e., a subset of the population—to make
statements about a population, we are performing statistical inference. Thus, inferential statistics
involves generalizing beyond the data, something that descriptive statistics does not do.

A population is the group of all items of interest to a statistics practitioner. It is frequently very
large and may, in fact, be infinitely large. In the language of statistics, population does not
necessarily refer to a group of people. It may, for example, refer to the population of soft drinks,
light bulbs or computers manufactured by a company.

A descriptive measure of a population is called a parameter. The parameter may be the


proportion of employees who are satisfied with their job or the mean number of soft drinks
consumed by all the students at a university. In most applications of inferential statistics, the
parameter represents the information we need.

Census is a complete enumeration of the units to be studied whether individuals, families, or


firms in an industry. The 2007 population census in Ethiopia is one good example. The
advantage of a census is that there are no ‘sampling errors’. The major disadvantages of a census
are expense and time.

1
A sample is a smaller subset of the population. A descriptive measure of a sample is called a
statistic. We use statistics to make inferences about parameters. For example, we would use the
sample mean to infer the value of the population mean.

Data are the facts and figures that are collected, analyzed, and summarized for presentation and
interpretation. Data may be classified as either quantitative or qualitative.

Quantitative data measure either how much or how many of something and qualitative data
provide labels, or names, for categories of like items. For example, suppose that a particular
study is interested in characteristics such as age, gender, marital status and annual income for a
sample of 100 individuals. These characteristics would be called the variables of the study and
data values for each of the variables would be associated with each individual. Thus, the data
values of 28, male, single, and $30,000 would be recorded for a 28-year-old single male with an
annual income of $30,000. With 100 individuals and 4 variables, the data set would have 100 × 4
= 400 items. In this example, age and annual income are quantitative variables; the
corresponding data values indicate how many years and how much money for each individual.
Gender and marital status are qualitative variables. The labels male and female provide the
qualitative data for gender and the labels single, married, divorced and widowed indicate marital
status.

Other distinctions are sometimes made between data types.


• Discrete data are whole numbers and are usually a count of objects. For instance, one study
might count the number of employees with qualifications above first degree and here it wouldn’t
make sense to have half an employee.
• Continuous data, in contrast to discrete data, are measured data and thus may take on any real
value. For example, the number of hours an employee spends on work or on meetings each day
would be measured data since they could spend any number of hours.

1.2. Sampling methods


Sampling is the process or method of drawing a representative group of individuals or cases
from a particular population. Sampling and statistical inference are used in circumstances in
which it is impractical to obtain information from every member of the population. The required
size of a sample depends on the level of precision that is desired.

2
Probability sampling methods
Probability sampling methods are sampling methods in which the probability of each unit
appearing in the sample is known.

The basic and most reliable method of probability sampling is simple random sampling where
every element of the population being sampled has an equal probability of being selected. In a
random sample of a class of 50 students, for example, each student has the same probability,
1/50, of being selected. Every combination of elements drawn from the population also has an
equal probability of being selected. Thus, with simple random sampling, every possible sample
of size n has the same probability of being selected.

Another probability method, systematic random sampling, includes every nth member of the
population in the sample. Thus, if one wishes to study the attitudes of employees towards a new
organizational reform program, and the organization has 1,000 employees, one could derive a
sample of 100 employees from a payroll list of the names of the employees by randomly
choosing a number between 1 and 10, selecting the name on the list corresponding to that
number and then selecting every 10th name after it. Note that systematic sampling is not as
statistically reliable as random sampling.

Stratified simple random sampling is a variation of simple random sampling in which the
population is partitioned into relatively homogeneous groups called strata and a simple random
sample is selected from each stratum. The results from the strata are then aggregated to make
inferences about the population. A side benefit of this method is that inferences about the
subpopulation represented by each stratum can also be made.

Cluster sampling involves partitioning the population into separate groups called clusters.
Unlike in the case of stratified simple random sampling, it is desirable for the clusters to be
composed of heterogeneous units. In single-stage cluster sampling, a simple random sample of
clusters is selected and data are collected from every unit in the sampled clusters. In two-stage
cluster sampling, a simple random sample of clusters is selected and then a simple random
sample is selected from the units in each sampled cluster. One of the primary applications of

3
cluster sampling is called area sampling, where the clusters are districts, townships, city blocks,
or other well-defined geographic sections of the population.

Non-probability sampling methods


Probability sampling techniques are less likely to be useful when the population consists of a
large number of items or members that are not homogeneous. An alternative to probability
sampling is judgment sampling in which selection is based on the judgment of the researcher and
there is an unknown probability of inclusion in the sample for any given case.

One common non-probability sampling method is quota sampling method. Here the probability
of an individual’s selection is not known; the sample is a collection of representative individuals.
Interviewers are given a quota of numbers they must contact and interview by sex, social class,
age and other variables relevant to the investigation being undertaken.

On the face of it, a quota sample sounds an attractive choice, but of course we have no real
guarantee that we have achieved a really representative set of respondents to our questionnaire.

1.3. Errors and Bias in sampling


Sampling error is the difference between a population parameter and a sample statistic used to
estimate it. For example, the difference between a population mean and a sample mean is
sampling error. Sampling error occurs because a portion, and not the entire population, is
surveyed. Probability sampling methods, where the probability of each unit appearing in the
sample is known, enable statisticians to make probability statements about the size of the
sampling error.

Non-sampling errors are errors that arise from non-response, biased response or interviewer
error. Non-sampling errors may be caused by very simple things – the misunderstanding of a
word in a questionnaire by less educated people, the dislike of a particular social group of one
interviewer’s manner, or the loss of a batch of questionnaires during administration. These could
all occur in an unplanned way and bias your survey badly. To say that a study is biased means
that there is some- thing systematically wrong with the way the study is conducted.

4
Sampling errors can be quantified in advance and are really the results of the researchers’
planning given a particular expenditure level whereas non-sampling errors can be very difficult
to detect once they have occurred.

1.4. Introduction to sampling distributions


Random variables and probability distributions
A random variable is a numerical description of the outcome of a statistical experiment. A
random variable that may assume only a finite number or an infinite sequence of values is said to
be discrete; one that may assume any value in some interval on the real number line is said to be
continuous. For instance, a random variable representing the number of employees absent on a
certain day at a particular organization would be discrete while a random variable representing
the weight of a person in kilograms (or pounds) would be continuous.

The probability distribution for a random variable describes how the probabilities are distributed
over the values of the random variable. For a discrete random variable, x, the probability
distribution is defined by a probability mass function, denoted by p(x). This function provides the
probability for each value of the random variable.

In the development of the probability function for a discrete random variable, two conditions
must be satisfied:
(1) p(x) must be nonnegative for each value of the random variable, and
(2) the sum of the probabilities for each value of the random variable must equal one.

A continuous random variable may assume any value in an interval on the real number line or in
a collection of intervals. Since there are an infinite number of values in any interval, it is not
meaningful to talk about the probability that the random variable will take on a specific value;
instead, the probability that a continuous random variable will lie within a given interval is
considered.

In the continuous case, the counterpart of the probability mass function is the probability density
function, also denoted by p(x). For a continuous random variable, the probability density
function provides the height or value of the function at any particular value of x; it does not
directly give the probability of the random variable taking on a specific value. However, the area

5
under the graph of p(x) corresponding to some interval, obtained by computing the integral of
p(x), (a method of calculus) over that interval provides the probability that the variable will take
on a value within that interval.

A probability density function must satisfy two requirements:


(1) p(x) must be nonnegative for each value of the random variable, and
(2) the integral over all values of the random variable must equal one.

Special probability distributions


i. Discrete probability distributions
Two of the most widely used discrete probability distributions are the binomial and Poisson.
The binomial probability mass function provides the probability that x successes will occur in
n trials of a binomial experiment.
A binomial experiment has four properties:
(1) it consists of a sequence of n identical trials;
(2) two outcomes, success or failure, are possible on each trial;
(3) the probability of success on any trial, denoted p, does not change from trial to trial; and
(4) the trials are independent.

Definition: If X is a binomial random variable in n independent trials with probability p of


success and q=(1-p) of failure on a single trial, then the probability function of a random variable
X is called binomial distribution which is given by the following formula.
n!
p X  x   nCxp x q n  x , where nCx 
x!(n  x)!
Example
Suppose it is known that 10 percent of the owners of two-year old automobiles have had
problems with their automobile's electrical system. Compute the probability of finding exactly 2
owners that have had electrical system problems out of a group of 10 owners.
Solution
The binomial probability mass function can be used by setting n = 10, x = 2, and p = 0.1 in in the
above equation; for this case,
p x  2   nCxp x q n  x  10C 2  0.12  0.98  0.1937

6
Here is one more example
Past record shows that 2% of workers of an organization are absent from their work every day. If
a random sample of 7 workers is selected determine the probability that out of this group:
a) There will be no absentee on a given day
b) There will be exactly 2 absentees
Solutions
a) P( x  0)  7C 0(0.02)0 (0.98)7  0  0.868  86.8%

b) P( x  2)  7C 2(0.02) 2 (0.98)7  2  21(0.0004)(0.9039)  0.0076  0.76%

The Poisson distribution


The Poisson probability distribution is often used as a model of the number of arrivals at a
facility within a given period of time. For instance, a random variable might be defined as the
number of telephone calls coming into an organization or its manger during a period of 15
minutes.

In general, if an event occurs  times during a given unit of time then (  is the average number
of events) the random variable X (number of arrivals at a place) follows the law of a Poisson
distribution given by
e   x
P X  x  
x!

Example
Suppose that the mean number of calls arriving in a 15-minute period is 10. Find the probability
that 5 calls come in within the next 15 minutes.
Solution
Given that  = 10 and x = 5 we substitute these values in the equation.
e 10105
Thus, P x  5   0.0378 giving the probability that 5 calls come in within the next 15
5!
minutes is 0.0378 or 3.78%.

7
Continuous probability distribution: The normal distribution
The most widely used continuous probability distribution in statistics is the normal probability
distribution. The graph corresponding to a normal probability density function with a mean of μ
= 50 and a standard deviation of σ = 5 is shown in the following figure.

Like all normal distribution graphs, it is a bell-shaped curve. Probabilities for the normal
probability distribution can be computed using statistical tables for the standard normal
probability distribution, which is a normal probability distribution with a mean of zero and a
standard deviation of one.

A simple mathematical formula is used to convert any value from a normal probability
distribution with mean μ and a standard deviation σ into a corresponding value for a standard
normal distribution. The tables for the standard normal distribution are then used to compute the
appropriate probabilities.

A normal distribution with mean zero and s.d. of one is known as the standard normal
distribution with the resulting pdf given by
1
1 z2
f ( z)  e 2 ,  x  
2

8
The Z-score for any score x is found by determining how many standard deviations x is from the
mean µ. This is done by the following transformation formula.
x
Z

Example
The daily wages of 1,000 workers of a factory are normally distributed with a mean of 70 Birr
and a standard deviation of 5.2 Birr. Find the probability that a randomly chosen worker has a
daily wage above 82 Birr.
82  70
p( Z  82)  p( Z  )  p ( Z  2.31)  0.5  p(0  z  2.31)  0.5  0.4896  0.0104
5.2
Example
Suppose that a population of men’s heights is normally distributed with a mean of 68 inches, and
standard deviation of 3 inches. If a person is selected at random, find the probability that his
height is:
a) under 66 inches
b) over 72 inches
c) between 66 and 72 inches.

Answers
The two cut-off values are 66 and 72. Converted to standard units these are:
x 66  68
px  66  p( Z  )  p( Z  )  p( Z  0.67)  0.5  p( z  0.67)  0.5  0.2486  0.2514
 3
Similarly,
72  68
px  72  p( Z  )  p ( Z  1.33)  0.5  p( z  1.33)  0.5  0.4082  0.0918
3
So the answers are:
a) 25.14%
b) 9.18%
c) 100 – 25.14 – 9.18 = 65.68%.

9
Exercise
1. Check the following using the Normal Table.
If Z ~ N (0,1) then:
a) P(Z≥ 1) = 0.1587 c) P(Z≥2.68)=0.0037 e) P(Z≤ -1)= 0.1587
b) P (Z ≤1) = 0.8413 d) (1.83≤Z ≤3.2)=0.0329 f) (-1.52≤Z ≤2.2)= 0.4357+0.4861=0.9218

Make sure you make yourselves completely familiar with this kind of work. It leads directly into
your work on confidence intervals in the next chapter and, ultimately, to ideas of hypothesis
testing in Chapters 3.

Sampling distributions
A sampling distribution is a probability distribution for a sample statistic. In other words, the
sampling distribution of a statistic is the distribution of values taken by the statistic in all
possible samples of the same size.

Sampling distribution of the mean


__

Consider all possible samples of size n taken from a population of size N; the sample mean, x ,
i
of each sample along with its probability of occurrence (relative frequency) is called the
sampling distribution of the mean.

Example: Consider a finite population of size 5 with elements 3, 5, 7, 9 and 11. Construct the
sampling distribution of the mean for the possible samples of size 2 selected without
replacement.

Sample number Sample Sample mean


1 3,5 4
2 3,7 5
3 3,9 6
4 3,11 7
5 5,7 6
6 5,9 7
7 5,11 8
8 7,9 8
9 7,11 9
10 9,11 10

Thus, the distribution of the sample mean of the above data could be given as follows
10
__ __

Sample mean ( x ) Frequency Relative frequency P( x )


i i
4 1 1
10
5 1 1
10
6 2 2
10
7 2 2
10
8 2 2
10
9 1 1
10
10 1 1
10
__

 P( xi )  1

This is called the sampling distribution of the sample mean.

The mean and standard deviation of the sampling distribution of the sample mean can be
determined as

__

 __ 
 fi x i

70
 7 and
x 10 10

 
  ( x i   ) 2 P( xi )  3  1.73
__ 
x
x

The standard deviation of the sampling distribution of the sample means measures the extent to
which we expect the means from the different samples to vary because of the chance error in the
sampling process. Thus the standard deviation of the sampling distribution of a sample statistic is
known as the standard error of the statistic. It indicates not only the size of the chance error but
also the accuracy we are likely to get if we use a sample statistic to measure a population
parameter. Therefore the standard deviation of the sample mean is called the standard error of the
mean.

11
Now compute the population mean, µ, and population standard deviation σ for the same
population of size five given in the example.


xi 35
 7
N 5

( xi   ) 2
  N  8  2.83

This shows that the mean of the sampling distribution of the sample means equals to the mean of
the population and the standard deviation of the sampling distribution of the sample means is
smaller than the population the standard deviation.

For a random sample of size n taken from a population of size N having population mean, µ, and
__

population standard deviation, σ, the theoretical sampling distribution of the sample means, xi s
has the following relationships.


   and  
__ __ ,if population is infinite or if sampling is with replacement and
x x n

 N n
   and  
__ __ , if population is finite or if sampling is without replacement
x x n N 1

N n
The factor is called finite population correction factor which is commonly omitted unless
N 1
the sample constitutes at least 5% of the population.

Example
Suppose the mean of a very large population is  = 50.0 and the standard deviation of the
measurements is  = 12.0. We determine the sampling distribution of the sample means for a
sample size of n = 36, in terms of the expected value and the standard error of the distribution, as
follows:

12
When sampling from a population that is finite and of limited size, a finite correction factor is
available for the correct determination of the standard error. The effect of this correction factor is
always to reduce the value that otherwise would be calculated. As a general rule, the correction is
negligible and can be omitted when n < 0.05N, that is, when the sample size is less than 5 percent
of the population size. Because populations from which samples are taken are usually large,
many texts and virtually all computer programs do not include this correction option. The
formula for the standard error of the mean with the finite correction factor included is

The correction factor in the above formula is the factor under the square root that has been
appended to the basic formula for the standard error of the mean. This same correction factor can
be appended to the formulas for any of the standard error formulas for the mean, difference
between means, proportion, and difference between proportions that are described and used in
this and the following chapters.

Example
To illustrate that the finite correction factor reduces the size of the standard error of the mean
when its use is appropriate, suppose that a sample of n = 36 values was taken from a population
of just 100 values. The sample thus constitutes 36 percent of the population. The expected value
and standard error of the sampling distribution of the mean are

If the standard deviation of the population is not known, the standard error of the mean can be
estimated by using the sample standard deviation as an estimator of the population standard
deviation. To differentiate this estimated standard error from the precise one based on a known
 , it is designated by the symbol s X (or by ˆ X in some texts). A few textbooks do not
differentiate the exact standard error from the estimated standard error, and instead use the

13
simplified SE( X ) to represent the standard error of the mean. The formula for the estimated
standard error of the mean is

Example
An auditor takes a random sample of size n = 16 from a set of N = 1,500 accounts receivable.
The standard deviation of the amounts of the receivables for the entire group of 1,500 accounts is
not known. However, the sample standard deviation is s = $57.00. We determine the value of the
standard error for the sampling distribution of the mean as follows:

Sampling Distribution of Proportion


Let N be the population size and x be the number of items in the population satisfying a certain
characteristic or condition then, the population proportion, denoted by p is given by
x
p
N
If x is the number of items in a sample satisfying the characteristic or condition and n is the
 
number of items in the sample, then the sample proportion, denoted by p is given by p x

n

Example
Assume that 0.80 of all third grade students can pass a test of physical fitness. A random sample
of 20 students is chosen: 13 passed and 7 failed. Find the mean and standard deviation of the
distribution for the sample proportion
Solution
  p  0.8

p

and

p(1  p ) (0.8)(0.2)
 
   0.089
p n 20

14
The Central Limit Theorem
In the introductory remarks to this chapter, we noted that the normal distribution could be used
for samples whose parent (population) distribution was not normal. The Central Limit
Theorem justifies this. It states roughly that for a large enough sample size n, the sampling
___
distribution of the sample mean X from a random sample of size n with replacement from a

2
population of values for X is close to the normal distribution N (  , ), where the population
n
values of X have mean µ and variance σ2 .

This is an approximate version of the precise result for samples from N (µ, σ2), but holding much
more generally, for the population is not restricted to be normal.

15
Chapter 2: Statistical Estimation
Introduction
This chapter is concerned with data based decision-making. It is about making a decision which
involves a population. The sort of information needed for a decision may be a mean value, (e.g.
How many hours does an employee spend working for an organization per month, on average?)
or a proportion (What proportion of items manufactured have a fault?). The associated decision
may range from determining the appropriate amount of employee benefits to stopping the
production line for readjustment.

In order to carry out this type of exercise, one obvious decision needs to be made. How large
should the sample be? The answer to this question is – it depends! It depends on how variable the
population is, on how accurate the input to the decision needs to be, and on how costly the data
collection is.

Point Estimation
Point Estimation is the method of estimating population parameter by a single value. The best
method of a point estimation is to use statistics i.e., formulae which are used to compute from
sample data. The statistics with which we estimate parameters are called estimators.

Parameter Estimator

__
x 
 x
µ n

σ2  __

2

  x i  x 

S2 
n 1

Σ  __

2

  xi  x 
 
S 
n 1

Interval Estimation
An interval estimate is a statement showing that a population parameter has a value lying
between two specific limits. It is a method where a population parameter is estimated to be
within an interval or a range of values rather than just a single point. The range of the interval

16
would depend on the probability with which the population parameter is expected to fall in that
interval which is known as confidence level.

Confidence Intervals

In the previous Section we saw that we could use X as a guess for μ, and that the standard error
 
of X (SE ( X )) =  gave us an idea of how good our guess
n

• SD measures the spread of the data. If the SD is big then it would be hard for you to guess the
value of a single observation drawn at random from the population.

• SE measures the amount of trust you can put in an estimate such as X . If the standard error of
an estimate is small then you can be confident that it is close to the true population quantity it is

estimating (e.g. that X is close to μ) is. Confidence intervals build on this idea. A confidence
interval gives a range of possible values (e.g. 10 − 20) which we are highly confident μ lies
within. For example if we said that a 95% confidence interval for μ was 10−20 this would mean
that we were 95% sure that μ lies between 10 and 20. The question is, how do we calculate the
interval?

First assume that the population we are looking at has mean μ and standard deviation σ. Then
 
X has mean μ and standard error  . The central limit theorem also assures us that X is
n

normally distributed. This is useful because we know that a normal random variable is almost

always (95% of the time) within 2 standard deviations of its mean. So, X will almost always be

within 2  of μ. That means we can be 95% certain that μ is no more than 2  away from X .
n n
 
Therefore, if we take the interval [ X − 2  , X + 2  ] we have a 95% chance of capturing μ.
n n

(In fact if we want to be exactly 95% sure of capturing μ we only need to use 1.96 rather than 2
but this is a minor point.) What if we want to be 99% sure or only 90% sure of being correct? If
you look in the normal tables you will see that a normal will lie within 2.57 standard deviations
of its mean 99% of the time and within 1.645 standard deviations of its mean 90% of the time.
 
Therefore in general we get [ X − z  , X + z  ] where z is 1.96 for a 95% interval, 1.645 for
n n

17
a 90% interval and 2.57 for a 99% interval. This formula applies to any other certainty level as
well. Just look up z in the normal table.

Confidence intervals and limits


It is important to remember that estimates made from data are bound to be imprecise, and it is
essential to indicate the level of imprecision associated with an estimate.

The generally adopted procedure for doing this is to state upper and/or lower limits within which
the true value of the parameter is likely to lie. These limits are called confidence limits, and the
interval between them, is called a confidence interval.

When the sample size, n, is at least 30, it is generally agreed that the central limit theorem will
ensure that the sample mean follows the normal distribution. This is an important consideration.
If the sample mean is normally distributed, we can use the standard normal distribution, that is, z,
in our calculations.

The 95 percent confidence interval is computed as follows, when the number of observations in
the sample is at least 30.

Similarly, the 99 percent confidence interval is computed as follows. Again we assume that the
sample size is at least 30.

Note that the values 1.96 and 2.58 are the z values corresponding to the middle 95 percent and
the middle 99 percent of the observations, respectively.

We can use other levels of confidence. For those cases the value of z changes accordingly. In
general, a confidence interval for the population mean is computed by:

18
Confidence Interval for the Population Mean [n ≥ 30]:

[2.1],
where z depends on the level of confidence. Thus, for a 92 percent level of confidence, the value
of z in formula (2.1) is 1.75. The value of z is from the Z-Table. This table is based on half the
normal distribution, so .9200/2 = .4600. The closest value in the body of the table is .4599 and
the corresponding z value is 1.75.

Frequently, we also use the 90 percent level of confidence. In this case, we want the area
between 0 and z to be .4500, found by .9000/2. To find the z value for this level of confidence,
search the body of the table for an area close to .4500. The closest values in the table are .4495
and .4505. To be conservative we will use .4505. To find the corresponding z value, in the same
row, refer to the left column and read 1.6. Then for the same column, refer to the top margin and
find .05. Adding 1.6 and 0.05 the z value is 1.65. Try looking up the following levels of
confidence and check your answers with the corresponding z values given below.

The following example shows the details for calculating a confidence interval and interpreting
the result.

Example 1
A certain management association wishes to have information on the mean income of middle
managers in the retail industry. A random sample of 256 managers reveals a sample mean of
$45,420. The standard deviation of this sample is $2,050. The association would like answers to
the following questions:

1. What is the population mean?


2. What is a reasonable range of values for the population mean?
3. What do these results mean?

19
Solution:
The central limit theorem stipulates that if we select a large sample, the distribution of the
sample means will follow the normal distribution. In this instance, with a sample of 256 middle
managers (remember, at least 30 is usually large enough), we can be assured that the sampling
distribution will follow the normal distribution.

Another issue is that the population standard deviation is not known. Again, it is sound practice
to use the sample standard deviation when we have a large sample. Now to answer the questions
posed in the problem.

1. What is the population mean? In this case, we do not know. We do know the sample mean is
$45,420. Hence, our best estimate of the unknown population value is the corresponding sample
statistic. Thus the sample mean of $45,420 is a point estimate of the unknown population mean.
2. What is a reasonable range of values for the population mean? The Association decides to
use the 95 percent level of confidence. To determine the corresponding confidence interval we
use formula (1-1).

These endpoints are $45,169 and $45,671. These endpoints are called the confidence limits. The
degree of confidence or the level of confidence is 95 percent and the confidence interval is from
$45,169 to $45,671.

3. What do these results mean? Suppose we select many samples of 256 managers, perhaps
several hundred. For each sample, we compute the mean and the standard deviation and then
construct a 95 percent confidence interval. Then we could expect about 95 percent of these
confidence intervals to contain the population mean. About 5 percent of the intervals would not
contain the population mean annual income, which is p. However, a particular confidence
interval either contains the population parameter or it does not. The following diagram shows the
results of selecting samples from the population of middle managers in the retail industry,
computing the mean and standard deviation of each, and then, using formula (1.1), determining a
95 percent confidence interval for the population mean. Note that not all intervals include the
population mean. Both the endpoints of the fifth sample are less than the population mean. We

20
attribute this to sampling error, and it is the risk we assume when we select the level of
confidence.

Example2
Measurements of the diameter of a random sample of 200 ball bearings produced by a machine,
showed a mean = 0.824”. The population standard deviation is = 0.042”. Find
a) 95%, and
b) 99% confidence intervals for the true mean value of the diameter of the ball bearings.
It has been given that σ = 0.042

21
Example 3
There are 250 employees in a certain organization. A poll of 40 employees reveals the mean
monthly income from overtime work is $450 with a standard deviation of $75. Construct a 90
percent confidence interval for the mean monthly income from overtime work.

Solution:
First, note that the population is finite. That is, there is a limit to the number of employees in the
organization. Second, note that the sample constitutes more than 5 percent of the population; that
is, n/N = 40/250 = .16. Hence, we use the finite-population correction factor. The 90 percent
confidence interval is constructed as follows.

The endpoints of the confidence interval are $432.03 and $467.97. It is likely that the population
mean falls within this interval.

Small sampling theory – the use of Student’s t


So far, our examples have used the Normal distribution. We have assumed that, the population
variance, σ2 is known. We could also assume, using the Central Limit Theorem, that, for a large
sample size, we can treat s (the sample estimate of σ) as normal. This is common practice. What
do we do, however, if we have to use the variance estimate from a small sample? The answer is
‘use the Student’s t distribution’. The statistic we use is:

, [2.2],
where s is the unbiased estimate of population variance.
It can be shown that the t distribution density function tends towards the Normal density as n
tends towards infinity.

Since the distribution is symmetrical about 0 it is used in exactly the same way as the Normal. A
general 100 (1 – β)% confidence interval is:

22
where k is found from t tables. Note that the table has columns defined by the probability of
falling in the right-hand tail (a which, in terms of the above interval, is β/2). The rows of the
table are defined by v (called the degrees of freedom of the sample) which for a single sample is
n-1.

Thus for a 95% confidence interval with a sample size of 16 observations:

Example1
A sample of 10 measurements of the diameter of the sphere gives a sample mean of 4.38” and a
standard deviation s = 0.06”. Find a 95% confidence interval for the actual diameter, and
compare it with one (incorrectly) derived from the Normal distribution.
You need to use t distribution since n<30 and is estimated by s)

Example 2

The manager of a company wants to estimate the mean amount spent per shopping visit by
customers. A sample of 20 customers reveals the following amounts spent.

What is the best estimate of the population mean? Determine a 95 percent confidence interval.
Interpret the result. Would it be reasonable to conclude that the population mean is $50? What
about $60?

Solution:
The population standard deviation is not known and the size of the sample is less than 30. Hence,
it is appropriate to use the t distribution to find the confidence interval.

23
The mall manager does not know the population mean. The sample mean is the best estimate of
that value. The mean is $49.35, which is the best estimate, the point estimate, of the unknown
population mean.

We use formula (2.2) to find the confidence interval. The value of t is available from T-Table.
There are n - 1 = 20 - 1 = 19 degrees of freedom. We move across the row with 19 degrees of
freedom to the column for the 95% confidence level. The value at this intersection is 2.093. We
substitute these values into formula (1.2) to find the confidence interval

The endpoints of the confidence interval are $45.13 and $53.57. It is reasonable to conclude that
the population mean is in that interval.

Sample Size for Estimating Population Mean

where n is the size of the sample.

z is the standard normal value corresponding to the desired level of confidence.

s is an estimate of the population standard deviation.

E is the maximum allowable error.

The result of this calculation is not always a whole number. When the outcome is not a whole
number, the usual practice is to round up any fractional result. For example, 201.22 would be
rounded up to 202.

Example 4
A student in public administration wants to determine the mean amount members of city councils
in large cities earn per month as remuneration for being a council member. The error in
estimating the mean is to be less than $1,00 with a 95 percent level of confidence. The student
found a report by the Department of Labor that estimated the standard deviation to be $1,000.
What is the required sample size?
24
Solution:
The maximum allowable error, E, is $1,00. The value of z for a 95 percent level of confidence is
1.96, and the estimate of the standard deviation is $1,000. Substituting these values into formula
gives the required sample size as:

The computed value of 384.16 is rounded up to 385. A sample of 385 is required to meet the
specifications. If the student wants to increase the level of confidence, for example to 99 percent,
this will require a larger sample. The z value corresponding to the 99 percent level of confidence
is 2.58.

A sample of a sample of 666 is recommended. Observe how much the change in the confidence
level changed the size of the sample. An increase from the 95 percent to the 99 percent level of
confidence resulted in an increase of 281 observations. This could greatly increase the cost of the
study, both in terms of time and money. Hence, the level of confidence should be considered
carefully.

Confidence limits for proportions


The confidence intervals above have looked at the mean of the population. We can extend the
concept to other parameters of interest. In this section you will look at the proportion of a
population.

 

p(1  p )
p  Z n
2

   
 
p(1  p ) p(1  p )
p  Z n
 P< p  Z n
2 2

25
Example 1
A sample poll of 100 voters chosen at random from all voters in a given district indicated that
55% of them were in favor of a particular candidate. Find
a) 95%, and
b) 99% confidence limits for the proportion of all voters in favor of the candidate.
Answers
a) 95% confidence interval for true proportion is

Estimation of intervals for the differences between means and between proportions
We can extend the concept of a confidence interval to functions of the population mean. Often
comparisons are made between two populations or between two random variables, usually with
the objective of establishing whether or not they are different. If the two random variables are
denoted as X1and X2 and samples of observations of them:
(x11,x12, x13 ,...x1n ) and (x21 ,x22 ,x23 ,....x2n) of size n1, and n2 respectively.
In this case, we can make the following substitutions in the confidence interval:

You can also look at the difference in proportions in the same way. For large samples, where you
have two proportions pA and pB and the sample sizes are and respectively, then: standard
deviation of pA– pB is

Sample Size for Population Proportion

26
A 100(1-α)% confidence interval for the difference between two population means µ1, µ2 is given

    2 2
as  x  x   Z  2
1  2
  
  1  2
 1 2    2
 1   2   x1  x2   Z  
 2 n1 n2   2 n1 n2
   

Example
A study showed a sample of 150 professional working men earn an average monthly salary of
Br3,000 with a standard deviation of Br800 and a sample of 100 professional working women
earn a monthly salary of Birr2,500 with a standard deviation of Br600. Construct a 99%
confidence interval estimate for the difference in average monthly salary between professional
working men and women.
  2 2   2 2
 x  x   Z  1   2       x  x   Z  1   2
1 1
 2


n1 n2
1 2  2


n1 n2
2 2
   
2 2
  8002 600   800 2 600
 3,000  2,500  2.58   1   2   3,000  2,500   2 . 58 
  150 100   150 100
(500-88.69, 500+88.69) = (511.31, 588.69)
Review Exercises
1. A researcher wants to study the effect of team empowerment on working capability of
teams. A sample of 16 teams of workers completed a specific task in an average of 26.4
minutes with a standard deviation of 4.0 minutes. Construct a 95% confidence interval
for the mean time required to complete the task.

2. A transport company wants to see the difference in fuel efficiency of the two types of
Lories it operates. It obtained the data shown in the following table from the two types of
Lorries designated A and B.
Lorry type A Lorry type B
Average mileage per liter 12.9 10
Standard deviation 2.3 2
Sample size 20 25

Construct a 95% confidence interval to estimate for the difference in the average fuel efficiency
of the two types of Lorries.

27
Chapter 3: Hypothesis Testing
3.1. Introduction
In Chapters 2 you were introduced to the ideas of the probability that a parameter could lie
within a range of values and in particular the confidence interval (generally 90%, 95% or 99%)
for a parameter. In this chapter we are going to look at the idea of using statistics to see whether
we should accept or reject statements about these parameters. The arithmetic you will use is
similar to that which you met in the last chapter.

We often need to answer questions about a population such as “Is there a difference between the
performances of two employees?”; “Has a new teaching method improved student achievement”;
“Has a new policy brought desirable impact on the lives of the target people?” and the likes.
Generally in statistics, we try to base our answer to these questions on the information we have
been given in the samples. Since the questions asked refer to populations we are concerned with
ideas of statistical inference.

The hypothesis
A hypothesis is a statement or assumption made about a population parameter which may or
may not be true. Data are then used to check the reasonableness of the statement.

The concept of hypothesis testing is manifested in a variety of real life situations. For instance, in
a fair legal system, a person is innocent until proven guilty. A jury or judge hypothesizes that a
person charged with a crime is innocent and subjects this hypothesis to verification by reviewing
the evidence and hearing testimony before reaching a verdict. In a similar sense, a patient goes to
a physician and reports various symptoms. On the basis of the symptoms, the physician will
order certain diagnostic tests, then, according to the symptoms and the test results, determine the
treatment to be followed.

In statistical analysis, we make a claim or an assertion, that is, state a hypothesis, collect data and
then use the data to test the claim (assertion). We define a statistical hypothesis as follows.

The following are examples of such possible claims

1. The proportion (percentage of) people with above elementary education in Ethiopia is not
less than 30%

28
2. The average monthly income of employees in an organization X is less or equal to 8000.

3. The cause of the spread of HIV in Ethiopia is poverty.

In most cases the population is so large that it is not feasible to study all the items, objects, or
persons in the population. For example, it would not be feasible to contact every teacher or nurse
in Ethiopia to find his or her monthly income.

An alternative to measuring or interviewing the entire population is to take a sample from the
population. We can, therefore, test a statement to determine whether the sample does or does not
support the hypothesis which is a statement concerning the population.

Hypothesis Testing: A procedure based on sample evidence and probability theory to determine
whether the hypothesis is a reasonable statement.

Five - Step Procedure for Testing a Hypothesis


Step 1: State the Null Hypothesis ( H 0 ) and the Alternate Hypothesis ( H 1 )
The first step is to state the hypothesis being tested. It is called the null hypothesis,
designated H 0 , and read "H sub zero." The capital letter H stands for hypothesis, and the
subscript zero implies "no difference." There is usually a "not" or a "no" term in the null
hypothesis, meaning that there is "no change."

A peculiarity of the theory of testing is that we pick out one hypothesis as our baseline – this is
the null hypothesis, which we write as H0. We then set up a sometimes less precise, but more
interesting, hypothesis as its competitor. We call this the alternative hypothesis and write it as
H1.

Alternate hypothesis: A statement that is accepted if the sample data provide sufficient
evidence that the null hypothesis is false.

The alternate hypothesis describes what you will conclude if you reject the null hypothesis. The
alternate hypothesis is accepted if the sample data provide us with enough statistical evidence
that the null hypothesis is false.

29
The above claims can be written in the formal way of stating hypothesis in the following manner.

1. H 0 : The proportion (percentage of) people with above elementary education in Ethiopia
is not less than 30%
H1 : The proportion (percentage of) people with above elementary education in Ethiopia is
less than 30%
Or
H 0 : p  0.3

H1 : p  0.3

2. H 0 : The average monthly income of employees in an organization X is less or equal to


8000.
H1 : The average monthly income of employees in an organization X is more than
8000.

Or

H 0 :   800

H1 :   800

3. H 0 : The cause of the spread of HIV in Ethiopia is poverty.

H1 : The cause of the spread of HIV in Ethiopia is not poverty.

Generally speaking, the null hypothesis is developed for the purpose of testing. We either reject
or fail to reject the null hypothesis. The null hypothesis is a statement that is not rejected unless
our sample data provide convincing evidence that it is false.

We should emphasize that if the null hypothesis is not rejected on the basis of the sample data,
we cannot say that the null hypothesis is true. To put it another way, failing to reject the null
hypothesis does not prove that H 0 is true, it means we have failed to disprove H 0 .

30
Step 2: Select a Level of Significance
After establishing the null hypothesis and alternate hypothesis, the next step is to select the level
of significance.

LEVEL OF SIGNIFICANCE: The probability of rejecting the null hypothesis when it is true.

The level of significance is designated α, the Greek letter alpha. It is also sometimes called the
level of risk. This may be a more appropriate term because it is the risk you take of rejecting the
null hypothesis when it is really true.

There is no one level of significance that is applied to all tests. A decision is made to use the .05
level (often stated as the 5 percent level), the .01 level, the .10 level, or any other level between 0
and 0.1. Traditionally, the .05 level is selected for consumer research projects, .01 for quality
assurance, and .10 for political polling. You, the researcher, must decide on the level of
significance before formulating a decision rule and collecting sample data.

Step 3: Select the Test Statistic


There are many test statistics. In this chapter we use both z and t as the test statistic. In other
chapters we will use such test statistics as F and  2 , called chi-square.

TEST STATISTIC: a value, determined from sample information, used to determine whether to
reject the null hypothesis.

For example, in hypothesis testing for the mean () when  is known or the sample size is large,
the test statistic z is computed by:

Step 4: Formulate the Decision Rule


A decision rule is a statement of the specific conditions under which the null hypothesis is
rejected and the conditions under which it is not rejected. The region or area of rejection defines
the location of all those values that are so large or so small that the probability of their
occurrence under a true null hypothesis is rather remote.

31
Note in the chart that:
1. The area where the null hypothesis is not rejected is to the left of 1.65. The area of
rejection is to the right of 1.65.
2. A one-tailed test is being applied. (This will also be explained later.)
3. The .05 level of significance was chosen.
4. The sampling distribution of the statistic z is normally distributed.
5. The value 1.65 separates the regions where the null hypothesis is rejected and where it is
not rejected.
6. The value 1.65 is the critical value.

CRITICAL VALUE: The dividing point between the region where the null hypothesis is
rejected and the region where it is not rejected.

Step 5: Make a Decision


The fifth and final step in hypothesis testing is computing the test statistic, comparing it to the
critical value, and making a decision to reject or not to reject the null hypothesis based on the
region in which the statistic falls.

Type I and Type II errors


As noted, only one of two decisions is possible in hypothesis testing-either accept or reject the
null hypothesis.

TYPE I ERROR: Rejecting the null hypothesis, H 0 , when it is true.

TYPE II ERROR: Accepting the null hypothesis when it is false.

32
We often refer to the probability of these two possible errors as alpha, , and beta, . Alpha ()
is the probability of making a Type I error, and beta () is the probability of making a Type II
error.

The following table summarizes the decisions the researcher could make and the possible
consequences.

Correct Type I
Error

Type II Correct
Error

SUMMARY OF THE STEPS IN HYPOTHESIS TESTING


1. Establish the null hypothesis ( H 0 ) and the alternative hypothesis ( H 1 )

2. Select the level of significance, that is .


3. Select an appropriate test statistic.
4. Formulate a decision rule based on steps 1, 2, and 3 above.
5. Make a decision regarding the null hypothesis based on the sample information. Intrpret
the results on the test.

It should be reemphasized that there is always a possibility that the null hypothesis is rejected
when it should not be rejected (a Type I error). Also, there is a definable chance that the null
hypothesis is accepted when it should be rejected (a Type II error). Before actually conducting a
test of hypothesis, we will differentiate between a one tailed test of significance and a two-tailed
test.

Sampling Distribution for the Statistic z, Left-Tailed Test, .05 Level of Significance

33
One way to determine the location of the rejection region is to look at the direction in which the
inequality sign in the alternate hypothesis is pointing (either < or >).

A test is one-tailed when the alternate hypothesis, H1, states a direction, such as:
H 0 : The mean income of women financial planners is less than or equal to $65,000 per year.

H 1 : The mean income of women financial planners is greater than $65,000 per year.

If no direction is specified in the alternate hypothesis, we use a two-tailed test. Changing the
previous problem to illustrate, we can say:
H 0 : The mean income of women financial planners is $65,000 per year.

H 1 : The mean income of women financial planners is not equal to $65,000 per year.

If the null hypothesis is rejected and H 1 accepted in the two-tailed case, the mean income could
be significantly greater than $65,000 per year, or it could be significantly less than $65,000 per
year. To accommodate these two possibilities, the 5 percent area of rejection is divided equally
into the two tails of the sampling distribution (2.5 percent each). Chart 3.3 shows the two areas
and the critical values. Note that the total area in the normal distribution is 1.0000, found by
.9500 + .0250 + .0250.

Chart 3.3: Regions of Non-rejection and Rejection for a Two-Tailed Test, .05 Level of
Significance

34
Rejection Regions for Two-Tailed and One-Tailed Tests,  = .O1

For the one-tailed test, the critical value is 2.33, found by: (1) subtracting .01 from .5000 and (2)
finding the z value corresponding to,4900.

Some examples of setting and testing hypotheses


Example1
Workers of a given industry complain that their average monthly salary is utmost Birr 600.
Workers union wanted to test the claim of the workers and takes a random sample of 100
workers of the industry. The sample produced a mean monthly salary of Birr 610 with a standard
deviation of Birr 50. What should the workers union conclude about the claim of the workers at
5% level of significance?
Solution
The hypothesis is
H0: µ≤600
H0: µ>600

35
Here, σ is known and large. Therefore, by central limit theorem, the distribution is Z-distribution.
The level of significance is 5% which means α=0.05
__
x 
Critical region is  Z
s
n
__
x   610  600
 2
s 50
n 100

For α=0.05, Z0.05 =1.645


Since Zcal > Ztab H0 is rejected. That is, the workers union should reject the claim of the workers.

Example2
The mean lifetime of 100 components in a sample is 1,570 hours and their standard deviation is
120 hours. Is it likely that the sample comes from a population whose mean is 1,600 hours? We
decide to test at a 1% significance level:

This is a two-tail test and hence the level or size of test is split equally between the two tails of
the distribution.
Since we are using a 1% level of test, then α/2 = 0.005, so z = 2.58 from Normal tables. Note
that:

1570  1600 30
Z   2.5
120 12
100
 Zcal < Ztab or the statistic is not the rejection region. Therefore, we accept H0
In other words, the critical region tells us to reject H0 if:

36
Note that, even though we are estimating from the sample, s can be used here as n > 30. Thus the
critical region is:

Since = 1570, which is between these two values, there is not enough evidence to reject H0.

Z-test for a single proportion



P P
Z
P(1  P )
n

Example 3
A survey on a certain large multinational company indicates that 240 workers out of a randomly
selected sample of 500 workers are not professionals. But the human resource manager of the
company believes that less than or equal to half of the workers are professionals. Test the claim
of the human resource manager at α= 0.05.

H0 : P ≤ 0.5

H1 : P > 0.5


P P 0.52  0.5
Z   0.89
P(1  P ) 0.5  0.5
.
n 500

 Zcal = 0.89
For α=0.05, Ztab = 1.645
 Zcal < Ztab or the statistic is not the rejection region. Hence, we accept H0
Therefore, the claim of the human resource manager is correct

Comparing the means and proportions of two populations


Consider the Normal random variables X1 and X2. In this case we have samples from both
populations. The statistic of interest is the standardized difference between sample means:

37
or, when using the variance estimated from the sample:

The null hypothesis is that there is no difference, thus the statistic is distributed as N(0, 1) or
Student’s t for small samples with n1+ n2 – 2 less than 30~ if the population standard deviation is
not known.

Example 4
A sample of 200 students who graduated from college in 2010 had an average age of 25 years
and another sample of 250 students who graduated from college in 2011 had an average age of
23.5 years. Test at significance level of 5% that the average age of graduation decreased in
2011, assuming normal population and standard deviation of ages of graduation of both years are
equal to 5 years.

The hypothesis is
H0: µ1 = µ2
H0: µ1 > µ2

Here, σ is known and large. Therefore, by central limit theorem, the distribution is Z-distribution.
The level of significance is 5% which means α=0.05
__ __
X1 X 2
Critical region is  Z
 12  22

n1 n2

38
25  23.5
 Z cal  3.16
25 25

200 250

For α=0.05, Z0.05 =1.645


Since Zcal > Ztab , H0 is rejected. Hence, at 5% significance level, it is possible to conclude that the
average age of graduation decreased in 2011.

Example 5
A postal researcher wanted to test the theory that a higher response rate is achieved when a
postal questionnaire is sent out with a personalized covering letter (A) than when the covering
letter is impersonal (B). Two random samples of 100 people were selected. When the
questionnaires were dispatched, the first sample received letter A and the second received letter
B. The response rates to letter A were 70% and to letter B were 55%.
a) Do these results provide evidence in support of the theory?
b) Explain the reasoning behind the test.

Solution
Call the response rate for each sample pA and pB respectively and the number in each sample nA
and nB. Note that this is a one-tail test. We want to know if pA is greater than pB (i.e. whether the
population proportion responding to letter A is greater than that responding to letter

39
B).

Our value lies between 1% and 5%. (It has a p value of 1 – 0.98645 = 0.013555 or 1.355%). So
we say the proportions are different at the 5% but not at the 1% level. The evidence that a
personal covering letter improves the response rate is clear at the 5% level.

40
Chapter 4: Analysis of Variance and the Chi-Square Test
Introduction
In chapter 3 we learned how to compare the means of two populations. In this chapter we will
learn how to perform an Analysis of Variance which will allow us to simultaneously compare the
means of three or more groups.

Analysis of variance (usually just referred to as ANOVA) is a statistical technique used to


compare more than two population means.

What the ANOVA does is compare the between-group variance (which is how different your
groups are from each other) to the within-group variance (which is how different members of the
same group are from each other). If the between-group variance is much larger than the within-
group variance then you conclude that the means are significantly different.

An ANOVA will test the hypotheses

where I is the number of different groups we want to compare.


To measure the within- and between-group variability we calculate sums of squares. The
between- groups sum of squares is calculated as

2
k T T2
SSB   j  where, Tj = total of the jth sample
j 1 n j N

nj
k
2 T2
SST   X ji  , where
j  1 i 1 N

N = n1 + n2 +…. nk = total number of observations in all samples and

T= the total sum of all observations

But SST  SSB  SSW

 SSW  SST  SSB

41
T= the total sum of all observations

The sum of SSB and SSW gives us the total sum of squares SST. This is a measure of the total
variability found in x.

Each one of our sums of squares has degrees of freedom associated with it. This is related to the
number of different sources that contribute to the sum. We typically divide the sums of squares
by the degrees of freedom to get a measure of the average variation. Working with these mean
squares lets us compare many groups simultaneously without inflating the probability of finding
significant results.

SSB SSW
MSB  and MSW 
d . fb d. fw

d . f b  between sample groups deg ree of freedom  k  1 where k is number of samples ( groups )
d . f w  within sample groups deg ree of freedom  N  k where k is number of samples ( groups)

Notice that the total df is equal to the sum of the between-group and within-group df. We can
perform an F-test to determine if we have a significant amount of between group variance. This
is the same thing as testing whether any of our means are significantly different. To perform an F
test you

1. Collect the information needed to build an ANOVA table.


2. Calculate the F statistic
MSB
F
MSW
3. Look up the corresponding p-value in an F distribution table with k-1 numerator degrees of
freedom and N-k denominator degrees of freedom.

Example: The following table shows 4 samples from each of 4 different populations. Check if
there is any significant difference between the means the four population s at α=0.05 and 0.01.
X1 X2 X3 X4
1 1 1 3
2 3 2 2
1 2 2 1
2 2 2 1

42
Solution
Hypotheses:
H0 : µ1 = µ2 = µ3 = µ4
H0: H0 is not true (Atleast one of the means is different)

X1 X2 X3 X4
1 1 1 3
2 3 2 2
1 2 2 1
2 2 2 1
X 1 6 X 2 8 X 3 7 X 4 7

   
X1  3 / 2 ; X 2  2 X 3  7 / 4 X 4  7 / 4
nj
k
2 T2 (6  8  7  7) 2
SST   X ji   (10  18  13  15)  7
j  1 i 1 N 4444
2
kTj T2 62 82 72 7 2 ( 6  8  7  7) 2
SSB    (    )  0.50
j 1 n j N 4 4 4 4 4444

SSW  SST  SSB  7  0.50  6.50


d . fb  k  1  4  1  3

d . f w  N  k  16  4  12

SSB 0.50
MSB    0.17
d . fb 3
SSW 6.50
MSW    0.54
d. fw 12
MSB 0.17
F   0.31
MSW 0.54

For α=0.05, F0.05(3,12) =3.49


For α=0.01, F0.01(3,12) =5.95

Since Fcal <Ftab we accept the null hypothesis at both α levels. Therefore we can conclude that
there is no statistically significant difference between the means of the four populations.

43
Exercise
1. The following data are the semester tuition charges [$000] for a sample private college in
various regions of a certain country. At the .05 significance level, can we conclude there is a
difference in the mean tuition rates for the various regions?

a. State the null and the alternate hypotheses.


b. What is the decision rule?
c. Develop an ANOVA table. What is the value of the test statistic?
d. What is your decision regarding the null hypothesis? Could there be a significant
difference between the mean tuition in the Northeast and that of the west.
2. A Chief police wants to determine whether there is a difference in the mean number of
crimes committed among four districts under his command. He recorded the number of
crimes reported in each district for a sample of six days. At the .05 significance level, can
the chief of police conclude there is a difference in the mean number of crimes?

44
Chi-squared test
In chapter 3 we learned how to perform a two-sample z test for proportions which allows us to
determine if the proportions of an outcome are the same in two different populations. In this
section we extend this idea and learn how to test whether a proportion is the same across three or
more groups.

A good way to look at your data when both your predictor and response variables are categorical
is with a two-way table. A two-way table is organized so that you have a different predictor
group in each column and a different outcome in each column. The inside of the table contains
counts of how many people in each group had each outcome.

We can use the information in a two-way table to calculate a chi-square statistic that will test
whether the proportion of outcomes is the same for each of our groups. In this case our
hypotheses are

H0: p1 = p2 = p3…= pi (or the populations are homogeneous)

H1: At least one pair is not equal (H0 is not true)

The steps to perform a chi-square test are:


1. Arrange your data in a two-way table.
2. Find the counts that would be expected in each cell of the table if H0 were true. These may be
computed using the formula

3. Calculate the chi-square statistic

4. Look up the p-value associated with χ2 in a χ2 distribution table using (r - 1) (c - 1) degrees of


freedom, where r is the number of rows and c is the number of columns in your two-way table.

You should not use the χ2 test if more than 20% of your expected cell counts are less than 5. As
long as the expected counts are high it is ok if the observed cell counts are less than 5.

45
Examples of a contingency table and a chi-squared test
Example
In a survey made in order to decide where to locate a factory, samples from five towns were
examined to see the numbers of skilled and unskilled workers. The data were as follows

312

390 1032 1422

1. Write down H0 and H1 first:


Ho: PA = PB = PC = PD = PE (The proportion of skilled workers does not vary with area)
H1: at least one pair is not equal (The proportion of skilled workers is related to area)
2. Then write down the degrees of freedom: (r – 1) (c – 1) = 4× 1 = 4.
3. Write down the 10%, 5% and 1% critical values of χ2 :7.78, 9.49, and 13.28 respectively.
Remember that the expected values Eij are easily found from the row and column totals. For
instance, for the bottom right-hand corner of the table, the expected value is:
12,1032/1422 = 226.4.
Note also that all the expected values have been put in brackets in the following chart.

Now look at the calculations, noting that the expected counts are printed below the observed
counts.

46
The value of the statistic is
χ2 = (80 – 72.4)2/72.4 + (184 – 191.6)2/191.6 + ...
= 0.797 + 0.301 + 0.056 + 0.021 + 0.463 + 0.175 +2.782 + 1.051 + 0.077 + 0.029
= 5.753
If we look at our significance levels (from Statistical Table), we can see that 5.763 is less than all
of them (1%, 5% and 10%) and so we will not want to reject the null hypothesis. It seems pretty
clear, on this evidence, that there is little difference between the areas in the proportion of skilled
workers there are (so, if this was one of your criteria for choosing where to put your new factory,
your management team would be no further forward with their factory site decision!).

Exercise
1. The human resources director of a company is concerned about absenteeism among hourly
workers. She decides to sample the records to determine whether absenteeism is distributed
evenly throughout the six-day workweek. The null hypothesis to be tested is: Absenteeism is
distributed evenly throughout the week. The sample results are:

47
Use the .01 significance level and the five-step hypothesis testing procedure.

a. What is the decision regarding the null hypothesis?


b. Specifically, what does this indicate to the human resources director?

48
Chapter 5: Correlation and Regression

Introduction
Correlation and regression are two techniques which enable us to see the connection between the
actual dimensions of two or more variables. The discussion in this section we will only involve
relationship between two variables at a time, but you should be aware that these theories and
similar formulae can be used to look at the relationship between many variables.

When we use these techniques we are concerned with using models for prediction and decision
making. So, how do we model the relationships between two variables?

• Correlation –measures the strength of a relationship


• Regression –is a way of representing that relationship.
It is important that you understand what these two techniques have in common, but also the
differences between them.

Correlation is concerned with measurements of the strength of the linear relationship between
two variables. We often begin by drawing a graph plotting this relationship. We call it a scatter
diagram. Consider the scatter diagrams below showing pairs of observations of X and Y. Let us
imagine that X represents the money spent on marketing a new product (in £) and Y represents
the value of sales of that product, by week.

Scatter diagrams showing cost of marketing and value of sales, by week.

49
The correlation between the variables x and y is defined as
n  xi yi   xi  yi
r
n  xi   xi  n  yi   yi 
2 2 2 2

The correlation coefficient, r, is a measure of linear association between two variables. Values of
the correlation coefficient are always between −1 and +1.

A correlation coefficient of +1 indicates that two variables are perfectly related in a positive
linear sense; a correlation coefficient of −1 indicates that two variables are perfectly related in a
negative linear sense and a correlation coefficient of 0 indicates that there is no linear
relationship between the two variables. However, correlations only measure the degree of linear
association between two variables. If two variables have a zero correlation, they might still have
a strong nonlinear relationship.

Regression analysis involves identifying the relationship between a dependent variable and one
or more independent variables. Thus correlation and regression analysis are related in the sense
that both deal with relationships among variables.

A regression line is an equation that describes the relationship between a response variable y and
an explanatory variable x. When talking about it we say that we regress y on x." The general
form of a regression line is

y  a  bx where

n  xi yi   xi  yi  
b and a  y  b x where
n  xi   xi 
2 2

In regression line equation y  a  bx y is the value of the response variable, x is the value of
the explanatory variable, b is the slope of the line and a is the y intercept of the line.

The slope is the amount by which y changes when you increase x by 1 and the intercept is the
value of y when x = 0.
Regression lines are often used to predict one variable from the other.

50
Example
1. The following table contains the annual cost of maintenance in thousand Birr and the age of a
machine in a certain factory.
Age Annual cost of maintenance in (000)Birr
9 40
4 16
2 8
8 37
4 15
5 17
1 5
3 10
6 25
8 35

Based on data given in the above table, find


a) The correlation coefficient
b) Equation of the regression line
c) Predict the annual cost of maintenance in Birr when the age of machine is 12 years
Solution:
Let the annual cost of maintenance be y and age of a machine be x.
Next, let’s do preliminary computations to facilitate the calculation.
2 2
xi xi xi yi xi yi
9 40 81 1600 360
4 16 16 256 64
2 8 4 64 16
8 37 64 1369 296
4 15 16 225 60
5 17 25 289 85
1 5 1 25 5
3 10 9 100 30
6 25 36 625 150
8 35 64 1225 280
2 2
x i  50 y i  208 x i  316 y i  5778 x y
i i  1346
 
x5 y  20.8

51
Now, let’s use values from the above table to determine the correlation coefficient r and the
equation of the regression line.
n  xi yi   xi  yi 10(1346)  50(208)
a) r  
n  xi   xi  n  yi   yi 
2 2
2 2
10(316)  (50) 2 10(5778)  ( 208) 2

13460  10400 3060


r   0.989
660 14516 3095

b) The linear regression line will be of the form y  a  bx where

n  xi yi   xi  yi 10(1346)  50( 208) 3060


b b   4.6364
n  xi   xi 
2 2
10(316)  2500 660

Further,
 
a  y  b x  20.8  4.6364  5  20.8  23.182  2.382
Thus equation of the linear regression line of annual cost (y) on the age of the machine is
y  2.382  4.6364 x

The annual cost of maintenance in Birr when the age of machine is 12 years will be
y(cos of ma int enance)  2.382  4.6364 12  2.382  55.6368  53.2548
Therefore, the annual cost of maintenance when the age of machine is 12 years is Br 53254.80.

Exercise: The following table shows past record of number of employees required for the given
number departments in a an organization.

Number departments Number of employees


3 20
4 26
6 37
9 53
11 62
12 66

a) Develop a linear regression equation for forecasting the number of employees on the
basis of the number of departments.
b) Forecast the human resource requirement (number of employees) if the organization
under consideration plans to expand its departments to 16 over the next five strategic plan
period.

52
Areas under the Normal Curve, Z-Value

53
Student’s t Distribution

54
Critical Values of the F Distribution at a 5% Level of Significance

55
Critical Values of the F Distribution at a 1% Level of Significance

56
Critical Values of Chi-Square

57

You might also like