Chapter 5 Up To 9 Revised
Chapter 5 Up To 9 Revised
UNIT FIVE
ELEMENTARY PROBABILITY
Objectives:
Having studied this unit, you should be able to
✓ understand the elements of probability
✓ calculate some probabilities of events associated with random experiments
✓ apply the concept of probability in some biological phenomena
5.1 Introduction
Without some formalism of probability theory, the student cannot appreciate the true
interpretation from data analysis through modern statistical methods. It is quite natural
to study probability prior to studying statistical inference. Elements of probability allow
us to quantify the strength or “confidence” in our conclusions. In this sense, concepts in
probability form a major component that supplements statistical methods and helps us
to gauge the strength of the statistical inference. The discipline of probability, then,
provides the transition between descriptive statistics and inferential methods. Elements
of probability allow the conclusion to be put into the language that the science or
engineering practitioners require. An example follows that will enable the reader to
understand the notion of a P-value, which often provides the “bottom line” in the
interpretation of results from the use of statistical methods.
1
STATISTICS FOR INDUSTRIAL CHEMIST
Definition 5.2:
Sample point (outcome): The individual result of a random experiment.
Sample space: The set containing all possible sample points (out comes) of the
random experiment. The sample space is often called the universe and denoted by S.
Event: The collection of outcomes or simply a subset of the sample space. We denote
events with capital letters, A, B, C, etc.
Concepts of set theory are important in understanding probability. Given A,B and C are
events associated with a sample space S and ω represents an elementary event
(outcome) in S, then the following are some useful definitions and results in set theory.
Definitions 5.3:
1. Union: The union of A and B, A u B, is the event containing all sample points in either
A or B or both. Sometimes we use A or B for union.
2. Intersection: The intersection of A and B, A n B, is the event containing all sample points
that are both in A and B. Sometimes we use AB or A and B for intersection.
3. Subset: If for any ω A, then ω B. Then A B .
4. Empty set: If a set A contains no points, it will be called the null set, or empty set, and
denoted by .
5. Complement: The complement of a set A denoted by Ac is the set where ω S, ω Ac but, ω
A .
6. Mutually Exclusive Events: Two events are said to be mutually exclusive (or disjoint) if
their intersection is empty. (i.e. A n B = ). Subsets A1, A2,… are defined to be mutually
exclusive if Ai n Aj = for every i ≠ j.
v) An S = A; A u S = S; A n = ; and A u A =A
vi) (A u B)c = Ac n Bc and (A n B)c = Ac u Bc
In short, to assign probabilities for an event, we might need to enumerate the possible
outcomes of a random experiment and need to know the number of possible outcomes
favoring the event. The following principles will help us in determining the number of
possible outcomes favoring a given event.
Example 5.3: Suppose one wants to purchase a certain commodity and that this
commodity is on sale in 5 government owned shops, 6 public shops and 10 private shops.
How many alternatives are there for the person to purchase this commodity?
Solution: Total number of ways =5+6+10=21 ways
Example 5.4: If we can go from Addis Ababa to Rome in 2 ways and from Rome to
Washington D.C. in 3 ways then the number of ways in which we can go from Addis
Ababa to Rome to Washington D.C. is 2x3 ways or 6 ways. We may illustrate the
situation by using a tree diagram below:
3
STATISTICS FOR INDUSTRIAL CHEMIST
R W
W
A
W
R
W
Example 5.5: If a test consists of 10 multiple choice questions, with each permitting 4
possible answers, how many ways are there in which a student gives his/her answers?
Solution: There are 10 steps required to complete the test.
First step: To give answer to question number one. He/she has 4 alternatives.
Second step: To give answer to question number two, he/she has 4 alternatives……
Last step: To give answer to last question, he/she has 4 alternatives.
Therefore, he/she has 4x4x4x…x4=410 ways or1, 048, 576 ways of completing the exam.
Note that there is only one way in which he /she can give correct answers to all
questions and that there are 310 ways in which all the answers will be incorrect.
Example 5.6: A manufactured item must pass through three control stations. At each
station the item is inspected for a particular characteristic and marked accordingly. At
the first station, three ratings are possible while at the last two stations four ratings are
possible. Hence there are 48 ways in which the item may be marked.
Example 5.7: Suppose that car plate has three letters followed by three digits. How
many possible car plates are there, if each plate begins with a H or an F?
2x 26x 26x 10x 10x 10 or 1, 352, 000 different plates.
Permutations
Suppose that we have n different objects. In how many ways, saynPn, may these objects
be arranged (permuted)? For example, if we have objects a, b and c we can consider the
following arrangements: abc, acb, bac, bca, cab, and cba. Thus the answer is 6. The
following theorem gives general result on the number of such arrangements.
There are many problems in which we are interested in determining the number of ways
in which r objects can be selected from n distinct objects without regard to the order in
which they are selected. Such selections are called combinations or r-sets. It may help to
think of combinations as committees. The key here is without regard for order.
To obtain the general result we recall the formula derived above: the number of ways of
choosing r objects out of n and permuting the chosen r equals n!/(n-r)!. Let C be the
number of ways of choosing r out of n, disregarding order. C is the number required.
Note that once the r items have been chosen, there are r! ways of permuting them.
Hence applying the multiplication principle again, together with the above result, we
obtain
n!
C.r! = n!/(n-r)!. Therefore, C = . This number arises in many contexts in
r!(n − r )!
mathematics and hence a special symbol is used for it. We shall write
6
STATISTICS FOR INDUSTRIAL CHEMIST
n n!
= n C r = .
r r!(n − r )!
Example 5.12: How many different committees of 3 can be formed from Hawa, Segenet,
Nigisty and Lensa?
Solution: The question can restated in terms of subsets from a set of 4 objects, how
many subsets of 3 elements are there? In terms of combinations the question becomes,
what is the number of combinations of 4 distinct objects taken 3 at a time? The list of
committees:{H,S,N}, {H,S,L}, {H,N,L}, {S,N,L}.Therefore, we have 4C3 or 4 possible
number of committees.
Example 5.13:
(i) A committee of 3 is to be formed from a group of 20 people. How many different
committees are possible?
(ii) From a group of 5 men and 7 women, how many different committees consisting of 2
men and 3 women can be formed?
20 20!
Solution: (i) There are = = 1140 possible committees.
3 3!17!
5 7 5! 7!
(i) = = 350 possible committees.
2 3 2!3! 3!4!
Remarks:
n n
i) =
r n − r
It is rather surprising that with only these three axioms, we can construct the "entire"
theory of probability! The next theorems and definitions help in assigning probabilities
of events.
Theorem 5.6 :If A is an event in a discrete sample space S, then P(S) equals the sum of the
probabilities of the individual outcomes comprising A.
Theorem 5.7: Suppose that we have a random experiment with sample space S and
probability function P and A andB are events. Then we have the following results:
i) P( ) = 0
ii) P(Ac) = 1 − P(A)
iii) P(B n Ac) = P(B) − P(A n B)
iv) If A subset of B then P(A) ≤ P(B).
8
STATISTICS FOR INDUSTRIAL CHEMIST
Then, calculate the probabilities of A, B, Ac, Bc, and Sc. The event that none of the
outcomes will occur is the same as Sc.
Solution:
2
P ( A) = = 0.5
4
2
P ( B ) = = 0 .5
4
P ( A ) = 1 − P ( A) = 1 − 0.5 = 0.5
c
P ( B c ) = 1 − P ( B ) = 1 − 0.5 = 0.5
P ( S c ) = 1 − P ( S ) = 1 − 1 = 0 = P( )
Example 5.16: From a group of 5 men and 7 women, it is required to form a committee
of 5 persons. If the selection is made randomly, then
i) what is the probability that 2 men and 3 women will be in the committee?
ii) what is the probability that all members of the committee will be men?
iii) what is the probability that at least three members will be women?
12 12!
Solution: The total number of possible committees is = = 792 , i.e. the number of
5 5!7!
possible out comes in the sample space is 792.
i) Let A be the event that the committee will consist of two 2 men and 3 women. We
need to know the number of possible outcomes favoring this event. The
5 5!
number of ways we can select 2 men from 5 men is = = 10 and the
2 2!3!
7 7!
number of ways of selecting 3 women out of 7 women is = = 35 .
3 3!4!
Using the multiplication principle, the number of elements favoring event A is
10x35 or 350.
Hence, using the classical definition of probability,
7
5
P( A) = =
2 3 350
= 0.44
12 792
5
ii) Let B be the event that all members of the committee will be men. Hence
5 7
P( A) = =
5 0 1
12 792
5
iii) Let C be the event that at least three of the committee members will be women.
9
STATISTICS FOR INDUSTRIAL CHEMIST
Basically, three different compositions of committee members can be formed
in terms of sex: 3 women and 2 men, 4 women and 1 man, and all are women.
Hence the number of possible outcomes favoring event C using the principle
of combination together with the addition principle is
5 7 5 7 5 7
+ + = 350 + 175 + 21 = 546 .
2 3 1 4 0 5
5 7 5 7 5 7
+ +
Therefore, P(C ) = =
2 3 1 4 0 5 546
= 0.69
12 792
5
The above definition of probability is based on empirical data accumulated through time
or based on observations made from repeated experiments for a large number of times.
5.5 Some probability rules
Theorem 5.8: If A and B , thenP(A u B) = P(A) + P(B) − P(A n B).
10
STATISTICS FOR INDUSTRIAL CHEMIST
b) what is the probability that this family owns a car or a house?
c) what is the probability that this family owns a car or a house but not both?
d) what is the probability that this family owns only a house?
e) what is the probability that this family neither owns a car nor a house?
Solution: Let A represents that the family owns a car and B represents that the family
owns a house. Given information: P(A)=0.6,P(B)=0.3, and P(AnB)=0.2.
a) Required: P(Ac) = ?
P(Ac)=1-P(A) = 1-0.6 = 0.4
b) Required: P(AUB) = ?
P(AUB) = P(A)+P(B)-P(AnB) = 0.6+0.3-0.2 = 0.7
c) Required: P((AnBc)U(AcnB)) = ?
P((AnBc)U(AcnB)) = P(AnBc)+P(AcnB) = [P(A)-P(AnB)]+[P(B)-P(AnB)]
= [0.6-0.2]+[0.3-0.2]=0.5
d) Required: P(AcnB) =?
P (AcnB) = P(B)-P(AnB) = 0.3-0.2 = 0.1
e) Required: P(AcnBc) = ?
P (AcnBc) = P((AUB)c) = 1-P(AUB) = 1-0.7 = 0.3
We can represent various events by an informative diagram called vein diagram. If
properly and correctly drawn, a vein diagram helps to calculate probabilities of events
easily. The figure below shows various events represented by shaded regions. Note that
the rectangle in each figure represents the sample space.
12
STATISTICS FOR INDUSTRIAL CHEMIST
Independence
We have introduced the conditional probability P (A|B) to capture the partial
information that event B provides about event A. An interesting and important special
case arises when the occurrence of B provides no information and does not alter the
probability that A has occurred, i.e., P(A|B) = P(A).When the above equality holds, we
say that A is independent of B. Note that by the definition P(A|B) = P(A ∩ B)/P(B), this is
equivalent to P(A ∩ B) = P(A)P(B).
13
STATISTICS FOR INDUSTRIAL CHEMIST
UNIT SIX
PROBABILITY DISTRIBUTIONS
Objectives:
Having studied this unit, you should be able to
✓ compute probabilities of events using the concept of probability distributions.
✓ compute expected values and variances of random variables.
✓ apply the concepts of probability distributions to real-life problems.
Introduction
In many applications, the outcomes of probabilistic experiments are numbers or have
some numbers associated with them, which we can use to obtain important information,
beyond what we have seen so far. We can, for instance, describe in various ways how
large or small these numbers are likely to be and compute likely averages and measures
of spread. For example, in 3 tosses of a coin, the number of heads obtained can range
from 0 to 3, and there is one of these numbers associated with each possible outcome.
Informally, the quantity “number of heads” is called a random variable, and the numbers
0 to 3 its possible values. The value of a random variable is determined by the outcome
of the experiment. Thus, we may assign probabilities to the possible values of the
random variable.
14
STATISTICS FOR INDUSTRIAL CHEMIST
random variable. Examples include body weight of new born baby, life time of a human
being, height of a person, etc.
The most important way to characterize a random variable is through the probabilities
of the values that it can take. For a discrete random variable X, these are captured by the
probability mass function (p.m.f. for short) of X, denoted PX(x). For a continuous
random variable X it is done by the probability density function (p.d.f.), denoted f X(x).
Example 6.1: Consider an experiment of tossing two fair coins. Letting X denote the
number of heads appearing on the top face, then X is a random variable taking on one of
the values 0, 1, 2 . The random variable X assigns a 0 value for the outcome (T,T), 1 for
outcomes (T ,H) and (H, T ), and 2 for the outcome (H,H). Thus, we can calculate the
probability that X can take specific value/s as follows:
P(X = 0) = P({(T , T )}) = ¼
P(X = 1) = P({(T ,H),(H, T )}) = 2/4,
P(X = 2) = P({(H,H)}) = ¼
The table below shows the probability mass function X.
X 0 1 2
PX(x) ¼ 2/4 ¼
We can justify that PX(x) is probability mass function.
PX(x)≥0 for x=0,1,2 and
P(X = 0) + P(X = 1)+P(X = 2) = ¼ + 2/4 + ¼=1
Suppose we are interested to calculate the probability that X≥1. The values of X which
are greater than or equal to 1 are 1 and 2. Thus, the probability that X is greater than or
equal to 1, denoted P(X≥1), is found as P(X≥1) = P(X = 1) + P(X = 2)=3/4.
We can use the probability density function to calculate probabilities of events expressed
in terms of the random variable X. For instance, if we are interested in the probability
15
STATISTICS FOR INDUSTRIAL CHEMIST
that X lies between two points, say a and b, we can find it using integration of fX(x) on
the interval [a,b],i.e.
b
P(a X b) = f X ( x)dx
a
iii) The probability that a continuous random variable X will assume a value in a
closed intervals is the same as the probability that it will assume in open interval
or half open intervals, i.e. , P(a≤X≤b) = P(a<X<b) = P(a≤X<b) = P(a<X≤b), P(X≤c)
= P(X<c) , P(X≥c) = P(X>c) where a, b, and c are constants.
Variance
Definition 6.4: The variance of a random variable X denoted V(X) or σ2 is defined
as V(X)=E[(X- μ)2] = E(X2) – μ2.
i) if X is discrete, V ( X ) = [ x 2 PX ( x)] − 2
16
STATISTICS FOR INDUSTRIAL CHEMIST
ii) if X is continuous, V ( X ) = [ x 2 f X ( x)dx] − 2
−
The variance provides a measure of dispersion of X around its mean. Another measure of
dispersion is the standard deviation of X, which is defined as the square root of the
variance and is denoted by σ.
Example 6.2: Calculate the mean and variance of the random variable X in example 7.1.
1 1 1
E ( X ) = xPX ( x) = 0 + 1 + 2 = 1
4 2 4
1 1 1
E ( X 2 ) = x 2 PX ( x) = 02 + 12 + 22 = 1.5
4 2 4
V ( X ) = E ( X ) − = 1.5 − 1 = 0.5
2 2 2
Example 6.3: A fair coin is flipped 4 times. Let X be the number of heads appearing out
of the four trials. Calculate the following probabilities:
i) 2 heads will appear
ii) No head will appear
iii) At least two heads will appear
17
STATISTICS FOR INDUSTRIAL CHEMIST
iv) Less than two heads will appear
v) At most heads 2 will appear
Solution: We can consider that the outcomes of each trial are independent to each
other. In addition the probability that a head will appear in each trial is the same. Thus,
X has a binomial distribution with number of trials 4 and probability of success (the
occurrence of head in a trial) is ½. The probability mass function of X is given by
n n
PX ( x) = 0.5 x (1 − 0.5) n − x = 0.5 n , x = 0, 1, 2, 3,4 , Note that n = 4 and p = 1/2
x x
4
i) P( X = 2) = 0.5 2 (1 − 0.5) 4−2 = 0.3750
2
4
ii) P( X = 0) = 0.5 0 (1 − 0.5) 4−0 = 0.0625
0
iii) P( X 2) = P( X = 2) + P( X = 3) + P( X = 4) = 0.3750 + 0.2500 + 0.0625 = 0.6875
iv) P( X 2) = P( X = 0) + P( X = 1) = 0.0625 + 0.2500 = 0.3125
v) P( X 2) = P( X = 0) + P( X = 1) + P( X = 2) = 0.0625 + 0.2500 + 0.3750 = 0.6875
Example 6.4:Suppose that a particular trait of a person (such as eye color or left
handedness) is classified on the basis of one pair of genes and suppose that d represents
a dominant gene and r a recessive gene. Thus a person with ddgenes is pure dominance,
one with rris pure recessive, and one with rdis hybrid. The pure dominance and the
hybrid are alike in appearance. Children receive one gene from each parent. If, with
respect to a particular trait, two hybrid parents have a total of four children, what is the
probability that exactly three of the four children have the outward appearance of the
dominant gene?
Solution:If we assume that each child is equally likely to inherit either of two genes
from each parent, the probabilities that the child of two hybrid parents will have dd, rr,
or rdpairs of genes are, respectively, ¼, ¼,½. Hence, because an offspring will have the
outward appearance of the dominant gene if its gene pair is either ddor rd, it follows
that the number of such children ,say X, is binomially distributed with parameters n
equals 4 and p equals ¾. Thus the desired probability is
4
P( X = 3) = 0.753 (1 − 0.75) 4−3 = 0.421875.
3
Example 6.5: Suppose it is known that the probability of recovery for a certain disease
is 0.4. If random sample of 10 people who are stricken with the disease are selected,
what is the probability that:
(a) exactly 5 of them will recover?
(b) at most 9 of them will recover?
Solution: Let X be the number of persons will recover from the disease. We can assume
that the selection process will not affect the probability of success (0.4) for each trial by
assuming a large diseased population size. Hence, X will have a binomial distribution
18
STATISTICS FOR INDUSTRIAL CHEMIST
with number of trials equal to 10 and probability of success equal 0.4.
10
P( X = k ) = 0.4 k 0.610− k , k = 0,1,2,...10
k
10
(a) P( X = 5) = 0.4 5 0.610−5 = 0.200658
5
10
(b) P( X 9) = 1 − P( X = 10) = 1 − 0.410 0.610−10 = 1 − 0.000105 = 0.9999
10
The Poisson Random Variable
A random variable X, taking on one of the values 0, 1, 2, . . . , is said to have a Poisson
distribution if its probability mass function is given by
e − x
PX ( x) = , x = 0, 1, 2, 3, ... and 0 .
x!
λ is the parameter of this distribution. The mean and variance of the Poisson
distribution are equal and their values are equal to λ. Note that poison distributions is
used to model situations where the random variable X is the number of occurrences of a
particular event over a given period of time (or space). Together with this , the following
conditions must also be fulfilled: events are independent of each other, events occur
singly, and events occur at a constant rate (in other words for a given time interval the
mean number of occurrences is proportional to the length of the interval).
The poisson distribution is used as a distribution of rare events such as telephone calls
made to a switch board in a given minute, number of misprints per page in a book, road
accidents on a particular motor way in one day, etc. The processes that give rise to such
events are called poisson processes.
Example 6.6:Suppose that the number of typographical errors on a single page of this
lecture note has a Poisson distribution with parameter λ = 1. if we randomly select a
page in this lecture note, calculate the probability that
a) no error will occur.
b) exactly three errors will occur.
c) less than 2 errors will occur.
d) there is at least one error.
Solution: Let X= Number of errors per page
e − k
P( X = k ) = , = 1, k = 0,1,2,...
k!
e −110 1
a) Required P(X≥1)=? P( X = 0) = = = 0.367879
0! e
−1 3
e 1
b) P( X = 3) = = 0.061313
3!
c) P( X 2) = P( X = 0) + P( X = 1) = 0.73576
D) P( X 1) = 1 − P( X = 0) = 1 − 0.367879 = 0.632121
19
STATISTICS FOR INDUSTRIAL CHEMIST
Example 6.7:If the number of accidents occurring on a highway each day is a Poisson
random variable with parameter λ = 3, what is the probability that no accidents will
occur on a randomly selected day in the future?
Solution: Let X= number of accidents per day
e −3 3 k
P( X = k ) = , k = 0,1,2,...
k!
e −3 30
Required P(X= 0) = ? P( X = 0) = = e −3 = 0.05
0!
Note: The Poisson random variable has a wide range of applications in a diverse number
of areas. An important property of the Poisson random variable is that it may be used to
approximate a binomial random variable when the binomial parameter n is large and p
is small. The probability that X will be k can be approximated by substituting λ by np in
e − k
the poisson distribution, i.e. P( X = k ) = , = np .
k!
6.4 Common continuous probability distributions
Normal distribution
The normal distribution plays an important role in statistical inference because many
real-life distributions are approximately normal; many other distributions can be almost
normalized by appropriate data transformations (e.g., taking the log) and as a sample
size increases, the means of samples drawn from a population of any distribution will
approach the normal distribution.
A continuous random variable X is said to follow normal distribution , if and only if , its
1 x− 2
1 − ( )
probability density function (p.d.f.) is f X ( x) = e where x (-∞,∞ ), μ
2
2
(-∞,∞ ) and σ (0,∞ ). There are infinitely many normal distributions since different
values of μ and σ define different normal distributions. For instance, when μ= 0 and σ
1
1 − 2 z2
=1 , the above density will have the following form f Z ( z ) = e . This particular
2
distribution is called the standard normal distribution and sometimes known as Z-
distribution.. The random variable corresponding to this distribution is usually denoted
by Z. If X has a normal distribution with mean μ and variance σ2, we denote it as
X ~ N ( , 2 ) .
Properties of normal distribution
i) The normal distribution curve is a bell shaped, symmetrical about μ and
mesokurtic. The p.d.f. attains its maximum value at x= μ.
ii) Since for x= μ divides the area under the normal curve into two equal parts, μ is
the mean, the median and the mode of the distribution.
iii) The mean and variance of the normal distribution are μ, and σ2, respectively.
20
STATISTICS FOR INDUSTRIAL CHEMIST
iv) The total area under the curve and bounded from below by the horizontal axis is 1,
i.e. f
−
X ( x)dx = 1
problem, we use the standard normal table which gives area values bounded by two
points. Areas under the standard normal distribution curve are tabulated in various
ways. The most common tables give areas bounded between Z=0 and a positive value of
Z. In addition to the standard normal table, the properties of normal distribution and
the following theorem are useful to make probability calculations very easy for any
normal distribution.
21
STATISTICS FOR INDUSTRIAL CHEMIST
a− X − b−
P ( a X b) = P ( )
ii)
a− b−
= P( Z )
Example 6.8: Let Z be the standard normal random variable. Calculate the following
probabilities using the standard normal distribution table: a) P(0<Z<1.2) b)
P(0<Z<1.43) c) P(Z≤0) d) P(-1.2<Z<0) e) P(Z≤-1.43)
f)P(-1.43≤Z<1.2) g) P(Z≥1.52) h)P(Z≥-1.52)
Solution:
a) The probability that Z lies between 0 and 1.2 can be directly found from the
standard normal table as follows: look for the value 1.2 from z column ( first
column) and then move horizontally until you find the value of 0.00 in the first
row. The point of intersection made by the horizontal and vertical movements will
give the desired area (probability). Hence P(0<Z<1.2)= 0.3849. Refer the table
below as a guide to find this probability.
22
STATISTICS FOR INDUSTRIAL CHEMIST
Figure: The area to the left and the right of 0 for z-distribution
d) P(-1.2<Z<0)=P(0<Z<1.2)= 0.3849 due to symmetry
e) P(Z<-1.43)= 1- P(Z ≥ -1.43) Using the probability of the complement event.
= 1-[P(-1.43<Z<0)+P(Z≥0)] Since a region can be broken down
=1-[P(0<Z<1.43)+P(Z ≥0)] into non overlapping regions.
=1-[0.4236 + 0.5]
=1-0.9236=0.0764
24
STATISTICS FOR INDUSTRIAL CHEMIST
25
STATISTICS FOR INDUSTRIAL CHEMIST
X − 74 88 − 74
P( X 88) = P( ) = P( Z 1.4)
a) 10 10
= 0.5 − P(0 Z 1.4) = 0.5 − 0.4192 = 0.0808
Hence 8.08 percent of the students score more than you did?
b) Let XA be the lowest mark to get letter grade A. We are given that
X − 74 x A − 74
P( X x A ) = 0.2 = P( ) = P( Z z A )
10 10
x − 74
P(0 Z z A ) = 0.5 − 0.2 = 0.3 z A = 0.85 z A = 0.85 = A
10
Hence, the lowest mark to get letter grade A is 82.5.
The chi-square and t distributions
The chi-square and t distributions are important continuous distributions which are
useful in statistical inference. In this section we will see a brief introduction of these
distributions. In later chapters, we are going to see in detail on how to use these
distributions in estimation and hypotheses testing.
Chi-square distribution
A random variable X is said to have a chi-square distribution with n degrees of freedom
(denoted by n2 ) if its probability density function is given by
n −x
1 −1
f X ( x) =
n
x e , x 0.
2 2
2 2 ( n )
2
The chi-square distribution has one parameter called the degrees of freedom, n.
Depending on the values of n, we can have many different chi-square distributions. The
mean and the variance of chi-square distribution are n, and 2n, respectively.
Because of its importance, the chi-square distribution is tabulated for various values of
the parameter n (refer table). Thus we may find in the table that value, denoted by 2 (n)
, satisfying p( X 2 (n)) = , 0 1. The example below helps on how to read chi-
square distribution values.
Example 6.11:To read the chi-square value with 2 degrees of freedom where the area to
the right of this value is 0.005.Look the degrees of freedom, 2, in the first column (df
column) and then move horizontally until you find the value of α , 0.005 in the first row.
The point of intersection made by the horizontal and vertical movement will give the
desired chi-square value, 10.597. This value satisfies the following:
26
STATISTICS FOR INDUSTRIAL CHEMIST
P( X 10.597) = 0.005 . In a similar way,The chi-square value with 100 degrees of
freedom where the area to the right of this value is 0.975 is 74.222.
The t distribution
The t distribution is an important distribution useful in inference concerning population
mean/means. This distribution has one parameter called the degrees of freedom.
Depending on the values of the degrees of freedom, we may have different t
distributions. The degrees of freedom is usually denoted by n. In inference on the
population mean, the degrees of freedom is related to sample size. As the sample size
or degrees of freedom increases, the t distribution approaches the standard normal
distribution.
The t- distribution shares some characteristics of the normal distribution and differs
from it in others. The t distribution is similar to the standard normal distribution in the
following ways.
i) it is bell-shaped
ii) it is symmetrical about the mean
iii) the mean, median, and mode are equal to 0 and are located at the center of the
distribution.
iv) The curve never touches the x-axis
The t distribution differs from the standard normal distribution in the following ways.
i) the variance is greater than 1.
ii) The t distribution is actually a family of curves based on the concept of degrees of
freedom.
27
STATISTICS FOR INDUSTRIAL CHEMIST
p(t (n) t (n)) = , 0 1 and t(n) represents the t random variable with n degrees of
freedom. The following example will help you to read t distribution values.
Example 6.12:To find the t value with 3 degrees of freedom where the area to the right
of this value is 0.05.Look the degrees of freedom, 3, in the first column (df column) and
then move horizontally until you find the value of α , 0.05 in the first row. The point of
intersection made by the horizontal and vertical movement will give the desired t value
2.353. This value satisfies the following: P(t 2.353) = 0.05
28
STATISTICS FOR INDUSTRIAL CHEMIST
UNIT SEVEN
SAMPLING AND SAMPLING DISTRIBUTION
OF SAMPLE MEAN
Objectives:
After a successful completion of this unit, students will be able to:
✓ Differentiate the two major sampling techniques: probabilistic and non-
probabilistic
✓ Apply simple random sampling technique to select sample
✓ Define sampling distribution of the sample mean
For example, you may use the lottery method to draw a random sample by using a set of
'N' tickets, with numbers ' 1 to N' if there are 'N' units in the population. After shuffling
the tickets thoroughly, the sample of a required size, say n, is selected by picking the
required n number of tickets.
The best method of drawing a simple random sample is to use a table of random
numbers. After assigning consecutive numbers to the units of population, the researcher
starts at any point on the table of random numbers and reads the consecutive numbers
in any direction horizontally, vertically or diagonally. If the read out numbers
corresponds with the one written on a unit card, then that unit is chosen for the sample.
Row> 1 2 3 4 5 …… N
Column∀
1 2315 7548 5901 8372 5993 ….. 6744
2 0554 5550 4310 5374 3508 ….. 1343
3 1487 1603 5032 4043 6223 ….. 0834
4 3897 6749 5094 0517 5853 ….. 1695
5 9731 2617 1899 7553 0870 ….. 0510
6 1174 2693 8144 3393 0862 ….. 6850
30
STATISTICS FOR INDUSTRIAL CHEMIST
7 4336 1288 5911 0164 5623 ….. 4036
8 9380 6204 7833 2680 4491 ….. 2571
9 4954 0131 8108 4298 4187 ….. 9527
10 3676 8726 3337 9482 1569 ….. 3880
11 ….. ….. ….. ….. ….. ….. …..
12 ….. ….. ….. ….. ….. ….. …..
13 ….. ….. ….. ….. ….. ….. …..
14 ….. ….. ….. ….. ….. ….. …..
15 ….. ….. ….. ….. ….. ….. …..
N 3914 5218 3587 4855 4888 ….. 8042
If you start in the first row and first column, centers numbered 23, 05, 14,…, will be
selected. However, centers numbered above the population size (60) will not be included
in the sample. In addition, if any number is repeated in the table, it may be substituted
by the next number from the same column. Besides, you can start at any point in the
table. If you chose column 4 and row 1, the number to start with is 83. In this way you
can select first 6 numbers from this column starting with 83.
The sample, then, is as follows:
83 75
53 33
40 01
05 26
Hence, the study centers numbered 53, 40, 05, 33, 01 and 26 will be in the sample.
Simple random sampling ensures the best results. However, from a practical point of
view, a list of all the units of a population is not possible to obtain. Even if it is possible,
it may involve a very high cost which a researcher or an organization may not be able to
afford. In addition, it may result an unrepresentative sample by chance.
Stratified sampling
Stratified random sampling takes into account the stratification of the main population
into a number of sub-populations, each of which is homogeneous with respect to one or
more characteristic(s). Having ensured this stratification, it provides for selecting
randomly the required number of units from each sub-population. The selection of a
sample from each subpopulation may be done using simple random sampling. It is useful
in providing more accurate results than simple random sampling.
Systematic sampling
In this method, samples are selected at equal intervals from the listings of the elements.
This method provides a sample as good as a simple random sample and is comparatively
easier to draw a sample. For instance, to study the average monthly expenditure of
households in a city, you may randomly select every fourth households from the
household listings
Cluster sampling
31
STATISTICS FOR INDUSTRIAL CHEMIST
Cluster sampling is used when sampling frame is difficult to construct or using other
sampling techniques (simple random sampling) is not feasible or costly. For instance,
when the geographic distribution of units is scattered it is difficult to apply simple
random sampling. It involves division of the population of elementary units into groups
or clusters that serve as primary sampling units. A selection of the clusters is then made
to form the sample. The precision of estimates made based on samples taken using this
method is relatively low.
Non-probabilily sampling techniques
In non-probability sampling, the sample is not based on chance. It is rather determined
by personal judgment. This method is cost effective; however, we cannot make objective
statistical inferences. Depending on the technique used, non-probability samples are
classified into quota, judgment or purposive and convenience samples.
32
STATISTICS FOR INDUSTRIAL CHEMIST
Note:
✓ The mean of the sampling distribution of the sample mean is the same as the
population mean irrespective of the sampling procedure.
✓ The variance of the sampling distribution of the sample mean is:
2
, if sampling is with replacement
n
2
N − n , if sampling is without replacement
n N − 1
✓ The problem with using sample mean to make inferences about the population
mean is that the sample mean will probably differ from the population mean.
This error is measured by the variance of the sampling distribution of the sample
mean and is known as the standard error. The standard error is the average
amount of sampling error found because of taking a sample rather than the whole
population. As sample size increases, the standard error decreases.
7.3 Central Limit Theorem
If X1, X2, …, Xn is a random sample from a population with mean μ and variance σ2, then
as n goes to infinity the distribution of the sample mean, X , approximates normal
distribution with mean μ and variance σ2/n. That is, as n gets large, X N (μ, σ2/n) and
X −
its standardized form is Z = ~ N (0,1).
/ n
Note: The central limit theorem is useful for approximating the distribution of the
sample mean based on a large sample size and when the population distribution is non
normal; however, if the population is normal, then the sampling distribution of the
sample mean will be normal regardless of the sample size.
Example 7.2: If the uric acid values in normal adult males are normally distributed with
mean 5.7 mgs and standard deviation of 1mg. Find the probability that
a) a sample of size 4 will yield a mean less than 5
b) a sample of size 9 will yield a mean greater than 6
Solution: Let X be the amount of uric acids in normal adult males with mean 5.7 and
variance 1.
a) If a sample of size 4 is taken, then X ~ N (5.7, 0.25) since the population is
33
STATISTICS FOR INDUSTRIAL CHEMIST
normally distributed.
5 − 5.7
P( X 5) = P ( Z ) = P( Z −1.4)
0.5
= 0.5 − P (0 Z 1.4) = 0.0808
b) If a sample of size 9 is taken, then X ~ N (5.7, 1/9) since the population is
normally distributed.
6 − 5.7
P( X 6) = P( Z ) = P( Z 0.9)
1
3
= 0.5 − P(0 Z 0.9) = 0.1841
34
STATISTICS FOR INDUSTRIAL CHEMIST
UNIT EIGHT
SIMPLE LINEAR REGRESSION AND
CORRELATION
Objectives:
Having studied this unit, you should be able to:
✓ Formulate a simple linear regression model.
✓ express quantitatively the magnitude and direction of the association between
two variables
Introduction
The statistical methods discussed so far are used to analyze the data involving only one
variable. Often an analysis of data concerning two or more variables is needed to look
for any statistical relationship or association between them. Thus, regression and
correlation analysis are helpful in ascertaining the probable form of the relationship
between variables and the strength of the relationship.
35
STATISTICS FOR INDUSTRIAL CHEMIST
The population parameters and can be estimated from sample data using the least
square technique. The estimators of and are usually denoted by a and b,
respectively. The resulting regression line is
Y = a+bX
and the equation is known as the fitted regression line. The estimated values of Y are
denoted by Y . The observed values of Y are denoted by y. The difference between the
observed and the estimated values, Y - Y , is known as error or residual, and is denoted
by ˆ . The residual can be positive, negative or zero.
A best fitting line is the one for which the sum of squares of the residuals, ˆ 2 has the
minimum value. This is called the method of least squares. According to this method,
one would select a and b such that ˆ 2
= (Y − Y ) 2
is minimum. The solution of this
minimization problem using partial differentiation is as follows:
X Y
XY − n n XY − X Y
b= = and a = Y − bX
( X ) 2
n X 2 − ( X ) 2
X − n
2
Example 8.1: A researcher wants to find out if there is any relationship between height
of the son and his father. He took random sample of 6 fathers and their sons. The height
in inch is given in the table below:
Height of father (X) 63 65 64 65 67 68
Height of the son (Y) 66 68 65 67 69 70
i) Draw the scatter diagram and comment on the type of relationship.
ii) Fit the regression line of Y on X.
iii) Predict the height of the son if his father’s height is 66 inch.
Solution:
i)
From the scatter plot one can see that the points are roughly on straight line.
ii)
n = 6 X = 392 , Y = 405 , X = 25628 , XY = 26476 , Y = 27355
2 2
36
STATISTICS FOR INDUSTRIAL CHEMIST
n XY − X Y 6(26476) − (392)(405) 405 392
b= = = 0.923 a = Y − bX = − 0.923 = 7.2
n X − ( X )
2 2
6(25628) − (392) 2
6 6
Then the fitted (regression) line of Y on X is given by:
Y = a + b X = 7.2+0.923X
✓ The slope of the line, i.e. b=0.923, tells us that a unit (one inch) increase in
the height of the father results in 0.923 inch increase in the height of the son.
✓ The y-intercept of the line, i.e. a=7.2, is the value of Y when the value of X is
zero(do you think that the intercept is meaningful?)
iii) Y=7.2+0.923(66) =68.118, thus the height of the son is 68.118 inch.
n −1 n −1
=
( X − X )(Y − Y )
( X − X ) (Y − Y )
2 2
37
STATISTICS FOR INDUSTRIAL CHEMIST
o as r approaches -1 indicates strong and negative linear relationship between the
two variables
o as r approaches 0 indicates weak linear relationship between the two variables
Examples of correlation coefficients:
a. Calculate the correlation coefficient and comment on the strength and direction
of the relationship between the two variables.
Solution: The summary quantities are
n = 16, xi = 1.656, y i = 170.6, xi y i = 20.0397, xi = 0.196912, y i = 2253.56
2 2
38
STATISTICS FOR INDUSTRIAL CHEMIST
UNIT NINE
ESTIMATION AND HYPOTHESIS TESTING
Objectives:
Having studied this unit, you should be able to
✓ construct and interpret confidence interval estimates
✓ formulate hypothesis about a population mean
✓ determine an appropriate sample size for estimation
Introduction
We now assume that we have collected, organized and summarized a random sample of
data and are trying to use that sample to estimate a population parameter. Statistical
inference is a procedure whereby inferences about a population are made on the basis of
the results obtained from a sample. Statistical inference can be divided in to two main
areas: estimation and hypothesis testing. Estimation is concerned with estimating the
values of specific population parameters; hypothesis testing is concerned with testing
whether the value of a population parameter is equal to some specific value.
9.1 Point and interval estimation of the mean
Point estimate: In point estimation, a single sample statistic (such as x , s or pˆ ) is
calculated from the sample to provide an estimate of the true value of the corresponding
population parameters (such as , or p ). Such a single statistic is termed as point
estimator, and the specific value of the statistic is termed as point estimate. For
example, the sample mean X is an estimator for population mean and X = 10 is an
estimate, which is one of the possible values of X .
Interval estimate: In most practical problems, a point estimate does not provide
information about ‘how close is the estimate’ to the population parameter unless
accompanied by a statement of possible sampling errors involved based on the sampling
distribution of the statistic. Hence, an interval estimate of a population parameter is a
confidence interval with a statement of confidence that the interval contains the
parameter value.
An interval estimate of the population parameter consists of two bounds within which
the parameter will be contained:
L U
where L is the lower bound and U is the upper bound.
Case 1: When the population is normal.
✓ If the variance 2 is known, the sampling distribution of the sample mean X is
2 2 X −
normal with mean and variance . i.e., X ~ N , and Z = ~
n n
n
N(0,1).
39
STATISTICS FOR INDUSTRIAL CHEMIST
X −
✓ If the variance 2 is unknown, t = will have t-distribution with
S
n
n - 1 degrees of freedom. Moreover, as the sample size increases t is
approximately the same as standard normal.
Consider the case 2 is known, we can derive a (1 − )100% confidence interval for the
population mean .
Let Z be a point on the standard normal curve that cuts an area of to the right. i.e.
2 2
P( Z Z ) = . By the symmetric property of the normal distribution, P( Z − Z ) =
2 2 2
(see the diagram below).
2
From the standard normal distribution, we know that
P(− Z Z Z ) = 1 −
2 2
2.5
0
0 2 4 6 8 10
To obtain the limit of the interval estimate, we use the standardized form of X in the
X −
above probability statement. i.e., letting Z =
n
P(− Z Z Z ) = 1 − Becomes
2 2
X −
P(− Z Z ) = 1 −
2 2
n
P(− Z X − Z ) = 1−
2 n 2 n
P(− X − Z − − X + Z ) = 1−
2 n 2 n
P( X − Z X + Z ) = 1−
2 n 2 n
40
STATISTICS FOR INDUSTRIAL CHEMIST
We can assert with probability 1 − that the interval ( X − Z X + Z )
2 n 2 n
contains the population mean we are estimating.
0.95 0.95
2.28 − (2.571) , 2.28 + (2.571)
6 6
(2.28-0.997, 2.28+0.997)
(1.28, 3.27)
We are 95% confident that the mean drop in blood pressure lies in between 1.28 mmHg
and 3.27 mmHg for the sampled population.
Example 9.2: Punctuality of patients in keeping appointment is of interest to a research
team. In a study of patients flow through the office of general practitioners, it was found
that a sample of 35 patients were 17.2 minutes late for appointments, on the average.
Previous research had shown the standard deviation to be about 8 minutes. The
population distribution was felt to be not normal. What is the 90 percent confidence
interval for the true mean amount of time late for appointment?
Solution: Given: X = 17.2 , = 8 , n = 35
(1 − )100% = 90% 1 − = 0.90 = 0.1 = 0.05
2
Since the sample size is fairly large (n > 30), and since the population standard
deviation is known, according to the central limit theorem, the sampling distribution of
sample mean is approximately normal. Thus, a confidence interval of the population
mean is given by:
X − Z , X + Z
2 n 2 n
And from the standard normal distribution table, Z = Z 0.05 = 1.65
2
8 8
17.2 − (1.65) , 17.2 + (1.65)
35 35
(17.2 – 2.2, 17.2 + 2.2)
(15.0, 19.4)
Therefore, the 90% confidence interval for true mean amount of time late for
appointment is between 15.0 and 19.4 minutes.
43
STATISTICS FOR INDUSTRIAL CHEMIST
✓ If the variance 2 is known, the sampling distribution of the sample mean X is
2 2
normal with mean and variance . i.e., X ~ N , and the test statistic is
n n
X −
Z= ~ N(0,1).
n
X −
✓ If the variance 2 is unknown the test statistic is, t = ~t (n-1).
S
n
Rejection Rejection
region, α/2 Acceptance region, α/2
region, 1-α
µ=µ0
Critical Critical
value, Zα/2 value, Zα/2
44
STATISTICS FOR INDUSTRIAL CHEMIST
i. For H 1 : 0 (two-tailed test) reject H 0 if Z Z .
2
Rejection Rejection
region, Acceptance region,
α/2 region, 1-α α/2
Rejection
Acceptance region, α
region, 1-α
Z=0 Zα
Rejection
region, α Acceptance
region, 1-α
-Zα Z=0
Reject H 0 : = 0 if t t (n − 1) t t (n − 1) t −t (n − 1)
2
45
STATISTICS FOR INDUSTRIAL CHEMIST
Step 5: Interpret the result.
Errors in Hypotesis Testing
Ideally the hypotesis testing procedure should lead to the rejection of the null hypothesis
H 0 when it is false and nonrejection of H 0 when it is true. However, the correct decision
is not always possible. Since the decision to reject or do not reject a hypothesis is based
on sample data, there is a possibility of committing an incorrect decision or error.
Hence, a decision-maker may commit one of the two types of errors while testing a null
hypothesis. These errors are summarized as follows:
46
STATISTICS FOR INDUSTRIAL CHEMIST
11
x i
54.2 + 50.4 + ..... + 57.5 + 53.4 598.5
X = i =1
= = = 54.41
n 11 11
11
1
x i −
( xi ) 1
2
= 32799.91 −
(598.5) 2
S =2 2
n −1 n 10 11
1
= (236.07) = 23.607
10
S = 23.607 = 4.859
Then, the t-test statistic is calculated as:
X − 0 54.41 − 50 4.41
t= = = = 3.01
S 4.859 1.465
n 11
4. For α = 0.05 and two-tailed test, the critical (table) value is:
t (n − 1) = t 0.05 (11 − 1) = t 0.025 (10) = 2.228
2 2
0.02 0.02
5 5
-2.228 0
2.228
Since t = 3.01 t (n − 1) = 2.228 reject the null hypothesis H 0 . That is, the
2
0.0 0.0
25 25
-
1.96 Z= 1.9
0 6
Since Z = 3.16 Z = Z 0.025 = 1.96 reject the null hypothesis H 0 . That is, the
2
48
STATISTICS FOR INDUSTRIAL CHEMIST
0.
05
Z= 1.6
0 5
Since Z = 4.23 Z = 1.65 reject the null hypothesis H 0 . That is, the calculated
Z value lies in the rejection region (the shaded region).
5. Conclusion: The mean Vo2max score for the sampled population of healthy
midwife women is greater than 30 at 5% level of significance.
9.3 Test of Association (Independence)
Usually we encounter with nominal scale data. The 2 test of association is useful for
determining whether there is any relationship or association exists between two
nominal variables. For instance, we might be interested in the relationship between HIV
status with sex, lung cancer and smoking habit, political affiliation and sex, e t c.
When observations are classified according to two variables or attributes and arranged
in a table, the display is called a contingency table as shown below:
The test of association or independence uses the contingency table format. Here the
variables A and B have been classified into mutually exclusive categories. The values O ij
in row i and column j of the table shows the observed frequency falling in each joint
category i and j. The row and column totals are the sums of their corresponding
frequencies. The sum of row or column totals will give grand total n, which represents
the sample size. The procedures to test the association between two independent
variables is summarized as follows:
Step 1: State the null and alternative hypotheis
H 0 : There is no association or relationship exists between two variables, that is,
the two variables are independent.
H 1 : There is association or relationship between two variables, that is, the two
variables are dependent.
Step 2: State the level of significance, .
49
STATISTICS FOR INDUSTRIAL CHEMIST
Step 3: Calculate the expected frequencies, Eij, corresponding to the observed frequency
in row i and column j. The expected frequencies in each cell are calculated as:
Row i total Column j total Ri C j
Eij = =
Sample size n
Step 4: Compute the value of test-statistic:
r c (O − E ) 2
Cal =
2 ij ij
i =1 j =1 Eij
where Oij is the observed frequency of row i and coulumn j and Eij is the expected
frequency of row i and coulumn j.
Step 5: Find the critical (table) value of 2 (df ) (from Appendix..). The value of 2
correponds to an area in the right tail of the distribution.
where df = (Number of rows – 1)(Number of columns – 1) = (r – 1)(c – 1)
Step 6: Compare the calculated and table values of 2 . Decide wheather the variables
are independent or not, using the following decision rule:
Reject H 0 if Cal 2 is greater than 2 , (df ) . Otherwise do not reject H 0 .
Example 9.6: The following data on the colour of eye and hair for 6800 individuals
were obtained from a source:
Hair Eye colour
colour Fair Brown Black red Total
Blue 1768 808 190 47 2813
Green 946 1387 746 43 3122
Brown 115 444 288 18 865
Total 2829 2639 1224 108 6800
Test the hypothesis that hair colour and eye colour are independently distributed (there
is no association between colour of eye and colour of hair) at the level of = 0.01.
Solution:
1. H 0 : There is no association between hair colour and eye colour.
H 1 : There is association between hair colour and eye colour.
2. = 0.01.
3. Calculate the expected frequencies, Eij
Ri C j
Eij =
n
2813 2829 2813 108
E11 = = 1170.29 ……………….. E14 = = 44.68
6800 6800
865 2829 865 108
E31 = = 359.87 ………………….. E34 = = 13.74
6800 6800
50
STATISTICS FOR INDUSTRIAL CHEMIST
Therefore, the contingency table for expected frequencies is as follows:
Hair Eye colour
colour Fair Brown Black red Total
Blue 1170.29 1091.69 506.34 44.68 2813
Green 1298.84 1211.61 561.96 49.58 3122
Brown 359.87 335.70 155.70 13.74 865
Total 2829 2639 1224 108 6800
4. Calculate the test statistic:
r c (O − E ) 2
Cal =
2 ij ij
i =1 j =1 E ij
51
STATISTICS FOR INDUSTRIAL CHEMIST
52
STATISTICS FOR INDUSTRIAL CHEMIST
53
STATISTICS FOR INDUSTRIAL CHEMIST
54