Probability and Statistic
Probability and Statistic
Contents:
Part One: Probability, Random Variables and Distribution
Chapter 1
1. The basic concept of probability
1.1 Introduction
1.2 Sample Space, probability of events, counting rule
1.3 Conditional probability
1.4 Multiplication rule
1.5 Bayes theorem
Chapter 2:
2. Discrete random variable and probability distribution
2.1 Introduction
2.2 Discrete random variable
2.3 Discrete probability distribution
2.4 Special functions for discrete probability distribution
Chapter 3:
3. Continuous random variable and probability distribution
3.1 Introduction
3.2 Continuous random variable
3.3 Continuous probability distribution
3.4 Special functions for continuous probability distribution
Part Two: Descriptive Statistics
Chapter 4:
4. Data display and summary of data
4.1 Introduction
4.2 The definition and the difference between sample and population
4.3 Graphical display of data: Stem and leaf, and Box-plot
4.4 The mean, variance and standard deviation of the data
Chapter 5:
5. Random sample, Central Limit Theorem, Normal Approximation and Statistical process
control: X-bar and R-charts.
5.1 Introduction
5.2 Random sample
5.3 The sampling distribution of X
5.4 Central Limit Theorem
5.5 Normal Approximation for Binomial and Poisson distributions.
5.6 Statistical process control: X-bar and R-charts
Part Three: Inferential Statistics
Chapter 6:
6. Hypothesis testing for single population
6.1 Introduction
6.2 Test about a sample mean for large sample, population variance is known
6.3 The P-value for the test and confidence interval for mean
6.4 Test about sample mean for small sample, population variance is unknown
6.5 Test about sample mean for small sample, population variance is unknown but the
sample size is large, n > 30.
6.6 The P-value for the test and confidence interval for mean
6.7 Test about proportion
6.8 The P-value for the test and confidence interval for proportion
6.9 Confidence interval for proportion
6.10 Test about variance
6.11 The P-value for the test and confidence interval for variance
Chapter 7:
7. Hypothesis testing for two populations
7.1 Introduction
7.2 Test about the difference between the means of two
known
7.3 The P-value for the test and confidence interval
populations
7.4 Test about the difference between the means of two
unknown but assuming to be equal
7.5 The P-value for the test and confidence interval
populations
7.6 Test about the difference between the means of two
unknown but assuming there are not equal
3
two
two
7.7 The P-value for the test and confidence interval between the means of two population
means.
7.8 Test about the difference between the means of two population means with variances
are unknown but the samples sizes for both populations are large, n1 > 30 and n2 > 30.
7.9 The P-value for the test and confidence interval between the means of two populations
7.10 Test about the difference between the proportions of two populations
7.11 The P-value for the test and confidence interval between the proportions of two
populations
Chapter 8:
8. Simple linear regression model
8.1 Introduction
8.2 Least squares estimator to determine the intercept and slope
8.3 Assessment of the regression: standard error of estimate, coefficient of determination
and t-test of the parameters.
8.4 Significance test for the regression model
Chapter 9:
9. Multiple linear regression models
9.1 Introduction
9.2 Normal equations
10. The coefficient of determination
11. 8.4 Confidence intervals and significance tests
Part Four: Design of Experiments
Chapter 8:
12. The design and analysis of experiments
10.1 Introduction design of experiment
10.2 One-way ANOVA
10.3 Two-way ANOVA
Appendix
Table 1: The normal - Z distribution
Table 2: The Students t-distribution
Table 3: The chi-squared, distribution
Table 4: The F-distribution
Preface
This book provides an introduction to probability and statistics, with particular emphasis on
applications in applied sciences, technology and engineering. Typically introductory texts on
engineering statistics spend a great deal of time on basic probability ideas for the first several
chapters. In fact, basic probabilities can easily fill up a standard introductory course. Because
engineering students often have only one probability/statistics course, the material needs to be
reorganized in order to allow for coverage of statistical methodology.
This book will be divided to four parts; part 1 is related to the basic concept of probability and
the distributions as in the Chapter 1 till Chapter 4. Chapter 1, we give a brief introduction to the
basic concept of the probability. In Chapter 2, we introduce the definition of discrete random
variables and the probability distributions. The continuous and their probability distributions will
be discussed in Chapter 3.
Part 2 is covering on descriptive statistics as in the Chapter 4 and Chapter 5. In Chapter 4, we
introduce the types of how to display data and summary of the data. Meanwhile, the random
sample, central limit theorem, normal approximation and statistical process control will be
discussed in the Chapter 5.
The engineering students also must be given some experience on how to do a basic data analysis.
The inferential statistics are important as a part of statistical methods to do the data analysis. Part
III is covering the inferential statistics such as in the Chapter 6 till Chapter 9. In Chapter 6, we
introduce the hypothesis testing for single population and the hypothesis testing for two
populations will be discussed in Chapter 7. While in Chapter 8 and Chapter 9, we introduce the
simple and multiple linear regressions, respectively.
Finally, in part IV, we also want the students have some experience on the real application in
engineering. The related topics such as a factorial design and design of experiment will be
discussed in the Chapter 10.
Afza Shafie
Radzuan Razali
April 2010.
5
Chapter 1
1.
1.1 Introduction
Probability theory refers to the study of randomness and uncertainty. Probability forms
the basis knowledge which we can make inferences about a population based on the
distribution and its provide methods for quantifying the chances or likelihood associated
with various outcomes. Probability helps to explain a lot of everyday occurrences and we
actually discuss it frequently.
Probability also has been used everyday in engineering and technology. For example: the
probability of a good part being produce, the reliability of a new machine (reliabilities are
actually probabilities) etc.
An engineer wants to be fairly certain that the percentage of good rods is at least 90%;
otherwise he will shut down the process for recalibration. How certain that he has at least
90% of the 1000 rods are good?
What is the different between probability and inferential statistics? Probability is
involving properties of the population under study which are assumed known and
questions regarding a sample taken from the population are posed and answered. While,
inferential statistics is involved a characteristics of a sample which are available to the
experimenter and this information enables experimenter to draw conclusions about the
populations.
1.1.1 Definition:
Some definitions or terms in basic probability must be known and well understand.
Among the definitions are:
Random Process is a situation in which possible results are known but actual results
cannot be predicted with certainty in advance.
Outcome is related to each possible result for a random process
Experiment is a process by which an observation or measurement is obtained (yield
outcomes)
1,2 Sample Space, probability of events, counting rule
In the process of collecting data before analysis and interpretation being done, the method
of how to model the random experiment is crucial. The terms related to it such as sample
space and an event are important.
1.2.1 Sample Space:
Sample space denoted by S, is the set of all possible outcomes of an experiment.
Event is any collection (subset) of outcomes contained in the sample space S.
An event is called simple if it consists of exactly one outcome and called compound
event if it consists of more than one outcome. Mean while the null event is an event with
no outcomes. This is actually impossible event or empty set.
Example 1.1:
Experiment of roll a die:
The sample space is: S = {1, 2, 3, 4, 5, 6}
The simple events (or outcomes) are:
E1: observe No. 1 = {1}
E4= {4}
E2= {2}
E5 = {5}
E3 = {3}
E6 = {6}
Example 1.2:
Toss a coin for three times and observed the number of heads. The sample space is,
S = {0, 1, 2, 3}
The sample space for the lifetime of a machine (in hours) is,
S = { t | t 0 } = [ 0, )
The sample space for the number of calls at a telephone exchange during a specific time
interval is,
S = {0, 1,.}
The knowledge in set theory is important to understand the basic of probability. The
union of events A and B denoted by A U B and read A or B is the event consisting of all
outcomes that are either in A or in B or in both events.
The intersection of A and B denoted by A B and read A and B, is the event consisting
of all outcomes that are in both A and B.
The complement of event A, denoted by AC, is the event of all outcomes in the sample
space S that are not contained in event A.
If two events A and B have no outcomes in common they are said to be mutually
exclusive or disjoint events. This means that if one of the events occurs the other cannot.
All these events can be visualized in term of Venn diagram:
1.2.2 Probability of Events
An event is a subset of all of the possible outcomes of an experiment. The probability of
event is to assign for each event, say E, a number, P(E), called the probability of E which
will give a precise measure of the chance that E will occur. The probability of an event E,
is defined as the ratio of the number of outcome favorable to the event, n divided by the
total number of all possible outcomes, N. That is P(E) = n/N.
For example, in the experiment tossing a die repeatedly, in the long run, what would we
expect that the probability of even number will occurs, P(E=2 or 4 or 6)?
In this experiment, an event is even number will occur three times, so n=3. The total
possible outcomes is six, so N=6. Hence the probability of even number will occur is,
P(E=2 or 4 or 6)=3/6=0.5
Condition of Probability
A probability denoted by P is a rule (or function) which assigns a number between 0 and
1 to each event and must satisfies:
0 P(E) 1 for any event E
P( ) = 0 , P(S) = 1,
If A1 , A2 , is an infinite collection of mutually exclusive events, then
P ( A1 A2 ...) P( A1 ) P ( A2 ) ...
P ( A ') 1 P ( A)
For example, if P(rain tomorrow) = 0.6 then P(no rain tomorrow) = 0.4
Other notations for complement for A is Ac or
Example 1.3:
An oil-prospecting firm plans to drill two exploratory wells. Past evidence is used to
assess the possible outcomes listed in the following table:
Event
A
B
C
Description
Neither well produces oil nor gas
Exactly one well produces oil or gas.
Both wells produce oil or gas
) give description.
Find P ( A B), P ( B C ) and P( B Cand
Probability
0.85
0.12
0.03
Solution:
Events A, B and C are mutually exclusive because the occurrence of one event
precludes the occurrence of either of the other two.
P(A or B) = P(A) + P(B) = 0.97 (probability at most one well produces oil or
gas)
P(B or C) = P(B) + P(C)= 0.15 (probability at least one well produces gas or oil
P(B) = 1 P(B) = 0.88 (probability both wells not produce or both produce oil or gas)
1.2.3 General Addition Law
Let A and B be two events defined in a sample space S.
P(A B) P(A) P(B) P(A B)
If two events A and B are mutually exclusive, then
P(A B) 0
Thus
P(A B) P(A) P(B)
This can be expanded to consider more than two mutually exclusive events.
Example 1.4
One of the residential in Ipoh, 45% of all households subscribe to the Sinar Harian
newspaper published in a nearby city, 75% subscribe to the Utusan Malaysia, and 30% of
all households subscribe to both papers. Draw a Venn diagram for this problem.
If a household is selected at random, what is the probability that it subscribes to
a) At least one of the two newspapers
b) Exactly one of the two newspapers
Solution:
a) A = event subscribe to Sinar Harian, B = event subscribe to Utusan Malaysia
P(A U B) = [ P(A) + P(B) P(A B)] = 0.45 + 0.75 0.30 = 0.9
b) P (exactly one) = P (A B) + P (A B) = 0.15 + 0.45 = 0.6
10
The probability of an event A equals the number of outcomes (sample points) contained
in A divided by the total number of possible outcomes. That is:
P(A) = n(A) / n(S)
Important condition: all outcomes are equally likely to occur. Inefficient when n(S) is
large.
1.2.4 Counting Rule:
Eliminates the need for listing each simple event and help to easily assigned probabilities
to various events when the outcomes are equally likely. Especially helpful if the sample
space is quite large.
Product (Multiplication) Rule
If there are k elements ( or things) to choose and there are n1 choices for the first
element, n2 for the second element, and so on to nk choices for the kth element,
then the number of possible ways of selecting them is only applies when elements are
different or the order of elements matters.
Example 1.5:
A chemical engineer wishes to conduct an experiment to determine how these four
factors affect the quality of the coating. She is interested in comparing two charge levels,
three density levels, four temperature levels, and five speed levels. How many
experimental conditions are possible?
Solution:
The possible experiment conditions are 2x3x4x5=120
Permutations and Combinations
Permutation is an ordered arrangement of k objects taken from a set of n distinct objects (
k n ).
The number of ways of permutation of k objects from n distinct objects will be denoted
by the symbol Pk,n
Pk , n Pkn
n!
( n k )!
11
Example 1.6:
8 teaching assistants are available to grade an exam of four questions. Wish to select a
different assistant to grade each question (only one assistant per question). How many
possible ways can the assistant are chosen for grading?
Solution:
The number of possible ways is Pkn P48 1680
Combination
Combination: an unordered subset of k objects taken from a set of n distinct objects.
The number of ways of combination of k objects from n distinct objects is denoted by the
symbol Ck,n
n
n!
Ck , n Cnk
k!( n k )!
k
P
n
n!
Cnk
k ,n
k !(n k )! k !
k
Example 1.7:
Fifteen players compete in a tournament. In how many ways can
a) rankings be assigned to the top five competitors?
b) the best five competitors be randomly chosen?
Solution
The number of rankings that can be assigned to the top five competitors is
Pkn P515 360,360
12
Conditional Probability
p( A B)
,
p( B)
provided p( B ) 0
p ( A B)
,
p ( A)
provided p ( A) 0
Example 1.8:
The Information Resource Center (IRC), UTP displays three types of books entitled
Science (S), Engineering (E), and Technology (T). Reading habits of randomly
selected reader with respect to these types of books are
Read regularly
Probability
S
0.14
E
0.23
T
0.37
SE ST
0.08 0.09
P( S | E )
P( S |E U T )
P( S | reads at least one )
P( S U E | T)
Solution:
13
ET
0.13
SET
0.05
S
0.02
0.03
0.07
0.05
0.08
0.04
0.20
p(S E )
p ( S E ) 0.08
0.3478
p( E )
0.23
p(S E T )
p( S E T ) 0.12
0.2553
p( E T )
0.47
p(S E T )
1.3.2
p(S S E T )
p(S )
0.14
0.2857
p(S E T )
p ( S E T ) 0.49
p ( S E T ) 0.22
0.5946
p (T )
0.37
Independent Events
The probability of both events occurring can be calculated by rearranging the terms in the
expression of conditional probability.
P ( A B ) P ( A B) P( B )
Two events A and B are called independent if the probability of event A is not affected by
the occurrence of event B, so P( A | B) Pand
( A) P ( A B ) P ( A) P ( B )
Example 1.9:
In rolling a fair die, let event A = {1, 3, 5} and event B = {4, 5, 6}.
Are events A and B independent?
Solution:
14
P ( A B) P ( A | B ). P ( B)
Events A and B are independent if and only if
P ( A B ) P ( A). P ( B )
If events A1, .., Ak are independent then,
P ( A1 A2 ... Ak ) P( A1 ) P( A2 )
P ( Ak )
15
Multiplication rule is most useful when the experiment consists of several stages in
succession. The conditioning event, B, describes the outcome of the first stage and A the
outcome of the second, so that P( A| B) conditioning on what occurs first will often be
known.
Example 1.11:
During a space shot, the primary computer system is backed up by two secondary
systems. They operate independently of one another, and each is 95% reliable. What is
the probability that all three systems will be operable at the time of the launch?
Solution
Let,
A1: event main system is operable
A2: event first backup is operable
A3: event second backup is operable
Given P(A1) = P(A2) = P(A3) = 0.95
Since they operate independently
P(A1 A2 A3) = P(A1)P(A2) P(A3) = 0.857
1.4.2 The Law of Total Probability
Suppose B1, B2 ,, Bn are mutually exclusive and exhaustive in S, then for any event A
n
i 1
i 1
P( A) P ( A Bi ) P ( A | Bi ) P ( Bi )
1.4.3 Bayes Theorem
Suppose B1, B2 ,, Bn are mutually exclusive and exhaustive (whose union is S). Let A
be an event such that P(A) > 0. Then for any event Bj , j =1, 2, , n,
P ( Bk | A)
P ( A Bk )
P ( A)
P( A | Bk ) P( Bk )
n
P( A | B ) P( B )
i 1
Example 1.12:
A store stocks bulbs for LCD projector from three suppliers. Suppliers A, B, and C
supply 10%, 20%, and 70% of the bulbs respectively. It has been determined that
company As bulbs are 1% defective while company Bs are 3% defective and company
16
P B | D
P B P D | B
P A P D | A P B P D | B P C P D | C
0.2 0.03
17
0.1714
Exercise Chapter 1:
1.Each message in a digital communication system is classified as to whether it is received
within the time specified by the system design. If 3 messages are classified, what is an
appropriate sample space for this experiment?
2.A digital scale is used that provide weights to the nearest gram. Let event A: a weight exceeds
11 grams, B: a weight is less than or equal to 15 grams, C: a weight is greater than or equal to
8 grams and less than 12 grams. What is the sample space for this experiment? and find
(a) A U B
(b) A
(c) A B
(d) (A U C) (e) A B C
(f) B C
3. Samples of building materials from three suppliers are classified for conformance to airquality specifications. The results from 100 samples are summarized as follows:
Supplie
r
R
S
T
Conforms
Yes
No
30
10
22
8
25
5
Let A denote the event that a sample is from supplier R, and B denote the event that a sample
conforms to the specifications. If sample is selected at random, determine the following
probabilities:
(a) P(A)
(b) P(B)
(c) P(B)
(d) P(AUB) (e) P(A B) (f) P(AUB)
(g) P ( A B ) (h) P ( B A)
4.
The compact discs from a certain supplier are analyzed for scratch and shock resistance.
The results from 100 discs tested are summarized as follows:
Shock
Resistance
High
Medium
Low
Scratch
Resistance
High Low
30
10
22
8
25
5
Let A denote the event that a disc has high shock resistance, and B denote the event that a
18
disc has high scratch resistance. If sample is selected at random, determine the following
probabilities:
(a) P(A)
(b) P(B)
(c) P(B)
(d) P(AUB) (e) P(A B) (f) P(AUB)
(g) P ( A B ) (h) P ( B A)
5.
The reaction times ( in minutes) of a reactor for two batches are measured in an
experiment.
(a) Define the sample space of the experiment.
(b) Define event A where the reaction time of the first batch is less than 45 minutes and event
B is the reaction time of the second batch is greater than 75 minutes.
(c) Find A U B, A B and A
(d) Verify whether events A and B are mutually exclusive.
6.
When a die is rolled and a coin is tossed, use a tree diagram to describe the set of possible
outcomes and find the probability that the die shows an odd number and the coin shows a
head.
7.
A bag contains 3 black and 4 while balls. Two balls are drawn at random one at a time
without replacement.
(i) What is the probability that a second ball drawn is black?
(ii) What is the conditional probability that first ball drawn is black if the second ball is
known to be black?
8.
An oil-prospecting firm plans to drill two exploratory wells. Past evidence is used to
assess the possible outcomes listed in the following table:
Event
Description
Probability
A
B
C
0.80
0.18
0.02
P( A B ), P( B C ) and P ( B ' )
10.
In a student organization election, we want to elect one president from five candidates,
one vice president from six candidates, and one secretary from three candidates. How many
possible outcomes?
11.
Suppose each student is assigned a 5 digit number. How many different numbers can be
created?
12.
13.
A menu has five appetizers, three soup, seven main course, six salad dressings and eight
desserts. In how many ways can
(a) a full meal be chosen?
(b) a meal be chosen if either and appetizer or a soup is ordered, but not both?
14.
Ten teaching assistants are available to grade a test of four questions. Wish to select a
different assistant to grade each question (only one assistant per question). How many
possible ways can the assistant be chosen for grading?
15.
Participant samples 8 products and is asked to pick the best, the second best, and the third
best. How many possible ways?
16.
Suppose that in the taste test, each participant samples eight products and is asked to
select the three best products. What is the number of possible outcomes?
17.
A contractor has 8 suppliers from which to purchase electrical supplies. He will select 3
of these at random and ask each supplier to submit a project bid. In how many ways can the
selection of bidders be made?
18.
19.
Three balls are selected at random without replacement from the jar below. Find the
probability that one ball is red and two are black.
20.
21.
There are 17 broken light bulbs in a box of 100 light bulbs. A random sample of 3 light
bulbs is chosen without replacement.
(a) How many ways are there to choose the sample?
20
(b)
(c)
(d)
(e)
22.
An agricultural research establishment grows vegetables and grades each one as either
good or bad for taste, good or bad for its size, and good or bad for its appearance. Overall,
78% of the vegetables have a good taste. However, only 69% of the vegetables have both a
good taste and a good size. Also, 5% of the vegetables have a good taste and a good
appearance, but a bad size. Finally, 84% of the vegetables have either a good size or a good
appearance.
(a) if a vegetable has a good taste, what is the probability that it also has a good size?
(b) if a vegetable has a bad size and a bad appearance, what is the probability that it has a
good taste?
23.
A local library displays three types of books entitled Science (S), Arts (A), and
Novels (N). Reading habits of randomly selected reader with respect to these types of
books are
Read regularly
Probability
S
0.14
A
0.23
N
0.37
SA SN
0.08
0.09
AN SAN
0.13
0.05
A batch of 500 containers for frozen orange juice contains 5 that are defective. Two are
selected at random, without replacement, from the batch. Let A and B denote that the first
and second selected is defective respective
(a) Are A and B independent events?
(b) If the sampling were done with replacement, would A and B be independent?
25.
Everyday (Mon to Fri) a batch of components sent by a first supplier arrives at certain
inspection facility. Two days a week, a batch also arrives from a second supplier. Eighty
percent of all batches from supplier 1 pass inspection, and 90% batches of supplier 2 pass
inspection. On a randomly selected day, what is the probability that two batches pass
inspection?
26.
The probability is 1% that an electrical connector that is kept dry fails during the
warranty period of a portable computer. If the connector is ever wet, the probability of a
failure during the warranty period is 5%. If 90% of the connectors are kept dry and 10% are
wet, what proportion of connectors fail during the warranty period?
21
27.
Computer keyboard failures are due to faulty electrical connects (12%) or mechanical
defects (88%). Mechanical defects are related to loose keys (27%) or improper assembly
(73%). Electrical connect defects are caused by defective wires (35%), improper connections
(13%) or poorly welded wires (52%). Find the probability that a failure is due to
(a) loose keys
(b) improperly connected or poorly welded wires.
28.
During a space shot, the primary computer system is backed up by two secondary
systems. They operate independently of one another, and each is 90% reliable. What is the
probability that all three systems will be operable at the time of the launch?
29.
A store stocks light bulbs from three suppliers. Suppliers A, B, and C supply 10%, 20%,
and 70% of the bulbs respectively. It has been determined that company As bulbs are 1%
defective while company Bs are 3% defective and company Cs are 4% defective. If a bulb
is selected at random and found to be defective, what is the probability that it came from
supplier B?
30.
A particular city has three airports. Airport A handles 50% of all airline traffic, while
airports B and C handle 30% and 20%, respectively. The rates of losing a baggage in airport
A, B and C are 0.3, 0.15 and 0.14 respectively. If a passenger arrives in the city and losses a
baggage, what is the probability that the passenger arrives at airport A?
31.
A company rated 75% of its employees as satisfactory and 25% unsatisfactory. Of the
satisfactory ones 80% had experience, of the unsatisfactory only 40%. If a person with
experience is hired, what is the probability that (s)he will be satisfactory?
32.
In a certain assembly plant, three machines, B1, B2, B3, make 30%, 45% and 25%,
respectively, of the products. It is known from past experience that 2%,3% and 2% of the
products made by each machine, respectively, are defective. Now, suppose that a finished
product is randomly selected.
(a)
What is the probability that it is defective?
(b)
If a product was chosen randomly and found to be defective, what is the probability
that
it was produced by machine B3?
33.
Three machines A, B and C produce identical items of their respective output 5%, 4%
and 3% of the items are faulty. On a certain day A has produced 25%, B has produced 30%
and C has produced 45% of the total output. An item selected at random is found to be
faulty. What are the chances that it was produced by C?
34.
Suppose that a test for Influenza A, H1N1 disease has a very high success rate: if a tested
patient has the disease, the test accurately reports this, a positive, 99% of the time, and if a
tested patient does not have the disease, the test accurately reports that, a negative, 95% of
the time. Suppose also, however, that only 0.1% of the population have that disease.
(a)
What is the probability that the test returns a positive result?
(b)
If the patient has a positive, what is the probability that he has the disease?
22
(c)
35.
23
Chapter 2
2.
2.1 Introduction
A random variable is a rule that assigns a number to each outcome of an experiment.
These numbers are called the measured values of the random variable. The capital letters
like X, Y and Z is used to denote a random variable and the small letters like x, y and z to
denote the measured values.
Example 2.1:
Select a soccer player; the random variable Y is the number of goals the player has
scored during the season.
The measured values of Y are 0, 1, 2, 3,
The test marks for 100 engineering students; the random variable Z is the average number
of goals scored by the students.
The values of Z are 65.4, 67.8, 70.5, 77.3,
There are two types of random variables called a discrete random variable and a
continuous random variable.
2.2 Discrete random variable
The measured values for a discrete random variable are finite or countable. The values
are in terms of integer value. The number of students in this class is the example of a
discrete random variable.
24
Solution:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
discrete
discrete
continuous
discrete
continuous
continuous
continuous
discrete
25
1/4
1/2
1/4
The above table represents a discrete probability distribution and the probability function,
P(X=xi) is called probability mass function (pmf) of X because it relates each value of a
discrete random variable with its probability of occurrence.
2.4.1
The pmf, P(X=xi) of a discrete random variable X must satisfied two conditions;
(i)
0 P ( X xi ) 1
(ii)
P ( X xi ) 1
xi
Given pmf, the probability of X occurs can be calculated. For example the probability at
most one occurs is P ( X 1) P( X 0) P( X 1)
Example 2.3:
Two balls are drawn at random in succession without replacement from an urn containing
4 red balls and 6 black balls. Find the probabilities of all the possible outcomes.
Solution:
Let X denote the number of red balls in the outcome.
Possible
outcomes
X
RR RB BR BB
2
1
26
Here, x1 = 2, x2 = 1, x3 = 1, x4 = 0
Now, the probability of getting 2 red balls when we draw out the balls one at a time is:
Probability of first ball being red = 4/10
Probability of second ball being red = 3/9 (because there are 3 red balls left in the urn, out
of a total of 9 balls left.) So:
Likewise, for the probability of red first is 4/10 followed by black is 6/9 (because there
are 6 black balls still in the urn and 9 balls all together). So:
2
2/15
1
8/15
0
5/15
2
k
3
1/5
Example 2.4:
Given the probability distribution,
X
P(X=x)
0
1/10
1
1/5
4
3/10
5
1/10
P( X xi ) 1 / 10 1 / 5 k 1 / 5 3 / 10 1 / 10 1 k 1 / 10 .
xi
F ( x ) P ( X x ) P ( X xi )
xi x
Example 2.5:
Given pmf,
X
P(X=x)
2
2/15
1
8/15
0
5/15
0
5 /15
F ( x) 13 /15
1
,x 0
,0 x 1
,1 x 2
,x 2
28
E ( X ) xi P ( X xi )
i 1
Var ( X ) 2 E ( X 2 ) ( E ( X )) 2 xi2 P( X xi ) 2
i 1
2
2/15
1
8/15
0
5/15
29
Var ( X ) 2 E ( X 2 ) ( E ( X )) 2 xi2 P( X xi ) 2
i 1
Bernoulli distribution
The experiment conducted with only two possible outcomes. In an experiment of tossing
a fair coin for 1 time and X is the number of head. There are only two possible outcomes,
X =0 or X=1 with probability distribution:
Possible
outcomes
X
P(X=x)
2.5.2
Head
Tail
1
1/2
0
1/2
Binomial distribution
If the Bernoulli experiment conducted for n times, and the random variable X is the
number of success, then the probability distribution of X is called Binomial distribution
with pmf,
n x xn
(XP x) qp ,x0,12,3. .
x
where p is the probability of success and q=1-p.
By using the definition, it can be shown that, if X is a Binomial distribution, then the
mean of X is E(X) = np and the variance of X, is Var(X) = npq.
30
Example 2.7:
In the experiment of tossing a fair coin for 10 times, and X is the number of head.
(i)
What is the pmf of X?.
(ii)
Find the probability the head will appear exactly 5 times.
(iii)
What is the probability no head?
(iv)
Find the mean and the variance of X.
Solution:
(i)
10 x 10x
P(X x) (0.5) (0.5) ,x0,12,3. .
x
(ii)
10 5 5 10! 5 5
P(X 5) (0.5) (0.5) (0.5) (0.5) 0.246
5 5!
(iii)
10 0 10 10! 0 10
P(X 0) (0.5) (0.5) (0.5) (0.5) 0. 097
0 0!1
(iii)
e x
x!
, x 0,1, 2, 3,.......
Hence, the probability that Anne receives more than 1 call in the next 15 minutes is,
P ( X 1) 1 P ( X 1) 1 [ P ( x 0) P ( X 1)]
e 0.75 (0.75) 0 e 0.75 (0.75)1
Exercise 2
1. Identify each of the random variables as continuous or discrete random variable.
(a) The number of atoms
(b) The number of fish in a pond
(c) The home team score in a football game
(d) The voltage on a power line
(e) A score on the mathematic final exam
(f) The volume of gas in the tank
(g) The number of cars at the petrol station
(h) The number of accidents in Ipoh
(i) The number of cakes left in the pantry
32
0.1
5
0.2
5
0.3
5
(a)
(b)
(c)
(d)
(e)
50
10
0
15
0
200
0.1
5
0.3
5
0.2 0.0
5
4. Let X denote the number of bars of service on your cell phone whenever you are at an
intersection with the following probabilities:
x
5. A local cab company is interested in the number of pieces of luggage a cab carries on a taxi
run. A random sample of 260 taxi runs gave the following information. x = number of pieces
of luggage and f is the frequency with which taxi runs carried x pieces of luggage.
x:
f:
0
42
1
51
2
63
3
38
4
19
5
16
6
12
7
10
8
6
9
2
10
1
(c) at least three vehicles visiting the drive-thru within a ten-minute interval during one of
these slow periods.
10. The number of cracks in a section of PLUS highway that are significant enough to require
repair is assumed to follow a Poisson distribution with a mean of two cracks per kilometer.
Determine the probability that
(a) there are no cracks at all in 2km of highway;
(b) at least one crack in 500meter of highway; and
(c) there are exactly 3 cracks in 0.5km of highway.
Chapter 3
3.
35
3.1 Introduction
If the outcomes of the experiment conducted are continuous random variables, its
probability distribution is called a continuous probability distribution or probability
density function, pdf.
3.1.1 The properties of the probability density function (pdf)
The pdf, f(x) of a continuous random variable X must satisfied two conditions;
(j)
(ii)
0 f ( x) 1
f ( x )dx 1
Given pdf, the probability of X occurs can be calculated. For example the probability at
most one occurs is P( X 1) 1 f ( x)dx
Example 3.1:
Let X be continuous random variable with pdf given by,
kx2 , 0 x 2
f ( x)
0 , elsewhere
f ( x )dx 1
So,
x3 2 8
3
f ( x)dx kx dx k k 1 k
30 3
8
2 2
0
36
3.1.2
F ( x) P( X x) f ( x)dx , x
Example 3.2:
Let X be continuous random variable with pdf given by,
3x 2 , 0 x 1
f ( x)
0 , elsewhere
Find,
(i)
(ii)
Solution:
(i)
0.5
1
P( X 0.5) f ( x)dx 3x dx x (0.5)3
0
8
0.5
0.5
0
0.75
0.75
37
19
64
0.75
0.5
(ii)
The cdf of X,
x
For x 0, F ( x) P ( X x) f ( x)dx 0
F ( x) x3
1
3.1.3
,x 0
,0 x 1
, x 1
Given the pdf of X, f(x), all the parameters of X such as the mean, the variance
and the standard deviation can be determined by using the expectation definition.
The mean of X is defined by,
E ( X ) xf ( x )dx
38
3 2
x ,0 x2
f ( x) 8
0 , elsewhere
Find the mean and the variance of X.
Solution:
The mean of X is
3 3 3x 4 2 3
E( X ) xf ( x)dx x dx 1.5
32 0 2
08
2
Var ( X )
x 2 f ( x ) dx 2 x 2
0
3 2
x dx (1.5) 2
8
3 4
9
3 52 9
12 9
3
x dx
x
0.15
0
4
40
4
5
4
20
0 8
2
Uniform distribution
The random variable X is a uniform distribution, and then the pdf of X is given by,
1
,a x b
f ( x) b a
0 ,elsewhere
39
3.2.2
Exponential distribution
e x , x 0
f ( x)
0 ,elsewhere
e x , x 0
f ( x)
0 ,elsewhere
3.2.3
Normal distribution
The random variable X is a Normal distribution, and then the pdf of X is given by,
1
f ( x)
e
2
( x )2
, x
1 z2
e
2
, z
X ~ N(
where = 15 and 2=(5)2
Hence,
x 25 15
P ( z 2)
5
1 P ( z 2) 1 ( 2) 1 0.977 0.023
P ( X 25) P
Exercise 3
1.Suppose that X is a continuous random variable having the probability density function
kx2 for 1 x 1
f (x)
(a)
(b)
(c)
(d)
2.
0,
elsewhere
k x , 1 x 2
f ( x)
, elsewhere
Find
(a) the value of constant k
(b) P(X < 1)
(c) the mean of X
(d) the standard deviation of X.
3.Let X be a continuous random variable with pdf given by
42
kxe2x , x 0
f ( x)
,x 0
Find
(a) the value of constant k
(b) P(X > 1)
(c) P(0 < X < 2)
(d) the mean of X
(e) the variance of X.
4.
kx 2
f ( x)
,0 x 3
, elsewhere
Find
(a) the value of constant k
(b) the cdf, F(x)
(c) P(X >1)
(d) the mean of X
(e) the variance of X.
5.
Find the cumulative probability distribution of X given that the density function
is
k (1 x4 ), for 0 x 1
f ( x)
0, elsewhere
Find
(a) the value of constant k
(b) the cdf, F(x)
(c) P(0.25 < X < 0.5)
(d) the mean of X
(e) the variance of X.
9.The time between telephone calls to ASTRO, a cable television payment processing
center follows an exponential distribution with a mean of 1.5 minutes. What is the
probability that the time between the next two calls
(a) at least 45 seconds?
(b) will be between 50 to 100 seconds?; and
(c) at most 150 seconds?
10. The mean weight of 500 UTP students is 68kg and the variance is 72.25kg. Find
the probability of students who weight
(a) between 65kg and 72kg
(b) more than 70kg
11. An average LCD Projector bulb manufactured by the ABC Corporation lasts 300
days with variance of 2500days. By assuming that the bulb life is normally
distributed, what is the probability that the bulb will last
(a) at most 365 days?
(b) between 250days and 350days?
(c) at least 400days?
12. The line width of a tool used for semiconductor manufacturing is assumed to be
normally distributed with a mean of 0.5 micrometer and a standard deviation of 0.05
micrometer.
(a) What is the probability that a line width is greater than 0.62 micrometer?
(b) What is the probability that a line width is between 0.47 and 0.63 micrometer?
(c) The line width of 90% of samples is below what value?
oooOOOooo
44
Chapter 4
4.
4.1 Introduction
The major use of inferential statistics is to use information from a sample to infer
something about a population. A population is a collection of data whose
properties are analyzed. The population is the complete collection to be studied; it
contains all subjects of interest. A sample is a part of the population of interest, a
sub-collection selected from a population. A parameter is a numerical
measurement that describes a characteristic of a population, while a statistic is a
numerical measurement that describes a characteristic of a sample. In general, we
will use a statistic to infer something about a parameter.
4.2 Mean and variance
45
The mean is the sum of all numbers in the list divided by the total numbers in the
list. If the given list is Statistical Population then the mean is called Population
Mean and the given list is a Statistical Sample, then the mean is called Sample
mean. The mean has an expected value of , known as the population mean. The
sample mean makes a good estimator of the population mean, as its expected value
which is as the same as the population mean.
Often, since the population variance is an unknown parameter, it is estimated by
the mean sum of squares, which changes the distribution of the sample mean from
a normal distribution to a Student's t distribution with n 1 degrees of freedom.
The mean and the variance of population and sample mean and sample variance
can be expressed as follows. By using the following equations we can identify the
difference.
Population Mean and Variance are defined as:
N
xi
Mean i 1
N
Variance 2
1 N
2
( xi )
N i1
xi
Mean x i 1
n
Variance s 2
1 N
2
( xi x )
n 1 i 1
46
xi
i 1
55 68 90 42 89 70 414
69
6
6
1 n
1 n 2
2
( xi x )
xi
n 1 i 1
n 1 i 1
2
1
( 414)
[30334
] 353.6
5
6
s2
( xi ) 2
n
i 1
85 80 82 83 79 79 71
80 77 89
Use the Stem and Leaf plot to determine the mode and the median for the
temperatures.
Solution:
First step should be to place the numbers in order from smallest to the largest.
Temperatures
Tens
The mode is 65
Ones
059
11224555589
001567799
0002235 9
48
Step 5: Calculate the 1.5IQR and determine the range of 1.5IQR from upper
quartile and the lower quartile. The value(s) that place outside of the
1.5IQR range called the outlier(s). The value(s) that place outside of the
3IQR range called the extreme outlier(s).
Example 4.3
Suppose that thirty UTP students live in Village 2. These are the following ages:
18, 20, 21, 26, 24, 19, 25, 20, 22, 21,
19, 24, 25, 28, 24, 20, 26, 20, 35, 17,
18, 24, 20, 21, 22, 27, 25, 28, 27, 24.
Step 1: Place the numbers in order from smallest to the largest.
17, 18, 18, 19, 19, 20, 20, 20, 20, 20,
21, 21, 21, 22, 22, 24, 24, 24, 24, 24,
25, 25, 25, 25, 26, 26, 27, 27, 28, 35.
Step 2: Find the median, Q2, the lower quartile, Q1 and the upper quartile, Q3 of a
given set of data.
The median, Q2 = (X15 + X16)/2 = (22+24)/2=23
The position of Q1 = (0.25) (n+1) = 0.25(31) = 7.75
So the lower quartile, Q1 is X7 + 075(X8-X7) =20 + 0.75(20-20) =20
The position of Q3 = (0.75) (n+1) = 0.75(31) = 23.25
So the upper quartile, Q3 is X23 + 0.25(X24-X23) =25+0.25(25-25) = 25
Step 3: The interquartile range (IQR) = Q3 - Q1 = 25 20 = 5
49
outlier
17
28
Q1 =20
Q2 =23
12.5.<1.5IQR>< ------------..IQR=5
o35
Q3=25
----------------><1.5IQR>.32.5
Exercise 4:
1. Find the mean, median and mode for the following observations:
6.5
7.8
4.6
3.7
6.5 9.2
12.1
6.5
3.7
10.8
2. Find the mean, median and mode for the following observations:
2.3 3.6
2.6
2.8
3.2
3.6
4.3
5.2
6.9
2.8 3.6
50
5. Find the mean, variance and standard deviation of the following samples of marks for
the probability and statistics final examination.
84.9
75.0
69.3
59.5
48.2
38.3
38.4
81.9
73.8
68.6
58.3
48.0
37.4
80.8
72.7
67.5
58.5
47.8
36.8
79.4
72.6
66.8
57.6
46.5
36.5
78.2
71.4
65.2
56.9
45.9
35.6
76.5
70.9
64.4
55.2
44.6
34.9
6. Find the mean, variance and standard deviation of the following samples of marks for
the engineering drawing course.
98.4
89.6
79.7
69.8
59.2
59.6
39.8
98.1
88.7
78.2
68.6
58.3
48.0
7.The shear strengths of 100 spot welds in a titanium alloy follow. Construct a stem-andleaf diagram for the weld strength data and comment on any important features that
you notice.
5408 5431 5475 5442 5376 5388 5459 5422 5416 5435
5420 5429 5401 5446 5487 5416 5382 5357 5388 5457
5407 5469 5416 5377 5454 5375 5409 5459 5445 5429
5463 5408 5481 5453 5422 5354 5421 5406 5444 5466
5399 5391 5477 5447 5329 5473 5423 5441 5412 5384
5445 5436 5454 5453 5428 5418 5465 5427 5421 5396
5381 5425 5388 5388 5378 5481 5387 5440 5482 5406
5401 5411 5399 5431 5440 5413 5406 5342 5452 5420
5458 5485 5431 5416 5431 5390 5399 5435 5387 5462
5383 5401 5407 5385 5440 5422 5448 5366 5430 5418
51
94.1
87.3
94.1
92.4
84.6
85.4
93.2
84.1
92.1
90.6
83.6
86.6
90.6
90.1
96.4
89.1
85.4
91.7
91.4
95.2
88.2
88.8
89.7
87.5
88.2
86.1
86.4
86.4
87.6
84.2
86.1
94.3
85.0
85.1
85.1
85.1
95.1
93.2
84.9
84.0
89.6
90.5
90.0
86.7
78.3
93.7
90.0
95.6
92.4
83.0
89.6
87.7
90.1
88.3
87.3
95.3
90.3
90.6
94.3
84.1
86.6
94.1
93.1
89.4
97.3
83.7
91.2
97.8
94.6
88.6
96.8
82.9
86.1
93.1
96.3
84.1
94.4
87.3
90.4
86.4
94.7
82.6
96.1
86.4
89.1
87.6
91.1
83.1
98.0
84.5
(a) Construct a cumulative frequency plot and histogram for the yield
(b) Construct a stem-and-leaf display for these data.
(c) Find the median, the quartiles, and the 5th and 95th percentiles for the yield
9. The average age of the football players on each team of the premier league as follows.
29.4
29.8
29.4
31.8
32.7
34.0
28.5
27.9
30.9
29.3
28.8
28.6
29.1
31.0
30.7
30.3
29.7
31.0
28.4
28.9
27.7
28.7
30.5
29.8
26.6
27.9
27.9
29.9
29.3
28.1
(a) Construct a cumulative frequency plot and histogram for the yield
(b) Construct a stem-and-leaf display for these data.
(c) Find the median, the quartiles, and the 5th and 95th percentiles for the yield
10. The following cold start ignition time of an automobile engine obtained for a test
vehicle are as follows:
1.75 1.92 2.62 2.35 3.09 3.15 2.53 1.91
(a)Calculate the sample median, the quartiles and the IQR
(b)Construct a box plot of the data.
52
11. The following data are the joint temperatures of the O-rings (F) for each test firing or
actual launch of the space shuttle rocket motor (from Presidential Commission on the
Space Shuttle Challenger Accident, Vol. 1, pp. 129131): 84, 49, 61, 40, 83, 67, 45,
66, 70, 69, 80, 58, 68, 60, 67, 72, 73, 70, 57, 63, 70, 78, 52, 67, 53, 67, 75, 61, 70, 81,
76, 79, 75, 76, 58, 31.
(a) Compute the sample mean and sample standard deviation;
(b) Calculate the median, the quartiles and the IQR;
(c) Construct a box plot of the data and comment on the possible presence of outliers.
12. Ipoh Pantai Hospital compiles data on the length of stay by patients in short-term
hospitals. A random sample of 28 patients yielded the following data on length of
stay, in days.
3
4
5
6
6
4
10
8
15
12
13
11
7 3 55 1
18 9 6 12
7 1 23 9
9 4 21 10
Chapter 5
5.
Example 5.1:
At chemical engineering department, Universiti Teknologi PETRONAS, the mean
age of the students is 20.6 years old, and the variance is 20 years. A random
sample of 80 students is drawn from 250 students. What is the probability that the
average age of these students is greater than 22 years old?
Solution:
The mean of X E ( X ) 20.6 and the variance of X V ( X ) 20
For n 80, the mean of X E ( X ) 20.6 and V ( X )
2
20
0.25
n
80
Hence,
X ~ N ( 20.6, 0.25)
22 20.6
So, P ( X 22) P ( Z
) P ( Z 1.4) 1 P ( Z 1.4)
0.25
1 (1.4) 1 0.9192 0.0808
54
5.3.1
The Central Limit Theorem says that as n increases, the binomial distribution with
n trials and probability p of success gets closer and closer to a normal distribution.
That is, the binomial probability of any event gets closer and closer to the normal
probability of the same event.
The normal distribution is a good approximation to Binomial when n is
sufficiency large and p is not too close to 0 or 1. How large n needs to be depends
on the value of p. It is better to be conservative and limit the use of the normal
distribution as an approximation to the binomial when np > 5 and n(1 - p) > 5.
That is, if we have a random variable X ~ Bin(n , p) and n is large and p is small
such that np > 5, than X can be calculated approximately using the Normal
distribution. It means that the random variable X will be normally distributed with
mean = np and variance, 2 np (1 p) i.e X ~ N( ).
Example 5.2:
Suppose in experiment of tossing a fair coin for 20 times. What is the probability
of getting between 9 and 11 heads?
Solution:
Let X be the random variable representing the number of heads thrown.
X ~ Bin (20, 0.5)
Since n is large and np > 5, then we can use normal approximation to find the
probability. It mean that now, X is normally distributed with mean np =10 and
variance 5. i.e X ~ N (10, 5). Hence,
(9 0.5) 10)
P(9 X 11) P
11.5 10
8.5 10
Z
P (0.67 Z 0.67)
2.24
2.24
(0.67) (0.67) 0.749 0.251 0.498
P
5.3.2
The normal distribution can also be used to approximate the Poisson distribution
for large values of (the mean of the Poisson distribution).
That is, if we have a random variable X ~ Poisson ( ) and is large than X can
be calculated approximately using the Normal distribution. It means that the
random variable X will be normally distributed with mean = and variance,
2 i.e X ~ N ( )
Example 5.3:
A car hire firm has 20 cars to hire. The number of demands for a car is hired per
day is a Poisson distribution with mean of 5. Calculate the probability that at most
ten cars will be hired in one day.
Solution:
Let a random variable X denotes the number of demands for a car.
The given mean value is 5. By the Poisson distribution
P( X x)
e x
x!
, x 0, 1, 2, 3, 4, . ........, 20
P ( X 10) P
(10 0.5) 5
5
5
Pareto diagrams
control charts
scatter or correlation diagrams
run charts
process flow diagrams
The most important SPC tool is called control charts. That is a graphical
representations of process performance over time concerned with how (or
whether) processes vary at different intervals and identifying nonrandom or
assignable causes of variation. The control charts are also providing a powerful
analytical tool for monitoring process variability and other changes in process
mean. There are two common charts use in the SPC. The X - chart and R-chart.
The X and Range, R Charts are a set of control charts for variables data (data that
is both quantitative and continuous in measurement, such as a measured
dimension or time). The X - chart monitors the process location over time, based
on the average of a series of observations, called a subgroup. While the R-chart
monitors the variation between observations in the subgroup over time.
The X - chart or R- chart are used when you can rationally collect measurements
in groups (subgroups) of between two and ten observations. The charts' x-axes are
time based, so that the charts show a history of the process. The data is timeordered; that is, entered in the sequence from which it was generated.
5.4.1
In order to construct the chart, the sample mean, the average of the sub-group and
the limits must be calculated.
The sample mean is calculated from a set of n data values as x
1 n
xi .
n i 1
1 m n
xij
mn j 1i 1
where n is the subgroup size and m is the total number of subgroups included in
the analysis.
This x is a centre line of the chart and is called the estimate process mean.
57
1 m
ri , where r is range between the
m i 1
While the upper for the R-chart is calculated by using the formula D4 r and for
lower limit using the formula D3 r where D4 and D3 can be find from the process
control chart table.
After the centre line and limits are calculated, and then the chart can be
constructed by plotting the observations of sample number versus x for X -chart
and the sample number versus r for R-chart.
Example 5.4:
A component part for a jet aircraft engine is manufactured by an investment
casting process. The vane opening on this casting is an important functional
parameter of the part.
We will illustrate the use of X and R control charts to assess the statistical
stability of this process. The table presents 20 samples of five parts each. The
values given in the table have been coded by using the last three digits of the
dimension; that is, 31.6 should be 0.50316 inch.
Sample Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x1
33
33
35
30
33
38
30
29
28
38
28
31
27
33
35
33
35
x2
29
31
37
31
34
37
31
39
33
33
30
35
32
33
37
33
34
x3
31
35
33
33
35
39
32
38
35
32
28
35
34
35
32
27
34
x4
32
37
34
34
33
40
34
39
36
35
32
35
35
37
35
31
30
58
x5
33
31
36
33
34
38
31
39
43
32
31
34
37
36
39
30
32
X
31.6
33.4
35.0
32.2
33.8
38.4
31.6
36.8
35.0
34.0
29.8
34.0
33.0
34.8
35.6
30.8
33.0
r
4
6
4
4
2
3
4
10
15
6
4
4
10
4
7
6
5
18
19
20
(a)
(b)
32
25
35
33
27
35
30
34
36
30
27
33
33
28
30
31.6
28.2
33.8
3
9
6
Exercise 5
1. Suppose X1, X2, , X20 is a sample from normal distribution N ( 2) with = 5,
X
2 = 4. Find
(a) Expectation and Variance of
X of
(b) Distribution
2. Given that X is normally distributed with mean 50 and standard deviation 4, compute
the following for n=25.
(a) Mean and variance of X
(b) P ( X 49)
(c) P ( X 52)
(d) P ( 49 X 51.5)
3. Given that X is normally distributed with mean 20 and standard deviation 2, compute
the following for n=40.
(a) Mean and variance of X
(b) P ( X 19)
(c) P ( X 22)
(d) P (19 X 21.5)
4. Let X denote the number of flaws in a 1 in length of copper wire. The pmf of X is
given in the following table
X=x
3
59
P(X=x)
0.48
0.39
0.12
0.01
100 wires are sampled from this population. What is the probability that the average
number of flaws per wire in this sample is less than 0.5?
5. At a large university, the mean age of the students is 22.3 years, and the standard
deviation is 4 years. A random sample of 64 students is drawn. What is the probability
that the average age of these students is greater than 23 years?
6. Assuming an equal chance of a new baby being a boy or a girl, what is the probability
that 60 or more out of the next 100 births at Pantai Hospital will be girls?
7. If 10% of UTP students are international students, what is the probability that fewer
than 100 in a random sample of 818 students are coming from overseas?
8. Suppose that a sample of n = 1,600 tires of the same type are obtained at random from
an ongoing production process in which 8% of all such tires produced are defective.
What is the probability that in such a sample 150 or fewer tires will be defective?
9. For overseas flights, an airline has three different choices on its dessert menuice
cream, apple pie, and chocolate cake. Based on past experience the airline feels that
each dessert is equally likely to be chosen.
(a) If a random sample of four passengers is selected, what is the probability that at
least two will choose ice cream for dessert?
(b) If a random sample of 21 passengers is selected, what is the approximate
probability that at least two will choose ice cream for dessert?
10. Suppose that at a certain automobile plant, the number of work stoppage is a Poisson
distribution with an average per day due to equipment problems during the production
process is 12.0.What is the approximate probability of having 15 or fewer work
stoppages due to equipment problems on any given day?
11. The number of cars arriving per minute at a toll booth on a
particular bridge is Poisson distributed with a mean of 2.5.What is
the probability that in any given minute
(a) no cars arrive?
(b)not more than two cars arrive?
If the expected number of cars arriving at the toll booth per tenminute interval is
25.0, what is the approximate probability that in any given tenminute period
(c)
(d)
12. A component part for a jet aircraft engine is manufactured by an investment casting
process. The vane opening on this casting is an important functional parameter of the
part. We will illustrate the use of X and R control charts to assess the statistical
stability of this process. The table presents 20 samples of five parts each. The values
given in the table have been coded by using the last three digits of the dimension; that
is, 31.6 should be 0.50316 inch.
Sample Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
x1
33
33
35
30
33
38
30
29
28
38
28
31
27
33
35
33
35
32
25
35
x2
29
31
37
31
34
37
31
39
33
33
30
35
32
33
37
33
34
33
27
35
x3
31
35
33
33
35
39
32
38
35
32
28
35
34
35
32
27
34
30
34
36
x4
32
37
34
34
33
40
34
39
36
35
32
35
35
37
35
31
30
30
27
33
x5
33
31
36
33
34
38
31
39
43
32
31
34
37
36
39
30
32
33
28
30
X
31.6
33.4
35.0
32.2
33.8
38.4
31.6
36.8
35.0
34.0
29.8
34.0
33.0
34.8
35.6
30.8
33.0
31.6
28.2
33.8
r
4
6
4
4
2
3
4
10
15
6
4
4
10
4
7
6
5
3
9
6
61
13. The overall length of a skew used in a knee replacement device is monitored using
and R charts. The following table gives the length for 20 samples of size 4.
(Measurements are coded from 2.00 mm; that is, 15 is 2.15 mm.)
Observation
Sample 1
Observation
4 Sample 1
16 18 15 13
11
14 14 15 13
16 15 17 16
12
15 13 15 16
15 16 20 16
13
13 17 16 15
14 16 14 12
14
11 14 14 21
14 15 13 16
15
14 15 14 13
16 14 16 15
16
18 15 16 14
16 16 14 15
17
14 16 19 16
17 13 17 16
18
16 14 13 19
15 11 13 16
19
17 19 17 13
10
15 18 14 13
20
12 15 12 17
(a) Using all the data, find trial control limits for and R charts, construct the chart,
and plot the data.
(b) Use the trial control limits from part (a) to identify out-of-control points. If
necessary, revise your control limits, assuming that any samples that plot outside
the control limits can be eliminated.
(c) Assuming that the process is in control, estimate the process mean and process
standard deviation.
14. The thickness of a printed circuit board (PCB) is an important quality parameter. Data
on board thickness (in cm) are given below for 25 samples of three boards each.
Sample
Sample
0.0629
0.0636
0.0640
14
0.0645
0.0640
0.0631
0.0630
0.0631
0.0622
15
0.0619
0.0644
0.0632
0.0628
0.0631
0.0633
16
0.0631
0.0627
0.0630
0.0634
0.0630
0.0631
17
0.0616
0.0623
0.0631
0.0619
0.0628
0.0630
18
0.0630
0.0630
0.0626
0.0613
0.0629
0.0634
19
0.0636
0.0631
0.0629
0.0630
0.0639
0.0625
20
0.0640
0.0635
0.0629
0.0628
0.0627
0.0622
21
0.0628
0.0625
0.0616
0.0623
0.0626
0.0633
22
0.0615
0.0625
0.0619
10
0.0631
0.0631
0.0633
23
0.0630
0.0632
0.0630
62
Sample
Sample
11
0.0635
0.0630
0.0638
24
0.0635
0.0629
0.0635
12
0.0623
0.0630
0.0630
25
0.0623
0.0629
0.0630
13
0.0635
0.0631
0.0630
(a) Using all the data, find trial control limits for and R charts, construct the chart,
and plot the data.
(b) Use the trial control limits from part (a) to identify out-of-control points. If
necessary, revise your control limits, assuming that any samples that plot outside
the control limits can be eliminated.
(c) Assuming that the process is in control, estimate the process mean and process
standard deviation.
oooOOOooo
Chapter 6
6.
6.1 Introduction
There are two types of statistical inferences: estimation of population parameters
and hypothesis testing. Hypothesis testing is one of the most important tools of
application of statistics to real life problems. Most often, decisions are required to
63
Accept H0 as true
Reject H0 as false
Correct decision. Probability: 1
Type I error. Probability:
-
Correct decision.
Type II error. Probability:
Probability: 1
The probability type I error, is called the level of significance and (1- )100%
is called the confidence level of the test and (1 ) is called the "power" of the
test.
6.1.3 Types of test
In hypothesis testing there are three types of test on any parameters of interest
called two tailed (sided) test, upper tailed test and lower tailed test such as in the
table below.
Type
Null
Hypothesis, H0
Alternative
Hypothesis, H1
observed difference between the sample statistic and the mean of the sampling
distribution did not occur by chance alone.
6.1.6 Decision making
If the test statistic falls in the rejection/critical region, then we may conclude that
H0 is rejected, it means that there are enough evidence to support the alternative
hypothesis. Otherwise we fail to reject H 0 means that there are no evidence to
support the claim that the H1 is true.
6.1.7 Steps to do the test
In hypothesis testing, there are seven steps to perform any statistical test:
(i) Identify the parameter of interest.
(ii) State the hypothesis: Null Hypothesis and Alternate Hypothesis
(iii) Determine the appropriate Test Statistic
(iv) Determine the critical value
(v) Determine the Rejection/Critical Region or P-value or 100(1-)% confidence
intervals
(vi) Calculate the Test Statistic
(vii) Make a decision or conclusion based on step (v).
6.2 Testing about the mean for large sample size and the variance is known
If the parameter of interest is to test about the mean for population when the
variance is known or the sample size is very large, then the test can be performed
as below:
Step 1: To test about the mean and population variance is known
Step 2: H 0 : 0 versus H 1 : 0 or H 1 : 0 or H 1 : 0
Step 3: Test statistic: Z 0
x 0
/ n
N (0, 1)
Step 4: The critical value, at significant level is Zfor one tailed (sided) test and
Zfor two tailed (sided) test.
Step 5: i.
ii.
Z 0 Z
Z 0 Z
P-Value approach:
Reject H0 IF P <
Alternative Hypothesis, H1
Two tailed test, 0
iii.
P-value = 2[1-(|z0|)]
P-value = [1-(z0)]
P-value = (z0)
x Z / 2
n
n
x Z / 2
x Z / 2
Example 6.1:
Test the hypothesis that the mean age of UTP students is less than 21, given a
random sample of 20 individuals who have a mean of 20 and assume that the age
is normally distributed with variance of 20.
i. Test the hypothesis that the mean age is less than 21. Use alpha = 0.05.
67
(1) The parameter of interest is to test the true mean age of UTP students, .
(2) The hypothesis Testing:
H 0 : 0 21 vs
H 1 : 0 21
x 0
/ n
n
20
20 1.96
20 1.96
20
18.04 21.96
x z 0.025
68
20
20
x 0
s/ n
~ t n 1 , if 0
Step 4: The critical value, at significant level is tn-1for one tailed (sided) test
and
t, n-1for two tailed (sided) test.
Step 5: i.
ii.
T t ,n 1
T t ,n 1
P-Value approach:
Alternative Hypothesis, H1
Two tailed test, 0
Reject H0 IF P <
P-value = 2P(Tn-1 > |t|)
69
iii.
Hypothesis,
s
s
x t / 2,n 1
n
n
x t / 2,n 1
x t / 2, n 1
s
n
s
n
Example 6.2:
A practical brand of diet margarine was analyzed to determine the level of
polyunsaturated fatty acid (in percent). A sample of six packages resulted in the
following data: 16.8, 17.2, 17.4, 16.9, 16.5 and 17.1.
i.
Using the P-value approach, test the hypothesis that the mean is not 17.0,
ii. Construct 95% two-sided CI on the mean.
iii. Use the CI found in part (ii) to test the hypothesis.
[4 marks]
Solution:
i.
70
t0
x 0
s/ n
From t-table, t0 = 0.1537 with 5 df is fall < 0.267 for which > 0.4,so the
P-value > 2(0.4) = 0.8
(7) Result and conclusion:
Since P-value > 0.05, then we fail reject H0 and we conclude that the true
mean is 17 at = 0.05.
ii. A 95% two-sided CI on mean strength is
s
s
x t 0.025,5
n
0.3188
0.3188
16.98 2.571
16.98 2.571
6
6
16.645 17.3146
x t0.025, 5
iii. Since 17 is fall in the interval, so we fail to reject H0 and conclude the true
mean is 17at = 0.05.
6.4 Testing about the proportion
In hypothesis testing, the procedure to test about the proportion of the population
is the same as the procedure to test about the mean when the population variance
is known.
Step 1: To test about the population proportion
Step 2: H 0 : p p0 versus H 1 : p p0 or H 1 : p p0 or H 1 : p p0
Step 3: Test statistic:
71
Z0
p p 0
p 0 (1 p 0 ) / n
N (0, 1) , if p p 0 , where p
X
n
Step 4: The critical value, at significant level is Zfor one tailed (sided) test and
Zfor two tailed (sided) test.
Step 5: i.
ii.
Z 0 Z / 2 or Z 0 Z / 2
Z 0 Z
Z 0 Z
P-Value approach:
Alternative Hypothesis, H1
iii.
Reject H0 IF P <
P-value = 2[1-(|z0|)]
P-value = [1-(z0)]
P-value = (z0)
tailed
p p0
p Z / 2
(1 p ) / n p p
Z / 2
p
p p Z / 2
p p0
72
p (1 p ) / n
(1 p ) / n
p
p p Z / 2
p p0
p (1 p ) / n
(n 1) s 2
2
~ n21 , if 2 02
Step 4: The critical value, at significant level is Zfor one tailed (sided) test and
Zfor two tailed (sided) test.
Step 5: i.
ii.
2 2 , n 1
2
Lower tailed test, 0
2 12 , n 1
P-Value approach:
73
Reject H0 IF P <
Alternative Hypothesis, H1
iii.
2
2
Upper tailed test, 0
2
2
Two tailed test, 0
2
Lower tailed test, 0
( n 1) s 2
( n 1) s 2
2
2 / 2,n 1
12 / 2,n 1
( n 1) s 2
12 / 2,n1
(n 1) s 2
2 / 2,n1
H 0 : 2 (0.3) 2
vs
H1 : 2 (0.3) 2
(6) Computation
02
50(0.37) 2
76.0556
(0.3) 2
(ii).
2 / 2, n 1
( n 1) s 2
12 / 2, n 1
50(0.37) 2
50(0.37) 2
2
71.42
32.36
2
0.0958 0.2115
0.309 0.459
Since =0.3 is outside of the interval, then we reject the null hypothesis
and conclude that the engineers claim is true at the 0.05 level of
significance.
75
oooOOOooo
Chapter 6
1.A manufacturer of sprinkler systems used for fire protection in office buildings claims
that the true average system- activation temperature is 1300. A sample of 9 systems
when tested yields an average activation temperature of 131.080F. If the distribution
of activation times is normal with standard deviation 1.50F, does the data contradict
the firms claim at level of significance a = 0.01. What is the P-value for this test?
2. A random sample of 50 battery packs is selected and subjected to a life test. The
average life of these batteries is 4.05 hours. Assume that the battery life is normally
distributed with standard deviation equals 0.2 hour. Is there evidence to support the
claims that mean battery life exceeds 4 hours? Use a = 0.05. What is the P-value for
this test?
3. The flow discharge of Perak River (measured in m3/s) was obtained at random. 40
readings were collected and the mean flow discharge was found to be 3.815m 3/s with
a standard deviation of 0.5m3/s.
(a) Test the hypothesis that mean flow discharge at Perak River is not equal to 4m 3/s .
Use =0.05;
(b) Use the P-value approach to test the hypothesis null.
(c) Construct a 95% two-sided CI on mean flow discharge. What is conclusion?
4. A civil engineer is analyzing the compressive strength of concrete. Compressive
strength is approximately normally distributed with variance 2 = 1000psi2. A random
sample of 12 specimens has a mean compressive strength of x =3255.42 psi.
(a) Test the hypothesis that mean compressive strength is 3500psi. Use =0.01;
76
(b) What is the smallest level of significance at which you would be willing to reject
the null hypothesis?;
(c) Construct a 95% two-sided CI on mean compressive strength; and
(d) Construct a 99% two-sided CI on mean compressive strength. Compare the width
of this confidence interval with the width of the one in part (c). What is your
comment?
5. A new process for producing synthetic diamonds can be operated at a profitable level
only if the average weight of the diamonds is greater than 0.5 karat. To evaluate the
profitability of the process, six diamonds are generated with recorded weights, 0.46,
0.61, .52, .48, .57 and .54 karat.
(a) At 5% significance level Do the six measurements present sufficient evidence that
the average weight of the diamonds produced by the process is in excess of .05
karat?
(b) Use the P-value approach to test the hypothesis null.
(c) Construct a 95% CI on the average weight of diamonds.
6. One of the Cigarette Company claims that their cigarettes contain an average of only
10mg of tar. A random sample of 25 cigarettes shows the average tar content to be
12.5 mg with standard deviation of 4.5mg.
(a) Construct a hypothesis test to determine whether the average tar content of
cigarettes exceeds 10mg. using the P-value approach;
(b) Construct a 95% two-sided CI on the average tar content of cigarettes.
7. Regardless of age, about 20% of Malaysian adults participate in fitness activities at
least twice a week. In a local survey of 100 adults over 40 years old, a total of 15
people indicated that they participated in a fitness activity at least twice a week.
(a) Do these data indicate that the participation rate for adults over 40 years of age is
significantly less than 20%? Carry out a test at 10% significance level and draw
appropriate conclusion.
(b) Construct a 95% two-sided CI on the participation rate.
8. A survey done one year ago showed that 45% of the population participated in
recycling programs. In a recent poll a random sample of 1250 people showed that 588
participate in recycling programs.
(a) Test the hypothesis that the proportion of the population who participate in
recycling programs is greater than it was one year ago. Use a 5% significance
level.
(b) Construct a 95% two-sided CI on the proportion.
9. A Ipoh city council member gave a speech in which she said that 18% of all private
homes in the city had been undervalued by the county tax assessors office. In a
follow-up story the local newspaper reported that it had taken random sample of 91
private homes. Using professional evaluator to evaluate the property and checking
against county tax records it found that 14 of the homes had been undervalued.
(a) Does this data indicate that the proportion of private homes that are undervalued
77
by the county tax assessor is different from 18%? Use a 5% significance level.
(b) Construct a 95% two-sided CI on the proportion.
10. Engineers designing the front-wheel-drive half shaft of a new model automobile
claim that the variance in the displacement of the constant velocity joints of the shaft
is less than 1.5 mm. 20 simulations were conducted and the following results were
obtained, x 3.39 and s = 1.41.
(a) At = 0.05, do these data support the claim of the engineers?
(b) What is the P-value for this test?
(c) Construct a two-sided CI for
11. An Aerospace Engineers claim that the standard deviation of the percentage in an
alloy used in aerospace casting is greater than 0.3. 51 parts were randomly selected
and the sample standard deviation of the percentage in an alloy used in aerospace
casting is s =0.37.
(a) At = 0.05, do these data support the claim of the engineers?
(b) What is the P-value for this test?
(c) Construct a 95% two-sided CI for . What is conclusion?
12. The scientists claim that the variance of sugar content of the syrup in canned peaches
thought to be 18 mg2. From a random sample of 10 cans yields a sample deviation of
4.8mg.
(a) At = 0.05, do these data support the claim of the scientists?
(b) What is the P-value for this test?
(c) Construct a 95% two-sided CI for . What is the conclusion?
oooOOOooo
78
Chapter 7
7.
Understand the procedure or steps to perform the test for two populations.
Do a testing about the different between the two mean when the
populations variance are known.
Do a testing about the different between the two mean when the
populations variance are unknown but assume to be equal.
Do a testing about the different between the two mean when the
populations variance are unknown but assume to be not equal.
Do a testing about the different between the two proportions.
Perform the testing about the different between the two variances.
7.1 Introduction
In hypothesis testing for two populations, the procedure or method is the same as
in hypothesis testing for one population. But now we want to test the different
between two parameters of interest of populations. For example, we want to test
about the different between the two mean of the two populations, and or to
test the different between two proportions of the two populations, p1 and p2.
7.1.1 Types of test
79
In this hypothesis testing there are three types of test on any parameters of interest
called two tailed (sided) test, upper tailed test and lower tailed test such as in the
table below.
Type
Null
Hypothesis, H0
Alternative
Hypothesis, H1
1 2 0
1 2 0
1 2 0
1 2 0
1 2 0
1 2 0
H 0 : 1 2 0
versus
Z0
H 1 : 1 2 0 or H 1 : 1 2 0or H 1 : 1 2 0
( x 1 x 2 ) ( 1 2 )
12 22
n1 n2
80
N (0, 1)
Step 4: The critical value, at significant level is Zfor one tailed (sided) test and
Zfor two tailed (sided) test.
Step 5: i.
Z 0 Z / 2 or Z 0 Z / 2
Z 0 Z
Z 0 Z
iii.
Alternative Hypothesis, H1
Reject H0 IF P <
P-value = 2[1-(|z0|)]
P-value = [1-(z0)]
P-value = (z0)
Alternative
Hypothesis, H1
81
( x1 x 2 ) Z / 2
12 12
1 2
n1
n1
( x1 x 2 ) Z / 2
12 12
n1
n1
( 1 2 ) x 1 x 2 Z / 2
x1 x 2 Z / 2
12 12
n1 n1
12 12
( 1 2 )
n1
n1
versus
H 1 : 1 2 0
( x1 x 2 )
12 22
n1 n2
~ N (0, 1)
(18 24)
6.32
9
9
20 20
P-value = 2[1-(6.32)]=2[1-1]=0
(7) Result and conclusion:
Since P-vale < 0.05, then we reject H0 . Both propellants are not the
same mean burning rate.
ii. A 95% two-sided CI on the difference in means, is
2 2
12 22
1
( x1 x 2 ) z 0.025
2
n1
n
n
n2
2
1
9
9
9
9
(18 24) 1.96
20
20
20
20
7.859 1 2 4.141
( x1 x 2 ) z 0.025
iii.
7.2 Testing about the different between the two means, when the both population
variances are unknown.
If the parameter of interest is to test about the different between the two means for
two populations when both variances are unknown, then the test can be classified
into two cases. First case we assume that the populations variances are the same,
12 22 and second case is we assume that the variances are not equal, 12 22 .
7.2.1 First case: 12 22
Testing about the different between the two means, when the both
population
variances are unknown but 12 22 . The test can be
performed as below:
83
Step 1: To test about the different between two means, and when both
population variances, 12 and 22 are unknown but 12 22 .
Step 2:
H 0 : 1 2 0
versus H 1 : 1 2 0 or H 1 : 1 2 0 or H 1 : 1 2 0
( x 1 x 2 ) ( 1 2 )
sp
1
1
n1 n 2
, with ( n1 n 2 2) df
where
s 2p
( n1 1) s12 ( n 2 1) s 22
n1 n 2 2
Step 4: The critical value, at significant level is tfor one tailed (sided) test and
tfor two tailed (sided) test.
Step 5: i.
t 0 t / 2 , n1 n2 2 or t 0 t / 2 , n1 n2 2
t 0 t , n1 n2 2
t 0 t , n1 n2 2
Reject H0 IF P <
84
iii.
Alternative
Hypothesis, H1
Two tailed test,
1 2 0
x1 t / 2, n1 n2 2 s p
x t / 2, n1 n2 2 s p
1
1
( 1 2 )
n1 n 2
1
1
n1 n 2
( 1 2 ) x t / 2, n1 n2 2 s p
x t / 2 , n1 n2 2 s p
1
1
n1 n 2
1
1
( 1 2 )
n1 n2
Does this data indicate that the mean score for the 11:00 a.m. class is
higher than the mean score for the 8a.m. class? Use a 5% significance
level.
ii. What is the P-value for this test?
iii. Construct a two-sided 95% CI on the difference in average scores.
Solution:
(i) (1) The parameter of interest is to test the mean score at 11am, is better
than the mean score at 8am,
85
H 0 : 1 2 0
vs
H 1 : 1 2
( x 1 x 2 ) (1 2 )
sp
1
1
n1 n 2
, with n1 n 2 2 df
where
s 2p
( n1 1) s12 (n 2 1) s 22
n1 n 2 2
(73.2 78.1)
8.95
1
1
49 36
4.9
2.5 with 83 df
1.96
where
sp
(48)(8.1) 2 (35)(10) 2
49 36 2
6649.28
8.95
83
From t-table with 83 df, t0 =2.5 is between t= 2.358 and t=2.617, which
give 0.005<p<0.01. Since P < 0.05, thus we reject H0 at the 0.05 level of
significance and conclude that there is enough evidence to say that test at
11am is better result from test at 8am.
iii.
A 95% CI for the difference in mean before and after the policy change
where t0.025,22 =1.98 is
x1 73.2, x 2 78.1, s p 8.95, n1 49, n2 36
( x1 x 2 ) t / 2 ,n1 n2 2 ( s p )
1
1
1
1
1 2 ( x1 x 2 ) t / 2 ,n1 n2 2 ( s p )
n1 n2
n1 n2
1
1
1 2 (73.2 78.1) (1.98)(8.95)
49 36
- 8.7899 1 2 - 1.0101
H 0 : 1 2 0
versus H 1 : 1 2 0 or H 1 : 1 2 0 or H 1 : 1 2 0
t0
s12 s 22
n1 n2
s12 s 22
n1 n2
v
2
2
s12 / n1
s2 / n
2 2
n1 1
n2 1
87
1
1
49 36
Step 4: The critical value, at significant level is tvfor one tailed (sided) test
and tvfor two tailed (sided) test.
Step 5: i.
t 0 t / 2, v or t 0 t / 2, v
t 0 t , v
t 0 t , v
iii.
Reject H0 IF P <
Alternative
Hypothesis, H1
( x 1 x 2 ) t / 2, v
s12 s 22
1 2
n1 n2
( x 1 x 2 ) t / 2, v
s12 s 22
n1 n 2
( 1 2 ) ( x 1 x 2 ) t / 2 , v
88
s12 s 22
n1 n 2
( x 1 x 2 ) t / 2 , v
s12 s 22
( 1 2 )
n1 n2
i.
Do the data support the claim that the two companies produce material
with different mean wear? Use 0.05, and assume that each population is
normally distributed but their variances are not equal. What is the P-value
for this test?
ii. Construct a two-sided 95% CI that will address the questions in part(i)
and (ii) above.
(6 marks)
Solution:
(i) (1) The parameter of interest is to test the different between the two means,
and variance unknown but not equal.
(2) Hypothesis testing:
H 0 : A B 0
vs
H1 : A B 0
t0
( x A x B ) ( A B )
s A2 sB2
n A nB
89
s A2 s B2
n A nB
2
2
s A2 / nB
s A2 / nB
22
nA 1
nB 1
25
2 2 82
25 25
82
/ 24
/ 24
25
27
t0.025, 27 = 3.057
(5) The critical region is reject H0 if t0 > 3.057 or t0 < -3.057
(6) Computation:
x A 20, x B 12, s A 2, s B 8, n A nB 25
So
t0
(20 12)
2 2 82
25 25
4.85
s A2 s B2
A B ( x A x B ) t / 2 , v
n A nB
s A2 s B2
n A nB
8 2 2 2
25 25
2.9582 A B 13.0427
8 2 2 2
25 25
7.3 Testing about the different between the two proportions, p1 and p2.
If the parameter of interest is to test about the different between the two
proportions of two populations, then the test can be performed as below:
Step 1: To test about the different between two proportions, p and p .
Step 2:
H 0 : p1 p 2 0
H 1 : p1 p 2 0 or H 1 : p1 p 2 0or H 1 : p1 p 2 0
versus
where, p
Z0
( p 1 p 2 ) ( p1 p 2 )
1
1
p (1 p )
n1 n 2
N (0, 1)
x1 x 2
n1 n2
Step 4: The critical value, at significant level is Zfor one tailed (sided) test
and Zfor two tailed (sided) test.
Step 5: i.
Z 0 Z / 2 or Z 0 Z / 2
Z 0 Z
Z 0 Z
Reject H0 IF P <
P-value = 2[1-(|z0|)]
P-value = [1-(z0)]
91
P-value = (z0)
Alternative
Hypothesis, H1
( p 1 p 2 ) Z / 2
12 12
p1 p 2
n1
n1
( p 1 p 2 ) Z / 2
12 12
n1
n1
( p 1 p 2 ) p1 p 2 Z
p1 p 2 0
p1 p 2 Z / 2
p1 p 2 0
12 12
n1 n1
12 12
( p 1 p 2 )
n1
n1
Solution:
i. (1) The parameters of interest are to test the proportion of hypertension
patients on sodium restricted diets, pA and non-hypertension patients, pB
(2) Hypothesis testing:
92
H 0 : p A pB
vs
H1 : p A pB
( p A p B ) ( p A p B )
1
1
p (1 p )
n A nB
, where p
xA xB
n A nB
0.44 0.24
1
1
0.30(1 0.30)
55
149
4.9455 , where p
24 36
0.29
55 149
p A (1 p A ) p B (1 p B )
p A p B ( p A p B ) z / 2
nA
nB
(0.44)(0.56) (0.24)(0.76)
p A pB
55
149
(0.44)(0.56) (0.24)(0.76)
55
149
0.052 p A p B 0.348
93
p A (1 p A ) p B (1 p B )
nA
nB
H 1 : 12 22 0 or H 1 : 12 22 0or H 1 : 12 22 0
s12
s 22
Step 4: The critical value, at significant level is ffor one tailed (sided) test
and ffor two tailed (sided) test.
Step 5: i.
ii.
F0 f / 2, n1 1, n2 1 or F0 f 1 / 2, n1 1, n2 1
Upper
tailed
2
2
1 2 0
test,
F0 f ,n1 1,n2 1
Lower
tailed
2
2
1 2 0
test,
F0 f ,n1 1, n2 1
Reject H0 IF
12
1 falls outside of the interval
22
s12
12 s12
f
f / 2,n2 1,n1 1
1 / 2 , n2 1, n1 1
s 22
22 s 22
94
12
s12
0 2 2 f ,n2 1,n1 1
2
s2
s12
12
f 1 ,n2 1, n1 1 2
s 22
2
(ii)
Find the 95% two-sided confidence interval on the ratio of two variances.
Solution:
i. (1) The parameters of interest are to test the difference between the two
variances of pollution indexes, and
s12
F0 2
s2
(4) Critical value f = 3.15
(5) The critical region is reject H0 if
F0 f 0.025,12,13 3.15 or F0 f 0.975,12 ,13
(6) Computation:
95
1
f 0.025,12 ,13
1
0.32
3.15
s12 0.0340
0.648
s 22 0.0525
(7) Result and conclusions:
F0
Since 0.32<F0 < 3.15, then we cannot reject H0. It means that not enough
evidence to say that the variances of the two pollution indexes are different.
ii A 95% two-side confidence interval on the ratio of two variances of pollution
indexes are
s12
12 s12
f
f / 2, n2 1,n1 1
1 / 2 , n2 1, n1 1
s 22
22 s 22
s12
12 s12
1
f / 2,n2 1,n1 1
s 22 f / 2,n2 1,n1 1 22 s 22
2 0.0340
0.0340 1
12
(3.15)
0.0525 3.15 2
0.0525
0.206
12
2.041
22
12
From CI, since 2 =1 is in the interval, so we cannot reject H0. It means that
2
not enough evidence to say that the variances of the two pollution indexes are
different.
Exercise 7
96
1. A random sample of size n = 25 taken from a normal population with = 5.2 has a
mean equals 81. A second random sample of size n = 36, taken from a different
normal population with = 3.4, has a mean equals 76.
(a) Do the data indicate that the true mean value 1 and 2 are different? Carry out a
test at = 0.01
(b) Find 90% CI on the difference in mean strength
2. Two machines are used for filling plastic bottles with a net volume of 16.0 oz. The fill
volume can be assumed normal with, s1 = 0.02 and s2 = 0.025. A member of the
quality engineering staff suspects that both machines fill to the same mean net
volume, whether or not this volume is 16.0 oz. A random sample of 10 bottles is taken
from the output of each machine with the following results:
(a) Do you think the engineer is correct? Use the p value approach.
(b) Find a 95% CI on the difference in means.
3. Two machine are used to fill plastic bottles with dishwashing detergent. The standard
deviations of fill volume are known to be 10.01 and = 0.15 fluid ounce for two
machines, respectively. Two random samples of n1 = 12 bottles from machine 1 and
n2=10 bottles from machine 2 are selected, and the sample mean fill volumes are
x 1 =30.61 x 2 =30.24 fluid ounces. Assume normality.
(a) Test the hypothesis that both machines fill to the same mean volume. Use the Pvalue approach;
(b) Construct a 90% two-sided CI on the mean difference in fill volume; and
(c) Construct a 95% two-sided CI on the mean difference in fill volume. Compare
and comment on the width of this interval to the width of the interval in part (ii).
4. To find out whether a new serum will arrest leukemia, 9 mice, all with an advanced
stage of the disease are selected. 5 mice receive the treatment and 4 do not. Survival,
in years, from the time the experiment commenced are as follows:
Treatment
2.1
5.3
1.4
4.6
No
treatment
1.9
0.5
2.8
3.1
0.9
At the 0.05 level of significance can the serum be said to be effective? Assume the
two distributions to be of equal variances.
5. A new policy regarding overtime pay was implemented. This policy decreased the
pay factor for overtime work. Neither the staffing pattern nor the work loads changed.
To determine if overtime loads changed under the policy, a random sample of
employees was selected. Their overtime hours for a randomly selected week before
and for another randomly selected week after the policy change were recorded as
follows:
97
Employees:
Before:
After:
1
5
3
2
4
7
3
2
5
4
8
3
5
10
7
6
4
4
7
9
4
8
3
1
9 10 11 12
6 0 1 5
2 3 2 2
Assume that the two population variances are equal and the underlying population is
normally distributed.
(a) Is there any evidence to support the claim that the average number of hours
worked as overtime per week changed after the policy went into effect. Use a Pvalue approach in arriving at this conclusion.
(b) Construct a 95% CI for the difference in mean before and after the policy change.
Interpret this interval.
6. The diameter of steel rods manufactured on two different extrusion machines is being
investigated. Two random samples of sizes n1 = 15 and n2 = 17 are selected, and
Assume that data are drawn
x1 8.37, s12 0.35 and x2 8.68, s22 0.respectively.
40
normal distribution with equal variances.
(a) Is there evidence to support the claim that the two machines produce rods with
different mean diameters ? Use the p value approach.
(b) Construct a 95% CI on the difference in mean rod diameter.
7. The following data represent the running times of films produced by 2 motion-picture
companies. Test the hypothesis that the average running time of films produced by
company 2 exceeds the average running time of films produced by company 1 by 10
minutes against the one-sided alternative that the difference is less than 10 minutes?
Use a = 0.01 and assume the distributions of times to be approximately normal with
unequal variances.
Ti
me
Company
X1
102
86
98
109
92
X2
81
165
97
134
92
87
114
98
and for
(a) Do the sample data support the claim that the two companies produce material
with different mean wear? Assume each population is normally distributed but
unequal variances?
(b) Construct a 95% CI for the difference in mean wear of these two companies.
Interpret this interval.
9. Professor A claims that a probability and statistics student can increase his or her
score on tests if the person is provided with a pre-test the week before the exam. To
test her theory she selected 16 probability and statistics students at random and gave
these students a pre-test the week before an exam. She also selected an independent
random sample of 12 students who were given the same exam but did not have access
to the pre-test. The first group had a mean score of 79.4 with standard deviation 8.8.
The second group had sample mean score 71.2 with standard deviation 7.9.
(a) Do the data support Professor A claims that the mean score of students who get a
pre-test are different from the mean score of those who do not get a pre test before
an exam. Use the P-value approach and assume that their variances are not equal.
(b) Construct a 95% CI for the difference in mean score of students who get a pre-test
and those who do not get a pre-test before an exam. Interpret this interval.
10. A vote is to be taken among residents of a town and the surrounding county to
determine whether a proposed chemical plant should be constructed. If 120 of 200
town voters favour the proposal and 240 of 500 county residents favour it, would you
agree that the proportion of town voters favouring the proposal is higher than the
proportion of county voters? Use a = 0.05
11. The rollover rate of sport utility vehicles is a transportation safety issue. Safety
advocates claim that the manufacturer As vehicle has a higher rollover rate than that
of manufacturer B. One hundreds crashes for each of this vehicles were examined.
The rollover rates were pA=0.35 and pB=0.25.
(a) By using the P-value approach, does manufacturer As vehicle has a higher
rollover rate than manufacturer Bs?
(b) Construct a 95% one-sided CI on the difference in the two rollover rates of the
vehicle. Interpret this interval.
12. Professor Rady gave 58 As and Bs to a class of 125 students in his section of English
101. The next term Professor Hady gave 45 As and Bs to a class of 115students in
his section of English 101.
99
(a) By using a 5% significance level, test the claim that Professor Rady gives a higher
percentage of As and Bs in English 101 than Professor Hady does. What is
comment?
(b) Construct a 95% one-sided CI on the difference in the percentage of As and Bs in
English 101 given by this two professors.
13. The diameter of steel rods manufactured on two different extrusion machines is being
investigated. Two random samples of sizes n1 = 15 and n2 = 17 are selected, and
x1 8.37, s12 0.35 and x2 8.68, s22 0.respectively.
40
(a) Is there evidence to conclude that the variance of the diameter of steel rods is
different for the two machines? Use the p value approach.
(b) Construct a 95% two-sided CI on the difference in mean rod diameter.
14. Professor A claims that a probability and statistics student can increase his or her
score on tests if the person is provided with a pre-test the week before the exam. To
test her theory she selected 16 probability and statistics students at random and gave
these students a pre-test the week before an exam. She also selected an independent
random sample of 12 students who were given the same exam but did not have access
to the pre-test. The first group had a mean score of 79.4 with standard deviation 8.8.
The second group had sample mean score 71.2 with standard deviation 7.9.
(a) Do the data support Professor A claims that the mean score of students who get a
pre-test are different from the mean score of those who do not get a pre test before
an exam. Use the P-value approach and assume that their variances are not equal.
(b) Construct a 95% two-sided CI for the difference in mean score of students who
get a pre-test and those who do not get a pre-test before an exam. Interpret this
interval.
15. The melting points of two alloys used were investigated by melting 15 samples of
each material. The sample standard deviation for alloy 1 was 2.34oF and for alloy 2
was 2.5oF.
(a) Do the sample data support a claim that both alloys have the same variance
melting point?. Use = 0.05.
(b) Construct a 95% two-sided confidence interval on the ratio of the two variances.
16. A study was conducted to test whether there are differences between two variances of
petrol consumptions of two types of petrol, RON95 and RON97. Five cars were
selected at random and the data of petrol consumptions in km/liter for each petrol
types are obtained as follow:
Car 1
Km per liter
RON95 RON97
8.9
9.2
100
Car 2
Car 3
Car 4
Car 5
7.5
8.2
8.6
9.5
7.8
8.5
8.8
9.4
(a) Do the sample data support a claim that both petrol types have the same variance
of petrol consumptions?. Use Use = 0.05.
(b) Construct a 95% two-sided confidence interval on the ratio of the two variances.
oooOOOooo
Chapter 8
101
8.
(8.1)
Where, 0 is called the intercept of the regression and 1 is the slope of the
regression. These two parameters called regression coefficients. The slope, 1 ,
can be interpreted as the change in the mean value of Y for a unit change in x.
102
The random error term, , is assumed to follow the normal distribution with a
mean of 0 and variance of 2 . Since Y is the sum of this random term and the
mean value, E(Y), (which is a constant), the variance of Y at any given value of x
is also 2 . Therefore, at any given value of x, say xi, the dependent variable Y
follows a normal distribution with a mean of 0 1 xi and a standard deviation
of 2 .
8.2 Fitted Regression Model
The true regression line corresponding to Equation (8.1) is usually never known.
However, the regression line can be estimated by estimating the coefficients 0
and 1 for an observed data set. The estimates of, 0 and 1 , are calculated
using least squares method. The estimated regression line, obtained using the
values of 0 and 1 , is called the fitted line. The least square estimates, 0
and 1 , are given by
n x n y
i
i
n
i 1
i 1
xi y i
n
1 i 1
2
n x
i
n
i 1
2
xi
i 1
n
(8.2)
and
0 y 1 x
(8.3)
n
where
yi
i 1
xi
i 1
is the
mean of all values of the predictor variable at which the observations were taken.
Once the 0 and 1 are known, the fitted regression model can be written as:
y 0 1 x
(8.4)
Where y is the fitted or estimated value based on the fitted regression model. It
is an estimate of the mean value, E(Y). The fitted value, y for a given value of the
103
predictor variable, xi, may be different from the corresponding observed value, yi.
The difference between the two values is called the residual,
ei yi y i
(8.5)
8.3 Assessment of the Regression Model
The fitted regression model, equation (8.4) can be used to estimate the value of
response, y for a certain value of predictor variable x.
The regression model can be evaluated by three method of assessment. There are,
the error of estimate, , the coefficient of determination, R2 and testing the slope
of the regression.
8.3.1 Method 1: The error of estimate,
The error of estimate is a square root of the error of sum of squares divided by
SS E
error degree of freedom, n-2. i.e
. The smaller the more successful
n2
is the linear regression model in explaining the response, y.
8.3.2 Method 2: the coefficient of determination, R2
The coefficient of determination can be interpreted as the proportion of variability
in the observed response variable that is explained by the linear regression model.
The coefficient of determination measures the strength of that linear relationship,
denoted by
R2 = 1 - SSE/SST
The greater R2 the more successful is the linear regression model.
8.3.3 Method 3: Testing the slope, 1
The significance of the fitted regression model can be tested by using the t1
students test on the parameter, 1 . The test statistic is t 0
. If
se( 1 )
then the null hypothesis, 1 0 is rejected. It means that the
regression model is adequate and fitted to the data otherwise there is no
relationship between x and y.
t 0 t / 2. n 2 ,
The analysis of variance (ANOVA) can also be used to test for the significance of
regression as in the table below:
ANOVA Table for simple linear regression, Y 0 1 x :
Source of
variation
Degree of
freedom
Sum of
squares
Mean of
squares
Regression
SSR
MSR=SSR/1
MSR/MSE
Error
n-2
SSE
MSE=SSE/n-2
Total
n-1
SST
Where,
n
i 1
i 1
n y
SS T S yy ( y i y ) 2 y i2
n
i 1
i 1
i 1
i 1
i 1
i 1
SS T ( y i y ) 2 ( y i y i ) 2 ( y i y ) 2 SS E SS R
n-2
where
se( 1 )
2
S XX
and
1
x2
se( 0 ) 2 (
)
n S XX
Example 8.1
The following measurements of the specific heat of a certain chemical were made
in order to investigate the variation in specific heat with temperature.
Temperature oC
Specific heat
i.
ii.
iii.
0
10
20
30
40
50
0.51 0.55 0.57 0.59 0.63 0.65
Example 8.2
The following data were collected on 8 lung cancer patients where x measures the
number of years the patient smoke cigarette (or any form of nicotine product) and
y is the physicians subjective evaluation of the extent of lung damage on a scale
of 0 to 100.
x (years)
25
35
22
15
48
39
42
31
y (0-100)
55
60
50
30
75
70
71
55
106
Coef
SE Coef
Constant
x
21.228
1.230
9.442
0.280
2.248
4.397
S = 8.17169
R - squared = ?
Analysis of Variance
Source
Df
p
value
0.066
0.005
SS
MS
1290.84
66.777
19.331
Regression
Residual Error
1
6
1290.84
400.66
Total
1691.50
p
value
0.005
i.
ii.
iii.
iv.
v.
107
Exercise 8
1. The manager of a car plant wishes to investigate how the plants electricity usage
depends upon the plant production. The data is given below
Production 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
(RMmillion)
(x)
Electricity 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
Usage
(y)
(a) Estimate the linear regression equation Y 0 1 x
(b) An estimate for the electricity usage when x = 5
(c) Find a 90% Confidence Interval for the electricity usage.
2. An experiment was set up to investigate the variation of the specific heat of a certain
chemical with temperature. The data is given below
Temperature oF 50
(x)
Heat
1.60
(y)
1.64
(a)
(b)
(c)
(d)
60
70
80
90
1.63
1.65
1.67
1.67
1.70
1.72
1.71
1.72
100
1.71
1.74
108
Analysis of variance
Source
Regression
Residual
Total
DF
1
18
19
SS
23965
6772
30737
MS
23965
376
F
63.70
SS
152.13
21.25
173.38
P-value
0.000
0.000
MS
152.13
1.18
F
12.86
(a) Estimate the purity of oxygen when the percentage of hydrocarbon 1%.
(b) Obtain a 95 % confidence interval for the true slope .
(c) Test for significance of regression for a = 0.05.
5. Regression methods were used to analyze the data from a study investigating the
relationship between roadway surface temperature (x) and pavement deflection (y).
The data follow.
Temperature
x
Deflection
y
Temperature
x
Deflection
y
70.0
0.621
72.7
0.637
77.0
0.657
67.8
0.627
72.1
0.640
76.6
0.652
72.8
0.623
73.4
0.630
78.3
0.661
70.5
0.627
109
Temperature
x
Deflection
y
Temperature
x
Deflection
y
74.5
0.641
72.1
0.631
74.0
0.637
71.2
0.641
72.4
0.630
73.0
0.631
75.2
0.644
72.7
0.634
76.0
0.639
71.4
0.638
1
2
0.65 0.79
4
1.36
8
2.26
16
3.59
25
5.39
110
Chapter 9
9.
9.1 Introduction
Multiple regressions (the term was first used by Pearson, 1908) is to learn more
about the relationship between several independent or predictor variables and a
dependent or criterion variable. It is an extension of a simple linear regression
model.
Consider the following data consisting of n sets of values
( y1 , x11 , x 21 , ....x k 1 )
( y 2 , x12 , x 22 , ....x k 2 )
.
( y n , x1n , x 2 n , ....x kn )
(9.1)
111
These equations can be solved by using matrices. Then we have the fitted
regression model as given below
Y 0 1 x1 .... k xk
Degree of
freedom
Sum of
squares
Mean of squares
Regression
SSR
MSR=SSR/k
MSR/MSE
Error
n - (k+1)
SSE
MSE=SSE/n-(k+1)
Total
n-1
SST
112
Where,
n
i 1
i 1
n y
SS T S yy ( y i y ) 2 y i2
n
i 1
i 1
i 1
i 1
i 1
i 1
SS T ( y i y ) 2 ( y i y i ) 2 ( y i y ) 2 SS E SS R
The ANOVA table is used to test for significance of regression
The hypotheses are:
H 0 : 1 2 0
H 1 : 1 and 2 both are not zero
MSR
.
MSE
n-(k+1)
Example 9.1:
A set of experimental runs were made to determine a way of predicting cooking
time y at various levels of oven width x1, and temperature x2. The data were
recorded as follows:
113
ii.
Coef
0.568
2.706
2.051
SE Coef
0.585
0.194
0.046
R-Sq = 100%
T
0.970
13.935
44.380
P
0.364
0.000
0.000
R-Sq(adj) = 100%
DF
2
7
9
SS
10953.334
2.809
10956.143
MS
5476.667
0.401
F
13647.872
P
0.000
Since F > f0.01, 2, 7 = 9.55, then we reject H0. It means that the regressions
are significant.
114
Exercise 9
1. Given the data:
Test Number
1
2
3
4
5
6
7
8
9
10
y
1.6
2.1
2.4
2.8
3.6
3.8
4.3
4.9
5.7
5
x1
1
1
2
2
2
3
2
4
4
3
x2
1
2
1
2
3
2
4
2
3
4
Pull Strength y
9.95
24.45
31.75
35.00
25.02
16.86
14.38
9.60
24.35
27.50
17.08
37.00
41.95
11.66
21.65
17.89
69.00
10.30
34.93
46.59
44.88
54.12
115
Wire Length
x1
Die Height x2
2
50
8
110
11
120
10
550
8
295
4
200
2
375
2
52
9
100
8
300
4
412
11
400
12
500
2
360
4
205
4
400
20
600
1
585
10
540
15
250
15
290
16
510
23
24
25
56.63
22.13
21.15
17
6
5
590
100
400
x
x
2
i1
i1
i1
i2
y
x x
553,
1,916,
12,352,
x 31,729,
43,550.8, x y 104,736.8, y 371,595.6
5,200.9,
yi
223,
2
i2
i1 i 2
i2
2
i
(a) Estimate the parameters to fit the multiple regression models for these data.
(b) What is the predicted strength when x1=18meter and x2= 43%.
4. A set of experimental runs were made to determine a way of predicting cooking time
y at various levels of oven width x1, and temperature x2. The data were recorded as
follows:
Brightness (%):
57
54
Contrast (%):
56 80 70 50
35
26
65
80
25
80
Compute the mean response of the useful range when brightness = 80 and contrast
= 75. Compute a 95% CI.
(f) Interpret parts (d) and (e) and comment on the comparison between the 95% PI
and 95% CI.
6. A study was performed on wear of a bearing y and its relationship to x 1 = oil viscosity
and x2 = load. The following data were obtained:
x1 1.6 15.5 22.0 43.0 33.0 40.0
x2 851 816 1058 1201 1357 1115
y 293 230 172
91
113
125
(a)
Fir a multiple regression model to these data.
(b)
Estimate and the standard errors of the regression coefficients.
(c)
Use the model to predict wear when x1 = 25 and x2 = 1000.
(d)
Fit a multiple regression model with an interaction term to these data.
(e)
Estimate and se(j) for this new model. How did these quantities
change? Does
this tell you anything about the value of adding the interaction term to the model?
(f)
Use the model in (d), to predict when x1=25 and x2=1000. Compare this
prediction with the predicted value from part (c) above.
117
Chapter 10
10.
Design of Experiment
At the end of the lesson, the student should be able to:
Design and conduct factorial experiments involving one and two factors using
factorial design.
Understand the concept of one-way and two ANOVA.
Know how to construct ANOVA table.
Understand how to use ANOVA to analyze data from these experiments
Analyze and interpret main effects and interactions.
10.1 Introduction
Experimental design techniques based on statistics are useful in the engineering
world for improving the performance of manufacturing process.
By using design experiments, we can determine which subsets of the process
variables have the most influence on process performance. Among the advantage
of using this experimental design are: it can improved process yield, reduced
variability in the process, reduced design and development time and also can
reduced cost of operation. Statistically designed experiments allow efficiency and
economy in the experimental process. If data are collected without an
experimental design, it may not be possible to extract the desired information.
Results of the analysis may be confusing, misleading, not credible and not
reproducible. Normally when several factors are of interest in an experiment, a
factorial experiment should be used.
10.2 Terminology and definition
There is some terminology in experimental design such as
Factor: variable whose influence upon the response variable is being studied
in the experiment.
Factor Level: different modes or settings of a factor.
Trial (or runs): applying of a treatment to an experimental unit.
Treatment or level combination: specific combination of the levels of
different factors.
Experimental units (subjects): the basic unit for which the response
measurement are collected.
118
SS
SStreatment
DF
k-1
SSerror
N-k
SStotal
N-1
119
MS
MStreatment =
SStreatment/k-1
MSerror=SSerror/Nk
MStreatment/MSerror
Calculation of SS:
Grand Mean:
The grand mean is the average of all the values when the factor is ignored. It is a
weighted average of the individual sample means.
xi
ni xi
i 1
x
i 1 k
k
N ni N ni
n x
x
1
i 1
i 1
n2 x2 ....... nk xk
n1 n2 ....... nk
Total variation:
The variation between observations and the grand mean given by
x
k
SSTotal =
i 1 j 1
ij
xij2 x
i 1 j 1
i 1
i 1
SSTreatment ni ( xi x ) 2 ni xi
SS Error ( ni 1) si2
i 1
2
i
120
The total df is one less than the sample size. If the total observations is 24 then the
df for total is 24 1 = 23.
Example:
The statistics classroom is divided into three rows: front, middle, and back. The
Professor noticed that the further the students were from him, the more likely they
were to miss class or sleep in the class. He wanted to see if the students sit in front
and near to him will did better on the exams. A random sample of the students in
each row was taken. The score for those students on the exam was recorded as
Front: 82, 83, 97, 93, 55, 67, 53
Middle: 83, 78, 68, 61, 77, 54, 69, 51, 63
Back: 38, 59, 55, 66, 45, 52, 52, 61
i.
ii.
Solution:
xx
n
Front
Middle
Back
Total
530
41,994
604
41,404
428
23.460
1,562
106,948
24
xi
2
i 1
(1562) 101,660.2
x
k
24
N ni
i 1
Calculation of SS:
k
SSTotal xij x
i 1 j 1
xij2 x
i 1 j 1
121
ni
( xij ) 2
i 1
ni
SSTreatment
j 1
101,660.2
7
9
8
40,128.57 40,535.11 22,898 - 101,660.2
SS Error SSTotal SSTreatment
5,287.8
1,901.48- 1,901.48
3,386.32
One-Way ANOVA
Source of variation
Between
group/Treatment
Within group
(Error)
Total
SS
1901.48
DF
2
MS
950.74
3386.32
21
161.25
5287.80
23
F
5.90
P
0.009
122
The Analysis of Variance (ANOVA) can be used to analyze the data from
experimental designs. From the ANOVA, the null hypothesis that the effect is
equal to 0 is tested. When H0 is rejected, this provides evidence that the factor
involved actually affect the outcome (response). However some assumptions
should be made before we do the analysis. Among the assumptions are the same
numbers of replicates for each treatment, at least 2 replications for each cells, and
each treatment is a random sample from a normal population.
10.5 The 22 factorial design
The 22 factorial design means that the experiment has been setup with 2 factors at
2 levels for each factor. The objective is to test and to determine which are the two
main effects and the interactions are important of all 2 factors at 2 levels. In one
experiment, the example of two factors is factor A: reaction time and factor B:
reaction temperature. For factor A, the two levels are time at 1 hour and 2 hours.
These levels can also be denoted as (minus) for one level and + (plus) for
another level. For factor B, the two levels are 35 oC (-) and 55oC (+). This can be
explained by the table below:
Factor B
(Temperature)
Factor A (Time)
1 hour
2 hours
(-)
(+)
o
35 C Yields
Yields
( - ) measured
measured
x111 , x112 , x113 x121 , x122 , x123
o
55 C Yields
( + ) measured
Yields
measured
where, n = 3 replications and xijk, k = 1,..,n are the observations in the cell (i,j).
The levels can be in the form of variable data (numbers) or attribute data such as
male and female, on and off. Normally the levels will be designated one level as
high (+) and the other level as low (-) as explained in the table below:
Factor B
(Temperature)
Low
(-)
High
(+)
123
Factor A
(Time)
Low
High
( -)
(+)
a
(1)
b
ab
All possible treatment combination of the level of the factors or called factorial
experiment is given in a design or test matrix for 22 factorial designs as follows:
Treatment
Combination
Factorial Effect
A
+
+
(1)
a
b
ab
B
+
+
AB
+
+
The letters (1), a, b and ab represents the total of all n observations at each
treatment combination.
10.6 Estimate the effects of factors in the 22 factorial design
The effect of main factors, factor A and factor B and the effect of AB interaction
can be calculated using the formula below.
Effect of main factor A is:
A
a ab b (1)
1
a ab b (1)
2n
2n
2n
b ab a (1)
1
b ab a (1)
2n
2n
2n
ab (1) a b
1
(1) ab a b
2n
2n
2n
124
The effects of main factors and the interaction factor can be tested by using the
two-way ANOVA table.
ANOVA Table:
Source
of
variatio
n
A
B
AB
Degree
of
Freedo
m
1
1
1
Sum of
Square
s
Mean of
squares
SSA
SSB
SSAB
MSA/MSE
MSB/MSE
MSAB/MS
Error
n-4
SSE
MSA=SSA/1
MSB=SSB/1
MSAB=SSAB/
1
MSE=SSE/n4
Total
n-1
SST
Pvalu
e
2
1
a ab b (1) (Contrast A)
4n
4n
2
1
b ab a (1) (Contrast B)
4n
4n
2
1
(1) ab a b (Contrast AB)
4n
4n
2
ijk
where,
0 = constant (grand average of all 4n observations)
1 = the estimated coefficient of x1 (the effect of having factor A) = (effect A)/2
2 = the estimated coefficient of x2 (the effect of having factor B) = (effect B)/2
3 = the estimated coefficient of x1x2 (the effect of interaction between factor A
and factor B) = (effect AB)/2
The final regression model can be determined from the ANOVA table. For
instance that the interaction factor between A and B is not significant, then the
final regression model is,
y 0 1 x1 2 x 2
Example 1:
An engineer
is interested
geometry (B) on the life in hours of a machine tool. Two cutting speeds and
two different geometries are used. Three experimental tests were done at each of
the four combinations. The data are as follows:
Tool
Geometry
(B)
1
2
22
18
28
15
20
16
34 37 29
11 10 10
126
Solution:
(a) The 22 factorial design table:
Treatment
Factorial
Combination
Effect
A
B AB
(1)
+
a
+
b
+
ab
+
+
+
(b) Estimates of the effects:
Life time
(hour)
22
34
18
11
28
37
15
10
20
29
16
10
Total
70
100
49
31
Average
23.33
33.33
16.33
10.33
127
ANOVA Table:
Source of variation
A
B
AB
Error
Total
SS
12
675
192
72.667
951.667
df
MS
F0
1
1
1
8
11
12
675
192
9.083
1.321
74.315
21.138
128
Exercise 10
1. UTP wishes to compare four programs for training staff to perform a certain task.
Twenty new staffs are randomly selected to the training programs, with 5 in each
program. At the end of the training, a test is conducted to see how quickly trainees
can perform the task. The number of times the task is performed per minute is
recorded for each trainee and the results as in table below:
Time in minutes
i.
ii.
Program 1
Program 2
Program 3
9, 8, 7, 8, 11
Program 4
10, 6, 9, 9, 10
2. A farmer wants to determine which type of fertilizer is best for growing mango trees
in his orchard. He chooses three types of fertilizers, F1, F2 and F3 and treats two
mango trees to each tree is recorded in table below:
Number of mangoes
from each tree
F1
32, 40
F2
43, 42
F3
20, 10
Total lifetime of 3
Design
Material
n
(1)
122
60
120
ab
118
Treatment
Combinatio
AB
components
(in hours)
i. Perform a two way analysis of variance to estimate the effects of design and
material expense on the component life time if the sum squares of total are 1050.
130
ii. Based on your results in part (a), what conclusions can you draw from the
factorial experiment?
iii. Indicate which effects are significant to the lifetime of a component.
iv. Write the least square fitted model using only the significant sources.
5. An engineer suspects that the surface finish of metal parts is influenced by the type of
paint used and the drying time. He selected two drying times, 20 and 30 minutes and
used two types of paint. Three parts are tested with each combination of paint typoe
and drying time. The data are as follow:
Drying Time (min)
i.
ii.
Paint
20min
30min
ICI
74
64
50
78
85
92
NIPPON
92
86
68
66
45
85
Compute the estimates of the effects and their standard errors for this design.
Perform an analysis of variance of the appropriate regression model for this
design. Include in your analysis hypothesis tests for each coefficient, as well as
residual
Low
130 155 20 70
74
High
180 82 58
168 160 82 60
i. Compute
the
their standard errors for this design.
131
ii. Perform an analysis of variance of the appropriate regression model for this
design. Include in your analysis hypothesis tests for each coefficient, as well as
residual analysis. State your final conclusions about the adequacy of the model.
Compare your results to part (c) and comment.
7. An article in the IEEE Transactions on Semiconductor Manufacturing (Vol. 5, 1992,
pp. 214-222) describes an experiment to investigate the surface charge on a silicon
wafer. The factors thought to influence induced surface charge are cleaning method
(spin rinse dry or SRD and spin dry or SD and the position on the wafer where the
charge was measured. The surface charge ( X1011 q/cm3) response data are shown.
Cleaning
Method
SD
SRD
Test Position
L
R
1.66
1.84
1.90
1.84
1.92
1.62
-4.21
-7.58
-1.35
-2.20
-2.08
-5.36
i. Compute the estimates of the effects and their standard errors for this design.
ii. Perform an analysis of variance of the appropriate regression model for this
design. Include in your analysis hypothesis tests for each coefficient, as well as
residual analysis. State your final conclusions about the adequacy of the model.
Compare your results to part (c) and comment.
oooOOOooo
132