Introductory Statistics
Introductory Statistics
STATISTICS
4th Edition
By OpenStax College
Adapted by Riyanti Boyd & Natalia Casper
Special Acknowledgement to Mark Beintema & Gladys Poma
This page is purposely left blank.
Introductory Statistics
Susan Dean and Barbara Illowsky (Published 2013 by OpenStax College)
Adapted by: College of Lake County Faculty: Riyanti Boyd and Natalia Casper
Revised July 2021
This project was funded by a grant from the College of Lake County Foundation.
Original text materials for Introductory Statistics by Dean and Illowski 2013
available at: http://cnx.org/content/col11562/latest/.
i
Table of Contents (4th Edition)
ii
Table of Contents (4th Edition)
iii
Table of Contents (4th Edition)
iv
1 | SAMPLING AND DATA
Introduction
Chapter Objectives: By the end of this chapter, the student should be able to
• Understand basic statistical terminology
• Recognize and differentiate between key terms.
• Apply various types of sampling methods to data collection.
You are probably asking yourself the question, "When and where will I use statistics?" If you
read any newspaper, watch television, or use the Internet, you will see statistical
information. There are statistics about crime, sports, education, politics, and real estate.
Typically, when you read a newspaper article or watch a television news program, you are
given sample information. With this information, you may make a decision about the
correctness of a statement, claim, or "fact." Statistical methods can help you make "best
educated guess."
Since you will undoubtedly be given statistical information at some point in your life, you
need to know some techniques for analyzing the information thoughtfully. Think about
buying a house or managing a budget. Think about your chosen profession. Think about the
Covid-19 pandemic. Think about the political elections. As well, the fields of economics,
business, psychology, education, biology, law, computer science, police science, and early
childhood development require at least one course in statistics.
Included in this chapter are some basic ideas and terms used in Statistics. We will learn how
data are gathered and how to distinguish "good" data from "bad."
1
1.1 | Basic Definitions
The science of Statistics deals with the collection, analysis, interpretation, and presentation
of data. We see and use data in our everyday lives. There are two basic branches of
Statistics: Descriptive and Inferential.
Effective interpretation of data (inference) is based on good procedures for producing data
and thoughtful examination of the data. We will encounter what may seem to be too many
mathematical formulas for interpreting data. The goal of statistics is not to perform
numerous calculations using the formulas, but to gain an understanding of your data. The
calculations can be done using a calculator or a computer; the understanding must come from
you. If you can thoroughly grasp the basics of statistics, you can be more confident in the
decisions you make in life and in your chosen career.
Key Terms
In Statistics, we generally want to study a population. You can think of a population as a
collection of persons, things, or objects under study. For logistical reasons, it is usually not
possible to gain access to all of the information from the entire population. So when want
to study a population, we usually select a sample. The idea of sampling is to select a
portion (or subset) of the larger population and study that portion (the sample) to gain
information about the population. Data are the result of sampling from a population.
https://openclipart.org/detail/216524/many-and-few
Because it takes a lot of time and resources to examine an entire population, sampling is a
very practical technique. If you wished to compute the overall grade point average at your
school, it would make sense to select a sample of students who attend the school. The data
collected from the sample would be the students' grade point averages. In presidential
2
elections, samples of between 1,000 and 2,000 prospective voters are used for opinion polls.
The opinion poll is supposed to represent the views of the people in the entire country.
Manufacturers of canned carbonated drinks take samples to determine if a 16 ounce
can contains 16 ounces of carbonated drink.
From the sample data, we can calculate a statistic. A statistic is a number that represents
a property of the sample. For example, if we consider one math class to be a sample of the
population of all math classes, then the average number of points earned by students in that
one math class at the end of the term is an example of a statistic. The statistic is an estimate
of a population parameter. A parameter is a number that is represents a property of the
population. Since we considered all math classes to be the population, then the average
number of points earned per student over all the math classes is an example of a parameter.
One of the main concerns in the field of statistics is how accurately a statistic estimates a
parameter. The accuracy really depends on how well the sample represents the
population. The sample must contain the characteristics of the population in order to be
a representative sample. We are interested in both the sample statistic and the population
parameter in inferential statistics. In a later chapter, we will use the sample statistic to test
the validity of the established population parameter.
A variable, denoted by capital letters such as X and Y, is a characteristic of interest for each
person or thing in a population. Variables may be numerical or categorical.
3
• Numerical variables take on numerical values with equal units such as weight in
pounds and time in hours.
• Categorical variables place the person or thing into a category.
If we let X equal the number of points earned by one math student at the end of a term,
then X is a numerical variable. If we let Y be a person's party affiliation, then Y is a
categorical variable, and some possible values of Y would be Republican, Democrat, and
Independent. Y is a categorical variable. We could do some math with values of X (calculate
the average number of points earned, for example), but it makes no sense to do math with
values of Y (calculating an average party affiliation makes no sense).
The actual values of a variable are called data (a single value is a datum); these values
may be numbers, or words.
Example 1.1
Determine what the key terms refer to in the following study: We want to know the
average amount of money first-year college students spend at ABC College on school supplies
that do not include books. We randomly survey 100 first-year students at the college. Three
of those students spent $150, $200, and $225, respectively.
Solution 1.1:
The population is all first year students attending ABC College this term.
The sample is the 100 first year students surveyed at the college (although this sample
may not represent the entire population).
The parameter is the average amount of money spent (excluding books) by first year
college students at ABC College this term. This average would be represented by µ.
The statistic is the average amount of money spent (excluding books) by first year college
students in the sample. This average would be represented by 𝑥𝑥̅ .
The variable is the amount of money spent (excluding books) by one first year student.
Let X = the amount of money spent (excluding books) by one first year student attending
ABC College.
The data would be the actual dollar amounts spent by the first-year students. Examples of
the data would be $150, $200, and $225.
1.1 Determine what the key terms refer to in the following study. We want to know the
average amount of money spent on school uniforms each year by families at Knoll Academy.
We randomly survey 100 families with children in the school. Three of the families spent
$65, $75, and $95, respectively.
4
Example 1.2
A study was conducted at a local college to analyze the average cumulative GPA’s of
students who graduated last year. Fill in the letter of the phrase that best describes each of
the items below.
a) the cumulative GPA of a student who graduated from the college last year
b) the average cumulative GPA of students surveyed who graduated from the college last
year
c) 3.65, 2.80, 1.50, 3.90
d) a group of students who graduated from the college last year, randomly selected
e) all students who graduated from the college last year
f) the average cumulative GPA of all students in the study who graduated from the
college last year
Solution 1.2: 1. e; 2. b; 3. f; 4. d; 5. a; 6. c
Example 1.3
As part of a study designed to test the safety of automobiles, the National Transportation
Safety Board collected and reviewed data about the effects of an automobile crash on test
dummies. Cars with dummies in the front seats were crashed into a wall at a speed of 35
miles per hour. We want to know the proportion of dummies in the driver’s seat that would
have had head injuries if they had been actual drivers. We start with a simple random
sample of 75 cars.
Solution 1.3
The population consists of all cars containing dummies in the front seat.
The parameter is the proportion of driver dummies (if they had been real people) who
would have suffered head injuries in the population.
The statistic is proportion of driver dummies (if they had been real people) who would
have suffered head injuries measured in the sample.
The variable X = whether or not the driver dummies (if they had been real people) have
suffered head injuries.
The possible data values would be either: yes, had head injury, or no, did not.
5
e of a Discrete Random Variable
Example 1.4
An insurance company would like to determine the proportion of all medical doctors who
have been involved in one or more malpractice lawsuits. The company selects 500 doctors at
random from a professional directory and determines the number in the sample who have
been involved in a malpractice lawsuit.
Solution to 1.4:
The parameter is the proportion of medical doctors who have been involved in one or more
malpractice suits in the population.
The sample is the 500 doctors selected at random from the professional directory.
The statistic is the proportion of medical doctors who have been involved in one or more
malpractice suits in the sample.
6
1.2 | Data and Sampling
Data can come from a population or from a sample. Small letters like x or y generally are
used to represent data values. Most data can be put into the two major categories of
qualitative or quantitative.
Qualitative data are the result of categorizing or describing attributes of a population. Hair
color, blood type, ethnic group, type of car a person drives, and the street a person lives on
are examples of qualitative data. Qualitative data are generally described by words or letters.
For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood
type might be AB+, O-, or B+.
Researchers often prefer to use quantitative data over qualitative data because it lends itself
more easily to mathematical analysis. For example, it does not make sense to find an average
hair color or blood type.
Quantitative data are always numbers. Quantitative data are the result
of counting or measuring attributes of a population. Amount of money, pulse rate, weight,
number of people living in your town, and number of students who take statistics are
examples of quantitative data. Quantitative data may be either discrete or continuous.
• Quantitative discrete data is data that are the result of counting. These data take
on only certain numerical values. If you count the number of phone calls you receive
for each day of the week, you might get values such as zero, one, two, or three.
• Quantitative continuous data is all data that are the result of measuring assuming
that we can measure accurately. Measuring angles in radians might result in such
numbers as π, π/3, 5π/6, etc. If you and your friends carry backpacks with books in
them to school, the numbers of books in the backpacks would be discrete data and the
weights of the backpacks would be continuous data.
The data are the number of books students carry in their backpacks. You sample five
students. Two students carry three books, one student carries four books, one student carries
two books, and one student carries one book. The numbers of books (three, four, two, and one)
are the quantitative discrete data.
1.5 The data are the number of machines in a gym. You sample five gyms. One gym has
12 machines, one gym has 15 machines, one gym has ten machines, one gym has 22
machines, and the other gym has 20 machines. What type data is this?
7
1.6 The data are the areas of lawns in square feet. You sample five houses. The areas of
the lawns are 144 sq. ft, 190 sq. ft, 180 sq. ft, and 210 sq. ft. What type of data is this?
7
Example 1.7
You go to the supermarket and purchase three cans of soup (19 ounces) tomato bisque,
14.1 ounces lentil, and 19 ounces Italian wedding), two packages of nuts (walnuts and
peanuts), four different kinds of vegetable (broccoli, cauliflower, spinach, and carrots), and
two desserts (16 ounces Cherry Garcia ice cream and 32 ounces chocolate chip cookies).
Identify data sets that are quantitative discrete, quantitative continuous, and qualitative.
• The three cans of soup, two packages of nuts, four kinds of vegetables and two
desserts are quantitative discrete data because you count them.
• The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) and weights of
desserts are quantitative continuous data
because we measure weights as precisely as possible.
• Types of soups, nuts, vegetables and desserts are qualitative data because they
are categorical.
Example 1.8
The data are the colors of backpacks. Again, you sample the same five students. One
student has a red backpack, two students have black backpacks, one student has a green
backpack, and one student has a gray backpack. The colors red, black, black, green, and
gray are qualitative data.
NOTE: You may collect data as numbers and report it categorically. For example,
exam scores for students are recorded throughout the term. At the end of the term, letter
grades are reported as A, B, C, D, or F.
Example 1.9
8
m. peoples’ attitudes toward the government
n. IQ scores (This may cause some discussion.)
Solution to 1.9:
Quantitative Discrete (a, e, l); Quantitative Continuous (d, f, j, k, n); Qualitative (b, c, g,
h, i, m)
1.9 Determine the correct data type (quantitative or qualitative) for the number of cars in a
parking lot. If quantitative indicate whether it is continuous or discrete.
Example 1.10
Figure 1.1
Solution 1.10 The chart shows the students in each year, which is qualitative data.
Sampling
Gathering information about an entire population often costs too much or is virtually
impossible. Instead, we usually use data from a sample of the population.
9
A sample should have the same characteristics as the population it is
representing.
Most statisticians use various methods of random sampling in an attempt to achieve this
goal. This section will describe a few of the most common sampling methods. There are
several different methods of random sampling. In each form of random sampling, each
member of a population initially has an equal chance of being selected for the sample. Each
method has pros and cons.
The easiest method to describe is called a simple random sample. Any group
of n individuals is equally likely to be chosen by any other group of n individuals if the simple
random sampling technique is used. In other words, each sample of the same size has an
equal chance of being selected.
For example, suppose Lisa wants to form a four-person study group (herself and three other
people) from her pre-calculus class, which has 31 members not including Lisa. To choose a
simple random sample of size three from the other members of her class, Lisa could put all
31 names in a hat, shake the hat, close her eyes, and pick out three names. A more
technological way is for Lisa to first list the last names of the members of her class together
with a two-digit number, as in Table 1.2:
Lisa can then use a table of random numbers (found in many statistics books
and mathematical handbooks), a calculator, or a computer to generate random numbers. For
this example, suppose Lisa chooses to generate random numbers from a calculator. The
numbers generated are as follows:
Lisa reads two-digit groups until she has chosen three class members (that is, she reads
0.94360 as the groups 94, 43, 36, 60). Each random number may only contribute one class
member. If she needed to, Lisa could have generated more random numbers. The random
numbers 0.94360 and 0.99832 do not contain appropriate two digit numbers. However the
third random number, 0.14669, contains 14 (the fourth random number also contains 14), the
10
fifth random number contains 05, and the seventh random number contains 04. The two-
digit number 14 corresponds to Macierz, 05 corresponds to Cuningham, and 04 corresponds
to Cuarismo. Besides herself, Lisa’s group will consist
of Marcierz, Cuningham, and Cuarismo.
Note: We can provide third input get a specified number of random values. E.g. randInt(0,
30, 3) will generate 3 random numbers, which would yield the answer {29, 28, 4}.
Besides simple random sampling, there are other forms of sampling that involve random
chance. Other well-known random sampling methods are the stratified samples, cluster
samples, and systematic samples.
To choose a stratified sample, divide the population into groups called “strata” and then
take a proportionate number from each stratum. For example, you could stratify (group)
your college population by department and then choose a proportionate simple random
sample from each stratum (each department) to get a stratified random sample. To choose a
simple random sample from each department, number each member of the first department,
number each member of the second department, and do the same for the remaining
departments. Then use simple random sampling to choose proportionate numbers from the
first department and do the same for each of the remaining departments. Those numbers
picked from the first department, picked from the second department, and so on represent
the members who make up the stratified sample.
To choose a cluster sample, divide the population into clusters (groups) and then
randomly select some of the clusters. All the members from these clusters are in the cluster
sample. For example, if you randomly sample four departments from your college population,
the four departments make up the cluster sample. Divide your college faculty by department.
The departments are the clusters. Number each department, and then choose four different
numbers using simple random sampling. All members of the four departments with
those numbers are the cluster sample.
To choose a systematic sample, randomly select a starting point and take every nth piece
of data from a listing of the population. For example, suppose you have to do a phone survey.
Your phone book contains 20,000 residence listings. You must choose 400 names for the
sample. Number the population 1 to 20,000 and then use a simple random sample to pick a
number that represents the first name in the sample. Then choose every fiftieth name
thereafter until you have a total of 400 names (you might have to go back to the beginning of
your phone list). Systematic sampling is frequently chosen because it is a simple method.
11
A type of sampling that is non-random is convenience sampling. Convenience
sampling involves using results that are readily available. For example, a computer
software store conducts a marketing study by interviewing potential customers who happen
to be in the store browsing through the available software. The results of convenience
sampling may be very good in some cases and highly biased (favor certain outcomes) in
others.
Sampling data should be done very carefully. Collecting data carelessly can have devastating
results. Surveys mailed to households and then returned may be very biased (they may favor
a certain group). It is better for the person conducting the survey to select the sample
respondents.
True random sampling is done with replacement. That is, once a member is picked, that
member goes back into the population and thus may be chosen more than once. However for
practical reasons, in most populations, simple random sampling is done without
replacement. Surveys are typically done without replacement. That is, a member of the
population may be chosen only once. Most samples are taken from large populations and the
sample tends to be small in comparison to the population. Since this is the case, sampling
without replacement is approximately the same as sampling with replacement because the
chance of picking the same individual more than once with replacement is very low.
If you sample without replacement, then the chance of picking the first person is ten out
of 25, and then the chance of picking the second person (who is different) is nine out of 24
(you do not replace the first person). Compare the fractions 9/25 and 9/24. To four decimal
places, 9/25 = 0.3600 and 9/24 = 0.3750. So these numbers are not equivalent.
When you analyze data, it is important to be aware of sampling errors and non-sampling
errors. The actual process of sampling causes sampling errors. For example, the sample may
not be large enough. Factors not related to the sampling process cause non-sampling
errors. A defective counting device can cause a non-sampling error.
In reality, a sample will never be exactly representative of the population so there will always
be some sampling error. In general, the larger the sample, the smaller the sampling error.
In Statistics, a sampling bias is created when a sample is collected from a population and
some members of the population are not as likely to be chosen as others (remember, each
member of the population should have an equally likely chance of being chosen).
When sampling bias happens, there can be incorrect conclusions drawn about the
population that is being studied.
12
Example 1.11
A study is done to determine the average tuition that College of Lake County transfer,
career, and adult education students pay per semester. Each student in the following
samples is asked how much tuition they paid for the fall semester. What is the type of
sampling in each case?
1.11 You are going to use the random number generator to generate different types of
samples from the data. This table displays six sets of quiz scores for an
elementary Statistics class.
#1 #2 #3 #4 #5 #6
5 7 10 9 8 3
10 5 9 8 7 6
9 10 8 6 7 9
9 10 10 9 8 9
7 8 9 5 7 4
9 9 9 10 8 7
7 7 10 9 8 8
8 8 9 10 8 8
9 7 8 7 7 8
8 8 10 9 8 7
13
1. Create a stratified sample by column. Pick three quiz scores randomly from each
column.
o Number each row one through ten.
o On your calculator, press Math and arrow over to PRB.
o For column 1, select randInt( and enter 1,10. Press ENTER. Record the
number. Press ENTER 2 more times (even the repeats). Record these
numbers. Record the three quiz scores in column one that correspond to these three
numbers.
o Repeat for columns two through six.
o These 18 quiz scores are a stratified sample.
Example 1.12
a. A soccer coach selects six players from a group of boys aged eight to ten, seven
players from a group of boys aged 11 to 12, and three from a group of boys aged 13 to 14
to form a recreational soccer team.
c. A high school educational researcher interviews 50 high school female teachers and
50 high school male teachers.
14
d. A medical researcher interviews every third cancer patient from a list of cancer
patients at a local hospital.
e. A high school counselor uses a computer to generate 50 random numbers and then
picks students whose names correspond to the numbers.
f. A student interviews classmates in his algebra class to determine how many pairs of
jeans a student owns, on the average.
Solution 1.12
a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; f. convenience
1.12 Determine the type of sampling used (simple random, stratified, systematic, cluster,
or convenience).
A high school principal polls 50 freshmen, 50 sophomores, 50 juniors, and 50
seniors regarding policy changes for after school activities.
If we were to examine two samples representing the same population, they would not
be exactly the same, even if we used random sampling methods for the samples. Just as
there is variation in data, there is variation in samples. As you become accustomed to
sampling, the variability will begin to seem natural.
Example 1.13
Suppose ABC College has 10,000 part-time students (the population). We are interested in
the average amount of money a part-time student spends on books in the fall term. Asking
all 10,000 students is an almost impossible task.
Suppose we take two different samples. First, we use convenience sampling and survey ten
students from a first term organic chemistry class. Many of these students are taking first
term calculus in addition to the organic chemistry class. The amount of money they spend
on books is as follows:
$128; $87; $173; $116; $130; $204; $147; $189; $93; $153
The second sample is taken using a list of senior citizens who take P.E. classes and taking
every fifth senior citizen on the list, for a total of ten senior citizens. They spend:
$50; $40; $36; $15; $50; $100; $40; $53; $22; $22
a. Do you think that either of these samples is representative of (or is characteristic of)
the entire 10,000 part-time student population?
b. If these samples are not representative of the entire population, would it be wise to
use the results to describe the entire population?
15
c. Now suppose we take a third sample. We choose ten different part-time students from
the disciplines of Chemistry, Math, English, Psychology, Sociology, History, Nursing,
Physical Education, Art, and Early Childhood Development. (We assume that these are
the only disciplines in which part-time students at ABC College are enrolled and that an
equal number of part-time students are enrolled in each of the disciplines.) Each student
is chosen using simple random sampling. Using a calculator, random
numbers are generated
and a student from a particular discipline is selected if he or she has a corresponding
number. The students spend the following amounts:
$180; $50; $150; $85; $260; $75; $180; $200; $200; $150
Solution 1.13
a. No. The first sample probably consists of science-oriented students. For example, in
addition to the chemistry course, some of them are also Calculus or Biology courses.
Books for these classes tend to be expensive. Most of these students are, more than
likely, paying more than the average part-time student for their books. The second
sample is a group of senior citizens who are likely taking courses for health and
interest. The amount of money they spend on books is probably much less than the
average part-time student. Both samples are biased. Also, in both cases, not all students
have a chance to be in either sample.
b. No. For these samples, each member of the population did not have an equally likely
chance of being chosen.
c. The sample is unbiased, but a larger sample would be recommended to increase the
likelihood that the sample will be close to representative of the population. However, for
a biased sampling technique, even a large sample runs the risk of not being
representative of the population.
Students often ask if it is "good enough" to take a sample, instead of surveying the entire
population. If the survey is done well, the answer is yes.
1.13 A local radio station has a fan base of 20,000 listeners. The station wants to know if its
audience would prefer more music or more talk shows. Since asking all 20,000 listeners would
be impossible, the station uses convenience sampling and surveys the first 200 people they
meet at one of the station’s music concert events. Of those sampled, 24 people said they’d
prefer more talk shows, and 176 people said they’d prefer more music. Is this sample
representative of the entire 20,000 listener population?
16
Variation in Data and in Samples
Variation is present in any set of data. For example, 16-ounce cans of beverage may
contain more or less than 16 ounces of liquid. In one study, eight 16 ounce cans were
measured and produced the following amount (in ounces) of beverage:
Be aware that as you take data, your data may vary somewhat from the data someone else
is taking for the same purpose. This is completely natural. However, if two or more of you are
taking the same data and get very different results, it is time for you and the others to
reevaluate your data selection methods and your accuracy.
It was mentioned previously that two or more samples from the same population, taken
randomly, and having close to the same characteristics of the population will likely be
different from each other. Suppose Doreen and Jung both decide to study the average amount
of time students at their college sleep each night. Doreen and Jung each take samples of 500
students. Doreen uses systematic sampling and Jung uses cluster sampling. Doreen's sample
will be different from Jung's sample. Even if Doreen and Jung used the same sampling
method, in all likelihood their samples would be different. Neither would be wrong,
however.
Think about what contributes to making Doreen’s and Jung’s samples different.
If Doreen and Jung took larger samples (i.e. the number of data values is increased), their
sample results (the average amount of time a student sleeps) might be closer to the actual
population average. But still, their samples would be, in all likelihood, different from each
other. This variability in samples cannot be stressed enough.
The size of a sample (often called the number of observations) is important. The examples
you have seen in this book so far have been small. Samples of only a few hundred
observations, or even smaller, are sufficient for many purposes. In polling, samples that are
from 1,200 to 1,500 observations are considered large enough and good enough if the survey
is random and is well done. You will learn why when you study confidence intervals.
Be aware that many large samples are biased. For example, call-in surveys are
invariably biased, because people choose to respond or not.
Critical Evaluation
We need to critically evaluate statistical studies we read about analyze them before accepting
the results of the studies. Common problems to be aware of include:
17
• Problems with samples: A sample must be representative of the population. A
sample that is not representative of the population is biased. Biased samples give
results that are inaccurate and not valid.
• Self-selected samples: Responses only by people who choose to respond, such as call-
in surveys, are often unreliable.
• Sample size issues: Samples that are too small may be unreliable. Larger samples
are better, if possible. In some situations, having small samples is unavoidable and can
still be used to draw conclusions. Examples include crash testing cars or medical testing
for rare conditions.
• Undue influence: collecting data or asking questions in a way that influences the
response.
• Causality: A relationship between two variables does not mean that one causes the
other to occur. They may be related (correlated) because of their relationship through a
different variable.
18
1.3 | Experimental Design and Ethics
Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more effective at
growing roses than another? Is fatigue as dangerous to a driver as the influence of alcohol?
Questions like these are answered using randomized experiments. In this module, you will
learn important aspects of experimental design. Proper study design ensures the production
of reliable, accurate data.
You want to investigate the effectiveness of vitamin E in preventing disease. You recruit a
group of subjects and ask them if they regularly take vitamin E. You notice that the subjects
who take vitamin E exhibit better health on average than those who do not. Does this prove
that vitamin E is effective in disease prevention? It does not. There are many differences
between the two groups compared in addition to vitamin E consumption. People who take
vitamin E regularly often take other steps to improve their health: exercise, diet, other
vitamin supplements, choosing not to smoke. Any one of these factors could be influencing
health. As described, this study does not prove that vitamin E is the key to disease
prevention.
Additional variables that can cloud a study are called lurking variables (or confounding
variables). In order to prove that the explanatory variable is causing a change in the response
variable, it is necessary to isolate the explanatory variable. The researcher must design her
experiment in such a way that there is only one difference between groups being compared:
the planned treatments. This is accomplished by the random assignment of experimental
units to treatment groups. When subjects are assigned treatments randomly, all of the
potential lurking variables are spread equally among the groups. At this point the only
difference between groups is the one imposed by the researcher. Different outcomes measured
in the response variable, therefore, must be a direct result of the different treatments. In this
way, an experiment can prove a cause-and-effect connection between the explanatory and
response variables.
The power of suggestion can have an important influence on the outcome of an experiment.
Studies have shown that the expectation of the study participant can be as important as the
actual medication. In one study of performance-enhancing drugs, researchers noted:
Results showed that believing one had taken the substance resulted in [performance] times
almost as fast as those associated with consuming the drug itself. In contrast, taking the drug
without knowledge yielded no significant performance increment.[1]
19
a placebo treatment–a treatment that cannot influence the response variable. The control
group helps researchers balance the effects of being in an experiment with the effects of the
active treatments. Of course, if you are participating in a study and you know that you are
receiving a pill which contains no actual medication, then the power of suggestion is no longer
a factor. Blinding in a randomized experiment preserves the power of suggestion. When a
person involved in a research study is blinded, he does not know who is receiving the active
treatments(s) and who is receiving the placebo treatment. A double-blind experiment is
one in which both the subjects and the researchers involved with the subjects are blinded.
1. McClung, M. Collins, D. “Because I know it will!” Placebo effects of an ergogenic aid on athletic
performance. Journal of Sport & Exercise Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 2013.
Example 1.19
Researchers want to investigate whether taking aspirin regularly reduces the risk of heart
attack. Four hundred men between the ages of 50 and 84 are recruited as participants. The
men are divided randomly into two groups: one group will take aspirin, and the other group
will take a placebo. Each man takes one pill each day for three years, but he does not know
whether he is taking aspirin or the placebo. At the end of the study, researchers count the
number of men in each group who have had heart attacks.
Identify the following for this study: population, sample, experimental units,
explanatory variable, response variable, treatments.
Solution 1.19
The population is men aged 50 to 84.
The sample is the 400 men who participated.
The experimental units are the individual men in the study.
The explanatory variable is oral medication.
The treatments are aspirin and a placebo.
The response variable is whether a subject had a heart attack.
Example 1.20
The Smell & Taste Treatment and Research Foundation conducted a study to investigate
whether smell can affect learning. Subjects completed mazes multiple times while wearing
masks. They completed the pencil and paper mazes three times wearing floral-scented masks,
and three times with unscented masks. Participants were assigned at random to wear the
floral mask during the first three trials or during the last three trials. For each trial,
researchers recorded the time it took to complete the maze and the subject’s impression of
the mask’s scent: positive, negative, or neutral.
20
Solution 1.20
a. The explanatory variable is scent, and the response variable is the time it takes to
complete the maze.
b. There are two treatments: a floral-scented mask and an unscented mask.
c. All subjects experienced both treatments. The order of treatments was randomly assigned
so there were no differences between the treatment groups. Random assignment eliminates
the problem of lurking variables.
d. Subjects will clearly know whether they can smell flowers or not, so subjects cannot be
blinded in this study. Researchers timing the mazes can be blinded, though. The
researcher who is observing a subject will not know which mask is being worn.
Example 1.21
A researcher wants to study the effects of birth order on personality. Explain why this
study could not be conducted as a randomized experiment. What is the main problem in a
study that cannot be designed as a randomized experiment?
Solution 1.21
the explanatory variable is birth order. You cannot randomly assign a person’s birth order.
Random assignment eliminates the impact of lurking variables. When you cannot assign
subjects to treatment groups at random, there will be differences between the groups other
than the explanatory variable.
1.21 You are concerned about the effects of texting on driving performance. Design a study
to test the response time of drivers while texting and while driving only. How many seconds
does it take for a driver to respond when a leading car hits the brakes?
Ethics
The widespread misuse and misrepresentation of statistical information often gives the
field a bad name. Some say that “numbers don’t lie,” but the people who use numbers to
support their claims often do.
21
A recent investigation of famous social psychologist, Diederik Stapel, has led to the
retraction of his articles from some of the world’s top journals including Journal of
Experimental Social Psychology, Social Psychology, Basic and Applied Social Psychology,
British Journal of Social Psychology, and the magazine Science. Diederik Stapel is a former
professor at Tilburg University in the Netherlands. Over the past two years, an extensive
investigation involving three universities where Stapel has worked concluded that the
psychologist is guilty of fraud on a colossal scale. Falsified data taints over 55 papers he
authored and 10 Ph.D. dissertations that he supervised.
Stapel did not deny that his deceit was driven by ambition. But it was more complicated
than that, he told me. He insisted that he loved social psychology but had been frustrated by
the messiness of experimental data, which rarely led to clear conclusions. His lifelong
obsession with elegance and order, he said, led him to concoct sexy results that journals
found attractive. “It was a quest for aesthetics, for beauty—instead of the truth,” he said. He
described his behavior as an addiction that drove him to carry out acts of increasingly
daring fraud, like a junkie seeking a bigger and better high.[2]
2. Yudhijit Bhattacharjee, “The Mind of a Con Man,” Magazine, New York Times, April 26, 2013. Available online
at: http://www.nytimes.com/2013/04/28/magazine/diederik-stapels-audacious-academic-
fraud.html?src=dayp&_r=2& (accessed May 1, 2013).
The committee investigating Stapel concluded that he was guilty of several practices
including:
• creating datasets, which largely confirmed the prior expectations,
• altering data in existing datasets
• changing measuring instruments without reporting the change, and
• misrepresenting the number of experimental subjects.
Clearly, it is never acceptable to falsify data the way this researcher did. Sometimes,
however, violations of ethics are not so easy to spot.
Researchers have a responsibility to verify that proper methods are being followed. The
report describing the investigation of Stapel’s fraud states that, “statistical flaws
frequently revealed a lack of familiarity with elementary statistics.”[3] Many of Stapel’s co-
authors should have spotted irregularities in his data. Unfortunately, they did not know very
much about statistical analysis, and they simply trusted that he was collecting and reporting
data properly.
Many types of statistical fraud are difficult to detect. Some researchers simply stop collecting
data once they have just enough to prove what they had hoped to prove. They don’t want to
take the chance that a more extensive study would complicate their lives by producing data
contradicting their hypothesis.
When a statistical study uses human participants, as in medical studies, both ethics and
the law dictate that researchers should be mindful of the safety of their research subjects.
The U.S. Department of Health and Human Services oversees federal regulations of
research studies with the aim of protecting participants. When a university or other
22
research institution engages in research, it must ensure the safety of all human subjects.
For this reason, research institutions establish oversight committees known
as Institutional Review Boards (IRB). All planned studies must be approved in advance
by the IRB. Key protections that are mandated by law
include the following:
• Participants must give informed consent. This means that the risks of participation
must be clearly explained to the subjects of the study. Subjects must consent in writing,
and researchers are required to keep documentation of their consent.
• Data collected from individuals must be guarded carefully to protect their privacy.
These ideas may seem fundamental, but they can be very difficult to verify in practice. Is
removing a participant’s name from the data record sufficient to protect privacy? Perhaps the
person’s identity could be discovered from the data that remains. What happens if the study
does not proceed as planned and risks arise that were not anticipated? When is informed
consent really necessary? Suppose your doctor wants a blood sample to check your cholesterol
level. Once the sample has been tested, you expect the lab to dispose of the remaining blood.
At that point the blood becomes biological waste. Does a researcher have the right to take it
for use in a study?
3. “Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel,”
Tillburg University, November 28, 2012, http://www.tilburguniversity.edu/upload/064a10cd- bce5-4385-
b9ff- 05b840caeae6_120695_Rapp_nov_2012_UK_web.pdf (accessed May 1, 2013).
It is important that students of statistics take time to consider the ethical questions that
arise in statistical studies. How prevalent is fraud in statistical studies? You might be
surprised—and disappointed. There is a website (www.retractionwatch.com)
(http://www.retractionwatch.com) dedicated to cataloging retractions of study articles that
have been proven fraudulent. A quick glance will show that the misuse of statistics is a
bigger problem than most people realize.
Vigilance against fraud requires knowledge. Learning the basic theory of statistics will
empower you to analyze statistical studies critically.
Example 1.22
Describe the unethical behavior in each example and describe how it could impact the
reliability of the resulting data. Explain how the problem should be corrected. A researcher
is collecting data in a community.
a. She selects a block where she is comfortable walking because she knows many of the
people living on the street.
Example 1.22 continued
b. No one seems to be home at four houses on her route. She does not record the
addresses and does not return at a later time to try to find residents at home.
23
c. She skips four houses on her route because she is running late for an appointment.
When she gets home, she fills in the forms by selecting random answers from
other residents in the neighborhood.
Solution 1.22
b. Intentionally omitting relevant data will create bias in the sample. Suppose the
researcher is gathering information about jobs and child care. By ignoring people who
are not home, she may be missing data from working families that are relevant to her
study. She needs to make every effort to interview all members of the target sample.
c. It is never acceptable to fake data. Even though the responses she uses are “real”
responses provided by other participants, the duplication is fraudulent and can create
bias in the data. She needs to work diligently to interview everyone on her route.
1.22 Describe the unethical behavior, if any, in each example and describe how it could
impact the reliability of the resulting data. Explain how the problem should be corrected.
A study is commissioned to determine the favorite brand of fruit juice among teens in
California. The survey is commissioned by the seller of a popular brand of apple
juice. There are only two types of juice included in the study: apple juice and cranberry
juice. Researchers allow participants to see the brand of juice as samples are poured for a
taste test. Among the participants, 25% preferred Brand X, 33% preferred Brand Y and
42% had no preference between the two brands. Brand X then references the study in a
commercial saying “Most teens like Brand X as much as or more than Brand Y.”
24
KEY TERMS
Average a number that describes the central tendency of the data
Categorical Variable variables that take on values that are names or labels
Cluster Sampling a method for selecting a random sample and dividing the population
into groups (clusters); use simple random sampling to select a set of clusters. Every
individual in the chosen clusters is included in the sample.
Data a set of observations (a set of possible outcomes); most data can be put into two
groups: qualitative (an attribute whose value is indicated by a label) or quantitative (an
attribute whose value is indicated by a number). Quantitative data can be separated into
two subgroups: discrete and continuous. Data is discrete if it is the result of counting
(such as the number of students of a given ethnic group in a class or the number of books on
a shelf). Data is continuous if it is the result of measuring (such as distance traveled or
weight of luggage)
Discrete Random Variable a random variable (RV) whose outcomes are counted
Double-blinding the act of blinding both the subjects of an experiment and the
researchers who work with the subjects
Informed Consent Any human subject in a research study must be cognizant of any risks
or costs associated with the study. The subject has the right to know the nature of the
treatments included in the study, their potential risks, and their potential benefits. Consent
must be given freely by an informed, fit participant.
Lurking Variable a variable that has an effect on a study even though it is neither an
explanatory variable nor a response variable
25
Non-sampling Error an issue that affects the reliability of sampling data other than
natural variation; it includes a variety of human errors including poor study design,
biased sampling methods, inaccurate information provided by study participants, data
entry errors, and poor analysis.
Numerical Variable variables that take on values that are indicated by numbers
Placebo an inactive treatment that has no real effect on the explanatory variable
Population all individuals, objects, or measurements whose properties are being studied
Proportion the number of successes divided by the total number in the sample
Random Assignment the act of organizing experimental units into treatment groups
using random methods
Random Sampling a method of selecting a sample that gives every member of the
population an equal chance of being selected.
Representative Sample a subset of the population that has the same characteristics as
the population
Response Variable the dependent variable in an experiment; the value that is measured
for change at the end of an experiment
Sampling Bias not all members of the population are equally likely to be selected
Sampling Error the natural variation that results from selecting a sample to represent a
larger population; this variation decreases as the sample size increases, so selecting
larger samples reduces sampling error.
Sampling with Replacement Once a member of the population is selected for inclusion in
a sample, that member is returned to the population for the selection of the next
individual.
26
Statistic a numerical characteristic of the sample; a statistic estimates the corresponding
population parameter.
Stratified Sampling a method for selecting a random sample used to ensure that
subgroups of the population are represented adequately; divide the population into groups
(strata). Use simple random sampling to identify a proportionate number of individuals
from each stratum.
Systematic Sampling a method for selecting a random sample; list the members of the
population. Use simple random sampling to select a starting point in the population. Let k
= (number of individuals in the population)/(number of individuals needed in the sample).
Choose every kth individual in the list starting with the one that was randomly selected. If
necessary, return to the beginning of the population list to complete your sample.
27
CHAPTER REVIEW
1.1 Basic Definitions
The mathematical theory of statistics is easier to learn when you know the language. This
section presented important terms that will be used throughout the text.
Data are individual items of information that come from a population or sample. Data may
be classified as qualitative, quantitative continuous, or quantitative discrete.
Because it is not practical to measure the entire population in a study, researchers use
samples to represent the population. A random sample is a representative group from the
population chosen by using a method that gives each individual in the population an equal
chance of being included in the sample. Random sampling methods include simple random
sampling, stratified sampling, cluster sampling, and systematic sampling. Convenience
sampling is a nonrandom method of choosing a sample that often produces biased data.
Samples that contain different individuals result in different data. This is true even when
the samples are well-chosen and representative of the population. When
properly selected, larger samples model the population more closely than smaller samples.
There are many different potential problems that can affect the reliability of a sample.
Statistical data needs to be critically analyzed, not simply accepted.
A poorly designed study will not produce reliable data. There are certain key components
that must be included in every experiment. To eliminate lurking variables, subjects must be
assigned randomly to different treatment groups. One of the groups must act as a
control group, demonstrating what happens when the active treatment is not applied.
Participants in the control group receive a placebo treatment that looks exactly like the active
treatments but cannot influence the response variable. To preserve the integrity of the
placebo, both researchers and subjects may be blinded. When a study is designed properly,
the only difference between treatment groups is the one imposed by the researcher.
Therefore, when groups respond differently to different treatments, the difference must be
due to the influence of the explanatory variable.
“An ethics problem arises when you are considering an action that benefits you or
some because you support, hurts or reduces benefits to others, and violates some rule.”[4]
Ethical violations in statistics are not always easy to spot. Professional associations and
federal agencies post guidelines for proper conduct. It is important that you learn basic
statistical procedures so that you can recognize proper data analysis.
4. Andrew Gelman, “Open Data and Open Methods,” Ethics and Statistics,
http://www.stat.columbia.edu/~gelman/research/ published/ChanceEthics1.pdf (accessed
May 1, 2013).
28
EXERCISES FOR CHAPTER 1
Use the following information to answer the next five exercises. Studies are often done by
pharmaceutical companies to determine the effectiveness of a treatment program. Suppose
that a new AIDS antibody drug is currently under study. It is given to patients once the
AIDS symptoms have revealed themselves. Of interest is the average (mean) length of time
in months patients live once they start the treatment. Two researchers each follow a
different set of 40 patients with AIDS from the start of treatment until their deaths. The
following data (in months) are collected.
Researcher A:
3; 4; 11; 15; 16; 17; 22; 44; 37; 16;
14; 24; 25; 15; 26; 27; 33; 29; 35; 44;
13; 21; 22; 10; 12; 8; 40; 32; 26; 27;
31; 34 ; 29; 17; 8; 24; 18; 47; 33; 34.
Researcher B:
3; 14; 11; 5; 16; 17; 28; 41; 31; 18;
14; 14; 26; 25; 21; 22; 31; 2; 35; 44;
23; 21; 21; 16; 12; 18; 41; 22; 16; 25;
33 ; 34; 29; 13; 18; 24; 23; 42; 33; 29
Determine what the key terms refer to in the example for Researcher A.
1. population
2. sample
3. parameter
4. statistic
5. variable
Use the following information to answer the next five exercises: A study was done to
determine the age, number of times per week, and the duration (amount of time) of
residents using a local park in San Antonio, Texas. The first house in the neighborhood
around the park was selected randomly, and then the resident of every eighth house in the
neighborhood around the park was interviewed.
29
9. The colors of the houses around the park are what kind of data?
a. qualitative b. quantitative discrete c. quantitative continuous
11. The table below contains the total number of deaths worldwide as a result of
earthquakes from 2000 to 2012.
f. Earthquakes are quantified according to the Richter scale, which measures the
amount of energy they produce (ex. are 2.1, 5.0, 6.7). What type of data is that?
For the following four exercises, determine the type of sampling used (simple random,
stratified, systematic, cluster, or convenience).
12. A group of test subjects is divided into twelve groups; then four of the groups are chosen
at random.
13. A market researcher polls every tenth person who walks into a store.
30
14. The first 50 people who walk into a sporting event are polled on their television
preferences.
15. A computer generates 100 random numbers, and 100 people whose names correspond
with the numbers on the list are chosen.
Use the following data to answer the next five exercises: A pair of studies was performed to
measure the effectiveness of a new software program designed to help stroke patients
regain their problem-solving skills. Patients were asked to use the software program twice
a day, once in the morning and once in the evening. The studies observed 200 stroke
patients recovering over a period of several weeks. The first study collected the data
in Table 1.31. The second study collected the data in Table 1.32.
17. The first study was performed by the company that designed the software program.
The second study was performed by the American Medical Association. Which study is more
reliable?
18. Both groups that performed the study concluded that the software works. Is this
accurate?
19. The company makes the software uses the two studies as proof that their
software causes mental improvement in stroke patients. Is this a fair statement?
20. Patients who used the software were also a part of an exercise program whereas
patients who did not use the software were not. Does this change the validity of the
conclusions from question #18?
For each of the following eight exercises, identify: a. the population, b. the sample, c. the
parameter, d. the statistic, e. the variable, and f. the data. Give examples where
appropriate.
31
21. A fitness center is interested in the mean amount of time a client exercises in the center
each week.
22. Ski resorts are interested in the mean age that children take their first ski and
snowboard lessons. They need this information to plan their ski classes optimally.
23. A cardiologist is interested in the mean recovery period of her patients who have had
heart attacks.
24. Insurance companies are interested in the mean health costs each year of their clients,
so that they can determine the costs of health insurance.
25. A politician is interested in the proportion of voters in his district who think he is doing
a good job.
26. A marriage counselor is interested in the proportion of clients she counsels who stay
married.
27. Political pollsters may be interested in the proportion of people who will vote for
a particular cause.
28. A marketing company is interested in the proportion of people who will buy a particular
product.
Use the following information to answer the next three exercises: A Lake Tahoe Community
College instructor is interested in the mean number of days Lake
Tahoe Community College math students are absent from class during a quarter.
30. Let X = number of days a Lake Tahoe Community College math student is absent. In
this case, X is an example of a:
a. variable. b. population. c. statistic. d. data.
31. The instructor’s sample produces a mean number of days absent of 3.5 days. This value
is an example of a:
a. parameter b. data c. statistic d. variable
For the following exercises (32 – 40), identify the type of data that would be used to describe
a response (quantitative discrete, quantitative continuous, or qualitative), and give an
example of the data.
32
35. number of students enrolled at Evergreen Valley College
a. Using complete sentences, list three things wrong with the way the survey was
conducted.
b. Using complete sentences, list three ways that you would improve the survey if it
were to be repeated.
42. Suppose you want to determine the mean number of students per statistics class in your
state. Describe a possible sampling method in three to five complete sentences. Make
the description detailed.
43. Suppose you want to determine the mean number of cans of soda drunk each month by
students in their twenties at your school. Describe a possible sampling method in three to
five complete sentences. Make the description detailed.
44. List some practical difficulties involved in getting accurate results from a
telephone survey.
45. List some practical difficulties involved in getting accurate results from a mailed
survey.
46. With your classmates, brainstorm some ways you could overcome these problems if you
needed to conduct a phone or mail survey.
47. Name the sampling method used in each of the following situations:
33
c. The marketing manager for an electronics chain store wants information about the
ages of its customers. Over the next two weeks, at each store location, 100
randomly selected customers are given questionnaires to fill out asking for information
about age, as well as about other variables of interest.
e. A political party wants to know the reaction of voters to a debate between the
candidates. The day after the debate, the party’s polling staff calls 1,200 randomly
selected phone numbers. If a registered voter answers the phone or is available to come
to the phone that registered voter is asked whom he or she intends to vote for and
whether the debate changed his or her opinion of the candidates.
48. A “random survey” was conducted of 3,274 people of the “microprocessor generation”
(people born since 1971, the year the microprocessor was invented). It was reported that
48% of those individuals surveyed stated that if they had $2,000 to spend, they would use it
for computer equipment. Also, 66% of those surveyed considered themselves
relatively savvy computer users.
a. Do you consider the sample size large enough for a study of this type? Why or why
not?
b. Based on your “gut feeling,” do you believe the percentages accurately reflect the
U.S. population for those individuals born since 1971? If not, do you think the
percentage of the population is actually higher or lower than the sample statistics?
Why?
Additional information: The survey, reported by Intel Corporation, was filled out by
individuals who visited the Los Angeles Convention Center to see the Smithsonian
Institute's road show called “America’s Smithsonian.”
c. With this additional information, do you feel that all demographic and ethnic groups
were equally represented at the event? Why or why not?
d. With the additional information, comment on how accurately you think the sample
statistics reflect the population parameters.
49. The Gallup-Healthways Well-Being Index is a survey that follows trends of U.S.
residents on a regular basis. There are six areas of health and wellness covered in the
survey: Life Evaluation, Emotional Health, Physical Health, Healthy Behavior, Work
Environment, and Basic Access. Some of the questions used to measure the Index are listed
below. Identify the type of data obtained from each question used in this survey
as qualitative, quantitative discrete, or quantitative continuous.
a. Do you have any health problems that prevent you from doing any of the things
people your age can normally do?
34
b. During the past 30 days, for about how many days did poor health keep you from
doing your usual activities?
c. In the last seven days, on how many days did you exercise for 30 minutes or more?
d. Do you have health insurance coverage?
50. In advance of the 1936 presidential election, a magazine titled Literary Digest released
the results of an opinion poll predicting that the Republican candidate Alf Landon would
win by a large margin. The magazine sent post cards to approximately 10,000,000
prospective voters. These prospective voters were selected from the subscription list of the
magazine, from automobile registration lists, from phone lists, and from club membership
lists. Approximately 2,300,000 people returned the postcards.
a. Think about the state of the United States in 1936. Explain why a sample chosen
from magazine subscription lists, automobile registration lists, phone books, and club
membership lists was not representative of the population of the United States at that
time.
b. What effect does the low response rate have on the reliability of the sample?
d. During the same year, George Gallup conducted his own poll of 30,000 prospective
voters. His researchers used a method they called "quota sampling" to obtain survey
answers from specific subsets of the population. Quota sampling is an example of which
sampling method described in this module?
51. Crime-related and demographic statistics for 47 US states in 1960 were collected from
government agencies, including the FBI's Uniform Crime Report. One analysis of this data
found a strong connection between education and crime indicating that higher levels of
education in a community correspond to higher crime rates. Which of the potential
problems with samples discussed in Section 1.2 could explain this connection?
52. YouPolls is a website that allows anyone to create and respond to polls. One question
posted April 15 asks: “Do you feel happy paying your taxes when members of the Obama
administration are allowed to ignore their tax liabilities?”[5] As of April 25, 11 people
responded to this question. Each participant answered “NO!” Which of the potential
problems with samples discussed in this module could explain this connection?
53. A scholarly article about response rates begins with the following quote: “Declining
contact and cooperation rates in random digit dial (RDD) national telephone surveys raise
serious concerns about the validity of estimates drawn from such research.”[6] The Pew
Research Center for People and the Press admits: “The percentage of people we interview –
out of all we try to interview – has been declining over the past decade or more.” [7]
a. What are some possible reasons for the decline in response rate over the past
decade?
b. Explain why researchers are concerned with the impact of the declining response
rate of public opinion polls.
35
REFERENCES
1.1 Basic Definitions
Dominic Lusinchi, “’President’ Landon and the 1936 Literary Digest Poll:
Were Automobile and Telephone Owners to Blame?” Social Science History 36, no. 1: 23-54
(2012), http://ssh.dukejournals.org/content/36/1/23.abstract (accessed May1, 2013).
Ankita Mehta. “Daily Dose of Aspiring Helps Reduce Heart Attacks: Study,” International Business
Times, July 21, 2011. Also available online at http://www.ibtimes.com/daily-dose-aspirin-helps-
reduce-heart-attacks-study-300443 (accessed May 1, 2013).
M.L. Jacskon et al., “Cognitive Components of Simulated Driving Performance: Sleep Loss effect and
Predictors,” Accident Analysis and Prevention Journal, Jan no. 50
(2013), http://www.ncbi.nlm.nih.gov/pubmed/22721550 (accessed May 1, 2013).
36
“Earthquake Information by Year,” U.S. Geological
Survey. http://earthquake.usgs.gov/earthquakes/eqarchives/year/ (accessed May 1, 2013).
“Fatality Analysis Report Systems (FARS) Encyclopedia,” National Highway Traffic and Safety
Administration. http://www-fars.nhtsa.dot.gov/Main/index.aspx (accessed May 1, 2013).
U.S. Department of Health and Human Services, Code of Federal Regulations Title 45 Public
Welfare Department of Health and Human Services Part 46 Protection of Human Subjects revised
January 15, 2009. Section 46.111: Criteria for IRB Approval of Research.
“April 2013 Air Travel Consumer Report,” U.S. Department of Transportation, April 11
(2013), http://www.dot.gov/ airconsumer/april-2013-air-travel-consumer-report (accessed May 1,
2013).
Maria de los A. Medina, “Ethics in Statistics,” Based on “Building an Ethics Module for Business,
Science, and Engineering Students” by Jose A. Cruz-Cruz and William
Frey, Connexions, http://cnx.org/content/m15555/latest/ (accessed May 1, 2013).
lastbaldeagle. 2013. On Tax Day, House to Call for Firing Federal Workers Who Owe Back
Taxes. Opinion poll posted online at: http://www.youpolls.com/details.aspx?id=12328 (accessed May
1, 2013).
Scott Keeter et al., “Gauging the Impact of Growing Nonresponse on Estimates from a National RD
D Telephone Survey,” Public OpinionQuarterly70 no. 5 (2006), http://poq.oxfordjournals.org/content/
70/5/759 (accessed May 1, 2013).
Frequently Asked Questions, Pew Research Center for the People & the Press, http://www.people-
press.org/methodology/frequently-asked-questions/#dont-you-have-trouble-getting-people-to-answer-
your-polls (accessed May 1, 2013).
37
CHAPTER 1 SOLUTIONS:
1) AIDS patients; 3) The average length of time (in months) AIDS patients live after
treatment.
5) X = the length of time (in months) AIDS patients live after treatment; 7) b;
9) a;
17) There is not enough information given to judge if either one is correct or incorrect.
19) Yes, because we cannot tell if the improvement was due to the software or the
exercise; the data is confounded, and a reliable conclusion cannot be drawn.
21) a. all clients for the fitness center b. a smaller, selected group of these clients
c. the population mean number of hours spent each week in the fitness center
d. the sample mean number of hours spent each week in the fitness center
e. X = the number of hours spent in the fitness center for a given client in a given
week.
f. values for X, such as 2, 1.7, 3.5, …
23) a. all patients of the doctor b. a smaller, selected group of these patients
c. the mean recovery period for all patients d. the mean recovery time for the
sample patients e. X = the recovery time of a single patient
25) a. all voters in the district b. a smaller, selected group of these voters
c. the proportion of all her voters in the district who approve of the politician’s
performance. d. the proportion of the sample who approve of the politician’s job
performance.
e. X = whether or not a voter approves of the politician’s job performance f. yes, no
27) a. all voters in the region (county, state or nation) b. a smaller, selected group of
these voters c. the proportion of all voters in the region who will vote for the cause.
d. the proportion of the sample who will vote for the cause. e. X = whether or not a
voter will vote for the cause f. yes, no
35) quantitative discrete; e.g. 11,234 students 37) qualitative; e.g. Crest, Colgate
38
39) quantitative continuous; e.g. 51 yrs, 63.5 yrs
41) a. The survey was conducted using six similar flights. The survey would not be a
true representation of the entire population of air travelers. Conducting the survey
on a holiday weekend will not produce representative results. b. Conduct the survey
during different times of the year. Conduct the survey using flights to and from
various locations. Conduct the survey on different days of the week.
43) Answers will vary. Sample Answer: You could use a systematic sampling method.
Stop the tenth person as they leave one of the buildings on campus at 9:50 in the
morning. Then stop the tenth person as they leave a different building on campus at
1:50 in the afternoon.
45) Answers will vary. Sample Answer: Many people will not respond to mail surveys. If
they do respond to the surveys, you can’t be sure who is responding. In addition,
mailing lists can be incomplete.
51) Causality: The fact that two variables are related does not guarantee that one
variable is influencing the other. We cannot assume that crime rate impacts
education level or that education level impacts crime rate. Confounding: There are
many factors that define a community other than education level and crime rate.
Communities with high crime rates and high education levels may have other
lurking variables that distinguish them from communities with lower crime rates
and lower education levels. Because we cannot isolate these variables of interest, we
cannot draw valid conclusions about the connection between education and crime.
Possible lurking variables include police expenditures, unemployment levels, region,
average age, and size.
53) a. Possible reasons: increased use of caller id, decreased use of landlines, increased
use of private numbers, voice mail, privacy managers, hectic nature of personal
schedules, decreased willingness to be interviewed b. When a large number of
people refuse to participate, then the sample may not have the same characteristics
of the population. Perhaps the majority of people willing to participate are doing so
because they feel strongly about the subject of the survey.
39
This page is purposely left blank.
40
2 | DISPLAYING DATA
Introduction
Chapter Objectives
41
Once you have collected data, what will you do with it? Data can be described and presented
in many different formats. For example, suppose you are interested in buying a house in a
particular area. You may have no clue about the house prices, so you might ask your real
estate agent to give you a sample data set of prices. Looking at all the prices in the sample
often is overwhelming. A better way might be to look at the median price and the variation
of prices. The median and variation are just two ways that you will learn to describe data.
Your agent might also provide you with a graph of the data. In this chapter, you will study
numerical and graphical ways to describe and display your data. This area of statistics is
called "Descriptive Statistics." You will learn how to calculate, and even more importantly,
how to interpret these measurements and graphs.
A statistical graph is a tool that helps you learn about the shape or distribution of a sample
or a population. A graph can be a more effective way of presenting data than a mass of
numbers because we can see where data clusters and where there are only a few data values.
Newspapers and the Internet use graphs to show trends and to enable readers to compare
facts and figures quickly. Statisticians often graph data first to get a picture of the data.
Then, more formal tools may be applied.
https://openclipart.org/detail/19980/graphs
Some of the types of graphs that are used to summarize and organize data are the dot plot,
the bar graph, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken
line graph), the pie chart, and the box plot. In this chapter, we will briefly look at stem-and-
leaf plots, line graphs, and bar graphs, as well as frequency polygons, and time series graphs.
Our emphasis will be on histograms and box plots.
42
2.1 | Frequency Distributions
Once you have a set of data, you will need to organize it so that you can analyze how
frequently each datum occurs in the set. This is done by creating a table of frequencies which
is known as a frequency distribution. There are three frequency distributions to choose
from, depending on if the data is categorical or quantitative. When it is quantitative data,
there are two frequency distributions to choose from. When calculating the frequency, you
may need to round your answers so that they are as precise as possible.
5; 6; 3; 3; 2; 4; 8; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3.
Since the range is 6, we will keep each data value separate and not group them together. To
create an ungrouped frequency distribution is a simple task. Place the data values from
smallest to the largest without skipping any values on the first column. Place the frequency,
the count of each data value, in the corresponding row of the second column.
Table 2.1 lists the different data values in ascending order and their frequencies. Notice all
the data values are listed including 7 which is not listed on the original data set.
A frequency is the number of times a value of the data occurs. According to Table 2.1, there
are three students who work two hours, five students who work three hours, and so on. The
sum of the values in the frequency column, 20, represents the total number of students
included in the sample, known as sample size (n).
43
A relative frequency is the ratio (fraction or proportion) of the number of times a value of
the data occurs in the set of all outcomes to the total number of outcomes. To find the relative
frequencies, divide each frequency by the total number of students in the sample, n, in this
case, 20. Relative frequencies can be written as fractions, decimals, or percent.
The following table is known as a relative frequency distribution because frequencies are
translated as percent. To find the percent of the first row, take the frequency and divide by
the sample size, n. 3/20 = 15%.
44
Grouped Frequency Distribution
This second type of frequency distribution is also used when there is quantitative data.
However, it is used when the range is large and the data values need to be grouped together.
For example, 28 students were asked how many hours they worked per week. Their
responses, in hours, are as follows:
15; 26; 13; 33; 22; 14; 27; 15; 32; 23; 5; 26; 25; 14; 34; 13; 15; 22; 15; 28; 10; 18; 21; 24; 20; 18;
34; 20;
Here there are too many different data values to list them separately as in the ungrouped
frequency distribution. Notice the range is 29 (highest – lowest = 34 – 5). Therefore we need
to construct a grouped frequency distribution (Table 2.4) to group data values into classes.
A class is an interval where the lowest value of the interval is known as the lower limit and
the highest value of the interval is known as the upper limit.
2. Determine the number of classes (usually the minimum is 5 classes and a maximum of
20 classes)
5. Create the other lower limits of the classes by adding the class width to the previous
lower limit
45
Example 2.1
Twenty-eight students were asked how many hours they worked per week. Their responses,
in hours, are as follows: 15; 26; 13; 33; 22; 14; 27; 15; 32; 23; 5; 26; 25; 14; 34; 13; 15; 22; 15;
28; 10; 18; 21; 24; 20; 18; 34; 20; construct a grouped frequency distribution using 5 classes.
5; 10; 13; 13; 14; 14; 15; 15; 15; 15; 18; 18; 20; 20; 21; 22; 22; 23; 24; 25; 26; 26; 27; 28; 32; 33;
34;
1. Range = 34 – 5 = 29
5. The other lower limits will be 11, 17, 23, 29 by adding the class width of 6 to the
previous lower limit
6. The first upper limit will be 10 since the next class begins at 11. Using class width
again, the other upper limits are 16, 22, 28, 34
Solution 2.1
2.1 The grade point averages for 40 students are listed below.
46
Solution ‘Try It’ 2.1:
Range = 4 - .5 = 3.5
Class width = 3.5/8 = .4375 Round up to .5 since the data values are in tenths (one decimal
spot) then we round the class width to tenths.
A; B; O; A; AB; O; O; A; O; B; A; A; A; O; O; O; B; O; AB; B
Table 2.5 lists the different data values and their frequencies.
Frequency distributions (tables) are a good way of organizing and displaying data. But
graphs can be even more helpful in understanding the data
47
The TI-83/TI-84 Calculator can sort the data values
Press STAT
Choose EDIT
Choose #2 SortA(
Press 2nd L1
48
2.2 | Graphs for Qualitative Data
There are no strict rules concerning which graphs to use. However, scale of the bars or slices
should be accurate. For example, if a category is 32% then the slice for the category should
not be smaller than a quarter of the circle or bigger than half the circle. Three graphs that
are used to display qualitative data are pie charts, bar graphs, and Pareto charts.
In a pie chart, categories of data are represented by wedges in a circle and are proportional
in size to the percent of individuals in each category. When creating a pie chart, each slice
should be labeled with the category name and the relative frequency (percent).
In a bar graph, the length of the bar for each category is proportional to the number or
percent of individuals in each category. Bars may be vertical or horizontal. For vertical
bars, the categories are on the x-axis and frequency or relative frequency are on the y-axis.
In a Pareto chart consists of bars that are sorted into order by category size (largest to
smallest). Pareto charts are typically used with nominal data. If a Pareto chart were created
from ordinal or quantitative data, the values on the x-axis might seem out of order after the
bars were rearranged from largest to smallest.
Pie Graphs:
Example 2.2
Below is a table comparing the number of part-time and full-time students at De Anza College
and Foothill College enrolled for the spring 2010 quarter. The tables display counts
(frequencies) and percentages or proportions (relative frequencies).
49
Solution 2.2
The following pie charts have the “Other/Unknown” category included (since the
percentages must add to 100%). The chart in Figure 2.7b is organized by the size of each
wedge, which makes it a more visually informative graph than the unsorted, alphabetical
graph in Figure 2.7a.
50
Bar Graphs:
Below is the same data from Example 2.2 displayed as a bar graph.
Look at Figure 2.1 and Figure 2.2 and determine which graph (pie or bar) you think
displays the comparisons better. It is a good idea to look at a variety of graphs to see which
is the most helpful in displaying the data. We might make different choices of what we think
is the “best” graph depending on the data and the context. Our choice also depends on what
we are using the data for.
Characteristic/Category Percent
Full-Time Students 40.9%
Students who intend to transfer to a 4-year 48.6%
educational institution
Students under age 25 61%
Total 150.5%
51
Figure 2.3 Bar Graph
Frequency Percent
Asian 8,794 36.1%
Black 1,412 5.8%
Filipino 1,298 5.3%
Hispanic 4,180 17.1%
Native American 146 0.6%
Pacific Islander 236 1.0%
White 5,978 24.5%
TOTAL 22,044 out of 24,382 90.4% out of 100%
Table 2.8 Ethnicity of Students at De Anza College Fall Term 2007 (Census Day)
52
Figure 2.4 Bar Graph
The following graph is the same as the previous graph but the “Other/Unknown” percent
(9.6%) has been included. The “Other/Unknown” category is large compared to some of the
other categories (Native American, 0.6%, Pacific Islander 1.0%). This is important to know
when we think about what the data are telling us. This particular bar graph in Figure
2.5 can be difficult to understand visually.
53
Pareto Charts:
As the above data may be more difficult to interpret in a bar graph, a Pareto chart is easier
to read and interpret. The graph in Figure 2.6 is a Pareto chart.
Example 2.3
Twenty-five students are asked their blood type. Their responses are as follows:
A; B; O; A; AB; O; O; A; O; B; A; A; A; O; O; O; B; O; AB; B, O, B, O, A, A
Blood type O has the largest frequency therefore it should be placed first on the x-axis and
blood type A follows.
54
2.3 | Graphs for Quantitative Data
Histogram:
For most of the work you do in this book, you will use a histogram to display the data. One
advantage of a histogram is that it can readily display large data sets. A rule of thumb is to
use a histogram when the data set consists of 100 values or more.
A histogram consists of contiguous (adjoining) bars. It has both a horizontal axis and a
vertical axis. The horizontal axis is labeled with what the data represents (for instance,
distance from your home to school). Horizontal axis uses the class boundaries. The vertical
axis is labeled either frequency or relative frequency (percent or probability). The graph will
have the same shape with either label. The histogram can give you the shape of the data, the
center, and the spread of the data.
Below is a histogram of the number of new coronavirus disease (Covid-19) cases reported
daily*† (N = 4,226) – United States, February 12 – March 16, 2020 from the CDC
(https://www.cdc.gov/mmwr/volumes/69/wr/mm6912e2.htm)
* Includes both COVID-19 cases confirmed by state or local public health laboratories, as well as those testing positive at the
state or local public health laboratories and confirmed at CDC.
† Cases identified before February 28 were aggregated and reported during March 1–3.
Notice that there are no gaps between the bars and that date is in the middle of each bar.
The class boundaries are where the bars meet. The horizontal axis may also be labeled by
the class boundaries.
55
The relative frequency is equal to the frequency for an observed value of the data divided
by the total number of data values in the sample. (Remember, frequency is defined as the
number of times an answer occurs.)
• f = frequency
• n = total number of data values (or the sum of the individual frequencies)
• RF = relative frequency
RF =
For example, if three students in Mr. Ahab's English class of 40 students received from 90%
to 100%, then, f = 3, n = 40, and RF = 3/40 = .075 = 7.5% of the students received 90–100%.
To construct a histogram:
1. Create class boundaries on the grouped frequency distribution. Choose a starting point
for the first interval to be less than the smallest data value. A convenient starting point is a
lower value carried out to one more decimal place than the value with the most decimal
places. For example, if the value with the most decimal places is 6.1 and this is the smallest
value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more
precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a
convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal
places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005
= 0.9995). If all the data happen to be integers and the smallest value is two, then a
convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other
boundaries are carried to one additional decimal place, no data value will fall on a boundary.
3. Draw bars as high as the frequency for each class interval within each boundary
The next two examples go into detail about how to construct a histogram using continuous
data and how to create a histogram using discrete data.
Example 2.4
The following data are the heights (in inches to the nearest half inch) of 100 male
semiprofessional soccer players. The heights are continuous data, since height is measured.
60; 60.5; 61; 61; 61.5; 63.5; 63.5; 63.5; 64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5;
64.5; 64.5; 64.5; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5;
66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5;
67.5; 67.5; 68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5; 70; 70; 70;
70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71; 72; 72; 72; 72.5; 72.5; 73; 73.5; 74
56
Create a relative frequency histogram using the frequency distribution that follows:
Solution 2.4
First find the relative frequencies and class boundaries for each class.
BOUNDARIES RELATIVE FREQUENCY
59.95 – 61.95 .05
61.95 – 63.95 .03
63.95 – 65.95 .15
65.95 – 67.95 .4
67.95 – 69.95 .17
69.95– 71.95 .12
71.95 – 73.95 .07
73.95 – 75.95 .01
Second place the boundaries on the x-axis and the relative frequency on the y-axis.
Figure 2.8
57
Example 2.5
The following data are the number of books bought by 50 part-time college students at the
College of Lake County. The number of books is discrete data, since books are counted.
1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1
2; 2; 2; 2; 2; 2; 2; 2; 2; 2
3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3
4; 4; 4; 4; 4; 4
5; 5; 5; 5; 5
6; 6;
Construct a Histogram.
Solution 2.5
BOUNDARIES FREQUENCY
.5 – 1.5 11
1.5 – 2.5 10
2.5 – 3.5 16
3.5 – 4.5 6
4.5 – 5.5 5
5.5 – 6.5 2
58
2.2 The following data are the shoe sizes of 50 male students. The sizes are
continuous data since shoe size is measured. Construct a histogram and calculate
the width of each bar or class interval. Suppose you choose six bars.
9; 9; 9.5; 9.5; 10; 10; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11;
11; 11; 11; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5; 12; 12; 12;
12; 12; 12; 12; 12.5; 12.5; 12.5; 12.5; 14
• Press WINDOW. Set Xmin = .5, Xscl = (6.5 – .5)/6, Ymin = –1, Ymax = 20, Yscl =
1, Xres = 1.
59
• Press 2nd Y=. Start by pressing 4: Plotsoff ENTER.
• Press 2nd Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. Arrow to the 3rd picture
(histogram). Press ENTER.
• Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 (2nd 2). Press Graph.
Use TRACE key and the arrow keys to examine the histogram.
Descriptions of Histograms:
There are four descriptions of histograms. These descriptions, describe how the histogram is
shaped.
1. Symmetrical: the largest bar is in the center interval and the bars on each side of the center
decrease about the same rate.
2. Skewed Left: the largest bar is closer to the right side and the bars start decreasing toward
the left.
60
3. Skewed Right: the largest bar is closer to the left side and the bars start decreasing toward
the right.
4. Uniform: each bar of the histogram is the same size for every interval
NOTE: Histograms will not always be exactly one of the four descriptions; therefore use
adjectives to help describe the histogram. For example, “mostly symmetrical”.
Stem-and-Leaf Graph:
One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory
data analysis. It is a good choice when the data sets are small.
For example:
The number 23 has stem two and leaf three. The number 432 has stem 43 and leaf two.
Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and
leaf three.
2. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right
of the stems. Then write the leaves in increasing order next to their corresponding stem.
2.6
61
Example 2.6
For Susan Dean's spring pre-calculus class, scores for the first exam were as follows:
33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90;
92; 94; 94; 94; 94; 96; 100
Create a Stem and leaf plot.
Solution 2.6:
Stem Leaf
33
4299
5355
61378899
72348
803888
90244446
10 0
Table 2.9 Stem-and- Leaf Graph
The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31
scores or approximately 26% were in the 90s or 100, a fairly high number of A’s.
2.3 For the Park City basketball team, scores for the last 30 games were as follows
(smallest to largest):
32; 32; 33; 34; 38; 40; 42; 42; 43; 44; 46; 47; 47; 48; 48; 48; 49; 50; 50; 51; 52; 52; 52; 53;
54; 56; 57; 57; 60; 61
The stem and leaf plot is a quick way to graph data and gives an exact picture of the data.
You want to look for an overall pattern and any outliers. An outlier is an observation of
data that does not fit the rest of the data. It is sometimes called an extreme value. When
you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due
to mistakes (for example, writing down 50 instead of 500) while others may indicate that
something unusual is happening. It takes some background information to explain outliers,
so we will cover them in more detail later.
62
Example 2.7
The data are the distances (in km) from a home to local supermarkets. Create
a stemplot using the data:
1.1; 1.5; 2.3; 2.5; 2.7; 3.2; 3.3; 3.3; 3.5; 3.8; 4.0; 4.2; 4.5; 4.5; 4.7; 4.8; 5.5; 5.6; 6.5;
6.7; 12.3
Do the data seem to have any concentration of values? The leaves are to the right of the
decimal.
Solution 2.7
The value 12.3 may be an outlier. Values appear to concentrate at three and four
kilometers.
Stem Leaf
1 15
2 357
3 23358
4 025578
5 56
6 57
7
8
9
10
11
12 3
Table 2.10
2.4 The following data show the distances (in miles) from the homes of off-campus
statistics students to the college. Create a stem plot using the data and identify any
outliers:
0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5;
3.8; 4.4; 4.8; 4.9; 5.2; 5.5; 5.7; 5.8; 8.0
Example 2.8
A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns.
In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are
to the left and the right of the stems. Table 2.11 and Table 2.12 show the ages of presidents
at their inauguration and at their death. Construct a side- by-side stem-and-leaf plot using
this data.
63
President Age President Age President Age
Washington 57 Lincoln 52 Hoover 54
J. Adams 61 A. Johnson 56 F. Roosevelt 51
Jefferson 57 Grant 46 Truman 60
Madison 57 Hayes 54 Eisenhower 62
Monroe 58 Garfield 49 Kennedy 43
J.Q. Adams 57 Arthur 51 L. Johnson 55
Jackson 61 Cleveland 47 Nixon 56
Van Buren 54 B. Harrison 55 Ford 61
W. H. Harrison 68 Cleveland 55 Carter 52
Tyler 51 McKinley 54 Reagan 69
Polk 49 T. Roosevelt 42 G.H.W Bush 64
Taylor 64 Taft 51 Clinton 47
Fillmore 50 Wilson 56 G.W. Bush 54
Pierce 48 Harding 55 Obama 47
Buchanan 65 Coolidge 51
Table 2.11 Presidential Ages at Inauguration
64
Line Graphs:
Another type of graph that is useful for specific data values is a line graph. In the
particular line graph shown in Example 2.9 is for an ungrouped frequency
distribution, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis)
consists of frequency points. The frequency points are connected using line segments.
Example 2.9
In a survey, 40 mothers were asked how many times per week a teenager must be reminded
to do his or her chores. The results are shown in Table 2.13 and in Figure 2.10.
Solution 2.9:.6
Figure 2.10
2.5 In a survey, 40 people were asked how many times per year they had their
car in the shop for repairs. The results are shown in Table 2.14. Construct a linegraph.
# of times FREQUENCY (f)
0 7
1 10
2 14
3 9
65
Frequency Polygon:
Frequency polygons are analogous to line graphs, and just as line graphs
make continuous data visually easy to interpret, so too do frequency polygons.
Example 2.10
Construct a frequency polygon using the distribution of Statistic Final Test Scores
CLASSES FREQUENCY (f)
50 - 59 5
60 - 69 10
70 - 79 30
80 - 89 40
90 - 99 15
Solution 2.10:
First find the midpoints of each class. Midpoints: 54.5, 64.6, 74.5, 84.5, 94.5
The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5.
Since the lowest test score is 54.5, this interval is used only to allow the graph to touch the
x-axis. The point labeled 54.5 represents the next interval or the first “real” interval from the
table, and contains five scores. This reasoning is followed for each of the remaining intervals
with the point 104.5 representing the interval from 99.5 to 109.5. Again, this interval
contains no data and is only used so that the graph will touch the x-axis. Looking at the
graph, we say that this distribution is skewed because one side of the graph does not mirror
the other side.
66
2.6 Construct a frequency polygon using the ages of President’s inauguration in the
following table.
Frequency polygons are useful for comparing distributions. This is achieved by overlaying
the frequency polygons drawn for different data sets.
Example 2.11
We will construct an overlay frequency polygon comparing the scores from Example 2.8
with the students’ final numeric grade.
Solution 2.11:
67
Suppose that we want to study the temperature range of a region for an entire month. Every
day at noon we note the temperature and write this down in a log. A variety of statistical
studies could be done with this data. We could find the mean or the median temperature for
the month. We could construct a histogram displaying the number of days those
temperatures reach a certain range of values. However, all of these methods ignore a portion
of the data that we have collected.
One feature of the data that we may want to consider is that of time. Since each date is paired
with the temperature reading for the day, we don‘t have to think of the data as being
random. We can instead use the times given to impose a chronological order on the data. A
graph that recognizes this ordering and displays the changing temperature as the month
progresses is called a time series graph.
Example 2.12
The following data shows the Annual Consumer Price Index, each month, for ten years.
Construct a time series graph for the Annual Consumer Price Index data only.
Year 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Annual 184 188.9 195.3 201.6 207.342 215.303 214.537 218.056 224.939 229.594
Solution 2.12:
68
2.7 The following table is a portion of a data set from www.worldbank.org. Use the table to
construct a time series graph for CO2 emissions for the United States.
CO2 Emissions
Ukraine United United States
Kingdom
2003 352,259 540,640 5,681,664
2004 343,121 540,409 5,790,761
2005 339,029 541,990 5,826,394
2006 327,797 542,045 5,737,615
2007 328,357 528,631 5,828,627
2008 323,657 522,247 5,656,839
2009 272,176 474,579 5,299,563
69
KEY TERMS
Bar Graphs a representation of categorical data where horizontal axis contains the
qualitative categories, and the vertical axis represents the frequency of each category. The
height of the bars represents the amount of data in the category.
Frequency Polygon looks like a line graph but uses intervals to display ranges of large
amounts of data.
Frequency Table a data representation in which grouped data is displayed along with the
corresponding frequencies.
Interval also called a class interval; an interval represents a range of data and is used
when displaying large data sets.
Paired Data Set two data sets that have a one-to-one relationship so that:
• both data sets are the same size, and
• each data point in one data set is matched with exactly one point from the other set.
Pie Graphs a circular graph that shows how large each category is in relation to the
whole. Created from the frequency distribution and using relative frequencies, the central
angles are calculated by multiplying the relative frequency by 360 degrees.
Quantitative Continuous Data measurable data that is numerical and can take on any
value in a given range of numbers.
Quantitative Discrete Data measurable data that is numerical that can only take on
particular values and cannot take on the values in between.
Relative Frequency the ratio of the number of times a value of the data occurs in the set
of all outcomes to the number of all outcomes.
Time Series the same variable is measured over a period and plotted on the x-y coordinate
plane where time is on the x-axis and the variable in the y-axis.
70
EXERCISES for CHAPTER 2
For exercises #1 – 4, use the following frequency distribution
143147513612
111716271644
11535232413
71
11. Find the sample size.
18. Use the given minimum and maximum data entries, and the number of classes, to find
the class width, the lower class limits, and the upper class limits.
a. minimum = 12, maximum = 68, 6 classes
b. minimum = 28, maximum = 75, 6 classes
c. minimum = 5, maximum = 92, 8 classes
d. minimum = 30, maximum = 83, 8 classes
19. The following data represents the number of miles a full tank of gas allows for a sample
of vehicles inspected. Create a grouped frequency distribution with 7 classes.
260 271 236 244 279 296 284 299 288
288 247 256 338 360 341 333 261 266
287 296 313 311 307 307 299 303 277
283 304 305 288 290 288 289 297 299
332 330 309 328 307 328 285 291 295
298 306 315 310 318 318 320 333 321
323 324 327
20. The data represent the time, in minutes, spent reading a political blog in a day.
Construct a frequency distribution using 5 classes. In the table, include the midpoints,
relative frequencies, and cumulative frequencies. Which class has the greatest frequency
and which has the least frequency?
34 18 10 44 41 47 0 33 49 45
16 10 27 24 44 46 21 25 45 23
72
21. Construct a frequency distribution for the given data set using 6 classes. In the table,
include the midpoints, relative frequencies, and cumulative frequencies. Which class has
the greatest frequency and which has the least frequency?
22. The students in Ms. Ramirez’s math class have birthdays in each of the four
seasons. The table below shows the four seasons, the number of students who have
birthdays in each season, and the percentage (%) of students in each group. Construct a bar
graph showing the number of students.
23. Using the data from Mrs. Ramirez’s math class supplied in Exercise 22, construct a pie
graph showing the percentages.
24. David County has six high schools. Each school sent students to participate in a county-
wide science competition. The table below shows the percentage breakdown of competitors
from each school, and the percentage of the entire student population of the county that
goes to each school. Construct a bar graph and pie graph that shows the
population percentage of competitors from each school.
High Science Competition Overall Student Population
School Percentage Percentage
Alabaster 28.9% 8.6%
Concordia 7.6% 23.2%
Genoa 12.1% 15.0%
Mocksville 18.5% 14.3%
Tynneson 24.2% 10.1%
West End 8.7% 28.8%
25. Use the data from the David County science competition supplied in 24. Construct a bar
graph that shows population percentage of students at each school.
73
For exercises #26 - 29, create a stem/leaf plot for the given data.
26. The miles per gallon rating for 30 cars are shown below (lowest to highest).
19, 19, 19, 20, 21, 21, 25, 25, 25, 26, 26, 28, 29, 31, 31,
32, 32, 33, 34, 35, 36, 37, 37, 38, 38, 38, 38, 41, 43, 43
25, 27, 33, 34, 34, 34, 35, 37, 37, 38, 39, 39, 39,
40, 41, 45, 46, 47, 49, 50, 50, 53, 53, 54, 54
28. The following data are the prices of different laptops at an electronics store. Round each
value to the nearest multiple of ten.
249, 249, 260, 265, 265, 280, 299, 299, 309, 319, 325, 326, 350,
350, 350, 365, 369, 389, 409, 459, 489, 559, 569, 570, 610
29. The following data are the daily high temperatures in a town for one month.
61, 61, 62, 64, 66, 67, 67, 67, 68, 69, 70, 70, 70, 71, 71,
72, 74, 74, 74, 75, 75, 75, 76, 76, 77, 78, 78, 79, 79, 95
30. In a survey, 40 people were asked how many times they visited a store before making a
major purchase. Construct a frequency polygon. The results are shown in the table:
31. In a survey, several people were asked how many years it has been since they
purchased a mattress. Construct a histogram. The results are shown below.
74
32. Several children were asked how many TV shows they watch each day. Construct a
relative frequency polygon. The results of the survey are shown in the table below:
33. Sixty-five randomly selected car salespersons were asked the number of cars they
generally sell in one week. Fourteen people answered that they generally sell three cars;
nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars;
eleven generally sell seven cars. Complete the table.
34. Refer to the table for problem 33. Construct a histogram for the given data.
35. Construct a frequency polygon and histogram for each of the following frequency tables:
a. b.
Pulse Rates for Women Frequency
Actual Speed in a Frequency
60 - 69 12 30 mph Zone
70 - 79 14 42 – 45 25
80 – 89 11 46 – 49 14
90 – 99 1 50 – 53 7
100 – 109 1 54 – 57 3
110 – 119 0 58 – 61 1
120 – 129 1
c.
75
36. Construct a relative frequency polygon and relative frequency histogram from the
frequency distribution for the 50 highest ranked countries for depth of hunger.
Depth of Hunger Frequency
230 - 259 21
260 - 289 13
290 - 319 6
320 - 349 7
350 - 379 1
380 - 409 1
410 - 439 1
37. Use the two frequency tables to compare the life expectancy of men and women from 20
randomly selected countries. Include an overlaid frequency polygon and discuss the shapes
of the distributions, the center, the spread, and any outliers. What can we conclude about
the life expectancy of women compared to men?
38. Construct a times series graph for (a) the number of male births, (b) the number of
female births, and (c) the total number of births.
39. The following data sets list full time police per 100,000 citizens along with homicides
per 100,000 citizens for the city of Detroit, Michigan during the period from 1961 to 1972.
Year 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
Police 260.4 269.8 272 273 272.5 261.3 268.9 296 319.9 341.4 356.6 376.7
Homicides 8.6 8.9 8.5 8.9 13.1 14.6 21.4 28 31.5 37.4 46.3 47.2
a. Construct a double time series graph using a common x-axis for both sets of data.
b. Which variable increased the fastest? Explain.
c. Did Detroit’s increase in police officers have an impact on the murder rate? Explain.
76
REFERENCES
2.2 Graphs for Qualitative Data
Burbary, Ken. Facebook Demographics Revisited – 2001 Statistics, 2011. Available online
at http://www.kenburbary.com/
2011/03/facebook-demographics-revisited-2011-statistics-2/ (accessed August 21, 2013).
“Overweight and Obesity: Adult Obesity Facts.” Centers for Disease Control and
Prevention. Available online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13,
2013).
“Timeline: Guide to the U.S. Presidents: Information on every president’s birthplace, political party,
term of office, and more.” Scholastic, 2013. Available online
at http://www.scholastic.com/teachers/article/timeline-guide-us-presidents (accessed April 3, 2013).
“Food Security Statistics.” Food and Agriculture Organization of the United Nations. Available
online at http://www.fao.org/economic/ess/ess-fs/en/ (accessed April 3, 2013).
“Consumer Price Index.” United States Department of Labor: Bureau of Labor Statistics.
Available online at http://data.bls.gov/pdq/SurveyOutputServlet (accessed April 3, 2013).
Gunst, Richard, Robert Mason. Regression Analysis and Its Application: A Data-Oriented Approach.
CRC Press: 1980.
“Overweight and Obesity: Adult Obesity Facts.” Centers for Disease Control and
Prevention. Available online at http://www.cdc.gov/obesity/data/adult.html (accessed September 13,
2013).
77
CHAPTER 2 SOLUTIONS:
1) 1.5 – 2.5 3) n = Σf = 40
5) frequency distribution 7) midpoints: 34, 39, 44, 49, 54, 59, 64
# of potholes f
1 12
2 4
3 5
4 5
5 3
6 3
7 3
Classes Frequency
236 – 253 3
254 – 271 5
272 – 289 11
290 – 307 17
308 – 325 11
326 – 343 9
344 – 361 1
78
23) 25)
27) 29)
31) 33)
79
35)
b.
c.
80
37)
39) Police variable increased faster. The number of police did not impact the number of
homicides. Homicides continued to increase.
81
This page is purposely left blank.
82
3 | NUMERICAL DESCRIPTORS
Introduction
Chapter Objectives
83
3.1 | Measures of Central Tendency
The "center" of a data set is also a way of describing location. The two most widely used
measures of the "center" of a data set are the mean (average) and the median. To calculate
the mean weight of 50 people, add the 50 weights together and divide by 50. To find the
median weight of the 50 people, order the data and find the number that splits the data
into two equal parts. The median is generally a better measure of the center when there are
extreme values or outliers because it is not affected by the precise numerical values of the
outliers. The mean is the most common measure of the center.
NOTE
The words “mean” and “average” are often used interchangeably. The substitution of one
word for the other is common practice. The technical term is “arithmetic mean” and “average”
is technically a center location. However, in practice among non-statisticians, “average" is
commonly accepted for “arithmetic mean.”
The symbol used to represent the sample mean is an x with a bar over it (pronounced “x
bar”): x .
The Greek letter m (pronounced "mew") represents the population mean. One of the
requirements for the sample mean to be a good estimate of the population mean is for
the sample used to be truly random.
When the values in a data set are not unique, the mean can be calculated by multiplying
each distinct value by its frequency and then dividing the sum by the total number of data
values. To see that both ways of calculating the mean are the same, consider the sample:
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
1+1+1+ 2 + 2 + 3 + 4 + 4 + 4 + 4 + 4
x= = 2.7
11
n +1
You can quickly find the location of the median by using the expression .
2
The letter n is the total number of data values in the sample. If n is an odd number, the
median is the middle value of the ordered data (ordered smallest to largest). If n is an even
number, the median is equal to the two middle values added together and divided by two
after the data has been ordered. For example, if the total number of data values is 97, then
84
n + 1 97 + 1
= = 49, so the median is the 49th value in the ordered data. If the total number of
2 2
n + 1 100 + 1
data values is 100, then = = 50.5, so the median occurs midway between the 50th
2 2
and 51st values.
In general, the values of the mean and median are not the same. The upper-case letter M is
often used to represent the median. The next example illustrates the location of the median
and the value of the median.
Example 3.1
AIDS data indicating the number of months a patient with AIDS lives after taking a new
antibody drug are as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24;
24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47
Solution 3.1
The calculation for the mean is:
3 + 4 + (8)(2) + 10 + 11 + 12 + 13 + 14 + (15)(2) + (16)(2) + ... + 35 + 37 + 40 + (44)(2) + 47
x= = 23.6
40
To find the median, M, first use the formula for the location. The location is
n + 1 40 + 1
= = 20.5.
2 2
Starting at the smallest value, the median is located between the 20th and 21st values (the
two 24s); so
24 + 24
M= = 24.
2
Go to STAT >> EDIT. Clear list L1 by pressing 4: ClrList, and then pressing 2nd 1 for list L1.
Press ENTER. Enter data into the list editor. Go to STAT >> EDIT and enter the data values
into list L1.
85
Press STAT and arrow to CALC. Select 1:1-VarStats. Press 2nd 1 for L1 and then ENTER.
At the top of the screen you will see x = 23.6. Use the arrow keys to scroll down to the
second output screen, and you will see the median: Med = 24.
3.1 The following data show the number of months patients typically wait on a transplant
list before getting surgery. The data are ordered from smallest to largest. Calculate the
mean and median.
Example 3.2
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the
other 49 each earn $30,000. Which is the better measure of the "center": the mean or the
median?
Solution 3.2
5,000,000 + 49(30,000)
The mean is x = = 129,400 , whereas the median is M = 30,000.
50
The median is a better measure of the "center" than the mean. The mean is distorted
because of one very large value, 5,000,000. Such a value is called an “outlier”, meaning that
it is an extreme value. The 30,000 gives us a better sense of the middle of the data.
3.2 In a sample of 60 households, one house is worth $2,500,000. Half of the rest are worth
$280,000, and all the others are worth $315,000. Which is the better measure of the
“center”: the mean or the median?
Another measure of the center is the mode. The mode is the most frequent value. There
can be more than one mode in a data set as long as those values have the same frequency,
and that frequency is the highest. A data set with two modes is called bimodal. There also
can be no mode. No mode exists when each data value has the same frequency.
86
Example 3.3
Solution 3.3
The most frequent score is 72, which occurs five times. Mode = 72.
3.3 The number of books checked out from the library from 25 students are as follows:
0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12
Find the mode.
Example 3.4
Solution 3.4
Consider a weight loss program that advertises a mean weight loss of six pounds the first
week of the program. The mode might indicate that most people lose two pounds the first
week, making the program less appealing.
NOTE
The mode can be calculated for qualitative data as well as for quantitative data. For
example, if the data set is: {red, red, red, green, green, yellow, purple, black, and blue}, then
the mode is red.
Statistical software will easily calculate the mean, the median, and the mode. Some graphing
calculators can also make these calculations. In the real world, people make these
calculations using software.
87
3.4 Five credit scores are 680, 680, 700, 720, 720. The data set is bimodal because the scores
680 and 720 each occur twice. Consider the annual earnings of workers at a factory. The
mode is $25,000 and occurs 150 times out of 301. The median is $50,000 and the mean is
$47,500. What would be the best measure of the “center” for each of these situations?
The Law of Large Numbers says that if you take samples of larger and larger size from any
population, then the mean of the sample is highly likely to get closer and closer to m.
Similarly, if we take larger and larger samples, the sample proportion will approach the
population proportion p. These facts are discussed in more detail later in the text. Recall that
a statistic is a number calculated from a sample, whereas a parameter is a corresponding
number calculated from a population. Examples of statistics include the mean, median, mode
and sample proportion.
The Law of Large Numbers tells that x is a sample statistic that estimates the parameter m
(the population mean).
Similarly, the sample proportion p̂ is a sample statistic that estimates the population
proportion, p.
Suppose that thirty randomly selected students were asked the number of movies they
watched the previous week. The results are shown in the frequency table below.
# of movies Frequency
0 5
1 15
2 6
3 3
4 1
Table 3.1
Suppose that we wanted to know the mean number of movies watched for the sample. Of
course, this will be the sum of the data, divided by 30 (the size of the sample):
data sum
sample mean = .
number of data values
88
To do this, we add up the data values, multiplied by their respective frequencies and divide
by 30:
In other words, we can also calculate the mean by multiplying the data values by their
respective relative frequencies and adding the resulting values. And this can be easily done
using the TI-84 family of calculators:
When only grouped data is available, we do not know the individual data values (we only
know intervals and interval frequencies); therefore, we cannot compute an exact mean for
the data set. However, we can still estimate the sample mean by from the frequency table.
To calculate the mean from a grouped frequency table we can apply the same method as
above; but since we do not know the individual data values, we instead use the midpoint of
each interval. The midpoint is the average of the lower boundary and upper boundary:
89
Example 3.5
A frequency table displaying Professor Blount’s last statistic test is shown below:
Solution 3.5
• Go to STAT >> EDIT, clear lists L1, L2. Enter the midpoints into L1 and the
frequencies into L2.
• Go to STAT >> CALC, select 1-VarStats and type L1, L2 (don’t forget the comma for
the TI-83!).
• Hit ENTER, and the mean will appear at the top of the output screen: x = 76.86.
• For TI-84 (plus) type L1 into FREQ: and L2 into FREQLIST:, then CALCULATE
90
3.5 Maris conducted a study on the effect that playing video games has on memory recall.
As part of her study, she compiled the following data:
What is the best estimate for the mean number of hours spent playing video games?
This data set can be represented by following histogram. Each interval has width one, and
each value is in the middle of an interval.
91
On the other hand, the histogram for the data: 4; 5; 6; 6; 6; 7; 7; 7; 7; 8 is not
symmetrical. The right-hand side seems "chopped off" compared to the left side. A
distribution of this type is called skewed to the left because it is pulled out to the left.
The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the mean is less
than the median, and they are both less than the mode. The mean and the median
both reflect the skewing, but the mean reflects it more so.
The mean is 7.7, the median is 7.5, and the mode is seven. Of the three statistics, the
mean is the largest, while the mode is the smallest. Again, the mean reflects the
skewing the most.
92
To summarize:
• If the distribution of data is skewed to the left, the mean is less than the median.
• If the data is symmetrical, then the median and mean are equal.
• If the data is skewed to the right, then the mean is greater than the median.
3.6 Discuss the mean, median, and mode for the following two graphs (dot plot and
histogram).
The GPA’s of all students in a business college. Each dot represents up to 4 observations.
93
3.2 | Measures of Variation
An important characteristic of any set of data is the variation in the data. In some data sets,
the data values are concentrated closely near the mean; in other data sets, the data values
are more widely spread out from the mean. In this section we will discuss three measures of
variation: the range, the standard deviation, and the variance.
Standard deviation.
The most used measure of variation, or spread, is the standard deviation. The standard
deviation is a non-negative number that measures how far data values are from their mean.
Its importance will become much clearer when we begin studying methods of Inferential
Statistics. For now, we point out two key features of this important statistic.
The standard deviation provides a measure of the overall variation in a data set.
The standard deviation is small when the data are all concentrated close to the mean,
exhibiting little variation or spread. The standard deviation is larger when the data values
are more spread out from the mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at
supermarket A and supermarket B. the average wait time at both supermarkets is five
minutes. At supermarket A, the standard deviation for the wait time is two minutes; at
supermarket B the standard deviation for the wait time is four minutes. Because
supermarket B has a higher standard deviation, we know that there is more variation in the
wait times at supermarket B. Overall, wait times at supermarket B are more spread out from
the average; wait times at supermarket A are more concentrated near the average, and hence
more predictable.
The standard deviation can be used to determine whether a data value is close to
or far from the mean. Suppose that Rosa and Binh both shop at supermarket A. Rosa waits
at the checkout counter for seven minutes and Binh waits for one minute. At supermarket A,
the mean waiting time is five minutes and the standard deviation is two minutes. The
standard deviation can be used to determine whether a data value is close to or far from the
mean.
Rosa’s wait time of seven minutes is two minutes longer than the average of five minutes;
that is, her wait time is one standard deviation above the average of five minutes.
Binh’s wait time is four minutes less than the average of five minute; that is, his wait time
of one minute is two standard deviations below the average of five minutes. A data
value that is two standard deviations from the average is just on the borderline for what
many statisticians would consider to be “unusual”, or “far” from the average. Considering
data to be far from the mean if it is more than two standard deviations away is more of an
approximate "rule of thumb" than a rigid rule. In general, the shape of the distribution of the
94
data affects how much of the data is further away than two standard deviations. (As we will
see in the next section)
As with measures of center, we can have a standard deviation measured from sample data (a
statistic) or from population data (a parameter). The lower-case letter s represents a sample
standard deviation and the Greek letter σ (sigma, lower case) represents a population
standard deviation. This is like our notation protocol for the mean, where we used the symbol
x for the sample mean and the Greek letter m for the population mean.
Calculating the Standard Deviation
If x is a data value, then the difference between x and the mean is called its deviation. In a
data set, there are as many deviations as there are values in the data set. The deviations are
used to calculate the standard deviation; in fact, we intuitively think of the standard
deviation as a sort of “average” of the deviations. If the data belong to a population, in
symbols a deviation is x – μ. For sample data, a deviation will be x – x .
The formula for calculating the standard deviation will also be slightly different for sample
data than for population data. If the sample is reasonably large and representative of the
population, then s should be a good estimate of σ.
Suppose that we wanted to measure the totality of all variation in a data set. Then we might
start by adding all of the deviations x – x . However, in any data set, there are values both
above and below the mean x . So, some deviations would be positive, and some will be
negative, and there will be considerable cancellation. In fact, the sum of all the deviations
would always be zero. To prevent deviations from cancelling each other out we look at the
squares of the deviations: terms of the form
(x – x )2.
The variance is the average of the squares of the deviations. For reasons that will
become clear later, we compute these averages slightly differently for population data than
we do for sample data.
(x − µ)2
The population variance is: σ 2 = ∑ ,
N
(x − x)2
The sample variance is: s 2 = ∑ ,
n −1
95
The symbol σ2 represents the population variance - so the population standard deviation
σ is the square root of the population variance. Similarly, the symbol s2 represents the
sample variance, so the sample standard deviation s is the square root of the sample
variance. Taking the square root of the variance can be thought of as reversing the effect of
squaring the deviations. Thus, we can think of the standard deviation as a sort of “average”
of the deviations. Note also that the standard deviation will always be measured in the same
units as the data values.
(x − µ)2 (x − x)2
σ= ∑ N
s= ∑ n −1
NOTE
We will concentrate on using and interpreting the information that the standard deviation
gives us. However, it is instructive to do one step-by-step example to help you understand
how the standard deviation measures variation from the mean. The calculator instructions
appear at the end of this example.
Example 3.6
In a fifth-grade class, teacher was interested in the average age and the sample standard
deviation of the ages of her students. The following data are the ages for a sample of n = 20
fifth grade students. The ages are rounded to the nearest half year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5;
11.5;
The average age is 10.53 years, rounded to two places. Find the sample standard deviation,
rounded to two decimal places.
96
Solution 3.6
The variance may be calculated by using a table. Then the standard deviation is obtained
by taking the square root of the variance. We will explain the parts of the table after
calculating s.
The sample variance, s2, is equal to the sum of the last column (9.7375) divided by the total
number of data values minus one, which is 19.
9.7375
s2 = = 0.5125
19
The sample standard deviation s is equal to the square root of the sample variance:
s= 0.5125 = 0.715891
Again, we almost never use the formulas or the procedure above to calculate a standard
deviation. Instead, we will use either software or a calculator to calculate these.
97
Using the TI-83, 83+, 84, 84+ Calculator _
To find the standard deviation:
Go to STAT >> EDIT. Clear list L1, and enter the data into the list.
Go to STAT >> CALC and select 1:1-VarStats. Press 2nd 1 for L1 and then ENTER.
If the data is from a sample, use Sx as it represents the sample standard deviation, s.
If the data is from a population use σx represents the population standard deviation, σ.
3.7 On a baseball team, the ages of each of the players are as follows: 21; 21; 22; 23; 24;
24; 25; 25; 28; 29; 29; 31; 32; 33; 33; 34; 35; 36; 36; 36; 36; 38; 38; 38; 40
Use a calculator or computer to find the mean and standard deviation. Then find the value
that is two standard deviations above the mean.
Recall that for grouped data we do not know individual data values, so we cannot describe
the typical value of the data with precision. Just as we could not find the exact mean from
a frequency table, neither can we find the exact standard deviation. However, by again
replacing the class intervals by their midpoints, we can estimate the standard deviation
using the same procedure as we used for estimating the mean.
98
Example 3.7
Find the standard deviation for the data in the following table:
Class Frequency
0-2 1
3-5 6
6-8 10
9-11 7
12-14 0
15-17 2
Solution 3.7
Here we see both a population standard deviation, σx, and the sample standard deviation Sx
displayed.
The following rules provide a little more insight into what the standard deviation tells us
about the distribution of the data.
Chebyshev's Rule: For any data set, no matter what the distribution of the data is:
99
Chebyshev’s Theorem
Empirical Rule: For data having a distribution that is bell-shaped and symmetric:
• Approximately 68% of the data is within one standard deviation of the mean.
• Approximately 95% of the data is within two standard deviations of the mean.
• Approximately 99.7% of the data is within three standard deviations of the mean.
• It is important to note that this rule applies only when the shape of the distribution of
the data is bell-shaped and symmetric. We will learn more about this when studying
"Normal" distribution in later chapters.
Empirical Rule
100
Finally, we have seen how the mean and standard deviation can be affected by extreme
high or extreme low values. An outlier is a data value that is far removed from the rest of
the data. Outliers need to be examined carefully - sometimes they are the result of
measurement error and should be removed from the data and not used in the analysis at
all. Other times, they hold valuable information about the population under study and
should be included in the data. Now that we have introduced the standard deviation, we
can present a commonly used criterion for classifying a data value as an outlier:
A data value that is more than three standard deviations from the mean is called
an outlier.
Example 3.8
Using Chebyshev’s theorem and Empirical find the percentage of values that lie within 3
standard deviations from the mean; given the heights of 20-year-old males have a
symmetrical distribution with a mean of 70.2 inches and a standard deviation of 1.5 inches.
Solution 3.8:
We are given m = 70.2 and s = 1.5. Three standard deviations is represented by 3×s = 3(1.5)
= 4.5. Subtract and add 4.5 from the mean, 70.2.
Chebyshev’s theorem states that 89% of 20-year-old male heights lie within 65.7 inches and
74.7 inches. While Empirical Rule states that 99.7% of 20-year-old male heights lie within
65.7 inches and 74.7 inches. Chebyshev’s theorem is more conservative since it is used for
any type of distribution. Empirical Rule is specifically for symmetrical distributions.
101
3.3 | Measures of Relative Position
Measures of relative position are used to describe data values relative to other data values.
Here we will discuss commonly used measures of relative position – in particular, quartiles,
percentiles, and z-scores.
Quartiles
Quartiles are special cases of percentiles. The first quartile, Q1, is the same as the 25th
percentile, and the third quartile, Q3, is the same as the 75th percentile. The median M is
both the second quartile and the 50th percentile. Thus 25% of the data values in any data
set are less than Q1. Similarly, 75% of the data values in the data set are less Q3, and so
25% of data values are above Q3. Finally, 50% of all data values are between Q1 and Q3.
The quartiles separate the data into quarters. To find the quartiles, first find the median or
second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and
the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the
idea, consider the following data set:
• There are 14 data values, so the median is the average of the 7th and 8th values in the
list. Thus, we have M = (7.2 + 6.8)/2 = 7.
• The first quartile will be the median of the lower half of the data: 1, 1, 2, 2, 4, 6, 6.8.
The middle
value of these seven data values is Q1 = 2.
• The third quartile will be the median of the upper half of the data: 7.2, 8, 8.3, 9, 10,
10, 11.5.
The middle value of these seven data values is Q3 = 9.
The interquartile range is the difference between the third quartile (Q3) and the first
quartile (Q1):
IQR = Q3 – Q1
The IQR provides a numerical measure of the overall amount of variation in a data set; it is
often used in conjunction with the median. When the median is used to describe the center,
the IQR is often used to describe the spread.
The IQR can also be used to determine potential outliers. Recall that an outlier is a data
value that is significantly separated from the other data. Using the IQR, we have the
following criteria for classifying a data value as an outlier:
102
Example 3.9
The following data represent selling prices for homes in a certain city. Calculate the IQR
and determine if any prices are potential outliers. Prices are in dollars.
There are 13 data values, so the median is the 7th value: M = 488,800
Q1 will be the median of the lower six values: Q1 = (230,500 + 387,000)/2 = 308,750.
Q3 will be the median of the upper six values: Q3 = (639,000 + 659,000)/2 = 649,000.
No house price is less than –201,625. However, 5,500,000 is more than 1,159,375.
Therefore, the data value 5,500,000 is a potential outlier.
3.8 Two classes take a test. The test Scores for Class A are:
69; 96; 81; 79; 65; 76; 83; 99; 89; 67; 90; 77; 85; 98; 66; 91; 77; 69; 80; 94
Find the interquartile range for the following two data sets and compare them. Which class
had more variable test scores?
103
Five Number Summary and Boxplots
The quartiles are part of the five-number summary - these five numbers consist of:
For example, for the real estate data in Example 3.9, the five-number summary would be:
Min = 114,950
Q1 = 308,750
M = 488,800
Q3 = 649,000
Max = 5,500,000
For a small example like this, the quartiles can be easily calculated by hand; but for a large
data set this can be tedious and time-consuming. But most statistical software packages and
graphing calculators will calculate the five-number summary for us. When entering the data
into the calculator or software packages, the data does not have to be in order.
• Go to STAT >> EDIT. Clear list L1 and enter the data into the list.
• Go to STAT >> CALC and select 1:1-VarStats. Press 2nd 1 for L1 and then ENTER.
• Use the down arrow key to scroll down; the 5-number summary appears in the 2nd
output screen.
A box plot is a graphical representation of the five-number summary. These graphs (also
called box and whisker plots or box-whisker plots) allow us to see where provide an
image of the concentration of the data and show how far the extreme values are from most
of the data. So, the box plot gives a good, quick picture of the data.
A box plot is constructed from the five values: the minimum value, the first quartile, the
median, the third quartile, and the maximum value. To construct a box plot, use a
horizontal number line and mark the five numbers along the axis. We make small vertical
marks above the minimum and maximum values, and larger vertical marks above the
quartiles Q1, M, and Q3. Finally, we “box off” the marks above the quartiles.
104
So, the first quartile marks one end of the box and the third quartile marks the other end of
the box. And the middle 50 percent of the data fall inside the box. The "whiskers"
extend from the ends of the box to the smallest and largest data values.
We calculated the quartiles already, and know that the five number summary is:
NOTES
1. It is important to start a box plot with a scaled number line. Otherwise, the box plot
may distort the spread of the data.
2. You may encounter box-and-whisker plots that have dots marking outlier values. In
those cases, the whiskers are not extending to the minimum and maximum values.
Example 3.10
The following data are the heights (in inches) of 40 students in a statistics class:
59; 60; 61; 62; 62; 63; 63; 64; 64; 64;
65; 65; 65; 65; 65; 65; 65; 65; 65; 66;
66; 67; 67; 68; 68; 69; 70; 70; 70; 70;
70; 71; 71; 72; 72; 73; 74; 74; 75; 77
a) Find the five-number summary for the data and construct a boxplot.
b) Find the IQR
c) Given an interval that has the middle 50% of the data.
105
Solution 3.10:
Go to STAT >> EDIT, and enter the data into L1. Then go to STAT >> CALC and select
1-VarStats. Press 2nd and 1 to select L1, and hit ENTER. Scroll down to the second output
screen to get:
c) About 50% of the data will be between Q1 and Q3. That is, about 50% of the students in
the class had heights that were between 64.5 inches and 70 inches.
Finally, note that the calculator can also be used to construct the box plot.
• Go to the STAT PLOT menu by pressing 2nd and then the Y = button.
• Turn on Plot 1 (Plot 2, Plot 3 should be turned off).
• Use the down arrow key to move to Type, and select the boxplot option (middle of
second row).
• For Xlist, type 2nd and then 1 to select L1.
• Press Zoom and then press 9 to select the ZoomStat graphing option; this will
automatically configure the window so that all of the data are included.
Finally, if we press TRACE, and use the arrow keys to move from left to right, we can see
the quartiles.
106
3.9 The following data are the number of pages in 40 books on a shelf. Construct a box plot
using a graphing calculator and state the interquartile range.
136; 140; 178; 190; 205; 215; 217; 218; 232; 234;
240; 255; 270; 275; 290; 301; 303; 315; 317; 318;
326; 333; 343; 349; 360; 369; 377; 388; 391; 392;
398; 400; 402; 405; 408; 422; 429; 450; 475; 512
It is possible that in a data set, two or more numbers in the five-number summary are
equal. For instance, you might have a data set in which the median and the third quartile
are the same. In this case, the diagram would not have a dotted line inside the box
displaying the median. The right side of the box would display both the third quartile and
the median.
As an example, suppose we have data set in which the smallest value and the first quartile
were both 1, the median and the third quartile were both 5, and the maximum value is 7;
then the box plot would look like:
In this case, at least 25% of the values are equal to 1. Twenty-five percent of the values are
between 1 and 5, inclusive. At least 25% of the values are equal to 5, and the top 25% of the
values fall between 5 and 7, inclusive.
Example 3.11
Test scores for a college statistics class held during the day are:
99; 56; 78; 55.5; 32; 90; 80; 81; 56; 59; 45; 77; 84.5; 84; 70; 72; 68; 32; 79; 90
Test scores for a college statistics class held during the evening are:
98; 78; 68; 83; 81; 89; 88; 76; 65; 45; 98; 90; 80; 84.5; 85; 79; 78; 98; 90; 79; 81; 25.5
107
a. Find the five-number summary for the day class.
b. Find the five-number summary for the night class.
c. Create a box plot for each set of data, using the same number line for both box plots.
d. Which box plot has the widest spread for the middle 50% of the data?
What does this mean for that set of data in comparison to the other set of data?
Solution 3.11
d. The first data set has the wider spread for the middle 50% of the data; so the IQR for
the first data set is greater than the IQR for the second set. This means that there is more
variability in the middle 50% of the first data set.
A percentile indicates the relative standing of a data value when data are sorted into
numerical order from smallest to largest. For example, 15% of data values are less than or
equal to the 15th percentile.
A percentile may or may not correspond to a value judgment about whether it is "good" or
"bad". The interpretation will depend on the context of the situation to which the data
applies. In some situations, a low percentile would be considered "good;" in other contexts a
high percentile might be considered "good". In many situations, there is no value judgment
that applies. Understanding how to interpret percentiles properly is important not only
when describing data, but also when calculating probabilities in later chapters of this text.
108
GUIDELINES
When interpreting a percentile in the context of the given data, the sentence should contain
the following information:
Example 3.12
On a timed math test, the first quartile for time it took to finish the exam was 35 minutes.
Interpret the first quartile in the context of this situation.
Solution 3.12:
• Twenty-five percent of students finished the exam in 35 minutes or less.
• Seventy-five percent of students finished the exam in 35 minutes or more.
• A low percentile could be considered good, as finishing more quickly on a timed exam is
desirable. (If you take too long, you might not be able to finish.)
3.10 For the 100-meter dash, the third quartile for times for finishing the race was 11.5
seconds. Interpret the third quartile in the context of the situation.
Example 3.13
On a 20 question math test, the 70th percentile for number of correct answers was 16.
Interpret the 70th percentile in the context of this situation.
Solution to 3.13:
• Seventy percent of students answered 16 or fewer questions correctly.
• Thirty percent of students answered 16 or more questions correctly.
• A higher percentile could be considered good, as answering more questions correctly is
desirable.
109
3.11 On a 60 point written assignment, the 80th percentile for the number of points earned
was 49. Interpret the 80th percentile in the context of this situation.
3.12 During a college basketball season, the 40th percentile for points scored per player in a
game is eight. Interpret the 40th percentile in the context of this situation.
Percentiles
We say that a data value x is at the P-th percentile if P% of all data values are less than x.
For example, if a student taking a standardized test scores at the 85th percentile, this means
that 85% of all students taking the test scored lower than the student, and so 15% of students
taking the test scored better.
Percentiles divide ordered data into hundredths and are useful for comparing values. For
example, universities and colleges use percentiles extensively. One instance in which colleges
and universities use percentiles is when SAT results are used to determine a minimum
testing score that will be used as a criterion for admission. For example, suppose Duke
accepts SAT scores at or above the 75th percentile. That translates into a score of at least
1220.
Percentiles are most often used with very large populations. Therefore, if you were to say
that 90% of the test scores are less (and not the same or less) than your score, it would be
acceptable because removing one particular data value is not significant. We will learn more
about how to calculate percentiles for large data sets in a later chapter. But sometimes we
also need to calculate a percentile for small data sets. The best way to do this is using a
cumulative relative frequency table.
110
Example 3.14
Fifty statistics students were asked how much sleep they get per school night (rounded to
the nearest hour). The results are as follows:
Cumulative Relative
Hours of Sleep Frequency Relative Frequency Frequency
4 2 0.04 0.04
5 5 0.10 0.14
6 7 0.14 0.28
7 12 0.24 0.52
8 14 0.28 0.80
9 7 0.14 0.94
10 3 0.06 1.00
Solution 3.14:
a. Find the 28th percentile. Notice the 0.28 in the "cumulative relative frequency" column.
Twenty-eight percent of the 50 is a total of 14 data values. There are 14 values less than
the 28th percentile. They include the two 4’s, the five 5’s, and the seven 6’s. The 28th
percentile is between the last six and the first seven. The 28th percentile is 6.5.
b. Find the median. Look again at the "cumulative relative frequency" column and find
0.52. The median is the 50th percentile. Since 50% of 50 is 25, there are 25 data values
less than the median. They include the two 4s, the five 5’s, the seven 6’s, and eleven of
the 7’s. The median is between the 25th and 26th data values, both of which are 7. The
median is M = 7.
c. Find the third quartile. The third quartile is the same as the 75th percentile. You can
"eyeball" this answer. If you look at the "cumulative relative frequency" column, you find
0.52 and 0.80. When you have all the 4’s, 5’s, 6’s and 7’s, you have 52% of the data. When
you include all the 8’s, you will have 80% of the data. The 75th percentile must be 8.
Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to
38. The third quartile, Q3, is the 38th value, which is again an 8.
111
A Formula for Finding the kth Percentile
In addition to using a frequency distribution, there are several formulas for calculating the
kth percentile. Here is one of them.
Suppose that we want the kth percentile. It may or may not be part of the data. Order the
data, and let i be the index of the kth percentile; i.e. this is the position that the percentile
appears in the list of data. Let n be the total number of data values. Then we calculate
k
i= (n + 1)
100
If i is a positive integer, then the kth percentile is the data value in the ith position in the
ordered set of data. If i is not a positive integer, then round i up and round i down to the
nearest integers. Average the two data values in these two positions in the ordered data set.
This is easier to understand in an example.
Example 3.15
Listed below are the ages for 29 Academy Award winning best actors:
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47;
52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
Solution 3.15
k 70
a. k = 70, i = the index, n = 29. Calculate i = (n + 1) = (29 + 1) = 21 .
100 100
This is a whole number, so we want the data value in the 21st position in the ordered data.
The 70th percentile is 64.
k 83
b. k = 83, i = the index, n = 29. Calculate i = (n + 1) = (29 + 1) = 24.9 .
100 100
This is not a whole number, so we round down to get 24, and round up to get 25.
The age in the 24th position is 71, and age in the 25th position is 72. Average these two
numbers to obtain the 83rd percentile, 71.5.
112
3.13 Listed below are the ages for 29 Academy Award winning best actors:
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47;
52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
x + 0.5 y
Calculate ⋅ 100 and round to the nearest integer.
n
Example 3.16
Listed below are the ages for 29 Academy Award winning best actors:
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47;
52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
Solution 3.16
a. Counting from the bottom of the list, there are 18 data values less than 58; and there is
one data value equal to 58. So x = 18 and y = 1.
x + 0.5 y 18 + 0.5(1)
=> ⋅ 100 = ⋅ 100 = 63.8
n 29
Thus an age of 58 would be at the 64th percentile.
b. Counting from the bottom of the list, there are 3 data values less than 25; and there is
one data value equal to 25. So x = 3 and y = 1.
x + 0.5 y 3 + 0.5(1)
=> ⋅ 100 = ⋅ 100 = 12.07
n 29
Thus an age of 25 would be at the 12th percentile.
113
z-Scores
The standard deviation is useful when comparing data values that come from different data
sets. If the data sets have different means and standard deviations, then comparing the
data values directly can be misleading. But we can tell which data value is farther removed
from the center by finding how many standard deviations away from its mean the value is.
And we have a statistic called a "z-score" for this purpose. Let x be a data value, µ be the
mean, and σ the standard deviation. Then the z-score corresponding to x is:
x−µ
z=
σ
Solving for x in this equation, we get x = μ + zσ. So, this really does measure how many
standard deviations x is from the mean. Moreover, if z > 0, a positive value, then x is
greater than the mean; and if z < 0, a negative value, then x is less than the mean. When z
= 0, the standard deviation is zero and the value is at the mean.
Example 3.17
Two students, John and Ali, from different high schools, wanted to find out who had the
highest GPA when compared to his school.
Student GPA School Mean GPA School Standard Deviation
John 2.85 3.0 0.7
Ali 77 80 10
Which student had the highest GPA when compared to his school?
Solution to 3.17:
For each student, we use the z-score to determine how many standard deviations his GPA
is away from the average for his school:
x−µ 77 − 80
For Ali, z = = = −0.3 .
σ 10
John’s GPA is 0.21 standard deviations below his school’s mean, while Ali’s GPA is 0.3
standard deviations below his school’s mean. So, John’s z-score is higher than Ali’s. For
GPA, higher values are better; so, John has the better GPA relative to his school.
114
3.14 Two swimmers, Angie and Beth, from different teams, wanted to find out who had the
fastest time for the 50 meter freestyle when compared to her team. The table below shows
their times.
Which swimmer had the fastest time when compared to her team?
Z- scores will be revisited in Chapter 7: The Normal Distribution. The normal distribution
is an important continuous probability distribution. Z-scores will be on the x-axis and will
also be used in hypothesis testing.
The figure below shows how z-scores relate to position to the mean and to percentile. Recall
that the z-score at the mean is zero. The figure shows the normal distribution for the SAT.
The z-score for Student A was 1, meaning Student A was 1 standard deviation above the
mean. Thus, Student A performed in the 84.13 percentile on the SAT.
115
Example 3.18
Three statistics students, Mia, Fiona and Raina, are in different classes and all scored an
82 on their last statistics test. In Mia’s class the mean was 68 with a standard deviation of
6. In Fiona’s class the mean was 73 with a standard deviation of 5. In Raina’s class the
mean was 80 with a standard deviation of 4. Who performed the best within their own
class?
Solution 3.18
Even though they have the same score we do not know how they compare to their
classmates so begin by calculating the z-scores for each student.
x−µ 82−68
For Mia: z = = = 2.33
σ 6
x−µ 82−73
For Fiona: z = = = 1.8
σ 5
x−µ 82−80
For Raina: z = = = 0.5
σ 4
Comparing the graphs of the z-scores, it is easier to see that Mia scored the best within her
class. The z-score for her score is at the highest percentile with the most area shaded to the
left.
116
KEY TERMS
Box plot a graph that gives a quick picture of the middle 50% of the data
First Quartile the value that is the median of the of the lower half of the ordered data set
Interquartile Range or IQR, is the range of the middle 50 percent of the data values; the
IQR is found by subtracting the first quartile from the third quartile.
Mean a number that measures the central tendency of the data; a common name for mean
is 'average.'
The term 'mean' is a shortened form of ‘arithmetic mean.' By definition, the mean for a
sample is
x = sample mean =
sum of data
=
∑x
number of data values n
When the data is from a population, we denote the mean by the Greek letter μ.
Median the number that separates ordered data into halves; half the values are the same
number or smaller than the median and half the values are the same number or larger than
the median. The median may or may not be part of the data.
Outlier an observation that falls far outside the rest of the data.
Percentile a number that divides ordered data into hundredths; percentiles may or may not
be part of the data. The median of the data is the second quartile and the 50th percentile.
The first and third quartiles are the 25th and the 75th percentiles, respectively.
Quartiles the numbers that separate the data into quarters; quartiles may or may not be
part of the data. The second quartile is the median of the data.
Variance: The mean of the squared deviations from the mean, or the square of the standard
deviation; for a set of data, a deviation can be represented as x – x where x is a data value
and x is the sample mean. The sample variance is equal to the sum of the squares of the
deviations divided by the difference of the sample size and one.
117
FORMULA REVIEW
Mean mean =
sum of data
=
∑x
number of data values n
(x − µ)2 (x − x)2
σ= ∑ N
s= ∑ n −1
Order the data, and let i be the index of the kth percentile; i.e. this is the position that the
percentile appears in the list of data. Let n be the total number of data values.
k
Then we calculate i = (n + 1)
100
If i is a positive integer, then the kth percentile is the data value in the ith position in the
ordered set of data.
If i is not a positive integer, then round i up and round i down to the nearest integers.
Average the two data values in these two positions in the ordered data set. This is easier to
understand in an example.
Let x = the number of data values counting from the bottom of the data list up to but not
including the data value for which you want to find the percentile.
Let y = the number of data values equal to the data value for which you want to find the
percentile.
118
Exercises for Chapter 3
For problems #1 - 4, find the mean, median, standard deviation and five number summaries
for the given data set.
1. The following data shows the mileage (in mpg) for a random sample of 15 new
cars.
2. The following data shows the ages (in years) of a random sample of 22 Boeing 747
airplanes.
17 5 5 12 8 5 8 16 14 12 22
15 5 8 17 5 4 2 22 17 20 23
3. The following data shows the speeds (in mph) for a random sample of 20 cars driving on
I-294 at 1 pm on a single day.
68 65 50 79 77 60 55 61 78 75
75 67 72 58 70 62 67 72 70 74
4. The following data shows the number of cardiograms done each day for a random sample
of 20 days
at an outpatient testing center.
25 31 20 32 13 14 43 2 57 23
36 32 33 32 44 32 52 44 51 45
5. Find the mean and standard deviation for the following frequency tables:
a. b.
119
c.
6. Sixty-five randomly selected car salespersons were asked the number of cars they
generally sell in one week. Fourteen people answered that they generally sell three cars;
nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars;
eleven generally sell seven cars. Calculate the following:
7. When the data are skewed left, what is the typical relationship between the mean and
median?
8. When the data are symmetrical, what is the typical relationship between the mean and
median?
(i) (ii)
10. Which is the greatest, the mean, the mode, or the median of the data set?
11; 11; 12; 12; 12; 12; 13; 15; 17; 22; 22; 22
120
11. Which is the least, the mean, the mode, and the median of the data set?
56; 56; 56; 58; 59; 60; 62; 64; 64; 65; 67
12. Of the three measures, which tends to reflect skewing the most, the mean, the mode, or
the median? Why?
13. In a perfectly symmetrical distribution, when would the mode be different from the
mean and median?
14. The following data show the distances in miles between 20 retail stores and a large
distribution center:
29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150
a) Use a calculator to find the standard deviation and round to the nearest tenth.
b) Find the value that is one standard deviation below the mean.
For the next four exercises, calculate the mean, median and standard deviation:
15. The miles per gallon rating for 30 cars are shown:
19, 19, 19, 20, 21, 21, 25, 25, 25, 26, 26, 28, 29, 31, 31,
32, 32, 33, 34, 35, 36, 37, 37, 38, 38, 38, 38, 41, 43, 43
25, 27, 33, 34, 34, 34, 35, 37, 37, 38, 39, 39, 39,
40, 41, 45, 46, 47, 49, 50, 50, 53, 53, 54, 54
17. The following data are the prices of different laptops at an electronics store.
249, 249, 260, 265, 265, 280, 299, 299, 309, 319, 325, 326, 350,
350, 350, 365, 369, 389, 409, 459, 489, 559, 569, 570, 610
18. The following data are the daily high temperatures in a town for one month.
61, 61, 62, 64, 66, 67, 67, 67, 68, 69, 70, 70, 70, 71, 71,
72, 74, 74, 74, 75, 75, 75, 76, 76, 77, 78, 78, 79, 79, 95
121
19. Fredo and Karl, two baseball players on different teams, wanted to find out who had
the higher batting average when compared to his team.
20. The ages for 29 Academy Award winning best actors are shown below, in order from
smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47;
52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
21. Jesse was ranked 37th in his graduating class of 180 students. At what percentile is
Jesse’s ranking?
22. On an exam, would it be more desirable to earn a grade with a high or low percentile?
Explain.
23. Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32
minutes is the 85th percentile of wait times. Is that good or bad? Write a sentence
interpreting the 85th percentile in the context of this situation.
24. In a survey collecting data about the salaries earned by recent college graduates, Li
found that her salary was in the 78th percentile. Should Li be pleased or upset by this
result? Explain.
25. In a study collecting data about the repair costs of damage to automobiles in a certain
type of crash tests, a certain model of car had $1,700 in damage and was in the 90th
percentile. Should the manufacturer and the consumer be pleased or upset by this result?
Explain and write a sentence that interprets the 90th percentile in the context of this
problem.
26. Suppose that you are buying a house. You and your realtor have determined that the
most expensive house you can afford is the 34th percentile. The 34th percentile of housing
prices is $240,000 in the town you want to move to. In this town, can you afford 34% of the
houses or 66% of the houses?
122
27. The following data show the lengths of boats moored in a marina. Use this data to
construct a boxplot.
16; 17; 19; 20; 20; 21; 23; 24; 25; 25; 25; 26; 26; 27;
27; 27; 28; 29; 30; 32; 33; 33; 34; 35; 37; 39; 40
28. Find the five number summary and construct a boxplot for the given data set.
508 488 430 49 278 542 451 366 451 531 196 115
118 362 434 399 148 30 287 236 185 210 178 105
Use the following information to answer the next two exercises: Suppose one hundred eleven
people who shopped in a special t-shirt store were asked the number of t-shirts they own
costing more than $19 each. The results are summarized in the histogram:
29. The percentage of people who own at most three t-shirts costing more than $19 each is
approximately:
30. If the data were collected by asking the first 111 people who entered the store, then the
type of sampling is:
123
31. Given the following box plot:
a. Which quarter has the smallest spread of data? What is that spread?
b. Which quarter has the largest spread of data? What is that spread?
c. Find the interquartile range (IQR).
d. Are there more data in the interval [5, 10] or in the interval [10, 13]? Explain
32. The following box plot shows ages of the U.S. population for 1990, the latest available
year.
a. Are there fewer or more children (age 17 & under) than senior citizens (age 65 & over)?
Explain.
b. Given that about 12.6% are age 65 and over, approximately what percentage of the
population are working age adults (above age17 to age 65)?
33. In a survey of 20-year-olds in China, Germany, and the United States, people were
asked the number of foreign countries they had visited in their lifetime. The following box
plots display the results.
a. In complete sentences, describe what the shape of each box plot implies about the
distribution of the data collected.
b. Of the U.S. and Germany, which country has the greater percentage of 20-year olds that
have visited more than eight foreign countries?
c. Compare the three box plots. What do they imply about the foreign travel of 20-year-old
residents of the three countries when compared to each other?
124
34. A survey was conducted of 130 purchasers of new BMW 3 series cars, 130 purchasers of
new BMW 5 series cars, and 130 purchasers of new BMW 7 series cars. In the survey,
people were asked the age they were when they purchased their car. The following box plots
display the results.
a. In complete sentences, describe what the shape of each box plot implies about the
distribution of the data collected for that car series.
b. Which group is most likely to have an outlier? Explain how you determined that.
c. Compare the three box plots. What do they imply about the age of purchasing a BMW
from the series when compared to each other?
d. For the BMW 5 series, which quarter has the smallest spread of data? What is the
spread?
e. For the BMW 5 series, which quarter has the largest spread of data? What is the spread?
g. For the BMW 5 series, are there more data in the interval 31 to 38 or in the interval 45 to
55? How do you know this?
h. For the BMW 5 series, which interval has the fewest data in it? How do you know this?
i. 31 to 35 ii. 38 to 41 iii. 41 to 64
What does it mean to have the first and second quartiles so close together, while the second
to third quartiles are far apart?
125
36. The most obese countries in the world have obesity rates that range from 11.5% to
83.4%. This data is summarized in the following table:
a. What is the best estimate of the average obesity percentage for these countries?
b. The United States has an average obesity rate of 33.9%. Is this rate above average or
below?
c. What is the standard deviation for the listed obesity rates?
d. How does the United States compare to other countries? Would it be considered
“unusual”?
37. The following table gives the percent of children under five considered to be
underweight.
a. What is the best estimate for the mean percentage of underweight children?
b. What is the standard deviation?
c. Which interval(s) could be considered unusual? Explain.
38. A music school has budgeted to purchase three musical instruments. They plan to
purchase a piano costing $3,000, a guitar costing $550, and a drum set costing $600. The
mean cost for a piano is $4,000 with a standard deviation of $2,500. The mean cost for a
guitar is $500 with a standard deviation of $200. The mean cost for drums is $700 with a
standard deviation of $100. Which cost is the lowest, when compared to other instruments
of the same type? Which cost is the highest when compared to other instruments of the
same type? Justify your answer.
126
39. Three students were applying to the same graduate school. They came from schools
with different grading systems. Which student had the best GPA when compared to other
students at his school? Explain how you determined your answer.
40. An elementary school class ran one mile with a mean of 11 minutes and a standard
deviation of three minutes. Rachel, a student in the class, ran one mile in eight minutes. A
junior high school class ran one mile with a mean of nine minutes and a standard deviation
of two minutes. Kenji, a student in the class, ran 1 mile in 8.5 minutes. A high school class
ran one mile with a mean of seven minutes and a standard deviation of four minutes.
Nedda, a student in the class, ran one mile in eight minutes.
a. Why is Kenji considered a better runner than Nedda, even though Nedda ran faster than
he?
b. Who is the fastest runner with respect to his or her class? Explain.
41. The median age of the U.S. population in 1980 was 30.0 years. In 1991, the median age
was 33.1 years.
42. We are interested in the number of years students in a particular elementary statistics
class have lived in California. The information in the following table is from the entire
section.
127
43. Javier and Ercilia are supervisors at a shopping mall. Each was given the task of
estimating the mean distance that shoppers live from the mall. They each randomly
surveyed 100 shoppers. The samples yielded the following information:
Javier Ercilia
Sample mean x 6.0 miles 6.0 miles
Sample SD, s 4.0 miles 7.0 miles
d. If the two box plots depict the distribution of values for each supervisor, which one
depicts Ercilia’s sample? How do you know?
44. Santa Clara County, CA, has approximately 27,873 Japanese Americans. Their ages
are as follows:
128
a. Construct a histogram of the Japanese American community in Santa Clara County, CA.
The bars will not be the same width for this example. Why not? What impact does this
have on the reliability of the graph?
45. Forty randomly selected students were asked the number of pairs of sneakers they
owned. Let X = the number of pairs of sneakers owned. The results are as follows:
129
46. Refer to the graphs below determine which of the following are true and which are false.
Explain your solution to each part in complete sentences.
47. Following are the published weights (in pounds) of all of the team members of the San
Francisco 49ers from a given year.
177; 205; 210; 210; 232; 205; 185; 185; 178; 210; 206; 212;
184; 174; 185; 242; 188; 212; 215; 247; 241; 223; 220; 260;
245; 259; 278; 270; 280; 295; 275; 285; 290; 272; 273; 280;
285; 286; 200; 215; 185; 230; 250; 241; 190; 260; 250; 302;
265; 290; 276; 228; 265
48. In a recent issue of the IEEE Spectrum, 84 engineering conferences were announced.
Four conferences lasted two days. Thirty-six lasted three days. Eighteen lasted four days.
Nineteen lasted five days. Four lasted six days. One lasted seven days. One lasted eight
days. One lasted nine days. Let X = the length (in days) of an engineering conference.
130
a. Organize the data in a relative frequency table.
b. Find the median, the first quartile, and the third quartile.
c. Find the 65th percentile.
d. The middle 50% of the conferences last from days to days.
e. Construct a box plot of the data.
f. Find the 10th percentile.
g. Calculate the sample mean of days of engineering conferences.
h. Calculate the sample standard deviation of days of engineering conferences.
i. Find the mode.
j. If you were planning an engineering conference, which would you choose as the length of
the conference: mean, median, or mode? Explain why you made that choice.
k. Give two reasons why you think that three to five days seem to be popular lengths of
engineering conferences.
49. A survey of enrollment at 35 community colleges across the United States yielded the
following data:
a. Make a frequency table for the data using five intervals of equal width.
b. Construct a histogram of the data.
c. If you were to build a new community college, which piece of information would be more
valuable: the mode or the mean?
d. Calculate the sample mean.
e. Calculate the sample standard deviation.
f. A school with an enrollment of 8000 would be how many standard deviations away from
the mean?
50. The average height of US men in 2010 is 69.3 inches according to Medical News Today.
Let’s say the standard deviation is 2.8 inches.
a. Using Chebyshev’s theorem, at least 75% of heights will be between what two heights?
b. Using Empirical rule, 68% of heights will be between what two heights?
c. If a randomly chosen male has a height of 78 inches, what is his z-score?
d. Using Empirical rule, what percentage of heights are above 74.9 inches?
51. The average ACT score in 2010 was 21 with a standard deviation of 5.2 according to
Digest of Education Statistics.
a. Using Chebyshev’s theorem, at least 89% of scores will be between what two scores?
b. Using Empirical rule, 95% of scores will be between what two scores?
131
c. If a randomly chosen student has a score of 18, what is this student’s z-score?
d. Using Empirical rule, what percentage of scores are below 26.2?
52. Let’s say that the mean return of mutual funds in the last quarter is 2.8% with a
standard deviation of 4.5%. Assume the returns are symmetrically distributed. Find the
return value(s) that would separate the
a. top 2.5%
b. middle 99.7%
c. bottom 84%
53. Let’s say that the mean return of mutual funds in the last quarter is 2.8% with a
standard deviation of 4.5%. Assume the returns are symmetrically distributed.
54. Let’s say that the mean return of mutual funds in the last quarter is 5.8% with a
standard deviation of 2.1%. Assume the returns are skewed right.
a. Find the percent of returns that are between 1.6% and 10%.
132
REFERENCES
3.1 Measures of the Center of the Data
Data from The World Bank, available online at http://www.worldbank.org (accessed April 3, 2013).
Villines, Z. “What is the average height for men?” Medical News Today. Available online at
https://www.medicalnewstoday.com/articles/318155.php (accessed May 20, 2018).
Cauchon, Dennis, Paul Overberg. “Census data shows minorities now a majority of U.S. births.”
USA Today, 2012. Available online at http://usatoday30.usatoday.com/news/nation/story/2012-05-
17/minority-birthscensus/55029100/1 (accessed April 3, 2013).
Data from the United States Department of Commerce: United States Census Bureau. Available
online at http://www.census.gov/ (accessed April 3, 2013).
“1990 Census.” United States Department of Commerce: United States Census Bureau. Available
online at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013).
133
CHAPTER 3 SOLUTIONS:
3) 𝑥𝑥̅ = 33.1, med = 32, s = 14.2, min = 2, Q1 = 24, Q2 = 32, Q3 = 44, max = 57
9) mean
11) The mean reflects skewing the most since it can be “pulled” to the right or left when
extreme values are either to the right or left respectively.
19) a.) P40 = 37 years old, b.) P78 = 70 years old, c.) 40th percentile, d.) 84th percentile
23) Li’s salary is higher than 78% of recent graduates; she should be pleased.
25) 34% of houses cost $240,000 or less. 66% of houses cost $240,000 or more.
So you can afford 34% of houses and 66% of the houses are too expensive for your
budget.
27)
29) d
31) a.) 17 or younger is 25% while 65 and older is less than 25% b.) 62.4%
33) a. Each box plot is spread out more in the greater values. Each plot is skewed to the
right, so the ages of the top 50% of buyers are more variable than the ages of the lower 50%.
b. The BMW 3 series is most likely to have an outlier. It has the longest whisker.
c. Comparing the median ages, younger people tend to buy the BMW 3 series, while
134
older people tend to buy the BMW 7 series. However, this is not a rule, because there is so
much variability in each data set.
d. The second quarter has the smallest spread. There seems to be only a three-year
difference between the first quartile and the median.
e. The third quarter has the largest spread. There seems to be approximately a 14-year
difference between the median and the third quartile.
f. IQR ~ 17 years
g. There is not enough information to tell. Each interval lies within a quarter, so we
cannot tell exactly where the data in that quarter is concentrated.
h. The interval from 31 to 35 years has the fewest data values. Twenty-five percent of
the values fall in the interval 38 to 41, and 25% fall between 41 and 64. Since 25% of values
fall between 31 and 38, we know that fewer than 25% fall between 31 and 35.
35) a.) 𝑥𝑥̅ = 23.3 b.) above average c.) s = 12.9 d.) It is within 1 standard deviation
therefore not unusual
37) Zpiano = -.4; Zguitar = .25; Zdrum = -1. Drums cost the lowest. Guitar cost the
highest.
43)
135
47)
k. answers vary
49) a. between 63.7 and 74.9 inches b. between 66.5 and 72.1 inches
c. Z = 3.1 d. 2.5%
136
4 | PROBABILITY TOPICS
Introduction
Chapter Objectives
Chapter Objectives
137
It is often necessary to "guess" about the outcome of an event to make a decision. Politicians
study polls to guess their likelihood of winning an election. Teachers choose a particular
course of study based on what they think students can comprehend. Doctors choose the
treatments needed for various diseases based on their assessment of likely results. You may
have visited a casino where people play games chosen because of the belief that the likelihood
of winning is good. You may have chosen your course of study based on the probable
availability of jobs.
You have highly likely used probability before, and most people have an intuitive sense of
the concept. Probability deals with the chance of an event occurring. Whenever you weigh
the odds of whether to do your homework or to study for an exam, you are using probability.
In this chapter, you will learn how to solve probability problems using a systematic approach.
4.1 | Terminology
Probability is a measure that is associated with how certain we are of outcomes of a particular
experiment or activity. An experiment is a planned operation carried out under controlled
conditions. If the result is not predetermined, then the experiment is said to be a chance
experiment. Flipping a fair coin twice is an example of an experiment.
An event is any collection of outcomes. Upper case letters, like A and B, represent events.
For example, if the experiment is to flip one fair coin, event A might be getting at most one
head. The probability of an event A is written as P(A).
There are two basic ways to calculate probabilities; Theoretical Probability and Empirical
Probability.
Theoretical Probability: Suppose that all outcomes in the sample space are equally likely.
To calculate the probability of an event A, count the number of outcomes for event A and
divide by the total number of outcomes in the sample space. That is,
138
number of outcomes in A
P( A) = .
number of outcomes in S
For example, suppose we toss a coin three times in succession. As we saw above, there is a
total of 8 outcomes in the sample space, and three of these outcomes have exactly two
heads. Thus, P(two heads) = 3/8.
For example, suppose that an insurance company wants to find the probability that a
suburban male driver, whose age is between 20 and 25 years, will have an accident in the
coming year. Then they would use data from thousands of such drivers and calculate the
proportion of those drivers that had an accident in the past year.
The Law of Large Numbers states that as the number of repetitions of an experiment is
increased, the relative frequency obtained in the experiment will become closer and closer
to the theoretical probability. For example, suppose you roll one fair six-sided die, with the
numbers 1, 2, 3, 4, 5, 6 on its faces. Let E be the event that we roll a number that is at least
five. There are two outcomes (5, 6) in E, so P(E) = 1/3. If you were to roll the die only a few
times, you would not be surprised if your observed results did not match the probability.
However, if we were to roll the die an exceptionally large number of times, the ratio of
times E occurred would get closer and closer to 1/3.
While a prisoner of war during World War II, the English mathematician John Kerrich
(1903–1985) performed a number of probability experiments to demonstrate the Law of
Large Numbers. The most famous of these consisted of tossing a coin 10,000 times and
recording whether it landed heads or tails. He obtained 5067 heads, so the empirical
probability of heads was 5067/10,000 or 50.67%.
In many real-world situations, the outcomes are not equally likely. A coin or die may be
unfair, or biased. Two math professors in Europe had their Statistic students test the
Belgian one Euro coin and discovered that in 250 trials, a head was obtained 56% of the
time and a tail was obtained 44% of the time. The data seem to show that the coin is not a
fair coin; more repetitions would be helpful to draw a more accurate conclusion about such
bias. Some dice are also biased. Look at the dice in a game you have at home; the spots on
each face are usually small holes carved out and then painted to make the spots visible.
Your dice may or may not be biased; it is possible that the outcomes may be affected by the
slight weight differences due to the different numbers of holes in the faces. Gambling
casinos make a lot of money depending on outcomes from rolling dice, so casino dice are
139
made differently to eliminate bias. Casino dice have flat faces; the holes are filled with
paint having the same density as the material that the dice are made out of so that each
face is equally likely to occur. Later we will learn techniques to use to work with
probabilities for events that are not equally likely.
Using either definition of probability, it is easy to obtain the following three general
properties:
Properties of Probability
1. For any event A, 0 < P(A) < 1. That is, the probability of any event is always a
number between 0 and 1.
2. The probability of any event A is equal to the sum of the probabilities of the
individual outcomes in A.
3. The sum of the probabilities of all outcomes in S must equal 1. This is true whether
S consists of equally likely outcomes.
Note that we can write probabilities as a fraction, a decimal or a percent. Moreover, if P(A)
= 0 this means the event A can never happen, whereas P(A) = 1 means the event A must
always occur (A is a certainty).
Often, we deal with events that are described in terms of other events; in particular, events
involving the connectives “and”, “or” and “not”. These are sometimes called compound
events.
The event A and B is the set of outcomes that are common to both A and B. That is, an
outcome is in A and B if it is in both A and B at the same time. For example, let A and B be
{1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, respectively. Then A and B = {4, 5}.
The complement of event A consists of all outcomes that are not in A. The complement of
A is denoted as A′ (read "A prime"). Notice that P(A) + P(A′) = 1, since every outcome is
either in A or is in its complement. This observation gives us a valuable rule for
calculating probabilities:
Complement Rule
140
The conditional probability of A given B is written P(A | B). P(A | B) is the probability
that event A will occur given that the event B has already occurred. The additional
information that event B occurs changes the sample space, since now we are interested only
in those outcomes that are in B. And since A must also occur, we see that
number of outcomes in (A and B)
P( A | B) = . If we divide the numerator and denominator
number of outcomes in B
by the number of outcomes in S, we get the equivalent formula:
Conditional Probability
The probability that event A occurs, given that event B also occurs is:
P( A and B)
P( A | B ) = .
P( B )
For example, suppose we toss one fair, six-sided die. The sample space S = {1, 2, 3, 4, 5, 6}.
Let A = {2, 3}, and B = {2, 4, 6}. To calculate P(A | B), first count the number of outcomes
common to the two events; the outcome 2 is in both. Using the first formula, we have
number of outcomes in (A and B) 1
P( A | B) = .
number of outcomes in B 3
1
P( A and B) 1 6 1
We get the same result by using the second formula: P( A | B) = = 6= ⋅ = .
P( B) 3 6 3 3
6
When working with probability, it is important to read each problem carefully to understand
what the events are. Understanding the wording is an important first step in solving
probability problems. Reread the problem several times if necessary. Clearly identify the
event of interest. Determine whether there is a condition stated in the wording that would
indicate that the probability is conditional; carefully identify the condition, if any.
Example 4.1
a. A = ____ , B=
b. P(A) = , P(B) =
c. A and B = , A or B =
141
d. P(A and B) = , P(A or B) =
e. A′ = , P(A′ ) =
f. P(A) + P(A′) =
Solution 4.1
a. A = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20}, B = {14, 15, 16, 17, 18, 19, 20}
10 1 7
b. P( A) = = ; P( B) =
20 2 20
c. A and B = {14, 16, 18, 20}; A or B = {2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20}
4 13
d. P( A and B) = ; P( A or B) =
20 20
10 1
e. A′ = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}; P( A′) = =
20 2
10 10 20
f. P( A) + P( A′) = + = = 1.
20 20 20
number of outcomes in (A and B) 4
= g. P( A | B) ;
number of outcomes in B 7
number of outcomes in (A and B) 4
=P( B | A) = .
number of outcomes in A 10
The probabilities are not equal.
Example 4.2
Suppose that we roll two fair dice, one red and one white. Then there are 36 possible
outcomes in the sample space, as shown in the table below:
142
Suppose we write the sample space in terms of the number of dots facing up on the two
dice. That is, write S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. Use the table to answer each of the
following:
Solution 4.2
a) The events are not equally likely – for example the probability of rolling “snake eyes” is
P(2) = 1/36 whereas the probability of rolling a “3” is P(3) = 2/36.
b) There are 3 outcomes in which the dots add to 4, so P(4) = 3/36.
c) There are 5 outcomes in which the dots add to 6, so P(6) = 5/36.
d) The event E = {2, 4, 6, 8, 10, 12}. So we add the probabilities of the outcomes:
1 3 5 5 3 1 18
P(E) = P(2) + P(4) + P(6) + P(8) + P(10) + P(12) = + + + + + =
36 36 36 36 36 36 36
e) The event G = {10, 11, 12}. So we add the probabilities of the outcomes:
3 2 1 6
P(G) = P(10) + P(11) + P(12) = + + =
36 36 36 36
6 1
f) The theoretical probability of rolling a 7 is P(7) =
= .
36 6
By the Law of Large Numbers, we expect the proportion of rolls that equal 7 to be about
7200
1/6. So we would expect the number of 7’s to be about = 1200.
6
Example 4.3
The table below describes the distribution of a random sample S of 100 individuals,
organized by preference for Star Wars or Star Trek and whether they are right- or left-
handed.
Right-handed Left-handed
Star Wars 43 9
Star Trek 44 4
Let’s denote the events as W = the subject prefers Star Wars, T = the subject prefers Star
Trek, R = the subject is right-handed, and L = the subject is left-handed. Compute the
following probabilities:
143
e. P(W and R) f. P(W and L) g. P(T or W ) h. P(T or R)
i. P(T or L) j. P(W' ) k. P(R |W) l. P(L | T )
Solution 4.3
a. P(W) = 0.52
b. P(T) = 0.48
c. P(R) = 0.87
d. P(L) = 0.13
g. P(W or T) = 1
43 + 9 + 44
h. P(W or R) = = 0.96
100
44 + 4 + 9
i. P(T or L) = = 0.57
100
144
4.2 | Independent and Mutually Exclusive Events
When applying the rules of probability, we will sometimes need to consider the relationship
between two events; in particular, we will need to consider whether two events are
independent or whether they are mutually exclusive. Although students often confuse the
two, they do not mean the same thing.
Independent Events
Two events A and B are independent if the knowledge that one occurred does not affect the
probability that the other occurs. For example, the outcomes of two roles of a fair die are
independent events. The outcome of the first roll does not change the probability for the
outcome of the second roll. To show two events are independent, you must show only one of
the above conditions. If two events are not independent, then we say that they are
dependent. If it is not known whether A and B are independent or dependent, you should
assume they are dependent until you can show otherwise.
To show that two events are independent, we show that any of following statements is true:
• P(A | B) = P(A)
• P(B | A) = P(B)
The first two are just restatements of the definition. The third statement follows from first
two statements and the formula for conditional probability. If we multiply both sides of the
equation
P( A and B)
P( A | B ) =
P( B )
by P(B), we get P( A and B) = P( B)P( A | B) . If the events are independent, we substitute P(A)
in place of P(A | B) to get P(A and B) = P(A)P(B).
There are some situations where we will have information about the probabilities and/or
conditional probabilities, and we will use the rules above to determine whether the events
are independent. There are other situations where the events are known to be independent,
and then we will use this fact to help compute the probability of P(A and B).
145
Recall that sampling may be done with replacement or without replacement.
A standard deck of cards is often used in probability problems. To help you visualize the
sample space of a standard deck of cards, all the cards are shown below.
2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥ A♥
2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦ A♦
2♣ 3♣ 4♣ 5♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣ A♣
2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠ A♠
Example 4.4
Suppose we have a standard, well-shuffled deck of 52 cards. Suppose that we select two
cards. The first card is a Queen of Diamonds.
a. Find the probability of selecting a queen on the second draw if the first card is replaced
before the second is drawn.
b. Find the probability of selecting a queen on the second draw if the first card is not
replaced before the second is drawn.
146
Solution 4.4:
a. If the first card is replaced before the second is drawn, then the second draw is
4
independent of the first. So, the probability of selecting a queen on the second draw is .
52
b. If the first card is not replaced before the second is drawn, then the second draw is
dependent on the first draw. There are now 51 cards left, of which 3 are queens. So, the
3
probability of selecting a queen on the second draw is .
51
4.4 Suppose we have a standard, well-shuffled deck of 52 cards. It consists of four suits:
clubs, diamonds, hearts, and spades. There are 13 cards in each suit consisting of Ace, 2, 3,
4, 5, 6, 7, 8, 9, 10, J, Q and K. Suppose that we select two cards. The first card is a Queen of
Diamonds.
a. Find the probability of selecting a diamond on the second draw if the first card is
replaced before the second is drawn.
b. Find the probability of selecting a diamond on the second draw if the first card is not
replaced before the second is drawn.
In Chapter 1 we saw that if we are selecting a sample from an exceptionally large population,
the probability of getting the same individual twice will be very small. As a result, there is
virtually no difference between sampling with replacement and treating without
replacement. Thus, when sampling from an exceptionally large population, we can think of
the individual selections as being independent of one another. This is useful for the next
example:
Example 4.5
In a large metropolitan area, it is known that 45% of all voters are registered as Democrats.
Suppose we select two voters at random.
147
Solution 4.5
Let event D1 = Democrat is selected on the first choice, and D2 = Democrat is selected on the
second. The key observation here is that the draws are independent, so we can use the rule
P(D1 and D2) = P(D1)P(D2).
b. If 45% are registered as Democrats, then the remaining 55% are not registered as
Democrats.
P(D1ʹ and D2ʹ) = P(D1ʹ) P(D2ʹ) = 0.55 x 0.55 = 0.3025.
4.5 Draw two cards from a standard 52-card deck with replacement. Find the probability of
getting at least one black card.
Example 4.6
Let G = event that a student is taking a math class. Let H = event that a student is taking
a science class. Then, G and H = the event that a student is taking both a math class and a
science class. Suppose that P(G) = 0.6, P(H) = 0.5, and P(G and H) = 0.3. Based on this
information, are G and H independent?
Solution 4.6
To show that G and H are independent, we must show ONE of the following:
• P(G | H) = P(G)
• P(H | G) = P(H)
The option we choose depends on the information given in the problem. We could choose
any of the methods here because we have the necessary information.
148
Or, we can show P(G and H) = P(G)P(H):
Since G and H are independent, knowing that a person is taking a science class does not
affect the probability that he or she is taking a math class.
4.6 In a bag, there are six red marbles and four green marbles. The red marbles are
marked with the numbers 1, 2, 3, 4, 5, and 6. The green marbles are marked with the
numbers 1, 2, 3, 4. One marble is selected. Let R = event that a red marble is drawn,
let G = event that a green marble is drawn, and let O = event that an odd-numbered
marble is selected.
Example 4.7
In a particular college class, 60% of the students are taking math courses. Fifty percent of
all students in the class have graphing calculators. Forty-five percent of the students are
taking math course and have graphing calculators. Of the students taking math courses,
75% have graphing calculators. One student is picked randomly. Let M be the event that
a student is taking a math course, and let C be the event that a student has a graphing
calculator. Are the events M and C independent?
Solution 4.7
There are three conditions we can check to determine independence:
The one we use will depend on the information given in the problem. So, we first write
down the probabilities that are given in the problem:
Based on this information we could use either the second condition or the third.
(We do not know P(M | C) yet, so we cannot use the first condition.)
149
Using the second rule, we check whether P(C |M) equals P(C):
We are given that P(C |M) = 0.75, whereas P(L) = 0.50. Since these are not equal, the
events are not independent. This shows that a student who taking a math course is more
likely to have a graphing calculator than a student who is not in a math course.
We say that A and B are mutually exclusive events if they cannot occur at the same time.
This means that A and B do not share any outcomes and so it follows that P(A and B) = 0.
For example, suppose that S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, A = {1, 2, 3, 4, 5}, B = {4, 5, 6, 7, 8},
and C = {7, 9}. Then A and B = {4, 5}; so the events A, B are not mutually exclusive. On the
other hand, the events B, C are mutually exclusive, since they do not have any outcomes in
common.
If it is not known whether A and B are mutually exclusive, assume they are not until you can
show otherwise. The following examples illustrate these definitions and terms.
Example 4.8
Suppose that we toss two fair coins. The sample space is {HH, HT, TH, TT} where T = tails
and H = heads. The possible outcomes are HH, HT, TH, and TT. Note that the outcomes
HT and TH are different. The outcome HT means that the first coin showed heads and the
second coin showed tails. The outcome TH means that the first coin showed tails and the
second coin showed heads.
a. Let A = the event of getting at most one tail. Find the event space.
b. Let B = the event of getting all tails. Find the event space and the
complement of B.
c. Find the probability of A and the probability of B.
d. Let C = the event of getting all heads. Are B and C mutually exclusive?
e. Let D = the event of getting more than one tail. Find the event space and
the probability of D.
f. Let E = the event of getting a head on the first toss. Find the event space
and the probability of E.
g. Let F = the event of getting at least one (one or two) tail in two flips. Find
the event space and the probability of F.
150
Solution 4.8
a. Let A = the event of getting at most one tail. (At most one tail means zero or one
tail.) Then A can be written as {HH, HT, TH}. The outcome HH shows zero tails. HT
and TH each show one tail.
b. Let B = the event of getting all tails. B can be written as {TT}. Note that Events A
and B are mutually exclusive. In fact, B is the complement of A, so P(A) + P(B) =
P(A) + P(A′) = 1.
c. P(A) = 3/4, and P(B) = 1/4.
d. Let C = the event of getting all heads; then B and C are mutually exclusive. Clearly
B, C have no outcomes in common because it is not possible to have all tails and all
heads at the same time.
e. Let D = event of getting more than one tail. D = {TT}, so P(D) = 1/4.
f. Let E = event of getting a head on the first toss. This means that we can get either a
head or tail on the second toss, so E = {HT, HH}. Thus, P(E) = 2/4.
g. Find the probability of getting at least one (one or two) tail in two flips. Let F =
event of getting at least one tail in two flips. Then F = {HT, TH, TT} and so P(F) =
3/4.
Example 4.9
Suppose that we toss two fair coins. Find the probabilities of the events.
a. Let F = the event of getting at most one tail (zero or one tail).
b. Let G = the event of getting two faces that are the same.
c. Let H = the event of getting a head on the first flip followed by a head or tail on the
second flip.
d. Are F and G mutually exclusive?
e. Let J = the event of getting all tails. Are J and H mutually exclusive?
Solution 4.9:
Refer to the sample space in Example 4.8.
a. Zero (0) or one (1) tails occur when the outcomes HH, TH, HT show up. P(F) = 3/4.
b. Two faces are the same if HH or TT show up. P(G) = 2/4 = 1/2.
c. A head on the first flip followed by a head or tail on the second flip occurs when HH
or HT show up. So P(H) = 2/4 = 1/2.
d. F and G share the outcome HH so F and G are not mutually exclusive.
e. Getting all tails occurs when tails shows up on both coins (TT). H’s outcomes are HH
and HT. J and H have no outcomes in common, J and H are mutually exclusive.
151
4.8 A box has two balls, one white and one red. We select one ball, put it back in the box,
and then select a second ball. Find the probability of the following events:
Example 4.10
Let event C = taking an English class. Let event D = taking a speech class. Suppose P(C) =
0.75, P(D) = 0.3, P(C | D) = 0.75 and P(C and D) = 0.225. Use this information to answer
the following:
Solution 4.10
152
4.3 | Two Basic Rules of Probability
Recall that in classical probability, the probability of an event is a fraction where the
numerator is the number of outcomes in the event and the denominator is the number of
outcomes in the sample space. When calculating probability, there are two rules to consider
when determining if two events are independent or dependent and if they are mutually
exclusive or not.
Addition Rule
If Events A and B are defined on the same sample space, then the probability of event A or
event B is written as
If A and B are two events defined on the same sample space, then
P(A or B) = P(A) + P(B) - P(A and B).
If A and B are mutually exclusive which means that event A and event B can’t happen at the
same time; that is, P(A and B) = 0. Therefore the addition rule becomes:
Recall that the probability of A given B equals the probability of A and B divided by the
P( A and B)
probability of B. That is, P( A | B) = . Multiplying both sides by P(B), we have:
P( B )
If A and B are two events defined on the same sample space, then:
P(A and B) = P(B)∙P(A|B)
If A and B are independent, then P(A|B) = P(A). So, for independent events, the
multiplication rule simplifies to:
153
Example 4.11
Klaus is trying to choose where to go on vacation. Klaus can only afford one vacation.
His two choices are: A = New Zealand and B = Alaska. The probability that he chooses A is
P(A) = 0.6 and the probability that he chooses B is P(B) = 0.35. P(A and B) = 0, because
Klaus can only afford to take one vacation. What is the probability that Klaus chooses
either New Zealand or Alaska?
Solution 4.11
The probability that he chooses either New Zealand or Alaska is:
Note that the probability that he does not choose to go anywhere on vacation must be 0.05.
Example 4.12
Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going to
attempt two goals in a row in the next game. Let A = the event Carlos is successful on his
first attempt, so P(A) = 0.65. Let B = the event Carlos is successful on his second attempt.
P(B) = 0.65. Carlos tends to shoot in streaks; the probability that he makes the second goal
given that he made the first goal is 0.90.
Solution 4.12
a. The problem is asking you to find P(A AND B) = P(B AND A). Since P(B|A) = 0.90:
P(B AND A) = P(B|A)P(A)
= (0.90)(0.65)
= 0.585
Carlos makes the first and second goals with probability 0.585.
154
c. No, they are not independent events, because P(B AND A) = 0.585.
P(B)∙P(A) = (0.65)(0.65) = 0.423
0.423 ≠ 0.585 = P(B AND A) So, P(B AND A) is not equal to P(B)∙P(A).
d. No, they are not mutually exclusive because P(A and B) = 0.585.
To be mutually exclusive P(A and B) = 0.
4.11 Helen plays basketball. For free throws, she makes the shot 75% of the time. Helen
must now attempt two free throws. C = the event that Helen makes the first shot. P(C) =
0.75. D = the event Helen makes the second shot. P(D) = 0.75. The probability that Helen
makes the second free throw given that she made the first is 0.85. What is the probability
that Helen makes both free throws?
Example 4.13
A community swim team has 150 members. Seventy-five of the members are advanced
swimmers. Forty- seven of the members are intermediate swimmers. The rest are novice
swimmers. Forty of the advanced swimmers’ practice four times a week. Thirty of the
intermediate swimmers’ practice four times a week. Ten of the novice swimmers’ practice
four times a week. Suppose one member of the swim team is chosen randomly.
b. What is the probability that the member practices four times a week?
c. What is the probability that the member is an advanced swimmer and practices four
times a week?
e. Are being a novice swimmer and practicing four times a week independent events?
Why or why not?
155
Solution 4.13
a. 28/150
b. 80/150
c. 40/150
d. P(advanced and intermediate) = 0, so these are mutually exclusive events. A
swimmer cannot be an advanced swimmer and an intermediate swimmer at the
same time.
e. No, these are not independent events.
P(novice and practices four times per week) = 0.0667
P(novice)P(practices four times per week) = 0.0996
Since 0.0667 ≠ 0.0996, the events are not independent.
4.12 A school has 200 seniors of whom 140 will be going to college next year. Forty will be
going directly to work. The remainder will be taking a gap year. Fifty of the seniors going
to college play sports. Thirty of the seniors going directly to work play sports. Five of the
seniors taking a gap year play sports. What is the probability that a senior is taking a gap
year?
Example 4.14
Felicity attends the College of Lake County in Grayslake, IL. The probability that Felicity
enrolls in a math class is 0.2 and the probability that she enrolls in a speech class is 0.65.
The probability that she enrolls in a math class given that she enrolls in speech class is
0.25. Let: M = math class, S = speech class.
a. What is the probability that Felicity enrolls in math and speech? i.e. Find P(M and
S) = P(M|S)∙P(S).
b. What is the probability that Felicity enrolls in math or speech classes? i.e. Find P(M
or S) = P(M) + P(S) - P(M and S).
c. Are M and S independent? Is P(M | S) = P(M)?
d. Are M and S mutually exclusive? Is P(M and S) = 0?
Solution 4.14
a. 0.1625, b. 0.6875, c. No, d. No
156
4.13 A student goes to the library. Let events B = the student checks out a book and D =
the student check out a DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(D|B) = 0.5.
a. Find P(B and D).
b. Find P(B or D)
Example 4.15
Studies show that about one woman in seven (approximately 14.3%) who live to be 90 will
develop breast cancer. Suppose that of those women who develop breast cancer, a test is
negative 2% of the time. Also suppose that in the general population of women, the test for
breast cancer is negative about 85% of the time. Let B = woman develops breast cancer and
let N = tests negative. Suppose one woman is selected at random.
a. What is the probability that the woman develops breast cancer? What is the probability
that woman tests negative?
b. Given that the woman has breast cancer, what is the probability that she tests
negative?
c. What is the probability that the woman has breast cancer AND tests negative?
d. What is the probability that the woman has breast cancer or tests negative?
Solution to 4.15
a. P(B) = 0.143; P(N) = 0.85
b. P(N|B) = 0.02
e. No. P(N) = 0.85; P(N|B) = 0.02. So, P(N|B) does not equal P(N).
f. No. P(B and N) = 0.0029. For B, N to be mutually exclusive, P(B and N) must equal 0.
157
4.14 A school has 200 seniors of whom 140 will be going to college next year. Forty will
be going directly to work. The remainder is taking a gap year. Fifty of the seniors going
to college play sports. Thirty of the seniors going directly to work play sports. Five of the
seniors taking a gap year play sports. What is the probability that a senior is going to
college and plays sports?
Example 4.16
Refer to the information in Example 4.15 and let P = tests positive.
a. Given that a woman develops breast cancer, what is the probability that she tests
positive? Find P(P|B) = 1 - P(N|B).
b. What is the probability that a woman develops breast cancer and tests positive?
Find P(B and P) = P(P|B)∙P(B).
c. What is the probability that a woman does not develop breast cancer? Find P(B′).
d. What is the probability that a woman tests positive for breast cancer? Find P(P) = 1
- P(N).
Solution 4.16
a. 0.98;
b. 0.1401;
c. 0.857;
d. 0.15
4.15 A student goes to the library. Let events B = the student checks out a book and D =
the student checks out a DVD. Suppose that P(B) = 0.40, P(D) = 0.30 and P(D|B) = 0.5.
a. Find P(B′).
b. Find P(D and B).
c. Find P(B|D).
d. Find P(D AND B′).
e. Find P(D|B′).
158
Tree Diagrams and Venn Diagrams
A tree diagram is a special type of graph used to determine the outcomes of an experiment.
It consists of "branches" that are labeled with either frequencies or probabilities. Tree
diagrams can make some probability problems easier to visualize and solve. The following
example illustrates how to use a tree diagram.
Example 4.17
In an urn, there are 11 balls. Three balls are red (R) and eight balls are blue (B). Draw two
balls, one at a time, with replacement. "With replacement" means that you put the first ball
back in the urn before you select the second ball.
The tree diagram using frequencies that show all the possible outcomes follows.
The first set of branches represents the first draw. The second set of branches represents
the second draw. Each of the outcomes is distinct. In fact, we can list each red ball as R1,
R2, and R3 and each blue ball as B1, B2, B3, B4, B5, B6, B7, and B8. Then the nine RR
outcomes can be written as: R1R1; R1R2; R1R3; R2R1; R2R2; R2R3; R3R1; R3R2; R3R3.
The other outcomes are similar.
159
There are a total of 11 balls in the urn. Draw two balls, one at a time, with replacement.
There is a total of 11 x 11 = 121 outcomes; this is the size of the sample space.
Solution 4.17
a. B1R1; B1R2; B1R3; B2R1; B2R2; B2R3; B3R1; B3R2; B3R3; B4R1; B4R2; B4R3;
B5R1; B5R2; B5R3; B6R1; B6R2; B6R3; B7R1; B7R2; B7R3; B8R1; B8R2; B8R3
b. P(RR) = P(1st Draw Red and 2nd Draw Red) = (3/11)∙(3/11) = (9/121)
Note that we get this probability by “multiplying along the branches”
c. P(RB or BR) = (3/11)∙(8/11) + (8/11)∙(3/11) = (48/121)
d. P(R on 1st draw and B on 2nd draw) = (3/11)∙(8/11) = (24/121)
e. P(R on 2nd draw GIVEN B on 1st draw) = P(R|B) = (24/88) = (3/11)
This problem is a conditional one. The sample space has been reduced to those
outcomes that already have a blue on the first draw. There are 24 + 64 = 88 possible
outcomes (24 BR and 64 BB). Twenty-four of those outcomes are BR.
f. P(BB) = (64/121)
If you do the same experiment this time without replacement, then the tree diagram is
160
There are a total of 11 balls in the urn. Draw two balls, one at a time, without replacement.
There are 11 × 10 = 110 outcomes; again this is the size of the sample space.
4.16 In a standard deck, there are 52 cards. Twelve cards are face cards (F) and 40 cards are
not face cards (N). Draw two cards, one at a time, without replacement. Draw a tree diagram
is labeled with all possible probabilities.
A Venn Diagram is another visual aid that can be used to display the relationships
between different events for an experiment, and their probabilities. It generally consists of
a box that represents the sample space S together with circles or ovals representing events.
Example 4.18
Forty percent of the students at a local college belong to a club and 50% work part time.
Five percent of the students work part time and belong to a club. Draw a Venn diagram
displaying this information, letting C = “student belongs to a club” and PT = “student
works part time”.
a. The probability that the student belongs to a club given that the student works part
time.
b. The probability that the student belongs to a club or works part time.
c. The probability that a student works part-time but does not belong to a club.
d. The probability that a student neither belongs to a club nor works part time.
161
Solution 4.18:
We draw a rectangle representing the sample space S, and then draw two circles, one for C
and the other representing PT. The overlap between the two circles represents the
outcomes that the two events have in common. That is, this region represents the event C
and PT.
Next, we fill in the probabilities, starting with the event C and PT, whose probability is .05.
We are given that P(C) = .40; so the portion of C that does not overlap PT must have
probability .35. Similarly, the part of PT that does not overlap C will have probability .45.
This gives us the diagram:
S
C PT
C and PT
a. The probability that the student belongs to a club given that the student works part
time is:
P(C and PT ) .05
P(C | PT ) = = = 0.10
P( PT ) .50
b. The probability that the student belongs to a club or works part time can be found using
the addition rule:
d. The event that a student neither belongs to a club nor works part time is represented by
the part of the box that is outside both circles. That is, this event is the complement of the
event C or PT. By the complement rule,
P(neither in club nor working part time) = 1 – P(C or PT) = 1 – .85 = 0.15.
162
4.17 Fifty percent of the workers at a factory work a second job, 25% have a spouse who
also works, and 5% work a second job and have a spouse who also works. Draw a Venn
diagram showing the relationships, letting W = works a second job and S = spouse also
works. Find the probability that a randomly selected worker would neither work a second
job nor have a spouse that works.
Example 4.19
A person with type O blood and a negative Rh factor (Rh-) can donate blood to any person
with any blood type. It is known that 4% of African Americans have both type O blood and a
negative Rh factor, 8% of African Americans have the Rh- factor, and 51% type O blood.
Make a Venn diagram for this situation, using O for the set of individuals with type O
blood, and Rh- for the individuals with the negative Rh factor.
a. Find the probability that a randomly selected African American has type O blood or
negative Rh factor.
b. Find the probability that a randomly selected African American has negative Rh
factor but does not have type O blood.
c. Find the probability that a randomly selected African American has neither type O
blood nor a negative Rh factor.
O Rh-
O and Rh-
4.18 In a bookstore, the probability that the customer buys a novel is 0.6, and the
probability that the customer buys a non-fiction book is 0.4. Suppose that the probability
that the customer buys both is 0.2.
163
4.4 | Contingency Tables
A contingency table is a commonly used way of displaying data that can facilitate
calculating probabilities. The table is particularly helpful for calculating conditional
probabilities. The table displays sample values in relation to two different variables that may
be dependent or contingent on one another. Contingency tables will be used again later in
the course.
Example 4.20
Suppose a study of speeding violations and drivers who use cell phones produced the data:
Speeding No speeding
violation violation
in last year in last year Total
Cell phone user 25 280 305
Not a cell phone user 45 405 450
Total 70 685 755
As we see in the bottom right corner, the total number of people in the sample is 755. The
row totals are 305 and 450. The column totals are 70 and 685. Notice that we can get the
total sample size by adding either the row totals or column totals:
Let C = event that person is a cell phone user; so Cʹ = event that person is not a cell phone
user.
Let V = event that person had a speeding violation in the last year, so Vʹ = event that
person had no speeding violation in the last year. Use the table to calculate the following
probabilities:
a. Find P(C).
b. Find P(Vʹ).
c. Find P(Vʹ and C).
d. Find P(Vʹ or C).
e. Find P(C given V)
f. Find P(Vʹ given Cʹ)
Solution 4.20
164
number of outcomes in V ′ 685
b. P (V
= ′) = ≈ 0.907
number of outcomes in S 755
number of outcomes in (V ′ or C )
P (V ′ or C ) =
number of outcomes in S
685 305 280
= P (V ′) + P (C ) − P (V ′ and C ) = + − ≈ 0.940
755 755 755
Or, we can find the number of outcomes in simply add the cells that are in either Vʹ
or C, being careful not to count any outcome twice:
Note that the sample space is limited to the number of individuals who had a
violation.
f. To find P(Vʹ given Cʹ), we again use the conditional probability formula:
This time the sample space is limited to the number of individuals who were not cell
phone users.
165
4.19 The following table shows the number of athletes who stretch before exercising and
how many injuries within the past year had.
Example 4.21
The table below shows the location preferences for a random sample of 100 boaters.
Solution 4.21.
a. We start by filling in the grand total; this is the total sample size, 100. This is also the
sum of the two row totals: 45 + 55 = 100. Next, we can fill in the entry in the row for
“kayaker” and the column “Freshwater Lakes”. The total for this column is 41, and so
the missing value must be 41 – 16 = 25. Next, we fill in the entry in the row for kayaker
and the column “The Coastline”. The row total is 55, and so this last entry must be 55 –
25 – 14 = 16:
166
Continuing in this way, we get the completed table:
When finished, make sure to check that the row totals and the column totals add to 100.
b. To show that A and C are independent, we must show any one of the following:
We will use the last one. From the table we quickly calculate:
34 45 18
𝑃𝑃(𝐶𝐶) = = .34, 𝑃𝑃(𝐴𝐴) = = .45 and 𝑃𝑃(𝐶𝐶 and 𝐴𝐴) = = .18
100 100 100
Since P(C and A) = .18 ≠ (.34)(.45) = P(C)P(A), the events are not independent.
c. Let L be the event “prefers lakes and streams” and M be the event “being male”. Then
we want to calculate P(A | L):
number of outcomes in (𝐴𝐴 and 𝐿𝐿) 25
𝑃𝑃(𝐴𝐴|𝐿𝐿) = = ≈ .610
number of outcomes in 𝐿𝐿 41
d. If we let A = “canoer” and P = “prefers rivers and streams”, then we want to calculate
45 25 11 59
P(A or P) = P(A) + P(P) – P(A and P) = + − = = .59.
100 100 200 100
4.20 The following table shows a random sample of 200 cyclists and the routes they prefer.
a. Out of the upright cyclists, what is the probability that the cyclist prefers a hilly path?
b. Let R = recumbent and H = hilly path. Are the events R and H independent?
167
Example 4.22
The table below contains the number of crimes per 100,000 inhabitants from 2008 to 2011
in the U.S.
Year Robbery Burglary Rape Vehicle Total
2008 145.7 732.1 29.7 314.7
2009 133.1 717.7 29.1 259.2
2010 119.3 701 27.7 239.1
2011 113.7 702.2 26.8 229.6
Total:
Solution 4.22:
a. 0.0294; b. 0.1551; c. 0.7165; d. 0.2365; e. 0.2575
4.21 The following table relates the vaccines and age ranges of a group of individuals
participating in an observational study.
168
4.5 | Counting and Probability
We now present some tools for counting to better utilize the basic definition of theoretical
probability. The underlying theme is to represent outcomes in a sample space as “lists”
formed from some collection of symbols. The fundamental questions we will need to answer
are:
In most cases, if we can explicitly describe the sample space, and can answer these basic
questions, we will be able to count outcomes in our event and in the sample space.
The first counting principle is sometimes called the “Fundamental Principle of Counting”—
as its name suggests, it forms the foundation of all our subsequent counting methods.
Suppose we have a task that can be broken down into a sequence of k steps. Further
suppose that there are
N 1 ways to do step 1, N 2 ways to do step 2, . . . , and N k ways to do step k. Then there are
N 1 N 2 N k ways to complete the entire task.
Notice that the Basic Counting Principle can be used only when counting an ordered
arrangement—the statement explicitly states that the task is broken down into a sequence
of steps.
Example 4.23
A license plate consists of three letters (from the English alphabet) followed by three digits.
How many license plates are possible?
Solution 4.23:
To form a valid license, we must fill in six slots: ___ ___ ___ ___ ___ ___
For each of the first three, there are 26 choices. For the last three, there are 10 choices
each. Thus, there are a total of 26 x 26 x 26 x 10 x 10 x 10 = 26 310 3 total possibilities.
169
Example 4.24
Solution 4.24:
b. Note that the calculation in part a counted the number of outcomes in the sample space
for this problem. Again, we draw nine slots:
___ ___ ___ - ___ ___ - ___ ___ ___ ___
There are 10 choices for the first slot. Since we cannot have repeated digits, there are only
9 choices for the second, only 8 choices for the third, etc. Thus, there is a total of
10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 = 3,628,800
Possible social security numbers without repeated digits. Thus, the probability that a social
security number has no repeated digits is:
# (no repeats) 3,628,800
P(no repeats) = = ≈ 0.0036 .
total numbef of SSN' s 10 9
c. This problem is easy once the previous problem is completed. Noting that “at least one
repeat” is the complement of “no repeats”, we use the complement rule to get:
= 1 – 0.0036 = 0.9964.
4.23 A company uses five-digit identification numbers for their employees. What is the
probability that a randomly selected employee has no repeated digits in his/her ID
number?
170
Factorial Notation
The product of the first n positive integers is often abbreviated by n!; this is read as “n
factorial”. That is,
n! = n (n − 1)(n − 2) 3 ⋅ 2 ⋅ 1
As our examples above suggest, factorial notation is very convenient for counting problems
using the basic counting principle; that is, counting ordered arrangements.
Example 4.25:
Solution 4.25:
We have eight slots to fill in: ___ ___ ___ ___ ___ ___ ___ ___ .
There are 8 choices for the first, 7 choices for the second, 6 choices for the third, etc. Thus,
there is a total of 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 8! ways to line them up.
The basic counting rules for ordered arrangements can be summarized as follows:
n!
n (n − 1)(n − 2) (n − k + 1) = .
(n − k )!
3. The number of ordered arrangements of k objects, chosen with replacement.
from a collection of n objects, is n k .
171
Justification: In each of the three cases, we have k slots to fill:
In case (1), there are k choices for the first, k –1 choices for the second..., and only one
choice for the last slot. Hence there are k (k − 1) 2 ⋅ 1 = k! possible orderings.
In case (2), there are n choices for the first, n –1 choices for the second, . . ., and
n − (k − 1) = n − k + 1 choices for the last slot. Hence there is a total of
n(n − 1)(n − 2) (n − k + 1)
In case (3), we are choosing with replacement, so there are n choices for each slot, hence a
total of n ⋅ n ⋅ n = n k possible arrangements.
NOTES These formulas are recorded mostly for future reference. When counting ordered
arrangements, it is best to simply use the basic counting principle directly—write out the
correct number of slots, fill in the number of choices for each slot, and multiply.
Example 4.26
A club at a local community college has 14 freshman and 18 sophomores. Suppose that the
club will elect a president, vice president and treasurer.
a. What is the probability that the president will be sophomores, and the other two
officers are freshman?
b. What is the probability that all three officers are sophomores?
Solution 4.26:
First we note that there are a total of 32 members in the club; and no person can fill two
different posts, so there will be no repetitions. By the Fundamental Counting Principle,
there is a total of
32 x 31 x 30 = 29,760 ways to choose the officers.
a. There are 18 sophomores to choose from to serve as president. Then there are 14
freshmen, so there are 14 choices for vice president and 13 choices for treasurer. Thus,
there is a total of 18 x 14 x 13 = 3276 ways to get a sophomore president, and freshman vice
president and treasurer. So, the probability is 3276/29,760 = 0.110.
b. There are 18 sophomores to choose from. So, there are 18 choices for president, 17 choices
for vice president, and 16 choices for treasurer. Thus, there is a total of 18 x 17 x 16 = 4896
ways to get all sophomore officers. So, the probability is 4896/29,760 = 0.165.
172
4.25 Suppose that a club has 20 freshmen and 22 sophomores. If they are to elect a
president, vice president and secretary, what is the probability that the officers will all be
freshmen? What is the probability that at least one sophomore is elected as an officer?
n!
n Pk =
(n − k )!
This formula is not always necessary in practice, because any time we want to count ordered
arrangements, we can just use the fundamental counting principle. However, the formula
is necessary to develop a formula for counting unordered arrangements.
n!
n Ck = (read as “n choose k”)
k!(n − k )!
To understand where this formula comes from, we think about the number of ordered
arrangements of k objects chosen without replacement from a set of n objects. We can such
an arrangement in two steps: First, we select a subset of k objects from the set; then we put
them in order. By the Fundamental Principle of Counting, we have:
n!
We know that the number on the left is n Pk = . And we know that there are k! ways
(n − k )!
to order k distinct objects. Finally, the number of subsets of size k is nCk,, which is what we
n!
are trying to find. So the equation above becomes: = n C k ⋅ k ! Dividing both sides by
(n − k )!
n!
k!, we get n C k = .
k!(n − k )!
173
NOTE
The numbers nCk are often called binomial coefficients, because they appear in the Binomial
Theorem, which you may remember from your high school algebra class. In the next
chapter they will also be used for an important probability distribution called the Binomial
distribution.
Permutations:
Order matters so look for key words like position titles, “in order” or “in a line”
n!
Formula n Pk =
(n − k )!
Combinations:
Order does not matter so look for key words like “committee” or “group”
n!
Formula n Ck =
k!(n − k )!
Note also that the formulas for nPk, and nCk, are built into the TI-83 and TI-84 calculators:
To calculate nPk and nCk we go press the MATH button, and scroll to the PRB menu.
To find nPk,, enter n, then press MATH >> PRB; press 2 (nPr) and then enter k and press
the ENTER key.
To find nCk,, enter n, then press MATH >> PRB; press 3 (nPr) and then enter k and press
the ENTER key.
174
Example 4.27
A shipment of 40 DVD players contains eight defective units. Suppose that a sample of four
players is selected, and that the entire shipment will be rejected if at least one DVD player
in the sample is defective.
Solution 4.27:
There are 40C4 = 91,390 ways to choose a sample of four DVD players from the shipment.
a. There are 36 non-defective players, so there are 36C4 = 58,905 ways to choose four non-
defective players. Thus, the probability is P(no defective players) = 58,905/91,390 = 0.6445,
or about 64.5%.
b. The probability calculated in part a is the probability that the shipment is not rejected.
So we can use the complement rule to get:
P(rejected) = 1 – P(no defective players) = 1 – 0.6445 = 0.3555.
Example 4.28
Solution 4.28:
a. A committee is an unordered arrangement without replacement, so the formula above
applies: We are choosing 4 objects from a set of 16 objects, so the total number of
committees is 16C4.Using the calculator, we press 16, then go to MATH >> PRB and select
item 3 (nCr); then press 4 and ENTER to get 16C4 = 1820.
b. We will choose the committee in two stages: First select three administrators, then
select one teacher to complete the committee. There are 9C3 = 84 ways to choose three
administrators from among the seven. There are 7C1 = 7ways to choose the teachers and
complete the committee. Thus there are (9C3)( 7C1) = 84 x 7 = 588 committees with three
administrators.
175
c. The probability that a committee has three administrators is
#(committees with three administrators) 588
P (three administrators) = = = 0.323.
total # committees 1820
d. The probability that a committee has at least three administrators there are two
possibilities: either there are exactly three administrators, or there are exactly four
administrators. So, the probability we want is P (three teachers) + P (four administrators).
Thus, the probability that the committee consists entirely of administrators is 35/1820 =
.01923. Combining this with the result from part c we have:
e. We will do this one using the complement rule. If there is not one or more teacher on the
committee, then the committee must consist entirely of administrators. I.e. the
complement of the event “at least one teacher” is the event “all administrators”. Thus the
probability that the committee has at least one teacher is
P(at least one teacher) = 1 – P(all administrator).
Using the result from part (d), the probability that the committee has at least one teacher is
176
KEY TERMS
Chance Experiment The result of the experiment is not predetermined
Complement The complement of event A (denoted as Aʹ ) the set of outcomes that are not
in A.
Conditional Probability the likelihood that an event will occur given that another event
has already occurred.
Dependent Events If two events are NOT independent, then we say that they are
dependent.
Event A subset of the set of all outcomes of an experiment; that is, an event is a subset of
the sample space S. Standard notations for events are capital letters such as A, B, C, and so
on.
Independent Events The occurrence of one event has no effect on the probability of the
occurrence of another event. Events A and B are independent if one of the following is true:
1. P(A|B) = P(A)
2. P(B|A) = P(B)
3. P(A AND B) = P(A)P(B)
Mutually Exclusive Two events are mutually exclusive if they have no outcomes in
common. Thus, if events A and B are mutually exclusive, then P(A and B) = 0.
Probability a number between zero and one, inclusive, that gives the likelihood that a
specific event will occur.
177
Sampling with Replacement means that each member of a population is replaced after it
is picked, so that there is a possibility that the same item is chosen more than once.
Sampling without Replacement means that items are not put back after being chosen,
so each member of a population may be chosen only once.
The AND Event The event A and B consists of all outcomes that are common to both
events. So, an outcome is in the event A and B if the outcome is in both A and in event B.
Tree Diagram A visual representation of a sample space and events in the form of a “tree”
with branches marked by possible outcomes together with associated probabilities
(frequencies, relative frequencies)
Venn Diagram A visual representation of a sample space and events in the form of circles
or ovals showing their intersections.
178
FORMULA REVIEW
Theoretical Probability: For a sample space with equally likely outcomes,
number of outcomes in A
P( A) =
number of outcomes in S
P( A and B)
Conditional Probability: P( A | B) =
P( B )
Addition Rules:
Multiplication Rules:
179
Exercises for Chapter 4
1. If the experiment is drawing two colored marbles from a box of 8 blue marbles, 10 green
marbles, and 5 purple marbles, then state the sample space.
2. If the experiment is testing 3 items from a shipment box that contain defective (D) and
non-defective (N) items. State the sample space.
3. A bag with 7 letters (A, B, C, D, E, F, G) is shaken and two letters are drawn without
replacement. State the sample space.
4. In a particular college class, there are sophomore and freshman students. Some students
have a math class, and some students have an English class. Write the symbols for the
probabilities of the events for parts a through j. (Note that you cannot find numerical answers
here. You were not given enough information to find any probability values yet; concentrate
on understanding the symbols.)
180
5. A box is filled with several party favors. It contains 12 hats, 15 noisemakers, ten finger
traps, and five bags of confetti.
a. Find P(H).
b. Find P(N).
c. Find P(F).
d. Find P(C).
9. The experiment is rolling a fair, six-sided die numbered one through six.
a. What is the probability of rolling an even number of dots?
b. What is the probability of rolling a prime number of dots?
181
10. You see a game at a local fair. You must throw a dart at a color wheel. Each section on
the color wheel is equal in area.
a. Let B = the event of landing on blue. Find P(B)
b. Let R = the event of landing on red. Find P(R′)
c. Let G = the event of landing on green. Find P(G′)
d. Let Y = the event of landing on yellow. Find P(Y).
11. On a baseball team, there are infielders and outfielders. Some players are great hitters,
and some players are not great hitters. Let I = the event that a player in an infielder.
Let O = the event that a player is an outfielder. Let H = the event that a player is a
great hitter. Let N = the event that a player is not a great hitter.
a. Write the symbols for the probability that a player is not an outfielder.
b. Write the symbols for the probability that a player is an outfielder or is a great
hitter.
c. Write the symbols for the probability that a player is an infielder and is not a
great hitter.
d. Write the symbols for the probability that a player is a great hitter, given that the
player is an infielder.
e. Write the symbols for the probability that a player is an infielder, given that the
player is a great hitter.
f. Write the symbols for the probability that of all the outfielders, a player is not a
great hitter.
g. Write the symbols for the probability that of all the great hitters, a player is an
outfielder.
h. Write the symbols for the probability that a player is an infielder or is not a great
hitter.
i. Write the symbols for the probability that a player is an outfielder and is a great
hitter.
j. Write the symbols for the probability that a player is an infielder.
12. What is the word for the set of all possible outcomes?
14. A shelf holds 12 books. Eight are fiction and the rest are nonfiction. Each is a different
book with a unique title. The fiction books are numbered one to eight. The nonfiction
books are numbered one to four. Randomly select one book. Let F = event that book is
fiction. Let N = event that book is nonfiction. What is the sample space?
15. What is the sum of the probabilities of an event and its complement?
182
16. Use the following information to answer the next two exercises. You are rolling a fair,
six-sided number cube. Let E = the event that it lands on an even number. Let M = the
event that it lands on a multiple of three.
17. Let E and F be mutually exclusive events. P(E) = 0.4; P(F) = 0.5. Find P(E∣F).
19. Let U and V be mutually exclusive events. P(U) = 0.26; P(V) = 0.37. Find:
a. P(U AND V) =
b. P(U|V) =
c. P(U OR V) =
20. Let Q and R be independent events. P(Q) = 0.4 and P(Q AND R) = 0.1. Find P(R).
21. Forty-eight percent of all Californians registered voters prefer life in prison without
parole over the death penalty for a person convicted of first-degree murder. Among Latino
California registered voters, 55% prefer life in prison without parole over the death
penalty for a person convicted of first-degree murder. 37.6% of all Californians are Latino.
Let C = Californians (registered voters) preferring life in prison without parole over the
death penalty for a person convicted of first-degree murder. Let L = Latino Californians.
Suppose that one Californian is randomly selected.
a. Find P(C).
b. Find P(L).
c. Find P(C|L).
d. In words, what is C|L?
e. Find P(L AND C).
f. In words, what is L AND C?
g. Are L and C independent events? Show why or why not.
h. Find P(L OR C).
i. In words, what is L OR C?
j. Are L and C mutually exclusive events? Show why or why not.
22. The following table shows a random sample of musicians and when they learned to play
their instruments.
183
a. Find P(musician learned in elementary school).
b. Find P(musician learned in middle school AND had private instruction).
c. Find P(musician learned in elementary school OR is self-taught).
d. Are the events “learned in elementary school” and “learning music in school”
mutually exclusive events?
23. The probability that a man develops some form of cancer in his lifetime is 0.4567. The
probability that a man has at least one false positive test result (meaning the test comes
back for cancer when the man does not have it) is 0.51. Let: C = a man develops cancer in
his lifetime; P = man has at least one false positive. Construct a tree diagram of the
situation.
24. An article in the New England Journal of Medicine, reported about a study of smokers in
California and Hawaii. In one part of the report, the self-reported ethnicity and smoking
levels per day were given. Of the people smoking at most ten cigarettes per day, there
were 9,886 African Americans, 2,745 Native Hawaiians, 12,831 Latinos, 8,378 Japanese
Americans, and 7,650 Whites. Of the people smoking 11 to 20 cigarettes per day, there
were 6,514 African Americans, 3,062 Native Hawaiians, 4,932 Latinos, 10,680 Japanese
Americans, and 9,877 Whites. Of the people smoking 21 to 30 cigarettes per day, there
were 1,671 African Americans, 1,419 Native Hawaiians, 1,406 Latinos, 4,715 Japanese
Americans, and 6,062 Whites. Of the people smoking at least 31 cigarettes per day, there
were 759 African Americans, 788 Native Hawaiians, 800 Latinos, 2,305 Japanese
Americans, and 3,970 Whites.
a. Complete the table using the data provided. Suppose that one person from the study
is randomly selected. Find the probability that person smoked 11 to 20 cigarettes per
day.
b. Suppose that one person from the study is randomly selected. Find the probability
that person smoked 31+ cigarettes per day.
c. Find the probability that the person was Latino.
d. In words, explain what it means to pick one person from the study who is “Japanese
American AND smokes 21 to 30 cigarettes per day.” Also, find the probability.
e. In words, explain what it means to pick one person from the study who is “Japanese
American OR smokes 21 to 30 cigarettes per day.” Also, find the probability.
184
f. In words, explain what it means to pick one person from the study that is “Japanese
American GIVEN that person smokes 21 to 30 cigarettes per day.” Also, find the
probability.
g. Prove that smoking level/day and ethnicity are dependent events.
25. The graph in Figure 4.11 displays the sample sizes and percentages of people in different
age and gender groups who were polled concerning their approval of Mayor Ford’s actions
in office. The total number in the sample of all the age groups is 1,045.
26. Explain what is wrong with the following statements. Use complete sentences.
a. If there is a 60% chance of rain on Saturday and a 70% chance of rain on Sunday,
then there is a 130% chance of rain over the weekend.
b. The probability that a baseball player hits a home run is greater than the
probability that he gets a successful hit.
185
27. The graph shown is based on more than 170,000 interviews done by Gallup that took
place from January through December 2012. The sample consists of employed Americans
18 years of age or older. The Emotional Health Index Scores are the sample space. We
randomly sample one Emotional Health Index Score.
79.6
80
80.1
80.2
80.3
80.7
80.9
81.5
81.7
81.8
82.3
82.7
82.7
83.7
28. The data from Gallup’s annual Values and Beliefs pill conducted on May 3 – 18, 2021,
70% support same-sex marriage. Republicans show majority support in 2021 first time at
55%. Among older adults (55+) surveyed 6 out of 10 say they favor same-sex marriage.
Assuming the surveyed Republicans who support same-sex marriage, 42% of them are
older adults.
186
• C = Surveyed adults who support same-sex marriage.
• B = Older adults surveyed who favor same-sex marriage.
• A = Surveyed Republicans who support same-sex marriage.
a. Find P(C).
b. Find P(B).
c. Find P(C|A).
d. Find P(B|C).
e. In words, what is C|A?
f. In words, what is B|C?
g. Find P(C AND B).
h. In words, what is C AND B?
i. Find P(C OR B)
j. Are C and B mutually exclusive events? Show why or why not.
29. After Rob Ford, the mayor of Toronto, announced his plans to cut budget costs in late
2021, the Forum Research polled 1,046 people to measure the mayor’s popularity.
Everyone polled expressed either approval or disapproval. These are the results their
poll produced:
30. Two spinners are spun, and the sum recorded. One spinner is numbered 1 – 6 where each
number is equally possible, and the even numbers are gray. The other spinner is
numbered 1- 5 where each number is equally possible, and the numbers 1 – 3 are gray.
a. Draw out the sample space shading the values where both spinners are gray.
Spinners 1 2 3 4 5 6
1
2
3
4
5
187
b. Find the probability of getting a 5, P(5).
c. Find the probability of getting a shaded box, P(shaded).
d. Find the probability of getting an even number, P(even number).
e. Is getting an even number the complement of getting an odd number? Why?
f. Find two mutually exclusive events.
g. Are the shaded events and the odd values independent?
31. Using the information from #30, compute the probability of the following events:
a. P(an odd number)
b. P(sum is 8)
c. P(sum is less than 5)
d. P(an even number and shaded)
e. P(multiple of 3)
f. P(greater than or equal to 9)
g. P(multiple of 4 or shaded multiple of 3)
32. Using the information from #30, compute the probability of the following events:
a. P(multiple of 4)
b. P(an odd number and shaded)
c. P(sum is 10)
d. P(prime number and shaded)
e. P(greater than 7 or even shaded)
f. P(odd shaded or less than 6)
33. Suppose that you have eight cards. Five are green and three are yellow. The five green
cards are numbered 1, 2, 3, 4, and 5. The three yellow cards are numbered 1, 2, and 3.
The cards are well shuffled. You randomly draw one card.
a. List the sample space.
b. P(G) =
c. P(G|E) =
d. P(G AND E) =
e. P(G OR E) =
f. Are G and E mutually exclusive? Justify your answer numerically.
34. Roll two fair dice. Each die has six faces.
a. List the sample space.
b. Let A be the event that either a three or four is rolled first, followed by an even
number. Find P(A).
c. Let B be the event that the sum of the two rolls is at most seven. Find P(B).
d. In words, explain what “P(A|B)” represents. Find P(A|B).
e. Are A and B mutually exclusive events? Explain your answer in one to three complete
sentences, including numerical justification.
f. Are A and B independent events? Explain your answer in one to three complete
sentences, including numerical justification.
188
35. A special deck of cards has ten cards. Four are green, three are blue, and three are red.
When a card is picked, its color of it is recorded. An experiment consists of first picking a
card and then tossing a coin.
a. List the sample space.
b. Let A be the event that a blue card is picked first, followed by landing a head on the
coin toss. Find P(A).
c. Let B be the event that a red or green is picked, followed by landing a head on the coin
toss. Are the events A and B mutually exclusive? Explain your answer in one to three
complete sentences, including numerical justification.
d. Let C be the event that a red or blue is picked, followed by landing a head on the coin
toss. Are the events A and C mutually exclusive? Explain your answer in one to three
complete sentences, including numerical justification.
36. An experiment consists of first rolling a die and then tossing a coin.
a. List the sample space.
b. Let A be the event that either a three or a four is rolled first, followed by landing a
head on the coin toss. Find P(A).
c. Let B be the event that the first and second tosses land on heads. Are the events A
and B mutually exclusive? Explain your answer in one to three complete sentences,
including numerical justification.
37. An experiment consists of tossing a nickel, a dime, and a quarter. Of interest is the side
the coin lands on.
a. List the sample space.
b. Let A be the event that there are at least two tails. Find P(A).
c. Let B be the event that the first and second tosses land on heads. Are the events A
and B mutually exclusive? Explain your answer in one to three complete sentences,
including justification.
38. Consider the following scenario: Let P(C) = 0.4. Let P(D) = 0.5. Let P(C|D) = 0.6.
a. Find P(C AND D).
b. Are C and D mutually exclusive? Why or why not?
c. Are C and D independent events? Why or why not?
d. Find P(C OR D).
e. Find P(D|C).
40. Let G and H are mutually exclusive events. P(G) = 0.5 P(H) = 0.3
a. Explain why the following statement MUST be false: P(H|G) = 0.4.
b. Find P(H OR G).
c. Are G and H independent or dependent events? Explain in a complete sentence.
189
41. Approximately 281,000,000 people over age five live in the United States. Of these people,
55,000,000 speak a language other than English at home. Of those who speak another
language at home, 62.3% speak Spanish. Let: E = speaks English at home; E′ = speaks
another language at home; S = speaks Spanish; Finish each probability statement by
matching the correct answer.
42. In 2018, the U.S. government issue 695,525 Family Green Cards (permits for non-citizens
to work legally in the U.S.). Renate Deutsch, from Germany, was one of approximately
2.7 million total applications. Let G = awarded green card.
a. What was Renate’s chance of being awarded a Family Green Card? Write your answer
as a probability statement.
b. In the Fall of 2017, Renate received a letter stating she was one of 950,000 finalists
chosen. Once the finalists were chosen, assuming that each finalist had an equal
chance to win, what was Renate’s chance of being awarded a Family Green Card?
Write your answer as a conditional probability statement. Let F = was a finalist.
c. Are G and F independent or dependent events? Justify your answer numerically.
d. Are G and F mutually exclusive events? Justify your answer numerically and explain
why.
190
44. The following table of data obtained from www.baseball-almanac.com shows hit
information for four players. Suppose that one hit from the table is randomly selected.
Are "the hit being made by Hank Aaron" and "the hit being a double” independent events?
45. United Blood Services is a blood bank that serves more than 500 hospitals in 18 states.
According to their website, a person with type O blood and a negative Rh factor (Rh-) can
donate blood to any person with any blood type. Their data show that 43% of people have
type O blood and 15% of people have Rh- factor; 52% of people have type O or Rh- factor.
a. Find the probability that a person has both type O blood and the Rh- factor.
b. Find the probability that a person does NOT have both type O blood and the Rh-
factor.
46. At a college, 72% of courses have final exams and 46% of courses require research papers.
Suppose that 32% of courses have a research paper and a final exam. Let F be the event
that a course has a final exam. Let R be the event that a course requires a research paper.
a. Find the probability that a course has a final exam or a research project.
b. Find the probability that a course has NEITHER of these two requirements.
47. In a box of assorted cookies, 36% contain chocolate and 12% contain nuts. Of those, 8%
contain both chocolate and nuts. Sean is allergic to both chocolate and nuts.
a. Find the probability that a cookie contains chocolate or nuts (he can't eat it).
b. Find the probability that a cookie does not contain chocolate or nuts (he can eat it).
48. A college finds that 10% of students have taken a distance learning class and that 40% of
students are part time students. Of the part time students, 20% have taken a distance
learning class. Let D = event that a student takes a distance learning class and E = event
that a student is a part time student.
a. Find P(D AND E).
b. Find P(E|D).
c. Find P(D OR E).
d. Using an appropriate test, show whether D and E are independent.
e. Using an appropriate test, show whether D and E are mutually exclusive.
191
49. The table shows the political party affiliation of each of 67 members of the US Senate in
June 2012, and when they are up for reelection.
Up for Democratic Republican Independent Total
reelection: Party Party
2016 10 24 0
2018 23 8 2
Total
a. What is the probability that a randomly selected senator has an “Independent”
affiliation?
c. What is the probability that a randomly selected senator is a Democrat and up for
reelection in November 2016?
e. Suppose that a member of the US Senate is randomly selected. Given that the
randomly selected senator is up for reelection in November 2016, what is the
probability that this senator is a Democrat?
f. Suppose that a member of the US Senate is randomly selected. What is the probability
that the senator is up for reelection in November 2018, knowing that this senator is a
Republican?
h. The events “Independent” and “Up for reelection in November 2016” are
i. mutually exclusive.
ii. independent.
iii. both mutually exclusive and independent.
iv. neither mutually exclusive nor independent.
192
50. The table below gives the number of suicides estimated in the U.S. for a recent year by
age, race (black or white), and sex. We are interested in possible relationships between
age, race, and sex. We will let suicide victims be our population.
Race and Sex 1 – 14 15 – 24 25 – 64 Over 64 Totals
white, male 210 3360 13,610 22,050
white, female 80 580 3380 4930
black, male 10 460 1060 1670
black, female 0 40 270 330
all others
Totals 310 4650 18,780 29,760
Do not include "all others" for parts f and g.
a. Fill in the column for the suicides for individuals over age 64.
b. Fill in the row for all other races.
c. Find the probability that a randomly selected individual was a white male.
d. Find the probability that a randomly selected individual was a black female.
e. Find the probability that a randomly selected individual was black
f. Find the probability that a randomly selected individual was male.
g. Out of the individuals over age 64, find the probability that a randomly selected
individual was a black or white male.
51. The table of data obtained from www.baseball-almanac.com shows hit information for
four well known baseball players. Suppose one hit from the table is randomly selected.
a. Find P(hit was made by Babe Ruth).
1518
i. Name Single Double Triple HR Total
2873
Babe 1517 506 136 714 2873
2873 Ruth
ii.
12351 Jackie 1054 273 54 137 1518
583 Robinson
iii. Ty Cobb 3603 174 295 114 4189
12351
Hank 2294 624 98 755 3771
4189 Aaron
iv.
12351 Total 8471 1577 583 1720 12351
193
52. The following table identifies a group of children by one of four hair colors, and by type of
hair.
53. In a previous year, the weights of the members of the San Francisco 49ers and the Dallas
Cowboys were published in the San Jose Mercury News. The factual data were compiled
into the following table. For the following, suppose that you randomly select one player
from the 49ers or Cowboys.
194
54. This tree diagram shows the tossing of an unfair coin followed by drawing one bead from
a cup containing three red (R), four yellow (Y) and five blue (B) beads. For the coin, P(H)
2 1
= and P(T) = where H is heads and T is tails.
3 3
55. A box of cookies contains three chocolate and seven butter cookies. Miguel randomly
selects a cookie and eats it. Then he randomly selects another cookie and eats it. (How
many cookies did he take?)
a. Draw the tree that represents the possibilities for the cookie selections. Write the
probabilities along each branch of the tree.
b. Are the probabilities for the flavor of the SECOND cookie that Miguel selects
independent of his first selection? Explain.
c. For each complete path through the tree, write the event it represents and find the
probabilities.
d. Let S be the event that both cookies selected were the same flavor. Find P(S).
e. Let T be the event that the cookies selected were different flavors. Find P(T) by two
different methods: by using the complement rule and by using the branches of the
tree. Your answers should be the same with both methods.
f. Let U be the event that the second cookie selected is a butter cookie. Find P(U).
56. A previous year, the weights of the members of the San Francisco 49ers and the Dallas
Cowboys were published in the San Jose Mercury News. The factual data are compiled
into the following table:
Shirt # ≤ 210 211-250 251 – 290 290 ≤
1 – 33 21 5 0 0
34 – 66 6 18 7 4
67 – 99 6 12 22 5
195
For the following, suppose that you randomly select one player from the 49ers or Cowboys.
If having a shirt number from one to 33 and weighing at most 210 pounds were
independent events, then what should be true about P(Shirt# 1–33|≤ 210 pounds)?
57. The probability that a male develops some form of cancer in his lifetime is 0.4567. The
probability that a male has at least one false positive test result (meaning the test comes
back for cancer when the man does not have it) is 0.51. Some of the following questions
do not have enough information for you to answer them. Write “not enough information”
for those answers. Let C = a man develops cancer in his lifetime and P = man has at least
one false positive.
a. P(C) =
b. P(P|C) =
c. P(P|C') =
d. If a test comes up positive, based upon numerical values, can you assume that man
has cancer? Justify numerically and explain why or why not.
58. Given events G and H: P(G) = 0.43; P(H) = 0.26; P(H AND G) = 0.14
a. Find P(H OR G).
b. Find the probability of the complement of event (H AND G).
c. Find the probability of the complement of event (H OR G).
59. Given events J and K: P(J) = 0.18; P(K) = 0.37; P(J OR K) = 0.45
a. Find P(J AND K).
b. Find the probability of the complement of event (J AND K).
c. Find the probability of the complement of event (J or K).
60. Suppose that you have eight cards. Five are green and three are yellow. The cards are
well shuffled. Suppose that you randomly draw two cards, one at a time, with
replacement. Let G1 = first card is green. Let G2 = second card is green
a. Draw a tree diagram of the situation.
b. Find P(G1 AND G2).
c. Find P(at least one green).
d. Find P(G2|G1).
e. Are G2 and G1 independent events? Explain why or why not.
61. Suppose that you have eight cards. Five are green and three are yellow. The cards are
well shuffled. Suppose that you randomly draw two cards, one at a time, without
replacement. G1 = first card is green G2 = second card is green
a. Draw a tree diagram of the situation.
b. Find P(G1 AND G2).
c. Find P(at least one green).
d. Find P(G2|G1).
e. Are G2 and G1 independent events? Explain why or why not.
196
Use the following information to answer the next two exercises. The percent of licensed U.S.
drivers (from a recent year) that are female is 48.60. Of the females, 5.03% are age 19 and
under; 81.36% are age 20–64; 13.61% are age 65 or over. Of the licensed U.S. male drivers,
5.04% are age 19 and under; 81.43% are age 20–64; 13.53% are age 65 or over.
63. Suppose that 10,000 U.S. licensed drivers are randomly selected.
a. How many would you expect to be male?
b. Using the table or tree diagram, construct a contingency table of gender versus age
group.
c. Using the contingency table, find the probability that out of the age 20–64 group, a
randomly selected driver is female.
64. Approximately 86.5% of Americans commute to work by car, truck, or van. Out of that
group, 84.6% drive alone and 15.4% drive in a carpool. Approximately 3.9% walk to work
and approximately 5.3% take public transportation.
a. Construct a table or a tree diagram of the situation. Include a branch for all other
modes of transportation to work.
b. Assuming that the walkers walk alone, what percent of all commuters travel alone to
work?
c. Suppose that 1,000 workers are randomly selected. How many would you expect to
travel alone to work?
d. Suppose that 1,000 workers are randomly selected. How many would you expect to
drive in a carpool?
65. When the Euro coin was introduced in 2002, two math professors had their statistics
students test whether the Belgian one Euro coin was a fair coin. They spun the coin rather
than tossing it and found that out of 250 spins, 140 showed a head (event H) while 110
showed a tail (event T). On that basis, they claimed that it is not a fair coin.
a. Based on the given data, find P(H) and P(T).
b. Use a tree to find the probabilities of each possible outcome for the experiment of
tossing the coin twice.
c. Use the tree to find the probability of obtaining exactly one head in two tosses of the
coin.
d. Use the tree to find the probability of obtaining at least one head.
197
66. Use the following information to answer the next two exercises. The following are real
data from Santa Clara County, CA. As of a certain time, there had been a total of 3,059
documented cases of AIDS in the county. They were grouped into the following categories:
Homosexual/Bisexual IV Drug Heterosexual Other Totals
User* Contact
Female 0 70 136 49
Male 2146 463 60 135
Totals
67. Answer these questions using probability rules. Do NOT use the contingency table. Three
thousand fifty-nine cases of AIDS had been reported in Santa Clara County, CA, through
a certain date. Those cases will be our population. Of those cases, 6.4% obtained the
disease through heterosexual contact and 7.4% are female. Out of the females with the
disease, 53.3% got the disease from heterosexual contact.
a. Find P(Person is female).
b. Find P(Person obtained the disease through heterosexual contact).
c. Find P(Person is female GIVEN person got the disease from heterosexual contact)
d. Construct a Venn diagram representing this situation. Make one group females and
the other group heterosexual contact. Fill in all values as probabilities.
68. Suppose 22% of the population are 65 or older, 28% of those 65 or older have loans, and
55% of those younger than 65 have loans. Find the probabilities that a person fits into
the following categories.
a. 65 or older and has a loan.
b. Has a loan.
c. Younger than 65 and does not have a loan.
d. Are the events that a person is 65 or older and that the person has a loan
independent.
69. In an election with 4 candidates for one office and 8 candidates for another office, how
many different ballots may be printed?
198
70. How many different 4-letter call letters for a student radio station can be made if
a. The first letter must be a M or a K and no letter may be repeated?
b. Repeats are allowed but the first letter is a M or K?
72. A financial advisor gives her client 9 potential investments and asked her to select and
rank her top 5. In how many ways can she do this?
73. An internet installer is given 12 customers that need installation to be completed by the
end of the day. How many different possible ways can the list of customers be arranged?
75. In a club with 12 varsity and 10 junior varsity members, a 5-member committee will be
randomly chosen. Find the probability that the committee contains the following,
a. All junior varsity members
b. 3 varsity and 2 junior varsity members
c. At least 4 junior varsity members
199
REFERENCES
4.1 Terminology
Lopez, Shane, Preety Sidhu. “U.S. Teachers Love Their Lives, but Struggle in the Workplace.”
Gallup Wellbeing, 2013. http://www.gallup.com/poll/161516/teachers-love-lives-struggle-
workplace.aspx (accessed May 2, 2013).
DiCamillo, Mark, Mervin Field. “The File Poll.” Field Research Corporation. Available online at
http://www.field.com/
Rider, David, “Ford support plummeting, poll suggests,” The Star, September 14, 2011. Available
online at http://www.thestar.com/news/gta/2011/09/14/ford_support_plummeting_poll_suggests.html
(accessed May 2, 2013).
“Mayor’s Approval Down.” News Release by Forum Research Inc. Available online at
http://www.forumresearch.com/
forms/News Archives/News Releases/74209_TO_Issues_-
Shin, Hyon B., Robert A. Kominski. “Language Use in the United States: 2007.” United States
Census Bureau. Available online at http://www.census.gov/hhes/socdemo/language/data/acs/ACS-
12.pdf (accessed May 2, 2013).
Data from The Roper Center: Public Opinion Archives at the University of Connecticut. Available
online at http://www.ropercenter.uconn.edu/ (accessed May 2, 2013).
200
4.4 Contingency Tables
Data from the National Center for Health Statistics, part of the United States Department of Health
and Human Services. Data from United States Senate. Available online at www.senate.gov (accessed
May 2, 2013).
Haiman, Christopher A., Daniel O. Stram, Lynn R. Wilkens, Malcom C. Pike, Laurence N. Kolonel,
Brien E. Henderson, and Loīc Le Marchand. “Ethnic and Racial Differences in the Smoking-Related
Risk of Lung Cancer.” The New England Journal of Medicine, 2013. Available online at
http://www.nejm.org/doi/full/10.1056/NEJMoa033250 (accessed May 2, 2013).
Samuel, T. M. “Strange Facts about RH Negative Blood.” eHow Health, 2013. Available online at
http://www.ehow.com/
facts_5552003_strange-rh-negative-blood.html (accessed May 2, 2013).
“United States: Uniform Crime Report – State Statistics from 1960–2011.” The Disaster Center.
Available online at http://www.disastercenter.com/crime/ (accessed May 2, 2013).
Data from Clara County Public H.D. Data from the American Cancer Society.
Data from The Data and Story Library, 1996. Available online at http://lib.stat.cmu.edu/DASL/
(accessed May 2, 2013). Data from the Federal Highway Administration, part of the United States
Department of Transportation.
Data from the United States Census Bureau, part of the United States Department of Commerce.
Data from USA Today.
“Search for Datasets.” Roper Center: Public Opinion Archives, University of Connecticut., 2013.
Available online at http://www.ropercenter.uconn.edu/data_access/data/search_for_datasets.html
(accessed May 2, 2013).
201
CHAPTER 4 SOLUTIONS:
25) a. Answers vary (i.e. individual is male, individual is between 18-34, ..)
b. 40% of total approve of Mayor Ford’s action in office
c. 60% of total disapprove …
d. 30% of the 18-34 years old approve of Mayor …
e. 45.7% f. 63% g. 40% h. 44% i. 78.9% j. 30%
27) a. 1/7 b. 0 c. ½ d. 5/14 e. 2/7 f. 3/14 g. 5/7 h. Physician i. Service j. 4.1 k. 81.3 l.
½
202
29) a. 1046 b. 58% c. 439 d. 57% e. 60%
31) a. 1/2 b. 4/30 = 2/15 c. 6/30 = 1/5 d. 3/30 = 1/10 e. 10/30 = 1/3 f. 6/30 = 1/5
g. 10/30 = 1/3
33) a. S = {G1, G2, G3, G4, G5, Y1, Y2, Y3} b. P(G) = 5/8 c. P(G | E) = 2/3
d. P( G and E) = ¼ e. P(G or E) = ¾ f. Not mutually exclusive since P(G and E) 0
35) a. S = {GH, GT, BH, BT, RH, RT} b. 3/10*1/2 = 3/20 c. Yes, mutually exclusive
since they can’t happen at the same time.
37) a. S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} b. P(A) = ½
c. Not mutually exclusive because can’t happen at the same time.
51) a. ii b. ii
53) a. 26/106 b. 33/106 c. 21/106 d. 38/106 e. 21/33
55) a.
b. Not independent since there is no replacement
(eaten)
d. 48/90
e. 42/90
f. 63/90
57) a. .4567 b. 0 base on tree diagram from problem 23 c. 0.51 based on tree
diagram
d. No because at least 1 false positive happens over 50% of time.
203
61) a.
b. 5/14
c. 25/28
d. 4/7
63) a. 5140
b.
< 20 20 – 64 >64 Totals
Female 0.0244 0.3954 0.0661 0.486
Male 0.0259 0.4186 0.0695 0.514
Totals 0.0503 0.8140 0.1356 1
c. .4857
d. 504/625
69) 32 ballots
71) a. 17,576,000 b. 15,600,000
73) 12!
C5 C3 ⋅ 10 C2 C4 ⋅ 12 C1 10 C5
75) a. 10
b. 12
c. 10
+
22 C5 22 C5 22 C5 22 C5
204
5|DISCRETE RANDOM VARIABLES
Introduction
Chapter Objectives
A student takes a ten-question, true-false quiz. Because the student had such a busy
schedule, he or she could not study and guesses randomly at each answer. What is the
probability of the student passing the test with at least a 70%?
205
Small companies might be interested in the number of long-distance phone calls their
employees make during the peak time of the day. Suppose the average is 20 calls. What is
the probability that the employees make more than 20 long-distance phone calls during the
peak time?
These two examples illustrate two different types of probability problems involving discrete
random variables. Recall that discrete data are data that you can count. A random variable
is a numerical variable that describes the outcomes of a statistical experiment in words. The
values of a random variable can vary with each repetition of an experiment. Thus, a random
variable is a variable whose values are determined by chance.
We will use upper case letters such as X or Y to denote a random variable. Lower case letters
like x or y denote the value of a random variable. If X is a random variable, then X is
described in words, and the value x is given as a number.
For example, let X = the number of heads you get when you toss three fair coins. The sample
space for the toss of three fair coins is {TTT; THH; HTH; HHT; HTT; THT; TTH; HHH}. So
the values of X are x = 0, 1, 2, 3. Notice that for this example, the x values are countable
outcomes. Because you can count the possible values that X can take on and the outcomes
are random (the x values 0, 1, 2, 3), X is a discrete random variable.
Example 5.1
A child psychologist is interested in the number of times a newborn baby's crying wakes its
mother after midnight. For a random sample of 50 mothers, the following information was
obtained.
206
Let X = the number of times per week a newborn baby's crying wakes its mother after
midnight. For this example, x = 0, 1, 2, 3, 4, 5. Let P(x) = probability that X takes on a
value x. Then the following is the probability distribution for X :
x P(x)
0 2/50
1 11/50
2 23/50
3 9/50
4 4/50
5 1/50
Solution 5.1
5.1 A hospital researcher is interested in the number of times the average post-op patient
will ring the nurse during a 12-hour shift. Let X = the number of times a patient rings the
nurse during a 12-hour shift. For this exercise, the possible values are x = 0, 1, 2, 3, 4, 5.
P(x) = the probability that X takes on value x. For a random sample of 50 patients, the
following information was obtained:
x P(x)
0 4/50
1 8/50
2 16/50
3 14/50
4 6/50
5 2/50
These two examples show that we can think of a relative frequency distribution as an
approximation to a probability distribution. But using ideas from Chapter 4, we can calculate
the actual PDF for many probability experiments.
207
Example 5.2
Suppose that we toss a coin 4 times in succession. Then we can write the sample space as
follows:
TTTT, HTTT, THTT, TTHT, TTTH, HHTT, THHT, TTHH,
THTH, HTHT, HTTH, THHH, HTHH, HHTH, HHHT, HHHH.
Let X = the number of heads in four tosses; so the values of X are x = 0, 1, 2, 3, 4.
Find the PDF for the random variable X.
5.2 Suppose that a fair coin is tossed three times in succession. List the outcomes in the
sample space as sequences of H’s and T’s. Let X = the number of heads in three tosses.
Construct the PDF for X.
Example 5.3
Suppose that we toss a pair of fair dice. Then we can represent the sample space using the
table:
Let X = number of dots facing up in a single toss. Find the PDF for the random variable X.
208
Solution 5.3:
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(x)
36 36 36 36 36 36 36 36 36 36 36
Finally, we can represent a PDF graphically using a histogram. For example, the
histogram for the PDF obtained by rolling two fair dice would be:
X = sum of dice
0.1946
0.1668
0.139
0.1112
0.0834 P(X)
0.0556
0.0278
0
1 2 3 4 5 6 7 8 9 10 11 12
For example, suppose that we toss a fair coin and record the result. Although the probability
of getting heads is 1/2, this does not mean that in multiple trials that exactly half the tosses
would land as heads. I.e. if you flip a coin two times, does probability tell you that these flips
will result in one heads and one tail? You might toss a fair coin ten times and record nine
heads. As we learned in Chapter 3, probability does not describe the short-term results of an
experiment. Instead, it gives information about what can be expected in the long term. To
demonstrate this principle, the British statistician Karl Pearson once tossed a fair coin 24,000
times! He recorded the results of each toss, obtaining heads 12,012 times. In his experiment,
Pearson illustrated the Law of Large Numbers.
209
The Law of Large Numbers states that, as the number of trials in a probability experiment
increases, the difference between the theoretical probability of an event and the relative
frequency approaches zero (the theoretical probability and the relative frequency get closer
and closer together). When evaluating the long-term results of statistical experiments, we
often want to know the “average” outcome. This “long-term average” is known as the mean
or expected value of the experiment and is denoted by the Greek letter μ. In other words,
after conducting many trials of an experiment, you would expect this average value.
Let X be random variable, and x1, x2, . . . xn, the list of possible outcomes for X.
Then the mean of the distribution and expected value of X are the same quantity, given by:
μ = E(X) = Σ xi P(xi)
That is, to find the expected value or long-term average, μ, simply multiply each value of
the random variable by its probability and add the products.
Example 5.4
A men's soccer team plays soccer zero, one, or two days a week. The probability that they play
zero days is 0.2, the probability that they play one day is 0.5, and the probability that they
play two days is 0.3. Find the long-term average or expected value, μ, of the number of days
per week the men's soccer team plays soccer.
Solution 5.4:
To do the problem, first let the random variable X = the number of days the men's soccer
team plays soccer per week. Then X takes on the values 0, 1, 2. Construct a PDF table,
and add an extra column labeled x*P(x). In this column, you will multiply each x value by
its probability:
x P(x) x*P(x)
0 0.2 0*(0.2) = 0
1 0.5 1*(0.5) = 0.5
2 0.3 2*(0.3) = 0.6
210
This table is called an expected value table, and helps you calculate the expected value or
long-term average. In this case, we have
This means that the men’s soccer team would, on average, expect to play 1.1 days per week.
We can also calculate the variance and standard deviation of a random variable; again, this
will be similar to the way we found these statistics for a relative frequency distribution in
Chapter 3. That is, we will take a weighted average of the squared deviations from the mean.
Let X be random variable, and x1, x2, . . . xn, the list of possible outcomes for X.
Example 5.5
Refer to Example 5.1 Find the expected value of the number of times a newborn baby's
crying wakes its mother after midnight. The expected value is the expected number of
times per week a newborn baby's crying wakes its mother after midnight. Calculate the
standard deviation of the variable as well.
Solution 5.5:
We start with the PDF from Example 5.1; then add columns column showing xP(x), and
(x – μ)2P(x):
Summing the entries in the third column, we get μ = E(X) = 2.1. So, on average, we expect
a newborn to wake its mother after midnight 2.1 times per week. We then use this value
211
to complete the fourth column. Add the values in the fourth column of the table to get the
variance:
While doing an example using these tables is instructive, we can do these calculations quickly
and efficiently using the TI-84. The procedure is exactly as we did for calculating the mean
and standard deviation of a frequency distribution in Chapter 3:
5.5 A hospital researcher is interested in the number of times the average post-op patient
will ring the nurse during a 12-hour shift. For a random sample of 50 patients, the
following information was obtained.
x P(x)
0 0.08
1 0.16
2 0.32
3 0.28
4 0.12
5 0.04
Find the expected value and the standard deviation of this random variable.
212
Example 5.6
Suppose you play a game of chance in which five numbers are chosen from 0, 1, 2, 3, 4, 5, 6,
7, 8, 9. A computer randomly selects five numbers from zero to nine with replacement. You
pay $2 to play and could profit $100,000 if you match all five numbers in order (you get
your $2 back plus $100,000). Over the long term, what is your expected profit of playing
the game?
Solution 5.6
To win, you must get all five numbers correct, in order. The probability of choosing one
1
correct number is because there are ten numbers. We are sampling without
10
replacement, so the selections are independent of one another; thus the probability of
10
1
choosing all five correctly is =0.00001. Therefore, the probability of winning is
10
0.00001 and the probability of losing is 1 − 0.00001 = 0.99999. Thus, we have the PDF:
x P(x)
Win 100,000 0.00001
Loss -2 0.99999
Since –0.99998 is about –1, on average you should expect to lose approximately $1 for each
game you play. Note that we will never actually lose $1; the only potential outcomes are
winning $100,000 or losing $2. But if we play the game many, many times, the average loss
will be about $1 per game.
5.5 You are playing a game of chance in which four cards are drawn from a standard deck
of 52 cards. You guess the suit of each card before it is drawn. The cards are replaced in the
deck on each draw. You pay $1 to play. If you guess the right suit every time, you get your
money back and $256. What is your expected profit of playing the game over the long term?
213
Example 5.7
Toss a pair of fair six-sided dice. Recall from Example 5.3 that the sample space has 36
outcomes:
Solution 5.7:
Using the sample space, we have the following PDF:
X P(x)
9
0
36
18
1
36
9
2
36
Example 5.8
According to a YouGov survey of 1354 US Adults on Dec 27, 2019, 21% of all respondents
clean their cell phone daily.
While at the mall suppose you make a bet with a friend that if you ask a random person at
the mall if the person clean their cell phone daily. If you win the bet, you win $50. If you
lose the bet, you pay $20. Let X = the amount of profit from a bet.
If you bet many times, will you come out ahead? Explain your answer in a complete
sentence using numbers. What is the standard deviation of X?
Solution 5.8:
We are given that P(win) = P(the person cleans their cell phone daily) = .21
214
Thus, we also have P(loss) = P(the person does not meet the CDC physical activity
guidelines) = 1 – .21 = .79
And if we win, then we win $50, so x = 50. If we lose, we pay $20, so x = -20.
Thus, we have the PDF:
x P(x)
Win 50 0.21
Loss -20 0.79
From the calculator, we have:
If you make this bet many times under the same conditions, the long-term outcome will be
an average loss of $5.30 per bet.
5.8 According to the YouGov survey on Dec 27, 2019, of 1354 US Adults, 8% of those
surveyed never wash their cell phone. As in Example 5.8, you make a bet with your friend
that a randomly surveyed person at the mall never washes their cell phone. If you win the
bet, you win $100. If you lose the bet, you pay $10. Let X = the amount of profit from a bet.
Find the mean and standard deviation of X.
So far, we have presented discrete probability distribution functions in table form. But for
some distributions, we can come up with a formula that describes the probabilities in the
distribution. In the next few sections, we will investigate some well-known discrete
probability functions: the binomial, geometric, and Poisson distributions.
A probability distribution function is a pattern. Each distribution has its own special
characteristics. Learning the characteristics each of these enables us to distinguish among
the different distributions, and to match a given probability problem into the correct pattern.
215
5.3 | Binomial Distribution
Binomial distribution is a special discrete probability distribution. There are four
conditions that the experiment has to meet to be considered a binomial experiment:
1. There are a fixed number of trials. Think of trials as repetitions of an experiment. The
letter n denotes the number of trials.
2. There are only two possible outcomes, called "success" and "failure," for each trial.
3. The n trials are independent and are repeated using identical conditions. Because the
n trials are independent, the outcome of one trial does not help in predicting the
outcome of another trial.
4. The letter p denotes the probability of a success on one trial, and q denotes the
probability of a failure on one trial, so p + q = 1. Since the trials are independent, p
remains the same for each trial.
Any experiment that has characteristics two, three, four, and where n = 1 is called a
Bernoulli Trial (named after Jacob Bernoulli who, in the late 1600s, studied them
extensively). A binomial experiment takes place when the number of successes is counted in
one or more Bernoulli Trials.
Example 5.9
Randomly guessing at a multiple-choice question with 4 possible answers has only two
outcomes. If a success is guessing correctly, then a failure is guessing incorrectly. Suppose
there are 6 multiple choice questions. Joe guesses on each question with no pattern.
Determine if this is a binomial experiment.
Solution 5.9
Condition 1 is met since there are a fixed number of trials: n = 6 questions
Condition 2 is met since there are two outcomes: each question is either correct or incorrect.
Condition 3 is met since each trial is independent: Joe guessing with no pattern on each
question does not help in predicting the outcome of another question.
Condition 4 is met since the probability of success (correct) is constant for each trial: p = .25
is the probability of guessing a question correct and remains constant for each question.
216
Example 5.10
Solution 5.10
Condition 1 is met since there are a fixed number of trials: n = 6 questions.
Condition 2 is met since there are two outcomes: each question is either correct or incorrect.
Condition 3 is met since each trial is independent: Cesar guessing with no pattern on each
question does not help in predicting the outcome of another question.
Condition 4 is not met since the probability of success (correct) is not constant for each trial:
p = .25 (1 out of the 4 choices is correct) is the probability of guessing a question correct for
the first 3 question but changes to p = .20 (1 out of 5 choices is correct) for the last 3 questions.
The probability of success (correct) does not remain constant for each trial.
5.9 Sixty-five percent of people pass the state driver’s exam on the first try. A group of 50
individuals who have taken the driver’s exam is randomly selected. Give two reasons why
this is a binomial problem.
217
Using the TI-83, 83+, 84, 84+ Calculator _
To calculate binomial formula.
Using an Excel function, the binomial formula is used for both calculating x values and
cumulative x values (see Example 5.13). The Excel function is =BINOM.DIST(number_s,
trials, probability, true/false). Notice that the x value is first, and the number of trials is
second, which is different than the TI 83/84. Select FALSE to calculate just the x value.
Example 5.11
Randomly guessing at a multiple-choice question with 4 possible answers has only two
outcomes. If a success is guessing correctly, then a failure is guessing incorrectly. Suppose
there are 6 multiple choice questions. Mayra guesses on each question with no pattern.
Create a binomial probability distribution. Draw a histogram.
X P(X)
0 Binompdf(6, .25, 0) = .178
1 Binompdf(6, .25, 1) = .356
2 Binompdf(6, .25, 2) = .297
3 Binompdf(6, .25, 3) = .132
4 Binompdf(6, .25, 4) = .033
5 Binompdf(6, .25, 5) = .00439
6 Binompdf(6, .25, 6) = .00024
218
0.4
0.3
0.2
0.1
0
.5 1.5 2.5 3.5 4.5 5.5 6.5
Example 5.12
Suppose you play a game that you can only either win or lose. The probability that you win
any game is 55%, and the probability that you lose is 45%. Each game you play is
independent.
a. If you play the game 20 times, write the function that describes the probability that
you win 15 of the 20 times.
b. Find the mean number of wins.
c. Find the standard deviation of wins.
Solution 5.12
a. Let X be the number of wins (0, 1, 2, 3, ..., 20). The probability of a success is p =
0.55. The number of trials is n = 20. The probability question can be stated
mathematically as P(x = 15) = Binompdf(20, .55, 15) = .0365
b. μ = n∙p = 20∙.55 = 11 wins is the mean number of wins for 20 trials.
c. σ = �𝑛𝑛𝑛𝑛𝑛𝑛 = 2.22
Example 5.13
A trainer is teaching a dolphin to do tricks. The probability that the dolphin successfully
performs the trick is 35%, and the probability that the dolphin does not successfully
perform the trick is 65%. Out of 10 attempts, you want to find the probability that the
dolphin succeeds at most 5 times. State the probability question mathematically.
Solution 5.13
Here, if you define X as the number of successful performances, then X takes on the values
0, 1, 2, 3, ..., 10. The probability of success is p = .35. The probability question can be stated
mathematically as P(x ≤ 5). Here we want to add all the probabilities from 0 to 5 as seen in
the table below: P(x ≤ 5) = .0135 + .0725 + .176 + .252 + .238 = .905
219
X P(X)
0 Binompdf(10, .35, 0) = .0135
1 Binompdf(10, .35, 1) = .0725
2 Binompdf(10, .35, 2) = .176
3 Binompdf(10, .35, 3) = .252
4 Binompdf(10, .35, 4) = .238
5 Binompdf(10, .35, 5) = .154
6 Binompdf(10, .35, 6) = .0689
7 Binompdf(10, .35, 7) = .0212
8 Binompdf(10, .35, 8) = .00428
9 Binompdf(10, .35, 9) = .00051
10 Binompdf(10, .35, 10) = .000028
A simpler way of finding this same probability is to use the cumulative binomial function on
the calculator. The function binomcdf (n, p, x) calculates the cumulative probability. In
this case the calculator automatically sums up the probabilities from x = 0 to x = 5.
Using an Excel function, the binomial formula is used for both calculating x values and
cumulative x values. The Excel function is =BINOM.DIST(number_s, trials, probability,
true/false). Notice that the x value is first, and the number of trials is second, which is
different than the TI 83/84. Select TRUE to calculate the cumulative x value.
220
TI 83/84 and Excel Function Recap where k is the value of x, n is the number of trials, and
p is the probability. Remember to include the equal sign, =, when using the Excel function.
Example 5.14
A fair coin is flipped 15 times. Each flip is independent. What is the probability of getting
more than ten heads? Let X = the number of heads in 15 flips of the fair coin. X takes on
the values 0, 1, 2, 3, ..., 15. Since the coin is fair, p = 0.5 and q = 0.5. The number of trials is
n = 15. State the probability question mathematically. Find the probability.
Solution 5.14
P(x > 10) = P(x = 11) + P(x = 12) + P(x = 13) + P(x = 14) + P(x = 15)
= 1 – P(complement)
= 1 – P(x ≤ 10)
= 1 – binomcdf(15, .5, 10)
= 1 - .941 = .059
5.10 A fair, six-sided die is rolled ten times. Each roll is independent. You want to find the
probability of rolling a one more than three times. State the probability question
mathematically. Find the probability.
221
Example 5.15
It has been stated that about 41% of adult workers have a high school diploma but do not
pursue any further education. If 20 adult workers are randomly selected, find the probability
that at most 12 of them have a high school diploma but do not pursue any further education.
How many adult workers do you expect to have a high school diploma but do not pursue any
further education? Create a histogram.
Solution 5.15
Let X = the number of workers who have a high school diploma but do not pursue any
further education. X takes on the values 0, 1, 2, ..., 20 where n = 20, p = 0.41, and q = 1 –
0.41 = 0.59.
X ~ B(20, 0.41)
P(x ≤ 12) = binomcdf(20, .41, 12) = 0.9738.
To answer the question how many adult workers you expect to have a high school diploma
but do not pursue any further education:
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1 5 ..... 20
x=01234
222
5.11 About 32% of students participate in a community volunteer program outside of school.
If 30 students are selected at random, find the probability that at most 14 of them participate
in a community volunteer program outside of school. Use the TI-83+ or TI-84 calculator to
find the answer.
Example 5.16
In the 2020 the Paper Clip business supply catalog had 560 pages. Eight of the pages feature
different types of ink pens. Suppose we randomly sample 100 pages. Let X = the number of
pages that feature different types of ink pens.
i. the probability that two pages feature different types of ink pens
ii. the probability that at most six pages feature different types of ink pens
iii. the probability that more than three pages feature different types of ink pens.
Solution 5.16
a. x = 0, 1, 2, 3, 4, 5, 6, 7, 8
b. X ~ B(100, 8/560)
c. probabilities:
223
5.12 According to a Gallup poll, 60% of American adults prefer saving money over spending.
Let X = the number of American adults out of a random sample of 50 who prefer saving to
spending.
i. the probability that 25 adults in the sample prefer saving money over spending
ii the probability that at most 20 adults prefer saving
iii. the probability that more than 30 adults prefer saving money.
c. Using the formulas, calculate the (i) mean and (ii) standard deviation of X.
Example 5.17
The lifetime risk of developing pancreatic cancer is about one in 78 (1.28%). Suppose we
randomly sample 200 people. Let X = the number of people who will develop pancreatic
cancer.
Solution 5.17
a. X ∼ B(200, 0.0128)
b. Mean and standard deviation
i. μ = np = 200(0.0128) = 2.56
ii. σ = �𝑛𝑛𝑛𝑛𝑛𝑛 = (200)(0.0128)(0.9853) ≈ 1.5897
c. P(x ≤ 8) = binomcdf(200, 0.0128, 8) = 0.9988
d. P(x = 5) = binompdf(200, 0.0128, 5) = 0.0707
e. P(x = 6) = binompdf(200, 0.0128, 6) = 0.0298 So P(x = 5) > P(x = 6); it is more likely that
five people will develop cancer than six.
224
5.13 During the 2020 - 2021 regular NBA season, DeAndre Jordan of the Brooklyn Nets had
a field goal completion rate of 76.3%. Suppose you choose a random sample of 80 shots made
by DeAndre during the 2020 - 2021 season. Let X = the number of shots that scored points.
Example 5.18
The following example illustrates a problem that is not binomial. It violates the condition of
independence. ABC College has a student advisory committee made up of ten staff members
and six students. The committee wishes to choose a chairperson and a recorder. What is the
probability that the chairperson and recorder are both students?
Solution 5.18
The names of all committee members are put into a box, and two names are drawn without
replacement. The first name drawn determines the chairperson and the second name the
recorder. There are two trials. However, the trials are not independent because the outcome
of the first trial affects the outcome of the second trial. The probability of a student on the
first draw is 6/16. The probability of a student on the second draw is 5/15, when the first draw
selects a student. The probability is 6/15, when the first draw selects a staff member. The
probability of drawing a student's name changes for each of the trials and, therefore, violates
the condition of independence.
5.14 A lacrosse team is selecting a captain. The names of all the seniors are put into a hat,
and the first three that are drawn will be the captains. The names are not replaced once they
are drawn (one person cannot be two captains). You want to see if the captains all play the
same position. State whether this is binomial or not and state why.
225
5.4 | The Geometric Distribution
There are four main characteristics of a geometric experiment:
1. There are one or more Bernoulli trials with all failures except the last one, which is
a success. In other words, we keep repeating the experiment until the first success.
Then we stop.
3. The probability of a success, p, is the same for each trial. Thus, the probability of a
failure, q, the same for each trial. Of course, q = 1 − p.
4. The random variable is X = the number of independent trials until the first success.
For example, you throw a dart at a bullseye until you hit the bullseye. The first time you hit
the bullseye is a "success" so you stop throwing the dart. It might take six tries until you hit
the bullseye. You can think of the trials as failure, failure, failure, failure, failure, success,
STOP.
There must be at least one trial; in theory, the number of trials could go on forever.
As another example, suppose we roll one fair die; then the probability of rolling a three is
1
. This is true no matter how many times you roll the die. Suppose you want to know the
6
probability of getting the first three on the fifth roll. On rolls one through four, you do not get
5
a three. The probability for each of the rolls is q = . And since the trials are independent,
6
5 5 5 5 1
the probability of getting a three on the fifth roll is = 0.0804.
6 6 6 6 6
Example 5.19
You play a game of chance that you can either win or lose until you lose. Your probability of
losing is p = 0.57. What is the probability that it takes five games until you lose? Let X = the
number of games you play until you lose (includes the losing game). Then X takes on the
values 1, 2, 3, ... (could go on indefinitely). The probability question is P(x = 5).
Solution 5.19
The probability of losing is p = 0.57, so the probability of winning is q = 1 – 0.57 = 0.43.
If the first loss occurs on the fifth play, then the first four plays must be wins. Thus,
226
5.15 You throw darts at a board until you hit the center area. Your probability of hitting
the center area is p = 0.17. Find the probability that it takes eight throws until you hit the
center.
Example 5.20
A safety engineer feels that 25% of all industrial accidents in her plant are caused by failure
of employees to follow instructions. She decides to look at the accident reports (selected
randomly and replaced in the pile after reading) until she finds one that shows an accident
caused by failure of employees to follow instructions. On average, how many reports would
the safety engineer expect to look at until she finds a report showing an accident caused by
employee failure to follow instructions? What is the probability that the safety engineer will
have to examine at least three reports until she finds a report showing an accident caused by
employee failure to follow instructions?
Solution 5.20
Let X = the number of accidents the safety engineer must examine until she finds a report
showing an accident caused by employee failure to follow instructions. X takes on the
values 1, 2, 3, ....
The first question asks us to find the expected value or the mean. If the probability of
success is 0.25, then on average we would expect to check about 4 reports before we had a
1
success. I.e. the mean is μ = 4 = .
0.25
The second question asks us to find P(x ≥ 3). (At least three means three or more.)
We can calculate this using the complement rule:
The calculations for the geometric distribution can be done very easily using the calculator.
227
5.16 An instructor feels that 15% of students get below a C on their final exam. She decides
to look at final exams (selected randomly and replaced in the pile after reading) until she
finds one that shows a grade below a C. We want to know the probability that the
instructor will have to examine at most ten exams until she finds one with a grade below a
C. Use the calculator to find P(x < 10).
Example 5.20
Assume that the probability of a defective computer component is 0.02. Components are
randomly selected and tested until one component fails.
a. Find the probability that the first defect is caused by the seventh component tested.
b. How many components do you expect to test until one is found to be defective?
Solution 5.20
Let X = the number of computer components tested until the first defect is found. Then X
takes on the values 1, 2, 3, .... And the probability of a defective component is p = 0.02, so X
~ G(0.02).
1 1
a. Here we want the mean: μ = = = 50 . I.e. we expect that we would have to
p .02
inspect 50 components before we found a defective one.
228
b. Here we want to find P(x = 7). Go to 2nd DISTR, and select the geometpdf :
5.17 The probability of a defective steel rod is 0.01. Steel rods are selected at random. Find
the probability that the first defect occurs on the ninth steel rod. Use the TI-83+ or TI-84
calculator to find the answer.
229
Example 5.21
The lifetime risk of developing pancreatic cancer is about one in78 (1.28%). Let X = the
number of people you ask until one says he or she has pancreatic cancer. Then X is a
discrete random variable with a geometric distribution: X ~ G(0.0128).
a. What is the probability of that you ask ten people before one says he or she has
pancreatic cancer?
b. What is the probability that you must ask 20 people?
c. Find the mean and standard deviation of X.
Solution 5.21
1 1
c. The mean is μ = = = 78
p .0128
1 1
The standard Deviation is σ = − 1 = 78(78 − 1) = 77.6234.
p p
5.18 The literacy rate for a nation measures the proportion of people age 15 and over who
can read and write. The literacy rate for women in Afghanistan is 12%. Let X = the number
of Afghani women you ask until one says that she is literate.
230
5.5 | Poisson Distribution
Both Binomial and Poisson are discrete probability distributions. In Binomial the goal was
to look for the probability of a specific value of success in n trials. Now we want to look for
the specific number of occurrences in a specific amount of time or space.
1. The experiment consists of counting the number of events occurring in a fixed interval
of time or space if these events happen with a known average rate and independently
of the time since the last event.
2. The probability of the event remains constant for each interval of equal length.
For example, a book editor might be interested in the number of words spelled incorrectly in
a particular book. It might be that, on the average, there are five words spelled incorrectly in
100 pages. The interval is the 100 pages.
𝜇𝜇 𝑥𝑥 𝑒𝑒 −𝜇𝜇
𝑃𝑃(𝑥𝑥) =
𝑥𝑥!
Example 5.22
The average number of loaves of bread put on a shelf in a bakery in a half-hour period is 12.
Of interest is the number of loaves of bread put on the shelf in five minutes. The time interval
of interest is five minutes. What is the average number of loaves of bread put on a shelf in 5
minutes?
231
Solution 5.22
Let X = the number of loaves of bread put on the shelf in five minutes. If the average number
of loaves put on the shelf in 30 minutes (half-hour) is 12, then the average number of loaves
12
put on the shelf in 5 minutes is µ= λ= ( 5 )= 2 loaves of bread. X ~ P(2) is Poisson
30
distribution notation.
5.19 The average number of fish caught in an hour is eight. Of interest is the number of fish
caught in 15 minutes. The time interval of interest is 15 minutes. What is the average
number of fish caught in 15 minutes?
Example 5.23
The average number of loaves of bread put on a shelf in a bakery in a half-hour period is 12.
Of interest is the number of loaves of bread put on the shelf in five minutes. The time interval
of interest is five minutes. What is the probability that the number of loaves, selected
randomly, put on the shelf in 5 minutes is three?
Solution 5.23
From the previous example, we found μ = 2 for the average number of loaves to put on the
shelf in 5 minutes.
232
Example 5.24
A bank expects to receive six bad checks per day, on average. What is the probability of the
bank getting fewer than five bad checks on any given day? Of interest is the number of checks
the bank receives in one day, so the time interval of interest is one day. Let X = the number
of bad checks the bank receives in one day. If the bank expects to receive six bad checks per
day, then the average is six checks per day. Write the correct notation for the Poisson
distribution. Write a mathematical statement for the probability question.
Solution 5.24
5.24 An electronics store expects to have ten returns per day on average. The manager wants
to know the probability of the store getting fewer than eight returns on any given day. State
the probability question mathematically.
Example 5.25
You notice that a news reporter says "uh," on average, two times per broadcast. What is the
probability that the news reporter says "uh" more than two times per broadcast?
This is a Poisson problem because you are interested in knowing the number of times the
news reporter says "uh" during a broadcast.
Solution 5.25
a. One broadcast is the fixed interval.
b. μ = 2
c. Let X = the number of times the news reporter says "uh" during one broadcast.
x = 0, 1, 2, 3, …
X ~ P(2)
d. P(x > 2)
e. P(x > 2) = P(x ≥ 3) = 1 – P(Complement) = 1 – P(x ≤ 2)
= 1 – poissoncdf(2, 2) = .677
233
Using the TI-83, 83+, 84, 84+ Calculator _
5.25 An emergency room at a particular hospital gets an average of five patients per hour. A
doctor wants to know the probability that the ER gets more than five patients per hour. Give
the reason why this would be a Poisson distribution.
Example 5.26
Leah's answering machine receives about six telephone calls between 8 a.m. and 10 a.m.
What is the probability that Leah receives more than one call in the next 15 minutes?
Let X = the number of calls Leah receives in 15 minutes. (The interval of interest is 15
minutes or ¼ hour)
Solution 5.26
x = 0, 1, 2, 3, ...
If Leah receives, on the average, six telephone calls in two hours, and there are eight 15
minute intervals in two hours, then Leah receives
1
( 6 ) =0.75 calls in 15 minutes, on average. So, μ = 0.75 for this problem.
8
X ~ P(0.75)
The probability that Leah receives more than one telephone call in the next 15 minutes is
about:
234
Find P(x > 1).
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 x = 0, 1, 2, 3, ...
The y-axis contains the probability of x where X = the number of calls in 15 minutes.
5.26 A customer service center receives about ten emails every half-hour. What is the
probability that the customer service center receives more than four emails in the next six
minutes? Use the TI-83+ or TI-84 calculator to find the answer.
Example 5.27
According to Baydin, an email management company, an email user gets, on average, 147
emails per day. Let X = the number of emails an email user receives per day. The discrete
random variable X takes on the values x = 0, 1, 2 …. The random variable X has a Poisson
distribution: X ~ P(147). The mean is 147 emails.
a. What is the probability that an email user receives exactly 160 emails per day?
b. What is the probability that an email user receives at most 160 emails per day?
c. What is the standard deviation?
Solution 5.27
235
5.27 According to a recent poll by the Pew Internet Project, girls between the ages of 14 and
17 send an average of 187 text messages each day. Let X = the number of texts that a girl
aged 14 to 17 sends per day. The discrete random variable X takes on the values x = 0, 1, 2
…. The random variable X has a Poisson distribution: X ~ P(187). The mean is 187 text
messages.
a. What is the probability that a teen girl sends exactly 175 texts per day?
b. What is the probability that a teen girl sends at most 150 texts per day?
c. What is the standard deviation?
Example 5.28
Text message users receive or send an average of 41.5 text messages per day.
a. How many text messages does a text message user receive or send per hour?
b. What is the probability that a text message user receives or sends two messages per
hour?
c. What is the probability that a text message user receives or sends more than two
messages per hour?
Solution 5.28:
a. Let X = the number of texts that a user sends or receives in one hour. The average
41.5
number of texts received per hour is ≈ 1.7292.
24
b. X ~ P(1.7292), so P(x = 2) = poissonpdf(1.7292, 2) ≈ 0.2653
c. P(x > 2) = 1 – P(x ≤ 2) = 1 – poissoncdf(1.7292, 2) ≈ 1 – 0.7495 = 0.2505
5.28 Atlanta’s Hartsfield-Jackson International Airport is the busiest airport in the world.
On average there are 2,500 arrivals and departures each day.
a. How many airplanes arrive and depart the airport per hour?
b. What is the probability that there are exactly 100 arrivals and departures in one hour?
c. What is the probability that there are at most 100 arrivals and departures in one hour?
236
Example 5.29
On May 13, 2013, starting at 4:30 PM, the probability of low seismic activity for the next 24
hours in Alaska was reported as about 1.02%. Use this information for the next 200 days to
find the probability that there will be low seismic activity in ten of the next 200 days. Use
both the binomial and Poisson distributions to calculate the probabilities. Are they close?
Solution 5.29
Let X = the number of days with low seismic activity. Using the binomial distribution:
We expect the approximation to be good because n is large (greater than 20) and p is small
(less than 0.05). The results are close—both probabilities reported are almost 0.
5.29 On May 13, 2013, starting at 4:30 PM, the probability of moderate seismic activity for
the next 48 hours in the Kuril Islands off the coast of Japan was reported at about 1.43%.
Use this information for the next 100 days to find the probability that there will be low
seismic activity in five of the next 100 days. Use both the binomial and Poisson distributions
to calculate the probabilities. Are they close?
237
KEY TERMS
Bernoulli Trials an experiment with the following characteristics:
1. There are only two possible outcomes called “success” and “failure” for each trial.
2. The probability p of a success is the same for any trial. So, the probability q = 1 − p
of a failure is also the same for any trial.
Binomial Experiment a statistical experiment that satisfies the following four conditions:
Binomial Distribution a discrete random variable that arises from Bernoulli trials; there
are a fixed number, n, of independent trials. “Independent” means that the result of any
trial (for example, trial one) does not affect the results of the following trials, and all trials
are conducted under the same conditions. Under these circumstances the binomial random
variable X is defined as the number of successes in n trials.
Expected Value: The expected arithmetic average when an experiment is repeated many
times; also called the mean of the distribution.
1. There are one or more Bernoulli trials with all failures except the last one, which is
a success.
2. In theory, the number of trials could go on forever. There must be at least one trial.
3. The probability, p, of a success and the probability, q, of a failure do not change from
trial to trial.
4. The trials are independent of one another.
238
Geometric Distribution a discrete random variable (RV) that arises from the Bernoulli
trials; the trials are repeated until the first success. The geometric variable X is defined as
the number of trials until the first success.
• Notation: X ~ G( p).
1 1 1
• The mean is µ = and the standard deviation is σ = ⋅ − 1 .
p p p
• The probability of exactly x failures before the first success is given by the formula:
P ( X = x) = p(1 − p) x −1 .
Mean of a Probability Distribution: The long-term average of the random variable over
many trials of a statistical experiment. The mean is also called the expected value.
Poisson Probability Distribution: A discrete random variable that counts the number
of times a certain event will occur in a specific interval. Characteristics of the variable:
• The probability p that the event occurs in a given interval is the same for all
intervals of equal length.
• The events occur with a known mean and independently of the time since the last
event.
• The distribution is defined by the mean λ of the event in the interval.
o Notation: X ~ P(μ).
o The mean is μ = np and the standard deviation is σ = µ .
o The probability of having exactly x successes a fixed time interval is
µx
P( X = x) = e − µ .
x!
The Poisson distribution is often used to approximate the binomial distribution, when n is
“large” and p is “small” (a general rule is that n should be greater than or equal to 20 and p
should be less than or equal to 0.05)
239
• The domain of the random variable is not necessarily a numerical set; the domain
may be categorical in nature and expressed in words; for example, if X = hair color
then the domain is {black, blond, gray, green, orange}.
• The values of the random variable are determined by chance. So we can tell what
specific value x the random variable X takes only after performing the experiment.
FORMULA REVIEW
Let X be a discrete random variable, and x1, x2, . . . xn, list of possible outcomes for X.
Binomial Distribution:
Geometric Distribution:
1
Notation: X ~ G( p). Mean: µ =
p
1 1
PDF: P ( X = x) = p (1 − p ) x −1 Standard deviation: σ = ⋅ − 1 .
p p
Poisson Distribution:
µx
PDF: P ( X = x) = e − µ Standard deviation: σ = µ
x!
240
Exercises for Chapter 5
Use the following information to answer the next five exercises:
A company wants to evaluate its attrition rate, in other words, how long new hires stay
with the company. Over the years, they have established the following probability
distribution.
Let X = the number of years a new hire will stay with the company.
Let P(x) = the probability that a new hire will stay with the company x years.
x P(x)
0 0.12
1 0.18
2 0.30
3 0.15
4
5 0.10
6 0.05
2. P(x = 4) =
3. P(x ≥ 5) =
4. On average, how long would you expect a new hire to stay with the company?
A baker is deciding how many batches of muffins to make to sell in his bakery. He wants to
make enough to sell every muffin and no fewer. Through observation, the baker has
established a probability distribution.
X P(x)
1 0.15
2 0.35
3 0.40
4 0.10
7. What is the probability the baker will sell more than one batch? P(x > 1) =
241
8. What is the probability the baker will sell exactly one batch? P(x = 1) =
Ellen has music practice three days a week. She practices for all of the three days 85% of
the time, two days 8% of the time, one day 4% of the time, and no days 3% of the time. One
week is selected at random.
12. We know that for a probability distribution function to be discrete, it must have two
characteristics. One is that the sum of the probabilities is one. What is the other
characteristic?
Javier volunteers in community events each month. He does not do more than five events in
a month. He attends exactly five events 35% of the time, four events 25% of the time, three
events 20% of the time, two events 10% of the time, one event 5% of the time, and no events
5% of the time.
16. Find the probability that Javier volunteers for less than three events each month:
That is, calculate P(x < 3).
17. Find the probability that Javier volunteers for at least one event each month.
That is, calculate P(x > 1).
x P(x) x* P(x)
0 0.2
1 0.2
2 0.4
3 0.2
242
19. Suppose that the PDF for the number of years it takes to earn a Bachelor of Science
(B.S.) degree is given in
x P(x)
3 0.05
4 0.40
5 0.30
6 0.15
7 0.10
20. Find the mean and standard deviation of the probability distributions:
a. b.
x P(x) x P(x)
2 0.1 1 0.15
4 0.3 2 0.25
6 0.3 3 0.30
8 0.2 4 0.20
10 0.1 5 0.10
A physics professor wants to know what percent of physics majors will spend the next
several years doing post-graduate research. He has the following probability distribution.
x P(x)
1 0.35
2 0.20
3 0.15
4
5 0.10
6 0.05
243
23. Find the probability that a physics major will do post-graduate research for four years,
P(x = 4).
24. Find the probability that a physics major will do post-graduate research for at most
three years. I.e. find P(x ≤ 3).
25. On average, how many years would you expect a physics major to spend doing post-
graduate research?
A ballet instructor is interested in knowing what percent of each year's class will continue
on to the next, so that she can plan what classes to offer. Let X = the number of years a
student will study ballet with the teacher.
Over the years, she has established the following probability distribution:
x P(x)
1 0.10
2 0.05
3 0.10
4
5 0.30
6 0.20
7 0.10
30. On average, how many years would you expect a child to study ballet with this teacher?
32. You are playing a game by drawing a card from a standard deck and replacing it. If the
card is a face card, you win $30. If it is not a face card, you pay $2. There are 12 face cards
in a deck of 52 cards. What is the expected value of playing the game?
33.. You are playing a game by drawing a card from a standard deck and replacing it. If the
card is a face card, you win $30. If it is not a face card, you pay $2. There are 12 face cards
in a deck of 52 cards. Should you play the game?
244
Use the following information to answer the next eight exercises:
The Higher Education Research Institute at UCLA collected data from 203,967 incoming
first-time, full-time freshmen from 270 four-year colleges and universities in the U.S. Of
those surveyed, 71.3% of those students replied that, yes, they believe that same-sex
couples should have the right to legal marital status. Suppose that you randomly pick eight
first-time, full-time freshmen from the survey. You are interested in the number that
believes that same sex-couples should have the right to legal marital status.
40. What is the probability that at most five of the freshmen reply “yes”?
41. What is the probability that at least two of the freshmen reply “yes”?
42. A game involves selecting a card from a regular 52-card deck and tossing a coin. The
coin is a fair coin and is equally likely to land on heads or tails.
• If the card is a face card, and the coin lands on Heads, you win $6
• If the card is a face card, and the coin lands on Tails, you win $2
• If the card is not a face card, you lose $2, no matter what the coin shows.
a. Find the expected value for this game (expected net gain or loss).
b. Explain what your calculations indicate about your long-term average profits and
losses on this game.
c. Should you play this game to win money?
43. You buy a lottery ticket to a lottery that costs $10 per ticket. There are only 100 tickets
available to be sold in this lottery. In this lottery there are one $500 prize, two $100 prizes,
and four $25 prizes. Find your expected gain or loss.
245
44. A venture capitalist, willing to invest $1,000,000, has three investments to choose from.
The first investment, a software company, has a 10% chance of returning $5,000,000 profit,
a 30% chance of returning $1,000,000 profit, and a 60% chance of losing the million dollars.
The second company, a hardware company, has a 20% chance of returning $3,000,000
profit, a 40% chance of returning $1,000,000 profit, and a 40% chance of losing the million
dollars. The third company, a biotech firm, has a 10% chance of returning $6,000,000 profit,
a 70% of no profit or loss, and a 20% chance of losing the million dollars.
45. Suppose that 20,000 married adults in the United States were randomly surveyed as to
the number of children they have. Let X = the number of children married people have.
The results of the survey are compiled and are used as theoretical probabilities:
x P(x)
0 0.10
1 0.20
2 0.30
3
4 0.10
5 0.05
6 or more 0.05
46. A “friend” offers you the following “deal.” For a $10 fee, you may pick an envelope from
a box containing 100 seemingly identical envelopes. However, each envelope contains a
coupon for a free gift.
246
Based upon the financial gain or loss over the long run, should you play the game?
47. Florida State University has 14 statistics classes scheduled for its Summer 2013 term.
One class has space available for 30 students, eight classes have space for 60 students, one
class has space for 70 students, and four classes have space for 100 students.
a. What is the average class size assuming each class is filled to capacity?
b. Space is available for 980 students. Suppose that each class is filled to capacity and
select a statistics student at random. Let the random variable X equal the size of the
student’s class. Define the PDF for X.
c. Find the mean of X.
d. Find the standard deviation of X.
49. In a lottery, there are 250 prizes of $5, 50 prizes of $25, and ten prizes of $100.
Assuming that 10,000 tickets are to be issued and sold, what is a fair price to charge to
break even?
50. According to a recent article the average number of babies born with significant hearing
loss (deafness) is approximately two per 1,000 babies in a healthy baby nursery. The
number climbs to an average of 30 per 1,000 babies in an intensive care nursery. Suppose
that 1,000 babies from healthy baby nurseries were randomly surveyed.
Find the probability that exactly two babies were born deaf.
Recently, a nurse commented that when a patient calls the medical advice line claiming to
have the flu, the chance that he or she truly has the flu (and not just a nasty cold) is only
about 4%. Of the next 25 patients calling in claiming to have the flu, we are interested in
how many actually have the flu.
51. Define the random variable and list its possible values.
53. Find the probability that at least four of the 25 patients actually have the flu.
54. On average, for every 25 patients calling in, how many do you expect to have the flu?
55. People visiting video rental stores often rent more than one DVD at a time. The
probability distribution for DVD rentals per customer at Video To Go is given in the table
below. There is five-video limit per customer at this store, so nobody ever rents more than
five DVDs.
247
x 0 1 2 3 4 5
P(x) 0.03 0.50 0.24 0.07 0.04
56. A school newspaper reporter decides to randomly survey 12 students to see if they will
attend Tet (Vietnamese New Year) festivities this year. Based on past years, she knows
that 18% of students attend Tet festivities. We are interested in the number of students
who will attend the festivities.
The probability that the San Jose Sharks will win any given game is 0.3694 based on a 13-
year win history of 382 wins out of 1,034 games played (as of a certain date). An upcoming
monthly schedule contains 12 games. Let X = the number of games won in that upcoming
month.
57. Find the expected number of wins for that upcoming month.
58. What is the probability that the San Jose Sharks win six games in that upcoming
month?
59. What is the probability that the San Jose Sharks win at least five games in the
upcoming month?
a. 0.369
b. 0.526
c. 0.473
d. 0.230
60. A student takes a ten-question true-false quiz but did not study and randomly guesses
each answer. Find the probability that the student passes the quiz with a grade of at least
70% of the questions correct.
248
61. A student takes a 32-question multiple-choice exam but did not study and randomly
guesses each answer. Each question has three possible choices for the answer. Find the
probability that the student guesses more than 75% of the questions correctly.
62. Six different colored dice are rolled. Of interest is the number of dice that show a one.
63. More than 96 percent of the very largest colleges and universities (more than 15,000
total enrollments) have some online offerings. Suppose you randomly pick 13 such
institutions. We are interested in the number that offer distance learning courses.
64. Suppose that about 85% of graduating students attend their graduation. A group of 22
graduating students is randomly chosen.
65. At The Fencing Center, 60% of the fencers use the foil as their main weapon. We
randomly survey 25 fencers at The Fencing Center. We are interested in the number of
fencers who do not use the foil as their main weapon.
249
e. Find the probability that six do not use the foil as their main weapon.
f. Based on numerical values, would you be surprised if all 25 did not use foil as their
main weapon? Justify your answer numerically.
66. Approximately 8% of students at a local high school participate in after-school sports all
four years of high school. A group of 60 seniors is randomly chosen. Of interest is the
number who participated in after-school sports all four years of high school.
67. The chance of an IRS audit for a tax return with over $25,000 in income is about 2% per
year. We are interested in the expected number of audits a person with that income has in
a 20-year period. Assume each year is independent.
68. It has been estimated that only about 30% of California residents have adequate
earthquake supplies. Suppose you randomly survey 11 California residents. We are
interested in the number who have adequate earthquake supplies.
69. There are two similar games played for Chinese New Year and Vietnamese New Year.
In the Chinese version, fair dice with numbers 1, 2, 3, 4, 5, and 6 are used, along with a
board with those numbers. In the Vietnamese version, fair dice with pictures of a gourd,
fish, rooster, crab, crayfish, and deer are used. The board has those six objects on it, also.
250
We will play with bets being $1. The player places a bet on a number or object. The “house”
rolls three dice. If none of the dice show the number or object that was bet, the house keeps
the $1 bet. If one of the dice shows the number or object bet (and the other two do not show
it), the player gets back his or her $1 bet, plus $1 profit. If two of the dice show the number
or object bet (and the third die does not show it), the player gets back his or her $1 bet, plus
$2 profit. If all three dice show the number or object bet, the player gets back his or her $1
bet, plus $3 profit. Let X = number of matches and Y = profit per game.
70. According to The World Bank, only 9% of the population of Uganda had access to
electricity as of 2009. Suppose we randomly sample 150 people in Uganda. Let X = the
number of people who have access to electricity.
71. The literacy rate for a nation measures the proportion of people aged 15 and over that
can read and write. The literacy rate in Afghanistan is 28.1%. Suppose you choose 15 people
in Afghanistan at random. Let X = the number of people who are literate.
251
5.4 Geometric Distribution
The Higher Education Research Institute at UCLA collected data from 203,967 incoming
first-time, full-time freshmen from 270 four-year colleges and universities in the U.S. 71.3%
of those students replied that, yes, they believe that same-sex couples should have the right
to legal marital status. Suppose that you randomly select freshman from the study until you
find one who replies “yes.” You are interested in the number of freshmen you must ask.
72. In words, define the random variable X and identify the distribution of X.
73. On average, how many freshmen would you expect to have to ask until you found one
who replies "yes?
74. What is the probability that you will need to ask fewer than three freshmen?
75. Suppose that the probability that an adult in America will watch the Super Bowl is
40%. Each person is considered independent. We are interested in the number of adults in
America we must survey until we find one who will watch the Super Bowl.
76. It has been estimated that only about 30% of California residents have adequate
earthquake supplies. Suppose we are interested in the number of California residents we
must survey until we find a resident who does not have adequate earthquake supplies.
252
77. In one of its Spring catalogs, L.L. Bean® advertised footwear on 29 of its 192 catalog
pages. Suppose we randomly survey 20 pages. We are interested in the number of pages
that advertise footwear. Each page may be picked more than once.
80. The World Bank records the prevalence of HIV in countries around the world. According
to their data, “Prevalence of HIV refers to the percentage of people ages 15 to 49 who are
infected with HIV.”[1] In South Africa, the prevalence of HIV is 17.3%. Let X = the number
of people you test until you find a person infected with HIV.
79. According to a recent Pew Research poll, 75% of millennials (people born between 1981
and 1995) have a profile on a social networking site. Let X = the number of millennials you
ask until you find a person without a profile on a social networking site.
80. A consumer looking to buy a used red Miata car will call dealerships until she finds a
dealership that carries the car. She estimates the probability that any independent
dealership will have the car will be 28%. We are interested in the number of dealerships
she must call.
253
a. In words, define the random variable X.
b. List the values that X may take on.
c. Give the distribution of X. X ~ ( , )
d. On average, how many dealerships would we expect her to have to call until she finds
one that has the car?
e. Find the probability that she must call at most four dealerships.
f. Find the probability that she must call three or four dealerships.
On average, eight teens in the U.S. die from motor vehicle injuries per day. As a result,
states across the country are debating raising the driving age.
81. Assume the event occurs independently in any given day. Define the random variable
X.
82. X ~ ( , )
84. For the given values of the random variable X, fill in the corresponding probabilities.
85. Is it likely that there will be no teens killed from motor vehicle injuries on any given
day in the U.S? Justify your answer numerically.
86. Is it likely that there will be more than 20 teens killed from motor vehicle injuries on
any given day in the U.S.? Justify your answer numerically.
87. Assume the event occurs independently in any given day. Define the random variable X
and identify the distribution.
89. What is the probability of getting 35 customers in the first four hours? Assume the store
is open 12 hours each day.
90. What is the probability that the store will have more than 12 customers in the first
hour?
91. What is the probability that the store will have fewer than 12 customers in the first two
hours?
254
92. The maternity ward at Dr. Jose Fabella Memorial Hospital in Manila in the Philippines
is one of the busiest in the world with an average of 60 births per day. Let X = the number
of births in an hour.
93. The switchboard in a Minneapolis law office gets an average of 5.5 incoming phone calls
during the noon hour on Mondays. Experience shows that the existing staff can handle up
to six calls in an hour. Let X = the number of calls received at noon.
94. The average number of children a Japanese woman has in her lifetime is 1.37. Suppose
that one Japanese woman is randomly chosen.
95. The average number of children a Spanish woman has in her lifetime is 1.47. Suppose
that one Spanish woman is randomly chosen.
255
96. Fertile, female cats produce an average of three litters per year. Suppose that one
fertile, female cat is randomly chosen. In one year, find the probability she produces:
97. The chance of an IRS audit for a tax return with over $25,000 in income is about 2% per
year. Suppose that 100 people with tax returns over $25,000 are randomly picked. We are
interested in the number of people audited in one year. Use a Poisson distribution to
answer the following questions.
98. An article about directory assistance stated that “internal surveys paid for by directory
assistance providers show that even the most accurate companies give out wrong numbers
15% of the time.” Assume that you are testing such a provider by making 20 directory
assistance requests and also assume that the provider gives the wrong phone number 15%
of the time.
99. A company prices its natural disaster insurance using the following assumptions. In one
calendar year, there is at most 1 natural disaster. Let’s say the probability of a natural
disaster is .15. Lastly, the number of natural disasters in any calendar year is independent
of the number of natural disasters in any other calendar year.
b. Find the probability that there are fewer than 4 natural disasters in a 10-year period.
c. Find the probability that there are at least 6 natural disasters in a 10-year period.
256
REFERENCES
5.2 Mean, Expected Value and Standard Deviation
Class Catalogue at the Florida State University. Available online at
https://apps.oti.fsu.edu/RegistrarCourseLookup/ SearchFormLegacy (accessed May 15, 2013).
“World Earthquakes: Live Earthquake News and Highlights,” World Earthquakes, 2012.
http://www.world- earthquakes.com/index.php?option=ethq_prediction (accessed May 15, 2013).
Newport, Frank. “Americans Still Enjoy Saving Rather than Spending: Few demographic differences
seen in these views other than by income,” GALLUP® Economy, 2013. Available online at
http://www.gallup.com/poll/162368/americans- enjoy-saving-rather-spending.aspx (accessed May 15,
2013).
Pryor, John H., Linda DeAngelo, Laura Palucki Blake, Sylvia Hurtado, Serge Tran. The American
Freshman: National Norms Fall 2011. Los Angeles: Cooperative Institutional Research Program at
the Higher Education Research Institute at UCLA, 2011. Also available online at
http://heri.ucla.edu/PDFs/pubs/TFS/Norms/Monographs/ TheAmericanFreshman2011.pdf (accessed
May 15, 2013).
“What are the key statistics about pancreatic cancer?” American Cancer Society, 2013. Available
online at http://www.cancer.org/cancer/pancreaticcancer/detailedguide/pancreatic-cancer-key-
statistics (accessed May 15, 2013).
“Millennials: Confident. Connected. Open to Change.” Executive Summary by PewResearch Social &
Demographic Trends, 2013. Available online at
http://www.pewsocialtrends.org/2010/02/24/millennials-confident-connected-open-to- change/
(accessed May 15, 2013).
257
“Prevalence of HIV, total (% of populations ages 15-49),” The World Bank, 2013. Available
online at
http://data.worldbank.org/indicator/SH.DYN.AIDS.ZS?order=wbapi_data_value_2011+wbapi_data_
value+wbapi_data_ value-last&sort =desc (accessed May15, 2013).
Pryor, John H., Linda DeAngelo, Laura Palucki Blake, Sylvia Hurtado, Serge Tran. The American
Freshman: National Norms Fall 2011. Los Angeles: Cooperative Institutional Research Program at
the Higher Education Research Institute at UCLA, 2011. Also available online at
http://heri.ucla.edu/PDFs/pubs/TFS/Norms/Monographs/ TheAmericanFreshman2011.pdf (accessed
May 15, 2013).
“Summary of the National Risk and Vulnerability Assessment 2007/8: A profile of Afghanistan,” The
European Union and ICON-Institute. Available online at
http://ec.europa.eu/europeaid/where/asia/documents/afgh_brochure_summary_en.pdf (accessed May
15, 2013).
“UNICEF reports on Female Literacy Centers in Afghanistan established to teach women and girls
basic resading [sic] and writing skills,” UNICEF Television. Video available online at
http://www.unicefusa.org/assets/video/afghan-female- literacy-centers.html (accessed May 15, 2013).
Center for Disease Control and Prevention. “Teen Drivers: Fact Sheet,” Injury Prevention & Control:
Motor Vehicle Safety, October 2, 2012. Available online at
http://www.cdc.gov/Motorvehiclesafety/Teen_Drivers/teendrivers_factsheet.html (accessed May 15,
2013).
“Children and Childrearing,” Ministry of Health, Labour, and Welfare. Available online at
http://www.mhlw.go.jp/english/policy/children/children-childrearing/index.html (accessed May 15,
2013).
“Eating Disorder Statistics,” South Carolina Department of Mental Health, 2006. Available online at
http://www.state.sc.us/ dmh/anorexia/statistics.htm (accessed May 15, 2013).
“Giving Birth in Manila: The maternity ward at the Dr Jose Fabella Memorial Hospital in Manila,
the busiest in the Philippines, where there is an average of 60 births a day,” the guardian,
2013. Available online at http://www.theguardian.com/world/gallery/2011/jun/08/philippines-
health#/?picture=375471900&index=2 (accessed May 15, 2013).
“How Americans Use Text Messaging,” Pew Internet, 2013. Available online at
http://pewinternet.org/Reports/2011/Cell- Phone-Texting-2011/Main-Report.aspx (accessed May 15,
2013).
258
Lenhart, Amanda. “Teens, Smartphones & Testing: Texting volume is up while the frequency of
voice calling is down. About one in four teens say they own smartphones,” Pew Internet, 2012.
Available online at http://www.pewinternet.org/~/media/
Files/Reports/2012/PIP_Teens_Smartphones_and_Texting.pdf (accessed May 15, 2013).
“One born every minute: the maternity unit where mothers are THREE to a bed,” MailOnline.
Available online at http://www.dailymail.co.uk/news/article-2001422/Busiest-maternity-ward-planet-
averages-60-babies-day-mothers- bed.html (accessed May 15, 2013).
Vanderkam, Laura. “Stop Checking Your Email, Now.” CNNMoney, 2013. Available online at
http://management.fortune.cnn.com/2012/10/08/stop-checking-your-email-now/ (accessed May 15,
2013).
“World Earthquakes: Live Earthquake News and Highlights,” World Earthquakes, 2012.
http://www.world- earthquakes.com/index.php?option=ethq_prediction (accessed May 15, 2013).
1. ”Prevalence of HIV, total (% of populations ages 15-49),” The World Bank, 2013. Available online at
http://data.worldbank.org/indicator/SH.DYN.AIDS.ZS?order=wbapi_data_value
2011+wbapi_data_value+wbapi_data_value-last&sort=desc (accessed May15, 2013).
259
CHAPTER 5 SOLUTIONS:
1)
x P(x)
0 0.12
1 0.18
2 0.30
3 0.15
4 0.10
5 0.10
6 0.05
3) P(x > 5) = 0.10 + 0.05 = 0.15 5) 1 7) P(x > 1) = 0.35 + 0.40 + 0.10 = 0.85
9) E(X) = 1(0.15) + 2(0.35) + 3(0.40) + 4(0.10) = 0.15 + 0.70 + 1.20 + 0.40 = 2.45
x P(x)
11) 0 0.03
1 0.04
2 0.08
3 0.85
13) Let X = the number of events Javier volunteers for each month.
x P(x)
15)
0 0.05
1 0.05
2 0.10
3 0.20
4 0.25
5 0.35
17) P(x > 0) = 1 – P(x = 0) = 0.95.
19) a. X = number of years needed to complete B.S. degree.
b. No students completed B.S. degrees in two years or less.
c. E[x] = 4.85 years
21) X = the number of years a physics major will spend doing post-graduate research.
23) 0.15
25) E(X) = 1(0.35) + 2(0.20) + 3(0.15) + 4(0.15) + 5(0.10) + 6(0.05) = 2.6 years
27) P(x > 5) = .30 + .20 + .10 = 0.60.
29) P(x < 4) = 0.10 + 0.05 + 0.10 = 0.25
31) The sum of the probabilities sum to one because it is a probability distribution.
33) E(X) = 30(12/52) – 2(40/52) = $5.38.
35) X = the number that reply “yes”
37) x = 0, 1, 2, 3, 4, 5, 6, 7, 8
39) = (8)(.713) = 5.7
260
41) P(x < 5) = binomcdf(8, .713, 5) = 0.4151.
43) E[x] = 6(6/52) + 2(6/52) – 2(40/62) = -$.62 No you should not play this game to win
money.
45) a.
Biotech company
x P(x)
6,000,000 .10
1,000,000 .70
-1,000,000 .20
47) b
49) Let X = the amount of money to be won on a ticket. The following table shows the PDF
for X.
X P(x)
0 0.969
5 0.025
25 0.005
100 0.001
E(X) = 0(0.969) + 5(0.025) + 25(0.005) + 100(0.001) = 0.35. So a fair price for a ticket is
$0.35.
Any price over $0.35 will enable the lottery to raise money.
51) X = the number of patients calling in claiming to have the flu, who actually have the
flu. Possible values of X are x = 0, 1, 2, ...25 53) 0.0165
261
55) a. X = the number of DVDs a Video to Go customer rents b. P(x = 3) = 0.12
c. P(x > 4) = 0.07 + 0.04 = 0.11 d. P(x < 2) = 0.77
73) 1.4
262
75) a. X = the number of adults surveyed until one says he or she will watch the Super
Bowl. b. X ~ G(0.40) c. μ = 2.5 d. 0.0187 e. 0.2304
77) a. X = the number of pages that advertise footwear b. X takes on the values 0, 1, 2,
..., 20 c. X ~ B(20, 29/192) d. 3.02 e. No f. 0.9997 g. X = the number of pages we
must survey until we find one that advertises footwear. X ~ G(29/192) h. P(x < 3) =
geometcdf(29/192, 3) = 0.3881 i. μ = 192/29 = 6.6207 pages.
79) a. X ~ G(0.25) b. μ = 1/0.25 = 4. c. P(x = 10) = geometpdf(0.25, 10) = 0.0188
d. P(x = 20) = geometpdf(0.25, 20) = 0.0011 e. P(x ≤ 5) = geometcdf(0.25, 5) =
0.7627.
81) X = the number of U.S. teens who die from motor vehicle injuries per day.
83) 0, 1, 2, 3, ...
85) No 87) X = number of customers per day; X ~ P(120).
89) If the store averages 120 customers per day, then it will average 10 customers per
hour, or 40 per each four hours. So P(x = 35) = poissonpdf(40, 35) = 0.0485
91) The store averages 120 customers per day, so it will average 10 customers per hour, or
20 per each two hours. So P(x < 12) = P(x < 11) = poissonpdf(20, 11) = 0.0214.
93) X ~ P(5.5) a. μ = 5.5 and σ = 5.5 b. P(x < 6) = poissoncdf(5.5, 6) ≈ 0.6860 c. P(x
= 6) = poissonpdf(5.5, 6). There is a 15.7% probability that the law staff will receive more
calls than they can handle. d. P(x > 8) = 1 – P(x ≤ 8) = 1 – poissoncdf(5.5, 8) ≈ 1 – 0.8944 =
0.1056
95) a. X = the number of children for a Spanish woman; b. 0, 1, 2, 3,.. ; .c. X ~
P(1.47) d. P(x = 0) = poissonpdf(1.47, 0) = 0.2299; e. P(x < 1.47) = poissoncdf(1.47, 1) =
0.5679 f. P(x > 1.47) = 1 – P(x < 1.47) = 1 – 0.5679 = 0.4321
97) a. X = the number of people audited in one year b. 0, 1, 2, ..., 100 c. X ~ P(2)
d. 2 e. P(x = 0) = poissonpdf(2, 0) = 0.1353; f. P(x > 3) = 1 – P(x < 2) = 1 – poissoncdf(2,
2) = 0.3233.
263
This page is purposely left blank.
264
6|CONTINUOUS RANDOM VARIABLES
Figure 6.1 The heights of these radish plants are continuous random variables. (Credit: Rev Stan)
Introduction
Chapter Objectives
Continuous random variables have many applications. Baseball batting averages, IQ scores,
the length of time a long distance telephone call lasts, the amount of money a person carries,
the length of time a computer chip lasts, and SAT scores are just a few. The field of reliability
depends on a variety of continuous random variables.
265
6.1|Continuous Probability Distributions
NOTE
The values of discrete and continuous random variables can be unclear. For example, if X is
equal to the number of miles (to the nearest mile) you drive to work, and then X is a discrete
random variable. You count the miles. If X is the distance you drive to work, then you
measure values of X and X is a continuous random variable. For a second example, if X is
equal to the number of books in a backpack, then X is a discrete random variable. If X is the
weight of a book, then X is a continuous random variable because weights are measured. How
the random variable is defined is very important.
Example 6.1
A child psychologist is interested in the number of times a newborn baby's crying wakes its
mother after midnight. For a random sample of 50 mothers, the following information was
obtained.
Let X = the number of times per week a newborn baby's crying wakes its mother after
midnight. For this example, x = 0, 1, 2, 3, 4, 5. Let P(x) = probability that X takes on a
value x. Then the following is the probability distribution for X:
x P(x)
0 2/50
1 11/50
2 23/50
3 9/50
4 4/50
5 1/50
2 11
Recall P(x < 2) = +
50 50
For continuous random variables, the values of X can’t be placed in a table. The outcomes are
measured, not counted.
266
Properties of Continuous Probability Distributions
The curve is called the probability density function (abbreviated as pdf). We use the
symbol f(x) to represent the curve. f(x) is the function that corresponds to the graph; we use
the density function f(x) to draw the graph of the probability distribution.
Area under the curve is given by a different function called the cumulative distribution
function (abbreviated as cdf). The cumulative distribution function is used to evaluate
probability as area.
• The entire area under the curve and above the x-axis is equal to one.
• Probability is found for intervals of x values rather than for individual x values.
• P(c < x < d) is the probability that the random variable X is in the interval between
the values c and d. P(c < x < d) is the area under the curve, above the x-axis, to the
right of c and the left of d. P(c < x < d) is the same as P(c ≤ x ≤ d) because probability
is equal to area.
• P(x = c) = 0. The probability that x takes on any single individual value is zero. The
area below the curve, above the x-axis, and between x = c and x = c has no width, and
therefore no area (area = 0). Since the probability is equal to the area, the probability
is also zero.
We will find the area that represents probability by using geometry, formulas, technology, or
probability tables. In general, calculus is needed to find the area under the curve for many
probability density functions. When we use formulas to find the area, the formulas were
found by using the techniques of integral calculus. However, we will not be using calculus in
this textbook.
267
There are many continuous probability distributions. When using a continuous probability
distribution to model probability, the distribution used is selected to model and fit the
particular situation in the best way. In this chapter and the next, we will study the
continuous uniform distribution, the exponential distribution, and the normal
distribution. The following graphs illustrate these distributions respectively.
Figure 6.2 Graph shows a Uniform Distribution with the shaded area between x = 3 and x =
6 to represent the probability that the value of the random variable X is in the interval
between 3 and 6.
Remember that the x-axis has the values of the continuous random variable.
Figure 6.3 Graph above shows an Exponential Distribution with the area between x = 2 and
x = 4 shaded to represent the probability that the value of the random variable X is in the
interval between 2 and 4.
Figure 6.4 Graph shows the Standard Normal Distribution with the area between x = 1 and
x = 2 shaded to represent the probability that the value of the random variable X is in the
interval between 1 and 2.
268
We begin by defining a continuous probability density function. We use the function
notation f(x). Intermediate algebra may have been your first formal introduction to functions.
In the study of probability, the functions we study are special. We define the function f(x) so
that the area between it and the x-axis is equal to a probability. Since the maximum
probability is one, the maximum area is also one. For continuous probability distributions,
PROBABILITY = AREA.
Example 6.2
1
Consider the function f(x) = for 0 ≤ x ≤ 20. x = a real number.
20
1
The graph of f(x) = is a horizontal line. However, since 0 ≤ x ≤ 20, f(x) is restricted to the
20
portion between x = 0 and x = 20, inclusive.
100%
Notice that the distance on the x-axis is 20 units. Also, the height on the y-axis is 1/20.
The area of the created rectangle (A = Base · Height) is 1 which represents 100%.
Example 6.3
The graph of f(x) = .1e −.1x is an exponential function that decreases starting at x = 0.
f(x)
Note: the area under
the curve is 100%.
0.1 Continuous probability
distributions have a
similar notion that the
“sum” of the
100%
probabilities is 1.
x
0
269
6.2 | Uniform Distribution
The uniform distribution is a continuous probability distribution and is concerned with
events that are equally likely to occur. When working out problems that have a uniform
distribution, be careful to note if the data is inclusive or exclusive.
1
• The probability density function is f ( x ) =
b−a
a+b
• The mean of a uniform probability distribution, µ= , and
2
(b − a) 2
• Variance of a uniform probability distribution, σ 2 =
12
(b − a)
• The standard deviation, σ =
12
• Probability = Area = Base ∙ height
The probability density function for Uniform distribution is a horizontal line from a starting
x value to an ending x value. The area under the horizontal line creates a rectangle.
Example 6.4
1
Consider the function f(x) = for 0 ≤ x ≤ 20. x = a real number.
20
Solution 6.4
a + b 0 + 20
b.=µ = = 10 ;
2 2
(b − a) 20 − 0
=σ = = 5.77
12 12
270
1
c. P(x < 2) = area of rectangle = base ∙ height = 2 ⋅ =.1 =10%
20
1
Suppose we want to find the area between f(x) = and the x-axis where 4 < x < 15.
20
Area = (Base)(Height)
1
Area = (15 – 4) = 0.55
20
The area corresponds to the probability P(4 < x < 15) = 0.55.
NOTE: Suppose we want to find P(x = 15). On an x-y graph, x = 15 is a vertical line. A vertical
1
line has no width (or zero width). Therefore, P(x = 15) = (base)(height) = (0) = 0
20
271
P(x ≤ k) (can be written as P(x < k) for continuous distributions) is called the cumulative
distribution function or CDF. Notice the "less than or equal to" symbol. We can use the
CDF to calculate P(x > k). The CDF gives "area to the left" and P(x > k) gives "area to the
right." We calculate P(x > k) for continuous distributions as follows:
6.1 The data that follow are the number of passengers on 35 different charter fishing boats.
The sample mean = 7.9 and the sample standard deviation = 4.33. The data follow a uniform
distribution where all values between and including zero and 14 are equally likely. State the
values of a and b. Write the distribution in proper notation, and calculate the theoretical
mean and standard deviation.
1 12 4 10 4 14 11 7 11 4 13 2 4 6 3 10 0 12
6 9 10 5 13 4 10 14 12 11 6 10 11 0 11 13 2
Example 6.5
Let the smiling times of an 8-week old baby, in seconds; follow a uniform distribution between
0 and 23 seconds inclusive.
Solution 6.5
1
a. f(x) = for 0 ≤ x ≤ 23
23
b. X~U(0, 23)
272
c. Find P(2 < x < 18).
1 16
P(2 < x < 18) = (base)(height) = (18 – 2) =
23 23
d. Ninety percent of the smiling times fall below the 90th percentile, k, so P(x < k) = 0.90
1
(k – 0) =0.90
23
1
k =0.90 solve for k by multiplying both sides by 23
23
k = 20.7 minutes
e. This probability question is a conditional. You are asked to find the probability that
an eight-week-old baby smiles more than 12 seconds when you already know the baby
has smiled for more than eight seconds.
Find P(x > 12|x > 8) There are two ways to do the problem:
273
For the first way, use the fact that this is a conditional and changes the sample space. The
graph illustrates the new sample space. You already know the baby smiled more than eight
seconds.
1 1
f(x) = = for 8 < x < 23
23 − 8 15
1 11
= (23 – 12) =
15 15
For the second way, use the conditional formula from Probability Topics with the original
distribution X ~ U(0, 23):
P( A and B)
P( A | B) =
P( B)
1
P ( x > 12 )
( 23 − 12 )
11
=
= = 15
P ( x > 8) 1 15
( 23 − 8)
15
274
6.2 A distribution is given as X ~ U (0, 20). What is P(2 < x < 18)? Find the 90th percentile.
Example 6.6
The amount of time, in minutes, that a person must wait for a bus is uniformly distributed
between zero and 15 minutes, inclusive.
a. What is the probability that a person waits fewer than 12.5 minutes?
b. On the average, how long must a person wait? Find the mean, μ, and the standard
deviation, σ.
c. Ninety percent of the time, the time a person must wait falls below what value?
Solution 6.6
a. Let X = the number of minutes a person must wait for a bus. a = 0 and b = 15. X ~ U(0,
1 1
15). Write the probability density function. f(x) = = for 0 ≤ x ≤ 15.
15 − 0 15
1
P(x < k) = (base)(height) = (12.5 - 0) = 0.8333
15
b. μ = (a + b)/2 = (15 + 0)/2 = 7.5 minutes. On the average, a person must wait 7.5 minutes.
275
c. This asks for the 90th percentile. Draw a graph. Let k = the 90th percentile.
1
0.9 = (k − 0)
15
1
0.9 = k
15
The 90th percentile is 13.5 minutes. Ninety percent of the time, a person must wait at most
13.5 minutes.
6.3 Based on a Covid 19 incubation study, a researcher looked at the incubation period as a
uniform distribution from 2.1 to 11.1 days to determine the quarantine period.
276
Example 6.7
Suppose the amount of gasoline distributed to a company’s fleet of trucks every week is
uniformly distributed. The minimum distributed is 2500 gallons. The maximum distributed
is 3200 gallons.
a. What is the probability that a randomly selected week, the distribution is more than
3000 gallons?
b. Find the probability that another week’s distribution is more than 3000 gallons, given
that distribution has already exceeded 2800 gallons.
Solution 6.7
1
a. P(x > 3000) = Base · Height = 200 � � = 0.286
700
b. The second question has a conditional probability. You are asked to find the probability
that another week’s distribution is more than 3000 gallons, given that distribution has
already exceeded 2800 gallons.
First way: Since you know the distribution has already exceeded 2800 gallons, the starting
point is no longer a = 2500. The new starting point is 2800.
1 1
f(x) = = for 2800 ≤ x ≤ 3200.
3200 − 2800 400
The probability that another week’s distribution is more than 3000 gallons, given that
distribution has already exceeded 2800 gallons is .5, 50%
277
Second way: Draw the original graph for X ~ U (2500, 3200).
1
200
P ( x > 3000 and x > 2800) 700 .5
P(x > 3000|x > 2800) = = =
P ( x > 2800) 1
400
700
Suppose the time it takes a student to finish a quiz is uniformly distributed between six and
15 minutes, inclusive. Let X = the time, in minutes, it takes a student to finish a quiz. X ~ U
(6, 15). Find the probability that a randomly selected student needs at least eight minutes to
complete the quiz. Then find the probability that a different student needs at least eight
minutes to finish the quiz given that she has already taken more than seven minutes.
Example 6.8
Ace Heating and Air Conditioning Service finds that the amount of time a repairman needs
to fix a furnace is uniformly distributed between 1.5 and four hours. Let X = the time needed
to fix a furnace. Then X ~ U (1.5, 4).
Solution 6.8
a. Uniform Distribution between 1.5 and 4 with an area of 0.30 shaded to the left,
representing the shortest 30% of repair times.
278
P(x < k) = (base)(height) = (k – 1.5)(0.4)
Interpretation:
The 30th percentile of repair times is 2.25 hours. 30% of repair times are 2.5 hours or
less.
b. Uniform Distribution between 1.5 and 4 with an area of 0.25 shaded to the right
representing the longest 25% of repair times.
The longest 25% of furnace repairs take at least 3.375 hours (3.375 hours or longer).
Note: Since 25% of repair times are 3.375 hours or longer, that means that 75% of repair
times are 3.375 hours or less. 3.375 hours is the 75th percentile of furnace repair times.
(b − a ) 4 − 1.5
=σ = = 0.7217
12 12
279
The amount of time a service technician needs to change the oil in a car is uniformly
distributed between 11 and 21 minutes. Let X = the time needed to change the oil on a car.
280
6.3 | The Exponential Distribution
The exponential distribution is often concerned with the amount of time until some specific
event occurs. For example, the amount of time (beginning now) until an earthquake occurs
has an exponential distribution. Other examples include the length, in minutes, of long
distance business telephone calls, and the amount of time, in months, a car battery lasts. It
can be shown, too, that the value of the change that you have in your pocket or purse
approximately follows an exponential distribution.
Values for an exponential random variable occur in the following way. There are fewer large
values and more small values. For example, the amount of money customers spend in one
trip to the supermarket follows an exponential distribution. There are more people who spend
small amounts of money and fewer people who spend large amounts of money.
The exponential distribution is widely used in the field of reliability. Reliability deals with
the amount of time a product lasts.
281
Graphing an exponential probability density function, f ( x) = me
− mx
, you will have a y-
intercept of (0, m). Since m is positive and there is a negative sign in front of it, as x increases
the function decreases. The number e = 2.71828182846... It is a number that is used often in
mathematics. Scientific calculators have the key ex; if you enter one for x, the calculator will
display the value e.
2nd ln
Example 6.9
Let X = amount of time (in minutes) a postal clerk spends with his or her customer. The time
is known to have an exponential distribution with the average amount of time equal to four
minutes.
1 1
µ= m=
m µ
Therefore m = .25
f ( x) = .25e −.25 x ; where x is at least zero. For example, f(5) = 0.25e−(0.25)(5) = 0.072.
282
The standard deviation, σ, is the same as the mean. μ = σ.
The amount of time spouses shop for anniversary cards can be modeled by an exponential
distribution with the average amount of time equal to eight minutes. Write the distribution,
state the probability density function, and graph the distribution.
Example 6.10
a. Using the information in Example 6.9, find the probability that a clerk spends less than
four minutes with a randomly selected customer.
b. Half of all customers are finished within how long? (Find the 50th percentile).
Solution 6.10
The cumulative distribution function (CDF) gives the area to the left.
.25
x
4
283
P(x < k) = 1 – e–mk
The probability that a postal clerk spends less than four minutes with a randomly selected
customer is 63.21%.
NOTE: µ = 4 is the mean and 63.21% is below it. It is unlike uniform distribution where µ
is the median too.
x
k
ln(area to right )
kth percentile = where ln is the natural logarithm.
−m
ln(.5)
50th percentile =
−.25
k = 2.8 minutes
NOTE: The mean is larger. The mean is 4 minutes and the median is 2.8 minutes.
The number of days ahead travelers purchase their airline tickets can be modeled by an
exponential distribution with the average amount of time equal to 15 days. Find the
probability that a traveler will purchase a ticket fewer than ten days in advance. How many
days do half of all travelers wait?
284
Example 6.11
On the average, a certain computer part lasts ten years. The length of time the computer part
lasts is exponentially distributed.
a. What is the probability that a computer part lasts more than 7 years?
b. On the average, how long would five computer parts last if they are used one after another?
d. What is the probability that a computer part lasts between nine and 11 years?
Solution 6.11
μ = 10 so m = 0.1
b. On the average, one computer part lasts ten years. Therefore, five computer parts, if they
are used one right after the other would last, on the average, (5)(10) = 50 years.
285
c. Find the 80th percentile. Draw the graph. Let k = the 80th percentile.
ln(area to right )
kth percentile =
−m
ln(.2)
k= = 16.1 years
−.1
The probability that a computer part lasts between nine and 11 years is 7.37%
286
On average, a pair of running shoes can last 18 months if used every day. The length of time
running shoes last is exponentially distributed. What is the probability that a pair of running
shoes last more than 15 months? On average, how long would six pairs of running shoes last
if they are used one after the other? Eighty percent of running shoes last at most how long if
used every day?
Example 6.12
The time spent waiting between events is often modeled using the exponential distribution.
For example, suppose that an average of 30 customers per hour arrive at a store and the time
between arrivals is exponentially distributed.
Solution 6.12
a. Since we expect 30 customers to arrive per hour (60 minutes), we expect on average one
customer to arrive every two minutes on average.
b. Since one customer arrives every two minutes on average, it will take six minutes on
average for three customers to arrive.
287
e. We want to solve for the 70th percentile
ln(area to right )
kth percentile =
−m
ln(.3)
k= = 2.41
−.5
Thus, seventy percent of customers arrive within 2.41 minutes of the previous
customer.
f. This model assumes that a single customer arrives at a time, which may not be
reasonable since people might shop in groups, leading to several customers arriving at
the same time. It also assumes that the flow of customers does not change throughout
the day, which is not valid if some times of the day are busier than others.
Suppose that on a certain stretch of highway, cars pass at an average rate of five cars per
minute. Assume that the duration of time between successive cars follows the exponential
distribution.
288
Memorylessness of the Exponential Distribution
In Example 6.9, recall that the amount of time between customers is exponentially
distributed with a mean of two minutes (X ~ Exp (0.5)). Suppose that five minutes have
elapsed since the last customer arrived. Since an unusually long amount of time has now
elapsed, it would seem to be more likely for a customer to arrive within the next minute. With
the exponential distribution, this is not the case–the additional time spent waiting for the
next customer does not depend on how much time has already elapsed since the last
customer. This is referred to as the memoryless property.
For example, if five minutes has elapsed since the last customer arrived, then the probability
that more than one minute will elapse before the next customer arrives is computed by using
r = 5 and t = 1 in the foregoing equation.
This is the same probability as that of waiting more than one minute for a customer to arrive
after the previous arrival.
Example 6.13
Refer to Example 6.9 where the time a postal clerk spends with his or her customer has an
exponential distribution with a mean of four minutes. Suppose a customer has spent four
minutes with a postal clerk. What is the probability that he or she will spend at least an
additional three minutes with the postal clerk?
Solution 6.13
289
Exponential distribution property states P(x > k) = e–0.25k.
We want to find P(x > 7|x > 4). The memoryless property says that P(x > 7|x > 4) = P (x >
3), so we just need to find the probability that a customer spends more than three minutes
with a postal clerk.
Suppose that the longevity of a light bulb is exponential with a mean lifetime of eight years.
If a bulb has already lasted 12 years, find the probability that it will last a total of over 19
years.
There is an interesting relationship between the exponential distribution and the Poisson
distribution. Suppose that the time that elapses between two successive events follows the
exponential distribution with a mean of μ units of time. Also assume that these times are
independent, meaning that the time between events is not affected by the times between
previous events. If these assumptions hold, then the number of events per unit time follows
a Poisson distribution with mean λ =1/μ. Recall from the chapter on Discrete Random
λ𝑘𝑘 𝑒𝑒 −λ
Variables that if X has the Poisson distribution with mean λ, then P(x = k) =
𝑘𝑘!
Conversely, if the number of events per unit time follows a Poisson distribution, then the
amount of time between events follows the exponential distribution.
Example 6.14
At a police station in a large city, calls come in at an average rate of four calls per minute.
Assume that the time that elapses from one call to the next has the exponential distribution.
Take note that we are concerned only with the rate at which calls come in, and we are
ignoring the time spent on the phone. We must also assume that the times spent between
290
calls are independent. This means that a particularly long delay between two calls does not
mean that there will be a shorter waiting period for the next call. We may then deduce that
the total number of calls received during a time period has the Poisson distribution.
Solution 6.14
a. On average there are four calls occur per minute, so 15 seconds, or 15/60 = 0.25
minutes occur between successive calls on average.
1
b. Let T = time elapsed between calls. From part a, μ = 0.25, so m = = 4. Thus,
µ
T ∼Exp(4). The cumulative distribution function is P(t < k) = 1 – e–4t.
The probability that the next call occurs in less than ten seconds (ten seconds =
1
1 −4⋅
1/6 minute) is P t < =−
1 e 6 ≈ 0.4866
6
c. Let X = the number of calls per minute. As previously stated, the number of calls
per minute has a Poisson distribution, with a mean of four calls per minute.
Therefore, X ∼ Poisson(4), and so
45 𝑒𝑒 −4
P(x = 5) = 5!
= poissonpdf(4, 5) ≈ 0.1563
(NOTE: 5! = 5∙4∙3∙2∙1 = 120)
d. Keep in mind that X must be a whole number, so P(x < 5) = P(x ≤ 4).To compute
this, we could take P(x = 0) + P(x = 1) + P(x = 2) + P(x = 3) + P(x = 4).
Using technology, we see that P(x ≤ 4) = poisssoncdf(4, 4) = 0.6288
e. Let Y = the number of calls that occur during an eight minute period.
Since there is an average of four calls per minute, there is an average of (8)(4) =
32 calls during each eight minute period. Hence, Y ∼ Poisson(32). Therefore, P(y >
40) = 1 – P (y ≤ 40) = 1 – poissoncdf(32, 40). = 0.0707
291
KEY TERMS
Uniform Distribution a continuous random variable (RV) that has equally likely outcomes
over the domain, a < x < b; it is often referred as the rectangular distribution because the
a+b
graph of the pdf has the form of a rectangle. Notation: X ~ U(a,b). The mean is μ = and
2
b−a 1
the standard deviation is σ = . The probability density function is f(x) = for a < x <
12 b−a
k −a
b or a ≤ x ≤ b. The cumulative distribution is P(x ≤ k) = .
b−a
Decay parameter: The decay parameter describes the rate at which probabilities decay to
zero for increasing values of x.
It is the value m in the probability density function f(x) = me(-mx) of an exponential random
1
variable. It is also equal to m = , where μ is the mean of the random variable.
µ
Exponential Distribution a continuous random variable (RV) that appears when we are
interested in the intervals of time between some random events, for example, the length of
time between emergency arrivals at a hospital; the notation is X ~ Exp(m). The mean is μ =
1 1
and the standard deviation is σ = . The probability density function is f ( x) = me− mx , x ≥ 0
m m
and the cumulative distribution function is P(x ≤ k) = 1 − e−mx.
In symbols we say that P(x > r + k|x > r) = P(x > k).
Poisson distribution If there is a known average of λ events occurring per unit time, and
these events are independent of each other, then the number of events X occurring in one
unit of time has the Poisson distribution. The probability of k events occurring in one unit
λ𝑘𝑘 𝑒𝑒 −λ
time is equal to P(x = k) = .
𝑘𝑘!
292
Formula Review
6.2 The Uniform Distribution
1
• The probability density function is f ( x) = for a ≤ x ≤ b
b−a
a+b
• The mean of a uniform probability distribution, µ =
2
(b − a)2 2
• Variance of a uniform probability distribution, σ =
12
(b − a)
• The standard deviation, σ =
12
1
• P(x < k) = ( k − a )
b−a Probability = Area
1
• P(c < x < d) = ( d − c ) Area = Base ∙ Height
b−a
293
EXERCISES FOR CHAPTER 6
For each probability and percentile problem, draw the picture.
294
5. For a continuous probability distribution, 0 ≤ x ≤ 15. What is P(x > 15)?
6. What is the total area under f(x) if the function is a continuous probability density
function?
295
13. Find the probability that x falls in the shaded area:
1
14. f(x), a continuous probability function, is equal to and the function is restricted to
3
3
1 ≤ x ≤ 4. Describe P(x > )
2
15. The uniform distribution X ~ U(1.5, 4.5) describes the square footage (in 1000 feet
squared)
a. What is the height of f(x) for the continuous probability distribution?
b. What are the constraints for the values of x?
c. Graph P(2 < x < 3).
d. What is P(2 < x < 3)?
e. What is P(x < 3.5| x < 4)?
f. What is P(x = 1.5)?
g. What is the 90th percentile of square footage for homes?
h. Find the probability that a randomly selected home has more than 3,000 square
feet given that you already know the house has more than 2,000 square feet.
17. The age of cars in the staff parking lot of a suburban college is uniformly distributed
from six months (0.5 years) to 9.5 years.
a. What is being measured here?
b. In words, define the random variable X.
c. Are the data discrete or continuous?
d. The interval of values for x is
e. The distribution for X is
296
f. Write the probability density function.
g. Graph the probability distribution.
h. Find the probability that a randomly chosen car in the lot was less than four
years old.
i. Considering only the cars less than 7.5 years old, find the probability that a
randomly chosen car in the lot was less than four years old.
j. What changed in the previous two problems that made the solutions different?
k. Find the third quartile of ages of cars in the lot.
18. A customer service representative must spend different amounts of time with each
customer to resolve various concerns. The amount of time spent with each customer
can be modeled by the following distribution: X ~ Exp(0.2)
a. What type of distribution is this?
b. Are outcomes equally likely in this distribution? Why or why not?
c. What is m? What does it represent?
d. What is the mean?
e. What is the standard deviation?
f. State the probability density function.
g. Graph the distribution.
h. Find P(2 < x < 10).
i. Find P(x > 6).
j. Find the 70th percentile.
20. Carbon-14 is a radioactive element with a half-life of about 5,730 years. Carbon-14 is
said to decay exponentially. The decay rate is 0.000121. We start with one gram of
carbon-14. We are interested in the time (years) it takes to decay carbon-14.
297
f. Find the amount (percent of one gram) of carbon-14 lasting less than 5,730 years.
This means, find P(x < 5,730).
g. Find the percentage of carbon-14 lasting longer than 10,000 years.
h. Thirty percent (30%) of carbon-14 will decay within how many years?
21. Consider the following experiment. You are one of 100 people enlisted to take part in
a study to determine the percent of nurses in America with an R.N. (registered
nurse) degree. You ask nurses if they have an R.N. degree. The nurses answer “yes”
or “no.” You then calculate the percentage of nurses with an R.N. degree. You give
that percentage to your supervisor.
a. What part of the experiment will yield discrete data?
b. What part of the experiment will yield continuous data?
22. When age is rounded to the nearest year, do the data stay continuous, or do they
become discrete? Why?
23. Births are approximately uniformly distributed between the 52 weeks of the year.
They can be said to follow a uniform distribution from one to 53 (spread of 52
weeks).
a. X ~
b. Graph the probability distribution.
c. f(x) =
d. μ =
e. σ =
f. Find the probability that a person is born at the exact moment week 19 starts.
That is, find P(x = 19) =
g. P(2 < x < 31) =
h. Find the probability that a person is born after week 40.
i. P(12 < x|x < 28) =
j. Find the 70th percentile.
k. Find the minimum for the upper quarter.
24. A random number generator picks a number from one to nine in a uniform manner.
a. X ~
b. Graph the probability distribution.
c. f(x) =
d. μ =
e. σ =
f. P(3.5 < x < 7.25) =
g. P(x > 5.67)
h. P(x > 5|x > 3) =
i. Find the 90th percentile.
298
25. According to a study by Dr. John McDougall of his live-in weight loss program at St.
Helena Hospital, the people who follow his program lose between six and 15 pounds
a month until they approach trim body weight. Let’s suppose that the weight loss is
uniformly distributed. We are interested in the weight loss of a randomly selected
individual following the program for one month.
a. Define the random variable. X =
b. X ~
c. Graph the probability distribution.
d. f(x) =
e. μ =
f. σ =
g. Find the probability that the individual lost more than ten pounds in a month.
h. Suppose it is known that the individual lost more than ten pounds in a month.
Find the probability that he lost less than 12 pounds in the month.
i. P(7 < x < 13|x > 9) = . State this in a probability question, similarly to parts g
and h, draw the picture, and find the probability.
26. A subway train on the Red Line arrives every eight minutes during rush hour. We
are interested in the length of time a commuter must wait for a train to arrive. The
time follows a uniform distribution.
a. Define the random variable. X =
b. X ~
c. Graph the probability distribution.
d. f(x) =
e. μ =
f. σ =
g. Find the probability that the commuter waits less than one minute.
h. Find the probability that the commuter waits between three and four minutes.
i. Sixty percent of commuters wait more than how long for the train? State this in
a probability question, similarly to parts g and h, draw the picture, and find the
probability.
27. The age of a first grader on September 1 at Garden Elementary School is uniformly
distributed from 5.8 to 6.8 years. We randomly select one first grader from the class.
a. Define the random variable. X =
b. X ~
c. Graph the probability distribution.
d. f(x) =
e. μ =
f. σ =
g. Find the probability that she is over 6.5 years old.
299
h. Find the probability that she is between four and six years old.
i. Find the 70th percentile for the age of first graders on September 1 at Garden
Elementary School.
28. The Sky Train from the terminal to the rental–car and long–term parking center is
supposed to arrive every eight minutes. The waiting times for the train are known to
follow a uniform distribution.
a. What is the average waiting time (in minutes)?
b. Find the 30th percentile for the waiting times (in minutes).
c. The probability of waiting more than seven minutes given a person has waited
more than four minutes is?
29. The time (in minutes) until the next bus departs a major bus depot follows a
1
distribution with f(x) = where x goes from 25 to 45 minutes.
20
a. Define the random variable. X =
b. X ~
c. Graph the probability distribution.
d. The distribution is _______ (name of distribution). Determine if it is discrete or
continuous.
e. μ =
f. σ =
g. Find the probability that the time is at most 30 minutes. Sketch and label a graph
of the distribution. Shade the area of interest. Write the answer in a probability
statement.
h. Find the probability that the time is between 30 and 40 minutes. Sketch and label
a graph of the distribution. Shade the area of interest. Write the answer in a
probability statement.
i. P(25 < x < 55) = . State this in a probability statement, similarly to parts g and
h, draw the picture, and find the probability.
j. Find the 90th percentile. This means that 90% of the time, the time is less than
minutes.
k. Find the 75th percentile. In a complete sentence, state what this means. (See part
j.)
l. Find the probability that the time is more than 40 minutes given (or knowing that)
it is at least 30 minutes.
30. Suppose that the value of a stock varies each day from $16 to $25 with a uniform
distribution.
a. Find the probability that the value of the stock is more than $19.
b. Find the probability that the value of the stock is between $19 and $22.
300
c. Find the upper quartile, which means 25% of all days the stock is above what
value? Draw the graph.
d. Given that the stock is greater than $18, find the probability that the stock is more
than $21.
31. A fireworks show is designed so that the time between fireworks is between one and
five seconds, and follows a uniform distribution.
a. Find the average time between fireworks.
b. Find probability that the time between fireworks is greater than four seconds.
32. The number of miles driven by a truck driver falls between 300 and 700 miles, and
follows a uniform distribution.
a. Find the probability that the truck driver goes more than 650 miles in a day.
b. Find the probability that the truck drivers goes between 400 and 650 miles in a
day.
c. At least how many miles does the truck driver travel on the furthest 10% of days?
33. Suppose that the length of long distance phone calls, measured in minutes, is known
to have an exponential distribution with the average length of a call equal to eight
minutes.
a. Define the random variable. X =
b. X ~
c. Determine if the variable is discrete or continuous.
d. μ =
e. σ =
f. Draw a graph of the probability distribution. Label the axes.
g. Find the probability that a phone call lasts less than nine minutes.
h. Find the probability that a phone call lasts more than nine minutes.
i. Find the probability that a phone call lasts between seven and nine minutes.
j. If 25 phone calls are made one after another, on average, what would you expect
the total to be? Why?
34. Suppose that the useful life of a particular car battery, measured in months, decays
with parameter 0.025. We are interested in the life of the battery.
a. Define the random variable. X = .
b. Is X continuous or discrete?
c. X ~
d. On average, how long would you expect one car battery to last?
e. On average, how long would you expect nine car batteries to last, if they are used
one after another?
f. Find the probability that a car battery lasts more than 36 months.
g. Seventy percent of the batteries last at least how long?
301
35. The percent of persons (ages five and older) in each state who speak a language at
home other than English is approximately exponentially distributed with a mean of
9.848. Suppose we randomly pick a state.
a. Define the random variable. X = .
b. Is X continuous or discrete?
c. X ~
d. μ =
e. σ =
f. Draw a graph of the probability distribution. Label the axes.
g. Find the probability that the percent is less than 12.
h. Find the probability that the percent is between eight and 14.
i. The percent of all individuals living in the United States who speak a language at
home other than English is 13.8.
i. Why is this number different from 9.848%?
ii. What would make this number higher than 9.848%?
36. The time (in years) after reaching age 60 that it takes an individual to retire is
approximately exponentially distributed with a mean of about five years. Suppose we
randomly pick one retired individual. We are interested in the time after age 60 to
retirement.
a. Define the random variable. X = .
b. Is X continuous or discrete?
c. X ~
d. μ =
e. σ =
f. Draw a graph of the probability distribution. Label the axes.
g. Find the probability that the person retired after age 70.
h. Do more people retire before age 65 or after age 65?
i. In a room of 1,000 people over age 80, how many do you expect will NOT have
retired yet?
37. The cost of all maintenance for a car during its first year is approximately
exponentially distributed with a mean of $150.
a. Define the random variable. X = .
b. X ~
c. μ =
d. σ =
e. Draw a graph of the probability distribution. Label the axes.
f. Find the probability that a car required over $300 for maintenance during its first
year.
302
38. The average lifetime of a certain new cell phone is three years. The manufacturer will
replace any cell phone failing within two years of the date of purchase. The lifetime of
these cell phones is known to follow an exponential distribution.
a. The decay rate is:
b. What is the probability that a phone will fail within two years of the date of
purchase?
c. What is the median lifetime of these phones (in years)?
40. Suppose that the longevity of a light bulb is exponential with a mean lifetime of eight
years.
a. Find the probability that a light bulb lasts less than one year.
b. Find the probability that a light bulb lasts between six and ten years.
c. Seventy percent of all light bulbs last at least how long?
d. A company decides to offer a warranty to give refunds to light bulbs whose lifetime
is among the lowest two percent of all bulbs. To the nearest month, what should
be the cutoff lifetime for the warranty to take place?
e. If a light bulb has lasted seven years, what is the probability that it fails within
the 8th year.
41. At a 911 call center, calls come in at an average rate of one call every two minutes.
Assume that the time that elapses from one call to the next has the exponential
distribution.
a. On average, how much time occurs between five consecutive calls?
b. Find the probability that after a call is received, it takes more than three minutes
for the next call to occur.
c. Ninety-percent of all calls occur within how many minutes of the previous call?
d. Suppose that two minutes have elapsed since the last call. Find the probability
that the next call will occur within the next minute.
e. Find the probability that less than 20 calls occur within an hour.
303
42. In major league baseball, a no-hitter is a game in which a pitcher, or pitchers, doesn't
give up any hits throughout the game. No-hitters occur at a rate of about three per
season. Assume that the duration of time between no-hitters is exponential.
a. What is the probability that an entire season elapses with a single no-hitter?
b. If an entire season elapses without any no-hitters, what is the probability that
there are no no-hitters in the following season?
c. What is the probability that there are more than 3 no-hitters in a single season?
d. What is the median?
e. What is the variance?
43. Assume that the life span of Android batteries is exponentially distributed. The
average time the batteries last is 4.2 years.
a. What is the probability that an Android battery lasts less than 5 years?
b. What is the probability that an Android battery lasts less than 6 years given that
it lasted already 5 years?
c. What is variance?
d. What is the median?
e. The middle 50% of batteries last between which 2 years?
44. According to the American Red Cross, about one out of nine people in the U.S. have
Type B blood. Suppose the blood types of people arriving at a blood drive are
independent. In this case, the number of Type B blood types that arrive roughly
follows the Poisson distribution.
a. If 100 people arrive, how many on average would be expected to have Type B blood?
b. What is the probability that over 10 people out of these 100 have type B blood?
c. What is the probability that more than 20 people arrive before a person with type
B blood is found?
45. A web site experiences traffic during normal working hours at a rate of 12 visits per
hour. Assume that the duration between visits has the exponential distribution.
a. Find the probability that the duration between two successive visits to the web
site is more than ten minutes.
b. The top 25% of durations between visits are at least how long?
c. Suppose that 20 minutes have passed since the last visit to the web site. What is
the probability that the next visit will occur within the next 5 minutes?
d. Find the probability that less than 7 visits occur within a one-hour period.
304
46. At an urgent care facility, patients arrive at an average rate of one patient every seven
minutes. Assume that the duration between arrivals is exponentially distributed.
a. Find the probability that the time between two successive visits to the urgent care
facility is less than 2 minutes.
b. Find the probability that the time between two successive visits to the urgent care
facility is more than 15 minutes.
c. If 10 minutes have passed since the last arrival, what is the probability that the
next person will arrive within the next five minutes?
d. Find the probability that more than eight patients arrive during a half-hour
period.
e. Find the median time between two successive visits.
f. Find the 75th percentile time between two successive visits.
g. Find the variance.
305
REFERENCES
6.2 The Uniform Distribution
McDougall, John A. The McDougall Program for Maximum Weight Loss. Plume, 1995.
306
CHAPTER 6 SOLUTIONS:
1
4
17) a. Age of cars b. X = The age (in years) of cars in the staff parking lot
c. continuous d. 0.5 to 9.5 e. X ~ U(.5, 9.5) f. f( x) = 1/9
g. Graph:
1
9
307
19) a. m = .75 b. f(x) = -.75e-.75x c. P(x < k) = 1 – e-.75k
d. Graph:
21) a. The number of nurses that said yes. b. The percentage of nurses that said yes.
23) a. X ~ U(1, 53) b. Graph is a uniform graph spanning from 1 to 52 with a height of
1/52.
c. f(x) = 1/9 d. µ = 54/2 = 27 e. σ = 52/sqrt(12) = 15.01
f. P(x = 19) = 0 g. P(2 < x < 31) = base*height = 29(1/52) = .558
h. P(x > 40) = 13 (1/52) = .25
i. P(12 < x | x < 28) = P(12 < x < 28)/ P(x < 28) = .308/.519 =.593
j. P70 = k, where 0.70 = (k – 1)(1/52) => k = 37.4
k. Q3 = k, where 0.75 = (k – 1)(1/52) => k = 40.
1
9
2 4 6 8 10 12 14 16
d. f(x) = 1/9 e. µ = 21/2 = 10.5 f. σ = 9/sqrt(12) = 2.6
g. P(x > 10) = base*height = 5(1/9) = .556
h. P(x < 12 | x > 10) = P(10 < x < 12) / P(x > 10) = .222/.556 = .399
i. P(7< x < 13 | x > 9) = P(9 < x < 13) / P(x > 9) = .444/.667= .666
308
27) a. X = age of first grade on Sept. 1st b. X ∼U(5.8, 6.8)
c. Graph:
1
20
5 10 15 20 25 30 35 40 45
h. P(30 < x < 40) = base*height = 10(1/20) = .50 i. P(25 < x < 55) = P(x > 25) = 1
j. P90 = k, where .90 = (k – 25)(1/20) => k = 43.
k. P75 = k, where .75 = (k – 25)(1/20) => k = 40
l. P( x > 40 | x ≥ 30) = P(x > 40) / P(x ≥ 30) = .25/.75 = 1/3
.125
309
g. P(x < 9) = 1 – e(-.125*9) = .675 h. P(x > 9) = e(-.125*9) = .325
i. P(7 < x < 9) = e(-.125*7) - e(-.125*9) = .0922 j. 25*8 = 200 minutes since average is 8
minutes.
35) a. X = percent of persons in each state who speak a language at home other than
English
b. Continuous c. X ∼ exp(.101) d. µ = 9.848 (given in the problem) e. σ =
9.848
g. Graph:
.101
h. P(x < 12) = 1 - e(-.101*12) = .702 i. P(8 < x < 14) = e(-.101*8) - e(-.101*14) = .203
j. The number is different because the population is different.
1
150
.1
310
d. P(x < 6) = 1 - e(-.10*6) = .451 e. P(3 < x < 6) = e(-.10*3) - e(-.10*6) = .192
f. P(x < 7) = 1 - e(-.10*7) = .503
45) a. P(x > 10) = e-.2(10) =.135. m = 12/60 = .2 b. P75 = ln(.25)/-.2 = 6.93 minutes
c. P(x < 25 | x > 20) = P(20 < x < 25)/P(x > 20) = .632
d. P(x < 7 visits) = poissoncdf(12, 6) = .0458
311
This page is purposely left blank.
312
7 | THE NORMAL DISTRIBUTION
Figure 7.1 If you ask enough people about their shoe size, you will find that your graphed data is
shaped like a bell curve and can be described as normally distributed. (credit: Ömer Ünlϋ)
Introduction
Chapter Objectives
The normal distribution is a continuous distribution, and is the single most important of all
the distributions discussed in this text. It is widely used and even more widely abused. Its
graph is bell-shaped, and we see the bell curve in many disciplines. Some of these include
psychology, business, economics, the sciences, nursing, and, of course, mathematics. Some of
313
your instructors may use the normal distribution to help determine your grade. Most IQ
scores are normally distributed, as are many standardized test scores. In this chapter, we
will study the normal distribution, the standard normal distribution, and applications
associated with them.
The normal distribution has two parameters (two numerical descriptive measures of the
population), the mean, μ, and the standard deviation, σ. If X is a quantity to be measured
that has a normal distribution with mean μ and standard deviation σ, we designate this by
writing X ~ N(μ, σ). The probability density function is a rather complicated function. Do
not memorize it. It is not necessary for calculations and is included only for completeness:
𝑥𝑥−𝜇𝜇 2
−.5� �
𝑒𝑒 𝜎𝜎
𝑓𝑓(𝑥𝑥) = 𝜎𝜎 √2𝜋𝜋
The graph of this function exhibits the bell-shape of the normal distribution:
The cumulative distribution function is P(X < x). It can be calculated by a calculator,
statistical software, or by using a normal distribution table. Modern technology has made the
tables virtually obsolete, and we will focus mainly on the use of the TI-84 family of calculators
for calculations. But there is still some insight to be gained from working with the tables so
we will discuss these briefly.
The curve is perfectly symmetrical about a vertical line drawn through the mean, μ. Thus
the mean is the same as the median. As the notation indicates, the normal distribution
depends only on the mean and the standard deviation. Since the area under the curve must
equal one, a change in the standard deviation, σ, causes a change in the shape of the curve;
the curve becomes fatter or skinnier depending on σ. The smaller the standard deviation is,
the narrower the normal curve appears. A change in μ causes the graph to shift to the left or
right. This means there are an infinite number of normal probability distributions.
Drawing the Normal Distribution Curve can be easily done. Firstly, we know that it is bell
shaped. Secondly, we know that it is symmetric about the mean, μ. Therefore, draw the x-
314
axis and place the mean at the center. Let the scale be the standard deviation, σ. The first
turning points of the bell shape occur one standard deviation from the mean. The second
turning points of the bell occur two standard deviations from the mean.
Example 7.1
Solution 7.1
The mean of the normal distribution is 30, μ = 30. The standard deviation of the
distribution is 4, σ = 4.
One standard deviation below the mean is 30 – 4 = 26. One standard deviation above the
mean is 30 + 4 = 34. The first turning points of the bell occur at 26 and 34.
Two standard deviations below the mean is 30 – 2 · 4 = 22. Two standard deviations above
the mean is 30 + 2 · 4 = 38. The second turning points of the bell occur at 22 and 38.
x = μ + z· σ = 5 + 3· 2 = 11
315
That is, the mean for the standard normal distribution is zero, and the standard deviation
is one. Therefore, zero is in the center.
-2 -1 0 1 2
z-Scores
x−µ
If X is a normally distributed random variable and X ~ N(μ, σ), then the z-score is z = .
σ
The z-score tells you how many standard deviations the value x is above (to the right of) or
below (to the left of) the mean μ. Values of x that are larger than the mean have positive z-
scores, and values of x that are smaller than the mean have negative z-scores. If x equals the
mean, then x has a z-score of zero.
Example 7.2
Suppose X ~ N(5, 6). This says that x is a normally distributed random variable with mean
μ = 5 and standard deviation σ = 6. Suppose x = 17; then the corresponding z-score is:
x−µ 17 − 5
z= = =2
σ 6
x−µ 1− 5
z= = = −0.67 .
σ 6
Summarizing,
- When z is positive, x is above, or to the right of μ
- When z is negative, x is to the left of, or below μ.
316
Graphical interpretation of Example 7.2. The first graph displays x = 17 on X ~ N(5, 6). The
second graph displays z = 2 on X ~ N(0, 1).
-7 -1 5 11 17 -2 -1 0 1 2
Normal Distribution Standard Normal Distribution
7.1 What is the z-score of x, when x = 7 and X ~ N(12,3)? Display the x value on a normal
distribution graph.
Example 7.3
Some doctors believe that a person can lose five pounds, on average, in a month by reducing
his or her fat intake and by exercising consistently. Suppose weight loss has a normal
distribution. Let X = the amount of weight lost (in pounds) by a person in a month. Use a
standard deviation of two pounds. X ~ N(5, 2). Fill in the blanks.
a. Suppose a person lost ten pounds in a month. The z-score when x = 10 pounds is z =
2.5 (verify).
Solution 7.3
a. This z-score tells you that x = 10 is 2.5 standard deviations to the right of the mean of
_5_, μ = 5.
𝑥𝑥−𝜇𝜇 −3−5
b. 𝑧𝑧 = = = −4. This z-score tells you that x = –3 is four standard deviations to the
𝜎𝜎 2
left of the mean.
317
One of the most important aspects of z-scores is that they allow us to compare data that are
scaled differently. To understand the concept, suppose X ~ N(5,6) represents weight gains
for one group of people who are trying to gain weight in a six week period and Y ~ N(3, 1)
measures the same weight gain for a second group of people. A negative weight gain would
be a weight loss. Since x = -1 and y = 2 are each one standard deviations to the left of their
means, they represent the same, standardized weight gain relative to their means.
x y
-7 -1 5 11 17 1 2 3 4 5
X~N(5, 6) Y~N(3, 1)
7.2 Jerome averages 16 points a game with a standard deviation of four points. Suppose
the number of points in a game is normally distributed: X ~ N(16,4). Jerome scores ten
points in a game. The z–score when x = 10 is –1.5. This score tells you that x = 10 is ___
standard deviations to the (right or left) of the mean (What is the mean?).
Example 7.4
The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170 cm with
a standard deviation of 6.28 cm. Male heights are known to follow a normal distribution.
Let X = the height of a 15 to 18-year-old male from Chile in 2009 to 2010, so X ~ N(170,
6.28).
a. Suppose a 15 to 18-year-old male from Chile was 168 cm tall from 2009 to 2010.
Calculate and interpret the z-score for this individual.
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010
has a z-score of z = 1.27. What was the male’s height?
Solution 7.4
a. z = -0.32. The individual was 0.32 standard deviations to the left of the mean µ = 170.
318
7.3 Use the information in Example 7.3 to answer the following questions.
a. Suppose a 15 to 18-year-old male from Chile in this time period was 176 cm tall.
What does that tell us about the individual’s height relative to the population?
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010
has a z-score of z = -2. What was the individual’s height?
Standard Normal distribution is also known as the Z-distribution. The total area under the
curve is 100%. As in the previous chapter, we can find the probability using area under the
curve. We will be using technology to find the probability of standard normal distribution.
The notation P(z < k) represents the probability of a z-score less than a particular k value
in the standard normal distribution. Below is a graphical representation of P(z < k):
To find the probability (area), we can use the Ti-83 or Ti-84 (plus) or Excel.
319
Example 7.5
Solution 7.5
Using an Excel function, you must type an equal sign in an empty cell first.
Go to an empty cell and type equal sign, After you press Enter, the answer
then start typing Norm.s.dist will appear.
320
Using an Excel function, you must type an equal sign in an empty cell first. Since
norm.s.dist(z, true) function gives the area to the left. Therefore, to find the area to
the right, we would use the complement: 1 – norm.s.dist(z, true)
c. Using the calculator, P(1.2 < z < 2.4) = normalcdf (1.2, 2.4, 0, 1) = 0.1069
Using an Excel function of norm.s.dist(z, true) gives the area to the left. We want
the area in between. To find the area in between, we take the difference of the area
for z = 2.4 and the area for z = 1.2.
321
7.2 | Using Normal Distribution
Recall that the probability that the random variable X falls into any interval will be the area
under the curve corresponding to that interval. For example, scores for the Stanford-Binet
IQ test are normally distributed with a mean of µ = 100 and standard deviation of σ = 15.
Then the probability that a randomly selected individual has an IQ between 95 and 120 would
be about 54%, as shown in the graph:
Distribution Plot
Normal, Mean=100, StDev=15
0.030
0.025 0.5393
0.020
Density
0.015
0.010
0.005
0.000
95 100 120
X
To calculate probabilities like the problem above, we need to work with the cumulative
distribution function, P(X < x). The shaded area in the graph below shows this probability;
it is the area under the normal curve that is to the left of x. To calculate this area manually
would require Calculus; but there is a table of values that shows these areas for the standard
normal distribution, and many calculators and statistical software packages also will
calculate cumulative normal probabilities.
322
As we found the area under the curve for the Z-distribution, we will again use Normalcdf to
find the area under the curve for the normal distribution that is not standard normal.
Example 7.6
Let’s say in the year 2020, a finance researcher took the daily price of XYZ stock and found
the daily return for it. The daily return (day to day return) for XYZ stock is normally
distributed with a mean of 0.12% return. This means on average from one day to the next
the stock went up .012%. The standard deviation of the course of the year, day to day, was
1.81%. Let x = 1.93%
Solution 7.6
a.) x = 1.93% represents a particular daily return of XYZ stock had a return of 1.93%
𝑥𝑥−𝜇𝜇 1.93−0.12
b.) 𝑧𝑧 = = = 1.
𝜎𝜎 1.81
This means that the x-value is one standard deviation above the mean. Below is the
graphical representation using the normal distribution curve and converting it to
standard normal distribution by finding the z-score of the x value.
323
Normal Distribution Standard Normal Distribution
μ = 0.11%; σ = 1.81% μ=0σ=1
e.) Probability for a continuous probability distribution is the area under the respective
curve. The area of the normal distribution curve that is represented by P(x < 1.93%)
with X~N(0.12%, 1.81%) is the same area of the standard normal distribution curve
that is represented by P(z < 1) with X~N(0, 1).
324
Example 7.7
The final exam scores in a large statistics class were normally distributed with a mean of
63 points and a standard deviation of 5 points.
a. Find the probability that a randomly selected student scored more than 65 points on the
exam without finding the z-score.
b. Find the probability that a randomly selected student scored less than 80 points on the
exam without find the z-score.
Solution 7.7
a. Let X = score on the final exam. X ~ N(63, 5) is the notation used to state that the
distribution is normally distributed with μ = 63 and σ = 5.
We start by drawing a graph to represent P(x > 65):
Then, find P(x > 65) by using the calculator, normalcdf(lower, upper, µ, σ). The lower bound
of the shaded area is at x = 65, the upper bound of the shaded area is positive infinity which
we represent using 10^99 on the calculator.
b. Here we want to find P(x < 85). The shaded area is on the left side of 85. The lower
bound of the shaded area is negative infinity which we represent using -10^99 on the
calculator. The upper bound of the shaded area is 85.
The solution: P(x < 85) = normalcdf(-10^99, 85, 63, 5) = 0.999995. This tells us that
virtually all of the students in the class scored below 85.
7.4 In 2019, over 2.2 million students took the SAT exam. The distribution of total scores of
the SAT had a mean µ = 1059 and a standard deviation σ = 210. Let X = a SAT exam score
in 2019. Let X ~ N(1059, 210). Find the z-scores for x1 = 933 and x2 = 1220. State the
interpretation. Find the P(x < 933) and P(x > 1220).
325
Now, what if we are given the area under the curve and want to find the value on the x-
axis. This process is considered doing the inverse operation of what we were doing in the
last example. We can find the percentiles in the Z-distribution, X~N(0,1), or normal
distribution, X~N(µ, σ) using the calculator.
NOTE: Newer versions of the calculator allow you to specify which area is given; if you are
given the left area, the center area, or the right area. The above instructions are for older
calculators where the area that needs to be entered is the left area as seen in the next
example.
Example 7.8
The final exam scores in a large statistics class were normally distributed with a mean of
63 points and a standard deviation of 5 points. Find the 90th percentile.
Solution 7.8
Let the random variable X represent scores on the final exam. X ~ N(63, 5) is the notation
used to state that the distribution is normally distributed with μ = 63 and σ = 5. The
variable k on the x-axis is a specific score that has 90% of final exam scores below it and
10% of scores above it.
326
To find k using the calculator: invnorm(area to the left, µ, σ)
Remember that Excel is a different company than the TI calculators. Excel function has the
same input into its function but the name of the function is different.
There is a third way of finding this percentile. It is using the z-score formula. Recall z-score
gives us the number of standard deviations a value is from the mean.
𝑥𝑥 − 𝜇𝜇
𝑧𝑧 =
𝜎𝜎
We can isolate the x value: x = µ + z· σ. We know the mean and standard deviation. We can
find z using the calculator. z = invnorm(.9, 0, 1) = 1.28. Remember that the area under the
standard normal curve and the area under the normal curve are the same.
327
Example 7.9
A personal computer is used for office work at home, research, communication, personal
finances, education, entertainment, social networking, and a myriad of other things. Suppose
that the average number of hours a household personal computer is used for entertainment
is two hours per day. Assume the times spent on entertainment are normally distributed and
the standard deviation for the times is half an hour.
a. Find the probability that a household personal computer is used for entertainment
between 1.8 and 2.75 hours per day.
b. Find the maximum number of hours per day that the bottom quartile of households
uses a personal computer for entertainment.
Solution 7.9
Let X = the amount of time (in hours) a household personal computer is used for
entertainment. We are told that µ = 2 and σ = 1/2, so X ~ N(2, 0.5)
a. We wish to find P(1.8 < x < 2.75). This probability is the area between x = 1.8 and
x = 2.75:
Using the calculator, P(1.8 < x < 2.75) = normalcdf(1.8, 2.75, 2, 0.5) = 0.5886.
b. To find the maximum number of hours per day that the bottom quartile of households
uses a personal computer for entertainment, find the 25th percentile, k. This is the
value so that P(x < k) = 0.25:
k =? 2
328
Thus, the maximum number of hours per day that the bottom quartile of households uses a
personal computer for entertainment is 1.66 hours.
7.5 The golf scores for a school team were normally distributed with a mean of 68 and a
standard deviation of three.
Find the probability that a randomly selected golfer scored less than 65.
Example 7.10
There are approximately one billion smartphone users in the world today. In the United
States the ages of smartphone users are approximately normally distributed with
approximate mean and standard deviation of 36.9 years and 13.9 years, respectively.
a. Determine the probability that a randomly selected smartphone user is between 23 and
64.7 years old.
b. Determine the probability that a randomly selected smartphone is at most 50.8 years old.
c. Find the 80th percentile of this distribution, and interpret it in a complete sentence.
Solution 7.10
Let X = the age of a randomly selected smartphone user. We are told that µ = 36.9 and σ =
13.9, so X ~ N(36.9, 13.9).
329
Example 7.11
There are approximately one billion smartphone users in the world today. In the United
States the ages of smartphone users approximately follow a normal distribution with mean
and standard deviation of 36.9 years and 13.9 years respectively.
Use this to answer the following questions, rounding answers to one decimal place:
Solution 7.11
40% is the area to the right of k, and so the area to the left of k will be 1 – 0.40 = 0.60.
In other words, we are looking for the 60th percentile. This value is given by k =
invNorm(0.60, 36.9, 13.9) = 40.4215.
Rounding, we get k = 40.4. So, forty percent of the ages are at least 40.4 years.
7.6 The golf scores for a school team are normally distributed with a mean of 68 and a
standard deviation of 3.
a. Find the probability that a randomly selected golfer scores between 66 and 70.
b. Find the score for a golfer that is at the third quartile.
330
Calculating Probabilities using a Table (optional)
Probabilities are calculated using technology. There are instructions given as necessary for
the TI-83+ and TI-84 calculators. To calculate normal probabilities without technology we
use the normal tables provided in the Appendix. The tables include instructions for how to
use them; but we will review the steps briefly.
The normal distribution table in the Appendix shows the area under the standardized
normal curve to the left of x. I.e. it shows the probability P(z < x). We reproduce a section
of the table here:
7.7 Use the tables to find P( z < 0.85), P( z > 1.36) and P(0.45 < z < 1.45)
331
To use the table for non-standard normally distributed variables, we first rewrite the desired
probability in terms of z-scores, and then use the table as demonstrated above. For example,
if we have a normally variable X ~ N(50, 4) and wanted to find P(46 < x < 55), we would find
the z-scores corresponding to X = 46 and 55, which are z = (46 – 50)/4 = -1 and z = (55 – 50)/4
= 1.25, respectively. From there we could rewrite P(46 < x < 55) = P(-1 < z < 1.25) and use
the table.
Example 7.12
A citrus farmer who grows mandarin oranges finds that the diameters of mandarin oranges
harvested on his farm follow a normal distribution with a mean diameter of 5.85 cm and a
standard deviation of 0.24 cm.
a. Find the probability that a randomly selected mandarin orange from this farm has a
diameter larger than 6.0 cm by finding z-score and using table.
b. Find the 90th percentile for the diameters of mandarin oranges, and interpret.
Solution 7.12
a. P(x > 6) = normalcdf(6, 10^99, 5.85, 0.24) = 0.2660 but if using the z-chart we would
need to find z first. z = (6 – 5.85)/0.24 = .625 which is halfway between 0.62 and 0.63
When you look up 0.62 and 0.63 on the table, these are the areas to the left.
We want the areas to the right, which can be found by taking the complement: 1 -
0.7324 = 0.2676 and 1 – 0.7357 = 0.2643. If you average these two areas we get the
same answer as the calculator.
332
b. The 90th percentile is invNorm(0.90, 5.85, 0.24) = 6.16 cm. If wanted to use the z
table. We would need to find the z-score associated with 90% to the left.
7.8 Using the information from Example 7.12, answer the following:
333
The Empirical Rule Review
Let X be a random variable that has a normal distribution with mean µ and standard
deviation σ. Then the Empirical Rule says the following:
• About 68% of the x values lie within one standard deviation of the mean. That is, about
68% of all data values lie between µ – σ and µ + σ.
• About 95% of the x values lie within two standard deviations of the mean. That is, about
95% of all data values lie between µ – 2σ and µ + 2σ.
• About 99.7% of the x values lie within three standard deviations of the mean. That is,
about 99.7% of all data values lie between µ – 3σ and µ + 3σ.
Notice that almost all the x values lie within three standard deviations of the mean. The
empirical rule is also known as the 68-95-99.7 rule.
Since the Empirical rule is used for symmetric distributions and normal distributions are
symmetric, we can use normalcdf to calculate the Empirical rule values.
Example 7.13
Check with the calculator. P(44 < x < 56) = normalcdf(44, 56, 50, 6) = .6827 = 68.27%
334
• Similarly, µ – 2σ = 38 and µ + 2σ = 62.
7.9 Suppose X has a normal distribution with mean 25 and standard deviation five.
Between what values of x do 68% of the values lie?
Example 7.14
From 1984 to 1985, the mean height of 15 to 18-year-old males from Chile was 172.36 cm,
and the standard deviation was 6.34 cm. Let Y = the height of 15 to 18-year-old males in
1984 to 1985, so Y ~ N(172.36, 6.34).
Solution 7.14
a. Using the Empirical rule, 68% lies within 1 standard deviation below and above the
mean. 172.36 – 1(6.34) = 166.02 and 172.36 + 1(6.34) = 178.7. Therefore, about 68% of the
values lie between 166.02 and 178.7 cm.
b. Using the Empirical rule, 95% lies within 2 standard deviation below and above the
mean. 172.36 – 2(6.34) = 159.68 and 172.36 + 2(6.34) = 185.04. Therefore, about 95% of the
values lie between 159.68 and 185.04 cm.
c. Using the Empirical rule, 99.7% lies within 3 standard deviation below and above the
mean. 172.36 – 3(6.34) = 153.34 and 172.36 + 3(6.34) = 191.38. Therefore, about 99.7% of
the values lie between 153.34 and 191.38 cm.
335
Normal Approximation to the Binomial
0.25 0.20
0.20
Probability
Probability
0.15
0.15
0.10
0.10
0.05
0.05
0.00 0.00
0 1 2 3 4 5 6 7 0 2 4 6 8 10 12
X X
0.14 0.10
0.12
0.08
0.10
Probability
Probability
0.08 0.06
0.06
0.04
0.04
0.02
0.02
0.00 0.00
0 5 10 15 20 10 15 20 25 30 35
X X
We see that as the sample size increases, the histogram for the distribution takes on a
distinctive bell shape, which means that a normal distribution will provide a good
approximation.
Recall that if X is the binomial random variable, then we write X ~ B(n, p). The shape of the
binomial distribution needs to be similar to the shape of the normal distribution. To ensure
this, the parameters n and p should satisfy the inequalities np > 10 and n(1 – p) > 10. Then
the binomial can be approximated by the normal distribution with mean μ = np and standard
deviation σ = np (1 − p ) .
336
In order to get the best approximation, we add 0.5 to x or subtract 0.5 from x (use x + 0.5 or
x – 0.5). The number 0.5 is called the continuity correction factor and is used in the
following example.
Example 7.15
Suppose in a local Kindergarten through 12th grade (K - 12) school district, 53 percent of the
population favor a charter school for grades K through 5. A simple random sample of 300 is
surveyed.
Solution 7.15
Let X = the number that favor a charter school for grades K through 5. Then X follows a
binomial distribution with n = 300 and p = 0.53; i.e. X ~ B(300, 0.53). Since np > 10 and
nq > 10, we can use the normal approximation to the binomial. The mean and standard
deviation of the distribution are μ = np = 159 and σ = 300(.53)(.47) = 8.6447.
Thus, the random variable for the corresponding normal distribution is Y ~ N(159, 8.6447).
For part a, we want P(X ≥ 150), which means that x = 150 is included; this means in the
histogram, we want the rectangle that is centered at x = 150; and the left endpoint of this
rectangle is 149.5.
So we must find the area to the right of x = 149.5, and so we write:
Similarly, for part b, we P(X ≤ 160), which means we want to include X = 160. In the
histogram, the rectangle for this value has a right endpoint of 160.6, and so
P(X > 155) ≈ P(y > 155.5) = normalcdf(155.5, 10^99, 159, 8.6447) = 0.6572.
337
For part d, we want to exclude 147 so
P(X < 147) ≈ P(Y < 146.5) = normalcdf(0, 146.5, 159, 8.6447) = 0.0741.
Modern calculators and computer software allow us to easily binomial probabilities for any
values of n and p, making the normal approximation to the binomial distribution obsolete for
calculation purposes. However, in Chapters 7, 8 and 9, we will see that this approximation
is important for theoretical reasons.
For the previous example, the probabilities can of course be calculated using the binomcdf
and binompdf functions in the TI-84 calculator, with n = 300 and p = 0.53. Compare the
binomial and normal distribution answers:
7.10 In a city, 46 percent of the population favor the incumbent, Dawn Morgan, for mayor. A
simple random sample of 500 is taken. Using the continuity correction factor, find the
probability that at least 250 favor Dawn Morgan for mayor.
338
KEY TERMS
𝑥𝑥−𝜇𝜇 2
−.5� �
𝑒𝑒 𝜎𝜎
Normal Distribution. A random variable with pdf 𝑓𝑓(𝑥𝑥) = 𝜎𝜎 √2𝜋𝜋
, where µ is the mean of
the distribution and σ is the standard deviation. The graph of this pdf is a bell-shaped curve
that is perfectly symmetric about the line x = µ. Notation: X ~ N(μ, σ).
µ
Standard Normal Distribution: A normal distribution with μ = 0 and σ = 1. I.e., this is a
continuous random variable X ~ N(0, 1); since this is the distribution obtained by replacing
data values by their respective z-scores, it is often noted as Z ~ N(0, 1).
0
x−µ
z-score: If x is any data value, then its z-score is z = . The z-score allows us to compare
σ
data that are normally distributed but scaled differently. If this transformation is applied to
any normal distribution X ~ N(μ, σ) the result is the standard normal distribution Z ~ N(0,1).
339
FORMULA REVIEW
x−µ
z-score: z =
σ
k = invnorm(probability to left, 0, 1)
k = invnorm(probability to left, µ, σ)
340
EXERCISES FOR CHAPTER 7
1. A bottling plant produces bottles of water whose volumes are normally distributed with a
mean of 12.05 fluid ounces and a standard deviation of 0.01 ounces. Describe the random
variable X symbolically: X ~ ____.
2. A normal distribution has a mean of 61 and a standard deviation of 15. What is the
median?
4. A company manufactures rubber balls. The diameters of the balls are approximately
normally distributed with mean 12 cm and a standard deviation of 0.2 cm. Describe the
random variable X symbolically: X = ____.
8. What is the z-score of x = 12, if it is two standard deviations to the right of the mean?
9. What is the z-score of x = 9, if it is 1.5 standard deviations to the left of the mean?
15. Suppose a normal distribution has a mean of 6 and a standard deviation of 1.5.
What is the z-score of x = 5.5?
19. Suppose X ~ N(4, 2). What value of x is 1.5 standard deviations to the left of the mean?
20. Suppose X ~ N(4, 2). What value of x is 2.8 standard deviations to the right of the mean?
341
21. Suppose X ~ N(8, 9). What value of x is 0.67 standard deviations to the left of the mean?
26. In a normal distribution, x = 5 and z = –1.25. This tells you that x = 5 is standard
deviations to the (right or left) of the mean.
27. In a normal distribution, x = 3 and z = 0.67. This tells you that x = 3 is standard
deviations to the (right or left) of the mean.
30. In a normal distribution, x = 6 and z = –1.7. This tells you that x = 6 is standard
deviations to the (right or left) of the mean.
31. About what percent of x values from a normal distribution lie within one standard
deviation (left and right) of the mean of that distribution?
32. About what percent of the x values from a normal distribution lie within two standard
deviations (left and right) of the mean of that distribution?
33. About what percent of x values lie between the second and third standard deviations
(both sides)?
34. Suppose X ~ N(15, 3). Between what x values does the middle 68.26% of the data lie?
35. Suppose X ~ N(–3, 1). Between what x values does the middle 95.44% of the data lie?
36. Suppose X ~ N(–3, 1). Between what x values does 34.14% of the data lie?
37. About what percent of x values lie between the mean and three standard deviations?
38. About what percent of x values lie between the mean and one standard deviation?
39. About what percent of x values lie between the first and second standard deviations
from the mean (both sides)?
40. About what percent of x values lie between the first and third standard deviations (both
sides)?
342
41. How would you represent the area to the left of one as a probability statement?
43. In a normal distribution, is P(x < 1) equal to P(x ≤ 1)? Why or why not?
44. How would you represent the area to the left of x = 3 as a probability statement?
343
48. If the area to the left of x in a normal distribution is 0.123, what is the area to the right
of x?
49. If the area to the right of x in a normal distribution is 0.543, what is the area to the left
of x?
54. Given that X ~ N(6, 2), find the probability that x is between 3 and 9.
55. Given that X ~ N(–3, 4), find the probability that x is between 1 and 4.
56. Given that X ~ N(4, 5), find the maximum value of x for the bottom quartile.
a. Draw the shape of the distribution, label and scale the axes
c. What is the likelihood that a gaming console will break down during the
guarantee period?
d. Find the probability that a console last more than 10 years. Is it unusual?
58. The life span of gaming consoles is normally distributed with mean of 7.5 years and a
standard deviation of 1.3 years. We are interested in the length of time a console lasts.
Find the 70th percentile of the distribution for the time a gaming console last.
59. The life span of car tires is normally distributed with a mean of 5 years and a standard
deviation of 0.85 years.
a. Draw the shape of the distribution, label and scale the axes
c. What is the likelihood that a car tire will last longer than 7 years?
344
Use the following information to answer the next two exercises: The patient recovery time
from a particular surgical procedure is normally distributed with a mean of 5.3 days and a
standard deviation of 2.1 days.
61. What is the z-score for a patient who takes ten days to recover?
62. Clotting time of blood is normally distributed. The mean clotting time of blood is 7.45
seconds, with a standard deviation of .3 seconds. What is the probability that the blood
clotting time will be less than 7 seconds?
63. The average size of the fish in a lake is 12.1 inches, with a standard deviation of 3.8
inches. The size of fish in a lake are normally distributed. Find the probability of catching
a fish that is
64. The systolic blood pressure (given in millimeters) of males has an approximately normal
distribution with mean µ =125 and standard deviation σ = 14.
a. Calculate the z-scores for the male systolic blood pressures 100 and 150 millimeters.
b. If a male friend of yours said he thought his systolic blood pressure was 2.5 standard
deviations below the mean, but that he believed his blood pressure was between 100 and
150 millimeters, what would you say to him?
65. The monthly income of trainees at a local technology center is normally distributed. The
average income is $1200 a month. The standard deviation is $150.
a. Find the probability that the trainee earns less than $900 a month?
66. According to Eye safe Nielsen study in 2020, the average time spent on App/Web on a
smartphone is 4.5 hours for adults in US. Let’s assume the time spent on App/Web on a
smartphone is normally distributed with standard deviation of 1.2 hours.
a. Find the probability that an adult spends less than 1 hour on App/Web on a
smartphone.
345
b. Find the probability that an adult spends between 4 to 6 hours on App/Web on a
smartphone.
67. In 2019, 2,220,087 students heading to college took the SAT. The distribution of scores in
the math section of the SAT follows a normal distribution with mean µ = 528 and standard
deviation σ = 117.
a. Calculate the z-score for an SAT score of 720. Interpret it using a complete sentence.
b. What math SAT score is 1.5 standard deviations above the mean? What can you say
about this SAT score?
c. What is the probability that a randomly selected score is greater than 750? Is it
unusual?
The patient recovery time from a particular surgical procedure is normally distributed with
a mean of 5.3 days and a standard deviation of 2.1 days.
68. What is the probability of spending more than two days in recovery?
70. Based upon the given information, would you be surprised if it took less than one
minute to find a parking space?
71. Find the probability that it takes at least eight minutes to find a parking space.
72. Seventy percent of the time, what is the most number of minutes needed to find a
parking space?
73. According to a study done by De Anza students, the height for Asian adult males is
normally distributed with an average of 66 inches and a standard deviation of 2.5 inches.
Suppose one Asian adult male is randomly chosen. Let X = height of the individual.
a. X ~ ( , )
346
b. Find the probability that the person is between 65 and 69 inches. Include a sketch of
the graph, and write the probability in terms of the random variable x.
c. Would you expect to meet many Asian adult males over 72 inches? Explain why or
why not, and justify your answer numerically.
d. The middle 40% of heights fall between what two values? Sketch the graph, and
write the probability statement.
74. IQ is normally distributed with a mean of 100 and a standard deviation of 15. Suppose
one individual is randomly chosen. Let X = IQ of an individual.
a. X ~ ( , )
b. Find the probability that the person has an IQ greater than 120. Include a sketch of
the graph, and write a probability statement.
c. MENSA is an organization whose members have the top 2% of all IQs. Find the
minimum IQ needed to qualify for the MENSA organization. Sketch the graph, and
write the probability statement.
d. The middle 50% of IQs fall between what two values? Sketch the graph and write
the probability statement.
75. The percent of fat calories that a person in America consumes each day is normally
distributed with a mean of about 36 and a standard deviation of 10. Suppose that one
individual is randomly chosen. Let X = percent of fat calories.
a. X ~ ( , )
b. Find the probability that the percent of fat calories a person consumes is more than
40. Graph the situation. Shade in the area to be determined.
c. Find the maximum number for the lower quarter of percent of fat calories. Sketch
the graph and write the probability statement.
76. Suppose that the distance of fly balls hit to the outfield (in baseball) is normally
distributed with a mean of 250 feet and a standard deviation of 50 feet.
347
77. The corn yield in a given area is normally distributed. The average corn yield in a given
area is 210 bushels per acre with a standard deviation of 47 bushels per acre.
78. The average weekly income of call center analyst is $600 with a standard deviation of
$20. The weekly income of call center analyst is normally distributed. Let X = weekly income
of call center analyst.
a. Find the probability that a randomly selected analyst has a weekly income less than
$530. Sketch the graph and write the probability statement.
b. Find the probability that a randomly selected analyst has a weekly income between
$610 and $630.
c. Find the third quartile for weekly income.
79. Suppose that the duration of a particular type of criminal trial is known to be normally
distributed with a mean of 21 days and a standard deviation of seven days.
80. Terri Vogel, an amateur motorcycle racer, averages 129.71 seconds per 2.5 mile lap (in a
seven-lap race) with a standard deviation of 2.28 seconds. The distribution of her race times
is normally distributed. We are interested in one of her randomly selected laps.
348
81. Suppose that Ricardo and Anita attend different colleges. Ricardo’s GPA is the same as
the average GPA at his school. Anita’s GPA is 0.70 standard deviations above her school
average. In complete sentences, explain why each of the following statements may be false.
82. A NUMMI assembly line, which has been operating since 1984, has built an average of
6,000 cars and trucks a week. Generally, 10% of the cars were defective coming off the
assembly line. Suppose we draw a random sample of n = 100 cars. Let X represent the number
of defective cars in the sample. What can we say about X in regard to the 68-95-99.7 empirical
rule (one standard deviation, two standard deviations and three standard deviations from
the mean are being referred to)? Assume a normal distribution for the defective cars in the
sample.
83. We flip a coin 100 times (n = 100) and note that it only comes up heads 20% (p = 0.20) of
the time. The mean and standard deviation for the number of times the coin lands on heads
is µ = 20 and σ = 4 (verify the mean and standard deviation). Solve the following:
a. There is about a 68% chance that the number of heads will be somewhere between
and .
b. There is about a chance that the number of heads will be somewhere between 12
and 28.
c. There is about a chance that the number of heads will be somewhere between eight
and 32.
Business Applications
84. The life of a coffee dispenser machine that uses regular-sized k-cups is normally
distributed with a mean of 4.88 years with a standard deviation of .62 years.
a. What is the probability that the coffee dispenser lasts less than 3 years?
b. If the company wants less than 2% of the coffee dispensers to fail while under warranty,
what should be the guarantee period?
85. A machine that fills quart bottles with fruit juice is normally distributed with a mean of
31.6 oz per bottle, with a standard deviation of 1.2 oz.
a. What is the probability that the amount of juice in a bottle is less than 1 quart?
b. What is the probability that the amount of juice in a bottle is at least 2 ounces
more than a quart?
349
c. What are the largest and smallest amounts dispensed by the middle 50%?
86. Customers at a certain cosmetic store spend an average of $45.80, with a standard
deviation of $1275. Assume the amount spent is normally distributed.
b. What is the probability that a customer spends between $30 and $40?
c. What are the largest and smallest amounts spend by the middle 60%?
87. Suppose in a quarter, a group of 25 mutual funds return of 2.8% with a standard
deviation of 4.8%. Assume the returns are normally distributed.
c. What percent of funds would you expect to have returns between 5% and 10%?
88. Human resource departments look at job satisfaction scores. Let’s say job satisfaction
scores are normally distributed with a mean of 90 and a standard deviation of 12.
c. Human resource departments are concerned with job satisfaction scores that
drops below a specific score. What score is considered unusual?
350
REFERENCES
7.1 The Standard Normal Distribution
“2012 College-Bound Seniors Total Group Profile Report.” CollegeBoard 2012. Available online at
http://media.collegeboard.com/digitalServices/pdf/research/TotalGroup-2012.pdf (accessed May 14,
2013).
“Digest of Education Statistics: ACT score average and standard deviations by sex and
race/ethnicity and percentage of ACT test takers, by selected composite score ranges and planned
fields of study: Selected years, 1995 through 2009.” National Center for Education Statistics.
Available online at http://nces.ed.gov/programs/digest/d09/tables/dt09_147.asp (accessed May 14,
2013).
Data from the National Basketball Association. Available online at www.nba.com (accessed May 14,
2013).
“Smart Phone Users, By The Numbers.” Visual.ly, 2013. Available online at http://visual.ly/smart-
phone-users-numbers (accessed May 14, 2013).
Eyesafe. “COVID-19: Screen Time Spikes to over 13 Hours per Day According to Eyesafe Nielsen
Estimates.” Eyesafe®, 30 Mar. 2020, eyesafe.com/covid-19-screen-time-spike-to-over-13-hours-
per-day/.
351
CHAPTER 7 SOLUTIONS:
23) 25)
61) z = 2.35
63) a. P(x > 17) = normalcdf(17, 10^99, 12.1, 3.8)
67) a. z = (720 – 528)/117 = 1.64. This student scored 1.64 standard deviations above the
mean.
b. x = µ + 1.5σ = 528 + 1.5(117) = 703.5. This student scored 1.5 standard deviations
above the mean.
c. P(x > 750) = normalcdf(750, 10^99, 528, 117) = .0289. Yes, unusual.
c. No, the probability that an Asian male is over 72 inches tall is 0.0082
352
75) a. X ~ N(36, 10) b. P(x > 40)= 0.3446.
c. Approximately 25% of people consume less than 29.26% of their calories as fat.
79) a. X = the number of days a particular type of criminal trial will take b. X ~ N(21, 7)
c. The probability that a randomly selected trial will last more than 24 days is 0.3336.
d. 22.77
353
This page is purposely left blank.
354
8 | CENTRAL LIMIT THEOREM
Figure 8.1 If you want to figure out the distribution of fish size in fish farm pools to maximize
profit, using the central limit theorem and assuming your sample is large enough, you will find that
the distribution is normal and bell-shaped.
Introduction
Chapter Objectives
355
The Central Limit Theorem (CLT for short) is one of the most powerful and useful ideas in
all of Statistics. There are two alternative forms of the theorem, and both describe the center,
spread and shape of a certain sampling distribution. In general, the sampling distribution
of a statistic (such as a sample mean or sample proportion) is the distribution of values of
that statistic when all possible samples of the same size are taken from the same population.
Sampling distributions form the foundation for almost all methods in inferential statistics,
and the Central Limit Theorem allows us to explicitly describe the sampling distribution for
a sample mean x and the sampling distribution for a sample proportion p.
If we select multiple random samples of size n from a population and calculate the mean for
each sample, then the sample mean x will vary from sample to sample.
Population
Distribution:
Random sample of size n:
µ = mean of population
x1 = mean of sample
σ = standard deviation
That is, x will be a random variable, and so it has a probability distribution. If we were able
to collect all possible random samples of size n from the population and calculate x for each
sample, then the resulting distribution is called the sampling distribution for the mean.
What would this distribution look like? Well, if n is reasonably large, we would expect most
of the sample means x to be pretty close to the true population mean, µ. Thus, we would
expect the mean of all sample means to be equal to µ. There will of course be some variation
among the sample means; however, we would expect most of the differences between a sample
mean and population mean to be fairly small, with large deviations quite rare. Thus, we
would expect the pdf of the distribution to approach zero as we move away from the center.
Finally, since a sample mean x is as likely to underestimate as it is to underestimate the true
population mean, we would expect the positive and negative deviations from the mean to
occur with about the same proportion. Thus, we would expect the distribution to be
symmetric.
Sampling distribution
356
When we put these three observations together, would expect the sampling distribution for
x to be a bell-shaped, symmetric distribution – so it is not unreasonable to think that the
sampling distribution is a normal distribution. Moreover, the mean of this sampling
distribution is the mean, µ of the population from which we are sampling. The Central Limit
Theorem validates our intuition; specifically, the CLT states that if we collect samples of
“sufficiently large” size n then the sampling distribution will be approximately normal.
Similarly, the sampling distribution for 𝑝𝑝̂ will be formed and the CLT can be used for this
sampling distribution too. Another version of the theorem says that if we again collect
samples of size n that are "large enough," calculate the sum of each sample, then the sampling
distribution for the sum will again be approximately normal.
In any case, it does not matter what the distribution of the original population is,
or whether we even know it. The important fact is that the distribution of sample
means and the sums tend to follow the normal distribution.
The size of the sample, n, that is required in order to be "large enough" depends on the
original population from which the samples are drawn. If we are sampling from a normal
distribution, the sampling distribution will exhibit a bell-shape even for small n. But if we
are sampling from a skewed distribution, we will need the sample size to be at least 30.
357
8.1 | The Central Limit Theorem for Sample Means and Sample
Proportions
Suppose X is a random variable with mean µ and standard deviation σ. Suppose that we
select random samples of size n and denote the corresponding sample means as X . Then we
denote the mean and standard deviation of the sampling distribution for X as µ X and σ X
respectively.
Suppose X is a random variable with mean µ and standard deviation σ. Suppose that we
select random samples of size n. Then the following are true:
The first two parts of the theorem tell us that the sampling distribution has the same mean
as the original distribution and a variance that equals the original variance divided by the
square root of the sample size. To state the third part more precisely, if we draw random
samples of size n, the distribution of the random variable X approaches a normal distribution
as the sample size n increases. We can summarize all three parts of the theorem succinctly
by saying:
σ
X ~ N(µ, )
n
σ
Note that the standard deviation for the sampling distribution, σ X = , is often called the
n
standard error of the mean. Note also that as n grows larger, the standard error gets
smaller. That is as the sample size gets larger there is less variation among the sample
means – this reinforces our intuition that larger samples give more reliable results when we
use a sample mean x to estimate µ.
Also recall the Law of Large numbers, which says that if you take samples of larger and
larger size from any population, then the sample mean x tends to get closer and closer to the
population mean μ. We now see that this is a direct result of the Central Limit Theorem: we
358
know that as n gets larger and larger, the sample means follow a normal distribution with
mean μ. Moreover, as the sample size increases, the standard deviation for the sampling
distribution decreases. So n becomes large, the sample means x really will get closer and
closer to the population mean μ.
Example 8.1
An unknown distribution has a mean of 90 and a standard deviation of 15. Samples of size
n = 36 are drawn randomly from the population.
Solution 8.1
a. This is asking for 𝜇𝜇𝑋𝑋̄ . According to the Central Limit Theorem, µ X = µ . Therefore,
𝜇𝜇𝑋𝑋̄ = 90.
σ 15
b. The formula for standard error of mean is σ X = , 𝜎𝜎𝑋𝑋̄ = = 2.5
√36
n
c. Since the original distribution is unknown, the sampling distribution is
approximately normal.
The fact that the sampling distribution of x is approximately normal or normal means that
we can use the techniques learned in Chapter 7 for the calculations. However, we must be
careful to use the standard deviation for the sampling distribution. For example, if
we want to calculate the z-score for a particular sample mean x , then we would use the
formula:
x − µX x−µ
z= =
(σ X ) σ/ n
And when we calculate probabilities, we will make sure that we will always write the
probability statement in terms of the correct random variable.
Here, µ = the mean of the original distribution we are sampling, σ = the standard
deviation of the same distribution, and n is the sample size.
359
Example 8.2
Continued from Example 8.1. Recall, an unknown distribution has a mean of 90 and a
standard deviation of 15. Samples of size n = 36 are drawn randomly from the population.
a. Find the probability that a randomly selected sample mean is between 85 and 92.
b. Find the value that is two standard deviations above the expected value of the sample
mean.
Solution 8.2
a. This question asks you to find a probability involving the sample mean: P(85 <
x < 92).
We first draw a graph:
We from the CLT that the distribution of x is approximately normal, with mean
σ 15
and standard deviation µ X = µ = 90 and σ X = = = 2.5 . Since the
n 36
distribution is normal, we can use the normalcdf function in the calculator to find
the probability:
b. To find the value of x that is two standard deviations above the expected value 90,
use the formula:
x = µ X + 2 σ X = 90 + 2(2.5) = 95.
The value of x that is two standard deviations above the expected value is 95.
8.1 An unknown distribution has a mean of 45 and a standard deviation of eight. Samples of
size n = 30 are drawn randomly from the population. Find the probability that the sample
mean is between 42 and 50.
360
Example 8.3
The length of time, in hours, it takes an "over 40" group of people to play one soccer match is
normally distributed with a mean of two hours and a standard deviation of 0.5 hours. A
sample of size n = 50 is drawn randomly from the population. Find the probability that the
sample mean is between 1.8 hours and 2.3 hours.
Solution 8.3
We are told that the original distribution has µ = 2, σ = .50, n = 50, and X ~ N(2, 0.5).
From the Central Limit Theorem, the sampling distribution of x is normal, with mean
µ X = µ = 2, σ X = σ = .5 ≈ 0.0707 and X ~ N(2, 0.0707)
n 50
Thus, P(1.8 < x < 2.3) = normalcdf(1.8, 2.3, 2, 0.0707) = 0.9977. Notice the x-axis for both
distributions. Between 1.8 and 2.3 covers the majority of the 2nd graph compared to the first
graph.
361
8.2 The length of time taken on the SAT for a group of students is normally distributed with
a mean of 2.5 hours and a standard deviation of 0.25 hours. A sample size of n = 60 is drawn
randomly from the population. Find the probability that the sample mean is between two
hours and three hours.
Recall that percentiles separate values We can also calculate percentiles in the sampling
distribution using the techniques from Chapter 7:
Example 8.4
a. What are the mean and standard deviation for the sampling distribution of average size
salmons?
b. What does the sampling distribution look like?
c. Find the probability that a sample mean length is more than 30 inches?
d. Find the 95th percentile for the sample mean length (to one decimal place).
Solution 8.4
a. The mean of the sampling distribution is µ X = µ = 29 inches and the standard deviation
𝜎𝜎 3
of the sampling distribution is 𝜎𝜎𝑋𝑋̄ = = = 0.3 inches
√𝑛𝑛 √100
b. From the Central Limit Theorem, we would expect that the sampling distribution will be
approximately normal since the original distribution is unknown.
362
c. The probability that a sample mean length is more than 30 inches is given by P(𝑋𝑋� > 30)
= normalcdf(30, 10^99, 29, 0.3) = 0.000429. It is very unlikely that an average of 100
salmons will be more than 30 inches.
d. Let k = the 95th percentile; then k = invNorm(0.95, 29, 0.3) = 29.493, or about 29.5
inches. 95% of sample means are less than 29.5 inches.
8.3 In the same NOAA site, another farmed fish that is monitored is the sablefish. The
average length is 36 inches. Assume the standard deviation is 3.2 inches. If samples of size
49 are taken, find the middle 80% of sample means. Interpret the results in terms of farmed
sablefish.
Example 8.5
The mean number of minutes for app engagement by a tablet user is 8.2 minutes. Suppose
the standard deviation is one minute. We select a random sample of 60 tablet users and
measure the mean time spent on each app.
a. What are the mean and standard deviation for the sample mean number of minutes for
app engagement by a tablet user? Draw the sampling distribution.
b. What is the standard error of the mean?
c. Find the 90th percentile for the sample mean time for app engagement for a tablet user.
Interpret value in a complete sentence.
d. Find the probability that the sample mean is between eight minutes and 8.5 minutes.
Solution 8.5
a. The mean of the sampling distribution is µ X = µ = 8.2 minutes and the standard
σ 1
deviation of the sampling distribution is σ X = = = 0.1291 .
n 60
363
b. Recall that the standard error of the mean is just σ X , the standard deviation of the
sampling distribution for x . If we select many samples of size 60, this statistic describes
the spread of the sample means about the population mean.
8.4 Cans of a cola beverage claim to contain 16 ounces. The amounts in a sample are
measured and the statistics are n = 34, x = 16.01 ounces. If the cans are filled so that μ =
16.00 ounces (as labeled) and σ = 0.143 ounces, find the probability that a sample of 34 cans
will have an average amount greater than 16.01 ounces. Do the results suggest that cans are
filled with an amount greater than 16 ounces?
364
Suppose X is a random variable with population proportion p. Suppose that we select
random samples of size n and denote the corresponding sample proportion as 𝑝𝑝̂ . Then we
denote the mean and standard deviation of the sampling distribution for 𝑝𝑝̂ as 𝜇𝜇𝑝𝑝� and 𝜎𝜎𝑝𝑝�
respectively.
Let’s look at how CLT works for a sampling distribution for sample proportion. We open up
a regular bag of colored candy and notice that there is 23 red candies out of 100 pieces.
23
Therefore 𝑝𝑝 = = 0.23. Now we will take samples of size 9, which the notation is n = 9.
100
Note the properties are approximate when the population is finite and no more than 10% of
the population is included in the sample. In each sample we find the proportion of red
candies:
𝑥𝑥
𝑝𝑝̂ = .
9
Population
Distribution: Random sample of size n:
𝑥𝑥
𝑝𝑝̂1 =
p = proportion of 𝑛𝑛
365
The list of 50 sample proportions of size 9:
0 1� 3� 3� 2� 2� 1� 4� 2� 3�
9 9 9 9 9 9 9 9 9
2� 3� 2� 0 1� 3� 1� 3� 4� 2�
9 9 9 9 9 9 9 9 9
3� 2� 1� 5� 2� 3� 0 4� 2� 3�
9 9 9 9 9 9 9 9 9
2� 1� 3� 4� 1� 4� 3� 2� 1� 1�
9 9 9 9 9 9 9 9 9 9
5� 3� 2� 1� 0 3� 2� 3� 3� 2�
9 9 9 9 9 9 9 9 9
If we look at the frequency distribution of this situation, we can see that the histogram is
symmetric around p = .23. We take the mean of all the 𝑝𝑝̂ drawn which we denote 𝜇𝜇𝑝𝑝� . CLT
𝑝𝑝𝑝𝑝
states that 𝜇𝜇𝑝𝑝� = p and 𝜎𝜎𝑝𝑝� = � .
𝑛𝑛
𝑝𝑝̂ f
0 4
1 10
9
2 14
9
3 15
9
4 5
9
5 2
9
6 0
9
7 0
9
8 0
9 𝜇𝜇𝑝𝑝� = .23
1 0
Example 8.6
Suppose that 68% of students like donuts. Take samples of size 40. The sampling
distributions of proportions are formed.
366
Solution 8.6
a. 𝜇𝜇𝑝𝑝� = 𝑝𝑝 = .68
𝑝𝑝𝑝𝑝 .68 ⋅ .32
𝜎𝜎𝑝𝑝� = � 𝑛𝑛 = = 0.0738
40
b. P(𝑝𝑝̂ ≥ .80) = ? Since we are asking the probability of a sample proportion, we are
using central limit theorem for proportions. P(𝑝𝑝̂ ≥ .80) = normalcdf(.80, 1E99, .68,
.0738) = 0.052
Example 8.7
Based on past experience, a bank claims that 7% of the people who receive student loans
will not make the payments on time. The bank recently approved 300 student loans.
a. What is the mean of the proportion of students in this group who may not make
timely payments?
b. What is the standard deviation of the proportion of students in this group who may
not make timely payments?
c. What’s the probability that over 10% of these students will not make the payments
on time?
Solution 8.7
P(𝑝𝑝̂ > .10) = ? Since we are asking the probability of a sample proportion, we are using
central limit theorem for proportions.
a. 𝜇𝜇𝑝𝑝� = 𝑝𝑝 = .07
𝑝𝑝𝑝𝑝 .07 ⋅ .93
b. 𝜎𝜎𝑝𝑝� = � 𝑛𝑛 = = 0.015
300
c. P(𝑝𝑝̂ ≥ .10) = normalcdf(.10, 1E99, .07, .015) = 0.0228
367
8.2 | Using the Central Limit Theorem
It is important to understand when to use the central limit theorem. If you are being asked
to find the probability of the mean, use the CLT for the mean. If you are being asked to find
the probability of a sum or total, use the CLT for sums. This also applies to percentiles for
means and sums.
NOTE
If you are being asked to find the probability of an individual value, do not use the CLT.
Use the distribution of its random variable.
Example 8.8
On the average, a certain computer part lasts 8 years. The length of time the computer part
lasts is exponentially distributed. Let X = the time an INDIVIDUAL computer part lasts,
and find the following probabilities:
a. Find the probability that the mean time the certain part lasts for 40 is shorter than 4
years.
b. Suppose that one computer part is randomly selected. Find the probability that this
individual computer part lasts shorter than 4 years. This is asking us to find P(x < 4).
c. Explain why the probabilities in parts a and b are different.
d. Find the 95th percentile for the mean time for samples of 40 computer parts lasts.
Solution 8.8
a. This is asking us to find P( x < 4). Since the original distribution is exponential, the
mean of the exponential distribution is µ = 8, the standard deviation is also σ = 8. The
sampling distribution of sample means is approximately normal with deviation µ X = µ and
𝜎𝜎 8
𝜎𝜎𝑋𝑋̄ = . By the CLT, X ~ N(8, ).
√𝑛𝑛 √40
8
Using the calculator, P( x < 4) = normalcdf(-10^99, 4, 8, ) = 0.000783.
√40
368
b. Find P(x < 4). Since we are looking at an individual computer part, we use the original
distribution which is the exponential distribution. µ = 8 and to find the rate of decay, m, we
1
1
take the reciprocal of mu. X ~ Exp� �. Thus P(x < 4) = 𝑒𝑒 −8(4) = 0.6065
8
c. The probabilities are not equal because they involve different random variables, and
hence use different distributions. Part a uses the sampling distribution of sample means.
The CLT states that this distribution is approximately normal. Part b uses exponential
distribution.
d. To find the 95th percentile for the mean time for samples of 40 computer parts lasts, we
8
again use the CLT, and the fact that X ~ N(8, ). Let k be the 95th percentile.
√40
8
This value is given by k = invNorm(.95, 8, ). = 10.08. Thus, about ninety five percent of
√40
such samples would have a mean under 10.08 years; only five percent of such samples
would have a mean above 10.08 years.
Example 8.9
A study was done on the market value of homes in a certain city. The market value of homes
is uniformly distributed between $175 thousand to $250 thousand.
a. In a sample of 30 homes, what is the probability of the mean value being greater than
$230 thousand.
b. What is the probability that an individual home has a value greater than $230
thousand?
c. Find the 95th percentile for the mean value of 30 homes.
369
Solution 8.9
The original distribution is uniform. In Uniform distribution, µ = (a + b)/2 = ($175 +
𝑏𝑏−𝑎𝑎 250−175
$250)/2 = $212.5 thousand and the standard deviation σ = = = $21.65 thousand.
√12 √12
The sampling distribution of sample means is approximately normal with 𝜇𝜇𝑥𝑥̅ = 𝜇𝜇 = $212.5
𝜎𝜎 $21.65
thousand and 𝜎𝜎𝑥𝑥̅ = = = $3.95 thousand. We use the CLT to get X ~ N($212.5
√𝑛𝑛 √30
thousand,$3.95 thousand) .
1
b. P(x > $230) = area of rectangle = base(height) = $20� � = 0.267
$75
c. The 95th percentile = invNorm(0.95, 212.5, 3.95) = $219 thousand. This indicates that in
95% all samples of size 30, the average market value of homes in the sample was less $219
thousand.
370
8.3 | The Central Limit Theorem for Sums
Suppose X is a random variable with a distribution that may be known or unknown (it can
be any distribution) and suppose that we select samples of size n. Recall that ΣX represents
the sum of all data values from the sample. ΣX is a random variable, as this sum
Again,
will vary from sample to sample; thus, we can discuss the sampling distribution for ΣX as
well. We denote the standard deviation of this sampling distribution as μΣX and σΣΧ
respectively. Then we have the following variant of the CLT:
Suppose X is a random variable with mean µ and standard deviation σ. Suppose that we
select random samples of size n. Then the following are true:
iii. For n > 30, the sampling distribution for the sampling distribution for ΣX is
approximately normal.
In other words, if we draw random samples of size n, the random variable ΣX consisting of
sums tends to be normally distributed and ΣX ~ N(nµ, n σ). Again, this means that we
can apply the techniques from Chapter 7 for calculating z-scores and probabilities involving
sample sums.
Example 8.10
An unknown distribution has a mean of 90 and a standard deviation of 15. A sample of size
80 is drawn randomly from the population.
a. Find the probability that the sum of the 80 values is more than 7,500.
b. Find the sum that is 1.5 standard deviations above the mean of the sums.
Solution 8.10
Let X = one value from the original unknown population. The probability question asks you
to find a probability for the sum (or total of) 80 values.
ΣX = the sum of 80 values. Since μ = 90, σ = 15, and n = 80, Σ X ~ N((80)(90), 80 (15)).
That is, Σ X ~ N(7200, 134.16).
371
a. Find P(Σx > 7,500). It is helpful to draw a graph:
b. Find Σx where z = 1.5. This sum is 1.5 standard deviations above the mean, so
8.6 An unknown distribution has a mean of 45 and a standard deviation of eight. A sample
size of 50 is drawn randomly from the population. Find the probability that the sum of the
50 values is more than 2,400.
Example 8.11
a. What are the mean and standard deviation for the sum of the lengths of salmon? What
is the distribution?
b. Find the probability that the sum of the lengths is between 2850 and 2950 inches.
c. Find the 80th percentile for the sum of the 100 lengths.
Solution 8.11
The distribution is approximately normal for sums by the central limit theorem.
b. P(2850 < Σx < 2950) = normalcdf (2850, 2950, 2900, 30) = 0.9044
c. Let k = the 80th percentile. Then k = invNorm(0.80, 2900, 30) = 2925.2 inches
372
8.7 According to NOAA (National Oceanic and Atmospheric Administration) Fisheries
website, the average size of the Atlantic Salmon (farmed) is about 29 inches. Let’s assume
the standard deviation is 4 inches. If we take sample sizes of 50 salmon,
a. What are the mean and standard deviation for the sum of the lengths of the salmon?
b. Find the probability that the sum of the lengths is between 1,400 and 1,500 inches.
c. Find the 90th percentile for the sum of the 50 salmon.
Example 8.12
The mean number of minutes for app engagement by a tablet user is 8.2 minutes. Suppose
the standard deviation is one minute. Suppose that we select a sample of size 70.
a. What are the mean and standard deviation for the sums?
b. Find the 95th percentile for the sum of the sample. Interpret this value in a complete
sentence.
c. Find the probability that the total time for the sample is at least ten hours.
Solution 8.12
a. μΣx = nμ = 70(8.2) =574 and σΣx = nσ= 70 (1) = 8.37 minutes
b. Let k = the 95th percentile. Then k = invNorm (0.95, 574, 8.37) = 587.76 minutes
Ninety five percent of the app engagement times are at most 587.76 minutes.
c. Note that 10 hours = 600 minutes. P(Σx ≥ 600) = normalcdf(600, 10^99, 574, 8.37) =
0.0009.
8.8 The mean number of minutes for app engagement by a table use is 8.2 minutes.
Suppose the standard deviation is one minute. We select a random sample size of 70.
a. What is the probability that the total time for the sample is between seven hours and ten
hours? What does this mean in context of the problem?
b. Find the 84th and 16th percentiles for the sum of the sample. Interpret these values in
context.
373
Example 8.13
A study involving stress is conducted among the students on a college campus. The stress
scores follow a uniform distribution with the lowest stress score equal to one and the highest
equal to five. Using a sample of 50 students, find:
a. The probability that the mean stress score for the 50 students is less than 2.5.
b. The 90th percentile for the mean stress score for the 50 students.
c. The probability that the total of the 50 stress scores is less than 125.
d. The 90th percentile for the total stress score for the 50 students. Let X = one stress
score.
Solution 8.13
Problems a and b ask you to find a probability or a percentile for a mean.
Problems c and d ask you to find a probability or a percentile for a total or sum.
The sample size is n = 50.
---------------------------------------------------------------------------------------------------------------------
Since the individual stress scores follow a uniform distribution, X ~ U(1, 5); thus the mean
and standard deviation for uniform distribution are µ = (1 + 4)/2 = 3 and σ = (5 – 1)/ 12 ≈
1.155.
Since µ = 3 and σ = 12 ≈ 1.155 are for the original distribution, the sampling distribution
for sample means is approximately normal. Using the central limit theorem for sample
σ 1.155
means, µ X = µ = 3 and σ X = = = 0.163. Therefore, X ~ N( 3, 0.163).
√50
n
b. Let k = the 90th percentile; that is, P( x < k) = 0.90. Again, a graph is helpful:
374
Using the calculator, and using the parameters from part a, we see that the 90th percentile
for the mean of 50 scores is k = invNorm(.90, 3, 0.163), which is about 3.21. This tells us
that, among all samples of size 50, 90% of the mean stress scores are at most 3.21, and that
10% are at least 3.21.
-----------------------------------------------------------------------------------------------------------------------------
375
KEY TERMS
Central Limit Theorem for Sample Means: Given a random variable X with known
mean µ and standard deviation σ, we select random samples of size n and examine the
sampling distribution of the sample mean x . This sampling distribution has the following
characteristics:
Central Limit Theorem for Sample Proportions: Suppose X is a random variable with
population proportion p. Suppose that we select random samples of size n. The sampling
distribution has the following characteristics:
Central Limit Theorem for Sample Sums: Given a random variable X with known
mean µ and standard deviation σ, we select random samples of size n and examine the
sampling distribution of the sample sum, ΣX. This sampling distribution has the following
characteristics:
iii. For n > 30, the sampling distribution for the sampling distribution for ΣX is
approximately normal. That is, ΣX ~ N(nµ, n σ).
376
Standard Error of the Mean: The standard deviation of the sampling distribution for x :
σ
.
n
FORMULA REVIEW
x−µ p̂ − p
z-score: z = z=
σ pq
n
n
σ
For 𝑋𝑋� ~ N(µ, ) 𝑝𝑝𝑝𝑝
For 𝑝𝑝̂ ~ N(p, � )
n 𝑛𝑛
σ
P(𝑋𝑋� < k) = normalcdf(-10^99, k, µ, ) 𝑝𝑝𝑝𝑝
P(𝑝𝑝̂ < k) = normalcdf(-10^99, k, p, � )
n 𝑛𝑛
σ 𝑝𝑝𝑝𝑝
P(𝑋𝑋� > k) = normalcdf(k, 10^99, µ, ) P(𝑝𝑝̂ > k) = normalcdf(k, 10^99, p, � )
n 𝑛𝑛
σ 𝑝𝑝𝑝𝑝
P(k1 < 𝑝𝑝̂ < k2) = normalcdf(k, k2, p, � )
P(k1 < 𝑋𝑋� < k2) = normalcdf(k, k2, µ, ) 𝑛𝑛
n
𝑝𝑝𝑝𝑝
σ k = invnorm(probability to left, p, � )
k = invnorm(probability to left, µ, ) 𝑛𝑛
n
377
Exercises for Chapter 8
1. A J-shaped distribution has a mean of 450 and a standard deviation of 28. Samples of size
n = 40 are drawn randomly from the population.
2. A normal distribution has a mean of 580 and a standard deviation of 40. Samples of size
25 are drawn randomly from the population.
3. Darla is a personnel manager in a large corporation. Each month she must review 16 of
the employees. From past experience, she has found that the reviews take her approximately
four hours each to do with a population standard deviation of 1.2 hours. Let Χ be the random
variable representing the time it takes her to complete one review; assume that X is normally
distributed. Let X-bar be the random variable representing the mean time to complete 16
reviews. Assume that the 16 reviews represent a random set of reviews.
b. Find the probability that the mean time for a month’s reviews will require between
3.5 and 4.25 hours, and sketch the graph.
4. Darla is a personnel manager in a large corporation. Each month she must review 20 of
the employees. From past experience, she has found that the reviews take her approximately
4.5 hours each to do with a population standard deviation of 1.1 hours. Let Χ be the random
variable representing the time it takes her to complete one review; assume that X is normally
distributed. Let X-bar be the random variable representing the mean time to complete 20
reviews. Assume that the 20 reviews represent a random set of reviews.
b. Find the probability that an individual for a month’s reviews will require less than
4 hours, and sketch the graph.
c. Find the probability that the mean time for a month’s reviews will require less than
4 hours, and sketch the graph
378
5. A normal distribution has a mean of 460 and a standard deviation of 30. Samples of size
25 are drawn randomly from the population. Round to tenths.
6. A J-shaped distribution has a mean of 320 and a standard deviation of 28. Samples of size
n = 45 are drawn randomly from the population. Find the 95th percentile for the sample
mean. Sketch the graph of the sampling distribution of sample means.
Use the following information to answer the next three exercises: A manufacturer
produces 25-pound lifting weights. The lowest actual weight is 24 pounds, and the highest is
26 pounds. Each weight is equally likely so the distribution of weights is uniform. A sample
of 100 weights is selected.
7. a. What is the distribution for the weights of one 25-pound lifting weight? What is
the mean and standard deviation of this distribution?
b. What is the distribution for the mean weight of one hundred 25-pound lifting
weights? What is the mean and standard deviation of the sampling distribution.
c. Find the probability that the mean actual weight for the 100 weights is less than
24.9. Draw the graph.
8. Find the probability that the mean actual weight for the 100 weights is greater than
25.2. Draw the graph.
9. Find the 90th percentile for the mean weight for the 100 weights. Draw the graph
Use the following information to answer the next four exercises: The lifetime in months
of a particular smartphone's battery follows an exponential distribution with a rate of decay
of 10%. A sample of 64 of these smartphones is taken.
11. Suppose that a sample of 64 batteries is selected. What is the distribution for the mean
length of time that the batteries last?
12. Find the probability that the sample mean for a sample of 64 batteries will be between 7
months and 11 months.
13. Find the IQR for the mean amount of time for a sample of 64 batteries.
379
14. Suppose that the distance of fly balls hit to the outfield (in baseball) is normally
distributed with a mean of 250 feet and a standard deviation of 50 feet. We randomly sample
49 fly balls.
a. Describe the distribution for x , the average distance in feet, for 49 fly balls.
b. What is the probability that the 49 balls travel a mean distance of less than 240 feet?
c. Find the 80th percentile of the distribution of the average of 49 fly balls.
15. Previously, De Anza Statistics students estimated that the amount of change daytime
Statistics students carry is exponentially distributed with a mean of $0.88. Suppose that we
randomly pick 25 daytime statistics students.
a. In words, Χ =
b. X ~ ( , )
c. In words, X =
d. X ~ ( , )
e. Find the probability that an individual had between $0.80 and $1.00. Graph the
situation, and shade in the area to be determined.
f. Find the probability that the average of the 25 students was between $0.80 and $1.00.
Graph the situation, and shade in the area to be determined.
g. Explain why there is a difference in part e and part f.
16. Suppose that a category of world-class runners are known to run a marathon (26 miles)
in an average of 145 minutes with a standard deviation of 14 minutes. We select a random
sample of 49 races, and let x-bar represent the average time for the 49 running times.
a. X ~ ( , )
b. Find the probability that the runner will average between 142 and 146 minutes in
these 49 marathons.
c. Find the 80th percentile for the average of these 49 marathons.
d. Find the median of the average running times.
17. According to the Internal Revenue Service, the average length of time for an individual
to complete (keep records for, learn, prepare, copy, assemble, and send) IRS Form 1040 is
10.53 hours (without any attached schedules). The distribution is unknown. Let us assume
that the standard deviation is two hours. Suppose we randomly sample 36 taxpayers.
a. In words, Χ =
b. In words, X =
c. X ~ ( , )
d. Would you be surprised if the 36 taxpayers finished their Form 1040s in an
average of more than 12 hours? Explain why or why not in complete sentences.
380
e. Would you be surprised if one taxpayer finished his or her Form 1040 in more
than 12 hours? In a complete sentence, explain why.
18. In 2019 the average size of a U.S. farm was 444 acres, with a standard deviation of 55
acres. Suppose we randomly survey 38 farms from 2019. What is the probability that the
average size was more than 500 acres?
19. The length of songs in a collector’s iTunes album collection is uniformly distributed from
two to 3.5 minutes. Suppose we randomly pick five albums from the collection. There are a
total of 43 songs on the five albums.
a. In words, Χ =
b. Χ ~
c. In words, X = _____
d. Describe the distribution: X ~ ( , )
e. Find the first quartile for the average song length.
f. The IQR(interquartile range) for the average song length is from – .
20. The percent of fat calories that a person in America consumes each day is normally
distributed with a mean of about 36 and a standard deviation 10. Suppose that 16
individuals are randomly chosen.
a. Find the probability that the mean percent of fat calories for a group of 16 is greater
than 40.
b. Find the first quartile for the average percent of fat calories.
21. Determine which of the following are true and which are false. Then, in complete
sentences, justify your answers.
22. Which of the following is NOT TRUE about the distribution of sample means?
23. The distribution of income in some summer paid internships is considered wedge shaped
(many very low paid, very few middle paid, and even fewer high paid). Suppose we pick a
381
company with a wedge-shaped distribution. Let the average summer salary be $2,000 per
summer with a standard deviation of $4,000. We randomly survey 100 interns.
a. In words, Χ =
b. In words, X =
c. X ~
d. How is it possible for the standard deviation to be greater than the average?
e. Why is it more likely that the average of the 100 interns will be from $2,000 to
$2,200 than from $2,200 to $2,400?
24. The attention span of a two-year-old is exponentially distributed with a mean of about
eight minutes. Suppose we randomly survey 60 two-year-olds.
a. In words, Χ =
b. Χ ~ ( , )
c. In words, X =
d. X ~ ( , )
e. Before doing any calculations, which do you think will be higher? Explain why.
i. The probability that an individual attention span is less than ten minutes.
ii. The probability that the average attention span for the 60 children is less than
ten minutes?
f. Calculate the probabilities in part e.
25. The cost of unleaded gasoline in the Bay Area once followed an unknown distribution
with a mean of $4.59 and a standard deviation of $0.10. Sixteen gas stations from the Bay
Area are randomly chosen. We are interested in the average cost of gasoline for the 16 gas
stations. The distribution to use for the average cost of gasoline for the 16 gas stations is:
0.10
a. X ~ N(4.59, 0.10) c. X ~ N(4.59, )
16
0.10 16
b. X ~ N(4.59, ) d. X ~ N(4.59, )
16 0.10
26. The time to wait for a particular rural bus is distributed uniformly from zero to 75
minutes. One hundred riders are randomly sampled to learn how long they waited. Would
you be surprised, based upon numerical calculations, if the sample average wait time (in
minutes) for 100 riders was less than 30 minutes?
27. The time to wait for a particular rural bus is distributed uniformly from zero to 75
minutes. One hundred riders are randomly sampled to learn how long they waited. Find the
90th percentile sample average wait time (in minutes) for a sample of 100 riders.
382
28. The cost of unleaded gasoline in the Bay Area once followed an unknown distribution
with a mean of $4.59 and a standard deviation of $0.10. Sixteen gas stations from the Bay
Area are randomly chosen. We are interested in the average cost of gasoline for the 16 gas
stations. Find the probability that the average price for 30 gas stations is less than $4.55.
29. The cost of unleaded gasoline in the Bay Area once followed an unknown distribution
with a mean of $4.59 and a standard deviation of $0.10. Sixteen gas stations from the Bay
Area are randomly chosen. We are interested in the average cost of gasoline for the 16 gas
stations. What's the approximate probability that the average price for 16 gas stations is over
$4.69?
30. The average length of a maternity stay in a U.S. hospital is said to be normally distributed
with mean of 2.4 days with a standard deviation of 0.9 days. We randomly survey 80 women
who recently bore children in a U.S. hospital.
a. In words, X =
b. In words, X =
c. X ~
d. Is it likely that an individual stayed more than five days in the hospital? Why or
why not?
e. Is it likely that the average stay for the 80 women was more than five days? Why
or why not?
h. Which is more likely:
i. An individual stayed more than five days.
ii. the average stay of 80 women was more than five days.
31. Salaries for teachers in a particular elementary school district are normally distributed
with a mean of $44,000 and a standard deviation of $6,500. We randomly survey ten
teachers from that district.
32. Men have an average weight of 172 pounds with a standard deviation of 29 pounds.
a. Find the probability that 20 randomly selected men will have a total weight
exceeding 3600 lbs.
b. If 20 men have a sum weight greater than 3500 lbs, then their total weight
exceeds the safety limits for water taxis. Based on (a), is this a safety concern?
Explain.
33. Never Ready Batteries has engineered a newer, longer lasting AAA battery. The company
claims this battery has an average life span of 17 hours with a standard deviation of 0.8
hours. Your statistics class questions this claim. As a class, you randomly select 30 batteries
and find that the sample mean life span is 16.7 hours. If the process is working properly,
383
what is the probability of getting a random sample of 30 batteries in which the sample mean
lifetime is 16.7 hours or less? Is the company’s claim reasonable?
34. A typical adult has an average IQ score of 105 with a standard deviation of 20. If 20
randomly selected adults are given an IQ test, what is the probability that the sample mean
scores will be between 85 and 125 points?
35. Your company has a contract to perform preventive maintenance on thousands of air-
conditioners in a large city. Based on service records from previous years, the time that a
technician spends servicing a unit averages one hour with a standard deviation of one hour.
In the coming week, your company will service a simple random sample of 70 units in the
city. You plan to budget an average of 1.1 hours per technician to complete the work. Will
this be enough time?
36. According to ClinicalTrials.gov, the rapid, onside COVID-19 detection had an average
result in 90 minutes. Let’s assume the standard deviation of detection time is 15 minutes.
Random samples of size 40 are made. Find the probability that sample mean of 60 tests is
less than 1 hour.
37. Certain coins have an average weight of 5.201 grams with a standard deviation of 0.065
g. If a vending machine is designed to accept coins whose weights range from 5.111 g to 5.291
g, what is the expected number of rejected coins when 280 randomly selected coins are
inserted into the machine?
38. Four friends, Janice, Barbara, Kathy and Roberta, decided to carpool together to get to
school. Each day the driver would be chosen by randomly selecting one of the four names.
They carpool to school for 96 days. Use the normal approximation to the binomial to
calculate the following probabilities.
b. Find the probability that Roberta is the driver more than 16 days.
39. Suppose in a local Kindergarten through 12th grade (K - 12) school district, 53 percent of
the population favor a charter school for grades K through five. A simple random sample of
300 is surveyed. Calculate following using the normal approximation to the binomial
distribution.
a. Find the probability that less than 100 favor a charter school for K through 5.
b. Find the probability that at least 170 favor a charter school for K through 5.
c. Find the probability that less than 141 favor a charter school for K through 5.
d. Find the probability that there are fewer than 130 that favor a charter school for
384
K through 5.
e. Find the probability that exactly 150 favor a charter school for K through 5.
40. Suppose that 54% of students register two weeks before the start of the semester. If you
randomly survey 70 students, what is the probability your survey says at most 20% register
two weeks before the start of the semester?
41. According to USA Today article, 56% of 18 to 24 year olds voted for Hilary Clinton in
the 2016 election. If we randomly survey thirty 18 to 24 year olds, what is the probability
that those selected at least 60% voted for Hilary Clinton?
43. Based on past experience, a bank claims that 24% of the people who receive student
loans will not make the payments on time due to the pandemic. The bank recently
reviewed 400 student loans.
a. What is the mean of the proportion of students in this group who may not make
timely payments?
b. What is the standard deviation of the proportion of students in this group who may
not make timely payments?
c. What’s the probability that over 30% of these students will not make the payments
on time?
44. According to Nielsen media, 44% of Hispanics adult consumers are more likely to be
considered what they call “recovery optimists” after the COVID-19 pandemic. If we
randomly survey 50 Hispanic adult consumers, what is the probability that those selected
less than 30% are considered “recovery optimists”?
Use the following information to answer the next four exercises: An unknown distribution
has a mean of 80 and a standard deviation of 12. A sample size of 95 is drawn randomly
from the population.
45. Find the probability that the sum of the 95 values is greater than 7,650.
46. Find the probability that the sum of the 95 values is less than 7,400.
47. Find the sum that is two standard deviations above the mean of the sums.
48. Find the sum that is 1.5 standard deviations below the mean of the sums.
385
Use the following information to answer the next five exercises: The distribution of results
from a cholesterol test has a mean of 180 and a standard deviation of 20. A sample size of 40
is drawn randomly.
49. Find the probability that the sum of the 40 values is greater than 7,500.
50. Find the probability that the sum of the 40 values is less than 7,000.
51. Find the sum that is one standard deviation above the mean of the sums.
52. Find the sum that is 1.5 standard deviations below the mean of the sums.
53. Find the percentage of sums between 1.5 standard deviations below the mean of the
sums and one standard deviation above the mean of the sums.
Use the following information to answer the next six exercises: A researcher measures the
amount of sugar in several cans of the same soda. The mean is 39.01 with a standard
deviation of 0.5. The researcher randomly selects a sample of 100.
54. Find the probability that the sum of the 100 values is greater than 3,910.
55. Find the probability that the sum of the 100 values is less than 3,900.
56. Find the probability that the sum of the 100 values falls between 3900 and 3910.
59. Find the probability that the sums will fall between the z = -2 and z = 1.
Use the following information to answer the next four exercises: An unknown distribution has
a mean 12 and a standard deviation of one. A sample size of 25 is taken. Let X = the object of
interest.
64. True or False: Only the sums of normal distributions are also normal distributions.
65. In order for the sums of a distribution to approach a normal distribution, what must be
true?
386
66. What three things must you know about a distribution to find the probability of sums?
67. An unknown distribution has a mean of 25 and a standard deviation of six. Let X = one
object from this distribution. What is the sample size if the standard deviation of ΣX is 42?
68. An unknown distribution has a mean of 19 and a standard deviation of 20. Let X = the
object of interest. What is the sample size if the mean of ΣX is 15,200?
69. A market researcher analyzes how many electronics devices customers buy in a single
purchase. The distribution has a mean of three with a standard deviation of 0.7. She samples
400 customers. What is P(Σx < 1,186)?
Use the following information to answer the next three exercises: An unknown distribution
has a mean of 100, a standard deviation of 100, and a sample size of 100. Let X = one object
of interest.
Use the following information to answer the next four exercises: A manufacturer produces
25-pound lifting weights. The lowest actual weight is 24 pounds, and the highest is 26
pounds. Each weight is equally likely so the distribution of weights is uniform. A sample of
100 weights is selected
73. a. What is the distribution for the sum of the weights of 100 25-pound lifting weights?
b. Find P(Σx < 2,450).
75. Find the 90th percentile for the total weight of the 100 weights.
Use the following information to answer the next three exercises: The length of time a
particular smartphone's battery lasts follows an exponential distribution with a mean of ten
months. A sample of 64 of these smartphones is taken.
77. What is the distribution for the total length of time that 64 batteries last?
78. Find the 80th percentile for the total length of time that 64 batteries will last.
79. Find the middle 80% for the total amount of time that 64 batteries will last.
387
Use the following information to answer the next six exercises:
A uniform distribution has a minimum of six and a maximum of ten. A sample of 50 is
selected.
REFERENCES
8.1 The Central Limit Theorem for Sample Means (Averages) and Sample Proportions
Baran, Daya. “20 Percent of Americans Have Never Used Email.”WebGuild, 2010. Available
online at http://www.webguild.org/20080519/20-percent-of-americans-have-never-used-email
(accessed May 17, 2013).
“National Health and Nutrition Examination Survey.” Center for Disease Control and Prevention.
Available online at http://www.cdc.gov/nchs/nhanes.htm (accessed May 17, 2013).
Castillo, Walbert, et al. “How We Voted - by Age, Education, Race and Sexual Orientation.” USA
Today, Gannett Satellite Information Network, 9 Nov. 2016, college.usatoday.com/2016/11/09/how-
we-voted-by-age-education-race-and-sexual-orientation/ (accessed July 17, 2018).
“Hispanic Consumers Are Recovery Optimists; Black Consumers Are Cautious Optimists.” Nielsen, 4
May 2021, www.nielsen.com/us/en/insights/article/2021/hispanic-consumers-are-recovery-
optimists-black-consumers-are-cautious-optimists/.
Farago, Peter. “The Truth About Cats and Dogs: Smartphone vs Tablet Usage Differences.” The
Flurry Blog, 2013. Posted October 29, 2012. Available online at http://blog.flurry.com (accessed May
17, 2013).
388
CHAPTER 8 SOLUTIONS:
17) a. X = length of time for an individual to complete IRS form 1040, in hours.
b. X = mean length of time for a sample of 36 taxpayers to complete IRS form 1040
c. X ~ N(10.53, 0.333)
d. Yes, we would be surprised, since P( x > 12) is approximately 0.
e. No; would not be totally surprised because the probability is P(x > 12) = 0.2312.
So about 23% of individuals take more than 12 hours to complete the form.
19) a. X = the length of a song, in minutes, in the collection b. X ~ U(2, 3.5)
c. X = the average length, in minutes, of the songs from a sample of five albums
d. X ~ N(2.75, 0.066)
e. Q1 = invNorm(.25, 2.75, 0.066) = 2.71 minutes f. 0.09 minutes
21) a. True. By the CLT, the mean of a sampling distribution of the means is
approximately the mean of the data distribution for large n.
b. True. According to the Central Limit Theorem, the larger the sample, the closer
the sampling distribution of the means becomes normal.
c. False. The standard deviation of the sampling distribution of the means
σ
decreases as the sample size increases; so σ x = will be much smaller than σ.
n
389
25) option b
33) We are given that μ = 17, σ = 0.8, x = 16.7, and n = 30. To calculate the probability, we
0.8
use P( x < 16.7) = normalcdf(-10^99, 16.7, 17, ) = 0.020. If the process is working
30
properly, then there is only about a 2% probability that a sample of 30 batteries would have
an average lifetime of 16.7 hours or less. So the sample data appears to be incompatible
with the company’s claim. Therefore, the class was justified to question the claim.
35) We are given that μ = 1, σ = 1, x = 1.1, and n = 70. We calculate the probability P( x <
1
1.1) = normalcdf(-10^99, 1.1, 1, ) = 0.7958. So there is an 80% chance that the average
70
service time will be less than 1.1 hours. It would be wise to schedule more time since there
is a 20% chance that the mean maintenance time will be greater than 1.1 hours.
0.065
37) Since P(5.111 < x < 5.291) = normalcdf(5.111, 5.91, 5.201, ) ≈ 1, we can conclude
280
that virtually all the coins are within the limits. We would not expect any coins to be
rejected in a sample of 280.
41) 0.329 43) a.)24% b.)2.14% c.) P(𝑝𝑝̂ > 10%) = normalcdf(.30, 10^99, .24, .0214)
= .0025
45) 0.3345 47) 7,833.92 49) 0.0089 51) 7,326.49 53) 77.45% 55) 0.4207
75) 2,507.40 77) X~N(640, 80) 79) P10 = invnorm(.10, 640, 80) = 537.5, P90 =
invnorm(.90, 640, 80) = 742.5
390
9 | CONFIDENCE INTERVALS
Have you ever wondered the average number of downloaded APPs a tablet user gets a month? You can
use confidence intervals to answer this question.
Chapter Objectives
• Calculate and interpret confidence intervals for estimating a population mean and a
population proportion.
• Interpret the Student's t probability distribution as the sample size changes.
• Discriminate between problems applying the normal and the Student's t distributions.
• Calculate the sample size required to estimate a population mean and a population
proportion given a desired confidence level and margin of error.
Suppose you were trying to determine the mean rent of a two-bedroom apartment in your
town. You might look in the classified section of the newspaper, write down several rents
listed, and average them together. You would have obtained a point estimate of the true
391
mean. If you are trying to determine the percentage of times you make a basket when
shooting a basketball, you might count the number of shots you make and divide that by the
number of shots you attempted. In this case, you would have obtained a point estimate for
the true proportion.
9.1 INTRODUCTION
We use sample data to generalize about an unknown population. This part of statistics is
called inferential statistics. The sample data help us to make an estimate of a population
parameter. We realize that the point estimate is most likely not the exact value of the
population parameter, but close to it. Therefore, point estimate is a sample statistic which
is a starting measurement to estimate the population parameter.
Example 9.1
Inequality notation: 3 < x < 10, here we are saying x is between 3 and 10.
A confidence interval is another type of estimate but, instead of being just one number, it is
an interval of numbers. The interval of numbers is a range of values calculated from a given
set of sample data. The confidence interval is likely to include an unknown population
parameter between a lower bound value and upper bound value with a certain level of
confidence.
392
NOTE: For mean and proportion lower bound value is the point estimate – margin of error.
The upper bound value is the point estimate + margin of error.
The interval notation of this form is (point estimate – margin of error, point estimate +
margin of error)
The empirical rule, which applies to bell-shaped distributions, says that in approximately
95% of the samples, the sample mean, x , will be within two standard deviations of the
population mean µ. From Example 9.1, two standard deviations is (2)(0.1) = 0.2. The sample
mean x is likely to be within 0.2 units of µ. Therefore, 0.2 is considered the margin of error.
Because x is within 0.2 units of µ, which is unknown, then μ is likely to be within 0.2 units
of x in 95% of the samples. The population mean µ is contained in an interval whose lower
bound value is calculated by taking the sample mean and subtracting two standard
deviations (2)(0.1) and whose upper bound value is calculated by taking the sample mean
and adding two standard deviations. In other words, µ is between x − 0.2 and x + 0.2 in 95%
of all the samples.
For the Example 9.1, the unknown population mean μ is between x − 0.2 = 4 − 0.2 = 3.8
(lower bound value) and x + 0.2 = 4 + 0.2 = 4.2 (upper bound value).
We say that we are 95% confident that the unknown population mean number of apps
downloaded from the app store per month is between 3.8 and 4.2. The 95% confidence interval
is (3.8, 4.2).
Recap of Terms
• Margin of Error: maximum difference from point estimate to the lower or upper
bound. E = upper bound – point estimate.
• Confidence Interval: the range of values that likely contains the population
parameter within a certain level of confidence.
393
Example 9.2
Find the point estimate for the population mean, µ, for a 90% confidence interval. Assume
the distribution is normally distributed with a population standard deviation of 6 minutes.
A random sample of 28 pizza delivery restaurants is taken and has a sample mean delivery
time of 36 minutes.
Solution 9.2
Point estimate for the population mean is the sample mean, x . Therefore x =36 minutes is
the point estimate.
Example 9.3
Find the point estimate for the population mean, µ, for a 90% confidence interval. Assume
the distribution is normally distributed with a population standard deviation of 6 minutes.
A random sample of 10 pizza delivery restaurants is taken and has the following raw data in
minutes:
Solution 9.3
Point estimate for the population mean is the sample mean, x . We can find the sample mean
using the calculator. First, we must enter the data into L1.
2nd Quit,
Mean(L1)
394
Example 9.4
In a Gallup poll that took place June 2021, 96 out of 1200 US adults mention “the coronavirus
pandemic” as the most important problem in the US. This proportion has gone down 7% from
the previous month. (Adapted from News.Gallup.com) In constructing a 98% confidence
interval for the population proportion that mention the pandemic as the most important
problem in the US, find the point estimate.
Solution 9.4
96
Point estimate for the population proportion is the sample proportion, 𝑝𝑝̂ . Therefore 𝑝𝑝̂ =
1200
96
is the point estimate. 𝑝𝑝̂ = can be written as a decimal (0.08) or a percent (8%).
1200
9.1 Find the point estimate for the population mean and population standard deviation,
µ and σ. Assume the distribution is approximately normal. In a random sample of 8 people,
the mean commute to work was 35.5 minutes and the sample standard deviation was 7.2
minutes.
Point Estimates
Sample Mean: x
Sample Variance: s2
395
9.2 | Confidence Interval for a Single Population Mean when σ known
A confidence interval for a population mean with a known standard deviation is based on the
fact that the sample means follow an approximately normal distribution; 𝑥𝑥̅ ∼N(µx, )
𝜎𝜎
√𝑛𝑛
Remember that a confidence interval is created for an unknown population parameter like
the population mean, μ. Confidence intervals for this parameter has the form:
( x − E, x + E )
The margin of error, E, depends on the confidence level, CL, or percentage of confidence
σ
and the standard error of the mean, σ x = . The value that represents the level of
n
confidence for a certain distribution is known as a critical value. The critical value for the
σ
Z-distribution has the notation of Zc or Zα/2. Therefore, E = zα 2 .
n
The confidence level, CL, is the area in the middle of the Z-distribution (standard normal
distribution). The confidence level is often considered the probability that the calculated
confidence interval estimate will contain the true population parameter. However, it is more
accurate to state that the confidence level is the percent of confidence intervals that contain
the true population parameter when repeated samples are taken. Most often, it is the choice
of the person constructing the confidence interval to choose a confidence level of 90% or higher
because that person wants to be reasonably certain of his or her conclusions.
CL = 1 – α, so α is the area that is not in the center and is split equally between the two tails.
Therefore, α is the probability that the interval will not contain the true population
parameter. Each of the tails contains an area equal to α/2. The z-score that has an area to
the right of α/2 is denoted by Zα/2.
For example, when CL = 0.95, α = 0.05 and α/2 = 0.025; we write Zα/2 = Z0.025.
396
Recall in chapter 7 to find the value of Z when given the area to the left, we can use the
calculator to find this value.
The left critical value is negative since it is on the left side of zero. The right critical value is
the symmetric opposite of the left critical value. Round critical values to two decimal places.
Example 9.5
Find the critical value for the Z-distribution for 98% confidence level.
Solution 9.5
397
Calculating the Confidence Interval for mean when σ is known
Point estimate – margin of error < µ < point estimate + margin of error
( x − E, x + E )
σ
where E = zα 2 .
n
• Calculate the sample mean, x from the sample data. Remember, in this section we
already know the population standard deviation σ.
• Find the critical value, Zα/2 that corresponds to the confidence level.
• Calculate the margin of error, E.
• Construct the confidence interval.
• Write a sentence that interprets the estimate in the context of the situation in the
problem. (Explain what the confidence interval means, in the words of the problem.)
The interpretation should clearly state the confidence level (CL), explain what population
parameter is being estimated (here, a population mean), and state the confidence interval
(both endpoints). "We estimate with % confidence that the true population mean (include
the context of the problem) is between and (include appropriate units)."
Example 9.6
Suppose scores on exams in statistics are normally distributed with an unknown population
mean and a population standard deviation of three points. A random sample of 36 scores is
taken and gives a sample mean (sample mean score) of 68. Find a confidence interval
estimate for the population mean exam score (the mean score on all exams).
Find a 90% confidence interval for the true (population) mean of statistics exam scores.
Solution 9.6
• The second solution uses the TI-83, 83+, and 84+ calculators (Solution B).
398
9.6 Solution A
To find the confidence interval, you need the sample mean, x , and the margin of error, E.
x = 68
σ
E = zα 2
n
σ 3
=E z=
α 2 1.645 = .8225
n 36
x – E = 68 – 0.8225 = 67.1775
x + E = 68 + 0.83 = 68.8225
The 90% confidence interval for the population mean is (67.1775, 68.8225).
9.6 Solution B
Interpretation:
We estimate with 90% confidence that the true population mean exam score for all statistics
students is between 67.18 and 68.82.
Ninety percent of all confidence intervals constructed in this way contain the true mean
statistics exam score. For example, if we constructed 100 of these confidence intervals, we
would expect 90 of them to contain the true population mean exam score.
399
9.2 Suppose average pizza delivery times are normally distributed with an unknown
population mean and a population standard deviation of six minutes. A random sample of 28
pizza delivery restaurants is taken and has a sample mean delivery time of 36 minutes. Find
a 90% confidence interval estimate for the population mean delivery time.
Example 9.7
$98, $131, $97, $89, $114, $145, $109, $141, $102, $110, $90, $92, $134, $144, $152,
$140, $118, $98, $144, $87, $150, $118, $118, $131, $99, $120, $115, $105, $95, $96
Find a 98% confidence interval for the true (population) mean of the monthly post pay mobile
phone bill. Assume that the population standard deviation is σ = $18
To find the confidence interval, start by finding the point estimate: the sample mean.
x = $116.07
Next, find the margin of error, E. Because you are creating a 98% confidence interval, CL =
0.98.
400
σ
E = zα 2
n
18
E = 2.33 =$7.66 NOTE: the units for E is the same units as 𝑥𝑥̅ and s.
30
To find the 98% confidence interval, find the lower and upper bounds of the interval.
Interpretation: We estimate with 98% confidence that the true monthly post pay mobile
phone bill for Americans between 18 - 64 is between $108.42 and $123.72. Therefore, the
article stating that the average American pays $114 is supported.
9.3 The data that follows, shows a different random sampling of 20 mobile phone bills. Use
this data to calculate a 93% confidence interval for the true (population) mean of the monthly
post pay mobile phone bill. Assume that the population standard deviation is σ = $18 as
previous stated. $91, $137, $97, $103, $137, $127, $86, $118, $128, $138, $126, $90,
$144, $94, $104, $94, $130, $141, $93, $128
401
Changing the Confidence Level or Sample Size
We need to look at if the range of the confidence interval is affected by changing the
confidence level or by changing the sample size.
Example 9.8
Suppose we change the original problem in Example 9.6, which uses 90% confidence level,
by using a 95% confidence level. Find a 95% confidence interval for the true (population)
mean statistics exam score.
To find the confidence interval about the mean with the population standard deviation
known, we use Zinterval.
E = 68.98 – 68 = .98
Notice that the E is larger for a 95% confidence level than the 90% confidence of Example 9.6
The 90% confidence interval is (67.18, 68.82). The 95% confidence interval is (67.02, 68.98).
The 95% confidence interval is wider. If you look at the graphs, because the area 0.95 is larger
than the area 0.90, it makes sense that the 95% confidence interval is wider. To be more
confident that the confidence interval actually does contain the true value of the population
mean for all statistics exam scores, the confidence interval necessarily needs to be wider.
402
Summary: Effect of Changing the Confidence Level
• Increasing the confidence level increases the margin of error, making the confidence
interval wider.
• Decreasing the confidence level decreases the margin of error, making the confidence
interval narrower.
9.4 Refer back to the pizza-delivery 9.2 Try It exercise. The population standard deviation is
six minutes and the sample mean deliver time is 36 minutes. Use a sample size of 20. Find a
95% confidence interval estimate for the true mean pizza delivery time.
Example 9.9
Suppose we change the original problem in Example 9.6 to see what happens to the margin
of error if the sample size is changed.
Leave everything the same except the sample size. Use the original 90% confidence level.
a. What happens to the margin of error and the confidence interval if we increase the
sample size and use n = 100 instead of n = 36?
b. What happens if we decrease the sample size to n = 25 instead of n = 36?
403
Summary: Effect of Changing the Sample Size
• Increasing the sample size causes the margin of error to decrease, making the
confidence interval narrower.
• Decreasing the sample size causes the margin of error to increase, making the
confidence interval wider.
9.5 Refer back to the pizza-delivery 9.2 Try It exercise. The mean delivery time is 36 minutes
and the population standard deviation is six minutes. Assume the sample size is changed to
50 restaurants with the same sample mean. Find a 90% confidence interval estimate for the
population mean delivery time.
If researchers desire a specific margin of error, then they can use the margin of error formula
to calculate the required sample size. The margin of error formula for a population mean
when the population standard deviation is known has the following formula:
σ
E = zα 2
n
2
z ⋅σ
n = α /2
E
404
Example 9.10
The population standard deviation for the age of CLC students is 15 years. If we want to be
95% confident that the sample mean age is within two years of the true population mean age
of CLC students, how many randomly selected CLC students must be surveyed?
Solution 9.10
2 2
zα /2 ⋅ σ 1.96 ⋅15
=n = = 216.09
E 2
Use n = 217: Always round the answer UP to the next higher integer to ensure that the
sample size is large enough. Therefore, 217 CLC students should be surveyed in order to be
95% confident that we are within two years of the true population mean age of CLC students.
9.6 The population standard deviation for the height of high school basketball players is three
inches. If we want to be 95% confident that the sample mean height is within 1.5 inch of the
true population mean height, how many randomly selected students must be surveyed?
405
9.3 | Confidence Interval for a Single Population Mean when σ
unknown
In practice, we rarely know the population standard deviation, σ. In the past, when the
sample size was large, this did not present a problem to statisticians. They used the sample
standard deviation s as an estimate for σ and proceeded as before to calculate a confidence
interval with close enough results. However, statisticians ran into problems when the sample
size was small. A small sample size caused inaccuracies in the confidence interval.
William S. Gosset (1876–1937) of the Guinness brewery in Dublin, Ireland ran into this
problem. His experiments with hops and barley produced very few samples. Just replacing σ
with s did not produce accurate results when he tried to calculate a confidence interval. He
realized that he could not use a normal distribution for the calculation; he found that the
actual distribution depends on the sample size. This problem led him to "discover" what is
called the Student's t-distribution. The name comes from the fact that Gosset wrote under
the pen name "Student."
Up until the mid-1970s, some statisticians used the normal distribution approximation for
large sample sizes and only used the Student's t-distribution only for sample sizes of at most
30. With graphing calculators and computers, the practice now is to use the Student's t-
distribution whenever s is used as an estimate for σ.
• The graph for the Student's t-distribution is similar to the standard normal curve;
however, it has more probability in its tails than the standard normal distribution
because the spread of the t-distribution is greater than the spread of the standard normal.
So the graph of the Student's t-distribution will be thicker in the tails and shorter in the
center than the graph of the standard normal distribution.
• The mean for the Student's t-distribution is 0 and the distribution is symmetric about 0.
• The exact shape of the Student's t-distribution depends on the degrees of freedom. As the
degrees of freedom increases, the graph of Student's t-distribution becomes more like the
graph of the standard normal distribution.
𝑥𝑥̅ −𝜇𝜇
• t-score, 𝑡𝑡 = 𝑠𝑠 , has the same interpretation as the z-score which is the number of
� �
√𝑛𝑛
406
Critical Value, tα/2.
Recall, confidence level (CL) is the area in the middle of the distribution. The confidence
level is often considered the probability that the calculated confidence interval estimate will
contain the true population parameter.
For example, when CL = 0.95, α = 0.05 and α/2 = 0.025; we write tα/2 = t0.025.
There are three ways to find the critical value for the t-distribution. All three ways need the
area in the tail (α/2) and the degree of freedom (n – 1).
1.) For those who have, TI-83, 83+, 84 to find tα/2 you must use the t-distribution chart
(located in appendix) or Excel (T.inv)
407
2.) For those who have the newer TI 84 + calculators
Press 2nd Vars, choose #4 invT(area in the tail, df)
Example 9.11
Find the critical value for the t-distribution for 98% confidence level where n = 12.
Left: - tα/2 = -t0.01 = invT(.01, 11) = -2.718 Right: tα/2 = t0.01 = - invT(.01, 11) = 2.718
Recall that t-distribution is symmetric. Therefore, the left and right critical values are
symmetric opposites. This means that if you know the left critical value, then the right
critical value is the positive version of it.
Go to an empty cell and type equal sign, After you press Enter, the answer
then start typing t.inv will appear.
408
All three methods of finding the left and right critical values for the t-distribution need the
area in the tail (1%) and degrees of freedom (df = n – 1)
Point estimate – margin of error < µ < point estimate + margin of error
( x − E, x + E )
s
where E = tα 2 .
n
409
The interpretation should clearly state the confidence level (CL), explain what population
parameter is being estimated (here, a population mean), and state the confidence interval
(both endpoints). "We estimate with % confidence that the true population mean (include
the context of the problem) is between and (include appropriate units)."
Example 9.12
Suppose you do a study of acupuncture to determine how effective it is in relieving pain. You
measure sensory rates for 15 subjects with the results given. Use the sample data to construct
a 95% confidence interval for the mean sensory rate for the population (assumed normal)
from which you took the data.
8.6; 9.4; 7.9; 6.8; 8.3; 7.3; 9.2; 9.6; 8.7; 11.4; 10.3; 5.4; 8.1; 5.5; 6.9
Solution 9.12
• The first solution uses the TI-83+ and TI-84 calculators (Solution A).
To find the confidence interval, you need the sample mean, x , sample standard deviation, s,
and the margin of error, E.
s
E = tα 2
n
410
The confidence level is 95% (CL = 0.95)
s 1.6722
=E tα=
2 2.14 = .924
n 15
The 95% confidence interval for the population mean is (7.3, 9.15).
Interpretation: We estimate with 95% confidence that the true population mean sensory rate
is between 7.30 and 9.15.
9.7 You do a study of hypnotherapy to determine how effective it is in increasing the number
of hours of sleep subjects get each night. You measure hours of sleep for 12 subjects with the
following results. Construct a 95% confidence interval for the mean number of hours slept
for the population (assumed normal) from which you took the data.
8.2; 9.1; 7.7; 8.6; 6.9; 11.2; 10.1; 9.9; 8.9; 9.2; 7.5; 10.5
Example 9.13
Industrial chemicals may enter the body through pollution or as ingredients in consumer
products. In October 2008, the scientists at HTP (Human Toxome Project) tested cord blood
samples for 20 newborn infants in the United States. The cord blood of the "In utero/newborn"
group was tested for 430 industrial compounds, pollutants, and other chemicals, including
chemicals linked to brain and nervous system toxicity, immune system toxicity, and
reproductive toxicity, and fertility problems. There are health concerns about the effects of
some chemicals on the brain and nervous system. The table below shows how many of the
targeted chemicals were found in each infant’s cord blood. Use this sample data to construct
a 90% confidence interval for the mean number of targeted industrial chemicals to be found
in an in infant’s blood.
411
Solution 9.13 Solution A
We estimate with 90% confidence that the mean number of all targeted industrial chemicals
found in cord blood in the United States is between 117.412 and 137.488.
First, from the sample, you can calculate x = 127.45 and s = 25.965 using 1-varStats. We are
also given n = 20, which we can then find degrees of freedom, df = 20 – 1 = 19.
You are asked to calculate a 90% confidence interval, which gives us the following notation:
Second, we can find the critical value by using Calculator, Excel, or t-chart:
s 25.965
=E tα=
2
1.729 = 10.038
n 20
412
9.4 | Confidence Interval for a Single Population Proportion, p
During an election year, we see articles in the newspaper that state confidence intervals in
terms of proportions or percentages. For example, a poll for a particular candidate running
for president might show that the candidate has 40% of the vote within three percentage
points (if the sample is large enough). Often, election polls are calculated with 95%
confidence, so, the pollsters would be 95% confident that the true proportion of voters who
favored the candidate would be between 0.37 and 0.43:
How do you know you are dealing with a proportion problem? First, the underlying
distribution is a binomial distribution. (There is no mention of a mean or average.) If X is a
binomial random variable, then X ~ B(n, p) where n is the number of trials and p is the
probability of a success. To form a proportion, take X, the random variable for the number of
successes and divide it by n, the number of trials (or the sample size). The random variable
𝑃𝑃� (read "P hat") is that proportion,
𝑋𝑋
𝑃𝑃� =
𝑛𝑛
When n is large and p is not close to zero or one, we can use the normal distribution to
approximate the binomial. When np < 10 and nq < 10, the normal approximation probabilities
for the binomial distribution are not close enough to the actual binomial probabilities.
Therefore, not best to use when less than 10.
𝑛𝑛𝑛𝑛 𝑛𝑛𝑛𝑛𝑛𝑛
𝑃𝑃�∼𝑁𝑁 � , √ 𝑛𝑛 � Recall that for Binomial µ = np and σ = �𝑛𝑛𝑛𝑛𝑛𝑛
𝑛𝑛
𝑝𝑝𝑝𝑝
𝑃𝑃�∼𝑁𝑁 �𝑝𝑝, � � is the reduced version
𝑛𝑛
Point estimate – margin of error < µ < point estimate + margin of error
𝑝𝑝�𝑞𝑞�
where 𝐸𝐸 = 𝑍𝑍𝛼𝛼/2 � 𝑛𝑛 and 𝑞𝑞� = 1 − 𝑝𝑝̂
NOTE: the confidence interval can be used only if the number of success 𝑛𝑛𝑝𝑝̂ and number of
failures 𝑛𝑛𝑞𝑞� are both greater than ten.
413
The steps to construct and interpret the confidence interval are:
𝑥𝑥
• Calculate the sample proportion, 𝑝𝑝̂ = , where x is the frequency, from the sample
𝑛𝑛
data.
• Determine if the number of success 𝑛𝑛𝑝𝑝̂ and number of failures 𝑛𝑛𝑞𝑞� are both greater
than ten.
• Find the critical value, Zα/2 that corresponds to the confidence level.
• Calculate the margin of error, E.
• Construct the confidence interval.
• Write a sentence that interprets the estimate in the context of the situation in the
problem. (Explain what the confidence interval means, in the words of the problem.)
Example 9.14
Solution 9.14
• The first solution uses a function of the TI-83, 83+ or 84 calculators (Solution A).
NOTE: x represents the frequency of the sample proportion x = 𝑝𝑝̂ n rounded to a whole
number
414
Solution 9.14 Solution B (step by step)
To calculate the confidence interval, you must first find p̂ (point estimate), which is given in
the problem.
Second, determine if 𝑛𝑛𝑝𝑝̂ and 𝑛𝑛𝑞𝑞� are both greater than ten.
Third, find margin of error, E where the confidence level is 96% (CL = 0.96)
𝑝𝑝�𝑞𝑞� 0.38*0.62
𝐸𝐸 = 𝑧𝑧𝛼𝛼/2 � 𝑛𝑛 = 2.05� =0.016
4004
NOTE: the units of E are the same units as the point estimate; therefore in this section E is
a percentage. E = 1.6%
Interpretation:
We estimate with 96% confidence that between 36.4% and 39.6% of all Verdict global readers
have health concern about Covid-19 pandemic.
Ninety-six percent of the confidence intervals constructed in this way would contain the true
value for the population proportion of all Verdict global readers have health concern about
Covid-19 pandemic.
415
9.8 Suppose 250 randomly selected people are surveyed to determine if they own a tablet. Of
the 250 surveyed, 98 reported owning a tablet. Using a 95% confidence level, compute a
confidence interval estimate for the true proportion of people who own tablets.
Example 9.15
For a class project, a political science student at a large university wants to estimate the
percent of students who are registered voters. He surveys 500 students and finds that 59.7%
are registered voters. Compute a 90% confidence interval for the true percent of students who
are registered voters, and interpret the confidence interval.
Solution 9.15
• The first solution uses a function of the TI-83, 83+ or 84 calculators (Solution A).
NOTE:
x on the calculator must be
in whole number form;
therefore, the calculator
solution will be slightly off
from Solution A.
416
Solution 9.15 Solution B
𝑝𝑝�𝑞𝑞� 0.597*0.403
𝐸𝐸 = 𝑧𝑧𝛼𝛼/2 � 𝑛𝑛 = 1.645� =0.036
500
Interpretation:
• We estimate with 90% confidence that the true percent of all students that are registered
voters is between 56.2% and 63.4%.
• Alternate Wording: We estimate with 90% confidence that between 56.2% and 63.4% of
ALL students are registered voters.
Ninety percent of all confidence intervals constructed in this way contain the true value for
the population percent of students that are registered voters.
9.9 A student polls his school to see if students in the school district are for or against the
new legislation regarding school uniforms. She surveys 600 students and finds that 79.8%
are against the new legislation. Compute a 90% confidence interval for the true percent of
students who are against the new legislation, and interpret the confidence interval.
417
“Plus Four” Confidence Interval for p
There is a certain amount of error introduced into the process of calculating a confidence
interval for a proportion. Because we do not know the true proportion for the population, we
are forced to use point estimates to calculate the appropriate standard deviation of the
sampling distribution. Studies have shown that the resulting estimation of the standard
deviation can be flawed.
Fortunately, there is a simple adjustment that allows us to produce more accurate confidence
intervals. We simply pretend that we have four additional observations. Two of these
observations are successes and two are failures. The new sample size, then, is n + 4, and the
new count of successes is x + 2.
Computer studies have demonstrated the effectiveness of this method. It should be used
when the confidence level desired is at least 90% and the sample size is at least ten.
Example 9.16
A random sample of 25 statistics students was asked: “Have you smoked a cigarette in the
past week?” Ten students reported smoking within the past week. Use the “plus-four”
method to find a 95% confidence interval for the true proportion of statistics students who
smoke.
Solution 9.16
Ten students out of 25 reported smoking within the past week, so x = 10 and n = 25. Because
we are using the “plus-four” method, we will use x = 10 + 2 = 12 and n = 25 + 4 = 29.
We are 95% confident that the true proportion of all statistics students who smoke cigarettes
is between 23.5% and 59.3%.
9.10 Out of a random sample of 65 freshmen at State University, 31 students have declared
a major. Use the “plus- four” method to find a 96% confidence interval for the true proportion
of freshmen at State University who have declared a major.
418
Calculating the Sample Size n
If researchers desire a specific margin of error, then they can use the margin of error formula
to calculate the required sample size.
𝑝𝑝�𝑞𝑞�
The margin of error formula for a population proportion is 𝐸𝐸 = 𝑧𝑧𝛼𝛼/2 � 𝑛𝑛 where 𝑞𝑞� = 1 − 𝑝𝑝̂
2
z
n = pˆ (1 − pˆ ) α / 2
E
Recall a clue word for E is “within”. NOTE: if the estimated sample proportion is not given,
then use p̂ =.5
Example 9.17
Suppose a mobile phone company wants to determine the current percentage of customers
aged 50+ who use text messaging on their cell phones. How many customers aged 50+ should
the company survey in order to be 90% confident that the estimated (sample) proportion is
within three percentage points of the true population proportion of customers aged 50+ who
use text messaging on their cell phones?
Solution 9.17
2 2
z 1.645
n= pˆ (1 − pˆ ) α /2 =
.5(.5)
E .03
n = 751.7
Round the answer to the next higher value. The sample size should be 752 cell phone
customers aged 50+ in order to be 90% confident that the estimated (sample) proportion is
within three percentage points of the true population proportion of all customers aged 50+
who use text messaging on their cell phones.
9.11 Suppose an internet marketing company wants to determine the current percentage of
customers who click on ads on their smartphones. How many customers should the company
survey in order to be 90% confident that the estimated proportion is within five percentage
points of the true population proportion of customers who click on ads on their smartphones?
419
9.5 | Confidence Interval for Standard Deviation, σ (Optional)
This section is repeated in chapter 12 (chi-square distribution) with more examples.
Let’s conduct the following statistical experiment. We select samples of size n from a normal
population, which has a standard deviation of σ. We find that the standard deviation in our
sample is equal to s. Given these data, we can define a statistic, called chi-square, using the
following formula:
(n − 1) s 2
χ2 =
σ2
χ𝟐𝟐𝟐𝟐
χ𝟐𝟐𝟔𝟔
χ𝟐𝟐𝟏𝟏𝟏𝟏
For the χ 2 distribution, the population mean is μ = df (degrees of freedom) and the population
standard deviation is σ = �2(𝑑𝑑𝑑𝑑). The random variable is shown as χ 2 , but may also be any
upper-case letter.
Recall that critical values are values from the distribution that separate the confidence area
and the non-confidence area. We found Zα/2 by using ±invnorm(α/2) and tα/2 by using t-
distribution chart or by ±invT(α/2, df). Since the Z-distribution and the t-distribution are
symmetrical, the left and right critical values are opposite values. However, in the chi-square
distribution, values are only positive.
420
The critical values for the χ 2 distribution are recorded in two tables (left critical value table
and right critical value table). To find these values, you need area in the tail and the degrees
of freedom.
Example 9.18
Find the left and right critical value for 95% confidence level for chi-square distribution with
a degree of freedom of 11.
Solution 9.18
Below you can see, that the chi-square distribution table is split into 2 tables. The first table
is the left critical values, χ𝟐𝟐𝐋𝐋 . The second table is the right critical values, χ𝟐𝟐𝐑𝐑 . The tail area
is 0.025 and df = 11 so χ𝟐𝟐𝐋𝐋 = 3.8157 using the table for the left critical values and χ𝟐𝟐𝐑𝐑 = 21.92
using the table for the right critical values. The complete two tables are found in the
appendix.
421
9.12 Find the left and right critical values for the chi-square distribution for C.L. = .98 and
df = 6
Recall that critical values from the Z-distribution (standard normal) and the t-distribution
(approximately normal) are symmetric opposites. The critical values from chi-square
distribution are both positive. Also, t-distribution and chi-square distribution are dependent
on degrees of freedom (df).
To use Excel functions, the first thing to type is an equal sign. Here are the Excel functions
for each critical value.
In Example 9.18 C.L. = .95 and df = 11. If the confidence level is 95%, that leaves 5% in
both tails of the distribution. α = 0.05. Split the 5%, which gives us 2.5% = 0.025 as α/2.
Note: Chi-square distribution and t-distribution are dependent on degrees of freedom that
is why it is included in the input of the Excel functions.
422
Confidence Interval for Variance/Standard Deviation
Here we present intervals for estimating σ2 and σ between a lower and upper bound.
Unlike other confidence intervals we have seen, these bounds can’t be found using the
calculator. They depend on the chi-square critical values.
(n − 1) s 2 (n − 1) s 2
<σ2 <
χ R2 χ L2
where χ L2 , χ R2 are critical values, where d.f. = n – 1, and s2 is the point estimate.
NOTE: χ L2 , χ R2 are not to be squared. The right critical value is on the left side and the left
on the right side. Also notice the numerators are the same. When you divide the same value
by a larger number, you get a smaller value.
(n − 1) s 2 (n − 1) s 2
<σ <
χ R2 χ L2
The sample variance, s2, can be found using the calculator or Excel function. Using the ti-
calculator, s2 can be found by:
423
Using Excel:
Example 9.19
The weights (in pounds) of 15 dogs selected randomly from those adopted out by an animal
shelter last week are shown in the list below. Construct a 98% confidence interval for the
population variance.
25, 34, 29, 30, 31, 28, 27, 28, 33, 31, 28, 29, 32, 29, 29
Solution 9.19
First we need to find the sample variance (chapter 3) of the data set. Recall to find it by
calculator you enter the data into L1 (Stat, Edit). 2nd Stat, Math, #8 Variance (L1):
s2 = 5.552
Second, we find the critical values using the chi-square distribution chart.
α/2 = 0.01
α/2 = 0.01
C.L. = 0.98
0 = 4.660 = 29.141
Third, we plug in the values into the formula.
(n − 1) s 2 (n − 1) s 2
<σ <
2
χ R2 χ L2
14(5.552) 14(5.552)
<σ2 <
29.141 4.660
424
Example 9.20
A random sample of 18 men have a mean height of 67.5 inches and a standard deviation of
1.8 inches. Construct a 99% confidence interval for the population standard deviation.
(n − 1) s 2 (n − 1) s 2 17(3.24) 17(3.24)
<σ < => <σ <
χ 2
R χ 2
L 35.718 5.697
(n − 1) s 2 (n − 1) s 2
<σ < => 1.242 < σ < 3.109
χ R2 χ L2
425
KEY TERMS
Confidence Interval (CI) an interval estimate for an unknown population parameter. This
depends on:
Confidence Level (CL) the percent expression for the probability that the confidence
interval contains the true population parameter; for example, if the CL = 90%, then in 90 out
of 100 samples the interval estimate will enclose the true population parameter.
Critical values are the values (±Za/2, ±ta/2, χ L2 , χ R2 ) from a distribution that separates the
confidence area from the non-confidence area.
Degrees of Freedom (df) the number of objects in a sample that are free to vary
Margin of error (E) depends on the confidence level, sample size, and known or estimated
population standard deviation.
Inferential Statistics also called statistical inference or inductive statistics; this facet of
statistics deals with estimating a population parameter based on a sample statistic. For
example, if four out of the 100 calculators sampled are defective we might infer that four
percent of the production is defective.
Point Estimate a single number computed from a sample and used to estimate a population
parameter
Standard Deviation a number that is equal to the square root of the variance and measures
how far data values are from their mean; notation: s for sample standard deviation and σ for
population standard deviation
426
• There is a "family of t–distributions: each representative of the family is completely
defined by the number of degrees of freedom, which is one less than the number of
data.
Formula Review
Critical Values Calculator Function Excel Function
Left Z-distribution critical value, -Zc Invnorm(α/2, 0, 1) =norm.s.inv(α/2)
Right Z-distribution critical value, Zc - Invnorm(α/2, 0, 1) =-norm.s.inv(α/2)
Left t-distribution critical value, -tc InvT(α/2,df) =t.inv(α/2, df)
Right t-distribution critical value, tc -InvT(α/2,df) =-t.inv (α/2, df)
σ
( x − E, x + E ) where E = zα 2
n
.
s
( x − E, x + E ) where E = t α 2
n
.
𝑝𝑝�𝑞𝑞�
(𝑝𝑝̂ − 𝐸𝐸, 𝑝𝑝̂ + 𝐸𝐸) where 𝐸𝐸 = 𝑧𝑧𝛼𝛼/2 � 𝑛𝑛 and 𝑞𝑞� = 1 − 𝑝𝑝̂
NOTE: the confidence interval can be used only if the number of success 𝑛𝑛𝑝𝑝̂ and number
of failures 𝑛𝑛𝑞𝑞� are both greater than ten.
427
Calculating the Sample Size for a population mean
2
z ⋅σ
n = α /2
E
2
z
n = pˆ (1 − pˆ ) α / 2
E
NOTE: if the estimated sample proportion is not given, then use p̂ =.5
(n − 1) s 2 (n − 1) s 2
Confidence Interval for Variance: <σ2 <
χ R2 χ L2
(n − 1) s 2 (n − 1) s 2
Confidence Interval for Standard Deviation: <σ <
χ R2 χ L2
428
EXERCISES FOR CHAPTER 9
1. Find the critical value, Zα/2 , for the following different confidence levels
a. 92%
b. 94%
c. 97%
2. A random sample of 49 students has a grade point average with a population standard
deviation of 0.78. Find the margin of error if the confidence level is 98%.
3. A random sample of 19 students has a grade point average with a standard deviation of
0.78. Find the margin of error if the confidence level is 98%.
5. The U.S. Census Bureau conducts a study to determine the time needed to complete the
short form. The Bureau surveys 200 people. The sample mean is 8.2 minutes. The population
standard deviation is 2.2 minutes. The population distribution is assumed to be normal.
429
iii. Calculate the margin of error.
c. If the Census wants to increase its level of confidence and keep the margin of error
the same by taking another survey, what changes should it make?
d. If the Census did another survey, kept the margin of error the same, and surveyed
only 50 people instead of 200, what would happen to the level of confidence? Why?
e. Suppose the Census needed to be 98% confident of the population mean length of time.
Would the Census have to survey more people? Why or why not?
6. A sample of 20 heads of lettuce was selected. Assume that the population distribution of
head weight is normal. The weight of each head of lettuce was then recorded. The mean
weight was 2.2 pounds with a standard deviation of 0.1 pounds. The population standard
deviation is known to be 0.2 pounds.
7. The mean age for all Foothill College students for a recent Fall term was 33.2. The
population standard deviation has been pretty consistent at 15. Suppose that twenty-five
Winter students were randomly selected. The mean age for the sample was 30.4. We are
interested in the true mean age for Winter Foothill College students. Let X = the age of a
Winter Foothill College student.
430
c. What is x estimating?
d. How much area is in both tails (combined)? α
e. Identify the margin of error.
f. In one complete sentence, explain what the interval means.
g. Using the same mean, standard deviation, and level of confidence, suppose that n were
69 instead of 25. Would the margin of error become larger or smaller? How do you
know?
h. Using the same mean, standard deviation, and sample size, how would the margin of
error change if the confidence level were reduced to 90%? Why?
i. Fill in the blanks on the graph with the area, upper and lower bounds of the confidence
interval, and the sample mean
8. Find the critical value, tα/2 , for the following different confidence levels
a. 90% when n = 25
b. 95% when n = 10
c. 98% when n = 36
9. A hospital is trying to cut down on emergency room wait times. It is interested in the
amount of time patients must wait before being called back to be examined. An investigation
committee randomly surveyed 70 patients. The sample mean was 1.5 hours with a sample
standard deviation of 0.5 hours.
a. Construct a 95% confidence interval for the population mean time spent waiting.
i. State the confidence interval,
ii. Find the critical values and sketch the graph,
iii. Calculate the margin of error.
b. Explain in complete sentences what the confidence interval means.
10. One hundred eight Americans were surveyed to determine the number of hours they
spend watching television each month. It was revealed that they watched an average of 151
hours each month with a standard deviation of 32 hours. Assume that the underlying
population distribution is normal.
431
a. Define the random variable X in words.
b. Define the random variable X in words.
c. Which distribution should you use for this problem? X ∼ ______________
d. Construct a 99% confidence interval for the population mean hours spent watching
television per month.
i. State the confidence interval,
ii. Find the critical value
iii. Calculate the margin of error.
e. Why would the margin of error change if the confidence level were lowered to 95%?
11. The data in the following table are the result of a random survey of 39 national flags (with
replacement between picks) from various countries. We are interested in finding a confidence
interval for the true mean number of colors on a national flag. Let X = the number of colors
on a national flag.
X Freq
1 1
2 7
3 18
4 7
5 6
e. Fill in the blanks on the graph with the areas, the upper and lower limits of the
Confidence Interval and the sample mean.
432
f. In one complete sentence, explain what the interval means.
g. Using the same x , sx, and level of confidence, suppose that n were 69 instead of 39.
Would the margin of error become larger or smaller? How do you know?
h. Using the same x , sx, and n = 39, how would the margin of error change if the
confidence level were reduced to 90%? Why?
12. Marketing companies are interested in knowing the population percent of women who
make the majority of household purchasing decisions.
13. Suppose the marketing company did do a survey. They randomly surveyed 200
households and found that in 120 of them, the woman made the majority of the purchasing
decisions. We are interested in the population proportion of households where women make
the majority of the purchasing decisions.
14. Of 1,050 randomly selected adults, 360 identified themselves as manual laborers, 280
identified themselves as non-manual wage earners, 250 identified themselves as mid- level
managers, and 160 identified themselves as executives. In the survey, 82% of manual
laborers preferred trucks, 62% of non-manual wage earners preferred trucks, 54% of mid-
level managers preferred trucks, and 26% of executives preferred trucks.
a. Construct a 95% confidence interval for the percent of executives who prefer trucks.
433
b. Calculate the margin of error.
c. Suppose we want to lower the sampling error. What is one way to accomplish that?
d. The sampling error given in the survey is ±2%. Explain what the ±2% means.
15. A poll of 1,200 voters asked what the most significant issue was in the upcoming election.
Sixty-five percent answered the economy. We are interested in the population proportion of
voters who feel the economy is the most important.
a. Find a 90% confidence interval, and state the confidence interval and the margin of
error.
b. What would happen to the confidence interval if the level of confidence were 95%?
16. According to a news release from the Bureau of Labor Statistics (US Department of Labor)
for June, 2021, a household survey states that 8.6% of employed persons that identify as
Hispanic/Latino ethnicity teleworked because of the pandemic. Let’s assume that the the
number surveyed is 3400. Calculate the following:
i. x=
ii. n=
iii. p̂ =
17. Among various college baseball pitchers, the standard deviation of heights is known to be
approximately three inches. We wish to construct a 95% confidence interval for the mean
height of college baseball pitchers. Forty-eight baseball pitchers are surveyed. The sample
mean is 72.5 inches. The sample standard deviation is 2.8 inches.
a. Construct a 95% confidence interval for the population mean height of baseball
pitchers.
b. Find the critical values associated with this problem.
434
c. What will happen to the level of confidence obtained if 200 baseball pitchers were
surveyed?
18. Announcements for 84 upcoming engineering conferences were randomly picked from a
stack of IEEE Spectrum magazines. The mean length of the conferences was 3.94 days, with
a standard deviation of 1.28 days. Assume the underlying population is normal. Construct a
95% confidence interval for the population mean length of engineering conferences.
19. Suppose that an accounting firm does a study to determine the time needed to complete
one person’s tax forms. It randomly surveys 100 people. The sample mean is 23.6 hours.
There is a known standard deviation of 7.0 hours. The population distribution is assumed to
be normal.
a. Which distribution should you use for this problem? Explain your choice.
b. Construct a 90% confidence interval for the population mean time to complete the tax
forms.
c. Find the critical values.
d. Calculate the margin of error.
e. If the firm wished to increase its level of confidence and keep the margin of error the
same by taking another survey, what changes should it make?
f. If the firm did another survey, kept the margin of error the same, and only surveyed
49 people, what would happen to the level of confidence? Why?
g. Suppose that the firm decided that it needed to be at least 96% confident of the
population mean length of time to within one hour. How would the number of people
the firm surveys change? Why?
20. A sample of 16 small bags of the same brand of candies was selected. Assume that the
population distribution of bag weights is normal. The weight of each bag was then recorded.
The mean weight was two ounces with a standard deviation of 0.12 ounces. The population
standard deviation is known to be 0.1 ounce.
a. Which distribution should you use for this problem? Explain your choice.
b. Construct a 90% confidence interval for the population mean weight of the candies.
c. Calculate the margin of error.
d. Construct a 98% confidence interval for the population mean weight of the candies.
e. In complete sentences, explain why the confidence interval in part b is smaller than
the confidence interval in part d.
f. In complete sentences, give an interpretation of what the interval in part d means.
21. A new coffee house owner is interested in the mean number of disposable cups used in a
month to budget total cost for the year. The owner took a look at 20 different similar
coffeehouses. The mean from the sample is 802 disposable cups used with a sample standard
deviation of 24.8.
435
a. Construct a 90% confidence interval for the population mean number of disposable
cups used in a month by a coffeehouse.
b. What will happen to the margin of error and confidence interval if 100 coffee houses
were looked at? Why?
22. MULTIPLE CHOICE: What is meant by the term “90% confident” when constructing a
confidence interval for a mean?
a. If we took repeated samples, approximately 90% of the samples would produce the
same confidence interval.
b. If we took repeated samples, approximately 90% of the confidence intervals calculated
from those samples would contain the sample mean.
c. If we took repeated samples, approximately 90% of the confidence intervals calculated
from those samples would contain the true value of the population mean.
d. If we took repeated samples, the sample mean would equal the population mean in
approximately 90% of the samples.
23. The Federal Election Commission collects information about campaign contributions and
disbursements for candidates and political committees each election cycle. During the 2012
campaign season, there were 1,619 candidates for the House of Representatives across the
United States who received contributions from individuals. The table below shows the total
receipts from individuals for a random selection of 40 House candidates rounded to the
nearest $100. The standard deviation for this data to the nearest hundred is σ = $909,200.
24. The American Community Survey (ACS), part of the United States Census Bureau,
conducts a yearly census similar to the one taken every ten years, but with a smaller
percentage of participants. The most recent survey estimates with 90% confidence that the
mean household income in the U.S. falls between $69,720 and $69,922. Find the point
estimate for mean U.S. household income and the margin of error for mean U.S. household
income.
436
25. The ASA (average speed of answer) is 82 seconds for a month at a call center that helps
employees with computer issues. Let’s say the standard deviation for all calls is 22 seconds.
The manager want to estimate the mean speed of answer within 5 seconds with 93%
confidence to report to the vice president. How many calls must the manager measure?
26. In six packages of “The Flintstones® Real Fruit Snacks” there were five Bam-Bam snack
pieces. The total number of snack pieces in the six bags was 68. We wish to calculate a 96%
confidence interval for the population proportion of Bam-Bam snack pieces.
27. A random survey of enrollment at 35 community colleges across the United States yielded
the following figures:
6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044; 5,481; 5,200; 5,853; 2,750;
10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768;
7,493; 2,771; 2,861; 1,263; 7,285; 28,165; 5,080; 11,622. Assume the underlying population is
normal.
28. Suppose that a committee is studying whether or not there is waste of time in our judicial
system. It is interested in the mean amount of time individuals waste at the courthouse
waiting to be called for jury duty. The committee randomly surveyed 81 people who recently
served as jurors. The sample mean wait time was eight hours with a sample standard
deviation of four hours.
a. Construct a 95% confidence interval for the population mean time wasted.
b. Find the critical values.
c. Explain in a complete sentence what the confidence interval means.
29. A pharmaceutical company makes tranquilizers. It is assumed that the distribution for
the length of time they last is approximately normal. Researchers in a hospital used the drug
on a random sample of nine patients. The effective period of the tranquilizer for each patient
(in hours) was as follows: 2.7; 2.8; 3.0; 2.3; 2.3; 2.2; 2.8; 2.1; and 2.4.
437
a. Which distribution should you use for this problem? Explain your choice.
b. Construct a 95% confidence interval for the population mean length of time.
c. Find the critical values.
d. What does it mean to be “95% confident” in this problem?
30. Suppose that 14 employees, who were trained to assemble electric switch boards, were
surveyed to determine how long they had to use training module. It was revealed that they
used them an average of 5.2 months with a sample standard deviation of 2.4 months. Assume
that the underlying population distribution is normal.
a. Construct a 99% confidence interval for the population mean length of time using
training modules.
b. Calculate the margin of error.
c. Why would the margin of error change if the confidence level were lowered to 90%?
31. The Teacher’s Retirement System of the State of Illinois (TRS) collects information about
annual pensions of public employees. A public relations associate is writing an article
responding to critics that teachers have high pensions. This associate wants to state the
average annual pension amount.
The following table shows the annual pensions of Illinois retired teachers for a random
selection of 30 teachers:
a. Find x = $___________
b. Find sample standard deviation.
c. Use this sample data to construct a 96% confidence interval for the mean amount of
annual pension for retired teachers in Illinois. Use the Student's t-distribution.
32. Forbes magazine published data on the best small firms in 2012. These were firms that had been
publicly traded for at least a year, have a stock price of at least $5 per share, and have reported annual
revenue between $5 million and $1 billion. The table shows the ages of the corporate CEOs for a random
sample of these firms. Use this sample data to construct a 90% confidence interval for the mean age of
CEO’s for these top small firms.
48 58 51 61 56 59
74 63 53 50 59 60
60 57 46 55 63 57
47 55 57 43 61 62
49 67 67 55 55 49
438
33. Unoccupied seats on flights cause airlines to lose revenue. Suppose a large airline wants
to estimate its mean number of unoccupied seats per flight over the past year. To accomplish
this, the records of 225 flights are randomly selected and the number of unoccupied seats is
noted for each of the sampled flights. The sample mean is 11.6 seats and the sample standard
deviation is 4.1 seats.
34. In a recent sample of 84 used car sales costs, the sample mean was $6,425 with a standard
deviation of $3,156. Assume the underlying distribution is approximately normal.
a. Construct a 95% confidence interval for the population mean cost of a used car.
b. Calculate the margin of error.
c. Explain what a “95% confidence interval” means for this study.
35. Six different national brands of chocolate chip cookies were randomly selected at the
supermarket. The grams of fat per serving are as follows: 8; 8; 10; 7; 9; 9. Assume the
underlying distribution is approximately normal.
a. Construct a 90% confidence interval for the population mean grams of fat per serving
of chocolate chip cookies sold in supermarkets.
b. Find the critical value.
c. Calculate the margin of error.
d. If you wanted a smaller margin of error while keeping the same level of confidence,
what should have been changed in the study before it was done?
36. A survey of the mean number of cents off that coupons give was conducted by randomly
surveying one coupon per page from the coupon sections of a recent San Jose Mercury News.
The following data were collected: 20¢; 75¢; 50¢; 65¢; 30¢; 55¢; 40¢; 40¢; 30¢; 55¢; $1.50; 40¢;
65¢; 40¢. Assume the underlying distribution is approximately normal.
a. Find
i. 𝑥𝑥̅ =
ii. s =
iii. n=
b. Construct a 95% confidence interval for the population mean worth of coupons.
c. Find the critical values
d. Calculate the margin of error.
439
e. If many random samples were taken of size 14, what percent of the confidence
intervals constructed should contain the population mean worth of coupons? Explain
why.
37. A quality control specialist for a restaurant chain takes a random sample of size 12 to
check the amount of soda served in the 16 oz. serving size. The sample mean is 13.30 with a
sample standard deviation of 1.55. Assume the underlying population is normally
distributed.
a. Find the 95% Confidence Interval for the true population mean for the amount of soda
served.
b. What is the margin of error?
38. Insurance companies are interested in knowing the population percent of drivers who
always buckle up before riding in a car.
39. Suppose that the insurance companies did do a survey. They randomly surveyed 400
drivers and found that 320 claimed they always buckle up. We are interested in the
population proportion of drivers who claim they always buckle up.
a. Identify
i. x=
ii. n=
iii. p̂ =
b. Construct a 95% confidence interval for the population proportion who claim they
always buckle up.
c. Calculate the margin of error.
d. If this survey were done by telephone, list three difficulties the companies might have
in obtaining random results.
40. According to a recent survey of 1,200 people, 61% feel that the president is doing an
acceptable job. We are interested in the population proportion of people who feel the president
is doing an acceptable job.
440
41. An article regarding racial gaps on advanced math courses focusing on 8th grade was
published in Sage Journals. It was reported that 67% of Asian students (n = 80), 45% of white
students (n = 180), 38% of Latinx students (n = 130), and less than 20% of black students
take algebra I by 8th grade (n = 90). These gaps widen as students advance through high
school. Let’s assume n = 500 eighth graders. Construct a 90% confidence interval for the true
proportion of Latinx students who took algebra I by 8th grade.
43. Stanford University conducted a study of whether running is healthy for men and women
over age 50. During the first eight years of the study, 1.5% of the 451 members of the 50-Plus
Fitness Association died. We are interested in the proportion of people over 50 who ran and
died in the same eight-year period.
44. A telephone poll of 1,000 adult Americans was reported in an issue of Time Magazine.
One of the questions asked was “What is the main problem facing the country?” Twenty
percent answered “crime.” We are interested in the population proportion of adult Americans
who feel that crime is the main problem.
441
f. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is
±3%. In one to three complete sentences, explain what the ±3% represents.
45. Refer to Exercise 44 Another question in the poll was “[How much are] you worried about
the quality of education in our schools?” Sixty-three percent responded “a lot”. We are
interested in the population proportion of adult Americans who are worried a lot about the
quality of education in our schools.
a. Construct a 95% confidence interval for the population proportion of adult Americans
who are worried a lot about the quality of education in our schools.
b. Calculate the margin of error.
c. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is
±3%. In one to three complete sentences, explain what the ±3% represents.
46. According to a Field Poll, 70% of Illinoisans (actual results are 420 out of 600 registered
voters surveyed) feel that “property taxes are too high” in Illinois. We wish to construct a 90%
confidence interval for the true proportion of Illinois registered voters who feel that IL
property tax is too high.
47. Five hundred and eleven (511) homes in a certain southern California community are
randomly surveyed to determine if they meet minimal earthquake preparedness
recommendations. One hundred seventy-three (173) of the homes surveyed met the minimum
recommendations for earthquake preparedness, and 338 did not.
a. Find the confidence interval at the 90% Confidence Level for the true population
proportion of southern California community homes meeting at least the minimum
recommendations for earthquake preparedness.
b. Find the point estimate for the population proportion of homes that do not meet the
minimum recommendations for earthquake preparedness.
48. In July 2021, Gallup reported that of the 4843 adults surveyed, 18% of Americans say
that they completely or mostly isolate themselves from non-family members due to the
pandemic. The confidence level for this study was reported at 95% with a ±2% margin of
error.
442
e. Compare the margin of error in part d to the margin of error reported by Gallup.
Explain any differences between the values.
f. Create a confidence interval for the results of this study.
g. A reporter is covering the release of this study for a local news station. How should
the reporter explain the confidence interval to their audience?
49. A national survey of 2175 likely U.S. voters was conducted on June 20 to 24, 2021 by
Rasmussen Reports. It concluded with 95% confidence that 37% to 41% of voters think the
country is heading in the right direction.
a. Find the point estimate and the margin of error for this confidence interval.
b. Can we conclude with 95% confidence that less than half of all voters believe this?
c. Use the point estimate from part a and n = 2175 to calculate a 75% confidence interval
for the proportion of voters that believe the country is heading in the right direction.
d. Can we (with 75% confidence) conclude that less than half of all voters believe this?
50. Lending Tree recently conducted a survey asking adults across the U.S. about “borrowing”
someone else’s login for streaming services. When asked, 305 of the 586 participants
admitted that used someone else’s login info to use Netflix.
a. Create a 99% confidence interval for the true proportion of American adults who have
used someone else’s login for Netflix.
b. This survey was conducted through online surveying method feb. 11 – 16th, 2021. The
margin of error of the survey compensates for sampling error, or natural variability
among samples. List some factors that could affect the survey’s outcome that are not
covered by the margin of error.
c. Describe how the confidence interval would change if the CL changed from 99% to
90%.
51. You plan to conduct a survey on your college campus to learn about the political
awareness of students. You want to estimate the true proportion of college students on your
campus who voted in the 2020 presidential election with 95% confidence and a margin of
error no greater than five percent. How many students must you interview?
52. According to College Pulse study where 2000 undergraduates were recorded in the survey
of 15 questions. 38% of students strongly agreed with the statement “Higher education has a
role to play in racial justice and racial equality in the US” while 7% strongly disagree. Find
98% confidence intervals for true percentage of student who strongly agree with the
statement.
53. In a June 2021 survey by Insider, “52% Americans want all student-loan borrowers to
have their debt forgiven”. A random sample of 3600 adults through GoBankingRates –
personal finance portal.
443
a.) Find the 90% confidence interval for the true proportion of Americans that want
all student loan borrowers to have their debt forgiven.
b.) Let’s say the full sample has a margin of error of plus/minus 3.6 percentage points.
If we want a margin of error of 3 percentage points with 95% confidence, how large of
a sample should be used?
54. According to a survey conducted by Bankrate, which surveyed 2,194 adults, including
1,330 homeowners. It was seen that 29 percent of respondents with a mortgage either didn't
know their rate or wouldn't say.
b. Find the 98% confidence of the true proportion of adults with a mortgage either
don’t know their rate or wouldn’t say.
55. Find the right and left critical values for chi-square distribution for 98% confidence
interval and degrees of freedom is 20.
56. Find the right and left critical values for chi-square distribution for 90% confidence
interval and degrees of freedom is 26.
57. Find the right and left critical values for chi-square distribution for 95% confidence
interval and degrees of freedom is 12.
58. Find the right and left critical values for chi-square distribution for 99% confidence
interval and degrees of freedom is 9.
59. Unoccupied seats on flights cause airlines to lose revenue. Suppose a large airline wants
to estimate its mean number of unoccupied seats per flight over the past year. To accomplish
this, the records of 25 flights are randomly selected and the number of unoccupied seats is
noted for each of the sampled flights. The sample mean is 11.6 seats and the sample standard
deviation is 4.1 seats.
b. Construct a 90% confidence interval for the population standard deviation of the
number of unoccupied seats per flight.
444
60. In a recent sample of 84 used car sales costs, the sample mean was $6,425 with a standard
deviation of $3,156. Construct a 95% confidence interval for the population variance
61. Six different national brands of chocolate chip cookies were randomly selected at the
supermarket. The grams of fat per serving are as follows: 8; 8; 10; 7; 9; 9. Construct a 90%
confidence interval for the population variance
62. A survey of the mean number of cents off that coupons give was conducted by randomly
surveying one coupon per page from the coupon sections of a recent San Jose Mercury News.
The following data were collected: 20¢; 75¢; 50¢; 65¢; 30¢; 55¢; 40¢; 40¢; 30¢; 55¢; $1.50; 40¢;
65¢; 40¢. Assume the underlying distribution is approximately normal.
a. Find s =
b. Construct a 95% confidence interval for the population standard deviation worth of
coupons.
63. A quality control specialist for a restaurant chain takes a random sample of size 12 to
check the amount of soda served in the 16 oz. serving size. The sample mean is 13.30 with a
sample standard deviation of 1.55. Assume the underlying population is normally
distributed. Find the 95% Confidence Interval for the true population standard deviation for
the amount of soda served.
445
REFERENCES
9.1 Confidence interval about a single population mean when σ known
“Disclosure Data Catalog: Candidate Summary Report 2012.” U.S. Federal Election Commission.
Available online at http://www.fec.gov/data/index.jsp (accessed July 2, 2013).
“Headcount Enrollment Trends by Student Demographics Ten-Year Fall Trends to Most Recently
Completed Fall.” Foothill De Anza Community College District. Available online at
http://research.fhda.edu/factbook/FH_Demo_Trends/ FoothillDemographicTrends.htm (accessed
September 30, 2013).
La, Lynn, Kent German. "Cell Phone Radiation Levels." c|net part of CBX Interactive Inc. Available
online at http://reviews.cnet.com/cell-phone-radiation-levels/ (accessed July 2, 2013).
“Mean Income in the Past 12 Months (in 2011 Inflaction-Adjusted Dollars): 2011 American
Community Survey 1-Year Estimates.” American Fact Finder, U.S. Census Bureau. Available online
at
http://factfinder2.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_11_1YR_S
1902&prodType=table (accessed July 2, 2013).
“Metadata Description of Candidate Summary File.” U.S. Federal Election Commission. Available
online at http://www.fec.gov/finance/disclosure/metadata/metadataforcandidatesummary.shtml
(accessed July 2, 2013).
“National Health and Nutrition Examination Survey.” Centers for Disease Control and Prevention.
Available online at http://www.cdc.gov/nchs/nhanes.htm (accessed July 2, 2013).
“Disclosure Data Catalog: Leadership PAC and Sponsors Report, 2012.” Federal Election
Commission. Available online at http://www.fec.gov/data/index.jsp (accessed July 2,2013).
446
“Human Toxome Project: Mapping the Pollution in People.” Environmental Working Group.
Available online at http://www.ewg.org/sites/humantoxome/participants/participant-
group.php?group=in+utero%2Fnewborn (accessed July 2, 2013).
“Metadata Description of Leadership PAC List.” Federal Election Commission. Available online at
http://www.fec.gov/finance/disclosure/metadata/metadataLeadershipPacList.shtml (accessed July
2, 2013).
Jensen, Tom. “Democrats, Republicans Divided on Opinion of Music Icons.” Public Policy Polling.
Available online at http://www.publicpolicypolling.com/Day2MusicPoll.pdf (accessed July 2, 2013).
Madden, Mary, Amanda Lenhart, Sandra Coresi, Urs Gasser, Maeve Duggan, Aaron Smith, and
Meredith Beaton. “Teens, Social Media, and Privacy.” PewInternet, 2013. Available online at
http://www.pewinternet.org/Reports/2013/Teens-Social- Media-And-Privacy.aspx (accessed July 2,
2013).
Prince Survey Research Associates International. “2013 Teen and Privacy Management Survey.”
Pew Research Center: Internet and American Life Project. Available online at
http://www.pewinternet.org/~/media//Files/Questionnaire/2013/
Methods%20and%20Questions_Teens%20and%20Social%20Media.pdf (accessed July 2, 2013).
Saad, Lydia. “Three in Four U.S. Workers Plan to Work Pas Retirement Age: Slightly more say they
will do this by choice rather than necessity.” Gallup® Economy, 2013. Available online at
http://www.gallup.com/poll/162758/three-four- workers-plan-work-past-retirement-age.aspx
(accessed July 2, 2013).
Martin, Emmie. “30% Of Homeowners Are Making a Mistake That Could Cost Them Thousands.”
CNBC, CNBC, 1 May 2018, www.cnbc.com/2018/05/01/a-third-of-homeowners-dont-know-their-
mortgage-rate.html?__source=twitter%7Cmain. (accessed July 17, 2018)
“Concern over COVID Spread Hits the Lowest in June.” Pharmaceutical Technology, Verdict Media
Limited, July 2021, www.pharmaceutical-technology.com/news/concern-over-covid-19-spread-
hits-the-lowest-in-june-since-the-start-of-the-pandemic-poll/. (accessed July 9, 2021)
Irizarry, Yasmiyn. “On Track or Derailed? Race, Advanced Math, and the Transition to High School -
Yasmiyn Irizarry, 2021.” SAGE Journals, SAGE Journals, 11 Jan. 2021,
journals.sagepub.com/doi/full/10.1177/2378023120980293. (accessed July 9, 2021)
Saad, Lydia. “Strict Social Distancing in the U.S. Dwindles to 18%.” Gallup.com, Gallup, 7 July
2021, news.gallup.com/poll/352166/strict-social-distancing-dwindles.aspx. (accessed July 11,
2021)
447
Rasmussen_Poll. “Right Direction or Wrong Track.” Rasmussen Reports®,
www.rasmussenreports.com/public_content/politics/mood_of_america/right_direction_wrong_tr
ack_jul5 (accessed July 11, 2021).
Papandrea, Dawn. “Nearly 4 in 10 Americans Are Mooching Off Someone Else's Streaming Account.”
LendingTree, 11 Mar. 2021, www.lendingtree.com/americans-mooching-streaming-accounts/
(accessed July 11, 2021).
Sheffey, Ayelet. “52% Of Americans Want All Student-Loan Borrowers to Have Their Debt Forgiven,
New Survey Finds.” Business Insider, Business Insider, 21 June 2021,
www.businessinsider.com/americans-support-blanket-student-loan-debt-forgiveness-go-
banking-rates-2021-6 (accessed July 11, 2021).
Ezarik, Maria. What Students Think about Racial Justice Efforts on Campus, 6 May 2021,
www.insidehighered.com/news/2021/05/06/what-students-think-about-racial-justice-efforts-
campus (accessed July 12, 2021).
448
CHAPTER 9 SOLUTIONS:
1) b. 1 – α = .94, α = .06, α/2 = .03; The critical values are ±Zα/2 = ±invnorm(.03) = ±1.88.
s
3) 1 – α = .98, α = .02, α/2 = .01; tα/2 = 2.55; Margin of error is: E = tα 2 = .46
n
5) a. 𝑥𝑥̅ = 8.2, σ =2.2, n = 200
b. Using Zinterval: (7.944, 8.456); ±zα/2 = ±invnorm(.05) = ±1.645; E = .26 minutes
c. Increase n
d. The level of confidence would decrease. The smaller the n, the larger the E but since
E is staying the same, the confidence level would decrease.
e. The larger the confidence level, the larger n should be.
b. 95% confident that the population mean for wait time is between 1.38 hours and 1.62
hours.
f. 95% confidence that the population mean is between 2.93 and 3.59 colors for national
flags. g. Smaller, more accurate h. Smaller less confidence
120
13.) a. x = 120, n = 200, 𝑝𝑝̂ = = .6 b. X number of households where women make
200
the majority of the purchasing decisions. 𝑝𝑝̂ is the proportion of household where women
make the majority of the purchasing decisions. c. X ~ N(.6, .035) d. 1propZint:
(.532, .668), E = .068 e. answers vary
449
15) a. 1propZint .627 < p < .673, E = .023 b. .623 < p < .677
17) a. Tinterval 71.69 < µ < 73.31 b. ±tc = ± 2.012 c. the level of confidence
would stay same but the interval would become closer together.
1.81∗22 2
25) 𝑛𝑛 = � � = 64 calls
5
29) a. T distribution because sigma is unknown b. Tinterval (2.27, 2.76) c. ±tc = ±2.31
d. 95% confidence that the population mean lies within 2.27 hours and 2.76 hours.
39) a. x = 320, n = 400, 𝑝𝑝̂ = .80 b. 1propZint (.761, .839) c. E = 3.92% d. answer
varies
43) a. X is the number of people over 50 who ran and died, 𝑝𝑝̂ is the sample proportion of
people over 50 who ran and died b. 1propZint (.00289, .0282) c. 97% chance the true
proportion of those over 50 who ran and die is between .2% and 2.8%.
45) a. 1propZint 60% < p < 66% b. E = 3% c. ±3% represents the margin of
error and that the true proportion of adult Americans who are worried a lot about the
quality of education in our schools is between 60% and 66%.
450
47) a. 1propZint (.304, .373) b. 𝑝𝑝̂ = .339
49) a. 𝑝𝑝̂ = .52 E = .03 b. No because there is a chance that it can be 49% or 50%
51) n = 385
(24)4.12 (24)4.12
59) a. 𝑥𝑥̅ = 11.6, s = 4.1, n = 25 b. <σ < 3.33 < σ < 5.4
36.415 13.848
(5)1.1 (5)1.1
61) s = 1.1, n = 6 <σ < .497 < σ2 < 4.801
11.0705 1.1458
(11)1.552 (11)1.552
63) s = 1.55, n = 12 <σ < 1.098 < σ < 2.632
21.92 3.816
451
This page is purposely left blank.
452
10 | HYPOTHESIS TESTING WITH
ONE SAMPLE
Figure 10.1 Scientist doing a PCR (Polymerase Chain Reaction) in electrophoresis gel. This technique is
often used for genetic and DNA testing. Photo by National Cancer Institute on Unsplash
Chapter Objectives
453
Introduction
One job of a statistician is to make statistical inferences about populations based on samples
taken from the population. In the last chapter, we used confidence intervals to estimate an
unknown population parameter. In this chapter we will use statistical inference to make a
decision about a parameter; that is, given two competing claims about an unknown
parameter, we will decide which of them is more plausible. The method for doing this is
called hypothesis testing. We already have all of the tools needed to implement this method
– and we will shortly distill the process into a straightforward five-step method. We will
illustrate the basic ideas with an example:
The ACME chemical corporation makes a fabric cleaner; the company claims that this
product successfully removes 90% of all stains. A consumer organization questions this claim
and decides to test the product. They select a random sample of 100 stained garments and
apply the product to each. They find that the stain remover successfully cleaned only 78 of
them. Does this sample data provide evidence that the company is overstating the
effectiveness of their product? That is, does the sample data provide evidence that the success
rate for the fabric cleaner is less than 90%?
We let p denote the success rate for the cleaner – that is, this is the proportion of all stains
in the population that will be removed by the product. Now we see that we have two
competing claims:
Which one of these claims is true? Of course, there is no way to determine p explicitly, but
we could use the methods of Chapter 8, along with the given sample data, to compute a
confidence interval for p. Doing this gives us a 95% confidence interval of (0.699, 0.861).
That is, we are 95% confident that 69.9% < p < 86.1%. Since the upper bound of the interval
is well under 90%, the sample data does indeed provide evidence that p is less than 90%.
But there is another way we could go about this. Note that if the company’s claim is true,
then we would expect to get about 90 successes in a sample of 100. But the number of
successes in 100 trials will change from sample to sample, so the fact that we got fewer than
90 does not in itself provide evidence that the success rate is less than 90%. ACME could
claim that the discrepancy between their claimed success rate and the sample data is simply
due to sample variation. This is where probability comes in. Let us give the company the
benefit of the doubt. Let us assume that the success rate really is 90% and calculate the
probability of getting a sample of 100 garments with a success rate of 78% or less (a sample
proportion less than 78% would also support the consumer organization’s claim). In Chapter
7, we saw that if n = 100 and p = .90, then the sampling proportion for p̂ is approximately
454
p(1 − p) .90(.10)
normal, with mean µ pˆ = p = .90 and standard deviation σ pˆ = = = 0.03.
n 100
Using the calculator, we see that the probability we want is:
So, if ACME’s claim really is true, then getting 78 or fewer successes in a sample of 100
garments would be highly unlikely (only about 3 times for every 100,000 samples).
However, the consumer group really did get a sample of 100 with only 78 successes! This
means that there are two possibilities: The success rate really is 90%, and a very rare event
occurred, or else the assumption that p = .90 is not correct. Given how small the probability
we calculated was, the second option seems to be more plausible. In other words, the sample
data is not compatible with ACME’s claim about the success rate of their product; so we would
reject the assumption that the success rate is 90%. Thus, this sample data provides evidence
that ACME was overstating the success rate of their product.
The line of reasoning we used here is known as hypothesis testing, and it is a widely used
method of statistical inference. In general, a hypothesis test will always consist of two
contradictory hypotheses or statements, a decision based on sample data, and a conclusion.
Note that in all of our problems, we will be given either the sample data or the summary
statistics. Reviewing our example, we see that a hypothesis test will involve the following
five steps:
1. Set up two contradictory hypotheses, which we call the null hypothesis and the
alternative hypothesis.
2. Determine the correct sampling distribution to perform the calculations.
3. Assuming that the null hypothesis is true, we calculate the probability of getting
sample data like the data we have actually observed.
4. If this probability is sufficiently small, we reject the null hypothesis.
5. Interpret the decision to write a meaningful conclusion; i.e. interpret the decision to
answer the original question.
These five steps comprise what is called the Five Step Method of hypothesis testing.
In this chapter, we develop hypothesis tests on single means and single proportions. We will
also learn about the errors associated with these tests. In subsequent chapters, we learn
many other tests to apply these ideas to other situations – but every test we learn will follow
the Five Step Method.
455
10.1 | Elements of a Hypothesis Test
Every hypothesis test begins by considering two hypotheses. These are called the null
hypothesis and the alternative hypothesis. These hypotheses contain opposing viewpoints, and
almost always are stated in terms of one or more unknown population parameters.
Population Parameters
Mean μ
Proportion ρ
Standard Deviation σ
The null hypothesis is denoted by H0. It is a statement about the parameter that either is
believed to be true or is used to put forth an argument unless it can be shown to be incorrect
beyond a reasonable doubt. Thus, all of our calculations will be done using the assumption
that the null hypothesis is true.
The alternative hypothesis is denoted by Ha. It is a claim about the parameter that is
contradictory to H0. The alternative Ha is also sometimes called the “research hypothesis”.
Since the hypotheses are contradictory, only one of them can be true; thus, if our sample data
leads us to reject H0, then we will have statistical evidence that Ha is true. And we only will
reject H0 when the sample data provides compelling evidence that it is false (this will be
discussed in greater detail in the next section).
Based on calculations using sample data, we will make a decision about H0. There are only
two possible options for a decision: They are:
• Reject H0
• Do not reject H0
If the decision is to reject H0, then this means that the sample data is incompatible with the
assumption that H0 is true. When the decision is to reject H0 then the sample data provides
evidence that Ha is true. On the other hand, if the decision is “do reject H0", then the sample
data simply does not provide enough evidence to reject the null hypothesis.
By its very nature, H0 asserts that the value of the parameter is known; thus, H0 will always
include an equal sign; so the null hypothesis will always be a statement involving either “=”,
“<” or “>”.
456
On the other hand, Ha will never have a symbol with an equal in it; so Ha will be a statement
involving either “≠”, “>” or “<”. The choice of symbol depends on the claim being tested.
(Note that some practitioners write H0 as a simple statement of equality, even with > or < as
the symbol in the alternative hypothesis. This practice is also acceptable.)
Example 10.1
We wish to test the claim that more than 30% of registered voters in a certain county voted
in the primary election. Find the hypotheses for this test.
Solution 10.1
The claim here concerns a population proportion p, the proportion of all registered voters in
the county that voted in the primary. The claim is that more than 30% voted, or p > 0.30.
This claim must be represented by either H0 or Ha; since the claim does not include an equal
sign, it will be Ha. Remember that H0 and Ha are opposite of each other. Thus we get the
hypotheses:
10.1 A medical trial is conducted to test whether or not a new medicine reduces cholesterol
by at least 25%. State the null and alternative hypotheses for the test.
457
Example 10.2
We want to test whether the mean GPA of students in American colleges is different from
2.0 (out of 4.0). State the hypotheses for this test.
Solution 10.2
Here the claim being tested involves a population mean, μ; and the claim states that the
mean GPA is different from 2.0. So, the claim is that μ ≠ 2.0. Since this is a “not equal”
statement, it must be represented by Ha, and H0 will be the opposing statement. So, the
hypotheses are:
10.2 We want to test whether the mean height of eighth graders is 66 inches. State the null
and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and
alternative hypotheses:
Example 10.3
Suppose that we want to test whether the average time needed to complete a bachelor’s
degree in the U.S. is less than five years.
Solution 10.3
a. The claim is that the population mean μ is less than five years; i.e. the claim is μ < 5.
This must be represented by the alternative hypothesis, and so the null and
alternative hypotheses are:
b. Since the decision is to reject H0, there is evidence that H0 is false. That is, there is
evidence that Ha is true. Thus, there is significant evidence that the average time
needed to complete a bachelor’s degree in the U.S. is less than five years.
458
10.3 We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null
and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and
alternative hypotheses.
Example 10.4
In an issue of U.S. News and World Report, an article on school standards stated that about
half of all students in France, Germany, and Israel take advanced placement exams and of
those, a third pass. The same article stated that 6.6% of U.S. students take advanced
placement exams and 4.4% pass. Suppose that a test is conducted to test whether the
percentage of U.S. students who take advanced placement exams is more than 6.6%.
Solution 10.4
a. The claim is that more than 6.6% pass; i.e. that p > 0.066. This is a strict inequality,
so it will be represented by Ha, and we get the hypotheses: H0: p ≤ 0.066 vs. Ha:
p > 0.066
b. Since the decision is to not reject H0, this means that there is not enough evidence to
support Ha. Thus, there is not enough evidence to conclude that the percentage of
U.S. students who take advanced placement exams is more than 6.6%.
10.4 On a state driver’s test, about 40% pass the test on the first try. We want to test the
claim that more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for
the null and alternative hypotheses:
459
The p-value and Significance Level for a Test
We have discussed at some length Steps 1 and 5 of the process; now we focus on Steps 3 and
4. The probability calculated in Step 3 is called the p-value for the test. To compute this
value, we assume that the null hypothesis is true, and then find the probability of getting a
sample statistic that is at least as extreme as the statistic from our sample. Thus, the smaller
the p-value is, the stronger the evidence against H0.
For any test, the p-value will be calculated in an appropriate sampling distribution. For tests
about a single population mean or population proportion, we can use the techniques learned
in Chapter 8. For example, suppose that we want to test the claim that the average time
spent to complete a bachelor’s degree is less than five years. As we saw in Example 10.3, the
hypotheses for this test are:
Further suppose that we collected a sample and found that x = 3.5. Then the p-value would
be the probability p-value = P( x < 5). Graphically, this probability is the area under the
normal curve that is to the left of x = 3.5. For this reason, a test in which the alternative
hypothesis is a “less than” statement is called a left-tailed test.
460
Similarly, if we were testing the hypotheses H0: μ < 65 vs. Ha: μ > 65, and the sample data
gave us x = 67, then the p-value would be the probability p = P( x > 67). Graphically, this
probability is the area under the normal curve that is to the right of x = 67. So, a test in
which the alternative hypothesis is a “greater than” statement is called a right-tailed test.
Finally, if the alternative hypothesis is a “not equal” statement, then the p-value will be the
area in both tails. And if the sampling distribution is symmetric, then the p-value will be
split evenly between the two tails. E.g. suppose we test the hypotheses:
H0: μ = 50 Ha: μ ≠ 50
Tests in which the alternative hypothesis is a “not equal” statement are called two-tailed
tests.
Once we have calculated the p-value, we still need to use it to decide whether or not to reject
the null hypothesis. According to Step 4 of the Five Step Method, we will reject the null
hypothesis H0 only if p is “sufficiently small”; and the criterion for this is provided by a
numerical value called the significance level of the test.
Before conducting a hypothesis test, the researcher will decide on a maximum allowable
probability for a Type I error, which we denote as α. A Type 1 error occurs when the decision
at the end of the hypothesis test is to reject H0 when in fact the H0 is true. This value is a
threshold value: if the p-value is less than α, then we will reject H0.
The significance level chosen will depend on the consequences of making a Type I error. For
example, in the social sciences, it is not uncommon to use a significance level of α = .10,
whereas in the pharmaceutical industry the standard level of significance for drug testing is
α = .01. The most widely used level is α = .05.
461
Rejection Rule Using the p-value
NOTE: We do not use the phrase “accept H0” if the conclusion is “do not reject H0”.
Example 10.5
It’s a Boy Genetics Labs claim their procedures improve the chances of a boy being born.
We let p be the proportion of male babies born to It’s A Boy clients. Suppose that we test the
hypotheses using a significance level of α = 0.01.
Solution 10.5
If claim is H0 and Reject H0, then “there is enough evidence at ___ % level of significance to
reject the claim that …
If claim is Ha and Reject H0, then “there is enough evidence at ____% level of significance to
support the claim that …
462
10.5 A certain computer component is known to have an average time of 520 hours between
failures. A modification is made to the component that is supposed to extend the time
between failures. To test that the modification is successful, a researcher samples 100 of
the components and uses the data to test the hypotheses H0: μ < 520, Ha: μ > 520 using a
0.05 significance level.
Example 10.6
A baker claims that his bread height is more than 15 cm, on average. Several of his
customers do not believe him. To persuade his customers that he is right, the baker decides
to do a hypothesis test. The baker knows from baking hundreds of loaves of bread that the
standard deviation for the height is 1.5 cm. and the distribution of heights is normal. He
makes 30 loaves of bread and finds that the mean height for the sample is 17 cm. Using a
.05 significance level, does this sample data provide evidence to that the mean bread height
is more than 15 cm?
Solution 10.6
The claim being tested is that the mean is more than 15 cm, or μ > 15. Since this is a strict
inequality, this claim will be represented by Ha. The null hypothesis must contradict the
alternate hypothesis, so we get the hypotheses:
463
We are sampling from a normal distribution and the population SD is known (σ = 1.5 cm.),
1.5
so the sampling distribution for x is normal with mean µ x = 15 and σ x = = 0.27386.
30
Next, we calculate the p-value for the test. Suppose the null hypothesis is true: that is,
assume that the mean height of the loaves is no more than 15 cm. Using this assumption,
we find the probability of observing a sample mean that is greater than or equal to 17; since
the sampling distribution is normal, we can use any of the methods from Chapter 7 to do the
calculation. For example, we can use the normalcdf function in the TI-84 calculator:
1.5
p = P( x > 17) = normalcdf(17, 10^99, 15, ) = 1.42 x 10-13
30
In other words, if the population mean really is μ = 15, then the probability of selecting a
sample of 10 with x > 17 is p = 0.00000000000014, which is virtually zero.
Since p < .05 we reject H0. In fact, with such a small p-value, we would have rejected the
null hypothesis at any level of significance. A p-value of approximately zero tells us that, if
the population mean height really been 15 cm, it would be virtually impossible to get a sample
mean of 17 cm. purely by chance. Since the outcome of 17 cm. is so unlikely, its occurrence
provides strong evidence against the null hypothesis. There is sufficient evidence that the
true mean height for the population of the baker's loaves of bread is greater than 15 cm.
10.6 A machine at a bottling plant is programmed to fill bottles to an average volume of 12.2
oz. The volumes are normally distributed with a standard deviation of σ = 0.6 oz. The line
manager suspects that the machine may be overfilling the bottles, and so he will test the
claim that the mean is greater than 12.2. A sample of 36 is selected; the sample mean is 12.5
oz, and the p-value is calculated to be p = 0.0013. What can we conclude from this test?
Note that in our example, we will follow the Five Step Method using p-value. The steps
are listed in the box below. This method will be used for the different types of hypothesis
tests in the chapter.
464
Five Step Method for Hypothesis Testing using p-value
The Five Step Method streamlines and organizes our work by providing a template for
performing a hypothesis test. Note that of the five steps involved, only Step 3 requires any
calculation; and these calculations are programmed into the TI-83 and TI-84 calculators:
To calculate the p-value (and test statistic) for a Z-Test, go to the STAT menu, and then
scroll over to the TESTS menu.
• If we are given summary statistics σ, x and n, then we select the Stats option.
Enter the mean hypothesized in H0 for μ0; enter the values for σ, x and n.
In the row marked as “μ: ”, scroll over to the option that matches the alternative Ha.
Put the cursor on Calculate, and press Enter. The p-value and the test statistic z
will be displayed.
If you instead put the cursor on Draw, the calculator will show a graph of the
normal distribution, with the area corresponding to the p-value shaded; this option
also shows the test statistic z and p-value.
• If we are given raw data, go to the EDIT menu and enter the data into a list.
Enter the hypothesized mean in H0 for μ0, the value for σ, and the list name
(e.g. press 2nd 1 for list L1). In the row marked as “μ: ”, scroll over to the option
that matches the alternative Ha. Press Calculate and Enter to see the p-value and
the test statistic.
465
Example 10.7
Jeffrey, as an eight-year-old, established a mean time of 16.43 seconds for swimming the 25-
yard freestyle, with a standard deviation of 0.8 seconds. His dad, Frank, thought that Jeffrey
could swim the 25-yard freestyle faster using goggles. Frank bought Jeffrey a new pair of
expensive goggles and timed Jeffrey for 15 25-yard freestyle swims. For the 15 swims,
Jeffrey's mean time was 16 seconds. Frank thought that the goggles helped Jeffrey to swim
faster than the 16.43 seconds. Conduct a hypothesis test using a significance level of α = 0.05.
Assume that the swim times for the 25-yard freestyle are normal.
Solution 10.7
(Step 1) Set up the hypotheses: This is a test of a single population mean; and claim is
that Jeffrey swims faster with the new goggles. For Jeffrey to swim faster, his time will be
less than 16.43 seconds; the claim is that μ < 16.43. This claim will be the alternative Ha,
and we have:
(Step 2) Select the correct test and identify the significance level α. Since the swim times
are normally distributed, and the population standard deviation is known (σ = 0.8), we use
a Z-Test. The significance level α is 0.05.
(Step 3) Calculate the p-value. In the TI-84, go to the STAT menu and then the TESTS
menu; select Z-Test (item 1). Enter the following:
μ0: 16.43 (this is the mean hypothesized in H0)
σ: 0.8
x : 16
n : 15
μ : < μ0
Put the cursor on Calculate and press Enter to get the p-value = 0.0187.
The graph looks like:
466
(Step 4) Use these results to make the decision about H0. Since p-value < α, we reject H0.
Note that if H0 is true, there is a 0.0187 probability that Jeffrey's mean time to swim the
25-yard freestyle will 16 seconds or less. Because a 1.87% chance is small, the mean time of
16 seconds or less is unlikely to have happened randomly. It is a rare event.
(Step 5) Interpret the decision about H0 in the context of the given problem to state a
conclusion. At the 5% significance level, we conclude that Jeffrey swims faster using the
new goggles. The sample data show there is sufficient evidence that Jeffrey's mean time to
swim the 25-yard freestyle is less than 16.43 seconds.
10.7 The mean throwing distance of a football for Marco, a high school freshman quarterback,
is 40 yards, with a standard deviation of two yards. The team coach tells Marco to adjust his
grip to more distance. The coach records the distances for 20 throws. For the 20 throws,
Marco’s mean distance was 45 yards. The coach thought the different grip helped Marco get
throw farther than 40 yards. Test the coach’s claim using a significance level of α = 0.05.
Assume that the throwing distances for footballs are normally distributed, with a standard
deviation of σ = 10 yards.
If we are sampling from a normal distribution, the sampling distribution for x will be
approximately normal. But for a small sample, the sample standard deviation may be a poor
estimate for σ. To compensate for this potential error, we use the t-distribution introduced
in Chapter 9 to calculate our p-values and critical values for our hypothesis tests. Recall that
this distribution is bell-shaped like the normal distribution, but with a larger variance.
Moreover, there is a different t-distribution for every sample size, and the smaller the sample
size the larger the variance. Finally, as the sample size gets very large, the t-distribution
approaches the normal distribution.
A test for a population mean μ that is performed under these conditions (i.e. in which σ is
unknown) is called a T-Test. We start with an example:
467
Example 10.8
Statistics students believe that the mean score on the first statistics test is 65. A statistics
instructor thinks the mean score is higher than 65. To test his claim, he selects a random
sample of ten statistics students and obtains the scores 65; 65; 70; 67; 66; 63; 63; 68; 72; 71.
He uses this data to conduct a hypothesis test using a 5% level of significance. The data are
assumed to be from a normal distribution. Does this sample data provide evidence that the
mean score for all students is more than 65?
Solution 10.8
(Step 1) This is a test of a single population mean. Since the instructor thinks the average
score is more than 65, his claim is represented by the alternative hypothesis, and we have:
(Step 2) A 5% level of significance means that α = 0.05. If you read the problem carefully,
you will notice that there is no population standard deviation given. You are only given n =
10 sample data values. Notice also that the data come from a normal distribution. This means
that we must use a t-test.
(Step 3) Therefore, the distribution for the test is t9 where n = 10 and so df = 10 - 1 = 9. The
mean of the distribution will be µ x = µ = 65. We do not have an explicit formula for the
standard deviation of the sampling distribution as we did for normal distributions. But the
p-value can easily be calculated using the TI-84:
To calculate the p-value (and test statistic) for a T-Test, follow the same basic instructions
as for a Z-Test, but in the TESTS menu we will select T-Test.
• Go the STAT menu, and then to the TESTS menu; select T-Test (option 2). In this
problem, we are given raw data, so we first go to the STAT menu, select the Edit
menu, and press ENTER. Next, we enter the 10 data values into a list, say L1.
• Press STAT, arrow over to the TESTS menu, and select T-Test. For Inpt:, select the
Data option, and press ENTER. Then, enter the following:
468
• Arrow down to Calculate and press ENTER.
The display screen will show the p-value, p = 0.0396.
It is useful to have a graphical representation of this p-value; this is a right-tailed
test, so the p-value is P( x > 67), the area to the right of the observed sample mean:
(Step 5) And since we reject μ = 65, we have sufficient evidence to conclude that Ha is true.
At a 5% level of significance, the sample data provide sufficient evidence that the mean test
score is more than 65.
NOTES:
• If we select Draw instead of Calculate, the calculator will graph the area
corresponding to the p-value; the p-value and test statistic and will still appear on
this screen.
• The output screen also shows the alternative hypothesis, μ > 65 at the top. It is
always a good idea to double check that the correct alternative hypothesis was
selected.
• The output screen also displays the sample mean, sample standard deviation and
sample size: x = 67, s = 3.1972, and n = 10, respectively. One of the advantages of
using the data option is that the calculator will automatically calculate the key
sample statistics for us. We just need to enter the data!
Before starting another example, we will take this opportunity to show another way to
calculate the p-value. First, we can quickly use 1-VarStats to calculate the sample mean and
sample standard deviation. Then we calculate the t-statistic using the formula:
x − µ0 67 − 65
t= = = 1.978 .
s/ n 3.1972 / 10
This is a right-tailed test, so the p-value = P( x > 67); and this probability can be rewritten
as p-value = P( x > 67) = P(t > 1.978).
469
The latter probability can be calculated using the tcdf function in the TI-84 calculator:
Press 2nd VARS to access the DISTR menu and select the tcdf function; this function is
simila1r to the normalcdf function that we have used many times. However, the tcdf function
needs only three inputs – the left endpoint, right endpoint, and degrees of freedom. For this
problem, we have
10.8 It is believed that a stock price for a particular company will grow, on average, at a
rate of $5 per week with a standard deviation of $1. An investor claims that the stock won’t
grow as quickly. The changes in stock price are recorded for ten weeks and are as follows:
$4, $3, $2, $3, $1, $7, $2, $1, $1, $2.
Use this data, along with a 5% level of significance, to test the claim that the mean.
Example 10.9
The National Institute of Standards and Technology provides exact data on conductivity
properties of materials. Following are conductivity measurements for 11 randomly selected
pieces of a particular type of glass.
1.11; 1.07; 1.11; 1.07; 1.12; 1.08; .98; .98 1.02; .95; .95
Using α = .05, does this data provide evidence that the average conductivity of this type of
glass is greater than one? Assume the population is normal.
Solution to 10.9
(Step 1) The claim being tested is that the average conductivity of the selected glass is
greater than one; i.e. the claim is that μ > 1, which will be the alternative hypothesis. So,
the hypotheses will be:
(Step 2) We are testing a sample mean and the population standard deviation is unknown.
Therefore, we need to use a Student's-t distribution. (Assume the underlying population is
normal)
(Step 3) Do the calculations: In the TI-84, go to the STAT menu, then to the Edit menu;
enter the data into one of the lists. Go to the STAT menu, then to the TESTS menu, and
select T-Test.
470
Select the data option and enter the following:
Arrow down to Calculate, and press Enter to get t = 2.014 and p-value = 0.0359.
Alternatively, arrow down to Draw, and press enter to see a graphical representation of the
area corresponding to the p-value; it is the area in the right tail:
(Step 4) Since p-value = .0359 < .05, we reject H0 at the .05 level of significance.
(Step 5) Thus the results are statistically significant; there is sufficient evidence to conclude
that the mean conductivity for this type of glass is greater than 1.
then the sampling distribution for p̂ will be approximately normal. Moreover, the mean and
p(1 − p)
standard deviation for this sampling distribution are µ pˆ = p and σ pˆ = ,
n
respectively. Using this information and the Five Step Method, we can test claims like the
one in the following example:
471
Example 10.10
A consumer group asserts that the proportion of households that have three cell phones is
30%. A cell phone company has reason to believe that the proportion more than 30%. Before
they start a big advertising campaign, they conduct a hypothesis test. Their marketing people
survey 150 households with the result that 60 of the households have three cell phones; this
data is then used to test the claim that more than 30% of households have three cell phones.
At the 5% significance level, is there sufficient evidence to support the company’s claim?
Solution to 10.10
(Step 1) The phone company’s claim is that p > 0.30, which will be represented by the null
hypothesis Ha. The alternative hypothesis will be contradictory statement, giving us the
hypotheses:
(Step 2) The significance level is 5% or 0.05. Since the claim involves a population
proportion, the appropriate sample statistic will be a sample proportion, p̂ .
(Step 3) Note that np =150(.30) = 45 and n(1 – p) = 150(.70) = 105, so the sampling
distribution for p̂ will be approximately normal with mean μ = 0.30 and standard deviation
p(1 − p) .30(.70)
σ pˆ = = ≈ 0.03742 .
n 150
(Step 4) This p-value is less than the significance level, so our decision is to reject Ho.
Moreover, this p-value tells us that if the null hypothesis is true, there is only about a 1.6%
chance that the sample proportion p̂ for a randomly selected sample will either be at least
as far off from the expected proportion of .30.
(Step 5) Thus, at the α = .05 level of significance, the sample data provides sufficient
evidence that the percentage of households with three cell phones is more than 30%. In
other words, we have significant statistical evidence that the company’s claim is correct.
As was the case with the Z-test and T-Test, the p-value and test statistic for a hypothesis
test involving a proportion can be easily calculated using the TI-83 or TI-84 calculator:
472
Using the TI-83, 83+, 84, 84+ Calculator _
To calculate the p-value and the test statistic for a test involving a proportion, we go to the
STAT menu, and then to the TESTS menu. Select 1-PropZTest (item 5).
Caution: The output screen will show two values labeled as p. The first one is the p-value
or the test. The second is the sample proportion p̂ (but the ^ is very small and easy to miss)
10.9 Marketers believe that 92% of adults in the United States own a cell phone. A cell phone
manufacturer believes that percentage is actually lower. A random sample of 200 American
adults are surveyed, of which 174 report owning cell phones. Test the manufacturer’s claim
using a 5% level of significance. State the null and alternative hypothesis, find the p-value,
and state your conclusion.
Example 10.11
In a study of 420,019 cell phone users, 172 of the subjects developed brain cancer. Test the
claim that cell phone users developed brain cancer at a greater rate than that for non-cell
phone users (the rate of brain cancer for non-cell phone users is 0.0340%). Since this is a
critical issue, use a 0.005 significance level. Explain why the significance level should be so
low in terms of a Type I error.
Solution 10.11
(Step 1) Let p be the proportion of cell phone users that develop brain cancer; then the
claim is that p > 0.00034. This claim will be represented by the alternative hypothesis, so
we get:
473
(Step 2) We are testing a claim about a proportion, so we use a one-proportion z-test. Note
that the sample is sufficiently large because np = 420,019(0.00034) = 142.8, and n(1 – p) =
420,019(0.99966) = 419,876.2.
(Step 3) In the TI-84, we go to STAT >> TESTS and select 1-PropZTest. Enter p0 = 0.00034,
x = 172 and n = 420,019. Select the “>” option for the alternative, place the cursor either
on Calculate or Draw and press Enter to get the following output:
(Step 4) Since the p-value = 0.0073 is greater than our alpha value = 0.005, we do not reject
H0.
(Step 5) We conclude that there is not enough evidence to support the claim of higher brain
cancer rates for cell phone users.
Finally, if we commit a Type I error, then we think that the rate of brain cancer is no worse
for cell-phone users than for non-cell phone users, when in fact the rate is higher for cell-
phone users. Since the claim describes cancer-causing environments, we want to minimize
the chances of incorrectly identifying causes of cancer.
Example 10.12
According to the US Census there are approximately 268,608,618 residents aged 12 and
older. Statistics from the Rape, Abuse, and Incest National Network indicate that 207,754
rapes occur each year (male and female), on average, for persons aged 12 and older. This
translates into a percentage of sexual assaults of 0.078%. In Daviess County, KY, there were
reported 11 rapes for a population of 37,937. Conduct an appropriate hypothesis test to
determine if there is a statistically significant difference between the local sexual assault
percentage and the national sexual assault percentage. Use a significance level of 0.01.
Solution 10.12
474
(Step 1) The claim is that the proportion of sexual assaults in Daviess County, KY is
significantly different from the national average. The proportion for the nation is
207,754/268,608,618, which is about 0.00078. This means that the claim is p ≠ 0.00078, and
so the hypotheses are:
(Step 2). Since we are working with proportions, we will use a one-proportion z-test. The
level of significance is 0.01.
(Step 3) Go to STAT >> TESTS and select 1-PropZTest. Enter 0.00078 as the hypothesized
proportion p0, enter 11 for x and 37,937 for n. Select the ≠ option for the alternative. Scroll
down to Calculate and press Enter to get the following output screen:
The figure on the right shows the output obtained by using the DRAW option.
(Step 4) Since the p-value, p = 0.00063, is less than the alpha level of 0.01, the sample data
indicates that we should reject the null hypothesis.
(Step 5) The sample data support the claim that the proportion of sexual assaults in
Daviess County, Kentucky is different from the national average proportion.
Here is a summary of the types of tests we have looked at so far using p-values.
475
10.5 | Critical Values and the Critical Value Method
Before statistical software and graphing calculators became widely available, probabilities
had to be calculated using the tables. This is time consuming, so statistical practitioners
instead used the concept of a critical value to make the decision whether or not to reject H0.
Critical values can be calculated for distributions we have looked at previously in the chapter,
the normal distribution and t-distribution as well as other probability distributions. The
critical values are the values separating unlikely values from those that are likely to occur.
The critical value method compares the critical value with the test statistic from the
hypothesis test.
Let’s look at the normal distribution. We already encountered critical values in Chapter 9;
recall that the value z0 is the z-value so that the area to the right is exactly equal to α. For
example, suppose that we are performing a right-tailed z-test for a population mean, using
a .05 significance level. Note that we can find z.05 using the invNorm function in the
calculator:
We can restate this by saying P(z > 1.645) = 0.05. Also note that as the z value increases,
the area to the right of z decreases; this is easy to see using graphs:
Distribution Plot
Normal, Mean=0, StDev=1
0.4
0.3
Density
0.2
0.1
0.05
0.0
0 1.645
X
The z value is on the horizontal axis (1.645). The area in red is 0.05.
476
Distribution Plot
Normal, Mean=0, StDev=1
0.4
0.3
Density
0.2
0.1
0.04457
0.0
0 1.7
X
The z value is on the horizonal axis (1.7). The red area is greater than the previous graph.
Distribution Plot
Normal, Mean=0, StDev=1
0.4
0.3
Density
0.2
0.1
0.04006
0.0
0 1.75
X
The z-value is on the horizonal axis (1.75). The red area is larger than previous graphs.
It follows that that the area to the right of z is less than 0.05 if and only if z > 1.645. Thus:
the p-value for the test is less than .05 if and only if the test statistic z is greater
than .05.
We would then refer to the area to the right of z.05 = 1.645 as the rejection region for the
test since we would reject H0 if and only if the test statistic z is in this region.
Of course, we could find the critical value and rejection region for any type of test, and for
any significance level. But the most tests use the significance levels α = .10, .05 or .01.
Z-value critical values would be used for the population mean where the standard deviation
is known (z-test) and population proportion (1-propztest). The rejection region for z-tests
using these significance levels is summarized in the table below:
477
Rejection Region for Z-Test
Critical values for the Student’s t-distribution were also discussed. Recall that the
Student’s t-distribution is different for different sample sizes. The degrees of freedom, n -1,
changes the distribution, which will change the critical values. Therefore, the critical values
for the t-test are dependent on the level of significance, the tail of the test, and the degrees
of freedom.
Using a sample size of 10 and a level of significance of 0.05. If we are preforming a right-
tailed t-test, we can find t.05 , the critical value, using the invT function in the calculator
where df = 9 or n - 1:
We can restate this by saying P(t > 1.833) = 0.05 when n = 10 and it is right-tailed test. The
rejection region would be when t > 1.833.
Recall that the Student’s t-distribution is symmetric around zero like the normal
distribution so that in a left-tailed test where n = 10, the t critical value would be negative,
t.05 = -1.833 and the rejection region would be when t < - 1.833.
If the test was two-tailed with n = 10 and the level of significance of 0.05, we can find the
t.025 using the invT function in the calculator:
This means that the critical values for a two tailed t-test with a level of significance of 0.05
and n = 10 are -2.26 and 2.26. The rejection regions would be t < -2.26 and t > 2.26.
Unfortunately, there is a not concise table for rejection regions for t-tests like the z-tests
because critical values change depending on the degrees of freedom, tails of test, and level
of significance,
478
P-value method versus critical value method
The first few steps are the same for both methods in hypothesis testing. Instead of using
the p-value we will use the test statistic. After finding the test statistic, the difference
comes between the two methods. P-value is the probability of the test statistic happening.
Critical value is the value from the distribution that separates the rejection region from the
non-rejection region. Here is a recap of the rejection rules for both methods:
Below are examples of hypothesis tests using the five-step method using critical value
method. Each type of test previously done in the chapter can done using critical values
instead of comparing p-values.
Example 10.13
According to a recent study, the mean cost of a heart-bypass operation is slightly less than
$26,100, and approximately 230,000 operations are performed annually. A sample of 36
bypass operations showed a mean cost of $25,000. Assume the population standard
deviation is $2400. Develop and test an appropriate hypothesis to see whether or not the
mean bypass operation is less than $26,100 using α = 0.05.
Solution 10.13
This is a test for mean when the population standard deviation is known. We will follow the
five-step method using the critical values.
(Step 1) The claim being tested is μ < 26,100, which will be the alternative hypothesis, Ha.
So the hypotheses are: H0: μ > 26,100 vs. Ha: μ < 26,100.
(Step 2) Since the alternative hypothesis is less than, “<”, it is a left tailed test. The level of
significance is 0.05.
(Step 3) We are testing a claim about a mean μ, and the population SD σ is known, so we
use a Z-Test.
This is a left-tailed test, so the critical value will be the z-score that has an area of .05 to the
left; i.e. the critical value is z.05 = -1.645.
(Step 4) The test statistic, -2.75, is to the left of the critical value, -1.645. Since z = -2.75 < -
1.645, we reject H0.
479
(Step 5) There is sufficient evidence to conclude that μ < 26,100. That is, there is
significant evidence that the mean cost of a bypass operation is less than $26,100.
Example 10.14
According to an article in the Wall Street Journal, they claim that the mean time it takes a
taxi from Manhattan to LaGuardia is now more than 35 minutes. Suppose new study using
40 taxis has an average of 42 minutes travel time from Manhattan to LaGuardia. Assume
the population standard deviation is 5.1 minutes. Develop and test an appropriate
hypothesis to see whether or not the mean travel time is more than 35 minutes using α =
0.05.
Solution 10.14
This is a test for mean when the population standard deviation is known. We will follow the
five-step method using the critical values.
(Step 2) Set up the correct test and identify the level of significance:
(Step 3) Calculate the test statistic: Since we have the population mean perform a Z-test
480
(Step 4) Use these results to make the decision about H0. The z critical value is 1.645 for a
right tailed test when the level of significance is 0.05. The test statistic for the z-test is 8.68.
(Step 5) There is enough evidence at 5% level of significance to support the claim that the
mean travel time from Manhattan to LaGuardia is now more than 35 minutes.
Example 10.15
An engineer at Duracell has designed a new battery for industrial use. The old-style battery
had an average life of 29.2 hours. The engineer claims the new battery is an improvement
over the old one since the new design has a longer life. A sample of 20 new batteries finds a
mean life of 32.3 hours with a standard deviation of 3.9 hours. Test the engineer's claim
using α = 0.01.
Solution 10.15
This is a test for mean when the population standard deviation is unknown. We will follow
the five-step method using the critical values.
(Step 1) The engineer’s claim is that the new battery is an improvement, meaning that it
will have a longer average life. So, the claim is that μ > 29.2 hours, and the hypotheses are:
(Step 2) The alternative hypothesis is “greater than” so we have right tailed test. The level
of significance, which is given in the problem, is 0.01.
(Step 3) We are testing a claim about a population mean μ and the population standard
deviation is unknown, so we use a t-test.
x − µ 32.3 − 29.2
Using the given sample statistics, the test value is t = = = 3.55 .
s n 3.9 20
(Step 4) Before deciding about the null hypothesis we must find the critical value; the
critical value will be t.01, the t value so that the area to the right of it is exactly 0.01. This
value can be found using the invT function; this function works just like the invNorm
function that we have used many times. The inputs will be the cumulative probability and
the degrees of freedom. In this problem, n = 20, so df = 19. And since the area in the right
tail will be .01, the area to the left – the cumulative probability – will be .99.
481
On the calculator press 2nd VARS to access the DISTR menu and select the invT function.
The critical value is:
So we will reject H0 if and only if t > 2.539; the rejection region is shown below:
Distribution Plot
T, df=19
0.4
0.3
Density
0.2
0.1
0.01
0.0
0 2.539
X
(Step 5) Thus, the data provides sufficient evidence that μ > 29.2 hours. There is sufficient
statistical evidence to support the engineer’s claim.
Example 10.16
When Mendel conducted his famous genetic experiments with peas, one sample of offspring
consisted of 428 green peas and 152 yellow peas. Use a 0.01 level of significance to test
Mendel’s claim that under the same circumstances, 25% of the offspring will be yellow.
Solution to 10.16
This is a test for population proportion. We will follow the five-step method using the
critical values.
H0: p = 0.25
Ha: p ≠ 0.25
(Step 2) Select the correct test and identify the significance level α:
Since Ha is “not equal to” it is a two-tailed test. The significance level is 0.01.
482
(Step 3) Calculate the test statistic.
Since this test is for proportion, using the percent, on the calculator use 1-PropZTest.
(Step 4) Use these results to make a decision about Ho. First, we must find the critical
values for the two-tailed test at a 0.01 significance level. Since it is two-tailed we divide
0.01/2.
So, the rejection regions are when z < -2.576 or z > 2.576. Since the test value 0.671 < 2.576,
we fail to reject H0.
(Step 5) Interpret the decision about H0 in the context of the given problem to state a
conclusion.
There is not enough evidence to reject the claim that 25% of the offspring peas will be
yellow.
Used to test a claim about a Used to test a claim about a Used to test a claim about a
population mean μ. population mean μ. population proportion p.
Assumptions: Assumptions: Assumptions:
Data comes from a random Data comes from a random Data comes from a random
sample sample sample
Sampling from a normal Sampling from a normal The sample size an
distribution, or sample size distribution. hypothesized proportion
n > 30 The population SD, σ, is not satisfy: np0 > 10 and n(1 –
The population SD, σ, is known. p0) > 10.
known.
x − µ0 x − µ0 Test statistic:
Test statistic: z = Test statistic: t= pˆ − p 0
σ/ n s/ n z=
p 0 (1 − p 0 ) / n
Calculator function: Calculator function: Calculator function:
483
10.6 | Type I and Type II Errors
We have looked briefly at Type I errors, but now we will spend more time investigating
Type I and Type II errors. When we perform a hypothesis test, the decision to reject H0
depends on randomly selected sample data – so there is always the possibility of an error.
No matter how careful we are, there is always a small probability that we will just get an
unusual sample which leads us to the wrong decision. Since there are two possible
decisions, and each of these can be either correct or incorrect, there are a total of four
possible outcomes for a test, which are summarized in the following table:
H0 is true H0 is false
As we see, there are two outcomes where the correct decision is made, and two outcomes
where an error occurs.
A Type I error occurs when the decision is to reject H0 when in fact H0 is true. The
probability of a Type I error is denoted by the Greek letter α.
A Type II error occurs when the decision is to not reject H0 when in fact H0 is false. The
probability of a Type II error is denoted by the Greek letter β.
Obviously, both α and β should be as small as possible because they are probabilities of
errors.
There is one more probability related to this chart; the Power of the Test is the probability
of rejecting the null hypothesis when H0 is false. Thus, the Power of the test is 1 – α.
Ideally, we want a high power that is as close to one as possible. Increasing the sample size
can increase the Power of the Test.
Example 10.17
Suppose the null hypothesis, H0, is: Frank's rock climbing equipment is safe. Find the Type
I and Type II error. Also explain α and β.
484
Solution 10.17
Type I error: Frank thinks that his rock-climbing equipment may not be safe when, in fact,
it really is safe.
Type II error: Frank thinks that his rock-climbing equipment may be safe when, in fact,
it is not safe.
α = probability that Frank thinks his rock-climbing equipment may not be safe when, in
fact, it really is safe.
β = probability that Frank thinks his rock-climbing equipment may be safe when, in fact, it
is not safe.
Notice that, in this case, the error with the greater consequence is the Type II error; if
Frank thinks his rock-climbing equipment is safe, he will go ahead and use it.
10.10 Suppose the null hypothesis, H0, is: The blood cultures contain no traces of pathogen
X. State the Type I and Type II errors.
Example 10.18
Suppose the null hypothesis, H0, is: The victim of an automobile accident arriving at the
emergency room of a hospital is infected with HIV.
a. What type of error has a greater consequence, a Type I or Type II error?
b. Describe the probability α in words.
Solution 10.18
Type I error: The emergency crew thinks that the victim does not have HIV when, in
reality the victim is infected.
Type II error: The emergency crew thinks that the victim does have HIV, when in reality
the patient is not infected.
a. The error with the greater consequence is the Type I error. If the emergency crew
mistakenly believes that the victim is uninfected with HIV, then they may not take proper
precautions to avoid infection while treating him.
b. Recall that α is the probability of a Type I error; that is, the probability that the
emergency room staff thinks the patient does not have HIV when in fact he/she does have
the virus.
485
10.11 Suppose the null hypothesis, H0, is: a patient is not sick. Which type of error has the
greater consequence, Type I or Type II?
Example 10.19
A certain experimental drug claims a cure rate of at least 75% for males with prostate
cancer. Describe both the Type I and Type II errors in context. Which error is the more
serious, a Type I or Type II error?
Solution 10.19
The claim is that the cure rate is at least 75%. So the claim is p > 0.75; this statement
includes an equal sign, and so it will be the null hypothesis H0. Thus, the errors are:
Type I Error: A cancer patient believes the cure rate for the drug is less than 75% when it
actually is at least 75%.
Type II Error: A cancer patient believes the experimental drug has at least a 75% cure rate
when it has a cure rate that is less than 75%.
In this scenario, the Type II error contains the more severe consequence. If a patient
believes the drug works at least 75% of the time, this most likely will influence the patient’s
(and doctor’s) choice about whether to use the drug as a treatment option.
10.12 Determine the Type I and Type II errors for the following scenario:
Assume a null hypothesis, H0, states that the percentage of adults with jobs is at least 88%.
Identify the Type I and Type II errors from these four statements.
a. Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when that percentage is actually less than 88%.
b. Not to reject the null hypothesis that the percentage of adults who have jobs is at
least 88% when the percentage is actually at least 88%.
c. Reject the null hypothesis that the percentage of adults who have jobs is at least
88% when the percentage is actually at least 88%.
d. Reject the null hypothesis that the percentage of adults who have jobs is at least
88% when that percentage is actually less than 88%.
486
KEY TERMS and FORMULA REVIEW
Hypothesis Test: A procedure for determining whether the hypothesis stated is a
reasonable statement and should not be rejected or is unreasonable and should be rejected.
p-value: The probability that an event will happen purely by chance assuming the null
hypothesis is true. The smaller the p-value, the stronger the evidence is against the null
hypothesis.
Type 1 Error: The decision is to reject the null hypothesis when, in fact, the null
hypothesis is true.
Type 2 Error: The decision is not to reject the null hypothesis when, in fact, the null
hypothesis is false.
T-Test: A test for a population mean, μ that is used when the population SD, σ, is
unknown.
Z-Test: A test for a population mean μ that is used when the population SD, σ, is known.
487
CHAPTER REVIEW
All hypothesis tests use the same basic steps, which we have called the Five Step Method:
(Step 1) The first step is to set up the hypotheses. The null hypothesis H0 and the
alternative hypothesis Ha are contradictory claims about one or more unknown population
parameters. There are a few basic rules for setting these up:
(Step 3) The third step is to calculate the p-value and the test statistic. We can get both very
easily using the TI-84 family of calculators.
The p-value is the probability that, if the null hypothesis is true, the results from another
randomly selected sample will be as extreme or more extreme as the results obtained from
the given sample.
The test statistic is a measure of relative position, and the formula depends on the type of
test being used (see below).
(Step 4) The fourth step is to make the decision about H0. In a test, there are only two possible
decisions: Either we reject H0 or we do not reject H0. This decision is made by comparing the
p-value found in Step 3 to a predetermined significance level α according to the following rule:
(Step 5) The fifth step is to write a conclusion. After we make our decision about H0, we
interpret the decision in the context of the given problem to write a meaningful conclusion
for the test in terms of the given problem. That is, we should write a statement, in plain
English, that explains the result of the test. A few things to keep in mind:
488
• If we reject H0, then there is sufficient evidence to conclude that H0 is incorrect. Thus,
rejecting H0 provides evidence that the alternative hypothesis Ha is true.
• If we fail to reject H0, then there is not sufficient evidence to conclude that the
alternative hypothesis Ha is true.
• Failing to reject H0 does not mean that we have evidence that H0 is true. It simply
means that the sample data have failed to provide sufficient evidence to cast serious
doubt about the truthfulness of H0.
Summary of Tests:
Z-Test:
• Used to test a claim about a population mean μ when σ (the population SD) is
known.
x − µ0
• Test statistic: z=
σ/ n
Rejection Regions
for Z-Test
Significance
Left-Tailed Right Tailed Two-Tailed
Level
α = .10 z < -1.28 z > 1.28 z < -1.645 or z > 1.645
489
T-Test:
• Used to test a claim about a population mean μ when σ (the population SD) is not
known.
x − µ0
• Test statistic: t=
s/ n
One-Proportion Z-test:
• The sample size and hypothesized proportion satisfy: np0 > 10 and n(1 – p0) > 10.
pˆ − p 0
• Test statistic: z=
p 0 (1 − p 0 ) / n
490
Chapter 10 Exercises
1. A test is being conducted to determine whether the mean speed of a cable internet
connection is more than three Megabits per second.
a. What is the random variable? Describe in words.
b. State the null and alternative hypotheses for the test.
2. The mean entry level annual salary of an employee at a large company is $58,000. An
economist wishes to test the claim that the mean entry level salary is higher for IT
professionals in the company. State the null and alternative hypotheses.
3. A test will be conducted to test the claim that the mean number of children for American
families is 2.
a. What is the random variable? Describe in words.
b. State the null and alternative hypotheses for the test.
4. A sociologist claims that the probability that a person picked at random in Times Square
in New York City is visiting the area is 0.83. A tourism expert thinks that the proportion is
actually less. Assume that the tourism expert’s claim is tested using an appropriate
hypothesis test.
5. In a population of fish, it is widely believed that approximately 42% are female. A biologist
will conduct a hypothesis test to determine if the proportion of female fish is less than 42%.
State the null and alternative hypotheses for the test.
6. An article in the 1990’s stated that the mean time spent in jail by a first–time convicted
burglar was 2.5 years. A study was then done to see if the mean time in jail has increased in
the new century. A random sample of 26 first-time convicted burglars in a recent year was
picked. The mean length of time in jail from the survey was 3 years. Assume that the
distribution of the population is normal, and that the population standard deviation is 1.5. If
you were conducting a hypothesis test to determine if the mean length of jail time has
increased, what would the null and alternative hypotheses be?
7. A random survey of 75 death row inmates revealed that the mean length of time on death
row is 17.4 years with a standard deviation of 6.3 years. If you were conducting a hypothesis
test to determine if the mean time spent on death is equal to 15 years, what would the null
and alternative hypotheses be?
a. H0: b. Ha:
491
8. The National Institute of Mental Health published an article stating that in any one-year
period, approximately 9.5 percent of American adults suffer from depression or a depressive
illness. Suppose that we are conducting a hypothesis test to determine if the true proportion
of people in a given town suffering from depression or a depressive illness is lower than the
percent in the general adult American population, what would the null and alternative
hypotheses be?
a. H0: b. Ha:
9. A market analyst claims that the mean price of mid-sized cars in the Midwest region is
$32,000. A suitable hypothesis test will be conducted to determine if the claim is true. State
the Type I and Type II errors in complete sentences.
10. A group of doctors is deciding whether or not to perform an operation. Suppose the null
hypothesis H0, in words, is: The surgical procedure will go well.
11. The power of a test is 0.981. What is the probability of a Type II error?
12. A group of divers is exploring an old sunken ship. Suppose the null hypothesis, H0, is:
the sunken ship does not contain buried treasure. State the Type I and Type II errors.
13. A microbiologist is testing a water sample for E-coli. Suppose the null hypothesis, H0, is:
the sample does not contain E-coli. The probability that the sample does not contain E-coli,
but the microbiologist thinks it does is 0.012. The probability that the sample does contain
E-coli, but the microbiologist thinks it does not is 0.002.
15. A population has a standard deviation of σ = 5. A test is to be conducted to test the claim
that the mean is μ = 25. A sample of 108 individuals yields a sample mean of 24. What
distribution should you use to perform a hypothesis test?
16. It is thought that 42% of respondents in a taste test would prefer Brand A. Suppose we
want to test the claim that the true population proportion is less than 42%. We select a
random sample of 100 people, and find that 39% preferred Brand A. What distribution should
we use to perform a hypothesis test?
492
17. Suppose we are conducting a hypothesis test of a single population mean using a
Student’s t-distribution. What must we assume about the distribution of the data?
18. Suppose we are conducting a hypothesis test for a single population proportion. What
must be true about the quantities of np and nq= n(1 – p)?
19. It is believed that the mean height of high school students who play basketball in a large
urban district is 73 inches with and standard deviation of σ = 1.8 inches. A random sample
of 40 players is chosen. The sample mean was 71 inches, and the sample standard deviation
was 1.5 years. Do the data support the claim that the mean height is less than 73 inches?
20. It is conjectured that the mean age of graduate students at a large state university is at
most 31 years. To test this claim, a random sample of 15 graduate students is taken. The
sample mean is 32 years, and the sample standard deviation is two years. Assume that the
ages are normally distributed and conduct an appropriate test.
a. Does the shaded region represent a low or a high p-value compared to a level of
significance of 1%?
b. Is this a right-tailed, left-tailed or two-tailed test? Explain.
22. Consider the statement, “If you do not reject the null hypothesis, then H0 must be
true.” Is this statement correct? Explain why or why not.
23. Suppose that a recent article stated that the mean time spent in jail by a first-time
convicted burglar is 2.5 years. A study was then done to see if the mean time has increased
in the new century. A random sample of 26 first-time convicted burglars in a recent year is
selected. The mean length of time in jail from the survey was three years with a standard
deviation of 1.8 years. Suppose that it is somehow known that the population standard
deviation is 1.5.
493
Conduct a hypothesis test to determine if the mean length of jail time has increased. Assume
the distribution of the jail times is approximately normal.
24. A random survey of 75 death row inmates revealed that the mean length of time on death
row is 17.4 years with a standard deviation of 6.3 years. Conduct a hypothesis test to
determine if the mean time on death row for all inmates is at most 15 years.
25. Suppose that we conduct at test with hypotheses H0: μ = 9 and Ha: μ < 9.
Is this a left-tailed, right-tailed, or two-tailed test?
26. Suppose that we conduct at test with hypotheses H0: μ ≤ 6 and Ha: μ > 6.
Is this a left-tailed, right-tailed, or two-tailed test?
27. Suppose that we conduct at test with hypotheses H0: p = 0.25 and Ha: p ≠ 0.25.
Is this a left-tailed, right-tailed, or two-tailed test?
28. A certain brand bottle of water is labeled as containing 16 fluid ounces of water. You
believe that the bottles contain less than 16 oz. Suppose that you will conduct a hypothesis
test for your claim. Would this be a left-tailed, right-tailed, or two-tailed test?
494
29. A golf pro claims that his mean golf score is 63. A sportswriter wants to test the claim
that his mean score is higher 63. Would this be a left-tailed, right-tailed, or two-tailed test?
30. A bathroom scale claims to be able to correctly identify any weight within a pound. You
suspect that it cannot be that accurate and will conduct an appropriate test to test your
claim. Would this be a left-tailed, right-tailed, or two-tailed test?
31. You flip a coin and record whether it shows heads or tails. The coin is supposed to be
balanced, but you suspect that the probability of getting for this particular coin is less than
50%. If you were to test your claim, would this be a left-tailed, right-tailed, or two-tailed
test?
32. If the alternative hypothesis has a not equals ( ≠ ) symbol, what type of test does this
signify?
33. Assume the null hypothesis states that the mean is at least 18. Is this a left-tailed,
right-tailed, or two-tailed test?
34. Assume the null hypothesis states that the mean is at most 12. Is this a left-tailed,
right-tailed, or two-tailed test?
35. Each of the following statements refers to a claim for a hypothesis test; so, each could
be either the null hypothesis or the alternative hypothesis. In each case, state the null
hypothesis H0 and the alternative hypothesis Ha, in terms of the appropriate parameter
(either μ or p).
36. Refer to problem 35; classify each test as either left-tailed, right-tailed, or two-tailed.
37. Refer to problem 35; for each test, state the Type I and Type II errors in complete
sentences.
495
38. Over the past few decades, public health officials have examined the link between weight
concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected
teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls
were surveyed again. Sixty-three said they smoked to stay thin. This data is used to test the
claim that more than thirty percent of the teen girls smoke to stay thin. The alternative
hypothesis is:
39. A statistics instructor wishes to test the claim that fewer than 20% of students at her
college attended the opening night midnight showing of the latest Harry Potter movie. She
surveys 84 of her students and finds that 11 attended the midnight showing. An appropriate
alternative hypothesis would be:
40. Some years ago, an organization reported that on average, teenagers spent 4.5 hours per
week on the phone. The organization thinks that the current mean is higher. A test will be
conducted to test whether the current mean is higher than the previously reported value of
4.5 hours per week. Fifteen randomly chosen teenagers were asked how many hours per week
they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation
of 2.0. The null and alternative hypotheses are:
c. Ho: μ = 4.75, Ha: μ > 4.75 d. Ho: μ = 4.5, Ha: μ > 4.5
41. When a new drug is created, the pharmaceutical company must subject it to testing before
receiving the necessary permission from the Food and Drug Administration (FDA) to market
the drug. Suppose the null hypothesis is “the drug is unsafe.” What is the Type II Error?
42. A statistics instructor believes that fewer than 20% of McHenry Community College
(MCC) students attended the opening midnight showing of the latest Harry Potter movie. To
test this claim she surveys 84 of her students and finds that 11 of them attended the midnight
showing. What is the Type I error for the test?
a. To conclude that the percent of MCC students who attended is at least 20%, when in
fact, it is less than 20%.
b. To conclude that the percent of MCC students who attended is 20%, when in fact, it
is 20%.
496
c. To conclude that the percent of MCC students who attended is less than 20%, when
in fact, it is 20% or more.
d. To conclude that the percent of MCC students who attended is less than 20%, when
in fact, it is less than 20%.
43. It is believed that College of Lake County (CLC) Intermediate Algebra students get less
than seven hours of sleep per night, on average. A survey of 22 CLC Intermediate Algebra
students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. A test is
conducted to test the claim that that CLC Intermediate Algebra students get less than
seven hours of sleep per night, on average. Which of the following statements is true?
a. The Type II error is to not reject that the mean number of hours of sleep CLC
students get per night is at least seven when, in fact, the mean number of hours is
more than seven hours.
b. The Type II error is to not reject that the mean number of hours of sleep CLC
students get per night is at least seven when, in fact, the mean number of hours is at
most seven hours.
c. The Type II error is to not reject that the mean number of hours of sleep CLC
students get per night is at least seven when, in fact, the mean number of hours is at
least seven hours.
d. The Type II error is to not reject that the mean number of hours of sleep CLC
students get per night is at least seven when, in fact, the mean number of hours is
less than seven hours.
44. Previously, an organization reported that teenagers spent 4.5 hours per week, on average,
on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly
chosen teenagers were asked how many hours per week they spend on the phone. The sample
mean was 4.75 hours with a sample standard deviation of 2.0. Suppose that a hypothesis test
is conducted to test the claim that the mean hours is more than 4.5 hours per week. What is
the Type I error?
a. To conclude that the current mean is higher than 4.5 hrs/week, when in fact, it is
higher.
b. To conclude that the current mean is more than 4.5 hrs/week, when in fact, it still
4.5 hrs.
c. To conclude that the current mean hours per week is 4.5 hrs/week, when in fact, it is
higher
d. To conclude that the current mean is no higher than 4.5 hrs/week, when in fact,
it is not higher.
497
45. It is believed that College of Lake County (CLC) Intermediate Algebra students get less
than seven hours of sleep per night, on average. A survey of 22 CLC Intermediate Algebra
students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. A test is
conducted to test the claim that that CLC Intermediate Algebra students get less than
seven hours of sleep per night, on average.
1.93
a. N 7.24, b. N(7.24, 1.93) c. t22 d. t21
22
46. The National Institute of Mental Health published an article stating that in any one-year
period, approximately 9.5 percent of American adults suffer from depression or a depressive
illness. Suppose that in a survey of 100 people in a certain town, seven of them suffered from
depression or a depressive illness. Conduct a hypothesis test to determine if the true
proportion of people in that town suffering from depression or a depressive illness is lower
than the percent in the general adult American population.
For the remaining exercises, use the FIVE STEP METHOD and a suitable hypothesis test to
answer each question. For problems involving a Student's-t distribution, you may assume
that the underlying population is normally distributed.
47. A particular brand of tires claims that its deluxe tire averages at least 50,000 miles
before it needs to be replaced. From past studies of this tire, the standard deviation is
known to be 8,000. A survey of owners of that tire design is conducted. From the 28 tires
surveyed, the mean lifespan was 46,500 miles with a standard deviation of 9,800 miles.
Using α = 0.05, is the data highly inconsistent with the claim?
498
48. From generation to generation, the mean age when smokers first start to smoke varies.
However, the standard deviation of that age remains constant of around 2.1 years. A survey
of 40 smokers of this generation was done to see if the mean starting age is at least 19. The
sample mean was 18.1 with a sample standard deviation of 1.3. Do the data support the
claim at the 5% level?
49. The cost of a daily newspaper varies from city to city. However, the variation among prices
remains steady with a standard deviation of 20¢. A study was done to test the claim that the
mean cost of a daily newspaper is $1.00. Twelve costs yield a mean cost of 95¢ with a standard
deviation of 18¢. Do the data support the claim at the 1% level?
50. An article in the San Jose Mercury News stated that students in the California state
university system take 4.5 years, on average, to finish their undergraduate degrees. Suppose
you believe that the mean time is longer. You conduct a survey of 49 students and obtain a
sample mean of 5.1 with a sample standard deviation of 1.2. Do the data support your claim
at the 1% level?
51. The mean number of sick days an employee takes per year is believed to be about ten.
Members of a personnel department do not believe this figure. They randomly survey eight
employees. The number of sick days they took for the past year are as follows:
Let x = the number of sick days they took for the past year. Does this data provide evidence
to support the personnel department’s claim the mean number of sick days differs from ten?
Conduct an appropriate test using a significance level of α = .05.
52. In 1955, Life Magazine reported that the 25-year-old mother of three worked, on
average, an 80-hour week. Recently, many groups have been studying whether or not the
women's movement has, in fact, resulted in an increase in the average work week for
women (combining employment and at-home work). Suppose a study was done to determine
if the mean work week has increased. 81 women were surveyed with the following results.
The sample mean was 83; the sample standard deviation was 10. Using a 5% significance
level, does this data provide evidence that the mean number of hours worked by women
each week has increased?
53. Your statistics instructor claims that 60 percent of the students who take her Elementary
Statistics class go through life feeling more enriched. For some reason that she can't quite
figure out, most people don't believe her. You decide to check this out on your own. You
randomly survey 64 of her past Elementary Statistics students and find that 34 feel more
enriched as a result of her class. Now, what do you think?
499
54. A Nissan Motor Corporation advertisement read, “The average man’s IQ is 107. The
average brown trout’s IQ is 4. So why can’t man catch brown trout?” Suppose you believe
that the brown trout’s mean IQ is greater than four. You catch 12 brown trout. A fish
psychologist determines the IQ’s for the fish as follows: 5; 4; 7; 3; 6; 4; 5; 3; 6; 3; 8; 5.
Use this data to conduct a hypothesis test for the claim that the mean IQ for brown trout is
greater than 4.
55. Refer to the previous exercise. Conduct a hypothesis test to see if your decision and
conclusion would change if your belief were that the1 brown trout’s mean IQ is different
from 4.
56. According to an article in Newsweek, the natural ratio of girls to boys is 100:105. In China,
the birth ratio is 100: 114 (46.7% girls). Suppose you don’t believe the reported figures of the
percent of girls born in China. You conduct a study. In this study, you count the number of
girls and boys born in 150 randomly chosen recent births. There are 60 girls and 90 boys born
of the 150. Based on your study, do you believe that the percent of girls born in China is 46.7?
57. A poll done for Newsweek found that 13% of Americans have seen or sensed the presence
of an angel. A contingent doubt that the percent is really that high. It conducts its own survey.
Out of 76 Americans surveyed, only two had seen or sensed the presence of an angel. As a
result of the contingent’s survey, would you agree with the Newsweek poll? In complete
sentences, also give three reasons why the two polls might give different results.
58. The mean work time each week for engineers in a start-up company is believed to be
about 60 hours. A newly hired engineer hopes that it’s shorter. She asks ten engineering
friends in start-ups for the lengths of their mean work weeks. The data is as follows:
70; 45; 55; 60; 65; 55; 55; 60; 50; 55.
Use this data to test the claim that the mean number of hours worked each week is less
than 60.
59. Toastmasters International cites a report by Gallop Poll that 40% of Americans fear
public speaking. A student believes that less than 40% of students at her school fear public
speaking. She randomly surveys 361 schoolmates and finds that 135 report they fear public
speaking. Conduct a hypothesis test to determine if the percent at her school is less than
40%.
60. A recent report states that 68% of online courses taught at community colleges nationwide
were taught by full-time faculty. To test if this proportion is the same for College of Lake
County (CLC) in Illinois, was randomly selected for comparison. In that same year, 34 of the
44 online courses CLC offered were taught by full-time faculty. Does this data provide
evidence that the percentage of online courses at community colleges that are taught by full-
time faculty is different from 68% in Illinois?
500
61. According to an article in Bloomberg Businessweek, New York City's most recent adult
smoking rate is 14%. Suppose that a survey is conducted to determine this year’s rate. Nine
out of 70 randomly chosen N.Y. City residents reply that they smoke. Use this data to test
the claim that the proportion of NYC adults who has decreased.
62. The mean age of College of Lake County students in a previous term was 26.6 years old.
An instructor thinks the mean age for online students is older than 26.6. She randomly
surveys 56 online students and finds that the sample mean is 29.4 with a standard deviation
of 2.1. Conduct a hypothesis test to test the instructor’s claim.
63. Nationwide, registered nurses earn an average annual salary of $69,110. A survey was
conducted of 41 California registered nurses to determine if the annual salary is higher than
$69,110 for California nurses. The sample average was $71,121 with a sample standard
deviation of $7,489. Conduct an appropriate test to test the claim that the mean salary for
nurses in California is higher than the national average.
64. La Leche League International reports that the mean age of weaning a child from breast-
feeding is age four to five worldwide. In America, most nursing mothers wean their children
much earlier. Suppose a random survey is conducted of 21 U.S. mothers who recently weaned
their children. The mean weaning age was nine months (3/4 year) with a standard deviation
of 4 months. Conduct a hypothesis test to determine if the mean weaning age in the U.S. is
less than four years old.
65. Over the past few decades, public health officials have examined the link between weight
concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected
teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls
were surveyed again. Sixty-three said they smoked to stay thin. Does this data provide
sufficient evidence to conclude that more than thirty percent of teen girls smoke to stay thin?
After conducting the test, your decision and conclusion are
a. Reject H0: There is sufficient evidence to conclude that more than 30% of teen girls
smoke to stay thin.
b. Do not reject H0: There is not sufficient evidence to conclude that less than 30% of
teen girls smoke to stay thin.
c. Do not reject H0: There is not sufficient evidence to conclude that more than 30% of
teen girls smoke to stay thin.
d. Reject H0: There is sufficient evidence to conclude that less than 30% of teen girls
smoke to stay thin.
66. A statistics instructor believes that fewer than 20% of students at her community college
attended the opening night midnight showing of the latest Harry Potter movie. She surveys
84 of her students and finds that 11 of them attended the midnight showing. At a 1% level
of significance, an appropriate conclusion is:
501
a. There is insufficient evidence to conclude that the percent of EVC students who
attended the midnight showing of Harry Potter is less than 20%.
b. There is sufficient evidence to conclude that the percent of EVC students who
attended the midnight showing of Harry Potter is more than 20%.
c. There is sufficient evidence to conclude that the percent of EVC students who
attended the midnight showing of Harry Potter is less than 20%.
d. There is insufficient evidence to conclude that the percent of EVC students who
attended the midnight showing of Harry Potter is at least 20%.
67. Previously, an organization reported that teenagers spent an average of 4.5 hours per
week on the phone. The organization thinks that the mean is now higher. To test the claim,
15 randomly chosen teenagers were asked how many hours per week they spend on the
phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Use this
data to test the claim that the current mean is greater than 4.5 hours. At a significance
level of a = 0.05, what is the correct conclusion?
a. There is enough evidence to conclude that the mean number of hours is more than
4.75.
b. There is enough evidence to conclude that the mean number of hours is more than
4.5.
c. There is not enough evidence to conclude that the mean number of hours is more
than 4.5.
d. There is not enough evidence to conclude that the mean number of hours is more
than 4.75.
68. According to the Center for Disease Control website, in 2011 at least 18% of high school
students have smoked a cigarette. An Introduction to Statistics class in Davies County, KY
conducted a hypothesis test at the local high school (a medium sized–approximately 1,200
students–small city demographic) to determine if the local high school’s percentage was
lower. One hundred fifty students were chosen at random and surveyed. Of the 150 students
surveyed, 82 have smoked. Use this data with a significance level of 0.05 to test whether the
local high school’s percentage is less than 18%.
69. A recent survey in the N.Y. Times Almanac indicated that 48.8% of families own stock. A
broker wanted to determine if this survey could be valid. He surveyed a random sample of
250 families and found that 142 owned some type of stock. At the 0.05 significance level, can
the survey be considered to be accurate? In other words, at the .05 level, does this data
provide evidence that the true proportion of families who own stock differs from 48.8%?
70. Driver error can be listed as the cause of approximately 54% of all fatal auto accidents,
according to the American Automobile Association. Thirty randomly selected fatal accidents
are examined, and it is determined that 14 were caused by driver error. Using = 0.05, is
the AAA proportion accurate?
502
71. The US Department of Energy reported that 51.7% of homes nationwide were heated by
natural gas. A random sample of 221 homes in Kentucky found that 115 were heated by
natural gas. Does this data provide evidence to support the claim that the percentage of
homes heated by natural gas differs from the national average? Use a significance level of α
= 0.05 to test.
72. For Americans using library services, the American Library Association claims that at
most 67% of patrons borrow books. The library director in Gurnee, Illinois feels this is not
true, so she asked a local college statistic class to conduct a survey. The class randomly
selected 100 patrons and found that 82 borrowed books. Did the class demonstrate that the
percentage was higher in Gurnee, IL? Use an α = 0.01 level of significance. What is the
possible proportion of patrons that do borrow books from the Warren Newport Library in
Gurnee?
73. The Weather Underground reported that the mean amount of summer rainfall for the
northeastern US is at least 11.52 inches. Ten cities in the northeast are randomly selected
and the mean rainfall amount is calculated to be 7.42 inches with a standard deviation of 1.3
inches. At the α = 0.05 level, can it be concluded that the mean rainfall was below the reported
average? What if α = 0.01? Assume the amount of summer rainfall follows a normal
distribution.
74. A survey in the N.Y. Times Almanac finds the mean commute time (one way) is 25.4
minutes for the 15 largest US cities. The Austin, TX chamber of commerce feels that Austin’s
commute time is less and wants to publicize this fact. The mean for 25 randomly selected
commuters is 22.1 minutes with a standard deviation of 5.3 minutes. At the α = 0.10 level, is
the Austin, TX commute significantly less than the mean commute time for the 15 largest
US cities?
75. A report by the Gallup Poll found that a woman visits her doctor, on average, at most 5.8
times each year. A hospital administrator believes that the average is actually higher than
reported. A random sample of 20 women results in these yearly visit totals:
3; 2; 1; 3; 7; 2; 9; 4; 6; 6; 8; 0; 5; 6; 4; 2; 1; 3; 4; 1
At the α = 0.05 level, can it be concluded that the mean number of visits for women is higher
than 5.8 visits per year?
76. According to the N.Y. Times Almanac the mean family size in the U.S. is 3.18. A professor
claims that mean family size is greater than the reported number above. A sample of a college
math class at a large university resulted in the following family sizes:
5; 4; 5; 4; 4; 3; 6; 4; 3; 3; 5; 5; 6; 3; 3; 2; 7; 4; 5; 2; 2; 2; 3; 2
At α = 0.05 level, is there evidence to support the professor’s claim? Does this mean that the
Almanac result is incorrect? Why or why not?
503
77. According to a Harvard Health Publishing article, the normal body temperature may be
lower than 98.6 degrees according to data samples from the past 160 years. A sample of 12
random adults from a biology class had the following temperatures in Fahrenheit:
96.6; 97.5; 98.9; 96.4; 98.1; 98.9; 97.6; 97.8; 98.1; 97.9; 96.7; 98.6
At α = 0.05, is there evidence to support the traditional belief that the average body
temperature is 98.6 degrees?
78. Normal resting heart rates for women are between 60 – 100 beats per minute. A group of
13 women who exercise three days a week at the local YMCA recorded their resting heart
rate, and the data is below:
67; 71; 82; 56; 94; 67; 78; 83; 63; 62; 74; 90;
72
At α = 0.05, is there evidence to support the claim that the resting heart rate of women who
work three days a week at the YMCA is less than 80 beats per minute?
79. According to the CDC adults 18-60 years old need 7 or more hours of sleep per night. It
is claimed that the average sleep time is greater than 7 hours. The sleep times of a random
sample from a CLC statistics class who are adults between 18-60 are listed below:
At α = 0.05 can it be concluded that the average is more than 7 hours of sleep per night?
80. According to the CDC adults 61 – 64 years old need 7 – 8 hours of sleep per night. The
sleep time of a random sample of adults in a sleep study of adults between 61 – 64 are given
below:
At the α = 0.05 level, is there evidence to support the claim that the mean of the sleep study
is 7.5 hours per sleep at night?
81. IQ, or intelligence quotient, is one way to measure a person’s ability to reason and
problem solve. The IQ population standard deviation 15 points. A statistics teacher claims
that students in his classes have an IQ greater than 100. He samples 20 students, and the
IQ scores are listed below:
86; 90; 110; 115; 100; 95; 91; 103; 98; 87;
105; 95; 103; 114; 102; 99; 120; 120; 93; 101
504
At the α = 0.05 level, is there evidence to support the teacher’s claim?
82. Another statistics teacher claims that the students in her classes have an IQ of 100, which
is the average. The IQ population standard deviation is 15 points. She samples 24 students,
and the IQ scores are listed below:
100; 87; 90; 93; 110; 99; 89; 102; 101; 87; 101; 112
88; 99; 82; 109; 97; 85; 99; 98; 101; 91; 92; 96
83. The student academic group on a college campus claims that freshman students study at
least 2.5 hours per day, on average. One Introduction to Statistics class was skeptical; they
claim that freshmen study less than 2.5 hours per day on average. To test the claim the class
selected a random sample of 30 freshman students and found a mean study time of 137
minutes with a standard deviation of 45 minutes. At α = 0.01 level, is there evidence to
support the Stats class’ claim?
84. A billing company that collects bills for medical offices in the surrounding area claims
that there is a change in the percent of bills being paid by Medicare. In the past, the
percentage that is paid by Medicare is 30%. A study of 7500 recent bills shows that 33% of
these bills are being paid by Medicare. Test the claim using a 5% level of significance.
Show all 5 steps.
505
REFERENCES
Data from the National Institute of Mental Health. Available online at
http://www.nimh.nih.gov/publicat/depression.cfm.
Data from Amit Schitai. Director of Instructional Technology and Distance Learning. LBCC.
Data from energy.gov. Available online at http://energy.gov (accessed June 27. 2013). Data from
Gallup®. Available online at www.gallup.com (accessed June 27, 2013). Data from Growing by
Degrees by Allen and Seaman.
Data from the American Automobile Association. Available online at www.aaa.com (accessed June
27, 2013). Data from the American Library Association. Available online at www.ala.org (accessed
June 27, 2013).
Data from the Centers for Disease Control and Prevention. Available online at www.cdc.gov
(accessed June 27, 2013)
Data from Weather Underground. Available online at www.wunderground.com (accessed June 27,
2013).
Harvard Health Publishing. “Time to redefine normal body temperature?” Shmerling, Robert H.
March 13, 2020. Available online at http://www.health.harvard.edu/blog/time-to-redefine-normal-
body-temperature-2020031319173 (Accessed July 14, 2021)
Federal Bureau of Investigations. “Uniform Crime Reports and Index of Crime in Daviess in the
State of Kentucky enforced by Daviess County from 1985 to 2005.” Available online at
http://www.disastercenter.com/kentucky/crime/ 3868.htm (accessed June 27, 2013).
506
“Foothill-De Anza Community College District.” De Anza College, Winter 2006. Available
online at http://research.fhda.edu/factbook/DAdemofs/Fact_sheet_da_2006w.pdf.
Johansen, C., J. Boice, Jr., J. McLaughlin, J. Olsen. “Cellular Telephones and Cancer—a
Nationwide Cohort Study in Denmark.” Institute of Cancer Epidemiology and the Danish Cancer
Society, 93(3):203-7. Available online at http://www.ncbi.nlm.nih.gov/pubmed/11158188 (accessed
June 27, 2013).
Rape, Abuse & Incest National Network. “How often does sexual assault occur?” RAINN, 2009.
Available online at http://www.rainn.org/get-information/statistics/frequency-of-sexual-assault
(accessed June 27, 2013).
507
CHAPTER 10 SOLUTIONS:
1) a. The RV is x = the sample mean Internet speed in Megabits per second.
b. H0: μ < 3, Ha: μ > 3
3) a. The RV is x = mean number of children from a random sample of American families.
b. H0: μ = 2, Ha: μ ≠ 2
5) a. H0: p = 0.42 b. Ha: p < 0.42 7) a. H0: μ = 15 b. Ha: μ ≠ 15
9) Type I: The mean price of mid-sized cars is $32,000, but we conclude that it is not
$32,000.
Type II: The mean price of mid-sized cars is not $32,000, but we conclude that it is
$32,000.
11) β = 1 – Power = 0.019 13) a. Power = 1 – β = 0.998 b. The Type II error
15) a normal distribution for a single population mean
17) It must be approximately normally distributed.
19) a. H0: μ > 73 Ha: μ < 73 b. The p-value is almost zero, which means there is
sufficient data to conclude that the mean height of high school students who play basketball
on the school team is less than 73 inches. The data do support the claim.
21) a. The shaded region shows a low p-value.
b. The p-value is the area in the right tail, so it is a right-tailed test.
23) a. It is a test for a population mean. b. x
c. the mean time spent in jail for 26 first time convicted burglars
d. σ = 1.5 e. x = 3, σ = 1.5, s = 1.8, n = 26.
e. Since we know σ = 1.5, we should use it; using s instead would cause errors in the
calculations. f. This will be a Z-test; x ~ N( 2.5, 1.5/sqrt(26))
25) This is a left-tailed test. 27) This is a two-tailed test.
29) a right-tailed test 31) a left-tailed test 33) This is a left-tailed test.
35) a. H0: μ = 34; Ha: μ ≠ 34 b. H0: p ≤ 0.60; Ha: p > 0.60
c. H0: μ ≥ 100,000; Ha: μ < 100,000 d. H0: p = 0.29; Ha: p ≠ 0.29
e. H0: p = 0.05; Ha: p < 0.05 f. H0: μ ≤ 10; Ha: μ > 10
g. H0: p = 0.50; Ha: p ≠ 0.50 h. H0: μ = 6; Ha: μ ≠ 6
i. H0: p ≥ 0.11; Ha: p < 0.11 j. H0: μ ≤ 20,000; Ha: μ > 20,000
37) c
39) a. Type I error: We conclude that the mean is not 34 years, when it really is 34 years.
Type II error: We conclude that the mean is 34 years, when in fact it really is not 34 years.
b. Type I error: We conclude that more than 60% of Americans vote in presidential
elections, when the actual percentage is at most 60%. Type II error: We conclude that at
most 60% of Americans vote in presidential elections when, in fact, more than 60% do.
c. Type I error: We conclude that the mean starting salary is less than $100,000, when it
really is at least $100,000. Type II error: We conclude that the mean starting salary is at
least $100,000 when, in fact, it is less than $100,000.
508
d. Type I error: We conclude that the proportion of high school seniors who get drunk each
month is not 29%, when it really is 29%. Type II error: We conclude that the proportion of
high school seniors who get drunk each month is 29% when, in fact, it is not 29%.
e. Type I error: We conclude that fewer than 5% of adults ride the bus to work in Los
Angeles, when the percentage that do is really 5% or more. Type II error: We conclude that
5% or more adults ride the bus to work in Los Angeles when, in fact, fewer that 5% do.
f. Type I error: We conclude that the mean number of cars a person owns in his or her
lifetime is more than 10, when in reality it is not more than 10. Type II error: We conclude
that the mean number of cars a person owns in his or her lifetime is not more than 10
when, in fact, it is more than 10.
g. Type I error: We conclude that the proportion of Americans who prefer to live away from
cities is not about half, though the actual proportion is about half. Type II error: We
conclude that the proportion of Americans who prefer to live away from cities is half when,
in fact, it is not half.
h. Type I error: We conclude that the duration of paid vacations each year for Europeans is
not six weeks, when in fact it is six weeks. Type II error: We conclude that the duration of
paid vacations each year for Europeans is six weeks when, in fact, it is not.
i. Type I error: We conclude that the proportion is less than 11%, when it is really at least
11%. Type II error: We conclude that the proportion of women who develop breast cancer is
at least 11%, when in fact it is less than 11%.
j. Type I error: We conclude that the average tuition cost at private universities is more
than $20,000, though in reality it is at most $20,000. Type II error: We conclude that the
average tuition cost at private universities is at most $20,000 when, in fact, it is more than
$20,000.
41) b 43) d 45) d
47) Hypotheses: H0: μ ≥ 50,000 , Ha : μ < 50,000
Use a Z-test since σ = 8000 is known. From the TI-84, z = -2.315 and p = 0.0103.
Since p < .05, we reject the null hypothesis. There is sufficient evidence to conclude that the
mean lifespan of the tires is less than 50,000 miles.
49) Hypotheses: H0: μ = $1.00 b. Ha: μ ≠ $1.00.
Use a Z-test since σ = .20 is known. From the TI-84, z = -0.866 and p = 0.3865.
Since p > .01, we do not reject the null hypothesis. There is not enough evidence to conclude
that the mean cost of daily papers is different from $1. The mean cost could reasonably be
$1.
51) Hypotheses: H0: μ = 10 Ha: μ ≠ 10
Use T-test because population SD is unknown. From TI-84, t = –1.12 and p = 0.300
Since p > 0.05, do not reject the null hypothesis. At the 5% significance level, there is
insufficient evidence to conclude that the mean number of sick days is different from 10.
509
53) Hypotheses: H0: p ≥ 0.6 Ha: p < 0.6
Use 1-PropZTest since testing a claim about a proportion.
From the TI-84, we have z = 1.12 and p-value = 0.1308.
Since p > .05, we do not reject Ho. There is not enough evidence to conclude that less than
60% of people who take Statistics feel more enriched.
55) Hypotheses: H0: μ = 4 Ha: μ ≠ 4
Use T-test because population SD is unknown. From TI-84, t = 1.95 and p-value = 0.076.
Since p > 0.05, we do not reject the null hypothesis. There is insufficient evidence to
conclude that the average IQ of brown trout is different from four.
57) Hypotheses: H0: p ≥ 0.13 Ha: p < 0.13
Use 1-PropZTest since testing a claim about a proportion.
From the TI-84, we have z = -2.688 and p-value = 0.0036. Since p < 0.05, reject Ho.
There is sufficient evidence to conclude that the percentage of Americans who have seen or
sensed an angel is less than 13%.
59) Hypotheses: H0: p = 0.40 Ha: p < 0.40
Use 1-PropZTest since testing a claim about a proportion. From the TI-84, z = -1.01 and p-
value = 0.1563. Since p > 0.05, do not reject Ho. There is insufficient evidence to support
the claim that less than 40% of students at the school fear public speaking.
61) Hypotheses: H0: p = 0.14 Ha: p < 0.14
Use 1-PropZTest since testing a claim about a proportion. From the TI-84, z = -0.2756 and
p-value = 0.3914. Since p > 0.05, do not reject Ho. There is insufficient evidence to
conclude that the proportion of NYC residents who smoke is less than 0.14.
63) Hypotheses: H0: μ = 69,110 Ha: μ > 69,110.
Use T-test because population SD is unknown. From TI-84, t = 1.719 and p-value = 0.0466.
Since p < 0.05, we reject the null hypothesis. There is sufficient evidence to conclude that
the mean salary of California registered nurses exceeds $69,110.
65) c 67) c
69) Hypotheses: H0: p = 0.488, Ha: p ≠ 0.488
Use 1-PropZTest since testing a claim about a proportion.
From the TI-84, z = -1.01 p-value = 0.0114. Since p < 0.05, reject the null hypothesis.
At the 5% level of significance, there is enough evidence to conclude that the percentage of
families who own stocks is not 48.8%. So, the survey does not appear to be accurate.
71) Hypotheses: H0: p = 0.517, Ha: p ≠ 0.517
Use 1-PropZTest since testing a claim about a proportion.
From the TI-84, p-value = 0.9203. Since p > 0.05, do not reject the null hypothesis.
At the 5% significance level, there is not enough evidence to conclude that the proportion of
homes in Kentucky that are heated by natural gas is 0.517.
However, we cannot generalize this result to the entire nation. First, the sample’s
population is only the state of Kentucky. Second, it is reasonable to assume that homes in
the extreme north and south will have extreme high usage and low usage, respectively.
510
73) Hypotheses: H0: µ ≥ 11.52, Ha: µ < 11.52
Use T-test because population SD is unknown. From TI-84, p-value = 0.000002.
Since p < .05, Reject the null hypothesis. At the 5% significance level, there is enough
evidence to conclude that the mean amount of summer rain in the northeaster US is less
than 11.52 inches, on average. Note that since the p-value is almost 0, we would reject Ho
at any level of significance.
75) Hypotheses: H0: µ ≤ 5.8 Ha: µ > 5.8
Use T-test because population SD is unknown. From TI-84, p-value = 0.9987.
Since p is much larger than .05, we do not reject the null hypothesis. At the 5% level of
significance, there is no evidence whatsoever to conclude that women visit their doctors an
average of more than 5.8 times a year.
77) Hypotheses: H0: μ = 98.6, Ha ≠ 98.6
Use T-test because population SD is unknown. Two-tailed test with α = 0.05. From TI-84, p-
value = 0.0057.
Since p is smaller than 0.05, we reject the null hypothesis. At the 5% level of significance,
there is not enough evidence to support the claim that the class’s normal body temperature
is 98.6 degrees.
79) Hypotheses: H0 ≤ 7, Ha > 7
Use T-test because population SD is unknown. Right tailed test with α = 0.05. From TI-84,
p-value = 0.0409.
Since p is smaller than 0.05, we reject the null hypothesis. At the 5% level of significance,
there is enough evidence to support the claim that class’s mean is greater than 7 hours of
sleep.
81) Hypothesis: H0 ≤ 100, Ha > 100
Use Z-test because the population SD is known, 15. Right tailed test with α = 0.05. From
TI-84 p-value = 0.3437
Since p is greater than 0.05, we do not reject the null hypothesis. At the 5% level of
significance there is not enough evidence to support the teacher’s claim that the class mean
IQ is greater than 100.
83) Hypotheses: H0: µ ≥ 150 Ha: µ < 150 (measuring in minutes)
Use T-test because population SD is unknown. From TI-84, p-value = 0.0622.
Since p > 0.01, do not reject the null hypothesis. At the 1% significance level, there is not
enough evidence to conclude that freshmen students study less than 2.5 hours per day, on
average. The student academic group’s claim appears to be correct.
511
This page is purposely left blank.
512
11 | HYPOTHESIS TESTING WITH
TWO SAMPLES
Figure 11.1 Student compares independent Marvel fans and DC Comic fans
Chapter Objectives
513
Introduction
Studies often require the comparison of two unknown population means or two unknown
population proportions. For example, medical researchers have done several studies to
measure the effect aspirin has in preventing heart attacks. Typically, the treatment group
is given aspirin and the control group is given a placebo. Then the heart attack rate for each
group is monitored and compared over several years.
There are other situations that deal with the comparison of two groups. For example, studies
compare various diet and exercise programs. Politicians compare the proportion of
individuals from different income brackets who might vote for them. Students are interested
in whether SAT or GRE preparatory courses really help raise their scores.
Now that we have learned how to conduct hypothesis tests on single means and single
proportions, we will expand our methods to test claims involving two unknown means or two
unknown proportions. The basic framework – the Five Step Method – will be the same, but
the sampling distributions in which we do the calculations will be slightly different.
Here we will be comparing parameters from two different populations, using random samples
chosen from each. In the first three sections we will assume that the samples selected are
independent of one another. That is, the sample values selected from one population are
not related in any way to sample values selected from the other population. In Section 10.4
we will compare two population means using “matched pairs”; for this test the data consist
of two samples that are dependent on one another.
NOTE
This chapter relies heavily on either a calculator or a compute to calculate the degrees
of freedom, the test statistics, and p-values. The TI-83+ and TI-84 calculators have utilities
for doing tests involving two independent samples. Instructions for the TI-83+ and TI-84
calculators are included, as well as formulas for the test statistic of each test. We will also
be able to use these calculators to do the calculations related to matched pairs tests.
514
11.1 | Two Population Means with Known σ1 and σ2
Suppose we want to test a claim about two unknown population means, µ1 and µ2, where µ1
is the mean for population 1 and µ2 is the mean for population 2. Note that any inequality
relating these two parameters can be rewritten in terms of µ1 – µ2. For example, the
statement µ1 < µ2 can be rewritten as µ1 – µ2 < 0. Similarly, µ1 > µ2 can be rewritten as µ1
– µ2 > 0. Thus, we can really think of claims comparing two different population means, µ1
and µ2, as claims involving a single new parameter µ1 – µ2. An obvious point estimate for
this parameter is the difference of sample means, x1 − x 2 , where x1 and x 2 are calculated
from random samples selected from populations 1 and 2, respectively.
Assumptions:
1. We are using independent, random samples selected from the two populations.
o If the sample sizes are small, the distributions from which we are sampling should
be approximately normal.
o If the sample sizes are both large (n > 30), then the distributions are not important;
they need not be normal.
Random Variable: X 1 − X 2
σ 12 σ 2 2
Mean: µ x1 − x2 = µ1 − µ 2 Standard Deviation: σ x1 − x2 = +
n1 n2
When we are sampling from normal distributions, or the sample size is large, then the
σ 12 σ 2 2
sampling distribution is approximately normal: X 1 − X 2 ~ N(µ1 – µ2, + )
n1 n2
515
Some important notes:
When testing claims about a single population mean, we used one of two sampling
distributions: either a normal distribution (z-test) or a t-distribution, depending on whether
or not the population standard deviation is known. This will be the case for two-sample
tests as well. When both of the population standard deviations are known, we will use a
( x − x 2 ) − ( µ1 − µ 2 )
Two-Sample Z-Test; the test statistic for this test is: z = 1 . Note that for
σ x1 − x2
most tests, we will start with a null hypothesis that includes an equal sign; that is, our null
hypothesis will generally include the case µ1 – µ2 = 0. So the numerator will be simplified,
and we have the test statistic:
( x1 − x 2 )
z=
σ 12 σ 2 2
+
n1 n2
Once we calculate our test statistic, we could use the fact that the sampling distribution for
X 1 − X 2 is approximately normal and either compute the p-value or use an appropriate
critical value for our test. However, the TI-83 and TI-84 calculators have a built-in utility
that will give us both the test statistic and p-value for our test.
Go to the STAT, menu and then to TESTS. Select option 3, which is 2-SampZTest.
Enter the population SD’s, the sample means and the sample sizes; select the appropriate
alternative hypothesis. Arrow down to Calculate and press ENTER. The test statistic and
p-value will appear on the output screen. If we select the Draw option instead of Calculate,
we can view a graph showing the region corresponding to the p-value.
NOTE: The first inputs we are prompted for are the population standard deviations, σ1 and
σ2. If these are not known, we should not be using 2SampZTest.
516
Example 11.1
The mean lasting time of two competing floor waxes is to be compared. The amount of time
that each brand lasts is normally distributed. Each brand of wax is applied to 20 randomly
chosen floors for testing tested for how the finish lasts under normal wear; the data are
recorded in the table below:
Using a 5% significance level, does this data suggest that Brand A wax lasts longer, on
average, than Brand B?
Solution 11.1:
This is a test of two means, using independent samples. The random variable we will be
working with is X 1 − X 2 = difference in the mean number of months the competing floor
waxes last; and since the population standard deviations known, we can use 2SampZTest.
The claim is that Brand A wax lasts longer Brand B, on average. I.e. the claim is μA > μB.
So this is a right tail test, and the hypotheses can be written as either:
H0: μA ≤ μB or H0: μA – μB ≤ 0
Ha: μA > μB Ha: μA – μB > 0
However, the first is preferred since it more naturally shows the claim and also matches the
inputs for the calculator. Go to 2SampZTest, and enter the following:
σ1: 0.33
σ1: 0.36
x1 : 3
n1: 20
x 2 : 2.8
n2: 20
µ1: > µ2
Put the cursor on Calculate and press enter to get Z = 1.83 and p-value = 0.0335
Since p-value < .05, we reject H0. At the 5% significance level, there is sufficient evidence
to conclude that Brand A wax lasts longer, on average, than Brand B wax.
517
11.1 The means of the number of revolutions per minute of two competing engines are to be
compared. Thirty engines are randomly assigned to be tested. Both populations have
normal distributions. The table below shows the result. Using a 5% level of significance, do
the data indicate that Engine 2 has higher RPM than Engine 1?
Example 11.2
A field researcher is gathering data on the trunk diameters of mature pine and spruce
trees in a certain area. The following are the results of his random sampling.
Using a .10 significance level, can he conclude that the average trunk diameter of a pine
tree is greater than the average diameter of a spruce tree?
This is a test of two means, using independent samples. The random variable we will be
working with is X 1 − X 2 , where X 1 represents a random sample mean diameter of pines, and
X 2 is the random sample mean diameter of spruces.
(Step 1) Let µ1 be the population mean of diameters of mature pine trees, and µ2 the
population mean of diameters of mature spruce trees. The claim is that pines have larger
mean diameter, or μ1 > μ2.
H0: μ1 ≤ μ2
Ha: μ1 > μ2
(Step 2) So this is a right tail test because Ha contains a greater than symbol, >
(Step 3) Since the population standard deviations known, we can use 2SampZTest.
518
NOTE: The calculator asks for the standard
deviation, which is the square root of the
variance.
Put the cursor on Calculate and press enter to get Z = 2.94 and p-value = 0.0206.
(Step 5) At the .10 significance level, there is sufficient evidence to conclude that the mean
diameter for pine trees is larger than the mean diameter for spruce trees.
(Step 1) Let µ1 be the population mean of diameters of mature pine trees, and µ2 the
population mean of diameters of mature spruce trees. The claim is that pines have larger
mean diameter, or μ1 > μ2.
H0: μ1 ≤ μ2
Ha: μ1 > μ2
(Step 2) So this is a right tail test because Ha contains a greater than symbol, >
(Step 3) Since the population standard deviations known, we can use 2SampZTest.
CALCULATOR: 2SampZTest
(Step 4) a right tailed Z-test with significance level α = .10. The critical value for the test
will be z-value that has an area of .10 to the right of it; so the area to the left of this value is
.90. This is easily found using the calculator:
Thus, we would reject Ho if and only if the test statistic Z > 1.282. Since our test statistic
is Z = 2.94, our decision again is to reject Ho.
(Step 5) At the .10 significance level, there is sufficient evidence to conclude that the mean
diameter for pine trees is larger than the mean diameter for spruce trees.
519
For convenience we again show the rejection regions for the three most commonly used
significance levels:
Rejection Regions
for Z-Test
Significance
Left-Tailed Right Tailed Two-Tailed
Level
• Used for testing a claim about two unknown means µ1, µ2 with data from
independent samples.
• Use only when the population standard deviations σ1, σ2 are known.
σ 12 σ 2 2
• Random variable and distribution: X 1 − X 2 ~ N (µ1 – µ2, + )
n1 n2
( x1 − x 2 )
• Test statistic: z =
σ 12 σ 2 2
+
n1 n2
• Calculator function: 2-SampZTest
520
11.2 | Two Population Means when σ1 and σ2 are Unknown
When we want to compare two unknown population means, it is often the case that the
population standard deviations are also unknown. The test comparing two independent
population means with unknown and possibly unequal population standard deviations is
called the Two Sample T-Test, or Aspin-Welch t-test. (The formula for the degrees of freedom
developed by Aspin and Welch)
Again we are using independent samples, and working with the random variable X 1 − X 2 .
And again the mean of this sampling distribution is µ x1 − x2 = µ1 − µ 2 . But when we do not know
the population standard deviations, we must estimate them using the two sample standard
deviations (s1 and s2) from our independent samples. This in turn provides an estimate for
the standard deviation of the sampling distribution:
2 2
s1 s
σ x1 − x2 ≈ s x1 − x2 = + 2
n1 n2
The expression is called the standard error of the difference of sample means. The test
( x − x 2 ) − ( µ1 − µ 2 )
statistic for the test will be t = 1 , where µ1 – µ2 is the difference in means
standard error
hypothesized in H0. As with the Two Sample Z-Test, we almost start with a null hypothesis
that includes an equal sign, so for our calculations we will be assuming that µ1 – µ2 = 0, giving
the simplified test statistic:
( x1 − x 2 )
t=
2 2
s1 s
+ 2
n1 n2
2
s1 2 s 2 2
+
n1 n 2
df = 4 4
s1 s2
2
+ 2
n1 (n1 − 1) n 2 (n 2 − 1)
521
It is not necessary to compute this by hand. The TI-83 and TI-84 calculator will
compute it for us automatically (thank goodness!)
The calculator instructions for a Two-Sample T-Test are essentially the same as for a Two-
Sample Z-test, except that there is one addition input. When using 2SampTTest, we will
be asked whether we want to “Pool the Data”. If we know that the population variances
are equal, then we can combine the two samples to get a better estimate of the common
population variance. This is called “pooling the data”; we only pool the data if we know
the population variances are equal.
Go to the STAT, menu and then to TESTS. Select option 4, which is 2-SampTTest.
Enter the sample means, sample SD’s and sample sizes; select the appropriate alternative
hypothesis. For 2-SampTTest, we are also asked if we want to Pool the data, and use the
rule:
• If we know that the population variances are equal, we say YES (pool the data)
• If we do not know, or if the population variances are not equal, we say NO.
Arrow down to Calculate and press ENTER. The test statistic and p-value will appear on the
output screen. If we select the Draw option instead of Calculate, we can view a graph showing
the region corresponding to the p-value.
NOTE: Any time we are unsure of whether to pool the data, we should select NO.
Example 11.3
The average amount of time students in a hybrid Stats class spend on a unit exam is
compared to students in a traditional Stats class. A study is done and data are collected,
resulting in the data in the table below. Each population has a normal distribution.
Using a there a 5% level of significance, is there a difference in the mean amount of time
hybrid stats students spend on a unit exam than traditional stats student?
522
Solution 11.3 using p-value
The population standard deviations are not known. Let h be the subscript for hybrid and t
be the subscript for traditional. Then, μh is the population mean for girls and μt is the
population mean for boys. This is a test for two population means using independent
samples, and the population standard deviations are not known – so we use 2-SampTTest.
(Step 1) This will use the random variable𝑋𝑋̄ℎ − 𝑋𝑋̄𝑡𝑡 , which is the difference in the sample
means between hybrid and traditional. We are testing whether there is a difference, so the
claim is that μh ≠ μt and the hypotheses are:
H0: μh = μt
Ha: μh ≠ μt
Since we are unsure if the population standard deviations are equal, we state NO for
pooling.
Put the cursor on Calculate and press enter to get t = 3.42 and p-value = 0.0028. Recall,
p-value is the probability of the test value and its extreme happening (both tails in this
case)
523
(Step 5) At the 5% significance level, there is sufficient evidence to conclude that there is a
difference in the mean time spent taking unit exams between hybrid section students and
traditional section students.
(Step 1) We are testing whether there is a difference, so the claim is that μh ≠ μt and the
hypotheses are:
H0: μh = μt
Ha: μh ≠ μt
Since we are unsure if the population standard deviations are equal, we state NO for
pooling.
Put the cursor on Calculate and press enter to get t = 3.42. Note also that the degrees of
freedom is shown on the output screen: df = 19.55.
(Step 4) Find 2 critical values since it is two-tailed we use α/2 = .025 and df = 19.55
524
(Step 5) At the 5% significance level, there is sufficient evidence to conclude that there is a
difference in the mean time spent taking unit exams between hybrid section students and
traditional section students.
11.2 Data from two independent samples are shown in the table below. Both have normal
distributions. The means for the two populations are thought to be the same. Does this
sample data provide evidence of a difference in the population means? Test at the 5% level of
significance.
Example 11.4
A study is done by a community group in two neighboring colleges to determine which school
graduates students with more math classes. College A samples 11 graduates; the average for
this sample is four math classes with a standard deviation of 1.5 math classes. College B
samples nine graduates. Their average is 3.5 math classes with a standard deviation of one
math class. The community group believes that, on average, students who graduate from
College A have taken more math classes than students from College B. Assume that both
populations have a normal distribution, and we are testing a 1% significance level. Answer
the following questions.
Solution 11.4
a. Test for two means b. These are unknown c. Student's t-distribution
d. X A − X B
525
h. Do not reject H0 since p-value > .01
i. There is not sufficient evidence at the 1% level of significance to conclude that a student
who graduates from college A has taken more math classes, on the average, than a student
who graduates from college B.
11.3 A study is done to determine if Company A retains its workers longer than Company B.
Company A samples 15 workers, and their average time with the company is five years with
a standard deviation of 1.2. Company B samples 20 workers, and their average time with the
company is 4.5 years with a standard deviation of 0.8. The populations are normally
distributed.
Example 11.5
Online class:
67.6 41.2 85.3 55.9 82.4 91.2 73.5 94.1 64.7 64.7 70.6 38.2 61.8 88.2 70.6
58.8 91.2 73.5 82.4 35.5 94.1 88.2 64.7 55.9 88.2 97.1 85.3 61.8 79.4 79.4
Face-to-face Class:
77.9 95.3 81.2 74.1 98.8 88.2 85.9 92.9 87.1 88.2 69.4 57.6 69.4 67.1 97.6
85.9 88.2 91.8 78.8 71.8 98.8 61.2 92.9 90.6 97.6 100 95.3 83.5 92.9 89.4
Is the mean of the Final Exam scores of the online class lower than the mean of the Final
Exam scores of the face-to-face class? We will test this claim with a 5% significance level.
We let μ1 be the population mean for online classes, and μ2 be the population mean for face-
to-face classes. The claim being tested is that scores for online classes are lower, on average,
than scores for face-to-face classes.
(Step 1) the claim is that μ1 < μ2, and so the hypotheses are:
526
H0: μ1 ≥ μ2 and Ha: μ1 < μ2
(Step 3) This is a test for two unknown population means, using independent samples and
unknown population SD’s, so we use a 2-SampTTest.
STAT menu and then to EDIT; enter the sample data into two lists (e.g. L1, L2).
Arrow over to TESTS and select option 4: 2SampTTest.
Choose the Data option, and press ENTER. Arrow down and enter L1 for the first list
and L2 for the second list.
Select the “<” alternative; do not pool the data (i.e. select Pooled: No) Arrow down to
Calculate and press ENTER to get:
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude that the
mean of the final exam scores for the online class is less than the mean of final exam scores
of the face- to-face class. So it appears that the professor was correct.
(Step 1) the claim is that μ1 < μ2, and so the hypotheses are:
527
(Step 3) This is a test for two unknown population means, using independent samples and
unknown population SD’s, so we use a 2-SampTTest.
(Step 4) Find left critical value since it is left-tailed and state conclusion. Recall that a left
critical value of the t-distribution is negative. Reject Ho if the test value is in the rejection
region.
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude that the
mean of the final exam scores for the online class is less than the mean of final exam scores
of the face- to-face class. So it appears that the professor was correct.
Cohen's Standards for Small, Medium, and Large Effect Sizes (Optional)
Cohen's d statistic is a measure of effect size based on the differences between two means.
Named for American statistician Jacob Cohen, d measures the relative strength of the
differences between the means of two populations based on sample data. This statistic is
2 2
(n1 − 1) s1 + (n 2 − 1) s 2
the difference of the observed sample means, divided by spooled = .
n1 + n 2 − 2
This is the sample standard deviation obtained by “pooling” the data – that is, by combining
the two samples into a single, large sample. Note that we can get this value from the
calculator; if we use 2-SampTTest and select YES for Pooling, the last entry in the output
screen is Sxp, which is the pooled standard error.
( x1 − x 2 )
Now, the Cohen’s d-statistic is defined as d = .
s pooled
528
The calculated value of effect size is then compared to Cohen’s standards of small, medium,
and large effect sizes according to the following table:
Size of Effect D
Small 0.2
Medium 0.5
Large 0.8
Example 11.6
Calculate Cohen’s d for Example 11.4. Is the size of the effect small, medium, or large?
Explain what the size of the effect means for this problem.
Solution 11.6
Go to 2-SampTTest, and enter the data; select YES for pooling, then scroll to Calculate and
press Enter to get spooled = Sxp = 1.302. Using the formula, we get
Example 11.7
Calculate Cohen’s d for Example 11.3. Is the size of the effect small, medium or large?
Explain what the size of the effect means for this problem.
Solution 11.7
Go to 2-SampTTest, and enter the data; select YES for pooling, then scroll to Calculate and
press Enter to get spooled = Sxp .
529
spooled = Sxp = 5.94
This indicates a large effect size, because 1.347 is greater than Cohen’s cutoff of 0.8 for a
large effect size. The size of the differences between the means of hybrid sections and
traditional sections for unit exams indicating a significant difference in the means.
Is there a difference in the mean weighted alpha for banks in the northeast and in the
west? Test at a 5% significance level. Explain what the size of the effect means for this
problem.
530
Summary of Two-Sample T-Test
• Used for testing a claim about two unknown means µ1, µ2 with data from
independent samples.
• Use when sampling from two normal distributions and the population standard
deviations σ1, σ2 are not known.
2
s1 2 s 2 2
+
( x1 − x 2 ) n1 n
2
• Test statistic: t = 2 2
df = 4 4
s1 s s1 s2
+ 2 2
+ 2
n1 n 2 n1 (n1 − 1) n 2 (n 2 − 1)
• Select the “Pooled” option only when the population variances are equal.
If in doubt, do not pool the data.
2. The number of successes is at least 10, and the number of failures is at least 10, for each
of the samples. That is, we should have n1 p1 ≥ 10 and n1 (1 − p1 ) ≥ 10 as well as n2 p2 ≥ 10 and
n2 (1 − p2 ) ≥ 10 .
These assumptions imply that the sampling distribution for pˆ 1 − pˆ 2 will be approximately
normal. The mean of the sampling distribution will be p1 – p2, and the variance will be the
sum of the variances of the sampling distributions for p̂1 and p̂ 2 . That is, we have:
µ pˆ1 − pˆ 2 = p1 − p 2 and σ pˆ − pˆ = p1 q1 + p 2 q 2 .
1 2
n1 n2
531
Generally, the null hypothesis will state that the two population proportions are the same.
That is, H0: p1 = p2; and since the population proportions are equal, the variances of the
individual sampling variations will also be equal. So we can get a better approximation to
this common variance by using a pooled proportion, pc. This is the total number of successes,
x1 + x 2
divided by the total sample size: p c = . Inserting pc in place of p1 and p2 in the formula
n1 + n 2
1 1
above and simplifying, we get: σ pˆ1 − pˆ 2 = p c (1 − p c ) +
n1 n 2
Now the test statistic for testing a claim about two proportions will be:
( pˆ 1 − pˆ 2 ) − ( p1 − p 2 )
z= .
σ pˆ1 − pˆ 2
Because the null hypothesis assumes that p1 and p2 are equal, we have p1 – p2= 0;
substituting the expression for σ pˆ1 − pˆ 2 above, we get the test statistic:
pˆ 1 − pˆ 2
z= .
1 1
p c (1 − p c ) +
n1 n 2
This may look complicated, but it is programmed into the TI-83 and TI-84 calculators. As
with other tests, the calculator will give us both the test statistic and the p-value.
Go to the STAT, menu and then to TESTS. Select option 6, which is 2-PropZTest.
Note that both x1 and x2 must be whole numbers; so if you are given sample
proportions, use x1= n1 p̂1 and x2 = n 2 p̂ 2 , and round to the nearest whole number.
Select the appropriate alternative hypothesis.
The test statistic and p-value will appear on the output screen. Note that the pooled
proportion pc also shows on the output screen as p̂ (without a subscript). As with
other tests, we can select the Draw option instead of Calculate to view a graph showing
the region corresponding to the p-value.
532
NOTE: the following summary is focusing on H 0 : p1 − p2 =
0 which also includes ≥ and ≤.
pˆ 1 − pˆ 2
Test value: z =
1 1
p c (1 − p c ) +
n1 n 2
n1 pˆ 1 ≥ 10 ; n1 (1 − pˆ 1 ) ≥ 10 ; n 2 p
ˆ 2 ≥ 10 ; n 2 (1 − pˆ 2 ) ≥ 10 .
Example 11.8
Two types of medication for hives are being tested to determine if there is a difference in the
proportions of adult patient reactions. Twenty out of a random sample of 200 adults given
medication A still had hives 30 minutes after taking the medication. Twelve out of another
random sample of 200 adults given medication B still had hives 30 minutes after taking the
medication. Test the claim that there is a difference in the proportions using a 1% level of
significance.
This is a test of two proportions. Let pA and pB be population proportions of individual with
hives after 30 minutes for medications A and B respectively. The problem asks if there is a
difference in proportions,
(Step 1) the claim being tested is that pA ≠ pB and so the hypotheses are:
(Step 3) we have data from independent random samples. From the given information, we
can see that the number of successes and failures in each sample is at least 10, so the
sampling distribution for pˆ A − pˆ B will be approximately normal. Thus, we will use
2-PropZTest. In the calculator, go to the STAT menu, then to TESTS and select 2-
PropZTest.
533
Enter the following:
(Step 5) Thus, at a 1% level of significance, this sample data does not provide sufficient
evidence to conclude that there is a difference in the proportions of adult patients who did
not react after 30 minutes to medication A and medication B.
difference in proportions,
(Step 1) the claim being tested is that pA ≠ pB and so the hypotheses are:
534
(Step 5) Thus, at a 1% level of significance, this sample data does not provide sufficient
evidence to conclude that there is a difference in the proportions of adult patients who did
not react after 30 minutes to medication A and medication B.
11.5 Two types of valves are being tested to determine if there is a difference in pressure
tolerances. In a random sample of 100 Brand A valves, 14 cracked under 4,500 psi. In a
random sample of 100 of Brand B valves, 19 cracked under 4,500 psi. Using a 5% level of
significance, test the claim that there is a difference in the proportion of valves that crack at
4,500 psi.
Example 11.9
PC Magazine wrote an article that stated IOS is more popular in US than UK. IOS is the
operating system used by iPhone. Is there sufficient evidence to conclude that the
proportion of IOS users in US is greater than the proportion of IOS users in the UK?
US UK
IOS 189 161
Total surveyed 355 310
This is a test of two population proportions. Let US and UK be the subscripts for US and UK.
Let pUS and pUK be the proportions for US and UK respectively. The claim is that the
proportion of IOS users in US is greater than the proportion of IOS users in UK:
535
(Step 2) α = .01 the test is right-tailed test
(Step 5) At the 1% level of significance, there is not sufficient evidence to conclude that the
proportion of US IOS users is greater than the proportion of UK IOS users.
Example 11.10
Bachelor’s Associate’s
Changed majors 205 196
Total surveyed 620 700
Let pB and pA are be the proportions of students in Bachelor’s degree programs and
proportions of students in Associate degree programs, respectively.
(Step 1) claim pB = pA
H0: pB = pA
H a : p B ≠ pA
536
(Step 2) α = .05; two-tailed test.
x1: 205
n1: 620
x2: 196
n2: 700
p : ≠ p0
Scroll down to Calculate and press Enter to get: z = 2 and p-value = 0.0458
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude the proportion
of students in Bachelor’s degree is the same as the proportion of students in Associate’s
degree that switch majors.
(Step 1) claim pB = pA
H0: pB = pA
H a : p B ≠ pA
x1: 205
n1: 620
x2: 196
n2: 700
p : ≠ p0
(Step 4) Find critical values which is both left and right since it is two-tailed test. ±Zc =
±invnorm(.025, 0, 1) = ±1.96. Reject Ho because 2 (test value) is in the right rejection region.
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude the proportion
of students in Bachelor’s degree is the same as the proportion of students in Associate’s
degree that switch majors.
537
11.6 Among new graduates who have student loans, the department of education looks at the
number of students who default for the first time. The percentage of students in public
institutions that are in default is 10.8%. They looked at a sample of 2100 students. The
percentage of student in private for-profit institution that are in first time default is 15%.
They looked at 1500 students. At a 5% significance level, is there a difference in the
proportions?
• Used for testing a claim about two unknown proportions p1, p2 with data from
independent samples.
pˆ 1 − pˆ 2 x1 + x 2
• Test statistic: z = , where p c =
1 1 n1 + n 2
p c (1 − p c ) +
n1 n 2
• Calculator function: 2-PropZTest
538
11.4 | Matched Pairs Test
Two samples are dependent when the subjects are paired or matched in some way. For
example, to study whether or not a medication helps lower cholesterol, researchers might
take a random sample of patients and measure their cholesterol. Then after a specified time
taking the medication, the patients would have their cholesterol measured again – this would
produce pairs of “before” and “after” data values, with one pair for each patient.
We could then analyze the actual differences, D = before – after, of the pairs using an ordinary
t-test. Thus the data is the gain or loss in cholesterol readings. Such a test is called a
matched-pairs test. One of the key advantages of this experiment design is that it controls
for a lot of individual attributes of the patients, such as diet, exercise, overall health, etc.
When using a hypothesis test for matched or paired samples, the following conditions must
be met:
µ0: 0; LIST: L1; FREQ: 1; highlight correct Ha symbol (≠, <, >)
539
Example 11.11
Subject: A B C D E F G H
Before 6.6 6.5 9.0 10.3 11.3 8.1 6.3 11.6
After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 2.0
A lower score indicates less pain. The "before" value is matched to an "after" value and the
differences are calculated. Assume that the population of differences has a normal
distribution. Using a 5% significance level, is there evidence that the sensory measurements
are lower, on average, after hypnotism?
d = "before" – "after"
If the sensory readings are lower after hypnotism, then “before” readings should, on
average, be larger than “after” readings and so we would expect the mean of all d values to
be positive.
(Step 1) Thus the claim we want to test is µd > 0. So the hypotheses are:
(Step 3) Now we calculate the differences; we can easily do this by adding another row to
our table:
Subject: A B C D E F G H
Before 6.6 6.5 9.0 10.3 11.3 8.1 6.3 11.6
After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 2.0
Difference: -0.2 4.1 1.6 1.8 3.2 2.0 2.9 9.6
540
(Step 4) Reject Ho because 0.0095 < 0.05
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude that the
sensory measurements are lower, on average, after hypnotism. Hypnotism appears to be
effective in reducing pain.
NOTE:
For the TI-84 calculator, you can either calculate the differences (before – after)
ahead of time and put the differences directly into a list, or you can put the before
data into one list L1 and the after data into a second list, L2. Then go to L3 and
arrow up to the name; that is, put the cursor on L3 at the top of the screen.
Type L1 – L2 and press Enter; the calculator will do the subtractions and store the
differences List 3.
(Step 1) Thus the claim we want to test is µd > 0. So the hypotheses are:
(Step 3) Now we calculate the differences; we can easily do this by adding another row to
our table:
Subject: A B C D E F G H
Before 6.6 6.5 9.0 10.3 11.3 8.1 6.3 11.6
After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 2.0
Difference: -0.2 4.1 1.6 1.8 3.2 2.0 2.9 9.6
(Step 4) Find critical value. Since it is a t-test and right-tailed. We find the right tc.
(Step 5) At the 5% level of significance, there is sufficient evidence to conclude that the
sensory measurements are lower, on average, after hypnotism. Hypnotism appears to be
effective in reducing pain.
541
11.7 A study was conducted to investigate how effective a new diet was in lowering
cholesterol. Results for the randomly selected subjects are shown in the table. The
differences have a normal distribution. Are the subjects’ cholesterol levels lower on average
after the diet? Test at the 5% level.
Subject: A B C D E F G H I
Before 209 210 205 198 216 217 238 240 222
After 199 207 189 209 217 202 211 223 201
1.) There is a decrease from before to after, which means after is “lower” than
before: this scenario will result in the average difference being positive, µd > 0.
2.) There is an increase from before to after, which means after is “higher” than
before: this scenario will result in the average difference being negative, µd < 0.
3.) There is a difference from before to after, which means after is not the same as
before: this scenario will result in the average difference being not equal to 0,
µd ≠ 0.
4.) There is no difference from before to after, which means after is the same as
before: this scenario will result in the average difference being equal to 0, µd = 0.
Example 11.12
A pharmaceutical company wishes to test a new drug with the expectation of lowering
cholesterol levels. Ten subjects are randomly selected and pretested. The subjects were
placed on the drug for a period of 6 months, after which their cholesterol levels were tested
again.
The test results, before and after, are listed below. (All units are milligrams per deciliter.)
Subject 1 2 3 4 5 6 7 8 9 10
Before 195 225 202 195 175 250 235 268 190 240
After 180 220 210 175 170 243 205 250 183 225
Use this data, along with a 1% significance level, to test the claim that the drug is effective
in lowering cholesterol.
542
Solution 11.12 using p-value method
Here we are comparing two population means, using a matched pairs design. Thus, we will
use a t-test on the differences: 15, 5, -8, 20, 5, 7, 30, 18, 7, 15
Let d = cholesterol level before taking drug – cholesterol level after taking drug.
(Step 1) If the drug lowers cholesterol, then the cholesterol level should be
higher before the drug so if the claim is true, we expect mean of the differences to be
positive, giving us the hypotheses:
Ho: μd < 0
Ha: μd > 0 (claim)
(Step 3) Enter the differences into a list, and go to STAT >> TESTS >> TTest. Enter 0 for
µ0, specify the list name, and select the > µ0 alternative.
(Step 5) There is sufficient evidence to conclude that the mean of the differences is negative.
Thus, there is sufficient evidence to conclude that the drug helps lower cholesterol.
11.8 A new prep class was designed to improve SAT test scores. Nine students were
selected at random. Their scores on two practice exams were recorded, one before the class
and one after. Are the scores, on average, higher after the class? Test at a 5% level. The
data recorded in the table below:
Student A B C D E F G H I
1st Score 480 510 530 540 550 560 600 620 660
2nd Score 400 520 550 530 580 580 610 640 690
543
Example 11.13
Seven eighth graders at Kennedy Middle School measured how far they could push the shot-
put with their dominant (writing) hand and their weaker (non-writing) hand. They thought
that they could push equal distances with either hand. The data were collected and recorded
in the table below; the distances shown are in feet.
H0: μd = 0, Ha: μd ≠ 0
(Step 3)
We will assume that the differences have a normal distribution, and will use a T-test.
Enter the differences into a list, and go to STAT >> TESTS >> TTest. Enter 0 for µ0, specify
the list name, and select the ≠ µ0 alternative.
544
(Step 4) Since p > 0.05, we do not reject Ho.
(Step 5) Thus, there is not enough evidence to conclude that there is a significant difference
in the means.
Note that if we had selected the DRAW option, we would see the following graph:
The graph shows that had this been a right-tailed test, our decision would have been to reject
Ho. However, our hypotheses are based on the question that has been asked, not on the
sample data. And it would not be appropriate to change the hypotheses once we saw the data
in order to get a different result.
Notice the other statistics that are given on the calculator screen (𝑥𝑥̅ and Sx). These values
are the sample average difference and sample standard deviation of the differences. The
symbols that should be used for the values are as follows:
545
11.5 | Confidence Intervals from Two Samples
In this chapter, we have learned how to investigate claims involving two parameters; but
sometimes we actually need an estimate of the difference of two parameters. For example, if
we conducted a hypothesis test and conclude that µ1 > µ2, a natural follow-up question might
be: by how much? Using a confidence interval, we can actually give bounds for how large the
difference between these parameters really is. In this section we will extend the ideas of
Chapter 9 to develop confidence interval estimates for the difference of two means and the
difference of two proportions. Like the intervals developed in Chapter 9, these will have the
basic form:
And we now have all of the ingredients needed. For example, if we want to find a confidence
interval for µ1 – µ2 using data from two independent samples, we know that the appropriate
point estimate is the difference of sample means, x1 − x 2 ; and this random variable will either
follow a normal distribution or a t-distribution. When the population standard deviations
are known, we can use a z-distribution, and the margin of error will be
σ 12 σ 22
E = z α / 2σ x1 − x2 = z α / 2 + . Thus, the confidence interval will have the form:
n1 n2
σ 12 σ 22 σ 12 σ 22
( x1 − x 2 ) − zα / 2 + < µ1 − µ 2 < ( x1 − x 2 ) + z α / 2 + .
n1 n2 n1 n2
When the population standard deviations are not known, then we would estimate σ1 and σ2
by s1 and s2 and use a critical t values in place of the Z critical values:
2 2 2 2
s1 s s1 s
( x1 − x 2 ) − t α / 2 + 2 < µ1 − µ 2 < ( x1 − x 2 ) + t α / 2 + 2 .
n1 n2 n1 n2
While these formulas look complicated, we should remember that they are really the same
basic format as the intervals for a single mean µ that we developed in Chapter 9 (but the
point estimates and standard errors are different).
Moreover, we never need to actually use these formulas, because they are built into the TI-
83 and TI-84 calculators.
546
Using the TI-83, 83+, 84, 84+ Calculator _
To calculate a confidence interval for the difference of two means, µ1 – µ2:
Enter the SD’s, the sample means and the sample sizes just as we did for two-sample tests.
Specify the confidence level. Arrow down to Calculate and press ENTER. The confidence
interval will appear on the screen.
To illustrate this, we revisit the question from Example 10.2; in that example, we used a
hypothesis test to compare the mean trunk diameters for pine spruce trees. In that test,
we concluded that the average trunk diameter for pine tree is in fact greater than
the average diameter of spruce trees.
Example 11.14
A field researcher is gathering data on the trunk diameters of mature pine and spruce
trees
in a certain area. The following are the results of his random sampling.
Use this sample data to find a 90% confidence interval for the difference in mean trunk
diameter for pine and spruce trees. Interpret this interval.
Solution 11.14
We let µ1 be the mean diameter for pine trees, and µ2 be the mean diameter for spruce
trees.
We will calculate a confidence interval for µ1 – µ2; the population variances are known, so
we will use 2-SampZInt.
547
Hit CALCULATE to get the interval. So we are 90% confident that, 0.971 < µ1 – µ2 < 9.03.
That is, the mean for pine trees is at least 0.971 inches more than the mean for spruce trees.
And, the mean for pine trees is at most 9.03 inches more than the mean for spruce trees.
Recall also Example 11.5; in that example, a professor at a large college wanted to compare
the mean final exam scores for two populations of students – online students and students in
traditional, face to face courses. We conducted a test and found evidence that the mean score
for online classes was lower than the mean for traditional courses. Here we will calculate a
confidence interval for the difference of means.
Example 11.15
A professor at a large community college wanted to compare the mean final exam scores
between students who took his statistics course online and the students who took his face-to-
face statistics class. The randomly selected 30 final exam scores from each group are listed
below:
Online class:
67.6 41.2 85.3 55.9 82.4 91.2 73.5 94.1 64.7 64.7 70.6 38.2 61.8 88.2 70.6
58.8 91.2 73.5 82.4 35.5 94.1 88.2 64.7 55.9 88.2 97.1 85.3 61.8 79.4 79.4
Face-to-face Class:
77.9 95.3 81.2 74.1 98.8 88.2 85.9 92.9 87.1 88.2 69.4 57.6 69.4 67.1 97.6
85.9 88.2 91.8 78.8 71.8 98.8 61.2 92.9 90.6 97.6 100 95.3 83.5 92.9 89.4
Calculate a 95% confidence interval for the difference in mean scores; interpret this
interval.
Solution 11.15
We let μ1 be the population mean for online classes, and μ2 be the population mean for face-
to-face classes. We want a confidence interval for μ1 – μ2; the population SD’s are not
known, so will use 2-SampT-Int. Moreover, instead of sample statistics, we are given raw
data, so we will use the DATA option.
- First go the STAT menu and then to EDIT; enter the sample data into two lists (e.g. L1,
L2).
- Go to STAT again, then to TESTS; select option 10: 2SampTInt. Choose the DATA option.
- Arrow down and enter L1 for the first list and L2 for the second list.
- Set C-Level: .95
- Set Pooled: No (since we have no reason to think the population variances are equal)
- Arrow down to Calculate and press Enter to get the interval: (-19.67, -4.59).
548
The mean for online courses is at least 4.59 points lower than the mean for traditional
courses. And, the mean for online courses is at most 19.67 points lower for traditional
courses.
We can also find a confidence interval for p1 – p2, the difference of the two population
proportions. The point estimate for this parameter is pˆ 1 − pˆ 2 , where p̂1 and p̂ 2 are sample
proportions chosen from the respective populations. And we know that if the sample sizes
are sufficiently large (see Section 10.3), then the sampling distribution for pˆ 1 − pˆ 2 will be
p1 q1 p 2 q 2
approximately normal, so the margin of error will be E = zα / 2σ pˆ1 − pˆ 2 = zα / 2 + . Thus
n1 n2
we get the following formula for the confidence interval:
p1 q1 p 2 q 2 p1 q1 p 2 q 2
( pˆ 1 − pˆ 2 ) − zα / 2 + < p1 − p 2 < ( pˆ 1 − pˆ 2 ) + zα / 2 +
n1 n2 n1 n2
Again, we would rarely (if ever) use this formula, as it is built into the TI-84 calculator:
Go to the STAT, menu and then to TESTS. Select option B, which is 2-PropZInt.
Enter the values for x1, n1, x2, and n2. Note that both x1 and x2 must be whole numbers; so if
you are given sample proportions, use x1= n1 p̂1 and x2 = n 2 p̂ 2 , and round to the nearest whole
number.
Specify the Confidence level, as a decimal. Arrow down to Calculate and press ENTER.
Example 11.16
Researchers for a cell phone company conducted a study of smartphone use among adults.
They wish to compare the proportion of smartphone users among two different populations:
adult whites (non-Hispanic) and African American adults. The results of the survey indicate
that of the 232 African American cell phone owners randomly sampled, 5% have an iPhone.
Of the 1,343 white cell phone owners randomly sampled, 10% own an iPhone. Use this data
to construct a 95% confidence interval for the difference in population proportions.
549
Solution 11.16
Let pW and pA are be the proportions of people who own iPhones in the white and African-
American populations, respectively. Then we want a confidence interval for pW – pA.
The conditions for using a Two-Proportion Z-Interval are easily met, so we go STAT, then to
TESTS, and scroll down to 2-PropZInt. Enter the following:
Arrow down to Calculate and press Enter to get the interval: (0.0156, 0.0875).
Thus, we are 95% confident that .0156 < pW – pA < .0875. The percentage difference can
range from 1.56% difference to 8.75% difference.
550
KEY TERMS and FORMULA REVIEW
Cohen’s effect size: This is a measure of the relative strength of the difference between
( x1 − x 2 )
two means based on sample data. Cohen’s d-statistic is defined as d = , where
s pooled
2 2
(n1 − 1) s1 + (n 2 − 1) s 2
spooled is the pooled standard error, spooled = .
n1 + n 2 − 2
Degrees of freedom: The number of objects in a sample that are free to vary. For a one-
sample t-test, the degrees of freedom was simple: df = n – 1. However, for a two-sample t-
test, the calculation is more complicated, given by Welch’s formula:
2
s1 2 s 2 2
+
n1 n
2
df = 4 4
s1 s2
2
+ 2
n1 (n1 − 1) n 2 (n 2 − 1)
Pooled proportion: Used in a hypothesis test for two proportions to estimate the
x1 + x 2
common value of p1 and p2: p c = .
n1 + n 2
µ pˆ1 − pˆ 2 = p1 − p 2 and σ pˆ − pˆ = p1 q1 + p 2 q 2 .
1 2
n1 n2
Provided np1 and np2 are both greater than 10, the distribution will be approximately
normal.
σ 12 σ 2 2
The mean and standard deviation are: µ x1 −x2 = µ1 − µ 2 and σ x1 − x2 = +
n1 n2
If we are sampling from normal distributions, or the sample sizes are both large, then the
σ 12 σ 2 2
sampling distribution is approximately normal: X 1 − X 2 ~ N(µ1 – µ2, + )
n1 n2
551
Standard Error: This is an estimate of σ x1 − x2 obtained by estimating the unknown
population variances by the respective sample variances; the standard error is:
2 2
s s
s x1 − x2 = 1 + 2 .
n1 n 2
σ 12 σ 22 σ 12 σ 22
( x1 − x 2 ) − zα / 2 + < µ1 − µ 2 < ( x1 − x 2 ) + z α / 2 +
n1 n2 n1 n2
2 2 2 2
s1 s s1 s
( x1 − x 2 ) − t α / 2 + 2 < µ1 − µ 2 < ( x1 − x 2 ) + t α / 2 + 2
n1 n2 n1 n2
p1 q1 p 2 q 2 p1 q1 p 2 q 2
( pˆ 1 − pˆ 2 ) − zα / 2 + < p1 − p 2 < ( pˆ 1 − pˆ 2 ) + zα / 2 +
n1 n2 n1 n2
552
CHAPTER REVIEW
11.1 Two Population Means with known Standard Deviations
If we are testing a claim comparing two unknown population means,µ1 and µ2, where the
population standard deviations σ1 and σ2 are known, then we use a Two-Sample Z-Test.
The key features of the test are:
σ 12 σ 2 2
• Random variable and distribution: X1 − X 2 ~ N (µ1 – µ2, + )
n1 n2
( x1 − x 2 )
• Test statistic: z =
σ 12 σ 2 2
+
n1 n2
• Calculator function: 2-SampZTest
If we are testing a claim comparing two unknown population means,µ1 and µ2, where the
population standard deviations σ1 and σ2 are not known, then we use a Two-Sample T-
Test;
the key features of the test are:
553
11.3 Comparing Two Independent Population Proportions
If we are testing a claim comparing two unknown population proportions, p1 and p 2, we use
a Two-Proportion Z-Test; the key features of the test are
If we are testing a claim about two unknown population means, using dependent samples
consisting of matched pairs of data values, then we use an ordinary t-test on the
differences.
The key features of the test are:
• Test the differences d by subtracting one measurement from the other measurement in
each pair.
• If n (the number of differences) is less than 30, then we must assume that the
differences are normally distributed.
d − µd
• Test statistic: t = .
sd / n
• Calculator function: place difference (d) into L1, T-test on L1, Enter 0 for µ0
554
Exercises for Chapter 11
The next seven exercises all describe a claim or scenario that will be investigated
using a hypothesis test. In each case:
1. It is believed that 70% of males pass their drivers test in the first attempt, while 65% of
females pass the test in the first attempt. A test will be conducted to check whether the
proportions are in fact equal.
2. The manufacturer of a new windshield treatment claims that their product will repel water
more effectively. To test the claim, ten windshields are tested by simulating rain without the
new treatment. The same windshields are then treated, and the experiment is run again.
This data is then used to test the manufacturer’s claim.
3. The known standard deviation in salary for all mid-level professionals in the financial
industry is $11,000. Company A and Company B are in the financial industry. Suppose
samples are taken of mid-level professionals from Company A and from Company B. The
sample mean salary for mid-level professionals in Company A is $80,000. The sample mean
salary for mid-level professionals in Company B is $96,000. This data is used to test the
claim that mid-level professionals at Company A and Company B are paid differently, on
average.
4. It is believed that the average grade on an English essay in a particular school system for
females is higher than for males. A random sample of 31 females had a mean score of 82 with
a standard deviation of three, and a random sample of 25 males had a mean score of 76 with
a standard deviation of four.
5. It is thought that teenagers sleep more than adults on average. A study is done to verify
this. A sample of 16 teenagers has a mean of 8.9 hours slept and a standard deviation of 1.2.
A sample of 12 adults has a mean of 6.9 hours slept and a standard deviation of 0.6.
6. A sample of 12 in-state graduate school programs at school A has a mean tuition of $64,000
with a standard deviation of $8,000. At school B, a sample of 16 in-state graduate programs
has a mean of $80,000 with a standard deviation of $6,000. On average, are the mean tuitions
different?
7. The manufacturer of a new medicine claims that the drug helps improve sleep. Eight
subjects are picked at random and given the medicine. The means hours slept for each person
555
were recorded before starting the medication and after. This data is then used to test the
manufacturer’s claim.
8. A study is done to determine which of two soft drinks has more sugar. The researchers
believe that Beverage B has more sugar than Beverage A, on average. A random sample of
13 cans of Beverage A is selected; the mean amount of sugar in the sample is 36 grams with
a standard deviation of 0.6 grams. A random sample of 10 cans of Beverage B is selected; the
mean amount of sugar in this sample is 38 grams with a standard deviation of 0.8 grams.
Both populations have normal distributions.
9. The U.S. Center for Disease Control reports that the mean life expectancy was 47.6 years
for whites born in 1900 and 33.0 years for nonwhites. Suppose that you randomly survey
death records for people born in 1900 in a certain county. Of the 124 whites, the mean life
span was 45.3 years with a standard deviation of 12.7 years. Of the 82 nonwhites, the mean
life span was 34.1 years with a standard deviation of 15.6 years. Conduct a hypothesis test
to see if the mean life spans in the county were the same for whites and nonwhites.
10. The mean speeds of fastball pitches from two different baseball pitchers are to be
compared. The populations (of fastball speeds) for each pitcher have normal distributions. A
sample of 14 fastball pitches is measured from each pitcher; the results are shown in the
table:
Scouters believe that Rodriguez pitches a speedier fastball than Wesley; use the data above
to test this claim.
a. State the null and alternative hypotheses in terms of the appropriate parameters.
b. What test should be used, and why?
c. Calculate the test statistic and the p-value.
556
d. At the 1% significance level, what is the decision about Ho?
e. What can we conclude from this test?
11. A researcher is testing the effects of plant food on plant growth; the researcher thinks
the food makes the plants grow taller. Nine plants have been given the plant food. Another
nine plants have not been given the plant food. The heights of the plants (in inches) are
recorded after eight weeks. The following table shows the results.
12. Two metal alloys are being considered as material for ball bearings. The mean melting
point of the two alloys is to be compared, and the melting points of each alloy have normal
distributions. 15 pieces of each metal are being tested; the results are shown in the
following table. This data will be used to test the claim that Alloy Zeta has a different
melting point than Alloy Gamma.
13. Two types of phone operating system are being tested to determine if there is a difference
in the proportions of system failures (crashes). Fifteen out of a random sample of 150 phones
with OS1 had system failures within the first eight hours of operation. Nine out of another
random sample of 150 phones with OS2 had system failures within the first eight hours of
operation. OS2 is believed to be more stable (have fewer crashes) than OS1.
557
d. At the 5% significance level, what is the decision about Ho?
e. What can you conclude about the two operating systems?
14. In the recent Census, three percent of the U.S. population reported being of two or more
races. However, this proportion varie from state to state. Suppose that two random surveys
are conducted. In the first random survey, out of 1,000 North Dakotans, only nine people
reported being of two or more races. In the second random survey, out of 500 Nevadans, 17
people reported being of two or more races. Conduct a hypothesis test to determine if the
population percentages are the same for the two states or if the percent for Nevada is
statistically higher than for North Dakota.
15. A study was conducted to test the effectiveness of a software patch in reducing system
failures over a six-month period. Results for randomly selected installations are shown in the
table below. Each installation was tested before and after the patch, and the number of
system failures recorded in the table below. Using a 1% significance level, test the claim that
the average number of system failures is reduced after installing the patch. Assume the
differences have a normal distribution.
Installation A B C D E F G H
Before 3 6 4 2 5 8 2 6
After 1 5 2 0 1 0 2 2
a. What is the random variable?
b. State the null and alternative hypotheses.
c. What kind of test should be used and why?
d. What are the test statistic and p-value?
e. What is the decision for Ho?
f. What conclusion can we draw about the software patch?
16. A study was conducted to test the effectiveness of a juggling class. Before the class
started, six subjects juggled as many balls as they could at once. After the class, the same
six subjects juggled as many balls as they could. The differences in the number of balls are
calculated. Assume that the differences have a normal distribution, and test the claim that
the mean number of balls increased after the class; use a 5% significance level.
Subject A B C D E F
Before 3 4 3 2 4 5
After 4 5 6 4 5 7
558
a. State the null and alternative hypotheses.
b. What kind of test should be used and why?
c. What are the test statistic and p-value?
d. What is the decision for Ho?
e. What conclusion can we draw about the juggling class?
17. A doctor wants to know if a medication is effective in lowering blood pressure. Six subjects
have their blood pressures recorded. After twelve weeks on the medication, the same six
subjects have their blood pressure recorded again. For this test, only systolic pressure is of
concern. Test the doctor’s claim at the 1% significance level.
Patient A B C D E F
Before 161 162 165 162 166 171
After 158 159 166 160 167 169
NOTE
In the following problems, if you are using a Student's t-distribution, including for paired
data, you may assume that the underlying population is normally distributed. If no
significance level is given, assume that α = .05.
18. The mean number of English courses taken in a two–year time period by male and female
college students is believed to be about the same. An experiment is conducted and data are
collected from 29 males and 16 females. The males took an average of three English courses
with a standard deviation of 0.8. The females took an average of four English courses with a
standard deviation of 1.0. Is there a significant difference in the means?
19. A student at a four-year college claims that mean enrollment at four–year colleges is
higher than at two–year colleges in the United States. Two surveys are conducted. Of the 35
two–year colleges surveyed, the mean enrollment was 5,068 with a standard deviation of
4,777. Of the 35 four-year colleges surveyed, the mean enrollment was 5,466 with a standard
deviation of 8,191.
20. At Rachel’s 11th birthday party, eight girls were timed to see how long (in seconds) they
could hold their breath in a relaxed position. After a two-minute rest, they timed themselves
while jumping. The girls thought that the mean difference between their jumping and relaxed
times would be zero. Test their hypothesis.
559
Relaxed time Jumping Time
26 21
47 40
30 28
22 21
23 25
45 43
37 35
29 32
21. Mean entry-level salaries for college graduates with mechanical engineering degrees and
electrical engineering degrees are believed to be approximately the same. A recruiting office
thinks that the mean mechanical engineering salary is actually lower than the mean
electrical engineering salary. The recruiting office randomly surveys 50 entry level
mechanical engineers and 60 entry level electrical engineers. Their mean salaries were
$46,100 and $46,700, respectively. Their standard deviations were $3,450 and $4,210,
respectively. Conduct a hypothesis test to determine if there is evidence that the mean entry-
level mechanical engineering salary is lower than the mean entry-level electrical engineering
salary. (Use α = .05)
22. Marketing companies have collected data implying that teenage girls use more ring tones
on their cellular phones than teenage boys do. In one particular study of 40 randomly chosen
teenage girls and boys (20 of each) with cellular phones, the mean number of ring tones for
the girls was 3.2 with a standard deviation of 1.5. The mean for the boys was 1.7 with a
standard deviation of 0.8. Conduct a hypothesis test to determine if the means are
approximately the same or if the girls’ mean is higher than the boys’ mean.
Use the following information to answer the next two exercises: The Eastern and
Western Major League Soccer conferences have a new Reserve Division that allows new
players to develop their skills. Data for a randomly picked date showed the following annual
goals:
Western Eastern
Los Angeles 9 D.C. United 9
FC Dallas 3 Chicago 8
Chivas USA 4 Columbus 7
Real Salt Lake 3 New England 6
Colorado 4 MetroStars 5
San Jose 4 Kansas City 3
23. Suppose that a hypothesis test is conducted to test the claim that Western Division
teams score more goals, on average, than Eastern Division teams. The exact distribution
for this hypothesis test would be:
560
a. the normal distribution b. the Student's t-distribution
c. the uniform distribution d. the exponential distribution
24. If the level of significance for the test in #23 is α = 0.05, then the conclusion would be:
a. There is sufficient evidence to conclude that the W Division teams score fewer
goals, on average, than the E teams
b. There is insufficient evidence to conclude that the W Division teams score more
goals, on average, than the E teams.
c. There is insufficient evidence to conclude that the W teams score fewer goals, on
average, than the E teams score.
d. Unable to determine
Use the following information to answer the next two exercises: A statistics instructor
believes that there is no significant difference between the mean class scores of statistics day
students on Exam 2 and statistics night students on Exam 2. She takes random samples from
each of the populations. The mean and standard deviation for 35 statistics day students were
75.86 and 16.91. The mean and standard deviation for 37 statistics night students were 75.41
and 19.73. The “day” subscript refers to the day students and the “night” subscript refers to
the night students.
25. If this data is used to test the instructor’s claim, which of the following would be the
appropriate conclusion?
a. There is insufficient evidence to conclude that the statistics day students' mean
on Exam 2 is better than the statistics night students' mean on Exam 2.
b. There is insufficient evidence to conclude that there is a significant difference
between the means of the statistics day students and night students on Exam 2.
c. There is sufficient evidence to conclude that there is a significant difference
between the means of the statistics day students and night students on Exam 2.
a. µday > µnight b. µday < µnight c. µday = µnight d. µday ≠ µnight
27. Researchers interviewed street prostitutes in Canada and the United States. The mean
age of the 100 Canadian prostitutes upon entering prostitution was 18 with a standard
deviation of six. The mean age of the 130 United States prostitutes upon entering prostitution
was 20 with a standard deviation of eight. Is the mean age of entering prostitution in Canada
lower than the mean age in the United States? Test at a 1% significance level.
28. A test is conducted to compare two types of diet. A powder diet is tested on 49 people,
and a liquid diet is tested on 36 different people. The powder diet group had a mean weight
561
loss of 42 pounds with a standard deviation of 12 pounds. The liquid diet group had a mean
weight loss of 45 pounds with a standard deviation of 14 pounds. Use this data and a 5%
significance level to test the claim that the liquid diet yields a higher mean weight loss than
the powder diet.
29. Parents of teenage boys often complain that auto insurance costs more, on average, for
teenage boys than for teenage girls. A group of concerned parents examines a random sample
of insurance bills. The mean annual cost for 36 teenage boys was $679. For 23 teenage girls,
it was $559. From past years, it is known that the population standard deviation for each
group is $180. Does this data provide evidence that the mean cost for auto insurance for
teenage boys is greater than that for teenage girls?
30. A group of transfer bound students wondered if they will spend the same mean amount
on texts and supplies each year at their four-year university as they have at their community
college. They conducted a random survey of 54 students at their community college and 66
students at their local four-year university. The sample means were $947 and $1,011,
respectively. From past studies, the population standard deviations are known to be $254
and $87, respectively. Conduct a hypothesis test to determine if the means are statistically
the same.
31. Some manufacturers claim that non-hybrid sedan cars have a lower mean miles-per-
gallon (mpg) than hybrid ones. Suppose that consumers test 21 hybrid sedans and get a mean
of 31 mpg with a standard deviation of 7 mpg. And 31 non-hybrid sedans get a mean of 22
mpg with a standard deviation of 4 mpg. Suppose that the population standard deviations
are known to be 6 and 3, respectively. Conduct a hypothesis test to evaluate the
manufacturers’ claim.
32. A study is done to determine if students in the California State University system take
longer to graduate, on average, than students enrolled in private universities. One hundred
students from both the California State University system and private universities are
surveyed. Suppose that from years of research, it is known that the population standard
deviations are 1.5811 years and 1 year, respectively. The following data are collected. The
California State University system students took on average 4.5 years with a standard
deviation of 0.8. The private university students took an average of 4.1 years with a standard
deviation of 0.3. Use this data to test the claim.
33. One of the questions in a study of marital satisfaction of dual-career couples was to rate
the statement “I’m pleased with the way we divide the responsibilities for childcare.” The
ratings went from one (strongly agree) to five (strongly disagree). The table below contains
ten of the paired responses for husbands and wives. Use this data to conduct a hypothesis
test to see if the mean difference in the husband’s versus the wife’s satisfaction level is
negative (meaning that, within the partnership, the husband is happier than the wife).
562
Wife’s score 2 2 3 3 4 2 1 1 2 4
Husband’s score 2 2 1 3 2 1 1 1 2 4
34. A recent drug survey showed an increase in the use of drugs and alcohol among local high
school seniors as compared to the national percent. Suppose that a survey of 100 local seniors
and 100 national seniors is conducted to test whether the proportion of drug and alcohol use
is higher locally than nationally. Locally, 65 seniors reported using drugs or alcohol within
the past month, while 60 national seniors reported using them. Does this data provide
evidence to support the claim?
35. A test is conducted to determine whether there is a difference in the proportions of female
suicide victims that are ages 15 to 24 are the same for whites and for African-Americans in
the United States. We randomly pick one year, 1992, to compare the populations. The number
of suicides estimated in the United States in 1992 for white females is 4,930. Five hundred
eighty were aged 15 to 24. The estimate for black females is 330. Forty were aged 15 to 24.
We will let female suicide victims be our population.
Use the following information to answer the next three exercises. Neuro-invasive West
Nile virus is a severe disease that affects a person’s nervous system. It is spread by the Culex
species of mosquito. In the United States in 2010 there were 629 reported cases of neuro-
invasive West Nile virus out of a total of 1,021 reported cases and there were 486 neuro-
invasive reported cases out of a total of 712 cases reported in 2011. Is the 2011 proportion of
neuro-invasive West Nile virus cases more than the 2010 proportion of neuro-invasive West
Nile virus cases? Using a 1% level of significance, we conduct an appropriate hypothesis test.
38. The p-value is 0.0022. At a 1% level of significance, the appropriate conclusion is:
563
c. There is insufficient evidence to conclude that the proportion of people in the
United States in 2011 who contracted neuro-invasive West Nile disease is less than
the proportion of people in the United States in 2010 who contracted neuro-invasive
West Nile disease.
39. A recent year was randomly picked from 1985 to the present. In that year, there were
2,051 Hispanic students at Cabrillo College out of a total of 12,328 students. At Lake Tahoe
College, there were 321 Hispanic students out of a total of 2,441 students. Does this data
provide evidence that the percent of Hispanic students at the two colleges is significantly
different? Explain.
40. Adults aged 18 years old and older were randomly selected for a survey on obesity. Adults
are considered obese if their body mass index (BMI) is at least 30. The researchers wanted to
determine if the proportion of women who are obese in the south is less than the proportion
of southern men who are obese. The results are shown in the table below. At the1% level of
significance, is there evidence to support the claim?
41. Researchers conducted a study to find out if there is a difference in the use of eReaders
by different age groups. Randomly selected participants were divided into two age groups. In
the 16- to 29-year-old group, 7% of the 628 surveyed use eReaders, while 11% of the 2,309
participants 30 years old and older use eReaders. Does this data indicate that there is a
significant difference in the proportions between the age groups?
42. A group of friends debated whether a higher percentage of men use smartphones than
women. They consulted a research study of smartphone use among adults. The results of the
survey indicate that of the 973 men randomly sampled, 379 use smartphones. For women,
404 of the 1,304 who were randomly sampled use smartphones. Use this data to test the
friends’ conjecture at the 5% level of significance.
564
43. Two computer users were discussing tablet computers. They conjectured that a higher
proportion of people ages 16 to 29 use tablets than the proportion of people age 30 and older.
The table below shows the number of tablet owners for each age group. Test the claim using
a 5% level of significance.
44. A teacher is interested in whether children’s educational computer software costs less,
on average, than children’s entertainment software. Thirty-six educational software titles
were randomly picked from a catalog. The mean cost was $31.14 with a standard deviation
of $4.69. Thirty-five entertainment software titles were randomly picked from the same
catalog. The mean cost was $33.86 with a standard deviation of $10.87. Use this data to test
the claim that children’s educational software costs less, on average, than children’s
entertainment software.
45. A social scientist recently claimed that the proportion of college-age males with at least
one pierced ear is as high as the proportion of college-age females. She conducted a survey
at a college to test her claim. Out of 107 males, 20 had at least one pierced ear. Out of 92
females, 47 had at least one pierced ear. Does this data provide evidence to reject her
claim?
46. Ten individuals went on a low–fat diet for 12 weeks to lower their cholesterol. The data
are recorded in the table blow. Does this data provide evidence that their cholesterol levels
were significantly lowered?
565
Use the following information to answer the next two exercises. A new AIDS prevention
drug was tried on a group of 224 HIV positive patients. Forty-five patients developed AIDS
after four years. In a control group of 224 HIV positive patients, 68 developed AIDS after four
years. We want to test whether the method of treatment reduces the proportion of patients
that develop AIDS after four years or if the proportions of the treated group and the untreated
group stay the same. Let the subscript t = treated patient and ut = untreated patient.
48. If the p-value is 0.0062 and we use α = .05, what is the conclusion?
49. A golf instructor is interested in determining if her new technique for improving players’
golf scores is effective. She takes four new students. She records their 18-hole scores before
learning the technique and then after having taken her class. She conducts a hypothesis test.
The data are as follows:
50. A local cancer support group believes that the estimate for new female breast cancer
cases in the south is higher in 2013 than in 2012. The group compared the estimates of new
female breast cancer cases by southern state in 2012 and in 2013. The results are shown in
the table:
States AL AR FL GA KY LA MS NC OK SC TN TX
2012 3450 2150 15540 6970 3160 3320 1990 7090 2630 3570 4680 15050
2013 3720 2280 15710 7310 3300 3630 2080 7430 2690 3580 5070 14980
Using a 5% significance level, does this data provide evidence that there were more cases in
2013 than in 2012?
566
51. A traveler wanted to know if the prices of hotels are different in the ten cities that he
visits the most often. The list of the cities with the corresponding hotel prices (in dollars) for
his two favorite hotel chains is in the table below. At a 5% level of significance, is there
evidence that the prices are different, on average, between the two chains?
52. Mean entry-level salaries for college graduates with mechanical engineering degrees and
electrical engineering degrees are believed to be approximately the same. A recruiting officer
wants to estimate the difference in the means. He randomly surveys 50 entry level
mechanical engineers and 60 entry level electrical engineers. Their mean salaries were
$46,100 and $46,700, respectively, and their standard deviations were $3,450 and $4,210,
respectively. Construct a 95% confidence interval for the difference in the means. Does this
interval suggest that the means could be the same? Explain.
53. Marketing companies have collected data implying that on average, teenage girls use
more ring tones on their cellular phones than teenage boys do. In one particular study of 40
randomly chosen teenage girls and boys (20 of each) with cellular phones, the mean number
of ring tones for the girls was 3.2 with a standard deviation of 1.5. The mean for the boys was
1.7 with a standard deviation of 0.8. Use this data to construct a 95% confidence interval for
the difference in means between girls and boys. According to this interval, is there evidence
that on average, girls use more ring tones that boys? How big is this difference?
54. Some manufacturers claim that non-hybrid sedan cars have a lower mean miles-per-
gallon (mpg) than hybrid ones. Suppose that consumers test 21 hybrid sedans and get a mean
of 31 mpg with a standard deviation of 7 mpg. And 31 non-hybrid sedans get a mean of 22
mpg with a standard deviation of 4 mpg. Suppose that the population standard deviations
are known to be 6 and 3, respectively. Use this data to construct a 95% confidence interval
for the difference in mean mpg between hybrid cars and non-hybrid cars.
55. A study is done to determine if students in the California State University system take
longer to graduate, on average, than students enrolled in private universities. One hundred
students from both the California State University system and private universities are
surveyed. Suppose that from years of research, it is known that the population standard
567
deviations are 1.5811 years and 1 year, respectively. The following data are collected. The
California State University system students took on average 4.5 years with a standard
deviation of 0.8. The private university students took an average of 4.1 years with a standard
deviation of 0.3. Use this data to find a 95% confidence interval for the difference in means
between the Cal State system and private universities.
56. A recent year was randomly picked from 1985 to the present. In that year, there were
2,051 Hispanic students at Cabrillo College out of a total of 12,328 students. At Lake Tahoe
College, there were 321 Hispanic students out of a total of 2,441 students. Use this data to
construct a 90% confidence interval for the difference in the proportions.
57. Adults aged 18 years old and older were randomly selected for a survey on obesity. Adults
are considered obese if their body mass index (BMI) is at least 30. The researchers wanted to
determine if the proportion of women who are obese in the south is less than the proportion
of southern men who are obese. The results are shown in the table below. Use this data to
construct a 95% confidence interval for the difference in obesity rates between men and
women.
58. An accounting firm is trying to decide between IT training that is conducted in-house or
conducted by consultants. The table below shows the average annual training cost per
employee. Are the mean costs significantly different using 5% level of significance? Assume
the population variances are equal. Show all hypothesis testing steps.
59. According to CNNMoney article, the millennials who are eligible for retirement plan
from workplace are saving at the same participation rate as older generations. Test this
claim at a 1% level of significance. Show all hypothesis testing steps. Suppose the
following data is collected:
n proportion
Millennials 300 94%
Other Generations 330 92%
568
REFERENCES
11.1 Two Population Means with Unknown Standard Deviations
Data from the United States Senate website, available online at www.Senate.gov (accessed June 17,
2013).
“Strip Clubs: Where Prostitution and Trafficking Happen.” Prostitution Research and Education,
2013. Available online at www.prostitutionresearch.com/ProsViolPosttrauStress.html (accessed June
17, 2013).
Hinduja, Sameer. “Sexting Research and Gender Differences.” Cyberbulling Research Center,
2013. Available online at http://cyberbullying.us/blog/sexting-research-and-gender-differences/
(accessed June 17, 2013).
“Smart Phone Users, By the Numbers.” Visually, 2013. Available online at http://visual.ly/smart-
phone-users-numbers (accessed June 17, 2013).
Smith, Aaron. “35% of American adults own a Smartphone.” Pew Internet, 2013. Available
online at http://www.pewinternet.org/~/media/Files/Reports/2011/PIP_Smartphones.pdf (accessed
June 17, 2013).
“Texas Crime Rates 1960–1012.” FBI, Uniform Crime Reports, 2013. Available online at:
http://www.disastercenter.com/ crime/txcrime.htm (accessed June 17, 2013).
569
11.3 Comparing Two Independent Population Proportions
Data from Hilton Hotels. Available online at http://www.hilton.com (accessed June 17, 2013). Data
from Hyatt Hotels. Available online at http://hyatt.com (accessed June 17, 2013).
Data from Statistics, United States Department of Health and Human Services. Data from Whitney
Exhibit on loan to San Jose Museum of Art.
Data from the American Cancer Society. Available online at http://www.cancer.org/index (accessed
June 17, 2013). Data from the Chancellor’s Office, California Community Colleges, November 1994.
“West Nile Virus.” Centers for Disease Control and Prevention. Available online at
http://www.cdc.gov/ncidod/dvbid/ westnile/index.htm (accessed June 17, 2013).
That's no surprise. After paying bills. “66% Of Millennials Have Nothing Saved for Retirement.”
CNNMoney, Cable News Network, money.cnn.com/2018/03/07/retirement/millennial-retirement-
savings/index.html?sr=twCNN030718millennial-retirement-savings0103PMStory (accessed July 18,
2018).
Cohen, Jason. “IOS More Popular in Japan and US, Android Dominates in China and India.”
PCMAG, PCMag, 4 Sept. 2020, www.pcmag.com/news/ios-more-popular-in-japan-and-us-
android-dominates-in-china-and-india.
570
CHAPTER 11 SOLUTIONS:
571
27) The hypotheses are H0: μ1 > μ2 Ha: μ1 < μ2 ( is the mean age for Canada). Use 2-
SampTTest. From the calculator, t = -2.17 and p = 0.0157. Since p > .01, do not reject Ho.
There is not enough evidence at the .01 level of significance to conclude that the mean age
for entering prostitution is lower in Canada than in the U.S.
29) The hypothess are H0: µ1 ≤ µ2, Ha: µ1 > µ2 (where µ1 is the mean for boys) Use 2-
SampZTest, since the population SD’s are both known. From the calculator, z = 2.50 and
p-value: 0.0062. Since p < 0.05, Reject Ho. At the 5% significance level, there is sufficient
evidence to conclude that the mean cost of auto insurance for teenage boys is greater than
that for girls.
31) The hypotheses are H0: µ1 > µ2, Ha: µ1 < µ2 (where µ1 is the mean for hybrid cars) Use
2-SampZTest, since the population SD’s are both known. From the calculator, z = 6.36 and
p = 0. Since p < .05, we reject Ho. At the 5% significance level, there is sufficient evidence
to conclude that the mean miles per gallon of non-hybrid sedans is less than that of hybrid
sedans.
33) The hypotheses are H0: µd = 0, Ha: µd < 0 (d = wife’s score – husband’s score)
This is a matched-pairs t-test. From the calculator, t = -1.86 and p = 0.0479. Since p <
.05, we reject the null hypothesis. There sufficient evidence (just barely!) to conclude that
the mean difference is negative. That is, at the .05 significance level, there is evidence
that on average, husbands are happier with the division of child-care responsibilities.
35) The hypotheses are: H0: pW = pB Ha: pW ≠ pB We use 2-PropZTest; from the
calculator we get z = - 0.1944 and p-value = 0.8458. Do not reject Ho, since p > .05.
There is absolutely evidence to conclude that the proportions of white and black female
suicide victims, aged 15 to 24, are different.
37) The hypotheses would be Ho: p2011 < p2010 , Ha: p2011 > p2010. Answer a.
41) The hypotheses are: H0: p1 = p2, Ha: p1 ≠ p2. Use 2-PropZTest; from the calculator,
z = -2.94 and p-value = 0.0033. Since p < .05, we reject Ho. At the 5% level of significance,
there is sufficient evidence to conclude that the proportion of eReader users aged 16 to 29
years is different from the proportion of eReader users 30 and older.
43) The hypotheses are: H0: p1 < p2, Ha: p1 > p2 (where p1 is the proportion for 16-29 yr
olds) Use 2-PropZTest; from the calculator, z = -2.94 and p-value = 0.2354. Since p > .01, we
do not reject Ho. At the 1% level of significance, there is not sufficient evidence to conclude
572
that a higher proportion of tablet owners are aged 16 to 29 years old than are 30 years and
older.
45) The hypotheses are: H0: p1 > p2, Ha: p1 < p2 (where p1 is the proportion for male
students) Use 2-PropZTest; from the calculator, z = -4.82 and p-value = 0.0000007.
Since p < .05, we do reject Ho. At the 5% significance level, there is sufficient evidence to
conclude that the proportions of male students with at least one pierced ear is less than the
proportion for female students.
47) The hypotheses are H0: pt > put, Ha: pt < put (where pt is the proportion of untreated
patients who develop AIDS) Answer d.
49) The hypotheses are H0: µd = 0, Ha: µd > 0 (d = score before class – score after) This is a
matched-pairs t-test. From the calculator, t = 1.19 and p = 0.8405. Since p > .05, we do not
reject the null hypothesis. There is no evidence that the course helped improve scores.
51) The hypotheses are H0: µd = 0, Ha: µd ≠ 0 (d = Hyatt price – Hilton price)
This is a matched-pairs t-test. From the calculator, t = .41 and p = 0.6881
Since p > .05, we do not reject the null hypothesis. There is no evidence that there is a
difference in prices between the chains.
53) Let µ1 = mean for girls, µ2 = mean for boys. Using 2SampTInt, we are 95% confident
that .7224 < µ1 – µ2 < 2.2775. So we are 95% confident that .7224 + µ2 < µ1. There is
evidence that the mean number for girls is at least .7224 more than the mean for boys.
55) Let µ1 = mean for Cal State, µ2 = mean for private universities. Using 2SampZInt, we
are 95% confident that -.1185 < µ1 – µ2 < .9185. Since the interval contains both positive
and negative values, this interval does not provide evidence that the mean time needed to
graduate is higher for Cal State than it is for private schools.
57) Let p1 = proportion for men, p2 = women. Using 2PropZInt, we are 95% confident that
0.0022 < p1 – p2 < 0.2750. Thus, there is evidence that 0.0022 + p2 < p1. That is, we are
95% confident that the proportion for men is at least .0022 more than for women.
59) Let p1 = proportion for millennials, p2 = older generation. Ho: p1 = p2 (claim) Ha: p1
≠ p2 Using 2PropZTest, test value Z = .92; p-value = .355, Do not Reject Ho. There is not
enough evidence to reject the claim that millennials save at the same proportion as older
generations.
573
This page is purposely left blank.
574
12 | THE CHI-SQUARE DISTRIBUTION
Chapter Objectives
575
12.1 | Introduction to Chi-Square Distribution
Have you ever wondered if lottery numbers were evenly distributed or if some numbers
occurred with a greater frequency? How about if the types of movies people preferred were
different across different age groups? What about if a coffee machine was dispensing
approximately the same amount of coffee each time? You could answer these questions by
conducting a hypothesis test.
You will now study a new distribution, one that is used to determine the answers to such
questions. This distribution is called the chi-square distribution.
In this chapter, you will learn four major applications of the chi-square distribution:
NOTE: Though the chi-square distribution depends on calculators or computers for most of
the calculations, there are Chi-Square Distribution tables available (see Appendix). TI-83+
and TI-84 calculator instructions are included in the text.
Let’s conduct the following statistical experiment. We select samples of size n from a normal
population, which has a standard deviation of σ. We find that the standard deviation in our
sample is equal to s. Given these data, we can define a statistic, called chi-square, using the
following formula:
(n − 1) s 2
χ2 =
σ2
576
χ𝟐𝟐𝟐𝟐
χ𝟐𝟐𝟔𝟔
χ𝟐𝟐𝟏𝟏𝟏𝟏
NOTE: The degrees of freedom for the three major applications of Chi-square distribution
are each calculated differently.
For the χ 2 distribution, the population mean is μ = df (degrees of freedom) and the population
standard deviation is σ = �2(𝑑𝑑𝑑𝑑). The random variable is shown as χ 2 , but may also be any
upper-case letter.
Recall that critical values are values from the distribution that separate the confidence area
and the non-confidence area. We found zα/2 by using ±invnorm(α/2) and tα/2 by using t-
distribution chart or by ±invT(α/2, df). Since the z-distribution and the t-distribution are
symmetrical, the left and right critical values are opposite values. However, in the chi-square
distribution, values are only positive.
The critical values for the χ 2 distribution are recorded in two tables (left critical value table
and right critical value table). To find these values, you need area in the tail and the degrees
of freedom.
577
Below you can see, that the chi-square distribution table is split into 2 tables. The first table
is the left critical values, χ𝟐𝟐𝐋𝐋 . The second table is the right critical values, χ𝟐𝟐𝐑𝐑 . The tail area
is 0.05 and df = 14 so χ𝟐𝟐𝐋𝐋 = 6.571 using the table for the left critical values and χ𝟐𝟐𝐑𝐑 = 23.865
using the table for the right critical values. The complete two tables are found in the
appendix.
578
12.1 Find the left and right critical values for the chi-square distribution for C.L. = .98 and
df = 6
Recall that critical values from the Z-distribution (standard normal) and the t-distribution
(approximately normal) are symmetric opposites. The critical values from chi-square
distribution are both positive. Also, t-distribution and chi-square distribution are dependent
on degrees of freedom (df).
To use Excel functions, the first thing to type is an equal sign. Here are the Excel functions
for each critical value.
The following example uses C.L. = .95 and df = 14. If the confidence level is 95%, that leaves
5% in both tails of the distribution. α = 0.05. Split the 5%, which gives us 2.5% = 0.025 as
α/2.
579
12.2 | Confidence Interval for Variance/Standard Deviation
Recall that an interval is a range of numbers. A confidence interval is a range of values,
calculated from sample data, that is used to estimate an unknown population parameter.
The interval gives us a lower bound value and upper bound for the parameter with a certain
level of confidence.
Here we present intervals for estimating σ2 and σ between a lower and upper bound.
Unlike other confidence intervals we have seen, these bounds can’t be found using the
calculator. They depend on the chi-square critical values.
(n − 1) s 2 (n − 1) s 2
<σ2 <
χ R2 χ L2
where χ L2 , χ R2 are critical values, where d.f. = n – 1, and s2 is the point estimate.
NOTE: χ L2 , χ R2 are not to be squared. The right critical value is on the left side and the left
on the right side. Also notice the numerators are the same. When you divide the same value
by a larger number, you get a smaller value.
(n − 1) s 2 (n − 1) s 2
<σ <
χ R2 χ L2
The sample variance, s2, can be found using the calculator or Excel function. Using the ti-
calculator, s2 can be found by:
580
Using Excel:
Example 12.1
The weights (in pounds) of 15 dogs selected randomly from those adopted out by an animal
shelter last week are shown in the list below. Construct a 98% confidence interval for the
population variance.
25, 34, 27, 27, 31, 28, 27, 28, 33, 31, 28, 29, 32, 29, 29
Solution 12.1
First we need to find the sample variance (chapter 2) of the data set. Recall to find it by
calculator you enter the data into L1 (Stat, Edit). 2nd Stat, Math, #8 Variance (L1):
s2 = 6.314
Second, we find the critical values using the chi-square distribution chart.
α/2 = 0.01
α/2 = 0.01
C.L. = 0.98
0 = 4.660 = 29.141
Third, we plug in the values into the formula.
(n − 1) s 2 (n − 1) s 2
<σ <
2
χ R2 χ L2
14(6.314) 14(6.314)
<σ2 <
29.141 4.660
581
Example 12.2
A random sample of 18 men have a mean height of 67.5 inches and a standard deviation of
1.5 inches. Construct a 99% confidence interval for the population standard deviation.
(n − 1) s 2 (n − 1) s 2 17(2.25) 17(2.25)
<σ < => <σ <
χ 2
R χ 2
L 35.718 5.697
(n − 1) s 2 (n − 1) s 2
<σ < => 1.035 < σ < 2.591
χ R2 χ L2
582
12.3 | Test of a Single Variance
A test of a single variance assumes that the underlying distribution is normal. The null and
alternative hypotheses are stated in terms of the population variance, σ2 (or population
standard deviation, σ). The test statistic is:
(n − 1) s 2
χ2 =
σ2
where:
s2 = sample variance
σ2 = population variance
Example 12.3
Math instructors are not only interested in how their students do on exams, on average, but
how the exam scores vary. To many instructors, the variance (or standard deviation) may be
more important than the average. Standard deviation tells us how consistent data is.
Suppose a math instructor believes that the standard deviation for his final exam is five
points. One of his best students thinks otherwise. The student claims that the standard
deviation is more than five points. If the student were to conduct a hypothesis test, what
would the null and alternative hypotheses be?
Solution 12.3
Even though we are given the population standard deviation, we can set up the test using
the population variance as follows. Recall variance is the square of standard deviation.
H0: σ2 ≤ 52
12.2 A SCUBA instructor wants to record the collective depths each of his student’s dives
during their checkout. He is interested in how the depths vary, even though everyone should
have been at the same depth. He believes the standard deviation is three feet. His assistant
thinks the standard deviation is less than three feet. If the instructor were to conduct a test,
what would the null and alternative hypotheses be?
583
Example 12.4
With individual lines at its various windows, a post office finds that the standard deviation
for normally distributed waiting times for customers on Friday afternoon is 7.2 minutes. The
post office experiments with a single, main waiting line and finds that for a random sample
of 25 customers, the waiting times for customers have a standard deviation of 3.5 minutes.
With a significance level of 5%, test the claim that a single line causes lower variation among
waiting times (shorter waiting times) for customers.
Since the claim is that a single line causes less variation, this is a test of a single variance.
The parameter is the population variance, σ2, or the population standard deviation, σ.
H0: σ ≥ 7.22
Step 2: Select the correct test, determine the correct sampling distribution
Calculate the test statistic: Substituting n = 25, s = 3.5, and σ = 7.2 (the value of σ in
(n − 1) s 2
Ho), we get: χ 2 = = 5.67
σ2
584
p-value = P ( χ 2 < 5.67) = χ2cdf(0,5.67, 24) = 0.000042
2nd VARS
Step 4: State conclusion based on p-value. P-value is compared with alpha. Alpha is the
level of significance.
H0: σ ≥ 7.22
Step 2: Select the correct test, determine the correct sampling distribution
Calculate the test statistic: Substituting n = 25, s = 3.5, and σ = 7.2 (the
(n − 1) s 2
hypothesized value of σ) we get: χ 2 = = 5.67
σ2
We are only finding the left critical value since it is left-tailed. We do not split α = 0.05
since it is one-tailed. Therefore, using the chi-square ( χ 2 ) Left chart, we need the area
of the tail which is 0.05 and df = n – 1 = 24.
χ L2 = 13.848
Step 4: State the conclusion. In the critical value method, you reject Ho if the test statistic
is in the rejection region.
585
In this example, the rejection region is on the left side because it is a left-tailed test.
The left rejection region of a chi-square distribution starts at 0 and ends at the critical
value, χ L2 = 13.848 .
0 13.848
5.67
The test statistic is 5.67 and is between 0 and 13.848; therefore, the conclusion is
Reject Ho.
12.3 The FCC conducts broadband speed tests to measure how much data per second passes
between a consumer’s computer and the internet. Let the standard deviation of Internet
speeds across Internet Service Providers (ISPs) was 10.8 percent. Suppose a sample of 20
ISPs is taken, and the standard deviation is 12.2. An analyst claims that the standard
deviation of speeds is more than what was reported. State the null and alternative
hypotheses, compute the degrees of freedom, the test statistic, sketch the graph of the p-
value, and draw a conclusion. Test at the 1% significance level.
586
12.4 | Goodness-of-Fit Test
In this type of hypothesis test, you determine whether the data "fits" a particular distribution
or not. For example, you may suspect your unknown data fits a binomial or a uniform
distribution. You use a chi-square test (meaning the distribution for the hypothesis test is
chi-square) to determine if there is a fit or not. The null and the alternative hypotheses for
this test may be written in sentences or may be stated as equations or inequalities.
(O − E )
2
where
• O = observed frequency
• E = expected values (from theory)
• Degrees of freedom = k – 1, where k = the number of different data cells or categories
The observed values are the data values and the expected values are the values you would
expect to get if the null hypothesis were true.
The goodness-of-fit test is almost always right-tailed. If the observed values and the
corresponding expected values are not close to each other, then the test statistic can get very
large and will be way out in the right tail of the chi-square curve.
NOTE:
The expected value for each cell needs to be at least five in order for you to use this test.
(O − E )
2
587
Example 12.5
The Nike picture at the beginning of this chapter stated that chi-square distribution can be
used to determine if shoe styles are equally likely. Suppose that a study from a Nike running
shoe order based on customer demand was done and found the following observed
frequencies.
a. Determine the null and alternative hypotheses needed to conduct a goodness-of-fit test
to test the claim that Nike shoe styles are equally likely.
b. Find the degree of freedom.
c. Find the test value
d. Find the p-value
e. State the conclusion
f. Interpret the decision
Solution 12.5
c.) To find the test value, first need to find the Expected values. To find the Expected values
when the Goodness of Fit test is about uniform distribution (equally likely), divide n by
the number of categories. Here, n = 215 and k = number of categories = 5. So the Expected
values are 215/5 = 43.
Now place the Observed values (33, 41, 54, 47, 40) into L1, place the Expected values (43,
43, 43, 43, 43) into L2.
588
Sum up L3 (2nd Stat, Math) to find
the Test Value, (2nd Vars, χ2CDF) to
find the p-value
d.) Since Goodness of Fit tests are always right-tailed, the p-value is on the right side of
the chi-square distribution.
χ 2 CDF(test value, 10^99, df) = χ 2 CDF(5.814, 10^99, 4) = .213
f.) Since level of significance is not given, the standard level of significance is 5%.
Remember the claim is assigned to Ho in this example.
Notice we are still following the 5 steps of hypothesis testing shown in chapter 9. The
alternative hypothesis uses words instead of inequality symbols (>, <, ≠). The type of test is
always right-tailed.
589
Example 12.6
Absenteeism of college students from math classes is a major concern to math instructors
because missing class appears to increase the drop rate. Suppose that a study was done to
determine if the actual student absenteeism rate follows faculty perception. The faculty
expected that a group of 100 students would miss class according to Table 12.1.
a. Determine the null and alternative hypotheses needed to conduct a goodness-of-fit test.
b. Can you use the information as it appears in the charts to conduct the goodness-of-fit
test? If not, create a new frequency distribution to continue.
c. Find the test value
d. Find the p-value
e. State the conclusion
f. Interpret the decision
A random survey across all mathematics courses was then done to determine the actual
number (observed) of absences in a course. The chart in Table 11.2 displays the results of
that survey.
Solution 12.6
590
c.) Degrees of freedom (df) = k – 1 = number of categories – 1 = 4 – 1 = 3
d.) Test Value using calculator
Enter Observed frequency values into L1. Enter Expected frequency values in to L2.
591
Example 12.7
Employers want to know which days of the week employees are absent in a five-day work
week. Most employers would like to believe that employees are absent uniformly during the
week. Suppose a random sample of 60 managers were asked on which day of the week they
had the highest number of employee absences. The results were distributed as in the table
below. For the population of employees, do the days for the highest number of absences occur
with equal frequencies during a five-day work week? Test at a 5% significance level.
Solution 12.7
H0: The absent days occur with equal frequencies, that is, they fit a uniform distribution.
Ha: The absent days occur with unequal frequencies, that is, they do not fit a uniform
distribution.
If the absent days occur with equal frequencies, then, out of 60 absent days there would be
12 absences expected on each day. These numbers are the expected (E) values. The values
in the table are the observed (O) values or data.
592
χ 2 Test Value = 3
Conclusion: Do Not Reject Ho (The absent days occur with equal frequencies) because p-
value ≥ α
Interpretation: At a 5% level of significance, the sample data does not provide sufficient
evidence to conclude that the absences do not occur with equal frequencies.
12.4 Teachers want to know which night each week their students are doing most of their
homework. Most teachers think that students do homework equally throughout the week.
Suppose a random sample of 49 students were asked on which night of the week they did the
most homework. The results were distributed as in the following table:
Example 12.8
One study claims that the number of televisions that American families have is as follows:
10% of families have 0 televisions, 16% have 1 television, 55% have 2 televisions, 11% have
3 televisions, and 8% have 4+ televisions. A random sample of 600 families in the far western
United States resulted in the following table.
Number of Frequency
Televisions
0 66
1 119
2 340
3 60
4+ 15
Total = 600
593
At the 1% significance level, does it appear that the distribution “number of televisions" of
far western United States families is different from the distribution for the American
population as a whole?
Solution 12.8
To find the expected values, we need to take the percentages in H0 (since H0 is assumed true)
and multiply each by n (n = 600):
df = 5 – 1 = 4; df ≠ 600 – 1
This means you reject the belief that the distribution for the far western states is the same
as that of the American population as a whole.
594
Interpretation: At the 1% significance level, from the data, there is sufficient evidence to
conclude that the "number of televisions" distribution for the far western United States is
different from the "number of televisions" distribution for the American population as a
whole.
Example 12.9
Suppose you flip two coins 100 times. The results are 20 HH, 27 HT, 30 TH, and 23 TT. Are
the coins fair? Test at a 5% significance level.
Solution 12.9
This problem can be set up as a goodness-of-fit problem. The sample space for flipping two
fair coins is {HH, HT, TH, TT}. Out of 100 flips, you would expect 25 HH, 25 HT, 25 TH, and
25 TT. This is the expected distribution. The question, "Are the coins fair?" is the same as
saying, "Does the distribution of the coins (20 HH, 27 HT, 30TH, 23 TT) fit the expected
distribution?"
Random Variable: Let X = the number of heads in one flip of the two coins. X takes on the
values 0, 1, 2. (There are 0, 1, or 2 heads in the flip of two coins.) Therefore, the number of
cells is three. Since X = the number of heads, the observed frequencies are 20 (for two heads),
57 (for one head), and 23 (for zero heads or both tails). The expected frequencies are 25 (for
two heads), 50 (for one head), and 25 (for zero heads or both tails). This test is right-tailed.
Graph:
Interpretation: There is insufficient evidence to conclude that the coins are not fair.
595
12.5 | Test of Independence and Test for Homogeneity
Test of Independence
Tests of independence involve using a contingency table of observed (data) values. The test
statistic for a test of independence is similar to that of a goodness-of-fit test:
(O − E )
2
χ =∑
2
i• j E
where:
• O = observed values
• E = expected values
• i = the number of rows in the table
• j = the number of columns in the table
(O − E )
2
NOTE:
The expected value for each cell needs to be at least five in order for you to use this test.
In a test of independence, we state the null and alternative hypotheses in words. Since the
contingency table consists of two factors, the null hypothesis states that the factors are
independent and the alternative hypothesis states that they are not independent
(dependent). If we do a test of independence using the example, then the null hypothesis is:
The test of independence is always right-tailed because of the calculation of the test statistic.
If the expected and observed values are not close together, then the test statistic is very large
and way out in the right tail of the chi-square curve, as it is in a goodness-of-fit.
596
The following formula calculates the expected number (E):
12.5 A sample of 300 students is taken. Of the students surveyed, 50 were music students,
while 250 were not. Ninety- seven were on the honor roll, while 203 were not. If we assume
being a music student and being on the honor roll are independent events, what is the
expected number of music students who are also on the honor roll?
Example 12.10
In a volunteer group, adults 21 and older volunteer from one to nine hours each week to spend
time with a disabled senior citizen. The program recruits among community college students,
four-year college students, and nonstudents. In the following table is a sample of the adult
volunteers and the number of hours they volunteer per week. Is the number of hours
volunteered independent of the type of volunteer?
Solution 12.10
The observed table and the question at the end of the problem, "Is the number of hour’s
volunteered independent of the type of volunteer?" tell you this is a test of independence.
The two factors are number of hours volunteered and type of volunteer. This test is always
right-tailed.
597
Using the TI-83, 83+, 84, 84+ Calculator _
To calculate test value of independence test
Enter the 3 by 3 Contingency Table into Matrix [A] by going to 2nd Matrix, Edit
Go to STAT, TESTS, and select χ2-Test; make sure Observed: [A]. Press Calculate.
NOTE: You do not have to enter anything into Matrix [B]. Matrix [B] will automatically be
filled with Expected Values.
Graph:
598
Using the TI-83, 83+, 84, 84+ Calculator _
To see the Expected Values
Expected Values of the Independence test were placed into Matrix [B]. To view them, Press
2nd Matrix, Edit, [B]
12.6 The Bureau of Labor Statistics gathers data about employment in the United States. A
sample is taken to calculate the number of U.S. citizens working in one of several industry
sectors over time. Table below shows the results:
We want to know if the change in the number of jobs is independent of the change in years.
State the null and alternative hypotheses and the degrees of freedom.
599
Example 12.11
College of Lake County is interested in the relationship between anxiety level and the need
to succeed in school. A random sample of 400 students took a test that measured anxiety level
and need to succeed in school. Following table shows the results. CLC wants to know if
anxiety level and need to succeed in school are independent events.
Solution 12.11
Enter the 3 by 3 Contingency Table into Matrix [A] by going to 2nd Matrix, Edit
Find the Critical Value: α = 0.01 (do not split since it is right-tailed test), df = 8
χ R2 = 20.090
Recall that the critical value separates the rejection region from the non-rejection region.
Here the rejection region is on the right side since it is a right-tailed test. The rejection
region starts are 20.090 and ends at positive infinity. Reject Ho if the test value is inside
the rejection region, which means that the test value is greater than the critical value.
600
Conclusion: Reject H0 because 48.42 (test value) > 20.090 (critical value)
Rejecting Ho means we are rejecting that the variables are independent. If we are
rejecting independence, we are saying that the variables are dependent. Another phrase
for variables being dependent is that the variables are related. Therefore, the
interpretation for this example is as follows:
b. How many high anxiety level students are expected to have a high need to succeed in
school?
Here we are looking for an expected value. So we need to look at Matrix [B].
You can expect about 22 high anxiety level students to have high need to succeed in
school.
A special type of independence test, called the test for homogeneity, can be used to draw a
conclusion about whether two populations have the same distribution. To calculate the test
statistic for a test for homogeneity, follow the same procedure as with the test of
independence. However, the hypotheses statements are stated differently.
Hypotheses
Ha: The distributions of the two populations are not the same.
601
Test Statistic (value)
Use a χ 2 test statistic. It is computed in the same way as the test for independence, using the
matrix feature of the calculator.
Common Uses:
Comparing two populations. For example: freshman vs sophomore, before vs. after, east vs.
west. The variable is categorical with more than two possible response values.
Example 12.12
It is claimed that online and in-class college students have the same grade distribution.
Suppose that 250 randomly selected online college students and 300 randomly selected in-
class college students’ grades were looked at. The results are shown in the following table.
Do online and in-class college students have the same grade distribution? Use a level of
significance of 0.05.
Solution 12.12
H0: The distribution of grades for online college students is the same as the distribution of
grades for in-class college students.
Ha: The distribution of grades for online college students is not the same as the distribution
of grades for in-class college students
Degrees of Freedom, df = 4
602
Conclusion: Reject H0 because p-value < α. This means that the distributions are not the
same.
Notice that the conclusion is only that the distributions are not the same. We cannot use the
test for homogeneity to draw any conclusions about how they differ.
12.7 Do families and singles have the same distribution of cars? Use a level of significance of
0.05. Suppose that 100 randomly selected families and 200 randomly selected singles were
asked what type of car they drove: sport, sedan, hatchback, truck, and van/SUV. The results
are shown in the table below. Do families and singles have the same distribution of cars? Test
at a level of significance of 0.05.
603
Example 12.13
Both before and after a recent earthquake, surveys were conducted asking voters which of
the three candidates they planned on voting for in the upcoming city council election. Has
there been a change since the earthquake? Use a level of significance of 0.05. The following
table shows the results of the survey. Has there been a change in the distribution of voter
preferences since the earthquake?
Solution 12.13
H0: The distribution of voter preferences was the same before and after the earthquake.
Ha: The distribution of voter preferences was not the same before and after the earthquake.
Key Terms
Critical Value is a value from the distribution that separates the confidence area from the
non-confidence area.
Chi Square Distribution is a family of curves created by using variance and dependent on
degrees of freedom. The shape of the curves is skewed right. The values of the distribution
are greater than or equal to zero. µ = df, σ = �2(𝑑𝑑𝑑𝑑)
604
Formula Review
Critical Values Calculator Function Excel Function
Left chi-square critical value, χ2L None, use a chart =chisq.inv(α/2, df)
Right chi-square critical value, χ2R None, use a chart =chisq.inv.rt(α/2, df)
(n − 1) s 2 (n − 1) s 2
Confidence Interval for Variance: <σ <
2
χ R2 χ L2
(n − 1) s 2 (n − 1) s 2
Confidence Interval for Standard Deviation: <σ <
χ R2 χ L2
Review of Tests
You have seen the χ 2 test statistic used in four different circumstances. The following
bulleted list is a summary that will help you decide which χ 2 -test is the appropriate one to
use.
(O − E )
2
Test statistic: χ =
2
∑ E
(calculator: χ2GOF-Test)
Degrees of freedom: df = k – 1
605
• Independence: Use the test for independence to decide whether two variables (factors)
are independent or dependent. In this case there will be two qualitative survey questions
or experiments and a contingency table will be constructed. The goal is to see if the two
variables are unrelated (independent) or related (dependent). The null and alternative
hypotheses are:
(O − E )
2
Test statistic: χ =
2
∑
i• j E
(calculator: χ 2 -Test)
• Homogeneity: Use the test for homogeneity to decide if two populations with unknown
distributions have the same distribution as each other. In this case there will be a single
qualitative survey question or experiment given to two different populations. The null
and alternative hypotheses are:
(O − E )
2
Test statistic: χ =
2
∑
i• j E
(calculator: χ 2 -Test)
606
EXERCISES FOR CHAPTER 12
1. If the number of degrees of freedom for a chi-square distribution is 25, what is the
population mean and standard deviation?
4. Find the critical values, χ L2 and χ R2 for C.L. = 0.95 and df = 11.
5. Find the critical values, χ L2 and χ R2 for C.L. = 0.98 and df = 19.
6. Find the critical values, χ L2 and χ R2 for C.L. = 0.99 and df = 23.
7. Find the critical values, χ L2 and χ R2 for C.L. = 0.90 and df = 35.
8. The grade point averages for 10 randomly selected students are listed below. Construct a
90% confidence interval for the population standard deviation, σ.
2.0 3.2 1.8 2.9 0.9 4.0 3.3 2.9 3.6 0.8
9. The outputs of 12 selected Keurig coffee makers are listed below. The quality control
needs the standard deviation to be low. Construct a 98% confidence interval for the
population standard deviation, σ.
11.85, 11.96, 11.9, 11.92, 12.01, 11.95, 11.96, 11.89, 11.97, 12.03, 11.88, 11.94
10. A large health maintenance organization (HMO) uses control charts to monitor the
process of directing patient calls to the proper department or doctor’s receptionist. The
standard deviation of 21 control charts is 6.8 minutes. Construct a 95% confidence
interval for the population variance, σ2.
11. An archer’s standard deviation for his hits is six (data is measured in distance from the
center of the target). An observer claims the standard deviation is less.
a. What type of test should be used?
b. State the null and alternative hypotheses.
c. Is this a right-tailed, left-tailed, or two-tailed test?
12. The standard deviation of heights for students in a school is 0.81. A random sample of 50
students is taken, and the standard deviation of heights of the sample is 0.96. A
researcher in charge of the study believes the standard deviation of heights for the school
is greater than 0.81.
a. State the null and alternative hypotheses.
b. df =
607
13. The average waiting time in a doctor’s office varies. The standard deviation of waiting
times in a doctor’s office is 3.4 minutes. A random sample of 30 patients in the doctor’s
office has a standard deviation of waiting times of 4.1 minutes. One doctor believes the
variance of waiting times is greater than originally thought.
a. What type of test should be used?
b. What is the test statistic?
c. What is the p-value?
d. What can you conclude at the 5% significance level?
14. Suppose an airline claims that its flights are consistently on time with an average delay
of at most 15 minutes. It claims that the average delay is so consistent that the variance
is no more than 150 minutes. Doubting the consistency part of the claim, a disgruntled
traveler calculates the delays for his next 25 flights. The average delay for those 25 flights
is 22 minutes with a standard deviation of 15 minutes.
a. Is the traveler disputing the claim about the average or about the variance?
b. A sample standard deviation of 15 minutes is the same as a sample variance of ______
minutes.
c. Is this a right-tailed, left-tailed, or two-tailed test?
d. H0:
e. df =
f. chi-square test statistic =
g. p-value =
h. Graph the situation. Label and scale the horizontal axis. Mark the mean and test
statistic. Shade the p-value.
i. Let α = 0.05. What is the conclusion? Rewrite out the conclusion in complete
sentences.
j. How did you know to test the variance instead of the mean?
k. If an additional test were done on the claim of the average delay, which distribution
would you use?
15. A plant manager is concerned her equipment may need recalibrating. It seems that the
actual weight of the 15 oz. cereal boxes it fills has been fluctuating. The standard
deviation should be at most 0.5 oz. In order to determine if the machine needs to be
recalibrated, 84 randomly selected boxes of cereal from the next day’s production were
weighed. The standard deviation of the 84 boxes was 0.54. Does the machine need to be
recalibrated? Use α = 0.01.
608
16. Consumers may be interested in whether the cost of a particular calculator varies from
store to store. Based on surveying 43 stores, which yielded a sample mean of $84 and a
sample standard deviation of $12, test the claim that the standard deviation is greater
than $15.
17. Isabella, an accomplished Bay to Breakers runner, claims that the standard deviation for
her time to run the 7.5 mile race is at most three minutes. To test her claim, Rupinder
looks up five of her race times. They are 55 minutes, 61 minutes, 58 minutes, 63 minutes,
and 57 minutes. Use α = 0.10.
18. Airline companies are interested in the consistency of the number of babies on each flight,
so that they have adequate safety equipment. They are also interested in the variation of
the number of babies. Suppose that an airline executive believes the average number of
babies on flights is six with a variance of nine at most. The airline conducts a survey. The
results of the 18 flights surveyed give a sample average of 6.4 with a sample standard
deviation of 3.9. Conduct a hypothesis test of the airline executive’s belief. Use α = 0.05.
19. The number of births per woman in China is 1.6 down from 5.91 in 1966. This fertility
rate has been attributed to the law passed in 1979 restricting births to one per woman.
Suppose that a group of students studied whether or not the standard deviation of births
per woman was greater than 0.75. They asked 50 women across China the number of
births they had had. The results are shown in the following table. Does the students’
survey indicate that the standard deviation is greater than 0.75? Use α = 0.01.
# of Births Frequency
0 5
1 30
2 10
3 5
20. According to an avid aquarist, the average number of fish in a 20-gallon tank is 10, with
a standard deviation of two. His friend, also an aquarist, does not believe that the
standard deviation is two. She counts the number of fish in 15 other 20-gallon tanks.
Based on the results that follow, do you think that the standard deviation is different
from two?
Use α = 0.10.
Data: 11; 10; 9; 10; 10; 11; 11; 10; 12; 9; 7; 9; 11; 10; 11
609
21. The manager of "Frenchies" is concerned that patrons are not consistently receiving the
same amount of French fries with each order. The chef claims that the standard deviation
for a ten-ounce order of fries is at most 1.5 oz., but the manager thinks that it may be
higher. He randomly weighs 49 orders of fries, which yields a mean of 11 oz. and a
standard deviation of two oz.
22. You want to buy a specific computer. A sales representative of the manufacturer claims
that retail stores sell this computer at an average price of $1,249 with a very narrow
standard deviation of $25. You find a website that has a price comparison for the same
computer at a series of stores as follows: $1,299; $1,229.99; $1,193.08; $1,279; $1,224.95;
$1,229.99; $1,269.95; $1,249. Can you argue that pricing has a larger standard deviation
than claimed by the manufacturer? Use the 5% significance level. As a potential buyer,
what would be the practical conclusion from your analysis?
23. A company packages apples by weight. One of the weight grades is Class A apples. Class
A apples have a mean weight of 150 g, and there is a maximum allowed weight tolerance
of 5% above or below the mean for apples in the same consumer package. A batch of apples
is selected to be included in a Class A apple package. Given the following apple weights
of the batch, does the fruit comply with the Class A grade weight tolerance requirements?
Conduct an appropriate hypothesis test.
(a) at the 5% significance level
(b) at the 1% significance level
Weights in selected apple batch (in grams): 158; 167; 149; 169; 164; 139; 154; 150; 157;
171; 152; 161; 141; 166; 172;
610
c. A personal trainer is putting together a weight-lifting program for her clients. For a
90-day program, she expects each client to lift a specific maximum weight each week.
As she goes along, she records the actual maximum weights her clients lifted. She
wants to know how well her expectations met with what was observed.
25. A teacher predicts that the distribution of grades on the final exam will be and they are
recorded in the following table on the left and the actual distribution for a class of 20 is
given on the right:
26. The following data are real. The cumulative number of COVID-19 vaccine doses reported
for Lake County, IL on May 27, 2021 is broken down by ethnicity as in the following table.
a. If the ethnicities of COVID-19 vaccine doses followed the ethnicities of the total county
population, fill in the expected number of cases per ethnic group.
The percentage of each ethnic group in Lake County is as in following table (Census
2019 projections).
611
b. Perform a goodness-of-fit test to determine whether the occurrence of vaccine doses
follows the ethnicities of the general population of Lake County, IL.
H0:
Ha:
c. Is this a right-tailed, left-tailed, or two-tailed test?
d. degrees of freedom =
e. χ2 test statistic =
f. p-value =
g. Graph the situation. Label and scale the horizontal axis. Mark the mean and test
statistic. Shade in the region corresponding to the p-value.
h. Let α = 0.05. Decision:
i. Conclusion (write out in complete sentences):
j. Does it appear that the pattern of vaccine doses in Lake County corresponds to the
distribution of ethnic groups in this county? Why or why not?
27. A survey by Computerworld mobile data service, asked “how would you rate the overall
trustworthiness of your mobile data provider?” The results from 2016 are shown in the
following table:
612
b. The owner of a baseball team is interested in the relationship between player salaries
and team winning percentage. He takes a random sample of 100 players from different
organizations.
c. A marathon runner is interested in the relationship between the brand of shoes
runners wear and their run times. She takes a random sample of 50 runners and
records their run times as well as the brand of shoes they were wearing.
29. Transit Railroads is interested in the relationship between travel distance and the ticket
class purchased. A random sample of 200 passengers is taken. The following table shows
the results. The railroad wants to know if a passenger’s choice in ticket class is
independent of the distance they must travel.
30. An article in the New England Journal of Medicine, discussed a study on smokers in
California and Hawaii. In one part of the report, the self-reported ethnicity and smoking
levels per day were given. The table below summarizes the results.
613
a. State the hypotheses.
H0:
Ha:
b. Find the Expected Values. Round to two decimal places. Place them into a table.
c. df =
d. χ2 test statistic =
e. p-value =
f. Is this a right-tailed, left-tailed, or two-tailed test? Explain why.
g. Graph the situation. Label and scale the horizontal axis. Mark the mean and test
statistic. Shade in the region corresponding to the p-value.
h. State the decision and conclusion (in a complete sentence) for α = 0.05
i. State the decision and conclusion (in a complete sentence) for α = 0.01
31. A math teacher wants to see if two of her classes have the same distribution of test scores.
a. What test should she use?
b. What are the null and alternative hypotheses?
32. A market researcher wants to see if two different stores have the same distribution of
sales throughout the year. What type of test should he use?
33. A meteorologist wants to know if East and West Australia have the same distribution of
storms. What type of test should she use?
34. Do private practice doctors and hospital doctors have the same distribution of working
hours? Suppose that a sample of 100 private practice doctors and 150 hospital doctors are
selected at random and asked about the number of hours a week they work. The results
are shown in following table:
20 – 29 30 – 39 40 – 49 50 - 59
Private Practice 16 40 38 6
Hospital 8 44 59 39
a. State the null and alternative hypotheses.
b. df =
c. What is the test statistic?
d. What is the p-value?
e. What can you conclude at the 5% significance level?
35. Which test would you use to decide whether two factors have a relationship?
36. Which test would you use to decide if two populations have the same distribution?
614
38. How are tests of independence different from tests for homogeneity?
40. A six-sided die is rolled 120 times. Fill in the expected frequency column. Then, conduct
a hypothesis test to determine if the die is fair. The data in the following table are the
result of the 120 rolls.
41. The marital status distribution of the U.S. male population, ages 15 and older, is as shown
in the following table.
Suppose that a random sample of 400 U.S. young adult males, 18 to 24 years old, yielded the
following frequency distribution. We are interested in whether this age group of males fits
the distribution of the U.S. adult population. Calculate the frequency one would expect when
surveying 400 people. Fill in the Expected frequency in the above table, rounding to two
decimal places.
615
42. The columns in the following table contain the Race/Ethnicity of U.S. Public Schools for
a recent year, the percentages for the Advanced Placement Examinee Population for that
class, and the Overall Student Population. Suppose the right column contains the result
of a survey of 1,000 local students from that year who took an AP Exam.
43. The City of South Lake Tahoe, CA, has an Asian population of 1,419 people, out of a total
population of 23,609. Suppose that a survey of 1,419 self-reported Asians in the
Manhattan, NY, area yielded the data in the following table. Conduct a goodness-of-fit
test to determine if the self-reported sub-groups of Asians in the Manhattan area fit that
of the Lake Tahoe area.
616
e. In a goodness-of fit test, if the p-value is 0.0113, in general, do not reject the null
hypothesis.
45. The CLC math department expects the grade distribution of College Algebra to be as
follows: 15% A grade, 25% B grade, 30% C grade, 10% D grade, 5% F grade, and 15% W.
W is a withdrawal. The spring 2021 observed grade distribution of --- students is listed
in the table:
Conduct a goodness-of-fit test to determine if the actual grade distribution fits the
distribution of the department’s expected distribution. Use the level of significance of 1%.
46. A sample of 212 commercial businesses was surveyed for recycling one commodity; a
commodity here means any one type of recyclable material such as plastic or aluminum.
The following table shows the business categories in the survey, the sample size of each
category, and the number of businesses in each category that recycle one commodity.
Based on the study, on average half of the businesses were expected to be recycling one
commodity. As a result, the last column shows the expected number of businesses in each
category that recycle one commodity. At the 5% significance level, perform a hypothesis
test to determine if the observed number of businesses that recycle one commodity follows
the uniform distribution of the expected values.
617
47. The following table contains information from a survey among 499 participants classified
according to their age groups. The second column shows the percentage of obese people
per age class among the study participants. The last column comes from a different study
at the national level that shows the corresponding percentages of obese people in the same
age classes in the USA. Perform a hypothesis test at the 5% significance level to determine
whether the survey participants are a representative sample of the USA obese population.
Age Class Obese Expected USA average
(Years) (Percentage) (%)
21 – 30 75 32.6
31 – 40 26.5 32.6
41 - 50 13.6 36.6
51 – 60 21.9 36.6
61 – 70 21 39.7
48. A recent debate about where in the United States skiers believe the skiing is best
prompted the following survey. Test to see if the best ski area is independent of the level
of the skier.
U.S. Ski Area Beginner Intermediate Advanced
Tahoe 20 30 40
Utah 10 30 60
Colorado 10 40 50
49. Car manufacturers are interested in whether there is a relationship between the size of
car an individual drives and the number of people in the driver’s family (that is, whether
car size and family size are independent). To test this, suppose that 800 car owners were
randomly surveyed with the results in the following table. Conduct a test of
independence.
618
50. College students may be interested in whether or not their majors have any effect on
starting salaries after graduation. Suppose that 300 recent graduates were surveyed as
to their majors in college and their starting salaries after graduation. The following table
shows the data. Conduct a test of independence using α = .05.
51. Some travel agents claim that honeymoon hot spots vary according to age of the bride.
Suppose that 280 recent brides were interviewed as to where they spent their
honeymoons. The information is given in the following table. Conduct a test of
independence using α = .10.
52. A manager of a sports club keeps information concerning the main sport in which
members participate and their ages. To test whether there is a relationship between the
age of a member and his or her choice of sport, 643 members of the sports club are
randomly selected. Conduct a test of independence using α = .05.
53. A major food manufacturer is concerned that the sales for its skinny French fries have
been decreasing. As a part of a feasibility study, the company conducts research into the
types of fries sold across the country to determine if the type of fries sold is independent
of the area of the country. The results of the study are shown in table below. Conduct a
test of independence using α = .05.
619
54. According to Dan Lenard, an independent insurance agent in the Buffalo, N.Y. area, the
following is a breakdown of the amount of life insurance purchased by males in the
following age groups. He is interested in whether the age of the male and the amount of
life insurance purchased are independent events. Conduct a test for independence using
α = .01.
55. Suppose that 600 thirty-year-olds were surveyed to determine whether or not there is a
relationship between the level of education an individual has and salary. Conduct a test
of independence using α = .05
620
57. An ice cream maker performs a nationwide survey about favorite flavors of ice cream in
different geographic areas of the U.S. Based on the following table, do the numbers
suggest that geographic location is independent of favorite ice cream flavors? Test at the
5% significance level.
58. The following table provides a recent survey of the youngest online entrepreneurs whose
net worth is estimated at one million dollars or more. Their ages range from 17 to 30.
Each cell in the table illustrates the number of entrepreneurs who correspond to the
specific age group and their net worth. Are the ages and net worth independent? Perform
a test of independence at the 5% significance level.
Age Group/Net 1 – 5 6 – 24 25 +
Worth Value (in
millions)
17 – 25 8 7 5
26 – 30 6 5 9
59. Based on Pew Research Center, below is the number of registered voters who identify as
one of the following generation and political party in 2018. Conduct a test of
independence at the 5% significance level. What is the expected number of Republican
Generation X?
621
61. Do people from the Midwest and West coast select different breakfasts? The breakfasts
ordered by randomly selected Midwesterners and West coasters at a popular breakfast
place is shown in the table below. Conduct a test for homogeneity at a 5% level of
significance.
62. A fisherman is interested in whether the distribution of fish caught in Green Valley Lake
is the same as the distribution of fish caught in Echo Lake. Of the 191 randomly selected
fish caught in Green Valley Lake, 105 were rainbow trout, 27 were other trout, 35 were
bass, and 24 were catfish. Of the 293 randomly selected fish caught in Echo Lake, 115
were rainbow trout, 58 were other trout, 67 were bass, and 53 were catfish. Perform a
test for homogeneity at a 5% level of significance.
63. In 2017 according to the U.S. National Center for Education Statistics, the following table
you can see the GPA groupings with respect to enrollment in STEM. According to the
survey results shown in the table, is the distribution of GPA same for STEM and non-
STEM enrollment? Provide your assessment at the 5% significance level. Did you expect
the result you obtained?
64. When looking at energy consumption, we are often interested in detecting trends over
time and how they correlate among different countries. The information in the following
table shows the average energy use (in units of kg of oil equivalent per capita) in the USA
and the joint European Union countries (EU) for the six-year period 2005 to 2010. Do the
energy use values in these two areas come from the same distribution? Perform the
analysis at the 5% significance level.
622
65. The Insurance Institute for Highway Safety collects safety information about all types of
cars every year, and publishes a report of Top Safety Picks among all cars, makes, and
models. In the following table presents the number of Top Safety Picks in six car
categories for the two years 2013 and 2017. Analyze the table data to conclude whether
the distribution of cars that earned the Top Safety Picks safety award has remained the
same between 2013 and 2017. Derive your results at the 5% significance level.
REFERENCES
12.1 The Chi-Square Distribution
“HIV/AIDS Epidemiology Santa Clara County.” Santa Clara County Public Health Department, May
2011.
Data from the College Board. Available online at http://www.collegeboard.com. Data from the U.S.
Census Bureau, Current Population Reports.
Ma, Y., E.R. Bertone, E.J. Stanek III, G.W. Reed, J.R. Hebert, N.L. Cohen, P.A. Merriam, I.S. Ockene,
“Association between Eating Patterns and Obesity in a Free-living US Adult Population.” American
Journal of Epidemiology volume 158, no. 1, pages 85-92.
Ogden, Cynthia L., Margaret D. Carroll, Brian K. Kit, Katherine M. Flegal, “Prevalence of Obesity in
the United States, 2009–2010.” NCHS Data Brief no. 82, January 2012. Available online at
http://www.cdc.gov/nchs/data/databriefs/db82.pdf (accessed May 24, 2013).
Stevens, Barbara J., “Multi-family and Commercial Solid Waste and Recycling Survey.” Arlington
Count, VA. Available online at
http://www.arlingtonva.us/departments/EnvironmentalServices/SW/file84429.pdf (accessed May
24,2013).
623
12.5 Test of Independence/Homogeneity
DiCamilo, Mark, Mervin Field, “Most Californians See a Direct Linkage between Obesity and Sugary
Sodas. Two in Three Voters Support Taxing Sugar-Sweetened Beverages If Proceeds are Tied to
Improving School Nutrition and Physical Activity Programs.” The Field Poll, released Feb. 14, 2013.
Available online at http://field.com/fieldpollonline/subscribers/ Rls2436.pdf (accessed May 24, 2013).
Data from the Insurance Institute for Highway Safety, 2013. Available online at
www.iihs.org/iihs/ratings (accessed May 24, 2013).
“Energy use (kg of oil equivalent per capita).” The World Bank, 2013. Available online at
http://data.worldbank.org/indicator/EG.USE.PCAP.KG.OE/countries (accessed May 24, 2013).
“Parent and Family Involvement Survey of 2007 National Household Education Survey Program
(NHES),” U.S. Department of Education, National Center for Education Statistics. Available online at
http://nces.ed.gov/pubsearch/ pubsinfo.asp?pubid=2009030 (accessed May 24, 2013).
Gralla, Preston. “The Votes Are in: Which Mobile Data Provider Is Best?” Computerworld,
Computerworld, 21 Dec. 2016, www.computerworld.com/article/3150992/wireless-carriers/the-votes-
are-in-which-mobile-data-provider-is-best.html.
“White Millennial Voters Are More Democratic than White Voters in Older Generations.” Pew
Research Center - U.S. Politics & Policy, Pew Research Center, 20 Mar. 2018,
www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-
groups/2_10-4/.
U.S. Department of Education, National Center for Education Statistics, 2012/17 Beginning
Postsecondary Students Longitudinal Study (BPS:12/17).
624
CHAPTER 12 SOLUTIONS:
3) The chi-square distribution approximates the normal distribution when df > 90.
(n − 1) s 2 (n − 1) s 2
9) Χ2L = 3.054, Χ2R = 24.725; s2 = .00282 <σ < = .0354 < σ < .101
χ R2 χ L2
There is not quite enough evidence to conclude that the standard deviation of
waiting times is greater than 3.4 minutes.
15) The hypotheses are Ho: σ < 0.5; Ha: σ > 0.5. Use a Χ2 test for a single variance.
The test statistic is Χ2 = 83(.54)2/(.5)2 = 96.81, and the p-value is p = Χ2 cdf(96.81, 109,
83) = 0.1426. Since p > .01, we do not reject Ho. There is not enough evidence to
conclude that the machine needs recalibrating.
17) The hypotheses are Ho: σ < 3 (claim); Ha: σ > 3. Use a Χ2 test for a single
variance. The test statistic is Χ2 = 4(3.194)2/(3)2 = 4.534, and the p-value is p = Χ2
cdf(4.534, 109, 4) = 0.3385. Since p > .10, we do not reject Ho. There is not enough
evidence to refute here claim.
19) The hypotheses are Ho: σ < 0.75; Ha: σ > 0.75. Use a Χ2 test for a single variance.
The test statistic is Χ2 = 49(.789)2/(.75)2 = 54.23, and the p-value is p = Χ2 cdf(54.23,
109, 49) = 0.2818. Since p > .01, we do not reject Ho. There is not enough evidence to
conclude that the standard deviation is more than 0.75.
21) a. Round degrees freedom down to df = 40 to get Χ2L = 26.509 and Χ2R = 55.758.
Apply the formula to get the interval: 1.86 < σ < 2.69. Since the lower bound of
the interval is more than 1.5, this suggests that the standard deviation really is
more than 1.5 oz as the manager suspected.
b. The hypotheses are Ho: σ < 1.5; Ha: σ > 1.5. Use a Χ2 test for a single variance.
The test statistic is Χ2 = 48(2)2/(1.5)2 = 85.33, and the p-value is p = Χ2 cdf(85.33,
109,48) = 0.00073. Since p < .05, we reject Ho. There is sufficient evidence to
conclude that the machine needs the standard deviation is greater than 1.5 oz.
625
23) Note that 5% of 150 is 7.5, so the hypotheses are Ho: σ < 7.5; Ha: σ > 7.5. Use a Χ2
test for a single variance; the test statistic is Χ2 = 14(10.43)2/(7.5)2 = 27.07, and the
p-value is p = Χ2 cdf(27.07, 109, 14) = 0.0189.
a. Since p < .05, we reject Ho. At the .05 level of significance, there is sufficient
evidence to conclude that the standard deviation exceeds the weight tolerance.
b. Since p > .01, we do not reject Ho at the .01 level of significance. I.e. at the .01
level of significance, there is not enough evidence to conclude that the standard
deviation exceeds the weight tolerance.
27) Ho: p1 = .31, p2 = .41, p3 = .09, p4 = .04, p5 = .15; Ha: at least 1 proportion is
different than stated. Test value: Χ2 = 32.43; p-value = .00000156; Reject Ho; There
is enough evidence to support the claim that the distribution has changed since
2016.
29) a. Ho: Distance and ticket class are independent. Ha: Distance and ticket class are
dependent
b. Ho: class 1 and class 2 have the same distribution of test scores. Ha: class 1 and
class 2 do not have the same distribution of test scores.
33) Χ2 test, test for Homogeneity 35) Χ2 test (test for independence)
39) a. true b. false c. false (they are close in number but not exactly the same)
626
Ho: p1 = 9.2%, p2 = 8.3%, p3 = 73.6%, p4 = 5.6%, p5 = .8%, p6 = .6%, p7 = 1.7%
Ha: at least 1 proportion is not the same as stated
Χ2 = 2035; p-value = 0; Reject Ho; There is enough evidence to reject that the self-
reported sub-groups of Asians in the Manhattan area fit that of the Lake Tahoe
area.
45) Expected frequency: A 50.25, B 83.75, C 100.5, D 33.5, F 16.75, W 50.25; X2-GOF
test value = 50.97, p-value = 8.77E-10, reject Ho because p-value < α, There is
enough evidence to reject the claim that the grade distribution for College Algebra is
the following 15% A grade, 25% B grade, 30% C grade, 10% D grade, 5% F grade,
and 15% W.
Χ2 = 54.01; p-value = 0 Reject Ho. Surveyed obese does not fit the expected obese.
49) Ho: Family size and size of car are independent; Χ2 = 15.83; p-value = .071. Do not
reject Ho. There is not enough evidence to reject the claim that they are
independent.
51) Ho: Location of honeymoon and Age are independent; Χ2 = 15.7; p-value = .0734;
Reject Ho. Location of honeymoon and Age are dependent.
53) Ho: type of fries and area of US sold are independent; Χ2 = 18.84; p-value = .00445;
Reject Ho. They are dependent.
55) Ho: Annual Salary and level of education are independent. Χ2 = 2558; p = 0; Reject
Ho. They are related.
57) Ho: Geographic location and favorite ice cream flavor are independent. Χ2 =14.06; p-
value = .521. They are independent.
59) Ho: Generation and political party are independent; Χ2 =4.7; p-value = .194; Do Not
Reject Ho. They are independent. Expected Gen X and Republican is 70 people.
61) Ho: Midwesterners and West coasters have the same distribution for breakfast
selection; Χ2 = 11.56; p-value = .0091; We can reject that they have the same
distribution.
63) Ho: Applicable reasons and Most important reasons have the same distribution; Χ2 =
234; Reject Ho. They do not have the same distribution.
65) Ho: 2009 and 2013 distribution of cars is the same; Χ2 =6.65; p-value = .248; Do not
reject Ho
627
This page is purposely left blank.
628
13 | LINEAR REGRESSION AND
CORRELATION
Figure 13.1 There is a correlation between the amount of cheese consumption and the number of
people who died by becoming tangled in their bedsheets between 2000 – 2009. For more correlations
go to www.tylervigen.com. Photo by Katrin Leinfellner on Unsplash
Chapter Objectives
629
INTRODUCTION
Professionals often want to know how two or more numeric variables are related. For
example, is there a relationship between the grade on the second math exam a student takes
and the grade on the final exam? If there is a relationship, what is the relationship and how
strong is it?
In another example, your income may be influenced by your education, your profession, your
years of experience, and your ability. The amount you pay a repair person for labor is often
determined by an initial amount plus an hourly fee. The type of data described in this
chapter is bivariate data – the prefix "bi" indicating there are two variables. Although we
will only study models involving two variables, in many real-world situations statisticians
use multivariate data, meaning many variables.
In this chapter, we will be studying the simplest form of regression, "linear regression" with
one independent variable. More specifically, given sample data measured on two variables,
x and y, we wish to find a linear equation that best fits the observed sample data. We will
also develop statistical measures and tests that measure how strong the linear relationship
is between the variables.
where a and b are constant numbers. This is called a linear equation because the graph of
the equation is a straight line.
Example 13.1
a. y = -1 + 2x b. y = 2e x
630
Solution 13.1
13.1 Is the following an example of a linear equation? If so, sketch the graph.
y = -0.125 – 3.5x
Example 13.2
Aaron's Word Processing Service (AWPS) does word processing. The rate for services is $32
per hour plus a $31.50 one-time charge. The total cost to a customer depends on the number
of hours it takes to complete the job. Find the equation that expresses the total cost in terms
of the number of hours required to complete the job.
Solution 13.2
Let x = the number of hours it takes to get the job done. Let y = the total cost to the
customer.
The $31.50 is a fixed cost. If it takes x hours to complete the job, then (32)(x) is the cost of
the word processing only. The total cost is: y = 31.50 + 32x
13.2 Emma’s Extreme Sports hires hang-gliding instructors and pays them a fee of $50 per
class as well as $20 per student in the class. The total cost Emma pays depends on the number
of students in a class. Find the equation that expresses the total cost in terms of the number
of students in a class.
631
Slope and Intercept of a Linear Equation
Given a linear equation y = a + bx, we call b the slope, and a is called the y-intercept. From
basic algebra, recall that the slope is a number that describes the steepness of the line, and
the y-intercept is the y coordinate of the point (0, a) where the line crosses the y-axis. The
sign of the slope b determines whether the line slopes upward or downward, as shown in the
figure below:
Example 13.3
Svetlana tutors to make extra money for college. For each tutoring session, she charges a
one-time fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total
amount of money Svetlana earns for each session she tutors is y = 25 + 15x.
What are the independent and dependent variables? What is the y-intercept and what is
the slope? Interpret these using complete sentences.
Solution 13.3
The independent variable x is the number of hours Svetlana tutors each session. The
dependent variable y is the amount, in dollars, Svetlana earns for each session.
The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-
time fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each additional hour she
tutors, Svetlana earns an additional $15.
632
13.3 Ethan repairs household appliances like dishwashers and refrigerators. For each visit,
he charges $25 plus $20 per hour of work. A linear equation that expresses the total
amount of money Ethan earns per visit is y = 25 + 20x.
What are the independent and dependent variables? What is the y-intercept and what is
the slope? Interpret them using complete sentences.
Example 13.4
SCUBA divers have maximum dive times they cannot exceed when going to different
depths. The data in the table below show different depths with the maximum dive times in
minutes.
Construct a scatter plot. Let x = depth in feet and let y = maximum dive time in minutes.
633
Solution 13.4
1. Enter your X data into list L1 and your Y data into list L2.
2. Press 2nd STATPLOT ENTER to use Plot 1. On the input screen for PLOT 1,
highlight On and press ENTER. (Keep the other plots set to OFF.)
3. For TYPE: highlight the very first icon, which is the scatter plot, and press ENTER.
4. For Xlist: enter L1 and for Ylist: enter L2.
5. For Mark: it does not matter which symbol you highlight, but the square is the
easiest to see. Press ENTER.
6. Make sure there are no other equations that could be plotted. Press Y = and clear
any equations out.
7. Press the ZOOM key and then select option 9, "ZoomStat"; the calculator will fit the
window to the data. You can press WINDOW to see the scaling of the axes.
80
Maximum dive time (in minutes)
70
60
50
40
30
20
50 60 70 80 90 100
Depth (in feet)
The points in the scatterplot exhibit a linear pattern; this suggests that a linear equation
would be a good fit to this data. We also can see that as the depth increases, the maximum
dive time decreases.
634
A scatter plot shows the direction of the relationship between the variables.
• If the points in the graph slope upward as we move left to right, then we say that
there is a positive relationship between the variables x and y.
• If the points in the graph slope downward as we move left to right, then we say that
there is a negative relationship between the variables x and y.
We can also roughly determine the strength of the linear relationship by looking at the
scatter plot; the more closely the points conform to a linear pattern, the better a linear
equation will describe the relationship. If the points in the graph appear to follow a non-
linear pattern (i.e. there is curvature) then perhaps a linear model is not appropriate.
When we look at a scatterplot, we want to notice the overall pattern and any deviations
from the pattern. The following scatterplot examples illustrate these concepts.
635
Again, in this chapter, we are interested in scatter plots that show a linear pattern. The
linear relationship is strong if the points are close to a straight line, except in the case of a
horizontal line where there is no relationship. If we think that the points show a linear
relationship, we would like to fit a line to the points in the scatter plot. This line can be
calculated through a process called linear regression, which we will learn in the next
section. We will also develop a numerical measure that provides a more objective measure
of how well the line fits the data. We will only use the fitted line when we have determined
that there is a statistically significant relationship between the variables x and y.
13.4 Amelia plays basketball for her high school. She wants to improve to play at the college
level. She notices that the number of points she scores in a game goes up in response to the
number of hours she practices her jump shot each week. She records the following data:
Construct a scatter plot and state whether Amelia’s conjecture appears to be true.
Example 13.5
Two different tests are designed to measure employee productivity and dexterity. Several
employees of a company are randomly selected and asked to complete the tests. The results
are below. Draw a scatterplot using Excel, describe the scatterplot.
Dexterity 23 25 28 21 21 25 26 30 34 36
Productivity 49 53 59 42 47 53 55 63 67 75
636
Solution 13.5
First, place Dexterity values in column A and place Productivity values in column B.
Second, highlight both columns, Insert Scatter. Choose the Scatter that is not connected.
637
13.3 | The Regression Equation
Data rarely fit a straight line exactly; usually we must be satisfied with estimates. In this
section we discuss how to calculate the Line of Best Fit. This equation is also known as
the Least-Squares Line.
Example 13.6
A random sample of 11 statistics students produced the following data, where x is the third
exam score out of 80, and y is the final exam score out of 200. Using the scatterplot, does it
appear that we can predict the final exam score of a random student if we know the third
exam score?
Solution 13.6
The scatterplot below exhibits a tight, linear pattern. This indicates a linear model could
be used to predict the final exam score from the score on the third exam.
638
Next, we wish to find a line that best "fits" the data. To do this, we use what is called a least-
squares regression line to obtain the line of best fit.
Consider the following diagram. Each point of data is of the form (x, y) and each point of the
line of best fit using least- squares linear regression has the form (x, ŷ). The ŷ is read "y-hat"
and is the estimated value of y; that is, it is the value of y obtained by plugging x into the
regression line. It is not generally equal to y from data.
The term y0 – ŷ0 = ε0 is called the "error" or residual (ε is the Greek letter epsilon). It is not
an error in the sense of a mistake; instead, it is the difference between the estimate and the
true value of y.
• If the observed data point lies above the line, the residual is positive, and the line
underestimates the actual data value for y.
• If the observed data point lies below the line, the residual is negative, and the line
overestimates that actual data value for y.
In the diagram above, y0 – ŷ0 = ε0 is the residual for the point shown; since the point lies above
the line, the residual is positive.
To get the line of best fit, our first impulse might be to minimize the sum of all the residuals
– that is, minimize the sum of all vertical distances y – ŷ. However, this sum would end up
awfully close to zero. The reason is that if the line really passes through the points, then the
positive errors would cancel out the negative residuals, resulting in a sum of zero. To avoid
this cancellation effect, we might instead minimize the sum of the actual vertical distances
from the data points to the points on the line; these distances are given by the absolute value
of the residuals.
However, to minimize the sum, we need to use Calculus; and in Calculus the absolute value
function is difficult to work with – so instead, we will take the square of each residual, and
minimize the sum of these squares. This is where the name least-squares line comes from.
639
For each data point, we calculate the residuals or errors, yi - ŷi = εi ; then we square these
residuals and add them up. The regression line is the line for which the sum of the
residuals,
Σ (yi - ŷi)2 = Σ εi2
is minimized. This expression is called the Sum of Squared Errors and denoted as SSE.
Using Calculus, we can determine the values of a and b that make the SSE a minimum; the
resulting line, ŷ = a + bx, is called the least squares regression line. The coefficients
are:
a = y − bx and b =
∑ ( x − x )( y − y )
∑ (x − x) 2
Here, x and y are the mean of the observed x- and y-values, respectively. Note that the
regression line always passes through the point ( x , y ).
• Linear regression - the line of best fit or the least-squares line is the straight line
that “best” fits the scatterplot of the data.
• Residual – the difference between the observed value of y and the y value predicted
by the regression equation.
The process of fitting a line to a set of points in the plane is called linear regression. The
least-squares approach described above is widely used, and so it is no surprise that these
calculations are programmed into most statistical software packages, as well as into many
calculators. So, there is no need to memorize the formulas above; instead, we will use the
TI-83/TI-84 calculators:
1. Go to STAT and then to the EDIT menu. Enter the x-values into list L1 and the y-
values into list L2. Make sure that the order is preserved – i.e. these must remain
paired so that the corresponding (x,y) values are next to each other in the lists.
2. Go to STAT and then to the TESTS menu; scroll down select LinRegTTest.
3. On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1
4. On the next line, at the prompt β or ρ, highlight "≠ 0" and press ENTER
5. Leave the line for "RegEq:" blank
6. Highlight Calculate and press ENTER.
The output screen contains a lot of information. For now, we will focus on a few items from
the output, and will return later to the other items. In particular, at the top of the screen
you will see y = a + bx which tells us that the intercept is a and the slope is b.
640
Example 13.7
To illustrate, recall the situation from Example 13.6, the “Third Exam/Final Exam”
example. Find the regression line (or line of best fit).
Solution 13.7
Following the instructions above, we have the following input and output screens:
Scroll down to find the values a = -173.513, and b = 4.8273; the equation of best fit line is
ŷ = –173.51 + 4.83x
We have already seen how to make a scatterplot for the data; we can graph the regression
line on the same screen. To graph the best-fit line, press the "Y=" key and type the
equation –173.5 + 4.83X into equation Y1. Press ZOOM 9 again to graph it.
Another way to graph the line after you create a scatter plot is to use LinRegTTest:
1. Make sure you have already made the scatter plot. Check it on your screen.
2. Go to LinRegTTest and enter the lists.
3. At RegEq: press VARS and arrow over to Y-VARS. Press 1 for 1:Function. Press 1
for 1:Y1.
Then arrow down to Calculate and do the calculation for the line of best fit.
4. Press Y = (you will see the regression equation).
5. Press GRAPH. The line will be drawn.
641
Following either of these procedures, the graph of the line of best fit for the third
exam/final-exam example is as follows:
Now that we have the regression line, we want to give a practical interpretation of the slope
and intercept.
When y = a + bx:
But in this case, the value a (-173.5) does not have any practical meaning – of course, it
would not make sense for a final exam score to be negative! Moreover, 0 is very far outside
the range of observed x-values, so we would not want to use the model to make such a
prediction in any case.
For this problem, we have b = 4.83, so for each one-point increase in the third exam score,
we would expect the final exam to increase by about 4.83 points, on average.
Example 13.8
In an area of the Midwest, records were kept on the relationship between the annual rainfall
(in inches) and the yield of wheat (bushels per acre).
Rain (inches) 10.5 8.8 13.4 12.5 18.8 10.3 7.0 15.6 16.0
Yield (bushels/acre) 50.5 46.2 58.8 59.0 82.4 49.2 31.9 76.0 78.8
642
Solution 13.8
Go to the STAT menu, and then the EDIT menu. Enter the Temp values into L1, and the
Growth values into L2. Go to the STAT menu again, and then to the TESTS menu; scroll
down to LinRegTTest; set the alternative to the “≠”, scroll to Calculate and press Enter.
a. Scroll down in the output screen to find a = 4.26681 and b = 4.37908. Thus, the
regression equation is: ŷ = 4.267 + 4.379x.
b. Slope is the coefficient of x. Slope = 4.379 this mean that for each additional inch of
rain each year, we expect the yield of wheat to increase by about 4.4 bushels/acre on
average.
13.5 The data below are ages and systolic blood pressures of 9 randomly selected adults.
Age 38 41 45 48 51 53 57 61 65
Pressure 116 120 123 131 142 145 148 150 152
If the scatter plot indicates that there is a linear relationship between the variables, we
may want to use the line of best fit to make predictions. There are two different types of
predictions.
• Interpolation – predicted values that fall within the range of data points that were
taken.
• Extrapolation – predicted values for points outside the range of data that were
taken.
643
Interpolation is considered more accurate than extrapolation. Interpolation is reasonable to
use the line to make predictions for y given x within the domain of x-values in the sample
data, but not necessarily for x-values outside that domain.
Extrapolation is less accurate because we don’t know how the variables are related outside
this range; it may be that for other data values the points in the scatterplot no longer have
a linear pattern at all, and sharply curve away from the line. So, for values outside the
range of observed x-values, the equation will be unreliable at best.
A scatterplot is a valuable visual tool; however, scatterplots can also be misleading. For
example, by changing the horizontal or vertical scale, we can make the linear pattern look
stronger or weaker than it really is. Other than examining the scatter plot and seeing that a
linear model seems reasonable, how else can we tell if there is a linear relationship? We need
a numerical measure of the strength of the relationship between x and y, and that numerical
measure is the correlation coefficient.
The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is numerical
and provides a measure of strength and direction of the linear association between the
independent variable x and the dependent variable y.
r=
n ∑ ( xy) − (∑ x )(∑ y )
[n ∑x 2
−( ∑ x) 2
][ n ∑y 2
−( ∑ y) 2
]
The formula for r looks complex, but it is easily calculated using the TI-83 and TI-84
calculators. In fact, we already know how to do it. It appears at the bottom of the output
screens for LinRegTTest.
Example 13.9
For example, if we use the data from the Third Exam/Final Exam used in Example13.6
and 13.7, we had the data:
644
Solution 13.9
We return to the output screen for the Third Exam/Final Exam example and find r at the
bottom of the LinRegTTest output table. We see that r = 0.663. Thus, we would say that
there is a moderately strong, positive correlation between scores on the third exam and scores
on the final exam.
Recall that the sign of r shows the direction of the relationship between x and y and can be
seen in the figures below.
NOTE
Even if we have a very strong correlation, this does not suggest that x causes y or y causes
x. Remember, "correlation does not imply causation."
645
Properties of the Correlation Coefficient, r
• The sign of r is the same as the sign of the slope, b, of the best-fit line.
• r is the same no matter what units are used to measure x and y; we can
easily see that whatever units are used. They will all cancel out of the
numerator and denominator.
The statistic r2 is called the coefficient of determination; as the notation suggests, this
is just the square of the correlation coefficient. This statistic is usually stated as a percent,
rather than in decimal form, and it measures something very specific in the in the context
of the data.
When we calculate a regression, some of the variation in the y-values is due to variation in
the x-values. This variation is called explained variation. But unless the linear
relationship between x and y is a perfect correlation (very rare!) there will also be variation
in the y-values that has nothing to do with x. These variations could be simple sampling
variation, or could be from other factors that influence the variable y. This type of
variation is called unexplained variation.
646
The coefficient of variation r2 measures the percentage of variation in the
dependent variable y that is explained by variation in the independent variable x.
Once we have calculated r, finding r2 is very easy; but the coefficient of determination also
appears on the output for LinRegTTest. Again, referring to the Third Exam/Final Exam
example, we can read the value of r2 from the output screen: r2 = 0.66312 = 0.4397. Since
.4397 ≈ 44%, this means that approximately 44% of the variation in the final exam grades
can be explained by the variation in the grades on the third exam.
Example 13.10
The paired data below consist of the temperatures on randomly chosen days and the
amount a certain kind of plant grew (in millimeters):
Temp 62 76 50 51 71 46 51 44 79
Growth 36 39 50 13 33 33 17 6 16
a. Draw a scatterplot.
b. Find linear regression equation.
c. Find the correlation coefficient r and interpret.
d. Find the coefficient of determination r2 and interpret.
Solution 13.10
a. Scatterplot
647
Go to the STAT menu, and then the EDIT menu. Enter the Temp values into L1, and the
Growth values into L2. Go to the STAT menu again, and then to the TESTS menu; scroll
down to LinRegTTest; set the alternative to the “≠”, scroll to Calculate and press Enter.
13.6 The data below are ages and systolic blood pressures of 9 randomly selected adults.
Age 38 41 45 48 51 53 57 61 65
Pressure 116 120 123 131 142 145 148 150 152
Finally, we will point out that the correlation coefficient is part of an alternative formula for
calculating the slope of the regression line. Let x and y be the mean of the observed x- and
y-values, respectively, and let sx and sy be the standard deviations of the observed x- and
y-values. Then we can calculate the slope using the formula:
sy
b=r .
sx
Note that this formula clearly shows one of the properties listed earlier; namely, that the sign
of r is the same as the sign of the slope, b, of the best-fit line.
648
13.4 | Testing the Significance of the Correlation Coefficient
The correlation coefficient, r, tells us about the strength and direction of the linear
relationship between x and y. However, our interpretation of this statistic is inherently
subjective. For example, we know that when r is “close to 1”, this indicates a strong
correlation. But this begs the question, how close? That is, just how close must r be to 1 in
order for the correlation to be considered “strong”? Similarly, how close does r need to be to
0 to call the correlation “weak”? More importantly, when is the correlation strong enough to
warrant using the regression equation for predictions? What we need is an objective
criterion; in short, we need a hypothesis test for whether the correlation is statistically
significant.
This hypothesis test is called the test of the "significance of the correlation
coefficient” and will allow us to quickly and efficiently decide whether the linear
relationship in the sample data is strong enough to use to model the relationship in the
population. We can do the five-step method like we have used previously.
The sample data are used to compute r, the correlation coefficient for the sample. If we had
data for the entire population, we could find the population correlation coefficient; this
unknown parameter is denoted by ρ, the Greek letter "rho." The sample statistic r is a
point estimate for ρ.
The most common test involves testing for a significant linear correlation. Then the null
hypothesis states that the population correlation coefficient ρ is 0. This means that if we were
to calculate the regression line using all data pairs in the population, then the points in the
scatterplot would be randomly scattered about a horizontal line. That is, knowing the value
of x would not be of any use in predicting y. So, the null hypothesis essentially states that
there really is no linear relationship between x and y.
649
On the other hand, the alternative hypothesis states that the population correlation
coefficient is significantly different from zero. I.e., Ha states that there is a significant
linear relationship between x and y.
As described above, the test of significance is two-tailed (because the alternative hypothesis
is a “≠” statement). And it is a t-test with df = n – 2, where n is the number of data pairs.
The test statistic is:
n−2
t=r
1− r2
To decide the test, we can use either critical values or p-values. Once the test statistic is
found, we can get the p-values can be found with the TI-83 and TI-84 calculators using the
tcdf function (in the DISTR menu). Even better, we can get both the test stat and p-value
using LinRegTTest, and the following instructions:
1. Go to STAT and then to the EDIT menu. Enter the x-values into list L1 and the y-
values into list L2.
2. Go to STAT and then to the TESTS menu; scroll down select LinRegTTest.
3. On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1
4. On the next line, at the prompt β or ρ, highlight "≠ 0" and press ENTER.
5. Leave the line for "RegEq:" blank.
6. Highlight Calculate and press ENTER.
The test statistic t and p-value will appear on the output screen.
Example 13.11
Recall the Third Exam/Final Exam case. We were given data on exam scores:
We have already calculated the regression line for the data: ŷ = –173.51 + 4.83x, as well as
the correlation coefficient r = 0.663. Use this information to determine whether the
correlation is significant at the 5% significance level.
650
Solution 13.11
This is a hypothesis test for correlation, and we will use the five-step method using p-value.
(Step 1) Set up the hypothesis Ho and Ha. The hypotheses are H0: ρ = 0 and Ha: ρ ≠ 0, and
the significance level is α = 0.05.
(Step 2) Select the correct test and identify the significance level α. This is a two-tailed test
and α = 0.05.
(Step 3) Calculate the p-value. Following the instructions above, we have the following
input and output screens:
(Step 4) Use these results to make the decision about Ho. Since p < .05, we reject H0.
(Step 5) Interpret the decision about Ho in the context of the given problem to state a
conclusion. Thus, there is a significant correlation between the Exam 3 scores and the Final
Exam scores. And because r is significant, the regression line can be used to predict
final exam scores.
Example 13.12
Consider the following cereal grams of sugar and calories per serving data that was
sampled from big box store. A concerned shopper claims that there is a positive linear
correlation between sugar per serving and calories. Perform a test to determine if there is a
positive linear correlation with a 5% level of significance.
651
Solution 13.12
Put the data into the L1 and L2 using STAT 1:Edit, finishing with quit (2nd mode). Use
STAT TESTS F:LinRegTTest. Select lists, Freq 1, β & ρ > 0, Calculate.
(Step 4) Since the p-value is greater than α, (0.5467 > 0.05) do not reject the null
hypothesis.
(Step 5) There is not enough evidence to support the claim that there is a positive linear
correlation between calories and grams of sugar in the cereals at the store.
In Summary
• If r is not significant, the regression equation should not be used for prediction.
• If r is significant then the regression equation can be used to predict the value of y
for values of x that are within the domain of observed x values.
• If r is significant, the line may NOT be appropriate or reliable for prediction
OUTSIDE the domain of observed x values in the data.
The p-value is quite easy to obtain using technology such as the TI-84 calculator. However,
the calculator requires us to have the raw data available. But there are some cases when we
just have the correlation coefficient and want to use it to determine whether the correlation
is significant. In these instances, we could calculate the test statistic t and then compare it
to an appropriate critical value; but there are also tables of critical values for r. That is, we
can compare the correlation coefficient r directly to the critical value to make the decision
about H0.
There is a table for critical values for the Pearson correlation coefficient at the end of this
chapter and in the appendix. We reproduce an excerpt here for convenience:
652
Because this is a two-tailed t-test, we will reject H0 at the α = .05 level if either:
The table again shows how the sample size is also a factor in whether or not a correlation is
significant or not.
Consider once more the third exam/final exam example. The line of best fit is ŷ = –
173.51+4.83x, r = 0.6631 and there are n = 11 data points. Can the regression line be used
for prediction?
But instead of using the p-value as before, we will instead use the critical value. We know
that df = n – 2 = 11 – 2 = 9.
From the table, the critical values are + 0.602; since 0.6631 > 0.602, we reject H0 and again
conclude that r is significant.
653
There is sufficient evidence to conclude that there is a significant linear relationship between
the third exam score and the final exam score. Because r is significant, the regression line
can be used to predict final exam scores.
Example 13.13
Suppose you computed the following correlation coefficients using samples of the given size.
For each of the following cases, use the table at the end of the chapter to locate the critical
value and use it to determine if r is significant and the line of best fit associated with each r
can be used to predict a y value. If it helps, draw a number line. Assume Ha: ρ ≠ 0
Solution 13.13
a. We computed r = 0.801 using n = 10 data points. Then df = n – 2 = 10 – 2 = 8, and the
critical values associated with df = 8 and α = .05 are + 0.632.
So we would reject H0 if either r < -0.632 or r > 0.632. Since r = 0.801 and 0.801 > 0.632, we
conclude that r is significant, and the regression line may be used for prediction. It is
helpful to view this on a number line:
b. We computed r = - 0.624 with 14 data points. So df =12 and α = .01; the critical values
are +0.661. Since –0.624 is not in the tails seen below r is not significant and the line
cannot be used for prediction.
654
r = 0.776 > +0.574. Therefore, r is significant.
13.7 For each of the following, use critical values to determine whether the correlation is
significant. Then state whether or not the regression line can be used for prediction.
a. r = 0.5204 using n = 9 data points, α = .1, two-tailed
b. r = –0.7204 using n = 8 data points, α = .05, left-tailed
c. r = 0.708 and n = 9, α = .02, two-tailed
d. r = 0.434 and n = 14, α = .05 , right-tailed
Example 13.14
In an area of the Midwest, records were kept on the relationship between the rainfall (in
inches) and the yield of wheat (bushels per acre). Use α = .01
Rain (inches) 10.5 8.8 13.4 12.5 18.8 10.3 7.0 15.6 16.0
Yield (bushels/acre) 50.5 46.2 58.8 59.0 82.4 49.2 31.9 76.0 78.8
a. Test the significance of the correlation between Rain and Yield using the p-value.
b. Test the significance of the correlation using the critical value.
Solution 13.14
Go to the STAT menu, and then the EDIT menu. Enter the Temp values into L1, and the
Growth values into L2. Go to the STAT menu again, and then to the TESTS menu; scroll
down to LinRegTTest; set the alternative to the “≠”, scroll to Calculate and press Enter.
a. The test statistic and p-value are: t = 13.31 and p = 3.16 x 10-6 = 0.00000316. Since
p < .01, we reject H0. There is a significant relationship between Rain and Yield.
655
NOTE
The table provided in this textbook shows critical only for the significance level of 5%, α =
0.05. But there are other tables available that show critical values for other significance
levels as well. Of course, if we are using the p-value method, we could test the significance
of the correlation using any desired significance level; we would not be limited to using α =
0.05.
Testing the significance of the correlation coefficient requires certain assumptions about
the population data. Remember, the purpose of this test is to determine whether the linear
relationship that we see between x and y in the sample data provides strong enough
evidence to conclude that there is in fact a linear relationship between x and y in the
population.
The regression equation that we calculate from the sample data gives the best-fit line for
our particular sample. We want to use this best-fit line for the sample as an estimate of the
best-fit line for the population. Examining the scatterplot and testing the significance of the
correlation coefficient helps us determine if it is appropriate to do this.
1) There is a linear relationship in the population that models the average value of y
for varying values of x. In other words, the expected value of y for each particular x-
value lies on a straight line in the population. (We do not know the equation for the
line for the population. Our regression line from the sample is our best estimate of
this line in the population.)
2) The y values for any particular, fixed x-value are normally distributed about the
line. This implies that there are more y values scattered closer to the line than are
scattered farther away. Assumption (1) states that these normal distributions are
centered on the line; so, for a fixed x-value, the mean of the normal distribution of
corresponding y-values is the y-value predicted by the regression equation.
3) The standard deviations of the normal distributions described in (2) are all equal.
That is, the population of y-values corresponding to a fixed x will have the same
standard deviation σ, regardless of the value of x.
656
Note that in the literature, Assumptions (2) and (3) are usually called the Normality
Assumption and the Constant Variance assumption, respectively. Combining the first
three assumptions, we see that no matter what the value of x, the distribution of
corresponding y-values will have the same shape and spread about the regression line.
This is illustrated in the following figure:
The y values for each x value are normally distributed about the line with the same standard
deviation. For each x value, the mean of the corresponding y values lies on the regression line.
Finally, the point estimate for the common standard deviation s in Assumption (3) is called
the standard error of the estimate. Note that this is just the standard deviation of the
residuals. This statistic provides a measure of the spread of y-values about the regression
line, and can be calculated using the formula:
s=
SSE
=
∑ ( y − yˆ ) 2
n−2 n−2
We rarely need to use this formula, however. Like other key statistics related to regression,
the standard error s appears on the output screens for LinRegTTest. For example, in the
Third Exam/Final Exam example, the standard error is s = 16.42.
n−2
Test statistic: t = r Calculator Function: LinRegTTest
1− r2
657
13.5 | Prediction
Recall the Third Exam/Final Exam example. We examined the scatterplot and showed
that there is a significant correlation between the Exam 3 scores and the Final Exam scores
(i.e. the correlation coefficient is significant). We also found the equation of the best-fit line
for the final exam grade as a function of the grade on the third exam: ŷ = –173.51 + 4.83x.
We can now use the least-squares regression line for prediction. For example, suppose we
wanted to predict the mean final exam score of statistics students who received 73 on the
third exam. Note that the sample exam scores (x-values) range from 65 to 75. Since x = 73 is
in the range of observed x-values 65 and 75, and the correlation is significant, we can use
the regression equation. So, we substitute x = 73 into the equation to get:
Of course, this does not mean that any student who scores 73 on the third exam is
guaranteed to score 179.08 on final exam. There will be other factors that influence a
student’s score on the final exam (e.g. the number of hours spent studying for the exam).
What we know is that in the population, among all students who scored 73 on the third
exam, the average score on the final exam for is the y-value on the regression line. So when
we say that 179.08 is the predicted score, this means:
We predict that statistics students who earn a grade of 73 on the third exam will earn a
grade of 179.08 on the final exam, on average.
Example 13.15
Recall the Third Exam/Final Exam example from Example 13.6 We have shown that
the correlation is significant, and the regression equation is ŷ = –173.51 + 4.83x.
a. What would you predict the final exam score to be for a student who scored a 66 on
the third exam?
b. What would you predict the final exam score to be for a student who scored a 90 on
the third exam?
Solution 13.15
b. The observed x-values in the sample data are between 65 and 75. Ninety is well
outside of the domain of these observed x-values, so we cannot reliably predict the
final exam score for this student.
658
Of course, we can easily enter 90 into the equation for x and calculate the corresponding y-
value, but the predicted value will not be reliable. To illustrate how unreliable the prediction
can be for x-values outside the range of observed data, make the substitution x = 90 into the
equation:
The final-exam score is predicted to be 261.19, which is not even possible - the largest the
final-exam score can be is 200.
Recall that the process of predicting inside of the observed x values observed in the data is
called interpolation. The process of predicting outside of the observed x values observed in
the data is called extrapolation. In general, extrapolation should be done with extreme
caution, if at all.
13.8 Data are collected on the relationship between the number of hours per week
practicing a musical instrument and scores on a math test. The line of best fit is as follows:
ŷ = 72.5 + 2.8x
What would you predict the score on a math test would be for a student who practices a
musical instrument for five hours a week?
Prediction and r
• If r is not significant, the regression equation should not be used for prediction.
Use the mean of the observed y values for prediction.
• If r is significant then the regression equation can be used to predict the value of y
for values of x that are within the domain of observed x values.
659
13.6 | Outliers
In some data sets, there are values (observed data points) called outliers. Outliers are
observed data points that are far from the least squares line. They have large "errors", where
the "error" or residual is the vertical distance from the line to the point.
Outliers need to be examined closely. Sometimes, for some reason or another, they should not
be included in the analysis of the data. It is possible that an outlier is a result of erroneous
data. Other times, an outlier may hold valuable information about the population under
study and should remain included in the data. The key is to examine carefully what causes
a data point to be an outlier.
Besides outliers, a sample may contain one or a few points that are called influential points.
Influential points are observed data points that are far from the other observed data points
in the horizontal direction. These points may have a big effect on the slope of the regression
line. To begin to identify an influential point, you can remove it from the data set and see if
the slope of the regression line is changed significantly.
Computers and many calculators can be used to identify outliers from the data. Computer
output for regression analysis will often identify both outliers and influential points so that
you can examine them.
Identifying Outliers
We could guess at outliers by looking at a graph of the scatterplot and best fit-line. However,
we would like some guideline as to how far away a point needs to be in order to be considered
an outlier.
As a rough rule of thumb, we can flag any point that is located further than two
standard errors above or below the best-fit line as an outlier.
We can do this visually in the scatter plot by drawing an extra pair of lines that are two
standard deviations above and below the best-fit line. Any data points that are outside this
extra pair of lines are flagged as potential outliers. Or we can do this numerically by
calculating each residual and comparing it to twice the standard deviation. On the TI-83,
83+, or 84+, the graphical approach is easier. The graphical procedure is shown first, followed
by the numerical calculations. You would generally need to use only one of these methods.
Example 13.16
In the Third Exam/Final Exam example, use the graphical method to determine if there
is an outlier or not. If there is an outlier, we will delete it and fit the remaining data to a new
line; the new line ought to fit the remaining data better. This means the SSE and standard
error should both be smaller, and the correlation coefficient ought to be closer to +1.
660
Solution 13.16
Recall that the regression equation for this example is ŷ = –173.51 + 4.83x, and the standard
error is s = 16.4. (This statistic appears on the output screen for LinRegTTest)
We want to draw two lines that are parallel to the regression line, and such that the vertical
distance to the regression line is 2s at each point; to do this, we add and subtract 2s = 2(16.4)
to the equation for the regression line:
Graph the scatterplot with the best fit line in equation Y1, as before.
Then enter the two extra lines as Y2 and Y3 in the "Y="equation editor and press ZOOM 9.
We see that the only data point that is not between lines Y2 and Y3 is the point x = 65, y =
175. The outlier is the student who had a grade of 65 on the third exam and 175 on the final
exam; this point is further than two standard deviations away from the best-fit line.
Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult
to tell if the point is between or outside the lines; for this example, the point is just barely
outside the parallel lines. On a computer, enlarging the graph may help; on the calculator
screen it can help to zoom in.
661
Numerical with Line of Best Fit
If we are in doubt, we just need to numerically verify that the point is an outlier.
1. Calculate the predicted final exam score for a student who scores 65 on the third
exam:
2. Calculate the difference between the observed and predicted values (i.e. the
residual):
3. Determine if the residual is more than two standard errors. Since s = 16.4. Then 2s
= 32.8. Since 34.56 > 32.8, the observation is an outlier.
In the table below, the first two columns are the third-exam and final-exam data. The third
column shows the predicted ŷ values calculated from the line of best fit: ŷ = –173.5 + 4.83x.
The residuals, or errors, have been calculated in the fourth column of the table:
Recall that the standard error s is the standard deviation of the residuals, which is s = 16.4.
x y ŷ y−ŷ
65 175 140 35
67 133 150 -17
71 185 169 16
71 163 169 -6
66 126 145 -19
75 198 189 9
67 153 150 3
70 163 164 -1
71 159 164 -10
69 151 160 -9
69 159 160 -1
We are looking for all data points for which the residual is greater than 2s = 2(16.4) = 32.8 or
less than -32.8. Compare these values to the residuals in column four of the table. The only
such data point is the student who had a grade of 65 on the third exam and 175 on the final
exam; the residual for this student is 35.
662
NOTE: This method involves a lot of calculation, but it is easily automated. In fact, most
statistical packages (such as Minitab or SPSS) will automatically calculate the residuals and
flag data points that appear to be outliers.
Now that we have identified the point (65, 175) as an outlier, we would re-examine the data
for this point to see if there are any problems with the data. If there is an error, we should
fix the error if possible, or if not delete the data. If the data is correct, we would leave it in
the data set. For this example, let us suppose that after reviewing the data, we found that
this data pair data really was an error and should be removed.
We remove the data point and compute a new best-fit line and correlation coefficient using
the ten remaining points; using the calculator as before, we obtain the new regression
equation:
ŷ = –355.19 + 7.39x
Moreover, the new model has correlation coefficient r = 0.9121. Recall that for the original
data, we had r = 0.6631, so the new line exhibits a stronger correlation. This means that
the new line is a better fit to the ten remaining data values. Moreover, the new standard
error is s = 9.275, a significant reduction. Both of these tell us that the new regression line
can better predict the final exam score given the third exam score.
NOTE
When outliers are deleted, the researcher should either record that data was deleted, and
why, or should provide results both with and without the deleted data. If data is erroneous
and the correct values are known (e.g., student one actually scored a 70 instead of a 65), then
this correction can be made to the data.
13.9 Identify the potential outlier in the scatter plot. The standard deviation of the
residuals is approximately 8.6.
663
Example 13.17
The Consumer Price Index (CPI) measures the average change over time in the prices paid
by urban consumers for consumer goods and services. The CPI affects nearly all Americans
because of the many ways it is used. One of its biggest uses is as a measure of inflation. By
providing information about price changes in the Nation's economy to government,
business, and labor, the CPI helps them to make economic decisions. The President,
Congress, and the Federal Reserve Board use the CPI's trends to formulate monetary and
fiscal policies. In the following table, x is the year and y is the CPI.
x y x y
1915 10.1 1969 36.7
1926 17.7 1975 49.3
1935 13.7 1979 72.6
1940 14.7 1980 82.4
1947 24.1 1986 109.6
1952 26.5 1991 130.7
1964 31.0 1999 166.6
Solution 13.17
a. Parts a and c:
Go to STAT, and then the EDIT menu. Enter the x-values into L1 and the y-values
into L2.
Go to STAT, and then the TESTS menu; scroll down to LinRegTTest. Specify the
lists, select the “≠” alternative, place the cursor on Calculate and press Enter to get
the output screens.
664
NOTE: To enter Y1 in RegEQ, press VARS, choose Y-Vars, choose Function Enter, Y1 enter
665
NOTE
In the example, notice the pattern of the points compared to the line. Although the
correlation coefficient is significant, the pattern in the scatterplot indicates that a curve
might be a more appropriate model to use than a line. If we were to try and fit a non-linear
function to the data, we might obtain a more accurate model.
13.10 The following table shows economic development measured in per capita income
PCINC.
Example 13.18
The following data represents the thread count of various bed sheets and their
corresponding price. Use the following data to solve the following use TI-83/84 and Excel
when possible.
Thread Count 300 800 1600 600 500 400 1000 700 1500 1800
Price in Dollars 53 54 77 67 55 40 97 70 80 96
666
a. Draw a scatterplot using TI-83/84 and Excel.
b. Determine the line of best fit. Interpret the slope, the y-intercept and the correlation
coefficient, r.
c. Calculate the residuals.
d. Determine if there are any outliers.
e. Calculate the coefficient of determination, r2 .
f. Test the claim that there a linear correlation between thread counts and cost in
dollars at a 5% level of significance using p-value.
Solution 13.18
a. Using TI-83/84 put Thread Count into a list and Price in dollars into another list. To
edit lists go to STAT, 1:Edit.
Go to Stat Plot (2nd y =). Pick a plot. Select ON, Type: Scatter plot, Xlist: Thread
Count, Ylist: Cost in Dollars. You must change the window to fit the data. Window: x
max = 2000, y max = 150. Then press graph.
Using Excel place the data into cells and select the scatter graph.
667
NOTE
Lists can be cleared by going to the top and pushing clear. This clears the whole list. To
clear one item in the list, use the del button. Do not use del at the top of the list as it will
delete the whole list.
Using Excel and the LINEST function: =LINEST(y values, x values, True – to solve for b,
True – for additional regression statistics)
See the first row in the right table is the slope (m = 0.027531 and y-intercept = 43.57143)
668
The slope is 0.027531 which meant that the as the thread count goes up by 1, the cost goes
up by $0.0275310.
The y – intercept is 43.57143. This means that when x = 0 (That is a thread count of zero),
the cost is $43.57. Notice that this extrapolation is inaccurate.
The correlation coefficient, r, is 0.778587 which is positive indicating a positive slope and
positive correlation.
c. The residuals, εo, can be calculated with both the TI-83/84 and Excel as long as you
use the calculations are done correctly in each column. Both tables would have these
values. Excel was used to make the table below.
d. Determine if there are any outliers. Looking at the LinRegTTest results chart in
part b. note that s = 12.59. Outliers are values that are greater or less than 2s =
2(12.59) = 25.18
Look at the residual value when the thread count is 1000 and see 28.89757 is
greater than 25.18 so this point (1000, 97) is an outlier.
669
e. Calculate the coefficient of coefficient of determination, r2.
Looking at b for both the TI-83/84 and Excel, r2 = 0.606198. This means that 60.6%
of the variation in the cost in dollars can be explained in the variation in the thread
count.
f. Test the claim that there is a correlation between thread count in sheets and cost in
dollars using p-value.
(Step 3) See b for the test statistic, t = 3.5092, and the p-value, p = 0.003985
(Step 4) p-value < α: 0.003985 < 0.05. Do not reject the null hypothesis.
r=
n ∑ ( xy) − (∑ x )(∑ y )
[n ∑x 2
−( ∑ x) 2
][ n ∑y 2
−( ∑ y) 2
]
Coefficient of determination: The square of the correlation coefficient, which is r2. The
coefficient of determination measures the percentage of variation in the dependent variable
that is explained by variation in the dependent variable.
Constant Variance Assumption: This assumption states that for any fixed x-value, the
population of corresponding y-values is normally has the same variance.
Least Squares Regression Line: This is the equation for the line of best fit to a set of n
data points (x, y). The equation is ŷ = a + bx, where:
a = y − bx and b =
∑ ( x − x )( y − y )
∑ (x − x) 2
670
Here, x and y are the mean of the observed x- and y-values, respectively; the regression
line always passes through the point ( x , y ).
Normality Assumption: This assumption states that for any fixed x-value, the population
of corresponding y-values is normally distributed.
Outlier: An observation that does not fit the rest of the data. For regression, this is an
observed data point that is located at a vertical distance more than 2s from the regression
line, where s is the standard error of the estimate.
Residual. The difference y – ŷ between an observed y-value and the y-value predicted by
the regression line.
Scatterplot: A graph of the ordered pairs (x, y) of observed data, where x is the
independent variable and y is the dependent variable.
Standard error of the estimate: This is the standard deviation of the residuals; it is also
a point estimate for the common standard deviation from the Constant Variance assumption.
This statistic provides a measure of the spread of y-values about the regression line, and is
calculated using:
s=
SSE
=
∑ ( y − yˆ ) 2
n−2 n−2
CHAPTER REVIEW
13.2 Scatter Plots
The procedure of fitting a linear equation to a collection of observed data points is called
linear regression.
The regression line is the “line of best fit” to the data. To understand what this means, we
calculate the residuals, the differences y – ŷ between the observed y-values and the y-values
predicted by line. The residuals measure the vertical distances between the observed data
points and the corresponding points on the regression line.
The sum of the squares of the residuals, Σ(y – ŷ)2, is called the SSE, or Sum of Squared
Errors. The regression line is found by minimizing the SSE and so is also called the least-
671
squares line. This line, usually denoted as ŷ = a + bx, is easily calculated on the TI-84 using
LinRegTTest.
The correlation coefficient r is a numerical measure of the strength and direction of the
linear association between x and y:
These properties are intuitive and easy to understand but are somewhat subjective. So, we
also introduced a test to determine whether or not a correlation between two variables is
significant. This test involves the correlation parameter ρ, and the hypotheses are:
n−2
This is a two-tailed t-test with test statistic t = r . We can get both the test statistic
1− r2
and p-value from the calculator using LinRegTTest. We also can make the decision about
Ho by comparing r directly to an appropriate critical value. More importantly
Rejecting H0 at the .05 significance level provides evidence that the correlation is
significant.
There are several important assumptions that we make when calculating a regression
equation, and judging how well it fits the data:
1. There is a linear relationship in the population that models the average value of y for
varying values of x. In other words, the expected value of y for each particular x-value
lies on a straight line in the population.
672
2. The y-values corresponding to a fixed x-value are normally distributed about the line.
Assumption (1) states that these normal distributions are centered on the regression
line, so for a fixed x-value, the mean of the normal distribution of corresponding y-
values is the y-value predicted by the regression equation.
3. The standard deviations of the normal distributions described in (2) are all equal.
That is, the population of y-values corresponding to a fixed x will have the same
standard deviation σ, regardless of the value of x. The point estimate for this common
standard deviation is called the standard error of the estimate; it is the standard
deviation of the residuals, and so provides a measure of the spread of y-values about
the regression line. The standard error of the estimate also appears on the output
screens for LinRegTTest.
13.5 Prediction
Once we determine that there is a significant correlation between x and y, we can use the
least squares regression line to make predictions; that is, given a value of the independent
variable x, we can use the equation ŷ = a + bx to find the corresponding y-value. Some things
to remember:
• If r is not significant, the regression equation should not be used for prediction.
• If r is significant then the regression equation can be used to predict the value of y
for values of x that are within the domain of observed x values.
• If r is significant, the line may NOT be appropriate or reliable for prediction
OUTSIDE the domain of observed x values in the data.
13.6 Outliers
An outlier is a point that lies well outside the pattern of the observed data; specifically, it is
an observed data point that is located at a vertical distance more than 2s from the
regression line, where s is the standard error of the estimate. We showed both a graphical
method and a numerical method for identifying outliers; see Example 13.16.
673
Critical Values of r
674
675
Exercises for Chapter 13
1. A specialty cleaning company charges an equipment fee and an hourly labor fee. The
total amount of the fee the company charges for each session is given by the equation y = 50
+ 100x.
2. The price of a single issue of stock can fluctuate throughout the day. The price of stock
for Shipment Express is y = 15 – 1.5x where x is the number of hours passed in an eight-
hour day of trading.
a. b.
c. d.
676
4. What does an r value of zero mean?
5. A random sample of ten professional athletes produced the following data where x is the
number of endorsements the player has and y is the amount of money made (in millions of
dollars).
x 0 3 2 1 5 5 4 3 0 4
Y 2 8 7 3 13 12 9 9 3 10
8. A landscaping company is hired to mow the grass for several large properties. The total
area of the properties combined is 1,345 acres. The rate at which one person can mow is as
follows: ŷ = 1350 – 1.2x where x is the number of hours and ŷ represents the number of
acres left to mow.
9. An electronics retailer used regression to find a simple model to predict sales growth in
the first quarter of the new year (January through March). The model is good for 90 days,
where x is the day. The model can be written as follows ŷ = 101.32 + 2.48x where ŷ is in
thousands of dollars.
677
10. The following table shows cell phone usage in the U.S. for the years 2003-2009:
a. Construct a scatterplot.
b. Find the line of best-fit.
c. Find the correlation coefficient.
d. Would the regression line provide a reliable estimate for cell phone usage in 2016?
Explain.
11. Find the regression equation and correlation coefficient for the following data:
12. Find the regression equation and correlation coefficient for the following data:
x 57 53 59 61 53 56 60
y 156 164 163 177 159 175 151
13. The following data shows the number of years spent studying a foreign language and
the score on a test of fluency.
Number of years (x) Score (y)
3 57
4 78
4 72
2 58
5 89
3 63
4 73
5 84
3 75
2 48
Find the regression equation and correlation coefficient for the data.
678
14. In an area of the Midwest, records were kept on the relationship between the rainfall
(in inches) and the yield of wheat (bushels per acre). The data is shown below:
Rain (inches) 10.5 8.8 13.4 12.5 18.8 10.3 7.0 15.6 16.0
Yield (bushels/acre) 50.5 46.2 58.8 59.0 82.4 49.2 31.9 76.0 78.8
Use this data to determine whether there is a significant correlation between the amount of
rainfall and yield of wheat.
15. The data below are the final exam scores of 6 randomly selected statistics students and
the number of hours they studied for the exam.
Hours 5 10 4 6 10 9
Score 64 86 69 86 59 87
Use this data to determine whether there is a significant correlation between the number of
hours studied and the exam score.
The following data shows data for the first two decades of AIDS reporting. Use this data to
answer questions #16-23.
679
16. Create a scatterplot for the data, using YEAR as the independent variable.
17. Find the regression equation. Round the coefficients to whole numbers.
19. Is there a significant correlation between the year and the number of AIDS cases
diagnosed? Explain.
22. Would it be appropriate to use the regression line to estimate the number of diagnosed
AIDS cases for the current year?
23. Does the line seem to fit the data? Why or why not? Does your answer affect how you
would use the regression equation for estimation?
a. The SSE (Sum of Squared Errors) for a data set of 18 numbers is 49. What is the
standard error of the estimate?
b. The standard error for the estimate a data set is 9.8. What is the cutoff for the
vertical distance that a point can be from the line of best fit to be considered an
outlier?
25. The following scatter plot shows the relationship between hours spent studying and
exam scores. The line shown is the calculated line of best fit. The correlation coefficient is
0.69.
680
a. Do there appear to be any outliers? If so, describe the point.
b. A point is removed, and the line of best fit is recalculated. The new correlation
coefficient is r = 0.98. Does the point appear to have been an outlier? Why or why
not?
c. What effect did the potential outlier have on the line of best fit?
d. Are you more or less confident in the predictive ability of the new line of best fit?
26. The Gross Domestic Product Purchasing Power Parity is an indication of a country’s
currency value compared to another country. The following table shows the GDP PPP of
Cuba as compared to US dollars.
27. Does the higher cost of tuition translate into higher-paying jobs? The table lists the top
ten colleges based on mid-career salary and the associated yearly tuition costs.
681
a. Construct a scatter plot of the data, using tuition as the independent variable.
b. Find the correlation coefficient r. Is there a significant correlation?
c. From the scatterplot, does there appear to be any outliers?
d. After removing the outlier(s), is there a significant correlation?
29. Recently, the annual number of driver deaths per 100,000 for the selected age groups
was as follows:
For each age group, pick the midpoint of the interval for the x value; for the 75+ group, use
80.
a. Using “ages” as the independent variable, calculate the least squares (best–fit) line.
b. Find the correlation coefficient. Is it significant?
c. Based on this data, is there a linear relationship between age of a driver and driver
fatality rate?
30. The table below shows the life expectancy for an individual born in the United States in
certain years.
Year of Life
Bi1930
h E 59.7
1940 62.9
1950 70.2
1965 69.7
1973 71.4
1982 74.5
1987 75
1992 75.7
2010 78.7
682
a. Which variable should be the independent and which should be the dependent
variable?
b. Draw a scatter plot of the ordered pairs.
c. Calculate the least squares line; put the equation in the form ŷ = a + bx.
d. Find the correlation coefficient. Is the correlation significant?
e. Find the estimated life expectancy for an individual born in 1950 and for one born in
1982.
f. Why aren’t the answers to part e the same as the values in the table?
g. Should we use the least squares line to find the estimated life expectancy for an
individual born in 1850? Why or why not?
h. Interpret the slope of the regression line.
31. The maximum discount value of the Entertainment® card for the “Fine Dining” section,
Edition ten, for various pages is given below:
683
32. The table below gives the gold medal times for every other Summer Olympics for the
women’s 100-meter freestyle (swimming).
Year Time (seconds)
1912 82.2
1924 72.4
1932 66.8
1952 66.8
1960 61.2
1968 60.0
1976 55.65
1984 55.92
1992 54.64
2000 53.8
2008 53.1
33. Ornithologists, scientists who study birds, tag sparrow hawks in 13 different colonies to
study their population. They gather data for the percent of new sparrow hawks in each
colony and the percent of those that have returned from migration.
Percent return: 74 66 81 52 73 62 52 45 62 46 60 46 38
Percent new: 5 6 8 11 12 15 16 17 18 18 19 20 20
a. Enter the data into your calculator and make a scatter plot.
b. Use your calculator’s regression function to find the equation of the least-squares
regression line. Add this to your scatter plot from part a.
c. Interpret the slope and intercept of the regression line in the context of the problem.
d. How well does the regression line fit the data? Explain your answer.
e. Which point has the largest residual? Is this point an outlier? An influential point?
Explain.
684
f. An ecologist wants to predict how many birds will join another colony of sparrow
hawks to which 70% of the adults from the previous year have returned. What is the
prediction?
34. The height (sidewalk to roof) of notable tall buildings in America is compared to the
number of stories of the building (beginning at street level). The data is shown below:
a. Using “stories” as the independent variable and “height” as the dependent variable,
make a scatter plot of the data.
b. Does it appear from inspection that there is a relationship between the variables?
c. Calculate the least squares line. Put the equation in the form: ŷ = a + bx
d. Find the correlation coefficient. Is it significant?
e. Find the estimated heights for a 32-story building.
f. Find the height for a 94-story building.
g. Based on this data, is there a linear relationship between the number of stories in
tall buildings and the height of the buildings?
h. Suppose we wanted to estimate the height of a building with six stories? Would the
least squares line give an accurate estimate of height? Explain why or why not.
i. Based on the least squares line, adding an extra story is predicted to add about how
many feet to a building
685
35. The following table consists of one student athlete’s time (in minutes) to swim 2000
yards and the student’s heart rate (beats per minute) after swimming on a random sample
of 10 days:
a. Enter the data into your calculator and make a scatter plot.
b. Find the equation of the least-squares regression line. Add this to your scatter plot
from part a.
c. Interpret the slope and y-intercept of the regression line.
d. How well does the regression line fit the data? Explain your response.
e. Which point has the largest residual? Explain what the residual means in context. Is
this point an outlier? An influential point? Explain.
36. The following table shows data on average per capita wine consumption (in liters) and
deaths from heart disease (per 100,000 residents) in a random sample of 10 countries:
Consumption 2.5 3.9 2.9 2.4 2.9 0.8 9.1 2.7 0.8 0.7
Deaths 221 167 131 191 220 297 71 172 211 300
a. Enter the data into your calculator and make a scatter plot.
b. Find the regression equation and add to your scatter plot from part a.
c. Explain in words what the slope and y-intercept of the regression line tell us.
d. How well does the regression line fit the data? Explain your response.
e. Do the data provide convincing evidence that there is a linear relationship between
the amount of wine consumed and the heart disease death rate? Carry out an
appropriate test at the .05 significance level to justify your answer.
686
37. The percentage of female wage and salary workers who are paid hourly rates for the
years 1979 to 1992 are given in the following table:
38. The average number of people in a family that received welfare for various years is
given in the following table:
687
a. Using “year” as the independent variable, draw a scatter plot of the data.
b. Calculate the least-squares line. Put the equation in the form: ŷ = a + bx
c. Find the correlation coefficient. Is the correlation significant?
d. Based on this data, is there a linear relationship between the year and the average
number of people in a welfare family?
e. Using the least-squares line, estimate the welfare family size for 1960 and 1995.
Does the least-squares line give an accurate estimate for those years? Explain why
or why not.
39. The following are advertised sale prices of color televisions at a major appliance store.
Suppose that we wish to use the size of the television to predict the selling price.
a. Calculate the least-squares line. Put the equation in the form of: ŷ = a + b
b. Find the correlation coefficient. Is there a significant correlation between size and
price?
c. Find the estimated sale price for a 32-inch television. Find the cost for a 50-inch
television.
d. Interpret the slope of the regression line.
Net Taxable Estate ($) Approx. Probate Fees and Taxes ($)
600,000 30,000
750,000 92,500
1,000,000 203,000
1,500,000 438,000
2,000,000 688,000
2,500,000 1,037,000
3,000,000 1,350,000
Suppose that we wish to investigate the relationship between these two variables.
688
a. Which would be the independent variable? Which would be the dependent variable?
b. Calculate the least-squares line. Put the equation in the form of: ŷ = a + bx
c. Find the correlation coefficient. Is it significant?
d. Find the estimated total cost for a next taxable estate of $1,000,000; why does this
value differ from the value shown in the table?
e. Interpret the slope of the least-squares regression line.
f. List the residuals for each x value.
41. The following table shows data for the average heights of American boys of different
ages. Suppose that we want to use a boy’s age to predict his height.
a. Which variable should be the independent variable? Which should be the dependent
variable?
b. Calculate the least-squares line. Put the equation in the form: ŷ = a + bx
c. Find the correlation coefficient. Is the correlation significant? Explain.
d. Find the estimated average height for a one-year-old. Find the estimated average
height for an eleven-year-old.
e. Interpret the slope of the least-squares line.
f. List the residuals for each x value.
689
REFERENCES
Data from the National Center for HIV, STD, and TB Prevention.
13.5 Prediction
Data from the National Center for HIV, STD, and TB Prevention.
13.6 Outliers
Data from the House Ways and Means Committee, the Health and Human Services Department.
Data from Microsoft Bookshelf.
Data from the United States Department of Labor, the Bureau of Labor Statistics. Data from the
Physician’s Handbook, 1990.
Data from the United States Department of Labor, the Bureau of Labor Statistics.
690
CHAPTER 13 SOLUTIONS:
Y
6
4
4
2 2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
X X
691
27) a.
Scatterplot of Mid-Career Salary vs Tuition
140
135
Mid-Career Salary
130
125
120
115
b. r = -0.148. The critical value is CV = .632. Since | r | < .632, there is not a significant
correlation between tuition and mid-career salary.
c. The two military institutions appear to be outliers (these have zero tuition)
d. Removing these two points, the new correlation coefficient is r = -0.715, and the new
critical value is CV = .707. Since we now have | r | > .707, the correlation is significant.
29) a. ŷ = 35.58 – 0.192 x
b. r = –0.579; LinRegTTest gives p-value = 0.2288 so we would not reject the null
hypothesis. I.e. the correlation is not significant.
c. There is not a significant linear relationship between deaths and age of driver.
31) a. Independent variable is x = page number; dependent variable is y = max value of
discount.
b.
Scatterplot of Max Value vs Page Number
19
18
Max Value
17
16
15
0 10 20 30 40 50 60 70 80 90
Page Number
c. ŷ = 17.218 – 0.014 x d. r = – 0.275; from the table, CV = 0.666. Since |r| < CV, there
is not a significant correlation between page number and maximum value of discount.
e. Since there is not a significant correlation, we should not use the regression equation for
estimation.
692
33) a. in the calculator b. ŷ = 31.93 – 0.304 x.
c. The slope of the regression line is -0.304 with a y-intercept of 31.93. The y-intercept
indicates that when there are no returning sparrow hawks, there will be almost 32% new
sparrow hawks; this doesn’t make sense since because if there are no returning birds, then
the new percentage would have to be 100% (this shows the dangers of extrapolation). The
slope tells us that for each percentage increase in returning birds, the percentage of new
birds in the colony decreases by 0.304%.
d. The coefficient of determination is r2 = .560 which means that about 56% of the total
variation in the percent of new birds is explained by the model and the correlation
coefficient, r = 0.71, indicates a moderately strong correlation between returning and new
percentages.
e. The ordered pair (66, 6) generates the largest residual of 6.0. This means that when the
observed return percentage is 66%, our observed new percentage, 6%, is almost 6% less
than the predicted new value of 11.87%. However, the standard error is s = 3.66, and so
this residual is less than two standard errors from the predicted value; hence the point is
not an outlier.
f. If there are 70% returning birds, we would expect to see y = -0.304(70) + 31.93 = 10.65%
new birds in the colony.
693
37) a
62.5
Percent paid hourly rates
62.0
61.5
61.0
.
Year
b. Yes, the scatterplot exhibits a fairly strong linear pattern. c. ŷ = −266.9 + 0.166 x
d. r = 0.944; the correlation is significant.
e. When x = 1991, y = 62.82; when x = 1988, y = 62.33.
f. Yes, the scatterplot, correlation coefficient and hypotheses test all indicate a linear
relationship.
g. Yes, the point (1987, 62.7) gives a residual of .906. The standard error is s = .247, so the
observe y-value is more than three standard errors from the predicted y-value.
h. When x = 2050, we would get y = 72.59. But this would not be an accurate prediction,
since x = 2050 is too far outside the range of observed data for a reliable prediction.
i. The slope is b = 0.1656. The percent of workers paid hourly rates tends to increase by
0.1656 (1/6 of 1%) each year.
694
14 | F-DISTRIBUTION AND 1-WAY
ANOVA
Figure 13.1 One-way ANOVA can be used to compare average roasting time of 3 or more blends.
Introduction
Chapter Objectives
• Interpret the F probability distribution as the number of groups and the sample size
change.
• Discuss two uses for the F distribution: one-way ANOVA and the test of two variances.
• Conduct and interpret one-way ANOVA.
• Conduct and interpret hypothesis tests of two variances.
Many statistical applications in psychology, social science, business administration, and the
natural sciences involve several groups. For example, an environmentalist is interested in
knowing if the average amount of pollution varies in several bodies of water. A sociologist is
interested in knowing if the amount of income a person earns varies according to his or her
upbringing. A consumer looking for a new car might compare the average gas mileage of
several models.
695
14.1 | Test of Two Variances/Standard Deviations
One use of the F distribution is testing two variances. It is often desirable to compare two
variances rather than two averages. For instance, college administrators would like two
college professors grading exams to have the same variation in their grading. In order for a
lid to fit a container, the variation in the lid and the container should be the same. A
supermarket might be interested in the variability of check-out times for two checkers.
In order to perform an F test of two variances, it is important that the following are true:
1. The populations from which the two samples are drawn are normally distributed.
2. The two populations are independent of each other.
Unlike most other tests in this book, the F test for equality of two variances is very sensitive
to deviations from normality. If the two distributions are not normal, the test can give higher
p-values than it should, or lower ones, in ways that are unpredictable. Many texts suggest
that students not use this test at all, but in the interest of completeness we include it here.
We will again be using the 5 steps of hypothesis testing.
Suppose we sample randomly from two independent normal populations. Let σ 12 and σ 22 be
the population variances and s12 and s22 be the sample variances. Let the sample sizes be n1
and n2. Since we are interested in comparing the two sample variances, we use the F ratio:
s12
σ 12
F=
s22
σ 22
F has the distribution similar to chi-square distribution but dependent on two degrees of
freedom.
696
F2,5
F8, 26
F16, 7
F3, 11
Since the null hypothesis will have equality σ 12 = σ 22 (or σ 12 ≥ σ 22 or σ 12 ≤ σ 22 ) then the F
Ratio becomes
s12
F= 2
s2
NOTE:
When using the 2-SampFtest, standard
deviation is requested instead of
variance. Make sure you know which is
given in the problem.
Recall, critical value is a value of the distribution that separates the confidence area from the
non-confidence area. We will be using the F- distribution table from Statistics Online
Computational Resource site (http://www.socr.ucla.edu/Applets.dir/F_Table.html) or Excel to
find the critical values. The same rules apply in the previous distributions (Z, t, and χ 2 )
when it applies to alpha (α): if the test is two-tailed you will split α.
697
To find FL, the left critical value:
1. Switch the degrees of freedom: d.f.N. will now be d.f.D. and vice versa
2. Look up the new d.f.N. at the top of the chart and the new d.f.D on the side of the
chart
1
3. FL =
chart value
Example 14.1
Find the right and left critical value of the F-distribution when given the degrees of
freedom of numerator is 13 and the degrees of freedom of the denominator is 9. Use 90%
confidence level for a two-tailed test.
First, find probability in the tails. CL = 0.90 which gives α/2 = .05 (area in 1 tail). Look for
0.05 page in the tables.
FR = 3.3258
1 1
FL = = = 0.2112
chart value 4.7351
698
Solution 14.1 using Excel
Here dfn = 5 and dfd = 10 stays the same for both FR and FL.
Example 14.2
Two college instructors are interested in whether or not there is any variation in the way
they grade math exams. They each grade the same set of 30 exams. The first instructor's
grades have a variance of 52.3. The second instructor's grades have a variance of 89.9. Test
the claim that the first instructor's variance is smaller. (In most colleges, it is desirable for
the variances of exam grades to be nearly the same among instructors.) The level of
significance is 10%.
Claim: σ 12 < σ 22 (first instructor variance is smaller than 2nd instructor’s variance).
F = 2sampFtest = .5818
699
Draw the graph labeling and shading appropriately.
With a 5% level of significance, from the data, there is not sufficient evidence to conclude that
the variance in grades for the first instructor is smaller.
Claim: σ 12 < σ 22 (first instructor variance is smaller than 2nd instructor’s variance).
F = 2sampFtest = 0.5818
Distribution for the test is F29,29 where d.f.N. = n1 – 1 = 29 and d.f.D. = n2 – 1 = 29.
Since it is left-tailed, we are looking for the left critical value, FL. Here, when we switch up
the degrees of freedom if we are using the table, it looks the same since in this example d.f.N.
= d.f.D. On the chart, there is no 29 at the top of the chart; therefore, we go to the closest
degrees of freedom which is 30. However, on the degrees of freedom on the side of the chart,
we use 29 since there is one.
1 1
FL = = = .5405
chart value 1.85
Or use Excel
700
Since the test statistic, 0.5818 is not in the critical region (0.5818 > 0.6173), Do Not Reject
H0.
With a 5% level of significance, from the data, there is sufficient evidence to conclude that
the variance in grades for the first instructor is smaller.
Example 14.3
The New York Choral Society divides male singers up into four categories from highest voices
to lowest: Tenor1, Tenor2, Bass1, and Bass2. In the table are heights of the men in the Tenor1
and Bass2 groups. One suspects that taller men will have lower voices, and that the variance
of height may go up with the lower voices as well. Do we have good evidence that the standard
deviations of the heights of singers in each of these two groups (Tenor1 and Bass2) are
different using 10% level of significance? Use the table below.
Tenor 1: Bass 2:
69, 72, 71, 66, 76, 74, 71, 66, 68, 67, 70, 65, 72, 75, 67, 75, 74, 72, 72, 74, 72, 72, 74, 70,
72, 70, 68, 64, 73, 66, 68, 67, 64 66, 68, 75, 68, 70, 72, 67, 70, 70, 69, 72, 71,
74, 75
H0: σ 1 = σ 2
Ha: σ 1 ≠ σ 2 (claim)
F = 2-SampFTest = 1.489
There is not enough evidence at 10% level of significance to support the claim standard
deviations of the heights of singers in each of these two groups (Tenor1 and Bass2) are
different.
701
Solution 14.3 using critical value
H0: σ 1 = σ 2
Ha: σ 1 ≠ σ 2 (claim)
Since it is two-tailed, we split α. We look up the critical values using the 0.05 chart of the F-
distribution chart (http://www.socr.ucla.edu/Applets.dir/F_Table.html).
Distribution for the test is F20, 25 where d.f.N. = n1 – 1 = 20 and d.f.D. = n2 – 1 = 25.
1 1
FR = 2.01 and FL = = = .4808
chart value 2.08
NOTE: d.f.N. and d.f.D. switch for left critical value using the table.
Do Not Reject H0 because the test statistic, 1.489 is not in neither tail.
There is not enough evidence at 10% level of significance to support the claim standard
deviations of the heights of singers in each of these two groups (Tenor1 and Bass2) are
different.
14.1 The New York Choral Society divides male singers up into four categories from highest
voices to lowest: Tenor1, Tenor2, Bass1, and Bass2. In the table are heights of the men in the
Tenor1 and Bass2 groups. One suspects that taller men will have lower voices, and that the
variance of height may go up with the lower voices as well. Do we have enough evidence that
the standard deviation of Tenor 1 heights is larger than the standard deviation of the heights
of Bass 2 singers using 5% level of significance? Use the table below.
Tenor 1: Bass 2:
70, 72, 71, 66, 76, 74, 71, 66, 68, 67, 72, 75, 67, 75, 74, 72, 72, 74, 72, 72, 74, 70, 66, 68,
70, 65, 72, 70, 68, 64, 73, 66, 68, 67, 64 75, 68, 70, 72, 67, 70, 70, 69, 72, 71, 74, 75
702
14.2 | One-Way ANOVA
The purpose of a one-way ANOVA test is to determine the existence of a statistically
significant difference among several group means. There must be at least three groups. The
test actually uses variances to help determine if the means are equal or not. In order to
perform a one- way ANOVA test, there are five basic assumptions to be fulfilled:
The null hypothesis is simply that all the group population means are the same. The
alternative hypothesis is that at least one pair of means is different. For example, if there
are k independent samples:
H0: μ1 = μ2 = μ3 = ... = μk
Ha: At least one of the means is different
The graphs, a set of box plots representing the distribution of values with the group means
indicated by a horizontal line through the box, help in the understanding of the hypothesis
test. In the first graph (red box plots), H0: μ1 = μ2 = μ3 and the three populations have the
same distribution if the null hypothesis is true. The variance of the combined data is
approximately the same as the variance of each of the populations.
If the null hypothesis is false, then the variance of the combined data is larger which is caused
by the different means as shown in the second graph (green box plots).
NOTE:
703
To calculate the F ratio for the ANOVA test value, two estimates of the variance
are made
MSB
Test Statistic (Value): F=
MSW
SSbetween (SSB) = the sum of squares that represents the variation among the different samples
SSwithin (SSW) = the sum of squares that represents the variation within samples that is due
to chance.
To find a "sum of squares" means to add together squared quantities that, in some cases, may
be weighted. We used sum of squares to calculate the sample variance and the sample
standard deviation in Descriptive Statistics.
(∑ x ) 2
704
( s )2 ( s )2
• SSbetween ∑
Explained variation:=
j
− ∑ j
nj n
• Unexplained variation: SS=within SStotal − SSbetween
The one-way ANOVA test depends on the fact that MSB can be influenced by population
differences among means of the several groups. Since MSW compares values of each group to
its own group mean, the fact that group means might be different does not affect MSW.
The null hypothesis says that all groups are samples from populations having the same
normal distribution. The alternate hypothesis says that at least one of the means of the
sample groups come from populations with different normal distributions. If the null
hypothesis is true, MSB and MSW should both estimate the same value.
NOTE: The null hypothesis says that all the group population means are equal. The
hypothesis of equal means implies that the populations have the same normal distribution,
because it is assumed that the populations are normal and that they have equal variances.
MSB
F=
MSW
If MSB and MSW estimate the same value (following the belief that H0 is true), then the F-
ratio should be approximately equal to one. Mostly, just sampling errors would contribute to
variations away from one. As it turns out, MSB consists of the population variance plus a
variance produced from the differences between the samples. MSW is an estimate of the
population variance. Since variances are always positive, if the null hypothesis is false, MSB
will generally be larger than MSW. Then the F-ratio will be larger than one. However, if the
population effect is small, it is not unlikely that MSW will be larger in a given sample.
705
ANOVA Summary Table
Total N-1
Example 14.4
Three different diet plans are to be tested for mean weight loss. The entries in the table are
the weight losses for the different plans. The one-way ANOVA results are shown in the
summary table:
Within
(Error) 20.8542 7
Total
a. Complete the ANOVA summary table
b. Using 5% level of significance, test the claim that the mean weight loss for the three
different plans are the same.
706
Solution 14.4 part a
Total 23.1 9
H0: μ1 = μ2 = μ3 (claim)
Ha: At least one of the means is different
MSB
F= = .3769
MSW
There is not sufficient evidence at 5% level of significance to reject the claim that the mean
weight loss for the three different plans are the same.
If we are given the data of 3 or more groups, we can use the List feature of the calculator to
enter the data. We can also use the ANOVA(L1, L2, L3, …) function to find the test statistic.
707
Example 14.5
A car manufacturing company needs to purchase steel to make roll bar frames for dune buggy
style autos. The purchasing manager looks at 4 steel makers and took a random sample of
steel bars and tested their strength (in MPa). The results are shown in the following table:
a. Using a significance level of 1%, is there a difference in the mean steel strength
among the 4 steel makers.
b. Create an ANOVA summary table.
H0: μ1 = μ2 = μ3 = μ4
Ha: At least one of the means is different (claim)
There is sufficient evidence at 1% level of significance to support the claim that the mean
steel strength among the 4 steel makers.
708
Variation Sum of Degrees of Mean Test Value p-value
Squares (SS) Freedom (df) Squares (F)
(MS)
Between
(Factor) 24175 3 8058.33 F = 5.68 0.00759
Within
(Error) 22680 16 1417.5
Total 46855 19
To recap, MSB = 8058.33 is an estimate of σ2 that is based on the variability among the
sample means. MSW = 1417.5 is an estimate of σ2 that is based on the sample variances.
Example 14.6
A medical researcher wishes to try three different techniques to lower blood pressure of
patients with high blood pressure. The subjects are randomly selected and assigned to one
of three groups. Group 1 is given medication, Group 2 is given an exercise program, and
Group 3 is assigned a diet program. At the end of six weeks, each subject's blood pressure is
recorded, with data as shown below. Using p-value and the level of significance of 1%, test
the claim that the mean is the same for the different groups.
709
Solution 14.6
H0: μ1 = μ2 = μ3 (claim);
There is enough evidence at 1% level of significance to reject the claim that the mean is the
same for the different groups.
14.2 Four different types of fertilizers are used on raspberry plants. The number of
raspberries on each randomly selected plant is given below.
A suitable hypothesis test will be conducted to test the claim that the type of fertilizer
makes no difference in the mean number of raspberries per plant. Use α = .05 to test the
claim.
Key Terms
Analysis of Variance also referred to as ANOVA, is a method of testing whether or not
the means of three or more populations are equal. The method is applicable if:
710
Review of Tests
You have seen the F test statistic used in two different circumstances. The following bulleted
list is a summary that will help you decide which F test is the appropriate one to use.
• Testing 2 Variances:
s12
Test statistic: F = (calculator: 2-SampFTest)
s22
Degrees of freedom of numerator: d.f.N. = n1 – 1
SSB
Mean Square Between: MSB =
d.f.N
SSW
Mean Square Within: MSW =
d.f.D
(∑ x ) 2
711
Exercises for Chapter 14
1. Find the right critical value for F-distribution with α = 0.10 and d.f.n = 8 and d.f.d = 14
2. Find the right critical value for F-distribution with α = 0.05 and d.f.n. = 6 and d.f.d = 8
3. Find the left critical value for F-distribution with α = 0.025 and d.f.n. = 12 and d.f.d = 15
4. Find the left critical value for F-distribution with α = 0.10 and d.f.n. = 10 and d.f.d = 20
5. State the two assumptions that must be true in order to perform an F- test of two
variances.
6. Suppose that we test the hypotheses H0: σ12 = σ22 vs. Ha: σ12 ≠ σ22 using the following
data:
7. Suppose that we test the hypotheses H0: σ12 = σ22 vs. Ha: σ12 ≠ σ22 using the following
data:
8. Two coworkers commute from the same building. They are interested in comparing the
variability of their driving times to work. They each record their times for 20 commutes. The
first worker’s times have a variance of 12.1. The second worker’s times have a variance of
16.9. The first worker claims that his commuting time is less variable than that of his
colleague. Test this claim at the 10% level.
712
9. Two students are interested in comparing the amount of variation in their test scores for
math class. There are 15 total math tests they have taken so far. The first student’s grades
have a standard deviation of 38.1. The second student’s grades have a standard deviation of
22.5. The second student thinks his scores are less variable. Assuming that the populations
of test scores are normally distributed, we will test her claim at the α = .05 significance level.
10. Two cyclists are comparing the variances of their overall paces going uphill. Each cyclist
records his or her speeds going up 35 hills. The first cyclist has a variance of 23.8 and the
second cyclist has a variance of 32.1. Use this data along with a .05 significance level to test
the claim that there is a difference in the variances.
11. List the five basic assumptions that must be fulfilled in order to perform a one-way
ANOVA
test.
12. State the hypotheses for a one-way ANOVA test if there are three groups.
13. State the hypotheses for a one-way ANOVA test if there are four groups.
14. Groups of men from three different areas of the country are to be tested for mean
weight. The entries in the table are the weights for the different groups. The one-way
ANOVA results are shown in the table:
713
15. Girls from four different soccer teams are to be tested for mean goals scored per game.
The entries in the table are the goals per game for the different teams. The one-way
ANOVA results are shown in the table below:
a. What is SSbetween?
b. What is the degrees of freedom for the numerator?
c. What is MSbetween?
d. What is SSwithin?
e. What is the degrees of freedom for the denominator?
f. What is MSwithin?
g. What is the F statistic?
h. Judging by the F statistic, do you think it is likely that you will reject the null
hypothesis?
16. Five basketball teams each took a random sample of players regarding how high each
player can jump (in inches). The results are shown in the table below:
714
17. A video game developer is testing a new game on three different groups. Each group
represents a different target market for the game. The developer collects scores from a
random sample from each group. The results are shown in the table:
This data is used to test whether there is a significant difference among the means for each
group.
18. Three different traffic routes are tested for mean driving time. The entries in the table
are the driving times in minutes on the three different routes. The data from three random
samples are shown in the table below:
Use this data to test the claim that the routes have the same mean driving time.
715
19. Three students, Linda, Tuan, and Javier, are each given five laboratory rats for a
nutritional experiment. Each rat's weight is recorded in grams. Linda feeds her rats Formula
A, Tuan feeds his rats Formula B, and Javier feeds his rats Formula C. At the end of a
specified time period, each rat is weighed again, and the net gain in grams is recorded. Using
a significance level of 10%, test the hypothesis that the three formulas produce the same
mean weight gain.
20. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase
would hurt working-class people the most, since they commute the farthest to work. Suppose
that the group randomly surveyed 24 individuals and asked them their daily one-way
commuting mileage. The results are in the table below. Using a 5% significance level, test
the hypothesis that the three mean commuting mileages are the same.
21. The table below lists the number of pages in four different types of magazines. Use this
data and a 5% level of significance to test the hypothesis that the four magazine types have
the same mean number of pages.
716
22. A researcher wants to know if the mean times (in minutes) that people watch their
favorite news station are the same. The table below shows the results of a study.
Assume that all distributions are normal, the four population standard deviations are
approximately the same, and the data were collected independently and randomly. Use a .05
level of significance.
23. Are the means for the final exams the same for all statistics class delivery types? The
table below shows the scores on final exams from randomly selected classes using the
different delivery types.
Assume that all distributions are normal, the population standard deviations are
approximately the same, and the data were collected independently and randomly. Use a .05
level of significance.
24. Are the mean number of times a month a person eats out the same for whites, blacks,
Hispanics and Asians? Suppose that the table below shows the results of a study:
Use this data, along with a .05 significance level, to test the claim that the mean is the
same for all four populations. Assume that all distributions are normal, the four population
717
standard deviations are approximately the same, and the data were collected independently
and randomly.
25. Are the mean numbers of daily visitors to a ski resort the same for the three types of
snow conditions? Suppose that the data below shows the results of a study:
Use this data and a .05 significance level to test the claim that the mean number of visitors
is the same for all three types of snow conditions. Assume that all distributions are normal,
the three population standard deviations are approximately the same, and the data were
collected independently and randomly.
26. DDT is a pesticide that has been banned from use in the United States and most other
areas of the world. It is quite effective, but persisted in the environment and over time became
seen as harmful to higher-level organisms. Famously, egg shells of eagles and other raptors
were believed to be thinner and prone to breakage in the nest because of ingestion of DDT in
the food chain of the birds.
An experiment was conducted on the number of eggs (fecundity) laid by female fruit flies.
There are three groups of flies. One group was bred to be resistant to DDT (the RS group).
Another was bred to be especially susceptible to DDT (SS). Finally there was a control line of
non-selected or typical fruitflies (NS). Here are the data:
RS SS NS RS SS NS
12.8 38.4 35.4 22.4 23.1 22.6
21.6 32.9 27.4 27.5 29.4 40.4
14.8 48.5 19.3 20.3 16 34.4
23.1 20.9 41.8 38.7 20.1 30.4
34.6 11.6 20.3 26.4 23.3 14.9
19.7 22.3 37.6 23.7 22.9 51.8
22.6 30.2 36.9 26.1 22.5 33.8
29.6 33.4 37.3 29.5 15.1 37.9
16.4 26.7 28.2 38.6 31 29.5
20.3 39 23.4 44.4 16.9 42.4
29.3 12.8 33.7 23.2 16.1 36.6
14.9 14.6 29.2 23.6 10.8 47.4
27.3 12.2 41.7
718
The values are the average number of eggs laid daily for each of 75 flies (25 in each group)
over the first 14 days of their lives. Using a 1% level of significance, are the mean rates of egg
selection for the three strains of fruitfly different? If so, in what way? Specifically, the
researchers were interested in whether or not the selectively bred strains were different from
the non-selected line, and whether the two selected lines were different from each other.
27. Three students, Linda, Tuan, and Javier, are given five laboratory rats each for a
nutritional experiment. Each rat’s weight is recorded in grams. Linda feeds her rats Formula
A, Tuan feeds his rats Formula B, and Javier feeds his rats Formula C. At the end of a
specified time period, each rat is weighed again and the net gain in grams is recorded.
Determine whether or not there is a significant difference in the variance among Javier’s and
Linda’s rats. Test at a significance level of 10%.
28. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase
would hurt working-class people the most, since they commute the farthest to work. Suppose
that the group randomly surveyed 24 individuals and asked them their daily one-way
commuting mileage. The results are as follows.
Working Class 17.8 26.7 49.4 9.4 65.4 47.1 19.5 51.2
Professional 16.5 17.4 22 7.4 9.4 2.1 6.4 13.9
(middle income)
Professional 8.5 6.3 4.6 12.6 11 28.6 15.4 9.3
(Wealthy)
719
Determine whether or not the variance in mileage driven is statistically the same among the
working class and professional (middle income) groups. Use a 5% significance level.
29. The following table lists the number of pages in four different types of magazines:
Use this data to test whether there is a significant difference in the variance for home
decorating magazines and news magazines.
30. Is the variance for the amount of money (in dollars) that shoppers spend on Saturdays
at the mall the same as the variance for the amount of money that shoppers spend on
Sundays at the mall? Suppose that the table shows the results of a study.
Use this data, along with a 5% significance level to test the claim that the variances are
equal.
31. Are the variances for incomes on the East Coast and the West Coast the same? Suppose
that the table below shows the results of a study; income is shown in thousands of dollars.
Use this data to test the hypothesis that the variances are the same. Assume that both
distributions are normal, and the samples are randomly and independently selected.
East 38 47 30 82 75 52 115 67
West 71 126 42 51 44 90 88
720
32. Thirty men in college were taught a method of finger tapping. They were randomly
assigned to three groups of ten, with each receiving one of three doses of caffeine: 0 mg, 100
mg, 200 mg. This is approximately the amount in no, one, or two cups of coffee. Two hours
after ingesting the caffeine, the men had the rate of finger tapping per minute recorded. The
experiment was double blind, so neither the recorders nor the students knew which group
they were in. Does caffeine affect the rate of tapping, and if so how?
0 mg 242 244 247 242 246 245 248 248 244 242
100 248 245 248 247 243 246 247 250 246 244
mg
200 246 250 248 246 245 248 252 250 248 250
mg
33. King Manuel I, Komnenus ruled the Byzantine Empire from Constantinople (Istanbul)
during the years 1145 to 1180 A.D. The empire was very powerful during his reign, but
declined significantly afterwards. Coins minted during his era were found in Cyprus, an
island in the eastern Mediterranean Sea. Nine coins were from his first coinage, seven from
the second, four from the third, and seven from a fourth. These spanned most of his reign.
We have data on the silver content of the coins:
Did the silver content of the coins change over the course of Manuel’s reign? Here are the
means and variances of each coinage. The data are unbalanced.
721
34. Four different diet plans are to be tested for mean weight loss. The entries in the table
are the weight losses for the different plans. The one-way ANOVA results are shown in the
summary table:
Within
(Error) 101.81 53
Total
a. Complete the ANOVA summary table
b. Using 5% level of significance, test the claim that the mean weight loss for the four
different plans are the same.
35. We are interested in testing the mean compression strength of four different box types.
The one-way ANOVA results are shown in the summary table:
Within
(Error) 32.901
Total 23
a. Complete the ANOVA summary table
b. Using 1% level of significance, test the claim that the mean compression strength for
the four different box types is different.
722
REFERENCES
14.1 Test of Two Variances
Hand, D.J., F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. A Handbook of Small Datasets:
Data for Fruitfly
14.2 ANOVA
Data from a fourth grade classroom in 1994 in a private K – 12 school in San Jose, CA.
Hand, D.J., F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. A Handbook of Small Datasets.
London: Chapman & Hall, 1994, pg. 50.
Hand, D.J., F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. A Handbook of Small Datasets.
London: Chapman & Hall, 1994, pg. 118.
Mackowiak, P. A., Wasserman, S. S., and Levine, M. M. (1992), "A Critical Appraisal of 98.6 Degrees
F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August
Wunderlich," Journal of the American Medical Association, 268, 1578-1580.
723
CHAPTER 14 SOLUTIONS:
3) using Excel FL = F.inv(0.025, 12, 15) = 0.3147 Using chart, switch DF dfn = 15, dfd
= 12 FL =1/3.1772 = 0.3147
7) From the calculator with 2SampFTest, F = .72 and p = .4810 Note: If the subscripts
are switched, then we get F = 1.39, with the same p-value.
9) a. Ho: σ1 < σ2; Ha: σ1 > σ2. b. Use a two-sample F-test (test for two variances)
c. From the calculator, F = 2.87 and p = 0.0291. d. Since p < .05, reject Ho. e.
There is sufficient evidence to conclude that the second student’s scores are less
variable.
Since p > .05, we do not reject Ho. There is not enough evidence to conclude that
there is a difference in the mean weight gains among the three groups of rats.
21) Ho: µ1 = µ2 = µ3= µ4; Ha: At least one of the means is different from the others.Use
ANOVA F-test; from the calculator, F = 8.69 and p = .0012. Since p < .05, we reject
Ho. There is sufficient evidence to conclude that there is a difference in the mean
number of pages for different magazine types.
724
23) Ho: µ1 = µ2 = µ3; Ha: At least one of the means is different from the others. Use
ANOVA F-test; from the calculator, F = .64 and p = .5437. Since p > .05, we do not
reject Ho. There is not enough evidence to reject the claim that the mean final exam
scores are the same for all three delivery types.
25) Ho: µ1 = µ2 = µ3; Ha: At least one of the means is different from the others. Use
ANOVA F-test. From the calculator, F = 3.10 and p = 0.082. Since p > .05, we do
not reject Ho. There is not enough evidence to reject the claim that the mean
number of visitors is the same for three types of snow conditions.
27) Ho: σ12 = σ22; Ha: σ12 ≠ σ22. Use 2SampFTest; from the calculator, we get F = .33
and p = 0.3127. Since p > .05, do not reject Ho. There is not enough evidence to
conclude that there is a difference in the variances among Javier’s and Linda’s rats.
NOTE: Here we used σ12 to represent the variance for Linda’s rats, and σ22 to
represent the variance for Javier’s rats. Had these been entered in the opposite
order, we would have gotten F = 3.0; however, because it is a two-tailed test, the p-
value would be the same.
29) Ho: σ12 = σ22; Ha: σ12 ≠ σ22. Use 2SampFTest; from the calculator, we get F = 12.69
and p = 0.0305. Since p < .05, do reject Ho. There is sufficient evidence to conclude
that the variance is different for the two types of magazines. Again, had we
switched the subscripts, we would get a different F-statistic, F = 0.079. However,
because it is a two-tailed test, the p – value would remain the same.
31) Ho: σ12 = σ22; Ha: σ12 ≠ σ22. Use 2SampFTest; from the calculator, we get F = 0.812
and p = 0.7825. Since p > .05, do reject Ho. There is not enough evidence to reject
the claim that the variances for incomes are the same on the two coasts.
33) Ho: µ1 = µ2 = µ3 = µ4; Ha: At least one of the means is different from the others. Use
ANOVA F-test. From the calculator, F = 26.27 and p = 0.0000001. Since p < .05, we
reject Ho. There is very strong evidence that the mean silver content of the different
coinages are not all the same.
35) a.
Total 65.039 23
725
This page is purposely left blank.
726
Standard Normal Distribution
Z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.504 0.508 0.512 0.516 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.591 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.648 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.67 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.695 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.719 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.758 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.791 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.834 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.877 0.879 0.881 0.883
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.898 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.937 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.975 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.983 0.9834 0.9838 0.9842 0.9846 0.985 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.989
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.992 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.994 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.996 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.997 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.998 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.999 0.999
A1
The critical values of t distribution are calculated according to the probabilities of
two alpha values and the degrees of freedom. The Alpha(α) values 0.05 one tailed
and 0.1 two tailed are the two columns to be compared with the degrees of
freedom in the row of the table.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
α (2 tail) 0.1 0.05 0.02 0.01 0.005 0.002 0.001
1 6.3138 12.7062 31.8205 63.6567 127.3213 318.3088 636.6192
2 2.9200 4.3027 6.9646 9.9248 14.0890 22.3271 31.5991
3 2.3534 3.1824 4.5407 5.8409 7.4533 10.2145 12.9240
4 2.1318 2.7764 3.7469 4.6041 5.5976 7.1732 8.6103
5 2.0150 2.5706 3.3649 4.0321 4.7733 5.8934 6.8688
6 1.9432 2.4469 3.1427 3.7074 4.3168 5.2076 5.9588
7 1.8946 2.3646 2.9980 3.4995 4.0293 4.7853 5.4079
8 1.8595 2.3060 2.8965 3.3554 3.8325 4.5008 5.0413
9 1.8331 2.2622 2.8214 3.2498 3.6897 4.2968 4.7809
10 1.8125 2.2281 2.7638 3.1693 3.5814 4.1437 4.5869
11 1.7959 2.2010 2.7181 3.1058 3.4966 4.0247 4.4370
12 1.7823 2.1788 2.6810 3.0545 3.4284 3.9296 4.3178
13 1.7709 2.1604 2.6503 3.0123 3.3725 3.8520 4.2208
14 1.7613 2.1448 2.6245 2.9768 3.3257 3.7874 4.1405
15 1.7531 2.1314 2.6025 2.9467 3.2860 3.7328 4.0728
16 1.7459 2.1199 2.5835 2.9208 3.2520 3.6862 4.0150
17 1.7396 2.1098 2.5669 2.8982 3.2224 3.6458 3.9651
18 1.7341 2.1009 2.5524 2.8784 3.1966 3.6105 3.9216
19 1.7291 2.0930 2.5395 2.8609 3.1737 3.5794 3.8834
20 1.7247 2.0860 2.5280 2.8453 3.1534 3.5518 3.8495
21 1.7207 2.0796 2.5176 2.8314 3.1352 3.5272 3.8193
22 1.7171 2.0739 2.5083 2.8188 3.1188 3.5050 3.7921
23 1.7139 2.0687 2.4999 2.8073 3.1040 3.4850 3.7676
24 1.7109 2.0639 2.4922 2.7969 3.0905 3.4668 3.7454
25 1.7081 2.0595 2.4851 2.7874 3.0782 3.4502 3.7251
26 1.7056 2.0555 2.4786 2.7787 3.0669 3.4350 3.7066
27 1.7033 2.0518 2.4727 2.7707 3.0565 3.4210 3.6896
28 1.7011 2.0484 2.4671 2.7633 3.0469 3.4082 3.6739
29 1.6991 2.0452 2.4620 2.7564 3.0380 3.3962 3.6594
30 1.6973 2.0423 2.4573 2.7500 3.0298 3.3852 3.6460
31 1.6955 2.0395 2.4528 2.7440 3.0221 3.3749 3.6335
32 1.6939 2.0369 2.4487 2.7385 3.0149 3.3653 3.6218
33 1.6924 2.0345 2.4448 2.7333 3.0082 3.3563 3.6109
34 1.6909 2.0322 2.4411 2.7284 3.0020 3.3479 3.6007
35 1.6896 2.0301 2.4377 2.7238 2.9960 3.3400 3.5911
B1
The critical values of t distribution are calculated according to the probabilities of
two alpha values and the degrees of freedom. The Alpha(α) values 0.05 one tailed
and 0.1 two tailed are the two columns to be compared with the degrees of
freedom in the row of the table.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
α (2 tail) 0.1 0.05 0.02 0.01 0.005 0.002 0.001
36 1.6883 2.0281 2.4345 2.7195 2.9905 3.3326 3.5821
37 1.6871 2.0262 2.4314 2.7154 2.9852 3.3256 3.5737
38 1.6860 2.0244 2.4286 2.7116 2.9803 3.3190 3.5657
39 1.6849 2.0227 2.4258 2.7079 2.9756 3.3128 3.5581
40 1.6839 2.0211 2.4233 2.7045 2.9712 3.3069 3.5510
41 1.6829 2.0195 2.4208 2.7012 2.9670 3.3013 3.5442
42 1.6820 2.0181 2.4185 2.6981 2.9630 3.2960 3.5377
43 1.6811 2.0167 2.4163 2.6951 2.9592 3.2909 3.5316
44 1.6802 2.0154 2.4141 2.6923 2.9555 3.2861 3.5258
45 1.6794 2.0141 2.4121 2.6896 2.9521 3.2815 3.5203
46 1.6787 2.0129 2.4102 2.6870 2.9488 3.2771 3.5150
47 1.6779 2.0117 2.4083 2.6846 2.9456 3.2729 3.5099
48 1.6772 2.0106 2.4066 2.6822 2.9426 3.2689 3.5051
49 1.6766 2.0096 2.4049 2.6800 2.9397 3.2651 3.5004
50 1.6759 2.0086 2.4033 2.6778 2.9370 3.2614 3.4960
51 1.6753 2.0076 2.4017 2.6757 2.9343 3.2579 3.4918
52 1.6747 2.0066 2.4002 2.6737 2.9318 3.2545 3.4877
53 1.6741 2.0057 2.3988 2.6718 2.9293 3.2513 3.4838
54 1.6736 2.0049 2.3974 2.6700 2.9270 3.2481 3.4800
55 1.6730 2.0040 2.3961 2.6682 2.9247 3.2451 3.4764
56 1.6725 2.0032 2.3948 2.6665 2.9225 3.2423 3.4729
57 1.6720 2.0025 2.3936 2.6649 2.9204 3.2395 3.4696
58 1.6716 2.0017 2.3924 2.6633 2.9184 3.2368 3.4663
59 1.6711 2.0010 2.3912 2.6618 2.9164 3.2342 3.4632
60 1.6706 2.0003 2.3901 2.6603 2.9146 3.2317 3.4602
61 1.6702 1.9996 2.3890 2.6589 2.9127 3.2293 3.4573
62 1.6698 1.9990 2.3880 2.6575 2.9110 3.2270 3.4545
63 1.6694 1.9983 2.3870 2.6561 2.9093 3.2247 3.4518
64 1.6690 1.9977 2.3860 2.6549 2.9076 3.2225 3.4491
65 1.6686 1.9971 2.3851 2.6536 2.9060 3.2204 3.4466
66 1.6683 1.9966 2.3842 2.6524 2.9045 3.2184 3.4441
67 1.6679 1.9960 2.3833 2.6512 2.9030 3.2164 3.4417
68 1.6676 1.9955 2.3824 2.6501 2.9015 3.2145 3.4394
69 1.6672 1.9949 2.3816 2.6490 2.9001 3.2126 3.4372
70 1.6669 1.9944 2.3808 2.6479 2.8987 3.2108 3.4350
B2
The critical values of t distribution are calculated according to the probabilities of
two alpha values and the degrees of freedom. The Alpha(α) values 0.05 one tailed
and 0.1 two tailed are the two columns to be compared with the degrees of
freedom in the row of the table.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
α (2 tail) 0.1 0.05 0.02 0.01 0.005 0.002 0.001
71 1.6666 1.9939 2.3800 2.6469 2.8974 3.2090 3.4329
72 1.6663 1.9935 2.3793 2.6459 2.8961 3.2073 3.4308
73 1.6660 1.9930 2.3785 2.6449 2.8949 3.2057 3.4289
74 1.6657 1.9925 2.3778 2.6439 2.8936 3.2041 3.4269
75 1.6654 1.9921 2.3771 2.6430 2.8924 3.2025 3.4250
76 1.6652 1.9917 2.3764 2.6421 2.8913 3.2010 3.4232
77 1.6649 1.9913 2.3758 2.6412 2.8902 3.1995 3.4214
78 1.6646 1.9908 2.3751 2.6403 2.8891 3.1980 3.4197
79 1.6644 1.9905 2.3745 2.6395 2.8880 3.1966 3.4180
80 1.6641 1.9901 2.3739 2.6387 2.8870 3.1953 3.4163
81 1.6639 1.9897 2.3733 2.6379 2.8860 3.1939 3.4147
82 1.6636 1.9893 2.3727 2.6371 2.8850 3.1926 3.4132
83 1.6634 1.9890 2.3721 2.6364 2.8840 3.1913 3.4116
84 1.6632 1.9886 2.3716 2.6356 2.8831 3.1901 3.4102
85 1.6630 1.9883 2.3710 2.6349 2.8822 3.1889 3.4087
86 1.6628 1.9879 2.3705 2.6342 2.8813 3.1877 3.4073
87 1.6626 1.9876 2.3700 2.6335 2.8804 3.1866 3.4059
88 1.6624 1.9873 2.3695 2.6329 2.8795 3.1854 3.4045
89 1.6622 1.9870 2.3690 2.6322 2.8787 3.1843 3.4032
90 1.6620 1.9867 2.3685 2.6316 2.8779 3.1833 3.4019
91 1.6618 1.9864 2.3680 2.6309 2.8771 3.1822 3.4007
92 1.6616 1.9861 2.3676 2.6303 2.8763 3.1812 3.3994
93 1.6614 1.9858 2.3671 2.6297 2.8755 3.1802 3.3982
94 1.6612 1.9855 2.3667 2.6291 2.8748 3.1792 3.3971
95 1.6611 1.9853 2.3662 2.6286 2.8741 3.1782 3.3959
96 1.6609 1.9850 2.3658 2.6280 2.8734 3.1773 3.3948
97 1.6607 1.9847 2.3654 2.6275 2.8727 3.1764 3.3937
98 1.6606 1.9845 2.3650 2.6269 2.8720 3.1755 3.3926
99 1.6604 1.9842 2.3646 2.6264 2.8713 3.1746 3.3915
100 1.6602 1.9840 2.3642 2.6259 2.8707 3.1737 3.3905
B3
The LEFT critical values of chi-distribution are calculated according to the
probabilities of one-tail alpha value and the degrees of freedom.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
1 0.0039 0.0010 0.0002 0.0000 0.0000 0.0000 0.0000
2 0.1026 0.0506 0.0201 0.0100 0.0050 0.0020 0.0010
3 0.3518 0.2158 0.1148 0.0717 0.0449 0.0243 0.0153
4 0.7107 0.4844 0.2971 0.2070 0.1449 0.0908 0.0639
5 1.1455 0.8312 0.5543 0.4117 0.3075 0.2102 0.1581
6 1.6354 1.2373 0.8721 0.6757 0.5266 0.3811 0.2994
7 2.1673 1.6899 1.2390 0.9893 0.7945 0.5985 0.4849
8 2.7326 2.1797 1.6465 1.3444 1.1043 0.8571 0.7104
9 3.3251 2.7004 2.0879 1.7349 1.4501 1.1519 0.9717
10 3.9403 3.2470 2.5582 2.1559 1.8274 1.4787 1.2650
11 4.5748 3.8157 3.0535 2.6032 2.2321 1.8339 1.5868
12 5.2260 4.4038 3.5706 3.0738 2.6612 2.2142 1.9344
13 5.8919 5.0088 4.1069 3.5650 3.1119 2.6172 2.3051
14 6.5706 5.6287 4.6604 4.0747 3.5820 3.0407 2.6967
15 7.2609 6.2621 5.2293 4.6009 4.0697 3.4827 3.1075
16 7.9616 6.9077 5.8122 5.1422 4.5734 3.9416 3.5358
17 8.6718 7.5642 6.4078 5.6972 5.0917 4.4161 3.9802
18 9.3905 8.2307 7.0149 6.2648 5.6233 4.9048 4.4394
19 10.1170 8.9065 7.6327 6.8440 6.1674 5.4068 4.9123
20 10.8508 9.5908 8.2604 7.4338 6.7228 5.9210 5.3981
21 11.5913 10.2829 8.8972 8.0337 7.2889 6.4467 5.8957
22 12.3380 10.9823 9.5425 8.6427 7.8649 6.9830 6.4045
23 13.0905 11.6886 10.1957 9.2604 8.4502 7.5292 6.9237
24 13.8484 12.4012 10.8564 9.8862 9.0442 8.0849 7.4527
25 14.6114 13.1197 11.5240 10.5197 9.6463 8.6493 7.9910
26 15.3792 13.8439 12.1981 11.1602 10.2562 9.2221 8.5379
27 16.1514 14.5734 12.8785 11.8076 10.8733 9.8028 9.0932
28 16.9279 15.3079 13.5647 12.4613 11.4973 10.3909 9.6563
29 17.7084 16.0471 14.2565 13.1211 12.1279 10.9861 10.2268
30 18.4927 16.7908 14.9535 13.7867 12.7646 11.5880 10.8044
35 22.4650 20.5694 18.5089 17.1918 16.0317 14.6878 13.7875
40 26.5093 24.4330 22.1643 20.7065 19.4171 17.9164 16.9062
45 30.6123 28.3662 25.9013 24.3110 22.8996 21.2507 20.1366
50 34.7643 32.3574 29.7067 27.9907 26.4636 24.6739 23.4610
55 38.9580 36.3981 33.5705 31.7348 30.0971 28.1731 26.8658
60 43.1880 40.4817 37.4849 35.5345 33.7911 31.7383 30.3405
65 47.4496 44.6030 41.4436 39.3831 37.5382 35.3616 33.8767
70 51.7393 48.7576 45.4417 43.2752 41.3323 39.0364 37.4674
75 56.0541 52.9419 49.4750 47.2060 45.1686 42.7573 41.1072
80 60.3915 57.1532 53.5401 51.1719 49.0429 46.5199 44.7910
85 64.7494 61.3888 57.6339 55.1696 52.9517 50.3203 48.5151
90 69.1260 65.6466 61.7541 59.1963 56.8921 54.1552 52.2758
95 73.5198 69.9249 65.8984 63.2496 60.8614 58.0220 56.0702
100 77.9295 74.2219 70.0649 67.3276 64.8574 61.9179 59.8957
C1
The RIGHT critical values of chi-distribution are calculated according to the probabilities
of one-tail alpha value and the degrees of freedom.
C2
The critical values of r distribution are calculated according to the probabilities of
two alpha values and the degrees of freedom. The Alpha(α) values 0.05 one tailed
and 0.1 two tailed are the two columns to be compared with the degrees of
freedom in the row of the table.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025
α (2 tail) 0.1 0.05 0.02 0.01 0.005
1 0.9877 0.9969 0.9995 0.9999 1.0000
2 0.9000 0.9500 0.9800 0.9900 0.9950
3 0.8054 0.8783 0.9343 0.9587 0.9740
4 0.7293 0.8114 0.8822 0.9172 0.9417
5 0.6694 0.7545 0.8329 0.8745 0.9056
6 0.6215 0.7067 0.7887 0.8343 0.8697
7 0.5822 0.6664 0.7498 0.7977 0.8359
8 0.5494 0.6319 0.7155 0.7646 0.8046
9 0.5214 0.6021 0.6851 0.7348 0.7759
10 0.4973 0.5760 0.6581 0.7079 0.7496
11 0.4762 0.5529 0.6339 0.6835 0.7255
12 0.4575 0.5324 0.6120 0.6614 0.7034
13 0.4409 0.5140 0.5923 0.6411 0.6831
14 0.4259 0.4973 0.5742 0.6226 0.6643
15 0.4124 0.4821 0.5577 0.6055 0.6470
16 0.4000 0.4683 0.5425 0.5897 0.6308
17 0.3887 0.4555 0.5285 0.5751 0.6158
18 0.3783 0.4438 0.5155 0.5614 0.6018
19 0.3687 0.4329 0.5034 0.5487 0.5886
20 0.3598 0.4227 0.4921 0.5368 0.5763
21 0.3515 0.4132 0.4815 0.5256 0.5647
22 0.3438 0.4044 0.4716 0.5151 0.5537
23 0.3365 0.3961 0.4622 0.5052 0.5434
24 0.3297 0.3882 0.4534 0.4958 0.5336
25 0.3233 0.3809 0.4451 0.4869 0.5243
26 0.3172 0.3739 0.4372 0.4785 0.5154
27 0.3115 0.3673 0.4297 0.4705 0.5070
28 0.3061 0.3610 0.4226 0.4629 0.4990
29 0.3009 0.3550 0.4158 0.4556 0.4914
30 0.2960 0.3494 0.4093 0.4487 0.4840
31 0.2913 0.3440 0.4032 0.4421 0.4770
32 0.2869 0.3388 0.3972 0.4357 0.4703
33 0.2826 0.3338 0.3916 0.4296 0.4639
34 0.2785 0.3291 0.3862 0.4238 0.4577
35 0.2746 0.3246 0.3810 0.4182 0.4518
D1
The critical values of r distribution are calculated according to the probabilities of
two alpha values and the degrees of freedom. The Alpha(α) values 0.05 one tailed
and 0.1 two tailed are the two columns to be compared with the degrees of
freedom in the row of the table.
α (1 tail) 0.05 0.025 0.01 0.005 0.0025
α (2 tail) 0.1 0.05 0.02 0.01 0.005
36 0.2709 0.3202 0.3760 0.4128 0.4461
37 0.2673 0.3160 0.3712 0.4076 0.4406
38 0.2638 0.3120 0.3665 0.4026 0.4353
39 0.2605 0.3081 0.3621 0.3978 0.4301
40 0.2573 0.3044 0.3578 0.3932 0.4252
41 0.2542 0.3008 0.3536 0.3887 0.4204
42 0.2512 0.2973 0.3496 0.3843 0.4158
43 0.2483 0.2940 0.3457 0.3801 0.4113
44 0.2455 0.2907 0.3420 0.3761 0.4070
45 0.2429 0.2876 0.3384 0.3721 0.4028
46 0.2403 0.2845 0.3348 0.3683 0.3987
47 0.2377 0.2816 0.3314 0.3646 0.3948
48 0.2353 0.2787 0.3281 0.3610 0.3909
49 0.2329 0.2759 0.3249 0.3575 0.3872
50 0.2306 0.2732 0.3218 0.3542 0.3836
51 0.2284 0.2706 0.3188 0.3509 0.3801
52 0.2262 0.2681 0.3158 0.3477 0.3766
53 0.2241 0.2656 0.3129 0.3445 0.3733
54 0.2221 0.2632 0.3102 0.3415 0.3700
55 0.2201 0.2609 0.3074 0.3385 0.3669
56 0.2181 0.2586 0.3048 0.3357 0.3638
57 0.2162 0.2564 0.3022 0.3328 0.3608
58 0.2144 0.2542 0.2997 0.3301 0.3578
59 0.2126 0.2521 0.2972 0.3274 0.3550
60 0.2108 0.2500 0.2948 0.3248 0.3522
61 0.2091 0.2480 0.2925 0.3223 0.3494
62 0.2075 0.2461 0.2902 0.3198 0.3468
63 0.2058 0.2441 0.2880 0.3173 0.3441
64 0.2042 0.2423 0.2858 0.3150 0.3416
65 0.2027 0.2404 0.2837 0.3126 0.3391
66 0.2012 0.2387 0.2816 0.3104 0.3366
67 0.1997 0.2369 0.2796 0.3081 0.3343
68 0.1982 0.2352 0.2776 0.3060 0.3319
69 0.1968 0.2335 0.2756 0.3038 0.3296
70 0.1954 0.2319 0.2737 0.3017 0.3274
D2
Back Cover
Back Cover