Introduction To Psychological Statistics (Foster Et Al.)
Introduction To Psychological Statistics (Foster Et Al.)
Introduction To Psychological Statistics (Foster Et Al.)
PSYCHOLOGICAL
STATISTICS
Foster et al.
University of Missouri-St. Louis, Rice
University, & University of Houston,
Downtown Campus
An Introduction to Psychological Statistics
Foster et al.
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds
of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all,
pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully
consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their
students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new
technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform
for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our
students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open-
access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource
environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being
optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are
organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields)
integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook Pilot
Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions
Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120,
1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact [email protected]. More information on our
activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog
(http://Blog.Libretexts.org).
5: PROBABILITY
Probability can seem like a daunting topic for many students. In a mathematical statistics course this might be true, as the meaning and
purpose of probability gets obscured and overwhelmed by equations and theory. In this chapter we will focus only on the principles and ideas
necessary to lay the groundwork for future inferential statistics. We accomplish this by quickly tying the concepts of probability to what we
already know about normal distributions and z-scores.
1 2/18/2022
5.3: THE BIGGER PICTURE
5.E: PROBABILITY (EXERCISES)
6: SAMPLING DISTRIBUTIONS
We have come to the final chapter in this unit. We will now take the logic, ideas, and techniques we have developed and put them together to
see how we can take a sample of data and use it to make inferences about what's truly happening in the broader population. This is the final
piece of the puzzle that we need to understand in order to have the groundwork necessary for formal hypothesis testing. Though some of the
concepts in this chapter seem strange, they are all simple extensions of what w
8: INTRODUCTION TO T-TESTS
Last chapter we made a big leap from basic descriptive statistics into full hypothesis testing and inferential statistics. For the rest of the unit,
we will be learning new tests, each of which is just a small adjustment on the test before it. In this chapter, we will learn about the first of
three t-tests, and we will learn a new method of testing the null hypothesis: confidence intervals.
9: REPEATED MEASURES
So far, we have dealt with data measured on a single variable at a single point in time, allowing us to gain an understanding of the logic and
process behind statistics and hypothesis testing. Now, we will look at a slightly different type of data that has new information we couldn’t
get at before: change. Specifically, we will look at how the value of a variable, within people, changes across two time points. This is a very
powerful thing to do, and, as we will see shortly, it involves only a v
2 2/18/2022
10.1: DIFFERENCE OF MEANS
10.2: RESEARCH QUESTIONS ABOUT INDEPENDENT MEANS
10.3: HYPOTHESES AND DECISION CRITERIA
10.4: INDEPENDENT SAMPLES T-STATISTIC
10.5: STANDARD ERROR AND POOLED VARIANCE
10.6: MOVIES AND MOOD
10.7: EFFECT SIZES AND CONFIDENCE INTERVALS
10.8: HOMOGENEITY OF VARIANCE
10.E: INDEPENDENT SAMPLES (EXERCISES)
12: CORRELATIONS
All of our analyses thus far have focused on comparing the value of a continuous variable across different groups via mean differences. We
will now turn away from means and look instead at how to assess the relation between two continuous variables in the form of correlations.
As we will see, the logic behind correlations is the same as it was group means, but we will now have the ability to assess an entirely new
data structure.
3 2/18/2022
14: CHI-SQUARE
We come at last to our final topic: chi-square (χ2). This test is a special form of analysis called a non-parametric test, so the structure of it
will look a little bit different from what we have done so far. However, the logic of hypothesis testing remains unchanged. The purpose of
chi-square is to understand the frequency distribution of a single categorical variable or find a relation between two categorical variables,
which is a frequently very useful way to look at our data.
BACK MATTER
INDEX
GLOSSARY
4 2/18/2022
About this Book
We are constantly bombarded by information, and finding a way to filter that information in an objective way is crucial to
surviving this onslaught with your sanity intact. This is what statistics, and logic we use in it, enables us to do. Through the
lens of statistics, we learn to find the signal hidden in the noise when it is there and to know when an apparent trend or
pattern is really just randomness. The study of statistics involves math and relies upon calculations of numbers. But it also
relies heavily on how the numbers are chosen and how the statistics are interpreted.
This work was created as part of the University of Missouri’s Affordable and Open Access Educational Resources
Initiative (https://www.umsystem.edu/ums/aa/oer). The contents of this work have been adapted from the following Open
Access Resources: Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project
Leader: David M. Lane, Rice University. Changes to the original works were made by Dr. Garett C. Foster in the
Department of Psychological Sciences to tailor the text to fit the needs of the introductory statistics course for psychology
majors at the University of Missouri – St. Louis. Materials from the original sources have been combined, reorganized, and
added to by the current author, and any conceptual, mathematical, or typographical errors are the responsibility of the
current author.
Garett C. Foster, University of Missouri-St. LouisFollow
David Lane, Rice UniversityFollow
David Scott, Rice University
Mikki Hebl, Rice University
Rudy Guerra, Rice University
Dan Osherson, Rice University
Heidi Zimmer, University of Houston, Downtown Campus
Recommended Citation
Foster, Garett C.; Lane, David; Scott, David; Hebl, Mikki; Guerra, Rudy; Osherson, Dan; and Zimmer, Heidi, "An
Introduction to Psychological Statistics" (2018). Open Educational Resources Collection. 4.
https://irl.umsl.edu/oer/4
1 1/21/2022 https://stats.libretexts.org/@go/page/7825
CHAPTER OVERVIEW
1: INTRODUCTION
This chapter provides an overview of statistics as a field of study and presents terminology that will be used throughout the course.
1 2/18/2022
1.1: What are statistics?
Statistics include numerical facts and figures. For instance:
The largest earthquake measured 9.2 on the Richter scale.
Men are at least 10 times more likely than women to commit murder.
One in every 8 South Africans is HIV positive.
By the year 2020, there will be 15 people aged 65 and over for every new baby born.
The study of statistics involves math and relies upon calculations of numbers. But it also relies heavily on how the
numbers are chosen and how the statistics are interpreted. For example, consider the following three scenarios and the
interpretations based upon the presented statistics. You will find that the numbers may be right, but the interpretation may
be wrong. Try to identify a major flaw with each interpretation before we describe it.
1. A new advertisement for Ben and Jerry's ice cream introduced in late May of last year resulted in a 30% increase in ice
cream sales for the following three months. Thus, the advertisement was effective. A major flaw is that ice cream
consumption generally increases in the months of June, July, and August regardless of advertisements. This effect is
called a history effect and leads people to interpret outcomes as the result of one variable when another variable (in this
case, one having to do with the passage of time) is actually responsible.
2. The more churches in a city, the more crime there is. Thus, churches lead to crime. A major flaw is that both increased
churches and increased crime rates can be explained by larger populations. In bigger cities, there are both more
churches and more crime. This problem, which we will discuss in more detail in Chapter 6, refers to the third-variable
problem. Namely, a third variable can cause both situations; however, people erroneously believe that there is a causal
relationship between the two primary variables rather than recognize that a third variable can cause both.
3. 75% more interracial marriages are occurring this year than 25 years ago. Thus, our society accepts interracial
marriages. A major flaw is that we don't have the information that we need. What is the rate at which marriages are
occurring? Suppose only 1% of marriages 25 years ago were interracial and so now 1.75% of marriages are interracial
(1.75 is 75% higher than 1). But this latter number is hardly evidence suggesting the acceptability of interracial
marriages. In addition, the statistic provided does not rule out the possibility that the number of interracial marriages
has seen dramatic fluctuations over the years and this year is not the highest. Again, there is simply not enough
information to understand fully the impact of the statistics.
As a whole, these examples show that statistics are not only facts and figures; they are something more than that. In the
broadest sense, “statistics” refers to a range of techniques and procedures for analyzing, interpreting, displaying, and
making decisions based on data.
Statistics is the language of science and data. The ability to understand and communicate using statistics enables
researchers from different labs, different languages, and different fields articulate to one another exactly what they have
found in their work. It is an objective, precise, and powerful tool in science and in everyday life.
Beyond its use in science, however, there is a more personal reason to study statistics. Like most people, you probably feel
that it is important to “take control of your life.” But what does this mean? Partly, it means being able to properly evaluate
the data and claims that bombard you every day. If you cannot distinguish good from faulty reasoning, then you are
vulnerable to manipulation and to decisions that are not in your best interest. Statistics provides tools that you need in
order to react intelligently to information you hear or read. In this sense, statistics is one of the most important things that
you can study.
To be more specific, here are some claims that we have heard on several occasions. (We are not saying that each one of
these claims is true!)
4 out of 5 dentists recommend Dentine.
Almost 85% of lung cancers in men and 45% in women are tobacco-related.
Condoms are effective 94% of the time.
People tend to be more persuasive when they look others directly in the eye and speak loudly and quickly.
Women make 75 cents to every dollar a man makes when they work the same job.
A surprising new study shows that eating egg whites can increase one's life span.
People predict that it is very unlikely there will ever be another baseball player with a batting average over 400.
There is an 80% chance that in a room full of 30 people that at least two people will share the same birthday.
79.48% of all statistics are made up on the spot.
All of these claims are statistical in character. We suspect that some of them sound familiar; if not, we bet that you have
heard other claims like them. Notice how diverse the examples are. They come from psychology, health, law, sports,
business, etc. Indeed, data and data interpretation show up in discourse from virtually every facet of contemporary life.
Statistics are often presented in an effort to add credibility to an argument or advice. You can see this by paying attention
to television advertisements. Many of the numbers thrown about in this way do not represent careful statistical analysis.
They can be misleading and push you into decisions that you might find cause to regret. For these reasons, learning about
statistics is a long step towards taking control of your life. (It is not, of course, the only step needed for this purpose.) The
purpose of this course, beyond preparing you for a career in psychology, is to help you learn statistical essentials. It will
make you into an intelligent consumer of statistical claims.
You can take the first step right away. To be an intelligent consumer of statistics, your first reflex must be to question the
statistics that you encounter. The British Prime Minister Benjamin Disraeli is quoted by Mark Twain as having said,
“There are three kinds of lies -- lies, damned lies, and statistics.” This quote reminds us why it is so important to
understand statistics. So let us invite you to reform your statistical habits from now on. No longer will you blindly accept
numbers or findings. Instead, you will begin to think about the numbers, their sources, and most importantly, the
procedures used to generate them.
The above section puts an emphasis on defending ourselves against fraudulent claims wrapped up as statistics, but let us
look at a more positive note. Just as important as detecting the deceptive use of statistics is the appreciation of the proper
use of statistics. You must also learn to recognize statistical evidence that supports a stated conclusion. Statistics are all
around you, sometimes used well, sometimes not. We must learn how to distinguish the two cases. In doing so, statistics
will likely be the course you use most in your day to day life, even if you do not ever run a formal analysis again.
Types of Variables
When conducting research, experimenters often manipulate variables. For example, an experimenter might compare the
effectiveness of four types of antidepressants. In this case, the variable is “type of antidepressant.” When a variable is manipulated
by an experimenter, it is called an independent variable. The experiment seeks to determine the effect of the independent variable
on relief from depression. In this example, relief from depression is called a dependent variable. In general, the independent
variable is manipulated by the experimenter and its effects on the dependent variable are measured.
Example 1.3.1
Can blueberries slow down aging? A study indicates that antioxidants found in blueberries may slow down the
process of aging. In this study, 19-month-old rats (equivalent to 60-year-old humans) were fed either their standard
diet or a diet supplemented by either blueberry, strawberry, or spinach powder. After eight weeks, the rats were
given memory and motor skills tests. Although all supplemented rats showed improvement, those supplemented
with blueberry powder showed the most notable improvement.
a. What is the independent variable? (dietary supplement: none, blueberry, strawberry, and spinach)
b. What are the dependent variables? (memory test and motor skills test)
Example 1.3.2
Does beta-carotene protect against cancer? Beta-carotene supplements have been thought to protect against
cancer. However, a study published in the Journal of the National Cancer Institute suggests this is false. The study
was conducted with 39,000 women aged 45 and up. These women were randomly assigned to receive a beta-
carotene supplement or a placebo, and their health was studied over their lifetime. Cancer rates for women taking
the betacarotene supplement did not differ systematically from the cancer rates of those women taking the placebo.
a. What is the independent variable? (supplements: beta-carotene or placebo)
b. What is the dependent variable? (occurrence of cancer)
Example 1.3.3
How bright is right? An automobile manufacturer wants to know how bright brake lights should be in order to
minimize the time required for the driver of a following car to realize that the car in front is stopping and to hit the
brakes.
a. What is the independent variable? (brightness of brake lights)
b. What is the dependent variable? (time to hit brakes)
Levels of Measurement
Before we can conduct a statistical analysis, we need to measure our dependent variable. Exactly how the measurement is carried
out depends on the type of variable involved in the analysis. Different types are measured differently. To measure the time taken to
respond to a stimulus, you might use a stop watch. Stop watches are of no use, of course, when it comes to measuring someone's
attitude towards a political candidate. A rating scale is more appropriate in this case (with labels like “very favorable,” “somewhat
favorable,” etc.). For a dependent variable such as “favorite color,” you can simply note the color-word (like “red”) that the subject
offers.
Although procedures for measurement differ in many ways, they can be classified using a few fundamental categories. In a given
category, all of the procedures share some properties that are important for you to know about. The categories are called “scale
types,” or just “scales,” and are described in this section.
Nominal scales
When measuring using a nominal scale, one simply names or categorizes responses. Gender, handedness, favorite color, and
religion are examples of variables measured on a nominal scale. The essential point about nominal scales is that they do not imply
any ordering among the responses. For example, when classifying people according to their favorite color, there is no sense in
which green is placed “ahead of” blue. Responses are merely categorized. Nominal scales embody the lowest level of
measurement.
Ordinal scales
A researcher wishing to measure consumers' satisfaction with their microwave ovens might ask them to specify their feelings as
either “very dissatisfied,” “somewhat dissatisfied,” “somewhat satisfied,” or “very satisfied.” The items in this scale are ordered,
ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal scales, ordinal scales
allow comparisons of the degree to which two subjects possess the dependent variable. For example, our satisfaction ordering
makes it meaningful to assert that one person is more satisfied than another with their microwave ovens. Such an assertion reflects
the first person's use of a verbal label that comes later in the list than the label chosen by the second person.
On the other hand, ordinal scales fail to capture important information that will be present in the other scales we examine. In
particular, the difference between two levels of an ordinal scale cannot be assumed to be the same as the difference between two
other levels. In our satisfaction scale, for example, the difference between the responses “very dissatisfied” and “somewhat
dissatisfied” is probably not equivalent to the difference between “somewhat dissatisfied” and “somewhat satisfied.” Nothing in
our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological
satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent
equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of
satisfaction, which we are trying to measure.)
Interval scales
Interval scales are numerical scales in which intervals have the same interpretation throughout. As an example, consider the
Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference as
the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms
of the kinetic energy of molecules).
Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to
carry the name “zero.” The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence
of temperature (the absence of any molecular kinetic energy). In reality, the label “zero” is applied to its temperature for quite
accidental reasons connected to the history of temperature measurement. Since an interval scale has no true zero point, it does not
make sense to compute ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20 degrees Fahrenheit is
the same as the ratio of 100 to 50 degrees; no interesting physical property is preserved across the two ratios. After all, if the “zero”
label were applied at the temperature that Fahrenheit happens to label as 10 degrees, the two ratios would instead be 30 to 10 and
90 to 40, no longer the same! For this reason, it does not make sense to say that 80 degrees is “twice as hot” as 40 degrees. Such a
claim would depend on an arbitrary decision about where to “start” the temperature scale, namely, what temperature to call zero
(whereas the claim is intended to make a more fundamental assertion about the underlying physical reality).
Ratio scales
The ratio scale of measurement is the most informative scale. It is an interval scale with the additional property that its zero
position indicates the absence of the quantity being measured. You can think of a ratio scale as the three earlier scales rolled up in
one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the
objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale
has the same meaning. And in addition, the same ratio at two places on the scale also carries the same meaning.
The Fahrenheit scale for temperature has an arbitrary zero point and is therefore not a ratio scale. However, zero on the Kelvin
scale is absolute zero. This makes the Kelvin scale a ratio scale. For example, if one temperature is twice as high as another as
measured on the Kelvin scale, then it has twice the kinetic energy of the other temperature.
Another example of a ratio scale is the amount of money you have in your pocket right now (25 cents, 55 cents, etc.). Money is
measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero
money, this implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has
twice as much money as someone with 25 cents (or that Bill Gates has a million times more money than you do).
A 0 0 1 1 0 0 0 0 0 0 2
B 1 0 1 1 0 0 0 0 0 0 3
C 1 1 1 1 1 1 1 0 0 0 7
D 1 1 1 1 1 0 1 1 0 1 8
Let's compare (i) the difference between Subject A's score of 2 and Subject B's score of 3 and (ii) the difference between Subject
C's score of 7 and Subject D's score of 8. The former difference is a difference of one easy item; the latter difference is a difference
of one difficult item. Do these two differences necessarily signify the same difference in memory? We are inclined to respond “No”
to this question since only a little more memory may be needed to retain the additional easy item whereas a lot more memory may
be needed to retain the additional hard itemred. The general point is that it is often inappropriate to consider psychological
measurement scales as either interval or ratio.
Consequences of level of measurement Why are we so interested in the type of scale that measures a dependent variable? The crux
of the matter is the relationship between the variable's level of measurement and the statistics that can be meaningfully computed
with that variable. For example, consider a hypothetical study in which 5 children are asked to choose their favorite color from
blue, red, yellow, green, and purple. The researcher codes the results as follows:
Table 1.3.2 : Favorite color data code
Color Code
Blue 1
Red 2
Yellow 3
Green 4
Purple 5
This means that if a child said her favorite color was “Red,” then the choice was coded as “2,” if the child said her favorite color
was “Purple,” then the response was coded as 5, and so forth. Consider the following hypothetical data
Table 1.3.3 : Favorite color data
Subject Color Code
1 Blue 1
2 Blue 1
3 Green 4
4 Green 4
5 Purple 5
Each code is a number, so nothing prevents us from computing the average code assigned to the children. The average happens to
be 3, but you can see that it would be senseless to conclude that the average favorite color is yellow (the color with a code of 3).
Such nonsense arises because favorite color is a nominal scale, and taking the average of its numerical labels is like counting the
number of letters in the name of a snake to see how long the beast is.
Does it make sense to compute the mean of numbers measured on an ordinal scale? This is a difficult question, one that statisticians
have debated for decades. The prevailing (but by no means unanimous) opinion of statisticians is that for almost all practical
Example 1.4.1
You have been hired by the National Election Commission to examine how the American people feel about the
fairness of the voting procedures in the U.S. Who will you ask?
Solution
It is not practical to ask every single American how he or she feels about the fairness of the voting procedures. Instead,
we query a relatively small number of Americans, and draw inferences about the entire country from their responses.
The Americans actually queried constitute our sample of the larger population of all Americans.
A sample is typically a small subset of the population. In the case of voting attitudes, we would sample a few thousand
Americans drawn from the hundreds of millions that make up the country. In choosing a sample, it is therefore crucial
that it not over-represent one kind of citizen at the expense of others. For example, something would be wrong with
our sample if it happened to be made up entirely of Florida residents. If the sample held only Floridians, it could not be
used to infer the attitudes of other Americans. The same problem would arise if the sample were comprised only of
Republicans. Inferences from statistics are based on the assumption that sampling is representative of the population. If
the sample is not representative, then the possibility of sampling bias occurs. Sampling bias means that our
conclusions apply only to our sample and are not generalizable to the full population.
Example 1.4.2
We are interested in examining how many math classes have been taken on average by current graduating seniors at
American colleges and universities during their four years in school.
Solution
Whereas our population in the last example included all US citizens, now it involves just the graduating seniors
throughout the country. This is still a large set since there are thousands of colleges and universities, each enrolling
many students. (New York University, for example, enrolls 48,000 students.) It would be prohibitively costly to
examine the transcript of every college senior. We therefore take a sample of college seniors and then make inferences
to the entire population based on what we find. To make the sample, we might first choose some public and private
colleges and universities across the United States. Then we might sample 50 students from each of these institutions.
Suppose that the average number of math classes taken by the people in our sample were 3.2. Then we might speculate
that 3.2 approximates the number we would find if we had the resources to examine every senior in the entire
population. But we must be careful about the possibility that our sample is non-representative of the population.
Perhaps we chose an overabundance of math majors, or chose too many technical institutions that have heavy math
requirements. Such bad sampling makes our sample unrepresentative of the population of all seniors.
To solidify your understanding of sampling bias, consider the following example. Try to identify the population and the
sample, and then reflect on whether the sample is likely to yield the information desired.
A substitute teacher wants to know how students in the class did on their last test. The teacher asks the 10 students
sitting in the front row to state their latest test score. He concludes from their report that the class did extremely well.
What is the sample? What is the population? Can you identify any problems with choosing the sample in the way that
the teacher did?
Solution
The population consists of all students in the class. The sample is made up of just the 10 students sitting in the front
row. The sample is not likely to be representative of the population. Those who sit in the front row tend to be more
interested in the class and tend to perform higher on tests. Hence, the sample may perform at a higher level than the
population.
Example 1.4.4
A coach is interested in how many cartwheels the average college freshmen at his university can do. Eight volunteers
from the freshman class step forward. After observing their performance, the coach concludes that college freshmen
can do an average of 16 cartwheels in a row without stopping.
Solution
The population is the class of all freshmen at the coach's university. The sample is composed of the 8 volunteers. The
sample is poorly chosen because volunteers are more likely to be able to do cartwheels than the average freshman;
people who can't do cartwheels probably did not volunteer! In the example, we are also not told of the gender of the
volunteers. Were they all women, for example? That might affect the outcome, contributing to the non-representative
nature of the sample (if the school is co-ed).
Example 1.4.5
A research scientist is interested in studying the experiences of twins raised together versus those raised apart. She
obtains a list of twins from the National Twin Registry, and selects two subsets of individuals for her study. First, she
chooses all those in the registry whose last name begins with Z. Then she turns to all those whose last name begins
with B. Because there are so many names that start with B, however, our researcher decides to incorporate only every
other name into her sample. Finally, she mails out a survey and compares characteristics of twins raised apart versus
together.
Solution
The population consists of all twins recorded in the National Twin Registry. It is important that the researcher only
make statistical generalizations to the twins on this list, not to all twins in the nation or world. That is, the National
Twin Registry may not be representative of all twins. Even if inferences are limited to the Registry, a number of
problems affect the sampling procedure we described. For instance, choosing only twins whose last names begin with
Z does not give every individual an equal chance of being selected into the sample. Moreover, such a procedure risks
over-representing ethnic groups with many surnames that begin with Z. There are other reasons why choosing just the
Z's may bias the sample. Perhaps such people are more patient than average because they often find themselves at the
end of the line! The same problem occurs with choosing twins whose last name begins with B. An additional problem
Stratified Sampling
Since simple random sampling often does not ensure a representative sample, a sampling method called stratified random
sampling is sometimes used to make the sample more representative of the population. This method can be used if the
population has a number of distinct “strata” or groups. In stratified sampling, you first identify members of your sample
who belong to each group. Then you randomly sample from each of those subgroups in such a way that the sizes of the
subgroups in the sample are proportional to their sizes in the population.
Let's take an example: Suppose you were interested in views of capital punishment at an urban university. You have the
time and resources to interview 200 students. The student body is diverse with respect to age; many older people work
during the day and enroll in night courses (average age is 39), while younger students generally enroll in day classes
(average age of 19). It is possible that night students have different views about capital punishment than day students. If
70% of the students were day students, it makes sense to ensure that 70% of the sample consisted of day students. Thus,
your sample of 200 students would consist of 140 day students and 60 night students. The proportion of day students in the
sample and in the population (the entire university) would be the same. Inferences to the entire population of students at
the university would therefore be more secure.
Convenience Sampling
Not all sampling methods are perfect, and sometimes that’s okay. For example, if we are beginning research into a
completely unstudied area, we may sometimes take some shortcuts to quickly gather data and get a general idea of how
things work before fully investing a lot of time and money into well-designed research projects with proper sampling. This
is known as convenience sampling, named for its ease of use. In limited cases, such as the one just described, convenience
sampling is okay because we intend to follow up with a representative sample. Unfortunately, sometimes convenience
sampling is used due only to its convenience without the intent of improving on it in future work.
Experimental Designs
If we want to know if a change in one variable causes a change in another variable, we must use a true experiment. An
experiment is defined by the use of random assignment to treatment conditions and manipulation of the independent
variable. To understand what this means, let’s look at an example:
A clinical researcher wants to know if a newly developed drug is effective in treating the flu. Working with collaborators at
several local hospitals, she randomly samples 40 flu patients and randomly assigns each one to one of two conditions:
Group A receives the new drug and Group B received a placebo. She measures the symptoms of all participants after 1
week to see if there is a difference in symptoms between the groups.
In the example, the independent variable is the drug treatment; we manipulate it into 2 levels: new drug or placebo.
Without the researcher administering the drug (i.e. manipulating the independent variable), there would be no difference
between the groups. Each person, after being randomly sampled to be in the research, was then randomly assigned to one
of the 2 groups. That is, random sampling and random assignment are not the same thing and cannot be used
interchangeably. For research to be a true experiment, random assignment must be used. For research to be representative
of the population, random sampling must be used. The use of both techniques helps ensure that there are no systematic
differences between the groups, thus eliminating the potential for sampling bias.
The dependent variable in the example is flu symptoms. Barring any other intervention, we would assume that people in
both groups, on average, get better at roughly the same rate. Because there are no systematic differences between the 2
groups, if the researcher does find a difference in symptoms, she can confidently attribute it to the effectiveness of the new
drug.
Quasi-Experimental Designs
Quasi-experimental research involves getting as close as possible to the conditions of a true experiment when we cannot
meet all requirements. Specifically, a quasiexperiment involves manipulating the independent variable but not randomly
assigning people to groups. There are several reasons this might be used. First, it may be unethical to deny potential
treatment to someone if there is good reason to believe it will be effective and that the person would unduly suffer if they
did not receive it. Alternatively, it may be impossible to randomly assign people to groups. Consider the following
example:
A professor wants to test out a new teaching method to see if it improves student learning. Because he is teaching two
sections of the same course, he decides to teach one section the traditional way and the other section using the new
method. At the end of the semester, he compares the grades on the final for each class to see if there is a difference.
In this example, the professor has manipulated his teaching method, which is the independent variable, hoping to find a
difference in student performance, the dependent variable. However, because students enroll in courses, he cannot
randomly assign the students to a particular group, thus precluding using a true experiment to answer his research question.
Because of this, we cannot know for sure that there are no systematic differences between the classes other than teaching
style and therefore cannot determine causality.
A data scientist wants to know if there is a relation between how conscientious a person is and whether that person is a
good employee. She hopes to use this information to predict the job performance of future employees by measuring their
personality when they are still job applicants. She randomly samples volunteer employees from several different
companies, measuring their conscientiousness and having their bosses rate their performance on the job. She analyzes this
data to find a relation.
Here, it is not possible to manipulate conscientious, so the researcher must gather data from employees as they are in order
to find a relation between her variables.
Although this technique cannot establish causality, it can still be quite useful. If the relation between conscientiousness and
job performance is consistent, then it doesn’t necessarily matter is conscientiousness causes good performance or if they
are both caused by something else – she can still measure conscientiousness to predict future performance. Additionally,
these studies have the benefit of reflecting reality as it actually exists since we as researchers do not change anything.
Descriptive Statistics
Descriptive statistics are numbers that are used to summarize and describe data. The word “data” refers to the information
that has been collected from an experiment, a survey, an historical record, etc. (By the way, “data” is plural. One piece of
information is called a “datum.”) If we are analyzing birth certificates, for example, a descriptive statistic might be the
percentage of certificates issued in New York State, or the average age of the mother. Any other number we choose to
compute also counts as a descriptive statistic for the data from which the statistic is computed. Several descriptive statistics
are often used at one time to give a full picture of the data. Descriptive statistics are just descriptive. They do not involve
generalizing beyond the data at hand. Generalizing from our data to another set of cases is the business of inferential
statistics, which you'll be studying in another section. Here we focus on (mere) descriptive statistics. Some descriptive
statistics are shown in Table 1.6.1. The table shows the average salaries for various occupations in the United States in
1999.
Table 1.6.1 : Average salaries for various occupations in 1999.
Salary Occupation
$112,760 pediatricians
$106,130 dentists
$100,090 podiatrists
$76,140 physicists
$53,410 architects
school, clinical, and counseling
$49,720
psychologists
$47,910 flight attendants
$39,560 elementary school teachers
$38,710 police officers
$18,980 floral designers
Descriptive statistics like these offer insight into American society. It is interesting to note, for example, that we pay the
people who educate our children and who protect our citizens a great deal less than we pay people who take care of our
feet or our teeth.
For more descriptive statistics, consider Table 1.6.2. It shows the number of unmarried men per 100 unmarried women in
U.S. Metro Areas in 1990. From this table we see that men outnumber women most in Jacksonville, NC, and women
outnumber men most in Sarasota, FL. You can see that descriptive statistics can be useful if we are looking for an
opposite-sex partner! (These data come from the Information Please Almanac.)
Table 1.6.2 : Number of unmarried men per 100 unmarried women in U.S. Metro Areas in 1990.
Cities with mostly Cities with mostly
Men per 100 Women Men per 100 Women
men women
1. Jacksonville, NC 224 1. Sarasota, FL 66
7. ClarksvilleHopkinsville,
113 7. Wheeling, WV 70
TN-KY
8. Anchorage,
112 8. Charleston, WV 71
Alaska
9. Salinas-SeasideMonterey,
112 9. St. Joseph, MO 71
CA
10. Bryan-College
111 10. Lynchburg, VA 71
Station, TX
NOTE: Unmarried includes never-married, widowed, and divorced persons, 15 years or older.
These descriptive statistics may make us ponder why the numbers are so disparate in these cities. One potential
explanation, for instance, as to why there are more women in Florida than men may involve the fact that elderly
individuals tend to move down to the Sarasota region and that women tend to outlive men. Thus, more women might live
in Sarasota than men. However, in the absence of proper data, this is only speculation.
You probably know that descriptive statistics are central to the world of sports. Every sporting event produces numerous
statistics such as the shooting percentage of players on a basketball team. For the Olympic marathon (a foot race of 26.2
miles), we possess data that cover more than a century of competition. (The first modern Olympics took place in 1896.)
The following table shows the winning times for both men and women (the latter have only been allowed to compete since
1984).
Table 1.6.3 : Winning Olympic marathon times.
Year Winner Country Time
Women
There are many descriptive statistics that we can compute from the data in the table. To gain insight into the improvement
in speed over the years, let us divide the men's times into two pieces, namely, the first 13 races (up to 1952) and the second
13 (starting from 1956). The mean winning time for the first 13 races is 2 hours, 44 minutes, and 22 seconds (written
2:44:22). The mean winning time for the second 13 races is 2:13:18. This is quite a difference (over half an hour). Does
this prove that the fastest men are running faster? Or is the difference just due to chance, no more than what often emerges
Inferential Statistics
Descriptive statistics are wonderful at telling us what our data look like. However, what we often want to understand is
how our data behave. What variables are related to other variables? Under what conditions will the value of a variable
change? Are two groups different from each other, and if so, are people within each group different or similar? These are
the questions answered by inferential statistics, and inferential statistics are how we generalize from our sample back up to
our population. Units 2 and 3 are all about inferential statistics, the formal analyses and tests we run to make conclusions
about our data.
For example, we will learn how to use a t statistic to determine whether people change over time when enrolled in an
intervention. We will also use an F statistic to determine if we can predict future values on a variable based on current
known values of a variable. There are many types of inferential statistics, each allowing us insight into a different behavior
of the data we collect. This course will only touch on a small subset (or a sample) of them, but the principles we learn
along the way will make it easier to learn new tests, as most inferential statistics follow the same structure and format.
1 4.6
2 5.1
3 4.9
4 4.4
We label Grape 1's weight X , Grape 2's weight X , etc. The following formula means to sum up the weights of the four
1 2
grapes:
4
∑ Xi (1.7.1)
i=1
The Greek letter Σ indicates summation. The “i = 1” at the bottom indicates that the summation is to start with X and the 1
4 at the top indicates that the summation will end with X . The “X ” indicates that X is the variable to be summed as i
4 i
i=1
The symbol
3
∑ Xi
i=1
indicates that only the first 3 scores are to be summed. The index variable i goes from 1 to 3.
When all the scores of a variable (such as X ) are to be summed, it is often convenient to use the following abbreviated
notation:
∑X
Thus, when no values of i are shown, it means to sum all the values of X .
Many formulas involve squaring numbers before they are summed. This is indicated as
2 2 2 2 2
∑X = 4.6 + 5.1 + 4.9 + 4.4
Notice that:
2
2
(∑ X) ≠ ∑X (1.7.2)
because the expression on the left means to sum up all the values of X and then square the sum (19² = 361), whereas the
expression on the right means to square the numbers and then sum the squares (90.54, as shown).
1 3 3
2 2 4
3 7 21
∑ XY = 28
Answer:
Your answer could take many forms but should include information about objectively interpreting information and/or
communicating results and research conclusions
Answer:
a. Ordinal
b. Ratio
c. Ordinal
d. Nominal
e. Interval
4. What is the difference between a population and a sample? Which is described by a parameter and which is described
by a statistic?
5. What is sampling bias? What is sampling error?
Answer:
Sampling bias is the difference in demographic characteristics between a sample and the population it should represent.
Sampling error is the difference between a population parameter and sample statistic that is caused by random chance
due to sampling bias.
6. What is the difference between a simple random sample and a stratified random sample?
7. What are the two key characteristics of a true experimental design?
Answer:
Random assignment to treatment conditions and manipulation of the independent variable 9
2 8
3 8
7 4
5 1
9 4
a. ∑ X
b. ∑ Y 2
c. ∑ XY
d. (∑ Y) 2
Answer:
a. 26
b. 161
c. 109
d. 625
10. What are the most common measures of central tendency and spread?
1 2/18/2022
2.1: Graphing Qualitative Variables
When Apple Computer introduced the iMac computer in August 1998, the company wanted to learn whether the iMac was
expanding Apple’s market share. Was the iMac just attracting previous Macintosh owners? Or was it purchased by
newcomers to the computer market and by previous Windows users who were switching over? To find out, 500 iMac
customers were interviewed. Each customer was categorized as a previous Macintosh owner, a previous Windows owner,
or a new computer purchaser.
This section examines graphical methods for displaying the results of the interviews. We’ll learn some general lessons
about how to graph data that fall into a small number of categories. A later section will consider how to graph numerical
data in which each observation is represented by a number in some range. The key point about the qualitative data that
occupy us in the present section is that they do not come with a pre-established ordering (the way numbers are ordered).
For example, there is no natural sense in which the category of previous Windows users comes before or after the category
of previous Macintosh users. This situation may be contrasted with quantitative data, such as a person’s weight. People of
one weight are naturally ordered with respect to people of a different weight.
Frequency Tables
All of the graphical methods shown in this section are derived from frequency tables. Table 1 shows a frequency table for
the results of the iMac study; it shows the frequencies of the various response categories. It also shows the relative
frequencies, which are the proportion of responses in each category. For example, the relative frequency for “none” of 0.17
= 85/500.
Table 2.1.1 : Frequency Table for the iMac Data.
Previous Ownership Frequency Relative Frequency
None 85 0.17
Windows 60 0.12
Macintosh 355 0.71
Total 500 1
Pie Charts
The pie chart in Figure 2.1.1 shows the results of the iMac study. In a pie chart, each category is represented by a slice of
the pie. The area of the slice is proportional to the percentage of responses in the category. This is simply the relative
frequency multiplied by 100. Although most iMac purchasers were Macintosh owners, Apple was encouraged by the 12%
of purchasers who were former Windows users, and by the 17% of purchasers who were buying a computer for the first
time.
Figure 2.1.1 : Pie chart of iMac purchases illustrating frequencies of previous computer ownership.
Pie charts are effective for displaying the relative frequencies of a small number of categories. They are not recommended,
however, when you have a large number of categories. Pie charts can also be confusing when they are used to compare the
outcomes of two different surveys or experiments. In an influential book on the use of graphs, Edward Tufte asserted “The
only worse design than a pie chart is several of them.” Here is another important point about pie charts. If they are based
Figure 2.1.2 : Bar chart of iMac purchases as a function of previous computer ownership.
Comparing Distributions Often we need to compare the results of different surveys, or of different conditions within the
same overall survey. In this case, we are comparing the “distributions” of responses between the surveys or conditions. Bar
charts are often excellent for illustrating differences between two distributions. Figure 2.1.3 shows the number of people
playing card games at the Yahoo web site on a Sunday and on a Wednesday in the spring of 2001. We see that there were
more players overall on Wednesday compared to Sunday. The number of people playing Pinochle was nonetheless the
same on these two days. In contrast, there were about twice as many people playing hearts on Wednesday as on Sunday.
Facts like these emerge clearly from a well-designed bar chart.
Figure 2.1.7 : A line graph used inappropriately to depict the number of people playing different card games on Sunday
and Wednesday.
There are many types of graphs that can be used to portray distributions of quantitative variables. The upcoming sections
cover the following types of graphs:
1. stem and leaf displays
2. histograms
3. frequency polygons
4. box plots
5. bar charts
6. line graphs
7. dot plots
8. scatter plots (discussed in a different chapter)
Some graph types such as stem and leaf displays are best-suited for small to moderate amounts of data, whereas others
such as histograms are best suited for large amounts of data. Graph types such as box plots are good at depicting
differences between distributions. Scatter plots are used to show the relationship between two variables.
Figure 2.2.2 : Stem and leaf display of the number of touchdown passes.
To make this clear, let us examine Figure 2.2.2 more closely. In the top row, the four leaves to the right of stem 3 are 2, 3,
3, and 7. Combined with the stem, these leaves represent the numbers 32, 33, 33, and 37, which are the numbers of TD
passes for the first four teams in Figure 2.2.1. The next row has a stem of 2 and 12 leaves. Together, they represent 12 data
points, namely, two occurrences of 20 TD passes, three occurrences of 21 TD passes, three occurrences of 22 TD passes,
one occurrence of 23 TD passes, two occurrences of 28 TD passes, and one occurrence of 29 TD passes. We leave it to you
to figure out what the third row represents. The fourth row has a stem of 0 and two leaves. It stands for the last two entries
in Figure 2.2.1, namely 9 TD passes and 6 TD passes. (The latter two numbers may be thought of as 09 and 06.)
Figure 2.2.3 : Stem and leaf display with the stems split in two.
Figure 2.2.3 is more revealing than Figure 2.2.2 because the latter figure lumps too many values into a single row.
Whether you should split stems in a display depends on the exact form of your data. If rows get too long with single stems,
you might try splitting them into two or more parts.
There is a variation of stem and leaf displays that is useful for comparing distributions. The two distributions are placed
back to back along a common column of stems. The result is a “back-to-back stem and leaf display.” Figure 2.2.4 shows
such a graph. It compares the numbers of TD passes in the 1998 and 2000 seasons. The stems are in the middle, the leaves
to the left are for the 1998 data, and the leaves to the right are for the 2000 data. For example, the second-to-last row shows
that in 1998 there were teams with 11, 12, and 13 TD passes, and in 2000 there were two teams with 12 and three teams
with 14 TD passes.
Figure 2.2.4 : Back-to-back stem and leaf display. The left side shows the 1998 TD data and the right side shows the 2000
TD data.
Figure 2.2.4 helps us see that the two seasons were similar, but that only in 1998 did any teams throw more than 40 TD
passes.
There are two things about the football data that make them easy to graph with stems and leaves. First, the data are limited
to whole numbers that can be represented with a one-digit stem and a one-digit leaf. Second, all the numbers are positive.
If the data include numbers with three or more digits, or contain decimals, they can be rounded to two-digit accuracy.
Negative values are also easily handled. Let us look at another example.
Figure 2.2.5 shows data from the case study Weapons and Aggression. Each value is the mean difference over a series of
trials between the times it took an experimental subject to name aggressive words (like “punch”) under two conditions. In
one condition, the words were preceded by a non-weapon word such as “bug.” In the second condition, the same words
were preceded by a weapon word such as “gun” or “knife.” The issue addressed by the experiment was whether a
preceding weapon word would speed up (or prime) pronunciation of the aggressive word compared to a non-weapon
priming word. A positive difference implies greater priming of the aggressive word by the weapon word. Negative
differences imply that the priming by the weapon word was less than for a neutral word.
Figure 2.2.6 : Stem and leaf display with negative numbers and rounding.
Observe that the figure contains a row headed by “0” and another headed by “-0.” The stem of 0 is for numbers between 0
and 9, whereas the stem of -0 is for numbers between 0 and -9. For example, the fifth row of the table holds the numbers 1,
2, 4, 5, 5, 8, 9 and the sixth row holds 0, -6, -7, and -9. Values that are exactly 0 before rounding should be split as evenly
as possible between the “0” and “-0” rows. In Figure 2.2.5, none of the values are 0 before rounding. The “0” that appears
in the “-0” row comes from the original value of -0.2 in the table.
Although stem and leaf displays are unwieldy for large data sets, they are often useful for data sets with up to 200
observations. Figure 2.2.7 portrays the distribution of populations of 185 US cities in 1998. To be included, a city had to
have between 100,000 and 500,000 residents.
Figure 2.2.7 : Stem and leaf display of populations of 185 US cities with populations between 100,000 and 500,000 in
1988.
Since a stem and leaf plot shows only two-place accuracy, we had to round the numbers to the nearest 10,000. For example
the largest number (493,559) was rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9. The fourth highest
number (463,201) was rounded to 460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of
100,000 and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1, 2-3, 4-5, 67, and
8-9.
Histograms
A histogram is a graphical method for displaying the shape of a distribution. It is particularly useful when there are a large
number of observations. We begin with an example consisting of the scores of 642 students on a psychology test. The test
consists of 197 items each graded as “correct” or “incorrect.” The students' scores ranged from 46 to 167.
The first step is to create a frequency table. Unfortunately, a simple frequency table would be too big, containing over 100
rows. To simplify the table, we group scores together as shown in Table 2.2.1.
Table 2.2.1 : Grouped Frequency Distribution of Psychology Test Scores
Interval's Lower Limit Interval's Upper Limit Class Frequency
39.5 49.5 3
49.5 59.5 10
59.5 69.5 53
69.5 79.5 107
79.5 89.5 147
89.5 99.5 130
99.5 109.5 78
109.5 119.5 59
119.5 129.5 36
129.5 139.5 11
139.5 149.5 6
149.5 159.5 1
159.5 169.5 1
To create this table, the range of scores was broken into intervals, called class intervals. The first interval is from 39.5 to
49.5, the second from 49.5 to 59.5, etc. Next, the number of scores falling into each interval was counted to obtain the
class frequencies. There are three scores in the first interval, 10 in the second, etc.
Class intervals of width 10 provide enough detail about the distribution to be revealing without making the graph too
“choppy.” More information on choosing the widths of class intervals is presented later in this section. Placing the limits of
the class intervals midway between two numbers (e.g., 49.5) ensures that every score will fall in an interval rather than on
the boundary between intervals.
In a histogram, the class frequencies are represented by bars. The height of each bar corresponds to its class frequency. A
histogram of these data is shown in Figure 2.2.8.
Frequency Polygons
Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as
histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying
cumulative frequency distributions.
To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing
the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle
value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each
class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval
below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides.
A frequency polygon for 642 psychology test scores shown in Figure 2.2.8 was constructed from the frequency table
shown in Table 2.2.2.
Table 2.2.2 : Frequency Distribution of Psychology Test Scores
Lower Limit Upper Limit Count Cumulative Count
29.5 39.5 0 0
39.5 49.5 3 3
49.5 59.5 10 13
59.5 69.5 53 66
69.5 79.5 107 173
79.5 89.5 147 320
89.5 99.5 130 450
99.5 109.5 78 528
109.5 119.5 59 587
119.5 129.5 36 623
129.5 139.5 11 634
139.5 149.5 6 640
149.5 159.5 1 641
159.5 169.5 1 642
169.5 170.5 0 642
The first label on the X-axis is 35. This represents an interval extending from 29.5 to 39.5. Since the lowest test score is
46, this interval has a frequency of 0. The point labeled 45 represents the interval from 39.5 to 49.5. There are three scores
in this interval. There are 147 scores in the interval that surrounds 85.
You can easily discern the shape of the distribution from Figure 2.2.9. Most of the scores are between 65 and 115. It is
clear that the distribution is not symmetric inasmuch as good scores (to the right) trail off more gradually than poor scores
(to the left). In the terminology of Chapter 3 (where we will study shapes of distributions more systematically), the
distribution is skewed.
using the same data from the cursor task. The difference in distributions for the two targets is again evident.
Continuing with the box plots, we put “whiskers” above and below each box to give additional information about the
spread of data. Whiskers are vertical lines that end in a horizontal stroke. Whiskers are drawn from the upper and lower
hinges to the upper and lower adjacent values (24 and 14 for the women's data), as shown in Figure 2.2.15.
Figure 2.2.16 : The box plots with the outside value shown.
There is one more mark to include in box plots (although sometimes it is omitted). We indicate the mean score for a group
by inserting a plus sign. Figure 2.2.17 shows the result of adding means to our box plots.
Bar Charts
In the section on qualitative variables, we saw how bar charts could be used to illustrate the frequencies of different
categories. For example, the bar chart shown in Figure 2.2.19 shows how many purchasers of iMac computers were
previous Macintosh users, previous Windows users, and new computer purchasers.
Figure 2.2.20 : Percent increase in three stock indexes from May 24th 2000 to May 24th 2001.
Bar charts are particularly effective for showing change over time. Figure 2.2.21, for example, shows the percent increase
in the Consumer Price Index (CPI) over four three-month periods. The fluctuation in inflation is apparent in the graph.
Bar charts are often used to compare the means of different experimental conditions. Figure 2.1.4 shows the mean time it
took one of us (DL) to move the cursor to either a small target or a large target. On average, more time was required for
small targets than for large ones.
Figure 2.2.22 : Bar chart showing the means for the two conditions.
Although bar charts can display means, we do not recommend them for this purpose. Box plots should be used instead
since they provide more information than bar charts without taking up more space. For example, a box plot of the cursor-
movement data is shown in Figure 2.2.23. You can see that Figure 2.2.23 reveals more about the distribution of movement
times than does Figure 2.2.22.
Figure 2.2.23 : Box plots of times to move the cursor to the small and large targets.
The section on qualitative variables presented earlier in this chapter discussed the use of bar charts for comparing
distributions. Some common graphical mistakes were also noted. The earlier discussion applies equally well to the use of
bar charts to display quantitative variables.
Figure 2.2.24 : A bar chart of the percent change in the CPI over time. Each bar represents percent increase for the three
months ending at the date indicated.
A line graph of these same data is shown in Figure 2.2.25. Although the figures are similar, the line graph emphasizes the
change from period to period.
Figure 2.2.25 : A line graph of the percent change in the CPI over time. Each point represents percent increase for the three
months ending at the date indicated.
Line graphs are appropriate only when both the X- and Y-axes display ordered (rather than qualitative) variables. Although
bar charts can also be used in this situation, line graphs are generally better at comparing changes over time. Figure 2.2.26,
for example, shows percent increases and decreases in five components of the CPI. The figure makes it easy to see that
medical costs had a steadier progression than the other components. Although you could create an analogous bar chart, its
interpretation would not be as easy.
Figure 2.2.27 : A line graph, inappropriately used, depicting the number of people playing different card games on
Wednesday and Sunday.
Answer:
Qualitative variables are displayed using pie charts and bar charts. Quantitative variables are displayed as box plots,
histograms, etc.
2. Given the following data, construct a pie chart and a bar chart. Which do you think is the more appropriate or useful
way to display the data?
Favorite Movie Genre Frequency
Comedy 14
Horror 9
Romance 8
Action 12
3. Pretend you are constructing a histogram for describing the distribution of salaries for individuals who are 40 years or
older, but are not yet retired.
a. What is on the Y-axis? Explain.
b. What is on the X-axis? Explain.
c. What would be the probable shape of the salary distribution? Explain why.
Answer:
[You do not need to draw the histogram, only describe it below]
a. The Y-axis would have the frequency or proportion because this is always the case in histograms
b. The X-axis has income, because this is out quantitative variable of interest
c. Because most income data are positively skewed, this histogram would likely be skewed positively too
4. A graph appears below showing the number of adults and children who prefer each type of soda. There were 130 adults
and kids surveyed. Discuss some ways in which the graph below could be improved.
5. Which of the box plots on the graph has a large positive skew? Which has a large negative skew?
6. Create a histogram of the following data representing how many shows children said they watch each day:
0 2
1 18
2 36
3 7
4 3
7. Explain the differences between bar charts and histograms. When would each be used?
Answer:
In bar charts, the bars do not touch; in histograms, the bars do touch. Bar charts are appropriate for qualitative
variables, whereas histograms are better for quantitative variables.
Major Frequency
Psychology 144
Biology 120
Chemistry 24
Physics 12
10. Create a histogram of the following data. Label the tails and body and determine if it is skewed (and direction, if so) or
symmetrical.
Hours worked per week Proportion
0 -10 4
10 -20 8
20 - 30 11
30 - 40 51
40 - 50 12
50 - 60 9
60+ 5
1 2/18/2022
3.1: What is Central Tendency?
What is “central tendency,” and why do we want to know the central tendency of a group of scores? Let us first try to
answer these questions intuitively. Then we will proceed to a more formal discussion.
Imagine this situation: You are in a class with just four other students, and the five of you took a 5-point pop quiz. Today
your instructor is walking around the room, handing back the quizzes. She stops at your desk and hands you your paper.
Written in bold black ink on the front is “3/5.” How do you react? Are you happy with your score of 3 or disappointed?
How do you decide? You might calculate your percentage correct, realize it is 60%, and be appalled. But it is more likely
that when deciding how to react to your performance, you will want additional information. What additional information
would you like?
If you are like most students, you will immediately ask your neighbors, “Whad'ja get?” and then ask the instructor, “How
did the class do?” In other words, the additional information you want is how your quiz score compares to other students'
scores. You therefore understand the importance of comparing your score to the class distribution of scores. Should your
score of 3 turn out to be among the higher scores, then you'll be pleased after all. On the other hand, if 3 is among the
lower scores in the class, you won't be quite so happy.
This idea of comparing individual scores to a distribution of scores is fundamental to statistics. So let's explore it further,
using the same example (the pop quiz you took with your four classmates). Three possible outcomes are shown in Table
3.1.1. They are labeled “Dataset A,” “Dataset B,” and “Dataset C.” Which of the three datasets would make you happiest?
In other words, in comparing your score with your fellow students' scores, in which dataset would your score of 3 be the
most impressive?
Table 3.1.1 : Three possible datasets for the 5-point make-up quiz.
Student Dataset A Dataset B Dataset C
You 3 3 3
John's 3 4 2
Maria's 3 4 2
Shareecia's 3 4 2
Luther's 3 5 1
In Dataset A, everyone's score is 3. This puts your score at the exact center of the distribution. You can draw satisfaction
from the fact that you did as well as everyone else. But of course it cuts both ways: everyone else did just as well as you.
Now consider the possibility that the scores are described as in Dataset B. This is a depressing outcome even though your
score is no different than the one in Dataset A. The problem is that the other four students had higher grades, putting yours
below the center of the distribution.
Finally, let's look at Dataset C. This is more like it! All of your classmates score lower than you so your score is above the
center of the distribution.
Now let's change the example in order to develop more insight into the center of a distribution. Figure 3.1.1 shows the
results of an experiment on memory for chess positions. Subjects were shown a chess position and then asked to
reconstruct it on an empty chess board. The number of pieces correctly placed was recorded. This was repeated for two
more chess positions. The scores represent the total number of chess pieces correctly placed for the three chess positions.
The maximum possible score was 89.
Definitions of Center
Now we explain the three different ways of defining the center of a distribution. All three are called measures of central
tendency.
Balance Scale
One definition of central tendency is the point at which the distribution is in balance. Figure 3.1.2 shows the distribution of
the five numbers 2, 3, 4, 9, 16 placed upon a balance scale. If each number weighs one pound, and is placed at its position
along the number line, then it would be possible to balance them by placing a fulcrum at 6.8.
2 8
3 7
4 6
9 1
16 6
Sum 28
The first row of the table shows that the absolute value of the difference between 2 and 10 is 8; the second row shows that
the absolute difference between 3 and 10 is 7, and similarly for the other rows. When we add up the five absolute
deviations, we get 28. So, the sum of the absolute deviations from 10 is 28. Likewise, the sum of the absolute deviations
from 5 equals 3 + 2 + 1 + 4 + 11 = 21. So, the sum of the absolute deviations from 5 is smaller than the sum of the absolute
deviations from 10. In this sense, 5 is closer, overall, to the other numbers than is 10.
We are now in a position to define a second measure of central tendency, this time in terms of absolute deviations.
Specifically, according to our second definition, the center of a distribution is the number for which the sum of the absolute
deviations is smallest. As we just saw, the sum of the absolute deviations from 10 is 28 and the sum of the absolute
deviations from 5 is 21. Is there a value for which the sum of the absolute deviations is even smaller than 21? Yes. For
these data, there is a value for which the sum of absolute deviations is only 20. See if you can find it.
Sum 186
The first row in the table shows that the squared value of the difference between 2 and 10 is 64; the second row shows that
the squared difference between 3 and 10 is 49, and so forth. When we add up all these squared deviations, we get 186.
Changing the target from 10 to 5, we calculate the sum of the squared deviations from 5 as 9 + 4 + 1 + 16 + 121 = 151. So,
the sum of the squared deviations from 5 is smaller than the sum of the squared deviations from 10. Is there a value for
which the sum of the squared deviations is even smaller than 151? Yes, it is possible to reach 134.8. Can you find the
target number for which the sum of squared deviations is 134.8?
The target that minimizes the sum of squared deviations provides another useful definition of central tendency (the last one
to be discussed in this section). It can be challenging to find the value that minimizes this sum.
Arithmetic Mean
The arithmetic mean is the most common measure of central tendency. It is simply the sum of the numbers divided by the
¯
¯¯¯
number of numbers. The symbol “μ ” (pronounced “mew”) is used for the mean of a population. The symbol “X ”
(pronounced “X-bar”) is used for the mean of a sample. The formula for μ is shown below:
∑X
μ = (3.2.1)
N
where ΣX is the sum of all the numbers in the population and N is the number of numbers in the population.
¯
¯¯¯
The formula for X is essentially identical:
¯
¯¯¯
∑X
X = (3.2.2)
N
where ΣX is the sum of all the numbers in the sample and N is the number of numbers in the sample. The only distinction
between these two equations is whether we are referring to the population (in which case we use the parameter μ ) or a
¯
¯¯¯
sample of that population (in which case we use the statistic X ).
As an example, the mean of the numbers 1, 2, 3, 6, 8 is 20/5 = 4 regardless of whether the numbers constitute the entire
population or just a sample from the population.
Figure 3.2.1 shows the number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League
in the 2000 season. The mean number of touchdown passes thrown is 20.45 as shown below.
∑X 634
μ = = = 20.45
N 31
Mode
The mode is the most frequently occurring value in the dataset. For the data in Figure 3.2.1, the mode is 18 since more
teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data, such as response
time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see
discussion of continuous variables). Therefore the mode of continuous data is normally computed from a grouped
frequency distribution. Table 3.2.1 shows a grouped frequency distribution for the target response time data. Since the
interval with the highest frequency is 600-700, the mode is the middle of that interval (650). Though the mode is not
frequently used for continuous data, it is nevertheless an important measure of central tendency as it is the only measure
we can use on qualitative or categorical data.
Table 3.2.1 : Grouped frequency distribution
Range Frequency
500 - 600 3
600 - 700 6
700 - 800 5
800 - 900 5
900 - 1000 0
1000 - 1100 1
2 2 4.8 4 23.04
3 1 3.8 1 14.44
4 0 2.8 0 7.84
9 5 2.2 25 4.84
16 12 9.2 144 84.64
Total 20 22.8 174 134.8
Figure 3.2.2 shows that the distribution balances at the mean of 6.8 and not at the median of 4. The relative advantages and
disadvantages of the mean and median are discussed in the section “Comparing Measures” later in this chapter.
Mode 84.00
Median 90.00
Mean 91.58
The distribution of baseball salaries (in 1994) shown in Figure 3.2.4 has a much more pronounced skew than the
distribution in Figure 3.2.3.
Mode 250
Median 500
Mean 1,183
Range
The range is the simplest measure of variability to calculate, and one you have probably encountered many times in your life. The
range is simply the highest score minus the lowest score. Let’s take a few examples. What is the range of the following group of
numbers: 10, 2, 5, 6, 7, 3, 4? Well, the highest number is 10, and the lowest number is 2, so 10 - 2 = 8. The range is 8. Let’s take
another example. Here’s a dataset with 10 numbers: 99, 45, 23, 67, 45, 91, 82, 78, 62, 51. What is the range? The highest number is
99 and the lowest number is 23, so 99 - 23 equals 76; the range is 76. Now consider the two quizzes shown in Figure 3.3.1 and
Figure 3.3.2. On Quiz 1, the lowest score is 5 and the highest score is 9. Therefore, the range is 4. The range on Quiz 2 was larger:
the lowest score was 4 and the highest score was 10. Therefore the range is 6.
The problem with using range is that it is extremely sensitive to outliers, and one number far away from the rest of the data will
greatly alter the value of the range. For example, in the set of numbers 1, 3, 4, 4, 5, 8, and 9, the range is 8 (9 – 1).
However, if we add a single person whose score is nowhere close to the rest of the scores, say, 20, the range more than doubles
from 8 to 19.
Interquartile Range
The interquartile range (IQR) is the range of the middle 50% of the scores in a distribution and is sometimes used to communicate
where the bulk of the data in the distribution are located. It is computed as follows:
IQR = 75th percentile − 25th percentile (3.3.1)
Sum of Squares
Variability can also be defined in terms of how close the scores in the distribution are to the middle of the distribution. Using the
mean as the measure of the middle of the distribution, we can see how far, on average, each data point is from the center. The data
¯¯¯
¯
from Quiz 1 are shown in Table 3.3.1. The mean score is 7.0 (ΣX/N = 140/20 = 7). Therefore, the column “X − X ” contains
deviations (how far each score deviates from the mean), here calculated as the score minus 7. The column “(X − X ) ” has the
¯¯¯
¯ 2
9 2 4
9 2 4
9 2 4
8 1 1
8 1 1
8 1 1
8 1 1
7 0 0
7 0 0
7 0 0
7 0 0
7 0 0
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
5 -2 4
5 -2 4
Σ = 140 Σ = 0 Σ = 30
2
∑(X − μ)
2
σ = (3.3.2)
N
Notice that the numerator that formula is identical to the formula for Sum of Squares presented above with X replaced by μ . Thus,
¯¯¯
¯
we can use the Sum of Squares table to easily calculate the numerator then simply divide that value by N to get variance. If we
assume that the values in Table 3.3.1 represent the full population, then we can take our value of Sum of Squares and divide it by
N to get our population variance:
2
30
σ = = 1.5
20
So, on average, scores in this population are 1.5 squared units away from the mean. This measure of spread is much more robust (a
term used by statisticians to mean resilient or resistant to) outliers than the range, so it is a much more useful value to compute.
Additionally, as we will see in future chapters, variance plays a central role in inferential statistics.
The sample statistic used to estimate the variance is s (“s-squared”):
2
¯¯¯
¯ 2
∑(X − X )
2
s = (3.3.3)
N −1
This formula is very similar to the formula for the population variance with one change: we now divide by N – 1 instead of N . The
value N – 1 has a special name: the degrees of freedom (abbreviated as df ). You don’t need to understand in depth what degrees
of freedom are (essentially they account for the fact that we have to use a sample statistic to estimate the mean (X ) before we
¯¯¯
¯
estimate the variance) in order to calculate variance, but knowing that the denominator is called df provides a nice shorthand for
the variance formula: SS/df .
Going back to the values in Table 3.3.1 and treating those scores as a sample, we can estimate the sample variance as:
30
2
s = = 1.58 (3.3.4)
20 − 1
Notice that this value is slightly larger than the one we calculated when we assumed these scores were the full population. This is
because our value in the denominator is slightly smaller, making the final value larger. In general, as your sample size N gets
bigger, the effect of subtracting 1 becomes less and less. Comparing a sample size of 10 to a sample size of 1000; 10 – 1 = 9, or
90% of the original value, whereas 1000 – 1 = 999, or 99.9% of the original value. Thus, larger sample sizes will bring the estimate
of the sample variance closer to that of the population variance. This is a key idea and principle in statistics that we will see over
and over again: larger sample sizes better reflect the population.
Standard Deviation
The standard deviation is simply the square root of the variance. This is a useful and interpretable statistic because taking the
square root of the variance (recalling that variance is the average squared difference) puts the standard deviation back into the
original units of the measure we used. Thus, when reporting descriptive statistics in a study, scientists virtually always report mean
and standard deviation. Standard deviation is therefore the most commonly used measure of spread for our purposes.
The population parameter for standard deviation is σ (“sigma”), which, intuitively, is the square root of the variance parameter σ 2
(on occasion, the symbols work out nicely that way). The formula is simply the formula for variance under a square root sign:
−−−−−−−−−−
2
∑(X − μ)
σ =√ (3.3.5)
N
The standard deviation is an especially useful measure of variability when the distribution is normal or approximately normal
because the proportion of the distribution within a given number of standard deviations from the mean can be calculated. For
example, 68% of the distribution is within one standard deviation (above and below) of the mean and approximately 95% of the
distribution is within two standard deviations of the mean. Therefore, if you had a normal distribution with a mean of 50 and a
standard deviation of 10, then 68% of the distribution would be between 50 - 10 = 40 and 50 +10 =60. Similarly, about 95% of the
distribution would be between 50 - 2 x 10 = 30 and 50 + 2 x 10 = 70.
Answer:
If the mean is higher, that means it is farther out into the right-hand tail of the distribution. Therefore, we know this distribution
is positively skewed.
2. Compare the mean, median, and mode in terms of their sensitivity to extreme scores.
3. Your younger brother comes home one day after taking a science test. He says that some- one at school told him that “60% of
the students in the class scored above the median test grade.” What is wrong with this statement? What if he had said “60% of
the students scored above the mean?”
Answer:
The median is defined as the value with 50% of scores above it and 50% of scores below it; therefore, 60% of score cannot fall
above the median. If 60% of scores fall above the mean, that would indicate that the mean has been pulled down below the
value of the median, which means that the distribution is negatively skewed
Answer:
2
μ = 4.80, σ = 2.36
6. For the following problem, use the following scores: 5, 8, 8, 8, 7, 8, 9, 12, 8, 9, 8, 10, 7, 9, 7, 6, 9, 10, 11, 8
a. Create a histogram of these data. What is the shape of this histogram?
b. How do you think the three measures of central tendency will compare to each other in this dataset?
c. Compute the sample mean, the median, and the mode
d. Draw and label lines on your histogram for each of the above values. Do your results match your predictions?
7. Compute the range, sample variance, and sample standard deviation for the following scores: 25, 36, 41, 28, 29, 32, 39, 37, 34,
34, 37, 35, 30, 36, 31, 31
Answer:
range = 16, s 2
,
= 18.40 s = 4.29
8. Using the same values from problem 7, calculate the range, sample variance, and sample standard deviation, but this time
include 65 in the list of values. How did each of the three values change?
9. Two normal distributions have exactly the same mean, but one has a standard deviation of 20 and the other has a standard
deviation of 10. How would the shapes of the two distributions compare?
Answer:
If both distributions are normal, then they are both symmetrical, and having the same mean causes them to overlap with one
another. The distribution with the standard deviation of 10 will be narrower than the other distribution
10. Compute the sample mean and sample standard deviation for the following scores: -8, -4, -7, -6, -8, -5, -7, -9, -2, 0
4.2: Z-SCORES
A z -score is a standardized version of a raw score ( x ) that gives information about the relative location of that score within its
distribution.
1 2/18/2022
4.1: Normal Distributions
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the “bell
curve,” although the tonal qualities of such a bell would be less than pleasing. It is also called the “Gaussian curve” of Gaussian
distribution after the mathematician Karl Friedrich Gauss.
Strictly speaking, it is not correct to talk about “the normal distribution” since there are many normal distributions. Normal
distributions can differ in their means and in their standard deviations. Figure 1 shows three normal distributions. The green (left-
most) distribution has a mean of -3 and a standard deviation of 0.5, the distribution in red (the middle distribution) has a mean of 0
and a standard deviation of 1, and the distribution in black (right-most) has a mean of 2 and a standard deviation of 3. These as well
as all other normal distributions are symmetric with relatively more values at the center of the distribution and relatively few in the
tails. What is consistent about all normal distribution is the shape and the proportion of scores within a given distance along the x-
axis. We will focus on the Standard Normal Distribution (also known as the Unit Normal Distribution), which has a mean of 0 and
a standard deviation of 1 (i.e. the red distribution in Figure 4.1.1).
As you can see, z -scores combine information about where the distribution is located (the mean/center) with how wide the
distribution is (the standard deviation/spread) to interpret a raw score (x). Specifically, z -scores will tell us how far the
score is away from the mean in units of standard deviations and in what direction.
The value of a z -score has two parts: the sign (positive or negative) and the magnitude (the actual number). The sign of the
z -score tells you in which half of the distribution the z-score falls: a positive sign (or no sign) indicates that the score is
above the mean and on the right hand-side or upper end of the distribution, and a negative sign tells you the score is below
the mean and on the left-hand side or lower end of the distribution. The magnitude of the number tells you, in units of
standard deviations, how far away the score is from the center or mean. The magnitude can take on any value between
negative and positive infinity, but for reasons we will see soon, they generally fall between -3 and 3.
Let’s look at some examples. A z -score value of -1.0 tells us that this z-score is 1 standard deviation (because of the
magnitude 1.0) below (because of the negative sign) the mean. Similarly, a z -score value of 1.0 tells us that this z -score is
1 standard deviation above the mean. Thus, these two scores are the same distance away from the mean but in opposite
directions. A z -score of -2.5 is two-and-a-half standard deviations below the mean and is therefore farther from the center
than both of the previous scores, and a z -score of 0.25 is closer than all of the ones before. In Unit 2, we will learn to
formalize the distinction between what we consider “close” to the center or “far” from the center. For now, we will use a
rough cut-off of 1.5 standard deviations in either direction as the difference between close scores (those within 1.5 standard
deviations or between z = -1.5 and z = 1.5) and extreme scores (those farther than 1.5 standard deviations – below z = -1.5
or above z = 1.5).
We can also convert raw scores into z -scores to get a better idea of where in the distribution those scores fall. Let’s say we
get a score of 68 on an exam. We may be disappointed to have scored so low, but perhaps it was just a very hard exam.
Having information about the distribution of all scores in the class would be helpful to put some perspective on ours. We
find out that the class got an average score of 54 with a standard deviation of 8. To find out our relative location within this
distribution, we simply convert our test score into a z -score.
X −μ 68 − 54
z = = = 1.75
σ 8
We find that we are 1.75 standard deviations above the average, above our rough cut off for close and far. Suddenly our 68
is looking pretty good!
which is just slightly below average (note that use of “math” as a subscript; subscripts are used when presenting multiple
versions of the same statistic in order to know which one is which and have no bearing on the actual calculation). The
critical reading section has a mean of 495 and standard deviation of 116, so
501 − 495
zC R = = 0.05
116
So even though we were almost exactly average on both tests, we did a little bit better on the critical reading portion
relative to other people.
Finally, z -scores are incredibly useful if we need to combine information from different measures that are on different
scales. Let’s say we give a set of employees a series of tests on things like job knowledge, personality, and leadership. We
may want to combine these into a single score we can use to rate employees for development or promotion, but look what
happens when we take the average of raw scores from different scales, as shown in Table 4.2.1:
Table 4.2.1 : Raw test scores on different scales (ranges in parentheses).
Raw Scores Job Knowledge (0 – 100) Personality (1 –5) Leadership (1 – 5) Average
Because the job knowledge scores were so big and the scores were so similar, they overpowered the other scores and
removed almost all variability in the average. However, if we standardize these scores into z -scores, our averages retain
more variability and it is easier to assess differences between employees, as shown in Table 4.2.2.
Table 4.2.2: Standardized scores.
for a sample. Notice that these are just simple rearrangements of the original formulas for calculating z from raw scores.
Let’s say we create a new measure of intelligence, and initial calibration finds that our scores have a mean of 40 and
standard deviation of 7. Three people who have scores of 52, 43, and 34 want to know how well they did on the measure.
We can convert their raw scores into z -scores:
52 − 40
z = = 1.71
7
43 − 40
z = = 0.43
7
34 − 40
z = = −0.80
7
A problem is that these new z -scores aren’t exactly intuitive for many people. We can give people information about their
relative location in the distribution (for instance, the first person scored well above average), or we can translate these z
scores into the more familiar metric of IQ scores, which have a mean of 100 and standard deviation of 16:
IQ = 1.71 ∗ 16 + 100 = 127.36
We would also likely round these values to 127, 107, and 87, respectively, for convenience.
Figure 4.3.1 : Shaded areas represent the area under the curve in the tails
We will have much more to say about this concept in the coming chapters. As it turns out, this is a quite powerful idea that
enables us to make statements about how likely an outcome is and what that means for research questions we would like to
answer and hypotheses we would like to test. But first, we need to make a brief foray into some ideas about probability.
Answer:
The location above or below the mean (from the sign of the number) and the distance in standard deviations away from the
mean (from the magnitude of the number).
Answer:
¯
¯¯
X
¯
= 4.2, s = 1.64; z = -1.34, -0.73, 0.49, 0.49, 1.10
4. True or false:
a. All normal distributions are symmetrical
b. All normal distributions have a mean of 1.0
c. All normal distributions have a standard deviation of 1.0
d. The total area under the curve of all normal distributions is equal to 1
5. Interpret the location, direction, and distance (near or far) of the following z -scores:
a. -2.00
b. 1.25
c. 3.50
d. -0.34
Answer:
a. 2 standard deviations below the mean, far
b. 1.25 standard deviations above the mean, near
c. 3.5 standard deviations above the mean, far
d. 0.34 standard deviations below the mean, near
6. Transform the following z -scores into a distribution with a mean of 10 and standard deviation of 2: -1.75, 2.20, 1.65, -0.95
7. Calculate z -scores for the following raw scores taken from a population with a mean of 100 and standard deviation of 16: 112,
109, 56, 88, 135, 99
Answer:
z = 0.75, 0.56, -2.75, -0.75, 2.19, -0.06
Answer:
a. -0.50
b. 0.25
c. 3.00
d. 1.10
10. Calculate the raw score for the following z -scores from a distribution with a mean of 15 and standard deviation of 3:
a. 4.0
1 2/18/2022
5.1: What is Probability
When we speak of the probability of something happening, we are talking how likely it is that “thing” will happen based
on the conditions present. For instance, what is the probability that it will rain? That is, how likely do we think it is that it
will rain today under the circumstances or conditions today? To define or understand the conditions that might affect how
likely it is to rain, we might look out the window and say, “it’s sunny outside, so it’s not very likely that it will rain today.”
Stated using probability language: given that it is sunny outside, the probability of rain is low. “Given” is the word we use
to state what the conditions are. As the conditions change, so does the probability. Thus, if it were cloudy and windy
outside, we might say, “given the current weather conditions, there is a high probability that it is going to rain.”
In these examples, we spoke about whether or not it is going to rain. Raining is an example of an event, which is the catch-
all term we use to talk about any specific thing happening; it is a generic term that we specified to mean “rain” in exactly
the same way that “conditions” is a generic term that we specified to mean “sunny” or “cloudy and windy.”
It should also be noted that the terms “low” and “high” are relative and vague, and they will likely be interpreted different
by different people (in other words: given how vague the terminology was, the probability of different interpretations is
high). Most of the time we try to use more precise language or, even better, numbers to represent the probability of our
event. Regardless, the basic structure and logic of our statements are consistent with how we speak about probability using
numbers and formulas.
Let’s look at a slightly deeper example. Say we have a regular, six-sided die (note that “die” is singular and “dice” is
plural, a distinction that Dr. Foster has yet to get correct on his first try) and want to know how likely it is that we will roll
a 1. That is, what is the probability of rolling a 1, given that the die is not weighted (which would introduce what we call a
bias, though that is beyond the scope of this chapter). We could roll the die and see if it is a 1 or not, but that won’t tell us
about the probability, it will only tell us a single result. We could also roll the die hundreds or thousands of times,
recording each outcome and seeing what the final list looks like, but this is time consuming, and rolling a die that many
times may lead down a dark path to gambling or, worse, playing Dungeons & Dragons. What we need is a simple equation
that represents what we are looking for and what is possible.
To calculate the probability of an event, which here is defined as rolling a 1 on an unbiased die, we need to know two
things: how many outcomes satisfy the criteria of our event (stated different, how many outcomes would count as what we
are looking for) and the total number of outcomes possible. In our example, only a single outcome, rolling a 1, will satisfy
our criteria, and there are a total of six possible outcomes (rolling a 1, rolling a 2, rolling a 3, rolling a 4, rolling a 5, and
rolling a 6). Thus, the probability of rolling a 1 on an unbiased die is 1 in 6 or 1/6. Put into an equation using generic
terms, we get:
number of outcomes that satisfy our criteria
Probability of an event = (5.1.1)
total number of possible outcomes
We can also using P() as shorthand for probability and A as shorthand for an event:
number of outcomes that count a A
P (A) = (5.1.2)
total number of possible outcomes
Using this equation, let’s now calculate the probability of rolling an even number on this die:
2, 4, or 6 3 1
P (Even Number) = = =
1, 2, 3, 4, 5, or 6 6 2
So we have a 50% chance of rolling an even number of this die. The principles laid out here operate under a certain set of
conditions and can be elaborated into ideas that are complex yet powerful and elegant. However, such extensions are not
necessary for a basic understanding of statistics, so we will end our discussion on the math of probability here. Now, let’s
turn back to more familiar topics.
Probability in Pie
Charts Recall that a pie chart represents how frequently a category was observed and that all slices of the pie chart add up to 100%,
or 1. This means that if we randomly select an observation from the data used to create the pie chart, the probability of it taking on
a specific value is exactly equal to the size of that category’s slice in the pie chart.
Contributors
Foster et al. (University of Missouri-St. Louis, Rice University, & University of Houston, Downtown
Campus)
Answer:
Your answer should include information about an event happening under certain conditions given certain criteria. You could
also discuss the relation between probability and the area under the curve or the proportion of the area in a chart.
2. There is a bag with 5 red blocks, 2 yellow blocks, and 4 blue blocks. If you reach in and grab one block without looking, what
is the probability it is red?
3. Under a normal distribution, which of the following is more likely? (Note: this question can be answered without any
calculations if you draw out the distributions and shade properly)
Getting a z -score greater than z = 2.75
Getting a z -score less than z = -1.50
Answer:
Getting a z -score less than z = -1.50 is more likely. z = 2.75 is farther out into the right tail than z = -1.50 is into the left tail,
therefore there are fewer more extreme scores beyond 2.75 than -1.50, regardless of the direction
4. The heights of women in the United States are normally distributed with a mean of 63.7 inches and a standard deviation of 2.7
inches. If you randomly select a woman in the United States, what is the probability that she will be between 65 and 67 inches
tall?
5. The heights of men in the United States are normally distributed with a mean of 69.1 inches and a standard deviation of 2.9
inches. What proportion of men are taller than 6 feet (72 inches)?
Answer:
15.87% or 0.1587
6. You know you need to score at least 82 points on the final exam to pass your class. After the final, you find out that the average
score on the exam was 78 with a standard deviation of 7. How likely is it that you pass the class?
7. What proportion of the area under the normal curve is greater than z = 1.65?
Answer:
4.95% or 0.0495
8. Find the z -score that bounds 25% of the lower tail of the distribution.
9. Find the z -score that bounds the top 9% of the distribution.
Answer:
z = 1.34 (the top 9% means 9% of the area is in the upper tail and 91% is in the body to the left; finding the value in the normal
table closest to .9100 is .9099, which corresponds to z = 1.34)
10. In a distribution with a mean of 70 and standard deviation of 12, what proportion of scores are lower than 55?
1 2/18/2022
6.1: People, Samples, and Populations
Most of what we have dealt with so far has concerned individual scores grouped into samples, with those samples being
drawn from and, hopefully, representative of a population. We saw how we can understand the location of individual
scores within a sample’s distribution via z -scores, and how we can extend that to understand how likely it is to observe
scores higher or lower than an individual score via probability.
Inherent in this work is the notion that an individual score will differ from the mean, which we quantify as a z -score. All of
the individual scores will differ from the mean in different amounts and different directions, which is natural and expected.
We quantify these differences as variance and standard deviation. Measures of spread and the idea of variability in
observations is a key principle in inferential statistics. We know that any observation, whether it is a single score, a set of
scores, or a particular descriptive statistic will differ from the center of whatever distribution it belongs in.
This is equally true of things outside of statistics and format data collection and analysis. Some days you hear your alarm
and wake up easily, other days you need to hit snooze a few [dozen] times. Some days traffic is light, other days it is very
heavy. Some classes you are able to focus, pay attention, and take good notes, but other days you find yourself zoning out
the entire time. Each individual observation is an insight but is not, by itself, the entire story, and it takes an extreme
deviation from what we expect for us to think that something strange is going on. Being a little sleepy is normal, but being
completely unable to get out of bed might indicate that we are sick. Light traffic is a good thing, but almost no cars on the
road might make us think we forgot it is Saturday. Zoning out occasionally is fine, but if we cannot focus at all, we might
be in a stats class rather than a fun one.
All of these principles carry forward from scores within samples to samples within populations. Just like an individual
score will differ from its mean, an individual sample mean will differ from the true population mean. We encountered this
principle in earlier chapters: sampling error. As mentioned way back in chapter 1, sampling error is an incredibly important
principle. We know ahead of time that if we collect data and compute a sample, the observed value of that sample will be
at least slightly off from what we expect it to be based on our supposed population mean; this is natural and expected.
However, if our sample mean is extremely different from what we expect based on the population mean, there may be
something going on.
distribution is called the standard error, the quantification of sampling error, denoted μ . The formula for standard error is:
¯
¯
X
¯¯
¯
σ
σ¯¯¯¯¯ = − (6.2.1)
X
√n
Notice that the sample size is in this equation. As stated above, the sampling distribution refers to samples of a specific
size. That is, all sample means must be calculated from samples of the same size n , such n = 10, n = 30, or n = 100. This
sample size refers to how many people or observations are in each individual sample, not how many samples are used to
form the sampling distribution. This is because the sampling distribution is a theoretical distribution, not one we will ever
actually calculate or observe. Figure 6.2.1 displays the principles stated here in graphical form.
For samples of a single size n , drawn from a population with a given mean μ and variance σ
2
, the sampling
2
σ
distribution of sample means will have a mean μ¯¯¯¯¯ = μ
X
and variance σ
X
2
= . This distribution will approach
n
normality as n increases.
From this, we are able to find the standard deviation of our sampling distribution, the standard error. As you can see, just
like any other standard deviation, the standard error is simply the square root of the variance of the distribution.
The last sentence of the central limit theorem states that the sampling distribution will be normal as the sample size of the
samples used to create it increases. What this means is that bigger samples will create a more normal distribution, so we
are better able to use the techniques we developed for normal distributions and probabilities. So how large is large enough?
In general, a sampling distribution will be normal if either of two characteristics is true:
1. the population from which the samples are drawn is normally distributed or
2. the sample size is equal to or greater than 30.
This second criteria is very important because it enables us to use methods developed for normal distributions even if the
true population distribution is skewed.
Figure 6.2.2 : Sampling distributions from the same population with μ = 50 and σ = 10 but different sample sizes (N = 10,
N = 30, N = 50, N = 100)
Let’s say we are drawing samples from a population with a mean of 50 and standard deviation of 10 (the same values used
in Figure 2). What is the probability that we get a random sample of size 10 with a mean greater than or equal to 55? That
¯¯¯
¯
is, for n = 10, what is the probability that X ≥ 55? First, we need to convert this sample mean score into a z -score:
55 − 50 5
z = = = 1.58
10
3.16
√10
Now we need to shade the area under the normal curve corresponding to scores greater than z = 1.58 as in Figure 6.3.1:
n ↑ σ¯¯¯¯¯ ↓ z ↑ p ↓ (6.3.3)
X
Let’s look at this one more way. For the same population of sample size 50 and standard deviation 10, what proportion of
sample means fall between 47 and 53 if they are of sample size 10 and sample size 50?
We’ll start again with n = 10. Converting 47 and 53 into z -scores, we get z = -0.95 and z = 0.95, respectively. From our z -
table, we find that the proportion between these two scores is 0.6578 (the process here is left off for the student to practice
¯¯¯
¯
converting X to z and z to proportions). So, 65.78% of sample means of sample size 10 will fall between 47 and 53. For n
= 50, our z -scores for 47 and 53 are ±2.13, which gives us a proportion of the area as 0.9668, almost 97%! Shaded regions
for each of these sampling distributions is displayed in Figure 6.3.3. The sampling distributions are shown on the original
scale, rather than as z-scores, so you can see the effect of the shading and how much of the body falls into the range, which
is marked off with dotted line.
Answer:
The sampling distribution (or sampling distribution of the sample means) is the distribution formed by combining many
sample means taken from the same population and of a single, consistent sample size.
2. What are the two mathematical facts that describe how sampling distributions work?
3. What is the difference between a sampling distribution and a regular distribution?
Answer:
A sampling distribution is made of statistics (e.g. the mean) whereas a regular distribution is made of individual scores.
4. What effect does sample size have on the shape of a sampling distribution?
5. What is standard error?
Answer:
Standard error is the spread of the sampling distribution and is the quantification of sampling error. It is how much we
expect the sample mean to naturally change based on random chance.
6. For a population with a mean of 75 and a standard deviation of 12, what proportion of sample means of size n = 16 fall
above 82?
7. For a population with a mean of 100 and standard deviation of 16, what is the probability that a random sample of size
4 will have a mean between 110 and 130?
Answer:
10.46% or 0.1046
8. Find the z -score for the following means taken from a population with mean 10 and standard deviation 2:
¯¯¯
¯
a. X = 8, n = 12
b. X = 8, n = 30
¯¯¯
¯
c. X = 20, n = 4
¯¯¯
¯
¯¯¯
¯
d. X = 20, n = 16
9. As the sample size increases, what happens to the p-value associated with a given sample mean?
Answer:
As sample size increases, the p -value will decrease
10. For a population with a mean of 35 and standard deviation of 7, find the sample mean of size n = 20 that cuts off the
top 5% of the sampling distribution.
1 2/18/2022
7.1: Logic and Purpose of Hypothesis Testing
The statistician R. Fisher explained the concept of hypothesis testing with a story of a lady tasting tea. Here we will present
an example based on James Bond who insisted that martinis should be shaken rather than stirred. Let's consider a
hypothetical experiment to determine whether Mr. Bond can tell the difference between a shaken and a stirred martini.
Suppose we gave Mr. Bond a series of 16 taste tests. In each test, we flipped a fair coin to determine whether to stir or
shake the martini. Then we presented the martini to Mr. Bond and asked him to decide whether it was shaken or stirred.
Let's say Mr. Bond was correct on 13 of the 16 taste tests. Does this prove that Mr. Bond has at least some ability to tell
whether the martini was shaken or stirred?
This result does not prove that he does; it could be he was just lucky and guessed right 13 out of 16 times. But how
plausible is the explanation that he was just lucky? To assess its plausibility, we determine the probability that someone
who was just guessing would be correct 13/16 times or more. This probability can be computed to be 0.0106. This is a
pretty low probability, and therefore someone would have to be very lucky to be correct 13 or more times out of 16 if they
were just guessing. So either Mr. Bond was very lucky, or he can tell whether the drink was shaken or stirred. The
hypothesis that he was guessing is not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence
that Mr. Bond can tell whether a drink was shaken or stirred.
Let's consider another example. The case study Physicians' Reactions sought to determine whether physicians spend less
time with obese patients. Physicians were sampled randomly and each was shown a chart of a patient complaining of a
migraine headache. They were then asked to estimate how long they would spend with the patient. The charts were
identical except that for half the charts, the patient was obese and for the other half, the patient was of average weight. The
chart a particular physician viewed was determined randomly. Thirty-three physicians viewed charts of average-weight
patients and 38 physicians viewed charts of obese patients.
The mean time physicians reported that they would spend with obese patients was 24.7 minutes as compared to a mean of
31.4 minutes for normal-weight patients. How might this difference between means have occurred? One possibility is that
physicians were influenced by the weight of the patients. On the other hand, perhaps by chance, the physicians who
viewed charts of the obese patients tend to see patients for less time than the other physicians. Random assignment of
charts does not ensure that the groups will be equal in all respects other than the chart they viewed. In fact, it is certain the
groups differed in many ways by chance. The two groups could not have exactly the same mean age (if measured precisely
enough such as in days). Perhaps a physician's age affects how long physicians see patients. There are innumerable
differences between the groups that could affect how long they view patients. With this in mind, is it plausible that these
chance differences are responsible for the difference in times?
To assess the plausibility of the hypothesis that the difference in mean times is due to chance, we compute the probability
of getting a difference as large or larger than the observed difference (31.4 - 24.7 = 6.7 minutes) if the difference were, in
fact, due solely to chance. Using methods presented in later chapters, this probability can be computed to be 0.0057. Since
this is such a low probability, we have confidence that the difference in times is due to the patient's weight and is not due to
chance.
Reactions example, the null hypothesis is that in the population of physicians, the mean time expected to be spent with obese
patients is equal to the mean time expected to be spent with average-weight patients. This null hypothesis can be written as:
H0 : μobese − μaverage = 0 (7.3.1)
The null hypothesis in a correlational study of the relationship between high school grades and college grades would typically be
that the population correlation is 0. This can be written as
H0 : ρ = 0 (7.3.2)
H0 : μ = 7.47
The number on the right hand side is our null hypothesis value that is informed by our research question. Notice that we are testing
the value for μ , the population parameter, NOT the sample statistic X . This is for two reasons: 1) once we collect data, we know
¯
¯¯¯
¯
¯¯¯
what the value of X is – it’s not a mystery or a question, it is observed and used for the second reason, which is 2) we are interested
in understanding the population, not just our sample.
Keep in mind that the null hypothesis is typically the opposite of the researcher's hypothesis. In the Physicians' Reactions study, the
researchers hypothesized that physicians would expect to spend less time with obese patients. The null hypothesis that the two
types of patients are treated identically is put forward with the hope that it can be discredited and therefore rejected. If the null
hypothesis were true, a difference as large or larger than the sample difference of 6.7 minutes would be very unlikely to occur.
Therefore, the researchers rejected the null hypothesis of no difference and concluded that in the population, physicians intend to
spend less time with obese patients.
In general, the null hypothesis is the idea that nothing is going on: there is no effect of our treatment, no relation between our
variables, and no difference in our sample mean from what we expected about the population mean. This is always our baseline
starting assumption, and it is what we seek to reject. If we are trying to treat depression, we want to find a difference in average
symptoms between our treatment and control groups. If we are trying to predict job performance, we want to find a relation
between conscientiousness and evaluation scores. However, until we have evidence against it, we must use the null hypothesis as
our starting point.
or H . The alternative hypothesis is simply the reverse of the null hypothesis, and there are three options, depending on
1
where we expect the difference to lie. Thus, our alternative hypothesis is the mathematical way of stating our research
question. If we expect our obtained sample mean to be above or below the null hypothesis value, which we call a
directional hypothesis, then our alternative hypothesis takes the form:
based on the research question itself. We should only use a directional hypothesis if we have good reason, based on prior
observations or research, to suspect a particular direction. When we do not know the direction, such as when we are
entering a new area of research, we use a non-directional alternative:
HA : μ ≠ 7.47
We will set different criteria for rejecting the null hypothesis based on the directionality (greater than, less than, or not
equal to) of the alternative. To understand why, we need to see where our criteria come from and how they relate to z -
scores and distributions.
region”). Finding the critical value works exactly the same as finding the z-score corresponding to any area under the
curve like we did in Unit 1. If we go to the normal table, we will find that the z-score corresponding to 5% of the area
under the curve is equal to 1.645 (z = 1.64 corresponds to 0.0405 and z = 1.65 corresponds to 0.0495, so .05 is exactly in
between them) if we go to the right and -1.645 if we go to the left. The direction must be determined by your alternative
hypothesis, and drawing then shading the distribution is helpful for keeping directionality straight.
Suppose, however, that we want to do a non-directional test. We need to put the critical region in both tails, but we don’t
want to increase the overall size of the rejection region (for reasons we will see later). To do this, we simply split it in half
so that an equal proportion of the area under the curve falls in each tail’s rejection region. For α = .05, this means 2.5% of
the area is in each tail, which, based on the z-table, corresponds to critical values of z∗ = ±1.96. This is shown in Figure
7.5.2.
To formally test our hypothesis, we compare our obtained z -statistic to our critical z -value. If Z > Z , that means it
obt crit
falls in the rejection region (to see why, draw a line for z = 2.5 on Figure 7.5.1 or Figure 7.5.2) and so we reject H . If 0
Zobt <Z crit , we fail to reject. Remember that as z gets larger, the corresponding area under the curve beyond z gets
smaller. Thus, the proportion, or p-value, will be smaller than the area for α , and if the area is smaller, the probability gets
smaller. Specifically, the probability of obtaining that result, or a more extreme result, under the condition that the null
hypothesis is true gets smaller.
The z -statistic is very useful when we are doing our calculations by hand. However, when we use computer software, it
will report to us a p-value, which is simply the proportion of the area under the curve in the tails beyond our obtained z -
statistic. We can directly compare this p-value to α to test our null hypothesis: if p < α , we reject H , but if p > α , we
0
fail to reject. Note also that the reverse is always true: if we use critical values to test our hypothesis, we will always know
if p is greater than or less than α . If we reject, we know that p < α because the obtained z -statistic falls farther out into the
tail than the critical z -value that corresponds to α , so the proportion (p-value) for that z -statistic will be smaller.
Conversely, if we fail to reject, we know that the proportion will be larger than α because the z -statistic will not be as far
into the tail. This is illustrated for a one-tailed test in Figure 7.5.3.
So our test statistic is z = -2.50, which we can draw onto our rejection region distribution:
This is very similar to our formula for z , but we no longer take into account the sample size (since overly large samples
can make it too easy to reject the null). Cohen’s d is interpreted in units of standard deviations, just like z . For our
example:
7.75 − 8.00 −0.25
d = = = 0.50
0.50 0.50
Cohen’s d is interpreted as small, moderate, or large. Specifically, d = 0.20 is small, d = 0.50 is moderate, and d = 0.80 is
large. Obviously values can fall in between these guidelines, so we should use our best judgment and the context of the
problem to make our final interpretation of size. Our effect size happened to be exactly equal to one of these, so we say
that there was a moderate effect.
Effect sizes are incredibly useful and provide important information and clarification that overcomes some of the weakness
of hypothesis testing. Whenever you find a significant result, you should always calculate an effect size.
Next you state the alternative hypothesis. You have reason to suspect a specific direction of change, so you make a one-
tailed test:
HA : The average building temperature is higher than claimed
HA : μ > 74
Step 2: Find the Critical Values You know that the most common level of significance is α = 0.05, so you keep that the
same and know that the critical value for a one-tailed z -test is z∗ = 1.645. To keep track of the directionality of the test and
rejection region, you draw out your distribution:
Monday 77
Tuesday 76
Wednesday 74
Thursday 78
Friday 78
¯
¯¯¯
You calculate the average of these scores to be X = 76.6 degrees. You use this to calculate the test statistic, using μ = 74
(the supposed average temperature), σ = 1.00 (how much the temperature should vary), and n = 5 (how many data points
you collected):
76.60 − 74.00 2.60
z = – = = 5.78
1.00/ √5 0.45
This value falls so far into the tail that it cannot even be plotted on the distribution!
Based on 5 observations, the average temperature (X = 76.6 degrees) is statistically significantly higher than it is supposed
¯
¯¯¯
The effect size you calculate is definitely large, meaning someone has some explaining to do!
Step 2: Find the Critical Values We have seen the critical values for z -tests at α = 0.05 levels of significance several times.
To find the values for α = 0.01, we will go to the standard normal table and find the z -score cutting of 0.005 (0.01 divided
by 2 for a two-tailed test) of the area in the tail, which is z∗ = ±2.575. Notice that this cutoff is much higher than it was for
α = 0.05. This is because we need much less of the area in the tail, so we need to go very far out to find the cutoff. As a
result, this will require a much larger effect or much larger sample size in order to reject the null hypothesis.
Step 3: Calculate the Test Statistic We can now calculate our test statistic. We will use σ = 10 as our known population
standard deviation and the following data to calculate our sample mean:
61 62
65 61
58 59
54 61
60 63
¯
¯¯¯
The average of these scores is X = 60.40. From this we calculate our z -statistic as:
60.40 − 60.00 0.40
z = −− = = 0.13
10.00/ √10 3.16
Step 4: Make the Decision Our obtained z -statistic, z = 0.13, is very small. It is much less than our critical value of 2.575.
Thus, this time, we fail to reject the null hypothesis. Our conclusion would look something like:
¯
¯¯¯
Based on the sample of 10 scores, we cannot conclude that there is no effect causing the mean (X = 60.40) to be
statistically significantly different from 60.00, z = 0.13, p > 0.01.
Notice two things about the end of the conclusion. First, we wrote that p is greater than instead of p is less than, like we
did in the previous two examples. This is because we failed to reject the null hypothesis. We don’t know exactly what the
p -value is, but we know it must be larger than the α level we used to test our hypothesis. Second, we used 0.01 instead of
the usual 0.05, because this time we tested at a different level. The number you compare to the p-value should always be
the significance level you test at.
Finally, because we did not detect a statistically significant effect, we do not need to calculate an effect size.
Answer:
Your answer should include mention of the baseline assumption of no difference between the sample and the
population.
Answer:
Alpha is the significance level. It is the criteria we use when decided to reject or fail to reject the null hypothesis,
corresponding to a given proportion of the area under the normal distribution and a probability of finding extreme
scores assuming the null hypothesis is true.
4. Why do we phrase null and alternative hypotheses with population parameters and not sample means?
5. If our null hypothesis is “H : μ = 40 ”, what are the three possible alternative hypotheses?
0
Answer:
HA : μ ≠ 40 ,H
A : μ > 40 ,H A : μ < 40
6. Why do we state our hypotheses and decision criteria before we collect our data?
7. When and why do you calculate an effect size?
Answer:
We calculate an effect size when we find a statistically significant result to see if our result is practically meaningful or
important
8. Determine whether you would reject or fail to reject the null hypothesis in the following situations:
a. z = 1.99, two-tailed test at α = 0.05
b. z = 0.34, z∗ = 1.645
c. p = 0.03, α = 0.05
d. p = 0.015, α = 0.01
9. You are part of a trivia team and have tracked your team’s performance since you started playing, so you know that
your scores are normally distributed with μ = 78 and σ = 12. Recently, a new person joined the team, and you think the
scores have gotten better. Use hypothesis testing to see if the average score has improved based on the following 8
weeks’ worth of score data: 82, 74, 62, 68, 79, 94, 90, 81, 80.
Answer:
Step 1: H : μ = 78 “The average score is not different after the new person joined”, H
0 A : μ > 78 “The average score
has gone up since the new person joined.”
Step 2: One-tailed test to the right, assuming α = 0.05, z∗ = 1.645.
¯
¯¯¯
Step 4: z > z∗ , Reject H . Based on 8 weeks of games, we can conclude that our average score (X = 88.75) is higher
0
now that the new person is on the team, z = 2.54, p < .05. Since the result is significant, we need an effect size:
Cohen’s d = 0.90, which is a large effect.
10. You get hired as a server at a local restaurant, and the manager tells you that servers’ tips are $42 on average but vary
about $12 (μ = 42, σ = 12). You decide to track your tips to see if you make a different amount, but because this is your
first job as a server, you don’t know if you will make more or less in tips. After working 16 shifts, you find that your
1 2/18/2022
8.1: The t-statistic
Last chapter, we were introduced to hypothesis testing using the z -statistic for sample means that we learned in Unit 1. This was a
useful way to link the material and ease us into the new way to looking at data, but it isn’t a very common test because it relies on
knowing the populations standard deviation, σ, which is rarely going to be the case. Instead, we will estimate that parameter σ
¯
¯¯¯
using the sample statistic s in the same way that we estimate μ using X (μ will still appear in our formulas because we suspect
something about its value and that is what we are testing). Our new statistic is called t , and for testing one population mean using a
single sample (called a 1-sample t -test) it takes the form:
¯ ¯
X −μ X −μ
t = = (8.1.1)
−
s ¯ s/ √n
X
Notice that t looks almost identical to z ; this is because they test the exact same thing: the value of a sample mean compared to
what we expect of the population. The only difference is that the standard error is now denoted s to indicate that we use the
¯
¯¯
X
¯
sample statistic for standard deviation, s , instead of the population parameter σ. The process of using and interpreting the standard
error and the full test statistic remain exactly the same.
In chapter 3 we learned that the formulae for sample standard deviation and population standard deviation differ by one key factor:
the denominator for the parameter is N but the denominator for the statistic is N – 1, also known as degrees of freedom, df .
Because we are using a new measure of spread, we can no longer use the standard normal distribution and the z -table to find our
critical values. For t -tests, we will use the t -distribution and t -table to find these values.
The t -distribution, like the standard normal distribution, is symmetric and normally distributed with a mean of 0 and standard error
(as the measure of standard deviation for sampling distributions) of 1. However, because the calculation of standard error uses
degrees of freedom, there will be a different t-distribution for every degree of freedom. Luckily, they all work exactly the same, so
in practice this difference is minor.
Figure 8.1.1 shows four curves: a normal distribution curve labeled z , and three tdistribution curves for 2, 10, and 30 degrees of
freedom. Two things should stand out: First, for lower degrees of freedom (e.g. 2), the tails of the distribution are much fatter,
meaning the a larger proportion of the area under the curve falls in the tail. This means that we will have to go farther out into the
tail to cut off the portion corresponding to 5% or α = 0.05, which will in turn lead to higher critical values. Second, as the degrees
of freedom increase, we get closer and closer to the z curve. Even the distribution with df = 30, corresponding to a sample size of
just 31 people, is nearly indistinguishable from z . In fact, a t -distribution with infinite degrees of freedom (theoretically, of course)
is exactly the standard normal distribution. Because of this, the bottom row of the t -table also includes the critical values for z -tests
at the specific significance levels. Even though these curves are very close, it is still important to use the correct table and critical
values, because small differences can add up quickly.
HA : This shop takes longer to change oil than your old mechanic
HA : μ > 30
Step 2: Find the Critical Values As noted above, our critical values still delineate the area in the tails under the curve
corresponding to our chosen level of significance. Because we have no reason to change significance levels, we will use α
= 0.05, and because we suspect a direction of effect, we have a one-tailed test. To find our critical values for t , we need to
add one more piece of information: the degrees of freedom. For this example:
df = N – 1 = 4– 1 = 3
Going to our t -table, we find the column corresponding to our one-tailed significance level and find where it intersects
with the row for 3 degrees of freedom. As shown in Figure 8.2.1: our critical value is t∗ = 2.353
46 -7.75 60.06
58 4.25 18.06
40 -13.75 189.06
71 17.25 297.56
Σ =215 Σ =0 Σ =564.74
After filling in the first row to get Σ=215, we find that the mean is X = 53.75 (215 divided by sample size 4), which
¯
¯¯¯
allows us to fill in the rest of the table to get our sum of squares SS = 564.74, which we then plug in to the formula for
standard deviation from chapter 3:
−−−−−−−−−− −−−
¯¯¯
¯
2
−−−−−−
∑(X − X ) SS 564.74
s =√ =√ =√ = 13.72
N −1 df 3
Next, we take this value and plug it in to the formula for standard error:
s 13.72
s¯¯¯¯¯ = − = = 6.86
X
√n 2
And, finally, we put the standard error, sample mean, and null hypothesis value into the formula for our test statistic t :
¯
¯¯¯
X−μ 53.75 − 30 23.75
t = = = = 3.46
s¯¯¯¯ 6.86 6.68
X
This may seem like a lot of steps, but it is really just taking our raw data to calculate one value at a time and carrying that
value forward into the next equation: data sample size/degrees of freedom mean sum of squares standard deviation
standard error test statistic. At each step, we simply match the symbols of what we just calculated to where they appear in
the next formula to make sure we are plugging everything in correctly.
Step 4: Make the Decision Now that we have our critical value and test statistic, we can make our decision using the same
criteria we used for a z -test. Our obtained t -statistic was t = 3.46 and our critical value was t∗ = 2.353 : t > t∗ , so we
reject the null hypothesis and conclude:
¯
¯¯¯
Based on our four oil changes, the new mechanic takes longer on average (X = 53.75) to change oil than our old mechanic,
t(3) = 3.46, p < .05.
This is a large effect. It should also be noted that for some things, like the minutes in our current example, we can also
interpret the magnitude of the difference we observed (23 minutes and 45 seconds) as an indicator of importance since
time is a familiar metric.
One important consideration when calculating the margin of error is that it can only be calculated using the critical value
for a two-tailed test. This is because the margin of error moves away from the point estimate in both directions, so a one-
tailed value does not make sense.
The critical value we use will be based on a chosen level of confidence, which is equal to 1 – α . Thus, a 95% level of
confidence corresponds to α = 0.05. Thus, at the 0.05 level of significance, we create a 95% Confidence Interval. How to
interpret that is discussed further on.
Once we have our margin of error calculated, we add it to our point estimate for the mean to get an upper bound to the
confidence interval and subtract it from the point estimate for the mean to get a lower bound for the confidence interval:
¯
Upper Bound = X + Margin of Error
(8.3.2)
Lower Bound = X̄ − Margin of Error
Or simply:
¯¯¯
¯ ∗ −
Confidence Interval = X ± t (s/ √n ) (8.3.3)
To write out a confidence interval, we always use soft brackets and put the lower bound, a comma, and the upper bound:
Let’s see what this looks like with some actual numbers by taking our oil change data and using it to create a 95%
confidence interval estimating the average length of time it takes at the new mechanic. We already found that our average
¯¯¯
¯
was X = 53.75 and our standard error was s = 6.86. We also found a critical value to test our hypothesis, but remember
¯
¯
X
¯¯
¯
that we were testing a one-tailed hypothesis, so that critical value won’t work. To see why that is, look at the column
headers on the t -table. The column for one-tailed α = 0.05 is the same as a two-tailed α = 0.10. If we used the old critical
value, we’d actually be creating a 90% confidence interval (1.00-0.10 = 0.90, or 90%). To find the correct value, we use
the column for two-tailed α = 0.05 and, again, the row for 3 degrees of freedom, to find t∗ = 3.182.
Now we have all the pieces we need to construct our confidence interval:
U B = 53.75 + 21.83
U B = 75.58
LB = 53.75 − 21.83
LB = 31.92
have been the case based on what we know about how much these scores vary (i.e. our standard error).
It is very tempting to also interpret this interval by saying that we are 95% confident that the true population mean falls
within the range (31.92, 75.58), but this is not true. The reason it is not true is that phrasing our interpretation this way
suggests that we have firmly established an interval and the population mean does or does not fall into it, suggesting that
our interval is firm and the population mean will move around. However, the population mean is an absolute that does not
change; it is our interval that will vary from data collection to data collection, even taking into account our standard error.
The correct interpretation, then, is that we are 95% confident that the range (31.92, 75.58) brackets the true population
mean. This is a very subtle difference, but it is an important one.
HA : There is a difference in how friendly the local community is compared to the national average
HA : μ ≠ 38
Step 2: Find the Critical Values We need our critical values in order to determine the width of our margin of error. We will
assume a significance level of α = 0.05 (which will give us a 95% CI). From the t -table, a two-tailed critical value at α =
0.05 with 29 degrees of freedom (N – 1 = 30 – 1 = 29) is t∗ = 2.045.
Step 3: Calculations Now we can construct our confidence interval. After we collect our data, we find that the average
¯¯¯
¯
person in our community scored 39.85, or X = 39.85, and our standard deviation was s = 5.61. First, we need to use this
standard deviation, plus our sample size of N = 30, to calculate our standard error:
s 5.61
s¯¯¯¯¯ = = = 1.02
X −
√n 5.48
Now we can put that value, our point estimate for the sample mean, and our critical value from step 2 into the formula for
a confidence interval:
U B = 39.85 + 2.09
U B = 41.94
LB = 39.85 − 2.09
LB = 37.76
Step 4: Make the Decision Finally, we can compare our confidence interval to our null hypothesis value. The null value of
38 is higher than our lower bound of 37.76 and lower than our upper bound of 41.94. Thus, the confidence interval
brackets our null hypothesis value, and we fail to reject the null hypothesis:
¯¯¯
¯
Fail to Reject H . Based on our sample of 30 people, our community not different in average friendliness (X =
0
Answer:
A z -test uses population standard deviation for calculating standard error and gets critical values based on the standard
normal distribution. A t-test uses sample standard deviation as an estimate when calculating standard error and gets
critical values from the t-distribution based on degrees of freedom.
Answer:
As the level of confidence gets higher, the interval gets wider. In order to speak with more confidence about having
found the population mean, you need to cast a wider net. This happens because critical values for higher confidence
levels are larger, which creates a wider margin of error.
¯¯¯
¯
4. Construct a confidence interval around the sample mean X = 25 for the following conditions:
a. N = 25, s = 15, 95% confidence level
b. N = 25, s = 15, 90% confidence level
c. s = 4.5, α = 0.05, df = 20
¯
¯
X
¯¯
¯
Answer:
False: a confidence interval is a range of plausible scores that may or may not bracket the true population mean.
6. You hear that college campuses may differ from the general population in terms of political affiliation, and you want to
use hypothesis testing to see if this is true and, if so, how big the difference is. You know that the average political
affiliation in the nation is μ = 4.00 on a scale of 1.00 to 7.00, so you gather data from 150 college students across the
nation to see if there is a difference. You find that the average score is 3.76 with a standard deviation of 1.52. Use a 1-
sample t -test to see if there is a difference at the α = 0.05 level.
7. You hear a lot of talk about increasing global temperature, so you decide to see for yourself if there has been an actual
change in recent years. You know that the average land temperature from 1951-1980 was 8.79 degrees Celsius. You
find annual average temperature data from 1981-2017 and decide to construct a 99% confidence interval (because you
want to be as sure as possible and look for differences in both directions, not just one) using this data to test for a
difference from the previous average.
Answer:
¯¯¯
¯
X = 9.44, s = 0.35, s = 0.06, df = 36, t∗ = 2.719, 99% CI = (9.28, 9.60); CI does not bracket μ , reject null hypothesis.
¯
¯¯
X
¯
d = 1.83
Answer:
Step 1: H : μ = 0 “The average person has a neutral opinion towards craft beer”,
0 HA : μ ≠ 0 “Overall people will
have an opinion about craft beer, either good or bad.”
Step 2: Two-tailed test, df = 54, t∗ = 2.009.
¯¯¯
¯
Step 3: X = 1.10, s = 0.05, t = 22.00.
¯
¯¯
X
¯
¯¯¯
¯
Based on opinions from 55 people, we can conclude that the average opinion of craft beer (X = 1.10) is positive, t(54)
= 22.00, p < .05. Since the result is significant, we need an effect size: Cohen’s d = 2.75, which is a large effect.
10. You want to know if college students have more stress in their daily lives than the general population (μ = 12), so you
¯¯¯
¯
gather data from 25 people to test your hypothesis. Your sample has an average stress score of X = 13.11 and a standard
deviation of s = 3.89. Use a 1-sample t -test to see if there is a difference.
1 2/18/2022
9.1: Change and Differences
Researchers are often interested in change over time. Sometimes we want to see if change occurs naturally, and other times
we are hoping for change in response to some manipulation. In each of these cases, we measure a single variable at
different times, and what we are looking for is whether or not we get the same score at time 2 as we did at time 1. The
absolute value of our measurements does not matter – all that matters is the change. Let’s look at an example:
Table 9.1.1 : Raw and difference scores before and after training.
Before After Improvement
6 9 3
7 7 0
4 10 6
1 3 2
8 10 2
Table 9.1.1 shows scores on a quiz that five employees received before they took a training course and after they took the
course. The difference between these scores (i.e. the score after minus the score before) represents improvement in the
employees’ ability. This third column is what we look at when assessing whether or not our training was effective. We
want to see positive scores, which indicate that the employees’ performance went up. What we are not interested in is how
good they were before they took the training or after the training. Notice that the lowest scoring employee before the
training (with a score of 1) improved just as much as the highest scoring employee before the training (with a score of 8),
regardless of how far apart they were to begin with. There’s also one improvement score of 0, meaning that the training did
not help this employee. An important factor in this is that the participants received the same assessment at both time
points. To calculate improvement or any other difference score, we must measure only a single variable.
When looking at change scores like the ones in Table 9.1.1, we calculate our difference scores by taking the time 2 score
and subtracting the time 1 score. That is:
Xd = XT2 − XT1 (9.1.1)
Where X is the difference score, X is the score on the variable at time 1, and X is the score on the variable at time 2.
d T1 T2
The difference score, X , will be the data we use to test for improvement or change. We subtract time 2 minus time 1 for
d
ease of interpretation; if scores get better, then the difference score will be positive. Similarly, if we’re measuring
something like reaction time or depression symptoms that we are trying to reduce, then better outcomes (lower scores) will
yield negative difference scores.
We can also test to see if people who are matched or paired in some way agree on a specific topic. For example, we can see
if a parent and a child agree on the quality of home life, or we can see if two romantic partners agree on how serious and
committed their relationship is. In these situations, we also subtract one score from the other to get a difference score. This
time, however, it doesn’t matter which score we subtract from the other because what we are concerned with is the
agreement.
In both of these types of data, what we have are multiple scores on a single variable. That is, a single observation or data
point is comprised of two measurements that are put together into one difference score. This is what makes the analysis of
change unique – our ability to link these measurements in a meaningful way. This type of analysis would not work if we
had two separate samples of people that weren’t related at the individual level, such as samples of people from different
states that we gathered independently. Such datasets and analyses are the subject of the following chapter.
As with our other null hypotheses, we express the null hypothesis for paired samples t -tests in both words and
mathematical notation. The exact wording of the written-out version should be changed to match whatever research
question we are addressing (e.g. “ There is no change in ability scores after training”). However, the mathematical version
of the null hypothesis is always exactly the same: the average change score is equal to zero. Our population parameter for
the average is still μ , but it now has a subscript D to denote the fact that it is the average change score and not the average
raw observation before or after our manipulation. Obviously individual difference scores can go up or down, but the null
hypothesis states that these positive or negative change values are just random chance and that the true average change
score across all people is 0.
Our alternative hypotheses will also follow the same format that they did before: they can be directional if we suspect a
change or difference in a specific direction, or we can use an inequality sign to test for any change:
HA : There is a change or difference
HA : μD ≠ 0
As before, you choice of which alternative hypothesis to use should be specified before you collect data based on your
research question and any evidence you might have that would indicate a specific directional (or non-directional) change.
Test Statistic
Our test statistic for our change scores follows exactly the same format as it did for our 1-sample t -test. In fact, the only
difference is in the data that we use. For our change test, we first calculate a difference score as shown above. Then, we
use those scores as the raw data in the same mean calculation, standard error formula, and t -statistic. Let’s look at each of
these.
The mean difference score is calculated in the same way as any other mean: sum each of the individual difference scores
and divide by the sample size.
¯
¯¯¯¯¯¯
¯
ΣXD
XD = (9.2.1)
n
We will find the numerator, the Sum of Squares, using the same table format that we learned in chapter 3. Once we have
our standard deviation, we can find the standard error:
SD −
s¯¯¯¯¯ = / √n (9.2.3)
XD
As we can see, once we calculate our difference scores from our raw measurements, everything else is exactly the same.
Let’s see an example.
In this case, we are hoping that the changes we made will improve employee satisfaction, and, because we based the changes on
employee recommendations, we have good reason to believe that they will. Thus, we will use a one-directional alternative
hypothesis.
Step 2: Find the Critical Values Our critical values will once again be based on our level of significance, which we know is α =
0.05, the directionality of our test, which is one-tailed to the right, and our degrees of freedom. For our dependent-samples t -test,
the degrees of freedom are still given as df = n– 1 . For this problem, we have 40 people, so our degrees of freedom are 39. Going
to our t-table, we find that the critical value is t∗ = 1.685 as shown in Figure 9.3.1.
Now, we can put that value, along with our sample mean and null hypothesis value, into the formula for t and calculate the test
statistic:
¯
¯¯¯¯¯¯
¯
XD − μD 2.96 − 0
t = = = 6.43
s¯¯¯¯¯ 0.46
XD
Notice that, because the null hypothesis value of a dependent samples t -test is always 0, we can simply divide our obtained sample
mean by the standard error.
Step 4: Make the Decision We have obtained a test statistic of t = 6.43 that we can compare to our previously established critical
value of t∗ = 1.685. 6.43 is larger than 1.685, so t > t∗ and we reject the null hypothesis:
job satisfaction (X = 2.96) among the workers, t(39) = 6.43, p < 0.05.
¯
¯¯¯¯¯
D
¯
Because this result was statistically significant, we will want to calculate Cohen’s d as an effect size using the same format as we
did for the last t -test:
¯
¯¯¯¯¯¯
¯
XD − μD 2.96
t = = = 1.04
sD 2.85
This is a large effect size. Notice again that we can omit the null hypothesis value here because it is always equal to 0.
Hopefully the above example made it clear that running a dependent samples t -test to look for differences before and after some
treatment works exactly the same way as a regular 1-sample t -test does, which was just a small change in how z -tests were
performed in chapter 7. At this point, this process should feel familiar, and we will continue to make small adjustments to this
familiar process as we encounter new types of data to test new types of research questions.
Step 2: Find the Critical Values Just like with our regular hypothesis testing procedure, we will need critical values from
the appropriate level of significance and degrees of freedom in order to form our confidence interval. Because we have 7
participants, our degrees of freedom are df = 6. From our t-table, we find that the critical value corresponding to this df at
this level of significance is t∗ = 1.943.
Step 3: Calculate the Confidence Interval The data collected before (time 1) and after (time 2) the participants viewed the
commercial is presented in Table 9.4.1. In order to build our confidence interval, we will first have to calculate the mean
and standard deviation of the difference scores, which are also in Table 9.4.1. As a reminder, the difference scores are
calculated as Time 2 – Time 1.
Table 9.4.1 : Opinions of the bank
Time 1 Time 2 XD
3 2 -1
3 6 3
5 3 -2
8 4 -4
3 9 6
1 2 1
4 5 1
¯
¯¯¯¯¯¯
¯
∑ XD 4
XD = = = 0.57
n 7
The standard deviation will be solved by first using the Sum of Squares Table:
Table 9.4.2 : Sum of Squares
¯
¯¯¯¯¯¯
¯ ¯
¯¯¯¯¯¯
¯ 2
XD XD − XD (XD − XD )
-1 -1.57 2.46
3 2.43 5.90
-2 -2.57 6.60
-4 -4.57 20.88
6 5.43 29.48
1 0.43 0.18
1 0.43 0.18
Σ = 4 Σ = 0 Σ = 65.68
−−− −−−−−
SS 65.68 −−−−
sD = √ =√ = √10.95 = 3.31
df 6
We now have all the pieces needed to compute our confidence interval:
¯
¯¯¯¯¯¯
¯ ∗
95%C I = XD ± t (s
¯ )
XD
U B = 0.57 + 2.43
U B = 3.00
LB = 0.57 − 2.43
LB = −1.86
Step 4: Make the Decision Remember that the confidence interval represents a range of values that seem plausible or
reasonable based on our observed data. The interval spans -1.86 to 3.00, which includes 0, our null hypothesis value.
Because the null hypothesis value is in the interval, it is considered a reasonable value, and because it is a reasonable
value, we have no evidence against it. We fail to reject the null hypothesis.
Fail to Reject H . Based on our focus group of 7 people, we cannot say that the average change in opinion (X = 0.57)
0
¯
¯¯¯¯¯¯
D
¯
was any better or worse after viewing the commercial, CI: (-1.86, 3.00).
As with before, we only report the confidence interval to indicate how we performed the test.
Answer:
A 1-sample t-test uses raw scores to compare an average to a specific value. A dependent samples t-test uses two raw
scores from each person to calculate difference scores and test for an average difference score that is equal to zero. The
calculations, steps, and interpretation is exactly the same for each.
2. Name 3 research questions that could be addressed using a dependent samples t -test.
3. What are difference scores and why do we calculate them?
Answer:
Difference scores indicate change or discrepancy relative to a single person or pair of people. We calculate them to
eliminate individual differences in our study of change or agreement.
5. A researcher is interested in testing whether explaining the processes of statistics helps increase trust in computer
algorithms. He wants to test for a difference at the α = 0.05 level and knows that some people may trust the algorithms
less after the training, so he uses a two-tailed test. He gathers pre-post data from 35 people and finds that the average
¯
¯¯¯¯¯¯
¯
difference score is X = 12.10 with a standard deviation of s = 17.39. Conduct a hypothesis test to answer the
D D
research question.
Answer:
Step 1: H : μ = 0 “The average change in trust of algorithms is 0”, H
0 A : μ ≠0 “People’s opinions of how much they
trust algorithms changes.”
Step 2: Two-tailed test, df = 34, t∗ = 2.032.
Step 3: X = 12.10, s
¯
¯¯¯¯¯¯
D
¯
¯
¯¯¯¯¯¯
XD
¯ = 2.94, t = 4.12.
Step 4: t > t∗ , Reject H . Based on opinions from 35 people, we can conclude that people trust algorithms more (X =
0
¯
¯¯¯¯¯¯
D
¯
12.10) after learning statistics, t(34) = 4.12, p < .05. Since the result is significant, we need an effect size: Cohen’s d
= 0.70, which is a moderate to large effect.
6. Decide whether you would reject or fail to reject the null hypothesis in the following situations:
¯
¯¯¯¯¯¯
¯
a. X = 3.50, s = 1.10, n = 12, α = 0.05, two-tailed test
D D
61 83
75 89
91 98
83 92
74 80
82 88
98 98
82 77
69 88
76 79
91 91
70 80
Answer:
Time 1 Time 2 XD
61 83 22
75 89 14
91 98 7
83 92 9
74 80 6
82 88 6
98 98 0
82 77 -5
69 88 19
76 79 3
91 91 0
70 80 10
8. You want to know if an employee’s opinion about an organization is the same as the opinion of that employee’s boss.
You collect data from 18 employee-supervisor pairs and code the difference scores so that positive scores indicate that
the employee has a higher opinion and negative scores indicate that the boss has a higher opinion (meaning that
¯
¯¯¯¯¯¯
¯
difference scores of 0 indicate no difference and complete agreement). You find that the mean difference score is X = D
-3.15 with a standard deviation of s = 1.97. Test this hypothesis at the α = 0.01 level.
D
¯
¯¯¯¯¯¯
¯
9. Construct confidence intervals from a mean of X = 1.25, standard error of s = 0.45, and df = 10 at the 90%, 95%,
D ¯
¯¯¯¯¯¯
XD
¯
and 99% confidence level. Describe what happens as confidence changes and whether to reject H . 0
Answer:
At the 90% confidence level, t∗ = 1.812 and CI = (0.43, 2.07) so we reject H . At the 95% confidence level, t∗ = 2.228
0
and CI = (0.25, 2.25) so we reject H . At the 99% confidence level, t∗ = 3.169 and CI = (-0.18, 2.68) so we fail to
0
reject H . As the confidence level goes up, our interval gets wider (which is why we have higher confidence), and
0
eventually we do not reject the null hypothesis because the interval is so wide that it contains 0.
10. A professor wants to see how much students learn over the course of a semester. A pre-test is given before the class
begins to see what students know ahead of time, and the same test is given at the end of the semester to see what
students know at the end. The data are below. Test for an improvement at the α = 0.05 level. Did scores increase? How
much did scores increase?
Pretest Posttest XD
90 89
60 66
95 99
93 91
95 100
1 2/18/2022
10.1: Difference of Means
Last chapter, we learned about mean differences, that is, the average value of difference scores. Those difference scores
came from ONE group and TWO time points (or two perspectives). Now, we will deal with the difference of the means,
that is, the average values of separate groups that are represented by separate descriptive statistics. This analysis involves
TWO groups and ONE time point. As with all of our other tests as well, both of these analyses are concerned with a single
variable.
It is very important to keep these two tests separate and understand the distinctions between them because they assess very
different questions and require different approaches to the data. When in doubt, think about how the data were collected
and where they came from. If they came from two time points with the same people (sometimes referred to as
“longitudinal” data), you know you are working with repeated measures data (the measurement literally was repeated) and
will use a repeated measures/dependent samples t -test. If it came from a single time point that used separate groups, you
need to look at the nature of those groups and if they are related. Can individuals in one group being meaningfully matched
up with one and only one individual from the other group? For example, are they a romantic couple? If so, we call those
data matched and we use a matched pairs/dependent samples t -test. However, if there’s no logical or meaningful way to
link individuals across groups, or if there is no overlap between the groups, then we say the groups are independent and
use the independent samples t -test, the subject of this chapter.
H0 : μ1 = μ2
or
H0 : μ1 − μ2 = 0
Both of these formulations of the null hypothesis tell us exactly the same thing: that the numerical value of the means is
the same in both groups. This is more clear in the first formulation, but the second formulation also makes sense (any
number minus itself is always zero) and helps us out a little when we get to the math of the test statistic. Either one is
acceptable and you only need to report one. The English interpretation of both of them is also the same:
Our alternative hypotheses are also unchanged: we simply replace the equal sign (=) with one of the three inequalities (>,
<, ≠):
HA : μ1 > μ2
HA : μ1 < μ2
HA : μ1 ≠ μ2
Or
HA : μ1 − μ2 > 0
HA : μ1 − μ2 < 0
HA : μ1 − μ2 ≠ 0
Whichever formulation you chose for the null hypothesis should be the one you use for the alternative hypothesis (be
consistent), and the interpretation of them is always the same:
Notice that we are now dealing with two means instead of just one, so it will be very important to keep track of which
mean goes with which population and, by extension, which dataset and sample data. We use subscripts to differentiate
between the populations, so make sure to keep track of which is which. If it is helpful, you can also use more descriptive
subscripts. To use the experimental medication example:
H0 : There is no difference between the means of the treatment and control groups
H0 : μtreatment = μcontrol
HA : There is a difference between the means of the treatment and control groups
HA : μtreatment ≠ μcontrol
Once we have our hypotheses laid out, we can set our criteria to test them using the same three pieces of information as
before: significance level (α ), directionality (left, right, or two-tailed), and degrees of freedom, which for an independent
samples t -test are:
df = n1 + n2 − 2
This looks different than before, but it is just adding the individual degrees of freedom from each group (n– 1) together.
Notice that the sample sizes, n , also get subscripts so we can tell them apart.
This looks like more work to calculate, but remember that our null hypothesis states that the quantity μ 1 − μ2 = 0 , so we
can drop that out of the equation and are left with:
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
(X1 − X2 )
t = (10.4.2)
s¯¯¯¯¯ ¯
¯¯¯¯¯
¯
X 1 −X2
Our standard error in the denomination is still standard deviation (s ) with a subscript denoting what it is the standard error
of. Because we are dealing with the difference between two separate means, rather than a single mean or single mean of
difference scores, we put both means in the subscript. Calculating our standard error, as we will see next, is where the
biggest differences between this t -test and other t -tests appears. However, once we do calculate it and use it in our test
statistic, everything else goes back to normal. Our decision criteria is still comparing our obtained test statistic to our
critical value, and our interpretation based on whether or not we reject the null hypothesis is unchanged as well.
subscript p serves as a reminder indicating that it is the pooled variance. The term “pooled variance” is a literal name because we
are simply pooling or combining the information on variance – the Sum of Squares and Degrees of Freedom – from both of our
samples into a single number. The result is a weighted average of the observed sample variances, the weight for each being
determined by the sample size, and will always fall between the two observed variances. The computational formula for the pooled
variance is:
2 2
(n1 − 1) s + (n2 − 1) s
2 1 2
sp = (10.5.1)
n1 + n2 − 2
This formula can look daunting at first, but it is in fact just a weighted average. Even more conveniently, some simple algebra can
be employed to greatly reduce the complexity of the calculation. The simpler and more appropriate formula to use when calculating
pooled variance is:
S S1 + S S2
2
sp = (10.5.2)
df1 + df2
Using this formula, it’s very simple to see that we are just adding together the same pieces of information we have been calculating
since chapter 3. Thus, when we use this formula, the pooled variance is not nearly as intimidating as it might have originally
seemed.
Once we have our pooled variance calculated, we can drop it into the equation for our standard error:
−−−−−−−−
2 2
Sp Sp
S¯¯¯¯¯¯¯ ¯
¯¯¯¯¯
¯ =√ + (10.5.3)
X1 −X2
n1 n2
Once again, although this formula may seem different than it was before, in reality it is just a different way of writing the same
thing. An alternative but mathematically equivalent way of writing our old standard error is:
−−−
2
s s
s¯¯¯¯¯ = = √ (10.5.4)
X −
√n n
Looking at that, we can now see that, once again, we are simply adding together two pieces of information: no new logic or
interpretation required. Once the standard error is calculated, it goes in the denominator of our test statistic, as shown above and as
was the case in all previous chapters. Thus, the only additional step to calculating an independent samples t-statistic is computing
the pooled variance. Let’s see an example in action.
or
H0 : μ1 = μ2
HA : The comedy film will give a better average mood than the horror film
HA : μ1 − μ2 > 0
or
HA : μ1 > μ2
Notice that in the first formulation of the alternative hypothesis we say that the first mean minus the second mean will be
greater than zero. This is based on how we code the data (higher is better), so we suspect that the mean of the first group
will be higher. Thus, we will have a larger number minus a smaller number, which will be greater than zero. Be sure to pay
attention to which group is which and how your data are coded (higher is almost always used as better outcomes) to make
sure your hypothesis makes sense!
Step 2: Find the Critical Values Just like before, we will need critical values, which come from out t -table. In this
example, we have a one-tailed test at α = 0.05 and expect a positive answer (because we expect the difference between the
means to be greater than zero). Our degrees of freedom for our independent samples t-test is just the degrees of freedom
from each group added together: 35 + 29 – 2 = 62. From our t-table, we find that our critical value is t∗ = 1.671. Note that
because 62 does not appear on the table, we use the next lowest value, which in this case is 60.
Step 3: Compute the Test Statistic The data from our two groups are presented in the tables below. Table 10.6.1 shows the
values for the Comedy group, and Table 10.6.2 shows the values for the Horror group. Values for both have already been
placed in the Sum of Squares tables since we will need to use them for our further calculations. As always, the column on
the left is our raw data.
Table 10.6.1 : Raw scores and Sum of Squares for Group 1
¯¯¯
¯ ¯¯¯
¯ 2
X (X − X ) (X − X )
Using the sum of the first column for each table, we can calculate the mean for each group:
¯¯¯¯¯
¯ 840
X1 = = 24.00
35
And
¯¯¯¯¯
¯
478.60
X2 = = 16.50
29
These values were used to calculate the middle rows of each table, which sum to zero as they should (the middle column
for group 2 sums to a very small value instead of zero due to rounding error – the exact mean is 16.50344827586207, but
that’s far more than we need for our purposes). Squaring each of the deviation scores in the middle columns gives us the
values in the third columns, which sum to our next important value: the Sum of Squares for each group: SS = 5061.60
1
and SS = 3896.45. These values have all been calculated and take on the same interpretation as they have since chapter 3
2
– no new computations yet. Before we move on to the pooled variance that will allow us to calculate standard error, let’s
compute our standard deviation for each group; even though we will not use them in our calculation of the test statistic,
they are still important descriptors of our data:
And
−−−−−−−
3896.45
s2 = √ = 11.80
28
Now we can move on to our new calculation, the pooled variance, which is just the Sums of Squares that we calculated
from our table and the degrees of freedom, which is just n– 1 for each group:
S S1 + S S2 5061.60 + 3896.45 8958.05
2
sp = = = = 144.48
df1 + df2 34 + 28 62
As you can see, if you follow the regular process of calculating standard deviation using the Sum of Squares table, finding
the pooled variance is very easy. Now we can use that value to calculate our standard error, the last step before we can find
our test statistic:
−−−−−−−−
2 2 −−−−−−−−−−−−− −
sp sp 144.48 144.48 − −−−− −−−− −−−
−
s¯¯¯¯¯¯¯ ¯
¯¯¯¯¯
¯ =√ + =√ + = √ 4.13 + 4.98 = √9.11 = 3.02
X1 −X2
n1 n2 35 29
Finally, we can use our standard error and the means we calculated earlier to compute our test statistic. Because the null
hypothesis value of μ – μ is 0.00, we will leave that portion out of the equation for simplicity:
1 2
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
(X1 − X2 ) 24.00 − 16.50
t = = = 2.48
s¯¯¯¯¯ ¯
¯¯¯¯¯
¯ 3.02
X 1 −X2
The process of calculating our obtained test statistic t = 2.48 followed the same sequence of steps as before: use raw data
to compute the mean and sum of squares (this time for two groups instead of one), use the sum of squares and degrees of
freedom to calculate standard error (this time using pooled variance instead of standard deviation), and use that standard
error and the observed means to get t. Now we can move on to the final step of the hypothesis testing procedure.
Step 4: Make the Decision Our test statistic has a value of t = 2.48, and in step 2 we found that the critical value is t∗ =
1.671. 2.48 > 1.671, so we reject the null hypothesis:
Reject H . Based on our sample data from people who watched different kinds of movies, we can say that the average
0
mood after a comedy movie (X = 24.00) is better than the average mood after a horror movie (X = 16.50),
¯¯¯¯¯
1
¯ ¯¯¯¯¯
2
¯
For our example above, we can calculate the effect size to be:
24.00 − 16.50 7.50
d = −−−−− = = 0.62
√144.48 12.02
We interpret this using the same guidelines as before, so we would consider this a moderate or moderately large effect.
Our confidence intervals also take on the same form and interpretation as they have in the past. The value we are interested
in is the difference between the two means, so our point estimate is the value of one mean minus the other, or xbar1 minus
xbar2. Just like before, this is our observed effect and is the same value as the one we place in the numerator of our test
statistic. We calculate this value then place the margin of error – still our critical value times our standard error – above and
below it. That is:
¯¯¯¯¯
¯ ¯¯¯¯¯
¯ ∗
Confidence Interval = (X1 − X2 ) ± t (s¯¯¯¯¯¯¯ ¯
¯¯¯¯¯
¯ ) (10.7.2)
X1 −X2
Because our hypothesis testing example used a one-tailed test, it would be inappropriate to calculate a confidence interval
on those data (remember that we can only calculate a confidence interval for a two-tailed test because the interval extends
in both directions). Let’s say we find summary statistics on the average life satisfaction of people from two different towns
and want to create a confidence interval to see if the difference between the two might actually be zero.
Our sample data are X = 28.65 s = 12.40 n = 40 and X = 25.40s = 15.68n = 42. At face value, it looks like
¯¯¯¯¯
1
¯
1 1
¯¯¯¯¯
¯
2 2 2
the people from the first town have higher life satisfaction (28.65 vs. 25.40), but it will take a confidence interval (or
complete hypothesis testing process) to see if that is true or just due to random chance. First, we want to calculate the
difference between our sample means, which is 28.65 – 25.40 = 3.25. Next, we need a critical value from our t -table. If we
want to test at the normal 95% level of confidence, then our sample sizes will yield degrees of freedom equal to 40 + 42 –
2 = 80. From our table, that gives us a critical value of t∗ = 1.990. Finally, we need our standard error. Recall that our
standard error for an independent samples t -test uses pooled variance, which requires the Sum of Squares and degrees of
freedom. Up to this point, we have calculated the Sum of Squares using raw data, but in this situation, we do not have
access to it. So, what are we to do?
If we have summary data like standard deviation and sample size, it is very easy to calculate the pooled variance, and the
key lies in rearranging the formulas to work backwards through them. We need the Sum of Squares and degrees of
freedom to calculate our pooled variance. Degrees of freedom is very simple: we just take the sample size minus 1.00 for
each group. Getting the Sum of Squares is also easy: remember that variance is standard deviation squared and is the Sum
of Squares divided by the degrees of freedom. That is:
2 2
SS
s = (s) = (10.7.3)
df
To get the Sum of Squares, we just multiply both sides of the above equation to get:
2
s ∗ df = SS (10.7.4)
2
(12.40 ) ∗ (40 − 1) = 5996.64
2
(s2 ) ∗ df2 = S S2
2
(15.68 ) ∗ (42 − 1) = 10080.36
All of these steps are just slightly different ways of using the same formulae, numbers, and ideas we have worked with up
to this point. Once we get out standard error, it’s time to build our confidence interval.
U B = 3.25 + 6.25
U B = 9.50
LB = 3.25 − 6.25
LB = −3.00
Our confidence interval, as always, represents a range of values that would be considered reasonable or plausible based on
our observed data. In this instance, our interval (-3.00, 9.50) does contain zero. Thus, even though the means look a little
bit different, it may very well be the case that the life satisfaction in both of these towns is the same. Proving otherwise
would require more data.
Answer:
The difference of the means is one mean, calculated from a set of scores, compared to another mean which is calculated
from a different set of scores; the independent samples t-test looks for whether the two separate values are different
from one another. This is different than the “mean of the differences” because the latter is a single mean computed on a
single set of difference scores that come from one data collection of matched pairs. So, the difference of the means
deals with two numbers but the mean of the differences is only one number.
2. Describe three research questions that could be tested using an independent samples t -test.
3. Calculate pooled variance from the following raw data:
Group 1 Group 2
16 4
11 10
9 15
7 13
5 12
4 9
12 8
Answer:
2
SS 1 = 106.86, SS 2 = 78.86, sp = 15.48
5. Determine whether to reject or fail to reject the null hypothesis in the following situations:
a. t(40) = 2.49, α = 0.01, one-tailed test to the right
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
b. X = 64, X = 54, n = 14, n = 12, s
1 2 1 2 = 9.75, α = 0.05 , two-tailed test
¯
¯¯¯¯¯
¯
X1 −X2
¯
¯¯¯¯¯
¯
Answer:
a. Reject
b. Fail to Reject
c. Reject
6. A professor is interest in whether or not the type of software program used in a statistics lab affects how well students
learn the material. The professor teaches the same lecture material to two classes but has one class use a point-and-click
software program in lab and has the other class use a basic programming language. The professor tests for a difference
between the two classes on their final exam scores.
Point-and-Click Programming
83 86
83 79
63 100
77 74
86 70
84 67
78 83
61 85
65 74
75 86
100 87
60 61
90 76
66 100
54
7. A researcher wants to know if there is a difference in how busy someone is based on whether that person identifies as
an early bird or a night owl. The researcher gathers data from people in each group, coding the data so that higher
scores represent higher levels of being busy, and tests for a difference between the two at the .05 level of significance.
23 26
28 10
27 20
33 19
26 26
30 18
22 12
25 25
26
Answer:
Step 1: H0 : μ1 – μ2 = 0 “There is not difference in the average business of early birds versus night owls”,
HA : μ1 – μ2 ≠ 0 “There is a difference in the average business of early birds versus night owls.”
Step 2: Two-tailed test, df = 15, t∗ = 2.131.
¯¯¯¯¯
¯ ¯¯¯¯¯
¯ ¯¯¯¯¯
¯
Step 3: X 1
2
= 26.67, X2 = 19.50, sp = 27.73, sX1 − X2 = 2.37
Step 4: t > t∗ , Reject H . Based on our data of early birds and night owls, we can conclude that early birds are busier (
0
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
X1 = 26.67 ) than night owls (X = 19.50), t(15) = 3.03,
2 p < .05 . Since the result is significant, we need an effect
size: Cohen’s d = 1.47, which is a large effect.
8. Lots of people claim that having a pet helps lower their stress level. Use the following summary data to test the claim
that there is a lower average stress level among pet owners (group 1) than among non-owners (group 2) at the .05 level
of significance.
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
X1 = 16.25, X2 = 20.95, s1 = 4.00, s2 = 5.10, n1 = 29, n2 = 25
9. Administrators at a university want to know if students in different majors are more or less extroverted than others.
They provide you with descriptive statistics they have for English majors (coded as 1) and History majors (coded as 2)
Answer:
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
X1 − X2 = 1.55, t
∗
= 1.990, s¯¯¯¯¯¯ ¯ , CI = (0.66, 2.44). This confidence interval does not contain zero, so it
¯¯¯¯
X1 −X2
¯ = 0.45
does suggest that there is a difference between the extroversion of English majors and History majors.
10. Researchers want to know if people’s awareness of environmental issues varies as a function of where they live. The
researchers have the following summary data from two states, Alaska and Hawaii, that they want to use to test for a
difference.
¯¯¯¯¯¯
¯ ¯
¯¯¯¯¯
¯
XH = 47.50, XA = 45.70, sH = 14.65, sA = 13.20, nH = 139, nA = 150
1 2/18/2022
11.1: Observing and Interpreting Variability
We have seen time and again that scores, be they individual data or group means, will differ naturally. Sometimes this is
due to random chance, and other times it is due to actual differences. Our job as scientists, researchers, and data analysts is
to determine if the observed differences are systematic and meaningful (via a hypothesis test) and, if so, what is causing
those differences. Through this, it becomes clear that, although we are usually interested in the mean or average score, it is
the variability in the scores that is key.
Take a look at Figure 11.1.1, which shows scores for many people on a test of skill used as part of a job application. The
x-axis has each individual person, in no particular order, and the y -axis contains the score each person received on the test.
As we can see, the job applicants differed quite a bit in their performance, and understanding why that is the case would be
extremely useful information. However, there’s no interpretable pattern in the data, especially because we only have
information on the test, not on any other variable (remember that the x-axis here only shows individual people and is not
ordered or interpretable).
belong in). Now that we can differentiate between applicants this way, a pattern starts to emerge: those applicants with a
relevant degree (coded red) tend to be near the top, those applicants with no college degree (coded black) tend to be near
the bottom, and the applicants with an unrelated degree (coded green) tend to fall into the middle. However, even within
these groups, there is still some variability, as shown in Figure 11.1.2.
Our second variable is our outcome variable. This is the variable on which people differ, and we are trying to explain or
account for those differences based on group membership. In the example above, our outcome was the score each person
earned on the test. Our outcome variable will still use X for scores as before. When describing the outcome variable using
¯¯¯¯¯
¯ ¯¯¯¯¯
¯
means, we will use subscripts to refer to specific group means. So if we have k = 3 groups, our means will be X , X , and
1 2
¯¯¯¯¯
¯
X . We will also have a single mean representing the average of all participants across all groups. This is known as the
3
¯¯¯¯¯¯
¯
grand mean, and we use the symbol X . These different means – the individual group means and the overall grand mean –
G
n , and n . We also have the overall sample size in our dataset, and we will denote this with a capital N . The total sample
2 3
The subscript j refers to the “j ” group where j = 1…k to keep track of which group mean and sample size we are
th
working with. As you can see, the only difference between this equation and the familiar sum of squares for variance is
that we are adding in the sample size. Everything else logically fits together in the same way.
In this instance, because we are calculating this deviation score for each individual person, there is no need to multiple by
how many people we have. The subscript j again represents a group and the subscript i refers to a specific person. So, X ij
is read as “the i person of the j group.” It is important to remember that the deviation score for each person is only
th th
calculated relative to their group mean: do not calculate these scores relative to the other group means.
We can see that our Total Sum of Squares is just each individual score minus the grand mean. As with our Within Groups
Sum of Squares, we are calculating a deviation score for each individual person, so we do not need to multiply anything by
the sample size; that is only done for Between Groups Sum of Squares.
An important feature of the sums of squares in ANOVA is that they all fit together. We could work through the algebra to
demonstrate that if we added together the formulas for SS and SS , we would end up with the formula for SS . That
B W T
is:
S ST = S SB + S SW (11.2.4)
This will prove to be very convenient, because if we know the values of any two of our sums of squares, it is very quick
and easy to find the value of the third. It is also a good way to check calculations: if you calculate each SS by hand, you
can make sure that they all fit together as shown above, and if not, you know that you made a math mistake somewhere.
We can see from the above formulas that calculating an ANOVA by hand from raw data can take a very, very long time.
For this reason, you will not be required to calculate the SS values by hand, but you should still take the time to understand
how they fit together and what each one represents to ensure you understand the analysis itself.
SSB M SB
Between SSB k−1
dfB M SW
SSW
Within SSW N −k
dfW
Total SST N −1
The first column of the ANOVA table, labeled “Source”, indicates which of our sources of variability we are using:
between groups, within groups, or total. The second column, labeled “SS”, contains our values for the sums of squares that
we learned to calculate above. As noted previously, calculating these by hand takes too long, and so the formulas are not
presented in Table 11.3.1. However, remember that the Total is the sum of the other two, in case you are only given two
SS values and need to calculate the third.
The next column, labeled “df ”, is our degrees of freedom. As with the sums of squares, there is a different df for each
group, and the formulas are presented in the table. Notice that the total degrees of freedom, N – 1, is the same as it was for
our regular variance. This matches the SS formulation to again indicate that we are simply taking our familiar variance
T
term and breaking it up into difference sources. Also remember that the capital N in the df calculations refers to the
overall sample size, not a specific group sample size. Notice that the total row for degrees of freedom, just like for sums of
squares, is just the Between and Within rows added together. If you take N – k + k– 1 , then the “– k” and “+k ” portions
will cancel out, and you are left with N – 1. This is a convenient way to quickly check your calculations.
The third column, labeled “M S ”, is our Mean Squares for each source of variance. A “mean square” is just another way to
say variability. Each mean square is calculated by dividing the sum of squares by its corresponding degrees of freedom.
Notice that we do this for the Between row and the Within row, but not for the Total row. There are two reasons for this.
First, our Total Mean Square would just be the variance in the full dataset (put together the formulas to see this for
yourself), so it would not be new information. Second, the Mean Square values for Between and Within would not add up
to equal the Mean Square Total because they are divided by different denominators. This is in contrast to the first two
columns, where the Total row was both the conceptual total (i.e. the overall variance and degrees of freedom) and the
literal total of the other two rows.
The final column in the ANOVA table, labeled “F ”, is our test statistic for ANOVA. The F statistic, just like a t - or z -
statistic, is compared to a critical value to see whether we can reject for fail to reject a null hypothesis. Thus, although the
calculations look different for ANOVA, we are still doing the same thing that we did in all of Unit 2. We are simply using a
new type of data to test our hypotheses. We will see what these hypotheses look like shortly, but first, we must take a
moment to address why we are doing our calculations this way.
For only three groups, we would have three t -tests: group 1 vs group 2, group 1 vs group 3, and group 2 vs group 3. This
may not sound like a lot, especially with the advances in technology that have made running an analysis very fast, but it
quickly scales up. With just one additional group, bringing our total to four, we would have six comparisons: group 1 vs
group 2, group 1 vs group 3, group 1 vs group 4, group 2 vs group 3, group 2 vs group 4, and group 3 vs group 4. This
makes for a logistical and computation nightmare for five or more groups.
A bigger issue, however, is our probability of committing a Type I Error. Remember that a Type I error is a false positive,
and the chance of committing a Type I error is equal to our significance level, α . This is true if we are only running a
single analysis (such as a t -test with only two groups) on a single dataset. However, when we start running multiple
analyses on the same dataset, our Type I error rate increases, raising the probability that we are capitalizing on random
chance and rejecting a null hypothesis when we should not. ANOVA, by comparing all groups simultaneously with a
single analysis, averts this issue and keeps our error rate at the α we set.
H0 : μ1 = μ2 = μ3
We list as many μ parameters as groups we have. In the example above, we have three groups to test, so we have three
parameters in our null hypothesis. If we had more groups, say, four, we would simply add another μ to the list and give it
the appropriate subscript, giving us:
H0 : μ1 = μ2 = μ3 = μ4
Notice that we do not say that the means are all equal to zero, we only say that they are equal to one another; it does not
matter what the actual value is, so long as it holds for all groups equally.
Our alternative hypothesis for ANOVA is a little bit different. Let’s take a look at it and then dive deeper into what it
means:
The first difference in obvious: there is no mathematical statement of the alternative hypothesis in ANOVA. This is due to
the second difference: we are not saying which group is going to be different, only that at least one will be. Because we do
not hypothesize about which mean will be different, there is no way to write it mathematically. Related to this, we do not
have directional hypotheses (greater than or less than) like we did in Unit 2. Due to this, our alternative hypothesis is
always exactly the same: at least one mean is different.
In Unit 2, we saw that, if we reject the null hypothesis, we can adopt the alternative, and this made it easy to understand
what the differences looked like. In ANOVA, we will still adopt the alternative hypothesis as the best explanation of our
data if we reject the null hypothesis. However, when we look at the alternative hypothesis, we can see that it does not give
us much information. We will know that a difference exists somewhere, but we will not know where that difference is. Is
only group 1 different but groups 2 and 3 the same? Is it only group 2? Are all three of them different? Based on just our
alternative hypothesis, there is no way to be sure. We will come back to this issue later and see how to find out specific
differences. For now, just remember that we are testing for any difference in group means, and it does not matter where
that difference occurs.
Now that we have our hypotheses for ANOVA, let’s work through an example. We will continue to use the data from
Figures 11.1.1 through 11.1.3 for continuity.
H0 : μ1 = μ2 = μ3
Again, we phrase our null hypothesis in terms of what we are actually looking for, and we use a number of population
parameters equal to our number of groups. Our alternative hypothesis is always exactly the same.
“Degrees of Freedom: Numerator” because it is the degrees of freedom value used to calculate the Mean Square Between,
which in turn was the numerator of our F statistic. Likewise, the df is the “df denom.” (short for denominator) because
W
it is the degrees of freedom value used to calculate the Mean Square Within, which was our denominator for F .
The formula for df is k– 1 , and remember that k is the number of groups we are assessing. In this example, k = 3 so our
B
dfB = 2. This tells us that we will use the second column, the one labeled 2, to find our critical value. To find the proper
row, we simply calculate the df , which was N – k . The original prompt told us that we have “three groups of 10 people
W
for 27, we find that our critical value is 3.35. We use this critical value the same way as we did before: it is our criterion
against which we will compare our obtained test statistic to determine statistical significance.
Between 8246
Within 3020
Total
These may seem like random numbers, but remember that they are based on the distances between the groups themselves
and within each group. Figure 11.6.2 shows the plot of the data with the group means and grand mean included. If we
wanted to, we could use this information, combined with our earlier information that each group has 10 people, to
calculate the Between Groups Sum of Squares by hand. However, doing so would take some time, and without the specific
values of the data points, we would not be able to calculate our Within Groups Sum of Squares, so we will trust that these
values are the correct ones.
Between 8246
Within 3020
Total 11266
We also calculated our degrees of freedom earlier, so we can fill in those values. Additionally, we know that the total
degrees of freedom is N – 1, which is 29. This value of 29 is also the sum of the other two degrees of freedom, so
everything checks out.
Table 11.6.3: Total Sum of Squares
Source SS df MS F
Between 8246 2
Within 3020 27
Now weBetween
have everything we need 8246 2
to calculate our mean squares. Our M S values for each row are just the SS divided by
the df forWithin
that row, giving us: 3020 27
Table 11.6.4: Total Sum of Squares
Source
Total 11266
SS 29
df MS F
Total 11266 29
Remember that we do not calculate a Total Mean Square, so we leave that cell blank. Finally, we have the information we
need to calculate our test statistic. F is our M S divided by M S .
B W
Total 11266 29
So, working our way through the table given only two SS values and the sample size and group size given before, we
calculate our test statistic to be F = 36.86, which we will compare to the critical value in step 4.
obt
obtained statistic is larger than our critical value, so we can reject the null hypothesis.
Reject H . Based on our 3 groups of 10 people, we can conclude that job test scores are statistically significantly different
0
2
SSB
η = (11.7.1)
SST
So, we are able to explain 73% of the variance in job test scores based on education. This is, in fact, a huge effect size, and
most of the time we will not explain nearly that much variance. Our guidelines for the size of our effects are:
Table 11.7.1 : Guidelines for the size of our effects
η
2
Size
0.01 Small
0.09 Medium
0.25 Large
So, we found that not only do we have a statistically significant result, but that our observed effect was very large!
However, we still do not know specifically which groups are different from each other. It could be that they are all
different, or that only those who have a relevant degree are different from the others, or that only those who have no degree
are different from the others. To find out which is true, we need to do a special analysis called a post hoc test.
Bonferroni Test
A Bonferroni test is perhaps the simplest post hoc analysis. A Bonferroni test is a series of t -tests performed on each pair
of groups. As we discussed earlier, the number of groups quickly grows the number of comparisons, which inflates Type I
error rates. To avoid this, a Bonferroni test divides our significance level α by the number of comparisons we are making
so that when they are all run, they sum back up to our original Type I error rate. Once we have our new significance level,
we simply run independent samples t -tests to look for difference between our pairs of groups. This adjustment is
sometimes called a Bonferroni Correction, and it is easy to do by hand if we want to compare obtained p-values to our new
corrected α level, but it is more difficult to do when using critical values like we do for our analyses so we will leave our
discussion of it to that.
As we can see, none of these intervals contain 0.00, so we can conclude that all three groups are different from one
another.
Scheffe’s Test
Another common post hoc test is Scheffe’s Test. Like Tukey’s HSD, Scheffe’s test adjusts the test statistic for how many
comparisons are made, but it does so in a slightly different way. The result is a test that is “conservative,” which means
that it is less likely to commit a Type I Error, but this comes at the cost of less power to detect effects. We can see this by
looking at the confidence intervals that Scheffe’s test gives us:
Table 11.8.2: Confidence intervals given by Scheffe’s test
Comparison Difference Tukey’s HSD CI
None vs Relevant 40.60 (28.35, 52.85)
As we can see, these are slightly wider than the intervals we got from Tukey’s HSD. This means that, all other things being
equal, they are more likely to contain zero. In our case, however, the results are the same, and we again conclude that all
Answer:
Variance between groups (SSB), variance within groups (SSW ) and total variance (SST ).
2. What does rejecting the null hypothesis in ANOVA tell us? What does it not tell us?
3. What is the purpose of post hoc tests?
Answer:
Post hoc tests are run if we reject the null hypothesis in ANOVA; they tell us which specific group differences are
significant.
4. Based on the ANOVA table below, do you reject or fail to reject the null hypothesis? What is the effect size?
Source SS df MS F
Total 274.33 44
Source SS df MS F
Between 87.40
Within
Total 199.22 33
b. N = 14
Source SS df MS F
Between 2 14.10
Within
Total 64.65
c.
Source SS df MS F
Between 2 42.36
Within 54 2.48
Total
Answer:
a. K = 4
b. N = 14
Source SS df MS F
c.
Source SS df MS F
6. You know that stores tend to charge different prices for similar or identical products, and you want to test whether or
not these differences are, on average, statistically significantly different. You go online and collect data from 3 different
stores, gathering information on 15 products at each store. You find that the average prices at each store are: Store 1
xbar = $27.82, Store 2 xbar = $38.96, and Store 3 xbar = $24.53. Based on the overall variability in the products and
the variability within each store, you find the following values for the Sums of Squares: SST = 683.22, SSW = 441.19.
Complete the ANOVA table and use the 4 step hypothesis testing procedure to see if there are systematic price
differences between the stores.
7. You and your friend are debating which type of candy is the best. You find data on the average rating for hard candy
¯
¯¯¯ ¯
¯¯¯ ¯
¯¯¯
(e.g. jolly ranchers, X = 3.60), chewable candy (e.g. starburst, X = 4.20), and chocolate (e.g. snickers, X = 4.40); each
type of candy was rated by 30 people. Test for differences in average candy rating using SSB = 16.18 and SSW =
28.74.
Answer:
Step 1: H 0 : μ1 = μ2 = μ3 “There is no difference in average rating of candy quality”, HA : “At least one mean is
different.”
Step 2: 3 groups and 90 total observations yields df num =2 and dfden = 87 , α = 0.05, F ∗
= 3.11.
Step 3: based on the given SSB and SSW and the computed df from step 2, is:
Source SS df MS F
Step 4: F > F , reject H . Based on the data in our 3 groups, we can say that there is a statistically significant
∗
0
difference in the quality of different types of candy, F (2, 87) = 24.52, p < .05. Since the result is significant, we need
an effect size: η = 16.18/44.92 = .36, which is a large effect
2
8. Administrators at a university want to know if students in different majors are more or less extroverted than others.
They provide you with data they have for English majors (X = 3.78, n = 45), History majors (X = 2.23, n = 40),
¯
¯¯¯ ¯
¯¯¯
Answer:
Step 1: H : μ
0 1
= μ2 = μ3 “There is no difference in average outcome based on treatment”, H : “At least one mean
A
is different.”
Step 2: 3 groups and 57 total participants yields df num =2 and df den = 54 , α = 0.05, F ∗
= 3.18 .
Step 3: based on the given SSB and SSW and the computed df from step 2, is:
Source SS df MS F
Step 4: F > F , reject H . Based on the data in our 3 groups, we can say that there is a statistically significant
∗
0
difference in the effectiveness of the treatments, F (2, 54) = 42.36, p < .05. Since the result is significant, we need an
effect size: η = 210.10/344.00 = .61
2
, which is a large effect.
¯
¯¯¯
10. You are in charge of assessing different training methods for effectiveness. You have data on 4 methods: Method 1 (X
¯
¯¯¯ ¯
¯¯¯ ¯
¯¯¯
= 87, n = 12), Method 2 (X = 92, n = 14), Method 3 (X = 88, n = 15), and Method 4 (X = 75, n = 11). Test for
differences among these means, assuming SSB = 64.81 and SST = 399.45.
12.4: PEARSON’S R
There are several different types of correlation coefficients, but we will only focus on the most common: Pearson’s r. r is a very popular
correlation coefficient for assessing linear relations, and it serves as both a descriptive statistic and as a test statistic. It is descriptive
because it describes what is happening in the scatterplot; r will have both a sign (+/–) for the direction and a number (0 – 1 in absolute
value) for the magnitude.
1 2/18/2022
12.1: Variability and Covariance
Because we have two continuous variables, we will have two characteristics or score on which people will vary. What we
want to know is do people vary on the scores together. That is, as one score changes, does the other score also change in a
predictable or consistent way? This notion of variables differing together is called covariance (the prefix “co” meaning
“together”).
Let’s look at our formula for variance on a single variable:
¯¯¯
¯ 2
∑(X − X )
2
s = (12.1.1)
N −1
¯¯¯
¯
We use X to represent a person’s score on the variable at hand, and X to represent the mean of that variable. The
numerator of this formula is the Sum of Squares, which we have seen several times for various uses. Recall that squaring a
value is just multiplying that value by itself. Thus, we can write the same equation as:
¯¯¯
¯ ¯¯¯
¯
∑((X − X )(X − X ))
2
s = (12.1.2)
N −1
This is the same formula and works the same way as before, where we multiply the deviation score by itself (we square it)
and then sum across squared deviations.
Now, let’s look at the formula for covariance. In this formula, we will still use X to represent the score on one variable,
and we will now use Y to represent the score on the second variable. We will still use bars to represent averages of the
scores. The formula for covariance (cov XYwith the subscript XY to indicate covariance across the X and Y variables) is:
¯¯¯
¯ ¯
¯¯¯
∑((X − X )(Y − Y ))
covXY = (12.1.3)
N −1
As we can see, this is the exact same structure as the previous formula. Now, instead of multiplying the deviation score by
itself on one variable, we take the deviation scores from a single person on each variable and multiply them together. We
do this for each person (exactly the same as we did for variance) and then sum them to get our numerator. The numerator
in this is called the Sum of Products.
¯¯¯
¯ ¯
¯¯¯
SP = ∑((X − X )(Y − Y )) (12.1.4)
We will calculate the sum of products using the same table we used to calculate the sum of squares. In fact, the table for
sum of products is simply a sum of squares table for X, plus a sum of squares table for Y , with a final column of products,
as shown below.
Table 12.1.1 : Sum of Products table
¯¯¯
¯ ¯¯¯
¯ 2 ¯
¯¯¯ ¯
¯¯¯ 2 ¯¯¯
¯ ¯
¯¯¯
X (X − X ) (X − X ) Y (Y − Y ) (Y − Y ) (X − X )(Y − Y )
This table works the same way that it did before (remember that the column headers tell you exactly what to do in that
column). We list our raw data for the X and Y variables in the X and Y columns, respectively, then add them up so we
can calculate the mean of each variable. We then take those means and subtract them from the appropriate raw score to get
our deviation scores for each person on each variable, and the columns of deviation scores will both add up to zero. We
will square our deviation scores for each variable to get the sum of squares for X and Y so that we can compute the
variance and standard deviation of each (we will use the standard deviation in our equation below). Finally, we take the
deviation score from each variable and multiply them together to get our product score. Summing this column will give us
our sum of products. It is very important that you multiply the raw deviation scores from each variable, NOT the squared
deviation scores.
Our sum of products will go into the numerator of our formula for covariance, and then we only have to divide by N – 1 to
get our covariance. Unlike the sum of squares, both our sum of products and our covariance can be positive, negative, or
zero, and they will always match (e.g. if our sum of products is positive, our covariance will always be positive). A
see from the axes that each of these variables is measured on a 10-point scale, with 10 being the highest on both variables
(high satisfaction and good health and well-being) and 1 being the lowest (dissatisfaction and poor health). When we look
at this plot, we can see that the variables do seem to be related. The higher scores on job satisfaction tend to also be the
higher scores on well-being, and the same is true of the lower scores.
Direction
The direction of the relation between two variables tells us whether the variables change in the same way at the same time
or in opposite ways at the same time. We saw this concept earlier when first discussing scatterplots, and we used the terms
positive and negative. A positive relation is one in which X and Y change in the same direction: as X goes up, Y goes up,
and as X goes down, Y also goes down. A negative relation is just the opposite: X and Y change together in opposite
directions: as X goes up, Y goes down, and vice versa.
As we will see soon, when we calculate a correlation coefficient, we are quantifying the relation demonstrated in a
scatterplot. That is, we are putting a number to it. That number will be either positive, negative, or zero, and we interpret
the sign of the number as our direction. If the number is positive, it is a positive relation, and if it is negative, it is a
negative relation. If it is zero, then there is no relation. The direction of the relation corresponds directly to the slope of the
hypothetical line we draw through scatterplots when assessing the form of the relation. If the line has a positive slope that
Magnitude
The number we calculate for our correlation coefficient, which we will describe in detail below, corresponds to the
magnitude of the relation between the two variables. The magnitude is how strong or how consistent the relation between
the variables is. Higher numbers mean greater magnitude, which means a stronger relation.
Our correlation coefficients will take on any value between -1.00 and 1.00, with 0.00 in the middle, which again represents
no relation. A correlation of -1.00 is a perfect negative relation; as X goes up by some amount, Y goes down by the same
amount, consistently. Likewise, a correlation of 1.00 indicates a perfect positive relation; as X goes up by some amount, Y
also goes up by the same amount. Finally, a correlation of 0.00, which indicates no relation, means that as X goes up by
some amount, Y may or may not change by any amount, and it does so inconsistently.
The vast majority of correlations do not reach -1.00 or positive 1.00. Instead, they fall in between, and we use rough cut
offs for how strong the relation is based on this number. Importantly, the sign of the number (the direction of the relation)
has no bearing on how strong the relation is. The only thing that matters is the magnitude, or the absolute value of the
correlation coefficient. A correlation of -1 is just as strong as a correlation of 1. We generally use values of 0.10, 0.30, and
0.50 as indicating weak, moderate, and strong relations, respectively.
The strength of a relation, just like the form and direction, can also be inferred from a scatterplot, though this is much more
difficult to do. Some examples of weak and strong relations are shown in Figures 12.3.5 and 12.3.6, respectively. Weak
correlations still have an interpretable form and direction, but it is much harder to see. Strong correlations have a very clear
pattern, and the points tend to form a line. The examples show two different directions, but remember that the direction
does not matter for the strength, only the consistency of the relation and the size of the number, which we will see next.
The first formula gives a direct sense of what a correlation is: a covariance standardized onto the scale of X and Y ; the
second formula is computationally simpler and faster. Both of these equations will give the same value, and as we saw at
the beginning of the chapter, all of these values are easily computed by using the sum of products table. When we do this
calculation, we will find that our answer is always between -1.00 and 1.00 (if it’s not, check the math again), which gives
us a standard, interpretable metric, similar to what z -scores did.
¯¯¯
¯ ¯¯¯
¯
It was stated earlier that r is a descriptive statistic like X , and just like X , it corresponds to a population parameter. For
correlations, the population parameter is the lowercase Greek letter ρ (“rho”); be careful not to confuse ρ with a p-value –
¯¯¯
¯
they look quite similar. r is an estimate of ρ just like X is an estimate of μ . Thus, we will test our observed value of r that
we calculate from the data and compare it to a value of ρ specified by our null hypothesis to see if the relation between our
variables is significant, as we will see in our example next.
H0 : ρ = 0
H0 : ρ > 0
Remember that ρ (“rho”) is our population parameter for the correlation that we estimate with r, just like X and μ for means.
¯¯¯
¯
Remember also that if there is no relation between variables, the magnitude will be 0, which is where we get the null and
alternative hypothesis values.
α = 0.05 level has a critical value of r = 0.549. Thus, if our observed correlation is greater than 0.549, it will be statistically
∗
significant. This is a rather high bar (remember, the guideline for a strong relation is r = 0.50); this is because we have so few
people. Larger samples make it easier to find significant relations.
Anx 3.54 3.05 3.81 3.43 4.03 3.59 4.17 3.46 3.19 4.12
We will need to put these values into our Sum of Products table to calculate the standard deviation and covariance of our
variables. We will use X for depression and Y for anxiety to keep track of our data, but be aware that this choice is arbitrary
and the math will work out the same if we decided to do the opposite. Our table is thus:
Table 12.5.2: Sum of Products table
¯¯¯
¯ ¯¯¯
¯ 2 ¯
¯¯¯ ¯
¯¯¯ 2 ¯¯¯
¯ ¯
¯
X (X − X ) (X − X ) Y (Y − Y ) (Y − Y ) (X − X )(Y − Y
The bottom row is the sum of each column. We can see from this that the sum of the X observations is 31.63, which makes the
¯¯¯
¯
mean of the X variable X = 3.16. The deviation scores for X sum to 0.03, which is very close to 0, given rounding error, so
everything looks right so far. The next column is the squared deviations for X, so we can see that the sum of squares for X is
¯
¯¯¯
SS X = 7.97 . The same is true of the Y columns, with an average of Y = 3.64 , deviations that sum to zero within rounding
error, and a sum of squares as SS = 1.33. The final column is the product of our deviation scores (NOT of our squared
Y
The formula for standard deviation are the same as before. Using subscripts X and Y to denote depression and anxiety:
−−−−−−−−−−
¯¯¯
¯ −−−−
∑(X − X )2 7.97
sX = √ =√ = 0.94
N −1 9
−−−−−−−−−−
¯
¯¯¯
2
−−−−
∑(Y − Y ) 1.33
sY =√ =√ = 0.38
N −1 9
We can verify this using our other formula, which is computationally shorter:
SP 2.22
r = − −−−−−−−−− = −−−−−−− − = .70
√ SSX ∗ SSY √ 7.97 ∗ 1.33
So our observed correlation between anxiety and depression is r = 0.70, which, based on sign and magnitude, is a strong,
positive correlation. Now we need to compare it to our critical value to see if it is also statistically significant.
is interpreted as the amount of variance explained in the outcome variance, and the cut scores are the same as well: 0.01,
0.09, and 0.25 for small, medium, and large, respectively. Notice here that these are the same cutoffs we used for regular r
effect sizes, but squared (0.102 = 0.01, 0.302 = 0.09, 0.502 = 0.25) because, again, the r effect size is just the squared
2
correlation, so its interpretation should be, and is, the same. The reason we use r as an effect size is because our ability to
2
analyses, even if they look nothing alike. That is because, behind the scenes, they actually are! In the next chapter, we will
learn a technique called Linear Regression, which will formally link the two analyses together.
Range Restriction
The strength of a correlation depends on how much variability is in each of the variables X and Y . This is evident in the
formula for Pearson’s r, which uses both covariance (based on the sum of products, which comes from deviation scores)
and the standard deviation of both variables (which are based on the sums of squares, which also come from deviation
scores). Thus, if we reduce the amount of variability in one or both variables, our correlation will go down. Failure to
capture the full variability of a variability is called range restriction.
Take a look at Figures 12.8.1 and 12.8.2 below. The first shows a strong relation (r = 0.67) between two variables. An
oval is overlain on top of it to make the relation even more distinct. The second shows the same data, but the bottom half
of the X variable (all scores below 5) have been removed, which causes our relation (again represented by a red oval) to
become much weaker (r = 0.38). Thus range restriction has truncated (made smaller) our observed correlation.
Outliers
Another issue that can cause the observed size of our correlation to be inappropriately large or small is the presence of
outliers. An outlier is a data point that falls far away from the rest of the observations in the dataset. Sometimes outliers are
the result of incorrect data entry, poor or intentionally misleading responses, or simple random chance. Other times,
Figure 12.8.3 : Three plots showing correlations with and without outliers.
Satisfaction 1.00
Notice that there are values of 1.00 where each row and column of the same variable intersect. This is because a variable
correlates perfectly with itself, so the value is always exactly 1.00. Also notice that the upper cells are left blank and only
the cells below the diagonal of 1s are filled in. This is because correlation matrices are symmetrical: they have the same
values above the diagonal as below it. Filling in both sides would provide redundant information and make it a bit harder
to read the matrix, so we leave the upper triangle blank.
Correlation matrices are a very condensed way of presenting many results quickly, so they appear in almost all research
studies that use continuous variables. Many matrices also include columns that show the variable means and standard
deviations, as well as asterisks showing whether or not each correlation is statistically significant.
Answer:
Correlations assess the linear relation between two continuous variables
Answer:
Covariance is an unstandardized measure of how related two continuous variables are. Correlations are standardized
versions of covariance that fall between negative 1 and positive 1.
Answer:
Strong, positive, linear relation
0.62 2.02
1.50 4.62
0.34 2.60
0.97 1.59
3.54 4.67
0.69 2.52
1.53 2.28
0.32 1.68
1.94 2.50
1.25 4.04
1.42 2.63
3.07 3.53
3.99 3.90
1.73 2.75
1.9 2.95
Answer:
Your scatterplot should look similar to this:
8. In the following correlation matrix, what is the relation (number, direction, and magnitude) between…
a. Pay and Satisfaction
b. Stress and Health
Workplace Pay Satisfaction Stress Health
ce Pay 1.00
9. Using the data from problem 7, test for a statistically significant relation between the variables.
Answer:
Step 1: H : ρ = 0 , “There is no relation between time spent studying and overall performance in class”, H
0 A : ρ >0 ,
“There is a positive relation between time spent studying and overall performance in class.”
Step 2: df = 15– 2 = 13, α = 0.05 , 1-tailed test, r
∗
= 0.441 .
Step 3: Using the Sum of Products table, you should find:
¯¯¯
¯
X = 1.61, S SX = 17.44, Y
¯
¯¯¯
= 2.95, S SY = 13.60, SP = 10.06, r = 0.65 .
Step 4: Obtained statistic is greater than critical value, reject H . There is a statistically significant, strong, positive
0
relation between time spent studying and performance in class, r(13) = 0.65, p < .05.
10. A researcher collects data from 100 people to assess whether there is any relation between level of education and levels
of civic engagement. The researcher finds the following descriptive values:
13.2: PREDICTION
In regression, we most frequently talk about prediction, specifically predicting our outcome variable Y from our explanatory variable X,
and we use the line of best fit to make our predictions.
1 2/18/2022
13.1: Line of Best Fit
In correlations, we referred to a linear trend in the data. That is, we assumed that there was a straight line we could draw through
the middle of our scatterplot that would represent the relation between our two variables, X and Y . Regression involves solving for
the equation of that line, which is called the Line of Best Fit.
The line of best fit can be thought of as the central tendency of our scatterplot. The term “best fit” means that the line is as close to
all points (with each point representing both variables for a single person) in the scatterplot as possible, with a balance of scores
above and below the line. This is the same idea as the mean, which has an equal weighting of scores above and below it and is the
best singular descriptor of all our data points for a single variable.
We have already seen many scatterplots in chapter 2 and chapter 12, so we know by now that no scatterplot has points that form a
perfectly straight line. Because of this, when we put a straight line through a scatterplot, it will not touch all of the points, and it
may not even touch any! This will result in some distance between the line and each of the points it is supposed to represent, just
like a mean has some distance between it and all of the individual scores in the dataset.
The distances between the line of best fit and each individual data point go by two different names that mean the same thing: errors
and residuals. The term “error” in regression is closely aligned with the meaning of error in statistics (think standard error or
sampling error); it does not mean that we did anything wrong, it simply means that there was some discrepancy or difference
between what our analysis produced and the true value we are trying to get at it. The term “residual” is new to our study of
statistics, and it takes on a very similar meaning in regression to what it means in everyday parlance: there is something left over. In
regression, what is “left over” – that is, what makes up the residual – is an imperfection in our ability to predict values of the Y
variable using our line. This definition brings us to one of the primary purposes of regression and the line of best fit: predicting
scores.
Ŷ = a + bX (13.2.1)
What this shows us is that we will use our known value of X for each person to predict the value of Y for that person. The
predicted value, Ŷ , is called “y -hat” and is our best guess for what a person’s score on the outcome is. Notice also that the form of
the equation is very similar to very simple linear equations that you have likely encountered before and has only two parameter
estimates: an intercept (where the line crosses the Y-axis) and a slope (how steep – and the direction, positive or negative – the line
is). These are parameter estimates because, like everything else in statistics, we are interested in approximating the true value of the
relation in the population but can only ever estimate it using sample data. We will soon see that one of these parameters, the slope,
is the focus of our hypothesis tests (the intercept is only there to make the math work out properly and is rarely interpretable). The
formulae for these parameter estimates use very familiar values:
¯
¯¯¯ ¯
¯¯¯
a = Y − bX (13.2.2)
covXY SP Sy
b= = = r( ) (13.2.3)
2
s SSX sx
X
We have seen each of these before. Y and X are the means of Y and X, respectively; cov
¯
¯¯¯ ¯¯¯
¯
XYis the covariance of X and Y we
learned about with correlations; and s is the variance of X. The formula for the slope is very similar to the formula for a Pearson
2
X
correlation coefficient; the only difference is that we are dividing by the variance of X instead of the product of the standard
deviations of X and Y . Because of this, our slope is scaled to the same scale as our X variable and is no longer constrained to be
between 0 and 1 in absolute value. This formula provides a clear definition of the slope of the line of best fit, and just like with
correlation, this definitional formula can be simplified into a short computational formula for easier calculations. In this case, we
are simply taking the sum of products and dividing by the sum of squares for X.
Notice that there is a third formula for the slope of the line that involves the correlation between X and Y . This is because
regression and correlation look for the same thing: a straight line through the middle of the data. The only difference between a
regression coefficient in simple linear regression and a Pearson correlation coefficient is the scale. So, if you lack raw data but have
summary information on the correlation and standard deviations for variables, you can still compute a slope, and therefore
intercept, for a line of best fit.
It is very important to point out that the Y values in the equations for a and b are our observed Y values in the dataset, NOT the
predicted Y values (Ŷ ) from our equation for the line of best fit. Thus, we will have 3 values for each person: the observed value of
X(X), the observed value of Y (Y ) , and the predicted value of Y (Ŷ ). You may be asking why we would try to predict Y if we
have an observed value of Y , and that is a very reasonable question. The answer has two explanations: first, we need to use known
values of Y to calculate the parameter estimates in our equation, and we use the difference between our observed values and
predicted values (Y – Y
ˆ
) to see how accurate our equation is; second, we often use regression to create a predictive model that we
can then use to predict values of Y for other people for whom we only have information on X.
way we use subtraction for deviation scores and sums of squares. The value (Y – Y ˆ
) is our residual, which, as defined above, is how
close our line of best fit is to our actual values. We can visualize residuals to get a better sense of what they are by creating a
scatterplot and overlaying a line of best fit on it, as shown in Figure 13.2.1.
dark red bracket between the triangular dots and the predicted scores on the line of best fit are our residuals (they are only drawn
for four observations for ease of viewing, but in reality there is one for every observation); you can see that some residuals are
positive and some are negative, and that some are very large and some are very small. This means that some predictions are very
accurate and some are very inaccurate, and the some predictions overestimated values and some underestimated values. Across the
entire dataset, the line of best fit is the one that minimizes the total (sum) value of all residuals. That is, although predictions at an
individual level might be somewhat inaccurate, across our full sample and (theoretically) in future samples our total amount of
error is as small as possible. We call this property of the line of best fit the Least Squares Error Solution. This term means that the
solution – or equation – of the line is the one that provides the smallest possible value of the squared errors (squared so that they
can be summed, just like in standard deviation) relative to any other straight line we could draw through the data.
in Figure 13.2.2.
are the same type of idea: the distance between an observed score and a given line, either the line of best fit that gives predictions
or the line representing the mean that serves as a baseline. The difference between these two values, which is the distance between
the lines themselves, is our model’s ability to predict scores above and beyond the baseline mean; that is, it is our models ability to
explain the variance we observe in Y based on values of X. If we have no ability to explain variance, then our line will be flat (the
slope will be 0.00) and will be the same as the line representing the mean, and the distance between the lines will be 0.00 as well.
We now have three pieces of information: the distance from the observed score to the mean, the distance from the observed score to
the prediction line, and the distance from the prediction line to the mean. These are our three pieces of information needed to test
our hypotheses about regression and to calculate effect sizes. They are our three Sums of Squares, just like in ANOVA. Our
distance from the observed score to the mean is the Sum of Squares Total, which we are trying to explain. Our distance from the
observed score to the prediction line is our Sum of Squares Error, or residual, which we are trying to minimize. Our distance from
the prediction line to the mean is our Sum of Squares Model, which is our observed effect and our ability to explain variance. Each
of these will go into the ANOVA table to calculate our test statistic.
Model ˆ ¯
¯¯¯ 2
∑(Y − Y ) 1 S SM /dfM M SM /M SE
Error ˆ 2
∑(Y − Y ) N −2 S SE /dfE
Total ∑(Y − Y )
¯
¯¯¯ 2
N −1
As with ANOVA, getting the values for the SS column is a straightforward but somewhat arduous process. First, you take
the raw scores of X and Y and calculate the means, variances, and covariance using the sum of products table introduced
in our chapter on correlations. Next, you use the variance of X and the covariance of X and Y to calculate the slope of the
line, b , the formula for which is given above. After that, you use the means and the slope to find the intercept, a , which is
given alongside b . After that, you use the full prediction equation for the line of best fit to get predicted Y scores (Yˆ ) for
each person. Finally, you use the observed Y scores, predicted Y scores, and mean of Y to find the appropriate deviation
scores for each person for each sum of squares source in the table and sum them to get the Sum of Squares Model, Sum of
Squares Error, and Sum of Squares Total. As with ANOVA, you won’t be required to compute the SS values by hand, but
you will need to know what they represent and how they fit together.
The other columns in the ANOVA table are all familiar. The degrees of freedom column still has N – 1 for our total, but
now we have N – 2 for our error degrees of freedom and 1 for our model degrees of freedom; this is because simple linear
regression only has one predictor, so our degrees of freedom for the model is always 1 and does not change. The total
degrees of freedom must still be the sum of the other two, so our degrees of freedom error will always be N – 2 for simple
linear regression. The mean square columns are still the SS column divided by the df column, and the test statistic F is
still the ratio of the mean squares. Based on this, it is now explicitly clear that not only do regression and ANOVA have the
same goal but they are, in fact, the same analysis entirely. The only difference is the type of data we feed into the predictor
side of the equations: continuous for regression and categorical for ANOVA.
H0 : β = 0
HA : β > 0
HA : β < 0
HA : β ≠ 0
A non-zero slope indicates that we can explain values in Y based on X and therefore predict future values of Y based on X.
Our alternative hypotheses are analogous to those in correlation: positive relations have values above zero, negative relations
have values below zero, and two-tailed tests are possible. Just like ANOVA, we will test the significance of this relation using
the F statistic calculated in our ANOVA table compared to a critical value from the F distribution table. Let’s take a look at an
example and regression in action.
H0 : β = 0
HA : β ≠ 0
that the appropriate critical value for 1 and 16 degrees of freedom is F = 4.49, shown below in Figure 13.5.1.
∗
¯
¯¯¯ ¯
¯¯¯
From the raw data in our X and Y columns, we find that the means are X = 19.78 and Y = 17.45. The deviation scores for
each variable sum to zero, so all is well there. The sums of squares for X and Y ultimately lead us to standard deviations of
sX = 2.12 and s Y = 3.20 . Finally, our sum of products is 58.29, which gives us a covariance of cov = 3.43, so we know
XY
our relation will be positive. This is all the information we need for our equations for the line of best.
First, we must calculate the slope of the line:
SP 58.29
b= = = 0.77
SSX 76.14
This means that as X changes by 1 unit, Y will change by 0.77. In terms of our problem, as health increases by 1, happiness
goes up by 0.77, which is a positive relation. Next, we use the slope, along with the means of each variable, to compute the
intercept:
¯
¯¯¯ ¯¯¯
¯
a = Y −b X
For this particular problem (and most regressions), the intercept is not an important or interpretable value, so we will not read
into it further. Now that we have all of our parameters estimated, we can give the full equation for our line of best fit:
Ŷ = 2.42 + 0.77X
We can plot this relation in a scatterplot and overlay our line onto it, as shown in Figure 13.5.2.
Model 44.62
Error 129.37
Total
Now that we have these, we can fill in the rest of the ANOVA table. We already found our degrees of freedom in Step 2:
Table 13.5.3: ANOVA Table
Source SS df MS F
Model 44.62 1
Error 129.37 16
Total
Our total line is always the sum of the other two lines, giving us:
Table 13.5.4: ANOVA Table
Source SS df MS F
Model 44.62 1
Error 129.37 16
Total 173.99 17
Our mean squares column is only calculated for the model and error lines and is always our SS divided by our df , which is:
Table 13.5.5: ANOVA Table
Source SS df MS F
Total 173.99 17
Total 173.99 17
This gives us an obtained F statistic of 5.52, which we will now use to test our hypothesis.
Reject H0 . Based on our sample of 18 people, we can predict levels of happiness based on how healthy someone is,
F (1, 16) = 5.52, p < .05 .
Effect Size We know that, because we rejected the null hypothesis, we should calculate an effect size. In regression, our effect
size is variance explained, just like it was in ANOVA. Instead of using η to represent this, we instead us R , as we saw in
2 2
correlation (yet more evidence that all of these are the same analysis). Variance explained is still the ratio of SS to SS :M T
2
SSM 44.62
R = = = 0.26
SST 173.99
We are explaining 26% of the variance in happiness based on health, which is a large effect size (R uses the same effect size
2
cutoffs as η ). 2
Accuracy in Prediction
We found a large, statistically significant relation between our variables, which is what we hoped for. However, if we want to
use our estimated line of best fit for future prediction, we will also want to know how precise or accurate our predicted values
are. What we want to know is the average distance from our predictions to our actual observed values, or the average size of
the residual (Y − Yˆ ). The average size of the residual is known by a specific name: the standard error of the estimate (
S ˆ
(Y −Y )
), which is given by the formula
−−−−−−−−−−
ˆ)
∑(Y − Y
2
S ˆ =√ (13.5.1)
(Y −Y )
N −2
This formula is almost identical to our standard deviation formula, and it follows the same logic. We square our residuals, add
them up, then divide by the degrees of freedom. Although this sounds like a long process, we already have the sum of the
squared residuals in our ANOVA table! In fact, the value under the square root sign is just the SS divided by the df , which
E E
−−−−−−−−− −
ˆ 2
∑(Y − Y ) −−−−−
s ˆ =√ = √M SE (13.5.2)
(Y −Y )
N −2
use variables that are statistically significantly related to our outcome to explain the variance we observe in that outcome.
Other forms of regression include curvilinear models that can explain curves in the data rather than the straight lines used
here, as well as moderation models that change the relation between two variables based on levels of a third. The
possibilities are truly endless and offer a lifetime of discovery.
Answer:
ANOVA and simple linear regression both take the total observed variance and partition it into pieces that we can
explain and cannot explain and use the ratio of those pieces to test for significant relations. They are different in that
ANOVA uses a categorical variable as a predictor whereas linear regression uses a continuous variable.
2. What is a residual?
3. How are correlation and regression similar? How are they different?
Answer:
Correlation and regression both involve taking two continuous variables and finding a linear relation between them.
Correlations find a standardized value describing the direction and magnitude of the relation whereas regression finds
the line of best fit and uses it to partition and explain variance.
4. What are the two parameters of the line of best fit, and what do they represent?
5. What is our criteria for finding the line of best fit?
Answer:
Least Squares Error Solution; the line that minimizes the total amount of residual error in the dataset.
6. Fill out the rest of the ANOVA tables below for simple linear regressions:
a.
Source SS df MS F
Model 34.21
Error
Total 66.12 54
b. Source SS df MS F
Model 6.03
Error 16
Total 19.98
7. In chapter 12, we found a statistically significant correlation between overall performance in class and how much time
someone studied. Use the summary statistics calculated in that problem (provided here) to compute a line of best fit
¯¯¯
¯ ¯
¯¯¯
predicting success from study times: X = 1.61, s = 1.12, Y = 2.95, s = 0.99, r = 0.65.
X Y
Answer:
¯
¯¯¯ ¯¯¯
¯
∗ ˆ
b = r (sy / sx ) = 0.65 ∗ (0.99/1.12) = 0.72; a = Y − b X = 2.95– (0.72 ∗ 1.61) = 1.79; Y = 1.79 + 0.72X
8. Using the line of best fit equation created in problem 7, predict the scores for how successful people will be based on
how much they study:
a. X = 1.20
b. X = 3.33
c. X = 0.71
d. X = 4.00
1 14
2 6
3 8
4 13
5 2
6 15
7 4
8 10
9 11
10 16
11 9
12 7
13 14
14 12
15 1
16 5
Answer:
Step 1: H0 : β = 0 “There is no predictive relation between draft rankings and final rankings in fantasy football,”
HA : β ≠ 0 , “There is a predictive relation between draft rankings and final rankings in fantasy football.”
Step 2: Our model will have 1 (based on the number of predictors) and 14 (based on how many observations we have)
degrees of freedom, giving us a critical value of F = 4.60. ∗
¯¯¯
¯ ¯
¯¯¯
Step 3: Using the sum of products table, we find : X = 8.50, Y = 8.50, SSX = 339.86, SP = 29.99 , giving us a line
of best fit of: b = 29.99/339.86 = 0.09; a = 8.50 – 0.09*8.50 = 7.74; ˆ
Y = 7.74 + 0.09X . Our given SS values and our
df from step 2 allow us to fill in the ANOVA table:
Source SS df MS F
Step 4: Our obtained value was smaller than our critical value, so we fail to reject the null hypothesis. There is no
evidence to suggest that draft rankings have any predictive value for final fantasy football rankings,
F (1, 14) = 0.11, p > .05
10. You have summary data for two variables: how extroverted some is (X) and how often someone volunteers (Y ). Using
these values, calculate the line of best fit predicting volunteering from extroversion then test for a statistically
significant relation using the hypothesis testing procedure:
= 2.12, r = 0.34, N = 67, SSM = 19.79, SSE = 215.77.
¯¯¯
¯ ¯
¯¯¯
X = 12.58, s X = 4.65, Y = 7.44, s Y
14.2: GOODNESS-OF-FIT
The first of our two χ² tests assesses one categorical variable against a null hypothesis of equally sized frequencies. Equal frequency
distributions are what we would expect to get if categorization was completely random. We could, in theory, also test against a specific
distribution of category sizes if we have a good reason to (e.g. we have a solid foundation of how the regular population is distributed), but
this is less common, so we will not deal with it in this text.
14.3: Χ² STATISTIC
The calculations for our test statistic in χ² tests combine our information from our observed frequencies ( O ) and our expected frequencies
( E ) for each level of our categorical variable. For each cell (category) we find the difference between the observed and expected values,
square them, and divide by the expected values. We then sum this value across cells for our test statistic.
1 2/18/2022
14.1: Categories and Frequency Tables
Our data for the χ test are categorical, specifically nominal, variables. Recall from unit 1 that nominal variables have no
2
specified order and can only be described by their names and the frequencies with which they occur in the dataset. Thus,
unlike our other variables that we have tested, we cannot describe our data for the χ test using means and standard
2
Observed 14 17 5 36
Expected 12 12 12 36
Table 14.1.1 gives an example of a frequency table used for a χ test. The columns represent the different categories
2
within our single variable, which in this example is pet preference. The χ test can assess as few as two categories, and
2
there is no technical upper limit on how many categories can be included in our variable, although, as with ANOVA,
having too many categories makes our computations long and our interpretation difficult. The final column in the table is
the total number of observations, or N . The χ test assumes that each observation comes from only one person and that
2
each person will provide only one observation, so our total observations will always equal our sample size.
There are two rows in this table. The first row gives the observed frequencies of each category from our dataset; in this
example, 14 people reported liking preferring cats as pets, 17 people reported preferring dogs, and 5 people reported a
different animal. The second row gives expected values; expected values are what would be found if each category had
equal representation. The calculation for an expected value is:
N
E = (14.1.1)
C
Where N is the total number of people in our sample and C is the number of categories in our variable (also the number of
columns in our table). The expected values correspond to the null hypothesis for χ tests: equal representation of
2
categories. Our first of two χ tests, the Goodness-of-Fit test, will assess how well our data lines up with, or deviates from,
2
this assumption.
frequency distributions are what we would expect to get if categorization was completely random. We could, in theory,
also test against a specific distribution of category sizes if we have a good reason to (e.g. we have a solid foundation of
how the regular population is distributed), but this is less common, so we will not deal with it in this text.
Hypotheses
All χ tests, including the goodness-of-fit test, are non-parametric. This means that there is no population parameter we
2
are estimating or testing against; we are working only with our sample data. Because of this, there are no mathematical
statements for χ hypotheses. This should make sense because the mathematical hypothesis statements were always about
2
population parameters (e.g. μ ), so if we are non-parametric, we have no parameters and therefore no mathematical
statements.
We do, however, still state our hypotheses verbally. For goodness-of-fit χ tests, our null hypothesis is that there is an
2
equal number of observations in each category. That is, there is no difference between the categories in how prevalent they
are. Our alternative hypothesis says that the categories do differ in their frequency. We do not have specific directions or
one-tailed tests for χ , matching our lack of mathematical statement.
2
Our degrees of freedom for the χ test are based on the number of categories we have in our variable, not on the number of
2
people or observations like it was for our other tests. Luckily, they are still as simple to calculate:
df = C − 1 (14.2.1)
So for our pet preference example, we have 3 categories, so we have 2 degrees of freedom. Our degrees of freedom, along
with our significance level (still defaulted to α = 0.05) are used to find our critical values in the χ table, which is shown
2
in figure 1. Because we do not have directional hypotheses for χ tests, we do not need to differentiate between critical
2
values for 1- or 2-tailed tests. In fact, just like our F tests for regression and ANOVA, all χ tests are 1-tailed tests.
2
expected frequencies (E ) for each level of our categorical variable. For each cell (category) we find the difference between
the observed and expected values, square them, and divide by the expected values. We then sum this value across cells for
our test statistic. This is shown in the formula:
2
(O − E)
2
χ =∑ (14.3.1)
E
Notice that, for each cell’s calculation, the expected value in the numerator and the expected value in the denominator are
the same value. Let’s now take a look at an example from start to finish.
Observed 26 19 45
2 2
(26 − 22.50) (19 − 22.50)
2
χ = + = 0.54 + 0.54 = 1.08
22.50 22.50
As noted above, our only description for nominal data is frequency, so we will again present our observations in a
frequency table. When we have two categorical variables, our frequency table is crossed. That is, each combination of
levels from each categorical variable are presented. This type of frequency table is called a contingency table because it
shows the frequency of each category in one variable, contingent upon the specific level of the other variable.
An example contingency table is shown in Table 14.5.1, which displays whether or not 168 college students watched
college sports growing up (Yes/No) and whether the students’ final choice of which college to attend was influenced by the
college’s sports teams (Yes – Primary, Yes – Somewhat, No):
Table 14.5.1 : Contingency table of college sports and decision making
Affected Decision
College Sports
Primary Somewhat No Total
Yes 47 26 14 87
Watched
No 21 23 37 81
Total 68 49 51 168
In contrast to the frequency table for our goodness-of-fit test, our contingency table does not contain expected values, only
observed data. Within our table, wherever our rows and columns cross, we have a cell. A cell contains the frequency of
observing it’s corresponding specific levels of each variable at the same time. The top left cell in Table 14.5.1 shows us
that 47 people in our study watched college sports as a child AND had college sports as their primary deciding factor in
which college to attend.
Cells are numbered based on which row they are in (rows are numbered top to bottom) and which column they are in
(columns are numbered left to right). We always name the cell using (R,C), with the row first and the column second. A
quick and easy way to remember the order is that R/C Cola exists but C/R Cola does not. Based on this convention, the top
left cell containing our 47 participants who watched college sports as a child and had sports as a primary criteria is cell
(1,1). Next to it, which has 26 people who watched college sports as a child but had sports only somewhat affect their
decision, is cell (1,2), and so on. We only number the cells where our categories cross. We do not number our total cells,
which have their own special name: marginal values.
Marginal values are the total values for a single category of one variable, added up across levels of the other variable. In
table 3, these marginal values have been italicized for ease of explanation, though this is not normally the case. We can see
that, in total, 87 of our participants (47+26+14) watched college sports growing up and 81 (21+23+37) did not. The total of
these two marginal values is 168, the total number of people in our study. Likewise, 68 people used sports as a primary
criteria for deciding which college to attend, 50 considered it somewhat, and 50 did not use it as criteria at all. The total of
these marginal values is also 168, our total number of people. The marginal values for rows and columns will always both
add up to the total number of participants, N , in the study. If they do not, then a calculation error was made and you must
go back and check your work.
Using the data from Table 14.5.1, we can calculate the expected frequency for cell (1,1), the college sport watchers who
used sports at their primary criteria, to be:
87 ∗ 68
E1,1 = = 35.21
168
We can follow the same math to find all the expected values for this table:
Table 14.5.2: Contingency table of college sports and decision making
Affected Decision
College Sports
Primary Somewhat No Total
Total 68 49 51
Notice that the marginal values still add up to the same totals as before. This is because the expected frequencies are just
row and column averages simultaneously. Our total N will also add up to the same value.
The observed and expected frequencies can be used to calculate the same χ statistic as we did for the goodness-of-fit test.
2
Before we get there, though, we should look at the hypotheses and degrees of freedom used for contingency tables.
values of each categorical variable (that is, the frequency of their levels) is related to or independent of the values of the other
categorical variable. Because we are still doing a χ test, which is nonparametric, we still do not have mathematical versions of our
2
hypotheses. The actual interpretations of the hypotheses are quite simple: the null hypothesis says that the variables are
independent or not related, and alternative says that they are not independent or that they are related. Using this set up and the data
provided in Table 14.5.2, let’s formally test for whether or not watching college sports as a child is related to using sports as a
criteria for selecting a college to attend.
In our example:
df = (2 − 1)(3 − 1) = 1 ∗ 2 = 2
Based on our 2 degrees of freedom, our critical value from our table is 5.991.
2
(0 − E)
2
χ =∑ (14.7.2)
E
2 2 2 2 2 2
(47 − 35.21) (26 − 25.38) (14 − 26.41) (21 − 32.79) (23 − 23.62) (37 − 24.59)
2
χ = + + + + +
35.21 25.38 26.41 32.79 23.62 24.59
someone watches college sports growing up and how much a college’s sports team factor in to that person’s decision on which
college to attend, χ (2) = 20.31, p < 0.05.
2
be calculated for statistically significant results. There are many options for which effect size to use, and the ultimate decision is
based on the type of data, the structure of your frequency or contingency table, and the types of conclusions you would like to
draw. For the purpose of our introductory course, we will focus only on a single effect size that is simple and flexible: Cramer’s V .
Cramer’s V is a type of correlation coefficient that can be computed on categorical data. Like any other correlation coefficient (e.g.
Pearson’s r), the cutoffs for small, medium, and large effect sizes of Cramer’s V are 0.10, 0.30, and 0.50, respectively. The
calculation of Cramer’s V is very simple:
−−−−−−−−
2
χ
V =√ (14.7.3)
N (k − 1)
For this calculation, k is the smaller value of either R (the number of rows) or C (the number of columns). The numerator is
simply the test statistic we calculate during step 3 of the hypothesis testing procedure. For our example, we had 2 rows and 3
columns, so k = 2 :
So the statistically significant relation between our variables was moderately strong.
Answer:
Frequency tables display observed category frequencies and (sometimes) expected category frequencies for a single
categorical variable. Contingency tables display the frequency of observing people in crossed category levels for two
categorical variables, and (sometimes) the marginal totals of each variable level.
Answer:
Expected values are what we would observe if the proportion of categories was completely random (i.e. no consistent
difference other than chance), which is the same was what the null hypothesis predicts to be true.
Category C 22 38
Category D 16 14
Answer:
Observed Category A Category B Total
rved Category C 22 38 60
rved Category D 16 14 30
rved Total 38 52 90
cted Total 38 52 90
6. Test significance and find effect sizes (if significant) for the following tests:
1. N = 19, R = 3, C = 2, χ2(2) = 7.89, α = .05
7. You hear a lot of people claim that The Empire Strikes Back is the best movie in the original Star Wars trilogy, and you
decide to collect some data to demonstrate this empirically (pun intended). You ask 48 people which of the original
movies they liked best; 8 said A New Hope was their favorite, 23 said The Empire Strikes Back was their favorite, and
17 said Return of the Jedi was their favorite. Perform a chi-square test on these data at the .05 level of significance.
Answer:
Step 1: H : “There is no difference in preference for one movie”,
0 HA : “There is a difference in how many people
prefer one movie over the others.”
Step 2: 3 categories (columns) gives df =2 ,χ 2
crit
= 5.991 .
Step 3: Based on the given frequencies:
Observed 8 23 17 48
Expected 16 16 16
χ
2
= 7.13 .
Step 4: Our obtained statistic is greater than our critical value, reject H . Based on our sample of 48 people, there is a
0
statistically significant difference in the proportion of people who prefer one Star Wars movie over the others,
χ (2) = 7.13, p < .05 . Since this is a statistically significant result, we should calculate an effect size: Cramer’s
2
−−−−−−−−
7.13
V =√ = 0.27 , which is a moderate effect size.
48(3 − 1)
8. A pizza company wants to know if people order the same number of different toppings. They look at how many
pepperoni, sausage, and cheese pizzas were ordered in the last week; fill out the rest of the frequency table and test for
a difference.
Expected
9. A university administrator wants to know if there is a difference in proportions of students who go on to grad school
across different majors. Use the data below to test whether there is a relation between college major and going to grad
school.
Major
Psychology Business Math
Yes 32 8 36
Graduate School
No 15 41 12
Answer:
Step 1: H : “There is no relation between college major and going to grad school”,
0 HA : “Going to grad school is
related to college major.”
Step 2: df =2 ,χ 2
crit
= 5.991 .
Step 3: Based on the given frequencies:
Major
Expected Values
Psychology Business Math
χ
2
= 2.09 + 12.34 + 4.49 + 2.33 + 13.79 + 5.02 = 40.05 .
Step 4: Obtained statistic is greater than the critical value, reject H . Based on our data, there is a statistically
0
significant relation between college major and going to grad school, χ (2) = 40.05, p < .05, Cramer’s V = 0.53,
2
10. A company you work for wants to make sure that they are not discriminating against anyone in their promotion
process. You have been asked to look across gender to see if there are differences in promotion rate (i.e. if gender and
promotion rate are independent or not). The following data should be assessed at the normal level of significance:
Women 8 5
Gender
Men 9 7
Bonferroni test S
11.8: Post Hoc Tests
H
hypothesis testing significance level
7.5: Critical values, p-values, and significance
C 7: Introduction to Hypothesis Testing
level
comorbid standard deviation
12.5: Anxiety and Depression
L
3.3: Spread and Variability
Covariance Line of Best Fit
13.1: Line of Best Fit
12.1: Variability and Covariance T
critical values Tukey’s Honest Significant Difference
7.5: Critical values, p-values, and significance M
level Magnitude (Correlation) (HSD)
12.3: Three Characteristics 11.8: Post Hoc Tests
D
Direction (Correlation) N V
12.3: Three Characteristics normal distribution Variability
4.1: Normal Distributions 3.3: Spread and Variability
variance
3.3: Spread and Variability
Glossary