Chapter 4 - Categorical Data - 2019
Chapter 4 - Categorical Data - 2019
Perth, Australia:
Author.
4
Categorical Data
Contents
Binomial Test ................................................................................................................................. 2
Binomial Test Hypotheses ......................................................................................................... 3
Binomial Test Confidence Intervals: SPSS ................................................................................. 4
Pearson Chi-Square: Test of Two Independent Proportions/Frequencies ................................... 5
Pearson Chi-Square: Two Independent Proportions - SPSS ...................................................... 6
Binomial Test Versus Pearson Chi-Square ................................................................................ 7
2x2 Pearson Chi-Square: Test of Association ................................................................................ 8
2x2 Pearson Chi-Square: SPSS ................................................................................................... 9
2x2 Pearson Chi-Square: Measure of Association (phi) .......................................................... 10
Assumptions: Pearson Chi-Square .............................................................................................. 11
2x2 Pearson Chi-Square: Effect Size Guidelines ...................................................................... 11
Yates Continuity Correction .................................................................................................... 12
Pearson Chi-Square: r x c Contingency Table Analysis ................................................................ 12
Pearson Chi-Square: r x c Contingency Table Analysis: SPSS .................................................. 14
Measure of Effect Size: Cramer’s V ......................................................................................... 15
r x c Contingency Table Analysis: Follow-Up Analyses ................................................................ 15
Adjusted Standardized Residual Analysis................................................................................ 16
Adjusted Standardized Residual Analysis: SPSS ...................................................................... 16
Adjusted Standardized Residual Analysis: Bonferroni Correction .......................................... 17
Adjusted Standardized Residual Analysis: Bonferroni Correction - SPSS................................ 18
What is the Minimum Cell Frequency Required? ....................................................................... 19
Differences Between Proportions/Percentages: Within-Subjects Designs ................................ 19
McNemar Chi-Square .................................................................................................................. 19
McNemar’s Chi-Square: SPSS .................................................................................................. 21
Advanced Topics ......................................................................................................................... 23
Why do Researchers Use p < .05? ........................................................................................... 23
Is phi Just a Pearson Correlation (r)? Yes. ............................................................................... 24
Adjusted phi ............................................................................................................................ 25
Is a One-Tailed Pearson Chi-square Analysis Possible? .......................................................... 25
Odds/Ratios ............................................................................................................................. 26
Odds/Ratios: SPSS ................................................................................................................... 27
Relative Risk ............................................................................................................................ 28
Relative Risk: SPSS ................................................................................................................... 29
How to Interpret Odds Ratios and Risk Ratios Less than 1 ..................................................... 31
2 x 2 Pearson Chi-Square: Interactions? ................................................................................. 31
Partitioning .............................................................................................................................. 32
CHAPTER 4: CATEGORICAL DATA
Binomial Test
I once had a girlfriend who insisted that she could tell the difference between Pepsi
and Coke. I had my doubts that anyone could do this, so I decided to put her to the test
statistically (as you do). First, I blind folded her, so that it would be a blind taste test. I then
filled five cups with Pepsi and five cups with Coke (but I didn’t tell her that I was going to fill
the cups 50/50; it could have been any fraction, from her perspective). I placed them onto a
table in front of her. I had her taste the contents of each of the ten cups. After each cup
tasting, she told me whether she thought the drink was Pepsi or Coke. I wrote down her
responses and scored each response as a 1 for a correct identification and a 0 for an incorrect
identification.
What is important to keep in mind, here, is that anyone would be expected to achieve
50% accuracy on this test, because there are only two possible answers for each of the ten
taste tests. Thus, anyone would be able to get 5 out of 10 taste tests correct, just by guessing.
My ex-girlfriend managed to identify correctly the contents of 7 out of the 10 cups. The key
statistical question is whether 7 out 10 is beyond the chance observation of 5 out of 10? That
is, my ex-girlfriend could have just been guessing, and, just by chance, managed to identify the
contents of 7 out of the 10 cups, correctly. This is precisely the type of question that can be
answered by a statistical test. Is the observation of 7 out of 10 beyond the expected chance of
5 out of 10? Before I report the results of the statistical analysis I performed in this case, I
would like to explain what “beyond chance” means in the context of statistics.
People often fool themselves into believing something systematic has happened,
when, in fact, it was really just a chance event (e.g., Croson & Sundali, 2005; Tversky, &
Kahneman, 1971). Consequently, statisticians want to protect themselves against concluding,
incorrectly, that something happened simply by chance. In practical terms, “beyond chance”
means that there is, at most, a 5% chance that one has fooled him or herself into concluding
an event has occurred beyond chance, when, in fact, it has occurred simply by chance.
Theoretically, nothing is actually beyond chance, if you take a probabilistic perspective on life
(which statisticians do). However, for better and for worse, statisticians have adopted a chance
event of 5% or less as sufficiently unlikely to merit the label of “beyond chance”. To return to
my ex-girlfriend’s apparent ability to detect the difference between Pepsi and Coke, I needed
to estimate the chances of 7 correct guesses out of 10, under the expectation that as good as 5
C4.2
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
correct guesses out of 10 trials could happen simply by guessing randomly. To estimate the
chances, in this case, one could conduct a statistical analysis known as the binomial test. I did
just that with my ex-girlfriend’s data. If the chances of having observed 7 out of 10 correct
taste tests was less 5%, I would have concluded that my ex-girlfriend had the ability to
distinguish the taste of Pepsi and Coke “beyond chance”.
Null hypothesis (H0): My ex-girlfriend does not have the ability to detect Pepsi and Coke
systematically (identification probability = .50)
Alternative hypothesis (H1): My ex-girlfriend does have the ability to detect Pepsi and Coke
systematically (identification probability ≠ .50)
To conduct a binomial test “by hand” is a little more complicated than one might
think. Fortunately, most statistical programs offer the option to conduct a binomial test on
data.
C4.3
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
The key value to examine in this table is in the last column: .344. The .344 value
corresponds to the probability of fooling oneself into thinking that my ex-girlfriend had the
ability to distinguish between Pepsi and Coke in a taste test based on 10 tastings. Because .344
is greater than .05 (i.e., 5%), I concluded that the observed correct identification proportion of
.70 was not beyond the .50 expectation under the null hypothesis, i.e., correct identifications
that would be expected purely by guessing. Thus, my ex-girlfriend failed to demonstrate a
statistically significant ability at detecting the difference between Pepsi and Coke.1
1
Fortunately for me, she was totally ignorant of the concept of statistical power and the
matter was left at that. To learn about the importance of statistical power, check-out the
chapter on the difference between two means. Also, check out Practice Question 1 of this
chapter.
2
Long story short, I’m not fond of the Monte Carlo utility in SPSS for such purposes, which
appears to be available within the binomial analysis menu option.
C4.4
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
(3 5) 2 (7 5) 2
2
5 5
4 4
2 0.8 0.8 1.60
5 5
The solved Pearson chi-square formula yielded a chi-square value of 1.60. The reason
the Pearson chi-square test is known as a chi-square test is because it was demonstrated by
Karl Pearson that the values calculated from the formula above follow the theoretical chi-
square distribution (Watch Video 4.5: The Chi-Square Distribution - Explained). The chi-square
distribution is similar in nature to the z-distribution and the t-distribution. Theoretically, the
chi-square distribution represents the sampling variability of various statistical values under
the null hypothesis (i.e., chance variation). For example, values obtained from the Pearson chi-
square formula (1) are known to follow the chi-square distribution, when the data are
consistent with the absence of a difference in proportions in the population. Consequently,
the calculated Pearson chi-square value of 1.60 above can be placed within the chi-square
distribution to determine how unlikely it is, under the expectation that the null hypothesis is
true (i.e., that there is no difference in the proportions).
Just like the on-sample t-test, a Pearson chi-square analysis required the identification
of degrees of freedom. The Pearson chi-square test with only two possible outcomes (e.g., hit
or miss, heads or tails, predicted or non-predicted) is associated with one degree of freedom.
Based on the theoretical chi-square distribution with one degree of freedom, it is known that a
chi-square value of 3.841 corresponds to the 95th percentile (i.e., p = .05). Thus, a calculated
Pearson chi-square value greater than 3.841 would imply a sufficiently unlikely event as to
suggest that it occurred beyond chance (i.e., p < .05). As the calculated Pearson chi-square
value of 1.60 was smaller than 3.841, the alternative hypothesis that my ex-girlfriend had the
ability to discriminate between Pepsi and Coke was not supported. To calculate the precise
probability associated with a Pearson chi-square value of 1.60 and 1 degree of freedom, the
analysis could be performed in SPSS, which I do next.
C4.6
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Next, a SPSS table entitled ‘Test Statistics’ includes the key statistical result, the chi-square
value. In this example, the chi-square was calculated at 1.60. With 1 degree of freedom, the p-
value was estimated at .206. Thus, as p = .206 is greater than p = .05, the null hypothesis of no
difference between the proportion of trials guessed correctly by my ex-girlfriend (.70) and the
null hypothesis expectation (.50) was not rejected (p > .05, or p = .206 more precisely). Stated
alternatively, there was an absence of statistical evidence to suggest that my ex-girlfriend had
the ability to distinguish between Pepsi and Coke.
3
In the Advanced Topics section of the chapter, I describe a syntax-based method that can be used in
SPSS, when one or more variables are constants in the context of the Pearson chi-square analysis and
the McNemar chi-square analysis.
C4.7
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
be dealt with by the binomial test. Consequently, despite the requirement for at least some
variation in the data (almost always satisfied with real data), the Pearson chi-square statistic is
much more commonly used than the binomial test, even in the context of testing the
difference between two independent proportions. I suspect another reason the Pearson chi-
square test is more popular than the binomial test is because the Pearson chi-square test is
more powerful (i.e., greater chance to reject the null hypothesis). You may have noted that the
p-value associated with binomial test was p = .344, whereas the p-value associated with the
Pearson chi-square test was p = .206, because p = .206 was closer to .05 than .344, it suggests
that Pearson chis-square analysis was more likely to detect a ‘statistically significant’ effect.
4
Geschwind and Behan (1982) used an extreme groups approach to their data collection
procedure which was not replicated here in the simulated data.
C4.8
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
However, in this case, the method used to calculate the expected frequencies is a little more
complicated, because two variables (handedness and dyslexia) need to be considered
simultaneously (Watch Video 4.7: 2x2 Pearson Chi-Square Calculations – Step-by-Step).
Conceptually, the main thing you need to know about expected and observed frequencies is
that the larger the discrepancy between the two types of frequencies, the greater chance the
null hypothesis of no association between the two independent variables will be rejected, all
other things equal.
By the way, the degrees of freedom associated with any Pearson 2 contingency table
analysis are equal to (r – 1)(c - 1), where r is equal to the number of rows, and c is equal to the
number of columns. Thus, in this 2 * 2 contingency table analysis, (r - 1)(c - 1) = (2 - 1) (2 - 1),
which is equal to 1. In the previous Pearson chi-square analysis based on the Pepsi vs. Coke
taste testing data, the degrees of freedom were also equal to (r - 1 = 2- 1 = 1).
Hypotheses
Null Hypothesis (H0): The percentage of dyslexics across left-handers and right-handers will be
equal.
Alternative Hypothesis (H1): The percentage of dyslexics across left-handers and right-handers
will be unequal.
C4.9
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
As can be seen in the SPSS table entitled ‘Chi-Square Tests’, the Pearson chi-square
value was estimated at 22.03, which was statistically significant, p < .001. Thus, a
disproportionately large percentage of dyslexics were left-handed. Stated alternatively, the
difference in the percentages (2.2% versus 22.7%) was statistically significant, p < .001.
Thus, in the handedness and dyslexia example, phi worked out to .297.
22.03
phi .297
250 1
C4.10
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
I discuss measures of standardized association in detail in chapter 5. I note here only briefly
that phi and Pearson correlation are identical (also, see Advanced Topics of this chapter).
1: Random Sampling
All cases in the population must have an equal chance of being selected into the
sample. In practice, it is entirely unrealistic to run a study such that every person (or case) in
the population to which you wish to infer your results has an equal chance of being selected
into your sample. Consequently, in practice, this first assumption is virtually always violated.
The extent to which the violation of this assumption affects the accuracy of statistical results is
anyone’s guess. Ultimately, there is not much that can be done about it. The show must go on,
as they say.
In the handedness and dyslexia study, many of the participants were recruited from
the health centers that Dr. Geschwind worked at in Glasgow, Scotalnd. Thus, not everyone in
the adult population had an equal chance to be selected into the sample. Instead, the sample
used in the Geschwind and Behan (1982) study would be considered a convenience sample.
It’s impossible to know the extent to which this may have compromised the accuracy of the p-
value obtained from the Pearson chi-square analysis.
2: Independence of Observations
Independence of observations implies that the participants in the investigation have
not influenced each other with respect to the variables of interest. Typically, it is easy to satisfy
this assumption, if the participants complete the tasks, tests or questionnaires independently.
In the handedness and dyslexia study, it is very likely that this assumption was
satisfied. That is, all of the participants were tested individually. Furthermore, they were not
related to each other (not brothers/sisters, for example).
yield the highest percentage of clicks on ads. Tech companies often refer to these type of
studies as A/B testing. In practice, the studies involve the presentation two or more slightly
different experiences on a particular webpage. Which version of the webpage that is displayed
is chosen at random when the webpage visitors’ browser loads the webpage’s content. Then,
they conduct statistical analyses to determine whether there was a statistically significant
difference between the percentages of clicks on ads across the different version of the
webpage. Although I would not expect Google to ever publish the results associated with their
particular studies, I have simulated some data to replicate my impressions of what they might
be expected to observe for a particular type of study. The study I have in mind consists of the
type of banner ad that appears on the right side of the screen on a Google dedicated webpage
(e.g., Google Finance). Should the ad be selected based on: (1) ads that are gaining popularity
generally across the internet; (2) based on the last item the user purchased online; (2) or
based on the last few search terms the user inputted into Google’s search engine. Thus, there
are three groups of ad selections: (1) ‘trending’, (2) ‘purchased’, and (3) ‘search’. The
dependent variable is the percentage of ads that are clicked on, known as the click-through-
rate (CTR).
Because there are three groups in this analysis, a 2x2 Pearson chi-square analysis
could not be applied. Instead, a larger Pearson chi-square analysis needs to be selected. As a
general term, all Pearson-chi-square analyses relevant to the test of percentages are known as
contingency table analyses. A contingency table can include two variables with any a number
of levels. They are known as r by c contingency tables, where r corresponds to the number of
cells (or percentages) in the rows, and c corresponds to the number of cells (or percentages) in
the columns. The 2 x 2 Pearson chi-square analysis is by far the most commonly observed
Pearson chi-square analysis in the literature. However, it is limited to two variables with a
maximum of only two levels (e.g., left/right, agree/disagree, correct/incorrect). Larger
contingency table analyses are known as ‘omnibus’ tests, because they include more than just
one statistical comparison, simultaneously.
In the current example, there are three possible comparisons: (1) trending versus
purchased; (2) trending versus search; and (3) purchased versus search. A Pearson chi-square
analysis based on such data would be referred to as a 3x2 Pearson chi-square analysis5,
because there are three levels in the grouping variable (popular, purchased, and search) and
two levels in the dependent variable (clicked versus not-clicked) (Data File: ad_types)
Hypotheses
Null Hypothesis (H0): The CTR (percentage) will be equal across all three ad selection options.
5
It is essentially arbitrary whether you refer to the analysis as a 2x3 or a 3x2 Pearson chi-
square analysis.
C4.13
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Alternative Hypothesis (H1): The CTR (percentage) will not be equal across all three ad
selection options
As can be seen in the SPSS table entitled ‘Chi-Square Tests’, the null hypothesis of equal
percentages was rejected, 2 = 11.98, p = .003. However, because a Pearson chi-square
analysis larger than a 2x2 is an omnibus test, it is not known precisely which percentages were
C4.14
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
statistically significantly different from the null hypothesis. It is only known that at least one
percentage is statistically significantly different.
In order to uncover the nature of the statistically significant effect in more detail, further
analyses need to be performed. First, however, I cover the topic of effect size. Then, I discuss
follow-up analyses.
It will be noted that phi was also estimated at .020. The nature of the Cramer’s V formula is
such that when either the r or the c portion of the contingency table is associated with only
two levels, the phi coefficient and the Cramer’s V coefficient will equal each other. However, I
do not believe that that such an observation renders a Cramer’s V coefficient an interpretable
measure of association in the 2x3 contingency table case.
C4.16
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
C4.17
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
analysis, and the three adjusted standardized residuals. In practice, researchers typically
disregard the potential impact of the omnibus statistical analysis with respect to the impact on
the familywise error rate. Instead, they focus upon the number of follow-up analyses used to
help uncover the nature of the effect. In the ad type example, there were three adjusted
standardized residuals that were consulted and evaluated, where a z-value greater than |1.96|
was indicative of a p-value less than .05 (i.e., statistically significant).
Additionally, I’ll note that I only applied the Bonferroni correction in the current
example under the pretense that three comparisons were made, even though the 2x3
contingency table included a total of six adjusted standardized residuals. In my opinion, the six
C4.18
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
standardized residuals are not independent from each other. In fact, the left and right side of
the contingency table are a mirror image of each other. Therefore, it would be inappropriate
to adjust the p-value, based on six analyses. There were really only three analyses.
McNemar Chi-Square
Does the percentage of infants who cry at night change across the ages of 12 months
to 36 months? Presumably, the percentage decreases, right? Or perhaps it increases? Gaylor,
Goodlin-Jones and Anders (2001) were interested in examining this issue scientifically. To this
effect, they collected data on 33 children as they slept at night (video cameras) when they
were 12 months old, and then again when they were 36 months old. The children were coded
C4.19
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
as ‘signalers’, if they cried and needed to be settled, or ‘self-soothers’, if they were able to
sooth themselves throughout the entire night. At 12 months, 16 of the children (48.5%) were
identified as self-soothers. At 36 months, 22 of the children (66.7%) were identified as self-
soothers. Thus, at least numerically, the percentage of self-soothing children increased by
18.2% (66.7 – 48.5). To test the numerical difference between 48.5% and 66.7% statistically,
one could use the McNemar chi-square statistic.
Hypotheses
Null Hypothesis (H0): There is no association between signalers at 12 months and signalers at
36 months.
McNemar (1947) derived a very simple formula for the test of the difference between
two percentages (or proportions) which relies exclusively upon the number of discordant pairs
in the observations. That is, the number of observations that switched from ‘yes’ to ‘no‘
divided by the total number of total switchers.
(b c) 2
2
(3)
bc
The values obtained from formula (3) correspond to the chi-square distribution with 1
degree of freedom. In this example, the number of self-soothers at age 12 months that
changed into signalers at age 36 months was 2 (see Table C4.1). By contrast, the number of
signalers at age 12 months that changed into self-soothers at age 36 months was 8.
C4.20
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Thus, applying formula (3) to the relevant data in Table C4.1 yielded a chi-square value of 3.60.
Is the chi-square value of 3.60 large enough within the context of the chi-square distribution to
declare it ‘statistically significant’?
(2 8) 2 36
2 3.60
28 10
It turns out that all McNemar chi-square statistic values that are larger than 3.84 are
statistically significant, p < .05. Thus, as the calculated McNemar chi-square value of 3.60 was
not greater than 3.84, the null hypothesis of equal percentages (or proportions) was not
rejected. Rather than calculate McNemar chi-square by hand, it would be more efficient to use
a software. Also, the precise p-values associated with a McNemar chi-square value can be
obtained from a computer program.
It will be noted, however, that the subscript ‘a’ next to the p-value of .109 denoted
that the Binomial distribution was used. Thus, SPSS did not actually conduct a McNemar chi-
square analysis. Instead, when the number of discordant pairs is 10 or less, SPSS automatically
conducts the more conservative binomial test, instead of the McNemar test. In this context,
the binomial test is similar to the Yates correction applied to the Pearson chi-square analysis.
Unfortunately, both tests have been shown to be excessively conservative (Camilli & Hopkins,
1978, 1979; Conover, 1974; Fagerland, Lydersen, & Laake, 2013; Feinberg, 1980; Larntz, 1978;
C4.21
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Thompson, 1988). Consequently, I’m not convinced that the binomial test, or the Yates
correction, should be used.
Fortunately, there is a relatively easy solution to the problem: conduct a Crochran’s Q
analysis. Cocrhan’s Q is a generalized version of McNemar’s chi-square analysis. Thus, it can
test the difference between two or more within-subjects percentages/proportions. In the
signaler/self-soother study, there are only two percentages. To conduct a Cochran’s Q analysis,
you can use the Nonparametrics utility in SPSS (Watch 4.14: Cochran’s Q as a Substitute for
McNemar Chi-Square).
As can be seen in the SPSS table entitled ‘Test Statistics’, the sample size was 33 and the
Cochran’s Q value was estimated at 3.600, which is precisely the same chi-square value
calculated above with the McNemar chi-square formula. Thus, the Cochran’s Q value is really a
chi-square value. Furthermore, the p-value was estimated at .058, which suggests that the null
hypothesis was not rejected (as expected, as the chi-square value was not greater than 3.84).
Thus, there was an absence of a statistically significant evidence to suggest that percentage of
self-soothers changed from 12 to 36 months.
C4.22
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Advanced Topics
6
Never trust an experimental psychologist.
C4.23
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Heads Heads
Heads
Prob. .500 .250 .125 .063 .031 .016 .008 .004 .002 .001
To help understand the results reported by Cowles and Davis (1982) in a different way
manner, I have reported a probability table under the pretense that a person played a coin
flipping game (heads or tails) with odds of winning at, theoretically, .50 for any trial (see Table
C4.2). However, if the coin were “fixed” such that the participant were to lose every time, it
would be predicted that participants would express suspiciousness of a possible non-chance
event after somewhere between three to four loses in a row (p = .125 to p = .063); and they
would be convinced of a non-chance event after 7 loses in a row (i.e., p = .008).
7
Pearson’s r and phi also have slightly different standard errors, so, they will also not be
associated with the same exact p-value. I believe which p-value is more accurate has not been
established, yet.
C4.24
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Adjusted phi
As demonstrated by Breaugh (2003), as the row and column marginal proportions
differ in magnitude, a very likely scenario in practice, the maximum possible phi values will
become increasingly less than 1.0. Breaugh (2003) provided a realistic example where the
estimated phi between two dichotomously scored variables was .20, which would suggest a
relatively small effect. However, the maximum possible phi value was .33, which would
suggest that the observed .20 was large. For this reason, some researchers report a statistic
known as adjusted phi, which is the ratio of the observed phi by the maximum phi:
obs (4)
adj
max
C4.25
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
associated with a critical chi-square square value of 3.84. Furthermore, 3.84 square rooted
equals 1.96. Not coincidently, 1.96/-1.96 in the z-distribution corresponds to 95% of the area
under the normal curve, with approximately 2.5% of the area beyond 1.96/-1.96 split between
the two tails. Stated alternatively, a z-value of 1.96 corresponds to alpha of .05. Thus, if 1.96 is
a two-tailed critical value in the context of the z distribution, then 3.84 should be considered a
two-tailed critical value with alpha = .05. In order to conduct a one-tailed test with the z-
distribution (alpha = .05), one could use 1.64 as the critical value. With respect to SPSS, there is
no specific option to conduct a one-tailed Pearson chi-square analysis. However, one could
simply conduct the Pearson chi-square analysis as per normal and then divide the reported p-
value in half. If it is less than .05, then one could declare a statistically significant effect.
One-tailed Pearson chi-square testing is possible only in contingency table analyses as
large as the 2x2 design. In contingency table designs with more than two levels, the logic of a
one-tailed test breaks down, completely. It is tantamount to conducting a one-tailed test
ANOVA (described in another chapter). Ultimately, it is not possible to specify the direction of
an effect with three or more levels.
Odds/Ratios
Another method to describe a statistically significant 2x2 Pearson chi-square result is
to express the effect as an odds/ratio. When there is the complete absence of an association
between the two variables, the odds/ratio will equal 1.0. Thus, as the odds/ratio deviates from
1.0 (either lower or higher), the larger the magnitude of the effect. An odds/ratio greater than
1.0 implies a greater chance of Y given an event X. An odds/ratio less than 1.0 implies a lesser
chance of Y given an even X. An intuitive formula for the calculation of an odds/ratio is:
a/b (5)
OR
c/d
Where a, b, c, and d refer to the frequencies associated with each of the cells within the
contingency table analysis. What exactly constitutes a, b, c, and d is to some degree arbitrary,
in the sense that a and c can certainly be interchanged, so long as b and d are also
interchanged. In most cases, you will likely want to speak of the presence of the outcome in
one group relative to the other. Thus, in the handedness and dyslexia example, it makes most
sense to me to speak of the presence of dyslexia in left-handers relative to right-handers.
Thus, the observed frequencies were a = 5, b = 17, c = 5 and d = 223.
C4.26
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Odds/Ratios: SPSS
In order to conduct an odds/ratio analysis in SPSS, you can use the Crosstabs menu
utility (Watch Video 4.16: Odds/Ratios in SPSS):
C4.27
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
As can be seen in the SPSS table entitled “Risk Estimate’, the odds/ratio (OR) was estimated at
13.118. Thus, left-handers were 13.12 times more likely to be diagnosed with dyslexia than
right-handers. Furthermore, the 95% confidence intervals corresponded to 3.455 and 49.801.
Thus, if this study were conducted a large number of times with new samples, we would
expect the OR point-estimate to be somewhere between 3.455 and 49.801 across 95% of the
samples. Given the relatively large sample of 250 participants, it may be surprising to observe
such a large confidence interval.
Relative Risk
Relative risk is an alternative approach to the odds/ratio for the interpretation of an
effect associated with a 2x2 Pearson chi-square analysis. It represents the ratio of the
observed condition percentages of interest across the two groups. It is easy to get confused
here, because the manner in which the data are entered into the formula or stats program will
affect the relative risk value that is estimated. Consequently, it is important for you to know
the manner in which you want to present the results. In the handedness and dyslexia example
(described in the Foundations section of this chapter), my preference would be to report the
relative risk as the percentage of people diagnosed with dyslexia who were left-handed
relative to the percentage of people diagnosed with dyslexia who were right-handed.
Numerically, my preference works out to the following: 22.7 / 2.2 = 10.32. Thus, the relative
risk of a dyslexia diagnosis, if a person were left-handed, rather than right-handed, was 10.32.
C4.28
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
The relative risk value of 10.32 was comparable to the odds ratio of 13.12. In cases
where there is a fairly substantial difference between the odds ratio and the relative risk ratio,
you will almost always be better off reporting the relative risk ratio (Altman et al., 1998).
SPSS simply and automatically divides the top row percentage with the corresponding
bottom row percentage. SPSS does so twice: once for the left-side column of results and once
for the right-side column of results. Thus, as can be seen in the SPSS table entitled
‘handedness * dyslexia Crosstabulation’, the relative risk of 1.266 was obtained by the
following ratio: 97.8 / 77.3. Furthermore, the relative risk of .096 was obtained by the
following ratio: 2.2 / 22.7. In my opinion, neither of these two relative risk ratios are
particularly intuitive. Fortunately, as a researcher, you have the option to divide the
corresponding percentages as you see fit. Thus, in this case, my preference is to divide 22.7%
by 2.2% to yield a relative risk ratio of 10.32. Doing so would allow me to say that left-handers
are 10.32 more at risk of being diagnosed with dyslexia than right- handers. The take-home
message, here, is that you cannot rely upon SPSS to necessarily report the most intuitive
relative risk ratio for any particular analysis. Instead, you need to think about how best to
report the relative risk ratio, and possibly calculate it yourself.
C4.29
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
The limitation with calculating the relative risk yourself, if SPSS does not report your
preferred risk ratio, is that you will not get the 95% confidence intervals. Consequently, you
cannot report the relative risk as statistically significant, nor can you report the level of
confidence associated with the estimated relative risk point estimate.
Fortunately, there is a relatively easy solution to the problem. You can simply recode
one of the variables, so that the relevant percentages are ordered in the contingency table in
such a way as to yield the desired relative risk ratio. In the handedness and dyslexia example, I
recoded the handedness variable such that left-handers were coded ‘0’ and right-handers
were coded ‘1’. When I re-ran the analysis with the recoded handedness variable and the
original dyslexia variable, I obtained exactly the same Pearson chi-square results. Additionally,
I obtained the following relative risk ratios:
As can be seen in the SPSS table entitled ‘Risk Estimate’, the ‘For cohort dyslexia = yes’
row includes a relative risk ratio of 10.364, which is precisely the relative risk ratio I calculated
by handed above. Thus, left-handers are 10.36 times more likely to be diagnosed with dyslexia
than right-handers. Furthermore, the relative risk ratio estimate of 10.36 was statistically
significant, because the 95% confidence intervals did not intersect with 1.0. Specifically, the
95% confidence intervals corresponded to 3.25 and 33.05. Unfortunately, SPSS does not report
the p-values associated with the estimated relative risk point-estimates.
C4.30
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
C4.31
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Partitioning
The partitioning approach to the decomposition of a statistically significant omnibus
Pearson chi-square analysis involves conducting a series of 2x2 Pearson chi-square follow-up
analyses. In the context of the webpage and ad clicking study, there are three possible 2x2
Pearson chi-square analyses: trending versus purchase; (2) trending versus search; and (3)
purchase versus trending.
Partitioning: SPSS
In order to conduct a series of partitioning Pearson chi-square analyses, it is necessary
to select the groups for comparisons. For example, in order to restrict the analysis to a 2x2
Pearson chi-square analysis of the trending versus purchase percentages, those two groups
need to be selected by SPSS, while the search group is excluded. In order to select groups in
SPSS, the select case menu utility can be used.
Each time the appropriate groups/cases are selected, the 2x2 Pearson chi-square
analysis can be conducted and interpreted. The results associated with the three 2x2 Pearson
chi-square analyses are reported in Table C4.3. It can be observed that the trending vs.
purchase (p = .001) and the purchase versus search (p = .022) 2x2 Pearson chi-square analyses
were statistically significant. However, after the application of a Bonferroni correction, only
C4.32
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
the trending versus purchase 2x2 Pearson chi-square analysis was statistically significant, p =
.003 (Watch Video 4.18: Pearson Chi-Square Partitioning - SPSS).
It should be kept in mind that the result associated with the omnibus 2x3 Pearson chi-
square analysis does not correspond exactly to a series of three 2x2 Pearson chi-square
analysis. For example, the sum of the three 2x2 Pearson chi-square values will not equal the
2x3 Pearson chi-square value. The reason is that the expected cell frequencies are not the
same across the 2x3 contingency table and the corresponding 2x2 contingency tables. Thus, it
is conceivable that one could obtain a statistically significant 2x3 Pearson chi-square analysis
and no statistically significant 2x2 Pearson chi-square analyses (or vice versa).
Ultimately, the main problem with the Pearson chi-square approach to testing the
difference between percentages is that the expected cell frequencies change from analysis to
analysis, depending on which variables are included in the analysis. By contrast, the within-row
percentages remain the same. For example, the trending ad type CTR percentage was 1.90%
across all Pearson chi-square analyses which included the trending ad type variable. By
contrast, the trending ad type adjusted standardized residual was -2.53 in the 2x3 Pearson chi-
square analysis and -1.06 in the trending by search 2x2 Pearson chi-square analysis.
C4.33
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
square analysis was estimated at .429, which suggests that there was some correspondence
(association) across time. However, such an observation is distinct from the within-subjects
null hypothesis that the percentage of signalers will change from 12 months to 36 months of
age. Hypotheses relevant to change cannot be tested via Pearson’s chi-square. Only the
McNemar chi-square statistic (or the Cochran’s Q) are appropriate for such questions.
Fagerland, M. W., Lydersen, S., & Laake, P. (2013). The McNemar test for binary
matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical
Research Methodology, 13, 91-91.
Standard Syntax for a 2x2 Pearson Chi-Square Modified Syntax for a 2x2 Pearson Chi-Square
CROSSTABS CROSSTABS variables = v1 v2 (1,2)
/TABLES=v1 BY v2 /TABLES=v1 BY v2
/FORMAT=AVALUE TABLES /FORMAT=AVALUE TABLES
/STATISTICS=MCNEMAR /STATISTICS=MCNEMAR
/CELLS=COUNT EXPECTED /CELLS=COUNT EXPECTED
/COUNT ROUND CELL /COUNT ROUND CELL
/METHOD=EXACT TIMER(5). /METHOD=EXACT TIMER(5).
C4.34
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Practice Questions
1: My ex-girlfriend - revisited
Had I been a supportive boyfriend, I would have given my ex-girlfriend more than 10
trials to prove her capacity to distinguish Pepsi from Coke. That is, based on the 10 trials, she
achieved a 70% success rate. True, that was not found to be statistically significantly different
from 50% (p = .206). However, had she been able to keep a 70% success rate across, say, 30
trials, it is possible that the p-value would have come in less than .05. Unfortunately for me,
back then, I was more concerned with being right, than being supportive – and I paid the price!
Test the hypothesis that a 70% success rate would have been found to be statistically
significant based on 30 taste tasting trials (Data File: how_to_get_kissed) (Watch Video 4.P1:
How to Get Kissed).
3: Q: “How does your lower-back feel?” A: “Partly cloudy with a 40% chance of an afternoon
storm.”
Some people who suffer from lower-back pain (and osteoarthritis) insist that there is a
connection between changes in the weather and the pain they feel in their body. So much so,
that they can predict the weather. Beilken, Hancock, Maher, Li, and Steffens (2016)
investigated this supposed connection in a sample of 981 individual suffering periodic, acute
lower-back pain. Beilken et al. (2016) had the participants complete a daily diary to specify
when they experienced the lower-back pain. Beilkein et al. (2016) also matched the back pain
events with the meteorological measurements for that day/time. Test the hypothesis that
there was an association between feeling back-pain and a change in barometric pressure in
the atmosphere. (Data File: weather_back_pain) (Watch 4.P3: Can a Person's Lower Back
Predict the Weather?)
C4.35
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
“When mother posed a fearful expression none of the 17 infants ventured across the
deep side. In sharp contrast, 14 of the 19 infants who observed mothers' happy face
crossed the deep side, χ2(1) = 20.49, p < .0001.”
C4.36
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
(forthcoming)
C4.37
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
References
Altman, D. G., Deeks, J. J., & Sackett, D. L. (1998). Odds ratios should be avoided when events
are common. BMJ: British Medical Journal, 317(7168), 1318-1318.
Beilken, K., Hancock, M. J., Maher, C. G., Li, Q., & Steffens, D. (2016). Acute low back pain? Do
not blame the weather—A case-crossover study. Pain Medicine, pnw126.
Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample
recommendations. Statistics in Medicine, 26, 3661–3675.
Camilli, G. & Hopkins, K. D. (1978). Applicability of chi-square to 2 * 2 contingency tables with
small expected frequencies. Psychological Bulletin, 85, 163-167.
Camilli, G. & Hopkins, K. D. (1979). Testing for association in 2 * 2 contingency tables with very
small sample sizes. Psychological Bulletin, 86, 1011-1014.
Conover, W. J. (1974). Some reasons for not using the Yates continuity correction on 2× 2
contingency tables. Journal of the American Statistical Association, 69(346), 374-376.
Croson, R., & Sundali, J. (2005). The gambler’s fallacy and the hot hand: Empirical data from
casinos. Journal of Risk and Uncertainty, 30(3), 195-209.
Fagerland, M. W., Lydersen, S., & Laake, P. (2013). The McNemar test for binary matched-pairs
data: mid-p and asymptotic are better than exact conditional. BMC Medical Research
Methodology, 13, 91-91.
Feinberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge: MIT.
Ferguson, G. A. (1976). Statistical analysis in psychology and education. Tokyo: McGraw-Hill.
Free, C., Knight, R., Robertson, S., Whittaker, R., Edwards, P., Zhou, W., ... & Roberts, I. (2011).
Smoking cessation support delivered via mobile phone text messaging (txt2stop): a
single-blind, randomised trial. The Lancet, 378(9785), 49-55.
Geschwind, N., & Behan, P. (1982). Left-handedness: Association with immune disease,
migraine, and developmental learning disorder. Proceedings of the National Academy
of Sciences, 79(16), 5097-5100.
Hughes, J. R., Keely, J., & Naud, S. (2004). Shape of the relapse curve and long‐term abstinence
among untreated smokers. Addiction, 99(1), 29-38.
Larntz, K. (1978). Small sample comparisons of exact levels for chi-square goodness of fit
statistics. Journal of the American Statistical Association, 73, 253-263.
MacDonald, P. L., & Gardner, R. C. (2000). Type I error rate comparisons of post hoc
procedures for i j Chi-Square tables. Educational and Psychological
Measurement, 60(5), 735-754.
McNemar, Q. (1947). Note on the sampling error of the difference between correlated
proportions or percentages. Psychometrika, 12(2), 153-157.
Newcombe, R. G. (1998). Two‐sided confidence intervals for the single proportion: comparison
of seven methods. Statistics in Medicine, 17(8), 857-872.
C4.38
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 4: CATEGORICAL DATA
Sorce, J. F., Emde, R. N., Campos, J. J., & Klinnert, M. D. (1985). Maternal emotional signaling:
Its effect on the visual cliff behavior of 1-year-olds. Developmental Psychology, 21(1),
195-200.
Thompson, B. (1988). Misuse of chi-square contingency-table test statistics. Educational and
Psychological Research, 8(1), 39-49.
Tversky, A., & Kahneman, D. (1971). Belief in law of small numbers. Psychological
Bulletin, 76(2), 105-110.
Yates, F. (1934). Contingency tables involving small numbers and the chi-square test. Journal of
the Royal Statistical Society, 1, 217-235.
C4.39
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.