Lecture3 - Contingency Analysis
Lecture3 - Contingency Analysis
Objectives
After studying this lesson and answering the questions in the exercises, a student will be able
to do the following:
Concepts:
Goodness of fit test
Two-Way Tables
Expected Counts in Two-Way Tables
The Chi-Square Test Statistic
Cell Counts Required for the Chi-Square Test
Uses of the Chi-Square Test
The Chi-Square Distributions
Objectives:
Perform a chi-square goodness of fit test.
Construct and interpret two-way tables.
Calculate expected counts in two-way tables.
Describe the chi-square test statistic.
Describe the cell counts required for the chi-square test.
Describe uses of the chi-square test.
Describe the chi-square distributions.
References:
Moore, D. S., Notz, W. I, & Flinger, M. A. (2013). The basic practice of statistics (6th ed.).
New York, NY: W. H. Freeman and Company.
The Chi-Square distribution has only one parameter: df = degrees of freedom. The degrees of
freedom depend on the application, as we will see later. Here are a few facts about the Chi-
Square distribution.
The degree of freedom depends on the application, as we will see later. Here are a few fact
about the Chi square distribution. If 𝑋 2 ~𝑋 2 𝑑𝑓 the following are true of𝑋 2 :
𝑋 2 is a continuous random variable
𝑋 2 = 𝑍 2 + 𝑍 2 + 𝑍 2 + ⋯ + 𝑍 2 ; 𝑋 2 is the sum of df independent squared standard
normal random variable
Data values cannot be negative, x∈ ⦋0, ∞)
μ = df (the mean of the Chi square distribution is the degrees of freedom)
δ = √2 ∗ 𝑑𝑓, V(X) = 2*df
when df >90, 𝑋 2 is approximately normal
Probability distributions that are continuous, have one mode, and are skewed to the
right or positively skewed.
The critical value of a test statistic in a chi-square distribution is determined by
specifying a significance level and the degrees of freedom.
The chi-square distributions are a family of distributions that take only positive values and
are skewed to the right. A particular chi-square distribution is specified by giving its degrees
of freedom.(Fig 13)
2
Figure 13: Chi squared distribution shapes
The chi-square test for a two-way table with r rows and c columns uses critical values from
the chi-square distribution with (r – 1)(c – 1) degrees of freedom. The P-value is the area
under the density curve of this chi-square distribution to the right of the value of the test
statistic.
The image above shows that the distribution of the chi-square statistic starts at zero
and can only have positive values.
The shape of the distribution is much different than the t or z statistic and is skewed to
the right.
The shape of the distribution changes as the degrees of freedom increases.
3
for the method of transportation (drive, bike, walk, other) used by students to get the Class
changed from that for 5 years ago?
2. Test of Homogeneity:
This test is for assessing if two or more populations are Homogeneous (alike) with respect to
the distribution of some discrete (categorical) variable. For example, is the distribution of
opinion on legal gambling the same for adult males versus adult females?
3. Test of Independence:
This test helps us to assess if two discrete (categorical) variables are independent for a
population, or if there is an association between the two variables. For example, is there an
association between satisfaction with the quality of public schools (not satisfied, somewhat
satisfied, very satisfied) and religious affiliation (Catholic, Protestants, Muslims, etc.)
The first test is the one-sample test for count data. The other two tests (homogeneity and
independence) are actually the same test. Although the hypotheses are stated differently and
the underlying assumptions about how the data is gathered are different, the steps for doing
the two tests are exactly the same.
All three tests are based on an X2 test statistic that, if the corresponding H0 is true and the
assumptions hold, follows a chi-‐square distribution with some degrees of freedom, written
χ 2 (df ) .
Example 1
A certain Toy Company prints baseball cards. The company claims that 30% of the cards are
rookies, 60% veterans but not All-Stars, and 10% are veteran All-Stars. Suppose a random
sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this consistent with the
company's claim? Use a 0.05 level of significance.
Solution
5
1. State the hypotheses.
Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%,
60% and 10%, respectively.
6
4. Decision rule/Conclusion
Example 2:
A University conducted a survey of its recent graduates to collect demographic and health
information for future planning purposes as well as to assess students' satisfaction with their
undergraduate experiences. The survey revealed that a substantial proportion of students
were not engaging in regular exercise, many felt their nutrition was poor and a substantial
number were smoking. In response to a question on regular exercise, 60% of all graduates
reported getting no regular exercise, 25% reported exercising sporadically and 15% reported
exercising regularly as undergraduates. The next year the University launched a health
promotion campaign on campus in an attempt to increase health behaviors among
undergraduates. The program included modules on exercise, nutrition and smoking cessation.
To evaluate the impact of the program, the University again surveyed graduates and asked
the same questions. The survey was completed by 470 graduates and the following data were
collected on the exercise question:
Solution
In this example, we have one sample and a discrete (ordinal) outcome variable (with three
response options). We specifically want to compare the distribution of responses in the
sample to the distribution reported the previous year (i.e., 60%, 25%, 15% reporting no,
7
sporadic and regular exercise, respectively). We now run the test using the five-step
approach:
NB:
Since there are three categories, the degrees of freedom = df = k-1 = 3-1 = 2).
Now, go to the chi-squared table. There you will find that the critical value,
X 2 2,0.05 = 5.99
Conclusion:
X 2 2,0.05 = 5.99 < X 2 = 8.46
8
We reject the null hypothesis, and conclude that the distribution of exercise has changed; it is
no longer 60%, 25%, 15%.
9
Test of Independence helps us to assess if two discrete (categorical) variables are
independent for a population, or if there is an association between the two variables
where r= the number of rows in the contingency table, and c= the number of columns.
For example, in the following contingency table, df=(r-1)*(c-1)= (3-1)*(3-1)=4:
High Exposure
Medium Exposure
Low Exposure
There are 3 exposure categories and 3 outcome categories, so df= (3-1) * (3-1) = 2*2 = 4
The research question can be phrased as either:
Is there a difference in outcome between two or more groups?
Is there an association between two variables?
Therefore,
H0: The distribution of the outcome is independent of the groups
H1: H0 is false
10
Example:
We have one population of interest -‐ say factory workers.
Question:
Is there a relationship between smoking habits and whether or not a factory worker
experiences hypertension?
Data:
1 random sample of 180 factory workers, we measure the two variables:
Y = hypertension status (yes or no)
X = smoking habit (non, moderate, heavy)
The table below summarizes the data in terms of the observed counts.
Observed Counts:
X ( Smoking habits)
Non Moderate Heavy Total
Y Yes 21 36 30 87
(Hypertension No 48 26 19 93
status) Total 69 62 49 180
H0: There is no association between smoking habit and hypertension status for the
population of factory workers. (or The two factors, smoking habit and hypertension status,
are independent for the population.)
Mathematically, this can be stated as:
H0: P(X = I and Y = j) = P(X = i)P(Y = j)
Ha: There is an association between smoking habit and hypertension status for the
population of factory workers. (or The two factors, smoking habit and hypertension status,
are dependent for the population.); α =0.01
11
The two-way table provides the OBSERVED counts. Our next step is to compute the
EXPECTED counts, under the assumption that H0 is true. The expected counts are obtained
using the cross tabulation rule:
(𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙)(𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙)
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 =
𝐺𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙, 𝑁
X (Smoking habits)
Non Moderate Heavy Total
Y Yes 21 (33.35) 36(29.97) 30(23.68) 87
(Hypertension status) No 48(35.65) 26(32.03) 19(25.32) 93
Total 69 62 49 180
Decision rule:
𝑋 2 > X 2 (2,0.01) , so we reject Ho
Report:
A 2 by 3 Chi-Square test of independence indicated a non-significant difference between
Hypertension status and Smoking habits,(X2(180) =14.46; p>0.01). Therefore, the null
12
hypothesis was rejected and we conclude that the two factors, smoking habit and
hypertension status, are dependent for the population
In sum,
1.6.2. Assumptions
Assumption #1: Your two variables should be measured at an ordinal or nominal level (i.e.,
categorical data). You can learn more about ordinal and nominal variables in our article:.
Assumption #2: Your two variables should consist of two or more categorical, independent
groups. Example independent variables that meet this criterion include gender (2 groups:
Males and Females), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic),
physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5
groups: surgeon, doctor, nurse, dentist, therapist), and so forth.
Assumption #3: Chi squared tests are only valid when you have reasonable sample size
For 2 by 2 tables (i.e. only two categories in each variable):
If the total sample size is greater than 40, chi squared can be used
If the total sample size is between 20 and40, and the smallest expected frequency is
atleast 5, chi squared can be used ( see note at the bottom of SPSS output to see if this
is a problem)
13
Otherwise Fisher’s exact test must be used.
For other tables:
Chi squared can be used if no more than 20% of the expected frequencies are less
than 5 ( see note at the bottom of SPSS output to see if this is a problem)
Exercises
1. In this exercise, we look at the relationship between reported diabetes and high blood
pressure. This is a crosstabulation:
2. The table below is taken from a study investigating the cause of diarrhoea in patients with
gastroenteritis and shows the relationship between foreign travel and a positive result for
the organism Providencia alcalifaciens (Haynes and Hawkey 1989).
14
P. alcalifaciens
Recent travel positive (no.) Negative (no.) Total
abroad?
Yes 25 229 254
No 5 368 373
Total 28 597 627
Chi Squared = 23.98, P<0.001
a) What is meant by ‘chi-squared = 23.98, P<0.001?’
b) What conditions do the data have to meet for the test to be valid?
c) What conclusions can be drawn from these data?
d) What other information would be useful in deciding whether P. alcalifaciens was a likely
cause of gastroenteritis in travelers?
3. Conduct a hypothesis test to determine if the actual majors of graduating females fit the
expected distribution of their majors. The observed data were collected from 5,000
graduating females. Complete a hypothesis test at the 𝛼= 0:05 significance level to test if
the actual distribution of female students to majors matches the expected distribution.
a. State the null and alternative hypotheses for this test of independence.
b. Complete the table of expected values assuming the success rate is
independent of the treatment method. Use two decimal places of accuracy.
a) Provide the null and alternative hypothesis and an interpretation of the results
b) Calculate the expected counts for the cells in the table above.
c) Test to see if the association between mortality outcome and treatment status is
statistically significant.
16