0% found this document useful (0 votes)
25 views

Lecture3 - Contingency Analysis

Course

Uploaded by

audengweha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lecture3 - Contingency Analysis

Course

Uploaded by

audengweha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1.

Contingency Analysis ("crosstab“)/ Measures of


Association

Objectives
After studying this lesson and answering the questions in the exercises, a student will be able
to do the following:

Concepts:
 Goodness of fit test
 Two-Way Tables
 Expected Counts in Two-Way Tables
 The Chi-Square Test Statistic
 Cell Counts Required for the Chi-Square Test
 Uses of the Chi-Square Test
 The Chi-Square Distributions

Objectives:
 Perform a chi-square goodness of fit test.
 Construct and interpret two-way tables.
 Calculate expected counts in two-way tables.
 Describe the chi-square test statistic.
 Describe the cell counts required for the chi-square test.
 Describe uses of the chi-square test.
 Describe the chi-square distributions.

References:
Moore, D. S., Notz, W. I, & Flinger, M. A. (2013). The basic practice of statistics (6th ed.).
New York, NY: W. H. Freeman and Company.

1.1. Brief History


1
By "𝑋 2 tests" it's not usually meant the preceding test concerning variance but a group of
tests based on the Pearson-approximation and contingency tables. Karl (Carl) Pearson (1857-
1936), is believed to be the "father" of statistics.

1.2. The chi-square distribution


The chi-square distributions are a family of distributions that take only positive values and
are skewed to the right. A particular chi-square distribution is specified by giving its degrees
of freedom.

The Chi-Square distribution has only one parameter: df = degrees of freedom. The degrees of
freedom depend on the application, as we will see later. Here are a few facts about the Chi-
Square distribution.
The degree of freedom depends on the application, as we will see later. Here are a few fact
about the Chi square distribution. If 𝑋 2 ~𝑋 2 𝑑𝑓 the following are true of𝑋 2 :
 𝑋 2 is a continuous random variable
 𝑋 2 = 𝑍 2 + 𝑍 2 + 𝑍 2 + ⋯ + 𝑍 2 ; 𝑋 2 is the sum of df independent squared standard
normal random variable
 Data values cannot be negative, x∈ ⦋0, ∞)
 μ = df (the mean of the Chi square distribution is the degrees of freedom)
 δ = √2 ∗ 𝑑𝑓, V(X) = 2*df
 when df >90, 𝑋 2 is approximately normal
 Probability distributions that are continuous, have one mode, and are skewed to the
right or positively skewed.
 The critical value of a test statistic in a chi-square distribution is determined by
specifying a significance level and the degrees of freedom.
The chi-square distributions are a family of distributions that take only positive values and
are skewed to the right. A particular chi-square distribution is specified by giving its degrees
of freedom.(Fig 13)

2
Figure 13: Chi squared distribution shapes

The chi-square test for a two-way table with r rows and c columns uses critical values from
the chi-square distribution with (r – 1)(c – 1) degrees of freedom. The P-value is the area
under the density curve of this chi-square distribution to the right of the value of the test
statistic.
 The image above shows that the distribution of the chi-square statistic starts at zero
and can only have positive values.
 The shape of the distribution is much different than the t or z statistic and is skewed to
the right.
 The shape of the distribution changes as the degrees of freedom increases.

1.3. Uses of the Chi-Square Test


Use the chi-square test to test the null hypothesis
H0: there is no relationship between two categorical variables when there is a two-way table
from one of these situations:
 Independent random samples from two or more populations, with each individual
classified according to one categorical variable.
 A single random sample, with each individual classified according to both of two
categorical variables.

1.4. The main types of Chi Square Test


Three main types of tests will be covered here:

1. Goodness of Fit Test:


This test is for assessing if a particular discrete model is a good fitting model for a discrete
characteristic, based on a random sample from the population. For example, has the model

3
for the method of transportation (drive, bike, walk, other) used by students to get the Class
changed from that for 5 years ago?

2. Test of Homogeneity:
This test is for assessing if two or more populations are Homogeneous (alike) with respect to
the distribution of some discrete (categorical) variable. For example, is the distribution of
opinion on legal gambling the same for adult males versus adult females?

3. Test of Independence:
This test helps us to assess if two discrete (categorical) variables are independent for a
population, or if there is an association between the two variables. For example, is there an
association between satisfaction with the quality of public schools (not satisfied, somewhat
satisfied, very satisfied) and religious affiliation (Catholic, Protestants, Muslims, etc.)

The first test is the one-sample test for count data. The other two tests (homogeneity and
independence) are actually the same test. Although the hypotheses are stated differently and
the underlying assumptions about how the data is gathered are different, the steps for doing
the two tests are exactly the same.

All three tests are based on an X2 test statistic that, if the corresponding H0 is true and the
assumptions hold, follows a chi-‐square distribution with some degrees of freedom, written
χ 2 (df ) .

1.5. Goodness of Fit Test


Hypotheses
 We use the goodness of fit test to test if a discrete categorical random variable
matches a predetermined “expected" distribution. The hypotheses in a goodness of fit
test are
H0: the actual distribution fits the expected distribution
HA: the actual distribution does not fit the expected distribution
REQUIREMENT: In order for chi-square goodness of fit test to be appropriate, the expected
value in each category must be at least 5. It may be possible to combine categories to meet
this requirement.
4
Our goal is to see if the observed values are close enough to the expected values that the
differences could be due to random variation or, alternatively, if the differences great enough
that we can conclude that the distribution is not as expected. Therefore, our sample statistic
(which is also the test statistic in this case) should provide a measure of how far the from
“expected" frequencies the \observed" frequencies are, as a group.

1.5.1.Steps in Testing the Hypotheses

Step 1: State the Hypotheses


H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
Step 2: Calculate the expected frequency counts. The expected frequency counts at each
level of the categorical variable are equal to:
Ei = npi
where Ei is the expected frequency count for the ith level of the categorical variable, n
is the total sample size, and pi is the hypothesized proportion of observations in
level i.
Step 3: Calculate the Chi square statistic:
(𝑂𝑖 −𝐸𝑖 )2
𝑋2 = ∑
𝐸𝑖
Step 4: Calculate the critical value:
𝑋 2 𝑐 = 𝑋 2 (𝑑𝑓, ∝)
Where
df = (r–1)(c–1)
Step 5: Decision rule
Reject H0 if 𝑋 2 𝑐 < 𝑋 2

Example 1

A certain Toy Company prints baseball cards. The company claims that 30% of the cards are
rookies, 60% veterans but not All-Stars, and 10% are veteran All-Stars. Suppose a random
sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this consistent with the
company's claim? Use a 0.05 level of significance.

Solution

5
1. State the hypotheses.
 Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%,
60% and 10%, respectively.

 Alternative hypothesis: At least one of the proportions in the null hypothesis is


false.
2. Calculate the expected frequency counts and the Chi square statistic (note that
steps 2 &3 are combined in this case):
Card Number of Percent Ei = ni*pi (𝑂𝑖 −𝐸𝑖 )2
𝑋2 = ∑
samples 𝐸𝑖

Rookies 50 30 100 * 0.30 = 30 13.33


Veterans 45 60 100 * 0.60 = 60 3.75
All-Stars 5 10 100 * 0.10 = 10 2.50
Total 100 100 100 19.58
3. Analyze sample data. Based on the chi-square statistic and the degrees of freedom,
we determine the critical value, X 2 c :
df = k - 1 = 3 - 1 = 2
Where
df = the degrees of freedom,
k =the number of levels of the categorical variable,
Hence
X 2 c = X 2 (2, 0.05) = 5.99 (see the chi square distribution table)

Percentage points of the Chi-squared Distribution

6
4. Decision rule/Conclusion

Since the Χ2 > X 2 c , we cannot accept the null hypothesis.

Example 2:
A University conducted a survey of its recent graduates to collect demographic and health
information for future planning purposes as well as to assess students' satisfaction with their
undergraduate experiences. The survey revealed that a substantial proportion of students
were not engaging in regular exercise, many felt their nutrition was poor and a substantial
number were smoking. In response to a question on regular exercise, 60% of all graduates
reported getting no regular exercise, 25% reported exercising sporadically and 15% reported
exercising regularly as undergraduates. The next year the University launched a health
promotion campaign on campus in an attempt to increase health behaviors among
undergraduates. The program included modules on exercise, nutrition and smoking cessation.
To evaluate the impact of the program, the University again surveyed graduates and asked
the same questions. The survey was completed by 470 graduates and the following data were
collected on the exercise question:

No Regular Exercise Sporadic Exercise Regular Exercise Total

Observed # 255 125 90 470


Based on the data, is there evidence of a shift in the distribution of responses to the exercise
question following the implementation of the health promotion campaign on campus?

Solution
In this example, we have one sample and a discrete (ordinal) outcome variable (with three
response options). We specifically want to compare the distribution of responses in the
sample to the distribution reported the previous year (i.e., 60%, 25%, 15% reporting no,

7
sporadic and regular exercise, respectively). We now run the test using the five-step
approach:

Step 1: We set up the hypotheses and determine level of significance.


The null hypothesis again represents the "no change" or "no difference" situation. If the
health promotion campaign has no impact then we expect the distribution of responses to the
exercise question to be the same as that measured prior to the implementation of the
program:
H0: p1=0.60, p2=0.25, p3=0.15, or equivalently
H0: The distribution of responses is 0.60, 0.25, 0.15
H1: H0 is false or
H1: The distribution of responses is different from 0.60, 0.25, 0.15 α =0.05
NB:
 The research hypothesis (H1) as stated captures any difference in the distribution of
responses from that specified in the null hypothesis.
 We do not specify a specific alternative distribution, instead we are testing whether
the sample data "fit" the distribution in H0 or not. With the χ2 goodness-of-fit test
there is no upper or lower tailed version of the test.
Step 2: Calculate the number of students expected in each exercise category!.
No Exercise Sporadic Exercise Regular Exercise Total
# Observed 255 125 90 470
# Expected E1 = 490*0.6 = E2 = 490*0.25 = E3 = 490*0.15 = 470
282 117.5 70.5
χ2 = (O- 2.59 0.48 5.39 8.46
E)2/E

NB:

Since there are three categories, the degrees of freedom = df = k-1 = 3-1 = 2).
Now, go to the chi-squared table. There you will find that the critical value,
X 2 2,0.05 = 5.99
Conclusion:
X 2 2,0.05 = 5.99 < X 2 = 8.46

8
We reject the null hypothesis, and conclude that the distribution of exercise has changed; it is
no longer 60%, 25%, 15%.

1.6. Chi-Square Test for Independence/Homogeneity


Contingency Tables

o A contingency table is a cross-tabulation of n paired observations into


categories
o Each cell shows the count of observations that fall into the category defined
by its row (r) and column (c) heading.
[Please note that the Kolmogorov-Smirnoff test is another test for the goodness of fit. The
Kolmogorov-Smirnov test has a higher power, but can only be applied to continuous-level
variables.]

Secondly, it tests whether or not a statistically significant relationship exists between a


dependent and an independent variable. When used as test of independence, the Chi-Square
Test is applied to a contingency table, or cross tabulation (sometimes called crosstabs for
short).

9
Test of Independence helps us to assess if two discrete (categorical) variables are
independent for a population, or if there is an association between the two variables

5.6.1. Chi-squared of independence Step-by-Step


1) Formulate Hypotheses
2) Calculate row and column totals
3) Calculate row and column proportions
4) Calculate expected frequencies (Ei)
5) Calculate χ2 statistic
6) Calculate degrees of freedom
7) Obtain Critical Value from table
8) Make decision regarding the Null-hypothesis
The chi-squared test of independence also uses the chi-squared statistic and chi-squared
distribution, but it is used to test whether there is a difference in frequency among two or
more groups. The outcome is categorical (2 or more levels) or ordinal. Therefore, there can
be multiple rows or columns in our contingency table, and the degress of freedom are

where r= the number of rows in the contingency table, and c= the number of columns.
For example, in the following contingency table, df=(r-1)*(c-1)= (3-1)*(3-1)=4:

Good Fair Poor

High Exposure

Medium Exposure

Low Exposure
There are 3 exposure categories and 3 outcome categories, so df= (3-1) * (3-1) = 2*2 = 4
The research question can be phrased as either:
 Is there a difference in outcome between two or more groups?
 Is there an association between two variables?
Therefore,
 H0: The distribution of the outcome is independent of the groups
 H1: H0 is false

10
Example:
We have one population of interest -‐ say factory workers.

Question:
Is there a relationship between smoking habits and whether or not a factory worker
experiences hypertension?
Data:
1 random sample of 180 factory workers, we measure the two variables:
Y = hypertension status (yes or no)
X = smoking habit (non, moderate, heavy)
The table below summarizes the data in terms of the observed counts.

Observed Counts:
X ( Smoking habits)
Non Moderate Heavy Total
Y Yes 21 36 30 87
(Hypertension No 48 26 19 93
status) Total 69 62 49 180

The null hypothesis:

H0: There is no association between smoking habit and hypertension status for the
population of factory workers. (or The two factors, smoking habit and hypertension status,
are independent for the population.)
Mathematically, this can be stated as:
H0: P(X = I and Y = j) = P(X = i)P(Y = j)
Ha: There is an association between smoking habit and hypertension status for the
population of factory workers. (or The two factors, smoking habit and hypertension status,
are dependent for the population.); α =0.01

11
The two-way table provides the OBSERVED counts. Our next step is to compute the
EXPECTED counts, under the assumption that H0 is true. The expected counts are obtained
using the cross tabulation rule:
(𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙)(𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙)
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 =
𝐺𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙, 𝑁

X (Smoking habits)
Non Moderate Heavy Total
Y Yes 21 (33.35) 36(29.97) 30(23.68) 87
(Hypertension status) No 48(35.65) 26(32.03) 19(25.32) 93
Total 69 62 49 180

Calculation of the Chi squared statistic

(21 − 33.35)2 (36 − 29.97)2 (48 − 35.65)2


𝑋2 = + + ⋯+
33.35 29.97 35.65
= 4.57 + 1.21 +… + 0.86
= 14.46
Next, we calculate the critical value or test statistic:
df = (r-1)(c-1)
= (3-1)(2-1)
=2
Therefore,
X 2 (2,0.01) = 9.21

Decision rule:
𝑋 2 > X 2 (2,0.01) , so we reject Ho
Report:
A 2 by 3 Chi-Square test of independence indicated a non-significant difference between
Hypertension status and Smoking habits,(X2(180) =14.46; p>0.01). Therefore, the null

12
hypothesis was rejected and we conclude that the two factors, smoking habit and
hypertension status, are dependent for the population
In sum,

1.6.2. Assumptions
Assumption #1: Your two variables should be measured at an ordinal or nominal level (i.e.,
categorical data). You can learn more about ordinal and nominal variables in our article:.
Assumption #2: Your two variables should consist of two or more categorical, independent
groups. Example independent variables that meet this criterion include gender (2 groups:
Males and Females), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic),
physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5
groups: surgeon, doctor, nurse, dentist, therapist), and so forth.

Assumption #3: Chi squared tests are only valid when you have reasonable sample size
For 2 by 2 tables (i.e. only two categories in each variable):
 If the total sample size is greater than 40, chi squared can be used
 If the total sample size is between 20 and40, and the smallest expected frequency is
atleast 5, chi squared can be used ( see note at the bottom of SPSS output to see if this
is a problem)

13
 Otherwise Fisher’s exact test must be used.
For other tables:
 Chi squared can be used if no more than 20% of the expected frequencies are less
than 5 ( see note at the bottom of SPSS output to see if this is a problem)

1.7. Measuring associations between variables

Exercises
1. In this exercise, we look at the relationship between reported diabetes and high blood
pressure. This is a crosstabulation:

Diabetes High blood pressure Total


No Yes
No 172 20 192
Yes 7 3 10
Total 179 23 202

a) What kinds of variables are diabetes and high blood pressure?


b) Which cell of the table would have the smallest expected frequency and, roughly,
what would this be?
c) What statistical method should be used to test the null hypothesis that diabetes and
high blood pressure are unrelated in this population, and why?
d) The test gives P = 0.09. What can we conclude about high blood pressure and
diabetes, given that the test was conducted at 5% significance level?

2. The table below is taken from a study investigating the cause of diarrhoea in patients with
gastroenteritis and shows the relationship between foreign travel and a positive result for
the organism Providencia alcalifaciens (Haynes and Hawkey 1989).

14
P. alcalifaciens
Recent travel positive (no.) Negative (no.) Total
abroad?
Yes 25 229 254
No 5 368 373
Total 28 597 627
Chi Squared = 23.98, P<0.001
a) What is meant by ‘chi-squared = 23.98, P<0.001?’
b) What conditions do the data have to meet for the test to be valid?
c) What conclusions can be drawn from these data?
d) What other information would be useful in deciding whether P. alcalifaciens was a likely
cause of gastroenteritis in travelers?
3. Conduct a hypothesis test to determine if the actual majors of graduating females fit the
expected distribution of their majors. The observed data were collected from 5,000
graduating females. Complete a hypothesis test at the 𝛼= 0:05 significance level to test if
the actual distribution of female students to majors matches the expected distribution.

a. Find the expected frequencies and complete the table.


b. Are the requirements for a chi-square goodness of _t test satisfied? Explain
and adjust the categories if needed.
c. Write the null and alternative hypotheses.
d. What is the distribution?
e. Find the test statistic.
f. Find the p-value.
4. Treating stress fractures. With respect to stress fractures in a foot bone, does the success
rate of the treatment depend on the treatment method, or do all methods of treatment have
15
basically the same success rate? Use the following data and a significance level of ∝ =
0:01 to complete a test of independence.

a. State the null and alternative hypotheses for this test of independence.
b. Complete the table of expected values assuming the success rate is
independent of the treatment method. Use two decimal places of accuracy.

c. Is the requirement for a test of independence satisfied?


d. Find the distribution of the test statistic, including the degrees of freedom.
e. Calculate the test statistic value using your preferred method.
f. Sketch the density curve, marking and labeling the test statistic and p-value.
g. What is the outcome of the test for independence? (Can we conclude that the
success rate depends on the method of treatment or not?)
5. A one year follow-up study was conducted to examine the effect of an experimental drug
on mortality in 296 cases of advanced non-Hodgkin's lymphoma. Controls received
standard treatment. The data are provided below.

a) Provide the null and alternative hypothesis and an interpretation of the results
b) Calculate the expected counts for the cells in the table above.
c) Test to see if the association between mortality outcome and treatment status is
statistically significant.

16

You might also like