Lecture 3 - Measuresof Assocn
Lecture 3 - Measuresof Assocn
Measures of Association
1/14/2025 1
Chi-square Test
• We will introduce chi square test, in
particular we will cover:
– cross-tabulation
1/14/2025 4
Cross-Tabulation
• It can be used for categorical variables only and shows the joint frequency
distributions of the two variables
1/14/2025 5
Categorical Data Analysis
1/14/2025 6
Basic Contingency Table Structure
1/14/2025 7
Structure of Contingency Tables
1/14/2025 8
Contingency Analysis ("crosstab“or Cross-tabulations)
Age:
Gender: 20-29: 30-39: 40-49: 50-59: 60-69:
Male: 12 13 7 8 7
Female: 12 14 10 9 11
The numbers inside the table are called the "cells" and the "All" or "Total" rows or
columns are called the "marginal."
A contingency table helps us look at whether the value of one variable is associated with, or
“contigent” upon, that of another. It is most useful when each variable contains only a few
categories. Usually, though not always, such variables will be nominal or ordinal.
1/14/2025 9
… Example of Cross-Tabulation
Mothers’ level of knowledge and nutritional status of their children:
1/14/2025 10
Example
There were also 148 children who did not eat burritos. Among these,
10 cases of illness were reported.
12
Brief History
The Chi-Square distribution has only one parameter: df = degrees of freedom. The degrees of
freedom depend on the application, as we will see later. Here are a few facts about the Chi-
Square distribution.
1/14/2025 13
… Chi-Square distribution
14
The Chi-Square Test has two major fields of application: 1) goodness of fit test and 2) test of
independence.
Firstly, the Chi-Square Test can test whether the distribution of a variable in a sample
approximates an assumed theoretical distribution (e.g., normal distribution, Beta).
[Please note that the Kolmogorov-Smirnoff test is another test for the goodness of fit. The
Kolmogorov-Smirnov test has a higher power, but can only be applied to continuous-level
variables.]
Secondly, the Chi-Square Test can be used to test of independence between two
variables. That means that it tests whether one variable is independent from another one. In
other words, it tests whether or not a statistically significant relationship exists between a
dependent and an independent variable. When used as test of independence, the Chi-Square
Test is applied to a contingency table, or cross tabulation (sometimes called crosstabs for
short).
1/14/2025 15
The chi-square distribution
Probability distributions that are continuous, have one mode, and are skewed to the right
or positively skewed
It is non-negative.
It is based on degrees of freedom, exact shape varies according to the number of degrees
of freedom.
Age:
Gender: 20-29: 30-39: 40-49: 50-59: 60-69:
Male: 12 13 7 8 7
Female: 12 14 10 9 11
The numbers inside the table are called the "cells" and the "All" or "Total" rows or
columns are called the "marginal."
A contingency table helps us look at whether the value of one variable is associated with, or
“contigent” upon, that of another. It is most useful when each variable contains only a few
categories. Usually, though not always, such variables will be nominal or ordinal.
1/14/2025 17
… Example of Cross-Tabulation
Mothers’ level of knowledge and nutritional status of their children:
1/14/2025 18
• Tables may also have additional information:
• Row and column marginals (i.e., totals)
SDF 27 + 10 = 37
+ This is the
CPDM 16 15 31
total N
=
Total 43 25 68
1/14/2025 19
Chi-square Test
There are 2 types of chi-square tests:
• The chi-square goodness of fit test is
used to compare the observed
frequencies in a data sample with the
frequencies based on some prior
expectation – either empirical or
theoretical.
• The chi-square test of independence
assesses whether observed frequencies
are dependent on (i.e., contingent on)
certain conditions. 20
Chi-square frequency tables
• A frequency distribution table / contingency table shows
how observations are distributed between different
groups (i.e., the number of observations in each group).
A chi-square goodness of fit test
can test whether these observed
frequencies are significantly different
from what was expected, such as
equal frequencies.
1/14/2025 23
Chi-square Test of Independence
The Chi-Square Test of Independence is also known as Pearson's Chi-Square, Chi-Squared,
• How does a chi-square test of independence work?
• It is based on comparing the observed cell values with the values
you’d expect if there were no relationship between variables
• Definitions:
• Observed values = values in the crosstab cells based on
your sample
• Expected values = crosstab cell values you would expect if
your variables were unrelated.
Assumption for test:
Large N (>100)
•Critical value Df = (R-1)(C-1).
= (number of rows – 1) × (number of columns – 1)
1/14/2025 24
CHI SQUARE FORMULA:
1/14/2025 26
DATA
QUALIT QUANTI
ATIVE TATIVE
CHI
SQUARE
T-TEST
TEST
Steps in Testing the Hypotheses
28
Hypothesis testing using crosstabs
This is a table
CPDM 16 15 column
1/14/2025 30
• Tables can also reflect percentages
• Either of total N, or of row or column marginals
• This table shows percentage of total N:
1/14/2025 33
• The value in a cell is referred to as a frequency
– Math symbol = f
• Cells are referred to by row and column numbers
– Ex: women republicans = 2nd row, 1st column
– In general, rows are numbered from 1 to i, columns are
numbered from 1 to j
• Thus, the value in any cell of any table can be written as:
– fij
1/14/2025 34
Expected Cell Values
• If two variables are independent, cell values will depend only on
row & column marginals/Totals
– Marginals reflect frequencies… And, if frequency is high, all
cells in that row (or column) should be high
• The formula for the expected value in a cell is:
(f i )(f j )
f̂ ij Ξ
Women Men N
N 43 25 68
10
SDF 27
SDF 27 10 37
CPDM 16 15 31
CPDM 16 15
Total 43 25 68
etc
E(Woman∩SDF) =
1/14/2025 37
…Chi-square Test of Independence
Women Men
(23.4 – 27)2/23.4
SDF (13.6 – 10)2/13.6 = .95
= .55
1/14/2025 38
…Chi-Square Test of Independence
1/14/2025 41
Percentage points of the Chi-squared Distribution
1/14/2025 42
EXAMPLE 2 - Hodgkin’s lymphoma
• A one year follow-up study was conducted to examine the
effect of an experimental drug on mortality in 296 cases of
advanced non-Hodgkin's lymphoma. Controls received
standard treatment. The data are provided below.
2. Calculate the expected counts for the cells in the table above.
3. Test to see if the association between mortality outcome and treatment
status is statistically significant.
1.Provide the null and alternative hypothesis and an interpretation of the results
1/14/2025 43
…EXAMPLE - Hodgkin’s lymphoma
1/14/2025 44
Died Survived Total
Died Survived
Treatment 9 190
199
Treatment 9 190
Control 13 84
Control 13 84 97
1/14/2025 45
• Compute Expected values for each cell
Women Men
1/14/2025 46
1/14/2025 47
Assumptions:
1. No Eij < 1, and
2. No more than 20% of Eij < 5.
3. Large N (>100)
4. variables should be measured at an ordinal or nominal level (i.e.,
categorical data)
1/14/2025
Reject Ho iff x2 > x2 (critical) 48
A _____ X _____ Chi-Square test of independence indicated a _____________ difference
between gender and political affiliation , (X2(_______) =.__________).
Therefore, the null hypothesis was _______________ and we conclude that there
is evidence of an association between mortality outcome and treatment
1/14/2025 49
HOMEWORK
Question 1
1/14/2025 50
Question 2
A dog trainer wants to know if golden retrievers and French bulldogs are equally
good at learning how to skateboard. She tries to train 40 golden retrievers and 60
French bulldogs to skateboard and finds the following:
Should she reject the null hypothesis that the dog’s breed is unrelated to their
skateboarding ability?
a) She should reject the null hypothesis.
b) She should fail to reject the null hypothesis.
1/14/2025 51
Question 3
A restaurant reviewer wants to know if three popular burger restaurants are equally
recommended by their customers. At each of the three restaurants, he asks 25 random
customers whether they would recommend the restaurant to a friend. He finds the
following:
Should he reject the null hypothesis that the proportion of customers recommending
the restaurant is the same for the three restaurants?
a) He should reject the null hypothesis.
b) He should fail to reject the null hypothesis.
1/14/2025 52
Measuring associations between variables
1/14/2025 53
Exercise:
Is there a relationship between smoking and cancer? Test at 5% level of significance and
report your findings clearly
1/14/2025 54
Any
1/14/2025 55
55