0% found this document useful (0 votes)
63 views55 pages

Lecture 3 - Measuresof Assocn

Uploaded by

belmacarthur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views55 pages

Lecture 3 - Measuresof Assocn

Uploaded by

belmacarthur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Contingency Analysis ("crosstab“)/

Measures of Association

Prof. Dr.rer.nat.tech. Ndoh Mbue

1/14/2025 1
Chi-square Test
• We will introduce chi square test, in
particular we will cover:
– cross-tabulation

– What a chi square test is

– Chi square test in SPSS

• This presentation is intended for


students in initial stages of Statistics. No
previous knowledge is required. 2
Contingency Analysis

Aims of the lecture


By the end of the lecture you know:
 the concept of a cross table.
 the key steps in conducting a contingency analysis.
 the process of hypothesis testing with X2 (Chi squared) test of
independence.
 measures of association.

You can conduct a contingency analysis with SPSS


• In particular, you know how to …
 interpret the output
 use a layer variable
1/14/2025 3
DESCRIPTIVE STATISTICS

A. For one variable ("univariate analysis"):


• Measures of "CENTRAL TENDENCY") (averages) and of
DISPERSION or variance around that average.
• Examples: Means, Modes, Medians, Standard Deviation,
quartiles

B. Descriptive statistics for the strength of relationship between


two variables (bivariate analysis) or among a set of variables
(multivariate analysis) are measures of ASSOCIATION or
correlation.

1/14/2025 4
Cross-Tabulation

• One of the most widely used bivariate statistic.

• Versatile and relatively simple.

• Can be used with any level of measurement.

• A way of presenting data about two variables in a table so that their


relations are more obvious.

• Also called a contingency table or a crossbreak table.

• It can be used for categorical variables only and shows the joint frequency
distributions of the two variables

1/14/2025 5
Categorical Data Analysis

• Categorical data analysis deals with discrete


data that can be organized into categories.
• The data are organized into a contingency table.
The basic structure consists of two columns and
two rows.
• The χ2 distribution is used in categorical data
analysis.

1/14/2025 6
Basic Contingency Table Structure

• Basic structure of a 2X2 contingency table


has two columns and two rows.

1/14/2025 7
Structure of Contingency Tables

• Cells are labeled A through D.


• Columns and rows are added for labels.

1/14/2025 8
Contingency Analysis ("crosstab“or Cross-tabulations)
Age:
Gender: 20-29: 30-39: 40-49: 50-59: 60-69:

Male: 12 13 7 8 7

Female: 12 14 10 9 11

The numbers inside the table are called the "cells" and the "All" or "Total" rows or
columns are called the "marginal."

A contingency table helps us look at whether the value of one variable is associated with, or
“contigent” upon, that of another. It is most useful when each variable contains only a few
categories. Usually, though not always, such variables will be nominal or ordinal.

1/14/2025 9
… Example of Cross-Tabulation
Mothers’ level of knowledge and nutritional status of their children:

Women’s attendance at nutritional education and their level of


nutritional knowledge:

1/14/2025 10
Example

A total of 452 children in elementary schools in Georgia and Florida


were served burritos for lunch. Among these, 304 children reported
eating the burritos. Among those who ate burritos, 145 reported
getting sick from bacterial contamination.

There were also 148 children who did not eat burritos. Among these,
10 cases of illness were reported.

A case of disease was defined as gastrointestinal upset, fever and other


symptoms. The CDC studied this event using categorical data
analysis. They reported relative risk, significance and a confidence
interval.
1/14/2025 11
Chi-square Test
• A non-parametric test that is used to
measure the association between two
categorical variables.
• We use it when we have observed
frequencies of a categorial variable (e.g.,
male vs female, healthy vs sick).

12
Brief History

Chi-square is a statistical test used to examine the differences between categorical


variables from a random sample in order to judge the goodness of fit between
expected and observed results.
The chi-square distribution
The chi-square distributions are a family of distributions that take only positive values and are
skewed to the right. A particular chi-square distribution is specified by giving its degrees of
freedom.

The Chi-Square distribution has only one parameter: df = degrees of freedom. The degrees of
freedom depend on the application, as we will see later. Here are a few facts about the Chi-
Square distribution.

1/14/2025 13
… Chi-Square distribution

14
The Chi-Square Test has two major fields of application: 1) goodness of fit test and 2) test of
independence.

Firstly, the Chi-Square Test can test whether the distribution of a variable in a sample
approximates an assumed theoretical distribution (e.g., normal distribution, Beta).

[Please note that the Kolmogorov-Smirnoff test is another test for the goodness of fit. The
Kolmogorov-Smirnov test has a higher power, but can only be applied to continuous-level
variables.]

Secondly, the Chi-Square Test can be used to test of independence between two
variables. That means that it tests whether one variable is independent from another one. In
other words, it tests whether or not a statistically significant relationship exists between a
dependent and an independent variable. When used as test of independence, the Chi-Square
Test is applied to a contingency table, or cross tabulation (sometimes called crosstabs for
short).
1/14/2025 15
The chi-square distribution

Probability distributions that are continuous, have one mode, and are skewed to the right
or positively skewed

It is non-negative.

It is based on degrees of freedom, exact shape varies according to the number of degrees
of freedom.

The critical value of a test statistic in a chi-square distribution is determined by


specifying a significance level and the degrees of freedom..
1/14/2025 16
Contingency Analysis ("crosstab“or Cross-tabulations)

Age:
Gender: 20-29: 30-39: 40-49: 50-59: 60-69:

Male: 12 13 7 8 7

Female: 12 14 10 9 11

The numbers inside the table are called the "cells" and the "All" or "Total" rows or
columns are called the "marginal."

A contingency table helps us look at whether the value of one variable is associated with, or
“contigent” upon, that of another. It is most useful when each variable contains only a few
categories. Usually, though not always, such variables will be nominal or ordinal.

1/14/2025 17
… Example of Cross-Tabulation
Mothers’ level of knowledge and nutritional status of their children:

Women’s attendance at nutritional education and their level of


nutritional knowledge:

1/14/2025 18
• Tables may also have additional information:
• Row and column marginals (i.e., totals)

Women Men Total

SDF 27 + 10 = 37
+ This is the
CPDM 16 15 31
total N
=
Total 43 25 68
1/14/2025 19
Chi-square Test
There are 2 types of chi-square tests:
• The chi-square goodness of fit test is
used to compare the observed
frequencies in a data sample with the
frequencies based on some prior
expectation – either empirical or
theoretical.
• The chi-square test of independence
assesses whether observed frequencies
are dependent on (i.e., contingent on)
certain conditions. 20
Chi-square frequency tables
• A frequency distribution table / contingency table shows
how observations are distributed between different
groups (i.e., the number of observations in each group).
A chi-square goodness of fit test
can test whether these observed
frequencies are significantly different
from what was expected, such as
equal frequencies.

•Null hypothesis (H0): The bird


species visit the feeder in the same
proportions as the average over the
past five years.
•Alternative hypothesis (H1): The
bird species visit the feeder in
different proportions from the 21
https://www.scribbr.com/statistics/chi-square-tests/ average over the past five years.
Chi-square frequency tables
• A frequency distribution table / contingency table shows
how observations are distributed between different
groups (i.e., the number of observations in each group).
A chi-square test of independence
can test whether these observed
frequencies are significantly different
from the frequencies expected if
handedness is unrelated to
nationality.
•Null hypothesis (H0): The
proportion of people who are left-
handed is the same for Americans
and Canadians.
•Alternative hypothesis (H1): The
proportion of people who are left-
handed differs between nationalities.
22
https://www.scribbr.com/statistics/chi-square-tests/
Chi-square Test of Independence

• Chi-Square test is a test of independence


• Asks “is there a relationship between variables or not?”
• Independence = no relationship
• ANOVA, T-Test do this too (same means = independent)
• Null hypothesis: the two variables are statistically independent
• H0: Gender and political party are independent
• There is no relationship between them
• Alternate hypothesis: the variables are related, not independent of
each other
• H1: Gender and political party are not independent.

1/14/2025 23
Chi-square Test of Independence
The Chi-Square Test of Independence is also known as Pearson's Chi-Square, Chi-Squared,
• How does a chi-square test of independence work?
• It is based on comparing the observed cell values with the values
you’d expect if there were no relationship between variables
• Definitions:
• Observed values = values in the crosstab cells based on
your sample
• Expected values = crosstab cell values you would expect if
your variables were unrelated.
Assumption for test:
Large N (>100)
•Critical value Df = (R-1)(C-1).
= (number of rows – 1) × (number of columns – 1)
1/14/2025 24
CHI SQUARE FORMULA:
1/14/2025 26
DATA

QUALIT QUANTI
ATIVE TATIVE

CHI
SQUARE
T-TEST
TEST
Steps in Testing the Hypotheses

28
Hypothesis testing using crosstabs

Example: Consider a dataset of 68 people distributed as follows:

• frequency along the first variable:


• Frequency: 43 women, 25 men
• Suppose further that according to political affiliation, the break out
groups by the second variable is:
• Of 43 women, 27 = SDF, 16 = CPDM
• Of 25 men, 10 = SDF, 15 = CPDM.

– Ho:Gender and political affiliation are independent


– H1: There is an association between gender and political party
1/14/2025 29
• Crosstab: a table that presents joint frequencies
• Also called a “joint contingency table”

Each box with a


value is a “cell”
Women Men
This is a table
row
SDF 27 10

This is a table
CPDM 16 15 column

1/14/2025 30
• Tables can also reflect percentages
• Either of total N, or of row or column marginals
• This table shows percentage of total N:

Just divide each


Women Men N cell value by the
27 10 total N to get a
SDF 39.7% 14.7% 37 proportion.
16 15
Multiply by 100
for a percentage:
CPDM 23.5% 22.1% 31
(10/68)(100)=14.7
N 43 25 68
1/14/2025 31
• In addition, you can calculate percentages with respect to either
row or column marginals
• Here is an example of column percentages

Just divide each


Women Men N cell by the column
27 10 marginal to get a
SDF 62.8% 40.0% 37 proportion.
16 15
Multiply by 100
for a percentage:
CPDM 37.2% 60.0% 31
(10/25)(100)=40%
N 43 25 68
1/14/2025 32
• If there is no relationship between two variables, they are said to
be “independent”
• Neither “depends” on the other
• If there is a relationship, the variables are said to be “associated”
or to “covary”
• If individuals in one category also consistently fall in another
(women= SDF, men=r CPDM), you may suspect that there is a
relationship between the two variables
• Just as when the mean of a certain sub-group is much higher or
lower than another (in T-test/ANOVA).

1/14/2025 33
• The value in a cell is referred to as a frequency
– Math symbol = f
• Cells are referred to by row and column numbers
– Ex: women republicans = 2nd row, 1st column
– In general, rows are numbered from 1 to i, columns are
numbered from 1 to j
• Thus, the value in any cell of any table can be written as:

– fij

1/14/2025 34
Expected Cell Values
• If two variables are independent, cell values will depend only on
row & column marginals/Totals
– Marginals reflect frequencies… And, if frequency is high, all
cells in that row (or column) should be high
• The formula for the expected value in a cell is:

(f i )(f j )
f̂ ij  Ξ

• fi and fj are the row and column marginals


• N is the total sample size
1/14/2025 35
Crosstabulation: Independence
• Here, column percentages highlight the relationship among
variables:

Women Men N

SDF 62.8% 40.0% 37

CPDM 37.2% 60.0% 31

N 43 25 68

• It appears as though women tend to be more SDF, while men tend


to be CPDM
1/14/2025 36
Chi-square Test of Independence
• Example: Gender and Political Views Let’s pretend that N of 68 is
sufficient
Women Men Total
Women Men

10
SDF 27
SDF 27 10 37

CPDM 16 15 31
CPDM 16 15
Total 43 25 68

etc
E(Woman∩SDF) =
1/14/2025 37
…Chi-square Test of Independence

• Compute (E – O)2 /E for each cell

Women Men

(23.4 – 27)2/23.4
SDF (13.6 – 10)2/13.6 = .95
= .55

(19.6 – 16)2/19.6 (11.4 – 15)2/15


CPDM
= .66 = .86

1/14/2025 38
…Chi-Square Test of Independence

• Finally, sum up to compute the Chi-square


c2 = .55 + .95 + .66 + .86 = 3.02
• What is the critical value for a =.05?
• Degrees of freedom: (R-1)(C-1) = (2-1)(2-1) = 1
• From tables, critical value is 3.84 = c2(0.05 ;df=1) = c2(critical)

• Question: Can we reject H0?


• No!
2  3.84
c = 3.02 <
2 χ .05;1
• We cannot conclude that there is a relationship between
gender and political party affiliation.
1/14/2025 39
1/14/2025 40
Results

A _____ X _____ Chi-Square test of independence


indicated a _____________ difference between gender
and political affiliation ,(X2(_______) =.__________).

Therefore, the null hypothesis was _______ and we


conclude that__________________________

1/14/2025 41
Percentage points of the Chi-squared Distribution

1/14/2025 42
EXAMPLE 2 - Hodgkin’s lymphoma
• A one year follow-up study was conducted to examine the
effect of an experimental drug on mortality in 296 cases of
advanced non-Hodgkin's lymphoma. Controls received
standard treatment. The data are provided below.

2. Calculate the expected counts for the cells in the table above.
3. Test to see if the association between mortality outcome and treatment
status is statistically significant.
1.Provide the null and alternative hypothesis and an interpretation of the results

1/14/2025 43
…EXAMPLE - Hodgkin’s lymphoma

1/14/2025 44
Died Survived Total
Died Survived

Treatment 9 190
199
Treatment 9 190

Control 13 84
Control 13 84 97

Total 22 274 296

1/14/2025 45
• Compute Expected values for each cell

Women Men

E(9) = 199*22/296 E(190) = 199*274/296


SDF
= 14.79 = 184.21

E(13) = 97*22/296 E(84) = 97*274/296


CPDM
= 7.21 = 89.79

1/14/2025 46
1/14/2025 47
Assumptions:
1. No Eij < 1, and
2. No more than 20% of Eij < 5.
3. Large N (>100)
4. variables should be measured at an ordinal or nominal level (i.e.,
categorical data)

1/14/2025
Reject Ho iff x2 > x2 (critical) 48
A _____ X _____ Chi-Square test of independence indicated a _____________ difference
between gender and political affiliation , (X2(_______) =.__________).

Therefore, the null hypothesis was _______________ and we conclude that there
is evidence of an association between mortality outcome and treatment

1/14/2025 49
HOMEWORK
Question 1

1/14/2025 50
Question 2
A dog trainer wants to know if golden retrievers and French bulldogs are equally
good at learning how to skateboard. She tries to train 40 golden retrievers and 60
French bulldogs to skateboard and finds the following:

Skateboards Can’t skateboard


Golden retrievers 20 20
French bulldogs 50 10

Should she reject the null hypothesis that the dog’s breed is unrelated to their
skateboarding ability?
a) She should reject the null hypothesis.
b) She should fail to reject the null hypothesis.

1/14/2025 51
Question 3
A restaurant reviewer wants to know if three popular burger restaurants are equally
recommended by their customers. At each of the three restaurants, he asks 25 random
customers whether they would recommend the restaurant to a friend. He finds the
following:

Would recommend Would not recommend


Tasty Burgers 20 5
Burger Prince 22 3
Burger Town 18 7

Should he reject the null hypothesis that the proportion of customers recommending
the restaurant is the same for the three restaurants?
a) He should reject the null hypothesis.
b) He should fail to reject the null hypothesis.

1/14/2025 52
Measuring associations between variables

1/14/2025 53
Exercise:

Is there a relationship between smoking and cancer? Test at 5% level of significance and
report your findings clearly

1/14/2025 54
Any

1/14/2025 55

55

You might also like