Chi Square Test
Chi Square Test
Does it Work?
Overview
What is the chi-square test? How does it work?
Learn about the different types of Chi-Square tests and where and when you
should apply them
Introduction
Let’s start with a case study. I want you to think of your favorite restaurant right
now. Let’s say you can predict a certain number of people arriving for lunch five
days a week. At the end of the week, you observe that the expected footfall was
different from the actual footfall.
So, how will you check the statistical significance between the observed and the
expected footfall values? Remember this is a categorical variable – ‘Days of the
week’ – with 5 categories [Monday, Tuesday, Wednesday, Thursday, Friday].
One of the best ways to deal with this is by using the Chi-Square test.
We can always opt for z-tests, t-tests or ANOVA when we’re dealing with
continuous variables. But the situation becomes tricky when working with
categorical features (as most data scientists will attest to!). I’ve found the chi-
square test to be quite helpful in my own projects.
So let’s dive into the article to understand all about the chi-square test, what it is,
how it works and how we can implement it in R.
Table of Contents
What are Categorical Variables?
What is a Chi-Square Test and Why Do We Use It?
Assumptions of the Chi-Square Test
Types of Chi-Square Tests (With implementation in R)
o Chi-Square Goodness of Fit Test
o Chi-Square Test of Association between Two Variables
I’m sure you’ve encountered categorial variables before, even if you might not
have intuitively recognized them. They can be tricky to deal with in the data
science world so let’s first define them.
Categorical variables fall into a particular category of those variables that can be
divided into finite categories. These categories are generally names or labels.
These variables are also called qualitative variables as they depict the quality or
characteristics of that particular variable.
For example, the category “Movie Genre” in a list of movies could contain the
categorical variables – “Action”, “Fantasy”, “Comedy”, “Romance”, etc.
1. Nominal Variable: A nominal variable has no natural ordering to its categories. They
have two or more categories. For example, Marital Status (Single, Married, Divorcee);
Gender (Male, Female, Transgender), etc.
2. Ordinal Variable: A variable for which the categories can be placed in an order. For
example, Customer Satisfaction (Excellent, Very Good, Good, Average, Bad), and so on
When the data we want to analyze contains this type of variable, we turn to the
chi-square test, denoted by χ², to test our hypothesis.
He obtains the placement records of the past five years from the placement cell
database (at random). He records how many students who got placed fell into
each of the following C.G.P.A. categories – 9-10, 8-9, 7-8, 6-7, and below 6.
If there is no relationship between the placement rate and the C.G.P.A., then the
placed students should be equally spread across the different C.G.P.A. categories
(i.e. there should be similar numbers of placed students in each category).
However, if students having C.G.P.A more than 8 are more likely to get placed,
then there would be a large number of placed students in the higher C.G.P.A.
categories as compared to the lower C.G.P.A. categories. In this case, the data
collected would make up the observed frequencies.
So the question is, are these frequencies being observed by chance or do they
follow some pattern?
Here enters the chi-square test! The chi-square test helps us answer the above
question by comparing the observed frequencies to the frequencies that we
might expect to obtain purely by chance.
Chi-square test in hypothesis testing is used to test the hypothesis about the
distribution of observations/frequencies in different categories.
Note: I strongly recommend going through the below article if you need to brush
up your hypothesis testing concepts:
We are almost at the implementing aspect of chi-square tests but there’s one
more thing we need to learn before we get there.
Just like any other statistical test, the chi-square test comes with a few
assumptions of its own:
The χ2 assumes that the data for the study is obtained through random
selection, i.e. they are randomly picked from the population
The categories are mutually exclusive i.e. each subject fits in only one
category. For e.g.- from our above example – the number of people who
lunched in your restaurant on Monday can’t be filled in the Tuesday category
The data should be in the form of frequencies or counts of a particular
category and not in percentages
The data should not consist of paired samples or groups or we can say the
observations should be independent of each other
When more than 20% of the expected frequencies have a value of less than 5
then Chi-square cannot be used. To tackle this problem: Either one should
combine the categories only if it is relevant or obtain more data
This is a non-parametric test. We typically use it to find how the observed value of
a given event is significantly different from the expected value. In this case, we
have categorical data for one independent variable, and we want to check
whether the distribution of the data is similar or different from that of the
expected distribution.
Let’s consider the above example where the research scholar was interested in
the relationship between the placement of students in the statistics department
of a reputed University and their C.G.P.A.
In this case, the independent variable is C.G.P.A with the categories 9-10, 8-9, 7-8,
6-7, and below 6.
The statistical question here is: whether or not the observed frequencies of
placed students are equally distributed for different C.G.P.A categories (so that
our theoretical frequency distribution contains the same number of students in
each of the C.G.P.A categories).
We will arrange this data by using the contingency table which will consist of both
the observed and expected values as below:
C.G.P.A
10-9 9-8 8-7 7-6 Below 6 Total
Observed 30 35 20 10 5 100
Frequency
of Placed
students
Expected
Frequency
20 20 20 20 20 100
of Placed
students
After constructing the contingency table, the next task is to compute the value of
the chi-square statistic. The formula for chi-square is given as:
where,
1. Step 1: Subtract each expected frequency from the related observed frequency. For
example, for the C.G.P.A category 10-9, it will be “30-20 = 10”. Apply similar operation
for all the categories
2. Step 2: Square each value obtained in step 1, i.e. (O-E)2. For example: for the C.G.P.A
category 10-9, the value obtained in step 1 is 10. It becomes 100 on squaring. Apply
similar operation for all the categories
3. Step 3: Divide all the values obtained in step 2 by the related expected frequencies i.e.
(O-E)2/E. For example: for the C.G.P.A category 10-9, the value obtained in step 2 is
100. On dividing it with the related expected frequency which is 20, it becomes 5.
Apply similar operation for all the categories
4. Step 4: Add all the values obtained in step 3 to get the chi-square value. In this case,
the chi-square value comes out to be 32.5
5. Step 5: Once we have calculated the chi-square value, the next task is to compare it
with the critical chi-square value. We can find this in the below chi-square table against
the degrees of freedom (number of categories – 1) and the level of significance:
In this case, the degrees of freedom are 5-1 = 4. So, the critical value at 5% level
of significance is 9.49.
Our obtained value of 32.5 is much larger than the critical value of
9.49. Therefore, we can say that the observed frequencies are significantly
different from the expected frequencies. In other words, C.G.P.A is related to
the number of placements that occur in the department of statistics.
Problem Statement
Let’s understand the problem statement before we dive into R.
11 – 20 Years = 20%
21 – 40 Years = 17%
6 – 10 Years = 41% and
Up to 5 Years = 22%
Output:
data: table(data$Experience.intervals)
X-squared = 14.762, df = 3, p-value = 0.002032
The p-value here is less than 0.05. Therefore, we will reject our null hypothesis.
Hence, the distribution of experience of the employees of different departments
differs from what the organization states.
Chi-Square Test for Association/Independence
Let’s take another example to understand this. A teacher wants to know the
answer to whether the outcome of a mathematics test is related to the gender of
the person taking the test. Or in other words, she wants to know if males show a
different pattern of pass/fail rates than females.
So, here are two categorical variables: Gender (Male and Female) and
mathematics test outcome (Pass or Fail). Let us now look at the contingency
table:
Boys Girls
Pass 17 20
Fail 8 5
By looking at the above contingency table, we can see that the girls have a
comparatively higher pass rate than boys. However, to test whether this observed
difference is significant or not, we will carry out the chi-square test.
Step 1: Calculate the row and column total of the above contingency table:
Step 2: Calculate the expected frequency for each individual cell by multiplying
row sum by column sum and dividing by total number:
Expected Frequency = (Row Total x Column Total)/Grand Total
For the first cell, the expected frequency would be (37*25)/50 = 18.5. Now, write
them below the observed frequencies in brackets:
Calculate the right-hand side part of each cell. For example, for the first cell, ((17-
18.5)^2)/18.5 = 0.1216.
Step 4: Then, add all the values obtained for each cell. In this case, the values are:
0.1216+0.1216+0.3461+0.3461 = 0.9354
The next task is to compare it with the critical chi-square value from the table we
saw above.
The Chi-Square calculated value is 0.9354 which is less than the critical value
of 3.84. So in this case, we fail to reject the null hypothesis. This means there is no
significant association between the two variables, i.e, boys and girls have a
statistically similar pattern of pass/fail rates on their mathematics tests.