0% found this document useful (0 votes)

232 views11 pages

Chi Square Test

The document discusses the chi-square test, which is used to test hypotheses about categorical variable distributions. It can determine if observed frequencies differ significantly from expected frequencies. The chi-square test assumes random sampling, mutually exclusive categories, and count data. There are two main types: goodness of fit tests compare observed counts to theoretical distributions, while tests of association examine relationships between two categorical variables. An example goodness of fit test analyzes the relationship between student GPAs and job placements.

Uploaded by

Sudhanshu Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

232 views11 pages

Chi Square Test

Uploaded by

Sudhanshu Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

What is the Chi-Square Test and How

Does it Work?
Overview
 What is the chi-square test? How does it work?
 Learn about the different types of Chi-Square tests and where and when you
should apply them

Introduction

“Science is advanced by proposing and testing a hypothesis, not by declaring

questions unsolvable” – Nick Matzke

Let’s start with a case study. I want you to think of your favorite restaurant right
now. Let’s say you can predict a certain number of people arriving for lunch five
days a week. At the end of the week, you observe that the expected footfall was
different from the actual footfall.

Sounds like a prime statistics problem? That’s the idea!

So, how will you check the statistical significance between the observed and the
expected footfall values? Remember this is a categorical variable – ‘Days of the
week’ – with 5 categories [Monday, Tuesday, Wednesday, Thursday, Friday].

One of the best ways to deal with this is by using the Chi-Square test.

We can always opt for z-tests, t-tests or ANOVA when we’re dealing with
continuous variables. But the situation becomes tricky when working with
categorical features (as most data scientists will attest to!). I’ve found the chi-
square test to be quite helpful in my own projects.
So let’s dive into the article to understand all about the chi-square test, what it is,
how it works and how we can implement it in R.

Table of Contents
 What are Categorical Variables?
 What is a Chi-Square Test and Why Do We Use It?
 Assumptions of the Chi-Square Test
 Types of Chi-Square Tests (With implementation in R)
o Chi-Square Goodness of Fit Test
o Chi-Square Test of Association between Two Variables

What are Categorical Variables?

I’m sure you’ve encountered categorial variables before, even if you might not
have intuitively recognized them. They can be tricky to deal with in the data
science world so let’s first define them.

Categorical variables fall into a particular category of those variables that can be
divided into finite categories. These categories are generally names or labels.
These variables are also called qualitative variables as they depict the quality or
characteristics of that particular variable.

For example, the category “Movie Genre” in a list of movies could contain the
categorical variables – “Action”, “Fantasy”, “Comedy”, “Romance”, etc.

There are broadly two types of categorical variables:

1. Nominal Variable: A nominal variable has no natural ordering to its categories. They
have two or more categories. For example, Marital Status (Single, Married, Divorcee);
Gender (Male, Female, Transgender), etc.
2. Ordinal Variable: A variable for which the categories can be placed in an order. For
example, Customer Satisfaction (Excellent, Very Good, Good, Average, Bad), and so on
When the data we want to analyze contains this type of variable, we turn to the
chi-square test, denoted by χ², to test our hypothesis.

What is a Chi-Square Test and Why Do We use it?

A Chi-Square test is a test of statistical significance for categorical variables.

Let’s learn the use of chi-square with an intuitive example.

A research scholar is interested in the relationship between the placement of

students in the statistics department of a reputed University and their C.G.P.A
(their final assessment score).

He obtains the placement records of the past five years from the placement cell
database (at random). He records how many students who got placed fell into
each of the following C.G.P.A. categories – 9-10, 8-9, 7-8, 6-7, and below 6.

Source: Anibrain School of Media Design

If there is no relationship between the placement rate and the C.G.P.A., then the
placed students should be equally spread across the different C.G.P.A. categories
(i.e. there should be similar numbers of placed students in each category).

However, if students having C.G.P.A more than 8 are more likely to get placed,
then there would be a large number of placed students in the higher C.G.P.A.
categories as compared to the lower C.G.P.A. categories. In this case, the data
collected would make up the observed frequencies.

So the question is, are these frequencies being observed by chance or do they
follow some pattern?

Here enters the chi-square test! The chi-square test helps us answer the above
question by comparing the observed frequencies to the frequencies that we
might expect to obtain purely by chance.

Chi-square test in hypothesis testing is used to test the hypothesis about the
distribution of observations/frequencies in different categories.

Note: I strongly recommend going through the below article if you need to brush
up your hypothesis testing concepts:

We are almost at the implementing aspect of chi-square tests but there’s one
more thing we need to learn before we get there.

Assumptions of the Chi-Square Test

Just like any other statistical test, the chi-square test comes with a few
assumptions of its own:

 The χ2 assumes that the data for the study is obtained through random
selection, i.e. they are randomly picked from the population
 The categories are mutually exclusive i.e. each subject fits in only one
category. For e.g.- from our above example – the number of people who
lunched in your restaurant on Monday can’t be filled in the Tuesday category
 The data should be in the form of frequencies or counts of a particular
category and not in percentages
 The data should not consist of paired samples or groups or we can say the
observations should be independent of each other
 When more than 20% of the expected frequencies have a value of less than 5
then Chi-square cannot be used. To tackle this problem: Either one should
combine the categories only if it is relevant or obtain more data

Types of Chi-Square Tests

Chi-Square Goodness of Fit Test

This is a non-parametric test. We typically use it to find how the observed value of
a given event is significantly different from the expected value. In this case, we
have categorical data for one independent variable, and we want to check
whether the distribution of the data is similar or different from that of the
expected distribution.

Let’s consider the above example where the research scholar was interested in
the relationship between the placement of students in the statistics department
of a reputed University and their C.G.P.A.

In this case, the independent variable is C.G.P.A with the categories 9-10, 8-9, 7-8,
6-7, and below 6.

The statistical question here is: whether or not the observed frequencies of
placed students are equally distributed for different C.G.P.A categories (so that
our theoretical frequency distribution contains the same number of students in
each of the C.G.P.A categories).

We will arrange this data by using the contingency table which will consist of both
the observed and expected values as below:

C.G.P.A
10-9 9-8 8-7 7-6 Below 6 Total
Observed 30 35 20 10 5 100
Frequency
of Placed
students
Expected
Frequency
20 20 20 20 20 100
of Placed
students

After constructing the contingency table, the next task is to compute the value of
the chi-square statistic. The formula for chi-square is given as:

where,

 χ 2 = Chi-Square value

 Oi = Observed frequency
 Ei = Expected frequency

Let us look at the step-by-step approach to calculate the chi-square value:

1. Step 1: Subtract each expected frequency from the related observed frequency. For
example, for the C.G.P.A category 10-9, it will be “30-20 = 10”. Apply similar operation
for all the categories
2. Step 2: Square each value obtained in step 1, i.e. (O-E)2. For example: for the C.G.P.A
category 10-9, the value obtained in step 1 is 10. It becomes 100 on squaring. Apply
similar operation for all the categories
3. Step 3: Divide all the values obtained in step 2 by the related expected frequencies i.e.
(O-E)2/E. For example: for the C.G.P.A category 10-9, the value obtained in step 2 is
100. On dividing it with the related expected frequency which is 20, it becomes 5.
Apply similar operation for all the categories
4. Step 4: Add all the values obtained in step 3 to get the chi-square value. In this case,
the chi-square value comes out to be 32.5
5. Step 5: Once we have calculated the chi-square value, the next task is to compare it
with the critical chi-square value. We can find this in the below chi-square table against
the degrees of freedom (number of categories – 1) and the level of significance:
In this case, the degrees of freedom are 5-1 = 4. So, the critical value at 5% level
of significance is 9.49.

Our obtained value of 32.5 is much larger than the critical value of
9.49. Therefore, we can say that the observed frequencies are significantly
different from the expected frequencies. In other words, C.G.P.A is related to
the number of placements that occur in the department of statistics.

Problem Statement
Let’s understand the problem statement before we dive into R.

An organization claims that the experience of the employees of different

departments is distributed in the following categories:

 11 – 20 Years = 20%
 21 – 40 Years = 17%
 6 – 10 Years = 41% and
 Up to 5 Years = 22%

A random sample of 1470 employees is collected. Does this random sample

provide evidence against the organization’s claim?

Setting up the hypothesis

 Null hypothesis: The true proportions of the experience of the employees of

different departments are distributed in the following categories: 11 – 20
Years = 20%, 21 – 40 Years = 17%, 6 – 10 Years = 41% and up to 5 Years = 22%
 Alternative hypothesis: The distribution of experience of the employees of
different departments differs from what the organization states

Calculate the chi-square value:

Output:

Chi-square test for given probabilities

data: table(data$Experience.intervals)
X-squared = 14.762, df = 3, p-value = 0.002032

The p-value here is less than 0.05. Therefore, we will reject our null hypothesis.
Hence, the distribution of experience of the employees of different departments
differs from what the organization states.

Chi-Square Test for Association/Independence

The second type of chi-square test is the Pearson’s chi-square test of

association. This test is used when we have categorical data for two independent
variables and we want to see if there is any relationship between the variables.

Let’s take another example to understand this. A teacher wants to know the
answer to whether the outcome of a mathematics test is related to the gender of
the person taking the test. Or in other words, she wants to know if males show a
different pattern of pass/fail rates than females.

So, here are two categorical variables: Gender (Male and Female) and
mathematics test outcome (Pass or Fail). Let us now look at the contingency
table:

Boys Girls
Pass 17 20
Fail 8 5

By looking at the above contingency table, we can see that the girls have a
comparatively higher pass rate than boys. However, to test whether this observed
difference is significant or not, we will carry out the chi-square test.

The steps to calculate the chi-square value are as follows:

Step 1: Calculate the row and column total of the above contingency table:

Boys Girls Total

Pass 17 20 37
Fail 8 5 13
Total 25 25 50

Step 2: Calculate the expected frequency for each individual cell by multiplying
row sum by column sum and dividing by total number:
Expected Frequency = (Row Total x Column Total)/Grand Total

For the first cell, the expected frequency would be (37*25)/50 = 18.5. Now, write
them below the observed frequencies in brackets:

Boys Girls Total

17 20
Pass 37
(18.5) (18.5)
8 5
Fail 13
(6.5) (6.5)
Total 25 25 50

Step 3: Calculate the value of chi-square using the formula:

Calculate the right-hand side part of each cell. For example, for the first cell, ((17-
18.5)^2)/18.5 = 0.1216.

Step 4: Then, add all the values obtained for each cell. In this case, the values are:

0.1216+0.1216+0.3461+0.3461 = 0.9354

Step 5: Calculate the degrees of freedom, i.e. (Number of rows-1)*(Number of

columns-1) = 1*1 = 1

The next task is to compare it with the critical chi-square value from the table we
saw above.

The Chi-Square calculated value is 0.9354 which is less than the critical value
of 3.84. So in this case, we fail to reject the null hypothesis. This means there is no
significant association between the two variables, i.e, boys and girls have a
statistically similar pattern of pass/fail rates on their mathematics tests.

Important things to note when considering using the

Chi-Square test
First, Chi-Square only tests whether two individual variables are independent in a
binary, “yes” or “no” format.
Chi-Square testing does not provide any insight into the degree of difference
between the respondent categories, meaning that researchers are not able to tell
which statistic (result of the Chi-Square test) is greater or less than the other.
Second, Chi-Square requires researchers to use numerical values, also known as
frequency counts, instead of using percentages or ratios. This can limit the
flexibility that researchers have in terms of the processes that they use.

General Physics Reviewer
No ratings yet
General Physics Reviewer
8 pages
2020 Assignment 16 - Discrete Random Variables (Student)
No ratings yet
2020 Assignment 16 - Discrete Random Variables (Student)
2 pages
Kohlberg's Theory of Moral Development
No ratings yet
Kohlberg's Theory of Moral Development
29 pages
Social Relationships: Personal Development
100% (1)
Social Relationships: Personal Development
2 pages
Skinfold
No ratings yet
Skinfold
19 pages
Difference Between Quantative and Qualative Research
100% (1)
Difference Between Quantative and Qualative Research
13 pages
Determinants of Health
100% (1)
Determinants of Health
9 pages
Statistics and Probability
No ratings yet
Statistics and Probability
35 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
44 pages
School of Public Health: Haramaya University, Chms
100% (1)
School of Public Health: Haramaya University, Chms
40 pages
Monte Carlo Simulation
No ratings yet
Monte Carlo Simulation
22 pages
Association and Causation
No ratings yet
Association and Causation
16 pages
CH 8
No ratings yet
CH 8
60 pages
Sampling and Sampling Distribution
100% (1)
Sampling and Sampling Distribution
22 pages
The Nature of Conflict
No ratings yet
The Nature of Conflict
11 pages
LESSON 11A Parametric Statistics
100% (1)
LESSON 11A Parametric Statistics
11 pages
What Is Physical Fitness
0% (1)
What Is Physical Fitness
6 pages
Fitness: Components of Physical Fitness
100% (1)
Fitness: Components of Physical Fitness
3 pages
Random Variables: Lesson 1.1
No ratings yet
Random Variables: Lesson 1.1
24 pages
Statistics CBT Questions
No ratings yet
Statistics CBT Questions
19 pages
Essential Statistics For The Behavioral Sciences 1st Edition Privitera Solutions Manual
100% (38)
Essential Statistics For The Behavioral Sciences 1st Edition Privitera Solutions Manual
7 pages
References: Igbegiri Gloria Ada
100% (1)
References: Igbegiri Gloria Ada
20 pages
Introduction To Coordination
No ratings yet
Introduction To Coordination
21 pages
Further Statistics Chapter 5
No ratings yet
Further Statistics Chapter 5
13 pages
Herrypramono,+##Default - Groups.name - ProductionEditor##,+Layout 2+Artikel+2+JMan+Vol+19+No.+1+ (CC)
No ratings yet
Herrypramono,+##Default - Groups.name - ProductionEditor##,+Layout 2+Artikel+2+JMan+Vol+19+No.+1+ (CC)
21 pages
Practice Problems and Exams 20131 PDF
100% (1)
Practice Problems and Exams 20131 PDF
71 pages
Sample Size Calculation
No ratings yet
Sample Size Calculation
13 pages
Chapter 6
No ratings yet
Chapter 6
14 pages
Independent T-Test Using SPSS
No ratings yet
Independent T-Test Using SPSS
6 pages
Biostat Quiz Leak
50% (2)
Biostat Quiz Leak
3 pages
One Way Anova
No ratings yet
One Way Anova
6 pages
Two Samples T Test Same N
No ratings yet
Two Samples T Test Same N
1 page
Statistics and Validation Data
No ratings yet
Statistics and Validation Data
20 pages
CHN Non Communicable Diseases and Others
100% (1)
CHN Non Communicable Diseases and Others
90 pages
Variables in Quantitative Research
No ratings yet
Variables in Quantitative Research
20 pages
08 212020082 Nalitalia Ramjani
No ratings yet
08 212020082 Nalitalia Ramjani
4 pages
Sampling-An Audit Tool: Presented by Mr. Preman Dinaraj, Prin. Director, RTI, Mumbai
No ratings yet
Sampling-An Audit Tool: Presented by Mr. Preman Dinaraj, Prin. Director, RTI, Mumbai
36 pages
Ecological Models
100% (1)
Ecological Models
25 pages
Checklist For Cohort Studies
No ratings yet
Checklist For Cohort Studies
4 pages
MACOY - Physical Activity Attitude Questionnaire
No ratings yet
MACOY - Physical Activity Attitude Questionnaire
2 pages
SGGSCC 1
No ratings yet
SGGSCC 1
2 pages
Lecture 7 (Two Sample Tests)
No ratings yet
Lecture 7 (Two Sample Tests)
28 pages
Chi Square Test
No ratings yet
Chi Square Test
24 pages
U3-L4 - Sampling Distributions
No ratings yet
U3-L4 - Sampling Distributions
25 pages
Comprehensive Report On Chi
No ratings yet
Comprehensive Report On Chi
4 pages
Untitled
No ratings yet
Untitled
9 pages
Health & Recreation
No ratings yet
Health & Recreation
20 pages
PT - Practice Assignment 2 (With Solutions)
No ratings yet
PT - Practice Assignment 2 (With Solutions)
7 pages
Analysis of Variance
No ratings yet
Analysis of Variance
20 pages
Correlation Coefficient Definition
100% (1)
Correlation Coefficient Definition
8 pages
Queen Mary, University of London MTH 4106 Introduction To Statistics
No ratings yet
Queen Mary, University of London MTH 4106 Introduction To Statistics
2 pages
Topic 1
100% (1)
Topic 1
37 pages
Multiple Regression Example (Salary Experience and Score)
No ratings yet
Multiple Regression Example (Salary Experience and Score)
4 pages
Statistics in Research 3
No ratings yet
Statistics in Research 3
22 pages
Basic Biostatistics
No ratings yet
Basic Biostatistics
319 pages
CHAPTER 1-3 (1st Set)
No ratings yet
CHAPTER 1-3 (1st Set)
31 pages
Probability Theory III (B.Stat. 2017-2020)
No ratings yet
Probability Theory III (B.Stat. 2017-2020)
173 pages
Introduction To Research Methods
No ratings yet
Introduction To Research Methods
46 pages
Week008 Personal Safety z2GdBS PDF
100% (1)
Week008 Personal Safety z2GdBS PDF
4 pages
CMC 8
No ratings yet
CMC 8
1 page
Group 2 Report in Nutrition
No ratings yet
Group 2 Report in Nutrition
99 pages
CPD Accreditation Policy and Procedure V2.1
No ratings yet
CPD Accreditation Policy and Procedure V2.1
27 pages
TR 2501 Cal Labs-Scales and Balances-13976-3
No ratings yet
TR 2501 Cal Labs-Scales and Balances-13976-3
11 pages
Test Procedure in SPSS Statistics
No ratings yet
Test Procedure in SPSS Statistics
8 pages
Health Promotion Strategies For Stress Reduction
No ratings yet
Health Promotion Strategies For Stress Reduction
26 pages
Bio Statistics
No ratings yet
Bio Statistics
140 pages
Theories TFN
No ratings yet
Theories TFN
8 pages
Introduction To Field Work
No ratings yet
Introduction To Field Work
25 pages
3.social Construction of Diagnosis, Illness and Medicalisation - Medical Sociology
No ratings yet
3.social Construction of Diagnosis, Illness and Medicalisation - Medical Sociology
15 pages
4.3. Parametric & Nonparametric Tests
No ratings yet
4.3. Parametric & Nonparametric Tests
26 pages
Statistics and Freq Distribution
No ratings yet
Statistics and Freq Distribution
35 pages
L 7estimating Risk
No ratings yet
L 7estimating Risk
63 pages
What Is Statistics
No ratings yet
What Is Statistics
147 pages
Measuring The Occurrence of Disease: Dr. Elijah Kakande MBCHB, MPH Department of Public Health
No ratings yet
Measuring The Occurrence of Disease: Dr. Elijah Kakande MBCHB, MPH Department of Public Health
25 pages
Research Design
No ratings yet
Research Design
14 pages
EPI 1.01 Overview of Epid and Descriptive Epid
No ratings yet
EPI 1.01 Overview of Epid and Descriptive Epid
4 pages
Research Bullets
No ratings yet
Research Bullets
18 pages
Librarian Job Description
No ratings yet
Librarian Job Description
4 pages
Introduction To Nutrition
No ratings yet
Introduction To Nutrition
19 pages
Personality Development
No ratings yet
Personality Development
3 pages
The Global Burden of Disease
No ratings yet
The Global Burden of Disease
17 pages
Medical Statistics: "Statistics in Medicine" Redirects Here. For The Journal, See
No ratings yet
Medical Statistics: "Statistics in Medicine" Redirects Here. For The Journal, See
5 pages
Common Types of Variables
No ratings yet
Common Types of Variables
5 pages
Psychiatric Nursing Lecture 2009-2010
No ratings yet
Psychiatric Nursing Lecture 2009-2010
86 pages
KPD Validity & Realibility
No ratings yet
KPD Validity & Realibility
25 pages
Health Belief Model
No ratings yet
Health Belief Model
3 pages
A Critical Evaluation of The Biopsychosocial Model
No ratings yet
A Critical Evaluation of The Biopsychosocial Model
3 pages
HS High Jump Rules & Instructions
No ratings yet
HS High Jump Rules & Instructions
2 pages
Mid Term
No ratings yet
Mid Term
7 pages
Quantitative Analysis
No ratings yet
Quantitative Analysis
3 pages

Chi Square Test

Uploaded by

Chi Square Test

Uploaded by

What is the Chi-Square Test and How

“Science is advanced by proposing and testing a hypothesis, not by declaring

Sounds like a prime statistics problem? That’s the idea!

What are Categorical Variables?

There are broadly two types of categorical variables:

What is a Chi-Square Test and Why Do We use it?

A Chi-Square test is a test of statistical significance for categorical variables.

Let’s learn the use of chi-square with an intuitive example.

A research scholar is interested in the relationship between the placement of

Source: Anibrain School of Media Design

Assumptions of the Chi-Square Test

Types of Chi-Square Tests

 χ 2 = Chi-Square value

Let us look at the step-by-step approach to calculate the chi-square value:

An organization claims that the experience of the employees of different

A random sample of 1470 employees is collected. Does this random sample

Setting up the hypothesis

 Null hypothesis: The true proportions of the experience of the employees of

Calculate the chi-square value:

Chi-square test for given probabilities

The second type of chi-square test is the Pearson’s chi-square test of

The steps to calculate the chi-square value are as follows:

Boys Girls Total

Boys Girls Total

Step 3: Calculate the value of chi-square using the formula:

Step 5: Calculate the degrees of freedom, i.e. (Number of rows-1)*(Number of

Important things to note when considering using the

You might also like