Chi-square test in Data Science & Data Analytics
Last Updated :
26 Jul, 2025
Chi-Square test helps us determine if there is a significant relationship between two categorical variables and the target variable. It is a non-parametric statistical test meaning it doesn’t follow normal distribution.
Example of Chi-square testThe Chi-square test compares the observed frequencies (actual data) to the expected frequencies (what we would expect if there was no relationship). This helps identify which features are important for predicting the target variable in machine learning models.
Chi-square statistic is calculated as:
\chi^2_c = \sum \frac{(O_{i} - E_{i})^2}{E_{i}} ...eq(1)
where,
- c is degree of freedom
- O_{i} is the observed frequency in cell {i}
- E_{i} is the expected frequency in cell {i}
Often used with non-normally distributed data. Before we jump into calculations. let's understand some important terms:
- Observed Values (O): Actual counts from the data.
- Expected Values (E): Counts expected if variables are independent.
- Contingency Table: A table showing counts of two categorical variables.
- Degrees of Freedom (df): Number of independent values, helps find critical values.
Types of Chi-Square test
The two main types are the chi-square test for independence and the chi-square goodness-of-fit test.
Types of chi-square tests1. Chi-Square Test for Independence: This test is used whether there is a significant relationship between two categorical variables.
- This test is applied when we have counts of values for two nominal or categorical variables.
- To conduct this test two requirements must be met: independence of observations and a relatively large sample size.
- We test if shopping preference (Electronics, Clothing, Books) is related to payment method (Credit Card, Debit Card, PayPal). The null hypothesis assumes no relationship between them.
2. Chi-Square Goodness-of-Fit Test: The Chi-Square Goodness-of-Fit test is used to check if a variable follows a specific expected pattern or distribution.
- This test is used with counts of categorical data to see if the observed values match what we expect based on a hypothesis. It helps determine if the data represents the whole population well.
- For example, when testing if a six-sided die is fair, the null hypothesis assumes each face has an equal chance of landing face up meaning the die is unbiased and all sides occur equally often.
Step 1: Define Your Hypotheses
- Null Hypothesis (H₀): The two variables are independent (no relationship).
- Alternative Hypothesis (H₁): The two variables are related (there is a relationship).
Step 2: Create a Contingency Table: This is simply a table that displays the frequency distribution of the two categorical variables.
Step 3: Calculate Expected Values: To find the expected value for each cell use this formula:
E_{i} = \frac{(Row\ Total \times Column\ Total)}{Grand\ Total}
Step 4: Compute the Chi-Square Statistic: Now use the Chi-Square formula:
\chi^2 = \sum \frac{(O_{i} - E_{i})^2}{E_{i}}
where:
- Oi = Observed value
- Ei = Expected value
If the observed and expected values are very different the Chi-Square value will be high which indicate a strong relationship.
Step 5: Compare with the Critical Value:
- If \chi^2 > critical value → Reject H₀ (There is a relationship).
- If \chi^2 < critical value → Fail to reject H₀ (No relationship).
Why do we use the Chi-Square Test?
The Chi-Square Test helps us find relationships or differences between categories. Its main uses are:
- Feature Selection in Machine Learning: It helps decide if a categorical feature (like color or product type) is important for predicting the target (like sales or satisfaction), improving model performance.
- Testing Independence: It checks if two categorical variables are related or independent. For example, whether age or gender affects product preferences.
- Assessing Model Fit: It helps check if a model’s predicted categories match the actual data, which is useful to improve classification models.
Example: Income Level vs Subscription Status
Let us examine a dataset with features including "income level" (low, medium, high) and "subscription status" (subscribed, not subscribed) indicate whether a customer subscribed to a service. The goal is to determine if this feature is relevant for predicting subscription status.
Step 1: Make Hypothesis
- Null hypothesis: No significant association between features
- Alternate Hypothesis: There is a significant association between features.
Step 2: Contingency table
Income Level | Subscribed | Not subscribed | Row Total |
---|
Low | 20 | 30 | 50 |
---|
Medium | 40 | 25 | 65 |
---|
High | 10 | 15 | 25 |
---|
Column Total | 70 | 70 | 140 |
---|
Step 3: Now calculate the expected frequencies: For example the expected frequency for "Low Income" and "Subscribed" would be:
- As Total count for each row R_i is 70 and each column C_j is 70 and Total number of observations are 140.
- Low Income, subscribed=(50 \times 70) \div140 = 25
Similarly we can find expected frequencies for other aspects as well:
| Subscribed | Not Subscribed |
---|
Low Income | 25 | 25 |
---|
Medium Income | 35 | 30 |
---|
High Income | 10 | 15 |
---|
Step 4: Calculate the Chi-Square Statistic: Let's summarize the observed and expected values into a table and calculate the Chi-Square value:
| Subscribed (O) | Not Subscribed (O) | Subscribed (E) | Not Subscribed (E) |
---|
Low Income | 20 | 30 | 25 | 25 |
---|
Medium Income | 40 | 25 | 35 | 30 |
---|
High Income | 10 | 15 | 10 | 15 |
---|
Now using the formula specified in equation 1 we can get our chi-square statistic values in the following manner:
\chi^2= \frac{(20 - 25)^2}{25} + \frac{(30 - 25)^2}{25}++ \frac{(40 - 35)^2}{35} + \frac{(25 - 30)^2}{30}+ \frac{(10 - 10)^2}{10} + \frac{(15 - 15)^2}{15}
= 1 + 1.2 + 0.714 + 0.833 + 0 + 0\\=3.747
Step 5: Degrees of Freedom
\text{Degrees of Freedom (df)} = (3 - 1) \times (2 - 1) = 2
Step 6: Interpretations
Now compare the calculated \chi^2 value (3.747) with the critical value for 2 degrees of freedom. If \chi^2 is greater than the critical value, reject the null hypothesis. This means "income level" is significantly related to "subscription status" and is an important feature. Before its implementation we should have some basic knowledge about numpy, matplotlib and scipy.
Python
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
df = 2
alpha = 0.05
critical_value = stats.chi2.ppf(1 - alpha, df)
critical_value
Output:
5.991464547107979
For df = 2 and significance level \alpha = 0.05, the critical value is 5.991.
- Since 3.747 < 5.991, we fail to reject the null hypothesis.
- Conclusion: No significant association between income level and subscription status.
Visualizing Chi-Square Distribution
Python
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
df = 2
alpha = 0.05
c_val = stats.chi2.ppf(1 - alpha, df)
cal_chi_s = 3.747
x = np.linspace(0, 10, 1000)
y = stats.chi2.pdf(x, df)
plt.plot(x, y, label='Chi-Square Distribution (df=2)')
plt.fill_between(x, y, where=(x > c_val), color='red', alpha=0.5, label='Critical Region')
plt.axvline(cal_chi_s, color='blue', linestyle='dashed', label='Calculated Chi-Square')
plt.axvline(c_val, color='green', linestyle='dashed', label='Critical Value')
plt.title('Chi-Square Distribution and Critical Region')
plt.xlabel('Chi-Square Value')
plt.ylabel('Probability Density Function')
plt.legend()
plt.show()
Output:
Chi-square Distribution
In this example The green dashed line represents the critical value the threshold beyond which you would reject the null hypothesis.
- The red dashed line represents the critical value (5.991) for a significance level of 0.05 with 2 degrees of freedom.
- The shaded area to the right of the critical value represents the rejection region.
If the calculated Chi-Square statistic falls within this shaded area then you would reject the null hypothesis. The calculated chi-square value does not fall within the critical region therefore accepting the null hypothesis. Hence there is no significant association between two variables.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice