Chi-Square Test

Defence Institute Of Advanced
Technology
School of Computer Engineering and Applied Mathematics
MSc.Data Science
Probability & Statistical Methods with R Lab

(Subject code:AMMSCD505)
Report on
Chi-Square Test
Submitted by
Deekshita S Iyer
23-60-02
Chi-Square Test
The chi-square test is a statistical test used to determine if there is a significant
association between two categorical variables. It is a non-parametric test, meaning it
does not make assumptions about the underlying distribution of the data.
It compares the observed frequencies of the categories in a contingency table with the
expected frequencies that would occur under the assumption of independence between
the variables.
The chi-square value is calculated as:
χ2=Σ (Oi - Ei)2 / Ei

Where,
Oi- observed value
Ei-expected value
There are two main types of chi-square tests commonly used in statistics:
Chi-Square Goodness of Fit Test: It is used to check if the observed data
distribution is similar to the expected data distribution.
Chi-Square Test of Independence:It is used to verify if there is a significant

association between two categorical variables in a sample from a population.
Chi-square distribution
Chi-square (Χ2) distributions are a family of continuous probability distributions that

are widely used in hypothesis tests such as the chi-square goodness of fit test and the
chi-square test of independence.
1
The shape of a chi-square distribution is determined by the degrees of freedom,
represented by the parameter k. The diagram below shows the chi square distributions
for various values of k
Fig 1: Chi-square distribution
Assumptions for a chi-square test include:
● Random selection of data: Data must be randomly selected to minimize potential

biases
● Categorical data: The variables in question must be nominal or ordinal
● Mutually exclusive categories: The levels (or categories) of the variables are
mutually exclusive
● Single data contribution: Each of the subjects of this test can contribute the data
to one and only one cell of the χ2
● Independence of study groups: Observations must be independent
● Specific cell expected frequency: The data filled in the cells must be in
frequencies or in any counts of the cases rather than just in the percentages
2
Goodness of fit test
The chi-squared goodness of fit is a non-parametric test that finds how the observed
value of a given event is significantly different from the expected value. Here, we have
categorical data for one independent variable, and we check how similar or different the
data distribution is from the expected distribution.
In this method we start off by setting up our hypothesis and choosing a significance
level. We then calculate χ2 value using the given formula and compare it with the critical
χ2 obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we
reject the null hypothesis.
Consider the below data of petal sizes of different flowers categorized as small, medium
and large.In this case, the independent variable is size with the categories small,
medium and large.
The statistical question here is: Is there any statistically significant difference in the
proportion of flowers that are small, medium and large?
small medium large
59 71 20
Table 1: Observed-Frequency table of petal sizes
3
Setting up the hypothesis:
H0: The proportion of flowers that are small, medium and large are equal
H1: The proportion of flowers that are small, medium and large are NOT equal
Significance level: 0.05
Degree of freedom = Number of categories - 1 = 3-1=2
Based on H0 our expected value is
small medium large
50 50 50
Table 2: Expected-Frequency table of petal sizes
Using the formula, calculate chi-square:
χ2 = (59-50)² / 50 + (71-50)² /50 + ( 20-50)² / 50 = 28.44
From the chi-square table corresponding to 2 degrees of freedom and significance of
0.05, we obtain a critical χ2 of 5.991
As obtained χ2 is greater than critical we reject our null Hypothesis.
R implementation
In R, the chisq.test function is used to perform a chi-squared test of independence or
goodness of fit. The specific usage depends on the type of data and analysis you are
4
conducting.
chisq.test(x, p)
x - a numeric vector or matrix
p - a vector of probabilities of the same length of x. An error is given if any entry of p is
negative.
If ‘x’ is a matrix with one row or column, then a goodness-of-fit test is performed (x is
treated as a one-dimensional contingency table). The entries of x must be non-negative
integers.
If ‘x’ is a matrix with at least two rows and columns, it is taken as a two-dimensional
contingency table: the entries of x must be non-negative integers.
Then Pearson's chi-squared test is performed on the null hypothesis that the joint
distribution of the cell counts in a 2-dimensional contingency table is the product of the
row and column marginals.
The output of chisq.test includes a chi-squared statistic, degrees of freedom, and a
p-value. The p-value is used to assess the significance of the test. If the p-value is less
than the chosen significance level (commonly 0.05), the null hypothesis is rejected.
For the above example,
The required data is stored in the variable flowers
chisq.test(table(flowers$size))
Output
5
> chisq.test(table(flowers$size))
Chi-squared test for given probabilities
data: table(flowers$size)
X-squared = 28.44, df = 2, p-value = 6.673e-07
Here, we observe that p value is less than 0.05, thus we reject the null hypothesis.
Test of independence
The chi-square test of independence is used to examine if there is a significant

association between two categorical variables in a sample from a population. It
compares the observed frequencies in a contingency table with the expected frequencies
assuming independence between the variables.
In this method we start off by setting up our hypothesis and choosing a significance
level. We then calculate the contingency table for the expected value (if we don't have
expected data) by using the below formula,
E=(RT)(CT) / N
Where,
RT - row total from observed data
CT - column total from observed data
N- total observations
We then calculate χ2 value using the given formula and compare it with the critical χ2
obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we
reject the null hypothesis.
Consider the data of petal sizes of different species flowers shown in the contingency
6
table below
Size -> small medium large

Species
|
V
setosa 47 3 0
versicolor 11 36 3
virginica 1 32 17
59 71 20
Table 3:Observed- Contingency table ( species and petal sizes)
The statistical question here is: :Does knowing the value of one variable help predict the
value of the other variable?
Setting up the hypothesis:
H0: The variables are independent (i.e, no relation between the variables)
H1: The variables are dependent
Significance level: 0.05
Degree of freedom = (No. of rows - 1)(No. Of columns -1) = (3-1)(3-1)=4
From the chi-square table corresponding to 2 degrees of freedom and significance of
0.05, we obtain a critical χ2 of 9.488
Calculating the expected value suing above formula,
7
Expected small medium large
setosa 19.67 23.67 6.67
versicolor 19.67 23.67 6.67
virginica 19.67 23.67 6.67
Table 4:Observed- Contingency table ( species and petal sizes)
χ2= (47-19.67)2 / 19.67 + (23.67-3)2/23.67 + ….+(17-6.67)2/6.67 = 111.63
As obtained χ2 (111.63) is greater than critical (9.488) we reject our null Hypothesis.
R implementation
chisq.test(table(flowers))
Output
> chisq.test(table(flowers))
Pearson's Chi-squared test
data: table(flowers)
X-squared = 111.63, df = 4, p-value < 2.2e-16

Chi-Square Test

Uploaded by

Copyright:

Available Formats

Chi-Square Test

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi-Square Test

Uploaded by

Copyright:

Available Formats

Defence Institute Of Advanced

School of Computer Engineering and Applied Mathematics

Probability & Statistical Methods with R Lab

The chi-square test is a statistical test used to determine if there is a significant

association between two categorical variables. It is a non-parametric test, meaning it

The chi-square value is calculated as:

χ2=Σ (Oi - Ei)2 / Ei

Chi-Square Test of Independence:It is used to verify if there is a significant

Chi-square (Χ2) distributions are a family of continuous probability distributions that

Fig 1: Chi-square distribution

Assumptions for a chi-square test include:

● Random selection of data: Data must be randomly selected to minimize potential

data distribution is from the expected distribution.

reject the null hypothesis.

medium and large.

proportion of flowers that are small, medium and large?

small medium large

Table 1: Observed-Frequency table of petal sizes

Significance level: 0.05

Degree of freedom = Number of categories - 1 = 3-1=2

Based on H0 our expected value is

small medium large

Table 2: Expected-Frequency table of petal sizes

Using the formula, calculate chi-square:

χ2 = (59-50)² / 50 + (71-50)² /50 + ( 20-50)² / 50 = 28.44

From the chi-square table corresponding to 2 degrees of freedom and significance of

0.05, we obtain a critical χ2 of 5.991

As obtained χ2 is greater than critical we reject our null Hypothesis.

In R, the chisq.test function is used to perform a chi-squared test of independence or

x - a numeric vector or matrix

p - a vector of probabilities of the same length of x. An error is given if any entry of p is

treated as a one-dimensional contingency table). The entries of x must be non-negative

contingency table: the entries of x must be non-negative integers.

row and column marginals.

The output of chisq.test includes a chi-squared statistic, degrees of freedom, and a

For the above example,

The required data is stored in the variable flowers

Chi-squared test for given probabilities

The chi-square test of independence is used to examine if there is a significant

Size -> small medium large

Table 3:Observed- Contingency table ( species and petal sizes)

value of the other variable?

Setting up the hypothesis:

H1: The variables are dependent

Significance level: 0.05

Degree of freedom = (No. of rows - 1)(No. Of columns -1) = (3-1)(3-1)=4

From the chi-square table corresponding to 2 degrees of freedom and significance of

0.05, we obtain a critical χ2 of 9.488

Calculating the expected value suing above formula,

setosa 19.67 23.67 6.67

versicolor 19.67 23.67 6.67

virginica 19.67 23.67 6.67

Table 4:Observed- Contingency table ( species and petal sizes)

χ2= (47-19.67)2 / 19.67 + (23.67-3)2/23.67 + ….+(17-6.67)2/6.67 = 111.63

Pearson's Chi-squared test

You might also like