Chi-Square Test
Chi-Square Test
Chi-Square Test
Technology
MSc.Data Science
Report on
Chi-Square Test
Submitted by
Deekshita S Iyer
23-60-02
Chi-Square Test
does not make assumptions about the underlying distribution of the data.
It compares the observed frequencies of the categories in a contingency table with the
expected frequencies that would occur under the assumption of independence between
the variables.
There are two main types of chi-square tests commonly used in statistics:
Chi-Square Goodness of Fit Test: It is used to check if the observed data
distribution is similar to the expected data distribution.
Chi-square distribution
1
The shape of a chi-square distribution is determined by the degrees of freedom,
represented by the parameter k. The diagram below shows the chi square distributions
for various values of k
2
Goodness of fit test
The chi-squared goodness of fit is a non-parametric test that finds how the observed
value of a given event is significantly different from the expected value. Here, we have
categorical data for one independent variable, and we check how similar or different the
In this method we start off by setting up our hypothesis and choosing a significance
level. We then calculate χ2 value using the given formula and compare it with the critical
χ2 obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we
Consider the below data of petal sizes of different flowers categorized as small, medium
and large.In this case, the independent variable is size with the categories small,
The statistical question here is: Is there any statistically significant difference in the
59 71 20
3
Setting up the hypothesis:
H0: The proportion of flowers that are small, medium and large are equal
H1: The proportion of flowers that are small, medium and large are NOT equal
50 50 50
R implementation
goodness of fit. The specific usage depends on the type of data and analysis you are
4
conducting.
chisq.test(x, p)
negative.
If ‘x’ is a matrix with one row or column, then a goodness-of-fit test is performed (x is
integers.
If ‘x’ is a matrix with at least two rows and columns, it is taken as a two-dimensional
Then Pearson's chi-squared test is performed on the null hypothesis that the joint
distribution of the cell counts in a 2-dimensional contingency table is the product of the
p-value. The p-value is used to assess the significance of the test. If the p-value is less
than the chosen significance level (commonly 0.05), the null hypothesis is rejected.
chisq.test(table(flowers$size))
Output
5
> chisq.test(table(flowers$size))
data: table(flowers$size)
X-squared = 28.44, df = 2, p-value = 6.673e-07
Here, we observe that p value is less than 0.05, thus we reject the null hypothesis.
Test of independence
In this method we start off by setting up our hypothesis and choosing a significance
level. We then calculate the contingency table for the expected value (if we don't have
expected data) by using the below formula,
E=(RT)(CT) / N
Where,
RT - row total from observed data
CT - column total from observed data
N- total observations
We then calculate χ2 value using the given formula and compare it with the critical χ2
obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we
reject the null hypothesis.
Consider the data of petal sizes of different species flowers shown in the contingency
6
table below
setosa 47 3 0
versicolor 11 36 3
virginica 1 32 17
59 71 20
The statistical question here is: :Does knowing the value of one variable help predict the
H0: The variables are independent (i.e, no relation between the variables)
7
Expected small medium large
As obtained χ2 (111.63) is greater than critical (9.488) we reject our null Hypothesis.
R implementation
chisq.test(table(flowers))
Output
> chisq.test(table(flowers))
data: table(flowers)
X-squared = 111.63, df = 4, p-value < 2.2e-16