Chi-Square Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Defence Institute Of Advanced

Technology

School of Computer Engineering and Applied Mathematics

MSc.Data Science

Probability & Statistical Methods with R Lab


(Subject code:AMMSCD505)

Report on
Chi-Square Test

Submitted by

Deekshita S Iyer
23-60-02
Chi-Square Test

The chi-square test is a statistical test used to determine if there is a significant

association between two categorical variables. It is a non-parametric test, meaning it

does not make assumptions about the underlying distribution of the data.

It compares the observed frequencies of the categories in a contingency table with the

expected frequencies that would occur under the assumption of independence between

the variables.

The chi-square value is calculated as:

χ2=Σ (Oi - Ei)2 / Ei


Where,
Oi- observed value
Ei-expected value

There are two main types of chi-square tests commonly used in statistics:
Chi-Square Goodness of Fit Test: It is used to check if the observed data
distribution is similar to the expected data distribution.

Chi-Square Test of Independence:It is used to verify if there is a significant


association between two categorical variables in a sample from a population.

Chi-square distribution

Chi-square (Χ2) distributions are a family of continuous probability distributions that


are widely used in hypothesis tests such as the chi-square goodness of fit test and the
chi-square test of independence.

1
The shape of a chi-square distribution is determined by the degrees of freedom,
represented by the parameter k. The diagram below shows the chi square distributions
for various values of k

Fig 1: Chi-square distribution

Assumptions for a chi-square test include:

● Random selection of data: Data must be randomly selected to minimize potential


biases
● Categorical data: The variables in question must be nominal or ordinal
● Mutually exclusive categories: The levels (or categories) of the variables are
mutually exclusive
● Single data contribution: Each of the subjects of this test can contribute the data
to one and only one cell of the χ2
● Independence of study groups: Observations must be independent
● Specific cell expected frequency: The data filled in the cells must be in
frequencies or in any counts of the cases rather than just in the percentages

2
Goodness of fit test

The chi-squared goodness of fit is a non-parametric test that finds how the observed

value of a given event is significantly different from the expected value. Here, we have

categorical data for one independent variable, and we check how similar or different the

data distribution is from the expected distribution.

In this method we start off by setting up our hypothesis and choosing a significance

level. We then calculate χ2 value using the given formula and compare it with the critical

χ2 obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we

reject the null hypothesis.

Consider the below data of petal sizes of different flowers categorized as small, medium

and large.In this case, the independent variable is size with the categories small,

medium and large.

The statistical question here is: Is there any statistically significant difference in the

proportion of flowers that are small, medium and large?

small medium large

59 71 20

Table 1: Observed-Frequency table of petal sizes

3
Setting up the hypothesis:

H0: The proportion of flowers that are small, medium and large are equal

H1: The proportion of flowers that are small, medium and large are NOT equal

Significance level: 0.05

Degree of freedom = Number of categories - 1 = 3-1=2

Based on H0 our expected value is

small medium large

50 50 50

Table 2: Expected-Frequency table of petal sizes

Using the formula, calculate chi-square:

χ2 = (59-50)² / 50 + (71-50)² /50 + ( 20-50)² / 50 = 28.44

From the chi-square table corresponding to 2 degrees of freedom and significance of

0.05, we obtain a critical χ2 of 5.991

As obtained χ2 is greater than critical we reject our null Hypothesis.

R implementation

In R, the chisq.test function is used to perform a chi-squared test of independence or

goodness of fit. The specific usage depends on the type of data and analysis you are

4
conducting.

chisq.test(x, p)

x - a numeric vector or matrix

p - a vector of probabilities of the same length of x. An error is given if any entry of p is

negative.

If ‘x’ is a matrix with one row or column, then a goodness-of-fit test is performed (x is

treated as a one-dimensional contingency table). The entries of x must be non-negative

integers.

If ‘x’ is a matrix with at least two rows and columns, it is taken as a two-dimensional

contingency table: the entries of x must be non-negative integers.

Then Pearson's chi-squared test is performed on the null hypothesis that the joint

distribution of the cell counts in a 2-dimensional contingency table is the product of the

row and column marginals.

The output of chisq.test includes a chi-squared statistic, degrees of freedom, and a

p-value. The p-value is used to assess the significance of the test. If the p-value is less

than the chosen significance level (commonly 0.05), the null hypothesis is rejected.

For the above example,

The required data is stored in the variable flowers

chisq.test(table(flowers$size))

Output

5
> chisq.test(table(flowers$size))

Chi-squared test for given probabilities

data: table(flowers$size)
X-squared = 28.44, df = 2, p-value = 6.673e-07

Here, we observe that p value is less than 0.05, thus we reject the null hypothesis.

Test of independence

The chi-square test of independence is used to examine if there is a significant


association between two categorical variables in a sample from a population. It
compares the observed frequencies in a contingency table with the expected frequencies
assuming independence between the variables.

In this method we start off by setting up our hypothesis and choosing a significance
level. We then calculate the contingency table for the expected value (if we don't have
expected data) by using the below formula,

E=(RT)(CT) / N

Where,
RT - row total from observed data
CT - column total from observed data
N- total observations

We then calculate χ2 value using the given formula and compare it with the critical χ2
obtained from the chi square table. If the calculated χ2 is greater than critical χ2 we
reject the null hypothesis.

Consider the data of petal sizes of different species flowers shown in the contingency

6
table below

Size -> small medium large


Species
|
V

setosa 47 3 0

versicolor 11 36 3

virginica 1 32 17

59 71 20

Table 3:Observed- Contingency table ( species and petal sizes)

The statistical question here is: :Does knowing the value of one variable help predict the

value of the other variable?

Setting up the hypothesis:

H0: The variables are independent (i.e, no relation between the variables)

H1: The variables are dependent

Significance level: 0.05

Degree of freedom = (No. of rows - 1)(No. Of columns -1) = (3-1)(3-1)=4

From the chi-square table corresponding to 2 degrees of freedom and significance of

0.05, we obtain a critical χ2 of 9.488

Calculating the expected value suing above formula,

7
Expected small medium large

setosa 19.67 23.67 6.67

versicolor 19.67 23.67 6.67

virginica 19.67 23.67 6.67

Table 4:Observed- Contingency table ( species and petal sizes)

χ2= (47-19.67)2 / 19.67 + (23.67-3)2/23.67 + ….+(17-6.67)2/6.67 = 111.63

As obtained χ2 (111.63) is greater than critical (9.488) we reject our null Hypothesis.

R implementation

chisq.test(table(flowers))

Output

> chisq.test(table(flowers))

Pearson's Chi-squared test

data: table(flowers)
X-squared = 111.63, df = 4, p-value < 2.2e-16

You might also like