Chi Square
Chi Square
Chi Square
10/15/08 6:52 PM
a. for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit test b. for estimating whether two random variables are independent. The Goodness-of-Fit Test One of the more interesting goodness-of-fit applications of the chi-square test is to examine issues of fairness and cheating in games of chance, such as cards, dice, and roulette. Since such games usually involve wagering, there is significant incentive for people to try to rig the games and allegations of missing cards, "loaded" dice, and "sticky" roulette wheels are all too common. So how can the goodness-of-fit test be used to examine cheating in gambling? It is easier to describe the process through an example. Take the example of dice. Most dice used in wagering have six sides, with each side having a value of one, two, three, four, five, or six. If the die being used is fair, then the chance of any particular number coming up is the same: 1 in 6. However, if the die is loaded, then certain numbers will have a greater likelihood of appearing, while others will have a lower likelihood. One night at the Tunisian Nights Casino, renowned gambler Jeremy Turner (a.k.a. The Missouri Master) is having a fantastic night at the craps table. In two hours of playing, he's racked up $30,000 in winnings and is showing no sign of stopping. Crowds are gathering around him to watch his streak - and The Missouri Master is telling anyone within earshot that his good luck is due to the fact that he's using the casino's lucky pair of "bruiser dice," so named because one is black and the other blue.
http://ccnmtl.columbia.edu/projects/qmss/chi_test.html
Page 1 of 5
10/15/08 6:52 PM
Roll
Unbeknownst to Turner, however, a casino statistician has been quietly watching his rolls and marking down the values of each roll, noting the values of the black and blue dice separately. After 60 rolls, the statistician has become convinced that the blue die is loaded.
Value on Blue Die Observed Frequency Expected Frequency 1 2 3 4 5 6 Total 16 5 9 7 6 17 60 10 10 10 10 10 10 60
At first glance, this table would appear to be strong evidence that the blue die was, indeed, loaded. There are more 1's and 6's than expected, and fewer than the other numbers. However, it's possible that such differences occurred by chance. The chi-square statistic can be used to estimate the likelihood that the values observed on the blue die occurred by chance. The key idea of the chi-square test is a comparison of observed and expected values. How many of something were expected and how many were observed in some process? In this case, we would expect 10 of each number to have appeared and we observed those values in the left column. With these sets of figures, we calculate the chi-square statistic as follows:
Using this formula with the values in the table above gives us a value of 13.6. Lastly, to determine the significance level we need to know the "degrees of freedom." In the case of the chi-square goodness-of-fit test, the number of degrees of freedom is equal to the number of terms used in calculating chi-square minus one. There were six terms in the chi-square for this problem - therefore, the number of degrees of freedom is five. We then compare the value calculated in the formula above to a standard set of tables. The value returned from the table is 1.8%. We interpret this as meaning that if the die was fair (or not loaded), then the chance of getting a !2 statistic as large or larger than the one calculated above is only 1.8%. In other words, there's only a very slim chance that these rolls came from a fair die. The Missouri Master is in serious trouble.
http://ccnmtl.columbia.edu/projects/qmss/chi_test.html
Page 2 of 5
10/15/08 6:52 PM
Recap To recap the steps used in calculating a goodness-of-fit test with chi-square: 1. Establish hypotheses. 2. Calculate chi-square statistic. Doing so requires knowing: The number of observations Expected values Observed values 3. Assess significance level. Doing so requires knowing the number of degrees of freedom. 4. Finally, decide whether to accept or reject the null hypothesis. Testing Independence The other primary use of the chi-square test is to examine whether two variables are independent or not. What does it mean to be independent, in this sense? It means that the two factors are not related. Typically in social science research, we're interested in finding factors that are related - education and income, occupation and prestige, age and voting behavior. In this case, the chi- square can be used to assess whether two variables are independent or not. More generally, we say that variable Y is "not correlated with" or "independent of" the variable X if more of one is not associated with more of another. If two categorical variables are correlated their values tend to move together, either in the same direction or in the opposite. Example Return to the example discussed at the introduction to chi-square, in which we want to know whether boys or girls get into trouble more often in school. Below is the table documenting the percentage of boys and girls who got into trouble in school:
Got in Trouble No Trouble Total Boys 46 Girls Total 37 83 71 83 154 117 120 237
To examine statistically whether boys got in trouble in school more often, we need to frame the question in terms of hypotheses. 1. Establish Hypotheses As in the goodness-of-fit chi-square test, the first step of the chi-square test for independence is to establish hypotheses. The null hypothesis is that the two variables are independent - or, in this particular case that the likelihood of getting in trouble is the same for boys and girls. The alternative hypothesis to be tested is that the likelihood of getting in trouble is not the same for boys and girls.
http://ccnmtl.columbia.edu/projects/qmss/chi_test.html Page 3 of 5
10/15/08 6:52 PM
Cautionary Note It is important to keep in mind that the chi-square test only tests whether two variables are independent. It cannot address questions of which is greater or less. Using the chi-square test, we cannot evaluate directly the hypothesis that boys get in trouble more than girls; rather, the test (strictly speaking) can only test whether the two variables are independent or not.
2. Calculate the expected value for each cell of the table As with the goodness-of-fit example described earlier, the key idea of the chi-square test for independence is a comparison of observed and expected values. How many of something were expected and how many were observed in some process? In the case of tabular data, however, we usually do not know what the distribution should look like (as we did with rolls of dice). Rather, in this use of the chi-square test, expected values are calculated based on the row and column totals from the table. The expected value for each cell of the table can be calculated using the following formula:
For example, in the table comparing the percentage of boys and girls in trouble, the expected count for the number of boys who got in trouble is:
The first step, then, in calculating the chi-square statistic in a test for independence is generating the expected value for each cell of the table. Presented in the table below are the expected values (in parentheses and italics) for each cell:
Got in Trouble No Trouble Total Boys 46 (40.97) Girls Total 37 (42.03) 83 71 (76.02) 83(77.97) 154 117 120 237
3. Calculate Chi-square statistic With these sets of figures, we calculate the chi-square statistic as follows:
http://ccnmtl.columbia.edu/projects/qmss/chi_test.html
Page 4 of 5
10/15/08 6:52 PM
4. Assess significance level Lastly, to determine the significance level we need to know the "degrees of freedom." In the case of the chi-square test of independence, the number of degrees of freedom is equal to the number of columns in the table minus one multiplied by the number of rows in the table minus one. In this table, there were two rows and two columns. Therefore, the number of degrees of freedom is:
We then compare the value calculated in the formula above to a standard set of tables. The value returned from the table is p< 20%. Thus, we cannot reject the null hypothesis and conclude that boys are not significantly more likely to get in trouble in school than girls. Recap To recap the steps used in calculating a goodness-of-fit test with chi-square: 1. Establish hypotheses 2. Calculate expected values for each cell of the table. 3. Calculate chi-square statistic. Doing so requires knowing: a. The number of observations b. Observed values 4. Assess significance level. Doing so requires knowing the number of degrees of freedom 5. Finally, decide whether to accept or reject the null hypothesis.
Go to Next Lesson >>
http://ccnmtl.columbia.edu/projects/qmss/chi_test.html
Page 5 of 5
10/15/08 6:46 PM
What is your sex? Disrete- How many cars do you two or three own? Continuous - How tall are you? 72 inches
Notice that discrete data arise fom a counting process, while continuous data arise from a measuring process. The Chi Square statistic compares the tallies or counts of categorical responses between two (or more) independent groups. (note: Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.) 2 x 2 Contingency Table There are several types of chi square tests depending on the way the data was collected and the hypothesis being tested. We'll begin with the simplest case: a 2 x 2 contingency table. If we set the 2 x 2 table to the general notation shown below in Table 1, using the letters a, b, c, and d to denote the contents of the cells, then we would have the following table: Table 1. General notation for a 2 x 2 contingency table. Variable 1 Variable 2 Category 1 Category 2 Total Data type 1 a c a+c Data type 2 b d b+d Totals a+b c+d a+b+c+d= N
For a 2 x 2 contingency table the Chi Square statistic is calculated by the formula:
http://math.hws.edu/javamath/ryan/ChiSquare.html Page 1 of 7
10/15/08 6:46 PM
Note: notice that the four components of the denominator are the four totals from the table columns and rows. Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals receiving the drug would survive better than those that did not receive the drug. You conduct the study and collect the following data: Ho: The survival of the animals is independent of drug treatment. Ha: The survival of the animals is associated with drug treatment.
Table 2. Number of animals that survived a treatment. Dead Alive 36 14 30 66 25 39 Total 50 55 105
Chi square = 105[(36)(25) - (14)(30)]2 / (50)(55)(39)(66) = 3.418 Before we can proceed we eed to know how many degrees of freedom we have. When a comparison is made between one sample and another, a simple rule is that the degrees of freedom equal (number of columns minus one) x (number of rows minus one) not counting the totals for rows or columns. For our data this gives (2-1) x (2-1) = 1. We now have our chi square statistic (x 2 = 3.418), our predetermined alpha level of significalnce (0.05), and our degrees of freedom (df =1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our value of x2 (3.418) lies between 2.706 and 3.841. The corresponding probability is 0.10<P<0.05. This is below the conventionally accepted significance level of 0.05 or 5%, so the null hypothesis that the two distributions are the same is verified. In other words, when the computed x2 statistic exceeds the critical value in the table for a 0.05 probability level, then we can reject the null hypothesis of equal distributions. Since our x2 statistic (3.418) did not exceed the critical value for 0.05 probability level (3.841) we can accept the null hypothesis that the survival of the animals is independent of drug treatment (i.e. the drug had no effect on survival). Table 3. Chi Square distribution table. probability level (alpha)
Df 1 0.5 0.455 0.10 2.706 0.05 3.841 0.02 5.412 0.01 6.635 0.001 10.827
Page 2 of 7
http://math.hws.edu/javamath/ryan/ChiSquare.html
10/15/08 6:46 PM
2 3 4 5
To make the chi square calculations a bit easier, plug your observed and expected values into the following applet. Click on the cell and then enter the value. Click the compute button on the lower right corner to see the chi square value printed in the lower left hand coner.
Data A Category 1 Category 2 Total 0 0 0 0 0 0 Data B 0 0 0 Totals
--> Note: Some earlier versions of Netscape for the Macintosh do not support java 1.1 and if you are using one of these browsers you will not see the applet.
Table 4. Results of a monohybrid coss between two heterozygotes for the 'a' gene. A 10 33 43 a 42 15 57 Totals 52 48 100
Page 3 of 7
A a Totals
http://math.hws.edu/javamath/ryan/ChiSquare.html
10/15/08 6:46 PM
The penotypic ratio 85 of the A type and 15 of the a-type (homozygous recessive). In a monohybrid cross between two heterozygotes, however, we would have predicted a 3:1 ratio of phenotypes. In other words, we would have expected to get 75 A-type and 25 a-type. Are or resuls different?
Calculate the chi square statistic x2 by completing the following steps: 1. For each observed number in the table subtract the corresponding expected number (O E). 2. Square the difference [ (O E) 2 ]. 3. Divide the squares obtained for each cell in the table by the expected number for that cell [ (O E) 2 / E ]. 4. Sum all the values for (O - E) 2 / E. This is the chi square statistic. For our example, the calculation would be:
Observed Expected (O E) (O E)2 (O E)2 / E
85 15 100
75 25 100
10 10
100 100
x2 = 5.33 We now have our chi square statistic (x 2 = 5.33), our predetermined alpha level of significalnce (0.05), and our degrees of freedom (df =1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our value of x2 5.33) lies between 3.841 and 5.412. The corresponding probability is 0.05<P<0.02. This is smaller than the conventionally accepted significance level of 0.05 or 5%, so the null hypothesis that the two distributions are the same is rejected. In other words, when the computed x2 statistic exceeds the critical value in the table for a 0.05 probability level, then we can reject the null hypothesis of equal distributions. Since our x2 statistic (5.33) exceeded the critical value for 0.05 probability level (3.841) we can reject the null hypothesis that the observed values of our cross are the same as the theoretical distribution of a 3:1 ratio. Table 3. Chi Square distribution table. probability level (alpha)
Df 1 2 3 0.5 0.455 1.386 2.366 0.10 2.706 4.605 6.251 0.05 3.841 5.991 7.815 0.02 5.412 7.824 9.837 0.01 6.635 9.210 11.345 0.001 10.827 13.815 16.268
Page 4 of 7
http://math.hws.edu/javamath/ryan/ChiSquare.html
10/15/08 6:46 PM
4 5
3.357 4.351
7.779 9.236
9.488 11.070
11.668 13.388
13.277 15.086
18.465 20.517
To put this into context, it means that we do not have a 3:1 ratio of A_ to aa offspring. To make the chi square calculations a bit easier, plug your observed and expected values into the following java applet. Click on the cell and then enter the value. Click the compute button on the lower right corner to see the chi square value printed in the lower left hand coner.
Observed Category 1 Category 2 Sums 0 0 0 0 0 0 Expected 0 0 (O - E) 0 0 (O - E)^2 (O - E)^2/E undened undened
--> Note: Some versions of Netscape for the Macintosh do not support java 1.1 and if you are using one of these browsers you will not see the applet.
10/15/08 6:46 PM
g a+d+g
h b+e+h
i c+f+i
g+h+i a+b+c+d+e+f+g+h+i=N
Now we need to calculate the expected values for each cell in the table and we can do that using the the row total times the column total divided by the grand total (N). For example, for cell a the expected value would be (a+b+c)(a+d+g)/N. Once the expected values have been calculated for each cell, we can use the same procedure are before for a simple 2 x 2 table. Observed Expected |O (O E)2 E|
(O E)2 / E
Suppose you have the following categorical data set. Table . Incidence of three types of malaria in three tropical regions. Asia Africa Malaria A Malaria B Malaria C Totals 31 2 53 86 14 5 45 64 South America 45 53 2 100 Totals 90 60 100 250
We could now set up the following table: Observed Expected 31 30.96 14 23.04 45 36.00 2 20.64 5 15.36 53 24.00 53 34.40 45 25.60 2 40.00 |O -E| 0.04 9.04 9.00 18.64 10.36 29.00 18.60 19.40 38.00
(O E)2 (O E)2 / E
http://math.hws.edu/javamath/ryan/ChiSquare.html
Page 6 of 7
10/15/08 6:46 PM
Degrees of Freedom = (c - 1)(r - 1) = 2(2) = 4 Table 3. Chi Square distribution table. probability level (alpha)
Df 1 2 3 4 5 0.5 0.455 1.386 2.366 3.357 4.351 0.10 2.706 4.605 6.251 7.779 9.236 0.05 3.841 5.991 7.815 9.488 11.070 0.02 5.412 7.824 9.837 11.668 13.388 0.01 6.635 9.210 11.345 13.277 15.086 0.001 10.827 13.815 16.268 18.465 20.517
Reject Ho because 125.516 is greater than 9.488 (for alpha = 0.05) Thus, we would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location, but that's all it says. Follow the link below to access a java-based program for calculating Chi Square statistics for contingency tables of up to 9 rows by 9 columns. Enter the number of row and colums in the spaces provided on the page and click the submit button. A new form will appear asking you to enter your actual data into the cells of the contingency table. When finished entering your data, click the "calculate now" button to see the results of your Chi Square analysis. You may wish to print this last page to keep as a record. Chi Square, This page was created as part of the Mathbeans Project. The java applets were created by David Eck and modified by Jim Ryan. The Mathbeans Project is funded by a grant from the National Science Foundation DUE-9950473.
http://math.hws.edu/javamath/ryan/ChiSquare.html
Page 7 of 7