100% found this document useful (1 vote)
100 views9 pages

Chi Square Goodness-of-Fit Test

The document discusses how to perform a chi-square goodness-of-fit test to analyze collected data and test hypotheses. It provides examples of using chi-square tests to analyze coin toss and gerbil breeding experiments to determine if the results differ significantly from what is expected. Students are instructed to perform chi-square calculations and analyses on provided data to evaluate hypotheses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
100 views9 pages

Chi Square Goodness-of-Fit Test

The document discusses how to perform a chi-square goodness-of-fit test to analyze collected data and test hypotheses. It provides examples of using chi-square tests to analyze coin toss and gerbil breeding experiments to determine if the results differ significantly from what is expected. Students are instructed to perform chi-square calculations and analyses on provided data to evaluate hypotheses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

Chi Square Goodness-of-Fit Test

Introduction: Statistical analysis is one of the cornerstones of modern science. For instance, Mendels great insights about the behaviors of inherited factors were founded upon his understanding of mathematics and the laws of probability. Today, we still apply those mathematical principles to the analysis of genetic information, as well as to virtually any other kinds of numerical data which might be collected. n this lab investigation, we will be e!amining one of these applications, the "hi S#uare $ %& 'oodness(of(Fit Test. "ollected data rarely conform e!actly to prediction, so it is important to determine if the deviation between the e!pected values $based upon the hypothesis& and the actual results is significant enough to discredit the original hypothesis. This need has led to the development of a variety of statistical devices $such as the chi s#uare test& designed to challenge the collected data. )e will be e!amining this procedure using several simple e!amples of hypotheses and data collection.
*emember that the purpose of this test is to determine if the actual results are different enough from the predicted results to suggest that the hypothesis is not correct. A Note about probability: +robabilities are predictions. )e make predictions of this kind all the time. For e!ample, ,Theres a fifty percent chance that baby will be a boy,- is a probability statement, based on the hypothesis that half of human births produce boys and half produce girls $which is in turn based upon understanding about . and / chromosomes, and about sperm and eggs&. n formal mathematical language, probabilities are e!pressed as decimals between 0ero $no chance& and one $certainty&. So the prediction above would be e!pressed as ,the probability for a boy is 1.2.- 3!pressed as a mathematical ,sentence,- it would be +$boy&41.2.

Exercise 1: )ork in groups of four. 5. 3ach partner should collect a penny from the instructor. %. The class will discuss the hypothesis which will be tested, and the e!pected results if that hypothesis is true. *ecord these numbers on your data sheet in the indicated spaces. 6. 3ach partner should toss his7her penny 511 times. *ecord the number of heads and the number of tails. /ou need not keep track of the order, simply how many of each. "ombine the sets of figures for your group, then combine with those generated by the other students in the class. The total numbers of heads and tails for the class represent our obser ed results. 8. "arefully follow the calculation of the chi s#uare value for this e!periment as demonstrated by the instructor, and the use to which that value is put. $There is also an e!ample included at the end of this handout. & The purpose of the chi s#uare test is to answer the following #uestion:
/our observed results were almost certainly not precisely identical to your e!pected results. Are they different enou!h to "erit questionin! our hypothesis#that the pennies are fairly balanced$ This is a very important #uestion, and is not always easy to answer. n a scientific report, the comparison of collected data and e!pected results must always be made through the use of a statistical challenge of significance $eg. 9 statistical test to determine whether there is a significant difference between predicted results and actual results&.

Exercise %& 9 similar activity will be performed using dice. /our group will be e!pected to perform this chi(s#uare test more independently. Exercise '& These techni#ues will now be applied to biological data. 5. "onsider a mating between two brown gerbils. :eres the way this mating would be presented:
The hypothesis concerning the fur color gene in consideration is that the trait is controlled by a single gene with two different forms $alleles&, brown and black, and that the brown allele is dominant to the black allele. )e will use the symbols ( for the brown allele and b for the black allele. ;ur knowledge of genetics tells us that each gerbil carries two alleles for this gene, so possible

allele combinations would be (( $which would be a brown gerbil&, bb $which would be a black gerbil, and (b $which would be a brown gerbil, because of our hypothesis that brown is dominant&. <ote also that, according to this hypothesis, we wont be able to tell whether a brown gerbil is (( or (b. ;ur mother gerbil is :oney, who is brown. :oneys mother and father were brown and black, respectively. This leads us to predict that :oneys allele combination is (b. ;ur father gerbil is *it0, who is also brown, and also had one brown parent and one black parent. This leads us to predict that *it0s allele combination is also (b. So our mating is: (b x (b )ed work our little genetics problem $which is our way to arrive at our prediction& like this: ( ( b (( (b b (b bb

From this, we predict that among the babies wed have: 1)* (( %)* (b 1)* bb Stated as probabilities: + ')* (ro,n 1)* (lac-

or

./(ro,n0 1 +23 ./(lac-0 1 +%3

%.

/our instructor will give you actual data from matings like this one. *eferring to the chi s#uare table provided on the data sheet, complete a chi s#uare analysis on these data. 9re the actual results close enough to the predicted results to merit accepting this hypothesis= "onsider a second gene, also studied using gerbils like :oney and *it0. n this case, the gene is for a fur color pattern called ,white spotting,- or ,"anada )hite Spot.- )ild gerbils have essentially solid brown fur. )hite spotted gerbils have a pattern of white markings on their otherwise colored fur. :eres the hypothesis regarding this gene: 9gain, the hypothesis concerning this gene is that the trait is controlled by a single gene with two different forms $alleles&, white spotted and solid $wild type&. +reliminary observation suggests that the white spotted allele is the dominant one. The observations that led to this prediction were $5& sometimes white spotted parents produced solid color babies>indicating that solid can be hidden? $%& two solid color parents never produced any white spotted babies, suggesting that white spotting cant be hidden, and thus cant be recessive. )e will use the symbols 4 for the white spotted allele and , for the solid allele. ;ur knowledge of genetics tells us that each gerbil carries two alleles for this gene, so possible allele combinations would be 44 $which would be a white spotted gerbil&, ,, $which would be a solid gerbil, and 4, $which would be a white spotted gerbil, because of our hypothesis that white spotted is dominant&. <ote also that, according to this hypothesis, we wont be able to tell whether a white spotted gerbil is 44 or 4,. ;nce again, our mother gerbil is :oney, who is white spotted. :oneys mother and father were white spotted and solid, respectively. This leads us to predict that :oneys allele combination is 4,. ;ur father gerbil is *it0, who is also white spotted, and also had one white spotted parent and one

6.

solid parent. This leads us to predict that *it0s allele combination is also 4,. So our mating is: 4, x 4, )ed work our little genetics problem $which is our way to arrive at our prediction& like this: 4 4 , 44 4, , 4, ,,

From this, we predict that among the babies wed have: 1)* 44 %)* 4, 1)* ,, Stated as probabilities: + ')* 4hite spotted 1)* Solid +

or

./4hite spotted0 1 +23 ./Solid0 1 +%3 +

8.

9gain, your instructor will give you data from actual matings involving this gene. +erform a "hi(S#uare 'oodness(of( Fit test to determine whether this hypothesis about this gene is supported by the data.

Exa"ple& Chi Square Analysis


My hypothesis is that a particular penny is a fair penny. n other words, that it is not weighted or in any other way designed to favor falling with heads up or to favor falling with tails up. f this is true of my coin, then my prediction is that the probability of flipping heads $+$:&& is 1.2, and the probability of flipping tails $+$T&& is also 1.2. This means that am predicting that @ of the time the coin will come up heads, and @ of the time it will come up tails. Therefore, if flip a coin 611 times, my hypothesis predicts: 3!pected: To test this hypothesis, :eads: 521 Tails: 521 Total: 611 get: Total: 611

flip my penny 611 times. :ere are the numbers ;bserved: :eads: 5A% Tails: 56B

5.

There are several factors which are important in determining the significance between the obser ed /50 and expected /E0 values. The absolute difference in numbers is important. This is obtained by subtracting the 3 value from the ; value $;(3&. For heads: ;(3 4 5A% ( 521 4 5% For tails: ;(3 4 56B C 521 4 (5%

%.

To get rid of the plus and minus signs, and for other esoterical statistical reasons, these values are s#uared, giving us $;(3&% for each of our data classes. For heads: $;(3&% 4 5%%4 588 For tails $;(3&% 4 (5%%4 588

6.

The number of trials is also very important. 9 particular deviation from perfect means a lot more if there are only a few trials than it would if there were many trials. This is done by dividing our $;(3& % values by the e!pected values $which reflect the number of trials&, For heads: $;(3&%73 4 5887521 4 1.DAE For tails: $;(3& %73 4 5887521 4 1.DAE

EThese values wont always work out to be the same for all of the categories. n this case they do because we have only two categories of data, and our e!pectations for the two categories are identical.

8.

To calculate the chi s#uare value for our e!periment, we add together all of the $;(3& %73 values>one for each of the categories of results, $ n this e!periment, our categories of results are ,heads- and ,tails-? for the dice you will be using in class, there would be si! categories of results: 5, %, 6, 8, 2, and A.& Sum of the .% 4 .DA F .DA 4 5.D%

2.

<ote some important features of this number. ts the sum of two numbers derived from fractions. The absolute difference between e!pected and observed results are in the numerators of those fractions, so the more you miss, the bigger the chi s#uare number will turn out to be. The e!pected values, reflecting the number of trials, are in the denominators of those fractions, and thus the bigger your sample si0e, the smaller the . % numbers will turn out to be. 9ll of this information can be laid out in a .w data table: Expected 5bser ed /5 6 E0 /5 6 E0% /5 6 E0%)E

A.

Class /of data0

7eads Tails Total

138 138 '88

19% 1'; '88

1% -1%

1** 1**

+:9 +:9 Su" of <% 1 1+:%

<;T3 that the greater the deviation of any observed value from its e!pected value, the larger the . % value will be, and that the larger the sample si0e, the smaller the . % value will be. Thus, in general, the smaller the Sum of the . % value, the better the fit between our prediction and our actual data. <ow that you have a sum of the .% value, you must determine how significant that value is. *emember that the #uestion is, are your actual data different enough from your predicted data to cast your hypothesis in doubt= For the ne!t step, you need one additional bit of information: the de!rees of freedo" /df0. Gegrees of freedom reflects the numbers of independent and dependent variables in your e!periment. To calculate the degrees of freedom, we need to know the number of classes of data. n the case of this e!ample, that number would be two $,heads- and ,tails-&. f you were doing an e!ercise with dice, rather than coins, the number of classes of data would be si! $the si! possible sides of the dice&. Gegrees of freedom will generally be the number of classes of data minus one. n this case, % C 5 4 5 degree of freedom. 9gain, if we were dealing with dice rather than coins, degrees of freedom would be A C 5 4 2. <ow we have two different numbers>the sum of the .% and the degrees of freedom>5.DA and 5, respectively, for our coin tossing e!ample. The final step in our process is to refer to a professionally prepared table of the probabilities of . % values. Such a table is reproduced on the last page of this document. These tables come in a variety of si0es, depending upon how many subdivisions $columns& are present, and how high the degrees of freedom go. This particular table is rather small compared to many available tables. The table lists the degrees of freedom as the headings to the rows. 9cross the top are probability figures>the ,probability of the "hi(S#uare.- The interior of the table consists of the sum of the . % values themselves. *emember, the point of the e!ercise is to decide whether our actual data are far enough away from the numbers which we predicted to Hustify throwing out our hypothesis.

To =se the Table


5. %. Find the degrees of freedom for your data $5 in this case& in the left(hand column of the table. Scan across the row of .% values beside the df number until you find two values which bracket your calculated number $5.DA in this case&. This means that one of the figures will be larger, and the other will be smaller. f the table were subdivided into enough columns, you might have found your e!act calculated value on the table, but you should easily be able to see why that happens only very rarely. 'enerally, you have to be satisfied with finding the bracketing numbers. n this case, 5.DA falls between the numbers 1.822 and %.I1A. Jook up at the top of the table to see which probabilities correspond to your bracketing . % values>in this case, 1.21 and 1.51 respectively. f you had found your e!act . % value on this table, its probability would have fallen somewhere between these two. So we could say that 8+18 > ./<%0 > 8+38. This mathematical statement means ,the probability of our "hi(S#uare falls between 1.51 and 1.21.So what does that mean= 9 probability of 1.51 corresponds to a ,chance- of 51K? a probability of 1.21 to a ,chanceof 21K. This chi(s#uare result means that, if our hypothesis is correct, and we performed e!actly this e!periment over and over again, 51K to 21K of the time, our results would be at least this far from what we predicted. ;r, the probability that we would get results at least as bad as these, even though our hypothesis is correct is between 1.51 and 1.21. The usual ,level of discrimination- used by investigators is +$. %& 4 1.12. Thus, if your chi(s#uare value has a probability of 1.12 or lower, it is very likely $but not certain& that your hypothesis is not correct.

6.

8.

2.

Critical ?alues of the <% @istribution


.robability of the Chi-Square A. /<%0B df
1 % ' * 3 9 2 ; : 18 8+::3
8+888 8+818 8+82% 8+%82 8+*1% 8+929 8+:;: 1+'** 1+2'3 %+139

8+:23
8+888 8+831 8+%19 8+*;* 8+;'1 1+%'2 1+9:8 %+1;8 %+288 '+%*2

8+:
8+819 8+%11 8+3;* 1+89* 1+918 %+*8% %+;'' '+*:8 *+19; *+;93

8+3
8+*33 1+';9 %+'99 '+'32 *+'31 3+'*; 9+'*9 2+'** ;+'*' :+'*%

8+1
%+289 *+983 9+%31 2+22: 8+%'9 18+9*3 1%+812 1'+'9% 1*+9;* 13+:;2

8+83
'+;*1 3+::1 2+;13 :+*;; 11+828 1%+3:% 1*+892 13+382 19+:1: 1;+'82

8+83
3+8%* 2+'2; :+'*; 11+1*' 1%+;'% 1*+**: 19+81' 12+3'3 1:+8%' %8+*;'

8+81
9+9'3 :+%18 11+'*3 1'+%22 13+8;9 19+;1% 1;+*23 %8+8:8 %1+999 %'+%8:

8+883
2+;2: 18+3:2 1%+;'; 1*+;98 19+238 1;+3*; %8+%2; %1+:33 %'+3;: %3+1;;

Chi Square Goodness-of-Fit Test

@ata Sheet

Exercise 1& Coins&


+redictions based upon your hypothesis: +$:& +$T& Gata collection: /our Tosses :eads Tails :eads /our 'roup Tails "lass Totals $from board& :eads Tails L:7MMMMMMMM $"lass +rediction& LT7MMMMMMMM $"lass +rediction&

"hi(S#uare Table $"oins&: "lass :eads Tails Total Gegrees of Freedom: MMMMMMMMMMMMMM "onclusion: +robability of the .%: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 3!pected ;bserved ;C3 $; C 3&% $; C 3&%73

Exercise %& @ice&


+redictions based upon your hypothesis: "lass ;ne Two Three Four Five Si! +robability 3!pected <umber

Gata "ollection: /our *olls 5s *olled %s *olled %s *olled 8s *olled 2s *olled As *olled "hi S#uare Table $Gice& "lass 3!pected ;bserved ;C3 $; C 3&% $; C 3&%73 'roup "lass

Total Gegrees of Freedom: MMMMMMMMMMMMMM "onclusion: +robability of the .%: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

Exercise '& Gerbils


First Mating: +redicted Nrown Nlack "hi(S#uare Table $'erbils 5&: "lass Nrown Nlack Total Gegrees of Freedom: MMMMMMMMMMMMMM "onclusion: +robability of the .%: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 3!pected ;bserved ;C3 $; C 3&% $; C 3&%73 9ctual

Second Mating: +redicted )hite Spotted Solid "hi(S#uare Table $'erbils %&: "lass )hite Spotted Solid Total Gegrees of Freedom: MMMMMMMMMMMMMM "onclusion: +robability of the .%: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 3!pected ;bserved ;C3 $; C 3&% $; C 3&%73 9ctual

You might also like