0% found this document useful (0 votes)
76 views5 pages

G-Test: Statistics Likelihood-Ratio Maximum Likelihood Statistical Significance Chi-Squared Tests

The G-test is a statistical test that is increasingly used instead of the chi-squared test. It compares observed values to values expected under a null hypothesis. The G-test calculates a test statistic (G-value) that measures the difference between observed and expected values. To determine if the difference is statistically significant, the G-value is compared to critical values from tables based on the degrees of freedom. If the G-value exceeds the critical value, the null hypothesis is rejected, meaning the observed values differ significantly from expected values. An example compares genotype frequencies observed in a population to those expected under Hardy-Weinberg equilibrium.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views5 pages

G-Test: Statistics Likelihood-Ratio Maximum Likelihood Statistical Significance Chi-Squared Tests

The G-test is a statistical test that is increasingly used instead of the chi-squared test. It compares observed values to values expected under a null hypothesis. The G-test calculates a test statistic (G-value) that measures the difference between observed and expected values. To determine if the difference is statistically significant, the G-value is compared to critical values from tables based on the degrees of freedom. If the G-value exceeds the critical value, the null hypothesis is rejected, meaning the observed values differ significantly from expected values. An example compares genotype frequencies observed in a population to those expected under Hardy-Weinberg equilibrium.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

G-test

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests


that are increasingly being used in situations where chi-squared tests were previously
recommended.

The commonly used chi-squared tests for goodness of fit to a distribution and for
independence in contingency tables are in fact approximations of the log-likelihood ratio on
which the G-tests are based. This approximation was developed by Karl Pearson because at
the time it was unduly laborious to calculate log-likelihood ratios. With the advent of
electronic calculators and personal computers, this is no longer a problem. G-tests are
coming into increasing use, particularly since they were recommended at least since the
1981 edition of the popular statistics textbook by Sokal and Rohlf.[1] Dunning[2] introduced
the test to the computational linguistics community where it is now widely used.

The general formula for Pearson's chi-squared test statistic is

where Oi is the frequency observed in a cell, E is the frequency expected on the null
hypothesis, and the sum is taken across all cells. The corresponding general formula for G
is

where ln denotes the natural logarithm (log to the base e) and the sum is again taken over
all non-empty cells.

Relation to mutual information

The value of G can also be expressed in terms of mutual information.

Let

, , and

Then G can be expressed in several alternative forms:


where the entropy of a discrete random variable is defined as

and where

is the mutual information between the row vector and the column vector of the contingency
table.

It can also be shown[citation needed] that the inverse document frequency weighting commonly
used for text retrieval is an approximation of G applicable when the row sum for the query
is much smaller than the row sum for the remainder of the corpus. Similarly, the result of
Bayesian inference applied to a choice of single multinomial distribution for all rows of the
contingency table taken together versus the more general alternative of a separate
multinomial per row produces results very similar to the G statistic.[citation needed]

Distribution and usage

Given the null hypothesis that the observed frequencies result from random sampling from
a distribution with the given expected frequencies, the distribution of G is approximately a
chi-squared distribution, with the same number of degrees of freedom as in the
corresponding chi-squared test.

For samples of a reasonable size, the G-test and the chi-squared test will lead to the same
conclusions. However, the approximation to the theoretical chi-squared distribution for the
G-test is better than for the Pearson chi-squared tests in cases where for any cell
, and in any such case the G-test should always be used.[citation needed]

For very small samples the multinomial test for goodness of fit, and Fisher's exact test for
contingency tables, or even Bayesian hypothesis selection are preferable to either the chi-
squared test or the G-test.[citation needed]

Application

An application of the G-test is known as the McDonald–Kreitman test in statistical


genetics.

La encefalitis de Rasmussen es una enfermedad neurológica progresiva rara, caracterizada por


encefalitis, epilepsia
refractaria al tratamiento, deterioro neurológico y autoanticuerpos contra receptores de glutamato
R3. Etiología desconocida, se sugiere de origen viral y autoinmune combinados y además una
predisposición genética
desencadenada por una enfermedad viral no identificada, aunque se sospecha que el
citomegalovirus pueda estar
implicado. La encefalitis de Rasmussen es una enfermedad neurológica de probable origen inmunológico caracterizada
por presentar una epilepsia intratable, deterioro neurológico y la presencia de autoanticuerpos contra receptores de
glutamato R3. se postula un
origen infeccioso debido a un virus lento [4] o a un fenómeno
autoinmune, pues se ha demostrado la presencia de autoanticuerpos
a receptores de glutamato (Glu) R3 [5]. Desde el punto de vista
terapéutico la respuesta a los fármacos antiepilépticos (FAE) es
muy pobre. El uso de gammaglobulina iv y corticosteroides en altas
dosis ha permitido detener durante cierto tiempo la progresión de la
enfermedad [6]. La plasmaféresis puede ser considerada otra alternativa
con el objeto de remover los autoanticuerpos circulantes [7].

The G-test allows biologists to compare observed values with those predicted from a
specific null hypothesis. The G-test determines the probability that differences between the
observed and predicted (or expected) values are large enough that they are unlikely to have
occurred due to chance alone. The G-test is generally used with variables that are counts,
not scalars. The Chi-square test can be used in similar instances.

For example, the following table compares the observed distributions of genotypes in a
population with that predicted by the Hardy-Weinberg Principle. The aim of the G-test is to
determine whether the Aa genotype is really more common, and the aa genotype less
common, than they should be under the Hardy-Weinberg assumptions.

                                         Genotype
                                         AA   Aa    aa
Expected from H-W         32    64    32
Observed in population     32    74    22

1) First Step is to calculate the "test statistic" – this is the G value.

                                                        E = expected number for each category


G = 2 the sum of [O * ln (O/E)]     O = observed number for each category
                                                        ln = natural log
                                                        "the sum of" = sum the values in the bracket for n
categories (in this case it is 3)

A G value of zero means that the observed numbers are exactly equal to the expected
numbers. The larger the differences between observed and expected, the greater the value
of G. The higher the G value, the more likely that the results are significantly different from
that predicted by the null hypothesis (i.e., the smaller the P value).

In the example given above, G is:


G = 2[32*ln(32/32) + 74*ln(74/64) + 22*ln(22/32)] = 2[0 + 10.74 – 8.24] = 5.0

2) Second Step is to calculate the degrees of freedom

G increases as the observed values become more and more different from the expected
values, but G also increases as we add more categories. To correct for this, we need to
figure the degrees of freedom in our sample. To calculate the degrees of freedom (df), we
need to determine the minimum number of categories whose value we need to know before
we could calculate the rest. In our genotype example, if we know any two genotype
categories, it is possible to calculate the third (by subtracting from the total). Thus the
degrees of freedom is two.

3) Third Step is to compare your G-value to the Critical G-value

To determine whether the difference between the observed and expected values is greater
than that expected by chance alone, the G value is compared to those on a table. In the
table, the degrees of freedom are listed down the left side, and the Critical G-values are
given in the adjacent column.

A p-value of less than or equal to 0.05 is usually accepted as indicating a significant


difference. If you look at the table, you will see that for 2 degrees of freedom, a G-Value of
6.0 is necessary to yield a p-value of 0.05 or less. The G-value calculated above is NOT
greater than 6.0, hence we cannot reject the null hypothesis. Therefore, we must conclude
that the data do not differ significantly from the expected Hardy-Weinberg distribution.

Sometimes the expected values for the G-test are generated from the data themselves. For
example, you have data on the number of times that a large male cricket mates (n1=12),
and the number of times that a small male cricket mates (n2=4). You would expect, if
mating were not related to size, that the two types of males would have the same
opportunity to mate. Thus, since there were 16 total mating events, each male would be
expected to mate 8 times. You can then perform the G-test using 8 as the expected value for
each male, and 12 and 4 as the observed values.

4) Fourth Step is to report your results

"The genotype frequencies in the first simulation did not differ from those predicted by
Hardy-Weinberg equilibrium (G = 1.23, p > 0.05, d.f. = 2)."
OR if you did find an effect:

"The genotype frequencies in the small populations differed significantly from those
predicted by Hardy-Weinberg equilibrium (G = 8.45, p < 0.05, d.f. = 2)."

You might also like