Fisher 1922
Fisher 1922
Author(s): R. A. Fisher
Source: Journal of the Royal Statistical Society, Vol. 85, No. 4 (Jun., 1922), pp. 597-612
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2341124 .
Accessed: 31/08/2013 13:17
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society.
http://www.jstor.org
MISCELLANEA.
By R. A. FISlIER, M.A.
Introduction.
TIHE widespread desire to introduce into statistical mietlhods sotnie
degree of critical exactitude lhas led( to the eniployment, now
general in careful work, of the two types of quantity which clharac-
terize inoderii statistics, namely, the " probable error" and the
test of " goodness of fit." The test of goodness of fit was devised
by Pearson, to wlhoselabours principally we now owe it, that the
test may readily be applied to a great variety of questioiis of
frequency distribution. It is aii essential means of justifying
a posteriorithe methods which have been employedin the reduction
of any body of data. Slutsky and Pearson lhave extended the
test to apply also to the fitness of regressionformnulw, Pearson's
correlationratio having also been employed for this purpose.
It has been shown in a previous communication[2 Fisher, 1922]
that the x2 test of goodness of fit can be accurately applied only
if allowance is made for the number of constants fittedin recon-
structingthe theoreticalpopulation. This correctionis particularly
importantin contingencytables, but is necessary in all cases; and
the fact that it has not been recognized has led to the adoption
of erroneousvalues in almost all the cases in which tests of goodness
of fithave been employed. The values of P have been exaggerated,
and it is to be feared that in many cases wrong conclusions have
been drawn from the values of P obtained.
It has, therefore,been necessary to extend the examination to
the tests of goodness of fit of regressionlines. The errors due to
neglectingthe numberof constantsfittedare here very pronounced;
but in addition other points have to be taken into consideration,
which did not arise in our previous investigation. In the most
important class of cases the curve of distributionof x2 is now no
longer of the Pearsonian Type III, which is the basis of Elderton's
tables, but of the neighbouringType VI. Certain misconceptions
also exist as to the form of the distribution of the correlation
ratio, tj, which we hope to have cleared up. We have also taken
the opportunityof solving the outstanding problem of the distri-
bution of the regressioncoefficientsin small samples.
1. The accurateapplicationofElderton'stables.
With any two variables x and y we shall suppose that the
number of observations for which x p is np, and the number
xp
of these for which y = yq is npq; also that jp is the mean of the
observed values of y for a given value of x, so that
np ?p- Sp(npqyq).
We may regard the group np as a random sample from a popu-
lation in which the value of x is constant; but the value of y varies
freelyabout a certain mean, mi,, with a certain standard deviation,
(rp.
dfztJ(N-a-2) e-2dt
are independent,
and thesetwo distributions forthe one depends
onlyon the deviationsfromthe meansof normalsamples,and the
otheronlyon the means.
The distributionof x2 will now be that of (N - a)_, so,
substituting
x2t
N-a
t 2 T 2 e 2oa' dtdT,
we obtain, ignoringconstants,
a-p-3 N-p-3 -_ t
2
a-p-3 ~ -p-
+
(X2) ( 1 NXa) dXN2.
( 1? N-a)
of theType VI curve
3. The natureof theapproximation
(N-a) 2 - a-p-3 i2
df 2 Xa 3(+ N-a) dx
N-a--2! a-p-3i
2 2
to theTypeIIl curve
-a-p-i
2 2 a-p-3 j
df= _ 3 . X 2 e dx.
2
When x is small, the two curves have closely similar forms, the
latter being the distribution of x2, as given by Elderton's table,
when n' -a - p. The ratio of the ordinates at the terminus of
the curve is obtained by expanding the constant multiplier of the
first curve in powers of N-1. It reduces to
Mean. Mode.
N-a N-a
Type VI ... (a-p-1) N-a-2 (a-p-3) N-a +2
The mean, therefore,is raised and the mode lowered in about the
same proportion. For the higher values of x the curves are not
closely similar,and since it is forthese especially that the value of P
is required, we shall obtain the necessary correction in P, as far
as the terms in N-L. The ratio of the ordinates is
l +
1+ fN
N{x2 - 2(n' -1)x + (n' - 1) (n' -3)1;
but, since
n'-1 00 n'-3 :z
2 2 Pt!,P (x) = x e dx,
x
we have the correction
I
1N(n'-1) (n'+l) Pn1t+4-2(n'-1)2 Pn1t+2+(n'--1) (n'-3) P,, }
- Pw,,+4-2(n'-l)
4N 1(n'+1) P,L'-+2+(n'-3) Pl'},
which, in the absence of tables of the Type VI curve, will usually
be found adequate.
4. The correlationratio.
We are now in a positionto make an accurate use ofthe correlation
ratio, as a test of the fitnessof regressionformulae. Let Y be the
functionof x used as regressionformula,and let
NR2s 2 = S {n (yp_ y)2}, NsY2= S (y-_)2
where y?is the mean of all the observed values of y; then it is easy
to see that, provided Y has been fitted to the data so that
3 {fp (9 __YP)2}
X2 N Cy2
=
N a (1-q2)
In other words
2 2R2
(N-a)
x2 (N-a) 1__2
and
n'-a-i,
5. Comparisonwithpreviousformula?.
Slutsky, in his method [4, p. 83] of treating homoscedastic
data, has used a process analogous to that arrived at above, but
with four deviations:-
(i) He averages the standard deviations of the-arrays,and not
theirsquares, in estimatingthe value of (r2.
(ii) He divides his total by N instead of N - a.
(iii) He enters Elderton's table with n' = a + 1, instead of
n =a - p.
(iv) He takes the Type III distributionto be exact.
(i) Pearson [1, pp. 249-51] has criticized the firstpoint, but
his practice is not quite explicit. In his opinion evidently,if the
surface is homoscedastic, we must take s2y(l - qq2), but in the
special case when the regressionis also linear he replaces ] -_q2
2 T2
N-a SS
{npq(y-gp)2}.
Now
(1-_2) S (y- j)2-SS {fpq (y-?Jp)2};
whenceit follows that the mean value of 1 -9 is
N-a
N-i '
and that of y2, therefore,
a-1
N-1
Pearsonhas discussedthe distribution of v9in thiscase [5].
(N-a) --2
1-v
a-3 N 4N)
3-'
_ s{(n-n -P)2}
wherenp is thenumberofobservations expected,and n,,thenumber
observedin any cell, then the value of n' withwhichElderton's
table shouldbe enteredis not the total numberof cells,but one
morethan the numberof values of
np - flp
whichcan be independently specified. That is to say, that when
the values of np are reconstructed
fromthe data of the sample,
(n' - 1) is the numberof degreesof freedomleft aftermaking
this reconstruction.
lines
In the same way forregression
2
= ., S {ln,,(~i,,-YP)2},
of whichthe coefficients
a and b are calculatedby theequations
-a
a-.w, b _S(y (x$- )) .'
--~S (X$--X,)2-
-a,2=-
n
So that if a is the populationvalue of a, and r = a
thenr is normallydistributedabout zero withstandarddeviation
unity. If o-2 is unknown,
thebestestimatethatcan be made of it
fromthe sample is
s2- 1 S (y-Y)2
n-2
wherethesumis dividedby (n - 2) to allowfor the twoconstants,
used in fittingthe regression
line. Then the distribution
of S2 is,
s2
if X2- (n.-2 82
n-4
df= - (2-) ex d
Thedistribution
ofthetwoquantitiess and a arewholly
independent;
hence,following" Student,"we findthedistributionof a quantity
completelycalculablefromthe sample,namely,
T (a-a) n
VT/
X VS (y-Y)2
2 X
n-3
1
(X2n 1e2x (Z) X . dz;
2
and integratingwith respect to x2 from0 to oo, we have
n-3,
1 2 dz
O r n42 (1 +z2) 2
S (X )2'
and if
(b-fl) V/ S (x- ;) 2
,\IS (y-Y)2
we arrive at the same distributionas before,/8being the population
value of the regressioncoefficient.
The above argument immediately extends itself to regression
lines of any form and involving any number of coefficients. For,
suppose the regressionequation is of the form
Y = a + bX,+ cX2 . ...kX,
where X1, X2 ...... Xp are orthogonalfunctionsof x forthe observed
values, so that
S (XaXb) 00
k S (yXp)
S (Xp2)
and
2 S _2_
S (Xp2)
if
Consequently,
z ~(k-K)
X/'S(y-Y)2
ofz is the TypeVII curve
the distribution
n-p-2!
df- 2 dz
2
!n-p-1 (1?z2) 2
2
and in this case, whenp + 1 constantshave been fitted,all the
will be distributedin like manner,
other regressioncoefficients
onlysubstituting the correspondingfunctionof x forXp.
Tablesof theProbabilityIntegralof the above TypeVII distri-
butionhave been preparedby " Student" [8], forvalues of n - p
from0 to 30. These tables are in a suitableformfortestingthe
significanceof an observed regressioncoefficient.For larger
normalfor most purposes,
samplesthe curve will be sufficiently
the varianceof z being
n-p-3-
The utilityof " Student's" curveforthe distributionof errors
in the mean of a sample,in termsof the standarddeviation,as
estimatedfromthe same sample,is increasedby the circumstance
that the same distribution between
also gives that of differences
suchmeans. Thus,ifx and .w'are themeanof samplesofn and n',
and we wish to test if the means are in sufficientagreementto
warrantthe beliefthat the samples are drawn fromthe same
population,we may calculate
x-x / nn'
VS (X-x)2+S' (X'-7)2' n+n'
'vS (y-Y)2 21
then, as before,z will be distributedin the Type VII distribution
n-p-2
1 2 dz
df n-c- np-
2 (I +z2) 2
Conclusions.
Referenes.
1. K. Pearson (1916).-" On the Application of Goodness of Fit Tables to
Test RegressionCurves and TheoreticalCurves used to describeObservational
or ExperimentalData." Biom., XI, 239-61.
2. R. A. Fisher (1922).-" On the Significanceof x2 from Contingency
Tables, and on the Calculation of P." J.R.S.S., LXXXV, pp. 87-94.
3. R. A. Fisher (1915).-" Frequency Distributionof the Values of the
CorrelationCoefficientin Samples from an IndefinitelyLarge Population."
Biom., X, 507-21.
4. E. Slutsky(1913).-" On the Criterionof Goodness of Fit of the Regres-
sion Lines, and on the best Method of fittingthem to the Data." J.R.S.S.,
LXXVII, 78-84.
5. K. Pearson (1911).-" On a Correctionto be made to the Correlation
Ratio." Biom., VIII, 254-6.
6. K. Pearson (1905).-" On the General Theory of Skew Correlationand
Non-linearRegression." Drapers' CompanyResearchMemoirs: Dulau and Co.
7. Student (1908).-" The ProbableErrorof a Mean." Biom.,VI, pp. 1-25.
8. Student (1917).-" Tables forEstimatingthe Probabilitythat the Mean
of a unique Sample of Observationslies between - X and any given Distance
of the Mean of the Population fromwhichthe Sample is drawn." Biom., XI,
414-17.
9. R. A. Fisher (1921).-" An Examination of the Yield of Dressed Grain
from Broadbalk." Journal of AgriculturalScience, XI, 107-35.