0% found this document useful (0 votes)
80 views17 pages

Fisher 1922

This document discusses the goodness of fit of regression formulas and the distribution of regression coefficients. It addresses issues with applying Pearson's chi-squared test of goodness of fit when the number of constants fitted is not properly accounted for. It also examines the exact distribution of chi-squared when the variance is estimated from the sample data rather than known, and solves the problem of the distribution of regression coefficients in small samples.

Uploaded by

banstala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views17 pages

Fisher 1922

This document discusses the goodness of fit of regression formulas and the distribution of regression coefficients. It addresses issues with applying Pearson's chi-squared test of goodness of fit when the number of constants fitted is not properly accounted for. It also examines the exact distribution of chi-squared when the variance is estimated from the sample data rather than known, and solves the problem of the distribution of regression coefficients in small samples.

Uploaded by

banstala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

The Goodness of Fit of Regression Formulae, and the Distribution of Regression Coefficients

Author(s): R. A. Fisher
Source: Journal of the Royal Statistical Society, Vol. 85, No. 4 (Jun., 1922), pp. 597-612
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2341124 .
Accessed: 31/08/2013 13:17

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society.

http://www.jstor.org

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] 597

MISCELLANEA.

THE GOODNESS OF FIT OF REGRESSION FORMULA4, AND TIIE


DISTRIBUTION OF REGRESSION COEFFICIENTS.

By R. A. FISlIER, M.A.

Introduction.
TIHE widespread desire to introduce into statistical mietlhods sotnie
degree of critical exactitude lhas led( to the eniployment, now
general in careful work, of the two types of quantity which clharac-
terize inoderii statistics, namely, the " probable error" and the
test of " goodness of fit." The test of goodness of fit was devised
by Pearson, to wlhoselabours principally we now owe it, that the
test may readily be applied to a great variety of questioiis of
frequency distribution. It is aii essential means of justifying
a posteriorithe methods which have been employedin the reduction
of any body of data. Slutsky and Pearson lhave extended the
test to apply also to the fitness of regressionformnulw, Pearson's
correlationratio having also been employed for this purpose.
It has been shown in a previous communication[2 Fisher, 1922]
that the x2 test of goodness of fit can be accurately applied only
if allowance is made for the number of constants fittedin recon-
structingthe theoreticalpopulation. This correctionis particularly
importantin contingencytables, but is necessary in all cases; and
the fact that it has not been recognized has led to the adoption
of erroneousvalues in almost all the cases in which tests of goodness
of fithave been employed. The values of P have been exaggerated,
and it is to be feared that in many cases wrong conclusions have
been drawn from the values of P obtained.
It has, therefore,been necessary to extend the examination to
the tests of goodness of fit of regressionlines. The errors due to
neglectingthe numberof constantsfittedare here very pronounced;
but in addition other points have to be taken into consideration,
which did not arise in our previous investigation. In the most
important class of cases the curve of distributionof x2 is now no
longer of the Pearsonian Type III, which is the basis of Elderton's
tables, but of the neighbouringType VI. Certain misconceptions
also exist as to the form of the distribution of the correlation

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
598 Miscellanea. [July,

ratio, tj, which we hope to have cleared up. We have also taken
the opportunityof solving the outstanding problem of the distri-
bution of the regressioncoefficientsin small samples.

1. The accurateapplicationofElderton'stables.
With any two variables x and y we shall suppose that the
number of observations for which x p is np, and the number
xp
of these for which y = yq is npq; also that jp is the mean of the
observed values of y for a given value of x, so that
np ?p- Sp(npqyq).
We may regard the group np as a random sample from a popu-
lation in which the value of x is constant; but the value of y varies
freelyabout a certain mean, mi,, with a certain standard deviation,
(rp.

For such samples of np, therefore,the mean, ?/P,will vary about


the same mean mp, and since this mean of y is independent of
the number of the array, mn will be the mean of all values of y
from random samples, however the number np may vary.
Any opinion put forward by Professor Pearson is worthy of
respect; but it is impossible to agree with his statement [1, p. 240]
that " This result cannot be taken as obvious, as the size of the
" array in the sample varies." The fact, however, Pearson has
verifiedforlarge samples as far as the thirdorderof approximation.
The difference in principleis of some importance,since the simplicity
of many of the results here obtained is a consequence of the fact
that we have not attempted to eliminate known quantities, given
by the sample, from the distribution formulaeof the statistics
studied, but only the unknown quantities-parameters of the popu-
lation fromwhich the sample is drawn-which have to be estimated
somewhat inexactly fromthe given sample.*
Next, for arrays of any given size, the standard deviation of yp
is (Tp/ /In,,and it will be normallydistributedifthe population-array
be normal, and approximately so in most cases if np be large.
Pearson rightly points out that the values of i7,vfor arrays of
differentsizes will not be normallydistributed,but the distribution
will be markedly leptokurtic even for considerable arrays. This
result follows from the fact that the distributionis a mixture of
* Statistics whose sampling distributiondepends upon other statistics
given by the sample cannot,in the strictsense,fulfilthe Criterionof Sufficiency.
In certaincases evidentlyno statisticexists whichstrictlyfulfilsthis criterion.
In these cases statisticsobtained by the MethodofMaximumLikelihoodappear
to fulfilthe Criterionof Efficiency; the extension of this criterionto finite
samples thus takes a new importance.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormule. 599

normal distributions,having the same mean, but differentstandard


deviations. This mixed distributionneed not concern us, however,
for in applying tests of fitnesswe do not in practice ignorethe size
of the array. The simple fact is, that, when the population arrays
are normal, the quantity
zp np(Jp- mp)
is normally distributed about zero, with a standard deviation or-,
and this distributionis independent of the size of the array.
In the case when the population arrays are equally variable,
crpis constant [= ], and if there are a arrays, the quantity
S(z 2) = S,np(gP -MP)2}
is the sum of the squares of a independent,normally and equally
variable quantities, and consequently, if we write
X20-2 S (ZP2),

x2 will be distributedas is the ordinarymeasure of goodness of fit.


In applying Elderton's tables we must, of course, put n' equal to
one more than the number of degrees of freedom,as I have demon-
strated elsewhere [2]. If the values of mi) were known a priori,
we should take n' = a + 1, but for regression formulhefitted to
the data by equations linear in yp we merely reduce the number
of degrees of freedom by the number of constants fitted. Thus,
if mp is a linear functionof x, and a straightline is fitted,we have
n' = a - 1, and the value of x2 then constitutesa test of whether
or not mp is in reality adequately representedby a linear function
of x. Similarly, if a cubic polynomial in x be fitted,we have
n= a -3.
2. The exact distributionof X2 when C- is determinedfrom the data.
So far the results are exact on the assumption that C is known;
but as in practice o- must usually be obtained fromthe data, errors
will be introduced from this source which necessarily influence
the distributionof x2. It is true that o may be estimated from
the whole data, and is thereforeknown with accuracy of a higher
orderthan the quantities which contributeto x2; neverthelessit is
necessary to determinewhat aberrations are to be expected when
the data are not very numerous.
From each array we can directlycalculate the second moment
sP2, and it has been shown [3] that the second momentof a normal
sample of np is so distributedthat the frequencywith whichit falls
into the range dsp2 is proportional to
np-3 -__nsp2
-(np-1) (SP2) 2 e 2&2 d (s"92:

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
600 Miscellanea. [July,
thechancethatall theobservedvaluesof Sp2 fallin assignedranges
is theproductof a suchquantities,forall are distributed
indepen-
dently; consequentlyto findthe optimumvalue of a-, whichwill
also be the value withthe least probableerror,we mustmakethis
producea maximumforvariationsof -.
Takinglogarithms we have
and differentiating,
8L __ S (npsP2) S (np-1)
80 a3 C '

whencethe optimumvalue of C2 is S2 where


(N-a) S2 = S (npspl).

suppose that C- is estimatedby this method,


We shall, therefore,
and that
= S (z2)
22

we mustnow findthe distribution of thisstatistic.


of S2 is of the same kindas thosewithwhich
The distribution
we have been concerned. For
S (nPsp2) = S (I-'4)2'

and maybe regardedas thesumofthesquaresofN equallyvariable


save fora linearrestrictions
quantities,independent oftheform
Sp (y) = npif-p.
we specifythe distribution
If,therefore, of S2 in sucha way as
to express the frequencyelement,df, in terms of the variate
elementwithinwhichit occurs,we shallhave

dfztJ(N-a-2) e-2dt

wheret standsfors2(N -a). In the same way if T stand forx2S2


we have,ifp + 1 constantshave beenused in fitting,

dfaT c(aP3) e 20,] dT

are independent,
and thesetwo distributions forthe one depends
onlyon the deviationsfromthe meansof normalsamples,and the
otheronlyon the means.
The distributionof x2 will now be that of (N - a)_, so,
substituting
x2t

N-a

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] Formulc.
ofFit ofRegression
The Goodness 601
in
N-a-2 a-p-3 _ t Q+T)

t 2 T 2 e 2oa' dtdT,
we obtain, ignoringconstants,
a-p-3 N-p-3 -_ t
2

(X2) 2 t 2 e 2XT2(1+N-a) dtdX2,


and so, on integratingfrom0 to oo with respect to t,

a-p-3 ~ -p-
+
(X2) ( 1 NXa) dXN2.

The variation in 82, therefore,changesthe exact formof the distri-


bution curve for x2 from Type III to Type VI. The change is,
however,very small if N be large, foras N increases

( 1? N-a)

and so reproduces the Type III distribution.

of theType VI curve
3. The natureof theapproximation

(N-a) 2 - a-p-3 i2
df 2 Xa 3(+ N-a) dx
N-a--2! a-p-3i
2 2
to theTypeIIl curve
-a-p-i
2 2 a-p-3 j
df= _ 3 . X 2 e dx.
2

When x is small, the two curves have closely similar forms, the
latter being the distribution of x2, as given by Elderton's table,
when n' -a - p. The ratio of the ordinates at the terminus of
the curve is obtained by expanding the constant multiplier of the
first curve in powers of N-1. It reduces to

+ (n' -1) (n' -3)


1
4N
for high values of P; therefore, 1 - P, as given by Elderton's table,

* The symbolx! is usedthroughout


thispaperas equivalentto r(x + 1),
or not.
waetherx is an integer
VOL. LXXXV. PART IV. 2 T

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
602 Miscellanea. [July,

may be corrected by multiplyingby this factor. Near the centre


of the curve we may observe the position of the mean and mode.

Mean. Mode.

Type III ... a-p I a-p- 3

N-a N-a
Type VI ... (a-p-1) N-a-2 (a-p-3) N-a +2

The mean, therefore,is raised and the mode lowered in about the
same proportion. For the higher values of x the curves are not
closely similar,and since it is forthese especially that the value of P
is required, we shall obtain the necessary correction in P, as far
as the terms in N-L. The ratio of the ordinates is

l +
1+ fN
N{x2 - 2(n' -1)x + (n' - 1) (n' -3)1;
but, since
n'-1 00 n'-3 :z
2 2 Pt!,P (x) = x e dx,
x
we have the correction
I
1N(n'-1) (n'+l) Pn1t+4-2(n'-1)2 Pn1t+2+(n'--1) (n'-3) P,, }

- Pw,,+4-2(n'-l)
4N 1(n'+1) P,L'-+2+(n'-3) Pl'},
which, in the absence of tables of the Type VI curve, will usually
be found adequate.

4. The correlationratio.
We are now in a positionto make an accurate use ofthe correlation
ratio, as a test of the fitnessof regressionformulae. Let Y be the
functionof x used as regressionformula,and let
NR2s 2 = S {n (yp_ y)2}, NsY2= S (y-_)2
where y?is the mean of all the observed values of y; then it is easy
to see that, provided Y has been fitted to the data so that
3 {fp (9 __YP)2}

is a minimumfor proportionalvariations of Y-y, then


N (1-R2) Sy2 = SS {npq (y-Yp)2}

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormula. 603

But the correlationratio is givun by the parallel formula,


N (1-_2) Sy2 SS {fnpq(y-gp)2}= (N- ) s2;
hence, by subtraction,
N (i12-R2) sY2 = S {n, (gp-yY)2} = x2s2

X2 N Cy2
=
N a (1-q2)
In other words
2 2R2
(N-a)

and to test the significanceof 2 -R2 we enter Elderton's table


with n' = a - p, wherep + 1 is the number of constants fittedto
the regression line. Thus, for a linear iegression formula,

x2 (N-a) 1__2
and
n'-a-i,

using, if necessary, the correctionfor Type VI as before.


The exact form of the distributionof q itself would be difficult
to obtain, but in practice I is usuallv employed to test the validity
of a linear or otherregressionformula. For this purposeit is not the
distributionof 7 but of the more variable quantity (q2_R2)/(1-rq2)
that is required,and the above expressionsshow it is approximately
representedby a Type III curve,and that the probabilityof a greater
discrepancy occurringby chance may be obtained from Elderton's
table.

5. Comparisonwithpreviousformula?.
Slutsky, in his method [4, p. 83] of treating homoscedastic
data, has used a process analogous to that arrived at above, but
with four deviations:-
(i) He averages the standard deviations of the-arrays,and not
theirsquares, in estimatingthe value of (r2.
(ii) He divides his total by N instead of N - a.
(iii) He enters Elderton's table with n' = a + 1, instead of
n =a - p.
(iv) He takes the Type III distributionto be exact.
(i) Pearson [1, pp. 249-51] has criticized the firstpoint, but
his practice is not quite explicit. In his opinion evidently,if the
surface is homoscedastic, we must take s2y(l - qq2), but in the
special case when the regressionis also linear he replaces ] -_q2
2 T2

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
604 Miscellanea. [July,
by 1 -r2. The point is not one of importance, and I am not
convincedthatany materialdifference wouldbe made by replacing
1 - \2 by 1 - R2 in general,when the regressionis well fitted.
There would seem to be no reasonfor treatinglineardifferently
fromotherregressionformule. In dealing with Slutsky'sprice
data, wherethe regressionis doubtfullylinear, Pearson prefers
tio use 1 - r2.
(ii) The secondpoint is, strictly,a matterof convenience, for
whenwe knowthe distribution of x2, calculatedby one method,
we also knowits distribution in the secondcase. Since neitherof
thesedistributions is exactlythe Type III tabled by Elderton,we
are freeto use whicheverwe please. The formwe have chosen
has theadvantageofinvolvingthe best estimateof o-,and we have
chosenit forthisreason; but as in the TypeVI distribution errors
of estimationare completelyeliminated, this choicehas onlythe
forceof a convention. The close agreementof the curvewe have
obtainedwith the corresponding Type III in the neighbourhood
of the medianis a practicaladvantage; it shouldin any case be
noted that the correctionswhichwe have obtainedfor P refer
onlyto our own formof the statisticx2.
Althoughstrictlya matterof convenience,there is a real
advantagewhen the matteris approachedfromotherpoints of
view,in the use of the best estimates. Thus, forexample,when
the arraysare undifferentiated, of
with respectto the distribution
y,we naturallytake
1 -S (y - )2
N-i ~(~)
as thebestestimateofthevarianceofthe wholeoftheobservation;
this should agree on the
and as the arrays are undifferentiated
average with ourestimateofthe variancein each array,

N-a SS
{npq(y-gp)2}.
Now
(1-_2) S (y- j)2-SS {fpq (y-?Jp)2};
whenceit follows that the mean value of 1 -9 is
N-a
N-i '
and that of y2, therefore,
a-1
N-1
Pearsonhas discussedthe distribution of v9in thiscase [5].

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formulc. 605
Observingthat,evenif thearraysare whollyundifferentiated,
v willnecessarilybe positive,he pointsoutthat,intestingwhetherv
differs fromzero,it is not onlynecessaryto knowthe
significantly
standarderrorof y, but also the meanvalue about whichit varies.
The standarderrorofq forundifferentiated arrayshe had previously
[6] evaluatedat 1/IN, and he thenbya somewhatintricate method
findsforthe meanvalue of y/2the value
a-i
N
and deducesthatthemeanvalue ofy willbe
a-i
N'
the latterdeductionbeingclearlya slip.
In the case under consideration we have p = 0, R - 0, the
regressionlinefittedbeingY y. Then

(N-a) --2
1-v

will be distributedin the Type VI curve


a-1 N-3
(-) 2 2 dx;
df 2x9 1 +Nxa>
N-a--2? a-3!N
2 *2
whencesubstituting for x, we findthat y2 is distributedin the
Type I curve
N-3?
a-3 N-a-2
2
N-a-2 2 a_3 (12) 2 (1 _V2) 2 d92.
N___a __2 2ad-3_
2 2
For large values of N the distribution
of q does not tend to
normalityas Pearsonsupposed,but that of 1y2tendsto a Type III
curve. For the meanvalues of q and y2 we have
a-2, N-3a
2 2
a-3 a N-2?
2 2
or, approximately,
a-2E

a-3 N 4N)
3-'

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
606 Miscellanea. [July,
while
a-i
N N-1
in agreementwithour previousvalue.
The meanvalue forq2 thusagreessufficiently withthatobtained
by Pearson,but the accuratevaluesforthe meanand the standard
deviationdiffer fromhis values. Thereis no purposeforpressing
furthera comparisonon theselines,since,unlessthe numberof
arraysbe large,the distribution of v is farfromnormal,and the
significance of an observedvalue of q may be testedwithsome
accuracyby the use of x2.
Tt maybe noticedthat,whenthe numberof arraysis large,
- 2 1 a\
q q' 2N( N)
to a firstapproximation, ofwhichthe second factormay usually
be ignored.
(iii) The thirdpointof differencebetweenmymethodand those
of Slutskyand Pearson,wherebyI have made allowancefor the
numberof constantsinvolvedin fitting formula,has
the regression
been morefullyexplainedin a recentpaper [2].
It is thereshownthatif

_ s{(n-n -P)2}
wherenp is thenumberofobservations expected,and n,,thenumber
observedin any cell, then the value of n' withwhichElderton's
table shouldbe enteredis not the total numberof cells,but one
morethan the numberof values of
np - flp
whichcan be independently specified. That is to say, that when
the values of np are reconstructed
fromthe data of the sample,
(n' - 1) is the numberof degreesof freedomleft aftermaking
this reconstruction.
lines
In the same way forregression
2
= ., S {ln,,(~i,,-YP)2},

and, if a is the numberof arrays,n' - 1 = a, onlyif thc valuesof


Yp are assignedindependently of thc sample. If, as m-oreusually
fitted
formiiula
is the case, thevalues ofYp arc thoseof a regression
to the sample,the numberof values of
p- Yp

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormlul. 607

which can be independentlyspecified is reduced by the nuumber


of
constants fitted. For example, if a cubic polynomial has been
fitted, the numbuerof degrees of freedom is (a - 4), so that
n= a-3.

6. The distributionof regressioncoefficients.


Hitherto we have only considered data in which a number of
values of y are observed corresponding in groups to identical
values of x ; little statistical or physical data is strictly of this
form,although the formermay in favourable cases be confidently
grouped, so as to simulate the kind of data for which the fitness
of regressionlines may be tested. The liinitation of our methods
to data of this formconstitutesone of the most serious deficiencies
in the statistical methods so far available. The position is well
stated by Pearson [1, p. 258]:
" Of course it is needful for a test of this kind that the
"number of measurements of A, ' the dependent variable,'
"should considerablyexceed the numberof values of B tested.
"It would fail entirelyif only one value of A were taken for
"each value of B, however numerous the latter might be.
"We must have some basis on which to determinethe error
"made in a single determination of A. This is a point,
"I think, often overlooked by the physicist. A fairly good
"determination-I mean a quantitation determination-of the
"goodness of fit of theoryto observation could be made from
"ten series of eight observations of A correspondingto ten
"values of B. But no measure of goodness of fit could be
obtained from eighty observations of A correspondingto
"eighty values of B, yet the latter system would probably
"make the greater appeal to most physicists. I do not see
"how quantitivelyto obtain any measure of the goodness of fit
"of theoryto observationin the latter method of procedure."
It appears to the writerthat the problem is one rather for the
statistician than for the physicist; for, given equally variable
arrays,and a regressionline of knownform,the problemis perfectlv
objective. I emphasize it here as a problem awaiting solution-
a manageable solution of which would be of great practical utility.
That it is an objective problem is clear from the confidencewith
which very bad fitswill be rejected at sight, as also fromthe fact
that roughand commnon-sense methodsoftestinghave been developed
for some purposes. [9, Fisher, 1921.]
Althoughexact methodsoftestingthe goodnessoffitofregression
lines are not available for the extended class of data, we are in a

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
608 Miscellanea. [July,
positionto givean exactsolutionofthedistribution
oftheregression
coefficients.This problemhas been outstanding formanyyears;
but the need for its solutionwas recentlybroughthome to the
writerby correspondence with" Student,"whosebrilliant
researches
[7] in 1908formthebasis oftheexactsolution.
For considera simplelinearregressionformula
Y = a + b(x - ),

of whichthe coefficients
a and b are calculatedby theequations

-a
a-.w, b _S(y (x$- )) .'
--~S (X$--X,)2-

we note firstthat a and b are orthogonal


functions,in that given
the series of values of x observed,their samplingvariationis
independent.
Now " Student" [7] has shownhow the probableerrorof a
may be calculated; for if for a given value of x the standard
deviationof y is (r,thena willbe normallydistributed,so that

-a,2=-
n
So that if a is the populationvalue of a, and r = a
thenr is normallydistributedabout zero withstandarddeviation
unity. If o-2 is unknown,
thebestestimatethatcan be made of it
fromthe sample is

s2- 1 S (y-Y)2
n-2
wherethesumis dividedby (n - 2) to allowfor the twoconstants,
used in fittingthe regression
line. Then the distribution
of S2 is,
s2
if X2- (n.-2 82

n-4

df= - (2-) ex d

Thedistribution
ofthetwoquantitiess and a arewholly
independent;
hence,following" Student,"we findthedistributionof a quantity
completelycalculablefromthe sample,namely,

T (a-a) n
VT/
X VS (y-Y)2

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formulw. 609
For
d=? 4 2 . L x2z2
(2)ed2 X2

2 X
n-3
1
(X2n 1e2x (Z) X . dz;

2
and integratingwith respect to x2 from0 to oo, we have

n-3,
1 2 dz
O r n42 (1 +z2) 2

the Type VII curve obtained by "Student," with n reduced by


unity, since we have fitteda regressionline of the firstdegree.
Similarly,for b,

S (X )2'

and if
(b-fl) V/ S (x- ;) 2
,\IS (y-Y)2
we arrive at the same distributionas before,/8being the population
value of the regressioncoefficient.
The above argument immediately extends itself to regression
lines of any form and involving any number of coefficients. For,
suppose the regressionequation is of the form
Y = a + bX,+ cX2 . ...kX,
where X1, X2 ...... Xp are orthogonalfunctionsof x forthe observed
values, so that
S (XaXb) 00

-in the mostimportantcase Xp will be polynomialin x, of degreep,


orthogonal to the polynomials of lower degree [9]-then, for
example,

k S (yXp)
S (Xp2)
and
2 S _2_
S (Xp2)

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
610 Miscellanea. [July,
Also, if
s2 1I S (q_y)l
n-p-i
ofs is givenby
thedistribution

df= L (X2j' 2 e-2' d(iX2)


n-p2 V
2
where
X2= (n--p-1)s

if
Consequently,
z ~(k-K)
X/'S(y-Y)2
ofz is the TypeVII curve
the distribution
n-p-2!
df- 2 dz
2
!n-p-1 (1?z2) 2
2
and in this case, whenp + 1 constantshave been fitted,all the
will be distributedin like manner,
other regressioncoefficients
onlysubstituting the correspondingfunctionof x forXp.
Tablesof theProbabilityIntegralof the above TypeVII distri-
butionhave been preparedby " Student" [8], forvalues of n - p
from0 to 30. These tables are in a suitableformfortestingthe
significanceof an observed regressioncoefficient.For larger
normalfor most purposes,
samplesthe curve will be sufficiently
the varianceof z being

n-p-3-
The utilityof " Student's" curveforthe distributionof errors
in the mean of a sample,in termsof the standarddeviation,as
estimatedfromthe same sample,is increasedby the circumstance
that the same distribution between
also gives that of differences
suchmeans. Thus,ifx and .w'are themeanof samplesofn and n',
and we wish to test if the means are in sufficientagreementto
warrantthe beliefthat the samples are drawn fromthe same
population,we may calculate
x-x / nn'
VS (X-x)2+S' (X'-7)2' n+n'

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formula. 611
then z will be distributedso that
n+n'-3!
df= *+2, dzt/=
e
2
This method of comparisonmay be applied directlyto regression
coefficients,when the same series of values of x is observed in each
case.
The above problem in which the errors of the coefficientsof
a regressionof any formare considered,is in reality a special case
of the multiple regressionsurface-special in the sense that with
a single variable we can conveniently choose the terms of the
regressionequation, so that the several termsconsist of uncorrelated
functions. When this is not the case we have such a regression
system as
Y = bixi + b2x2+. ... + bPXP
when xl, x2. x are p independent variables, with certain
mutual correlations. The accuracy of the regression coefficients
is only affectedby the correlationswhich appear in the sample, so
that if we constructthe determinant
A- S (X12) S (XlX2) ............ S (XlXp)
S (xix2) S (x22) ............ S (x2xp)
.............................................
S (XlXp) S (X2XP) ............S (p2)

from the values of the sample, then


(2L211
(b 2- A___

where A1l is the minor of S(x12).


Consequently,if
(bi-/31)
z = _ _ _ _ _

'vS (y-Y)2 21
then, as before,z will be distributedin the Type VII distribution
n-p-2
1 2 dz
df n-c- np-
2 (I +z2) 2

Conclusions.

(t) In testing the fitness of regressionlines account must be


taken of the number of degrees of freedom which have been
absorbed in the process of fitting.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions
612 Miscellanea. [July,
(2) The Type III distribution of Elderton'stables is not exact
fortestingregression lines,but the tables may be used as a basis
of a usefulapproximation.
(3) The exact distribution of x2 is given by a curve of the
PearsonianTypeVI, whichforlargesamplesapproachestheTypeIII
distribution.
(4) For undifferentiatedarraysthe distribution
of 2 is givenby
a curve of the PearsonianType I; forlarge samplesthis curve
approachesthe Type III distribution.
(5) The distribution in randomsamples of a great varietyof
regressioncoefficientsmay be treatedby the methodintroduced
by " Student" forthe distribution ofthemeanofa normalsample,
and as in that case lead to a distribution
curveof the Pearsonian
TypeVII, whichfor largesamplesrapidlyapproachesnormality.
The importanceof the last result is considerable. It shows
that a numberof regressioncoefficients may be safelycalculated
froma sampleof moderatesize. Thus,in studyingrelationsof a
complex kind, such as occur in agriculturalmeteorology, it is
useful to know that we may as accuratelydeterminethirty
froma sampleof sixtysets of observations
coefficients as we may
calculatea singlecoefficient,or mean,froma sampleof thirty-one
observations.

Referenes.
1. K. Pearson (1916).-" On the Application of Goodness of Fit Tables to
Test RegressionCurves and TheoreticalCurves used to describeObservational
or ExperimentalData." Biom., XI, 239-61.
2. R. A. Fisher (1922).-" On the Significanceof x2 from Contingency
Tables, and on the Calculation of P." J.R.S.S., LXXXV, pp. 87-94.
3. R. A. Fisher (1915).-" Frequency Distributionof the Values of the
CorrelationCoefficientin Samples from an IndefinitelyLarge Population."
Biom., X, 507-21.
4. E. Slutsky(1913).-" On the Criterionof Goodness of Fit of the Regres-
sion Lines, and on the best Method of fittingthem to the Data." J.R.S.S.,
LXXVII, 78-84.
5. K. Pearson (1911).-" On a Correctionto be made to the Correlation
Ratio." Biom., VIII, 254-6.
6. K. Pearson (1905).-" On the General Theory of Skew Correlationand
Non-linearRegression." Drapers' CompanyResearchMemoirs: Dulau and Co.
7. Student (1908).-" The ProbableErrorof a Mean." Biom.,VI, pp. 1-25.
8. Student (1917).-" Tables forEstimatingthe Probabilitythat the Mean
of a unique Sample of Observationslies between - X and any given Distance
of the Mean of the Population fromwhichthe Sample is drawn." Biom., XI,
414-17.
9. R. A. Fisher (1921).-" An Examination of the Yield of Dressed Grain
from Broadbalk." Journal of AgriculturalScience, XI, 107-35.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM


All use subject to JSTOR Terms and Conditions

You might also like