0% found this document useful (0 votes)

80 views17 pages

Fisher 1922

This document discusses the goodness of fit of regression formulas and the distribution of regression coefficients. It addresses issues with applying Pearson's chi-squared test of goodness of fit when the number of constants fitted is not properly accounted for. It also examines the exact distribution of chi-squared when the variance is estimated from the sample data rather than known, and solves the problem of the distribution of regression coefficients in small samples.

Uploaded by

banstala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views17 pages

Fisher 1922

Uploaded by

banstala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

The Goodness of Fit of Regression Formulae, and the Distribution of Regression Coefficients

Author(s): R. A. Fisher
Source: Journal of the Royal Statistical Society, Vol. 85, No. 4 (Jun., 1922), pp. 597-612
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2341124 .
Accessed: 31/08/2013 13:17

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society.

http://www.jstor.org

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] 597

MISCELLANEA.

THE GOODNESS OF FIT OF REGRESSION FORMULA4, AND TIIE

DISTRIBUTION OF REGRESSION COEFFICIENTS.

By R. A. FISlIER, M.A.

Introduction.
TIHE widespread desire to introduce into statistical mietlhods sotnie
degree of critical exactitude lhas led( to the eniployment, now
general in careful work, of the two types of quantity which clharac-
terize inoderii statistics, namely, the " probable error" and the
test of " goodness of fit." The test of goodness of fit was devised
by Pearson, to wlhoselabours principally we now owe it, that the
test may readily be applied to a great variety of questioiis of
frequency distribution. It is aii essential means of justifying
a posteriorithe methods which have been employedin the reduction
of any body of data. Slutsky and Pearson lhave extended the
test to apply also to the fitness of regressionformnulw, Pearson's
correlationratio having also been employed for this purpose.
It has been shown in a previous communication[2 Fisher, 1922]
that the x2 test of goodness of fit can be accurately applied only
if allowance is made for the number of constants fittedin recon-
structingthe theoreticalpopulation. This correctionis particularly
importantin contingencytables, but is necessary in all cases; and
the fact that it has not been recognized has led to the adoption
of erroneousvalues in almost all the cases in which tests of goodness
of fithave been employed. The values of P have been exaggerated,
and it is to be feared that in many cases wrong conclusions have
been drawn from the values of P obtained.
It has, therefore,been necessary to extend the examination to
the tests of goodness of fit of regressionlines. The errors due to
neglectingthe numberof constantsfittedare here very pronounced;
but in addition other points have to be taken into consideration,
which did not arise in our previous investigation. In the most
important class of cases the curve of distributionof x2 is now no
longer of the Pearsonian Type III, which is the basis of Elderton's
tables, but of the neighbouringType VI. Certain misconceptions
also exist as to the form of the distribution of the correlation

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
598 Miscellanea. [July,

ratio, tj, which we hope to have cleared up. We have also taken
the opportunityof solving the outstanding problem of the distri-
bution of the regressioncoefficientsin small samples.

1. The accurateapplicationofElderton'stables.
With any two variables x and y we shall suppose that the
number of observations for which x p is np, and the number
xp
of these for which y = yq is npq; also that jp is the mean of the
observed values of y for a given value of x, so that
np ?p- Sp(npqyq).
We may regard the group np as a random sample from a popu-
lation in which the value of x is constant; but the value of y varies
freelyabout a certain mean, mi,, with a certain standard deviation,
(rp.

For such samples of np, therefore,the mean, ?/P,will vary about

the same mean mp, and since this mean of y is independent of
the number of the array, mn will be the mean of all values of y
from random samples, however the number np may vary.
Any opinion put forward by Professor Pearson is worthy of
respect; but it is impossible to agree with his statement [1, p. 240]
that " This result cannot be taken as obvious, as the size of the
" array in the sample varies." The fact, however, Pearson has
verifiedforlarge samples as far as the thirdorderof approximation.
The difference in principleis of some importance,since the simplicity
of many of the results here obtained is a consequence of the fact
that we have not attempted to eliminate known quantities, given
by the sample, from the distribution formulaeof the statistics
studied, but only the unknown quantities-parameters of the popu-
lation fromwhich the sample is drawn-which have to be estimated
somewhat inexactly fromthe given sample.*
Next, for arrays of any given size, the standard deviation of yp
is (Tp/ /In,,and it will be normallydistributedifthe population-array
be normal, and approximately so in most cases if np be large.
Pearson rightly points out that the values of i7,vfor arrays of
differentsizes will not be normallydistributed,but the distribution
will be markedly leptokurtic even for considerable arrays. This
result follows from the fact that the distributionis a mixture of
* Statistics whose sampling distributiondepends upon other statistics
given by the sample cannot,in the strictsense,fulfilthe Criterionof Sufficiency.
In certaincases evidentlyno statisticexists whichstrictlyfulfilsthis criterion.
In these cases statisticsobtained by the MethodofMaximumLikelihoodappear
to fulfilthe Criterionof Efficiency; the extension of this criterionto finite
samples thus takes a new importance.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormule. 599

normal distributions,having the same mean, but differentstandard

deviations. This mixed distributionneed not concern us, however,
for in applying tests of fitnesswe do not in practice ignorethe size
of the array. The simple fact is, that, when the population arrays
are normal, the quantity
zp np(Jp- mp)
is normally distributed about zero, with a standard deviation or-,
and this distributionis independent of the size of the array.
In the case when the population arrays are equally variable,
crpis constant [= ], and if there are a arrays, the quantity
S(z 2) = S,np(gP -MP)2}
is the sum of the squares of a independent,normally and equally
variable quantities, and consequently, if we write
X20-2 S (ZP2),

x2 will be distributedas is the ordinarymeasure of goodness of fit.

In applying Elderton's tables we must, of course, put n' equal to
one more than the number of degrees of freedom,as I have demon-
strated elsewhere [2]. If the values of mi) were known a priori,
we should take n' = a + 1, but for regression formulhefitted to
the data by equations linear in yp we merely reduce the number
of degrees of freedom by the number of constants fitted. Thus,
if mp is a linear functionof x, and a straightline is fitted,we have
n' = a - 1, and the value of x2 then constitutesa test of whether
or not mp is in reality adequately representedby a linear function
of x. Similarly, if a cubic polynomial in x be fitted,we have
n= a -3.
2. The exact distributionof X2 when C- is determinedfrom the data.
So far the results are exact on the assumption that C is known;
but as in practice o- must usually be obtained fromthe data, errors
will be introduced from this source which necessarily influence
the distributionof x2. It is true that o may be estimated from
the whole data, and is thereforeknown with accuracy of a higher
orderthan the quantities which contributeto x2; neverthelessit is
necessary to determinewhat aberrations are to be expected when
the data are not very numerous.
From each array we can directlycalculate the second moment
sP2, and it has been shown [3] that the second momentof a normal
sample of np is so distributedthat the frequencywith whichit falls
into the range dsp2 is proportional to
np-3 -__nsp2
-(np-1) (SP2) 2 e 2&2 d (s"92:

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
600 Miscellanea. [July,
thechancethatall theobservedvaluesof Sp2 fallin assignedranges
is theproductof a suchquantities,forall are distributed
indepen-
dently; consequentlyto findthe optimumvalue of a-, whichwill
also be the value withthe least probableerror,we mustmakethis
producea maximumforvariationsof -.
Takinglogarithms we have
and differentiating,
8L __ S (npsP2) S (np-1)
80 a3 C '

whencethe optimumvalue of C2 is S2 where

(N-a) S2 = S (npspl).

suppose that C- is estimatedby this method,

We shall, therefore,
and that
= S (z2)
22

we mustnow findthe distribution of thisstatistic.

of S2 is of the same kindas thosewithwhich
The distribution
we have been concerned. For
S (nPsp2) = S (I-'4)2'

and maybe regardedas thesumofthesquaresofN equallyvariable

save fora linearrestrictions
quantities,independent oftheform
Sp (y) = npif-p.
we specifythe distribution
If,therefore, of S2 in sucha way as
to express the frequencyelement,df, in terms of the variate
elementwithinwhichit occurs,we shallhave

dfztJ(N-a-2) e-2dt

wheret standsfors2(N -a). In the same way if T stand forx2S2

we have,ifp + 1 constantshave beenused in fitting,

dfaT c(aP3) e 20,] dT

are independent,
and thesetwo distributions forthe one depends
onlyon the deviationsfromthe meansof normalsamples,and the
otheronlyon the means.
The distributionof x2 will now be that of (N - a)_, so,
substituting
x2t

N-a

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] Formulc.
ofFit ofRegression
The Goodness 601
in
N-a-2 a-p-3 _ t Q+T)

t 2 T 2 e 2oa' dtdT,
we obtain, ignoringconstants,
a-p-3 N-p-3 -_ t
2

(X2) 2 t 2 e 2XT2(1+N-a) dtdX2,

and so, on integratingfrom0 to oo with respect to t,

a-p-3 ~ -p-
+
(X2) ( 1 NXa) dXN2.

The variation in 82, therefore,changesthe exact formof the distri-

bution curve for x2 from Type III to Type VI. The change is,
however,very small if N be large, foras N increases

( 1? N-a)

and so reproduces the Type III distribution.

of theType VI curve
3. The natureof theapproximation

(N-a) 2 - a-p-3 i2
df 2 Xa 3(+ N-a) dx
N-a--2! a-p-3i
2 2
to theTypeIIl curve
-a-p-i
2 2 a-p-3 j
df= _ 3 . X 2 e dx.
2

When x is small, the two curves have closely similar forms, the
latter being the distribution of x2, as given by Elderton's table,
when n' -a - p. The ratio of the ordinates at the terminus of
the curve is obtained by expanding the constant multiplier of the
first curve in powers of N-1. It reduces to

+ (n' -1) (n' -3)

1
4N
for high values of P; therefore, 1 - P, as given by Elderton's table,

* The symbolx! is usedthroughout

thispaperas equivalentto r(x + 1),
or not.
waetherx is an integer
VOL. LXXXV. PART IV. 2 T

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
602 Miscellanea. [July,

may be corrected by multiplyingby this factor. Near the centre

of the curve we may observe the position of the mean and mode.

Mean. Mode.

Type III ... a-p I a-p- 3

N-a N-a
Type VI ... (a-p-1) N-a-2 (a-p-3) N-a +2

The mean, therefore,is raised and the mode lowered in about the
same proportion. For the higher values of x the curves are not
closely similar,and since it is forthese especially that the value of P
is required, we shall obtain the necessary correction in P, as far
as the terms in N-L. The ratio of the ordinates is

l +
1+ fN
N{x2 - 2(n' -1)x + (n' - 1) (n' -3)1;
but, since
n'-1 00 n'-3 :z
2 2 Pt!,P (x) = x e dx,
x
we have the correction
I
1N(n'-1) (n'+l) Pn1t+4-2(n'-1)2 Pn1t+2+(n'--1) (n'-3) P,, }

- Pw,,+4-2(n'-l)
4N 1(n'+1) P,L'-+2+(n'-3) Pl'},
which, in the absence of tables of the Type VI curve, will usually
be found adequate.

4. The correlationratio.
We are now in a positionto make an accurate use ofthe correlation
ratio, as a test of the fitnessof regressionformulae. Let Y be the
functionof x used as regressionformula,and let
NR2s 2 = S {n (yp_ y)2}, NsY2= S (y-_)2
where y?is the mean of all the observed values of y; then it is easy
to see that, provided Y has been fitted to the data so that
3 {fp (9 __YP)2}

is a minimumfor proportionalvariations of Y-y, then

N (1-R2) Sy2 = SS {npq (y-Yp)2}

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormula. 603

But the correlationratio is givun by the parallel formula,

N (1-_2) Sy2 SS {fnpq(y-gp)2}= (N- ) s2;
hence, by subtraction,
N (i12-R2) sY2 = S {n, (gp-yY)2} = x2s2

X2 N Cy2
=
N a (1-q2)
In other words
2 2R2
(N-a)

and to test the significanceof 2 -R2 we enter Elderton's table

with n' = a - p, wherep + 1 is the number of constants fittedto
the regression line. Thus, for a linear iegression formula,

x2 (N-a) 1__2
and
n'-a-i,

using, if necessary, the correctionfor Type VI as before.

The exact form of the distributionof q itself would be difficult
to obtain, but in practice I is usuallv employed to test the validity
of a linear or otherregressionformula. For this purposeit is not the
distributionof 7 but of the more variable quantity (q2_R2)/(1-rq2)
that is required,and the above expressionsshow it is approximately
representedby a Type III curve,and that the probabilityof a greater
discrepancy occurringby chance may be obtained from Elderton's
table.

5. Comparisonwithpreviousformula?.
Slutsky, in his method [4, p. 83] of treating homoscedastic
data, has used a process analogous to that arrived at above, but
with four deviations:-
(i) He averages the standard deviations of the-arrays,and not
theirsquares, in estimatingthe value of (r2.
(ii) He divides his total by N instead of N - a.
(iii) He enters Elderton's table with n' = a + 1, instead of
n =a - p.
(iv) He takes the Type III distributionto be exact.
(i) Pearson [1, pp. 249-51] has criticized the firstpoint, but
his practice is not quite explicit. In his opinion evidently,if the
surface is homoscedastic, we must take s2y(l - qq2), but in the
special case when the regressionis also linear he replaces ] -_q2
2 T2

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
604 Miscellanea. [July,
by 1 -r2. The point is not one of importance, and I am not
convincedthatany materialdifference wouldbe made by replacing
1 - \2 by 1 - R2 in general,when the regressionis well fitted.
There would seem to be no reasonfor treatinglineardifferently
fromotherregressionformule. In dealing with Slutsky'sprice
data, wherethe regressionis doubtfullylinear, Pearson prefers
tio use 1 - r2.
(ii) The secondpoint is, strictly,a matterof convenience, for
whenwe knowthe distribution of x2, calculatedby one method,
we also knowits distribution in the secondcase. Since neitherof
thesedistributions is exactlythe Type III tabled by Elderton,we
are freeto use whicheverwe please. The formwe have chosen
has theadvantageofinvolvingthe best estimateof o-,and we have
chosenit forthisreason; but as in the TypeVI distribution errors
of estimationare completelyeliminated, this choicehas onlythe
forceof a convention. The close agreementof the curvewe have
obtainedwith the corresponding Type III in the neighbourhood
of the medianis a practicaladvantage; it shouldin any case be
noted that the correctionswhichwe have obtainedfor P refer
onlyto our own formof the statisticx2.
Althoughstrictlya matterof convenience,there is a real
advantagewhen the matteris approachedfromotherpoints of
view,in the use of the best estimates. Thus, forexample,when
the arraysare undifferentiated, of
with respectto the distribution
y,we naturallytake
1 -S (y - )2
N-i ~(~)
as thebestestimateofthevarianceofthe wholeoftheobservation;
this should agree on the
and as the arrays are undifferentiated
average with ourestimateofthe variancein each array,

N-a SS
{npq(y-gp)2}.
Now
(1-_2) S (y- j)2-SS {fpq (y-?Jp)2};
whenceit follows that the mean value of 1 -9 is
N-a
N-i '
and that of y2, therefore,
a-1
N-1
Pearsonhas discussedthe distribution of v9in thiscase [5].

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formulc. 605
Observingthat,evenif thearraysare whollyundifferentiated,
v willnecessarilybe positive,he pointsoutthat,intestingwhetherv
differs fromzero,it is not onlynecessaryto knowthe
significantly
standarderrorof y, but also the meanvalue about whichit varies.
The standarderrorofq forundifferentiated arrayshe had previously
[6] evaluatedat 1/IN, and he thenbya somewhatintricate method
findsforthe meanvalue of y/2the value
a-i
N
and deducesthatthemeanvalue ofy willbe
a-i
N'
the latterdeductionbeingclearlya slip.
In the case under consideration we have p = 0, R - 0, the
regressionlinefittedbeingY y. Then

(N-a) --2
1-v

will be distributedin the Type VI curve

a-1 N-3
(-) 2 2 dx;
df 2x9 1 +Nxa>
N-a--2? a-3!N
2 *2
whencesubstituting for x, we findthat y2 is distributedin the
Type I curve
N-3?
a-3 N-a-2
2
N-a-2 2 a_3 (12) 2 (1 _V2) 2 d92.
N___a __2 2ad-3_
2 2
For large values of N the distribution
of q does not tend to
normalityas Pearsonsupposed,but that of 1y2tendsto a Type III
curve. For the meanvalues of q and y2 we have
a-2, N-3a
2 2
a-3 a N-2?
2 2
or, approximately,
a-2E

a-3 N 4N)
3-'

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
606 Miscellanea. [July,
while
a-i
N N-1
in agreementwithour previousvalue.
The meanvalue forq2 thusagreessufficiently withthatobtained
by Pearson,but the accuratevaluesforthe meanand the standard
deviationdiffer fromhis values. Thereis no purposeforpressing
furthera comparisonon theselines,since,unlessthe numberof
arraysbe large,the distribution of v is farfromnormal,and the
significance of an observedvalue of q may be testedwithsome
accuracyby the use of x2.
Tt maybe noticedthat,whenthe numberof arraysis large,
- 2 1 a\
q q' 2N( N)
to a firstapproximation, ofwhichthe second factormay usually
be ignored.
(iii) The thirdpointof differencebetweenmymethodand those
of Slutskyand Pearson,wherebyI have made allowancefor the
numberof constantsinvolvedin fitting formula,has
the regression
been morefullyexplainedin a recentpaper [2].
It is thereshownthatif

_ s{(n-n -P)2}
wherenp is thenumberofobservations expected,and n,,thenumber
observedin any cell, then the value of n' withwhichElderton's
table shouldbe enteredis not the total numberof cells,but one
morethan the numberof values of
np - flp
whichcan be independently specified. That is to say, that when
the values of np are reconstructed
fromthe data of the sample,
(n' - 1) is the numberof degreesof freedomleft aftermaking
this reconstruction.
lines
In the same way forregression
2
= ., S {ln,,(~i,,-YP)2},

and, if a is the numberof arrays,n' - 1 = a, onlyif thc valuesof

Yp are assignedindependently of thc sample. If, as m-oreusually
fitted
formiiula
is the case, thevalues ofYp arc thoseof a regression
to the sample,the numberof values of
p- Yp

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] The Goodnessof Fit of RegressionFormlul. 607

which can be independentlyspecified is reduced by the nuumber

of
constants fitted. For example, if a cubic polynomial has been
fitted, the numbuerof degrees of freedom is (a - 4), so that
n= a-3.

6. The distributionof regressioncoefficients.

Hitherto we have only considered data in which a number of
values of y are observed corresponding in groups to identical
values of x ; little statistical or physical data is strictly of this
form,although the formermay in favourable cases be confidently
grouped, so as to simulate the kind of data for which the fitness
of regressionlines may be tested. The liinitation of our methods
to data of this formconstitutesone of the most serious deficiencies
in the statistical methods so far available. The position is well
stated by Pearson [1, p. 258]:
" Of course it is needful for a test of this kind that the
"number of measurements of A, ' the dependent variable,'
"should considerablyexceed the numberof values of B tested.
"It would fail entirelyif only one value of A were taken for
"each value of B, however numerous the latter might be.
"We must have some basis on which to determinethe error
"made in a single determination of A. This is a point,
"I think, often overlooked by the physicist. A fairly good
"determination-I mean a quantitation determination-of the
"goodness of fit of theoryto observation could be made from
"ten series of eight observations of A correspondingto ten
"values of B. But no measure of goodness of fit could be
obtained from eighty observations of A correspondingto
"eighty values of B, yet the latter system would probably
"make the greater appeal to most physicists. I do not see
"how quantitivelyto obtain any measure of the goodness of fit
"of theoryto observationin the latter method of procedure."
It appears to the writerthat the problem is one rather for the
statistician than for the physicist; for, given equally variable
arrays,and a regressionline of knownform,the problemis perfectlv
objective. I emphasize it here as a problem awaiting solution-
a manageable solution of which would be of great practical utility.
That it is an objective problem is clear from the confidencewith
which very bad fitswill be rejected at sight, as also fromthe fact
that roughand commnon-sense methodsoftestinghave been developed
for some purposes. [9, Fisher, 1921.]
Althoughexact methodsoftestingthe goodnessoffitofregression
lines are not available for the extended class of data, we are in a

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
608 Miscellanea. [July,
positionto givean exactsolutionofthedistribution
oftheregression
coefficients.This problemhas been outstanding formanyyears;
but the need for its solutionwas recentlybroughthome to the
writerby correspondence with" Student,"whosebrilliant
researches
[7] in 1908formthebasis oftheexactsolution.
For considera simplelinearregressionformula
Y = a + b(x - ),

of whichthe coefficients
a and b are calculatedby theequations

-a
a-.w, b _S(y (x$- )) .'
--~S (X$--X,)2-

we note firstthat a and b are orthogonal

functions,in that given
the series of values of x observed,their samplingvariationis
independent.
Now " Student" [7] has shownhow the probableerrorof a
may be calculated; for if for a given value of x the standard
deviationof y is (r,thena willbe normallydistributed,so that

-a,2=-
n
So that if a is the populationvalue of a, and r = a
thenr is normallydistributedabout zero withstandarddeviation
unity. If o-2 is unknown,
thebestestimatethatcan be made of it
fromthe sample is

s2- 1 S (y-Y)2
n-2
wherethesumis dividedby (n - 2) to allowfor the twoconstants,
used in fittingthe regression
line. Then the distribution
of S2 is,
s2
if X2- (n.-2 82

n-4

df= - (2-) ex d

Thedistribution
ofthetwoquantitiess and a arewholly
independent;
hence,following" Student,"we findthedistributionof a quantity
completelycalculablefromthe sample,namely,

T (a-a) n
VT/
X VS (y-Y)2

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formulw. 609
For
d=? 4 2 . L x2z2
(2)ed2 X2

2 X
n-3
1
(X2n 1e2x (Z) X . dz;

2
and integratingwith respect to x2 from0 to oo, we have

n-3,
1 2 dz
O r n42 (1 +z2) 2

the Type VII curve obtained by "Student," with n reduced by

unity, since we have fitteda regressionline of the firstdegree.
Similarly,for b,

S (X )2'

and if
(b-fl) V/ S (x- ;) 2
,\IS (y-Y)2
we arrive at the same distributionas before,/8being the population
value of the regressioncoefficient.
The above argument immediately extends itself to regression
lines of any form and involving any number of coefficients. For,
suppose the regressionequation is of the form
Y = a + bX,+ cX2 . ...kX,
where X1, X2 ...... Xp are orthogonalfunctionsof x forthe observed
values, so that
S (XaXb) 00

-in the mostimportantcase Xp will be polynomialin x, of degreep,

orthogonal to the polynomials of lower degree [9]-then, for
example,

k S (yXp)
S (Xp2)
and
2 S _2_
S (Xp2)

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
610 Miscellanea. [July,
Also, if
s2 1I S (q_y)l
n-p-i
ofs is givenby
thedistribution

df= L (X2j' 2 e-2' d(iX2)

n-p2 V
2
where
X2= (n--p-1)s

if
Consequently,
z ~(k-K)
X/'S(y-Y)2
ofz is the TypeVII curve
the distribution
n-p-2!
df- 2 dz
2
!n-p-1 (1?z2) 2
2
and in this case, whenp + 1 constantshave been fitted,all the
will be distributedin like manner,
other regressioncoefficients
onlysubstituting the correspondingfunctionof x forXp.
Tablesof theProbabilityIntegralof the above TypeVII distri-
butionhave been preparedby " Student" [8], forvalues of n - p
from0 to 30. These tables are in a suitableformfortestingthe
significanceof an observed regressioncoefficient.For larger
normalfor most purposes,
samplesthe curve will be sufficiently
the varianceof z being

n-p-3-
The utilityof " Student's" curveforthe distributionof errors
in the mean of a sample,in termsof the standarddeviation,as
estimatedfromthe same sample,is increasedby the circumstance
that the same distribution between
also gives that of differences
suchmeans. Thus,ifx and .w'are themeanof samplesofn and n',
and we wish to test if the means are in sufficientagreementto
warrantthe beliefthat the samples are drawn fromthe same
population,we may calculate
x-x / nn'
VS (X-x)2+S' (X'-7)2' n+n'

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
1922.] TheGoodness
ofFit ofRegression
Formula. 611
then z will be distributedso that
n+n'-3!
df= *+2, dzt/=
e
2
This method of comparisonmay be applied directlyto regression
coefficients,when the same series of values of x is observed in each
case.
The above problem in which the errors of the coefficientsof
a regressionof any formare considered,is in reality a special case
of the multiple regressionsurface-special in the sense that with
a single variable we can conveniently choose the terms of the
regressionequation, so that the several termsconsist of uncorrelated
functions. When this is not the case we have such a regression
system as
Y = bixi + b2x2+. ... + bPXP
when xl, x2. x are p independent variables, with certain
mutual correlations. The accuracy of the regression coefficients
is only affectedby the correlationswhich appear in the sample, so
that if we constructthe determinant
A- S (X12) S (XlX2) ............ S (XlXp)
S (xix2) S (x22) ............ S (x2xp)
.............................................
S (XlXp) S (X2XP) ............S (p2)

from the values of the sample, then

(2L211
(b 2- A___

where A1l is the minor of S(x12).

Consequently,if
(bi-/31)
z = _ _ _ _ _

'vS (y-Y)2 21
then, as before,z will be distributedin the Type VII distribution
n-p-2
1 2 dz
df n-c- np-
2 (I +z2) 2

Conclusions.

(t) In testing the fitness of regressionlines account must be

taken of the number of degrees of freedom which have been
absorbed in the process of fitting.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions
612 Miscellanea. [July,
(2) The Type III distribution of Elderton'stables is not exact
fortestingregression lines,but the tables may be used as a basis
of a usefulapproximation.
(3) The exact distribution of x2 is given by a curve of the
PearsonianTypeVI, whichforlargesamplesapproachestheTypeIII
distribution.
(4) For undifferentiatedarraysthe distribution
of 2 is givenby
a curve of the PearsonianType I; forlarge samplesthis curve
approachesthe Type III distribution.
(5) The distribution in randomsamples of a great varietyof
regressioncoefficientsmay be treatedby the methodintroduced
by " Student" forthe distribution ofthemeanofa normalsample,
and as in that case lead to a distribution
curveof the Pearsonian
TypeVII, whichfor largesamplesrapidlyapproachesnormality.
The importanceof the last result is considerable. It shows
that a numberof regressioncoefficients may be safelycalculated
froma sampleof moderatesize. Thus,in studyingrelationsof a
complex kind, such as occur in agriculturalmeteorology, it is
useful to know that we may as accuratelydeterminethirty
froma sampleof sixtysets of observations
coefficients as we may
calculatea singlecoefficient,or mean,froma sampleof thirty-one
observations.

Referenes.
1. K. Pearson (1916).-" On the Application of Goodness of Fit Tables to
Test RegressionCurves and TheoreticalCurves used to describeObservational
or ExperimentalData." Biom., XI, 239-61.
2. R. A. Fisher (1922).-" On the Significanceof x2 from Contingency
Tables, and on the Calculation of P." J.R.S.S., LXXXV, pp. 87-94.
3. R. A. Fisher (1915).-" Frequency Distributionof the Values of the
CorrelationCoefficientin Samples from an IndefinitelyLarge Population."
Biom., X, 507-21.
4. E. Slutsky(1913).-" On the Criterionof Goodness of Fit of the Regres-
sion Lines, and on the best Method of fittingthem to the Data." J.R.S.S.,
LXXVII, 78-84.
5. K. Pearson (1911).-" On a Correctionto be made to the Correlation
Ratio." Biom., VIII, 254-6.
6. K. Pearson (1905).-" On the General Theory of Skew Correlationand
Non-linearRegression." Drapers' CompanyResearchMemoirs: Dulau and Co.
7. Student (1908).-" The ProbableErrorof a Mean." Biom.,VI, pp. 1-25.
8. Student (1917).-" Tables forEstimatingthe Probabilitythat the Mean
of a unique Sample of Observationslies between - X and any given Distance
of the Mean of the Population fromwhichthe Sample is drawn." Biom., XI,
414-17.
9. R. A. Fisher (1921).-" An Examination of the Yield of Dressed Grain
from Broadbalk." Journal of AgriculturalScience, XI, 107-35.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

All use subject to JSTOR Terms and Conditions

1936 Mahalanobis Paper. Original. On The Generalized Distance in Statistics.
100% (1)
1936 Mahalanobis Paper. Original. On The Generalized Distance in Statistics.
8 pages
A Concise Course in A-Level Statistics - Crawshaw.J
No ratings yet
A Concise Course in A-Level Statistics - Crawshaw.J
692 pages
STATISTICS
No ratings yet
STATISTICS
48 pages
Comparison of Goodness of Fit Tests For PDF
No ratings yet
Comparison of Goodness of Fit Tests For PDF
32 pages
Model Validity
No ratings yet
Model Validity
511 pages
Grade 1 Mathematics Lesson Plan
100% (33)
Grade 1 Mathematics Lesson Plan
4 pages
Comparing Distributions
No ratings yet
Comparing Distributions
357 pages
31 40
No ratings yet
31 40
263 pages
Formulas and Tables For Statistics
No ratings yet
Formulas and Tables For Statistics
11 pages
Training Needs Analysis For Teachers
100% (4)
Training Needs Analysis For Teachers
4 pages
Chapter 2 Statistical Concepts in Research
No ratings yet
Chapter 2 Statistical Concepts in Research
62 pages
To The - Requifibd Stan'Cqrd'
No ratings yet
To The - Requifibd Stan'Cqrd'
122 pages
International Statistical Institute (ISI)
No ratings yet
International Statistical Institute (ISI)
26 pages
Original)
No ratings yet
Original)
22 pages
Mathematics 09 00788 v3
No ratings yet
Mathematics 09 00788 v3
20 pages
Business Statistics-II
No ratings yet
Business Statistics-II
2 pages
Goodness of Fit: Topic 21
No ratings yet
Goodness of Fit: Topic 21
15 pages
Chi-Squared Distribution
No ratings yet
Chi-Squared Distribution
15 pages
Chi Square Tests For Linear Trends in Proportions
No ratings yet
Chi Square Tests For Linear Trends in Proportions
14 pages
Grad Lecture 3
No ratings yet
Grad Lecture 3
27 pages
Pearson 1930
No ratings yet
Pearson 1930
12 pages
Geometry in Space
No ratings yet
Geometry in Space
9 pages
3 One-Way Table
No ratings yet
3 One-Way Table
30 pages
Thirteen Ways To Look at The Correlation Coefficient
No ratings yet
Thirteen Ways To Look at The Correlation Coefficient
9 pages
Testes de Qualidade de Ajuste
No ratings yet
Testes de Qualidade de Ajuste
113 pages
Normal Statistics Estimation
No ratings yet
Normal Statistics Estimation
8 pages
Effects of Transformations On Significance of The Pearson T-Test
No ratings yet
Effects of Transformations On Significance of The Pearson T-Test
4 pages
Box 1953
No ratings yet
Box 1953
19 pages
Chapter 4 Notes
No ratings yet
Chapter 4 Notes
16 pages
Nair DistributionStudentst 1941
No ratings yet
Nair DistributionStudentst 1941
19 pages
Student, 1908.
No ratings yet
Student, 1908.
9 pages
Cherof - On The Distribution of The Likelihood Ratio
No ratings yet
Cherof - On The Distribution of The Likelihood Ratio
7 pages
Application of Statistical Techniques in Industry
No ratings yet
Application of Statistical Techniques in Industry
7 pages
Young 1941
No ratings yet
Young 1941
9 pages
04 Testing
No ratings yet
04 Testing
35 pages
Tests of Fit For The Von Mises Distribution
No ratings yet
Tests of Fit For The Von Mises Distribution
6 pages
Royal Statistical Society, Wiley Journal of The Royal Statistical Society
No ratings yet
Royal Statistical Society, Wiley Journal of The Royal Statistical Society
10 pages
Mathematics 06 00088 PDF
No ratings yet
Mathematics 06 00088 PDF
16 pages
FALK 2010 Comparison of Common Tests For Normality
No ratings yet
FALK 2010 Comparison of Common Tests For Normality
103 pages
Student (1908)
No ratings yet
Student (1908)
10 pages
Codex Adeptus Custodes 1.5
85% (13)
Codex Adeptus Custodes 1.5
40 pages
Fisher Exact Test Paper
No ratings yet
Fisher Exact Test Paper
9 pages
Anna's Archive
No ratings yet
Anna's Archive
12 pages
A079807 PDF
No ratings yet
A079807 PDF
23 pages
Goodness
No ratings yet
Goodness
12 pages
Formulas and Tables With Everything 2019
No ratings yet
Formulas and Tables With Everything 2019
10 pages
Prueba de Fisher. R. A Fisher
No ratings yet
Prueba de Fisher. R. A Fisher
9 pages
Chi Square
No ratings yet
Chi Square
5 pages
1900 On The Criterion That A Given System of Deviations From The Probable in The
No ratings yet
1900 On The Criterion That A Given System of Deviations From The Probable in The
20 pages
Normal Distribution: Theory and Testing of Normality
No ratings yet
Normal Distribution: Theory and Testing of Normality
21 pages
What Is A Centrifugal Compressor Surge
100% (3)
What Is A Centrifugal Compressor Surge
8 pages
1 s2.0 016794739592844N Main
No ratings yet
1 s2.0 016794739592844N Main
11 pages
Final Year Project Proposals
100% (4)
Final Year Project Proposals
77 pages
Rangkuman Rumus & Tabel Statistika
No ratings yet
Rangkuman Rumus & Tabel Statistika
12 pages
Bartlett 1941
No ratings yet
Bartlett 1941
10 pages
Linnet 1988
No ratings yet
Linnet 1988
8 pages
Stat11t - Formulas-Triola 11th Ed.
No ratings yet
Stat11t - Formulas-Triola 11th Ed.
8 pages
Chapter 2 Organizing and Summarizing Data
No ratings yet
Chapter 2 Organizing and Summarizing Data
8 pages
Ecological Profile 2018 of Antipolo City
No ratings yet
Ecological Profile 2018 of Antipolo City
129 pages
Lecture 12
No ratings yet
Lecture 12
6 pages
Tablas y Formulas Triola 9 Ed
No ratings yet
Tablas y Formulas Triola 9 Ed
8 pages
Sufficient Statistics and Exponential Family
No ratings yet
Sufficient Statistics and Exponential Family
11 pages
Poetry and Mathematics by Scott M. Buchanan
100% (2)
Poetry and Mathematics by Scott M. Buchanan
159 pages
Danjuma Abdulganiyu E.: CONTACT ADDRESS: 2, Saviour Street, Isashi, Off Badagry Expressway, Lagos
No ratings yet
Danjuma Abdulganiyu E.: CONTACT ADDRESS: 2, Saviour Street, Isashi, Off Badagry Expressway, Lagos
2 pages
Pega Interview Questions
100% (1)
Pega Interview Questions
3 pages
PMI Online Setup and Commisioning
No ratings yet
PMI Online Setup and Commisioning
25 pages
History IX 3rd Summative
No ratings yet
History IX 3rd Summative
30 pages
Name: Roxanne B. Magsipoc Date: 12/2/2020 Year and Section: BSA 1-3 Professor: Ms. Mary Camille Delima
No ratings yet
Name: Roxanne B. Magsipoc Date: 12/2/2020 Year and Section: BSA 1-3 Professor: Ms. Mary Camille Delima
3 pages
CL7204-Soft Computing Techniques
No ratings yet
CL7204-Soft Computing Techniques
13 pages
RESERS
100% (1)
RESERS
4 pages
Classical Conditioning (Ivan Pavlov)
No ratings yet
Classical Conditioning (Ivan Pavlov)
9 pages
As 3617-1997 Parameters For The Machining and Reconditioning of Brake Drums and Discs
No ratings yet
As 3617-1997 Parameters For The Machining and Reconditioning of Brake Drums and Discs
5 pages
Cwps-Refraction at Spherical Surfaces & Lenses - Solutions
No ratings yet
Cwps-Refraction at Spherical Surfaces & Lenses - Solutions
12 pages
Indian Dry Fish Export
No ratings yet
Indian Dry Fish Export
4 pages
CFL CPM Guidelines
No ratings yet
CFL CPM Guidelines
30 pages
Group Planning Exercise - 2
No ratings yet
Group Planning Exercise - 2
1 page
The History of Using Solar Energy
No ratings yet
The History of Using Solar Energy
8 pages
1st Summative Bengali Questions
No ratings yet
1st Summative Bengali Questions
15 pages
TF-IDF - From - Scratch - Towards - Data - Science
No ratings yet
TF-IDF - From - Scratch - Towards - Data - Science
20 pages
Colorful Gradient Graphic Designer Resume
No ratings yet
Colorful Gradient Graphic Designer Resume
1 page
Installation Guide For Ibm'S Db2 Database Server Software
No ratings yet
Installation Guide For Ibm'S Db2 Database Server Software
10 pages
A Handbook of Varieties of English
No ratings yet
A Handbook of Varieties of English
2 pages
Inkandvolt Yearly Planning Week1
No ratings yet
Inkandvolt Yearly Planning Week1
3 pages
Accuracy Interpretability Tradeoff
No ratings yet
Accuracy Interpretability Tradeoff
24 pages
PSYC327 Lab Report Introduction Renee Chen Sze Ling 7024666
No ratings yet
PSYC327 Lab Report Introduction Renee Chen Sze Ling 7024666
6 pages
MC Den PDF
No ratings yet
MC Den PDF
1 page
T Den PDF
No ratings yet
T Den PDF
1 page
My Writing Portfolio: Arw1 Victor Castillejo
No ratings yet
My Writing Portfolio: Arw1 Victor Castillejo
28 pages
A Customer Churn Prediction Using Pearson Correlation Function and K Nearest Neighbor Algorithm For Telecommunication Industry
No ratings yet
A Customer Churn Prediction Using Pearson Correlation Function and K Nearest Neighbor Algorithm For Telecommunication Industry
14 pages
Infosys Ltd. Revenue Forecasting Model: Regression Chart Prediction Chart
No ratings yet
Infosys Ltd. Revenue Forecasting Model: Regression Chart Prediction Chart
3 pages
RM Malkani Vs State of Maharashtra 22091972 SCs720204COM619736
No ratings yet
RM Malkani Vs State of Maharashtra 22091972 SCs720204COM619736
8 pages
MIE1622H - Assignment 3
No ratings yet
MIE1622H - Assignment 3
5 pages
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
No ratings yet
Assignment 2 DESA 1004 - Paulo Ricardo Rangel Maciel Pimenta
4 pages
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet

Fisher 1922

Uploaded by

Fisher 1922

Uploaded by

The Goodness of Fit of Regression Formulae, and the Distribution of Regression Coefficients

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

THE GOODNESS OF FIT OF REGRESSION FORMULA4, AND TIIE

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

For such samples of np, therefore,the mean, ?/P,will vary about

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

normal distributions,having the same mean, but differentstandard

x2 will be distributedas is the ordinarymeasure of goodness of fit.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

whencethe optimumvalue of C2 is S2 where

suppose that C- is estimatedby this method,

we mustnow findthe distribution of thisstatistic.

and maybe regardedas thesumofthesquaresofN equallyvariable

wheret standsfors2(N -a). In the same way if T stand forx2S2

dfaT c(aP3) e 20,] dT

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

(X2) 2 t 2 e 2XT2(1+N-a) dtdX2,

The variation in 82, therefore,changesthe exact formof the distri-

and so reproduces the Type III distribution.

+ (n' -1) (n' -3)

* The symbolx! is usedthroughout

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

may be corrected by multiplyingby this factor. Near the centre

Type III ... a-p I a-p- 3

is a minimumfor proportionalvariations of Y-y, then

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

But the correlationratio is givun by the parallel formula,

and to test the significanceof 2 -R2 we enter Elderton's table

using, if necessary, the correctionfor Type VI as before.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

will be distributedin the Type VI curve

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

and, if a is the numberof arrays,n' - 1 = a, onlyif thc valuesof

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

which can be independentlyspecified is reduced by the nuumber

6. The distributionof regressioncoefficients.

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

we note firstthat a and b are orthogonal

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

the Type VII curve obtained by "Student," with n reduced by

-in the mostimportantcase Xp will be polynomialin x, of degreep,

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

df= L (X2j' 2 e-2' d(iX2)

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

from the values of the sample, then

where A1l is the minor of S(x12).

(t) In testing the fitness of regressionlines account must be

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

This content downloaded from 192.236.36.29 on Sat, 31 Aug 2013 13:17:13 PM

You might also like