This Content Downloaded From 2.14.59.251 On Sat, 18 Jul 2020 08:34:10 UTC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

On a Measure of Dependence Between two Random Variables

Author(s): Nils Blomqvist


Source: The Annals of Mathematical Statistics , Dec., 1950, Vol. 21, No. 4 (Dec., 1950), pp.
593-600
Published by: Institute of Mathematical Statistics

Stable URL: http://www.jstor.com/stable/2236609

REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.com/stable/2236609?seq=1&cid=pdf-
reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and


extend access to The Annals of Mathematical Statistics

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
ON A MEASURE OF DEPENDENCE BETWEEN
TWO RANDOM VARIABLES

BY NILs BLOMQVIST

University of Stockholm and Boston University

1. Summary. The properties of a measure of dependence q' between two


random variables are studied. It is shown (Sections 3-5) that q' under fairly
general conditions has an asymptotically normal distribution and provides
approximate confidence limits for the population analogue of q'. A test of inde-
pendence based on q' is non-parametric (Section 6), and its asymptotic efficiency
in the normal case is about 41% (Section 7). The q'-distribution in the case of
independence is tabulated for sample sizes up to 50.

2. Introduction and definitions. In drawing conclusions from statistical data


it frequently happens that it is unnecessary to utilize all the information given
by the data. In such cases it seems desirable to use methods which are
1) valid under rather weak assumptions regarding the distribution of the
population and
2) easy to deal with in practice.
Naturally such methods should always be used, but their applicability is, in
most cases, limited by their small efficiency.
Concerning methods of measuring correlation and testing independence some
so-called rank correlation coefficients have been defined [2, 3, 4, 6] which have
the first property. In large samples these are, however, rather tiresome to calcu-
late, and a simpler method might then be preferable. The coefficient studied
here has in most cases both properties mentioned above and can be used when-
ever its efficiency is not too small.
Let (xi, y') ... (xn, y.) be a sample from a two-dimensional population with
cdf F(x, y), and consider the two sample medians x and y. The cdf F(x, y) is
assumed to have continuous marginal cdf's Fi(x) and F2(y) in order that the
probability of obtaining two equal x-values or two equal y-values in the sample
will be zero. Let the x, y-plane be divided into four regions by the lines x = x
and y = y. It is then clear that some information about the correlation between
x and y can be obtained from the number of sample points, say n1, belonging
to the first or third quadrants compared with the number, say n2, belonging
to the second or fourth quadrants.
Before going further we shall explain what is meant here by 'belong to'. If
the sample size n is an even number the calculation of n1 and n2 is evident. If,
however, n is an odd number one or two sample points must fall on the lines
x = I and y = y. In the first case this sample point shall not be counted. In
the other case one point falls on each of the lines. Then one of the points shall
be said to belong to the quadrant touched by both points, while the other shall
593

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
594 NILS BLOMQVIST

not be counted. It is easy


even numbers.
As a measure of correlation we define

(1) q' - ni-n2 = 2n _ (-1< q' <)


nli + n2 nli + n2

The definition of q' is not new [5] but as far as is known, its statistical proper-
ties have never been studied completely.

3. The asymptotic distribution. It is known [1] that the median in a sample


from a one-dimensional distribution under certain conditions is a consistent
estimate of the population median and asymptotically normally distributed.
Although it seems possible to weaken the requirements in our case, we shall not
do so. We require that
a) the population medians are uniquely defined (and assumed to equal zero),
b) the marginal distributions of F(x, y) admit density functions fi(x) and
f2(Y).
C) fi(x), f2(8) and their first derivatives are continuous in some neighbourhood
of the origin and
d) fi(0) and f2(0) are $0.
In order to avoid trivial complications we shall assume here that the sample
size n = 2k + 1.
Now define for every arbitrarily chosen point (x, y)

a(x, y) = P{j > x, I > y},


b(x, y) = P{j < x, I > y},
(2)

d(x, y) = P{> > x, v < y},

where the measure P refers to the cdf F(x, y) and evidently


a + b + c + d = 1.

As the number of sample points belonging to the first and third quadrants
around, (x, y) must be equal, the probability of the combined event

{ni = 2r; tE(x, x + dx), ge(y, y + dy)}


is

(3) - ~~~~~(2k + 1)! (cr b)rS


(3) pk(2r;x,y) = r!((aC)r(bd)k -
where

S a*dy
(4) S=-*d r k- - br
a-b *dxb*dy b
a
r k- r
+ d, c-dy c - d *dxd*dyd +dF.

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
MEASURE OF DEPENDENCE 595

Each of the first four terms of the expression (4) refers to a case in which two
sample points determine (x, y), and the last term refers to a case in which (x, y)
is determined by only one point. From (3) it follows that the probability of
obtaining ni at most equal to 21 is

(5) P{ni < 2R} = ] E Pk(2r; x, y).


If we introduce the joint cdf Tk(X, y) of x and y, (5) can be written
R

X0 X0 E pk(2r; x, y)
(6) P{n1 < 2R} = Ldk(x = y) r0
x0 00 X pk(2r; x, y)
r=O

as
k

d4kk(X, Y) = E pk(2r; x, y).


r=O

Clearly the integrand in (6) is <1 everywhere it exists. In the points (x, y)
where the denominator is equal to zero the integrand is undefined, but as the
measure (T) of the set of such points is zero, we need not have any trouble
with them.
Under the conditions a)-d) Jx and y converge in probability to zero; that is

lim fk(X, Y) = 1 for {x > 0, y > 01,


k-*oo 0 otherwise.

Thus, when k and R tend to infinity such that R--* const

E pk(2r; 0, 0)
(7) lim P{n1 < 2R} = lim r0-
Z pk(2r; 0, 0)
r=O

According to (3)

(2k10 1
(8) pk(2r; 0, 0) - + 1)! (aoco)r. (bodO)kr _So,

where the subscripts indicate the value at the point (0, 0). Because of (2),

co = a0o do = bo and ao + bo -

and the two parts of (8) are for large k

(2k + 1)! a2r b2(k-r) ((r-2kaI)2 4a0bo)


r!2(k - r)!2 2irao boV2 rk

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
596 NILS BLOMQVIST

and

So - 21 (Oa\ ( )_ b (2?\ - (8)X ]dx dy.


axoaY/o axJoayo + OxJoVdy/o kaxJoVdayJo Y
The first of these expressions follows from the usual application of Stirling's
approximation formula and we omit all details here.
Hence, after the introduction of

r - 2kao + t\/2kaobc,
R = 2kao + TV/2kaobo

the expression (7) is transformed to

(9 lim /8kaOb < T -\/27r e dt


From (9) it follows that ni is asymptotically normally distributed with mean
4kao and standard deviation V8kaobo. Thus

2n,1 n
q = 2k -1 =nk

is asymptotically normally distributed with mean 4ao - 1 and standard deviation


2/ao(1 - 2ao)/k.

4. Properties as an estimator. Suppose we measure the correlation between


x and y by

(10) q = 2[ff dF + f dF -1 = 4aO -1,

where, as before, (0, 0) are the coordinates of the population medians. Then q
has the desired property of being equal to zero in the case of independence and
equal to ?t1 in the case of linear relationship between x and y.
According to (9) q' is a consistent estimate of q when the conditions a)-d) are
fulfilled. Furthermore, as the standard deviation of q' is, to a first approximation,
independent of quantities other than q, it is possible to construct approximate
confidence limits for q for large sample sizes. This is done in the following way.
In terms of n and q we have, according to the last paragraph of section 3 and
(10),

Eq' '-

Let) DxbeatadadiednomacfndXian2t

Let 1'(x) be a standardized normal cdf and X1 and X2 two numbers such that

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
MEASURE OF DEPENDENCE 597

+(X2) -(X1) = -a. According to (9) we then have

P j V - 1 < X2 a,
which gives the desired result.
If we let X2 = -i = X and solve the inequality in (11) for q, the following
symmetrical confidence interval is obtained

q - +V/X2 + n(1 - q'2) < q < q' + X VX2 + n(1 -q2)


n n

where we have used that 2 << n.

5. The normal case. If x and y are normally distributed with correlation


coefficient p, we have

(12) q - arcsin p.

This expression is the same as the mean of Esscher-Kendall's rank correlation


coefficient r [2, 4]. Hence, in the normal case q' and r estimate the same quantity.
The coefficient q' has, however, a much smaller efficiency. The asymptotic
efficiency of q' relative to the afore mentioned coefficient is

2( r) - arcsin P)] =
a2(ff) r / \2- -9
- *q 1. - arcsin p)
for p = 0.
6. Tests of independence based on q'. In testing independence between x
and y it is in practice more convenient to use critical regions based on n1 instead
of q'. Since, under the null hypothesis, the measure of a critical region is inde-
pendent of F(x, y) (Fl(x) and F2(y) are assumed to be continuous), any test
based on ni is non-parametric. We have made exact calculations of the q'-distribu-
tion for sample sizes n up to 50. For larger sample sizes the normal approximation
for n1 does not seem to entail errors of practical importance.
To derive the exact distribution of ni under the null hypothesis we suppose
that n equals 2k. The probability that any k sample points shall have smaller
x-values than the other k points is
/k\

Hence, since any arrangement of the sample points according to their x-values
does not affect the distribution of the y-values,

(13) P{n1 = 2r} (()


(2k)

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
598 NILS BLOMQVIST

If n = 2k + 1 it is easily verified that the probability (13) remains unchanged,


if we use the procedure in calculating n1 and n2 proposed in Section 2. This is,
in fact, the main reason for the proposal.

Table ofPII ni-k I 2?v


2k
IV 4 8 12 16 20 24 28 32 36 40 44 48

0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
2 .333 .486 .567 .619 .656 .684 .706 .724 .740 .752 .764 .773
4 .029 .080 .132 .179 .220 .257 .289 .318 .343 .366 .387
6 .0022 .010 .023 .039 .057 .076 .094 .113 .131 .148
8 .0002 .0011 .0033 .0070 .012 .018 .026 .034 .042
10 .0001 .0004 .0011 .0022 .0038 .0060 .0087
12 .0002 .0004 .0007 .0012
14 .0001 .0001

6 10 14 18 22 26 30 34 38 42 46 50
~--. 2k

1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
3 .100 .206 .286 .347 .395 .434 .466 .494 .517 .538 .556 .572
5 .0079 .029 .057 .086 .115 .143 .169 .194 .217 .238 .258
7 .0006 .0034 .0089 .017 .027 .038 .050 .063 .076 .089
9 .0003 .0012 .0028 .0053 .0086 .013 .017 .023
11 .0001 .0004 .0009 .0017 .0028 .0042
13 .0001 .0001 .0003 .0005
15

2k is the largest even number contained in the sample size.

The distribution of ni is symmetric about ni = k with the variance

2k - 1

Thus, in testing independence we can for large sample sizes use

- k *V2k - 1

as an approximately normally distributed random variable with mean zero


and unit s.d.

7. The asymptotic efficiency of the q'-test. In the case that x and y are nor-
mally distributed with the correlation coefficient p, it is possible, but rather
tedious, to calculate the power function of the q'-test. We will, therefore, restrict
ourselves to considering only the asymptotic behavior of the power function.
Consider tests of independence (p = 0) against one-sided alternatives p > 0.
Let L(l)'(p) be the powver function of the q'-test for the sample size mn anld L(2)(p)
be the power function of the test based on the correlation coefficient r in a
sample of size n. We assume that all tests have the same size, i.e.

(14) L ') (0) = L(2)(0) = a

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
MEASURE OF DEPENDENCE 599

for all m and n. We shall say that the q'-test has the asymptotic efficiency e if

(OL(1) _

(15) lim (2) p=O -


no --3L '0
(Lp P=)
when
n
m = -.

This means that the sample size in using the r-test need only be lOOE%
that in using the q'-test, in order to get the same derivative of the power fun
at p = 0 (for large sample sizes). Since the definition of e only concerns the
behavior in the neighborhood of p = 0, it might perhaps be more correct to call e
the asymptotic local efficiency.
In order to calculate e we define two sequences {qm } and {r,, I such that
{q' > qm} and {r > r. } are tests with the afore mentioned properties. According
to (9) and (10) q' is asymptotically normally distributed with mean q and s.d.
V/(1 -q2)/m. ]Furthermore, r is asymptotically normally distributed with mean
p and s.d. (1- p2)/Vn. Hence,

1 - L,n) (p) =P I q' < qm p}

1- L2)(p) = P{r < rn p} r [

from which it follows

(16) (dL- ) V(rn,A /n)m\/n.


(dL ) (2 (^S/)(2

ap 0

According to (14) we have

lim qm * >/m-lim r *Vn = '1(\-nX a).


m_-x0 n--oo

Thus we conclude

(17) 1m,n- ap J ( -(' lm2 (dq) *dp

K p /'o

Clearly (17) is equal to 1 if

n dq\2
m dp/o

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms
600 NILS BLOMQVIST

Hence, according to (12) and (15)

(2)2

In other words, the asymptotic efficiency of the q'-test is about 41%o.

8. Concluding remarks. An interesting similarity exists between the q'-test


of independence and a test of equal location parameters in two distributions,
constructed in the following way. Suppose that two samples of equal size, say k,
are drawn independently from two distributions. Compute the number of
individuals, say r, in the first sample, falling short of the median of the pooled
samples. Then the distribution of 2r under the null hypothesis is the same as
that of ni in the q'-test for sample size 2k (or 2k + 1). The test based on r was
discussed by F. Mosteller [7].
Another similarity is between the q'-test and a special case of the exact test of
independence in a 2 x 2 table [8]. If in such a table the marginals happen to be cut
at the 50% points the two test procedures become identical.

REFERENCES

[1] H. CRAME'R, Mathematical MlIethods of Statistics, Princeton University Press, 1946.


[21 F. ESSCHER, "On a method of determining correlation from the ranks of a variate",
Skandinavisk Aktuarietidsk1rift, Vol. 7 (1924), p. 201.
[3] W. HOEFFDING, "A non-parametric test of independence", Annals of Math. Stat.,
Vol. 19 (1948), p. 546.
[4] WM. G. KENDALL, "A neNw meesue of rlnlk correlation", Biometrika, Vol. 30 (1938), p. 81.
[5] F. MOSTELLER, "On some use-ful 'inefficient' statistics", Annals of M1ath. Stat., Vol. 17
(1946), p. 377.
[6] C. SPEARMAN, "The proof and measurement of association betWeen two things", Am.
Jour. of Psych., Vol. 15 (1904), p. 88.
[7] F. MOSTELLER, "On some useful 'inefficient' statistics", unpublished thesis, Princeton
University, 1946.
[8] R. A. FISHER, Statistical M1lethods for Research Workers, 8th Ed, Stechert & Co., 1941.

This content downloaded from


2.14.59.251 on Sat, 18 Jul 2020 08:34:10 UTC
All use subject to https://about.jstor.org/terms

You might also like