0% found this document useful (0 votes)
130 views

Convergence of Stochastic Processes

An exposition od selected parts of empirical process theory, with related interesting facts about weak convergence, and applications to mathematical statistics. The high points of the book describe the combinatorial ideas needed to prove maximal inequalities for empirical processes indexed by classes of sets or classes of functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
130 views

Convergence of Stochastic Processes

An exposition od selected parts of empirical process theory, with related interesting facts about weak convergence, and applications to mathematical statistics. The high points of the book describe the combinatorial ideas needed to prove maximal inequalities for empirical processes indexed by classes of sets or classes of functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 223
David Pollard Convergence of Stochastic Processes With 36 Illustrations $ Springer-Verlag New York Berlin Heidelberg Tokyo David Pollard Department of Statistics, Yale University New Haven, CT 06520 USA [AMS Subject Classifications: 60F99, 60G07, 601199, 62M99, Library of Congress Cataloging in Publication Data Pollard, David Convergence of stochastie processes, (Springer series in statistics) Bibliography: p. Includes index. 1. Stochastic processes. 2. Convergence, 1. Title IM. Series QAI74P6t 1984 51924401 © 1984 by Springer-Verlag New York Inc All rights reserved. No part of this book may be translated ot reproduced in any {orm without written permission from Springer-Verlag, 175 Fith Avenue, New York, New York 10010, US.A. ‘Typeset by Composition House Ltd, Salisbury, England. Printed and bound by R. R. Donnelley & Sons, Harrisonburg, Virginia, Printed inthe United States of America, 987654321 ISBN 0-387-90990-7 Springer-Verlag New York Berlin Heidelberg Tokyo ISBN 3-540-90990-7 Springer-Verlag Berlin Heidelberg New York Tokyo To Barbara Amato Preface A more accurate title for this book might be: An Exposition of Selected Parts of Empirical Process Theory, With Related Interesting Facts About Weak Convergence, and Applications to Mathematical Statistics. The high points are Chapters II and VIL, which describe some of the developments inspired by Richard Dudley's 1978 paper. There I explain the combinatorial ideas and approximation methods that are needed to prove maximal inequalities for empirical processes indexed by classes of sets or classes of functions. The material is somewhat arbitrarily divided into results used to prove consistency theorems and results used to prove central limit theorems. This has allowed me to put the easier material in Chapter I, with the hope of enticing the casual reader to delve deeper. Chapters III through VI deal with more classical material, as seen from a different perspective. The novelties are: convergence for measures that don’t live on borel o-fields; the joys of working with the uniferm metric on DLO, 1); and finite-dimensional approximation as the unifying idea behind ‘weak convergence. Uniform tightness reappears in disguise as a condition that justifies the finite-dimensional approximation. Only later is it exploited a8 a method for proving the existence of limit distributions. The last chapter has a heuristic flavor. I didn't want to confuse the ‘martingale issues with the martingale facts. My introduction to empirical processes came during my 1977-8 stay with Peter Gaenssler and Winfried Stute at the Ruhr University in Bochum, while I was supported by an Alexander von Humboldt Fellowship. Peter and I both spent part of 1982 at the University of Washington in Seattle, ‘where we both gave lectures and absorbed the empirical process wisdom of Ron Pyke and Galen Shorack. The published lecture notes (Gaenssler 1984) show how closely our ideas have evolved in parallel since Bochum. I also vill Preface hhad the privilege of seeing a draft manuscript of a book on empirical processes by Galen Shorack and Jon Wellner. At Yale I have been helped by a number of friends. Dan Barry read and criticized early drafts of the manuscript. Deb Nolan did the same for the later drafts, and then helped with the proofreading. First Jeanne Boyce, and then Barbara Amato, fed innumerable versions of the manuscript into the DEC-20, John Hartigan inspired me to think. The National Science Foundation has supported my research and writing over several summers. am most grateful to everyone who has encouraged and aided me to get this thing finished. Contents Notation CHAPTER f Functionals on Stochastic Processes 1. Stochastic Processes as Random Functions Notes Problems CHAPTER Ii Uniform Convergence of Empirical Measures 1. Uniformity and Consistency 2. Direet Approximation 3. The Combinatorial Method 4, Classe of Sets with Polynomial Discrimins 5 6 (Classes of Functions Rates of Convergence Notes Problems (CHAPTER It Convergence in Distribution in Euclidean Spaces 1. The Definition ‘The Continuous Mapping Theorem Expectations of Smooth Functions ‘The Central Limit Theorem Characteristic Functions ‘Quantile Transformations and Almost Sure Representations Notes Problems B 6 24 30 36 38 a 8 30 34 a 6 o (CHAPTER IV Convergence in Distribution in Metric Spaces 1. Measurabilty ‘The Continuous Mapping Taeorem Representation by Almost Surely Convergent Sequences Coupling ‘Weakly Convergent Subsequences Notes Problems CHAPTER V ‘The Uniform Metric on Spaces of Cadlag Functions 1. Approximation of Stochastic Processes Empirical Processes Existence of Brownian Bridge and Brownian Motion Processes with Independent Increments Infinite Time Seales, Funetionals of Brownian Motion and Brownian Bridge Notes Problems (CHAPTER VI The Skorohod Metric on D[0, oo) 1. Properties ofthe Metric 2, Convergence in Distribution Notes Problems CHAPTER VII Central Limit Theorems Stochastic Equicontinuity Chaining Gaussian Processes |. Random Covering Numbers Empirical Central Limit Theorems Restricted Chaining Notes Problems (CHAPTER vit Martingales 1. A Central Limit Theorem for Martingale-Difference Arrays 2. Continuous Time Martingales 3. Estimation from Censored Data Notes Problems Contents 9 95 100 103 107 10 u7 118 12 1m 130 136 137 138 138 142 146 149 155 160 165 167 170 170 176 182 185 186 Contents APPENDIX A Stochastic-Order Symbols APPENDIX B Exponential Inequalities Notes Problems APPENDIX C Measurability Notes Problems References Author Index Subject Index 139) 191 193 193 195 200 200 201 Notation Integrals and expectations are written in linear functional notation; sets are identified with their indicator functions. Thus, instead of [, f(s)IP(dx) write (fA). When the variable of integration needs to be identified, asin iterated integrals, T return to the traditional notation. And orthodcxy constrains re to write { f(x) dx for the lebesgue integral, in whatever dimension is appropriate. If unspecified, the domain of integration is the whole space. Abbreviations can stand for a probability measure or a random variable distributed according to that probability measure Bin(n, p) = binomial distribution for n trials with success probability p, jormal distribution with mean j1 and variance o?. multivariate normal distribution with mean vector and variance matrix V. Uniform(a, b) = uniform distribution on the open interval (a,b); square brackets, as in Uniform{0, 1J, indicate closed intervals. Poisson(2) = poisson distribution with mean 2, No?) Nw Y) The symbol 1 denotes end of proof, end of definition, and so on—some- thing to indicate resumption of the main text. Product measures, product aces, and product o-fields share the product symbol @. Maxima and minima are v and. Set-theoretic difference is \; symmetric difference is A.If ay/b, + 00, for sequences {a,} and {b,), then write a, > by. Invariably IR denotes the real line, and IR‘ denotes k-dimensional euclidean space, The borel a-field on a metric space J is always (7). The symbol IP xiv Notation denotes a probability measure on a (Sometimes unspecified) measurable space (Q, 6); miscellaneous random variables live or this space. ‘An~,a cross between ~ (the sign for “is distributed according to”) and an ordinary arrow > (for convergence), is used for convergence in distribu- tion and weak convergence. AA result stated and proved in the text is always referred to with intial letters capitalized. Thus the Multivariate Central Limit Theorem is numbered 111.30, but Taylor's theorem and dominated convergence are not reproved. The letters B, U, Bp, Ep usually denote the gaussian processes: brownian motion, brownian’ bridge, P-motion, and P-bridge. The letters U, and E, denote empirical processes, with U, generated by observations on Uniform, 1). Usually P, isthe empirical measure. The set of all square-integrable functions with respect to & measure jis written £2); the corresponding space of equivalence clases is Lalu). A similar distinction holds for #'(x) and L'(j.). Often pp denotes the 7(P) seminorm; isthe supremum norm on a space of functions. The symbols ‘ys, and so on, are usually projection maps on function spaces. Expressions like N(B), Ny(O,d, 7), and. N,(6, P.%) represent various covering numbers; J(3), J, 4, 7), and J3(6, P, F) ate the corresponding covering integral. CHAPTER I Functionals on Stochastic Processes ‘Which inttoduces the idea of studying random variables determined by the ‘whole sample path of a stochastie process, 1.1. Stochastic Processes as Random Functions Functions analyzed as points of abstract spaces of functions appear in many branches of mathematics, Geometric intuitions about distance (or approximation, or convergence, or orthogonality, or any other ideas learned from the study of euclidean space) carry over to those abstract spaces lending, familiarity to operations carried out on the functions. We enjoy similar benefits in the study of stochastic processes if we analyze then as random elements of spaces of functions, Remember that a stochastic process is a collection {X,:1¢T} of real random variables, all defined on a common probability spece (0, 6, IP). Often Twill be an interval of the real line (which makes the temptation to think of the process as evolving in time almost irresistible), but we shall also ‘meet up with fancier index sets—subsets of higher-cimensional euclidean spaces, and collections of functions. The random variable X, depends on boti r and the point « in © at which itis evaluated. To emphasize its role as a function of two variables, write it as X(o, 1). For fixed ¢, the function X(, 1) is, by assumption, a measurable map from Q into IR. For fixed @, the function X(o,-) is called a sample path of the stochastic process. If all the sample paths lie in some fixed collection 2 of real-valued functions on T, the process X can be thought of as a map from 9 into %, a random element of . For example, if a process indexed by [0, 1] hes continuous sample paths it defines a random element of the space C[0, 1] of all continuous, real-valued functions on [0,1]. (In Chapter IV we shall formalize the definition by adding a measurability requirement.) Each sample path of X is a single point in 4. Fach random variable Z for which Z(o) depends only on the sample path X(c,-), such as the maximum 2 |. Functionas on Stochasti Processes of X(o, 1) over allt, can be expressed by means of a functional on. That is, the value Z(o)is found by applying to X(«,) a map H that sends functions in onto points ofthe real line. The name functional serves to distinguish functions on spaces of functions from funetions on other abstract sets, an ‘outmoded distinction, but one that can help us to remember where H lives. By breaking Z into the composition of a functional with a map from into 2 we also break any analysis of Z into iwo parts: calculations involving only the random element X; and caleulaticns involving only the functional LH. This allows us o study many different Z's simultaneously, just by varying the choice of H. Of course we only gain by this if most of the hard work can bbe disposed of once and for all in the analysis of X. ‘The idea can be taken further. Suppose that a second stochastic process {¥: re T) puts all its sample paths in the same function space 2. Suppose we want to study the same functional Hf of both processes: we want (0 show that HX and HY have distributions that are close, perhaps. Break the problem into its two parts: show that the distributions of X and ¥ (the probability measures they induce on 2) are close; then show that H has a continuity property ensuring that closeness of the distributions of X and ¥ implies closeness of the distributions of HX and HY. Such an approach would make the analysis easier for other functionals with the same sort of continuity property; for a different H only the second part of the analysis would need repeating. 1 Example. Goodness-of-fit test statistics can often be expressed as func- tionals on a suitably standardized empirical distribution function. Consider the basic case of an independent sample &,,.... 6, from the Uniform(0, 1) distribution. Define the uniform empirical process U,, by Uloj =n? ¥ (Go) sh-H for O 4 for each f > m then the median is a continuous functional, in tae sense that ‘median(Q) ~ median(P)| < & whenever the distribution Q is close enough to P. Close means sup |O(—<0, 1] — P(—2, 1 < 6, where the tiny 6 is chosen so that P(-co,m— 2] <$—6, P(-w,m+ e] > $46. ‘The argument goes: if @ has median my then P(—c0, m'] > O—20, m1 - 8 =$-6, so certainly m > m — e. Similarly, for every m" < mi, P(—c0,m"] < Qo, m+ 6<446, which implies m” < m+ ¢, and hence m' < m+ ¢. Next comes the probability theory. If the empirical measure P, is con- structed from a sample of independent observations on P, the Glivenko Cantelli theorem tells us that sup |P,(— 00, f] — P(—20, f]| +0 almost surely. From this we deduce that, almost surely, |median(P,) — median(P)| <¢ eventually. ‘The sample median is strongly consistent as an estimator of the population median, o For this example we didn't have to prove the uniformity result; the Glivenko-Cantelli theorem is the oldest and best-known uniform strong law of large numbers in the literature, But as we encounter new functions (usually called functionals) of the empirical measure, new uniform con- vergence theorems will be demanded. We shall be exploring two methods for proving these theorems. ‘The first method is simpler in concept, but harder in execution. It involves direct approximation of functions in an infinite class F by a 1. Uniform Convergence of Fmpirical Measures finite collection of functions. Classical convergence results, such as the strong law of large numbers or the ergodic theorem, ensure uniform convergence for the finite collection; the form of approximation carries the uniformity over to & Section 2 deals with direct approximation. ‘The second method depends heavily upon symmetry properties implied by independence. It uses simple combinatorial arguments to identify classes satisfying uniform strong laws of large numbers under independent sampling. Sections 3 to 5 assemble the ideas behind this method, 11.2. Direct Approximation Throughout the section ¥ will be a class of (measurable) functions on a set, S with a o-field that carries a probability measure P. The empirical measure P, is constructed by sampling from P. Assume P| f | < oo for each f in AFF were finite, the convergence of P, f to Pf assured by the strong law of large numbers would, for trivial reasons, be uniform in f. If ¥ can be ap- proximated by a finite class (not necessarily a subclass of #) in such a way that the errors of approximation are uniformly small :he uniformity carries over to & The direct method achieves this by requiring that each member of F be sandwiched between a pair of approximating functions taken from the finite class. 2 Theorem. Suppose that for each e > O there exists afinite class F, containing lower and upper approximations to each fin F, for which Fur SSS fav and Phy — fd 0 and Timsup sup(P,f ~ Pf) <0 or, equivalently, limint int P(—) — P(—f) 2 0. ‘Then two applications of the next theorem will complete the proof. 0 3 Theorem. Suppose that for each ¢ > 0 there exists a finite class F, of Junctions for which: to each fin F there exists an fin F, such that f, < f and Pf, = Pf e. Then liming inf(P,f — PF) 20 almost surely. 112. Direct Approximation 9 PRoor. For each ¢ > 0, liminf inf(P, f — Pf) > liminf inf(P,f,— Pf) because f. < f * - > limin€ inf(P, fe — PA) + inf (Ph ~ PN) 20+ —e almost surely, as Fis finite. ‘Throw away an aberrant null set for each positive rational eto arrive at the asserted result, o You might have noticed that independence enters only as a way of guaranteeing the almost sure convergence of Pf, to Pj for each approximat- ing f. Weaker assumptions, such as stationarity and ergodicity, could substitute for independence. 4 Example, The method of k-means belongs to the host of ad hoc procedures that have been suggested as ways of partitioning multivariate data into groups somehow indicative of clusters in the underlying population. We can prove a consistency theorem for the procedure by application of the one- sided uniformity result of Theorem 3. For purposes of illustration, consider only the simple case where observa- tions &,...,é, from a distribution P on the real line are to be partitioned into two groups. The method prescribes that the two groups be chosen to minimize the within-groups sum of squares. Equivalently, we may choose ‘optimal centers a, and b, to minimize Dis —aP 016 — oP, then allocate each ¢,to its nearest center. The optimal centers must lie at the mean of those observations drawn into their clusters, hence the name k-means (or 2-means, in the present case). In terms of the empirical measure P,, the method seeks to minimize W(a,by Ph) = PaSuoe where Sas As the sample size increases, W(a, b, P,) converges almost surely to WOa.b, P) = Phaw for each fixed (a, b). This suggests that (a,,b,), which minimizes W(., -, P,), might converge to the (a*, b*) that minimizes W(.,-, P). Given a few obvious conditions, that is indeed what happens. To ensure finiteness of W(-, , P), assume that P|x|* < co. Assume also that there exists a unique (a*, b*) minimizing W. Adopt the convention that lx al a lx — bP. 10 I. Uniform Convergence of Empirical Measures 3D then Fas 1x] = DY because |a| < M; the lower approximation for f,,, on [-D, D] will also serve for f, = al? = ff Ix] < D} 12. Direet Approximation nl Let C, be a finite subset of [-3D, 3D]? such that each (a,b) in that square has an (a,b) with a —a'| < e/D and |b ~ b'| < e/D. Then for each x in [—D, DJ, ao) — fun ()] SIG = a)? = (& — a] + [bY -(—BF] S 2a a'||x— Ma +a] +2]b —b'|\x— 46+) 5 2@/DXD + 3D) + 2(e/DXD + 3D) = 16 ‘The class F3, consists of all functions (fy, (x) — 162){|x| < D} for (a,b) ranging over C,. From Theorem 3, Pfa,s) 2 0. limint inf(P, f, Eventually the optimal cemters (a,, b,) lie in C, Thus liminf(W(ay, by, P,) — W(ay, by, P)) = 0 almost surely. Since Way, by, Py) < Wat, b*, P,) because (a,, b,) is optimal for Py + W(at,b*, P) almost surely S W(a,, By, P) because (a*, b*) is optimal for P, we then deduce that Way, by, P) > Wa, b*, P) almost surely. Notice what happened. The uniformity allowed us to transfer optimality of (a4, b,) for P, to a sort of asymptotic optimality for P; the processes WC, Pa) have disappeared, leaving everything in terms of the fixed, non- random function W(-, , P). We have assumed that 1V(., -, P) achieves its unique minimum at (a*, b*), ‘Complete the argument by strengthening this to: for each neighborhood U of (a*, B*), inf W(a, b, P) > Wat, b*, P). Continuity of W(, -, P) takes cate of the infimum over bounded regions of C\U. If there were an unbounded sequence (x, 8) in C with W(%, Bi, P) > Wa, b*, P), wwe could extract a subsequence along which, say, a> —o0 and fj, f, with IB| $M. Dominated convergence would give Wat, b*, P) = Plx ~ BP, which would contradict uniqueness of (a, b*): for every a, the pait (a, 6) would minimize W(-,-, P). The pair (a,,6,), by seeking out the unique minimum of W(-,-, P) over the region C, must converge to (a*,b*). 2 TI Uniform Convergence of Empirical Measures The k-means example typifies consistency proofs for estimators defined by optimization of a random criterion function. By ad hoc arguments one forces the optimal solution into a restricted, often compact, region, That is usually the hardest part of the proof. (Problem 2 desvribes one particularly nice ad hoe argument.) Then one appeals to a uniform strong law over the restricted region, to replace the random criterion function by a deterministic limit function. Global properties of the limit function force the optimal solution into desired neighborhoods. If one wants consistency results that apply not just to independent sequences but also, for example, to stationary ergodic sequences, one is stuck with cumbersome direct approximation arguments; but for independent sampling, slicker methods are available for proving the uniform strong laws. We shall return to the k-means problem in Section 5 (Example 29 to be precise) after we have developed these methods 5 Example. Let @ be the parameter of a stationary autoregressive process Yn = OV ty for independent, identically distributed innovations {u,}. Stationarity requires |@| < 1. A generalized M-estimator for 0 is any value 8, for which the random function HO) = (0-1 Loonies ~ 8y) takes the value zero. We would hope that 8, converges to the 0 at which the deterministic function H(6) = Pa(y:)602 ~ 1) takes the value zero. If|g| < 1 and || < 1 and is continuous, we can go part of the way towards proving this by means of a uniform strong law for a bivariate empirical measure. Write Q, for the probability measure that puts equal mass (n — 1)~* on each of the pairs (),, 2), ++, (la-1s Ja: For fixed (integrable) f(-,-), Qf OF almost surely, where Q denotes the joint distribution of (y,, 72). This follows from the exgodic theorem forthe stationary bivariate process {(V, Jee 1)} Check the approximation conditions of Theorem 2, with Q in place of P, for the class of functions F061, ¥2, 6) = gle)O02 — Ox,) for ~15 <1. First, choose an integer K so large that P(ly:1S Ky|ys| 1—6 13. The Combinatorial Method B Then appeal to uniform continuity of $ on the compact interval [—2K, 2K] to find a 6 >0 such that |$(a) — 4(6)| K} + 2E[xa] > K). With the integer running over the finite range needed for these intervals to cover [—1, 1], the functions I (1, ¥2, KB/K) + & + 2x4] > K} + 2x2] > K} provide the upper and lower approximations required by Theorem 2, As noted following Theorem 3, the uniform strong laws also apply to empirical measures constructed from stationary ergodic sequences, Ac- cordingly, © $82 1Qs/ 8) — OFC, )| +0 almost surly, that is, sup |H,(@) ~ H@)| +0 almost surely. feist Provided @, lies in the range [—1, 1], we can deduce from (6) that H(6,) +0 almost surely. It would be a sore embarrassment if the estimate of the auto- regressive parameter were not in this range. Usually one avoids the em- barrassment by insisting only that H,(8,) — 0, with @, in [—1, 1]. Such a 6, always exists because H,(6*) + 0 almost surely. Convergence results for 8, depend upon the form of H(.). We know 8, gets forced eventually into the set {|#f| <<} for each © > 0, If this set shrinks to 6* as ¢ | 0 then 8, must converge to 6*, which necessarily would have to be the unique zero of H(-). If we assume that H does have these properties we get the consistency result for the generalized M-estimator. 11.3. The Combinatorial Method Since understanding of general methods grows from insights into simple special cases, let us begin with the best-known example of a uniform strong. aw of large numbers, the classical Glivenko-Cantelli theorem. This asserts that, for every distribution P on the real line, o sup|P,(—00, t] — P(—<0, | +0 almost surely, When the empirical measure P, comes from independent sampling on P. ‘The ideas that will emerge from the treatment of this special case will later be expanded into methods applicable to other classes of functions. To facilitate back reference, break the proof into five steps. 4 1. Uniform Convergence of Empirical Measures Keep the notation tidy by writing ||| to denote the supremum over the class ¥ of intervals (—s0, f], for —o0 <1< 20. We could restrict the supremum to rational ¢ to ensure measurability. First SYMMETRIZATION, Instead of matching P, against its parent distribution P, look at the difference between P, and an independent copy, P, say, of itself The difference P, — Pr is determined by a set of 2n points (albeit random) oa the real line; it can be attacked by combinatorial methods, which lead to a bound on deviation probabilities for [P, — Pl. A symmetrization inequality converts this into a bound on ||P, — Pl deviations. 8 Symmetrization Lemma. Let (Z(t):t © T} and (20: te T) be indepen- dent stochastic processes sharing an index set T. Suppose there exist constants B > Oand a > 0 such that IP{|Z'(@| < a} & P for every tin T. Then Oy {up 1200) > 4 < prefeupize = Z| > 2 ~ 4 Proor. Select a random + for which |Z(z)| > € on the set (sup |Z()| > e}. Since r is determined by Z, it is independent of Z’. It behaves like a fixed index value when we condition on Z P(IZ@| < 4/2) > B. Integrate out, e{sup [Z| > 4 SPUZ@| <4 |Z@)| > 2} S P(IZ()— Z'@)| > - a} oat o Close inspection of the proof would reveal a disregard for a number of measure-theoretic niceties. A more careful treatment may be found in Appendix C. For our present purpose it would suffice if we assumed T countable; the proof is impeccable for stochastic processes sharing a count- able index set, We could replace suprema over all intervals (00, f] by suprema over intervals with a rational endpoint. For fixed t, Ps(—00, t] isan average of the n independent random variables { <0), each having expected value P(—o, ¢] and variance P(— v0, ] — (P(x, £)), which is less than one, By Tehebychev’s inequality, IPIP,(—c0, 9] ~ P(—o0, 1 < de} >} it n> Be? 113. The Combinatorial Method 15 Apply the Symmetrization Lemma with Z = P, ~ Pand Z’ = P, — P, the class J as index set, = $6, and B= 3 10) PLP, — Pi > ¢} < UP(EP, — Pill > $e) if mB Bem, SECOND SYMMETRIZATION. The difference P, ~ P, depends on 2n observations, The double sample size ereates a minor nuisance, atleast notationally. It can be avoided by a second symmetrization trick, at the cost of a further diminution of the Independently of the observations €,,...,€, Ejr--+4 5 from which the empirical measures are constructed, generate independent sign random variables ¢,,.... 0, for which IP(o; = +1} = P(g; = —1} = 4. The sym- metric random variables {é, <1} — {<1}, for on and —o0 seh i >} > at Write Pj for the signed measure that places mass n~a, até, The two sym- metrizations give, for n = 872, ay PP, ~ Pl > e} < AP{IP5I| > 28) To bound the right-hand side, work conditionally on the vector of observa- tions &, leaving only the randomness contributed by the sign variables. PIP, ~ Pil > de) = {sp p = P(e rt S otis no Sots +ifap Maxinat INEQUALITY. Once the locations of the & observations are fixed, the supremum ||P3l reduces {o a maximum taken over a strategically chosen set of intervals I= (20, 4h for j = 0, 1,...,n. OF course the choice of these intervals depends on &; we need one t, between each pair of adjacent observations. (The fg and t, are not really necessary.) With the number of intervals reduced so drastically, we can afford a crude bound for the supremum, (12) Wil Psll > Fel 8} = Srurs| > b|B) S (n+ 1) max P{|P3L,| > 468). ? 16 1. Uniform Convergence of Empirical Measures This bound will be adequate for the present because the conditional proba- bilities decrease exponentially fast with n, thanks to an inequality of Hoeffding for sums of independent, bounded random variables. EXPONENTIAL BOUNDS. Let ¥;,..., ¥, be independent random variables, each with zero mean and bounded range: a, < ¥;0, Hoefiding’s Inequality (Appendix B) asserts PUK ee + YEN} s2e1p{—27/ 36, - «| Apply the inequality with ¥; = o({é, < 1}. Given & the random variable ¥, takes only two values, + (; Sel} < 200 + 1) exp(—ne/32). Notice that the right-hand side now does not depend on &. InteraTion. Take expectations over & IPP, — Pl) > e} < 8(n + 1) exp(—me?/32). This gives very fast convergence in probability, so fast that $pur.—ri>a 0. The Borel-Cantelli lemma turns this into the full almost sure ‘convergence asserted by the Glivenko-Cantelli theorem, IL4. Classes of Sets with Polynomial Discrimination ‘We made use of very few distinguishing properties of intervals for the proof of the Glivenko-Cantelli theorem in Section 3. The main requirement was that they should pick out at most n + 1 subsets from any set of n points. Other classes have a similar property. For example, quadrants of the form (~c, t] in IR? can pick out fewer than (n + 1)? different subsets from a TA. Classes of Sets with Polynomial Discrimination 7 setofinpointsin the plane—there are at mostn + | placesto set the horizontal boundary and at most m + 1 places to set the vertical boundary. (Problem 8 ives the precise upper bound.) With (n+ 1)? replacing the n + 1 factor, ‘we could repeat the arguments from Section 3 to get the bivariate analogue of the Glivenko-Cantelli theorem. The exponential bound would swallow up (1 + 1, just as it did the m + 1. Indeed, it would swallow up any poly. nomial, The argument works for intervals, quadrants, and any other class of sets that picks out @ polynomial number of subsets. 13 Definition, Let J be a class of subsets of some space S. Its said to have polynomial discrimination (of degree 1) if there exists a polynomial p(-) (of degree 2) such that, from every set of N’ points in S, the class picks out at most p(N) distinet subsets. Formally, if So consists of NV poitts, then there are at most p(N) distinct sets of the form So.“ D with D in , Call p(:) the discriminating polynomial for 2. o When the risk of confusion with the algebraic sort of polynomial is slight, Jet us shorten the name “class having polynomial discrimination” to “polynomial class,” and adopt the usual terminology for polytomials of low