Large Deviations: S. R. S. Varadhan
Large Deviations: S. R. S. Varadhan
Large Deviations: S. R. S. Varadhan
S. R. S. VARADHAN
S. R. S. VARADHAN
1. Introduction The theory of large deviations deals with rates at which probabilities of certain events decay as a natural parameter in the problem varies. It is best to think of a specic example to clarify the idea. Let us suppose that we have n independent and identically distributed random variables {Xi } having mean zero and variance one with a common distribution . +Xn converges to the degenerate distribution at 0, while The distribution of X n = X1 + n X1 + + X n Zn = has a limiting normal distribution according to the Central Limit Theorem. n In particular (1.1) 1 lim P [Zn ] = n 2
x2 2
dx
While the convergence is uniform in it does not say much if is large. A natural question to ask is does the ratio P [Zn ] tend to 1 even when ? It depends on how rapidly is getting large. If << n it holds under suitable conditions. But if n it clearly does not. For instance if {Xi } are bounded by a constant C , and > C n , P [ Z n ]= 0 while the Gaussian probability is n. When << n they are referred to as moderate not. Large deviations are when deviations. While moderate deviations are renements of the Central Limit Theorem, large deviations are dierent. It is better to think of them as estimating the probability pn ( ) = P [X1 + + Xn n ] = P [X n ] It is expected that the probability decays exponentially. In the Gaussian case it is clear that nx2 n 2 n pn ( ) = e 2 dx = e 2 +o(n) = en I ( )+o(n) 2 The function I ( ) = 2 reects the fact that the the common distribution of {Xi } was Gaussian to begin with. Following Cramer, let us take for the common distribution of our random variables X1 , X2 , , Xn an arbitrary distribution . We shall try to estimate P [X n A] where A = {x : x } for some > m, m being the mean of the distribution . Under suitable assumptions n (A) should again decay exponentially rapidly in n as n , and our goal is to nd the precise exponential constant. In other words we want to calculate 1 log n (A) = I ( ) n n as explicitly as possible. The attractive feature of the large deviation theory is that such objects can be readily computed. If we denote the sum by Sn = n i=1 Xi then lim n (A) = Pr[Sn n ]
2
1 2
x2 2
dx
LARGE DEVIATIONS
can be estimated by the standard Tchebechev type estimate n (A) e n E {exp[Sn ]} with any > 0. Denoting the moment generating function of the underlying distribution by M () = and its logarithm by () = log M () we obtain the obvious inequality n (A) e n [M ()]n and by taking logarithms on both sides and dividing by n 1 log n (A) + (). n Since the above inequality is valid for every 0 we should optimize and get 1 log n (A) sup [ ()]. n 0 By an application of Jensens inequality we can see that if < 0, then () m . Because 0 is always a trivial lower bound replacing sup0 [ ()] by sup [ ()] does not increase its value. Therefore we have 1 (1.2) log n (A) sup [ ()]. n It is in fact more convenient to introduce the conjugate function (1.3) h( ) = sup [ ()]
ex d
which is seen to be a nonnegative convex function of with a minimum value of 0 at the mean = m. From the convexity we see that h( ) is non-increasing for m and nondecreasing for m. We can now rewrite our upper bound as 1 log n (A) h( ) n for m. A similar statement is valid for the sets of the form A = {x : x } with m. We shall now prove that the upper bounds that we obtained are optimal. To do this we need eective lower bounds. Let us rst assume that = m. In this case we know by central limit theorem that 1 lim n (A) = n 2 and clearly it follows that 1 lim inf log n (A) 0 = h(m). n n
S. R. S. VARADHAN
To do the general case let us assume that the distribution does not live on any proper subinterval of (, ). Then () grows super-linearly at and the supremum in sup [ ()] is attained at some value 0 for . Equating the derivative to 0, (1.4) = ( 0 ) = 1 M ( 0 ) = M (0 ) M (0 ) d e0 x (x) = d M (0 ) e0 x d(x).
then has for its expected value. If we denote be n the distribution of the mean of the n random variables under the assumption that their common distribution is rather than , an elemenatary calculation yields n (A) =
A
[M (0 )e0 x ]n dn (x).
To get a lower bound let us replace A by A = {x : x + } for some > 0. Then n (A) n (A ) [M (0 )e0 (
+ ) n
] n (A ).
1 Again applying the central limit theorem, but now to n ,we see that n (A ) 2 , and taking logarithms we get 1 lim inf log n (A) (0 ) 0 ( + ). n n Since > 0 was arbitrary we can let it go to 0 and obtain 1 lim inf log n (A) (0 ) 0 = h( ). n n A similar proof works for intervals of the form A = {x : x } as well. We have therefore calculated the precise exponential decay rate of the relevant probabilities.
Remark 1.1. This way of computing the probability of a rare event by changing the underlying model to one where the rare event is no longer rare and using the explicit form of the Radon-Nikodym derivative to estimate the probabiliy goes back to Cram er-Lundberg in the 1930s. The problem is to estimate (1.6) P [sup X (t) ] = q ( )
t
where X (t) = i t i m t is all the claims paid up to time t less the premiums collected reecting the net total loss up to time t. is the initial capital of an insurance company. The claims follow a compound Poisson process with the L evy-Khinitchine representation (1.7) log E [exp[ X (t)]] = t log () = t (e x 1)dF (x) m t
is the claims rate, F is the distribution of the individual claim amounts and m is the rate at which premium is being collected. It is assumed that the company is protable, i.e. x dF (x) < m. If is large the probability of ever running out of cash should be small.
LARGE DEVIATIONS
The problem is to estimate q ( ) for large . First consider the imbedded random walk, which is the net out ow of cash after each claim is paid. Sn+1 = Sn + n+1 n+1 m = Sn + Yn+1 . n+1 is the amount of the new claim and n+1 is the gap between the claims during which a premium of m n+1 was collected.The idea of tilting is to consider 0 > 0 so that E [e0 Y ] = E [e0 (m ) ] = E [e0 ]E [em0 ] = 1 Solve for + m0 Such 0 = 0 exists because E [Y ] < 0. Tilt so that the (, ) is now distributed as e0 y e0 dF (y )e d The tilted process Q will now have net positive outow. We have made the claims more frequent and more expensive, while keeping the premium the same. It will run out of cash. Let be the overshoot, i.e. shortfall when that happens. For the tilted process E [e0 Y ] = 1. M ( 0 ) = q ( ) = E Q [e0 ( E Q [e0 ] has a limit as
+ )
] = e0 E Q [e0 ]
Remark 1.2. If the underlying distribution were the standard Gaussian then we see that 2 h( ) = 2 . Remark 1.3. One should really think of h( ) as giving the local decay rate of the probability at or near . Because an + bn 2[max(a, b)]n and the factor 2 leaves no trace when we take logarithms and divide by n, the global decay rate is really the worst local decay rate. The correct form of the result should state 1 lim log n (A) = inf h(x). n n x A Using the monotonicity of h on either side of m one can see that this is indeed correct for sets of the form A = (, ) or ( , ). In fact the upper bound 1 lim sup log n (A) inf h(x) xA n n is easily seen to be valid for more or less any set A. Just consider the smallest set of the form (, a] [b, ), with a m b that contains A. While a and b may not be in A , the closure of A. If we use the continuity of h we see that A can they clearly belong to A be arbitrary. If we do not want to use the continuity of h then it is surely true for closed sets. The lower bound is more of a problem. In the Gaussian case if A is a set of Lebesgue measure zero, then n (A) = 0 for all n and the lower bound clearly does not hold in this case. Or if we take the coin tossing or the Binomial case, if A contains only irrationals, we are again out of luck. We can see that the proof we gave works if the point is an interior point of A. Therefore it is natural to assume for the lower bound that A is an open set just as it is natural to assume that A is closed for the upper bound.
S. R. S. VARADHAN
Exercise 1. For the coin tossing where is the discrete distribution with masses p and q = 1 p at 1 and 0 respectively calculate explicitly the function h(x). In this case our proof of the lower bound may not be valid because lives on a bounded interval. Provide two proofs for the lower bound, one by explicit combinatorial calculation and Stirlings formula and the other by an argument that extends our proof to the general case. Exercise 2. Extend the result to the case where X1 , X2 , , Xn take values in Rd with a common distribution on Rd . Tchebechev will give estimates of probabilities for halfspaces and for a ball we can try to get the halfspace that gives the optimal estimate. If the ball is small this is OK. A compact set is then covered by a nite number of small balls and a closed set is approximated by closed bounded sets by cutting it o. This will give the upper bound for closed sets and the lower bound for open sets presents no additional diculties in the multidimensional case. Exercise 3. Let us look at the special case of on Rd that has probabilities 1 , 2 , , d , (with i i = 1), respectively at the unit vectors in the coordinate directions e1 , e2 , , ed . Calculate explicitly the rate function h(x) in this case. Is there a way of recovering from this example the one dimensional result for a general discrete with d mass points? Exercise 4. If we replace the compound Poisson process by Brownian motion with a positive drift x(t) = (t) + mt, m > 0, then q ( ) = P [inf x(t) ] = e2m .
t0
References: 1. H. Cram er; Collective Risk Theory, Stockholm, 1955 2. H. Cram er; Sur un nouveau th eor` eme limite de la th eorie des probabilit es; Actualit es Scientiques, Paris, 1938 2. General Principles In this section we will develop certain basic principles that govern the theory of large deviations. We will be working for the most part on Polish (i.e. complete separable metric) spaces. Let X be such a space. A function I () : X [0, ], will be called a (proper) rate function if it is lower semicontinuous and the level sets K = {x : I (x) } are compact in X for every < . Let Pn be a sequence of probability distributions on X . We say that Pn satieses the large deviation principle on X with rate function I () if the following two statements hold. For every closed set C X , 1 lim sup log Pn (C ) inf I (x) x C n n
LARGE DEVIATIONS
Remark 2.1. The value of + is allowed for the function I (). All the inma are eectively taken only on the set where I () < . Remark 2.2. The examples of the previous section are instances where the LDP is valid. One should check that the assumption of compact level sets holds in those cases, and this verication is left as an exercise. Remark 2.3. Under our assumptions the inmum inf x I (x) is clearly attained and since Pn (X ) = 1, this inmum has to be 0. If we dene K0 = {x : I (x) = 0}, then K0 is a compact set and the sequence Pn is tight with any limit point Q of the sequence satisfying Q(K0 ) = 1. In particular if K0 is a single point x0 of X , then Pn x0 , i.e. Pn converges weakly to the distribution degenerate at the point x0 of X . There are certain general properties that are easy to prove. Theorem 2.4. Suppose Pn is a sequence that satises an LDP on X with respect to a proper rate function I (). Then for any < there exists a compact set D X such that Pn (D ) 1 en for every n. Proof. Let us consider the compact level set A = {x : I (x) +2}. From the compactness of A it follows that for each k , it can be covered by a nite number of open balls of radius 1 k . The union, which we will denote by Uk , is an open set and I (x) ( + 2) on the closed c . By the LDP, for every k , set Uk lim sup
n
We can assume without loss of generality that n0 (k ) k for every k 1. We then have for n n0 k n P n ( Ac . k) e e
c ) For j = 1, 2, , n0 (k ), we can nd compact sets Bk,1 , Bk,2 , , Bk,n0 such that Pj (Bk,j ek ej for every j . Let us dene E = k [Ak (j Bk,j )]. E is clearly totally bounded and
Pn (E c )
k1
ek en en
We are done.
S. R. S. VARADHAN
Remark 2.5. The conclusion of the theorem which is similar in spirit to Prohorovs tightness condition in the context of weak convergence will be caled superexponential tightness. There are some elementary relationships where the validity of LDP in one context implies the same in other related situations. We shall develop a couple of them. Theorem 2.6. Suppose Pn and Qn are two sequences on two spaces X and Y satisfying the LDP with rate functions I () and J () respectively. Then the sequence of product measures Rn = Pn Qn on X Y satises an LDP with the rate function K (x, y ) = I (x) + J (y ). Proof. The proof is typical of large deviation arguments and we will go through it once for the record. Step 1. Let us pick a point z = (x, y ) in Z = X Y . Let > 0 be given. We wish to show that there exists an open set N = Nz, containing z such that lim sup
n
1 log Rn (N ) K (z ) + . n
1 for all x U1 . This is possible Let us nd an open set U1 in X such that I (x ) I (x) 2 by the lower semicontinuity of I (). By general separation theorems in a metric space we 2 U1 . By the LDP of the sequence Pn can nd an open set U2 such that x U2 U
lim sup
n
1 1 2 ) inf I (x ) I (x) + 1 . log Pn (U2 ) lim sup log Pn (U 2 n 2 x U n n 1 1 2 ) inf J (y ) J (y ) + 1 log Qn (V2 ) lim sup log Pn (V 2 n n 2 y V n
2 V1 . If we take N = U2 V2 as the for some open sets V1 , V2 with y V2 V neighborhood of z = (x, y ) we are done. Step 2. Let D Z = X Y be a compact set. Let > 0 be given. We will show that for some neighborhood D of D lim sup
n
1 log Rn (D ) inf K (z ) + . z D n
We know from step 1. that for each z D there is a neighborhood Nz such that lim sup
n
From the open covering {Nz } of D we extract a nite subcover {Nj : 1 j k } and if we take D = j Nj , then Rn (D )
j
LARGE DEVIATIONS
Since k leaves no trace after taking logarithms, dividing by n and passing to the limit we get 1 lim sup log Rn (D ) inf K (z ) + . z D n n In particular because > 0 is arbitrary we get 1 lim sup log Rn (D) inf K (z ). z D n n Step 3. From the superexponential tightness, for any given there are compact sets A and B in X and Y respectively such that Pn ([A ]c ) en and Pn ([B ]c ) en . If we dene the compact set C Z by C = A B then Rn ([C ]c ) 2en . We can complete the proof of the theorem by writing any closed set C as the union C = [C C ] [C (C )c ]. An easy calculation yields 1 lim sup log Rn (C ) max( inf K (z ), ) max( inf K (z ), ) z C z C n n and we can let to obtain the upper bound for arbitrary closed sets. Step 4. The lower bound a much simpler. We need to prove only that if z Z is arbitrary and N is a neighborhood of z in Z , then 1 lim inf log Rn (N ) I (z ). n n Since any neighborhood of z contains a product U V of neighborhoods U and V the lower bound for Rn follows easily from the lower bounds in the LDP for Pn and Qn . The next result has to do with the behavior of LDP under continuous mapping from one space to another and is referred to as the Contraction Principle Theorem 2.7. If Pn satises an LDP on X with a rate function I (), and F is a continuous mapping from the Polish spacs X to another Polish space Y , then the family Qn = Pn F 1 satises an LDP on Y with a rate function J () given by J (y ) = inf
x:F (x)=y
I (x).
Proof. Let C Y be closed. Then D = F 1 C = {y : F (x) C } is a closed subset of X . 1 1 lim sup log Qn (C ) = lim sup log Pn (D) n n n n inf I (x)
xD
= inf
y C x:F (x)=y
inf
I (x)
= inf J (y )
y C
10
S. R. S. VARADHAN
The lower bound is proved just as easily from the denition 1 1 lim inf log Qn (U ) = lim inf log Pn (V ) n n n n inf I (x)
xV
= inf
y U x:F (x)=y y U
inf
I (x)
an =
0
where dn (x) = It is clear that the contribution comes mostly from the point where F (x) I (x) achieves its maximum. In fact under mild conditions it is easy to see that 1 lim log an = sup[F (x) I (x)]. n n x This essentially remains true under LDP. Theorem 2.8. Assume that Pn satises an LDP with rate function I () on X . Suppose that F () is a bounded continuous function on X , then
n
enI (x) dx
lim
Proof. The basic idea of the proof is that the sum of a nite number of exponentials grows like the largest of them and if they are all nonnegative then there is no chance of any cancellation. Step 1. Let > 0 be given, For every x X , there is a neighborhood Nx such that, x of Nx . We have here used F (x ) F (x) + on Nx and I (x ) I (x) on the closure N the lower semicontinuity of I () and the continuity, (in fact only the upper semicontinuity) of F (). If we denote by a n ( A) =
A
we have shown that for any x X , there is a neighborhood Nx of x, such that, x ) an (Nx ) en[F (x)+ ] Pn (Nx ) en[F (x)+ ] Pn (N
LARGE DEVIATIONS
11
Step 2. By a standard compactness argument for any compact K X , 1 lim sup log an (K ) sup[F (x) I (x)] + 2 . n n x Step 3. As for the contribution from the complement of a compact set by superexponential tightness for any < there exist a compact set K such that Pn (K c ) en and if F is bounded by a constant M , the contribution from K c can be estimated by 1 lim sup log an (K c ) [M ]. n n Putting the two pieces together we get 1 1 lim sup log an = lim sup log an (X ) max(M , sup[F (x) I (x)] + 2 ). n n x n n If we let 0 and we are done. Step 4. The lower bound is elementary. Let x X be arbitrary. By the continuity of F (), (in fact lower semicontinuity at this time) F (x ) F (x) in some neighborhood Nx of x, and an = an (X ) an (Nx ) en[F (x) ] Pn (Nx ). By taking logarithms, dividing by n and letting n we get 1 lim inf log an [F (x) I (x)] . n n We let 0, and take the supremum over the points of x X to get our result. Remark 2.9. We have used only the upper semicontinuity of F for the upper bound and the lower semicontinuity of F for the lower bound. It is often necessary to weaken the regularity assumptions on F and this has to be done with care. More on this later when we start applying the theory to specic circumstances. Remark 2.10. The boundedness of F is only needed to derive the upper bound. The lower bound is purely local. For unbounded F we could surely try our luck with truncation and try to estimate the errors. With some control this can be done. Finally we end the section by transfering the LDP from Pn to Qn where Qn is dened by Qn (A) = for Borel subsets A X .
n (x) nF ( x ) dPn (x) Xe Ae nF (x) dP
12
S. R. S. VARADHAN
Theorem 2.11. If Pn satises an LDP with rate function I () and F is a bounded continuous function on X , then Qn dened above satises an LDP on X as well with the new rate function J () given by J (x) = supxX [F (x) I (x)] [F (x) I (x)]. Proof. The proof is essentially repeating the arguments of Theorem 4. Let us remark that in the notation of Theorem 4, along the lines of the proof there one can establish 1 lim sup log an (C ) sup[F (x) I (x)]. n n x C Now, log Qn (C ) = log an (C ) log an and we get lim sup
n
This gives us the upper bound and the lower bound is just as easy and is left as an exercise.