Class Notes in Statistics and Econometrics
Class Notes in Statistics and Econometrics
Val (3.4.12) = F.(V@) — (1- F:(V@) (3.4.13) = 2F.(Vq) — 1. a Instead of the cumulative distribution function F, one can also use the quan- tile function F;* to characterize a probability measure. As the notation suggests, the quantile function can be considered some kind of “inverse” of the cumulative distribution function. The quantile function is the function (0,1) > R defined by (3.4.14) F5\(p) =int{u: F,(u) > p} or, plugging the definition of F,, into (3.4.14), (3.4.15) Fy \(p) =inf{u: Pr[y p}. The quantile function is only defined on the open unit interval, not on the endpoints 0 and 1, because it would often assume the values —co and -00 on these endpoints, and the information given by these values is redundant. The quantile function is continuous from the left, i.e., from the other side than the cumulative distribution74 3. RANDOM VARIABLES function. If F is continuous and strictly increasing, then the quantile function is the inverse of the distribution function in the usual sense, i.c., F~!(F(t)) = t for all t € R, and F(F-'((p)) = p for all p € (0,1). But even if F is flat on certain intervals, and/or F has jump points, ic., F does not have an inverse function, the following important identity holds for every y € R and p € (0,1): (3.4.16) PSF, (y) iff Fp)p then of course y > inf{u: F(u) > p}. =: y > inf{u F(u) > p} means that every > y satisfies F(z) > p; therefore, since F is continuous from the right, also F(y) > p. This proof is from [Rei89, p. 318]. a PROBLEM 49. You throw a pair of dice and your random variable x is the sum of the points shown. © a. Draw the cumulative distribution function of wv. ANSWER. This is Figure 1: the cdf is 0 in (—00, 2), 1/36 in [2,3), 3/36 in [3,4), 6/36 in [4,5), 10/36 in [5,6), 15/36 in (6,7), 21/36 in [7,8), 26/98 on [8,9), 30/36 in [9,10), 33/36 in [10,11), 38/86 on [11,12), and 1 in [12, +00) eb. Draw the quantile function of x.3.4. CHARACTERIZATION OF RANDOM VARIABLES 75 FIGURE 1. Cumulative Distribution Function of Discrete Variable ANSWER. This is Figure 2: the quantile function is 2 in (0, 1/36], 3 in (1/36,3/36], 4 in (3/36,6/36], 5 in (6/36,10/36], 6 in (10/36,15/36], 7 in (15/36,21/36], 8 in (21/36,26/36], 9 in (26/36,30/36], 10 in (30/36,33/36], 11 in (33/36,35/36), and 12 in (35/36,1] a76 3. RANDOM VARIABLES FIGURE 2. Quantile Function of Discrete Variable PROBLEM 50. 1 point Give the formula of the cumulative distribution function of a random variable which is uniformly distributed between 0 and b. ANSWER. 0 for x <0, 2/b for 0. a Empirical Cumulative Distribution Function: Besides the cumulative distribution function of a random variable or of a proba- bility measure, one can also define the empirical cumulative distribution function of a sample. Empirical cumulative distribution functions are zero for all values below the lowest observation, then 1/n for everything below the second lowest, etc. They are step functions. If two observations assume the same value, then the step at3.5. DISCRETE AND ABSOLUTELY CONTINUOUS PROBABILITY MEASURES 17 that value is twice as high, etc. The empirical cumulative distribution function can be considered an estimate of the cumulative distribution function of the probability distribution underlying the sample. [Rei89, p. 12] writes it as a sum of indicator functions: 1 3.4.17 Fo 1 (3.4.17) oe Mri 3.5. Discrete and Absolutely Continuous Probability Measures One can define two main classes of probability measures on R: One kind is concentrated in countably many points. Its probability distribution can be defined in terms of the probability mass function. PROBLEM 51. Show that a distribution function can only have countably many jump points. ANSWER. Proof: There are at most two with jump height > 3, at most four with jump height ete. a Among the other probability measures we are only interested in those which can be represented by a density function (absolutely continuous). A density function is a nonnegative integrable function which, integrated over the whole line, gives 1. Given78 3. RANDOM VARIABLES such a density function, called f(z), the probability Pr[r¢(a,b)] = J? f.(x)dx. The density function is therefore an alternate way to characterize a ‘robbs measure. But not all probability measures have density functions. Those who are not familiar with integrals should read up on them at this point. Start with derivatives, then: the indefinite integral of a function is a function whose derivative is the given function. Then it is an important theorem that the area under the curve is the difference of the values of the indefinite integral at the end points. This is called the definite integral. (The area is considered negative when the curve is below the z-axis.) The intuition of a density function comes out more clearly in terms of infinitesi- mals. If f(z) is the value of the density function at the point «, then the probability that the outcome of « lies in an interval of infinitesimal length located near the point x is the length of this interval, multiplied by f,(z). In formulas, for an infinitesimal dz follows (3.5.1) Pr[xe[x,x + dal] = f.(x) |dz|. The name “density function” is therefore appropriate: it indicates how densely the probability is spread out over the line. It is, so to say, the quotient between the probability measure induced by the variable, and the length measure on the real numbers.3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 79 If the cumulative distribution function has everywhere a derivative, this deriva- tive is the density function. 3.6. Transformation of a Scalar Density Function Assume wr is a random variable with values in the region A C R, ie., Priv¢ A] = 0, and t is a one-to-one mapping A + R. One-to-one (as opposed to many-to-one) means: if a,b € A and t(a) = ¢(b), then already a = b. We also assume that t has a continuous nonnegative first derivative t/ > 0 everywhere in A. Define the random variable y by y = t(). We know the density function of y, and we want to get that of x. (Le., t expresses the old variable, that whose density function we know, in terms of the new variable, whose density function we want to know.) Since t is one-to-one, it follows for all a,b € A thata=b <= + t(a) =t(b). And etd) te) recall the definition of a derivative in terms of infinitesimals dz: t/(«) = In order to compute f,,(:r) we will use the following identities valid for all x € A: (3.6.1) fol) |de| = Pr[xelx,x + del] = Pr[t(x)e[t(x), t(« + dx)]] (3.6.2) = Prit(x)e[t(x), t(x) + t'(x) del] = f,(t(x)) |t'(w)de|80 3. RANDOM VARIABLES Absolute values are multiplicative, ice., |t/(«)dx| = |t"(x)| |de|; divide by |da| to get (3.6.3) fol) = fy (t@)) |t"(@)|- This is the transformation formula how to get the density of x from that of y. This formula is valid for all a € A; the density of « is 0 for all x ¢ A. Heuristically one can get this transformation as follows: write |t/(x)| = 4, then one gets it from f(x) |da| = f,(é(x)) |dy| by just dividing both sides by |dz|. In other words, this transformation rule consists of 4 steps: (1) Determine A, the range of the new variable; (2) obtain the transformation t which expresses the old variable in terms of the new variable, and check that it is one-to-one on A; (3) plug expression (2) into the old density; (4) multiply this plugged-in density by the absolute value of the derivative of expression (2). This gives the density inside A; it is 0 outside A. An alternative proof is conceptually simpler but cannot be generalized to the multivariate case: First assume t is monotonically increasing. Then F,,(«) = Prix < a] = Pr[t(x) < t(i)] = F,(t(z)). Now differentiate and use the chain rule. Then also do the monotonically decresing case. This is how [Ame94, theorem 3.6.1 on pp. 48] does it. [Ame94, pp. 52/3] has an extension of this formula to many-to-one functions.3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 81 PROBLEM 52. 4 points [Lar82, example 3.5.4 on p. 148] Suppose y has density function 1 for0 0 and 0 otherwise. a PROBLEM 53. 6 points [Dhr86, p. 1574] Assume the random variable z has the exponential distribution with parameter , i.e., its density function is f.(z) = Aexp(—Az) for z > 0 and 0 for z < 0. Define u = —logz. Show that the density function of u is f.,(u) = exp(u—u—exp(u—u)) where w= logr. This density will be used in Problem 151.82 3. RANDOM VARIABLES ANSWER. (1) Since = only has values in (0, 00), its log is well defined, and A = R. (2) Express old variable in terms of new: —u = log therefore » = e~"; this is one-to-one everywhere. (3) plugging in (since e~™ > 0 for all u, we must plug it into Nexp(—Az)) gives .... (4) the derivative of 2 =e7" is —e~", taking absolute values gives the Jacobian factor e~“. Plugging in and multiplying gives the density of u: fu.(u) = Aexp(—Ae~")e™ = Aew" "and using Aexp(—u) = exp(u—u) this simplifies to the formula above. Alternative without transformation rule for densities: F,.(w) ~u] = Prlsze~™] = fF eae =dz| +00 Lge Pr[u 0 and 0 for z <0. Define u= \/Z. Compute the density function of u. ANSWER. (1) A = {u: w > 0} since v always denotes the nonnegative square root; (2) Express u?, this is one-to-one on A (but not one-to-one on all of R); (3) then the derivative is 2u, which is nonnegative as well, no absolute values are necessary; (4) multiplying gives the density of u: f..(u) = 2wexp(—u?) if u > 0 and 0 elsewhere. a old variable in terms of new: 3.7. Example: Binomial Variable Go back to our Bernoulli trial with parameters p and n, and define a random variable x which represents the number of successes. Then the probability mass3.7, EXAMPLE: BINOMIAL VARIABLE 83 function of x is (3.7.1) p(k) = Pr[r=k] = (iota =p)" k=0,1,2,.. Proof is simple, every subset of k elements represents one possibility of spreading out the k successes. We will call any observed random variable a statistic. And we call a statistic ¢ sufficient for a parameter 6 if and only if for any event A and for any possible value t of ¢, the conditional probability Pr[Al’ Elh(x)] = h(B[r]). (3) The existence of such a h follows from con- vexity. Since g is convex, for every point a € B there is a number @ so that94 3. RANDOM VARIABLES g(x) > g(a) + B(x — a). This B is the slope of g if g is differentiable, and other- wise it is some number between the left and the right derivative (which both always exist for a convex function). We need this for a = Ef’. This existence is the deepest part of this proof. We will not prove it here, for a proof sce [Rao73, pp. 57, 58]. One can view it as a special case of the separating hyperplane theorem. ao PROBLEM 62. Use Jensen’s inequality to show that (E[x])? < E[x?]. You are allowed to use, without proof, the fact that a function is convex on B if the second derivative exists on B and is nonnegative. PROBLEM 63. Show that the expected value of the empirical distribution of a sample is the sample mean. Other measures of locaction: The median is that number m for which there is as much probability mass to the left of m as to the right, ie., 1 5 It is much more robust with respect to outliers than the mean. If there is more than one m satisfying (3.10.11), then some authors choose the smallest (in which case the median is a special case of the quantile function m = F-1(1/2)), and others the average between the biggest and smallest. If there is no m with property (3.10.11) 1 (3.10.11) Prlr a. 3.10.3. Mean-Variance Calculations. If one knows mean and variance of a random variable, one does not by any means know the whole distribution, but one has already some information. For instance, one can compute E[/7] from it, too. PROBLEM 66. 4 points Consumer M has an expected utility function for money income u(x) = 12x — 22. The meaning of an expected utility function is very simple: if he owns an asset that generates some random income y, then the utility he derives from this asset is the expected value E[u(y)]. He is contemplating acquiring two assets. One asset yields an income of 4 dollars with certainty. The other yields an expected income of 5 dollars with standard deviation 2 dollars. Does he prefer the certain or the uncertain asset?98 3. RANDOM VARIABLES ANSWER. Efu(y)] = 12E[y] — Bly?] = 12E[y] — varly] — (Bly])?. Therefore the certain asset gives him utility 48 — 0 — 16 = 32, and the uncertain one 60 — 4 — 25 = 31. He prefers the certain asset. a 3.10.4. Moment Generating Function and Characteristic Function. Here we will use the exponential function e®, also often written exp(x), which has the two properties: e* = lim, (1+ £)" (Buler’s limit), and e* =1+a+ 5+ 8.405. Many (but not all) random variables «x have a moment generating function m,(t) for certain values of (. If they do for t in an open interval around zero, then their distribution is uniquely determined by it. The definition is (3.10.18) m,(t) = Ele] It is a powerful computational device. The moment generating function is in many cases a more convenient charac- terization of the random variable than the density function. It has the following uses: 1. One obtains the moments of « by the simple formula Ik = m.(t) (3.10.19) Ele*] = a t=0"3.10. LOCATION AND DISPERSION PARAMETERS, 99 Proof: 2,2 pga (3.10.20) Marites Spe Spt Tr (3.10.21) m.(t) = Efe] = L+ eB] +5 Ble? + E[e3] + (3.10.22) Jm(t) = = Ele] +t E22) + # ps + (3.10.23) © in.) = Ble?) + 1B le°) ++. ete. 2. The moment generating function is also good for determining the probability distribution of linear combinations of independent random variables. a. it is easy to get the m.g.f. of Ax from the one of x: (3.10.24) my. (t) = me (At) because both sides are Ele]. b. If x, y independent, then (3.10.25) Mo+y(t) = m,(t)my(t). The proof is simple: (3.10.26) Efe"+)] = Efe’”e'] = Ele] E[e’’] due to independence.100 3. RANDOM VARIABLES The characteristic function is defined as v,(t) = Efe*], where i = /—T. It has the disadvantage that it involves complex numbers, but it has the advantage that it always exists, since exp(ix) = cosx + isinw. Since cos and sin are both bounded, they always have an expected value. And, as its name says, the characteristic function characterizes the probability distribution. Analytically, many of its properties are similar to those of the moment generating function. 3.11. Entropy 3.11.1. Definition of Information. Entropy is the average information gained by the performance of the experiment. The actual information yielded by an event A with probabbility Pr[4] = p 4 0 is defined as follows: 1 3411 IA] = logy ——— (3.11.1) (4) = loss By This is simply a transformation of the probability, and it has the dual interpretation of either how unexpected the event was, or the informaton yielded by the occurrense of event A. It is characterized by the following properties [AD75, pp. 3-5]: « [A] only depends on the probability of A, in other words, the information content of a message is independent of how the information is coded.3.11. ENTROPY 101 « JA] > 0 (nomegativity), ie., after knowing whether A occurred we are no more ignorant than before. e If A and B are independent then I[A NB] = I[A] + I[B] (additivity for independent events). This is the most important property. ¢ Finally the (inessential) normalization that if Pr[A] = 1/2 then I[A] = 1, i.e., a yes-or-no decision with equal probability (coin flip) is one unit of information. Note that the information yielded by occurrence of the certain event is 0, and that yielded by occurrence of the impossible event is oo. But the important information-theoretic results refer to average, not actual, information, therefore let us define now entropy: 3.11.2. Definition of Entropy. The entropy of a probability field (experi- ment) is a measure of the uncertainty prevailing before the experiment is performed, or of the average information yielded by the performance of this experiment. If the set U of possible outcomes of the experiment has only a finite number of different el- ements, say their number is n, and the probabilities of these outcomes are p1,...;Pns then the Shannon entropy H[¥] of this experiment is defined as H — 1 (3.11.2) al YS px logy — 1 Pr bits102 3. RANDOM VARIABLES This formula uses log,, logarithm with base 2, which can easily be computed from the natural logarithms, logy x = log x/log2. The choice of base 2 is convenient because in this way the most informative Bernoulli experiment, that with success probability p = 1/2 (coin flip), has entropy 1. This is why one says: “the entropy is measured in bits.” If one goes over to logarithms of a different base, this simply means that one measures entropy in different units. In order to indicate this dependence on the measuring unit, equation (3.11.2) was written as the definition 47) instead of H[F] itself, i.e, this is the number one gets if one measures the entropy in bits. If one uses natural logarithms, then the entropy is measured in “nats.” Entropy can be characterized axiomatically by the following axioms [Khi57]: The uncertainty associated with a finite complete scheme takes its largest value if all events are equally likely, ie., H(p1,---,Pn) < H(1/n,...,1/n). The addition of an impossible event to a scheme does not change the amount of uncertainty. © Composition Law: If the possible outcomes are arbitrarily combined into m groups Wy = XyyU+++U Xin, Wo = Xo1 Us? U Xap, 0, Win = Xm1U++*U Xie, With corresponding probabilities wy = pir + +--+ Pin, We = par t+ + Pakgs +++) Wm = Pmi +++ + Prim, then3.11. ENTROPY 103 H(pi,---, Pn) = H(w1,-.-,Wn) + + wi A(pir/wi +--+ + Pir /wi) + + woH (Par [we + +++ + Pay /W2) +00 + + Win H (Pmi/Wm +++ + Pink /Wm)- Since pjj/w; = Pr[X;;|W;], the composition law means: if you first learn half the outcome of the experiment, and then the other half, you will in the average get as much information as if you had been told the total outcome all at once. The entropy of a random variable « is simply the entropy of the probability field induced by x on R. It does not depend on the values « takes but only on the probabilities. For discretely distributed random variables it can be obtained by the following “cerily self-referential” prescription: plug the random variable into its own probability mass function and compute the expected value of the negative logarithm of this, ice., He] (3.11.3) = E[- logs p.(x)] bits One interpretation of the entropy is: it is the average number of yes-or-no ques- tions necessary to describe the outcome of the experiment. For instance, consider an experiment which has 32 different outcomes occurring with equal probabilities. The104, 3. RANDOM VARIABLES entropy is H 32 1 (3.11.4) Fir = Do Fp 0232 =log,32=5 ic, — H=dbits which agrees with the number of bits necessary to describe the outcome. PROBLEM 67. Design a questioning scheme to find out the value of an integer between 1 and 32, and compute the expected number of questions in your scheme if all numbers are equally likely. ANSWER. In binary digits one needs a number of length 5 to describe a number between 0 and 31, therefore the 5 questions might be: write down the binary expansion of your number minus 1 Is the first binary digit in this expansion a zero, then: is the second binary digit in this expansion a zero, etc. Formulated without the use of binary digits these same questions would be: is the number between 1 and 16?, then: is it between 1 and 8 or 17 and 24?, then, is it between 1 and 4 or 9 and 12 or 17 and 20 or 25 and 28?, etc., the last question being whether it is odd. Of course, you can formulate those questions conditionally: First: between 1 and 16? if no, then second: between 17 and 24? if yes, then second: between 1 and 8? Etc. Each of these questions gives you exactly the entropy of 1 bit. a PROBLEM 68. [CT91, example 1.1.2 on p. 5] Assume there is a horse race with eight horses taking part. The probabilities for winning for the eight horses are3.11. ENTROPY 105 a. 1 point Show that the entropy of the horse race is 2bits. ANSWER. 1 1 1 4 log 2+ 7 logy 4+ 5 log, 8 + 75 low, 16 + = log, 64 = 1 1 3 1 3 444434243 =ptotgtats> Ss 2 a eb. 1 point Suppose you want to send a binary message to another person indicating which horse won the race. One alternative is to assign the bit strings 000, 001, 010, 011, 100, 101, 110, 111 to the eight horses. This description requires 3 bits for any of the horses. But since the win probabilities are not uniform, it makes sense to use shorter descriptions for the horses more likely to win, so that we achieve a lower expected value of the description length. For instance, we could use the following set of bit strings for the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. Show that the the expected length of the message you send to your friend is 2bits, as opposed to 3bits for the uniform code. Note that in this case the expected value of the description length is equal to the entropy. ANSWER. The math is the same as in the first part of the question: Lig 1,1 3 13 _ 444434243 _ 1 1 1 1 <-14—-24 5-34 —-444-— 65-4 -4 24 - arg ets ete tt a atatgtats 8106 3. RANDOM VARIABLES a PROBLEM 69. [CT91, example 2.1.2 on pp. 14/15]: The experiment has four possible outcomes; outcome x=a occurs with probability 1/2, 2=b with probability 1/4, x=c with probability 1/8, and x=d with probability 1/8. a. 2 points The entropy of this experiment (in bits) is one of the following three numbers: 11/8, 7/4, 2. Which is it? b. 2 points Suppose we wish to determine the outcome of this experiment with the minimum number of questions. An efficient first question is “Is x=a?” This splits the probability in half. If the answer to the first question is no, then the second question can be “Is x=b?” The third question, if it is necessary, can then be: “Is x=c?” Compute the expected number of binary questions required. ec. 2 points Show that the entropy gained by each question is 1 bit. d. 3 points Assume we know about the first outcome that «4a. What is the entropy of the remaining experiment (i.e., under the conditional probability) ? ec. 5 points Show in this ecample that the composition law for entropy holds.3.11. ENTROPY 107 PROBLEM 70. 2 points In terms of natural logarithms equation (3.11.4) defining entropy reads H 1 ” 1 (3.11.5) ie np Pens Compute the entropy of (i.e., the average informaton gained by) a roll of an unbiased die. ANSWER. Same as the actual information gained, since each outcome is equally likely: Hoist 1 In6 3.116 = (4+in6+---+21n6) = 2S = 2.585 (G.11.6) bits nal mote + Gln ) ina a a. 3 points How many questions does one need in the average to determine the outcome of the roll of an unbiased die? In other words, pick a certain questioning scheme (try to make it efficient) and compute the average number of questions if this scheme is followed. Note that this average cannot be smaller than the entropy H/bits, and if one chooses the questions optimally, it is smaller than H /bits +1. ANSWER. First question: is it bigger than 32 Second question: is it even? Third question (if necessary): is it a multiple of 3? In this scheme, the number of questions for the six faces of the die are 3, 2,3,3,2,3, therefore the average is $-3 + 2-2 = 23. Also optimal: (1) is it bigger than 2? (2) is it odd? (3) is it bigger than 4? Gives 2,2,3,3,3,3. Also optimal: Ist question: is it 1 or108, 3. RANDOM VARIABLES 2? If anser is no, then second question is: is it 3 or 4?; otherwise go directly to the third question: is it odd or even? The steamroller approach: Is it 1? Is it 2? etc. gives 1,2,3,4,5,5 with expected number 34. Even this is here < 1 + H /bits. a PROBLEM 71. a. 1 point Compute the entropy of a roll of two unbiased dice if they are distinguishable. ANSWER. Just twice the entropy from Problem 70. (3.11.7) mer (amr + ggings) = 2 < sro bits n2 \36 36 m2 a ¢ b. Would you expect the entropy to be greater or less in the more usual case that the dice are indistinguishable? Check your answer by computing it. ANSWER. If the dice are indistinguishable, then one gets less information, therefore the exper- iment has less entropy. One has six like pairs with probability 1/36 and 6 - 5/2 = 15 unlike pairs with probability 2/36 = 1/18 each. Therefore the average information gained is H 1 3.11.8) —=— » bits In2 (6 Ginso-+ 15. Ems) = 4 (Zina6 + 218) 4.337 36 18 n2\6 63.11. ENTROPY 109 c. 3 points Note that the difference between these two entropies is 5/6 = 0.833. How can this be explained? ANSWER. This is the composition law (2?) in action, Assume you roll two dice which you first consider indistinguishable and afterwards someone tells you which is which. How much information do you gain? Well, if the numbers are the same, then telling you which die is which does not give you any information, since the outcomes of the experiment are defined as: which number has the first die, which number has the second die, regardless of where on the table the dice land. But if the numbers are different, then telling you which is which allows you to discriminate between two outcomes both of which have conditional probability 1/2 given the outcome you already know; in this case the information you gain is therefore 1 bit. Since the probability of getting two different numbers is 5/6, the expected value of the information gained explains the difference in entropy. 0 All these definitions use the convention 0 log } = 0, which can be justified by the following continuity argument: Define the function, graphed in Figure 3: wlogt ifw>0 3.11.9) ( ) 0 ifw =0. 7 is continuous for all w > 0, even at the boundary point w = 0. Differentiation gives 1! (w) = —(1+logw), and 7"(w) = —w~!. The function starts out at the origin with a vertical tangent, and since the second derivative is negative, it is strictly concave for all w > 0. The definition of strict concavity is n(w) < n(v) + (w — v)n!(v) for w #, ie., the function lies below all its tangents. Substituting 7/(v) = —(1 + log v)