0% found this document useful (0 votes)
124 views1,644 pages

Class Notes in Statistics and Econometrics

class notes stats and economics for civil services
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
124 views1,644 pages

Class Notes in Statistics and Econometrics

class notes stats and economics for civil services
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 1644
Class Notes in Statistics and Econometrics Hans G. Ehrbar ECONOMICS DEPARTMENT, UNIVERSITY OF UTAH, 1645 CAMPUS CENTER DRIVE, SALT LAKE City UT 84112-9300, U.S.A. URL: waw.econ.utah.edu/ehrbar/ecmet . pdf E-mail address: ehrbarGecon.utah.edu ABSTRACT. This is an attempt to make a carefully argued set of class notes freely available. The source code for these notes can be downloaded from www. econ utah. edu/ehrbar/ecmet-sources.zip Copyright Hans G. Ehrbar un- der the GNU Public License Contents Chapter 1. Preface Chapter 2. Probability Fields 21. The Concept of Probability Events as Sets The Axioms of Probability Objective and Subjective Interpretation of Probability Counting Rules Relationships Involving Binomial Coefficients Conditional Probability Ratio of Probabilities as Strength of Evidence CONTENTS 2.9. Bayes Theorem 48 2.10. Independence of Events 50 2.11. How to Plot Frequency Vectors and Probability Vectors 57 Chapter 3. Random Variables 63 3.1. Notation 63 3.2. Digression about Infinitesimals 64 3.3. Definition of a Random Variable 68 3.4. Characterization of Random Variables 70 3.5. Discrete and Absolutely Continuous Probability Measures 77 3.6. Transformation of a Scalar Density Function 79 3.7. Example: Binomial Variable 82 3.8. Pitfalls of Data Reduction: The Ecological Fallacy 85 3.9. Independence of Random Variables 87 3.10. Location Parameters and Dispersion Parameters of a Random Variable 89 3.11. Entropy 100 Chapter 4. Random Number Generation and Encryption 121 4.1. Alternatives to the Linear Congruential Random Generator 125 4.2. How to test random generators 126 4.3. The Wichmann Hill generator 128 CONTENTS 4.4. Public Key Cryptology Chapter 5. Specific Random Variables 1. Binomial on 5.2. The Hypergeometric Probability Distribution 5.3. The Poisson Distribution 5.4. The Exponential Distribution 5.5. The Gamma Distribution 5.6. The Uniform Distribution 5.7. The Beta Distribution 5.8. The Normal Distribution 5.9. The Chi-Square Distribution 5.10. The Lognormal Distribution 5.11. The Cauchy Distribution Chapter 6. Sufficient Statistics and their Distributions 6.1. Factorization Theorem for Sufficient Statistics 6.2. The Exponential Family of Probability Distributions Chapter 7. Chebyshev Inequality, Weak Law of Large Numbers, and Central Limit Theorem 133 139 139 146 148 154 158 164 165 166 172 174 174 179 179 182 189 vi CONTENTS 7.1. Chebyshev Inequality 7.2. The Probability Limit and the Law of Large Numbers 7.3. Central Limit Theorem Chapter 8. Vector Random Variables 8.1. Expected Value, Variances, Covariances 8.2. Marginal Probability Laws 8.3. Conditional Probability Distribution and Conditional Mean 8.4. The Multinomial Distribution 8.5. Independent Random Vectors 8.6. Conditional Expectation and Variance 8.7. Expected Values as Predictors 8.8. Transformation of Vector Random Variables Chapter 9. Random Matrices 9.1. Linearity of Expected Values 9.2. Means and Variances of Quadratic Forms in Random Matrices Chapter 10. The Multivariate Normal Probability Distribution 10.1. More About the Univariate Case 10.2. Definition of Multivariate Normal 189 192 195 199 203 210 212 216 218 221 226 235 245 245 249 261 261 264 10.3. 10.4. 10.5. 10.6. CONTENTS Special Case: Bivariate Normal Multivariate Standard Normal in Higher Dimensions Higher Moments of the Multivariate Standard Normal The General Multivariate Normal Chapter 11. The Regression Fallacy Chapter 12. A Simple Example of Estimation 12.1. 12.2. 12.3. Sample Mean as Estimator of the Location Parameter Intuition of the Maximum Likelihood Estimator Variance Estimation and Degrees of Freedom Chapter 13. Estimation Principles and Classification of Estimators 13.1. 13.2. 13.3. 13.4, 13.5. 13.6. 13.7. 13.8. Asymptotic or Large-Sample Properties of Estimators Small Sample Properties Comparison Unbiasedness Consistency The Cramer-Rao Lower Bound Best Linear Unbiased Without Distribution Assumptions Maximum Likelihood Estimation Method of Moments Estimators M-Estimators 265 284 290 299 309 327 327 330 335 355 355 359 362 369 386 390 396 396 viii CONTENTS 13.9. Sufficient Statistics and Estimation 13.10. The Likelihood Principle 13.11. Bayesian Inference Chapter 14. Interval Estimation Chapter 15. Hypothesis Testing 15.1. Duality between Significance Tests and Confidence Regions 15.2. The Neyman Pearson Lemma and Likelihood Ratio Tests 15.3. The Runs Test 15.4. Pearson’s Goodness of Fit Test. 15.5. Permutation Tests 15.6. The Wald, Likelihood Ratio, and Lagrange Multiplier Tests Chapter 16. General Principles of Econometric Modelling Chapter 17. Causality and Inference Chapter 18. Mean-Variance Analysis in the Linear Model 18.1. Three Versions of the Linear Model 18.2. Ordinary Least Squares 18.3. The Coefficient of Determination 397 405 406 411 425 433 434 440 AAT 453 465 469 473, 481 481 484 499 CONTENTS 18.4. The Adjusted R-Square Chapter 19. Digression about Correlation Coefficients 19.1. A Unified Definition of Correlation Coefficients 19.2. Correlation Coefficients and the Associated Least Squares Problem 19.3. Canonical Correlations 19.4. Some Remarks about the Sample Partial Correlation Coefficients Chapter 20. Numerical Methods for computing OLS Estimates 20.1. QR Decomposition 20.2. The LINPACK Implementation of the QR Decomposition Chapter 21. About Computers 21.1. General Strategy 21.2. The Emacs Editor 21.3. How to Enter and Exit SAS 21.4. How to Transfer SAS Data Sets Between Computers 21.5. Instructions for Statistics 5969, Hans Ehrbar’s Section 21.6. The Data Step in SAS Chapter 22. Specific Datasets 509 513 513 519 521 524 527 527 530 535 535 542 544 545 547 557 563 22.1. 22.2. 22.3. 22.4, 22.5. CONTENTS Cobb Douglas Aggregate Production Function Houthakker’s Data Long Term Data about US Economy Dougherty Data Wage Data Chapter 23. The Mean Squared Error as an Initial Criterion of Precision 23.1. Comparison of Two Vector Estimators Chapter 24. Sampling Properties of the Least Squares Estimator 24.1. 24.2. 24.3. 24.4, 24.5. 24.6. The Gauss Markov Theorem Digression about Minimax Estimators Miscellaneous Properties of the BLUE Estimation of the Variance Mallow’s Cp-Statistic as Estimator of the Mean Squared Error Optimality of Variance Estimators Chapter 25. Variance Estimation: Should One Require Unbiasedness? 25.1. 25.2. Setting the Framework Straight Derivation of the Best Bounded MSE Quadratic Estimator of the Variance 563 580 592 594 595 629 630 637 639 643 645 666 668 670 675 678 682 CONTENTS xi 25.3. Unbiasedness Revisited 688 25.4. Summary 692 Chapter 26. Nonspherical Positive Definite Covariance Matrix 695 Chapter 27. Best Linear Prediction 703 27.1. Minimum Mean Squared Error, Unbiasedness Not Required 704 27.2. The Associated Least Squares Problem 717 27.3. Prediction of Future Observations in the Regression Model 720 Chapter 28. Updating of Estimates When More Observations become Available731 Chapter 29. Constrained Least Squares 737 29.1. Building the Constraint into the Model 738 29.2. Conversion of an Arbitrary Constraint into a Zero Constraint 740 29.3. Lagrange Approach to Constrained Least Squares 742 29.4. Constrained Least Squares as the Nesting of Two Simpler Models 748 29.5. Solution by Quadratic Decomposition 750 29.6. Sampling Properties of Constrained Least Squares 752 29.7. Estimation of the Variance in Constrained OLS 755 29.8. Inequality Restrictions 763 xii CONTENTS 29.9. Application: Biased Estimators and Pre-Test Estimators Chapter 30. Additional Regressors 30.1. Selection of Regressors Chapter 31. Residuals: Standardized, Predictive, “Studentized” 31.1. Three Decisions about Plotting Residuals 31.2. Relationship between Ordinary and Predictive Residuals 31.3. Standardization Chapter 32. Regression Diagnostics 32.1. Missing Observations 32.2. Grouped Data 32.3, Influential Observations and Outliers 32.4. Sensitivity of Estimates to Omission of One Observation Chapter 33. Regression Graphics 33.1. Scatterplot Matrices 33.2. Conditional Plots 33.3. Spinning 33.4. Sufficient Plots 764 765 789 795 795 800 806 813 814 815 815 820 833 834 838 839 841 CONTENTS Chapter 34. Asymptotic Properties of the OLS Estimator 34.1. Consistency of the OLS estimator 34.2. Asymptotic Normality of the Least Squares Estimator Chapter 35. Least Squares as the Normal Maximum Likelihood Estimate Chapter 36. Bayesian Estimation in the Linear Model Chapter 37. OLS With Random Constraint Chapter 38. Stein Rule Estimators Chapter 39. Random Regressors 39.1. Strongest Assumption: Error Term Well Behaved Conditionally on Explanatory Variables 39.2. Contemporancously Uncorrelated Disturbances 39.3, Disturbances Correlated with Regressors in Same Observation Chapter 40. The Mahalanobis Distance 40.1. Definition of the Mahalanobis Distance 40.2. The Conditional Mahalanobis Distance 847 850 852 855 867 877 883 892 895 896 897 898 903 40.3. 40.4, 40.5. CONTENTS First Scenario: Minimizing relative increase in Mahalanobis distance if distribution is known Second Scenario: One Additional IID Observation Third Scenario: one additonal observation in a Regression Model Chapter 41. Interval Estimation 41.1. 41.2. 41.3. 41.4, A Basic Construction Principle for Confidence Regions Coverage Probability of the Confidence Regions Conventional Formulas for the Test Statistics Interpretation in terms of Studentized Mahalanobis Distance Chapter 42. Three Principles for Testing a Linear Constraint 42.1. 42.2. 42.3. 42.4, 42.5, Mathematical Detail of the Three Approaches Examples of Tests of Linear Hypotheses The F-Test Statistic is a Function of the Likelihood Ratio Tests of Nonlinear Hypotheses Choosing Between Nonnested Models Chapter 43. Multiple Comparisons in the Linear Model 43.1. 43.2. Rectangular Confidence Regions Relation between F-test and t-tests. 904 906 909 921 921 929 931 932 941 943 950 966 968 968 971 971 978 CONTENTS xv 43.3, Large-Sample Simultaneous Confidence Regions 983 Chapter 44. Sample SAS Regression Output 989 Chapter 45. Flexible Functional Form 997 45.1. Categorical Variables: Regression with Dummies and Factors 998 45.2. Flexible Functional Form for Numerical Variables 1002 45.3. More than One Explanatory Variable: Backfitting 1014 Chapter 46. Transformation of the Response Variable 1019 46.1. Alternating Least Squares and Alternating Conditional Expectations 1020 46.2. Additivity and Variance Stabilizing Transformations (avas) 1027 46.3. Comparing ace and avas 1029 Chapter 47. Density Estimation 1031 47.1. How to Measure the Precision of a Density Estimator 1031 47.2. The Histogram 1032 47.3, The Frequency Polygon 1034 47.4, Kernel Densities 1034 47.5, Transformational Kernel Density Estimators 1036 47.6. Confidence Bands 1036 xvi CONTENTS 47.7. Other Approaches to Density Estimation 47.8. Two-and Three-Dimensional Densities 47.9. Other Characterizations of Distributions 47.10. Quantile-Quantile Plots 47.11. Testing for Normality Chapter 48. Measuring Economic Inequality 48.1. Web Resources about Income Inequality 48.2. Graphical Representations of Inequality 48.3. Quantitative Measures of Income Inequality 48.4, Properties of Inequality Measures Chapter 49. Distributed Lags 49.1. Geometric lag 49.2. Autoregressive Distributed Lag Models Chapter 50. Investment Models 50.1. Accelerator Models 50.2. Jorgenson’s Model 50.3. Investment Function Project 1036 1037 1038 1038 1042 1043 1043 1044 1045 1050 1051 1062 1063 1073 1073 1076 1081 CONTENTS Chapter 51. Distinguishing Random Variables from Variables Created by a 51.1 Deterministic Chaotic Process Empirical Methods: Grassberger-Procaccia Plots. Chapter 52. Instrumental Variables Chapter 53. Errors in Variables 53.1. 53.2. 53.3. 53.4, 53.5. 53.6. 53.7. 53.8. 53.9. The Simplest Errors-in-Variables Model General Definition of the EV Model Particular Forms of EV Models The Identification Problem Properties of Ordinary Least Squares in the EV model Kalman’s Critique of Malinvaud Estimation if the EV Model is Identified P-Estimation Estimation When the Error Covariance Matrix is Exactly Known Chapter 54. Dynamic Linear Models 54.1. 54.2. 54.3. Specification and Recursive Solution Locally Constant Model The Reference Model 1083 1086 1089 1099 1099 1108 1111 1116 1126 1132 1146 1152 1165 1169 1169 1175 1181 xviii CONTENTS 54.4. Exchange Rate Forecasts 54.5. Company Market Share 54.6. Productivity in Milk Production Chapter 55. Numerical Minimization Chapter 56. Nonlinear Least Squares 56.1. The J Test 56.2. Nonlinear instrumental variables estimation Chapter 57. Applications of GLS with Nonspherical Covariance Matrix 57.1. Cases when OLS and GLS are identical 57.2. Heteroskedastic Disturbances 57.3, Equicorrelated Covariance Matrix Chapter 58. Unknown Parameters in the Covariance Matrix 58.1. Heteroskedasticity 58.2. Autocorrelation 58.3. Autoregressive Conditional Heteroskedasticity (ARCH) Chapter 59. Generalized Method of Moments Estimators 1186 1194 1200 1207 1215 1227 1230 1233 1234 1235 1238 1245 1246 1255 1281 1287 CONTENTS xix Chapter 60. Bootstrap Estimators 1299 Chapter 61. Random Coefficients 1303 Chapter 62. Multivariate Regression 1313 62.1. Multivariate Econometric Models: A Classification 1313 62.2. Multivariate Regression with Equal Regressors 1315 62.3. Growth Curve Models 1329 Chapter 63. Independent Observations from the Same Multivariate Populatioi 333 63.1. Notation and Basic Statistics 1333 63.2. Two Geometries 1337 63.3. Assumption of Normality 1339 63.4. EM-Algorithm for Missing Observations 1341 63.5. Wishart Distribution 1347 63.6. Sample Correlation Coefficients 1349 Chapter 64. Pooling of Cross Section and Time Series Data 1353 64.1. OLS Model 1354 64.2. The Between-Estimator 1356 64.3. Dummy Variable Model (Fixed Effects) 1357 64.4 64.5, CONTENTS Relation between the three Models so far: Variance Components Model (Random Effects) Chapter 65. Disturbance Related (Seemingly Unrelated) Regressions 65.1. 65.2. 65.3. 65.4. 65.5, The Supermatrix Representation The Likelihood Function Concentrating out the Covariance Matrix (Incomplete) Situations in which OLS is Best Unknown Covariance Matrix Chapter 66. Simultaneous Equations Systems 66.1. 66.2. 66.3. 66.4. 66.5, 66.6. Examples General Mathematical Form Indirect Least Squares Instrumental Variables (2SLS) Identification Other Estimation Methods Chapter 67. Timeseries Analysis 67.1. 67.2. Covariance Stationary Timeseries Vector Autoregressive Processes 1365 1365 1375 1376 1380 1386 1389 1395 1397 1397 1405 1414 1416 1419 1424 1435 1435 1450 CONTENTS 67.3. Nonstationary Processes 67.4. Cointegration Chapter 68. Seasonal Adjustment 68.1. Methods of Seasonal Adjustment 68.2. Seasonal Dummies in a Regression Chapter 69. Binary Choice Models 69.1. Fisher's Scoring and Iteratively Reweighted Least Squares 69.2. Binary Dependent Variable 69.3. The Generalized Linear Model Chapter 70. Multiple Choice Models Appendix A. Matrix Formulas A.l. A Fundamental Matrix Decomposition A.2. The Spectral Norm of a Matrix A.3. Inverses and g-Inverses of Matrices A.4. Deficiency Matrices A.5. Nonnegative Definite Symmetric Matrices A.6. Projection Matrices 1460 1464 1467 1472 1474 1487 1487 1489 1495 1499 1501 1501 1503 1504 1506 1514 1523 sexi CONTENTS A.7. Determinants A.8. More About Inverses A.9. Eigenvalues and Singular Value Decomposition Appendix B. Arrays of Higher Rank B.1. Informal Survey of the Notation B.2. Axiomatic Development of Array Operations B.3. An Additional Notational Detail B.4. Equality of Arrays and Extended Substitution B.5. Vectorization and Kronecker Product Appendix C. Matrix Differentiation C.l. First Derivatives Appendix. Bibliography 1528 1530 1537 1543 1544 1549 1559 1561 1562 1583 1583 1597 CHAPTER 1 Preface These are class notes from several different graduate econometrics and statistics classes. In the Spring 2000 they were used for Statistics 6869, syllabus on p. ??, and in the Fall 2000 for Economics 7800, syllabus on p. 2?. The notes give a careful and complete mathematical treatment intended to be accessible also to a reader inexpe- rienced in math. There are 618 exercise questions, almost all with answers. The R-package ecmet has many of the datasets and R-functions needed in the examples. P. 547 gives instructions how to download it. Here are some features by which these notes may differ from other teaching material available: xxiv 1. PREFACE « A typographical distinction is made between random variables and the val- ues taken by them (page 63). Best linear prediction of jointly distributed random variables is given as a second basic building block next to the least squares model (chapter 27). Appendix A gives a collection of general matrix formulas in which the g- inverse is used extensively. The “deficiency matrix,” which gives an algebraic representation of the null space of a matrix, is defined and discussed in Appendix A.4. A molecule-like notation for concatenation of higher-dimensional arrays is introduced in Appendix B and used occasionally, see (10.5.7), (64.3.2) (65.0.18). Other unusual treatments can be found in chapters/sections 3.11, 18.3, 25, 29, 40, 36, 41-42, and 64. There are a number of plots of density functions, confidence ellipses, and other graphs which use the full precision of TEX, and more will be added in the future. Some chapters are carefully elaborated, while others are still in the process of construction. In some topics covered in those notes I am an expert, in others I am still a beginner. This edition also includes a number of comments from a critical realist per- spective, inspired by [Bha78] and [Bha93]; see also [Law89]. There are many situations in the teaching of probability theory and statistics where the concept of 1. PREFACE ax totality, transfactual efficacy, ete., can and should be used. These comments are still at an experimental state, and are the students are not required to know them for the exams. In the on-line version of the notes they are printed in a different color. After some more cleaning out of the code, I am planning to make the A,4S-ESTEX source files for these notes publicly available under the GNU public license, and up- load them to the TEX-archive network CTAN. Since I am using Debian GNU/Linux, the materials will also be available as a deb archive. The most up-to-date version will always be posted at the web site of the Econom- ics Department of the University of Utah www.econ.utah.edu/ehrbar/ecmet . pdf. You can contact me by email at [email protected] Hans Ehrbar CHAPTER. 2 Probability Fields 2.1. The Concept of Probability Probability theory and statistics are useful in dealing with the following types of situations: Games of chance: throwing dice, shuffling cards, drawing balls out of urns. Quality control in production: you take a sample from a shipment, count how many defectives. Actuarial Problems: the length of life anticipated for a person who has just applied for life insurance. ¢ Scientific Eperiments: you count the number of mice which contract cancer when a group of mice is exposed to cigarette smoke. 1 2 2. PROBABILITY FIELDS Markets: the total personal income in New York State in a given month. Meteorology: the rainfall in a given month. Uncertainty: the exact date of Noah’s birth. Indeterminacy: The closing of the Dow Jones industrial average or the temperature in New York City at 4 pm. on February 28, 2014. Chaotic determinacy: the relative frequency of the digit 3 in the decimal representation of 7. Quantum mechanics: the proportion of photons absorbed by a polarization filter Statistical mechanics: the velocity distribution of molecules in a gas at a given pressure and temperature. In the probability theoretical literature the situations in which probability theory applies are called “experiments,” see for instance [Rén70, p. 1]. We will not use this terminology here, since probabilistic reasoning applies to several different types of situations, and not all these can be considered “experiments.” PROBLEM 1. (This question will not be asked on any exams) Rényi says: “Ob- serving how long one has to wait for the departure of an airplane is an experiment.” Comment. 2.1. THE CONCEPT OF PROBABILITY 3 ANSWER. Rény commits the epistemic fallacy in order to justify his use of the word “exper iment.” Not the observation of the departure but the departure itself is the event which can be theorized probabilistically, and the word “experiment” is not appropriate here. a What does the fact that probability theory is appropriate in the above situations tell us about the world? Let us go through our list one by one: ¢ Games of chance: Games of chance are based on the sensitivity on initial conditions: you tell someone to roll a pair of dice or shuffle a deck of cards, and despite the fact that this person is doing exactly what he or she is asked to do and produces an outcome which lies within a well-defined universe known beforehand (a number between 1 and 6, or a permutation of the deck of cards), the question which number or which permutation is beyond their control. The precise location and speed of the die or the precise order of the cards varies, and these small variations in initial conditions give rise, by the “butterfly effect” of chaos theory, to unpredictable final outcomes. A critical realist recognizes here the openness and stratification of the world: If many different influences come together, each of which is gov- erned by laws, then their sum total is not determinate, as a naive hyper- determinist would think, but indeterminate. This is not only a condition for the possibility of science (in a hyper-deterministic world, one could not know anything before one knew everything, and science would also not be 2. PROBABILITY FIELDS necessary because one could not do anything), but also for practical human activity: the macro outcomes of human practice are largely independent of micro detail (the postcard arrives whether the address is written in cursive or in printed letters, ete.) Games of chance are situations which delib- erately project this micro indeterminacy into the macro world: the micro influences cancel each other out without one enduring influence taking over (as would be the case if the die were not perfectly symmetric and balanced) or deliberate human corrective activity stepping into the void (as a card trickster might do if the cards being shuffled somehow were distinguishable from the backside). The experiment in which one draws balls from urns shows clearly an- other aspect of this paradigm: the set of different possible outcomes is fixed beforehand, and the probability enters in the choice of one of these predetermined outcomes. This is not the only way probability can arise; it is an extensionalist example, in which the connection between success and failure is external. The world is not a collection of externally related outcomes collected in an urn. Success and failure are not determined by a choice between different spacially separated and individually inert balls (or playing cards or faces on a die), but it is the outcome of development and struggle that is internal to the individual unit. 2.1. THE CONCEPT OF PROBABILITY 5 Quality control in production: you take a sample from a shipment, count how many defectives. Why is statistics and probability useful in produc- tion? Because production is work, it is not spontaneous. Nature does not voluntarily give us things in the form in which we need them. Production is similar to a scientific experiment because it is the attempt to create local closure. Such closure can never be complete, there are always leaks in it, through which irregularity enters. Actuarial Problems: the length of life anticipated for a person who has just applied for life insurance. Not only production, but also life itself is a struggle with physical nature, it is emergence. And sometimes it fails: sometimes the living organism is overwhelmed by the forces which it tries to keep at bay and to subject to its own purposes. Scientific periments: you count the number of mice which contract cancer when a group of mice is exposed to cigarette smoke: There is local closure regarding the conditions under which the mice live, but even if this clo- sure were complete, individual mice would still react differently, because of genetic differences. No two mice are exactly the same, and despite these differences they are still mice. This is again the stratification of reality. Two mice are two different individuals but they are both mice. Their reaction to the smoke is not identical, since they are different individuals, but it is 2. PROBABILITY FIELDS not completely capricious either, since both are mice. It can be predicted probabilistically. Those mechanisms which make them mice react to the smoke. The probabilistic regularity comes from the transfactual efficacy of the mouse organisms. Meteorology: the rainfall in a given month. It is very fortunate for the development of life on our planet that we have the chaotic alternation be- tween cloud cover and clear sky, instead of a continuous cloud cover as in Venus or a continuous clear sky. Butterfly effect all over again, but it is possible to make probabilistic predictions since the fundamentals remain stable: the transfactual efficacy of the energy received from the sun and radiated back out into space. Markets: the total personal income in New York State in a given month. Market economies are a very much like the weather; planned economies would be more like production or life. Uncertainty: the exact date of Noah’s birth. This is epistemic uncertainty: assuming that Noah was a real person, the date exists and we know a time range in which it must have been, but we do not know the details. Proba- bilistic methods can be used to represent this kind of uncertain knowledge, but other methods to represent this knowledge may be more appropriate. 2.1. THE CONCEPT OF PROBABILITY 7 « Indeterminacy: The closing of the Dow Jones Industrial Average (DJIA) or the temperature in New York City at 4 pm. on February 28, 2014: This is ontological uncertainty, not only epistemological uncertainty. Not only do we not know it, but it is objectively not yet decided what these data will be. Probability theory has limited applicability for the DJIA since it cannot be expected that the mechanisms determining the DJIA will be the same at that time, therefore we cannot base ourselves on the transfactual efficacy of some stable mechanisms. It is not known which stocks will be included in the DJIA at that time, or whether the US dollar will still be the world reserve currency and the New York stock exchange the pinnacle of international capital markets. Perhaps a different stock market index located somewhere else will at that time play the role the DJIA is playing today. We would not even be able to ask questions about that alternative index today. Regarding the temperature, it is more defensible to assign a probability, since the weather mechanisms have probably stayed the same, except for changes in global warming (unless mankind has learned by that time to manipulate the weather locally by cloud seeding etc.). Chaotic determinacy: the relative frequency of the digit 3 in the decimal representation of : The laws by which the number 7 is defined have very 2. PROBABILITY FIELDS little to do with the procedure by which numbers are expanded as decimals, therefore the former has no systematic influence on the latter. (It has an influence, but not a systematic one; it is the error of actualism to think that every influence must be systematic.) But it is also known that laws can have remote effects: one of the most amazing theorems in mathematics is the formula } = 1~—4+4~—4+--+ which estalishes a connection between the geometry of the circle and some simple arithmetics. Quantum mechanics: the proportion of photons absorbed by a polarization filter: If these photons are already polarized (but in a different direction than the filter) then this is not epistemic uncertainty but ontological inde- terminacy, since the polarized photons form a pure state, which is atomic in the algebra of events. In this case, the distinction between epistemic un- certainty and ontological indeterminacy is operational: the two alternatives follow different mathematics. Statistical mechanics: the velocity distribution of molecules in a gas at a given pressure and temperature. Thermodynamics cannot be reduced to the mechanics of molecules, since mechanics is reversible in time, while thermodynamics is not. An additional element is needed, which can be modeled using probability. 2.1, THE CONCEPT OF PROBABILITY 9 PROBLEM 2. Not every kind of uncertainty can be formulated stochastically. Which other methods are available if stochastic means are inappropriate? ANSWER. Dialectics. a PROBLEM 3. How are the probabilities of rain in weather forecasts to be inter- preted? ANSWER. Renyi in [Rén70, pp. 33/4]: “By saying that the probability of rain tomorrow is 80% (or, what amounts to the same, 0.8) the meteorologist means that in a situation similar to that observed on the given day, there is usually rain on the next day in about 8 out of 10 cases; thus, while it is not certain that it will rain tomorrow, the degree of certainty of this event is 0.8." Pure uncertainty is as hard to generate as pure certainty; it is needed for en- cryption and numerical methods. Here is an encryption scheme which leads to a random looking sequence of num- bers (see [Ra097, p. 13]): First a string of binary random digits is generated which is known only to the sender and receiver. The sender converts his message into a string of binary digits. He then places the message string below the key string and obtains a coded string by changing every message bit to its alternative at all places where the key bit is 1 and leaving the others unchanged. The coded string which appears to be a random binary sequence is transmitted. The received message is decoded by 10 2. PROBABILITY FIELDS making the changes in the same way as in encrypting using the key string which is known to the receiver. PROBLEM 4. Why is it important in the above encryption scheme that the key string is purely random and does not have any regularities? PROBLEM 5. [Knu81, pp. 7, 452] Suppose you wish to obtain a decimal digit at random, not using a computer. Which of the following methods would be suitable? a. Open a telephone directory to a random place (i.e., stick your finger in it somewhere) and use the unit digit of the first number found on the selected page. ANSWER. This will often fail, since users select “round” numbers if possible. In some areas, telephone numbers are perhaps assigned randomly. But it is a mistake in any case to try to get several successive random numbers from the same page, since many telephone numbers are listed several times in a sequence. a eb. Same as a, but use the units digit of the page number. ANSWER. But do you use the left-hand page or the right-hand page? Say, use the left-hand page, divide by 2, and use the units digit. a c. Roll a die which is in the shape of a regular icosahedron, whose twenty faces have been labeled with the digits 0, 0, 1, 1,..., 9, 9. Use the digit which appears on 2.1, THE CONCEPT OF PROBABILITY ca top, when the die comes to rest. (A felt table with a hard surface is recommended for rolling dice.) ANSWER. The markings on the face will slightly bias the die, but for practical purposes this method is quite satisfactory. See Math. Comp. 15 (1961), 94-95, for further discussion of these dice, a ed. Expose a geiger counter to a source of radioactivity for one minute (shielding yourself) and use the unit digit of the resulting count. (Assume that the geiger counter displays the number of counts in decimal notation, and that the count is initially zero.) ANSWER. This is a difficult question thrown in purposely as a surprise. The number is not uniformly distributed! One sees this best if one imagines the source of radioactivity is very low level, so that only a few emissions can be expected during this minute. If the average number of emissions per minute is A, the probability that the counter registers k is e~*A*/k! (the Poisson distribution). So the digit 0 is selected with probability e~* }>7°_, \1°* /(10k)}, etc. a ¢e. Glance at your wristwatch, and if the position of the second-hand is between 6n and 6(n +1), choose the digit n. ANSWER. Okay, provided that the time since the last digit selected in this way is random. A bias may arise if borderline cases are not treated carefully. A better device seems to be to use a stopwatch which has been started long ago, and which one stops arbitrarily, and then one has all the time necessary to read the display. a 12 2. PROBABILITY FIELDS f. Ask a friend to think of a random digit, and use the digit he names. ANSWER. No, people usually think of certain digits (like 7) with higher probability. a g. Assume 10 horses are entered in a race and you know nothing whatever about their qualifications. Assign to these horses the digits 0 to 9, in arbitrary fashion, and after the race use the winner's digit. ANSWER. Okay; your assignment of numbers to the horses had probability 1/10 of assigning a given digit to a winning horse 2.2. Events as Sets With every situation with uncertain outcome we associate its sample space U, which represents the set of all possible outcomes (described by the characteristics which we are interested in). Events are associated with subsets of the sample space, i.e., with bundles of outcomes that are observable in the given experimental setup. The set of all events we denote with F. (F is a set of subsets of U.) Look at the example of rolling a die. U = {1,2,3,4,5,6}. The events of getting an even number is associated with the subset {2,4,6}; getting a six with {6}; not getting a six with {1,2,3,4,5}, etc. Now look at the example of rolling two indistin- guishable dice. Observable events may be: getting two ones, getting a one and a two, 2.2. EVENTS AS SETS 13 etc. But we cannot distinguish between the first die getting a one and the second a two, and vice versa. Le., if we define the sample set to be U = {1,...,6} x {1,..., 6}, ite., the set of all pairs of numbers between 1 and 6, then certain subsets are not observable. {(1,5)} is not observable (unless the dice are marked or have different colors etc.), only {(1,5), (5, 1)} is observable. If the experiment is measuring the height of a person in meters, and we make the idealized assumption that the measuring instrument is infinitely accurate, then all possible outcomes are numbers between 0 and 3, say. Sets of outcomes one is usually interested in are whether the height falls within a given interval; therefore all intervals within the given range represent observable events. If the sample space is finite or countably infinite, very often all subsets are observable events. If the sample set contains an uncountable continuum, it is not desirable to consider all subsets as observable events. Mathematically one can define quite crazy subsets which have no practical significance and which cannot be mean- ingfully given probabilities. For the purposes of Econ 7800, it is enough to say that all the subsets which we may reasonably define are candidates for observable events. The “set of all possible outcomes” is well defined in the case of rolling a die and other games; but in social sciences, situations arise in which the outcome is open and the range of possible outcomes cannot be known beforehand. If one uses a probability theory based on the concept of a “set of possible outcomes” in such 14 2. PROBABILITY FIELDS a situation, one reduces a process which is open and evolutionary to an imaginary predetermined and static “set.” Furthermore, in social theory, the mechanism by which these uncertain outcomes are generated are often internal to the members of the statistical population. The mathematical framework models these mechanisms as an extraneous “picking an element out of a pre-existing set.” From given observable events we can derive new observable events by set theo- retical operations. (All the operations below involve subsets of the same U.) Mathematical Note: Notation of sets: there are two ways to denote a set: either by giving a rule, or by listing the elements. (The order in which the elements are listed, or the fact whether some elements are listed twice or not, is irrelevant.) Here are the formal definitions of set theoretic operations. The letters A, B, ete. denote subsets of a given set U (events), and J is an arbitrary index set. w stands 2.2. EVENTS AS SETS 15 for an element, and w € A means that w is an element of A. (2.2.1) ACB = WeA=+weB) (Ais contained in B) (2.2.2) ANB={w:weAandw€B} (intersection of A and B) (2.2.3) (Ai = {wi w € A; for alll i € T} ier (2.2.4) AUB={w:weAorw eB} — (union of A and B) (2.2.5) UAi = {w: there exists an i € I such that w € Aj} ier (2.2.6) U_ Universal set: all w we talk about are € U. (2.2.7) Al = {w:w ¢ A but w € U} (2.2.8) 0 = the empty set: w ¢ 0) for all w. These definitions can also be visualized by Venn diagrams; and for the purposes of this class, demonstrations with the help of Venn diagrams will be admissible in lieu of mathematical proofs. PROBLEM 6. For the following set-theoretical exercises it is sufficient that you draw the corresponding Venn diagrams and convince yourself by just looking at them that the statement is true. For those who are interested in a precise mathematical 16 2. PROBABILITY FIELDS proof derived from the definitions of AU B etc. given above, should remember that a proof of the set-theoretical identity A = B usually has the form: first you show that w € A implies w € B, and then you show the converse. ea. Prove that AUB=B — ANB=A. ANSWER. If one draws the Venn diagrams, one can see that either side is true if and only if A C B. If one wants a more precise proof, the following proof by contradiction seems most illuminating: Assume the lefthand side does not hold, i.e., there exists aw € A but w ¢ B. Then w ¢ ANB, ic, ANB # A. Now assume the righthand side does not hold, i.e., there is aw € A with w ¢ B. This w lies in AU B but not in B, i.e., the lefthand side does not hold either. a b. Prove that AU(BNC) =(AUB)N (AUC) ANSWER. If w € A then it is clearly always in the righthand side and in the lefthand side. If there is therefore any difference between the righthand and the lefthand side, it must be for the w ¢ A: Ifw ¢ A and it is still in the lefthand side then it must be in BMC, therefore it is also in the righthand side. If w ¢ A and it is in the righthand side, then it must be both in B and in C, therefore it is in the lefthand side. a © c. Prove that AN(BUC) = (AN B)U(ANC). ANSWER. If w ¢ A then it is clearly neither in the righthand side nor in the lefthand side. If there is therefore any difference between the righthand and the lefthand side, it must be for the 2.2. EVENTS AS SETS 17 w € A: Ifw € A and it is in the lefthand side then it must be in BUC, ie., in B or in C or in both, therefore it is also in the righthand side. If w € A and it is in the righthand side, then it must be in either B or C or both, therefore it is in the lefthand side. a ed. Prove that AN (US, Bi) =UEs(A0 Bi). ANSWER. Proof: If w in lefthand side, then it is in A and in at least one of the Bj, say it is in Bg. Therefore it is in AN Bx, and therefore it is in the righthand side. Now assume, conversely, that w is in the righthand side; then it is at least in one of the AN Bi, say it is in AN By. Hence it, is in A and in Bx, ic., in A and in lJ Bi, ie., it is in the lefthand side, a PROBLEM 7. 3 points Draw a Venn Diagram which shows the validity of de Morgan’s laws: (AU B)! = A'N B! and (ANB)! = A’ UB". If done right, the same Venn diagram can be used for both proofs. ANSWER. There is a proof in [IITS3, p. 12]. Draw A and B inside a box which represents U, and shade A’ from the left (blue) and B’ from the right (yellow), so that A’ B’ is cross shaded (green); then one can see these laws, a PROBLEM 8. 3 points [HT'83, Exercise 1.2-13 on p. 14] Evaluate the following unions and intersections of intervals. Use the notation (a,b) for open and [a,b] for closed intervals, (a,b] or [a,b) for half open intervals, {a} for sets containing one 18 2. PROBABILITY FIELDS element only, and ) for the empty set. (2.2.9) U(72) = A(z) - aan UEa)- Apa ed)- an O(s)-ea Fes) an Ufbeleos Alert) en 1 n=1 Explanation of J™_, [+ 2]: for every a with 0 0 for all events A (2.3.3) If A; Aj =0) for all i,j with i#j then Pr[L) 4) = > Pr[Ai) iat i=l 2.3. THE AXIOMS OF PROBABILITY 21 Here an infinite sum is mathematically defined as the limit of partial sums. These axioms make probability what mathematicians call a measure, like area or weight. Ina Venn diagram, one might therefore interpret the probability of the events as the area of the bubble representing the event. PROBLEM 9. Prove that Pr[A’] = 1 — Pr[A]. ANSWER. Follows from the fact that A and A’ are disjoint and their union U' has probability 1. a PROBLEM 10. 2 points Prove that Pr[A U B] = Pr[A] + Pr[B] — Pr[An 8]. ANSWER. For Econ 7800 it is sufficient to argue it out intuitively: if one adds Pr[A] + Pr[B] then one counts Pr[AM B] twice and therefore has to subtract it again. The brute force mathematical proof guided by this intuition is somewhat verbose: Define D=ANB!, B= ANB, and F = A'NB. D, E, and F satisfy (2.3.4) DUE =(ANB')U(ANB) = AN(B/UB) = ANU (2.3.5) EUF=B, (2.3.6) DUEUF=AUB. A, 22 2. PROBABILITY FIELDS You may need some of the properties of unions and intersections in Problem 6. Next step is to prove that D, B, and F are mutually exclusive. Therefore it is easy to take probabilities (2.3.7) Pr[4] = Pr[D] + Pr[E]; (2.3.8) Pr[B] = Pr[E] + Pr[F]; (2.3.9) Pr[A UB] = Pr[D] + Pr[E] + Pr[F]. Take the sum of (2.3.7) and (2.3.8), and subtract (2.3.9): (2.3.10) Pr[A] + Pr[B] — Pr[A U B] = Pr[E] = Pr[An B); A shorter but trickier alternative proof is the following. First note that AUB = AU(A/NB) and that this is a disjoint union, i.e., Pr[AUB] = Pr[A]+Pr[4/NB]. Then note that B = (ANB)U(A'NB), and this is a disjoint union, therefore Pr[] = Pr[ANB]+Pr[4’NB], or Pr[A/NB] = Pr[B]—Pr[ANB] Putting this together gives the result. a PROBLEM 11. 1 point Show that for arbitrary events A and B, Pr[AU B] < Pr[A] + Pr[B]. ANSWER. From Problem 10 we know that Pr[A U B] = Pr[A] + Pr[B] — Pr[A 0B], and from axiom (2.3.2) follows Pr[A 9 B] > 0. a PROBLEM 12. 2 points (Bonferroni inequality) Let A and B be two events. Writ- ing Pr[A] = 1—a and Pr[B] = 1-8, show that Prl[An B] > 1—(a+ 8). You are 2.3. THE AXIOMS OF PROBABILITY 23 allowed to use that Pr[AU B] = Pr[A] + Pr[B] — Pr[An B] (Problem 10), and that all probabilities are <1. ANSWER. (2.3.11) Pr[AU B] = Pr[A] + Pr[B] — Pr[AN B] <1 (2.3.12) Pr[A] + Pr[B] < 14 Pr[An B] (2.3.13) Pr[4] + Pr[B] — 1 < Pr[An B] (2.3.14) 1-a+1-6-1=1-a-§< PAB] a PROBLEM 13. (Not eligible for in-class exams) Given a rising sequence of events By C Bo C Bg---, define B =U, Bj. Show that Pr[B] = lim. Pr[Bi]. ANSWER. Define C1 = Bi, C2 = B21 B'1, C3 = B31 B's, ete. Then CiNC; = 0 for iF 5, . In other words, now we have represented every B,, and B SON, PrlCi]- The infinite sum is merely a short way of writing Pr[B] = limn—oo })”_, Pr[Ci], ie., the infinite sum is the limit of the finite sums. But since these finite sums are exactly 5", Pr[Cy] = PriU"_, Ci] = Pr[Bn], the assertion follows. This proof, as it stands, is for our purposes entirely acceptable. One can make some steps in this proof still more stringent. For instance, one might use induction to prove Bn =", C;. And how does one show that B =, Ci? Well, one knows that C; C By, therefore =, Ci C US, Bi = B. Now take an w € B. Then it lies in at least one 24 2. PROBABILITY FIELDS of the B;, but it can be in many of them. Let k be the smallest k for which w € By. If k = 1, then w €C; = By as well. Otherwise, w ¢ By—1, and therefore w € Cy. Le., any element in B lies in at least one of the Cy, therefore B CU a PROBLEM 14. (Not eligible for in-class exams) From problem 13 derive also the following: if Ay > Ag D Ag-- is a declining sequence, and A = (), Ai, then Pr[A] = lim Pr[A,]. ANSWER. If the Aj are declining, then their complements B; = A! are rising: B1 C Bz C By ++» are rising; therefore I know the probability of B = LJ Bi. Since by de Morgan’s laws, B = A’, this gives me also the probability of A. a The results regarding the probabilities of rising or declining sequences are equiv- alent to the third probability axiom. This third axiom can therefore be considered a continuity condition for probabilities. If U is finite or countably infinite, then the probability measure is uniquely determined if one knows the probability of every one-element set. We will call Pr[{w}] = p(w) the probability mass function. Other terms used for it in the lit- erature are probability function, or even probability density function (although it is not a density, more about this below). If UV has more than countably infinite elements, the probabilities of one-clement sets may not give enough information to define the whole probability measure. 2.3. THE AXIOMS OF PROBABILITY 25 Mathematical Note: Not all infinite sets are countable. Here is a proof, by contradiction, that the real numbers between 0 and 1 are not countable: assume there is an enumeration, i.c., a sequence a1,a2,... which contains them all. Write them underneath each other in their (possibly infinite) decimal representation, where O.dirdiodiz... is the decimal representation of a;. Then any real number whose decimal representation is such that the first digit is not equal to dy, the second digit is not equal dy, the third not equal dyz, etc., is a real number which is not contained in this enumeration. That means, an enumeration which contains all real numbers cannot exist. On the real numbers between 0 and 1, the length measure (which assigns to each interval its length, and to sets composed of several invervals the sums of the lengths, etc.) is a probability measure. In this probability field, every one-element subset of the sample set has zero probability. This shows that events other than () may have zero probability. In other words, if an event has probability 0, this does not mean it is logically impossible. It may well happen, but it happens so infrequently that in repeated experiments the average number of occurrences converges toward zero. 26 2. PROBABILITY FIELDS 2.4, Objective and Subjective Interpretation of Probability The mathematical probability axioms apply to both objective and subjective interpretation of probability. The objective interpretation considers probability a quasi physical property of the experiment. One cannot simply say: Pr[A] is the relative frequency of the occurrence of A, because we know intuitively that this frequency does not necessarily converge. E.g., even with a fair coin it is physically possible that one always gets head, or that one gets some other sequence which does not converge towards $. The above axioms resolve this dilemma, because they allow to derive the theorem that the relative frequencies converges towards the probability with probability one. Subjectivist interpretation (de Finetti: “probability does not exist”) defines prob- ability in terms of people’s ignorance and willingness to take bets. Interesting for economists because it uses money and utility, as in expected utility. Call “a lottery on A” a lottery which pays $1 if A occurs, and which pays nothing if A does not occur. If a person is willing to pay p dollars for a lottery on A and 1 — p dollars for a lottery on A’, then, according to a subjectivist definition of probability, he assigns subjective probability p to A. There is the presumption that: his willingness to bet does not depend on the size of the payoff (i.e., the payoffs are considered to be small amounts). 2.4, OBJECTIVE AND SUBJECTIVE INTERPRETATION OF PROBABILITY 27 PROBLEM 15. Assume A, B, and C are a complete disjunction of events, i.e., they are mutually exclusive and AU BUC =U, the universal set. © a. 1 point Arnold assigns subjective probability p to A, q to B, andr to C. Explain exactly what this means. ANSWER. We know six different bets which Arnold is always willing to make, not only on A, B, and C, but also on their complements. a b. 1 point Assume that p+q+r > 1. Name three lotteries which Arnold would be willing to buy, the net effect of which would be that he loses with certainty. ANSWER. Among those six we have to pick subsets that make him a sure loser. If p+q+r > 1, then we sell him a bet on A, one on B, and one on C. The payoff is always 1, and the cost is ptqtr>1 o ec. 1 point Now assume that p+q+r <1. Name three lotteries which Arnold would be willing to buy, the net effect of which would be that he loses with certainty. ANSWER. If p+q-+r <1, then we sell him a bet on A’, one on B’, and one on C’. The payoff is 2, and the cost is 1—-p-+1—q+1—r>2 a ed. 1 point Arnold is therefore only coherent if Pr[A]+Pr[B]+Pr[C] = 1. Show that the additivity of probability can be derived from coherence, i.e., show that any subjective probability that satisfies the rule: whenever A, B, and C is a complete 28 2. PROBABILITY FIELDS disjunction of events, then the sum of their probabilities is 1, is additive, i.e., Pr[AU B] = Pr[A] + Pr[B]. ANSWER. Since r is his subjective probability of C, 1—r must be his subjective probability of C! = AUB. Since p+q +r =1, it follows 1—-r=p+q. a This last problem indicates that the finite additivity axiom follows from the requirement that the bets be consistent or, as subjectivists say, “coherent” with each other. However, it is not possible to derive the additivity for countably infinite sequences of events from such an argument. 2.5. Counting Rules In this section we will be working in a finite probability space, in which all atomic events have equal probabilities. The acts of rolling dice or drawing balls from urns can be modeled by such spaces. In order to compute the probability of a given event, one must count the elements of the set which this event represents. In other words, we count how many different ways there are to achieve a certain outcome. This can be tricky, and we will develop some general principles how to do it. PROBLEM 16. You throw two dice. a. 1 point What is the probability that the sum of the numbers shown is five or less? 2.5. COUNTING RULES 29 111213 14 . Answer. 312223”, Le., 10 out of 36 possibilities, gives the probability 3. a aL eb. 1 point What is the probability that both of the numbers shown are five or less? a 1 2 34 35, Le., # at ANSWER. 31 it at ¢c. 2 points What is the probability that the maximum of the two numbers shown is five? (As a clarification: if the first die shows 4 and the second shows 3 then the maximum of the numbers shown is 4.) 15 ANSWER. 38, ie, 2. a 5152.53 54.88 In this and in similar questions to follow, the answer should be given as a fully shortened fraction. The multiplication principle is a basic aid in counting: If the first operation can be done m; ways, and the second operation nz ways, then the total can be done nins ways. Definition: A permutation of a set is its arrangement in a certain order. It was mentioned earlier that for a set it does not matter in which order the elements are 30 2. PROBABILITY FIELDS written down; the number of permutations is therefore the number of ways a given set can be written down without repeating its elements. From the multiplication principle follows: the number of permutations of a set of n elements is n(n — 1)(n — 2)---(2)(1) =n! (n factorial). By definition, 0! = 1. If one does not arrange the whole set, but is interested in the number of k- tuples made up of distinct elements of the set, then the number of possibilities is n(n —1)(n—2)---(n—k+2)(n—k+1) = GBs. (Start with n and the number of factors is k.) (k-tuples are sometimes called ordered k-tuples because the order in which the elements are written down matters.) [Ame94, p. 8] uses the notation P® for this. This leads us to the next question: how many k-element subsets does a n-element set have? We already know how many permutations into k elements it has; but always k! of these permutations represent the same subset; therefore we have to divide by k!. The number of k-element subsets of an n-element set is therefore nl nln —1(n—2 (25.1) nmi (D3) It is pronounced as n choose k, and is also called a “binomial coefficient.” It is defined for all 0 << k (Nar-*o* fo Why? When the n factors a+b are multiplied out, each of the resulting terms selects from each of the n original factors either a or b. The term a”~*b* occurs therefore (,2,) = (2) times. ‘As an application: If you set a = 1, b = 1, you simply get a sum of binomial coefficients, i.c., you get the number of subsets in a set with n elements: it is 2" (always count the empty set as one of the subsets). The number of all subsets is easily counted directly. You go through the set clement by element and about every element you ask: is it in the subset or not? Le., for every element you have two 34 2. PROBABILITY FIELDS possibilities, therefore by the multiplication principle the total number of possibilities is 2”, 2.7. Conditional Probability The concept of conditional probability is arguably more fundamental than prob- ability itself. Every probability is conditional, since we must know that the “ex- periment” has happened before we can speak of probabilities. [Ame94, p. 10] and [Rén70] give axioms for conditional probability which take the place of the above axioms (2.3.1), (2.3.2) and (2.3.3). However we will follow here the common proce- dure of defining conditional probabilities in terms of the unconditional probabilities: (2.7.1) Pr[B|.A] = ao How can we motivate (2.7.1)? If we know that A has occurred, then of course the only way that B occurs is when BM A occurs. But we want to multiply all probabilities of subsets of A with an appropriate proportionality factor so that the probability of the event A itself becomes = 1. PROBLEM 20. 8 points Let A be an event with nonzero probability. Show that the probability conditionally on A, i.e., the mapping B + Pr[B|A], satisfies all the 2.7. CONDITIONAL PROBABILITY 35 axioms of a probability measure: (2.7.2) Pr[U|A] =1 (2.7.3) A] > 0. for all events B (2.7.4) A) = SO Pr[B.\A] if BB; =0 for all i,j with iF j. ANSWER. Pr[U|A] = Pr[UNA]/ Pr[A] = 1. Pr[B|A] = Pr[BNA]/ Pr[A] > 0 because Pr[BNA] > 0 and Pr[A] > 0. Finally, (2.7.5) Pri(U=, 8)94) _ PrIUZ (2.0 4)) Pel eo PB al= Sm lA] First equal sign is definition of conditional probability, second is distributivity of unions and inter sections (Problem 6 d), third because the B; are disjoint and therefore the B; 7A are even more disjoint: B; FAN B; NA = B;NB;NA=ONA =O for alll i,j with i 4 j, and the last equal sign again by the definition of conditional probability. a PROBLEM 21. You draw two balls without replacement from an urn which has 7 white and 14 black balls. If both balls are white, you roll a die, and your payoff is the number which the die shows in dollars. 36 2. PROBABILITY FIELDS If one ball is black and one is white, you flip a coin until you get your first head, and your payoff will be the number of flips it takes you to get a head, in dollars again. If both balls are black, you draw from a deck of 52 cards, and you get the number shown on the card in dollars. (Ace counts as one, J, Q, and K as 11, 12, 13, i.e., basically the deck contains every number between 1 and 13 four times.) Show that the probability that you receive exactly two dollars in this game is 1/6. ANSWER. You know a complete disjunction of events: Jos Pri{bb}] = 33.36 = 303 Prl{bw}] = 29+ ato ditional probabilities of getting 2 dollars conditonally on each of these events: Pr[{2}|{ww}] = }; Pr[{2}|{bb}] = zy; Pri{2}|{wb}] = 4. Now Pr[{2}N{ww}] = Pr[{2}|{ww}] Pr[{ww}] etc., therefore ww}U{bb}U{wb}, with Pr[{ww}] = =. Furthermore you know the con- 2.76) Pr({2}] = Pol{2} 0 {ane} + Pe{{2} 0 (buh) + Pr({2} 0 {00} (2.7.8) =11,17,18 1 610 +415 * 1330 6 a PROBLEM 22. 2 points A and B are arbitrary events. Prove that the probability of B can be written as: (2.7.9) Pr[B] = Pr[B|A] Pr[A] + Pr[B|A’] Pr[4’] 2.7. CONDITIONAL PROBABILITY 37 This is the law of iterated expectations (8.6.2) in the case of discrete random vari- ables: it might be written as Pr[B] = E[Pr[B|4]]. ANSWER. B= BNU = BN (AU A!) = (BNA) U(BN A’) and this union is disjoint, i.e., (BN A)N(BNA') = BN(AN A!) = BNO = 0. Therefore Pr[B] = Pr[B N A] + Pr[BN A’) Now apply definition of conditional probability to get Pr[BN A] = Pr[B|A] Pr[A] and Pr[BN A’) = Pr[B|A‘] Pr[A‘] a PROBLEM 23. 2 points Prove the following lemma: If Pr[B|A1] = Pr[B|Ag] (call it c) and Ay Az =) (i.e., Ay and Ag are disjoint), then also Pr[B\Ay U Ag] =e. ANSWER. PriBlAy Ud] = Pr[BA(A1UAa)] _ Pr[(BO Ar) U(BA Ad) eel PriArUAa] Pr[Ay U Ag] (27:0) _ Pr[BO Ai] +PrlBO Ag] _ ePrlAi]+ePrlA2] _ Pr[Ai] + Pr[A9] ~~ Pr[Aa] + Pr[ Aa] & PROBLEM 24. Show by countererample that the requirement Ay M Az = () i. necessary for this result to hold. Hint: use the example in Problem 38 with A, {HH,HT}, A, ={HH,TH}, B ={HH,TT}. ANSWER. Pr[B]Ay] = 1/2 and Pr[B|Ag] = 1/2, but Pr[B|A; U Aa] = 1/3. a 38 2. PROBABILITY FIELDS The conditional probability can be used for computing probabilities of intersec- tions of events. PROBLEM 25, [Lar82, exercises 2.5.1 and 2.5.2 on p. 57, solutions on p. 597, but no discussion]. Five white and three red balls are laid out in a row at random. ea. 3 points What is the probability that both end balls are white? What is the probability that one end ball is red and the other white? ANSWER. You can lay the first ball first and the last ball second: for white balls, the probability 3435 5: for one white, one red it is =8. oO 45 Sar > ia 8 5 b. 4 points What is the probability that all red balls are together? What is the probability that all white balls are together? avi ous 4 7 PROBLEM 26. The first three questions here are discussed in [Lar82, example 2.6.3 on p. 62]: There is an urn with 4 white and 8 black balls. You take two balls out without replacement. 2.7. CONDITIONAL PROBABILITY 39 a. 1 point What is the probability that the first ball is white? ANsWER. 1/3 o eb. 1 point What is the probability that both balls are white? ANSWER. It is Pr[second ball white|first ball white] Prffirst ball white] = ec. 1 point What is the probability that the second ball is white? ANSWER. It is Pr[first ball white and second ball white]+Pr[first ball black and second ball whi (2.7.11) - 3 4,4 8 1 348448 7H484d 3 ‘This is the same as the probability that the first balll is white. ‘The probabilities are not dependent on the order in which one takes the balls out. This property is called “exchangeability.” One can see it also in this way: Assume you number the balls at random, from 1 to 12, Then the probability for a white ball to have the number 2 assigned to it is obviously 4. i ed. 1 point What is the probability that both of them are black? cwer. 3 pou 56. Answer. $= 34 = ¥ (or 2B). a ec. 1 point What is the probability that both of them have the same color? 14 33 (or & ao ANSWER. The sum of the two above, $8). aL ir 40 2. PROBABILITY FIELDS Now you take three balls out without replacement. f. 2 points Compute the probability that at least two of the three balls are white. ANSWER. It is 43. The possibilities are wwb, whw, bww, and www. Of the first three, each has probability 4 4 4. Therefore the probability for exactly two being white is 2 = 2. The probability for www is 232, = 24 = 1. Add this to get 312 = 13. More systematically, the ansvr ie ((3)(8) + ())/ 05) ° © g. 1 point Compute the probability that at least two of the three are black. — is 22 . 672 _ 28 itis 2 _ 336 _ 4 ANSWER. It is #2. For exactly two: 3, = 28. For three it is (OW = 3B = 8 Together 4998 = 42. One can also get is as: it is the complement of the last, or as ((5) + (2) ())/(G)- a eh. 1 point Compute the probability that two of the three are of the same and the third of a different color. B= a (DQ) + (())/(8) gS i. 1 point Compute the probability that at least two of the three are of the same color. Swe! 960 ANSWER. It is 23% ANSWER. This probability is 1. You have 5 black socks and 5 white socks in your drawer. There is a fire at night and you must get out of your apartment in two minutes. There is no light. 2.7. CONDITIONAL PROBABILITY 41 You fumble in the dark for the drawer. How many socks do you have to take out so that you will have at least 2 of the same color? The answer is 3 socks. a PROBLEM 27. If a poker hand of five cards is drawn from a deck, what is the prob- ability that it will contain three aces? (How can the concept of conditional probability help in answering this question?) ANSWER, [Ame94, example 2.3.3 on p. 9] and [Ame94, example 2.5.1 on p. 13] give two alternative ways to do it. The second answer uses conditional probability: Probability to draw three aces in a row first and then 2 nonaces is 4 3 2, 48 47 Then multiply this by (3) = £43 = 10 This gives 0.0017, ie., 0.17%. a PROBLEM 28. 2 points A friend tosses two coins. You ask: “did one of them land heads?” Your friend answers, “yes.” What's the probability that the other also landed heads? ANSWER. J = {HH, HT,TH,TT}; Probability is PROBLEM 29. (Not eligible for in-class exams) [Ame94, p. 5] What is the prob- ability that a person will win a game in tennis if the probability of his or her winning a point is p? 42 2. PROBABILITY FIELDS ANSWER. 20p(1 ~ p)* ) = 2p(1 =p) How to derive this: {ssss} has probability p*; {sssfs}, {ssfss}, {sfsss}, and {fssss} have prob- ability 4p*(1 — p); {sssffs} etc. (2 f and 3 s in the first 5, and then an s, together (3) = 10 possibilities) have probability 10p4(1 — p)?. Now {sssfff} and ($) = 20 other possibilities give deuce at least once in the game, ie. the probability of deuce is 20p3(1 — p)3. Now Pr[win|deuce] = p? + 2p(1 — p)Pr[win|deuce], because you win either if you score twice in a row (p2) or if you get deuce again (probablity 2p(1—p)) and then win. Solve this to get Pr[win|deuce] = p?, /(1-2p(—p)) and then multiply this conditional probability with the probability of getting deuce at least once: Pr[win after at least one deuce] = 20p3(1 — p)p?/(1 — pil — )) This gives the last term in (2.7.12), a (2.7.12) pi(14 40») +100 -p? + PROBLEM 30. (Not eligible for in-class exams) Andy, Bob, and Chris play the following game: each of them draws a card without replacement from a deck of 52 cards. The one who has the highest card wins. If there is a tie (like: two kings and no aces), then that person wins among those who drew this highest card whose name comes first in the alphabet. What is the probability for Andy to be the winner? For Bob? For Chris? Does this probability depend on the order in which they draw their cards out of the stack? ANSWER. Let A be the event that Andy wins, B that Bob, and C’ that Chris wins 2.7. CONDITIONAL PROBABILITY 43 One way to approach this problem is to ask: what are the chances for Andy to win when he draws a king?, etc., i.c., compute it for all 13 different cards. Then: what are the chances for Bob to win when he draws a king, and also his chances for the other cards, and then for Chris. It is computationally easier to make the following partitioning of all outcomes: Either all three cards drawn are different (call this event D), or all three cards are equal (event E), or two of the three cards are equal (T). This third case will have to be split into T = HUL, according to whether the card that is different is higher or lower. If all three cards are different, then Andy, Bob, and Chris have equal chances of winning; if all three cards are equal, then Andy wins. What about the case that two cards are the same and the third is different? There are two possibilities. If the card that is different is higher than the two that are the same, then the chances of winning are evenly distributed; but if the two equal cards are higher, then Andy has a 2 chance of winning (when the distribution of the cards Y (lower) and Z (higher) among ABC is is ZZY and ZYZ), and Bob has a 4 chance of winning (when the distribution is YZZ). What we just did was computing the conditional probabilities Pr[A|D], Pr[A|z], ete Now we need the probabilities of D, , and T. What is the probability that all three cards drawn are the same? The probability that the second card is the same as the first is &; and the probability that the third is the same too is 3,; therefore the total probability is ah sho The probability that all three are unequal is 48 44 = 2442. The probability that two are equal and Bi 50 — 2550 the third is different is 33 #5 = 33%. Now in half of these cases, the card that is different is higher, and in half of the cases it is lower. 44 2. PROBABILITY FIELDS Putting this together one gets: Uncond. Prob. Cond. Prob. Prob. of intersection ABC A B c B all 3 equal 6/2550 1 0 0 6/2550 0 0 H 2of3equal, 3rd higher 216 /2550 2 2 2 72/2550 72/2550 72/2550 L 2 of 3 equal, 3rd lower 216/2550 2 2 0144/2550 72/2550 0 D all 3 unequal 2112/2550 i i 4704/2550 704/250 704/2550 Sum 2550/2550 926/2550 848/2550 76/2550 Le., the probability that A wins is 926/2550 = 463/1275 = .363, the probability that B wins is 848/2550 = 424/1275 = .3325, and the probability that C’ wins is 776/2550 = 38/1275 = .304. Here we are using Pr[A] = Pr[A|] Pr[2] + Pr[A|H] Pr[/] + Pr[A|L] Pr[L]+Pr[4|D] Pr[D]. 2 PROBLEM 31. 4 points You are the contestant in a game show. There are three closed doors at the back of the stage. Behind one of the doors is a sports car, behind the other two doors are goats. The game master knows which door has the sports car behind it, but you don’t. You have to choose one of the doors; if it is the door with the sports car, the car is yours. After you make your choice, say door A, the game master says: “I want to show you something.” He opens one of the two other doors, let us assume it is door B, and it has a goat behind it. Then the game master asks: “Do you still insist on door A, or do you want to reconsider your choice?” 2.8. RATIO OF PROBABILITIES AS STRENGTH OF EVIDENCE 45 Can you improve your odds of winning by abandoning your previous choice and instead selecting the door which the game master did not open? If so, by how much? ANSWER. If you switch, you will lose the car if you had initially picked the right door, but you will get the car if you were wrong before! Therefore you improve your chances of winning from 1/3 to 2/3. This is simulated on the web, see www.stat.sc.edu/~west/javahtml/LetsMakeaDeal .html. It is counterintuitive. You may think that one of the two other doors always has a goat behind it, whatever your choice, therefore there is no reason to switch. But the game master not only shows you that there is another door with a goat, he also shows you one of the other doors with a goat behind it, i.e., he restricts your choice if you switch. This is valuable information. It is as if you could bet on both other doors simultaneously, i.e., you get the car if it is behind one of the doors B or C. Le., if the quiz master had said: I give you the opportunity to switch to the following: you get the car if it is behind B or C. Do you want to switch? The only doubt the contestant may have about this is: had I not picked a door with the car behind it then I would not have been offered this opportunity to switch. a 2.8. Ratio of Probabilities as Strength of Evidence Pr, and Pry are two probability measures defined on the same set F of events. Hypothesis H;, says Pr; is the true probability, and H says Pr is the true probability. Then the observation of an event A for which Pr;[A] > Pr2[4] is evidence in favor of H; as opposed to Hy. [Roy97] argues that the ratio of the probabilities (also called “ikelihood ratio”) is the right way to measure the strength of this evidence. Among 46 2. PROBABILITY FIELDS others, the following justification is given [Roy97, p. 7]: If Hz is true, it is usually not impossible to find evidence favoring Hy, but it is unlikely; and its probability is bounded by the (reverse of) the ratio of probabilities. This can be formulated mathematically as follows: Let S be the union of all events A for which Pr; [A] > k Pr2[A]. Then it can be shown that Prs[S] < 1/k, i. if Hy is true, the probability to find evidence favoring H; with strength k is never greater than 1/k. Here is a proof in the case that there is only a finite number of pos- sible outcomes U = {wy,..., w, }: Renumber the outcomes such that for i = 1,...,m, Pri [{wi}] < kPro[{w;}], and for j = m+1,...,n, Pri[{w,}] > kPreo[{w,}]. Then S = {Wm41,-.,wn}, therefore Pr2[S] = xn, mar Prol{ws}] kPro[{wi)}] he publishes his result; if not, he makes a second independent observation of the exper- iment wi2). If Pry [{wicr) }] Pri[{wicay }] > k Pro[{wica }] Pra[{wz2) }] he publishes his result; if not he makes a third observation and incorporates that in his publication as 2.8. RATIO OF PROBABILITIES AS STRENGTH OF EVIDENCE aT well, ete. It can be shown that this strategy will not help: if his rival’s hypothesis is true, then the probability that he will ever be able to publish results which seem to show that his own hypothesis is true is still < 1/k. Le., the sequence of independent observations w}(2),W;(2),.-. is such that 1 (2841) Prof TPs {wi} > rT] Petes) for some n=1,2,...] < i It is not possible to take advantage of the indeterminacy of a random outcome by carrying on until chance places one ahead, and then to quit. If one fully discloses all the evidence one is accumulating, then the probability that this accumulated evidence supports one’s hypothesis cannot rise above 1/k. PROBLEM 32. It is usually not possible to assign probabilities to the hypotheses Hy and Hy, but sometimes it is. Show that in this case, the likelihood ratio of event A is the factor by which the ratio of the probabilities of H, and Hy is changed by the observation of A, i.e., Pr[Hi|A] _ Pr[ | Pr(/a|A] ~ Pr(MTa] Pr[Alo] ANSWER. Apply Bayes’s theorem (2.9.1) twice, once for the numerator, once for the denomi- nator a (2.8.2) 48 2. PROBABILITY FIELDS A world in which probability theory applies is therefore a world in which the transitive dimension must be distinguished from the intransitive dimension. Research results are not determined by the goals of the researcher. 2.9. Bayes Theorem In its simplest form Bayes’s theorem reads (2.9.1) pr[ajp] = ———_PelblajPrfa} Pr[B]A] Pr[ 4] + Pr[B]A] Pr] PROBLEM 33. Prove Bayes theorem! ANSWER. Obvious since numerator is Pr[B 9 A] and denominator Pr[B MA] + Pr[B 9 A’) = Pr[B]. o This theorem has its significance in cases in which A can be interpreted as a cause of B, and B an effect of A. For instance, A is the event that a student who was picked randomly from a class has learned for a certain exam, and B is the event that he passed the exam. Then the righthand side expression contains that information which you would know from the cause-effect relations: the unconditional probability of the event which is the cause, and the conditional probabilities of the effect conditioned on whether or not the cause happened. From this, the formula computes the conditional probability of the cause given that the effect happened. 2.9. BAYES THEOREM 49 Bayes’s theorem tells us therefore: if we know that the effect happened, how sure can we be that the cause happened? Clearly, Bayes’s theorem has relevance for statistical inference. Let’s stay with the example with learning for the exam; assume Pr[4] = 60%, Pr[B|A] = .8, and Pr[B|A'] = .5. Then the probability that a student who passed the exam has learned for it is GES | = 48 = .706. Look at these numbers: The numerator is the average percentage of students who learned and passed, and the denominator average percentage of students who passed. PROBLEM 34. AIDS diagnostic tests are usually over 99.9% accurate on those who do not have AIDS (i.e., only 0.1% false positives) and 100% accurate on those who have AIDS (i.e., no false negatives at all). (A test is called positive if it indicates that the subject has AIDS.) a. 3 points Assuming that 0.5% of the population actually have AIDS, compute the probability that a particular individual has AIDS, given that he or she has tested positive. 50 2. PROBABILITY FIELDS ANSWER. A is the event that he or she has AIDS, and T’ the event that the test is positive. Peal] Pr{T|A] Pr[A] 1-.0.005, 7 =A Oe PrTAPr[A] + PrTIATPHA] —-1-0.005 40.001 - 0.995 100 - 0.5 1000-5 _ 5000 _ 1000 = —— = 0.834028 100-0.5+0.1-99.5 1000-5+1-995 5995 1199 Even after testing positive there is still a 16.6% chance that this person does not have AIDS. 0 b. 1 point If one is young, healthy and not in one of the risk groups, then the chances of having AIDS are not 0.5% but 0.1% (this is the proportion of the applicants to the military who have AIDS). Re-compute the probability with this alternative number. ANSWER. 1-0.001 _ 100 - 0.1 1000-1 10001000 sia 1-0.001 + 0.001 -0.999 ~ 100-0.1+0.1-99.9 1000-1+1-999 10004999 1999” a 2.10. Independence of Events 2.10.1. Definition of Independence. Heuristically, we want to say: event B is independent of event A if Pr[B|A] = Pr[B|A’]. From this follows by Problem 23 that the conditional probability is equal to the unconditional probability Pr[B], i.e., 2.10. INDEPENDENCE OF EVENTS 51 Pr[B] = Pr[Bn A]/Pr[4]. Therefore we will adopt as definition of independence the so-called multiplication rule: Definition: B and A are independent, notation BA, if Pr[BN A] = Pr[B] Pr[.]. This is a symmetric condition, ie., if B is independent of A, then A is also independent of B. This symmetry is not immediately obvious given the above defi- nition of independence, and it also has the following nontrivial practical implication (this example from [Daw79a, pp. 2/3]): A is the event that one is exposed to some possibly carcinogenic agent, and B the event that one develops a certain kind of cancer. In order to test whether BA, i.e., whether the exposure to the agent does not increase the incidence of cancer, one often collects two groups of subjects, one group which has cancer and one control group which does not, and checks whether the exposure in these two groups to the carcinogenic agent is the same. Le., the experiment checks whether 41B, although the purpose of the experiment was to determine whether BLA. PROBLEM 35. 3 points Given that Pr[BM A] = Pr[B] - Pr[A] (i.e., B is inde- pendent of A), show that Pr[B 0 A’] = Pr[B]- Pr[A’] (ie., B is also independent of A’). ANSWER. If one uses our heuristic definition of independence, i.e., B is independent of event A if Pr[B|A] = Pr[B|4’], then it is immediately obvious since definition is symmetric in A and A’, However if we use the multiplication rule as the definition of independence, as the text of 52 2. PROBABILITY FIELDS this Problem suggests, we have to do a little more work: Since B is the disjoint union of (BN A) and (BM A’), it follows Pr[B] = Pr[B M A] + Pr[B A A’ or Pr[B NM A] = Pr[B] - Pr[B A A] = Pr[B] — Pr[B] Pr[A] = Pr[B](1 — Pr[A]) = Pr[B] Pr[4’] a PROBLEM 36. 2 points A and B are two independent events with Pr[A] = 4 and Pr[B] = 4. Compute Pr[Au B]. ANSWER. Pr[AUB] = Pr[A]+Pr[B] — Pr[ANB] = Pr[A] + Pr[B] — Pr[A] Pr[B] = +4 - o1 1 z PROBLEM 37. 3 points You have an urn with five white and five red balls. You take two balls out without replacement. A is the event that the first ball is white, and B that the second ball is white. a. What is the probability that the first ball is white? b. What is the probability that the second ball is white? c. What is the probability that both have the same color? d. Are these two events independent, i.e., is Pr[B|A] = Pr[A]? e. Are these two events disjoint, i.e., is ANB = (0? ANSWER. Clearly, Pr[A] = 1/2. Pr[B] = Pr[B|A] Pr[A] + Pr[B]A"]Pr[4’] = (4/9)(1/2) (5/9)(1/2) = 1/2. The events are not independent: Pr[B|A] = 4/9 # Pr[B], or PrlAN B] = 34 2/9 #1/4. They would be independent if the first ball had been replaced. The events are also ni disjoint: it is possible that both balls are white, ost 2.10. INDEPENDENCE OF EVENTS 53 2.10.2. Independence of More than Two Events. If there are more than two events, we must require that all possible intersections of these events, not only the pairwise intersections, follow the above multiplication rule. For instance, Pr[An B] = Pr[.4] Pr[B); Pr[An C] = Pr[.4] Pr[C]; Pr[B NC] = Pr[B] Pr[C); Pr[An BAC] = Pr[A] Pr[B] Pr{C]. This last condition is not implied by the other three. Here is an example. Draw a ball at random from an urn containing four balls numbered 1, 2, 3, 4. Define A = {1,4}, B = {2,4}, and C = {3,4}. These events are pairwise independent but not mutually independent. (2.10.1) A,B,C mutually independent <> PROBLEM 38. 2 points Flip a coin two times independently and define the fol- lowing three events: A = Head in first flip (2.10.2) B = Head in second flip C = Same face in both flips. Are these three events pairwise independent? Are they mutually independent? 54 2. PROBABILITY FIELDS ANSWER. U = {HH ar}. A= {HH,HT}, B = {HH,TH}, C = {HH,TT}. Pr[A] = } Pr[B] = }, Pr[C] = 5. They are pairwise independent, but Pr[AM BNC] = Pr{{HH}] = 4 4 Pr[A] Pr[B] Pr[C], therefore the events cannot be mutually independent. a PROBLEM 39. 3 points A, B, and C are pairwise independent events whose probabilities are greater than zero and smaller than one, and ANB CC. Can those events be mutually independent? ANSWER. No; from ANB C C follows ANBNC Pr[AN B] Pr{C] since Pr[C] < 1 and Pr[AN Bl > 0. AO B and therefore Pr[AN BNC] # a If one takes unions, intersections, complements of different mutually independent events, one will still end up with mutually independent events. E.g., if A, B, C mutually independent, then A’, B, C’ are mutually independent as well, and AM B independent of C’, and AU B independent of C, ete. This is not the case if the events are only pairwise independent. In Problem 39, 41 B is not independent of C. 2.10.3. Conditional Independence. If A and B are independent in the prob- ability measure conditionally on C, i.e., if Pr[AN B|C] = Pr[A|C] Pr[B|C], then they are called conditionally independent given that C’ occurred, notation A1B|C. In 2.10. INDEPENDENCE OF EVENTS formulas, Pr[ANC] Pr[BAC] _ PrlAn BAC] 2.10.3 oS ooo: (2.10.3) Pric]) Pr(C] Pr[C] PROBLEM 40. 5 points Show that A1 B\C is equivalent to Pr[A|BNC] = Pr[A|C]. In other words: independence of A and B conditionally on C means: once we know that C occurred, the additional knowledge whether B occurred or not will not help us to sharpen our knowledge about A. Literature about conditional independence (of random variables, not of events) includes [Daw79a], [Daw79b], [Daw80]. (3 8H ey FIGURE 1. Generic Venn Diagram for 3 Events 56 2. PROBABILITY FIELDS 2.10.4. Independent Repetition of an Experiment. If a given experiment has sample space U,, and we perform the experiment n times in a row, then this repetition can be considered a single experiment with the sample space consisting of n-tuples of elements of U. This set is called the product set U" =U x U x +++ x U (n terms). If a probability measure Pr is given on F, then one can define in a unique way a probability measure on the subsets of the product set so that events in different repetitions are always independent of each other. The Bernoulli experiment is the simplest example of such an independent rep- tition. U = {s, f} (stands for success and failure). Assume Pr{{s}] = p, and that the experimenter has several independent trials. For instance, U° has, among others, the following possible outcomes: Ifw=(fff.f.f) then Pr[{w}] =(1—p)" (LAL) (L—p)""'p (2.10.4) (AEs f) (l-p)""'p (f, f. f.s,s) (1 — p)"-2p? 2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 57 One sees, this is very cumbersome, and usually unnecessarily so. If we toss a coin 5 times, the only thing we usually want to know is how many successes there were. ‘As long as the experiments are independent, the question how the successes were distributed over the n different trials is far less important. This brings us to the definition of a random variable, and to the concept of a sufficient statistic. 2.11. How to Plot Frequency Vectors and Probability Vectors If there are only 3 possible outcomes, ic., U = {w1,w2,w3}, then the set of all probability measures is the set of nonnegative 3-vectors whose components sum up to 1. Graphically, such vectors can be represented as points inside a trilateral triangle with height 1: the three components of the vector are the distances of the point to cach of the sides of the triangle. The R/Splus-function triplot in the ecmet package, written by Jim Ramsay [email protected], does this, with optional rescaling if the rows of the data matrix do not have unit sums. PROBLEM 41. In an equilateral triangle, call a = the distance of the sides from the center point, b = half the side length, and c = the distance of the corners from the center point (as in Figure 2). Show that b= av/3 and c= 2a. ANSWER. From (a +c)? + b2 = 462, ie, (a +c)? = 36%, follows a+c = bY3. But we also have a? + b? = c?. Therefore a? + 2ac + c? = 3b? = 3c? — 3a”, or da? + 2ac — 2c? = 0 58 2. PROBABILITY FIELDS FIGURE 2. Geometry of an equilateral triangle or 2a? 4 ac — c? = (2a —c)(a +c) = 0. The positive solution is therefore c = 2a. This gives a+c=3a = bV3, orb 3. a And the function quadplot, also written by Jim Ramsey, does quadrilinear plots, meaning that proportions for four categories are plotted within a regular tetrahe- dron. Quadplot displays the probability tetrahedron and its points using XGobi. Each vertex of the triangle or tetrahedron corresponds to the degenerate probabil- ity distribution in which one of the events has probability 1 and the others have probability 0. The labels of these vertices indicate which event has probability 1. 2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 59 The script kai is an example visualizing data from [Mor65]; it can be run using the command ecmet . script (kai). Example: Statistical linguistics. In the study of ancient literature, the authorship of texts is a perplexing problem. When books were written and reproduced by hand, the rights of authorship were limited and what would now be considered forgery was common. The names of reputable authors were borrowed in order to sell books, get attention for books, or the writings of disciples and collaborators were published under the name of the master, or anonymous old manuscripts were optimistically attributed to famous authors. In the absence of conclusive evidence of authorship, the attribution of ancient texts must be based on the texts themselves, for instance, by statistical analysis of literary style. Here it is necessary to find stylistic criteria which vary from author to author, but are independent of the subject matter of the text. An early suggestion was to use the probability distribution of word length, but this was never acted upon, because it is too dependent on the subject matter. Sentence-length distributions, on the other hand, have proved highly reliable. [Mor65, p. 184] says that sentence-length is “periodic rather than random,” therefore the sample should have at least about 100 sentences. “Sentence-length distributions are not suited to dialogue, they cannot be used on commentaries written on one author by another, nor are they reliable on such texts as the fragmentary books of the historian Diodorus Siculus.” 60 2. PROBABILITY FIELDS PROBLEM 42. According to [Mor65, p. 184], sentence-length is “periodic rather than random.” What does this mean? ANSWER. In a text, passages with long sentences alternate with passages with shorter sen- tences. This is why one needs at least 100 sentences to get a representative distribution of sen- tences, and this is why fragments and drafts and commentaries on others’ writings do not exhibit an average sentence length distribution: they do not have the melody of the finished text. a Besides the length of sentences, also the number of common words which express a general relation (“and”, “in”, “but”, “I”, “to be”) is random with the same distri- bution at least among the same genre. By contrast, the occurrence of the definite article “the” cannot be modeled by simple probabilistic laws because the number of nouns with definite article depends on the subject matter. Table 1 has data about the epistles of St. Paul. Abbreviations: Rom Romans; Co1 Ist Corinthians; Co2 2nd Corinthians; Gal Galatians; Phi Philippians; Col Colos- sians; Th1 Ist Thessalonians; Tit 1st Timothy; Ti2 2nd Timothy; Heb Hebrews. 2nd Thessalonians, Titus, and Philemon were excluded because they were too short to give reliable samples. From an analysis of these and other data [Mor65, p. 224] the first 4 epistles (Romans, Ist Corinthians, 2nd Corinthians, and Galatians) form a consistent group, and all the other epistles lie more than 2 standard deviations from the mean of this group (using y? statistics). If Paul is defined as being the author of 2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 61 Galatians, then he also wrote Romans and 1st and 2nd Corinthians. The remaining epistles come from at least six hands. TABLE 1. Number of Sentences in Paul's Epistles with 0, 1, 2, and > 3 occurrences of kai Rom Col Co2 Gal Phi Col Thi Til Ti2 Heb | no kai 386 424 192 128 420 «23 B40 494555 one 141 152 86 48 29 32 23 38 28 94 two 34 358 5 19 17 8 9 ll 37 3 or more 17 1613 6 12 9 16 10 4 24 PROBLEM 43. Enter the data from Table 1 into xgobi and brush the four epistles which are, according to Morton, written by Paul himself. 3 of those points are almost on top of each other, and one is a little apart. Which one is this? ANSV normalize with xgobi. In R, issue the commands library (xgobi) then data(PaulKAT) then quadplot (PaulKAI TRUE). If you have xgobi but not R, this dataset is one of the default datasets coming a CHAPTER. 3 Random Variables 3.1. Notation Throughout these class notes, lower case bold letters will be used for vectors and upper case bold letters for matrices, and letters that are not bold for scalars. The (i,j) clement of the matrix A is aj;, and the ith element of a vector b is b;; the arithmetic mean of all elements is b. All vectors are column vectors; if a row vector is needed, it will be written in the form 6". Furthermore, the on-line version of these notes uses green symbols for random variables, and the corresponding black symbols for the values taken by these variables. If a black-and-white printout of the on-line version is made, then the symbols used for random variables and those used for specific values taken by these random variables can only be distinguished 63 64 3. RANDOM VARIABLES by their grey scale or cannot be distinguished at all; therefore a special monochrome version is available which should be used for the black-and-white printouts. It uses an upright math font, called “Euler,” for the random variables, and the same letter in the usual slanted italic font for the values of these random variables. Example: If y is a random vector, then y denotes a particular value, for instance an observation, of the whole vector; y, denotes the ith element of y (a random scalar) and y; is a particular value taken by that element (a nonrandom scalar). With real-valued random variables, the powerful tools of calculus become avail- able to us. Therefore we will begin the chapter about random variables with a digression about infinitesimals 3.2. Digression about Infinitesimals In the following pages we will recapitulate some basic facts from calculus. But it will differ in two respects from the usual calculus classes. (1) everything will be given its probability-theoretic interpretation, and (2) we will make explicit use of infinitesimals. This last point bears some explanation. You may say infinitesimals do not exist. Do you know the story with Achilles and the turtle? They are racing, the turtle starts 1 km ahead of Achilles, and Achilles runs ten times as fast as the turtle. So when Achilles arrives at the place the turtle started, the turtle has run 100 meters; and when Achilles has run those 100 meters, 3.2. DIGRESSION ABOUT INFINITESIMALS 65 the turtle has run 10 meters, and when Achilles has run the 10 meters, then the turtle has run 1 meter, etc. The Greeks were actually arguing whether Achilles would ever reach the turtle. This may sound like a joke, but in some respects, modern mathematics never went beyond the level of the Greek philosophers. If a modern mathematicien sees something like 1 “1 _ 10 3.2.1 lim = =0, li m~w=a (24) ding om lim ao then he will probably say that the lefthand term in each equation never really reaches the number written on the right, all he will say is that the term on the left comes arbitrarily close to it. This is like saying: I know that Achilles will get as close as 1 cm or 1 mm to the turtle, he will get closer than any distance, however small, to the turtle, instead of simply saying that Achilles reaches the turtle. Modern mathematical proofs are full of races between Achilles and the turtle of the kind: give me an ¢, and I will prove to you that the thing will come at least as close as < to its goal (so-called epsilontism) but never speaking about the moment when the thing will reach its goal. Of course, it “works,” but it makes things terribly cumbersome, and it may have prevented people from seeing connections. 66 3. RANDOM VARIABLES Abraham Robinson in [Rob74] is one of the mathematicians who tried to remedy it. He did it by adding more numbers, infinite numbers and infinitesimal numbers. Robinson showed that one can use infinitesimals without getting into contradictions, and he demonstrated that mathematics becomes much more intuitive this way, not only its elementary proofs, but especially the deeper results. One of the elemrntary books based on his calculus is [HK79]. The well-know logician Kurt Gédel said about Robinson’s work: “I think, in coming years it will be considered a great oddity in the history of mathematics that the first exact theory of infinitesimals was developed 300 years after the invention of the differential calculus.” Gédel called Robinson’s theory the first theory. I would like to add here the fol- lowing speculation: perhaps Robinson shares the following error with the “standard” mathematicians whom he criticizes: they consider numbers only in a static way, with- out allowing them to move. It would be beneficial to expand on the intuition of the inventors of differential calculus, who talked about “fluxions,” ic., quantities in flux, in motion. Modern mathematicians even use arrows in their symbol for limits, but they are not calculating with moving quantities, only with static quantities, This perspective makes the category-theoretical approach to infinitesimals taken in [MR91J especially promising. Category theory considers objects on the same footing with their transformations (and uses lots of arrows). 3.2. DIGRESSION ABOUT INFINITESIMALS 67 Maybe a few years from now mathematics will be done right. We should not let this temporary backwardness of mathematics allow to hold us back in our intuition. The equation 24% = 2x does not hold exactly on a parabola for any pair of given (static) Ax and Ay; but if you take a pair (Ax, Ay) which is moving towards zero then this equation holds in the moment when they reach zero, i.e., when they vanish. Writing dy and dx means therefore: we are looking at magnitudes which are in the process of vanishing. If one applies a function to a moving quantity one again gets a moving quantity, and the derivative of this function compares the speed with which the transformed quantity moves with the speed of the original quantity. Likewise, the equation 2", 4 = 1 holds in the moment when n reaches infinity. From this point of view, the axiom of o-additivity in probability theory (in its equivalent form of rising or declining sequences of events) indicates that the probability of a vanishing event vanishes. Whenever we talk about infinitesimals, therefore, we really mean magnitudes which are moving, and which are in the process of vanishing. dV;.y is therefore not, as one might think from what will be said below, a static but small volume element located close to the point (x,y), but it is a volume element which is vanishing into the point («,y). The probability density function therefore signifies the speed with which the probability of a vanishing element vanishes. 68 3. RANDOM VARIABLES 3.3. Definition of a Random Variable The best intuition of a random variable would be to view it as a numerical variable whose values are not determinate but follow a statistical pattern, and call it «, while possible values of x are called x. In order to make this a mathematically sound definition, one says: A mapping a : U — Roof the set U of all possible outcomes into the real numbers R is called a random variable. (Again, mathematicians are able to construct pathological mappings that cannot be used as random variables, but we let that be their problem, not ours.) The green x is then defined as x = e(w). Le., all the randomness is shunted off into the process of selecting an clement of U. Instead of being an indeterminate function, it is defined as a determinate function of the random w. It is written here as (ww) and not as (w) because the function itself is determinate, only its argument is random. Whenever one has a mapping x : U > R between sets, one can construct from it in a natural way an “inverse image” mapping between subsets of these sets. Let F, as usual, denote the set of subsets of U, and let B denote the set of subsets of R. We will define a mapping x~! : B + F in the following way: For any B C R, we define a-'(B) = {w € U: aw) € B}. (This is not the usual inverse of a mapping, which does not always exist. The inverse-image mapping always exists, but the inverse image of a one-clement set is no longer necessarily a one-clement set; it may have more than one element or may be the empty set.) 3.3. DEFINITION OF A RANDOM VARIABLE 69 This “inverse image” mapping is well behaved with respect to unions and inter- sections, etc. In other words, we have identities x~!(AN B) = x~'(A)Na~!(B) and ax\(AU B) =a71(A)Ua-1(B), ete. PROBLEM 44. Prove the above two identities. ANSWER. These are a very subtle proofs. 2~!(AMB) = {w € U: x(w) € ANB} = {we U: a(w) € A and aw) € B= {w EU: 2(w) € A} {w EU: aw) € B} = 2-1 (A) Nae !(B). The other identity has a similar proof. a PROBLEM 45. Show, on the other hand, by a counterexample, that the “direct image” mapping defined by x(E) = {r ER: there exists w € E with a(w) =r} no longer satisfies x(E 0 F) = 2(E) 0 2(F). By taking inverse images under a random variable wx, the probability measure on F is transplanted into a probability measure on the subsets of R by the simple prescription Pr[B] = Pr[x~!(B)]. Here, B is a subset of R and x~1(B) one of U, the Pr on the right side is the given probability measure on U, while the Pr on the left is the new probability measure on R induced by x. This induced probability measure is called the probability law or probability distribution of the random variable. Every random variable induces therefore a probability measure on R, and this probability measure, not the mapping itself, is the most important ingredient of a random variable. That is why Amemiya’s first definition of a random variable 70 3. RANDOM VARIABLES (definition 3.1.1 on p. 18) is: “A random variable is a variable that takes values acording to a certain distribution.” In other words, it is the outcome of an experiment whose set of possible outcomes is R. 3.4. Characterization of Random Variables We will begin our systematic investigation of random variables with an overview over all possible probability measures on R. The simplest way to get such an overview is to look at the cumulative distribution functions. Every probability measure on R has a cumulative distribution function, but we will follow the common usage of assigning the cumulative distribution not to a probability measure but to the random variable which induces this probability measure on R. Given a random variable x ; U 3 w + 2(w) € R. Then the cumulative distribu- tion function of « is the function F, : R + R defined by: (3.4.1) This function uniquely defines the probability measure which « induces on R. 3.4. CHARACTERIZATION OF RANDOM VARIABLES 71 Properties of cumulative distribution functions: a function F: R > R is a cu- mulative distribution function if and only if (3.4.2) a 0). Why is a cumulative distribution function continuous from the right? For every nonnegative sequence ¢1,¢2,... > 0 converging to zero which also satisfies ¢) > co >... follows {x < a} =(),{x 2z] for all z. Call its cumulative distribution function F.(z). Show that the cumulative distribution function of the random variable q = 22 is F,(q) = 2F.(V@ — 1 for q > 0, and 0 for q <0. 3.4. CHARACTERIZATION OF RANDOM VARIABLES 73 ANSWER. If q > 0 then (3.4.9) F,(q) = Pr[2? Val (3.4.12) = F.(V@) — (1- F:(V@) (3.4.13) = 2F.(Vq) — 1. a Instead of the cumulative distribution function F, one can also use the quan- tile function F;* to characterize a probability measure. As the notation suggests, the quantile function can be considered some kind of “inverse” of the cumulative distribution function. The quantile function is the function (0,1) > R defined by (3.4.14) F5\(p) =int{u: F,(u) > p} or, plugging the definition of F,, into (3.4.14), (3.4.15) Fy \(p) =inf{u: Pr[y p}. The quantile function is only defined on the open unit interval, not on the endpoints 0 and 1, because it would often assume the values —co and -00 on these endpoints, and the information given by these values is redundant. The quantile function is continuous from the left, i.e., from the other side than the cumulative distribution 74 3. RANDOM VARIABLES function. If F is continuous and strictly increasing, then the quantile function is the inverse of the distribution function in the usual sense, i.c., F~!(F(t)) = t for all t € R, and F(F-'((p)) = p for all p € (0,1). But even if F is flat on certain intervals, and/or F has jump points, ic., F does not have an inverse function, the following important identity holds for every y € R and p € (0,1): (3.4.16) PSF, (y) iff Fp) p then of course y > inf{u: F(u) > p}. =: y > inf{u F(u) > p} means that every > y satisfies F(z) > p; therefore, since F is continuous from the right, also F(y) > p. This proof is from [Rei89, p. 318]. a PROBLEM 49. You throw a pair of dice and your random variable x is the sum of the points shown. © a. Draw the cumulative distribution function of wv. ANSWER. This is Figure 1: the cdf is 0 in (—00, 2), 1/36 in [2,3), 3/36 in [3,4), 6/36 in [4,5), 10/36 in [5,6), 15/36 in (6,7), 21/36 in [7,8), 26/98 on [8,9), 30/36 in [9,10), 33/36 in [10,11), 38/86 on [11,12), and 1 in [12, +00) eb. Draw the quantile function of x. 3.4. CHARACTERIZATION OF RANDOM VARIABLES 75 FIGURE 1. Cumulative Distribution Function of Discrete Variable ANSWER. This is Figure 2: the quantile function is 2 in (0, 1/36], 3 in (1/36,3/36], 4 in (3/36,6/36], 5 in (6/36,10/36], 6 in (10/36,15/36], 7 in (15/36,21/36], 8 in (21/36,26/36], 9 in (26/36,30/36], 10 in (30/36,33/36], 11 in (33/36,35/36), and 12 in (35/36,1] a 76 3. RANDOM VARIABLES FIGURE 2. Quantile Function of Discrete Variable PROBLEM 50. 1 point Give the formula of the cumulative distribution function of a random variable which is uniformly distributed between 0 and b. ANSWER. 0 for x <0, 2/b for 0. a Empirical Cumulative Distribution Function: Besides the cumulative distribution function of a random variable or of a proba- bility measure, one can also define the empirical cumulative distribution function of a sample. Empirical cumulative distribution functions are zero for all values below the lowest observation, then 1/n for everything below the second lowest, etc. They are step functions. If two observations assume the same value, then the step at 3.5. DISCRETE AND ABSOLUTELY CONTINUOUS PROBABILITY MEASURES 17 that value is twice as high, etc. The empirical cumulative distribution function can be considered an estimate of the cumulative distribution function of the probability distribution underlying the sample. [Rei89, p. 12] writes it as a sum of indicator functions: 1 3.4.17 Fo 1 (3.4.17) oe Mri 3.5. Discrete and Absolutely Continuous Probability Measures One can define two main classes of probability measures on R: One kind is concentrated in countably many points. Its probability distribution can be defined in terms of the probability mass function. PROBLEM 51. Show that a distribution function can only have countably many jump points. ANSWER. Proof: There are at most two with jump height > 3, at most four with jump height ete. a Among the other probability measures we are only interested in those which can be represented by a density function (absolutely continuous). A density function is a nonnegative integrable function which, integrated over the whole line, gives 1. Given 78 3. RANDOM VARIABLES such a density function, called f(z), the probability Pr[r¢(a,b)] = J? f.(x)dx. The density function is therefore an alternate way to characterize a ‘robbs measure. But not all probability measures have density functions. Those who are not familiar with integrals should read up on them at this point. Start with derivatives, then: the indefinite integral of a function is a function whose derivative is the given function. Then it is an important theorem that the area under the curve is the difference of the values of the indefinite integral at the end points. This is called the definite integral. (The area is considered negative when the curve is below the z-axis.) The intuition of a density function comes out more clearly in terms of infinitesi- mals. If f(z) is the value of the density function at the point «, then the probability that the outcome of « lies in an interval of infinitesimal length located near the point x is the length of this interval, multiplied by f,(z). In formulas, for an infinitesimal dz follows (3.5.1) Pr[xe[x,x + dal] = f.(x) |dz|. The name “density function” is therefore appropriate: it indicates how densely the probability is spread out over the line. It is, so to say, the quotient between the probability measure induced by the variable, and the length measure on the real numbers. 3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 79 If the cumulative distribution function has everywhere a derivative, this deriva- tive is the density function. 3.6. Transformation of a Scalar Density Function Assume wr is a random variable with values in the region A C R, ie., Priv¢ A] = 0, and t is a one-to-one mapping A + R. One-to-one (as opposed to many-to-one) means: if a,b € A and t(a) = ¢(b), then already a = b. We also assume that t has a continuous nonnegative first derivative t/ > 0 everywhere in A. Define the random variable y by y = t(). We know the density function of y, and we want to get that of x. (Le., t expresses the old variable, that whose density function we know, in terms of the new variable, whose density function we want to know.) Since t is one-to-one, it follows for all a,b € A thata=b <= + t(a) =t(b). And etd) te) recall the definition of a derivative in terms of infinitesimals dz: t/(«) = In order to compute f,,(:r) we will use the following identities valid for all x € A: (3.6.1) fol) |de| = Pr[xelx,x + del] = Pr[t(x)e[t(x), t(« + dx)]] (3.6.2) = Prit(x)e[t(x), t(x) + t'(x) del] = f,(t(x)) |t'(w)de| 80 3. RANDOM VARIABLES Absolute values are multiplicative, ice., |t/(«)dx| = |t"(x)| |de|; divide by |da| to get (3.6.3) fol) = fy (t@)) |t"(@)|- This is the transformation formula how to get the density of x from that of y. This formula is valid for all a € A; the density of « is 0 for all x ¢ A. Heuristically one can get this transformation as follows: write |t/(x)| = 4, then one gets it from f(x) |da| = f,(é(x)) |dy| by just dividing both sides by |dz|. In other words, this transformation rule consists of 4 steps: (1) Determine A, the range of the new variable; (2) obtain the transformation t which expresses the old variable in terms of the new variable, and check that it is one-to-one on A; (3) plug expression (2) into the old density; (4) multiply this plugged-in density by the absolute value of the derivative of expression (2). This gives the density inside A; it is 0 outside A. An alternative proof is conceptually simpler but cannot be generalized to the multivariate case: First assume t is monotonically increasing. Then F,,(«) = Prix < a] = Pr[t(x) < t(i)] = F,(t(z)). Now differentiate and use the chain rule. Then also do the monotonically decresing case. This is how [Ame94, theorem 3.6.1 on pp. 48] does it. [Ame94, pp. 52/3] has an extension of this formula to many-to-one functions. 3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 81 PROBLEM 52. 4 points [Lar82, example 3.5.4 on p. 148] Suppose y has density function 1 for0 0 and 0 otherwise. a PROBLEM 53. 6 points [Dhr86, p. 1574] Assume the random variable z has the exponential distribution with parameter , i.e., its density function is f.(z) = Aexp(—Az) for z > 0 and 0 for z < 0. Define u = —logz. Show that the density function of u is f.,(u) = exp(u—u—exp(u—u)) where w= logr. This density will be used in Problem 151. 82 3. RANDOM VARIABLES ANSWER. (1) Since = only has values in (0, 00), its log is well defined, and A = R. (2) Express old variable in terms of new: —u = log therefore » = e~"; this is one-to-one everywhere. (3) plugging in (since e~™ > 0 for all u, we must plug it into Nexp(—Az)) gives .... (4) the derivative of 2 =e7" is —e~", taking absolute values gives the Jacobian factor e~“. Plugging in and multiplying gives the density of u: fu.(u) = Aexp(—Ae~")e™ = Aew" "and using Aexp(—u) = exp(u—u) this simplifies to the formula above. Alternative without transformation rule for densities: F,.(w) ~u] = Prlsze~™] = fF eae =dz| +00 Lge Pr[u 0 and 0 for z <0. Define u= \/Z. Compute the density function of u. ANSWER. (1) A = {u: w > 0} since v always denotes the nonnegative square root; (2) Express u?, this is one-to-one on A (but not one-to-one on all of R); (3) then the derivative is 2u, which is nonnegative as well, no absolute values are necessary; (4) multiplying gives the density of u: f..(u) = 2wexp(—u?) if u > 0 and 0 elsewhere. a old variable in terms of new: 3.7. Example: Binomial Variable Go back to our Bernoulli trial with parameters p and n, and define a random variable x which represents the number of successes. Then the probability mass 3.7, EXAMPLE: BINOMIAL VARIABLE 83 function of x is (3.7.1) p(k) = Pr[r=k] = (iota =p)" k=0,1,2,.. Proof is simple, every subset of k elements represents one possibility of spreading out the k successes. We will call any observed random variable a statistic. And we call a statistic ¢ sufficient for a parameter 6 if and only if for any event A and for any possible value t of ¢, the conditional probability Pr[Al’ Elh(x)] = h(B[r]). (3) The existence of such a h follows from con- vexity. Since g is convex, for every point a € B there is a number @ so that 94 3. RANDOM VARIABLES g(x) > g(a) + B(x — a). This B is the slope of g if g is differentiable, and other- wise it is some number between the left and the right derivative (which both always exist for a convex function). We need this for a = Ef’. This existence is the deepest part of this proof. We will not prove it here, for a proof sce [Rao73, pp. 57, 58]. One can view it as a special case of the separating hyperplane theorem. ao PROBLEM 62. Use Jensen’s inequality to show that (E[x])? < E[x?]. You are allowed to use, without proof, the fact that a function is convex on B if the second derivative exists on B and is nonnegative. PROBLEM 63. Show that the expected value of the empirical distribution of a sample is the sample mean. Other measures of locaction: The median is that number m for which there is as much probability mass to the left of m as to the right, ie., 1 5 It is much more robust with respect to outliers than the mean. If there is more than one m satisfying (3.10.11), then some authors choose the smallest (in which case the median is a special case of the quantile function m = F-1(1/2)), and others the average between the biggest and smallest. If there is no m with property (3.10.11) 1 (3.10.11) Prlra. 3.10.3. Mean-Variance Calculations. If one knows mean and variance of a random variable, one does not by any means know the whole distribution, but one has already some information. For instance, one can compute E[/7] from it, too. PROBLEM 66. 4 points Consumer M has an expected utility function for money income u(x) = 12x — 22. The meaning of an expected utility function is very simple: if he owns an asset that generates some random income y, then the utility he derives from this asset is the expected value E[u(y)]. He is contemplating acquiring two assets. One asset yields an income of 4 dollars with certainty. The other yields an expected income of 5 dollars with standard deviation 2 dollars. Does he prefer the certain or the uncertain asset? 98 3. RANDOM VARIABLES ANSWER. Efu(y)] = 12E[y] — Bly?] = 12E[y] — varly] — (Bly])?. Therefore the certain asset gives him utility 48 — 0 — 16 = 32, and the uncertain one 60 — 4 — 25 = 31. He prefers the certain asset. a 3.10.4. Moment Generating Function and Characteristic Function. Here we will use the exponential function e®, also often written exp(x), which has the two properties: e* = lim, (1+ £)" (Buler’s limit), and e* =1+a+ 5+ 8.405. Many (but not all) random variables «x have a moment generating function m,(t) for certain values of (. If they do for t in an open interval around zero, then their distribution is uniquely determined by it. The definition is (3.10.18) m,(t) = Ele] It is a powerful computational device. The moment generating function is in many cases a more convenient charac- terization of the random variable than the density function. It has the following uses: 1. One obtains the moments of « by the simple formula Ik = m.(t) (3.10.19) Ele*] = a t=0" 3.10. LOCATION AND DISPERSION PARAMETERS, 99 Proof: 2,2 pga (3.10.20) Marites Spe Spt Tr (3.10.21) m.(t) = Efe] = L+ eB] +5 Ble? + E[e3] + (3.10.22) Jm(t) = = Ele] +t E22) + # ps + (3.10.23) © in.) = Ble?) + 1B le°) ++. ete. 2. The moment generating function is also good for determining the probability distribution of linear combinations of independent random variables. a. it is easy to get the m.g.f. of Ax from the one of x: (3.10.24) my. (t) = me (At) because both sides are Ele]. b. If x, y independent, then (3.10.25) Mo+y(t) = m,(t)my(t). The proof is simple: (3.10.26) Efe"+)] = Efe’”e'] = Ele] E[e’’] due to independence. 100 3. RANDOM VARIABLES The characteristic function is defined as v,(t) = Efe*], where i = /—T. It has the disadvantage that it involves complex numbers, but it has the advantage that it always exists, since exp(ix) = cosx + isinw. Since cos and sin are both bounded, they always have an expected value. And, as its name says, the characteristic function characterizes the probability distribution. Analytically, many of its properties are similar to those of the moment generating function. 3.11. Entropy 3.11.1. Definition of Information. Entropy is the average information gained by the performance of the experiment. The actual information yielded by an event A with probabbility Pr[4] = p 4 0 is defined as follows: 1 3411 IA] = logy ——— (3.11.1) (4) = loss By This is simply a transformation of the probability, and it has the dual interpretation of either how unexpected the event was, or the informaton yielded by the occurrense of event A. It is characterized by the following properties [AD75, pp. 3-5]: « [A] only depends on the probability of A, in other words, the information content of a message is independent of how the information is coded. 3.11. ENTROPY 101 « JA] > 0 (nomegativity), ie., after knowing whether A occurred we are no more ignorant than before. e If A and B are independent then I[A NB] = I[A] + I[B] (additivity for independent events). This is the most important property. ¢ Finally the (inessential) normalization that if Pr[A] = 1/2 then I[A] = 1, i.e., a yes-or-no decision with equal probability (coin flip) is one unit of information. Note that the information yielded by occurrence of the certain event is 0, and that yielded by occurrence of the impossible event is oo. But the important information-theoretic results refer to average, not actual, information, therefore let us define now entropy: 3.11.2. Definition of Entropy. The entropy of a probability field (experi- ment) is a measure of the uncertainty prevailing before the experiment is performed, or of the average information yielded by the performance of this experiment. If the set U of possible outcomes of the experiment has only a finite number of different el- ements, say their number is n, and the probabilities of these outcomes are p1,...;Pns then the Shannon entropy H[¥] of this experiment is defined as H — 1 (3.11.2) al YS px logy — 1 Pr bits 102 3. RANDOM VARIABLES This formula uses log,, logarithm with base 2, which can easily be computed from the natural logarithms, logy x = log x/log2. The choice of base 2 is convenient because in this way the most informative Bernoulli experiment, that with success probability p = 1/2 (coin flip), has entropy 1. This is why one says: “the entropy is measured in bits.” If one goes over to logarithms of a different base, this simply means that one measures entropy in different units. In order to indicate this dependence on the measuring unit, equation (3.11.2) was written as the definition 47) instead of H[F] itself, i.e, this is the number one gets if one measures the entropy in bits. If one uses natural logarithms, then the entropy is measured in “nats.” Entropy can be characterized axiomatically by the following axioms [Khi57]: The uncertainty associated with a finite complete scheme takes its largest value if all events are equally likely, ie., H(p1,---,Pn) < H(1/n,...,1/n). The addition of an impossible event to a scheme does not change the amount of uncertainty. © Composition Law: If the possible outcomes are arbitrarily combined into m groups Wy = XyyU+++U Xin, Wo = Xo1 Us? U Xap, 0, Win = Xm1U++*U Xie, With corresponding probabilities wy = pir + +--+ Pin, We = par t+ + Pakgs +++) Wm = Pmi +++ + Prim, then 3.11. ENTROPY 103 H(pi,---, Pn) = H(w1,-.-,Wn) + + wi A(pir/wi +--+ + Pir /wi) + + woH (Par [we + +++ + Pay /W2) +00 + + Win H (Pmi/Wm +++ + Pink /Wm)- Since pjj/w; = Pr[X;;|W;], the composition law means: if you first learn half the outcome of the experiment, and then the other half, you will in the average get as much information as if you had been told the total outcome all at once. The entropy of a random variable « is simply the entropy of the probability field induced by x on R. It does not depend on the values « takes but only on the probabilities. For discretely distributed random variables it can be obtained by the following “cerily self-referential” prescription: plug the random variable into its own probability mass function and compute the expected value of the negative logarithm of this, ice., He] (3.11.3) = E[- logs p.(x)] bits One interpretation of the entropy is: it is the average number of yes-or-no ques- tions necessary to describe the outcome of the experiment. For instance, consider an experiment which has 32 different outcomes occurring with equal probabilities. The 104, 3. RANDOM VARIABLES entropy is H 32 1 (3.11.4) Fir = Do Fp 0232 =log,32=5 ic, — H=dbits which agrees with the number of bits necessary to describe the outcome. PROBLEM 67. Design a questioning scheme to find out the value of an integer between 1 and 32, and compute the expected number of questions in your scheme if all numbers are equally likely. ANSWER. In binary digits one needs a number of length 5 to describe a number between 0 and 31, therefore the 5 questions might be: write down the binary expansion of your number minus 1 Is the first binary digit in this expansion a zero, then: is the second binary digit in this expansion a zero, etc. Formulated without the use of binary digits these same questions would be: is the number between 1 and 16?, then: is it between 1 and 8 or 17 and 24?, then, is it between 1 and 4 or 9 and 12 or 17 and 20 or 25 and 28?, etc., the last question being whether it is odd. Of course, you can formulate those questions conditionally: First: between 1 and 16? if no, then second: between 17 and 24? if yes, then second: between 1 and 8? Etc. Each of these questions gives you exactly the entropy of 1 bit. a PROBLEM 68. [CT91, example 1.1.2 on p. 5] Assume there is a horse race with eight horses taking part. The probabilities for winning for the eight horses are 3.11. ENTROPY 105 a. 1 point Show that the entropy of the horse race is 2bits. ANSWER. 1 1 1 4 log 2+ 7 logy 4+ 5 log, 8 + 75 low, 16 + = log, 64 = 1 1 3 1 3 444434243 =ptotgtats> Ss 2 a eb. 1 point Suppose you want to send a binary message to another person indicating which horse won the race. One alternative is to assign the bit strings 000, 001, 010, 011, 100, 101, 110, 111 to the eight horses. This description requires 3 bits for any of the horses. But since the win probabilities are not uniform, it makes sense to use shorter descriptions for the horses more likely to win, so that we achieve a lower expected value of the description length. For instance, we could use the following set of bit strings for the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. Show that the the expected length of the message you send to your friend is 2bits, as opposed to 3bits for the uniform code. Note that in this case the expected value of the description length is equal to the entropy. ANSWER. The math is the same as in the first part of the question: Lig 1,1 3 13 _ 444434243 _ 1 1 1 1 <-14—-24 5-34 —-444-— 65-4 -4 24 - arg ets ete tt a atatgtats 8 106 3. RANDOM VARIABLES a PROBLEM 69. [CT91, example 2.1.2 on pp. 14/15]: The experiment has four possible outcomes; outcome x=a occurs with probability 1/2, 2=b with probability 1/4, x=c with probability 1/8, and x=d with probability 1/8. a. 2 points The entropy of this experiment (in bits) is one of the following three numbers: 11/8, 7/4, 2. Which is it? b. 2 points Suppose we wish to determine the outcome of this experiment with the minimum number of questions. An efficient first question is “Is x=a?” This splits the probability in half. If the answer to the first question is no, then the second question can be “Is x=b?” The third question, if it is necessary, can then be: “Is x=c?” Compute the expected number of binary questions required. ec. 2 points Show that the entropy gained by each question is 1 bit. d. 3 points Assume we know about the first outcome that «4a. What is the entropy of the remaining experiment (i.e., under the conditional probability) ? ec. 5 points Show in this ecample that the composition law for entropy holds. 3.11. ENTROPY 107 PROBLEM 70. 2 points In terms of natural logarithms equation (3.11.4) defining entropy reads H 1 ” 1 (3.11.5) ie np Pens Compute the entropy of (i.e., the average informaton gained by) a roll of an unbiased die. ANSWER. Same as the actual information gained, since each outcome is equally likely: Hoist 1 In6 3.116 = (4+in6+---+21n6) = 2S = 2.585 (G.11.6) bits nal mote + Gln ) ina a a. 3 points How many questions does one need in the average to determine the outcome of the roll of an unbiased die? In other words, pick a certain questioning scheme (try to make it efficient) and compute the average number of questions if this scheme is followed. Note that this average cannot be smaller than the entropy H/bits, and if one chooses the questions optimally, it is smaller than H /bits +1. ANSWER. First question: is it bigger than 32 Second question: is it even? Third question (if necessary): is it a multiple of 3? In this scheme, the number of questions for the six faces of the die are 3, 2,3,3,2,3, therefore the average is $-3 + 2-2 = 23. Also optimal: (1) is it bigger than 2? (2) is it odd? (3) is it bigger than 4? Gives 2,2,3,3,3,3. Also optimal: Ist question: is it 1 or 108, 3. RANDOM VARIABLES 2? If anser is no, then second question is: is it 3 or 4?; otherwise go directly to the third question: is it odd or even? The steamroller approach: Is it 1? Is it 2? etc. gives 1,2,3,4,5,5 with expected number 34. Even this is here < 1 + H /bits. a PROBLEM 71. a. 1 point Compute the entropy of a roll of two unbiased dice if they are distinguishable. ANSWER. Just twice the entropy from Problem 70. (3.11.7) mer (amr + ggings) = 2 < sro bits n2 \36 36 m2 a ¢ b. Would you expect the entropy to be greater or less in the more usual case that the dice are indistinguishable? Check your answer by computing it. ANSWER. If the dice are indistinguishable, then one gets less information, therefore the exper- iment has less entropy. One has six like pairs with probability 1/36 and 6 - 5/2 = 15 unlike pairs with probability 2/36 = 1/18 each. Therefore the average information gained is H 1 3.11.8) —=— » bits In2 (6 Ginso-+ 15. Ems) = 4 (Zina6 + 218) 4.337 36 18 n2\6 6 3.11. ENTROPY 109 c. 3 points Note that the difference between these two entropies is 5/6 = 0.833. How can this be explained? ANSWER. This is the composition law (2?) in action, Assume you roll two dice which you first consider indistinguishable and afterwards someone tells you which is which. How much information do you gain? Well, if the numbers are the same, then telling you which die is which does not give you any information, since the outcomes of the experiment are defined as: which number has the first die, which number has the second die, regardless of where on the table the dice land. But if the numbers are different, then telling you which is which allows you to discriminate between two outcomes both of which have conditional probability 1/2 given the outcome you already know; in this case the information you gain is therefore 1 bit. Since the probability of getting two different numbers is 5/6, the expected value of the information gained explains the difference in entropy. 0 All these definitions use the convention 0 log } = 0, which can be justified by the following continuity argument: Define the function, graphed in Figure 3: wlogt ifw>0 3.11.9) ( ) 0 ifw =0. 7 is continuous for all w > 0, even at the boundary point w = 0. Differentiation gives 1! (w) = —(1+logw), and 7"(w) = —w~!. The function starts out at the origin with a vertical tangent, and since the second derivative is negative, it is strictly concave for all w > 0. The definition of strict concavity is n(w) < n(v) + (w — v)n!(v) for w #, ie., the function lies below all its tangents. Substituting 7/(v) = —(1 + log v)

You might also like