Appendix A Probability and Statistics
Appendix A Probability and Statistics
Shaojian Chen
September 16, 2021
1 Basics of Probability
1.1 Random Variables
A listing of the values x taken by a random variable X and their associated probabilities is
a probability distribution, f ( x) . For a discrete random variable,
f ( x) = Prob( X = x)
(2) x f ( x) = 1 .
For the continuous case, the probability associated with any particular point is zero, and
we can only assign positive probabilities to intervals in the range (or support) of x . the
probability density function (pdf), f ( x) , is defined so that f ( x) 0 and
b
(1) Prob(a X b) = f ( x)dx 0
a
This result is the area under f ( x) in the range from a to b . For a continuous variable,
+
(2) f ( x)dx = 1
−
If the range of x is not infinite, then it is understood that f ( x) = 0 anywhere outside the
appropriate range. because the probability associated with any individual point is 0,
Prob(a X b)= Prob(a X b)
= Prob(a X b)
= Prob(a X b)
1 / 12
F ( x) is the cumulative density function (cdf), or distribution function. For a discrete
random variable,
F ( x) = Prob( X x)
f ( xi ) = F ( xi ) − F ( xi −1 )
In both the continuous and discrete cases, F ( x) must satisfy the following properties:
(1) 0 F ( x) 1 ;
(2)If x y, then F ( x) F ( y) .
(3) F (+) = 1;
(4) F (−) = 0 .
From the definition of the cdf,
Prob(a X b) = F (b) − F (a)
Definition 2.1 (Mean of a Random Variable). The mean, or expected value, of a random
variable is
x Prob( X = x) if X is discrete
E[ X ] = x
xf ( x)dx if X is continuous
x
g ( x) Prob( X = x) if X is discrete
E[ g ( X )] = x
g ( x) f ( x)dx if X is continuous
x
Var[ X ] = E[ X 2 ] − [ E ( X )]2
Var[a + bX ] = b2 Var[ X ]
The general form of the normal distribution with mean and standard deviation is
1
f ( x | , 2 ) = e −1/2[( x − ) / ]
2 2
2
This result is usually denoted X N [ , 2 ] . Among the most useful properties of the
normal distribution is its preservation under linear transformation.
If X N [ , 2 ] , then (a + bX ) N [a + b , b 2 2 ]
The specific notation ( z ) is often used for this density and ( z ) for its cdf. It follows
from the definitions above that if X N [ , 2 ] , then
1 x−
f ( x) =
a− X − b−
Prob(a X b) = Prob
which can always be read from a table of the standard normal distribution. In addition,
because the distribution is symmetric, (− z) = 1 − ( z) . Hence, it is not necessary to
tabulate both the negative and positive halves of the distribution.
X
i =1
i chi- squared[n]
The mean and variance of a chi-squared variable with n degrees of freedom are n
and 2n , respectively.
⚫ If Zi , i = 1, , n , are independent N[0,1] variables, then
n
Z
i =1
i
2
χ 2 [n]
(Z
i =1
i / )2 χ 2 [n]
X1 + X 2 χ 2 [ n]
The joint density function for two random variables X and Y denoted f ( x, y) is
defined so that
4 / 12
f ( x, y ) if X and Y are discrete
Prob(a X b, c Y d ) = a bX db cY d
f ( x, y )dydx if X and Y are continuous
a c
x y
f ( x, y )dydx = 1 if X and Y are continuous
If (and only if) X and Y are independent, then the cdf factors as well as the pdf:
F ( x, y ) = FX (x)FY (y )
or
Prob( X x, Y y) = Prob( X x) Prob(Y y)
5 / 12
The means, variances, and higher moments of the variables in a joint distribution are
defined with respect to the marginal distributions. For the mean of X in a discrete
distribution,
E[ X ] = xf X ( x)
x
= x f ( x, y )
x y
= xf ( x, y )
x y
The means of the variables in a continuous distribution are defined likewise, using
integration instead of summation:
E[ X ] = xf X ( x) dx
x
= xf ( x, y )dydx
x y
Var[ X ] = ( x − E[ X ]) f X ( x)
2
= ( x − E[ X ]) f ( x, y )
2
x y
6 / 12
XY = f X ( x) fY ( y )( x − X )( y − Y )
x y
= ( x − X ) f X ( x) ( y − Y ) fY ( y )
x y
= E[ X − X ]E[Y − Y ]
=0
The sign of the covariance will indicate the direction of covariation of X and Y . Its
magnitude depends on the scales of measurement, however. In view of this fact, a
preferable measure is the correlation coefficient:
XY
XY =
XY
and
Cov[aX + bY , cX + dY ] = ac Var[ X ] + bd Var[Y ] + (ad + bc)Cov[ X , Y ]
E[ g1 ( X ) g 2 (Y )] = E[ g1 ( X )]E[ g 2 (Y )]
7 / 12
f ( x, y)
f ( y | x) =
f X ( x)
and
f ( x, y)
f ( x | y) =
fY ( y )
yf ( y | x)dy if Y is continuous
y
E[Y | X = x] E[Y | x] =
yf ( y | x) if Y is discrete
y
or
The conditional variance is called the scedastic function and, like the regression, is
generally a function of x . Unlike the conditional mean function, however, it is common
for the conditional variance not to vary with x .
Some useful results for the moments of a conditional distribution are given in the following
theorems.
8 / 12
Theorem 5.1 (Law of Iterated Expectations). E[Y ] = E X [ E[Y | X ]] .
The notation E X [] indicates the expectation over the values of x . Note that E[Y | X ] is
a function of X .
Theorem 5.2 (Decomposition of Variance).
In a joint distribution,
The notation VarX [] indicates the variance over the distribution of X . This equation
states that in a bivariate distribution, the variance of Y decomposes into the variance of
the conditional mean function plus the expected variance around the conditional mean.
2 Basics of Statistics
2.1 Finite Sample Properties of Estimators
Given a random sample {Y1 , Y2 , , Yn } drawn from a population distribution that depends
on an unknown parameter , an estimator of is a rule that assigns each possible
outcome of the sample a value of . The rule is specified before any sampling is carried
out; in particular, the rule is the same regardless of the data actually obtained.
As an example of an estimator, let {Y1 , Y2 , , Yn } be a random sample from a population
with mean . A natural estimator of is the average of the random sample:
n
Y = n−1 Yi
i =1
Y is called the sample average, viewed as an estimator. Given any outcome of the random
variables {Y1 , Y2 , , Yn } , we use the same rule to estimate : we simply average them.
For actual data outcomes { y1 , y2 , , yn } , the estimate is just the average in the sample:
y = ( y1 + y2 + + yn ) / n .
More generally, an estimator W of a parameter can be expressed as an abstract
mathematical formula:
W = h(Y1 , Y2 , , Yn )
9 / 12
2.1.2 Unbiasedness
Letting Y1 , , Yn denote the random sample from the population with E(Y ) = and
Var(Y ) = 2 , define the estimator as
1 n
S2 =
n − 1 i =1
(Yi − Y )2
which is usually called the sample variance. It can be shown that S 2 is unbiased for 2 :
E ( S 2 ) = 2 . The division by n − 1 , rather than n, accounts for the fact that the mean m is
estimated rather than known. If m were known, an unbiased estimator of 2 would be
n−1 i =1 (Yi − )2 , but is rarely known in practice.
n
We now obtain the variance of the sample average for estimating the mean from a
population:
Var(Y ) = 2 / n
2.1.4 Efficiency
2.2.1 Consistency
10 / 12
Consistency. Let Wn be an estimator of based on a sample Y1 , Y2 , , Yn of size n .
P(| Wn − | ) → 0 as n →
plim(Yn ) =
The law of large numbers means that, if we are interested in estimating the population
average , we can get arbitrarily close to by choosing a sufficiently large sample.
P( Z n z ) → ( z ) as n →
Yn −
Zn =
/ n
11 / 12
2.3 General Approaches to Parameter Estimation
N/A
In hypothesis testing, we can make two kinds of mistakes. First, we can reject the null
hypothesis when it is in fact true. This is called a Type I error. The second kind of error is
failing to reject H 0 when it is actually false. This is called a Type II error.
After we have made the decision of whether or not to reject the null hypothesis, we have
either decided correctly or we have committed an error. We will never know with certainty
whether an error was committed. However, we can compute the probability of making
either a Type I or a Type II error. Hypothesis testing rules are constructed to make the
probability of committing a Type I error fairly small. Generally, we define the significance
level (or simply the level) of a test as the probability of a Type I error; it is typically denoted
by a. Symbolically, we have
= P(Reject H0 |H 0 )
The right-hand side is read as: “The probability of rejecting H 0 given that H 0 is true.”
N/A
12 / 12