0% found this document useful (0 votes)
53 views12 pages

Appendix A Probability and Statistics

This document provides an overview of probability and statistics concepts covered in Lecture 1, including: 1) Definitions of probability distributions, probability density functions, and cumulative distribution functions for discrete and continuous random variables. 2) Concepts of expectations, means, variances, and standard deviations of random variables. 3) Examples of specific probability distributions like the normal, chi-squared, t and F distributions. 4) Introduction to joint distributions and marginal distributions of two random variables.

Uploaded by

ma yening
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views12 pages

Appendix A Probability and Statistics

This document provides an overview of probability and statistics concepts covered in Lecture 1, including: 1) Definitions of probability distributions, probability density functions, and cumulative distribution functions for discrete and continuous random variables. 2) Concepts of expectations, means, variances, and standard deviations of random variables. 3) Examples of specific probability distributions like the normal, chi-squared, t and F distributions. 4) Introduction to joint distributions and marginal distributions of two random variables.

Uploaded by

ma yening
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 1 Basics of Probability and Statistics

Shaojian Chen
September 16, 2021

1 Basics of Probability
1.1 Random Variables

1.1.1 Probability Distributions

A listing of the values x taken by a random variable X and their associated probabilities is
a probability distribution, f ( x) . For a discrete random variable,

f ( x) = Prob( X = x)

The axioms of probability require that


(1) 0  Prob( X = x)  1 .

(2)  x f ( x) = 1 .

For the continuous case, the probability associated with any particular point is zero, and
we can only assign positive probabilities to intervals in the range (or support) of x . the
probability density function (pdf), f ( x) , is defined so that f ( x)  0 and
b
(1) Prob(a  X  b) =  f ( x)dx  0
a

This result is the area under f ( x) in the range from a to b . For a continuous variable,
+
(2)  f ( x)dx = 1
−

If the range of x is not infinite, then it is understood that f ( x) = 0 anywhere outside the
appropriate range. because the probability associated with any individual point is 0,
Prob(a  X  b)= Prob(a  X  b)
= Prob(a  X  b)
= Prob(a  X  b)

1.1.2 Cumulative Distribution Function


For any random variable X, the probability that X is less than or equal to a is denoted F (a) .

1 / 12
F ( x) is the cumulative density function (cdf), or distribution function. For a discrete
random variable,
F ( x) = Prob( X  x)

In view of the definition of f ( x) ,

f ( xi ) = F ( xi ) − F ( xi −1 )

For a continuous random variable,


x dF ( x)
F ( x) = Prob( X  x) =  f (t )dt , f ( x) =
− dx

In both the continuous and discrete cases, F ( x) must satisfy the following properties:
(1) 0  F ( x)  1 ;
(2)If x  y, then F ( x)  F ( y) .

(3) F (+) = 1;

(4) F (−) = 0 .
From the definition of the cdf,
Prob(a  X  b) = F (b) − F (a)

1.2 Expectations of a Random Variable

Definition 2.1 (Mean of a Random Variable). The mean, or expected value, of a random
variable is

 x Prob( X = x) if X is discrete

E[ X ] =  x
  xf ( x)dx if X is continuous
 x

 g ( x) Prob( X = x) if X is discrete

E[ g ( X )] =  x
 g ( x) f ( x)dx if X is continuous
 x

Definition 2.2 (Variance of a Random Variable). The variance of a random variable is

 ( x − E ( X ))2 Prob( X = x) if X is discrete



Var[ X ] = E[( X − E ( X ))2 ] =  x
 ( x − E ( X ))2 f ( x)dx if X is continuous
 x

The variance of X , Var[ X ] , which must be positive, is usually denoted  2 . This


2 / 12
function is a measure of the dispersion of a distribution. Computation of the variance is
simplified by using the following important result:

Var[ X ] = E[ X 2 ] − [ E ( X )]2

Var[a + bX ] = b2 Var[ X ]

1.3 Some Specific Probability Distributions

1.3.1 The Normal Distributions

The general form of the normal distribution with mean  and standard deviation  is
1
f ( x |  , 2 ) = e −1/2[( x −  ) / ]
2 2

 2

This result is usually denoted X N [  ,  2 ] . Among the most useful properties of the
normal distribution is its preservation under linear transformation.

If X N [  ,  2 ] , then (a + bX ) N [a + b , b 2 2 ]

One particularly convenient transformation is a = − /  and b = 1/  . The resulting


variable Z = ( X −  ) /  has the standard normal distribution, denoted N[0,1] , with
density
1 − z 2 /2
 ( z) = e
2

The specific notation  ( z ) is often used for this density and ( z ) for its cdf. It follows
from the definitions above that if X N [  ,  2 ] , then

1 x−
f ( x) = 
   

For any normally distributed variable,

 a− X − b− 
Prob(a  X  b) = Prob   
    

which can always be read from a table of the standard normal distribution. In addition,
because the distribution is symmetric, (− z) = 1 − ( z) . Hence, it is not necessary to
tabulate both the negative and positive halves of the distribution.

1.3.2 The Chi-Squared, T, and F Distributions


3 / 12
The first of the essential results is
⚫ If X1 , , X n are n independent chi-squared [1] variables, then
n

X
i =1
i chi- squared[n]

The mean and variance of a chi-squared variable with n degrees of freedom are n
and 2n , respectively.
⚫ If Zi , i = 1, , n , are independent N[0,1] variables, then
n

Z
i =1
i
2
χ 2 [n]

⚫ If Zi , i = 1, , n , are independent N [0,  2 ] variables, then


n

 (Z
i =1
i /  )2 χ 2 [n]

⚫ If X 1 and X 2 are independent chi-squared variables with n1 and n2 degrees of


freedom, respectively, then

X1 + X 2 χ 2 [ n]

⚫ If n1 and n2 are two independent chi-squared variables with degrees of freedom


parameters X 1 and X 2 respectively, then the ratio
X1 / n1
F[n1 , n2 ]
X 2 / n2

has the F distribution with n1 and n2 degrees of freedom.


⚫ If Z is an N[0,1] variable and X is χ 2 [n] and is independent of Z , then the
ratio
Z
T [ n] =
X /n

has the t distribution with n degrees of freedom.

1.4 Joint Distributions

The joint density function for two random variables X and Y denoted f ( x, y) is
defined so that

4 / 12
   f ( x, y ) if X and Y are discrete

Prob(a  X  b, c  Y  d ) = a bX db cY  d
   f ( x, y )dydx if X and Y are continuous
a c

The counterparts of the requirements for a univariate probability density are


f ( x, y)  0

 f ( x, y) = 1 if X and Y are discrete


x y


x y
f ( x, y )dydx = 1 if X and Y are continuous

The cumulative probability is likewise the probability of a joint event:

   f ( x, y ) in the discrete case


 X x Y x
F ( x, y ) = Prob( X  x, Y  y ) =  x y
   f (t , s)dsdt in the continuous case
 − −

1.4.1 Marginal Distributions

A marginal probability density or marginal probability distribution is defined with


respect to an individual variable. To obtain the marginal distributions from the joint density,
it is necessary to sum or integrate out the other variable:

 f ( x, y ) in the discrete case



f X ( x) =  y
  f ( x, s)ds in the continuous case
 y

and similarly for fY ( y ) .


Two random variables are statistically independent if and only if their joint density is the
product of the marginal densities:

f ( x, y ) = f X (x)fY (y )  X and Y are independent

If (and only if) X and Y are independent, then the cdf factors as well as the pdf:
F ( x, y ) = FX (x)FY (y )

or
Prob( X  x, Y  y) = Prob( X  x) Prob(Y  y)

1.4.2 Expectations in a Distribution

5 / 12
The means, variances, and higher moments of the variables in a joint distribution are
defined with respect to the marginal distributions. For the mean of X in a discrete
distribution,

E[ X ] =  xf X ( x)
x

 
=  x   f ( x, y ) 
x  y 
=  xf ( x, y )
x y

The means of the variables in a continuous distribution are defined likewise, using
integration instead of summation:

E[ X ] =  xf X ( x) dx
x

=  xf ( x, y )dydx
x y

Variances are computed in the same manner:

Var[ X ] =  ( x − E[ X ]) f X ( x)
2

=  ( x − E[ X ]) f ( x, y )
2

x y

1.4.3 Covariance and Correlation

For any function g ( X , Y ) ,

 g ( x, y ) f ( x, y ) in the discrete case


 x y
E[ g ( X , Y )] = 
   g ( x, y ) f ( x, y )dydx in the continuous case
 x y

The covariance of X and Y is a special case:

Cov[ X , Y ] = E[( X −  X )(Y − Y )]


= E[ XY ] −  X Y
=  XY

If X and Y are independent, then f ( x, y) = f X ( x) fY ( y) and

6 / 12
 XY =  f X ( x) fY ( y )( x −  X )( y − Y )
x y

=  ( x −  X ) f X ( x) ( y − Y ) fY ( y )
x y

= E[ X −  X ]E[Y − Y ]
=0

The sign of the covariance will indicate the direction of covariation of X and Y . Its
magnitude depends on the scales of measurement, however. In view of this fact, a
preferable measure is the correlation coefficient:
 XY
 XY =
 XY

where  X and  Y are the standard deviations of x and y , respectively. The


correlation coefficient has the same sign as the covariance but is always between -1 and 1
and is thus unaffected by any scaling of the variables.
Some general results regarding expectations in a joint distribution, which can be
verified by applying the appropriate definitions, are
E[aX + bY + c] = aE[ X ] + bE[Y ] + c

Var[aX + bY + c] = a 2 Var[ X ] + b 2 Var[Y ] + 2ab Cov[ X , Y ]


= Var[aX + bY ]

and
Cov[aX + bY , cX + dY ] = ac Var[ X ] + bd Var[Y ] + (ad + bc)Cov[ X , Y ]

If X and Y are uncorrelated, then


Var[ X + Y ] = Var[ X − Y ]
= Var[ X ] + Var[Y ]

For any two functions g1 ( X ) and g 2 (Y ) , if X and Y are independent, then

E[ g1 ( X ) g 2 (Y )] = E[ g1 ( X )]E[ g 2 (Y )]

1.5 Conditioning in a Bivariate Distribution

In a bivariate distribution, there is a conditional distribution over y for each value of


x . The conditional densities are

7 / 12
f ( x, y)
f ( y | x) =
f X ( x)

and
f ( x, y)
f ( x | y) =
fY ( y )

If X and Y are independent, then f ( y | x) = fY ( y ) and f ( x | y ) = f X ( x) .

1.5.1 Regression: The Conditional Mean

A conditional mean is the mean of the conditional distribution and is defined by

 yf ( y | x)dy if Y is continuous
y
E[Y | X = x]  E[Y | x] = 
 yf ( y | x) if Y is discrete
 y

The conditional mean function E[Y | x] is called the regression of y on x .

1.5.2 Conditional Variance

A conditional variance is the variance of the conditional distribution:

Var[Y | x] = E[( y − E[Y | x]) 2 | x]


=  ( y − E[Y | x])2 f ( y | x)dy, if Y is continuous
y

or

Var[Y | x] =  ( y − E[Y | x]) 2 f ( y | x) if Y is discrete


y

The computation can be simplified by using

Var[Y | x] = E[Y 2 | x] − ( E[Y | x]) 2

The conditional variance is called the scedastic function and, like the regression, is
generally a function of x . Unlike the conditional mean function, however, it is common
for the conditional variance not to vary with x .

1.5.3 Relationships among Marginal and Conditional Moments

Some useful results for the moments of a conditional distribution are given in the following
theorems.

8 / 12
Theorem 5.1 (Law of Iterated Expectations). E[Y ] = E X [ E[Y | X ]] .
The notation E X [] indicates the expectation over the values of x . Note that E[Y | X ] is
a function of X .
Theorem 5.2 (Decomposition of Variance).
In a joint distribution,

Var[Y ] = VarX [ E[Y | X ]] + EX [Var[Y | X ]]

The notation VarX [] indicates the variance over the distribution of X . This equation
states that in a bivariate distribution, the variance of Y decomposes into the variance of
the conditional mean function plus the expected variance around the conditional mean.

2 Basics of Statistics
2.1 Finite Sample Properties of Estimators

2.1.1 Estimators and Estimates

Given a random sample {Y1 , Y2 , , Yn } drawn from a population distribution that depends
on an unknown parameter  , an estimator of  is a rule that assigns each possible
outcome of the sample a value of  . The rule is specified before any sampling is carried
out; in particular, the rule is the same regardless of the data actually obtained.
As an example of an estimator, let {Y1 , Y2 , , Yn } be a random sample from a population
with mean  . A natural estimator of  is the average of the random sample:
n
Y = n−1  Yi
i =1

Y is called the sample average, viewed as an estimator. Given any outcome of the random
variables {Y1 , Y2 , , Yn } , we use the same rule to estimate  : we simply average them.
For actual data outcomes { y1 , y2 , , yn } , the estimate is just the average in the sample:
y = ( y1 + y2 + + yn ) / n .
More generally, an estimator W of a parameter  can be expressed as an abstract
mathematical formula:

W = h(Y1 , Y2 , , Yn )

for some known function h of the random variables Y1 , Y2 , , Yn .

9 / 12
2.1.2 Unbiasedness

Unbiased Estimator. An estimator, W of  , is an unbiased estimator if


E(W ) = 

for all possible values of  .


We now show that the sample average Y is an unbiased estimator of the population mean
 , regardless of the underlying population distribution.
E (Y ) = 

Letting Y1 , , Yn denote the random sample from the population with E(Y ) =  and
Var(Y ) =  2 , define the estimator as

1 n
S2 = 
n − 1 i =1
(Yi − Y )2

which is usually called the sample variance. It can be shown that S 2 is unbiased for  2 :
E ( S 2 ) =  2 . The division by n − 1 , rather than n, accounts for the fact that the mean m is
estimated rather than known. If m were known, an unbiased estimator of  2 would be
n−1 i =1 (Yi −  )2 , but  is rarely known in practice.
n

2.1.3 The Sampling Variance of Estimators

We now obtain the variance of the sample average for estimating the mean  from a
population:

Var(Y ) =  2 / n

An important implication of Var(Y ) =  2 / n is that it can be made very close to zero by


increasing the sample size n .

2.1.4 Efficiency

Relative Efficiency. If W1 and W2 are two unbiased estimators of  , W1 is efficient


relative to W2 when Var(W1 )  Var(W2 ) for all  , with strict inequality for at least one
value of  .

2.2 Asymptotic or Large Sample Properties of Estimators

2.2.1 Consistency
10 / 12
Consistency. Let Wn be an estimator of  based on a sample Y1 , Y2 , , Yn of size n .

Then, Wn is a consistent estimator of  if for every   0 ,

P(| Wn −  |  ) → 0 as n → 

If Wn is not consistent for  , then we say it is inconsistent.


When Wn is consistent, we also say that  is the probability limit of Wn , written as
plim(Wn ) =  .
Unlike unbiasedness —which is a feature of an estimator for a given sample size—
consistency involves the behavior of the sampling distribution of the estimator as the
sample size n gets large.
Law of Large Numbers. Let Y1 , Y2 , , Yn be independent, identically distributed
random variables with mean  . Then,

plim(Yn ) = 

The law of large numbers means that, if we are interested in estimating the population
average  , we can get arbitrarily close to  by choosing a sufficiently large sample.

2.2.2 Asymptotic Normality

Asymptotic Normality. Let {Z n : n = 1, 2, } be a sequence of random variables, such that


for all numbers z ,

P( Z n  z ) → ( z ) as n → 

where ( z ) is the standard normal cumulative distribution function. Then, Z n is said to


have an asymptotic standard normal distribution. In this case, we often write
Zn Normal(0,1) . (The “ a ” above the tilde stands for “ asymptotically ” or
“approximately.”)
Central Limit Theorem. Let {Y1 , Y2 , , Yn } be a random sample with mean  and
variance  2 . Then,

Yn − 
Zn =
/ n

has an asymptotic standard normal distribution.

11 / 12
2.3 General Approaches to Parameter Estimation

N/A

2.4 Hypothesis Testing

2.4.1 Fundamentals of Hypothesis Testing

In hypothesis testing, we can make two kinds of mistakes. First, we can reject the null
hypothesis when it is in fact true. This is called a Type I error. The second kind of error is
failing to reject H 0 when it is actually false. This is called a Type II error.
After we have made the decision of whether or not to reject the null hypothesis, we have
either decided correctly or we have committed an error. We will never know with certainty
whether an error was committed. However, we can compute the probability of making
either a Type I or a Type II error. Hypothesis testing rules are constructed to make the
probability of committing a Type I error fairly small. Generally, we define the significance
level (or simply the level) of a test as the probability of a Type I error; it is typically denoted
by a. Symbolically, we have

 = P(Reject H0 |H 0 )

The right-hand side is read as: “The probability of rejecting H 0 given that H 0 is true.”

2.4.2 Computing and Using p-Values

N/A

12 / 12

You might also like