Distribuciones de Probabilidades
Distribuciones de Probabilidades
Distribuciones de Probabilidades
1 Indicator Functions
The concept of an indicator function is a really useful one. This is a function that takes the
value one if its argument is true, and the value zero if its argument is false. Sometimes these
functions are called Heaviside functions or unit step functions. I write an indicator function
as I{A} (x), although sometimes they are written 1{A} (x). If the context is obvious, we can
also simply write I{A} .
Example: (
0 x≤3
I{x>3} (x) =
1 x>3
Indicator functions are useful for making sure that you don’t take the log of a negative
number, and things like that. Indicator functions are always first in the order of operations—
if the indicator function is zero, you don’t try to evaluate the rest of the expression. When
taking derivatives they just go along for the ride. When taking integrals, they may affect
the range over which the integral is evaluated.
Example: The density function for the exponential distribution (see Section 4.1) can be
written as f (x) = λ exp(−λx)I{x≥0} (x). We can integrate the density and show that it is
indeed a proper density and integrates to one:
Z ∞ Z ∞
λ exp(−λx)I{x≥0} (x)dx = λ exp(−λx)dx = − exp(−λx)|∞
0 = −(0 − 1) = 1.
−∞ 0
2 Expected Values
The expected value, also known as the expectation or mean, of a random variable X is
denoted E(X). It is the weighed average of all values X could take, with weights given by
1
the probabilities of those values. If X is discrete-valued, then
X X
E(X) = x · P (X = x) = x · f (x) .
x x
The expectation of a random variable can be thought of as the average value. If, for example,
we observed many realizations of a random variable X and took their average, it would be
close to E(X).
One nice property of expected values is that they are easy to compute for linear functions
of random variables. To see this, let X and Y be random variables with E(X) = µX and
E(Y ) = µY . Suppose we are interested in a new random variable Z = aX +bY +c where a, b,
and c are any real constants. The mean of Z is easy to compute: E(Z) = E(aX + bY + c) =
aE(X) + bE(Y ) + c = aµX + bµY + c.
We can also compute expectations of functions of X. For example, suppose g(X) = 2/X.
R∞ R∞
Then we have E(g(X)) = −∞ g(x)f (x)dx = −∞ x2 f (x)dx. Note however that in general,
E(g(X)) 6= g(E(X)).
Example: Let’s say continuous random variable X has PDF f (x) = 3x2 I{0≤x≤1} (x). We
want to find E(X) and E(X 2 ). First,
Z ∞
E(X) = x · 3x2 I{0≤x≤1} (x)dx
−∞
Z 1 Z 1
2 3 3
= x · 3x dx = 3x3 dx = x4 |x=1
x=0 = (1 − 0) (1)
0 0 4 4
3
= ,
4
and second,
Z ∞
2
E(X ) = x2 · 3x2 I{0≤x≤1} (x)dx
−∞
Z 1 Z 1
2 2 3 3
= x · 3x dx = 3x4 dx = x5 |x=1
x=0 = (1 − 0) (2)
0 0 5 5
3
= .
5
2
2.1 Variance
The variance of random variable measures how spread out its values are. If X is a random
variable with mean E(X) = µ, then the variance is E[(X − µ)2 ]. In words, the variance
is the expected value of the squared deviation of X from its mean. If X is discrete, this is
calculated as
X
V ar(X) = (x − µ)2 · P (X = x)
x
and if X is continuous, it is
Z ∞
V ar(X) = (x − µ)2 · f (x)dx .
−∞
For both discrete and continuous X, a convenient formula for the variance is V ar(X) =
E[X 2 ] − (E[X])2 . The square root of variance is called the standard deviation.
Variance has a linear property similar to expectation. Again, let X and Y be random
2
variables with V ar(X) = σX and V ar(Y ) = σY2 . It is also necessary to assume that X and Y
are independent. Suppose we are interested in a new random variable Z = aX +bY +c where
a, b, and c are any real constants. The variance of Z is then V ar(Z) = V ar(aX + bY + c) =
a2 V ar(X) + b2 V ar(Y ) + 0 = a2 σX
2
+ b2 σY2 . Because c is constant, it has variance 0.
Example: Continuing the previous example, let’s say continuous random variable X has
PDF f (x) = 3x2 I{0≤x≤1} (x). We found in Equations 1 and 2 that E(X) = 3/4 and E(X 2 ) =
3/5. Then, V ar(X) = E[X 2 ] − (E[X])2 = 3/5 − (3/4)2 = 3/80.
3.1 Geometric
The geometric distribution is the number of trials needed to get the first success, i.e., the
number of Bernoulli events until a success is observed, such as the first head when flipping
a coin. It takes values on the positive integers starting with one (since at least one trial is
3
needed to observe a success).
X ∼ Geo(p)
P (X = x|p) = p(1 − p)x−1 for x = 1, 2, . . .
1
E[X] =
p
If the probability of getting a success is p, then the expected number of trials until the first
success is 1/p.
Example: What is the probability that we flip a fair coin four times and don’t see any heads?
This is the same as asking what is P (X > 4) where X ∼ Geo(1/2). P (X > 4) = 1 − P (X =
1)−P (X = 2)−P (X = 3)−P (X = 4) = 1−(1/2)−(1/2)(1/2)−(1/2)(1/2)2 −(1/2)(1/2)3 =
1/16. Of course, we could also have just computed it directly, but here we see an example
of using the geometric distribution and we can also see that we got the right answer.
3.2 Multinomial
Another generalization of the Bernoulli and the binomial is the multinomial distribution,
which is like a binomial when there are more than two possible outcomes. Suppose we have
n trials and there are k different possible outcomes which occur with probabilities p1 , . . . , pk .
For example, we are rolling a six-sided die that might be loaded so that the sides are not
equally likely, then n is the total number of rolls, k = 6, p1 is the probability of rolling a one,
and we denote by x1 , . . . , x6 a possible outcome for the number of times we observe rolls of
X6 X6
each of one through six, where xi = n and pi = 1.
i=1 i=1
n!
f (x1 , . . . , xk |p1 , . . . , pk ) = px1 · · · pxkk .
x 1 ! · · · xk ! 1
Recall that n! stands for n factorial, which is the product of n times n − 1 times . . . 1, e.g.,
4! = 4 · 3 · 2 · 1 = 24. The expected number of observations in category i is npi .
4
3.3 Poisson
The Poisson distribution is used for counts, and arises in a variety of situations. The param-
eter λ > 0 is the rate at which we expect to observe the thing we are counting.
X ∼ Pois(λ)
λx exp(−λ)
P (X = x|λ) = for x = 0, 1, 2, . . .
x!
E[X] = λ
V ar[X] = λ
A Poisson process is a process wherein events occur on average at rate λ, events occur one
at a time, and events occur independently of each other.
Example: Significant earthquakes occur in the Western United States approximately fol-
lowing a Poisson process with rate of two earthquakes per week. What is the probability
there will be at least 3 earthquakes in the next two weeks? Answer: the rate per two weeks is
2 × 2 = 4, so let X ∼ Pois(4) and we want to know P (X ≥ 3) = 1 − P (X ≤ 2) = 1 − P (X =
42 e−4
0) − P (X = 1) − P (X = 2) = 1 − e−4 − 4e−4 − 2
= 1 − 13e−4 = 0.762. Note that 0! = 1
by definition.
4 Continuous Distributions
4.1 Exponential
The exponential distribution is often used to model the waiting time between random events.
Indeed, if the waiting times between successive events are independent from an Exp(λ)
distribution, then for any fixed time window of length t, the number of events occurring in
that window will follow a Poisson distribution with mean tλ.
X ∼ Exp(λ)
f (x|λ) = λe−λx I{x≥0} (x)
1
E[X] =
λ
1
V ar[X] = 2
λ
5
Similar to the Poisson distribution, the parameter λ is interpreted as the rate at which the
events occur.
4.2 Gamma
If X1 , X2 , . . . , Xn are independent (and identically distributed Exp(λ)) waiting times between
successive events, then the total waiting time for all n events to occur Y = ni=1 Xi will
P
Y ∼ Gamma(α, β)
β α α−1 −βy
f (y|α, β) = y e I{y≥0} (y)
Γ(α)
α
E[Y ] =
β
α
V ar[Y ] = 2
β
where Γ(·) is the gamma function, a generalization of the factorial function which can accept
non-integer arguments. If n is a positive integer, then Γ(n) = (n − 1)!. Note also that α > 0
and β > 0.
The exponential distribution is a special case of the gamma distribution with α = 1. The
gamma distribution commonly appears in statistical problems, as we will see in this course.
It is used to model positive-valued, continuous quantities whose distribution is right-skewed.
As α increases, the gamma distribution more closely resembles the normal distribution.
4.3 Uniform
The uniform distribution is used for random variables whose possible values are equally likely
over an interval. If the interval is (a, b), then the uniform probability density function (PDF)
f (x) is flat for all values in that interval and 0 everywhere else.
X ∼ Uniform(a, b)
1
f (x|a, b) = I{a≤x≤b} (x)
b−a
a+b
E[X] =
2
(b − a)2
V ar[X] =
12
The standard uniform distribution is obtained when a = 0 and b = 1.
6
4.4 Beta
The beta distribution is used for random variables which take on values between 0 and 1.
For this reason (and other reasons we will see later in the course), the beta distribution is
commonly used to model probabilities.
X ∼ Beta(α, β)
Γ(α + β) α−1
f (x|α, β) = x (1 − x)β−1 I{0<x<1} (x)
Γ(α)Γ(β)
α
E[X] =
α+β
αβ
V ar[X] = 2
(α + β) (α + β + 1)
where Γ(·) is the gamma function introduced with the gamma distribution. Note also that
α > 0 and β > 0. The standard Uniform(0, 1) distribution is a special case of the beta
distribution with α = β = 1.
4.5 Normal
The normal, or Gaussian distribution is one of the most important distributions in statistics.
It arises as the limiting distribution of sums (and averages) of random variables. This is
due to the Central Limit Theorem, introduced in Section 5. Because of this property, the
normal distribution is often used to model the “errors,” or unexplained variation of individual
observations in regression models.
The standard normal distribution is given by
Z ∼ N(0, 1)
2
1 z
f (z) = √ exp −
2π 2
E[Z] = 0
V ar[Z] = 1
Now consider X = σZ+µ where σ > 0 and µ is any real constant. Then E(X) = E(σZ+µ) =
σE(Z) + µ = σ · 0 + µ = µ and V ar(X) = V ar(σZ + µ) = σ 2 V ar(Z) + 0 = σ 2 · 1 = σ 2 .
Then, X follows a normal distribution with mean µ and variance σ 2 (standard deviation σ)
7
denoted as
X ∼ N(µ, σ 2 )
(x − µ)2
2 1
f (x|µ, σ ) = √ exp −
2πσ 2 2σ 2
The normal distribution is symmetric about the mean µ, and is often described as a “bell-
shaped” curve. Although X can take on any real value (positive or negative), more than
99% of the probability mass is concentrated within three standard deviations of the mean.
The normal distribution has several desirable properties. One is that if X1 ∼ N(µ1 , σ12 )
and X2 ∼ N(µ2 , σ22 ) are independent, then X1 +X2 ∼ N(µ1 +µ2 , σ12 +σ22 ). Consequently, if we
take the average of n independent and identically distributed (iid) normal random variables,
n
1X
X̄ = Xi ,
n i=1
iid
where Xi ∼ N(µ, σ 2 ) for i = 1, 2, . . . , n, then
σ2
X̄ ∼ N µ, . (3)
n
4.6 t
If we have normal data, we can use Equation 3 to help us estimate the mean µ. Reversing
the transformation from the previous section, we get
X̄ − µ
√ ∼ N(0, 1) . (4)
σ/ n
However, we may not know the value of σ. If we estimate it from data, we can replace it with
pP
2
S= i (Xi − X̄) /(n − 1), the sample standard deviation. This causes the expression (4)
to no longer be distributed as standard normal, but as a standard t distribution with ν = n−1
degrees of freedom.
Y ∼ tν
−( ν+1 )
Γ( ν+1 y2
2
) 2
f (y) = ν √ 1+
Γ( 2 ) νπ ν
E[Y ] = 0 if ν > 1
ν
V ar[Y ] = if ν > 2
ν−2
8
The t distribution is symmetric and resembles the normal distribution, but with thicker tails.
As the degrees of freedom increase, the t distribution looks more and more like the standard
normal distribution.
9
Then √
n(X̄ − µ)
⇒ N (0, 1) .
σ
That is, X̄n is approximately normally distributed with mean µ and variance σ 2 /n or stan-
√
dard deviation σ/ n.
10