0% found this document useful (0 votes)
105 views6 pages

Hitchhiker S Guide To Probability

This document provides an introduction to probability concepts through informal explanations and examples. It covers topics like: - Definitions of discrete and continuous random variables. Discrete variables take countable values while continuous variables take uncountable values. - Common probability distributions like the binomial, Poisson, and Laplace distributions which are used to model situations involving independent events. - Foundational calculus concepts like integrals, derivatives, and primitives/antiderivatives that are necessary for understanding probability distributions and expected values of random variables.

Uploaded by

Matthew Raymond
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views6 pages

Hitchhiker S Guide To Probability

This document provides an introduction to probability concepts through informal explanations and examples. It covers topics like: - Definitions of discrete and continuous random variables. Discrete variables take countable values while continuous variables take uncountable values. - Common probability distributions like the binomial, Poisson, and Laplace distributions which are used to model situations involving independent events. - Foundational calculus concepts like integrals, derivatives, and primitives/antiderivatives that are necessary for understanding probability distributions and expected values of random variables.

Uploaded by

Matthew Raymond
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A HITCHHIKER’S GUIDE TO PROBABILITY

MATT RAYMOND

Abstract. This guide is supposed to be an informal and reference like


paper on introductory probability so we can use data from the Large
Hadron Collider (LHC) in Geneva to make predictions about particle
physics. We extend many of the notions introduced here in Week 4 of the
course. I omit discussion of machine learning and particle physics here.
This is not exhaustive, or in any sense deep, so the questioning reader
should consult a introductory probability text.

1. Some Prerequisites

First we need to introduce some helpful terminology (which has probably


already been introduced). Suppose we have two sets, call them A and B, and
let us consider a function f : A → B.

(1) We call f surjective (or onto) if for every b ∈ B, there is an a ∈ A


which has f (a) = b.
(2) We call f injective (or one-to-one) if for each a ∈ A, there is unique
b ∈ B for which f (a) = b.
(3) We call f bijective if it is surjective and injective.

Suppose now there is a bijection A → B (sometimes we will write this as


A ' B). Then we say A and B have the same cardinality. Now let Jn be
the set of the first n integers. If Jn ' A, for some n ∈ N, we call A finite.
Otherwise A is infinite. If A ' N then we say A is countable.1

Heuristically, the continuity of f can be characterised by whether the graph


of f is unbroken, and ’smooth’ (this is good enough for our purposes). Let
f : R → R, and suppose x ∈ R. Then we say that f is continuous at x if
for some ε ∈ R, as ε → 0, f (x + ε) → f (x).2

Now we are ready to state a primitive definition of the differentiability of


f : R → R. Let ε > 0 and define the quotient
f (x + ε) − f (x)
(1.1) ψ(x) = .
ε
If ψ(x) exists as ε → 0 we say f is differentiable at x and we call the value
which ψ(x) tends to as ε → 0 the derivative of f at x, f 0 (x) (there are helpful
rules to compute these, many of which are in our textbook).

Date: 2/3/21.
1Be careful! Some authors use other names for A when A ' N.
2In order to make this rigorous, we really need to define what it means for x to tend to y.
This leads us to limits!

1
2 MATT RAYMOND

Let f : [a, b] → R be differentiable on [a, b] and suppose we divide [a, b] into


a union of segments [a, x1 ] , [x1 , x2 ], . . . , [xn , b], with a ≤ x1 ≤ . . . ≤ xn ≤ b.
Then the following sum as n → ∞ is called the integral over [a, b] of f .
n+1
X Z b
(1.2) f (xi )(xi − xi−1 ) ≈ f (x) dx,
i=0 a

The astute reader would recognise that the above sum approximates the area
under the graph of f .3

Now we go further. Let f : [a, b] → R be continuous, and let F : [a, b] → R


be differentiable on [a, b]. Then we say F is a primitive (or antiderivative)
of f on [a, b] if F 0 (x) = f (x) for every x ∈ [a, b], and we write
Z
(1.3) F (x) = f (x) dx.

Notice how f may have many primitives (but they all differ by a constant).
Now we state a practical and important (slightly simplified) result.

Proposition 1.4. Let f : [a, b] → R have a primitive F on [a, b]. Then


Z b
(1.5) f (x)dx = F (b) − F (a).
a

This is often referred to as the ’Fundamental Theorem’ of Calculus.


Some functions may not have a primitive (or may not have one which has a
closed form expression).4 A useful example for our purposes is the Gaussian
integral,
Z ∞

Z
2 2
(1.6) I= e−x dx = e−x dx = π,
R −∞

which will be fundamental in describing the ’normal’ distribution.5

2. Random Variables

Call Ω our probability (sample) space, and let X : Ω → R. Then we call X a


random variable.
(1) A random variable X is discrete if the image X(Ω) ⊂ R is finite or
countable.
(2) A random variable X is continuous if X(Ω) is infinite.
We could consider mixed random variables, in which X behaves continuously
on some part of R and at others discretely (but we do not, for lack of time).
We first consider the discrete case.
3However, consider an odd function on [−a, a]. Why does the integral of this function over
[−a, a] vanish?
4A function f has a closed form expression if it can be expressed as a composition of
’elementary’ functions like sin x, ln x, x2 and so on.
5Analogous rules for integration may be found in our textbook, but here most of the integrals
we deal with in our course are messy and can be better computed numerically.
A HITCHHIKER’S GUIDE TO PROBABILITY 3

Let X be the number of heads we collect if we throw 2 coins. We want to


define the probability of X attaining some value n. We define a probability
mass function on X as a map P : R → [0, 1] which is normalized,
k
X
(2.1) P (X = xi ) = 1.
i=1

with pX (xi ) = P (X = xi ) > 0 for each xi ∈ X(Ω), and pX (x) = 0 for all
others. The set of xi with P (X = xi ) > 0 is called the support of X, suppX.

Example 2.2. Suppose we toss an unfair dice once with X being the number
of heads collected (either 1 or 0) with pX (1) = p, and p(0) = 1−pX (1) = 1−p.
This gives us an first example of a probability distribution, from which X
is sampled. We say X is Bernoulli distributed, and write X ∼ Ber(p).

Exercise 2.3. Suppose the distribution (generated by X) on a finite space


Ω is given by P (X = xi ) = 1/k when xi ∈ [a, b] and P (X = x) = 0 when
x 6∈ [a, b] where suppX ' Jn . Prove P is a mass function.

We say that in the above X is uniformly distributed on [a, b], or X ∼ U([a, b]).
Let us quickly review conditional probability. Suppose that X = xj occurs
before X = xi . Then we define the conditional probability of X = xi
given that X = xj has already occurred.
P (X = xi ∪ X = xj )
(2.4) P (X = xj | X = xi ) = ,
P (X = xi )
requiring that P (X = xi ) 6= 0.

Exercise 2.5. Prove Bayes’ Theorem (apply the definition twice)


P (X = xi | X = xj )P (X = xj )
(2.6) P (X = xj | X = xi ) = ,
P (X = xi )
where P (X = xi ) 6= 0.

Often we want to consider multiple random variables, and to do this we


consider joint distributions. Let X, Y be discrete. Then the mass function

(2.7) pX,Y (xi , yj ) = P (X = xi , Y = yj ) = P (X = xi ∪ Y = yj )

defines the joint distribution of X and Y . Notice we can recover pX by sum-


ming over every state of Y and conversely pY by summing over all states of
X.6 If X and Y are pairwise independent, we have the familiar multipli-
cation rule

(2.8) P (X ∪ Y ) = P (X)P (Y ).

Mutually independent random variables X, Y, Z . . . have pairwise indepen-


dence for every combination chosen. We now develop the continuous case.
For a continuous random variable X, which takes values on [a, b], notice that
6This did not make sense to me at first so see me and I’ll give you an example.
4 MATT RAYMOND

the probability of getting any single point on [a, b] is 0 (why?). But we can
consider the probability that x ∈ [x − ε, x + ε], as ε → 0.7

Given a continuous random variable X, call p a probability density func-


tion on X if the domain of p is all possible states of X, for every state of X,
xi , p(xi ) ≥ 0, and
Z
(2.9) p(x) dx = 1.
R
Example 2.10. We formulate a version of the uniform distribution for the
continuous case. Suppose we consider the probability that x ∈ [a, b] to be
1/(b − a), for every state x of X in [a, b]. This is nonnegative, and
Z Z b
1 x b b a
(2.11) p(x) dx = dx = = = = 1.
R a b−a b−a a b−a b−a
Exercise 2.12. Suppose X1 , X2 , . . . , Xn are discrete, and Xi ∼ Ber(p). Show
that X1 , X2 , . . . , Xn are mutually independent, and suppose we have k-successes.
Then we can consider the above in terms of one binomially-distributed
random variable. That is, for X ∼ Bin(n, k, p),
 
n k
(2.13) pX (n, k, p) = p (1 − p)n−k .
k
Show this is a mass function with mean np.

We often wish to describe the probability of a frequency of independent events


in a given time interval.8 Consider a discrete variable X. Then X is said to
be Poisson distributed if given a constant mean λ ∈ R, with k ∈ SuppX,
λk −λ
(2.14) pX (λ, k) =
e .
k!
Suppose we wished to model a continuous random variable X which takes a
sharp peak at µ ∈ R. We can model this using the Laplace distribution,
writing X ∼ Laplace(µ, γ)
 
1 |x − µ|
(2.15) pX (x, µ, γ) = exp −
2γ γ
Where γ > 0 is a parameter that dictates the variance of X (we will define
this shortly). Finally we consider the Gaussian (normal) distribution.
Let X be a continuous variable, with parameters µ ∈ R and σ 2 > 0. Then
X ∼ N (µ, σ 2 ) if
 
2 1 1 2
(2.16) pX (x, µ, σ ) = √ exp − 2 (x − µ) .
2πσ 2 2σ

Proposition 2.17. (Central Limit Theorem) The sum of mutually in-


dependent variables X1 , . . . , Xn which have the same probability distributions
becomes normally distributed as n → ∞.
7This can be formulated less handwavingly by considering the notion of a cumulative density
function.
8For example, Poisson distribution could model the amount of patients admitted to ED
between 12pm and 1am.
A HITCHHIKER’S GUIDE TO PROBABILITY 5

This is why Gaussian distributions are so important! The sum of any set
of continuous or discrete random variables (sampled from one distribution)
becomes normally distributed as the set becomes countable.9)

3. Expectation, Variance, Covariance and Correlation

Example 3.1. Define a (discrete) random variable X with states; student


has yellow hair X = 0, green hair X = 1, grey hair X = 2 or purple hair
X = 3, with probabilities by P (0) = 1/50, P (1) = 3/50, P (2) = 4/50, and
P (3) = 21/25. If we calculate a weighted sum,
1 3 4 21
(3.2) ×0+ ×1+ ×2+ × 3 = 2.74.
50 50 50 25
This number tells us that we can expect to select a student with purple-
coloured hair (X = 3).

For a discrete variable X with S = SuppX finite or countable, and probabil-


ities {P (x) : x ∈ S}, define the expected value of X as
X
(3.3) EX = P (x)x.
x∈S

For continuous variables, replace the sum over S with an integral over R.

Example 3.4. Suppose pX (x) = 2−2x for x ∈ (0, 1), but x = 0 when x 6∈ (0, 1).
The expected value of X is given by
Z Z 1
2 2
(3.5) E(X) = (2 − 2x)x dx = (2x − 2x2 ) dx = x2 − x3 = 1 − .

R (0,1) 3 3
0

This makes sense, plot 2x − 2x2 and 2 − x on (0, 1). The operator E has some
nice properties,

(1) The map E is R-linear. That is, for λ, µ ∈ R and random variables X
and Y , E(λX + µY ) = λE(X) + µE(Y ).
(2) The expectation of a random variable with a symmetric distribution
coincides with its axis of symmetry.
(3) Suppose for X : Ω → R, we have a map f : R → R. Then the
expectation of the map f ◦ X : Ω → R → R is
Z
(3.6) E(f ◦ X) = f (x)pX (x) dx.
R
The same holds for discrete variables on replacing the integral over R
with a sum over SuppX.

Sometimes, EX doesn’t exist. Let X describe the amount of times we select


a student until we select one that has purple hair, so SuppX = N. Then

X n 1 1 3 1 5
(3.7) E(X) = n
= + + + + + ...
2 2 2 8 4 32
n=1

9There are other ways to state the CLT more precisely.


6 MATT RAYMOND

It is clear that this sum diverges as n → ∞, so EX fails to exist.

Exercise 3.8. Prove that the density function in the above example is nor-
malised (hint: geometric series).

Exercise 3.9. Show that for X and Y independent, E(XY ) = E(X)E(Y ).

To finish we list a few other numbers which help us characterise a distribution.


The variance of X gives a measure of how far our supports are from the
expected value, or how ’spread’ our data is

Var(X) = E (X − EX)2 .
 
(3.10)

Exercise 3.11. Show that Var(X) = E(X 2 ) − (EX)2 . Hence show that for
X ∼ Pois(λ), EX = λ, and EX 2 = λ2 + λ (use 3.3). Hence, show Var X = λ.

Suppose we now want to analyse X ∼ N (µ, σ 2 ). Fist we compute the ex-


pected value.
Z  
1 1 2
(3.12) EX = √ x exp − 2 (x − µ) dx
2πσ 2 R 2σ

Make the substitution t = (x − µ)/ 2σ 2 gives us
√ Z √
2σ 2
(t 2σ 2 + µ) exp −t2 ) dt

(3.13) EX = √
2πσ R2
∞
1 √ 2 √
  
1 2
(3.14) = √ 2σ − exp(−t ) +µ π
π 2 −∞
1 √ 
(3.15) = √ 0+µ π
π
So EX = µ Notice we applied the Gaussian integral previously discussed (in
1.6) in (3.14). Similarly we can show Var X = σ 2 . The square root of Var X
is usually called the standard deviation.

Finally we introduce covariance, a measure of how linearly related two ran-


dom variables are, defined as,

(3.16) Cov(X, Y ) = E(X − EX)(Y − EY ).

Exercise 3.17. Show if Y is a linear combination of X then Cov(X, Y ) = 0.


By considering Var(X + Y ), find that covariance is an obstruction to the
additivity of variance.

Finally, let the correlation of X and Y be the normalised covariance,


Cov(X, Y )
(3.18) Corr(X, Y ) = √ .
Var X Var Y
Notice that Corr(X, Y ) ∈ [−1, 1] always, and that if Corr(X, Y ) = ±1 then
Y = mX + b, for m, b ∈ R.

You might also like