0% found this document useful (0 votes)
11 views70 pages

3 Expectation

The document discusses the concept of expectation in probability, defining it for both discrete and continuous random variables, and providing examples such as uniform, binomial, Poisson, and Gaussian distributions. It also covers related concepts like quantiles, quartiles, median, mode, variance, and standard deviation, emphasizing their definitions and properties. Additionally, it explains the linearity of expectation and the expectation of functions of random variables.

Uploaded by

Ananya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views70 pages

3 Expectation

The document discusses the concept of expectation in probability, defining it for both discrete and continuous random variables, and providing examples such as uniform, binomial, Poisson, and Gaussian distributions. It also covers related concepts like quantiles, quartiles, median, mode, variance, and standard deviation, emphasizing their definitions and properties. Additionally, it explains the linearity of expectation and the expectation of functions of random variables.

Uploaded by

Ananya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CS 215

Data Analysis and Interpretation


Expectation
Suyash P. Awate
Expectation
• “Expectation” of the random variable;
“Expected value” of the random variable;
“Mean” of the random variable.
• “Expected value” isn’t the value that is
most likely to be observed in the
random experiment
• Can think of it as the center of mass of
the probability mass/density function
Expectation
• Definition:
Expectation of a Discrete Random Variable:
• Frequentist interpretation of probabilities and expectation
• If a random experiment is repeated infinitely many times,
then the proportion of number of times event E occurs is the probability P(E)
• If a random experiment underlying a discrete random variable X
is repeated infinitely many times,
then the proportion of number of experiments when X takes value x is P(X=x)
• So, in N→∞ experiments, number of times X takes value xi will → N.P(X=xi)
• So, across all N→∞ experiments,
arithmetic average of observed values will
→ (1/N) ∑i (xi) (N.P(X=xi))
= E[X]
Expectation
• Another Formulation of Expectation
• Recall:
• Discrete random variable X is a function defined on a probability space {Ω,ẞ,P}
• Function X:Ω→R, maps each element in sample space Ω to a single numerical value
belonging to the set of real numbers

X(.)

s x = X(s)
Expectation
• Example
• “Expected value” for the uniform random variable modelling die roll
• Values on die are {1,2,3,4,5,6}
• E[X] = 3.5
• Expectation of a uniform random variable (discrete case)
• If X has uniform distribution over n consecutive integers over [a,b],
then E[X] = (a+b)/2
Expectation
• Example
• Expectation of a binomial random variable (when n=1, this is Bernoulli)

j := k – 1
m := n – 1
Expectation
• Example
• Expectation of a Poisson random variable
• Consider random arrivals/hits occurring at a constant average rate λ>0,
i.e., λ arrivals/hits (typically) per unit time

• This gives meaning to parameter λ as average number of arrivals in unit time


Expectation
• Definition:
"
Expectation of a Continuous Random variable: E[X] :=∫!" 𝑥𝑃 𝑥 𝑑𝑥
• Frequentist interpretation of probabilities and expectation
• If a random experiment underlying a continuous random variable X
is repeated N→∞ times,
then,
for a tiny interval [x,x+Δx],
the proportion of time X takes values within interval is approximately P(x)Δx
• So, in N→∞ experiments,
number of times we will get X within [xi,xi+Δx] is approximately N.P(xi)Δx
• So, across all N→∞ experiments,
arithmetic average of all observed values is
approximately (1/N) ∑i (xi) (N.P(xi)Δx)
• In the limit that Δx→0, this average→E[X]
Expectation
X(.)
• Another Formulation of Expectation
s x = X(s)
• Recall:
• Random variable X is a function defined on a probability space {Ω,ẞ,P}
• Function X:Ω→R, maps each element in sample space Ω to a single numerical value
belonging to the set of real numbers x

P(x)
" " X(s)
• E[X] :=∫!" 𝑥𝑃 𝑥 𝑑𝑥 = ∫!" 𝑋 𝑠 𝑃 𝑠 𝑑𝑠
• Intuition remains the same as in the discrete case [x,x+Δx]
• Using probability-mass conservation:
P(x)Δx is approximated by P(s1)Δs1 + P(s2)Δs2 + …
[s1,s1+Δs1] [s2,s2+Δs2] s
• Thus, x.P(x)Δx is approximated by
X(s1).P(s1)Δs1 + X(s2).P(s2)Δs2 + … P(s)
• A more rigorous proof needs advanced results in real analysis
Expectation
• Mean as the center of mass

P(x)
• By definition,
mean m := E[X] :=∫! 𝑥𝑃 𝑥 𝑑𝑥
• Thus, ∫!(𝑥 − 𝑚)𝑃 𝑥 𝑑𝑥 = 0
x m
• Mass P(x)dx dx
placed around location ‘x’
applies a torque ∝ P(x)dx.(x-m)
at the fulcrum placed at location ‘m’
• Because the integral ∫#(𝑥 − 𝑚)𝑃 𝑥 𝑑𝑥 is zero,
the net torque around the fulcrum ‘m’ is zero
• Hence, ‘m’ is the center of mass
Expectation
• Example
• Expectation of a uniform random variable (continuous case)
Expectation PDF
P(x) = 0, for all x < 0
P(x) = 𝜆 exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Example CDF
f(x) = 0, for all x < 0
• Expectation of an exponential random variable f(x) = 1 − exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Consider random arrivals/hits occurring
at a constant average rate λ > 0
• Define β := 1/λ

• This gives meaning to parameter β as average inter-arrival time


• Larger arrival/hit rate leads to lesser inter-arrival time
Expectation
• Example
• Expectation of a Gaussian random variable
Expectation
• Example
• Expectation of a limiting case of binomial
• As n tends to infinity,

binomial

tends to a

“Gaussian” form

• Gaussian expectation μ(=np here) is


consistent with binomial expectation np
Expectation
• Linearity of Expectation
• For both discrete and continuous random variables
• For random variables X and Y having a joint probability space (Ω,ẞ,P),
the following rules hold:
• E[X + Y] = E[X] + E[Y]
• Either
• Or LHS = ∫# ∫/ 𝑥 + 𝑦 𝑃 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = ∫# 𝑥 ∫/ 𝑃 𝑥, 𝑦 𝑑𝑦 𝑑𝑥 + ∫/ 𝑦 ∫# 𝑃 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 = RHS
• E[X + c] = E[X] + c, where ‘c’ is a constant

• E[a X] = a E[X], where ‘a’ is a scalar constant

• This generalizes to:


Expectation
• Expectation of a “function of a random variable”
• Let us define values y := Y(x), or “Y(.) is a function of the random variable X”

X(.) Y(.)

s x := X(s) y := Y(x) := Y(X(s))

• Discrete random variable: 𝐸 𝑌 𝑋 ≔ 𝐸" # 𝑌(𝑋) ≔ ∑!! 𝑌 𝑥$ 𝑃(𝑥$ )


• Continuous random variable: 𝐸 𝑌 𝑋 ≔ 𝐸" # 𝑌(𝑋) ≔ ∫! 𝑌 𝑥 𝑃 𝑥 𝑑𝑥
• Property:
• Just as EP(S)[X(S)] = EP(X)[X], …
• … we get EP(X)[Y(X)] = EP(Y)[Y]
Expectation
• Expectation of a function of multiple random variables
• Definition: When we have multiple random variables X1,…,Xn with
a joint PMF/PDF P(X1,…,Xn) and
a function of the multiple random variables g(X1,…,Xn),
then we define the expectation of g(X1,…,Xn) as:
𝐸 𝑔 𝑋% , … , 𝑋& ∶= 4 𝑔 𝑥% , … , 𝑥& 𝑃(𝑋% = 𝑥% , … , 𝑋& = 𝑥& )
!" ,…,!#
or
𝐸 𝑔 𝑋% , … , 𝑋& ∶= 5 𝑔 𝑥% , … , 𝑥& 𝑃(𝑥% , … , 𝑥& ) 𝑑𝑥% … 𝑑𝑥&
!" ,…,!#

• If X and Y are independent, then E[XY] = E[X] E[Y]


• Proof:
• ∑#,/ 𝑥𝑦𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) = ∑#,/ 𝑥𝑦𝑃 𝑋 = 𝑥 𝑃 𝑌 = 𝑦 = ∑# 𝑥𝑃 𝑋 = 𝑥 ∑/ 𝑦𝑃(𝑌 = 𝑦)
Expectation
• Tail-sum formula
• Let X be a discrete random variable taking values in set of natural numbers

• Then,
P(x=1)
P(x=2) P(x=2)
P(x=3) P(x=3) P(x=3)
• Proof: P(x=4) P(x=4) P(x=4) P(x=4)

Sum over rows (row number = x)

Sum over columns (column number = k)


Expectation
• Tail-sum formula
• Let X be a continuous random variable taking non-negative values
• Notation: For random variable X, PDF is fX(.) and CDF is FX(.)
• Then,
t
• Proof:

x t
x
t x
Expectation in Life
• Action without expectation à Happiness [Indian Philosophy]
Quantile, Quartile
• Definition: For a discrete/continuous random variable
with a PMF/PDF P(.), the q-th quantile
(where 0<q<1) is any real number ‘xq’
such that P(X≤xq) ≥ q and P(X≥xq) ≥ 1-q
• Quartiles: q = 0.25 (1st quartile),
q = 0.5 (2nd), q = 0.75 (3rd)
• Percentiles
• q=0.25 à 25th percentile
• Box plot,
box-and-whisker plot
• Inter-Quartile Range
(IQR)
Quantile, Median
• Definition:
For a discrete/continuous random variable with a PMF/PDF P(.),
the median is any real number ‘m’
such that P(X≤m) ≥ 0.5 and P(X≥m) ≥ 0.5
• Median = second quartile
• Definition:
For a continuous random variable with a PDF P(.),
the median is any real number ‘m’
such that P(X≤m) = P(X>m)
• CDF fX(m) = 0.5
• A PDF can be associated with multiple medians
Mode
• For discrete X
• Mode m is a value for which the PMF value P(X=m) is maximum
• A PMF can have multiple modes
• For continuous X
• Mode ‘m’ is any local maximum of the PDF P(.)
• A PDF can have multiple modes
• Unimodal PDF = A PDF having only 1 local maximum
• Bimodal PDF:
2 local maxima
• Multimodal PDF:
2 or more
local maxima
Mean, Median, Mode
• For continuous X, for unimodal and symmetric distributions,
mode = mean = median
• Assuming symmetry
around mode,
mass on left of mode =
mass on right of mode
• So, mode = median
• Assuming symmetry
around mode,
every P(x)dx mass on left of mode
is matched by
a P(x)dx mass on right of mode
• So, mode = mean
Variance
• Definition: Var(X) := E[(X-E[X])2]
• A measure of the spread of the mass (in PMF or PDF) around the mean
• Property: Variance is always non-negative
• Property: Var(X) = E[X2] – (E[X])2
• Proof: LHS =
E[(X-E[X])2]
= E[ X2 + (E[X])2 – 2.X.E[X] ]
= E[X2] + (E[X])2 – 2(E[X])2
= E[X2] – (E[X])2 = RHS
• Definition: Standard deviation is the square root of the variance
• Units of variance = square of units of values taken by random variable
• Units of standard deviation = units of values taken by random variable
Variance
• Variance of a Uniform Random Variable
• Discrete case
• X has uniform distribution over n integers {a, a+1, …, b}
• Here, n = b–a+1
• Variance = (n2 – 1) / 12
Variance
• Variance of a Binomial Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = np
Variance
• Variance of a Binomial Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = np
• So, E[X2]
= np (mp + 1)
= np ((n–1)p + 1)
= (np)2 + np(1-p)
• Thus, Var(X) = np(1–p) = npq
• Interpretation
• When p=0 or p=1,
then Var(X) = 0,
which is the minimum possible
• When p=q=0.5,
then Var(X) is maximized
Variance
• Variance of a Poisson Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = λ
Variance
• Variance of a Poisson Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = λ
• So, E[X2]
= λ (λ.1 + 1)
= λ2 + λ
• Thus, Var(X) = λ
• Interpretation
• Mean of Poisson random variable was also λ
• Standard deviation of Poisson random variable is λ0.5
• As mean increases, so does variance (and standard deviation)
• When mean increase by factor of N (i.e., N time larger signal = number of arrivals/hits),
then the standard deviation (spread) increases only by a factor of N0.5
• As N increases,
then variability in number of arrivals/hits, relative to average arrival/hit rate, decreases
Variance
• Variance of a Uniform Random Variable
• Continuous case
• X has uniform distribution over [a,b]
• Variance = (b – a)2 / 12
Variance PDF
P(x) = 0, for all x < 0
P(x) = 𝜆 exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Variance of a Exponential Random Variable CDF
f(x) = 0, for all x < 0
• Var(X) = E[X2] – (E[X])2 , where E[X] = β := 1/λ f(x) = 1 − exp −𝜆𝑥 , ∀𝑥 ≥ 0

• So, Var(X) = β2 . So, β = E[X] = SD(X); unlike Poisson.


Variance
• Variance of a Gaussian Random Variable
• Var(X) = E[X2] – (E[X])2, where E[X] = μ
Variance
• Variance of a Gaussian Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = μ

t . (t.exp(-t2))
Variance
• Example
• Variance of a limiting case of binomial
• As n tends to infinity,

binomial

tends to

Gaussian

• Gaussian variance σ2 (= npq in this case) is


consistent with binomial variance npq
Variance
• Property: Var(aX+c) = a2Var(X)
• Adding a constant to a random variable doesn’t change the variance (spread)
• This only shifts the PDF/PMF
• If Y := X + c, then Var(Y) = Var(X)
• If we scaling a random variable by ‘a’, then the variance gets scaled by a2
• If Y := aX, then Var(Y) = a2Var(X)
• Proof:
Variance
• Property: Var(X+Y) = Var(X) + Var(Y) + 2(E[XY] – E[X]E[Y])
• Proof:

E[Y2]
E[Y2]

• If X and Y are independent,


then E[XY] = E[X] E[Y], and so Var(X+Y) = Var(X) + Var(Y)
• If X,Y,Z are independent, then
Var(X+Y+Z) = Var(X+Y) + Var(Z) = Var(X) + Var(Y) + Var(Z)
• For independent random variables X1, …, Xn;
Var(X1 + … + Xn) = Var(X1) + … + Var(Xn)
Markov’s Inequality
• Theorem: Let X be a random variable with PDF P(.).
Let u(.) be an non-negative-valued function.
Let ‘c’ be a positive constant.
Then, P(u(X) ≥ c) ≤ E[u(X)] / c
• Proof:
• E[u(X)] = ∫x:u(x)≥c u(x) P(x) dx + ∫x:u(x)<c u(x) P(x) dx
• Because u(.) takes non-negative values, each integral above is non-negative
• So, E[u(X)] ≥ ∫x:u(x)≥c u(x) P(x) dx
≥ c ∫x:u(x)≥c P(x) dx
= c P(u(X) ≥ c)
• Because c>0, we get E[u(X)]/c ≥ P(u(X) ≥ c)
• Special case à
• X takes non-negative values & u(x) := x
Chebyshev’s Inequality Markov’s Inequality:
P(u(X) ≥ c) ≤ E[u(X)] / c

• Theorem: Let X be a random variable with PDF P(.),


finite expectation E[X], and finite variance Var(X).
Then, P(|X-E[X]| ≥ a) ≤ Var(X) / a2
• Proof:
• Define random variable u(X) := (X-E[X])2
• Then, by Markov’s inequality, P(u(X) ≥ a2) ≤ E[u(X)] / a2
• LHS = P(|X-E[X]| ≥ a)
• RHS = Var(X) / a2
• Q.E.D.
• Corollary: If random variable X has standard deviation σ, then
P(|X-E[X]| ≥ kσ) ≤ 1/k2
• This is consistent with the notion of standard deviation (σ) or variance (σ2)
measuring the spread of the PDF around the mean (center of mass)
Chebyshev’s Inequality
Chebyshev
• Pafnuty Chebyshev
• Founding father of Russian mathematics
• Students: Lyapunov, Markov
• First person to think
systematically in terms of
random variables and their
moments and expectations
Markov
• Andrey Markov
• Russian mathematician best known for
his work on stochastic processes
• Advisor: Chebyshev
• Students: Voronoy
• One year after doctoral defense,
appointed extraordinary professor
• He figured out that he could use chains to model
the alliteration of vowels and consonants
in Russian literature
Jensen’s Inequality
• Theorem: Let X be any random variable; f(.) be any convex function.
Then, E[f(X)] ≥ f(E[X]) A real-valued function is called convex if
the line segment between any two points on the graph of the function
• Proof: lies above/never-below the graph between the two points.
• Let m := E[X], can be anywhere on real line
• Consider a tangent (subderivative line) to f(.) at [m,f(m)]
• This line is, say, Y = aX+b,
which lies at/below (never above) f(X)
• Then, f(m) = am+b
• Then,
E[f(X)] ≥ E[aX+b]
= aE[X] + b
= f(E[X])
Jensen’s Inequality
• Corollary: Let X be any random variable; g(.) be any concave function.
Then, E[g(X)] ≤ g(E[X]) A real-valued function is called concave if
the line segment between any two points on the graph of the function
• Proof: lies below/never-above the graph between the two points.
• Let m := E[X], can be anywhere on real line
• Consider a tangent (subderivative line) to g(.) at [m,g(m)]
• This line is, say, Y = aX+b,
which lies at/above (never below) g(X)
• Then, g(m) = am+b
• Then,
E[g(X)] ≤ E[aX+b]
= aE[X] + b
= g(E[X])
Jensen
• Johan Jensen
• Danish mathematician and engineer
• President of the Danish Mathematical Society
from 1892 to 1903
• Never held any academic position
• Engineer for Copenhagen Telephone Company
• Became head of its technical department
• Learned advanced math topics by himself
• All his mathematics research
was carried out in his spare time
Minimizer of Expected Absolute Deviation
• Theorem: E[|X – c|] is minimum when c = Median(X)
• Case 1: Let c ≤ m := Median(X)
+ *
• E[|X – c|] = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, A + B)
, ,
• A = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 − ∫+ 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A1 – A2)
, *
• B = ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 + ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B1 + B2)
1
• Now, B1 – A2 = 2 ∫0 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 ≥ 0
, ,
• A1 = ∫)* 𝑐 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫)* 𝑚 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A11 + A12)
* *
• B2 = ∫, 𝑥 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫, 𝑚 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B21 + B22)
• Now, A11 + B22 = –(m–c) (1–P(x≥m)) + (m–c) P(x≥m) = (m–c) (2P(x≥m)–1) ≥ 0
• Now, A12 + B21 = E[|X – m|]
,
• So, A+B = E[|X – m|] + (m–c) (2P(x≥m) – 1) + 2 ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥
• Value of c minimizing A+B is c = m
Minimizer of Expected Absolute Deviation
• Theorem: E[|X – c|] is minimum when c = Median(X)
• Case 2: Let m := Median(X) ≤ c
+ *
• E[|X – c|] = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, A + B)
, +
• A = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫, 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A1 + A2)
+ *
• B = − ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 + ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, – B1 + B2)
0
• Now, A2 – B1 = 2 ∫1 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 ≥ 0
, ,
• A1 = ∫)* 𝑐 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫)* 𝑚 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A11 + A12)
* *
• B2 = ∫, 𝑥 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫, 𝑚 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B21 + B22)
• Now, A11 + B22 = (c–m) P(x≤m) – (c–m) (1–P(x≤m)) = (c–m) (2P(x≤m)–1) ≥ 0
• Now, A12 + B21 = E[|X – m|]
+
• So, A+B = E[|X – m|] + (c–m) (2P(x≤m) – 1) + 2 ∫, 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥
• Value of c minimizing A+B is c = m
Mean, Median, Standard Deviation
• Theorem:
Mean(X) and Median(X) are within a distance of SD(X) of each other
• Proof:
• Distance between mean and median
= |E[X] – Median(X)|
= |E[X – Median(X)]|
This is |E[.]|, where |.| is a convex function. Apply Jensen’s inequality.
≤ E[|X – Median(X)|]
≤ E[|X – E[X]|] (because Median(X) minimizes expected absolute deviation)
= E[Sqrt{ (X – E[X])2 }]
This is E[Sqrt(.)], where Sqrt(.) is a concave function. Apply Jensen’s inequality.
≤ Sqrt{ E[ (X – E[X])2 ] }
= Sqrt{ Var(X) } = SD(X)
Law of Large Numbers
• This justifies why the expectation is motivated as an average over a
large number of random experiments (“long-term average”)
• Let random variables X1, …, Xi, …, Xn be ‘n’ independent and identically
distributed (i.i.d.), each with mean μ=E[Xi] and finite variance v=Var(Xi)
• Let the average, over ‘n’ experiments, be modeled by
a random variable 𝑋 := (X1 + … + Xn) / n
• Then, the expected average E[𝑋] = μ, by the linearity of expectation
• But, in specific runs, how close is 𝑋 to the expectation μ ?
• So, we analyze the spread of 𝑋 around μ
• Var(𝑋) = Var(X1/n) + … + Var(Xn/n) = n(v/n2) = v/n
Law of Large Numbers
• This justifies why the expectation is motivated as an average over a
large number of random experiments
• Law of large numbers: For all ε > 0, as n→∞, P(|𝑋 – μ | ≥ ε) → 0
• Proof: Using Chebyshev’s inequality,
P(|𝑋 – μ | ≥ ε)
≤ Var(𝑋) / ε2
= v / (nε2)
→0, as n→∞
• Thus, as the average 𝑋 uses data from more number of experiments ‘n’,
the event of “𝑋 being farther from μ than ε” has a probability that tends to 0
Law of Large Numbers
• Example
• This also gives us a way to
compute an “estimate” of
the expectation μ of a
random variable X
from “observations”/data
• What is the estimate ?
•𝑋
Law of Large Numbers

www.nature
.com/article
s/nmeth.26
13
Covariance
• For random variables X and Y, consider the joint PMF/PDF P(X,Y)
• Covariance: A measure of how the values taken by X and Y vary
together (“co”-“vary”)
• Definition: Cov(X,Y) := E[(X – E[X])(Y – E[Y])]
• Interpretation:
• Define U(X) := X – E[X] and V(Y) := Y – E[Y] (Note: U and V have expectation 0)
• In the joint distribution P(U,V),
if larger (more +ve) values of U typically correspond to larger values of V, and
smaller (more –ve) values of U typically correspond to smaller values of V,
then U and V co-vary positively
• In the joint distribution P(U,V),
if larger values of U typically correspond to smaller values of V, and …
then U and V co-vary negatively
• Property: Symmetry: Cov(X,Y) = Cov(Y,X)
Covariance
• Examples
Covariance
• Property: Cov(X,Y) = E[XY] – E[X]E[Y]
• Proof:
• Cov(X,Y) = E[(X – E[X])(Y – E[Y])] = E[XY] – E[X]E[Y] – E[X]E[Y] + E[X]E[Y] = E[XY] – E[X]E[Y]
• So, Var(X+Y) = Var(X) + Var(Y) + 2(E[XY] – E[X]E[Y]) = Var(X) + Var(Y) + 2Cov(X,Y)
• Also, when X and Y are independent, then Cov(X,Y) = 0
• Property: When Var(X) and Var(Y) are finite, and one of them is 0,
then Cov(X,Y)=0
• Property: When Y := mX + c (with finite m), what is Cov(X,Y) ?
• Cov(X,Y) = E[XY] – E[X]E[Y]
= E[mX2 + cX] – E[X](m.E[X] + c)
= m.E[X2] – m(E[X])2 = m.Var(X)
• When Var(X)>0, covariance is ∝ line-slope ‘m’, and has same sign as that of m
Covariance
• Bilinearity of Covariance
• Let X, X1, X2, Y, Y1, Y2 be random variables. Let c be a scalar constant.
• Property: Cov(X1 + X2, Y) = Cov(X1, Y) + Cov(X2, Y) = Cov(Y, X1 + X2)
• Proof (first part; second part follows from symmetry):

• Property: Cov(aX, Y) = a.Cov(X, Y) = Cov(X, aY)


• Proof (first part):
• Cov(aX, Y)
= E[ aXY ] − E[ aX ]E[ Y ]
= a (E[ XY ] − E[ X ]E[ Y ])
= a Cov(X,Y)
Standardized Random Variable
• Definition:
If X is a random variable, then its standardized form is given by
X* := (X – E[X]) / SD(X), where SD(.) gives the standard deviation
• Property: E[X*] = 0, Var(X*) = 1
• Proof:

• X* is unit-less
• X* is obtained by:
• First shifting/translating X to make mean 0, and
• Then scaling the shifted variable to make variance 1
Correlation
• For covariance, the magnitude isn’t easy to interpret (unlike its sign)
• Correlation: A measure of how the values taken by X and Y vary
together (“co”-“relate”) obtained by rescaling covariance
• Pearson’s correlation coefficient
• Assuming X and Y are linearly related, correlation magnitude shows the
strength of the (functional/deterministic) relationship between X and Y
• Let ‘SD’ = standard deviation
• Definition: Cor(X,Y) :=

• Thus, Cor(X,Y) = E[X*Y*], where X* and Y* are the standardized variables


= E[X*Y*] – E[X*]E[Y*]
= Cov(X*,Y*)
Correlation
• Property: -1 ≤ Cor(X,Y) ≤ 1
• Proof:
• First inequality
• 0 ≤ E[(X*+Y*)2]
= E[(X*)2] + E[(Y*)2] + 2E[X*Y*]
= 2(1 + Cor(X,Y))
• So, –1 ≤ Cor(X,Y)

• Second inequality
• 0 ≤ E[(X*–Y*)2]
= E[(X*)2] + E[(Y*)2] – 2E[X*Y*]
= 2(1 – Cor(X,Y))
• So, Cor(X,Y) ≤ 1
Correlation
• Property: If X and Y are linearly related, i.e., Y = mX + c,
and are non-constant (i.e., SD(X)>0 and SD(Y)>0),
then |Cor(X,Y)| = 1
• Proof:
• When Y = mX + c, then SD(Y) = |m| SD(X)
• Cor(X,Y)
= Cov(X,Y) / (SD(X) SD(Y))
= mVar(X) / (SD(X) |m|SD(X))
= ±1
= sign of the slope m
Correlation
• Property: If |Cor(X,Y)| = 1, then X and Y are linearly related
• Proof:
• If Cor(X,Y) = 1, then E[(X*–Y*)2] = 2(1 – Cor(X,Y)) = 0
• For discrete X,Y: this must imply X*=Y* for all (x’,y’) where P(X=x’,Y=y’) > 0
• Else the summation underlying the expectation cannot be zero
• For continuous X,Y: this must imply X*=Y* for all measures (dx’,dy’) where P(dx’,dy’) > 0
• X* and Y* can be unequal only on a countable set of isolated points where P(dx’,dy’) > 0
• Else the integral underlying the expectation cannot be zero
• If Cor(X,Y) = (–1), then E[(X*+Y*)2] = 2(1 + Cor(X,Y)) = 0
• For discrete X,Y: this must imply X*=(–Y*) for all (x’,y’) where P(X=x’,Y=y’) > 0
• For continuous X,Y: this must imply X*=(–Y*) for all measures (dx’,dy’) where P(dx’,dy’) > 0
• Inequality can hold only on a countable set of isolated points where P(dx’,dy’) > 0
• If X* = ±Y*, then Y must be of the form mX+c
Correlation
• If |Cor(X,Y)|=1 (or Y=mX+c), then
how to find the equation of the line from data {(xi,yi): i=1,…,n}?
• By the way: line must pass through (E[X],E[Y])
• Because, when X=E[X], value of Y must be mE[X]+c, but that also equals E[Y]
• We proved that: if Y=mX+c, then |Cor(X,Y)|=1 and Y* = ±X* = Cor(X,Y) X*
• So, (Y – E[Y]) / SD(Y) = Cor(X,Y) (X – E[X]) / SD(X)
• So, Y = E[Y] + SD(Y) Cor(X,Y) (X – E[X]) / SD(X)
• So, Y = E[Y] + Cov(X,Y) (X – E[X]) / Var(X)
• This gives the equation of the line with:
• Slope m := Cov(X,Y) / Var(X)
• Intercept c := E[Y] – Cov(X,Y) E[X] / Var(X)
Correlation
• Examples
Correlation
• Four sets of data with the same correlation of 0.816
• Blue line indicates the line passing through (E[X],E[Y]) with slope = 0.816
(more on this when we study estimation)
• So, correlation = 0.816
doesn’t always mean that data
lies along a line of slope 0.816
• This indicates the likely
misinterpretation of correlation
when variables underlying data
aren’t linearly related
Correlation
• Zero correlation doesn’t imply independence

• We showed that independence implies zero covariance/correlation,


but the converse isn’t always true
• Example: Let X be uniformly distributed within [-1,+1]. Let Y := X2.
• Cov(X,X2) = E[X.X2] – E[X]E[X2] = E[X3] – 0.E[X2] = 0
• Thus, Cov(X,Y) = 0 = Cor(X,Y) even though Y is a deterministic function of X
Correlation
• Non-zero correlation doesn’t imply causation
• https://hbr.org/2015/06/beware-spurious-correlations
• https://science.sciencemag.org/content/348/6238/980.2
• http://www.tylervigen.com/spurious-correlations
Correlation
• Non-zero correlation doesn’t imply causation
• https://hbr.org/2015/06/beware-spurious-correlations
• https://science.sciencemag.org/content/348/6238/980.2
• http://www.tylervigen.com/spurious-correlations
Correlation
• Non-zero correlation
doesn’t imply causation
Correlation
• Non-zero correlation doesn’t imply causation
Correlation
• Non-zero correlation
doesn’t imply causation

You might also like