0% found this document useful (0 votes)
17 views12 pages

Data Analysis For Social Scientists Cheatsheet

This document is a comprehensive cheat sheet for data analysis, covering fundamental concepts of probability, distributions, and statistical methods as taught in an online course by Prof. Esther Duflo and Prof. Sara Ellison. It includes definitions and properties of various probability distributions, such as binomial and hypergeometric distributions, as well as techniques for data collection and analysis, including web scraping and kernel density estimation. Additionally, it outlines key principles of ethical research involving human subjects, as well as methods for summarizing and describing data through visualizations like histograms and cumulative distributions.

Uploaded by

DianaCruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Data Analysis For Social Scientists Cheatsheet

This document is a comprehensive cheat sheet for data analysis, covering fundamental concepts of probability, distributions, and statistical methods as taught in an online course by Prof. Esther Duflo and Prof. Sara Ellison. It includes definitions and properties of various probability distributions, such as binomial and hypergeometric distributions, as well as techniques for data collection and analysis, including web scraping and kernel density estimation. Additionally, it outlines key principles of ethical research involving human subjects, as well as methods for summarizing and describing data through visualizations like histograms and cumulative distributions.

Uploaded by

DianaCruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

14.

310x Data Analysis for Social A probability on a sample space S is a collection of numbers P(A) that satisfy Binomial Distribution X ∼ B(n, p)
axioms 1-3.
Scientists Counting fX (x) =
 n
x
p (1 − p)
n−x
, x = 0, 1, . . . , n
x
This is a cheat sheet for data analysis based on the online course given by Prof. Esther 1. If an experiment has two parts, first one having m possibilities and, regardless The binomial distribution describes the number of ”successes in n trials where the
Duflo and Prof. Sara Ellison. Compiled by Janus B. Advincula. of the outcome in the first part, the second one having n possibilities, then the trials are independent and the probability of success in each is p.
experiment has m × n possible outcomes.
Last Updated December 3, 2019 The probability function (PF) of X, where X is a discrete random variable, is the
2. Any ordered arrangement of objects is called a permutation. The number of function fX such that for any real number x, fX (x) = P(X = x).
different permutations of N objects is N !. The number of different
Module 1 permutations of n objects taken from N objects is
N!
.
Properties:
(N − n)!
• 0 ≤ fX (xi ) ≤ 1
Introduction 3. Any unordered arrangement of objects is called a combination. The number of

P
N! fX (xi ) = 1
• Data is plentiful. different combinations of n objects taken from N objects is . We i
• Data is beautiful. (N − n)!n!
• P(A) = P(X ⊂ A) =
P
N  fX (xi )
• Data is insightful. typically denote this . A
• Data is powerful. n
• Data can be deceitful. • P(X = x) = 0 for any x if X is continuous.
Properties:
Causation vs. Correlation The density or probability density function (PDF) is the continuous analog to the
• P (Ac ) = 1 − P (A) discrete PF in many ways.
• Correlation is not causality. • P (∅) = 0
• A causal story is not causality either. A random variable X is continuous if there exists a non-negative function fX such
• If A ⊂ B then P (A) ≤ P (B) that for any interval A ⊂ R,
• Even more sophisticated data use may still not be causality.
• For all A, 0 ≤ P (A) ≤ 1 Z
What We Need to Learn • P (A ∪ B) = P (A) + P (B) − P (AB) P(X ⊂ A) = fX (x)dx.
A
• How do we model the processes that might have generated our data? • P (AB c ) = P (A) − P (AB)
- Probability Properties:
• How do we summarize and describe data, and try to uncover what process Independence Events A and B are independent if P (AB) = P (A) P (B).
may have generated it? Theorem If A and B are independent, A and B are also independent.
c
• 0 ≤ fX (x)
- Statistics
Conditional Probability The probability of A conditional on B is •
R
• How do we uncover pattern between variables? fX (x)dx = 1
- Exploratory data analysis • P(A) = P(a ≤ X ≤ b) =
R
fX (x)dx
P (AB)
- Econometrics P (A|B) = , P (B) > 0.
A
P (B)
The cumulative distribution function (CDF) FX of a random variable X is defined
Module 2 If A and B are independent and P (B) > 0, then
for each x as
FX (x) = P(X ≤ x).
Fundamentals of Probability P (AB) P (A) P (B) Properties:
P (A|B) = = = P (A) .
P (B) P (B)
A sample space S is a collection of all possible outcomes of an experiment.
• 0 ≤ FX (x) ≤ 1
An event A is any collection of outcomes (including individual outcomes, the entire Bayes’ Theorem
sample space, the null set). • FX (x) is non-decreasing in x
P (B|A) P (A)
Useful results: P (A|B) = • lim FX (x) = 0
P (B|A) P (A) + P (B|Ac ) P (Ac ) x→−∞
• If A ⊂ B, then A ∪ B = B.
• lim FX (x) = 1
• If A ⊂ B and B ⊂ A, then A = B. (A and Ac form a partition of S.) x→∞
• If A ⊂ B, then A ∩ B = AB = A. • FX (x) is right continuous.
• A ∪ Ac = S Random Variables, Distributions, and Joint
Distributions A PF/PDF and a CDF for a particular random variable contain exactly the same
A and B are mutually exclusive (disjoint) if they have no outcomes in common.
information about its distribution, just in a different form.
A and B are exhaustive (complementary) if their union is S. A random variable is a real-valued function whose domain is the sample space.
Z x
A discrete random variable can take on only a finite or countable infinite number of FX (x) = P(X ≤ x) = fX (x)dx
Probability values. −∞

We will assign to every event A a number P (A), which is the probability the event A random variable that can take on any value in some interval, bounded or dF (x)
will occur (P : S → R). unbounded, of the real line is called a continuous random variable.
0
FX (x) = = fX (x)
dx
We require that: Hypergeometric Distribution X ∼ H (N, K, n)
1. P(A) ≥ 0 for all A ⊂ S
Joint Distributions
K  N −K 
x n−x
2. P(S) = 1 fX (x) = N
, x = max (0, n + K − N ) , . . . , min (n, K) If X and Y are continuous random variables defined on the same sample space S,
n then the joint probability density function of X & Y , fXY (x, y), is the surface such
3. For any sequence of disjoint sets A1 , A2 , . . . ,
that for any region A of the xy-plane,
! The hypergeometric distribution describes the number of ”successes” in n trials
\ X where you’re sampling without replacement from a sample of size N whose initial Z Z
P Ai = P (Ai ) P ((X, Y ) ⊂ A) = fXY (x, y)dxdy.
probability of success was K/N .
i i A
Gathering and Collecting Data The Kernel Density Estimation y

Where can we find data?


Kernel density estimation is a non-parametric way to estimate the probability
(−1, 1) (1, 1)

1. Existing data libraries density function of a random variable.


2. Collecting your own data
3. Extracting data from the internet Let (x1 , x2 , . . . , xn ) be an independent and identically distributed sample drawn
What is web scraping? from some distribution with an unknown PDF f . We are interested in estimating the y = x2
x
• Pull data from one page shape of this function f . Its kernel density estimator is given by
• Crawl an entire web page n n
• A set of forms running in the background 1 X 1 X x − xi
 
fˆh (x) = Kh (x − xi ) = K What is c?
• Any of the above in an ongoing fashion n i=1 nh i=1 h Z 1 Z 1 4 21
2
cx ydydx = c=1 ⇒ c=
Web scraping in Python −1 x2 21 4
You will work using the request library and the BeautifulSoup library. where K() is the kernel, a non-negative function that integrates to 1 and has mean
zero, and h > 0 is the bandwidth. What is P (X > Y )?
Web scraping in R
R has a web scraping package built by Hadley Wickham called rvest. Things to choose:
y
Human Subject Research • the Kernel function (Epanechnikov, Normal, etc.) (−1, 1) (1, 1)
• the bandwidth (the optimal bandwidth minimizes the Mean Squared Error)
Research A systematic investigation, including research development, testing and
evaluation, designed to develop or contribute to generalizable knowledge.
Human Subject A living individual about whom an investigator (whether
professional or student) conducting research obtains 0.06

y = x2
1. data through intervention or interaction with the individual, or
x
2. identifiable private information. 0.04

density
Key Principles of the Belmont Report
Z 1 Z x 21 2 3
0.02
1. Respect for Persons x ydydx =
• Respect individual autonomy 0 x2 4 20
• Protect individuals with reduced autonomy 0.00

2. Beneficence 120 140 160


height_cm
180

• Maximize benefits and minimize harm Marginal Distribution


3. Justice
• Equitable distribution of research burdens and benefits Cumulative Histogram, CDF For discrete: X
fX (x) = fXY (x, y)
Cumulative Histogram the number/frequency of cases that are smaller or equal to
Module 3 the value for a particular bin
y

For continuous:
You can get a smoothed version of a CDF (using ecdf in R)
Z
Summarizing and Describing Data fX (x) = fXY (x, y)dy
y
Histogram 1.00

Independence X & Y are independent if


A histogram is a rough estimate of the probability distribution function of a
continuous variable. It is a function that counts the number of observations that fit
P (X ⊂ A and Y ⊂ B) = P (X ⊂ A) P (Y ⊂ B)
0.75

into each bin.


Example: Women’s height in Bihar 0.50 for all regions A and B. Also, X and Y are independent iff
y

fXY (x, y) = fX (x)fY (y).


0.25

1500
Conditional Distribution
0.00

The conditional PDF of Y given X is


140 160 180
Height in centimeters

1000
count

fXY (x, y)
Joint, Marginal and Conditional Distributions fY |X (y|x) =
fX (x)
Joint Distribution
500 (= P (Y = y|X = x) for X, Y discrete)
Example:
( Conditional distributions and independence
cx2 y for x2 ≤ y ≤ 1
fXY (x, y) =
0 0 otherwise fY |X (y|x) = fY (y) iff fXY (x, y) = fX (x)fY (y)
120 140 160 180
Height in centimeters, Bihar Females
Support: iff X & Y independent
Module 4 Moments of a Distribution Covariance and Correlation
The mode is the point where the PDF reaches its highest value. Covariance
Functions of Random Variables The median is the point above and below which the integral of the PDF is equal to Cov(X, Y ) = E [(X − µX ) (Y − µY )]
X is a random variable with fX (x) known. We want the distribution of 1/2.
Correlation
Y = h(X). Then, E [(X − µX ) (Y − µY )]
The mean, or expectation, or expected value, is defined as
ρ(X, Y ) =
Z
Var(X) Var(Y )
p p
FY (y) = fX (x)dx Z
{x:h(x)≤y}
E [X] = xfX (x)dx.
If Y is also continuous, then • X & Y are ”positively correlated” if ρ > 0.
• X & Y are ”negatively correlated” if ρ < 0.
dFY (y) Y = g(X) • X & Y are ”uncorrelated” if ρ = 0.
fY (y) = . Z
dy E [Y ] = E [g(X)] = g(x)fX (x)dx
Properties of Covariance:
Example:
( Expectation, Variance and an Introduction to 1. Cov(X, X) = Var(X)
1/2 for − 1 ≤ x ≤ 1
fX (x) = Regression
0 otherwise 2. Cov(X, Y ) = Cov(Y, X)

Y = X . What is fY (y)? Note that the support of X is [−1, 1] which implies that
2 Properties of Expectation: 3. Cov(X, Y ) = E[XY ] − E[X]E[Y ]
the induced support of Y is [0, 1].
1. E [a] = a, a constant 4. X, Y independent ⇒ Cov(X, Y ) = 0
 
2
FY (y) = P (Y ≤ y) = P X ≤ y 2. E [Y ] = aE [X] + b, Y = aX + b 5. Cov(aX + b, cY + d) = ac Cov(X, Y )
Z √y
√ √ 1 3. E [Y ] = E[X1 ] + · · · + E[Xn ], Y = X1 + · · · + Xn 6. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
= P (− y ≤ X ≤ y) = √
dx
− y 2
√ 4. E[Y ] = a1 E[X1 ] + · · · + an E[Xn ] + b, Y = a1 X1 + · · · + an Xn + b 7. |ρ(X, Y )| ≤ 1
= y for 0 ≤ y ≤ 1
5. E[XY ] = E[X]E[Y ] if X, Y independent 8. |ρ(X, Y )| = 1 iff Y = aX + b, a 6= 0
1

 √ for 0 ≤ y ≤ 1
fY (y) = 2 y Variance h
2
i A Preview of Regression
0 otherwise Var(X) = E (X − µ)

We have two random variables, X & Y .


Linear Transformation Properties of Variance:
2
E[X] = µX , Var(X) = σX
Let X have PDF fX (x). Let Y = aX + b, a =
6 0. 1. Var(X) ≥ 0
2
1

y−b

2. Var(a) = 0, a constant E[Y ] = µY , Var(Y ) = σY
fY (y) = fX
|a| a 3. Var(Y ) = a2 Var(X), Y = aX + b Cov(X, Y )
ρXY =
σX σY
Probability Integral Transformation Let X, continuous, have PDF fX (x) and CDF 4. Var(Y ) = Var(X1 ) + · · · + Var(Xn ), Y = X1 + · · · + Xn , X1 , . . . , Xn
FX (x). Let Y = FX (X). How is Y distributed? independent If |ρXY | < 1, then we can write Y = α + βX + U .
A continuous random variable transformed by its own CDF will always have a 5. Var(Y ) = a21 Var(X1 ) + · · · + a2n Var(Xn ), Y = a1 X1 + · · · + an Xn + b, Let β = ρXY
σY
.
U [0, 1] distribution. X1 , . . . , Xn independent
σX

Let α = µY − βµX .
Convolution 6. Var(X) = E[X 2 ] − (E[X])2
Then, U = Y − α − βX has the following properties:
A convolution refers to the sum of independent random variables. Standard Deviation q • E[U ] = 0
Let X be continuous with PDF fX , Y continuous with PDF fY . X and Y are SD(X) = σ = Var(X)
• Cov(X, U ) = 0
independent. Let Z be their sum. What is the PDF of Z?
Z ∞ Conditional Expectation α & β are the regression coefficients.
fZ (z) = fX (z − y)fY (y)dy −∞<z <∞ Z
−∞ E [Y |X] yfY |X (y|x)dy
Inequalities
Order Statistics Note that E [Y |X] is a function of X and, therefore, a random variable. Markov Inequality X is a random variable that is always non-negative. Then, for
any t > 0,
Let X1 , . . . , Xn be continuous, independent, identically distributed, with PDF fX . Law of Iterated Expectations
E[X]
Let Yn = max {X1 , . . . , Xn }. This is called the nth order statistic. P (X ≥ t) ≤
E [E [Y |X]] = E[Y ] t
Distribution:
n Chebyshev Inequality X is a random variable for which Var(X) exists. Then, for
Fn (y) = FX (y) Law of Total Variance any t > 0,
dFn (y) n−1 Var(X)
fn (y) = = n (FX (y)) fX (y) Var (E [Y |X]) + E [Var (Y |X)] = Var(Y ) P (|X − E[X]| ≥ t) ≤
dy t2
Module 5 The Sample Mean, Central Limit Theorem and Theorem The sample mean for an i.i.d. sample is unbiased for the population mean.
Estimation Theorem The sample variance for an i.i.d. sample is unbiased for the population
Special Distributions The sample mean is the arithmetic average of the n random variables (or variance, where the sample variance is
Bernoulli Two possible outcomes: success or failure. The probability of success is p, realizations) from a random sample of size n.
n
failure is q (= 1 − p). 2 1 X 2
1 S = Xi − X n
Binomial If X1 , . . . , Xn are i.i.d. random variables, all Bernoulli distributed with Xn = (X1 + · · · + Xn ) n − 1 i=1
success probability p, then n
n
n 1 X
X =
X
Xk ∼ B(n, p) binomial distribution = Xi Given two unbiased estimators, θ̂1 & θ̂2 , θ̂1 is more efficient than θ̂2 if, for a given
n i=1 sample size,
k=1    
The binomial distribution is the number of successes in a sequence of n independent Var θ̂1 < Var θ̂2
Expectation of the sample mean:
(success/failure) trials, each of which yields success with probability p.
Mean Squared Error Sometimes we are interested in trading off bias and
h i
Hypergeometric The binomial distribution is used to model the number of successes E Xn = µ
in a sample of size n with replacement. If you sample without replacement, you get the variance/efficiency.
hypergeometric distribution.
Variance of the sample mean:    2     h i 2
Negative Binomial Consider a sequence of independent Bernoulli trials, and let X MSE θ̂ = E θ̂ − θ = Var θ̂ + E θ̂ − θ
be the number of trials necessary to achieve r successes. X has a negative binomial   σ2
distribution. Var X n =
n
Geometric A negative binomial distribution with r = 1 is a geometric distribution. θ̂ is a consistent estimator for θ if
It is the number of failures before the first success. The Central Limit Theorem  
lim P θ − θ̂n < δ = 1.
• The sum of r independent Geometric (p) random variables is a negative n→∞
Let X1 , . . . , Xn form a random sample of size n from a distribution with finite
binomial (r, p) random variable.
mean and variance. Then for any fixed number x, Roughly, an estimator is consistent if its distribution collapses to a single point at the
• If Xi are i.i.d. and negative binomial (ri , p), then Xi is distributed as a
P
negative binomial ( ri , p). true parameter as n → ∞.
P  
 
• Memorylessness √ X−µ
lim P  n ≤ x = Φ(x)
Poisson The Poisson distribution expresses the probability of a given number of n→∞ σ Method of Moments
events occurring in a fixed interval of time if:
1. the events can be counted in whole numbers where Φ(x) is the CDF of a standard normal random variable. Population Moments (about the origin):
2. the occurrences are independent and h i h i
3. the average frequency of occurrence for a time period is known. Statistics 2 3
E [X] , E X , E X , . . .
Relationship between Poisson and Binomial For small values of p, the Poisson An estimator is a function of the random variables in a random sample.
distribution can simulate the Binomial distribution. Sample Moments:
A parameter is a constant indexing a family of distributions.
Exponential
• waiting time between two events in a Poisson process The function of the random sample is the estimator. The number, or realization of 1 X
n n
1 X 2 1 X 3
n

• memoryless the function of the random sample, is the estimate. Xi , Xi , X ,...


n i=1 n i=1 n i=1 i
• It is a special case of a gamma distribution, the “waiting time” before a Example: Suppose X ∼ U [0, θ]
number of occurrences.
Uniform 1
 If you have k parameters to estimate, you will have k moment conditions. In other
0<x<θ words, you will have k equations in k unknowns to solve.
• quasi-random number generators fX (x) = θ
• from a uniform distribution, you can use the inverse CDF method to get a
0 otherwise
sample for many (not all) distributions you are interested in Maximum Likelihood Estimation
We want to estimate θ.
Normal Distribution
Properties: θ̂1 = max{X1 , . . . , Xn } The maximum likelihood estimator of a parameter θ is the value θ̂ which most
likely would have generated the observed sample.
• If X1 is normal, then X2 = a + bX1 is also normal, with mean a + b E[X1 ] 2
n
X
and variance b2 Var(X1 ). θ̂2 = Xi Likelihood Function
n i=1 n
• Normal distributions are symmetric, unimodel, “bell-shaped,” have thin tails,
Y
L (θ|x) = f (xi |θ)
and the support is R. i=1
Useful R Commands:
Module 6 We just maximize L over θ in Θ.

Purpose Syntax Assessing and Deriving Estimators Properties:


• If there is an efficient estimator in a class of consistent estimators, MLE will
rnorm generates random numbers rnorm(n, mean, sd)
h i
An estimator is unbiased for θ if E θ̂ = θ for all θ in Θ. produce it.
from normal distribution
• Under certain regularity conditions, MLEs will have asymptotically normal
dnorm probability density dnorm(x, mean, sd) Example Xi i.i.d. U [0, θ] distributions.
function (PDF)
pnorm cumulative distribution pnorm(q, mean, sd) 1 X
n
Disadvantages:
function (CDF) θ̂ = 2 Xi
qnorm quantile function – qnorm(p, mean, sd) n i=1 • They can be biased.
inverse of pnorm n n
• They might be difficult to compute.
h i 1 X 1 Xθ • They can be sensitive to incorrect assumptions about the underlying
E θ̂ = 2 E [Xi ] = 2 =θ unbiased
n i=1 n i=1 2 distribution, more so than other estimators.
Confidence Intervals and Hypothesis Testing A simple hypothesis is one characterized by a single point, i.e., ΘO = θO . Power Calculations
The standard error of an estimate is the standard deviation (or estimated standard A composite hypothesis is one characterized by multiple points, i.e., ΘO has We tend to pick α low because society does not want to conclude that some
deviation) of the estimator. multiple values or a range of values. treatment work when in fact it really does not.
2
χ Distribution Example We want to pick N = Nc + Nt such that, if the average treatment effect is in fact
The sample variance Xi i.i.d. N (µ, σ 2 ), where σ 2 known maintained hypothesis some value τ , the power of the test will be at least 1 − β for some β, given that a
2 1 X 2
Interested in testing whether µ = 0 testable hypothesis fraction γ of the units are assigned to the treatment group.
S = Xi − X n
n−1 In addition, we must assume (know) something about the variance of the outcome
HO : µ = 0 null hypothesis, simple
is an unbiased estimator for the variance of a distribution. in each treatment arm: for simplicity, we often assume it is the same, and some
HA : µ = 1 alternative hypothesis, simple parameter σ 2 .
(n − 1)S 2 2
∼ χn−1 In summary, we know, impose, or assume α, β, τ, σ, and γ, and we are looking for
σ2
HO True HO False N.
t Distribution
Accept HO No error Type II error Alternatively, we could be interested in the power for a given sample size: we know
If X ∼ N (0, 1) and Z ∼ χ2n and they’re independent, then α, β, τ, σ, and N and look for β.
Reject HO Type I error No error
X
p ∼ tn .
Z/n
obs obs obs obs
The significance level of the test, α, is the probability of type I error. Yt −Yc Y −Yc
Suppose we are sampling from a N µ, σ 2 distribution. Then, T = q ≈ qt

The operating characteristic of the test, β, is the probability of type II error. V̂Neyman σ2
+ σ2
Nt Nc
√ X−µ 1 − α is the confidence level.
n ∼ tn−1 We reject this hypothesis if |T | > Φ 1 − α

, e.g. if α = 0.05, if |T | > 1.96.
S 1 − β is the power. 2

F Distribution We define the critical region of the test, C or CX , as the region of the support of the obs obs
Yt −Yc −τ
If X ∼ χ2n and Z ∼ χ2m and they’re independent, then random sample for which we reject the null. The critical region will take the form q ≈ N (0, 1)
σ2 σ2
X > k for some k yet to be determined. N + Nc t
X/n
∼ Fn,m . Statistical Power
Z/m
HO HA
Confidence Intervals fX under HO fX under HA
“Power”
Case 1 We are sampling from a normal distribution with a known variance and we 1−β
want a confidence interval for the mean.
"  #
√ X−µ
 
−1 α −1 α
P Φ < n < −Φ =1−α
2 σ 2
     
−1 α σ −1 α σ β α
CI1−α = X + Φ √ ,X − Φ √ β α
2 n 2 n 2

Case 2 We are sampling from a normal distribution with an unknown variance and 0k 1
we want a confidence interval for the mean.
Choice of any one of α, β, or k determines the other two. This involves an explicit
"  #
√ X−µ
 
−1 α −1 α   
P tn−1 < n < −tn−1 =1−α trade-off between the probability of type I and type II errors.  
α
 
 
α

τ
2 σ 2 |T | > Φ 1− ≈ Φ −Φ
−1
1− + q +
 
P
      • increasing k means α ↓ and β ↑ 2 
 2 σ2
+ σ2
−1 α σ −1 α σ Nt Nc
CI1−α = X + tn−1 √ , X − tn−1 √ • decreasing k means α ↑ and β ↓  
2 n 2 n   
What happens as n increases or decreases? −1 α τ 
Φ −Φ 1− − q
 
Hypothesis Testing 2 σ2
+ σ2



Nt Nc
An hypothesis is an assumption about the distribution of a random variable in a
population. β The second term is very small so we ignore it. We want the first term to be equal to
A maintained hypothesis is one that cannot or will not be tested. 1 − β: √ p
τ N γ(1 − γ)
 
−1 −1 α
A testable hypothesis is one that can be tested using evidence from a random Φ (1 − β) = −Φ 1− +
sample. 2 σ
Nt
The null hypothesis, HO , is the one that will be tested. where γ = N .

The alternative hypothesis, HA , is a possibility (or series of possibilities) other than The required sample size is
the null.
smaller n
2
Φ−1 (1 − β) + Φ−1 1 − α
2
We might want to perform a test concerning unknown parameter θ where larger n
N =
τ2
Xi ∼ f (x|θ). σ2
γ (1 − γ)

HO : θ in ΘO α • With stratified design, the variance of the estimated treatment effect is lower.
HA : θ in ΘA , where ΘO and ΘA disjoint. • With clustered design, the variance of the estimated treatment effect is larger.
Stratified Design Stable Unit Treatment Value Assumption (SUTVA) The potential outcomes for any Analyzing Randomized Experiments
unit do not vary with the treatments assigned to other units, and, for each unit, there
• Take the difference in means within each strata. are no different forms or versions of each treatment level, which lead to different The Average Treatment Effect
• Take a weighted average of the treatment effect with weight the size of the potential outcomes.
strata:
The Assigment Mechanism Let’s assume we have a population of size N , indexed obs obs
h i h i
X  Ng  ATE = E Yi |Wi = 1 − E Yi |Wi = 0
τ̂g by i. Let the treatment indicator Wi take on the values 0 (the control treatment) and 1
g
N (the active treatment). We have one realized (and possibly observed) potential
This will be an unbiased estimate of the average treatment effect. outcome for each unit, denoted by Yiobs :
Suppose we have a completely randomized experiment with Nt treatment units
• The variance will be calculated as: ( and Nc control units. The difference in sample average
obs Yi (0) if Wi = 0,
X  N g 2 Yi = Yi (Wi ) =
V̂g Yi (1) if Wi = 1. 1 1
obs obs obs obs
X X
g
N τ̂ = Yi − Yi = Yt −Yc .
Nt Nc
For each unit we also have one missing potential outcome, Yimis , i:Wi =1 i:Wi =0

• Special case: probability of assignment to control group stays the same in each (
strata. Then this coefficient is equal to the simple difference between treatment mis Yi (1) if Wi = 0,
Yi = Yi (1 − Wi ) = The variance of a difference of two statistically independent variables is the sum of
and control, but the variance is always weakly lower. Yi (0) if Wi = 1.
their variances. Thus,
• Stratification will lower the required sample size for a given power.
Comparisons of Yi (1) and Yi (0) are unit-level causal effects: Sc2 S2
V (τ̂ ) = + t .
Clustered Design Nc Nt
• We need to take into account the fact that the potential outcomes for units Yi (1) − Yi (0)
To estimate the variance, V̂ (τ̂ ), replace Sc2 and St2 by their sample counterpart:
within randomization clusters are not independent.
• Conservative way to do this: just average the outcome by unit and treat each Missing data problem Given any treatment assigned to an individual unit, the
one as an observation. potential outcome associated with any alternate treatment is missing. A key role is 2 1 X 
obs
2
therefore played by the missing data mechanism, or the assignment mechanism. How is sc = Yi (0) − Y c
• The number of observations is the number of clusters and you can analyze this Nc − 1
data exactly as a completely randomized experiment but with clusters as the it determined which units get which treatments or, equivalently, which potential i:Wi =0

unit of analysis. outcomes are realized and which are not? 2 1 X 


obs
2
st = Yi (0) − Y t
• A randomization with two clusters is unlikely to go very far! The Selection Problem Imagine we have a larger group of people who took aspirin Nt − 1 i:Wi =0
and a group who did not, and we decide to take the sample mean of headache for
Module 7 people who got or did not get the pill. We know that this is a good estimator for
Confidence Intervals We want to find a function of the random samples A and B
E [Yi |Wi = 1] − E [Yi |Wi = 0] . such that
Causality P (A(X1 , . . . , XN ) < θ < B(X1 , . . . , XN )) > 1 − α
Definition of Causal Effects For any unit, the causal effect of a treatment is the obs obs
h i h i
E Yi |Wi = 1 − E Yi |Wi = 0
difference between the potential outcome with and without the treatment. The ratio of the difference and the estimated standard error will follow a
Example Consider a single unit contemplating whether or not to take an aspirin for
= E [Yi (1)|Wi = 1] − E [Yi (0)|Wi = 0] t-distribution, so
headache. The unit-level causal effect involves one of four possibilities:
p p 
= E [Yi (1)|Wi = 1] − E [Yi (0)|Wi = 1]

τ
CI1−α = τ̂ − tcrit V̂ , τ̂ + tcrit V̂ .
+ E [Yi (0)|Wi = 1] − E [Yi (0)|Wi = 0]
1. Headache gone only with aspirin:
Y (Aspirin) = No Headache, Y (No Aspirin) = Headache With small samples, take tcrit from a table of t-distribution for the relevant α with
Treatment effect on the treated E [Yi (1)|Wi = 1] − E [Yi (0)|Wi = 1] Nt + Nc − 1 degrees of freedom.
2. No effect of aspirin, with a headache in both cases:
Y (Aspirin) = Headache, Y (No Aspirin) = Headache Selection bias E [Yi (0)|Wi = 1] − E [Yi (0)|Wi = 0]
Hypothesis Testing
3. No effect of aspirin, with the headache gone ib both cases: Randomization solves the selection problem In a completely randomized
Y (Aspirin) = No Headache, Y (No Aspirin) = No Headache experiment, Nt units are randomly drawn to be in the treatment group, and Nc units
N
are drawn to be in the control group. Then, the probability of assignment does not 1 X
4. Headache gone only without aspirin: depend on potential outcomes: H0 : (Yi (1) − Yi (0)) = 0
N i=1
Y (Aspirin) = Headache, Y (No Aspirin) = No Headache
E [Yi (0)|Wi = 1] − E [Yi (0)|Wi = 0] = 0 N
1 X
H1 : (Yi (1) − Yi (0)) 6= 0
Unit Potential Outcomes Causal Effect and N i=1
Y (Aspirin) Y (No Aspirin) obs obs
h i h i
E Yi |Wi = 1 − E Yi |Wi = 0
You No Headache Headache Improvement due to Aspirin Natural Test Statistic
= E [Yi (1)|Wi = 1] − E [Yi (0)|Wi = 1] obs obs
Yt −Yc
The Problem of Causal Inference = E [Yi (1) − Yi (0)|Wi = 1] t= p
= E [Yi (1) − Yi (0)] V̂
• The definition of the causal effect depends on the potential outcomes, but it follows a t-distribution with N − 1 degrees of freedom or, with N large enough, a
does not depend on which outcome is actually observed. Types of RCT normal distribution.
• The causal effect is the comparison of potential outcomes, for the same unit, at
the same moment in time post-treatment. The fundamental problem of causal • Complete randomization
inference is therefore the problem that at most one of the potential outcomes • Stratified randomization Fisher Exact Test
can be realized and thus observed. • Pairwise randomization
• We must rely on multiple units to make causal inferences. • Clustered randomization Another view of uncertainty
(More) Exploratory Data Analysis: Non-Parametric Kernel Estimator Replace f (x, y) and f (x) by their empirical estimates: • reverse least squares
Comparisons and Regressions R
y fˆ(x, y)dy X β 0 + Yi
2
ĝ(x) = min Xi −
Kolmogorov-Smirnov Test fˆ(x) β
i
β1

Let X1 , . . . , Xn be a random sample with CDF F and let Y1 , . . . , Yn be a random Denominator: Under the assumptions of the Classical Linear Regression Model, OLS provides the
sample with CDF G. 1 X
n
x − xi minimum variance (most efficient) unbiased estimator of β0 and β1 . It is the MLE
 
fˆ(x) = K under normality of errors, and the estimates are consistent and asymptotically
We are interested in testing the hypotheses nh i=1 h
normal.
Ho : F = G Numerator:
n Closed-form solutions:
1 X x − xi
 
Ha : F 6= G yi K 1
P  
nh i=1 h n Xi − X Yi − Y
The Statistic β̂1 = 2 , β̂0 = Y − β̂1 X
where h (the bandwidth) is the kernel estimate of the density of x. K() is a density. 1
P
Xi − X
Dnm = max |Fn (x) − Gm (x)| n
x
Large sample properties:
where Fn and Gm are the empirical CDF of the first and second sample. The Fitted Value Ŷi = β̂0 + β̂1 Xi
empirical CDF just counts the number of sample points below level x: • As h goes to zero, the bias goes to zero.
• As nh goes to infinity, variance goes to zero. Residual ˆi = Yi − Ŷi
n
1 X • As you increase the number of observation, you promise to decrease the Regression Line or Fitted Line Y = β̂0 + β̂1 X
Fn (x) = Pn (X < x) = I(X < x) bandwidth.
n i=1 P 2
Let X = n1
Xi and σ̂X2 1
Xi − X .
P
Choices to make: = n
First Order Stochastic Dominance: One-sided Kolmogorov-Smirnov Test
• Choice of kernel
• We are interested in testing the hypothesis Mean Variance Covariance
1. Histogram: K(u) = 1
2 if |u| ≤ 1, K(u) = 0 otherwise.
2 2 2
Ho : F = G 2. Epanechnikov: K(u) = 3
− u2 ) if |u| ≤ 1, K(u) = 0 otherwise. σ X σ
4 (1 β̂0 β0 + σ2 X
2
nσ̂X n
against 3. Quartic: K(u) = ( 34 (1 − u2 ))2 if u ≤ 1, K(u) = 0 otherwise. − 2
σ2 nσ̂X
Ha : F > G
• Choice of bandwidth: trade off bias and variance β̂1 β1 2
(which would mean that G FSD F ). nσ̂X
• The one-sided KS statistics is: – A large bandwidth will lead to more bias.
+ – A small bandwidth will lead to more variance.
Dnm = max [Fn (x) − Gm (x)]
x Some comparative statistics:

Asymptotic Distribution of the KS Statistic


Module 8  
• A larger σ 2 means larger Var β̂
• Under Ho , the limit of KS as n and m go to infinity is 0, so we want to The Linear Model • A larger σ̂X
2
 
means smaller Var β̂
compare the KS statistics to 0. We will reject the hypothesis if the statistic is Linear Model
“large” enough.
 
Yi = β0 + β1 Xi + i , i = 1, . . . , n • A larger n means smaller Var β̂
• Under Ho , the distribution of
 1 Basic assumptions: • If X > 0, Cov (β0 , β1 ) < 0.
nm 2 1. Xi , i uncorrelated
Dnm If we use the stronger assumption that the errors are i.i.d. N 0, σ 2 , β̂0 and β̂1

n+m 2. identification
1 X 2 will also have normal distributions.
has a known distribution (KS) with associated critical values. Xi − X >0
• Therefore, we reject the null of equality if n i Analysis of Variance
Sample variance is positive.
We want some way to indicate how much of Y ’s variation is explained by X’s
 
nm
Dnm > c(α) 3. zero mean: E [i ] = 0
n+m variation. We perform an analysis of variance and that leads us to a measure of
4. homoskedasticity: E 2i = σ 2 for all i goodness-of-fit.
 
where c(α) are critical values which we find in tables.
5. no serial correlation: E [i j ] = 0 if i 6= j Sum of Squared Residuals (SSR)
Non-Parametric (Bivariate) Regression Assumptions 3-5 could be subsumed under a stronger assumption: i i.i.d. N (0, σ 2 ). X 2 X 2
SSR = Yi − β̂0 − β̂i Xi = (ˆ
i )
You have two random variables, X and Y , and express the conditional expectation Properties: i i
of Y given X as E[Y |X] = g(X). Therefore, for any x and y,
• E [Yi ] = β0 + β1 Xi Total Sum of Squares (SST)
y = g(x) +  • Var(Yi ) = E 2i = σ 2
 
X 2
SST = Yi − Y
where  is the prediction error. The problem is to estimate g(x) without imposing a • Cov(Yi , Yj ) = 0, i 6= j i
functional form.
Estimates for β0 and β1 Model Sum of Squares (SSM)
The Kernel Regression
Z • least squares (OLS) X 2
E [Y |X = x] = yf (y|x)dy
X 2 SSM = Ŷi − Y
min (Yi − β0 − β1 Xi )
β i
i
By Bayes’ rule: • least absolute deviations The fact that the regression line is the least squares line ensures that SSR ≤ SST.
R
yf (x, y) yf (x, y)dy SSR
Z Z X
yf (y|x)dy = dy = min |Yi − β0 − β1 Xi |
β 0≤ ≤1
f (x) f (x) i SST
We want a measure of fit that had larger values when the fit was better so we define 2. We impose the restrictions of the null and estimate that model. Transformations of the Dependent Variable
3. We compare the goodness-of-fit of the models.
2 SSR Suppose
R =1− . What if the restriction is that some β = c? This is an F -test. β β 
SST Yi = AX1i1 X2i2 e i .
1
r (SSRR − SSRU ) Then run the linear regression
In addition to using R2 as a basic measure of goodness-of-fit, we can also use it as T =
SSRU
the basis of a test of the hypothesis that β1 = 0. We reject the hypothesis when log (Yi ) = β0 + β1 log X1i + β2 log X2i + i
n − (k + 1)
(n − 1)R 2 to estimate β1 and β2 . Note that β1 and β2 are elasticities: when X1i changes by
, T ∼ Fr,n−(k+1) under the null and we reject the null for large values of the test 1%, Yi changes by β1 %.
1 − R2 statistic.
Returns to education formulation
which has an F distribution under the null, is large. HO : βi = c
β̂i − c    1/2 log Yi = β0 + β1 Si + i
Interpretation β̂1 is the estimated effect on Y of a one-unit increase in X. T =   where SE β̂i = σ 2 (X T X)−1
SE β̂i ii When education increases by 1 year, wages increase by β1 × 100%.
The Multivariate Linear Model Box Cox Transformation Suppose
HO : Rβ = c
General Linear Model: 1
Rβ̂i − c   
2 T −1 T 1/2
 Yi = .
Yi = β0 + β1 X1i + · · · + βk Xki + i , i = 1, . . . , n T =   where SE Rβ̂i = σ R(X X) R β0 + β1 X1i + β2 X2i + i
SE Rβ̂i
Then run the regression
Matrix notation:
Y = Xβ +  Module 9 1
= β0 + β1 X1i + β2 X2i + i .
Yi
Assumptions: Practical Issues in Running Regressions Discrete Choice Model Suppose
1. identification: n > k + 1, X has full column rank k + 1 (i.e., regressors are Dummy Variables
linearly independent; X T X is invertible) eβ0 +β1 X1i +β2 X2i +i
Pi = ,
Yi = α + βDi + i 1 + eβ0 +β1 X1i +β2 X2i +i
2. error behavior: E() = 0, E(T ) = Cov() = σ 2 In Di is a dummy variable, or an indicator variable, if it takes the value 1 if the
stronger version  ∼ N 0, σ 2 In
 Pi is the percentage of individuals choosing a particular option (e.g., buying a
observation is in group A and 0 if in group B. particular car), then run the regression
Interpretation Without any control variables, then
β̂ is the vector that minimizes the sum of squared errors, i.e., 
Pi

β̂ = Y A − Y B . Yi = log = β0 + β1 X1i + β2 X2i + i
T
 T   1 − Pi
ˆ ˆ = Y − X β̂ Y − X β̂ You can always estimate the difference between the treatment and control groups for
an RCT using an OLS regression framework. Polynomial Models
Solution:  −1   Categorical Variables If there are more than two groups, you can transform them 2
Yi = β0 + β1 X1i + β2 X1i + · · · + βk X1i i
k
β̂ =
T
X X
T
X Y if
T
X X is invertible into dummy variables, one for each group. Warning: Omit one category to avoid
multi-collinearity. • You can choose straight polynomial, series expansion, orthogonal
Properties: Interpretation Each coefficient is the difference between the value of that group and polynomials, etc.
h i
the value for the omitted (reference) group. • If you assume that the model is known, this is just standard OLS.
• E β̂ = β • If you assume that the model is not known, this is a non-parametric method –
   −1 Other Variables in the Regression there is bias (because the shape is never quite perfect) and variance (as you
• Cov β̂ = σ 2 X T X Yi = α + βDi + γXi + i add more Xs) so you add more terms as the number of observations increases
(series regression).
ˆT ˆ β̂ is the difference in intercept between group A and group B. Xi s are “control”
• σ̂ 2 = variables – things that did not affect the assignment but may have been different at
n−k Regression Discontinuity Design
baseline.
Inference in the Linear Model • add polynomials
Dummy Variables and Interactions Imagine you have two sets of dummy
variables, say, Treatment and Control Di , Male and Female Mi : 2
Yi = β0 + β1 Dai + β2 ai + β3 ai + β4 ai + i
3
Consider the hypotheses
Yi = α + βDi + γMi + δMi ∗ Di + i
HO : Rβ = c • fit a polynomial on each side of the discontinuity:
• α̂: an estimate of the mean for women in the control group
HA : Rβ 6= c. • β̂: an estimate of the difference between treatment and control group means Yi = β0 + β1 Dai + β2 (ai − a0 ) + β3 (ai − a0 ) ∗ Dai + · · · + i
R is a r × (k + 1) matrix of restrictions. If, for instance, R = [0 1 0 . . . 0] and for women (treatment main effect) Centering the variables ensure that β1 is still the jump at the discontinuity.
c = [0], that corresponds to HO : β1 = 0. If • γ̂: an estimate of the difference between males and females (gender main
effect)
0 1 0 ... 0 0
   
• δ̂: an estimate of the difference between the treatment effect for males and for
0 0 1 ... 0 0 females (interaction effect)
R = . . . and c = .
   
. . . . This is the basic difference-in-differences model which is used by empirical
. . . .
0 0 0 ... 1 0 researchers in a situation where there was a change in the law (or an event) affecting
one group but not the other, and you are willing to assume that in the absence of the
that corresponds to HO : β1 = β2 = · · · = βk = 0. law, the difference between the two groups would have remained stable over time.
One thing you cannot do in this framework is test one-sided hypotheses. More Generally: Interactions More generally, the coefficient on the interaction
between a dummy variable and some variable X tells us the extent to which the
Steps: dummy variable changes the regression function for that regressor.
∗ ∗
1. We estimate the unrestricted model. Yi = β0 + β0 Di + β1 X1i + β Di X1i + · · · + i
Omitted Variable Bias Visualizing Data • β̂1 is the Wald estimate.
Suppose that the regression model excludes a key variable (e.g., the data is Two different goals of data visualization
unavailable).
• For yourself: getting a sense of what is in the data – to guide future analysis The interpretation of IV when the treatment effect is not constant Under a fairly
Example: Consider the model: mild assumption, the Wald estimate still has a causal interpretation. It captures the
• For others: telling a story about the data and your results – to communicate effect of the treatment on those who are compelled by the instrument to get treated
log(Wi ) = β0 + β1 Ei + β2 Xi + β3 Ai + i your results (Local Average Treatment Effect or LATE).
where Wi is wage, Ei is education, Xi is job experience and Ai is ability. We are Scientific visualization
interested in measuring the effects of Ei and Xi on Wi with Ai constant. Suppose From the Wald estimate to Two-Stage Least Squares (2SLS)
What to achieve:
the Ai is unavailable so we run the regression without it. Next, we, instead, use IQ as
proxy to the omitted variable and run the regression. The results are shown below. • Show the data.
• We could couch this in a regression framework.
• Not lie about it.
log(wage) Coeff. SE Coeff. SE • Illustrate a story. • First stage: π̂1 in
Education 0.078 0.007 0.057 0.007 • Reduce clutter. Xi = π0 + π1 Zi + δi
Experience 0.020 0.003 0.020 0.003
IQ — — 0.006 0.001 • Visualization must complement the text and have enough information to stand
Constant 5.503 0.112 5.198 0.122 alone. • Reduced form: γ̂1 in
Tufte’s Principles Yi = γ0 + γ1 Zi + ωi
The estimated return to education changes from 7.8% to 5.7%.
• Show the data.
More Advanced Techniques • Maximize data-ink ratio. • Two-Stage Least Squares: Run the first stage and take the fitted values X̂i . In
• Erase non-data ink (as much as possible). the second stage, run Yi = β0 + β1 X̂i + i .
Machine Learning: Double Post LASSO • Erase redundant data ink.
• Suppose that we have lots of variables and we are not sure which ones to • Avoid chart junk (moiré, ducks).
include. • Try to increase the density of data ink. The 2SLS and the Wald estimates are identical
• There are machine learning techniques to learn which variables are predictive. • Graphics should tend to be horizontal.
• Three steps:
Cov(Yi , X̂i ) Cov(Yi , π0 + π1 Zi )
1. Regress X1 on all the available variables and see what LASSO picks. Module 11 β̂1 = =
Call this X2 . Var(X̂i ) Var(π0 + π1 Zi )
2. Regress Y on all the available variables and see what LASSO picks. Call Endogeneity and Instrumental Variables π1 Cov(Yi , Zi )
=
this X3 . π12 Var(Zi )
Consider a more general model:
3. Run γ1
Yi = β0 + β1 X1i + β2 X2i + β3 X3i + i .
( =
Yi = β0 + β1 Xi + β2 Ti + i π1
Xi = α0 + α1 Yi + α2 Zi + δi
Module 10
Endogenous variables (Xi and Yi ) are determined within the system. Experimental Design
Machine Learning
We talk about endogeneity when there is mutual relationship, i.e., when reasonable
Estimation case can be made either way.
What is experimental design?
• Strict assumptions about data generating process
Instrumental Variables • What is being randomized?
• Back out coefficients – the intervention
An instrument for the model
• Low-dimensional • Who is being randomized?
Yi = β0 + β1 Xi + i – the level of randomization (schools, individuals, villages, cells)
Prediction – the sample over which you randomize
is a variable Zi such that • How is randomization introduced?
• Allow for flexible functional forms
– method of randomization
• Get high quality predictions Cov(Z, X) 6= 0 and Cov(Z, ) = 0. – stratification
• How many units are being randomized?
• Give up on adjudicating between observably similar functions (variables) Three conditions:
– power
Undestanding OLS In-sample fit vs. out-of-sample fit 1. It affects X: Cov(Z, X) 6= 0.
2. It is randomly assigned. Randomization
OLS 2
β̂ = arg min ESn (β; x − y) 3. It has no direct effect on Y : Cov(Z, ) = 0 (exclusion restriction)
β • Simple randomization: define your sample frame and your unit of

βprediction = arg min E(y,x) β x − y
0 2 The IV estimation can be seen as a two-step estimator within a simultaneous randomization, use software to randomly assign one group to treatment, one
β equations model. to control
RCT as IV Let Zi be a dummy variable equal to 1 if assigned to the treatment group • Stratification: create groups that are similar ex-ante
OLS looks good with the sample you have. We overfit in estimation. • Clustering: randomize at the group level
and 0 otherwise. Then,
Processing of data requires machine learning.
E[Yi |Zi = 1] − E[Yi |Zi = 0] Introducing Randomization
Two kinds of processing: β̂1 =
E[Xi |Zi = 1] − E[Xi |Zi = 0]
• Phase-in design
• Pre-processing
• The denominator is the first stage relationship. • Randomization “in the bubble”
• Processing • The numerator is the reduced form relationship. • Encouragement design
Some R commands

Command Library What it does

Create a matrix of
choose(n,m) rows and n
chooseMatrix(n,m) perm columns. The matrix has
unique rows with m ones in
each row and the rest zeros.

Returns the number of rows


NROW(x), NCOL(x)
or columns in matrix x

Computes the variance of x,


var(x) which is a vector, matrix or
dataframe.

Computes the covariance of x


and y, where both arguments
covar(x,y) are vectors, matrices or
dataframes with comparable
dimensions to each other.

Returns a vector or array or


list of values obtained by
apply()
applying a function to
margins of an array or matrix.

Fits a linear model to the


lm()
given data.

Computes confidence
confint() intervals for one or more
parameters in a fitted model.

Fit linear models with


felm() lfe multiple group fixed effects,
similar to lm.

Function to implement the


DCdensity() rdd
McCrary (2008) sorting test.

Function to calculate the


RDestimate() rdd Regression Discontinuity
estimate.

Fit IV regression by a
ivreg() AER two-stage least squares
method.

Recommended Resources

• Causal Inference for Statistics, Social, and Biomedical Sciences (Guido W.


Imbens and Donald B. Rubin)
• Mastering ’Metrics (Joshua D. Angrist and Jörn-Steffen Pischke)
• Data Analysis for Social Scientists [Lecture Slides] (http://www.edx.org)
• R Studio (https://www.rstudio.com)

Please share this cheatsheet with friends!


Summary of Special Distributions

Distribution PDF / PMF Expectation and Variance Graph

pX (x) = px (1 − p)1−x , E[X] = p


Bernoulli
x ∈ {0, 1} Var(X) = p(1 − p)

 
n x
pX (x) = p (1 − p)n−x , E[X] = np
Binomial x
Var(X) = np(1 − p)
x = 0, 1, . . . , n

   nA
A B E[X] =
Hypergeometric A+B
x n−x
pX (x) =  nAB A + B − n
Var(X) =

A+B
(A + B)2 A + B − 1
n

k(1 − p)
  E[X] =
Negative Binomial r+k−1 k p
pX (k) = p (1 − p)r
k r(1 − p)
Var(X) =
p2
Distribution PDF / PMF Expectation and Variance Graph

1
E[X] =
Geometric pX (k) = (1 − p) k−1
p p
1−p
Var(X) =
p2

λk e−λ E[X] = λ
Poisson pX (k) =
k! Var(X) = λ

1
E[X] =
Exponential fX (x) = λe −λx λ
1
Var(X) = 2
λ

a+b
E[X] =
Uniform 1 2
fX (x) =
b−a (b − a)2
Var(X) =
12

1 (x−µ)2 E[X] = µ
Normal fX (x) = √ e− 2σ2
σ 2π Var(X) = σ 2

You might also like