Data Analysis For Social Scientists Cheatsheet
Data Analysis For Social Scientists Cheatsheet
310x Data Analysis for Social A probability on a sample space S is a collection of numbers P(A) that satisfy Binomial Distribution X ∼ B(n, p)
axioms 1-3.
Scientists Counting fX (x) =
n
x
p (1 − p)
n−x
, x = 0, 1, . . . , n
x
This is a cheat sheet for data analysis based on the online course given by Prof. Esther 1. If an experiment has two parts, first one having m possibilities and, regardless The binomial distribution describes the number of ”successes in n trials where the
Duflo and Prof. Sara Ellison. Compiled by Janus B. Advincula. of the outcome in the first part, the second one having n possibilities, then the trials are independent and the probability of success in each is p.
experiment has m × n possible outcomes.
Last Updated December 3, 2019 The probability function (PF) of X, where X is a discrete random variable, is the
2. Any ordered arrangement of objects is called a permutation. The number of function fX such that for any real number x, fX (x) = P(X = x).
different permutations of N objects is N !. The number of different
Module 1 permutations of n objects taken from N objects is
N!
.
Properties:
(N − n)!
• 0 ≤ fX (xi ) ≤ 1
Introduction 3. Any unordered arrangement of objects is called a combination. The number of
•
P
N! fX (xi ) = 1
• Data is plentiful. different combinations of n objects taken from N objects is . We i
• Data is beautiful. (N − n)!n!
• P(A) = P(X ⊂ A) =
P
N fX (xi )
• Data is insightful. typically denote this . A
• Data is powerful. n
• Data can be deceitful. • P(X = x) = 0 for any x if X is continuous.
Properties:
Causation vs. Correlation The density or probability density function (PDF) is the continuous analog to the
• P (Ac ) = 1 − P (A) discrete PF in many ways.
• Correlation is not causality. • P (∅) = 0
• A causal story is not causality either. A random variable X is continuous if there exists a non-negative function fX such
• If A ⊂ B then P (A) ≤ P (B) that for any interval A ⊂ R,
• Even more sophisticated data use may still not be causality.
• For all A, 0 ≤ P (A) ≤ 1 Z
What We Need to Learn • P (A ∪ B) = P (A) + P (B) − P (AB) P(X ⊂ A) = fX (x)dx.
A
• How do we model the processes that might have generated our data? • P (AB c ) = P (A) − P (AB)
- Probability Properties:
• How do we summarize and describe data, and try to uncover what process Independence Events A and B are independent if P (AB) = P (A) P (B).
may have generated it? Theorem If A and B are independent, A and B are also independent.
c
• 0 ≤ fX (x)
- Statistics
Conditional Probability The probability of A conditional on B is •
R
• How do we uncover pattern between variables? fX (x)dx = 1
- Exploratory data analysis • P(A) = P(a ≤ X ≤ b) =
R
fX (x)dx
P (AB)
- Econometrics P (A|B) = , P (B) > 0.
A
P (B)
The cumulative distribution function (CDF) FX of a random variable X is defined
Module 2 If A and B are independent and P (B) > 0, then
for each x as
FX (x) = P(X ≤ x).
Fundamentals of Probability P (AB) P (A) P (B) Properties:
P (A|B) = = = P (A) .
P (B) P (B)
A sample space S is a collection of all possible outcomes of an experiment.
• 0 ≤ FX (x) ≤ 1
An event A is any collection of outcomes (including individual outcomes, the entire Bayes’ Theorem
sample space, the null set). • FX (x) is non-decreasing in x
P (B|A) P (A)
Useful results: P (A|B) = • lim FX (x) = 0
P (B|A) P (A) + P (B|Ac ) P (Ac ) x→−∞
• If A ⊂ B, then A ∪ B = B.
• lim FX (x) = 1
• If A ⊂ B and B ⊂ A, then A = B. (A and Ac form a partition of S.) x→∞
• If A ⊂ B, then A ∩ B = AB = A. • FX (x) is right continuous.
• A ∪ Ac = S Random Variables, Distributions, and Joint
Distributions A PF/PDF and a CDF for a particular random variable contain exactly the same
A and B are mutually exclusive (disjoint) if they have no outcomes in common.
information about its distribution, just in a different form.
A and B are exhaustive (complementary) if their union is S. A random variable is a real-valued function whose domain is the sample space.
Z x
A discrete random variable can take on only a finite or countable infinite number of FX (x) = P(X ≤ x) = fX (x)dx
Probability values. −∞
We will assign to every event A a number P (A), which is the probability the event A random variable that can take on any value in some interval, bounded or dF (x)
will occur (P : S → R). unbounded, of the real line is called a continuous random variable.
0
FX (x) = = fX (x)
dx
We require that: Hypergeometric Distribution X ∼ H (N, K, n)
1. P(A) ≥ 0 for all A ⊂ S
Joint Distributions
K N −K
x n−x
2. P(S) = 1 fX (x) = N
, x = max (0, n + K − N ) , . . . , min (n, K) If X and Y are continuous random variables defined on the same sample space S,
n then the joint probability density function of X & Y , fXY (x, y), is the surface such
3. For any sequence of disjoint sets A1 , A2 , . . . ,
that for any region A of the xy-plane,
! The hypergeometric distribution describes the number of ”successes” in n trials
\ X where you’re sampling without replacement from a sample of size N whose initial Z Z
P Ai = P (Ai ) P ((X, Y ) ⊂ A) = fXY (x, y)dxdy.
probability of success was K/N .
i i A
Gathering and Collecting Data The Kernel Density Estimation y
y = x2
1. data through intervention or interaction with the individual, or
x
2. identifiable private information. 0.04
density
Key Principles of the Belmont Report
Z 1 Z x 21 2 3
0.02
1. Respect for Persons x ydydx =
• Respect individual autonomy 0 x2 4 20
• Protect individuals with reduced autonomy 0.00
For continuous:
You can get a smoothed version of a CDF (using ecdf in R)
Z
Summarizing and Describing Data fX (x) = fXY (x, y)dy
y
Histogram 1.00
1500
Conditional Distribution
0.00
1000
count
fXY (x, y)
Joint, Marginal and Conditional Distributions fY |X (y|x) =
fX (x)
Joint Distribution
500 (= P (Y = y|X = x) for X, Y discrete)
Example:
( Conditional distributions and independence
cx2 y for x2 ≤ y ≤ 1
fXY (x, y) =
0 0 otherwise fY |X (y|x) = fY (y) iff fXY (x, y) = fX (x)fY (y)
120 140 160 180
Height in centimeters, Bihar Females
Support: iff X & Y independent
Module 4 Moments of a Distribution Covariance and Correlation
The mode is the point where the PDF reaches its highest value. Covariance
Functions of Random Variables The median is the point above and below which the integral of the PDF is equal to Cov(X, Y ) = E [(X − µX ) (Y − µY )]
X is a random variable with fX (x) known. We want the distribution of 1/2.
Correlation
Y = h(X). Then, E [(X − µX ) (Y − µY )]
The mean, or expectation, or expected value, is defined as
ρ(X, Y ) =
Z
Var(X) Var(Y )
p p
FY (y) = fX (x)dx Z
{x:h(x)≤y}
E [X] = xfX (x)dx.
If Y is also continuous, then • X & Y are ”positively correlated” if ρ > 0.
• X & Y are ”negatively correlated” if ρ < 0.
dFY (y) Y = g(X) • X & Y are ”uncorrelated” if ρ = 0.
fY (y) = . Z
dy E [Y ] = E [g(X)] = g(x)fX (x)dx
Properties of Covariance:
Example:
( Expectation, Variance and an Introduction to 1. Cov(X, X) = Var(X)
1/2 for − 1 ≤ x ≤ 1
fX (x) = Regression
0 otherwise 2. Cov(X, Y ) = Cov(Y, X)
Y = X . What is fY (y)? Note that the support of X is [−1, 1] which implies that
2 Properties of Expectation: 3. Cov(X, Y ) = E[XY ] − E[X]E[Y ]
the induced support of Y is [0, 1].
1. E [a] = a, a constant 4. X, Y independent ⇒ Cov(X, Y ) = 0
2
FY (y) = P (Y ≤ y) = P X ≤ y 2. E [Y ] = aE [X] + b, Y = aX + b 5. Cov(aX + b, cY + d) = ac Cov(X, Y )
Z √y
√ √ 1 3. E [Y ] = E[X1 ] + · · · + E[Xn ], Y = X1 + · · · + Xn 6. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
= P (− y ≤ X ≤ y) = √
dx
− y 2
√ 4. E[Y ] = a1 E[X1 ] + · · · + an E[Xn ] + b, Y = a1 X1 + · · · + an Xn + b 7. |ρ(X, Y )| ≤ 1
= y for 0 ≤ y ≤ 1
5. E[XY ] = E[X]E[Y ] if X, Y independent 8. |ρ(X, Y )| = 1 iff Y = aX + b, a 6= 0
1
√ for 0 ≤ y ≤ 1
fY (y) = 2 y Variance h
2
i A Preview of Regression
0 otherwise Var(X) = E (X − µ)
Let α = µY − βµX .
Convolution 6. Var(X) = E[X 2 ] − (E[X])2
Then, U = Y − α − βX has the following properties:
A convolution refers to the sum of independent random variables. Standard Deviation q • E[U ] = 0
Let X be continuous with PDF fX , Y continuous with PDF fY . X and Y are SD(X) = σ = Var(X)
• Cov(X, U ) = 0
independent. Let Z be their sum. What is the PDF of Z?
Z ∞ Conditional Expectation α & β are the regression coefficients.
fZ (z) = fX (z − y)fY (y)dy −∞<z <∞ Z
−∞ E [Y |X] yfY |X (y|x)dy
Inequalities
Order Statistics Note that E [Y |X] is a function of X and, therefore, a random variable. Markov Inequality X is a random variable that is always non-negative. Then, for
any t > 0,
Let X1 , . . . , Xn be continuous, independent, identically distributed, with PDF fX . Law of Iterated Expectations
E[X]
Let Yn = max {X1 , . . . , Xn }. This is called the nth order statistic. P (X ≥ t) ≤
E [E [Y |X]] = E[Y ] t
Distribution:
n Chebyshev Inequality X is a random variable for which Var(X) exists. Then, for
Fn (y) = FX (y) Law of Total Variance any t > 0,
dFn (y) n−1 Var(X)
fn (y) = = n (FX (y)) fX (y) Var (E [Y |X]) + E [Var (Y |X)] = Var(Y ) P (|X − E[X]| ≥ t) ≤
dy t2
Module 5 The Sample Mean, Central Limit Theorem and Theorem The sample mean for an i.i.d. sample is unbiased for the population mean.
Estimation Theorem The sample variance for an i.i.d. sample is unbiased for the population
Special Distributions The sample mean is the arithmetic average of the n random variables (or variance, where the sample variance is
Bernoulli Two possible outcomes: success or failure. The probability of success is p, realizations) from a random sample of size n.
n
failure is q (= 1 − p). 2 1 X 2
1 S = Xi − X n
Binomial If X1 , . . . , Xn are i.i.d. random variables, all Bernoulli distributed with Xn = (X1 + · · · + Xn ) n − 1 i=1
success probability p, then n
n
n 1 X
X =
X
Xk ∼ B(n, p) binomial distribution = Xi Given two unbiased estimators, θ̂1 & θ̂2 , θ̂1 is more efficient than θ̂2 if, for a given
n i=1 sample size,
k=1
The binomial distribution is the number of successes in a sequence of n independent Var θ̂1 < Var θ̂2
Expectation of the sample mean:
(success/failure) trials, each of which yields success with probability p.
Mean Squared Error Sometimes we are interested in trading off bias and
h i
Hypergeometric The binomial distribution is used to model the number of successes E Xn = µ
in a sample of size n with replacement. If you sample without replacement, you get the variance/efficiency.
hypergeometric distribution.
Variance of the sample mean: 2 h i 2
Negative Binomial Consider a sequence of independent Bernoulli trials, and let X MSE θ̂ = E θ̂ − θ = Var θ̂ + E θ̂ − θ
be the number of trials necessary to achieve r successes. X has a negative binomial σ2
distribution. Var X n =
n
Geometric A negative binomial distribution with r = 1 is a geometric distribution. θ̂ is a consistent estimator for θ if
It is the number of failures before the first success. The Central Limit Theorem
lim P θ − θ̂n < δ = 1.
• The sum of r independent Geometric (p) random variables is a negative n→∞
Let X1 , . . . , Xn form a random sample of size n from a distribution with finite
binomial (r, p) random variable.
mean and variance. Then for any fixed number x, Roughly, an estimator is consistent if its distribution collapses to a single point at the
• If Xi are i.i.d. and negative binomial (ri , p), then Xi is distributed as a
P
negative binomial ( ri , p). true parameter as n → ∞.
P
• Memorylessness √ X−µ
lim P n ≤ x = Φ(x)
Poisson The Poisson distribution expresses the probability of a given number of n→∞ σ Method of Moments
events occurring in a fixed interval of time if:
1. the events can be counted in whole numbers where Φ(x) is the CDF of a standard normal random variable. Population Moments (about the origin):
2. the occurrences are independent and h i h i
3. the average frequency of occurrence for a time period is known. Statistics 2 3
E [X] , E X , E X , . . .
Relationship between Poisson and Binomial For small values of p, the Poisson An estimator is a function of the random variables in a random sample.
distribution can simulate the Binomial distribution. Sample Moments:
A parameter is a constant indexing a family of distributions.
Exponential
• waiting time between two events in a Poisson process The function of the random sample is the estimator. The number, or realization of 1 X
n n
1 X 2 1 X 3
n
F Distribution We define the critical region of the test, C or CX , as the region of the support of the obs obs
Yt −Yc −τ
If X ∼ χ2n and Z ∼ χ2m and they’re independent, then random sample for which we reject the null. The critical region will take the form q ≈ N (0, 1)
σ2 σ2
X > k for some k yet to be determined. N + Nc t
X/n
∼ Fn,m . Statistical Power
Z/m
HO HA
Confidence Intervals fX under HO fX under HA
“Power”
Case 1 We are sampling from a normal distribution with a known variance and we 1−β
want a confidence interval for the mean.
" #
√ X−µ
−1 α −1 α
P Φ < n < −Φ =1−α
2 σ 2
−1 α σ −1 α σ β α
CI1−α = X + Φ √ ,X − Φ √ β α
2 n 2 n 2
Case 2 We are sampling from a normal distribution with an unknown variance and 0k 1
we want a confidence interval for the mean.
Choice of any one of α, β, or k determines the other two. This involves an explicit
" #
√ X−µ
−1 α −1 α
P tn−1 < n < −tn−1 =1−α trade-off between the probability of type I and type II errors.
α
α
τ
2 σ 2 |T | > Φ 1− ≈ Φ −Φ
−1
1− + q +
P
• increasing k means α ↓ and β ↑ 2
2 σ2
+ σ2
−1 α σ −1 α σ Nt Nc
CI1−α = X + tn−1 √ , X − tn−1 √ • decreasing k means α ↑ and β ↓
2 n 2 n
What happens as n increases or decreases? −1 α τ
Φ −Φ 1− − q
Hypothesis Testing 2 σ2
+ σ2
Nt Nc
An hypothesis is an assumption about the distribution of a random variable in a
population. β The second term is very small so we ignore it. We want the first term to be equal to
A maintained hypothesis is one that cannot or will not be tested. 1 − β: √ p
τ N γ(1 − γ)
−1 −1 α
A testable hypothesis is one that can be tested using evidence from a random Φ (1 − β) = −Φ 1− +
sample. 2 σ
Nt
The null hypothesis, HO , is the one that will be tested. where γ = N .
The alternative hypothesis, HA , is a possibility (or series of possibilities) other than The required sample size is
the null.
smaller n
2
Φ−1 (1 − β) + Φ−1 1 − α
2
We might want to perform a test concerning unknown parameter θ where larger n
N =
τ2
Xi ∼ f (x|θ). σ2
γ (1 − γ)
HO : θ in ΘO α • With stratified design, the variance of the estimated treatment effect is lower.
HA : θ in ΘA , where ΘO and ΘA disjoint. • With clustered design, the variance of the estimated treatment effect is larger.
Stratified Design Stable Unit Treatment Value Assumption (SUTVA) The potential outcomes for any Analyzing Randomized Experiments
unit do not vary with the treatments assigned to other units, and, for each unit, there
• Take the difference in means within each strata. are no different forms or versions of each treatment level, which lead to different The Average Treatment Effect
• Take a weighted average of the treatment effect with weight the size of the potential outcomes.
strata:
The Assigment Mechanism Let’s assume we have a population of size N , indexed obs obs
h i h i
X Ng ATE = E Yi |Wi = 1 − E Yi |Wi = 0
τ̂g by i. Let the treatment indicator Wi take on the values 0 (the control treatment) and 1
g
N (the active treatment). We have one realized (and possibly observed) potential
This will be an unbiased estimate of the average treatment effect. outcome for each unit, denoted by Yiobs :
Suppose we have a completely randomized experiment with Nt treatment units
• The variance will be calculated as: ( and Nc control units. The difference in sample average
obs Yi (0) if Wi = 0,
X N g 2 Yi = Yi (Wi ) =
V̂g Yi (1) if Wi = 1. 1 1
obs obs obs obs
X X
g
N τ̂ = Yi − Yi = Yt −Yc .
Nt Nc
For each unit we also have one missing potential outcome, Yimis , i:Wi =1 i:Wi =0
• Special case: probability of assignment to control group stays the same in each (
strata. Then this coefficient is equal to the simple difference between treatment mis Yi (1) if Wi = 0,
Yi = Yi (1 − Wi ) = The variance of a difference of two statistically independent variables is the sum of
and control, but the variance is always weakly lower. Yi (0) if Wi = 1.
their variances. Thus,
• Stratification will lower the required sample size for a given power.
Comparisons of Yi (1) and Yi (0) are unit-level causal effects: Sc2 S2
V (τ̂ ) = + t .
Clustered Design Nc Nt
• We need to take into account the fact that the potential outcomes for units Yi (1) − Yi (0)
To estimate the variance, V̂ (τ̂ ), replace Sc2 and St2 by their sample counterpart:
within randomization clusters are not independent.
• Conservative way to do this: just average the outcome by unit and treat each Missing data problem Given any treatment assigned to an individual unit, the
one as an observation. potential outcome associated with any alternate treatment is missing. A key role is 2 1 X
obs
2
therefore played by the missing data mechanism, or the assignment mechanism. How is sc = Yi (0) − Y c
• The number of observations is the number of clusters and you can analyze this Nc − 1
data exactly as a completely randomized experiment but with clusters as the it determined which units get which treatments or, equivalently, which potential i:Wi =0
Let X1 , . . . , Xn be a random sample with CDF F and let Y1 , . . . , Yn be a random Denominator: Under the assumptions of the Classical Linear Regression Model, OLS provides the
sample with CDF G. 1 X
n
x − xi minimum variance (most efficient) unbiased estimator of β0 and β1 . It is the MLE
fˆ(x) = K under normality of errors, and the estimates are consistent and asymptotically
We are interested in testing the hypotheses nh i=1 h
normal.
Ho : F = G Numerator:
n Closed-form solutions:
1 X x − xi
Ha : F 6= G yi K 1
P
nh i=1 h n Xi − X Yi − Y
The Statistic β̂1 = 2 , β̂0 = Y − β̂1 X
where h (the bandwidth) is the kernel estimate of the density of x. K() is a density. 1
P
Xi − X
Dnm = max |Fn (x) − Gm (x)| n
x
Large sample properties:
where Fn and Gm are the empirical CDF of the first and second sample. The Fitted Value Ŷi = β̂0 + β̂1 Xi
empirical CDF just counts the number of sample points below level x: • As h goes to zero, the bias goes to zero.
• As nh goes to infinity, variance goes to zero. Residual ˆi = Yi − Ŷi
n
1 X • As you increase the number of observation, you promise to decrease the Regression Line or Fitted Line Y = β̂0 + β̂1 X
Fn (x) = Pn (X < x) = I(X < x) bandwidth.
n i=1 P 2
Let X = n1
Xi and σ̂X2 1
Xi − X .
P
Choices to make: = n
First Order Stochastic Dominance: One-sided Kolmogorov-Smirnov Test
• Choice of kernel
• We are interested in testing the hypothesis Mean Variance Covariance
1. Histogram: K(u) = 1
2 if |u| ≤ 1, K(u) = 0 otherwise.
2 2 2
Ho : F = G 2. Epanechnikov: K(u) = 3
− u2 ) if |u| ≤ 1, K(u) = 0 otherwise. σ X σ
4 (1 β̂0 β0 + σ2 X
2
nσ̂X n
against 3. Quartic: K(u) = ( 34 (1 − u2 ))2 if u ≤ 1, K(u) = 0 otherwise. − 2
σ2 nσ̂X
Ha : F > G
• Choice of bandwidth: trade off bias and variance β̂1 β1 2
(which would mean that G FSD F ). nσ̂X
• The one-sided KS statistics is: – A large bandwidth will lead to more bias.
+ – A small bandwidth will lead to more variance.
Dnm = max [Fn (x) − Gm (x)]
x Some comparative statistics:
Create a matrix of
choose(n,m) rows and n
chooseMatrix(n,m) perm columns. The matrix has
unique rows with m ones in
each row and the rest zeros.
Computes confidence
confint() intervals for one or more
parameters in a fitted model.
Fit IV regression by a
ivreg() AER two-stage least squares
method.
Recommended Resources
n x
pX (x) = p (1 − p)n−x , E[X] = np
Binomial x
Var(X) = np(1 − p)
x = 0, 1, . . . , n
nA
A B E[X] =
Hypergeometric A+B
x n−x
pX (x) = nAB A + B − n
Var(X) =
A+B
(A + B)2 A + B − 1
n
k(1 − p)
E[X] =
Negative Binomial r+k−1 k p
pX (k) = p (1 − p)r
k r(1 − p)
Var(X) =
p2
Distribution PDF / PMF Expectation and Variance Graph
1
E[X] =
Geometric pX (k) = (1 − p) k−1
p p
1−p
Var(X) =
p2
λk e−λ E[X] = λ
Poisson pX (k) =
k! Var(X) = λ
1
E[X] =
Exponential fX (x) = λe −λx λ
1
Var(X) = 2
λ
a+b
E[X] =
Uniform 1 2
fX (x) =
b−a (b − a)2
Var(X) =
12
1 (x−µ)2 E[X] = µ
Normal fX (x) = √ e− 2σ2
σ 2π Var(X) = σ 2