0% found this document useful (0 votes)

9 views65 pages

Lecture1 introToBayes

This document provides an introduction to a course on Bayesian methods and modern statistics. It outlines expectations for the course including class structure, assignments, exams and resources available to students. Key aspects covered are optional class attendance, closed book exams, no late assignments, homework submitted in R markdown format, and availability of office hours and supplemental materials to aid student learning.

Uploaded by

Ala Bala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views65 pages

Lecture1 introToBayes

Uploaded by

Ala Bala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Intro to Bayesian Methods

Rebecca C. Steorts
Bayesian Methods and Modern Statistics: STA 360/601

Lecture 1

1
I Course Webpage
I Syllabus
I LaTeX reference manual
I R markdown reference manual
I Please come to office hours for all questions.
I Office hours are not a review period if you cannot come to
class.
I Join Google group
I Graded on Labs/HWs, Exams.
I Labs/HWs and Exams .R markdown format (it must compile).
I Nothing late will be accepted.
I You’re lowest homework will be dropped.
I Announcements: Emails or in class.
I All your lab/homework assignments will be uploaded to Sakai.
I How to reach me and TAs – email or Google.

2
Expectations

I Class is optional but you are expected to know everything

covered in lecture.
I Not everything will always be on the slides.
I 2 Exams: in class, timed. Closed book, closed notes. (Dates
are on the syllabus).
I There are NO make up exams.
I Late assignments will not be accepted. Don’t ask.
I Final exam: during finals week.
I You should be reading the book as we go through the material
in class.

3
Expectations for Homework

I Your write ups should be clearly written.

I Proofs: show all details.
I Data analysis: clearly explain.
I For data analysis questions, don’t just turn in code.
I Code must be well documented.
I Code style: https://google.github.io/styleguide/Rguide.xml
I For all homeworks, can use Markdown or LaTex. You must
include all files that lead to your solutions (this includes code)!

4
Things available to you!

I Come to office hours. We want to help you learn!

I Supplementary reading to go with the notes by yours truly.
(Beware of typos).
I Undergrad level notes
I PhD level notes
I Example form of write up in .Rmd on Sakai (Module 0).
I You should have your homeworks graded and returned within
one week by the TA’s!

5
I Why should we learn about Bayesian concepts?
I Natural if thinking about unknown parameters as random.
I They naturally give a full distribution when we perform an
update.
I We automatically get uncertainty quantification.
I Drawbacks: They can be slow and inconsistent.

6
Record linkage

Record linkage is the process of merging together noisy databases

to remove duplicate entries.

7
8
9
10
These are clearly not the same Steve Fienberg!

10
Syrian Civil War

11
Bayesian Model

I Define α` (w) = relative frequency of w in data for field `.

12
Bayesian Model

I Define α` (w) = relative frequency of w in data for field `.

I G` : empirical distribution for field `.

12
Bayesian Model

I Define α` (w) = relative frequency of w in data for field `.

I G` : empirical distribution for field `.
I W ∼ F` (w0 ): P (W = w) ∝ α` (w) exp[−c d(w, w0 )] , where d(·, ·)
is a string metric and c > 0.

12
Bayesian Model

I Define α` (w) = relative frequency of w in data for field `.

I G` : empirical distribution for field `.
I W ∼ F` (w0 ): P (W = w) ∝ α` (w) exp[−c d(w, w0 )] , where d(·, ·)
is a string metric and c > 0.

δ(Yλij ` )
 if zij` = 0
Xij` | λij , Yλij ` , zij` ∼ F` (Yλij ` ) if zij` = 1 and field ` is string-valued

G` if zij` = 1 and field ` is categorical


Yj 0 ` ∼ G`
zij` | βi` ∼ Bernoulli(βi` )
βi` ∼ Beta(a, b)
k
X
λij ∼ DiscreteUniform(1, . . . , Nmax ), where Nmax = ni
i=1

12
The model I showed you is very complicated.

This course will give you an intro to Bayesian models and methods.

Often Bayesian models are hard to work with, so we’ll learn about
approximations.

The above record linkage problem is one that needs such an

approximation.

13
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

14
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ)p(θ). (1)
p(x)

14
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ)p(θ). (1)
p(x)

We can decompose Bayes’ Theorem into three principal terms:

14
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ)p(θ). (1)
p(x)

We can decompose Bayes’ Theorem into three principal terms:

p(θ|x) posterior

14
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ)p(θ). (1)
p(x)

We can decompose Bayes’ Theorem into three principal terms:

p(θ|x) posterior
p(x|θ) likelihood

14
I “Bayesian” traces its origin to the 18th century and English
Reverend Thomas Bayes, who along with Pierre-Simon
Laplace discovered what we now call “Bayes’ Theorem”.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ)p(θ). (1)
p(x)

We can decompose Bayes’ Theorem into three principal terms:

p(θ|x) posterior
p(x|θ) likelihood
p(θ) prior

14
Polling Example 2012

Let’s apply this to a real example! We’re interested in the

proportion of people that approve of President Obama in PA.

15
Polling Example 2012

Let’s apply this to a real example! We’re interested in the

proportion of people that approve of President Obama in PA.
I We take a random sample of 10 people in PA and find that 6
approve of President Obama.

15
Polling Example 2012

Let’s apply this to a real example! We’re interested in the

15
Polling Example 2012

Let’s apply this to a real example! We’re interested in the

proportion of people that approve of President Obama in PA.
I We take a random sample of 10 people in PA and find that 6
approve of President Obama.
I The national approval rating (Zogby poll) of President Obama
in mid-December was 45%. We’ll assume that in PA his
approval rating is approximately 50%.
I Based on this prior information, we’ll use a Beta prior for θ
and we’ll choose a and b. (Won’t get into this here).

15
Polling Example 2012

Let’s apply this to a real example! We’re interested in the

15
Density

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0
0.2
Prior

0.4

θ
0.6
0.8
1.0

16
3.5
3.0

Prior
2.5

Likelihood
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

17
3.5
3.0

Prior
2.5

Likelihood
Posterior
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

18
The basic philosophical difference between the frequentist and
Bayesian paradigms is that
I Bayesians treat an unknown parameter θ as random.

19
The basic philosophical difference between the frequentist and
Bayesian paradigms is that
I Bayesians treat an unknown parameter θ as random.
I Frequentists treat θ as unknown but fixed.

19
Stopping Rule
Let θ be the probability of a particular coin landing on heads, and
suppose we want to test the hypotheses

20
Stopping Rule
Let θ be the probability of a particular coin landing on heads, and
suppose we want to test the hypotheses

H0 : θ = 1/2, H1 : θ > 1/2

at a significance level of α = 0.05. Suppose we observe the

following sequence of flips:

heads, heads, heads, heads, heads, tails (5 heads, 1 tails)

20
Stopping Rule
Let θ be the probability of a particular coin landing on heads, and
suppose we want to test the hypotheses

H0 : θ = 1/2, H1 : θ > 1/2

at a significance level of α = 0.05. Suppose we observe the

following sequence of flips:

heads, heads, heads, heads, heads, tails (5 heads, 1 tails)

I To perform a frequentist hypothesis test, we must define a

random variable to describe the data.

20
Stopping Rule
Let θ be the probability of a particular coin landing on heads, and
suppose we want to test the hypotheses

H0 : θ = 1/2, H1 : θ > 1/2

at a significance level of α = 0.05. Suppose we observe the

following sequence of flips:

heads, heads, heads, heads, heads, tails (5 heads, 1 tails)

I To perform a frequentist hypothesis test, we must define a

random variable to describe the data.
I The proper way to do this depends on exactly which of the
following two experiments was actually performed:

20
I Suppose the experiment is “Flip six times and record the
results.”

21
I Suppose the experiment is “Flip six times and record the
results.”
I X counts the number of heads, and X ∼ Binomial(6, θ).
I The observed data was x = 5, and the p-value of our
hypothesis test is

p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)

p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64

p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64
So we fail to reject H0 at α = 0.05.

21
I Suppose now the experiment is “Flip until we get tails.”

22
I Suppose now the experiment is “Flip until we get tails.”
I X counts the number of the flip on which the first tails occurs,
and X ∼ Geometric(1 − θ).
I The observed data was x = 6, and the p-value of our
hypothesis test is

p-value = Pθ=1/2 (X ≥ 6)

p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)

p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
X
= 1− Pθ=1/2 (X = x)
x=1

p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
X
= 1− Pθ=1/2 (X = x)
x=1

1 1 1 1 1 1
= 1− + + + + = = 0.03125 < 0.05.
2 4 8 16 32 32

So we reject H0 at α = 0.05.

22
I The conclusions differ, which seems strikes some people as
absurd.

23
I The conclusions differ, which seems strikes some people as
absurd.
I P-values aren’t close—one is 3.5 times as large as the other.

23
I The conclusions differ, which seems strikes some people as
absurd.
I P-values aren’t close—one is 3.5 times as large as the other.
I The result our hypothesis test depends on whether we would
have stopped flipping if we had gotten a tails sooner.

23
I The likelihood for the actual value of x that was observed is
the same for both experiments (up to a constant):

p(x|θ) ∝ θ5 (1 − θ).

24
I The likelihood for the actual value of x that was observed is
the same for both experiments (up to a constant):

p(x|θ) ∝ θ5 (1 − θ).

I A Bayesian approach would take the data into account only

through this likelihood.

24
I The likelihood for the actual value of x that was observed is
the same for both experiments (up to a constant):

p(x|θ) ∝ θ5 (1 − θ).

I A Bayesian approach would take the data into account only

through this likelihood.
I This would provide the same answers regardless of which
experiment was being performed.

The Bayesian analysis is independent of the stopping rule since it

only depends on the likelihood (show this at home!).

24
Hierarchical Bayesian Models

In a hierarchical Bayesian model, rather than specifying the prior

distribution as a single function, we specify it as a hierarchy.

25
Hierarchical Bayesian Models

X|θ ∼ f (x|θ)
Θ|γ ∼ π(θ|γ)
Γ ∼ φ(γ),

where we assume that φ(γ) is known and not dependent on any

other unknown hyperparameters.

26
Conjugate Distributions

Let F be the class of sampling distributions p(y|θ).

27
Conjugate Distributions

Let F be the class of sampling distributions p(y|θ).

I Then let P denote the class of prior distributions on θ.

27
Conjugate Distributions

Let F be the class of sampling distributions p(y|θ).

I Then let P denote the class of prior distributions on θ.
I Then P is said to be conjugate to F if for every p(θ) ∈ P and
p(y|θ) ∈ F, p(θ | y) ∈ P.
Simple definition: A family of priors such that, upon being
multiplied by the likelihood, yields a posterior in the same family.

27
Beta-Binomial