0% found this document useful (0 votes)
4 views

notes19

Dbms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

notes19

Dbms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Bayesian estimation

Chrysafis Vogiatzis
Lecture 19

Learning objectives

After this lecture, we will be able to:


• Use Bayesian estimation to find point estimators for un-
known parameters.
• Propose new point estimators for unknown parameters
based on prior information.

Motivation: Heads or Tails?

We flip a coin 10 times and we get 6 Heads and 4 Tails. Do you be-
lieve it is a fair coin? What does the method of moments and the
maximum likelihood estimation method say about this situation?

Quick review

During these past two lectures, we discussed two methods to identify


“good” estimators Θ̂ for a series of unknown parameters:

• the method of moments.


h i
1. Compute the moments of the population, calculated as E X k .
2. Compute the moments of the sample (empirical moments),
n
1
calculated as n ∑ Xik .
i =1
3. Equate the two and solve a system of equations for the un-
known parameters.

• maximum likelihood estimation.

1. Calculate the likelihood function as

L ( θ ) = f ( X1 , θ ) · f ( X2 , θ ) · . . . · f ( X n , θ )

2. Or the log-likelihood function as

ln ( L (θ )) = ln ( f ( X1 , θ )) + ln ( f ( X2 , θ )) + . . . + ln ( f ( Xn , θ ))

3. Find the maximizer (usually found by setting the derivative per


each unknown parameter equal to 0).
bayesian estimation 2

Both these methods have one thing in common: they require no


prior information to work, but instead they base all of their obser-
vations on the obtained sample. What if I already know some more
information about what is going on?

Bayesian estimation through an example

We begin in a slightly different way than usually. We begin with an


example to help us build intuition! Assume I carry 3 coins with me:

1. One with both sides showing Heads.

2. One with both sides showing Tails.

3. One that is fair and has a side of Heads and a side of Tails.

Assume I randomly pick one coin and start flipping it. I report
to you the number of tries (n) and the number of Heads (x). For
example, I may tell you n = 8, x = 5 or n = 2, x = 0, and so on.
Flipping the coin: first take

I let you know that I flipped the coin three times and got
Heads both times: n = 3, x = 2. What are the method of
moments and the maximum likelihood estimators for p?

We will have E [ X ] = p and 13 · (1 + 1 + 0) = 23 , and equating


will give p̂ = 23 . The likelihood function is L ( p) = p2 · (1 − p),
and maximizing will also give p̂ = 23 .

But... I carry three coins with me. Shouldn’t I use this infor-
mation somehow?

1. Can it be my “2-Heads” coin?

2. Can it be my “2-Tails” coin?

3. Does it have to be my “50-50” coin?

This is the key to realizing what Bayesian estimation brings to


the table: extra information in the form of prior probabilities for the
parameters that are unknown.

Bayesian estimation

We separate the discussion between discrete sets for the values the
parameter can take (like in the previous example where I carried
3 distinct coins with me) and between continuous sets, where the
parameter can be any real number in a range of values.
bayesian estimation 3

For discrete parameter values


Before describing the method, we provide some notation:

• prior probabilities (“priors”): the probability of seeing a certain


parameter P (θ ).

• likelihood probabilities (“likelihoods”): the likelihood of seeing


an outcome given a certain parameter P ( X |θ ).

• posterior probabilities (“posteriors”): the multiplication of the


two P(θ ) · P ( X |θ ).

A quick note about the likelihoods: those are calculated in identi-


cal manner as the likelihood function in the maximum likelihood
estimation method!
The Bayesian estimation method then states that:

“The higher the posterior probability,


the better the chance of having that parameter.”

This is it! This is the whole method!


Flipping the coin: second take

Let us go back to the example where I carried three coins (“2-


Heads”; “2-Tails”; and “50-50”) and I picked one at random.
After 3 tries, we got 2 Heads: n = 3, x = 2. Let’s see what we
have for these three distinct cases of p = 1, p = 0, p = 0.5:

• priors P( p): probability of picking a certain coin, that is


P( p = 1) = P( p = 0) = P( p = 0.5) = 31 .

• likelihoods P( X = 2| p): likelihood function of seeing two


Heads for each coin. For example, the likelihood function
for the p = 0.5 coin with x = 2 Heads in n = 3 tries would
be: p2 · (1 − p) = 0.52 · 0.5 = 0.125.

• posteriors P( p) · P( X = 2| p): we will need to calculate this


for each coin.

Let us put this in table format.


parameter prior likelihood posterior
p P( p) P ( X = 2| p ) P ( p ) · P ( X = 2| p )
1 1
0 3 02 · 11 = 0 3 ·0 = 0
1 1
1 3 12 · 01 = 0 3 ·0 = 0
1 1
0.5 3 0.5 · 0.51 = 0.125
2
3 · 0.125 = 0.04166

The maximum value (and only non-zero probability!) is


achieved for the “50-50” coin so it must be this!
bayesian estimation 4

See? It is pretty intuitive. Of course, we may complicate things by


making the probability of picking a coin a little more general.
Flipping the coin: third take

I still have three types of coins on me. But, given that I am an


adult that carries money wherever I go, I carry more actual
coins (“50-50”) than novelty coins (“2-Heads”, “2-Tails”). More
specifically, I carry 8 real coins and 1 of each novelty coin. I
take a coin out and toss it twice and get two Heads! Which
coin is it?

Let us try the table format again:


parameter prior likelihood posterior
p P ( p ) P ( X = 2| p ) P ( p ) · P ( X = 2| p )
1 1
0 8 02 = 0 8 ·0 = 0
1 1
1 8 12 = 1 8 · 1 = 0.125
3 2 3
0.5 4 0.5 = 0.25 4 · 0.25 = 0.1875
The maximum value is still achieved for a “50-50” coin, so we
are inclined to think we picked one. Note how much closer
the posteriors are, though..

It would actually take one more Heads to change our param-


eter estimation towards the “2-Heads” novelty coin! Why is
that?

Normalizing may also be useful. Instead of looking at the poste-


rior values as they are in the end, we may turn them into actual “%”
values to help compare them. To normalize simply take each pos-
terior and divide it by the summation of all posterior probabilities.
For example, in our third take (see above) we would end up with
probabilities:

0
• P ( p = 0) = 0+0.125+0.1875 = 0.
0.125
• P ( p = 1) = 0+0.125+0.1875 = 0.4.
0.1875
• P( p = 0.5) = 0+0.125+0.1875 = 0.6.

This helps us quantify our parameter estimation even more. There


is a 40% chance we picked the “2-Heads” coin and a 60% chance we
picked one of the “50-50” coins.
Let’s work one more example before we move to the continuous case.
bayesian estimation 5

A computer vision system: third take

A machine learning algorithm for computer vision is trained


to observe the first vehicle that passes from an intersection
at or after 8am every day. Then, it reports the time from that
vehicle to the next one again and again. We assume this time
is exponentially distributed but with unknown λ.
• If the first vehicle of the day was a personal car, then λ1 = 1
per minute.

• If the first vehicle of the day was a motorcycle, then λ2 = 1


per 5 minutes.

• If the first vehicle was a truck, then λ3 = 1 per 10 minutes.

• If the first vehicle was a bike, then λ4 = 1 per 12 minutes.


We observe a sample of 5 times: X1 = 9 minutes, X2 = 8.5
minutes, X3 = 8 minutes, X4 = 10.5 minutes. What is the
probability of each parameter λ?
First, we need to calculate the prior probabilities P(λ). The
first vehicle of the day is:
λ1
• a personal car with probability λ1 + λ2 + λ3 + λ4 = 0.732 (why?),
λ2
• a motorcycle with probability λ1 + λ2 + λ3 + λ4 = 0.146,
λ3
• a truck with probability λ1 + λ2 + λ3 + λ4 = 0.073,
λ4
• or a bike with probability λ1 + λ2 + λ3 + λ4 = 0.049.

With these in hand, we calculate the likelihood func-


tions as being λ · e−λ· X1 · λ · e−λ· X2 · . . . · λ · e−λ· Xn since
we have an exponentially distributed random variable.
For example, if λ = λ1 = 1, then we would have
1 · e−1·9 · 1 · e−1·8.5 · 1 · e−1·8 · 1 · e−1·10.5 = e−36 = 2.32 · 10−16 .
Finally:

parameter prior likelihood posterior


λ P(λ) P ( X1 , X2 , X3 , X4 | λ ) P ( λ ) · P ( X1 , X2 , X3 , X4 | λ )
λ1 = 1 0.732 2.32 · 10−16 1.70 · 10−16
λ2 = 0.2 0.146 1.19 · 10−6 1.74 · 10−7
λ3 = 0.1 0.073 2.73 · 10−6 1.99 · 10−7
λ4 = 0.066 0.049 1.79 · 10−6 8.77 · 10−8

From the results it seems that the vehicle that first passed
today is more likely a truck!
bayesian estimation 6

If we wanted to assign probability values to each type of vehicle,


we would report:

6.24·10−17
• personal car: 6.24·10−17 +1.43·10−7 +1.81·10−7 +8.18·10−8
= 1.54 · 10−10 ≈ 0.

• motorcycle: 0.3527.

• truck: 0.4458.

• bike: 0.2015.

We can now move to the continuous case.

For continuous parameter values


Let us again begin with an example. The method is largely still the
same; but the definitions of some of the items change slightly to
accommodate the continuous nature of the unknown parameter(s).
Say we have a coin that is made with the goal of being fair; that is,
“50-50”. But, materials fail and get deposited more on one side than
the other resulting in different compositions for the probability of
Heads and Tails. Say, in the end, the probability of Heads is normally
distributed with N (0.5, 0.01), that is a mean of µ = 0.5 and a vari-
ance σ2 = 0.01 =⇒ σ = 0.1. Visually, we would get the distribution
of Figure 1

Figure 1: The distribution of the probability of getting Heads in the continuous version
of the problem. We see how p = 0.5 is more likely, but we can get values as low as 0.1,
0.2, or as high as 0.8, 0.9, albeit with very small likelihood.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Now that we know this, say we tossed a coin 10 times, and got 10
straight times Heads! Recall that both the method of moments and
the maximum likelihood estimation method would simply assume
that the coin has p = 1 and proceed.
Getting 10 Heads in 10 tosses would be highly improbable for
a coin that is “50-50”, but it could mean that I have a biased coin
towards Heads. So, what should our estimate be?
bayesian estimation 7

First, calculate the likelihood function, the way we did during the
maximum likelihood estimation calculations. In this case, it would be
L( p) = p10 . Let’s plot that (see Figure 2).

Figure 2: The likelihood function of getting 10 Heads after tossing a coin 10 times. It is
maximized at p = 1, which would then be our maximum likelihood estimator.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

In Bayesian estimation for discrete-valued parameters earlier, we


calculated P(θ ) (priors) with P( X |θ ) (likelihoods) to obtain a series
of posteriors that we would compare. In the continuous version, we
calculate f (θ ) (prior distribution) with L(θ ) (likelihood function) to
obtain a posterior distribution that we would then find the maximizer
at! Confused? Let’s look at this visually again in Figure 3.

Figure 3: The posterior distribution, found by multiplying f (θ ) (the pdf of the normal
distribution N (0.5, 0.01)) with the likelihood function L(θ ). The maximizer here is the
Bayesian estimator and is found at p̂ = 0.6531.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Let us define the notation for the method then:

• prior distribution: the distribution of the real-valued and continu-


ous parameter θ, f (θ ).

• likelihood function: the likelihood function, built just as in the


maximum likelihood estimation metho, L(θ ).
bayesian estimation 8

• posterior distribution: the multiplication of the two functions


f ( θ ) · L ( θ ).

The Bayesian estimation method for continuous parameters states


that:

“The Bayesian estimator is found by


maximizing the posterior distribution.”

And, yes! This sums it up. Let us view one example from beginning
to end using the method.
Mortality risk

We call mortality risk of a hospital the probability of death oc-


curring for any patient admitted to the hospital. The mortality
risk in US hospitals is in general exponentially distributed
1
with a mean at 1.5% (that is, λ = 1.5% ). You have been ob-
serving a hospital and have seen 25 deaths in the first 150
patient admissions. What is the Bayesian estimator for the true
mortality rate of the hospital?

Right away, distinguish between two items:

1. the prior distribution that we believe the mortality risk to


be distributed as (exponential)

2. the mortality rate itself is a Bernoulli random variable (p


and 1 − p); in our case, we have a sample we have collected
(25 deaths in 150 admissions) to help us estimate p.

So, let us start collecting what we need one-by-one.

Prior distribution:
1 −1p
f ( p) = e 1.5 .
1.5

Likelihood function:

L( p) = p25 · (1 − p)125 .

Posterior distribution:
1 − 1 p 25
f ( p) · L( p) = e 1.5 · p · (1 − p)125 .
1.5
bayesian estimation 9

Mortality risk (cont’d)

To maximize, get the derivative and equate to 0 to get:

∂ f ( p) · L( p) 4 2
= e− 3 p ( p − 1)124 p24 (( p − 226) p + 37.5) = 0 =⇒
∂p 9
=⇒ (( p − 226) p + 37.5) = 0 =⇒ p = 0.16605.

The maximizer is, then, at p̂ = 0.16605.

If we want to, we can see the same result visually. First, plot our
prior beliefs/distribution:

0 0.05 0.1 0.15 0.2 0.25 0.3

Then, plot our likelihood function based on the sample collected:


And finally plot the posterior distribution, and check that the
maximizer is indeed at p̂ = 0.16605:

One last example


Let us work on one more example for continuously distributed
parameters. Assume we have a population distribution with pdf
f ( x ) = (θ + 1) x θ , for 0 ≤ x ≤ 1. Moreover, assume that θ is not
1
totally random, but is instead distributed with pdf f (θ ) = 12 (3 − θ ),
defined over −2 ≤ θ ≤ 2. Assume we have collected a sample of
X1 = 0.9, X2 = 0.89, X3 = 0.76, X4 = 0.96. What is the Bayesian
estimator for θ?

You may inspect the solution visually as a homework assignment.


Algebraically, though, we would multiply the prior distribution
bayesian estimation 10

0 0.05 0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

( f (θ )) with the likelihood function (L(θ )) to obtain the posterior


distribution. In mathematical terms:
1
f (θ ) = (3 − θ )
12
L(θ ) = (θ + 1) X1θ · (θ + 1) X2θ · (θ + 1) X3θ · (θ + 1) X4θ =
= (θ + 1)4 ( X1 · X2 · X3 · X4 )θ = (θ + 1)4 0.5844096θ
1
f (θ ) · L(θ ) = (3 − θ ) · (θ + 1)4 0.5844096θ
12
bayesian estimation 11

Getting the derivative of the posterior, and equating it to 0, we get:

∂ f (θ ) L(θ )
= 0 =⇒ 0.0447628 · 0.58441θ (1 + θ )3 (17.4783 + θ (−11.3083 + θ )) = 0.
∂θ
We get three possible solution: θ = −1, θ = 1.85, or θ = 9.46.
We note that the last one cannot happen as θ is between -2 and 2.
Between the two remaining possible solutions, we compare their
posterior distribution values:

• f (−1) · L(−1) = 1
12 (3 − (−1)) · ((−1) + 1)4 0.5844096−1 = 0.

• f (1.85) · L(1.85) = 1
12 (3 − 1.85) · (1.85 + 1)4 0.58440961.85 = 2.34.

Hence, θ̂ = 1.85 is the maximizer and the Bayesian estimator.


I lied.. Here is the visual version of the posterior also. It is clear
that 1.85 is indeed the maximizer!

−2 −1 1 2

You might also like