0% found this document useful (0 votes)
19 views16 pages

Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions

Focuses on developing nonparametric methods for high-dimensional data and discusses recent advances, challenges, and performance comparisons with parametric methods.

Uploaded by

roy.anup78kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views16 pages

Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions

Focuses on developing nonparametric methods for high-dimensional data and discusses recent advances, challenges, and performance comparisons with parametric methods.

Uploaded by

roy.anup78kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2.

9 Posterior Predictive Distributions 46

Next, we can move right into R for our analysis.

setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()

##posterior predictive distribution


xnew = seq(0,100) ## will all be ints
post_pred_values = dnbinom(xnew,x+a,b/(2*b+1))
plot(xnew, post_pred_values, type="h", xlab = "x", ylab="Posterior Predictive Distribution")

## what is posterior predictive prob that number


of pregnant women arrive is between 40 and 45 (inclusive)

(ans = sum(post_pred_values[41:46])) ##recall we included 0

In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.

When we find our posterior predictive distribution, we must create a se-


quence of integers from 0 to 100 (inclusive) using the seq command. Then
we find the posterior predictive values using the function dnbinom. Then
we simply plot the sequence of xnew on the x-axis and the corresponding
posterior predictive values on the y-axis. We set type="h" so that our plot
will appear somewhat like a smooth histogram.

Finally, in order to calculate the posterior predictive probability that the


number of pregnant women who arrive is between 40 and 45, we simply add
up the posterior predictive probabilities that correspond to these values. We
find that the posterior predictive probability of 0.1284 that the number of
pregnant women who arrive is between 40 and 45.
Chapter 3

Being Objective

No, it does not make sense for me to be an ‘Objective Bayesian’ !


—Stephen E. Fienberg

Thus far in this course, we have mostly considered informative or subjective


priors. Ideally, we want to choose a prior reflecting our beliefs about the un-
known parameter of interest. This is a subjective choice. All Bayesians agree
that wherever prior information is available, one should try to incorporate
a prior reflecting this information as much as possible. We have mentioned
how incorporation of a prior expert opinion would strengthen purely data-
based analysis in real-life decision problems. Using prior information can
also be useful in problems of statistical inference when your sample size is
small or you have a high- or infinite-dimensional parameter space.

However, in dealing with real-life problems you may run into problems such
as

• not having past historical data,

• not having an expert opinion to base your prior knowledge on (perhaps


your research is cutting-edge and new), or

• as your model becomes more complicated, it becomes hard to know


what priors to put on each unknown parameter.

The problems we have dealt with all semester have been very simple in na-
ture. We have only had one parameter to estimate (except for one example).

47
48

Think about a more complex problem such as the following (we looked at
this problem in Chapter 1):

X�✓ ∼ N (✓, 2
)
✓� 2
∼ N (µ, ⌧ )
2

2
∼ IG(a, b)

where now ✓ and 2 are both unknown and we must find the posterior
distributions of ✓�X, 2 and 2 �X. For this slightly more complex problem,
it is much harder to think about what values µ, ⌧ 2 , a, b should take for a
particular problem. What should we do in these type of situations?

Often no reliable prior information concerning ✓ exists, or inference based


completely on the data is desired. It might appear that inference in such
settings would be impossible, but reaching this conclusion is too hasty.

Suppose we could find a distribution p(✓) that contained no or little infor-


mation about ✓ in the sense that it didn’t favor one value of ✓ over another
(provided this is possible). Then it would be natural to refer to such a dis-
tribution as a noninformative prior. We could also argue that all or most of
the information contained in the posterior distribution, p(✓�x), came from
the data. Thus, all resulting inferences were objective and not subjective.
Definition 3.1: Informative/subjective priors represent our prior beliefs
about parameter values before collecting any data. For example, in reality,
if statisticians are unsure about specifying the prior, they will turn to the
experts in the field or experimenters to look at past data to help fix the
prior.
Example 3.1: (Pregnant Mothers) Suppose that X is the number of preg-
nant mothers arriving at a hospital to deliver their babies during a given
month. The discrete count nature of the data as well as its natural inter-
pretation leads to adopting a Poisson likelihood,
e−✓ ✓x
p(x�✓) = , x ∈ {0, 1, 2, . . .}, ✓ > 0.
x!
A convenient choice for the prior distribution here is a Gamma(a, b) since
it is conjugate for the Poisson likelihood. To illustrate the example further,
suppose that 42 moms deliver babies during the month of December. Sup-
pose from past data at this hospital, we assume a prior of Gamma(5, 6).
From this, we can easily calculate the posterior distribution, posterior mean
and variance, and do various calculations of interest in R.
49

Definition 3.2: Noninformative/objective priors contain little or no infor-


mation about ✓ in the sense that they do not favor one value of ✓ over
another. Therefore, when we calculate the posterior distribution, most if
not all of the inference will arise from the likelihood. Inferences in this case
are objective and not subjective. Let’s look at the following example to see
why we might consider such priors.

Example 3.2: (Pregnant Mothers Continued) Recall Example 3.1. As we


noted earlier, it would be natural to take the prior on ✓ as Gamma(a, b)
since it is the conjugate prior for the Poisson likelihood, however suppose
that for this data set we do not have any information on the number of
pregnant mothers arriving at the hospital so there is no basis for using a
Gamma prior or any other informative prior. In this situation, we could
take some noninformative prior.

Comment: Since many of the objective priors are improper, so we must


check that the posterior is proper.

Theorem 3.1. Propriety of the Posterior

• If the prior is proper, then the posterior will always be proper.

• If the prior is improper, you must check that the posterior is proper.

◯ Meaning Of Flat

What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have di↵erent meanings.

Example 3.3: Often statisticians will refer to a prior as being flat, when
a plot of its density actually looks flat, i.e., uniform. An example of this
would be taking such a prior to be

✓ ∼ Unif(0, 1).

We can plot the density of this prior to see that the density is flat.

What happens if we consider though the transformation to 1�✓. Is our prior


still flat?
50

1.4
1.2
density

1.0
0.8
0.6

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1: Unif(0,1) prior

Example 3.4: Now suppose we consider Je↵reys’ prior, pJ (✓), where


X ∼ Bin(n, ✓).

We calculate Je↵reys’ prior by finding the Fisher information. The Fisher


information tells us how much information the data gives us for certain
parameter values.

In this example, it can be shown that pJ (✓) ∝ Beta(1�2, 1�2). Let’s consider
the plot of this prior. Flat here is a purely abstract idea. In order to achieve
objective inference, we need to compensate more for values on the boundary
than values in the middle.
51

1.0 1.5 2.0 2.5 3.0


density

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.2: Je↵reys’ prior for Binom likelihood

Example 3.5: Finally, we consider the following prior on ✓ ∶

✓ ∼ N (0, 1000).

What happens in this situation? We look at two plots in Figure 3.3 to


consider the behavior of this prior.

◯ Objective Priors in More Detail

Uniform Prior of Bayes and Laplace

Example 3.6: (Thomas Bayes) In 1763, Thomas Bayes considered the


question of what prior to use when estimating a binomial success proba-
bility p. He described the problem quite di↵erently back then by considering
throwing balls onto a billiard table. He separated the billiard table into
many di↵erent intervals and considered di↵erent events. By doing so (and
not going into the details of this), he argued that a Uniform(0,1) prior was
appropriate for p.

Example 3.7: (Laplace) In 1814, Pierre-Simon Laplace wanted to know


52

0.012

Normal prior
density

0.006
0.000

−1000 −500 0 500 1000

θ
0.012

Normal prior on [−10,10]


density

0.006
0.000

−10 −5 0 5 10

Figure 3.3: Normal priors


53

the probability that the sun will rise tomorrow. He answered this question
using the following Bayesian analysis:

• Let X represent the number of days the sun rises. Let p be the prob-
ability the sun will rise tomorrow.

• Let X�p ∼ Bin(n, p).

• Suppose p ∼ Uniform(0, 1).

• Based on reading the Bible, Laplace computed the total number of


days n in recorded history, and the number of days x on which the sun
rose. Clearly, x = n.

Then

⇡(p�x) ∝ � �px (1 − p)n−x ⋅ 1


n
x
∝p x+1−1
(1 − p)n−x+1−1

This implies
p�x ∼ Beta(x + 1, n − x + 1)

x+1 x+1 n+1


Then
p̂ = E[p�x] = = =
x+1+n−x+1 n+2 n+2
.

Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)�(n + 2), where n is the total number of days recorded in history.
For instance, if so far we have encountered 100 days in the history of our
universe, this would say that the probability the sun will rise tomorrow
is 101�102 ≈ 0.9902. However, we know that this calculation is ridiculous.
Here, we have extremely strong subjective information (the laws of physics)
that says it is extremely likely that the sun will rise tomorrow. Thus, ob-
jective Bayesian methods shouldn’t be recklessly applied to every problem
we study—especially when subjective information this strong is available.

Criticism of the Uniform Prior

The Uniform prior of Bayes and Laplace and has been criticized for many
di↵erent reasons. We will discuss one important reason for criticism and not
go into the other reasons since they go beyond the scope of this course.
54

In statistics, it is often a good property when a rule for choosing a prior


is invariant under what are called one-to-one transformations. Invariant
basically means unchanging in some sense. The invariance principle means
that a rule for choosing a prior should provide equivalent beliefs even if we
consider a transformed version of our parameter, like p2 or log p instead of
p.

Je↵reys’ Prior

One prior that is invariant under one-to-one transformations is Je↵reys’


prior.

What does the invariance principle mean? Suppose our prior parameter is
✓, however we would like to transform to .

Define = f (✓), where f is a one-to-one function.

Je↵reys’ prior says that if ✓ has the distribution specified by Je↵reys’ prior
for ✓, then f (✓) will have the distribution specified by Je↵reys’ prior for .
We will clarify by going over two examples to illustrate this idea.

Note, for example, that if ✓ has a Uniform prior, Then one can show = f (✓)
will not have a Uniform prior (unless f is the identity function).

Aside from the invariance property of Je↵reys’ prior, in the univariate case,
Je↵reys’ prior satisfies many optimality criteria that statisticians are inter-
ested in.

Definition 3.3: Define


@ 2 log p(y�✓)
I(✓) = −E � �,
@✓2

where I(✓) is called the Fisher information. Then Je↵reys’ prior is defined
to be �
pJ (✓) = I(✓).

Example 3.8: (Uniform Prior is Not Invariant to Transformation)


Let ✓ ∼ Uniform(0, 1). Suppose now we would like to transform from ✓ to
✓2 .
55


Let = ✓2 . Then ✓ = . It follows that

= √ .
@✓ 1
@ 2

Thus, p( ) = √ , 0 < < 1 which shows that is not Uniform on (0, 1).
1
2
Hence, the transformation is not invariant. Criticism such as this led to
consideration of Je↵reys’ prior.
Example 3.9: (Je↵reys’ Prior Invariance Example)
Suppose

X�✓ ∼ Exp(✓).

One can show using calculus that I(✓) = 1�✓2 . Then pJ (✓) = 1�✓. Suppose
that = ✓2 . It follows that

= √ .
@✓ 1
@ 2
Then

pJ ( ) = pJ ( )� �
@✓
@
=√ √ ∝ .
1 1 1
2
Hence, we have shown for this example, that Je↵reys’ prior is invariant under
the transformation = ✓2 .
Example 3.10: (Je↵reys’ prior) Suppose

X�✓ ∼ Binomial(n, ✓).

Let’s calculate the posterior using Je↵reys’ prior. To do so we need to


calculate I(✓). Ignoring terms that don’t depend on ✓, we find

log p(x�✓) = x log (✓) + (n − x) log (1 − ✓) �⇒


@ log p(x�✓) x n − x
= −
@✓ ✓ 1−✓
@ 2 log p(x�✓) n−x
=− 2 −
x
@✓ 2 ✓ (1 − ✓)2
56

Since, E(X) = n✓, then


n−x n − n✓
I(✓) = −E �− − �= 2 + = =
x n✓ n n n
(1 − ✓) (1 − ✓) ✓ (1 − ✓) ✓(1 − ✓)
2 2 2
.
✓ ✓
This implies that

pJ (✓) =
n
✓(1 − ✓)
∝ Beta(1�2, 1�2).

Jeffrey's prior and flat prior densities


2.0
1.5
p(θ)

1.0
0.5

Beta(1/2,1/2)
Beta(1,1)
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.4: Je↵reys’ prior and flat prior densities

Figure 3.4 compares the prior density ⇡J (✓) with that for a flat prior, which
is equivalent to a Beta(1,1) distribution.

Note that in this case the prior is inversely proportional to the standard
deviation. Why does this make sense?

We see that the data has the least e↵ect on the posterior when the true ✓ =
0.5, and has the greatest e↵ect near the extremes, ✓ = 0 or 1. Je↵reys’ prior
compensates for this by placing more mass near the extremes of the range,
where the data has the strongest e↵ect. We could get the same e↵ect by
(for example) letting the prior be ⇡(✓) ∝ instead of ⇡(✓) ∝
1 1
[Var✓]1�2
.
Var✓
57

However, the former prior is not invariant under reparameterization, as we


would prefer.

We then find that

p(✓ � x) ∝ ✓x (1 − ✓)n−x ✓1�2−1 (1 − ✓)1�2−1


= ✓x−1�2 (1 − ✓)n−x−1�2
= ✓x−1�2+1−1 (1 − ✓)n−x−1�2+1−1 .

Thus, ✓�x ∼ Beta(x + 1�2, n − x + 1�2), which is a proper posterior since the
prior is proper.

Note: Remember that it is important to check that the posterior is proper.

Je↵reys’ and Conjugacy


Je↵reys priors are widely used in Bayesian analysis. In general, they are
not conjugate priors; the fact that we ended up with a conjugate Beta prior
for the binomial example above is just a lucky coincidence. For example,
with a Gaussian model X ∼ N (µ, 2 ), it can be shown that ⇡J (µ) = 1
and ⇡J ( ) = 1 , which do not look anything like a Gaussian or an inverse
gamma, respectively. However, it can be shown that Je↵reys priors are limits
of conjugate prior densities. For example, a Gaussian density N (µo , o2 )
approaches a flat prior as o2 → ∞, while the inverse gamma −(a+1) e−b� →
−1
as a, b → 0.

Limitations of Je↵reys’
Je↵reys’ priors work well for single-parameter models, but not for models
with multidimensional parameters. By analogy with the one-dimensional
case, one might construct a naive Je↵reys prior as the joint density:

⇡J (✓) = �I(✓)�1�2 ,

where � ⋅ � denotes the determinant and the (i, j)th element of the Fisher
information matrix is given by

@ 2 log p(X�✓)
I(✓)ij = −E � �.
@✓i @✓j

Let’s see what happens when we apply a Je↵reys’ prior for ✓ to a multivariate
Gaussian location model. Suppose X ∼ Np (✓, I), and we are interested in
58

performing inference on ��✓��2 . In this case the Je↵reys’ prior for ✓ is flat. It
turns out that the posterior has the form of a non-central 2 distribution
with p degrees of freedom. The posterior mean given one observation of X
is E(��✓��2 �X) = ��X��2 + p. This is not a good estimate because it adds p to
the square of the norm of X, whereas we might normally want to shrink
our estimate towards zero. By contrast, the minimum variance frequentist
estimate of ��✓��2 is ��X��2 − p.

Intuitively, a multidimensional flat prior carries a lot of information about


the expected value of a parameter. Since most of the mass of a flat prior
distribution is in a shell at infinite distance, it says that we expect the
value of ✓ to lie at some extreme distance from the origin, which causes our
estimate of the norm to be pushed further away from zero.

Haldane’s Prior

In 1963, Haldane introduced the following improper prior for a binomial


proportion:
p(✓) ∝ ✓−1 (1 − ✓)−1 .
It can be shown to be improper using simple calculus, which we will not go
into. However, the posterior is proper under certain conditions. Let
Y �✓ ∼ Bin(n, ✓).
Calculate p(✓�y) and show that it is improper when y = 0 or y = n.

Remark: Recall that for a Binomial distribution, Y can take values


y = 0, 1, 2, . . . , n.

We will first calculate p(✓�y).


�ny�✓y (1 − ✓)n−y
p(✓�y) ∝
✓(1 − ✓)
∝✓ y−1
(1 − ✓)n−y−1
= ✓y−1 (1 − ✓)(n−y)−1 .

The density of a Beta(a, b) is the following:


(a + b) a−1
f (✓) = ✓ (1 − ✓)b−1 , ✓ > 0.
(a) (b)
3.1 Reference Priors 59

This implies that ✓�Y ∼ Beta(y, n − y).

Finally, we need to check that our posterior is proper. Recall that the
parameters of the Beta need to be positive. Thus, y > 0 and n − y > 0. This
means that y ≠ 0 and y ≠ n in order for the posterior to be proper.

Remark: Recall that the Beta density must integrate to 1 whenever


the parameter values are positive. Hence, when they are not positive,
the density does not integrate to 1 and integrates to ∞. Thus, for the
problem above, when y = 0 and y = n the density is improper.

There are many other objective priors that are used in Bayesian inference,
however, this is the level of exposure that we will cover in this course. If
you’re interested in learning more about objective priors (g-prior, probability
matching priors), see me and I can give you some references.

3.1 Reference Priors

Reference priors were proposed by Jose Bernardo in a 1979 paper, and fur-
ther developed by Jim Berger and others from the 1980s through the present.
They are credited with bringing about an objective Bayesian renaissance;
an annual conference is now devoted to the objective Bayesian approach.

The idea behind reference priors is to formalize what exactly we mean by an


uninformative prior: it is a function that maximizes some measure of distance
or divergence between the posterior and prior, as data observations are made.
Any of several possible divergence measures can be chosen, for example the
Kullback-Leibler divergence or the Hellinger distance. By maximizing the
divergence, we allow the data to have the maximum e↵ect on the posterior
estimates.

For one-dimensional parameters, it will turn out that reference priors and
Je↵reys’ priors are equivalent. For multidimensional parameters, they dif-
fer. One might ask, how can we choose a prior to maximize the divergence
between the posterior and prior, without having seen the data first? Refer-
ence priors handle this by taking the expectation of the divergence, given a
model distribution for the data. This sounds superficially like a frequentist
approach—basing inference on imagined data. But once the prior is chosen
based on some model, inference proceeds in a standard Bayesian fashion.
3.1 Reference Priors 60

(This contrasts with the frequentist approach, which continues to deal with
imagined data even after seeing the real data!)

◯ Laplace Approximation

Before deriving reference priors in some detail, we go through the Laplace


approximation which is very useful in Bayesian analysis since we often need
to evaluate integrals of the form

� g(✓)f (x�✓)⇡(✓) d✓.

For example, when g(✓) = 1, the integral reduces to the marginal likelihood
of x. The posterior mean requires evaluation of two integrals ∫ ✓f (x�✓)⇡(✓) d✓
and ∫ f (x�✓)⇡(✓) d✓. Laplace’s method is a technique for approximating in-
tegrals when the integrand has a sharp maximum.

Remark: There is a nice refinement of the Laplace approximation due to


Tierney, Kass, and Kadane (JASA, 1989). Due to time constraints, we
won’t go into this, but if you’re looking to apply this in research, this is
something you should look up in the literature and use when needed.

Theorem 3.2. Laplace Approximation


Let I = ∫ q(✓) exp{nh(✓)} d✓. Assume that ✓ˆ maximizes ✓ and that h has a
sharp maximum at ✓. ˆ Let c = h′′ (✓)
ˆ > 0. Then
√ √
−1
I = q(✓) exp{nh(✓)} √ (1 + O(n )) ≈ q(✓) exp{nh(✓)} √
ˆ ˆ 2⇡ ˆ ˆ 2⇡
nc nc

ˆ
Proof. Apply Taylor expansion about ✓.
ˆ
I ≈� �q(✓) ˆ ′ (✓)
ˆ + (✓ − ✓)q ˆ 2 q ′′ (✓)�
ˆ + 1 (✓ − ✓)
✓+
ˆ
ˆ
✓− 2
× �exp{nh(✓) ˆ ′ (✓)
ˆ + n(✓ − ✓)h ˆ + n (✓ − ✓) ˆ 2 h′′ (✓)}�
ˆ d✓ + �
2
ˆ q (✓) + 1 (✓ − ✓)ˆ 2 q (✓) �
′ ˆ ′′ ˆ
ˆ nh(✓ˆ) � �1 + (✓ − ✓)
≈ q(✓)e
ˆ
q(✓) 2 q(✓)ˆ
−nc
× exp � (✓ − ✓)
ˆ 2 }� d✓ + � .
2
3.1 Reference Priors 61

√ ˆ This implies that d✓ = √1 dt. Hence,


Now let t = nc(✓ − ✓).
nc

t q ′ (✓) t2 q ′′ (✓)
ˆ nh(✓) ˆ ˆ ˆ
I≈ √ �1 + √ + � e−t �2 dt
q(✓)e nc

2

nc − nc nc q(✓)ˆ 2nc q(✓) ˆ
ˆ
ˆ nh(✓) √ q ′′ (✓)
ˆ 1
≈ √ 2⇡ �1 + 0 + �
q(✓)e
nc q(✓) ˆ 2nc
ˆ
ˆ nh(✓) √ ˆ
ˆ nh(✓) √
≈ √ 2⇡ [1 + O(1�n)] ≈ √
q(✓)e q(✓)e
2⇡.
nc nc

◯ Some Probability Theory

First, we give a few definitions from probability theory (you may have seen
these before) and we will be informal about these.

• If Xn is O(n−1 ) then Xn “goes to 0 at least as fast as 1�n.”

• If Xn is o(n−1 ) then Xn “goes to 0 faster than 1�n.”


Definition 3.4: Formally, writing

Xn = o(rn ) as n → ∞

means that
→ 0.
Xn
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn

This shouldn’t be confused with the definition below:


Definition 3.5: Formally, let Xn , n ≥ 1 be random vectors and Rn , n ≥ 1
be positive variables. Then

Xn = op (Rn ) if �→ 0
Xn p
Rn

You might also like