Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()
In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.
Being Objective
However, in dealing with real-life problems you may run into problems such
as
The problems we have dealt with all semester have been very simple in na-
ture. We have only had one parameter to estimate (except for one example).
47
48
Think about a more complex problem such as the following (we looked at
this problem in Chapter 1):
X�✓ ∼ N (✓, 2
)
✓� 2
∼ N (µ, ⌧ )
2
2
∼ IG(a, b)
where now ✓ and 2 are both unknown and we must find the posterior
distributions of ✓�X, 2 and 2 �X. For this slightly more complex problem,
it is much harder to think about what values µ, ⌧ 2 , a, b should take for a
particular problem. What should we do in these type of situations?
• If the prior is improper, you must check that the posterior is proper.
◯ Meaning Of Flat
What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have di↵erent meanings.
Example 3.3: Often statisticians will refer to a prior as being flat, when
a plot of its density actually looks flat, i.e., uniform. An example of this
would be taking such a prior to be
✓ ∼ Unif(0, 1).
We can plot the density of this prior to see that the density is flat.
1.4
1.2
density
1.0
0.8
0.6
In this example, it can be shown that pJ (✓) ∝ Beta(1�2, 1�2). Let’s consider
the plot of this prior. Flat here is a purely abstract idea. In order to achieve
objective inference, we need to compensate more for values on the boundary
than values in the middle.
51
✓ ∼ N (0, 1000).
0.012
Normal prior
density
0.006
0.000
θ
0.012
0.006
0.000
−10 −5 0 5 10
the probability that the sun will rise tomorrow. He answered this question
using the following Bayesian analysis:
• Let X represent the number of days the sun rises. Let p be the prob-
ability the sun will rise tomorrow.
Then
This implies
p�x ∼ Beta(x + 1, n − x + 1)
Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)�(n + 2), where n is the total number of days recorded in history.
For instance, if so far we have encountered 100 days in the history of our
universe, this would say that the probability the sun will rise tomorrow
is 101�102 ≈ 0.9902. However, we know that this calculation is ridiculous.
Here, we have extremely strong subjective information (the laws of physics)
that says it is extremely likely that the sun will rise tomorrow. Thus, ob-
jective Bayesian methods shouldn’t be recklessly applied to every problem
we study—especially when subjective information this strong is available.
The Uniform prior of Bayes and Laplace and has been criticized for many
di↵erent reasons. We will discuss one important reason for criticism and not
go into the other reasons since they go beyond the scope of this course.
54
Je↵reys’ Prior
What does the invariance principle mean? Suppose our prior parameter is
✓, however we would like to transform to .
Je↵reys’ prior says that if ✓ has the distribution specified by Je↵reys’ prior
for ✓, then f (✓) will have the distribution specified by Je↵reys’ prior for .
We will clarify by going over two examples to illustrate this idea.
Note, for example, that if ✓ has a Uniform prior, Then one can show = f (✓)
will not have a Uniform prior (unless f is the identity function).
Aside from the invariance property of Je↵reys’ prior, in the univariate case,
Je↵reys’ prior satisfies many optimality criteria that statisticians are inter-
ested in.
where I(✓) is called the Fisher information. Then Je↵reys’ prior is defined
to be �
pJ (✓) = I(✓).
√
Let = ✓2 . Then ✓ = . It follows that
= √ .
@✓ 1
@ 2
Thus, p( ) = √ , 0 < < 1 which shows that is not Uniform on (0, 1).
1
2
Hence, the transformation is not invariant. Criticism such as this led to
consideration of Je↵reys’ prior.
Example 3.9: (Je↵reys’ Prior Invariance Example)
Suppose
X�✓ ∼ Exp(✓).
One can show using calculus that I(✓) = 1�✓2 . Then pJ (✓) = 1�✓. Suppose
that = ✓2 . It follows that
= √ .
@✓ 1
@ 2
Then
�
pJ ( ) = pJ ( )� �
@✓
@
=√ √ ∝ .
1 1 1
2
Hence, we have shown for this example, that Je↵reys’ prior is invariant under
the transformation = ✓2 .
Example 3.10: (Je↵reys’ prior) Suppose
1.0
0.5
Beta(1/2,1/2)
Beta(1,1)
0.0
Figure 3.4 compares the prior density ⇡J (✓) with that for a flat prior, which
is equivalent to a Beta(1,1) distribution.
Note that in this case the prior is inversely proportional to the standard
deviation. Why does this make sense?
We see that the data has the least e↵ect on the posterior when the true ✓ =
0.5, and has the greatest e↵ect near the extremes, ✓ = 0 or 1. Je↵reys’ prior
compensates for this by placing more mass near the extremes of the range,
where the data has the strongest e↵ect. We could get the same e↵ect by
(for example) letting the prior be ⇡(✓) ∝ instead of ⇡(✓) ∝
1 1
[Var✓]1�2
.
Var✓
57
Thus, ✓�x ∼ Beta(x + 1�2, n − x + 1�2), which is a proper posterior since the
prior is proper.
Limitations of Je↵reys’
Je↵reys’ priors work well for single-parameter models, but not for models
with multidimensional parameters. By analogy with the one-dimensional
case, one might construct a naive Je↵reys prior as the joint density:
⇡J (✓) = �I(✓)�1�2 ,
where � ⋅ � denotes the determinant and the (i, j)th element of the Fisher
information matrix is given by
@ 2 log p(X�✓)
I(✓)ij = −E � �.
@✓i @✓j
Let’s see what happens when we apply a Je↵reys’ prior for ✓ to a multivariate
Gaussian location model. Suppose X ∼ Np (✓, I), and we are interested in
58
performing inference on ��✓��2 . In this case the Je↵reys’ prior for ✓ is flat. It
turns out that the posterior has the form of a non-central 2 distribution
with p degrees of freedom. The posterior mean given one observation of X
is E(��✓��2 �X) = ��X��2 + p. This is not a good estimate because it adds p to
the square of the norm of X, whereas we might normally want to shrink
our estimate towards zero. By contrast, the minimum variance frequentist
estimate of ��✓��2 is ��X��2 − p.
Haldane’s Prior
Finally, we need to check that our posterior is proper. Recall that the
parameters of the Beta need to be positive. Thus, y > 0 and n − y > 0. This
means that y ≠ 0 and y ≠ n in order for the posterior to be proper.
There are many other objective priors that are used in Bayesian inference,
however, this is the level of exposure that we will cover in this course. If
you’re interested in learning more about objective priors (g-prior, probability
matching priors), see me and I can give you some references.
Reference priors were proposed by Jose Bernardo in a 1979 paper, and fur-
ther developed by Jim Berger and others from the 1980s through the present.
They are credited with bringing about an objective Bayesian renaissance;
an annual conference is now devoted to the objective Bayesian approach.
For one-dimensional parameters, it will turn out that reference priors and
Je↵reys’ priors are equivalent. For multidimensional parameters, they dif-
fer. One might ask, how can we choose a prior to maximize the divergence
between the posterior and prior, without having seen the data first? Refer-
ence priors handle this by taking the expectation of the divergence, given a
model distribution for the data. This sounds superficially like a frequentist
approach—basing inference on imagined data. But once the prior is chosen
based on some model, inference proceeds in a standard Bayesian fashion.
3.1 Reference Priors 60
(This contrasts with the frequentist approach, which continues to deal with
imagined data even after seeing the real data!)
◯ Laplace Approximation
For example, when g(✓) = 1, the integral reduces to the marginal likelihood
of x. The posterior mean requires evaluation of two integrals ∫ ✓f (x�✓)⇡(✓) d✓
and ∫ f (x�✓)⇡(✓) d✓. Laplace’s method is a technique for approximating in-
tegrals when the integrand has a sharp maximum.
ˆ
Proof. Apply Taylor expansion about ✓.
ˆ
I ≈� �q(✓) ˆ ′ (✓)
ˆ + (✓ − ✓)q ˆ 2 q ′′ (✓)�
ˆ + 1 (✓ − ✓)
✓+
ˆ
ˆ
✓− 2
× �exp{nh(✓) ˆ ′ (✓)
ˆ + n(✓ − ✓)h ˆ + n (✓ − ✓) ˆ 2 h′′ (✓)}�
ˆ d✓ + �
2
ˆ q (✓) + 1 (✓ − ✓)ˆ 2 q (✓) �
′ ˆ ′′ ˆ
ˆ nh(✓ˆ) � �1 + (✓ − ✓)
≈ q(✓)e
ˆ
q(✓) 2 q(✓)ˆ
−nc
× exp � (✓ − ✓)
ˆ 2 }� d✓ + � .
2
3.1 Reference Priors 61
First, we give a few definitions from probability theory (you may have seen
these before) and we will be informal about these.
Xn = o(rn ) as n → ∞
means that
→ 0.
Xn
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn
Xn = op (Rn ) if �→ 0
Xn p
Rn