0% found this document useful (0 votes)
62 views

Statistics Lecture Notes

This document provides an introduction and overview for Stat 511, an advanced statistics course. It discusses the conceptual and mathematical foundations of the course, including the basic ingredients of a statistical inference problem and introducing concepts from measure theory that are important for probability and statistics. Measure theory provides the foundation for modern probability and will be important for understanding research papers in statistical theory. The course will focus on parametric distributions like exponential families and developing a working knowledge of basic measure theory concepts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Statistics Lecture Notes

This document provides an introduction and overview for Stat 511, an advanced statistics course. It discusses the conceptual and mathematical foundations of the course, including the basic ingredients of a statistical inference problem and introducing concepts from measure theory that are important for probability and statistics. Measure theory provides the foundation for modern probability and will be important for understanding research papers in statistical theory. The course will focus on parametric distributions like exponential families and developing a working knowledge of basic measure theory concepts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Stat 511 – Lecture Notes I

Introduction and Preparations ∗†


Ryan Martin
www.math.uic.edu/~rgmartin
Spring 2013

1 Introduction
Stat 511 is a first course in advanced statistical theory. This first set of notes is intended
to set the stage for the material that is the core of the course. In particular, these notes
define the notation we shall use throughout, and also set the conceptual and mathematical
level we will be working at. Naturally, both the conceptual and mathematical level will
be higher than in an intermediate course, such as Stat 411 at UIC.
On the conceptual side, besides being able to apply theory to particular examples, I
hope to communicate why such theory was developed; that is, not only do I want you
be familiar with results and techniques, but I hope you can understand the motivation
behind these developments. Along these lines, in this set of notes, I will discuss the
basic ingredients of a statistical inference problem, along with some discussion about
statistical reasoning, addressing the fundamental question: how do we reason from sample
to population? Surprisingly, there’s no perfectly satisfactory answer to this question.
On the mathematical side, real analysis and, in particular, measure theory, is very im-
portant in probability and statistics. Indeed, measure theory is the foundation on which
modern probability is built and, by the close connection between probability and statis-
tics, it is natural that measure theory also permeates the statistics literature. Measure
theory itself can be very abstract and difficult. I am not an expert in measure theory,
and I don’t expect you to be an expert either. But, in general, to read and understand
research papers in statistical theory, one should at least be familiar with the basic ter-
minology and results of measure theory. My presentation here is meant to introduce you
to these basics, so that we have a working measure-theoretic vocabulary moving forward
to our main focus in the course. Keener (2010), the course textbook, also takes a similar
approach to its measure theory presentation.
The remainder of this first set of notes makes a connection between measure theory
and probability, including conditional distributions, introduces the two most important

This version: January 8, 2013

These notes are meant to supplement in-class lectures. The author makes no guarantees that these
notes are free of typos or more serious errors.

1
general classes of parametric distributions—namely, exponential families and group trans-
formation families—and concludes with a discussion of convex functions.

2 Conceptual preliminaries
2.1 Ingredients of a statistical inference problem
Statistics in general is concerned with the collection and analysis of data. The data-
collection step is an important one, but this will not be considered here—we will assume
the data is given, and concern ourselves only with how these data should be analyzed. In
our case, the general statistical problem we will face consists of data X, possibly vector-
valued, taking values in X, and a model that describes the mechanism that produced this
data. For example, if X = (X1 , . . . , Xn ) is a vector consisting of the recorded heights of n
UIC students, then the model might say that these individuals were sampled completely
at random from the entire population of UIC students, and that heights of students in the
population are normally distributed. In short, we would write something like X1 , . . . , Xn
are iid N(µ, σ 2 ); here “iid” stands for independent and identically distributed. There
would be nothing to analyze if the population in question were completely known. In
the heights example, it shall be assumed that at least one of µ and σ 2 are unknown, and
we want to use the observed data X to learn about these unknown quantities. So, in
some sense, the population in question is actually just a class/family of distributions—in
the heights example this is the collection of all (univariate) normal distributions. More
generally, we shall say that X ∼ Pθ where, for each θ ∈ Θ, Pθ is a probability distribution.
Then the statistics problem can be stated as follows: Use data X to determine which
population in {Pθ : θ ∈ Θ} was the one that produced the observed X. We shall refer
to θ as the parameter and Θ the parameter space. We will have some more to say about
these parametric families in Section 3.4 below.
The data and model should be familiar ingredients in the statistical inference problem.
There is an important but less-familiar piece of the statistical inference problem—the loss
function—that is not given much attention in introductory inference courses. To facilitate
this discussion, consider the problem of trying to estimate θ based on data X ∼ Pθ . The
loss function L records how much I lose by guessing that θ equals any particular a in
Θ. In other words, (θ, a) 7→ L(θ, a) is just a real-valued function defined on Θ × Θ. In
introductory courses, one usually takes L(θ, a) = (a − θ)2 , the so-called squared error
loss, without explanation. In this course, we will consider more general loss functions, in
more general inference problems, particularly when we discuss decision theory.
To summarize, the statistical inference problem consists of data X taking values in
a sample space Θ and a family of probability distributions {Pθ : θ ∈ Θ}. In some
cases, we will need to consider the loss function L(·, ·), and in other cases there will be a
known probability distribution Π sitting on the parameter space Θ, which we will need
to incorporate somehow. In any case, the ultimate goal is to identify the particular Pθ
which produced the observed X.

2
2.2 Reasoning from sample to population
It is generally believed that statistics and probability are closely related. But while this
claim is true in some sense, the connection is not immediate or obvious. Surely, the
general sampling model “X ∼ Pθ ” is a probabilistic statement for known θ. For example,
if X ∼ N(0, 1) then we know from introductory probability courses that P(0,1) {X ≤ 1} =
Φ(1), where Φ(·) is the standard normal distribution function. Similar calculations can
be made for any known θ = (µ, σ 2 ). But how can this information be used when θ
is unknown? You see, probability can be used directly to describe characteristics of a
sample taken from a fixed population. The statistics problem, on the other hand, has
a known sample but an unknown population—basically the opposite of the probability
setup. It is not clear how probability should be used, is it?
The general strategy is to embed, in one way or another, the particular problem
into a hypothetical sequence of infinitely many “similar” problems. In so doing, tools
from probability can be introduced. For example, in frequentist statistics, one looks for
procedures which perform well on average across this hypothetical sequence. In Bayesian
statistics, a super-population is introduced from which the unknown θ is believed to be
sampled from; this allows application of Bayes’ theorem to incorporate the observed data.
For the most part, we will avoid such philosophical concerns in this course, but stu-
dents should be aware that there is not widely agreed upon setup. The issue is that
the statistical inference problem is ill-posed, from a mathematical point of view, so one
cannot deduce, from first principles, a “correct answer.” Fisher thought very carefully
about such things; this is related to his emphatic point that likelihood and probability
are different. There have been efforts to develop different solutions to the inference prob-
lem (non-frequentist and non-Bayesian) but these have, for the most part, not reached
the mainstream. The basic idea behind each of these alternatives is to avoid introduc-
ing a hypothetical super-population. The challenge is that these alternatives generally
require use of something other than the familiar probability, e.g., belief functions. These
alternatives include Fisher’s fiducial inference (Zabell 1992, is a good review), general-
ized fiducial inference (Hannig 2009), structural inference (Fraser 1968), Dempster–Shafer
theory (Dempster 2008; Shafer 1976), and the new inferential model framework (Martin
and Liu 2012). Depending on time, we may discuss some of these alternatives in class.

3 Mathematical preliminaries
3.1 Measure and integration
Measure theory is the foundation on which modern probability theory is built. All
statisticians should, at least, be familiar with the terminology and the key results (e.g.,
Lebesgue’s dominated convergence theorem). The presentation below is based on mate-
rial in Lehmann and Casella (1998); similar things are presented in Keener (2010).
A measure is a generalization of the concept of length, area, volume, etc. More
specifically, a measure µ is a non-negative set-function, i.e., µ assigns a non-negative
number to subsets A of an abstract set X, and this number is denoted by µ(A). Similar
to lengths, µ is assumed to be additive:
µ(A ∪ B) = µ(A) + µ(B), for each disjoint A and B.

3
This extends by induction to any finite set A1 , . . . , An of disjoint sets. But a stronger
assumption is σ-additivity:
[∞ ∞
 X
µ Ai = µ(Ai ), for all disjoint A1 , A2 , . . ..
i=1 i=1

Note that finite additivity does not imply σ-additivity. All of the (probability) measures
we’re familiar with are σ-additive.1 But there are some peculiar measures which are
finitely additive but not σ-additive. The classical example of this is the following.
Example 1. Take X = {1, 2, . . .} and define a measure µ as
(
0 if A is finite
µ(A) =
1 if A is co-finite,
where a set A is “co-finite” if it’s the complement of a finite set.SIt is easy to see that µ
is
Padditive. Taking a disjoint sequence Ai = {i} we find that µ( ∞ i=1 Ai ) = µ(X) = 1 but
∞ P ∞
i=1 µ(Ai ) = i=1 0 = 0. Therefore, µ is not σ-additive.
In general, a measure µ cannot be defined for all subsets A ⊆ X. But the class of
subsets on which the measure can be defined is, in general, a σ-algebra, or σ-field.
Definition 1. A σ-algebra A is a collection of subsets of X that satisfies the following
properties:
• X is in A;
• If A ∈ A, then Ac ∈ A;
• and if A1 , A2 , . . . ∈ A, then ∞
S
i=1 Ai ∈ A.

The sets A ∈ A are said to be measurable. We refer to (X, A) as a measurable space and,
if µ is a measure defined on (X, A), then (X, A, µ) is a measure space.
A measure µ is finite if µ(X) is a finite number. Probability measures (see Section 3.2)
are special finite measures where µ(X) = 1. S A measure µ is said to be σ-finite if there
exists a sequence of sets {Ai } ⊂ A such that ∞ i=1 Ai = X and µ(Ai ) < ∞ for each i.
Example 2. Let X be a countable set and A the class of all subsets of X; then clearly A
is a σ-algebra. Define µ according to the rule
µ(A) = number of points in A, A ∈ A.
Then µ is a σ-finite measure which is refered to as counting measure.
Example 3. Let X be a subset of d-dimensional Euclidean space Rd . Take A to be the
smallest σ-algebra that contains the collection of open rectangles
A = {(x1 , . . . , xd ) : ai < xi < bi , i = 1, . . . , d}, ai < bi .
Then A is the Borel σ-algebra on X, which contains all open and closed sets in X; but
there are subsets of X that do not belong to A! Then the (unique) measure µ, defined by
d
Y
µ(A) = (bi − ai ), for rectangles A ∈ A
i=1

is called Lebesgue measure, and it’s σ-finite.


1
Remember, that’s one of Kolmogorov’s axioms!

4
Next we consider integration of a real-valued function f with respect to a measure µ
on (X, A). This more general definition of integral satisfies most of the familiar properties
from calculus, such as linearity, monotonicity, etc. But the calculus integral is defined
only for a class of functions which is generally too small for our applications.
The class of functions of interest are those which are measurable. In particular, a
real-valued function f is measurable if and only if, for every real number a, the set
{x : f (x) ≤ a} is in A. If A is a measurable set, then the indicator function IA (x), which
equals 1 when x ∈ A and 0 otherwise, is measurable. More generally, a simple function
K
X
s(x) = ak IAk (x),
k=1

is measurable provided that A1 , . . . , AK ∈ A. Continuous f are also usually measurable.


The integral of a non-negative simple function s with respect to µ is defined as
Z K
X
s dµ = ai µ(Ak ). (1)
k=1

Take a non-decreasing sequence of non-negative simple functions {sn } and define

f (x) = lim sn (x). (2)


n→∞

It can be shown that f defined in (2) is measurable. Then the integral of f with respect
to µ is defined as Z Z
f dµ = lim sn dµ,
n→∞

the limit of the simple function integrals. It turns out that the left-hand side does not
depend on the particular sequence {sn }, so it’s unique. In fact, an equivalent definition
for the integral of a non-negative f is
Z Z
f dµ = sup s dµ. (3)
0 ≤ s ≤ f , simple

For a general measurable function f which may take negative values, define

f + (x) = max{f (x), 0} and f − (x) = − min{f (x), 0}.

Both the positive part f + and the negative part f − are non-negative, and f = f + − f − .
The integral of f with respect to µ is defined as
Z Z Z
f dµ = f dµ − f − dµ,
+

where the two integrals on the right-hand side are defined throughR (3). In general, a
measurable function f is said to be µ-integrable, or just integrable, if f + dµ and f − dµ
R

are both finite.

5
Example 4 (Counting measure). If X = {x1 , x2 , . . .} and µ is counting measure, then
Z X∞
f dµ = f (xi ).
i=1

Example
R 5 (Lebesgue measure). If X is a Euclidean space and µ is Lebesgue measure,
then f dµ exists and is equal to the usual Riemann integral of f from calculus whenever
the latter exists. But the Lebesgue integral exists for f which are not Riemann integrable.
Next we list some important results from analysis, related to integrals. The first
two have to do with interchange of limits and integration, which is often important in
statistical problems. The first is relatively weak, but is used in the proof of the second.
Theorem 1 (Fatou’s lemma). Given {fn }, non-negative and measurable,
Z   Z
lim inf fn dµ ≤ lim inf fn dµ.
n→∞ n→∞

The opposite inequality holds for lim sup, provided that |fn | ≤ g for integrable g.
Theorem 2 (Dominated convergence). Given measurable {fn }, suppose that

f (x) = lim fn (x) µ-almost everywhere,


n→∞

and |fn (x)| ≤ g(x) for all n, for all x, and for some integrable function g. Then fn and
f are integrable, and Z Z
f dµ = lim fn dµ.
n→∞

Proof. First, by definition of f as the pointwise limit of fn , we have that |fn − f | ≤


|fn | + |f | ≤ 2g and lim supn |fn − f | = 0. From Exercise 10, we get
Z Z Z Z
fn dµ − f dµ = (fn − f ) dµ ≤ |fn − f | dµ

and, for the upper bound, by the “reverse Fatou’s lemma,” we have
Z Z
lim sup |fn − f | dµ ≤ lim sup |fn − f | dµ = 0.
n n
R R
Therefore, fn dµ → f dµ, which completes the proof.
Note, the phrase “µ-almost everywhere” used in the theorem means that the property
holds everywhere except on a µ-null set; i.e., a set N with µ(N ) = 0.
The next theorem is useful for bounding integrals of products of two functions. You
may be familiar with this name from other courses, such as linear algebra—it turns out
actually, that certain collections of integrable functions act very much the same as vectors
in a finite-dimensional vector space.
Theorem 3 (Cauchy–Schwarz inequality). If f 2 and g 2 are measurable, then
Z 2 Z Z
f g dµ ≤ f dµ · g 2 dµ.
2

6
2
R
Proof. Take any λ; then (f + λg)2 dµ ≥ 0. In particular,
g dµ ·λ2 + 2 f g dµ ·λ + g 2 dµ ≥ 0 ∀ λ.
R 2 R R
| {z } | {z } | {z }
a b c

In other words, the quadratic (in λ) can have at most one real root. Using the quadratic
formula, √
−b ± b2 − 4ac
λ= ,
2a
it is clear that the only way there can be fewer than two real roots is if b2 − 4ac is ≤ 0.
Using the definitions of a, b, and c we find that
Z 2 Z Z
4 f g dµ − 4 f dµ · g 2 dµ ≤ 0,
2

and from this the result follows immediately.


The next result defines “double-integrals” and shows that, under certain conditions,
the order of integration does not matter. Fudging a little bit on the details, for two
measure spaces (X, A, µ) and (Y, B, ν), define the product space
(X × Y, A ⊗ B, µ × ν),
where X × Y is usual set of ordered pairs (x, y), A ⊗ B is the smallest σ-algebra that
contains all the sets A × B for A ∈ A and B ∈ B, and µ × ν is the product measure
defined as
(µ × ν)(A × B) = µ(A)ν(B).
This concept is important for us because independent probability distributions induce a
product measure. Fubini’s theorem is a powerful result that allows certain integrals over
the product to be done one dimension at a time.
Theorem 4 (Fubini). Let f (x, y) be a non-negative measurable function on X × Y. Then
Z hZ i Z hZ i
f (x, y) dν(y) dµ(x) = f (x, y) dµ(x) dν(y). (4)
X Y Y X
R
The common value above is the double integral, written X×Y
f d(µ × ν).
Our last result has something to do with constructing new measures from old. It also
allows us to generalize the familiar notion of probability densities which, in turn, will
make our lives easier when discussing the general statistical inference problem. Suppose
f is a non-negative integrable function. Then
Z
ν(A) = f dµ (5)
A

defines a new measure ν on (X, A). An important property is that µ(A) = 0 implies
ν(A) = 0; the terminology is that ν is absolutely continuous with respect to µ, or ν is
dominated by µ, written ν  µ. But it turns out that, if ν  µ, then there exists f such
that (5) holds. This is the famous Radon–Nikodym theorem.
2
A different proof, based on Jensen’s inequality, is given in Example 8.

7
Theorem 5 (Radon–Nikodym). Suppose ν  µ. Then there exists a non-negative µ-
integrable function f , unique modulo µ-null sets, such that (5) holds. The function f ,
often written as f = dν/dµ is the Radon–Nikodym derivative of ν with respect to µ.
We’ll see later that, in statistical problems, the Radon–Nikodym derivative is the
familiar density or, perhaps, a likelihood ratio. The Radon–Nikodym theorem also for-
malizes the idea of change-of-variables in integration. For example, suppose that µ and
ν are σ-finite measures defined on X, such that ν  µ, so that there exists a unique
Radon–Nikodym derivative f = dν/dµ. Then, for a ν-integrable function ϕ, we have
Z Z
ϕ dν = ϕ f dµ;

symbolically this makes sense: dν = (dν/dµ) dµ.

3.2 Probability basics


It turns out the mathematical probability is just a special case of the measure theory
stuff presented above. Our probabilities are finite measures, our random variables are
the measurable functions, our expected values are just integrals.
Start with an essentially arbitrary measurable space (Ω, F), and introduce a proba-
bility measure P; that is P(Ω) = 1. Then (Ω, F, P) is called a probability space. The idea
is that Ω contains all possible outcomes of the random experiment. Consider, for exam-
ple, the heights example above in Section 2.1. Suppose we plan to sample a single UIC
student at random from the population of students. Then Ω consists of all students, and
exactly one of these students will be the one that’s observed. The measure P will encode
the underlying sampling scheme. But in this example, it’s not the particular student
chosen that’s of interest: we want to know the student’s height, which is a measurement
or characteristic of the sampled student. How do we account for this?
A random variable X is nothing but a measurable function from Ω to another space
X. It’s important to understand that X, as a mapping, is not random; instead, X is a
function of a randomly chosen element ω in Ω. So when we are discussing probabilities
that X satisfies such and such properties, we’re actually thinking about the probability
(or measure) the set of ω’s for which X(ω) satisfies the particular property. To make this
more precise we write

P(X ∈ A) = P{ω : X(ω) ∈ A} = PX −1 (A).

To simplify notation, etc, we will often ignore the underlying probability space, and work
simply with the probability measure PX (·) = PX −1 (·). This is what we’re familiar with
from basic probability and statistics; the statement X ∼ N(0, 1) means simply that the
probability measure induced on R by the mapping X is a standard normal distribution.
When there is no possibility of confusion, we will drop the “X” subscript and simply
write P for PX .
When PX , a measure on the X-space X, is dominated by a σ-finite measure µ, the
Radon–Nikodym theorem says there is a density dPX /dµ = pX , and
Z
PX (A) = pX dµ.
A

8
This is the familiar case we’re used to; when µ is counting measure, pX is a probability
mass function and, when µ is Lebesgue measure, pX is a probability density function.
One of the benefits of the measure-theoretic formulation is that we do not have to handle
these two important cases separately.
Let ϕ be a real-valued measurable function defined on X. Then the expected value of
ϕ(X) is Z Z
EX {ϕ(X)} = ϕ(x) dPX (x) = ϕ(x)pX (x) dµ(x),
X X
the latter expression holding only when PX  µ for a σ-finite measure µ on X. The
usual properties of expected value hold in this more general case; the same tools we use
in measure theory to study properties of integrals of measurable functions are useful for
deriving such things.
At this point I should mention that it will be assumed you are familiar with all the
basic probability calculations defined and used in intermediate probability and statistics
courses, such as Stat 401 and Stat 411 at UIC. In particular, you are expected to know the
common distributions (e.g., normal, binomial, Poisson, gamma, uniform, etc) and how to
calculate expectations for these and other distributions. Moreover, I will assume you are
familiar with some basic operations involving random vectors (e.g., covariance matrices)
and some simple linear algebra stuff. Keener (2010), Sections 1.7 and 1.8, introduces
these concepts and notations.
In probability and statistics, product spaces are especially important. The reason, as
we eluded to before, is that independence of random variables is connected with product
spaces and, in particular, product measures. If X1 , . . . , Xn are iid PX , then their joint
distribution is the product measure

PX1 × PX2 × · · · × PXn = PX × PX · · · × PX = PnX .

The first term holds with only “independence;” the second requires “identically dis-
tributed;” the last term is just a short-hand notation for the middle term.
When we talk about convergence theorems, such as the law of large numbers, we
say something like: for an infinite sequence of random variables X1 , X2 , . . . some event
happens with probability 1. But what is the measure being referenced here? In the iid
case, it turns out that it’s an infinite product measure, written as P∞
X . We’ll have more
to say about this when the time comes.

3.3 Conditional distributions


Conditional distributions in general are rather abstract. When the random variables in
question are discrete (µ = counting measure), however, things are quite simple; the reason
is that events where the value of the random variable is fixed have positive probability,
so the ordinary conditional probability formula involving ratios can be applied.
When one or more of the random variables in question are continuous (dominated
by Lebesgue measure), then more care must be taken. Suppose random variables X
and Y have a joint distribution with density function pX,Y (x, y), with respect to some
dominating (product) measure µ × ν. Then the corresponding marginal distributions

9
have densities with respect to µ and ν, respectively, given by
Z Z
pX (x) = pX,Y (x, y) dν(y) and pY (y) = pX,Y (x, y) dµ(x).

Moreover, the conditional distribution of Y , given X = x, also has a density with respect
to ν, and is given by the ratio

pY |X (y | x) = pX,Y (x, y)/pX (x).

As a function of x, for given y, this is clearly µ-measurable since the joint and marginal
densities are measurable. Also, for a given x, pY |X (y | x) defines a probability mea-
sure Qx , called
R the conditional distribution of Y , given X = x, through the integral
Qx (B) = B pY |X (y | x) dν(y); that is, pY |X (y | x) is the Radon–Nikodym derivative for
the conditional distribution Qx . For us, conditional distribution can always be defined
through its conditional density though, in general, a conditional density may not exist
even if the conditional distribution Qx does exist. There are real cases where the most
general definition of conditional distribution (Keener 2010, Sec. 6.2) is required, e.g., in
the proof of the Neyman–Fisher factorization theorem and in the proof of the general
Bayes theorem. Also, I should mention that conditional distributions are not unique;
but, we shall not dwell on this point here.
Given conditional distribution with density pY |X (y | x), we can define conditional
probabilities and expectations. That is,
Z
P(Y ∈ B | X = x) = pY |X (y | x) dν(y).
B

Here I use the more standard notation for conditional probability. The law of total
probability then allows us to write
Z
P(Y ∈ B) = P(Y ∈ B | X = x)pX (x) dµ(x),

in other words, marginal probabilities for Y may be obtained by taking expectation of


the conditional probabilities. More generally, for any ν-integrable function ϕ, we may
write the conditional expectation
Z
E{ϕ(Y ) | X = x} = ϕ(y)pY |X (y|x) dν(y).

We may evaluate the above expectation for any x, so we actually have defined a (µ-
measurable) function, say, g(x) = E(Y | X = x); here I took ϕ(y) = y for simplicity.
Now, g(X) is a random variable, to be denoted by E(Y | X), and we can ask about
its mean, variance, etc. The corresponding versions of the law of total probability for
conditional expectations are

E(Y ) = E{E(Y | X)},


V(Y ) = V{E(Y | X)} + E{V(Y | X)},

where V(Y | X) is the conditional variance, i.e., the variance of Y relative to its con-
ditional distribution. The first formula above is called smoothing in Keener (2010) but

10
I would probably call it a law of iterated expectation. This is actually a very powerful
result that can simplify lots of calculations; Keener (2010) uses this a lot.
As a final word about conditional distributions, it is worth mentioning that conditional
distributions are particularly useful in the specification of complex models. Indeed, it
can be difficult to specify a meaningful joint distribution for a collection of random
variables in a given application. However, it is often possible to write down a series of
conditional distributions that, together, specify a meaningful joint distribution. That is,
we can simplify the modeling step by working with several lower-dimensional conditional
distributions. This is particularly useful for specifying prior distributions for unknown
parameters in a Bayesian analysis; we will discuss this more later.

3.4 Parametric families of distributions


In most of what we do in this course, here in particular, we will ignore the underlying
probability space and work just with probability measures on the X-space. As discussed in
Section 2.1, in a statistical problem, there is not just one probability measure in question,
but a whole family of measures Pθ indexed3 by a parameter θ ∈ Θ. You’re already familiar
with this setup; X1 , . . . , Xn iid N(θ, 1) is one common example. A very important and
broad class of distributions is the exponential family. That is, for a given dominating
measure µ, an exponential family has density function (Radon–Nikodym derivative with
respect to µ) of the form
pθ (x) = ehη(θ),T (x)i+A(θ) h(x),
where η(θ), T (x), A(θ), and h(x) are some functions, and h·, ·i is the Euclidean inner
product. You should be familiar with these distributions from an intermediate statistics
course, such as Stat 411 at UIC. We will discuss exponential families in detail later.
In this section we will consider a special family of probability measures which are
characterized by a “base measure” and a suitable collection of transformations. We
begin with an important special case.
Example 6. Let P0 be a probability measure with symmetric density p0 with respect to
Lebesgue measure on R. Symmetry implies that the median is 0; if the expected value
exists, then it equals 0 too. For X ∼ P0 , define X 0 = X + θ for some real number θ.
Then the distribution of X 0 is

Pθ (A) = P0 (X + θ ∈ A) = P0 (A − θ);

i.e., the distribution function is Pθ (x) = P0 (x − θ). Doing this for all θ generates the
family {Pθ : θ ∈ R}. The normal family N(θ, 1) is a special case.
The family of distributions in Example 6 are generated by a single distribution, cen-
tered at 0, and a collection of “location shifts.” There are four key properties of these
location shifts: first, shifting by zero doesn’t change anything; second, the result of any
two consecutive location shifts can be achieved by a single location shift; third, the order
in which location shifts are made is irrelevant; fourth, for any given location, there is a
shift that takes the location back to 0. It turns out that these two properties characterize
what’s called a group of transformations.
3
Note that the subscript in Pθ serves a different purpose than the subscript PX described in Section 3.2.

11
Definition 2. A set of transformations G on X is a group if (i) there exists e ∈ G such
that e(x) = x for all x ∈ X; (ii) for any two g1 , g2 ∈ G , the composition g1 g2 also belongs
to G ; for any g1 , g2 , g3 ∈ G , (g1 g2 )g3 = g1 (g2 g3 ); and (iv) for any g ∈ G , there exists
g −1 ∈ G such that gg −1 = e, i.e., gg −1 x = g −1 gx = x.
To generalize the location shift example, start with a fixed probability measure P on
(X, A). Now introduce a group G of transformations on X, and take Pe = P; here the
subscript “e” refers to the group identity e. Then define the family {Pg : g ∈ G } as
Pg (A) = Pe (g −1 A), A ∈ A.
That is, Pg (A) is the probability, under X ∼ Pe , that gX lands in A. In the case where
Pe has a density pe with respect to Lebesgue measure, we have
dg −1 x
pg (x) = pe (g −1 x) ,

dx
which is just the usual change-of-variable formula from introductory probability.

3.5 Convex functions


There is a special property that functions can have which will we will occasionally take
advantage of later on. This property is called convexity. Throughout this section, unless
otherwise stated, take f (x) to be a real-valued function defined over a p-dimensional
Euclidean space X. The function f is said to be convex on X if, for any x, y ∈ X and any
α ∈ [0, 1], the following inequality holds:
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y).
For the case p = 1, this property is easy to visualize. Examples of convex (univariate)
functions include ex , − log x, xr for r > 1.
In the case where f is twice differentiable, there is an alternative characterization of
convexity. This is something that’s covered in most intermediate calculus courses.
Proposition 1. A twice-differentiable function f , defined on p-dimensional space, is
convex if and only if
 ∂ 2 f (x) 
∇2 f (x) = ,
∂xi ∂xj i,j=1,...,p
the matrix of second derivatives, is positive semi-definite for each x.
Convexity is important in optimization problems (maximum likelihood, least squares,
etc) as it relates to existence and uniqueness of global minima. For example, if the
criterion (loss) function to be minimized is convex and a local minimum exists, then
convexity guarantees that it’s a global minimum.
We will occasionally make use of the following result regarding expectations of convex
functions of random variables.
Theorem 6 (Jensen’s inequality). Suppose ϕ is a convex function on an open interval
X ⊆ R, and X is a random variable taking values in X. Then
ϕ[E(X)] ≤ E[ϕ(X)].
If ϕ is stricly convex, then equality holds if and only if X is constant.

12
Proof. First, take x0 to be any fixed point in X. Then there exists a linear function
`(x) = c(x − x0 ) + ϕ(x0 ), through the point (x0 , ϕ(x0 )), such that `(x) ≤ ϕ(x) for all x.
To prove our claim, take x0 = E(X), and note that

ϕ(X) ≥ c[X − E(X)] + ϕ[E(X)].

Taking expectations on both sides gives the result.


Jensen’s inequality can be used to confirm: E(1/X) > 1/E(X), E(X 2 ) ≥ E(X)2 , and
E[log X] < log E(X). An interesting consequence is the following.
Example 7 (Kullback–Leibler divergence). Let f and g be two probability density func-
tions dominated by a σ-finite measure µ. The Kullback–Leibler divergence of g from f
is defined as Z
Ef {log[f (X)/g(X)]} = log(f /g)f dµ.

It follows from Jensen’s inequality that

Ef {log[f (X)/g(X)]} = −Ef {log[g(X)/f (X)]}


≥ − log Ef [g(X)/f (X)]
Z
= − log (g/f )f dµ = 0.

That is, the Kullback–Leibler divergence is non-negative for all f and g. Moreover, it
equals zero if and only if f = g (µ-almost everywhere). Therefore, the Kullback–Leibler
divergence acts like a distance measure between to density functions. While it’s not a
metric in a mathematical sense4 , it has a lot of statistical applications.
2 2
Example 8 (Another
R 2 proof of Cauchy–Schwartz). Recall that f and g are µ-measurable
functions. If R g dµ is infinite, then there is nothing to prove, so suppose otherwise.
Then p = g 2 / g 2 dµ is a probability density on X. Moreover,
 R f g dµ 2 Z 2 Z R 2
f dµ
R
2
= (f /g)p dµ ≤ (f /g)2 p dµ = R 2 ,
g dµ g dµ

where the inequality follows from Theorem 6. Rearranging terms one gets
Z 2 Z Z
f g dµ ≤ f dµ · g 2 dµ,
2

which is the desired result.


We will find some more applications of convexity in the decision-theoretic context. In
particular, when the loss function is convex, it will follow that randomized decision rules
can be ignored in a specific sense.
4
it’s not symmetric and does not satisfy the triangle inequality

13
4 Exercises
1. Recall the standard 100(1 − α)% confidence interval for a normal mean θ with
known variance σ 2 = 1, i.e., X̄ ± z1−α/2 σn−1/2 , where Φ(z1−α/2 ) = 1 − α/2.

(a) When we say that the coverage probability is 1 − α, what do we mean?


(b) How does the confidence interval’s coverage probability relate to the true θ?

2. A fundamental concept in frequentist statistical theory is sampling distributions.


For an observable sample X1 , . . . , Xn from a distribution depending on some pa-
rameter θ, let T = T (X1 , . . . , Xn ) be some statistic.

(a) What do we mean by the sampling distribution of T ?


(b) Give a rough explanation of how the sampling distribution is used for statistical
inference. Think about the “embedding” idea described in Section 2.2.

3. Keener, Problem 1.1, page 17.


T∞
4. Show that if A1 , A2 , . . . are members of a σ-algebra A, then so is i=1 Ai .

5. Keener, Problem 1.6, page 17.

6. For A1 , A2 , . . . ∈ A, define
∞ [
\ ∞
lim sup An = Am = {x: x is in An for infinitely many n}.
n=1 m=n

Show that lim sup An is also in A.

Prove the Borel–Cantelli Lemma: If µ is a finite measure (i.e., µ(X) < ∞) and
7. P

n=1 µ(An ) < ∞, then µ(lim sup An ) = 0.

8. Keener, Problem 1.8, page 17.

9. Prove that if f and g are measurable functions, then so are f + g and f ∨ g =


max{f, g}. [Hint: For proving f + g is measurable, note that if f (x) ≤ a − g(x)
then there is a rational number r that sits between f (x) and a − g(x).]
R R
10. Show that if f is µ-integrable, then | f dµ| ≤ |f | dµ. [Hint: Write |f | in terms
of f + and f − .]

11. (a) Use Fubini’s theorem to show that, for a non-negative


R∞ random variable X,
with distribution function F , we have E(X) = 0 {1 − F (x)} dx.
(b) Use this result to derive the mean of an exponential distribution with scale
parameter θ.

12. Keener, Problem 1.26, page 21.

13. Keener, Problems 1.36 and 1.37, page 22.

14
14. Keener, Problem 1.40, page 23.

15. Let X ∼ N(µ, σ 2 ) and, given X = x, Y ∼ N(x, τ 2 ). Find the conditional distribution
of X, given Y = y.

16. Show that if X is distributed according to a scale family, then Y = log X is dis-
tributed according to a location family.

17. Suppose that X is a positive random variable, and consider the family of distribu-
tions generated by X and transformations {gb,c } given by

X 0 = gb,c (X) = bX 1/c , b > 0, c > 0.

(a) Show that this defines a group family.


(b) If X has a unit-rate exponential distribution, then the family generated by
{gb,c } is the Weibull family, with density

(c/b)(x/b)c−1 exp{−(x/b)c }, x > 0.

18. Keener, Problem 1.17, page 19. [Hint: Consider the probability generating function
g(t) = E(tX ) and, in particular, g(1) and g(−1).]

19. Suppose ϕ is convex on (a, b) and ψ is convex and nondecreasing on the range of
ϕ. Prove that ψ ◦ ϕ is convex on (a, b).

20. Suppose X takes values x1 , . . . , xn with probabilities p1 , . . . , pn . Use the definition


of convexity (of ϕ) and induction to prove Jensen’s inequality, ϕ[E(X)] ≤ E[ϕ(X)].

References
Dempster, A. P. (2008), “Dempster–Shafer calculus for statisticians,” Internat. J. of
Approx. Reason., 48, 265–277.

Fraser, D. A. S. (1968), The Structure of Inference, New York: John Wiley & Sons Inc.

Hannig, J. (2009), “On generalized fiducial inference,” Statist. Sinica, 19, 491–544.

Keener, R. W. (2010), Theoretical Statistics, Springer Texts in Statistics, New York:


Springer.

Lehmann, E. L. and Casella, G. (1998), Theory of Point Estimation, Springer Texts in


Statistics, New York: Springer-Verlag, 2nd ed.

Martin, R. and Liu, C. (2012), “Inferential models: A framework for prior-free posterior
probabilistic inference,” J. Amer. Statist. Assoc., to appear, arXiv:1206.4091.

Shafer, G. (1976), A Mathematical Theory of Evidence, Princeton, N.J.: Princeton Uni-


versity Press.

Zabell, S. L. (1992), “R. A. Fisher and the fiducial argument,” Statist. Sci., 7, 369–387.

15

You might also like