0% found this document useful (0 votes)
219 views668 pages

Main

Hidden Markov models are one of the most successful statistical modelling ideas. The use of hidden (or unobservable) states makes the model generic enough to handle a variety of complex real-world time series. This book presents a reasonably complete picture of statistical inference for HMMs.

Uploaded by

Moulines Vincent
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views668 pages

Main

Hidden Markov models are one of the most successful statistical modelling ideas. The use of hidden (or unobservable) states makes the model generic enough to handle a variety of complex real-world time series. This book presents a reasonably complete picture of statistical inference for HMMs.

Uploaded by

Moulines Vincent
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 668

Olivier Capp, Eric Moulines and Tobias Rydn e e

Inference in Hidden Markov Models


May 20, 2011

Springer
Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo

Preface

Hidden Markov modelsmost often abbreviated to the acronym HMMs are one of the most successful statistical modelling ideas that have came up in the last forty years: the use of hidden (or unobservable) states makes the model generic enough to handle a variety of complex real-world time series, while the relatively simple prior dependence structure (the Markov bit) still allows for the use of ecient computational procedures. Our goal with this book is to present a reasonably complete picture of statistical inference for HMMs, from the simplest nite-valued models, which were already studied in the 1960s, to recent topics like computational aspects of models with continuous state space, asymptotics of maximum likelihood, Bayesian computation and model selection, and all this illustrated with relevant running examples. We want to stress at this point that by using the term hidden Markov model we do not limit ourselves to models with nite state space (for the hidden Markov chain), but also include models with continuous state space; such models are often referred to as state-space models in the literature. We build on the considerable developments that have taken place during the past ten years, both at the foundational level (asymptotics of maximum likelihood estimates, order estimation, etc.) and at the computational level (variable dimension simulation, simulation-based optimization, etc.), to present an up-to-date picture of the eld that is self-contained from a theoretical point of view and self-sucient from a methodological point of view. We therefore expect that the book will appeal to academic researchers in the eld of HMMs, in particular PhD students working on related topics, by summing up the results obtained so far and presenting some new ideas. We hope that it will similarly interest practitioners and researchers from other elds by leading them through the computational steps required for making inference in HMMs and/or providing them with the relevant underlying statistical theory. The book starts with an introductory chapter which explains, in simple terms, what an HMM is, and it contains many examples of the use of HMMs in elds ranging from biology to telecommunications and nance. This chapter also describes various extension of HMMs, like models with autoregression

VI

Preface

or hierarchical HMMs. Chapter 2 denes some basic concepts like transition kernels and Markov chains. The remainder of the book is divided into three parts: State Inference, Parameter Inference and Background and Complements; there are also three appendices. Part I of the book covers inference for the unobserved state process. We start in Chapter 3 by dening smoothing, ltering and predictive distributions and describe the forward-backward decomposition and the corresponding recursions. We do this in a general framework with no assumption on niteness of the hidden state space. The special cases of HMMs with nite state space and Gaussian linear state-space models are detailed in Chapter 5. Chapter 3 also introduces the idea that the conditional distribution of the hidden Markov chain, given the observations, is Markov too, although non-homogeneous, for both ordinary and time-reversed index orderings. As a result, two alternative algorithms for smoothing are obtained. A major theme of Part I is simulationbased methods for state inference; Chapter 6 is a brief introduction to Monte Carlo simulation, and to Markov chain Monte Carlo and its applications to HMMs in particular, while Chapters 7 and 8 describe, starting from scratch, so-called sequential Monte Carlo (SMC) methods for approximating ltering and smoothing distributions in HMMs with continuous state space. Chapter 9 is devoted to asymptotic analysis of SMC algorithms. More specialized topics of Part I include recursive computation of expectations of functions with respect to smoothed distributions of the hidden chain (Section 4.1), SMC approximations of such expectations (Section 8.3) and mixing properties of the conditional distribution of the hidden chain (Section 4.3). Variants of the basic HMM structure like models with autoregression and hierarchical HMMs are considered in Sections 4.2, 6.3.2 and 8.2. Part II of the book deals with inference for model parameters, mostly from the maximum likelihood and Bayesian points of views. Chapter 10 describes the expectation-maximization (EM) algorithm in detail, as well as its implementation for HMMs with nite state space and Gaussian linear state-space models. This chapter also discusses likelihood maximization using gradient-based optimization routines. HMMs with continuous state space do not generally admit exact implementation of EM, but require simulationbased methods. Chapter 11 covers various Monte Carlo algorithms like Monte Carlo EM, stochastic gradient algorithms and stochastic approximation EM. In addition to providing the algorithms and illustrative examples, it also contains an in-depth analysis of their convergence properties. Chapter 12 gives an overview of the framework for asymptotic analysis of the maximum likelihood estimator, with some applications like asymptotics of likelihood-based tests. Chapter 13 is about Bayesian inference for HMMs, with the focus being on models with nite state space. It covers so-called reversible jump MCMC algorithms for choosing between models of dierent dimensionality, and contains detailed examples illustrating these as well as simpler algorithms. It also contains a section on multiple imputation algorithms for global maximization of the posterior density.

Preface

VII

Part III of the book contains a chapter on discrete and general Markov chains, summarizing some of the most important concepts and results and applying them to HMMs. The other chapter of this part focuses on order estimation for HMMs with both nite state space and nite output alphabet; in particular it describes how concepts from information theory are useful for elaborating on this subject. Various parts of the book require dierent amounts of, and also dierent kinds of, prior knowledge from the reader. Generally we assume familiarity with probability and statistical estimation at the levels of Feller (1971) and Bickel and Doksum (1977), respectively. Some prior knowledge of Markov chains (discrete and/or general) is very helpful, although Part III does contain a primer on the topic; this chapter should however be considered more a brush-up than a comprehensive treatise of the subject. A reader with that knowledge will be able to understand most parts of the book. Chapter 13 on Bayesian estimation features a brief introduction to the subject in general but, again, some previous experience with Bayesian statistics will undoubtedly be of great help. The more theoretical parts of the book (Section 4.3, Chapter 9, Sections 11.211.3, Chapter 12, Sections 14.214.3 and Chapter 15) require knowledge of probability theory at the measure-theoretic level for a full understanding, even though most of the results as such can be understood without it. There is no need to read the book in linear order, from cover to cover. Indeed, this is probably the wrong way to read it! Rather we encourage the reader to rst go through the more algorithmic parts of the book, to get an overall view of the subject, and then, if desired, later return to the theoretical parts for a fuller understanding. Readers with particular topics in mind may of course be even more selective. A reader interested in the EM algorithm, for instance, could start with Chapter 1, have a look at Chapter 2, and then proceed to Chapter 3 before reading about the EM algorithm in Chapter 10. Similarly a reader interested in simulation-based techniques could go to Chapter 6 directly, perhaps after reading some of the introductory parts, or even directly to Section 6.3 if he/she is already familiar with MCMC methods. Each of the two chapters entitled Advanced Topics in... (Chapters 4 and 8) is really composed of three disconnected complements to Chapters 3 and 7, respectively. As such, the sections that compose Chapters 4 and 8 may be read independently of one another. Most chapters end with a section entitled Complements whose reading is not required for understanding other parts of the bookmost often, this section mostly contains bibliographical notes although in some chapters (9 and 11 in particular) it also features elements needed to prove the results stated in the main text. Even in a book of this size, it is impossible to include all aspects of hidden Markov models. We have focused on the use of HMMs to model long, potentially stationary, time series; we call such models ergodic HMMs. In other applications, for instance speech recognition or protein alignment, HMMs are used to represent short variable-length sequences; such models are often called

VIII

Preface

left-to-right HMMs and are hardly mentioned in this book. Having said that we stress that the computational tools for both classes of HMMs are virtually the same. There are also a number of generalizations of HMMs which we do not consider. In Markov random elds, as used in image processing applications, the Markov chain is replaced by a graph of dependency which may be represented as a two-dimensional regular lattice. The numerical techniques that can be used for inference in hidden Markov random elds are similar to some of the methods studied in this book but the statistical side is very dierent. Bayesian networks are even more general since the dependency structure is allowed to take any form represented by a (directed or undirected) graph. We do not consider Bayesian networks in their generality although some of the concepts developed in the Bayesian networks literature (the graph representation, the sum-product algorithm) are used. Continuous-time HMMs may also be seen as a further generalization of the models considered in this book. Some of these continuous-time HMMs, and in particular partially observed diusion models used in mathematical nance, have recently received considerable attention. We however decided this topic to be outside the range of the book; furthermore, the stochastic calculus tools needed for studying these continuous-time models are not appropriate for our purpose. We acknowledge the help of Stphane Boucheron, Randal Douc, Gersende e Fort, Elisabeth Gassiat, Christian P. Robert, and Philippe Soulier, who participated in the writing of the text and contributed the two chapters that compose Part III (see next page for details of the contributions). We are also indebted to them for suggesting various forms of improvement in the notations, layout, etc., as well as helping us track typos and errors. We thank Franois Le Gland and Catherine Matias for participating in the early stages c of this book project. We are grateful to Christophe Andrieu, Sren Asmussen, Arnaud Doucet, Hans Knsch, Steve Levinson, Yaacov Ritov and Mike Titu terington, who provided various helpful inputs and comments. Finally, we thank John Kimmel of Springer for his support and enduring patience. Paris, France & Lund, Sweden March 2005 Olivier Capp e Eric Moulines Tobias Rydn e

Contributors

We are grateful to Randal Douc Ecole Polytechnique Christian P. Robert CREST INSEE & Universit Paris-Dauphine e for their contributions to Chapters 9 (Randal) and 6, 7, and 13 (Christian) as well as for their help in proofreading these and other parts of the book Chapter 14 was written by Gersende Fort CNRS & LMC-IMAG Philippe Soulier Universit Paris-Nanterre e with Eric Moulines

Chapter 15 was written by Stphane Boucheron e Universit Paris VII-Denis Diderot e Elisabeth Gassiat Universit dOrsay, Paris-Sud e

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What Is a Hidden Markov Model? . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Beyond Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Finite Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Normal Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . 1.3.3 Gaussian Linear State-Space Models . . . . . . . . . . . . . . . . . 1.3.4 Conditionally Gaussian Linear State-Space Models . . . . 1.3.5 General (Continuous) State-Space HMMs . . . . . . . . . . . . 1.3.6 Switching Processes with Markov Regime . . . . . . . . . . . . . 1.4 Left-to-Right and Ergodic Hidden Markov Models . . . . . . . . . . . Main Denitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Transition Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Homogeneous Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Non-homogeneous Markov Chains . . . . . . . . . . . . . . . . . . . 2.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Denitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Conditional Independence in Hidden Markov Models . . . 2.2.3 Hierarchical Hidden Markov Models . . . . . . . . . . . . . . . . . 1 1 4 6 6 13 15 17 24 29 33 35 35 35 37 40 42 42 44 46

Part I State Inference

XII

Contents

Filtering and Smoothing Recursions . . . . . . . . . . . . . . . . . . . . . . . 3.1 Basic Notations and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 The Forward-Backward Decomposition . . . . . . . . . . . . . . . 3.1.4 Implicit Conditioning (Please Read This Section!) . . . . . 3.2 Forward-Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The Forward-Backward Recursions . . . . . . . . . . . . . . . . . . 3.2.2 Filtering and Normalized Recursion . . . . . . . . . . . . . . . . . . 3.3 Markovian Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Forward Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Backward Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 53 53 54 56 58 59 59 61 66 66 70 74

Advanced Topics in Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Recursive Computation of Smoothed Functionals . . . . . . . . . . . . 77 4.1.1 Fixed Point Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1.2 Recursive Smoothers for General Functionals . . . . . . . . . 79 4.1.3 Comparison with Forward-Backward Smoothing . . . . . . . 82 4.2 Filtering and Smoothing in More General Models . . . . . . . . . . . . 85 4.2.1 Smoothing in Markov-switching Models . . . . . . . . . . . . . . 86 4.2.2 Smoothing in Partially Observed Markov Chains . . . . . . 86 4.2.3 Marginal Smoothing in Hierarchical HMMs . . . . . . . . . . . 87 4.3 Forgetting of the Initial Condition . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.1 Total Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.2 Lipshitz Contraction for Transition Kernels . . . . . . . . . . . 95 4.3.3 The Doeblin Condition and Uniform Ergodicity . . . . . . . 97 4.3.4 Forgetting Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3.5 Uniform Forgetting Under Strong Mixing Conditions . . . 105 4.3.6 Forgetting Under Alternative Conditions . . . . . . . . . . . . . 110 Applications of Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.1 Models with Finite State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.1.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.1.2 Maximum a Posteriori Sequence Estimation . . . . . . . . . . 125 5.2 Gaussian Linear State-Space Models . . . . . . . . . . . . . . . . . . . . . . . 127 5.2.1 Filtering and Backward Markovian Smoothing . . . . . . . . 127 5.2.2 Linear Prediction Interpretation . . . . . . . . . . . . . . . . . . . . . 131 5.2.3 The Prediction and Filtering Recursions Revisited . . . . . 137 5.2.4 Disturbance Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.2.5 The Backward Recursion and the Two-Filter Formula . . 148 5.2.6 Application to Marginal Filtering and Smoothing in CGLSSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Contents

XIII

Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1 Basic Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.1.2 Monte Carlo Simulation for HMM State Inference . . . . . 163 6.2 A Markov Chain Monte Carlo Primer . . . . . . . . . . . . . . . . . . . . . . 166 6.2.1 The Accept-Reject Algorithm . . . . . . . . . . . . . . . . . . . . . . . 166 6.2.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.2.3 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.2.4 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.2.5 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.2.6 Stopping an MCMC Algorithm . . . . . . . . . . . . . . . . . . . . . . 185 6.3 Applications to Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 186 6.3.1 Generic Sampling Strategies . . . . . . . . . . . . . . . . . . . . . . . . 186 6.3.2 Gibbs Sampling in CGLSSMs . . . . . . . . . . . . . . . . . . . . . . . 194 Sequential Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.1 Importance Sampling and Resampling . . . . . . . . . . . . . . . . . . . . . . 210 7.1.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 7.1.2 Sampling Importance Resampling . . . . . . . . . . . . . . . . . . . 211 7.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.2.1 Sequential Implementation for HMMs . . . . . . . . . . . . . . . . 214 7.2.2 Choice of the Instrumental Kernel . . . . . . . . . . . . . . . . . . . 218 7.3 Sequential Importance Sampling with Resampling . . . . . . . . . . . 231 7.3.1 Weight Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.3.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.4.1 Implementation of Multinomial Resampling . . . . . . . . . . . 242 7.4.2 Alternatives to Multinomial Resampling . . . . . . . . . . . . . . 244 Advanced Topics in Sequential Monte Carlo . . . . . . . . . . . . . . . 251 8.1 Alternatives to SISR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 8.1.1 I.I.D. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 8.1.2 Two-Stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 8.1.3 Interpretation with Auxiliary Variables . . . . . . . . . . . . . . . 260 8.1.4 Auxiliary Accept-Reject Sampling . . . . . . . . . . . . . . . . . . . 261 8.1.5 Markov Chain Monte Carlo Auxiliary Sampling . . . . . . . 263 8.2 Sequential Monte Carlo in Hierarchical HMMs . . . . . . . . . . . . . . 264 8.2.1 Sequential Importance Sampling and Global Sampling . 265 8.2.2 Optimal Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.2.3 Application to CGLSSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 8.3 Particle Approximation of Smoothing Functionals . . . . . . . . . . . 278

XIV

Contents

Analysis of Sequential Monte Carlo Methods . . . . . . . . . . . . . . 287 9.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 9.1.1 Unnormalized Importance Sampling . . . . . . . . . . . . . . . . . 287 9.1.2 Deviation Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 9.1.3 Self-normalized Importance Sampling Estimator . . . . . . . 293 9.2 Sampling Importance Resampling . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9.2.2 Denitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 297 9.2.3 Weighting and Resampling . . . . . . . . . . . . . . . . . . . . . . . . . 300 9.2.4 Application to the Single-Stage SIR Algorithm . . . . . . . . 307 9.3 Single-Step Analysis of SMC Methods . . . . . . . . . . . . . . . . . . . . . . 311 9.3.1 Mutation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 9.3.2 Description of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 315 9.3.3 Analysis of the Mutation/Selection Algorithm . . . . . . . . . 319 9.3.4 Analysis of the Selection/Mutation Algorithm . . . . . . . . . 320 9.4 Sequential Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 321 9.4.1 SISR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 9.4.2 I.I.D. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 9.5.1 Weak Limits Theorems for Triangular Array . . . . . . . . . . 333 9.5.2 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

Part II Parameter Inference 10 Maximum Likelihood Inference, Part I: Optimization Through Exact Smoothing . . . . . . . . . . . . . . . . . . . 347 10.1 Likelihood Optimization in Incomplete Data Models . . . . . . . . . 347 10.1.1 Problem Statement and Notations . . . . . . . . . . . . . . . . . . . 348 10.1.2 The Expectation-Maximization Algorithm . . . . . . . . . . . . 349 10.1.3 Gradient-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 10.1.4 Pros and Cons of Gradient-based Methods . . . . . . . . . . . . 358 10.2 Application to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 10.2.1 Hidden Markov Models as Missing Data Models . . . . . . . 359 10.2.2 EM in HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 10.2.3 Computing Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 10.2.4 Connection with the Sensitivity Equation Approach . . . 364 10.3 The Example of Normal Hidden Markov Models . . . . . . . . . . . . . 367 10.3.1 EM Parameter Update Formulas . . . . . . . . . . . . . . . . . . . . 367 10.3.2 Estimation of the Initial Distribution . . . . . . . . . . . . . . . . 370 10.3.3 Recursive Implementation of E-Step . . . . . . . . . . . . . . . . . 371 10.3.4 Computation of the Score and Observed Information . . . 374 10.4 The Example of Gaussian Linear State-Space Models . . . . . . . . 384 10.4.1 The Intermediate Quantity of EM . . . . . . . . . . . . . . . . . . . 385 10.4.2 Recursive Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 387

Contents

XV

10.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 10.5.1 Global Convergence of the EM Algorithm . . . . . . . . . . . . 389 10.5.2 Rate of Convergence of EM . . . . . . . . . . . . . . . . . . . . . . . . . 392 10.5.3 Generalized EM Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 393 10.5.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 11 Maximum Likelihood Inference, Part II: Monte Carlo Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 11.1 Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 11.1.1 Monte Carlo EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 11.1.2 Simulation Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 11.1.3 Gradient-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 408 11.1.4 Interlude: Stochastic Approximation and the Robbins-Monro Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 411 11.1.5 Stochastic Gradient Algorithms . . . . . . . . . . . . . . . . . . . . . 412 11.1.6 Stochastic Approximation EM . . . . . . . . . . . . . . . . . . . . . . 414 11.1.7 Stochastic EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 11.2 Analysis of the MCEM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 419 11.2.1 Convergence of Perturbed Dynamical Systems . . . . . . . . 420 11.2.2 Convergence of the MCEM Algorithm . . . . . . . . . . . . . . . 423 11.2.3 Rate of Convergence of MCEM . . . . . . . . . . . . . . . . . . . . . 426 11.3 Analysis of Stochastic Approximation Algorithms . . . . . . . . . . . . 429 11.3.1 Basic Results for Stochastic Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 11.3.2 Convergence of the Stochastic Gradient Algorithm . . . . . 431 11.3.3 Rate of Convergence of the Stochastic Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 11.3.4 Convergence of the SAEM Algorithm . . . . . . . . . . . . . . . . 433 11.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 12 Statistical Properties of the Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 12.1 A Primer on MLE Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 12.2 Stationary Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 12.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 12.3.1 Construction of the Stationary Conditional Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 12.3.2 The Contrast Function and Its Properties . . . . . . . . . . . . 448 12.4 Identiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 12.4.1 Equivalence of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 451 12.4.2 Identiability of Mixture Densities . . . . . . . . . . . . . . . . . . . 454 12.4.3 Application of Mixture Identiability to Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 12.5 Asymptotic Normality of the Score and Convergence of the Observed Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

XVI

Contents

12.5.1 The Score Function and Invoking the Fisher Identity . . . 457 12.5.2 Construction of the Stationary Conditional Score . . . . . . 459 12.5.3 Weak Convergence of the Normalized Score . . . . . . . . . . . 464 12.5.4 Convergence of the Normalized Observed Information . . 465 12.5.5 Asymptotics of the Maximum Likelihood Estimator . . . . 465 12.6 Applications to Likelihood-based Tests . . . . . . . . . . . . . . . . . . . . . 466 12.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 13 Fully Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 13.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 13.1.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 13.1.2 Prior Distributions for HMMs . . . . . . . . . . . . . . . . . . . . . . . 475 13.1.3 Non-identiability and Label Switching . . . . . . . . . . . . . . 478 13.1.4 MCMC Methods for Bayesian Inference . . . . . . . . . . . . . . 481 13.2 Reversible Jump Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 13.2.1 Variable Dimension Models . . . . . . . . . . . . . . . . . . . . . . . . . 488 13.2.2 Greens Reversible Jump Algorithm . . . . . . . . . . . . . . . . . . 490 13.2.3 Alternative Sampler Designs . . . . . . . . . . . . . . . . . . . . . . . . 498 13.2.4 Alternatives to Reversible Jump MCMC . . . . . . . . . . . . . 500 13.3 Multiple Imputations Methods and Maximum a Posteriori . . . 501 13.3.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 13.3.2 The SAME Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

Part III Background and Complements 14 Elements of Markov Chain Theory . . . . . . . . . . . . . . . . . . . . . . . . 513 14.1 Chains on Countable State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 513 14.1.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 14.1.2 Recurrence and Transience . . . . . . . . . . . . . . . . . . . . . . . . . 514 14.1.3 Invariant Measures and Stationarity . . . . . . . . . . . . . . . . . 517 14.1.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 14.2 Chains on General State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 14.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 14.2.2 Recurrence and Transience . . . . . . . . . . . . . . . . . . . . . . . . . 523 14.2.3 Invariant Measures and Stationarity . . . . . . . . . . . . . . . . . 534 14.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 14.2.5 Geometric Ergodicity and Foster-Lyapunov Conditions . 548 14.2.6 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 14.3 Applications to Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 556 14.3.1 Phi-irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 14.3.2 Atoms and Small Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 14.3.3 Recurrence and Positive Recurrence . . . . . . . . . . . . . . . . . 560

Contents

XVII

15 An Information-Theoretic Perspective on Order Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 15.1 Model Order Identication: What Is It About? . . . . . . . . . . . . . . 566 15.2 Order Estimation in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 15.3 Order Estimation and Composite Hypothesis Testing . . . . . . . . 569 15.4 Code-based Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 15.4.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 15.4.2 Information Divergence Rates . . . . . . . . . . . . . . . . . . . . . . . 574 15.5 MDL Order Estimators in Bayesian Settings . . . . . . . . . . . . . . . . 576 15.6 Strongly Consistent Penalized Maximum Likelihood Estimators for HMM Order Estimation . . . . . . . . . . . . . . . . . . . . . 577 15.7 Eciency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 15.7.1 Variations on Steins Lemma . . . . . . . . . . . . . . . . . . . . . . . 581 15.7.2 Achieving Optimal Error Exponents . . . . . . . . . . . . . . . . . 584 15.8 Consistency of the BIC Estimator in the Markov Order Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 15.8.1 Some Martingale Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 15.8.2 The Martingale Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 591 15.8.3 The Union Bound Meets Martingale Inequalities . . . . . . 592 15.9 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600

Part IV Appendices A Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 A.1 Probability and Topology Terminology and Notation . . . . . . . . . 605 A.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 A.3 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 A.4 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 B.1 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 B.2 The Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 C.1 Mathematical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 C.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 C.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 C.4 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

1 Introduction

1.1 What Is a Hidden Markov Model?


A hidden Markov model (abbreviated HMM) is, loosely speaking, a Markov chain observed in noise. Indeed, the model comprises a Markov chain, which we will denote by {Xk }k0 , where k is an integer index. This Markov chain is often assumed to take values in a nite set, but we will not make this restriction in general, thus allowing for a quite arbitrary state space. Now, the Markov chain is hidden, that is, it is not observable. What is available to the observer is another stochastic process {Yk }k0 , linked to the Markov chain in that Xk governs the distribution of the corresponding Yk . For instance, Yk may have a normal distribution, the mean and variance of which is determined by Xk , or Yk may have a Poisson distribution whose mean is determined by Xk . The underlying Markov chain {Xk } is sometimes called the regime, or state. All statistical inference, even on the Markov chain itself, has to be done in terms of {Yk } only, as {Xk } is not observed. There is also a further assumption on the relation between the Markov chain and the observable process, saying that Xk must be the only variable of the Markov chain that aects the distribution of Yk . This is expressed more precisely in the following formal denition. A hidden Markov model is a bivariate discrete time process {Xk , Yk }k0 , where {Xk } is a Markov chain and, conditional on {Xk }, {Yk } is a sequence of independent random variables such that the conditional distribution of Yk only depends on Xk . We will denote the state space of the Markov chain {Xk } by X and the set in which {Yk } takes its values by Y. The dependence structure of an HMM can be represented by a graphical model as in Figure 1.1. Representations of this sort use a directed graph without loops to describe dependence structures among random variables. The nodes (circles) in the graph correspond to the random variables, and the edges (arrows) represent the structure of the joint probability distribution, with the interpretation that the latter may be factored as a product of the conditional distributions of each node given its parent nodes (those that are directly

1 Introduction



E Xk+1 E 
Xk

c c 
Yk Yk+1


Fig. 1.1. Graphical representation of the dependence structure of a hidden Markov model, where {Yk } is the observable process and {Xk } is the hidden chain.

connected to it by an arrow). Figure 1.1 thus implies that the distribution of a variable Xk+1 conditional on the history of the process, X0 , . . . , Xk , is determined by the value taken by the preceding one, Xk ; this is called the Markov property. Likewise, the distribution of Yk conditionally on the past observations Y0 , . . . , Yk1 and the past values of the state, X0 , . . . , Xk , is determined by Xk only (this is exactly the denition we made above). We shall not go into details about graphical models, but just sometimes use them as an intuitive means of illustrating various kinds of dependence. The interested reader is referred to, for example, Jensen (1996) or Jordan (2004) for introductory texts and to Lauritzen (1996), Cowell et al. (1999), or Jordan (1999) for in-depth coverage. Throughout the book, we will assume that each HMM is homogeneous, by which we mean that the Markov chain {Xk } is homogeneous (its transition kernel does not depend on the time index k), and that the conditional law of Yk given Xk does not depend on k either. In order to keep this introductory discussion simple, we do not embark into precise mathematical denitions of Markov chain concepts such as transition kernels for instance. The formalization of several of the ideas that are rst reviewed on intuitive grounds here will be the topic of the rst part of the book (Section 2.1). As mentioned above, of the two processes {Xk } and {Yk }, only {Yk } is actually observed, whence inference on the parameters of the model must be achieved using {Yk } only. The other topic of interest is of course inference on the unobserved {Xk }: given a model and some observations, can we estimate the unobservable sequence of states? As we shall see later in the book, these two major statistical objectives are indeed strongly connected. Models that comprise unobserved random variables, as HMMs do, are called latent variable models, missing data models, or also models with incomplete data, where the latent variable refers to the unobservable random quantities. Let us already at this point give a simple and illustrative example of an HMM. Suppose that {Xk } is a Markov chain with state space {0, 1} and that 2 Yk , conditional on Xk = i, has a Gaussian N(i , i ) distribution. In other words, the value of the regime governs the mean and variance of the Gaussian

1.1 What Is a Hidden Markov Model?

distribution from which we then draw the output. This model illustrates a common feature of HMMs considered in this book, namely that the conditional distributions of Yk given Xk all belong to a single parametric family, with parameters indexed by Xk . In this case, it is the Gaussian family of distributions, but one may of course also consider the Gamma family, the Poisson family, etc. A meaningful observation, in the current example, is that the marginal distribution of {Yk } is that of a mixture of two Gaussian distributions. Hence we may also view HMMs as an extension of independent mixture models, including some degree of dependence between observations. Indeed, even though the Y -variables are conditionally independent given {Xk }, {Yk } is not an independent sequence because of the dependence in {Xk }. In fact, {Yk } is not a Markov chain either: the joint process {Xk , Yk } is of course a Markov chain, but the observable process {Yk } does not have the loss of memory property of Markov chains, in the sense that the conditional distribution of Yk given Y0 , . . . , Yk1 generally depends on all the conditioning variables. As we shall see in Chapter 2, however, the dependence in the sequence {Yk } (dened in a suitable sense) is not stronger than that in {Xk }. This is a general observation that is valid not only for the current example. Another view is to consider HMMs as an extension of Markov chains, in which the observation {Yk } of the state {Xk } is distorted or blurred in some manner that includes some additional, independent randomness. In the previous example, the distortion is simply caused by additive Gaussian noise, as we may write this model as Yk = Xk + Xk Vk , where {Vk }k0 is an i.i.d. (independent and identically distributed) sequence of standard Gaussian random variables. We could even proceed one step further by deriving a similar functional representation for the unobservable sequence of states. More precisely, if {Uk }k0 denotes an i.i.d. sequence of of uniform random variables on the interval [0, 1], we can dene recursively X1 , X2 , . . . by the equation Xk+1 = 1(Uk pXk ) where p0 and p1 are dened respectively by pi = P(Xk+1 = 1 | Xk = i) (for i = 0 and 1). Such a representation of a Markov chain is usually referred to as a stochastically recursive sequence (and sometimes abbreviated to SRS) (Borovkov, 1998). An alternative view consists in regarding 1(Uk p ) as a random function (here on {0, 1}), hence the name iterated random functions also used to refer to the above representation of a Markov chain (Diaconis and Freedman, 1999). Our simple example is by no means a singular case and, in great generality, any HMM may be equivalently dened through a functional representation known as a (general) state-space model, Xk+1 = a(Xk , Uk ) , Yk = b(Xk , Vk ) , (1.1) (1.2)

where {Uk }k0 and {Vk }k0 are mutually independent i.i.d. sequences of random variables that are independent of X0 , and a and b are measurable functions. The rst equation is known as the state or dynamic equation, whereas

1 Introduction

the second one is the observation equation. These two equations correspond to a recursive, generative form of the model, as opposed to our initial exposition, which focused on the specication of the joint probability distribution of the variables. Which view is most natural and fruitful typically depends on what the HMM is intended to model and for what purpose it is used (see the examples section below). In the times series literature, the term state-space model is usually reserved for models in which a and b are linear functions and the sequences {Uk }, {Vk }, and X0 are jointly Gaussian (Anderson and Moore, 1979; Brockwell and Davis, 1991; Kailath et al., 2000). In this book, we reverse the perspective and refer to the family of models dened by (1.1) as (general) state-space models. The linear Gaussian sub-family of models will be covered in some detail, notably in Chapter 5, but is clearly not the main focus of this book. Similarly, in the classical HMM literature like the tutorial by Rabiner (1989) or the books by Elliott et al. (1995) and MacDonald and Zucchini (1997), it is tacitly assumed that the denomination hidden Markov model implies a nite state space X. This is a very important case indeed, but in this book we will treat more general state spaces as well. In our view, the terms hidden Markov model and state-space model refer to the same type of objects, although we will reserve the latter for describing the functional representation of the model given by (1.1).

1.2 Beyond Hidden Markov Models


The original works on (nite state space) hidden Markov models, as well as most of the theory regarding Gaussian linear state-space models, date back to the 1960s. Since then, the practical success of these models in several distinct application domains has generated an ever-increasing interest in HMMs and a similarly increasing number of new models based on HMMs. Several of these extensions of the basic HMM structure are, to some extent, also covered in this book. A rst simple extension is when the hidden state sequence {Xk }k0 is a dth order Markov process, that is, when the conditional distribution of Xk given past values X (with 0 < k) depends on the d-tuple Xkd , Xkd+1 , . . . , Xk1 . At least conceptually this is not a very signicant step, as we can fall back to the standard HMM setup by redening the state to be the vector (Xkd+1 , . . . , Xk ), which has Markovian evolution. Another variation consists in allowing for non-homogeneous transitions of the hidden chain or for non-homogeneous observation distributions. By this we mean that the distribution of Xk given Xk1 , or that of Yk given Xk , can be allowed to depend on the index k. As we shall see in the second part of this book, non-homogeneous models lead to identical methods as far as state inference, i.e., inference about the hidden chain {Xk }, is concerned (except for the need to index conditional distributions with k).

1.2 Beyond Hidden Markov Models



E Xk+1 E 
Xk

c c 

E Yk+1 E 
Yk

Fig. 1.2. Graphical representation of the dependence structure of a Markovswitching model, where {Yk } is the observable process and {Xk } is the hidden chain.

Markov-switching models perhaps constitute the most signicant generalization of HMMs. In such models, the conditional distribution of Yk+1 , given all past variables, depends not only on Xk+1 but also on Yk (and possibly more lagged Y -variables). Thus, conditional on the state sequence {Xk }k0 , {Yk }k0 forms a (non-homogeneous) Markov chain. Graphically, this is represented as in Figure 1.2. In state-space form, a Markov-switching model may be written as Xk+1 = a(Xk , Uk ) , Yk+1 = b(Xk+1 , Yk , Vk+1 ) . (1.3) (1.4)

The terminology regarding these models is not fully standardized and the term Markov jump systems is also used, at least in cases where the (hidden) state space is nite. Markov-switching models have much in common with basic HMMs. In particular, virtually identical computational machinery may be used for both models. The statistical analysis of Markov-switching models is however much more intricate than for HMMs due to the fact that the properties of the observed process {Yk } are not directly controlled by those of the unobservable chain {Xk } (as is the case in HMMs; see the details in Chapter 4). In particular, {Yk } is an innite memory process whose dependence may be stronger than that of {Xk } and it may even be the case that no stationary solution {Yk }k0 to (1.3)(1.4) exists. A nal observation is that the computational tools pertaining to posterior inference, and in particular the smoothing equations of Chapter 3, hold in even greater generality. One could for example simply assume that {Xk , Yk }k0 jointly forms a Markov process, only a part {Yk }k0 of which is actually observed. We shall see however in the third part of the book that all statistical statements that we can currently make about the properties of estimators of the parameters of HMMs heavily rely on the fact that {Xk }k0 is a Markov chain, and even more crucially, a uniformly ergodic Markov chain (see Chapter 4). For more general models such as partially observed Markov processes,

1 Introduction

it is not yet clear what type of (not overly restrictive and reasonably general) conditions are required to guarantee that reasonable estimators (such as the maximum likelihood estimator for instance) are well behaved.

1.3 Examples
HMMs and their generalizations are nowadays used in many dierent areas. The (partial) bibliography by Capp (2001b) (which contains more than 360 e references for the period 19902000) gives an idea of the reach of the domain. Several specialized books are available that largely cover applications of HMMs to some specic areas such as speech recognition (Rabiner and Juang, 1993; Jelinek, 1997), econometrics (Hamilton, 1989; Kim and Nelson, 1999), computational biology (Durbin et al., 1998; Koski, 2001), or computer vision (Bunke and Caelli, 2001). We shall of course not try to compete with these in fully describing real-world applications of HMMs. We will however consider throughout the book a number of prototype HMMs (used in some of these applications) in order illustrate the variety of situations: nite-valued state space (DNA or protein sequencing), binary Markov chain observed in Gaussian noise (ion channel), non-linear Gaussian state-space model (stochastic volatility), conditionally Gaussian state-space model (deconvolution), etc. It should be stressed that the idea one has about the nature of the hidden Markov chain {Xk } may be quite dierent from one case to another. In some cases it does have a well-dened physical meaning, whereas in other cases it is conceptually more diuse, and in yet other cases the Markov chain may be completely ctitious and the probabilistic structure of the HMM is then used only as a tool for modeling dependence in data. These dierences are illustrated in the examples below. 1.3.1 Finite Hidden Markov Models In a nite hidden Markov model, both the state space X of the hidden Markov chain and the set Y in which the output lies are nite. We will generally assume that these sets are {1, 2, . . . , r} and {1, 2, . . . , s}, respectively. The HMM is then characterized by the transition probabilities qij = P(Xk+1 = j | Xk = i) of the Markov chain and the conditional probabilities gij = P(Yk = j | Xk = i). Example 1.3.1 (Gilbert-Elliott Channel Model). The Gilbert-Elliott channel model, after Gilbert (1960) and Elliott (1963), is used in information theory to model the occurrence of transmission errors in some digital communication channels. Interestingly, this is a pre-HMM hidden Markov model, as it predates the seminal papers by Baum and his colleagues who introduced the term hidden Markov model. In digital communications, all signals to be transmitted are rst digitized and then transformed, a step known as source coding. After this preprocessing,

1.3 Examples

one can safely assume that the bits that represent the signal to be transmitted form an i.i.d. sequence of fair Bernoulli draws (Cover and Thomas, 1991). We will denote by {Bk }k0 the sequence of bits at the input of the transmission system. Abstracted high-level models of how this sequence of bits may get distorted during the transmission are useful for devising ecient reception schemes and deriving performance bounds. The simplest model is the (memoryless) binary symmetric channel in which it is assumed that each bit may be randomly ipped by an independent error sequence, Yk = Bk Vk , (1.5)

where {Yk }k0 are the observations and {Vk }k0 is an i.i.d. Bernoulli sequence with P(Vk = 1) = q, and denotes modulo-two addition. Hence, the received bit is equal to the input bit Bk if Vk = 0; otherwise Yk = Bk and an error occurs. The more realistic Gilbert-Elliott channel model postulates that errors tend to be more bursty than predicted by the memoryless channel. In this model, the channel regime is modeled as a two-state Markov chain {Sk }k0 , which represents low and high error conditions, respectively. The transition matrix of this chain is determined by the switching probabilities p0 = P(Sk+1 = 1 | Sk = 0) (transition into the high error regime) and p1 = P(Sk+1 = 0 | Sk = 1) (transition into the low error regime). In each regime, the model acts like the memoryless symmetric channel with error probabilities q0 = P(Yk = Bk | Sk = 0) and q1 = P(Yk = Bk | Sk = 1), where q0 < q1 . To recover the HMM framework, dene the hidden state sequence as the joint process that collates the emitted bits and the sequence of regimes, Xk = (Bk , Sk ). This is a four-state Markov chain with transition matrix (0, 0) (0, 1) (1, 0) (1, 1) (1 p0 )/2 p0 /2 (1 p0 )/2 p0 /2 p1 /2 (1 p1 )/2 p1 /2 (1 p1 )/2 (1 p0 )/2 p0 /2 (1 p0 )/2 p0 /2 p1 /2 (1 p1 )/2 p1 /2 (1 p1 )/2

(0, 0) (0, 1) (1, 0) (1, 1)

Neither the emitted bit Bk nor the channel regime Sk is observed directly, but the model asserts that conditionally on {Xk }k0 , the observations are independent Bernoulli draws with P(Yk = b | Bk = b, Sk = s) = 1 qs .

Example 1.3.2 (Channel Coding and Transmission Over Memoryless Discrete Channel). We will consider in this example another elementary example of the use of HMMs, also drawn from the digital communication

1 Introduction

world. Assume we are willing to transmit a message encoded as a sequence {b0 , . . . , bm } of bits, where bi {0, 1} are the bits and m is the length of the message. We wish to transmit this message over a channel, which will typically aect the transmitted message by introducing (at random) errors. To go further, we need to have an abstract model for the channel. In this example, we will consider discrete channels, that is, the channels inputs and outputs are assumed to belong to nite alphabets: {i1 , . . . , iq } for the inputs and {o1 , . . . , ol } for the outputs. In this book, we will most often consider binary channels only; then the inputs and the outputs of the transmission channel are bits, q = l = 2 and {i1 , i2 } = {o1 , o2 } = {0, 1}. A transmission channel is said to be memoryless if the probability of the channels output Y0:n = y0:n conditional on its input sequence S0:n = s0:n factorizes as
n

P(Y0:n | S0:n ) =
i=0

P(Yi | Si ) .

In words, conditional on the input sequence S0:n , the channel outputs are conditionally independent. The transition probabilities of the discrete memoryless channel are dened by a transition kernel R : {i1 , . . . , iq } {o1 , . . . , ol } [0, 1], where for i = 1, . . . , q and j = 1, . . . , l, R(ii , oj ) = P(Y0 = oj | S0 = ii ) . (1.6)

The most classical example of a discrete memoryless channel is the binary symmetric channel (BSC) with binary input and binary output, for which R(0, 1) = R(1, 0) = with [0, 1]. In words, every time a bit Sk = 0 or Sk = 1 is sent across the BSC, the output is also a bit Yk = {0, 1}, which diers from the input bit with probability ; that is, the error probability is P(Yk = Ok ) = . As described in Example 1.3.1, the output of a binary symmetric channel can be modeled as a noisy version of the input sequence, Yk = Sk Vk , where is the modulo-two addition and {Vk }k0 is an independent and identically distributed sequence of bits, independent of the input sequence {Xk }k0 and with P{Vk = 0} = 1 . If we wish to transmit a message S0:m = b0:m over a BSC without coding, the probability of getting an error will be P(Y0:m = b0:m | S0:m = b0:m ) = 1 P(Y0:m = b0:m | S0:m = b0:m ) = 1 (1 )m . Therefore, as m becomes large, with probability close to 1, at least one bit of the message will be incorrectly received, which calls for practical solution. Channel coding is a viable method to increase reliability, but at the expense of reduced information rate. Increased reliability is achieved by adding redundancy to the information symbol vector, resulting in a longer coded vector of symbols that are distinguishable at the output of the channel. There are many ways to construct codes, and we consider in this example only a very elementary example of a rate 1/2 convolutional coder with memory length 2.

1.3 Examples

00100111 1 0

+
110000101011101000011101

+
Fig. 1.3. Rate 1/2 convolutional code with memory length 2.

The rate 1/2 means that a message of length m will be transformed into a message of length 2m, that is, we will send 2m bits over the transmission channel in order to introduce some kind of redundancy to increase our chance of getting an error-free message. The principle of this convolutional coder is depicted in Figure 1.3. Because the memory length is 2, there are 4 dierent states and the behavior of this convolutional encoder can be captured as 4-state machine, where the state alphabet is X = {(0, 0), (0, 1), (1, 0), (1, 1)}. Denote by Xk the value of the state at time k, Xk = (Xk,1 , Xk,2 ) X. Upon the arrival of the bit Bk+1 , the state is transformed to Xk+1 = (Xk+1,1 , Xk+1,2 ) = (Bk+1 , Xk,1 ) . In the engineering literature, Xk is said to be a shift register. If the sequence {Bk }k0 of input bits is i.i.d. with probability P(Bk = 1) = p, then {Xk }k0 is a Markov chain with transition probabilities P[Xk+1 = (1, 1) | Xk = (1, 0)] = P[Xk+1 = (1, 1) | Xk = (1, 1)] = p , P[Xk+1 = (1, 0) | Xk = (0, 1)] = P[Xk+1 = (1, 0) | Xk = (0, 0)] = p , P[Xk+1 = (0, 1) | Xk = (1, 0)] = P[Xk+1 = (0, 1) | Xk = (1, 1)] = 1 p , P[Xk+1 = (0, 0) | Xk = (0, 1)] = P[Xk+1 = (0, 0) | Xk = (0, 0)] = 1 p , all other transition probabilities being zero. To each input bit, the convolutional encoder generates two outputs according to Sk = (Sk,1 , Sk,2 ) = (Bk Xk,2 , Bk Xk,2 Xk,1 ) . These encoded bits, referred to as symbols, are then sent on the transmission channel. A graphical interpretation of the problem is quite useful. A convolutional encoder (or, more generally, a nite state Markovian machine) can be represented by a state transition diagram of the type in Figure 1.4. The nodes are the states and the branches represent transitions having non-zero probability. If we index the states with both the time index k and state index m, we get the trellis diagram of Figure 1.4. The trellis diagram shows the time

10

1 Introduction

0|00 00

0|00

0|00 1|11

0|00 1|11 0|11 0|11 1|00 0|01 1|10 0|10 1|01

1|11 01 0|01 10

1|11 1|00 0|01 1|10 0|10

1|10 11

1|01

Fig. 1.4. Trellis representation of rate 1/2 convolutional code with memory length 2.

progression of the state sequences. For every state sequence, there is a unique path through the trellis diagram and vice versa. More generally, the channel encoder is a nite state machine that transforms a message encoded as a nite stream of bits into an output sequence whose length is increased by a multiplicative factor that is the inverse of the rate of the encoder. If the input bits are i.i.d., the state sequence of this nite state machine is a nite state Markov chain. The m distinct states of the Markov source are {t1 , . . . , tm }. The outputs of this nite state machine is a sequence Sk with values in a nite alphabet {o1 , . . . , oq }. The state transitions of the Markov source are governed by the transition probabilities p(i, j) = P(Xn = tj | Xn1 = ti ) and the output of the nite-state machine by the probabilities q(i; j, k) = P(Sn = oi | Xn = tj , Xn1 = tk ). The Markov source always starts from the same initial state, X0 = t1 say, and produces an output sequence S0:n = (S0 , S1 , . . . , Sn ) ending in the terminal state Xn = t1 . S0:n is the input to a noisy discrete memoryless channel whose output is the sequence Y0:n = (Y0 , . . . , Yn ). This discrete memoryless channel is also governed by transition probabilities (1.6). It is easy to recognize the general set-up of hidden Markov models, which are an extremely useful and popular tool in the digital communication community. The objective of the decoder is to examine Y0:n and estimate the a posteriori probability of the states and transitions of the Markov source, i.e., the conditional probabilities P(Xk = ti | Y0:n ) and P(Xk = ti , Xk+1 = tj | Y0:n ). Example 1.3.3 (HMM in Biology). Another example featuring nite HMMs is stochastic modeling of biological sequences. This is certainly one of the most successful examples of applications of HMM methodology in recent years. There are several dierent uses of HMMs in this context (see Churchill, 1992; Durbin et al., 1998; Koski, 2001; Baldi and Brunak, 2001, for further references and details), and we only briey describe the application of HMMs

1.3 Examples

11

to gene nding in DNA, or more generally, functional annotation of sequenced genomes. In their genetic material, all living organisms carry a blueprint of the molecules they need for the complex task of living. This genetic material is (usually) stored in the form of DNAshort for deoxyribonucleic acid sequences. The DNA is not actually a sequence, but a long, chain-like molecule that can be specied uniquely by listing the sequence of amine bases from which it is composed. This process is known as sequencing and is a challenge on its own, although the number of complete sequenced genomes is growing at an impressive rate since the early 1990s. This motivates the abstract view of DNA as a sequence over a four-letter alphabet A, C, G, and T (for adenine, cytosine, guanine, and thyminethe four possible instantiations of the amine base). The role of DNA is as a storage medium for information about the individual molecules needed in the biochemical processes of the organism. A region of the DNA that encodes a single functional molecule is referred to as a gene. Unfortunately, there is no easy way to discriminate coding regions (those that correspond to genes) from non-coding ones. In addition, the dimension of the problem is enormous as typical bacterial genomes can be millions of bases long with the number of genes to be located ranging from a few hundreds to a few thousands. The simplistic approach to this problem (Churchill, 1992) consists in modeling the observed sequence of bases {Yk }k0 {A, C, G, T } by a two-state hidden Markov model such that the non-observable state is binary-valued with one state corresponding to non-coding regions and the other one to coding regions. In the simplest form of the model, the conditional distribution of Yk given Xk is simply parameterized by the vector of probabilities of observing A, C, G, or T when in the coding and non-coding states, respectively. Despite its deceptive simplicity, the results obtained by estimating the parameters of this basic two-state nite HMM on actual genome sequences and then determining the smoothed estimate of the state sequence Xk (using techniques to be discussed in Chapter 3) were suciently promising to generate an important research eort in this direction. The basic strategy described above has been improved during the years to incorporate more and more of the knowledge accumulated about the behavior of actual genome sequencessee Krogh et al. (1994), Burges and Karlin (1997), Kukashin and Borodovsky (1998), Jarner et al. (2001) and references therein. A very important fact, for instance, is that in coding regions the DNA is structured into codons, which are composed of three successive symbols in our A, C, G, T alphabet. This property can be accommodated by using higher order HMMs in which the distribution of Yk does not only depend on the current state Xk but also on the previous two observations Yk1 and Yk2 . Another option consists in using non-homogeneous models such that the distribution of Yk does not only depend on the current state Xk but also on the value of the index k modulo 3. In addition, some particular

12

1 Introduction

sub-sequences have a specic function, at least when they occur in a coding region (there are start and end codons for instance). Needless to say, enlarging the state space X to add specic states corresponding to those well identied functional sub-sequences is essential. Finally and most importantly, the functional description of the DNA sequence is certainly not restricted to just the coding/non-coding dichotomy, and most models use many more hidden states to dierentiate between several distinct functional regions in the genome sequence. Example 1.3.4 (Capture-Recapture). Capture-recapture models are often used in the study of populations with unknown sizes as in surveys, census undercount, animal abundance evaluation, and software debugging to name a few of their numerous applications. To set up the model in its original framework, we consider here the setting examined in Dupuis (1995) of a population of lizards (Lacerta vivipara) that move between three spatially connected zones, denoted 1, 2, and 3, the focus being on modeling these moves. For a given lizard, the sequence of the zones where it stays can be modeled as a Markov chain with transition matrix Q. This model still pertains to HMMs as, at a given time, most lizards are not observed: this is therefore a partly hidden Markov model. To draw inference on the matrix Q, the capturerecapture experiment is run as follows. At time k = 0, a (random) number of lizards are captured, marked, and released. This operation is repeated at times k = 1, . . . , n by tagging the newly captured animals and by recording at each capture the position (zone) of the recaptured animals. Therefore, the model consists of a series of capture events and positions (conditional on a capture) of n + 1 cohorts of animals marked at times k = 0, . . . , n. To account for open populations (as lizards can either die or leave the region of observation for good), a fourth state is usually added to the three spatial zones. It is denoted (dagger ) and, from the point of view of the underlying Markov chain, it is an absorbing state while, from the point of view of the HMM, it is always hidden.1 The observations may thus be summarized by the series {Ykm }0kn of capture histories that indicate, for each lizard at least captured once (m being the lizard index), in which zone it was at each of the times it was captured. We may for instance record {ykm }0kn = (0, . . . , 0, 1, 1, 2, 0, 2, 0, 0, 3, 0, 0, 0, 1, 0, . . . , 0) , where 0 means that the lizard was not captured at that particular time index. To each such observed sequence, there corresponds a (partially) hidden sequence {Xkm }0kn of lizard locations, for instance {xkm }0kn = (1, . . . , 2, 1, 1, 2,2, 2,3, 2, 3,3, 2, 2, 1,, . . . , )
1 One could argue that lizards may also enter the population, either by migration or by birth. The latter reason is easily accounted for, as the age of the lizard can be assessed at the rst capture. The former reason is real but will be ignored.

1.3 Examples

13

which indicates that the animal disappeared right after the last capture (where the values that are deterministically known from the observations have been stressed in bold). The purposes in running capture-recapture experiments are often twofold: rst, inference can be drawn on the size of the whole population based on the recapture history as in the Darroch model (Castledine, 1981; Seber, 1983), and, second, features of the population can be estimated from the captured animals, like capture and movement probabilities.

1.3.2 Normal Hidden Markov Models By a normal hidden Markov model we mean an HMM in which the conditional distribution of Yk given Xk is Gaussian. In many applications, the state space is nite, and we will then assume it is {1, 2, . . . , r}. In this case, given Xk = i, 2 Yk N(i , i ), so that the marginal distribution of Yk is a nite mixture of normals. Example 1.3.5 (Ion Channel Modeling). A cell, for example in the human body, needs to exchange various kinds of ions (sodium, potassium, etc.) with its surrounding for its metabolism and for purposes of chemical communication. The cell membrane itself is impermeable to such ions but contains so-called ion channels, each tailored for a particular kind of ion, to let ions pass through. Such a channel is really a large molecule, a protein, that may assume dierent congurations, or states. In some states, the channel allows ions to ow throughthe channel is openwhereas in other states ions cannot passthe channel is closed. A ow of ions is a transportation of electrical charge, hence an electric current (of the order of picoamperes). In other words, each state of the channel is characterized by a certain conductance level. These levels may correspond to a fully open channel, a closed channel, or something in between. The current through the channel can be measured using special probes (this is by no means trivial!), with the result being a time series that switches between dierent levels as the channel recongures. In this context, the main motivation is to study the characteristics of the dynamic of these ion channels, which is only partly understood, based on sampled measurements. In the basic model, the channel current is simply assumed to be corrupted by additive white (i.i.d.) Gaussian measurement noise. If the state of the ion channel is modeled as a Markov chain, the measured time series becomes an 2 HMM with conditionally Gaussian output and with the variances i not depending on i. A limitation of this basic model is that if each physical conguration of the channel (say closed) corresponds to a single state of the underlying Markov chain, we are implicitly assuming that each visit to this state has a duration drawn from a geometric distribution. A work-around that makes it possible to keep the HMM framework consists in modeling each physical conguration by a compound of distinct states of the underlying Markov chain,

14

1 Introduction

which are constrained to have a common conditional Gaussian output distribution. Depending on the exact transition matrix of the hidden chain, the durations spent in a given physical conguration can be modeled by negative binomial, mixtures of geometric or more complicated discrete distributions. Further reading on ion-channel modeling can be found, for example, in Ball and Rice (1992) for basic references and Ball et al. (1999) and Hodgson (1998) for more advanced statistical approaches. Example 1.3.6 (Speech Recognition). As yet another example of normal HMMs, we consider applications to speech recognition, which was the rst area where HMMs were used extensively, starting in the early 1980s. The basic task is to, from a recording of a persons voice (or in real time, on-line), automatically determine what he or she said. To do that, the recorded and sampled speech signal is slotted into short sections (also called frames), typically representing about 20 milliseconds of the original signal. Each section is then analyzed separately to produce a set of coecients that represent the estimated power spectral density of the signal in the frame. This preprocessing results in a discrete-time multivariate time series of spectral coecients. For a given word to be recognized (imagine, for simplicity, that speakers only pronounce single words), the length of the series of vectors resulting from this preprocessing is not determined beforehand but depends on the time taken for the speaker to utter the word. A primary requirement on the model is thus to cope with the time alignment problem so as to be able to compare multivariate sequences of unequal lengths. In this application, the hidden Markov chain corresponds to sub-elements of the utterance that are expected to have comparable spectral characteristics. In particular, we may view each word as a sequence of phonemes (for instance, red: [r-e-d]; class: [k-l-a:-s]). The state of the Markov chain is then the hypothetical phoneme that is currently being uttered at a given time slot. Thus, for a word with three phonemes, like red for example, the state of the Markov chain may evolve according to Figure 1.5. Note that as opposed to Figures 1.1 and 1.2, Figure 1.5 is an automaton description of the Markov chain that indicates where the chain may jump to given its current state. Each arrow thus represents a possible transition that is associated with a non-zero transition probability. In this book, we shall use double circles for the nodes of such automata, as in Figure 1.5, to distinguish them from graphical models. We see that each state corresponding to a phoneme has a transition back ' ' ' e e e       E R E E E D E Stop Start      
Fig. 1.5. Automaton representation of the Markov chain structure of an HMM for recognizing the word red.

1.3 Examples

15

to itself, that is, a loop; this is to allow the phoneme to last for as long as the recording of it does. The purposes of the initial state Start and terminal state Stop is simply to have well-dened starts and terminations of the Markov chain; the stop state may be thought of as an absorbing state with no associated observation. The observation vectors associated with a particular (unobservable) state are assumed to be independent and are assigned a multivariate distribution, most often a mixture of Gaussian distributions. The variability induced by this distribution is used to model spectral variability within and between speakers. The actual speech recognition is realized by running the recorded word as input to several dierent HMMs, each representing a particular word, and selecting the one that assigns the largest likelihood to the observed sequence. In a prior training phase, the parameters of each word model have been estimated using a large number of recorded utterances of the word. Note that the association of the states of the hidden chain with the phonemes in Figure 1.5 is more a conceptual view than an actual description of what the model does. In practice, the recognition performance of HMM-based speech recognition engines is far better than their eciency at segmenting words into phonemes. Further reading on speech recognition using HMMs can be found in the books by Rabiner and Juang (1993) and Jelinek (1997). The famous tutorial by Rabiner (1989) gives a more condensed description of the basic model, and Young (1996) provides an overview of current large-scale speech recognition systems.

1.3.3 Gaussian Linear State-Space Models The standard state-space model that we shall most often employ in this book takes the form Xk+1 = AXk + RUk , Yk = BXk + SVk , where {Uk }k0 , called the state or process noise, and {Vk }k0 , called the measurement noise, are independent standard (multivariate) Gaussian white noise (sequences of i.i.d. multidimensional Gaussian random variables with zero mean and identity covariance matrices); The initial condition X0 is Gaussian with mean and covariance and is uncorrelated with the processes {Uk } and {Vk }; The state transition matrix A, the measurement transition matrix B, the square-root of the state noise covariance R, and the square-root of the measurement noise covariance S are known matrices with appropriate dimensions. (1.7) (1.8)

16

1 Introduction

Ever since the pioneering work by Kalman and Bucy (1961), the study of the above model has been a favorite both in the engineering (automatic control, signal processing) and time series literature. Recommended readings on the state-space model include the books by Anderson and Moore (1979), Caines (1988), and Kailath et al. (2000). In addition to its practical importance, the Gaussian linear state-space model is interesting because it corresponds to one of the very few cases for which exact and reasonably ecient numerical procedures are available to compute the distributions of the X-variables given Y -variables (see Chapters 3 and 5). Remark 1.3.7. The form adopted for the model (1.7)(1.8) is rather standard (except for the symbols chosen for the various matrices, which vary widely in the literature), but the role of the matrices R and S deserve some comments. We assume in the following that both noise sequences {Uk } and {Vk } are i.i.d. with identity covariance matrices. Hence R and S serve as square roots of the noise covariances, as Cov(RUk ) = RRt and Cov(SVk ) = SS t ,

where the superscript t denotes matrix transposition. In some cases, and in particular when either the X- or Y -variables are scalar, it would probably be simpler to use Uk = RUk and Vk = SVk as noise variables, adopting their respective covariance matrices as parameters of the model. In many situations, however, the covariance matrices have a special structure that is most naturally represented by using R and S as parameters. In Example 1.3.8 below for instance, the dynamic noise vector Uk has a dimension much smaller than that of the state vector Xk . Hence R is a tall matrix (with more rows than columns) and the covariance matrix of Uk = RUk is rank decient. It is then much more natural to work only with the low-dimensional unit covariance disturbance vector Uk rather than with Uk = RUk . In the following, we will assume that SS t is a full rank covariance matrix (for reasons discussed in Section 5.2), but RRt will often be rank decient as in Example 1.3.8. In many respects, the case in which the state and measurement noises {Uk } and {Vk } are correlated is not much more complicated. It however departs from our usual assumptions in that {Xk , Yk } then forms a Markov chain but {Xk } itself is no longer Markov. We will thus restrict ourselves to the case in which {Uk } and {Vk } are independent and refer, for instance, to Kailath et al. (2000) for further details on this issue. Example 1.3.8 (Noisy Autoregressive Process). We shall dene a pth order scalar autoregressive (AR) process {Zk }k0 as one that satises the stochastic dierence equation Zk+1 = 1 Zk + + p Zkp+1 + Uk , where {Uk }k0 is white noise. Dene the lag-vector (1.9)

1.3 Examples

17

Xk = (Zk , . . . , Zkp+1 )t , and let A be the so-called companion matrix 1 2 . . . p 1 0 ... 0 0 1 ... 0 . A= . . .. . . . . . . . . 0 0 ... 1 0

(1.10)

(1.11)

Using these notations, (1.9) can be equivalently rewritten in state-space form: Xk = AXk1 + 1 0 . . . 0 Uk1 , Yk = 1 0 . . . 0 Xk .
t

(1.12) (1.13)

If the autoregressive process is not directly observable but only a noisy version of it is available, the measurement equation (1.13) is replaced by Yk = 1 0 . . . 0 Xk + Vk , (1.14)

where {Vk }k0 is the measurement noise. When there is no feedback between the measurement noise and the autoregressive process, it is sensible to assume that the state and measurement noises {Uk } and {Vk } are independent.

1.3.4 Conditionally Gaussian Linear State-Space Models We gradually move toward more complicated models for which the state space X of the hidden chain is no more nite. The previous example is, as we shall see in Chapter 5, a singular case because of the unique properties of the multivariate Gaussian distribution with respect to linear transformations. We now describe a related, although more complicated, situation in which the state Xk is composed of two components Ck and Wk where the former is nitevalued whereas the latter is a continuous, possibly vector-valued, variable. The term conditionally Gaussian linear state-space models, or CGLSSMs in short, corresponds to structures by which the model, when conditioned on the nite-valued process {Ck }k0 , reduces to the form studied in the previous section. Conditionally Gaussian linear state-space models belong to a class of models that we will refer to as hierarchical hidden Markov models, whose dependence structure is depicted in Figure 1.6. In such models the variable Ck , which is the highest in the hierarchy, inuences both the transition from Wk1 to Wk as well as the observation Yk . When {Ck } takes its values in a nite set, it is also common to refer to such models as jump Markov models, where the jumps correspond to the instants k at which the value of Ck diers from that of Ck1 . Of course, Figure 1.6 also corresponds to a standard HMM structure by

18

1 Introduction



E Ck+1 E 
Ck

c c 

E Wk+1 E 
Wk

c c 
Yk Yk+1


Fig. 1.6. Graphical representation of the dependence structure of a hierarchical HMM.

considering the composite state Xk = (Ck , Wk ). But for hierarchical HMMs in general and CGLSSMs in particular, it is often advantageous to consider the intermediate state sequence {Wk }k0 as a nuisance parameter to focus on the {Ck } component that stands at the top of the hierarchy in Figure 1.6. To do so, one needs to integrate out the inuence of {Wk }, conditioning on {Ck } only. This principle can only be made eective in situations where the model belongs to a simple class (such as Gaussian linear state-space models) once conditioned on {Ck }. Below we give several simple examples that illustrate the potential of this important class of models. Example 1.3.9 (Rayleigh-fading Channel). We will now follow up on Example 1.3.1 and again consider a model of interest in digital communication. The point is that for wireless transmissions it is possible, and desirable, to model more explicitly (than in Example 1.3.1) the physical processes that cause errors during transmissions. As in Example 1.3.1, we shall assume that the signal to be transmitted forms an i.i.d. sequence of fair Bernoulli draws. Here the sequence is denoted by {Ck }k0 and we assume that it takes its values in the set {1, 1} rather than in {0, 1}. This sequence is transmitted through a suitable modulation (Proakis, 1995) that is not of direct interest to us. At the receiving side, the signal is rst demodulated and the simplest model, known as the additive white Gaussian noise (AWGN) channel, postulates that the demodulated signal {Yk }k0 may be written Yk = hCk + Vk , (1.15)

where h is a (real) channel gain, also known as a fading coecient, and {Vk }k0 is an i.i.d. sequence of Gaussian observation noise with zero mean and

1.3 Examples

19

variance 2 . For reasons that are inessential for the discussion that follows, the actual model features complex channel gain and noise (Proakis, 1995), a fact that we will ignore in the following. The AWGN channel model ignores inter-symbol interference in the sense that under (1.15) the observations {Yk } are i.i.d. In many practical situations, it is necessary to account for channel memory to obtain a reasonable model of the received signal. Another issue is that, in particular in wireless communication, the physical characteristics of the propagation path or channel are continuously changing over time. As a result, the fading coecient h will typically not stay constant but vary with time. A very simple model consists in assuming that the fading coecient follows a (complex) autoregressive model of order 1, giving the model Wk+1 = Wk + Uk , Yk = Wk Ck + Vk , where the time-varying h is denoted by Wk , and {Uk }k0 is white Gaussian noise (an i.i.d. sequence of zero mean Gaussian random variables). With this model, it is easily checked that if we assume that W0 is a Gaussian random variable independent of both the observation noise {Vk } and the state noise {Uk }, {Yk } is the observation sequence corresponding to an HMM with hidden state Xk = (Ck , Wk ) (the emitted bit and the fading coecient). This is a general state-space HMM, as Wk is a real random variable. In this application, the aim is to estimate the sequence {Ck } of bits, which is thus a component of the unobservable state sequence, given the observations {Yk }. The fading coecients {Wk } are of no direct interest and constitute nuisance variables. This model however has a unique feature among general state-space HMMs in that conditionally on the sequence {Ck } of bits, it reduces to a Gaussian linear state-space model with state variables {Wk }. The only dierence to Section 1.3.3 is that the observation equation becomes non-homogeneous in time, Yk = Wk ck + Vk , where {Ck = ck } is the event on which we are conditioning. As a striking consequence, we shall see in Chapters 4 and 5 that the distribution of Wk given the observations Y0 , Y1 , . . . , Yk is a mixture of 2k+1 Gaussian distributions. Because this is clearly not a tractable form when k is a two-digit number, the challenge consists in nding practical approaches to approximate the exact distributions. Conditionally Gaussian models related to the previous example are also commonly used to approximate non-Gaussian state-space models. Imagine that we are interested in the linear model given by Eqs. (1.7)(1.8) with both noise sequences still being i.i.d. but at least one of them with a non-Gaussian distribution. Assuming a very general form of the noise distribution would directly lead us into the world of (general) continuous state-space HMMs. As

20

1 Introduction

a middle ground, we may however assume that the distribution of the noise is a nite mixture of Gaussian distributions. Let {Ck }k0 denote an i.i.d. sequence of random variables taking values in a set C, which can be nite or innite. We refer to these variables as the indicator variables when C is nite and latent variables otherwise. To model non-Gaussian system dynamics we will typically replace the evolution equation (1.7) by Wk+1 = W (Ck+1 ) + A(Ck+1 )Wk + R(Ck+1 )Uk , Uk N(0, I) ,

where, W , A and R are respectively vector-valued and matrix-valued functions of suitable dimensions on C. When C = {1, . . . , r} is nite, the distribution of the noise W (Ck+1 ) + R(Ck+1 )Uk driving the state equation is a nite mixture of multivariate Gaussian distributions,
r

mi N W (i), R(i)Rt (i)


i=1

with mi = P(C0 = i) .

Another option consists in using the same modeling to represent non-Gaussian observation noise by replacing the observation equation (1.8) by Yk = Y (Ck ) + B(Ck )Wk + S(Ck )Vk , Vk N(0, I) ,

where Y , B and S are respectively vector-valued and matrix-valued functions of suitable dimensions on C. Of course, by doing this the state of the HMM has to be extended to the joint process {Xk }k0 , where Xk = (Wk , Ck ), taking values in the product set X C. At rst sight, it is not obvious that anything has been gained at all by introducing additional mixture indices with respect to our basic objective, which is to allow for linear state-space models with non-Gaussian noises. We shall see however in Chapter 8 that the availability of computational procedures that evaluate quantities such as E[Wk | Y0 , . . . Yk , C0 , . . . , Ck ] is a distinct advantage of conditionally linear state-space models over more general (unstructured) continuous state-space HMMs. Conditionally Gaussian linear state-space models (CGLSSM) have found an exceptionally broad range of applications. Example 1.3.10 (Change Point Detection). A simple yet useful example of CGLSSMs appears in change point detection problems (Shumway and Stoer, 1991; Fearnhead, 1998). In a Gaussian linear state-space model, the dynamics of the state depends on the state transition matrix and on the state noise covariance. These quantities may change over time, and if the changes, when they occur, do so unannounced and at unknown time points, then the associated inferential problem is referred to as a change point problem. Various important application areas of statistics involve change detection in a central way (for instance, environmental monitoring, quality assurance, biology). In the simplest change point problem, the state variable is the level

1.3 Examples
1.5 x 10
5

21

x 10

1.4

0
1.3

Nuclear response

Nuclear response

1.2

1.1

0.9

4
0.8

5
0.7

0.6

500

1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000

Time

Time

Fig. 1.7. Left: well-log data waveform with a median smoothing estimate of the state. Right: median smoothing residual.

of a quantity of interest, which is modeled as a step function; the time instants at which the step function jumps are the change points. An example of this situation is provided by the well-log data considered in Chapter 5 of the book by O Ruanaidh and Fitzgerald (1996) and analyzed, among others, by Fearnhead (1998) and Fearnhead and Cliord (2003). In this example, the data, which is plotted in Figure 1.7, consists of measurements of the nuclear magnetic response of underground rocks that are obtained whilst drilling for oil. The data contains information about the rock structure that is being drilled through. In particular, it contains information about boundaries between rock strata; jumps in the step function relate to the rock strata boundaries. As can be seen from the data, the underlying state is a step function, which is corrupted by a fairly large amount of noise. It is the position of these jumps that one needs to estimate. To model this situation, we put C = {0, 1}, where Ck = 0 means that there is no change point at time index k, whereas Ck = 1 means that a change point has occurred. The state-space model is Wk+1 = A(Ck+1 )Wk + R(Ck+1 )Uk , Yk = Wk + Vk , where A(0) = I, R(0) = 0 and A(1) = 0 and R(1) = R. The simplest model consists in taking for {Ck }k0 an i.i.d. sequence of Bernoulli random variables with probability of success p. The time between two change points (period of time during which the state variable is constant) is then distributed as a geometric random variable with mean 1/p; Wk+1 = Wk Uk with probability p , otherwise . (1.16)

22

1 Introduction

It is possible to allow a more general form for the prior distribution of the durations of the periods by introducing dependence among the indicator variables. Note that it is also possible to consider such multiple change point models under the dierent, although strictly equivalent, perspective of a Bayesian model with an unknown number of parameters. In this alternative representation, the hidden state trajectory is parameterized by the succession of its levels (between two change points), which thus form a variable dimension set of parameters (Green, 1995; Lavielle and Lebarbier, 2001). Bayesian inference about such parameters, equipped with a suitable prior distribution, is then carried out using simulation-based techniques to be discussed further in Chapter 13. Example 1.3.11 (Linear State-Space Model with Observational Outliers and Heavy-Tailed Noise). Another interesting application of conditional Gaussian linear state-space models pertains to the eld of robust statistics (Schick and Mitter, 1994). In the course of model building and validation, statisticians are often confronted with the problem of dealing with outliers. Routinely ignoring unusual observations is neither wise nor statistically sound, as such observations may contain valuable information about unmodeled system characteristics, model degradation and breakdown, measurement errors and so forth. The well-log data considered in the previous example illustrates this situation. A visual inspection of the nuclear response reveals the presence of outliers, which tend to clump together in bursts (or clusters). This is conrmed when plotting the quantile-quantile regression plot (see Figure 1.8) of the residuals of the well-log data obtained from a crude moving median estimate of the state variable (the median lter applies a sliding window to a sequence and outputs the median value of all points in the window as a smoothed estimate at the window center). It can be seen that the normal distribution does not t the measurement noise well in the tails. Following Fearnhead and Cliord (2003), we model the measurement noise as a mixture of two Gaussian distributions. The model can be written Wk+1 = A(Ck+1,1 )Wk + R(Ck+1,1 )Uk , Yk = (Ck,2 ) + B(Ck,2 )Wk + S(Ck,2 )Vk , Uk N(0, 1) , Vk N(0, 1) ,

where Ck,1 {0, 1} and Ck,2 {0, 1} are indicators of a change point and of the presence of an outlier, respectively. As above, the level is assumed to be constant between two change points. Therefore we put A(0) = 1, R(0) = 0, A(1) = 0, and R(1) = U . When there is no outlier, that is, Ck,2 = 0, we assume that the level is observed in additive Gaussian noise. Therefore {(0), B(0), S(0)} = (0, 1, V,0 ). In the presence of an outlier, the measurement does no longer carry information about the current value of the level, that is, B(1) = 0, and the measurement noise is assumed to follow a Gaussian distribution with negative mean and (large) variance V,1 . Therefore

1.3 Examples
6 x 10
4

23

4 Quantiles of Input Sample

6 4

1 0 1 Standard Normal Quantiles

Fig. 1.8. Quantile-quantile regression of empirical quantiles of the well-log data residuals with respect to quantiles of the standard normal distribution.

{(1), B(1), S(1)} = (, 0, V,1 ). One possible model for {Ck,2 } would be a Bernoulli model in which we could include information about the ratio of outliers/non-outliers in the success probability. However, this does not incorporate any information about the way samples of outliers cluster together, as samples are assumed independent in such a model. A better model might be a two-state Markov chain in which the state transition probabilities allow a preference for cohesion within outlier bursts and non-outlier sections. Similar models have been used for audio signal restoration, where an outlier is a local degradation of the signal (click, scratch, etc.). There are of course, in the framework of CGLSSMs, many additional de grees of freedom. For example, O Ruanaidh and Fitzgerald (1996) claimed that the distribution of the measurement noise in the clean segments (segments free from outliers) of the nuclear response measurement have tails heavier than those of the Gaussian distribution, and they advocated a Laplacian additive noise model. The use of heavy-tailed distributions to model either the observation noise or the measurement noise, which nds its roots in the eld of robust statistics, is very popular and has been worked out in many dierent elds. One can of course consider to use Laplace, Weibull, or Student t-distributions, depending on the expected size of the tails, but if one is willing to exploit the full strength of conditionally Gaussian linear systems, it is wiser to consider using Gaussian scale mixtures. A random vector V is a Gaussian scale mixture if it can be expressed as the product of a Gaussian vector W with zero mean and identity covariance matrix and an independent positive scalar random variable C: V = CW (Andrews and Mallows, 1974). The variable C is the multiplier or the scale. If C has nite support, then V is a nite mixture of Gaussian vectors, whereas if C has a density with respect to Lebesgue measure on R, then V is a continuous mixture of Gaussian vectors. Gaussian scale mixtures are symmetric, zero mean, and have leptokurtic marginal densities (tails heavier than those of a Gaussian distribution).

24

1 Introduction

1.3.5 General (Continuous) State-Space HMMs Example 1.3.12 (Bearings-only Tracking). Bearings-only tracking concerns online estimation of a target trajectory when the observations consist solely of the direction of arrivals (bearings) of a plane wavefront radiated by a target as seen from a known observer position (which can be xed, but is, in most applications, moving). The measurements are blurred by noise, which accounts for the errors occurring when estimating the bearings. In this context, the range information (the distance between the object and the sensor) is not available. The target is usually assumed to be traveling in a two-dimensional space, the state of the target being its position and its velocity. Although the observations occur at regularly spaced instants, we describe the movement of the object in continuous time to be able to dene the derivatives of the motion. The system model that we describe here is similar to that used in Gordon et al. (1993) and Chapter 6 of Ristic et al. (2004)see also (Pitt and Shephard, 1999; Carpenter et al., 1999). The state vector at time kT is Xk = (Px,k , Px,k , Py,k , Py,k )t , representing the targets position at time kT and its velocity, where T denotes the sampling period. One possible discretization of this model, based on a second order Taylor expansion, is given by (Gordon et al., 1993) Xk+1 = AXk + RUk , where 1T 0 1 A= 0 0 0 0 0 0 0 0 , 1 T 0 1 2 T /2 0 T 0 R = U 0 T 2 /2 0 T (1.17)

and {Uk }k0 is bivariate standard white Gaussian noise, Uk N(0, I2 ). The scale U characterizes the magnitude of the random uctuations of the acceleration between two sampling points. The initial position X0 is multivariate Gaussian with mean (x , x , y , y ) and covariance matrix 2 diag(x , x , y , y ). The measurements {Yk }k0 are modeled as 2 2 2 Yk = tan1 Py,k Ry,k Px,k Rx,k + V V k , (1.18)

where {Vk }k0 is white Gaussian noise with zero mean and unit variance, and (Rx,k , Ry,k ) is the (known) observer position. It is moreover assumed that {Uk } and {Vk } are independent. One important feature of this model is that the amount of information about the range of the target that is present in the measurements is, in general, small. The only range information in the observations arise due to the knowledge of the state equations, which are informative about the maneuvers that the target is likely to perform. Therefore, the majority of range information contained in the model is that which is included in the prior model of the target motion.

1.3 Examples

25

Target trajectory

Target

Pk = (Px,k , Py,k ) k

Rk = (Rx,k , Ry,k ) 0

Observer trajectory
x

Fig. 1.9. Two-dimensional bearings-only target tracking geometry.

Example 1.3.13 (Stochastic Volatility). The distributional properties of speculative prices have important implications for several nancial models. Let Sk be the price of a nancial assetsuch as a share price, stock index, or foreign exchange rateat time k. Instead of the prices, it is more customary to consider the relative returns (Sk Sk1 )/Sk1 or the log-returns log(Sk /Sk1 ), which both describe the relative change over time of the price process. In what follows we often refer, for short, to returns instead of relative or log-returns (see Figure 1.10). The unit of the discrete time index k may be for example an hour, a day, or a month. The famous Black-Scholes model, which is a continuous-time model and postulates a geometric Brownian motion for the price process, corresponds to log-returns that are i.i.d. and with a Gaussian N(, 2 ) distribution, where is the volatility (the word volatility is the word used in econometrics for standard deviation). The Black and Scholes option pricing model provides the foundation for the modern theory of option valuation. In actual applications, however, this model has certain well-documented deciencies. Data from nancial markets clearly indicate that the distribution of returns usually have tails that are heavier than those of the normal distribution (see Figure 1.11). In addition, even though the returns are approximately uncorrelated over times (as predicted by the Black and Scholes model), they are not independent. This can be readily veried by the fact that the sample autocorrelations of the absolute values (or squares) of the returns are non-zero for a large number of lags (see Figure 1.12). Whereas the former

26

1 Introduction
1600
0.1

0.08

1400
0.06

1200

0.04

S&P Index

1000

S&P Index
0 500 1000 1500 2000 2500

0.02

800

0.02

600

0.04

0.06

400
0.08

200

0.1

500

1000

1500

2000

2500

Time (in days)

Time (in days)

Fig. 1.10. Left: opening values of the Standard and Poors index 500 (S&P 500) over the period January 2, 1990August 25, 2000. Right: log-returns of the opening values of the S&P 500, same period.
250 0.08 0.06 200 Quantiles of Input Sample 0.05 0 0.05 Return 0.1 0.04 0.02 0 0.02 0.04 0.06 0.08 0 0.1 0.1 4

150

100

50

2 0 2 4 Standard Normal Quantiles

Fig. 1.11. Left: histogram of S&P 500 log-returns. Right: quantile-quantile regression plot of empirical quantiles of S&P 500 log-returns against quantiles of the standard normal distribution.

property indicates that the returns can be modeled by a white noise sequence (a stationary process with zero autocorrelation at all positive lags), the latter property indicates that the returns are dependent and that the dependence may even span a rather long period of time. The variance of returns tends to change over time: the large and small values in the sample occur in clusters. Large changes tend to be followed by

1.3 Examples

27

0.9 0.8 0.7 Correlation Coefficient 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0 50 100 150 Time Lag 200 Correlation Coefficient

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0 50 100 150 Time Lag 200

Fig. 1.12. Left: correlation coecients of S&P 500 log-returns over the period January 2, 1990August 25, 2000. The dashed lines are 95% condence bands (1.96/ n) corresponding to the autocorrelation function of i.i.d. white Gaussian noise. Right: correlation coecients of absolute values of log-returns, same period.

large changesof either signand small changes tend to be followed by small changes, a phenomenon often referred to as volatility clustering. Most models for return data that are used in practice are of a multiplicative form, Yk = k Vk , (1.19) where {Vk }k0 is an i.i.d. sequence and the volatility process {k }k0 is a non-negative stochastic process such that k and Vk are independent for all k. Mostly, {k } is assumed to be strict sense stationary. It is often assumed that Vk is symmetric or, at least, has zero mean. The rationale for using these models is quite simple. First of all, the direction of the price changes is modeled by the sign of Vk only, independently of the order of magnitude of this change, which is directed by the volatility. Because k and Vk are inde2 pendent and Vk is assumed to have unit variance, k is then the conditional variance of Xk given k . Most models assume that k is a function of past values. The simplest model assumes that k is a function of the squares of the previous observations. This leads to the celebrated autoregressive conditional heteroscedasticity (ARCH) model developed by Engle (1982), Yk = Xk Vk ,
p 2 i Yki , i=1

Xk = 0 +

(1.20)

28

1 Introduction

where 0 , . . . , p are non-negative constants. In the Engle (1982) model, {Vk } is normal; hence the conditional error distribution is normal, but with conditional variance equal to a linear function of the p past squared observations. ARCH models are thus able to reproduce the tendency for extreme values to be followed by other extreme values, but of unpredictable sign. The autoregressive structure can be seen by the following argument. Writ2 2 ing k = Yk Xk = Xk (Vk 1), one obtains
p 2 Yk i=1 2 i Yki = 0 + k .

(1.21)

Because {Vk } is an i.i.d. sequence with zero mean and unit variance, {k }k0 is an uncorrelated sequence. Because ARCH(p) processes do not t log-returns very well unless the order p is quite large, various people have thought about improvements. As (1.21) bears some resemblance to an AR structure, a possible generalization is to introduce an ARMA structure. This construction leads to the so-called GARCH(p, q) process (Bollerslev et al., 1994). This model displays some striking similarities to autoregressive models with Markov regime; this will be discussed in more detail below. An alternative to the ARCH/GARCH framework is a model in which the variance is specied to follow some latent stochastic process. Such models, referred to as stochastic volatility (SV) models, appear in the theoretical literature on option pricing and exchange rate modeling. In contrast to GARCHtype processes, there is no direct feedback from past returns to the volatility process, which has been questioned as unnatural by some authors. Empirical versions of the SV model are typically formulated in discrete time, which makes inference problems easier to deal with. The canonical model in SV for discrete-time data is (Hull and White, 1987; Jacquier et al., 1994), Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk , Uk N(0, 1) , Vk N(0, 1) , (1.22)

where the observations {Yk }k0 are the log-returns, {Xk }k0 is the logvolatility, which is assumed to follow a stationary autoregression of order 1, and {Uk }k0 and {Vk }k0 are independent i.i.d. sequences. The parameter plays the role of the constant scaling factor, is the persistence (memory) in the volatility, and is the volatility of the log-volatility. Despite a very parsimonious representation, this model is capable of exhibiting a wide range of behaviors. Like ARCH/GARCH models, the model can give rise to a high persistence in volatility (volatility clustering). Even with = 0, the model is a Gaussian scale mixture that will give rise to excess kurtosis in the marginal distribution of the data. In ARCH/GARCH models with normal errors, the degree of kurtosis is tied to the roots of the volatility equation; as the volatility becomes more correlated, the degree of kurtosis also increases. In the stochastic volatility model, the parameter governs the degree of mixing independently of the degree of smoothness in the variance evolution.

1.3 Examples

29

It is interesting to note that stochastic volatility models are related to conditionally Gaussian linear state-space models. By taking logarithms of the squared returns, one obtains, Xk = Xk1 + Uk1 ,
2 log Yk = log 2 + Xk + Zk , 2 where Zk = log Vk .

If Vk is standard normal, Zk follows the log 2 distribution. This distribution 1 may be approximated with arbitrary accuracy by a nite mixture of Gaussian distributions, and then the SV model becomes a conditionally Gaussian linear state-space model (Sandmann and Koopman, 1998; Durbin and Koopman, 2000). This time, the latent variable Ck is the mixture component and the model writes Wk+1 = Wk + Uk , Yk = Wk + ((Ck ) + V (Ck )Vk ) , Uk N(0, 1) , Vk N(0, 1) .

This representation of the stochastic volatility model may prove useful when deriving numerical algorithms to lter the hidden state or estimate the model parameters. 1.3.6 Switching Processes with Markov Regime We now consider several examples that are not HMMs but belong to the class of Markov-switching models already mentioned in Section 1.2. Perhaps the most famous example of Markov-switching processes is the switching autoregressive process that was introduced by Hamilton (1989) to model econometric data. 1.3.6.1 Switching Linear Models A switching linear autoregression is a model of the form
d

Yk = (Ck ) +
i=1

ai (Ck )(Yki (Cki ) + (Ck )Vk ,

k1,

(1.23)

where {Ck }k0 , called the regime, is a Markov chain on a nite state space C = {1, 2, . . . , r}, and {Vk }k0 is white noise independent of the regime; the functions : C R, ai : C R, i = 1, . . . , r, and : C R describe the dependence of the parameters on the realized regime. In this model, we change only the scale of the innovation as a function of the regime, but we can of course more drastically change the innovation distribution conditional on each state.

30

1 Introduction

Remark 1.3.14. A model closely related to (1.23) is


d

Yk = (Ck ) +
i=1

ai (Ck )Yki + (Ck )Vk ,

k1.

(1.24)

In (1.23), (Ck ) is the mean of Yk conditional on the sequence of states C1 , . . . , Ck , whereas in (1.24) the shift is on the intercept of the autoregressive process. A model like this is not an HMM because, given {Ck }, the Yk are not conditionally independent but rather form a non-homogeneous autoregression. Hence it is a Markov-switching model. Obviously, the conditional distribution of Yk does not only depend on Ck and Yk1 but also on other lagged Cs and Y s back to Ckd and Ykd . By vectorizing the Y s and Cs, that is, stacking them in groups of d elements, we can obtain a process whose conditional distribution depends on one lagged variable only, as in Figure 1.2. This model can be rewritten in state-space form. Let Yk = [Yk , Yk1 , . . . , Ykd+1 ]t , Ck = [Ck , Ck1 , . . . , Ckd+1 ]t , (Ck ) = [(Ck ), . . . , (Ckd+1 )]t , Vk = [Vk , 0, . . . , 0]t , and denote by C(c) the d d companion matrix associated with the autoregressive coecients of the state c, a1 (c) a2 (c) . . . . . . ad (c) 1 0 0 . . . 0 1 0 . (1.25) A(c) = . . .. .. .. . . . . . . . 0 ... 0 1 0 The stacked observation vector Yk then satises Yk = (Ck ) + A(Ck ) (Yk1 (Ck1 )) + (Ck )Vk . (1.26)

Interestingly enough, switching autoregressive processes have a rather rich probabilistic structure and have proven to be useful in many dierent contexts. We focus here on applications in econometrics and nance, but the scope of potential applications of these models span many dierent areas. Example 1.3.15 (Regime Switches in Econometrics). The Hamilton (1989) model for the U.S. business cycle fostered a great deal of interest in Markov-switching autoregressive models as an empirical vehicle for characterizing macro-economic uctuations. This model provides a formal statistical

1.3 Examples

31

representation of the old idea that expansion and contraction constitute two distinct economic phases: Hamiltons model assumes that a macro-economic aggregate (real output growth, countrys gross national product measured per quarter, annum, etc.) follows one of two dierent autoregressions depending on whether the economy is expanding or contracting, with the shift between regimes governed by the outcome of an unobserved Markov chain. The simple business cycle model advocated by Hamilton takes the form
d

Yk = (Ck ) +
i=1

ai (Yki (Cki )) + Vk ,

(1.27)

where {Vk }k0 is white Gaussian noise with zero mean and unit variance, and {Ck }k0 is the unobserved latent variable that reects the state of the business cycle (the autoregressive coecients do not change; only the mean of the process is eectively modulated). In the simplest model, {Ck } takes only two values; for example, Ck = 0 could indicate that the economy is in recession and Ck = 1 that it is in expansion. When Ck = 0, the average growth rate is given by (0), whereas when Ck = 1 the average growth rate is (1). This simple model can be made more sophisticated by making the variance a function of the state Ck as well,
d

Yk = (Ck ) +
i=1

ai (Yki (Cki )) + (Ck )Vk .

The Markov assumption on the hidden states basically says that if the economy was, say, in expansion the last period, the probability of going into recession is a xed constant that does not depend on how long the economy has been in expansion or other measures of the strength of the expansion. This assumption, though rather naive, does not appear to be a bad representation of historical experience, though several researchers have suggested that more complicated specications of the transition matrix ought to be considered. Further reading on applications of switching linear Gaussian autoregressions in economics and nance can be found in, for instance, Krolzig (1997), Kim and Nelson (1999), Raj (2002), and Hamilton and Raj (2003). It is possible to include an additional degree of sophistication by considering instead of a linear autoregression, linear state-space models (see for instance Tugnait, 1984; West and Harrison, 1989; Kim and Nelson, 1999; Doucet et al., 2000a; Chen and Liu, 2000): Wk+1 = W (Ck+1 ) + A(Ck+1 )Wk + R(Ck+1 )Uk , Yk = Y (Ck ) + B(Ck )Wk + S(Ck )Vk , (1.28)

where {Ck }k0 is a Markov chain on a discrete state space, {Uk }k0 and {Vk }k0 are mutually independent i.i.d. sequences independent of {Ck }k0 ,

32

1 Introduction

and W , Y , A, B, R, and S are vector- and matrix-valued functions of appropriate dimensions. Each state of the underlying Markov chain is then associated with a particular regime of the dynamic system, specied by particular values of (W , Y , A, B, R, S) governing the behavior the state and observations. Switching linear state-space models approximate complex non-linear dynamics with a dynamic mixture of linear processes. This type of model has found a broad range of applications in econometrics (Kim and Nelson, 1999), in engineering including control (hybrid system, target tracking), signal processing (blind channel equalization) and communications (interference suppression) (Doucet et al., 2000b, 2001b). Example 1.3.16 (Maneuvering Target). Recall that in Example 1.3.12, we considered the motion of a single target that evolves in 2-D space with (almost) constant velocity. To represent changes in the velocity (either speed or direction or both), we redene the model that describes the evolution of the state Wk = (Px,k , Px,k , Py,k , Py,k ) by making it conditional upon a maneuver indicator Ck = ck {1, . . . , r} that is assumed to take only a nite number of values corresponding to various predened maneuver scenarios. The state now evolves according to the following conditionally Gaussian linear equation Wk = A(Ck+1 )Wk + R(Ck+1 )Uk , Uk N(0, I) ,

where A(c) and R(c) describe the parameters of the dynamic system characterizing the motion of the target for the maneuver labeled by c. Assuming that the observations are linear, Yk = BWk + Vk , the system is a switching Gaussian linear state-space model. 1.3.6.2 Switching Non-linear Models Switching autoregressive processes with Markov regime can be generalized by allowing non-linear autoregressions. Such models were considered in particular by Francq and Roussignol (1997) and take the form Yk = (Yk1 , . . . , Ykd , Xk ) + (Yk1 , . . . , Ykd , Xk )Vk , (1.29)

where {Xk }k0 , called the regime, is a Markov chain on a discrete state space X, {Vk } is an i.i.d. sequence, independent of the regime, with zero mean and unit variance, and : Rd X R and : Rd X R+ are (measurable) functions. Of particular interest are the switching ARCH models (Francq et al., 2001), Yk =
2 2 0 (Xk ) + 1 (Xk )Yt1 + d (Xk )Ytd Vk .

Krishnamurthy and Rydn (1998) studied an even more general class of e switching autoregressive processes that do not necessarily admit an additive decomposition; these are characterized by Yk = (Yk1 , . . . , Ykd , Xk , Vk ) , (1.30)

1.4 Left-to-Right and Ergodic Hidden Markov Models

33

where {Xk }k0 , the regime, is a Markov chain on a discrete state space, {Vk }k0 is an i.i.d. sequence independent of the regime, and : Rd X R is a (measurable) function. Conditional on the regime, {Yk } is thus a dth order Markov chain on a general state space. Douc et al. (2004) studied the same kind of model but allowing the regime to evolve on a general state space. Example 1.3.17 (Switching ARCH Models). Hamiltons (1989) switching autoregression (1.27) models a change in the business cycle phase as a shift in the average growth rate. By contrast, Hamilton and Susmel (1994) modeled changes in the volatility of the stock market as a shift in the overall scale of the ARCH process modeling stock returns. They suggested to model the monthly excess return of a nancial asset (for example, the excess return of a nancial index over the treasury bill yield) as Wk =
2 2 0 + 1 Wk1 + + m Wkm Uk ,

Yk = 0 + 1 Yk1 + + q Ykq + (Ck )Wk .

(1.31)

where {Uk }k0 is Gaussian white noise with zero mean and {Ck }k0 is an unobserved Markov chain on a discrete state space that represents the volatility phase of the stock market; {Ck } and {Uk } are independent. In the absence of such phases, the parameter (Ck ) would simply be constant over k, and (1.31) would describe stock returns by an autoregressive model whose innovations {Uk } follow an mth order ARCH process. More generally, when the function : C R+ is not identically equal to unity, the latent ARCH process Wk is multiplied by a scale factor (Ck ) representing the current phase Ck that characterizes overall stock volatility. Assuming again that the market has two phases, C = {0, 1}, and normalizing (0) = 1, (1) has the interpretation as the ratio of the average variance of stock returns when Ck = 1 compared to that observed when Ck = 0.

1.4 Left-to-Right and Ergodic Hidden Markov Models


Most HMMs fall into one of two principally dierent classes of models: left-toright HMMs and ergodic HMMs. By a left-to-right HMM is meant an HMM with a Markov chain that starts in a particular initial state, traverses a number of intermediate states, and nally terminates in a nal state (this state may be considered as absorbing). When traversing the intermediate states the chain may not go backwardstoward the initial statebut only toward the nal state. This progression is usually pictured from left to right; thus the term left-to-right HMM. Speech recognition, discussed in Example 1.3.6 above, is typically a case where only left-to-right HMMs are used. A left-to-right HMM is not ergodic, but produces a sequence, typically of random length, of output. The number of states is also usually large.

34

1 Introduction

In contrast, an ergodic HMM is one for which the underlying Markov chain is ergodic, or at least is irreducible and admits a unique stationary distribution (thus allowing for periodicity). Such a model can thus produce an innitely long sequence of output, which is an ergodic sequence as well. The number of states, if the state space is nite, is typically small. Most of the examples mentioned in Section 1.3 correspond to ergodic HMMs. Left-to-right HMMs and ergodic HMMs have much in common, in particular on the computational side. Indeed, computational algorithms like the EM algorithm, which is widely used for HMMs, may be implemented similarly whatever the structure of the Markov chain. Of course, because left-to-right HMMs often have many states, in such models it is often considerably more dicult to nd the maximum likelihood estimator, say, among all local maxima of the likelihood function. Having said that, when it comes to matters of theoretical statistics, there are noticeable dierences between ergodic and left-to-right HMMs. Inference in left-to-right HMMs cannot be based on a single observed sequence of output, but is based on many, usually independent sequences. In contrast, inference in ergodic HMMs is usually based on a single long observed sequence, within which there is no independence. For this reason, issues regarding asymptotics of estimators and statistical tests are to be treated quite dierently. For ergodic HMMs, one cannot rely on statistical theory for i.i.d. data but must develop specic methods. This development was initiated in the late 1960s by Baum and Petrie (1966) but was not continued until the 1990s. The case of left-to-right HMMs is simpler because it involves only independent observations, even though each observation is a sequence of random length. It should however be stressed that, when dealing with left-to-right HMMs, nding the global maximum of the log-likelihood function, that is, the maximum likelihood estimator, or computing condence intervals for parameters, etc., is not always a main goal, as for left-to-right HMMs the focus is often on how the model performs with respect to the particular application at hand: how good is the DNA sequence alignment; how large is the percentage of correctly recognized words, etc.? Indeed, even comparisons between models of dierent structure are often done by evaluating their performance on the actual application rather than applying statistical model selection procedures. For these reasons, one can argue that left-to-right HMMs are often applied in a data tting way or data mining way, rather than in a statistical way. Throughout this book, most examples given are based on ergodic HMMs, but the methodologies described are with few exceptions applicable to left-toright HMMs either directly or after minor modications.

2 Main Denitions and Notations

We now formally describe hidden Markov models, setting the notations that will be used throughout the book. We start by reviewing the basic denitions and concepts pertaining to Markov chains.

2.1 Markov Chains


2.1.1 Transition Kernels Denition 2.1.1 (Transition Kernel). Let (X, X ) and (Y, Y) be two measurable spaces. An unnormalized transition kernel from (X, X ) to (Y, Y) is a function Q : X Y [0, ] that satises (i) for all x X, Q(x, ) is a positive measure on (Y, Y); (ii) for all A Y, the function x Q(x, A) is measurable. If Q(x, Y) = 1 for all x X, then Q is called a transition kernel, or simply a kernel. If X = Y and Q(x, X) = 1 for all x X, then Q will also be referred to as a Markov transition kernel on (X, X ). An (unnormalized) transition kernel Q is said to admit a density with respect to the positive measure on Y if there exists a non-negative function q : X Y [0, ], measurable with respect to the product -eld X Y, such that Q(x, A) = q(x, y) (dy) , AY.
A

The function q is then referred to as an (unnormalized) transition density function. When X and Y are countable sets it is customary to write Q(x, y) as a shorthand notation for Q(x, {y}), and Q is generally referred to as a transition matrix (whether or not X and Y are nite sets). We summarize below some key properties of transition kernels, introducing important pieces of notation that are used in the following.

36

2 Main Denitions and Notations

Let Q and R be unnormalized transition kernels from (X, X ) to (Y, Y) and from (Y, Y) to (Z, Z), respectively. The product QR, dened by QR(x, A) =
def

Q(x, dy) R(y, A) ,

x X, A Z ,

is then an unnormalized transition kernel from (X, X ) to (Z, Z). If Q and R are transition kernels, then so is QR, that is, QR(x, Z) = 1 for all x X. If Q is an (unnormalized) Markov transition kernel on (X, X ), its iterates are dened inductively by Q0 (x, ) = x for x X and Qk = QQk1 for k 1 . These iterates satisfy the Chapman-Kolmogorov equation: Qn+m = Qn Qm for all n, m 0. That is, for all x X and A X , Qn+m (x, A) = Qn (x, dy) Qm (y, A) . (2.1)

If Q admits a density q with respect to the measure on (X, X ), then for all n 2 the kernel Qn is also absolutely continuous with respect to . The corresponding transition density is qn (x, y) =
Xn1

q(x, x1 ) q(xn1 , y) (dx1 ) (dxn1 ) .

(2.2)

Positive measures operate on (unnormalized) transition kernels in two different ways. If is a positive measure on (X, X ), the positive measure Q on (Y, Y) is dened by Q(A) =
def

(dx) Q(x, A) ,

AY.

Moreover, the measure Q on the product space (XY, X Y) is dened by Q(C) =


def C

(dx) Q(x, dy) ,

C X Y .

If is a probability measure and Q is a transition kernel, then Q and Q are probability measures. (Unnormalized) transition kernels operate on functions. Let f be a real measurable function on Y. The real measurable function Qf on X is dened by Qf (x) =
def

Q(x, dy) f (y) ,

xX,

provided the integral is well-dened. It will sometimes be more convenient to use the alternative notation Q(x, f ) instead of Qf (x). In particular,

2.1 Markov Chains

37

for x X and A Y, Q(x, A), x Q(A), Q1A (x), and Q(x, 1A ), where 1A denotes the indicator function of the set A, are four equivalent ways of denoting the same quantity. In general, we prefer using the Q(x, 1A ) and Q(x, A) variants, which are less prone to confusion in complicated expressions. For any positive measure on (X, X ) and any real measurable function f on (Y, Y), (Q) (f ) = (Qf ) = (dx) Q(x, dy) f (y) ,

provided the integrals are well-dened. We may thus use the simplied notation Qf instead of (Q)(f ) or (Qf ). Denition 2.1.2 (Reverse Kernel). Let Q be a transition kernel from (X, X ) to (Y, Y) and let be a probability measure on (X, X ). The reverse kernel Q associated to and Q is a transition kernel from (Y, Y) to (X, X ) such that for all bounded measurable functions f dened on X Y, f (x, y) (dx)Q(x, dy) =
XY XY

f (x, y) Q(dy) Q (y, dx) .

(2.3)

The reverse kernel does not necessarily exist and is not uniquely dened. Nevertheless, if Q ,1 and Q ,2 satisfy (2.3), then for all A X , Q ,1 (y, A) = Q ,2 (y, A) for Q-almost every y in Y. The reverse kernel does exist if X and Y are Polish spaces endowed with their Borel -elds (see Appendix A.1 for details). If Q admits a density q with respect to a measure on (Y, Y), then Q can be dened for all y such that X q(z, y) (dz) = 0 by Q (y, dx) = q(x, y) (dx) . q(z, y) (dz) X (2.4)

The values of Q on the set {y Y : X q(z, y) (dz) = 0} are irrelevant because this set is Q-negligible. In particular, if X is discrete and is counting measure, then for all (x, y) X Y such that Q(y) = 0, (x)Q(x, y) Q (y, x) = . Q(y) 2.1.2 Homogeneous Markov Chains Let (, F, P) be a probability space and let (X, X ) be a measurable space. An X-valued (discrete index) stochastic process {Xn }n0 is a collection of Xvalued random variables. A ltration of (, F) is a non-decreasing sequence {Fn }n0 of sub--elds of F. A ltered space is a triple (, F, F), where F is a ltration; (, F, F, P) is called a ltered probability space. For any ltration (2.5)

38

2 Main Denitions and Notations

F = {Fn }n0 , we denote by F = Fn the -eld generated by F or, in n=0 other words, the minimal -eld containing F. A stochastic process {Xn }n0 is adapted to F = {Fn }n0 , or simply F-adapted, if Xn is Fn -measurable for all n 0 The natural ltration of a process {Xn }n0 , denoted by FX = X {Fn }n0 , is the smallest ltration with respect to which {Xn } is adapted. Denition 2.1.3 (Markov Chain). Let (, F, F, P) be a ltered probability space and let Q be a Markov transition kernel on a measurable space (X, X ). An X-valued stochastic process {Xk }k0 is said to be a Markov chain under P, with respect to the ltration F and with transition kernel Q, if it is F-adapted and for all k 0 and A X , P (Xk+1 A | Fk ) = Q(Xk , A) . (2.6)

The distribution of X0 is called the initial distribution of the chain, and X is called the state space.
X If {Xk }k0 is F-adapted, then for all k 0 it holds that Fk Fk ; hence a Markov chain with respect to a ltration F is also a Markov chain with respect to its natural ltration. Hereafter, a Markov chain with respect to its natural ltration will simply be referred to as a Markov chain. When there is no risk of confusion, we will not mention the underlying probability measure P. A fundamental property of a Markov chain is that its nite-dimensional distributions, and hence the distribution of the process {Xk }k0 , are entirely determined by the initial distribution and the transition kernel.

Proposition 2.1.4. Let {Xk }k0 be a Markov chain with initial distribution and transition kernel Q. For any k 0 and any bounded X (k+1) measurable function f on X(k+1) , E [f (X0 , . . . , Xk )] = f (x0 , . . . , xk ) (dx0 ) Q(x0 , dx1 ) Q(xk1 , dxk ) .

In the following, we will use the generic notation f Fb (Z) to denote the fact that f is a measurable bounded function on (Z, Z). In the case of Proposition 2.1.4 for instance, one considers functions f that are in Fb X(k+1) . More generally, we will usually describe measures and transition kernels on (Z, Z) by specifying the way they operate on the functions of Fb (Z). 2.1.2.1 Canonical Version Let (X, X ) be a measurable space. The canonical space associated to (X, X ) is the innite-dimensional product space (XN , X N ). The coordinate process is the X-valued stochastic process {Xk }k0 dened on the canonical space by Xn () = (n). The canonical space will always be endowed with the natural ltration FX of the coordinate process.

2.1 Markov Chains

39

Let (, F) = (XN , X N ) be the canonical space associated to the measurable space (X, X ). The shift operator : is dened by ()(n) = (n + 1) , n0.

The iterates of the shift operator are dened inductively by 0 = Id (the identity), 1 = and k = k1 for k 1. If {Xk }k0 is the coordinate process with associated natural ltration FX , then for all k, n 0, Xk n = X Xk+n , and more generally for any Fk -measurable random variable Y , Y n X is Fn+k -measurable. The following theorem, which is a particular case of the Kolmogorov consistency theorem, states that it is always possible to dene a Markov chain on the canonical space. Theorem 2.1.5. Let (X, X ) be a measurable set, a probability measure on (X, X ), and Q a transition kernel on (X, X ). Then there exists an unique probability measure on (XN , X N ), denoted by P , such that the coordinate process {Xk }k0 is a Markov chain (with respect to its natural ltration) with initial distribution and transition kernel Q. For x X, let Px be an alternative simplied notation for Px . Then for all A X N , the mapping x Px (A) = Q(x, A) is X -measurable, and for any probability measure on (X, X ), P (A) = (dx) Px (A) . (2.7)

The Markov chain dened in Theorem 2.1.5 is referred to as the canonical version of the Markov chain. The probability P dened on (XN , X N ) depends on and on the transition kernel Q. Nevertheless, the dependence with respect to Q is traditionally omitted in the notation. The relation (2.7) implies that x Px is a regular version of the conditional probability P ( | Xk = x) in the sense that one can rewrite (2.6) as
X X P Xk+1 A | Fk = P X1 k A Fk = PXk (X1 A)

P -a.s.

2.1.2.2 Markov Properties More generally, an induction argument easily yields the Markov property: for X any F -measurable random variable Y ,
X E [Y k | Fk ] = EXk [Y ]

P -a.s.

(2.8)

The Markov property can be extended to a specic class of random times known as stopping times. Let N = N {+} denote the extended integer set and let (, F, F) be a ltered space. Then, a mapping : N is said to be an F-stopping time if { = n} Fn for all n 0. Intuitively, this means that at any time n one should be able to tell, based on the information Fn

40

2 Main Denitions and Notations

available at that time, if the stopping time occurs at this time n (or before then) or not. The class F dened by F = {B F : B { = n} Fn for all n 0} , is a -eld, referred to as the -eld of the events occurring before . Theorem 2.1.6 (Strong Markov Property). Let {Xk }k0 be the canonical version of a Markov chain and let be an FX -stopping time. Then for X any bounded F -measurable function ,
X E [1{ <} | F ] = 1{ <} EX [ ]

P -a.s.

(2.9)

X We note that an F -measurable function, or random variable, , is typically a function of potentially the whole trajectory of the Markov chain, although 2 it may of course be a rather simple function like X1 or X2 + X3 .

2.1.3 Non-homogeneous Markov Chains Denition 2.1.7 (Non-homogeneous Markov Chain). Let (, F, F, P) be a ltered probability space and let {Qk }k0 be a family of transition kernels on a measurable space (X, X ). An X-valued stochastic process {Xk }k0 is said to be a non-homogeneous Markov chain under P, with respect to the ltration F and with transition kernels {Qk }, if it is F-adapted and for all k 0 and A X, P(Xk+1 A | Fk ) = Qk (Xk , A) . For i j we dene Qi,j = Qi Qi+1 Qj . With this notation, if denotes the distribution of X0 (which we refer to as the initial distribution as in the homogeneous case), the distribution of Xn is Q0,n1 . An important example of a non-homogeneous Markov chain is the so-called reverse chain. The construction of the reverse chain is based on the observation that if {Xk }k0 is a Markov chain, then for any index n 1 the time-reversed (or, index-reversed) process {Xnk }n is a Markov chain too. k=0 The denition below provides its transition kernels. Denition 2.1.8 (Reverse Chain). Let Q be a Markov kernel on some space X, let be a probability measure on this space, and let n 1 be an index. The reverse chain is the non-homogeneous Markov chain with initial distribution Qn , (time) index set k = 0, 1, . . . , n and transition kernels Qk = Q Qnk1 , k = 0, . . . , n 1 ,

assuming that the reverse kernels are indeed well-dened.

2.1 Markov Chains

41

If the transition kernel Q admits a transition density function q with respect to a measure on (X, X ), then Qk also admits a density with respect to the same measure , namely hk (y, x) = qnk1 (z, x)q(x, y) (dz) . qnk (z, y) (dz) (2.10)

Here, ql is the transition density function of Ql with respect to as dened in (2.2). If the state space is countable, then Qk (y, x) = Qnk1 (x)Q(x, y) . Qnk (y) (2.11)

An interesting question is in what cases the kernels Qk do not depend on the index k and are in fact all equal to the forward kernel Q. A Markov chain with this property is said to be reversible. The following result gives a necessary and sucient condition for reversibility. Theorem 2.1.9. Let X be a Polish space. A Markov kernel Q on X is reversible with respect to a probability measure if and only if for all bounded measurable functions f on X X, f (x, x ) (dx) Q(x, dx ) = f (x, x ) (dx ) Q(x , dx) . (2.12)

The relation (2.12) is referred to as the local balance equations (or detailed balance equations). If the state space is countable, these equations hold if for all x, x X, (x)Q(x, x ) = (x )Q(x , x) . (2.13)

Upon choosing a function f that only depends on the second variable in (2.12), it is easily seen that Q(f ) = (f ) for all functions f Fb (X). We can also write this as = Q. This equation is referred to as the global balance equations. By induction, we nd that Qn = for all n 0. The left-hand side of this equation is the distribution of Xn , which thus does not depend on n when global balance holds. This is a form of stationarity, obviously implied by local balance. We shall tie this form of stationarity to the following customary denition. Denition 2.1.10 (Stationary Process). A stochastic process {Xk } is said to be stationary (under P) if its nite-dimensional distributions are translation invariant, that is, if for all k, n 1 and all n1 , . . . , nk , the distribution of the random vector (Xn1 +n , . . . , Xnk +n ) does not depend on n. A stochastic process with index set N, stationary but otherwise general, can always be extended to a process with index set Z, having the same nitedimensional distributions (and hence being stationary). This is a consequence of Kolmogorovs existence theorem for stochastic processes.

42

2 Main Denitions and Notations

For a Markov chain, any multi-dimensional distribution can be expressed in terms of the initial distribution and the transition kernelthis is Proposition 2.1.4and hence the characterization of stationarity becomes much simpler than above. Indeed, a Markov chain is stationary if and only if its initial distribution and transition kernel Q satisfy Q = , that is, satisfy global balance. Much more will be said about stationary distributions of Markov chains in Chapter 14.

2.2 Hidden Markov Models


A hidden Markov model is a doubly stochastic process with an underlying stochastic process that is not directly observable (it is hidden) but can be observed only through another stochastic process that produces the sequence of observations. As shown in the introduction, the scope of HMMs is large and covers a variety of situations. To accommodate these conceptually dierent models, we now dene formally a hidden Markov model. 2.2.1 Denitions and Notations In simple cases such as fully discrete models, it is common to dene hidden Markov models by using the concept of conditional independence. Indeed, this was the view taken in Chapter 1, where an HMM was dened as a bivariate process {(Xk , Yk )}k0 such that {Xk }k0 is a Markov chain with transition kernel Q and initial distribution ; Conditionally on the state process {Xk }k0 , the observations {Yk }k0 are independent, and for each n the conditional distribution of Yn depends on Xn only.

It turns out that conditional independence is mathematically more dicult to dene in general settings (in particular, when the state space X of the Markov chain is not countable), and we will adopt a dierent route to dene general hidden Markov models. The HMM is dened as a bivariate Markov chain, only partially observed though, whose transition kernel has a special structure. Indeed, its transition kernel should be such that both the joint process {Xk , Yk }k0 and the marginal unobservable (or hidden) chain {Xk }k0 are Markovian. From this denition, the usual conditional independence properties of HMMs will then follow (see Corollary 2.2.5 below). Denition 2.2.1 (Hidden Markov Model). Let (X, X ) and (Y, Y) be two measurable spaces and let Q and G denote, respectively, a Markov transition kernel on (X, X ) and a transition kernel from (X, X ) to (Y, Y). Consider the Markov transition kernel dened on the product space (X Y, X Y) by

2.2 Hidden Markov Models

43

T [(x, y), C] =
C

Q(x, dx ) G(x , dy ) ,

(x, y) X Y, C X Y .

(2.14) The Markov chain {Xk , Yk }k0 with Markov transition kernel T and initial distribution G, where is a probability measure on (X, X ), is called a hidden Markov model. Although the denition above concerns the joint process {Xk , Yk }k0 , the term hidden is only justied in cases where {Xk }k0 is not observable. In this respect, {Xk }k0 can also be seen as a ctitious intermediate process that is useful only in dening the distribution of the observed process {Yk }k0 . We shall denote by P and E the probability measure and corresponding expectation associated with the process {Xk , Yk }k0 on the canonical space (X Y)N , (X Y)N . Notice that this constitutes a slight departure from the Markov notations introduced previously, as is a probability measure on X only and not on the state space X Y of the joint process. This slight abuse of notation is justied by the special structure of the model considered here. Equation (2.14) shows that whatever the distribution of the initial joint state (X0 , Y0 ), even if it were not of the form G, the law of {Xk , Yk }k1 only depends on the marginal distribution of X0 . Hence it makes sense to index probabilities and expectations by this marginal initial distribution only. If both X and Y are countable, the hidden Markov model is said to be discrete, which is the case originally considered by Baum and Petrie (1966). Many of the examples given in the introduction (those of Section 1.3.2 for instance) correspond to cases where Y is uncountable and is a subset of Rd for some d. In such cases, we shall generally assume that the following holds true. Denition 2.2.2 (Partially Dominated Hidden Markov Model). The model of Denition 2.2.1 is said to be partially dominated if there exists a probability measure on (Y, Y) such that for all x X, G(x, ) is absolutely continuous with respect to , G(x, ) (), with transition density function g(x, ). Then, for A Y, G(x, A) = A g(x, y) (dy) and the joint transition kernel T can be written as T [(x, y), C] =
C

Q(x, dx )g(x , y ) (dy ) C X Y .

(2.15)

In the third part of the book (Chapter 10 and following) where we consider statistical estimation for HMMs with unknown parameters, we will require even stronger conditions and assume that the model is fully dominated in the following sense. Denition 2.2.3 (Fully Dominated Hidden Markov Model). If, in addition to the requirements of Denition 2.2.2, there exists a probability measure on (X, X ) such that and, for all x X, Q(x, ) () with transition density function q(x, ). Then, for A X , Q(x, A) = A q(x, x ) (dx )

44

2 Main Denitions and Notations

and the model is said to be fully dominated. The joint Markov transition kernel T is then dominated by the product measure and admits the transition density function def t [(x, y), (x , y )] = q(x, x )g(x , y ) . (2.16) Note that for such models, we will generally re-use the notation to denote the probability density function of the initial state X0 (with respect to ) rather than the distribution itself. 2.2.2 Conditional Independence in Hidden Markov Models In this section, we will show that the intuitive way of thinking about an HMM, in terms of conditional independence, is justied by Denition 2.2.1. Readers unfamiliar with conditioning in general settings may want to read more on this topic in Appendix A.4 before reading the rest of this section. Proposition 2.2.4. Let {Xk , Yk }k0 be a Markov chain over the product space X Y with transition kernel T given by (2.14). Then, for any integer p, any ordered set {k1 < < kp } of indices and all functions f1 , . . . , fp Fb (Y),
p p

E
i=1

fi (Yki ) Xk1 , . . . , Xkp =


i=1 Y

fi (y) G(Xki , dy) .

(2.17)

Proof. For any h Fb (Xp ), it holds that


p

E
i=1

fi (Yki )h(Xk1 , . . . , Xkp ) =


p kp

Q(xi1 , dxi ) G(xi , dyi )

(dx0 )G(x0 , dy0 )


i=1

i=1

fi (yki ) h(xk1 , . . . , xkp )


kp

(dx0 )
i=1

Q(xi1 , dxi )h(xk1 , . . . , xkp ) G(xi , dyi )


i{k1 ,...,kp }


i{k1 ,...,kp }

fi (yi ) G(xi , dyi ) .

Because

G(xi , dyi ) = 1,

2.2 Hidden Markov Models


p

45

E
i=1

fi (Yki )h(Xk1 , . . . , Xkp ) = E h(Xk1 , . . . , Xkp )


i{k1 ,...,kp }

fi (yi ) G(Xi , dyi ) .

Corollary 2.2.5. (i) For any integer p and any ordered set {k1 < < kp } of indices, the random variables Yk1 , . . . , Ykp are P -conditionally independent given (Xk1 , Xk2 , . . . , Xkp ). (ii) For any integers k and p and any ordered set {k1 < < kp } of indices such that k {k1 , . . . , kp }, the random variables Yk and (Xk1 , . . . , Xkp ) are P -conditionally independent given Xk . Proof. Part (i) is an immediate consequence of Proposition 2.2.4. To prove (ii), note that for any f Fb (Y) and h Fb (Xp ), E f (Yk )h(Xk1 , . . . , Xkp ) | Xk = E E [f (Yk ) | Xk1 , . . . , Xkp , Xk ]h(Xk1 , . . . , Xkp ) Xk = E f (Yk ) | Xk ] E [h(Xk1 , . . . , Xkp ) | Xk .

As a direct application of Propositions A.4.2 and A.4.3, the conditional independence of the observations given the underlying sequence of states implies that for any integers p and p , any indices k1 < < kp and k1 < < kp such that {k1 , . . . , kp } {k1 , . . . , kp } = and any function f Fb (Yp ), E [f (Yk1 , . . . , Ykp ) | Xk1 , . . . , Xkp , Xk1 , . . . , Xkp , Yk1 , . . . Ykp ] = E [f (Yk1 , . . . , Ykp ) | Xk1 , . . . , Xkp ] . (2.18) Indeed, in terms of conditional independence of the variables, (Yk1 , . . . , Ykp ) (Yk1 , . . . , Ykp ) | (Xk1 , . . . , Xkp , Xk1 , . . . , Xkp ) [P ] and (Yk1 , . . . , Ykp ) (Xk1 , . . . , Xkp ) | (Xk1 , . . . , Xkp ) [P ] . Hence, by the contraction property of Proposition A.4.3, (Yk1 , . . . , Ykp ) (Xk1 , . . . , Xkp , Yk1 , . . . , Ykp ) | (Xk1 , . . . , Xkp ) [P ] , which implies (2.18).

46

2 Main Denitions and Notations

2.2.3 Hierarchical Hidden Markov Models In examples such as 1.3.16 and 1.3.15, we met hidden Markov models whose state variable naturally decomposes into two distinct sub-components. To accommodate such structures, we dene a specic sub-class of HMMs for which the state Xk consists of two components, Xk = (Ck , Wk ). This additional structure will be used to introduce a level of hierarchy in the state variables. We call this class hierarchical hidden Markov models. In general, the hierarchical structure will be as follows. {Ck }k0 is a Markov chain on a state space (C, C) with transition kernel QC and initial distribution C . Thus, for any f Fb (C) and any k 1, E [ f (Ck ) | C0:k1 ] = QC (Ck1 , f ) and EC [f (C0 )] = C (f ) .

Conditionally on {Ck }k0 , {Wk }k0 is a Markov chain on (W, W). More precisely, there exists a transition kernel QW : (W C) W [0, 1] such that for any k 1 and any function f Fb (W), E [ f (Wk ) | W0:k1 , C0:k ] = QW [(Wk1 , Ck ), f ] . In addition, there exists a transition kernel W : C W [0, 1] such that for any f Fb (W), E [ f (W0 ) | C0 ] = W (C0 , f ) .

We denote by Xk = (Ck , Wk ) the composite state variable. Then, {Xk }k0 is a Markov chain on X = C W with transition kernel Q [(c, w), A B] =
A B

QC (c, dc ) QW [(w, c ), dw ] ,

A C, B W ,

and initial distribution (A B) =


A

C (dc) W (c, B) .

As before, we assume that {Yk }k0 is conditionally independent of {Xk }k0 and such that the conditional distribution of Yn depends on Xn only, meaning that (2.17) holds. The distinctive feature of hierarchical HMMs is that it is often advantageous to consider that the state variables are {Ck }k0 rather than {Xk }k0 . Of course, the model is then no longer an HMM because the observation Yk depends on all partial states Cl for l k due to the marginalization of the intermediate component Wl (for l = 0, . . . , k). Nonetheless, this point of view is often preferable, particularly in cases where the structure of {Ck }k0 is very simple, such as when C is nite. The most common example of hierarchical HMM is the conditionally Gaussian linear state-space model (CGLSSM), which we already met in Examples 1.3.9, 1.3.11, and 1.3.16. We now formally dene this model.

2.2 Hidden Markov Models

47

Denition 2.2.6 (Conditionally Gaussian Linear State-Space Model). A CGLSSM is a model of the form Wk+1 = A(Ck+1 )Wk + R(Ck+1 )Uk , Yk = B(Ck )Wk + S(Ck )Vk , subject to the following conditions. The indicator process {Ck }k0 is a Markov chain with transition kernel QC and initial distribution C . Usually, C is nite and then identied with the set {1, . . . , r}. The state (or process) noise {Uk }k0 and the measurement noise {Vk }k0 are independent multivariate Gaussian white noises with zero mean and identity covariance matrices. In addition, the indicator process {Ck }k0 is independent of both the state noise and of the measurement noise. A, B, R, and S are known matrix-valued functions of appropriate dimensions. W0 N( , ) , (2.19)

Part I

State Inference

3 Filtering and Smoothing Recursions

This chapter deals with a fundamental issue in hidden Markov modeling: given a fully specied model and some observations Y0 , . . . , Yn , what can be said about the corresponding unobserved state sequence X0 , . . . , Xn ? More specically, we shall be concerned with the evaluation of the conditional distributions of the state at index k, Xk , given the observations Y0 , . . . , Yn , a task that is generally referred to as smoothing. There are of course several options available for tackling this problem (Anderson and Moore, 1979, Chapter 7) and we focus, in this chapter, on the xed-interval smoothing paradigm in which n is held xed and it is desired to evaluate the conditional distributions of Xk for all indices k between 0 and n. Note that only the general mechanics of the smoothing problem are dealt with in this chapter. In particular, most formulas will involve integrals over X. We shall not, for the moment, discuss ways in which these integrals can be eectively evaluated, or at least approximated, numerically. We postpone this issue to Chapter 5, which deals with some specic classes of hidden Markov models, and Chapters 6 and 7, in which generally applicable Markov chain Monte Carlo methods or sequential importance sampling techniques are reviewed. The driving line of this chapter is the existence of a variety of smoothing approaches that involve a number of steps that only increase linearly with the number of observations. This is made possible by the fact (to be made precise in Section 3.3) that conditionally on the observations Y0 , . . . , Yn , the state sequence still is a Markov chain, albeit a non-homogeneous one. Readers already familiar with the eld could certainly object that as the probabilistic structure of any hidden Markov model may be represented by the generic probabilistic network drawn in Figure 1.1 (Chapter 1), the xed interval smoothing problem under consideration may be solved by applying the general principle known as probability propagation or sum-productsee Cowell et al. (1999) or Frey (1998) for further details and references. As patent however from Figure 1.1, the graph corresponding to the HMM structure is so simple and systematic in its design that ecient instances of the probability propagation approach are all based on combining two systematic phases:

52

3 Filtering and Smoothing Recursions

one in which the graph is scanned systematically from left to right (or forward pass), and one in which the graph is scanned in reverse order (backward pass). In this context, there are essentially only three dierent ways of implementing the above principle, which are presented below in Sections 3.2.2, 3.3.1, and 3.3.2. From a historical perspective, it is interesting to recall that most of the early references on smoothing, which date back to the 1960s, focused on the specic case of Gaussian linear state-space models, following the pioneering work by Kalman and Bucy (1961). The classic book by Anderson and Moore (1979) on optimal ltering, for instance, is fully devoted to linear state-space modelssee also Chapter 10 of the recent book by Kailath et al. (2000) for a more exhaustive set of early references on the smoothing problem. Although some authors such as (for instance) Ho and Lee (1964) considered more general state-space models, it is fair to say that the Gaussian linear state-space model was the dominant paradigm in the automatic control community1 . In contrast, the work by Baum and his colleagues on hidden Markov models (Baum et al., 1970) dealt with the case where the state space X of the hidden state is nite. These two streams of research (on Gaussian linear models and nite state space models) remained largely separated. Approximately at the same time, in the eld of probability theory, the seminal work by Stratonovich (1960) stimulated a number of contributions that were to compose a body of work generally referred to as ltering theory. The object of ltering theory is to study inference about partially observable Markovian processes in continuous time. A number of early references in this domain indeed consider some specic form of discrete state space continuous-time equivalent of the HMM (Shiryaev, 1966; Wonham, 1965)see also Lipster and Shiryaev (2001), Chapter 9. Working in continuous time, however, implies the use of mathematical tools that are denitely more complex than those needed to tackle the discrete-time model of Baum et al. (1970). As a matter of fact, ltering theory and hidden Markov models evolved as two mostly independent elds of research. A poorly acknowledged fact is that the pioneering paper by Stratonovich (1960) (translated from an earlier Russian publication) describes, in its rst section, an equivalent to the forward-backward smoothing approach of Baum et al. (1970). It turns out, however, that the formalism of Baum et al. (1970) generalizes well to models where the state space is not discrete anymore, in contrast to that of Stratonovich (1960) (see Section 3.4 for the exact correspondence between both approaches).

1 Interestingly, until the early 1980s, the works that did not focus on the linear state-space model were usually advertised by the use of the words Bayes or Bayesian in their titlesee, e.g., Ho and Lee (1964) or Askar and Derin (1981).

3.1 Basic Notations and Denitions

53

3.1 Basic Notations and Denitions


In the rest of this chapter, the principles of smoothing as introduced by Baum et al. (1970) are exposed in a general setting that is suitable for all the examples introduced in Section 1.3. 3.1.1 Likelihood The joint probability of the unobservable states and observations up to index n is such that for any function f Fb {X Y}n+1 , E [f (X0 , Y0 , . . . , Xn , Yn )] =
n

f (x0 , y0 , . . . , xn , yn )

(dx0 )g(x0 , y0 )
k=1

{Q(xk1 , dxk )g(xk , yk )} n (dy0 , . . . , dyn ) , (3.1)

where n denotes the product distribution (n+1) on (Yn+1 , Y (n+1) ). Marginalizing with respect to the unobservable variables X0 , . . . , Xn , one obtains the marginal distribution of the observations only, E [f (Y0 , . . . , Yn )] = f (y0 , . . . , yn ) L,n (y0 , . . . , yn ) n (dy0 , . . . , dyn ) ,

(3.2) where L,n is an important quantity which we dene below for future reference. Denition 3.1.1 (Likelihood). The likelihood of the observations is the probability density function of Y0 , Y1 , . . . , Yn with respect to n dened, for all (y0 , . . . , yn ) Yn+1 , by L,n (y0 , . . . , yn ) = (dx0 )g(x0 , y0 )Q(x0 , dx1 )g(x1 , y1 ) Q(xn1 , dxn )g(xn , yn ) . (3.3)

In addition,
def ,n

= log L,n ,

(3.4)

is referred to as the log-likelihood function. Remark 3.1.2 (Concise Notation for Sub-sequences). For the sake of conciseness, we will use in the following the notation Yl:m to denote the collection of consecutively indexed variables Yl , . . . , Ym wherever possible (proceeding the same way for the unobservable sequence {Xk }). In quoting (3.3) for instance, we shall write L,n (y0:n ) rather than L,n (y0 , . . . , yn ). By transparent convention, Yk:k refers to the single variable Yk , although the second notation (Yk ) is to be preferred in this particular case. In systematic expressions, however, it may be helpful to understand Yk:k as a valid replacement

54

3 Filtering and Smoothing Recursions

of Yk . For similar reasons, we shall, when needed, accept Yk+1:k as a valid empty set. The latter convention should easily be recalled by programmers, as instructions of the form for i equals k+1 to k, do..., which do nothing, constitute a well-accepted ingredient of most programming idioms. 3.1.2 Smoothing We rst dene generically what is meant by the word smoothing before deriving the basic results that form the core of the techniques discussed in the rest of the chapter. Denition 3.1.3 (Smoothing, Filtering, Prediction). For positive indices k, l, and n with l k, denote by ,k:l|n the conditional distribution of Xk:l given Y0:n , that is (a) ,k:l|n is a transition kernel from Y(n+1) to X(lk+1) : for any given set A X (lk+1) , y0:n ,k:l|n (y0:n , A) is a Y (n+1) measurable function, for any given sub-sequence y0:n , A ,k:l|n (y0:n , A) is a probability distribution on (Xlk+1 , X (lk+1) ). (b) ,k:l|n satises, for any function f Fb Xlk+1 , E [f (Xk:l ) | Y0:n ] = f (xk:l ) ,k:l|n (Y0:n , dxk:l ) ,

where the equality holds P -almost surely. Specic choices of k and l give rise to several particular cases of interest: Joint Smoothing: ,0:n|n , for n 0; (Marginal) Smoothing: ,k|n for n k 0; Prediction: ,n+1|n for n 0; In describing algorithms, it will be convenient to extend our notation to use ,0|1 as a synonym for the initial distribution ; p-step Prediction: ,n+p|n for n, p 0. Filtering: ,n|n for n 0; Because the use of ltering will be preeminent in the following, we shall most often abbreviate ,n|n to ,n . In more precise terms (see details in Section A.2 of Appendix A), ,k:l|n is a version of the conditional distribution of Xk:l given Y0:n . It is however not obvious that such a quantity indeed exists in great generality. The proposition below complements Denition 3.1.3 by a constructive approach to dening the smoothing quantities from the elements of the hidden Markov model. Proposition 3.1.4. Consider a hidden Markov model compatible with Denition 2.2.2, let n be a positive integer and y0:n Yn+1 a sub-sequence such that L,n (y0:n ) > 0. The joint smoothing distribution ,0:n|n then satises

3.1 Basic Notations and Denitions

55

,0:n|n (y0:n , f ) = L,n (y0:n )1

f (x0:n )
n

(dx0 )g(x0 , y0 )
k=1

Q(xk1 , dxk )g(xk , yk )

(3.5)

for all functions f Fb Xn+1 . Likewise, for indices p 0, ,0:n+p|n (y0:n , f ) = f (x0:n+p )
n+p

,0:n|n (y0:n , dx0:n )


k=n+1

Q(xk1 , dxk )

(3.6)

for all functions f Fb Xn+p+1 . Proof. Equation (3.5) denes ,0:n|n in a way that obviously satises part (a) of Denition 3.1.3. To prove the (b) part of the denition, recall the characterization of the conditional expectation given in Appendix A.2 and consider a function h Fb Yn+1 . By (3.1), E [h(Y0:n )f (X0:n )] = h(y0:n )f (x0:n )
n

(dx0 )g(x0 , y0 )
k=1

Q(xk1 , dxk )g(xk , yk ) n (dy0:n ) .

Using Denition 3.1.1 of the likelihood L,n and (3.5) for ,0:n|n yields E [h(Y0:n )f (X0:n )] = h(y0:n ) ,0:n|n (y0:n , f )L,n (y0:n ) n (dy0:n ) (3.7)

= E [h(Y0:n ),0:n|n (Y0:n , f )] .

Hence E [f (X0:n ) | Y0:n ] equals ,0:n|n (Y0:n , f ), P -a.e., for any function f Fb Xn+1 . For (3.6), proceed similarly and consider two functions f Fb Xn+p+1 and h Fb Yn+1 . First apply (3.1) to obtain E [h(Y0:n )f (X0:n+p )] = (dx0 )g(x0 , y0 )
l=n+1

f (x0:n+p ) Q(xk1 , dxk )g(xk , yk ) h(y0:n )

k=1 n+p

Q(xl1 , dxl )g(xl , yl ) n+p (dy0:n+p ) .

56

3 Filtering and Smoothing Recursions

When integrating with respect to the subsequence yn+1:n+p , the third line n+p of the previous equation reduces to l=n+1 Q(xl1 , dxl )n (dy0:n ). Finally use (3.3) and (3.5) to obtain E [h(Y0:n )f (X0:n+p )] = ,0:n|n (y0:n , dx0:n )
k=n+1

n+p

h(y0:n )f (x0:n+p ) Q(xk1 , dxk ) L,n (y0:n )n (dy0:n ) , (3.8)

which concludes the proof. Remark 3.1.5. The requirement that L,n (y0:n ) be non-null is obviously required to guarantee that (3.5) makes sense and that (3.7) and (3.8) are correct. Note that if S is a set such that S L,n (y0:n )n (dy0:n ) = 0, P (Y0:n S) = 0 and the value of ,0:n|n (y0:n , ) for y0:n S is irrelevant (see discussion in Appendix A.3). In the sequel, it is implicit that results similar to those in Proposition 3.1.4 hold for values of y0:n S,n Yn+1 , where the set S,n is such that P (Y0:n S,n ) = 1. In most models of practical interest, this nuance can be ignored as it is indeed possible to set S,n = Yn+1 . This is in particular the case when g(x, y) is strictly positive for all values of (x, y) X Y. There are however more subtle cases where, for instance, the set S,n really depends upon the initial distribution (see Example 4.3.28). Proposition 3.1.4 also implicitly denes all other particular cases of smoothing kernels mentioned in Denition 3.1.3, as these are obtained by marginalization. For instance, the marginal smoothing kernel ,k|n for 0 k n is such that for any y0:n Yn+1 and f Fb (X), ,k|n (y0:n , f ) =
def

f (xk ) ,0:n|n (y0:n , dx0:n ) ,

(3.9)

where ,0:n|n is dened by (3.5). Likewise, for any given y0:n Yn+1 , the p-step predictive distribution ,n+p|n (y0:n , ) may be obtained by marginalization of the joint distribution ,0:n+p|n (y0:n , ) with respect to all variables xk except the last one (the one with index k = n + p). A closer examination of (3.6) together with the use of the Chapman-Kolmogorov equations introduced in (2.1) (cf. Chapter 14) directly shows that ,n+p|n (y0:n , ) = ,n (y0:n , )Qp , where ,n refers to the lter (conditional distribution of Xn given Y0:n ). 3.1.3 The Forward-Backward Decomposition As stated in the introduction, the rest of the chapter is devoted to techniques upon which the marginal smoothing kernels ,k|n may be eciently computed

3.1 Basic Notations and Denitions

57

for all values of k in {0, . . . , n} for a given, pre-specied, value of n. This is the task that we referred to as xed interval smoothing. In doing so, our main tool will be a simple representation of ,k|n , which we now introduce. Replacing ,0:n|n in (3.9) by its expression given in (3.5) shows that it is always possible to rewrite ,k|n (y0:n , f ), for functions f Fb (X), as ,k|n (y0:n , f ) = L,n (y0:n )1 f (x) ,k (y0:k , dx)k|n (yk+1:n , x) , (3.10)

where ,k and k|n are dened below in (3.11) and (3.12), respectively. In simple terms, ,k correspond to the factors in the multiple integral that are to be integrated with respect to the state variables xl with indices l k while k|n gathers the remaining factors (which are to be integrated with respect to xl for l > k). This simple splitting of the multiple integration in (3.9) constitutes the forward-backward decomposition. Denition 3.1.6 (Forward-Backward Variables). For k {0, . . . , n}, dene the following quantities. Forward Kernel ,k is the non-negative nite kernel from (Yk+1 , Y (k+1) ) to (X, X ) such that
k

,k (y0:k , f ) =

f (xk ) (dx0 )g(x0 , y0 )


l=1

Q(xl1 , dxl )g(xl , yl ) ,

(3.11) with the convention that the rightmost product term is empty for k = 0. Backward Function k|n is the non-negative measurable function on Ynk X dened by k|n (yk+1:n , x) =
n

Q(x, dxk+1 )g(xk+1 , yk+1 )


l=k+2

Q(xl1 , dxl )g(xl , yl ) , (3.12)

for k n 1 (with the same convention that the rightmost product is empty for k = n 1); n|n () is set to the constant function equal to 1 on X. The term forward and backward variables as well as the use of the symbols and is part of the HMM credo and dates back to the seminal work of Baum and his colleagues (Baum et al., 1970, p. 168). It is clear however that for a general model as given in Denition 2.2.2, these quantities as dened in (3.11) and (3.12) are very dierent in nature, and indeed suciently so to prevent the use of the loosely dened term variable. In the original framework studied by Baum and his coauthors where X is a nite set, both the forward measures ,k (y0:k , ) and the backward functions k|n (yk+1:n , ) can be represented by vectors with non-negative entries. Indeed, in this case ,k (y0:k , x) has the interpretation P (Y0 = y0 , . . . , Yk = yk , Xk = x) while

58

3 Filtering and Smoothing Recursions

k|n (yk+1:n , x) has the interpretation P(Yk+1 = yk+1 , . . . , Yn = yn | Xk = x). This way of thinking of ,k and k|n may be extended to general state spaces: ,k (y0:k , dx) is then the joint density (with respect to k+1 ) of Y0 , . . . , Yk and distribution of Xk , while k|n (yk+1:n , x) is the conditional joint density (with respect to nk ) of Yk+1 , . . . , Yn given Xk = x. Obviously, these entities may then not be represented as vectors of nite length, as when X is nite; this situation is the exception rather than the rule. Let us simply remark at this point that while the forward kernel at index k is dened irrespectively of the length n of the observation sequence (as long as n k), the same is not true for the backward functions. The sequence of backward functions clearly depends on the index where the observation sequence stops. In general, for instance, k|n1 diers from k|n even if we assume that the same sub-observation sequence y0:n1 is considered in both cases. This is the reason for adding the terminal index n to the notation used for the backward functions. This notation also constitutes a departure from HMM traditions in which the backward functions are simply indexed by k. For ,k , the situation is closer to standard practice and we simply add the subscript to recall that the forward kernel ,k , in contrast with the backward measure, does depend on the distribution postulated for the initial state X0 . 3.1.4 Implicit Conditioning (Please Read This Section!) We now pause to introduce a convention that will greatly simplify the exposition of the material contained in the rst part of the book (from this chapter on, starting with the next section), both from terminological and notational points of view. This convention would however generate an acute confusion in the mind of a hypothetical reader who, having read Chapter 3 up to now, would decide to skip our friendly encouragement to read what follows carefully. In the rest of Part I (with the notable exception of Section 4.3), we focus on the evaluation of quantities such as ,0:n|n or ,k|n for a given value of the observation sequence y0:n . In this context, we expunge from our notations the fact that all quantities depend on y0:n . In particular, we rewrite (3.5) for any f Fb Xn+1 more concisely as
n

,0:n|n (f ) = L1 ,n

f (x0:n ) (dx0 )g0 (x0 )


i=1

Q(xi1 , dxi )gi (xi ) , (3.13)


def

where gk are the data-dependent functions on X dened by gk (x) = g(x, yk ) for the particular sequence y0:n under consideration. The sequence of functions {gk } is about the only new notation that is needed as we simply re-use the previously dened quantities omitting their explicit dependence on the observations. For instance, in addition to writing L,n instead of L,n (y0:n ), we

3.2 Forward-Backward

59

will also use n () rather than n (y0:n , ), k|n () rather than k|n (yk+1:n , ), etc. This notational simplication implies a corresponding terminological adjustment. For instance, ,k will be referred to as the forward measure at index k and considered as a positive nite measure on (X, X ). In all cases, the conversion should be easy to do mentally, as in the case of ,k , for instance, what is meant is really the measure ,k (y0:k , ), for a particular value of y0:k Yk+1 . At rst sight, omitting the observations may seem a weird thing to do in a statistically oriented book. However, for posterior state inference in HMMs, one indeed works conditionally on a given xed sequence of observations. Omitting the observations from our notation will thus allow more concise expressions in most parts of the book. There are of course some properties of the hidden Markov model for which dependence with respect to the distribution of the observations does matter (hopefully!) This is in particular the case of Section 4.3 on forgetting and Chapter 12, which deals with statistical properties of the estimates for which we will make the dependence with respect to the observations explicit.

3.2 Forward-Backward
The forward-backward decomposition introduced in Section 3.1.3 is just a rewriting of the multiple integral in (3.9) such that for f Fb (X), ,k|n (f ) = L1 ,n where
k

f (x) ,k (dx)k|n (x) ,

(3.14)

,k (f ) = and k|n (x) =

f (xk ) (dx0 )g0 (x0 )


l=1

Q(xl1 , dxl )gl (xl )

(3.15)

Q(x, dxk+1 )gk+1 (xk+1 )


l=k+2

Q(xl1 , dxl )gl (xl ) . (3.16)

The last expression is, by convention, equal to 1 for the nal index k = n. Note that we are now using the implicit conditioning convention discussed in the previous section. 3.2.1 The Forward-Backward Recursions The point of using the forward-backward decomposition for the smoothing problem is that both the forward measures ,k and the backward functions

60

3 Filtering and Smoothing Recursions

k|n can be expressed recursively rather than by their integral representations (3.15) and (3.1.4). This is the essence of the forward-backward algorithm proposed by Baum et al. (1970, p. 168), which we now describe. Section 3.4 at the end of this chapter gives further comments on historical and terminological aspects of the forward-backward algorithm. Proposition 3.2.1 (Forward-Backward Recursions). The forward measures dened by (3.15) may be obtained, for all f Fb (X), recursively for k = 1, . . . , n according to ,k (f ) = with initial condition ,0 (f ) = f (x)g0 (x) (dx) . (3.18) f (x ) ,k1 (dx)Q(x, dx )gk (x ) (3.17)

Similarly, the backward functions dened by (3.16) may be obtained, for all x X, by the recursion k|n (x) = Q(x, dx )gk+1 (x )k+1|n (x ) (3.19)

operating on decreasing indices k = n 1 down to 0; the initial condition is n|n (x) = 1 . (3.20)

Proof. The proof of this result is straightforward and similar for both recursions. For ,k for instance, simply rewrite (3.15) as ,k (f ) =
xk X

f (xk )
xk1 X k1

x0 X,...,xk2 X

(dx0 )g0 (x0 )


l=1

Q(xl1 , dxl )gl (xl ) Q(xk1 , dxk )gk (xk ) ,

where the term in brackets is recognized as ,k1 (dxk1 ). Remark 3.2.2 (Concise Markov Chain Notations). In the following, we shall often quote the above results using the concise Markov chain notations introduced in Chapter 2. For instance, instead of (3.17) and (3.19) one could write more simply ,k (f ) = ,k1 Q(f gk ) and k|n = Q(gk+1 k+1|n ). Likewise, the decomposition (3.14) may be rewritten as ,k|n (f ) = L1 ,k f k|n . ,n

3.2 Forward-Backward

61

The main shortcoming of the forward-backward representation is that the quantities ,k and k|n do not have an immediate probabilistic interpretation. Recall, in particular, that the rst one is a nite (positive) measure but certainly not a probability measure, as ,k (1) = 1 (in general). There is however an important solidarity result between the forward and backward quantities ,k and k|n , which is summarized by the following proposition. Proposition 3.2.3. For all indices k {0, . . . , n}, ,k (k|n ) = L,n and ,k (1) = L,k , where L,k refers to the likelihood of the observations up to index k (included) only, under P . Proof. Because (3.14) must hold in particular for f = 1 and the marginal smoothing distribution ,k|n is a probability measure, ,k|n (1) = 1 = L1 ,k k|n . ,n For the nal index k = n, n|n is the constant function equal to 1 and hence ,n (1) = L,n . This observation is however not specic to the nal index n, as ,k only depends on the observations up to index k and thus any particular index may be selected as a potential nal index (in contrast to what happens for the backward functions). 3.2.2 Filtering and Normalized Recursion The forward and backward quantities ,k and k|n , as dened in previous sections, are unnormalized in the sense that their scales are largely unknown. On the other hand, we know that ,k (k|n ) is equal to L,n , the likelihood of the observations up to index n under P . The long-term behavior of the likelihood L,n , or rather its logarithm, is a result known as the asymptotic equipartition property, or AEP (Cover and Thomas, 1991) in the information theoretic literature and as the ShannonMcMillan-Breiman theorem in the statistical literature. For HMMs, Proposition 12.3.3 (Chapter 12) shows that under suitable mixing conditions on the underlying unobservable chain {Xk }k0 , the AEP holds in that n1 log L,n converges P -a.s. to a limit as n tends to innity. The likelihood L,n will thus either grow to innity or shrink to zero, depending on the sign of the limit, exponentially fast in n. This has the practical implication that in all cases where the recursions of Proposition 3.2.1 are eectively computable (like in the case of nite state space to be discussed in Chapter 5), the dynamics of the numerical values needed to represent ,k and k|n is so large that it
def

62

3 Filtering and Smoothing Recursions

rapidly exceeds the available machine representation possibilities (even with high accuracy oating-point representations). The famous tutorial by Rabiner (1989) coined the term scaling to describe a practical solution to this problem. Interestingly, scaling also partly answers the question of the probabilistic interpretation of the forward and backward quantities. Scaling as described by Rabiner (1989) amounts to normalizing ,k and k|n by positive real numbers to keep the numeric values needed to represent ,k and k|n within reasonable bounds. There are clearly a variety of options available, especially if one replaces (3.14) by the equivalent auto-normalized form ,k|n (f ) = [,k (k|n )]1 ,k (f k|n ) , (3.21) assuming that ,k (k|n ) is indeed nite and non-zero. In our view, the most natural scaling scheme (developed below) consists in replacing the measure ,k and the function k|n by scaled versions ,k and k|n of these quantities, satisfying both (i) ,k (1) = 1, and (ii) ,k (k|n ) = 1. Item (i) implies that the normalized forward measures ,k are probabil ity measures that have a probabilistic interpretation given below. Item (ii) implies that the normalized backward functions are such that ,k|n (f ) = f (x)k|n (x) ,k (dx) for all f Fb (X), without the need for a further renormalization. We note that this scaling scheme diers slightly from the one described by Rabiner (1989). The reason for this dierence, which only aects the scaling of the backward functions, is non-essential and will be discussed in Section 3.4. To derive the probabilistic interpretation of ,k , observe that (3.14) and Proposition 3.2.3, instantiated for the nal index k = n, imply that the ltering distribution ,n at index n (recall that ,n is used as a simplied notation for ,n|n ) may be written [,n (1)]1 ,n . This nding is of course not specic to the choice of the index n as already discussed when proving the second statement of Proposition 3.2.3. Thus, the normalized version ,k of the forward measure ,k coincides with the ltering distribution ,k introduced in Denition 3.1.3. This observation together with Proposition 3.2.3 implies that there is a unique choice of scaling scheme that satises the two requirements of the previous paragraph, as f (x) ,k|n (dx) = L1 ,n f (x) ,k (dx)k|n (x) = f (x) L1 ,k (dx) L1 L,k k|n (x) ,n ,k ,k (dx) k|n (x)

3.2 Forward-Backward

63

must hold for any f Fb (X). The following denition summarizes these conclusions, using the notation ,k rather than ,k , as these two denitions refer to the same objectthe ltering distribution at index k. Denition 3.2.4 (Normalized Forward-Backward Variables). For k {0, . . . , n}, the normalized forward measure ,k coincides with the ltering distribution ,k and satises ,k = [,k (1)]1 ,k = L1 ,k . ,k The normalized backward functions k|n are dened by k|n = L,k ,k (1) k|n = k|n . ,k (k|n ) L,n

The above denition would be pointless if computing ,k and k|n was in deed necessary to obtain the normalized variables ,k and k|n . The following result shows that this is not the case. Proposition 3.2.5 (Normalized Forward-Backward Recursions). Forward Filtering Recursion The ltering measures may be obtained, for all f Fb (X), recursively for k = 1, . . . , n according to c,k = ,k (f ) = c1 ,k with initial condition c,0 = g0 (x)(dx) , f (x)g0 (x) (dx) . ,k1 (dx)Q(x, dx )gk (x ) , f (x ) ,k1 (dx)Q(x, dx )gk (x ) , (3.22)

,0 (f ) = c1 ,0

Normalized Backward Recursion The normalized backward functions may be obtained, for all x X, by the recursion k|n (x) = c1 ,k+1 Q(x, dx )gk+1 (x )k+1|n (x ) (3.23)

operating on decreasing indices k = n 1 down to 0; the initial condition is n|n (x) = 1. Once the two recursions above have been carried out, the smoothing distribution at any given index k {0, . . . , n} is available via ,k|n (f ) = for all f Fb (X). f (x) k|n (x),k (dx) (3.24)

64

3 Filtering and Smoothing Recursions

Proof. Proceeding by forward induction for ,k and backward induction for k|n , it is easily checked from (3.22) and (3.23) that
k 1 n 1

,k =
l=0

c,l

,k

and

k|n =
l=k+1

c,l

k|n .

(3.25)

Because ,k is normalized,
k 1

,k (1) = 1 =
l=0

def

c,l

,k (1) .

Proposition 3.2.3 then implies that for any integer k,


k

L,k =
l=0

c,l .

(3.26)

In other words, c,0 = L,0 and for subsequent indices k 1, c,k = L,k /L,k1 . Hence (3.25) coincides with the normalized forward and backward variables as specied by Denition 3.2.4. We now pause to state a series of remarkable consequences of Proposition 3.2.5. Remark 3.2.6. The forward recursion in (3.22) may also be rewritten to highlight a two-step procedure involving both the predictive and ltering measures. Recall our convention that ,0|1 refers to the predictive distribution of X0 when no observation is available and is thus an alias for , the distribution of X0 . For k {0, 1, . . . , n} and f Fb (X), (3.22) may be decomposed as c,k = ,k|k1 (gk ) , ,k (f ) = c1 ,k|k1 (f gk ) , ,k ,k+1|k = ,k Q . (3.27) The equivalence of (3.27) with (3.22) is straightforward and is a direct consequence of the remark that k+1|k = ,k Q, which follows from Proposition 3.1.4 in Section 3.1.2. In addition, each of the two steps in (3.27) has a very transparent interpretation. Predictor to Filter : The rst two equations in (3.27) may be summarized as ,k (f ) f (x) g(x, Yk ) ,k|k1 (dx) , (3.28)

where the symbol means up to a normalization constant (such that ,k (1) = 1) and the full notation g(x, Yk ) is used in place of gk (x) to highlight the dependence on the current observation Yk . Equation (3.28) is recognized as Bayes rule applied to a very simple equivalent Bayesian pseudo-model in which

3.2 Forward-Backward

65

Xk is distributed a priori according to the predictive distribution ,k|k1 , g is the conditional probability density function of Yk given Xk . The lter ,k is then interpreted as the posterior distribution of Xk given Yk in this simple equivalent Bayesian pseudo-model.

Filter to Predictor : The last equation in (3.27) simply means that the updated predicting distribution ,k+1|k is obtained by applying the transition kernel Q to the current ltering distribution ,k . We are thus left with the very basic problem of determining the one-step distribution of a Markov chain given its initial distribution. Remark 3.2.7. In many situations, using (3.27) to determine ,k is indeed the goal rather than simply a rst step in computing smoothed distributions. In particular, for sequentially observed data, one may need to take actions based on the observations gathered so far. In such cases, ltering (or prediction) is the method of choice for inference about the unobserved states, a topic that will be developed further in Chapter 7. Remark 3.2.8. Another remarkable fact about the ltering recursion is that (3.26) together with (3.27) provides a method for evaluating the likelihood L,k of the observations up to index k recursively in the index k. In addition, as c,k = L,k /L,k1 from (3.26), c,k may be interpreted as the conditional likelihood of Yk given the previous observations Y0:k1 . However, as discussed at the beginning of Section 3.2.2, using (3.26) directly is generally impracticable for numerical reasons. In order to avoid numerical underor overow, one can equivalently compute the log-likelihood ,k . Combining (3.26) and (3.27) gives the important formula
k def ,k

= log L,k =
l=0

log ,l|l1 (gl ) ,

(3.29)

where ,l|l1 is the one-step predictive distribution computed according to (3.27) (recalling that by convention, ,0|1 is used as an alternative notation for ). Remark 3.2.9. The normalized backward function k|n does not have a simple probabilistic interpretation when isolated from the corresponding ltering measure. However, (3.24) shows that the marginal smoothing distribution, ,k|n , is dominated by the corresponding ltering distribution ,k and that k|n is by denition the Radon-Nikodym derivative of ,k|n with respect to ,k , d,k|n k|n = d,k As a consequence,

66

3 Filtering and Smoothing Recursions

inf M R : ,k ({k|n M }) = 0 1 and sup M R : ,k ({k|n M }) = 0 1 , with the conventions inf = and sup = . As a consequence, all values of k|n cannot get simultaneously large or close to zero as was the case for k|n , although one cannot exclude the possibility that k|n still has important dynamics without some further assumptions on the model. n The normalizing factor l=k+1 c,l = L,n /L,k by which k|n diers from the corresponding unnormalized backward function k|n may be interpreted as the conditional likelihood of the future observations Yk+1:n given the observations up to index k, Y0:k .

3.3 Markovian Decompositions


The forward-backward recursions (Proposition 3.2.1) and their normalized versions (Proposition 3.2.5) were probably already well-known to readers familiar with the hidden Markov model literature. A less widely observed fact is that the smoothing distributions may also be expressed using Markov transitions. In contrast to the forward-backward algorithm, this second approach will already be familiar to readers working with dynamic (or state-space) models (Kailath et al., 2000, Chapter 10). Indeed, the method to be described in Section 3.3.2, when applied to the specic case of Gaussian linear state-space models, is known as Rauch-Tung-Striebel (sometimes, abbreviated to RTS) smoothing after Rauch et al. (1965). The important message here is that {Xk }k0 (as well as the index-reversed version of {Xk }k0 , although greater care is needed to handle this second case) is a non-homogeneous Markov chain when conditioned on some observed values {Yk }0kn . The use of this approach for HMMs with nite state spaces as an alternative to the forwardbackward recursions is due to Askar and Derin (1981)see also (Ephraim and Merhav, 2002, Section V) for further references. 3.3.1 Forward Decomposition Let n be a given positive index and consider the nite-dimensional distributions of {Xk }k0 given Y0:n . Our goal will be to show that the distribution of Xk given X0:k1 and Y0:n reduces to that of Xk given Xk1 only and Y0:n , this for any positive index k. The following denition will be instrumental in decomposing the joint posterior distributions ,0:k|n . Denition 3.3.1 (Forward Smoothing Kernels). Given n 0, dene for indices k {0, . . . , n 1} the transition kernels

3.3 Markovian Decompositions

67

if k|n (x) = 0 otherwise , (3.30) for any point x X and set A X . For indices k n, simply set Fk|n (x, A) =
def A

[k|n (x)]1 0

Q(x, dx )gk+1 (x )k+1|n (x )

Fk|n = Q , where Q is the transition kernel of the unobservable chain {Xk }k0 .

def

(3.31)

Note that for indices k n 1, Fk|n depends on the future observations Yk+1:n through the backward variables k|n and k+1|n only. The subscript n in the Fk|n notation is meant to underline the fact that, like the backward functions k|n , the forward smoothing kernels Fk|n depend on the nal index n where the observation sequence ends. The backward recursion of Proposition 3.2.1 implies that [k|n (x)]1 is the correct normalizing constant. Thus, for any x X, A Fk|n (x, A) is a probability measure on X . Because the functions x k|n (x) are measurable on (X, X ), for any set A X , x Fk|n (x, A) is X /B(R)-measurable. Therefore, Fk|n is indeed a Markov transition kernel on (X, X ). The next proposition provides a probabilistic interpretation of this denition in terms of the posterior distribution of the state at time k + 1, given the observations up to time n and the state sequence up to time k. Proposition 3.3.2. Given n, for any index k 0 and function f Fb (X), E [f (Xk+1 ) | X0:k , Y0:n ] = Fk|n (Xk , f ) , where Fk|n is the forward smoothing kernel dened by (3.30) for indices k n 1 and (3.31) for indices k n. Proof. First consider an index 0 k n and let f and h denote functions in Fb (X) and Fb Xk+1 , respectively. Then E [f (Xk+1 )h(X0:k ) | Y0:n ] = f (xk+1 )h(x0:k ) ,0:k+1|n (dx0:k+1 ) ,

which, using (3.13) and the denition (3.16) of the backward function, expands to
k

L1 ,n

h(x0:k ) (dx0 )g0 (x0 )


i=1

Q(xi1 , dxi )gi (xi )

Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )


n

i=k+2

Q(xi1 , dxi )gi (xi ) . (3.32)


k+1|n (xk+1 )

68

3 Filtering and Smoothing Recursions

From Denition 3.3.1, Q(xk , dxk+1 )f (xk+1 )gk+1 (xk+1 )k+1|n (xk+1 ) is equal to Fk|n (xk , f )k|n (xk ). Thus, (3.32) may be rewritten as E [f (Xk+1 )h(X0:k ) | Y0:n ] = L1 ,n
k

Fk|n (xk , f )h(x0:k )

(dx0 )g0 (x0 )


i=1

Q(xi1 , dxi )gi (xi ) k|n (xk ) . (3.33)

Using the denition (3.16) of k|n again, this latter integral is easily seen to be similar to (3.32) except for the fact that f (xk+1 ) has been replaced by Fk|n (xk , f ). Hence E [f (Xk+1 )h(X0:k ) | Y0:n ] = E [Fk|n (Xk , f )h(X0:k ) | Y0:n ] , for all functions h Fb Xk+1 as requested. For k n, the situation is simpler because (3.6) implies that ,0:k+1|n = ,0:k|n Q. Hence, E [f (Xk+1 )h(X0:k ) | Y0:n ] = and thus E [f (Xk+1 )h(X0:k ) | Y0:n ] = h(x0:k ),0:k|n (dx0:k )Q(xk , f ) , h(x0:k ) ,0:k|n (dx0:k ) Q(xk , dxk+1 )f (xk+1 ) ,

= E [Q(Xk , f )h(X0:k ) | Y0:n ] .

Remark 3.3.3. A key ingredient of the above proof is (3.32), which gives a representation of the joint smoothing distribution of the state variables X0:k given the observations up to index n, with n k. This representation, which states that ,0:k|n (f )
k

= L1 ,n

f (x0:k ) (dx0 )g0 (x0 )


i=1

Q(xi1 , dxi )gi (xi ) k|n (xk ) (3.34)

for all f Fb Xk+1 , is a generalization of the marginal forward-backward decomposition as stated in (3.14). Proposition 3.3.2 implies that, conditionally on the observations Y0:n , the state sequence {Xk }k0 is a non-homogeneous Markov chain associated with

3.3 Markovian Decompositions

69

the family of Markov transition kernels {Fk|n }k0 and initial distribution ,0|n . The fact that the Markov property of the state sequence is preserved when conditioning sounds surprising because the (marginal) smoothing distribution of the state Xk depends on both past and future observations. There is however nothing paradoxical here, as the Markov transition kernels Fk|n indeed depend (and depend only) on the future observations Yk+1:n . As a consequence of Proposition 3.3.2, the joint smoothing distributions may be rewritten in a form that involves the forward smoothing kernels using the Chapman-Kolmogorov equations (2.1). Proposition 3.3.4. For any integers n and m, function f Fb Xm+1 and initial probability on (X, X ), E [f (X0:m )) | Y0:n ] =
m

f (x0:m ) ,0|n (dx0 )


i=1

Fi1|n (xi1 , dxi ) , (3.35)

where {Fk|n }k0 are dened by (3.30) and (3.31) and ,0|n is the marginal smoothing distribution dened, for any A X , by ,0|n (A) = [(g0 0|n )]1
A

(dx)g0 (x)0|n (x) .

(3.36)

If one is only interested in computing the xed point marginal smoothing distributions, (3.35) may also be used as the second phase of a smoothing approach which we recapitulate below. Corollary 3.3.5 (Alternative Smoothing Algorithm). Backward Recursion Compute the backward variables n|n down to 0|n by backward recursion according to (3.19) in Proposition 3.2.1. Forward Smoothing ,0|n is given by (3.36) and for k 0, ,k+1|n = ,k|n Fk|n , where Fk|n are the forward kernels dened by (3.30). For numerical implementation, Corollary 3.3.5 is denitely less attractive than the normalized forward-backward approach of Proposition 3.2.5 because the backward pass cannot be carried out in normalized form without rst determining the forward measures ,k . We will discuss in Chapter 5 some specic models where these recursions can be implemented with some form of normalization, but generally speaking the backward decomposition to be described next is preferable for practical computation of the marginal smoothing distributions. On the other hand, Proposition 3.3.4 provides a general decomposition of the joint smoothing distribution that will be instrumental in establishing some form of ergodicity of the Markov chain that corresponds to the unobservable states {Xk }k0 , conditional on some observations Y0:n (see Section 4.3).

70

3 Filtering and Smoothing Recursions

3.3.2 Backward Decomposition In the previous section it was shown that, conditionally on the observations up to index n, Y0:n , the state sequence {Xk }k0 is a Markov chain, with transition kernels Fk|n . We now turn to the so-called time-reversal issue: is it true in general that the unobserved chain with the indices in reverse order, forms a non-homogeneous Markov chain, conditionally on some observations Y0:n ? We already discussed time-reversal for Markov chains in Section 2.1 where it has been argued that the main technical diculty consists in guaranteeing that the reverse kernel does exist. For this, we require somewhat stronger assumptions on the nature of X by assuming for the rest of this section that X is a Polish space and that X is the associated Borel -eld. From the discussion in Section 2.1 (see Denition 2.1.2 and comment below), we then know that the reverse kernel does exist although we may not be able to provide a simple closed-form expression for it. The reverse kernel does have a simple expression, however, as soon as one assumes that the kernel to be reversed and the initial distribution admit densities with respect to some measure on X. Let us now return to the smoothing problem. For positive indices k such that k n1, the posterior distribution of (Xk , Xk+1 ) given the observations up to time k satises E [f (Xk , Xk+1 ) | Y0:k ] = f (xk , xk+1 ) ,k (dxk ) Q(xk , dxk+1 ) (3.37)

for all f Fb (X X). From the previous discussion, there exists a Markov transition kernel B,k which satises Denition 2.1.2, that is B,k = {B,k (x, A), x X, A X } such that for any function f Fb (X X), E [f (Xk , Xk+1 ) | Y0:k ] = f (xk , xk+1 ) ,k+1|k (dxk+1 ) B,k (xk+1 , dxk ) , (3.38) where ,k+1|k = ,k Q is the one-step predictive distribution. Proposition 3.3.6. Given a strictly positive index n, initial distribution , and index k {0, . . . , n 1}, E [f (Xk ) | Xk+1:n , Y0:n ] = B,k (Xk+1 , f ) for any f Fb (X). Here, B,k is the backward smoothing kernel dened in (3.38). Before giving the proof of this result, we make a few remarks to provide some intuitive understanding of the backward smoothing kernels.
def

3.3 Markovian Decompositions

71

Remark 3.3.7. Contrary to the forward kernel, the backward transition kernel is only dened implicitly through the equality of the two representations (3.37) and (3.38). This limitation is fundamentally due to the fact that the backward kernel implies a non-trivial time-reversal operation. Proposition 3.3.6 however allows a simple interpretation of the backward kernel: Because E [f (Xk ) | Xk+1:n , Y0:n ] is equal to B,k (Xk+1 , f ) and thus depends neither on Xl for l > k + 1 nor on Yl for l k + 1, the tower property of conditional expectation (Proposition A.2.3) implies that not only is B,k (Xk+1 , f ) equal to E [f (Xk ) | Xk+1 , Y0:n ] but also coincides with E [f (Xk ) | Xk+1 , Y0:k ], for any f Fb (X). In addition, the distribution of Xk+1 given Xk and Y0:k reduces to Q(Xk , ) due to the particular form of the transition kernel associated with a hidden Markov model (see Denition 2.2.1). Recall also that the distribution of Xk given Y0:k is denoted by ,k . Thus, B,k can be interpreted as a Bayesian posterior in the equivalent pseudo-model where Xk is distributed a priori according to the ltering distribution ,k , The conditional distribution of Xk+1 given Xk is Q(Xk , ).

B,k (Xk+1 , ) is then interpreted as the posterior distribution of Xk given Xk+1 in this equivalent pseudo-model. In particular, for HMMs that are fully dominated in the sense of Definition 2.2.3, Q has a transition probability density function q with respect to a measure on X. This is then also the case for ,k , which is a marginal of (3.13). In such cases, we shall use the slightly abusive but unambiguous notation ,k (dx) = ,k (x) (dx) (that is, ,k denotes the probability density function with respect to rather than the probability distribution). The backward kernel B,k (xk+1 , ) then has a probability density function with respect to , which is given by Bayes formula, B,k (xk+1 , x) = ,k (x)q(x, xk+1 ) . ,k (x)q(x, xk+1 ) (dx) X (3.39)

Thus, in many cases of interest, the backward transition kernel B,k can be written straightforwardly as a function of ,k and Q. Several examples of such cases will be dealt with in some detail in Chapter 5. In these situations, Proposition 3.3.9 is the method of choice for smoothing, as it only involves normalized quantities, whereas Corollary 3.3.5 is not normalized and thus can generally not be implemented as it stands. Proof (of Proposition 3.3.6). Let k {0, . . . , n 1} and h Fb Xnk . Then E [f (Xk )h(Xk+1:n ) | Y0:n ] = f (xk )h(xk+1:n ) ,k:n|n (dxk:n ) . (3.40)

Using the denition (3.13) of the joint smoothing distribution ,k:n|n yields

72

3 Filtering and Smoothing Recursions

E [f (Xk )h(Xk+1:n ) | Y0:n ]


k

= L1 ,n

(dx0 )g0 (x0 )


i=1

Q(xi1 , dxi )gi (xi )f (xk )

Q(xi1 , dxi )gi (xi ) h(xk+1:n ) ,


i=k+1

L,k L,n

,k|n (dxk )Q(xk , dxk+1 )f (xk )gk+1 (xk+1 )


n

i=k+2

Q(xi1 , dxi )gi (xi ) h(xk+1:n ) ,

(3.41)

which implies, by the denition (3.38) of the backward kernel, that E [f (Xk )h(Xk+1:n ) | Y0:n ] L,k B,k (xk+1 , dxk )f (xk ),k+1|k (dxk+1 )gk+1 (xk+1 ) = L,n
n

i=k+2

Q(xi1 , dxi )gi (xi ) h(xk+1:n ) . (3.42)

Taking f 1 shows that for any function h Fb Xnk , E [h (Xk+1:n ) | Y0:n ] = L,k L,n h (xk+1:n )
n

,k+1|k (dxk+1 )gk+1 (xk+1 )


i=k+2

Q(xi1 , dxi )gi (xi ) .

Identifying h with h(xk+1:n ) f (x) B,k (xk+1 , dx), we nd that (3.42) may be rewritten as E [f (Xk )h(Xk+1:n ) | Y0:n ] = E h(Xk+1:n ) which concludes the proof. The next result is a straightforward consequence of Proposition 3.3.6, which reformulates the joint smoothing distribution ,0:n|n in terms of the backward smoothing kernels. Corollary 3.3.8. For any integer n > 0 and initial probability ,
n1

B,k (Xk+1 , dx)f (x) Y0:n ,

E [f (X0:n ) | Y0:n ] =

f (x0:m ) ,n (dxn )
k=0

B,k (xk+1 , dxk )

(3.43)

3.3 Markovian Decompositions

73

for all f Fb Xn+1 . Here, {B,k }0kn1 are the backward smoothing kernels dened in (3.38) and ,n is the marginal ltering distribution corresponding to the nal index n. It follows from Proposition 3.3.6 and Corollary 3.3.8 that, conditionally on Y0:n , the joint distribution of the index-reversed sequence {Xk }0kn , with k = Xnk , is that of a non-homogeneous Markov chain with initial distriX bution ,n and transition kernels {B,nk }1kn . This is an exact analog of the forward decomposition where the ordering of indices has been reversed, starting from the end of the observation sequence and ending with the rst observation. Three important dierences versus the forward decomposition should however be kept in mind. (i) The backward smoothing kernel B,k depends on the initial distribution and on the observations up to index k but it depends neither on the future observations nor on the index n where the observation sequence ends. As a consequence, the sequence of backward transition kernels {B,k }0kn1 may be computed by forward recurrence on k, irrespectively of the length of the observation sequence. In other terms, the backward smoothing kernel B,k depends only on the ltering distribution ,k , whereas the forward smoothing kernel Fk|n was to be computed from the backward function k|n . (ii) Because B,k depends on ,k rather than on the unnormalized forward measure ,k , its computation involves only properly normalized quantities (Remark 3.3.7). The backward decomposition is thus more adapted to the actual computation of the smoothing probabilities than the forward decomposition. The necessary steps are summarized in the following result. Proposition 3.3.9 (Forward Filtering/Backward Smoothing). Forward Filtering Compute, forward in time, the ltering distributions ,0 to ,n using the recursion (3.22). At each index k, the backward transition kernel B,k may be computed according to (3.38). Backward Smoothing From ,n , compute, for k = n 1, n 2, . . . , 0, ,k|n = ,k+1|n B,k , recalling that ,n|n = ,n . (iii) A more subtle dierence between the forward and backward Markovian decompositions is the observation that Denition 3.3.1 does provide an expression of the forward kernels Fk|n for any k 0, that is, also for indices after the end of the observation sequence. Hence, the process {Xk }k0 , when conditioned on some observations Y0:n , really forms a non-homogeneous Markov chain whose nite-dimensional distributions are dened by Proposition 3.3.4. In contrast, the backward kernels B,k
def

74

3 Filtering and Smoothing Recursions

are dened for indices k {0, . . . , n1} only, and thus the index-reversed process {Xnk } is also dened, by Proposition 3.3.6, for indices k in the range {0, . . . , n} only. In order to dene the index-reversed chain for negative indices, a minimal requirement is that the underlying chain {Xk } also be well dened for k < 0. Dening Markov chains {Xk } with indices k Z is only meaningful in the stationary case, that is when is the stationary distribution of Q. As both this stationarization issue and the forward and backward Markovian decompositions play a key role in the analysis of the statistical properties of the maximum likelihood estimator, we postpone further discussion of this point to Chapter 12.

3.4 Complements
The forward-backward algorithm is known to many, especially in the eld of speech processing, as the Baum-Welch algorithm, although the rst published description of the approach is due to Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss (1970, p. 168). The denomination refers to the collaboration between Baum and Lloyd R. Welch (Welch, 2003) who also worked out together an early version of the EM approach (to be discussed in Chapter 10). To the best of our knowledge however, the note entitled A Statistical Estimation Procedure for Probabilistic Functions of Finite Markov Processes, co-authored by Baum and Welch and mentioned in the bibliography of Baum et al. (1970), has never been published. The forward-backward algorithm was discovered several times in the early 1970s. A salient example is the paper by Bahl et al. (1974) on the computation of posterior probabilities for a nite-state Markov channel encoder for transmission over a discrete memoryless channel (see Example 1.3.2 in the introductory chapter). The algorithm described by Bahl et al. (1974) is fully equivalent to the forward-backward and is known in digital communication as the BCJR (for Bahl, Cocke, Jelinek, and Raviv) algorithm. Chang and Hancock (1966) is another less well-known reference, contemporary of the work of Baum and his colleagues, which also describes the forward-backward decomposition and its use for decoding in communication applications. It is important to keep in mind that the early work on HMMs by Baum and his colleagues was conducted at the Institute for Defense Analyses (IDA) in Princeton under a contract from the U.S. National Security Agency. Although there are a few early publications of theoretical nature, most of the practical work that dealt with cryptography was kept secret and has never been published. It explains why some signicant practical aspects (like the need for scaling to be discussed below) remained unpublished until HMMs became the de facto standard approach to speech recognition in the 1980s. The famous tutorial by Rabiner (1989) is considered by many as the standard source of information for practical implementation of hidden Markov

3.4 Complements

75

models. The impact of this publication has been very signicant in speech processing but also in several other domains of application such as bioinformatics (Durbin et al., 1998). It was Rabiner (1989) who coined the term scaling to describe the need for normalization when implementing the forward-backward recursions. There is indeed a subtle dierence between the normalization scheme described in Section 3.2.2 and the solution advocated by Rabiner (1989), which was rst published by Levinson et al. (1983). As was done in Section 3.2.2, Rabiner (1989) recommends normalizing the forward measures so that they integrate to one. However, the normalized backward functions are n n dened as k|n = ( l=k c,l )1 k|n rather than n|n = ( l=k+1 c,l )1 k|n . This dierence is a consequence of the normalized backward recursion being carried out as n|n (x) = c1 ,n k|n (x) = c1 ,k and Q(x, dx )gk+1 (x )k+1|n (x )
X

for k = n 1 down to 0,

rather than as prescribed by (3.23). In contrast to our approach, Rabiners scaling implies that normalization is still required for computing the marginal smoothing distributions as ,k|n (dx) = [,k (k|n )]1 k|n (x),k (dx) . On the other hand, the joint smoothing distribution ,k:k+1|n of Xk and Xk+1 may be obtained directly, without normalization, as ,k:k+1|n (dx, dx ) = ,k (dx)Q(x, dx )gk+1 (x )k+1|n (x ) . Indeed, ,k = (
k 1 ,k l=0 c,l ) n

and thus
1

,k:k+1|n (dx, dx ) =
l=0 n

c,l

,k (dx)Q(x, dx )gk+1 (x )k+1|n (x ) ,

as requested, as L,n = l=0 c,l is the normalization factor common to all smoothing distributions from (3.13). Easy computation of bivariate smoothing distributions does not, in our view, constitute a strong motivation for preferring a particular scaling scheme. The Markovian structure of the joint smoothing distribution exhibited in Section 3.3 in particular provides an easy means of evaluating bivariate smoothing distributions. For instance, with the scaling scheme described in Section 3.2.2, the forward Markovian decomposition of Section 3.3.1 implies that ,k:k+1|n (dx, dx ) = c,k ,k|n (dx) Q(x, dx )gk+1 (x )k+1|n (x ) . k|n (x)

As stated in the introduction, Stratonovich (1960) proposed a decomposition that is largely related to the forward-backward approach when the state

76

3 Filtering and Smoothing Recursions

space X is discrete. The forward measure, named w in the work of Stratonovich (1960), is dened as wk (x) = P (Xk = x | Y0:k ) , which coincides with the denition of the ltering probability ,k for a discrete X. Also recall that ,k corresponds to the normalized forward variable ,k = [,k (1)]1 ,k . Instead of the backward function, Stratonovich (1960) dened wk (x) = P (Xk = x | Yk:n ) . Forward and backward recursions for wk and wk , respectively, as well as the relation for computing the marginal smoothing probability from wk and wk , are given in the rst section of Stratonovich (1960) on pages 160162. Although wk as dened by Stratonovich (1960) obviously has a probabilistic interpretation that the backward function lacks, the resulting recursion is more complicated because it requires the evaluation of the prior probabilities P (Xk = x) for k 0. In addition, generalizing the denition of wk to gen eral state spaces X would require using the more restrictive index- (or time-) reversal concept discussed in Section 3.3.2. In contrast, the forward-backward decomposition of Baum et al. (1970) provides a very general framework for smoothing as discussed in this chapter. The fact that, in some cases, a probabilistic interpretation may be given to the backward function k|n (or to equivalent quantities) also explains why in the control and signal processing literatures, the forward-backward recursions are known under the generic term of two-lter formulas (Kitagawa, 1996; Kailath et al., 2000, Section 10.4). This issue will be discussed in detail for Gaussian linear state-space models in Section 5.2.5.

4 Advanced Topics in Smoothing

This chapter covers three distinct complements to the basic smoothing relations developed in the previous chapter. In the rst section, we provide recursive smoothing relations for computing smoothed expectations of general functions of the hidden states. In many respects, this technique is reminiscent of the ltering recursion detailed in Section 3.2.2, but somewhat harder to grasp because the quantity that needs to be updated recursively is less directly interpretable. In the second section, it is shown that the ltering and smoothing approaches discussed so far (including those of Section 4.1) may be applied, with minimal adaptations, to a family of models that is much broader than simply the hidden Markov models. We consider in some detail the case of hierarchical HMMs (introduced in Section 1.3.4) for which marginal ltering and smoothing formulas are still available, despite the fact that the hierarchic component of the state process is not a posteriori Markovian. The third section is dierent in nature and is devoted to the so-called forgetting property of the ltering and smoothing recursions, which are instrumental in the statistical theory of HMMs (see Chapter 12). Forgetting refers to the fact that observations that are either far back in the past or in the remote future (relative to the current time index) have little impact on the posterior distribution of the current state. Although this section is written to be self-contained, its content is probably better understood after some exposure to the stability properties of Markov chains as can be found in Chapter 14.

4.1 Recursive Computation of Smoothed Functionals


Chapter 3 mostly dealt with xed-interval smoothing, that is, computation of k|n 1 for a xed value of the observation horizon n and for all indices
Note that we omit the dependence with respect to the initial distribution , which is not important in this section.
1

78

4 Advanced Topics in Smoothing

0 k n. For Gaussian linear state-space models, it is well-known however that recursive (in n) evaluation of k|n for a xed value of k, also called xed-point smoothing, is feasible (Anderson and Moore, 1979, Chapter 7). Gaussian linear state-space models certainly constitute a particular case, as the smoothing distributions k|n are then entirely dened by their rst and second moments (see Chapter 5). But xed-point smoothing is by no means limited to some specic HMMs and (3.13) implies the existence of recursive update equations for evaluating k|n with k xed and increasing values of n. Remember that, as was the case in the previous chapter, we consider for the moment that evaluating integrals on X is a feasible operation. The good news is that there also exist recursive formulas for computing a large class of smoothed quantities, which include in particular expressions n n like E[ k=0 s(Xk ) | Y0:n ] and E[( k=0 s(Xk ))2 | Y0:n ], where s is a real-valued measurable function on (X, X ) such that both expectations are well-dened. Although one can of course consider arbitrary functions in this class, we will see in Chapter 10 that smoothed expectations of the state variables, for some specic choices of the function of interest, are instrumental in numerical approximations of the maximum likelihood estimate for parameter-dependent HMMs. 4.1.1 Fixed Point Smoothing The fundamental equation here is (3.13), which upon comparing the expressions corresponding to n and n + 1 gives the following update equation for the joint smoothing distribution: 0:n+1|n+1 (fn+1 ) = Ln+1 Ln
1

fn+1 (x0:n+1 ) (4.1)

0:n|n (dx0 , . . . , dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 )

for functions fn+1 Fb Xn+2 . Recall that we used the notation cn+1 for the scaling factor Ln+1 /Ln that appears in (4.1), where, according to (3.27), cn+1 may also be evaluated as n+1|n (gn+1 ). Equation (4.1) corresponds to a simple, yet rich, structure in which the joint smoothing distribution is modied by applying an operator that only aects the last coordinate2 . The probabilistic interpretation of this nding is that Xn+1 and X0:n1 are conditionally independent given both Y0:n+1 and Xn . This remark suggests that while the objective of updating k|n recursively in n (for a xed k) may not be achievable directly, k,n|n the joint distribution of Xk and Xn given Y0:n does follow a simple recursion. Proposition 4.1.1 (Fixed Point Smoothing). For k 0 and any f Fb X2 ,
This structure also has deep implications, which we do not comment on here, for sequential Monte Carlo approaches (to be discussed in Chapters 7 and 8).
2

4.1 Recursive Computation of Smoothed Functionals

79

k,k+1|k+1 (f ) = c1 k+1

f (xk , xk+1 ) k (dxk ) Q(xk , dxk+1 ) gk+1 (xk+1 ) ,

where k is the ltering distribution and ck+1 = k Qgk+1 . For n k + 1 and any f Fb X2 , k,n+1|n+1 (f ) = c1 n+1 f (xk , xn+1 ) k,n|n (dxk , dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 ) .

Both relations are obtained by integrating (4.1) over all variables but those of relevant indices (k and k + 1 for the rst one, k, n, and n + 1 for the second one). At any index n, the marginal smoothing distribution may be evaluated through k|n = k,n|n (, X). Similarly the ltering distribution, which is required to evaluate cn+1 , is given by n = k,n|n (X, ). 4.1.2 Recursive Smoothers for General Functionals From Proposition 4.1.1, one can easily infer a smoothing scheme that applies to the specic situation where the only quantity of interest is E[s(Xk ) | Y0:n ] for a particular function s, and not the full conditional distribution k,n|n . To this aim, dene the nite signed measure n on (X, X ) by n (f ) = f (xn ) s(xk ) k,n|n (dxk , dxn ) , f Fb (X) ,

so that n (X) = E[s(Xk ) | Y0:n ]. Proposition 4.1.1 then implies that k+1 (f ) = c1 k+1 and n+1 (f ) = c1 n+1 f (xn+1 ) n (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 ) (4.2) f (xk+1 ) s(xk ) k (dxk ) Q(xk , dxk+1 ) gk+1 (xk+1 ) ,

for n k + 1 and f Fb (X). Equation (4.2) is certainly less informative than Proposition 4.1.1, as one needs to x the function s whose smoothed conditional expectation is to be updated recursively. On the other hand, this principle may be adapted to compute smoothed conditional expectations for a general class of functions that depend on the whole trajectory of the hidden states X0:n rather than on just a single particular hidden state Xk . Before exposing the general framework, we rst need to clarify a matter of terminology. In the literature on continuous time processes, and particularly in works that originate from the automatic control community, it is fairly common to refer to quantities similar to n as lterssee for instance Elliott et al. (1995, Chapters 5 and 6) or Zeitouni and Dembo (1988). A lter is then

80

4 Advanced Topics in Smoothing

dened as an object that may be evaluated recursively in n and is helpful in computing a quantity of interest that involves the observations up to index n. A more formal denition, which will also illustrate what is the precise meaning of the word recursive, is that a lter {n }n0 is such that 0 = R (Y0 ) and n+1 = Rn (n , Yn+1 ) where R and {Rn }n0 are some nonrandom operators. In the case discussed at the beginning of this section, Rn is dened by (4.2) where Q is xed (this is the transition kernel of the hidden chain) and Yn+1 enters through gn+1 (x) = g(x, Yn+1 ). Note that because the normalizing constant c1 in (4.2) depends on n , Q and gn+1 , to be coherent n+1 with our denition we should say that {n , n }n0 jointly forms a lter. In this book, we however prefer to reserve the use of the word lter to designate the state lter n . We shall refer to quantities similar to {n }n0 as the recursive smoother associated with the functional {tn }n0 , where the previous example corresponds to tn (x0 , . . . , xn ) = s(xk ). It is not generally possible to derive a recursive smoother without being more explicit about the family of functions {tn }n0 . The device that we will use in the following consists in specifying {tn }n0 using a recursive formula that involves a set of xed-dimensional functions. Denition 4.1.2 (Smoothing Functional). A smoothing functional is a sequence {tn }n0 of functions such that tn is a function Xn+1 R, and which may be dened recursively by tn+1 (x0:n+1 ) = mn (xn , xn+1 )tn (x0:n ) + sn (xn , xn+1 ) (4.3)

for all x0:n+1 Xn+2 and n 0, where {mn }n0 and {sn }n0 are two sequences of measurable functions X X R and t0 is a function X R. This denition can be extended to cases in which the functions tn are ddimensional vector-valued functions. In that case, {sn }n0 also are vectorvalued functions X X Rd while {mn }n0 are matrix-valued functions X X Rd R d . In simpler terms, a smoothing functional is such that the value of tn+1 in x0:n+1 diers from that of tn , applied to the sub-vector x0:n , only by a multiplicative and an additive factor that both only depend on the last two components xn and xn+1 . The whole family is thus entirely specied by t0 and the two sequences {mn }n0 and {sn }n0 . This form has of course been chosen because it reects the structure observed in (4.1) for the joint smoothing distributions. It does however encompass some important functionals of interest. The rst and most obvious example is when tn is a homogeneous additive functional, that is, when
n

tn (x0:n ) =
k=0

s(xk )

for a given measurable function s. In that case, sn (x, x ) reduces to s(x ) and mn is the constant function equal to 1.

4.1 Recursive Computation of Smoothed Functionals

81

The same strategy also applies for more complicated functions such as the n squared sum ( k=0 s(xk ))2 . This time, we need to dene two functions
n

tn,1 (x0:n ) =
k=0 n

s(xk ) ,
2

tn,2 (x0:n ) =
k=0

s(xk )

(4.4)

for which we have the joint update formula tn+1,1 (x0:n+1 ) = tn,1 (x0:n ) + s(xn+1 ) , tn+1,2 (x0:n+1 ) = tn,2 (x0:n ) + s2 (xn+1 ) + 2s(xn+1 )tn,1 (x0:n ) . Note that these equations can also be considered as an extension of Denition 4.1.2 for the vector valued function tn = (tn,1 , tn,2 )t . We now wish to compute E[tn (X0:n ) | Y0:n ] recursively in n, assuming that the functions tn are such that these expectations are indeed nite. We proceed as previously and dene the family of nite signed measures {n } on (X, X ) such that n (f ) =
def

f (xn ) tn (x0:n ) 0:n|n (dx0 , . . . , dxn )

(4.5)

for all functions f Fb (X). Thus, n (X) = E[tn (X0:n ) | Y0:n ]. We then have the following direct consequence of (4.1). Proposition 4.1.3. Let (tn )n0 be a sequence of functions on Xn+1 R possessing the structure of Denition 4.1.2. The nite signed measures {n }n0 on (X, X ) dened by (4.5) may then be updated recursively according to 0 (f ) = {(g0 )} and n+1 (f ) = c1 n+1 f (xn+1 ) n (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 )mn (xn , xn+1 ) +n (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 )sn (xn , xn+1 ) (4.6)
1

f (x0 ) (dx0 ) t0 (x0 ) g0 (x0 )

for n 0, where f denotes a generic function in Fb (X). At any index n, E[tn (X0:n ) | Y0:n ] may be evaluated by computing n (X). In order to use (4.6), it is required that the standard ltering recursions (Proposition 3.2.5) be computed in parallel to (4.6). In particular, the normalizing constant cn+1 is given by (3.22) as cn+1 = n Qgn+1 .

82

4 Advanced Topics in Smoothing

As was the case for Denition 4.1.2, Proposition 4.1.3 can obviously be extended to cases where the functional (tn )n0 is vector-valued, without any additional diculty. Because the general form of the recursion dened by Proposition 4.1.3 is quite complex, we rst examine the simple case of homogeneous additive functionals mentioned above. Example 4.1.4 (First and Second Moment Functionals). Let s be a xed function on X and assume that the functionals of interest are the sum and squared sum in (4.4). A typical example is when the base function s equals 1A for a some measurable set A. Then, E[tn,1 (X0:n ) | Y0:n ] is the conditional expected occupancy of the set A by the hidden chain {Xk }k0 between indices 0 and n. Likewise, E[tn,2 (X0:n ) | Y0:n ](E[tn,1 (X0:n ) | Y0:n ])2 is the conditional variance of the occupancy of the set A. We dene the signed measures n,1 and n,2 associated to tn,1 and tn,2 by (4.5). We now apply the general formula given by Proposition 4.1.3 to obtain a recursive update for n,1 and n,2 : 0,1 (f ) = [(g0 )]1 0,2 (f ) = [(g0 )]1 and, for n 0, n+1,1 (f ) = f (xn+1 ) n,1 (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 ) , f (x0 ) (dx0 ) s(x0 )g0 (x0 ) , f (x0 ) (dx0 ) s2 (x0 )g0 (x0 )

n+1 (dxn+1 ) s(xn+1 ) + c1 n+1

n+1,2 (f ) =

f (xn+1 ) n,2 (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 )

n+1 (dxn+1 ) s2 (xn+1 ) + c1 n+1 + 2c1 n+1

n,1 (dxn ) Q(xn , dxn+1 ) gn+1 (xn+1 )s(xn+1 ) .

4.1.3 Comparison with Forward-Backward Smoothing It is important to contrast the approach of Section 4.1.2 above with the techniques discussed previously in Chapter 3. What are exactly the dierences between the recursive smoother of Proposition 4.1.3 and the various versions of forward-backward smoothing discussed in Sections 3.2 and 3.3? Is it always possible to apply either of the two approaches? If yes, is one of them preferable

4.1 Recursive Computation of Smoothed Functionals

83

to the other? These are important issues that we review below. Note that for the moment we only compare these two approaches on principle grounds and we do not even try to discuss the computational burden associated with the eective implementation of either approach. This latter aspect is of course entirely dependent of the way in which we are to evaluate (or approximate) integrals, which is itself highly dependent on the specic model under consideration. Several concrete applications of this approach will be considered in Chapters 10 and 11. 4.1.3.1 Recursive Smoothing Is More General Remember that in Chapter 3 our primary objective was to develop approaches for computing marginal smoothing distributions k|n = P(Xk | Y0:n ). A closer inspection of the results indicate that both in the standard forwardbackward approach (Section 3.2) or when using a Markovian (forward or backward) decomposition (Section 3.3), one may easily obtain the bivariate joint smoothing distribution k+1:k|n = P((Xk+1 , Xk ) | Y0:n ) as a by-product of evaluating k|n , with essentially no additional calculation (see in particular Section 3.4). If we consider however the second-order functional tn,2 discussed in Example 4.1.4, we may write
n n

E[tn,2 (X0:n ) | Y0:n ] =


i=0 j=0

E[s(Xi )s(Xj ) | Y0:n ] .

The conditional expectations on the right-hand side indeed only involve the bivariate joint smoothing distributions but for indices that are not consecutive: it is not sucient to determine k+1:k|n for k = 0, . . . , n 1 to evaluate E[tn,2 (X0:n ) | Y0:n ] directly. One would require the complete set of distributions P[(Xi , Xj ) | Y0:n ] for 0 i j n. From this example we may conclude that computing E[tn (X0:n ) | Y0:n ] using forward-backward smoothing is not possible for the whole class of functionals dened in (4.3) but only for a subset of it. If we are to use only the bivariate joint smoothing distributions k+1:k|n , then tn must be an additive functional for which the multipliers mn are constant (say, equal to 1). In that case, tn reduces to
n1

tn (x0:n ) = t0 (x0 ) +
k=0

sk (xk , xk+1 ) ,

and the expected value of tn may be directly evaluated as E[tn (X0:n ) | Y0:n ] =
n1

t0 (x0 ) 0|n (dx0 ) +


k=0

sk (xk , xk+1 ) k:k+1|n (dxk , dxk+1 ) . (4.7)

84

4 Advanced Topics in Smoothing

Recursive smoothing is more general in the sense that it is not restricted to sum functionals but applies to the whole class of functions whose structure agrees with (4.3). 4.1.3.2 For Additive Functionals, Forward-Backward Is More General A distinctive feature however of recursive smoothing is that it may only be applied once a particular function in the class has been selected. The recursive smoother n is associated with a specic choice of the functional tn . As an example, denote by n,A the recursive smoother associated with the homogeneous sum functional
n

tn,A (x0:n ) =
k=0

1A (xk )

for a given set A. We may compute n,A , recursively in n using Proposin tion 4.1.3 and evaluate k=0 P(Xk A | Y0:n ) as n,A (X). If we now consider n a dierent set B, there is no way of evaluating k=0 P(Xk B | Y0:n ) from the previous recursive smoother n,A . It is thus required to run a specic recursive smoother for each function that we are possibly interested in. In contrast, once we have evaluated k+1:k|n for all indices k between 0 and n 1, we may apply (4.7) to obtain the expectation of any particular sum functional that we might be interested in. 4.1.3.3 Recursive Smoothing Is Recursive! A nal element of the comparison of the two approaches is the fact that forward-backward is fundamentally intended for a xed amount of observations, a situation usually referred to as block or batch processing. Consider again, as an example, a simple sum functional of the form
n

tn (x0:n ) =
k=0

s(xk ) ,

and suppose that we are given our n observations not as a whole but one by one, starting with X0 and then X1 , X2 , etc. If we use the normalized forward-backward recursions (Proposition 3.2.5) or the equivalent backward Markovian decomposition (Proposition 3.3.9), the only quantities that are available at an intermediate index k (with k less than n) are the ltering distributions 0 to k . Although we could evaluate E[s(Xj ) | Y0:j ] for j k, it is not yet possible to evaluate E[tk (X0:k ) | Y0:k ]. To be able to compute smoothed quantities, one must decide on an endpoint, say k = n, from which the backward recursion is started. The backward recursion then provides us with the smoothed marginal distributions k|n from which E[tk (X0:n ) | Y0:n ] can be evaluated. This is even more obvious for the forward

4.2 Filtering and Smoothing in More General Models

85

Markovian decomposition (Corollary 3.3.5), which starts by the backward recursion initialized at the nal index n. In contrast, for the recursive smoother, the update equation (4.6) in Proposition 4.1.3 provides a means of computing E[tk (X0:k ) | Y0:k ] for all indices k = 1, 2, . . . , whether or not we have reached the nal observation index. There need not even be a nal observation index, and the method can be applied also when n = or when the nal observation index is not specied. Note that in cases where n is nite but quite large, forward-backward (or the equivalent Markovian decompositions) requires that all the intermediate results be stored: before we can compute k|n we rst need to evaluate and keep track of all the ltering distributions 0 to n (or, for the forward Markovian decomposition, the backward functions n|n down to 0|n ). Thus for large values of n, recursive smoothing approaches are also preferable to those based on forward-backward ideas. Remember however that the price to pay for deriving a recursive smoother is the need to particularize the function of interest. We will discuss in Chapter 10 the exact computational cost of both approaches in examples of HMMs for which the computation corresponding to Proposition 4.1.3 is actually feasible. 4.1.3.4 Bibliographic Notes The recursive smoothing approach discussed in this section was rst described by Zeitouni and Dembo (1988) and Elliott (1993) for continuous time discrete state Markov processes observed in (Gaussian) noise. The approach is also at the core of the book by Elliott et al. (1995). The application of the same principle to the specic case of Gaussian linear state-space models is considered, among others, by Elliott and Krishnamurthy (1999) (see also references therein). The common theme of these works is to use the EM algorithm (see Chapter 10), replacing forward-backward smoothing by recursive smoothing. For reasons to be explained in Section 10.2, the functionals of interest in this context are sums (that is, mn = 1 in Denition 4.1.2). We will see in Section 10.2.4 that the same approach (always with sum functionals) also applies for computing the gradient of the log-likelihood with respect to the parameters in parameterized models. The fact that the same approach applies for more general functionals such as squared sums is, to the best of our knowledge, new (see also Section 10.3.4 for an example of this latter case).

4.2 Filtering and Smoothing in More General Models


Although our main interest is hidden Markov models as dened in Section 2.2, the smoothing decompositions and recursions derived so far turn out to be far more general. We briey discuss below the case of several non-HMM models of practical interest before considering the specic case of hierarchical HMMs as dened in Section 2.2.3.

86

4 Advanced Topics in Smoothing

4.2.1 Smoothing in Markov-switching Models In Markov-switching models (see Section 1.3.6), the distribution of Yk given X0:k and Y0:k1 does not only depend on Xk but also on a number of past values of the observed sequence. Assume for ease of notation that the dependence with respect to previous observations is only on the last observation Yk1 . It is easily checked that (3.1), which denes the joint distribution of a number of consecutive hidden states and observations, should then be replaced by E [f (X0 , Y0 , . . . , Xn , Yn )] =
n

f (x0 , y0 , . . . , xn , yn )

(dx0 )h(x0 , y0 )
k=1

{Q(xk1 , dxk ) g [(xk , yk1 ), yk ]} n (dy0 , . . . , dyn ) (4.8)

for all f Fb {X Y}n+1 , where g[(xk , yk1 ), ] is the transition density function of Yk given Xk and Yk1 . Note that for Markov-switching models, it is more natural to dene the initial distribution as the joint distribution of X0 and Y0 and hence as a probability measure on (X Y, X Y). In (4.8), we have adopted a particular and equivalent way of representing this distribution as (dx0 ) h(x0 , dy0 ) (dy0 ) for some transition density function h. Equation (4.8) is similar to (3.1) and will be even more so once we adopt the implicit conditioning convention introduced in Section 3.1.4. Indeed, upon dening g0 () = h(, Y0 ) , gk () = g [(, Yk1 ), Yk ]
def def

for k 1 ,

the joint distribution ,0:n|n of the hidden states X0:n given the observations Y0:n is still given by (3.13), and hence the mechanics of smoothing for switching autoregressive models are the same as for the standard HMM (see for instance Hamilton, 1994, Chapter 22). 4.2.2 Smoothing in Partially Observed Markov Chains It should also be clear that the same remark holds, mutatis mutandis, for other variants of the model such as non-homogeneous onesif Q depends on the index k for instanceor if the transition from Xk to Xk+1 also depends on some function of the past observations Y0:k1 . Moreover, a closer inspection of the smoothing relations obtained previously indicate that, except when one wishes to exhibit predicted quantitiesas in (3.27)only the unnormalized product kernel Rk1 (xk1 , dxk ) = Q(xk1 , dxk ) gk (xk ) does play a role3 . In
We will come back to this remark when examining sequential Monte Carlo approaches in Chapter 7.
3

4.2 Filtering and Smoothing in More General Models

87

particular, for the general class of models in which it is only assumed that {Xk , Yk }k0 jointly form a Markov chain, the joint distribution of Yk and Xk given Yk1 and Xk1 may be represented as Q [(xk1 , yk1 ), dxk ] g [(xk1 , yk1 , xk ), yk ] (dyk ) , assuming that the second conditional distribution is dominated by . Hence in this case also, one may dene Rk1 (xk1 , dxk ) = Q [(xk1 , Yk1 ), dxk ] g [(xk1 , Yk1 , xk ), Yk ] and use the same ltering and smoothing relations as before. With this notation, it is a simple matter of rewriting, replacing the product of Q and gk by Rk1 to obtain, for instance, the ltering update from (3.22): c,k = ,k (f ) = c1 ,k ,k1 (dx) Rk1 (x, dx ) , f (x ) ,k1 (dx) Rk1 (x, dx ) , f Fb (X) .
def

4.2.3 Marginal Smoothing in Hierarchical HMMs An example that nicely illustrates the previous discussion on the generality of the ltering and smoothing recursions of Chapter 3 is the case of hierarchical HMMs. These models dened in Section 2.2.3 are hidden Markov models in which the unobservable chain {Xk }k0 is split into two components {Ck }k0 and {Wk }k0 such that the component {Ck }k0 , which is the highest in the hierarchy, marginally forms a Markov chain. Of course, these models are HMMs and can be handled as such. In many cases, it is however advantageous to consider that the component of interest is {Ck }k0 only, marginalizing with respect to the intermediate component {Wk }k0 . A typical example is the case of conditionally Gaussian linear state-space models (Denition 2.2.6), where the indicator component Ck takes values in a nite set, whereas the intermediate component Wk is a vector-valued, possibly high-dimensional, variable. It is clear however that the pair (Ck , Yk ) does not correspond to a hidden Markov model. In particular, the distribution of Yn depends on all indicator variables C0 up to Cn (rather than on Cn only), due to the marginalization of the intermediate variables W0:n . Because of the generality of the smoothing relations obtained in Chapter 3, the implementation of marginal smoothingthat is, estimation of {Ck }k0 only given {Yk }k0 however bears some similarity with the (simpler) case of HMMs. For notational simplicity, we consider in the remainder of this section that the hierarchic component {Ck }k0 takes values in the nite set {1, . . . , r}. As usual in this context, we use the notations QC (x, x ) and C (x) rather than QC (x, {x }) and C ({x}). The other notations pertaining to hierarchical

88

4 Advanced Topics in Smoothing

hidden Markov models can be found in Section 2.2.3. Let ,0:k|k denote the posterior distribution of C0:k given Y0:k , ,0:k|k (c0:k ) = P ( C0:k = c0:k | Y0:k ) .
def

(4.9)

Using (3.13) for the hierarchical HMM and integrating with respect to the intermediate component w0:n readily gives
n

,0:n|n (c0:n ) = L1 C (c0 ) ,n


k=1

QC (ck1 , ck )
n

def

W (c0 , dw0 )
k=1

QW [(wk1 , ck ), dwk ] gk (ck , wk ) , (4.10)

where gk (ck , wk ) = g [(ck , wk ), Yk ]. Comparing the above expression for two successive indices, say n and n + 1, yields ,0:n+1|n+1 (c0:n+1 ) = L,n+1 L,n
1

,0:n|n (c0:n ) QC (cn , cn+1 )

,n+1|n (c0:n+1 , dwn+1 ) gn+1 (cn+1 , wn+1 ) , (4.11) where ,n+1|n (c0:n+1 , f ) =
n def

W (c0 , dw0 )
Wn+1 k=1

QW [(wk1 , ck ), dwk ]gk (ck , wk ) QW [(wn , cn+1 ), f ]


n

W (c0 , dw0 )
Wn+1 k=1

QW [(wk1 , ck ), dwk ]gk (ck , wk ) (4.12)

for f Fb (W), which is recognized as the predictive distribution of the intermediate component Wn+1 given the observations Y0:n up to index n and the indicator variables C0:n+1 up to index n + 1. In the example of conditionally Gaussian linear state-space models, the conditional predictive distribution ,n+1|n (c0:n+1 , ) given in (4.12) is Gaussian and may indeed be evaluated recursively for a given sequence of indicator variables c0:n+1 using the Kalman recursions (see Section 5.2). Moreover, in these models the integral featured on the second line of (4.11) may also be evaluated exactly. It is important however to understand that even in this (favorable) case, the existence of (4.11) does not provide an easy solution to updating the marginal ltering distribution ,n|n as it does for HMMs. The fundamental problem is that (4.12) also directly indicates that the predictive distribution Wn+1 given Y0:n , but without conditioning on the indicator variables C0:n+1 , is a mixture distribution with a number of components equal

4.3 Forgetting of the Initial Condition

89

to the number of possible congurations of C0:n+1 , that is, rn+2 . Hence in practice, even in cases such as the conditionally Gaussian linear state-space models for which evaluation of (4.12) is feasible, it is not possible to implement the exact marginal ltering relations for the sequence {Ck }k0 because of the combinatorial explosion due to the need to enumerate all congurations of the indicator variables. Thus, (4.1) will only be helpful in approaches where it is possible to impute values to (part of) the unknown sequence {Ck }k0 , making it possible to avoid exhaustive enumeration of all congurations of the indicator variables. This is precisely the aim of sequential Monte Carlo methods to be described in Chapters 7 and 8, where the specic case of hierarchical HMMs will be detailed in Section 8.2. Note that while (4.1) obviously suggests a recursion in increasing values of n, it is also possible to write an analog to the forward-backward decomposition (see Section 3.2) starting from (4.10): ,0:n|n (c0:n ) = L1 ,n where ,k (c0:k , f ) =
def

,k (c0:k , dwk ) k|n (ck:n , wk ) ,

(4.13)

f (wk )
k

C (c0 ) W (c0 , dw0 )


l=1

QC (cl1 , cl ) QW [(wl1 , cl ), dwl ] gl (cl , wl )

for f Fb (W) and


n

k|n (ck:n , wk ) =

def

l=k+1

QC (cl1 , cl ) QW [(wl1 , cl ), dwl ] gl (cl , wl ) .

The same comment as before applies regarding the fact that both the forward and backward variables do depend on complete sub-sequences of indicator variables; c0:k for ,k and ck:n for k|n . This property of hierarchical HMMs restricts the practical use of (4.13) to cases in which it is possible, for instance, to condition on all values of Cl in the sequence C0:n except Ck . The main application of this decomposition is to be found in Markov chain Monte Carlo methods (Chapter 6) and, more precisely, in the so-called Gibbs sampling approach (Section 6.2.5). The use of (4.13) in this context will be fully illustrated for conditionally Gaussian linear state space models in Sections 5.2.6 and 6.3.2.

4.3 Forgetting of the Initial Condition


Recall from previous chapters that in a partially dominated HMM model (see Denition 2.2.2), we denote by

90

4 Advanced Topics in Smoothing

P the probability associated to the Markov chain {Xk , Yk }k0 on the canonical space (X Y)N , (X Y)N with initial probability measure and transition kernel T dened by (2.15); ,k|n the distribution of the hidden state Xk conditionally on the observations Y0:n , under the probability measure P .

Forgetting properties pertain to the dependence of ,k|n with respect to the initial distribution . A typical question is to ask whether ,k|n and ,k|n are close (in some sense) for large values of k and arbitrary choices of and . This issue will play a key role both when studying the convergence of sequential Monte Carlo methods (Chapter 9) and when analyzing the asymptotic behavior of the maximum likelihood estimator (Chapter 12). In the following, it is shown more precisely that, under appropriate conditions on the kernel Q of the hidden chain and on the transition density function g, the total variation distance ,k|n ,k|n TV converges to zero as k tends to innity. Remember that, following the implicit conditioning convention (Section 3.1.4), we usually omit to indicate explicitly that ,k|n indeed depends on the observations Y0:n . In this section however we cannot use this convention anymore, as we will meet both situations in which, say, ,n ,n TV converges to zero (as n tends to innity) for all possible values of the sequence {yn }n0 YN (uniform forgetting) and cases where ,n ,n TV can be shown to converge to zero almost surely only when {Yk }k0 is assumed to be distributed under a specic distribution (typically P for some initial distribution ). In this section, we thus make dependence with respect to the observations explicit by indicating the relevant subset of observation between brackets, using, for instance, ,k|n [y0:n ] rather than ,k|n . We start by recalling some elementary facts and results about the total variation norm of a signed measure, providing in particular useful characterizations of the total variation as an operator norm over appropriately dened function spaces. We then discuss the contraction property of Markov kernels, using the measure-theoretic approach introduced in an early paper by Dobrushin (1956) and recently revisited and extended by Del Moral et al. (2003). We nally present the applications of these results to establish forgetting properties of the smoothing and ltering recursions and discuss the implications of the technical conditions required to obtain these results. 4.3.1 Total Variation Let (X, X ) be a measurable space and let be a signed measure on (X, X ). Then there exists a measurable set H X , called a Jordan set, such that (i) (A) 0 for each A X such that A H; (ii) (A) 0 for each A X such that A X \ H. The set H is not unique, but any other such set H X satises (H H ) = 1. Hence two Jordan sets dier by at most a set of zero measure. If X is

4.3 Forgetting of the Initial Condition

91

nite or countable and X = P(X) is the collection of all subsets of X, then H = {x : (x) 0} and H = {x : (x) > 0} are two Jordan sets. As another example, if is absolutely continuous with respect to a measure on (X, X ) with Radon-Nikodym derivative f , then {f 0} and {f > 0} are two Jordan sets. We dene two measures on (X, X ) by + (A) = (H A) and (A) = (H c A) , AX .

The measures + and are referred to as the positive and negative variations of the signed measure . By construction, = + . This decomposition of into its positive and negative variations is called the Hahn-Jordan decomposition of . The denition of the positive and negative variations above is easily shown to be independent of the particular Jordan set chosen. Denition 4.3.1 (Total Variation of a Signed Measure). Let (X, X ) be a measurable space and let be a signed measure on (X, X ). The total variation norm of is dened as
TV

= + (X) + (X) ,

where (+ , ) is the Hahn-Jordan decomposition of . If X is nite or countable and is a signed measure on (X, P(X)), then TV = xX |(x)|. If has a density f with respect to a measure on (X, X ), then TV = |f (x)| (dx). Denition 4.3.2 (Total Variation Distance). Let (X, X ) be a measurable space and let and be two measures on (X, X ). The total variation distance between and is the total variation norm of the signed measure . Denote by M(X, X ) the set of nite signed measures on the measurable space (X, X ), by M1 (X, X ) the set of probability measures on (X, X ) and by M0 (X, X ) the set of nite signed measures on (X, X ) satisfying (X) = 0. M(X, X ) is a Banach space with respect to the total variation norm. In this Banach space, the subset M1 (X, X ) is closed and convex. Let Fb (X) denote the set of bounded measurable real functions on X. This set embedded with the supremum norm f = sup{f (x) : x X} also is a Banach space. For any M(X, X ) and f Fb (X), we may dene (f ) = f d. Therefore any nite signed measure in M(X, X ) denes a linear functional on the Banach space (Fb (X) , ). We will use the same notation for the measure and for the functional. The following lemma shows that the total variation of the signed measure agrees with the operator norm of . Lemma 4.3.3. (i) For any M(X, X ) and f Fb (X), f d
TV

92

4 Advanced Topics in Smoothing

(ii) For any M(X, X ),


TV

= sup {(f ) : f Fb (X, X ) , f

= 1} .

(iii) For any f Fb (X), f

= sup {(f ) : M(X, X ),

TV

= 1} .

Proof. Let H be a Hahn-Jordan set of . Then + (H) = (H) and (H c ) = (H c ). For f Fb (X), |(f )| |+ (f )| + | (f )| f

(+ (X) + (X)) = f

TV

showing (i). It also shows that the suprema in (ii) and (iii) are no larger than TV and f , respectively. To establish equality in these relations, rst note that 1H 1H c = 1 and (1H 1H c ) = (H) (H c ) = TV . This proves (ii). Next pick f and let let {xn } be a sequence in X such that limn |f (xn )| = f . Then f = limn |xn (f )|, proving (iii). The set M0 (X, X ) possesses some interesting properties that will prove useful in the sequel. Let be in this set. Because (X) = 0, for any f Fb (X) and any real c it holds that (f ) = (f c). Therefore by Lemma 4.3.3(i), |(f )| TV f c , which implies that |(f )| inf TV cR f c

It is easily seen that for any f Fb (X), inf cR f c is related to the oscillation semi-norm of f , also called the global modulus of continuity, osc (f ) =
def

sup
(x,x )XX

|f (x) f (x )| = 2 inf f c
cR

(4.14)

The lemma below provides some additional insight into this result. Lemma 4.3.4. For any M(X, X ) and f Fb (X), |(f )| sup
(x,x )XX

|+ (X)f (x) (X)f (x )| ,

(4.15)

where (+ , ) is the Hahn-Jordan decomposition of . In particular, for any M0 (X, X ) and f Fb (X), |(f )| where osc (f ) is given by (4.14). 1 2
TV

osc (f ) ,

(4.16)

4.3 Forgetting of the Initial Condition

93

Proof. First note that (f ) = = Therefore |(f )| |f (x)/ (X) f (x )/+ (X)| + (dx) (dx ) sup
(x,x )XX

f (x) + (dx)

f (x) (dx) f (x ) + (dx) (dx ) . + (X)

f (x) + (dx) (dx ) (X)

|f (x)/ (X) f (x )/+ (X)| + (X) (X) ,


1 2

which shows (4.15). If (X) = 0, then + (X) = (X) = (4.16).

TV ,

showing

Therefore, for M0 (X, X ), TV is the operator norm of considered as an operator over the space Fb (X) equipped with the oscillation semi-norm (4.14). As a direct application of this result, if and are two probability measures on (X, X ), then M0 (X, X ) which implies that for any f Fb (X), 1 |(f ) (f )| TV osc (f ) . (4.17) 2 This inequality is sharper than the bound |(f ) (f )| TV f provided by Lemma 4.3.3(i), because osc (f ) 2 f . We conclude this section by establishing some alternative expressions for the total variation distance between two probability measures. Lemma 4.3.5. For any and in M1 (X, X ), 1 2
TV

= sup |(A) (A)|


A

(4.18) (4.19) (4.20)

= 1 sup (X)
, n

= 1 inf
p=1

(Ai ) (Ai ) .

Here the supremum in (4.18) is taken over all measurable subsets of X, the supremum in (4.19) is taken over all nite signed measures on (X, X ) satisfying and , and the inmum in (4.20) is taken over all nite measurable partitions A1 , . . . , An of X. Proof. To prove (4.18), rst write (A) (A) = ( )1A and note that osc (1A ) = 1. Thus (4.17) shows that the supremum in (4.18) is no larger than (1/2) TV . Now let H be a Jordan set of the signed measure .

94

4 Advanced Topics in Smoothing

The supremum is bounded from below by (H) (H) = ( )+ (X) = (1/2) TV . This establishes equality in (4.18). We now turn to (4.19). For any p, q R, |pq| = p+q 2(pq). Therefore for any A X , 1 1 |(A) (A)| = ((A) + (A)) (A) (A) . 2 2 Applying this relation to the sets H and H c , where H is as above, shows that 1 1 ( ) (H) = [(H) + (H)] (H) (H) , 2 2 1 1 c ( ) (H ) = [(H c ) + (H c )] (H c ) (H c ) . 2 2 For any measure such that and , it holds that (H) (H) (H) and (H c ) (H c ) (H c ), showing that 1 1 1 ( ) (H) + ( ) (H c ) = 2 2 2
TV

1 (X) .

Thus (4.19) is no smaller than the left-hand side. To show equality, let be the measure dened by (A) = (A H c ) + (A H) . (4.21)

By the denition of H, (A H c ) (A H c ) and (A H) (A H) for any A X . Therefore (A) (A) and (A) (A). In addition, (H) = (H) = (H) (H) and (H c ) = (H c ) = (H c ) (H c ), showing that 1 2 TV = 1 (X) and concluding the proof of (4.19). Finally, because (X) = (H) (H) + (H c ) (H c ) we have
n

sup (X) inf


, i=1

(Ai ) (Ai ) .

Conversely, for any measure satisfying and , and any partition A1 , . . . , A n ,


n n

(X) =
i=1

(Ai )
i=1 n

(Ai ) (Ai ) ,

showing that sup (X) inf


, i=1

(Ai ) (Ai ) .

The supremum and the inmum thus agree, and the proof of (4.20) follows from (4.19).

4.3 Forgetting of the Initial Condition

95

4.3.2 Lipshitz Contraction for Transition Kernels In this section, we study the contraction property of transition kernels with respect to the total variation distance. Such results have been discussed in a seminal paper by Dobrushin (1956) (see Del Moral, 2004, Chapter 4, for a modern presentation and extensions of these results to a general class of distance-like entropy criteria). Let (X, X ) and (Y, Y) be two measurable spaces and let K be a transition kernel from (X, X ) to (Y, Y) (see Denition 2.1.1). The kernel K is canonically associated to two linear mappings: (i) a mapping M(X, X ) M(Y, Y) that maps any in M(X, X ) to a (possibly signed) measure K given by K(A) = X (dx) K(x, A) for any A Y; (ii) a mapping Fb (Y) Fb (X) that maps any f in Fb (Y) to the function Kf given by Kf (x) = K(x, dy) f (y). Here again, with a slight abuse in notation, we use the same notation K for these two mappings. If we equip the spaces M(X, X ) and M(Y, Y) with the total variation norm and the spaces Fb (X) and Fb (Y) with the supremum norm, a rst natural problem is to compute the operator norm(s) of the kernel K. Lemma 4.3.6. Let (X, X ) and (Y, Y) be two measurable spaces and let K be a transition kernel from (X, X ) to (Y, Y). Then 1 = sup { K = sup { Kf Proof. By Lemma 4.3.3, sup { K
TV TV

: M(X, X ), : f Fb (Y) , f

TV

= 1}

= 1} .

: M(X, X ),

TV

= 1}

= sup {Kf : M(X, X ), f Fb (Y) , f = sup { Kf

= 1,

TV

= 1}

: f Fb (Y, Y) , f

= 1} 1 .

If is a probability measure then so is K. Because the total variation of any probability measure is one, we see that the left-hand side of this display is indeed equal to one. Thus all members equate to one, and the proof is complete. To get sharper results, we will have to consider K as an operator acting on a smaller set of nite measures than M(X, X ). Of particular interest is the subset M0 (X, X ) of signed measures with zero total mass. Note that if lies in this subset, then K is in M0 (Y, Y). Below we will bound the operator norm of the restriction of the operator K to M0 (X, X ).

96

4 Advanced Topics in Smoothing

Denition 4.3.7 (Dobrushin Coecient). Let K be a transition kernel from (X, X ) to (Y, Y). Its Dobrushin coecient (K) is given by (K) = = 1 sup K(x, ) K(x , ) 2 (x,x )XX sup
(x,x )XX,x=x TV TV

K(x, ) K(x , ) x x TV

We remark that as K(x, ) and K(x , ) are probability measures, it holds that K(x, ) TV = K(x , ) TV = 1. Hence (K) 1 (1 + 1) = 1, so that the 2 Dobrushin coecient satises 0 (K) 1. Lemma 4.3.8. Let be a nite signed measure on (X, X ) and let K be a transition kernel from (X, X ) to (Y, Y). Then K
TV

(K)

TV

+ (1 (K)) |(X)| .

(4.22)

Proof. Pick M(X, X ) and let, as usual, + and be its positive and negative part, respectively. If (X) = 0 ( is a measure), then TV = (X) and (4.22) becomes K TV TV ; this follows from Lemma 4.3.6. If + (X) = 0, an analogous argument applies. Thus assume that both + and are non-zero. In view of Lemma 4.3.3(ii), it suces to prove that for any f Fb (Y) with f = 1, |Kf | (K)(+ (X) + (X)) + (1 (K))|+ (X) (X)| . (4.23)

We shall suppose that + (X) (X), if not, replace by and (4.23) remains the same. Then as |+ (X) (X)| = + (X) (X), (4.23) becomes |Kf | 2 (X)(K) + + (X) (X) . Now, by Lemma 4.3.4, for any f Fb (Y) it holds that |Kf | sup
(x,x )XX

(4.24)

|+ (X)Kf (x) (X)Kf (x )| + (X)K(x, ) (X)K(x , )


TV

sup
(x,x )XX

Finally (4.24) follows upon noting that + (X)K(x, ) (X)K(x , )


TV TV

(X) K(x, ) K(x , )

+ [+ (X) (X)] K(x, )

TV

= 2 (X)(K) + + (X) (X) .

Corollary 4.3.9. (K) = sup { K


TV

: M0 (X, X ),

TV

1} .

(4.25)

4.3 Forgetting of the Initial Condition

97

Proof. If (X) = 0, then (4.22) becomes K sup { K


TV

TV

(K)

TV ,

showing that

: M0 (X, X ),

TV

1} (K) .

The converse inequality is obvious, as (K) = sup (x, x ) X X, 1 (x x )K 2


TV

TV TV

sup { K

: M0 (X, X ),

= 1} .

If and are two probability measures on (X, X ), Corollary 4.3.9 implies that K K TV (K) TV . Thus the Dobrushin coecient is the norm of K considered as a linear operator from M0 (X, X ) to M0 (Y, Y). Proposition 4.3.10. The Dobrushin coecient is sub-multiplicative. That is, if K : (X, X ) (Y, Y) and R : (Y, Y) (Z, Z) are two transition kernels, then (KR) (K)(R). Proof. This is a direct consequence of the fact that the Dobrushin coecient is an operator norm. By Corollary 4.3.9, if M0 (X, X ), then K M0 (Y, Y) and K TV (K) TV . Likewise, R TV (R) TV holds for any M0 (Y, Y). Thus KR
TV

= (K)R

TV

(R) K

TV

(K)(R)

TV

4.3.3 The Doeblin Condition and Uniform Ergodicity Anticipating results on general state-space Markov chains presented in Chapter 14, we will establish, using the contraction results developed in the previous section, some ergodicity results for a class of Markov chains (X, X ) satisfying the so-called Doeblin condition. Assumption 4.3.11 (Doeblin Condition). There exist an integer m 1, (0, 1), and a transition kernel = {x,x , (x, x ) XX} from (XX, X X ) to (X, X ) such that for all (x, x ) X X and A X , Qm (x, A) Qm (x , A) x,x (A) . We will frequently consider a strengthened version of this assumption.

98

4 Advanced Topics in Smoothing

Assumption 4.3.12 (Doeblin Condition Reinforced). There exist an integer m 1, (0, 1), and a probability measure on (X, X ) such that for any x X and A X , Qm (x, A) (A) . By Lemma 4.3.5, the Dobrushin coecient of Qm may be equivalently written as
n

(Q ) = 1 inf
i=1

Qm (x, Ai ) Qm (x , Ai ) ,

(4.26)

where the inmum is taken over all (x, x ) X X and all nite measurable partitions A1 , . . . , An of X of X. Under the Doeblin condition, the sum in this n display is bounded from below by i=1 x,x (Ai ) = . Hence the following lemma is true. Lemma 4.3.13. Under Assumption 4.3.11, (Qm ) 1 . Stochastic processes that are such that for any k, the distribution of the random vector (Xn , . . . , Xn+k ) does not depend on n are called stationary (see Denition 2.1.10). It is clear that in general a Markov chain will not be stationary. Nevertheless, given a transition kernel Q, it is possible that with an appropriate choice of the initial distribution we may produce a stationary process. Assuming that such a distribution exists, the stationarity of the marginal distribution implies that E [1A (X0 )] = E [1A (X1 )] for any A X . This can equivalently be written as (A) = Q(A), or = Q. In such a case, the Markov property implies that all nite-dimensional distributions of {Xk }k0 are also invariant under translation in time. These considerations lead to the denition of invariant measure. Denition 4.3.14 (Invariant Measure). If Q is a Markov kernel on (X, X ) and is a -nite measure satisfying Q = , then is called an invariant measure. If an invariant measure is nite, it may be normalized to an invariant probability measure. In practice, this is the main situation of interest. If an invariant measure has innite total mass, its probabilistic interpretation is much more dicult. In general, there may exist more than one invariant measure, and if X is not nite, an invariant measure may not exist. As a trivial example, consider X = N and Q(x, x + 1) = 1. Invariant probability measures are important not merely because they dene stationary processes. Invariant probability measures also dene the longterm or ergodic behavior of a stationary Markov chain. Assume that for some initial measure , the sequence of probability measures {Qn }n0 converges to a probability measure in total variation norm. This implies that for any function f Fb (X), limn Qn (f ) = (f ). Therefore

4.3 Forgetting of the Initial Condition

99

(f ) = lim

(dx) Qn (x, dx ) f (x ) = lim (dx) Qn1 (x, dx ) Qf (x ) = (Qf ) .

Hence, if a limiting distribution exists, it is an invariant probability measure, and if there exists a unique invariant probability measure, then the limiting distribution will be independent of , whenever it exists. These considerations lead to the following denitions. Denition 4.3.15. Let Q be a Markov kernel admitting a unique invariant probability measure . The chain is said to be ergodic if for all x in a set A X such that (A) = 1, limn Qn (x, ) TV = 0. It is said to be uniformly ergodic if limn supxX Qn (x, ) TV = 0. Note that when a chain is uniformly ergodic, it is indeed uniformly geometrically ergodic because limn supxX Qn (x, ) TV = 0 implies that 1 there exists an integer m such that 2 sup(x,x )XX Qm (x, ) Qm (x , ) TV < 1 by the triangle inequality. Hence the Dobrushin coecient (Qm ) is strictly less than 1, and Qm is contractive with respect to the total variation distance by Lemma 4.3.8. Thus there exist constants C < and [0, 1) such that supxX Qn (x, ) TV Cn for all n. The following result shows that if a power Qm of the Markov kernel Q satises Doeblins condition, then the chain admits a unique invariant probability and is uniformly ergodic. Theorem 4.3.16. Under Assumption 4.3.11, Q admits a unique invariant probability measure . In addition, for any M1 (X, X ), Qn
TV

(1 )

n/m

TV

where u is the integer part of u. Proof. Let and be two probability measures on (X, X ). Corollary 4.3.9, Proposition 4.3.10, and Lemma 4.3.13 yield that for all k 1, Qkm Qkm
TV

k (Qm )

TV

(1 )k

TV

(4.27)

Taking = Qpm , we nd that Qkm Q(k+p)m


TV

(1 )k ,

showing that {Qkm } is a Cauchy sequence in M1 (X, X ) endowed with the total variation norm. Because this metric space is complete, there exists a probability measure such that Qkm . In view of the discussion above, is invariant for Qm . Moreover, by (4.27) this limit does not depend on . Thus Qm admits as unique invariant probability measure. The ChapmanKolmogorov equations imply that (Q)Qm = (Qm )Q = Q, showing that Q is also invariant for Qm and hence that Q = as claimed.

100

4 Advanced Topics in Smoothing

Remark 4.3.17. Classical uniform convergence to equilibrium for Markov processes has been studied during the rst half of the 20th century by Doeblin, Kolmogorov, and Doob under various conditions. Doob (1953) gave a unifying form to these conditions, which he named Doeblin type conditions. More recently, starting in the 1970s, an increasing interest in non-uniform convergence of Markov processes has arisen. An explanation for this interest is that many useful processes do not converge uniformly to equilibrium, while they do satisfy weaker properties such as a geometric convergence. It later became clear that non-uniform convergence relates to local Doeblin type condition and to hitting times for so-called small sets. These types of conditions are detailed in Chapter 14.

4.3.4 Forgetting Properties Recall from Chapter 3 that the smoothing probability ,k|n [Y0:n ] is dened by ,k|n [Y0:n ](f ) = E [f (Xk ) | Y0:n ] , f Fb (X) . Here, k and n are integers, and is the initial probability measure on (X, X ). The ltering probability is dened by ,n [Y0:n ] = ,n|n [Y0:n ]. In this section, we will establish that under appropriate conditions on the transition kernel Q and on the function g, the sequence of ltering probabilities satises a property referred to in the literature as forgetting of the initial condition. This property can be formulated as follows: given two probability measures and on (X, X ),
n

lim

,n [Y0:n ]

,n [Y0:n ] TV

=0

P -a.s.

(4.28)

where is the initial probability measure that denes the law of the observations {Yk }. Forgetting is also a concept that applies to the smoothing distributions, as it is often possible to extend the previous results showing that
k n0

lim sup ,k|n [Y0:n ]

,k|n [Y0:n ] TV

=0

P -a.s.

(4.29)

Equation (4.29) can also be strengthened by showing that, under additional conditions, the forgetting property is uniform with respect to the observed sequence Y0:n in the sense that there exists a deterministic sequence {k } satisfying k 0 and sup sup ,k|n [y0:n ]
y0:n Yn+1 n0 ,k|n [y0:n ] TV

k .

Several of the results to be proven in the sequel are of this latter type (uniform forgetting).

4.3 Forgetting of the Initial Condition

101

As shown in (3.5), the smoothing distribution is dened as the ratio ,k|n [y0:n ](f ) = f (xk ) (dx0 ) g(x0 , y0 ) i=1 Q(xi1 , dxi ) g(xi , yi ) . n (dx0 ) g(x0 , y0 ) i=1 Q(xi1 , dxi ) g(xi , yi )
n

Therefore, the mapping associating the probability measure M1 (X, X ) to the probability measure ,k|n [y0:n ] is non-linear. The theory developed above allows one to separately control the numerator and the denominator of this quantity but does not lend a direct proof of the forgetting properties (4.28) or (4.29). To achieve this, we use the alternative representation of the smoothing probability ,k|n [y0:n ] introduced in Proposition 3.3.4, which states that
k

,k|n [y0:n ](f ) =

,0|n [y0:n ](dx0 )


i=1 k

Fi1|n [yi:n ](xi1 , dxi ) f (xk )

= ,0|n [y0:n ]
i=1

Fi1|n [yi:n ]f .

(4.30)

Here we have used the following notations and denitions from Chapter 3. (i) Fi|n [yi+1:n ] are the forward smoothing kernels (see Denition 3.3.1) given for i = 0, . . . , n 1, x X and A X , by Fi|n [yi+1:n ](x, A) = i|n [yi+1:n ](x)
A def 1

Q(x, dxi+1 ) g(xi+1 , yi+1 )i+1|n [yi+2:n ](xi+1 ) , (4.31)

where i|n [yi+1:n ](x) are the backward functions (see Denition 3.1.6) i|n [yi+1:n ](x) = Q(x, dxi+1 ) g(xi+1 , yi+1 )i+1|n [yi+2:n ](xi+1 ) . (4.32) Recall that, by Proposition 3.3.2, {Fi|n }i0 are the transition kernels of the non-homogeneous Markov chain {Xk } conditionally on Y0:n , E [f (Xi+1 ) | X0:i , Y0:n ] = Fi|n [Yi+1:n ](Xi , f ) . (ii) ,0|n [y0:n ] is the posterior distribution of the state X0 conditionally on Y0:n = y0:n , dened for any A X by ,0|n [y0:n ](A) =
A

(dx0 ) g(x0 , y0 )0|n [y1:n ](x0 ) . (dx0 ) g(x0 , y0 )0|n [y1:n ](x0 )

(4.33)

We see that the non-linear mapping ,k|n [y0:n ] is the composition of two mappings on M1 (X, X ).

102

4 Advanced Topics in Smoothing

(i) The mapping ,0|n [y0:n ], which associates to the initial distribution the posterior distribution of the state X0 given Y0:n = y0:n . This mapping consists in applying Bayes formula, which we write as ,0|n [y0:n ] = B[g(, y0 )0|n [y1:n ](), ] . Here B[, ](f ) = f (x)(x) (dx) , (x) (dx) f Fb (X) , (4.34)

for any probability measure on (X, X ) and any non-negative measurable function on X. Note that B[, ] is a probability measure on (X, X ). Because of the normalization, this step is non-linear. k (ii) The mapping i=1 Fi1|n [yi:n ], which is a linear mapping being dened as product of Markov transition kernels. For two initial probability measures and on (X, X ), the dierence of the associated smoothing distributions may thus be expressed as ,k|n [y0:n ]
,k|n [y0:n ]

=
k

B[g(, y0 )0|n [y1:n ], ] B[g(, y0 )0|n [y1:n ], ]


i=1

Fi1|n [yi:n ] . (4.35)

Note that the function g(x, y0 )0|n [y1:n ](x) dened for x X may also be interpreted as the likelihood of the observation Lx ,n [y0:n ] when starting from the initial condition X0 = x (Proposition 3.2.3). In the sequel, we use the likelihood notation whenever possible, writing, in addition, Lx,n [y0:n ] rather than Lx ,n [y0:n ] and L,n [y0:n ] when referring to the whole function. Using Corollary 4.3.9, (4.35) implies that ,k|n [y0:n ] ,k|n [y0:n ]
TV

k TV i=1

B[L,n [y0:n ], ] B[L,n [y0:n ], ]

Fi1|n [yi:n ]

, (4.36)

where the nal factor is a Dobrushin coecient. Because Bayes operator B returns probability measures, the total variation distance in the right-hand side of this display is always bounded by 2. Although this bound may be sucient, it is often interesting to relate the total variation distance between B[, ] and B[, ] to the total variation distance between and . The following lemma is adapted from (Knsch, 2000)see also (Del Moral, 2004, u Theorem 4.3.1). Lemma 4.3.18. Let and be two probability measures on (X, X ) and let be a non-negative measurable function such that () > 0 or () > 0. Then B[, ] B[, ]
TV

() ()

TV

(4.37)

4.3 Forgetting of the Initial Condition

103

Proof. We may assume, without loss of generality, that () (). For any f Fb (X), B[, ](f ) B[, ](f ) = f (x)(x) (dx) (x) ( )(dx) f (x)(x) ( )(dx) + (x) (dx) (x) (dx) (x) (dx) 1 ( )(dx) (x)(f (x) B[, ](f )) . = ()

By Lemma 4.3.5, ( )(dx) (x)(f (x) B[, ](f ))

TV

1 sup |(x)(f (x) B[, ](f )) (x )(f (x ) B[, ](f ))| . 2 (x,x )XX Because |B[, ](f )| f and 0, the supremum on the right-hand side of this display is bounded by 2 f . This concludes the proof. As mentioned by Knsch (2000), the Bayes operator may be non-contractive: u the numerical factor in the right-hand side of (4.37) is sometimes larger than one and the bound may be shown to be tight on particular examples. The intuition that the posteriors should at least be as close as the priors if the same likelihood (the same data) is applied is thus generally wrong. Equation (4.30) also implies that for any integer j such that j k,
j k

,k|n [y0:n ] = ,0|n [y0:n ]


i=1 k

Fi1|n [yi:n ]
i=j+1

Fi1|n [yi:n ]

= ,j|n [y0:n ]
i=j+1

Fi1|n [yi:n ] .

(4.38)

This decomposition and Corollary 4.3.9 shows that for any 0 j k, any initial distributions and and any sequence y0:n such that L,n [y0:n ] > 0 and L ,n [y0:n ] > 0, ,k|n [y0:n ]
,k|n [y0:n ] TV

Fi1|n [yi:n ] ,j|n [y0:n ]


,j|n [y0:n ] TV

i=j+1

Because the Dobrushin coecient of a Markov kernel is bounded by one, this relation implies that the total variation distance between the smoothing distributions associated with two dierent initial distributions is non-expanding. To summarize this discussion, we have obtained the following result.

104

4 Advanced Topics in Smoothing

Proposition 4.3.19. Let and be two probability measures on (X, X ). For any non-negative integers j, k, and n such that j k and any sequence y0:n Yn+1 such that L,n [y0:n ] > 0 and L ,n [y0:n ] > 0, ,k|n [y0:n ]
i=j+1 ,k|n [y0:n ] TV k

Fi1|n [yi:n ] ,j|n [y0:n ]


,j|n [y0:n ] TV

, (4.39)

,k|n [y0:n ]

,k|n [y0:n ] TV k

L,n [y0:n ] L,n [y0:n ] L ,n [y0:n ]

Fi1|n [yi:n ]
i=1

TV

. (4.40)

Along the same lines, we can compare the posterior distribution of the state Xk given observations Yj:n for dierent values of j. To avoid introducing new notations, we will simply denote these conditional distributions by P ( Xk | Yj:n = yj:n ). As mentioned in the introduction of this chapter, it is sensible to expect that P ( Xk | Yj:n ) gets asymptotically close to P ( Xk | Y0:n ) as k j tends to innity. Here again, to establish this alternative form of the forgetting property, we will use a representation of P ( Xk | Yj:n ) similar to (4.30). Because {(Xk , Yk )} is a Markov chain, and assuming that k j, P ( Xk | Xj , Yj:n ) = P ( Xk | Xj , Y0:n ) . Moreover, we know that conditionally on Y0:n , {Xk } is a non-homogeneous Markov chain with transition kernels Fk|n [Yk+1:n ] where Fi|n = Q for i n (Proposition 3.3.2). Therefore the Chapman-Kolmogorov equations show that for any function f Fb (X), E [f (Xk ) | Yj:n ] = E [ E [f (Xk ) | Xj , Yj:n ] | Yj:n ]
k

= E
i=j+1

Fi1|n [Yi:n ]f (Xj ) Yj:n = ,j|n [Yj:n ]


i=j+1

Fi1|n [Yi:n ]f ,

cf. (4.38), where the probability measure ,j|n [Yj:n (f )] is dened by ,j|n [Yj:n ](f ) = E [f (Xj ) | Yj:n ] , f Fb (X) .

Using (4.38) as well, we thus nd that the dierence between P ( Xk | Yj:n ) and P ( Xk | Y0:n ) may be expressed by
k

E [f (Xk ) | Yj:n ] E [f (Xk ) | Y0:n ] = (,j|n ,j|n )


i=j+1

Fi1|n [Yi:n ]f .

4.3 Forgetting of the Initial Condition

105

Proceeding like in Proposition 4.3.19, we may thus derive a bound on the total variation distance between these probability measures. Proposition 4.3.20. For any integers j, k, and n such that 0 j k and any probability measure on (X, X ),
k

P ( Xk | Y0:n ) P ( Xk | Yj:n )

TV

2
i=j+1

Fi1|n [Yi:n ] . (4.41)

4.3.5 Uniform Forgetting Under Strong Mixing Conditions In light of the discussion above, establishing forgetting properties amounts to determining non-trivial bounds on the Dobrushin coecient of products of forward transition kernels and, if required, on ratio of likelihoods Lx,n (y0:n )/(L,n (y0:n ) L ,n (y0:n )). To do so, we need to impose additional conditions on Q and g. We consider in this section the following assumption, which was introduced by Le Gland and Oudjane (2004, Section 2). Assumption 4.3.21 (Strong Mixing Condition). There exist a transition kernel K : (Y, Y) (X, X ) and measurable functions and + from Y to (0, ) such that for any A X and y Y, (y)K(y, A)
A

Q(x, dx ) g(x , y) + (y)K(y, A) .

(4.42)

We rst show that under this condition, one may derive a non-trivial upper bound on the Dobrushin coecient of the forward smoothing kernels. Lemma 4.3.22. Under Assumption 4.3.21, the following hold true. (i) For any non-negative integers k and n such that k < n and x X,
n n

(yj ) k|n [yk+1:n ](x)


j=k+1 j=k+1

+ (yj ) .

(4.43)

(ii) For any non-negative integers k and n such that k < n and any probability measures and on (X, X ), (yk+1 ) + (yk+1 ) (dx) k|n [yk+1:n ](x) + (yk+1 ) . (yk+1 ) (dx) k|n [yk+1:n ](x) X
X

(iii) For any non-negative integers k and n such that k < n, there exists a transition kernel k,n from (Ynk , Y (nk) ) to (X, X ) such that for any x X, A X , and yk+1:n Ynk ,

106

4 Advanced Topics in Smoothing

(yk+1 ) k,n (yk+1:n , A) Fk|n [yk+1:n ](x, A) + (yk+1 ) + (yk+1 ) k,n (yk+1:n , A) . (yk+1 )

(4.44)

(iv) For any non-negative integers k and n, the Dobrushin coecient of the forward smoothing kernel Fk|n [yk+1:n ] satises (Fk|n [yk+1:n ]) where for any y Y, 0 (y) = 1
def

0 (yk+1 ) 1

k<n, kn,

(y) + (y)

and

1 = 1

def

(y) (dy) .

(4.45)

Proof. Take A = X in Assumption 4.3.21 to see that X Q(x, dx ) g(x , y) is bounded from above and below by + (y) and (y), respectively. Part (i) then follows from (3.16). Next, (3.19) shows that (dx) k|n [yk+1:n ](x) = (dx) Q(x, dxk+1 ) g(xk+1 , yk+1 )k+1|n [yk+2:n ](xk+1 ) .

This expression is bounded from above by + (yk+1 ) K(yk+1 , dxk+1 ) k+1|n [yk+2:n ](xk+1 ) ,

and similarly a lower bound, with (yk+1 ) rather than + (yk+1 ), holds too. These bounds are independent of , and (ii) follows. We turn to part (iii). Using the denition (3.30), the forward kernel Fk|n [yk+1:n ] may be expressed as Fk|n [yk+1:n ](x, A) =
A

Q(x, dxk+1 ) g(xk+1 , yk+1 )k+1|n [yk+2:n ](xk+1 ) . Q(x, dxk+1 ) g(xk+1 , yk+1 )k+1|n [yk+2:n ](xk+1 ) X

Using arguments as above, (4.44) holds with k,n (yk+1:n , A) =


def A

K(yk+1 , dxk+1 ) k+1|n [yk+2:n ](xk+1 ) . K(yk+1 , dxk+1 ) k+1|n [yk+2:n ](xk+1 ) X

Finally, part (iv) for k < n follows from part (iii) and Lemma 4.3.13. In the opposite case, recall from (3.31) that Fk|n = Q for indices k n. Integrating

4.3 Forgetting of the Initial Condition

107

(4.42) with respect to and using A X and any x X, Q(x, A) (y)K(y, A) (dy) =

g(x, y) (dy) = 1, we nd that for any (y)K(y, A) (dy) , (y) (dy)

(y) (dy)

where the ratio on the right-hand side is a probability measure. The proof of part (iv) again follows from Lemma 4.3.13. The nal part of the above lemma shows that under Assumption 4.3.21, the Dobrushin coecient of the transition kernel Q satises (Q) 1 for some > 0. This is in fact a rather stringent assumption, which fails to be satised in many of the examples considered in Chapter 1. When X is nite, this condition is satised if Q(x, x ) for any (x, x ) X X. When X is countable, (Q) < 1 is satised under the Doeblin condition 4.3.11 with n = 1. When X Rd or more generally is a topological space, (Q) < 1 typically requires that X is compact, which is, admittedly, a serious limitation. Proposition 4.3.23. Under 4.3.21 the following hold true. (i) For any non-negative integers k and n and any probability measures and on (X, X ), ,k|n [y0:n ]
j=1 ,k|n [y0:n ] TV

kn

0 (yj ) kkn ,0|n [y0:n ] 1

,0|n [y0:n ] TV

where 0 and 1 are dened in (4.45). (ii) For any non-negative integer n and any probability measures and on (X, X ) such that (dx0 ) g(x0 , y0 ) > 0 and (dx0 ) g(x0 , y0 ) > 0, ,0|n [y0:n ]
,0|n [y0:n ] TV + (y1 ) (y1 )

g (g(, y0 )) (g(, y0 ))

TV

(iii) For any non-negative integers j, k, and n such that j k and any probability measure on (X, X ), P ( Xk | Y0:n = y0:n ) P (Xk | Yj:n = yj:n )
kn TV kj(knjn)

2
i=jn+1

0 (yi ) 1

Proof. Using Lemma 4.3.22(iv) and Proposition 4.3.10, we nd that for j k,


kn

(Fj|n [yj+1:n ] Fk|n [yk+1:n ])


i=jn+1

0 (yi ) 1

kj(knjn)

108

4 Advanced Topics in Smoothing

Parts (i) and (iii) then follow from Propositions 4.3.19 and 4.3.20, respectively. Next we note that (4.33) shows that ,0|n [y0:n ] = B 0|n [y1:n ](), B[g(, y0 ), ] . Apply Lemma 4.3.18 twice to this form to arrive at a bound on the total variation norm of the dierence ,0|n [y0:n ] ,0|n [y0:n ] given by 0|n [y1:n ] g(, y0 ) B[g(, y0 ), ](0|n [y1:n ]) (g(, y0 )) (g(, y0 ))
TV

Finally, bound the rst ratio of this display using Lemma 4.3.22(ii); the supremum norm is obtained by taking one of the initial measures as an atom at some point x X. This completes the proof of part (ii). From the above it is clear that forgetting properties stem from properties of the product
kn

0 (Yi )1
i=jn+1

kj(knjn)

(4.46)

The situation is elementary when the factors of this product are (non-trivially) upper-bounded uniformly with respect to the observations Y0:n . To obtain such bounds, we consider the following strengthening of the strong mixing condition, rst introduced by Atar and Zeitouni (1997). Assumption 4.3.24 (Strong Mixing Reinforced). (i) There exist two positive real numbers and + and a probability measure on (X, X ) such that for any x X and A X , (A) Q(x, A) + (A) . (ii) For all y Y, 0 <
X

(dx) g(x, y) < .

It is easily seen that this implies Assumption 4.3.21. Lemma 4.3.25. Assumption 4.3.24 implies Assumption 4.3.21 with (y) = X (dx) g(x, y), + (y) = + X (dx) g(x, y), and K(y, A) =
A

(dx) g(x, y) . (dx) g(x, y) X

In particular, (y)/ + (y) = / + for any y Y. Proof. The proof follows immediately upon observing that
A

(dx ) g(x , y)
A

Q(x, dx ) g(x , y) +
A

(dx ) g(x , y) .

4.3 Forgetting of the Initial Condition

109

Replacing Assumption 4.3.21 by Assumption 4.3.24, Proposition 4.3.23 may be strengthened as follows. Proposition 4.3.26. Under Assumption 4.3.24, the following hold true. (i) For any non-negative integers k and n and any probability measures and on (X, X ), ,k|n [y0:n ] 1 +
,k|n [y0:n ] TV

kn

(1 )kkn ,0|n [y0:n ]

,0|n [y0:n ] TV

(ii) For any non-negative integer n and any probability measures and on (X, X ) such that (dx0 ) g(x0 , y0 ) > 0 and (dx0 ) g(x0 , y0 ) > 0, ,0|n [y0:n ]
,0|n [y0:n ] TV +

g [g(, y )] [g(, y )] 0 0

TV

(iii) For any non-negative integers j, k, and n such that j k and any probability measure on (X, X ), P ( Xk | Y0:n = y0:n ) P ( Xk | Yj:n = yj:n ) 2 1 +
knjn TV

kj(knjn)

Thus, under Assumption 4.3.24 the lter and the smoother forget their initial conditions exponentially fast, uniformly with respect to the observations. This property, which holds under rather stringent assumptions, plays a key role in the sequel (see for instance Chapters 9 and 12). Of course, the product (4.46) can be shown to vanish asymptotically under conditions that are less stringent than Assumption 4.3.24. A straightforward adaptation of Lemma 4.3.25 shows that the following result is true. Lemma 4.3.27. Assume 4.3.21 and that there exists a set C Y and constants 0 < + < satisfying (C) > 0 and, for all y C, (y) + (y) + . Then, 0 (y) 1 / + , 1 1 (C) and
kn

0 (Yi )1
i=jn+1

kj(knjn)

1 / +

kn i=jn+1

1C (Yi )

1 (C)

kj(knjn)

. (4.47)

110

4 Advanced Topics in Smoothing

In words, forgetting is guaranteed to occur when {Yk } visits a given set C innitely often in the long run. Of course, such a property cannot hold true for all possible sequences of observations but it may hold with probability one under appropriate assumptions on the law of {Yk }, assuming in particular that the observations are distributed under the model, perhaps with a different initial distribution . To answer whether this happens or not requires additional results from the general theory of Markov chains, and we postpone this discussion to Section 14.3 (see in particular Proposition 14.3.8 on the recurrence of the joint chain in HMMs). 4.3.6 Forgetting Under Alternative Conditions Because Assumptions 4.3.21 and 4.3.24 are not satised in many contexts of interest, it is worthwhile to consider ways in which these assumptions can be weakened. This happens to raise dicult mathematical challenges that largely remain unsolved today. Perhaps surprisingly, despite many eorts in this direction, there is up to now no truly satisfactory assumption that covers a reasonable fraction of the situations of practical interest. The problem really is more complicated than appears at rst sight. In particular, Example 4.3.28 below shows that the forgetting property does not necessarily hold under assumptions that imply that the underlying Markov chain is uniformly ergodic. This last section on forgetting is more technical and requires some knowledge of Markov chain theory as can be found in Chapter 14. Example 4.3.28. This example was rst discussed by Kaijser (1975) and recently worked out by Chigansky and Lipster (2004). Let {Xk } be a Markov chain on X = {0, 1, 2, 3}, dened by the recurrence equation Xk = (Xk1 + Uk ) mod 4, where {Uk } is an i.i.d. binary sequence with P(Bk = 0) = p and P(Bk = 1) = 1 p for some 0 < p < 1. For any (x, x ) X X, Q4 (x, x ) > 0, which implies that (Q4 ) < 1 and, by Theorem 4.3.16, that the chain is uniformly geometrically ergodic. The observations {Yk } are a deterministic binary function of the chain, namely Yk = 1{0,2} (Xk ) . The function mapping Xk to Yk is not injective, but knowledge of Yk indicates two possible values of Xk . The ltering distribution is given recursively by ,k [y0:k ](0) = yk {,k1 [y0:k1 ](0) + ,k1 [y0:k1 ](3)} , ,k [y0:k ](1) = (1 yk ) {,k1 [y0:k1 ](1) + ,k1 [y0:k1 ](0)} , ,k [y0:k ](2) = yk {,k1 [y0:k1 ](2) + ,k1 [y0:k1 ](1)} , ,k [y0:k ](3) = (1 yk ) {,k1 [y0:k1 ](3) + ,k1 [y0:k1 ](2)} . In particular, either one of the two sets {0, 2} and {1, 3} has null probability under ,k [y0:k ], depending on the value of yk , and irrespectively of the choice of . We also notice that

4.3 Forgetting of the Initial Condition

111

yk ,k [y0:k ](j) = ,k [y0:k ](j) , (1 yk ) ,k [y0:k ](j) = ,k [y0:k ](j) ,

for j = 0, 2, for j = 1, 3. (4.48)

In addition, it is easily checked that, except when ({0, 2}) or ({1, 3}) equals 1 (which rules out one of the two possible values for y0 ), the likelihood L,n [y0:n ] is strictly positive for any integer n and any sequence y0:n {0, 1}n+1 . Dropping the dependence on y0:k for notational simplicity and using (4.48) we obtain |,k (0)
,k (0)| ,k1 (0)

= yk |,k1 (0)

+ ,k1 (3)

,k1 (3)| ,k1 (3)|}

= yk {yk1 |,k1 (0)

,k1 (0)|

+ (1 yk1 )|,k1 (3)

Proceeding similarly, we also nd that |,k (1) |,k (2) |,k (3)
,k (1)|

=
,k1 (1)|

(1 yk ) {(1 yk1 )|,k1 (1)


,k (2)|

+ yk1 |,k1 (0)

,k1 (0)|}

=
,k1 (2)|

yk {yk1 |,k1 (2)


,k (3)|

+ (1 yk1 )|,k1 (1)


,k1 (3)|

,k1 (1)|}

, .

= + yk1 |,k1 (2)


,k1 (2)|}

(1 yk ) {(1 yk1 )|,k1 (3)

Adding the above equalities using (4.48) again shows that for any k = 1, . . . , n, ,k [y0:k ]
,k [y0:k ] TV

= ,k1 [y0:k1 ] = ,0 [y0 ]


,0 [y0 ]

,k1 [y0:k1 ] TV TV

By construction, ,0 [y0 ](j) = y0 (j)/((0) + (2)) for j = 0 and 2, and ,0 [y0 ](j) = (1 y0 ) (j)/((1) + (3)) for j = 1 and 3. This implies that ,0 [y0 ] ,0 [y0 ] TV = 0 if = . In this model, the hidden Markov chain {Xk } is uniformly ergodic, but the ltering distributions ,k [y0:k ] never forget the inuence of the initial distribution , whatever the observed sequence. In the above example, the kernel Q does not satisfy Assumption 4.3.24 with m = 1 (one-step minorization), but the condition is veried for a power Qm (here for m = 4). This situation is the rule rather than the exception. In particular, a Markov chain on a nite state space has a unique invariant probability measure and is ergodic if and only if there exists an integer m > 0 such that Qm (x, x ) > 0 for all (x, x ) X X (but the condition may not hold for m = 1). This suggests considering the following assumption (see for instance Del Moral, 2004, Chapter 4).

112

4 Advanced Topics in Smoothing

Assumption 4.3.29. (i) There exist an integer m, two positive real numbers and + , and a probability measure on (X, X ) such that for any x X and A X , (A) Qm (x, A) + (A) . (ii) There exist two measurable functions g and g from Y to (0, ) such that for any y Y, g (y) inf g(x, y) sup g(x, y) g + (y) .
xX xX

Compared to Assumption 4.3.24, the condition on the transition kernel has been weakened, but at the expense of strengthening the assumption on the function g. Note in particular that part (ii) is not satised in Example 4.3.28. Using (4.30) and writing k = jm + r with 0 r < m, we may express ,k|n [y0:n ] as
j1 (u+1)m1 k1

,k|n [y0:n ] = ,0|n [y0:n ]


u=0

i=um

Fi|n [yi+1:n ]
i=jm

Fi|n [yi+1:n ] .

This implies, using Corollary 4.3.9, that for any probability measures and on (X, X ) and any sequence y0:n satisfying L,n [y0:n ] > 0 and L ,n [y0:n ] > 0, ,k|n [y0:n ]
j1 ,k|n [y0:n ] TV

(u+1)m1

Fi|n [yi+1:n ] ,0|n [y0:n ]


,0|n [y0:n ] TV

u=0

i=um

. (4.49)

This expression suggest computing a bound on ( i=um Fi|n [yi+1:n ]) rather than a bound on (Fi|n ). The following result shows that such a bound can be derived under Assumption 4.3.29. Lemma 4.3.30. Under Assumption 4.3.29, the following hold true. (i) For any non-negative integers k and n such that k < n and x X,
n n

um+m1

g (yj ) k|n [yk+1:n ](x)


j=k+1 j=k+1

g + (yj ) ,

(4.50)

where k|n is the backward function (3.16). (ii) For any non-negative integers u and n such that 0 u < n/m and any probability measures and on (X, X ), + g (yi ) g + (yi ) i=um+1
(u+1)m

(dx) um|n [yum+1:n ](x) + (dx) um|n [yum+1:n ](x) X


X

g + (yi ) . g (yi ) i=um+1

(u+1)m

4.3 Forgetting of the Initial Condition

113

(iii) For any non-negative integers u and n such that 0 u < n/m , there exists a transition kernel u,n from Y(n(u+1)m) , Y (n(u+1)m) to (X, X ) such that for any x X, A X and yum+1:n Y(num) , + g (yi ) u,n (y(u+1)m+1:n , A) g + (yi ) i=um+1 +
(u+1)m (u+1)m (u+1)m1

Fi|n [yi+1:n ](x, A)


i=um

g + (yi ) u,n (y(u+1)m+1:n , A) . (4.51) g (yi ) i=um+1

(iv) For any non-negative integers u and n, (u+1)m1 0 (yum+1:(u+1)m ) Fi|n [yi+1:n ] 1 i=um where for any yum+1:(u+1)m Ym , 0 (yum+1:(u+1)m ) = 1
def

u < n/m , u n/m ,

g (yi ) g + (yi ) i=um+1

(u+1)m

and

1 = 1 . (4.52)

def

Proof. Part (i) can be proved using an argument similar to the one used for Lemma 4.3.22(i). Next notice that for 0 u < n/m , um|n [yum+1:n ](xum )
(u+1)m

i=um+1

Q(xi1 , dxi ) g(xi , yi ) (u+1)m|n [y(u+1)m+1:n ](x(u+1)m ) .

Under Assumption 4.3.29, dropping the dependence on the ys for notational simplicity, the right-hand side of this display is bounded from above by
(u+1)m (u+1)m

g + (yi )
i=um+1

i=um+1

Q(xi1 , dxi ) (u+1)m|n (x(u+1)m )

(u+1)m

+
i=um+1

g + (yi )

(u+1)m|n (x(u+1)m ) (dx(u+1)m ) .

In a similar fashion, a lower bound may be obtained, containing and g rather than + and g + . Thus part (ii) follows. For part (iii), we use (3.30) to write

114

4 Advanced Topics in Smoothing

(u+1)m1

Fi|n [yi+1:n ](xum , A)


i=um

(u+1)m i=um+1 Q(xi1 , xi ) g(xi , yi ) A (x(u+1)m )(u+1)m|n (x(u+1)m ) (u+1)m i=um+1 Q(xi1 , xi ) g(xi , yi )(u+1)m|n (x(u+1)m )

The right-hand side is bounded from above by + g + (yi ) g (yi ) i=um+1


(u+1)m A

(dx) (u+1)m|n [y(u+1)m+1:n ](x) . (dx) (u+1)m|n [y(u+1)m+1:n ](x)

We dene u,n as the second ratio of this expression. Again a corresponding lower bound is obtained similarly, proving part (iii). Part (iv) follows from part (iii) and Lemma 4.3.13. Using this result together with (4.49), we may obtain statements analogous to Proposition 4.3.23. In particular, if there exist positive real numbers and + such that for all y Y, g (y) g + (y) + , then the smoothing and the ltering distributions both forget uniformly the initial distribution. Assumptions 4.3.24 and 4.3.29 are still restrictive and fail to hold in many interesting situations. In both cases, we assume that either the one-step or the m-step transition kernel is uniformly bounded from above and below. The following weaker condition is a rst step toward handling more general settings. Assumption 4.3.31. Let Q be dominated by a probability measure on (X, X ) such that for any x X and A X , Q(x, A) = A q (x, x ) (dx ) for some transition density function q . Assume in addition that (i) There exists a set C X , two positive real numbers and + such that for all x C and x X, q (x, x ) + . (ii) For all y Y and all x X, C q (x, x ) g(x , y) (dx ) > 0; (iii) There exists a (non-identically null) function : Y [0, 1] such that for any (x, x ) X X and y Y,
C

[x, x ; y](x ) (dx ) (y) , [x, x ; y](x ) (dx ) X

where for (x, x , x ) X3 and y Y, [x, x ; y](x ) = q (x, x )g(x , y)q (x , x ) .


def

(4.53)

4.3 Forgetting of the Initial Condition

115

Part (i) of this assumption implies that the set C is 1-small for the kernel Q (see Denition 14.2.10). It it shown in Section 14.2.2.2 that such small sets do exist under conditions that are weak and generally simple to check. Assumption 4.3.31 is trivially satised under Assumption 4.3.24 using the whole state space X as the state C: in that case, their exists a transition density function q (x, x ) that is bounded from above and below for all (x, x ) X2 . It is more interesting to consider cases in which the hidden chain is not uniformly ergodic. One such example, rst addressed by Budhiraja and Ocone (1997), is a Markov chain observed in noise with bounded support. Example 4.3.32 (Markov Chain in Additive Bounded Noise). We consider real states {Xk } and observations {Yk }, assuming that the states form a Markov chain with a transition density q(x, x ) with respect to Lebesgue measure. Furthermore we assume the following. (i) Yk = Xk + Vk , where {Vk } is an i.i.d. sequence of satisfying P(|V | M ) = 0 for some nite M (the essential supremum of the noise sequence is bounded). In addition, Vk has a probability density g with respect to Lebesgue measure. (ii) The transition density satises q(x, x ) > 0 for all (x, x ) and there exists a positive constant A, a probability density h and positive constants and + such that for all x C = [A M, A + M ], h(x ) q(x, x ) + h(x ) . The results below can readily be extended to cover the case Yk = (Xk ) + Vk , provided that the level sets {x R : |(x)| K} of the function are compact. This is equivalent to requiring |(x)| as |x| . Likewise extensions to multivariate states and/or observations are obvious. Under (ii), Assumption 4.3.31(i) is satised with C as above and (dx) = h(x) dx. Denote by the probability density of the random variables Vk . Then g(x, y) = (y x). The density may be chosen such that supp [M, +M ], so that g(x, y) > 0 if and only if x [y M, y + M ]. To verify Assumption 4.3.31(iii), put = [A, A]. For y , we then have g(x, y) = 0 if x [A M, A + M ], and thus
A+M

q(x, x )g(x , y)q(x , x ) dx =


AM

q(x, x )g(x , y)q(x , x ) dx .

This implies that for all (x, x ) X X,


C

q(x, x )g(x , y)q(x , x ) dx =1. q(x, x )g(x , y)q(x , x ) dx X

The bounded noise case is of course very specic, because an observation Yk allows locating the corresponding state Xk within a bounded set.

116

4 Advanced Topics in Smoothing

Under assumption 4.3.31, the lemma below establishes that the set C is a 1-small set for the forward transition kernels Fk|n [yk+1:n ] and that it is also uniformly accessible from the whole space X (for the same kernels). Lemma 4.3.33. Under Assumption 4.3.31, the following hold true. (i) For any initial probability measure on (X, X ) and any sequence y0:n Yn+1 satisfying C (dx0 ) g(x0 , y0 ) > 0, L,n (y0:n ) > 0 . (ii) For any non-negative integers k and n such that k < n and any y0:n Yn+1 , the set C is a 1-small set for the transitions kernels Fk|n . Indeed there exists a transition kernel k,n from (Y(nk) , Y (nk) ) to (X, X ) such that for all x C, yk+1:n Ynk and A X , Fk|n [yk+1:n ](x, A) k,n [yk+1:n ](A) . +

(iii) For any non-negative integers k and n such that n 2 and k < n 1, and any yk+1:n Ynk ,
xX

inf Fk|n [yk+1:n ](x, C) (yk+1 ) .

Proof. Write
n

L,n (y0:n ) =
C

(dx0 ) g(x0 , y0 )
i=1 n

Q(xi1 , dxi ) g(xi , yi ) Q(xi1 , dxi ) g(xi , yi )1C (xi1 ) g(xi , yi ) (dxi ) ,
C

(dx0 ) g(x0 , y0 )

(dx0 ) g(x0 , y0 )

i=1 n n i=1

showing part (i). The proof of (ii) is similar to that of Lemma 4.3.22(iii). For (iii), write Fk|n [yk+1:n ](x, C) = = [x, xk+2 ; yk+1 ](xk+1 )1C (xk+1 )[yk+2:n ](xk+2 ) (dxk+1:k+2 ) [x, xk+2 ; yk+1 ](xk+1 )[yk+2:n ](xk+2 ) (dxk+1:k+2 ) [yk+1 ](x, xk+2 )[x, xk+2 ; yk+1 ](xk+1 )[yk+2:n ](xk+2 ) (dxk+1:k+2 ) . [x, xk+2 ; yk+1 ](xk+1 )[yk+2:n ](xk+2 ) (dxk+1:k+2 )

where is dened in (4.53) and [yk+2:n ](xk+2 ) = g(xk+2 , yk+2 )k+2|n [yk+3:n ](xk+2 ) , [yk+1 ](x, xk+2 ) = [x, xk+2 ; yk+1 ](xk+1 )1C (xk+1 ) (dxk+1 ) . [x, xk+2 ; yk+1 ](xk+1 ) (dxk+1 )

4.3 Forgetting of the Initial Condition

117

Under Assumption 4.3.31, (x, x ; y) (y) for all (x, x ) X X and y Y, which concludes the proof. The corollary below then shows that the whole set X is a 1-small set for the composition Fk|n [yk+1:n ]Fk+1|n [yk+2:n ]. This generalizes a well-known result for homogeneous Markov chains (see Proposition 14.2.12). Corollary 4.3.34. Under Assumption 4.3.31, for positive indices 2 k n,
k/2 1

,k|n [y0:n ]

,k|n [y0:n ] TV

2
j=0

(y2j+1 ) . +

Proof. Because of Lemma 4.3.33(i), we may use the decomposition in (4.39) with j = 0 bounding the total variation distance by 2 to obtain
k1

,k|n [y0:n ]

,k|n [y0:n ] TV

2
j=0

Fj|n [yj+1:n ] .

Now, using assertions (ii) and (iii) of Lemma 4.3.33, Fj|n [yj+1:n ]Fj+1|n [yj+2:n ](x, A)
C

Fj|n [yj+1:n ](x, dx )Fj+1|n [yj+2:n ](x , A) (yj+1 )

j+1,n [yj+2:n ](A) , + for all x X and A X . Hence the composition Fj|n [yj+1:n ]Fj+1|n [yj+2:n ] satises Doeblins condition (Assumption 4.3.12) and the proof follows by Application of Lemma 4.3.13. Corollary 4.3.34 is only useful in cases where the function is such that the obtained bound indeed decreases as k and n grow. In Example 4.3.32, one could set (y) = 1 (y), for an interval . In such a case, it suces that the joint chain {Xk , Yk }k0 be recurrent under P which was the case in Example 4.3.32to guarantee that 1 (Yk ) equals one innitely often and thus that ,k|n [Y0:n ] ,k|n [Y0:n ] TV tends to zero P -almost surely as k, n . The following example illustrates a slightly more complicated situation in which Assumption 4.3.31 still holds. Example 4.3.35 (Non-Gaussian Autoregressive Process in Gaussian Noise). In this example, we consider a rst-order non-Gaussian autoregressive process, observed in Gaussian noise. This is a practically relevant example for which there is apparently no results on forgetting available in the literature. The model is thus Xk+1 = Xk + Uk , Yk = Xk + Vk , where X0 ,

118

4 Advanced Topics in Smoothing

(i) {Uk }k0 is an i.i.d. sequence of random variables with Laplace (double exponential) distribution with scale parameter ; (ii) {Vk }k0 is an i.i.d. sequence of Gaussian random variable with zero mean and variance 2 . We will see below that the fact that the tails of the Xs are heavier than the tails of the observation noise is important for the derivations that follow. It is assumed that || < 1, which implies that the chain {Xk } is positive recurrent, that is, admits a single invariant probability measure . It may be shown (see Chapter 14) that although the Markov chain {Xk } is geometrically ergodic, that is, Qn (x, ) TV 0 geometrically fast, it is not uniformly ergodic as lim inf n supxR Qn (x, ) TV > 0. We will nevertheless see that the forward smoothing kernel is uniformly geometrically ergodic. Under the stated assumptions, q(x, x ) = 1 exp (|x x|) , 2 1 (y x)2 g(x, y) = exp . 2 2 2

Here we set, for some M > 0 to be specied later, C = [M 1/2, M + 1/2], and we let y [1/2, +1/2]. Note that
M +1/2 exp(|u x| |y u|2 /2 2 |x u|) du M 1/2 exp(|u x| |y u|2 /2 2 |x u|) du M exp(|u x| u2 /2 2 |x M exp(|u x| u2 /2 2 |x

u|) du u|) du

and to show Assumption 4.3.31(iii) it suces to show that the right-hand side is bounded from below. This in turn is equivalent to showing that sup(x,x )RR R(x, x ) < 1, where
M

R(x, x ) =

exp(|u x| u2 |x u|) du (4.54)

exp(|u x| u2 |x u|) du

with = , = 1/2 2 and = . To do this, rst note that any M > 0 we have sup{R(x, x ) : |x| M, |x | M } < 1, and we thus only need to study the behavior of this quantity when x and/or x become large. We rst show that lim sup sup R(x, x ) < 1 . (4.55)
M xM, |x |M

For this we note that for |x | M and x M , it holds that

4.3 Forgetting of the Initial Condition


x

119

+
M x

exp |x u| u2 (u x ) du ex eM exp[M 2 + ( )M ] exp(x2 x) + eM , 2M ( ) 2x + ( + )

where we used the bound

exp(u u2 ) du (2y ) exp(y 2 + y) ,


y

which holds as soon as 2y 0. Similarly, we have


M

exp (x u) u2 (x u) du

ex eM

exp[M 2 ( + )M ] , 2M + ( + )

exp (x u) u2 |u x | du
M M

e2M ex
M

exp(u2 + u) du .

Thus, (4.54) is bounded by exp[x2 + ( )x] 2 exp[M 2 + ( )M ] + sup 2M + x + ( + ) xM


M M

e3M

exp(u2 + u) du

proving (4.55). Next we show that lim sup sup R(x, x ) < 1 . (4.56)
M xM, x M

We consider the case M x x ; the other case can be handled similarly. The denominator in (4.54) is then bounded by
M

exx
M

exp(u2 + ( + )u) du .

The two terms in the numerator are bounded by, respectively,


M

exp (x u) u2 (x u) du

exx

exp[M 2 ( + )M ] 2M + +

120

4 Advanced Topics in Smoothing

and

exp |x u| u2 |x u| du
M

exp[M 2 + ( + )M ] 2M exp(x2 + x x ) exp[(x )2 + x x ] + , + 2x + 2x + + exx and (4.56) follows by combining the previous bounds. We nally have to check that lim sup sup R(x, x ) < 1 .
M x M, xM

This can be done along the same lines.

5 Applications of Smoothing

Remember that in the previous two chapters, we basically considered that integration over X was a feasible operation. This is of course not the case in general, and numerical evaluation of the integrals involved in the smoothing recursions turns out to be a dicult task. In Chapters 6 and 7, generally applicable methods for approximate smoothing, based on Monte Carlo simulations, will be considered. Before that, we rst examine two very important particular cases in which an exact numerical evaluation is feasible: models with nite state space in Section 5.1 and Gaussian linear state-space models in Section 5.2. Most of the concepts to be used below have already been introduced in Chapters 3 and 4, and the current chapter mainly deals with computational aspects and algorithms. It also provides concrete examples of application of the methods studied in the previous chapters. Note that we do not yet consider examples of application of the technique studied in Section 4.1, as the nature of functionals that can be computed recursively will only become more explicit when we discuss the EM framework in Chapter 10. Corresponding examples will be considered in Section 10.2.

5.1 Models with Finite State Space


We rst consider models for which the state space X of the hidden variables is nite, that is, when the unobservable states may only take a nite number of distinct values. In this context, the smoothing recursions discussed in Chapter 3 take the familiar form described in the seminal paper by Baum et al. (1970) as well as Rabiners (1989) tutorial (which also covers scaling issues). Section 5.1.2 discusses a technique that is of utmost importance in many applications, for instance digital communications and speech processing, by which one can determine the maximum a posteriori sequence of hidden states given the observations.

122

5 Applications of Smoothing

5.1.1 Smoothing 5.1.1.1 Filtering Let X denote a nite set that we will, without loss of generality, identify with X = {1, . . . , r}. Probability distributions on X can be represented by vectors belonging to the simplex of Rr , that is, the set
r

(p1 , . . . , pr ) : pi 0 for every 1 i r,


i=1

pi = 1

The components of the transition matrix Q and the initial distribution of the hidden chain are denoted by (qij )1i,jr and (i )1ir , respectively. Similarly, for the ltering and smoothing distributions, we will use the slightly abusive but unambiguous notation k (i) = P(Xk = i | Y0:k ), for 1 i r, instead of k ({i}). Finally, because we are mainly concerned with computational aspects given a particular model specication, we do not need to indicate dependence with respect to the initial distribution of X0 and will simply denote the lter (and all associated quantities) by k instead of ,k . The rst item below describes the specic form taken by the ltering recursionsor, in Rabiners (1989) terminology, the normalized forward recursionwhen the state space X is nite. Algorithm 5.1.1 (Forward Filtering). Assume X = {1, . . . , r}. Initialization: For i = 1, . . . , r, 0|1 (i) = (i) . Forward Recursion: For k = 0, . . . , n,
r

ck =
i=1

k|k1 (i)gk (i) ,

(5.1) (5.2) (5.3)

k (j) = k|k1 (j)gk (j)/ck ,


r

k+1|k (j) =
i=1

k (i)qij ,

for each j = 1, . . . , r. The computational cost of ltering is thus proportional to n, the number of observations, and scales like r2 (squared cardinality of the state space X) because of the r vector matrix products corresponding to (5.3). Note however that in models with many zero entries in the transition matrix, in particular for left-to-right models like speech processing HMMs (Example 1.3.6), the complexity of (5.3) is at most of order r times the maximal number of nonzero elements along the rows of Q, which can be signicantly less. In addition,

5.1 Models with Finite State Space

123

and this is also the case for speech processing HMMs, if the Yk are highdimensional multivariate observations, the main computational load indeed lies in (5.1)(5.2) when computing the numerical values of the conditional densities of Yk given Xk = j for all r possible states j. Recall from Section 3.2.2 that the likelihood of the observations Y0:n can be computed directly on the log scale according to
n def n

= log Ln =
k=0

log ck .

(5.4)

This form is robust to numerical over- or underow and should be systematically preferred to the product of the normalization constants ck , which would evaluate the likelihood on a linear scale. 5.1.1.2 The Forward-Backward Algorithm As discussed in Section 3.4, the standard forward-backward algorithm as exposed by Rabiner (1989) adopts the scaling scheme described by Levinson et al. (1983). The forward pass is given by Algorithm 5.1.1 as described above, where both the normalization constants ck and the lter vectors k have to be stored for k = 0, . . . , n. Note that the tradition consists in denoting the forward variables by the letter , but we reserved this notation for the unscaled forward variables (see Section 3.2). Here we actually only store the lter vectors k , as their unnormalized versions would quickly under- or overow the machine precision for any practical value of n. Algorithm 5.1.2 (Backward Smoothing). Given stored values of 0 , . . . , n and c0 , . . . , cn , computed during the forward ltering pass (Algorithm 5.1.1), and starting from the end of the data record, do the following. Initialization: For j = 1, . . . , r, n|n (j) = c1 . n Backward Recursion: For k = n 1, . . . , 0,
r

k|n (i) = c1 k
j=1

qij gk+1 (j)k+1|n (j)

(5.5)

for each i = 1, . . . , r. For all indices k < n, the marginal smoothing probabilities may be evaluated as k|n (i) = P(Xk = i | Y0:n ) = and the bivariate smoothing probabilities as
def k:k+1|n (i, j) = P(Xk = i, Xk+1 = j | Y0:n ) = k (i)qij gk+1 (j)k+1|n (j) . def

k (i)k|n (i) r j=1 k (j)k|n (j)

(5.6)

124

5 Applications of Smoothing

The correctness of the algorithm described above has already been discussed in Section 3.4. We recall that it diers from the line followed in Section 3.2.2 only by the choice of the normalization scheme. Algorithms 5.1.1 and 5.1.2 constitute the standard form of the two-pass algorithm known as forward-backward introduced by Baum et al. (1970), where the normalization scheme is rst mentioned in Levinson et al. (1983) (although the necessity of scaling was certainly known before that date, as discussed in Section 3.4). The complexity of the backward pass is comparable to that of the forward ltering, that is, it scales as n r2 . Note however that for high-dimensional observations Yk , the computational cost of the backward pass is largely reduced, as it is not necessary to evaluate the (n + 1)r conditional densities gk (i) that have already been computed (given that these have been stored in addition to the lter vectors 0 , . . . n ). 5.1.1.3 Markovian Backward Smoothing The backward pass as described in Algorithm 5.1.2 can be replaced by the use of the backward Markovian decomposition introduced in Section 3.3.2. Although this second form of backward smoothing is equivalent to Algorithm 5.1.2 from a computational point of view, it is much more transparent on principle grounds. In particular, it shows that the smoothing distributions may be evaluated from the ltering ones using backward Markov transition matrices. In addition, these transition matrices only depend on the ltering distributions themselves and not on the data anymore. In this respect, the computation of the observation densities in (5.5) is thus inessential. The algorithm, which has been described in full generality in Section 3.3.2, goes as follows, Algorithm 5.1.3 (Markovian Backward Smoothing). Given stored values of 0 , . . . , n and starting from the end of the data record, do the following. Initialization: For j = 1, . . . , r, n|n (j) = n (j). Backward Recursion: For k = n 1, . . . , 0, Compute the backward transition kernel according to Bk (j, i) = k (i)qij r m=1 k (m)qmj (5.7)

for j, i = 1, . . . , r (if the denominator happens to be null for index j, then Bk (j, i) can be set to arbitrary values for i = 1, . . . , r). Compute k:k+1|n (i, j) = k+1|n (j)Bk (j, i) and

5.1 Models with Finite State Space


r

125

k|n (i) =
m=1

k+1|n (m)Bk (m, i)

for i, j = 1, . . . , r. Compared to the general situation investigated in Section 3.3.2, the formulation of Algorithm 5.1.3 above takes prot of (3.39) in Remark 3.3.7, which provides an explicit form for the backward kernel Bk in cases where the hidden Markov model is fully dominated (which is always the case when the state space X is nite). Note also that the value of Bk (j, i) in cases where the denominator of (5.7) happens to be null is irrelevant. The condition r m=1 k (m)qmj = 0 is equivalent to stating that k+1|k (j) = 0 by (5.3), which in turn implies that k+1 (j) = 0 by (5.2) and nally that k+1|n (j) = 0 for n k+1 by (5.6). Hence the value of Bk (j, i) is arbitrary and is (hopefully) never used in Algorithm 5.1.3, as it is multiplied by zero. As noted in Section 3.3.2, the idea of using this form of smoothing for nite state space models is rarely ever mentioned except by Askar and Derin (1981) who illustrated it on a simple binary-valued examplesee also discussion in Ephraim and Merhav (2002) about stable forms of the forwardbackward recursions. Of course, one could also consider the forward Markovian decomposition, introduced in Section 3.3.1, which involves the kernels Fk|n that are computed from the backward variables k|n . We tend to prefer Algorithm 5.1.2, as it is more directly connected to the standard way of computing smoothed estimates in Gaussian linear state-space models to be discussed later in Section 5.2. 5.1.2 Maximum a Posteriori Sequence Estimation When X is nite, it turns out that it is also possible to carry out a dierent type of inference concerning the unobservable sequence of states X0 , . . . Xn . This second form is non-probabilistic in the sense that it does not provide a distributional statement concerning the unknown states. On the other hand, the result that is obtained is the jointly optimal, in terms of maximal conditional probability, sequence X0 , . . . Xn of unknown states given the corresponding observations, which is in some sense much stronger a result than just the marginally (or bivariate) optimal sequence of states. However, neither optimality property implies the other. To express this precisely, let xk maximize the conditional probability P(Xk = xk | Y0:n ) for each k = 0, 1, . . . , n, and let the sequence x0:n maximize the joint conditional probability P(X0:n = x0:n | Y0:n ). Then, in general, the sequences x0:n and x0:n do not agree. It may even be that a transition (xk , xk+1 ) of the marginally optimal sequence is disallowed in the sense that qxk ,xk+1 = 0. In the HMM literature, the algorithm that makes possible to compute efciently the a posteriori most likely sequence of states is known as the Viterbi algorithm, after Viterbi (1967). It is based on the well-known dynamic programming principle. The key observation is indeed (4.1), which we rewrite

126

5 Applications of Smoothing

in log form with notations appropriate for the nite state space case under consideration: log 0:k+1|k+1 (x0 , . . . , xk+1 ) = (
k

k+1 )

+ log 0:k|k (x0 , . . . , xk ) + log qxk xk+1 + log gk+1 (xk+1 ) , (5.8) where k denotes the log-likelihood of the observations up to index k and 0:k|k is the joint distribution of the states X0:k given the observations Y0:k . The salient feature of (5.8) is that, except for a constant term that does not depend on the state sequence (on the right-hand side of the rst line), the a posteriori log-probability of the subsequence x0:k+1 is equal to that of x0:k up to terms that only involve the pair (xk , xk+1 ). Dene mk (i) =
{x0 ,...,xk1 }Xk

max

log 0:k|k (x0 , . . . , xk1 , i) +

(5.9)

that is, up to a number independent of the state sequence, the maximal conditional probability (on the log scale) of a sequence up to time k and ending with state i. Also dene bk (i) to be that value in X of xk1 for which the optimum is achieved in (5.9); in other words, bk (i) is the second nal state in an optimal state sequence of length k + 1 and ending with state i. Using (5.8), we then have the simple recursive relation mk+1 (j) = max
i{1,...,r}

[mk (i) + log qij ] + log gk+1 (j) ,

(5.10)

and bk+1 (j) equals the index i for which the maximum is achieved. This observation immediately leads us to formulate the Viterbi algorithm. Algorithm 5.1.4 (Viterbi Algorithm). Forward Recursion (for optimal conditional probabilities): Let m0 (i) = log((i)g0 (i)) . Then for k = 0, 1, . . . , n 1, compute mk+1 (j) for all states j as in (5.10). Backward Recursion (for optimal sequence): Let xn be the state j for which mn (j) is maximal. Then for k = n 1, n 2, . . . , 0, let xk be the state i for which the maximum is attained in (5.10) for mk+1 (j) with j = xk+1 . That is, xk = bk+1 (k+1 ). x The backward recursion rst identies the nal state of the optimal state sequence. Then, once the nal state is known, the next to nal one can be determined as the state that gives the optimal probability for sequences ending with the now known nal state. After that, the second next to nal state can be determined in the same manner, and so on. Thus the algorithm requires

5.2 Gaussian Linear State-Space Models

127

storage of all the mk (j). Storage of the bk (j) is not necessary but makes the backward recursion run faster. In cases where there is no unique maximizing state i in (5.10), there may be no unique optimal state sequence either, and bk+1 (j) can be taken arbitrarily within the set of maximizing indices i.

5.2 Gaussian Linear State-Space Models


Gaussian linear state-space models form another important class for which the tools introduced in Chapter 3 provide implementable algorithms. Sections 5.2.1 to 5.2.4 review two dierent variants of the general principle outlined in Proposition 3.3.9. The second form, exposed in Section 5.2.4, is definitely more involved, but also more ecient in several situations, and is best understood with the help of linear prediction tools that are reviewed in Sections 5.2.2 and 5.2.3. Finally, the exact counterpart of the forwardbackward approach, examined in great generality in Section 3.2, is exposed in Section 5.2.5. 5.2.1 Filtering and Backward Markovian Smoothing We here consider a slight generalization of the Gaussian linear state-space model dened in Section 1.3.3: Xk+1 = Ak Xk + Rk Uk , Yk = Bk Xk + Sk Vk , (5.11) (5.12)

where {Uk }k0 and {Vk }k0 are two independent vector-valued i.i.d. Gaussian sequences such that Uk N(0, I) and Vk N(0, I) where I is a generic notation for the identity matrices (of suitable dimensions). In addition, X0 is assumed to be N(0, ) distributed and independent of {Uk } and {Vk }. Recall t from Chapter 1 that while we typically assume that Sk Sk = Cov(Sk Vk ) is a full-rank covariance matrix, the dimension of the state noise vector (also referred to as the excitation or disturbance) Uk is in many situations smaller t than that of the state vector Xk and hence Rk Rk may be rank decient. Compared to the basic model introduced in Section 1.3.3, the dierence lies in the fact that the parameters of the state-space model, Ak , Bk , Rk , and Sk , depend on the time index k. This generalization is motivated by conditionally Gaussian state-space models, as introduced in Section 1.3.4. For such models, neither is the state space nite nor is the complete model equivalent to a Gaussian linear state-space model. However, it is indeed possible, and often advantageous, to perform ltering while conditioning on the state of the unobservable indicator variables. In this situation, although the basic model is homogeneous in time, the conditional model features time-dependent parameters. There are also cases in which the means of {Uk } and {Vk } depend on time. To avoid notational blow-up, we consider only the zero-mean case:

128

5 Applications of Smoothing

the modications needed to handle non-zero means are straightforward as explained in Remark 5.2.14 below. A feature that is unique to the Gaussian linear state-space model dened by (5.11)(5.12) is that because the states X0:n and the observations Y0:n are jointly multivariate Gaussian (for any n), all smoothing distributions are also Gaussian. Hence any smoothing distribution is fully determined by its mean vector and covariance matrix. We consider in particular below the predictive state estimator k|k1 and ltered state estimator k and denote by k|k1 = N Xk|k1 , k|k1 k = N Xk|k , k|k their respective means and covariance matrices. Remark 5.2.1. Note that up to now we have always used k as a simplied notation for k|k , thereby expressing a default interest in the ltering distri bution. To avoid all ambiguity, however, we will adopt the notations Xk|k and k|k to denote the rst two moments of the ltering distributions in Gaussian linear state-space models. The reason for this modication is that the conventions used in the literature on state-space models are rather variable, but with a marked general preference for using Xk and k to refer to the moments of predictive distribution k|k1 see, e.g., Anderson and Moore (1979) or Kailath et al. (2000). In contrast, the more explicit notations Xk|k and k|k are self-explaining and do not rely on an implicit knowledge of whether the focus is on the ltering or prediction task. The following elementary lemma is instrumental in computing the predictive and the ltered state estimator. Proposition 5.2.2 (Conditioning in the Gaussian Linear Model). Let X and V be two independent Gaussian random vectors with E[X] = X , Cov(X) = X , and Cov(V ) = V , and assume E[V ] = 0. Consider the model Y = BX + V , (5.15) where B is a deterministic matrix of appropriate dimensions. Further assume that BX B t + V is a full rank matrix. Then E [X | Y ] = E[X] + Cov(X, Y ) {Cov(Y )} = X + X B and Cov(X | Y ) = Cov(X E[X|Y ]) = E (X E[X|Y ])X t = X X B
t t 1

(5.13) (5.14)

(Y E[Y ]) (Y BX )

(5.16)

BX B + V

(5.17)

BX B + V

BX .

5.2 Gaussian Linear State-Space Models

129

Proof. Denote by X the right-hand side of (5.16). Then X X = X E(X) Cov(X, Y ){Cov(Y )}1 (Y E[Y ]) , which implies that Cov(X X, Y ) = Cov(X, Y ) Cov(X, Y ){Cov(Y )}1 Cov(Y ) = 0 . (5.18) The random vectors Y and X X thus are jointly Gaussian (as linear transformations of a Gaussian multivariate random vector) and uncorrelated. Hence, Y and X X are also independent. Writing X = X + (X X) , where X is (Y ) measurable (as a linear combination of the components of Y ) and X X is independent of X, it is then easily checked (see Appendix A.2) = E(X | Y ) and that, in addition, that X
def Cov (X | Y ) = Cov (X X)(X X)

Y = Cov(X X) .

Finally, (5.17) is obtained upon noting that Cov(X X) = E[(X X)(X X)t ] = E[(X X)X t ] , using (5.18) and the fact that X is a linear transform of Y . The second lines of (5.16) and (5.17) follow from the linear structure of (5.15). For Gaussian linear state-space models, Proposition 5.2.2 implies in par ticular that while the mean vectors Xk|k1 or Xk|k do depend on the observations, the covariance matrices k|k1 and k|k are completely determined by the model parameters. Our rst result below simply consists in applying the formula derived in Proposition 5.2.2 for the Gaussian linear model to obtain an explicit equivalent of (3.27) in terms of the model parameters. Proposition 5.2.3 (Filtering in Gaussian Linear State-Space Models). The ltered and predictive mean and covariance matrices may be updated recursively as follows, for k 0. Filtering:
t t t Xk|k = Xk|k1 + k|k1 Bk (Bk k|k1 Bk + Sk Sk )1 (Yk Bk Xk|k1 ) , (5.19) t t t k|k = k|k1 k|k1 Bk (Bk k|k1 Bk + Sk Sk )1 Bk k|k1 ,

(5.20)

with the conventions X0|1 = 0 and 0|1 = . Prediction: Xk+1|k = Ak Xk|k , k+1|k = Ak k|k At k +
t Rk Rk

(5.21) , (5.22)

130

5 Applications of Smoothing

Proof. As mentioned in Remark 3.2.6, the predictor-to-lter update is obtained by computing the posterior distribution of Xk given Yk in the equiva lent pseudo-model Xk N(Xk|k1 , k|k1 ) and Yk = Bk Xk + Vk ,
t where Vk is N(0, Sk Sk ) distributed and independent of Xk . Equations (5.19) and (5.20) thus follow from Proposition 5.2.2. Equations (5.21) and (5.22) correspond to the moments of

Xk+1 = Ak Xk + Rk Uk when Xk and Uk are independent and, respectively, N(Xk|k , k|k ) and N(0, I) distributed (see discussion in Remark 3.2.6). Next we consider using the backward Markovian decomposition of Sec tion 3.3.2 to derive the smoothing recursion. We will denote by Xk|n and k|n respectively the mean and covariance matrix of the smoothing distribution k|n . According to Remark 3.3.7, the backward kernel Bk corresponds to the distribution of Xk given Xk+1 in the pseudo-model Xk+1 = Ak Xk + Rk Uk , when Xk N(Xk|k , k|k ) and Uk N(0, I) independently of Xk . Using Proposition 5.2.2 once again, Bk (Xk+1 , ) is seen to be the Gaussian distribution with mean and covariance matrix given by, respectively,
t Xk|k + k|k At (Ak k|k At + Rk Rk )1 (Xk+1 Ak Xk|k ) , k k

(5.23)

and covariance matrix


t k|k k|k At (Ak k|k At + Rk Rk )1 Ak k|k . k k

(5.24)

Proposition 3.3.9 asserts that Bk is the transition kernel that maps k+1|n to k|n . Hence, if we assume that k+1|n = N(Xk+1|n , k+1|n ) is already known, Xk|n = Xk|k + k|k At Mk (Xk+1|n Ak Xk|k ) , k k|n = k|k k|k At Mk Ak k|k k + k|k At Mk k+1|n Mk Ak k|k k , (5.25) (5.26)

give the moments of k|n , where


t Mk = (Ak k|k At + Rk Rk )1 . k

To derive these two latter equations, we must observe that (i) Bk (Xk+1 , ) may be interpreted as an ane transformation of Xk+1 as in (5.23) followed by adding an independent zero mean Gaussian random vector with covariance matrix as in (5.24), and that (ii) mapping k+1|n into k|n amounts to replac ing the xed Xk+1 by a random vector with distribution N(Xk+1|n , k+1|n ).

5.2 Gaussian Linear State-Space Models

131

The random vector obtained through this mapping is Gaussian with mean and covariance as in (5.25)(5.26), the third term of (5.26) being the extra term arising because of (ii). We summarize these observations in the form of an algorithm. Algorithm 5.2.4 (Rauch-Tung-Striebel Smoothing). Assume that the ltering moments Xk|k and k|k are available (for instance by application of Proposition 5.2.3) for k = 0, . . . , n. The smoothing moments Xk|n and k|n may be evaluated backwards by applying (5.25) and (5.26) from k = n 1 down to k = 0. This smoothing approach is generally known as forward ltering, backward smoothing or RTS (Rauch-Tung-Striebel) smoothing after Rauch et al. (1965). From the discussion above, it clearly corresponds to an application of the general idea that the backward posterior chain is a Markov chain as discussed in Section 3.3.2. Algorithm 5.2.4 is thus the exact counterpart of Algorithm 5.1.3 for Gaussian linear state-space models. 5.2.2 Linear Prediction Interpretation The approach that we have followed so far to derive the ltering and smoothing recursions is simple and ecient and has the merit of being directly connected with the general framework investigated in Chapter 3. It however suers from two shortcomings, the latter being susceptible of turning into a real hindrance in practical applications of the method. The rst concern has to do with the interpretability of the obtained recursions. Indeed, by repeated applications of Proposition 5.2.2, we rapidly obtain complicated expressions such as (5.26). Although such expressions are usable in practice granted that one identies common terms that need only be computed once, they are hard to justify on intuitive grounds. This may sound like a vague or naive statement, but interpretability turns out to be a key issue when considering more involved algorithms such as the disturbance smoothing approach of Section 5.2.4 below. The second remark is perhaps more troublesome because it concerns the numerical eciency of the RTS smoothing approach described above. Several of the state-space models that we have considered so far share a common feature, which is dramatically exemplied in the noisy AR(p) model (Example 1.3.8 in Chapter 1). In this model, the disturbance Uk is scalar, and there is a deterministic relationship between the state variables Xk and Xk+1 , which is that the last p 1 components of Xk+1 are just a copy of the rst p 1 components of Xk . In such a situation, it is obvious that the same deter ministic relation should be reected in the values of Xk|n and Xk+1|n , in the k+1|n must coincide with the rst sense that the last p 1 components of X p 1 components of Xk|n . In contrast, Algorithm 5.2.4 implies a seemingly

132

5 Applications of Smoothing

complex recursion, which involves a p p matrix inversion, to determine Xk|n from Xk+1|n and Xk|k . In order to derive a smoothing algorithm that takes advantage of the model structure (5.11)(5.12), we will need to proceed more cautiously. For models like the noisy AR(p) model, it is in fact more appropriate to perform the smoothing on the disturbance (or dynamic noise) variables Uk rather than the states Xk themselves. This idea, which will be developed in Section 5.2.4 below, does not directly t into the framework of Chapter 3 however because the pairs {Uk , Yk }k0 are not Markovian, in contrast to {Xk , Yk }k0 . The rest of this section thus follows a slightly dierent path by developing the theory of best linear prediction in mean squared error sense. The key point here is that linear prediction can be interpreted geometrically using (elementary) Hilbert space theory. In state-space models (and more generally, in time series analysis), this geometric intuition serves as a valuable guide in the development and construction of algorithms. As a by-product, this approach also constitutes a framework that is not limited to the Gaussian case considered up to now and applies to all linear state-space models with nite second moments. However, the fact that this approach also fully characterizes the marginal smoothing distributions is of course particular to Gaussian models. 5.2.2.1 Best Linear Prediction This section and the following require basic familiarity with the key notions of L2 projections, which are reviewed briey in Appendix B. Let Y0 , . . . , Yk and X be elements of L2 (, F, P). We will assume for the moment that Y0 , . . . , Yk and X are scalar random variables. The best linear predictor of X given Y0 , . . . , Yk is the L2 projection of X on the linear subspace
k

span(1, Y0 , . . . , Yk ) =

def

Y :Y =+
i=0

i Yi ,

, 0 , . . . , k R

The best linear predictor will be denoted by proj(X|1, Y0 , . . . , Yk ), or simply by X in situations where there is no possible confusion regarding the subspace on which X is projected. The notation 1 refers to the constant (deterministic) random variable, whose role will be made clearer in Remark 5.2.5 below. According to the projection theorem (Theorem B.2.4 in Appendix B), X is characterized by the equations E{(X X)Y } = 0 for all Y span(1, Y0 , . . . , Yk ) .

Because 1, Y0 , . . . , Yk is a generating family of span(1, Y0 , . . . , Yk ), this condition may be equivalently rewritten as E[(X X)1] = 0 and E[(X X)Yi ] = 0, for all i = 0, . . . , k .

5.2 Gaussian Linear State-Space Models

133

The notations X X span(1, Y0 , . . . , Yk ) and X X Yi will also be used to denote concisely these orthogonality relations, where orthogonality is to be understood in the L2 (, F, P) sense. Because X span(1, Y0 , . . . , Yk ), the projection may be represented as X = + 0 (Y0 E[Y0 ]) + . . . + k (Yk E[Yk ]) (5.27)

for some scalars , 0 , . . . , k . Denoting by k the matrix [Cov(Yi , Yj )]0i,jk and k the vector [Cov(X, Y0 ), . . . , Cov(X, Yk )]t , the prediction equations may be summarized as = E[X] and n = k , where = (1 , . . . , k )t . (5.28)

The projection theorem guarantees that there is at least one solution . If the covariance matrix k is singular, there are innitely many solutions, but all of them correspond to the same (uniquely dened) optimal linear predictor. An immediate consequence of Proposition B.2.6(iii) is that the covariance of the prediction error may be written in two equivalent, and often useful, ways, Cov(X X) = E[X(X X)] = Cov(X) Cov(X) . (5.29)

Remark 5.2.5. The inclusion of the deterministic constant in the generating family of the prediction subspace is simply meant to capture the prediction capacity of E[X]. Indeed, because E[(X )2 ] = E{[X E(X)]2 } + [ E(X)]2 E(X 2 ) + [ E(X)]2 , predicting X by E(X) is the optimal guess that always reduces the mean squared error in the absence of observations. In (5.27), we used a technique that will be recurrent in the following and consists in replacing some variables by orthogonalized ones. Because E[(Yi E(Yi ))1] = 0 for i = 0, . . . , k, the projection on span(1, Y0 , . . . , Yk ) may be decomposed as the projection on span(1), that is, E(X), plus the projection on span(Y0 E[Y0 ], . . . , Yk E[Yk ]). Following (5.28), projecting a non-zero mean variable X is then achieved by rst considering the projection on the centered observations Yi E(Yi ) and then adding the expectation of X to the obtained prediction. For this reason, considering means is not crucial, and we assume in the sequel that all variables under consideration have zero mean. Hence, X is directly dened as the projection on span(Y0 , . . . , Yk ) only and the covariances Cov(Yi , Yj ) and Cov(X, Yi ) can be replaced by E(Yi Yj ) and E(XYi ), respectively. We now extend these denitions to the case of vector-valued random variables. Denition 5.2.6 (Best Linear Predictor). Let X = [X(1), . . . , X(dx )]t be a dx -dimensional random vector and Y0 , . . . , Yk a family of dy -dimensional

134

5 Applications of Smoothing

random vectors, all elements of L2 (, F, P). It is further assumed that E(X) = 0 and E(Yi ) = 0 for i = 0, . . . , k. The minimum mean square error prediction of X given Y0 , . . . , Yk is dened as the vector [X(1), . . . , X(dx )]t such that 2 every component X(j), j = 1, . . . , dx , is the L -projection of X(j) on span {Yi (j)}0ik,1jdy As a convention, we will also use the notations X = proj(X|Y0 , . . . , Yk ) = proj(X| span(Y0 , . . . , Yk )) , in this context. Denition 5.2.6 asserts that each component X(j) of X is to be projected on the linear subspace spanned by linear combinations of the components of the vectors Yi , k dy Y :Y = i,j Yi (j) , i,j R .
i=0 j=1

Proceeding as in the case of scalar variables, the projection X may be written


k

X=
i=0

i Yi ,

where 0 , . . . , k are dx dy matrices. The orthogonality relations that char acterize the projection of X may the be summarized as
k

i E(Yi Yjt ) = E(XYjt )


i=0

for j = 0, . . . , k ,

(5.30)

where E(Yi Yjt ) and E(XYjt ) are respectively dy dy and dx dy matrices such that E(Yi Yjt )
l1 l2 E(XYjt ) l l 1 2

= E[Yi (l1 )Yj (l2 )] , = E[X(l1 )Yj (l2 )] .

The projection theorem guarantees that there is at least one solution to this system of linear equations. The solution is unique if the dy (k + 1) dy (k + 1) block matrix t E(Y0 Y0t ) E(Y0 Yk ) . . . . k = . .
t E(Yn Y0t ) E(Yn Yn )

is invertible. As in the scalar case, the covariance matrix of the prediction error may be written in any of the two forms

5.2 Gaussian Linear State-Space Models

135

Cov(X X) = E[X(X X)t ] = E(XX t ) E(X X t ) . An important remark, which can be easily checked from (5.30), is that proj(AX|Y0 , . . . , Yk ) = A proj(X|Y0 , . . . , Yk ) ,

(5.31)

(5.32)

whenever A is a deterministic matrix of suitable dimensions. This simply says that the projection operator is linear. Clearly, solving for (5.30) directly is only possible in cases where the dimension of k is modest. In all other cases, an incremental way of computing the predictor would be preferable. This is exactly what the innovation approach to be described next is all about. 5.2.2.2 The Innovation Approach Let us start by noting that when k = 0, and when the covariance matrix E(Y Y t ) is invertible, then the best linear predictor of the vector X in terms of Y only satises X = E(XY t ) E(Y Y t )
t 1

Y ,
t t t 1

(5.33) E(Y X t ) .

Cov(X X) = E[X(X X) ] = E(XX ) E(XY ) E(Y Y )

Interestingly, (5.33) is an expression that we already met in Proposition 5.2.2. Equation (5.33) is equivalent to the rst expressions given in (5.16) and (5.17), assuming that X is a zero mean variable. This is not surprising, as the proof of Proposition 5.2.2 was based on the fact that X, as dened by (5.33), is is uncorrelated with Y . The only dierence is that in the such that X X (multivariate) Gaussian case, the best linear predictor and the covariance of the prediction error also correspond to the rst two moments of the conditional distribution of X given Y , which is Gaussian, and hence entirely dene this distribution. Another case of interest is when the random variables Y0 , . . . , Yk are uncorrelated in the sense that E(Yi Yjt ) = 0 for any i, j = 0, . . . , k such that i = j. In this case, provided that the covariance matrices E(Yi Yit ) are positive denite for every i = 0, . . . , k, the best linear predictor of X in terms of {Y0 , . . . , Yk } is given by
k

X=
i=0

E(XYit ) E(Yi Yit )

Yi .

(5.34)

The best linear predictor of X in terms of Y0 , . . . , Yk thus reduces to the sum of the best linear predictors of X in terms of each individual vector Yi , i = 0, . . . , k. Of course, in most problems the vectors Y0 , . . . , Yk are correlated, but there is a generic procedure by which we may fall back to this simple case, irrespectively of the correlation structure of the Yk . This approach is the

136

5 Applications of Smoothing

analog of the Gram-Schmidt orthogonalization procedure used to obtain a basis of orthogonal vectors from a set of linearly independent vectors. Consider the linear subspace span(Y0 , . . . , Yj ) spanned by the observations up to index j. By analogy with the Gram-Schmidt procedure, one may replace the set {Y0 , . . . , Yj } of random vectors by an equivalent set { 0 , . . . , j } of uncorrelated random vectors spanning the same linear subspace, span(Y0 , . . . , Yj ) = span( 0 , . . . , j ) for all j = 0, . . . , k .
j

(5.35) by
0

This can be achieved by dening recursively the sequence of and j+1 = Yj+1 proj(Yj+1 | span(Y0 , . . . , Yj ))

= Y0 (5.36)

for j 0. The projection of Yj+1 on span(Y0 , . . . , Yj ) = span( 0 , . . . , j ) has an explicit form, as 0 , . . . , j are uncorrelated. According to (5.34),
j

proj(Yj+1 | span( 0 , . . . , j )) =
i=0

E(Yj+1 t ) E( i

t 1 i i) i

(5.37)

which leads to the recursive expression


j j+1

= Yj+1
i=0

E(Yj+1 t ) E( i

t 1 i i) i

(5.38)

For any j = 0, . . . , k, j may be interpreted as the part of the random variable Yj that cannot be linearly predicted from the history Y0 , . . . , Yj1 . For this reason, j is called the innovation. The innovation sequence { j }j0 constructed recursively from (5.38) is uncorrelated but is also in a causal relationship with {Yj }j0 in the sense that for every j 0,
j

span(Y0 , . . . , Yj )

and Yj span( 0 , . . . , j ) .

(5.39)

In other words, the sequences {Yj }j0 and { j }j0 are related by a causal and causally invertible linear transformation. To avoid degeneracy in (5.37) and (5.38), one needs to assume that the covariance matrix E( j t ) is positive denite. Hence we make the following j denition, which guarantees that none of the components of the random vector Yj+1 can be predicted without error by some linear combination of past variables Y0 , . . . , Yj . Denition 5.2.7 (Non-deterministic Process). The process {Yk }k0 is said to be non-deterministic if for any j 0 the matrix Cov [Yj+1 proj(Yj+1 |Y0 , . . . , Yj )] is positive denite.

5.2 Gaussian Linear State-Space Models

137

The innovation sequence { k }k0 is useful for deriving recursive prediction formulas for variables of interest. Let Z L2 (, F, P) be a random vector and denote by Z|k the best linear prediction of Z given observations up to index k. Using (5.34), Z|k satises the recursive relation
k

Z|k =
i=0

E(Z t ) E( i

t 1 i i) i t 1 k k k)

(5.40) .

= Z|k1 + E(Z t ) E( k

The covariance of the prediction error is given by Cov(Z Z|k ) = Cov(Z) Cov(Z|k )
k

(5.41)
t 1 i i)

= Cov(Z)
i=0

E(Z t ) E( i

E( i Z t )
t 1 k k)

= Cov(Z) Cov(Z|k1 ) E(Z t ) E( k

E( k Z t ) .

5.2.3 The Prediction and Filtering Recursions Revisited 5.2.3.1 Kalman Prediction We now consider again the state-space model Xk+1 = Ak Xk + Rk Uk , Yk = Bk Xk + Sk Vk , (5.42) (5.43)

where {Uk }k0 and {Vk }k0 are now only assumed to be uncorrelated secondorder white noise sequences with zero mean and identity covariance matrices. The initial state variable X0 is assumed to be uncorrelated with {Uk } and {Vk } and such that E(X0 ) = 0 and Cov(X0 ) = . It is also assumed that {Yk }k0 is non-deterministic in the sense of Denition 5.2.7. The form of (5.43) shows that a simple sucient (but not necessary) condition that guarantees this t requirement is that Sk Sk be positive denite for all k 0. As a notational convention, for any (scalar or vector-valued) process {Zk }k0 , the projection of Zk onto the linear space spanned by the random vectors Y0 , . . . , Yn will be denoted by Zk|n . Particular cases of interest are Xk|k1 , which corresponds to the (one-step) state prediction as well as Yk|k1 for the observation prediction. The innovation k discussed in the previous section is by denition equal to the observation prediction error Yk Yk|k1 . We nally introduce two additional notations, k = Cov( k )
def def and k|n = Cov(Xk Xk|n ) .

138

5 Applications of Smoothing

Remark 5.2.8. The careful reader will have noticed that we overloaded the notations Xk|k1 and k|k1 , which correspond, in Proposition 5.2.3, to the mean and covariance matrix of k|k1 and, in Algorithm 5.2.9, to the best mean square linear predictor of Xk in terms of Y0 , . . . , Yk1 and the covari ance of the linear prediction error Xk Xk|k1 . This abuse of notation is justied by Proposition 5.2.2, which states that these concepts are equivalent in the Gaussian case. In the general non-Gaussian model, only the second interpretation (linear prediction) is correct. We rst consider determining the innovation sequence from the observations. Projecting (5.43) onto span(Y0 , . . . , Yk1 ) yields Yk|k1 = Bk Xk|k1 + Sk Vk|k1 . (5.44)

Our assumptions on the state-space model imply that E(Vk Yjt ) = 0 for j = 0, . . . , k 1, so that Vk|k1 = 0. Hence
k

= Yk Yk|k1 = Yk Bk Xk|k1 .

(5.45)

We next apply the general decomposition obtained (5.40) to the variable Xk+1 to obtain the state prediction update. Equation (5.40) applied with Z = Xk+1 yields Xk+1|k = Xk+1|k1 + E(Xk+1 t ) E( k
t 1 k k) k

(5.46)

To complete the recursion, the rst term on the right-hand side should be expressed in terms of Xk|k1 and k1 . Projecting the state equation (5.42) on the linear subspace spanned by Y0 , . . . , Yk1 yields Xk+1|k1 = Ak Xk|k1 + Rk Uk|k1 = Ak Xk|k1 , (5.47)

because E(Uk Yjt ) = 0 for indices j = 0, . . . , k 1. Thus, (5.46) may be written Xk+1|k = Ak Xk|k1 + Hk
k

(5.48)

where Hk , called the Kalman gain 1 , is a deterministic matrix dened by


1 Hk = E(Xk+1 t )k . k def

(5.49)

To evaluate the Kalman gain, rst note that


k

= Yk Bk Xk|k1 = Bk (Xk Xk|k1 ) + Sk Vk .

(5.50)

1 Readers familiar with the topic will certainly object that we do not comply with the well-established tradition of denoting the Kalman gain by the letter K. We will however meet in Algorithm 5.2.13 below a dierent version of the Kalman gain for which we reserve the letter K.

5.2 Gaussian Linear State-Space Models

139

Because E(Vk (Xk Xk|k1 )t ) = 0, (5.50) implies that


t t k = Bk k|k1 Bk + Sk Sk ,

(5.51)

where k|k1 is our notation for the covariance of the state prediction error Xk Xk|k1 . Using the same principle, E(Xk+1 t ) = Ak E(Xk t ) + Rk E(Uk t ) k k k
t t = Ak k|k1 Bk + Rk E[Uk (Xk Xk|k1 )t ]Bk t = Ak k|k1 Bk ,

(5.52)

where we have used the fact that Uk span(X0 , U0 , . . . , Uk1 , V0 , . . . , Vk1 ) span(Xk , Y0 , . . . , Yk1 ) . Combining (5.51) and (5.52) yields the expression of the Kalman gain:
t t t Hk = Ak k|k1 Bk Bk k|k1 Bk + Sk Sk 1

(5.53)

As a nal step, we now need to evaluate k+1|k . Because Xk+1 = Ak Xk + t Rk Uk and E(Xk Uk ) = 0,
t Cov(Xk+1 ) = Ak Cov(Xk )At + Rk Rk . k

(5.54)

Similarly, the predicted state estimator follows (5.48) in which Xk|k1 and k also are uncorrelated, as the former is an element of span(Y0 , . . . , Yk1 ). Hence t Cov(Xk+1|k ) = Ak Cov(Xk|k1 )At + Hk k Hk . (5.55) k Using (5.31), k+1|k = Cov(Xk+1 ) Cov(Xk+1|k )
t t = Ak k|k1 At + Rk Rk Hk k Hk , k

(5.56)

upon subtracting (5.55) from (5.54). Equation (5.56) is known as the Riccati equation. Collecting (5.45), (5.48), (5.51), (5.53), and (5.56), we obtain the standard form of the so-called Kalman lter, which corresponds to the prediction recursion. Algorithm 5.2.9 (Kalman Prediction). Initialization: X0|1 = 0 and 0|1 = . Recursion: For k = 0, . . . n,
k

= Yk Bk Xk|k1 , , , +
t Rk Rk

innovation innovation cov. Kalman Gain


k

(5.57) (5.58) (5.59) (5.61)

t t k = Bk k|k1 Bk + Sk Sk t 1 Hk = Ak k|k1 Bk k ,

Xk+1|k = Ak Xk|k1 + Hk k+1|k = (Ak

predict. state estim. (5.60) . predict. error cov.

Hk Bk )k|k1 At k

140

5 Applications of Smoothing

It is easily checked using (5.59) that (5.61) and (5.56) are indeed equivalent, the former being more suited for practical implementation, as it requires fewer matrix multiplications. Equation (5.61) however dissimulates the fact that k+1|k indeed is a symmetric matrix. One can also check by simple substitution that Algorithm 5.2.9 is also equivalent to the application of the recursion derived in Proposition 5.2.3 for Gaussian models. Remark 5.2.10. Evaluating the likelihood function for general linear statespace models is a complicated task. For Gaussian models however, k and k entirely determine the rst two moments, and hence the full conditional probability density function of Yk given the previous observations Y0 , . . . , Yk1 , in the form 1 1 (5.62) (2)dy /2 |k |1/2 exp t k k 2 k where dy is the dimension of the observations. As a consequence, the loglikelihood of observations up to index n may be computed as
n

1 (n + 1)dy log(2) 2 2

log |k | +
k=0

t 1 k k k

(5.63)

which may be evaluated recursively (in n) using Algorithm 5.2.9. Equation (5.63), which is very important in practice for parameter estimation in state-space models, is easily recognized as a particular form of the general relation (3.29). Example 5.2.11 (Random Walk Plus Noise Model). To illustrate Algorithm 5.2.9 on a simple example, consider the scalar random walk plus noise model dened by Xk+1 = Xk + u Uk , Yk = Xk + v Vk , where all variables are scalar. Applying the Kalman prediction equations yields, for k 1, Xk+1|k = Xk|k1 + k|k1 Yk Xk|k1 2 k|k1 + v
2 k|k1 2 k|k1 + v def

(5.64)

= (1 ak )Xk|k1 + ak Yk ,
2 k+1|k = k|k1 + u 2 k|k1 v 2 k|k1 + v

2 + u = f (k|k1 ) ,

(5.65)

5.2 Gaussian Linear State-Space Models

141

2 with the notation ak = k|k1 /(k|k1 + v ). This recursion is initialized by setting X0|1 = 0 and 0|1 = . For such a state-space model with timeindependent parameters, it is interesting to consider the steady-state solutions for the prediction error covariance, that is, to solve for in the equation 2 v 2 + u . 2 + v

= f () = Solving this equation for 0 yields = 1 2 + 2 u

4 2 2 u + 4u v

Straightforward calculations show that, for any M < , sup0M |f()| < 1. In addition, for k 1, (k+1|k )(k|k1 ) 0. These remarks imply that k+1|k always falls between k|k1 and , and in particular that k+1|k max(1|0 , ). Because f is strictly contracting on any compact subset of R+ , regardless of the value of , the coecients 2 ak = k|k1 /(k|k1 + v ) converge to a = , 2 + v

and the mean squared error of the observation predictor (Yk+1 Yk+1|k ) con2 verges to + v . Remark 5.2.12 (Algebraic Riccati Equation). The equation obtained t t by assuming that the model parameters Ak , Bk , Sk Sk , and Rk Rk are time invariant, that is, do not depend on the index k, and then dropping indices in (5.56), is the so-called algebraic Riccati equation (ARE). Using (5.51) and (5.53), one nds that the ARE may be written = AAt + AB t (BB t + SS t )1 BAt + RRt . Conditions for the existence of a symmetric positive semi-denite solution to this equation, and conditions under which the recursive form (5.56) converges to such a solution can be found, for instance, in (Caines, 1988).

5.2.3.2 Kalman Filtering Algorithm 5.2.9 is primarily intended to compute the state predictor Xk|k1 and the covariance k|k1 of the associated prediction error. It is of course possible to obtain a similar recursion for the ltered state estimator Xk|k and associated covariance matrix k|k . Let us start once again with (5.40), applied with Z = Xk , to obtain
1 Xk|k = Xk|k1 + E(Xk t )k k k

= Xk|k1 + Kk

(5.66)

142

5 Applications of Smoothing
def

1 where, this time, Kk = Cov(Xk , k )k is the lter version of the Kalman gain. The rst term on the right-hand side of (5.66) may be rewritten as

Xk|k1 = Ak1 Xk1|k1 + Rk1 Uk1|k1 = Ak1 Xk1|k1 , where we have used Uk1 span(X0 , U0 , . . . , Uk2 ) span(Y0 , . . . , Yk1 ) . Likewise, the second term on the right-hand side of (5.66) reduces to
t 1 Kk = k|k1 Bk k ,

(5.67)

(5.68)

t because k = Bk (Xk Xk|k1 ) + Sk Vk with E(Xk Vk ) = 0. The only missing piece is the relationship between the error covariance matrices k|k and k|k1 . The state equation Xk = Ak1 Xk1 + Rk1 Uk1 and the state prediction equation Xk|k1 = Ak1 Xk1|k1 imply that t Cov(Xk ) = Ak1 Cov(Xk1 )At + Rk1 Rk1 , k1 Cov(Xk|k1 ) = Ak1 Cov(Xk1|k1 )At , k1

which, combined with (5.31), yield


t k|k1 = Ak1 k1|k1 At + Rk1 Rk1 . k1

(5.69)

By the same argument, the state recursion Xk = Ak1 Xk1 + Rk1 Uk1 and the lter update Xk|k = Ak1 Xk1|k1 + Kk k imply that
t t k|k = Ak1 k1|k1 At + Rk1 Rk1 Kk k Kk . k1

(5.70)

These relations are summarized in the form of an algorithm. Algorithm 5.2.13 (Kalman Filtering). For k = 0, . . . n, do the following. If k = 0, set Xk|k1 = 0 and k|k1 = ; otherwise, set Xk|k1 = Ak1 Xk1|k1 ,
t k|k1 = Ak1 k1|k1 At + Rk1 Rk1 . k1

Compute
k

= Yk Bk Xk|k1 , ,

innovation innovation cov. Kalman (lter.) gain lter. state estim. lter. error cov.

(5.71) (5.72) (5.73) (5.74) (5.75)

t t k = Bk k|k1 Bk + Sk Sk t 1 Kk = k|k1 Bk k ,

Xk|k = Xk|k1 + Kk

k|k = k|k1 Kk Bk k|k1 .

5.2 Gaussian Linear State-Space Models

143

There are several dierent ways in which Algorithm 5.2.13 may be equivalently rewritten. In particular, it is possible to completely omit the prediction variables Xk|k1 and k|k1 (Kailath et al., 2000). Remark 5.2.14. As already mentioned in Remark 5.2.5, the changes needed to adapt the ltering and prediction recursions to the case where the state and measurement noises are not assumed to be zero-mean are straightforward. The basic idea is to convert the state-space model by dening properly centered states and measurement variables. Dene Xk = Xk E[Xk ], Uk = Uk E[Uk ], Yk = Yk E[Yk ], and Vk = Vk E[Vk ]; the expectations of the state and measurement variables can be computed recursively using E[Xk+1 ] = Ak E[Xk ] + Rk E[Uk ] , E[Yk ] = Bk E[Xk ] + Sk E[Vk ] . It is obvious that
Xk+1 = Xk+1 E[Xk+1 ] = Ak (Xk E[Xk ]) + Rk (Uk E[Uk ]) = Ak Xk + Rk Uk

and, similarly,
Yk = Yk E[Yk ] = Bk Xk + Sk Vk . Thus {Xk , Yk }k0 follows the model dened by (5.42)(5.43) with X0 = 0, E[Uk ] = 0 and E[Vk ] = 0. The Kalman recursions may be applied directly to compute for instance Xk|k1 , the best linear estimate of Xk in terms of Y0 , . . . , Yk1 . The best linear estimate of Xk in terms of Y0 , . . . , Yk1 is then given by Xk|k1 = X + E[Xk ] . k|k1

All other quantities of interest can be treated similarly.

5.2.4 Disturbance Smoothing After revisiting Proposition 5.2.3, we are now ready to derive an alternative solution to the smoothing problem that will share the general features of Algorithm 5.2.4 (RTS smoothing) but operate only on the disturbance vectors Uk rather than on the states Xk . This second form of smoothing, which is more ecient in situations discussed at the beginning of Section 5.2.2, has been popularized under the name of disturbance smoothing by De Jong (1988), Kohn and Ansley (1989), and Koopman (1993). It is however a rediscovery of a technique known, in the engineering literature, as Bryson-Frazier (or BF) smoothing, named after Bryson and Frazier (1963)see also (Kailath et al., 2000, Section 10.2.2). The original arguments invoked by Bryson and Frazier (1963) were however very dierent from the ones discussed here and

144

5 Applications of Smoothing

the use of the innovation approach to obtain smoothing estimates was initiated by Kailath and Frost (1968). Recall that for k = 0, . . . , n1 we denote by Uk|n the smoothed disturbance estimator, i.e., the best linear prediction of the disturbance Uk in terms of the observations Y0 , . . . , Yn . The additional notation k|n = Cov(Uk Uk|n ) will also be used. We rst state the complete algorithm before proving that it is actually correct. Algorithm 5.2.15 (Disturbance Smoother). Forward ltering: Run the Kalman lter (Algorithm 5.2.9) and store for k = 1 0, . . . , n the innovation k , the inverse innovation covariance k , the state prediction error covariance k|k1 , and k = Ak Hk Bk , where Hk is the Kalman (prediction) gain. Backward smoothing: For k = n 1, . . . , 0, compute pk = Ck =
t 1 Bn n n 1 t Bk+1 k+1 def def

k+1

t pk+1 k+1

for k = n 1, otherwise,

(5.76)

t 1 Bn n Bn 1 t Bk+1 k+1 Bk+1 + t Ck+1 k+1 k+1

for k = n 1, otherwise, (5.77) (5.78) (5.79)

t Uk|n = Rk pk ,

k|n = I

t Rk Ck Rk

Initial Smoothed State Estimator: Compute


t 1 X0|n = B0 0 0

+ t p0 , 0 + t C0 0 0 .

(5.80) (5.81)

0|n =

t 1 B0 0 B0

Smoothed State Estimator: For k = 0, . . . n 1, Xk+1|n = Ak Xk|n + Rk Uk|n , k+1|n = Ak k|n At k


t + Rk k|n Rk t Ak k|k1 t Ck Rk Rk k

(5.82)
t Rk Rk Ck k k|k1 At . (5.83) k

Algorithm 5.2.15 is quite complex, starting with an application of the Kalman prediction recursion, followed by a backward recursion to obtain the smoothed disturbances and then a nal forward recursion needed to evaluate the smoothed states. The proof below is split into two parts that concentrate on each of the two latter aspects of the algorithm.

5.2 Gaussian Linear State-Space Models

145

Proof (Backward Smoothing). We begin with the derivation of the equations needed for computing the smoothed disturbance estimator Uk|n for k = n 1 down to 0. As previously, it is advantageous to use the innovation sequence { 0 , . . . , n } instead of the correlated observations {Y0 , . . . , Yn }. Using (5.40), we have n n Uk|n = E(Uk t ) 1 i = E(Uk t ) 1 i , (5.84) i i
i i i=0 i=k+1

where the fact that Uk span{Y0 , . . . Yk } = span{ 0 , . . . ,


k}

has been used to obtain the second expression. We now prove by induction that for any i = k + 1, . . . , n, E[Uk (Xi Xi|i1 )t ] = E(Uk t ) = i First note that E(Uk
t k+1 ) t = E[Uk (Xk+1 Xk+1|k )t ]Bk+1 t t t t = E(Uk Xk+1 )Bk+1 = Rk Bk+1 , t Rk , t t t Rk k+1 t k+2 . . . i1 ,

i=k+1, ik+2,

(5.85)

t t Rk Bk+1 , t t t Rk t t k+1 k+2 . . . i1 Bi ,

i=k+1, ik+2.

(5.86)

using (5.45) and the orthogonality relations Uk Vk+1 , Uk span(Y0 , . . . , Yk ) and Uk Xk . Now assume that (5.85)(5.86) hold for some i k + 1. Combining the state equation (5.42) and the prediction update equation (5.48), we obtain Xi+1 Xi+1|i = i (Xi Xi|i1 ) + Ri Ui Hi Si Vi . (5.87)

Because E(Uk Uit ) = 0 and E(Uk Vit ) = 0, the induction assumption implies that
t t E[Uk (Xi+1 Xi+1|i )t ] = E[Uk (Xi Xi|i1 )t ]t = Rk t t i k+1 k+2 . . . i . (5.88) Proceeding as in the case i = k above,

E(Uk

t i+1 )

t t t t = E[Uk (Xi+1 Xi+1|i )t ]Bi+1 = Rk t t k+1 k+2 . . . i Bi+1 , (5.89)

which, by induction, shows that (5.85)(5.86) hold for all indices i k + 1. Plugging (5.86) into (5.84) yields
n t Uk|n = Rk 1 t Bk+1 k+1 k+1

+
i=k+2

t t . . . t Bi i1 k+1 i1

(5.90)

146

5 Applications of Smoothing

where the term between parentheses is easily recognized as pk dened recursively by (5.76), thus proving (5.78). To compute the smoothed disturbance error covariance k|n , we apply once again (5.41) to obtain k|n = Cov(Uk ) Cov Uk|n
n

(5.91)

=I
i=k+1

t E(Uk t )i1 E( i Uk ) i

1 t t = I Rk Bk+1 k+1 Bk+1 n

+
i=k+2

t t 1 t k+1 . . . i1 Bi i Bi i1 . . . k+1 Rk ,

where I is the identity matrix with dimension that of the disturbance vector and (5.89) has been used to obtain the last expression. The term in parentheses in (5.91) is recognized as Ck dened by (5.77), and (5.79) follows. Proof (Smoothed State Estimation). The key ingredient here is the following set of relations: E[Xk (Xi Xi|i1 )t ] = E(Xk t ) = i k|k1 , t k|k1 t t k k+1 . . . i1 , i=k, ik+1, (5.92)

t k|k1 Bk , t t k|k1 t t k k+1 . . . i1 Bi ,

i=k, ik+1,

(5.93)

which may be proved by induction exactly like (5.85)(5.86). Using (5.40) as usual, the minimum mean squared error linear predictor of the initial state X0 in terms of the observations Y0 , . . . , Yn may be expressed as n X0|n = E(X0 t ) 1 i . (5.94) i
i i=0

Hence by direct application of (5.93),


n

X0|n =

t 1 B0 0

0+ i=1

t t . . . t Bi i1 0 i1

(5.95)

proving (5.80). Proceeding as for (5.91), the expression for the smoothed initial state error covariance in (5.81) follows from (5.41). The update equation (5.82) is a direct consequence of the linearity of the projection operator applied to the state equation (5.42). Finally, to prove (5.83), rst combine the state equation (5.42) with (5.82) to obtain

5.2 Gaussian Linear State-Space Models

147

Cov(Xk+1 Xk+1|n ) = Cov[Ak (Xk Xk|n ) + Rk (Uk Uk|n )] = t t t t Ak k|n At + Rk k|n Rk Ak E(Xk Uk|n )Rk Rk E(Uk|n Xk )At , (5.96) k k where the remark that E[Xk|n (Uk Uk|n )t ] = 0, because Xk|n belongs to span(Y0 , . . . , Yn ), has been used to obtain the second expression. In order to compute E(Xk U t ) we use (5.90), writing
k|n

t E(Xk Uk|n ) = E(Xk

1 t k+1 )k+1 Bk+1 Rk + n E(Xk t )i1 Bi i1 i i=k+2

. . . k+1 Rk . (5.97)

Finally, invoke (5.93) to obtain


1 t t E(Xk Uk|n ) = k|k1 t Bk+1 k+1 Bk+1 Rk + k n t t 1 k|k1 t t k k+1 . . . i1 Bi i Bi i1 . . . k+1 Rk , i=k+2

which may be rewritten as t E(Xk Uk|n ) = k|k1 t Ck Rk . k Equation (5.83) then follows from (5.96). Remark 5.2.16. There are a number of situations where computing the best linear prediction of the state variables is the only purpose of the analysis, and computation of the error covariance Cov(Xk Xk|n ) is not required. Algorithm 5.2.15 may then be substantially simplied because (5.77), (5.79), (5.81), and (5.83) can be entirely skipped. Storage of the prediction error covariance matrices k|k1 during the initial Kalman ltering pass is also not needed anymore. Remark 5.2.17. An important quantity in the context of parameter estimation (to be discussed in Section 10.4 of Chapter 10) is the one-step posterior cross-covariance Ck,k+1|n = E
def

(5.98)

Xk Xk|n

Xk+1 Xk+1|n

Y0:n

(5.99)

This is a quantity that can readily be evaluated during the nal forward recursion of Algorithm 5.2.15. Indeed, from (5.42)(5.82), Xk+1 Xk+1|n = Ak Xk Xk|n + Rk Uk Uk|n Hence
t t Ck,k+1|n = k|n At E Xk Uk|n Rk , k t where the fact that E(Xk Uk ) = 0 has been used. Using (5.98) then yields t Ck,k+1|n = k|n At k|k1 t Ck Rk Rk . k k

(5.100)

148

5 Applications of Smoothing

5.2.5 The Backward Recursion and the Two-Filter Formula Notice that up to now, we have not considered the backward functions k|n in the case of Gaussian linear state-space models. In particular, and although the details of both approaches dier, the smoothing recursions discussed in Sections 5.2.1 and 5.2.4 are clearly related to the general principle of backward Markovian smoothing discussed in Section 3.3.2 and do not rely on the forward-backward decomposition discussed in Section 3.2. A rst terminological remark is that although major sources on Gaussian linear models never mention the forward-backward decomposition, it is indeed known under the name of two-lter formula (Fraser and Potter, 1969; Kitagawa, 1996; Kailath et al., 2000, Section 10.4). A problem however is that, as noted in Chapter 3, the backward function k|n is not directly interpretable as a probability distribution (recall for instance that the initialization of the backward recursion is n|n (x) = 1 for all x X). A rst approach consists in introducing some additional assumptions on the model that ensure that k|n (x), suitably normalized, can indeed be interpreted as a probability density function. The backward recursion can then be interpreted as the Kalman prediction algorithm, applied backwards in time, starting from the end of the data record (Kailath et al., 2000, Section 10.4). A dierent option, originally due to Mayne (1966) and Fraser and Potter (1969), consists in deriving the backward recursion using a reparameterization of the backward functions k|n , which is robust to the fact that k|n (x) may not be integrable over X. This solution has the advantage of being generic in that it does not require any additional assumptions on the model, other t than Sk Sk being invertible. The drawback is that we cannot simply invoke a variant of Algorithm 5.2.3 but need to derive a specic form of the backward recursion using a dierent parameterization. This implementation of the backward recursion (which could also be used, with some minor modications, for usual forward prediction) is referred to as the information form of the Kalman ltering and prediction recursions (Anderson and Moore, 1979, Section 6.3; Kailath et al., 2000, Section 9.5.2). In the time series literature, this method is also sometimes used as a tool to compute the smoothed estimates when using so-called diuse priors (usually for X0 ), which correspond to the notion of improper at distributions to be discussed below. 5.2.5.1 The Information Parameterization The main ingredient of what follows consists in revisiting the calculation of the posterior distribution of the unobserved component X in the basic Gaussian linear model Y = BX + V . Indeed, in order to prove Proposition 5.2.2, we could have followed a very dierent route: assuming that both V and Cov(Y ) = B t X B + V are full

5.2 Gaussian Linear State-Space Models

149

rank matrices, the posterior probability density function of X given Y , which we denote by p(x|y), is known by Bayes rule to be proportional to the product of the prior p(x) on X and the conditional probability density function p(y|x) of Y given X, that is, 1 1 1 (y Bx)t V (y Bx) + (x X )t X (x X ) , 2 (5.101) where the symbol indicates proportionality up to a constant that does not depend on the variable x. Note that this normalizing constant could easily be determined in the current case because we know that p(x|y) corresponds to a multivariate Gaussian probability density function. Hence, to fully determine p(x|y), we just need to rewrite (5.101) as a quadratic form in x: p(x|y) exp p(x|y) exp { 1 t t 1 1 1 1 x (B V B + X )x xt (B t V y + X X ) 2
1 1 (B t V y + X X )t x

, (5.102)

that is, p(x|y) exp where


1 1 X|Y = X|Y B t V y + X X

1 1 (x X|Y )t X|Y (x X|Y )] 2

(5.103)

(5.104) (5.105)

X|Y = B

1 V B

1 1 X

Note that in going from (5.102) to (5.104), we have used once again the fact that p(x|y) only needs be determined up to a normalization factor, whence terms that do not depend on x can safely be ignored. As a rst consequence, (5.105) and (5.104) are alternate forms of equations (5.17) and (5.16), respectively, which we rst met in Proposition 5.2.2. The fact that (5.17) and (5.105) coincide is a well-known result from matrix theory known as the matrix inversion lemma that we could have invoked directly to obtain (5.104) and (5.105) from Proposition 5.2.2. This simple rewriting of the conditional mean and covariance in the Gaussian linear model is however not the only lesson that can be learned from (5.104) and (5.105). In particular, a very natural parameterization of the Gaussian distribution in this context consists in considering the inverse of the covariance matrix = 1 and the vector = rather than the covariance and the mean vector . Both of these parameterizations are of course fully equivalent when the covariance matrix is invertible. In some contexts, the inverse covariance matrix is referred to as the precision matrix, but in the ltering context the

150

5 Applications of Smoothing

use of this parameterization is generally associated with the word information (in reference to the fact that in a Gaussian experiment, the inverse of the covariance matrix is precisely the Fisher information matrix associated with the estimation of the mean). We shall adopt this terminology and refer to the use of and as parameters of the Gaussian distribution as the information parameterization. Note that because a Gaussian probability density function p(x) with mean and covariance may be written p(x) exp 1 t 1 x x 2xt 1 2 1 = exp trace xxt 1 2xt 1 2

= 1 and = also form the natural parameterization of the multivariate normal, considered as a member of the exponential family of distributions (Lehmann and Casella, 1998). 5.2.5.2 The Gaussian Linear Model (Again!) We summarize our previous ndingsEqs. (5.104) and (5.105)in the form of the following alternative version of Proposition 5.2.2, Proposition 5.2.18 (Conditioning in Information Parameterization). Let Y = BX + V , where X and V are two independent Gaussian random vectors such that, 1 1 in information parameterization, X = Cov(X) E(X), X = Cov(X) , 1 V = Cov(V ) and V = E(V ) = 0, B being a deterministic matrix. Then X|Y = X + B t V Y , X|Y = X + B V B ,
1 1 t

(5.106) (5.107)

where X|Y = Cov(X|Y ) E(X|Y ) and X|Y = Cov(X|Y ) . If the matrices X , V , or X|Y are not full rank matrices, (5.106) and (5.107) can still be interpreted in a consistent way using the concept of improper (at) distributions. Equations (5.106) and (5.107) deserve no special comment as they just correspond to a restatement of (5.104) and (5.105), respectively. The last sentence of Proposition 5.2.18 is a new element, however. To understand the point, consider (5.101) again and imagine what would happen if p(x), for instance, was assumed to be constant. Then (5.102) would reduce to p(x|y) exp 1 t t 1 1 1 x (B V B)x xt (B t V y) (B t V y)t x 2 , (5.108)

5.2 Gaussian Linear State-Space Models

151

which corresponds to a perfectly valid Gaussian distribution, when viewed as a 1 function of x, at least when B t V B has full rank. The only restriction is that there is of course no valid probability density function p(x) that is constant on X. This practice is however well established in Bayesian estimation (to be discussed in Chapter 13.1.1) where such a choice of p(x) is referred to as using an improper at prior. The interpretation of (5.108) is then that under an (improper) at prior on Y , the posterior mean of X given Y is
1 B t V B 1 1 B t V Y ,

(5.109)

which is easily recognized as the (deterministic) optimally weighted leastsquares estimate of x in the linear regression model Y = Bx + V . The important message here is that (5.109) can be obtained direct from (5.106) by assuming that X is the null matrix and X the null vector. Hence Proposition 5.2.18 also covers the case where X has an improper at distribution, which is handled simply by setting the precision matrix X and the vector X equal to 0. A more complicated situation is illustrated by the following example. Example 5.2.19. Assume that the linear model is such that X is bivariate Gaussian and the observation Y is scalar with B= 10 and Cov(V ) = 2 .

Proposition 5.2.18 asserts that the posterior parameters are then given by X|Y = X + X|Y = X + 2 Y 0 2 0 0 0 , . (5.110) (5.111)

In particular, if the prior on X is improper at, then (5.110) and (5.111) simply mean that the posterior distribution of the rst component of X given Y is Gaussian with mean Y and variance 2 , whereas the posterior on the second component is also improper at. In the above example, what is remarkable is not the result itself, which is obvious, but the fact that it can be obtained by application of a single set of formulas that are valid irrespectively of the fact that some distributions are improper. In more general situations, directions that are in the null space of X|Y form a subspace where the resulting posterior is improper at, whereas the posterior distribution of X projected on the image X|Y is a valid Gaussian distribution. The information parameterization is ambivalent because it can be used both as a Gaussian prior density function as in Proposition 5.2.18 but also as an observed likelihood. There is nothing magic here but simply the observation

152

5 Applications of Smoothing

that as we (i) allow for improper distributions and (ii) omit the normalization factors, Gaussian priors and likelihood are equivalent. The following lemma is a complement to Proposition 5.2.18, which will be needed below. Lemma 5.2.20. Up to terms that do not depend on x, exp 1 1 (y Bx)t 1 (y Bx) exp y t y 2y t 2 2 1 exp xt B t (I + )1 Bx 2xt B t (I + )1 , 2 dy (5.112)

where I denotes the identity matrix of suitable dimension. Proof. The left-hand side of (5.112), which we denote by p(x), may be rewritten as 1 p(x) = exp xB t 1 Bx 2 1 exp y t ( + 1 )y 2y t ( + 1 Bx) dy . (5.113) 2 Completing the square, the bracketed term in the integrand of (5.113) may be written y ( + 1 )1 ( + 1 Bx)
t

( + 1 )

y ( + 1 )1 ( + 1 Bx) ( + 1 Bx)t ( + 1 )1 ( + 1 Bx) . (5.114) The exponent of 1/2 times the rst two lines of (5.114) integrates to a constant (or, rather, a number not depending on x), as it is recognized as a Gaussian probability density function. Thus p(x) exp 1 [2xt B t 1 ( + 1 )1 2 + xt B t 1 1 ( + 1 )1 1 Bx , (5.115)

where terms that do not depend on x have been dropped. Equation (5.112) follows from the equalities 1 ( + 1 )1 = (I + )1 and 1 1 ( + 1 )1 1 = 1 ( + 1 )1 ( + 1 ) 1 = (I + )1 . Note that the last identity is the matrix inversion lemma that we already met, as (I + )1 = ( 1 + )1 . Using this last form however is not a good idea in general, however, as it obviously does not apply in cases where is non-invertible.

5.2 Gaussian Linear State-Space Models

153

5.2.5.3 The Backward Recursion The question now is, what is the link between our original problem, which consists in implementing the backward recursion in Gaussian linear state-space models, and the information parameterization discussed in the previous section? The connection is the fact that the backward functions dened by (3.16) do not correspond to probability measures. More precisely, k|n (Xk ) dened by (3.16) is the conditional density of the future observations Yk+1 , . . . , Yn given Xk . For Gaussian linear models, we know from Proposition 5.2.18 that this density is Gaussian and hence that k|n (x) has the form of a Gaussian likelihood, 1 p(y|x) exp (y M x)t 1 (y M x) , 2 for some M and given by (5.16) and (5.17). Proceeding as previously, this equation can be put in the same form as (5.108) (replacing B and V by M and , respectively). Hence, a possible interpretation of k|n (x) is that it corresponds to the posterior distribution of Xk given Yk+1 , . . . , Yn in the pseudo-model where Xk is assumed to have an improper at prior distribution. According to the previous discussion, k|n (x) itself may not correspond to a valid Gaussian distribution unless one can guarantee that M t 1 M is a full rank matrix. In particular, recall from Section 3.2.1 that the backward recursion is initialized by setting n|n (x) = 1, and hence n|n never is a valid Gaussian distribution. The route from now on is clear: in order to implement the backward recursion, one needs to dene a set of information parameters corresponding to k|n and derive (backward) recursions for these parameters based on Proposition 5.2.18. We will denote by k|n and k|n the information parameters (precision matrix times mean and precision matrix) corresponding to k|n for k = n down to 0 where, by denition, n|n = 0 and n|n = 0. It is important to keep in mind that k|n and k|n dene the backward function k|n only up to an unknown constant. The best we can hope to determine is k|n (x) , k|n (x) dx by computing the Gaussian normalization factor in situations where k|n is a full rank matrix. But this normalization is not more legitimate or practical than other ones, and it is preferable to consider that k|n will be determined up to a constant only. In most situations, this will be a minor concern, as formulas that take into account this possible lack of normalization, such as (3.21), are available. Proposition 5.2.21 (Backward Information Recursion). Consider the t Gaussian linear state-space model (5.11)(5.12) and assume that Sk Sk has full rank for all k 0. The information parameters k|n and k|n , which determine k|n (up to a constant), may be computed by the following recursion.

154

5 Applications of Smoothing

Initialization: Set n|n = 0 and n|n = 0. Backward Recursion: For k = n 1 down to 0,


t t k+1|n = Bk+1 Sk+1 Sk+1 1

Yk+1 + k+1|n , Bk+1 + k+1|n ,


1

(5.116) (5.117) (5.118) (5.119)

k+1|n =

t Bk+1

1 t Sk+1 Sk+1

t k|n = At I + k+1|n Rk Rk k t k|n = At I + k+1|n Rk Rk k

k+1|n ,
1

k+1|n Ak .

Proof. The initialization of Proposition 5.2.21 has already been discussed and we just need to check that (5.116)(5.119) correspond to an implementation of the general backward recursion (Proposition 3.2.1). We split this update in two parts and rst consider computing k+1|n (x) gk+1 (x)k+1|n (x) (5.120)

from k+1|n . Equation (5.120) may be interpreted as the posterior distribution of X in the pseudo-model in which X has a (possibly improper) prior distribution k+1|n (with information parameters k+1|n and k+1|n ) and Y = Bk+1 X + Sk+1 V is observed, where V is independent of X. Equations(5.116)(5.117) thus correspond to the information parameterization of k+1|n by application of Proposition 5.2.18. From (3.19) we then have k|n (x) = Qk (x, dx )k+1|n (x ) , (5.121)

where we use the notation Qk rather than Q to emphasize that we are dealing with possibly non-homogeneous models. Given that Qk is a Gaussian transition density function corresponding to (5.11), (5.121) may be computed explicitly by application of Lemma 5.2.20 which gives (5.118) and (5.119). While carrying out the backward recursion according to Proposition 5.2.21, it is also possible to simultaneously compute the marginal smoothing distribution by use of (3.21). Algorithm 5.2.22 (Forward-Backward Smoothing). Forward Recursion: Perform Kalman ltering according to Algorithm 5.2.13 and store the values of Xk|k and k|k . Backward Recursion: Compute the backward recursion, obtaining for each k the mean and covariance matrix of the smoothed estimate as Xk|n = Xk|k + k|k I + k|n k|k k|n = k|k k|k I + k|n k|k
1 1

(k|n k|n Xk|k ) , k|n k|k .

(5.122) (5.123)

5.2 Gaussian Linear State-Space Models

155

Proof. These two equations can be obtained exactly as in the proof of Lemma 5.2.20, replacing (y Bx)t 1 (y Bx) by (x )t 1 (x ) and applying the result with = Xk|k , = k|k , = k|n and = k|n . If k|n is invertible, (5.122) and (5.123) are easily recognized as the application 1 of Proposition 5.2.2 with B = I, Cov(V ) = k|n , and an equivalent observed 1 value of Y = k|n k|n . Remark 5.2.23. In the original work by Mayne (1966), the backward infor mation recursion is carried out on the parameters of k|n , as dened by (5.120), rather than on k|n . It is easily checked using (5.116)(5.119) that, except for this dierence of focus, Proposition 5.2.21 is equivalent to the Mayne (1966) formulassee also Kailath et al. (2000, Section 10.4) on this point. Of course, in the work of Mayne (1966), k|n has to be combined with the predictive distribution k|k1 rather than with the ltering distribution k , as k|n already incorporates the knowledge of the observation Yk . Proposition 5.2.21 and Algorithm 5.2.22 are here stated in a form that is compatible with our general denition of the forward-backward decomposition in Section 3.2. 5.2.6 Application to Marginal Filtering and Smoothing in CGLSSMs The algorithms previously derived for linear state-space models also have important implications for conditionally Gaussian linear state-space models (CGLSSMs). According to Denition 2.2.6, a CGLSSM is such that conditionally on {Ck }k0 , Wk+1 = A(Ck+1 )Wk + R(Ck+1 )Uk , Yk = B(Ck )Wk + S(Ck )Vk , where the indicator process {Ck }k0 is a Markov chain on a nite set C, with some transition matrix QC . We follow the general principle outlined in Section 4.2.3 and consider the computation of the posterior distribution of the indicator variables C0:k given the observations Y0:k , marginalizing with respect to the continuous component of the state W0:k . The key remarksee (4.11)is that one may evaluate the conditional distribution of Wk given the observations Y0:k1 and the indicator variables C0:k . For CGLSSMs, this distribution is Gaussian with mean Wk|k1 (C0:k ) and covariance k|k1 (C0:k )the dependence on the measurement, here Y0:k1 , is implicit and we emphasize only the dependence with respect to the indicator variables in the following. Both of these quantities may be evaluated using the Kalman lter recursion (Algorithm 5.2.13), which we briey recall here. Given Wk1|k1 (C0:k1 ) and k1|k1 (C0:k1 ), the ltered partial state estimator and the ltered partial state error covariance at time k 1, evaluate W0 N( , ) ,

156

5 Applications of Smoothing

the predicted partial state and the associated predicted partial state error covariance as Wk|k1 (C0:k ) = A(Ck )Wk1|k1 (C0:k1 ) ,
t t

(5.124)

k|k1 (C0:k ) = A(Ck )k1|k1 (C0:k1 )A (Ck ) + R(Ck )R (Ck ) . From these quantities, determine in a second step the innovation and the covariance of the innovation given the indicator variables,
k (C0:k )

= Yk B(Ck )Wk|k1 (C0:k ) ,


t t

(5.125)

k (C0:k ) = B(Ck )k|k1 (C0:k )B (Ck ) + S(Ck )S (Ck ) . In a third and last step, evaluate the ltered partial state estimation and ltered partial state error covariance from the innovation and the innovation covariance,
1 Kk (C0:k ) = k|k1 (C0:k )B(Ck )k (C0:k ) , Wk|k (C0:k ) = Wk|k1 (C0:k ) + Kk (C0:k ) k (C0:k ) ,

(5.126)

k|k (C0:k ) = {I Kk (C0:k )B(Ck )} k|k1 (C0:k ) . As a by-product of the above recursion, one may also determine the conditional probability of Ck given the history of the indicator process C0:k1 and the observations Y0:k up to index k. Indeed, by Bayes rule, P (Ck = c | C0:k1 , Y0:k ) = P (Ck = c | C0:k1 , Y0:k ) L (Y0:k | C0:k1 , Ck = c)QC (Ck1 , c) , (5.127) L (Y0:k | C0:k1 , Ck = c )QC (Ck1 , c ) where L denotes the conditional likelihood of the observations given the indicator variables. Both the numerator and the denominator can be evaluated, following Remark (5.2.10), by applying the Kalman recursions (5.125)(5.126) for the two values Ck = c and Ck = c . Using (5.62) and (5.127) then yields P (Ck = c | C0:k1 , Y0:k ) |k (C0:k1 , c)|1/2 exp 1 2
1 t k (C0:k1 , c)k (C0:k1 , c) k (C0:k1 , c)

QC (Ck1 , c) , (5.128)

where the normalization factor may be evaluated by summation of (5.128) over all c C. At the expense of computing r times (5.125)(5.126), where r is the cardinality of C, it is thus possible to evaluate the conditional distribution of Ck given the history of the indicator process C0:k1 , where the continuous variables W0:k have been fully marginalized out. To be applicable however, (5.128) implies that the history of the indicator process before index k be exactly known. This is hardly conceivable except in simulation-based

5.2 Gaussian Linear State-Space Models

157

smoothing approximations where one imputes values of the unknown sequence of indicators {Ck }k0 . The application of (5.125)(5.126) and (5.128) for this purpose will be fully described in Chapter 8. A similar remark holds regarding the computation of the conditional distribution of Ck given both the history C0:k1 and future Ck+1:n of the indicator sequence and the corresponding observations Y0:n . The principle that we follow here is an instance of the generalized forward-backward decomposition (4.13) which, in the case of CGLSSMs, amounts to adapting Algorithm 5.2.22 as follows. 1. Use the backward information recursion of Proposition 5.2.21 to compute k|n (Ck+1:n ) and k|n (Ck+1:n )2 . 2. Use the ltering recursion of Algorithm 5.2.13restated above as (5.124) (5.126)to compute Wk1|k1 (C0:k1 ) and k1|k1 (C0:k1 ). 3. For all values of c C, evaluate k (C0:k1 , c), k (C0:k1 , c), as well as Wk|k (C0:k1 , c), k|k (C0:k1 , c) using one step of Algorithm 5.2.13. Then apply (5.122) and (5.123) to obtain Wk|n (C0:k1 , c, Ck+1:n ) and k|n (C0:k1 , c, Ck+1:n ). The most dicult aspect then consists in computing the likelihood of the observations Y0:n given the indicator sequence, where all indicators variables but ck are xed and ck takes all possible values in C. The lemma below provides a simple formula for this task. Lemma 5.2.24. Assume that k (ck ), k (ck ), Wk|k (ck ), k|k (ck ), Wk|n (ck ), and k|n (ck ) are available, where we omit dependence with respect to the indicator variables cl for l = k, which is implicit in the following. The likelihood of the observations Y0:n given the indicator sequence C0:n = c0:n is then proportional to the quantity 1 1 1 exp t (ck )k (ck ) k (ck ) 1/2 2 k |k (ck )| 1 1 t 1 exp Wk|k (ck )k|k (ck )Wk|k (ck )) 1/2 2 |k|k (ck )| 1 1 t 1 exp Wk|n (ck )k|n (ck )Wk|n (ck )) 2 |k|n (ck )|1/2
1

, (5.129)

where the proportionality constant does not depend on the value of ck . Before actually proving this identity, we give a hint of the fundamental argument behind (5.129). If X and Y are jointly Gaussian variables (with non-singular covariance matrices), Bayes rule implies that
We do not repeat Proposition 5.2.21 with the notations appropriate for CGLSSMs as we did for (5.124)(5.126).
2

158

5 Applications of Smoothing

p(x|y) =

p(y|x)p(x) . p(y|x )p(x ) dx

In particular, the denominator on the right-hand side equals p(y|x)p(x)/p(x|y) for any value of x. For instance, in the linear model of Proposition 5.2.2, applying this identity for x = 0 yields p(Y |x)p(x) dx 1 1 1 exp Y t V Y 1/2 2 |V | 1 1 1 exp t X X 1/2 2 X |X |
def

1 |X|Y |1/2

1 exp t X|Y X|Y 2 X|Y


def

, (5.130)

where X|Y = E(X|Y ) and X|Y = Cov(X|Y ) and constants have been omitted. It is tedious but straightforward to check from (5.16) and (5.17) using the matrix inversion lemma that (5.130) indeed coincides with what we know to be the correct result: p(Y |x)p(x) dx = p(Y ) 1 1 exp (Y BX )t (V + BX B t )1 (Y BX ) . 2 |V + BX B t |1/2 Equation (5.130) is certainly not the most ecient way of computing p(Y ) but it is one that does not necessitate any other knowledge than that of the prior p(x), the conditional p(y|x), and the posterior p(x|y). Lemma 5.2.24 will now be proved by applying the same principle to the conditional smoothing distribution in a CGLSSM. Proof (Conditional Smoothing Lemma). The forward-backward decomposition provides a simple general expression for the likelihood of the observations Y0:n in the form Ln = k (dw)k|n (w) (5.131)

for any k = 0, . . . , n. Recall that our focus is on the likelihood of the observations conditional on a given sequence of indicator variables C0:n = c0:n , and more precisely on the evaluation of the likelihood for all values of ck in C, the other indicator variables cl , l = k, being held xed. In the following, every expression should be understood as being conditional on C0:n = c0:n , where only the dependence with respect to ck is of interest (terms that do not depend on the value of ck will cancel out by normalization). This being said, (5.131) may be rewritten as L(ck ) = n k1 (dwk1 )Q(ck ) (wk1 , dwk )gk k (wk )k|n (wk )
(c )

(5.132)

5.2 Gaussian Linear State-Space Models

159

using the forward recursion (3.17), where the superscript (ck ) is used to highlight quantities that depend on this variable. Because the rst term of the integrand does not depend on ck , it may be replaced by its normalized version k1 to obtain L(ck ) n k1 (dwk1 )Q(ck ) (wk1 , dwk )gk k (wk )k|n (wk ) ,
(c )

(5.133)

where the proportionality constant does not depend on ck . Now, using the prediction and ltering relations (see Proposition 3.2.5 and Remark 3.2.6), the right-hand side of (5.133) may be rewritten as the product
k k|k1 (dw)gk k (w)

(c )

(c )

k k (dw)k|n (w) .

(c )

(5.134)

Finally note that in the case of conditionally Gaussian linear state-space models: (i) the rst integral in (5.134) may be computed from the innovation k as the rst line of (5.129)a remark that was already used in obtaining (5.128); (c ) (ii) k k is a Gaussian probability density function with parameters Wk|k (ck ) and k|k (ck ); (iii) k|n is a Gaussian likelihood dened, up to a constant, by the information parameters k|n and k|n ;
k (iv) k|n (dw) =

(c )

k k (dw)k|n (w) k k (dw )k|n (w )


(c )

(c )

is the Gaussian distribution with parameters Xk|n and k|n given by (5.122) and (5.123), respectively. The last two factors of (5.129) are now easily recognized as an instance of (5.130) applied to the second integral term in (5.134), where the factor k|n (0) has been ignored because it does not depend on the value of ck . Note that as a consequence, the fact that k|n and k|n dene k|n up to an unknown constant only is not detrimental. Once again, the context in which Lemma 5.2.24 will be useful is not entirely obvious at this point and will be fully discussed in Section 6.3.2 when reviewing Monte Carlo methods. From the proof of this result, it should be clear however that (5.129) is deeply connected to the smoothing approach discussed in Section 5.2.5 above.

6 Monte Carlo Methods

This chapter takes a dierent path to the study of hidden Markov models in that it abandons the pursuit of closed-form formulas and exact algorithms to cover instead simulation-based techniques. This change of perspective allows for a much broader coverage of HMMs, which is not restricted to the specic cases discussed in Chapter 5. In this chapter, we consider sampling the unknown sequence of states X0 , . . . , Xn conditionally on the observed sequence Y0 , . . . Yn . In subsequent chapters, we will also use simulation to do inference about the parameters of HMMs, either using simulation-based stochastic algorithms that optimize the likelihood (Chapter 11) or in the context of Bayesian joint inference on the states and parameters (Chapter 13). But even the sole simulation of the missing states may prove itself a considerable challenge in complex settings like continuous state-space HMMs. Therefore, and although these dierent tasks are presented in separate chapters, simulating hidden states in a model whose parameters are assumed to be known is certainly not disconnected from parameter estimation to be discussed in Chapters 11 and 13.

6.1 Basic Monte Carlo Methods


Although we will not go into a complete description of simulation methods in this book, the reader must be aware that recent developments of these methods have oered new opportunities for inference in complex models like hidden Markov models and their generalizations. For a more in-depth covering of these simulation methods and their implications see, for instance, the books by Chen and Shao (2000), Evans and Swartz (2000), Liu (2001), and Robert and Casella (2004).

162

6 Monte Carlo Methods

6.1.1 Monte Carlo Integration Integration, in general, is most useful for computing probabilities and expectations. Of course, when given an expectation to compute, the rst thing is to try to compute the integral analytically. When analytic evaluation is impossible, numerical integration is an option. However, especially when the dimension of the space is large, numerical integration can become numerically involved: the number of function evaluations required to achieve some degree of approximation increases exponentially in the dimension of the problem (this is often called the curse of dimensionality). Thus it is useful to consider other methods for evaluating integrals. Fortunately, there are methods that do not suer so directly from the curse of dimensionality, and Monte Carlo methods belong to this group. In particular, recall that, by the strong law of large numbers, if 1 , 2 , . . . is a sequence of i.i.d. X-valued random variables with common probability distribution , then the estimator
N

N (f ) = N 1 MC
i=1

f ( i )

converges almost surely to (f ) for all -integrable functions f . Obviously this Monte Carlo estimate of the expectation is not exact, but generating a suciently large number of random variables can render this approximation error arbitrarily small, in a suitable probabilistic sense. It is even possible to assess the size of this error. If (|f |2 ) = |f (x)|2 (dx) < ,

MC the central limit theorem shows that N N (f ) (f ) has an asymptotic normal distribution, which can be used to construct asymptotic condence regions for (f ). For instance, if f is real-valued, a condence interval with asymptotic probability of coverage is given by N (f ) c N 1/2 N (, f ), N (f ) + c N 1/2 N (, f ) , MC MC where
def 2 N (, f ) = N

(6.1)

1 i=1

f ( i ) N (f ) MC

and c is the /2 quantile of the standard Gaussian distribution. If generating a sequence of i.i.d. samples from is practicable, one can make the condence interval as small as desired by increasing the sample size N . When compared to univariate numerical integration and quasi-Monte Carlo methods (Niederreiter, 1992), the convergence rate is not fast. In practical terms, (6.1) implies that an extra digit of accuracy on the approximation requires 100 times as many replications, where the rate 1/ N cannot be improved. On the other

6.1 Basic Monte Carlo Methods

163

hand, it is possible to derive methods to reduce the asymptotic variance of the Monte Carlo estimate by allowing a certain amount of dependence among the random variables 1 , 2 , . . . Such methods include antithetic variables, control variates, stratied sampling, etc. These techniques are not discussed here (see for instance Robert and Casella, 2004, Chapter 4). A remarkable fact however is that the rate of convergence of 1/ N in (6.1) remains the same whatever the dimension of the space X is, which leaves some hope of eectively using the Monte Carlo approach in large-dimensional settings. 6.1.2 Monte Carlo Simulation for HMM State Inference 6.1.2.1 General Markovian Simulation Principle We now turn to the specic task of simulating the unobserved sequence of states in a hidden Markov model, given some observations. The main result has already been discussed in Section 3.3: given some observations, the unobserved sequence of states constitutes a non-homogeneous Markov chain whose transition kernels may be evaluated, either from the backward functions for the forward chain (with indices increasing as usual) or from the forward measuresor equivalently ltering distributionsfor the backward chain (with indices in reverse order). Schematically, both available options are rather straightforward to implement. Backward Recursion/Forward Sampling: First compute (and store) the backward functions k|n by backward recursion, for k = n, n 1 down to 0 (Proposition 3.2.1). Then, simulate Xk+1 given Xk from the forward transition kernels Fk|n specied in Denition 3.3.1. Forward Recursion/Backward Sampling: First compute and store the forward measures ,k by forward recursion, according to Proposition 3.2.1. As an alternative, one may evaluate the normalized versions of the forward measures, which coincide with the ltering distributions ,k , following Proposition 3.2.5. Then Xk is simulated conditionally on Xk+1 (starting from Xn ) according to the backward transition kernel B,k dened by (3.38). Despite its beautiful simplicity, the method above will obviously be of no help in cases where an exact implementation of the forward-backward recursion is not available. 6.1.2.2 Models with Finite State Space In the case where the state space X is nite, the implementation of the forwardbackward recursions is feasible and has been fully described in Section 5.1. The second method described above is a by-product of Algorithm 5.1.3. Algorithm 6.1.1 (Markovian Backward Sampling). Given the stored values of 0 , . . . , n computed by forward recursion according to Algorithm 5.1.1, do the following.

164

6 Monte Carlo Methods

Final State: Simulate Xn from n . Backward Simulation: For k = n 1 down to 0, compute the backward transition kernel according to (5.7) and simulate Xk from Bk (Xk+1 , ). The numerical complexity of this sampling algorithm is thus equivalent to that of Algorithm 5.1.3, whose computational cost depends most importantly on the cardinal r of X and on the diculty of evaluating the function g(x, Yk ) for all x X and k = 0, . . . , n (see Section 5.1). The backward simulation pass in Algorithm 6.1.1 is simpler than its smoothing counterpart in Algorithm 5.1.3, as one only needs to evaluate Bk (Xk+1 , ) for the simulated value of Xk+1 rather than Bk (i, j) for all (i, j) {1, . . . , r}2 . 6.1.2.3 Gaussian Linear State-Space Models As discussed in Section 5.2, Rauch-Tung-Striebel smoothing (Algorithm 5.2.4) is the exact counterpart of Algorithm 5.1.3 in the case of Gaussian linear statespace models. Not surprisingly, to obtain the smoothing means and covariance matrices in Algorithm 5.2.4, we explicitly constructed the backward Gaussian transition density, whose mean and covariance are given by (5.23) and (5.24), respectively. We simply reformulate this observation in the form of an algorithm as follows. Algorithm 6.1.2 (Gaussian Backward Markovian State Sampling). Assume that the ltering moments Xk|k and k|k have been computed using Proposition 5.2.3. Then do the following. Final State: Simulate Xn N(Xn|n , n|n ). Backward Simulation: For k = n 1 down to 0, simulate Xk from a Gaussian distribution with mean and covariance matrix given by (5.23) and (5.24), respectively. The limitations discussed in the beginning of Section 5.2.2 concerning RTS smoothing (Algorithm 5.2.4) also apply here. In some models, Algorithm 6.1.2 is far from being computationally ecient (Frhwirth-Schnatter, 1994; Carter u and Kohn, 1994). With these limitations in mind, De Jong and Shephard (1995) described a sampling algorithm inspired by disturbance (or BrysonFrazier) smoothing (Algorithm 5.2.15) rather than by RTS smoothing. The method of De Jong and Shephard (1995) is very close to Algorithm 5.2.15 and proceeds by sampling the disturbance vectors Uk backwards (for k = n 1, . . . , 0) and then the initial state X0 , from which the complete sequence X0:n may be obtained by repeated applications of the dynamic equation (5.11). Because the sequence of disturbance vectors {Uk }k=n1,...,0 does not however have a backward Markovian structure, the method of De Jong and Shephard (1995) is not a simple by-product of disturbance smoothing (as was the case

6.1 Basic Monte Carlo Methods

165

for Algorithms 5.2.4 and 6.1.2). Durbin and Koopman (2002) described an approach that is conceptually simpler and usually about as ecient as the disturbance sampling method of De Jong and Shephard (1995). The basic remark is that if X and Y are jointly Gaussian variables, the conditional distribution of X given Y is Gaussian with mean vector E [X | Y ] and covariance matrix Cov(X | Y ), where Cov(X | Y ) equals Cov(XE[X | Y ]) and, in addition, does not depend on Y (Proposition 5.2.2). In particular, if (X , Y ) is another independent pair of Gaussian distributed random vectors with the same (joint) distribution, X E[X | Y ] and X E[X | Y ] are independent and both are N (0, Cov(X | Y )) distributed. In summary, to simulate from the distribution of X given Y , one may 1. Simulate an independent pair of Gaussian variables (X , Y ) with the same distribution as (X, Y ) and compute X E[X | Y ]; 2. Given Y , compute E[X | Y ], and set = E[X | Y ] + X E[X | Y ] . This simulation approach only requires the ability to compute conditional expectations and to simulate from the prior joint distribution of X and Y . When applied to the particular case of Gaussian linear state-space models, this general principle yields the following algorithm. Algorithm 6.1.3 (Sampling with Dual Smoothing). Given a Gaussian linear state-space model following (5.11)(5.12) and observations Y0 , . . . , Yn , do the following.
1. Simulate a ctitious independent sequence {Xk , Yk }k=0,...,n of both states and observations using the model equations. 2. Compute {Xk|n }k=0,...,n and {Xk|n }k=0,...,n using Algorithm 5.2.15 for the two sequences {Yk }k=0,...,n and {Yk }k=0,...,n . Then {Xk|n + Xk Xk|n }k=0,...,n is distributed according to the posterior distribution of the states given Y0 , . . . , Yn .

Durbin and Koopman (2002) list a number of computational simplications that are needed to make the above algorithm competitive with the disturbance sampling approach. As already noted in Remark 5.2.16, the backward recursion of Algorithm 5.2.15 may be greatly simplied when only the best linear estimates (and not their covariances) are to be computed. During the forward Kalman prediction recursion, it is also possible to save on computations by noting that all covariance matrices (state prediction error, innovation) will be common for the two sequences {Yk } and {Yk }, as these matrices do not depend on the observations but only on the model. The same remark should be used when the purpose is not only to simulate one sequence but N sequences of states conditional on the same observations, which will be the standard situation in a Monte Carlo approach. Further improvement can be

166

6 Monte Carlo Methods

gained by carrying out simultaneously the simulation and Kalman prediction tasks, as both of them are implemented recursively (Durbin and Koopman, 2002).

6.2 A Markov Chain Monte Carlo Primer


As we have seen above, the general task of simulating the unobserved X0:n given observations Y0:n is non-trivial except when X in nite or the model is a Gaussian linear state-space model. In fact, in such models, analytic integration with respect to (low-dimensional marginals of) the conditional distribution of X0:n given observations is most often feasible, whence there is generally no true need for simulation of the unobserved Markov chain. The important and more dicult challenge is rather to explore methods to carry out this task in greater generality, and this is the object of the current section. We start by describing the accept-reject algorithm, which is a general approach to simulation of i.i.d. samples from a prescribed distribution, and then turn to so-called Markov chain Monte Carlo methods, which are generally more successful in large-dimensional settings. 6.2.1 The Accept-Reject Algorithm For specic distributions such as the Gaussian, Poisson, or Gamma distributions, there are ecient tailor-made simulation procedures; however, we shall not discuss here the most basic (but nonetheless essential) aspects of random variate generation for which we refer, for instance, to the books by Devroye (1986), Ripley (1987), or Gentle (1998). We are rather concerned with methods that can provide i.i.d. samples from any pre-specied distribution , not just for specic choices of this distribution. It turns out that there are only a limited number of options for this task, which include the accept-reject algorithm discussed here and the sampling importance resampling approach to be discussed in Section 7.1 (although the latter only provides an approximate i.i.d. sample). The accept-reject algorithm, rst described by von Neumann, is important both for its direct applications and also because its principle is at the core of many of the more advanced methods to be discussed in the following (for general references on the accept-reject method, see Devroye, 1986, Chapter 2, Ripley, 1987, p. 6062, or Robert and Casella, 2004, Chapter 2). It is easier to introduce the key concepts using probability densities, and we assume that has a density with respect to a measure ; because this assumption will be adopted all through this section, we shall indeed use the notation for this density as well. The key requirement of the method is the availability of another probability density function (with respect to ) r whose functional form is known and from which i.i.d. sampling is readily feasible. We also

6.2 A Markov Chain Monte Carlo Primer

167

target density

envelope

Mr(x)

(x)

Fig. 6.1. Illustration of the accept-reject method. Random points are drawn uniformly under the bold curve and rejected if the ordinate exceeds (x) (dashed curve).

assume that for some constant M > 1, M r(x) (x), for all x X, as illustrated by Figure 6.1. Proposition 6.2.2 below asserts that abscissas of i.i.d. random points in X R+ that are generated uniformly under the graph of (x) are distributed according to . Of course, it is not easier to sample uniformly under the graph of (x) in X R+ than it is to sample directly from , but one may instead sample uniformly under the graph of the envelope M r(x) and accept only those samples that fall under the graph of . To do this, rst generate a candidate, say according to the density r and compute () as well as the height of the envelope M r(). A uniform U([0, 1]) random variable U is then generated independently from , and the pair is accepted if U M r() (). In case of rejection, the whole procedure is started again until one eventually obtains a pair , U which is accepted. The algorithm is summarized below, Algorithm 6.2.1 (Accept-Reject Algorithm). Repeat: Generate two independent random variables: r and U U([0, 1]). Until: U ()/(M r()). The correctness of the accept-reject method can be deduced from the following two simple results. Proposition 6.2.2. Let be a random variable with density with respect to a measure on X and U be an independent real random variable uniformly

168

6 Monte Carlo Methods

distributed on the interval [0, M ]. Then the pair (, U ()) of random variables is uniformly distributed on S,M = (x, u) X R+ : 0 < u < M (x) ,

with respect to Leb , where Leb denotes Lebesgue measure. Conversely, if a random vector (, U ) of X R+ is uniformly distributed on S,M , then admits as marginal probability density function. Proof. Obviously, if Proposition 6.2.2 is to be true for some value of M0 , then both claims also hold for all values of M > 0 simply by scaling the ordinate by M/M0 . In the following, we thus consider the case where M equals one. For the rst statement, take a measurable subset B S,1 and let Bx denote the section of B in x, that is, Bx = {u : (x, u) B}. Then P {(, U ()) B} =
xX uBx

1 Leb (du)(x) (dx) = (x)

Leb (du) (dx) .


B

For the second statement, consider a measurable subset A X and set A = + {(x, u) A R : 0 u (x)}. Then P( A) = P (, U ) A = Leb (du) (dx) = Leb (du) (dx) S,1
A

(x) (dx) .
A

Lemma 6.2.3. Let V1 , V2 , . . . be a sequence of i.i.d. random variables taking values in a measurable space (V, V) and B V a set such that P(V1 B) = p > 0. The integer-valued random variable = inf {k 1, Vk B} (with the convention that inf = ) is geometrically distributed with parameter p, i.e., for all i 0, P( = i) = (1 p)i1 p . (6.2) The random variable V = V 1< is distributed according to P(V A) = Proof. First note that P( = i) = P(V1 B, . . . , Vi1 B, Vi B) = (1 p)i1 p , showing (6.2), which implies in particular that the waiting time is nite with probability one. For A V, P(V A B) . p (6.3)

6.2 A Markov Chain Monte Carlo Primer

169

P(V A) =
i=1

P(V1 B, . . . , Vi1 B, Vi A B) (1 p)i1 P(V1 A B)


i=1

= P(V1 A B)

1 . 1 (1 p)

Hence by Proposition 6.2.2, the intermediate pairs (i , Ui ) generated in Algorithm 6.2.1 are such that (i , M Ui r(i )) are uniformly distributed under the graph of M r(x). By Lemma 6.2.3, the accepted pair (, U ) is then uniformly distributed under the graph of (x) and, using Proposition 6.2.2, is marginally distributed according to . The probability p of acceptance is equal to P U1 (1 ) M r(1 ) = P {(1 , M U1 r(1 )) S,M } = (x) (dx) 1 = . M M r(x) (dx) X
X

Remark 6.2.4. The same algorithm can be applied also in cases where the densities or r are known only up to a constant. In that case, denote by C = (x) (dx) and Cr = r(x) (dx) the normalizing constants. The condition (x) M r(x) can be equivalently written as (x) M (Cr /C )(x), r where (x) = (x)/C and r(x) = r(x)/Cr denote the actual probability den sity functions. Because the two stopping conditions (x) M (Cr /C )(x) r and (x) M r(x) are equivalent, using the accept-reject algorithm with , r, and M amounts to using it with , r and M Cr /C . Therefore, the knowl edge of the normalizing constants C and Cr is not required. Note however that when either C or Cr diers from one, it is not possible anymore to interpret 1/M as the acceptance probability, and the actual acceptance probability C /(Cr M ) is basically unknown. In that case, the complexity of the accept-reject algorithm (typically how many intermediate draws are required on average before accepting a single one) cannot be determined in advance and may only be estimated empirically. Of course, the assumption (x) M r(x) puts some stringent constraints on the choice of the density r from which samples are drawn. The density r should have both heavier tails and sharper innite peaks than . The eciency of the algorithm is the ratio of the areas under the two graphs of (x) and M r(x), which equals 1/M . Therefore, it is essential to keep M as close to one as possible. The optimal choice of M for a given r is Mr = supxX (x)/r(x), as it maximizes the acceptance probability and therefore minimizes the average required computational eort. Determining a proposal density r such that Mr is small and evaluating Mr (or a tight upper bound for it) are the

170

6 Monte Carlo Methods

two key ingredients for practical application of the accept-reject method. In many situations, and especially in multi-dimensional settings, both of these tasks are often equally dicult (see Robert and Casella, 2004, for examples). 6.2.2 Markov Chain Monte Carlo The remarks above highlight that although accept-reject is often a viable approach in low-dimensional problems, it has serious drawbacks in largedimensional ones. Most fortunately, there exists a class of alternatives that allow us to handle arbitrary distributions, on large-dimensional sets, without a detailed study of them. This class of simulation methods is called Markov chain Monte Carlo (or MCMC) methods, as they rely on Markov-dependent simulations. It should be stressed at this point that the Markov in Markov chain Monte Carlo has nothing to do with the Markov in hidden Markov models. These MCMC methods are generic/universal and, while they naturally apply in HMM settings, they are by no means restricted to those. The original MCMC algorithm was introduced by Metropolis et al. (1953) for the purpose of optimization on a discrete state space, in connection with statistical physics: the paper was actually published in the Journal of Chemical Physics. The Metropolis algorithm was later generalized by Hastings (1970) and Peskun (1973, 1981) to statistical simulation. Despite several other papers that highlighted its usefulness in specic settings (see, for example, Geman and Geman, 1984; Tanner and Wong, 1987; Besag, 1989), the starting point for an intensive use of MCMC methods by the statistical community can be traced to the presentation of the Gibbs sampler by Gelfand and Smith (1990). The MCMC approach is now well-known in many scientic domains, which include physics and statistics but also biology, engineering, etc. Returning for a while to the general case where is a distribution, the tenet of MCMC methods is the remark that simulating an i.i.d. sequence 1 , . . . , n with common probability distribution is not the only way to approximate in the sense of being able to approximate the expectation of any -integrable function f . In particular, one may consider Markov-dependent sequences { i }i1 rather than i.i.d. sequences. The ergodic theorem for Markov chains asserts that, under suitable conditions (discussed in Section 14.2.6 of Chapter 14), N MCMC (f ) 1 = N
N

f ( i )
i=1

(6.4)

is a reasonable estimate of the expectation of f under the stationary distribution of the chain { i }i1 , for all integrable functions f . In addition, the rate of convergence is identical to that of standard (independent) Monte Carlo, that is, 1/ N . To make this idea practicable however requires simulation schemes that guarantee (i) that simulating the chain { i }i1 given an arbitrary initial value 1 is an easily implementable process;

6.2 A Markov Chain Monte Carlo Primer

171

(ii) that the stationary distribution of { i }i1 indeed coincides with the desired distribution ; (iii) that the chain { i }i1 satises conditions needed to guarantee the convergence towards , irrespectively of the initial value 1 . We will introduce below two major classes of such algorithms, and we refer the reader to Robert and Casella (2004) and Roberts and Tweedie (2005) for an appropriate detailed coverage of these MCMC methods. In this context, the specic distribution of interest is generally referred to as the target distribution. To keep the presentation simple, we will also assume that all distributions and conditional distributions arising are dominated by a common measure . The target distribution in particular is assumed to have a probability density function, as above denoted by , with respect to . 6.2.3 Metropolis-Hastings The (very limited) assumption underlying the Metropolis-Hastings algorithm, besides the availability of , is that one can simulate from a transition density function r (with respect to the same measure ), called the proposal distribution, whose functional form is also known. Algorithm 6.2.5 (The Metropolis-Hastings Algorithm). Simulate a sequence of values { i }i1 , which forms a Markov chain on X, with the following mechanism: given i , 1. Generate r( i , ); 2. Set i with probability ( i , ) def () r(, ) 1 = i+1 ( i ) r( i , ) = i otherwise The initial value 1 may be chosen arbitrarily. In practice, (6.5) is carried out by drawing an independent U([0, 1]) variable U and accepting only if U A( i , ), where A( i , ) = () r(, i ) , ( i ) r( i , )

(6.5)

is generally referred to as the Metropolis-Hastings acceptance ratio. The reason for this specic choice of acceptance probability in (6.5), whose name follows from Metropolis et al. (1953) and Hastings (1970), is that the associated Markov chain {t } satises the detailed balance equation (2.12) discussed in Chapter 2. Proposition 6.2.6 (Reversibility of the Metropolis-Hastings Kernel). The chain { i }i1 generated by Algorithm 6.2.5 is reversible and is its stationary probability density function.

172

6 Monte Carlo Methods

Proof. The transition kernel K associated with Algorithm 6.2.5 is such that for a function f Fb (X), K(x, f ) = f (x ) [(x, x )r(x, x ) (dx ) + pR (x) x (dx )] ,

where pR (x) is the probability of remaining in the state x, given by pR (x) = 1 Hence f1 (x)f2 (x )(x) (dx) K(x, dx ) = f1 (x)f2 (x )(x)(x, x )r(x, x ) (dx) (dx ) + f1 (x)f2 (x)(x)pR (x) (dx) (6.6) (x, x )r(x, x ) (dx ) .

for all functions f1 , f2 Fb (X). According to (6.5), (x)(x, x )r(x, x ) = (x )r(x , x) (x)r(x, x ) , which is symmetric in x and x , and thus K satises the detailed balance condition (2.12), as we may swap the functions f1 and f2 in both terms on the right-hand side of (6.6). This implies in particular that is a stationary density for the kernel K. The previous result is rather weak as there is no guarantee that the chain { i }i1 indeed converges in distribution to , whatever the choice of the initialization 1 . We postpone the study of such questions until Chapter 14, where we show that such results can be obtained under weak additional conditions (see for instance Theorem 14.2.37). We refer to the books by Robert and Casella (2004) and Roberts and Tweedie (2005) for further discussion of convergence issues and focus, in the following, on the practical aspects of MCMC. Remark 6.2.7. An important feature of the Metropolis-Hastings algorithm is that it can be applied also when or r is known only through the ratio (x )/(x) or r(x , x)/r(x, x ). This allows the algorithm to be used without knowing the normalizing constants: evaluating and/or r only up to a constant scale factor, or even the ratio /r, is sucient to apply Algorithm 6.2.5. This fact is instrumental when the algorithm is to be used to simulate from posterior distributions in Bayesian models (see Chapter 13 for examples), as these distributions are most often dened though Bayes theorem as the product of the likelihood and the prior density, where the normalization is not computable (or else one would not consider using MCMC...).

6.2 A Markov Chain Monte Carlo Primer

173

In hidden Markov models, this feature is very useful for simulating from the posterior distribution of an unobservable sequence of states X0:n given the corresponding observations Y0:n . Indeed, the functional form of the conditional distribution of X0:n given Y0:n is given in (3.13), which is fully explicit except for the normalization factor L,n . For MCMC approaches, there is no point in trying to evaluate this normalization factor L,n , and it suces to know that the desired joint target distribution is proportional to
n

0:n|n (x0:n ) (x0 )g0 (x0 )


k=1

q(xk1 , xk )gk (xk ) ,

(6.7)

where we assume that the model is fully dominated in the sense of Denition (2.2.3) and hence that and q denote, respectively, a probability density function and a transition density function (with respect to ). The target distribution 0:n|n dened by (6.7) is thus perfectly suitable for MCMC simulation. We now consider two important classes of Metropolis-Hastings algorithms. 6.2.3.1 Independent Metropolis-Hastings A rst option for the choice of the proposal transition density function r(x, ) is to select a xedthat is, independent of xdistribution over X, like the uniform distribution if X is compact, or more likely some other distribution that is related to . This method, as rst proposed by Hastings (1970), appears to be an alternative to importance sampling and the accept-reject algorithms1 . To stress this special case, we denote the independent proposal density by rind (x). The Metropolis-Hastings acceptance probability then reduces to (x, x ) = (x )/rind (x ) 1. (x)/rind (x)

In particular, in the case of a uniform proposal rind , the acceptance probability is nothing but the ratio (x )/(x) (a feature shared with the random walk Metropolis-Hastings algorithm below). Intuitively, the transition from Xn = x to Xn+1 = x is accomplished by generating an independent sample from a proposal distribution rind and then thinning it down based on a comparison of the corresponding importance ratios (x)/rind (x) and (x )/rind (x ). One can notice the connection with the importance sampling method (see Section 7.1.1) in that the Metropolis-Hastings acceptance probability is also based on the importance weight ( )/rind ( ). A major dierence is
1 The importance sampling algorithm is conceptually simpler than MCMC methods. For coherence reasons however, the former will be discussed later in the book, when considering sequential Monte Carlo methods. Readers not familiar with the concept of importance sampling may want to go through Section 7.1.1 at this point.

174

6 Monte Carlo Methods

that importance sampling preserves all the simulations while the independent Metropolis-Hastings algorithm only accepts moving to new values with sufciently large importance ratio. It can thus be seen as an approximation to sampling importance resampling of Section 7.1.2 in that it also replicates the points with the highest importance weights. As reported in Mengersen and Tweedie (1996), the performance of an independent Metropolis-Hastings algorithm will vary widely, depending on, in particular, whether or not the importance ratio ()/rind () is bounded (which is also the condition required for applying the accept-reject algorithm). In Mengersen and Tweedie (1996, Theorem 2.1), it is proved that the algorithm is uniformly ergodic (see denition 4.3.15) if there exists > 0 such that xX: and then, for any x X, K n (x, )
TV

rind (x) (x)

=1,

(6.8)

(1 )n .

Conversely, if for every > 0 the set on which (6.8) fails has positive measure, then the algorithm is not even geometrically ergodic. The practical implication is that the chain may tend to get stuck in regions with low values of . This happens when the proposal has lighter tails than the target distribution. To ensure robust performance, it is thus advisable to let rind be a relatively heavy-tailed distribution (such as the t-distribution for example). Example 6.2.8 (Squared and Noisy Autoregression). Consider the following model where the hidden Markov chain is from a regular AR(1) model, Xk+1 = Xk + Uk with Uk N(0, 2 ), and where the observable is
2 Yk = Xk + Vk

with Vk N(0, 2 ). The conditional distribution of Xk given Xk1 , Xk+1 and Y0:n is, by Remark 6.2.7, equal to the conditional distribution of Xk given Xk1 , Xk+1 and Yk , with density proportional to exp 1 2 2 (xk xk1 )2 + (xk+1 xk )2 + 2 (yk x2 )2 k 2 . (6.9)

Obviously, the diculty with this distribution is the (yk x2 )2 term in the k exponential. A naive resolution of this diculty is to ignore the term in the proposal distribution, which is then a N(k , 2 ) distribution with k k = xk1 + xk+1 1 + 2 and 2 = k 2 . 1 + 2

6.2 A Markov Chain Monte Carlo Primer

175

Fig. 6.2. Illustration of Example 6.2.8. Top: plot of the last 500 realizations of the chain {i }i1 produced by the independent Metropolis-Hastings algorithm associated with the N(k , 2 ) proposal over 10,000 iterations. Bottom: histogram of k a chain of length 10,000 compared with the target distribution (normalized by numerical integration).

The ratio (x)/rind (x) is then equal to exp (yk x2 )2 /2 2 , which is bounded. k Figure 6.2 (bottom) shows how the Markov chain produced by Algorithm 6.2.5 does converge to the proper posterior distribution, even though the target is bimodal (because of the ambiguity on the sign of xt resulting from the square in the observation equation). Figure 6.2 (top) also illustrates the fact that, to jump from one mode to another, the chain has to remain in a given state for several iterations before jumping to the alternative modal region. When the ratio (x)/rind (x) is not bounded, the consequences may be very detrimental on the convergence of the algorithm, as shown by the following elementary counterexample. Example 6.2.9 (Cauchy Meets Normal). Consider a Cauchy C(0, 1) target distribution with a Gaussian N(0, 1) proposal. The ratio (x)/rind (x) is then exp{x2 /2}/(1 + x2 ), which is unbounded and can produce very high values. Quite obviously, the simulation of a sequence of normal proposals to achieve simulation from a Cauchy C(0, 1) distribution is bound to fail, as the normal distribution, whatever its scale, cannot reach the tails of the Cauchy distribution: this failure is illustrated in Figure 6.3. To stress the importance of this requirement (that the ratio (x)/rind (x) be bounded), it is important to remember that we can diagnose the failure in Figure 6.3 only because we are cheating and know what the target distribution is, including its normalization. In real practical uses of the method, it would be very dicult in such a case to detect that the sampling algorithm is not doing what it is expected to.

176

6 Monte Carlo Methods

Density 0.0
4

0.1

0.2

0.3

0.4

Fig. 6.3. Illustration of Example 6.2.9. Histogram of a independent MetropolisHastings chain of length 5,000, based on a N(0, 1) proposal, compared with the target C(0, 1) distribution.

6.2.3.2 Random Walk Metropolis-Hastings Given that the derivation of an acceptable independent proposal becomes less realistic as the dimension of the problem increases, another option for the choice of r(x, ) is to propose local moves around x with the hope that, by successive jumps, the Markov chain will actually explore the whole range of the target distribution. The most natural (and historically rst) proposal in a continuous state space X is the random walk proposal, r(x, x ) = h(x x) , where h is a symmetric density. The Metropolis-Hastings acceptance probability is then (x ) 1, (x, x ) = (x) due to the symmetry assumption on h. Once again, the chain { i }i1 thus visits each state x in proportion to (x). Example 6.2.10 (Squared and Noisy Autoregression, Continued). The conditional distribution of Xk given Xk1 , Xk+1 and Yk (6.9) is generally bimodal as in Figure 6.2. For some occurrences of Xk1 , Xk+1 and Yk , the zone located in between the modes has a very low probability under the conditional distribution. If we use a Gaussian random walk, i.e., h = N(0, 2 ), with a scale that is too small, the random walk will never jump to the other mode. This is illustrated in Figure 6.4 for = 0.1. On the opposite, if the scale is suciently large, the corresponding Markov chain will explore both

6.2 A Markov Chain Monte Carlo Primer

177

Fig. 6.4. Illustration of Example 6.2.10. Same legend as Figure 6.2 but for a dierent outcome of (Xt1 , Xt+1 , Yt ) and with the Markov chain based on a random walk with scale = 0.1.

Fig. 6.5. Illustration of Example 6.2.10. Same legend and data set (Xt1 , Xt+1 , Yt ) as Figure 6.4 but with the Markov chain based on a random walk with scale = 0.5.

modes and give a satisfactory approximation of the target distribution, as shown by Figure 6.5 for = 0.5. Comparing Figures 6.4 and 6.5 also conrms that a higher acceptance rate does not necessarily imply, by far, a better performance of the MetropolisHastings algorithm (in Figure 6.4, the acceptance rate is about 50% and it drops to 13% in the case of Figure 6.5). Especially with random walk proposals, it is normal to observe a fair amount of rejections when the algorithm is properly tuned. Even though the choice of a symmetric density h seems to oer less opportunities for misbehaving, there are two levels at which the algorithm may err: one is related to tail behavior, namely that the tail of h must be heavy enough if geometric convergence is to occur (Mengersen and Tweedie, 1996); and the

178

6 Monte Carlo Methods

other is the scale of the random walk. From a theoretical point of view, note that the random walk Metropolis-Hastings kernel is never uniformly ergodic in unbounded state spaces X (Robert and Casella, 2004, Section 7.5). Depending on which scale is chosen, the Markov chain may be very slow to converge either because it moves too cautiously (if the scale is too small) or too wildly (if the scale is too large). Based on time-scaling arguments (i.e., continuoustime limits for properly rescaled random walk Metropolis-Hastings chains), Roberts and Rosenthal (2001) recommend setting the acceptance rate in the range 0.20.35, which can be used as a guideline to select the scale of the random walk. In cases similar to the one considered in Example 6.2.10, with well-separated modes, it is customary to observe that the best scaling of the proposal (in terms of the empirical correlation of the MCMC chain for instance) corresponds to an acceptance rate that is even lower than these numbers. Unexpected multimodality really is a very signicant diculty in this respect: if the target distribution has several separated modes that are not expected, a random walk with too small a scale can miss those modes without detecting a problem with convergence, as the exploration of the known modes may well be very satisfactory, as exemplied in Figure 6.4. Example 6.2.11 (Cauchy Meets Normal, Continued). To keep up with the spirit of this toy example, we also try to use in this case a Gaussian random walk as a proposal. The corresponding acceptance probability is then (x, x ) = 1 + x2 1. 1 + (x )2

Figure 6.6 illustrates the performance of the algorithm in this setting. The graphic t of the Cauchy density by the histogram is good but, if we follow Roberts and Tweedie (2005) and look at the chain in more detail, it appears that after 10,000 iterations the range of the chain is (14.44, 15.57), which shows that the chain fails to explore in a satisfactory fashion the tails of the Cauchy distribution. In fact, the 99% quantile of the Cauchy C(0, 1) distribution is 31, implying that on average 200 points out of the 10,000 rst values of the Markov chain should be above 31 in absolute value! Roberts and Tweedie (2005) show in essence that, when the density of the random walk has tails that are not heavy enough, the corresponding Markov chain is not geometrically ergodic. The two previous categories are the most common choices for the proposals density r, but they are by no means the only or best choices. For instance, in a large-dimension compact state space with a concentrated target distribution , the uniform proposal is very inecient in that it leads to a very low average acceptance probability; this translates, in practice, to the chain { i }i1 being essentially constant. Similarly, using the random walk proposal with a small scale parameter while the target is multimodal with a very low density in between the modes may result in the chain never leaving its initial mode.

6.2 A Markov Chain Monte Carlo Primer

179

Density 0.00 0.05 0.10 0.15

0.20

0.25

0.30

0.35

Fig. 6.6. Illustration of Example 6.2.11. Histogram of the 10,000 rst steps of a random walk Metropolis-Hastings Markov chain using a Gaussian proposal with scale 1 and Cauchy target distribution.

6.2.4 Hybrid Algorithms Although the Metropolis-Hastings rule of Algorithm 6.2.5 is our rst eective approach for constructing MCMC samplers, we already have a number of available options, as we may freely choose the proposal distribution r. A natural question to ask in this context is to know whether it is possible to build new samplers from existing ones. It turns out that there are two generic and easily implemented ways of combining several MCMC samplers into a new one, which we shall refer to as a hybrid sampler. The following lemma is easy to prove from the corresponding denitions of Chapters 2 and 14. Lemma 6.2.12 (Hybrid Kernels). Assume that K1 , . . . , Km are Markov transition kernels that all admit as stationary distribution. Then (a) Ksyst = K1 K2 Km and m (b) Krand = i=1 i Ki , with i > 0 for i = 1, . . . , m and
m i=1

i = 1,

also admit as stationary distribution. If in addition K1 , . . . , Km are reversible, Krand also is reversible but Ksyst need not be. Both of these constructions are easily implemented in practice: in (a), each iteration of the hybrid sampler consists in systematically cycling through the m available MCMC kernels; in (b), at each iteration we rst toss an m-ary coin with probability of turning i equal to i and then apply the MCMC kernel Ki . The additional warning that Ksyst may not be reversible (even if all the individual kernels Ki are) is not a problem per se. Reversibility is not a necessary condition for MCMC, it is only prevalent because it is easier to devise rules that enforce the (strongest) detailed balance condition. Note also

180

6 Monte Carlo Methods

that it is always possible to induce reversibility by appropriate modications of the cycling strategy. For instance, the symmetric combination Ksyst Krev with Krev = Km Km1 K1 is easily checked to be reversible. In practice, it means that the cycle through the various available MCMC kernels Ki has to be done in descending and then ascending order. Regarding irreducibility, it is clear that the random scan kernel Krand is guaranteed to be phi-irreducible if at least one of the kernels Ki is. For the systematic scan strategy, the situation is more complex and Ksyst may fail to be phi-irreducible even in cases where all the individual kernels Ki are phi-irreducible (with common irreducibility measure ). A more useful remark is that if K1 , . . . , Km all admit as stationary distribution but are not phi-irreduciblemeaning that they do not yet correspond to fully functional converging MCMC algorithmsthere are cases where both Ksyst and Krand are phi-irreducible. It is thus possible to build viable sampling strategies from individual MCMC transitions that are not in themselves fully functional. The main application of this remark is to break large-dimensional problems into smaller ones by modifying only one part of the state at a time. 6.2.5 Gibbs Sampling When the distribution of interest is multivariate, it may be the case that for each particular variable, its conditional distribution given all remaining variables has a simple form. This is in particular the case for models specied using conditional independence relations like HMMs and more general latent variable models. In this case, a natural MCMC algorithm is the so-called Gibbs sampler, which we now describe. Its name somehow inappropriately stems from its use for the simulation of Gibbs Markov random elds by Geman and Geman (1984). 6.2.5.1 A Generic Conditional Algorithm Suppose we are given a joint distribution with probability density function on a space X such that x X may be decomposed into m components x = (x1 , . . . , xm ), where xk Xk . If k is an index in {1, . . . , m}, we shall denote by xk the kth component of x and by xk = {xl }l=k the collection of remaining components. We further denote by k (|xk ) the conditional probability density function of Xk given {Xl }l=k and assume that simulation from this conditional distribution is feasible (for k = 1, . . . , m). Note that xk is not necessarily scalar but may be itself vector-valued. Algorithm 6.2.13 (Gibbs Sampler). Starting from an initial arbitrary state i i 1 , update the current state i = (1 , . . . , m ) to a new state i+1 as follows.
i+1 i+1 i+1 i i For k = 1, 2, . . . , m: Simulate k from i (|1 , . . . , k1 , k+1 , . . . , m ).

6.2 A Markov Chain Monte Carlo Primer

181

In other words, in the kth round of the cycle needed to simulate i+1 , the kth component is updated by simulation from its conditional distribution given all other components (which remain xed). This new value then supersedes the old one and is used in the subsequent simulation steps. A complete round of m conditional simulations is usually referred to as a sweep of the algorithm. Another representation of the Gibbs sampler is to break the complete cycle as a combination of m individual MCMC steps where only one of the m components is modied according to the corresponding conditional distribution. This approach is easily recognized as the combination of type (a) systematic cyclingin Lemma 6.2.12. Hence we know from Lemma 6.2.12 that the correct behavior of the complete cycle can be inferred from that of the individual updates. The next result is a rst step in this direction. Proposition 6.2.14 (Reversibility of Individual Gibbs Steps). Each of the m individual steps of the Gibbs sampler (Algorithm 6.2.13) is reversible and thus admits as a stationary probability density function. Proof. Consider the step that updates the kth component and denote by Kk the corresponding transition kernel. We can always write = k k where k and k are measures on Xk and Xk , respectively, such that k dominates k (|xk ) for all values of xk Xk . With these notations, Kk (x, dx ) = {xk } (dxk )k (xk |xk ) k (dxk ) . Hence, for any functions f1 , f2 Fb (X), f1 (x)f2 (x )(x) (dx) K(x, dx ) = f1 (x)(x) k (dxk ) f2 (xk , xk )k (xk |xk ) k (dxk ) k (dxk ) ,

where (xk , xk ) refers to the element u of X such that uk = xk and uk = xk . Because (xk , xk )k (xk |xk ) = k (xk |xk )(xk , xk ), we may also write f1 (x)f2 (x )(x) (dx) K(x, dx ) = f2 (xk , xk )(xk , xk ) k (dxk ) f1 (xk , xk )k (xk |xk ) k (dxk ) k (dxk ) ,

which is the same expression as before where the roles of f1 and f2 have been exchanged, thus showing that the detailed balance condition (2.12) holds. An insightful interpretation of Proposition 6.2.14 is that each step corresponds to a very special type of Metropolis-Hastings move where the acceptance probability is uniformly equal to 1, due to choice of k as the proposal

182

6 Monte Carlo Methods

distribution. However, Proposition 6.2.14 does not suce to establish proper convergence of the Gibbs sampler, as none of the individual steps produces a phi-irreducible chain. Only the combination of the m moves in the complete cycle has a chance of producing a chain with the ability to visit the whole space X from any starting point. Of course, one can also adopt the combination of type (b) in Lemma 6.2.12 to obtain the random scan Gibbs sampler as opposed to the systematic scan Gibbs sampler, which corresponds to the solution exposed in Algorithm 6.2.13. We refer to (Robert and Casella, 2004) and (Roberts and Tweedie, 2005) for more precise convergence results pertaining to these variants of the Gibbs sampler. One perspective that is somehow unique to Gibbs sampling is RaoBlackwellization, named after the Rao-Blackwell theorem used in classical statistics (Lehmann and Casella, 1998) and recalled as Proposition A.2.5. It is in essence a variance reduction technique (see Robert and Casella, 2004, Chapter 4) that takes advantage of the conditioning abilities of the Gibbs sampler. If only a part of the vector x is of interest (as is often the case in latent variable models), say xk , Rao-Blackwellization consists in replacing the empirical average N MCMC (f ) 1 = N
N i f (k ) i=1

with

N (f ) RB

1 = N

N i E [f (k ) | k ] , i=1

where { i }i1 denotes the chain produced by Algorithm 6.2.13. This is of course only feasible in cases where the integral of the function of interest f under k (|xk ) may be easily evaluated for all x X. In i.i.d. settings, RB N MCMC (f ) would be more variable than N (f ) by Proposition A.2.5. For Markov chain simulations { i }i1 , this is not necessarily the case, and it is only in specic situations (see Robert and Casella, 2004, Sections 9.3 and 10.4.3) that the latter estimate can be shown to be less variable. Another substantial benet of Rao-Blackwellization is to provide an elegant method for the approximation of probability density functions of the dierent components of x. Indeed, N 1 i k k N i=1 is unbiased and converges to the marginal density of kth component, under the target distribution. If the conditional probability density functions are available in closed form, it is unnecessary (and inecient) to use nonparametric density estimation methods such as kernel methods for postprocessing the output of Gibbs sampling. We now discuss a clever use of the Gibbs sampling principle, known as the slice sampler, which is of interest in its own right.

6.2 A Markov Chain Monte Carlo Primer

183

6.2.5.2 The Slice Sampler Proposition 6.2.2 asserts that the bivariate random variable (X, U ) whose distribution is uniform on S = (x, u) X R+ : 0 u (x) ,

is such that the marginal distribution of X is . This observation is at the core of the accept-reject algorithm discussed in Section 6.2.1. We will use the letter U to denote uniform distributions on sets, writing, for instance, (X, U ) U (S ). From the perspective of MCMC algorithms, we can consider using a random walk on S to produce a Markov chain with stationary distribution equal to this uniform distribution on S . There are many ways of implementing a random walk on this set, but a natural solution is to go one direction at a time, that is, to move iteratively along the u-axis and then along the x-axis. Furthermore, we can use uniform moves in both directions; that is, starting from a point (x, u) in S , the move along the u-axis will correspond to the conditional distribution U ({u : u ()}) , (6.10) resulting in a change from point (x, u) to point (x, u ), still in S , and then the move along the -axis to the conditional distribution U ({x : (x) u }) , (6.11)

resulting in a change from point (x, u ) to point (x , u ). This set of proposals is the basis chosen for the original slice sampler of Damien and Walker (1996), Neal (1997) (published as Neal, 2003), and Damien et al. (1999). Algorithm 6.2.15 (Slice Sampler). Starting from an arbitrary point ( 1 , U 1 ) in S , simulate for i 1, 1. U i+1 U [0, ( i )] ; 2. i+1 U S(U i+1 ) , with S(u) = {x : (x) u}. The important point here is that Algorithm 6.2.15 is validated as a Gibbs sampling method, as steps 1 and 2 in the above are simply the conditional distributions of U and associated with the joint distribution U (S ). Obviously, this does not make the slice sampler a universal generator: in many settings, resolving the simulation from the uniform U (S(u)) is just as hard (and impossible) as to generate directly from , and extensions are often necessary (Robert and Casella, 2004, Chapter 8). Still, this potential universality shows that Gibbs sampling does not only pertain to a special category of hierarchical models. Example 6.2.16 (Single Site Conditional Distribution in Stochastic Volatility Model). To illustrate the slice sampler, we consider the stochastic

184

6 Monte Carlo Methods

volatility model discussed in Example 1.3.13 whose state-space form is as follows: Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk , where {Uk }k0 and {Vk }k0 are independent standard Gaussian white noise processes. In this model, 2 exp(Xk ) is referred to as the volatility, and its estimation is one of the purposes of the analysis (see Example 1.3.13 for details). As in Example 6.2.8 above, we consider the conditional distribution of Xk given Xk1 , Xk+1 and Yk , whose transition density function k (x|xk1 , xk+1 ) is proportional to 1 y2 exp 2 k , exp(x/2) 2 exp(x) (6.12) ignoring constants. In fact, terms that do not depend on x can be ignored as well, and we may complete the square (in x) to obtain exp k (x|xk1 , xk+1 ) exp where k = Dening k =
2 yk 2 exp(k ) (1 + 2 ) 2

(x xk1 )2 (xk+1 x)2 + 2 2 2 2

1 + 2 2 2

(x k )2 +

2 yk 2 exp(x) (1 + 2 ) 2

(xk+1 + xk1 ) 2 /2 . 1 + 2 and = 1 + 2 , 2 2

(6.13)

(6.14)

k (x|xk1 , xk+1 ) is thus proportional to exp (x k )2 + k exp[(x k )] .

The parameter k corresponds to a simple shift that poses no simulation problem. Hence, the general form of the conditional probability density function from which simulation is required is exp[{x2 + exp(x)}] for positive values of and . Shephard and Pitt (1997) (among others) discuss an approach based on accept-reject ideas for carrying out this conditional simulation, but we may also use the slice sampler for this purpose. The second step of Algorithm 6.2.15 then requires simulation from the uniform distribution on the set S(u) = x : exp[{x2 + exp(x)}] u = x : x2 + exp(x) ,

setting = (1/) log u. Now, while the inversion of x2 + exp(x) = is not possible analytically, the fact that this function is convex (for > 0) and that the previous value of x belongs to the set S(u) help in solving this equation by numerical trial-and-error or more elaborate zero-nding algorithms.

6.2 A Markov Chain Monte Carlo Primer

185

As pointed out by Neal (2003), there is also no need to solve precisely this equation, as knowledge of an interval that contains the set S(u) is enough to simulate from the uniform distribution on S(u): it then suces to simulate candidates uniformly from the larger set and accept them only if S(u) (which is also the accept-reject method but with a high acceptance rate that is controlled by the accuracy of the zero-nding algorithm). Figure 6.7 (top plot) shows that the t between the histogram of 10,000 consecutive values produced by the slice sampler and the true distribution is quite satisfactory. In addition, the bottom plot shows that the autocorrelation between successive values of i is quite modest. This fast mixing of the one-dimensional slice sampler is an appealing feature that has been shown to hold under fairly general assumptions on the target distribution (Roberts and Rosenthal, 1998; Robert and Casella, 2004, Chapter 8).

1 0.8

Density

0.6 0.4 0.2 0 1

0.5

0.5

1.5

2.5

3.5

0.1

Correlation

0.05

0.05

0.1

10

20

30

40

50

60

70

80

90

100

Lag

Fig. 6.7. Illustration of Example 6.2.16. Top: histogram of a Markov chain produced by the slice sampler for = 5 and = 1 with target distribution in overlay. Bottom: correlogram with 95% condence interval corresponding to the assumption of white noise.

6.2.6 Stopping an MCMC Algorithm There is an intrinsic diculty with using Markov chain Monte Carlo methods for simulation purposes in that, were we to stop the iterations too early,

186

6 Monte Carlo Methods

we would still be inuenced by the (arbitrary) starting value of the chain, and were we to stop the iteration too late, we would be wasting simulation time. In contrast with what happens for independent Monte Carlo where (6.1) may be used to obtain condence intervals, it is fairly dicult to estimate the accuracy of estimates derived from the MCMC sample because of the unknown correlation structure of the simulated i . Apart for often useful graphic diagnostics (trace of the samples, correlograms, comparison of histograms obtained with dierent starting points...), there exist (more or less) empirical rules that provide hints on when an MCMC sampler should be stopped. A branch of MCMC, known as perfect sampling, corresponds to a renement of these rules in which the aim is to guarantee that the Markov chain, when observed at appropriate times, is exactly distributed from the stationary distribution. Not surprisingly, these methods are very dicult to devise and equally costly to implement. Another direction, generally referred to as computable bounds, consists in obtaining bounds on the convergence speed of MCMC-generated Markov chains. When available, such results are very powerful, as they do not require any empirical estimation, and the number of required MCMC simulations may be calibrated beforehand. Of course, the drawback here is that for complex samplers, typically hybrid samplers that incorporate several dierent MCMC sampling steps, such results are simply not available (Robert and Casella, 2004).

6.3 Applications to Hidden Markov Models


This section describes methods that may be used to simulate the unobservable sequence of states X0:n given the corresponding observations Y0:n in HMMs for which the direct (independent) Monte Carlo simulations methods discussed in Section 6.1.2 are not applicable. We start from the most generic and easily implementable approaches in which each individual hidden state Xk is simulated conditionally on all Xl except itself. We then move to a more specic sampling technique that takes prot of the structure found in conditionally Gaussian linear state-space models (see, in particular, Denition 2.2.6 and Sections 4.2.3 and 5.2.6). 6.3.1 Generic Sampling Strategies 6.3.1.1 Single Site Sampling We now formalize an argument that was underlying in Examples 6.2.8 and 6.2.16. Starting from the joint conditional distribution of X0:n given Y0:n dened (up to a proportionality constant) by (6.7), the conditional probability density function of a single variable in the hidden chain, Xk say, given Y0:n and its two neighbors Xk1 and Xk+1 is such that

6.3 Applications to Hidden Markov Models

187

k1:k+1|n (xk |xk1 , xk+1 ) k1:k+1|n (xk1 , xk , xk+1 ) q(xk1 , xk )q(xk , xk+1 )gk (xk ) . (6.15)

At the two endpoints k = 0 and k = n, we have the obvious corrections 0:1|n (x0 |x1 ) (x0 )q(x0 , x1 ) and n1:n|n (xn |xn1 ) q(xn1 , xn )gn (xn ) . Therefore, if we aim at simulating the whole vector X0:n by the most basic Gibbs sampler that simulates one component of the vector at a time, k1:k+1|n (xk |xk1 , xk+1 ) is given by (6.15) in a simple closed-form expression. Remember that the expression looks simple only because knowledge of the normalization factor is not required for performing MCMC simulations. In the case where X is nite, the simulation of X0:n by this Gibbs sampling approach is rather straightforward, as the only operations that are requested (for k = 0, . . . , n) are computing q(xk1 , x)q(x, xk+1 )gk (x) for all values of x X and normalizing them to form a probability vector k ; simulating a value of the state according to k .

It is interesting to contrast this Gibbs sampling algorithm with the simpler Monte Carlo approach of Algorithm 6.1.1. A complete sweep of the Gibbs sampler is simpler to implement, as each Gibbs simulation step requires that r products be computed (where r is the cardinality of X). Hence, the complete Gibbs sweep requires O (r(n + 1)) operations compared to O r2 (n + 1) for Algorithm 6.1.1 due to the necessity of computing all the ltering distributions by Algorithm 5.1.1. On the other hand, the Monte Carlo simulations obtained by Algorithm 6.1.1 are independent, which is not the case for those produced by Gibbs sampling. For a comparable computational eort, we may thus perform r times as many simulations by Gibbs sampling than by independent Monte Carlo. This does not necessarily correspond to a gain though, as the variance of MCMC estimates is most often larger than that of Monte Carlo ones due to the Markov dependence between successive samples. It remains that if the number of possible values of Xk is very large (a case usually found in related models used in applications such as image processing), it may be the case that implementing Monte Carlo simulation is overwhelming while the Gibbs sampler is still feasible. It is generally true that, apart from this case (nite but very large state space), there are very few examples of hidden Markov models where the Gibbs sampling approach is applicable and the general Monte Carlo approach of Section 6.1.2.1 is not. This has to do with the fact that determining k1:k+1|n (|xk1 , xk+1 ) exactly, not only up to a constant, involves exactly the same type of marginalization operation involved in the implementation

188

6 Monte Carlo Methods

of the ltering recursion. An important point to stress here is that replacing an exact simulation by a Metropolis-Hastings step in a general MCMC algorithm does not jeopardize its validity as long as the Metropolis-Hastings step is associated with the correct stationary distribution. Hence, the most natural alternative to the Gibbs sampler in cases where sampling from the full conditional distribution is not directly feasible is the one-at-a-time MetropolisHastings algorithm that combines successive Metropolis-Hastings steps that update only one of the variables. For k = 0, . . . , n, we thus update the kth component xi of the current simulated sequence of states xi by proposing a k new candidate for xi+1 and accepting it according to (6.5), using (6.15) as the k target. Example 6.3.1 (Single Site Conditional Distribution in Stochastic Volatility Model, Continued). We return to the stochastic volatility model already examined in Example 6.2.16 but with the aim of simulating complete sequences under the posterior distribution rather than just individual states. From the preceding discussion, we may use the algorithm described in Example 6.2.16 for each index (k = 0, . . . , n) in the sequence of states to simulate. Although the algorithm itself applies to all indices, the expression of k , k and in (6.13)(6.14) need to be modied for the two endpoints as follows. For k = 0, the rst term in (6.12) should be replaced by exp (1 2 )x2 (x1 x)2 + 2 2 2 2 , (6.16)

as it is sensible to assume that the initial state X0 is a priori distributed as the stationary distribution of the AR(1) process, that is, N 0, 2 /(1 2 ) . Hence for k = 0, (6.13) and (6.14) should be replaced by 0 = x1 2 /2 , 0 = Y02 2 exp(0 )/ 2 , (6.17) 0 = 1/(2 2 ) . For k = n, the rst term in (6.12) reduces to exp and thus (x xn1 )2 2 2 , (6.18)

n = xn1 2 /2 , 2 n = Yn 2 exp(n )/ 2 , n = 1/(2 2 ) ,

(6.19)

replace (6.13) and (6.14). An iteration of the complete algorithm thus proceeds by computing, for each index k = 0, . . . , n in turn, k , k and according to (6.13) and (6.14), or (6.17) or (6.19) if k = 0 or n. Then one iteration of the slice sampling algorithm discussed in Example 6.2.16 is applied.

6.3 Applications to Hidden Markov Models

189

For comparison purposes, we also consider a simpler alternative that consists in using a random walk Metropolis-Hastings proposal for the simulation of each individual site. As discussed in Section 6.2.3.2, the acceptance probability of the move at index k is given by k (x, x ) = k (x ) 1, k (x)

where k is dened in (6.12) with the modications mentioned in (6.16) and (6.18) for the two particular cases k = 0 and k = n. Remember that for random walk proposals, we are still free to choose the proposal density itself because, as long as it is of random walk type, it does not aect the acceptance ratio. Because the positive tail of k is equivalent to that of a Gaussian distribution with variance (2)1 = 2 /(1 + 2 ) and the negative one decays much faster, it seems reasonable to use a Gaussian random walk proposal with a standard deviation about 2.4 / 1 + 2 based on (Roberts and Rosenthal, 2001)see also discussion in Section 6.2.3.2 above about setting the scale of random walk proposals. To compare the relative eciency of these approaches, we use data simulated from the stochastic volatility model with parameter values corresponding to those tted by Shephard and Pitt (1997) on log-returns of a historical daily exchange rate series, that is, = 0.98, = 0.14, and = 0.66. We rst consider the case where n = 20 for which the simulated state trajectory and the observed data are plotted in Figure 6.8. Because of the highly non-linear nature of the model, comparing the values of daily log-return Yk and those of the day volatility Xk is not very helpful. To provide a clearer picture, the crosses 2 in Figure 6.8 represent k = log(Yk / 2 ) rather than Yk itself. Note that k is 2 2 the maximum likelihood estimate of the daily volatility Xk in the absence of an a priori model on the dynamics of the volatility sequence. It is also easily checked from (6.12) and similar expressions that the posterior distribution of 2 the states depend only on the values of Yk / 2 . Figure 6.8 shows that while 2 2 larger values of log(Yk / ) provide a rather good idea of the actual volatility, smaller ones look more like outliers and can be very far from the volatility (beware that the y-scale in Figure 6.8 is reversed). Indeed, a volatility value x rules out observations signicantly larger (in magnitude) than, say, three times exp(x/2), but not observations signicantly smaller than exp(x/2). Figure 6.9 summarizes the output of 50,000 complete cycles of the single site slice sampling strategy on this data. The initial volatility sequence x1 , 0:n whose choice is arbitrary, was set to be zero at all sites. Obviously, in this model, the smoothing distributions are very dispersed and do not allow a precise estimation of the actual sequence of states. Note however that there is a possible misinterpretation of Figure 6.9, which would be that the most likely state sequence is the very smooth trajectory connecting the modes of the marginal smoothing distributions displayed here. This is not the case, and typical simulated sequence of states have variations comparable to that of the true sequence. But because of the large dispersion of the marginal

190

6 Monte Carlo Methods

10

12

14

16

18

4 20

Time Index

Fig. 6.8. Illustration of Example 6.3.1. Simulated data: values of Xk (black circles) 2 and log(Yk / 2 ) (cross). Note that the ordinates (y-axis) run from top to bottom.

0.05 0.04 Density 0.03 0.02 0.01 0 0

10 1 15 0.5 1 Time Index 20 1.5 2 State 0.5 0

Fig. 6.9. Illustration of Example 6.3.1. Waterfall representation of the marginal smoothing distributions estimated from 50,000 iterations of the single site slice sampler (densities estimated with Epanechnikov kernel, bandwidth 0.05). The bullets show the true simulated state sequence.

State

6.3 Applications to Hidden Markov Models


1 Slice Sampler Random Walk MH 0.9

191

0.8

0.7

Correlation

0.6

0.5

0.4

0.3

0.2

0.1

50

100

150

200

250

300

Lag

Fig. 6.10. Correlogram of the values simulated at index k = 10: solid line, single site slice sampler; dashed line, single site random walk Metropolis-Hastings.

posterior distributions and the absence of clearly marked posterior modes, their marginal averages produce the very smooth curves displayed here. In this example the eciency of the simulation algorithm itself is reasonable. To obtain Figure 6.9, for instance, 15,000 iterations would already have been sucient, in the sense of producing no visible dierence, showing that the sampler has converged to the stationary distribution. Figures such as 50,000 or even 15,000 may seem frightening, but they are rather moderate in MCMC applications. Figure 6.10 is the analog of the bottom plot in Figure 6.7, displaying the empirical autocorrelations of the sequence of simulated values for the state with index k = 10 (in the center of the sequence). It is interesting to note that while the single site slice sampler (Figure 6.7) produces a sequence of values that are almost uncorrelated, Figure 6.10 exhibits a strong positive correlation due to the interaction between neighboring sites. Also shown in Figure 6.10 (dashed line) is the autocorrelation for the other algorithm discussed above, based on Gaussian random walk proposals for the simulation of each individual site. This second algorithm has a tuning parameter that corresponds to the standard deviation of the proposals. In the case shown in Figure 6.10, this standard deviation was set to 2.4 / 1 + 2 as previously discussed. With this choice, the acceptance rates are of the order of 50%, ranging from 65% for edge sites (k = 0 and k = n) to 45% at the center of the simulated sequence. Figure 6.10 shows that this second algorithm produces successive draws that are more correlated (with positive correlation) than the single site slice sampling approach. A frequently used numerical measure of the performance of an MCMC sampler is twice the sum of the autocorrelations, over all the range of indices where the estimation is accurate (counting the value one that corresponds to the index zero only

192

6 Monte Carlo Methods

once). This number is equal to the ratio of the asymptotic variance of the N sample mean of the simulated values, say 1/N i=1 xi in our case, to the 10 corresponding Monte Carlo variance for independent simulations under the target distribution (Meyn and Tweedie, 1993, Theorem 17.5.3; Robert and Casella, 2004, Theorem 6.65). Thus this ratio, which is sometimes referred to as the integrated autocorrelation time, may be interpreted as the price to pay (in terms of extra simulations) for using correlated draws. For the approach based on slice sampling this factor is equal to 120, whereas it is about 440 when using random walk proposals. Hence the method based on random walk is about four times less ecient, or more appropriately requires about four times as many iterations to obtain comparable results in terms of variance of the estimates. Note that this measure should not be over-interpreted, as the N asymptotic variance of estimates of the form 1/N i=1 f (xi ) will obviously 0:n depend on the function f as well. In addition, each iteration of the random walk sampler runs faster than for the sampler based on slice sampling. It is important to understand that the performance of a sampler depends crucially on the characteristics of the target distribution. More specically, in our example it depends on the values of the parameters of the model, (, , ), but also on the particular observed sequence Y0:n under consideration. This is a serious concern in contexts such as those of Chapters 11 and 13, where it is required to simulate sequences of states under widely varying, and sometimes very unlikely, choices of the parameters. To illustrate this point, we replaced Y10 by the value exp(5/2), which corresponds to a rather signicant positive (and hence very informative) outlier in Figure 6.8. Figure 6.11 shows the eect of this modication on the marginal smoothing distributions. For this particular data set, the integrated autocorrelation time at index k = 10 increases only slightly (140 versus 120 above) for the sampler based on single site slice sampling but more signicantly (450 versus 220) for the sampler that uses random walk proposals. In Figures 6.9 and 6.11, the length of the sequence to simulate was indeed quite short (n = 20). An important issue in many applications is to know whether or not the eciency of the sampler will deteriorate signicantly when moving to longer sequences. Loosely speaking the answer is no, in general for HMMs due to the forgetting properties of the posterior distribution. When the conditions discussed in Section 4.3 hold, the posterior correlation between distant sites is indeed low and thus single site sampling does not really become worse as the overall length of the sequence increases. Figure 6.12 for instance shows the results obtained for n = 200 with the same number of MCMC iterations. For the slice sampling based approach, the integrated autocorrelation time at index k = 100 is about 90, that is, comparable to what was observed for the shorter observation sequence2 (see also Figure 8.6 and related comments for further discussion of this issue).
It is indeed even slightly lower due to the fact that mixing is somewhat better far from the edges of the sequence to be simulated. The value measured at index
2

6.3 Applications to Hidden Markov Models

193

0.06

0.04 Density

0.02

0 0

10 0 15 1.5 2 Time Index 20 2.5 3 State 0.5 1

Fig. 6.11. Same plot as Figure 6.9 where Y10 has been replaced by a positive outlier.
1.5

0.5

0.5

20

40

60

80

100

120

140

160

180

1.5 200

Time Index

Fig. 6.12. Illustration of Example 6.3.1. Grey level representation of the smoothing distributions estimated from 50,000 iterations of the single site slice sampler (densities estimated with Epanechnikov kernel, bandwidth 0.05). The bold line shows the true simulated state sequence.

State

194

6 Monte Carlo Methods

We conclude this example by noting that slice sampling is obviously not the only available approach to tackle posterior simulation in this model and we do not claim that it is necessary the best one either. Because of its practical importance in econometric applications, MCMC approaches suitable for this model have been considered by several authors including Jacquier et al. (1994), Shephard and Pitt (1997) and Kim et al. (1998).

6.3.1.2 Block Sampling Strategies In some cases, single site updating can be painfully slow. It is thus of interest to try to speed up the simulation by breaking some of the dependence involved in single site updating. A natural solution is to propose a joint update of a group of Xk , as this induces more variability in the simulated values. This strategy has been shown to be successful in some particular models (Liu et al., 1994). The drawback of this approach however is that when the size of the blocks increases, it is sometimes dicult to imagine ecient proposal strategies in larger dimensional spaces. For the stochastic volatility model discussed above for instance, Shephard and Pitt (1997) discuss the use of approximations based on Gaussian expansions. There are no general rules here, however, and the eventual improvements in mixing speed have to be gauged at the light of the extra computational eorts required to simulate larger blocks. In the case of multivariate Gaussian distributions for instance, simulating in blocks of size m involves computing the Cholevski factorization of m by m matrices, an operation whose cost is of order m3 . Hence moving to block simulations will be most valuable in cases where single site sampling is pathologically slow. 6.3.2 Gibbs Sampling in CGLSSMs For the stochastic volatility model, Kim et al. (1998) (among others) advocate the use of a specic technique that consists in approximating the behavior of the model by a conditionally Gaussian linear state-space structure. This makes sense as there are simulation techniques specic to CGLSSMs that are usually more ecient than generic simulation methods. This is also the main reason why CGLSSMs are often preferred to less structured (but perhaps more accurate) alternative models in a variety of situations such as heavy tailed noise or outliers as in Examples 1.3.11 and 1.3.10, non-Gaussian observation noise (Kim et al., 1998), or signals (Capp et al., 1999). Not surprisingly, e ecient simulation in CGLSSMs is a topic that has been considered by many authors, including Carter and Kohn (1994), De Jong and Shephard (1995), Carter and Kohn (1996), and Doucet and Andrieu (2001).
k = 10 is equal to 110, that is, similar to what was observed for the shorter (n = 20) sequence.

6.3 Applications to Hidden Markov Models

195

In this context, the most natural approach to simulation consists in adequately combining the two specic Monte Carlo techniques discussed in Section 6.1.2 (for the nite state space case and Gaussian linear state-space models). Indeed, if we assume knowledge of the indicator sequence C0:n , the continuous component of the state, {Wk }0kn , follows a non-homogeneous Gaussian linear state-space model from which one can sample (block-wise) by Algorithms 6.1.2 or 6.1.3. If we now assume that W0:n is known, Figure 1.6 clearly corresponds to a (non-homogeneous) nite state space hidden Markov model for which we may use Algorithm 6.1.1. To illustrate this conditional two-step block simulation approach, we consider an illustrative example. Example 6.3.2 (Non-Gaussian Autoregressive Process Observed in Noise). Example 1.3.8 dealt with the case of a Gaussian autoregressive process observed in noise. When the state and/or observation noises are nonGaussian, a possible solution is to represent the corresponding distributions by mixtures of Gaussians. The model then becomes a CGLSSM according to
R(Ck+1 )

Wk+1

(Ck+1 ) 0 = AWk + Uk , . . . 0

(6.20)

Yk = 1 0 0 Wk + S(Ck )Vk ,

(6.21)

where the matrix A is the companion matrix dened in (1.11), which is such p that Wk+1 (1) (the rst coordinate of Wk+1 ) is the regression i=1 i Wk (i), whereas the rest of the vector Wk+1 is simply a copy of the rst p 1 coordinates of Wk . By allowing and S to depend on the indicator sequence Ck , either the state or the observation noise (or both) can be represented as nite scale mixtures of Gaussians. We will assume in the following that {Ck }k0 is a Markov chain taking values in the nite set {1, . . . , r}; the initial distribution is denoted by C , and the transition matrix by QC . In addition, we will assume that W0 is N(0, W0 ) distributed where W0 does not depend on the indicator C0 . The simulation of the continuous component of the state Wk for k = 0, . . . , n, conditionally on C0:n , is straightforward: for a specied sequence of indicators, (6.20) and (6.21) are particular instances of a non-homogeneous Gaussian linear state-space model for which Algorithm 6.1.3 applies directly. Recall that due to the particular structure of the matrix A in (6.20), the noisy AR model is typically an example for which the disturbance smoothing (Algorithm 5.2.15) will be more ecient. For the simulation of the indicator variables given the disturbances U0:n1 , two dierent situations can be distinguished.

196

6 Monte Carlo Methods

Indicators in the Observation Equation: If is constant (does not depend on Ck ), only the terms related to the observation equation (6.21) contribute to the posterior joint distribution of the indicators C0:n whose general expression is given in (4.10). Hence the joint posterior distribution of the indicators satises 0:n|n (c0:n |w0:n , y0:n )
n1 n

C (c0 )
k=0

QC (ck , ck+1 )
k=0

1 (yk wk )2 exp S(ck ) 2S 2 (ck )

, (6.22)

where factors that do not depend on the indicator variables have been omitted. Equation (6.22) clearly has the same structure as the joint distribution of the states in an HMM given by (3.13). Because Ck is nitevalued, we may use Algorithm 5.1.1 for ltering and then Algorithm 6.1.1 for sampling granted that the function gk be dened as gk (c) = (yk wk )2 1 exp S(c) 2S 2 (c) . (6.23)

Indicators in the Dynamic Equation: In the opposite case, S is constant but is a function of the indicator variables. The joint distribution of the indicators C0:n given W0:n and Y0:n depends on the quantities dening the dynamic equation (6.20) only, according to 0:n|n (c0:n |w0:n , y0:n )
n1 n

C (c0 )
k=0 def

QC (ck , ck+1 )
k=1

u2 1 k1 exp 2 (ck ) 2 (ck )

, (6.24)

where uk = wk+1 Awk . Algorithms 5.1.1 and 6.1.1 once again apply with gk dened as gk (c) = u2 1 exp k1 (c) 22 (c) (6.25)

for k = 1, . . . , n and g0 = 1. Note that in this second case, we do not need to condition on the sequence of states W0:n , and knowledge of the disturbances U0:n1 is sucient. In particular, when using Algorithm 6.1.3 (conditionally given C0:n ), one can omit the last two steps to keep track only of the simulated disturbance sequence
Uk|n + Uk Uk|n

(for k = 0, . . . , n 1) ,

using the notations introduced in Algorithm 6.1.3.

6.3 Applications to Hidden Markov Models

197

Of course, in cases where the indicator variables modify the variances of both the state noise and the observation noise, the two cases considered above should be merged, which involves in particular that the functions gk be dened as the product of the expressions given in (6.23) and (6.25), respectively. In general, the algorithm described above is reasonably successful. However, the rate of convergence of the MCMC sampler typically depends on the values of the parameters and the particular data under consideration: it can be slow in adverse situations, making it dicult to reach general conclusions. There are however a number of cases of practical importance where the algorithm fails. This has to do with the fact that in some situations, there is a very close association between the admissible values of the continuous component {Wk }k0 and the indicator variables {Ck }k0 leading to a very slow exploration of the space by the MCMC simulations. This happens in particular when using so-called Bernoulli-Gaussian noises (Kormylo and Mendel, 1982; Lavielle, 1993; Doucet and Andrieu, 2001). In the model of Example 6.3.2 for instance, if we just want to model outlying valuesa model of interest in audio restoration applications (O Ruanaidh and Fitzgerald, 1996; Godsill and Rayner, 1998)we could set S = 0 in the absence of outliers (say if Ck = 1) and S = , where 2 is large compared to the variance of {Wk }k0 , in the opposite case (Ck = 2). In this case, however, it is easily seen from (6.23) that Ck = 1 is only possible if Wk = Yk and, conversely, Wk = Yk has zero probability (remember that it is a continuous variable) unless Ck = 1. Hence the above algorithm would be fully stuck in that case. Not surprisingly, if S 2 (1) is not exactly equal to 0 but still very small (compared to the variance of {Wk }k0 ), the Gibbs sampling approach, which simulates W0:n and then C0:n conditionally on each other, both block-wise, is not very ecient. We illustrate this situation with a very simple instance of Example 6.3.2. Example 6.3.3 (Gaussian AR Process with Outliers). We consider again (6.20) and (6.21) in the AR(1) case, that is, when all variables in the models are scalar. For the state equation, the parameters are set as A = = 0.98
def

and

R=

1 2 ,

so that the stationary distribution of {Wk }k0 is Gaussian with unit variance. We rst assume that S = 3 in the presence of outliers and 0.2 otherwise, corresponding to a moderately noisy signal in the absence of outliers. By convention, Ck = 2 will correspond to the presence of an outlier at index k and we set Ck = 1 otherwise. The light curve in the top plot of Figure 6.13 displays the corresponding simulated observations, where outliers have been generated at (arbitrarily selected) indices 25, 50, and 75. For modeling purposes, we assume that outliers occur independently of each other and with probability 0.95. The alternating block sampling algorithm discussed above is applied by initially setting 1 i Ck = 1 for k = 0, . . . , n, thus assuming that there are no outliers. Then W0:n

198

6 Monte Carlo Methods


3 2 1

Data

0 1 2 3 4 0 10 20 30 40 50 60 70 80 90 100

1 0.8

Probability

0.6 0.4 0.2 0

10

20

30

40

50

60

70

80

90

100

Time Index

Fig. 6.13. Top plot: observed signal (light curve) and estimated state sequence (bold curve) as estimated after 500 iterations of alternating block sampling from C0:n and W0:n . Bottom plot: estimated probability of presence of an outlier; S(1) = 0.2 in this case.
i+1 i is simulated (as a block) conditionally on C0:n , and C0:n conditionally on i W0:n , for i = 1 to 500. The bottom plot in Figure 6.13 displays estimates of the probability of the presence of an outlier at index k obtained by counting i the number of times where Ck = 1. Not surprisingly, the three outliers are clearly localized, although there are two or three other points that could also be considered as outliers given the model, with some degree of plausibility. The bold curve in the top plot of Figure 6.13 shows the average of the simui lated state sequences W0:n . This is in fact a very good approximation of the actual state sequence that is not shown here because it is would be nearly indiscernible from the estimated state sequence in this case. We now keep the same sequence of states and observation noises but consider the case where S(1) = 0.02, that is, ten times smaller than before. In some sense, the task is easier now because there is almost no observation noise except for the outliers, so that localizing them should be all the more easy. Figure 6.14, which is the analog of the top plot in Figure 6.13, shows that it is indeed not the case as the outlier located at index 25 is visibly not detected, resulting in a grossly incorrect estimation of the underlying state at index 1 25. The source of the problem is transparent: because initially Ck = 1 for all indices, simulated values of Wk are very close to the observation Yk because

6.3 Applications to Hidden Markov Models

199

3 2 1

Data

0 1 2 3 4 0 10 20 30 40 50 60 70 80 90 100

Time Index
Fig. 6.14. Observed signal (light curve) and estimated state sequence (bold curve) as estimated after 500 iterations of alternating block sampling from C0:n and W0:n when S(1) = 0.02.

10

Number of Outliers

8 6 4 2 0

50

100

150

200

250

300

350

400

450

500

Iteration
Fig. 6.15. Number of outliers as a function of the iteration index when S(1) = 0.2.

Number of Outliers

6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 400 450 500

Iteration
Fig. 6.16. Number of outliers as a function of the iteration index when S(1) = 0.02.

200

6 Monte Carlo Methods

S(1) is very small, in turn making it very dicult to reach congurations with Ck = 2. This lack of convergence when S(1) = 0.02 is also patent when comparing Figures 6.15 and 6.16: both gures show the simulated number of outliers, that i is, the number of indices k for which Ck = 2, as a function of the iteration index i. In Figure 6.15, this number directly jumps from 0 initially to reach the most likely values (between 3 and 6) and move very quickly in subsequent iterations. In contrast, in Figure 6.16 the estimated number of outliers varies only very slowly with very long steady period. A closer examination of the output reveals that in the second case, it is only after 444 iterations that C26 is nally simulated as 2, which explains why the estimated sequence of states is still grossly wrong after 500 simulations. The moral of Example 6.3.3 is by no means that the case where S(1) = 0.02 is desperate. Running the simulation for much longer than 500 iterationsand once again, 500 is not considered as a big number in the MCMC worlddoes produce the expected results. On the other hand, the observation that the same sampling algorithm performs signicantly worse in a task that is arguably easier is not something that can easily be swept under the carpet. At the risk of frightening newcomers to the eld, it is important to underline that this is not an entirely lonely observation, as it is often dicult to sample eciently from very concentrated distributions. In Example 6.3.3, the subsets of (C W)n+1 that have non-negligible probability under the posterior distribution are very narrow (in some suitable sense) and thus hard to explore with generic MCMC approaches. To overcome the limits of the method used so far, we can however take prot of the remark that in CGLSSMs, the conditional distribution of the continuous component of the state, W0:n , given both the observations Y0:n and the sequence of indicators C0:n , is multivariate Gaussian and can be fully characterized using the algorithms discussed in Section 5.2. Hence the idea to devise MCMC algorithms that target the conditional distribution of C0:n given Y0:n , where the continuous part W0:n is marginalized out, rather than the joint distribution of C0:n and W0:n . This is the principle of the approaches proposed by Carter and Kohn (1996) and Doucet and Andrieu (2001). The specic contribution of Doucet and Andrieu (2001) was to remark that using the information form of the backward smoothing recursion (discussed in Section 5.2.5) is preferable because it is more generally applicable. The main tool here is Lemma 5.2.24, which makes it possible to evaluate the likelihood of the observations, marginalizing with respect to the continuous part of the state sequence, where all indicators except one are xed. Combined with the information provided by the prior distribution of the sequence of indicators, this is all we need for sampling an indicator given all its neighbors, which is the Gibbs sampling strategy discussed in full generality in Section 6.2.5. There is however one important detail concerning the application Lemma 5.2.24 that needs to be claried. To apply Lemma 5.2.24

6.3 Applications to Hidden Markov Models

201

at index k, it is required that the results of both the ltering recursion for index k 1, {Wk1|k1 (C0:k1 ), k1|k1 (C0:k1 )}, as well as those of the backward information recursion at index k, {k|n (Ck+1:n ), k|n (Ck+1:n )}, be available. None of these two recursions is particularly simple as each step of each recursion involves in particular the inversion of a square matrix whose dimension is that of the continuous component of the state. The important point noted by Carter and Kohn (1996) and Doucet and Andrieu (2001) is that because the forward quantities at index k depend on indicators Cl for l k only and, conversely, the backward quantities depend on indicators Cl such that l > k only, it is advantageous to use a systematic scan Gibbs sampler that simulates Ck given its neighbors for k = 0, . . . , n (or in reverse order) so as to avoid multiple evaluation of identical quantities. This however makes the overall algorithm somewhat harder to describe because it is necessary to carry out the Gibbs simulations and the forward (or backward) recursion simultaneously. The overall computational complexity of a complete sweep of the Gibbs sampler is then only of the order of what it takes to implement Algorithm 5.2.13 or Proposition 5.2.21 for all indices k between 0 and n, times the number r of possible values of the indicator, as these need to be enumerated exhaustively at each index. We now describe the version of the systematic scan Gibbs sampler that uses the result previously obtained in Section 5.2.6. Algorithm 6.3.4 (Gibbs Sampler for Indicators in Conditional Gaussian Linear State-Space Model). Consider a conditionally Gaussian linear state-space model (Denition 2.2.6) with indicator-dependent matrices A, R, B, and S for which the covariance of the initial state may depend on C0 and denote by C and QC , respectively, the initial distribution and transition matrix of {Ck }k0 . i Assuming that a current simulated sequence of indicators C0:n is available, i+1 draw C0:n as follows. Backward Recursion: Apply Proposition 5.2.21 for k = n down to 0 with Ak = i i i i A(Ck+1 ), Rk = R(Ck+1 ), Bk = B(Ck ), and Sk = S(Ck ). Store the computed quantities k|n and k|n for k = n down to 0. Initial State: For c = 1, . . . , r, compute
0

= Y0 ,

0 (c) = B(c) (c)B t (c) + S(c)S t (c) , W0|0 (c) = (c)B t (c) 1 (c) 0 ,
0 1 0|0 (c) = (c) (c)B t (c)0 (c)B t (c) (c) , 0 (c)

= log |0 (c)| +

t 1 0 0 (c) 0

/2
1 1

W0|n (c) = W0|0 (c) + 0|0 (c) I + 0|n 0|0 (c) 0|n (c) = 0|0 (c) 0|0 (c) I + 0|n 0|0 (c)

0|n 0|n W0|0 (c) ,

0|n 0|0 (c) ,

202

6 Monte Carlo Methods


1 t m0 (c) = log |0|0 (c)| + W0|0 (c)0|0 (c)W0|0 (c) /2 1 t + log |0|n (c)| + W0|n (c)0|n (c)W0|n (c) /2 , i p0 (c) = exp [ 0 (c) + m0 (c)] C (c)QC (c, C1 ) .

Normalize the vector p0 by computing p0 (c) = p0 (c)/ c =1 p0 (c ) for i+1 c = 1, . . . , r and sample C0 from the probability distribution p0 on i+1 {1, . . . , r}. Then store the Kalman lter variables corresponding to c = C0 i+1 i+1 (W0|0 (C0 ) and 0|0 (C0 ), that is) for the next iteration. For k = 1, . . . , n: for c = 1, . . . , r, compute
i+1 Wk|k1 (c) = A(c)Wk1|k1 (Ck1 ) , i+1 k|k1 (c) = A(c)k1|k1 (Ck1 )At (c) + R(c)Rt (c) , k (c) = Yk B(c)Wk|k1 (c) ,

k (c) = B(c)k|k1 (c)B t (c) + S(c)S t (c) , 1 Wk|k (c) = Wk|k1 (c) + k|k1 (c)B t (c)k (c) k (c) ,
1 k|k (c) = k|k1 (c) k|k1 (c)B t (c)k (c)B t (c)k|k1 (c) , k (c)

= log |k (c)| +

t 1 k k (c) k

/2 ,
1 1

Wk|n (c) = Wk|k (c) + k|k (c) I + k|n k|k (c) k|n (c) = k|k (c) k|k (c) I + k|n k|k (c)

k|n k|n Wk|k (c) ,

k|n k|k (c) ,

1 t mk (c) = log |k|k (c)| + Wk|k (c)k|k (c)Wk|k (c) /2 1 t + log |k|n (c)| + Wk|n (c)k|n (c)Wk|n (c) /2 ,

pk (c) =

i+1 i exp [ k (c) + mk (c)] QC (Ck1 , c)QC (c, Ck+1 ) i+1 exp [ n (c) + mn (c)] QC (Cn1 , c) r

for k < n . for k = n

i+1 Set pk (c) = pk (c)/ c =1 pk (c ) (for c = 1, . . . , r) and sample Ck from i+1 pk . If k < n, the corresponding Kalman lter variables Wk|k (Ck ) and i+1 k|k (Ck ) are stored for the next iteration.

Despite the fact that it is perhaps the most complex algorithm that is to be met in this book, Algorithm 6.3.4 deserves no special comment as it simply combines the results obtained in Chapter 5 (Algorithms 5.2.13 and 5.2.22, Lemma 5.2.24) with the principle of the Gibbs sampler exposed in Section 6.2.5 and the clever remark that using a systematic scanning order of the simulation sites (here in ascending order) greatly reduces the computation load. Algorithm 6.3.4 is similar to the method described by Doucet and Andrieu (2001), but here the expression used for evaluating mk (c) has been made more transparent by use of the smoothing moments Wk|n (c) and k|n (c).

6.3 Applications to Hidden Markov Models

203

Remark 6.3.5. Note that in Algorithm 6.3.4, the quantities k (c) and, most importantly, mk (c) are evaluated on a log-scale. Only when computation of the probabilities pk (c) is necessary are those converted back to the linear scale using the exponential function. Although rarely explicitly mentioned, this remark is of some importance in many practical applications of MCMC methods (and particularly those to be discussed in Section 13.2) that involve ratios of likelihood terms, each of which may well exceed the machine precision. In the case of Algorithm 6.3.4, remember that these terms need to be evaluated for all possible values of c and hence their range of variations is all the more important that some of these indicator congurations may be particularly unlikely. To illustrate the behavior of Algorithm 6.3.4, we consider again the noisy AR(1) models with outliers in the case where S(1) = 0.02, which led to the poor mixing illustrated in Figure 6.16.
7

Number of Outliers

6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 400 450 500

Iteration
Fig. 6.17. Number of outliers as a function of the iteration index when S(1) = 0.02 for the systematic scan Gibbs sampler.

Example 6.3.6 (Gaussian AR Process with Outliers, Continued). Applying Algorithm 6.3.4 to the model of Example 6.3.3 provides the result shown in Figure 6.17, this gure being the exact analog Figure 6.16. Figure 6.17 shows that with Algorithm 6.3.4, only congurations with at least three outliers are ever visited. This is logical, as with such a low value of the observation noise (S(1) = 0.02), the values observed at indices 25, 50, and 75 can only correspond to outliers. A closer examination of the simulai tion shows that all simulated sequences C0:n except the initial onewe are 1 still initializing the sampler with the conguration such that Ck = 1 for all i i i k = 0, . . . , nare such that C25 = C50 = C75 = 2. From the simulation, we are thus as certain as we can be that there are indeed outliers at these locations and most probably there are no others (the conguration with exactly these three outliers is selected about 67% of the time and no individual site other than those with indices 25, 50, and 75is selected more than 15 times

204

6 Monte Carlo Methods

out of the 500 iterations). Figure 6.17 also shows that the other congurations that are explored are visited rather rapidly rather than with long idle periods as in Figure 6.16, which also suggests good mixing properties of the Gibbs sampling algorithm. To conclude this section with a more realistic example of the use of CGLSSMs and MCMC techniques, we consider the change point model and the well-log data already discussed in Example 1.3.10. Example 6.3.7 (Gibbs Sampler for the Well-Log Data). To analyze the well-log data shown in Figure 1.7, we consider the conditionally Gaussian state-space model Wk+1 = A(Ck+1,1 )Wk + R(Ck+1,1 )Uk , Uk N(0, 1) , Vk N(0, 1) ,

Yk = Y (Ck,2 ) + B(Ck,2 )Wk + S(Ck,2 )Vk ,

where Ck,1 {1, 2} and Ck,2 {1, 2} are indicator variables indicating, respectively, the presence of a jump in the level of the underlying signal and that of an outlier in the measurement noise, as discussed in Examples 1.3.10 and 1.3.11. For comparison purposes, we use exactly the same model specication as the one advocated by Fearnhead and Cliord (2003). The data shown in Figure 1.7 is rst centered (approximately) by subtracting = 115, 000 from each observation; in the following, Yk refers to the data with this average level subtracted. When Ck,1 = 1 the underlying signal level is constant and we set A(1) = 1 and R(1) = 0. When Ck,1 = 2, the occurrence of a jump is modeled by A(2) = 0, R(2) = 10,000, which is an informative prior on the size of the jump. Though as explained in the introduction this is presumably an oversimplied assumption, we assume a constant probability for the presence of a jump, or, equivalently, that {Ck,1 }k0 is an i.i.d. sequence of Bernoulli random variables with constant probability of success p. The jump positions then form a discrete renewal sequence whose increment distribution is geometric with expectation 1/p. Because there are about 16 jumps in a sequence of 4,000 samples, the average of the increment distribution is about 250, suggesting p = 1/250. When Ck,2 = 1, the observation is modeled as the true state corrupted by additive noise, so that B(1) = 1, where S(1) is set to 2,500 based on the empirical standard deviation of the median lter residual shown in the right plot of Figure 1.7. When Ck,2 = 2, the occurrence of an outlier is modeled by a Gaussian random variable whose parameters are independent of the true state, so that B(2) = 0, and the outlier is assumed to have mean Y (2) = -30,000 and standard deviation S(2) = 12,500. The outliers appear to be clustered in time, with a typical cluster size of four samples. Visual inspection shows that there are about 16 clusters of noise, which suggests to model the sequence {Ck,2 }k0 as a Markov

6.3 Applications to Hidden Markov Models

205

chain with transition probabilities P(Ck,2 = 2 | Ck,2 = 1) = 1/250 and P(Ck,2 = 1 | Ck,2 = 2) = 1/4. The initial C0,2 is assumed to be distributed according to the stationary distribution of this chain, which is P(C0,2 = 1) = 125/127. The initial distribution of W0 is assumed to have zero mean with a very large variance, which corresponds to an approximation of the so-called diuse (or improper at, following the terminology of Section 5.2.5) prior. Note that because B(C0 ) may be null (when C0 = 2), using a truly diuse prior (with innite covariance matrix) cannot be done in this case by simply computing W0|0 as in (5.109), which is customary. In the case under consideration, however, the prior on W0 is non-essential because the initial state is very clearly identied from the data anyway.

Note that in the model above, the presence of outliers induces non-zero means in the observation equation. As discussed in Remark 5.2.14, however, this does not necessitate signicant modications, and we just need to apply Algorithm 6.3.4 using as observation Yk Y (ck ) rather than Yk , where Y (1) = 0 and Y (2) = -30,000. Because R(1) = 0 implies that the continuous component of the state Wk stays exactly constant between two jump points, this model belongs to the category discussed earlier for which the alternating block sampling algorithm cannot be applied at all. We thus consider the result of the Gibbs sampler that operates on the indicator variables only. Figure 6.18 displays the results obtained by application of Algorithm 6.3.4 after 5,000 iterations, one iteration referring to a complete cycle of the Gibbs sampler through all the n + 1 1 1 sites. Initially, Ck,1 and Ck,2 are both set to 1 for all sites k = 0, . . . , n, which corresponds to the (very improbable) conguration in which there are neither jumps nor outliers. Clearly, after 5,000 iterations both the jump and outlier positions are located very clearly. There is however a marked dierence, which is that whereas the outliers (middle plot) are located with posterior probabilities very close to 1, the jumps are only located with probabilities between 0.3 and 0.6. There are two reasons for this behavior, the second being more fundamental. First, the model for the distribution of outliers is more precise and incorporate in particular the fact that outliers systematically induce a downward bias. The second reason is a slightly decient modeling of the occurrence of jumps. For the outliers, the selected Markov transition kernel implies that outlier periods are infrequent (occurring 2/127 of the time on average) but have durations that are exponential with average duration 4. This is a crucial feature, as a closer examination of the data reveals that some of these periods of outliers last for 10 or even 20 consecutive samples. In contrast, our model for jumps implies that jumps are infrequent (occurring in one sample out of 250 on average) and isolated. For instance, a sequence of four consecutive jumps is, a priori, judged as being 6.2 107 times less probable than the occurrence of just one jump in one of these four positions. The real data however, cf. Figure 6.19, shows that the actual jumps are not abrupt and involve at least

206

6 Monte Carlo Methods


1.5 x 10
5

Data

0.5 1

500

1000

1500

2000

2500

3000

3500

4000

Prob. Outlier

0.5

0 1

500

1000

1500

2000

2500

3000

3500

4000

Prob. Jump

0.5

500

1000

1500

2000

2500

3000

3500

4000

Time Index

Fig. 6.18. From top to bottom: original data, posterior probability of the presence of outliers, and jumps estimated from 5,000 iterations of Algorithm 6.3.4.
1.5 1.4 x 10
5

Data

1.3 1.2 1.1 1 2400

2420

2440

2460

2480

2500

2520

2540

2560

2580

2600

0.6 0.5

Prob. Jump

0.4 0.3 0.2 0.1 0 2400 2420 2440 2460 2480 2500 2520 2540 2560 2580 2600

Time Index

Fig. 6.19. From top to bottom: original data and posterior probability of the presence of jumps (zoom on a detail of Figure 6.18).

6.3 Applications to Hidden Markov Models


0

207

50

Simulation Index

100

150

200

250

500

1000

1500

2000

2500

3000

3500

4000

Time Index
i Fig. 6.20. Jump detection indicators (indices such that Ck = 2) for the rst 250 iterations.

two and sometimes as many as ve consecutive points. Because the modeling assumptions do not allow all of these points to be marked as jumps, the result tends to identify one of these only as the preferred jump location, whence the larger uncertainty (lower posterior probability) concerning which one is selected. Interestingly, the picture will be very dierent when we consider the ltering distributions (that is, the distribution of Ck given data up to index k only) in Example 8.2.10 of Chapter 8. Figure 6.20 gives an idea of the way the simulation visits the congurations of indicators (for the jumps), showing that the algorithm almost instantaneously forgets its erroneous initial state. Consequently, the congurations change at a rather fast pace, suggesting good mixing behavior of the sampler. Note that those time indices for which jumps are detected in the bottom plot of Figure 6.18 correspond to abscissas for which the indicators of jump stay on very systematically through the simulation. To conclude this section on MCMC sampling in conditionally Gaussian linear state-space models, we note that there is an important and interesting literature that discusses the best use of simulations for the purpose of estimating the unobservable state sequence {Wk , Ck }k0 . To estimate a function f of the unobserved sequence of states W0:n , the most natural options are the straightforward MCMC estimate 1 N
N i f (W0:n ) , i=1

208

6 Monte Carlo Methods

directly available with alternating block sampling (as in Example 6.3.3), or its Rao-Blackwellized version 1 N
N i E[f (W0:n ) | C0:n ] , i=1

which can easily be computed when using Algorithm 6.3.4, at least for linear i+1 i and quadratic functions f , as the smoothing moments Wk|n (C0:k , Ck+1:n ) and i+1 i k|n (C0:k , Ck+1:n ) are evaluated at each iteration i and for all sites k. But both of these alternatives are estimates of E[f (W0:n ) | Y0:n ], which, in some applications, is perhaps not what is regarded as the best estimate of the states. In the change point application discussed in Example 6.3.7 in particular, E [f (W0:n ) | Y0:n ] does not correspond to a piecewise constant trajectory, especially if some jump locations are only detected with some ambiguity. If one really believes that the model is correct, it may thus make more sense to estimate rst the best sequence of indicators c0:n , that is, the one that maximizes P(C0:n = c0:n | Y0:n ), and then use E[f (W0:n ) | Y0:n , C0:n = c0:n ] as the estimate of the continuous part of the state sequence. In the change point model, this third way of proceeding is guaranteed to return a piecewise constant sequence. This is not an easy task, however, because nding the indicator sequence c0:n that maximizes the posterior probability is a dicult combinatorial optimization problem, especially given the fact that we cannot evaluate P(C0:n = c0:n | Y0:n ) directly. We refer to Lavielle and Lebarbier (2001), Doucet and Andrieu (2001), and references therein for further reading on this issue.

7 Sequential Monte Carlo Methods

The use of Monte Carlo methods for non-linear ltering can be traced back to the pioneering contributions of Handschin and Mayne (1969) and Handschin (1970). These early attempts were based on sequential versions of the importance sampling paradigm, a technique that amounts to simulating samples under an instrumental distribution and then approximating the target distributions by weighting these samples using appropriately dened importance weights. In the non-linear ltering context, importance sampling algorithms can be implemented sequentially in the sense that, by dening carefully a sequence of instrumental distributions, it is not needed to regenerate the population of samples from scratch upon the arrival of each new observation. This algorithm is called sequential importance sampling, often abbreviated SIS. Although the SIS algorithm has been known since the early 1970s, its use in non-linear ltering problems was rather limited at that time. Most likely, the available computational power was then too limited to allow convincing applications of these methods. Another less obvious reason is that the SIS algorithm suers from a major drawback that was not clearly identied and properly cured until the seminal paper by Gordon et al. (1993). As the number of iterations increases, the importance weights tend to degenerate, a phenomenon known as sample impoverishment or weight degeneracy. Basically, in the long run most of the samples have very small normalized importance weights and thus do not signicantly contribute to the approximation of the target distribution. The solution proposed by Gordon et al. (1993) is to allow rejuvenation of the set of samples by duplicating the samples with high importance weights and, on the contrary, removing samples with low weights. The particle lter of Gordon et al. (1993) was the rst successful application of sequential Monte Carlo techniques to the eld of non-linear ltering. Since then, sequential Monte Carlo (or SMC) methods have been applied in many dierent elds including computer vision, signal processing, control, econometrics, nance, robotics, and statistics (Doucet et al., 2001a; Ristic et al., 2004). This chapter reviews the basic building blocks that are needed to implement a sequential Monte Carlo algorithm, starting with concepts re-

210

7 Sequential Monte Carlo Methods

lated to the importance sampling approach. More specic aspects of sequential Monte Carlo techniques will be further discussed in Chapter 8, while convergence issues will be dealt with in Chapter 9.

7.1 Importance Sampling and Resampling


7.1.1 Importance Sampling Importance sampling is a method that dates back to, at least, Hammersley and Handscomb (1965) and that is commonly used in several elds (for general references on importance sampling, see Glynn and Iglehart, 1989, Geweke, 1989, Evans and Swartz, 1995, or Robert and Casella, 2004.) Throughout this section, will denote a probability measure of interest on a measurable space (X, X ), which we shall refer to as the target distribution. As in Chapter 6, the aim is to approximate integrals of the form (f ) = X f (x) (dx) for real-valued measurable functions f . The Monte Carlo approach exposed in Section 6.1 consists in drawing an i.i.d. sample 1 , . . . , N from the probability measure and then evaluating the sample mean N N 1 i=1 f ( i ). Of course, this technique is applicable only when it is possible (and reasonably simple) to sample from the target distribution . Importance sampling is based on the idea that in certain situations it is more appropriate to sample from an instrumental distribution , and then to apply a change-of-measure formula to account for the fact that the instrumental distribution is dierent from the target distribution. More formally, assume that the target probability measure is absolutely continuous with respect to an instrumental probability measure from which sampling is easily feasible. Denote by d/d the Radon-Nikodym derivative of with respect to . Then for any -integrable function f , (f ) = f (x) (dx) = f (x) d (x) (dx) . d (7.1)

In particular, if 1 , 2 , . . . is an i.i.d. sample from , (7.1) suggests the following estimator of (f ):


N

IS (f ) = N 1 ,N
i=1

f ( i )

d i ( ) . d

(7.2)

Because this estimator is the sample mean of independent random variables, there is a range of results to assess the quality of IS (f ) as an estimator ,N of (f ). First of all, the strong law of large number implies that IS (f ) ,N converges to (f ) almost surely as N tends to innity. In addition, the central limit theorem for i.i.d. variables (or deviation inequalities) may serve as a guidance for selecting the proposal distribution , beyond the obvious requirement that it should dominate the target distribution . We postpone this

7.1 Importance Sampling and Resampling

211

issue and, more generally, considerations that pertain to the behavior of the approximation for large values of N to Chapter 9. In many situations, the target probability measure or the instrumental probability measure is known only up to a normalizing factor. As already discussed in Remark 6.2.7, this is particularly true when applying importance sampling ideas to HMMs and, more generally, in Bayesian statistics. The Radon-Nikodym derivative d/d is then known up to a (constant) scaling factor only. It is however still possible to use the importance sampling paradigm in that case, by adopting the self-normalized form of the importance sampling estimator, IS (f ) = ,N
N i d i i=1 f ( ) d ( ) N d i i=1 d ( )

(7.3)

This quantity is obviously free from any scale factor in d/d. The selfnormalized importance sampling estimator IS (f ) is dened as a ratio of the ,N sample means of the functions f1 = f (d/d) and f2 = d/d. The strong N N law of large numbers thus implies that N 1 i=1 f1 ( i ) and N 1 i=1 f2 ( i ) converge almost surely, to (f1 ) and (d/d) = 1, respectively, showing that IS (f ) is a consistent estimator of (f ). Again, more precise results on the ,N behavior of this estimator will be given in Chapter 9. In the following, the term importance sampling usually refers to the self-normalized form (7.3) of the importance sampling estimate. 7.1.2 Sampling Importance Resampling Although importance sampling is primarily intended to overcome diculties with direct sampling from when approximating integrals of the form (f ), it can also be used for (approximate) sampling from the distribution . The latter can be achieved by the sampling importance resampling (or SIR) method due to Rubin (1987, 1988). Sampling importance resampling is a twostage procedure in which importance sampling as discussed below is followed by an additional random sampling step. In the rst stage, an i.i.d. sample ( 1 , . . . , M ) is drawn from the instrumental distribution , and one computes the normalized version of the importance weights, i =
d i d ( ) M d i i=1 d ( )

, i = 1, . . . , M .

(7.4)

In the second stage, the resampling stage, a sample of size N denoted by 1 , . . . , N is drawn from the intermediate set of points 1 , . . . , M , taking into account the weights computed in (7.4). The rationale is that points i for i which in (7.4) is large are most likely under the target distribution and should thus be selected with higher probability during the resampling than

212

7 Sequential Monte Carlo Methods

TARGET

TARGET

Fig. 7.1. Principle of resampling. Top plot: the sample drawn from with associated normalized importance weights depicted by bullets with radii proportional to the normalized weights (the target density corresponding to is plotted in solid line). Bottom plot: after resampling, all points have the same importance weight, and some of them have been duplicated (M = N = 7).

points with low (normalized) importance weights. This principle is illustrated in Figure 7.1. There are several ways of implementing this basic idea, the most obvious approach being sampling with replacement with probability of sampling each i equal to the importance weight i . Hence the number of times N i each particular point i in the rst-stage sample is selected follows a binomial Bin(N, i ) distribution. The vector (N 1 , . . . , N M ) is distributed from Mult(N, 1 , . . . , M ), the multinomial distribution with parameter N and probabilities of success ( 1 , . . . , M ). In this resampling step, the points in the rst-stage sample that are associated with small normalized importance weights are most likely to be discarded, whereas the best points in the sample are duplicated in proportion to their importance weights. In most applications, it is typical to choose M , the size of the rst-stage sample, larger (and sometimes much larger) than N . The SIR algorithm is summarized below. Algorithm 7.1.1 (SIR: Sampling Importance Resampling). Sampling: Draw an i.i.d. sample 1 , . . . , M from the instrumental distribution . Weighting: Compute the (normalized) importance weights

7.1 Importance Sampling and Resampling

213

i =

d i d ( ) M d j j=1 d ( )

for i = 1, . . . , M .

Resampling: Draw, conditionally independently given ( 1 , . . . , M ), N discrete random variables (I 1 , . . . , I N ) taking values in the set {1, . . . , M } with probabilities ( 1 , . . . , M ), i.e., P(I 1 = j) = j , Set, for i = 1, . . . , N , i = I .
i

j = 1, . . . , M .

(7.5)

The set (I 1 , . . . , I N ) is thus a multinomial trial process. Hence, this method of selection is known as the multinomial resampling scheme. At this point, it may not be obvious that the sample 1 , . . . , N obtained from Algorithm 7.1.1 is indeed (approximately) i.i.d. from in any suitable sense. In Chapter 9, it will be shown that the sample mean of the draws obtained using the SIR algorithm, SIR (f ) = ,M,N 1 N
N

f ( i ) ,
i=1

(7.6)

is a consistent estimator of (f ) for all functions f satisfying (|f |) < . The resampling step might thus be seen as a means to transform the weighted importance sampling estimate IS (f ) dened by (7.3) into an unweighted ,M sample average. Recall that N i is the number of times that the element i is resampled. Rewriting SIR (f ) = ,M,N 1 N
N M

f ( i ) =
i=1 i=1

N i i f ( ) , N

it is easily seen that the sample mean SIR (f ) of the SIR sample is, condi,M,N tionally on the rst-stage sample ( 1 , . . . , M ), equal to the importance sampling estimator IS (f ) dened in (7.3), ,M E SIR (f ) 1 , . . . , M = IS (f ) . ,M,N ,M As a consequence, the mean squared error of the SIR estimator SIR (f ) is ,M,N always larger than that of the importance sampling estimator (7.3) due to the well-known variance decomposition E SIR (f ) (f ) ,M,N =E
2 2 2

SIR (f ) IS (f ) ,M,N ,M

+E

IS (f ) (f ) ,M

214

7 Sequential Monte Carlo Methods

The variance E[(SIR (f ) IS (f ))2 ] may be interpreted as the price ,M,N ,M to pay for converting the weighted importance sampling estimate into an unweighted approximation. Showing that the SIR estimate (7.6) is a consistent and asymptotically normal estimator of (f ) is not a trivial task, as 1 , . . . , N are no more independent due to the normalization of the weights followed by resampling. As such, the elementary i.i.d. convergence results that underlie the theory of the importance sampling estimator are of no use, and we refer to Section 9.2 for the corresponding proofs. Remark 7.1.2. A closer examination of the numerical complexity of Algorithm 7.1.1 reveals that whereas all steps of the algorithm have a complexity that grows in proportion to M and N , this is not quite true for the multinomial sampling step whose numerical complexity is, a priori, growing faster than N (about N log2 M see Section 7.4.1 below for details). This is very unfortunate, as we know from elementary arguments discussed in Section 6.1 that Monte Carlo methods are most useful when N is large (or more appropriately that the quality of the approximation improves rather slowly as N grows). A clever use of elementary probabilistic results however makes it possible to devise methods for sampling N times from a multinomial distribution with M possible outcomes using a number of operations that grows only linearly with the maximum of N and M . In order not to interrupt our exposition of sequential Monte Carlo, the corresponding algorithms are discussed in Section 7.4.1 at the end of this chapter. Note that we are here only discussing implementations issues. There are however dierent motivations, also discussed in Section 7.4.2, for adopting sampling schemes other than multinomial sampling.

7.2 Sequential Importance Sampling


7.2.1 Sequential Implementation for HMMs We now specialize the sampling techniques considered above to hidden Markov models. As in previous chapters, we adopt the hidden Markov model as specied by Denition 2.2.2 where Q denotes the Markov transition kernel of the hidden chain, is the distribution of the initial state X0 , and g(x, y) (for x X, y Y) denotes the transition density function of the observation given the state, with respect to the measure on (Y, Y). To simplify the mathematical expressions, we will also use the shorthand notation gk () = g(, Yk ) introduced in Section 3.1.4. We denote the joint smoothing distribution by 0:k|k , omitting the dependence with respect to the initial distribution , which does not play an important role here. According to (4.1), the joint

7.2 Sequential Importance Sampling

215

smoothing distribution may be updated recursively in time according to the relations 0 (f ) = f (x0 ) g0 (x0 ) (dx0 ) g0 (x0 ) (dx0 ) for all f Fb (X) ,

0:k+1|k+1 (fk+1 ) =

u fk+1 (x0:k+1 ) 0:k|k (dx0:k )Tk (xk , dxk+1 )

for all fk+1 Fb Xk+2 , (7.7)


u where Tk is the transition kernel on (X, X ) dened by

u Tk (x, f ) =

Lk+1 Lk

f (x ) Q(x, dx )gk+1 (x ) for all x X, f Fb (X) . (7.8)

u The superscript u (for unnormalized) in the notation Tk is meant to highu light the fact that Tk is not a probability transition kernel. This distinction is u u important here because the normalized version Tk = Tk /Tk (1) of the kernel will play an important role in the following. Note that except in some special cases discussed in Chapter 5, the likelihood ratio Lk+1 /Lk can generally u not be computed in closed form, rendering analytic evaluation of Tk or 0:k|k hopeless. The rest of this section reviews importance sampling methods that make it possible to approximate 0:k|k recursively in k. First, because importance sampling can be used when the target distribution is known only up to a scaling factor, the presence of non-computable constants such as Lk+1 /Lk does not preclude the use of the algorithm. Next, it is convenient to choose the instrumental distribution as the probability measure associated with a possibly non-homogeneous Markov chain on X. As seen below, this will make it possible to derive a sequential version of the importance sampling technique. Let {Rk }k0 denote a family of Markov transition kernels on (X, X ) and let 0 denote a probability measure on (X, X ). Further denote by {0:k }k0 the family of probability measures associated with the inhomogeneous Markov chain with initial distribution 0 and transition kernels {Rk }k0 , k1

0:k (fk ) =

def

fk (x0:k ) 0 (dx0 )
l=0

Rl (xl , dxl+1 ) .

In this context, the kernels Rk will be referred to as the instrumental kernels. The term importance kernel is also used. The following assumptions will be adopted in the sequel.

216

7 Sequential Monte Carlo Methods

Assumption 7.2.1 (Sequential Importance Sampling). 1. The target distribution 0 is absolutely continuous with respect to the instrumental distribution 0 . u 2. For all k 0 and all x X, the measure Tk (x, ) is absolutely continuous with respect to Rk (x, ). Then for any k 0 and any function fk Fb Xk+1 , dTlu (xl , ) (xl+1 ) 0:k (dx0:k ) , dRl (xl , ) l=0 (7.9) which implies that the target distribution 0:k|k is absolutely continuous with respect to the instrumental distribution 0:k with Radon-Nikodym derivative given by k1 d0:k|k d0 dTlu (xl , ) (xl+1 ) . (7.10) (x0:k ) = (x0 ) d0:k d0 dRl (xl , ) 0:k|k (fk ) = fk (x0:k ) d0 (x0 ) d0
l=0 k1

It is thus legitimate to use 0:k as an instrumental distribution to compute importance sampling estimates for integrals with respect to 0:k|k . Denoting 1 N by 0:k , . . . , 0:k N i.i.d. random sequences with common distribution 0:k , the importance sampling estimate of 0:k|k (fk ) for fk Fb Xk+1 is dened as IS (fk ) = 0:k|k
N i=1 i i k fk (0:k ) N i=1 i k

(7.11)

i where k are the unnormalized importance weights dened recursively by i 0 =

d0 i ( ) d0 0

for i = 1, . . . , N ,

(7.12)

and, for k 0,
i i k+1 = k u i dTk (k , ) i ( ) i dRk (k , ) k+1

for i = 1, . . . , N .

(7.13)

The multiplicative decomposition of the (unnormalized) importance weights in (7.13) implies that these weights may be computed recursively in time as successive observations become available. In the sequential Monte Carlo litu erature, the update factor dTk /dRk is often called the incremental weight. As discussed previously in Section 7.1.1, the estimator in (7.11) is left unmodied if the weights, or equivalently the incremental weights, are evaluated up to a constant only. In particular, one may omit the problematic scaling u factor Lk+1 /Lk that we met in the denition of Tk in (7.8). The practical implementation of sequential importance sampling thus goes as follows.

7.2 Sequential Importance Sampling

217

Algorithm 7.2.2 (SIS: Sequential Importance Sampling).


1 N Initial State: Draw an i.i.d. sample 0 , . . . , 0 from 0 and set i i 0 = g0 (0 )

d i ( ) d0 0

for i = 1, . . . , N .

Recursion: For k = 0, 1, . . . , j 1 N Draw (k+1 , . . . , k+1 ) conditionally independently given {0:k , j = i i i i 1, . . . , N } from the distribution k+1 Rk (k , ). Append k+1 to 0:k to i i i form 0:k+1 = (0:k , k+1 ). Compute the updated importance weights
i i i k+1 = k gk+1 (k+1 ) i dQ(k , ) i ( ), i dRk (k , ) k+1

i = 1, . . . , N .

At any iteration index k importance sampling estimates may be evaluated according to (7.11).

FILT.

INSTR.

FILT. +1

Fig. 7.2. Principle of sequential importance sampling (SIS). Upper plot: the curve represents the ltering distribution, and the particles with weights are represented along the axis by bullets, the radii of which being proportional to the normalized weight of the particle. Middle plot: the instrumental distribution with resampled particle positions. Bottom plot: ltering distribution at the next time index with particle updated weights. The case depicted here corresponds to the choice Rk = Q.

218

7 Sequential Monte Carlo Methods

An important feature of Algorithm 7.2.2, which corresponds to the method originally proposed in Handschin and Mayne (1969) and Handschin (1970), 1 N is that the N trajectories 0:k , . . . , 0:k are independent and identically distributed for all time indices k. Following the terminology in use in the nonlinear ltering community, we shall refer to the sample at time index k, 1 N i k , . . . , k , as the population (or system) of particles and to 0:k for a specic value of the particle index i as the history (or trajectory) of the ith particle. The principle of the method is illustrated in Figure 7.2. 7.2.2 Choice of the Instrumental Kernel Before discussing in Section 7.3 a serious drawback of Algorithm 7.2.2 that needs to be xed in order for the method to be applied to any problem of practical interest, we examine strategies that may be helpful in selecting proper instrumental kernels Rk in several models (or families of models) of interest. 7.2.2.1 Prior Kernel The rst obvious and often very simple choice of instrumental kernel Rk is that of setting Rk = Q (irrespectively of k). In that case, the instrumental kernel simply corresponds to the prior distribution of the new state in the absence of the corresponding observation. The incremental weight then simplies to
u Lk dTk (x, ) (x ) = gk+1 (x ) gk+1 (x ) dQ(x, ) Lk+1

for all (x, x ) X2 .

(7.14)

A distinctive feature of the prior kernel is that the incremental weight in (7.14) does not depend on x, that is, on the previous position. The use of the prior kernel Rk = Q is popular because sampling from the prior kernel Q is often straightforward, and computing the incremental weight simply amounts to evaluating the conditional likelihood of the new observation given the current particle position. The prior kernel also satises the minimal requirement of importance sampling as stated in Assumption 7.2.1. In addition, because the importance function reduces to gk+1 , it is upper-bounded as soon as one can assume that supxX,yY g(x, y) is nite, which (often) is a very mild condition (see also Section 9.1). Despite these appealing properties, the use of the prior kernel can sometimes lead to poor performance, often manifesting itself as a lack of robustness with respect to the values taken by the observed sequence {Yk }k0 . The following example illustrates this problem in a very simple situation. Example 7.2.3 (Noisy AR(1) Model). To illustrate the potential problems associated with the use of the prior kernel, Pitt and Shephard (1999) consider the simple model where the observations arise from a rst-order linear autoregression observed in noise,

7.2 Sequential Importance Sampling

219

Xk+1 = Xk + U Uk , Yk = Xk + V Vk ,

Uk N(0, 1) , Vk N(0, 1) ,

2 2 where = 0.9, U = 0.01, V = 1 and {Uk }k0 and {Vk }k0 are independent Gaussian white noise processes. The initial distribution is the stationary distribution of the Markov chain {Xk }k0 , that is, normal with zero mean 2 and variance U /(1 2 ). In the following, we assume that n = 5 and simulate the rst ve observations from the model, whereas the sixth observation is set to the arbitrary value 20. The observed series is

(0.652, 0.345, 0.676, 1.142, 0.721, 20) . The last observation is located 20 standard deviations away from the mean (zero) of the stationary distribution, which denitively corresponds to an aberrant value from the models point of view. In a practical situation however, we would of course like to be able to handle also data that does not necessarily come from the model under consideration. Note also that in this toy example, one can evaluate the exact smoothing distributions by means of the Kalman ltering recursion discussed in Section 5.2. Figure 7.3 displays box and whisker plots for the SIS estimate of the posterior mean of the nal state X5 as a function of the number N of particles when using the prior kernel. These plots have been obtained from 125 independent replications of the SIS algorithm. The vertical line corresponds to the true posterior mean of X5 given Y0:5 , computed using the Kalman lter. The
1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 100 400 1600 6400

True Value= .907

Values

Number of particles

Fig. 7.3. Box and whisker plot of the posterior mean estimate of X5 obtained from 125 replications of the SIS lter using the prior kernel and increasing numbers of particles. The horizontal line represents the true posterior mean.

220

7 Sequential Monte Carlo Methods

gure shows that the SIS algorithm with the prior kernel grossly underestimates the values of the state even when the number of particles is very large. This is a case where there is a conict between the prior distribution and the posterior distribution: under the instrumental distribution, all particles are proposed in a region where the conditional likelihood function g5 is extremely low. In that case, the renormalization of the weights used to compute the ltered mean estimate according to (7.11) may even have unexpectedly adverse consequences: a weight close to 1 does not necessarily correspond to a simulated value that is important for the distribution of interest. Rather, it is a weight that is large relative to other, even smaller weights (of particles even less important for the ltering distribution). This is a logical consequence of the fact that the weights must sum to one.

7.2.2.2 Optimal Instrumental Kernel The mismatch between the instrumental distribution and the posterior distribution observed in the previous example is the type of problem that one should try to alleviate by a proper choice of the instrumental kernel. An interesting choice to address this problem is the kernel Tk (x, f ) = f (x ) Q(x, dx )gk+1 (x ) Q(x, dx )gk+1 (x ) for x X, f Fb (X), (7.15)

u which is just Tk dened in (7.8) properly normalized to correspond to a Markov transition kernel (that is, Tk (x, 1) = 1 for all x X). The kernel Tk may be interpreted as a regular version of the conditional distribution of the hidden state Xk+1 given Xk and the current observation Yk+1 . In the sequel, we will refer to this kernel as the optimal kernel, following the terminology found in the sequential importance sampling literature. This terminology dates back probably to Zaritskii et al. (1975) and Akashi and Kumamoto (1977) and is largely adopted by authors such as Liu and Chen (1995), Chen and Liu (2000), Doucet et al. (2000a), Doucet et al. (2001a) and Tanizaki (2003). The word optimal is somewhat misleading, and we refer to Chapter 9 for a more precise discussion of optimality of the instrumental distribution in the context of importance sampling (which generally has to be dened for a specic choice of the function f of interest). The main property of Tk as dened in (7.15) is that u dTk (x, ) Lk (x ) = k (x) k (x) dTk (x, ) Lk+1

for (x, x ) X2 ,

(7.16)

where k (x) is the denominator of Tk in (7.15): k (x) =


def

Q(x, dx )gk+1 (x ) .

(7.17)

7.2 Sequential Importance Sampling

221

Equation (7.16) means that the incremental weight in (7.13) now depends on the previous position of the particle only (and not on the new position proposed at index k + 1). This is the exact opposite of the situation observed previously for the prior kernel. The optimal kernel (7.15) is attractive because it incorporates information both on the state dynamics and on the current observation: the particles move blindly with the prior kernel, whereas they tend to cluster into regions where the current local likelihood gk+1 is large when using the optimal kernel. There are however two problems with using Tk in practice. First, drawing from this kernel is usually not directly feasible. Second, calculation of the incremental importance weight k in (7.17) may be analytically intractable. Of course, the optimal kernel takes a simple form with easy simulation and explicit evaluation of (7.17) in the particular cases discussed in Chapter 5. It turns out that it can also be evaluated for a slightly larger class of non-linear Gaussian state-space models, as soon as the observation equation is linear (Zaritskii et al., 1975). Indeed, consider the state-space model with non-linear state evolution equation Xk+1 = A(Xk ) + R(Xk )Uk , Yk = BXk + SVk , Uk N(0, I) , Vk N(0, I) , (7.18) (7.19)

where A and R are matrix-valued functions of appropriate dimensions. By application of Proposition 5.2.2, the conditional distribution of the state vector Xk+1 given Xk = x and Yk+1 is multivariate Gaussian with mean mk+1 (x) and covariance matrix k+1 (x), given by Kk+1 (x) = R(x)Rt (x)B t BR(x)Rt (x)B t + SS t mk+1 (x) = A(x) + Kk+1 (x) [Yk+1 BA(x)] , k+1 (x) = [I Kk+1 (x)B] R(x)Rt (x) .
i Hence new particles k+1 need to be simulated from the distribution i i N mk+1 (k ), k+1 (k ) , 1

(7.20)

and the incremental weight for the optimal kernel is proportional to k (x) = q(x, x )gk+1 (x ) dx

1 t 1 |k+1 (x)|1/2 exp [Yk+1 BA(x)] k+1 (x) [Yk+1 BA(x)] 2 where k+1 (x) = BR(x)Rt (x)B t + SS t . In other situations, sampling from the kernel Tk and/or computing the normalizing constant k is a dicult task. There is no general recipe to solve this problem, but rather a set of possible solutions that should be considered.

222

7 Sequential Monte Carlo Methods

Example 7.2.4 (Noisy AR(1) Model, Continued). We consider the noisy AR(1) model of Example 7.2.3 again using the optimal importance kernel, which corresponds to the particular case where all variables are scalar and A and R are constant in (7.18)(7.19) above. Thus, the optimal instrumental transition density is given by tk (x, ) = N
2 U 2 2 U V 2 + V

Yk x 2 + 2 U V

2 U

2 2 U V 2 + V

and the incremental importance weights are proportional to k (x) exp 1 (Yk x)2 2 2 2 U + V .

Figure 7.4 is the exact analog of Figure 7.3, also obtained from 125 independent runs of the algorithm, for this new choice of instrumental kernel. The gure shows that whereas the SIS estimate of posterior mean is still negatively biased, the optimal kernel tends to reduce the bias compared to the prior kernel. It also shows that as soon as N = 400, there are at least some particles located around the true ltered mean of the state, which means that the method should not get entirely lost as subsequent new observations arrive. To illustrate the advantages of the optimal kernel with respect to the prior kernel graphically, we consider the model (7.18)(7.19) again with = 0.9, 2 2 u = 0.4, v = 0.6, and (0, 2.6, 0.6) as observed series (of length 3). The initial distribution is a mixture 0.6 N(1, 0.3) + 0.4 N(1, 0.4) of two Gaussians, for which it is still possible to evaluate the exact ltering distributions as the
1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 100 400 1600 6400 True value= .907

Values

Number of particles

Fig. 7.4. Box and whisker plot of the posterior mean estimate for X5 obtained from 125 replications of the SIS lter using the optimal kernel and increasing numbers of particles. Same data and axes as Figure 7.3.

7.2 Sequential Importance Sampling

223

FILT.

FILT. +1

FILT. +2

Fig. 7.5. SIS using the prior kernel. The positions of the particles are indicated by circles whose radii are proportional to the normalized importance weights. The solid lines show the ltering distributions for three consecutive time indices.

FILT.

FILT. +1

FILT. +2

Fig. 7.6. SIS using the optimal kernel (same data and display as in Figure 7.5).

224

7 Sequential Monte Carlo Methods

mixture of two Kalman lters using, respectively, N(1, 0.3) and N(1, 0.4) as the initial distribution of X0 . We use only seven particles to allow for an interpretable graphical representation. Figures 7.5 and 7.6 show the positions of the particles propagated using the prior kernel and the optimal kernel, respectively. At time 1, there is a conict between the prior and the posterior as the observation does not agree with the particle approximation of the predictive distribution. With the prior kernel (Figure 7.5), the mass becomes concentrated on a single particle with several particles lost out in the left tail of the distribution with negligible weights. In contrast, in Figure 7.6 most of the particles stay in high probability regions through the iterations with several distinct particles having non-negligible weights. This is precisely because the optimal kernel pulls particles toward regions where the current local likelihood gk (x) = gk (x, Yk ) is large, whereas the prior kernel does not.

7.2.2.3 Accept-Reject Algorithm Because drawing from the optimal kernel Tk is most often not feasible, a rst natural idea consists in trying the accept-reject method (Algorithm 6.2.1), which is a versatile approach to sampling from general distributions. To sample from the optimal importance kernel Tk (x, ) dened by (7.15), one needs an instrumental kernel Rk (x, ) from which it is easy to sample and such that dQ(x,) there exists M satisfying dRk (x,) (x )gk (x ) M (for all x X). Note that because it is generally impossible to evaluate the normalizing constant k of Tk , we must resort here to the unnormalized version of the accept-reject algorithm (see Remark 6.2.4). The algorithm consists in generating pairs (, U ) of independent random variables with Rk (x, ) and U uniformly distributed on [0, 1] and accepting if U 1 dQ(x, ) ()gk () . M dRk (x, )

Recall that the distribution of the number of simulations required is geometric with parameter Q(x, dx )gk (x ) p(x) = . M The strength of the accept-reject technique is that, using any instrumental kernel Rk satisfying the domination condition, one can obtain independent samples from the optimal importance kernel Tk . When the conditional likelihood of the observation gk (x)viewed as a function of xis bounded, one can for example use the prior kernel Q as the instrumental distribution. In that case dTk (x, ) (x ) = dQ(x, ) supx X gk (x ) gk (x ) . gk (u) Q(x, du) gk (u) Q(x, du)

7.2 Sequential Importance Sampling

225

The algorithm then consists in drawing from the prior kernel Q(x, ), U uniformly on [0, 1] and accepting the draw if U gk ()/ supxX gk (x). The acceptance rate of this algorithm is then given by p(x) =
X

Q(x, dx )gk (x ) . supx X gk (x )

Unfortunately, it is not always possible to design an importance kernel Rk (x, ) that is easy to sample from, for which the bound M is indeed nite, and such that the acceptance rate p(x) is reasonably large. 7.2.2.4 Local Approximation of the Optimal Importance Kernel A dierent option consists in trying to approximate the optimal kernel Tk by a simpler proposal kernel Rk that is handy for simulating. Ideally, Rk should be such that Rk (x, ) both has heavier tails than Tk (x, ) and is close to Tk (x, ) dT around its modes, with the aim of keeping the ratio dRk (x,) (x ) as small as k (x,) possible. To do so, authors such as Pitt and Shephard (1999) and Doucet et al. (2000a) suggest to rst locate the high-density regions of the optimal distribution Tk (x, ) and then use an over-dispersed (that is, with suciently heavy tails) approximation of Tk (x, ). The rst part of this program mostly applies to the case where the distribution Tk (x, ) is known to be unimodal with a mode that can be located in some way. The overall procedure will need to be repeated N times with x corresponding in turn to each of the current particles. Hence the method used to construct the approximation should be reasonably simple if the potential advantages of using a good proposal kernel are not to be oset by an unbearable increase in computational cost. A rst remark of interest is that there is a large class of state-space models for which the distribution Tk (x, ) can eectively be shown to be unimodal using convexity arguments. In the remainder of this section, we assume that X = Rd and that the hidden Markov model is fully dominated (in the sense of Denition 2.2.3), denoting by q the transition density function associated with the hidden chain. Recall that for a certain form of non-linear state-space models given by (7.18)(7.19), we were able to derive the optimal kernel and its normalization constant explicitly. Now consider the case where the state evolves according to (7.18), so that 1 q(x, x ) exp (x A(x))t R(x)Rt (x) 2
1

(x A(x)) ,

and g(x, y) is simply constrained to be a log-concave function of its x argument. This of course includes the linear Gaussian observation model considered previously in (7.19) but also many other cases like the non-linear observation considered below in Example 7.2.5. Then the optimal transition density tu (x, x ) = (Lk+1 /Lk )1 q(x, x )gk (x ) is also a log-concave function of k

226

7 Sequential Monte Carlo Methods

its x argument, as its logarithm is the sum of two concave functions (and a constant term). This implies in particular that x tu (x, x ) is unimodal and k that its mode may be located using computationally ecient techniques such as Newton iterations. The instrumental transition density function is usually chosen from a parametric family {r } of densities indexed by a nite-dimensional parameter . An obvious choice is the multivariate Gaussian distribution with mean m and covariance matrix , in which case = (, ). A better choice is a multivariate t-distribution with -degrees of freedom, location m, and scale matrix . Recall that the density of this distribution is proportional to r (x) [ + (x m)t 1 (x m)](+d)/2 . The choice = 1 corresponds to a Cauchy distribution. This is a conservative choice that ensures over-dispersion, but if X is high-dimensional, most draws from a multivariate Cauchy might be too far away from the mode to reasonably approximate the target distribution. In most situations, values such as = 4 (three nite moments) are more reasonable, especially if the underlying model does not feature heavy-tailed distributions. Recall also that simulation from the multivariate t-distribution with degrees of freedom, location m, and scale can easily be achieved by rst drawing from a multivariate Gaussian distribution with mean m and covariance and then dividing the outcome by the square root of an independent chi-square draw with degrees of freedom divided by . To choose the parameter of the instrumental distribution r , one should try to minimize the supremum of the importance function, min sup q(x, x )gk (x ) . r (x ) (7.21)

x X

This is a minimax guarantee by which is chosen to minimize an upper bound on the importance weights. Note that if r was to be used for sampling from tk (x, ) by the accept-reject algorithm, the value of for which the minimum is achieved in (7.21) is also the one that would make the acceptance probability maximal (see Section 6.2.1). In practice, solving the optimization problem in (7.21) is often too demanding, and a more generic strategy consists in locating the mode of x tk (x, x ) by an iterative algorithm and evaluating the Hessian of its logarithm at the mode. The parameter is then selected in the following way. Multivariate normal: t the mean of the normal distribution to the mode of tk (x, ) and t the covariance to minus the inverse of the Hessian of log tk (x, ) at the mode. Multivariate t-distribution: t the location and scale parameters as the mean and covariance parameters in the normal case; the number of degrees of freedom is usually set arbitrarily (and independently of x) based on the arguments discussed above. We discuss below an important model for which this strategy is successful.

7.2 Sequential Importance Sampling

227

Example 7.2.5 (Stochastic Volatility Model). We return to the stochastic volatility model introduced as Example 1.3.13 and considered previously in the context of MCMC methods as Example 6.2.16. From the state-space equations that dene the model, Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk , we directly obtain q(x, x ) = gk (x ) = (x x)2 , 2 2 2 2 1 Y2 1 exp k2 exp(x ) x 2 2 2 2 1 exp

Simulating from the optimal transition kernel tk (x, x ) is dicult, but the function x log(q(x, x )gk (x )) is indeed (strictly) concave. The mode mk (x) of x tk (x, x ) is the unique solution of the non-linear equation Y2 1 1 (x x) + k2 exp(x ) = 0 , 2 2 2 (7.22)

which can be found using Newton iterations. Once at the mode, the (squared) 2 scale k (x) is set as minus the inverse of the second-order derivative of x (log q(x, x )gk (x )) evaluated at the mode mk (x). The result is
2 k (x) =

1 Y2 + k2 exp [mk (x)] 2 2

(7.23)

In this example, a t-distribution with = 5 degrees of freedom was used, with location mk (x) and scale k (x) obtained as above. The incremental importance weight is then given by
x) exp (x 22 1 k (x) +
2 2 Yk 2 2

exp(x )

x 2

[x mk (x)]2 2 k (x)

(+1)/2

As in the case of Example 6.2.16, the rst time index (k = 0) is particular, and it is easily checked that m0 (x) is the solution of and 0 (x) is given by
2 0 (x) =

1 Y2 1 2 x + 02 exp(x) = 0 , 2 2 2

1 2 Y2 + 02 exp(m0 ) 2 2

228

7 Sequential Monte Carlo Methods

0.08 0.06

Density

0.04 0.02 0 0

10 2

15 0.5 0 0.5

1.5

Time Index

20

1.5

State

Fig. 7.7. Waterfall representation of ltering distributions as estimated by SIS with N = 1,000 particles (densities estimated with Epanechnikov kernel, bandwidth 0.2). Data is the same as in Figure 6.8. .

Figure 7.7 shows a typical example of the type of t that can be obtained for the stochastic volatility model with this strategy using 1,000 particles. Note that although the data used is the same as in Figure 6.8, the estimated distributions displayed in both gures are not directly comparable, as the MCMC method in Figure 6.9 approximates the marginal smoothing distribution, whereas the sequential importance sampling approach used for Figure 7.7 provides a (recursive) approximation to the ltering distributions. When there is no easy way to implement the local linearization technique, a natural idea explored by Doucet et al. (2000a) and Van der Merwe et al. (2000) consists in using classical non-linear ltering procedures to approximate tk . These include in particular the so-called extended Kalman lter (EKF), which dates back to the 1970s (Anderson and Moore, 1979, Chapter 10), as well as the unscented Kalman lter (UKF) introduced by Julier and Uhlmann (1997)see, for instance, Ristic et al. (2004, Chapter 2) for a recent review of these techniques. We illustrate below the use of the extended Kalman lter in the context of sequential importance sampling. We now consider the most general form of the state-space model with Gaussian noises:

7.2 Sequential Importance Sampling

229

Xk+1 = a(Xk , Uk ) , Yk = b(Xk , Vk ) ,

Uk N(0, I) , Vk N(0, I) ,

(7.24) (7.25)

where a, b are vector-valued measurable functions. It is assumed that {Uk }k0 and {Vk }k0 are independent white Gaussian noises. As usual, X0 is assumed to be N(0, ) distributed and independent of {Uk } and {Vk }. The extended Kalman lter proceeds by approximating the non-linear state-space equations (7.24)(7.25) by a non-linear Gaussian state-space model with linear measurement equation. We are then back to a model of the form (7.18)(7.19) for which the optimal kernel may be determined exactly using Gaussian formulas. We will adopt the approximation Xk a(Xk1 , 0) + R(Xk1 )Uk1 , Yk b [a(Xk1 , 0), 0] + B(Xk1 ) [Xk a(Xk1 , 0)] + S(Xk1 )Vk , where R(x) is the dx du matrix of partial derivatives of a(x, u) with respect to u and evaluated at (x, 0), [R(x)]i,j =
def

(7.26) (7.27)

[a(x, 0)]i uj

for i = 1, . . . , dx and j = 1, . . . , du ;

B(x) and S(x) are the dy dx and dy dv matrices of partial derivatives of b(x, v) with respect to x and v respectively and evaluated at (a(x, 0), 0), [B(x)]i,j = [S(x)]i,j {b [a(x, 0), 0]}i xj {b [a(x, 0), 0]}i = vj for i = 1, . . . , dy and j = 1, . . . , dx , for i = 1, . . . , dy and j = 1, . . . , dv .

It should be stressed that the measurement equation in (7.27) diers from (7.19) in that it depends both on the current state Xk and on the previous one Xk1 . The approximate model specied by (7.26)(7.27) thus departs from the HMM assumptions. On the other hand, when conditioning on the value of Xk1 , the structure of both models, (7.18)(7.19) and (7.26)(7.27), are exactly similar. Hence the posterior distribution of the state Xk given Xk1 = x and Yk is a Gaussian distribution with mean mk (x) and covariance matrix k (x), which can be evaluated according to Kk (x) = R(x)Rt (x)B t (x) B(x)R(x)Rt (x)B t (x) + S(x)S t (x) mk (x) = a(x, 0) + Kk (x) {Yk b [a(x, 0), 0]} , (x) = [I Kk (x)B(x)] R(x)Rt (x) . The Gaussian distribution with mean mk (x) and covariance k (x) may then be used as a proxy for the optimal transition kernel Tk (x, ). To improve
1

230

7 Sequential Monte Carlo Methods

the robustness of the method, it is safe to increase the variance, that is, to use ck (x) as the simulation variance, where c is a scalar larger than one. A perhaps more recommendable option consists in using as previously a proposal distribution with tails heavier than the Gaussian, for instance, a multivariate t-distribution with location mk (x), scale k (x), and four or ve degrees of freedom. Example 7.2.6 (Growth Model). We consider the univariate growth model discussed by Kitagawa (1987) and Polson et al. (1992) given, in statespace form, by Xk = ak1 (Xk1 ) + u Uk1 , Yk =
2 bXk

Uk N(0, 1) , Vk N(0, 1) ,

(7.28) (7.29)

+ v V k ,

where {Uk }k0 and {Vk }k0 are independent white Gaussian noise processes and x + 2 cos [1.2(k 1)] (7.30) ak1 (x) = 0 x + 1 1 + x2
2 2 with 0 = 0.5, 1 = 25, 2 = 8, b = 0.05, and v = 1 (the value of u will be discussed below). The initial state is known deterministically and set to X0 = 0.1. This model is non-linear both in the state and in the measurement equation. Note that the form of the likelihood adds an interesting twist to the problem: whenever Yk 0, the conditional likelihood function

gk (x) = g(x; Yk ) exp

def

b2 x2 Yk /b 2 2v

is unimodal and symmetric about 0; when Yk > 0 however, the likelihood gk is symmetric about 0 with two modes located at (Yk /b)1/2 . The EKF approximation to the optimal transition kernel is a Gaussian distribution with mean mk (x) and variance k (x) given by
2 2 2 Kk (x) = 2u bak1 (x) 4u b2 a2 (x) + v k1 1

mk (x) = ak1 (x) + Kk (x) Yk


2 2 v u k (x) = . 2 b2 a2 2 4u k1 (x) + v

ba2 (x) k1

In Figure 7.8, the optimal kernel, the EKF approximation to the optimal kernel, and the prior kernel for two dierent values of the state variance are compared. This gure corresponds to the time index one, and Y1 is set to 6 2 (recall that the initial state X0 is equal to 0.1). In the case where u = 1 (left 2 plot in Figure 7.8), the prior distribution of the state, N(a0 (X0 ), u ), turns out to be more informative (more peaky, less diuse) than the conditional likelihood g1 . In other words, the observed Y1 does not carry a lot of information about the state X1 , compared to the information provided by X0 ; this

7.3 Sequential Importance Sampling with Resampling


0

231

Optimal kernel EKF kernel Prior kernel

Optimal kernel EKF kernel Prior kernel

logdensity

logdensity

10

10

15

15

20

20

25

25

30 20

15

10

10

15

20

30 20

15

10

10

15

20

Fig. 7.8. Log-density of the optimal kernel (solid line), EKF approximation of the optimal kernel (dashed-dotted line), and the prior kernel (dashed line) for two 2 2 2 dierent values of the state noise variance u : left, u = 1; right, u = 10.
2 2 is because the measurement variance v is not small compared to u . The optimal transition kernel, which does take Y1 into account, is then very close to the prior kernel, and the dierences between the three kernels are minor. In such a situation, one should not expect much improvement with the EKF approximation compared to the prior kernel. 2 In the case shown in the right plot of Figure 7.8 (u = 10), the situation is 2 2 reversed. Now v is relatively small compared to u , so that the information about X1 contained in g1 is large compared to that provided by the prior information on X0 . This is the kind of situation where we expect the optimal kernel to improve considerably on the prior kernel. Indeed, because Y1 > 0, the optimal kernel is bimodal, with the second mode far smaller than the rst one (recall that the plots are on log-scale); the EKF kernel correctly picks the dominant mode. Figure 7.8 also illustrates the fact that, in contrast to the prior kernel, the EKF kernel does not necessarily dominate the optimal kernel in the tails; hence the need to simulate from an over-dispersed version of the EKF approximation as discussed above.

7.3 Sequential Importance Sampling with Resampling


Despite quite successful results for short data records, as was observed in Example 7.2.5, it turns out that the sequential importance sampling approach discussed so far is bound to fail in the long run. We rst substantiate this claim with a simple illustrative example before examining solutions to this shortcoming based on the concept of resampling introduced in Section 7.1.2. 7.3.1 Weight Degeneracy
i The intuitive interpretation of the importance sampling weight k is as a meai sure of the adequacy of the simulated trajectory 0:k to the target distribution

232

7 Sequential Monte Carlo Methods

0:k|n . A small importance weight implies that the trajectory is drawn far from the main body of the posterior distribution 0:k|n and will contribute only moderately to the importance sampling estimates of the form (7.11). Indeed, i a particle such that the associated weight k is orders of magnitude smaller N i than the sum i=1 k is practically ineective. If there are too many ineffective particles, the particle approximation becomes both computationally and statistically inecient: most of the computing eort is put on updating particles and weights that do not contribute signicantly to the estimator; the variance of the resulting estimator will not reect the large number of terms in the sum but only the small number of particles with non-negligible normalized weights. Unfortunately, the situation described above is the rule rather than the exception, as the importance weights will (almost always) degenerate as the time index k increases, with most of the normalized importance weights N j i k / j=1 k close to 0 except for a few ones. We consider below the case of i.i.d. models for which it is possible to show using simple arguments that the large sample variance of the importance sampling estimate can only increase with the time index k. Example 7.3.1 (Weight Degeneracy in the I.I.D. Case). The simplest case of application of the sequential importance sampling technique is when is a probability distribution on (X, X ) and the sequence of target distributions corresponds to the product distributions, that is, the sequence of distributions on (Xk+1 , X (k+1) ) dened recursively by 0 = and k = k1 , for k 1. Let be another probability distribution on (X, X ) and assume that is absolutely continuous with respect to and that d (x) d
2

(dx) < .

(7.31)

Finally, let f be a bounded measurable function that is not (-a.s.) constant such that its variance under , (f 2 ) 2 (f ), is strictly positive. Consider the sequential importance sampling estimate given by
N

IS (f ) = k,N
i=1

i f (k )

k d i l=0 d (l ) N k d j j=1 l=0 d (l )

(7.32)

j where the random variables {l }, l = 1, . . . , k, j = 1, . . . , N are i.i.d. with common distribution . As discussed in Section 7.2, the unnormalized importance weights may be computed recursively and hence (7.32) really corresponds to an estimator of the form (7.11) in the particular case of a function fk that depends on the last component only. This is of course a rather convoluted and very inecient way of constructing an estimate of (f ) but still constitutes a valid instance of the sequential importance sampling approach (in a very particular case).

7.3 Sequential Importance Sampling with Resampling

233

Now let k be xed and write N 1/2 IS (f ) (f ) = k,N Because


k

N 1/2

N k i i=1 l=0 f (k ) (f ) N k i N 1 i=1 l=0 d (l ) d

d i d (l )

. (7.33)

E
l=0

d i ( ) = 1 , d l

the weak law of large numbers implies that the denominator of the right-hand side of (7.33) converges to 1 in probability as N increases. Likewise, under (7.31), the central limit theorem shows that the numerator of the right-hand 2 side of (7.33) converges in distribution to the normal N(0, k (f )) distribution, where 2 k d 1 2 1 2 (7.34) ( ) f (k ) (f ) k (f ) = E d l
l=0

d (x) d

(dx)

d (x) d

[f (x) (f )] (dx) .

Slutskys lemma then implies that (7.33) also converges in distribution to the 2 same N(0, k (f )) limit as N grows. Now Jensens inequality implies that 1= d (x)(dx) d
2

d (x) d

(dx) ,

with equality if and only if = . Therefore, if = , the asymptotic variance 2 k (f ) grows exponentially with the iteration index k for all functions f such that d (x) d
2

[f (x) (f )] (dx) =

d 2 (x) [f (x) (f )] (dx) = 0 . d

Because is absolutely continuous with respect to , {x X : d/d(x) = 0} = 0 and the last integral is null if and only if f has zero variance under . Thus in the i.i.d. case, the asymptotic variance of the importance sampling estimate (7.32) increases exponentially with the time index k as soon as the proposal and target dier (except for constant functions). It is more dicult to characterize the degeneracy of the weights for general target and instrumental distributions. There have been some limited attempts to study more formally this phenomenon in some specic scenarios. In particular, Del Moral and Jacod (2001) have shown the degeneracy of the sequential importance sampling estimator of the posterior mean in Gaussian linear models when the instrumental kernel is the prior kernel. Such results

234

7 Sequential Monte Carlo Methods

are in general dicult to derive (even in the Gaussian linear models where most of the derivations can be carried out explicitly) and do not provide much additional insight. Needless to say, in practice, weight degeneracy is a prevalent and serious problem making the vanilla sequential importance sampling method discussed so far almost useless. The degeneracy can occur after a very limited number of iterations, as illustrated by the following example.

1000

500

0 25 1000

20

15

10

500

0 25 100

20

15

10

50

0 25

20

15

10

Importance Weights (base 10 logarithm)

Fig. 7.9. Histograms of the base 10 logarithm of the normalized importance weights after (from top to bottom) 1, 10, and 100 iterations for the stochastic volatility model of Example 7.2.5. Note that the vertical scale of the bottom panel has been multiplied by 10.

Example 7.3.2 (Stochastic Volatility Model, Continued). Figure 7.9 displays the histogram of the base 10 logarithm of the normalized importance weights after 1, 10, and 100 time indices for the stochastic volatility model considered in Example 7.2.5 (using the same instrumental kernel). The number of particles is set to 1,000. Figure 7.9 shows that, despite the choice of a reasonably good approximation to the optimal importance kernel, the normalized importance weights quickly degenerate as the number of iterations of the SIS algorithm increases. Clearly, the results displayed in Figure 7.7 still are reasonable for k = 20 but would be disastrous for larger time horizons such as k = 100.

7.3 Sequential Importance Sampling with Resampling

235

Because the weight degeneracy phenomenon is so detrimental, it is of great practical signicance to set up tests that can detect this phenomenon. A simple criterion is the coecient of variation of the normalized weights used by Kong et al. (1994), which is dened by CVN 1 = N
N

N
i=1

i
N j=1

1/2 . (7.35)

The coecient of variation is minimal when the normalized weights are all equal to 1/N , and then CVN = 0. The maximal value of CVN is N 1, which corresponds to one of the normalized weights being one and all others being null. Therefore, the coecient of variation is often interpreted as a measure of the number of ineective particles (those that do not signicantly contribute to the estimate). A related criterion with a simpler interpretation is the so-called eective sample size Ne (Liu, 1996), dened as Ne =
i=1 N

1 , (7.36)

N j=1

which varies between 1 (all weights null but one) and N (equal weights). It is straightforward to verify the relation Ne = N . 1 + CV2 N

Some additional insights and heuristics about the coecient of variation are given by Liu and Chen (1995). Yet another possible measure of the weight imbalance is the Shannon entropy of the importance weights,
N

Ent =
i=1

i
N j=1

log2

i
N j=1

(7.37)

When all the normalized importance weights are null except for one of them, the entropy is null. On the contrary, if all the weights are equal to 1/N , then the entropy is maximal and equal to log2 N . Example 7.3.3 (Stochastic Volatility Model, Continued). Figure 7.10 displays the coecient of variation (left) and Shannon entropy (right) as a function of the time index k under the same conditions as for Figure 7.9, that is, for the stochastic volatility model of 7.2.5. The gure shows that the distribution of the weights steadily degenerates: the coecient of variation increases and the entropy of the importance weights decreases. After 100 iterations, there are less than 50 particles (out 1,000) signicantly contributing to

236

7 Sequential Monte Carlo Methods


20 10

Coeff. of Variation

15

10

Entropy
0 20 40 60 80 100

20

40

60

80

100

Time Index

Time Index

Fig. 7.10. Coecient of variation (left) and entropy of the normalized importance weights as a function of the number of iterations for the stochastic volatility model of Example 7.2.5. Same model and data as in Figure 7.9.

the importance sampling estimator. Most particles have importance weights that are zero to machine precision, which is of course a tremendous waste in computational resource.

7.3.2 Resampling The solution proposed by Gordon et al. (1993) to reduce the degeneracy of the importance weights is based on the concept of resampling already discussed in the context of importance sampling in Section 7.1.2. The basic method consists in resampling in the current population of particles using the normalized weights as probabilities of selection. Thus, trajectories with small importance weights are eliminated, whereas those with large importance weights are duplicated. After resampling, all importance weights are reset to one. Up to the rst instant when resampling occurs, the method can really be interpreted as an instance of the sampling importance resampling (SIR) technique discussed in Section 7.1.2. In the context of sequential Monte Carlo, however, the main motivation for resampling is to avoid future weight degeneracy by reseting (periodically) the weights to equal values. The resampling step has a drawback however: as emphasized in Section 7.1.2, resampling introduces additional variance in Monte Carlo approximations. In some situations, the additional variance may be far from negligible: when the importance weights already are nearly equal for instance, resampling can only reduce the number of distinct particles, thus degrading the accuracy of the Monte Carlo approximation. The one-step eect of resampling is thus negative but, in the long term, resampling is required to guarantee a stable behavior of the algorithm. This interpretation suggests that it may be advantageous to restrict the use of resampling to cases where the importance weights are becoming very uneven. The criteria dened in (7.35), (7.36), or (7.37) are of course helpful for that

7.3 Sequential Importance Sampling with Resampling

237

purpose. The resulting algorithm, which is generally known under the name of sequential importance sampling with resampling (SISR), is summarized below. Algorithm 7.3.4 (SISR: Sequential Importance Sampling with Resampling). Initialize the particles as in Algorithm 7.2.2, optionally applying the resampling step below. For subsequent time indices k 0, do the following. Sampling: j 1 N Draw (k+1 , . . . , k+1 ) conditionally independently given {0:k , j = i i 1, . . . , N } from the instrumental kernel: k+1 Rk (k , ), i = 1, . . . , N . Compute the updated importance weights
i dQ(k , ) i i i i ( ), k+1 = k gk+1 (k+1 ) i dRk (k , ) k+1

i = 1, . . . , N .

Resampling (Optional): i j Draw, conditionally independently given {(0:k , k+1 ), i, j = 1, . . . , N }, 1 N the multinomial trial (Ik+1 , . . . Ik+1 ) with probabilities of success
1 k+1 N j j k+1

, ... ,

N k+1 N j j k+1

i Reset the importance weights k+1 to a constant value for i = 1, . . . , N . i If resampling is not applied, set for i = 1, . . . , N , Ik+1 = i.

Trajectory update: for i = 1, . . . , N ,


i 0:k+1 = Ik+1 Ik+1 0:k , k+1
i i

(7.38)

As discussed previously the resampling step in the algorithm above may be used systematically (for all indices k), but it is often preferable to perform resampling from time to time only. Usually, resampling is either used systematically but at a lower rate (for one index out of m, where m is xed) or at random instants based on the values of the coecient of variation or the entropy criteria dened in (7.35) and (7.37), respectively. Note that in addition to arguments based on the variance of the Monte Carlo approximation, there is usually also a computational incentive for limiting the use of resampling; indeed, except in models where the evaluation of the incremental weights is costly (think of large-dimensional multivariate observations for instance), the computational cost of the resampling step is not negligible. Both Sections 7.4.1 and 7.4.2 discuss several implementations and variants of the resampling step that may render the latter argument less pregnant. The term particle lter is often used to refer to Algorithm 7.3.4 although the terminology SISR is preferable, as particle ltering is sometimes also used more generically for any sequential Monte Carlo method. Gordon et al. (1993) actually proposed a specic instance of Algorithm 7.3.4 in which resampling

238

7 Sequential Monte Carlo Methods

is done systematically at each step and the instrumental kernel is chosen as the prior kernel Rk = Q. This particular algorithm, commonly known as the bootstrap lter , is most often very easy to implement because it only involves simulating from the transition kernel Q of the hidden chain and evaluation of the conditional likelihood function g. There is of course a whole range of variants and renements of Algorithm 7.3.4, many of which will be covered in some detail in the next chapter. A simple remark though is that, as in the case of the simplest SIR method discussed in Section 7.1.2, it is possible to resample N times from a larger population of M intermediate samples. In practice, it means that Algorithm 7.3.4 should be modied as follows at indices k for which resampling is to be applied. SIS: For i = 1, . . . , N , draw candidates i,1 , . . . , i, from each proposal
k+1 k+1 i distribution Rk (k , ). 1,1 1, N,1 N, Resampling: Draw (Nk+1 , . . . , Nk+1 , . . . , Nk+1 , . . . , Nk+1 ) from the multinomial distribution with parameter N and probabilities i,j k+1 N l=1 m=1 l,m k+1

for i = 1, . . . , N , j = 1, . . . , .

Hence, while this form of resampling keeps the number of particles xed and equal to N after resampling, the intermediate population (before resampling) has size M = N . Although obviously heavier to implement, the use of larger than one may be advantageous in some models. In particular, we will show in Chapter 9 that using larger than one eectively reduces the variance associated with the resampling operation in a proportion that may be signicant. Remark 7.3.5 (Marginal Interpretation of SIS and SISR). Both Algorithms 7.2.2 and 7.3.4 have been introduced as methods to simulate whole i trajectories {0:k }1iN that approximate the joint smoothing distribution 0:k|k . This was done quite easily in the case of sequential importance sampling (Algorithm 7.2.2), as the trajectories are simply extended independently of one another as new samples arrive. When using resampling however, the process is more involved because it becomes necessary to duplicate or discard some trajectories according to (7.38). This presentation of the SIS and SISR methods has been adopted because it is the most natural way to introduce sequential Monte Carlo methods. It does not mean that, when implementing the SISR algorithm, storing the whole trajectories is required. Neither do we claim that for large k, the approximation of the complete joint distribution 0:k|k provided by the particle i trajectories {0:k }1iN is accurate (this point will be discussed in detail in Section 8.3). Most often, Algorithm 7.3.4 is implemented storing only the i current generation of particles {k }1iN , and (7.38) simplies to
i Ik+1 k+1 = k+1
i

i = 1, . . . , N .

7.3 Sequential Importance Sampling with Resampling

239

i In that case, the system of particles {k }1iN with associated weights i {k }1iN , provides an approximation to the ltering distribution k , which is the marginal of the joint smoothing distribution 0:k|k . i The notation k could be ambiguous when resampling is applied, as the i rst k + 1 elements of the ith trajectory 0:k+1 at time k + 1 do not necessarily i i coincide with the ith trajectory 0:k at time k. By convention, k always refers i to the last point in the ith trajectory, as simulated at index k. Likewise, l:k is the portion of the same trajectory that starts at index l and ends at the i last index (that is, k). When needed, we will use the notation 0:k (l) for the element of index l in the ith particle trajectory at time k to avoid ambiguity.

To conclude this section on the SISR algorithm, we briey revisit two of the examples already considered previously to contrast the results obtained with the SIS and SISR approaches. Example 7.3.6 (Stochastic Volatility Model, Continued). To illustrate the eectiveness of the resampling strategy, we consider once again the stochastic volatility model introduced in Example 7.2.5, for which the weight degeneracy phenomenon (in the basic SIS approach) was patent in Figures 7.9 and 7.10. Figures 7.11 and 7.12 are the counterparts of Figs. 7.10 and 7.9, respectively, when resampling is applied whenever the coecient of variation (7.35) of the normalized weights exceeds one. Note that Figure 7.11 displays the coecient of variation and Shannon entropy computed, for each index k, before resampling, at indices for which resampling do occur. Contrary to what happened in plain importance sampling, the histograms of the normalized importance weights shown in Figure 7.12 are remarkably similar, showing that the weight degeneracy phenomenon is now under control. Another important remark in this example is that both criteria (the coecient of variation and

2.5

10 9.5

Coeff. of Variation

2 1.5 1 0.5 0

Entropy
0 20 40 60 80 100

9 8.5 8 7.5

20

40

60

80

100

Time Index

Time Index

Fig. 7.11. Coecient of variation (left) and entropy of the normalized importance weights as a function of the number of iterations in the stochastic volatility model of Example 7.2.5. Same model and data as in Figure 7.10. Resampling occurs when the coecient of variation gets larger than 1.

240

7 Sequential Monte Carlo Methods


1000

500

0 25 1000

20

15

10

500

0 25 1000

20

15

10

500

0 25

20

15

10

Importance Weights (base 10 logarithm)

Fig. 7.12. Histograms of the base 10 logarithm of the normalized importance weights after (from top to bottom) 1, 10, and 100 iterations in the stochastic volatility model of Example 7.2.5. Same model and data as in Figure 7.9. Resampling occurs when the coecient of variation gets larger than 1.

entropy) are strongly correlated. Triggering resampling whenever the entropy gets below, say 9.2, would thus be nearly equivalent with resampling occurring, on average, once every tenth time indices. The Shannon entropy of the normalized importance weights evolves between 10 and 9, suggesting that there are at least 500 particles that are signicantly contributing to the importance sampling estimate (out of 1,000). Example 7.3.7 (Growth Model, Continued). Consider again the non2 linear state-space model of Example 7.2.6, with the variance u of the state noise set to 10; this makes the observations very informative relative to the prior distribution on the hidden states. Figures 7.13 and 7.14 display the ltering distributions estimated for the rst 31 time indices when using the SIS method with the prior kernel Q as instrumental kernel (Figure 7.13), and the corresponding SISR algorithm with systematic resamplingthat is, the bootstrap lterin Figure 7.14. Both algorithms use 500 particles. For each time index, the top plots of Figures 7.13 and 7.14 show the highest posterior density (HPD) regions corresponding to the estimated ltering distribution, where the lighter grey zone contains 95% of the probability mass and the darker area corresponds to 50% of the probability mass. These HPD

7.3 Sequential Importance Sampling with Resampling


20

241

10

State

10

20

10

15

20

25

30

25

Coefficient of Variation

20 15 10 5 0

10

15

20

25

30

Time Index

Fig. 7.13. SIS estimates of the ltering distributions in the growth model with instrumental kernel being the prior one and 500 particles. Top: true state sequence () and 95%/50% HPD regions (light/dark grey) of estimated ltered distribution. Bottom: coecient of variation of the normalized importance weights.
20

10

State

10

20

10

15

20

25

30

Coefficient of Variation

4 3 2 1 0

10

15

20

25

30

p]

Time Index

Fig. 7.14. Same legend for Figure 7.13, but with results for the corresponding bootstrap lter.

242

7 Sequential Monte Carlo Methods

regions are based on a kernel density estimate (using the Epanechnikov kernel with bandwidth 0.2) computed from the weighted particles (that is, before resampling in the case of the bootstrap lter). Up to k = 8, the two methods yield very similar results. With the SIS algorithm however, the bottom panel of Figure 7.13 shows that the weights degenerate quickly. Remember that the maximal value of the coecient of variation (7.35) is N 1, that is about 22.3 in the case of Figure 7.13. Hence for k = 6 and for all indices after k = 12, the bottom panel of Figure 7.13 indeed means that almost all normalized weights but one are null: the ltered estimate is concentrated at one point, which sometimes severely departs from the actual state trajectory shown by the crosses. In contrast, the bootstrap lter (Figure 7.14) appears to be very stable and provides reasonable state estimates even at indices for which the ltering distribution is strongly bimodal (see Example 7.2.6 for an explanation of this latter feature).

7.4 Complements
As discussed above, resampling is a key ingredient of the success of sequential Monte Carlo techniques. We discuss below two separate aspects related to this issue. First, we show that there are several schemes based on clever probabilistic results that may be exploited to reduce the computational load associated with multinomial resampling. Next, we examine some variants of resampling that achieves lower conditional variance than multinomial resampling. In this latter case, the aim is of course to be able to decrease the number of particles without losing too much on the quality of the approximation. Throughout this section, we will assume that it is required to draw N samples 1 , . . . , N out of a, usually larger, set { 1 , . . . , M } according to the normalized importance weights { 1 , . . . , M }. We denote by G a -eld such that both 1 , . . . , M and 1 , . . . , M are G-measurable. 7.4.1 Implementation of Multinomial Resampling Drawing from the multinomial distribution is equivalent to drawing N random indices I 1 , . . . , I N conditionally independently given G from the set {1, . . . , M } and such that P(I j = i | G) = i . This is of course the simplest example of use of the inversion method, and each index may be obtained by rst simulating a random variable U with uniform distribution on [0, 1] and then determining I I1 the index I such that U ( j=1 j , j=1 j ] (see Figure 7.15). Determining the appropriate index I thus requires on the average log2 M comparisons (using a simple binary tree search). Therefore, the naive technique to implement multinomial resampling requires the simulation of N independent uniform random variables and, on the average, of the order N log2 M comparisons.

7.4 Complements

243

1 + 2 + 3 1 + 2 1 0 1 2 3 4 5 6
Fig. 7.15. Multinomial sampling from uniform distribution by the inversion method.

A nice solution to avoid the repeated sorting operations consists in presorting the uniform variables. Because the resampling is to be repeated N times, we need N uniform random variables, which will be denoted by U1 , . . . , UN and U(1) U(2) U(N ) denoting the associated order statistics. It is easily checked that applying the inversion method from the ordered uniforms {U(i) } requires, in the worst case, only M comparisons. The problem is that determining the order statistics from the unordered uniforms {Ui } by sorting algorithms such as Heapsort or Quicksort is an operation that requires, at best, of the order N log2 N comparisons (Press et al., 1992, Chapter 8). Hence, except in cases where N M , we have not gained anything yet by pre-sorting the uniform variables prior to using the inversion method. It turns out however that two distinct algorithms are available to sample directly the ordered uniforms {U(i) } with a number of operations that scales linearly with N . Both of these methods are fully covered in by Devroye (1986, Chapter 5), and we only cite here the appropriate results, referring to Devroye (1986, pp. 207215) for proofs and further references on the methods. Proposition 7.4.1 (Uniform Spacings). Let U(1) . . . U(N ) be the order statistics associated with an i.i.d. sample from the U ([0, 1]) distribution. Then the increments Si = U(i) U(i1) , i = 1, . . . , N , (7.39)

(where by convention S1 = U(1) ) are called the uniform spacings and distributed as E1 EN , . . . , N +1 , N +1 i=1 Ei i=1 Ei where E1 , . . . , EN +1 is a sequence of i.i.d. exponential random variables. Proposition 7.4.2 (Malmquist, 1950). Let U(1) . . . U(N ) be the order statistics of U1 , U2 , . . . , UN a sequence of i.i.d. uniform [0, 1] random vari-

244

7 Sequential Monte Carlo Methods


1/N 1/N 1/(N 1)

ables. Then UN , UN UN 1 as U(N ) , . . . , U(1) .

, . . . , UN UN 1

1/N

1/(N 1)

U1

1/1

is distributed

The two sampling algorithms associated with these probabilistic results may be summarized as follows. Algorithm 7.4.3 (After Proposition 7.4.1). For i = 1, . . . , N + 1: Simulate Ui U ([0, 1]) and set Ei = log Ui . N +1 Set G = i=1 Ei and U(1) = E1 /G. For i = 2, . . . , n: U(i) = U(i1) + Ei /G. Algorithm 7.4.4 (After Proposition 7.4.2). Generate VN U ([0, 1]) and set U(N ) = VN
1/N

.
1/i

For i = N 1 down to 1: Generate Vi U ([0, 1]) and set U(i) = Vi

U(i+1) .

Note that Devroye (1986) also discusses a third, slightly more complicated algorithmthe bucket sort method of Devroye and Klincsek (1981)which also has an expected computation time of order N . Using any of these methods, the computational cost of multinomial resampling scales only linearly in N and M , which makes the method practicable even when a large number of particles is used. 7.4.2 Alternatives to Multinomial Resampling Instead of using the multinomial sampling scheme, it is also possible to use a dierent resampling (or reallocation) scheme. For i = 1, . . . , M , denote by N i the number of times the ith element i is selected. A resampling scheme will be said to be unbiased with respect to G if
M

Ni = N ,
i=1

(7.40) i = 1, . . . , M . (7.41)

E N i G = N i ,

We focus here on resampling techniques that keep the number of particles constant (see for instance Crisan et al., 1999, for unbiased sampling with a random number of particles). There are many dierent conditions under which a resampling scheme is unbiased. The simplest unbiased scheme is multinomial resampling, for which (N 1 , . . . , N M ), conditionally on G, has the multinomimal distribution Mult(N, 1 , . . . , N ). Because I 1 , . . . , I M are conditionally i.i.d. given G, it is easy to evaluate the conditional variance in the multinomial resampling scheme:

7.4 Complements

245

1 Var N

1 f ( I ) G = N i=1
i

i f ( i )

2 j f ( j )

i=1 M

j=1 M 2

1 = N

i f 2 ( i )
i=1 i=1

i f ( i )

. (7.42)

A sensible objective is to try to construct resampling schemes for which the i N conditional variance Var( i=1 N f ( i ) | G) is as small as possible and, in parN ticular, smaller than (7.42), preferably for any choice of the function f . 7.4.2.1 Residual Resampling Residual resampling, or remainder resampling, is mentioned by Whitley (1994) (see also Liu and Chen, 1998) as a simple means to decrease the variance incurred by the sampling step. In this scheme, for i = 1, . . . , M we set N i = N i + N i , (7.43)

where N 1 , . . . , N M are distributed, conditionally on G, according to the multiM nomial distribution Mult(N R, 1 , . . . , N ) with R = i=1 N i and i = N i N i , N R i = 1, . . . , M . (7.44)

This scheme is obviously unbiased with respect to G. Equivalently, for any measurable function f , the residual sampling estimator is 1 N
N M

f ( i ) =
i=1 i=1

N i 1 f ( i ) + N N

N R

i f ( J ) ,
i=1

(7.45)

where J 1 , . . . , J N R are conditionally independent given G with distribution P(J i = k | G) = k for i = 1, . . . , N R and k = 1, . . . , M . Because the residual resampling estimator is the sum of one term that, given G, is deterministic and one term that involves conditionally i.i.d. labels, the variance of residual resampling is given by 1 Var N2 N R i 1 f ( J ) G = Var f ( J ) G 2 N i=1 2 M M (N R) = i f ( i ) j f ( j ) N2 i=1 j=1 1 = N
M M N R

(7.46)

f ( i )
i 2 i=1 i=1

N i 2 i N R f ( ) N2 N2

f ( i )
i i=1

246

7 Sequential Monte Carlo Methods

Residual sampling dominates multinomial sampling also in the sense of having smaller conditional variance. Indeed, rst write
M M

i f ( i ) =
i=1 i=1

N R N i f ( i ) + N N

i f ( i ) .
i=1

Then note that the sum of the M numbers N i /N plus (N R)/N equals one, whence this sequence of M + 1 numbers can be viewed as a probability distribution. Thus Jensens inequality applied to the square of the right-hand side of the above display yields
M 2 M

f ( i )
i i=1

i=1

N R N i 2 i f ( ) + N N

f ( i )
i i=1

Combining with (7.46) and (7.42), this shows that the conditional variance of residual sampling is always smaller than that of multinomial sampling. 7.4.2.2 Stratied Resampling The inversion method for sampling a multinomial sequence of trials maps uniform (0, 1) random variables U 1 , . . . , U N into indices I 1 , . . . , I N through a deterministic function. For any function f ,
N N

f ( I ) =
i=1 i=1

f (U i ) ,

where the function f (which depends on both f and { i }) is dened, for any u (0, 1], by
M

f (u) = f ( I(u) ),

def

I(u) =
i=1

i1(

i1 j=1

j ,

i j=1

j ] (u)

(7.47)

1 M Note that, by construction, 0 f (u) du = i=1 i f ( i ). To reduce the coni N ditional variance of i=1 f ( I ), we may change the way in which the sample U 1 , . . . , U N is drawn. A possible solution, commonly used in survey sampling, is based on stratication (see Kitagawa, 1996, and Fearnhead, 1998, Section 5.3, for discussion of the method in the context of particle ltering). The interval (0, 1] is partitioned into dierent strata, assumed for simplicity to be intervals (0, 1] = (0, 1/N ] (1/N, 2/N ] ({N 1}/N, 1]. More general partitions could have been considered as well; in particular, the number of partitions does not have to equal N , and the interval lengths could be made dependent on the i . One then draws a sample U 1 , . . . , U N condition ally independently given G from the distribution U i U (({i 1} /N, i/N ])

7.4 Complements

247

1 + 2 + 3 1 + 2 1 0 1 2 3 4 5 6
Fig. 7.16. Stratied sampling: the interval (0, 1] is divided into N intervals ((i 1)/N, i/N ]. One sample is drawn uniformly from each interval, independently of samples drawn in the other intervals.

(for i = 1, . . . , N ) and let I i = I(U i ) with I as in (7.47) (see Figure 7.16). N By construction, the dierence between N i = j=1 1{I j =i} and the target (non-integer) value N i is less than one in absolute value. It also follows that
N N

E
i=1

f ( I ) G = E
i=1 N i/N

f (U i ) G
1 M

=N
i=1 (i1)/N

f (u) du = N
0

f (u) du = N
i=1

i f ( i ) ,

showing that the stratied sampling scheme is unbiased. Because U 1 , . . . , U N are conditionally independent given G, Var 1 N
N

i f ( I ) G = Var
i=1

1 N
N

f (U i ) G
i=1

1 N2

Var f (U i ) G
i=1 M N i/N 2 i 2

1 = N here we used that inequality, 1 N


N 1 0

1 f ( i ) N i=1
1 0

N
i=1 M i=1 (i1)/N

f (u)du

2 (u) du = f
i/N

f 2 (u) du =
2 N

i f 2 ( i ). By Jensens
2

i/N

N
i=1 (i1)/N

f (u)du

i=1 M (i1)/N

f (u)du
2

=
i=1

f ( i )
i

248

7 Sequential Monte Carlo Methods

showing that the conditional variance of stratied sampling is always smaller than that of multinomial sampling. Remark 7.4.5. Note that stratied sampling may be coupled with the residual sampling method discussed previously: the proof above shows that using stratied sampling on the R residual indices that are eectively drawn randomly can only decrease the conditional variance. 7.4.2.3 Systematic Resampling Stratied sampling aims at reducing the discrepancy DN (U 1 , . . . , U N ) = sup
a(0,1] def

1 N

1(0,a] (U i ) a

i=1

of the sample U from the uniform distribution function on (0, 1]. This is simply the Kolmogorov-Smirnov distance between the empirical distribution function of the sample and the distribution function of the uniform distribution. The Koksma-Hlawka inequality (Niederreiter, 1992) shows that for any function f having bounded variation on [0, 1], 1 N
N 1

f (ui )
i=1 0

f (u) du C(f )DN (u1 , . . . , uN ) ,

where C(f ) is the variation of f . This inequality suggests that it is desirable to design random sequences U 1 , . . . , U N whose expected discrepancy is as low as possible. This provides another explanation of the improvement brought by stratied resampling (compared to multinomial resampling).

1 + 2 + 3 1 + 2 1 0 1 2 3 4 5 6
Fig. 7.17. Systematic sampling: the unit interval is divided into N intervals ((i 1)/N, i/N ] and one sample is drawn from each of them. Contrary to stratied sampling, each sample has the same relative position within its stratum.

7.4 Complements

249

Pursuing in this direction, it makes sense to look for sequences with even smaller average discrepancy. One such sequence is U i = U + (i 1)/N , where U is drawn from a uniform U((0, 1/N ]) distribution. In survey sampling, this method is known as systematic sampling. It was introduced in the particle lter literature by Carpenter et al. (1999) but is mentioned by Whitley (1994) under the name of universal sampling. The interval (0, 1] is still divided into N sub-intervals ({i 1}/N, i/N ] and one sample is taken from each of them, as in stratied sampling. However, the samples are no longer independent, as they have the same relative position within each stratum (see Figure 7.17). This sampling scheme is obviously still unbiased. Because the samples are not taken independently across strata, it is however not possible to obtain simple formulas for the conditional variance (Knsch, 2003). It is often conjectured u that the conditional variance of systematic resampling is always lower than that of multinomial resampling. This is not correct, as demonstrated by the following example. Example 7.4.6. Consider the case where the initial population of particles { i }1iN is composed of the interleaved repetition of only two distinct values x0 and x1 , with identical multiplicities (assuming N to be even). In other words, { i }1iN = {x0 , x1 , x0 , x1 , . . . , x0 , x1 } . We denote by 2/N the common value of the normalized weight i associated to the N/2 particles i that satisfy i = x1 , so that the remaining ones (which i = x0 ) share a common weight of 2(1 )/N . Without loss are such that of generality, we assume that 1/2 < 1 and that the function of interest f is such that f (x0 ) = 0 and f (x1 ) = F . Under multinomial resampling, (7.42) shows that the conditional variance N of the estimate N 1 i=1 f ( i ) is given by Var 1 N
N i f (mult ) G = i=1

1 (1 )F 2 . N

(7.48)

Because the value 2/N is assumed to be larger than 1/N , it is easily checked that systematic resampling deterministically sets N/2 of the i to be equal to x1 . Depending on the draw of the initial shift, all the N/2 remaining particles are either set to x1 , with probability 21, or to x0 , with probability 2(1 ). Hence the variance is that of a single Bernoulli draw scaled by N/2, that is, N 1 i Var f (syst ) G = ( 1/2)(1 )F 2 . N i=1 Note that in this case, the conditional variance of systematic resampling is not only larger than (7.48) for most values of (except when is very close to 1/2), but it does not even decrease to zero as N grows! Clearly, this observation is very dependent on the order in which the initial population of particles

250

7 Sequential Monte Carlo Methods

Multinomial Residual, stratied Systematic Systematic with prior random shuing

0.51 0.050 0.010 0.070 0.023

0.55 0.049 0.021 0.150 0.030

0.6 0.049 0.028 0.200 0.029

0.65 0.048 0.032 0.229 0.029

0.70 0.046 0.035 0.245 0.028

0.75 0.043 0.035 0.250 0.025

Table 7.1. Standard deviations of various resampling methods for N = 100 and F = 1. The bottom line has been obtained by simulations, averaging 100,000 Monte Carlo replications.

is presented. Interestingly, this feature is common to the systematic and stratied sampling schemes, whereas the multinomial and residual approaches are unaected by the order in which the particles are labelled. In this particular example, it is straightforward to verify that residual and stratied resampling are equivalentwhich is not the case in generaland amount to deterministically setting N/2 particles to the value x1 , whereas the N/2 remaining ones are drawn by N/2 conditionally independent Bernoulli trials with probability of picking x1 equal to 2 1. Hence the conditional variance, for both the residual and stratied schemes, is equal to N 1 (2 1)(1 )F 2 . It is hence always smaller than (7.48), as expected from the general study of these two methods. Once again, the failure of systematic resampling in this example is entirely due to the specic order in which the particles are labelled: it is easy to verify, at least empirically, that the problem vanishes upon randomly permuting the initial particles before applying systematic resampling. Table 7.1 also shows that a common feature of both the residual, stratied, and systematic resampling procedures is to become very ecient in some particular congurations of the weights such as when = 0.51 for which the probabilities of selecting the two types of particles are almost equal and the selection becomes quasi-deterministic. Note also that prior random shuing does somewhat compromise this ability in the case of systematic resampling. In practical applications of sequential Monte Carlo methods, residual, stratied, and systematic resampling are generally found to provide comparable results. Despite the lack of complete theoretical analysis of its behavior, systematic resampling is often preferred because it is the simplest method to implement. Note that there are specic situations, to be discussed in Section 8.2, where more subtle forms of resampling (which do not necessarily bring back all the weights to equal values) are advisable.

8 Advanced Topics in Sequential Monte Carlo

This chapter deals with three disconnected topics that correspond to variants and extensions of the sequential Monte Carlo framework introduced in the previous chapter. Remember that we have already examined in Section 7.2 a rst and very important degree of freedom in the application of sequential Monte Carlo methods, namely the choice of the instrumental kernel Rk used to simulate the trajectories of the particles. We now consider solutions that depart, more or less signicantly, from the sequential importance sampling with resampling (SISR) method of Algorithm 7.3.4. The rst section covers a far-reaching revision of the principles behind the SISR algorithm in which sequential Monte Carlo is interpreted as a repeated sampling task. This reinterpretation suggests several other sequential Monte Carlo schemes that dier, sometimes signicantly, from the SISR approach. Section 8.2 reviews methods that exploit the specic hierarchical structure found in some hidden Markov models, and in particular in conditionally Gaussian linear state-space models (CGLSSMs). The algorithms to be considered there combine the sequential simulation approach presented in the previous chapter with the Kalman ltering recursion discussed in Chapter 5. Finally, Section 8.3 discusses the use of sequential Monte Carlo methods for approximating smoothed quantities of the form introduced in Section 4.1.

8.1 Alternatives to SISR


We rst present a reinterpretation of the objectives of the sequential importance sampling with resampling (SISR) algorithm in Section 7.3. This new interpretation suggests a whole range of dierent approaches that combines more closely the sampling (trajectory update) and resampling (weight reset) operators involved in the SISR algorithm. In the basic SISR approach (Algorithm 7.3.4), we expect that after a re1 N sampling step, say at index k, the particle trajectories 0:k , . . . , 0:k approximately form an i.i.d. sample of size N from the distribution 0:k|k . We will

252

8 Advanced Topics in SMC

discuss more precisely in Chapter 9 the degree to which this assertion is correct but assume for the moment that the general intuition is justiable. Even 1 N in the absence of resampling at index k, in which case the weights k , . . . , k k+1 are not identical, the expectation of any function fk Fb X under 0:k|k may be approximated, following (7.11), by
N i k N j=1 j k i fk (0:k ) .

i=1

This behavior may indeed be adopted as a general principle for sequential Monte Carlo techniques, considering that a valid algorithm is such that it is recursive and guarantees that the weighted empirical distribution,
N

0:k|k =
i=1

i k N j=1

j k

i 0:k ,

(8.1)

is a consistent approximation to 0:k|k , in some suitable sense, as the number N of particles increases (the symbol denotes Dirac measures). The particular feature of the sequence of target distributions encountered in the HMM ltering application is the relatively simple recursive form recalled by (7.7): 0:k+1|k+1 (fk+1 ) =
u fk+1 (x0:k+1 )0:k|k (dx0:k ) Tk (xk , dxk+1 ) ,

u for all functions fk+1 Fb Xk+2 , where Tk is the (unnormalized) kernel u dened in (7.8). This relation may be rewritten replacing Tk by its normalized version Tk dened in (7.15), the so-called optimal importance kernel, to obtain

0:k+1|k+1 (fk+1 ) =

fk+1 (x0:k+1 ) 0:k|k (dx0:k ) Lk k (xk ) Tk (xk , dxk+1 ) , Lk+1 (8.2)

where k is the normalizing function dened in (7.17). Because the likelihoods Lk and Lk+1 are precisely the type of quantities that are non-evaluable in contexts where sequential Monte Carlo is useful, it is preferable to rewrite (8.2) in the equivalent auto-normalized form fk+1 (x0:k+1 ) 0:k|k (dx0:k )k (xk ) Tk (xk , dxk+1 ) . 0:k|k (dx0:k )k (xk ) (8.3) A natural idea in the context of sequential Monte Carlo is to plug the approximate empirical distribution dened in (8.1) into the recursive update formula (8.3), which yields 0:k+1|k+1 (fk+1 ) =

8.1 Alternatives to SISR


N def 0:k+1|k+1 (fk+1 ) = i=1 i i k k (k ) N j=1 j j k k (k ) i i fk+1 (0:k , x) Tk (k , dx) .

253

(8.4)

This equation denes a probability distribution 0:k+1|k+1 on Xk+2 , which is a nite mixture distribution and which also has the particularity that its restriction to the rst k + 1 component is a weighted empirical distribution 1 N i i with support 0:k , . . . , 0:k and weights proportional to k k (k ). Following this argument, the updated empirical approximation 0:k+1|k+1 should approximate the distribution dened in (8.4) as closely as possible, but with the constraint that it is supported by N points only. The simplest idea of course consists in trying to obtain a (conditionally) i.i.d. sample from this mixture distribution. This interpretation opens a range of new possibilities, as we are basically faced with a sampling problem for which several methods, including those discussed in Chapter 6, are available. 8.1.1 I.I.D. Sampling As discussed above, the rst obvious idea is to simulate, if possible, the new particle trajectories as N i.i.d. draws from the distribution dened by (8.4). Note that the term i.i.d. is used somewhat loosely here, as the statement obviously refers to the conditional distribution of the new particle trajecto1 N ries 0:k+1 , . . . , 0:k+1 given the current state of the system as dened by the 1 N 1 N particle trajectories 0:k , . . . , 0:k and the weights k , . . . , k . The algorithm obtained when following this principle is distinct from Algorithm 7.3.4, although it is very closely related to SISR when the optimal importance kernel Tk is used as the instrumental kernel. Algorithm 8.1.1 (I.I.D. Sampling or Selection/Mutation Algorithm). Weight computation: For i = 1, . . . , N , compute the (unnormalized) importance weights i i i k = k k (k ) . (8.5)
1 N i Selection: Draw Ik+1 , . . . , Ik+1 conditionally i.i.d. given {0:k }1iN , with probj 1 abilities P(Ik+1 = j) proportional to k , j = 1, . . . , N . Sampling: Draw 1 , . . . , N conditionally independently given { i }1iN k+1 k+1 0:k

and
i

i {Ik+1 }1iN ,

I i i with distribution k+1 Tk (kk+1 , ). Set 0:k+1 =

Ik+1 i i (0:k , k+1 ) and k+1 = 1 for i = 1, . . . , N .

Comparing the above algorithm with Algorithm 7.3.4 for the particular choice Rk = Tk reveals that they dier only by the order in which the sampling and selection operations are performed. Algorithm 7.3.4 prescribes that each trai i i i jectory be rst extended by setting 0:k+1 = (0:k , k+1 ) with k+1 drawn i from Tk (k , ). Then resampling is performed in the population of extended

254

8 Advanced Topics in SMC

trajectories, based on weights given by (8.5) when Rk = Tk . In contrast, Algoi rithm 8.1.1 rst selects the trajectories based on the weights k and then simulates an independent extension for each selected trajectory. This is of course possible only because the optimal importance kernel Tk is used as instrumental kernel, rendering the incremental weights independent of the position of the particle at index k + 1 and thus allowing for early selection. Intuitively, Algorithm 8.1.1 is preferable because it does not simply duplicate trajectories with high weights but rather selects the most promising trajectories at index k using independent extensions (at index k + 1) for each selected trajectory. Following the terminology in use in genetic algorithms1 , Algorithm 8.1.1 is a selection/mutation algorithm, whereas the SISR approach is based on mutation/selection. Recall that the latter is more general, as it does not require that the optimal kernel Tk be used, although we shall see later, in Section 8.1.2, that the i.i.d. sampling approach can be modied to allow for general instrumental kernels. Remark 8.1.2. In Chapter 7 as well as in the exposition above, we considered that the quantity of interest is the joint smoothing measure 0:k|k . It is important however to understand that this focus on the joint smoothing measure 0:k|k is unessential as all the algorithms presented so far only rely on the recursive structure observed in (8.4). Of course, in the case of the joint smoothing measure 0:k|k , the kernel Tk and the function k that appear in (8.4) have a specic form given by (7.15) and (7.17): f (x ) k (x)Tk (x, dx ) = f (x ) gk+1 (x )Q(x, dx ) (8.6)

for functions f Fb (X), where k (x) equals the above expression evaluated for f = 1. However, any of the sequential Monte Carlo algorithms discussed so far can be used for generic choices of the kernel Tk and the function k provided the expression for the incremental weights is suitably modied. The core of SMC techniques is thus the structure observed in (8.4), whose connection with the methods exposed here is worked out in detail in the recent book by Del Moral (2004). As an example, recall from Chapter 3 that the distribution 0:k|k1 diers from 0:k1|k1 only by an application of the prior (or state transition) kernel Q and hence satises a recursion similar to (8.4) with the kernel Tk and the function k replaced by Q and gk , respectively: 0:k+1|k (fk+1 ) = fk+1 (x0:k+1 ) 0:k|k1 (dx0:k )gk (xk ) Q(xk , dxk+1 ) , 0:k|k1 (dx0:k )gk (xk ) (8.7)

1 Genetic algorithms (see, e.g., Whitley, 1994) have much in common with sequential Monte Carlo methods. Their purpose is dierent however, with an emphasis on optimization rather than, as for SMC, simulation. Both elds do share a lot of common terminology.

i i k Q(k1 , ) i i k gk (k )

resampling

i i k+1 Q(k , ) 1 N

i k1 ,
1 N

1iN 1iN 1iN

i k , 1 N

i i k , k

i k ,

1iN

i k+1 ,

1 N

1iN

mutation selection

mutation

 
SISR with R = Q

-

-
I.I.D. sampling for the predictor recursion (8.7)

8.1 Alternatives to SISR

255

Fig. 8.1. The bootstrap lter decomposed into elementary mutation/selection steps.

256

8 Advanced Topics in SMC

where the denominator could be written more compactly as k|k1 (gk ). The recursive update formula obtained for the (joint) predictive distribution is much simpler than (8.4), as (8.7) features the prior kernel Qfrom which we generally assume that sampling is feasibleand the conditional likelihood function gk whose analytical expression is known. In particular, it is straightforward to apply Algorithm 8.1.1 in this case by selecting with weights 1 N gk (k ), . . . , gk (k ) and mutating the selected particles using the kernel Q. This is obviously equivalent to the bootstrap lter (Algorithm 7.3.4 with Q as the instrumental kernel) viewed at a dierent stage: just after the selection step for Algorithm 7.3.4 and just after the mutation step for Algorithm 8.1.1 applied to the predictive distribution (see Figure 8.1 for an illustration). The previous interpretation however suggests that the bootstrap lter operates very dierently on the ltering and predictive approximations, either according to Algorithm 7.3.4 or to Algorithm 8.1.1. We shall see in the next chapter (Section 9.4) that this observation has important implications when it comes to evaluating the asymptotic (for large N ) performance of the method. Coming back to the joint smoothing distribution 0:k|k , Algorithm 8.1.1 is generally not applicable directly as it involves sampling from Tk and evaluation of the normalization function k (see also the discussion in Section 7.2.2 on this point). In the remainder of this section, we will examine a number of more practicable options that keep up with the general objective of sampling from the distribution dened in (8.4). The rst section below presents a method that is generally known under the name auxiliary particle lter after Pitt and Shephard (1999) (see also Liu and Chen, 1998). The way it is presented here however diers notably from the exposition of Pitt and Shephard (1999), whose original argument will be discussed in Section 8.1.3. 8.1.2 Two-Stage Sampling We now consider using the sampling importance resampling method intro duced in Section 7.1.2 to sample approximately from 0:k+1|k+1 . Recall that SIR sampling proceeds in two steps: in a rst step, a new population is drawn according to an instrumental distribution, say 0:k+1 ; then, in a second step, the points are selected with probabilities proportional to the importance ratio between the target (here 0:k+1|k+1 ) and the instrumental distribution 0:k+1 . Our aim is to nd an instrumental distribution 0:k+1 that is as close as possible to 0:k+1|k+1 as dened in (8.4), yet easy to sample from. A sensible option is provided by mixture distributions such that for all functions fk+1 Fb Xk+2 ,
N

0:k+1 (fk+1 ) =
i=1

i i k k N j=1 j j k k

i i fk+1 (0:k , x) Rk (k , dx) .

(8.8)

1 N Here, k+1 , . . . , k+1 are positive numbers, called adjustment multiplier weights by Pitt and Shephard (1999), and Rk is a transition kernel on X. Both the

8.1 Alternatives to SISR

257

adjustment multiplier weights and the instrumental kernel may depend on the new observation Yk+1 although, as always, we do not explicitly mention it in our notation. To ensure that the importance ratio is well-dened, we require that the adjustment multiplier weights be strictly positive and that Tk (x, ), u or equivalently Tk (x, ), be absolutely continuous with respect to Rk (x, ), for all x X. These assumptions imply that the target distribution 0:k+1|k+1 dened in (8.4) is dominated by the instrumental distribution 0:k+1 with importance function given by the Radon-Nikodym derivative d0:k+1|k+1 (x0:k+1 ) = Ck d0:k+1 where Ck =
N
i 1{0:k } (x0:k )

i=1

i i k (k ) dTk (k , ) (xk+1 ) , i i k dRk (k , )

(8.9)

N i i i=1 k k N i i i=1 k k (k )

Because the factor Ck is a normalizing constant that does not depend on x0:k+1 , it is left here only for reference; its evaluation is never required when using the SIR approach. In order to obtain (8.9), we used the fundamental observation that a set Ak+1 X (k+2) can have non-null probability under both (8.4) and (8.8) only if there exists an index i and a measurable set A X i i such that {0 } {k } A Ak+1 , that is, Ak+1 must contain (at least) one of the current particle trajectories. Recall that
i i i k (k ) Tk (k , dx) = gk+1 (x) Q(k , dx) ,

and hence (8.9) may be rewritten as


i gk+1 (xk+1 ) dQ(k , ) (xk+1 ) , i i k dRk (k , ) i=1 (8.10) Thanks to the relatively simple expression of the importance function in (8.10), the complete SIR algorithm is straightforward provided that we can simulate from the instrumental kernel Rk .
i 1{0:k } (x0:k )

d0:k+1|k+1 (x0:k+1 ) = Ck d0:k+1

Algorithm 8.1.3 (Two-Stage Sampling). First-Stage Sampling: 1 M i Draw Ik , . . . , Ik conditionally i.i.d. given {0:k }1iN , with probabilities j j i P(Ik = j) proportional to the (unnormalized) rst-stage weights k k , j = 1, . . . , M . l 1 M Draw k+1 , . . . , k+1 conditionally independently given {0:k }1lN and i i I Ik i i i i {Ik }1iM , with distribution k+1 Rk (kk , ). Set 0:k+1 = (0:k , k+1 ) for i = 1, . . . , M .

258

8 Advanced Topics in SMC

Weight computation: For i = 1, . . . , M , compute the (unnormalized) secondstage weights Ii i gk+1 (k+1 ) dQ(kk , ) i i (k+1 ) . (8.11) k = Ii Ii k k dRk (kk , ) Second-Stage Resampling: 1 N i Draw Jk+1 , . . . , Jk+1 conditionally i.i.d. given {0:k+1 }1iM , with probj 1 abilities P(Jk+1 = j) proportional to the second-stage weights k , j = 1, . . . , M . i J For i = 1, . . . , N , set i = k+1 and i = 1.
0:k+1 0:k+1 k+1

The adjustment multiplier weights should be chosen to sample preferentially (in the rst stage) the particle trajectories that are most likely i under 0:k+1|k+1 . Usually the multiplier weight k depends on the new obi servation Yk+1 and on the position of the particle at index k, k , but more general conditions can be considered as well. If one can guess, based on the new observation, which particle trajectories are most likely to survive or die, the resampling stage may be anticipated by increasing (or decreasing) the importance weights. As such, the use of adjustment multiplier weights is a mechanism to prevent sample impoverishment. The expression for the second-stage weights in (8.11) provides additional insights on how to choose the adjustment multiplier weights. The eciency of the SIR procedure is best when the importance weights are well-balanced, that is, when the total mass is spread over a large number of particles. The i multiplier adjustment weights k should thus be chosen to render the secondstage weights as evenly distributed as possible. In the particular case where sampling is done from the prior (or state transition) kernel, that is if Rk = Q, the expression of the second-stage weight simplies to
I i i k = gk+1 (k+1 )/k k+1 . i Although it is not possible to equate this expression to one, as k cannot i depend on k+1 , it is easy to imagine strategies that reach this objective on average. Pitt and Shephard (1999) suggest that the adjustment multiplier weights be set as the likelihood of the mean of the predictive distribution corresponding to each particle, i k = gk+1 i x Q(k , dx)
i

i {k }1in

(8.12)

In particular, in examples where Q corresponds to a random walk move, the i i adjustment multiplier weight k is thus equal to gk+1 (k ), the conditional likelihood of the new observation given the current position, which is quite natural. In general situations, the success of this approach depends on our ability to choose the adjustment multiplier weights in a way that the rst sampling stage is eective.

8.1 Alternatives to SISR

259

Auxiliary particle lter M = 100 1,000 10,000 0.56 (0.11) 0.62 (0.11) 0.62 (0.10) 0.59 (0.11) 0.71 (0.10) 0.74 (0.09) 0.60 (0.12) 0.73 (0.09) 0.80 (0.08) Table 8.1. Approximations of the posterior mean X5|5 in the noisy AR(1) model, obtained using the bootstrap lter and auxiliary particle lter. The model and observations Y0:5 are given in Example 7.2.3. Results are reported for dierent values of M (size of the rst stage sample) and N (number of particle retained in the second stage). The gures are means and standard errors from 500 independent replications for each pair of M and N . The column ref displays the true posterior mean computed by Kalman ltering.

Ref. Bootstrap lter N M = 100 1,000 10,000 100 0.91 0.49 (0.12) 0.57 (0.10) 0.61 (0.11) 1,000 0.91 0.64 (0.10) 0.71 (0.09) 10,000 0.91 0.75 (0.09)

Example 8.1.4 (Noisy AR(1) Model, Continued). To illustrate the behavior of the method, we consider again the simple noisy AR(1) model of Example 7.2.3, which has the advantage that exact ltering quantities may be computed by the Kalman recursions. In Example 7.2.3, we approximated the posterior mean of X5 given the observed Y0:5 using sequential importance sampling with the prior kernel Q as instrumental kernel and found that this approximation grossly underestimates the true posterior mean, which evaluates (by Kalman ltering) to 0.91. The situation improves somewhat when using the optimal kernel Tk (Example 7.2.4). Because there are only six observations, the dierences between the results of SIS and SISR are small, as the weights do not have the time to degenerate (given that, in addition, the outlier occurs at the last time index). In Table 8.1, we compare the results of the SISR algorithm with Q as the instrumental kernel (also known as the bootstrap lter) and the two-stage algorithm. Following (8.12), the adjustment multiplier weights were set to
i i 2 k = N(Yk+1 ; k , V ) ;

see Example 7.2.3 for details on the notation. This second algorithm is usually referred to as the (or an) auxiliary particle lter. The table shows that for all values of M (the size of the rst stage sampling population) and N (the number of particles retained in the second stage), the auxiliary article lter outperforms the bootstrap lter. The auxiliary lter eectively reduces the bias to a level that is, in this case, comparable (albeit slightly larger) to that obtained when using the optimal kernel Tk as instrumental kernel (see Figure 7.4). For the bootstrap lter (Algorithm 7.3.4), only values of M larger than N have been considered. Indeed, because the algorithm operates by rst extending the trajectories and then resampling, it does not apply directly when M < N . Note however that the examination of the gures obtained for the auxiliary lter (Algorithm 8.1.3), for which both M and N may be chosen

260

8 Advanced Topics in SMC

freely, suggests that it is more ecient to use M larger than N than the converse. The payo for using M larger than N , compared to the base situation where M = N , is also much more signicant in the case of the bootstrap lterwhose baseline performance is worsethan for the auxiliary particle lter. 8.1.3 Interpretation with Auxiliary Variables We now discuss another interpretation of Algorithm 8.1.3, which is more in the spirit of (Pitt and Shephard, 1999). This alternative perspective on Algorithm 8.1.3 is based on the observation that although we generally consider our target distributions to be the joint smoothing distribution 0:k|k , the obtained algorithms are directly applicable for approximating the ltering distribution k simply by dropping the history of the particles (Remark 7.3.5). In particular, if we now consider that only the current system of particles i i {k }1iN , with associated weights {k }1iN is available, (8.3) should be replaced by the marginal relation
N def k+1 (f ) = i=1 i k N j=1 i k (k ) j j k k (k ) i f (x) Tk (k , dx) ,

f Fb (X) , (8.13)

which thus denes our target distribution for updating the system of particles. For the same reason as above, it makes sense to select a proposal distribution (this time on X) closely related to (8.13). Indeed, we consider the N component mixture
N

k+1 (f ) =
i=1

i i k k N j=1 j j k k

i f (x) Rk (k , dx) .

(8.14)

Proceeding as in (8.9)(8.10), the Radon-Nikodym derivative is now given by d d,k+1 (x) = Ck dk+1 d
N i=1 N i=1 i i k Tk (k , ) i j i k k Rk (k , )

(x) .

(8.15)

Compared to (8.10), this marginal importance ratio would be costly to evaluate as such, as both its numerator and its denominator involve summing over N terms. This diculty can be overcome by data augmentation, introducing an auxiliary variable that corresponds to the mixture component that is selected when drawing the new particle position. Consider the following distribution aux on the product space {1, . . . , N } X: k+1 aux ({i} A) = k+1
i k i g (x) Q(k , dx) A k+1 N j j j=1 k k (k )

A X , i = 1, . . . , N . (8.16)

8.1 Alternatives to SISR

261

Because
N

aux ({1, . . . , N } A) = k+1


i=1

aux ({i} A) = k+1 (A) , k+1

AX ,

k+1 is the marginal distribution of aux and we may sample from k+1 by k+1 aux sampling from k+1 and discarding the auxiliary index. To sample from aux k+1 using the SIR method, we can then use the following instrumental distribution on the product space {1, . . . , N } X: aux ({i} A) = k+1
i i k k N j=1 j j k k i Rk (k , A) ,

AX .

(8.17)

This distribution may be interpreted as the joint distribution of the selection i i index Ik and the proposed new particle position k+1 in Algorithm 8.1.3. This time, the importance function is very simple and similar to (8.10),
i daux gk+1 (x) dQ(k , ) k+1 (x) , (i, x) = Ck i i daux k dRk (k , ) k+1

i = 1, . . . , N, x X ,

(8.18)

Hence Algorithm 8.1.3 may also be understood in terms of auxiliary sampling. 8.1.4 Auxiliary Accept-Reject Sampling Rather than using the SIR method, simulating from (8.17) and using the importance ratio dened in (8.18), we may consider other methods for simulating directly from (8.16). An option, already discussed in the context of sequential importance sampling in Section 7.2.2, consists in using the accept-reject method (dened in Section 6.2.1). The accept-reject method may be used to generate a truly i.i.d. sample from the target distribution. The price to pay compared to the SIR algorithm is a typically higher computational cost, especially when the acceptance probability is low. In addition, the number of simulations needed is itself random and the computation time cannot be predicted beforehand, especially when there are unknown normalizing constants (Remark 6.2.4). The method has nonetheless been studied for sequential simulation by several authors including Tanizaki (1996), Tanizaki and Mariano (1998), and Hrzeler and Knsch u u (1998) (see also Pitt and Shephard, 1999, and Liu and Chen, 1998). In auxiliary accept-reject the idea is to nd an instrumental distribution aux that dominates the target aux and is such that the Radon-Nikodym k+1 k+1 derivative daux /daux is bounded. Indeed, proposals of the form given k+1 k+1 in (8.8) still constitute an appropriate choice granted that we strengthen somewhat the assumptions that were needed for applying the SIR method. Assumption 8.1.5. For any k 0 and x X, sup gk+1 (x )
x X

dQ(x, ) (x ) < . dRk (x, )

(8.19)

262

8 Advanced Topics in SMC

Because the index i runs over a nite set {1, . . . , N }, we may dene Mk = max Ai k , i k where Ai sup gk+1 (x) k
xX i dQ(k , ) (x) . i dRk (k , )

1iN

(8.20)

With these denitions, the Radon-Nikodym derivative daux /daux given k+1 k+1 by (8.18) is bounded by daux k+1 (i, x) Mk daux k+1
N i i i=1 k k N i i i=1 k k (k )

(8.21)

and hence the use of the accept-reject algorithm is valid. The complete algorithm proceeds as follows. Algorithm 8.1.6 (Auxiliary Accept-Reject). For i = 1, . . . , N , Repeat: i Draw an index Ik {1, . . . , N } with probabilities proportional to the 1 1 N N rst-stage weights k k , . . . , k k . i i Conditionally on Ik , draw a proposal k+1 from the instrumental transition kernel Rk (kk , ) and U i from a uniform distribution on [0, 1]. Until:
i I i 1 gk+1 (k+1 ) dQ(kk , ) i (k+1 ) . U i I Ii Mk k k dRk (kk , ) i i Update: Set k+1 = k+1 . i When done, reset all weights {k+1 }1iN to a (common) constant value.
i

Ii

Because the joint distribution of the accepted pairs is aux , as dened k+1 by (8.16), the marginal distribution of the accepted draws (forgetting about the index) is (8.13) as required. One should typically try to increase the aci ceptance rate by proper choices of the adjustment multiplier weights k and, whenever possible, by also choosing the instrumental kernel Rk in an appropriate fashion. The user should also determine the upper bounds Ai in (8.20) k as tightly as possible. The following lemma, due to Knsch (2003), gives some u indications on how the multiplier weights should be chosen to maximize the acceptance ratio. Lemma 8.1.7. For a given choice of instrumental kernels Rk and upper bounds Ai , the average acceptance probability is maximal when the incremenk i tal adjustment weights k are proportional to Ai for i = 1, . . . , N . k Proof. Recall from Remark 6.2.4 that because of the presence of unknown normalization constants, the acceptance probability of the accept-reject method is not 1/Mk but rather the inverse of the upper bound on the importance function, that is, the right-hand side of (8.21). Because

8.1 Alternatives to SISR


N i i i=1 k k Mk N i i i=1 k k (k ) N i=1 N i=1 i i k k Ai k i k N i=1 N i=1 i k Ai k

263

i i k k (k )

i i k k (k )

the acceptance probability is bounded by


N i i i=1 k k (k ) N i i i=1 k Ak

(8.22)

i The bound is attained when Ai /k = Mk for all i. k

Tanizaki and Mariano (1998) and Hrzeler and Knsch (1998) both conu u sider the particular choice Rk = Q. Lemma 8.1.7 shows that the optimal i adjustment multiplier weights are then constant, k = 1 for all i. This is somewhat surprising in light of the discussion in Section 8.1.2, as one could conjecture heuristically that it is more appropriate to favor particles that agree with the next observations. Lemma 8.1.7 however shows that the only means to improve the acceptance rate is, whenever possible, to properly optimize the instrumental kernel. 8.1.5 Markov Chain Monte Carlo Auxiliary Sampling Rather than using the accept-reject algorithm to sample exactly from (8.16), Berzuini et al. (1997) suggest that a few iterations of a Markov chain Monte Carlo sampler with target distribution (8.16) be used. The algorithm proposed by Berzuini et al. (1997) is based on the independent Metropolis-Hastings sampler discussed in Section 6.2.3.1. Once again, we use a distribution aux k+1 of the form dened in (8.8) as the proposal, but this time the chain moves from (i, x) to (i , x ) with a probability given by A[(i, x), (i , x )] 1 where A [(i, x), (i , x )] =
i gk+1 (x ) dQ(k , ) (x ) i i dRk (k , ) k i gk+1 (x) dQ(k , ) (x) i i k dRk (k , ) 1

(8.23) In case of rejection, the chain stays in (i, x). This update step is then repeated independently N times. Algorithm 8.1.8 (Auxiliary MCMC). For i = 1, . . . , N ,
i,1 Initialization: Draw an index Ik {1, . . . , N } with probabilities proportional 1 1 N N i,1 to the rst-stage weights k k , . . . , k k , and k+1 from the instrumental i,1 I transition kernel Rk ( k , ). Set i = i,1 and I i = I i,1 . k k+1 k+1 k k i,j For j = 2 to Jmax : Draw an index Ik {1, . . . , N } with probabilities propor1 1 N N i,j tional to the rst-stage weights k k , . . . , k k , draw k+1 from the instru-

mental transition kernel Rk (kk , ) and a U([0, 1]) variable U j . If


i,j i,j i i U j A[(Ik , k+1 ), (Ik , k+1 )] , i,j i i i,j set k+1 = k+1 and Ik = Ik .

I i,j

264

8 Advanced Topics in SMC

i When done, all weights {k+1 }1iN are reset to a constant value.

In the above algorithm, aux is used both as proposal distribution for k+1 the independent Metropolis-Hastings sampler and for generating the initial i,1 i,1 values Ik and k+1 . Compared to the accept-reject approach of the previous section, Algorithm 8.1.8 is appealing, as it is associated with a deterministic computation time that scales like the product of N and Jmax . On the other hand, the method can only be useful if Jmax is small which in turn is legitimate only if the independent Metropolis-Hastings chain is fast mixing. As discussed in Section 6.2.3.1, the mixing of each individual chain is governed by the behavior of the quantity
i Mk = sup i gk+1 (x) dQ(k , ) (x) , i i k dRk (k , ) xX

i and the chain is uniformly (geometrically) ergodic, at rate (1 1/Mk ), only i if Mk is nite. Not surprisingly, this approach thus shares many common features and properties with the accept-reject algorithm discussed in the previous section. It is of course possible to combine both methods (Tanizaki, 2003) or to resort to other type of MCMC samplers. We refer to Berzuini and Gilks (2001) for a full discussion of this approach together with some examples where it is particularly useful.

8.2 Sequential Monte Carlo in Hierarchical HMMs


In Section 4.2, we examined a general class of HMMs, referred to as hierarchical HMMs, for which the state can be partitioned into two components, one of which can be analytically integrated outor marginalizedconditionally on the other component. When marginalization is feasible, one may derive computationally ecient sampling procedures that focus their full attention on a state space whose dimension is smallerand in most applications, much smallerthan the original one. As a result, when marginalization is feasible, it usually signicantly improves the performance of particle ltering, allowing in particular a drastic reduction of the number of particles needed to achieve a given level of accuracy of the estimates (Akashi and Kumamoto, 1977; Liu and Chen, 1998; MacEachern et al., 1999; Doucet et al., 2000a,b). One should however keep in mind that marginalization requires the use of rather sophisticated algorithms, and that the computations necessary to update each marginal particle can be much more demanding than for an unstructured particle that lives in the complete state space. Marginalizing out some of the variables is an example of a classical technique in computational statistics referred to as the Rao-Blackwellization, because it is related to the Rao-Blackwell risk reduction principle in statistics. Rao-Blackwellization is an important ingredient of simulation-based methods that we already met in the context of MCMC methods in Chapter 6.

8.2 Sequential Monte Carlo in Hierarchical HMMs

265

In the hierarchical hidden Markov model introduced in Section 1.3.4, the state variable Xk can be decomposed in two parts (Ck , Wk ), where Ck is called the indicator variable or the regime and Wk is the partial state, which can be marginalized out conditionally on the regime. We will focus on the special case where the indicator variables are discrete and nite. Although it is possible to use the marginalization principle in a more general setting (see, e.g., Doucet et al., 2001b, or Andrieu et al., 2003), the case of discrete indicator variables remains the most important in practical applications. 8.2.1 Sequential Importance Sampling and Global Sampling Assume that the indicator variables take their values in the nite set C = {1, . . . , r}. We consider here, as previously, that the goal is to simulate from the sequence of joint probability measures {0:k|k }k0 of C0:k given Y0:k . For the moment, the details of the structure of 0:k|k do not matter and we simply u assume that there exists an (unnormalized) transition kernel Tk : Ck+1 C + R such that
u 0:k+1|k+1 (c0:k+1 ) = 0:k|k (c0:k )Tk (ck , ck+1 ) .

(8.24)

Note that as usual for probabilities on discrete spaces, we use the notation 0:k|k (c0:k ) rather than 0:k|k ({c0:k }). This denition should be compared to u (7.8). Indeed, Tk is an unnormalized kernel similar to that which appears in (7.8), although it does not dependas a function of c0:k on ck only. This modication is due to the fact that the structure of the joint smoothing distribution in hierarchical HMMs, when marginalizing with respect to the intermediate component {Wk }, is more complex than in the models that we have met so far in this chapter (see Section 4.2.3). Once again, these considerations are not important for the moment, and the reader should consider (8.24) as the denition of a (generic) sequence of probability distributions over increasing spaces. 8.2.1.1 Sequential Importance Sampling In the sequential importance sampling framework, the target distribution at time k is approximated by independent path particles denoted, as pre1 N viously, by 0:k , . . . , 0:k , associated with non-negative (normalized) weights 1 N k , . . . , k such that
N

0:k|k (c0:k ) =
i=1

i i k 10:k (c0:k ) .

(8.25)

These particles and weights are updated sequentially by drawing from an instrumental distribution over sequences in CN dened by an initial probability

266

8 Advanced Topics in SMC

distribution 0:0 on C and a family of transition kernels Rk : Ck+1 C R+ , for k 0, such that 0:k+1 (c0:k+1 ) = 0:k (c0:k )Rk (c0:k , ck+1 ) , (8.26)

1 where 0:k denotes the joint distribution of 0:k . It is assumed that for each k, u the instrumental kernel Rk dominates the transition Tk in the sense that u for any c0:k and any c = 1, . . . , r, the condition Tk (c0:k , c) > 0 implies Rk (c0:k , c) > 0. In words, all transitions that are permitted (have positive probability) under the model are permitted also under the instrumental kernel. In the sequential importance sampling procedure, one draws exactly one i successor for each path particle 0:k , i = 1, . . . , N . More precisely, an N -uplet 1 N Ik+1 , . . . , Ik+1 is drawn conditionally independently given the past and with probabilities proportional to the weights i i Rk (0:k , 1), . . . , Rk (0:k , r) . i i i The particle system is then updated according to 0:k+1 = (0:k , Ik+1 ). If 1 N 0 , . . . , 0 are drawn independently from a probability distribution 0:0 , the N 1 particle system 0:k , . . . , 0:k consists of N independent draws from the instrumental distribution 0:k . As in (7.13), the associated (unnormalized) importance weights can be written as a product of incremental weights u i i Tk (0:k , Ik+1 ) . i i Rk (0:k , Ik+1 )

i i k+1 = k

(8.27)

The instrumental transition kernel that minimizes the variance of the importance weights conditionally on the history of the particle system will be denoted by Tk and is given by the analog of (7.15): Tk (c0:k , c) =
u Tk (c0:k , c) , u (c Tk 0:k , C)

c0:k Ck+1 , c C .

(8.28)

This kernel is referred to as the optimal instrumental kernel. The importance weights (8.27) associated with this kernel are updated according to
i i u i k+1 = k Tk (0:k , C) .

(8.29)

As before, these incremental importance weights do not depend on the descendant of the particle. The SIS algorithm using the optimal importance kernel is equivalent to the random sampling algorithm of Akashi and Kumamoto (1977). In this scheme, resampling is stochastic with precisely one descendant of each particle at time k being kept. For each particle, a descendant is chosen with probabilities proportional to the descendants weights u i u i Tk (0:k , 1), . . . , Tk (0:k , r). The weight of the chosen particle is set to the prodr u i uct of its parents weight and the sum c=1 Tk (0:k , c).

8.2 Sequential Monte Carlo in Hierarchical HMMs

267

8.2.1.2 Global Sampling As in the previous chapter, the particle system produced by sequential importance sampling degenerates, and the way to ght this degeneracy is resampling. Because the state space is nite however, we now can probe the whole state space because each particle has a nite number (r) of possible descendants. The sampling and resampling steps may then be combined into a single random draw. Recall that a natural estimator of the target distribution 0:k|k at time k is the empirical distribution of the particles dened in (8.25). Equation (8.24) suggests to estimate the probability distribution k+1 by 0:k+1|k+1 (c0:k+1 ) =
N i=1 u i i i k 0:k (c0:k )Tk (0:k , ck+1 ) N i=1 i u i k Tk (0:k , C)

(8.30)

This equation corresponds to (8.4) in the current discrete setting. The support of this distribution is included in the set of all the possible descendants of the current system of particles. Each particle has at most r possible descendants and thus the support of this distribution has at most N r points. A straightforward solution (see for instance Fearnhead and Cliord, 2003) to sample from this distribution is as follows. Algorithm 8.2.1 (Global Sampling). Weighting: For i = 1, . . . , N and j = 1, . . . , r, compute the (normalized) weights
i,j k+1 = i u i k Tk (0:k , j) N l=1 r c=1 l u i k Tk (0:k , c)

(8.31)

Sampling : Conditionally independently from the particle system history, draw N i i identically distributed pairs (Ik , Jk+1 ) {1, . . . , N } {1, . . . , r}, for i = i,j 1 1 1, . . . N , such that P[(Ik , Jk+1 ) = (i, j) | Gk ] = k+1 , where Gk is the -eld generated by the history of the particle system up to time k. i Ik i i i Update: Set 0:k+1 = (0:k , Jk+1 ) and k+1 = 1/N for i = 1, . . . , N . Remark 8.2.2. There are several closely related algorithms that have appeared in the literature, in particular the detection estimation algorithm of Tugnait (1984). In this algorithm, the resampling stage is deterministic, with the N particles having largest weights being kept. The application of such ideas has been especially investigated in digital communication applications and is discussed, for instance, by Punskaya et al. (2002) and Bertozzi et al. (2003). 8.2.2 Optimal Sampling As stressed in Section 7.4.2, there are other options to draw the reallocation variables such as residual, stratied, or systematic resampling. Although these

268

8 Advanced Topics in SMC

can certainly be useful in this context, the discrete nature of the state space has an unexpected consequence that is not addressed properly by the resampling techniques discussed so far. For problems in which the state space is continuous, having multiple copies of particles is not detrimental. After resampling, each copy of a given duplicated particle will evolve independently from the others. Therefore, a particle with a large importance weight that is replicated many times in the resampling stage may, in the future, have a large number of distinct descendants. When the state space is nite however, each particle can i u i probe all its possible descendants (0:k , j) such that Tk (0:k , j) > 0. Hence, if the resampling procedure replicates a particle at time k, the replications of this particle will probe exactly the same congurations in the future. Having multiple copies of the same path particle in nite state space models is thus particularly wasteful. A possible solution to this problem has been suggested by Fearnhead and Cliord (2003) under the name optimal sampling. Instead of drawing reallocai i tion variables {(Ik , Jk+1 )}1iN , we sample non-negative importance weights i,j {Wk+1 } satisfying the constraints
N r
i,j 1{Wk+1 >0} N

(8.32) i = 1, . . . , N, j = 1, . . . , r, (8.33)

i=1 j=1 i,j i,j E[Wk+1 | Gk ] = k+1 ,

i,j where the weights k+1 are dened in (8.31). The rst constraint is that there are at most N particles with non-zero weights. The second constraint is that the importance weights be unbiasedin the terminology of Liu and Chen (1998) or Liu et al. (2001), the new sample is the said to be properly weighted. A word of caution is needed here: despite the fact that the unbiasedness condition is very sensible in the context of resampling, it does not, in itself, guarantee a proper behavior of the algorithm (more on this will be said in Chapter 9). Conversely, exact unbiasedness is not absolutely necessary, and it is perfectly possible to consider algorithms that exhibit a low, and controllable, bias. The problem reduces to that of approximating a probability distribution having M = N r points of support by a random probability distribution having at most N points of support. Resampling is equivalent to assigning a new, random weight to each of the M = N r particles. If the weight is zero the particle is removed, whereas if the weight is non-zero the particle is i,j kept; the non-zero random variables Wk+1 represent the new weights of the descendants of the particle system. In a more general perspective, the problem can be formulated as follows. Let be a discrete probability distribution with M points of support M

= (1 , . . . , M ),

i 0,
i=1

i = 1 .

(8.34)

8.2 Sequential Monte Carlo in Hierarchical HMMs

269

We want to nd a random probability distribution W = (W1 , . . . , WM ) on {1, . . . , M } with at most N M points of support,
M M

Wi 0 ,
i=1

Wi = 1 ,
i=1

1{Wi >0} N ,

(8.35)

satisfying E[Wi ] = i , i = 1, . . . , M . (8.36) There are of course a number of dierent ways to achieve (8.35) and (8.36). In particular, all the resampling methods discussed in Section 7.4.2 (as well as multinomial resampling) draw integer counts Ni , which are such that Wi = Ni /N satisfy the above requirements, with equality for the last condition in (8.35). The optimal solution is the one that guarantees that the random distribution W is close, in some suitable sense, to the target distribution . We follow the suggestion of Fearnhead and Cliord (2003) and use the average L2 distance. The problem then becomes equivalent to nding a random probability distribution W = (W1 , . . . , WM ) that minimizes
M

E(Wi i )2
i=1

(8.37)

subject to (8.35) and (8.36). To compute the solution we rely on two lemmas. Lemma 8.2.3. Let 0 and p (0, 1]. If W is a non-negative random variable satisfying E[W ] = then E(W )2 and P(W > 0) = p , 1p 2 . p (8.38)

(8.39)

The lower bound is attained by any random variable W such that W equals /p on the subset of the sample space where W > 0. Proof. By decomposing the sample space into {W > 0} and {W = 0}, we obtain = E[W ] = E[W | W > 0] P(W > 0) = E[W | W > 0]p , and by a similar decomposition, E(W )2 = E[(W )2 | W > 0]p + 2 (1 p) . A bias-variance decomposition of E[(W )2 | W > 0] then gives (8.41) (8.40)

270

8 Advanced Topics in SMC

E[(W )2 | W > 0] = E[(W E[W |W > 0])2 | W > 0] + (E[W | W > 0] )2 = E[(W E[W |W > 0])2 | W > 0] + 2 (1 p)2 , p2

where we used (8.40) to obtain the second equality. The right-hand side of this display is bounded from below by 2 (1 p)2 /p2 , and inserting this into the right-hand side of (8.41) we obtain (8.39). Using the last display once again, we also see that the bound is attained if and only if W equals E[W |W > 0] = /p on the set where W > 0. Lemma 8.2.4. Let N < M be integers and let 1 , . . . , M be non-negative numbers. Consider the problem
M

minimize
j=1 M

j pj pj N ,

subject to
j=1

0 pj 1, This problem has a unique solution given by pj = j 1,

j = 1, . . . , M .

j = 1, . . . , M ,

(8.42)

where the constant is the unique solution of the equation


M

j 1 = N .
i=1

(8.43)

Proof. Denote by and i the Lagrange multipliers associated respectively M with the inequality constraints i=1 pi N and pi 1, i = 1, . . . , M . The Karush-Kuhn-Tucker conditions (see Boyd and Vandenberghe, 2004) for the primal p1 , . . . , pM and dual , 1 , . . . , M optimal points are given by
M

pi N,
i=1

pi 1, i 0,

i = 1, . . . , M , i = 1, . . . , M , i = 1, . . . , M , i = 1, . . . , M .

(8.44) (8.45) (8.46) (8.47)

0,
M

i=1

pi N

= 0,

i (pi 1) = 0,

i + + i = 0, p2 i

8.2 Sequential Monte Carlo in Hierarchical HMMs

271

The complementary slackness condition (8.46) implies that for all indices i such that pi < 1, the corresponding multiplier i is zero. Hence, using (8.47), pi = i 1, i = 1, . . . , M . (8.48)

From this we see that if = 0, then pi = 1 for all i and (8.44) cannot be satised. Thus > 0, and the complementary slackness condition (8.46) therefore M implies that i=1 pi = N . Plugging (8.48) into this equation determines the M i / 1 = N . multiplier by solving for in the equation 1 By combining these two lemmas, we readily obtain a characterization of the random distribution achieving the minimal average divergence (8.37) subM ject to the support constraint i=1 P(Wi > 0) N and the unbiasedness constraint (8.36). Proposition 8.2.5. Let W = (W1 , . . . , WM ) be a random vector with nonnegative entries. This vector is a solution to the problem
M

minimize
i=1 M

E(Wi i )2 P(Wi > 0) N ,


i=1

subject to

E[Wi ] = i , if and only if for any i = 1, . . . , M , Wi = i /pi 0

i = 1, . . . , M ,

with probability pi = i 1 , otherwise ,

def

(8.49)

where is the unique solution of the equation


M

i 1 = N .
i=1

(8.50)

Proof. Put pi = P(Wi > 0). By Lemma 8.2.3,


M M

E(Wi i )2
i=1 i=1

2 i 2 i . pi i=1

(8.51)

The proof follows from Lemma 8.2.4. Remark 8.2.6. Note that if i 1, then pi = 1 and i /pi = i . Thus (8.49) implies that weights exceeding a given threshold (depending on the weights

272

8 Advanced Topics in SMC

themselves) are left unchanged. For a particle i whose weight falls below this threshold, the algorithm proceeds as follows. With probability 1 pi > 0, the weight is set to zero; otherwise it is set (and thus increased) to i /pi = 1/ in order to satisfy the unbiasedness condition. The algorithm is related to the procedure proposed in Liu et al. (2001) under the name partial rejection control. The above proposition describes the marginal distribution of the Wi that solves (8.37). The following result proposes a simple way to draw random M weights (W1 , . . . , WM ) that satisfy (8.49) with i=1 1{Wi >0} = N . Proposition 8.2.7. Let be the solution of (8.50), S = {i {1, . . . , M } : i 1}
def

(8.52)

and pi = i 1. Let U be a uniform random variable on (0, 1) and set Ni =


jS, ji

pj + U

jS, j<i

pj + U ,

i = 1, . . . , M ,

with being the integer part. Dene the random vector W = (W1 , . . . , WM ) by i if i S , (8.53) Wi = 1/ if i S and Ni > 0 , 0 if i S and Ni = 0 . Then W satises (8.49) and
M

1{Wi >0} = N ,
Wi = 1 .

(8.54)

i=1 M

(8.55)

i=1

Proof. We rst show that P(Wi > 0) = pi . For i S this is immediate, with pi = 1. Thus pick i S. Then Ni sup( x + pi x ) 1 .
x0

Therefore Ni = 1{Wi >0} , which implies P(Wi > 0) = P(Ni > 0) = E[Ni ]. It is straightforward to check that the expectation of Ni is the dierence of the two sums involved in its denition, whence E[Ni ] = pi . Thus P(Wi > 0) = pi , showing that (8.49) is satised. M Next observe that 1 1{Wi >0} = |S| + iS Ni . The sum of Ni over all i S is a telescoping one, whence

8.2 Sequential Monte Carlo in Hierarchical HMMs


M

273

1{Wi >0} = |S| +

iS

pi + U U

i=1

= |S| + N |S| + U U = |S| (N |S|) = N , where we used iS pi = Thus we have (8.54). Finally,
M M 1

pi

iS

pi = N |S| for the second equality.

Wi =
i=1 iS

i +
iS

Ni / .

From the above, we know that the second sum on the right-hand side equals (N |S|)/c. Because, by denition, i /pi = 1/ for i S, the rst sum is i = 1
iS iS M 1

i = 1 1
iS

pi = 1

N |S| .

We conclude that

Wi = 1, that is, (8.55) holds.

Back to our original problem, Proposition 8.2.7 suggests the following sampling algorithm. Algorithm 8.2.8 (Optimal Sampling). Weighting : For i = 1, . . . , N and j = 1, . . . , r, compute the weights
i,j k+1 = i u i k Tk (0:k , j) N l=1 r c=1 l u i k Tk (0:k , c)

(8.56)

Sampling: Determine the solution k+1 of the equation


N r i,j k+1 k+1 1 = N . i=1 j=1

Draw U U([0, 1]) and set S = 0. For i = 1, . . . , N and j = 1, . . . , r, i,j i,j i,j If k+1 k+1 1, then set Wk+1 = k+1 . i,j If k+1 k+1 < 1, then set
i,j Wk+1 =

1 k+1 0

i,j if k+1 (S + k+1 ) + U k+1 S + U > 0 , otherwise ,

i,j and set S = S + k+1 . i,j Update: For i = 1, . . . , N and j = 1, . . . , r, if Wk+1 > 0 set i 0:k+1 = (0:k , j) , i i,j k+1 = Wk+1 , I(i,j) j1 I(i,j)

where

I(i, j) =
l=1 c=1

1{Wk+1 >0} . l,c

274

8 Advanced Topics in SMC

8.2.3 Application to CGLSSMs In this section, we consider conditionally Gaussian linear state-space models (CGLSSMs), introduced in Section 1.3.4 and formally dened in Section 2.2.3. Recall that a CGLSSM is such that Wk+1 = A(Ck+1 )Wk + R(Ck+1 )Uk , Yk = B(Ck )Wk + S(Ck )Vk , where {Ck }k0 is a Markov chain on the nite set C = {1, . . . , r}, with transition kernel QC and initial distribution C ; the state noise {Uk }k0 and measurement noise {Vk }k0 are independent multivariate Gaussian white noises with zero mean and identity covariance matrices; the initial partial state W0 is assumed to be independently N( , ) distributed; A, B, R, and S are known matrix-valued functions of appropriate dimensions. (8.57)

Ecient recursive procedures, presented in Section 5.2.6, are available to compute the ltered or predicted estimate of the partial state and the associated error covariance matrix conditionally on the indicator variables and observations. By embedding these algorithms in the sequential importance sampling resampling framework, it is possible to derive computationally efcient sampling procedures that operate in the space of indicator variables (Doucet et al., 2000a; Chen and Liu, 2000). Recall in particular that the keru nel Tk in (8.24) has an expression given by (4.11), which we repeat below.
u Tk (c0:k , ck+1 ) =

Lk+1 Lk
W

QC (ck , ck+1 )

gk+1 (ck+1 , wk+1 )k+1|k (c0:k+1 , wk+1 ) dwk+1 , (8.58) for c0:k+1 Ck+2 , where Lk is the likelihood of the observations up to time k; gk+1 (ck+1 , wk+1 ) = g [(ck+1 , wk+1 ), Yk+1 ] is the value of the transition density function of the observation Yk+1 given the state and indicator variables, that is, gk+1 (ck+1 , wk+1 ) = N(Yk+1 ; B(ck+1 )wk+1 , S(ck+1 )S t (ck+1 )) , (8.59)

with N(; , ) being the density of the Gaussian multivariate distribution with mean and covariance matrix ;

8.2 Sequential Monte Carlo in Hierarchical HMMs

275

k+1|k (c0:k+1 , wk+1 ) is the density of the predictive distribution of the partial state Wk+1 given the observations up to time k and the indicator variables up to time k + 1: k+1|k (c0:k+1 , wk+1 ) = N wk+1 ; Wk+1|k (c0:k+1 ), k+1|k (c0:k+1 ) ,

(8.60) where Wk+1|k (c0:k+1 ) and k+1|k (c0:k+1 ) denote respectively the conditional mean and error covariance matrix of the prediction of the partial state Wk+1 in terms of the observations Y0:k and indicator variables C0:k+1 = c0:k+1 these quantities can be computed recursively using the Kalman one-step prediction/correction formula (see Section 5.2.3). As discussed in Section 4.2.3, the distribution of the partial state Wn conditional on the observations up to time n is a mixture of rn+1 componentshere, Gaussian componentswith weights given by 0:n|n . In the particle approxi imation, each particle 0:n relates to a single term in this mixture. Particle approximation of the ltering distribution n|n of the partial state Wn thus consists in recursively choosing N components out of a growing mixture of rn+1 components and adjusting accordingly the weights of the components which are kept; hence the name mixture Kalman lter proposed by Chen and Liu (2000) to describe this approach. Algorithm 8.2.9 (Mixture Kalman Filter). Initialization: For i = 1, . . . , r, compute
i 0 = i , i 0 = N(Y0 ; B(i) , B(i) B t (i) + S(i)S t (i)) C (c0 ) , i K0 (0 ) = B t (i) B(i) B t (i) + S(i)S t (i) i i W0|0 (0 ) = + K0 (0 ) [Y0 B(i) ] , i i 0|0 (0 ) = K0 (0 )B(i) . 1

Recursion: Computation of weights: For i = 1, . . . , N and j = 1, . . . , r, compute


i i Wk+1|k (0:k , j) = A(j)Wk|k (0:k ) , i i k+1|k (0:k , j) = A(j)k|k (0:k )At (j) + R(j)Rt (j) , i i Yk+1|k (0:k , j) = B(j)Wk+1|k (0:k , j) , i i k+1 (0:k , j) = B(j)k+1|k (0:k , j)B t (j) + S(j)S t (j) , i i i i i,j = k N(Yk+1 ; Yk+1|k (0:k , j), k+1 (0:k , j)) QC (k , j) . k+1 i (First Option) Importance Sampling Step: For i = 1, . . . , N , draw Jk+1 in i,1 i,r {1, . . . , r} with probabilities proportional to k , . . . , k , conditionally independently of the particle history, and set

276

8 Advanced Topics in SMC


i i i 0:k+1 = (0:k , Jk+1 ) , r i k+1 = j=1 i Kk+1 (0:k+1 ) N r

k+1 i,j
i=1 j=1

k+1 , i,j

1 i i i i i = k+1|k (0:k , Jk+1 )B t (Jk+1 )k+1 (0:k+1 , Jk+1 ) , i i i Wk+1|k+1 (0:k+1 ) = Wk+1|k (0:k , Jk+1 ) i i i + Kk+1 (0:k+1 ) Yk+1 Yk+1|k (0:k , Jk+1 ) , i i i i k+1|k+1 (0:k+1 ) = I Kk+1 (0:k+1 )B(Jk+1 ) k+1|k (0:k , Jk+1 ) .

(Second Option) Optimal Sampling Step: i,j Draw importance weights Wk for i = 1, . . . , N and j = 1, . . . , r using Algorithm 8.2.8. i,j Set I = 0. For i = 1, . . . , N and j = 1, . . . , r, if Wk+1 > 0 then
I i 0:k+1 = (0:k , j) , i,j I k+1 = Wk+1 , 1 I i i Kk+1 (0:k+1 ) = k+1|k (0:k , j)B t (j)k+1 (0:k , j) , I i Wk+1|k+1 (0:k+1 ) = Wk+1|k (0:k , j) I i + Kk+1 (0:k+1 ) Yk+1 Yk+1|k (0:k , j) , I I i k+1|k+1 (0:k+1 ) = I Kk+1 (0:k+1 )B(j) k+1|k (0:k , j) ,

I=I+1.
i,j Note that in the algorithm above, {Wk } are the weights drawn according to Algorithm 8.2.8. These have nothing to do with the state variable Wk and should not be mistaken with the corresponding predictor denoted by i Wk+1|k (0:k , j). The rst option corresponds to the basic importance sampling strategywithout resamplingand is thus analogous to the SIS approach of Algorithm (7.2.2). As usual, after several steps without resampling, the particle system quickly degenerates into a situation where the discrepancy i between the weights {k }1iN is more and more pronounced as k grows. The second option corresponds to a resampling step based on Algorithm 8.2.8, which avoids particle duplication in the situation where Ck is nite-valued.

Example 8.2.10. To illustrate the previous algorithm, we consider once more the well-log data of Example 1.3.10 using the same modeling assumptions as in Example 6.3.7. In contrast to Example 6.3.7 however, we now consider sequential approximation of the ltering (or xed-lag smoothing) distributions of the jump and outlier indicators rather than the block (non-sequential) approximation of the joint smoothing distributions of these variables.

8.2 Sequential Monte Carlo in Hierarchical HMMs


15 Data 10 5 0 1 500 1000 1500 2000 2500 3000 3500 4000 x 10
4

277

Posterior Probability of Jump

0.5

0 1

500

1000

1500

2000

2500

3000

3500

4000

Posterior Probability of Outlier

0.5

500

1000

1500

2000 Time

2500

3000

3500

4000

Fig. 8.2. On-line analysis of the well-log data, using 100 particles with detection delay = 0. Top: data; middle: posterior probability of a jump; bottom: posterior probability of an outlier.
x 10
4

15 Data 10 5 0 1

500

1000

1500

2000

2500

3000

3500

4000

Posterior Probability of Jump

0.5

0 1

500

1000

1500

2000

2500

3000

3500

4000

Posterior Probability of Outlier

0.5

500

1000

1500

2000 Time

2500

3000

3500

4000

Fig. 8.3. On-line analysis of the well-log data, using 100 particles with detection delay = 5 (same display as above).

278

8 Advanced Topics in SMC

The main aim of analyzing well-log data is the on-line detection of abrupt changes in the level of the response. The detection delay, dened as the number of samples that are processed before a decision is taken, should be kept as small as possible. Here the detection delay has been set to = 0 and = 5: after processing each observation Yk , the probability of a jump having occurred at i time k was estimated by averaging the values of {0:k (k )}1iN (see Example 6.3.7 for the detail of the parameterization used in this example). The results of a single on-line analysis of the well-log data using the optimal sampling strategy (at each step) are shown in Figures 8.2 ( = 0) and 8.3 ( = 5). In both cases, N = 100 particles are used. For = 0, the particle lter has performed reasonably well: most of the obvious jumps in the level of the data have a posterior probability close to 1, although some of them are obviously missing (around time index 2000 for instance). In addition, dierentiating jumps from outliers is particularly dicult in this case and the lter has misclassied outliers into change points (at time index 700 for instance). On Figure 8.3 ( = 5), most of the misclassication errors have disappeared and the overall result is quite good (although some points are still detected both as change points and outliers as in index 1200). Because the typical length of an outlier is about four, ve samples are usually enough to tell whether a change in the level has occurred.

8.3 Particle Approximation of Smoothing Functionals


As emphasized in Section 4.1, it is often of interest to approximate the expectation of some statistic tn (x0:n ) under the joint smoothing distribution 0:n|n , tn (x0:n ) 0:n|n (dx0:n ) .

This dicult problem admits a computationally simpler solution in cases where the statistic has the specic formwhich we called a smoothing functional in Section 4.1given by (see Denition 4.1.2): tn+1 (x0:n+1 ) = mn (xn , xn+1 )tn (x0:n ) + sn (xn , xn+1 ) , n0, (8.61)

for all x0:n+1 Xn+2 . Here {mn }n0 and {sn }n0 are two sequences of real measurable functions on X X. Examples include the sample mean n tn (x0:n ) = (n+1)1 k=0 xk , the rst-order sample autocovariance coecient n tn (x0:n ) = n1 k=1 xk1 xk , etc. Other important examples of smoothing functionals arise in parameter estimation when using the EM algorithm or when computing the gradient of the log-likelihood function (see Chapters 10 and 11 for details). Dene the nite signed measure n on (X, X ) by n (f ) =
def

f (xn ) tn (x0:n ) 0:n|n (dx0:n ) ,

f Fb (X) .

(8.62)

8.3 Particle Approximation of Smoothing Functionals

279

Note that by construction, n (X) = 0:n|n (tn ), that is, the quantity of interest. By Proposition 4.1.3, the measures {n }n0 may be updated recursively according to 0 (f ) = {(g0 )} and n+1 (f ) = c1 n+1 f (xn+1 ) n (dxn )Q(xn , dxn+1 )gn+1 (xn+1 )mn (xn , xn+1 ) + n (dxn ) Q(xn , dxn+1 )gn+1 (xn+1 )sn (xn , xn+1 ) , (8.63) where the normalizing constant cn+1 is given by (3.22) as cn+1 = n Qgn+1 . It is easily seen that n is absolutely continuous with respect to the ltering measure n . Hence (8.63) may be rewritten as n+1 (f ) = f (xn+1 )
1

f (x0 ) t0 (x0 )g0 (x0 ) (dx0 )

dn (xn )mn (xn , xn+1 ) + sn (xn , xn+1 ) n:n+1|n+1 (dxn:n+1 ) . (8.64) dn In SISR algorithms, the joint smoothing distribution 0:n+1|n+1 at time n+1 is i approximated by a set {0:n+1 }1iN of particles with associated importance i weights {n+1 }1iN . Due to the sequential update of the particle trajecto1 N ries, there exist indices In+1 , . . . , In+1 (see Algorithm 7.3.4) such that
n+1 i i 0:n+1 = (0:n , n+1 ) ,

Ii

meaning that the rst n + 1 coordinates of the path are simply copied from the previous generation of particles. Because n is absolutely continuous with respect to n for any n, it seems reasonable to approximate n using the same system of particles as that used to approximate n . That is, for any n we approximate n by N i n (8.65) n = i i , N j n n n j=1 i=1
i where n , i = 1, . . . , N , are signed weights. Such approximations have been considered in dierent settings by Capp (2001a), Crou et al. (2001), Doucet e e and Tadi (2003), and Fichou et al. (2004). This approximation of n yields c the following estimator of 0:n|n (tn ) = n (X): N i n N j=1 j n

0:n|n (tn ) =
i=1

i n .

(8.66)

280

8 Advanced Topics in SMC

The two measures n and n have the same support, which implies that n is absolutely continuous with respect to n ; in addition, for any x 1 N {n , . . . , n }, j j dn jIn (x) n n , (8.67) (x) = j n d n
jIn (x)

where In (x) = {j = 1, . . . , N : = x}. In cases where there are no ties (all particle locations are distinct), we simply have dn dn
i i (n ) = n .

def

j n

(8.68)

To derive a recursive approximation of n , it is only needed to derive upi date equations for the signed weights n . Plugging the particle approximation N i n:n+1|n+1 i i=1 n+1 n:n+1 of the retrospective smoothing distribution n:n+1|n+1 into the update equation (8.64) yields the following approximation of the measure n+1 :
i dn In+1 Ii Ii i i i (n )mn (nn+1 , n+1 ) + sn (nn+1 , n+1 ) n+1 . dn i=1 (8.69) d j Using the approximation (8.68) of dn (n ), the latter relation suggests the n i following recursion for the weights {n }1iN :

i n+1 N j j=1 n+1

i i 0 = t0 (0 ) , i n+1 = Ii Ii i nn+1 mn (nn+1 , n+1 )

(8.70) +
Ii i sn (nn+1 , n+1 )

(8.71)

This relation, originally derived by Capp (2001a)2 , is computationally ate tractive because the approximation uses the same set particles and weights as those used to approximate the ltering distribution; only the incremental signed weights need to be computed recursively. Also, it mimics the exact recursion for n and therefore seems like a good way to approximate this sequence of measures. To get a better understanding of the behavior of the algorithm, we will derive the recursion (8.71) from a dierent (admittedly more elementary) perspective. The sequential importance sampling approximation of the joint smoothing distribution 0:n|n amounts to approximate, for any statistic tn (x0:n ), 0:n|n (tn ) by
N

0:n|n (tn ) =
i=1

i n N j=1 j n

i tn (0:n ) .

(8.72)

2 The recursion obtained by Crou et al. (2001) is based on a very dierent e argument but turns out to be equivalent in the case where the functional of interest corresponds to the gradient of the log-likelihood function (see Section 10.2.4 for details).

8.3 Particle Approximation of Smoothing Functionals

281

If the statistic tn is a smoothing functional as dened in (8.61), this quantity can be evaluated sequentially so that storing the whole particle paths is avoided. Denote by {ti }1iN the current value of the smoothing functional n i i tn along the particle path 0:n : ti = tn (0:n ). This quantity may be updated n i i according to the recursion t0 = t0 (0 ) and
n+1 i i ti mn (nn+1 , n+1 ) + sn (nn+1 , n+1 ), n+1 = tn

Ii

Ii

Ii

i = 1, . . . , N .

(8.73)

Perhaps surprisingly, because the two approximations have been derived from two dierent perspectives, (8.73) and (8.71) are identical. This means that both equations are recursive ways to compute the approximation (8.72) of expectations with respect to the joint smoothing distribution. The second reasoning, which led to recursion (8.73), however, raises some concern about the practical use of this approximation. Because the path particles i i {(0:n , n )}1iN are targeted to approximate a probability distribution over the space Xn+1 , whose dimension grows with n, it is to be expected that the curse of dimensionality can only be fought by increasing the number N of path particles as n increases (Del Moral, 2004). A worst case analysis suggests that the number N of path particles should grow exponentially with n, which is of course unrealistic. This assertion should however be taken with some care because we are in general interested only in low-dimensional statistical summaries of the particle paths. Hence, the situation usually is more contrasted, as illustrated below on an example. Example 8.3.1. We consider here the stochastic volatility model of Example 7.2.5: Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk , Uk N(0, 1) , Vk N(0, 1) .

Here the observations {Yk }k0 are the log-returns, {Xk }k0 is the logvolatility, and {Uk }k0 and {Vk }k0 are independent sequences of standard white Gaussian noise. We use the SISR algorithm with systematic resampling and instrumental kernel being a t-distribution with 5 degrees of freedom and mode and scale adjusted to the mode and curvature of the optimal instrumental kernel (see Example 7.2.5). We consider the daily log-returns, that is, the dierence of the log of the series, on the British pound/US dollar exchange rate from October, 1 1981, to June, 28 1985 (the data is scaled by 100 and mean-correctedsee Kim et al., 1998, and Shephard and Pitt, 1997 for details). The number of samples is n = 945, and we used the stochastic volatility model with parameters = 0.975, = 0.63, and = 0.16; these are the maximum likelihood estimates reported by Sandmann and Koopman (1998) on this data set. The path particles after 70 iterations are plotted in Figure 8.4. The gure clearly shows that the selection mechanism implies that for any given time index k n, the number of ancestors, at that time, of the particle trajectories

282

8 Advanced Topics in SMC


2

1.5

State Value

0.5

0.5

10

20

30

40

50

60

70

Time Index

Fig. 8.4. Particle trajectories at time n = 70 for the stochastic volatility model using the algorithm of Example 7.2.5 with N = 100 particles and systematic resampling.

ending in index n becomes small as the dierence between n and k grows. It is therefore to be expected that estimation of the expectation under the joint smoothing distribution of statistics involving the rst time lags will typically display large uctuations and that these uctuations will get larger when n increases. This behavior is indeed illustrated in Figure 8.5, which shows particle estimates of x2 0|n (dx) for dierent values of n and N . The variance of the particle estimate steadily increases with n for all values of N . In addition, a fairly large number N of particles is needed to obtain reliable estimates for larger values of n although the value to be estimated does not change much when n gets larger than, say, n = 20. It is interesting to contrast the results of particle methods with those that can be obtained with the (non-sequential) Markov chain Monte Carlo (MCMC) methods of Chapter 6. For MCMC methods, because the target distribution is static and equal to the joint distribution 0:n|n , we simply ran 100 instances of the sampler of Example 6.3.1 for each value of n and recorded the averaged value of the rst component (squared) in each sample. Here a sweep refers to the successive updates of each of the n + 1 sites of the i simulated sequence X0:n (see Example 6.3.1 for details). The computation cost of the MCMC and particles approaches, with comparable values of n and N , are thus roughly the same. Remember however that in the particle approach, estimated values of x2 0|n (dx) for dierent values of n may be

8.3 Particle Approximation of Smoothing Functionals

283

102 Particles
1 0.8 0.6 0.4 0.2

103 Particles

104 Particles

20 100 500

20 100 500

20 100 500

Fig. 8.5. Box and whisker plots of particle estimates of x2 0|n (dx) for n = 1, 5, 20, 30, and 500, and particle population sizes N = 102 , 103 , and 104 . The plots are based on 100 independent replications.

10 Sweeps
1 0.8 0.6 0.4 0.2

10 Sweeps

10 Sweeps

20 100 500

20 100 500

20 100 500

Fig. 8.6. Same gure as above for MCMC estimates of x2 0|n (dx), where N refers to the number of MCMC sweeps though the data, using the MCMC sampler of Example 6.3.1.

obtained in a single run of the algorithm due to the sequential nature of the computations. Observe rst on the leftmost display of Figure 8.6 that the MCMC estimates obtained with just N = 100 sweeps are severely downward 1 biased: this is due to the fact that the sequence of states X0:n is initialized with zero values and N = 100 is insucient to forget this initialization, due to the correlation between successive MCMC simulations (see Figure 6.10). On this data set (and with those parameter values), about 200 iterations are indeed needed to obtain reasonably unbiased estimates. The next important observation about Figure 8.6 is that the variance of the estimate does not vary much with n. This is of course connected to the observation, made in Example 6.3.1, that the correlation between successive MCMC simulations does not change (signicantly) as n increases. For smaller values of n, the

284

8 Advanced Topics in SMC

existence of correlation makes the MCMC approach far less reliable than the particle method. But for larger values of n, the degradation of the results previously observed for the particle methodwith a xed value of N and as n increaseskicks in and the comparison is more balanced (compare the fth boxes in the rightmost displays of Figures 8.5 and 8.6). In some sense, the degradation observed on Figure 8.5 as n grows (N being xed) is all the more disturbing that we expect the result to be nearly independent of n, once it exceeds a given value (which is clearly the case on Figures 8.5 and 8.6). Indeed, the forgetting property of the smoothing distributions discussed in Section 4.3 implies that the posterior distribution of the state x0 depends predominantly on the observations Yk with time indices close to k = 03 (see, e.g., Polson et al., 2002, for a related use of the forgetting property). For large values of n, it is thus reasonable to approximate the expectation of tn,0 (x0:n ) = x0 under 0|n by that of the same quantity under 0|k for k large enough, but still much smaller than n. Of course it is to be expected that the bias of the approximation decreases when increasing the number of lags k. On the other hand, as mentioned above, the dispersion of the particle estimator of the expectation under the reduced-lag smoothing distribution 0|k (t0,n ) increases with k. We are thus faced with a classical biasvariance trade-o problem; when k is large the bias is small but the dispersion is large, and vice versa. Setting k smaller than n is thus an eective way of robustifying the estimator without any modication of the sequential Monte Carlo procedure. To give an idea of how large k should be for the example under consideration, the dierence between the means of the particle estimates (obtained using N = 105 particles) of 0|n (tn,0 ) and 0|k (tn,0 ) is less than 103 for n = 100 and k = 20. For k = 1 and k = 10, the corresponding dierences are 0.2 and 0.12, respectively. This means that we can safely estimate 0|n (t0,n ) by 0|k (t0,n ) if we take k 20. The standard error of the reduced-lag smoothing estimator 0|20 (t0,n ) is, at least, three times less than that of 0|500 (t0,n ). As a consequence, we can achieve the same level of performance using reducedlag smoothing with about 10 times less particles (compare, on Figure 8.5, the third box in the second display with the fth one in the third display). This naturally raises the question whether the same conclusion can be drawn for other statistics of interest. Suppose that we want to approximate n1 n the expectations of tn,1 (x0:n ) = l=0 x2 and tn,2 (x0:n ) = l=1 xl1 xl under l the joint smoothing distribution 0:n|n 4 . These two statistics may be written as time averages, immediately suggesting the xed-lag approximations n1 n x2 l|(l+k)n (dxl ) and l l=1 xl1 xl l1:l|(l+k)n (dxl1:l ) for some l=0
Note that we invoke here the spirit of Section 4.3 rather than an exact result, as we are currently unable to prove that the forgetting property holds for the stochastic volatility model (see discussion at the end of Section 4.3), although empirical evidence says it does. 4 These statistics need to be evaluated in order to estimate the intermediate quantity of the Expectation-Maximization algorithmsee Example 11.1.2 for details.
3

8.3 Particle Approximation of Smoothing Functionals


102 Particles 550 103 Particles 104 Particles

285

500 t
1

450

400

10
2

20

joint

10
3

20

joint

10
4

20

joint

10 Particles 550

10 Particles

10 Particles

500 t
2

450

400

10

20

joint

10

20

joint

10

20

joint

Fig. 8.7. Box and whisker plots of particle estimators of the expectations of the two statistics tn,1 (x0:n ) = n1 x2 (top) and tn,2 (x0:n ) = n xk xk1 (bottom) k=0 k k=1 for n = 945: from left to right, increasing particle population sizes of N = 102 , 103 , 4 and 10 ; on each graph, xed-lag smoothing approximation for smoothing delays k = 10 and 20 and full path joint particle approximation. The plots are based on 100 independent replications.

lag kwhere the term xed-lag refers to the fact that k is xed and does not vary with n. To approximate both of these sums, one can use a variant of (8.73) in which only the part of the sum that pertains to indices l located less than k lags away from the current time index is updated, while the contribution of indices further back in the past is xed. A little thought should convince the reader that this can be achieved by storing the cumulative contribution of past sections of the trajectories that do not get resampled anymore N nk1 i i s(0:n (l)) as well as the recent history of the particles {0:n (l)} i=1 l=0 for l = n k, . . . , n and i = 1, . . . , N ; here s is the function of interest, say i s(x) = x2 in the case of tn,1 , and 0:n (l) denotes the element of index l in the i path 0:n . As above, it is expected that increasing the number of lags k will increase the dispersion but decrease the bias. This is conrmed by the results displayed in Figure 8.7. Again, the use of xed-lag instead of joint smoothing provides more accurate estimators. To conclude this section, we would like to stress again the dierence between xed-dimensional statistics like tn,0 (x0:n ) = x0 and smoothing functionals, in the sense of Denition (4.1.2), which depend on the complete collection n1 of hidden states up to time n (for instance, tn,1 (x0:n ) = l=0 x2 ). Although l

286

8 Advanced Topics in SMC

the latter case does seem to be more challenging, the averaging eect due to n should not be underestimated: even crude approximations of the individual terms, say x2 l|n (dxl ) in the case of tn,1 , may add up to provide a reliable l approximation of the conditional expectation of tn,1 . In our experience, the strategy discussed above is usually successful with rather moderate values of the lag k and the number N of particles, as will be illustrated in Chapter 11. In the case of xed-dimensional statistics, more elaborate smoothing algorithms may be more recommendable, particularly in situations where relying on forgetting properties might be questionable (Kitagawa, 1996; Fong et al., 2002; Briers et al., 2004).

9 Analysis of Sequential Monte Carlo Methods

The previous chapters have described many algorithms to approximate prediction, ltering, and smoothing distributions. The development of these algorithms was motivated mainly on heuristic grounds, and the validity of these approximations is of course a question of central interest. In this chapter, we analyze these methods, mainly from an asymptotic perspective. That is, we study the behavior of the estimators in situations where the number of particles gets large. Asymptotic analysis provides approximations that in many circumstances have proved to be relatively robust. Most importantly, asymptotic arguments provide insights in the sampling methodology by verifying that the procedures are sensible, providing a framework for comparing competing procedures, and providing understanding of the impact of dierent options (choice of importance kernel, etc.) on the overall performance of the samplers.

9.1 Importance Sampling


9.1.1 Unnormalized Importance Sampling Let (X, X ) be a measurable space. Dene on (X, X ) two probability distributions: the target distribution and the instrumental distribution . Assumption 9.1.1. The target distribution is absolutely continuous with respect to the instrumental distribution , , and d/d > 0 -a.s. Let f be a real-valued measurable function on X such that (|f |) = |f | d < . Denote by 1 , 2 , . . . an i.i.d. sample from and consider the estimator IS (f ) = ,N 1 N
N

f ( i )
i=1

d i ( ) . d

(9.1)

288

9 Analysis of SMC Methods

Because this estimator is the sample average of independent random variables, there is a range of results to assess the accuracy of IS (f ) as an ,N estimator of (f ). Some of these results are asymptotic in nature, like the law of large numbers (LLN) and the central limit theorem (CLT). It is also possible to derive non-asymptotic bounds like Berry-Esseen bounds, bounds on error moments E |IS (f )(f )|p for some p > 0 or on the tail probability ,N P |IS (f ) (f )| . Instead of covering the full scale of results that can ,N be derived, we establish for the dierent algorithms presented in the previous chapter a law of large numbers, a central limit theorem, and deviation bounds. A direct application of the LLN and of the CLT yields the following result. Theorem 9.1.2. Let f be a real measurable function such that (|f |) < and |f | |f |, and let 1 , 2 , . . . be a sequence of i.i.d. random variables from . Then the unnormalized importance sampling estimator IS (f ) given ,N by (9.1) is strongly consistent, limN IS (f ) = (f ) a.s. ,N Assume in addition that f2 d d
2

d < .

(9.2)

Then IS (f ) is asymptotically Gaussian, ,N N (IS (f ) (f )) N 0, Var f ,N is given by Var f d d = f d (f ) d


2 D

d d

as N ,

where Var f d d

d .

Obviously, while the importance sampling construction (9.1) is universal, the performance of the importance sampling estimator depends heavily on the relation between the target distribution , the instrumental distribution , and the function f . It is also worthwhile to note that for a given function f , it is most often possible to nd a distribution that yields an estimate with a lower variance than when using the Monte Carlo method, that is, taking = . In some situations the improvements can be striking: this is in particular the case where the function f is non-zero only for values that are in the tails of the target distribution , a situation that occurs for instance when estimating the probability of rare events. The basic idea is to choose the importance distribution so that it generates values that are in the region where the integrand f d is large, as this region is where the most important d contributions are made to the value of the integral. Notice that Var f d d = [(f )]2 |f |d/d 1 (|f |)
2

9.1 Importance Sampling

289

where the second factor on the right-hand side is the chi-square distance between the densities 1 and |f | d /(|f |) under . This factor is of course d in general unknown, but may be estimated consistently by computing the (squared) coecient of variation CV2 , see (7.35), of the importance weights N i = |f ( i )| d ( i ), i = 1, . . . , N . d Poor selection of the instrumental distribution can induce large variations in the importance weights d/d and thus unreliable approximations of (f ). In many settings, an inappropriate choice of the instrumental distribution might lead to an estimator (9.1) whose variance is innite (and which therefore does not satisfy the assumptions of the CLT). Here is a simple example of this behavior. Example 9.1.3 (Importance Sampling with Cauchy and Gaussian Variables). In this example, the target = C(0, 1) is a standard Cauchy distribution, and the instrumental distribution = N(0, 1) is a standard Gaussian distribution. The importance weight function, given by exp(x2 /2) d (x) = 2 , d (1 + x2 ) is obviously badly behaved. In particular 1 2

d (x) d

exp(x2 /2) dx = .

Figure 9.1 illustrates the poor performance of the associated importance sampling estimator for the function f (x) = exp(|x|). We have displayed the quantile-quantile plot of the sample quantiles of the unnormalized IS estimator IS (f ), obtained from m = 500 independent Monte Carlo experiments, ver,N sus the quantiles of a standard normal distribution. In the left panel N =100 and in the right panel N =1,000. The quantile-quantile plot shows deviations from the normal distribution in both the lower and the upper tail for both N =100 and N =1,000, indicating that the distribution of IS (f ) does not ,N converge in the limit to a Gaussian distribution. Example 9.1.4. We now switch the roles of the target and instrumental distributions, taking = N(0, 1) and = C(0, 1). The importance weight is bounded by 2/e, and this time Theorem 9.1.2 can be applied. Quantilequantile plots of the sample quantiles of the unnormalized IS estimator IS (f ) are shown in Figure 9.2. The t is good, even when the sample size ,N is small (N = 100). It is worthwhile to investigate the impact of the choice of the scale of the Cauchy distribution. Assume now that = C(0, ) where > 0 is the scale parameter. The importance weight function is bounded by 2 2 /2 , < 2, e e (9.3) /2 , 2.

290

9 Analysis of SMC Methods


60 60

50

50

Quantiles of Input Sample

40

Quantiles of Input Sample


2 0 2 4

40

30

30

20

20

10

10

10 4

10 4

Standard Normal Quantiles

Standard Normal Quantiles

Fig. 9.1. Quantile-quantile plot of the sample quantiles of the unnormalized IS estimator of (f ) versus the quantiles of a standard normal distribution. The target and instrumental distributions and are standard Cauchy and standard Gaussian, respectively, and f (x) = exp(|x|). The number of Monte Carlo replications is m = 500. Left panel: sample size N = 100. Right panel: sample size N = 1, 000.
0.75 0.75

0.7

0.7

Quantiles of Input Sample

0.65

Quantiles of Input Sample


2 0 2 4

0.65

0.6

0.6

0.55

0.55

0.5

0.5

0.45 4

0.45 4

Standard Normal Quantiles

Standard Normal Quantiles

Fig. 9.2. Same gure as above with the roles of and switched: the target distribution is standard Gaussian and the instrumental distribution is standard Cauchy.

9.1 Importance Sampling

291

For < 2, the maximum is attained at 2 2 , while for 2 it is attained at x = 0. The upper bound on the importance weight has a minimum at = 1.

2.2

Values

1.8

1.6

1.4

1.2 0.1 1 Scale 10

Fig. 9.3. Box-and-whisker plots of the unnormalized IS estimator of (f ). The target and instrumental distributions and were standard Gaussian and Cauchy with scale , respectively, and f (x) = exp(|x|). Left to right: : 0.1, 1, and 10. The sample size was N = 1,000 and the number of Monte Carlo replications for each plot was m = 500.

Figure 9.3 displays box-and-whisker plots of the unnormalized IS estimator for three dierent values of the scale: = 0.1, = 1, and = 10. The choice = 1 leads to estimators that are better behaved than for = 0.1 and = 10. In the rst case, the values drawn from the instrumental distribution are typically too small to represent the standard Gaussian distribution around 0. In the second case, the values drawn are typically too large, and many draws fall far in the tail of the Gaussian distribution. 9.1.2 Deviation Inequalities As outlined above, it is interesting to obtain some non-asymptotic control of the uctuations of the importance sampling estimator. We may either want to compute bounds on moments E |IS (f )(f )|p , or to control the probability ,N P(|IS (f ) (f )| t) for some t > 0. Because IS (f ) is a sum of i.i.d. ,N ,N random variables, there is a variety of probability inequalities that may be applied for this purpose (see Petrov, 1995, Chapter 2). We do not develop this topic in detail, but just mention two inequalities that will be used later in the book.

292

9 Analysis of SMC Methods

The rst family of inequalities is related to the control on moments of sums of random variables. There are a variety of inequalities of that kind, which are all similar (except for the constants). Theorem 9.1.5 (Marcinkiewicz-Zygmund Inequality). If X1 , . . . , Xn is a sequence of independent random variables and p 2, then
n p n

E
i=1

(Xi E[Xi ])

C(p)np/21
i=1

E |Xi E(Xi )|p

(9.4)

for some positive constant C(p) only depending on p. The second family of inequalities is related to bounding the tail probabilities. There is a large amount of work in this domain too. The archetypal result is the so-called Hoeding inequality. Theorem 9.1.6 (Hoeding Inequality). Let X1 , . . . , Xn be independent bounded random variables such that P(ai Xi bi ) = 1. Then for any t 0,
n

P
i=1

[Xi E(Xi )] t

e2t

n 2 i=1 (bi ai )

and P

[Xi E(Xi )] t
i=1

e2t

n 2 i=1 (bi ai )

From these inequalities, it is straightforward to derive non-asymptotic bounds on moments and tail probabilities of the importance sampling estimator. Because the importance ratio is formally not dened on sets A that are such that (A) = 0, we rst need to extend the concept of oscillation see (4.14)as follows. For any measurable function f and measure , we dene the essential oscillation of f with respect to by osc (f ) = 2 inf f c
cR def ,

(9.5)

where g , denotes the essential supremum of g (with respect to ), the smallest number a such that {x : g(x) > a} has -measure 0. It is easily checked that the above denition implies that for any a and b such that a f () b -a.s., osc (f ) (b a). Theorem 9.1.7. For p 2 and any N 1, the estimator IS (f ) dened ,N in (9.1) satises E |IS (f ) (f )|p C(p)N p/2 ,N f d (f ) d
p

where the constant C(p) < only depends on p. Moreover, for any N 1 and any t 0, P |IS (f ) (f )| t 2 exp 2N t2 osc2 (f d/d) . ,N (9.6)

9.1 Importance Sampling

293

9.1.3 Self-normalized Importance Sampling Estimator When the normalizing constant of the target distribution is unknown, it is customary to use the self-normalized form of the importance sampling estimator, N f ( i ) d ( i ) . (9.7) IS (f ) = i=1N d d ,N i i=1 d ( ) This quantity is obviously free from any scale factor in d/d. The properties of this estimator are of course closely related to those of the unnormalized importance sampling estimator. 9.1.3.1 Consistency and Asymptotic Normality Theorem 9.1.8. Let f be a measurable function such that (|f |) < . Assume that and let 1 , 2 , . . . , be an i.i.d. sequence with distribution . Then a.s. IS (f ) (f ) as N . ,N Assume in addition that f satises [1 + f 2 ] d d
2

d < .

(9.8)

Then the sequence of estimators IS (f ) is asymptotically Gaussian, ,N where 2 (, f ) = d d N IS (f ) (f ) N 0, 2 (, f ) ,N


2 D

as N ,

[f (f )]2 d .

(9.9)

Proof. Strong consistency follows from


N

N 1
i=1

f ( i )

d i a.s. ( ) (f ) d

and N 1
i=1

d i a.s. ( ) 1 . d

Write N IS (f ) (f ) = ,N N 1/2
N d i i i=1 d ( ) f ( ) N N 1 i=1 d ( i ) d

(f )

By the central limit theorem, the numerator of the right-hand side above converges weakly to N(0, 2 (, f )) as N , with 2 (, f ) given by (9.9), and as noted above the corresponding denominator converges a.s. to 1. The second part of the theorem then follows by Slutskys theorem (Billingsley, 1995).

294

9 Analysis of SMC Methods

9.1.3.2 Deviation Inequalities Assessing deviance bounds for (9.7) is not a trivial task, because both the numerator and the denominator of IS (f ) are random. The following ele,N mentary lemma plays a key role in deriving such bounds. Lemma 9.1.9. Let f be a measurable function and assume that c be a real constant and dene f = f c. Then IS (f ) (f ) ,N 1 N
N

. Let

i=1

d i i ( )f ( ) (f ) d 1 N
N

+ f

i=1

d i ( ) 1 d

-a.s. (9.10)

Proof. First note that IS (f ) (f ) = IS f (f ). Next consider the ,N ,N decomposition 1 IS f (f ) = ,N N


N

i=1

d i i ( )(f ( ) (f )) d +
N d i i i=1 d ( )f ( ) N d i i=1 d ( )

1 N

i=1

d i ( ) d

Finally, use the triangle inequality and maximize over f ( i ) in the second term. From this result we may obtain moment bounds using the MarcinkiewiczZygmund inequality or, under more stringent conditions, exponential bounds on tail probabilities. Theorem 9.1.10. Assume that [(d/d) ] < for some p 2. Then there exists a constant C < such that for any N 1 and measurable function f , E |IS (f ) (f )|p CN p/2 oscp (f ) . ,N In addition, for any t 0, P |IS (f ) (f )| t 4 exp 8N t2 ,N 9 d/d
2 , p

(9.11)

osc2 (f )

. (9.12)

Proof. The bound (9.11) is a direct consequence of Lemma 9.1.9 and the Marcinkiewicz-Zygmund inequality (Theorem 9.1.5). Note that by minimiz ing over c in the right-hand side of (9.10), we may replace f , by (1/2) osc (f ), which is done here. For the second part pick b (0, 1) and write, using Lemma 9.1.9,

9.2 Sampling Importance Resampling


N

295

P |IS (f ) (f )| t P ,N
i=1 N

d i i ( )f ( ) (f ) d

N bt f

+P
i=1

d i ( ) 1 d

N (1 b)t

Next apply Hoedings inequality (Theorem 9.1.6) to both terms on the righthand side to obtain P |IS (f ) (f )| t 2 exp 2N b2 t2 osc2 (d/d)f ,N + 2 exp 2N (1 b)2 t2 d/d
2 ,

2 ,

, (9.13)

where the fact that osc (d/d) d/d , (as d/d is positive) has been used. Now note that when f is such that f , = (1/2) osc (f ), osc (d/d)f d/d , osc (f ). Hence to equate both terms on the right-hand side of (9.13) we set b = 2/3 which gives (9.12).

9.2 Sampling Importance Resampling


9.2.1 The Algorithm In this section, we study the sampling importance resampling (SIR) technique, introduced by Rubin (1987, 1988). It enables drawing an asymptotically independent sample 1 , . . . , N from a target distribution . The method requires that we know an instrumental distribution satisfying and such that the Radon-Nikodym derivative d/d is known up to a normalizing factor. Therefore either or , or both, may be known up to a normalizing constant only. A tacit assumption is that sampling from the instrumental distribution is doable. The SIR method proceeds in two steps. In the sampling stage, we draw an i.i.d. sample 1 , . . . , M from the instrumental distribution . The size M of this intermediate sample is usually taken to be larger, and sometimes much larger, than the size M of the nal sample. In the resampling stage, we 1 M draw a sample , . . . , of size M from the instrumental sample 1 , . . . , M . There are several ways of implementing this basic idea, the most obvious approach being to sample with replacement with a probability of picking each i , i = 1, . . . , M , that is proportional to its importance weight d ( i ). That d i is, i = I for i = 1, . . . , M , where I 1 , . . . , I M are conditionally independent given the instrumental sample and with distribution P(I 1 = i | 1 , . . . , M ) =
d i d ( ) M d j j=1 d ( )

296

9 Analysis of SMC Methods

For any measurable real-valued function f , we may associate to this sample an estimator SIR (f ) of (f ), dened as the Monte Carlo estimator of (f ) ,M associated to the resampled particles 1 , . . . , M , SIR (f ) = ,M 1 M
M

f ( i ) =
i=1

1 M

N i f ( i ) .
i=1

(9.14)

Here N i is the total number of times that i was selected from the instrumental sample. Thus (N 1 , . . . , N M ) have a multinomial distribution with E[N i | 1 , . . . , M ] = M
d i d ( ) M d i j=1 d ( )

i = 1, . . . , M .

The conditional expectation of the SIR estimate with respect to the instrumental sample equals the (self-normalized) importance sampling estimator provided by this sample,
M

E[SIR ,M

(f ) | , . . . , ] =
i=1

d i d ( ) f ( i ) M d i ( ) i=1 d

The asymptotic analysis of the SIR estimator involves more sophisti cated arguments however, because 1 , . . . , M is not an i.i.d. sample from . Nevertheless, for any measurable bounded real-valued function f on X and j = 1, . . . , M ,
M

E[f ( j ) | 1 , . . . , M ] =
i=1

d i d ( ) f ( i ) M d j ( ) j=1 d

(f ) ,

where the convergence follows from Theorem 9.1.8. Because the conditional expectation on the left-hand side is bounded by f , we can take expectations of both sides and appeal to dominated convergence to conclude that E[f ( j )] (f ) as M . This shows that, whereas marginally the i are i is asymptotically not distributed according to , the distribution of any correct in the sense that for any i, the marginal distribution of i converges to the target distribution as M . In the same way, for any i = j and f, g Fb (X) we have E[f ( i )g( j )] = E[E[f ( i )g( j ) | 1 , . . . , M ]] = E[E[f ( i ) | 1 , . . . , M ] E[g( j ) | 1 , . . . , M ]] = E[IS (f ) IS (g)] . ,M ,M Repeating the argument above shows that E[f ( i )g( j )] (f )(g). Thus, i and j for i = j are not independent for any whereas the random variables

9.2 Sampling Importance Resampling

297

given sample size M , they are asymptotically independent as the sample size M goes to innity. The estimation error SIR (f ) (f ) can be decomposed into two terms, ,M SIR (f ) (f ) = SIR (f ) IS (f ) + IS (f ) (f ) . ,M ,M ,M ,M (9.15)

The rst term on the right-hand side is the error associated with the approximation of the importance sampling estimator IS (f ) by its sampled version ,M SIR (f ). The second term is the error associated to the importance sampling ,M estimator. To obtain asymptotic results, we now assume that the instrumental and nal sample sizes are non-decreasing sequences of integers, denoted by {MN } and {MN }, respectively, both diverging to innity. As shown in Theorem 9.2.15, when (|f |) < , these two error terms go to zero and therefore SIR (f ) is a consistent estimator of (f ). ,M N The next question to answer in the elementary asymptotic theory developed in this chapter is to nd conditions upon which aN {SIR (f ) (f )} ,M N is asymptotically normal; here {aN }, the rate sequence, is a non-decreasing sequence of positive reals. Again we use the decomposition (9.15). First a conditional central limit theorem shows that, for any f L2 (X, ), 1/2 MN SIR N (f ) IS N (f ) ,M ,M N M D 1/2 f ( i ) E[f ( i ) | 1 , . . . , MN ] N (0, Var (f )) . = MN
i=1

Note that N(0, Var (f )) is the limiting distribution of the plain Monte Carlo estimator of (f ) from an i.i.d. sample from . Theorem 9.1.8 shows that if (1 + f 2 )(d/d)2 is -integrable, then MN
1/2

IS N (f ) (f ) N 0, Var ,M

d [f (f )] d
N

1/2 The key result, shown in Theorem 9.2.15, is that MN {SIR (f )IS N (f )} ,M ,M and MN {IS N (f ) (f )} are asymptotically independent. ,M In many circumstances, and in particular when studying the resampling step in sequential or iterative applications of the SIR algorithm (such as in the sequential Monte Carlo framework), it is convenient to relax the conditions on the instrumental sample 1 , . . . , M . In addition, it is of interest to consider weighted samples ( 1 , 1 ), . . . , ( M , M ), where i are non-negative (importance) weights. We now proceed by introducing precise denitions and notations and then present the main results. 9.2.2 Denitions and Notations Let {MN }N 0 be a sequence of positive integers. Throughout this section, we use the word triangular array to refer to a system {U N,i }1iMN of random
1/2

298

9 Analysis of SMC Methods

variables dened on a common probability space (, F, P) and organized as follows: U 1,1 U 1,2 . . . U 1,M1 U 2,1 U 2,2 . . . . . . U 2,M2 U 3,1 U 3,2 . . . . . . . . . U 3,M3 . . . . . .. . . . . . . . . . . . The row index N ranges over 1, 2, 3, . . . while the column index i ranges from 1 to MN , where MN is a sequence of integers satisfying limN MN = . It will usually be the case that M1 < M2 < . . . ; hence the term triangular. It is not necessary to assume this, however. It is not assumed that the random variables within each row are independent nor that they are identically distributed. We assume nothing about the relation between the random variables on dierent rows. Let {G N }N 0 be a sequence of sub--elds of F. We say that a triangular array {U N,i }1iMN is measurable with respect to this sequence if for any N the random variables U N,1 , . . . , U N,MN are G N -measurable. We say that the triangular array {U N,i }1iMN is conditionally independent given {G N } if for any N the random variables U N,1 , . . . , U N,MN are conditionally independent given G N . The term conditionally i.i.d. given {G N } is dened in an entirely similar manner. In the sequel, we will need a number of technical results regarding triangular arrays. To improve readability of the text, however, these results are gathered at the end of the chapter, in Section 9.5.1. Denition 9.2.1 (Weighted Sample). A triangular array of random variables {( N,i , N,i )}1iMN is said to be a weighted sample if for any N 1, MN N,i 0 for i = 1, . . . , MN and i=1 N,i > 0 a.s. Let us now consider specically the case when the variables N,i take values in the space X. Assume that the weighted sample {( N,i , N,i )}1iMN approximates the instrumental distribution in the sense that for any f MN N,i 1 in an appropriately dened class of functions, WN f ( N,i ), with i=1 MN N,i WN = i=1 being the normalization factor, converges in an appropriately dened sense to (f ) as N tends to innity. The most elementary way MN N,i 1 to assess this convergence consists in requiring that WN f ( N,i ) i=1 converges to (f ) in probability for functions f in some class C of real-valued functions on X. Denition 9.2.2 (Consistent Weighted Sample). The weighted sample {( N,i , N,i )}1iMN is said to be a consistent for the probability measure and the set C L1 (X, ) if for any f C,
MN

N,i
MN j=1

i=1

N,j

f N,i (f )

as N .

9.2 Sampling Importance Resampling

299

In order to obtain sensible results, we restrict our attention to classes of sets that are suciently rich. Denition 9.2.3 (Proper Set). A set C of real-valued measurable functions on X is said to be proper if the following conditions are satised. (i) C is a linear space: for any f and g in C and reals and , f + g C. (ii) If |g| C and f is measurable with |f | |g|, then |f | C . For any function f , dene the positive and negative parts of it by f+ = f 0
def

and f = (f ) 0 ,

def

and note that f + and f are both dominated by |f |. Thus, if |f | C, then f + and f both belong to C and so does f = f + f . It is easily seen that for any p 0 and any measure on (X, X ), the set Lp (X, ) is proper. There are many dierent ways to obtain a consistent weighted sample. An i.i.d. sample { N,i }1iMN with common distribution is consistent for , L1 (X, ) , and {( N,i , d ( N,i )}1iMN is consistent for (, L1 (X, )). Of d course, when dealing with such elementary situations, the use of triangular arrays can be avoided. Triangular arrays come naturally into play when considering iterated applications of the SIR algorithm, as in sequential importance sampling techniques. In this case, the weighted sample {( N,i , N,i }1iMN is the result of iterated applications of importance sampling, resampling, and propagation steps. We study several examples of such situations later in this chapter. The notion of sample consistency is weak but is in practice only moderately helpful, because it does not indicate the rate at which the estimator N 1 N,i WN f ( N,i ) converges to (f ). In particular, this denition does i=1 not provide a way to construct an asymptotic condence interval for (f ). A natural way to strengthen it is to consider distributional convergence of the MN N,i normalized dierence aN i=1 WN {f ( N,i ) (f )}. Denition 9.2.4 (Asymptotically Normal Weighted Sample). Let A be a class of real-valued measurable functions on X, let be a real non-negative function on A, and let {aN } be a non-decreasing real sequence diverging to innity. We say that the weighted sample {( N,i , N,i )}1iMN is asymptotically normal for (, A, , {aN }) if for any function f A it holds that (|f |) < , 2 (f ) < and
MN

aN
i=1

N,i
MN j=1

N,j

f ( N,i ) (f ) N(0, 2 (f ))

as N .

Of course, if {( N,i , N,i )}1iMN is asymptotically normal for (, A, , {aN }), then it is also consistent for (, A). If { N,i }1iMN are i.i.d. with

300

9 Analysis of SMC Methods

common distribution then for any function f L2 (X, ) and any nondecreasing sequence {MN } such that limN MN = , 1 MN
MN

f ( N,i ) (f ) N 0, [f (f )]2
i=1

Therefore {( N,i , 1)}1iMN is an asymptotically normal weighted sample for , L2 (X, ), , { MN } with 2 (f ) = {f (f )}2 . In the context of importance sampling, for each N we draw { N,i }1iMN independently from the instrumental distribution and assign it weights { d ( N,i )}1iMN . d Using an argument as in the proof of Theorem 9.1.8, it also follows that {( N,i , d N,i ))}1iMN is an asymptotically normal weighted sample for d ( (, A, , { MN }), with A= and 2 (f ) = d [f (f )] d f L2 (X, ) : d [f (f )] d
2 2

<

f A.

When the SIR algorithm is applied sequentially, the rate {aN } can be dif ferent from MN because of the dependence among the random variables { N,i }1iMN introduced by the resampling procedure. 9.2.3 Weighting and Resampling Assume that {( N,i , 1)}1iMN is an i.i.d. sample from the instrumental distribution . In the rst stage of the SIR procedure, we assign to these samples importance weights d ( N,i ), i = 1, . . . , MN , where is the target distribud tion, assumed to be absolutely continuous with respect to . We then draw, conditionally independently given F N = ({ N,1 , . . . , N,MN }), random vari ables I N,1 , . . . , I N,MN with distribution P(I N,k = N,i | F N ) = d ( N,i ) and d N,i let N,i = N,I for i = 1, . . . , MN . Proceeding this way, we thus dene a weighted sample {( N,i , 1)}1iMN . As outlined in the discussion above, N,i we know that {( , 1)}1iMN is consistent for (, L1 (X, )). We have already mentioned that {( N,i , d ( N,i ))}1MN is consistent for (, L1 (X, )); d therefore the weighting operation transforms a weighted sample consistent for (, L1 (X, )) into a weighted sample consistent for (, L1 (X, )). Similarly, in the second step, the resampling operation transforms a weighted sample {( N,i , d ( N,i ))}1iMN into another one {( N,i , 1)}1iMN . It is a natural d question to ask whether the latter one is consistent for and, if so, what an appropriately dened class of functions on X might be. Of course, in this discussion it is also sensible to strengthen the requirement of consistency into

9.2 Sampling Importance Resampling

301

asymptotic normality and again prove that the weighting and resampling operations transform an asymptotically normal weighted sample for into an asymptotically normal sample for (for appropriately dened class of functions, normalizing factors, etc.) The main purpose of this section is to establish such results. Because we apply these results in a sequential context, we start from a weighted sample {( N,i , N,i }1iMN , with weights N,i that are not necessarily identical. Also, we do not assume that { N,i }1iMN are conditionally i.i.d. with distribution . In addition, we denote by {G N } a sequence of sub--elds of F. When studying the single-stage SIR estimator, one may simply set, for any N 0, G N equal to the trivial -eld {, }. Indeed, the use of {G N }N 0 is a provision for situations in which the SIR algorithm is applied sequentially; {G N }N 0 handles the history of the particle system up to the current iteration. Algorithm 9.2.5 (Weighting and Resampling). Resampling: Draw random variables {I N,1 , . . . , I N,MN } conditionally independently given F N = G N ( N,1 , N,1 ), . . . , ( N,MN , N,MN ) , (9.16)

with probabilities proportional to N,1 d ( N,1 ), . . . , N,MN d ( N,MN ). In d d other words, for k = 1, . . . , MN , P(I N,k = i | F N ) = N,i d ( N,i ) d
MN j=1

N,j d ( N,j ) d

i = 1, . . . , MN .

Assignment: For i = 1, . . . , MN , set


N,i N,i = N,I .

(9.17)

We now study in which sense the weighted sample {( N,i , 1)}1iMN ap proximates the target distribution . Consider the following assumption. Assumption 9.2.6. {( N,i , N,i )}1iMN is consistent for (, C), where C is a proper set of functions. In addition, d/d C. The following theorem is an elementary extension of Theorem 9.1.8. It shows that the if the original weighted sample of Algorithm 9.2.5 is consistent for , then the reweighted sample is consistent for . Theorem 9.2.7. Assume 9.1.1 and 9.2.6. Then def C = f L1 (X, ) : |f | d C d (9.18)

is a proper set of functions and {( N,i , N,i d ( N,i ))}1iMN is consistent d for (, C).

302

9 Analysis of SMC Methods

Proof. It is easy to check that C is proper. Because {( N,i , N,i )}1iMN is consistent for (, C), for any function h C it holds that
MN

N,i
MN j=1

i=1

N,j

h N,i (h) .

By construction h d C for any h C. Therefore d


MN

N,i
MN j=1

i=1

N,j

d N,i d P ( )h( N,i ) h d d

= (h) .

(9.19)

The proof is concluded by applying (9.19) with h 1 and h = f . The next step is to show that the sample { N,i }, which is the result of the resampling operation, is consistent for as well. The key result to proving this is the following theorem, which establishes a conditional weak law of large numbers for conditionally independent random variables under easily veried technical conditions. Theorem 9.2.8. Let be a probability distribution on (X, X ) and let f be in L1 (X, ). Assume that the triangular array { N,i }1iMN is conditionally independent given {F N } and that for any non-negative C, 1 MN Then 1 MN
MN

E |f |( N,i )1{|f |(N,i )C} F N |f |1{|f |C} .


P

(9.20)

i=1

MN

f ( N,i ) E f ( N,i ) F N
i=1

0 .

(9.21)

Proof. We have to check conditions (ii)(iii) of Proposition 9.5.7. Set VN,i = 1 MN f ( N,i ) for any N and i = 1, . . . , MN , By construction, the triangular array {VN,i } is conditionally independent given {F N } and E[|VN,i | F N ] < . Equation (9.20) with C = 0 shows that
MN MN

E |VN,i | | F
i=1

1 MN i=1

E |f ( N,i )| F N (|f |) < ,

N whence the sequence { i=1 E[|VN,i | | F N }N 0 is bounded in probability [condition (ii)]. Next, for any positive and C we have for suciently large N ,

9.2 Sampling Importance Resampling


MN

303

E |VN,i |1{|VN,i | = 1 MN
1 MN i=1 MN

FN FN
P

i=1

E |f ( N,i )|1{|f |(N,i )

MN }

i=1 MN

E |f ( N,i )|1{|f |(N,i )C} F N (|f |1{|f |C} ) .

By dominated convergence, the right-hand side of this display tends to zero as C . Thus, the left-hand side of the display converges to zero in probability, which is condition (iii). We can now prove that the resampled particles are consistent for . Theorem 9.2.9. Let {( N,i , 1)}1iMN be as in Algorithm 9.2.5 and let C be N,i as in (9.18). Then under Assumptions 9.1.1, and 9.2.6, {( , 1)}1iMN is consistent for (, C). Proof. We will apply Theorem 9.2.8 and thus need to verify its assumptions. By construction, { N,i }1iMN is conditionally independent given F N . Pick f in C. Because C is proper, |f |1{|f |C} C for any C 0. Therefore 1 N M
MN

E |f |( N,i )1{|f |(N,i )C} F N N,i d ( N,i ) d


MN j=1

i=1 MN

=
i=1

N,j d ( N,j ) d

|f |( N,i )1{|f |(N,i )C} (|f |1{|f |C} ) ,


P

where the convergence follows from Theorem 9.2.7. Thus Theorem 9.2.8 ap 1 MN f ( N,i ) plies, and taking C = 0, it allows us to conclude that MN 1 converges to (f ) in probability for any non-negative f . By dividing a general f in C into its positive and negative parts, we see that the same conclusion holds true for such f . Our next objective is to establish asymptotic normality of the resampled particles {( N,i , 1)}. Consider the following assumption. Assumption 9.2.10. The weighted sample {( N,i , N,i )}1iMN is asymptotically normal for (, A, , {aN }), where A is a proper set of functions, is a non-negative function on A, and {aN } is a non-decreasing sequence of positive constants diverging to innity. In addition, d A. d We proceed in two steps. In a rst step, we strengthen the conclusions of Theorem 9.1.8 to show that the reweighted sample {( N,i , N,i d ( N,i ))}1iMN d is asymptotically normal. Then we show that the sampling operation preserves asymptotic normality.

304

9 Analysis of SMC Methods

Theorem 9.2.11. Assume 9.1.1, 9.2.6, and 9.2.10 and dene def A = f L2 (X, ) : |f | d A d .

Then A is a proper set and the weighted sample {( N,i , N,i d ( N,i ))}1iMN d is asymptotically normal for (, A, , {aN }) with 2 (f ) = 2 d [f (f )] d .

Proof. Once again it is easy to see that A is proper. Pick f in A. Under the stated assumptions, d A and f d A. Therefore (|f |) = (|f | d ) < , d d d showing that f L1 (X, ). In addition, again as A is a proper, h = d {f d (f )} A. By construction, (h) = 0. Write
MN

aN
i=1

N,i d ( N,i ) d
MN j=1

N,j d ( N,j ) d

f ( N,i ) (f )

aN

MN N,i h( N,i ) i=1 N N,i d ( N,i ) i=1 d

Because the weighted sample {( N,i , N,i }1iMN is asymptotically normal for (, A, , {aN }), h A, and (h) = 0, we conclude that
MN

aN
i=1

N,i
MN j=1

N,j

h( N,i ) N 0, 2 (h)

and note that 2 (h) = 2 (f ). Moreover, because the same weighted sample is consistent for ,
MN

N,i
N j=1

i=1

N,j

d N,i P ( ) d

d d

=1.

The proof now follows by Slutskys theorem (Billingsley, 1995). In order to proceed to asymptotic normality after resampling, we need some preparatory results. The following proposition establishes a conditional CLT for triangular arrays of conditionally independent random variables. It is an almost direct application of Theorem 9.5.13, which is stated and proved in Section 9.5.1. Proposition 9.2.12. Assume 9.1.1 and 9.2.6. Then for any u R and any function f such that f 2 d C, d
MN

1/2 E exp iu MN
i=1

{f ( N,i ) E[f ( N,i ) | F N ]} F N exp (u2 /2) Var (f ) , (9.22)


P

where {F N } and { N,i }1iMN are dened in (9.16) and (9.17), respectively.

9.2 Sampling Importance Resampling

305

Corollary 9.2.13. Assume 9.1.1 and 9.2.6. Then


MN

1/2 MN
i=1

{f ( N,i ) E[f ( N,i ) | F N ]} N(0, Var (f )) .

(9.23)

Proof (of Proposition 9.2.12). We will appeal to Theorem 9.5.13 and hence need to check that its conditions (ii) and (iii) are satised. First, Var[f ( N,1 ) | F N ] =
MN

N,i d ( N,i ) d
MN j=1

MN

i=1

N,j d ( N,j ) d

f (

N,i

)
i=1

N,i d ( N,i ) d
MN j=1

N,j d ( N,j ) d

f (

N,i

The assumptions say that {( N,i , N,i )}1iMN is consistent for (, C). Because d C and f 2 d C, the inequality |f | d 1{|f |1} d + f 2 d shows d d d d d that |f | d C. Theorem 9.2.7 then implies that d Var[f ( N,1 ) | F N ] (f 2 ) {(f )}2 = Var (f ) . Condition (ii) follows. Moreover, for any positive constant C,
MN P

1 MN
i=1

E[f 2 ( N,i )1{|f |(N,i )C} | F N ]


MN

=
i=1

N,i d ( N,i ) d
MN j=1

N,j d ( N,j ) d

f 2 ( N,i )1{|f |(N,i )C} .

Because f 2 d belongs to the proper set C, we have f 2 1{|f |C} d C. This d d implies that the right-hand side of the above display converges in probability to (f 2 1|f |C ). Hence condition (iii) also holds. Applying successively Theorem 9.2.11 and Proposition 9.2.12 yields the following result, showing that the resampling preserves asymptotic normality. Theorem 9.2.14. Assume 9.1.1, 9.2.6, and 9.2.10, and that a2 /MN has a N limit, say, possibly innite. Dene def A = f L2 (X, ) : |f | d d A, f 2 C d d , (9.24)

where A and C are as in Assumptions 9.2.10 and 9.2.6, respectively. Then A is a proper set and the following holds true for the resampled system {( N,i , 1)} dened as in Algorithm 9.2.5.
1iMN

306

9 Analysis of SMC Methods

(i) If < 1, then {( N,i , 1)} is asymptotically normal for (, A, , {aN }) with 2 (f ) = Var (f ) + 2 d {f (f )} , d f A. (9.25)

1/2 (ii) If 1, then {( N,i , 1)} is asymptotically normal for (, A, , {MN }) with 2 (f ) = Var (f ) + 1 2 d {f (f )} , d f A. (9.26)

Thus, we see that if MN increases much slower than aN , so that = , 1/2 then the rate of convergence is MN and the limiting variance is the basic Monte Carlo variance Var (f ). This means that aN is so large compared to MN that the weighted sample {( N,i , N,i d ( N,i ))} approximates with d negligible error, and the resampled particles can eectively be thought of as an i.i.d. sample from . On the other hand, when MN increases much faster than aN , so that = 0, then the rate of convergence is aN and the limiting variance is that associated with the weighted sample {( N,i , N,i d ( N,i ))} d alone (see Theorem 9.2.11). This means that the size of the resample is so large that the error associated with this part of the overall procedure can be disregarded. 1 Proof (Theorem 9.2.14). Pick f A and write MN AN + BN with
MN MN i=1

f ( N,i ) (f ) =

AN =
i=1

N,i d ( N,i ) d
MN j=1 MN

N,j d ( N,j ) d

{f ( N,i ) (f )} ,

BN

1 = MN
i=1

{f ( N,i ) E[f ( N,i ) | F N ]} .

Under the stated assumptions, Proposition 9.2.11 shows that aN AN N 0, 2


D

d [f (f )] d

Combining this with Proposition 9.2.12, we nd that for any real numbers u and v, 1/2 E exp(i(uMN BN + vaN AN ) = E E exp(iuMN BN ) F N exp(ivaN AN ) exp (u2 /2) Var (f ) exp (v 2 /2) 2 d [f (f )] d .
1/2

9.2 Sampling Importance Resampling

307

Thus the bivariate characteristic function converges to the characteristic function of a bivariate normal, implying that a N AN 1/2 M BN
N

N 0 ,

2
1/2

d d {f

[f ]} 0 0 Var {f }

Put bN = aN if < 1 and bN = MN

if 1. The proof follows from


1/2

bN (AN + BN ) = (bN a1 )aN AN + (bN MN N

)MN BN .

1/2

9.2.4 Application to the Single-Stage SIR Algorithm We now apply the above results to the single-stage SIR algorithm, sampling from an instrumental distribution and then weighting and resampling to obtain an approximately i.i.d. sample from . The procedure is illustrated in Figure 9.4. Thus { N,i }1iMN is an i.i.d. sample from and the weights are set to 1; N,i 1. The LLN shows that Assumption 9.2.6 is satised with C = L1 (X, ). Theorem 9.2.9 shows that for any f C = L1 (X, ) (see the denition in (9.18)), 1 N M
MN P f ( N,i ) (f ) . i=1

Moreover, the weighted sample {( N,i , 1)}1iMN satises Assumption 9.2.10 1/2 with A = L2 (X, ), 2 (f ) = {f (f )}2 , and aN = MN , provided 2 N,i d/d L (X, ). Thus Theorem 9.2.14 shows that {( , 1)}1iMN is asymptotically normal for . We summarize this in the following result. Theorem 9.2.15. Assume 9.1.1 and let { N,i }1iMN be i.i.d. random vari ables with distribution . Then {( N,i , 1)}1iMN given by Algorithm 9.2.5 is 1 consistent for , L (X, ) . Assume in addition that limN MN /MN = for some [0, ] and d 2 = {f L2 (X, ) : f d L2 (X, )}. Then the that d L (X, ). Dene A d following holds true. (i) If < 1, then {( N,i , 1))} is asymptotically normal for (, A, ,
1iMN

{MN }) with 2 (f ) = Var (f ) + Var


def

1/2

d [f (f )] , d

f A.

(ii) If 1, then {( N,i , 1))}1iMN is asymptotically normal for (, A, , 1/2 {M }) with


N

2 (f ) = Var (f ) + 1 Var

def

d [f (f )] , f A . d

308

9 Analysis of SMC Methods

Instrumental distribution

N,1

N,MN Target distribution

H I H I H H F G F G V W R S Q P V U T P E D D C B B A @ A @ 9 8 7 6 5 4 7 6 3 2 4 3 2

d N,1 ) d (

d N,MN ) d (

N,1 N,MN
i h i h c b Y X i h i h e d a ` i h i h % $ i h $ ! i h   i h   i h # " i h " i h i h i h i h i h i h i h

Fig. 9.4. The single-stage SIR algorithm.

Without loss of generality, we may assume here that MN = N . To obtain a rate N asymptotically normal sample for the target distribution , the cardinality MN of the instrumental sample should grow at least as fast as N , limN MN /N > 0. If limN MN /N = , then N [SIR (f ) (f )] N(0, Var (f )) , ,N
D

that is, the SIR estimator and the plain Monte Carlo estimator MC (f ) N of (f ) (the estimator of (f ) obtained by computing the sample average N N 1 i=1 f ( i ) with { i } being an i.i.d. sample from the target distribution ) have the same limiting Gaussian distribution. In practice, this means that large values for the instrumental sample should be used when one is asking for a sample that behaves as an i.i.d. sample from . We conclude this section with some elementary deviations inequalities. These inequalities are non-asymptotic and allow evaluating the performance of the SIR estimator for nite sample sizes. Theorem 9.2.16. Assume 9.1.1 and let { N,i }1iMN be i.i.d. random variables with distribution . Then for any t > 0, f Fb (X), a (0, 1), and N 0,

g f ' &

g f ' &

g f  

g f  

g f  

g f    

g f  

g f

g f

g f

g f

g f

g f 1 0

g f 1 0 ) (

g f q p (

g f q p

g f

9.2 Sampling Importance Resampling


MN

309

1 MN i=1

f ( N,i ) (f ) t 2 exp 2MN a2 t2 osc2 (f ) + 4 exp 8MN (1 a)2 t2 9 osc2 (f ) d/d


2 ,

1 Proof. Decompose MN terms 1 A (f ) = MN


N

MN N,i ) (f )} i=1 {f (

as a sum AN + B N of the two

MN

{f ( N,i ) E[f ( N,i ) | N,1 , . . . , N,MN ]} ,


i=1 MN

1 B (f ) = MN
N i=1

{E[f ( N,i ) | N,1 , . . . , N,MN ] (f )}


d d f d d

MN d N,i )f ( N,i ) i=1 d ( MN d N,i ) i=1 d (

Hoedings inequality implies that P |AN (f )| at N,1 , . . . , N,MN 2 exp 2MN a2 t2 osc2 (f ) .

The result also holds unconditionally by taking the expectation of the lefthand side. For P(|B N (f )| (1a)t), use the bound (9.12) of Theorem 9.1.10. Example 9.2.17 (Importance Sampling with Cauchy and Gaussian Variables, Continued). In this continuation of Example 9.1.3, the target distribution is standard Gaussian and the instrumental distribution is standard Cauchy. In this case d/d is bounded by some nite M , so that f2 d d
2

M f2

d d

M (f 2 ) .

Hence Theorem 9.2.15 applies to functions f that are square integrable with respect to the standard Gaussian distribution. This condition is also required to establish asymptotic normality of the importance sampling estimator. We set N =1,000 and investigate the impact of the size M of the instrumental sample on the accuracy of the SIR estimator for f (x) = exp(x). Figure 9.5 displays the box-and-whisker plot obtained from 500 independent Monte Carlo replications of the IS and SIR estimators of (f ), for instrumental sample sizes M =100, 1,000, 10,000, and 100,000. As expected, the uctuations of the SIR estimate decrease as the ratio M/N increases. Not surprisingly, when

310

9 Analysis of SMC Methods

M =100 ( = 0.1) the uctuation of IS (f )(f ) dominates the resampling ,M uctuation SIR (f ) IS (f ). On the contrary, when M = 10, 000 ( = ,M,N ,M 10), the resampling uctuation is much larger than the error associated with the importance sampling estimate. Likewise, for this M the variance of the SIR estimator is not signicantly dierent from the variance of the plain Monte Carlo estimator using an i.i.d. sample of size N =1,000 from the target distribution . To judge the ability of the SIR sample to mimic the distribution of an independent sample from , we applied a goodness-of-t test. Figure 9.5 displays observed p-values and observed rejection probabilities for the Kolmogorov-Smirnov (KS) goodness-of-t test of the null hypothesis that the distribution is standard Gaussian (with signicance level 5%). For M =100 and 1,000, the p-values are small and the rejection probabilities are large, meaning that the KS test detects a deviation from the null hypothesis of Gaussianity. For M = 10, 000 and 100, 000 the p-values are much higher and the probabilities of rejection are much smaller.

2.2 2 Values 1.8 1.6 1.4 1.2 100 1 Observed pValues 0.8 0.6 0.4 0.2 0 100 1000 10000 100000 M 1000 10000 100000 M Observed Rejection Prob. Values

2.2 2 1.8 1.6 1.4 1.2 100 1 0.8 0.6 0.4 0.2 0 100 1000 10000 100000 M 1000 10000 100000 M

Fig. 9.5. Simulation results for estimation of the integral (f ) with f (x) = exp(x) and sample size N =1,000, using importance sampling (IS) and sampling importance resampling (SIR) estimators. The instrumental distribution was standard Cauchy and target distribution was standard Gaussian. The number of Monte Carlo replications was 500 and the instrumental sample sizes were M = 100, 1,000, 10,000, and 100,000. Top left: Box-and-whisker plot of the IS estimates. Top right: Box-and-whisker plot of the SIR estimates. Bottom left: Observed p-values of the Kolmogorov-Smirnov goodness-of-t test of the null hypothesis that the distribution after resampling is standard Gaussian. Bottom right: Observed rejection probabilities of the null hypothesis at signicance level 5%.

9.3 Single-Step Analysis of SMC Methods

311

9.3 Single-Step Analysis of SMC Methods


We now carry the analysis one step forward to encompass elementary steps of (some of) the sequential Monte Carlo methods discussed in the previous chapters. To do that, we need to consider transformations of the weighted sample that are more sophisticated than weighting and sampling. As outlined in the previous chapter, many dierent actions might be considered, and it is out of the scope of this chapter to investigate all possible variants. We focus in the following on the SISR approach (Algorithm 7.3.4) and on the variant that we called i.i.d. sampling (Algorithm 8.1.1). As discussed in Section 8.1.1, each iteration of both of these algorithms is composed of two simple procedures selection and mutationwhich we consider separately below. 9.3.1 Mutation Step To study SISR algorithms, we need rst to show that when moving the particles using a Markov transition kernel and then assigning them appropriately dened importance weights, we transform a weighted sample consistent (or asymptotically normal) for one distribution into a weighted sample consistent (or asymptotically normal) for another appropriately dened distribution. As before, we let be a probability measure on (X, X ), L be a nite transition kernel on (X, X ), and R be a probability kernel on (X, X ). Dene the probability measure on (X, X ) by (A) =
X

(dx) L(x, A) . (dx) L(x, X) X

(9.27)

We then wish to construct a sample consistent for , given a weighted sample {( N,i , 1)}1iMN from . To do so, we move the particles using R as an instrumental kernel and then assign them suitable importance weights. Before writing down the algorithm, we introduce some assumptions. Assumption 9.3.1. L(X) =
X

(dx) L(x, X) is positive and nite.

Assumption 9.3.2. {( N,i , 1)}1iMN is consistent for (, C), where C is a proper set. In addition, the function x L(x, X) belongs to C. Assumption 9.3.3. For any x X, L(x, ) is absolutely continuous with respect to R(x, ) and there exists a (strictly) positive version of dL(x, )/dR(x, ). Now let {N } be a sequence of integers and put MN = N MN . Consider the following algorithm. Algorithm 9.3.4 (Mutation). Draw N,1 , . . . , N,MN conditionally independently given F N = G N ( N,1 , . . . , N,MN ) with distribution P( N,j A | F N ) = R( N,i , A)

312

9 Analysis of SMC Methods

for i = 1, . . . , MN , j = N (i 1) + 1, . . . , N i, and A X , and assign N,j the weight dL( N,i , ) N,j ( ). N,j = dR( N,i , ) Thus each particle gives birth to N ospring. In many cases, we set N = 1; then each particle is propagated forward only once. Increasing the number N of ospring increases the particle diversity before the resampling step and is thus a practical means for contending particle degeneracy. This of course increases the computational complexity of the algorithm. Theorem 9.3.5. Assume 9.3.1, 9.3.2 and 9.3.3, and dene def C = {f L1 (X, ) : x L(x, |f |) C} , (9.28)

where is given by (9.27). Then C is a proper set and {( N,i , N,i )}1iMN dened by Algorithm 9.3.4 is consistent for (, C). Proof. Checking that C is proper is straightforward, so we turn to the consis tency. We prove this by showing that for any f C, 1 N M
MN P N,j f ( N,j ) L(f ) . j=1

(9.29)

Under the assumptions made, the function x L(x, X) belongs to C, implying 1 MN N,j converges to that the constant function 1 belongs to C; hence MN j=1 L(X) in probability. Then for any f C, the ratio of the two sample means considered tends to L(f )/L(X) = (f ) in probability. This is consistency. To prove (9.29), pick f in C and note that E[ N,j f ( N,j ) | F N ] = L( N,i , f ) for j and i as in Algorithm 9.3.4. Hence
MN MN

1 MN
j=1

E[

N,j

1 f ( N,j ) | F N ] = MN i=1

L( N,i , f ) L(f ) ,

so that it is sucient to show that


MN MN

1 MN
j=1

1 N,j f ( N,j ) MN
j=1

E[ N,j f ( N,j ) | F N ] 0 .

(9.30)

1 For that purpose, we put VN,j = MN N,j f ( N,j ) and appeal to Proposition 9.5.7; we need to check its conditions (i)(iii). The triangular array {VN,j }1jMN is conditionally independent given {F N }; this is condition (i). Next, just as above,

9.3 Single-Step Analysis of SMC Methods


MN 1 E[|VN,j | | F N ] = MN j=1 i=1 MN

313

L( N,i , |f |) L(|f |) ,

showing condition (ii). We nally need to show that for any positive C,
MN

AN =
j=1

E[|VN,j |1{|VN,j |C} | F N ] 0 .


P

Put h(x, x ) =

dL(x,) dR(x,) (x

)|f |(x ). For any positive C, we then have R(x, dx ) h(x, x ) = L(x, |f |) .

R(x, dx ) h(x, x )1{h(x,x )C}

Because the function x L(x, |f |) C and the set C is proper, this shows that the left-hand side of the above display is in C. Hence for large enough N ,
MN 1 AN M N i=1

R( N,i , dx ) h( N,i , x )1{h(N,i ,x )C}


P

(dx) R(x, dx ) h(x, x )1{h(x,x )C} .

The right-hand side of this inequality is bounded by L(|f |) < (cf. above), so that, by dominated convergence, the right-hand side can be made arbitrarily small by letting C . This shows that AN tends to zero in probability, which is condition (iii). Thus Proposition 9.5.7 applies, (9.30) holds, and the proof is complete. To establish asymptotic normality of the estimators, we must strengthen Assumption 9.3.2 as follows. Assumption 9.3.6. The weighted sample {( N,i , 1)}1iMN is asymptoti1/2 cally normal for (, A, , {MN }), where A is a proper set and is a nonnegative function on A. Theorem 9.3.7. Assume 9.3.1, 9.3.2, 9.3.3, and 9.3.6, and that {N } has a limit , possibly innite. Dene def A = f L2 (X, ) : x L(x, f ) A and x
X

R(x, dx )

dL(x, ) (x )f (x ) dR(x, )

. (9.31)

Then A is a proper set and {( N,i , N,i )}1iMN given by Algorithm 9.3.4 is , {M 1/2 }) with asymptotically normal for (, A,
N

314

9 Analysis of SMC Methods

2 (f ) =

def

2 {L[f (f )]} + 1 2 [f (f )] [L(X)]


2

f A,

(9.32)

and 2 dened by 2 (f ) =
def

(dx) R(x, dx )

dL(x, ) (x )f (x ) dR(x, )

(dx) [L(x, f )]2 . (9.33)

Proof. First we note that by denition, is necessarily at least 1. Checking that A is proper is straightforward, so we turn to the asymptotic normality. Pick f A and assume, without loss of generality, that (f ) = 0. Write
MN

N,i
MN j=1

f ( N,i ) =

i=1

N,j

MN MN N,j j=1

(AN + BN ) ,

with
MN MN 1 E[ N,i f ( N,i ) | F N ] = MN i=1 MN i=1

AN

1 = MN

L( N,i , f ) ,

1 B N = MN
i=1 M

{ N,i f ( N,i ) E[ N,i f ( N,i ) | F N ]} .

N Because MN / i=1 N,i converges to 1/L(X) in probability (cf. the proof of Theorem 9.3.5), the conclusion of the theorem follows from Slutskys theorem 1/2 if we prove that MN (AN +BN ) converges weakly to N(0, 2 (Lf )+1 2 (f )). In order to do that, we rst note that as the function x L(x, f ) belongs 1/2 to A and {( N,i , 1)}1iMN is asymptotically normal for (, A, , {MN }),

MN AN N(0, 2 (Lf )) . Next we prove that for any real u,


P 1/2 E exp(iuMN BN ) F N exp (u2 /2) 2 (f ) .

1/2

For that purpose, we use Proposition 9.5.12, and we thus need to check 1/2 its conditions (i)(iii). Set VN,i = MN N,i f ( N,i ). The triangular array {VN,i }1iMN is conditionally independent given {F N } [condition (i)]. More over, the function x belongs to C. Therefore
MN 2 E[VN,i | F N ] i=1 P

R(x, dx ) h2 (x, x ) with h(x, x ) =

dL(x,) dR(x,) (x

)f (x )

(dx) R(x, dx ) h2 (x, x ) ,

9.3 Single-Step Analysis of SMC Methods


MN 2

315

(E[VN,i | F N ])2
i=1

(dx)

R(x, dx ) h(x, x )

These displays imply that condition (ii) holds. It remains to verify (iii), the Lindeberg condition. For any positive C, the inequality R(x, dx ) h2 (x, x )1{|h(x,x )|C}
X X X

R(x, dx ) h2 (x, x )

shows that the function x This yields


MN 1 MN i=1

R(x, dx )h (x, x )1{|h(x,x )|C} belongs to C.


2

R( N,i , dx ) h2 ( N,i , x )1{h(N,i ,x )C}


P

(dx)R(x, dx ) h2 (x, x )1{h(x,x )C} .

Because (dx)R(x, dx )h2 (x ) < , the right-hand side of this display can be made arbitrarily small by letting C . Therefore
MN 2 E[VN,i 1{|VN,i | } | F N ] 0 , P

i=1

and this is condition (iii). Thus Proposition 9.5.12 applies, and just as in the proof of Theorem 9.2.14 it follows that M N AN 1/2 MN B N
1/2

N 0 ,

2 (Lf ) 0 0 2 (f )
1/2

.
1/2

The proof is now concluded upon writing MN (AN + BN ) = MN AN + 1/2 1/2 N MN BN . 9.3.2 Description of Algorithms It is now time to combine the mutation step and the resampling step. This can be done in two dierent orders, mutation rst or selection rst, leading to two dierent algorithms that we call mutation/selection and selection/mutation, respectively. In the mutation/selection algorithm, we rst apply the muta tion algorithm, 9.3.4, to obtain a weighted sample {( N,i , N,i )}1iMN , and then resample according to the importance weights. The selection/mutation algorithm on the other hand is based on a particular decomposition of , namely

316

9 Analysis of SMC Methods

(A) = where

(dx) L(x, A) = L(X)


def A

(dx)

L(x, A) , L(x, X)

(9.34)

(A) =

(dx) L(x, X) , L(X)

AX .

(9.35)

From a sample {( N,i , N,i )}1iMN , we compute importance weights as L( N,i , X), resample, and nally mutate the resampled system using the Markov kernel (x, A) L(x, A)/L(x, X). We now describe the algorithms formally. Let {N } be a sequence of integers and set MN = N MN . Algorithm 9.3.8 (Mutation/Selection). Mutation: Draw N,1 , . . . , N,MN conditionally independently given F N = G N N,1 N,MN ( , . . . , ), with distribution P( N,j | F N ) = R( N,i , ) for i = 1, . . . , MN and j = N (i 1) + 1, . . . , N i. Assign N,j the weight N,j = dL( N,i ,) N,j ( ). dR( N,i ,) Sampling: Draw MN random variables I N,1 , . . . , I N,MN conditionally indepen dently given F N = F N ( N,1 , . . . , N,MN ), with the probability of out N,i for come j, 1 j MN , being proportional to N,j . Set N,i = N,I i = 1, . . . , MN . To avoid notational explosion, it is assumed here that the sample size after the resampling stage is identical to the size of the initial sample. Extensions to general sample sizes are straightforward. The algorithm is illustrated in Figure 9.6. For the selection/mutation algorithm, we have to strengthen the assumption on the transition kernel L. Assumption 9.3.9. For any x X, L(x, X) > 0. Algorithm 9.3.10 (Selection/Mutation). Selection: Draw random variables I N,1 , . . . , I N,MN conditionally independently given F N = G N ( N,1 , . . . , N,MN ), with the probability of outcome N,i j, 1 j MN , being proportional to L( N,j , X). Set N,i = N,I for i = 1, . . . , MN . Mutation: Draw N,1 , . . . , N,MN conditionally independently given F N = F N L( N,i ,) (I N,1 , . . . , I N,MN ), with distribution P( N,i | F N ) = L(N,i ,X) . The algorithm is illustrated in Figure 9.7. As described above, the selection/mutation algorithm requires evaluation of, for any x X, the normalizing constant L(x, X), and then sampling from the Markov transition kernel L(x, )/L(x, X). As emphasized in Chapter 7, these steps are not always easy to carry out. In this sense, this algorithm is in general less widely applicable

9.3 Single-Step Analysis of SMC Methods

317

Instrumental distribution

N,1

N,MN

R N,1 N,MN
H I U T E D G F   Q P q p P Y X g f     c b  

Target distribution

dL( N,1 ,) N,1 ) dR( N,1 ,) ( dL( N,MN ,) N,MN ) dR( N,1 ,) (
4 5 W V w v V u t S R 3 2 S R s r 3 2 i h i h y x ) ( 1 0 ( e d a `

N,1 N,MN
% $ 7 6 $ 9 8 " # !     ' &

Fig. 9.6. The mutation/selection algorithm. The gure depicts the transformation of the particle system by application of a mutation step followed by a resampling step. In the rst stage, an intermediate sample is generated using an instrumental kernel R. Each individual particle of the original system has exactly N ospring. In a second step, importance weights taking into account the initial and nal positions of the particles are computed. A resampling step, in accordance with these importance weights, is then applied.

A @

C B

318

9 Analysis of SMC Methods

Initial distribution

N,1

N,MN

L( N,1, X)

L( N,MN , X)

N,1

N,MN Final distribution

N,1

N,MN

Fig. 9.7. The selection/mutation algorithm. The gure depicts the transformation of the particle system by application of a selection step followed by a mutation step. In the rst stage, the importance weights {L( N,i , X)}1iMN are computed and the system of particles is resampled according to these importance weights. In the second stage, each resampled particle { N,i }1iMN is mutated using the kernel L( N,i , )/L( N,i , X).

9.3 Single-Step Analysis of SMC Methods

319

than mutation/selection. However, it is worthwhile to note that the random variables N,1 , . . . , N,MN are conditionally independent given F N and distributed according to the mixture of probability kernels
MN

L( N,i , X)
MN j=1

i=1

L( N,i , A) . L( N,j , X) L( N,i , X)

As pointed out in Section 8.1.4, it is possible to draw from this distribution without having to follow the selection/mutation steps. 9.3.3 Analysis of the Mutation/Selection Algorithm Using the tools derived above we establish the consistency and asymptotic normality of the mutation/selection algorithm, 9.3.8. A direct application of Theorems 9.3.5 and 9.2.9 yields the following result. Theorem 9.3.11. Assume 9.3.1, 9.3.2, and 9.3.3, and dene def C = {f L1 (X, ) : x L(x, |f |) C} . (9.36)

where is given by (9.27). Then C is a proper set and (i) {( N,i , N,i )}1iMN given by Algorithm 9.3.8 is consistent for (, C); N,i , 1)}1iM given by Algorithm 9.3.8 is consistent for (, C). (ii) {(
N

Moreover, Theorems 9.3.7 and 9.2.14 imply the following. Theorem 9.3.12. Assume 9.3.1, 9.3.2, 9.3.3, and 9.3.6, and that {N } has a limit, possibly innite. Dene def A = f L2 (X, ) : x L(x, |f |) A and x R(x, dx ) dL(x, ) (x )f (x ) dR(x, )
2

Then A is a proper set and (i) {( N,i , N,i )}1iMN given by Algorithm 9.3.8 is asymptotically normal , {M 1/2 }) with for (, A,
N

2 (f ) =

def

2 {L[f (f )]} + 1 2 [f (f )] [L(X)]


2

f A,

and 2 being dened in (9.33); (ii) {( N,i , 1)}1iMN given by Algorithm 9.3.8 is asymptotically normal for 1/2 (, A, , {MN }) with 2 (f ) = Var (f ) + 2 (f ) for f A.

320

9 Analysis of SMC Methods

9.3.4 Analysis of the Selection/Mutation Algorithm We now analyze the selection/mutation algorithm, 9.3.10. Theorem 9.3.13. Assume 9.3.2 and 9.3.9. Then (i) {( N,i , L( N,i , X))}1iMN given by Algorithm 9.3.10 is consistent for (, C), where is dened in (9.35) and def C = {f L1 (X, ) : x |f (x)|L(x, X) C} ; (ii) {( N,i , 1)}1iMN given by Algorithm 9.3.10 is consistent for (, C), where C = {f L1 (X, ) : x L(x, |f |) C} . Proof. By construction, is absolutely continuous with respect to and L(x, X) d (x) = , d L(X) xX. (9.37)

The rst assertion follows from Theorem 9.2.7. Theorem 9.2.9 shows that the weighted sample {( N,i , 1)}1iMN is consistent for (, C). Assertion (ii) then follows from the representation (9.34) of and Theorem 9.3.5. We may similarly formulate conditions under which the selection/mutation scheme transforms an asymptotically normal sample from the distribution into an asymptotically normal sample from . Assumption 9.3.14. {( N,i , 1)}1iMN is asymptotically normal for (, A, 1/2 , {MN }), where A is a proper set and is a non-negative function on A. In addition the function x L(x, X) belongs to A. Theorem 9.2.11, Theorem 9.2.14, and Theorem 9.3.7 lead to the following result. Theorem 9.3.15. Assume 9.3.2, 9.3.9, and 9.3.14. Then (i) {( N,i , L( N,i , X))}1iMN given by Algorithm 9.3.10 is asymptotically 1/2 normal for (, A, , {MN }), where is dened in (9.35), 2 A = {f L (X, ) : x |f (x)|L(x, X) A} and 2 (f ) = 2 {L(, X)[f (f )]} [L(X)]
2

f A;

(ii) ( N,i , 1) 1iM given by Algorithm 9.3.10 is asymptotically normal N 1/2 for (, A, , {M }), where
N

A = {f L2 (X, ) : x L(x, |f |) A and x L(x, f 2 ) C} and 2 (f ) = Var (f ) + 2 {L[f (f )]} [L(X)]


2

f A.

9.4 Sequential Monte Carlo Methods

321

9.4 Sequential Monte Carlo Methods


We are now ready to evaluate the performance of repeated applications of the basic procedures studied in the previous section. We begin with the mutation/selection or SISR variant. 9.4.1 SISR Sequential importance sampling with resampling amounts to successively applying the mutation/selection procedure in order to construct a sample approximating the marginal ltering distribution. In this case, the initial and nal probability distributions are the marginal ltering distributions ,k at two successive time instants. As discussed in Chapter 7, these two distributions are related by (7.8), which we recall here: ,0 (A) = ,k+1 (A) =
u Tk (x, A) = A A

(dx ) g0 (x ) , (dx ) g0 (x ) X

AX , A X, k 0 , AX ,

(9.38) (9.39) (9.40)

u ,k (dx) Tk (x, A) , u (dx) Tk (x, X) X ,k

Q(x, dx ) gk+1 (x ) ,

where, as usual, Q stands for the transition kernel of the hidden chain and gk for the likelihood of the current observation, gk (x) = g(x, Yk )1 . The instrumental distributions are dened by a sequence {Rk }k0 of instrumental transition kernels on (X, X ) and a probability distribution 0 on (X, X ). In addition, let {N } denote a sequence of positive integers that control the size of the intermediate populations of particles (see below). We require the following assumptions. Assumption 9.4.1. (i) (g0 ) > 0. (ii) X Q(x, dx )gk (x ) > 0 for all x X and k 0. (iii) supxX gk (x) < for all k 0. Assumption 9.4.2. The instrumental distribution 0 for the initial state dominates the ltering distribution ,0 , ,0 0 . Assumption 9.4.3. For any k 0 and all x X, the instrumental kernel u u Rk (x, ) dominates Tk (x, ), Tk (x, ) Rk (x, ). In addition, for any x there dT u (x,) exists a version of the Radon-Nikodym derivative dRk (x,) that is (strictly) k positive and such that sup(x,x )XX
u dTk (x,) dRk (x,) (x

) < .

1 u Note that in Chapter 7 we dened Tk with a dierent scale factorsee (7.8). As mentioned several times, however, this scale factor plays no role in approaches based on (self-normalized) importance sampling and SIR. For notational simplicity, we thus ignore this scale factor here.

322

9 Analysis of SMC Methods

These conditions are not minimal but are most often satised in practice. The rst assumption, 9.4.1, implies that for any positive integer k,
k k

0<

(dx0 ) g0 (x0 )
i=1

Q(xi1 , dxi ) gi (xi )

sup gi (x) < ,


i=0 xX

so that in particular
u 0 < ,k Tk (X) < .

(9.41)

The SISR approach under study has already been described in Algorithm 7.3.4, which we rephrase below in a more mathematical fashion to underline the conditioning arguments to be used in the following. Algorithm 9.4.4 (SISR).
N N,i Mutation: Draw {k+1 }1iMN conditionally independently given Fk , with dis tribution N,i N N,j P(k+1 A | Fk ) = Rk (k , A)

for i = 1, . . . , MN , j = N (i 1) + 1, . . . , N i and A X , and compute the importance weights


N,i N,j dQ(k , ) ( N,j ) k+1 = gk+1 (k+1 ) N,j k+1 N,i dRk (k , )

for j and i as above. N,1 N,M N N Selection: Draw Ik+1 , . . . , Ik+1 N conditionally independently given Fk = Fk N ( N,1 , . . . , N,M ), with distribution
k+1 k+1 N,i P(Ik+1

N = j | Fk ) =

k+1 N,j
MN j=1

k+1 N,j

N,1 N,M N,i N N and set k+1 = k+1k+1 and Fk+1 = Fk (k+1 , . . . , k+1 N ).

N,I N,i

Two choices, among many others, of the instrumental kernel are the following.
u Prior kernel: Rk = Q. For any (x, x ) X X, [dTk (x, )/dQ(x, )](x ) = N,j N,j gk+1 (x ), showing that the importance weights k+1 = gk+1 (k+1 ) only de pend on the mutated particle positions. Provided Assumption 9.4.1 holds true, so does Assumption 9.4.3 as soon as gk+1 (x) > 0 for all x X. Note N,i that for the prior kernel, {(k+1 , 1)}1iMN is a sample approximating the marginal predictive distribution k+1|k = k Q. Optimal kernel: Rk = Tk , dened by

Tk (x, A) =

u Tk (x, A) . u Tk (x, X)

9.4 Sequential Monte Carlo Methods

323

u u For all (x, x ) X X, [dTk (x, )/dTk (x, )](x ) = Tk (x, X), which implies N,j u N,i that the importance weights k+1 = Tk ( , X), with j and i as above, only depend on the current particle positions. Provided Assumption 9.4.1 holds true, so does Assumption 9.4.3 because, for all (x, x ) X X, u [dTk (x, )/dTk (x, )] > 0 and u dTk (x, ) (x ) = sup xX (x,x )XX dTk (x, )

sup

Q(x, dx ) gk+1 (x ) sup gk+1 (x) < .


X xX

For all other instrumental kernels, the importance weights depend on the initial and nal positions of the particles. Theorem 9.4.5. Assume 9.4.1, 9.4.2, and 9.4.3. Then the following holds true.
N,i (i) If {(0 , 1)}1iMN is consistent for (,0 , L1 (X, ,0 )) then for any k > N,i 0, {(k , 1)}1iMN is consistent for (,k , L1 (X, ,k )). N,i (ii) If in addition {(0 , 1)}1iMN is asymptotically normal for (,0 , 1/2 N,i L2 (X, ,0 ), 0 , {MN }) then for any k > 0, {(k , 1)}1iMN is asymp1/2 totically normal for (,k , L2 (X, ,k ), k , {MN }), where the sequence {k } of functions is dened recursively, for f L2 (X, ,k ), by 2 k+1 (f ) = Var,k+1 (f )

+ with
2 k (f ) =

2 u 2 k (Tk {f ,k+1 (f )}) + 1 k ({f ,k+1 (f )}2 ) u (,k Tk (X)) 2

,k (dx)Rk (x, dx )

u dTk (x, ) (x )f (x ) dRk (x, )

u ,k (dx){Tk (x, f )}2 .

Proof. The proof is by induction over k. Starting with (i), we hence assume that for some k 0, {( N,i , 1)}1iMN is consistent for (,k , L1 (X, ,k )). To prove that consistency then holds for k + 1 as well, we shall employ Theorem 9.3.11 and hence need to verify its underlying assumptions with = ,k u and L = Tk . To start with, Assumption 9.3.1 is (9.41) and Assumption 9.3.3 is implied by Assumption 9.4.3. Assumption 9.3.2 follows from the induction u hypothesis plus the bound Tk (x, X) gk+1 < for all x. Finally, to check that consistency applies over L1 (X, ,k+1 ), we need to verify that for u any f L1 (X, ,k+1 ) the function x Tk (x, |f |) belongs to L1 (X, ,k ). This is indeed true, as

324

9 Analysis of SMC Methods


u u (,k Tk )(|f |) = (,k Tk )(X) ,k+1 (|f |).

Assertion (i) now follows from Theorem 9.3.11 and induction. We proceed to part (ii), modify the induction hypothesis accordingly, and use Theorem 9.3.12 to propagate it from k to k +1. The additional assumption we then need to verify is Assumption 9.3.6, which is the induction hypothesis. Finally, we need to check that asymptotic normality applies over L2 (X, ,k+1 ). Pick f L2 (X, ,k+1 ). Then by Jensens inequality,
u ,k (Tk |f |)2 = ,k [Q(gk+1 |f |)]2 2 ,k Q(gk+1 f 2 ) u = (,k Tk )(X) ,k+1 (gk+1 f 2 ) u (,k Tk )(X) gk+1 u saying that Tk (|f |) is in L2 (X, ,k ). Similarly u dTk (x, ) (x )f (x ) dRk (x, ) u dTk (x, ) u (x ) (,k Tk )(X) ,k+1 (f 2 ) < , sup (x,x )XX dRk (x, ) 2

,k+1 (f 2 ) < ,

,k (dx)
X X

Rk (x, dx )

so that the function that ,k is acting on in the left-hand side belongs to L1 (X, ,k ). Assertion (ii) now follows from Theorem 9.3.12 and induction.

9.4.2 I.I.D. Sampling We now consider successive applications of the selection/mutation procedure. The resulting algorithm, referred to as i.i.d. sampling in Section 8.1.1, is recalled below. Because the mathematical analysis of this algorithm is somewhat simpler, we consider below two additional types of results: uniform (in time) convergence results under appropriate forgetting conditions (as discussed in Section 4.3) and exponential tail inequalities. Recall that although the emphasis is here put on ltering estimates, the selection/mutation algorithm may also be applied to approximate the predictive distributions, in which case it is known as the bootstrap lter (Figure 8.1). Hence all results below also apply to the analysis of the predictive estimates produced by the bootstrap lter, with only minor adjustments. Algorithm 9.4.6 (I.I.D. Sampling).
N,i Selection: Assign to the particle k the importance weight N,i u N,i k+1 = Tk (k , X) = N,i Q(k , dx ) gk+1 (x ) .

9.4 Sequential Monte Carlo Methods

325

N,1 N,M N Draw Ik+1 , . . . , Ik+1 N conditionally independently given Fk , with distribution N,j k N,i N i, j = 1, . . . , MN , P(Ik+1 = j | Fk ) = MN N,j , j=1 k N,I N,i and set k = k k+1 . N,1 N,M N N Mutation: Draw k+1 , . . . , k+1 N conditionally independently given Fk = Fk N,1 N,MN (Ik+1 , . . . , Ik+1 ), with distribution
N,i

T u ( N,i , A) N,i N P(k+1 A | Fk ) = k k = u N,i Tk (k , X)

A X

N,i Q(k , dx ) gk+1 (x ) . Q( N,i , dx ) gk+1 (x )


k

9.4.2.1 Consistency and Asymptotic Normality Theorem 9.4.7. Assume 9.4.1, 9.4.2, and 9.4.3. Then the following holds true.
N,i (i) If {(0 , 1)}1iMN is consistent for (,0 , L1 (X, ,0 )) then for any k > N,i 0, {(k , 1)}1iMN is consistent for (,k , L1 (X, ,k )). N,i (ii) If {(0 , 1)}1iMN is asymptotically normal for (,0 , L2 (X, ,0 ), 0 , 1/2 N,i {MN }), then for any k > 0, {(k , 1)}1iMN is asymptotically normal 1/2 for (,k , L2 (X, ,k ), k , {MN }), where the sequence {k } of functions is dened recursively by 2 k+1 (f ) = Var,k+1 (f ) + 2 u k {Tk [f ,k+1 (f )]} u [,k Tk (X)] 2

f L2 (X, ,k+1 ) . (9.42)

Proof. Again the proof is by induction. Hence assume that for some k 0, N,i {(k , 1)}1iMN is consistent for (,k , L1 (X, ,k )). To carry the induction hypothesis from k to k + 1, we shall employ Theorem 9.3.13 and thus need to check its underlying assumptions. Assumption 9.3.2 was veried in the proof of Theorem 9.4.5, and (9.3.9) is Assumption 9.4.1(ii). What remains to check is that consistency holds over the whole of L1 (X, ,k+1 ), and for that u we must verify that for every f in this space, the function Tk (|f |) belongs to 1 L (X, ,k ). This was also done in the proof of Theorem 9.4.5. Hence assertion (i) follows from Theorem 9.3.13 and induction. We proceed to part (ii), modify the induction hypothesis accordingly, and use Theorem 9.3.15 to propagate it from k to k +1. The additional assumption we then need to verify is Assumption 9.3.14, which follows from the induction u hypothesis and the bound Tk (x, X) gk+1 . Finally, we establish that asymptotic normality applies over L2 (X, ,k+1 ), which amounts to verifying u that for any f L2 (X, ,k+1 ), the function Tk (|f |) belongs to L2 (X, ,k ) and u 2 1 the function Tk (f ) belongs to L (X, ,k+1 ). The rst of these requirements

326

9 Analysis of SMC Methods

is part of the proof of Theorem 9.4.5, and the proof of the second requirement is entirely analogous. Assertion (ii) now follows from Theorem 9.3.15 and induction. It is worthwhile to note that the asymptotic variance of the i.i.d. sampling algorithm is always lower than that of SISR, whatever choice of instrumental kernel for the latter. This indicates that whenever possible, i.i.d. sampling should be preferred. By iterating (9.42), one can obtain an analytic expression for the asymptotic variance.
N,i Proposition 9.4.8. Assume 9.4.1 and 9.4.3 and that {(0 , 1)}1iMN is 1/2 asymptotically normal for (,0 , L2 (X, ,0 ), 0 , {MN }). Then for any k 0 2 and f L (X, ,k ), k 2 k (f ) = l=1 u Var,l Tlu Tk1 [f ,k (f )] u ,l Tlu Tk1 (X) 2 2 u u 0 T0 Tk1 [f ,k (f )] u u ,0 T0 Tk1 (X) 2

where, by convention Tiu Tju (x, A) is the identity transition kernel x (A) for i > j. Proof. The proof is by induction on k. The result holds true for k = 0. Assume now that the result holds true for some k 0. We evaluate the right-hand 2 side of (9.42) with the claimed formula for k . Doing this, we rst note that u u u u Tk [f ,k+1 (f )] ,k Tk [f ,k+1 (f )] = Tk [f ,k+1 (f )], because ,k Tk equals ,k+1 up to a multiplicative constant. Thus the right-hand side of (9.42) evaluates to
k u Var,l {Tlu Tk [f ,k+1 (f )]} u ,l Tlu Tk1 (X) 2 u [,k Tk (X)] 2

Var,k+1 (f ) +
l=1

2 u u 0 {T0 Tk [f ,k (f )]} u u ,0 T0 Tk1 (X) 2 u [,k Tk (X)] 2

2 Comparing this with the claimed expression for k+1 (f ), we see that what remains to verify is that the denominators of the above ratios equal the square u of ,l Tlu Tk (X). To do that, we observe that the denition of the ltering distributionsee for instance (3.13)shows that for any l k 1,

9.4 Sequential Monte Carlo Methods


k

327

,k (h) = L1 ,k

(dx0 ) g0 (x0 )
i=1 k

Q(xi1 , dxi ) gi (xi )h(xk )

= L,l L1 ,k =

,l (dxl )
i=l+1

Q(xi1 , dxi ) gi (xi )h(xk )

u ,l Tlu Tk1 f . u ,l Tlu Tk1 (X)

u u u u Setting h = Tk (X) yields [,k Tk (X)] ,l Tlu Tk1 (X) = ,l Tlu Tk (X). The proof now follows by induction.

The expression for the asymptotic variance is rather involved, and it is difcult in general to make simple statements on this quantity. There is however a situation in which some interesting conclusions can be drawn. Consider the following assumption (cf. Lemma 4.3.25). Assumption 9.4.9. There exist positive constants and + and a probability distribution such that 0 < (A) Q(x, A) + (A) < for all x X and A X . Also recall the notation = 1 / + . Under this condition, it has been shown that the posterior chain is uniformly geometrically mixing, that is, it forgets its initial condition uniformly and at a geometric (or exponential) rate. Exponential forgetting allows us to prove that the asymptotic variance of the selection/mutation algorithm remains bounded. Proposition 9.4.10. Assume 9.4.1, 9.4.3, and 9.4.9. Then for any f 2 2 Fb (X), it holds that supk0 k (f ) < , where k is dened in (9.42). Proof. Consider the numerators of the ratios of the expression for k in Proposition 9.4.8. Proposition 3.3.2 shows that for any integers l < k,
u Tlu Tk1 (x, A) = l|k (x)Fl|k Fk1|k (x, A) , def

x X, A X ,

where the Fl|k are forward smoothing kernels (see Denition 3.3.1) and l|k is the backward function (see Denition 3.1.6). Therefore
u u Tlu Tk1 f (x) Tlu Tk1 (x, X),k (f )

= l|k (x) Fl|k Fk1|k f (x) ,k (f ) . (9.43) Next we consider the denominators of the expression for k . We have k u ,l Tlu Tk1 (X) = ,l (l|k ) = j=l+1 c,j , where the rst equality follows from the above and the second one from Proposition 3.2.5, and where the constants c,j are dened recursively in (3.22). Moreover, by (3.26) L,k = k j=0 c,j , and hence

328

9 Analysis of SMC Methods


u ,l Tlu Tk1 (X) =

L,k . L,l

(9.44)

Combining (9.43) and (9.44) yields for any integers l k,


u Var,l Tlu Tk1 [f ,k (f )] u ,l Tlu Tk1 (X) 2

= Var,l

l|k

L,l {Fl|k Fk1|k f ,k (f )} L,k

. (9.45)

In order to bound this variance, we rst notice that Lemma 4.3.22(ii) shows that Q(x, dx ) gl+1 (x )l+1|k (x ) 1 + = . 1 ,l (dx) Q(x, dx ) gl+1 (x )l+1|k (x ) (9.46) Next, Proposition 3.3.4 shows that ,k (f ) = ,l|k Fl|k Fk1|k f , where ,l|k is a smoothing distribution. In addition, by Lemma 4.3.22 again, for any probability measures and on (X, X ), l|k (x) L,l = L,k Fl|k Fk1|k Fl|k Fk1|k
TV

kl

TV

Applying this bound with = x and = ,l|k shows that |Fl|k Fk1|k (x, f ) ,l|k (f )| 2kl f Finally, combining with (9.45) and (9.46) shows that
u Var,l Tlu Tk1 [f ,k (f )] u ,l Tlu Tk1 (X) 2

4(1 )2 2(kl) f

This bound together with Proposition 9.4.8 completes the proof. 9.4.2.2 Exponential Inequalities The induction argument previously used for the central limit theorem may also be used to derive exponential inequalities for the tail probabilities. Theorem 9.4.11. Assume 9.4.1 and that there exist some constants a(0) and b(0) such that for any t 0 and f Fb (X),
MN

1 MN i=1

N,i f (0 ) ,0 (f ) t a(0) exp

2MN t2 b(0)2 osc2 (f )

(9.47)

Then for any k > 0, t > 0 and f Fb (X),

9.4 Sequential Monte Carlo Methods


MN

329

1 MN i=1

N,i f (k ) ,k (f ) t a(k) exp

2MN t2 b(k)2 osc2 (f )

, (9.48)

where the constants a(k) and b(k) are dened recursively through a(k + 1) = 2 (1 + a(k)) , b(k + 1) =
u (3/2) gk+1 b(k) + ,k Tk (X) . u ,k Tk (X)

Proof. The proof is by induction; assume that the claim is true for some k 0. MN N,i 1 N N Decompose MN k=1 f (k+1 ) ,k+1 (f ) in two terms Ak+1 (f ) + Bk+1 (f ), where
MN 1 AN (f ) = MN k+1 i=1 MN 1 N Bk+1 (f ) = MN N,i N E[f (k+1 ) | Fk ] ,k+1 (f ) u ,k Tk f u (X) . ,k Tk N,i N,i N (f (k+1 ) E[f (k+1 ) | Fk ])

i=1 MN N,i u k=1 Tk f (k ) MN u N,i k=1 Tk (k , X)

Proceeding like in Theorem 9.2.16, for any a (0, 1) and t 0, P(|AN (f )| at) 2 exp 2a2 t2 MN / osc2 (f ) . k+1 (9.49)

N N We now bound Bk+1 (f ). First note rst for any constant c, Bk+1 (f ) = N Bk+1 (f c). We choose c in such a way that f c = (1/2) osc (f ) and set f = f c. Writing 1 MN Mn u N,i u i=1 {Tk f (k ) ,k Tk f } u ,k Tk (X) MN u N,i MN 1 u N,i i=1 Tk f (k ) MN i=1 {Tk (k , X) u MN u N,i ,k Tk (X) i=1 Tk (k , X)

N Bk+1 (f ) =

u ,k Tk (X)}

(9.50)

and using the induction assumption, it holds that for any b (0, 1),
N P |Bk+1 (f )| (1 a)t a(k) exp u 2MN (1 a)2 b2 t2 [,k Tk (X)]2 u b2 (k) osc2 Tk f

+ a(k) exp

u 2MN (1 a)2 (1 b)2 t2 [,k Tk (X)]2 2 osc2 (T u 1) b2 (k) f k

By Lemma 4.3.4, for any (x, x ) X X,

330

9 Analysis of SMC Methods


u u |Tk f (x) Tk f (x )| = |Q(x, gk+1 f ) Q(x , gk+1 f )|

(1/2) Q(x, ) Q(x , ) gk+1 and similarly,

TV

osc gk+1 f

osc (f ) ,

u u |Tk (x, X) Tk (x , X)| = |Q(x, gk+1 ) Q(x , gk+1 )| gk+1

u u Thus, osc Tk f and osc (Tk 1) are bounded by gk+1 osc (f ) and gk+1 , respectively. The result follows by choosing b = 2/3 as in the proof of TheoN rem 9.1.10 and then setting a to equate the bounds on AN (f ) and Bk+1 (f ). k+1

The bound is still of Hoeding type, but at each iteration the constants a(k) and b(k) increase. Hence, the obtained bound is almost useless in practice for large k, except when the number of iterations is small or the number of particles is large (compared to the iteration index). It would of course be more appropriate to derive an exponential bound with constants that do not depend on the iteration index. Such results hold true when Q satises the strong mixing condition. Theorem 9.4.12. Assume 9.4.1, 9.4.9, and (9.47). Then there exist constants a and b such that for any n 0, t 0 and f Fb (X),
MN

1 MN i=1

N,i f (n ) ,n (f ) t a exp

2MN t2 b2 osc2 (f )

MN 1 N N,i Proof. Dene N = MN k i=1 k . The dierence n (f ) ,n (f ) may be expressed as the telescoping sum u N T u Tn1 f ,n (f )+ N (f ) ,n (f ) = 0 0 n N T u T u (X) 0 0 n1 n u u u N T u Tn1 f N Tk Tn1 f k k1 k1 u u u u N Tk Tn1 (X) N Tk1 Tn1 (X) k k1

, (9.51)

k=1

u u with the convention that Tk Tn1 is the identity mapping when k = n. We shall show that the tail probabilities of each of the terms on the right-hand side of (9.51) are exponentially small. Put u u N T0 Tn1 f 0 ,n (f ) N T u T u (X) 0 0

AN (f ) = n =

(9.52) ,n (f )} , (9.53)

n1 MN N,i N,i i=1 0|n (0 ){F0|n Fn1|n f (0 ) MN N,i i=1 0|n (0 )

9.4 Sequential Monte Carlo Methods

331

u u where ,n (f ) could also be rewritten as ,0 T0 Tn1 (f ) (see Section 3.3.1). Thus by Lemma 4.3.4 and Proposition 4.3.23(i),

F0|n Fn1|n (, f ) ,n (f ) and

n osc (f )

(9.54) (9.55) (9.56)

osc F0|n Fn1|n (, f ) n osc (f ) . In addition ,0 (0|n ) ,0 (0|n ) + =1, osc 0|n () 2 0|n ()

where Lemma 4.3.22(ii) was used for the second inequality. Writing AN (f ) n =
1 MN MN i=1 MN i=1 N,i N,i 0|n (0 ){F0|n Fn1|n f (0 ) ,n (f )} ,0 (0|n )

N,i N,i 0|n (0 ){F0|n Fn1|n f (0 ) ,n (f )} MN i=1 N,i 0|n (0 ) 1 1 MN MN i=1 N,i 0|n (0 ) ,0 (0|n )

we have, using (9.54) and the triangle inequality,


1 AN (f ) MN n MN i=1 N,i N,i 0|n (0 ){F0|n Fn1|n f (0 ) ,n (f )} ,0 (0|n ) 1 + n osc (f ) MN MN i=1 N,i 0|n (0 ) ,0 (0|n ) . ,0 (0|n )

Using (9.56) as well as (9.47) twice (for the functions F0|n Fn1|n f and 0|n ) shows that for any t 0, P |AN (f )| t 2a(0) exp n For 1 k n, put N (f ) = k,n
u u u N T u Tn1 f N Tk Tn1 f k k1 k1 . u u u u N Tk Tn1 (X) N Tk1 Tn1 (X) k k1

MN t2 (1 )2 2b2 (0) osc2 (f ) 2n

(9.57)

(9.58)

u u Proposition 3.3.2 shows that Tk Tn1 (x, A) = k|n (x)Fk|n Fn1|n (x, A). Pick x0 X. Then u u N Tk Tn1 f k Fk|n Fn1|n (x0 ) = N T u T u (X) k k n1 MN i=1 N,i N,i k|n (k )k|n (k ) MN i=1 N,i k|n (k )

(9.59)

332

9 Analysis of SMC Methods

where k|n (x) = Fk|n Fn1|n f (x) Fk|n Fn1|n f (x0 ). Set N = k Then N k|n
u N Tk1 k1 u N Tk1 (X) k1

and N (A) = k|n

A X

N (dx) k|n (x) k . N (dx) k|n (x)


k

N , with Radon-Nikodym derivative k dN k|n k|n (x) (x) = N . N (k|n ) dk k

Using these notations,


u u N Tk1 Tn1 f k1 Fk|n Fn1|n f (x0 ) = N T u T u (X) k1 k1 n1

N k|n {Fk|n Fn1|n f Fk|n Fn1|n f (x0 )} k = N (k|n ) . (9.60) k|n N (k|n )
k

Combining (9.59) and (9.60), we may express N (f ) as k,n N (f ) = k,n


N MN d k|n N,i N,i i=1 dN (k )k|n (k ) k N MN d k|n N,i i=1 dN (k ) k

N (k|n ) . k|n

N,i N Because {k }1iMN are conditionally i.i.d. given Fk1 with common distribution N , the rst term in the above expression may be seen as an imk portance sampling estimator of N (k|n ). By Lemma 4.3.22(ii), the Radonk|n N N Nikodym derivative d /d (x) is bounded uniformly in k, N and x as k|n k

dN k|n + 1 . (x) = N 1 dk Proceeding as above, the Hoeding inequality implies that for any t 0, P N (f ) t 2 exp k,n MN t2 (1 )2 2 osc2 (f ) 2(nk) .

Hence the probability that the sum on the right-hand side of (9.51) is (in absolute value) at least t is bounded by
n1

2
k=0

exp

MN t2 (1 )2 b2 k 2 osc2 (f ) 2k

(9.61)

for any sequence {bk }0kn1 of positive numbers summing to one. To obtain a bound that does not depend on n, take bk = k (1)/(1n ) with < < 1. This choice proves that (9.61) is bounded by

9.5 Complements

333

a exp

MN t2 (1 )2 (1 2 ) 2 osc2 (f )

where a is a constant that depends only on and .

9.5 Complements
9.5.1 Weak Limits Theorems for Triangular Array This section summarizes various basic results on the asymptotics of triangular arrays that are used in the proofs of this chapter. 9.5.1.1 Law of Large Numbers Throughout this section, {MN }N 0 denotes a sequence of integers. All random variables are assumed to be dened on a common probability space (, F, P). Proposition 9.5.1. Let {UN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for any N and i = 1, . . . , MN , E[|UN,i | | F N ] < and E[UN,i | F N ] = 0. (ii) For some positive ,
MN 2 E[UN,i 1{|UN,i |< } | F N ] 0 , P

(9.62)

i=1 MN

E[|UN,i |1{|UN,i | } | F N ] 0 .
P

(9.63)

i=1

Then

MN

UN,i 0 .
i=1

Proof. Consider the truncated random variable UN,i = UN,i 1{|UN,i |< } . Using N (9.63) and E[UN,i | F ] = 0, we nd that
MN

E[UN,i | F N ] 0 .
i=1

(9.64)

By Chebyshevs inequality, it follows that for any > 0,

334

9 Analysis of SMC Methods


MN MN

AN () = P
i=1

UN,i
i=1 MN

E[UN,i | F N ] F N

2 Var
i=1

UN,i F N

and hence (9.62) shows that AN () 0 in probability. Because AN () is obviously bounded, we also have E[AN ()] 0, that is,
MN MN

UN,i
i=1 i=1

E[UN,i | F N ] 0 .

(9.65)

Moreover, for any > 0,


MN MN MN

P
i=1

UN,i
i=1

UN,i F N

P
i=1 MN

|UN,i |1{|UN,i |

FN
P

1
i=1

E[|UN,i |1{|UN,i | } | F N ] 0 .

MN MN Thus, i=1 UN,i i=1 UN,i 0 in probability. Combining with (9.64) and (9.65), the proof is complete.

Denition 9.5.2 (Bounded in Probability). A sequence {ZN }N 0 of random variables is said to be bounded in probability if
C N 0

lim sup P(|ZN | C) = 0.

Often the term tight, or asymptotically tight, is used instead of bounded in probability. We recall without proof the following elementary properties. Lemma 9.5.3. 1. Let {UN }N 0 and U be random variables. If UN U , then {UN } is bounded in probability. 2. Let {UN }N 0 and {VN }N 0 be two sequences of random variables. If {VN } is bounded in probability and |UN | |VN | for any N , then {UN } is bounded in probability. 3. Let {UN }N 0 and {VN }N 0 be two sequences of random variables. If {UN }N 0 is bounded in probability and VN 0 in probability, then UN VN 0 in probability. 4. Let {UN }N 0 be a sequence of random variables and let {MN }N 0 be a non-decreasing deterministic sequence diverging to innity. If {UN } is bounded in probability, then 1{UN MN } 0 in probability.
D

9.5 Complements

335

The following elementary lemma is repeatedly used in the sequel. Lemma 9.5.4. Let {UN }N 0 and {VN }N 0 be two sequences of random variables such that {VN } is bounded in probability. Assume that for any positive there exists a sequence {WN ()}N 0 of random variables such that WN () 0 as N and |UN | VN + WN () . Then UN 0. Proof. For any > 0, P(|UN | ) P[VN /(2)] + P[WN () /2] . This implies that for any > 0, lim sup P(|UN | ) sup P[VN /(2)] .
N N 0 P P

Because the right-hand side can be made arbitrarily small by letting 0, the result follows. Proposition 9.5.5. Let {UN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for any N and i = 1, . . . , MN , E[|UN,i | | F N ] < and E[UN,i | F N ] = 0. (ii) The sequence of random variables
MN

E[|UN,i | | F N ]
i=1 N 0

(9.66)

is bounded in probability. (iii) For any positive ,


MN

E[|UN,i |1{|UN,i |} | F N ] 0 .
P

(9.67)

i=1

Then

MN

UN,i 0 .
i=1

Proof. We employ Proposition 9.5.1 and then need to check its condition (ii). The current condition (iii) is (9.63), so it suces to prove that (9.62) holds for some (arbitrary) > 0. To do that, note that for any (0, ),

336

9 Analysis of SMC Methods


MN 2 E[UN,i 1{|UN,i |< } | F N ] MN MN

i=1

i=1

2 E[UN,i 1{|UN,i |<} | F N ] + MN

2 E[UN,i 1{|UN,i |< } | F N ]

i=1

MN

i=1

E[|UN,i | | F N ] +
i=1

E[|UN,i |1{|UN,i |} | F N ] .

Now (9.62) follows from Lemma 9.5.4. In the special case where the random variables {UN,i }1iMN , for any N , are conditionally i.i.d. given {F N }, Proposition 9.5.5 admits a simpler formulation. Corollary 9.5.6. Let {VN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally i.i.d. given F N and for any N , E[|VN,1 | | F N ] < and E[VN,1 | F N ] = 0. (ii) The sequence {E[|VN,1 | | F N ]}N 0 is bounded in probability. (iii) For any positive , E[|VN,1 |1{|VN,1 |MN } | F N ] 0 in probability. Then
1 MN i=1 MN P

VN,i 0 .

Proposition 9.5.7. Let {VN,i }1iMN be a triangular array of random variables and let {F N } be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for any N and i = 1, . . . , MN , E[|VN,i | | F N ] < . MN (ii) The sequence { i=1 E[|VN,i | | F N ]}N 0 is bounded in probability, (iii) For any positive ,
MN

E[|VN,i |1{|VN,i | } | F N ] 0 .
P

(9.68)

i=1

Then

MN

{VN,i E[VN,i | F N ]} 0 .
i=1

Proof. We check that the triangular array UN,i = VN,i E[VN,i | F N ] satises conditions (i)(iii) of Proposition 9.5.5. This triangular array is conditionally

9.5 Complements

337

independent given F N , and for any N and any i = 1, . . . , MN , E[|UN,i | | F N ] 2 E[|VN,i | | F N ] < and E[UN,i | F N ] = 0, showing condition (i). In addition
MN MN

E[|UN,i | | F N ] 2
i=1 M i=1

E[|VN,i | | F N ] ,

N showing that the sequence { i=1 E[|UN,i | | F N ]}N 0 is bounded in probability. Hence condition (ii) holds. We now turn to the nal condition of Proposition 9.5.5, (9.67). With the bounds |UN,i | |VN,i | + E[|VN,i | | F N ] and 1{|UN,i | } 1{|VN,i | /2} + 1{E[|VN,i | | F N ] /2} and in view of the assumed condition (iii), it suces to prove that for any positive ,

MN

AN =
i=1 MN

E[|VN,i | | F N ] P(|VN,i | | F N ) 0 , E[|VN,i | | F N ]1{E[|VN,i | | F N ] 0 .


P

(9.69)

BN =
i=1

(9.70)

Bound AN as
MN

AN P

1iMN

max |VN,i |

FN
i=1

E[|VN,i | | F N ] .

Considering the assumed condition (ii), it is sucient to prove that the conditional probability of the display tends to zero in probability. To do that, notice that
MN 1iMN

max |VN,i | /2 +
i=1

|VN,i |1{|VN,i |

/2}

whence, using condition (iii),


MN

1iMN

max |VN,i |

P
i=1

|VN,i |1{|VN,i |

/2}

/2 F N | F N ] 0 .
P

MN

(2/ )
i=1

E[|VN,i |1{|VN,i |

/2}

Thus (9.69) holds. Now bound BN as BN 1{max1iMN E[ |VN,i | | F N ]


MN } i=1

E[|VN,i | | F N ] .

To show that BN 0 in probability, it is again sucient to prove that so does the rst factor. In a similar fashion as above we have

338

9 Analysis of SMC Methods

1iMN

max E[|VN,i | | F N ]

MN

E[|VN,i |1{|VN,i |

/2}

| F N ] /2 | F N ] 0 .
P

i=1 MN

(2/ )
i=1

E[|VN,i |1{|VN,i |

/2}

Thus (9.70) holds. By combining (9.68), (9.69), and (9.70) we nd that (9.67) holds, concluding the proof.

9.5.1.2 Central Limit Theorems Lemma 9.5.8. Let z1 , . . . , zm and z1 , . . . , zm be complex numbers of modulus at most 1. Then
m

|z1 zm z1 zm |
i=1

|zi zi | .

Proof. This follows by induction from z1 zm z1 zm = (z1 z1 )z2 zm + z1 (z2 zm z1 zm ) .

In the investigation of the central limit theorem for triangular arrays, the so-called Lindeberg condition plays a fundamental role. Proposition 9.5.9. Let {UN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for 2 any N and i = 1, . . . , MN , E[UN,i | F N ] < , and E[UN,i | F N ] = 0. 2 2 (ii) There exists a positive constant 2 such that with N,i = E[UN,i | F N ],
MN 2 N,i 2 . i=1 P

(9.71)

(iii) For all

> 0,
MN 2 E[UN,i 1{|UN,i | } | F N ] 0. P

(9.72)

i=1

9.5 Complements

339

Then for any real u,


MN

E exp

iu
i=1

UN,i

F N exp 2 u2 /2 .

(9.73)

Remark 9.5.10. The condition (9.72) is often referred to as the Lindeberg condition. If this condition is satised, then the triangular array also satises 2 the uniform smallness condition, max1iMN E[UN,i | F N ] 0 in probability. Indeed, for any > 0,
2 2 2 N,i = E[UN,i 1{|UN,i |< } | F N ] + E[UN,i 1{|UN,i | } | F N ] 2 + E[UN,i 1{|UN,i | } | F N ] ,

which implies that

MN 1iMN 2 max E[UN,i | F N ] 2

+
i=1

2 E[UN,i 1|UN,i | | F N ] .

Because is arbitrary, the uniform smallness condition is satised. The Lindeberg condition guarantees that large values (of the same order as the square root of the variance of the sum) have a negligible inuence in the central limit theorem. Such extremely large values have a small inuence both on the variance and on the distribution of the sum we investigate. Proof (of Proposition 9.5.9). The proof is adapted from Billingsley (1995, P N 2 Theorem 27.1). Because i=1 N,i 2 ,
MN

exp (u2 /2)


i=1

2 N,i

exp 2 u2 /2 .

Thus it suces to prove that


MN

E exp iu
i=1

UN,i

F N exp

u2 2

MN 2 N,i i=1

0 .

(9.74)

To start with, using the conditional independence of the triangular array and Lemma 9.5.8, it follows that the left-hand side of this display is bounded by
MN 2 E[exp (iuUN,i ) | F N ] exp(u2 N,i /2) . i=1

From here we proceed in two steps, showing that both


MN

AN =
i=1

2 E[exp (iuUN,i ) | F N ] (1 u2 N,i /2) 0

340

9 Analysis of SMC Methods


MN

and BN =

2 2 exp(u2 N,i /2) (1 u2 N,i /2) 0 . i=1

These two result suce to nish the proof. Now, by Taylors inequality, 1 eitx 1 + itx t2 x2 2 min |tx|2 , |tx|3 ,

so that the characteristic function of UN,i satises


2 E[exp(iuUN,i ) | F N ] (1 u2 N,i /2) E[min(|uUN,i |2 , |uUN,i |3 ) | F N ] .

Note that this expectation is nite. For positive , the right-hand side of the inequality is at most E[|uUN,i |3 1{|UN,i |< } | F N ] + E[|uUN,i |2 1{|UN,i | } | F N ]
2 |u|3 N,i + u2 E[|UN,i |2 1{|UN,i | } | F N ] .

Summing up the right-hand side over 1 i MN , using the assumed conditions (ii) and (iii) and recalling that was arbitrary, we nd that AN 0 in probability. We now turn to BN . For positive x, |ex 1 + x| x2 /2. Thus BN u4 8
MN 4 N,i i=1

u4 8

MN 1iMN 2 max N,i i=1 2 N,i .

Here the sum on the right-hand side converges in probability and, as remarked above, the maximum tends to zero in probability (the uniform smallness condition). Thus BN 0 in probability and the proof is complete. In the special case where the random variables {UN,i }1iMN , for any N , are conditionally i.i.d. given {F N }, Proposition 9.5.9 admits a simpler formulation. Corollary 9.5.11. Let {VN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally i.i.d. given {F N } and for any N , 2 E[VN,1 | F N ] < and E[VN,1 | F N ] = 0.
2 (ii) There exists a positive constant 2 such that E[VN,1 | F N ] 2 . 2 (iii) For any positive , E[VN,1 1{|VN,1 | MN } P

| F N ] 0.

Then for any real u,


MN

E exp

iuMN

1/2 i=1

VN,i

F N exp( 2 u2 /2) .

(9.75)

9.5 Complements

341

Proposition 9.5.12. Let {VN,i }1iMN be a triangular array of random variables and let {F N }N 0 be a sequence of sub--elds of F. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for 2 any N and i = 1, . . . , MN , E[VN,i | F N ] < . (ii) There exists a constant 2 > 0 such that
MN 2 {E[VN,i | F N ] (E[VN,i | F N ])2 } 2 . i=1 P

(iii) For all

> 0,
MN 2 E[VN,i 1{|VN,i | } | F N ] 0 . P

i=1

Then for any real u,


MN

E exp iu
i=1

{VN,i E[VN,i | F N ]}

F N exp((u2 /2) 2 ) .

Proof. We check that the triangular array UN,i = VN,i E[VN,i | F N ] satises conditions (i)(iii) of Proposition 9.5.9. This triangular array is conditionally independent given {F N } and by construction E[UN,i | F N ] = 0 and 2 2 E[UN,i | F N ] = E[VN,i | F N ] {E[VN,i | F N ]}2 . Therefore, conditions (i) and (ii) are fullled. It remains to check that for any > 0, (9.72) holds true. By Jensens inequality,
2 2 1{|UN,i | } 1{VN,i 2 /4} + 1{E[VN,i | F N ] 2 /4} ,

2 2 2 UN,i 2(VN,i + E[VN,i | F N ]) ,

so that the left-hand side of (9.72) is bounded by


MN

2
i=1

2 2 E[VN,i 1{VN,i

MN
2 /4}

| FN] + 2
i=1 MN

2 2 E[VN,i | F N ] P(VN,i

/4 | F N )

+4
i=1

2 2 E[VN,i | F N ]1{E[VN,i | F N ]

2 /4}

The proof is concluded using the same arguments as in the proof of Proposition 9.5.7. Theorem 9.5.13. Let { N,i }1iMN be a triangular array of X-valued random variables, let {F N }N 0 be a sequence of sub--elds of F, and let f be a real-valued function on X. Assume that the following conditions hold true. (i) The triangular array is conditionally independent given {F N } and for any N and i = 1, . . . , MN , E[f 2 ( N,i ) | F N ] < ,

342

9 Analysis of SMC Methods

(ii) There exists a constant 2 > 0 such that


MN 1 MN i=1

{E[f 2 ( N,i ) | F N ] (E[f ( N,i ) | F N ])2 } 2 .

(iii) There exists a probability measure on (X, X ) such that f L2 (X, ) and for any positive C,
MN 1 MN i=1

E[f 2 ( N,i )1{|f (N,i )|C} | F N ] (f 2 1{|f |C} ) .


P

Then for any real u,


MN 1/2 i=1

E exp iuMN

f ( N,i ) E[f ( N,i ) | F N ]


P

FN

exp( 2 u2 /2) . (9.76) Proof. Set VN,i = MN f ( N,i ). We prove the theorem by checking conditions (i)(iii) of Proposition 9.5.12. Of these conditions, the rst two are immediate, so it remains to verify the Lindeberg condition (iii). Pick > 0. Then for any positive C
MN 2 E[VN,i 1{|VN,i | } | F N ] MN 1 MN i=1 1/2

i=1

E[f 2 ( N,i )1{|f (N,i )|C} | F N ] (f 2 1{|f |C} ) ,


P

where the inequality holds for suciently large N . Because f L2 (X, ) the right-hand side of this display tends to zero as C , so that the Lindeberg condition is satised.

9.5.2 Bibliographic Notes Convergence of interacting particle systems has been considered by many authors in the last decade, triggered by the seminal papers of Del Moral (1996, 1998). Most of the results presented in this chapter have already appeared in the literature, perhaps in a slightly dierent form. We have focused here on the most elementary convergence properties, the law of large numbers, and the central limit theorem. More sophisticated convergence results are available, covering for instance large deviations (Del Moral and Guionnet, 1998), empirical processes (Del Moral and Ledoux, 2000), propagation of chaos (Del

9.5 Complements

343

Moral and Miclo, 2001), and rate of convergence in the central limit theorem. The ultimate reference for convergence analysis of interacting particle systems is Del Moral (2004), which summarizes most of these eorts. An elementary but concise survey of available results is given in Crisan and Doucet (2002). The approach developed here has been inspired by Knsch (2003). u

Part II

Parameter Inference

10 Maximum Likelihood Inference, Part I: Optimization Through Exact Smoothing

In previous chapters, we have focused on structural results and methods for HMMs, considering in particular that the models under consideration were always perfectly known. In most situations, however, the model cannot be fully specied beforehand, and some of its parameters need to be calibrated based on observed data. Except for very simplistic instances of HMMs, the structure of the model is suciently complex to prevent the use of direct estimators such as those provided by moment or least squares methods. We thus focus in the following on computation of the maximum likelihood estimator. Given the specic structure of the likelihood function in HMMs, it turns out that the key ingredient of any optimization method applicable in this context is the ability to compute smoothed functionals of the unobserved sequence of states. Hence the methods discussed in the second part of the book for evaluating smoothed quantities are instrumental in devising parameter estimation strategies. This chapter only covers the class of HMMs discussed in Chapter 5, for which the smoothing recursions described in Chapters 3 and 4 may eectively be implemented on computers. For such models, the likelihood function is computable, and hence our main task will be to optimize a possibly complex but entirely known function. The topic of this chapter thus relates to the more general eld of numerical optimization. For models that do not allow for exact numerical computation of smoothing distributions, this chapter provides a framework from which numerical approximations can be built. Those will be discussed in Chapter 11.

10.1 Likelihood Optimization in Incomplete Data Models


To describe the methods as concisely as possible, we adopt a very general viewpoint in which we only assume that the likelihood function of interest may be written as the marginal of a higher dimensional function. In the terminology introduced by Dempster et al. (1977), this higher dimensional function is

348

10 Maximum Likelihood Inference, Part I

described as the complete data likelihood; in this framework, the term incomplete data refers to the actual observed data while the complete data is a (not fully observable) higher dimensional random variable. In Section 10.2, we will exploit the specic structure of the HMM, and in particular the fact that it corresponds to a missing data model in which the observations simply are a subset of the complete data. We ignore these specics for the moment however and consider the general likelihood optimization problem in incomplete data models. 10.1.1 Problem Statement and Notations Given a -nite measure on (X, X ), we consider a family {f (; )} of nonnegative -integrable functions on X. This family is indexed by a parameter , where is a subset of Rd (for some integer d ). The task under consideration is the maximization of the integral L() =
def

f (x ; ) (dx)

(10.1)

with respect to the parameter . The function f ( ; ) may be thought of as an unnormalized probability density with respect to . Thus L() is the normalizing constant for f ( ; ). In typical examples, f ( ; ) is a relatively simple function of . In contrast, the quantity L() usually involves highdimensional integration and is therefore suciently complex to prevent the use of simple maximization approaches; even the direct evaluation of the function might turn out to be non-feasible. In Section 10.2, we shall consider more specically the case where f is the joint probability density function of two random variables X and Y , the latter being observed while the former is not. Then X is referred to as the missing data, f is the complete data likelihood, and L is the density of Y alone, that is, the likelihood available for estimating . Note however that thus far, the dependence on Y is not made explicit in the notation; this is reminiscent of the implicit conditioning convention discussed in Section 3.1.4 in that the observations do not appear explicitly. Having sketched these statistical ideas, we stress that we feel it is actually easier to understand the basic mechanisms at work without relying on the probabilistic interpretation of the above quantities. In particular, it is not required that L be a likelihood, as any function satisfying (10.1) is a valid candidate for the methods discussed here (cf. Remark 10.2.1). In the following, we will assume that L() is positive, and thus maximizing L() is equivalent to maximizing () = log L() .
def

(10.2)

In a statistical setting, is the log-likelihood. We also associate to each function f ( ; ) the probability density function p( ; ) (with respect to the dominating measure ) dened by

10.1 Likelihood Optimization in Incomplete Data Models

349

p(x ; ) = f (x ; )/L() .

def

(10.3)

In the statistical setting sketched above, p(x; ) is the conditional density of X given Y . 10.1.2 The Expectation-Maximization Algorithm The most popular method for solving the general optimization problem outlined above is the EM (for expectation-maximization) algorithm introduced, in its full generality, by Dempster et al. (1977) in their landmark paper. Given the literature available on the topic, our aim is not to provide a comprehensive review of all the results related to the EM algorithm but rather to highlight some of its key features and properties in the context of hidden Markov models. 10.1.2.1 The Intermediate Quantity of EM The central concept in the framework introduced by Dempster et al. (1977) is an auxiliary function (or, more precisely, a family of auxiliary functions) known as the intermediate quantity of EM. Denition 10.1.1 (Intermediate Quantity of EM). The intermediate quantity of EM is the family {Q( ; )} of real-valued functions on , indexed by and dened by Q( ; ) =
def

log f (x ; )p(x ; ) (dx) .

(10.4)

Remark 10.1.2. To ensure that Q( ; ) is indeed well-dened for all values of the pair (, ), one needs regularity conditions on the family of functions {f ( ; )} , which will be stated below (Assumption 10.1.3). To avoid trivial cases however, we use the convention 0 log 0 = 0 in (10.4) and in similar relations below. In more formal terms, for every measurable set N such that both f (x ; ) and p(x ; ) vanish -a.e. on N , set log f (x ; )p(x ; ) (dx) = 0 .
N def

With this convention, Q( ; ) stays well-dened in cases where there exists a non-empty set N such that both f (x ; ) and f (x ; ) vanish -a.e. on N . The intermediate quantity Q( ; ) of EM may be interpreted as the expectation of the function log f (X ; ) when X is distributed according to the probability density function p( ; ) indexed by a, possibly dierent, value of the parameter. Using (10.2) and (10.3), one may rewrite the intermediate quantity of EM in (10.4) as

350

10 Maximum Likelihood Inference, Part I

Q( ; ) = () H( ; ) , where H( ; ) =
def

(10.5) (10.6)

log p(x ; )p(x ; ) (dx) .

Equation (10.5) states that the intermediate quantity Q( ; ) of EM diers from (the log of) the objective function () by a quantity that has a familiar form. Indeed, H( ; ) is recognized as the entropy of the probability density function p( ; ) (see for instance Cover and Thomas, 1991). More importantly, the increment of H( ; ), H( ; ) H( ; ) = log p(x ; ) p(x ; ) (dx) , p(x ; ) (10.7)

is recognized as the Kullback-Leibler divergence (or relative entropy) between the probability density functions p indexed by and , respectively. The last piece of notation needed is the following: the gradient and Hessian of a function, say L, at will be denoted by L( ) and 2 L( ), respec tively. To avoid ambiguities, the gradient of H( ; ) with respect to its rst argument, evaluated at , will be denoted by H( ; )|= (where the same convention will also be used, if needed, for the Hessian). We conclude this introductory section by stating a minimal set of assumptions that guarantee that all quantities introduced so far are indeed well-dened. Assumption 10.1.3. (i) The parameter set is an open subset of Rd (for some integer d ). (ii) For any , L() is positive and nite. (iii) For any (, ) , | log p(x ; )|p(x ; ) (dx) is nite. . Assumption 10.1.3(iii) implies in particular that the probability distributions in the family {p( ; ) d} are all absolutely continuous with respect to one another. Any individual distribution p( ; ) d can only vanish on sets that are assigned null probability by all other probability distributions in the family. Thus both H( ; ) and Q( ; ) are well-dened for all pairs of parameters. 10.1.2.2 The Fundamental Inequality of EM We are now ready to state the fundamental result that justies the standard construction of the EM algorithm. Proposition 10.1.4. Under Assumption 10.1.3, for any (, ) , () ( ) Q( ; ) Q( ; ) , where the inequality is strict unless p( ; ) and p( ; ) are equal -a.e. Assume in addition that (10.8)

10.1 Likelihood Optimization in Incomplete Data Models

351

(a) L() is continuously dierentiable on ; (b) for any , H( ; ) is continuously dierentiable on . Then for any , Q( ; ) is continuously dierentiable on and

( ) =

Q( ;

)|=

(10.9)

Proof. The dierence between the left-hand side and the right-hand side of (10.8) is the quantity dened in (10.7), which we already recognized as a Kullback-Leibler distance. Under Assumption 10.1.3(iii), this latter term is well-dened and known to be strictly positive (by direct application of Jensens inequality) unless p( ; ) and p( ; ) are equal -a.e. (Cover and Thomas, 1991; Lehmann and Casella, 1998). For (10.9), rst note that Q( ; ) is a dierentiable function of , as it is the dierence of two functions that are dierentiable under the additional assumptions (a) and (b). Next, the previous discussion implies that H( ; ) is minimal for = , although this may not be the only point where the minimum is achieved. Thus its gradient vanishes at , which proves (10.9).

10.1.2.3 The EM Algorithm The essence of the EM algorithm, which is suggested by (10.5), is that Q( ; ) may be used as a surrogate for (). Both functions are not necessarily comparable but, in view of (10.8), any value of such that Q( ; ) is increased over its baseline Q( ; ) corresponds to an increase of (relative to ( )) that is at least as large. The EM algorithm as proposed by Dempster et al. (1977) consists in iteratively building a sequence {i }i1 of parameter estimates given an initial guess 0 . Each iteration is classically broken into two steps as follows. E-Step: Determine Q( ; i ); M-Step: Choose i+1 to be the (or any, if there are several) value of that maximizes Q( ; i ). It is certainly not obvious at this point that the M-step may be in practice easier to perform than the direct maximization of the function of interest itself. We shall return to this point in Section 10.1.2.4 below. Proposition 10.1.4 provides the two decisive arguments behind the EM algorithm. First, an immediate consequence of (10.8) is that, by the very denition of the sequence {i }, the sequence { (i )}i0 of log-likelihood values is non-decreasing. Hence EM is a monotone optimization algorithm. Second, if the iterations ever stop at a point , then Q( ; ) has to be maximal at (otherwise it would still be possible to improve over ), and hence is such that L( ) = 0, that is, this is a stationary point of the likelihood. Although this picture is largely correct, there is a slight aw in the second half of the above intuitive reasoning in that the if part (if the iterations ever

352

10 Maximum Likelihood Inference, Part I

stop at a point) may indeed never happen. Stronger conditions are required to ensure that the sequence of parameter estimates produced by EM from any starting point indeed converges to a limit . However, it is actually true that when convergence to a point takes place, the limit has to be a stationary point of the likelihood. In order not to interrupt our presentation of the EM framework, convergence results pertaining to the EM algorithm are deferred to Section 10.5 at the end of this chapter; see in particular Theorems 10.5.3 and 10.5.4. 10.1.2.4 EM in Exponential Families The EM algorithm dened in the previous section will only be helpful in situations where the following general conditions hold. E-Step: It is possible to compute, at reasonable computational cost, the intermediate quantity Q( ; ) given a value of . M-Step: Q( ; ), considered as a function of its rst argument , is suciently simple to allow closed-form maximization. A rather general context in which both of these requirements are satised, or at least are equivalent to easily interpretable necessary conditions, is when the functions {f ( ; )} belong to an exponential family. Denition 10.1.5 (Exponential Family). The family {f ( ; )} denes an exponential family of positive functions on X if f (x ; ) = exp{()t S(x) c()}h(x) , (10.10)

where S and are vector-valued functions (of the same dimension) on X and respectively, c is a real-valued function on and h is a non-negative real-valued function on X. Here S(x) is known as the vector of natural sucient statistics, and = () is the natural parameterization. If {f ( ; )} is an exponential family and if |S(x)|f (x ; ) (dx) is nite for any , the intermediate quantity of EM reduces to Q( ; ) = ()t S(x)p(x ; ) (dx) c() + p(x ; ) log h(x) (dx) .

(10.11) Note that the right-most term does not depend on and thus plays no role in the maximization. It may as well be ignored, and in practice it is not required to compute it. Except for this term, the right-hand side of (10.11) has an explicit form as soon as it is possible to evaluate the expectation of the vector of sucient statistics S under p( ; ). The other important feature of (10.11), ignoring the rightmost term, is that Q( ; ), viewed as a function of , is similar to the logarithm of (10.10) for the particular value S = S(x)p(x ; ) (dx) of the sucient statistic.

10.1 Likelihood Optimization in Incomplete Data Models

353

In summary, if {f ( ; )} is an exponential family, the two above general conditions needed for the EM algorithm to be practicable reduce to the following. E-Step: The expectation of the vector of sucient statistics S(X) under p( ; ) must be computable. M-Step: Maximization of ()t s c() with respect to must be feasible in closed form for any s in the convex hull of S(X) (that is, for any valid value of the expected vector of sucient statistics). For the sake of completeness, it should be mentioned that there are variants of the EM algorithm that are handy in cases where the maximization required in the M-step is not directly feasible (see Section 10.5.3 and further references in Section 10.5.4). In the context of HMMs, the main limitation of the EM algorithm rather appears in cases where the E-step is not feasible. This latter situation is the rule rather than the exception in models for which the state space X is not nite. For such cases, approaches that build on the EM concepts introduced in the current chapter will be fully discussed in Chapter 11. 10.1.3 Gradient-based Methods A frequently ignored observation is that in any model where the EM strategy may be applied, it is also possible to evaluate derivatives of the objective function () with respect to the parameter . This is obvious from (10.9), and we will expand on this matter below. As a consequence, instead of resorting to a specic algorithm such as EM, one may borrow tools from the (comprehensive and well-documented) toolbox of gradient-based optimization methods. 10.1.3.1 Computing Derivatives in Incomplete Data Models A rst remark is that in cases where the EM algorithm is applicable, the objective function () is actually computable: because the EM requires the computation of expectations under the conditional density p( ; ), it is restricted to cases where the normalizing constant L()and hence () = log L()is available. The two equalities below show that it is indeed also the case for the rst- and second-order derivatives of (). Proposition 10.1.6 (Fishers and Louis Identities). Assume 10.1.3 and that the following conditions hold. (a) L() is twice continuously dierentiable on . (b) For any , H( ; ) is twice continuously dierentiable on . In addition, | k log p(x ; )|p(x ; ) (dx) is nite for k = 1, 2 and any (, ) , and
k

log p(x ; )p(x ; ) (dx) =

log p(x ; )p(x ; ) (dx) .

354

10 Maximum Likelihood Inference, Part I

Then the following identities hold:

( ) =

log f (x ; )|= p(x ; ) (dx) ,

(10.12)

( ) =

log f (x ; ) +

= 2

p(x ; ) (dx)
=

log p(x ; )

p(x ; ) (dx) . (10.13)

The second equality may be rewritten in the equivalent form


2

( ) + { +{

( )} {

( )} =

log f (x ; )
t

log f (x ; )|= } {

log f (x ; )|= } p(x ; ) (dx) . (10.14)

Equation (10.12) is sometimes referred to as Fishers identity (see the comment by B. Efron in the discussion of Dempster et al., 1977, p. 29). In cases where the function L may be interpreted as the likelihood associated with some statistical model, the left-hand side of (10.12) is the score function (gradient of the log-likelihood). Equation (10.12) shows that the score function may be evaluated by computing the expectation, under p( ; ), of the function log f (X ; )|= . This latter quantity, in turn, is referred to as the complete score function in a statistical context, as log f (x ; ) is the joint log-likelihood of the complete data (X, Y ); again we remark that at this stage, Y is not explicit in the notation. Equation (10.13) is usually called the missing information principle after Louis (1982) who rst named it this way, although it was mentioned previously in a slightly dierent form by Orchard and Woodbury (1972) and implicitly used in Dempster et al. (1977). In cases where L is a likelihood, the left-hand side of (10.13) is the associated observed information matrix, and the second term on the right-hand side is easily recognized as the (negative of the) Fisher information matrix associated with the probability density function p( ; ). Finally (10.14), which is here written in a form that highlights its symmetry, was also proved by Louis (1982) and is thus known as Louis identity. Together with (10.12), it shows that the rst- and second-order derivatives of may be evaluated by computing expectations under p( ; ) of quantities derived from f ( ; ). We now prove these three identities. Proof (of Proposition 10.1.6). Equations (10.12) and (10.13) are just (10.5) where the right-hand side is dierentiated once, using (10.9), and then twice under the integral sign. To prove (10.14), we start from (10.13) and note that the second term on its right-hand side is the negative of an information matrix for the parameter

10.1 Likelihood Optimization in Incomplete Data Models

355

associated with the probability density function p( ; ) and evaluated at . We rewrite this second term using the well-known information matrix identity
2

log p(x ; ) = {

p(x ; ) (dx)

log p(x ; )|= } {

log p(x ; )|= } p(x ; ) (dx) .

This is again a consequence of assumption (b) and the fact that p( ; ) is a probability density function for all values of , implying that

log p(x ; )|= p(x ; ) (dx) = 0 .

Now use the identity log p(x ; ) = log f (x ; ) () and (10.12) to conclude that {

log p(x ; )|= } { {

log p(x ; )|= } p(x ; ) (dx)

log f (x ; )|= } {

log f (x ; )|= } p(x ; ) (dx) {

( )} {

( )} ,

which completes the proof. Remark 10.1.7. As was the case for the intermediate quantity of EM, Fishers and Louis identities only involve expectations under p( ; ) of quantities derived from f ( ; ). In particular, when the functions f ( ; ) belong to an exponential family (see Denition 10.1.5), Fishers identity, for instance, may be rewritten as

( ) = {

)}

S(x)p(x ; ) (dx)

c(

),

with the convention that ( ) is the d d matrix containing the partial derivatives [ ( )]ij = i ( )/j . As a consequence, the only practical requirement for using Fishers and Louis identities is the ability to compute expectations of the sucient statistic S(x) under p( ; ) for any . 10.1.3.2 The Steepest Ascent Algorithm We briey discuss the main features of gradient-based iterative optimization algorithms, starting with the simplest, but certainly not most ecient, approach. We restrict ourselves to the case where the optimization problem is unconstrained in the sense that = Rd , so that any parameter value produced by the algorithms below is valid. For an in-depth coverage of the subject, we recommend the monographs by Luenberger (1984) and Fletcher (1987).

356

10 Maximum Likelihood Inference, Part I

The simplest method is the steepest ascent algorithm in which the current value of the estimate i is updated by adding a multiple of the gradient i ( ), referred to as the search direction: i+1 = i + i

(i ) .

(10.15)

Here the multiplier i is a non-negative scalar that needs to be adjusted at each iteration to ensure, a minima, that the sequence { (i )} is non-decreasingas was the case for EM. The most sensible approach consists in choosing i as to maximize the objective function in the search direction: i = arg max0 [i +

(i )] .

(10.16)

It can be shown (Luenberger, 1984, Chapter 7) that under mild assumptions, the steepest ascent method with multipliers (10.16) is globally convergent, with a set of limit points corresponding to the stationary points of (see Section 10.5 for precise denitions of these terms and a proof that this property holds for the EM algorithm). It remains that the use of the steepest ascent algorithm is not recommended, particularly in large-dimensional parameter spaces. The reason for this is that its speed of convergence linear in the sense that if the sequence {i }i0 converges to a point such that the Hessian 2 ( ) is negative denite (see Section 10.5.2), then lim i+1 (k) (k) = k < 1 ; |i (k) (k)| (10.17)

here (k) denotes the kth coordinate of the parameter vector. For largedimensional problems it frequently occurs that, at least for some components k, the factor k is close to one, resulting in very slow convergence of the algorithm. It should be stressed however that the same is true for the EM algorithm, which also exhibits speed of convergence that is linear, and often very poor (Dempster et al., 1977; Jamshidian and Jennrich, 1997; Meng, 1994; Lange, 1995; Meng and Van Dyk, 1997). For gradient-based methods however, there exists a whole range of approaches, based on the second-order properties of the objective function, to guarantee faster convergence. 10.1.3.3 Newton and Second-order Methods The prototype of second-order methods is the Newton, or Newton-Raphson, algorithm: i+1 = i H 1 (i ) (i ) , (10.18) where H(i ) = 2 (i ) is the Hessian of the objective function. The Newton iteration is based on the second-order approximation () ( ) + ( ) ( ) + 1 t ( ) H( ) ( ) . 2

10.1 Likelihood Optimization in Incomplete Data Models

357

If the sequence {i }i0 produced by the algorithm converges to a point at which the Hessian is negative denite, the convergence is, at least, quadratic in the sense that for suciently large i there exists a positive constant such that i+1 i 2 . Therefore the procedure can be very ecient. The practical use of the Newton algorithm is however hindered by two serious diculties. The rst is analogous to the problem already encountered for the steepest ascent method: there is no guarantee that the algorithm meets the minimal requirement to provide a nal parameter estimate that is at least as good as the starting point 0 . To overcome this diculty, one may proceed as for the steepest ascent method and introduce a multiplier i controlling the step-length in the search direction, so that the method takes the form i+1 = i i H 1 (i )

(i ) .

(10.19)

Again, i may be set to maximize (i+1 ). In practice, it is most often impossible to obtain the exact maximum point called for by the ideal line-search, and one uses approximate directional maximization procedures. Generally speaking, a line-search algorithm is an algorithm to nd a reasonable multiplier i in a step of the form (10.19). A frequently used algorithm consists in determining the (approximate) maximum based on a polynomial interpolation of () along the line-segment between the current point i and the proposed update given by (10.18). A more serious problem is that except in the particular case where the function () is strictly concave, the direct implementation of (10.18) is prone to numerical instabilities: there may well be whole regions of the parameter space where the Hessian H() is either non-invertible (or at least very badly conditioned) or not negative semi-denite (in which case H 1 (i ) (i ) is not necessarily an ascent direction). To combat this diculty, Quasi-Newton methods1 use the modied recursion i+1 = i + i W i (i ) ; (10.20)

here W i is a weight matrix that may be tuned at each iteration, just like the multiplier i . The rationale is that if W i becomes close to H 1 (i ) when convergence occurs, the modied algorithm will share the favorable convergence properties of the Newton algorithm. On the other hand, by using a weight matrix W i dierent from H 1 (i ), numerical issues associated with the matrix inversion may be avoided. We again refer to Luenberger (1984) and Fletcher (1987) for a more precise discussion of the available approaches and simply mention here the fact that usually the methods only take prot of gradient information to construct W i , for instance using nite dierence calculations, without requiring the direct evaluation of the Hessian H(). In some contexts, it may be possible to build explicit strategies that are not as good as the Newton algorithmfailing in particular to reach quadratic
Conjugate gradient methods are another alternative approach that we do not discuss here.
1

358

10 Maximum Likelihood Inference, Part I

convergence ratesbut yet signicantly faster at converging than the basic steepest ascent approach. For incomplete data models, Lange (1995) suggested 1 to use in (10.20) a weight matrix Ic (i ) given by Ic ( ) =
2

log f (x ; )

p(x ; ) (dx) .

(10.21)

This is the rst term on the right-hand side of (10.13). In many models of interest, this matrix is positive denite for all , and thus its inversion is not subject to numerical instabilities. Based on (10.13), it is also to be expected that in some circumstances, Ic ( ) is a reasonable approximation to the Hessian 2 ( ) and hence that the weighted gradient algorithm converges faster than the steepest ascent or EM algorithms (see Lange, 1995, for further results and examples). In a statistical context, where f (x ; ) is the joint density of two random variables X and Y , Ic ( ) is the conditional expectation given Y of the observed information matrix of associated with this pair. 10.1.4 Pros and Cons of Gradient-based Methods A quick search through the literature shows that for HMMs in particular and incomplete data models in general, the EM algorithm is much more popular than are gradient-based alternatives. A rst obvious reason for this is that the EM approach is more generally known than its gradient-based counterparts. We list below a number of additional signicant dierences between both approaches, giving rst the arguments in favor of the EM algorithm. The EM algorithm is usually very simple to implement from scratch. This is not the case for gradient-based methods, which require several specialized routines, for Hessian approximation, line-search, etc. This argument is however made less pregnant by the wide availability of generic numerical optimization code, so that implementing a gradient-based method usually only requires the computation of the objective function and its gradient. In most situations, this is not more complicated than is implementing EM. The EM algorithm often deals with parameter constraints implicitly. It is generally the case that the M-step equations are so simple that they can be solved even for parameters that are subject to constraints (see the case of normal HMMs in Section 10.3 for an example). For gradient-based methods this is not the case, and parameter constraints have to be dealt with explicitly, either through reparameterization (see Example 10.3.2) or using constrained optimization routines. The EM algorithm is parameterization independent. Because the M-step is dened by a maximization operation, it is independent of the way the parameters are represented, as is the maximum likelihood estimator for instance. Thus any (invertible) transformation of the parameter vector leaves the EM recursion unchanged. This is obviously not the case for gradient-based methods for which reparameterization will change the gradient and Hessian, and hence the convergence behavior of the algorithm.

10.2 Application to HMMs

359

In contrast, gradient-based methods may be preferred for the following reasons. Gradient-based methods do not require the M-step. Thus they may be applied to models for which the M-step does not lead to simple closed-form solutions. Gradient-based methods converge faster. As discussed above, gradientbased methods can reach quadratic convergence whereas EM usually converges only linearly, following (10.17)see Example 10.3.2 for an illustration and further discussion of this aspect.

10.2 Application to HMMs


We now return to our primary focus and discuss the application of the previous methods to the specic case of hidden Markov models. 10.2.1 Hidden Markov Models as Missing Data Models HMMs correspond to a sub-category of incomplete data models known as missing data models. In missing data models, the observed data Y is a subset of some not fully observable complete data (X, Y ). We here assume that the joint distribution of X and Y , for a given parameter value , admits a joint probability density function f (x, y ; ) with respect to the product measure . As mentioned in Section 10.1.1, the function f is sometimes referred to as the complete data likelihood. It is important to understand that f is a probability density function only when considered as a function of both x and y. For a xed value of y and considered as a function of x only, f is a positive integrable function. Indeed, the actual likelihood of the observation, which is dened as the probability density function of Y with respect to , is obtained by marginalization as L(y ; ) = f (x, y ; ) (dx) . (10.22)

For a given value of y this is of course a particular case of (10.1), which served as the basis for developing the EM framework in Section 10.1.2. In missing data models, the family of probability density functions {p( ; )} dened in (10.3) may thus be interpreted as p(x|y ; ) = f (x, y ; ) , f (x, y ; ) (dx) (10.23)

the conditional probability density function of X given Y . In the last paragraph, slightly modied versions of the notations introduced in (10.1) and (10.3) were used to reect the fact that the quantities of interest now depend on the observed variable Y . This is obviously

360

10 Maximum Likelihood Inference, Part I

mostly a change regarding terminology, with no impact on the contents of Section 10.1.2, except that we may now think of integrating with respect to p( ; ) d as taking the conditional expectation with respect to the missing data X, given the observed data Y , in the model indexed by the parameter value . Remark 10.2.1. Applying the EM algorithm dened in Section 10.1.2 in the case of (10.22) yields a sequence of parameter values {i }i0 whose likelihoods L(y ; i ) cannot decrease with the iteration index i. Obviously, this connects to maximum likelihood estimation. Another frequent use of the EM algorithm is for maximum a posteriori (MAP) estimation, in which the objective function to be maximized is a Bayesian posterior (Dempster et al., 1977). Indeed, we may replace (10.22) by L(y ; ) = () f (x, y ; ) (dx) , (10.24)

where is a positive function on . In the Bayesian framework (see Section 13.1 for a brief presentation of the Bayesian approach), is usually selected to be a probability density function (with respect to some measure on ) and (10.24) is then interpreted as being proportional, up to a factor that depends on y only, to the posterior probability density function of the unknown parameter , conditional on the observation Y . In that case, is referred to as a prior density on the parameter . But in (10.24) may also be thought of as a regularization functional (sometimes also called a penalty) that may not have a probabilistic interpretation (Green, 1990). Whether L is dened according to (10.22) or to (10.24) does not modify the denition of p( ; ) in (10.23), as the factor () cancels out in the renormalization. Thus the E-step in the EM algorithm is left unchanged and only the M-step depends on the precise choice of . 10.2.2 EM in HMMs We now consider more specically hidden Markov models using the notations introduced in Section 2.2, assuming that observations Y0 to Yn (or, in short, Y0:n ) are available. Because we only consider HMMs that are fully dominated in the sense of Denition 2.2.3, we will use the notations and k|n to refer to the probability density functions of these distributions (of X0 and of Xk given Y0:n ) with respect to the dominating measure . The joint probability density function of the hidden states X0:n and associated observations Y0:n , with respect to the product measure (n+1) (n+1) , is given by fn (x0:n , y0:n ; ) = (x0 ; )g(x0 , y0 ; )q(x0 , x1 ; )g(x1 , y1 ; ) q(xn1 , xn ; )g(xn , yn ; ) , (10.25) where we used the same convention as above to indicate dependence with respect to the parameter .

10.2 Application to HMMs

361

Because we mainly consider estimation of the HMM parameter vector from a single sequence of observations, it does not make much sense to consider as an independent parameter. There is no hope to estimate consistently, as there is only one random variable X0 (that is not even observed!) drawn from this density. In the following, we shall thus consider that is either xed (and known) or fully determined by the parameter that appears in q and g. A typical example of the latter consists in assuming that is the stationary distribution associated with the transition function q(, ; ) (if it exists). This option is generally practicable only in very simple models (see Example 10.3.3 below for an example) because of the lack of analytical expressions relating the stationary distribution of q(, ; ) to for general parameterized hidden chains. Irrespective of whether is xed or determined by , it is convenient to omit dependence with respect to in our notations, writing, for instance, E for expectations under the model parameterized by (, ). Note that for left-to-right HMMs (discussed Section 1.4), the case is rather dierent as the model is trained from several independent sequences and the initial distribution is often a key parameter. Handling the case of multiple training sequences is straightforward as the quantities corresponding to different sequences simply need to be added together due to the independence assumption (see Section 10.3.2 below for the details in the normal HMM case). The likelihood of the observations Ln (y0:n ; ) is obtained by integrating (10.25) with respect to the x (state) variables under the measure (n+1) . Note that here we use yet another slight modication of the notations adopted in Section 10.1 to acknowledge that both the observations and the hidden states are indeed sequences with indices ranging from 0 to n (hence the subscript n). Upon taking the logarithm in (10.25),
n1

log fn (x0:n , y0:n ; ) = log (x0 ; ) +


k=0 n

log q(xk , xk+1 ; ) log g(xk , yk ; ) ,


k=0

and hence the intermediate quantity of EM has the additive structure


n1

Q( ; ) = E [log (X0 ; ) | Y0:n ] +


k=0 n

E [log q(Xk , Xk+1 ; ) | Y0:n ] E [log g(Xk , Yk ; ) | Y0:n ] .


k=0

In the following, we will adopt the implicit conditioning convention that we have used extensively from Section 3.1.4 and onwards, writing gk (x ; ) instead of g(x, Yk ; ). With this notation, the intermediate quantity of EM may be rewritten as

362

10 Maximum Likelihood Inference, Part I


n

Q( ; ) = E [log (X0 ; ) | Y0:n ] +


k=0 n1

E [log gk (Xk ; ) | Y0:n ] E [log q(Xk , Xk+1 ; ) | Y0:n ] . (10.26)

+
k=0

Equation (10.26) shows that in great generality, evaluating the intermediate quantity of EM only requires the computation of expectations under the marginal k|n ( ; ) and bivariate k:k+1|n ( ; ) smoothing distributions, given the parameter vector . The required expectations may thus be computed using either any of the variants of the forward-backward approach presented in Chapter 3 or the recursive smoothing approach discussed in Section 4.1. To make the connection with the recursive smoothing approach of Section 4.1, we simply rewrite (10.26) as E [tn (X0:n ; ) | Y0:n ], where t0 (x0 ; ) = log (x0 ; ) + log g0 (x0 ; ) and tk+1 (x0:k+1 ; ) = tk (x0:k ; ) + {log q(xk , xk+1 ; ) + log gk+1 (xk+1 ; )} . (10.28) Proposition 4.1.3 may then be applied directly to obtained the smoothed expectation of the sum functional tn . Although the exact form taken by the M-step will obviously depend on the way g and q depend on , the EM update equations follow a very systematic scheme that does not change much with the exact model under consideration. For instance, all discrete state space models for which the transition matrix q is parameterized by its r r elements and such that g and q do not share common parameters (or parameter constraints) give rise to the same update equations for q, given in (10.43) below. Several examples of the EM update equations will be reviewed in Sections 10.3 and 10.4. 10.2.3 Computing Derivatives Recall that the Fisher identity(10.12)provides an expression for the gradient of the log-likelihood n () with respect to the parameter vector , closely related to the intermediate quantity of EM. In the HMM context, (10.12) reduces to
n n ()

(10.27)

= E [

log (X0 ; ) | Y0:n ] +


k=0 n1

E [

log gk (Xk ; ) | Y0:n ]

+
k=0

E [

log q(Xk , Xk+1 ; ) | Y0:n ] . (10.29)

10.2 Application to HMMs

363

Hence the gradient of the log-likelihood may also be evaluated using either the forward-backward approach or the recursive technique discussed in Chapter 4. For the latter, we only need to redene the functional of interest, replacing (10.27) and (10.28) by their gradients with respect to . Louis identity (10.14) gives rise to more complicated expressions, and we only consider here the case where g does depend on , whereas the state transition density q and the initial distribution are assumed to be xed and known (the opposite situation is covered in detail in a particular case in Section 10.3.4). In this case, (10.14) may be rewritten as
2 n ()

+{
n

n ()} { 2 n

n ()}

(10.30)

=
k=0

E [
n

log gk (Xk ; ) Y0:n ] E { log gk (Xk ; )} { log gj (Xj ; )}


t

+
k=0 j=0

Y0:n .

The rst term on the right-hand side of (10.30) is obviously an expression that can be computed proceeding as for (10.29), replacing rst- by second-order derivatives. The second term is however more tricky because it (seemingly) requires the evaluation of the joint distribution of Xk and Xj given the observations Y0:n for all pairs of indices k and j, which is not obtainable by the smoothing approaches based on some form of the forward-backward decomposition. The rightmost term of (10.30) is however easily recognized as a squared sum functional similar to (4.4), which can thus be evaluated recursively (in n) proceeding as in Example 4.1.4. Recall that the trick consists in observing that if
n

n,1 (x0:n ; ) =

def k=0 n

log gk (xk ; ) ,
n t k=0

n,2 (x0:n ; ) = then

def k=0

log gk (xk ; )

log gk (xk ; )

n,2 (x0:n ; ) = n1,2 (x0:n1 ; ) + { +

log gn (xn ; )} {

log gn (xn ; )}
t

+ n1,1 (x0:n1 ; ) {

log gn (xn ; )}

log gn (xn ; ) {n1,1 (x0:n1 ; )} .

This last expression is of the general form given in Denition 4.1.2, and hence Proposition 4.1.3 may be applied to update recursively in n E [n,1 (X0:n ; ) | Y0:n ] and E [n,2 (X0:n ; ) | Y0:n ] .

To make this approach more concrete, we will describe below, in Section 10.3.4, its application to a very simple nite state space HMM.

364

10 Maximum Likelihood Inference, Part I

10.2.4 Connection with the Sensitivity Equation Approach The method outlined above for evaluating the gradient of the likelihood is coherent with the general approach of Section 4.1. There is however a (seemingly) distinct approach for evaluating the same quantity, which does not require the use of Fishers identity, and has been used for a very long time in the particular case of Gaussian linear state-space models. The method, known under the name of sensitivity equations (see for instance Gupta and Mehra, 1974), postulates that since the log-likelihood can be computed recursively based on the Kalman prediction recursion, its derivatives can also be computed by a recursionthe so-called sensitivity equationswhich is obtained by dierentiating the Kalman relations with respect to the model parameters. For such models, the remark that the gradient of the log-likelihood may also be obtained using Fishers identity was made by Segal and Weinstein (1989); see also Weinstein et al. (1994). The sensitivity equations approach is in no way limited to Gaussian linear state-space models but may be applied to HMMs in general. This remark, put forward by Campillo and Le Gland (1989) and Le Gland and Mevel (1997), has been subsequently used for nite state-space HMMs (Capp et al., 1998; e Collings and Rydn, 1998) as well as for general HMMs (Crou et al., 2001; e e Doucet and Tadi, 2003). In the latter case, it is necessary to resort to some c form of sequential Monte Carlo approach discussed in Chapter 7 because exact ltering is not available. It is interesting that the sequential Monte Carlo approximation method used by both Crou et al. (2001) and Doucet and Tadi e c (2003) has also been derived by Capp (2001a) using Fishers identity and the e smoothing framework discussed in Section 4.1. Indeed, we show below that the sensitivity equation approach is exactly equivalent to the use of Fishers identity. Recall that the log-likelihood may be written according to (3.29) as a sum of terms that only involve the prediction density,
n n () = k=0 ck ()

log

k|k1 (xk ; )gk (xk ; ) (dxk ) ,

(10.31)

where the integral is also the normalizing constant that appears in the prediction and ltering recursion (Remark 3.2.6), which we denoted by ck (). The ltering recursion as given by (3.27) implies that k+1 (xk+1 ; ) = c1 () k+1 k (xk ; )q(xk , xk+1 ; )gk+1 (xk+1 ; ) (dxk ) .

(10.32) To dierentiate (10.32) with respect to , we assume that ck+1 () does not vanish and we use the obvious identity

u() = v 1 () v()

u()

u() v()

log v()

10.2 Application to HMMs

365

to obtain
k+1 (xk+1

; ) = k+1 (xk+1 ; ) k+1 (xk+1 ; )

log ck+1 () , (10.33)

where k+1 (xk+1 ; ) = c1 () k+1


def

k (xk ; )q(xk , xk+1 ; )gk+1 (xk+1 ; ) (dxk ) .

(10.34) We further assume that as in Proposition 10.1.6, we may interchange integration with respect to and dierentiation with respect to . Because k+1 ( ; ) is a probability density function, k+1 (xk+1 ; ) (dxk+1 ) = 1 and k+1 (xk+1 ; ) (dxk+1 ) = k+1 (xk+1 ; ) (dxk+1 ) = 0. Therefore, integration of both sides of (10.33) with respect to (dxk+1 ) yields 0= k+1 (xk+1 ; ) (dxk+1 )

log ck+1 () .

Hence, we may evaluate the gradient of the incremental log-likelihood in terms of k+1 according to

log ck+1 () =

def

( k+1 () k ())

k+1 (xk+1 ; ) (dxk+1 ) . (10.35)

Now we evaluate the derivative in (10.34) assuming also that q and gk are non-zero to obtain k+1 (xk+1 ; ) = c1 () k+1 k (xk ; ) + [

log q(xk , xk+1 ; ) +

log gk+1 (xk+1 ; )]

k (xk

; ) q(xk , xk+1 ; )gk+1 (xk+1 ; ) (dxk ) .

Plugging (10.33) into the above equation yields an update formula for k+1 , k+1 (xk+1 ; ) = c1 () k+1 [

log q(xk , xk+1 ; ) +

log gk+1 (xk+1 ; )]

k (xk ; ) + k (xk ; ) q(xk , xk+1 ; )gk+1 (xk+1 ; ) (dxk ) k+1 (xk+1 ; )

log ck () , (10.36)

where (10.32) has been used for the last term on the right-hand side. We collect these results in the form of the algorithm below. Algorithm 10.2.2 (Sensitivity Equations). In addition to the usual ltering recursions, do: Initialization: Compute (x0 ) = [ and
0 ()

log (x0 ; ) +

log q0 (x0 ; )] 0 (x0 ; )

(x0 ) (dx0 ).

366

10 Maximum Likelihood Inference, Part I

Recursion: For k = 0, 1 . . . , use (10.36) to compute k+1 and (10.35) to evaluate k+1 () k (). Algorithm 10.2.2 updates the intermediate function k ( ; ), dened in (10.34), whose integral is the quantity of interest log ck (). Obviously, one can equivalently use as intermediate quantity the derivative of the ltering probability density function k ( ; ), which is directly related to k ( ; ) by (10.33). The quantity k ( ; ), which is referred to as the tangent lter by Le Gland and Mevel (1997), is also known as the lter sensitivity and may be of interest in its own right. Using k ( ; ) instead of k ( ; ) does not however modify the nature of algorithm, except for slightly more involved mathematical expressions. It is interesting to contrast Algorithm 10.2.2 with the smoothing approach based on Fishers identity (10.29). Recall from Section 4.1 that in order to evaluate (10.29), we recursively dene a sequence of functions by t0 (x0 ) = and tk+1 (x0:k+1 ) = tk (x0:k ) +

log (x0 ; ) +

log g0 (x0 ; ) ,

log q(xk , xk+1 ; ) +

log gk+1 (xk ; ) k (xk ; ) (dxk ),

for k 0. Proposition 4.1.3 asserts that E [tk (X0:k ) | Y0:k ] = where k may be updated according to the recursion k+1 (xk+1 ; ) = c1 () k+1 [

log q(xk , xk+1 ; ) +

log gk+1 (xk+1 ; )] (10.37)

k (xk ; ) + k (xk ; ) q(xk , xk+1 ; )gk+1 (xk+1 ; ) (dxk )

for k 0, where 0 (x0 ; ) = c0 ()1 (x0 )t0 (x0 )g0 (x0 ). Comparing (10.37) and (10.36), it is easily established by recurrence on k that 0 ( ; ) = 0 ( ; ) and
k1

k ( ; ) = k ( ; )
l=0

log cl () k ( ; )

(10.38)

for k 1. Hence, whereas k (xk ; ) (dxk ) gives access to k (), the gradient of the log-likelihood up to index k, k (xk ; ) (dxk ) equals the gradient of the increment k () k1 (), where the second term is decomposed k1 into the telescoping sum k1 () = l=0 log cl () of increments. The sensitivity equations and the use of Fishers identity combined with the recursive smoothing algorithm of Proposition 4.1.3 are thus completely equivalent. The fundamental reason for this rather surprising observation is that whereas the log-likelihood may be written, according to (10.31), as a

10.3 The Example of Normal Hidden Markov Models

367

sum of integrals under the successive prediction distributions, the same is no more true when dierentiating with respect to . To compute the gradient of (10.31), one needs to evaluate k ( ; )or, equivalently, k ( ; )which k1 depends on all the previous values of cl () through the sum l=0 log cl (). To conclude this section, let us stress again that there are only two dierent options for computing the gradient of the log-likelihood. Forward-backward algorithm: based on Fishers identity (10.29) and forwardbackward smoothing. Recursive algorithm: which can be equivalently derived either through the sensitivity equations or as an application of Proposition 4.1.3 starting from Fishers identity. Both arguments give rise to the same algorithm. These two options only dier in the way the computations are organized, as both evaluate exactly the sum of terms appearing in (10.29). In considering several examples below, we shall observe that the former solution is generally more ecient from the computational point of view.

10.3 The Example of Normal Hidden Markov Models


In order to make the general principles outlined in the previous section more concrete, we now work out the details on selected examples of HMMs. We begin with the case where the state space is nite and the observation transition function g corresponds to a (univariate) Gaussian distribution. Only the most standard case where the parameter vector is split into two sub-components that parameterize, respectively, g and q, is considered. 10.3.1 EM Parameter Update Formulas In the widely used normal HMM discussed in Section 1.3.2, X is a nite set, identied with {1, . . . , r}, Y = R, and g is a Gaussian probability density function (with respect to Lebesgue measure) given by g(x, y ; ) = (y x )2 1 exp 2x 2x .

By denition, gk (x ; ) is equal to g(x, Yk ; ). We rst assume that the initial distribution is known and xed, before examining the opposite case briey in Section 10.3.2 below. The parameter vector thus encompasses the transition probabilities qij for i, j = 1, . . . , r as well as the means i and variances i for i = 1, . . . , r. Note that in this section, because we will often need to dierentiate with respect to i , it is simpler to use the variances i = 2 i rather than the standard deviations i as parameters. The means and variances are unconstrained, except for the positivity of the latter, but the r transition probabilities are subject to the equality constraints j=1 qij = 1

368

10 Maximum Likelihood Inference, Part I

for i = 1, . . . , r (in addition to the obvious constraint that qij should be non-negative). When considering the parameter vector denoted by , we will denote by i , i , and qij its various elements. For the model under consideration, (10.26) may be rewritten as Q( ; ) = C st 1 2
n r

E
k=0 n i=1

1{Xk = i} log i +
r

(Yk i )2 i

Y0:n

+
k=1

1{(Xk1 , Xk ) = (i, j)} log qij Y0:n ,

i=1 j=1

where the leading term does not depend on . Using the notations introduced in Section 3.1 for the smoothing distributions, we may write Q( ; ) = C st 1 2
n r

k|n (i ; ) log i +
k=0 i=1 n r r

(Yk i )2 i

+
k=1 i=1 j=1

k1:k|n (i, j ; ) log qij . (10.39)

In the above expression, we use the same convention as in Chapter 5 and denote the smoothing probability P (Xk = i | Y0:n ) by k|n (i ; ) rather than by k|n ({i} ; ). The variable is there to recall the dependence of the smoothing probability on the unknown parameters. Now, given the initial distribution and parameter , the smoothing distributions appearing in (10.39) can be evaluated by any of the variants of forward-backward smoothing discussed in Chapter 3. As already explained above, the E-step of EM thus reduces to solving the smoothing problem. The M-step is specic and depends on the model parameterization: the task consists in nding a global optimum of Q( ; ) that satises the constraints mentioned above. For this, simply introduce the Lagrange multipliers 1 , . . . , r r that correspond to the equality constraints j=1 qij = 1 for i = 1, . . . , r (Luenberger, 1984, Chapter 10). The rst-order partial derivatives of the Lagrangian
r r

L(, ; ) = Q( ; ) +
i=1

i 1
j=1

qij

are given by 1 L(, ; ) = i i


n

k|n (i ; )(Yk i ) ,
k=0 n

1 L(, ; ) = i 2

k|n (i ; )
k=0

1 (Yk i )2 2 i i

10.3 The Example of Normal Hidden Markov Models

369

L(, ; ) = qij

k=1

k1:k|n (i, j ; ) i , qij


r

L(, ; ) = 1 qij . i j=1

(10.40)

Equating all expressions in (10.40) to zero yields the parameter vector


= ( )i=1,...,r , (i )i=1,...,r , (qij )i,j=1,...,r i

which achieves the maximum of Q( ; ) under the applicable parameter constraints: = i


i = qij = n k=0 k|n (i ; )Yk , n k=0 k|n (i ; ) n 2 k=0 k|n (i ; )(Yk i ) , n k=0 k|n (i ; ) n k=1 k1:k|n (i, j ; ) n r k=1 l=1 k1:k|n (i, l ; )

(10.41) (10.42) (10.43)

for i, j = 1, . . . , r, where the last equation may be rewritten more concisely as


qij = n k=1 k1:k|n (i, j ; n k=1 k1|n (i ; )

(10.44)

Equations (10.41)(10.43) are emblematic of the intuitive form taken by the parameter update formulas derived though the EM strategy. These equations are simply the maximum likelihood equations for the complete model in which both {Xk }0kn and {Yk }0kn would be observed, except that the functions 1{Xk = i} and 1{Xk1 = i, Xk = j} are replaced by their conditional expectations, k|n (i ; ) and k1:k|n (i, j ; ), given the actual observations Y0:n and the available parameter estimate . As discussed in Section 10.1.2.4, this behavior is fundamentally due to the fact that the probability density functions associated with the complete model form an exponential family. As a consequence, the same remark holds more generally for all discrete HMMs for which the conditional probability density functions g(i, ; ) belong to an exponential family. A nal word of warning about the way in which (10.42) is written: in order to obtain a concise and intuitively interpretable expression, (10.42) features the value of as given by (10.41). It is of course possible i to rewrite (10.42) in a way that only contains the current parameter value and the observations Y0:n by combining (10.41) and (10.42) to obtain
i

n 2 k=0 k|n (i ; )Yk n k=0 k|n (i ; )

n k=0 k|n (i ; )Yk n k=0 k|n (i ; )

(10.45)

For normal HMMs, the M-step thus reduces to computing averages and ratios of simple expressions that involve the marginal and bivariate smoothing

370

10 Maximum Likelihood Inference, Part I

probabilities evaluated during the E-step. The number of operations associated with the implementation of these expressions scales with respect to r and n like r2 n, which is similar to the complexity of forward-backward smoothing (see Chapter 5). In practice however, the M-step is usually faster than the E-step because operations such as sums, products, or squares are carried out faster than the exponential (recall that forward-backward smoothing requires the computation of g (i, yk ) for all i = 1, . . . , r and k = 0, . . . , n). Although the dierence may not be very signicant for scalar models, it becomes more and more important for high-dimensional multivariate generalizations of the normal HMM, such as those used in speech recognition. 10.3.2 Estimation of the Initial Distribution As mentioned above, in this chapter we generally assume that the initial distribution , that is, the distribution of X0 , is xed and known. There are cases when one wants to treat this as an unknown parameter however, and we briey discuss below this issue in connection with the EM algorithm for the normal HMM. We shall assume that = (i )1ir is an unknown probability vector (that is, with non-negative entries summing to unity), which we accommodate within the parameter vector . The complete log-likelihood will then be as above, where the initial term
r

log X0 =
i=1

1{X0 = i} log i

goes into Q( ; ) as well, giving the additive contribution


r

0|n (i ; ) log i
i=1

to (10.39). This sum is indeed part of (10.39) already, but hidden within C st when is not a parameter to be estimated. Using Lagrange multipliers as above, it is straightforward to show that the M-step update of is i = 0|n (i ; ). It was also mentioned above that sometimes it is desirable to link to q as being the stationary distribution of q . Then there is an additive contribution to Q( ; ) as above, with the dierence that can now not be chosen freely but is a function of q . As there is no simple formula for the stationary distribution of q , the M-step is no longer explicit. However, once the sums (over k) in (10.39) have been computed for all i and j, we are left with an optimization problem over the qij for which we have an excellent initial guess, namely the standard update (ignoring ) (10.43). A few steps of a standard numerical optimization routine (optimizing over the qij ) is then often enough to nd the maximum of Q( ; ) under the stationarity assumption. Variants of the basic EM strategy, to be discussed in Section 10.5.3, may also be useful in this situation.

10.3 The Example of Normal Hidden Markov Models

371

10.3.3 Recursive Implementation of E-Step An important observation about (10.41)(10.43) is that all expressions are ratios in which both the numerator and the denominator may be interpreted as smoothed expectations of simple additive functionals. As a consequence, the recursive smoothing techniques discussed in Chapter 4 may be used to evaluate separately the numerator and denominator of each expression. The important point here is that to implement the E-step of EM, forward-backward smoothing is not strictly required and it may be replaced by a purely recursive evaluation of the quantities involved in the M-step update. As an example, consider the case of the rst update equation (10.41) that pertains to the means i . For each pre-specied state i, say i = i0 , one can devise a recursive lter to compute the quantities needed to update i0 as follows. First dene the two functionals
n

tn,1 (X0:n ) =
k=0 n

1{Xk = i0 }Yk , 1{Xk = i0 } .


(10.46)

tn,2 (X0:n ) =
k=0

Comparing with the general form considered in Chapter 4, the two functionals above are clearly of additive type. Hence the multiplicative functions {mk }0kn that appear in Denition 4.1.2 are constant and equal to one in this case. Proceeding as in Chapter 4, we associate with the functionals dened in (10.46) the sequence of signed measures n,1 (i ; ) = E [1{Xn = i}tn,1 (X0:n ) | Y0:n ] , n,2 (i ; ) = E [1{Xn = i}tn,2 (X0:n ) | Y0:n ] ,

(10.47)

for i = 1, . . . , r. Note that we adopt here the same convention as for the smoothing distributions, writing n,1 (i ; ) rather than n,1 ({i} ; ). In this context, the expression signed measure is also somewhat pompous because the state space X is nite and n,1 and n,2 can safely be identied with vectors in Rr . The numerator and denominator of (10.41) for the state i = i0 are given by, respectively,
r r

n,1 (i ; )
i=1

and
i=1

n,2 (i ; ) ,
r

which can also be checked directly from (10.47), as i=1 1{Xn = i} is identically equal to one. Recall from Chapter 4 that n,1 and n,2 are indeed quantities that may be recursively updated following the general principle of Proposition 4.1.3. Algorithm 10.3.1 below is a restatement of Proposition 4.1.3 in the context of the nite normal hidden Markov model.

372

10 Maximum Likelihood Inference, Part I

Algorithm 10.3.1 (Recursive Smoothing for a Mean). Initialization: Compute the rst ltering distribution according to 0 (i ; ) = for i = 1, . . . , r, where c0 ( ) = 0,1 (i0 ; ) = 0 (i0 ; )Y0 (i)g0 (i ; ) , c0 ( ) (j)g0 (j ; ). Then

r j=1

and 0,2 (i0 ; ) = 0 (i0 ; ) ,

and both 0,1 (i ; ) and 0,2 (i ; ) are set to zero for i = i0 . Recursion: For k = 0, . . . , n 1, update the ltering distribution k+1 (j ; ) = for j = 1, . . . , r, where
r r r i=1

k (i ; ) qij gk+1 (j ; ) ck+1 ( )

ck+1 ( ) =
j=1 i=1

k (i ; ) qij gk+1 (j ; ) .

Next, k+1,1 (j ; ) =
r i=1 k,1 (i ;

) qij gk+1 (j ; ) ck+1 ( ) + Yk+1 k+1 (i0 ; )i0 (j)

(10.48)

for j = 1, . . . , r, where i0 (j) is equal to one when j = i0 and zero otherwise. Likewise, k+1,2 (j ; ) =
r i=1 k,2 (i ;

) qij gk+1 (j ; ) + k+1 (i0 ; )i0 (j) ck+1 ( ) (10.49)

for j = 1, . . . , r. Parameter Update: When the nal observation index n is reached, the updated mean 0 is obtained as i 0 = i
r i=1 n,1 (i ; r i=1 n,2 (i ;

) . )

It is clear that one can proceed similarly for parameters other than the means. For the same given state i = i0 , the alternative form of the variance update equation given in (10.45) shows that, in addition to tn,1 and tn,2 dened in (10.46), the functional
n

tn,3 (X0:n ) =
k=0

1{Xk = i0 }Yk2

10.3 The Example of Normal Hidden Markov Models

373

is needed to compute the updated variance i0 . The recursive smoother associated with this quantity is updated as prescribed by Algorithm 10.3.1 for 2 tn,1 by simply replacing Yk by Yk . In the case of the transition probabilities, considering a xed pair of states (i0 , j0 ), (10.44) implies that in addition to evaluating n1,2 , one needs to derive a smoother for the functional n

tn,4 (X0:n ) =
k=1

1{Xk1 = i0 , Xk = j0 } ,

(10.50)

where t0,4 (X0 ) is dened to be null. Following Proposition 4.1.3, the associated smoothed quantity n,4 (i ; ) = E [1{Xn = i}tn,4 (X0:n ) | Y0:n ] may be updated recursively according to k+1,4 (j ; ) =
r i=1 k,4 (i ;

) qij gk+1 (j ; ) ck+1 ( ) k (i0 ; ) qi0 j0 gk+1 (j0 ; )j0 (j) + , (10.51) ck+1 ( )

where j0 (j) equal to one when j = j0 and zero otherwise, and ck+1 and k should be computed recursively as in Algorithm 10.3.1. Because 0,4 is null, the recursion is initialized by setting 0,4 (i ; ) = 0 for all states i = 1, . . . , r. The case of the transition probabilities clearly illustrates the main weakness of the recursive approach, namely that a specic recursive smoother must be implemented for each statistic of interest. Indeed, for each time index k, (10.48), (10.49), or (10.51) require of the order of r2 operations, which is comparable with the computational cost of the (normalized) forward or ltering recursion (Algorithm 5.1.1). The dierence is that after application of the complete forward-backward recursions, one may compute all the statistics involved in the EM re-estimation equations (10.41)(10.43). In contrast, the recursive smoothing recursion only provides the smoothed version of one particular statistic: in the case of (10.51) for instance, this is (10.50) with a xed choice of the pair i0 , j0 . Hence implementing the EM algorithm with recursive smoothing requires the order of r2 (n+1)dim() operations, where dim() refers to the number of parameters. In the case of the complete (scalar) normal HMM, dim() equals 2r for the means and variances, plus r (r 1) for the transition probabilities. Hence recursive smoothing is clearly not competitive with approaches based on the forward-backward decomposition. To make it short, the recursive smoothing approach is not a very attractive option in nite state space HMMs and normal HMMs in particular. More precisely, both the intermediate quantity of EM in (10.26) and the gradient of the log-likelihood in (10.29) are additive. In the terminology used in Section 4.1.2, they both correspond to additive functionals of

374

10 Maximum Likelihood Inference, Part I

the form tn+1 (x0:n+1 ) = tn (x0:n ) + sn (xn , xn+1 ). In such cases, smoothing approaches based on the forward-backward decompositions such as Algorithms 5.1.2 or 5.1.3 that evaluate the bivariate smoothing distributions k:k+1|n for k = 0, . . . , n 1 are more ecient because they do not require that the functions {sk }k=0,...,n1 be specied. There are however some situations in which the recursive smoothing approach developed in Section 4.1 and illustrated above in the case of normal HMMs may be useful. First, because it is recursive, it does not require that the intermediate computation results be stored, which is in sharp contrast with the other smoothing approaches where either the forward or backward variables need to be stored. This is of course of interest when processing very large data sets. When the functional whose conditional expectation is to be evaluated is not of the additive type, approaches based on the evaluation of bivariate smoothing distributions are not applicable anymore. In contrast, recursive smoothing stays feasible as long as the functional follows the general pattern of Denition 4.1.2. The most signicant functional of practical interest that is not additive is the second-order derivative of the log-likelihood function. The use of recursive smoothing for this purpose will be illustrated on a simple example in Section 10.3.4.

Finally, another dierent motivation for computing either the intermediate quantity of EM or the gradient of the log-likelihood recursively has to do with recursive estimation. As noted by several authors, including Le Gland and Mevel (1997), Collings and Rydn (1998), and Krishnamurthy and Yin e (2002), being able to compute recursively the intermediate quantity of EM or the gradient of the log-likelihood is a key step toward ecient recursive (also called on-line or adaptive) parameter estimation approaches. It is important however to understand that recursive computation procedures do not necessarily directly translate into recursive estimation approaches. Algorithm 10.3.1 for instance describes how to compute the EM update of the mean i given some observations Y0 , . . . , Yn and a xed current parameter value = . In recursive estimation on the other hand, once a new observation Yk is collected, the parameter estimate, k say, needs to be updated. Using the equations of k substituted for is of course a natural idea, but Algorithm 10.3.1 with not one that is guaranteed to produce the desired result. This is precisely the objective of works such as Le Gland and Mevel (1997) and Krishnamurthy and Yin (2002), to study if and when such recursive approaches do produce expected results. It is fair to say that, as of today, this remains a largely open issue. 10.3.4 Computation of the Score and Observed Information For reasons discussed above, computing the gradient of the log-likelihood is not a dicult task in nite state space HMMs and should preferably be done

10.3 The Example of Normal Hidden Markov Models

375

using smoothing algorithms based on the forward-backward decomposition. The only new requirement is to evaluate the derivatives with respect to that appear in (10.29). In the case of the normal HMM, we already met the appropriate expressions in (10.40), as Fishers identity (10.12) implies that the gradient of the intermediate quantity at the current parameter estimate coincides with the gradient of the log-likelihood. Hence i i qij
n ()

1 i

k|n (i ; )(Yk i ) ,
k=0 n

n ()

1 2

k|n (i ; )
k=0

1 (Yk i )2 2 i i

n n () = k=1

k1:k|n (i, j ; ) . qij

Recall also that the log-likelihood itself is directly available from the ltering recursion, following (5.4). Before considering the computation of the Hessian, we rst illustrate the performance of the optimization methods introduced in Section 10.1.3, which only require the evaluation of the log-likelihood and its gradient. Example 10.3.2 (Binary Deconvolution Model). We consider the simple binary deconvolution model of Capp et al. (1999), which is somewhat e related to the channel coding situation described in Example 1.3.2, except that the channel is unknown. This model is of interest in digital communications (see for instance Krishnamurthy and White, 1992; Kaleh and Vallet, 1994; Fonollosa et al., 1997). It is given by
p

Yk =
i=0

hi Bki + Nk ,

(10.52)

where {Yk }k0 is the observed sequence, {Nk }k0 is a stationary sequence of white Gaussian noise with zero mean and variance , and {Bk }k0 is a sequence of transmitted symbols. For simplicity, we assume that {Bk }k0 is a binary, i.e., Bk {1, 1}, sequence of i.i.d. fair Bernoulli draws. We consider below that p = 1, so that to cast the model into the HMM framework, we only need to dene the state as the vector Xk = (Bk , Bk1 )t , which takes one of the four possible values s1 =
def

1 1

s2 =

def

1 1

,
def

s3 =

def

1 1

s4 =

def

1 1

Hence, upon dening the vector h = (h0 h1 )t of lter coecients, we may view (10.52) as a four-states normal HMM such that i = st h and i = for i i = 1, . . . , 4. The transition matrix Q is entirely xed by our assumption that the binary symbols are equiprobable, and is given by

376

10 Maximum Likelihood Inference, Part I

1/2 1/2 Q= 0 0

0 1/2 0 0 1/2 0 . 1/2 0 1/2 1/2 0 1/2

The model parameters to be estimated are thus the vector h of lter coecients and the common variance . For simplicity, we assume that the distribution of the initial state X0 is known. To make the connection with the general (unconstrained) normal hidden Markov model discussed previously, we need only take into account the facts that h i = si and i / = 1, as all variances are equal. Hence, using the chain rule, the gradient of the intermediate quantity of EM may be evaluated from (10.40) as
4 h Q( ;

)=
i=1

Q( ; ) i
4 n

h i

= and Q( ; ) =
4

k|n (i ; )(Yk si si st h) i
i=1 k=0

(10.53)

i=1

i Q( ; ) i
4 n

1 n 2 i=1

k|n (i ; )
k=0

(Yk st h)2 i 2

(10.54)

The M-step update equations (10.41) and (10.42) of the EM algorithm should thus be replaced by
4 n 1 4 n

h =
i=1 k=0

k|n (i ;

)si st i
i=1 k=0

k|n (i ; )Yk si

1 n

k|n (i ; )(Yk st h )2 i
i=1 k=0 n 2 Yk k=0 i=1 k=0 4 n t

1 = n

k|n (i ; )Yk si

For computing the log-likelihood gradient, we may resort to Fishers iden tity, setting = in (10.53) and (10.54) to obtain h n ( ) and n ( ) , respectively.

10.3 The Example of Normal Hidden Markov Models

377

We now compare the results of the EM algorithm and of a quasi-Newton method for this model. In both cases, the forward-backward recursions are used to compute the smoothing probabilities k|n (i ; ) for k = 0, . . . , n and i = 1, . . . , 4. To avoid parameter constraints, we compute the partial derivative with respect to log rather than with respect to , as the parameter log is unconstrained. This modication is not needed for the EM algorithm, which is parameterization independent. The quasi-Newton optimization is performed using the so-called BFGS weight update and cubic line-searches (see Fletcher, 1987, for details concerning the former). The data set under consideration is the same as in Capp et al. (1999) and e consists of 150 synthetic observations generated with the model corresponding to h0 = 1.3, h1 = 0.6 and = (h2 + h2 )/4 (6 dB signal to noise ratio). There 0 1 are three parameters for this model, and Figures 10.1 and 10.2 show plots of the prole log-likelihood for values of h0 and h1 on a regular grid. The prole log-likelihood is n (h, (h)) with (h) = arg maxv n (h, ), that is, the largest possible log-likelihood for a xed value of h. The gures show that the prole log-likelihood has a global maximum, the MLE, as well as a local one. The location of the local maximum (or maxima) as well as its presence obviously depends on the particular outcome of the simulated noise {Nk }. It is

260 270 280 290 loglikelihood 300 310 320 330 340 350 2 1.5 1 0.5 h0 0 1.5

MLE
3 2

LOC

0.5 h
1

0.5

1.5

Fig. 10.1. Prole log-likelihood surface over (h0 , h1 ) for a particular realization of the binary deconvolution model. The true model parameters are h0 = 1.3 and h1 = 0.6, and 150 observations were taken. The two circled positions labeled MLE and LOC are, respectively, the global maximum of the prole log-likelihood and a local maximum. Also shown are trajectories of 35 iterations of the EM algorithm, initialized at four dierent points marked 14.

378

10 Maximum Likelihood Inference, Part I

260 270 280 290 loglikelihood 300 310 320 330 340 350 2 1.5 1 0.5 h0 0 1.5

MLE
3 2

LOC

0.5 h1

0.5

1.5

Fig. 10.2. Same prole log-likelihood surface as in Figure 10.1. Also shown are trajectories of 5 iterations of a quasi-Newton algorithm, initialized at the same four dierent points marked 14 as in Figure 10.1.

a fundamental feature of the model however that the parameters h = (h0 h1 )t and h = (h1 h0 )t , which govern identical second-order statistical properties of the model, are dicult to discriminate, especially with few observations. Note that as swapping the signs of both h0 and h1 leaves the model unchanged, the prole log-likelihood surface is symmetric, and only the half corresponding to positive values of h0 is shown here. A rst remark is that even in such a simplistic model, there is a local maximum and, depending on the initialization, both algorithms may converge to this point. Because the algorithms operate dierently, it may even occur that the EM and quasi-Newton algorithms initialized at the same point eventually converge to dierent values, as in the case of initialization at point 1. The other important remark is that the EM algorithm (Figure 10.1) shows very dierent convergence behavior depending on the region of the parameter space where it starts: when initialized at point 4, the algorithm gets real close to the MLE in about seven iterations, whereas when initialized at point 1 or 2, it is still far from having reached convergence after 20 iterations. In contrast, the quasi-Newton method (Figure 10.2) updates the parameter by doing steps that are much larger than those of EM, especially during the rst iterations, and provides very accurate parameter estimates with as few as ve iterations. It is fair to say that due to the necessity of evaluating the weight matrix (with nite dierence computations) and to the cubic line-search procedure, each iteration of the quasi-Newton method requires on average seven evaluations of

10.3 The Example of Normal Hidden Markov Models

379

the log-likelihood and its gradient, which means in particular seven instances of the forward-backward procedure. From a computational point of view, the time needed to run the 5 iterations of the quasi-Newton method in this example is thus roughly equivalent to that required for 35 iterations of the EM algorithm. As discussed earlier, computing the observed information in HMMs is more involved, as the only computationally feasible option consists in adopting the recursive smoothing framework of Proposition 4.1.3. Rather than embarking into the general normal HMM case, we consider another simpler illustrative example where the parameter of interest is scalar. Example 10.3.3. Consider a simplied version of the ion channel model (Example 1.3.5) in which the state space X is composed of two states that are (by convention) labeled 0 and 1, and g(0, y) and g(1, y) respectively correspond to the N(0, ) and N(1, ) distributions. This model may also be interpreted as a state space model in which Yk = Xk + Vk , where {Vk } is an i.i.d. N(0, )-distributed sequence, independent of the Markov chain {Xk }, which takes its values in the set {0, 1}. The transition matrix Q of {Xk } is parameterized in the form Q= 0 1 0 1 1 1 .

It is also most logical in this case to assume that the initial distribution of X0 coincides with the stationary distribution associated with Q, that is, (0) = 0 /(0 + 1 ) and (1) = 1 /(0 + 1 ). In this model, the distributions of holding times (number of consecutive steps k for which Xk stays constant) have geometric distributions with expectations (1 0 )1 and (1 1 )1 for states 0 and 1, respectively. We now focus on the computation of the derivatives of the log-likelihood in the model of Example 10.3.3 with respect to the transition parameters 0 and 1 . As they play a symmetric role, it is sucient to consider, say, 0 only. The variance is considered as xed so that the only quantities that depend on the parameter 0 are the initial distribution and the transition matrix Q. We will, as usual, use the simplied notation gk (x) rather than g(x, Yk ) to denote the Gaussian density function (2)1/2 exp{(Yk x)2 /(2)} for x {0, 1}. Furthermore, in order to simplify the expressions below, we also omit to indicate explicitly the dependence with respect to 0 in the rest of this section. Fishers identity (10.12) reduces to 0 =E log (X0 ) + 0
n1

k=0

log qXk Xk+1 Y0:n 0

380

10 Maximum Likelihood Inference, Part I

where the notation qij refers to the element in the (1+i)-th row and (1+j)-th column of the matrix Q (in particular, q00 and q11 are alternative notations for 0 and 1 ). We are thus in the framework of Proposition 4.1.3 with a smoothing functional tn,1 dened by log (x) , 0 sk,1 (x, x ) = log qxx 0 t0,1 (x) =

for k 0 ,

where the multiplicative functions {mk,1 }k0 are equal to 1. Straightforward calculations yield 1 0 (x) 1 (x) , 0 1 1 (0,1) (x, x ) . sk,1 (x, x ) = (0,0) (x, x ) 0 1 0 t0,1 (x) = (0 + 1 )1 Hence a rst recursion, following Proposition 4.1.3. Algorithm 10.3.4 (Computation of the Score in Example 10.3.3). Initialization: Compute c0 =
1 i=0

(i)g0 (i) and, for i = 0, 1,

k (i) = c1 (i)g0 (i) , 0 0,1 (i) = t0,1 (i)0 (i) . Recursion: For k = 0, 1, . . . , compute ck+1 = j = 0, 1,
1 1 i=0 1 j=0

k (i)qij gk (j) and, for

k+1 (j) = c1 k+1


i=0 1

k (i)qij gk (j) ,

k+1,1 (j) = c1 k+1


i=0

k,1 (i)qij gk+1 (j) .

+ k (0)gk+1 (0)0 (j) k (0)gk+1 (1)1 (j) At each index k, the log-likelihood is available via derivative with respect to 0 may be evaluated as 0
1 k k

k l=0

log cl , and its

=
i=0

k,1 (i) .

10.3 The Example of Normal Hidden Markov Models

381

For the second derivative, Louis identity (10.14) shows that 2 2 0 n 0 + E


2

2 =E log (X0 ) + 2 0
n1

n1

k=0

2 log qXk Xk+1 Y0:n 2 0 2 Y0:n . (10.55)

log (X0 ) + 0

k=0

log qXk Xk+1 0

The rst term on the right-hand side of (10.55) is very similar to the case of n,1 considered above, except that we now need to dierentiate the func tions twice, replacing t0,1 and sk,1 by 0 t0,1 and 0 sk,1 , respectively. The corresponding smoothing functional tn,2 is thus now dened by 1 (20 + 1 ) 1 0 (x) + 1 (x) , 2 (0 + 1 )2 (0 + 1 )2 0 1 1 sk,2 (x, x ) = 2 (0,0) (x, x ) (0,1) (x, x ) . 0 (1 0 )2 t0,2 (x) = The second term on the right-hand side of (10.55) is more dicult, and we need to proceed as in Example 4.1.4: the quantity of interest may be rewritten as the conditional expectation of
n1 2

tn,3 (x0:n ) = t0,1 (x0 ) +


k=0

sk,1 (xk , xk+1 )

Expanding the square in this equation yields the update formula tk+1,3 (x0:k+1 ) = tk,3 (x0:k ) + s2 (xk , xk+1 ) + 2tk,1 (x0:k )sk,1 (xk , xk+1 ) . k,1 Hence tk,1 and tk,3 jointly are of the form prescribed by Denition 4.1.2 with incremental additive functions sk,3 (x, x ) = s2 (x, x ) and multiplicative upk,1 dates mk,3 (x, x ) = 2sk,1 (x, x ). As a consequence, the following smoothing recursion holds. Algorithm 10.3.5 (Computation of the Observed Information in Example 10.3.3). Initialization: For i = 0, 1, 0,2 (i) = t0,2 (i)0 (i) . 0,3 (i) = t2 (i)0 (i) . 0,1 Recursion: For k = 0, 1, . . . , compute for j = 0, 1,

382

10 Maximum Likelihood Inference, Part I


1

k+1,2 (j) = c1 k+1


i=0

k,2 (i)qij gk+1 (j) 1 1 k (0)gk+1 (0)0 (j) k (0)gk+1 (1)1 (j) 0 (1 0 )
1

0,3 (j) = c1 k+1

k,3 (i)qij gk+1 (j)


i=0

+ 2 [k,1 (0)gk+1 (0)0 (j) k,1 (0)gk+1 (1)1 (j)] + 1 1 k (0)gk+1 (0)0 (j) + k (0)gk+1 (1)1 (j) 0 (1 0 ) .

At each index k, the second derivative of the log-likelihood satises 2 2 0


k +

2 k

=
i=0

k,2 (i) +
i=0

k,3 (i) ,

where the second term on the left-hand side may be evaluated in the same recursion, following Algorithm 10.3.4. To illustrate the results obtained with Algorithms 10.3.410.3.5, we consider the model with parameters 0 = 0.95, 1 = 0.8, and = 0.1 (using the notations introduced in Example 10.3.3). Figure 10.3 displays the typical aspect of two sequences of length 200 simulated under slightly dierent values of 0 . One possible use of the output of Algorithms 10.3.410.3.5 consists in testing for changes in the parameter values. Indeed, under conditions to be detailed in Chapter 12 (and which hold here), the normalized score n1/2 0 n satises a central limit theorem with variance given by the limit of the normalized information n1 ( 2 /2 ) n . Hence it is expected that 0 Rn =
0 n 2
0 2

be asymptotically N(0, 1)-distributed under the null hypothesis that 0 is indeed equal to the value used for computing the score and information recursively with Algorithms 10.3.410.3.5. Figure 10.4 displays the empirical quantiles of Rn against normal quantiles for n = 200 and n =1,000. For the longer sequences (n =1,000), the result is clearly as expected with a very close t to the normal quantiles. When n = 200, asymptotic normality is not yet reached and there is a signicant bias toward high values of Rn . Looking back at Figure 10.3, even if was equal to zeroor in other words, if we were able to identify without ambiguity the 0 and 1 states from the datathere would not be much information about 0 to be gained from runs of length 200: when 0 = 0.95 and 1 = 0.8, the

10.3 The Example of Normal Hidden Markov Models


0 = 0.95
2 1.5 1

383

Data

0.5 0 0.5 1 0 20 40 60 80 100 120 140 160 180 200

= 0.92
0
2 1.5 1

Data

0.5 0 0.5 1 0 20 40 60 80 100 120 140 160 180 200

Time Index

Fig. 10.3. Two simulated trajectories of length n = 200 from the simplied ion channel model of Example 10.3.3 with 0 = 0.95, 1 = 0.8, and 2 = 0.1 (top), and 0 = 0.92, 1 = 0.8, and 2 = 0.1 (bottom).
n = 200
0.999 0.99 0.999 0.99 0.90 0.5 0.1 0.01 0.001 2 0 2 4 6 8 2 0 2 4

n = 1000

Probability

0.90 0.5 0.1 0.01 0.001

Fig. 10.4. QQ-plot of empirical quantiles of the test statistic Rn (abscissas) for the simplied ion channel model of Example 10.3.3 with 0 = 0.95, 1 = 0.8, and 2 = 0.1 vs. normal quantiles (ordinates). Samples sizes were n = 200 (left) and n =1,000 (right), and 10,000 independent replications were used to estimate the empirical quantiles.

average number of distinct runs of 0s that one can observe in 200 consecutive data points is only about 200/(20 + 5) = 8. To construct a goodness of t test from Rn , one can monitor values of R2 , which asymptotically has a chin square distribution with one degree of freedom. Testing the null hypothesis 0 = 0.95 gives p-values of 0.87 and 0.09 for the two sequences in the top and bottom plots, respectively, of Figure 10.3. When testing at the 10% level, both

384

10 Maximum Likelihood Inference, Part I

sequences thus lead to the correct decision: no rejection and rejection of the null hypothesis, respectively. Interestingly, testing the other way around, that is, postulating 0 = 0.92 as the null hypothesis, gives p-values of 0.20 and 0.55 for the top and bottom sequences of Figure 10.3, respectively. The outcome of the test is now obviously less clear-cut, which reveals an asymmetry in its discrimination ability: it is easier to detect values of 0 that are smaller than expected than the converse. This is because smaller values of 0 means more changes (on average) in the state sequence and hence more usable information about 0 to be obtained from a xed size record. This asymmetry is connected to the upward bias visible in the left plot of Figure 10.4.

10.4 The Example of Gaussian Linear State-Space Models


We now consider more briey the case of Gaussian linear state-space models that form the other major class of hidden Markov models for which the methods discussed in Section 10.1 are directly applicable. It is worth mentioning that Gaussian linear state-space models are perhaps the only important subclass of the HMM family for which there exist reasonable simple noniterative parameter estimation algorithms not based on maximum likelihood arguments but are nevertheless useful in practical applications. These suboptimal algorithms, proposed by Van Overschee and De Moor (1993), rely on the linear structure of the model and use only eigendecompositions of empirical covariance matricesa general principle usually referred to under the denomination of subspace methods (Van Overschee and De Moor, 1996). Keeping in line with the general topic of this chapter, we nonetheless consider below only algorithms for maximum likelihood estimation in Gaussian linear state-space models. The Gaussian linear state-space model introduced in Section 1.3.3 is given in so-called state-space form by (1.7)(1.8), which we recall here: Xk+1 = AXk + RUk , Yk = BXk + SVk , where X0 , {Uk }k0 and {Vk }k0 are jointly Gaussian. The parameters of the model are the four matrices A, R, B, and S. Note that except for scalar models, it is not possible to estimate R and S because both {Uk } and {Vk } are unobservable and hence R and S are only identiable up to an orthonormal matrix. In other words, multiplying R or S by any orthonormal matrix of suitable dimension does not modify the distribution of the observations. Hence the parameters that are identiable are the covariance matrices R = RRt and S = SS t , which we consider below. Likewise, the matrices A and B are identiable up to a similarity transformation only. Indeed, setting Xk = T Xk for some invertible matrix T , that is, making a change of basis for the

10.4 The Example of Gaussian Linear State-Space Models

385

state process, it is straightforward to check that the joint process {(Xk , Yk )} satises the model assumptions with T AT 1 , BT 1 , and T R replacing A, B, and R, respectively. Nevertheless, we work with A and B in the algorithm below. If a unique representation is desired, one may use, for instance, the companion form of A given its eigenvalues; this matrix may contain complex entries though. As in the case of nite state space HMMs (Section 10.2.2), it is not sensible to consider the initial covariance matrix as an independent parameter when using a single observed sequence. On the other hand, for such models it is very natural to assume that is associated with the stationary distribution of {Xk }. Except for the particular case of the scalar AR(1) model however (to be discussed in Example 11.1.2), this option typically renders the EM update equations non-explicit and it is thus standard practice to treat as a xed matrix unrelated to the parameters (Ghosh, 1989). We shall also assume that both R and S are full rank covariance matrices so that all Gaussian distributions admit densities with respect to (multi-dimensional) Lebesgue measure. 10.4.1 The Intermediate Quantity of EM With the previous notations, the intermediate quantity Q( ; ) of EM, dened in (10.26), may be expressed as 1 E n log |R | + 2
n1 1 (Xk+1 AXk )t R (Xk+1 AXk ) Y0:n k=0 n 1 (Yk BXk )t S (Yk BXk ) Y0:n , k=0

1 E (n + 1) log |S | + 2

(10.56) up to terms that do not depend on the parameters. In order to elicit the M-step equations or to compute the score, we dierentiate (10.56) using elementary perturbation calculus as well as the identity C log |C| = C t for an invertible matrix Cwhich is a consequence of the adjoint representation of the inverse (Horn and Johnson, 1985, Section 0.8.2):
n1 A Q( ; 1 ) = R E k=0
1 R Q( ;

t t (AXk Xk Xk+1 Xk ) Y0:n

(10.57)

)=

1 nR 2
n1

(10.58) (Xk+1 AXk )(Xk+1 AXk )t Y0:n

+ E
B Q( ; 1 ) = S E

k=0 n t t (BXk Xk Yk Xk ) Y0:n k=0

(10.59)

386

10 Maximum Likelihood Inference, Part I


1 S Q( ;

)=

1 (n + 1)S 2
n

(10.60)

+ E

(Yk BXk )(Yk BXk )t Y0:n


k=0

Note that in the expressions above, we dierentiate with respect to the inverses of R and S rather than with respect to the covariance matrices themselves, which is equivalent, because we assume both of the covariance matrices to be positive denite, but yields simpler formulas. Equating all derivatives simultaneously to zero denes the EM update of the parameters. We will denote these updates by A , B , R , and S , respectively. To write them down, we will use the notations introduced in Chapter 5: Xk|n ( ) = E [Xk | Y0:n ] and t k|n ( ) = E [Xk Xk | Y0:n ] Xk|n ( )Xk|n ( ), where we now indicate explicitly that these rst two smoothing moments indeed depend on the current estimates of the model parameters (they also depend on the initial covariance matrix , but we ignore this fact here because this quantity is considered as being xed). We also need to evaluate the conditional covariances Ck,k+1|n ( ) = Cov [Xk , Xk+1 | Y0:n ] t t = E [Xk Xk+1 | Y0:n ] Xk|n ( )Xk+1|n ( ) . For Gaussian models, the latter expression coincides with the denition given in (5.99), and hence one may use expression (5.100) to evaluate Ck,k+1|n ( ) during the nal forward recursion of Algorithm 5.2.15. With these notations, the EM update equations are given by
n1 t def

A =
k=0

t Ck,k+1|n ( ) + Xk|n ( )Xk+1|n ( )


n1 1

(10.61)

t k|n ( ) + Xk|n ( )Xk|n ( )


k=0 R =

1 n

n1

t k+1|n ( ) + Xk+1|n ( )Xk+1|n ( )


k=0

(10.62) , (10.63)

t A Ck,k+1|n ( ) + Xk|n ( )Xk+1|n ( )


n t t Xk|n ( )Yk k=0 n 1

B =

t k|n ( ) + Xk|n ( )Xk|n ( )


k=0

10.4 The Example of Gaussian Linear State-Space Models


S

387

1 = n+1

n t t Yk Yk B Xk|n ( )Yk . k=0

(10.64)

In obtaining the covariance update, we used the same remark that made it possible to rewrite, in the case of normal HMMs, (10.42) as (10.45). 10.4.2 Recursive Implementation As in the case of nite state space HMMs, it is possible to implement the parameter update equations (10.61)(10.64) or to compute the gradient (10.57) (10.60) of the log-likelihood recursively in n. Here we only sketch the general principles and refer to the paper by Elliott and Krishnamurthy (1999) in which the details of the EM re-estimation equations are worked out. Proceeding as in Section 4.1, it is clear that all expressions under consideration may be rewritten term by term as the expectation2 E[tn (X0:n ) | Y0:n ] of well chosen additive functionals tn . More precisely, the functionals of interest are of the n1 form tn (x0:n ) = t0 (x0 ) + k=0 sk (xk , xk+1 ), where the individual terms in the sum are of one of the types sk1,1 (xk ) = ht xk , k sk1,2 (xk ) = sk1,3 (xk1 , xk ) = x t Mk x k , k xt Tk1 xk k1 , (10.65) (10.66) (10.67)

and {hk }k0 , {Mk }k0 , and {Tk }k0 , respectively, denote sequences of vectors and matrices with dimension that of the state vectors (dx ) and which may depend on the model parameters or on the observations. For illustration purposes, we focus on the example of (10.63): the rst factor on the right-hand side of (10.63) is a matrix whose ij elements (ith n row, jth column) corresponds to E[ k=0 ht Xk | Y0:n ] for the particular choice k hk = 0 . . . 0 Yk (i) 0 . . . 0 1 . . . j 1 j j + 1 . . . dx
t

(10.68)

Likewise, the ij element of the second factor on the right-hand side of (10.63) before inverting the matrixcorresponds to the expectation of a functional of the second of the three types above with Mk being a matrix of zeros except for a unit entry at position ij. n Let n,1 denote the expectation E[ k=0 ht Xk | Y0:n ] for an additive funck tional of the rst type given in (10.65). To derive a recursion for n,1 , we use the innovation decomposition (Section 5.2.2) to obtain

Note that in this section, we omit to indicate explicitly the dependence with respect to the model parameters to alleviate the notations.

388

10 Maximum Likelihood Inference, Part I


n+1

n+1,1 = E

def

ht Xk Y0:n+1 k

k=0 t n+1|n+1 = hn+1 X n + ht Xk|n k k=0

+ E[Xk
n

1 t n+1 ]n+1 n+1

= ht Xn+1|n+1 + E n+1
k=0 n

ht Xk Y0:n k
1 B t n+1

+
k=0

t ht k|k1 t t k k k+1 . . . n

n+1

rn+1

where (5.93) was used to obtain the last expression, which also features the notation k = A Hk B with Hk being the Kalman (prediction) gain introduced in the statement of Algorithm 5.2.15. The term that we denoted by rn+1 is an intermediate quantity that has some similarities with the variable pk (or more precisely p0 ) that is instrumental in the disturbance smoothing algorithm (Algorithm 5.2.15). The same key remark applies here as rn can be computed recursively (in n) according to the equations r0 = 0 , rn+1 = rn + hn n|n1 t n for n 0 .

Hence the following recursive smoothing algorithm, which collects all necessary steps. Algorithm 10.4.1 (Recursive Smoothing for a Linear Sum Functional). Initialization: Apply the Kalman ltering recursion for k = 0 (Algorithm 5.2.13) and set r0 = 0 , 0 = E[ht X0 | Y0 ] = ht X0|0 . 0 0 Recursion: For n = 1, 2, . . . , run one step of the Kalman ltering and prediction recursions (Algorithms 5.2.9 and 5.2.13) and compute rn = rn1 + hn1 n1|n2 t n1 ,
n

n = E
k=0

1 ht Xk Y0:n = ht Xn|n + n1 + rn B t n k n

10.5 Complements

389

Algorithm 10.4.1 illustrates the fact that as in the case of nite state space models, recursive computation is in general less ecient than is forwardbackward smoothing from a computational point of view: although Algorithm 10.4.1 capitalizes on a common framework formed by the Kalman ltering and prediction recursions, it does however require the update of a quantity (rn ) that is specic to the choice of the sequence of vectors {hk }k0 . To compute the rst factor on the right-hand side of (10.63) for instance, one needs to apply the recursion of Algorithm 10.4.1 for the dy dx possible choices of {hk }k0 given by (10.68). Thus, except for low-dimensional models or particular cases in which the system matrices A, R , B, and S are very sparse, recursive computation is usually not the method of choice for Gaussian linear state-space models (see Elliott and Krishnamurthy, 1999, for a discussion of the complexity of the complete set of equations required to carry out the EM parameter update).

10.5 Complements
To conclude this chapter, we briey return to an issue mentioned in Section 10.1.2 regarding the conditions that ensure that the EM iterations indeed converge to stationary points of the likelihood. 10.5.1 Global Convergence of the EM Algorithm As a consequence of Proposition 10.1.4, the EM algorithm described in Section 10.1.2 has the property that the log-likelihood function can never decrease in an iteration. Indeed, (i+1 ) (i ) Q(i+1 ; i ) Q(i ; i ) 0 . This class of algorithms, sometimes referred to as ascent algorithms (Luenberger, 1984, Chapter 6), can be treated in a unied manner following a theory developed mostly by Zangwill (1969). Wu (1983) showed that this general theory applies to the EM algorithm as dened above, as well as to some of its variants that he calls generalized EM (or GEM). The main result is a strong stability guarantee known as global convergence, which we discuss below. We rst need a mathematical formalism that describes the EM algorithm. This is done by identifying any homogeneous (in the iterations) iterative algorithm with a specic choice of a mapping M that associates i+1 to i . In the theory of Zangwill (1969), one indeed considers families of algorithms by allowing for point-to-set maps M that associate a set M ( ) to each parameter value . A specic algorithm in the family is such that i+1 is selected in M (i ). In the example of EM, we may dene M as M ( ) = : Q( ; ) Q( ; ) for all , (10.69)

390

10 Maximum Likelihood Inference, Part I

that is, M ( ) is the set of values that maximize Q( ; ) over . Usually M ( ) reduces to a singleton, and the mapping M is then simply a point-topoint map (a usual function from to ). But the use of point-to-set maps makes it possible to deal also with cases where the intermediate quantity of EM may have several global maxima, without going into the details of what is done in such cases. We next need the following denition before stating the main convergence theorem. Denition 10.5.1 (Closed Mapping). A map T from points of to subsets of is said to be closed on a set S if for any converging sequences {i }i0 and {i }i0 , the conditions (a) i S, (b) i with i T (i ) for all i 0, imply that T (). Note that for point-to-point maps, that is, if T () is a singleton for all , the denition above is equivalent to the requirement that T be continuous on S. Denition 10.5.1 is thus a generalization of continuity for general (pointto-set) maps. We are now ready to state the main result, which is proved in Zangwill (1969, p. 91) or Luenberger (1984, p. 187). Theorem 10.5.2 (Global Convergence Theorem). Let be a subset of Rd and let {i }i0 be a sequence generated by i+1 T (i ) where T is a point-to-set map on . Let S be a given solution set and suppose that (1) the sequence {i }i0 is contained in a compact subset of ; (2) T is closed over \ S (the complement of S); (3) there is a continuous ascent function s on such that s() s( ) for all T ( ), with strict inequality for points that are not in S. Then the limit of any convergent subsequence of {i } is in the solution set S. In addition, the sequence of values of the ascent function, {s(i )}i0 , converges monotonically to s( ) for some S. The nal statement of Theorem 10.5.2 should not be misinterpreted: that {s(i )} converges to a value that is the image of a point in S is a simple consequence of the rst and third assumptions. It does however not imply that the sequence of parameters {i } is itself convergent in the usual sense, but only that the limit points of {i } have to be in the solution set S. An important property however is that because {s(i(l) )}l0 converges to s( ) for any convergent subsequence {i(l) }, all limit points of {i } must be in the set S = { : s() = s( )} (in addition to being in S). This latter statement means that the sequence of iterates {i } will ultimately approach a set of points that are equivalent as measured by the ascent function s. The following general convergence theorem following the proof by Wu (1983) is a direct application of the previous theory to the case of EM.

10.5 Complements

391

Theorem 10.5.3. Suppose that in addition to the hypotheses of Proposition 10.1.4 (Assumptions 10.1.3 as well as parts (a) and (b) of Proposition 10.1.4), the following hold. (i) H( ; ) is continuous in its second argument, , on . (ii) For any 0 , the level set 0 = : () (0 ) is compact and contained in the interior of . Then all limit points of any instance {i }i0 of an EM algorithm initialized at 0 are in L0 = { 0 : () = 0}, the set of stationary points of with log-likelihood larger than that of 0 . The sequence { (i )} of log-likelihoods converges monotonically to = ( ) for some L0 . Proof. This is a direct application of Theorem 10.5.2 using L0 as the solution set and as the ascent function. The rst hypothesis of Theorem 10.5.2 follows from (ii) and the third one from Proposition 10.1.4. The closedness assumption (2) follows from Proposition 10.1.4 and (i): for the EM mapping M dened in (10.69), i M (i ) amounts to the condition Q(i ; i ) Q( ; i ) for all ,

which is also satised by the limits of the sequences {i } and {i } (if these converge) by continuity of the intermediate quantity Q, which follows from that of and H (note that it is here important that H be continuous with respect to both arguments). Hence the EM mapping is indeed closed on as a whole and Theorem 10.5.3 follows. The assumptions of Proposition 10.1.4 as well as item (i) above are indeed very mild in typical situations. Assumption (ii) however may be restrictive, even for models in which the EM algorithm is routinely used (such as the normal HMMs introduced in Section 1.3.2, for which this assumption does not hold if the variances i are allowed to be arbitrarily small). The practical implication of (ii) being violated is that the EM algorithm may fail to converge to the stationary points of the likelihood for some particularly badly chosen initial points 0 . Most importantly, the fact that i+1 maximizes the intermediate quantity Q( ; i ) of EM does in no way imply that, ultimately, is the global maximum of over . There is even no guarantee that is a local maximum of the loglikelihood: it may well only be a saddle point (Wu, 1983, Section 2.1). Also, the convergence of the sequence (i ) to does not automatically imply the convergence of {i } to a point . Pointwise convergence of the EM algorithm requires more stringent assumptions that are dicult to verify in practice. As an example, a simple corollary of the global convergence theorem states that if the solution set S in Theorem 10.5.2 is a single point, say, then the sequence {i } indeed converges to (Luenberger, 1984, p. 188). The sketch of the proof of this corollary is that every subsequence of {i } has a convergent further subsequence because of the compactness assumption (1), but such a subsequence

392

10 Maximum Likelihood Inference, Part I

admits s as an ascent function and thus converges to by Theorem 10.5.2 itself. In cases where the solution set is composed of several points, further conditions are needed to ensure that the sequence of iterates indeed converges and does not cycle through dierent solution points. In the case of EM, pointwise convergence of the EM sequence may be guaranteed under an additional condition given by Wu (1983) (see also Boyles, 1983, for an equivalent result), stated in the following theorem. Theorem 10.5.4. Under the hypotheses of Theorem 10.5.3, if (iii) i+1 i 0 as i , then all limit points of {i } are in a connected and compact subset of L = { : () = }, where is the limit of the log-likelihood sequence { (i )}. In particular, if the connected components of L are singletons, then {i } converges to some in L . Proof. The set of limit points of a bounded sequence {i } with i+1 i 0 is connected and compact (Ostrowski, 1966, Theorem 28.1). The proof follows because under Theorem 10.5.2, the limit points of {i } must belong to L . 10.5.2 Rate of Convergence of EM Even if one can guarantee that the EM sequence {i } converges to some point , this limiting point can be either a local maximum, a saddle point, or even a local minimum. The proposition below states conditions under which the stable stationary points of EM coincide with local maxima only (see also Lange, 1995, Proposition 1, for a similar statement). We here consider that the EM mapping M is a point-to-point map, that is, that the maximizer in the M-step is unique. To understand the meaning of the term stable, consider the following approximation to the limit behavior of the EM sequence: it is sensible to expect that if the EM mapping M is suciently regular in a neighborhood of the limiting xed point , the asymptotic behavior of the EM sequence {i } follows the tangent linear dynamical system (i+1 ) = M (i ) M ( )
M (

)(i ) .

(10.70)

Here M ( ) is called the rate matrix (see for instance Meng and Rubin, 1991). A xed point is said to be stable if the spectral radius of M ( ) is less than 1. In this case, the tangent linear system is asymptotically stable in the sense that the sequence { i } dened recursively by i+1 = M ( ) i tends to zero as n tends to innity (for any choice of 0 ). The linear rate of convergence of EM is dened as the largest moduli of the eigenvalues of M ( ). This rate is an upper bound on the factors k that appear in (10.17).

10.5 Complements

393

Proposition 10.5.5. Under the assumptions of Theorem 10.1.6, assume that Q( ; ) has a unique maximizer for all and that, in addition, H( ) = and G( ) =
2 2

log f (x ; )

p(x ; ) (dx)

(10.71)

log p(x ; )

p(x ; ) (dx)

(10.72)

are positive denite matrices for all stationary points of EM (i.e., such that M ( ) = ). Then for all such points, the following hold true. (i) M ( ) is diagonalizable and its eigenvalues are positive real numbers. (ii) The point is stable for the mapping M if and only if it is a proper maximizer of () in the sense that all eigenvalues of 2 ( ) are nega tive. Proof. The EM mapping is dened implicitly through the fact that M ( ) maximizes Q( ; ), which implies that

log f (x ; )|=M ( ) p(x ; ) (dx) = 0 ,

using assumption (b) of Theorem 10.1.6. Careful dierentiation of this relation at a point = , which is such that M ( ) = and hence ()|= = 0, gives (Dempster et al., 1977; Lange, 1995, see also)
M (

) = [H( )]1 H( ) +

( ) ,

where H( ) is dened in (10.71). The missing information principleor Louis formula (see Proposition 10.1.6)implies that G( ) = H( )+ 2 ( ) is positive denite under our assumptions. Thus M ( ) is diagonalizable with positive eigenvalues that are the same (counting multiplicities) as those of the matrix A = I +B , where B = [H( )]1/2 2 ( )[H( )]1/2 . Thus M ( ) is stable if and only if B has negative eigenvalues only. The Sylvester law of inertia (see for instance Horn and Johnson, 1985) shows that B has the same inertia (number of positive, negative, and zero eigenvalues) as 2 ( ). Thus all of B s eigenvalues are negative if and only if the same is true for 2 ( ), that is, if is a proper maximizer of . The proof above implies that when is stable, the eigenvalues of M ( ) lie in the interval (0, 1). 10.5.3 Generalized EM Algorithms As discussed above, the type of convergence guaranteed by Theorem 10.5.3 is rather weak but, on the other hand, this result is remarkable as it indeed

394

10 Maximum Likelihood Inference, Part I

covers not only the original EM algorithm proposed by Dempster et al. (1977) but a whole class of variants of the EM approach. One of the most useful extensions of EM is the ECM (for expectation conditional maximization) by Meng and Rubin (1993), which addresses situations where direct maximization of the intermediate quantity of EM is intractable. Assume for instance that the parameter vector consists of two sub-components 1 and 2 , which are such that maximization of Q((1 , 2 ) ; ) with respect to 1 or 2 only (the other sub-component being xed) is easy, whereas joint maximization with respect to = (1 , 2 ) is problematic. One may then use the following algorithm for updating the parameter estimate at iteration i.
i i E-step: Compute Q((1 , 2 ) ; (1 , 2 )); CM-step: Determine i+1 i i i 1 = arg max Q((1 , 2 ) ; (1 , 2 )) , 1

and then
i+1 i+1 i i 2 = arg max Q((1 , 2 ) ; (1 , 2 )) . 2

It is easily checked that for this algorithm, (10.8) is still veried and thus is an ascent function; this implies that Theorem 10.5.3 holds under the same set of assumptions. The example above is only the simplest case where the ECM approach may be applied, and further extensions are discussed by Meng and Rubin (1993) as well as by Fessler and Hero (1995) and Meng and Van Dyk (1997). 10.5.4 Bibliographic Notes The EM algorithm was popularized by the celebrated article of Dempster et al. (1977). It is generally admitted however that several published works predated this landmark paper by describing applications of the EM principle to some specic cases (Meng and Van Dyk, 1997). Interestingly, the earliest example of a complete EM strategy, which also includes convergence proofs (in addition to describing the forward-backward smoothing algorithm discussed in Chapter 3), is indeed the work by Baum et al. (1970) on nite state space HMMs, generalizing the idea put forward by Baum and Eagon (1967). This pioneering contribution has been extended by authors such as Liporace (1982), who showed that the same procedure could be applied to other types of HMMs. The generality of the approach however was not fully recognized until Dempster et al. (1977) and Wu (1983) (who made the connection with the theory of global convergence) showed that the convergence of the EM approach (and its generalizations) is guaranteed in great generality. The fact that the EM algorithm may also be used, with minor modications, for MAP estimation was rst mentioned by Dempster et al. (1977). Green (1990) illustrates a number of practical applications where this option

10.5 Complements

395

plays an important role. Perhaps the most signicant of these is speech processing where MAP estimation, as rst described by Gauvain and Lee (1994), is commonly used for the model adaptation task (that is, re-retraining from sparse data of some previously trained models). The ECM algorithm of Meng and Rubin (1993) (discussed Section 10.5.3) was also studied independently by Fessler and Hero (1995) under the name SAGE (space-alternating generalized EM). Fessler and Hero (1995) also introduced the idea that in some settings it is advantageous to use dierent ways of augmenting the data, that is, dierent ways of writing the likelihood as in (10.1) depending on the parameter subset that one is trying to re-estimate; see also Meng and Van Dyk (1997) for further developments of this idea.

11 Maximum Likelihood Inference, Part II: Monte Carlo Optimization

This chapter deals with maximum likelihood parameter estimation for models in which the smoothing recursions of Chapter 3 cannot be implemented. The task is then considerably more dicult, as it is not even possible to evaluate the likelihood to be maximized. Most of the methods applicable in such cases are reminiscent of the iterative optimization procedures (EM and gradient methods) discussed in the previous chapter but rely on approximate smoothing computations based on some form of Monte Carlo simulation. In this context, the methods covered in Chapters 6 and 7 for simulating the unobservable sequence of states conditionally on the observations play a prominent role. It is important to distinguish the topic of this chapter with a distinct although not entirely disconnectedproblem. The methods discussed in the previous chapters were all based on local exploration (also called hill-climbing strategies) of the likelihood function. Such methods are typically unable to guarantee that the point reached at convergence is a global maximum of the function; indeed, it may well be a local maximum only or even a saddle point see Section 10.5 for details regarding the EM algorithm. Many techniques have been proposed to overcome this signicant diculty, and most of them belong to a class of methods that Geyer (1996) describes as random search optimization. Typical examples are the so-called genetic and simulated annealing algorithms that both involve simulating random moves in the parameter space (see also Section 13.3, which describes a technique related to simulated annealing). In these approaches, the main motivation for using simulations (in parameter space and/or hidden variable space) is the hope to design more robust optimization rules that can avoid local maxima. The focus of the current chapter is dierent, however, as we examine below methods that can be considered as simulation-based extensions of approaches introduced in the previous chapter. The primary objective is here to provide tools for maximum likelihood inference also for the class of HMMs in which exact smoothing is not available.

398

11 Maximum Likelihood Inference, Part II

11.1 Methods and Algorithms


11.1.1 Monte Carlo EM 11.1.1.1 The Algorithm Throughout this section, we use the incomplete data model notations introduced in Section 10.1.2. Recall that the E-step of the EM algorithm amounts to evaluating the function Q( ; ) = log f (x ; )p(x ; ) (dx) (see Denition 10.1.1). We here consider cases where direct numerical evaluation of this expectation under p is not available. The principle proposed by Wei and Tanner (1991)see also Tanner (1993)consists in using the Monte Carlo approach to approximate the intractable E-step with an empirical average based on simulated data:
def 1 Qm ( ; ) = m m

log f ( j ; ) ,
j=1

(11.1)

where 1 , . . . , m are i.i.d. draws from the density p(x ; ). The subscript m in (11.1) reects the dependence on the Monte Carlo sample size. The EM algorithm can thus be modied into the Monte Carlo EM (MCEM) algorithm by replacing Q( ; ) by Qm ( ; ) in the E-step. More formally, the MCEM algorithm consists in iteratively computing a sequence {i } of parameter es timates, given an initial guess 0 , by iterating the following two steps. Algorithm 11.1.1 (MCEM Algorithm). For i = 1, 2, . . . , Simulation step: Draw i,1 , . . . , i,mi conditionally independently given F i1 = (0 , j,l , j = 0, . . . , i 1, l = 1, . . . , mj )
def

(11.2)

from the density p(x ; i1 ). i M-step: Choose to be the (or any, if there are several) value of which m ( ; i1 ), where Qm ( ; i1 ) is as in (11.1) (replacing j by maximizes Q i i i,j ). The initial point is picked arbitrarily and depends primarily on prior belief about the location of the maximum likelihood estimate. Like the EM algorithm, the MCEM algorithm is particularly well suited to problems in which the parametric model {f (x ; ) : } belongs to an exponential family, f (x ; ) = exp( t ()S(x) c())h(x) (see Denition 10.1.5). In this case, the E-step consists in computing a Monte Carlo approximation 1 Si = mi
mi

S( i,j )
j=1

(11.3)

11.1 Methods and Algorithms

399

of the expectation S(x)p(x ; i1 ) (dx). The M-step then consists in opt timizing the function ()S i c(). In many models, this function is convex, and the maximization can be achieved in closed form. In many situations, the simulation of an i.i.d. sample from the density p(x ; i1 ) may turn out dicult. One may then use Markov chain Monte Carlo techniques, in which case i,1 , . . . , i,mi is a sequence generated by an ergodic Markov chain whose stationary distribution is p(x ; i1 ) (see Chapter 6). More precisely, i,j | F i,j1 i1 ( i,j1 , ), j = 2, . . . , mi ,

where, for any , is a Markov transition kernel admitting p(x ; ) as its stationary distribution and F i,j = F i1 ( i,1 , . . . , i,j1 ). Using MCMC complicates the control of the MCEM algorithm because of the nested structure of the iterations: an iterative sampling procedure (MCMC) is used in the inner loop of an iterative optimization procedure (MCEM). Compared to i.i.d. Monte Carlo simulations, MCMC introduces two additional sources of errors. First, for any i and j = 1, . . . , mi , the distribution of i,j is only approximately equal to the density p(x ; i1 ), thus inducing a bias in the estimate. To obtain a reasonably accurate sample, it is customary to include a burn-in period, whose length should ideally depend on the rate at which the MCMC sampler actually mixes, during which the MCMC samples are not used for computing (11.3). The implementation of such procedures typically requires more or less sophisticated schemes to check for convergence. Second, the successive realizations i,1 , . . . , i,mi of the missing data are not independent. This makes the choice of sample size more involved, because the dependence complicates the estimation of the Monte Carlo error. 11.1.1.2 MCEM for HMMs The applications of the MCEM algorithm to HMMs is straightforward. We use the same notations and assumptions as in Section 10.2.2. In this context, Ln (Y0:n ; ) is the likelihood of the observations, log f (x0:n ; ) is the so-called complete data likelihood (10.25), and p(x0:n ; ) is the conditional density of the state sequence X0:n given the observations Y0:n . In this context, MCEM is (at least conceptually) straightforward to implement: one rst simulates mi trajectories of the hidden states X0:n condition ally on the observations Y0:n and given the current parameter estimate i1 ; (11.1) is then computed using the expression of the intermediate quantity of EM given in (10.26). As discussed above, the M-step is usually straightforward at least in exponential families of distributions. To illustrate the method, we consider the following example, which will serve for illustration purposes throughout this section. Example 11.1.2 (MCEM in Stochastic Volatility Model). We consider maximum likelihood estimation in the stochastic volatility model of Example 1.3.13,

400

11 Maximum Likelihood Inference, Part II

Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk ,

Uk N(0, 1) , Vk N(0, 1) ,

where the observations {Yk }k0 are the log-returns, {Xk }k0 is the logvolatility, and {Uk }k0 and {Vk }k0 are independent sequences of white Gaussian noise with zero mean and unit variance. We analyze daily log-returns, that is, dierences of the log of the series, on the British pound/US dollar exchange rate historical series (from 1 October 1981 to 28 June 1985) already considered in Example 8.3.1. The number of observations is equal to 945. In our analysis, we will assume that the log-volatility process {Xk } is stationary (|| < 1) so that the initial distribution is given by X0 N(0, 2 /(12 )). For this very simple model, the M-step equations are reasonably simple both for the exact likelihoodassuming that the initial state is distributed under the stationary distributionand for the conditional likelihoodassuming that the distribution of X0 does not depend on the parameters. We use the former approach for illustration purposes, although the results obtained on this data set with both methods are equivalent. The stochastic volatility model can naturally be cast into the framework of exponential families. Dene S(X0:n ) = (Si (X0:n ))0i4 by
n1 n

S0 (x0:n ) = x2 , 0 S3 (x0:n ) =

S1 (x0:n ) =
k=0 n

x2 , k

S2 (x0:n ) =
k=1 n

x2 , k

xk xk1 ,
k=1

S4 (x0:n ) =
k=0

2 Yk exp(xk ) . (11.4)

With these notations, the complete data likelihood may be expressed, up to terms not depending on the parameters, as log f (X0:n ; , , ) = F (S(X0:n ) ; , , ) , where the function s = (si )0i4 F (s ; , , ) is given by F (s ; , , ) = 1 n+1 1 n+1 log 2 2 s4 log 2 + log(1 2 ) 2 2 2 2 2 1 (1 )s0 2 s2 2s3 + 2 s1 . 2 2 2

Maximization with respect to yields the update = s4 . n+1 (11.5)

Computing the partial derivative of F (s ; , , ) with respect to 2 yields the relation

11.1 Methods and Algorithms

401

2 (s ; ) =

1 (1 2 )s0 + s2 2s3 + 2 s1 n+1 1 (s0 + s2 ) 2s3 + 2 (s1 s0 ) = n+1

(11.6)

Plugging this value into the partial derivative of F (s ; , , ) with respect to yields an estimation equation for : s0 s3 s1 + 2 + 2 =0. 1 2 (s ; ) (s ; )

The solution of this equation amounts to solving the cubic 3 [n(s1 s0 )] + 2 [(n 1)s3 ] + [s2 + ns0 (n + 1)s1 ] + (n + 1)s3 = 0 . (11.7) Hence the M-step implies the following computations: nd as the root of (11.7), selecting the one that is, in absolute value, smaller than one; determine ( )2 using (11.6); is given by (11.5). To implement the MCEM algorithm, we sampled from the joint smooth ing distribution of X0:n parameterized by i1 using the single-site Gibbs sampler with embedded slice sampler, as described in Example 6.2.16. Initially, the sampler was initialized by setting all Xk = 0, and a burn-in period of 200 sweeps (by a sweep we mean updating every hidden state Xk once in a linear order from X0 to Xn ) was performed before the computation of the samples averages involved in the statistics Sl (for l = 0, . . . , 4) was initialized. Later E-steps did not reset the state variables like this, but rather i1,m started with the nal realization X0:n i1 of the previous E-step (thus done with dierent parameters). The statistics Sl (X0:n ) (for l = 0, . . . , 4) were approximated by averaging over the sampled trajectories letting, for instance, mi n i,j i,j 1 i S3 = mi j=1 k=1 Xk Xk1 . The M-step was carried out as discussed above. Figure 11.1 shows 400 iterations of the MCEM algorithm with 25,000 MCMC sweeps in each step, started from the parameter values = 0.8, = 0.9, and = 0.3. Because the number of sweeps at each step is quite large, the MCEM parameter trajectory can be seen as a proxy for the EM trajectory. It should be noted that the convergence of the EM algorithm is in this case quite slow because the eigenvalues of the rate matrix dened in (10.70) are close to one. The nal estimates are = 0.641, = 0.975, and = 0.165, which agrees with gures given by Sandmann and Koopman (1998) up to the second decimal. A key issue, to be discussed in the following, is whether or not such a large number of MCMC simulation is really needed to obtain the results shown on Figure 11.1. In Section 11.1.2, we will see that by a proper choice of the simulation schedule, that is, of the sequence {mi }i1 , it is possible to obtain equivalent results with far less computational eort.

402

11 Maximum Likelihood Inference, Part II


0.8 0.7

0.6 0.5 0.4 1

0.975 0.95 0.925 0.9 0.3 0.25 0.2 0.15 0.1 0 50 100 150 200 250 300 350 400

Number of Iterations

Fig. 11.1. Trajectory of the MCEM algorithm for the stochastic volatility model and GBP/USD exchange rate data. In the E-step, an MCMC algorithm was used to impute the missing data. The plots show 400 EM iterations with 25,000 MCMC sweeps in each iteration.

11.1.1.3 MCEM Based on Sequential Monte Carlo Simulations The use of Monte Carlo simulationseither Markov chain or i.i.d. ones is not the only available option for approximating the E-step computations. Another approach, suggested by Gelman (1995) (see also Quintana et al., 1999), consists in approximating the intermediate quantity Q( ; i1 ) of EM using importance sampling (see Section 7.1). In this case, we simulate a sample i,1 , . . . , i,mi from an instrumental distribution with density r with respect to the common dominating measure and approximate Q( ; i1 ) by the weighted sum
mi

Qmi ( ;

i1

) =

def j=1

i,j

log f ( ; ) ,

i,j

i,j def

p( i,j ; i1 ) r( i,j ) mi p( i,k ; i1 ) k=1 r( i,k )

. (11.8)

In most implementations of this method reported so far, the instrumental distribution is chosen as the density p(x ; ) for a reference value of the parameter, but other choices can also be valuable. We may keep the same instrumental distribution and therefore the same importance sample during several iterations of the algorithm. Of course, as the iterations go on, the instrumental distribution can become poorly matched to the current target density p(x; i1 ), leading to badly behaved importance sampling estimators. The mismatch between the instrumental and target distributions can be monitored by controlling that the importance weights remain properly balanced.

11.1 Methods and Algorithms

403

For HMMs, importance sampling is seldom a sensible choice unless the number of observations is small (see Section 7.3.1). Natural candidates in this context are the sequential Monte Carlo methods based on resampling ideas discussed in Chapters 7 and 8. In Section 8.3, we considered the general problem of estimating quantities of the form E(tn (X0:n )|Y0:n ; ), when the function tn complies with Denition 4.1.2, based on sequential Monte Carlo simulations. As discussed in Section 10.2.2, the intermediate quantity of EM is precisely of this form with an additive structure given by (10.26). Recall that the same remark also holds for the gradient of the log-likelihood with respect to the parameter vector (Section 10.2.3). For both of these, an approximation of the smoothed expectation can be computed recursively and without storing the complete particle trajectories (see Section 8.3). For the model of Example 11.1.2, the function tn is fully determined by the four statistics dened in (11.4). Recursive particle smoothing for the statistics S0 , S1 , and S3 has already been considered in Example 8.3.1 (see Figures 8.5 and 8.7). The case of the remaining two statistics is entirely similar. Recall from Example 8.3.1 that it is indeed possible to robustify the estimation of such smoothed sum functionals by using xed-lag approximations. The simple method proposed in Example 8.3.1 consists in replacing the smoothing distributions l|n by the xed lag-smoothing distribution l|l+kn for a suitably chosen value of the delay k. The particle approximation to n l=0 s(x)l l|l+kn (dxl ) can be computed recursively using an algorithm n that is only marginally more complex than that used for l=0 s(x)l l|n (dxl ). Results obtained following this approach will be discussed in Example 11.1.3 below. 11.1.2 Simulation Schedules Although the MCEM algorithm provides a solution to intractable E-step, it also raises dicult implementation issues. Intelligent usage of the Monte Carlo simulations is necessary because MCEM can place a huge burden on the users computational resources. Heuristically there is no need to use a large number of simulations during the initial stage of the optimization. Even rather crude estimation of Q( ; i1 ) might suce to drive the parameters toward the region of interest. As the EM iterations go on, the number of simulations should be increased however to avoid zig-zagging when the algorithm approaches convergence. Thus, in making the trade-o between improving accuracy and reducing the computational cost associated with a large sample size, one should favor in creasing the sample size mi as i approaches its limit. Determining exactly how this increase should be accomplished to produce the best possible result is a topic that still attracts much research interest (Booth and Hobert, 1999; Levine and Casella, 2001; Levine and Fan, 2004). Example 11.1.3 (MCEM with Increasing Simulation Schedule). Results comparable to those of the brute force version of the MCEM algorithm

404

11 Maximum Likelihood Inference, Part II


0.8 0.7 40 30 20 10 0.6 0.65 0.7 400 300 0.95 200 100 0.9 0.3 0.25 0.97 0.975 0.98 200 150 100 50 0 50 100 150 200 250 300 350 400 0.15 0.16 0.17 0.18 Density Density Density

0.6 0.5 0.4 1

0.2 0.15 0.1

Number of Iterations

Fig. 11.2. Same model, data, and algorithm as in Figure 11.1, except that the number of MCMC sweeps in the E-step was increased quadratically with the EM iteration number. The plots show results from 400 iterations of the MCEM algorithm with the number of MCMC sweeps ranging from 1 at the rst iteration to 374 at iteration 200 and 1,492 at iteration 400; the total number of sweeps was 200,000. Left: 10 independent trajectories of the MCEM algorithm, with identical initial points. Right: histograms, obtained from 50 independent runs, of the nal values of the parameters.

considered in Example 11.1.2 can in fact can be achieved with a number of sweeps smaller by an order of magnitude. To allow for comparisons with other methods, we set, in the following, the total number of simulations of the missing data sequence to 200,000. Figure 11.2 shows the results when the number of sweeps of the E-step MCMC sampler increases proportionally to the square of the EM iteration number. This increase is quite slow, because many EM iterations are required to reach convergence (see Figure 11.1). The number of sweeps performed during the nal E-step is only about 1500 (compared to the 25,000 for the MCEM algorithm illustrated in Figure 11.1). As a result, the MCEM algorithm is still aected by a signicant fraction of simulation noise in its last iteration. As discussed above, the averaged MCMC simulations may be replaced by time-averages computed from sequential Monte Carlo simulations. To this aim, we consider the SISR algorithm implemented as in Example 8.3.1 with systematic resampling and a t-distribution with 5 degrees of freedom tted to the mode of the optimal instrumental distribution. The SMC approach requires a minimal number of particles to produce sensible output. Hence we cannot adopt exactly the same simulation schedule as in the case of MCMC above, and the number of particles was set to 250 for the rst 100 MCEM

11.1 Methods and Algorithms


0.8 0.7 30 25 15 10 5 0.6 0.65 0.7 150 100 Density 20

405

0.6 0.5 0.4 1

0.95 50 0.9 0.3 0.25

0.97

0.975

0.98 60 50 30 20 10 Density Density Density Density 40

0.2 0.15 0.1 0 50 100 150 200 250 300 350 400 0.15 0.16 0.17

0.18

Number of Iterations

Fig. 11.3. Same model and data as in Figure 11.1. Parameter estimates were computed using an MCEM algorithm employing SISR approximation of the joint smoothing distributions. The plots show results from 400 iterations of the MCEM algorithm. The number of particles was 250 for the rst 100 EM iterations, 500 for iterations 101 to 200, and then increased proportionally to the squared iteration number. The contents of the plots are as in Figure 11.2.
0.8 0.7 100 80 60

0.6 40 0.5 0.4 1 0.6 0.65 20 0.7 400 300

0.95

200 100

0.9 0.3 0.25

0.97

0.975

0.98 150 100 50

0.2 0.15 0.1 0 50 100 150 200 250 300 350 400 0.15 0.16 0.17

0.18

Number of Iterations

Fig. 11.4. Same model and data as in Figure 11.1. Parameter estimates were computed using an MCEM algorithm employing SISR approximation of xed-lag smoothing distributions with delay k = 20. The plots show results from 400 iterations of the MCEM algorithm. The number of particles was as described in Figure 11.3 and the contents of the plots are as in Figure 11.2.

Density

406

11 Maximum Likelihood Inference, Part II

iterations, 500 for iterations 101 to 200, and then increases proportionally to the square of the MCEM iteration number. The total number of simulations is also equal to 200,000 in this case. The MCEM algorithm was run using both the particle approximation of the joint smoothing distributions and that of the xed-lag smoothing distributions. Figure 11.3 shows that the implementation based on joint smoothing produces highly variable parameter estimates. This is coherent with the behavior observed in Example 8.3.1. Given that the number of observations is already quite large, it is preferable to use xed-lag smoothing (here with a lag k = 20), as the bias introduced by this approximation is more than compensated by the reduction in the Monte Carlo error variance. As shown in Figure 11.4, the behavior of the resulting algorithm is very close to what is obtained using the MCEM algorithm with MCMC imputation of the missing data. When comparing to Figure 11.2, the level of the Monte Carlo error appears to be reduced in Figure 11.4, and the bias introduced by the xed-lag smoothing approximation is hardly perceptible. 11.1.2.1 Automatic Schedules From the previous example, it is obvious that it is generally advantageous to vary the precision of the estimate of the intermediate quantity Q( ; i1 ) with i approaches a i, and in particular to increase this precision as i grows and limit. In the example above, this was accomplished by increasing the number of sweeps of the MCMC sampler or by increasing the number of particles of the SMC algorithm. So far, the increase was done in a deterministic fashion, and such deterministic schedules may also be given theoretical support (see Section 11.2.3). Deterministic schemes are appealing because of their simplicity, but it is obvious that because there are only few theoretical guidelines on how to choose mi , nding an appropriate schedule is in general not straightforward. It has often been advocated that using automatic, or adaptive, procedures to choose mi would be more appropriate. To do so, it is required to deter mine, at each iteration, an estimate of the Monte Carlo error Qmi ( ; i1 ) i1 ). The dependence of this error with respect to mi should also be Q( ; known or determined from the output of the algorithm. Such data-driven procedures require gauging the Monte Carlo errors, which is, in general, a complicated task. Booth and Hobert (1999) present an automatic method that requires independent Monte Carlo sample in the E-step. Independent simulations allow for computationally inexpensive and straightforward assessment of Monte Carlo error through an application of the central limit theorem. Such independent sampling routines are often unavailable in practical implementations of the MCEM algorithm however, requiring MCMC or SMC algorithms to obtain relevant Monte Carlo samples. Levine and Casella (2001) present a method for estimating the simulation error of a Monte Carlo E-step using MCMC samples. Their procedure is based on regenerative methods for MCMC simulations and amounts to nding renewal periods across which the

11.1 Methods and Algorithms

407

MCMC trajectories are independent (see for instance Hobert et al., 2002). By subsampling the chain between regeneration times, Monte Carlo error may be assessed through the CLT for independent outcomes in a manner analogous to Booth and Hobert (1999). For phi-irreducible Markov chains, such renewal periods can be obtained using the splitting procedure, which requires determining small sets (see Section 14.2 for denitions of the concepts mentioned here). A drawback of this approach is that it may be dicult, if not impossible, to establish the minorization condition necessary for implementing the regenerative simulation procedure. Once such a minorization condition has been established however, implementing the procedure is nearly trivial. Both of the automatic procedures mentioned above are able to decide when to increase the Monte Carlo sample size, but the choice of sample size at each such instance is arbitrary. Levine and Fan (2004) present a method that overcomes the limitations of the previous algorithm. The Monte Carlo error is gauged directly using a subsampling technique, and the authors use asymptotic results to construct an adaptive rule for updating the Monte Carlo sample size. Despite their obvious appeal, automatic methods suer from some drawbacks. First, the estimation of the Monte Carlo error induces a computational overhead that might be non-negligible. Second, because the number of simulations at each iteration is random, the total amount of computation cannot be xed beforehand; this may be inconvenient. Finally, the convergence of the proposed schemes are based on heuristic arguments and have not been established on rm grounds. 11.1.2.2 Averaging There is an alternative to automatic selection of the Monte Carlo sample size, developed by Fort and Moulines (2003), which is straightforward to implement and most often useful. This method is inspired by the averaging procedure originally proposed by Polyak (1990) to improve the rate of convergence of stochastic approximation procedures. To motivate the construction of the averaging procedure, note that pro vided that the sequence {i } converges to a limit , each value of i may itself be considered as an estimator of the associated limit . Theorem 11.2.14 as serts that the variance of i is of order 1/mi . Thus, in the idealized situation where the random perturbations i would also be uncorrelated, it is well-known that it is possible to obtain an improved estimator of by combining the individual estimates i in proportion of the inverse of their variance (this is the minimum variance estimate of ). This optimal linear combination has a variance that decreases as 1/ i mi , that is, the total number of simulations rather than the nal number of simulations. Although the MCEM perturbations i are not uncorrelated, even when using i.i.d. Monte Carlo simulation, due to the dependence with respect to , Fort and Moulines (2003) suggested using the averaged MCEM estimator

408

11 Maximum Likelihood Inference, Part II


i

def i =
j=i0

mj i k=i0

mk

j ,

for i i0 ,

(11.9)

where i0 is the iteration index at which computation of the average is started. In general, it is not recommended to start averaging too early, when the algorithm is still in its transient phase. Example 11.1.4 (Averaging). In Example 11.1.3, the number of sweeps is increased quite slowly and the number of sweeps during the nal EM iterations is not large (about 1500). This scheme is advantageous in situations when the EM algorithm is slow, because a large number of iterations can be performed while keeping the total of number of simulations moderate. The problem is rather that the simulation noise at convergence is still signicant (see Figure 11.2). This is a typical situation in which averaging can prove to be very helpful. As seen in Figure 11.5, averaging reduces the noise when the parameters are in the neighborhood of their limits. Averaging is also benecial when the EM statistics are estimated using sequential Monte Carlo (see Figure 11.6). 11.1.3 Gradient-based Algorithms As discussed in Section 10.2.3, computation of the gradient of the loglikelihood is very much related to the E-step of EM as a consequence of Fishers identity (Proposition 10.1.6). It is thus rather straightforward to derive Monte Carlo versions of the gradient algorithms introduced in Section 10.1.3. At the ith iteration, one may for example approximate the gradient of the log likelihood (i1 ), where i1 denotes the current parameter estimate, by
mi (

i1 ) = 1 mi

mi j=1

log f ( i,j ; i1 ) ,

(11.10)

where i,1 , . . . , i,mi is an i.i.d. sample from the density p(x ; i1 ) or a rei1 ) as its stationary alization of an ergodic Markov chain admitting p(x ; density. It is also possible to use importance sampling; if i,1 , . . . , i,mi is a sample from the instrumental distribution r, then the IS estimate of (i1 ) is
mi m (

i1 ) =
j=1

i,j

log f ( i,j ; i1 ),

i,j =

p( i,j ; i1 ) r( i,j ) mi p( i,k ; i1 ) k=1 r( i,k )

(11.11) As in the case of MCEM, it is likely that for HMMs, importance sampling strategies become unreliable when the number of observations increases. To circumvent the problem, one may use sequential Monte Carlo methods such

11.1 Methods and Algorithms


0.8 0.7 60 50 30 20 10 0.6 0.65 0.7 800 600 0.95 400 200 0.9 0.3 0.25 0.97 0.975 0.98 200 150 100 50 0 50 100 150 200 250 300 350 400 0.15 0.16 0.17 0.18 Density Density Density 40

409

0.6 0.5 0.4 1

0.2 0.15 0.1

Number of Iterations

Fig. 11.5. Same model, data, and algorithm as in Figure 11.2, except that averaging according to (11.9) was used to smooth the sucient statistics of the E-step; averaging was started after i0 = 200 iterations. The plots show results from 400 iterations of the MCEM algorithm. The contents of the plots are as in Figure 11.2.
0.8 0.7 150 100 50

0.6 0.5 0.4 1 0.6 0.65

0.7 800 600

0.95

400 200

0.9 0.3 0.25

0.97

0.975

0.98 200 150 100 50 Density

0.2 0.15 0.1 0 50 100 150 200 250 300 350 400 0.15 0.16 0.17

0.18

Number of Iterations

Fig. 11.6. Same model, data, and algorithm as in Figure 11.4, except that averaging according to (11.9) was used to smooth the sucient statistics of the E-step; averaging was started after i0 = 200 iterations. The plots show results from 400 iterations of the MCEM algorithm. The contents of the plots are as in Figure 11.2.

Density

Density

410

11 Maximum Likelihood Inference, Part II

as SISR where (11.11) is not computed directly but rather constructed recursively (in time) following the approach discussed in Section 8.3 and used in the case of MCEM above. Details are omitted because the gradient of the log-likelihood (10.29) and the intermediate quantity of EM (10.26) are very similar. For models that belong to exponential families, the only quantities that need to be computed in both cases are the smoothed expectation of the sucient statistics, and hence both computations are exactly equivalent. Louiss identity (see Proposition 10.1.6) suggests an approximation of the Hessian of () at i1 of the form 1 Jmi (i1 ) = mi
mi 2 j=1

log f ( i,j ; i1 ) +
mi (

1 mi
2

mi j=1

log f ( i,j ; i1 )

i1 )

where i,1 , . . . , i,mi are as above, for a vector a we have used the notation a2 = aat , and the estimate of the gradient in the nal term on the righthand side may be chosen, for instance, as in (11.10). Using this approximation of the Hessian, it is possible to formulate a Monte Carlo version of the Newton-Raphson procedure. This algorithm was rst proposed by Geyer and Thompson (1992) in an exponential family setting and then generalized by Gelfand and Carlin (1993). Gelman (1995) proposed a similar algorithm in which importance sampling is used as the Monte Carlo method. Now assume that we have, with the help of a Monte Carlo approximation of the gradient and possibly also the Hessian, selected a search direction. The next step is then to determine an appropriate value of the step size (see Section 10.1.3). This is not a simple task, because the objective function () cannot be evaluated analytically, and therefore it is not possible to implement a line searchat least not in an immediate way. A simple option consists in using a step size that is small but xed (see Dupuis and Simha, 1991), and to let mi as suciently fast as i . If we want to optimize the step size, we have to approximate the objective function in the search direction. We may for example follow the method proposed by Geyer and Thompson (1992), which consists in approximating (locally) the ratio L()/L(i1 ) by
mi

j=1

f ( i,j ; ) , f ( i,j ; i1 )

where { i,j } are the samples from p(x ; i1 ) used to determine the search direction. Under standard assumptions, the sum of this display converges in probability as mi to f (x ; ) L() p(x ; i ) (dx) = . i1 ) f (x ; L(i1 )

11.1 Methods and Algorithms

411

This suggests approximating the dierence () (i1 ) in a neighborhood i1 of by mi 1 f ( i,j ; ) . (11.12) log mi j=1 f ( i,j ; i1 ) This type of approximation nevertheless needs to be considered with some care because the search direction is not necessarily an ascent direction for this approximation of the objective function due to the Monte Carlo errors. To the best of our knowledge, this type of approximation has not been thoroughly investigated in practice. As for the MCEM algorithm, it is not necessary to estimate the objective function and its gradient with high accuracy during the initial optimization steps. Therefore, the Monte Carlo sample sizes should not be taken large at the beginning of the procedure but should be increased when the algorithm approaches convergence. Procedures to adapt the sample size mi at each iteration are discussed and analyzed by Sakalauskas (2000, 2002) for gradient algorithms using a (small enough) xed step size. The suggestion of this author is to increase mi proportionally to the inverse of the squared norm of the (estimated) gradient at the current parameter estimate. If this proportionality factor is carefully adjusted, it may be shown, under a set of restrictive conditions, that the Monte Carlo steepest ascent algorithm converges almost surely to a stationary point of the objective function. It is fair to say that in the case of general state space HMMs, gradientbased methods as less popular than their counterparts based on the EM paradigm. An important advantage of EM based methods in this context is that they are parameterization independent (see Section 10.1.4 for further discussion). This property means that the issue of selecting a proper step size which is problematic in simulation-based approaches as discussed above has no counterpart for EM-based methods, which are scale-free. Remember that it is also precisely the reason why the EM approach sometimes converges much more slowly than gradient-based methods. 11.1.4 Interlude: Stochastic Approximation and the Robbins-Monro Approach Stochastic approximation is a general term for methods that recursively search for an optimum or zero of a function that can only be observed disturbed by some noise. The original work in the stochastic approximation literature was by Robbins and Monro (1951), who developed and analyzed a recursive procedure for nding the root(s) of the equation h() = 0. If the function h was known, a simple procedure to nd a root consists in using the elementary algorithm i = i1 + i h(i1 ) , (11.13)

412

11 Maximum Likelihood Inference, Part II

where {i } is a sequence of positive step sizes. In many applications, the evaluation of h() cannot be performed, either because it is computationally prohibitive or analytical formulas are simply not available, but noise-corrupted observations of the function can be obtained for any value of the parameter Rd . One could then, for instance, consider using the procedure (11.13) but with h() replaced by an accurate estimate of its value obtained by averaging many noisy observations of the function. It was recognized by Robbins and Monro (1951) that averaging a large number of observations of the function at i1 is not always the most ecient solution. Indeed, the value of the function h(i1 ) is only of interest in so far that it leads us in the right direction, and it is not unreasonable to expect that this happens, at least on the average, even if the estimate is not very accurate. Robbins and Monro (1951) rather proposed the algorithm i = i1 + i Y i , where i is a deterministic sequence satisfying i > 0,
i

(11.14)

lim i = 0,
i

i = ,

and Y i is a noisy observation of h(i1 ). Although the analysis of the method is certainly simpler when the noise sequence {Y i h(i1 )}i1 is i.i.d., in many i i1 ) depends on i1 and sometimes practical applications the noise Y h( j and Y j , for j i 1 (see for instance Benveniste et al., on past values of 1990; Kushner and Yin, 2003). Using a decreasing step size implies that the parameter sequence {i } moves slower as i goes to innity; the basic idea is that decreasing step sizes provides an averaging of the random errors committed when evaluating the function h. Ever since the introduction of the now classic Robbins-Monro algorithm, stochastic approximation has been successfully used in many applications and has received wide attention in the literature. The convergence of the stochastic approximation scheme is also a question of importance that has been addressed under a variety of conditions, which cover most of the applications (see for instance Benveniste et al., 1990; Duo, 1997; Kushner and Yin, 2003). 11.1.5 Stochastic Gradient Algorithms We now come back to the generic incomplete data model, considering several ways in which the stochastic approximation approach may be put in use. The rst obvious option is to apply the Robbins-Monro algorithm to determine the roots of the equations () = 0, yielding the following recursions i = i1 + i

log f ( i ; i1 ) ,

(11.15)

where i is a sample from the density p(x ; i1 ). That is, dening the ltration i i1 0 , 0 , . . . , i1 ), {F } such that F = (

11.1 Methods and Algorithms

413

i | F i1 p(; i1 ) . Thus Y i = log f ( i ; i1 ) can be considered as a noisy measurement of i1 ) because of the Fisher identity, E[Y i | F i1 ] = (i1 ). Hence we ( i i1 ) + i , with can write Y = ( i =

log f ( i ; i1 ) E[

log f ( i ; i1 ) | F i1 ] ;

obviously { i } is an {F i }-adapted martingale dierence sequence. Often it is not possible to sample directly from the density p(x ; i1 ). One can then replace this draw by iterations from a Markov chain admitting p(x ; i1 ) as its stationary density. Then E[Y i | F i1 ] does no longer equal i1 ), but rather ( i | F i1 i1 ( i1 , ) , (11.16)

where for any , is a transition kernel of an ergodic Markov chain with stationary density p(x ; ). Such algorithms were considered by Younes (1988, 1989) for maximum likelihood estimation in partially observed Gibbs elds. They were later extended by Gu and Kong (1998) to maximum likelihood estimation in general incomplete data problems by (see also Gu and Li, 1998; Delyon et al., 1999, Section 8). In this case, the noise structure is more complicated and analysis and control of the convergence of such algorithms become intricate (see Andrieu et al., 2005, for results in this direction). Several improvements can be brought to this scheme. First, it is sometimes recommendable to run a certain number, say m, of simulations before updating the value of the parameter. That is, 1 m i,j i1 i = i1 + i ; ) , (11.17) log f ( m
j=1

where i,1 , . . . , i,m are draws from p(x ; i1 ). Choosing m > 1 is generally benecial in that it makes the procedure more stable and saves computational time. The downside is that there are few theoretical guidelines on how to set this number. The above algorithm is very close to the Monte Carlo version of the steepest ascent method. Another possible improvement, much in the spirit of quasi-Newton algorithms, is to modify the search direction by letting 1 m i,j i1 i = i1 + i W i ; ) , (11.18) log f ( m
j=1

where W i is a properly chosen weight matrix (see for instance Gu and Li, 1998; Gu and Kong, 1998). One of the main appeals of stochastic approximation is that, at least in principle, the only decision that has to be made is the choice of the step size

414

11 Maximum Likelihood Inference, Part II

schedule. Although in theory the method converges for a wide variety of step sizes (see Section 11.3), in practice the choice of step sizes inuences the actual number of simulations needed to take the parameter estimate into the neighborhood of the solution (transient regime) and its uctuations around the solution (misadjustment near convergence). Large step sizes generally speed up convergence to a neighborhood of the solution but fail to mitigate simulation noise. Small step sizes reduce noise but cause slow convergence. Heuristically, it is appropriate to use large step sizes until the algorithm reaches a neighborhood of the solution and then to switch to smaller step sizes (see for instance Gu and Zhu, 2001, for applications to the stochastic gradient algorithm). A way to alleviate the step size selection problem is to use averaging as in Section 11.1.2. Polyak (1990) (see also Polyak and Juditsky, 1992) showed that if the sequence of step sizes {i } tends to zero slower than 1/i, yet fast enough to ensure convergence at a given rate, then the running average
i

def i = (i i0 + 1)1
j=i0

j ,

i i0 ,

(11.19)

converges at an optimal rate. Here i0 is an index at which averaging starts, so as to discard the very rst steps. This result implies that one should adopt step sizes larger than usual but in conjunction with averaging (to control the increased noise due to use of the larger step sizes). The practical value of averaging has been reported in many dierent contextssee (Kushner and Yin, 2003, Chapter 11) for a thorough investigation averaging, as well as (Delyon et al., 1999). 11.1.6 Stochastic Approximation EM We now consider a variant of the MCEM algorithm that may also be interpreted as a stochastic approximation procedure. Compared to the stochastic gradient approach discussed in the previous section, this algorithm is scale-free in the sense that the step sizes are positive numbers restricted to the interval [0, 1]. Compared to the MCEM approach, the E-step involves a weighted average of the approximations of the intermediate quantity of EM obtained in the current as well as in the previous iterations. Hence there is no need to increase the number of replications of the missing data as in MCEM. Algorithm 11.1.5 (Stochastic Approximation EM). Given an initial pa rameter estimate 0 and a decreasing sequence of positive step sizes {i }i1 such that 1 = 1, do, for i = 1, 2 . . . , Simulation: Draw i,1 , . . . , i,m from the conditional density p(x ; i1 ). i Maximization: Compute as the maximum of the function Qi () over the feasible set , where

11.1 Methods and Algorithms

415

1 Qi () = Qi1 () + i m

log f ( i,j ; ) Qi1 ()


j=1

(11.20)

This algorithm, called the stochastic approximation EM (SAEM) algorithm, was proposed by Cardoso et al. (1995) and further analyzed by Delyon et al. (1999) and Kuhn and Lavielle (2004). To understand why this algorithm can be cast into the Robbins-Monro framework, consider the simple case where the complete data likelihood is from an exponential family of distributions. In this case, the SAEM algorithm consists in updating, at each iteration, the current estimates (S i , i ) of the complete data sucient statistic and of the parameter. Each iteration of the algorithm is divided into two steps. In a rst step, we draw i,1 , . . . , i,m from the conditional density p(x ; i1 ) and update i according to S m 1 S( i,j ) S i1 . (11.21) S i = S i1 + i m j=1 In a second step, we compute i as the maximum of the function t ()S i c(). Assume that the function t ()s c() has a single global maximum, m denoted (s) for all feasible values of S i . The dierence m1 j=1 S( i,j ) S i1 can then be considered as a noisy observation of a function h(S i1 ), where h(s) = S(x)p(x ; (s)) (dx) s . (11.22) Thus (11.21) ts into the Robbins-Monro when considering the sucient statistic s rather than the associated parameter (s). This Robbins-Monro procedure searches for the roots of h(s) = 0, that is, the values of s satisfying S(x)p(x ; (s)) (dx) = s . Assume that this equation has a solution s and put = (s ). Now note that Q( ; ) = t () S(x)p(x ; ) (dx) c() = t ()s c() ,

and by denition the maximum of the right-hand side of this display is obtained at . Therefore, an iteration of the EM algorithm started at will stay at , and we nd that each root s is associated to a xed point of the EM algorithm. The SAEM algorithm is simple to implement and has proved to be reasonably successful in dierent applications. Compared to the stochastic gradient procedure, SAEM inherits from the expectation-maximization algorithm

416

11 Maximum Likelihood Inference, Part II

most of the properties that made the success of the EM approach (for instance, the simplicity with which it deals with parameter constraints). One of these properties is invariance with respect to the parameterization. With the SAEM algorithm, the scale of the step sizes {i } is xed irrespectively of the parameterization as 1 equals 1. As in the case of the stochastic gradient, however, the rate of decrease of the step sizes strongly inuences the practical performance of the algorithm. In particular, if the convergence rate of the EM algorithm is already slow, it is unwise to choose fast decreasing step sizes, thereby even further slowing down the method. In contrast, if EM converges fast, then large step sizes introduce an unnecessary amount of extra noise, which should be avoided. Here again, the use of averaging is helpful in reducing the impact of the choice of the rate of decrease of the step sizes. Example 11.1.6. We implemented the SAEM algorithm for the stochastic volatility model and data described in Example 11.1.2, and the results are displayed in Figure 11.7. In each iteration of the algorithm, a single realization of the missing data was obtained using a sweep of the Gibbs sampler. This draw was used to update the stochastic approximation estimate of the complete data sucient statistics, which were then used to update the parameter estimate. The only tuning parameter is the sequence of step size n . Here again the theory of stochastic approximation does not tell much about the optimal way to choose this sequence. In view of the above discussion, we used slowly decreasing step sizes (n = n0.6 ) to speed up convergence toward the region of interest. As seen in Figure 11.7, the parameters estimates obtained using this implementation of SAEM are rather noisy. In order to reduce the uctuations, we performed averaging, computing
i

i = (i i0 + 1)1
j=i0

j ,

i i0 ,

(11.23)

where i0 was set to 100,000. Averaging is useful only when the parameter approaches convergence and should be turned o during the initial steps of the algorithm. Figure 11.8 shows results for the SAEM algorithm with averaging. Figures 11.7 and 11.8 should be compared with Figures 11.2 and 11.5, respectively, which involve the same sampler and the same overall number of simulations but were obtained using the MCEM strategy. Both procedures (SAEM and MCEM) provides comparable results. 11.1.7 Stochastic EM The stochastic EM (SEM) algorithm is a method that shares many similarities with the stochastic approximation EM algorithm. The SEM algorithm was initially proposed as a means to estimate parameters of mixtures distributions (Celeux and Diebolt, 1985, 1990), but the concept can easily be generalized to cover more general incomplete data models. The basic idea is

11.1 Methods and Algorithms


0.8 0.7 60 50 30 20 10 0.6 0.65 0.7 600 500 0.95 300 200 100 0.9 0.3 0.25 0.97 0.975 0.98 300 250 150 100 50 0 0.5 1 1.5 x 10 2
5

417

0.6 0.5 0.4 1

0.2 0.15 0.1 0.15 0.16 0.17

0.18

Number of Iterations

Fig. 11.7. Parameter estimation in the stochastic volatility model with GBP/USD exchange rate data, using the SAEM algorithm with MCMC simulations. The plots show results from 200,000 iterations of the SAEM algorithm with step sizes n = n0.6 . The contents of the plots are as in Figure 11.2.
0.8 0.7 60 50 30 20 10 0.6 0.65 0.7 1000 800 600 0.95 400 200 0.9 0.3 0.25 0.97 0.975 0.98 300 250 150 100 50 0 0.5 1 1.5 x 10 2
5

0.6 0.5 0.4 1

Density Density Density

40

200

0.2 0.15 0.1 0.15 0.16 0.17

0.18

Number of Iterations

Fig. 11.8. Same model, data, and algorithm as in Figure 11.7, except that averaging was used starting at 100,000 iterations. The plots show results from 200,000 iterations of the SAEM algorithm. The contents of the plots are as in Figure 11.2.

Density

200

Density

400

Density

40

418

11 Maximum Likelihood Inference, Part II

to construct an ergodic homogeneous Markov chain whose stationary distribution is concentrated around the maximum likelihood estimate. SEM is an iterative algorithm in which each iteration proceeds in two steps. In a rst step, the stochastic imputation step, the missing data is drawn from the con ditional density p(x ; i1 ), where i1 is the current parameter estimate. In a second step, the maximization step, a new parameter estimate i is obtained as the maximizer of the complete data likelihood function with the missing data being that imputed in the simulation step. The algorithm thus alternates between simulating (imputing) missing data and computing parameter estimates. In a more general formulation, one may draw several replications of the missing data in the simulation step and use the average of the corresponding complete data log-likelihood functions to obtain a new parameter estimate. Algorithm 11.1.7 (Stochastic EM Algorithm). Simulation: Draw i,1 , . . . , i,m from the conditional density p(x ; i1 ). i as the maximum of the function Qi () over the feasible Maximization: Compute set , where m 1 log f ( i,j ; ) . (11.24) Qi () = m j=1 The main dierence between SAEM and SEM is the sequence of decreasing step sizes used in the SAEM approach to smooth the intermediate quantities of EM estimated in successive iterations. In the SEM algorithm, these step sizes are non-decreasing, i = 1, so there is no averaging of the Monte Carlo error as the iterations progress. The SEM iteration is also obviously identical to the MCEM iteration (see Algorithm 11.1.1) where the dierence only lies in the fact that the number of simulated replications of the missing data is not increased with the iteration index. If i,1 , . . . , i,m are conditionally independent given F i1 dened in (11.2), with common density p(x; i1 ), then {i } is a homogeneous Markov chain. Under a set of (rather restrictive) technical conditions, this chain can be shown to be ergodic (Diebolt and Ip, 1996; Nielsen, 2000). Then, as the number of iterations i tends to innity, the distribution of i converges in total varia tion distance to the distribution of a random variable . The distribution of this random variable is in general dicult to characterize, but, under additional technical assumptions, this stationary distribution may be shown to converge in the sense that as the number of observations increases, it becomes increasingly concentrated around the maximum likelihood estimator (Nielsen, 2000). With SEM, a point estimate can be obtained, for example, by computing sample averages of the simulated parameter trajectories. The theory of the SEM algorithm is dicult even for elementary models, and the available results are far from covering sophisticated setups like continuous state-space HMMs. This is particularly true in situations where imputation of missing data is done using an MCMC algorithm, which clearly adds an addition level of diculty.

11.2 Analysis of the MCEM Algorithm


1 0.8 6 4 2 0 60 40

419

0.6 0.4 1

0.6

0.8

0.95 20 0.9 0.3 0.2 0 15 10 5 0 0.3

0.9

0.95

0.1 0

0.5

1.5 x 10

2
5

0.1

0.2

Number of iterations

Fig. 11.9. Parameter estimation in the stochastic volatility model with GBP/USD exchange rate data, using an SEM algorithm. The plots show results from 200,000 iterations of the SEM algorithm with a single replication of the missing data imputed in each iteration. Left: 200,000 iterations of a single trajectory of SEM. Right: histograms, computed from the second half of the run, of parameter estimates.

Example 11.1.8. Figure 11.9 displays one trajectory of parameter estimates obtained with the SEM algorithm for the stochastic volatility model and data described in Example 11.1.2, using one sweep of the Gibbs sampler to simulate the unobserved volatility sequence at each iteration. The histograms of the parameters have a single mode but are highly skewed and show great variability (note that the x-scales are here much larger than in previous gures). The empirical averages for the three parameters are = 0.687, = 0.982, = 0.145, which do not coincide with the maximum likelihood estimate previously found with other methods (compare with the numbers given at the end of Example 11.1.2). This remains consistent however with the theory developed in Nielsen (2000), as the mismatch is small and, in the current case, probably even less than the order of the random uctuations due to the use of a nite number of simulations (here 200,000). To conclude this section, we also mention the variant of SEM and MCEM proposed by Doucet et al. (2002). This algorithm, which uses concepts borrowed from the Bayesian paradigm, will be presented in Section 13.3.

11.2 Analysis of the MCEM Algorithm


In Section 10.5, the EM algorithm was analyzed by viewing each of its iterations as a mapping M on the parameter space such that the EM

Density

Density

Density

420

11 Maximum Likelihood Inference, Part II

sequence of estimates is given by the iterates i+1 = M (i ). Under mild conditions, the EM sequence eventually converges to the set of xed points, L = { : = M ()}, of this mapping. EM is an ascent algorithm as each iteration of M increases the observed log-likelihood , that is, M () () for any with equality if and only if L. This ascent property is essential in showing that the algorithm converges: it guarantees that the sequence { (i )} is non-decreasing and, hence, convergent if it is bounded. The MCEM algorithm is an approximation of the EM algorithm. Each iteration of the MCEM algorithm is a perturbed version of an EM iteration, where the typical size of the perturbation is controlled by the Monte Carlo error and thus by the number of simulations. The MCEM sequence may thus be written under the form i+1 = M (i ) + i+1 , where i+1 is the perturbation due to the Monte Carlo approximation. Provided that the number of simulations is increased as the algorithm approaches convergence, the perturbation i vanishes as i . Note that the MCEM algorithm is not an ascent algorithm, which prevents us from using the general convergence results of Section 10.5. It is sensible however to expect that the behavior of the MCEM algorithm closely follows that of the EM algorithm, at least for large i, as the random perturbations vanish in the limit. To prove that this intuition is correct, we rst establish in Section 11.2.1 a stability result for deterministically perturbed dynamical systems and then use this result in Section 11.2.2 to deduce a set of conditions implying almost sure convergence of the MCEM algorithm. To avoid entering into too many technicalities, we study convergence under elementary assumptions that do not cover all possible applications of MCEM to maximum likelihood estimation in partially observed models. We feel however that a rst exposure to this theory should not be obscured by too many distracting details that will almost inevitably arise when trying to cover more sophisticated cases. Remark 11.2.1 (Stability in Stochastic Algorithms). One topic of importance that we entirely avoid here is the stability issue. We always assume that it can be independently guaranteed that the sequence of estimates produced by the algorithm deterministically stays in a compact set. Although this will obviously be the case where the parameter space is compact, this assumption may fail to hold in more general settings where the algorithms under study can generate sequences of parameters that either diverge erratically or converge toward the boundary of the parameter space. To circumvent this problem, from both practical and theoretical points of view, it is necessary to modify the elementary recursion of the algorithm, for instance using reprojections (Kushner and Yin, 2003; Fort and Moulines, 2003; Andrieu et al., 2005). 11.2.1 Convergence of Perturbed Dynamical Systems Let T : be a (point-to-point) map on . We study in this section the convergence of the -valued discrete time dynamical system i+1 = T (i )

11.2 Analysis of the MCEM Algorithm

421

and the perturbed dynamical system i+1 = T (i ) + i+1 , where { i } is a deterministic sequence converging to zero. The study of such perturbed dynamical systems was initiated by Kesten (1972), and these results have later been extended by Pierre-Loti-Viaud (1995), Brandi`re (1998), and Bonnans e and Shapiro (1998). To study the convergence, it is useful to introduce Lyapunov functions associated with the mapping T . A Lyapunov function, as dened below, is equivalent to the concept of ascent function that we met in Section 10.5 when discussing the convergence of EM. The terminology Lyapunov function is however more standard, except in numerical optimization texts. Note that Lyapunov functions are traditionally dened as descent functions rather than ascent functions. We reverse this convention to be consistent with the fact that the MLE estimator is dened as the maximum of the (log-)likelihood function. Denition 11.2.2 (Lyapunov Function). T : be a map as above and let def L = { : = T ()} (11.25) be the set of xed points of this map. A function W : R is said to be a Lyapunov function relative to (T, ) if W is continuous and W T () W () for all , with equality if and only if L. In other words, the map T is an ascent algorithm for the function W . Theorem 11.2.3. Let be an open subset of Rd and let T : be a continuous map with set L of xed points. Assume that there exists a Lyapunov function W relative to (T, ) such that W (L) is a nite set of points. Let K be a compact set and {i } a K-valued sequence satisfying
i

lim |W (i+1 ) W T (i )| = 0 .

(11.26)

Then the set L K is non-empty, the sequence {W (i )} converges to a point w W (LK), and the sequence {i } converges to the set Lw = { LK : W () = w }. The proof of the theorem is based on the following result. Lemma 11.2.4. Let > 0 be a real constant, let n 1 be an integer, and let < a1 < b1 < . . . < an < bn < be real numbers. Let {wj } and {ej } be two sequences such that lim supj wj < , limj ej = 0 and wj+1 wj +

1Ac (wj ) + ej , where A def =

[ai , bi ] .
i=1

(11.27)

Then there exists an index k lim sup wj bk .

{1, . . . , n} such that ak

lim inf wj

422

11 Maximum Likelihood Inference, Part II

Proof. First note that (11.27) implies that the sequence {wj } is innitely often in the set A (otherwise it would tend to innity, contradicting the assumptions). Thus it visits innitely often at least one of the intervals [ak , bk ] for some k. Choose < inf 1in1 (ai+1 bi )/2 and set j0 such that |ej | for j j0 . Let p j0 such that wp [ak , bk ]. We will show that for any j p , wj ak . (11.28)

The property is obviously true for j = p. Assume now that the property holds true for some j p. If wj ak , then (11.27) shows that wj+1 ak . If ak wj < ak , then wj+1 wj + ak . Therefore wj+1 ak , and (11.28) follows by induction. Because was arbitrary, we nd that lim inf wj ak . Using a similar induction argument, one may show that lim sup wj bk , which concludes the proof. Proof (of Theorem 11.2.3). If L K was empty, then min K W T () W () > 0, which would contradict (11.26). Hence L K is non-empty. For simplicity, we assume in the following that L K, if not, simply replace L by L K. def For any > 0, let [W (L)] = {x R : inf yW (L) |x y| < }. Because W (L) is bounded, the set [W (L)] is a nite union of disjoint bounded open intervals of length at least equal to 2. Thus there exists an integer n 0 and real numbers a (1) < b (1) < . . . < a (n ) < b (n ) such that
n

[W (L)] =
k=1

(a (k), b (k)) .

(11.29)

Note that W 1 ([W (L)] ) is an open neighborhood of L, and dene = Write W (i+1 ) W (i ) = W T (i ) W (i ) + W (i+1 ) W T (i ) . (11.31) Because W (i ) [W (L)] implies i W 1 ([W (L)] ), we obtain W (i+1 ) W (i ) + 1[W (L)]c W (i ) + W (i+1 ) W T (i ) . (11.32)
def {K\W 1 ([W (L)] )}

inf

{W T () W ()} > 0 .

(11.30)

By (11.26), W (i+1 ) W T (i ) 0 as i . Thus by Lemma 11.2.4, the set of limit points of the sequence {W (i )} belongs to one of the intervals [a (k), b (k)]. Because W (L) = >0 [W (L)] and W (L) is a nite set, the sequence {W (i )} must be convergent with a limit that belongs to W (L). Using (11.31) and (11.26) again, this implies that W T (i ) W (i ) 0 as i , showing that all limit points of the sequence {i } belongs to L. The proof of Theorem 11.2.3 follows.

11.2 Analysis of the MCEM Algorithm

423

11.2.2 Convergence of the MCEM Algorithm Throughout this section, we focus on the case where the complete data likelihood is from an exponential family of distributions. To keep the discussion short, we also consider only the simplest mechanism to draw the missing data, that is conditionally i.i.d. simulations. Many of the assumptions below can be relaxed, but the proof of convergence then becomes more cumbersome and technical (Fort and Moulines, 2003; Kuhn and Lavielle, 2004). We recall the notations f (x; ) for the complete data likelihood, L() = f (x; ) (dx) for the likelihood, and p(x; ) = f (x; )/L() for the conditional density of the missing data. We will also need the function
def S() =

S(x)p(x ; )(dx) ,

(11.33)

where S(x) is the (vector of) sucient statistic(s) dened below. Assumption 11.2.5. (i) is an open subset of Rd and {f (; )} denes an exponential family of positive functions on X, that is, f (x ; ) = exp[ t ()S(x) c()]h(x) (11.34)

for some functions : Rd Rds , S : X Rds , c : R, and h : X R+ . (ii) The function L is positive and continuous on . (iii) For any , |S(x)|p(x ; ) (dx) < , and the function S is continuous on . (iv) There exists a closed subset S Rds that contains the convex hull of S(X) and is such that for any s S, the function t ()s c() has a unique global maximum (s) . In addition, the function (s) is continuous on S. Under the assumptions and denitions given above, the EM and the MCEM recursions may be expressed as EM: i+1 = T (i ) = S(i ) ,
def

MCEM: i+1 = (S i+1 ) ,

(11.35)

where {S i } are the estimates of the complete data sucient statistics given, for instance, by (11.3) or by an importance sampling estimate of the same quantity. Assumption 11.2.6. With
def L = { : S() = }

(11.36)

being the set of xed points of the EM algorithm, the image by the function L of this set L is a nite set of points.

424

11 Maximum Likelihood Inference, Part II

Recall that if the function L is continuously dierentiable, then L coincides with the set of stationary points of the log-likelihood. That is, L = { : L() = 0} (see in particular Theorem 10.5.3). To study the MCEM algorithm, we now state conditions that specify how S i+1 approximates S(i ). Assumption 11.2.7. L[(S i+1 )] L[ S(i )] 0 a.s. as i . Theorem 11.2.8. Assume 11.2.5, 11.2.6, and 11.2.7. Assume in addition that, almost surely, the closure of the set {i } is a compact subset of . i } converges to the set L and the sequence Then, almost surely, the sequence { {L(i )} has a limit. Proof. From Proposition 10.1.4, each iteration of the EM algorithm increases the log-likelihood, L( S()) L(), with equality if and only if L (see (11.36)). Thus L is a Lyapunov function for T = S on . Because T is continuous by assumption, the proof follows from Theorem 11.2.3. Assumption 11.2.7 is not a low-level assumption. It may be expressed dierently, using the conditional version of the Borel-Cantelli Lemma. Lemma 11.2.9 (Conditional Borel-Cantelli Lemma). Let {Gk } be a ltration and let {k } be an {Gk }-adapted sequence of random variables. Assume that there exists a constant C such that for any k, 0 k C. Then if k=1 E[k | Gk1 ] < a.s., it holds that k=1 k < a.s. Proof. Set Mn = k=1 {k E[k | Gk1 ]}. Then {Mn } is a square-integrable {Gn }-adapted martingale. The angle-bracket process of this martingale (see Dacunha-Castelle and Duo, 1986, Section 2.6) is bounded by
n n 2 2 E[Mk | Gk1 ] Mk1 = k=1 k=1 n n

def n

E[(k E[k | Gk1 ])2 | Gk1 ] C


k=1

E[k | Gk1 ] < P-a.s.

The proof is concluded by applying Proposition 2.6.29 of Dacunha-Castelle and Duo (1986), which shows that {Mn } converges a.s. to an a.s. nite random variable. We may use the conditional Borel-Cantelli lemma to show that Assumption 11.2.7 is implied by the following sucient condition, which turns out to be more convenient to check. Lemma 11.2.10. Assume 11.2.5 and that the following conditions hold. (i) The closure of the set {i } is, almost surely, a compact subset of .

11.2 Analysis of the MCEM Algorithm

425

(ii) For any

> 0 and any compact set K , P{|S i S(i1 )| | F i1 }1K (i1 ) <
def

a.s. ,

(11.37)

i=1

where F j = (0 , S 1 , . . . , S j ). Then Assumption 11.2.7 is satised. Note that the indicator random variable is F i1 -measurable, as i1 is a deteri1 of the sucient ministic function (the M-step) of the previous estimate S statistic. Proof. We rst prove that for any

> 0 and any compact set K , a.s. (11.38)

P{|L[(S i )] L[ S(i1 )]| | F i1 }1K (i1 ) < > 0,

i=1

In order to do so, note that for any > 0 and

P{|L[(S i )] L[ S(i1 )]| | F i1 } P{|S i S(i1 )| | F i1 } + P{|L[(S i )] L[ S(i1 )]| , |S i S(i1 )| | F i1 } . In particular, this inequality holds true on the event {i1 K}. Now dene is assumed continuous the set T = S {|s| supK S() + }. Because S this set is compact, and therefore the function L is uniformly continuous L (s )| for any on T . Hence we can nd an > 0 such that |L (s) (s, s ) T T such that |s s | . We thus see that on the on the event {i1 K}, P{|L[(S i )] L[ S(i1 )]| , |S i S(i1 )| | F i1 } P{|S i S(i1 )| | F i1 } . In view of assumption (ii), (11.38) follows. Combining (11.38) with Lemma 11.2.9 shows that for any compact set K , lim |L[(S i )] L[ S(i1 )]|1K (i1 ) = 0 a.s.
i

The proof is concluded by noting that there exists an increasing sequence K1 K2 of compact subsets of such that = n=0 Kn . As discussed previously, there are many dierent ways to approximate S(). To simplify the discussion, we concentrate below on the simple situation of plain Monte Carlo approximation, assuming that

426

11 Maximum Likelihood Inference, Part II


mi

S i = m1 i
j=1

S( i,j ) ,

i1,

(11.39)

where mi is the number of replications in the ith iteration and i,1 , . . . , i,mi are conditionally i.i.d. given the -eld F i1 with common density p(x; i1 ). Lemma 11.2.11. Assume 11.2.5 and that the closure of the set {i } is, alr/2 most surely, a compact subset of . Assume in addition that i=1 mi < for some r 2 and that supK |S(x)|r p(x ; ) (dx) < for any compact set K . Then the MCEM sequence {i } based on the estimators {S i } of the sucient statistics given by (11.39) satises Assumption 11.2.7. Proof. The Markov and the Marcinkiewicz-Zygmund (Theorem 9.1.5) inequalities state that for any r 2 and any > 0,

P{|S i S(i1 )| | F i1 }1K (i1 )

i=1

r i=1

E[|S i S(i1 )|r | F i1 ]1K (i1 )

C(r) C(r)

r i=1

mi sup
K

r/2

|S(x)|r p(x ; i1 )(dx) 1K (i1 )

|S(x)|r p(x ; )(dx)


i=1

mi

r/2

where C(r) is a universal constant. The right-hand side is nite by assumption, so that the conditions of Lemma 11.2.10 are satised. The situation is slightly more complicated when instead of drawing i.i.d. random variables from the density p(x ; i1 ), we run an ergodic Markov chain i1 ). We then need a version of Marcinkiewiczwith stationary density p(x ; Zygmund inequality for ergodic Markov chains (see for instance Fort and Moulines, 2003, Section 6). We will not develop further the theory in this direction. All we need to know at this point is that Assumption 11.2.7 still holds true in this case under reasonable conditions. 11.2.3 Rate of Convergence of MCEM Recall from Section 10.5.2 that the asymptotic behavior of an EM sequence {i } that converges to a local maximum may be (approximately) described by the linear dynamical system (i+1 ) = M (i ) M ( )
M (

)(i ) ,

(11.40)

where the eigenvalues of M ( ) lie in the interval (0, 1) (Proposition 10.5.5). To use this decomposition, we require some additional regularity assumptions.

11.2 Analysis of the MCEM Algorithm

427

Assumption 11.2.12. (i) The functions and c of the exponential family characterization, S and , are twice continuously dierentiable on . (ii) is twice continuously dierentiable on the interior of S. (iii) The set L of stationary points of is reduced to a single point , which is a proper maximizer of and such that s = S( ) lies in the interior of S; the matrices H( ) and G( ) dened by (10.71) and (10.72) are positive denite. Note that in exponential families, the form taken by () (see Denition 10.1.5) and the rst assumption above imply that the technical condition (b) in Proposition 10.1.6 holds so that Proposition 10.5.5 applies and is a stable stationary point of the EM mapping. The third condition above is overly restrictive and is adopted only to allow for simpler statements. It is possible to obtain similar results assuming only that L consists of isolated points by properly conditioning on the events {|i | < } for L and arbitrary values of > 0 (see Fort and Moulines, 2003, for details). It is useful in the following to consider the EM algorithm not directly in the parameter space but in the space S of the complete data sucient statistic. In this space, the EM recursion may be written as
def S i+1 = S (S i ) = G(S i ),

i+1 = (S i+1 ) .

(11.41)

def If is a xed point of M , then s = S( ) is a xed point of G, that is, s = G(s ) = S (s ). In addition, M ( ) = s (s ) S( ) and ) s (s ), so that s G(s ) and M ( ) have the same s G(s ) = S( eigenvalues (counting multiplicities). We now apply this principle to the MCEM algorithm, letting again S i be the estimate of the sucient statistic at the ith iteration. The dierence S i s , where s = S( ), may be expressed as

S i s = [G(S i1 ) G(s )] + [S i G(S i1 )] i1 s ) + (S i E[S i | F i1 ]) + Qi , = s G(s )(S where F i1 is as in Lemma 11.2.10 and Qi is a remainder term. For con ditionally i.i.d. simulations, S i is given by (11.39) and hence E(S i | F i1 ) = S(x)p(x ; (S i1 ))(dx) = G(S i1 ). Thus the remainder term Qi is equal to the dierence between G(S i1 ) G(s ) and its rst-order approximation i1 s ), which we expect to be small for large values of the iters G(s )(S ation index i when S i converges to s . For technical reasons, we consider instead the equivalent error decomposi tion S i s = M i + Ri , where M i obeys a linear dierence equation driven by the martingale dierence, M0 = 0 and M i =
s G(s

)M i1 + (S i E[S i | F i1 ])1C (i1 ) , (11.42)

428

11 Maximum Likelihood Inference, Part II

C being a compact neighborhood of = (s ) and Ri is the remainder term. Because the stationary point s is stable, all eigenvalues of s G(s ) have modulus less than 1, implying that the linear dierence equation (11.42) is stable. To go further, we need to strengthen the assumption on the Monte Carlo perturbation. Assumption 11.2.13. mi < for some r 2 and for any compact 1/2 i E[S i | F i1 ]|r 1K (i1 ))1/r < . set K , lim supi mi (E |S This condition implies that
r/2

E[|S i E{S i | F i1 }|r | F i1 ]1K (i1 ) <

a.s.

j=1

Hence by Markov inequality and Lemma 11.2.10, Assumption 11.2.13 implies Assumption 11.2.7. The following result (adapted from Fort and Moulines, 2003, Theorem 6), which we state without proof, establishes the rate of convergence of M i and Ri . Theorem 11.2.14. Assume 11.2.5, 11.2.7, 11.2.12, 11.2.13, and that S i 2 s a.s. Assume in addition that 1 limi mi+1 /mi < |max ( s G(s ))| . Then 1/2 1/2 there exists a constant C such that (E M i r )1/r Cmi and mi (S i i i s M ) 0 a.s., where M is as in (11.42). To understand the impact of the schedule {mi } on the dispersion of the MCEM estimate, it is appropriate to evaluate the rate of convergence as a function of the total number of simulations. For any sequence {ai }, we dene the interpolated sequence ai = a(i) , where for any integer i, (i) is the largest integer such that
(i) (i)+1

mk < i
k=0 k=0

mk .

Hence ai is the original sequence reindexed by simulation number rather than by iteration number. In particular, i denotes the t of the parameter after i is the t of the parameter after the ith the ith simulation while, as usual, iteration. Assume rst that the number of simulations increases at a polynomial rate, mi i , for some > 0. Then (i) [(1 + )i]1/(1+) and i = + OP (i 2(1+) ). Whatever the value of , the rate of convergence is slower than i1/2 . It is worthwhile to note that the rate improves by choosing large values of ; on the simulation scale, the dispersion of the estimator decreases when increasing . Assume now that the schedule is exponential, mi i for some > 1. This choice has been advocated by Chan and Ledolter (1995) and in several earlier works on the subject. We obtain similarly that i = + OP (i1/2 ) whenever 1 < < |max [ s G(s )]|2 . This analysis

11.3 Analysis of Stochastic Approximation Algorithms

429

suggests that the optimal schedule is exponential, yet the choice of is not obvious as max [ s G(s )] is in general unknown. We now study the averaged algorithm based on the use of (11.9). Then Si s may be decomposed as Si s = Mi + Ri , where the leading term M i is given by 1
i i ik

def Mi =
j=0

mj
k=0

j=0

mj+k

s G(s

)j (S k E[S k | F k1 ]) .

Fort and Moulines (2003, Theorem 8) shows that the following result holds true. Theorem 11.2.15. Assume 11.2.5, 11.2.7, 11.2.12, 11.2.13, and that S i s a.s. Assume in addition that the following conditions hold true. (i) 1 limi mi+1 /mi < |max [ s G(s )]|2 . i (ii) limi i( j=0 mj )1/2 = 0. Then there is a constant C such that (E |Mi |r )1/r C
j=0 i

1/2 mj ,

and
j=0 i

1/2 mj (S i s Mi ) 0 a.s.

The Lr -norm of the leading term Mi of the error S i s thus decreases as the inverse square root of the total number of simulations up to iteration i, both for subexponential and exponential schedules. This implies that the estimator i = (S i ) converges to at a rate inversely proportional to the square root of the total number of simulations up to iteration i. When expressed on the simulation timescale, the previous result shows that the rate of convergence of the interpolated sequence i is proportional to i1/2 , the total number of simulations up to time i. Hence the averaging procedure improves the rate of convergence and makes the choice of the sequence {mi } less sensitive.

11.3 Analysis of Stochastic Approximation Algorithms


11.3.1 Basic Results for Stochastic Approximation Algorithms Since the early work by Kushner and Clark (1978), convergence of stochastic approximation procedures has been thoroughly studied under various sets

430

11 Maximum Likelihood Inference, Part II

of assumptions. For a good summary of available results, we recommend in particular the books by Benveniste et al. (1990), Duo (1997), and Kushner and Yin (2003). In the following, we follow the approach recently proposed by Andrieu et al. (2005), which is of interest here because it parallels the method adopted in the previous section for the MCEM algorithm. The analysis again consists in decomposing the study of the convergence of stochastic approximation algorithms in two distinct steps. In the rst step, we establish deterministic conditions on a noise sequence { i } and a step size sequence {i } under which a deterministic sequence {i } dened as 0 , i+1 = i + i+1 (h(i ) + i+1 ) , i0, (11.43)

converges to the set of stationary points of h. This rst result (Theorem 11.3.2 below) is the analogy of Theorem 11.2.3, which was instrumental in analyzing the convergence of the MCEM algorithm. Because the proof of Theorem 11.3.2 is more technical, however, it is postponed to Section 11.4 and may be omitted in a rst reading. In a second step, which is probabilistic in nature and depends on the distribution of the process { i }, we check that these conditions are satised with probability one. In order to state Theorem 11.3.2, we rst need to adopt a strengthened version of Denition (11.2.2). Denition 11.3.1 (Dierential Lyapunov Function). Let be a subset of Rd , let w be a real function on , and let h : Rd be a vector-valued function. The function w is said to be a Lyapunov function relative to (h, ) if w is continuously dierentiable on and w(), h() 0 for any , with equality if and only if is such that h() = 0. In this context, the function h is usually referred to as the mean eld and the points such that h() = 0 are called stationary points (of the mean eld). We will denote by L the set of such points, that is, L = { : h() = 0} .
def

(11.44)

To make the connection with Denition (11.2.2), note that if W is a Lyapunov function relative to T in the sense of Denition (11.2.2) and that both functions are continuously dierentiable on , then W also is a (dierential) Lyapunov function in the sense of Denition 11.3.1 relative to the gradient eld h = T . Recall that we adopt in this chapter a denition that is compatible with maximization tasks, whereas the tradition is to consider Lyapunov functions as descent functions (hence replacing by in Denition 11.3.1). Theorem 11.3.2. Assume that is an open subset of Rd and let h : Rd be continuous. Let {i } be a positive sequence such that i 0 and i =

11.3 Analysis of Stochastic Approximation Algorithms


l

431

, and let { i } be a sequence in Rd satisfying limk suplk | i=k i i | = 0. Assume that there exists a Lyapunov function w relative to (h, ) such that w(L) is nite, where L is as in (11.44). Finally, assume that the sequence {i }i0 given by i = i1 + i h(i1 ) + i i is such that {i } K for some compact subset K of satisfying L K. Then the sequence {w(i )} converges to some w in w(L) and the sequence i { } converges to the set Lw = { L : w() = w }. 11.3.2 Convergence of the Stochastic Gradient Algorithm We consider the stochastic gradient algorithm dened by (11.17). For simplicity, we set the number of simulations m in each iteration to one, bringing us back to the basic form (11.15). This recursion may be rewritten in Robbins Monro form i = i1 + i h(i1 ) + i i , where h() =

() ,

i =

log f ( i ; i1 ) h(i1 ) .

(11.45)

Because the mean eld h is a gradient, the function w = is a Lyapunov function relative to (, h). To proceed, one needs to specify how the missing data is simulated. We consider the following simple assumption. Assumption 11.3.3. For any i 1, given F i1 = (0 , 1 , . . . , i1 ), the i simulated missing data is drawn from the density p(x ; i1 ). In addition, for some r > 2, the function |S(x)|r p(x ; ) (dx) is nite and continuous on . This assumption can be relaxed to allow for Markovian dependence, a situation that is typical when MCMC methods are used for simulation of the missing data (Andrieu et al., 2005). We may now formulate a general convergence result for the stochastic gradient algorithm under the assumption that the complete data likelihood is from an exponential family of distributions. Note that in the latter case, the representation f (x ; ) = exp[ t ()S(x) c()]h(x) implies that the perturbation i dened in (11.45) may be rewritten as i = [ (i1 )]t (S i E[S i | F i1 ]), where () is the Jacobian matrix i = S( i ) is a simulation of the complete data sucient statistics of and S under the density p(x ; i1 ). Theorem 11.3.4. Assume 11.2.5, 11.2.6, and 11.3.3. Assume in addition that () is a continuously dierentiable function of , that k 0 , k = and
2 k < ,

and that the closure of the set {i } is a compact subset of . Then, almost i surely, the sequence given by (11.15) satises limk (k ) = 0.

432

11 Maximum Likelihood Inference, Part II


i

j Proof. Put M i = j=1 j . The result will follow from Theorem 11.3.2 i provided {M } has a nite limit a.s., so this is what we will prove. Using the form of i given above, we see that the sequence {M i } is an i {F }-martingale satisfying

E[|M i+1 M i |2 | F i ]
i=1 i=1

2 i+1

i )

|S(x)|2 p(x; i ) (dx) .

Under the stated assumptions the sequence {i } a.s. belongs to a compact subset of . Therefore, by Assumption 11.3.3, the right-hand side of the above display is nite a.s., and Dacunha-Castelle and Duo (1986, Proposition 2.6.29) then shows that M i has a nite limit almost surely. 11.3.3 Rate of Convergence of the Stochastic Gradient Algorithm The results above are of little help in selecting the step size sequence, because they do not tell much about the behavior of the sequence {i } when the algorithm approaches convergence. This section is concerned with the rate of convergence, assuming that convergence occurs. To simplify the discussion it is assumed here that, as in Section 11.2.3, i , which is a stable stationary point. That is, a point in satisfying the following conditions: (i) h( ) = 0, (ii) h is twice dierentiable in a neighborhood of and (iii) J( ), the Jacobian matrix of h, or, in other words, the Hessian of ( ), is negative denite. All this is guaranteed by Assumption 11.2.12, under which is a proper maximizer of . Write the dierence i as i = (i1 ) + i [h(i1 ) h( )] + i i i1 ) + i J( )(i1 ) + i i + i Qi , = ( where Qi = [h(i1 ) h( )] J( )(i1 ) is the remainder term. This i = M i + Ri , where M i obeys a linear suggests the error decomposition dierence equation driven (under Assumption 11.3.3) by a martingale dierence; M 0 = 0 and, for i 1,
i i

M i = [I + i J( )]M i1 + i i =
j=0

[I + l J( )] j .
l=j+1

(11.46)

The following result is adapted from Delyon et al. (1999, Lemma 6) (see also Kushner and Yin, 2003, Chapter 10). Theorem 11.3.5. Assume 11.2.5, 11.2.12, 11.3.3, and that i a.s. As 1 1 2 sume in addition that i=0 i = , i=0 i < and that i+1 i 0. Then there exists a constant C such that (E[ M i r ])1/r Ci and 1/2 i i ( M i ) 0 a.s., where M i is as in (11.46).

11.3 Analysis of Stochastic Approximation Algorithms

433

Hence M i is the leading term of the error and Ai is a remainder term. Because the variance of the leading term M i is proportional to the step size i , this result suggests taking the smallest possible step size compatible with the assumptions. Using small step sizes is however clearly not a recommendable practice. Indeed, if the step sizes are not sucient, it is likely that the algorithm will get stuck at an early stage, failing to come close to the target point. In addition, it is dicult to detect that the step size is converging too quickly to zero, or that it is too small, and therefore there is a substantial ambiguity on how to select an appropriate sequence of step sizes. This diculty has long been considered as a serious handicap for practical applications of stochastic approximation procedures. Note that it is possible to carry out a dierent analysis of stochastic ap proximation procedures in which the error i is normalized by the square root of the inverse of the step size i . One may for example prove conver1/2 i gence in distribution of the centered and normalized iterate i ( ), with the variance of the limiting distribution taken as a measure of how fast convergence occurs (Benveniste et al., 1990; Duo, 1997). It is also possible to analyze scenarios in which the step sizes are essentially constant but assumed suciently small (Kushner and Yin, 2003) or to use approaches based on large deviations (Dupuis and Ellis, 1997). As in the case of MCEM, the averaging procedure partly raises the di culty discussed above: for the averaged sequence {i } dened in (11.19), the following result, adapted from Delyon et al. (1999, Theorem 4), holds. Theorem 11.3.6. Under the assumptions of Theorem 11.3.5, i D i( ) N(0, J( )1 J( )1 ) , where = t ( ) [S(x) S( )][S(x) S( )]t p(x ; ) (dx) ( ) .

(11.47)

As shown by Poznyak and Chikin (1984) and Chikin (1988), the rate 1/ i and the asymptotic variance of (11.47) are optimal. This performance may also be achieved using a Gauss-Newton type stochastic approximation algorithm. Such an algorithm would however require knowledge, or estimates of J( ), whereas averaging circumvents such diculties. This result suggests a rather dierent philosophy for setting the step sizes: because the optimal rate of 1/ i can be achieved by averaging, the step sizes {i } should decrease as slowly as permitted by the assumptions of Theorem 11.3.5 to ensure fast convergence toward the region of interest (hence the choice of a rate n0.6 adopted in Example 11.1.6). 11.3.4 Convergence of the SAEM Algorithm We consider the stochastic approximation EM (SAEM) algorithm (11.21) and again, for simplicity, with m = 1 replication of the missing data in each

434

11 Maximum Likelihood Inference, Part II

iteration. In Robbins-Monro form, this algorithm is dened as S i = S i1 + i1 ) + i i , where the mean eld h and the perturbation i are given by i h(S h(s) = S (s) s , i = S( i ) S (S i1 ) . (11.48)

The log-likelihood function () is increased at each iteration of the EM algorithm. We show in the following lemma that this property, in the domain of complete data sucient statistics, implies that is a Lyapunov function for the mean eld h. Lemma 11.3.7. Assume 11.2.5, items (i) and (ii) of 11.2.12 and set w = . Then s w(s), h(s) 0 for any s S, where h is the mean eld of (11.48). Moreover, {s S : ({s S :
s w(s), h(s) def

= 0} = {s S :

s w(s)

= 0} ,

(11.49) (11.50)

s w(s), h(s) = 0}) = { :

() = 0} .

Proof. We start by working out an expression for the gradient of w. Under Assumption 11.2.12, the function S is continuously dierentiable on and the function is continuously dierentiable on S. Hence h is continuously dierentiable on S, so that h is bounded on every compact subset of S. By construction for any s S, the function satises
[(s)]s

c[(s)]

=0.

(11.51)

Put F (s, ) = {()}t s c(), so that this relation reads F [s ; (s)] = 0. Under the assumptions made, we may dierentiate this relation with respect to s to obtain 2 (11.52) F [s ; (s)] s (s) = [(s)] . On the other hand, the Fisher identity implies that for any ,

() =

()S()

c()

Evaluating this equality at (s) and using (11.51) yields

[(s)] = =

[(s)]

S[(s)] s =
2 F [s ; (s)] s (s)h(s)

[(s)]h(s)

(11.53)

whence
s

[(s)] =

s (s)

2 F [s ; (s)]

s (s) h(s)

(11.54)
2 F [s ; (s)]

Because the F (s; ) as a unique proper maximizer in = (s), is negative denite implying that
s w(s), h(s)

= {h(s)}t {

t s (s)}

2 F [s ; (s)]

s (s)h(s)

0 . (11.55)

This is the rst claim of the lemma.

11.4 Complements

435

Now pick s S to be such that w(s ), h(s ) = 0. Under Assump tion 11.2.12, the matrix 2 F [s ; (s )] is negative denite, whence (11.55) shows that s (s ) h(s ) = 0. Inserting this into (11.54) yields s w(s ) = 0, so that {s S :
s w(s), h(s)

= 0} {s S :

s w(s)

= 0} .

The reverse inclusion is trivial, and the second claim of the lemma follows. For the nal claim, use a similar argument and (11.53) as well as the fact that if ( ) = 0 then h(s ) = S (s ) s = 0 (for the point s = S( )). We may now formulate a result that is the stochastic counterpart of the general convergence theorem for the EM sequence. Theorem 11.3.8. Let {i } and {S i } be sequences of parameters and complete sucient statistics, respectively, of the SAEM algorithm (11.21). Assume 11.2.5, 11.2.6, and items (i) and (ii) of 11.2.12 and 11.3.3. Assume in addition that 2 k 0 , k = and k < , and that the closure of the set {S i } is a compact subset of S. Then, almost i ) = 0 and limi (i ) = 0. surely, limi h(S The proof is similar to the one of Theorem 11.3.4 and is omitted.

11.4 Complements
We give below the proof of Theorem 11.3.2, which was omitted in Section 11.3. We rst need three lemmas for which the assumptions of Theorem 11.3.2 are assumed to hold. Lemma 11.4.1. Let J be a compact subset of such that 0 < inf J w(), h() . Then, for any 0 < < inf J w(), h() , there exist constants > 0 and > 0, such that, for any , 0 , , || , and J , w[ + h() + ] w() + . Proof. For any 0 < < inf J w, h , there exist > 0 and > 0 such that for all , 0 , , || and t, 0 t 1, we have for all J , + th() + t and |
w(), h()

w[

+ th() + t], h() + |

Rd \J

inf

w, h

Then, for any , 0 and , || ,

436

11 Maximum Likelihood Inference, Part II


w(), h() w(), h()

w( + h() + ) w() =
1

+
0

{ inf

w[

+ th() + t], h() + inf |


w, h

} dt

Rd \J

w, h

Rd \J

= .

Lemma 11.4.2. Let N be an open neighborhood of L. There exist positive constants , C, , and (depending only on the sets N and K), such that for any , 0 < , , 0 < , one can nd an integer N and a sequence {j }jN satisfying j for any j N and sup |j j | ,
jN

sup j ,
jN

and

sup |w(j ) w(j )| ,


jN

(11.56) w(j ) w(j1 ) + j 1N c (j1 ) j C 1N (j1 ) for j N + 1. (11.57)

Proof. Let us choose 0 > 0 small enough so that K0 = { , inf | | 0 } .


K def

The set K0 \ N is compact and inf K0 \N w, h > 0. By Lemma 11.4.1, for any , 0 < < inf K0 \N w(), h() , one may choose > 0 and > 0 small enough so that for any , 0 , , || and K0 \ N , + h() + and w[ + h() + ] w() + . (11.58)

Because the function h is continuous on , it is uniformly continuous on each compact subset of , i.e., for any > 0 one may choose , 0 < h1K 0 so that for all (, ) K0 K0 satisfying | | , |h() h()| and |w() w()| . (11.59)

Under the stated conditions for any , 0 < and , 0 < there j i exists an integer N such that for any j N + 1, j and i=N +1 i . Dene recursively for j N the sequence {j }jN as follows: N = N and for j N + 1, j = j1 + j h(j1 ) . (11.60)
j By construction, for j N + 1, j j = i=N +1 i i , which implies that supjN |j j | and thus, for all j N , j K0 and |w(j )w(j )| . On the other hand, for j N + 1,

11.4 Complements

437

j = j1 + j h(j1 ) + j [h(j1 ) h(j1 )] ,

(11.61)

and because |j1 j1 | , (11.59) shows that |h(j1 )h(j1 )| . Thus, (11.58) implies that, whenever j1 K0 \ N , w(j ) w(j1 ) + j . Now (11.59) and (11.60) imply that, for any j N , |w(j ) w(j1 )| j
w K

h1K

Lemma 11.4.3. Let and C be real constants, n be an integer and let < a1 < b1 < < an < bn < be real numbers. Let {uj } be a sequence such that lim sup uj < and, for any j, uj uj1 + j 1Ac (uj1 ) j C 1A (uj1 )
n

A=
i=1

[ai , bi ] .

(11.62)

Then, the limit points of the sequence {uj } are included in A. Proof. As lim sup uj < is bounded, {uj } is innitely often in the set A and thus in at least one of the intervals [ak , bk ], k = 1, . . . , n. Choose , 0 < < inf 1in1 (ai+1 bi )/2 and let J be suciently large so that, for all j J, j C . Assume that {ui } is innitely often in the interval [ak , bk ], for some k = 1, . . . , n. Let p J be such that up [ak , bk ]. We will show by induction that, for any j p , uj ak . (11.63) The property is obviously true for j = p. Assume now that the property holds true for some j p. If uj ak , then, uj+1 ak . If ak uj ak , then uj+1 uj + j ak , showing (11.63). Because is arbitrary, lim inf uj ak , showing that the sequence {uj } is innitely often in only one of the intervals. Hence, there exists an index j0 such that, for any j j0 , uj < ak+1 (with the convention that an+1 = ), which is possible only if, for any j j0 , uj < bk . As a consequence, there cannot be an accumulation point in an interval other than [ak , bk ]. Proof (Theorem 11.3.2). We rst prove that limj w(j ) exists. For any > 0, dene the set [w(L)] = {x R : inf yw(L) |x y| < }. Because w1L < , [w(L)] is a nite union of disjoint intervals of length at least equal to 2. By applying Lemma 11.4.2 with N = w1 ([w(L)] ), there exist positive constants C, , , , such that for any , 0 < , , 0 < and > 0, one may nd an integer N and a sequence {j }jN such that, sup |j j | , sup j
jN jN

and

sup |w(j ) w(j )|


jN

and, for any j N + 1,

438

11 Maximum Likelihood Inference, Part II

w(j ) w(j1 ) + j 1[w(L)]c [w(j1 )] j C 1[w(L)] [w(j1 )] , By Lemma 11.4.3, the limit points of the sequence {w(j )} are in [w(L)] and j j )| , the limit points of the sequence {w(j )} because supjN |w( ) w( belong to [w(L)]+ . Because and are arbitrary, this implies that the limit points of the sequence {w(j )} are included in >0 [w(L)] . Because w(L) is nite, w(L) = >0 [w(L)] showing that the limit points of {w(j )} belong to the set w(L). On the other hand, lim supj |w(j ) w(j1 )| = 0, which implies that the set of limit points of {w(j )} is an interval. Because w(L) is nite, the only intervals included in w(L) are isolated points, which shows that the limit limj w(j ) exists. We now proceed to proving that all the limit points of the sequence {j } belong to L. Let N be an arbitrary neighborhood of L. From Lemma 11.4.2, there exist constants C, > 0, > 0, > 0 such that for any , 0 < , , 0 < , and > 0, one may nd an integer N and a sequence {j }jN such that sup |j j | ,
jN

sup j
jN

and

sup |w(j ) w(j )|


jN

and, for any j N + 1, w(j ) w(j1 ) + j 1N c (j1 ) j C 1N (j1 ) . For j N , dene (j) = inf k 0, k+j N . For any integer p, dene p (j) = (j) p, where a b = min(a, b).
j+ p (j) j+ p (j) def

p w(j+ (j) ) w(j ) =


i=j+1

[w(i ) w(i1 )]
i=j+1

i ,

(11.64)
l i=l+1

with the convention that, for any sequence {ai } and any integer l, 0. Therefore, w(j+
p

ai =

(j)

) w(j ) = w(j+ w(j+


p

(j)

p ) w(j+ (j) )+
j+ p (j)

(j)

) w(j ) + w(j ) w(j ) 2 +


i=j+1

i .

Because {w(j )} converges, there exists N > N such that, for all j N ,
j+ p (j)

w(

j+ p (j)

) w( ) 2 +
i=j+1

i .

This implies that, for all j N and all integer p 0,

11.4 Complements
j+ p (j)

439

i 3/ .
i=j+1 j+ (j) i=j+1 j+ p (j) i=j+1 i=1 j+p i=j+1

(11.65)

Because

i = limp =
j j+p i=j+1

i and )+

i = , the previous
j+ (j) i=j+1 i

relation implies that, for all j N , (j) < and any integer p,
j+p

i 3/. For

i h(

i1

i , which implies that i i .

j+p

j+p

j+p j h1K

i=j+1

i +
i=j+1

Applying this inequality for j N and p = (j) and using that, by denition, j+ (j) N , j j+ (j) |j+ (j) j+ (j) | + |j+ (j) j |
j+ (j)

+ h1K

3/ +
i=j+1 l

i i .

Because , , and can be arbitrarily small, and suplk | i=k i i | tends to zero, the latter inequality shows that all limit points of the sequence {j } belong to N . Because N is arbitrary, all limit points of {j } belong to L.

12 Statistical Properties of the Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) is one of the backbones of statistics, and as we have seen in previous chapters, it is very much appropriate also for HMMs, even though numerical approximations are required when the state space is not nite. A standard result in statistics says that, except for atypical cases, the MLE is consistent, asymptotically normal with asymptotic (scaled) variance equal to the inverse Fisher information matrix, and ecient. The purpose of the current chapter is to show that these properties are indeed true for HMMs as well, provided some conditions of rather standard nature hold. We will also employ the asymptotic results obtained to verify the validity of certain likelihood-based tests. Recall that the distribution (law) P of {Yk }k0 depends on a parameter that lies in a parameter space , which we assume is a subset of Rd for some d . Commonly, is a vector containing some components that parameterize the transition kernel of the hidden Markov chainsuch as the transition probabilities if the state space X is niteand other components that parameterize the conditional distributions of the observations given the states. Throughout the chapter, it is assumed that the HMM model is, for all , fully dominated in the sense of Denition 2.2.3 and that the underlying Markov chain is positive (see Denition 14.2.26). Assumption 12.0.1. (i) There exists a probability measure on (X, X ) such that for any x X and any , Q (x, ) with transition density q . That is, Q (x, A) = q (x, x ) (dx ) for A X . (ii) There exists a probability measure on (Y, Y) such that for any x X and any , G (x, ) with transition density function g . That is, G (x, A) = g (x, y) (dy) for A Y. (iii) For any , Q is positive, that is, Q is phi-irreducible and admits a (necessarily unique) invariant distribution denoted by .

442

12 Statistical Properties of the MLE

In this chapter, we will generally assume that is compact. Furthermore, is used to denote the true parameter, that is, the parameter corresponding to the data that we actually observe.

12.1 A Primer on MLE Asymptotics


The standard asymptotic properties of the MLE hinge on three basic results: a law of large numbers for the log-likelihood, a central limit theorem for the score function, and a law of large of numbers for the observed information. More precisely, (i) for all , n1 n () () P -a.s. uniformly over compact subsets of , where n () is the log-likelihood of the parameter given the rst n observations and () is a continuous deterministic function with a unique global maximum at ; (ii) n1/2 n ( ) N(0, J ( )) P -weakly, where J () is the Fisher information matrix at (we do not provide a more detailed denition at the moment); (iii) lim0 limn sup| | n1 2 n () J ( ) = 0 P -a.s. The function in (i) is sometimes referred to as the contrast function. We note that n1 2 n () in (iii) is the observed information matrix, so that (iii) says that the observed information should converge to the Fisher information in a certain uniform sense. This uniformity may be replaced by conditions on the third derivatives of the log-likelihood, which is common in statistical textbooks, but as we shall see, it is cumbersome enough even to deal with second derivatives of the log-likelihood for HMMs, whence avoiding third derivatives is preferable. Condition (i) assures strong consistency of the MLE, which can be shown using an argument that goes back to Wald (1949). The idea of the argument is as follows. Denote by n the maximum the ML estimator; n (n ) n () for any . Because has a unique global maximum at , ( ) () 0 for any and, in particular, ( ) (n ) 0. We now combine these two inequalities to obtain 0 ( ) (n ) ( ) n1
n ( 1

) + n1
n ()|

n (

) n1

n (n )

+ n1

n (n )

(n )

2 sup | () n

Therefore, by taking the compact subset in (i) above as itself, (n ) ( ) P -a.s. as n , which in turn implies, as is continuous with a unique global maximum at , that the MLE converges to P -a.s.. In other words, the MLE is strongly consistent.

12.2 Stationary Approximations

443

Provided strong consistency holds, properties (ii) and (iii) above yield asymptotic normality of the MLE. In fact, we must also assume that is an interior point of and that the Fisher information matrix J ( ) is nonsingular. Then we can for suciently large n make a Taylor expansion around , noting that the gradient of n vanishes at the MLE n because is maximal there,
1

0=

n (n )

n (

)+
0

2 n [

+ t(n )] dt

(n ) .

From this expansion we obtain


1 1 2 n [ 0

n1/2 (n ) =

n1

+ t(n )] dt

n1/2

n (

).

Now n converges to P -a.s. and so, using (iii), the rst factor on the righthand side tends to J ( )1 P -a.s. The second factor converges weakly to N(0, J ( )); this is (ii). Cramr-Slutskys theorem hence tells us that n1/2 (n e ) tends P -weakly to N(0, J 1 ( )), and this is the standard result on asymptotic normality of the MLE. In an entirely similar way properties (ii) and (iii) also show that for any u Rd (recall that is a subset of Rd ),
n (

+n1/2 u)

n (

) = n1/2 uT

n (

1 )+ uT [n1 2

2 n (

)]u+Rn (u) ,

where n1/2 n ( ) and n1 2 n ( ) converge as described above, and where Rn (u) tends to zero P -a.s. Such an expansion is known as local asymptotic normality (LAN) of the model, cf. Ibragimov and Hasminskii (1981, Denition II.2.1). Under this condition, it is known that so-called regular estimators (a property possessed by all sensible estimators) cannot have an asymptotic covariance matrix smaller than J 1 ( ) (Ibragimov and Hasminskii, 1981, p. 161). Because this limit is obtained by the MLE, this estimator is ecient. Later on in this chapter, we will also exploit properties (i)(iii) to derive asymptotic properties of likelihood ratio and other tests for lower dimensional hypotheses regarding .

12.2 Stationary Approximations


In this section, we will introduce a way of obtaining properties (i)(iii) for HMMs; more detailed descriptions are given in subsequent sections. Before proceeding, we will be precise on the likelihood we shall analyze. In this chapter, we generally make the assumption that the sequence {Xk }k0 is stationary; then {Xk , Yk }k0 is stationary as well. Then there is obviously a

444

12 Statistical Properties of the MLE

corresponding likelihood. However, it is sometimes convenient to work with a likelihood Lx0 ,n () that is conditional on an initial state x0 ,
n

Lx0 ,n () =

g (x0 , Y0 )
i=1

q (xi1 , xi )g (xi , Yi ) (dxi ) .

(12.1)

We could also want to replace the xed initial state by an initial distribution on (X, X ), giving L,n () =
X

Lx0 ,n () (dx0 ) .

The stationary likelihood is then L ,n (), which we will simply denote by Ln (). The advantage of working with the stationary likelihood is of course that it is the correct likelihood for the model and may hence be expected to provide better nite-sample performance. The advantage of assuming a xed initial state x0 and hence adopting the likelihood Lx0 ,n ()is that the stationary distribution is not always available in closed form when X is not nite. It is however important that g (x0 , Y0 ) is positive P -a.s.; otherwise the log-likelihood may not be well-dened. In fact, we shall require that g (x0 , Y0 ) is, P -a.s., bounded away from zero. In the following, we always assume that this condition is fullled. A further advantage of Lx0 ,n () is that the methods described in the current chapter may be extended to Markov-switching autoregressions (Douc et al., 2004), and then the stationary likelihood is almost never computable, not even when X is nite. Throughout the rest of this chapter, we will work with Lx0 ,n () unless noticed, where x0 X is chosen to satisfy the above positivity assumption but otherwise arbitrarily. The MLE arising from this likelihood has the same asymptotic properties as has the MLE arising from Ln (), provided the initial stationary distribution has smooth second-order derivatives (cf. Bickel et al., 1998), whence from an asymptotic point of view there is no loss in using the incorrect likelihood Lx0 ,n (). We now return to the analysis of log-likelihood and items (i)(iii) above. In the setting of i.i.d. observations, the log-likelihood n () is a sum of i.i.d. terms, and so (i) and (iii) follow from uniform versions of the strong law of large numbers and (ii) is a consequence of the simplest central limit theorem. In the case of HMMs, we can write x0 ,n () as a sum as well:
n x0 ,n ()

=
k=0 n

log log
k=0

g (xk , Yk ) x0 ,k|k1 [Y0:k1 ](dxk ; ) g (xk , Yk ) P (Xk dxk | Y0:k1 , X0 = x0 ) ,

(12.2) (12.3)

where x0 ,k|k1 [Y0:k1 ]( ; ) is the predictive distribution of the state Xk given the observations Y0:k1 and X0 = x0 . These terms do not form a stationary sequence however, so the law of large numbersor rather the ergodic

12.2 Stationary Approximations

445

theoremdoes not apply directly. Instead we must rst approximate x0 ,n () by the partial sum of a stationary sequence. When the joint Markov chain {Xk , Yk } has an invariant distribution, this chain is stationary provided it is started from its invariant distribution. In this case, we can (and will!) extend it to a stationary sequence {Xk , Yk }<k< with doubly innite time, as we can do with any stationary sequence. Having done this extension, we can imagine a predictive distribution of the state Xk given the innite past Y:k1 of observations. A key feature of these variables is that they now form a stationary sequence, whence the ergodic theorem applies. Furthermore we can approximate x0 ,n () by
n s n ()

=
k=0

log

g (xk , Yk ) P (Xk dxk | Y:k1 ) ,

(12.4)

where superindex s stands for stationary. Heuristically, one would expect this approximation to be good, as observations far in the past do not provide much information about the current one, at least not if the hidden Markov chain enjoys good mixing properties. What we must do is thus to give a precise denition of the predictive distribution P (Xk | Y:k1 ) given the innite past, and then show that it approximates the predictive distribution x0 ,k|k1 ( ; ) well enough that the two sums (12.2) and (12.4), after normalization by n, have the same asymptotic behavior. We can treat the score function similarly by dening a sequence that forms a stationary martingale increment sequence; for sums of such sequences there is a central limit theorem. The cornerstone in this analysis is the result on conditional mixing stated in Section 4.3. We will rephrase it here, but before doing so we state a rst assumption. It is really a variation of Assumption 4.3.24, adapted to the dominated setting and uniform in . Assumption 12.2.1. (i) The transition density q (x, x ) of {Xk } satises 0 < q (x, x ) + < for all x, x X and all , and the measure is a probability measure. (ii) For all y Y, the integral X g (x, y) (dx) is bounded away from 0 and on . Part (i) of this assumption often, but not always holds when the state space X is nite or compact. Note that Assumption 12.2.1 says that for all , the whole state space X is a 1-small set for the transition kernel Q , which implies that for all , the chain is phi-irreducible and strongly aperiodic (see Section 14.2 for denitions). It also ensures that there exists a stationary distribution for Q . In addition, the chain is uniformly geometrically ergodic in the sense that for any x X and n 0, Qn (x, ) TV (1 )n . Under Assumption 12.0.1, it holds that , and we use the same notation

446

12 Statistical Properties of the MLE

for this distribution and its density with respect to the dominating measure . Using the results of Section 14.3, we conclude that the state space X Y is 1-small for the joint chain {Xk , Yk }. Thus the joint chain is also phi-irreducible and strongly aperiodic, and it admits a stationary distribution with density (x)g (x, y) with respect to the product measure on (X Y, X Y) The joint chain also is uniformly geometrically ergodic. Put = 1 / + ; then 0 < 1. The important consequence of Assumption 12.2.1 that we need in the current chapter is Proposition 4.3.26. It says that if Assumption 12.2.1 holds true, then for all k 1, all y0:n and all initial distributions and on (X, X ), P (Xk | X0 = x, Y0:n = y0:n ) [(dx) (dx)]
X TV

k .

(12.5)

12.3 Consistency
12.3.1 Construction of the Stationary Conditional Log-likelihood We shall now construct P (Xk dxk | Y:k1 ) and g (xk , Yk ) P (Xk dxk | Y:k1 ). The latter variable will be dened as the limit of Hk,m,x () =
def

g (xk , Yk ) P (Xk dxk | Ym+1:k1 , Xm = x)

(12.6)

as m . Note that Hk,m,x () is the conditional density of Yk given Ym+1:k1 and Xm = x, under the law P . Put hk,m,x () = log Hk,m,x () and consider the following assumption. Assumption 12.3.1. b+ = sup supx,y g (x, y) < and E |log b (Y0 )| < , where b (y) = inf X g (x, y) (dx). Lemma 12.3.2. The following assertions hold true P -a.s. for all indices k, m and m such that k > (m m ): sup sup |hk,m,x () hk,m ,x ()|
x,x X def

(12.7)

k+(mm )1 , 1

(12.8) (12.9)

sup

sup

sup |hk,m,x ()| |log b+ | |log( b (Yk ))| .

m(k1) xX

12.3 Consistency

447

Proof. Assume that m m and write Hk,m,x () = g (xk , Yk )q (xk1 , xk ) (dxk )

P (Xk1 dxk1 | Ym+1:k1 , Xm = xm ) x (dxm ) , (12.10) Hk,m ,x () = g (xk , Yk )q (xk1 , xk ) (dxk ) P (Xk1 dxk1 | Ym+1:k1 , Xm = xm ) P (Xm dxm | Ym +1:k1 , Xm = x ) , and invoke (12.5) to see that |Hk,m,x () Hk,m ,x ()| k+m1 sup
xk1

(12.11)

g (xk , Yk )q (xk1 , xk ) (dxk ) g (xk , Yk ) (dxk ) . (12.12)

k+m1 +

Note that the step from the total variation bound to the bound on the dierence between the integrals does not need a factor 2, because the integrands are non-negative. Also note that (12.5) is stated for m = m = 0, but its initial time index is of course arbitrary. The integral in (12.10) can be bounded from below as Hk,m,x () g (xk , Yk ) (dxk ) , (12.13) and the same lower bound holds for (12.11). Combining (12.12) with these lower bounds and the inequality |log x log y| |x y|/(x y) shows that |hk,m,x () hk,m ,x ()| + k+m1 k+m1 = , 1

which is the rst assertion of the lemma. Furthermore note that (12.10) and (12.13) yield b (Yk ) Hk,m,x () b+ , (12.14) which implies the second assertion. Equation (12.8) shows that for any given k and x, {hk,m,x ()}m(k1) is a uniform (in ) Cauchy sequence as m , P -a.s., whence there is a P -a.s. limit. Moreover, again by (12.8), this limit does not depend on x, so we denote it by hk, (). Our interpretation of this limit is as log E [ g (Xk , Yk ) | Y:k1 ]. Furthermore (12.9) shows that provided Assumption 12.3.1 holds, {hk,m,x ()}m(k1) is uniformly bounded in L1 (P ), so that hk, () is in L1 (P ) and, by the dominated convergence theorem, the limit holds in this mode as well. Finally, by its denition {hk, ()}k0 is a stationary process, and it is ergodic because {Yk }<k< is. We summarize these ndings.

448

12 Statistical Properties of the MLE

Proposition 12.3.3. Assume 12.0.1, 12.2.1, and 12.3.1 hold. Then for each and x X, the sequence {hk,m,x ()}m(k1) has, P -a.s., a limit hk, () as m . This limit does not depend on x. In addition, for any , hk, () belongs to L1 (P ), and {hk,m,x ()}m(k1) also converges to hk, () in L1 (P ) uniformly over and x X. Having come thus far, we can quantify the approximation of the loglikelihood x0 ,n () by s (). n Proposition 12.3.4. For all n 0 and , |
x0 ,n ()

s n ()|

|log g (x0 , Y0 )| + h0, () +

1 (1 )2

P -a.s.

Proof. Letting m in (12.8) we obtain |hk,0,x0 () hk, ()| k1 /(1 ) for k 1. Therefore, P -a.s.,
n n

x0 ,n ()

s n ()|

=
k=0

hk,0,x0 ()
k=0

hk, ()
n

|log g (x0 , Y0 )| + h0, () +


k=1

k1 . 1

12.3.2 The Contrast Function and Its Properties Because hk, () is in L1 (P ) under the assumptions made above, we can dene the real-valued function () = E [hk, ()]. It does not depend on k, by stationarity. This is the contrast function () referred to above. By the ergodic theorem n1 s () () P -a.s., and by Proposition 12.3.4, n n1 x0 ,n () () P -a.s. as well. As noted above, however, we require this convergence to be uniform in , which is not guaranteed so far. In addition, we require () to be continuous and possess a unique global maximum at ; the latter is an identiability condition. In the rest of this section, we address continuity and convergence; identiability is addressed in the next one. To ensure continuity we need a natural assumption on continuity of the building blocks of the likelihood. Assumption 12.3.5. For all (x, x ) X X and y Y, the functions q (x, x ) and g (x, y) are continuous. The following result shows that hk, () is then continuous in L1 (P ). Proposition 12.3.6. Assume 12.0.1, 12.2.1, 12.3.1, and 12.3.5. Then for any ,
def

12.3 Consistency

449

sup
: | |

|h0, ( ) h0, ()| 0

as 0 ,

and () is continuous on . Proof. Recall that h0, () is the limit of h0,m,x () as m . We rst prove that for any x X and any m > 0, the latter quantity is continuous in and then use this to show continuity of the limit. Recall the interpretation of H0,m,x () as a conditional density and write H0,m,x () =
0 i=m+1 q (xi1 , xi )g (xi , Yi ) (dxm+1 ) (dx0 ) 1 i=m+1 q (xi1 , xi )g (xi , Yi ) (dxm+1 ) (dx1 )

(12.15)

The integrand in the numerator is, by assumption, continuous and bounded by ( + b+ )m , whence dominated convergence shows that the numerator is continuous with respect to (recall that is assumed nite). Likewise the denomina1 tor is continuous, and it is bounded from below by ( )m1 m+1 b (Yi ) > 0 P -a.s. Thus H0,m,x () and h0,m,x () are continuous as well. Because h0,m,x () converges to h0, () uniformly in as m , P -a.s., h0, () is continuous P -a.s. The uniform bound (12.9) assures that we can invoke dominated convergence to obtain the rst part of the proposition. The second part is a corollary of the rst one, as sup
: | |

| ( ) ()| = E

sup
: | |

| E [h0, ( ) h0, ()]| sup |h0, ( ) h0, ()| .

: | |

We can now proceed to show uniform convergence of n1

x0 ,n ()

to ().

Proposition 12.3.7. Assume 12.0.1, 12.2.1, 12.3.1, and 12.3.5. Then sup |n1
x0 ,n ()

()| 0

P -a.s. as n .

Proof. First note that because is compact, it is sucient to prove that for all , lim sup lim sup
0

sup

|n1

n : | |

x0 ,n (

) ()| = 0

P -a.s.

Now write

450

12 Statistical Properties of the MLE

lim sup lim sup


0

sup

|n1 sup sup sup

n : | |

x0 ,n (

) ()|
x0 ,n ( x0 ,n (

= lim sup lim sup


0

|n1 n1 |

) n1 )
s n (

n : | | n : | | n : | |

s n ()|

lim sup lim sup


0

)| .

+ lim sup lim sup


0

n1 | s ( ) n

s n ()|

The rst term on the right-hand side vanishes by Proposition 12.3.4 (note that Lemma 12.3.2 shows that sup |h0, ( )| is in L1 (P ) and hence nite P -a.s.). The second term is bounded by
n

lim sup lim sup


0

sup

n1

(hk, ( ) hk, ())


k=0 n

n : | |

lim sup lim sup n1


0 n

sup
k=0 : | |

|hk, ( ) hk, ()|

= lim sup E
0

sup
: | |

|h0, ( ) h0, ()| = 0 ,

with convergence P -a.s. The two nal steps follow by the ergodic theorem and Proposition 12.3.6 respectively. The proof is complete. At this point, we thus know that n1 x0 ,n converges uniformly to . The same conclusion holds when other initial distributions are put on X0 , provided sup |log g (x, Y0 ) (dx)| is nite P -a.s. When is the stationary distribution , uniform convergence can in fact be proved without this extra regularity assumption by conditioning on the previous state X1 to get rid of the rst two terms in the bound of Proposition 12.3.4; cf. Douc et al. (2004). The uniform convergence of n1 x0 ,n () to () can be usedwith an argument entirely similar to the one of Wald outlined in Section 12.1to show that the MLE converges a.s. to the set, say, of global maxima of . Because is continuous, we know that is closed and hence also compact. More precisely, for any (open) neighborhood of , the MLE will be in that neighborhood for large n, P -a.s. We say that the MLE converges to in the quotient topology. This way of describing convergence was used, in the context of HMMs, by Leroux (1992). The purpose of the identiability constraint, that () has a unique global maximum at , is thus to ensure that consists of the single point so that the MLE indeed converges to the point .

12.4 Identiability
As became obvious in the previous section, the set of global maxima of is of intrinsic importance, as this set constitutes the possible limit points of the

12.4 Identiability

451

MLE. The denition of () as a limit is however usually not suitable for extracting relevant information about the set of maxima, and the purpose of this section is to derive a dierent characterization of the set of global maxima of . 12.4.1 Equivalence of Parameters We now introduce the notion of equivalence of parameters. Denition 12.4.1. Two points , are said to be equivalent if they govern identical laws for the process {Yk }k0 , that is, if P = P . We note that, by virtue of Kolmogorovs extension theorem, and are equivalent if and only if the nite-dimensional distributions P (Y1 , Y2 , . . . , Yn ) and P (Y1 , Y2 , . . . , Yn ) agree for all n 1. We will show that a parameter is a global maximum point of if and only if is equivalent to . This implies that the limit points of the MLE are those points that govern the same law for {Yk }k0 as does . This is the best we can hope for because there is no wayeven with an innitely large sample of Y s!to distinguish between the true parameter and a dierent but equivalent parameter . Naturally we would like to conclude that no parameter other than itself is equivalent to . This is not always the case however, in particular when X is nite and we can number the states arbitrarily. We will discuss this matter further after proving the following result. Theorem 12.4.2. Assume 12.0.1, 12.2.1, and 12.3.1. Then a parameter is a global maximum of if and only if is equivalent to . An immediate implication of this result is that is a global maximum of . Proof. By the denition of () and Proposition 12.3.3, ( ) () = E
m

lim h1,m,x ( ) E
m

lim h1,m,x ()

= lim E [h1,m,x ( )] lim E [h1,m,x ()]


m

= lim E [h1,m,x ( ) h1,m,x ()] ,


m

where hk,m,x () is given in (12.7). Next, write E [h1,m,x ( ) h1,m,x ()] = E E log H1,m,x ( ) Ym+1:0 , Xm = x H1,m,x () ,

where Hk,m,x () is given in (12.6). Recalling that H1,m,x () is the conditional density of Y1 given Ym+1:0 and Xm = x, we see that the inner (conditional)

452

12 Statistical Properties of the MLE

expectation on the right-hand side is a Kullback-Leibler divergence and hence non-negative. Thus the outer expectation and the limit ( ) () are nonnegative as well, so that is a global mode of . Now pick such that () = ( ). Throughout the remainder of the proof, we will use the letter p to denote (possibly conditional) densities of random variables, with the arguments of the density indicating which random variables are referred to. For any k 1, E [log p (Y1:k |Ym+1:0 , Xm = x)]
k

=
i=1 k

E [log p (Yi |Ym+1:i1 , Xm = x)]

=
i=1

E [hi,m,x ()]

so that, employing stationarity,


m

lim E [log p (Y1:k |Ym+1:0 , Xm = x)] = k () .

Thus for any positive integer n < k, 0 = k( ( ) ()) p (Y1:k |Ym+1:0 , Xm = x) = lim E log m p (Y1:k |Ym+1:0 , Xm = x) p (Ykn+1:k |Ym+1:0 , Xm = x) = lim E log m p (Ykn+1:k |Ym+1:0 , Xm = x) p (Y1:kn |Ykn+1:k , Ym+1:0 , Xm = x) + E log p (Y1:kn |Ykn+1:k , Ym+1:0 , Xm = x) p (Y1:n |Ynkm+1:nk , Xnkm = x) lim sup E log , p (Y1:n |Ynkm+1:nk , Xnkm = x) m where the inequality follows by using stationarity for the rst term and noting that the second term is non-negative as an expectation of a (conditional) Kullback-Leibler divergence as above. Hence we have inserted a gap between the variables Y1:n whose density we examine and the variables Ynkm+1:nk and Xnkm that appear as a condition. The idea is now to let this gap tend to innity and to show that in the limit the condition has no eect. Next we shall thus show that
k mk

lim sup E

log

p (Y1:n |Ym+1:k , Xm = x) p (Y1:n |Ym+1:k , Xm = x) E log p (Y1:n ) p (Y1:n ) = 0 . (12.16)

Combining (12.16) with the previous inequality, it is clear that if () = ( ), then E {log[p (Y1:n )/p (Y1:n )]} = 0, that is, the Kullback-Leibler divergence

12.4 Identiability

453

between the n-dimensional densities p (y1:n ) and p (y1:n ) vanishes. This implies, by the information inequality, that these densities coincide except on a set with n -measure zero, so that the n-dimensional laws of P and P agree. Because n was arbitrary, we nd that and are equivalent. What remains to do is thus to prove (12.16). To that end, put Uk,m () = log p (Y1:n |Ym+1:k , Xm = x) and U () = log p (Y1:n ). Obviously, it is enough to prove that for all , lim E sup |Uk,m () U ()| = 0 .
mk

(12.17)

To do that we write p (Y1:n |Ym+1:k , Xm = x) = p (Y1:n |X0 = x0 ) Qk (xk , dx0 ) P (Xk dxk | Ym+1:k , Xm = x) and p (Y1:n ) = p (Y1:n |X0 = x0 ) Qk (xk , dx0 ) (dxk ) ,

where is the stationary distribution of {Xk }. Realizing that p (Y1:n |X0 = x0 ) is bounded from above by (b+ )n (condition on X1:n !) and that the transition kernel Q satises the Doeblin condition (see Denition 4.3.12) and is thus uniformly geometrically ergodic (see Denition 4.3.15 and Lemma 4.3.13), we obtain sup |p (Y1:n |Ym+1:k , Xm = x) p (Y1:n )| (b+ )n (1 )k
mk

(12.18)

P -a.s.. Moreover, the bound


n

p (Y1:n |X0 = x0 ) =

i=1 n

q (xi1 , xi )g (xi , Yi ) (dxi ) b (Yi )


i=1

( )n

implies that p (Y1:n |Ym+1:k , Xm = x) and p (Y1:n ) both obey the same lower bound. Combined with the observation b (Yi ) > 0 P -a.s., which follows from Assumption 12.3.1, and the bound |log(x) log(y)| |x y|/x y, (12.18) shows that
k mk

lim sup |Uk,m () U ()| 0

P -a.s.

Now (12.17) follows from dominated convergence provided E sup sup Uk,m () < .
k mk

454

12 Statistical Properties of the MLE

Using the aforementioned bounds, we conclude that this expectation is indeed nite. We remark that the basic structure of the proof is potentially applicable also to models other than HMMs. Indeed, using the notation of the proof, we may dene as () = limm E [log p (Y1 |Ym:1 )], a denition that does not exploit the HMM structure. Then the rst part of the proof, up to (12.16), does not use the HMM structure either, so that all that is needed, in a more general framework, is to verify (12.16) (or, more precisely, a version thereof not containing Xm ). For particular other processes, this could presumably be carried out using, for instance, suitable mixing properties. The above theorem shows that the points of global maxima of forming the set of possible limit points of the MLEare those that are statistically equivalent to . This result, although natural and important (but not trivial!), is however yet of a somewhat high level character, that is, not veriable in terms of low level conditions. We would like to provide some conditions, expressed directly in terms of the Markov chain and the conditional distributions g (x, y), that give information about parameters that are equivalent to and, in particular, when there is no other such parameter than . We will do this using the framework of mixtures of distributions. 12.4.2 Identiability of Mixture Densities We rst dene what is meant by a mixture density. Denition 12.4.3. Let f (y) be a parametric family of densities on Y with respect to a common dominating measure and parameter in some set . If is a probability measure on , then the density f (y) =

f (y) (d)

is called a mixture density; the distribution is called the mixing distribution. We say that the class of (all) mixtures of (f ) is identiable if f = f -a.e. if and only if = . Furthermore we say that the class of nite mixtures of (f ) is identiable if for all probability measures and with nite support, f = f -a.e. if and only if = . In other words, the class of all mixtures of (f ) is identiable if the two distributions with densities f and f respectively agree only when = . Yet another way to put this property is to say that identiability means that the mapping f is one-to-one (injective). A way, slightly Bayesian, of thinking of a mixture distribution that is often intuitive and fruitful is the following. Draw with distribution and then Y from the density f . Then, Y has density f . Many important and commonly used parametric classes of densities are identiable. We mention the following examples.

12.4 Identiability

455

(i) The Poisson family (Feller, 1943). In this case, Y = Z+ , = R+ , is the mean of the Poisson distribution, is counting measure, and f (y) = y e /y!. (ii) The Gamma family (Teicher, 1961), with the mixture being either on the scale parameter (with a xed form parameter) or on the form parameter (with a xed scale parameter). The class of joint mixtures over both parameters is not identiable however, but the class of joint nite mixtures is identiable. (iii) The normal family (Teicher, 1960), with the mixture being either on the mean (with xed variance) or on the variance (with xed mean). The class of joint mixtures over both mean and variance is not identiable however, but the class of joint nite mixtures is identiable. (iv) The Binomial family Bin(N, p) (Teicher, 1963), with the mixture being on the probability p. The class of nite mixtures is identiable, provided the number of components k of the mixture satises 2k 1 N . Further reading on identiability of mixtures is found, for instance, in Titterington et al. (1985, Section 3.1). A very useful result on mixtures, taking identiability in one dimension into several dimensions, is the following. Theorem 12.4.4 (Teicher, 1967). Assume that the class of all mixtures of the family (f ) of densities on Y with parameter is identiable. (n) Then the class of all mixtures of the n-fold product densities f (y) = n n f1 (y1 ) fn (yn ) on y Y with parameter is identiable. The same conclusion holds true when all mixtures is replaced by nite mixtures. 12.4.3 Application of Mixture Identiability to Hidden Markov Models Let us now explain how identiability of mixture densities applies to HMMs. Assume that {Xk , Yk } is an HMM such that the conditional densities g (x, y) all belong to a single parametric family. Then given Xk = x, Yk has conditional density g(x) say, where (x) is a function mapping the current state x into the parameter space of the parametric family of densities. Now assume that the class of all mixtures of this family of densities is identiable, and that we are given a true parameter of the model as well as an equivalent other parameter . Associated with these two parameters are two mappings (x) and (x), respectively, as above. As and are equivalent, the ndimensional restrictions of P and P coincide; that is, P (Y1:n ) and P (Y1:n ) agree. Because the class of all mixtures of (g ) is identiable, Theorem 12.4.4 tells us that the n-dimensional distributions of the processes { (Xk )} and {(Xk )} agree. That is, for all subsets A n , P {( (X1 ), (X2 ), . . . , (Xn )) A} = P {((X1 ), (X2 ), . . . , (Xn )) A} .

456

12 Statistical Properties of the MLE

This condition is often informative for concluding = . Example 12.4.5 (Normal HMM). Assume that X is nite, say X = {1, 2, . . . , r}, and that Yk |Xk = i N(i , 2 ). The parameters of the model are the transition probabilities qij of {Xk }, the i and 2 . We thus identify (x) = x . If and are two equivalent parameters, the laws of the processes { Xk } and {Xk } are thus the same, and in addition 2 = 2 . Here i denotes the i -component of , etc. Assuming the i to be distinct, this can only happen if the sets { 1 , . . . , r } and {1 , . . . , r } are identical. We may thus conclude that the sets of means must be the same for both parameters, but they need not be enumerated in the same order. Thus there is a permutation {c(1), c(2), . . . , c(r)} of {1, 2, . . . , r} such that c(i) = i for all i X. Now because the laws of { Xk } under P and {c(Xk ) } under P coincide with the i s being distinct, we conclude that the laws of {Xk } under P and of {c(Xk )} under P also agree, which in turn implies q ij = qc(i),c(j) for all i, j X. Hence any parameter that is equivalent to is in fact identical, up to a permutation of state indices. Sometimes the parameter space is restricted by, for instance, requiring the means i to be sorted: 1 < 2 < . . . < r , which removes the ambiguity. Such a restriction is not always desirable though; for example, in a Bayesian framework, it destroys exchangeability of the parameter in the posterior distribution (see Chapter 13). In the current example, we could also have allowed the variance 2 to 2 depend on the state, Yk |Xk = i N(i , i ), reaching the same conclusion. The assumption of conditional normality is of course not crucial either; any family of distributions for which nite mixtures are identiable would do. Example 12.4.6 (General Stochastic Volatility). In this example, we consider a stochastic volatility model of the form Yk |Xk = x N(0, 2 (x)), where 2 (x) is a mapping from X to R+ . Thus, we identify (x) = 2 (x). Again assume that we are given a true parameter as well as another parameter , which is equivalent to . Because all variance mixtures of normal distributions are identiable, the laws of { 2 (Xk )} under P and of { 2 (Xk )} under P agree. Assuming for instance that 2 (x) = 2 (x) = x (and hence also X R+ ), we conclude that the laws of {Xk } under P and P , respectively, agree. For particular models of the transition kernel Q of {Xk }, such as the nite case of the previous example, we may then be able to show that = , possibly up to a permutation of state indices. Example 12.4.7. Sometimes a model with nite state space is identiable even though the conditional densities g(x, ) are identical for several x. For instance, consider a model on the state space X = {0, 1, 2} with Yk |Xk = i N(i , 2 ), the constraints 0 = 1 < 2 , and transition probability matrix q00 q01 0 Q = q10 q11 q12 . 0 q21 q22

12.5 Asymptotics of the Score and Observed Information

457

The Markov chain {Xk } is thus a (discrete-time) birth-and-death process in the sense that it can change its state index by at most one in each step. This model is similar to models used in modeling ion channel dynamics (cf. Fredkin and Rice, 1992). Because 1 < 2 , we could then think of states 0 and 1 as closed and of state 2 as open. Now assume that is equivalent to . Just as in Example 12.4.5, we may then conclude that the law of { Xk } under P and that of {Xk } under P agree, and hence, because of the constraints on the s, that the laws of {1(Xk {0, 1}) + 1(Xk = 2)} under P and P agree. In other words, after lumping states 0 and 1 of the Markov chain we obtain processes with identical laws. This in particular implies that the distributions under P and P of the sojourn times in the state aggregate {0, 1} coincide. The probability of such a sojourn having length 1 is q12 , whence q12 = q 12 must hold. For length 2, the corresponding probability is q11 q12 , whence q11 = q 11 follows and then also q10 = q 10 as rows of Q sum up to unity. For length 3, the 2 probability is q11 q12 + q10 q01 q12 , so that nally q01 = q 01 and q00 = q 00 . We may thus conclude that = , that is, the model is identiable. The reason that identiability holds despite the means i being non-distinct is the special structure of Q. For further reading on identiability of lumped Markov chains, see Ito et al. (1992).

12.5 Asymptotic Normality of the Score and Convergence of the Observed Information
We now turn to asymptotic properties of the score function and the observed information. The score function will be discussed in some detail, whereas for the information matrix we will just state the results. 12.5.1 The Score Function and Invoking the Fisher Identity Dene the score function
n x0 ,n () = k=0

log

g (xk , Yk ) P (Xk dxk | Y0:k1 , X0 = x0 ) .

(12.19) To make sure that this gradient indeed exists and is well-behaved enough for our purposes, we make the following assumptions. Assumption 12.5.1. There exists an open neighborhood U = { : | | < } of such that the following hold. (i) For all (x, x ) X X and all y Y, the functions q (x, x ) and g (x, y) are twice continuously dierentiable on U.

458

12 Statistical Properties of the MLE

(ii) sup sup


U x,x

log q (x, x ) <

and sup sup


U x,x 2

log q (x, x ) < .

(iii) E and E sup sup


U x 2

sup sup
U x

log g (x, Y1 )

<

log g (x, Y1 )

<.

(iv) For -almost all y Y, there exists a function fy : X R+ in L1 () such that supU g (x, y) fy (x). 2 1 (v) For -almost all x X, there exist functions fx : Y R+ and fx : Y 1 1 2 2 R+ in L () such that g (x, y) fx (y) and g (x, y) fx (y) for all U. These assumptions assure that the log-likelihood is twice continuously dierentiable, and also that the score function and observed information have nite moments of order two and one, respectively, under P . The assumptions are natural extensions of standard assumptions that are used to prove asymptotic normality of the MLE for i.i.d. observations. The asymptotic results to be derived below are valid also for likelihoods obtained using a distribution for X0 (such as the stationary one), provided this distribution satises conditions similar to the above ones: for all x X, (x) is twice continuously dierentiable on U, and the rst and second derivatives of log (x) are bounded uniformly over U and x X. We shall now study the score function and its asymptotics in detail. Even though the log-likelihood is dierentiable, one must take some care to arrive at an expression for the score function that is useful. A tool that is often useful in the context of models with incompletely observed data is the socalled Fisher identity, which we encountered in Section 10.1.3. Invoking this identity, which holds in a neighborhood of under Assumption 12.5.1, we nd that (cf. (10.29))
n x0 ,n ()

log g (x0 , Y0 ) + E
k=1

(Xk1 , Xk , Yk ) Y0:n , X0 = x0

(12.20) where (x, x , y ) = log[q (x, x )g (x , y )]. However, just as when we obtained a law of large numbers for the normalized log-likelihood, we want to express the score function as a sum of increments, conditional scores. For that purpose we write

12.5 Asymptotics of the Score and Observed Information


n x0 ,n () n

459

x0 ,0 ()

+
k=1

x0 ,k ()

x0 ,k1 ()}

=
k=0

hk,0,x0 () , (12.21)

where h0,0,x0 =

log g (x0 , Y0 ) and, for k 1,


k

hk,0,x () = E
i=1

(Xi1 , Xi , Yi ) Y0:k , X0 = x
k1

E
i=1

(Xi1 , Xi , Yi ) Y0:k1 , X0 = x

Note that hk,0,x () is the gradient with respect to of the conditional loglikelihood hk,0,x () as dened in (12.7). It is a matter of straightforward algebra to check that (12.20) and (12.21) agree. 12.5.2 Construction of the Stationary Conditional Score We can extend, for any integers k 1 and m 0, the denition of hk,0,x () to
k

hk,m,x () = E
i=m+1

(Xi1 , Xi , Yi ) Ym+1:k , Xm = x
k1

E
i=m+1

(Xi1 , Xi , Yi ) Ym+1:k1 , Xm = x

with the aim, just as before, to let m . This will yield a denition of hk, (); the dependence on x will vanish in the limit. Note however that the construction below does not show that this quantity is in fact the gradient of hk, (), although one can indeed prove that this is the case. As noted in Section 12.1, we want to prove a central limit theorem (CLT) for the score function evaluated at the true parameter. A quite general way to do that is to recognize that the corresponding score increments form, under reasonable assumptions, a martingale increment sequence with respect to the ltration generated by the observations. This sequence is not stationary though, so one must either use a general martingale CLT or rst approximate the sequence by a stationary martingale increment sequence. We will take the latter approach, and our approximating sequence is nothing but {hk, ( )}. k, (). First write hk,m,x () as We now proceed to the construction of h hk,m,x () = E [ (Xk1 , Xk , Yk ) | Ym+1:k , Xm = x]
k1

(E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x]
i=m+1

E [ (Xi1 , Xi , Yi ) | Ym+1:k1 , Xm = x]) . (12.22)

460

12 Statistical Properties of the MLE

The following result shows that it makes sense to take the limit as m in the previous display. Proposition 12.5.2. Assume 12.0.1, 12.2.1, and 12.5.1 hold. Then for any integers 1 i k, the sequence {E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x]}m0 converges P -a.s. and in L2 (P ), uniformly with respect to U and x X, as m . The limit does not depend on x. We interpret and write this limit as E [ (Xi1 , Xi , Yi ) | Y:k ]. Proof. The proof is entirely similar to that of Proposition 12.3.3. For any (x, x ) X X and non-negative integers m m, E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] E [ (Xi1 , Xi , Yi ) | Ym +1:k , Xm = x ] = (xi1 , xi , Yi ) Q (xi1 , dxi ) P (Xi1 dxi1 | Ym+1:k , Xm = xm ) [x (dxm ) P (Xm dxm | Ym +1:k , Xm = x )] 2 sup (x, x , Yi ) (i1)+m ,
x,x

(12.23)

where the inequality stems from (12.5). Setting x = x in this display shows that {E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x]}m0 is a Cauchy sequence, thus converging P -a.s. The inequality also shows that the limit does not depend on x. Moreover, because for any non-negative integer m, x X and U, E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] sup (x, x , Yi )
x,x

with the right-hand side belonging to L2 (P ). The inequality (12.23) thus also shows that {E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x]}m0 is a Cauchy sequence in L2 (P ) and hence converges in L2 (P ). With the sums arranged as in (12.22), we can let m and dene, for k 1, hk, () = E [ (Xk1 , Xk , Yk ) | Y:k ]
k1

(E [ (Xi1 , Xi , Yi ) | Y:k ] E [ (Xi1 , Xi , Yi ) | Y:k1 ]) .


i=

The following result gives an L2 -bound on the dierence between hk,m,x () k, (). and h

12.5 Asymptotics of the Score and Observed Information

461

Lemma 12.5.3. Assume 12.0.1, 12.2.1, 12.3.1, and 12.5.1 hold. Then for k 1, (E hk,m,x () hk, () 2 )1/2
1/2

12 E

sup
x,x X

(x, x , Y1 )

(k+m)/21 . 1

Proof. The idea of the proof is to match, for each index i of the sums express ing hk,m,x () and hk, (), pairs of terms that are close. To be more precise, we match 1. The rst terms of hk,m,x () and hk, (); 2. For i close to k, E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] and E [ (Xi1 , Xi , Yi ) | Y:k ] , and similarly for the corresponding terms conditioned on Ym+1:k1 and Y:k1 , respectively; 3. For i far from k, E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] and E [ (Xi1 , Xi , Yi ) | Ym+1:k1 , Xm = x] , and similarly for the corresponding terms conditioned on Y:k and Y:k1 , respectively. We start with the second kind of matches (of which the rst terms are a special case). Taking the limit in m in (12.23), we see that E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] E [ (Xi1 , Xi , Yi ) | Y:k ] 2 sup
x,x X

(x, x , Yi ) (i1)+m .

This bound remains the same if k is replaced by k 1. Obviously, it is small if i is far away from m, that is, close to k. For the third kind of matches, we need a total variation bound that works backwards in time. Such a bound reads P (Xi | Ym+1:k , Xm = x) P (Xi | Ym+1:k1 , Xm = x)
TV

k1i .

The proof of this bound is similar to that of Proposition 4.3.23 and uses the time-reversed process. We postpone the proof to the end of this section. We

462

12 Statistical Properties of the MLE

may also let m and omit the condition on Xm without aecting the bound. As a result of these bounds, we have E [ (Xi1 , Xi , Yi ) | Ym+1:k , Xm = x] E [ (Xi1 , Xi , Yi ) | Ym+1:k1 , Xm = x] 2 sup
x,x X

(x, x , Yi ) k1i ,

with the same bound being valid if the conditioning is on Y:k and Y:k1 , respectively. This bound is small if i is far away from k. Combining these two kinds of bounds and using Minkowskis inequality for the L2 -norm, we nd that (E hk,m,x () hk, () 2 )1/2 is bounded by
k1 m

2k+m1 + 2 2 4
k+m1

(ki1 i+m1 ) + 2
i=m+1 i=

ki1 i+m1

+4
<i(km)/2

ki1 + 4
(km)/2i<

(k+m)/21

12

up to the factor (E supx,x X (x, x , Yi ) 2 )1/2 . The proof is complete. We now establish the backwards in time uniform forgetting property, which played a key role in the above proof. Proposition 12.5.4. Assume 12.0.1, 12.2.1, and 12.3.1 hold. Then for any integers i, k, and m such that m 0 and m < i < k, any xm X, ym+1:k Yk+m , and U, P (Xi | Ym+1:k = ym+1:k , Xm = xm ) P (Xi | Ym+1:k1 = ym+1:k1 , Xm = xm )
TV

k1i .

Proof. The cornerstone of the proof is the observation that conditional on Ym+1:k and Xm , the time-reversed process X with indices from k down to m is a non-homogeneous Markov chain satisfying a uniform mixing condition. We shall indeed use a slight variant of the backward decomposition developed in Section 3.3.2. For any j = m + 1, . . . , k 1, we thus dene the backward kernel (cf. (3.39)) by Bxm ,j [ym+1:j ](x, f ) =
j u=m+1 q(xu1 , xu )g(xu , yu ) (dxu ) f (xj )q(xj , x) j u=m+1 q(xu1 , xu )g(xu , yu ) (dxu ) q(xj , x)

(12.24)

12.5 Asymptotics of the Score and Observed Information

463

for any f Fb (X). For brevity, we do not indicate the dependence of the quantities involved on . We note that the integral of the denominator of this j display is bounded from below by ( )m+j m+1 g (xu , yu ) (dxu ), and is hence positive P -a.s. under Assumption 12.3.1. It is trivial that for any x X,
j

u=m+1 j

q(xu1 , xu )g(xu , yu ) (dxu ) f (xj )q(xj , x) =

u=m+1

q(xu1 , xu )g(xu , yu ) (dxu )q(xj , x)Bxm ,j [ym+1:j ](x, f ) ,

which implies that E [f (Xj ) | Xj+1:k , Ym+1:k = ym+1:k ,Xm = x] = Bxm ,j [ym+1:j ](Xj+1 , f ) . This is the desired Markov property referred to above. Along the same lines as in the proof of Proposition 4.3.26, we can show that the backward kernels satisfy a Doeblin condition, + xm ,j [ym+1:j ] Bxm ,j [ym+1:j ](x, ) xm ,j [ym+1:j ] , + where for any f Fb (X), xm ,j [ym+1:j ](f ) =
j u=m+1 q (xu1 , xu )g (xu , yu ) (dxu ) f (xj ) j u=m+1 q (xu1 , xu )g (xu , yu ) (dxu )

Thus Lemma 4.3.13 shows that the Dobrushin coecient of each backward kernel is bounded by = 1 / + . Finally P (Xi | Ym+1:k1 = ym+1:k1 , Xm = xm ) = P (Xi | Ym+1:k1 = ym+1:k1 , Xm = xm , Xk1 = xk1 ) P (Xk1 dxk1 | Ym+1:k1 = ym+1:k1 , Xm = xm ) and P (Xi | Ym+1:k = ym+1:k , Xm = xm ) = P (Xi | Ym+1:k1 = ym+1:k1 , Xm = xm , Xk1 = xk1 ) P (Xk1 dxk1 | Ym+1:k = ym+1:k , Xm = xm ) , so that the two distributions on the left-hand sides can be considered as the result of running the above-described reversed conditional Markov chain from index k 1 down to index i, using two dierent initial conditions. Therefore, by Proposition 4.3.10, they dier by at most k1i in total variation distance. The proof is complete.

464

12 Statistical Properties of the MLE

12.5.3 Weak Convergence of the Normalized Score We now return to the question of a weak limit of the normalized score n n1/2 k=0 hk,0,x0 ( ). Using Lemma 12.5.3 and Minkowskis inequality, we see that 1/2
n 2

1/2

(hk,0,x0 ( ) hk, ( ))
k=0 n

2 1/2

n1/2
k=0

hk,0,x0 ( ) hk, ( )

as n ,

whence the limiting behavior of the normalized score agrees with that of n n1/2 k=0 hk, ( ). Now dene the ltration F by Fk = (Yi , < i k) for all integer k. By conditional dominated convergence,
k1

E
i=

(E [ (Xi1 , Xi , Yi ) | Y:k ] E [ (Xi1 , Xi , Yi ) | Y:k1 ]) | Fk1 ] = 0 ,

and Assumption 12.5.1 implies that E [ (Xk1 , Xk , Yk ) | Y:k1 ] = E [E [ (Xk1 , Xk , Yk ) | Y:k1 , Xk1 ] | Fk1 ] = 0 . It is also immediate that hk, ( ) is Fk -measurable. Hence the sequence {hk, ( )}k0 is a P -martingale increment sequence with respect to the ltration {Fk }k0 in L2 (P ). Moreover, this sequence is stationary because {Yk }<k< is. Any stationary martingale increment sequence in n L2 (P ) satises a CLT (Durrett, 1996, p. 418), that is, n1/2 0 hk, ( ) N(0, J ( )) P -weakly, where J ( ) = E [h1, ( )ht ( )] 1,
def

(12.25)

is the limiting Fisher information. Because the normalized score function has the same limiting behavior, the following result is immediate. Theorem 12.5.5. Under Assumptions 12.0.1, 12.2.1, 12.3.1, and 12.5.1, n1/2
x0 ,n (

) N(0, J ( ))

P -weakly

for all x0 X, where J ( ) is the limiting Fisher information as dened above. We remark that above, we have normalized sums with indices from 0 to n, that is, with n + 1 terms, by n1/2 rather than by (n + 1)1/2 . This of course does not aect the asymptotics. However, if J ( ) is estimated for the purpose of making a condence interval for instance, then one may well normalize it using the number n + 1 of observed data.

12.5 Asymptotics of the Score and Observed Information

465

12.5.4 Convergence of the Normalized Observed Information We shall now very briey discuss the asymptotics of the observed information matrix, 2 x0 ,n (). To handle this matrix, one can employ the so-called missing information principle (see Section 10.1.3 and (10.30)). Because the complete information matrix, just as the complete score, has a relatively simple form, this principle allows us to study the asymptotics of the observed information in a fashion similar to what was done above for the score function. The analysis becomes more dicult however, as covariance terms, arising from the conditional variance of the complete score, also need to be accounted for. In addition, we need the convergence to be uniform in a certain sense. We state the following theorem, whose proof can be found in Douc et al. (2004). Theorem 12.5.6. Under Assumptions 12.0.1, 12.2.1, 12.3.1, and 12.5.1,
0 n | |

lim lim

sup

(n1

2 x0 ,n ())

J ( ) = 0

P -a.s.

for all x0 X. 12.5.5 Asymptotics of the Maximum Likelihood Estimator The general arguments in Section 12.1 and the theorems above prove the following result. Theorem 12.5.7. Assume 12.0.1, 12.2.1, 12.3.1, 12.3.5, and 12.5.1, and that is identiable, that is, is equivalent to only if = (possibly up to a permutation of states if X is nite). Then the following hold true. (i) The MLE n = x0 ,n is strongly consistent: n P -a.s. as n . (ii) If the Fisher information matrix J ( ) dened above is non-singular and is an interior point of , then the MLE is asymptotically normal: n1/2 (n ) N(0, J ( )1 ) P -weakly as n

for all x0 X. (iii) The normalized observed information at the MLE is a strongly consistent estimator of J ( ): n1
2 x0 ,n (n )

J ( )

P -a.s. as n .

As indicated above, the MLE n depends on the initial state x0 , but that dependence will generally not be included in the notation. The last part of the result is important, as is says that condence intervals or regions and hypothesis tests based on the estimate (n+1)1 2 x0 ,n (n ) of J ( ) will asymptotically be of correct size. In general, there is no closed-form

466

12 Statistical Properties of the MLE

expression for J ( ), so that it needs to be estimated in one way or another. The observed information is obviously one way to do that, while another one is to simulate data Y1:N from the HMM, using the MLE, and then computing 1 2 (N +1) x0 ,N (n ) for this set of simulated data and some x0 . An advantage of this approach is that N can be chosen arbitrarily large. Yet another approach, motivated by (12.25), is to estimate the Fisher information by the empirical covariance matrix of the conditional scores of (12.19) at the MLE, that n is, by (n + 1)1 0 [Sk|k1 (n ) S(n )][Sk|k1 (n ) S(n )]t with Sk|k1 () = n 1 log g (x, Yk ) x0 ,k|k1 [Y0:k1 ](dx ; ) and S() = (n + 1) 0 Sk|k1 (). This estimate can of course also be computed from estimated data, then using an arbitrary sample size. The conditional scores may be computed as Sk|k1 () = x0 ,k () x0 ,k1 (), where the scores are computed using any of the methods of Section 10.2.3.

12.6 Applications to Likelihood-based Tests


The asymptotic properties of the score function and observed information have immediate implications for the asymptotics of the MLE, as has been described in previous sections. However, there are also other conclusions that can be drawn from these convergence results. One such application is the validity of some classical procedures for testing whether lies in some subset, 0 say, of the parameter space . Suppose that 0 is an (d s)-dimensional subset that may be expressed in terms of constraints Ri () = 0, i = 1, 2, . . . , s, and that there is an equivalent formulation i = bi (), i = 1, 2, . . . , d , where is the constrained parameter lying in a subset of Rd s . We also let be a point such that = b( ). Each function Ri and bi is assumed to be continuously dierentiable and such that the matrices C = Ri j and
sd

D =

bi j

d (d s)

have full rank (s and d s respectively) in a neighborhood of and , respectively. Perhaps the simplest example is when we want to test a simple (point) null hypothesis = 0 versus the alternative = 0 . Then, we take Ri () = i 0i and bi () = i0 for i = 1, 2, . . . , d . In this case, is void as s = d and hence d s = 0. Furthermore, C is the identity matrix and D is void. Now suppose that we want to test the equality i = i0 only for i in a subset K of the d coordinates of , where K has cardinality s. The constraints we employ are then Ri () = i 0i for i K; furthermore, comprises i for i K and, using the d s indices not in K for , bi () = 0i for i K and bi () = i otherwise. Again it is easy to check that C and D are constant and of full rank.

12.6 Applications to Likelihood-based Tests

467

Example 12.6.1 (Normal HMM). A slightly more involved example concerns the Gaussian hidden Markov model with nite state space {1, 2, . . . , r} 2 and conditional distributions Yk |Xk = i N(i , i ). Suppose that we want 2 to test for equality of all of the r component-wise conditional variances i : 2 2 2 2 2 1 = 2 = . . . = r . Then, the R-functions are for instance i r for 2 i = 1, 2, . . . , r 1. The parameter is obtained by removing from all i and 2 then adding a common conditional variance ; those b-functions referring to 2 any of the i evaluate to 2 . The matrices C and D are again constant and of full rank. A further application, to test the structure of conditional covariance matrices in a conditionally Gaussian HMM with multivariate output, can be found in Giudici et al. (2000). There are many dierent tests available for testing the null hypothesis 0 versus the alternative \ 0 . One is the generalized likelihood ratio test, which uses the test statistic n = 2 sup
x0 ,n ()

sup
0

x0 ,n ()

Another one is the Wald test, which uses the test statistic
t Wn = nR(n )t [Cn Jn (n )1 C ]1 R(n ) ,
n

where R() is the s 1 vector of R-functions evaluated at , and Jn () = n1 2 x0 ,n () is the observed information evaluated at . Yet another test is based on the Rao statistic, dened as
0 0 0 Vn = n1 Sn (n )Jn (n )1 Sn (n )t , 0 where n is the MLE over 0 , that is, the point where x0 ,n () is maximized subject to the constraint Ri () = 0, 1 i s, and Sn () = x0 ,n () is the score function at . This test is also known under the names ecient score test and Lagrange multiplier test. The Wald and Rao test statistics are usually dened using the true Fisher information J () rather than the observed one, but as J () is generally infeasible to compute for HMMs, we replace it by the observed counterpart. Statistical theory for i.i.d. data suggests that the likelihood ratio, Wald and Rao test statistics should all converge weakly to a 2 distribution with s degrees of freedom provided 0 holds true, so that an approximate p-value of the test of this null hypothesis can be computed by evaluating the complementary distribution function of the 2 distribution at the point n , s Wn , or Vn , whichever is preferred. We now state formally that this procedure is indeed correct.

Theorem 12.6.2. Assume 12.0.1, 12.2.1, 12.3.1, 12.3.5, and 12.5.1 as well as the conditions stated on the functions Ri and bi above. Also assume that

468

12 Statistical Properties of the MLE

is identiable, that is, is equivalent to only if = (possibly up to a permutation of states if X is nite), that J ( ) is non-singular, and that and are interior points of and , respectively. Then if 0 holds true, each of the test statistics n , Wn , and Vn converges P -weakly to the 2 distribution as n . s The proof of this result follows, for instance, Sering (1980, Section 4.4). The important observation is that the validity of the proof does not hinge on independence of the data but on asymptotic properties of the score function and the observed information, properties that have been established for HMMs in this chapter. It is important to realize that a key assumption for Theorem 12.6.2 to hold is that is identiable, so that n converges to a unique point . As a result, the theorem does not apply to the problem of testing the number of components of a nite state HMM. In the normal HMM for instance, with 2 Yk |Xk = i N(i , i ), one can indeed eectively remove one component by 2 2 invoking the constraints 1 2 = 0 and 1 2 = 0, say. In this way, within 0 , components 1 and 2 collapse into a single one. However, any 0 is then non-identiable as the transition probabilities q12 and q21 , among others, can be chosen arbitrarily without changing the dynamics of the model. Hence Theorem 12.6.2 does not apply, and in fact we know from Chapter 15 that the limiting distribution of the likelihood ratio test statistic for selecting the number of components in a nite state HMM is much more complex than a 2 distribution. The reason that Theorem 12.6.2 fails is that its proof crucially depends on a unique point to which n converges and around which loglikelihoods can be Taylor-expanded.

12.7 Complements
The theoretical statistical aspects of HMMs and related models have essentially been developed since 1990. The exception is the seminal paper Baum and Petrie (1966) and the follow-up Petrie (1969), which both consider HMMs for which X and Y are nite. Such HMMs can be viewed as a process obtained by lumping states of a Markov chain living on a larger set X Y, and this idea lies behind much of the analysis in these early papers. Yet Baum and Petrie (1966) contains the basic idea used in the current chapter, namely that of dening log-likelihoods, score functions, etc., conditional on the innite past, and bounds that quantify how far these variables are from their counterparts conditional on a nite past. Baum and Petrie (1966) established consistency and asymptotic normality of the MLE, while Petrie (1969) took a closer look at identiability, and in fact a lot more, which was not studied in detail in the rst paper. Leroux (1992) was the rst to carry out some analysis on more general HMMs, with nite X but general Y. He proved consistency of the MLE by an

12.7 Complements

469

approach based on Kingmans subadditive ergodic theorem and did also provide a very useful discussion on identiability on which much of the above one (Section 12.4) is based. Lerouxs approach was thus not based on conditioning on the innite past; the subadditive ergodic approach however has the drawback that it applies to analysis of the log-likelihood only and not to the score function or observed information. A few years later, Bickel and Ritov (1996) took the rst steps toward an analysis of the MLE for models of the kind studied by Leroux. Their results imply so-called local asymptotic normality (LAN) of the log-likelihood, but not asymptotic normality of the MLE without some extra assumptions. This result was instead obtained by Bickel et al. (1998), who based their analysis on the innite past approach almost entirely, employing bounds on conditional mixing rates similar to those of Baum and Petrie (1966). This analysis was generalized to models with compact X by Jensen and Petersen (1999). Finally, as mentioned above, Douc et al. (2004) took this approach to the point where autoregression is also allowed, using the mixing rate bound of Proposition 4.3.23. Neither Bickel et al. (1998) nor Jensen and Petersen (1999) used uniform forgetting to derive their bounds, but both of them can easily be stated in such terms. Higher order derivatives of the log-likelihood are studied in Bickel et al. (2002). A quite dierent approach to studying likelihood asymptotics is to express the log-likelihood through the predictor,
n x0 ,n ()

=
k=1

log
X

g (x, Yk ) x0 ,k|k1 (dx; ) ,

cf. Chapter 3, and then dierentiating the recursive formula (3.27) for x0 ,k|k1 with respect to to obtain recursive expressions for the score function and observed information. This approach is technically more involved than that using the innite past but does allow for analysis of recursive estimators such as recursive maximum likelihood. Le Gland and Mevel (2000) studied the recursive approach for HMMs with nite state space, and Douc and Matias (2002) extended the results to HMMs on compact state spaces. As good as all of the results above can be extended to Markov-switching autoregressions; see Douc et al. (2004). Under Assumption 12.2.1, the conditional chain then still satises the same favorable mixing properties as in Section 4.3. The log-likelihood, score function, and observed observation can be analyzed using the ideas exposed in this chapter; we just need to replace some of the assumptions by analogs including regressors (lagged Y s). Other papers that examine asymptotics of estimators in Markov-switching autoregressions include Francq and Roussignol (1997), Krishnamurthy and Rydn e (1998), and Francq and Roussignol (1998). Markov-switching GARCH models were studied by Francq et al. (2001).

13 Fully Bayesian Approaches

Some previous chapters have already mentioned MCMC and conditional (or posterior) distributions, especially in the set-up of posterior state estimation and simulation. The spirit of this chapter is obviously dierent in that it covers the fully Bayesian processing of HMMs, which means that, besides the hidden states and their conditional (or parameterized) distributions, the model parameters are assigned probability distributions, called prior distributions, and the inference on these parameters is of Bayesian nature, that is, conditional on the observations (or the data). Because more advanced Markov chain Monte Carlo methodology is also needed for this fully Bayesian processing, additional covering of MCMC methods, like reversible jump techniques, will be given in this chapter (Section 13.2). The emphasis is put on HMMs with nite state space (X is nite), but some facts are general and the case of continuous state space is addressed at some points.

13.1 Parameter Estimation


13.1.1 Bayesian Inference Although the whole apparatus of modern Bayesian inference cannot be discussed here (we refer the reader to, e.g., Robert, 2001, or Gelman et al., 1995), we briey recall the basics of a Bayesian analysis of a statistical model, and we also introduce some notation not used in earlier chapters. Given a general parameterized model Y p(y|), ,

where p(y|) thus denotes a parameterized density, the idea at the core of Bayesian analysis is to provide an inferential assessment (on ) conditional on the realized value of Y , which we denote (as usual) by y. Obviously, to give a proper probabilistic meaning to this conditioning, itself must be embedded with a probability distribution called the prior distribution, which

472

13 Fully Bayesian Approaches

is denoted by (d). The choice of this prior distribution is often decided on practicality grounds rather than strong subjective belief or overwhelming prior information, but there also exist less disturbing (or subjective) choices called non-informative priors, as we will discuss below. Denition 13.1.1 (Bayesian Model). A Bayesian model is given by the completion of a statistical model Y p(y|), ,

with a probability distribution (d), called the prior distribution, on the parameter space . The associated posterior distribution is given by Bayes theorem as the conditional distribution of given the observation y, (d|y) = p(y|)(d) . p(y|) (d) (13.1)

The density p(y|) is the likelihood of the model and will also be denoted by L(y|) as in previous chapters. Note that in this chapter, we always assume that both the prior and the posterior distributions admits densities that we denote by () and (|y), respectively. For the sake of notational simplicity, the dominating measure for both of these densities, whose exact specication is not important here, is denoted by d. Once the prior distribution is selected, Bayesian inference is, in principle, over, that is, completely determined, as the estimation, testing, and evaluation procedures are provided by the prior and the associated loss function. For instance, if the loss function for the evaluation of estimators is the quadratic loss function loss(, ) = 2 , the corresponding Bayes procedure is the expected value of , either under the prior distribution (when no observation is available), or under the posterior distribution, p(y|) (d) = (d|y) = . p(y|) (d) When no specic loss function is available, this estimator is often used as the default estimator, although alternatives also are available. A specic alternative is the maximum marginal posterior estimator, dened as i = arg maxi i (i |y) for each component i of the vector . A diculty with this estimator is that the marginal posteriors i (i |y) = (|y) di ,

13.1 Parameter Estimation

473

where i = {j , j = i}, are often intractable, especially in the setting of latent variable models like HMMs. Another alternative, not to be confused with the previous one, is the maximum a posteriori estimator (MAP), = arg max (|y) = arg max ()p(y|) , (13.2)

which is thus in principle easier to compute because the function to maximize is usually provided in closed form. However, numerical problems make the optimization involved in nding the MAP far from trivial. Note also here the similarity of (13.2) with the maximum likelihood estimator: the inuence of the prior distribution () progressively disappears with the number of observations and the MAP estimator recovers the asymptotic properties of the MLE. This is, of course, only true if the support of the distribution contains the true value, and if latent variables like the hidden states of the HMMthe number of which grows linearly with nare not adjoined to the parameter vector . See Schervish (1995) for more details on the asymptotics of Bayesian estimators. We will discuss in more detail the important issue of selection of the prior distribution for HMMs in Section 13.1.2, but at this point we note that when the model is from an exponential family of distributions, in so-called natural parameterization (which corresponds to the case () = in Denition 10.1.5), p(y|) = exp t S(y) c() h(y) , there exists a generic class of priors called the class of conjugate priors, (|, ) exp t c() ,

which are parameterized by a positive real value and a vector of the same dimension as the sucient statistic S(y). These parameterized prior distributions on are thus such that the posterior distribution can be written as (|, , y) = [| (y), (y)] . (13.3) Equation (13.3) simply says that the conjugate prior is such that the prior and posterior densities belong to the same parametric family of densities, but with dierent parameters. Indeed, the parameters of the posterior density are updated, using the observations, relative to the prior parameters. To avoid confusion, the parameters involved in the prior distribution on the model parameter are usually called hyperparameters. Example 13.1.2 (Normal Distribution). Consider a normal N(, 2 ) distribution for Y and assume we have i.i.d. observations y0 , y1 , . . . , yn . Assuming is to be estimated, the conjugate prior associated with this distribution is, again, normal N(, ), as then

474

13 Fully Bayesian Approaches


n

(|y0:n ) exp{( ) /2}


k=0

exp{(yk )2 /2 2 } 2 S + 2 ,

1 exp 2 2

1 n+1 + 2

where S is the sum of the yk . Inspecting the right-hand side shows that it is proportional (in ) to the density of a normal distribution with mean (S + 2 /)/[(n + 1) + 2 /] and variance 2 /[(n + 1) + 2 /]. In the case where 2 is to be estimated and is known, the conjugate prior is instead the inverse Gamma distribution IG(, ), with density ( 2 |, ) = Indeed, with this prior, ( 2 |y1:n ) ( 2 )(+1) e/
2 2 e/ . 2 )+1 ()(

1 exp{(yk )2 /2 2 } 2 k=0

= ( 2 )[(n+1)/2++1] exp{(S (2) /2 + )/ 2 } , where S (2) = k=0 (yk )2 . Hence, the posterior distribution of 2 is the inverse gamma distribution IG((n + 1)/2 + , S (2) /2 + ). As argued in Robert (2001), there is no compelling reason to choose these priors, except for their simplicity, but the restrictive aspect of conjugate priors can be attenuated by using hyperpriors on the hyperparameters. Those hyperpriors can be chosen amongst so-called non-informative (or vague) priors to attenuate the impact on the resulting inference. As an aside related to this point, let us recall that the introduction of vague priors within the Bayesian framework allows for a closure of this framework, in the sense that limits of Bayes procedures are also Bayes procedures for non-informative priors. Example 13.1.3 (Normal Distribution, Continued). A limiting case of the conjugate N(, ) prior is obtained when letting go to innity. In this case, the posterior (|y) is the same as the posterior obtained with the at prior () = 1, which is not the density of a probability distribution but simply the density of Lebesgue measure! Although this sounds like an invalid extension of the probabilistic framework, it is quite correct to dene posterior distributions associated with positive -nite measures , then viewing (13.1) as a formal expression valid as long as the integral in the denominator is nite (almost surely). More detailed accounts are provided in Hartigan (1983), Berger (1985), or Robert (2001, Section 1.5) about this possibility of using -nite measures (sometimes called improper priors) in settings where true probability prior distributions are too
n

13.1 Parameter Estimation

475

dicult to come up with or too subjective to be accepted by all. Let us conclude this aside with the remark that location models y p(y ) are usually associated with at priors () = 1, whereas scale models y 1 f 1

are usually associated with the log-transform of a at prior, that is, () = 1 .

13.1.2 Prior Distributions for HMMs In the specic set-up of HMMs, there are typically two separate entities of the parameter vector . That is, can be decomposed as = (, ) , where parameterizes the transition pdf q(, ) = q (, ) and parameterizes the conditional distribution of Y0:n given X0:n , with marginal conditional pdf g(, ) = g (, ). The reason for this decomposition should be clear from Chapter 10 on the EM framework: when conditioned on the (latent) chain X0:n , the parameter is estimated as in a regular (non-latent) model, whereas the parameter only depends on the chain X0:n . A particular issue is the distribution of the initial state X0 . In general, it is assumed either that X0 is xed and known ( is then degenerate); or that X0 is random, unknown, and is parameterized by a separate parameter; or that X0 is random, unknown, and with being parameterized by . In the latter case, a standard setting is that {Xk }k0 is assumed stationaryso that the HMM as a whole is stationaryand is then the stationary distribution of the transition kernel Q = Q . A particular instance of the second case is to assume that is xed, for example uniform on X. We remark that if is parameterized by a separate parameter, for instance the probabilities (1 , . . . , r ) themselves, there is of course no hope of being able to estimate this parameter consistently, as there is only one variable X0 that we do not even observe!whose distribution is given by . The above is formalized in the following separation lemma about . Lemma 13.1.4. Assume that the prior distribution () is such that () = () () (13.4)

and that the distribution of X0 depends on or on another separate parameter. Then, given x0:n and y0:n , and are conditionally independent, and the conditional posterior distribution of does not depend on the observations y0:n .

476

13 Fully Bayesian Approaches

Proof. The proof is straightforward: given that the posterior distribution (|x0:n , y0:n ) factorizes as
n n

() () (x0 )
k=1 n

q (xk1 , xk )
k=0

g (xk , yk )
n

= () (x0 )
k=1

q (xk1 , xk ) ()
k=0

g (xk , yk )

(13.5)

up to a normalizing constant, the two subvectors and are indeed conditionally independent. Independence of the conditional distribution of from y0:n is obvious from (13.5). A practical consequence of Lemma 13.1.4 is therefore that we can conduct Bayesian inference about and separately, conditional on the (latent) chain X0:n (and of course on the observables Y0:n ). Conditional inference is of interest because of its relation with the Gibbs sampler (see Chapter 6) associated with this model, as will be made clearer in Section 13.1.4. In the case where the latent variables are nite, that is, when X is nite, a reparameterization of X into {1, . . . , r} allows for use of the classical conjugate Dirichlet prior on the transition probability matrix q = (qij ), Dirr (1 , . . . , r ). These priors generalize the Beta (of type one) distribution as priors on the simplex of Rr . Denition 13.1.5 (Dirichlet Distribution). A Dirichlet Dirr (1 , . . . , r ) distribution is a distribution on the subset q1 + . . . + qr = 1 of Rr , given by the density (q1 , . . . , qr ) = where all i > 0. We remark that the above density is with respect to Lebesgue measure on the subset that supports the distribution. Of particular interest is the choice i = 1 for all i, in which case the density is constant and hence the distribution uniform. Under the assumption that is known or with a distribution parameterized by a separate parameter, we then have the following conjugacy result. Lemma 13.1.6. The Dirichlet prior is a conjugate distribution for the transition probability matrix Q of the Markov chain X1:n in the following sense. Assume that each row of Q has a prior distribution that is Dirichlet, (qi1 , . . . , qir ) Dirr (1 , . . . , r ) , with the rows being a priori independent, and that the distribution of X0 is either xed or parameterized by a separate parameter. Then, given the Markov chain, the rows of Q are conditionally independent and (1 + . . . + r ) 1 1 q qrr 1 1{q1 + . . . + qr = 1} , (1 ) (r ) 1

13.1 Parameter Estimation

477

(qi1 , . . . , qir )|x1:n Dirr (1 + ni1 , . . . , r + nir ) , where nij denotes the number of transitions from i to j in the sequence x0:n . Proof. Given that the parameters of Q only depend on X0:n , we have
n

(Q|x0:n ) (Q)
k=1

qxk1 xk
i,j

j qij

+nij 1

We remark that in the case where the distribution of X0 is the stationary distribution of Q, there is no conjugate distribution because of the non-exponential relation between this stationary distribution and Q. This does not mean that Bayesian inference is not possible, but simulation from the posterior distribution of Q is less straightforward in this case. Simulation from a Dirichlet distribution is easy: if 1 , . . . , r are independent with i having a Ga(i , 1) distribution, then the r-tuple 1 , r i=1 i has a Dirr (1 , . . . , r ) distribution. Example 13.1.7 (Normal HMM). Assume that {Xk }k0 is a nite Markov chain on X = {1, . . . , r} and that, conditional on Xk = i, Yk has 2 a N(i , i ) distribution. A typical prior for this model may look as follows. On the transition probability matrix Q we put a Dirr (1 , . . . , r ) distribution on each row, with independence between rows. A standard choice is to set the j equal; often j = 1. The means and variances of the normal distributions are assumed a priori independent and with conjugate priors, that is, a N(, ) prior for each 2 mean i and a IG(, ) prior for each variance i (cf. Example 13.1.2). The joint prior thus becomes
2 2 () = (Q, 1 , . . . , r , 1 , . . . , r ) r r

2 , r i=1 i

r r i=1 i

(1 + . . . + r ) 1 qj = (1 ) (r ) j=1 ij i=1
r

i=1 r

2 1 e(i ) /2 2

( 2 )(+1) /2 e . () i=1

It is often appropriate to consider one or several of , , , and as unknown random quantities themselves, and hence put hyperpriors on them. These

478

13 Fully Bayesian Approaches

quantities are then adjoined to , and their prior densities are adjoined to the above prior. Richardson and Green (1997) and Robert et al. (2000), for instance, contain such examples. In the above example, the initial distribution was not mentioned. Indeed, it was tacitly assumed that the initial distribution is given by Q, for example as the stationary distribution. From a simulation point of view this is inconvenient however, as the posterior distributions of the rows of Q are then no longer Dirichlet; cf. the remark below Lemma 13.1.6. A dierent assumption, more appealing from this simulation point of view, is to assume that is xed, typically uniform on {1, . . . , r}. We may also assume that is unknown and equip it with a Dir(1 , . . . , r ) prior, usually with all i equal. Then is adjoined to and the Dirichlet density goes into the prior. Finally, we may also assume that X0 is xed and known, equal to 1, say. This implies that the prior is not exchangeable though, and the structure of the implied non-exchangeability is dicult to describe (see below). Therefore, in practice the two alternatives of setting as the uniform distribution or assigning it a Dirichlet prior are the most appealing. In the latter case, as remarked above Lemma 13.1.4, cannot be estimated consistently. 13.1.3 Non-identiability and Label Switching An issue of particular interest for the choice of the loss function or, correspondingly, of the Bayes estimator, is non-identiability. This is a problem that primarily arises in the case of nite state space X. Hence, assume X = {1, . . . , r}. To start with, we will make assumptions about the parameterization of the HMM. We assume that decomposes into (, ) as in (13.4), that simply comprises the transition probabilities qij themselves, and that further decomposes as = (1 , . . . , r ), where i parameterizes the conditional density g(i, ) in a way that is identical for each i. Hence, all g(i, ) belong to the same parametric family. A typical example is to take, as in the above example, the 2 2 g(i, ) as normal distributions N(i , i ), in which case i = (i , i ). The initial distribution is assumed to be the stationary distribution of Q, or to be xed and uniform on X, or to be given by a separate set (1 , . . . , r ) of probabilities. Under these conditions, the likelihood L(y0:n |) is invariant under permutation of state indices. More precisely, if (s1 , . . . , sr ) is a permutation of {1, . . . , r}, then L[y0:n |(i ), (qij ), (i )] = L[y0:n |(si ), (qsi ,sj ), (si )] . This equality simply says that if we renumber the states in X and permute the parameter indices accordingly, the likelihood remains unchanged. We now turn to a second set of assumptions. A density on Rr is said to be exchangeable if it is invariant under permutations of the components. We will assume that the joint prior for (q(i, j)), (i ), and (i ) is exchangeable,

13.1 Parameter Estimation

479

[(i ), (qij ), (i )] = [(si ), (qsi ,sj ), (si )] . This exchangeability condition is very often occurring in practice. It holds, for instance, if the three entities involved are a priori independent with an independent Dirichlet Dirr (, . . . , ) prior on each row of the transition probability matrix, independent identical priors on the i and, when applicable, a Dirichlet Dirr ( , . . . , ) prior on (i ). Under the above two sets of assumptions, because (|y0:n ) is proportional to ()L(y0:n |) in , the posterior will also be exchangeable, [(i ), (q(i, j)), (i )|y0:n ] = [(si ), (q(si , sj )), (si )|y0:n ] . This non-identiability feature has the serious consequence that, from a Bayesian point of view, within each block of parameters all marginals are the same! Indeed, for example, (1 , . . . , r |y0:n ) = (s1 , . . . , sr |y0:n ) . Thus, for 1 i r, the density i dened as i (i |y0:n ) = (1 , . . . , r |y0:n ) di , (13.6)

is independent of i. Therefore, both the posterior mean and the maximum marginal posterior estimators are ruled out in exchangeable settings, as they only depend on the marginals. A practical consequence of this lack of identiability is so-called label switching, illustrated in Figure 13.1. This gure provides an MCMC sequence for both the standard deviations i and the stationary probabilities of Q for an 2 HMM with three Gaussian components N(0, i ). The details will be discussed below, but the essential feature of this graph is the continuous shift between the three levels of each component i , which translates the equivalence between (1 , 2 , 3 ) and any of its permutations for the posterior distribution. As discussed by Celeux et al. (2000), this behavior does not always occur in a regular MCMC implementation. In the current case, it is induced by the underlying reversible jump algorithm (see Section 13.2.3). We stress that label switching as such is not a result of exploring the posterior surface by simulation but is rather an intrinsic property of the model and its prior. Lack of identiability also creates a diculty with the maximum a posteriori estimator in that the exchangeability property implies that there are a multiple of r! (local and global) modes of the posterior surface, given (13.6). It is therefore dicult to devise ecient algorithms that can escape a particular mode to provide a fair picture of the overall, multimodal posterior surface. For instance, Celeux et al. (2000) had to resort to simulated tempering, a sort of inverted simulated annealing, to achieve a proper exploration. A common approach to combat problems caused by lack of identiability is to put constraints on the prior, in that certain parameters are required to appear in ascending or descending order. For instance, in the above

480
0.0 0.2 0.4 0.6 0.8 1.0

13 Fully Bayesian Approaches

pi

500

1000

1500

sigma

500

1000

1500

Fig. 13.1. Representation of an MCMC sequence simulated from the posterior distribution associated with a Gaussian HMM with three hidden states, Gaussian 2 components N(0, i ), and a data set made of a sequence of wind intensities in Athens (Greece). The top graph plots the sequence of stationary probabilities of the transition probability matrix Q and the bottom graph the sequence of i . Source: Capp et al. (2003). e

example, we could set the prior density to zero outside the region where 1 < 2 < . . . < r . That is, we require the normal means to appear in ascending order. Such a constraint does not aect the MAP, but it does aect the marginal posterior distributionsobviously, the marginal posterior distribution functions of the i become stochastically orderedand hence, for instance, the posterior means of individual parameters. It is important to realize that marginal posterior distributions of parameters not directly involved 2 in the constraint, for instance the i in the current example, are also aected. Even more importantly, if an ordering constraint is put on a dierent set of pa2 2 2 rameters, 1 < 2 < . . . < r for example, then the marginal posterior distributions will be aected in a dierent way. Hence, ordering constraints are not a tool that is unambiguous in the sense that any constraint leads to the same marginal posterior distributions. This is illustrated in Richardson and Green (1997). From a practical point of view, in an MCMC simulation, ordering can be imposed at each step of the sampler, but we could also design a sampler without such constraints and do the sorting as a part of post-processing of the sampler output. This approach obviously greatly simplies investigations of how constraints on dierent sets of parameters aect the results. Stephens (2000b) discusses the label switching problem in a general decision theoretic framework. In particular, he demonstrates that sorting means, variances, etc.,

13.1 Parameter Estimation

481

sometimes gives results that are dicult to interpret, and he suggests, in the contexts of i.i.d. observations from a nite mixture, a relabeling algorithm based on probabilities of the each observation belonging to a certain mixture component. If we put a sorting constraint on the parameters, we implicitly construct a new prior that is zero in regions where the constraint does not hold. Moreover, because a parameter can be permuted in r! dierent ways, the new prior is equal to the original prior multiplied by r! in the region where the constraint does hold, in order to make it integrate to unity (over the constrained space). A similar but slightly dierent view, suggested by Stephens (2000a), is to think of the r! permutations of a given parameter as a single element of an equivalence class of parameters; the eective parameter space is then the space of such equivalence classes. Again, because a parameter of order r can be permuted in r! dierent ways, each element of the equivalence class [] has a prior that is r! times the prior () of any of its particular representations . This distinction between a parameter and its corresponding equivalence class and the factor r! are not important when r is xed, but it becomes important when r is variable, and we attempt to estimate it, as discussed in Section 13.2. Lack of identiability can also be circumvented by using a loss function that is impervious to label switching, that is, invariant under permutation of the label indices. For instance, in the case of mixtures, Celeux et al. (2000) employed a loss function for the estimation of the parameter may based on the Kullback-Leibler divergence, loss(, ) = log p(y0:n |) p(y0:n |) dy0:n . p(y0:n |)

13.1.4 MCMC Methods for Bayesian Inference Analytic computation of Bayesian estimates like the posterior mean or posterior mode is most generally infeasible for HMMs, except for the simplest models. We now review simulation-based methods that follow the general MCMC scheme introduced in Chapter 6 and provide Monte Carlo approximations of the posterior distribution of the parameters given the observable Y0:n . As noted in Chapter 6, the distribution of X0:n given both Y0:n and is often manageable (when X is nite notably). Likewise, the conditional distribution of the parameters given Y0:n and X0:n is usually simple enough in HMMs, especially when conjugate priors are used (as in Example 13.1.7). What remains to be exposed here is how to bridge the gap between these two conditionals. The realization that for HMMs, the distribution of interest involves two separate entities, and X0:n , for which the two conditional distributions (|x0:n , y0:n ) and (x0:n |, y0:n ) are available or may be sampled from, suggests the use of a two stage Gibbs sampling strategy as dened in Chapter 6 (see Algorithm 6.2.13). The simplest version of the Gibbs sampler, which will be referred to as global updating of the hidden chain, goes as follows.

482

13 Fully Bayesian Approaches

Algorithm 13.1.8. Iterate: 1. Simulate from (|x1:n , y0:n ). 2. Simulate X0:n from (x0:n |, y0:n ). This means that, if we can simulate the parameters based on the completed model (and this is usually the case, see Example 13.1.10 for instance) and the missing states X0:n conditionally on the parameters and Y0:n (see Chapter 6), we can implement this two-stage Gibbs sampler, also called data augmentation by Tanner and Wong (1987). We note that typically is multivariate, and it is then often broken down into several components; accordingly, the rst step above then breaks down into several sub-steps. Similar comments apply if there are hyperparameters with their own priors in the model; we can view them as part of even though they are often updated separately. By global updating we mean that the trajectory of the hidden chain is updated as a whole from its joint conditional distribution given the parameter and the data Y0:n . This corresponds to the partitioning (, X0:n ) of the state space of the Gibbs sampler. Another possible partitioning is (, X0 , X1 , . . . , Xn ), which leads to an earlier and more rudimentary version of the Gibbs sampler (Robert et al., 1993). In this algorithm, only one hidden variable Xk is updated at a time, and we refer to this scheme as local updating of the hidden chain. The algorithm thus looks as follows. Algorithm 13.1.9. Iterate: 1. Simulate from (|x1:n , y1:n ). 2. For k = 0, 1, . . . , n, simulate Xk from (xk |, y1:n , x1:k1 , xk+1:n ). This algorithm only updates one state at a time, and, because (xk |, y0:n , x0:k1 , xk+1:n ) reduces to (xk |, yk , xk1 , xk+1 ) q (xk1 , xk )q (xk , xk+1 )g (xk , yk ) where the rst factor on the right-hand side is replaced by (x0 ) for k = 0 and the second factor is replaced by unity for k = n; this means that each Xk is updated conditional upon its neighbors, as seen in Chapter 6. In the above algorithm, the Xk are updated in a xed linear order, but there is nothing that prevents us from using a dierent order or from picking the variable Xk to be updated at random. Of course there are schemes intermediate between the extremes global and local updating. We might, for example, update blocks of Xk ; like for local updating, these blocks may be of xed size and updated in a specic order, but size and order may also be chosen at random as in (Shephard and Pitt, 1997).

13.1 Parameter Estimation

483

Example 13.1.10 (Normal HMM, Continued). Let us return to the HMM and prior given in Example 13.1.7. To compute the respective full conditionals in the Gibbs sampler, we note again that each such distribution, or density, is proportional (in the component to be updated) to the product of the prior and the likelihood. For example,
2 2 (1 , . . . , r |Q, 1 , . . . , r , x1:n , y0:n ) 2 2 (Q, 1 , . . . , r , 1 , . . . , r ) 2 2 p(x0:n |Q)L(y0:n |x0:n , 1 , . . . , r , 1 , . . . , r ) n

2 2 (Q)(1 ) (r )(1 ) (r )p(x0:n |Q) k=0

g(,) (xk , yk ) .

By picking out the factors on the right-hand side that contain the appropriate variables, we can nd their full conditional. We now detail this process for each of the variables involved. The conditional pdf of 1 , . . . , r is proportional to
r n

exp (i )2 /2
i=1 r k=0

2 exp (yk xk )/2xk

1 2 2 exp [2 ( 1 + ni i ) 2i ( 1 + Si i )] 2 i i=1

where ni is the number of xk with xk = i and Si is the sum of the corresponding yk ; Si = {k: xk =i} yk . We can conclude that the full conditional distribution of 1 , . . . , r is such that these variables are conditionally independent and
2 2 i | Q, 1 , . . . , r , x0:n , y0:n N 2 i / + Si 1 2 / + n , 1/ + n / 2 i i i i

(13.7)

This can also be understood in the following way: given X0:n all the observations are independent, and to obtain the posterior for i we only need to consider observations governed by this regime. As the i are a priori independent, they will be so a posteriori as well. The above formula is then a standard result of Bayesian statistics (cf. Example 13.1.2). In a similar fashion, one nds that
2 2 (1 , . . . , r |Q, 1 , . . . , r , x0:n , y0:n ) r

(2)

2 2 (i )(+ni /2+1) exp ( + Si /2)/i i=1

(2)

where Si = {k: xk =i} (yk i )2 . Hence, the full conditional distribution of 2 2 1 , . . . , 1 is such that these variables are conditionally independent, and

484

13 Fully Bayesian Approaches


2 i | Q, 1 , . . . , r , x0:n , y0:n IG( + ni /2, ( + Si /2)) . (2)

(13.8)

This result is indeed also an immediate consequence of Example 13.1.2. The full conditional distribution of the transition matrix Q was essentially derived in Lemma 13.1.6; the rows are conditionally independent with the ith row following a Dirichlet distribution Dirr (1 + nij , . . . , r + nir ). Here nij is the number of transitions from state i to j, that is, nij = #{0 k n 1 : xk = i, xk+1 = j}. Several types of MCMC moves are typically put together in what is often called a sweep of the algorithm. Thus, one sweep of the Gibbs sampler with local updating for the present model looks as follows. Algorithm 13.1.11. 1. Simulate the i independently according to (13.7). 2 2. Simulate the i independently according to (13.8). 3. Simulate the rows of Q independently, with the ith row from Dirr (1 + ni1 , . . . , r + nir ). 4. For k = 0, 1, . . . , n, simulate Xk with unnormalized probabilities P(Xk = i | , yk , xk1 , xk+1 ) q(xk1 , i)q(i, xk+1 )
2 1 (yk i )2 /2i e ; i

for k = 0 the rst factor is replaced by (x0 ), and for k = n the factor q(i, xk+1 ) is replaced by unity. If is the stationary distribution of Q, simulation of Q requires a Metropolis-Hastings step; a sensible proposal is then the same Dirichlet as above. If is rather a separate parameter, Q is updated as above and, provided the prior on (1 , . . . , r ) is a Dirichlet as in Example 13.1.7, this vector is updated with full conditional distribution Dirr (1 + t1 , . . . , r + tr ) with ti = 1{x0 = i}. Of course, global updating of X0:n could have been used as well, which would modify step 4 of the algorithm only. The Gibbs sampler with local updating should mix and explore the posterior surface much more slowly than when global updating is used. It must be considered, however, that the simulation of the whole vector of states, X1:n , is more time-consuming in that it requires the use of the forward or backward formulas (Section 6.1.2). A numerical comparison of the two approaches by Robert et al. (1999), using several specially designed convergence monitoring tools, did not exhibit an overwhelming advantage in favor of global updating, even without taking into account the additional O(n2 ) computational time required by this approach. On the other hand, Scott (2002) provided an example showing a signicant advantage for global updating in terms of autocovariance decay. It is thus dicult to make a rm recommendation on which updating scheme to use. One may start by running local updating, and if its mixing behavior is poor, try global updating as well. We do remark, however, that when the state space X is continuous, there is seldom any alternative to local

13.1 Parameter Estimation

485

updating. In addition, with continuous X, local updating must in general be carried out by a Metropolis-Hastings step, as the full conditional distribution seldom lends itself to direct simulation (see Section 6.3). The next example demonstrates a somewhat more complicated use of the single site Gibbs sampling strategy. Example 13.1.12 (Capture-Recapture, Continued). Let us now consider Gibbs simulation from the posterior distribution of the parameters in the capture-recapture model of Example 1.3.4. The parameters are divided into (a) the capture probabilities pk (i), indexed by the capture zone i (i = 1, 2, 3), and (b) the movement probabilities qk (i, j) (i, j = 1, 2, 3, ), which are the probabilities that the lizard is in zone j at time k + 1 given that it is in zone i at time k. For instance, the probability qk (, ) is equal to 1, because of the absorbing nature of . We also denote by k (i) the survival probability at time k in zone i, that is, k (i) = 1 qk (i, ) , and by k (i, j) the eective probability of movement for the animals remaining in the system, that is, k (i, j) = qk (i, j)/k (i) . If we denote k (i) = (k (i, 1), k (i, 2), k (i, 3)), the prior distributions are chosen to be pk (i) Be(a, b), k (i) Be(, ), k (i) Dir3 (1 , 2 , 3 ) , where the hyperparameters a, b, 1 , 2 , 3 are known. The probabilities of capture pk (i) depend on the zone of capture i and the missing data structure of the model, which must be taken into account. Slightly modifying the notations of Example 1.3.4, we let ykm be the position of animal m at time k and xkm its capture indicator, the observations can be written in the form ykm = xkm ykm , where ykm = 0 corresponds to a missing observation. The sequence of ykm for a given m then corresponds to a non-homogeneous Markov chain, with transition matrix Qk = (qk (i, j)). Conditionally on ykm , the Xkm then are Bernoulli variables with probability of success pk (ykm ). The Gibbs sampler associated with this model has the following steps. Algorithm 13.1.13.
1. Simulate ykm for sites such that xkm = 0. 2. Generate (0 k n)

pk (i) Be(a + uk (i), b + vk (i)) , k (m) Be( + wk (i), + wk (i, )) , k (i) Dir3 (1 + wk (i, 1), 2 + wk (i, 2), 3 + wk (i, 3)) ,

486

13 Fully Bayesian Approaches

where uk (i) denotes the number of captures in i at time k, vk (i) the number of animals unobserved at time i for which the simulated ykm is equal to i, wk (i, j) the number of passages (observed or simulated) from i to j, wk (i, ) the number of (simulated) passages from i to , and wk (i) = wk (i, 1) + wk (i, 2) + wk (i, 3) . Step 1. must be decomposed into conditional sub-steps to account for the Markovian nature of the observations; in a full Gibbs strategy, ykm can be simulated conditionally on y(k1)m and y(k+1)k when xkm = 0. If k = n, the missing data are simulated according to
P(ykm = j | y(k1)m = i, y(k+1)m = , xkm = 0) qk1 (i, j)(1 pk (j))qk (j, )

and
P(ynm = j | y(n1)m = i, xnm = 0) qn1 (i, j)(1 pn (j)) .

So far, we have dealt with MCMC algorithms for which the state space of the sampler consists of the parameter and the hidden chain X0:n ; both are random, unobserved quantities because we are in a Bayesian framework and X0:n because of its role in the model as a latent variable. However, it is quite possible to devise MCMC algorithms for which the sampler state space comprises alone and not the hidden chain. In particular, when the state space X of the hidden chain is nite, we know that the likelihood may be computed exactly. In such a case the completion step, that is, the simulation of X0:n , does not appear as a necessity any longer, and alternative Metropolis-Hastings steps can be used instead. Example 13.1.14 (Normal HMM, Continued). In Capp et al. (2003), e the simulation of the parameters of the normal components, as well as of the parameters of the transition probability matrix, was done through simple random walk proposals: for the means j the proposed move is j = j + i , where i N(0, ) and is a parameter that may be adjusted to optimize performance of the sampler. Because the proposal is symmetric, the acceptance ratio is simple; it is ( )L(y0:n | ) , ()L(y0:n |) where L is the likelihood computed via the forward algorithm (Section 5.1.1). 2 For the variances j , the proposed move is a multiplicative random walk log j = log j + j , where j N(0, ), with acceptance ratio

13.1 Parameter Estimation

487

( )L(y0:n | ) ()L(y0:n |)

j j

the last term being the ratio of the Jacobians incurred by working on the logscale. To describe the above proposal, we also sometimes say that j follows a log-normal LN(log j , ) distribution. In the case of the transition probability matrix, Q, the move is slightly more involved due to the constraint on the sums of the rows, Q being a stochastic matrix. Capp et al. (2003) solved this diculty by reparameterize ing each row (qi1 , . . . , qir ) as qij = ij , i ij > 0 ,

so that the summation constraint on the qij does not hinder the random walk. Obviously the ij are not identiable, but as we are only interested in the qij , this is not a true diculty. On the opposite, using overparameterized representations often helps with the mixing of the corresponding MCMC algorithms, as they are less constrained by the data set or the likelihood. The proposed move on the ij is log ij = log ij + ij , where ij N(0, ), with acceptance ratio ( )L(y0:n | ) ()L(y0:n |) ij . ij

i,j

Note that this reparameterization of the model forces us to select a prior distribution on the ij rather than on the qij . The choice ij Ga(j , 1) is natural in that it gives a Dirr (1 , . . . , r ) distribution on the corresponding (qi1 , . . . , qir ). We also note that it is not dicult to show that if (i1 , . . . , ir ) is r reparameterized into Si = 1 ij and (qi1 , . . . , qir ), then, given x0:n , Si and r (qi1 , . . . , qir ) are conditionally independent and distributed as Ga( 1 j , 1) and Dirr (1 +ni1 , . . . , r +nir ) respectively. This proves that the -parameterization does nothing but introduce a new parameter for each row, the sum Si , that is independent of everything else and hence totally irrelevant for the inference. The point of introducing this extra variable is only to simplify the design of Metropolis-Hastings moves. If the initial distribution is also a parameter of the model, it can be recast in a similar fashion. Figure 13.1 provides an illustration of this simulation scheme in the special case of a Gaussian HMM with zero means. Over the 2,000 MCMC iterations represented on both graphs, there are periods where the value of the i or of the stationary probabilities of Q do not change: these periods correspond to sequences of proposed values that are rejected at the Metropolis-Hastings stage. Note that the rejection periods are not the same for the i and the stationary probabilities. This is due to the fact that there is a MetropolisHastings stage for each group of parameters.

488

13 Fully Bayesian Approaches

Another alternative stands at the opposite end of the range of possibilities: the parameters of the model can be integrated out when conjugate priors are used, as demonstrated by Liu (1994), Chen and Liu (1996), and Casella et al. (2000) in the case of mixture and switching regression models. In such schemes, each site Xk is typically sampled conditionally on all the other sites, with the model parameters fully integrated out.

13.2 Reversible Jump Methods


So far we have not touched upon the topic of the unknown number of states in an HMM and of the estimation of this number via Bayesian procedures. After a short presentation of variable dimension models and of their meaning, we introduce the adequate MCMC methodology to deal with this additional level of complexity. 13.2.1 Variable Dimension Models In general, a variable dimension model is, to quote Peter Green, a model where one of the things you do not know is the number of things you do not know. In other words, this pertains to a statistical model where the dimension of the parameter space is not known. This is not a formal enough denition, obviously, and we need to provide a more rigorous perspective. Denition 13.2.1 (Variable Dimension Model). A variable dimension model is dened as a collection of models (or parameter spaces), r , r = 1, . . . , R ,

associated with a collection of priors on these spaces, r (r ), r = 1, . . . , R ,

and a prior distribution on (the indices of ) these spaces, (r), r = 1, . . . , R .

In the following, we shall consider that a variable dimension model is associated with a probability distribution on the space
R

{r} r ,
r=1

(13.9)

where the union is of course one of disjoint sets. An element of may thus always be written as = (r, r ), where r is an element of r . Obviously, this convention is somewhat redundant, as we generally know by looking at the

13.2 Reversible Jump Methods

489

second component of to which of the sets in (13.9) belongs, but it will greatly simplify matters from a notational point of view. The target density will be denoted by () = (r, r ) = (r)r (r ) . In order to avoid tedious (but straightforward) constructions, we do not fully specify the dominating measure used for dening the above density, and we will also, when needed and unambiguous from the context, use the notation (d) to refer to the probability measure itself. On the individual parameter spaces r , we denote the dominating measure by dr as previously. For HMMs, the space r is in general that of parameters for HMMs with r states for the hidden Markov chain. We remark that strictly speaking, a model is not identical to a parameter space, as the parameter space alone does not tell anything about the model structure. Two completely dierent models could well have identical parameter spaces. In the development below, this distinction between model and parameter space is not important however, and we will work with the parameter spaces only. In the Bayesian framework exposed above, the dimension r of the model is treated as a usual parameter. The aim is to address the two problems of testingdeciding which model is bestand estimationdetermining the parameters of the best tting modelsimultaneously. Conceptually, a variable dimension model is more complicated only because the prior and posterior distributions live in the space dened in (13.9), whose structure is quite complex. Interestingly, by integrating out the index part of the model, we simply end up with mixture representations both for the distribution of the data,
R

(r)p(y) ,
r=1

and for the predictive distribution (given observations yobs )


R

(r|yobs )
r=1

p(y|r )r (r |yobs ) dr .

This mixture representation, called model averaging in the Bayesian literature, is interesting because it suggests the use of predictors that are not obtained by selecting a particular model from the R possible ones but rather consist in taking all the options into account simultaneously, weighting them by their posterior odds (r|yobs ). The variability due to the selection of the model is thus accounted for. Note also that in dening the variable dimension model, we have chosen a completely new set of parameters for each model r and set the parameter space as the union of the model parameter spaces r , even though some parameters may have a similar meaning in two dierent models. For instance, when comparing an AR(p) and an AR(p + 1) model, it could be posited that

490

13 Fully Bayesian Approaches

the rst p autoregressive coecients would remain the same for the AR(p) and AR(p + 1) models, i.e., that an AR(p) model is simply an AR(p + 1) model with an extra zero coecient. We argue on the opposite that they should be distinguished as entities because the models are dierent and also because, for instance, the best tting AR(p + 1) model is not necessarily a straight modication of the best tting AR(p) model by adding an extra term while keeping the other ones xed. Similarly, even though the variance 2 has the same formal meaning for all values of p in the autoregressive case, we insist on using a dierent variance parameter for each value of p. This is not the only possible perspective on this problem however, and many prefer to use some parameters common to all models in order to reduce model and computational complexity. In some sense, the reversible jump technique to be discussed in Section 13.2.3 is based on this assumption of exchangeable parameters between models, using proposal distributions that modify only a part of the parameter vector to move between models. Given a variable dimension model, there is an additional computational diculty in representing, or simulating from, the posterior distribution in that a sampler must move both within and between models r . Although the former pertains to previous developments (Section 13.1.4), the latter requires a sound measure-theoretic basis to lead to correct MCMC moves, that is, to moves that validate (|y0:n ) as the stationary distribution of the simulated Markov chain. There have been several earlier approaches in the literature, using for instance birth-and-death processes (Geyer and Mller, 1994) or pseudo-priors (Carlin and Chib, 1995), but the general formalization of this problem has been realized by Green (1995). 13.2.2 Greens Reversible Jump Algorithm Greens (1995) algorithm is basically of Metropolis-Hastings type with specic trans-dimensional proposals carefully designed to move between dierent models in a way that is consistent with the desired stationary distribution of the MCMC algorithm. We discuss here only the simplest, and more common, application of Greens ideas in which the moves from higher to lower dimensional models are deterministic and refer to Green (1995) or Richardson and Green (1997) for more involved proposals. We describe below the structure of moves between two dierent models s and l , where l say is of larger dimension than is s (s is for small and l for large). If the Markov chain is currently in state s s , Greens algorithm uses an auxiliary random variable, which we denote by v, and a function m that maps the pair (s , v) into a proposed new state l l . The only requirement is that m be dierentiable with an inverse mapping m1 that is also dierentiable. If (s , v) is the point that corresponds to l trough m1 , we will use the notations s = m1 (l ) param and v = m1 (l ) aux

13.2 Reversible Jump Methods

491

for the associated projections of m1 (l ). The reverse move from l to s is deterministic and simply consists in jumping back to the point s = m1 (l ). Obviously, this dimension-changing move alone may fail to exparam plore the whole space, and it is necessary to propose usual xed dimension moves as well as these trans-dimensional moves. For the moment we can ignore this fact however, as we are going to show that the trans-dimensional move alone is reversible. We shall assume that when in state s s , the move to l is attempted with probability Ps,l and that the auxiliary variable v has a density p. Conversely, when in l , the move to s is attempted with probability Pl,s . The moves are then accepted with probability (s , l ) in the rst case and (l , s ) in the second one, where it is understood that the chain stays in its current state in case of rejection. To determine the correct form of the acceptance probability , we will check that the transition kernel corresponding to the mechanism described above does satisfy the detailed balance condition (2.12) for the target . A rst remark is that given the structure of the state space , which is a union of disjoint sets, one can fully specify probability distributions on by their operation on test functions fq of the form fq () = fq (r, r ) = 0 if r = q , fr (r ) otherwise , (13.10)

for some q = 1, . . . , R and fq Fb (q ). For such a test function, E (fq ) = (q)


q

fq (q )q (q ) dq .

The second important remark is that when examining the proof of the reversibility of the usual Metropolis-Hastings algorithm (Proposition 6.2.6), it is seen that the form of the acceptance probability is entirely determined by what happens when the chain really moves. The part that concerns rejection is fully determined by the fact that the transition kernel must be a probability kernel, that is, integrate to unity. Hence, in the case under consideration, we may check the detailed balance condition for test functions of the form given in (13.10) only, with q = s and q = l. We will denote these functions by fs and fl respectively (with associated functions fs Fb (s ) and fl Fb (l )). Denoting by K the transition kernel associated with the move between s and l described above, we have fs () (d) K(, d )fl ( ) = (s)s (s )fs (s ) Ps,l [s , m(s , v)]p(v)fl [m(s , v)] dv ds .

Now apply the change of variables formula to replace the pair (s , v) by l . This yields

492

13 Fully Bayesian Approaches

fs () (d) K(, d )fl ( ) = fs [m1 (l )]fl (l ) (s)s (m1 (l )) param param Ps,l (s , l )p[m1 (l )] aux dl , Js,l (l ) (13.11)

where Js,l (l ) is the absolute value of the determinant of the Jacobian matrix associated with the mapping m. It may be evaluated either as Js,l (l ) = or Js,l (l ) = m1 (l ) l m(s , v) (s , v)

(s ,v)=m1 (l ) 1

Because the reverse move is deterministic, the opposite case is much simpler and fl () (d) K(, d )fs ( ) = (l)l (l )fl (l ) Pl,s [l , m1 (l )]fs [m1 (l )] dl . (13.12) param param To ensure that (13.11) and (13.12) coincide for all choices of the functions fs and fl , the acceptance probability must satisfy (s)s (s )Ps,l p(v) (s , l ) = (l)l (l )Pl,s (l , s ) , Js,l (l ) (13.13)

where it is understood that s , l and v satisfy l = m(s , v). By analogy with the case of the usual Metropolis-Hastings algorithm, it is possible to nd a solution to the above equation of the form (s , l ) = A(s , l ) 1 by setting A(s , l ) = (l)l (l )Pl,s Js,l (l ) . (s)s (s )Ps,l p(v) (13.14) and (l , s ) = A1 (s , l ) 1

Indeed, with this choice both sides of (13.13) evaluate to (l)l (l )Pl,s (s)s (s )Ps,l p(v) . Js,l (l )

Thus (13.14) denes the applicable acceptance ratio to be used with Greens reversible jump move. At this level the formulation of Greens algorithm is rather abstract, but we hope it will be more clear after studying the following example.

13.2 Reversible Jump Methods

493

Example 13.2.2 (Normal HMM, Continued). We shall extend Example 13.1.14 to allow for moving between HMMs of dierent orders using reversible jump MCMC. We will discuss two dierent kinds of dimensionchanging moves, or, rather, pair of moves: birth/death and split/combine. In a birth move, the order of the Markov chain is increased by one by adding a new state, and the death move works in the reverse way by deleting an existing state. The split move takes an existing state and splits it in two, whereas the combine (also called merge) move takes a pair of states and tries to combine them into one. We will now in detail describe these moves and how their acceptance ratios are computed. We start with the birth move. Suppose that the current MCMC state is (r, r ), and that we attempt to add a new state, that we denote by i0 , to the HMM. We rst draw the random variables i0 N(, ), i0 ,j Ga(j , 1) for j = 1, . . . , r, i0 ,i0 Ga(i0 , 1), all independently. In other words, the parameters that go with the new state are drawn from their respective priors. These parameters correspond to the auxiliary variable vbirth for the birth move. The remaining parameters, that is, the components of r , are simply copied to the proposed new state r+1 . Therefore, the corresponding mapping mbirth is simply the identity; no particular transformation is required to obtain the proposed new state in r+1 . In the death move, the attempted move is to delete a state, denoted by i0 , that is chosen at random. The auxiliary variables i0 , etc., of the associated birth move are trivially recovered; they are just components of the state i0 that is proposed to be deleted! Next in turn is the computation of the acceptance ratio, which is in fact quite simple in this particular case. Because the mapping mbirth is the identity mapping, its Jacobian is the identity matrix, with determinant one. The remaining factors of (13.14) become (r + 1)r+1 (r+1 )L(y0:n |r+1 )(r + 1)! Pd (r + 1)/(r + 1) (r)r (r )L(y0:n |r )r! Pb (r) 1 . (13.15) r r 2 p (i0 )p2 (i0 ) i=1 p (i,i0 ) j=1 p (i0 ,j )p (i0 ,i0 ) This ratio deserves some further comments. The rst factor is the ratio of posterior densities. The factorials arise from the fact that, as the prior is exchangeablethe prior as well as the posterior are invariant under permutations of stateswe cannot distinguish between parameters that are identical up to such permutations. Thus our eective parameter space for r-order HMMs is that of equivalence classes of parameters that are identical up to
2 i0 IG(, ),

i,i0 Ga(i0 , 1) for i = 1, . . . , r,

494

13 Fully Bayesian Approaches

permutations, and the prior of such an equivalence class is r! times the original prior of one of its representations (cf. Section 13.1.3). When r stays put, this distinction between a parameter and its equivalence class is unimportant, but it becomes important when r is allowed to vary as ignoring it would lead to incorrect acceptance ratios. The remaining factors in (13.15) are as follows: Pb (r) is the probability of proposing a birth move when the current state is of order r, Pd (r + 1) is the probability of proposing a death move when the current state is of order r + 1, so that Pd (r + 1)/(r + 1) is the probability of proposing to kill the specic state i0 of r+1 , and the product of densities p , p2 and p forms the joint proposal density pbirth of the birth move. Now, because the proposal densities p , etc., are identical to the priors of the corresponding parameters, and because the components in r remain the same in r+1 , there will be cancellations in (13.15), leading to the simplied expression (r + 1)L(y0:n |r+1 ) Pd (r + 1) . (13.16) (r)L(y0:n |r ) Pb (r) The acceptance ratio for the death move is the inverse of the above, which completes the description of the birth/death move. We now turn to the split/combine move. Starting with the split move, suppose that the current MCMC state is r , of order r. The split move selects a state, i0 say, and attempts to split it into two new ones, i1 and i2 . The parameters of the corresponding normal distribution must be split as well. This can be done as follows. (i) Split i0 as i1 = i0 i0 ,
2 and split i0 as 2 2 i1 = i0 , 2 2 i2 = i0 / ,

i2 = i0 + i0 ,

with N(0, ),

with LN(0, ).

(ii) Split column i0 as i,i1 = i,i0 ui , (iii) Split row i0 as i1 ,j = i0 ,j j , (iv) Split i0 ,i0 as i1 ,i1 = i0 ,i0 ui0 i1 , i1 ,i2 = i0 ,i0 (1 ui0 )i2 , i2 ,j = i0 ,j /j , with j LN(0, ) for j = i0 . i,i2 = i,i0 (1 ui ), with ui U(0, 1) for i = i0 .

i2 ,i1 = i0 ,i0 ui0 /i1 , i2 ,i2 = i0 ,i0 (1 ui0 )/i2 , where ui0 U(0, 1) and i1 , i2 LN(0, ).

13.2 Reversible Jump Methods

495

These formulas deserve some comments. Step (ii) is sensible in the way that the transition probability of moving from state i to i0 is distributed between the probabilities of moving to the new states i1 and i2 , respectively. We note that state i0 can be split into states (i1 , i2 ) with corresponding normal param2 2 eters (i1 , i1 ) and (i2 , i2 ), but also into the same pair but in reverse order (the corresponding are then also reversed). This gives an identical parameter in terms of equivalence classes as dened above. In fact, the densities of these two proposals are identical, as u and 1 u have the same distribution, and likewise for and , and and 1/, respectively (here subscripts on these variables are omitted). The move that reverses the above operations, that is, the combine move, goes as follows. Select two distinct states i1 and i2 at random, and attempt to combine them into a single state i0 as follows. (i) Let (ii) Let (iii) Let (iv) Let
2 2 2 i0 = (i1 + i2 )/2 and let i0 = (i1 i2 )1/2 . i,i0 = i,i1 + i,i2 for i = i0 . i0 ,j = (i1 ,j i2 ,j )1/2 for j = i0 . i0 ,i0 = (i1 ,i1 i2 ,i1 )1/2 + (i1 ,i2 i2 ,i2 )1/2 .

Along the way, we recover the values of the auxiliary variables of the split move. The auxiliary variables , , etc., constitute the vector vsplit of the split move. The mapping msplit is not the identity, as for the birth move, but rather given by steps (i)(iv) above. We will now detail the computation of the corresponding Jacobian and its determinant. The transformation we need to examine is thus the one taking the components of an rth order parameter r and the auxiliary variables into an (r + 1)-th order parameter r+1 by a split move. In this transformation most components, namely all that are not associated with state i0 that is split, are simply copied to the new parameter r+1 , and they do not aect any of the other components of r+1 . Thus the Jacobian will be block diagonal with respect to these components, and the block corresponding to the copied components is an identity matrix. In eect, this means that the Jacobian determinant equals the Jacobian determinant associated with the components actually involved in the split only. Analyzing this part closer, we nd further structure implying diagonal blocks, namely the structure found in steps (i)(iv) above. The sets of parameters and auxiliary variables involved in each of these steps are disjoint, meaning that the Jacobian will be block diagonal with respect to the structure of the steps and its determinant will be the product of the determinants given by each of the steps.
2 2 2 (i) For this step, taking (i0 , , i0 , ) into (i1 , i2 , i1 , i1 ), the Jacobian is 1 i0 /2i0 0 1 i0 /2i0 0 , 2 0 0 i0 2 2 0 0 1/ i0 /

496

13 Fully Bayesian Approaches

2 given that we dierentiate with respect to i0 , not i0 . The (modulus of 3 the) determinant of this matrix is 4i0 / . (ii) For this step, the Jacobian is further block diagonal with respect to each i = i0 . For each such i, the step takes (i,i0 , ui ) into (i,i1 , i,i2 ), with Jacobian ui 1 ui i,i0 i,i0

and (modulus of the) determinant i,i0 . The overall Jacobian determinant of this step is thus i=i0 i,i0 . (iii) For this step, the Jacobian is also further block diagonal with respect to j = i0 . For a specic j, the step takes (i0 ,j , j ) into (i1 ,j , i2 ,j ), with Jacobian j 1/j 2 i0 ,j i0 ,j /j and (modulus of the) determinant 2i0 ,j /j . The overall Jacobian determinant of this step is thus 2r1 j=i0 i0 ,j /j . (iv) For this step, taking (i0 ,i0 , ui0 , i1 , i2 ) into (i1 ,i1 , i1 ,i2 , i2 ,i1 , i2 ,i2 ), the Jacobian is ui0 i1 (1 ui0 )i2 ui0 /i1 (1 ui0 )/i1 i0 ,i0 i1 i0 ,i0 i2 i0 ,i0 /i1 i0 ,i0 /i2 . 2 i0 ,i0 ui0 0 i0 ,i0 ui0 /i1 0 2 0 i0 ,i0 (1 ui0 ) 0 i0 ,i0 (1 ui0 )/i2 Some algebra shows that the (modulus of the) determinant of this matrix 3 is 4i0 ,i0 ui0 (1 ui0 )/i1 i2 . Finally we arrive at the overall Jacobian determinant (in absolute value) of the split move, Jsplit = 2r+3
3 3 i0 i0 ,i0 ui0 (1 ui0 ) i1 i2

i,i0
i=i0 j=i0

i0 ,j . j

The acceptance ratio for the split/combine move is thus (r + 1)r+1 (r+1 )L(y0:n |r+1 )(r + 1)! Pc (r + 1)/[(r + 1)r/2] (r)r (r )L(y0:n |r )r! Ps (r)/r 1 Jsplit 2p ( )p ( )pi1 (i1 )pi2 (i2 ) j=i0 pj (j ) = (r + 1)r+1 (r+1 )L(y0:n |r+1 ) Pc (r + 1) (r)r (r )L(y0:n |r ) Ps (r) 1 Jsplit . p ( )p ( )pi1 (i1 )pi2 (i2 ) j=i0 pj (j )

13.2 Reversible Jump Methods

497

Here Ps (r)/r and Pc (r + 1)/[(r + 1)r/2] are the probabilities to propose to split a specic component out of r and to propose to combine a specic pair out of (r + 1)r/2 (the number of pairs selected from r + 1 items) possible ones, respectively. For the auxiliary variable densities, we note that the uniform variables involved have densities equal to unity, and that the factor 2 arises from the above observation that there are two dierent combinations of auxiliary variables that have equal density and that result in identical parameters after the split. The acceptance rate for the combine move is the inverse of the above. Just as for MCMC algorithms with xed r, several types of moves are typically put together into a sweep. For the current algorithm, a sweep may look as follows. (a) Update the means i while letting r stay xed. 2 (b) Update the variances i while letting r stay xed. (c) Update the ij while letting r stay xed. (d) Propose a birth move or a death move, with probabilities Pb (r) and Pd (r), respectively. (e) Propose a split move or a combine move, with probabilities Ps (r) and Pc (r), respectively. Obviously, Pb (r) + Pd (r) = 1 and Ps (r) + Pc (r) = 1 must hold for all r. Typically, all these probabilities are set to 1/2, except for Pb (1) = Ps (1) = 1, Pd (1) = Pc (1) = 0, Pb (R) = Ps (R) = 0, and Pd (R) = Pc (R) = 1, where R is the maximum number of states allowed by the prior. Steps (a)(c) above may be accomplished by Metropolis-Hastings steps as in Example 13.1.14 but may also be done by completing the data through simulation of the hidden 2 chain X0:n followed by a Gibbs step for updating i and i conditional on both the data and the hidden chain. The ij may also be updated this way, by simulating the row sums and the qij separately and then computing the corresponding ij . The above reversible jump MCMC algorithm was implemented and run on a data set consisting of 600 monthly returns (in percent) from the Japanese stock index Nikkei over the time period 19501999; Graund and Nilsson (2003) contains a fuller description of this time series as well as an ML-based statistical analysis using the normal HMMs. The mean of the data was 1.14, and its minimal and maximal values were 29.8 and 24.6, respectively. In our implementation, we put a uniform prior on r over the range 1, 2, . . . , R with R = 10, and took = 0, = 40, = 1, = 2, and j = 1 for all j. 2 Updating of the i and the i for xed r was done through imputation of the hidden chain followed by Gibbs sampling, whereas the ij were updated through a N(0, 0.12 ) increment random walk Metropolis-Hastings proposal on each log ij . The birth, death, split, and combine proposal probabilities Pb (r), etc., were all set to 1/2 with the aforementioned modications at the boundaries r = 1 and r = R. In the split move, we used = = = 0.5.

498

13 Fully Bayesian Approaches

The algorithm was run for 100,000 burn-in sweeps and then for another 2,000,000 sweeps during which its output was monitored. The acceptance rate for the update-ij move, the split/combine move, and the birth/death move was about 34%, 1.8%, and 1.4%, respectively. A higher rate for the dimensionchanging moves would indeed be desirable, and this could perhaps be achieved with modied moves. We did some experimentation with other values for , , and the , but without obtaining much variation in the acceptance rates. The estimated posterior probabilities for r were 0.000, 0.307, 0.500, 0.156, 0.029, 0.006, and 0.001 for r = 1, 2, . . . , 7 and below 0.001 for larger r. Graund and Nilsson (2003) estimated the same kind of HMM from the data but using ML implemented through simulated annealing, arriving at the estimated pvalue 0.60 for testing r = 2 vs. r = 3. They thus adopted r = 2 as their order estimate, whereas the reversible jump MCMC analysis above gives the largest posterior probability for r = 3. However, our particular choice of prior may have a substantial eect on the posterior for r, and a Bayes factor analysis, which we did not carry out, may also give a dierent conclusion. Indeed, hierarchical priors are often used to attenuate the eect of the prior on the posterior (Richardson and Green, 1997; Robert et al., 2000). We stress that the algorithm outlined above should be viewed as an example of a reversible jump MCMC algorithm that may be modied and tuned for dierent applications, rather than as a ready-to-use algorithm that suits every need. As another example of posterior analysis, we extracted the MCMC samples with r = 2 components, permuted the component indices for each such sample to make the means i sorted (there was label switching in the MCMC output), and computed the posterior means: 1 = 0.755 and 2 = 1.568. This is to be compared to the MLEs 1 = 0.847 and 2 = 1.531 reported by Graund and Nilsson (2003). The credibility intervals we obtained were quite wide; the 95% intervals for 1 and 2 (after sorting) read (0.213, 1.460) and (1.102, 2.074) respectively, both covering the respective MLE. 13.2.3 Alternative Sampler Designs Reversible jump MCMC algorithms have in common with more conventional Metropolis-Hastings algorithms that they generally contain some parameters that need to be ne tuned in order to optimize their performance. In the example above, these parameters are , and . Often the only way to do this ne tuning is through a set of pilot runs during which acceptance probabilities and other statistics related to the mixing of the algorithm are monitored. For any particular variable-dimension statistical model, there is an innite number of ways of designing reversible jump algorithms. The above example is only one of them for the normal HMM. Other structures of the split/combine move, for instance, may prove more ecient with certain combinations of priors and/or data. Designing a reversible jump algorithm is by no means an automated procedure but needs to be guided by experimentation and, when

13.2 Reversible Jump Methods

499

available, experience. The recent paper by Brooks et al. (2003) does outline, however, some general ideas about how to construct ecient reversible jump algorithms by setting up rules to calibrate the jump proposals. Above, we motivated the factorial r! that is adjoined to the posterior density by an argument based on equivalence classes of parameters. Richardson and Green (1997) motivated them by saying that the actual parameter space is the one only containing parameters such that the normal means, for instance, appear in ascending order: 1 < 2 < . . . < r , cf. Section 13.1.3. We note that sorting of this kind may become necessary even without restrictions on the prior, as we have seen that with an exchangeable prior, the marginal posterior densities of the means, for example, are generally identical. We prefer to view such sorting as a part of the post-processing of the MCMC sampler output, however, rather than as an intrinsic property of the algorithm itself. Sorting afterwards simplies, for example, examination of how sorting with respect to dierent sets of parameters (means or variances, for example) aect the inference. As a consequence of the assumption of sorted means, Richardson and Green (1997) also restrict the split move, disallowing it to separate the normal means so far apart that the ordering is violated, and the combine move is restricted accordingly in that it may only attempt to combine states with adjacent normal means. We make some comments on this approach. The rst is that this restriction on the split/combine move is by no means necessary; if a split move violates the ordering, we can view that parameter as the equivalent one obtained upon sorting the means followed by a corresponding permutation of the remaining coordinates. The combine move is then allowed to attempt merging any pair of states. A second comment is that the above restriction on the split and combine moves may prove useful, even when we do not make any restrictions on the prior. With r states, there are r(r 1)/2 dierent pairs to combine, and one can imagine that pairs with means (or variances) far apart are less likely to generate a successful combine move. Therefore, restricting the combine move to consider states with adjacent means (or variances) only may lead to an increased acceptance probability for this move. If this strategy is adopted, the split move must be restricted accordingly, as the split/combine pair (as all other pairs) must be reversible: what one move may do the other one must be able to undo. We also mention the option to include the hidden chain {Xk }k0 in the MCMC state space, that is, adjoining it to the parameter . This choice was made by Richardson and Green (1997) in the setting of mixtures, and followed up for HMMs by Robert et al. (2000). These papers also provide suggestions for other designs of split/combine moves. In addition, the latter paper contains a lot of ne tuning done in the process of increasing acceptance rates. Including the hidden chain in the MCMC sampler simplies the computation of the posterior density, as the likelihood involved is then L(y0:n |x0:n , r ) rather than L(y0:n |r ), and the former is simply a product of scalars. On the other hand, in the birth move the new state i0 must be assigned to some Xk and, similarly,

500

13 Fully Bayesian Approaches

in the split move each Xk equal to i0 must be relabeled either i1 or i2 . The simulation mechanisms for doing so may be quite complex, cf. (Robert et al., 2000), and computationally demanding. 13.2.4 Alternatives to Reversible Jump MCMC Reversible jump MCMC has had a vast impact on variable-dimension Bayesian inference, but there certainly are some other approaches that deserve to be discussed. Brooks et al. (2003) reassess the reversible jump methodology through a global saturation scheme. They consider a series of models r (r = 1, . . . , R) such that maxr dim(r ) = rmax < . The parameter r r is then completed with an auxiliary variable Ur such that dim(r , ur ) = rmax and Ur qr (ur ). Brooks et al. (2003) dene in addition a vector r of dimension rmax with i.i.d. components, distributed from (r ), and assign the following joint prior to a parameter in r ,
rmax

(r, r ) qr (ur )
i=1

(i ) .

Within this augmented (or saturated) framework, there is no varying dimension anymore because, for all models, the whole vector (r , ur , ) is of xed dimension. Therefore, moves between models can be dened just as freely as moves between points of each modelsee also (Godsill, 2001) for a similar development. Brooks et al. (2003) propose a three stage MCMC update. Algorithm 13.2.3. 1. Update the current value of the parameter, r . 2. Update ur and conditional on r . 3. Update the model index r into r using the bijection. (r , ur ) = m(r , ur ) . Note that, for specic models, saturation schemes appear rather naturally. For instance, the case of a noisily observed time series with abrupt changes corresponds to a variable dimension model, when considered in continuous time (Green, 1995; Hodgson, 1998). Its discrete time counterpart however may be reparameterized by using indicators Xk that a change occurs at index k (for all indices) rather than the indices of change points (Chib, 1998; Lavielle and Lebarbier, 2001). The resulting model is then a xed dimension model, whatever the number of change points in the series. Petris and Tardella (2003) devised an approach that is close to a saturation scheme in the sense that it constructs a density on the subspace of largest

13.3 Multiple Imputations Methods and Maximum a Posteriori

501

dimension. However, it does not construct the extra variables uk explicitly but rather embeds the densities on lower dimensional subspaces into a function on the subspace of largest dimension that eectively incorporates all densities. This approach has not yet been tested on HMMs. Reversible jump algorithms operate in discrete time, but similar algorithms may be formulated in continuous time. Stephens (2000a) suggested such an algorithm, built on birth/death moves only, for mixture distribution, and Capp e et al. (2003) extended the framework to allow for other kinds of dimensionchanging moves like split/combine. In this continuous time approach, there are no acceptance probabilities and birth moves are always accepted, but model parameters that are unlikely, in the sense of having low posterior density, are assigned large death rates and are hence abandoned quickly. Similar remarks apply to split/combine moves. Moves that update model parameters without changing its dimension may also be incorporated. Capp et al. (2003) also e compared the discrete and continuous time approaches and concluded that the dierences between them are very minor, but with the continuous time approach generally requiring more computing time.

13.3 Multiple Imputations Methods and Maximum a Posteriori


We consider in this last section a class of methods, which methods are arguably less directly connected with the Bayesian framework and which may also be envisioned as extensions or variants of the approaches discussed in Chapter 11. Rather than simulating from the posterior distribution of the parameters, we now consider maximizing it to determine the so-called maximum a posteriori (or MAP) point estimate. In contrast to the methods of Chapters 1011, which could also be used in this context (Remark 10.2.1), the techniques to be discussed below explicitly use parameter simulation in addition to hidden state simulation. The primary objective of these techniques is not (only) to compensate for the lack of exact smoothing computations in many models of interest, but also to perform some form of random search optimization see discussion in the introduction of Chapter 11which is (hopefully) more robust to the presence of local maxima in the function to be optimized. We already mentioned, in conjunction with identiability issues, the diculties in using, in a Bayesian context, marginal posterior means as parameter estimates in HMMs. Identiability can be forced upon the parameter by imposing some articial identifying constraint such as ascending means, as mentioned above, or as in Robert and Titterington (1998) for instance. Even in that case, the posterior mean is a poor candidate for Bayesian inference, given that it heavily depends on the identifying constraints (see Celeux et al., 2000, for an illustration in the setting of mixtures). Therefore in many cases, the remaining candidate is the MAP estimate,

502

13 Fully Bayesian Approaches

MAP = arg max

(, x0:n |y0:n )(, x0:n ) dx0:n (13.17)

= arg max (|y) .

As previously discussed, the methods of either Chapter 10 or 11 may be used to determine the MAP estimator, depending on whether or not the marginalization in (13.17) can be performed exactly. The structure of (13.17) also suggests a specic class of optimization algorithms which implement the simulated annealing principle originally proposed by Metropolis et al. (1953). 13.3.1 Simulated Annealing Simulated annealing methods are a non-homogeneous variant of MCMC algorithms used to perform global optimization. The word global is used to emphasize that the ultimate goal is convergence to the actual maxima of the function of interestthe so-called global maximawhether or not the function does possess local maxima. The terminology is borrowed from metallurgy where a slow decrease of the temperature of a metalthe annealing process is used to obtain a minimum energy crystalline structure. By analogy, simulated annealing is a random search technique that explores the parameter space , using a non-homogeneous Markov chain {i }i0 whose transition kernels Ki are tailored to have invariant probability density functions Mi (|y0:n ) Mi (|y0:n ) , (13.18)

{Mi }i1 being a positive increasing sequence tending to innity. The intuition behind simulated annealing is that as Mi tends to innity, Mi (|y) concentrates itself upon the set of global modes of the posterior distribution. It has been shown under various assumptions that convergence to the set of global maxima is indeed ensured for sequences {Mi }i1 growing at a logarithmic rate (Laarhoven and Arts, 1987). Using the metallurgic analogy again, the sequence {Mi }i1 is often called a cooling schedule, and the reciprocal of Mi is known as the temperature. In simple situations where the posterior (|Y0:n ) is known (up to a constant), sampling from a kernel Ki that has (13.18) as invariant density may be done using the Metropolis-Hastings algorithm (see Section 6.2.3). For HMMs however, this situation is the exception rather than the rule, and the posterior is only available in closed form in models where exact smoothing is feasible, such as normal HMMs with nite state space. To overcome this diculty, Doucet et al. (2002) developed a novel approach named SAME (for state augmentation for marginal estimation), also studied by Gaetan and Yao (2003) under the name MEM (described as multiple-inputed Metropolis version of the EM algorithm). We adopt here the terminology proposed by Doucet et al. (2002).

13.3 Multiple Imputations Methods and Maximum a Posteriori

503

13.3.2 The SAME Algorithm The key argument behind SAME is that upon restricting the Mi to be integers, the probability density function Mi in (13.18) may be viewed as the marginal posterior in an articially augmented probability model. Hence one may use standard MCMC techniques to draw from this augmented probability model, and therefore the simulated annealing strategy is feasible for general missing data models. The concentrated distribution Mi is obtained by articially replicating the latent variables in the model, in our case the hidden states X0:n . To make the argument more precise, denote by M the current value of Mi and consider M articial copies of the hidden state sequence, denoted by X0:n (1), . . . , X0:n (M ). The ctitious probability model postulates that these sequences are a priori independent with common parameter and observed sequence Y0:n , leading to a posterior joint density dened by
M

M [, x0:n (1), . . . , x0:n (M )|y0:n ]


m=1

[, x0:n (m)|y0:n ]
M

(13.19)

m=1

p[x0:n (m)|y0:n , ] ()M ,

where (|y0:n ) is the joint posterior distribution corresponding to the model, p(|y0:n , ) the likelihood, and is the prior. This distribution does not correspond to a real phenomenon but it is a properly dened density in that it is positive, and the right-hand side can be normalized so that (13.19) integrates to unity. Now the marginal distribution of in (13.19), obtained by integration over all replications of X0:n , is M (|y0:n ) =
m=1

M [, x0:n (1), . . . , x0:n (M )|y0:n ] dx0:n (1) dx0:n (M )


M

[, x0:n (m)|y0:n ] dx0:n (1) dx0:n (M )

= M (|y0:n ) . Hence an MCMC algorithm in the augmented space, with invariant distribution M [, x0:n (1), . . . , x0:n (M )|y0:n ], is such that the simulated sequence of parameter {i }i0 marginally admits M in (13.18) as invariant distribution. An important point here is that when an MCMC sampler is available for the density (, x0:n |y0:n ), it is usually easy to construct an MCMC sampler with target density (13.19) as the replications of X0:n are statistically independent conditional on in this ctitious model, that is,

504

13 Fully Bayesian Approaches


M

M [x0:n (1), . . . , x0:n (M )|y0:n , ] =


m=1

[x0:n (m)|y0:n , ] ,

(13.20)

and for , the full conditional distribution satises


M

M [|y0:n , x0:n (1), . . . , x0:n (M )]


m=1

[|y0:n , x0:n (m)] .

(13.21)

According to (13.20), the sampling step for x0:n (k) is identical to its counterpart in a standard data augmentation sampler with target distribution [, x0:n (k)|y0:n ], whereas the sampling step for involves a draw from (13.21). If (|y0:n , x0:n ) belongs to an exponential family of densities, then sampling from (13.21) is straightforward, as the product of conditionals in (13.21) is also a member of this exponential family. In other cases, (13.21) can be simulated using a Metropolis-Hastings stepGaetan and Yao (2003) for instance used random walk Metropolis-Hastings proposals. For normal HMMs, the SAME algorithm may be implemented as follows. Example 13.3.1 (SAME for Normal HMMs). Assume that the state space X is {1, . . . , r} and that the conditional distributions are normal, 2 Yk |Xk = j N(j , j ). Conjugate priors are assumed, that is, j N(, ), 2 j IG(, ) and qj, Dirr (, . . . , ) with independence between the j , the 2 j , and the rows of Q. We assume (for simplicity) that the initial distribution is xed and known. To avoid confusion with simulation indices (which are 2 indicated by superscripts), we will use the notation j rather than j for the components variances. Examining Example 13.1.10, we nd that the full conditional distribution of the means j is such that they are conditionally independent with j | j , x0:n (1), . . . , x0:n (M ), y0:n N M j / + M j / +
M m=1 Sj (m) , M m=1 nj (m)

(13.22) 1 M/ +
M m=1

nj (m)/j

where Sj (m) = 0kn: xk (m)=j yk is the sum statistic associated with the mth replication of X0:n and state j and, similarly, nj (m) = #{0 k n : xk (m) = j} is the number of xk (m) with xk (m) = j. In an analogous way, we nd that the full conditional distribution of the variances j is such that they are conditionally independent with j | j , x0:n (1), . . . , x0:n (M ), y0:n IG M ( + 1) 1 +
(2)

(13.23)
M

1 1 (2) nj (m), M + S (m) 2 m=1 2 m=1 j

2 where Sj (m) = 0kn: xk (m)=j (yk j ) , and that the full conditional distribution of Q is such that the rows are conditionally independent with

13.3 Multiple Imputations Methods and Maximum a Posteriori

505

(qj1 , . . . , qjr ) | x0:n (1), . . . , x0:n (M )


M M

(13.24) njr (m)


m=1

Dirr

M ( 1) + 1 +
m=1

nj1 (m), . . . , M ( 1) + 1 +

where njl (m) = #{0 k n 1 : xk (m) = j, xk+1 (m) = l} is the number of transitions from state j to l in the mth replication. Hence the SAME algorithm looks as follows. Algorithm 13.3.2. Initialize the algorithm with 0 = and select a schedule {Mi }i0 . Then for i 1,
0 {0 , j }j=1,...,r , Q0 j

i i Simulate the Mi missing data replications X0:n (1), . . . , X0:n (Mi ) indepeni1 dently under the common distribution (x0:n | ); Simulate i , . . . , i independently from the normal distributions (13.22); r 1 i i Simulate 1 , . . . , r independently from the inverse Gamma distributions (2) (13.23), using the newly simulated i to evaluate Sj (m) for j = 1, . . . , r j and m = 1, . . . , M ; Simulate the rows of Qi independently from the Dirichlet distributions (13.24).

i The simulation of the replications X0:n (m) can be carried out using the forward ltering-backward sampling recursion developed in Section 6.1.2.

It should be clear from the above example that the SAME approach is strikingly close to the SEM and MCEM methods discussed in Sections 11.1.7 and 11.1.1, respectively. Indeed, taking the log, (13.19) may be rewritten as log M [, x0:n (1), . . . , x0:n (M )|y0:n ] = C st +M 1 M
M

log p(x0:n (m)|y0:n , ) + log ()


m=1

, (13.25)

where the constant does not depend on the parameter . The term in braces in (13.25) is recognized as a Monte Carlo approximation of the intermediate quantity of EM for this problem, with the addition of the prior term (see Remark 10.2.1). Hence replacing the parameter simulation step in the SAME algorithm by a maximization step lead us back to the MCEM approach. In the example of Algorithm 13.3.2, the MCEM update can be obtained by setting the new values of the parameter to the modes of (13.22)(13.24), that is, = j
j = M m=1 Sj (m) , M 1 j / + M m=1 nj (m) (2) M + (1/2)M 1 m=1 Sj (m) , M ( + 1) + (1/2)M 1 m=1 nj (m) M ( 1) + M 1 m=1 njl (m) r M r( 1) + M 1 l=1 m=1 njl (m)

j / + M 1

qjl =

506

13 Fully Bayesian Approaches

These equations can also be obtained from the M-step update equations (10.41)(10.43) of the EM algorithm for the normal HMM, taking into account the prior terms and replacing the posterior expectations by their Monte Carlo approximation. It is also of interest that the distributions (13.22)(13.24), from which simulation is done in the SAME approach, have variances that decrease proportionally to 1/M ; hence the distributions get more and more concentrated around the modes given above as the number of replications increases. The interest of SAME, however, is that it exactly implements the simulated annealing principle for which a number of convergence results have been obtained in the literature. In particular, both Doucet and Robert (2002) and Gaetan and Yao (2003) provide some conditions under which the distribution of the ith parameter estimate i converges to a measure that is concentrated on the set of global maxima of the marginal posterior. Although very appealing, these results do imply restrictive conditions on the model, requiring in particular that the likelihood be bounded from above and below. In addition, those results apply only for very slow logarithmic rates of increase of {Mi }i1 , with appropriate choice of multiplicative constants. Many authors, among which are Doucet et al. (2002), recommend using faster schedules in practice, reporting for instance good results with sequences {Mi }i1 that grow linearly. We conclude this brief exposition with an example that illustrates the importance of the choice of a proper schedulesee Doucet et al. (2002), Gaetan and Yao (2003), and Jacquier et al. (2004) for further applications of the method. Example 13.3.3 (Binary Deconvolution Model, Continued). We consider again the noisy binary deconvolution model of Example 10.3.2, which served for illustrating the EM and quasi-Newton methods. Recall that this model is a four-state normal HMM for which the transition parameters are known, the variances j are constrained to equal a common value that we denote by , the means are given by j = st h where h is a two-dimensional j vector of unknown lter coecients, and s1 to s4 are xed two-dimensional vectors. For easier comparison with the results discussed in Example 10.3.2, we select improper priors for the parameters, which amounts to setting = 0 and = in (13.22) and = 1 and = 0 in (13.23). Hence the SAME algorithm will directly maximize the likelihood. Taking into account the constraints mentioned above, the posteriors in (13.22) and (13.23) should then be replaced by h | , x0:n (1), . . . , x0:n (M ), y0:n
M n

N [x0:n (1 : M )]
m=1 k=0

yk xk (m), [x0:n (1 : M )]

where

13.3 Multiple Imputations Methods and Maximum a Posteriori


M n 1 t

507

[x0:n (1 : M )] =
m=1 k=0

xk (m)xk (m)

and | h, x0:n (1), . . . , x0:n (M ), y0:n IG M (n + 1) 1 1, 2 2 m=1


M n

[yk xk (m)xk (m)t ]2


k=0

Note that for this discrete-state space model, the likelihood is indeed computable exactly for all values of the parameters h and . Hence we could also imagine implementing the simulated annealing approach directly, without resorting to the SAME completion mechanism. This example nonetheless constitutes a realistic testbed for the SAME algorithm with the advantage that the likelihood can be plotted exactly and its maximum determined with high precision by the deterministic methods discussed in Example 10.3.2. The data is the same as in Example 10.3.2, leading to the prole likelihood surface shown in Figure 10.1. Recall that for the sake of clarity, we only consider the estimated values of h although the variance is also treated as a parameter. For this problem, we xed the total number of simulations of the missing state trajectories X0:n to 10,000 and then evaluated dierent schedules of the form Mi = 1 + ai for various values of a and such that imax the overall number of simulations, i=1 Mi , equals 10,000. Hence imax is not xed and varies depending on the cooling schedule. These choices will be discussed below, but we can already note that 10,000 is a rather large number of simulations for this problem. Recall for instance from Figure 10.1 that the convergence of EM is quite fast in this problem (compared with the model of Example 11.1.2 for instance), although it sometimes converges to a local mode that, as we will see below, is very unlikely compared to the MLE. Table 13.1 summarizes the results obtained over 100 independent replications of the SAME trajectories started from the rst two starting points considered in Figure 10.1. The rst column shows that the simple MCMC simulation without cooling schedule (Mi = 1) is indeed very ecient at nding the global mode of the likelihood. Indeed, once in its steady-state, the MCMC simulations spend about 640 times more time in the vicinity of the global mode than in the local mode. This nding is coherent with the loglikelihood dierence between the two points (labeled MLE and LOC, respectively) in Figure 10.1, which corresponds to a factor 937 once converted back to a linear scale. Hence the likelihood indeed has a local mode but one that is very unlikely compared to the MLE. Letting a simple MCMC chain run long enough is thus sucient to end up in the vicinity of the global mode with high probability (640/641). Because of the correlation between successive values of the parameters however, this phenomenon does not manifest itself as fast as expected and 210 iterations are necessary to ensure that 95% out of the

508

13 Fully Bayesian Approaches a imax Mimax 0 1/72 1/12 1/2 10000 1163 483 198 1 17 41 100 Starting from point 1 in Figure 10.1 # converged 99 92 78 79 std. error 0.122 0.028 0.017 0.014 Starting from point 2 in Figure 10.1 # converged 100 87 61 52 std. error 0.121 0.029 0.018 0.013 1 140 141 95 0.010 36 0.009

Table 13.1. Summary of results of the SAME algorithm for 100 runs and dierent rates of increase a. The upper part of the table pertains to trajectories started from the point labeled 1 in Figure 10.1 and the lower part to those started from the point labeled 2 in Figure 10.1. # converged is the number of sequences that converged to the MLE and not to the local mode, and std. error is the average L2 -norm of the distance to the MLE for those trajectories (for comparison purposes, the L2 -norm of the MLE itself is 1.372). The random seeds used for the simulations were the same for all values of a.

200 trajectories started from either of the two starting points indeed visit the neighborhood of the global mode. Likewise, although some of the trajectories do visit the mirror modes that have identical likelihood for negative values of h0 (see Example 10.3.2), none of the trajectories was found to switch between positive and negative values of h0 once converged1 . The Gibbs sampler is thus unable to connect these two regions of the posterior, which are however equally probable. This phenomenon has been observed in various other missing data settings by Celeux et al. (2000). In this example these mixing problems rapidly get more severe as Mi increases. Accordingly, the number of trajectories in Table 13.1 that do eventually reach the MLE drops down as the linear factor a is set to higher values. The picture is somewhat more complicated in the case of the rst starting point, as the number of trajectories that reach the MLE rst decreases (a = 1/72, 1/12) before increasing again. The explanation for this behavior is to be found in Figure 10.1, which shows that the trajectory of the EM algorithm started from this point does converge to the MLE, in contrast with what happens for the second starting point. Hence for this rst starting point, when Mi increases suciently rapidly, the SAME algorithm mimics the EM trajectory (with some random uctuations) and eventually converges to the MLE. This behavior is illustrated in Figure 13.2. In this example, it turns out that in order to guarantee that the SAME algorithm eectively reaches the MLE, it is very important that Mi stays exactly equal to one for a large number of iterations, preferably a few hundreds, but fty is really a minimum. The logarithmic rates of increase of Mi that
1 In Table 13.1, the trajectories that converge to minus the MLE are counted as having converged, as we know that it corresponds to an identiability issue inherent to the model.

13.3 Multiple Imputations Methods and Maximum a Posteriori

509

260 270 280 290 loglikelihood 300 310 320 330 340 350 2 1.5 1 0.5 h0 0 1.5

MLE

LOC

0.5 h
1

0.5

1.5

260 270 280 290 loglikelihood 300 310 320 330 340 350 2 1.5 1 0.5 h0 0 1.5

MLE

LOC

0.5 h1

0.5

1.5

260 270 280 290 loglikelihood 300 310 320 330 340 350 2 1.5 1 0.5 h0 0 1.5

MLE

LOC

0.5 h1

0.5

1.5

Fig. 13.2. Same prole log-likelihood surface as in Figure 10.1. The trajectories show the rst 200 SAME estimates for, from top to bottom, a = 0, a = 1/12, and a = 1, started at the point labeled 1 in Figure 10.1. The same random seed was used for all three cases.

510

13 Fully Bayesian Approaches

are compatible with this constraint and with the objective of using an overall number of simulations equal to 10,000 typically end up with Mimax being of the order three and are thus roughly equivalent to the MCMC run (a = 0) in Table 13.1. Note that the error obtained with this simple scheme is not that bad, being about ten times smaller than the L2 norm of the MLE. The factor a = 1/72, which gives a reasonable probability of convergence to the MLE from both points, provides an error that is further reduced by a factor of ten. We would like to point out thatespecially when the answer is known as in this toy example!it is usually possible to nd out by trial-and-error cooling schedules that are ecient for the problem (data and model) at hand. In the case of Example 13.3.3, setting Mi = 1 for the rst 280 iterations and letting Mi = 4, 16, 36, 64, 100 for the last ve iterations (500 simulations in total) is very successful with 98 (resp. 96) trajectories converging to the MLE and an average error of 0.018 (resp. 0.020) when started from the two initial points under consideration. The last ve iterations in this cooling schedule follow a square progression that was used for the MCEM algorithm in Example 11.1.3. Note that rather than freezing the parameter by abruptly increasing Mi , one could use instead the averaging strategy (see Section 11.1.2) advocated by Gaetan and Yao (2003). Clearly, one-size-ts-all cooling schedules such as linear or logarithmic rates of increase may be hard to adjust to a particular problem, at least when the overall number of simulations is limited to a reasonable amount. This observation contrasts with the behavior observed for the MCEM and SAEM algorithms in Chapter 11, which are more robust in this respect, particularly for the latter. Remember however that we are here tackling a much harder problem in trying not only to avoid all local maxima but also to ensure that the parameter estimate eventually gets reasonably close to the actual global maximum. There is no doubt that simulated annealing strategies in general, and SAME in particular, are very powerful tools for global maximization of the likelihood or marginal posterior in HMMs. Their usefulness in practical situations however depends crucially on the ability to select proper nite-eort cooling schedules, which may itself be a dicult issue.

Part III

Background and Complements

14 Elements of Markov Chain Theory

14.1 Chains on Countable State Spaces


We review the key elements of the mathematical theory developed for studying the limiting behavior of Markov chains. In this rst section, we restrict ourselves to the case where the state space X is countable, which is conceptually simpler. On our way, we will also meet a number of important concepts to be used in the next section when dealing with Markov chains on general state spaces. 14.1.1 Irreducibility Let {Xk }k0 be a Markov chain on a countable state space X with transition matrix Q. For any x X, we dene the rst hitting time x on x and the return time x to x respectively as x = inf{n 0 : Xn = x} , x = inf{n 1 : Xn = x} , where, by convention, inf = +. The successive hitting times x (n) return times x , n 0, are dened inductively by
(0) (1) (n+1) (n) x = 0, x = x , x = inf{k > x : Xk = x} , (0) (1) (n+1) (n) x = 0, x = x , x = inf{k > x : Xk = x} .

(14.1) (14.2)
(n)

and

For two states x and y, we say that state x leads to state y, which we write x y, if Px (y < ) > 0. In words, x leads to y if the state y can be reached from x. An alternative, equivalent denition is that there exists some integer n 0 such that the n-step transition probability Qn (x, y) > 0. If both x leads to y and y leads to x, then we say that the x and y communicate, which we write x y.

514

14 Elements of Markov Chain Theory

Theorem 14.1.1. The relation is an equivalence relation on X. Proof. We need to prove that the relation is reexive, symmetric, and transitive. The rst two properties are immediate because, by denition, for all x, y X, x x (reexivity), and x y if and only if y x (symmetry). For any pairwise distinct x, y, z X, {y + z y < } {z < } (if the chain reaches y at some time and later z, it certainly reaches z). The strong Markov property (Theorem 2.1.6) implies that Px (z < ) Px (y + z y < ) = Ex [1{y <} 1{z <} y ] = Ex [1{y <} PXy (z < )] = Px (y < ) Py (z < ) . In words, if the chain can reach y from x and z from y, it can reach z from x by going through y. Hence if x y and y z, then x z (transitivity). For x X, we denote the equivalence class of x with respect to the relation by C(x). Because is an equivalence relation, there exists a collection {xi } of states, which may be nite or innite, such that the classes {C(xi )} form a partition of the state space X. Denition 14.1.2 (Irreducibility). If C(x) = X for some x X (and then for all x X), the Markov chain is called irreducible. 14.1.2 Recurrence and Transience When a state is visited by the Markov chain, it is natural to ask how often the state is visited in the long-run. Dene the occupation time of the state x as

x =

def n=0

1x (Xn ) =

1{x <} . (j)

j=1

If the expected number of visits to x starting from x is nite, that is, if Ex [x ] < , then the state x is called transient. Otherwise, if Ex [x ] = , x is said to be recurrent. When X is countable, the recurrence or transience of a state x can be expressed in terms of the probability Px (x < ) that the chain started in x ever returns to x. Proposition 14.1.3. For any x X the following hold true, (i) If x is recurrent, then Px (x = ) = 1 and Px (x < ) = 1. (ii) If x is transient, then Px (x < ) = 1 and Px (x < ) < 1. (iii) Ex [x ] = 1/[1 Px (x < )], with 1/0 = . Proof. By construction,

Ex [x ] =
k=1

Px (x k) =
k=1

(k) Px (x < ) .

14.1 Chains on Countable State Spaces

515

Applying strong Markov property (Theorem 2.1.6) for n > 1, we obtain


(n) (n1) Px (x < ) = Px (x < , x x
(n1)

< )

= Ex [1{(n1) <} PX
x

(n1) x

(x < )] .

If x

(n1)

< , then X(n1) = x Px -a.s., so that


x

(n) Px (x

(n1) < ) = Px (x < ) Px (x < ) . (n)

By denition Px (x < ) = 1, whence Px (x and

< ) = [Px (x < )]n1

Ex [x ] =

[Px (x < )]n1 .


n=1

This proves part (iii). Now assume x is recurrent. Then by denition Ex [x ] = , and hence (n) Px (x < ) = 1 and Px (x < ) = 1 for all n 1. Thus x = Px -a.s. If x is transient then Ex [x ] < , which implies Px (x < ) < 1. For a recurrent state x, the occupation time of x is innite with probability one under Px ; essentially, once the chain started from x returns to x with probability one, it returns a second time with probability one, and so on. Thus the occupation time of a state has a remarkable property, not shared by all random variables: if the expectation of the occupation time is innite, then the actual number of returns is innite with probability one. The mean of the occupation time of a state obeys the so-called maximum principle. Proposition 14.1.4. For all x and y in X, Ex [y ] = Px (y < ) Ey [y ] , with the convention 0 = 0. Proof. It follows from the denition that y 1{y =} = 0 and y 1{y <} = y y 1{y <} . Thus, applying the strong Markov property, Ex [y ] = Ex [1{y <} y ] = Ex [1{y <} y y ] = Ex [1{y <} EXy [y ]] = Px (y < ) Ey [y ] . (14.3)

Corollary 14.1.5. If Ex [y ] = for some x, then y is recurrent. If X is nite, then there exists at least one recurrent state. Proof. By Proposition 14.1.4, Ey [y ] Ex [y ], so that Ex [y ] = implies that Ey [y ] = , that is, y is recurrent. Next, obviously yX y = and thus for all x X, yX Ex [y ] = . Hence if X is nite, given x X there necessarily exists at least one y X such that Ex [y ] = , which implies that y is recurrent.

516

14 Elements of Markov Chain Theory

Our next result shows that a recurrent state can only lead to another recurrent state. Proposition 14.1.6. Let x be a recurrent state. Then for y X, either of the following two statements holds true. (i) x leads to y, Ex [y ] = , y is recurrent and leads to x, and Px (y < ) = Py (x < ) = 1; (ii) x does not lead to y and Ex [y ] = 0. Proof. Assume that x leads to y. Then there exists an integer k such that Qk (x, y) > 0. Applying the Chapman-Kolmogorov equations, we obtain Qn+k (x, y) Qn (x, x)Qk (x, y) for all n. Hence

Ex [y ]
n=1

Qn+k (x, y)
n=1

Qn (x, x)Qk (x, y) = Ex [x ]Qk (x, y) = .

Thus y is also recurrent by Corollary 14.1.5. Because x is recurrent, the strong Markov property implies that 0 = Px (x = ) Px (y < , x = ) = Px (y < , x y = ) = Px (y < ) Py (x = ) . Because x leads to y, Px (y < ) > 0, whence Py (x = ) = 0. Thus y leads to x and moreover Py (x < ) = 1. By symmetry, Px (y < ) = 1. If x does not lead to y then Proposition 14.1.4 shows that Ex [y ] = 0. For a recurrent state x, the equivalence class C(x) (with respect to the relation of communication dened in Section 14.1.1) may thus be equivalently dened as C(x) = {y X : Ex [y ] = } = {y X : Px (y < ) = 1} . (14.4)

If y C(x), then Px (y = 0) = 1, which implies that Px (Xn C(x) for all n 0) = 1. In words, the chain started from the recurrent state x forever stays in C(x) and visits each state of C(x) innitely many times. The behavior of a Markov chain can thus be described as follows. If a chain is not irreducible, there may exist several equivalence classes of communication. Some of them contain only transient states, and some contain only recurrent states. The latter are then called recurrence classes. If a chain starts from a recurrent state, then it remains in its recurrence class forever. If it starts from a transient state, then either it stays in the class of transient states forever, which implies that there exist innitely many transient states, or it reaches a recurrent state and then remains in its recurrence class forever. In contrast, if the chain is irreducible, then all the states are either transient or recurrent. This is called the solidarity property of an irreducible chain. We now summarize the previous results.

14.1 Chains on Countable State Spaces

517

Theorem 14.1.7. Consider an irreducible Markov chain on a countable state space X. Then every state is either transient, and the chain is called transient, or every state is recurrent, and the chain is called recurrent. Moreover, either of the following two statements holds true for all x and y in X. (i) Px (y < ) = 1, Ex [y ] = and the chain is recurrent. (ii) Px (x < ) < 1, Ex [y ] < and the chain is transient. Remark 14.1.8. Note that in the transient case, we do not necessarily have Px (y < ) < 1 for all x and y in X. For instance, if Q is a transition matrix on N such that Q(n, n + 1) = 1 for all n, then Pk (n < ) = 1 for all k < n. Nevertheless all states are obviously transient because Xn = X0 + n. 14.1.3 Invariant Measures and Stationarity For many purposes, we might want the marginal distribution of {Xk } not to depend on k. If this is the case, then by the Markov property it follows that the nite-dimensional distributions of {Xk } are invariant under translation in time, and {Xk } is thus a stationary process. Such considerations lead us to invariant distributions. A non-negative vector {(x)}xX with the property (y) =
xX

(x)Q(x, y) ,

yX,

will be called invariant. If the invariant vector is summable, then we assume it is a probability distribution, that is, it sums to one. Such distributions are also called stationary distributions or stationary probability measures. The key result concerning the existence of invariant vectors is the following. Theorem 14.1.9. Consider an irreducible recurrent Markov chain {Xk }k0 on a countable state space X. Then there exists a unique (up to a scaling factor) invariant measure . Moreover 0 < (x) < for all x X. This measure is summable if and only if there exists a state x such that Ex [x ] < . (14.5)

In this case, Ey [y ] < for all y X and the unique invariant probability measure is given by (x) = 1/ Ex [x ] , xX. (14.6)

Proof. Let Q be the transition matrix of the chain. Pick an arbitrary state x X and dene the measure x by
x 1

x (y) = Ex
k=0

1y (Xk ) = Ex

1y (Xk ) .

(14.7)

k=1

518

14 Elements of Markov Chain Theory

That is, x (y) is the expected number of visits to the state y before the rst return to x, given that the chain starts in x. Let f be a non-negative function on X. Then
x 1

x (f ) = Ex
k=0

f (Xk ) =
k=0

Ex

1{x >k} f (Xk ) .

X Using this identity and the fact that Qf (Xk ) = Ex [f (Xk+1 ) | Fk ] Px -a.s. for all k 1, we nd that

x (Qf ) =
k=0

Ex [1{x >k} Qf (Xk )] =

X Ex {1{x >k} Ex [f (Xk+1 ) | Fk ]} x

k=0

=
k=0

Ex [1{x >k} f (Xk+1 )] = Ex

f (Xk )
k=1

showing that x (Qf ) = x (f ) f (x) + Ex [f (Xx )] = x (f ). Because f was arbitrary, we see that x Q = x ; the measure x is invariant. For any other state y, the chain may reach y before returning to x when starting in x, as it is irreducible. This proves that x (y) > 0. Moreover, again by irreducibility, we can pick an m > 0 such that Qm (y, x) > 0. By invariance x (x) = zX x (z)Qm (z, x) x (y)Qm (y, x), and as x (x) = 1, we see that x (y) < We now prove that the invariant measure is unique up to a scaling factor. The rst step consists in proving that if is an invariant measure such that (x) = 1, then x . It suces to show that, for any y X and any integer n,
n

(y)
k=1

Ex [1y (Xk )1{x k} ] .

(14.8)

The proof is by induction. The inequality is immediate for n = 1. Assume that (14.8) holds for some n 1. Then (y) = Q(x, y) +
z=x n

(z)Q(z, y) Ex [Q(Xk , y)1{x}c (Xk )1{x k} ] Ex [1y (Xk+1 )1{x k+1} ]

Q(x, y) +
k=1 n

Q(x, y) +
k=1 n+1

=
k=1

Ex [1{y} (Xk )1{x k} ] ,

showing the induction. We will now show that = x . The proof is by contradiction. Assume that (z) > x (z) for some z X. Then

14.1 Chains on Countable State Spaces

519

1 = (x) = Q(x) =
zX

(z)Q(z, x) >
zX

x (z)Q(z, x) = x (x) = 1 ,

which cannot be true. The measure x is summable if and only if


x 1

>
yX

x (y) =
yX

Ex
k=0

1{Xk =y} = Ex [x ] .

Thus the unique invariant measure is summable if and only if a state x satisfying this relation exists. On the other hand, if such a state x exists then, by uniqueness of the invariant measure, Ey [y ] < must hold for all states y. In this case, the invariant probability measure, say, satises (x) = x (x)/x (X) = 1/ Ex [x ]. Because the reference state x was in fact arbitrary, we nd that (y) = 1/ Ex [y ] for all states y. It is natural to ask what can be inferred from the knowledge that a chain possesses an invariant probability measure. The next proposition gives a partial answer. Proposition 14.1.10. Let Q be a transition matrix and an invariant probability measure. Then every state x such that (x) > 0 is recurrent. If Q is irreducible, then it is recurrent. Proof. Let y X. If (y) > 0 then other hand, by Proposition 14.1.4,
n=0

Qn (y) =

n=0

(y) = . On the

Qn (y) =
n=0 xX

(x)
n=0

Qn (x, y) (x) Ex [y ] Ey [y ]
xX xX

(x) = Ey [y ] . (14.9)

Thus (y) > 0 implies Ey [y ] = , that is, y is recurrent. Let {Xk } be an irreducible Markov chain. If there exists an invariant probability measure, the chain is called positive recurrent; otherwise it is called null. Note that null chains can be either null recurrent or transient. Transient chains are always null, though they may admit an invariant measure. 14.1.4 Ergodicity A key result for positive recurrent irreducible chains is that the transition laws converge, in a suitable sense, to the invariant vector . The classical result is the following.

520

14 Elements of Markov Chain Theory

Proposition 14.1.11. Consider an irreducible and positive recurrent Markov chain on a countable state space. Then for any states x and y,
n

n1
i=1

Qn (x, y) (y)

as n .

(14.10)

The use of the Csaro limit can be avoided if the chain is aperiodic. The e simplest denition of aperiodicity is that a state x is aperiodic if Qk (x, x) > 0 for all k suciently large or, equivalently, that the period of the state x is one. The period of x is dened as the greatest common divisor of the set I(x) = {n > 0 : Qn (x, x) > 0}. For irreducible chains, the following result holds true. Proposition 14.1.12. If the chain is irreducible, then all states have the same period. If the transition matrix Q is irreducible and aperiodic, then for all x and y in X, there exists n(x, y) N such that Qk (x, y) > 0 for all k n(x, y). Thus, an irreducible chain can be said to be aperiodic if the common period of all states is one. The traditional pointwise convergence (14.10) of transition probabilities has been replaced in more recent research by convergence in total variation (see Denition 4.3.1). The convergence result may then be formulated as follows. Theorem 14.1.13. Consider an irreducible and aperiodic positive recurrent Markov chain on a countable state space X with transition matrix Q and invariant probability distribution . Then for all initial distributions and on X, Qn Qn TV 0 as n . (14.11) In particular, for any x X we may set = x and = to obtain Qn (x, )
TV

as n .

(14.12)

The proof of this result, and indeed the focus on convergence in total variation, follows using of the coupling technique. We postpone the presentation of this technique to Section 14.2.4 because essentially the same ideas can be applied to Markov chains on general state spaces.

14.2 Chains on General State Spaces


In this section, we extend the concepts and results pertaining to countable state spaces to general ones. In the following, X is an arbitrary set, and we just require that it is equipped with a countably generated -eld X . By {Xk }k0 we denote an X-valued Markov chain with transition kernel Q. It

14.2 Chains on General State Spaces

521

X is dened on a probability space (, F, P), and FX = {Fk }k0 denotes the natural ltration of {Xk }. For any set A X , we dene the rst hitting time A and return time A respectively by

A = inf{n 0 : Xn A} , A = inf{n 1 : Xn A} ,

(14.13) (14.14)
(n)

where, by convention, inf = +. The successive hitting times A (n) return times A , n 0, are dened inductively by A = 0, A = A , A A = 0, A = A , A
(0) (1) (0) (1) (n+1)

and

= inf{k > A : Xk A} , = inf{k > A : Xk A} .


(n)

(n)

(n+1)

We again dene the occupation time A as the number of visits by {Xk } to A,

A =

def k=0

1A (Xk ) .

(14.15)

14.2.1 Irreducibility The rst step to develop a theory on general state spaces is to dene a suitable concept of irreducibility. The denition of irreducibility adopted for countable state spaces does not extend to general ones, as the probability of reaching single point x in the state space is typically zero. Denition 14.2.1 (Phi-irreducibility). The transition kernel Q, or the Markov chain {Xk }k0 with transition kernel Q, is said to be phi-irreducible if there exists a measure on (X, X ) such that for any A X with (A) > 0, Px (A < ) > 0 for all x X. Such a measure is called an irreducibility measure for Q. Phi-irreducibility is a weaker property than irreducibility of a transition kernel on a countable state space. If a transition kernel on a countable state space is irreducible, then it is phi-irreducible, and any measure is an irreducibility measure. The converse is not true. For instance, the transition kernel Q= 01 01

on {0, 1} is phi-irreducible (1 is an irreducibility measure for Q) but not irreducible. In general, there are innitely many irreducibility measures, and two irreducibility measures are not necessarily equivalent. For instance, if is an irreducibility measure and is absolutely continuous with respect to , then

522

14 Elements of Markov Chain Theory

is also an irreducibility measure. Nevertheless, as shown in the next result, there exist maximal irreducibility measures , which are such that any irreducibility measure is absolutely continuous with respect to . Theorem 14.2.2. Let Q be a phi-irreducible transition kernel on (X, X ). Then there exists an irreducibility measure such that all irreducibility measures are absolutely continuous with respect to and for all A X , (A) > 0 Px (A < ) > 0 for all x X . (14.16)

Proof. Let be an irreducibility measure and (0, 1). Let be the measure dened by = K , where K is the resolvent kernel dened by K (x, A) = (1 )
k0 def k

Qk (x, A) ,

x X, A X .

(14.17)

We will rst show that is an irreducibility measure. Let A X be such that (A) > 0 and dene A = {x X : Px (A < ) > 0} = {x X : K (x, A) > 0} . (14.18)

By denition, (A) > 0 implies that (A) > 0. Dene Am = {x X : = m , and because (A) > Px (A < ) 1/m}. By construction, A m>0 A m ) > 0. Because is an irreducibility measure, 0, there exists m such that (A Px (Am < ) > 0 for all x X. Hence by the strong Markov property, for all x X,
Px (A < ) Px (Am + A Am < , Am < ) 1 Px (Am < ) > 0 , = Ex [1{Am <} PXA (A < )] m m

showing that is an irreducibility measure. Now for m 0 the Chapman-Kolmogorov equations imply

(dx)
X

Qm (x, A) = (1 )
X n=m

Qn (x, A) (dx) (A) .

Therefore, if (A) = 0 then K (A) = 0, which in turn implies (A) = 0. Summarizing the results above, for any A X , (A) > 0 ({x X : Px (A < ) > 0}) > 0 . (14.19)

This proves (14.16) To conclude we must show that all irreducibility measures are absolutely continuous with respect to . Let be an irreducibility measure and let C X be such that (C) > 0. Then ({x X : Px (C < ) > 0}) = (X) > 0, which, by (14.19), implies that (C) > 0 . This exactly says that is absolutely continuous with respect to .

14.2 Chains on General State Spaces

523

A set A X is said to be accessible for the kernel Q (or Q-accessible, or simply accessible if there is no risk of confusion) if Px (A < ) > 0 for all x X. The family of accessible sets is denoted X + . If is a maximal irreducibility measure the set A is accessible if and only if (A) > 0. Example 14.2.3 (Autoregressive Model). The rst-order autoregressive model on R is dened iteratively by Xn = Xn1 + Un , where is a real number and {Un } is an i.i.d. sequence. If is the probability distribution of the noise sequence {Un }, the transition kernel of this chain is given by Q(x, A) = (A x). The autoregressive model is phi-irreducible provided that the noise distribution has an everywhere positive density with respect to Lebesgue measure Leb . If we take = Leb , it is easy to see that whenever Leb (A) > 0, we have (A x) > 0 for any x, and so Q(x, A) > 0 in just one step. Example 14.2.4. The Metropolis-Hastings algorithm, introduced in Chapter 6, provides another typical example of a general state-space Markov chain. For simplicity, we assume here that X = Rd , which we equip with the Borel -eld X = B(Rd ). Assume that we are given a probability density function on with respect to Lebesgue measure Leb . Let r be a transition density kernel. Starting from Xn = x, a candidate transition x is generated from r(x, ) and accepted with probability (x, x ) = (x ) r(x , x) 1. (x) r(x, x ) (14.20)

The transition kernel of the Metropolis-Hastings chain is given by Q(x, A) =


A

(x, x )r(x, x ) Leb (dx ) + 1x (A) [1 (x, x )]r(x, x ) Leb (dx ) . (14.21)

There are various sucient conditions for the Metropolis-Hastings algorithm to be phi-irreducible (Roberts and Tweedie, 1996; Mengersen and Tweedie, 1996). For the Metropolis-Hastings chain, it is simple to check that the chain is phi-irreducible if for Leb -almost all x X, the condition (x ) > 0 implies that r(x, x ) > 0 for any x X. 14.2.2 Recurrence and Transience In view of the discussion above, it is not sensible to dene recurrence and transience in terms of the expectation of the occupation measure of a state, but for phi-irreducible chains it makes sense to consider the occupation measure of accessible sets.

524

14 Elements of Markov Chain Theory

Denition 14.2.5 (Uniform Transience and Recurrence). A set A X is called uniformly transient if supxA Ex [A ] < . A set A X is called recurrent if Ex [A ] = + for all x A. Obviously, if supxX Ex [A ] < , then A is uniformly transient. In fact the reverse implication holds true too, because if the chain is started outside A it cannot hit A more times, on average, than if it is started at the most favorable location in A. Thus an alternative denition of a uniformly transient set is supxX Ex [A ] < . The main result on phi-irreducible transition kernels is the following recurrence/transience dichotomy, which parallels Theorem 14.1.7 for countable state-space Markov chains. Theorem 14.2.6. Let Q be a phi-irreducible transition kernel (or Markov chain). Then either of the following two statements holds true. (i) Every accessible set is recurrent, in which case we call Q recurrent. (ii) There is a countable cover of X with uniformly transient sets, in which case we call Q transient. In the next section, we will prove Theorem 14.2.6 in the particular case where the chain possesses an accessible atom (see Denition 14.2.7); the proof is then very similar to that for countable state space. In the general case, the proof is more involved. It is necessary to introduce small sets and the socalled splitting construction, which relates the chain to one that does possess an accessible atom. 14.2.2.1 Transience and Recurrence for Chains Possessing an Accessible Atom Denition 14.2.7 (Atom). A set X is called an atom if there exists a probability measure on (X, X ) such that Q(x, A) = (A) for all x and A X. Atoms behave the same way as do individual states in the countable state space case. Although any singleton {x} is an atom, it is not necessarily accessible, so that Markov chain theory on general state spaces diers from the theory of countable state space chains. If is an atom for Q, then for any m 1 it is an atom for Qm . Therefore we denote by Qm (, ) the common value of Qm (x, ) for all x . This implies that if the chain starts from within the atom, the distribution of the whole chain does not depend on the precise starting point. Therefore we will also use the notation P instead of Px for any x . Example 14.2.8 (Random Walk on the Half-Line). The random walk on the half-line (RWHL) is dened by an initial condition X0 0 and the recursion

14.2 Chains on General State Spaces

525

Xk+1 = (Xk + Wk+1 )+ ,

k0,

(14.22)

where {Wk }k1 is an i.i.d. sequence of random variables, independent of X0 , with distribution function on R. This process is a Markov chain with transition kernel Q dened by Q(x, A) = (A x) + (( , x])1A (0) , x R+ , A B(R+ ) ,

where A x = {y x : y A}. The set {0} is an atom, and it is accessible if and only if (( , 0]) > 0. We now prove Theorem 14.2.6 when there exists an accessible atom. Proposition 14.2.9. Let {Xk }k0 be a Markov chain that possesses an accessible atom , with associated probability measure . Then the chain is phiirreducible, is an irreducibility measure, and a set A X is accessible if and only if P (A < ) > 0. Moreover, is recurrent if and only if P ( < ) = 1 and (uniformly) transient otherwise, and the chain is recurrent if is recurrent and transient otherwise. Proof. For all A X and x X, the strong Markov property yields Px (A < ) Px ( + A < , < ) = Ex [PX (A < )1{ <} ] = P (A < ) Px ( < ) (A) Px ( < ) . Because is accessible, Px ( < ) > 0 for all x X. Thus for any A X satisfying (A) > 0, it holds that Px (A < ) > 0 for all x X, showing that is an irreducibility measure. The above display also shows that A is accessible if and only if P (A < ). (n) Now let be the successive hitting times of (see (14.13)). The strong Markov property implies that for any n > 1,
(n) (n1) P ( < ) = P ( < ) P ( < ) .

Hence, as for discrete state spaces, P ( < ) = [P ( < )]n1 and E [ ] = 1/[1 P ( < )]. This proves that is recurrent if and only if P ( < ) = 1. Assume that is recurrent. Because the atom is accessible, for any x X, there exists r such that Qr (x, ) > 0. If A X + there exists s such that Qs (, A) > 0. By the Chapman-Kolmogorov equations, Qr+s+n (x, A) Qr (x, )
n1 n1

(n)

Qn (, ) Qs (, A) = .

526

14 Elements of Markov Chain Theory

Hence Ex [A ] = for all x X and A is recurrent. Because A was an arbitrary accessible set, the chain is recurrent. Assume now that is transient, in which case E ( ) < . Then, following the same line of reasoning as in the discrete state space case (proof of Proposition 14.1.4), we obtain that for all x X, Ex [ ] = Px ( < ) E [ ] E [ ] .
j

(14.23)

Dene Bj = {x : n=1 Qn (x, ) 1/j}. Then Bj = X because is acj=1 cessible. Applying the denition of the sets Bj and the Chapman-Kolmogorov equations, we nd that
j

Qk (x, Bj )
k=1 j k=1

Qk (x, Bj ) inf j
yBj =1

Q (y, )

j
k=1 =1 Bj

Qk (x, dy) Q (y, ) j 2


k=1

Qk (x, ) = j 2 Ex [ ] < .

The sets Bj are thus uniformly transient. The proof is complete. 14.2.2.2 Small Sets and the Splitting Construction We now return to the general phi-irreducible case. In order to prove Theorem 14.2.6, we need to introduce the splitting technique. To do so, we need to dene a class of sets (containing accessible sets) that behave the same way in many respects as do atoms. We shall see this in many of the results below, which exactly mimic the atomic case results they generalize. These sets are called small sets. Denition 14.2.10 (Small Set). Let Q and be a transition kernel and a probability measure, respectively, on (X, X ), let m be a positive integer and (0, 1]. A set C X is called a (m, , )-small set for Q, or simply a small set, if (C) > 0 and for all x C and A X , Qm (x, A) (A) . If = 1 then C is an atom for the kernel Qm .

Trivially, any individual point is a small set, but small sets that are not accessible are of limited interest. If the state space is countable and Q is irreducible, then every nite set is small. The minorization measure associated to an accessible small set provides an irreducibility measure. Proposition 14.2.11. Let C be an accessible (m, , )-small set for the transition kernel Q on (X, X ). Then is an irreducibility measure.

14.2 Chains on General State Spaces

527

Proof. Let A X be such that (A) > 0. The strong Markov property yields Px (A < ) Px (C < , A C < ) = Ex [1{C <} PXC (A < )] . Because C is a small set, for all y C it holds that Py (A < ) Py (Xm A) = Qm (y, A) (A) . Because C is accessible and (A) > 0, for all x X it holds that Px (A < ) (A) Px (C < ) > 0 . Thus A is accessible, whence is an irreducibility measure. An important result due to Jain and Jamison (1967) states that if the transition kernel is phi-irreducible, then small sets do exist. For a proof see Nummelin (1984, p. 16) or Meyn and Tweedie (1993, Theorem 5.2.2). Proposition 14.2.12. If the transition kernel Q on (X, X ) is phi-irreducible, then every accessible set contains an accessible small set. Given the existence of just one small set from Proposition 14.2.12, we may show that it is possible to cover X with a countable number of small sets in the phi-irreducible case. Proposition 14.2.13. Let Q be a phi-irreducible transition kernel on (X, X ). (i) If C X is an (m, , )-small set and for any x D we have Qn (x, C) , then D is (m + n, , )-small set. (ii) If Q is phi-irreducible then there exists a countable collection of small sets Ci such that X = i Ci . Proof. Using the Chapman-Kolmogorov equations, we nd that for any x D, Qn+m (x, A)
C

Qn (x, dy) Qm (y, A) Qn (x, C)(A) (A) ,

showing part (i). Because Q is phi-irreducible, by Proposition 14.2.12 there exists an accessible (m, , )-small set C. Moreover, by the denition of phiirreducibility, the sets C(n, m) = {x : Qn (x, C) 1/m} cover X and, by part (i), each C(n, m) is small. Proposition 14.2.14. If Q is phi-irreducible and transient, then every accessible small set is uniformly transient. Proof. Let C be an accessible (m, , )-small set. If Q is transient, there exists at least one A X + that is uniformly transient. For (0, 1), by the Chapman-Kolmogorov equations,

528

14 Elements of Markov Chain Theory


Ex [A ] =
k=0

Q (x, A) (1 )
p=0

p k=0

Qk+m+p (x, A) Qm (x , dx ) Qp (x , A)

(1 )
p=0

p
k=0 C

Qk (x, dx )

k=0

Qk (x, C) (1 )
p=0

p Qp (A) = Ex [C ] K (A) ,

where K is the resolvent kernel (14.17). Because C is an accessible small set, Proposition 14.2.11 shows that is an irreducibility measure. By Theorem 14.2.2, K is a maximal irreducibility measure, so that K (A) > 0. Thus supxX Ex [C ] < and we conclude that C is uniformly transient (see the remark following Denition 14.2.5). Example 14.2.15 (Autoregressive Process, Continued). Suppose that the noise distribution in Example 14.2.3 has an everywhere positive continuous density with respect to Lebesgue measure Leb . If C = [M, M ] and = inf |x|(1+)M (u), then for A C, Q(x, A) =
A

(x x) dx Leb (A) .

Hence the compact set C is small. Obviously R is covered by a countable collection of small sets and every accessible set (here sets with non-zero Lebesgue measure) contains a small set. Example 14.2.16 (Metropolis-Hastings Algorithm, Continued). Similar results hold for the Metropolis-Hastings algorithm of Example 14.2.4 if (x) and r(x, x ) are positive and continuous for all (x, x ) X X. Suppose that C is compact with Leb (C) > 0. By positivity and continuity, we then have d = supxC (x) < and = inf (x,x )CC q(x, x ) > 0. For any A C, dene Rx (A) =
def

x A:

(x )q(x , x) <1 (x)q(x, x )

the region of possible rejection. Then for any x C, Q(x, A)


A

q(x, x )(x, x ) dx

q(x , x) (x ) dx + q(x, x ) dx Rx (A) (x) A\Rx (A) (x ) dx + (x ) dx d Rx (A) d A\Rx (A) = (x ) dx . d A

14.2 Chains on General State Spaces

529

Thus C is small and, again, X can be covered by a countable collection of small sets. We now show that it is possible to dene a Markov chain with an atom, the so-called split chain, whose properties are directly related to those of the original chain. This technique was introduced by Nummelin (1978) (Athreya and Ney, 1978, introduced, independently, a virtually identical concept) and allows extending results valid for Markov chain possessing an accessible atom to irreducible Markov chains that only possess small sets. The basic idea is as follows. Suppose the chain admits a (1, , )-small set C. Then as long as the chain does not enter C, the transition kernel Q is used to generate the trajectory. However, as soon as the chain hits C, say Xn C, a zero-one random variable dn is drawn, independent of everything else. The probability that dn = 1 is , and hence dn = 0 with probability 1 . Then if dn = 1, the next value Xn+1 is drawn from ; otherwise Xn+1 is drawn from the kernel R(x, A) = [1

1C (x)]1 [Q(x, A) 1C (x)(A)] ,

with x = Xn . It is immediate that (A) + (1 )R(x, A) = Q(x, A) for all x C, so Xn+1 is indeed drawn from the correct (conditional) distribution. Note also that R(x, ) = Q(x, ) for x C. So, what is gained by this approach? What is gained is that whenever Xn C and dn = 1, the next value of the chain will be independent of Xn (because it is drawn from ). This is often called a regeneration time, as the joint chain {(Xk , dk )} in a sense restarts and forgets its history. In technical terms, the state C {1} in the extended state space is as atom, and it will be accessible provided C is. We now make this formal. Thus we dene the so-called extended state space as X = X {0, 1} and let X be the associated product -eld. We associate to every measure on (X, X ) the split measure on (X, X ) as the unique measure satisfying, for A X , (A {0}) = (1 )(A C) + (A C c ) , (A {1}) = (A C) . If Q is a transition kernel on (X, X ), we dene the kernel Q on X X by Q (x, A) = [Q(x, )] (A) for x X and A X . Assume now that Q is a phi-irreducible transition kernel and let C be a (1, , )-small set. We dene the split transition kernel Q on X X as follows. For any x X and A X , Q((x, 0), A) = R (x, A) , Q((x, 1), A) = (A) . (14.24) (14.25)

Examining the above technicalities, we nd that transitions into C c {1} have zero probability from everywhere, so that dn = 1 can only occur if Xn C. Because dn = 1 indicates a regeneration time, from within C, this is

530

14 Elements of Markov Chain Theory

logical. Likewise we nd that given a transition to some y C, the conditional probability that dn = 1 is , wherever the transition took place from. Thus the above split transition kernel corresponds to the following simulation scheme for {(Xk , dk )}. Assume (Xk , dk ) are given. If Xk C, then draw Xk+1 from Q(Xk , ). If Xk C and dn = 1, then draw Xk+1 from , otherwise from R(Xk , ). If the realized Xk+1 is not in C, then set dk+1 = 0; if Xk+1 is in C, then set dk+1 = 1 with probability , and otherwise set dk+1 = 0. Split measures operate on the split kernel in the following way. For any measure on (X, X ), Q = (Q) . (14.26)

For any probability measure on X , we denote by P and E , respectively, the probability distribution and the expectation on the canonical space (XN , X N ) such that the coordinate process, denoted {(Xk , dk )}k0 , is a Markov chain with initial probability measure and transition kernel Q. We also denote by X {Fk }k0 the natural ltration of this chain and, as usual, by {Fk }k0 the natural ltration of {Xk }k0 . Proposition 14.2.17. Let Q be a phi-irreducible transition kernel on (X, X ), let C be an accessible (1, , )-small set for Q and let be a probability measure on (X, X ). Then for any bounded X -measurable function f and any k 1,
X E [f (Xk ) | Fk1 ] = Qf (Xk1 )

P -a.s.

(14.27)

Before giving the proof, we discuss the implications of this result. It implies that under P , {Xk }k0 is a Markov chain (with respect to its natural ltration) with transition kernel Q and initial distribution . By abuse of notation, we can identify {Xk } with the coordinate process associated to the canonical space XN . Denote by P the probability measure on (XN , X N ) such that {Xk }k0 is a Markov chain with transition kernel Q and initial distribution (see Section 2.1.2.1) and denote by E the associated expectation operator. Then Proposition 14.2.17 yields the following identity. For any bounded X F -measurable random variable Y , E [Y ] = E [Y ] . Proof (of Proposition 14.2.17). We have, -a.s., E [f (Xk ) | Fk1 ] = 1{dk1 =1} (f ) + 1{dk1 =0} Rf (Xk1 ) .
X Because P (dk1 = 1 | Fk1 ) =

(14.28)

1C (Xk1 ) P -a.s., it holds that

X X E [f (Xk ) | Fk1 ] = E {E[f (Xk ) | Fk1 ] | Fk1 }

1C (Xk1 )(f ) + [1 1C (Xk1 )]Rf (Xk1 )

= Qf (Xk1 ) .

14.2 Chains on General State Spaces

531

Corollary 14.2.18. Under the assumptions of Proposition 14.2.17, X {1} is an accessible atom and is an irreducibility measure for the split kernel Q. More generally, if B X is accessible for Q, then B {0, 1} is accessible for the split kernel. Proof. Because = X {1} is an atom for the split kernel Q, Proposi tion 14.2.9 shows that is an irreducibility measure if is accessible. Ap plying (14.28) we obtain for x X, P(x,1) ( < ) = P(x,1) (dn = 1 for some n 1) P(x,1) (d1 = 1) = (C) > 0 , P(x,0) ( < ) = P(x,0) ((Xn , dn ) C {1} for some n 1) P(x,0) (C{0,1} < , dC{0,1} = 1) = Px (C < ) > 0 . Thus is accessible and is an irreducibility measure for Q. This implies, by is a maximal irreducibility meaTheorem 14.2.2, that for all (0, 1), K sure for the split kernel Q; here K is the resolvent kernel (14.17) associated By straightforward applications of the denitions, it is easy to check to Q. that K = (K ) . Moreover, is an irreducibility measure for Q, and K is a maximal irreducibility measure for Q (still by Proposition 14.2.11 and Theorem 14.2.2). If B is accessible, then K (B) > 0 and K (B {0, 1}) = (K ) (B {0, 1}) = K (B) > 0. Thus B {0, 1} is accessible for Q. 14.2.2.3 Transience/Recurrence Dichotomy for General Phi-irreducible Chains Using the splitting construction, we are now able to prove Theorem 14.2.6 for chains not possessing accessible atoms. We rst consider the simple case in which the chain possesses a 1-small set. Proposition 14.2.19. Let Q be a phi-irreducible transition kernel that admits an accessible (1, , )-small set C. Then Q is either recurrent or transient. It is recurrent if and only if the small set C is recurrent. Proof. Because the split chain possesses an accessible atom, by Proposition 14.2.9 the split chain is phi-irreducible and either recurrent or transient. Applying (14.28) we can write Ex [B{0,1} ] = Ex [B ] . (14.29)

Assume rst that the split chain is recurrent. Let B be an accessible set for Q. By Proposition 14.2.17, B {0, 1} is accessible for the split chain. Hence

532

14 Elements of Markov Chain Theory

Ex [B{0,1} ] = for all x B, so that, by (14.29), Ex [B ] = for all x B. Conversely, if the split chain is transient, then by Proposition 14.2.9 the j atom is transient. For j 1, dene Bj = {x : l=1 Ql ((x, 0), ) 1/j}. Because is accessible, j=1 Bj = X. By the same argument as in the proof of Proposition 14.2.9, the sets Bj {0, 1} are uniformly transient for the split chain. Hence, by (14.29), the sets Bj are uniformly transient for Q. It remains to prove that if the small set C is recurrent, then the chain is recurrent. We have just proved that Q is recurrent if and only if Q is recurrent and, by Proposition 14.2.9, this is true if and only if the atom is recurrent. Thus we only need to prove that if C is recurrent then is recurrent. If C is recurrent, then (14.29) yields for all x C, Ex [ ] Ex [C{0,1} ] = Ex [C ] = . Using the denition of x , this implies that there exists x X such that Ex [ ] = . This observation and (14.23) imply that E [ ] = , that is, the atom is recurrent. Using the resolvent kernel, the previous results can be extended to the general case where an accessible small set exists, but not necessarily a 1-small one. Proposition 14.2.20. Let Q be transition kernel. (i) If Q is phi-irreducible and admits an accessible (m, , )-small set C, then for any (0, 1), C is an accessible (1, , )-small set for the resolvent kernel K = (1 ) k=0 k Qk with = (1 ) m . (ii) A set is recurrent (resp. uniformly transient) for Q if and only if it is recurrent (resp. uniformly transient) for K for some (hence for all) (0, 1). (iii) Q is recurrent (resp. transient) if and only if K is recurrent (resp. transient) for some (hence for all) (0, 1). Proof. For any > 0, x C, and A X , K (x, A) (1 ) m Qm (x, A) (1 ) m (A) = (A) . Thus C is a (1, , )-small set for K , showing part (i). The remaining claims follow from the identity
n K = n1

Qn .
n0

14.2 Chains on General State Spaces

533

14.2.2.4 Harris Recurrence As for countable state spaces, it is sometimes useful to consider stronger recurrence properties, expressed in terms of return probabilities rather than mean occupation times. Denition 14.2.21 (Harris Recurrence). A set A X is said to be Harris recurrent if Px (A < ) = 1 for any x X. A phi-irreducible Markov chain is said to be Harris (recurrent) if any accessible set is Harris recurrent. It is intuitively obvious that, as for countable state spaces, Harris recurrence implies recurrence. Proposition 14.2.22. A Harris recurrent set is recurrent. Proof. Let A be a Harris recurrent set. Because for j 1, A = A A (j) on the set {A < }, the strong Markov property implies that for any x A, Px (A
(j+1) (j+1)
(j)

< ) = Ex PX

(j) A

(A < )1{(j) <} = Px (A < ) .


(j)
A

Because Px (A < ) = 1 for x A, we obtain that for all x A and all (j) (j) j 1, Px (A = 1) and Ex [A ] = j=1 Px (A < ) = . Even though all transition kernels may not be Harris recurrent, the following theorem provides a very useful decomposition of the state space of a recurrent phi-irreducible transition kernel. For a proof of this result, see Meyn and Tweedie (1993, Theorem 9.1.5) Theorem 14.2.23. Let Q be a phi-irreducible recurrent transition kernel on a state space X and let be a maximal irreducibility measure. Then X = NH, where N is covered by a countable family of uniformly transient sets, (N) = 0 and every accessible subset of H is Harris recurrent. As a consequence, if A is an accessible set of a recurrent phi-irreducible chain, then there exists a set A A such that (A \ A ) = 0 for any maximal irreducibility measure , and Px (A < ) = 1 for all x A . Example 14.2.24. To understand why a recurrent Markov chain can fail to be Harris, consider the following elementary example of a chain on X = N. Let the transition kernel Q be given by Q(0, 0) = 1 and for x 1, Q(x, x+1) = 1 1/x2 and Q(x, 0) = 1/x2 . Thus the state 0 is absorbing. Because Q(x, 0) > 0 for any x X, 0 is an irreducibility measure. In fact, by application of Theorem 14.2.2, this measure is maximal. The set {0} is an atom and because P0 ({0} < ) = 1, the chain is recurrent by Proposition 14.2.9. The chain is not Harris recurrent, however. Indeed, for any x 1 we have

(1)

534

14 Elements of Markov Chain Theory


x+k1

Px (0 k) = Px (X1 = 0, . . . , Xk1 = 0) =
j=x

(1 1/j 2 ) .

Because j=2 (1 1/j 2 ) > 0, we obtain that Px (0 = ) = limk Px (0 k) > 0 for any x 2, so that the accessible state 0 is not certainly reached from such an initial state. Comparing to Theorem 14.2.23, we see that the decomposition of the state space is given by H = {0} and N = {1, 2, . . .}. 14.2.3 Invariant Measures and Stationarity On general state spaces, we again further classify chains using invariant measures. A -nite measure is called Q-sub-invariant if Q and Qinvariant if = Q. Theorem 14.2.25. A phi-irreducible recurrent transition kernel (or Markov chain) admits a unique (up to a multiplicative constant) invariant measure which is also a maximal irreducibility measure. This result leads us to dene the following classes of chains. Denition 14.2.26 (Positive and Null Chains). A phi-irreducible transition kernel (or Markov chain) is called positive if it admits an invariant probability measure; otherwise it is called null. We now prove the existence of an invariant measure when the chain admits an accessible atom. The invariant measure is dened as for countable state spaces, by replacing an individual state by the atom. Thus dene the measure on X by

(A) = E
n=1

1A (Xn ) ,

AX .

(14.30)

Proposition 14.2.27. Let be an accessible atom for the transition kernel Q. Then is Q-sub-invariant. It is invariant if and only if the atom is recurrent. In that case, any Q-invariant measure is proportional to , and is a maximal irreducibility measure. Proof. By the denition of and the strong Markov property,
+1

Q(A) = E
k=1

Q(Xk , A) = E
k=2

1A (Xk )

= (A) P (X1 A) + E [1A (X +1 )1{ <} ] . Applying the strong Markov property once again yields

14.2 Chains on General State Spaces


X E [1A (X +1 )1{ <} ] = E {E [1A (X1 ) | F ]1{ <} }

535

= E [PX (X1 A)1{ <} ] = P (X1 A) P ( < ) .

Thus Q(A) = (A) P (X1 A)[1 P ( < )]. This proves that is sub-invariant, and invariant if and only if P ( < ) = 1. Now let be an invariant non-trivial measure and let A be an accessible set such that (A) < . Then there exists an integer n such that Qn (, A) > 0. Because is invariant, it holds that = Qn , so that > (A) = Qn (A) ()Qn (, A) . This implies that () < . Without loss of generality, we can assume () > 0; otherwise we replace by + . Assuming () > 0, there is then no loss of generality in assuming () = 1. The next step is to prove that if is an invariant measure such that () = 1, then . To prove this it suces to prove that for all n 1,
n

(A)
k=1

P (Xk A, k) .

We prove this inequality by induction. For n = 1 we can write (A) = Q(A) ()Q(, A) = Q(, A) = P (X1 A) . Now assume now that the inequality holds for some n 1. Then (A) = Q(, A) +
c n

(dy) Q(y, A) E [Q(Xk , A)1{ k} 1{Xk } ] / E [Q(Xk , A)1{ k+1} ] .

Q(, A) +
k=1 n

Q(, A) +
k=1

X Because { k + 1} Fk , the Markov property yields

E [Q(Xk , A)1{ k+1} ] = P (Xk+1 A, k + 1) , whence


n+1 n+1

(A) Q(, A) +
k=2

P (Xk A, k) =
k=1

P (Xk A, k) .

This completes the induction, and we conclude that . Assume that there exists a set A such that (A) > (A). It is straightforward that and are both invariant for the resolvent kernel K (see

536

14 Elements of Markov Chain Theory

(14.17)), for any (0, 1). Because is accessible, K (x, ) > 0 for all x X. Hence A (dx) Q(x, ) > A (dx) Q(x, ), which implies that 1 = () = K () =
A

(dx) K (x, ) +
Ac

(dx) K (x, )

>
A

(dx) K (x, ) +
Ac

(dx) K (x, ) = K () = () = 1.

This contradiction shows that = . We nally prove that is a maximal irreducibility measure. Let be a maximal irreducibility measure and assume that (A) = 0. Then Px (A < ) = 0 for -almost all x X. This obviously implies that Px (A < ) = 0 for -almost all x . Because Px (A < ) is constant over , we nd that Px (A < ) = 0 for all x , and this yields (A) = 0. Thus is absolutely continuous with respect to , hence an irreducibility measure. Let again K be the resolvent kernel. By Theorem 14.2.2, K is a maximal irreducibility measure. But, as noted above, K = , and therefore is a maximal irreducibility measure. Proposition 14.2.28. Let Q be a recurrent phi-irreducible transition kernel that admits an accessible (1, , )-small set C. Then it admits a non-trivial invariant measure, unique up to multiplication by a constant and such that 0 < (C) < , and any invariant measure is a maximal irreducibility measure. Proof. By (14.26), (Q) = Q, so that is Q-invariant if and only if is Q-invariant. Let be a Q-invariant measure and dene =
C{0}

(d) R(x, ) + x
C c {0}

(d) Q(x, ) + (X {1}) . x

By application of the denition of the split kernel and measures, it can be checked that Q = . Hence = Q = . We thus see that is Q invariant, which, as noted above, implies that is Q-invariant. Hence we have shown that there exists a Q-invariant measure if and only if there exists a Q-invariant one. If Q is recurrent then C is recurrent, and as appears in the proof of Proposition 14.2.28 this implies that the atom is recurrent for the split chain Q. Thus, by Proposition 14.2.9 the kernel Q is recurrent, and by Proposition 14.2.27 it admits an invariant measure that is unique up to a scaling factor. Hence Q also admits an invariant measure, unique up to a scaling factor and such that 0 < (C) < . Let be Q-invariant. Then is Q-invariant and hence, by Proposition 14.2.27, a maximal irreducibility measure. If (A) > 0, then (A {0, 1}) = (A) > 0. Thus A {0, 1} is accessible, and this implies that A is accessible. We conclude that is an irreducibility measure, and it is maximal because it is K -invariant.

14.2 Chains on General State Spaces

537

If the kernel Q is phi-irreducible and admits an accessible (m, , )-small set C, then, by Proposition 14.2.20, for any (0, 1) the set C is an accessible (1, , )-small set for the resolvent kernel K . If C is recurrent for Q, it is also recurrent for K and therefore, by Proposition 14.2.19, K has a unique invariant probability measure. The following result shows that this probability measure is invariant also for Q. Lemma 14.2.29. A measure on (X, X ) is Q-invariant if and only if is K -invariant for some (hence for all) (0, 1). Proof. If Q = , then obviously Qn = for all n 0, so that K = . Conversely, assume that K = . Because K = QK + (1 )Q0 and QK = K Q, it holds that = K = QK + (1 ) = K Q + (1 ) = Q + (1 ) . Hence Q = , which concludes the proof. 14.2.3.1 Drift Conditions We rst give a sucient condition for a chain to be positive, based on the expectation of the return time to an accessible small set. Proposition 14.2.30. Let Q be a transition kernel that admits an accessible small set C such that sup Ex [C ] < .
xC

(14.31)

Then the chain is positive and the invariant probability measure satises, for all A X ,
C 1

(A) =
C

(dy) Ey
k=0

1A (Xk ) =
C

(dy) Ey
k=1

1A (Xk ) . (14.32)

If f is a non-negative measurable function such that


C 1

sup Ex
xC k=0

f (Xk ) < ,

(14.33)

then f is integrable with respect to and


C 1 C

(f ) =
C

(dy) Ey
k=0

f (Xk ) =
C

(dy) Ey
k=1

f (Xk )

538

14 Elements of Markov Chain Theory

Proof. First note that by Proposition 14.2.11, Q is phi-irreducible. Equation (14.31) implies that for all Px (C < ) = 1 x C, that is, C is Harris recurrent. By Proposition 14.2.22, C is recurrent, and so, by Proposition 14.2.19, Q is recurrent. Let be an invariant measure such that 0 < (C) < , the existence of which is given by Proposition 14.2.28. Then dene a measure C on X by
C

C (A) =

def C

(dy) Ey
k=1

1A (Xk ) .

Because C < Py -a.s. for all y C, it holds that C (C) = (C). Then we can show that C (A) = (A) for all A X . The proof is along the same lines as the proof of Proposition 14.2.27 and is therefore omitted. Thus, C is invariant. In addition, we obtain that for any measurable set A, (dy) Ey [1A (X0 )] = (A C) = C (A C) =
C C

(dy) Ey [1A (XC )] ,

and this yields


C

C (A) =
C

(dy) Ey
k=1

1A (Xk ) =
C

C 1

(dy) Ey
k=0

1A (Xk ) .

We thus obtain the following equivalent expressions for C :


C 1

C (A) =
C

(dy) Ey
k=0 C

1A (Xk ) =
C

C 1

C (dy) Ey
k=0 C

1A (Xk )

=
C

C (dy) Ey
k=1

1A (Xk ) =
C

(dy) Ey
k=1

1A (Xk )

= (A) . Hence
C 1

(X) =
C

(dy) Ey
k=0

1X (Xk ) (C) sup Ey [C ] < ,


yC

so that any invariant measure is nite and the chain is positive. Finally, under (14.33) we obtain that
C 1 C 1

(f ) =
C

(dy) Ey
k=0

f (Xk ) (C) sup Ey


yC k=1

f (Xk ) < .

Except in specic examples (where, for example, the invariant distribution is known in advance), it may be dicult to decide if a chain is positive or null. To check such properties, it is convenient to use drift conditions.

14.2 Chains on General State Spaces

539

Proposition 14.2.31. Assume that there exists a set C X , two measurable functions 1 f V , and a constant b > 0 such that QV V f + b1C . Then Ex [C ] V (x) + b1C (x) ,
C 1

(14.34)

(14.35) (14.36)

Ex [V (XC )] + Ex
k=0

f (Xk ) V (x) + b1C (x) .

If C is an accessible small set and V is bounded on C, then the chain is positive recurrent and (f ) < . Proof. Set for n 1,
n1

Mn = V (Xn ) +
k=0

f (Xk )

1{C n} .

Then
n

E[Mn+1 | Fn ] = QV (Xn ) +
k=0

f (Xk )

1{C n+1}
n

V (Xn ) f (Xn ) + b1C (Xn ) +


n1

f (Xk )
k=0

1{C n+1}

= V (Xn ) +
k=0

f (Xk )

1{C n+1} Mn ,

as 1C (Xn )1{C n+1} = 0. Hence {Mn }n1 is a non-negative super-martingale. For any integer n, C n is a bounded stopping time, and Doobs optional stopping theorem shows that for any x X, Ex [MC n ] Ex [M1 ] V (x) + b1C (x) . Applying this relation with f 1 yields for any x X and n 0, Ex [C n] V (x) + b1C (x) , and (14.35) follows using monotone convergence. This implies in particular that Px (C < ) = 1 for any x X. The proof of (14.36) follows similarly from (14.37) by the letting n and (f ) is nite by (14.33). Example 14.2.32 (Random Walk on the Half-Line, Continued). Consider again the model of Example 14.2.8. Previously we have seen that sets of the form [0, c] are small. If (( , c]) > 0, then for x [0, c], (14.37)

540

14 Elements of Markov Chain Theory

Q(x, A) (( , c])1A (0) ; otherwise there exists an integer m such that m (( , c]) > 0, whence Qm (x, A) m (( , c])1A (0) . To prove recurrence for < 0, we apply Proposition 14.2.31. Because < 0, there exists c > 0 such that c w (dw) /2 < 0. Thus taking V (x) = x for x > c,

QV (x) V (x) =

[(x + w)+ x] (dw)

= x (( , x]) +
x

w (dw) /2 .

Hence the chain is positive recurrent. Consider now the case > 0. In view of Proposition 14.2.9, we have to n show that the atom {0} is transient. For any n, Xn X0 + i=1 Wi . Dene n Cn = n1 i=1 Wi /2 and write Dn for {Xn = 0}. The strong law of large numbers implies that P0 (Dn i.o.) P0 (Cn i.o.) = 0. Hence the atom {0} is transient, and so is the chain. When = 0, additional assumptions on are needed to prove the recurrence of the RWHL (see for instance Meyn and Tweedie, 1993, Lemma 8.5.2). Example 14.2.33 (Autoregressive Model, Continued). Consider again the model of Example 14.2.3 and assume that the noise process has zero mean and nite variance. Choosing V (x) = x2 we have
2 P V (x) = E[(x + U1 )2 ] = 2 V (x) + E[U1 ] ,

so that (14.34) holds when C = [M, M ] for some large enough M , provided || < 1. Because we know that every compact set is small if the noise process has an everywhere continuous positive density, Proposition 14.2.31 shows that the chain is positive recurrent. Note that this approach provides an existence result but does not help us to determine . If {Uk } are Gaussian with zero mean and variance 2 , then one can check that the invariant distribution also is Gaussian with zero mean and variance 2 /(1 2 ). Theorem 14.2.25 shows that if a chain is phi-irreducible and recurrent then the chain is positive, that is, it admits a unique invariant probability measure . In certain situations, and in particular when dealing with MCMC procedures, it is known that Q admits an invariant probability measure, but it is not known, a priori, that the chain is recurrent. The following result shows that positivity implies recurrence. Proposition 14.2.34. If the Markov kernel Q is positive, then it is recurrent.

14.2 Chains on General State Spaces

541

Proof. Suppose that the chain is positive and let be an invariant probability measure. If Q is transient, the state space X is covered by a countable family {Aj } of uniformly transient subsets (see Theorem 14.2.6). For any j and k,
k

k(Aj ) =
n=1

Qn (Aj )

(dx) Ex [Aj ] sup Ex [Aj ] .


xX

(14.38)

The strong Markov property implies that Ex [Aj ] = Ex [Aj 1{Aj <} ] Ex {1{Aj <} EXA [Aj ]} sup Ex [Aj ] Px (Aj < ) .
j

xAj

Thus, the left-hand side of (14.38) is bounded as k . This implies that (Aj ) = 0, and hence (X) = 0. This is a contradiction so the chain cannot be transient.

14.2.4 Ergodicity In this section, we study the convergence of iterates Qn of the transition kernel to the invariant distribution. As for discrete state spaces case, we rst need to avoid periodic behavior that prevents the iterates to converge. In the discrete case, the period of a state x is dened as the greatest common divisor of the set of time points {n 0 : Qn (x, x) > 0}. Of course this notion does not extend to general state spaces, but for phi-irreducible chains we may dene the period of accessible small sets. More precisely, let Q be a phi-irreducible transition kernel with maximal irreducibility measure . By Theorem 14.2.11, there exists an accessible (m, , )-small set C. Because is a maximal irreducibility measure, (C) > 0, so that when the chain starts from C there is a positive probability that the it will return to C at time m. Let EC = {n 1 : the set C is (n,
def n , )-small

for some

> 0}

(14.39)

be the set of time points for which C is small with minorizing measure . Note that for n and m in EC , B X + and x C, Qn+m (x, B)
C

Qm (x, dx ) Qn (x , B)

m n (C)(B)

>0,

showing that EC is closed under addition. There is thus a natural period for EC , given by the greatest common divisor. Similar to the discrete case (see Proposition 14.1.12), this period d may be shown to be independent of the particular choice of the small set C (see for instance Meyn and Tweedie, 1993, Theorem 5.4.4).

542

14 Elements of Markov Chain Theory

Proposition 14.2.35. Suppose that Q is phi-irreducible with maximal irreducibility measure . Let C be an accessible (m, , )-small set and let d be the greatest common divisor of the set EC , dened in (14.39). Then there exist disjoint sets D1 , . . . , Dd (a d-cycle) such that (i) for x Di , Q(x, Di+1 ) = 1, i = 0, . . . , d 1 (mod d); (ii) the set N = (d Di )c is -null. i=1 The d-cycle is maximal in the sense if D1 , . . . , Dd is a d -cycle, then d divides d, and if d = d , then up to a permutation of indices Di and Di are -almost equal. It is obvious from the this theorem that the period d does not depend on the choice of the small set C and that any small set must be contained (up to -null sets) inside one specic member of a d-cycle. This in particular implies that if there exists an accessible (1, , )-small set C, then d = 1. This suggests the following denition Denition 14.2.36 (Aperiodicity). Suppose that Q is a phi-irreducible transition kernel with maximal irreducibility measure . The largest d for which a d-cycle exists is called the period of Q. When d = 1, the chain is called aperiodic. When there exists a (1, , )-small set C, the chain is called strongly aperiodic. In all the examples considered above, we have shown the existence of a 1-small set; therefore all these Markov chains are strongly aperiodic. Now we can state the main convergence result, formulated and proved by Athreya et al. (1996). It parallels Theorem 14.1.13. Theorem 14.2.37. Let Q be a phi-irreducible positive aperiodic transition kernel. Then for -almost all x,
n

lim

Qn (x, )

TV

=0.

(14.40)

If Q is Harris recurrent, the convergence occurs for all x X. Although this result does not provide information on the rate of convergence to the invariant distribution, its assumptions are quite minimal. In fact, it may be shown that these assumptions are essentially necessary and sufcient. If Qn (x, ) TV 0 for any x X, then by Nummelin (1984, Proposition 6.3), the chain is -irreducible, aperiodic, positive Harris, and is an invariant distribution. This form of the ergodicity theorem is of particular interest in cases where the invariant distribution is explicitly known, as in Markov chain Monte Carlo. It provides conditions that are simple and easy to verify, and under which an MCMC algorithm converges to its stationary distribution. Of course the exceptional null set for non-Harris recurrent chain is a nuisance. The example below however shows that there is no way of getting rid of it.

14.2 Chains on General State Spaces

543

Example 14.2.38. In the model of Example 14.2.24, = 0 is an invariant probability measure. Because Qn (x, 0) = Px ({0} n) for any n 0, limn Qn (x, 0) = Px ({0} < ). We have previously shown that Px ({0} < ) = 1 Px ({0} = ) < 1 for x 2, whence lim sup Qn (x, ) TV = 0 for such x. Fortunately, in many cases it is not hard to show that a chain is Harris. A proof of Theorem 14.2.37 from rst principles is given by Athreya et al. (1996). We give here a proof due to Rosenthal (1995), based on pathwise coupling (see Rosenthal, 2001; Roberts and Rosenthal, 2004). The same construction is used to compute bounds on Qn (x, ) TV . Before proving the theorem, we briey introduce the pathwise coupling construction for phiirreducible Markov chains and present the associated Lindvall inequalities. 14.2.4.1 Pathwise Coupling and Coupling Inequalities Suppose that we have two probability measures and on (X, X ) that are 1 such that 2 TV 1 for some (0, 1] or, equivalently (see (4.19)), that there exists a probability measure such that . Because and are probability measures, we may construct a probability space (, F, P) and X-valued random variables X and X such that P(X ) = () and P(X ) = , respectively. By denition, for any A X , |(A) (A)| = | P(X A) P(X A)| = | E[1A (X) 1A (X )]| (14.41) = | E[(1A (X) 1A (X ))1{X=X } ]| P(X = X ) , (14.42) so that the total variation distance between the laws of two random elements is bounded by the probability that they are unequal. Of course, this inequality is not in general sharp, but we can construct on an appropriately dened probability space (, F, P) two X-valued random variables X and X with laws and such that P(X = X ) 1 . The construction goes as follows. We draw a Bernoulli random variable d with probability of success . If d = 0, we then draw X and X independently from the distributions (1 )1 ( ) and (1 )1 ( ), respectively. If d = 1, we draw X from and set X = X . Note that for any A X , P(X A) = P(X A | d = 0)P(d = 0) + P(X A | d = 1)P(d = 1) = (1 ){(1 )1 [(A) (A)]} = (A) and, similarly, P(X A) = (A). Thus, marginally the random variables X and X are distributed according to and . By construction, P(X = X ) P(d = 1) , showing that X and X are equal with probability at least . Therefore the coupling bound (14.41) can be made sharp by using an appropriate construction. Note that this construction may be used to derive bounds on distances between probability measures that generalize the total variation; we will consider in the sequel the V -total variation.

544

14 Elements of Markov Chain Theory

Denition 14.2.39 (V-Total Variation). Let V : X [1, ) be a measurable function. The V -total variation distance between two probability measures and on (X, X ) is If V 1,
1 def V

= sup |(f ) (f )| .
|f |V

is the total variation distance.

When applied to Markov chains, the whole idea of coupling is to construct on an appropriately dened probability space two Markov chains {Xk } and {Xk } with transition kernel Q and initial distributions and , respectively, in such a way that Xn = Xn for all indices n after a random time T , referred to as the coupling time. The coupling procedure attempts to couple the two Markov chains when they simultaneously enter a coupling set. Denition 14.2.40 (Coupling Set). Let C X X, (0, 1] and let = {x,x , x, x X} be transition kernels from C (endowed with the trace is a (1, , )-coupling set if for all (x, x ) C -eld) to (X, X ). The set C and all A X , Q(x, A) Q(x , A) x,x (A) . (14.43)

By applying Lemma 4.3.5, this condition can be stated equivalently as: there exists (0, 1] such that for all (x, x ) C, 1 Q(x, ) Q(x , ) 2
TV

1 .

(14.44)

For simplicity, only one-step minorization is considered in this chapter. Adaptations to m-step minorization (replacing Q by Qm in (14.43)) can be carried out as in Rosenthal (1995). Condition (14.43) is often satised by setting C = C C for a (1, , )-small set C. Indeed, in that case, for all (x, x ) C C and A X , Q(x, A) Q(x , A) (A) . The case = 1 needs some consideration. If there exists an atom, say , i.e., there exists a probability measure such that for all x and A X , Q(x, A) = (A), then C = is a (1, 1, )-coupling set with Conversely, assume that C is a (1, 1, )-coupling x,x = for all (x, x ) C. set. The alternative characterization (14.44) shows that Q(x, ) = Q(x , ) for all (x, x ) C, that is, C is an atom. This also implies that the set C contains a set 1 2 , where 1 and 2 are atoms for Q. We now introduce the coupling construction. Let C be a (1, , )-coupling = X X and X = X X . Let Q be a transition kernel on (X, X ) set. Dene X given for all A and A in X by

14.2 Chains on General State Spaces

545

Q(x, x ; A A ) = Q(x, A)Q(x , A )1C c (x, x )+ (1 )2 [Q(x, A) x,x (A)][Q(x , A ) x,x (A )]1C (x, x ) (14.45) if < 1 and Q = Q Q if = 1. For any probability measure on (X, X ), let P be the probability measure on the canonical space (XN , X N ) such that the coordinate process {Xk } is a Markov chain with respect to its natural ltration and with initial distribution and transition kernel Q. As usual, denote the associated expectation operator by E . def We now dene a transition kernel Q on the space X = X X {0, 1} endowed with the product -eld X by, for any x, x X and A, A X , Q ((x, x , 0), A A {0}) = [1 1C (x, x )]Q((x, x ), A A ) , Q ((x, x , 0), A A {1}) = 1C (x, x )x,x (A A ) , ((x, x , 1), A A {1}) = Q(x, A A ) . Q (14.46) (14.47) (14.48)

For any probability measure on (X, X ), let P be the probability measure N N on the canonical space (X , X ) such that the coordinate process {Xk } is a Markov chain with transition kernel Q and initial distribution . The corre sponding expectation operator is denoted by E . can be described algorithmically. Given X0 = The transition kernel Q 1 = (X1 , X , d1 ) is obtained as follows. (X0 , X0 , d0 ) = (x, x , d), X 1 If d = 1, then draw X1 from Q(x, ) and set X1 = X1 , d1 = 1. If d = 0 and (x, x ) C, ip a coin with probability of heads . If the coin comes up heads, draw X1 from x,x and set X1 = X1 and d1 = 1. If the coin comes up tails, draw (X1 , X1 ) from Q(x, x ; ) and set d1 = 0. draw (X1 , X ) from Q(x, x ; ) and set d1 = 0. If d = 0 and (x, x ) C, 1

The variable dn is called the bell variable; it indicates whether coupling has occurred by time n (dn = 1) or not (dn = 0). The rst index n at which dn = 1 is the coupling time; T = inf{k 1 : dk = 1}. If dn = 1, then Xk = Xk for all k n. The coupling construction is carried out in such a way that under P 0 , {Xk } and {Xk } are Markov chains with transition kernel Q with initial distributions and , respectively. The coupling construction allows deriving quantitative bounds on the (V -)total variation distance in terms of the tail probability of the coupling time. Proposition 14.2.41. Assume that the transition kernel Q admits a (1, , )coupling set. Then for any probability measures and on (X, X ) and any measurable function V : X [1, ),

546

14 Elements of Markov Chain Theory

Qn Qn Q Q
n n

TV V

2P 0 (T > n) , 2E 0 [V (Xn , Xn )1{T >n} ] ,

(14.49) (14.50)

where V : X X [1, ) is dened by V (x, x ) = {V (x) + V (x )}/2. Proof. We only need to prove (14.50) because (14.49) is obtained by setting V 1. Pick a function f such that |f | V and note that [f (Xn ) f (Xn )]1{dn =1} = 0. Hence |Qn f Qn f | = |E 0 [f (Xn ) f (Xn )]| = |E [(f (Xn ) f (Xn ))1{d
0

n =0}

]|

2E 0 [V (Xn , Xn )1{dn =0} ] . We now provide an alternative expression of the coupling inequality that only involves the process {Xk }. Let C be the hitting time on the coupling by this process, dene K0 ( ) = 1, and for all n 1, set C 1{ n} if = 1 ; C Kn ( ) = (14.51) n1 if (0, 1) . j=0 [1 1C (Xj )] Proposition 14.2.42. Assume that the transition kernel Q admits a (1, , )coupling set. Let and be probability measures on (X, X ) and let V : X [1, ) be a measurable function. Then Qn Qn
V

2E [V (Xn , Xn )Kn ( )] ,

(14.52)

with V (x, x ) = [V (x) + V (x )]/2. Proof. We show that for any probability measure on (X, X ), E0 [V (Xn , Xn )1{T >n} ] = E [V (Xn , Xn )Kn ( )] . To do this, we shall prove by induction that for any n 0 and any bounded X -measurable functions {fj }j0 ,
n

E0
j=0

fj (Xj , Xj ) 1{T >n} = E

fj (Xj , Xj ) Kn ( ) .
n

(14.53)

j=0

This is obviously true for n = 0. For n 0, put n = j=0 fj (Xj , Xj ). The induction assumption and the identity {T > n + 1} = {dn+1 = 0} yield E0 [n+1 1{T >n+1} ] = E0 [n fn+1 (Xn+1 , Xn+1 )1{dn+1 =0} ] = E {n E[fn+1 (Xn+1 , Xn+1 )1{d =0} | Fn ]1{d =0} }
0 n+1 n

= E0 {n [1 1C (Xn , Xn )]Qfn+1 (Xn , Xn )1{dn =0} } [n Qfn+1 (Xn )Kn+1 ( )] = E [n+1 Kn+1 ( )] . = E This concludes the induction and the proof.

14.2 Chains on General State Spaces

547

14.2.4.2 Proof of Theorem 14.2.37 We preface the proof of Theorem 14.2.37 by two technical lemmas that establish some elementary properties of a chain on the product space with transition kernel Q Q. Lemma 14.2.43. Suppose that Q is a phi-irreducible aperiodic transition kernel. Then for any n, Qn is phi-irreducible and aperiodic. Proof. Propositions 14.2.11 and 14.2.12 show that there exists an accessible (m, , )-small set C and that is an irreducibility measure. Because Q is aperiodic, there exists a sequence { k } of positive numbers and an integer nC such that for all n nC , x C, and A X , Qn (x, A) n (A). In addition, because C is accessible, there exists p such that Qp (x, C) > 0 for any x X. Therefore for any n nC and any A X such that (A) > 0, Qn+p (x, A)
C

Qp (x, dx ) Qn (x , A)

p n (A)Q (x, C)

>0.

(14.54)

Lemma 14.2.44. Let Q be an aperiodic positive transition kernel with invariant probability measure . Then Q Q is phi-irreducible, is Q Qinvariant, and Q Q is positive. If C is an accessible (m, , )-small set for Q, then C C is an accessible (m, 2 , )-small set for Q Q. Proof. Because Q is phi-irreducible and admits as an invariant probability measure, is a maximal irreducibility measure for Q. Let C be an accessible (m, , )-small set for Q. Then for (x, x ) C C and A X X , (Q Q)m (x, x ; A) =
A

Qm (x, dy) Qm (x , dy )

(A) .

Because (C C) = [(C)]2 > 0, this shows that C C is a (1, 2 , )small set for Q Q. By (14.54) there exists an integer nx such that for any n nx , Qn (x, C) > 0. This implies that for any (x, x ) X X and any n nx nx , (Q Q)n (x, x ; C C) = Qn (x, C)Qn (x , C) > 0 , showing that C C is accessible. Because C C is a small set, Proposition 14.2.11 shows that Q Q is phi-irreducible. In addition, is invariant for Q Q, so that is a maximal irreducibility measure and Q Q is positive. We have now all the necessary ingredients to prove Theorem 14.2.37.

548

14 Elements of Markov Chain Theory

Proof (of Theorem 14.2.37). By Lemma 14.2.43, Qm is phi-irreducible for any integer m. By Proposition 14.2.12, there exists an accessible (m, , )-small set C with (C) > 0. Lemma 4.3.8 shows that for all integers n, Qn (x, ) Qn (x , )
TV

Qm[n/m] (x, ) Qm[n/m] (x , )

TV

Hence it suces to prove that (14.40) holds for Qm and we may thus without loss of generality assume that m = 1. For any probability measure on (X X, X X ), let P denote the probability measure on the canonical space ((X X)N , (X X )N ) such that the canonical process {(Xk , Xk )}k0 is a Markov chain with transition kernel Q Q and initial distribution . By Lemma 14.2.44, Q Q is positive, and it is recurrent by Proposition 14.2.34. Because (C C) = 2 (C) > 0, by Theorem 14.2.23 there exist two measurable sets C C C and H X X such that (C C \ C) = 0, (H) = 1, and for all (x, x ) H, Px,x (C < ) = 1. Moreover, the set C is a (1, , )-coupling set with x,x = for all (x, x ) C. Let the transition kernel Q be dened by (14.45) if < 1 and by Q = Q Q if = 1. For = 1, Px,x = Px,x . Now assume that (0, 1). For (x, x ) C, Px,x (C = ) = Px,x (C = ). For (x, x ) C, noting that x , A) (1 )2 Q Q(x, x , A) we obtain Q(x, Px,x (C = ) = Px,x (C = | (X1 , X1 ) C C) Q(x, x , C c ) / 2 c / (1 ) Q Q(x, x , C ) Px,x (C = | X1 C) = (1 )2 Px,x (C = ) = 0 . Thus, for all (0, 1] the set C is Harris-recurrent for the kernel Q. This implies that limn Ex,x [Kn ( )] = 0 for all (x, x ) H and, using Proposition 14.2.42, we conclude that (14.40) is true. 14.2.5 Geometric Ergodicity and Foster-Lyapunov Conditions Theorem 14.2.37 implies forgetting of the initial distribution and convergence to stationarity but does not provide us with rates of convergence. In this section, we show how to adapt the construction above to derive explicit bounds on Qn Qn V . We focus on conditions that imply geometric convergence. Denition 14.2.45 (Geometric Ergodicity). A positive aperiodic transition kernel Q with invariant probability measure is said to be V -geometrically ergodic if there exist constants (0, 1) and M < such that Qn (x, )
V

M V (x)n

for -almost all x.

(14.55)

We now present conditions that ensure geometric ergodicity.

14.2 Chains on General State Spaces

549

Denition 14.2.46 (Foster-Lyapunov Drift Condition). A transition kernel Q is said to satisfy a Foster-Lyapunov drift condition outside a set C X if there exists a measurable function V : X [1, ], bounded on C, and non-negative constants < 1 and b < such that QV V + b1C . (14.56)

If Q is phi-irreducible and satises a Foster-Lyapunov condition outside a small set C, then C is accessible and, writing QV V (1 )V + b1C , Proposition 14.2.31 shows that Q is positive and (V ) < . Example 14.2.47 (Random Walk on the Half-Line, Continued). Assume that for the model of Example 14.2.8 there exists z > 0 such that E[ezW1 ] < . Then because < 0, there exists z > 0 such that E[ezW1 ] < 1. Dene z0 = arg minz>0 E[ezW1 ] and V (x) = ez0 x , and choose x0 > 0 such that = E[ez0 W1 ] + P(W1 < x0 ) < 1. Then for x > x0 , QV (x) = E[ez0 (x+W1 )+ ] = P(W1 x) + ez0 x E[ez0 W1 1{W1 >x} ] V (x) . Hence the Foster-Lyapunov drift condition holds outside the small set [0, x0 ], and the RWHL is geometrically ergodic. For a sharper choice of the constants z0 and , see Scott and Tweedie (1996, Theorem 4.1). Example 14.2.48 (Metropolis-Hastings Algorithm, Continued). Consider the Metropolis-Hastings algorithm of Example 14.2.4 with random walk proposal kernel r(x, x ) = r(|x x |). Geometric ergodicity of the MetropolisHastings algorithm on Rd is largely a property of the tails of the stationary distribution . Conditions for geometric ergodicity can be shown to be, essentially, that the tails are exponential or lighter (Mengersen and Tweedie, 1996) and that in higher dimensions the contours of are regular near (see for instance Jarner and Hansen, 2000). To understand how the tail conditions come into play, consider the case where is a probability density on X = R+ . We suppose that is log-concave in the upper tail, that is, that there exists > 0 and M such that for all x x M , log (x) log (x ) (x x) . (14.57)

To simplify the proof, we assume that is non-increasing, but this assumption is unnecessary. Dene Ax = {x R+ : (x ) (x)} and Rx = {x R+ , (x) > (x )}, the acceptance and (possible) rejection regions for the chain started from x. Because is non-increasing, these sets are simple: Ax = [0, x] and Rx = (x, ) (, 0). If we relax the monotonicity conditions, the acceptance and rejection regions become more involved, but because is log-concave and thus in particular monotone in the upper tail, Ax and Rx are essentially intervals when x is suciently large. For any function V : R+ [1, +) and x R+ ,

550

14 Elements of Markov Chain Theory

QV (x) =1+ V (x)

r(x x)
Ax

V (x ) 1 dx V (x) +
Rx

r(x x)

(x ) V (x ) 1 dx . (x) V (x)

We set V (x) = esx for some s (0, ). Because is log-concave, (x )/(x) e(x x) when x x M . For x M , it follows from elementary calculations that lim sup
x

QV (x) 1 V (x)

r(u)(1 esu )[1 e(s)u ] du < 1 ,


0

showing that the random walk Metropolis-Hastings algorithm on the positive real line satises the Foster-Lyapunov condition when is monotone and logconcave in the upper tail. The main result guaranteeing geometric ergodicity is the following. Theorem 14.2.49. Let Q be a phi-irreducible aperiodic positive transition kernel with invariant distribution . Also assume that Q satises a FosterLyapunov drift condition outside a small set C with drift function V . Then (V ) is nite and Q is V -geometrically ergodic. In fact, it follows from Meyn and Tweedie (1993, Theorems 15.0.1 and 16.0.1) that the converse is also true: if a phi-irreducible aperiodic kernel is V geometrically ergodic, then there exists an accessible small set C such that V is a drift function outside C. For the sake of brevity and simplicity, we now prove Theorem 14.2.49 under the additional assumption that the level sets of V are all (1, , )-small. In that case, it is possible to dene a coupling set C and a transition ker that satises a (bivariate) Foster-Lyapunov drift condition outside C. nel Q The geometric ergodicity of the transition kernel Q is then proved under this assumption. This is the purpose of the following propositions. Proposition 14.2.50. Let Q be a kernel that satises the Foster-Lyapunov drift condition (14.56) with respect to a (1, , )-small set C and a function V whose level sets are (1, , )-small. Then for any d > 1, the set C = C {x X : V (x) d} is small, C C is a (1, , )-coupling set, and the kernel Q, dened as in (14.45), satises the drift condition (14.58) with C = C C , V (x, x ) = (1/2)[V (x) + V (x )], and = + b/(1 + d) provided < 1. Proof. For (x, x ) C we have (1 + d)/2 V (x, x ). Therefore b QV (x, x ) V (x, x ) + 2 and for (x, x ) C it holds that + b 1+d V (x, x ) ,

14.2 Chains on General State Spaces

551

QV (x, x ) =

1 [QV (x) + QV (x ) 2 (V )] 2(1 ) b (V ) V (x, x ) + . (1 ) 1

Proposition 14.2.51. Assume that Q admits a (1, , )-coupling set C and for which there is a measurable that there exists a choice of the kernel Q function V : X [1, ), (0, 1) and > 0 such that b b QV V + 1C . (14.58)

Let W : X [1, ) be a measurable function such that W (x) + W (x ) 2V (x, x ) for all (x, x ) X X. Then there exist (0, 1) and c > 0 such that for all (x, x ) X X, Qn (x, ) Qn (x , )
W

cV (x, x )n .

(14.59)

Proof. By Proposition 14.2.41, proving (14.59) amounts to proving the re quested bound for Ex,x [V (Xn )Kn ( )]. We only consider the case (0, 1), the case = 1 being easier. Write x = (x, x ). By induction, the drift condi tion (14.58) implies that
n1

x x Ex [V (Xn )] = Qn V () n V () + b
j=0

x j V () + b/(1 ) .

(14.60)

n1 Recall that Kn ( ) = (1 )n (C) for (0, 1), where n (C) = 0 1C (Xj ) before time n. Hence Kn ( ) is is the number of visits to the coupling set C Fn1 -measurable. Let j n + 1 be an arbitrary positive integer to be chosen later. Then (14.60) yields

Ex [V (Xn )Kn ( )1{n (C)j} ] (1 )j Ex [V (Xn )]1{jn} x [V () + b/(1 )](1 )j 1{jn} .

(14.61)

x Put M = supxC QV ()/V () and B = 1 [M (1 )/]. For k = 0, . . . , n, x k [(1 )/B]k (C) V (Xk ). Because n (C) is Fn1 -measurable, dene Zk = we obtain
Ex [Zn | Fn1 ] = n QV (Xn1 )[(1 )/B]n (C) n+1 V (Xn1 )[(1 )/B]n (C) 1C c (Xn1 ) + n M V (Xn1 )[(1 )/B]n (C) 1C (Xn1 ) .

Using the relations n (C) = n1 (C) + 1C (Xn1 ) and M (1 ) B , we x [Zn | Fn1 ] Zn1 and, by induction, Ex [Zn ] Ex [Z0 ] = V (). x nd that E Hence, as B 1,

552

14 Elements of Markov Chain Theory

x Ex [V (Xn )Kn ( )1{n (C)<j} ] n B j Ex [Zn ] n B j V () . Gathering (14.61) and (14.62) yields

(14.62)

x Ex [V (Xn )Kn ( )] [V () + b/(1 )] [(1 )j 1{jn} + n B j ] . If B = 1, choosing j = n + 1 yields (14.59) with = , and if B > 1 then set + log(B) < 0; this choice yields j = [n] with (0, 1) such that log() (14.59) with = (1 ) (B ) < 1. Example 14.2.52 (Autoregressive Model, Continued). In the model of Example 14.2.3, we have veried that V (x) = 1 + x2 satises (14.56) when the noise variance is nite. We can deduce from Theorem 14.2.49 a variety of results: the stationary distribution has nite variance and the iterates Qn (x, ) of the transition kernel converge to the stationary distribution geometrically fast in V -total variation distance. Thus there exist constants C and < 1 such that for any x X, Qn (x, ) V C(1 + x2 )n . This implies in particular that for any x X and any function f such that supxX (1 + x2 )1 |f (x)| < , Ex [f (Xn )] converges to the limiting value E [f (Xn )] = 1 2 2 2 exp (1 2 )x2 f (x) dx 2 2

geometrically fast. This applies for the mean, f (x) = x, and the second moment, f (x) = x2 (though in this case convergence can be derived directly from the autoregression). 14.2.6 Limit Theorems One of the most important problems in probability theory is the investigation of limit theorems for appropriately normalized sums of random variables. The case of independent random variables is fairly well understood, but less is known about dependent random variables such as Markov chains. The purpose of this section is to study several basic limit theorems for additive functionals of Markov chains. 14.2.6.1 Law of Large Numbers Suppose that {Xk } is a Markov chain with transition kernel Q and initial distribution . Assume that Q is phi-irreducible and aperiodic and has a stationary distribution . Let f be a -integrable function; (|f |) < . We say that the sequence {f (Xk )} satises a law of large numbers (LLN) if for any n initial distribution on (X, X ), the sample mean n1 k=1 f (Xk ) converges to (f ) P -a.s. For i.i.d. samples, classical theory shows that the LLN holds provided (|f |) < . The following theorem shows that the LLN holds for ergodic Markov chains; it does not require any conditions on the rate of convergence to the stationary distribution.

14.2 Chains on General State Spaces

553

Theorem 14.2.53. Let Q be a positive Harris recurrent transition kernel with invariant distribution . Then for any real -integrable function f on X and any initial distribution on (X, X ),
n n

lim n1
k=1

f (Xk ) = (f )

P -a.s.

(14.63)

The LLN can be obtained from general ergodic theorems for stationary processes. An elementary proof can be given when the chain possesses an accessible atom. The basic technique is then the regeneration method, which consists in dividing the chain into blocks between the chains successive returns to the atom. These blocks are independent (see Lemma 14.2.54 below) and standard limit theorems for i.i.d. random variables yield the desired result. When the chain has no atom, one may still employ this technique by replacing the atom by a suitably chosen small set and using the splitting technique (see for instance Meyn and Tweedie, 1993, Chapter 17). Lemma 14.2.54. Let Q be a positive Harris recurrent transition kernel that admits an accessible atom . Dene for any measurable function f ,

sj (f ) =
k=1

f (Xk )

(j1)

j1.

(14.64)

Then for any initial distribution on (X, X ), k 0 and functions {j } in Fb (R),


k k

E
j=1

j (sj (f )) = E [1 (s1 (f ))]


j=2

E [j (sj (f ))] .

Proof. Because the atom is accessible and the chain is Harris recurrent, (k) Px ( < ) = 1 for any x X. By the strong Markov property, for any integer k, E [1 (s1 (f )) k (sk (f ))] = E [1 (s1 (f )) k1 (sk1 (f )) E [k (sk (f )) | F (k1) ]1{ (k1) <} ]

= E [1 (s1 (f )) k1 (sk1 (f ))] E [k (s1 (f ))] . The desired result in then obtained by induction. Proof (of Theorem 14.2.53 when there is an accessible atom). First assume that f is non-negative. Denote the accessible atom by and dene
n

n =
k=1

1 (Xk ) ,

(14.65)

554

14 Elements of Markov Chain Theory

the occupation time of the atom up to time n. We now split the sum n k=1 f (Xk ) into sums over the excursions between successive visits to ,
n n n

f (Xk ) =
k=1 j=1

sj (f ) +
( ) k= n +1

f (Xk ) .

This decomposition shows that


n n n +1

sj (f )
j=1 k=1

f (Xk )
j=1

sj (f ) .

(14.66)

Because Q is Harris recurrent and is accessible, n P -a.s. as n . Hence s1 (f )/n 0 and (n 1)/n 1 P -a.s. By Lemma 14.2.54 the variables {sj (f )}j2 are i.i.d. under P . In addition E [sj (f )] = (f ) for j 2 with , dened in (14.30), being an invariant measure. Because all invariant measures are constant multiples of and (|f |) < , E [sj (f )] is nite. Writing 1 n
n

sj (f ) =
j=1

s1 (f ) n 1 1 + n n n 1

sj (f ) ,
j=2

the LLN for i.i.d. random variables shows that 1 n n lim


n

sj (f ) = (f )
j=1

P -a.s. ,
n

1 whence, by (14.66), the same limit holds for n 1 f (Xk ). Because (1) = 1, (1) is nite too. Applying the above result with f 1 yields n/n (1), n so that n1 1 f (Xk ) (f )/ (1) = (f ) P -a.s. This is the desired result when f 0. The general case is is handled by splitting f into its positive and negative parts.

14.2.6.2 Central Limit Theorems We say that {f (Xk )} satises a central limit theorem (CLT) if there is a conn stant 2 (f ) 0 such that the normalized sum n1/2 k=1 {f (Xk ) (f )} converges P -weakly to a Gaussian distribution with zero mean and variance 2 (f ) (we allow for the special case 2 (f ) = 0 corresponding to weak convergence to the constant 0). CLTs are essential for understanding the error n occurring when approximating (f ) by the sample mean n1 k=1 f (Xk ) and are thus a topic of considerable importance. For i.i.d. samples, classical theory guarantees a CLT as soon as (|f |2 ) < . This is not true in general for Markov chains; the CLTs that are available do require some additional assumptions on the rate of convergence and/or the existence of higher order moments of f under the stationary distribution.

14.2 Chains on General State Spaces

555

Theorem 14.2.55. Let Q be a phi-irreducible aperiodic positive Harris recurrent transition kernel with invariant distribution . Let f be a measurable function and assume that there exists an accessible small set C satisfying
C 2

(dx) Ex
xC k=1

|f |(Xk )

<

and
C

2 (dx) Ex [C ] < .

(14.67) Then (f 2 ) < and {f (Xk )} satises a CLT. Proof. To start with, it follows from the expression (14.32) for the stationary distribution that
C C 2

(f ) =
C

(dx) Ex
k=1

f (Xk )
C

(dx) Ex
k=1

|f (Xk )|

<.

We now prove the CLT under the additional assumption that the chain admits an accessible atom . The proof in the general phi-irreducible case can be obtained using the splitting construction. The proof is along the same lines n as for the LLN. Put f = f (f ). By decomposing the sum k=1 f (Xk ) into excursions between successive visits to the atom , we obtain
n n

n1/2
k=1

f (Xk )
j=2

sj (f ) n1/2 s1 (|f |) + n1/2 sn +1 (|f |) ,

(14.68)

where n and sj (f ) are dened in (14.65) and (14.64). It is clear that the rst term on the right-hand side of this display vanishes (in P -probability) as n . For the second one, the strong LLN (Theorem 14.2.53) shows that n n1 1 s2 (|f |) has an P -a.s. nite limit, whence, P -a.s., j n n+1 2 1 n+1 1 s (|f |) s2 (|f |) s2 (|f |) = 0 . = lim sup lim sup n n n j=1 j n n + 1 j=1 j n n The strong LLN with f = 1 also shows that n /n () P -a.s., so that s2n (|f |)/n 0 and n1/2 sn +1 (|f |) 0 P -a.s. n Thus n1/2 1 f (Xk ) and n1/2 2n sj (f ) have the same limiting be2 havior. By Lemma 14.2.54, the blocks {sj (|f |)}j2 are i.i.d. under P . Thus, n by the CLT for i.i.d. random variables, n1/2 2 sj (f ) converges P -weakly to a Gaussian law with zero mean and some variance 2 < ; that the variance is indeed nite follows as above with the small set C being the accessible atom . The so-called Ascombes theorem (see for instance Gut, 1/2 n 1988, Theorem 3.1) then implies that n 2 f (Xk ) converges P -weakly to the same Gaussian law. Thus we may conclude that n1/2 2n f (Xk ) = 1/2 n (n /n)1/2 n 2 f (Xk ) converges P -weakly to a Gaussian law with zero n mean and variance () 2 . By (14.68), so does n1/2 1 f (Xk ).

556

14 Elements of Markov Chain Theory

The condition (14.67) is stated in terms of the second moment of the excursion between two successive visits to a small set and appears rather dicult to verify directly. More explicit conditions can be obtained, in particular if we assume that the chain is V -geometrically ergodic. Proposition 14.2.56. Let Q be a phi-irreducible, aperiodic, positive Harris reccurrent kernel that Q satises a Foster-Lyapunov drift condition (see Denition 14.2.46) outside an accessible small set C, with drift function V . Then any measurable function f such that |f |2 V satises a CLT. Proof. Minkovskis inequality implies that 2
C 1

1/2

Ex
k=0

|f (Xk )|

k=0

Ex [f 2 (Xk ) {C >k} ]
1/2

k=0

Ex [V (Xk )1{C >k} ]

Put Mk = k V (Xk )1{C k} , where is as in (14.56). Then for k 1, E[Mk+1 | Fk ] (k+1) E[V (Xk+1 ) | Fk ]1{C k+1} k V (Xk )1{C k+1} Mk , showing that {Mk } is a super-martingale. Thus Ex [Mk ] Ex [M1 ] for any x C, which implies that for k 1, sup Ex [V (Xk )1{C k} ] k sup V (x) + b .
xC

xC

14.3 Applications to Hidden Markov Models


As discussed in Section 2.2, an HMM is best dened as a Markov chain {Xk , Yk }k0 on the product space (X Y, X Y). The transition kernel of this joint chain has a simple structure reecting the conditional independence assumptions that are imposed. Let Q and G denote, respectively, a Markov transition kernel on (X, X ) and a transition kernel from (X, X ) to (Y, Y). The transition kernel of the joint chain {Xk , Yk }k0 is given by, for any (x, y) X Y, T [(x, y), C] =
C

Q(x, dx ) G(x , dy) ,

(x, y) X Y, C X Y . (14.69)

14.3 Applications to Hidden Markov Models

557

This chain is said to be hidden because only a component (here {Yk }k0 ) is observed. Of course, the process {Yk } is not a Markov chain, but nevertheless most of the properties of this process are inherited from stability properties of the hidden chain. In this section, we establish stability properties of the kernel T of the joint chain. 14.3.1 Phi-irreducibility Phi-irreducibility of the joint chain T is inherited from irreducibility of the hidden chain, and the maximal irreducibility measures of the joint and hidden chains are related in a simple way. Before stating the precise result, we recall (see Section 2.1.1) that if is a measure on (X, X ), we dene the measure G on (X Y, X Y) by G(A) =
def A

(dx) G(x, dy) ,

AX Y .

Proposition 14.3.1. Assume that Q is phi-irreducible, and let be an irreducibility measure for Q. Then G is an irreducibility measure for T . If is a maximal irreducibility measure for Q, then G is a maximal irreducibility measure for T . Proof. Let A X Y be a set such that G(A) > 0. Denote by A the function A (x) = Y G(x, dy) 1A (x, y) for x X. By Fubinis theorem, G(A) = (dx) G(x, dy) 1A (x, y) = (dx) A (x) ,

and the condition G(A) > 0 implies that ({A > 0}) > 0. Because {A > 0} = m=0 {A 1/m}, we have ({A 1/m}) > 0 for some integer m. Because is an irreducibility measure, for any x X there exists an integer k 0 such that Qk (x, {A 1/m}) > 0. Therefore for any y Y, T k [(x, y), A] =
{A 1/m}

Qk (x, dx ) G(x , dy ) 1A (x , y ) = Qk (x, dx ) A (x )

Qk (x, dx ) A (x )

1 k Q (x, {A 1/m}) > 0 , m

showing that G is an irreducibility measure for T . Morever, using Theorem 14.2.2, we see that a maximal irreducibility measure T for T is given by, for any (0, 1) and A X Y,

T (A) = = =

(dx) G(x, dy) (1 )


m=0

m T m [(x, y), A]

(1 )
m=0

(dx) Qm (x, dx ) G(x , dy ) 1A (x , y )

(dx ) G(x , dy ) 1A (x , y ) = G(A) ,

558

14 Elements of Markov Chain Theory

where (B) = (dx) (1 )

m Qm (x, B) ,
m=0

BX .

By Theorem 14.2.2, is a maximal irreducibility measure for Q. In addition, if is a maximal irreducibility measure for Q, then is equivalent to . Because for any A X Y, G(A) = (dx) G(x, dy) 1A (x, y) = G(dx, dy) d (x)1A (x, y) , d

G(A) = 0 whenever G(A) = 0. Thus G G. Exchanging shows that G and G are indeed equivalent, which concludes and the proof. Example 14.3.2 (Normal HMM). Consider a normal HMM (see Section 1.3.2). In this case, the state space X of the hidden chain is nite, X = {1, 2, . . . , r} and Y = R. The hidden chain is governed by a transition matrix Q = [Q(x, y)]1x,yr . Conditionally on the state x X, the distribution 2 of the observation is Gaussian with mean x and variance x . Hence the transition kernel T for the joint Markov chain is given by, for any (x, y) X Y and A B(R), T [(x, y), {x } A] = Q(x, x )
A

1
2 2x

exp

1 (y x )2 dy . 2 2 x

If the transition matrix Q is irreducible (all states in X communicate), then Q is positive. For any x X, x is an irreducibility measure for Q and T is phi-irreducible with irreducibility measure x G. Denote by the unique invariant probability measure for Q. Then is also a maximal irreducibility measure, whence G is a maximal irreducibility measure for T . Example 14.3.3 (Stochastic Volatility Model). The canonical stochastic volatility model (see Example 1.3.13) is given by Xk+1 = Xk + Uk , Yk = exp(Xk /2)Vk , Uk N(0, 1) , Vk N(0, 1) ,

We have established (see Example 14.2.3) that because {Uk } has a positive density on R+ , the chain {Xk } is phi-irreducible and Leb is an irreducibility measure. Therefore {Xk , Yk } is also phi-irreducible and Leb Leb is a maximal irreducibility measure. 14.3.2 Atoms and Small Sets It is possible to relate atoms and small sets of the joint chain to those of the hidden chain. Examples of HMMs possessing accessible atoms are numerous, even when the state space of the joint chain is general. They include in particular the Markov chains whose hidden state space X is nite.

14.3 Applications to Hidden Markov Models

559

Example 14.3.4 (Normal HMM, Continued). For the normal HMM (see Example 14.3.2), it holds that T [(x, y), ] = T [(x, y ), ] for any (y, y ) R R. Hence {x} R is an atom for T . When accessible atoms do not exist, it is important to determine small sets. Here again the small sets of the joint chain can easily be related to those of the hidden chain. Lemma 14.3.5. Let m be a positive integer, > 0 and let be a probability measure on (X, X ). Let C X be an (m, , )-small set for the transition kernel Q, that is, Qm (x, A) 1C (x)(A) for all x X and A X . Then C Y is an (m, , G)-small set for the transition kernel T dened in (2.14), that is, T m [(x, y), A]

1C (x) G(A) ,

(x, y) X Y, A X Y .

Proof. Pick (x, y) C Y. Then T m [(x, y), A] = Qm (x, dx ) G(x , dy ) 1A (x , y ) (dx ) G(x , dy ) 1A (x , y ) .

If the Markov transition kernel Q on (X, X ) is phi-irreducible (with maximal irreducibility measure ), then we know from Proposition 14.2.12 that there exists an accessible small set C. That is, there exists a set C X with Px (C < ) > 0 for all x X and such that C is (m, , )-small for some triple (m, , ) with (C) > 0. Then Lemma 14.3.5 shows that C Y is an (m, , G)-small set for the transition kernel T . Example 14.3.6 (Stochastic Volatility Model, Continued). We have shown in Example 14.2.3 that any compact set K R is small for the rstorder autoregression constituting the hidden chain of the stochastic volatility model of Example 14.3.3. Therefore any set K R, where K a compact subset of R, is small for the joint chain {Xk , Yk }. The simple relations between the small sets of the joint chain and those of the hidden chain immediately imply that T and Q have the same period. Proposition 14.3.7. Suppose that Q is phi-irreducible and has period d. Then T is phi-irreducible and has the same period d. In particular, if Q is aperiodic, then so is T . Proof. Let C be an accessible (m, , )-small set for Q with (C) > 0. Dene EC as the set of time indices for which C is a small set with minorizing probability measure ,

560

14 Elements of Markov Chain Theory

EC = {n 0 : C is (n, , )-small for some

def

> 0} .

The period of the set C is given by the greatest common divisor of EC . Proposition 14.2.35 shows that this value is in fact common to the chain as such and does not depend on the particular small set chosen. Lemma 14.3.5 shows that C Y is an (m, , G)-small set for the joint Markov chain with transition kernel T , and that G(C Y) = (C) > 0. The set ECY of time indices for which C Y is a small set for T with minorizing measure G is thus, using Lemma 14.3.5 again, equal to EC . Thus the period of the set C is also the period of the set C Y. Because the period of T does not depend on the choice of the small set C Y, it follows that the periods of Q and T coincide. 14.3.3 Recurrence and Positive Recurrence As the following result shows, recurrence and transience of the joint chain follows directly from the corresponding properties of the hidden chain. Proposition 14.3.8. Assume that the hidden chain is phi-irreducible. Then the following statements hold true. (i) The joint chain is transient (recurrent) if and only if the hidden chain is transient (recurrent). (ii) The joint chain is positive if and only if the hidden chain is positive. In addition, if the hidden chain is positive with stationary distribution , then G is the stationary distribution of the joint chain. Proof. First assume that the transition kernel Q is transient, that is, that there is a countable cover X = i Ai of X with uniformly transient sets,

sup Ex
xAi n=1

1Ai (Xn ) < .

Then the sets {Ai Y}i1 form a countable cover of X Y, and these sets are uniformly transient because

Ex
n=1

1Ai Y (Xn , Yn ) = Ex

1Ai (Xn ) .

(14.70)

n=1

Thus the joint chain is transient. Conversely, assume that the joint chain is transient. Because the hidden chain is phi-irreducible, Proposition 14.2.13 shows that there is a countable cover X = i Ai of X with sets that are small for Q. At least one of these, say A1 , is accessible for Q. By Lemma 14.3.5, the sets Ai Y are small. By Proposition 14.3.1, A1 Y is accessible and, because T is transient, Proposition 14.2.14 shows that A1 Y is uniformly transient. Equation (14.70) then

14.3 Applications to Hidden Markov Models

561

shows that A1 is uniformly transient, and because A1 is accessible, we conclude that Q is transient. Thus the hidden chain is transient if and only if the joint chain is so. The transience/recurrence dichotomy (Theorem 14.2.6) then implies that the hidden chain is recurrent if and only if the joint chain is so, which completes the proof of (i). We now turn to (ii). First assume that the hidden chain is positive recurrent, that is, that there exists a unique stationary probability measure satisfying Q = . Then the probability measure G is stationary for the transition kernel T of the joint chain, because ( G)T (A) = = = (dx) G(x, dy) Q(x, dx ) G(x , dy ) 1A (x , y ) (dx) Q(x, dx ) G(x , dy ) 1A (x , y ) (dx ) G(x , dy ) 1A (x , y ) = G(A) .

Because the joint chain admits a stationary distribution it is positive, and by Proposition 14.2.34 it is recurrent. Conversely, assume that the joint chain is positive. Denote by the (unique) stationary probability measure of T . Thus for any A X Y, we have (dx, dy) Q(x, dx ) G(x , dy ) 1A (x , y ) = (dx, Y) Q(x, dx ) G(x , dy ) 1A (x , y ) = (A) .

Setting A = A Y for A X , this display implies that (dx, Y) Q(x, A) = (A Y) . This shows that (A) = (A Y) is a stationary distribution for the hidden chain. Hence the hidden chain is positive and recurrent. When the joint (or hidden) chain is positive, it is natural to study the rate at which it converges to stationarity. Proposition 14.3.9. Assume that the hidden chain satises a uniform Doeblin condition, that is, there exists a positive integer m, > 0 and a family {x,x , (x, x ) X X} of probability measures such that Qm (x, A) Qm (x , A) x,x (A), A X , (x, x ) X X .

Then the joint chain also satises a uniform Doeblin condition. Indeed, for all (x, y) and (x , y ) in X Y and all A X Y,

562

14 Elements of Markov Chain Theory

T m [(x, y), A] T m [(x , y ), A] x,x (A) , where x,x (A) = x,x (dx) G(x, dy) 1A (x, y) .

The proof is along the same lines as the proof of Lemma 14.3.5 and is omitted. This proposition in particular implies that the ergodicity coecients for the kernels T m and Qm coincide; (T m ) = (Qm ). A straightforward but useful application of this result is when the hidden Markov chain is dened on a nite state space. If the transition matrix Q of this chain is primitive, that is, there exists a positive integer m such that Qm (x, x ) > 0 for all (x, x ) X X (or, equivalently, if the chain Q is irreducible and aperiodic), then the joint Markov chain satises a uniform Doeblin condition and the ergodicity coecient of the joint chain is bounded as (T m ) 1 with = inf sup [Qm (x, x ) Qm (x , x )] .
(x,x )XX x X

A similar result holds when the hidden chain satises a Foster-Lyapunov drift condition instead of a uniform Doeblin condition. This result is of particular interest when dealing with hidden Markov models on state spaces that are not nite or bounded. Proposition 14.3.10. Assume that Q is phi-irreducible, aperiodic, and satises a Foster-Lyapunov drift condition (Denition 14.2.46) with drift function V outside a set C. Then the transition kernel T also satises a FosterLyapunov drift condition with drift function V outside the set C Y, T [(x, y), V ] V (x) + b1CY (x, y) . Here on the left-hand side, we wrote V also for a function on X Y dened by V (x, y) = V (x). The proof is straightforward. Proposition 14.2.50 yields an explicit bound on the rate of convergence of the iterates of the Markov chain to the stationary distribution. This result has a lot of interesting consequences. Proposition 14.3.11. Suppose that Q is phi-irreducible, aperiodic, and satises a Foster-Lyapunov drift condition with drift function V outside a small set C. Then the transition kernel T is positive and aperiodic with invariant distribution G, where is the invariant distribution of Q. In addition, for any measurable function f : X Y R, the following statements hold true. (i) If supxX [V (x)]1 G(x, dy) |f (x, y)| < , then there exist (0, 1) and K < (not depending on f ) such that for any n 0 and (x, y) X Y, |T n f (x, y) G(f )| Kn V (x) sup [V (x )]1 intG(x , dy) |f (x , y)| .
x X

14.3 Applications to Hidden Markov Models

563

(ii) If supxX [V (x)]1 G(x, dy) f 2 (x, y) < , then EG [f 2 (X0 , Y0 )] < and there exist (0, 1) and K < (not depending on f ) such that for any n 0, |Cov [f (Xn , Yn ), f (X0 , Y0 )]|
2

Kn (V ) sup[V (x)]1/2
xX

G(x, dy) |f (x, y)|

Proof. First note that |T n f (x, y) G(f )| = [Qn (x, dx ) (dx )]G(x , dy ) f (x , y )
V

Qn (x, )

sup [V (x )]1
x X

G(x , dy) |f (x , y)| .

Now part (i) follows from the geometric ergodicity of Q (Theorem 14.2.49). Next, because (V ) < , EG [f 2 (X0 , Y0 )] = (dx) G(x, dy) f 2 (x, y) G(x, dy) f 2 (x, y) < ,

(V ) sup[V (x)]1
xX

implying that | Cov [|f (Xn , Yn )|, |f (X0 , Y0 )|]| Var [f (X0 , Y0 )] < . In addition Cov [f (Xn , Yn ), f (X0 , Y0 )] = E {E[f (Xn , Yn ) G(f ) | F0 ]f (X0 , Y0 )} = G(dx, dy) f (x, y) [Qn (x, dx ) (dx )] G(x , dy ) f (x , y ) . (14.71) By Jensens inequality G(x, dy) |f (x, y)| [ G(x, dy) f 2 (x, y)]1/2 and

QV 1/2 (x) [QV (x)]1/2 [V (x) + b1C (x)]1/2 1/2 V 1/2 (x) + b1/2 1C (x) , showing that Q also satises a Foster-Lyapunov condition outside C with drift function V 1/2 . By Theorem 14.2.49, there exists (0, 1) and a constant K such that [Qn (x, dx ) (dx)] G(x , dy ) f (x , y ) Qn (x, )
V 1/2

sup V 1/2 (x)


x X

G(x , dy) |f (x , y)|

Kn V 1/2 (x) sup V 1/2 (x )


x X

G(x , dy) |f (x , y)| .

Part (ii) follows by plugging this bound into (14.71).

564

14 Elements of Markov Chain Theory

Example 14.3.12 (Stochastic Volatility Model, Continued). In the 2 2 model of Example 14.3.3, we set V (x) = ex /2 for > U . It is easily shown that x2 2 (2 + 2 ) , exp QV (x) = U 2 2 2
2 2 where 2 = U 2 /( 2 U ). We may choose large enough that 2 (2 + 2 2 )/ < 1. Then lim sup|x| QV (x)/V (x) = 0 so that Q satises a Foster-

Lyapunov condition with drift function V (x) = ex /2 outside a compact set [M, +M ]. Because every compact set is small, the assumptions of Proposition 14.3.11 are satised, showing that the joint chain is positive. Set f (x, y) = |y|. Then G(x, dy) |y| = ex/2 2/. Proposition 14.3.11(ii) shows that Var (Y0 ) < and that the autocovariance function Cov(|Yn |, |Y0 |) decreases to zero exponentially fast.

15 An Information-Theoretic Perspective on Order Estimation

Statistical inference in hidden Markov models with nite state space X has to face a serious problem: order identication. The order of an HMM {Yk }k1 over Y (in this chapter, we let indices start at 1) is the minimum size of the hidden state space X of an HMM over (X, Y) that can generate {Yk }k1 . In many real-life applications of HMM modeling, no hints about this order are available. As order misspecication is an impediment to parameter estimation, consistent order identication is a prerequisite to HMM parameter estimation. Furthermore, HMM order identication is a distinguished representative of a family of related problems that includes Markov order identication. In all those problems, a nested family of models is given, and the goal is to identify the smallest model that contains the distribution that has generated the data. Those problems dier in an essential way according to whether identiability does or does not depend on correct order specication. Order identication problems are related to composite hypothesis testing. As the performance of generalized likelihood ratio testing in this framework is still a matter of debate, order identication problems constitute benchmarks for which the performance of generalized likelihood ratio testing can be investigated (see Zeitouni et al., 1992). As a matter of fact, analyzing order identication issues boils down to understanding the simultaneous behavior of (possibly innitely) many maximum likelihood estimators. When identiability depends on correct order specication, universal coding arguments have proved to provide very valuable insights into the behavior of likelihood ratios. This is the main reason why source coding concepts and techniques have become a standard tool in the area. This chapter presents four kinds of results: rst, in a Bayesian setting, a general consistency result provides hints about the ideal penalties that could be used in penalized maximum likelihood order estimation. Then universal coding arguments are shown to provide a general construction of strongly consistent order estimators. Afterwards, a general framework for analyzing the Bahadur eciency of order estimation procedures is presented, following the lines of Gassiat and Boucheron (2003). Consistency and eciency results

566

15 Order Estimation

hold for HMMs. As explained below, rening those consistency and eciency results requires a precise understanding of the behavior of likelihood ratios. As of writing this text, in the HMM setting, this precise picture is beyond our understanding. But such a work has been carried recently out for Markov order estimation. In order to give a avor of what remains to be done concerning HMMs, this chapter reports in detail the recent tour de force by Csiszr and a Shields (2000) who show that the Bayesian information criterion provides a strongly consistent Markov order estimator.

15.1 Model Order Identication: What Is It About?


In preceding chapters, we have been concerned with inference problems in HMMs for which the hidden state space is known in advance: it might be either nite with known cardinality or compact under restrictive conditions; see the assumptions on the transition kernel of the hidden chain to ensure consistency of the MLE in Chapter 12. In this chapter, we focus on HMMs with nite state space of unknown cardinality. Moreover, the set Y in which the observations {Yk }k1 take values is assumed to be nite and xed. Let Mr denote the set of distributions of Y-valued processes {Yk }k1 that can be generated by an HMM with hidden state space X of cardinality r. The parameter space associated with Mr is r . Note that even if all nite-dimensional distributions of {Yk }k1 are known, deciding whether the distribution of {Yk }k1 belongs to Mr or even to r Mr is not trivial (Finesso, 1991, Chapter 1). Elementary arguments show that Mr Mr+1 ; further reection veries that this inclusion is strict. Hence for a xed observation set Y, the sequence (Mr )r1 denes a nested sequence of models. We may now dene the main topic of this chapter: the order of an HMM. Denition 15.1.1. The order of an HMM {Yk }k1 over Y is the smallest integer r such that the distribution of {Yk }k1 belongs to Mr . Henceforth, when dealing with an HMM {Yk }k1 , its order will be denoted by r , and will denote a parameterization of this distribution in r . The distribution of the process will be denoted by P . Assume for a moment that we are given an innite sequence of observations of an HMM {Yk }k1 : y1 , . . . , yk , . . ., that we are told that the order of {Yk }k1 is at most some r0 , and that we are asked to estimate a parameterization of the distribution of {Yk }k1 . It might seem that the MLE in r0 would perform well in such a situation. Unfortunately, if the order of {Yk }k1 is strictly smaller than r0 , maximum likelihood estimation will run into trouble. As a matter of fact, if r < r0 , then is not identiable in r0 . Hence, when confronted with such an estimation problem, it is highly reasonable to rst estimate r and then to proceed to maximum likelihood estimation of . The order estimation question is then the following: given an outcome y1:n of the process {Yk }k1 with distribution in r Mr , can we identify r ?

15.2 Order Estimation in Perspective

567

Denition 15.1.2. An order estimation procedure is a sequence of estimators r1 , . . . , rn , . . . that, given input sequences of length 1, . . . n, . . ., outputs estimates rn (y1:n ) of r . A sequence of estimators is strongly consistent if the sequence r1 , . . . rn , . . . converges to r P -a.s.

15.2 Order Estimation in Perspective


The ambition of this chapter is not only to provide a state-of-the-art exposition of order estimation in HMMs but also to provide a perspective. There are actually many other order estimation problems in the statistical or the information-theoretical literature. All pertain to the estimation of the dimension of a model. We may quote for example the following. Estimating the order of a Markov process. In that case, the order should be understood as the Markov order of the process (Finesso, 1991; Finesso et al., 1996; Csiszr and Shields, 2000; Csiszr, 2002). See Section 15.8 for a a precise denitions and recent advances on this topic. Estimating the order of semi-Markov models, which have proved to be valuable tools in telecommunication engineering. Estimating the order in stochastic context-free grammars, which are currently considered in genomics (Durbin et al., 1998). Estimating the number of populations in a mixture (Dacunha-Castelle and Gassiat, 1997a,b, 1999; Gassiat, 2002). Estimating the number of change points in detection problems. Estimating the order of ARMA models (Azencott and Dacunha-Castelle, 1984; Dacunha-Castelle and Gassiat, 1999; Boucheron and Gassiat, 2004).

Hence, HMM order estimation is both interesting per se and as a paradigm of a rich family of statistical problems for which the general setting is the following. Let {Mr }r1 be a nested sequence of models (sets of probability distributions) for sequences {Yk }k1 on a set Y. For any P in r Mr , the order is the smallest integer r such that P belongs to Mr . Our two technical questions will be the following. (i) Does there exist (strongly) consistent order estimators? Is it possible to design generic order estimation procedures? (ii) How ecient are the putative consistent order estimators? The analysis of order estimation problems is currently inuenced by the theory of universal coding from information theory and by the theory of composite hypothesis testing from plain old statistics. The rst perspective provides a convenient framework for designing consistent order estimators, and the second provides guidelines in the analysis of the performance of order estimators. As a matter of fact, code-based order estimators turn out to be analyzed as penalized maximum likelihood estimators.

568

15 Order Estimation

Denition 15.2.1. Let {pen(n, r)}n,r denote a family of non-negative numbers. A penalized maximum likelihood (PML) order estimator is dened by rn = arg maxr
def

sup log P(y1:n ) pen(n, r) .


PMr

The main point now becomes the choice of the penalty pen(n, r). To ensure consistency and/or eciency, sup log P(y1:n ) sup log P(y1:n )
PMr PMr

(15.1)

has to be compared with pen(n, r) pen(n, r ) . In case r < r , this is related to Shannon-McMillan-Breiman theorems (see Section 15.4.2), and if the penalty grows slower than n, PML order estimators do not underestimate the order (see Lemma 15.6.2). Moreover the probability of underestimating the order decreases exponentially with rate proportional to n, and the better the constant is, the more ecient is the estimation. Asymptotic behavior of this error thus comes from a large deviations analysis of the likelihood process (see Theorem 15.7.2 and 15.7.7). The analysis of the overestimation error follows dierent considerations. A rst simple remark is that it depends on whether the parameter describing the distribution of the observations is or is not identiable as an element of a model of larger order. When the parameter is still identiable in larger models, stochastic behavior of the maximum likelihood statistic is well understood and can be cast into the old framework created by Wilks, Wald, and Cherno. In this case, weak consistency of PML order estimators is achieved as soon as the penalties go to innity with n and the set of possible orders is bounded. When the parameter is no longer identiable in larger models, stochastic description of the maximum likelihood statistic has to be investigated on an ad hoc basis. Indeed, for general HMMs, the likelihood ratio statistic is stochastically unbounded even for bounded parameters (see Kribin and Gassiat, 2000), and we e are not even aware of a candidate for penalties warranting weak consistency of PML order estimators. Note that one can however use marginal likelihoods to build weakly consistent order estimators (see Gassiat, 2002). From now on, we will mainly focus on nite sets Y. In this case, ideas and results from information theory may be used to build consistent order estimators, without assuming any a priori upper bound on the order (see Lemma 15.6.3). Though the likelihood ratio (15.1) may be unbounded for r > r , its rate of growth is smaller than n. The asymptotic characterization of the decay of the overestimation error should thus resort to a moderate deviations analysis of the likelihood process. Consistency and eciency theorems are stated in Sections 15.6 and 15.7. Although they apply to HMMs, in order to outline the key ingredients of the

15.3 Order Estimation and Composite Hypothesis Testing

569

proofs, those theorems are stated and derived in a general setting, Though the results might seem satisfactory, they fall short of closing the story. Indeed, for example, lower bounds on penalties warranting strongly consistent order identication for HMMs has only received very partial (and far too conservative) answers . In practice, the question is important when underestimation has to be avoided at (almost) any price. The theoretical counterpart is also fascinating, as it is connected to non-asymptotic evaluation of stochastic deviations of likelihoods (in the range of large and moderate deviations). This is why we shall also consider in more detail the problem of Markov order estimation. A process {Yk }k1 with distribution P is said to be Markov of order r if for every y1:n+1 Yn+1 , P (yn+1 | y1:n ) = P (yn+1 | ynr+1:n ) . For Markov models, whatever the value of r, the maximum likelihood estimator is uniquely dened and it can be computed easily from a (r-dependent) nite-dimensional sucient statistic. Martingale tools may be used to obtain non-asymptotic tail inequalities for maximum likelihoods. Section 15.8 reports a recent tour de force by Csiszr and Shields (2000), who show that a the Bayesian information criterion provides a strongly consistent Markov order estimator. Of course, though this order estimation problem is apparently very similar to the HMM order estimation problem, this similarity should be taken cautiously. Indeed, maximum likelihood estimators in an HMM may not be computed directly using nite-dimensional statistics. However, we believe that our current understanding of Markov order estimation will provide insights into the HMM order estimation problem. Moreover, designing the right non-asymptotic deviation inequalities has become a standard approach in the analysis of model selection procedures (see Barron et al., 1999). This work still has to be done for HMMs. We will start the technical exposition by describing the relationship between order estimation and hypothesis testing.

15.3 Order Estimation and Composite Hypothesis Testing


If we have a consistent order estimation procedure, we should be able to manufacture a sequence of consistent tests for the following questions: is the true order larger than 1, . . . , r, . . .? We may indeed phrase the following composite hypothesis testing problem: H0: The source belongs to Mr0 ; H1: The source belongs to (r Mr ) \ Mr0 . To put things in perspective, in this paragraph we will focus on testing whether some probability distribution P belongs to some subset M0 (H0) of

570

15 Order Estimation

some set M of distributions over Y . Hypothesis H1 corresponds to P M1 = M \ M0 . A test on samples of length n is a function Tn that maps Yn on {0, 1}. If Tn (y1:n ) = 1, the test rejects H0 in favor of H1, otherwise the test does not reject. The region Kn on which the test rejects H0 is called the critical region. The power function n of the test maps distributions P to the probability of the critical region, def n (P) = P(Y1:n Kn ) . If n (P) for all P M0 , the test Tn is said to be of level . The goal of test design is to achieve high power at low level. In many settings of interest, the determination of the highest achievable power at a given level for a given sample size n is beyond our capabilities. This motivates asymptotic analysis. A sequence of tests Tn is asymptotically of level if for all P M0 , lim sup P(Kn ) .
n

A sequence of tests Tn with power functions n is consistent at level if all but nitely many Tn have level , and if n (P) 1 for all P M1 . When comparing two simple hypotheses, the question is solved by the Neyman-Pearson lemma. This result asserts that it is enough to compare the ratio of likelihoods of observations according to the two hypotheses with a threshold. When dealing with composite hypotheses, things turn out to be more dicult. In the context of nested models, the generalized likelihood ratio test is dened in the following way. Denition 15.3.1. Let M0 and M denote two sets of distributions on Y , with M0 M. Then the nth likelihood ratio test between M0 and M \ M0 has critical region Kn =
def

y1:n : sup log P(y1:n ) sup log P(y1:n ) pen(n)


PM0 PM

where the penalty pen(n) denes an n-dependent threshold. Increasing the penalty shrinks the critical region and tends to diminish the level of the test. As a matter of fact, in order to get a non-trivial level, pen(n) should be positive. The denition of the generalized likelihood ratio test raises two questions. 1. How should pen(n) be chosen to warrant strong consistency? 2. Is generalized likelihood ratio testing the best way to design a consistent test? It turns out that the answers to these two questions depend on the properties of maximum likelihood in the models M0 and M. Moreover, the way to get the answers depends on the models under consideration. In order to answer the rst question, we need to understand the behavior of

15.4 Code-based Identication

571

sup log P(Y1:n ) sup log P(Y1:n )


PM PM0

under the two hypotheses. Let M0 denote Markov chains of order r and let M denote Markov chains of order r+1. If P denes a Markov chain of order r, then as n tends to innity, 2[supPM log P(Y1:n ) supPM0 log P(Y1:n )] converges in distribution to a 2 random variable with |Y|r (|Y| 1)2 degrees of freedom. As a consequence of the law of the iterated logarithm, P -a.s., it should be of order log log n as n tends to innity (see Finesso, 1991, and Section 15.8). Hence in such a case, a good understanding of the behavior of maximum likelihood estimates provides hints for designing consistent testing procedures. As already pointed out such a knowledge is not available for HMMs. As of this writing the best and most useful insights into the behavior of supPM log P(Y1:n ) supPM0 log P(Y1:n ) when M denotes HMMs of order r and M0 denotes HMMs of order r < r, can be found in the universal coding literature.

15.4 Code-based Identication


15.4.1 Denitions The pervasive inuence of concepts originating from universal coding theory in the literature dedicated to Markov order or HMM order estimation should not be a surprise. Recall that by the Kraft-McMillan inequality (Cover and Thomas, 1991), a uniquely decodable code on Yn denes a (sub)-probability on Yn , and conversely, for any probability distribution P on Yn , there exists a uniquely decodable code for Yn such that the length of the codeword associated with y1:n is upper-bounded by log P{y1:n } + 1. Henceforth, the probability associated with a code will be called the coding probability, and the logarithm of the coding probability will represent the ideal codeword length associated with the coding probability. For each n, let Rn denote a coding probability for Yn . The family (Rn ) is not necessarily compatiblein other words it is not necessarily the nth dimensional marginal of a distribution on Y . We shall denote by subscripts the marginals: for a probability P on Y , Pn is the marginal distribution of Y1:n . The redundancy of Rn with respect to P M is dened as the Kullback divergence between Pn and Rn , denoted by D(Pn | Rn ) . The family (Rn ) of coding probabilities is a universal coding probability for model M if and only if
PM n

sup lim n1 D(Pn | Rn ) = 0 .

572

15 Order Estimation

The quantity supPM D(Pn | Rn ) is called the redundancy rate of the family (Rn ) with respect to M. The following coding probability has played a distinguished role in the areas of universal coding and prediction of individual sequences. Denition 15.4.1. Given a model M of probability distributions over Yn , the normalized maximum likelihood (NML) coding probability induced by M on Yn is dened by NMLn (y1:n ) = where
def y1:n Yn

supPM P(y1:n ) , Cn sup P(y1:n ) .


PM

Cn =

The maximum point-wise regret of a coding probability Rn with respect to the model M is dened as
y1:n Y PM

max n sup log

P(y1:n ) . Rn (y1:n )

Note that NMLn achieves the same regret log Cn over all strings from Yn . No coding probability can achieve a smaller maximum point-wise regret. This is why NML coders are said to achieve minimax point-wise regret over M. During the last two decades, precise bounds on Cn have been determined for dierent kinds of models, notably for the class of product distributions (memoryless sources), for the class of Markov chains of order r (Markov sources), and for the class of hidden Markov sources of order r. The relevance of bounds on Cn to our problem is immediate. Let Cn be dened with respect to M and let P denote the true distribution, which is assumed to belong to M. Then sup log P(y1:n ) log P (y1:n ) = log NMLn (y1:n ) log P (y1:n ) + log Cn .
PM

On the right-hand side of this inequality, the two quantities that show up refer to two xed probabilities. After exponentiation, those two quantities may take part into summations over y1:n as will be seen for example when proving consistency of penalized maximum likelihood order estimators (see Lemma 15.6.3). One possible (conservative) choice of the penalty term will be made by comparison with normalizing constants Cn The NML coding probability is one among many universal coding probabilities that have been investigated in the literature. For models like HMMs with xed order r, the parameter space r can be endowed with a probability space structure. A prior probability can be dened on r , and under mild measurability assumptions this in turn denes a probability distribution P on Y ,

15.4 Code-based Identication

573

P=
r

P (d) ,

(15.2)

where P is the probability distribution on Y of the HMM with parameter . Such coding probabilities are called mixture coders. Historically, several prior probabilities on have been considered. Uniform (or Laplace) priors were considered rst, but Dirichlet distributions soon gained much attention. Denition 15.4.2. A Dirichlet-(1 , . . . , r ) distribution is a distribution on the simplex of Rr given by the density (q1 , . . . , qr |1 , . . . , r ) = where the i are all positive. Though the Dirichlet prior has a venerable history in Bayesian inference, in this chapter we will stick to the information-theoretical tradition and call the resulting coding probability the Krichevsky-Tromov mixture. Denition 15.4.3. The Krichevsky-Tromov mixture (KT) is dened by providing r with a product of Dirichlet-(1/2, . . . 1/2) distributions. More precisely, such a distribution is assigned to () in the simplex of Rr , to each row G (i, ), in the simplex of Rs where s = |Y|, and to each row Q (i, ) in the simplex of Rr , KT (d) =
r def

(1 + . . . + r ) 1 1 q qr r 1 (1 ) (r ) 1

1q1 +...+qr =1 ,

r r 2 (i)1/2 1 r 2 i=1

i=1

r r 2 Q (i, j)1/2 1 r 2 j=1

d d 2 G (i, j)1/2 1 d j=1 2

. (15.3)

Krichevsky-Tromov mixtures dene a compatible family of probability distributions over Yn for n 1. This is in sharp contrast with NML distributions and is part of the reason why KT mixtures became so popular in source coding theory. Resorting to coding-theoretical concepts provides a framework for dening an order estimation procedure known as minimum description length (MDL) order estimation. MDL was introduced and popularized by J. Rissanen in the late 1970s. Although MDL has often been promoted by borrowing material from medieval philosophy, we will see later that it can be justied using some non-trivial mathematics for Markov order estimation. Denition 15.4.4. Assume that is a probability distribution on the set of possible orders and that for each order r and n 1, Rn denes a coding probr ability for Yn with respect to Mr . Then the MDL order estimator is dened by def r = arg maxr [log Rn (y1:n ) + log (r)] . r

574

15 Order Estimation

n Note that if the coding probability BBrr turns out to be the normalized maximum likelihood distribution, the MDL order estimator is a special kind of penalized maximum likelihood (PML) order estimator. The Bayesian information criterion (BIC) order estimator is nothing but another distinguished member of the family of penalized maximum likelihood order estimators. It is closely related to but dierent from the MDL order estimator derived from the NML coding probability.

Denition 15.4.5. Let dim(r) be the dimension of the parameter space r in Mr . Then the BIC order estimator is dened by r = arg maxr
def

sup log P(y1:n )


PMr

dim(r) log n . 2

Schwarz introduced the BIC in the late 1970s using Bayesian reasoning, and using Laplaces trick to simplify high-dimensional integrals. The validity of this trick and the relevance of Bayesian reasoning to the minimax framework has to be checked on an ad hoc basis. 15.4.2 Information Divergence Rates The order estimators we have in mind (MDL, BIC, PML) are related to generalized likelihood ratio testing. In order to prove their consistency, we need strong laws of large numbers concerning logarithms of likelihood ratios. In the stationary independent case, those laws of large numbers reduce to the classical laws of large numbers for sums of independent random variables. Such strong laws have proved to be fundamental tools both in statistics and in information theory. In general (that is, not necessarily i.i.d. settings), the laws of large numbers we are looking for have been called asymptotic equipartition principles for information in information theory or Shannon-McMillanBreiman (SMB) theorems in ergodic theory (Barron, 1985). Before formulating SMB theorems in a convenient form, let us recall some basic facts about likelihood ratios. Let P and P denote two probabilities over Y such that for every n, Pn is absolutely continuous with respect to Pn . Then under P, the ratio Pn / Pn is a martingale with expectation less than or equal than 1. By monotonicity and concavity of the logarithm, log Pn / Pn is a super-martingale with non-positive expectation. It follows from a theorem due to Doob that this super-martingale converges a.s. to an integrable random variable. If the expectation of the latter random variable is innite, P is singular with respect to P . In such a setting, the rate of growth of log Pn / Pn is a matter of concern. If the two distributions are product probabilities, the log-likelihood ratio is a sum of independent random variables and grows linearly with n if the factors are identical. Moreover, the strong law of large numbers tells us that n1 log Pn / Pn converges a.s. to a xed value, which is called the information divergence rate between the two distributions. How robust is this observation? This is precisely the topic of SMB theorems.

15.4 Code-based Identication

575

Denition 15.4.6. A set M of process laws over Y is said to satisfy a generalized AEP if the following holds. (i) For every pair of laws P and P from M, the relative entropy rate (information divergence rate) between P and P , 1 D(Pn | Pn ) , n n lim exists. It is denoted by D (P | P ). (ii) Furthermore, if P and P are stationary ergodic, then P(Y1:n ) 1 = D (P | P ) log n n P (Y1:n ) lim P-a.s.

Remark 15.4.7. In the i.i.d. setting, the AEP boils down to the usual strong law of large numbers. The cases of Markov models and hidden Markov models can be dealt with using Barrons generalized Shannon-McMillan-Breiman theorem, which we state here. Theorem 15.4.8. Let Y be a standard Borel space and let {Yk }k1 be a Yvalued stochastic ergodic process distributed according to P . Let P denote a distribution over Y , which is assumed to be Markovian of order r, and such that for each n, Pn has a density with respect to Pn . Then n1 log dP (Y1:n ) dP

converges P-a.s. to the relative entropy rate between the two distributions, D (P | P ) = lim n1 D(Pn | Pn ) = sup n1 D(Pn | Pn ) .
n n

From Barrons theorem, it is immediate that the collection of Markov models satises the generalized AEP. The status of HMMs is less straightforward. There are actually several proofs that HMMs satisfy the generalized AEP (see Finesso, 1991). The argument we present here simply resorts to the extended chain device. Theorem 15.4.9. The collection of HMMs over some nite observation alphabet Y satises the generalized AEP. Proof. Let P and P denote two HMMs over some nite observation alphabet Y. Let n and n denote the associated prediction lters. Then under P and P the sequence {Yn , n , n } is a Markov chain over Y Rr Rr , which may be regarded as a standard Borel space. Moreover log P(y1:n ) = log P(y1:n , 1:n , 1:n ) . Applying Theorem 15.4.8 to the sequence {Yn , n , n } nishes the proof.

576

15 Order Estimation

Knowing that some collection of models satises the generalized AEP allows us to test between two elements picked from the collection. When performing order estimation, we need more than that. If ML estimation is consistent, we need to have for every P Mr \ Mr 1 , lim sup
n

sup
PMr
1

n1 log

P(Y1:n ) <0 P (Y1:n )

P -a.s.

If the collection of models satises the generalized AEP, this should at least imply that inf 1 D (P | P) > 0 . r
PM

We recall here some results concerning divergence rates of stationary HMMs that may be found in Gassiat and Boucheron (2003). Here Mr is the set of stationary HMMs of order at most r. Lemma 15.4.10. D ( | ) is lower semi-continuous on r Mr r Mr . Lemma 15.4.11. If P is a stationary but not necessarily ergodic HMM of order r, it can be represented as a mixture of ergodic HMMs (Pi )ii(r) having disjoint supports on X Y,
d

P=
i=1

i Pi ,

where i i = 1, i 0 and i(r) depends on r only. If P is a stationary ergodic HMM then D (P | P ) =


i

i D (Pi | P ) ,

D (P | P) = inf D (P | Pi ) .
i

Lemma 15.4.12. If P is a stationary ergodic HMM of order r and r < r , then inf r D (P | P ) > 0 and inf r D (P | P) > 0 .
PM PM

15.5 MDL Order Estimators in Bayesian Settings


Under mild but non-trivial conditions on universal redundancy rates, the above-described order estimators are strongly consistent in a minimax setting. In this section, we will present a result that might seem to be a denitive one. Recall that two probability distributions Q and Q are orthogonal or mutually singular if there exists a set A such that Q(A) = 1 = Q (Ac ).

15.6 Strongly Consistent Penalized Maximum Likelihood Estimators

577

Theorem 15.5.1. Let {r }r1 denote a collection of models and let Qr denote coding probabilities dened by (15.2) with prior probabilities r . Let L(r) denote the length of a prex binary encoding of the integer r. Assume that the probabilities Qr are mutually singular on the asymptotic -eld. If the order estimator is dened as rn = arg minr log2 Qr (y1:n ) + L(r) , then for all r and r -almost all , rn converges to r a.s. Proof. Dene Q as the double mixture Q =C
r=r def

2L(r) Qr ,

where C 1 is a normalization factor. Under the assumptions of the theorem Q and Qr are mutually singular on the asymptotic -eld. Moreover for all y1:n , Q (y1:n ) C sup 2L(r) Qr (y1:n ) ,
r=r

which is equivalent to log2 Q (y1:n ) log2 C + inf [L(r) log2 Qr (y1:n )] .


r=r

On the other hand, a standard martingale argument tells us that Qr -a.s., log2 Qr (y1:n ) Q (y1:n )

converges to a limit, and the fact that Qr and Q are mutually singular entails that this limit is innite Qr -a.s. Hence Qr -a.s., for all suciently large n log2 Qr (y1:n ) + L(r ) < inf [L(r) log2 Qr (y1:n )] .
r=r

This implies that Qr -a.s., for all suciently large n, rn = r , which is the desired result. Remark 15.5.2. Theorem 15.5.1 should not be misinterpreted. It does not prevent the fact that for some in a set with null r probability, the order estimator might be inconsistent. Neither does the theorem give a way to identify those for which the order estimator is consistent.

15.6 Strongly Consistent Penalized Maximum Likelihood Estimators for HMM Order Estimation
In this section, we give general results concerning order estimation in the framework of nested sequences of models, and we then state their application to stationary HMMs. We shall consider penalized ML estimators rn .

578

15 Order Estimation

Assumption 15.6.1. (i) The sequence of models satises the generalized AEP (Denition 15.4.6). (ii) Whenever P is stationary ergodic of order r and r < r ,
PMr

inf D (P | P) > 0 .

(iii) For any > 0 and any r, there exists a sieve (Pi )iI r , that is, a nite set I r such that Pi Mr with all Pi being stationary ergodic, and a nr such that for all P Mr there is an i I r such that n1 | log P(y1:n ) log Pi (y1:n )| for all n nr and all y1:n . Non-trivial upper bounds on point-wise minimax regret for the dierent models at hand will enable us to build strongly consistent code-based order estimators. Lemma 15.6.2. Let the penalty function pen(n, r) be non-decreasing in r and such that pen(n, r)/n 0. Let {rn } denote the sequence of penalized maximum likelihood order estimators dened by pen(). Then under Assumption 15.6.1, P -a.s., rn r eventually. Proof. Throughout innitely often will be abbreviated i.o. Write {rn < r i.o.} =
r<r

{rn = r i.o.}

and note that {rn = r i.o.}


iI r

sup log P(y1:n ) log P (y1:n ) pen(n, r ) i.o.


PMr

max log Pi (y1:n ) log P (y1:n ) n pen(n, r ) i.o. r


iI

lim sup n1 [log Pi (y1:n ) log P (y1:n )]

where (Pi )iI r is the sieve for Mr given by Assumption 15.6.1(iii). Now, by Assumption 15.6.1(i), n1 [log Pi (y1:n ) log P (y1:n )] converges P -a.s. to D (P | Pi ), and by Assumption 15.6.1(ii), as soon as < min inf r D (P | P) ,
r<r PM

one obtains P (r < r i.o.) = 0. A possibly very conservative way of choosing penalties may be justied r in a straightforward way by universal coding arguments. Let Cn denote the normalizing constant in the denition of the NML coding probability induced by Mr on Yn .

15.6 Strongly Consistent Penalized Maximum Likelihood Estimators


r

579

r Lemma 15.6.3. Let the penalty function be pen(n, r) = r =0 (log Cn + 2 log n) and let {rn } denote the sequence of penalized maximum likelihood order estimators dened by pen(). Then P -a.s., rn r eventually.

Proof. Let r denote an integer larger than r . Then P (rn = r) P P


y1:n

log P (Y1:n ) sup log P(Y1:n ) pen(n, r) + pen(n, r )


PMr r1

log P (Y1:n ) log NMLr (Y1:n ) n


r =r +1

r log Cn 2(r r ) log n

exp[log P (y1:n )] 1{log P


(y1:n )log NMLr (y1:n ) n r1
r1 r =r +1 r log Cn 2(rr ) log n}

y1:n

NMLr (y1:n ) exp n


r =r +1 r1

r log Cn 2(r r ) log n

exp
r =r +1

r log Cn 2(r r ) log n

2(rr )

r because log Cn = 0 for r = r + 1. By the union bound,

r1 r =r +1

P (rn > r ) =
r>r

P (rn = r)

n2 , 1 n2

whence P (rn > r )


n n

n2 <. 1 n2

Applying the Borel-Cantelli lemma, we may now conclude that P -a.s., order over-estimation occurs only nitely many times. In order to show the existence of strongly consistent order estimators for HMMs, it remains to check that Assumption 15.6.1 holds and that the penalties used in the statement of Lemma 15.6.3 satisfy the conditions stated in Lemma 15.6.2, that is, for all r 1, lim
n

1 n

r log Cn + 2 log n = 0 . r r

This last point follows immediately from the following result from universal coding theory.

580

15 Order Estimation

Lemma 15.6.4. For all r, all n > r and all y1:n ,


r log Cn = log

supPMr P(y1:n ) r(r + d 2) log n + cr,d (n), NMLr (y1:n ) 2 n

where for n 4, cr,d (n) may be chosen as cr,d (n) = log r + r log
r 2 1 2

d 2 1 2

r2 + d2 1 + 4n 6n

Concerning Assumption 15.6.1, part (i) is Theorem 15.4.9 and part(ii) is r Lemma 15.4.12. Now for any positive , let us denote by the set of HMM parameters in r such that each coordinate is lower-bounded by . r For any r , there exists such that for any n and any y1:n , n1 | log P (y1:n ) log P (y1:n )| r2 + d2 . 2

A glimpse at the proof of this fact in Liu and Narayan (1994) reveals that this r statement still holds when is constrained to lie in a sieve for , dened as a nite subset (i )iI such that for all r , at least one i in the sieve is within L -distance smaller than away from . This may be summarized in the following way. Corollary 15.6.5. Let P be an HMM of order r and let {rn } be the sequence of penalized ML order estimators dened in Lemma 15.6.3. Then P -a.s., rn = r eventually. Remark 15.6.6. Resorting to universal coding arguments to cope with our poor understanding of the maximum likelihood in misspecied HMMs provides us with a Janus-faced result: on one hand it allows us to describe a family of strongly consistent order estimators that will prove to be optimal as far as under-estimation is concerned; on the other hand the question raised by Kieer (1993) about the consistency of BIC and MDL for HMM order estimation remains open.

15.7 Eciency Issues


How ecient are the aforementioned order estimation procedures? The notions of eciency that have been considered in the order estimation literature have been shaped on the testing theory setting. As a matter of fact, the classical eciency notions have emerged from the analysis of the simple hypotheses testing problem. Determining how those notions could be tailored to the nested composite hypothesis testing problem is still a subject of debate. Among the various notions of eciency, or even of asymptotic relative eciency that are regarded as relevant in testing theory, Pitmans eciency

15.7 Eciency Issues

581

focuses on the minimal sample size that is required to achieve simultaneously a given level and a given power at alternatives. Up to our knowledge, Pitmans eciency for Markov order or HMM order estimation related problems has not been investigated. This is due to the lack of non-asymptotic results concerning estimation procedures for HMM and Markov chains. The notion of eciency that has been assessed in the order estimation literature is rather called Bahadur relative eciency in the statistical literature and error exponents in the information-theoretical literature. When testing a simple hypothesis against another simple hypothesis in the memoryless setting, a classical result by Cherno tells us that comparing likelihood ratios to a xed threshold, both level and power may decay exponentially fast with respect to the number of observations. In that setting, Bahadur-ecient testing procedures are those that achieve the largest exponents. Viewing that set of circumstances, there have been several attempts to generalize those results to the composite hypothesis setting. Part of the diculty lies in stating the proper questions. Although consistency issues concerning the BIC and MDL criteria for HMM order estimation have not yet been claried, our understanding of eciency issues concerning HMM order identication recently underwent significant progress. In this section, we give general results concerning eciency of order estimation in the framework of nested sequences of models; these results apply to stationary HMMs. 15.7.1 Variations on Steins Lemma The next theorems are extensions of Steins lemma to the order estimation problem. Theorem 15.7.2 aims at determining the best underestimation exponent for a class of order estimators that ultimately overestimate the order with a probability bounded away from 1. Theorem 15.7.4 aims at proving that the best overestimation exponent should be trivial in most cases of interest. Assumption 15.7.1. (i) The sequence of models satises the general AEP (Denition 15.4.6). (ii) For any r, there exists Mr Mr such that any P in Mr is stationary 0 0 ergodic and has true order at most r, and such that for any P Mr , 0
PMr

inf D (P | P ) = inf r D (P | P ) .
PM0

Versions of the following theorem have been proved iby Finesso et al. (1996) for Markov chains and by Gassiat and Boucheron (2003) for HMMs. Theorem 15.7.2. Let the sequence {Mr }r1 of nested models satisfy Assumption 15.7.1. Let {rn }n1 denote a sequence of order estimators such that for some < 1, all r and all P Mr , 0 P (rn (Y1:n ) > r )

582

15 Order Estimation

for n T1 (P , , r ). Then for all r and all P Mr , 0 lim inf n1 log P (rn (Y1:n ) < r ) min
n r <r PMr

inf D (P | P ) .

Proof. Fix P Mr . Let P Mr with r < r and dene 0 0 An (P ) = {y1:n : rn (y1:n ) r } , P (y1:n ) def Bn (P ) = {y1:n : n1 log D (P | P ) + } . P (y1:n ) For n > T1 (P , , r ), P (An (P )) > 1 , and as r M is assumed to satisfy the generalized AEP, for all n > T3 ( , P , P ) it holds that P (Bn (P )) > 1 . If n > T2 (, , P ) = max[T1 (, r ), T3 ( , P , P )], then P (rn (Y1:n ) < r ) = EP [1{rn <r } ] is an equality if P and P have the same support set for nite marginals P (Y1:n ) 1{rn <r } EP P (Y1:n ) as r < r P (Y1:n ) EP 1A (P ) P (Y1:n ) n from the denition of Bn (P ) EP EP (15.4)
r def

1An (P ) 1Bn (P ) en[D(P 1An (P ) 1Bn (P ) en[D(P


| P )+ ]

| P )+ ]

| P )+ ]

from the union bound, and by the AEP (1 )en[D(P . tend to zero, the

Now optimizing with respect to and r and letting theorem follows.

Remark 15.7.3. Assessing that the upper bound on underestimation exponent is positive amounts to checking properties of relative entropy rates. Theorem 15.7.2 holds for stationary HMMs. Assumption 15.7.1(i) is Theorem 15.4.9, and part (ii) is veried by taking Mr as the distributions of 0

15.7 Eciency Issues

583

stationary ergodic HMMs with order at most r. Then Theorem 15.7.2 follows using Lemmas 15.4.10 and 15.4.11. Another Stein-like argument provides an even more clear-cut statement concerning possible overestimation exponents. Such a statement seems to be a hallmark of a family of embedded composite testing problems. It shows that in many circumstances of interest, we cannot hope to achieve both nontrivial under- and overestimation exponents. Versions of this theorem have been proved by Finesso et al. (1996) for Markov chains and by Gassiat and Boucheron (2003) for HMMs. Theorem 15.7.4. Let the sequence {Mr }r1 of nested models satisfy Assumption 15.7.1. Assume also that for P Mr Mr there exists a sequence 0 {Pm }m of elements in Mr+1 \ Mr such that 0
m

lim D (Pm | P) = 0 .

Assume that {rn }n is a consistent order estimation procedure. Then for all P Mr having order r , 0 lim inf
n

1 log P(rn > r ) = 0 . n

The change of measure argument that proved eective in the proof of Theorem 15.7.2 can now be applied for each P Mr . 0 Proof. Let P denote a distribution in Mr having order r and let {Pm } de0 note a sequence as above. Let denote a small positive real. Fix m suciently large that D (Pm | P) and then n suciently large that Pm n1 log while Pm (rn = r + 1) 1 . n We may now lower bound the overestimation probability as P(rn > r ) P(rn = r + 1) dP EPm 1{rn =r +1} d Pn d Pn EPm 1{rn =r +1} d Pm n d Pm n EPm exp log d Pn e2n (1 2 ) . Hence lim inf n n1 log Pn (rn > r ) 2 . As may be arbitrarily small, this nishes the proof. d Pm (Y1:n ) D (Pm | P) + dP

1{rn =r

+1}

584

15 Order Estimation

This theorem holds for stationary HMMs; see Gassiat and Boucheron (2003). The message of this section is rather straightforward: in order estimation problems like HMM order estimation, underestimation corresponds to large deviations of the likelihood process, whereas overestimation corresponds to moderate deviations of the likelihood process. In the Markov order estimation problem, the large-scale typicality theorem of Csiszr and Shields allows us a to assign a quantitative meaning to this statement. 15.7.2 Achieving Optimal Error Exponents Stein-like theorems (Theorems 15.7.2 and 15.7.4) provide a strong incentive to investigate underestimation exponents of the consistent order estimators that have been described in Section 15.6. As those estimators turn out to be penalized maximum likelihood estimators, what is at stake here is the (asymptotic) optimality of generalized likelihood ratio testing. In some situations, generalized likelihood ratio testing fails to be optimal. We will show that this is not the case in the order estimation problems we have in mind. As will become clear from the proof, as soon as the NML normalizing r constant log Cn /n tends to 0 as n tends to innity, NML code-based order estimators exhibit the same property. Assumption 15.7.5. (i) The sequence of models satises the AEP. (ii) Each model Mr can be endowed with a topology under which it is sequentially compact. (iii) Relative entropy rates satisfy the semi-continuity property: if Pm and P ,m are stationary ergodic and converge respectively to P and P , then D (P | P ) lim inf m D (Pm | P ,m ). (iv) For any > 0 and any r, there exists a sieve (Pi )iI r , that is, a nite set I r such that Pi Mr with all Pi ergodic and such that the following hold true. (a) Assumption 15.6.1(iii) is satised. (b) For each stationary ergodic distribution P r Mr with order r and for every nite subset P of the union {Pi : i I r } Mr of all sieves, the log-likelihood process {log P(Y1:n )}PP satises a large deviation principle with good rate function JP and rate n. Moreover, any sample path {u(P)}PP of the log-likelihood process indexed by P that satises JP (u) < enjoys the representation property that there exists a distribution Pu Mr such that u(P) = lim n1 EPu [log P(Y1:n )] ,
n

PP,

JP (u) D (Pu | P ) . (v) For any r1 < r2 , if P1 Mr1 and P2 Mr2 satisfy D (P2 | P1 ) = 0, then P2 = P1 Mr1 .

15.7 Eciency Issues

585

(vi) If P Mr is not stationary ergodic, it can be represented as a nite mixture of ergodic components (Pi )ii(r ) (where i(r ) depends only on r ) in Mr , i i Pi = P, and for all ergodic P in M, D (P | P ) =
ii(r )

i D (Pi | P ) .

Remark 15.7.6. Assumption 15.7.5 holds for HMMs. This is not obvious at all and follows from available LDPs for additive functionals of Markov chains, the extended chain device, and ad hoc considerations. The interested reader may nd complete proofs and relevant information in Gassiat and Boucheron (2003). Theorem 15.7.7. Assume that the sequence of nested models (Mr ) satises Assumptions 15.7.1 and 15.7.5. If pen(n, r) is non-negative and for each r, pen(n, r)/n 0 as n , the penalized maximum likelihood order estimators achieve the optimal underestimation exponent,
r<r PMr

min inf D (P | P ) .

The optimality of this exponent comes from Theorem 15.7.2, which holds under Assumption 15.7.1. Hence the proof of Theorem 15.7.7 consists in proving that the exponent is achievable. Proof. An application of the union bound entails that lim sup n1 log P (rn < r ) max lim sup n1 log P (rn = r) .
r<r

Hence the problem reduces to checking that for each r < r , lim sup 1 log P (rn = r) inf r D (P | P ) . PM n

Fix r < r . The proof will be organized in two steps. First, we will check that for each > 0 we can nd some P I r and some P such that D (P | P ) 3 , lim sup n
n 1

log P (rn = r) D (P | P ) .

In the second step, we let tend to 0 to check that there exists some P in Mr such that lim n1 log P (rn = r) D (P | P ) .
n

Let us choose > 0 and n large enough that pen(n, r ) n for n n . Under Assumption 15.7.5(iv)(a), we get for n n nr ,

586

15 Order Estimation

log P (rn = r) log P log P sup log P(Y1:n ) sup log P(Y1:n ) pen(n, r) pen(n, r )
PMr iI PMr

max n1 log Pi (Y1:n ) max n1 log P(Y1:n ) 2 r r


iI

We may divide by n, take the lim sup of the two expressions as n tends to innity, and use Assumption 15.7.5(iv)(b) to obtain lim sup n1 log P (rn = r) inf JP (u) : sup u(Pi ) sup u(Pi ) 2
iI r iI r

with P = {Pi : i I r } {Pi : i I r } . The inmum on the right-hand side of the inequality is attained at some path u . Hence, using again Assumption 15.7.5(iv)(b), lim sup n1 log P (rn = r) D (P | P ) , where P Mr , u (P) = lim n1 EP [log P(Y1:n )] , and sup u (Pi ) sup u (Pi ) 2 .
iI r iI r

(15.5)

PP,

(15.6) (15.7)

Pick P {Pi }iI r such that for n nr , n1 | log P (y1:n ) log P (y1:n )| and P such that
iI r

sup u (Pr ) = u (P ) . i

(15.8)

Then lim sup n1 EP [log P (Y1:n )] lim sup n1 EP [log P (Y1:n )] + = u (P ) + u (P ) + 3 = lim n1 EP [log P (Y1:n )] + 3 . Here we used (15.6) for the second step, then (15.8) and (15.7), and nally (15.6) again. Using Assumption 15.7.5(i) we thus nally obtain D (P | P ) 3 .

15.8 Consistency of BIC for Markov Order Estimation

587

Let us now proceed to the second step. It remains to check that if we let tend to 0, the sequence (P ) obtained in (15.5) has an accumulation point in Mr . Note that P is ergodic and let i i, Pi, denote the ergodic decomposition of P . Then D (P | P ) =
i

i, D (Pi, | P ) .

Extract a subsequence of (i, ) and (Pi, ) converging to i and Pi , respectively, and such that P = i i Pi , while P is the corresponding accumulation point of the sequence P . We may then apply the semi-continuity property to obtain i D (Pi | P) = 0 .
i

This leads, using Assumption 15.7.5(v) and (vi), to i i Pi = P, that is, P = P Mr . Using the semi-continuity property again we nd that lim D (P | P ) = lim
i

i, D (Pi, | P ) D (P | P ) ,

whence lim sup n1 P (rn = r) inf r D (P | P ) .


PM

15.8 Consistency of the BIC Estimator in the Markov Order Estimation Problem
Though consistency of the BIC estimator for HMM order is still far from being established, recent progress concerning the Markov order estimation problem raises great expectations. As a matter of fact, the following was established by Csiszr and Shields and recently rened by Csiszr (Csiszr and Shields, a a a 2000; Csiszr, 2002). a Theorem 15.8.1. For any stationary irreducible Markov process with distribution P over the nite set Y and of order r , the BIC order estimator converges to r P -a.s. The proof of this remarkable theorem follows from a series of technical lemmas concerning the behavior of maximum likelihood estimators in models Mr for r r . In the Markov order estimation problem, such precise results can be obtained at a reasonable price, thanks to the fact that maximum likelihood estimates coincide with simple functions of empirical measures. Here we follow the argument presented by Csiszr (2002). a

588

15 Order Estimation

First note that underestimation issues are dealt with using Lemma 15.6.2. Theorem 15.8.1 actually follows almost directly from the following result. Let r P denote the MLE of the probability distribution in Mr on the sample y1:n . Theorem 15.8.2. For any stationary irreducible Markov process with distribution P of order r over the nite set Y, sup
rr r 1 1 log P (y1:n ) log P (y1:n ) 0 |Sr | log n

P -a.s.

Here Sr denotes the subset of patterns from |Y|r that have non-zero stationary probability. To emphasize the power of this theorem, let us rst use it to derive Theorem 15.8.1. Proof (of Theorem 15.8.1). The event {rn > r i.o.} equals the event {r > r : log P (y1:n ) log P (y1:n ) pen(n, r) pen(n, r ) i.o.} , which is included in {r > r : log P (y1:n ) log P (y1:n ) pen(n, r) pen(n, r ) i.o.} . By Theorem 15.8.2, it follows that for any > 0, P -a.s., sup
rr r 1 1 log P (y1:n ) log P (y1:n ) < . |Sr | log n r r r

Finally, for large n, for the BIC criterion, pen(n, r) (1/2)|Sr |(|Y|1) log n. Remark 15.8.3. Viewing the proof of strong consistency of the BIC Markov order estimator, one may wonder whether an analogous result holds for MDL order estimators derived from NML coding probabilities or KT coding probabilities. If no a priori restriction on the order is enforced, the answer is negative: there exists at least one stationary ergodic Markov chain (the uniform memoryless source) for which unrestricted MDL order estimators overestimate the order innitely often with probability one. But if the search for r in maxr { log Qn,r (y1:n ) log (r)} is restricted to some nite range {0, . . . , log n} where is small enough (depending on the unknown P ) and does not depend on n, then the MDL order estimator derived by taking NMLn,r as the rth coding probability turns out to be strongly consistent. The reason why this holds is that in order to prove strong consistency, we need to control |Sr+1 | |Sr | log n 2 over a large range of values of r for all suciently large n. Sharp estimates of the minimax point-wise regret of NML for Markov sources of order r have recently been obtained. It is not clear whether such precise estimates can be obtained for models like HMMs where maximum likelihood is not as wellbehaved as in the Markov chain setting.
r log Cn

15.8 Consistency of BIC for Markov Order Estimation

589

Throughout this section, P denotes the distribution of a stationary irreducible Markov chain of order r over Y. For all r and all a1:r Yr ,
n+1r

Nn (a1:r ) =

def i=1

1r {Yi+j1 =aj } j=1

is the number of times the pattern a1:r occurs in the sequence y1:n . The MLE of the conditional distribution in Mr (r-transitions) is P (ar+1 | a1:r ) =
r

Nn (a1:r+1 ) Nn1 (a1:r )

for all a1:r+1 Yr+1 , whenever Nn1 (a1:r ) > 0. The proof of Theorem 15.8.2 is decomposed into two main parts. The r easiest part relates log P (y1:n ) log P (y1:n ) and a 2 distance between the r empirical transition kernel Pn and P , under conditions that aver to be almost surely satised by sample paths of irreducible Markov chains. This relationship (Lemma 15.8.4) is a quantitative version of the asymptotic equivalence between relative entropy and 2 distance (see Csiszr, 1990, for more infora mation on this topic). The most original part actually proves that the almost r sure convergence of P to P is uniform over r r . Lemma 15.8.4. Let P and P be two probability distributions on {1, . . . , m}. If P (i)/2 P (i) 2P (i) for all i then D(P | P ) 2 (P, P ), where m 2 (P, P ) = i=1 {P (i) P (i)}2 /P (i). A simple corollary of this lemma is the following. Corollary 15.8.5. Let r be an integer such that r r . If y1:n is such that for all a1:r+1 Sr+1 , Nn (a1:r+1 ) 1 P (ar+1 | a1:r ) 2 P (ar+1 | a1:r ) , 2 Nn1 (a1:r ) then log P (y1:n ) log P (y1:n )
a1:r Sr r

Nn (a1:r )2 (P ( | a1:r ), P ( | a1:r )) .

15.8.1 Some Martingale Tools The proof of Theorem 15.8.2 relies on martingale arguments. The basic tools of martingale theory we need are gathered here. def In the sequel, denotes the convex function (x) = exp(x) x 1 and its convex dual, (y) = supx (yx (x)) = (y + 1) log(y + 1) y for y 1 and otherwise. We will use repeatedly the classical inequality

590

15 Order Estimation

(x)

x2 , 1 + x/3

x0.

The following lemma is usually considered as an extension of the Bennett inequality to martingales with bounded increments. Various proofs may be found in textbooks on probability theory such as Neveu (1975) or DacunhaCastelle and Duo (1986). Lemma 15.8.6. Let {Fn }n1 denote a ltration and let {Zn }n1 denote a centered square-integrable martingale with respect to this ltration, with incredef n 2 ments bounded by 1. Let Z n = s=1 E[(Zs Zs1 ) | Fs1 ] be its bracket. Then for all , the random variables exp[Zn () Z form an {Fn }-adapted super-martingale. Let us now recall Doobs maximal inequality and the optional sampling principle. Doobs maximal inequality asserts that if {Zn } is a a supermartingale, then for all n0 and all x > 0, P sup Zn x
nn0 n]

E[(Zn0 )+ ] . x

(15.9)

Recall that a random variable T is a stopping time with respect to a ltration {Fn } if for each n the event {T n} is Fn -measurable. The optional sampling theorem asserts that if T1 , T2 , . . . , Tk , . . . form an increasing sequence of stopping times with respect to {Fn }, then the sequence {ZTi } is a {FTi }-adapted super-martingale. Considering a stopping time T and the increasing sequence {T n} of stopping times, it follows from Lemma 15.8.6, Doobs maximal inequality, and the optional sampling theorem that if {Zn } is a martingale with increments bounded by 1, then for any stopping time T , P n T : |Zn | > () Zn + 2 exp() . (15.10)

Let B1 B2 be two numbers. If the stopping times T1 and T2 are dened by T1 = inf{n : Z n B1 } and T2 = inf{n : Z n B2 }, (15.10) entails that for any x > 0, P n {T1 , . . . T2 } : |Zn | > x 2 exp = 2 exp 2 exp B2 sup

x () B2

B2

x B2 . (15.11)

x2 2 B2 + x/3

This inequality will aver to be the workhorse in the proof of Theorem 15.8.2.

15.8 Consistency of BIC for Markov Order Estimation

591

15.8.2 The Martingale Approach The following observation has proved to be crucial in the developments that started with Finesso (1991) and culminated in Csiszr (2002). For each r > r a and a1:r Yr , the random variables Zn (a1:r ) dened by Zn (a1:r ) = Nn (a1:r ) Nn1 (a1:r1 ) P (ar | a1:r1 ) form an {Fn }-adapted martingale. Moreover, this martingale has increments bounded by 1, and the associated bracket has the form Z(a1:r )
n def

= Nn1 (a1:r1 ) P (ar | a1:r1 )[1 P (ar | a1:r1 )] .

(15.12)

Note that |Zn (a1:r )| < x implies that |P


r1

(ar | a1:r1 ) P (ar | a1:r1 )| <

x . Nn1 (a1:r1 )
r1

Hence bounds on the deviations of the martingales Zn (a1:r ) for a1:r Sr Yr are of immediate relevance to the characterization of P . The following lemma will be the fundamental bridging block in the proof of the large scale typicality Theorem 15.8.1. Lemma 15.8.7. Let and be two positive reals, r > r , a1:r Sr and let Zn denote the martingale associated with a1:r . Then for any > 1 and any integer m 0, P n : m Z 2 exp
n

m+1 , |Zn |

max[r, log log( Z

n )]

max[r, log log(m )] 2{1 + (1/3) max[r, log log(m )]/m+2 }

. (15.13)

Proof. Let the stopping time Tm be dened as the rst instant n such that Z n m . Note that Z n m for n between Tm and Tm+1 , and we may take x = m max[r, log log m ] and B2 = m+1 in (15.11). Remark 15.8.8. If a1:r Sr , ergodicity implies that P -a.s., Z(a1:r ) n converges to innity. Choosing = 0 and taking = 2(1 + ) with > 0, the previous lemma asserts that P n : m Z
n

m+1 , |Zn |

2(1 + ) Z

log log( Z

n)

2 exp

(1 + ) log log 1+
1 3

2(1+) log log m m+1

592

15 Order Estimation

The sum over m of the right-hand side is nite. Thus by the Borel-Cantelli lemma, P -a.s., the event on the left-hand side only occurs for nitely many m. Combining these two observations and letting tend to 1 and tend to 0 completes the proof that P -a.s., lim sup
n

|Zn | 2Z
n

log log Z

1.
n

(15.14)

Note that by Corollary 15.8.5 this entails that for some xed r > r , P -a.s., eventually for all a1:r Sr , Nn1 (a1:r ) 2 r [P ( | a1:r ), P ( | a1:r )] 2 log log Nn1 (a1:r ) |Y| and
r 1 [log P (a1:r ) log P (a1:r )] 2 log log n . |Y||Sr |

If we were ready to assume that r is smaller than some given upper bound on the true order, this would be enough to ensure almost sure consistency of penalized maximum likelihood order estimators by taking pen(n, r) = 2|Y|r+1 log log n .

15.8.3 The Union Bound Meets Martingale Inequalities The following lemma will allow us to control supr:r
r r log n {log P log P

}.

Lemma 15.8.9. For every > 0 there exists > 0 (depending on P ) such that eventually almost surely as n , for all a1:r in Sr with r < r log n, |Zn (a1:r )| Z(a1:r ) n log Z(a1:r ) n .
,c, Let the event Dn (a1:r ) be dened by ,c, Dn (a1:r ) = def

y1:n : Z(a1:r ) |Zn (a1:r )|

> cr,
n

Z(a1:r )

max[r, log log( Z(a1:r ) n )] .

Lemma 15.8.10. Let , and c be chosen in a way that there exists > 1 such that > 2 log |Y| + max(c1/2 , 1 (15.15) 3 and > . (15.16) log |Y| 2[+ /3 max(c1/2 ,1)]

15.8 Consistency of BIC for Markov Order Estimation

593

Then lim sup


n rr a1:r Sr

1Dn (a1:r ) = 0 ,c,

P -a.s.

Proof. Fix > 1 in such a way that (15.15) and (15.16) are satised. For each ,c, integer m, let the event Em (a1:r ) be dened by
,c, Em (a1:r ) = def

y1: : m > cr, a1:r , n {Tm (a1:r ), . . . , Tm+1 (a1:r )} , |Zn (a1:r )| Z(a1:r )
n

max[r, log log( Z(a1:r ) n )] .

The lemma will be proved in two steps. We will rst check that P -a.s., ,c, only nitely many events Em (a1:r ) occur. Then we will check that on a set of sample paths that has probability 1, this entails that only nitely many ,c, events Dn (a1:r ) occur. Note that max[r, log log( )] =
m

if r

log log m ,

log log(m ) otherwise .

To alleviate notations, let be dened as = 2 + Then E


m r a1:r

log |Y| .

max(c1/2 , 1

1Em (a1:r ) ,c,



m

r 2 +
1 3 r m

|Y|r exp
log log m r m /c

+
r <r log log m

|Y|r exp exp

log log m r 2 1+
1 3 log log m m

1 1 log log m + . |Y| 1 1 exp()

Note that as > , by (15.15), the last sum is nite. This shows that our rst goal is attained. Now as P is assumed to be ergodic, P -a.s., for all r > r and all a1:r Sr , Z(a1:r ) n tends to innity. Let us consider such a sample path. Then if ,c, innitely many events of the form Dn (a1:r ) occur for a xed pattern a1:r , ,c, also innitely many events of the form Em (a1:r ) occur for the same xed pattern.

594

15 Order Estimation

If there exists an innite sequence {a1:rn } of patterns such that the events ,c, Dn (a1:rn ) occur for innitely many n, then innitely many events of the ,c, form Emn (a1:rn ) also occur. In order to prove Lemma 15.8.9, we will need lower bounds on P {a1:r } for r rn and a1:r Sr . As P has Markov order r we have
r

P (a1:r ) = P (a1:r )
j=r +1

P (aj | aj1:jr ) . P (ar | a1:r ). (15.17)

Now let = mina1:r Then

Sr

P (a1:r ) and = mina1:r


a1:r Sr

+1 Sr +1

+1

min P (a1:r ) rr .

Proof (of Lemma 15.8.9). We will rely on Lemma 15.8.10 and we thus x , and c to satisfy the conditions of this lemma. The challenge will consist in checking that for every > 0 we can nd some > 0 such that (i) P -a.s. all the clocks associated with patterns in r{r move suciently fast, that is, for all suciently large n, Z(a1:r )
n ,..., log n} Sr

>r

for all a1:r r{r

,..., log n} Sr

(ii) For all suciently large n, max[r, log log Z(a1:r ) n ] log n for all a1:r r{r
r1 ,..., log n} Sr

Let us rst make a few observations. If 1 1) P (a1:r1 )| < 1 + r1 and |Zn (a1:r )| < then Nn (a1:r ) > Nn1 (a1:r1 ) P (ar | a1:r1 ) Z(a1:r )
n

< |Nn1 (a1:r1 )/(n r +

Z(a1:r )

max[r, log log Z(a1:r ) n ] ,

max[r, log log Z(a1:r ) n ]

> (n r + 1) P (a1:r ) 1
r1

(1 +

r1 ) max[r, log log(2(n

r + 1)) P (a1:r1 )]

(n r + 1) P (a1:r ) 2 max[r, log log(2n)] nrr

> (n r + 1) P (a1:r ) 1 and

r1

15.8 Consistency of BIC for Markov Order Estimation

595

Nn (a1:r ) < Nn1 (a1:r1 ) P (a1:r1 ) + Z(a1:r )


n

max[r, log log Z(a1:r ) n ]


r1

< (n r + 1) P (a1:r ) 1 +

max[r, log log(2n)] nrr

Now P -a.s., for n large enough and all a1:r Sr , 1


r

< |Nn (a1:r )/(n r + 1) P (a1:r )| < 1 +

Let be such that < 1/ log(1/). Then for r < (/) log log n, we may choose r (n) in such a way that 2 log log 2n log n + 2 log n 1/4 log log(2n) r (n) r (n) + (/) log log(2n) n n for all r log n. Hence for suciently large n, we have r (n) 1/2 for all r log n. This however implies that P -a.s. for all suciently large n, all r log n and all a1:r Sr , Z(a1:r )
n

1 (n r + 1)r > cr . 2

By Lemma 15.8.10, this renders that P -a.s., for all suciently large n, all r log n and all a1:r Sr , |Zn (a1:r )| Z(a1:r )
n

max[r, log log Z(a1:r ) n ] .

If is suciently small, the right-hand side of this display is smaller than Z(a1:r ) n log Z(a1:r ) n in the range of r considered. The next lemma will prove crucial when checking the most delicate part of the BIC consistency theorem. It will allow us to rule out (almost surely) the possibility that the BIC order estimator jitters around log n for innitely many values of n. ,c For any > 0, any c > 0 and any a1:r , dene the event Bn (a1:r ) by
,c Bn (a1:r ) = def

y1:n : Z(a1:r ) |Zn (a1:r )|

> cr and Z(a1:r )


n

max[r, 4 log log Z(a1:r ) n ] . < 3/2. Then P -a.s.

Lemma 15.8.11. Let > 0 and c > 0 be such that lim sup sup
n r>r

1 |Sr |

1Bn (a1:r ) = 0 ,c
a1:r Sr

596

15 Order Estimation
1 3 ) 4 log log m m+2

Proof. Choose > 1 such that (1 + consider those m such that 1 +


,c Cm (a1:r ) = def 1 3

3/2 . In the sequel, we only 3/2. Put

y1: : n : m Z(a1:r ) and |Zn (a1:r )| Z(a1:r )

n n

m+1 , m > cr max[r, 4 log log Z(a1:r ) n ] .

The proof is carried in two steps: (i) Proving that P -a.s., lim sup
M m>M r>r

1 |Sr |

1Cm (a1:r ) = 0 ; ,c
a1:r Sr

(15.18)

(ii) Proving that this entails lim sup sup


n r>r

1 |Sr |

1Bn (a1:r ) = 0 . ,c
a1:r Sr

(15.19)

Note that when dealing with |Sr |1 a1:r Sr 1Cm (a1:r ) , we adapt the time,c scale at which we analyze Zn (a1:r ) to the pattern. This allows us to formulate a rather strong statement: not only does um =
r>r

1 |Sr |

1Cm (a1:r ) ,c
a1:r Sr

tend to 0 as m tends to innity, but the series m um is convergent. Let us start with the rst step. Thanks to our assumptions on the values of and m, E
r>r

1 |Sr |

1Cm (a1:r ) ,c
a1:r Sr

exp
4

r 2 1 +

log log m <r< c

3 m

+
4 r< log log m

exp 4 log log m 3 1 |Sr |

4 log log 2 1 +
1 3

4 log log m m+2

exp Hence E
m>M

1 4 + log log m . 1 exp(1/3)

1Cm (a1:r ) < , ,c


a1:r Sr

r>r

15.8 Consistency of BIC for Markov Order Estimation

597

which shows that (15.18) holds P -a.s. Let us now proceed to the second step. As P is assumed ergodic, it is enough to consider sequences y1: such that Z(a1:r ) n tends to innity for all a1:r . Assume that there exists a sequence {rn } such that for some > 0, for innitely many n, 1 1Bn (a1:rn ) > . ,c |Srn |
a1:rn Srn

If the sequence rn has an accumulation point r, then there exists some a1:r ,c such that Bn (a1:rn ) occurs for innitely many n. This however implies that ,c innitely many events Cm (a1:r ) occur, which means that whatever M , 1 1 ,r =. |Sr | Cm (a1:r )

m>M

If the sequence rn is increasing then for each n such that 1 |Srn | holds, also 1 |Srn | a Hence, whatever M , 1 |Sr |

1Bn (a1:rn ) > ,c


a1:rn Srn

1Cm (a1:rn ) > . ,r


m>log (crn )

1:rn

1Cm (a1:r ) > . ,c


a1:r Sr

m>M r> m /c

Remark 15.8.12. Lemmas 15.8.10 and 15.8.11 are proved in a very similar way, they have a similar form, but convey a dierent message. In Lemma 15.8.10, the constant may be taken rather close to 2 and the constants in the lemma may be considered as trade-os between the constants that show up in the law of the iterated logarithm and the constants that may be obtained if the union bound has to be used repeatedly. Note that if the conditions of Lemma 15.8.10 are to be met, for a given we cannot look for arbitrarily small c. This is sharp contrast with the setting of Lemma 15.8.11. There the constant was deliberately set to 4, and the freedom allowed by this convention, as well as by the normalizing factors 1/|Sr |, allows us to consider arbitrarily small c. Proof (of Theorem 15.8.2). First note that if |Sr | does not grow exponentially fast in r, then the Markov chain has zero entropy rate, it is a deterministic

598

15 Order Estimation

process and the likelihood ratios of interest are equal to 1. Thus there is nothing to do. Let us hence thus assume that there exists some h > 0 such that for all suciently large r, log |Sr | hr. Then
r 1 1 1 [log P (y1:n ) log P (y1:n )] ehr log n . |Sr | log n

Hence for r (C/h) log n with C > log , the quantity tends to 0 as n tends to innity. It thus remains to prove that for every > 0, sup
r r C h r 1 1 [log P (y1:n ) log P (y1:n )] |Sr | log n log n

occurs only nitely many times. Assume < 1/4. Then by Lemma 15.8.9 there exists some > 0 depending on P and such that for all suciently large n, all r such that r < r < log n and all a1:r Sr , |Zn (a1:r )| < But this inequality shows that |P (ar | a1:r1 ) P (ar | a1:r1 )|
r

Z(a1:r )

log Z(a1:r )

(15.20)

P (ar | a1:r1 ) log Nn1 (a1:r1 ) . Nn1 (a1:r1 )

Hence P -a.s., for all suciently large n and all r < r < log n, Nn1 (a1:r1 ) 2 r [P ( | a1:r1 ), P ( | a1:r1 )] log n . |Y| On the other hand, notice that if |Zn (a1:r )| then
r

(15.21)

1 Z(a1:r ) 2

1 P (ar | a1:r1 ) . 2 Hence by Corollary 15.8.5, as log u < u/4, P -a.s., for all suciently large n and all r < r < log n, |Pn (ar | a1:r1 ) P (ar | a1:r1 )|
r 1 [log Pn (y1:n ) log P (y1:n )] . |Sr | log n

Thus P -a.s., for suciently large n, sup


r<r r 1 [log Pn (y1:n ) log P (y1:n )] . |Sr | log n < log n

15.8 Consistency of BIC for Markov Order Estimation

599

Let us now consider those r such that log n r (C/h) log n. Choose 2 and c2 such that for some (irrelevant) > 2, the conditions of Lemma 15.8.10 are satised. Note that for n suciently large, for all r such that log n r (C/h) log n, max(2 r, log log n) = 2 r. Let 1 > 0 and c1 > 0 be chosen in such a way that c1 + 1 < h/C. We will use Lemma 15.8.11 with those constants. Recall that c1 and 1 may be chosen arbitrarily close to 0 (see Remark 15.8.12). Let Gr,n , Gr,n , Gr,n and Gr,n be dened by 4 3 2 1 Gr,n = {a1:r1 : Nn1 (a1:r1 ) < c1 r} Sr1 , 1 Gr,n = {a1:r1 : c1 r Nn1 (a1:r1 ) 2 and for all a Y, |Zn (a1:r1 , a)| < Gr,n 3 Gr,n 4 = {a1:r1 : c1 r Nn1 (a1:r1 ) < c2 r and for some a Y, |Zn (a1:r1 , a)| < = {a1:r1 : c2 r < Nn1 (a1:r1 ) and for all a Y, |Zn (a1:r1 , a)| < 2 r Z(a1:r1 , a)
n

1 r Z(a1:r1 , a) n } , 1 r Z(a1:r1 , a) n } , \ Gr,n . 2

By Lemma 15.8.10, P -a.s., for suciently large n and all r such that log n r (C/h) log n, Gr,n Gr,n Gr,n Gr,n = Sr1 . 1 2 3 4 Moreover by Lemma 15.8.11, P -a.s., for suciently large n and the same r, |Gr,n | + |Gr,n | 3 4 <. |Sr1 | By the denition of Gr,n and Gr,n , we are in a position to use Corollary 15.8.5 2 4 to obtain Nn1 (a1:r1 )D(Pn ( | a1:r1 ) | P ( | a1:r1 )) 1 r if a1:r1 Gr,n , 2 2 r if a1:r1 Gr,n . 4 (15.22)

Thus P -a.s., for suciently large n and all r such that log n r (C/h) log n, log P (y1:n ) log P (y1:n )
iGr,n a1:r1 Gr,n i i r

Nn1 (a1:r1 )D(Pn ( | a1:r1 ) | P ( | a1:r1 )) |Gr,n |c1 r log 1 1 1 + |Gr,n |1 r + |Gr,n |c2 r log + |Gr,n |2 r . 2 3 4

Dividing both sides by |Sr | log n, we nd for the range of r of interest that

600

15 Order Estimation
r 1 [log P (y1:n ) log P (y1:n )] |Sr | log n C |Gr,n | |Gr,n | c1 + 1 + c2 3 + 4 2 h |Sr | |Sr |

As we may choose c1 + 1 h/C, P -a.s., for suciently large n, sup


r: log nr C h r 1 [log P (y1:n ) log P (y1:n )] . |Sr | log n log n

15.9 Complements
The order estimation problem for HMMs and Markov processes became an active topic in the information theory literature in the late 1980s. Early references can be found in Finesso (1991) and Ziv and Merhav (1992). Other versions of the order estimation problem had been tackled even earlier, see Haughton (1988). We refer to Chambaz (2003, Chapter 7) for a brief history of order identication. The denition of HMM order used in this chapter is classical. A general discussion concerning HMM order and related notions like rank can be found in Finesso (1991). An early discussion of order estimation issues in ARMA modeling is presented in Azencott and Dacunha-Castelle (1984). Finesso (1991) credits the latter reference for major inuence on his work on Markov order estimation. The connections between the performance of generalized likelihood ratio testing and the behavior of maximum likelihood ratios was outlined in Finesso (1991). Using the law of iterated logarithms for the empirical measure of Markov chains in order to identify small penalties warranting consistency in Markov order estimation also goes back to Finesso (1991) The connections between order estimation and hypothesis testing has been emphasized in the work of Merhav and collaborators (Zeitouni and Gutman, 1991; Zeitouni et al., 1992; Ziv and Merhav, 1992; Feder and Merhav, 2002). Those papers present various settings for composite hypothesis testing in which generalized likelihood ratio testing may or may not be asymptotically optimal. Though the use of universal coding arguments in order identication is already present in Finesso (1991), Zeitouni and Gutman (1991), and Ziv and Merhav (1992), the paper by Kieer (1993) provides the most striking exposition of the connections between order identication and universal coding. Versions of Lemmas 15.6.2 and 15.6.3 are at least serendipitous in Kieer (1993). Results of Section 15.6 can be regarded as elaboration of ideas exposed by Kieer.

15.9 Complements

601

The proof of the rst inequality in Lemma 15.6.4 goes back to Shtarkov (1987). The proof of the second inequality for HMMs is due to Csiszr (1990). a Variants of the result have been used by Finesso (1991) and Liu and Narayan (1994). Section 15.8 is mainly borrowed from Csiszr (2002), although the rea sults presented here were already contained in Csiszr and Shields (2000) a but justied with dierent proofs. The use of non-asymptotic tail inequalities (concentration inequalities) for the analysis of model selection procedure has become a standard approach in modern statistics (see Bartlett et al., 2002, and references therein for more examples on this topic). Section 15.7 is largely inspired by Gassiat and Boucheron (2003), and further results in this direction can be found in Chambaz (2003) and Boucheron and Gassiat (2004).

Part IV

Appendices

A Conditioning

A.1 Probability and Topology Terminology and Notation


By a measurable space is meant a pair (X, X ) with X being a set and X being a -eld of subsets of X. The sets in the -eld are called measurable sets. We will always assume that for any x X, the singleton set {x} is measurable. Typically, if X is a topological space, then X is the Borel -eld, that is, the -eld generated by the open subsets of X. If X is a discrete set (that is, nite or countable), then X is the power set P(X), the collection of all subsets of X. A positive measure on a measurable space (X, X )1 is a measure such that (A) 0, for all A X , and (X) > 0. A probability measure is a positive measure with unit total mass, (X) = 1. All measures will be assumed to be -nite. Let (, F) and (X, X ) be two measurable spaces. A function X : X is said to be measurable if the set X 1 (A) F for all A X . If (X, X ) = (R, B(R)) where B(R) is the Borel -eld, X is said to be real-valued random variable. By abuse of notation, but in accordance with well-established traditions, the phrase random variable usually refers to a real-valued random variable. If X is not the real numbers R, we often write X-valued random variable. A -eld G on such that G F is called a sub--eld of F. If X is a random variable (real-valued or not) such that X 1 (A) G for all A X for such a sub--eld G, then X is said to be G-measurable. If X denotes an X-valued mapping on , then the -eld generated by X, denoted by (X), is the smallest -eld on that makes X measurable. It can be expressed as (X) = X 1 (X ) = {X 1 (B) : B X }. Typically it is assumed that X is a random variable, that is, X is F-measurable, and then (X) is a sub--eld of

1 In some situations, such as when X is a countable set, the -eld under consideration is unambiguous and essentially unique and we may omit X for notational simplicity.

606

A Conditioning

F. If Z is a real-valued random variable that is (X)-measurable, then there exists a measurable function g : X R such that Z = g X = g(X). If (, F) is a measurable space and P is a probability measure on F, the triplet (, F, P) is called a probability space. We then write E[X] for the expectation of a random variable X on (, F), meaning the (Lebesgue) integral X d P. The image of P by X, denoted by PX , is the probability measure dened by PX (B) = P(X 1 (B)). As good as all random variables (real-valued or not) in this book are assumed to be dened on a probability space denoted by (, F, P), and in most cases this probability space is not mentioned explicitly. The space is sometimes called the sample space. Finally, a few words on topological spaces. A topological space is a set Y equipped with a topology T . A topological space (Y, T ) is called metrizable if there exists a metric d : Y Y [0, ] such that the topology induced by d is T . If (Y, d) is a metric space, a Cauchy sequence in this space is a sequence {yn }n0 in Y such that d(yn , ym ) 0 as n, m . A metric space (Y, d) is called complete if every Cauchy sequence in Y has a limit in Y. A topological space (Y, T ) is called a Polish space if (Y, T ) is separable (i.e., it admits a countable dense subset) and metrizable for some metric d such that the metric space (Y, d) is complete. As a trivial example, Rn equipped with the Euclidean distance is the most elementary example of a Polish space.

A.2 Conditional Expectation


Let (, F, P) be a probability space. For p > 0 we denote by Lp (, F, P) the space of random variables X such that E |X|p < , and by L+ (, F, P) the space of random variables X such that X 0 P-a.s. If we identify random variables that are equal P-a.s., we get respectively the spaces Lp (, F, P) and L+ (, F, P). We allow random variables to assume the values . Lemma A.2.1. Let (, F, P) be a probability space, let X L+ (, F, P), and let G be a sub--eld of F. Then there exists Y L+ (, G, P) such that E[XZ] = E[Y Z] (A.1)

for all Z L+ (, G, P). If Y L+ (, G, P) also satises (A.1), then Y = Y P-a.s. A random variable with the above properties is called a version of the conditional expectation of X given G, and we write Y = E[X | G]. Conditional expectations are thus dened up to P-almost sure equality. Hence, when writing E[X | G] = Y for instance, we always mean that this relations holds P-a.s., that is, Y is a version of the conditional expectation. One can indeed extend the denition of the conditional expectation to random variables that do not belong to L+ (, F, P). We follow here the approach outlined in Shiryaev (1996, Section II.7).

A.2 Conditional Expectation

607

Denition A.2.2 (Conditional Expectation). Let (, F, P) be a probability space, let X be a random variable and let G be a a sub- eld of F. def def Dene X + = max(X, 0) and X = min(X, 0). If min{E[X + | G], E[X | G]} < P-a.s. ,

then (a version of ) the conditional expectation of X given G is dened by E[X | G] = E[X + | G] E[X | G] ; on the set of probability 0 of sample points where E[X + | G] and E[X | G] are both innite, the above dierence is assigned an arbitrary value, for instance, zero. In particular, if E[|X| | G] < P-a.s., then E[X + | G] < and E[X | G] < P-a.s., and we may always dene the conditional expectation in this context. Note that for X L1 (, F, P), E[X + ] < and E[X ] < . By applying (A.1) with Z 1, E[E(X + | G)] = E[X + ] < and E[E(X | G)] = E[X ] < . Therefore, E[X + | G] < and E[X | G] < , and thus the conditional expectation is always dened for X L1 (, F, P). Let Y be a random variable and let (X) be the sub--eld generated by a random variable X. If E[Y | (X)] is well-dened, we write E[Y | X] rather than E[Y | (X)]. This is called the conditional expectation of Y given X. By construction, E[Y | X] is a (X)-measurable random variable. Thus (cf. Section A.1), there exists a real measurable function g on X such that E[Y | X] = g(X). The choice of g is unambiguous in the sense that any two functions g and g satisfying this equality must be equal PX -a.s. We sometimes write E[Y | X = x] for such a g(x). Many of the useful properties of expectations extend to conditional expectations. We state below some these useful properties. In the following statements, all equalities and inequalities between random variables, and convergence of such, should be understood to hold P-a.s. Proposition A.2.3 (Elementary Properties of Conditional Expectation). (a) If X Y and, either, X 0 and Y 0, or E[|X| | G] < and E[|Y | | G] < , then E[X | G] E[Y | G]. (b) If E[|X| | G] < , then | E[X | G]| E[|X| | G]. (c) If X 0 and Y 0, then for any non-negative real numbers a and b, E[aX + bY | G] = a E[X | G] + b E[Y | G] . If E[|X| | G] < and E[|Y | | G] < , the same equality holds for arbitrary real numbers a and b. (d) If G = {, } is the trivial -eld and X 0 or E |X| < , then E[X | G] = E[X].

608

A Conditioning

(e) If H is a sub--eld of F such that G H and X 0, then E[E(X | H) | G] = E[X | G] . (A.2)

If E[|X| | G] < , then E[|X| | H] < and (A.2) holds. (f ) Assume that X is independent of G, in the sense that E[XY ] = E[X] E[Y ] for all G-measurable random variables Y . If, in addition, either X 0 or E |X| < , then E[X | G] = E[X] . (A.3) (g) If X is G-measurable, X 0, and Y 0, then E[XY | G] = X E[Y | G] . (A.4)

The same conclusion holds if E[|XY | | G], |X|, and E[|Y | | G] are all nite. Proof. (a): Assume that X and Y are non-negative. By (A.1), for any A G, E[E(X | G)1A ] = E[X 1A ] E[Y 1A ] = E[E(Y | G)1A ] . Setting, for any M > 0, AM = {E[X | G] E[Y | G] 1/M }, the above relation implies that P (AM ) = 0. Therefore, P{E[X | G] E[Y | G] > 0) = 0. For general X and Y , the condition X Y implies that X + Y + and Y X ; therefore E[X + | G] E[Y + | G] and E[Y | G] E[X | G], which proves the desired result. (b): This part follows from the preceding property, on observing that |X| X |X|. (c): Assume rst that X, Y , a, and b are all non-negative, Then, for any A G, E[E(aX + bY | G)1A ] = E[(aX + bY )1A ] = a E[X 1A ] + b E[Y 1A ] = a E[E(X | G)1A ] + b E[E(Y | G)1A ] = E {[a E(X | G) + b E(Y | G)] 1A } ,

which establishes the rst part of (c). For arbitrary reals a and b, and X and Y such that E[|X| | G] < and E[|Y | | G] < , (b) and the rst part of (c) shows that E[|aX + bY | | G] |a| E[|X| | G] + |b| E[|Y | | G] < , whence E[(aX + bY ) | G] is well-dened. We will now show that, for two nonnegative random variables U and V satisfying E[U | G] < and E[V | G] < , E[U V | G] = E[U | G] E[V | G] . (A.5)

Applying again the rst part of (c) and noting that (U V )+ = (U V )1{U V } and (U V ) = (V U )1{V U } , we nd that

A.2 Conditional Expectation

609

E[U V | G] + E[V 1{U V } | G] E[U 1{V >U } | G]

= E[(U V )1{U V } | G] + E[V 1{U V } | G]

= E[U 1{U V } | G] E[V 1{V >U } | G] .

{E[(V U )1{V U } | G] + E[U 1{V >U } | G]}

Moving the two last terms on the left-hand side to the right-hand side establishes (A.5). Finally, the second part of (c) follows by splitting aX and bY into their positive and negative parts (aX)+ and (aX) etc., and using the above linearity. (e): Suppose rst that X 0, and pick A G. Then A is in H as well, so that, using (A.1) repeatedly, E (1A E [ E(X | H)| G]) = E [1A E (X | H)] = E[1A X] = E [1A E (X | G)] . This establishes (e) for non-negative random variables. Suppose now that E[|X| | G] < . For any integer M 0, put AM = {E[X | H] > M }, and put A = {E[X | H] = }. Then AM is in H, and so is A = M AM . Moreover, M E[1A | G] E[M 1AM | G] E [E (|X| | H) 1AM | G] E [E (|X| | H) | G] = E[|X| | G] < . Because M is arbitrary in this display, E[1A | G] = 0, implying that E[1A ] = 0. Hence, P (A) = 0, that is, E[X | H] < . The second part of (e) now follows from (c) applied to E[X + | H] and E[X | H]. (f): If X 0, then (A.1) implies that for any A G, E[1A E(X | G)] = E[1A X] = E[1A E(X)] . This proves the rst part of (f). If E |X| < , then E[X + ] < and E[X ] < , and the proof follows by linearity. (g): For X 0 and Y 0, (A.1) shows that, for any A G, E[1A E(XY | G)] = E[1A XY ] = E[1A X E(Y | G)] . Thus, the rst part of (g) follows. For X and Y such that |X|, E[|Y | | G], and E[|XY | | G] are all nite, the random variables E[X + Y + | G], E[X + Y | G], E[X Y + | G], and E[X Y | G] are nite too. Therefore, applying (c), E[XY | G] = E[X + Y + | G] + E[X Y | G] E[X + Y | G] E[X Y + | G] . The preceding result shows that the four terms on the right-hand side equal X + E[Y + | G], X E[Y | G], X + E[Y | G], and X E[Y + | G], respectively. Because these four random variables are nite, the result follows. Proposition A.2.4. Let {Xn }n0 be a sequence of random variables. (i) If Xn 0 and Xn X, then E[Xn | G] E[X | G].

610

A Conditioning

(ii) If Xn Y , E[|Y | | G] < , and Xn X with E[|X| | G] < , then E[Xn | G] E[X | G]. (iii) If |Xn | Z, E[Z | G] < , and Xn X, then E[Xn | G] E[X | G] and E[|Xn X| | G] 0. Proof. (i): Proposition A.2.3(a) shows that E[Xn | G] E[Xn+1 | G]; hence, limn E[Xn | G] exists P-a.s. Because limn E[Xn | G] is a limit of Gmeasurable random variables, it is G-measurable. By (A.1) and the monotone convergence theorem, for any A G, E[1A lim E(Xn | G)] = lim E[1A E(Xn | G)] = lim E[1A Xn ] = E[1A X] . Because the latter relation holds for all A G, Lemma A.2.1 shows that lim E(Xn | G) = E(X | G). (ii): First note that, as {Xn } decreases to X, we have X Xn Y for all n. This implies |Xn | |X| + |Y |, and we conclude that E[|Xn | | G] < for all n. Now set Zn = Y Xn . Then, Zn 0 and Zn Y X. Therefore, using (i) and Proposition A.2.3(c), E[Y | G] E[Xn | G] = E[Zn | G] E[lim Zn | G] = E[Y X | G] = E[Y | G] E[X | G] . (iii): Set Zn = supmn |Xm X|. Because Xn X, Zn 0. By Proposition A.2.3(b) and (c), |E(Xn | G) E(X | G)| E[|Xn X| | G] E[Zn | G] . Because Zn 0 and Zn 2Z, (ii) shows that E[Zn | G] 0. The following equality plays a key role in several parts of the book, and we thus provide a simple proof of this result. Proposition A.2.5 (Rao-Blackwell Inequality). Let (, F, P) be a probability space, let X be a random variable such that E[X 2 ] < , and let G be a sub--eld of F. Then Var[X] = Var[E(X | G)] + E[Var(X | G)] , where the conditional variance Var(X | G) is dened as Var(X | G) = E[(X E[X | G])2 | G] .
def

(A.6)

(A.7)

This implies in particular that Var[E(X | G)] Var[X], where the inequality is strict unless X is G-measurable. Proof. Without loss of generality, we may assume that E[X] = 0. Write E[(X E[X | G])2 | G] = E[X 2 | G] (E[X | G])2 . Taking expectation on both sides and noting that E[E(X | G)] = E[X] = 0 yields (A.6).

A.3 Conditional Distribution

611

A.3 Conditional Distribution


Denition A.3.1 (Version of Conditional Probability). Let (, F, P) be a probability space and let G be a sub--eld of F. For any event F F, P(F | G) = E[1F | G] is called a version of the conditional probability of F with respect to G. We might expect a version of the conditional probability F P(F | G) to be a probability measure on F. If {Fn }n0 is a sequence of disjoint subsets of F, then Propositions A.2.3(c) and A.2.4(i) show that

P
n=0

Fn G

=
n=0

P(Fn | G),

or, more precisely, that n=0 P(Fn | G) is a version of the conditional expecta tion of n=0 Fn given G. This version is dened up to a P-null set. However, this null set may depend on the sequence {Fn }n0 . Because unless in very specic cases the -eld F is not countable, there is no guarantee that it is possible to choose versions of the conditional distribution for each set F that are such that the sub-additive property holds for all sequences {Fn }n0 except on a P-null set. This leads to the need for and denition of regular conditional probabilities Denition A.3.2 (Regular Conditional Probability). Let (, F, P) be a probability space and let G be a sub--eld of F. A regular version of the conditional probability of P given G is a function PG : F [0, 1] such that (i) For all F F, PG (, F ) is G-measurable and is a version of the conditional probability of F given G, PG (, F ) = P[F | G]; (ii) For P-almost every , the mapping F PG (, F ) is a probability measure on F. Closely related to regular conditional probabilities is the notion of regular conditional distribution. Denition A.3.3 (Regular Conditional Distribution of Y Given G). Let (, F, P) be a probability space and let G be a sub--eld of F. Let (Y, Y) be a measurable space and let Y be an Y-valued random variable. A regular version of the conditional distribution of Y given G is a function PY |G : Y [0, 1] such that

612

A Conditioning

(i) For all E Y, PY |G (, E) is G-measurable and is a version of the conditional probability of PY given G, PY |G (, E) = E[1E (Y ) | G]; (ii) For P-almost every , E PY |G (, E) is a probability measure on Y. In the sequel, we will focus exclusively on regular conditional distributions. When a regular version of a conditional distribution of Y given G exists, conditional expectations can be written as integrals for each . Theorem A.3.4. Let (, F, P) be a probability space and let G be a sub-eld of F. Let (Y, Y) be a measurable space, let Y be an Y-valued random variable and let PY |G be a regular version of the conditional expectation of Y given G. Then for any real-valued measurable function g on Y such that E |g(Y )| < , g is integrable with respect to PY |G (, ), that is, |g(y)| PY |G (, dy) < , for P-almost every , and Y E[g(Y ) | G] = That is, given G. g(y) PY |G (, dy) . (A.8)

g(y) PY |G (, dy) is a version of the conditional expectation of g(Y )

The key question is now the existence of regular conditional probabilities. It is known that regular conditional probabilities exist under most conditions encountered in practice, but we should keep in mind that they do not always exist. This topic requires some care, because the existence of these regular versions requires some additional assumptions on the topology of the probability space (see Dudley, 2002, Chapter 10). Here is a main theorem on existence and uniqueness of regular conditional probabilities. It is not stated under the weakest possible topological assumptions, but nevertheless the assumptions of this theorem are mild and are veried in all situations considered in this book. Theorem A.3.5. Let (, F, P) be a probability space and let G be a sub-eld of F. Let Y be a Polish space, let Y be its Borel -eld, and let Y be an Y-valued random variable. Then there exists a regular version of the conditional distribution of Y given G, PY |G , and this version is unique in the Y |G of this distribution, for P-almost sense that for any other regular version P every it holds that Y |G PY |G (, F ) = P (, F ) for all F F .

For a proof, see Dudley (2002, Theorem 10.2.2). Finally ,it is of interest to dene the regular conditional distribution of a random variable Y given another random variable X. Denition A.3.6 (Regular Conditional Distribution of Y Given X). Let (, F, P) be a probability space and let X and Y be random variables with

A.3 Conditional Distribution

613

values in the measurable spaces (X, X ) and (Y, Y), respectively. Then a regular version of the conditional distribution of Y given (X) is a function PY |X : X Y [0, 1] such that (i) For all E Y, x PY |X (x, E) is X -measurable and PY |X (x, E) = E[1E (Y ) | X = x] ; (A.9)

(ii) For PX -almost every x X, E PY |X (x, E) is a probability measure on Y. When a regular version of a conditional distribution of Y given X exists, conditional expectations can be written as integrals for each x. Theorem A.3.7. Let (, F, P) be a probability space, let X and Y be random variables with values in the measurable spaces (Y, Y) and (X, X ), respectively, and let PY |X be a regular version of the conditional expectation of Y given X. Then if for any real-valued measurable function g on Y such that E |g(Y )| < , g is integrable with respect to PY |X (x, ) for PX -almost every x and E[g(Y )|X = x] = g(y) PY |X (x, dy) . (A.10)

Moreover, for any a real-valued measurable function g on the measurable space (X Y, X Y) such that E |g(X, Y )| < , g(x, ) is integrable with respect to PY |X (x, ) for Px -almost every x and E[g(X, Y )] = E[g(X, Y )|X = x] = g(x, y) PY |X (x, dy) PX (dx) , (A.11) (A.12)

g(x, y) PY |X (x, dy) .

We conclude this section by stating conditions upon which there exists a regular conditional probability of Y given X. Theorem A.3.8. Let (, F, P) be a probability space and let X and Y be random variables with values in the measurable spaces (X, X ) and (Y, Y), respectively, with Y being Polish space and Y being its Borel -eld. Then there exists a regular version PY |X of the conditional distribution of Y given X and this version is unique.

614

A Conditioning

A.4 Conditional Independence


Concepts of conditional independence play an important role in hidden Markov models and, more generally, in all models involving complex dependence structures among sets of random variables. This section covers the general denition of conditional independence as well as some basic properties. Further readings on this topic include the seminal paper by Dawid (1980) as well as more condensed expositions such as (Cowell et al., 1999, Chapter 5). Denition A.4.1 (Conditional Independence). Let (, F, P) be a probability space and let G and G1 , . . . , Gn be sub--elds of F. Then G1 , . . . , Gn are said to be P-conditionally independent given G if for any bounded random variables X1 , . . . , Xn measurable with respect to G1 , . . . , Gn , respectively, E[X1 Xn | G] = E[X1 | G] E[Xn | G] . If Y1 , . . . , Yn and Z are random variables, then Y1 , . . . , Yn are said to be conditionally independent given Z if the sub--elds (Y1 ), . . . , (Yn ) are Pconditionally independent given (Z). Intuition suggests that if two random variables X and Y are independent given a third one, Z say, then the conditional distribution of X given Y and Z should be governed by the value of Z alone, further information about the value of Y being irrelevant. The following result shows that this intuition is not only correct but could in fact serve as an alternative denition of conditional independence of two variables given a third one. Proposition A.4.2. Let (, F, P) be a probability space and let A, B, and C be sub--elds of F. Then A and B are P-conditionally independent given C if and only if for any bounded A-measurable random variable X, E[X | B C) = E[X | C] , where B C denotes the -eld generated by B C. Proposition A.4.2 is sometimes used as an alternative denition of conditional independence: it is said that A and B are P-conditionally independent given C if for all A-measurable non-negative random variables X there exists a version of the conditional expectation E[X | B C] that is C-measurable (Dawid, 1980, Denition 5.1). Following the suggestion of Dawid (1980), the notation A B | C [P] is used to denote that the sub--elds A and B are conditionally independent given C, under the probability P. In the case where A = (X), B = (Y ), and C = (Z) with X, Y , and Z being random variables, the simplied notation (A.13)

A.4 Conditional Independence

615

X Y | Z [P] will be used. In accordance with Denition A.4.1, we shall then say that X and Y are conditionally independent given Z under P. The following proposition states a number of useful properties of conditional independence. Proposition A.4.3. Let (, F, P) be a probability space and let A, B, C and D be sub--elds of F. Then the following properties hold true. 1. (Symmetry) If A B | C [P], then B A | C [P]. 2. (Decomposition) If A (B C) | D [P], then A B | D [P] and A C | D [P]. 3. (Weak Union) If A (B D) | C [P], then A D | B C [P]. 4. (Contraction) If A D | B C [P] and A B | C [P], then A B D | C [P]. In the theory of Bayesian networks (also called graphical models), as introduced by Pearl (1988), these four properties are referred to as the semigraphoid inference axioms (Cowell et al., 1999).

B Linear Prediction

This appendix provides a brief introduction to the theory of linear prediction of random variables. Further reading includes Brockwell and Davis (1991, Chapter 2), which provides a proof of the projection theorem (Theorem B.2.4 below), as well as Williams (1991) or Jacod and Protter (2000, Chapter 22). The results below are used in Chapter 5 to derive the particular form taken by the ltering and smoothing recursions in linear state-space models.

B.1 Hilbert Spaces


Denition B.1.1 (Inner Product Space). A real linear space H is said to be an inner product space if for each pair of elements x and y in H there is a real number x, y , called the inner product (or, scalar product) of x and y, such that (a) x, y = y, x , (b) x + y, z = x, z + y, z for z in H and real and , (c) x, x 0 and x, x = 0 if and only if x = 0. Two elements x and y such that x, y = 0 are said to be orthogonal. The norm x of an element x of an inner product space is dened as x = The norm satises (a) x + y x + y (triangle inequality), (b) x = || x for real , (c) x 0 and x = 0 if and only if x = 0. These properties justify the use of the terminology norm for . In addition, the Cauchy-Schwarz inequality | x, y | x y holds, with equality if and only if y = x for some real . x, x . (B.1)

618

B Linear Prediction

Denition B.1.2 (Convergence in Norm). A sequence {xk }k0 of elements of an inner product space H is said to converge in norm to x H if xn x 0 as n . It is readily veried that a sequence {xk }k0 that converges in norm to some element x satises lim supn0 supmn xm xn = 0. Any sequence, convergent or not, with this property is said to be a Cauchy sequence. Thus any convergent sequence is a Cauchy sequence. If the reverse implication holds true as well, that any Cauchy sequence is convergent (in norm), then the space is said to be complete. A complete inner product space is called a Hilbert space. Denition B.1.3 (Hilbert Space). A Hilbert space H is an inner product space that is complete, that is, an inner product space in which every Cauchy sequence converges in norm to some element in H. It is well-known that Rk equipped with the inner product x, y = xi yi , where x = (x1 , . . . , xk ) and y = (y1 , . . . , yk ), is a Hilbert space. A more sophisticated example is the space of square integrable random variables. Let (, F, P) be a probability space and let L2 (, F, P) be the space of square integrable random variables on (, F, P). For any two elements X and Y in L2 (, F, P) we dene
k i=1

X, Y = E(XY ) .

(B.2)

It is easy to check that X, Y satises all the properties of an inner product except for the last one: if X, Y = 0, then it does not follow that X() = 0 for all , but only that P{ : X() = 0} = 1. This diculty is circumvented by saying that the random variables X and Y are equivalent if P(X = Y ) = 1. This equivalence relation partitions L2 (, F, P) into classes of random variables such that any two random variables in the same class are equal with probability one. The space L2 (, F, P) is the set of these equivalence classes with inner product still dened by (B.2). Because each class is uniquely determined by specifying any one of the random variables in it, we shall continue to use the notation X and Y for the elements in L2 and to call them random variables, although it is sometimes important that X stands for an equivalence class of random variables. A well-known result in functional analysis is the following. Proposition B.1.4. The space H = L2 (, F, P) equipped with the inner product (B.2) is a Hilbert space. Norm convergence of a sequence {Xn } in L2 (, F, P) to a limit X means that Xn X 2 = E |Xn X|2 0 as n . Norm convergence of Xn to X in an L2 -space is often called mean square convergence.

B.2 The Projection Theorem

619

B.2 The Projection Theorem


Before introducing the notion of projection in Hilbert spaces in general and in L2 -spaces in particular, some denitions are needed. Denition B.2.1 (Closed Subspace). A linear subspace M of a Hilbert space H is said to be closed if M contains all its limit points. That is, if {xn } is a sequence in M converging to some element x H, then x M. The lemma below is a direct consequence of the fact that the inner product is continuous mapping from H to R. Lemma B.2.2 (Closedness of Finite Spans). If y1 , . . . , yn is a nite family of elements of H, then the linear subspace spanned by y1 , . . . , yn ,
n

span(y1 , . . . , yn ) =

def

xH:x=
i=1

i yi , for some 1 , . . . , n R

is a closed subspace of H. Denition B.2.3 (Orthogonal Complement). The orthogonal complement M of a subset M of H is the set of all elements of H that are orthogonal to every element of M: x M if and only if x, y = 0 for every y M. Theorem B.2.4 (The Projection Theorem). Let M be a closed linear subspace of a Hilbert space H and let x H. Then the following hold true. (i) There exists a unique element x M such that x x = inf
yM

xy .

(ii) x is the unique element of M such that (x x) M . The element x is referred to as the projection of x onto M. Corollary B.2.5 (The Projection Mapping). If M is a closed linear subspace of the Hilbert space H and I is the identity mapping on H, then there is a unique mapping from H onto M, denoted proj(|M), such that I proj(|M) maps H onto M . proj(|M) is called the projection mapping onto M. The following properties of the projection mapping can be readily obtained from Theorem B.2.4. Proposition B.2.6 (Properties of the Projection Mapping). Let H be a Hilbert space and let proj(|M) denote the projection mapping onto a closed linear subspace M. Then the following properties hold true.

620

B Linear Prediction

(i) For all x, y in H and real , , proj(x + y|M) = proj(x|M) + proj(y|M) . (ii) x = proj(x|M) + proj(x|M ). (iii) x 2 = proj(x|M) 2 + proj(x|M ) 2 . (iv) x proj(x|M) is continuous. (v) x M if and only if proj(x|M) = x and x M if and only if proj(x|M ) = 0. (vi) If M1 and M2 are two closed linear subspaces of H, then M1 M2 if and only if for all x H, proj(proj(x|M2 ) | M1 ) = proj(x|M1 ) . When the space H is an L2 -space, the following terminology is often preferred. Denition B.2.7 (Best Linear Prediction). If M is a closed subspace of L2 (, F, P) and X L2 (, F, P), then the best linear predictor (also called minimum mean square error linear predictor) of X in M is the element X M such that X X
2 def

= E(X X)2 E(X Y )2

for all Y M .

The best linear predictor is clearly just an alternative denomination for proj(X|M), taking the probabilistic context into account. Interestingly, the projection theorem implies that X is also the unique element in M such that X X, Y
def

= E[(X X)Y ] = 0

for all Y M .

An immediate consequence of Proposition B.2.6(iii) is that the mean square prediction error X X 2 may be written in two other equivalent and often useful ways, namely X X
2 def

= E[(X X)2 ] = E[X(X X)] = E[X 2 ] E[X 2 ] .

C Notations

C.1 Mathematical
i e x x xy xy u, v zk:l At |S| imaginary unit, i2 = 1 base of natural logarithm, e = 2.7182818 . . . largest integer less than or equal to x (integer part) smallest integer larger than or equal to x minimum of x and y maximum of x and y scalar product of vectors u and v collection zk , zk+1 , . . . , zl transpose of matrix A cardinality of (nite) set S indicator function of set A supremum of function f oscillation (global modulus of continuity) of f derivative of (real-valued) f gradient of f at Hessian of f at measurable space bounded measurable functions on (Z, Z) minimal -eld generated by -elds G and F product measures product -eld total variation norm of signed measure essential supremum of a measurable function f (with respect to the measure ) essential oscillation semi-norm

1A

f osc (f ) f f ( ) or 2 f ( ) or (Z, Z) Fb (Z) GF , 2 G n TV f , osc (f )

f ()|= f ()|=

622

C Notations

C.2 Probability
P, E a.s. L1 , L2 X p span(X1 , X2 ) proj(X|M) X Y | Z [P] N LN Dir Ga IG U Bin Be Mult
P D

probability, expectation convergence in distribution convergence in probability almost sure convergence integrable and square integrable functions Lp norm of X ([E |X|p ]1/p ) linear span in Hilbert space, usually L2 (, F, P) projection onto a linear subspace X and Y are conditionally independent given Z (with respect to the probability P) Gaussian distribution, N(, 2 ) log-normal distribution, LN(log(), 2 ) Dirichlet distribution, Dirr (1 , . . . , r ) gamma distribution, Ga(, ) inverse gamma distribution uniform distribution, U([a, b]) binomial distribution, Bin(n, p) beta distribution, Be(, ) multinomial distribution, Mult(n, (1 , . . . , N ))

C.3 Hidden Markov Models


{Xk }k0 (X, X ) Q(x, dx ) q(x, x ) (dx ) r {Yk }k0 (Y, Y) G(x, dy) g(x, y) (dy) gk (x) P , E hidden states state space of the hidden states transition kernel of the hidden chain idem, in fully dominated models initial distribution (probability density function with respect to in fully dominated models) stationary distribution of {Xk }k0 (if any) |X| in nite HMMs observations observation space conditional likelihood kernel idem, in partially dominated models g(x, Yk )implicit conditioning convention probability, expectation under the model, assuming initial distribution

C.3 Hidden Markov Models

623

Smoothing ,k or ,k|k ,k|k1 c,k L,n


,n

,k|n , ,k:l|n ,k k|n ,k k|n Fk|n B,n ,n

ltering distribution predictive distribution normalization constant for the lter likelihood log-likelihood marginal of joint smoothing distribution forward measure backward function normalized forward measure normalized backward function forward smoothing kernel backward smoothing kernel recursive smoother

In several chapters, explicit dependence with respect to the initial distribution is omitted; in a few others, the above notations are followed by an expression of the form [Yk:l ] to highlight dependence with respect to the relevant observations. Parametric HMMs d J () s n () () Q( ; ) S ds State-Space Models Xk+1 = Ak Xk + Rk Uk Yk = Bk Xk + Sk Vk dx , du , dy , dv Xk|k , k|k k|k1 , k|k1 X Xk|n , k|n k|n , k|n k , k Hk Kk state (dynamic) equation observation equation dimensions of Xk , Uk , Yk and Vk ltered moments predicted moments smoothed moments idem in information parameterization innovation and associated covariance matrix Kalman gain (prediction) Kalman gain (ltering) parameter vector dimension of the parameter actual (true) value of parameter Fisher information matrix stationary version of the log-likelihood limiting contrast [of n1 ,n ()] intermediate quantity of EM complete-data sucient statistic in exponential family dimension of S

624

C Notations

Hierarchical HMMs hierarchic component of the states (usually indicator variables) (C, C) space of hierarchic component QC transition kernel of {Ck }k0 C distribution of C0 {Wk }k0 intermediate component of the states (W, W) space of intermediate component QW [(w, c), w )] conditional transition kernel of {Wk }k0 given {Ck }k0 ,k:l|n distribution of Ck:l given Y0:n k+1|k predictive distribution of Wk+1 given Y0:n and C0:k+1 {Ck }k0

C.4 Sequential Monte Carlo


MC (f ) N IS (f ) ,N IS (f ) ,N SIR (f ) ,N u Tk (x, dx ) Tk k i {k }i=1,...,N i {k }i=1,...,N i i 0:k , 0:k (l) Monte Carlo estimate of (f ) (from N i.i.d. draws) unnormalized importance sampling estimate (using as instrumental distribution) importance sampling estimate sampling importance resampling estimate (Lk+1 /Lk )1 Q(x, dx ) gk+1 (x ) Q(x, dx ) gk+1 (x ) u optimal instrumental kernel (Tk normalized) u normalization function of Tk population of particles at time index k associated importance weights (usually unnormalized) path particle and lth element in the trajectory i i [by convention k = 0:k (k)]

References

Akashi, H. and Kumamoto, H. (1977) Random sampling approach to state estimation in switching environment. Automatica, 13, 429434. Anderson, B. D. O. and Moore, J. B. (1979) Optimal Filtering. Prentice-Hall. Andrews, D. F. and Mallows, C. L. (1974) Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B, 36, 99102. Andrieu, C., Davy, M. and Doucet, A. (2003) Ecient particle ltering for jump Markov systems. Application to time-varying autoregressions. IEEE Trans. Signal Process., 51, 17621770. Andrieu, C., Moulines, E. and Priouret, P. (2005) Stability of stochastic approximation under veriable conditions. SIAM J. Control Optim. To appear. Askar, M. and Derin, H. (1981) A recursive algorithm for the Bayes solution of the smoothing problem. IEEE Trans. Automat. Control, 26, 558561. Atar, R. and Zeitouni, O. (1997) Exponential stability for nonlinear ltering. Ann. Inst. H. Poincar Probab. Statist., 33, 697725. e Athreya, K. B., Doss, H. and Sethuraman, J. (1996) On the convergence of the Markov chain simulation method. Ann. Statist., 24, 69100. Athreya, K. B. and Ney, P. (1978) A new approach to the limit theory of recurrent Markov chains. Trans. Am. Math. Soc., 245, 493501. Azencott, R. and Dacunha-Castelle, D. (1984) Sries dobservations irre e guli`res. Masson. e Bahl, L., Cocke, J., Jelinek, F. and Raviv, J. (1974) Optimal decoding of linear codes for minimizing symbol error rate. IEEE Trans. Inform. Theory, 20, 284287. Baldi, P. and Brunak, S. (2001) Bioinformatics. The Machine Learning Approach. MIT Press. Ball, F. G., Cai, Y., Kadane, J. B. and OHagan, A. (1999) Bayesian inference for ion channel gating mechanisms directly from single channel recordings, using Markov chain Monte Carlo. Proc. Roy. Soc. London A, 455, 2879 2932.

626

References

Ball, F. G. and Rice, J. H. (1992) Stochastic models for ion channels: Introduction and bibliography. Math. Biosci., 112, 189206. Barron, A. (1985) The strong ergodic theorem for densities; generalized Shannon-McMillan-Breiman theorem. Ann. Probab., 13, 12921303. Barron, A., Birg, L. and Massart, P. (1999) Risk bounds for model selection e via penalization. Probab. Theory Related Fields, 113, 301413. Bartlett, P., Boucheron, S. and Lugosi, G. (2002) Model selection and error estimation. Machine Learning, 48, 85113. Baum, L. E. and Eagon, J. A. (1967) An inequality with applications to statistical estimation for probalistic functions of Markov processes and to a model for ecology. Bull. Am. Math. Soc., 73, 360363. Baum, L. E. and Petrie, T. P. (1966) Statistical inference for probabilistic functions of nite state Markov chains. Ann. Math. Statist., 37, 15541563. Baum, L. E., Petrie, T. P., Soules, G. and Weiss, N. (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41, 164171. Benveniste, A., Mtivier, M. and Priouret, P. (1990) Adaptive Algorithms and e Stochastic Approximations, vol. 22. Springer. Translated from the French by Stephen S. S. Wilson. Berger, J. O. (1985) Statistical Decision Theory and Bayesian Analysis. Springer, 2nd ed. Bertozzi, T., Le Ruyet, D., Rigal, G. and Han, V.-T. (2003) Trellis-based search of the maximum a posteriori sequence using particle ltering. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 6, 693696. Berzuini, C., Best, N., Gilks, W. R. and Larizza, C. (1997) Dynamic conditional independence models and Markov Chain Monte Carlo methods. J. Am. Statist. Assoc., 92, 14031412. Berzuini, C. and Gilks, W. R. (2001) Resample-move ltering with crossmodel jumps. In Sequential Monte Carlo Methods in Practice (eds. A. Doucet, N. De Freitas and N. Gordon). Springer. Besag, J. (1989) Towards Bayesian image analysis. J. Applied Statistics, 16, 395407. Bickel, P. J. and Doksum, K. A. (1977) Mathematical Statistics. Prentice-Hall. Bickel, P. J. and Ritov, Y. (1996) Inference in hidden Markov models I. Local asymptotic normality in the stationary case. Bernoulli, 2, 199228. Bickel, P. J., Ritov, Y. and Rydn, T. (1998) Asymptotic normality of the e maximum likelihood estimator for general hidden Markov models. Ann. Statist., 26, 16141635. (2002) Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones. Ann. Inst. H. Poincar Probab. Statist., 38, 825846. e Billingsley, P. (1995) Probability and Measure. Wiley, 3rd ed. Bollerslev, T., Engle, R. F. and Nelson, D. (1994) ARCH models. In Handbook of Econometrics (eds. R. F. Engle and D. McFadden). North-Holland. Bonnans, J. F. and Shapiro, A. (1998) Optimization problems with perturbations: a guided tour. SIAM Rev., 40, 228264.

References

627

Booth, J. and Hobert, J. (1999) Maximizing generalized linear mixed model likelihoods with an automated monte carlo EM algorithm. J. Roy. Statist. Soc. Ser. B, 61, 265285. Borovkov, A. A. (1998) Ergodicity and Stability of Stochastic Systems. Wiley. Boucheron, S. and Gassiat, E. (2004) Error exponents in AR order testing. Preprint. Boyd, S. and Vandenberghe, L. (2004) Convex Optimization. Cambridge University Press. Boyles, R. (1983) On the convergence of the EM algorithm. J. Roy. Statist. Soc. Ser. B, 45, 4750. Brandi`re, O. (1998) The dynamic system method and the traps. Adv. Appl. e Probab., 30, 137151. Briers, M., Doucet, A. and Maskell, S. (2004) Smoothing algorithms for statespace models. Tech. Rep., University of Cambridge, Department of Engineering. Brockwell, P. J. and Davis, R. A. (1991) Time Series: Theory and Methods. Springer, 2nd ed. Brooks, S. P., Giudici, P. and Roberts, G. O. (2003) Ecient construction of reversible jump Markov chain Monte Carlo proposal distributions. J. Roy. Statist. Soc. Ser. B, 65, 137. Bryson, A. and Frazier, M. (1963) Smoothing for linear and nonlinear dynamic systems. Tech. Rep., Aero. Sys. Div. Wrigth-Patterson Air Force Base. Budhiraja, A. and Ocone, D. (1997) Exponential stability of discrete-time lters for bounded observation noise. Systems Control Lett., 30, 185193. Bunke, H. and Caelli, T. (eds.) (2001) Hidden Markov Models: Applications in Computer Vision. World Scientic. Burges, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic dna. J. Mol. Biol., 268, 7894. Caines, P. E. (1988) Linear Stochastic Systems. Wiley. Campillo, F. and Le Gland, F. (1989) MLE for patially observed diusions: Direct maximization vs. the EM algorithm. Stoch. Proc. App., 33, 245274. Capp, O. (2001a) Recursive computation of smoothed functionals of hidden e Markovian processes using a particle approximation. Monte Carlo Methods Appl., 7, 8192. (2001b) Ten years of hmms (online bibliography 19892000). URL http://www.tsi.enst.fr/~cappe/docs/hmmbib.html. Capp, O., Buchoux, V. and Moulines, E. (1998) Quasi-Newton method for e maximum likelihood estimation of hidden Markov models. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 4, 22652268. Capp, O., Doucet, A., Lavielle, M. and Moulines, E. (1999) Simulation-based e methods for blind maximum-likelihood lter identication. Signal Process., 73, 325. Capp, O., Robert, C. P. and Rydn, T. (2003) Reversible jump, birth-ande e death and more general continuous time Markov chain Monte Carlo samplers. J. Roy. Statist. Soc. Ser. B, 65, 679700.

628

References

Cardoso, J.-F., Lavielle, M. and Moulines, E. (1995) Un algorithme didentication par maximum de vraisemblance pour des donnes incompl`tes. C. e e R. Acad. Sci. Paris Srie I Statistique, 320, 363368. e Carlin, B. P. and Chib, S. (1995) Bayesian model choice via Markov chain Monte Carlo. J. Roy. Statist. Soc. Ser. B, 57, 473484. Carpenter, J., Cliord, P. and Fearnhead, P. (1999) An improved particle lter for non-linear problems. IEE Proc., Radar Sonar Navigation, 146, 27. Carter, C. K. and Kohn, R. (1994) On Gibbs sampling for state space models. Biometrika, 81, 541553. (1996) Markov chain Monte Carlo in conditionnaly Gaussian state space models. Biometrika, 83, 589601. Casella, G., Robert, C. P. and Wells, M. T. (2000) Mixture models, latent variables and partitioned importance sampling. Tech. Rep., CREST, INSEE, Paris. Castledine, B. (1981) A Bayesian analysis of multiple-recapture sampling for a closed population. Biometrika, 67, 197210. Celeux, G. and Diebolt, J. (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Statist., 2, 7382. (1990) Une version de type recuit simul de lalgorithme EM. C. R. Acad. e Sci. Paris Sr. I Math., 310, 119124. e Celeux, G., Hurn, M. and Robert, C. P. (2000) Computational and inferential diculties with mixture posterior distributions. J. Am. Statist. Assoc., 95, 957979. Crou, F., Le Gland, F. and Newton, N. (2001) Stochastic particle methods e for linear tangent ltering equations. In Optimal Control and PDEs - Innovations and Applications, in Honor of Alain Bensoussans 60th Anniversary (eds. J.-L. Menaldi, E. Rofman and A. Sulem), 231240. IOS Press. Chambaz, A. (2003) Segmentation spatiale et slection de mod`le. Ph.D. thee e sis, Universit Paris-Sud. e Chan, K. S. and Ledolter, J. (1995) Monte carlo EM estimation for time series models involving counts. J. Am. Statist. Assoc., 90, 242252. Chang, R. and Hancock, J. (1966) On receiver structures for channels having memory. IEEE Trans. Inform. Theory, 12, 463468. Chen, M. H. and Shao, Q. M. (2000) Monte Carlo Methods in Bayesian Computation. Springer. Chen, R. and Liu, J. S. (1996) Predictive updating method and Bayesian classication. J. Roy. Statist. Soc. Ser. B, 58, 397415. (2000) Mixture Kalman lter. J. Roy. Statist. Soc. Ser. B, 62, 493508. Chib, S. (1998) Estimation and comparison of multiple change point models. J. Econometrics, 86, 221241. Chigansky, P. and Lipster, R. (2004) Stability of nonlinear lters in nonmixing case. Ann. Appl. Probab., 14, 20382056. Chikin, D. O. (1988) Convergence of stochastic approximation procedures in the presence of dependent noise. Autom. Remote Control, 1, 5061.

References

629

Churchill, G. (1992) Hidden Markov chains and the analysis of genome structure. Computers & Chemistry, 16, 107115. Collings, I. B. and Rydn, T. (1998) A new maximum likelihood gradient e algorithm for on-line hidden Markov model identication. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 4, 22612264. Cover, T. M. and Thomas, J. A. (1991) Elements of Information Theory. Wiley. Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J. (1999) Probabilistic Networks and Expert Systems. Springer. Crisan, D., Del Moral, P. and Lyons, T. (1999) Discrete ltering using branching and interacting particle systems. Markov Process. Related Fields, 5, 293318. Crisan, D. and Doucet, A. (2002) A survey of convergence results on particle ltering methods for practitioners. IEEE Trans. Signal Process., 50, 736 746. Csiszr, I. (1990) Class notes on information theory and statistics. University a of Maryland. (2002) Large-scale typicality of Markov sample paths and consistency of MDL order estimators. IEEE Trans. Inform. Theory, 48, 16161628. Csiszr, I. and Shields, P. (2000) The consistency of the BIC Markov order a estimator. Ann. Statist., 28, 16011619. Dacunha-Castelle, D. and Duo, M. (1986) Probability and Statistics. Vol. II. Springer. Translated from the French by D. McHale. Dacunha-Castelle, D. and Gassiat, E. (1997a) The estimation of the order of a mixture model. Bernoulli, 3, 279299. (1997b) Testing in locally conic models and application to mixture models. ESAIM Probab. Statist., 1, 285317. (1999) Testing the order of a model using locally conic parametrization: population mixtures and stationary arma processes. Ann. Statist., 27, 1178 1209. Damien, P., Wakeeld, J. and Walker, S. (1999) Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. J. Roy. Statist. Soc. Ser. B, 61, 331344. Damien, P. and Walker, S. (1996) Sampling probability densities via uniform random variables and a Gibbs sampler. Tech. Rep., Business School, University of Michigan. Dawid, A. P. (1980) Conditional independence for statistical operations. Ann. Statist., 8, 598617. Del Moral, P. (1996) Nonlinear ltering: interacting particle solution. Markov Process. Related Fields, 2, 555579. (1998) Measure-valued processes and interacting particle systems. Application to nonlinear ltering problems. Ann. Appl. Probab., 8, 6995. (2004) Feynman-Kac Formulae. Genealogical and Interacting Particle Systems with Applications. Springer.

630

References

Del Moral, P. and Guionnet, A. (1998) Large deviations for interacting particle systems: applications to non-linear ltering. Stoch. Proc. App., 78, 6995. Del Moral, P. and Jacod, J. (2001) Interacting particle ltering with discretetime observations: Asymptotic behaviour in the Gaussian case. In Stochastics in Finite and Innite Dimensions: In Honor of Gopinath Kallianpur (eds. T. Hida, R. L. Karandikar, H. Kunita, B. S. Rajput, S. Watanabe and J. Xiong), 101122. Birkhuser. a Del Moral, P. and Ledoux, M. (2000) Convergence of empirical processes for interacting particle systems with applications to nonlinear ltering. J. Theoret. Probab., 13, 225257. Del Moral, P., Ledoux, M. and Miclo, L. (2003) On contraction properties of Markov kernels. Probab. Theory Related Fields, 126, 395420. Del Moral, P. and Miclo, L. (2001) Genealogies and increasing propagation of chaos for feynman-kac and genetic models. Ann. Appl. Probab., 11, 1166 1198. Delyon, B., Lavielle, M. and Moulines, E. (1999) On a stochastic approximation version of the EM algorithm. Ann. Statist., 27. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39, 138 (with discussion). Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer. URL http://cgm.cs.mcgill.ca/~luc/rnbookindex.html. Devroye, L. and Klincsek, T. (1981) Average time behavior of distributive sorting algorithms. Computing, 26, 17. Diaconis, P. and Freedman, D. (1999) Iterated random functions. SIAM Rev., 47, 4576. Diebolt, J. and Ip, E. H. S. (1996) Stochastic EM: method and application. In Markov Chain Monte Carlo in Practice (eds. W. R. Gilks, S. Richardson and D. J. Spiegelhalter), 259273. Chapman. Dobrushin, R. (1956) Central limit theorem for non-stationary Markov chains. I. Teor. Veroyatnost. i Primenen., 1, 7289. Doob, J. L. (1953) Stochastic Processes. Wiley. Douc, R. and Matias, C. (2002) Asymptotics of the maximum likelihood estimator for general hidden Markov models. Bernoulli. Douc, R., Moulines, E. and Rydn, T. (2004) Asymptotic properties of e the maximum likelihood estimator in autoregressive models with Markov regime. Ann. Statist., 32, 22542304. Doucet, A. and Andrieu, C. (2001) Iterative algorithms for state estimation of jump Markov linear systems. IEEE Trans. Signal Process., 49, 12161227. Doucet, A., De Freitas, N. and Gordon, N. (eds.) (2001a) Sequential Monte Carlo Methods in Practice. Springer. Doucet, A., Godsill, S. and Andrieu, C. (2000a) On sequential Monte-Carlo sampling methods for Bayesian ltering. Stat. Comput., 10, 197208.

References

631

Doucet, A., Godsill, S. and Robert, C. P. (2002) Marginal maximum a posteriori estimation using Markov chain Monte Carlo. Stat. Comput., 12, 7784. Doucet, A., Gordon, N. and Krishnamurthy, V. (2001b) Particle lters for state estimation of jump Markov linear systems. IEEE Trans. Signal Process., 49, 613624. Doucet, A., Logothetis, A. and Krishnamurthy, V. (2000b) Stochastic sampling algorithms for state estimation of jump Markov linear systems. IEEE Trans. Automat. Control, 45, 188202. Doucet, A. and Robert, C. P. (2002) Marginal maximum a posteriori estimation for hidden Markov models. Tech. Rep., CEREMADE, Universit Paris e Dauphine. Doucet, A. and Tadi, V. B. (2003) Parameter estimation in general statec space models using particle methods. Ann. Inst. Statist. Math., 55, 409 422. Dudley, R. M. (2002) Real Analysis and Probability. Cambridge University Press. Duo, M. (1997) Random Iterative Models, vol. 34. Springer. Translated from the 1990 French original by S. S. Wilson and revised by the author. Dupuis, J. A. (1995) Bayesian estimation of movement probabilities in open populations using hidden Markov chains. Biometrika, 82, 761772. Dupuis, P. and Ellis, R. S. (1997) A Weak Convergence Approach to the Theory of Large Deviations. Wiley. Dupuis, P. and Simha, R. (1991) On sampling controlled stochastic approximation. IEEE Trans. Automat. Control, 36, 915924. Durbin, J. and Koopman, S. J. (2000) Time series analysis of non-Gaussian observations based on state space models from both classical and Bayesian perspectives. J. Roy. Statist. Soc. Ser. B, 62, 329. (2002) A simple and ecient simulation smoother for state space time series analysis. Biometrika, 89, 603616. Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. Durrett, R. (1996) Probability: Theory and Examples. Duxbury Press, 2nd ed. Elliott, E. O. (1963) Estimates of error rates for codes on burst-noise channels. Bell System Tech. J., 19771997. Elliott, R. J. (1993) New nite dimensional lters and smoothers for Markov chains observed in Gaussian noise. IEEE Trans. Signal Process., 39, 265 271. Elliott, R. J., Aggoun, L. and Moore, J. B. (1995) Hidden Markov Models: Estimation and Control. Springer. Elliott, R. J. and Krishnamurthy, V. (1999) New nite-dimensional lters for parameter estimation of discrete-time linear Gaussian models. IEEE Trans. Automat. Control, 44.

632

References

Engle, R. F. (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom ination. Econometrica, 50, 987 1007. Ephraim, Y. and Merhav, N. (2002) Hidden Markov processes. IEEE Trans. Inform. Theory, 48, 15181569. Evans, M. and Swartz, T. (1995) Methods for approximating integrals in Statistics with special emphasis on Bayesian integration problems. Statist. Sci., 10, 254272. (2000) Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford University Press. Fearnhead, P. (1998) Sequential Monte Carlo methods in lter theory. Ph.D. thesis, University of Oxford. Fearnhead, P. and Cliord, P. (2003) On-line inference for hidden Markov models via particle lters. J. Roy. Statist. Soc. Ser. B, 65, 887899. Feder, M. and Merhav, N. (2002) Universal composite hypothesis testing: a competitive minimax and its applications. IEEE Trans. Inform. Theory, 48, 15041517. Feller, W. (1943) On a general class of contagious distributions. Ann. Math. Statist., 14, 389399. (1971) An Introduction to Probability Theory and its Applications. Wiley. Fessler, J. A. and Hero, A. O. (1995) Penalized maximum-likelihood image reconstruction using space-alternating generalized em algorithms. IEEE Trans. Image Process., 4, 141729. Fichou, J., Le Gland, F. and Mevel, L. (2004) Particle based methods for parameter estimation and tracking: Numerical experiments. Tech. Rep., INRIA. Finesso, L. (1991) Consistent estimation of the order for Markov and hidden Markov Chains. Ph.D. thesis, Maryland University. Finesso, L., Liu, C. and Narayan, P. (1996) The optimal error exponent for Markov order estimation. IEEE Trans. Inform. Theory, 42, 14881497. Fletcher, R. (1987) Practical Methods of Optimization. Wiley. Fong, W., Godsill, S., Doucet, A. and West, M. (2002) Monte carlo smoothing with application to audio signal enhancement. IEEE Trans. Signal Process., 50, 438449. Fonollosa, J. A. R., Anton-Haro, C. and Fonollosa, J. R. (1997) Blind channel estimation and data detection using hidden Markov models. IEEE Trans. Signal Process., 45, 241246. Fort, G. and Moulines, E. (2003) Convergence of the Monte Carlo expectation maximization for curved exponential families. Ann. Statist., 31, 12201259. Francq, C. and Roussignol, M. (1997) On white noises driven by hidden Markov chains. J. Time Ser. Anal., 18, 553578. (1998) Ergodicity of autoregressive processes with Markov-switching and consistency of the maximum-likelihood estimator. Statistics, 32, 151173.

References

633

Francq, C., Roussignol, M. and Zakoian, J.-M. (2001) Conditional heteroskedasticity driven by hidden Markov chains. J. Time Ser. Anal., 2, 197220. Fraser, D. and Potter, J. (1969) The optimum linear smoother as a combination of two optimum linear lters. IEEE Trans. Automat. Control, 4, 387390. Fredkin, D. R. and Rice, J. A. (1992) Maximum-likelihood-estimation and identication directly from single-channel recordings. Proc. Roy. Soc. London Ser. B, 249, 125132. Frey, B. J. (1998) Graphical Models for Machine Learning and Digital Communication. MIT Press. Frhwirth-Schnatter, S. (1994) Data augmentation and dynamic linear modu els. J. Time Ser. Anal., 15. Gaetan, C. and Yao, J.-F. (2003) A multiple-imputation Metropolis version of the EM algorithm. Biometrika, 90, 643654. Gassiat, E. (2002) Likelihood ratio inequalities with applications to various mixtures. Ann. Inst. H. Poincar Probab. Statist., 38, 887906. e Gassiat, E. and Boucheron, S. (2003) Optimal error exponents in hidden Markov models order estimation. IEEE Trans. Inform. Theory, 49, 964 980. Gauvain, J.-L. and Lee, C.-H. (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process., 2, 291298. Gelfand, A. E. and Carlin, B. P. (1993) Maximum-likelihood estimation for constrained or missing-data models. Can. J. Statist., 21, 303311. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling based approaches to calculating marginal densities. J. Am. Statist. Assoc., 85, 398409. Gelman, A. (1995) Methods of moments using monte-carlo simulation. J. Comput. Graph. Statist., 4, 3654. Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1995) Bayesian Data Analysis. Chapman. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6, 721741. Gentle, J. E. (1998) Random Number Generation and Monte Carlo Methods. Springer. Geweke, J. (1989) Bayesian inference in econometric models using MonteCarlo integration. Econometrica, 57, 13171339. Geyer, C. J. (1996) Estimation and optimization of functions. In Markov Chain Monte Carlo in Practice (eds. W. R. Gilks, S. Richardson and D. J. Spiegelhalter). Chapman. Geyer, C. J. and Mller, J. (1994) Simulation procedures and likelihood inference for spatial point processes. Scand. J. Statist., 21, 359373. Geyer, C. J. and Thompson, E. A. (1992) Constrained Monte Carlo maximum likelihood for dependent data. J. Roy. Statist. Soc. Ser. B, 54, 657699.

634

References

Ghosh, D. (1989) Maximum likelihood estimation of the dynamic shock-error model. J. Econometrics, 41. Gilbert, E. N. (1960) Capacity of a burst-noise channel. Bell System Tech. J., 12531265. Giudici, P., Rydn, T. and Vandekerkhove, P. (2000) Likelihood-ratio tests e for hidden Markov models. Biometrics, 56, 742747. Glynn, P. W. and Iglehart, D. (1989) Importance sampling for stochastic simulations. Management Science, 35, 13671392. Godsill, S. J. (2001) On the relationship between MCMC methods for model uncertainty. J. Comput. Graph. Statist., 10, 230248. Godsill, S. J. and Rayner, P. J. W. (1998) Digital Audio Restoration: A Statistical Model-Based Approach. Springer. Gordon, N., Salmond, D. and Smith, A. F. (1993) Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F, Radar Signal Process., 140, 107113. Graund, A. and Nilsson, B. (2003) Dynamic portfolio selection: the relevance of switching regimes and investment horizon. Eur. Financial Management, 9, 4768. Green, P. J. (1990) On use of the EM algorithm for penalized likelihood estimation. J. Roy. Statist. Soc. Ser. B, 52, 443452. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711732. Gu, M. G. and Kong, F. H. (1998) A stochastic approximation algorithm with Markov chain Monte-Carlo method for incomplete data estimation problems. Proc. Natl. Acad. Sci. USA, 95, 72707274. Gu, M. G. and Li, S. (1998) A stochastic approximation algorithm for maximum-likelihood estimation with incomplete data. Can. J. Statist., 26, 567582. Gu, M. G. and Zhu, H.-T. (2001) Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation. J. Roy. Statist. Soc. Ser. B, 63, 339355. Gupta, N. and Mehra, R. (1974) Computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations. IEEE Trans. Automat. Control, 19, 774783. Gut, A. (1988) Stopped Random Walks. Springer. Hamilton, J. and Susmel, R. (1994) Autoregressive conditional heteroskedasticity and changes of regime. J. Econometrics, 64, 307333. Hamilton, J. D. (1989) A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357384. (1994) Time Series Analysis. Princeton University Press. Hamilton, J. D. and Raj, B. (eds.) (2003) Advances in Markov-Switching Models: Applications in Business Cycle Research and Finance (Studies in Empirical Economics). Springer. Hammersley, J. M. and Handscomb, D. C. (1965) Monte Carlo Methods. Methuen & Co.

References

635

Handschin, J. (1970) Monte Carlo techniques for prediction and ltering of non-linear stochastic processes. Automatica, 6, 555563. Handschin, J. and Mayne, D. (1969) Monte Carlo techniques to estimate the conditionnal expectation in multi-stage non-linear ltering. In Int. J. Control, vol. 9, 547559. Hartigan, J. A. (1983) Bayes Theory. Springer. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their application. Biometrika, 57, 97109. Haughton, D. M. (1988) On the choice of a model to t data from an exponential family. Ann. Statist., 16, 342355. Ho, Y. C. and Lee, R. C. K. (1964) A Bayesian approach to problems in stochastic estimation and control. IEEE Trans. Automat. Control, 9, 333 339. Hobert, J. P., Jones, G. L., Presnell, B. and Rosenthal, J. S. (2002) On the applicability of regenerative simulation in Markov chain Monte Carlo. Biometrika, 89, 731743. Hodgson, M. E. A. (1998) Reversible jump Markov chain Monte Carlo and inference for ion channel data. Ph.D. thesis, University of Bristol. Horn, R. A. and Johnson, C. R. (1985) Matrix Analysis. Cambridge University Press. Hull, J. and White, A. (1987) The pricing of options on assets with stochastic volatilities. J. Finance, 42, 281300. Hrzeler, M. and Knsch, H. R. (1998) Monte Carlo approximations for genu u eral state-space models. J. Comput. Graph. Statist., 7, 175193. Ibragimov, I. A. and Hasminskii, R. Z. (1981) Statistical Estimation. Asymptotic Theory. Springer. Ito, H., Amari, S. I. and Kobayashi, K. (1992) Identiability of hidden Markov information sources and their minimum degrees of freedom. IEEE Trans. Inform. Theory, 38, 324333. Jacod, J. and Protter, P. (2000) Probability Essentials. Springer. Jacquier, E., Johannes, M. and Polson, N. G. (2004) MCMC maximum likelihood for latent state models. Tech. Rep., Columbia University. Jacquier, E., Polson, N. G. and Rossi, P. E. (1994) Bayesian analysis of stochastic volatility models (with discussion). J. Bus. Econom. Statist., 12, 371417. Jain, N. and Jamison, B. (1967) Contributions to Doeblins theory of Markov processes. Z. Wahrsch. Verw. Geb., 8, 1940. Jamshidian, M. and Jennrich, R. J. (1997) Acceleration of the EM algorithm using quasi-Newton methods. J. Roy. Statist. Soc. Ser. B, 59, 569587. Jarner, H., larsen, T. S., Krogh, A., Saxild, H. H., Brunak, S. and Knudsen, S. (2001) Sigma A recognition sites in the Bacilius subtilis genome. Microbiology, 147, 24172424. Jarner, S. and Hansen, E. (2000) Geometric ergodicity of Metropolis algorithms. Stoch. Proc. App., 85, 341361. Jelinek, F. (1997) Statistical Methods for Speech Recognition. MIT Press.

636

References

Jensen, F. V. (1996) An Introduction to Bayesian Networks. UCL Press. Jensen, J. L. and Petersen, N. V. (1999) Asymptotic normality of the maximum likelihood estimator in state space models. Ann. Statist., 27, 514535. De Jong, P. (1988) A cross validation lter for time series models. Biometrika, 75, 594600. De Jong, P. and Shephard, N. (1995) The simulation smoother for time series models. Biometrika, 82, 339350. Jordan, M. I. (ed.) (1999) Learning in Graphical Models. MIT Press. Jordan, M. I. (2004) Graphical models. Statist. Sci., 19, 140155. Julier, S. J. and Uhlmann, J. K. (1997) A new extension of the Kalman lter to nonlinear systems. In AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulation and Controls. Kaijser, T. (1975) A limit theorem for partially observed Markov chains. Ann. Probab., 3, 677696. Kailath, T. and Frost, P. A. (1968) An innovations approach to least-squares estimationPart II: Linear smoothing in additive white noise. IEEE Trans. Automat. Control, 13, 655660. Kailath, T., Sayed, A. and Hassibi, B. (2000) Linear Estimation. PrenticeHall. Kaleh, G. K. and Vallet, R. (1994) Joint parameter estimation and symbol detection for linear or nonlinear unknown channels. IEEE Trans. Commun., 42, 24062413. Kalman, R. E. and Bucy, R. (1961) New results in linear ltering and prediction theory. J. Basic Eng., Trans. ASME, Series D, 83, 95108. Kribin, C. and Gassiat, E. (2000) The likelihood ratio test for the number of e components in a mixture with Markov regime. ESAIM Probab. Statist., 4, 2552. Kesten, H. (1972) Limit theorems for stochastic growth models. I, II. Adv. Appl. Probab., 4, 193232. Kieer, J. C. (1993) Strongly consistent code-based identication and order estimation for constrained nite-state model classes. IEEE Trans. Inform. Theory, 39, 893902. Kim, C. and Nelson, C. (1999) State-Space Models with Regime Switching: Classical and Gibbs-Sampling Approaches with Applications. MIT Press. Kim, S., Shephard, N. and Chib, S. (1998) Stochastic volatility: Likelihood inference and comparison with ARCH models. Rev. Econom. Stud., 65, 361394. Kitagawa, G. (1987) Non-Gaussian state space modeling of nonstationary time series. J. Am. Statist. Assoc., 82, 10231063. (1996) Monte-Carlo lter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Statist., 1, 125. Kohn, R. and Ansley, C. F. (1989) A fast algorithm for signal extraction, inuence and cross-validation in state space models. Biometrika, 76, 65 79.

References

637

Kong, A., Liu, J. S. and Wong, W. (1994) Sequential imputation and Bayesian missing data problems. J. Am. Statist. Assoc., 89. Koopman, S. J. (1993) Disturbance smoother for state space models. Biometrika, 80, 117126. Kormylo, J. and Mendel, J. M. (1982) Maximum-likelihood detection and estimation of Bernoulli-Gaussian processes. IEEE Trans. Inform. Theory, 28, 482488. Koski, T. (2001) Hidden Markov Models for Bioinformatics. Kluwer. Krishnamurthy, V. and Rydn, T. (1998) Consistent estimation of linear and e non-linear autoregressive models with Markov regime. J. Time Ser. Anal., 19, 291307. Krishnamurthy, V. and White, L. B. (1992) Blind equalization of FIR channels with Markov inputs. In Proc. IFAC Int. Conf. Adapt. Systems Control Signal Process. Krishnamurthy, V. and Yin, G. G. (2002) Recursive algorithms for estimation of hidden Markov models and autoregressive models with Markov regime. IEEE Trans. Inform. Theory, 48, 458476. Krogh, A., Mian, I. S. and Haussler, D. (1994) A hidden Markov model that nds genes in E. coli DNA. Nucleic Acids Res., 22, 47684778. Krolzig, H.-M. (1997) Markov-switching Vector Autoregressions. Modelling, Statistical Inference, and Application to Business Cycle Analysis. Springer. Kuhn, E. and Lavielle, M. (2004) Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM Probab. Statist., 8, 115131. Kukashin, A. V. and Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene nding. Nucleic Acids Res., 26, 11071115. Knsch, H. R. (2000) State space and hidden Markov models. In Complex u Stochastic Systems (eds. O. E. Barndor-Nielsen, D. R. Cox and C. Kluppelberg). CRC Press. (2003) Recursive Monte-Carlo lters: algorithms and theoretical analysis. Preprint ETHZ, seminar fr statistics. u Kushner, H. J. and Clark, D. S. (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer. Kushner, H. J. and Yin, G. G. (2003) Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd ed. Laarhoven, P. J. V. and Arts, E. H. L. (1987) Simulated Annealing: Theory and Applications. Reidel Publisher. Lange, K. (1995) A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Statist. Soc. Ser. B, 57, 425437. Lauritzen, S. L. (1996) Graphical Models. Oxford University Press. Lavielle, M. (1993) Bayesian deconvolution of Bernoulli-Gaussian processes. Signal Process., 33, 6779. Lavielle, M. and Lebarbier, E. (2001) An application of MCMC methods to the multiple change-points problem. Signal Process., 81, 3953. Le Gland, F. and Mevel, L. (1997) Recursive estimation in HMMs. In Proc. IEEE Conf. Decis. Control, 34683473.

638

References

(2000) Exponential forgetting and geometric ergodicity in hidden Markov models. Math. Control Signals Systems, 13, 6393. Le Gland, F. and Oudjane, N. (2004) Stability and uniform approximation of nonlinear lters using the hilbert metric and application to particle lters. Ann. Appl. Probab., 14, 144187. Lehmann, E. L. and Casella, G. (1998) Theory of Point Estimation. Springer, 2nd ed. Leroux, B. G. (1992) Maximum-likelihood estimation for hidden Markov models. Stoch. Proc. Appl., 40, 127143. Levine, R. A. and Casella, G. (2001) Implementations of the Monte Carlo EM algorithm. J. Comput. Graph. Statist., 10, 422439. Levine, R. A. and Fan, J. (2004) An automated (Markov chain) Monte Carlo EM algorithm. J. Stat. Comput. Simul., 74, 349359. Levinson, S. E., Rabiner, L. R. and Sondhi, M. M. (1983) An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Tech. J., 62, 10351074. Liporace, L. A. (1982) Maximum likelihood estimation of multivariate observations of Markov sources. IEEE Trans. Inform. Theory, 28, 729734. Lipster, R. S. and Shiryaev, A. N. (2001) Statistics of Random Processes: I. General theory. Springer, 2nd ed. Liu, C. and Narayan, P. (1994) Order estimation and sequential universal data compression of a hidden Markov source by the method of mixtures. IEEE Trans. Inform. Theory, 40, 11671180. Liu, J. and Chen, R. (1995) Blind deconvolution via sequential imputations. J. Am. Statist. Assoc., 430, 567576. (1998) Sequential Monte-Carlo methods for dynamic systems. J. Am. Statist. Assoc., 93, 10321044. Liu, J., Chen, R. and Logvinenko, T. (2001) A theoretical framework for sequential importance sampling and resampling. In Sequential Monte Carlo Methods in Practice (eds. A. Doucet, N. De Freitas and N. Gordon). Springer. Liu, J., Wong, W. and Kong, A. (1994) Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81, 2740. Liu, J. S. (1994) The collapsed Gibbs sampler with applications to a gene regulation problem. J. Am. Statist. Assoc., 89, 958966. (1996) Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput., 6, 113119. (2001) Monte Carlo Strategies in Scientic Computing. Springer. Louis, T. A. (1982) Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B, 44, 226233. Luenberger, D. G. (1984) Linear and Nonlinear Programming. AddisonWesley, 2nd ed. MacDonald, I. and Zucchini, W. (1997) Hidden Markov and Other Models for Discrete-Valued Time Series. Chapman.

References

639

MacEachern, S. N., Clyde, M. and Liu, J. (1999) Sequential importance sampling for nonparametric bayes models: The next generation. Can. J. Statist., 27, 251267. Mayne, D. Q. (1966) A solution of the smoothing problem for linear dynamic systems. Automatica, 4, 7392. Meng, X.-L. (1994) On the rate of convergence of the ECM algorithm. Ann. Statist., 22, 326339. Meng, X.-L. and Rubin, D. B. (1991) Using EM to obtain asymptotic variancecovariance matrices: The SEM algorithm. J. Am. Statist. Assoc., 86, 899 909. (1993) Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80, 267278. Meng, X.-L. and Van Dyk, D. (1997) The EM algorithman old folk song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B, 59, 511567. Mengersen, K. and Tweedie, R. L. (1996) Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist., 24, 101121. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953) Equations of state calculations by fast computing machines. J. Chem. Phys., 21, 10871092. Meyn, S. P. and Tweedie, R. L. (1993) Markov Chains and Stochastic Stability. Springer. Neal, R. M. (1997) Markov chain Monte Carlo methods based on slicing the density function. Tech. Rep., University of Toronto. (2003) Slice sampling (with discussion). Ann. Statist., 31, 705767. Neveu, J. (1975) Discrete-Time Martingales. North-Holland. Niederreiter, H. (1992) Random Number Generation and Quasi-Monte Carlo Methods. SIAM. Nielsen, S. F. (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli, 6, 457489. Nummelin, E. (1978) A splitting technique for Harris recurrent Markov chains. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4, 309318. (1984) General Irreducible Markov Chains and Non-Negative Operators. Cambridge University Press. Orchard, T. and Woodbury, M. A. (1972) A missing information principle: Theory and applications. In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics, vol. 1, 697715. O Ruanaidh, J. J. K. and Fitzgerald, W. J. (1996) Numerical Bayesian Methods Applied to Signal Processing. Springer. Ostrowski, A. M. (1966) Solution of Equations and Systems of Equations. Academic Press, 2nd ed. Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Peskun, P. H. (1973) Optimum Monte Carlo sampling using Markov chains. Biometrika, 60, 607612.

640

References

(1981) Guidelines for chosing the transition matrix in Monte Carlo methods using Markov chains. J. Comput. Phys., 40, 327344. Petrie, T. (1969) Probabilistic functions of nite state Markov chains. Ann. Math. Statist., 40, 97115. Petris, G. and Tardella, L. (2003) A geometric approach to transdimensional Markov chain Monte Carlo. Can. J. Statist., 31, 469482. Petrov, V. V. (1995) Limit Theorems of Probability Theory. Oxford University Press. Pierre-Loti-Viaud, D. (1995) Random perturbations of recursive sequences with an application to an epidemic model. J. Appl. Probab., 32, 559578. Pitt, M. K. and Shephard, N. (1999) Filtering via simulation: Auxiliary particle lters. J. Am. Statist. Assoc., 94, 590599. Polson, N. G., Carlin, B. P. and Stoer, D. S. (1992) A Monte Carlo approach to nonnormal and nonlinear state-space modeling. J. Am. Statist. Assoc., 87, 493500. Polson, N. G., Stroud, J. R. and Mller, P. (2002) Practical ltering with u sequential parameter learning. Tech. Rep., University of Chicago. Polyak, B. T. (1990) A new method of stochastic approximation type. Autom. Remote Control, 51, 98107. Polyak, B. T. and Juditsky, A. B. (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30, 838855. Poznyak, A. S. and Chikin, D. O. (1984) Asymptotic properties of procedures of stochastic approximation with dependent noise. Autom. Remote Control, 1, 7893. Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (1992) Numerical Recipes in C: The Art of Scientic Computing. Cambridge University Press, 2nd ed. URL http://www.numerical-recipes.com/. Proakis, J. G. (1995) Digital Communications. McGraw-Hill. Punskaya, E., Doucet, A. and Fitzgerald, W. (2002) On the use and misuse of particle ltering in digital communications. In Proc. Eur. Signal Process. Conf., vol. 2, 173176. Quintana, F. A., Liu, J. and del Pino, G. (1999) Monte-Carlo EM with importance reweighting and its applications in random eects models. Comput. Statist. Data Anal., 29, 429444. Rabiner, L. R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257285. Rabiner, L. R. and Juang, B.-H. (1993) Fundamentals of Speech Recognition. Prentice-Hall. Raj, B. (2002) Asymmetry of business cycles: the Markov-switching approach. In Handbook of Applied Econometrics and Statistical Inference (eds. A. Ullah, A. T. K. Wan and A. Chaturvedi), 687710. Dekker. Rauch, H., Tung, F. and Striebel, C. (1965) Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3, 14451450.

References

641

Richardson, S. and Green, P. J. (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Roy. Statist. Soc. Ser. B, 59, 731792. Ripley, B. (1987) Stochastic Simulation. Wiley. Ristic, B., Arulampalam, M. and Gordon, A. (2004) Beyond Kalman Filters: Particle Filters for Target Tracking. Artech House. Robbins, H. and Monro, S. (1951) A stochastic approximation method. Ann. Math. Statist., 22, 400407. Robert, C. P. (2001) The Bayesian Choice. Springer, 2nd ed. Robert, C. P. and Casella, G. (2004) Monte Carlo Statistical Methods. Springer, 2nd ed. Robert, C. P., Celeux, G. and Diebolt, J. (1993) Bayesian estimation of hidden Markov chains: A stochastic implementation. Statist. Probab. Lett., 16, 77 83. Robert, C. P., Rydn, T. and Titterington, M. (1999) Convergence controls for e MCMC algorithms, with applications to hidden Markov chains. J. Comput. Graph. Statist., 64, 327355. (2000) Bayesian inference in hidden Markov models through reversible jump Markov chain Monte Carlo. J. Roy. Statist. Soc. Ser. B, 62, 5775. Robert, C. P. and Titterington, M. (1998) Reparameterisation strategies for hidden Markov models and Bayesian approaches to maximum likelihood estimation. Stat. Comput., 8, 145158. Roberts, G. O. and Rosenthal, J. S. (1998) Markov chain Monte Carlo: Some practical implications of theoretical results. Canad. J. Statist., 26, 532. (2001) Optimal scaling for various Metropolis-Hastings algorithms. Statist. Sci., 16, 351367. (2004) General state space Markov chains and MCMC algorithms. Probab. Surv., 1, 2071. Roberts, G. O. and Tweedie, R. L. (1996) Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83, 95110. (2005) Understanding MCMC. In preparation. Rosenthal, J. S. (1995) Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Am. Statist. Assoc., 90, 558566. (2001) A review of asymptotic convergence for general state space Markov chains. Far East J. Theor. Stat., 5, 3750. Rubin, D. B. (1987) A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when the fraction of missing information is modest: the SIR algorithm (discussion of Tanner and Wong). J. Am. Statist. Assoc., 82, 543546. (1988) Using the SIR algorithm to simulate posterior distribution. In Bayesian Statistics 3 (eds. J. M. Bernardo, M. H. DeGroot, D. Lindley and A. Smith), 395402. Clarendon Press. Sakalauskas, L. (2000) Nonlinear stochastic optimization by the Monte-Carlo method. Informatica (Vilnius), 11, 455468.

642

References

(2002) Nonlinear stochastic programming by Monte-Carlo estimators. European J. Oper. Res., 137, 558573. Sandmann, G. and Koopman, S. J. (1998) Estimation of stochastic volatility models via Monte Carlo maximum likelihood. J. Econometrics, 87, 271 301. Schervish, M. J. (1995) Theory of Statistics. Springer. Schick, I. C. and Mitter, S. K. (1994) Robust recursive estimation in the presence of heavy-tailed observation noise. Ann. Statist., 22, 10451080. Scott, D. J. and Tweedie, R. L. (1996) Explicit rates of convergence of stochastically ordered Markov chains. In Athens Conference on Applied Probability and Time Series: Applied Probability in Honor of J. M. Gani, vol. 114 of Lecture Notes in Statistics. Springer. Scott, S. L. (2002) Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Statist. Assoc., 97, 337351. Seber, G. A. F. (1983) Capture-recapture methods. In Encyclopedia of Statistical Science (eds. S. Kotz and N. Johnson). Wiley. Segal, M. and Weinstein, E. (1989) A new method for evaluating the loglikelihood gradient, the Hessian, and the Fisher information matrix for linear dynamic systems. IEEE Trans. Inform. Theory, 35, 682687. Sering, R. J. (1980) Approximation Theorems of Mathematical Statistics. Wiley. Shephard, N. and Pitt, M. (1997) Likelihood analysis of non-Gaussian measurement time series. Biometrika, 84, 653667. Erratum in 91:249250, 2004. Shiryaev, A. N. (1966) On stochastic equations in the theory of conditional Markov process. Theory Probab. Appl., 11, 179184. (1996) Probability. Springer, 2nd ed. Shtarkov, Y. M. (1987) Universal sequential coding of messages. Probl. Inform. Transmission, 23, 317. Shumway, R. and Stoer, D. (1991) Dynamic linear models with switching. J. Am. Statist. Assoc., 86, 763769. Stephens, M. (2000a) Bayesian analysis of mixture models with an unknown number of components - an alternative to reversible jump methods. Ann. Statist., 28, 4074. (2000b) Dealing with label switching in mixture models. J. Roy. Statist. Soc. Ser. B, 62, 795809. Stratonovich, R. L. (1960) Conditional Markov processes. Theory Probab. Appl., 5, 156178. Tanizaki, H. (1996) Nonlinear Filters: Estimation and Applications. Springer. (2003) Nonlinear and non-Gaussian state-space modeling with Monte-Carlo techniques: a survey and comparative study. In Handbook of Statistics 21. Stochastic processes: Modelling and Simulation (eds. D. N. Shanbhag and C. R. Rao), 871929. Elsevier. Tanizaki, H. and Mariano, R. (1998) Nonlinear and non-Gaussian state-space modeling with Monte-Carlo simulations. J. Econometrics, 83, 263290.

References

643

Tanner, M. and Wong, W. (1987) The calculation of posterior distributions by data augmentation. J. Am. Statist. Assoc., 82, 528550. Tanner, M. A. (1993) Tools for Statistical Inference. Springer, 2nd ed. Teicher, H. (1960) On the mixture of distributions. Ann. Math. Statist., 31, 5573. (1961) Identiability of mixtures. Ann. Math. Statist., 32, 244248. (1963) Identiability of nite mixtures. Ann. Math. Statist., 34, 12651269. (1967) Identiability of mixtures of product measures. Ann. Math. Statist., 38, 13001302. Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985) Statistical Analysis of Finite Mixture Distributions. Wiley. Tugnait, J. (1984) Adaptive estimation and identication for discrete systems with Markov jump parameters. IEEE Trans. Automat. Control, 27, 1054 1065. Van der Merwe, R., Doucet, A., De Freitas, N. and Wan, E. (2000) The unscented particle lter. In Adv. Neural Inf. Process. Syst. (eds. T. K. Leen, T. G. Dietterich and V. Tresp), vol. 13. MIT Press. Van Overschee, P. and De Moor, B. (1993) Subspace algorithms for the stochastic identication problem. Automatica, 29, 649660. (1996) Subspace Identication for Linear Systems. Theory, Implementation, Applications. Kluwer. Viterbi, A. J. (1967) Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory, 13, 260269. Wald, A. (1949) Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist., 20, 595601. Wei, G. C. G. and Tanner, M. A. (1991) A Monte-Carlo implementation of the EM algorithm and the poor mans Data Augmentation algorithms. J. Am. Statist. Assoc., 85, 699704. Weinstein, E., Oppenheim, A. V., Feder, M. and Buck, J. R. (1994) Iterative and sequential algorithms for multisensor signal enhancement. IEEE Trans. Acoust., Speech, Signal Process., 42, 846859. Welch, L. R. (2003) Hidden Markov models and the Baum-Welch algorithm. IEEE Inf. Theory Soc. Newslett., 53. West, M. and Harrison, J. (1989) Bayesian Forecasting and Dynamic Models. Springer. Whitley, D. (1994) A genetic algorithm tutorial. Stat. Comput., 4, 6585. Williams, D. (1991) Probability with Martingales. Cambridge University Press. Wonham, W. M. (1965) Some applications of stochastic dierential equations to optimal nonlinear ltering. SIAM J. Control, 2, 347369. Wu, C. F. J. (1983) On the convergence properties of the EM algorithm. Ann. Statist., 11, 95103. Younes, L. (1988) Estimation and annealing for Gibbsian elds. Ann. Inst. H. Poincar Probab. Statist., 24, 269294. e

644

References

(1989) Parametric inference for imperfectly observed Gibbsian elds. Probab. Theory Related Fields, 82, 625645. Young, S. (1996) A review of large-vocabulary continuous-speech recognition. IEEE Signal Process. Mag., 13. Zangwill, W. I. (1969) Nonlinear Programming: A Unied Approach. PrenticeHall. Zaritskii, V., Svetnik, V. and Shimelevich, L. (1975) Monte-Carlo techniques in problems of optimal data processing. Autom. Remote Control, 12, 2015 2022. Zeitouni, O. and Dembo, A. (1988) Exact lters for the estimation of the number of transitions of nite-state continuous-time Markov processes. IEEE Trans. Inform. Theory, 34. Zeitouni, O. and Gutman, M. (1991) On universal hypothesis testing via large deviations. IEEE Trans. Inform. Theory, 37, 285290. Zeitouni, O., Ziv, J. and Merhav, N. (1992) When is generalized likelihood ratio test optimal? IEEE Trans. Inform. Theory, 38, 15971602. Ziv, J. and Merhav, N. (1992) Estimating the number of states of a nite-state source. IEEE Trans. Inform. Theory, 38, 6165.

Index

Absorbing state, 12 Accept-reject algorithm, 166170, 173 in sequential Monte Carlo, 224, 261 Acceptance probability in accept-reject, 169 in Metropolis-Hastings, 171 Acceptance ratio in Metropolis-Hastings, 171 in reversible jump MCMC, 492 Accessible set, 523 AEP, see Asymptotic equipartition property Asymptotic equipartition property, see Shannon-McMillan-Breiman theorem, 574 Asymptotically tight, see Bounded in probability Atom, 524 Auxiliary variable, 260 in sequential Monte Carlo, 256264 Averaging in MCEM, 407, 429 in SAEM, 416 in stochastic approximation, 414, 433 Backward smoothing decomposition, 70 kernels, 7071, 125, 130 Bahadur eciency, 565 Balance equations detailed, 41 global, 41 local, 41 Baum-Welch, see Forward-backward

Bayes formula, 71 operator, 102 rule, 64, 157 theorem, 172 Bayesian decision procedure, 472 estimation, 360, 471 model, 71, 472 network, see Graphical model posterior, see Posterior prior, see Prior Bayesian information criterion, 566, 569, 574 BCJR algorithm, 74 Bearings-only tracking, 24 Bennett inequality, 590 Bernoulli-Gaussian model, 197 BIC, see Bayesian information criterion Binary deconvolution model, 375 estimation using EM, 376 estimation using quasi-Newton, 376 estimation using SAME, 506 Binary symmetric channel, 7, 8 Bootstrap lter, 238, 254256, 259 Bounded in probability, 334 Bryson-Frazier, see Smoothing Burn-in, 399, 498 Canonical space, 38 Capture-recapture model, 12, 485 Cauchy sequence, 606 CGLSSM, see State-space model

646

Index in exponential family, 352 intermediate quantity of, 349 SAGE, 395 Exponential family, 352 natural parameterization, 473 of the Normal, 150 Exponential forgetting, see Forgetting Filtered space, 37 Filtering, 54 Filtration, 37 natural, 38 Fisher identity, 354, 362, 458 Forgetting, 100120 exponential, 109, 446 of time-reversed chain, 461 strong mixing condition, 105, 108 uniform, 100, 105110 Forward smoothing decomposition, 66 kernels, 66, 101, 327 Forward-backward, 5666 , see forward variable , see backward variable backward variable, 57 Baum-Welch denomination, 74 decomposition, 57 forward variable, 57 in nite state space HMM, 123124 in state-space model, 154 scaling, 62, 75 Gaussian linear model, 128, 150 Generalized likelihood ratio test, see Likelihood ratio test Gibbs sampler, 180182 in CGLSSM, 194 in hidden Markov model, 481486 random scan, 182 sweep of, 181, 401, 484 systematic scan, 182 Gilbert-Elliott channel, 6 Global sampling, see Resampling, global Global updating, see Updating of hidden chain Gram-Schmidt orthogonalization, 136 Graphical model, 1, 4 Growth model comparison of SIS kernels, 230231

Chapman-Kolmogorov equations, 36 Coding probability, 571, 574 mixture, 573 normalized maximum likelihood, 572 universal, 572 Communicating states, 513 Companion matrix, 17, 30 Computable bounds, 186 Conditional likelihood function, 218 log-concave, 225 Contrast function, 442 Coordinate process, 38 Coupling inequality, 543 of Markov chains, 543545 set, 544 Critical region, 570 Darroch model, 13 Data augmentation, 482 Dirichlet distribution, 476, 573 Disturbance noise, 127 Dobrushin coecient, 96 Doeblin condition, 97 for hidden Markov model, 561 Drift conditions for hidden Markov model, 562 for Markov chain, 537541, 548552 Foster-Lyapunov, 549 ECM, see Expectation-maximization Eective sample size, 235 Eciency, 580 Bahadur, 581 Pitman, 580 Ecient score test, 467 EKF, see Kalman, extended lter EM, see Expectation-maximization Equivalent parameters, 451 Error exponent, 581 overestimation, 568 underestimation, 568 Exchangeable distribution, 478 Expectation-maximization, 349353 convergence of, 389394 ECM, 394 for MAP estimation, 360 for missing data models, 359

Index performance of bootstrap lter, 240242 Hahn-Jordan decomposition, 91 Harris recurrent chain, see Markov chain, Harris recurrent Harris recurrent set, 533 Hidden Markov model, 16, 4244 aperiodic, 559 discrete, 43 ergodic, 33 nite, 613 fully dominated, 43 hierarchical, 4647 in biology, 10 in ion channel modelling, 13 in speech recognition, 14 left-to-right, 33 likelihood, 53 log-likelihood, 53 normal, see Normal hidden Markov model partially dominated, 43 phi-irreducible, 559 positive, 560 recurrent, 560 transient, 560 with nite state space, 121127 Hilbert space, 618 Hitting time, 513, 521 HMM, see Hidden Markov model Hoeding inequality, 292 Homogeneous, see Markov chain HPD (highest posterior density) region, 240 Hybrid MCMC algorithms, 179 Hyperparameter, see Prior Hypothesis testing composite, 565, 567, 569, 581 simple, 570 Ideal codeword length, 571 Identiability, 450457, 468, 478, 565, 568 in Gaussian linear state-space model, 384 of nite mixtures, 454 of mixtures, 454455 Implicit conditioning convention, 58

647

Importance kernel, see Instrumental kernel Importance sampling, 173, 210211, 287295 self-normalized, 211, 293295 asympotic normality, 293 consistency, 293 deviation bound, 294 sequential, see Sequential Monte Carlo unnormalized, 210, 287292 asymptotic normality, 288 consistency, 288 deviation bound, 292 Importance weights, 173 normalized, 211 coecient of variation of, 235 Shannon entropy of, 235 Incremental weight, 216 Information divergence rate, 574 Information matrix, 464 observed, 442 convergence of, 465 Information parameterization, 148150 Initial distribution, 38 Innovation sequence, 136 Instrumental distribution, 210 Instrumental kernel, 215 choice of, 218 optimal, 220224 local approximation of, 225231 prior kernel, 218 Integrated autocorrelation time, 192 Invariant measure, 517, 534 sub-invariant measure, 534 Inversion method, 242 Irreducibility measure maximal, 522 of hidden Markov model, 557 of Markov chain, 521 Jacobian, 487, 492, 495496 Kalman extended lter, 228 lter, 141143 gain, 142 ltering with non-zero means, 143 predictor, 137140

648

Index reverse, 40 reversible, 41 solidarity property, 516 strongly aperiodic, 542 transient, 517 Markov chain Monte Carlo, 170186 Markov jump system, see Markovswitching model Markov property, 39 strong, 40 Markov-switching model, 5 maximum likelihood estimation, 469 smoothing, 86 Matrix inversion lemma, 149, 152 Maximum a posteriori, 360, 473, 501510 state estimation, 125, 208 Maximum likelihood estimator, 360, 441 asymptotic normality, 443, 465 asymptotics, 442443 consistency, 442, 446450, 465 convergence in quotient topology, 450 eciency, 443 Maximum marginal posterior estimator, 472 in CGLSSM, 208 MCEM, see Monte Carlo EM MCMC, see Markov chain Monte Carlo MDL, see Minimum description length Mean eld in stochastic approximation, 430 Mean square convergence, 618 error, 620 prediction, 620 Measurable function, 605 set, 605 space, 605 Measure positive, 605 probability, 605 MEM algorithm, see SAME algorithm Metropolis-Hastings algorithm, 171 one-at-a-time, 188 geometric ergodicity, 549 independent, 173 phi-irreducibility, 523 random walk, 176

gain, 138 unscented lter, 228 Kernel, see Transition Kraft-McMillan inequality, 571 Krichevsky-Tromov mixture, 573 Kullback-Leibler divergence, 350 Label switching, 479 Lagrange multiplier test, 467 Large deviations, 584 Latent variable model, 2 Law of iterated logarithm, 571 Level, 570 asymptotic, 570 Likelihood, 53, 359, 443445 conditional, 65, 66, 444 in state-space model, 140 Likelihood ratio test, 466468 generalized, 467, 565, 570, 574, 584 Linear prediction, 131137 Local asymptotic normality, 443 Local updating, see Updating of hidden chain Log-likelihood, see Likelihood Log-normal distribution, 487 Louis identity, 354 Lyapunov function, 421 dierential, 430 MAP, see Maximum a posteriori Marcinkiewicz-Zygmund inequality, 292 Markov chain aperiodic, 520, 542 canonical version, 39 central limit theorem, 555, 556 ergodic theorem, 520, 542 geometrically ergodic, 548 Harris recurrent, 533 homogeneous, 2 irreducible, 514 law of large numbers, 553 non-homogeneous, 40, 163 null, 519, 534 on countable space, 513520 on general space, 520556 phi-irreducible, 521 positive, 534 positive recurrent, 519 recurrent, 517

Index Minimum description length, 573 Missing information principle, 465 Mixing distribution, 454 Mixture density, 454 Mixture Kalman lter, 275 ML, MLE, see Maximum likelihood estimator Model averaging, 489 Moderate deviations, 568, 584 Monte Carlo estimate, 162 integration, 162 Monte Carlo EM, 398399 analysis of, 419429 averaging in, 407 in hidden Markov model, 399 rate of convergence, 426429 simulation schedule, 403408 with importance sampling, 402 with sequential Monte Carlo, 402 Monte Carlo steepest ascent, 408 Neyman-Pearson lemma, 570 NML, see Coding probability Noisy AR(1) model SIS with optimal kernel, 222224 SIS with prior kernel, 218220 Non-deterministic process, 136 Normal hidden Markov model, 1315 Gibbs sampling, 483 identiability, 456 likelihood ratio testing in, 467 Metropolis-Hastings sampling, 486 prior for, 477 reversible jump MCMC, 493 SAME algorithm, 504 Normalizing constant, 211 in accept-reject, 169 in Metropolis-Hastings, 172173 Occupation time of set, 521 of state, 514 Optional sampling, 590 Order, 565 estimator BIC, 587 MDL, 576 PML, 577 identication, 565 Markov, 566, 567, 569, 587 of hidden Markov model, 566, 567 Oscillation semi-norm, 92 essential, 292

649

Particle lter, 209, 237 Penalized maximum likelihood, 565, 568, 574 Perfect sampling, 186 Period of irreducible Markov chain, 520 of phi-irreducible HMM, 559 of phi-irreducible Markov chain, 542 of state in Markov chain, 520 PML, see Penalized maximum likelihood Polish space, 606 Posterior, 65, 71, 360, 472 Power, 570 function, 570 Precision matrix, 149 Prediction, 54 Prior, 65, 71, 360 conjugate, 473 diuse, 148 Dirichlet, 573 distribution, 471 at, 151, 475 for hidden Markov model, 475478 hyper-, 474 hyperparameter, 473 improper, 151, 474 non-informative, 472, 474 regularization, 360 selection, 473 subjective, 472 Probability space, 606 ltered, 37 Projection theorem, 619 Proper set, 299 Properly weighted sample, 268 Radon-Nikodym derivative, 210 Rao test, 467 Rao-Blackwellization, 182 Rauch-Tung-Striebel, see Smoothing Rayleigh-fading channel, 18 Recurrent

650

Index Sampling importance resampling, 211214, 295310 asymptotic normality, 307 consistency, 307 deviation bound, 308 estimator, 213 mean squared error of, 214 unbiasedness, 213 Score function, 457 asymptotic normality, 457464 SEM, see Stochastic EM Sensitivity equations, 363367 Sequential Monte Carlo, 209, 214231 i.i.d. sampling, 253, 324 analysis of, 324333 asymptotic normality, 325 asymptotic variance, 326 consistency, 325 deviation bound, 328, 330 for smoothing functionals, 278286 implementation in HMM, 214218 mutation step, 311315 asymptotic normality, 313 consistency, 312 mutation/selection, 255, 316 analysis of, 319 asymptotic normality, 319 consistency, 319 optimal kernel, 322 prior kernel, 322 selection/mutation, 253, 255, 316 analysis of, 320 asymptotic normality, 320 consistency, 320 SISR, 322 analysis of, 321324 asymptotical normality, 323 consistency, 323 with resampling, 231242 Shannon-McMillan-Breiman theorem, 61, 568, 574, 575 Shift operator, 39 Sieve, 577 Simulated annealing, 502 cooling schedule, 502 SIR, see Sampling importance resampling SIS, see Importance sampling SISR, see Sequential Monte Carlo

set, 524 state, 514 Recursive estimation, 374 Regeneration time, 529 Regret, 572 Regularization, 360 Reprojection, 420 Resampling asymptotic normality, 306 consistency, 303 global, 267 in SMC, 236242 multinomial, 212213 alternatives to, 244250 implementation of, 242244 optimal, 267273 remainder, see residual residual, 245246 stratied, 246248 systematic, 248250 unbiased, 244, 268 Resolvent kernel, see Transition Return time, 513, 521 Reversibility, 41 in Gibbs sampler, 181 of Metropolis-Hastings, 171 of reversible jump MCMC, 491 Reversible jump MCMC, 488, 490 acceptance ratio, 492 birth move, 493 combine move, 493495 death move, 493 merge move, 493 split move, 493495 Riccati equation, 139 algebraic, 141 Robbins-Monro, see Stochastic approximation RTS, see Smoothing SAEM, see Stochastic approximation EM SAGE, see Expectation-maximization SAME algorithm, 503 for normal HMM, 504 in binary deconvolution model, 506 Sample impoverishment, see Weight degeneracy

Index Slice sampler, 183 Small set existence, 527 of hidden Markov model, 559 of Markov chain, 526 SMC, see Sequential Monte Carlo Smoothing, 51, 54 Bryson-Frazier, 143 disturbance, 143147 xed-interval, 51, 5976 xed-point, 7879 forward-backward, 59 functional, 278 in CGLSSM, 157159 in hierarchical HMM, 8789 in Markov-switching model, 86 Rauch-Tung-Striebel, 66, 131 recursive, 7985 smoothing functional, 80 two-lter formula, 76, 148155 with Markovian decomposition backward, 70, 124, 131 forward, 66 Source coding, 565 Splitting construction, 529530 split chain, 529 Stability in stochastic algorithms, 420 State space, 38 State-space model, 3 conditionally Gaussian linear, 1723, 46, 194208, 274278 Gaussian linear, 1517, 127155 Stationary distribution of hidden Markov model, 560 of Markov chain, 517 Steins lemma, 581, 584 Stochastic approximation, 411 analysis of, 429433 gradient algorithm, 412 rate of convergence, 432433 Robbins-Monro form, 412 Stochastic approximation EM, 414 convergence of, 433435 Stochastic EM, 416 Stochastic process, 37 adapted, 38 stationary, 41 Stochastic volatility model, 2529 approximation of optimal kernel, 227228 EM algorithm, 399 identiability, 456 one-at-a-time sampling, 188194 performance of SISR, 239240 single site sampling, 183185 smoothing with SMC, 281 weight degeneracy, 234236 Stopping time, 39 Strong mixing condition, 105, 108 Subspace methods, 384 Sucient statistic, 352 Sweep, see Gibbs sampler Tangent lter, 366 Target distribution, 171 Tight, see Bounded in probability Total variation distance, 91, 93 V -total variation, 544 Transient set (uniformly), 524 state, 514 Transition density function, 35 kernel, 35 Markov, 35 resolvent, 522 reverse, 37 unnormalized, 35 matrix, 35 Triangular array, 297 central limit theorems, 338342 conditionally independent, 298 conditionally i.i.d., 298 laws of large numbers, 333338 Two-lter formula, see Smoothing UKF, see Kalman, unscented lter Uniform spacings, 243 Universal coding, 565, 567, 571 Updating of hidden chain global, 481 local, 482

651

V -total variation distance, see Total variation distance Variable dimension model, 488 Viterbi algorithm, 125

652

Index Weighting and resampling algorithm, 301 Well-log data model, 2122 with Gibbs sampler, 204 with mixture Kalman lter, 276

Wald test, 467 Weight degeneracy, 209, 231236 Weighted sample, 298 asymptotic normality, 299, 304 consistency, 298, 301

You might also like