Markov Chains and Mixing Times


Markov Chains
and Mixing Times
Second Edition

David A. Levin
University of Oregon

Yuval Peres
Microsoft Research

With contributions by
Elizabeth L. Wilmer

With a chapter on “Coupling from the Past” by

James G. Propp and David B. Wilson

Providence, Rhode Island

Copyright no copyright American Mathematical Society.
2010 Mathematics Subject Classification. Primary 60J10, 60J27, 60B15, 60C05, 65C05,
60K35, 68W20, 68U20, 82C22.
FRONT COVER: The figure on the bottom left of the front cover, courtesy of
David B. Wilson, is a uniformly random lozenge tiling of a hexagon (see Section
25.2). The figure on the bottom center, also from David B. Wilson, is a random
sample of an Ising model at its critical temperature (see Sections 3.3.5 and 25.2)
with mixed boundary conditions. The figure on the bottom right, courtesy of Eyal
Lubetzky, is a portion of an expander graph (see Section 13.6).

For additional information and updates on this book, visit

Library of Congress Cataloging-in-Publication Data

Names: Levin, David Asher, 1971- | Peres, Y. (Yuval) | Wilmer, Elizabeth L. (Elizabeth Lee),
1970- | Propp, James, 1960- | Wilson, David B. (David Bruce)
Title: Markov chains and mixing times / David A. Levin, Yuval Peres ; with contributions by
Elizabeth L. Wilmer.
Description: Second edition. | Providence, Rhode Island : American Mathematical Society, [2017]
| “With a chapter on Coupling from the past, by James G. Propp and David B. Wilson.” |
Includes bibliographical references and indexes.
Identifiers: LCCN 2017017451 | ISBN 9781470429621 (alk. paper)
Subjects: LCSH: Markov processes–Textbooks. | Distribution (Probability theory)–Textbooks.
| AMS: Probability theory and stochastic processes – Markov processes – Markov chains
(discrete-time Markov processes on discrete state spaces). msc | Probability theory and sto-
chastic processes – Markov processes – Continuous-time Markov processes on discrete state
spaces. msc | Probability theory and stochastic processes – Probability theory on algebraic
and topological structures – Probability measures on groups or semigroups, Fourier transforms,
factorization. msc | Probability theory and stochastic processes – Combinatorial probability
– Combinatorial probability. msc | Numerical analysis – Probabilistic methods, simulation
and stochastic differential equations – Monte Carlo methods. msc | Probability theory and
stochastic processes – Special processes – Interacting random processes; statistical mechan-
ics type models; percolation theory. msc | Computer science – Algorithms – Randomized
algorithms. msc | Computer science – Computing methodologies and applications – Simula-
tion. msc | Statistical mechanics, structure of matter – Time-dependent statistical mechanics
(dynamic and nonequilibrium) – Interacting particle systems. msc
Classification: LCC QA274.7 .L48 2017 | DDC 519.2/33–dc23 LC record available at

Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy select pages for
use in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Permissions to reuse
portions of AMS publication content are handled by Copyright Clearance Center’s RightsLink
service. For more information, please visit:
Send requests for translation rights and licensed reprints to [email protected].
Excluded from these provisions is material for which the author holds copyright. In such cases,
requests for permission to reuse or reprint material should be addressed directly to the author(s).
Copyright ownership is indicated on the copyright page, or on the lower right-hand corner of the
first page of each article within proceedings volumes.
Second edition c 2017 by the authors. All rights reserved.
First edition 
c 2009 by the authors. All rights reserved.
Printed in the United States of America.

∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at

Copyright no copyright American Mathematical Society.
Preface ix
Preface to the Second Edition ix
Preface to the First Edition ix
Overview xi
For the Reader xii
For the Instructor xiii
For the Expert xiv

Acknowledgements xvi

Part I: Basic Methods and Examples 1

Chapter 1. Introduction to Finite Markov Chains 2

1.1. Markov Chains 2
1.2. Random Mapping Representation 5
1.3. Irreducibility and Aperiodicity 7
1.4. Random Walks on Graphs 8
1.5. Stationary Distributions 9
1.6. Reversibility and Time Reversals 13
1.7. Classifying the States of a Markov Chain* 15
Exercises 17
Notes 19

Chapter 2. Classical (and Useful) Markov Chains 21

2.1. Gambler’s Ruin 21
2.2. Coupon Collecting 22
2.3. The Hypercube and the Ehrenfest Urn Model 23
2.4. The Pólya Urn Model 25
2.5. Birth-and-Death Chains 26
2.6. Random Walks on Groups 27
2.7. Random Walks on Z and Reflection Principles 30
Exercises 34
Notes 35

Chapter 3. Markov Chain Monte Carlo: Metropolis and Glauber Chains 38

3.1. Introduction 38
3.2. Metropolis Chains 38
3.3. Glauber Dynamics 41
Exercises 45
Notes 45

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Chapter 4. Introduction to Markov Chain Mixing 47

4.1. Total Variation Distance 47
4.2. Coupling and Total Variation Distance 49
4.3. The Convergence Theorem 52
4.4. Standardizing Distance from Stationarity 53
4.5. Mixing Time 54
4.6. Mixing and Time Reversal 55
4.7. p Distance and Mixing 56
Exercises 57
Notes 58

Chapter 5. Coupling 60
5.1. Definition 60
5.2. Bounding Total Variation Distance 61
5.3. Examples 62
5.4. Grand Couplings 69
Exercises 73
Notes 73
Chapter 6. Strong Stationary Times 75
6.1. Top-to-Random Shuffle 75
6.2. Markov Chains with Filtrations 76
6.3. Stationary Times 77
6.4. Strong Stationary Times and Bounding Distance 78
6.5. Examples 81
6.6. Stationary Times and Cesàro Mixing Time 83
6.7. Optimal Strong Stationary Times* 84
Exercises 85
Notes 86

Chapter 7. Lower Bounds on Mixing Times 87

7.1. Counting and Diameter Bounds 87
7.2. Bottleneck Ratio 88
7.3. Distinguishing Statistics 91
7.4. Examples 95
Exercises 97
Notes 98

Chapter 8. The Symmetric Group and Shuffling Cards 99

8.1. The Symmetric Group 99
8.2. Random Transpositions 101
8.3. Riffle Shuffles 106
Exercises 109
Notes 112

Chapter 9. Random Walks on Networks 115

9.1. Networks and Reversible Markov Chains 115
9.2. Harmonic Functions 116
9.3. Voltages and Current Flows 117

9.4. Effective Resistance 118

9.5. Escape Probabilities on a Square 123
Exercises 125
Notes 126

Chapter 10. Hitting Times 127

10.1. Definition 127
10.2. Random Target Times 128
10.3. Commute Time 130
10.4. Hitting Times on Trees 133
10.5. Hitting Times for Eulerian Graphs 135
10.6. Hitting Times for the Torus 136
10.7. Bounding Mixing Times via Hitting Times 139
10.8. Mixing for the Walk on Two Glued Graphs 143
Exercises 145
Notes 147

Chapter 11. Cover Times 149

11.1. Definitions 149
11.2. The Matthews Method 149
11.3. Applications of the Matthews Method 151
11.4. Spanning Tree Bound for Cover Time 153
11.5. Waiting for All Patterns in Coin Tossing 155
Exercises 157
Notes 157

Chapter 12. Eigenvalues 160

12.1. The Spectral Representation of a Reversible Transition Matrix 160
12.2. The Relaxation Time 162
12.3. Eigenvalues and Eigenfunctions of Some Simple Random Walks 164
12.4. Product Chains 168
12.5. Spectral Formula for the Target Time 171
12.6. An 2 Bound 171
12.7. Time Averages 172
Exercises 176
Notes 177

Part II: The Plot Thickens 179

Chapter 13. Eigenfunctions and Comparison of Chains 180
13.1. Bounds on Spectral Gap via Contractions 180
13.2. The Dirichlet Form and the Bottleneck Ratio 181
13.3. Simple Comparison of Markov Chains 185
13.4. The Path Method 188
13.5. Wilson’s Method for Lower Bounds 193
13.6. Expander Graphs* 196
Exercises 198
Notes 199

Chapter 14. The Transportation Metric and Path Coupling 201

14.1. The Transportation Metric 201
14.2. Path Coupling 203
14.3. Rapid Mixing for Colorings 207
14.4. Approximate Counting 209
Exercises 213
Notes 214

Chapter 15. The Ising Model 216

15.1. Fast Mixing at High Temperature 216
15.2. The Complete Graph 219
15.3. The Cycle 220
15.4. The Tree 221
15.5. Block Dynamics 224
15.6. Lower Bound for the Ising Model on Square* 227
Exercises 229
Notes 230
Chapter 16. From Shuffling Cards to Shuffling Genes 233
16.1. Random Adjacent Transpositions 233
16.2. Shuffling Genes 237
Exercises 241
Notes 242
Chapter 17. Martingales and Evolving Sets 244
17.1. Definition and Examples 244
17.2. Optional Stopping Theorem 245
17.3. Applications 247
17.4. Evolving Sets 250
17.5. A General Bound on Return Probabilities 255
17.6. Harmonic Functions and the Doob h-Transform 257
17.7. Strong Stationary Times from Evolving Sets 258
Exercises 260
Notes 260
Chapter 18. The Cutoff Phenomenon 262
18.1. Definition 262
18.2. Examples of Cutoff 263
18.3. A Necessary Condition for Cutoff 268
18.4. Separation Cutoff 269
Exercises 270
Notes 270

Chapter 19. Lamplighter Walks 273

19.1. Introduction 273
19.2. Relaxation Time Bounds 274
19.3. Mixing Time Bounds 276
19.4. Examples 278
Exercises 278
Notes 279

Chapter 20. Continuous-Time Chains* 281

20.1. Definitions 281
20.2. Continuous-Time Mixing 283
20.3. Spectral Gap 285
20.4. Product Chains 286
Exercises 290
Notes 291

Chapter 21. Countable State Space Chains* 292

21.1. Recurrence and Transience 292
21.2. Infinite Networks 294
21.3. Positive Recurrence and Convergence 296
21.4. Null Recurrence and Convergence 301
21.5. Bounds on Return Probabilities 302
Exercises 303
Notes 305

Chapter 22. Monotone Chains 306

22.1. Introduction 306
22.2. Stochastic Domination 307
22.3. Definition and Examples of Monotone Markov Chains 309
22.4. Positive Correlations 310
22.5. The Second Eigenfunction 314
22.6. Censoring Inequality 315
22.7. Lower Bound on d¯ 320
22.8. Proof of Strassen’s Theorem 321
Exercises 322
Notes 323

Chapter 23. The Exclusion Process 324

23.1. Introduction 324
23.2. Mixing Time of k-Exclusion on the n-Path 329
23.3. Biased Exclusion 330
Exercises 334
Notes 334

Chapter 24. Cesàro Mixing Time, Stationary Times, and Hitting

Large Sets 336
24.1. Introduction 336
24.2. Equivalence of tstop , tCes , and tG for Reversible Chains 338
24.3. Halting States and Mean-Optimal Stopping Times 340
24.4. Regularity Properties of Geometric Mixing Times* 341
24.5. Equivalence of tG and tH 342
24.6. Upward Skip-Free Chains 344
24.7. tH (α) Are Comparable for α ≤ 1/2 345
24.8. An Upper Bound on trel 346
24.9. Application to Robustness of Mixing 346
Exercises 347
Notes 348

Chapter 25. Coupling from the Past 349

25.1. Introduction 349
25.2. Monotone CFTP 350
25.3. Perfect Sampling via Coupling from the Past 355
25.4. The Hardcore Model 356
25.5. Random State of an Unknown Markov Chain 358
Exercise 359
Notes 359

Chapter 26. Open Problems 360

26.1. The Ising Model 360
26.2. Cutoff 361
26.3. Other Problems 361
26.4. Update: Previously Open Problems 362

Appendix A. Background Material 365

A.1. Probability Spaces and Random Variables 365
A.2. Conditional Expectation 371
A.3. Strong Markov Property 374
A.4. Metric Spaces 375
A.5. Linear Algebra 376
A.6. Miscellaneous 376
Exercise 376

Appendix B. Introduction to Simulation 377

B.1. What Is Simulation? 377
B.2. Von Neumann Unbiasing* 378
B.3. Simulating Discrete Distributions and Sampling 379
B.4. Inverse Distribution Function Method 380
B.5. Acceptance-Rejection Sampling 380
B.6. Simulating Normal Random Variables 383
B.7. Sampling from the Simplex 384
B.8. About Random Numbers 384
B.9. Sampling from Large Sets* 385
Exercises 388
Notes 391

Appendix C. Ergodic Theorem 392

C.1. Ergodic Theorem* 392
Exercise 393

Appendix D. Solutions to Selected Exercises 394

Bibliography 425

Notation Index 439

Index 441

Preface to the Second Edition

Since the publication of the first edition, the field of mixing times has continued
to enjoy rapid expansion. In particular, many of the open problems posed in the
first edition have been solved. The book has been used in courses at numerous
universities, motivating us to update it.
In the eight years since the first edition appeared, we have made corrections
and improvements throughout the book. We added three new chapters: Chapter 22
on monotone chains, Chapter 23 on the exclusion process, and Chapter 24, which
relates mixing times and hitting time parameters to stationary stopping times.
Chapter 4 now includes an introduction to mixing times in p , which reappear in
Chapter 10. The latter chapter has several new topics, including estimates for hit-
ting times on trees and Eulerian digraphs. A bound for cover times using spanning
trees has been added to Chapter 11, which also now includes a general bound on
cover times for regular graphs. The exposition in Chapter 6 and Chapter 17 now
employs filtrations rather than relying on the random mapping representation. To
reflect the key developments since the first edition, especially breakthroughs on the
Ising model and the cutoff phenomenon, the Notes at the end of chapters and the
open problems have been updated.
We thank the many careful readers who sent us comments and corrections:
Anselm Adelmann, Amitabha Bagchi, Nathanael Berestycki, Olena Bormashenko,
Krzysztof Burdzy, Gerandy Brito, Darcy Camargo, Varsha Dani, Sukhada Fad-
navis, Tertuliano Franco, Alan Frieze, Reza Gheissari, Jonathan Hermon, Ander
Holroyd, Kenneth Hu, John Jiang, Svante Janson, Melvin Kianmanesh Rad, Yin
Tat Lee, Zhongyang Li, Eyal Lubetzky, Abbas Mehrabian, R. Misturini, L. Mor-
gado, Asaf Nachmias, Fedja Nazarov, Joe Neeman, Ross Pinsky, Anthony Quas,
Miklos Racz, Dinah Shender, N. J. A. Sloane, Jeff Steif, Izabella Stuhl, Jan Swart,
Ryokichi Tanaka, Daniel Wu, and Zhen Zhu. We are particularly grateful to Daniel
Jerison, Pawel Pralat, and Perla Sousi, who sent us long lists of insightful comments.

Preface to the First Edition

Markov first studied the stochastic processes that came to be named after him
in 1906. Approximately a century later, there is an active and diverse interdisci-
plinary community of researchers using Markov chains in computer science, physics,
statistics, bioinformatics, engineering, and many other areas.
The classical theory of Markov chains studied fixed chains, and the goal was
to estimate the rate of convergence to stationarity of the distribution at time t, as
t → ∞. In the past two decades, as interest in chains with large state spaces has
increased, a different asymptotic analysis has emerged. Some target distance to

the stationary distribution is prescribed; the number of steps required to reach this
target is called the mixing time of the chain. Now, the goal is to understand how
the mixing time grows as the size of the state space increases.
The modern theory of Markov chain mixing is the result of the convergence, in
the 1980s and 1990s, of several threads. (We mention only a few names here; see
the chapter Notes for references.)
For statistical physicists Markov chains become useful in Monte Carlo simu-
lation, especially for models on finite grids. The mixing time can determine the
running time for simulation. However, Markov chains are used not only for sim-
ulation and sampling purposes, but also as models of dynamical processes. Deep
connections were found between rapid mixing and spatial properties of spin systems,
e.g., by Dobrushin, Shlosman, Stroock, Zegarlinski, Martinelli, and Olivieri.
In theoretical computer science, Markov chains play a key role in sampling and
approximate counting algorithms. Often the goal was to prove that the mixing
time is polynomial in the logarithm of the state space size. (In this book, we are
generally interested in more precise asymptotics.)
At the same time, mathematicians including Aldous and Diaconis were inten-
sively studying card shuffling and other random walks on groups. Both spectral
methods and probabilistic techniques, such as coupling, played important roles.
Alon and Milman, Jerrum and Sinclair, and Lawler and Sokal elucidated the con-
nection between eigenvalues and expansion properties. Ingenious constructions of
“expander” graphs (on which random walks mix especially fast) were found using
probability, representation theory, and number theory.
In the 1990s there was substantial interaction between these communities, as
computer scientists studied spin systems and as ideas from physics were used for
sampling combinatorial structures. Using the geometry of the underlying graph to
find (or exclude) bottlenecks played a key role in many results.
There are many methods for determining the asymptotics of convergence to
stationarity as a function of the state space size and geometry. We hope to present
these exciting developments in an accessible way.
We will only give a taste of the applications to computer science and statistical
physics; our focus will be on the common underlying mathematics. The prerequi-
sites are all at the undergraduate level. We will draw primarily on probability and
linear algebra, but we will also use the theory of groups and tools from analysis
when appropriate.
Why should mathematicians study Markov chain convergence? First of all, it is
a lively and central part of modern probability theory. But there are ties to several
other mathematical areas as well. The behavior of the random walk on a graph
reveals features of the graph’s geometry. Many phenomena that can be observed in
the setting of finite graphs also occur in differential geometry. Indeed, the two fields
enjoy active cross-fertilization, with ideas in each playing useful roles in the other.
Reversible finite Markov chains can be viewed as resistor networks; the resulting
discrete potential theory has strong connections with classical potential theory. It
is amusing to interpret random walks on the symmetric group as card shuffles—and
real shuffles have inspired some extremely serious mathematics—but these chains
are closely tied to core areas in algebraic combinatorics and representation theory.

In the spring of 2005, mixing times of finite Markov chains were a major theme
of the multidisciplinary research program Probability, Algorithms, and Statistical
Physics, held at the Mathematical Sciences Research Institute. We began work on
this book there.

We have divided the book into two parts.
In Part I, the focus is on techniques, and the examples are illustrative and
accessible. Chapter 1 defines Markov chains and develops the conditions necessary
for the existence of a unique stationary distribution. Chapters 2 and 3 both cover
examples. In Chapter 2, they are either classical or useful—and generally both;
we include accounts of several chains, such as the gambler’s ruin and the coupon
collector, that come up throughout probability. In Chapter 3, we discuss Glauber
dynamics and the Metropolis algorithm in the context of “spin systems.” These
chains are important in statistical mechanics and theoretical computer science.
Chapter 4 proves that, under mild conditions, Markov chains do, in fact, con-
verge to their stationary distributions and defines total variation distance and
mixing time, the key tools for quantifying that convergence. The techniques of
Chapters 5, 6, and 7, on coupling, strong stationary times, and methods for lower
bounding distance from stationarity, respectively, are central to the area.
In Chapter 8, we pause to examine card shuffling chains. Random walks on the
symmetric group are an important mathematical area in their own right, but we
hope that readers will appreciate a rich class of examples appearing at this stage
in the exposition.
Chapter 9 describes the relationship between random walks on graphs and
electrical networks, while Chapters 10 and 11 discuss hitting times and cover times.
Chapter 12 introduces eigenvalue techniques and discusses the role of the re-
laxation time (the reciprocal of the spectral gap) in the mixing of the chain.
In Part II, we cover more sophisticated techniques and present several detailed
case studies of particular families of chains. Much of this material appears here for
the first time in textbook form.
Chapter 13 covers advanced spectral techniques, including comparison of Dirich-
let forms and Wilson’s method for lower bounding mixing.
Chapters 14 and 15 cover some of the most important families of “large” chains
studied in computer science and statistical mechanics and some of the most impor-
tant methods used in their analysis. Chapter 14 introduces the path coupling
method, which is useful in both sampling and approximate counting. Chapter 15
looks at the Ising model on several different graphs, both above and below the
critical temperature.
Chapter 16 revisits shuffling, looking at two examples—one with an application
to genomics—whose analysis requires the spectral techniques of Chapter 13.
Chapter 17 begins with a brief introduction to martingales and then presents
some applications of the evolving sets process.
Chapter 18 considers the cutoff phenomenon. For many families of chains where
we can prove sharp upper and lower bounds on mixing time, the distance from
stationarity drops from near 1 to near 0 over an interval asymptotically smaller
than the mixing time. Understanding why cutoff is so common for families of
interest is a central question.

Chapter 19, on lamplighter chains, brings together methods presented through-

out the book. There are many bounds relating parameters of lamplighter chains
to parameters of the original chain: for example, the mixing time of a lamplighter
chain is of the same order as the cover time of the base chain.
Chapters 20 and 21 introduce two well-studied variants on finite discrete time
Markov chains: continuous time chains and chains with countable state spaces.
In both cases we draw connections with aspects of the mixing behavior of finite
discrete-time Markov chains.
Chapter 25, written by Propp and Wilson, describes the remarkable construc-
tion of coupling from the past, which can provide exact samples from the stationary
Chapter 26 closes the book with a list of open problems connected to material
covered in the book.

For the Reader

Starred sections, results, and chapters contain material that either digresses
from the main subject matter of the book or is more sophisticated than what
precedes them and may be omitted.
Exercises are found at the ends of chapters. Some (especially those whose
results are applied in the text) have solutions at the back of the book. We of course
encourage you to try them yourself first!
The Notes at the ends of chapters include references to original papers, sugges-
tions for further reading, and occasionally “complements.” These generally contain
related material not required elsewhere in the book—sharper versions of lemmas or
results that require somewhat greater prerequisites.
The Notation Index at the end of the book lists many recurring symbols.
Much of the book is organized by method, rather than by example. The reader
may notice that, in the course of illustrating techniques, we return again and again
to certain families of chains—random walks on tori and hypercubes, simple card
shuffles, proper colorings of graphs. In our defense we offer an anecdote.
In 1991 one of us (Y. Peres) arrived as a postdoc at Yale and visited Shizuo
Kakutani, whose rather large office was full of books and papers, with bookcases
and boxes from floor to ceiling. A narrow path led from the door to Kakutani’s desk,
which was also overflowing with papers. Kakutani admitted that he sometimes had
difficulty locating particular papers, but he proudly explained that he had found a
way to solve the problem. He would make four or five copies of any really interesting
paper and put them in different corners of the office. When searching, he would be
sure to find at least one of the copies. . . .
Cross-references in the text and the Index should help you track earlier occur-
rences of an example. You may also find the chapter dependency diagrams below
We have included brief accounts of some background material in Appendix A.
These are intended primarily to set terminology and notation, and we hope you
will consult suitable textbooks for unfamiliar material.
Be aware that we occasionally write
 n  symbols representing a real number when
an integer is required (see, e.g., the δk ’s in the proof of Proposition 13.37). We

hope the reader will realize that this omission of floor or ceiling brackets (and the
details of analyzing the resulting perturbations) is in her or his best interest as
much as it is in ours.

For the Instructor

The prerequisites this book demands are a first course in probability, linear
algebra, and, inevitably, a certain degree of mathematical maturity. When intro-
ducing material which is standard in other undergraduate courses—e.g., groups—we
provide definitions but often hope the reader has some prior experience with the
In Part I, we have worked hard to keep the material accessible and engaging
for students. (Starred material is more sophisticated and is not required for what
follows immediately; they can be omitted.)
Here are the dependencies among the chapters of Part I:

Chapters 1 through 7, shown in gray, form the core material, but there are
several ways to proceed afterwards. Chapter 8 on shuffling gives an early rich
application but is not required for the rest of Part I. A course with a probabilistic
focus might cover Chapters 9, 10, and 11. To emphasize spectral methods and
combinatorics, cover Chapters 8 and 12 and perhaps continue on to Chapters 13
and 16.
While our primary focus is on chains with finite state spaces run in discrete time,
continuous-time and countable-state-space chains are both discussed—in Chapters
20 and 21, respectively.
We have also included Appendix B, an introduction to simulation methods, to
help motivate the study of Markov chains for students with more applied interests.
A course leaning towards theoretical computer science and/or statistical mechan-
ics might start with Appendix B, cover the core material, and then move on to
Chapters 14, 15, and 25.
Of course, depending on the interests of the instructor and the ambitions and
abilities of the students, any of the material can be taught! Below we include
a full diagram of dependencies of chapters. Its tangled nature results from the
interconnectedness of the area: a given technique can be applied in many situations,
while a particular problem may require several techniques for full analysis.

The logical dependencies of chapters. The core Chapters 1

through 7 are in dark gray, the rest of Part I is in light gray,
and Part II is in white.

For the Expert

Several other recent books treat Markov chain mixing. Our account is more
comprehensive than those of Häggström (2002), Jerrum (2003), or Montene-
gro and Tetali (2006), yet not as exhaustive as Aldous and Fill (1999). Nor-
ris (1998) gives an introduction to Markov chains and their applications but does
not focus on mixing. Since this is a textbook, we have aimed for accessibility and
comprehensibility, particularly in Part I.
What is different or novel in our approach to this material?
– Our approach is probabilistic whenever possible. We also integrate “classi-
cal” material on networks, hitting times, and cover times and demonstrate
its usefulness for bounding mixing times.
– We provide an introduction to several major statistical mechanics models,
most notably the Ising model, and collect results on them in one place.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Thank You! Your purchase supports the AMS' mission, programs, and services for the mathematical community.

– We give expository accounts of several modern techniques and examples,

including evolving sets, the cutoff phenomenon, lamplighter chains, and
the L-reversal chain.
– We systematically treat lower bounding techniques, including several ap-
plications of Wilson’s method.
– We use the transportation metric to unify our account of path coupling
and draw connections with earlier history.
– We present an exposition of coupling from the past by Propp and Wilson,
the originators of the method.

The authors thank the Mathematical Sciences Research Institute, the National
Science Foundation VIGRE grant to the Department of Statistics at the University
of California, Berkeley, and National Science Foundation grants DMS-0244479 and
DMS-0104073 for support. We also thank Hugo Rossi for suggesting we embark on
this project. Thanks to Blair Ahlquist, Tonci Antunovic, Elisa Celis, Paul Cuff,
Jian Ding, Ori Gurel-Gurevich, Tom Hayes, Itamar Landau, Yun Long, Karola
Mészáros, Shobhana Murali, Weiyang Ning, Tomoyuki Shirai, Walter Sun, Sith-
parran Vanniasegaram, and Ariel Yadin for corrections to an earlier version and
making valuable suggestions. Yelena Shvets made the illustration in Section 6.5.4.
The simulations of the Ising model in Chapter 15 are due to Raissa D’Souza. We
thank László Lovász for useful discussions. We are indebted to Alistair Sinclair for
his work co-organizing the M.S.R.I. program Probability, Algorithms, and Statisti-
cal Physics in 2005, where work on this book began. We thank Robert Calhoun
for technical assistance.
Finally, we are greatly indebted to David Aldous and Persi Diaconis, who initi-
ated the modern point of view on finite Markov chains and taught us much of what
we know about the subject.


Part I: Basic Methods and Examples

Everything should be made as simple as possible, but not simpler.

–Paraphrase of a quotation from Einstein (1934).

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Thank You! Your purchase supports the AMS' mission, programs, and services for the mathematical community.


Introduction to Finite Markov Chains

1.1. Markov Chains

A Markov chain is a process which moves among the elements of a set X in
the following manner: when at x ∈ X , the next position is chosen according to
a fixed probability distribution P (x, ·) depending only on x. More precisely, a
sequence of random variables (X0 , X1 , . . .) is a Markov chain with state space
X and transition matrix P if for all x, y ∈ X , all t ≥ 1, and all events Ht−1 =
s=0 {Xs = xs } satisfying P(Ht−1 ∩ {Xt = x}) > 0, we have

P {Xt+1 = y | Ht−1 ∩ {Xt = x} } = P {Xt+1 = y | Xt = x} = P (x, y). (1.1)

Equation (1.1), often called the Markov property , means that the conditional
probability of proceeding from state x to state y is the same, no matter what
sequence x0 , x1 , . . . , xt−1 of states precedes the current state x. This is exactly why
the |X | × |X | matrix P suffices to describe the transitions.
The x-th row of P is the distribution P (x, ·). Thus, P is stochastic; that is,
its entries are all non-negative and

P (x, y) = 1 for all x ∈ X .

Example 1.1. A certain frog lives in a pond with two lily pads, east and west.

A long time ago, he found two coins at the bottom of the pond and brought one
up to each lily pad. Every morning, the frog decides whether to jump by tossing
the current lily pad’s coin. If the coin lands heads up, the frog jumps to the other
lily pad. If the coin lands tails up, he remains where he is.
Let X = {e, w}, and let (X0 , X1 , . . . ) be the sequence of lily pads occupied
by the frog on Sunday, Monday, . . .. Given the source of the coins, we should not

Figure 1.1. A randomly jumping frog. Whenever he tosses heads,

he jumps to the other lily pad.

assume that they are fair! Say the coin on the east pad has probability p of landing
heads up, while the coin on the west pad has probability q of landing heads up.
The frog’s rules for jumping imply that if we set
P (e, e) P (e, w) 1−p p
P = = , (1.2)
P (w, e) P (w, w) q 1−q
then (X0 , X1 , . . . ) is a Markov chain with transition matrix P . Note that the first
row of P is the conditional distribution of Xt+1 given that Xt = e, while the second
row is the conditional distribution of Xt+1 given that Xt = w.
Assume that the frog spends Sunday on the east pad. When he awakens Mon-
day, he has probability p of moving to the west pad and probability 1 − p of staying
on the east pad. That is,
P{X1 = e | X0 = e} = 1 − p, P{X1 = w | X0 = e} = p. (1.3)
What happens Tuesday? By considering the two possibilities for X1 , we see that
P{X2 = e | X0 = e} = (1 − p)(1 − p) + pq (1.4)


P{X2 = w | X0 = e} = (1 − p)p + p(1 − q). (1.5)

While we could keep writing out formulas like (1.4) and (1.5), there is a more
systematic approach. We can store our distribution information in a row vector
μt := (P{Xt = e | X0 = e}, P{Xt = w | X0 = e}) .
Our assumption that the frog starts on the east pad can now be written as μ0 =
(1, 0), while (1.3) becomes μ1 = μ0 P .
Multiplying by P on the right updates the distribution by another step:
μt = μt−1 P for all t ≥ 1. (1.6)
Indeed, for any initial distribution μ0 ,
μt = μ 0 P t for all t ≥ 0. (1.7)
How does the distribution μt behave in the long term? Figure 1.2 suggests that

1 1 1

0.75 0.75 0.75

0.5 0.5 0.5

0.25 0.25 0.25

0 10 20 0 10 20 0 10 20

(a) (b) (c)

Figure 1.2. The probability of being on the east pad (started

from the east pad) plotted versus time for (a) p = q = 1/2, (b)
p = 0.2 and q = 0.1, (c) p = 0.95 and q = 0.7. The long-term
limiting probabilities are 1/2, 1/3, and 14/33 ≈ 0.42, respectively.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Thank You! Your purchase supports the AMS' mission, programs, and services for the mathematical community.

μt has a limit π (whose value depends on p and q) as t → ∞. Any such limit

distribution π must satisfy
π = πP,
which implies (after a little algebra) that
q p
π(e) = , π(w) = .
p+q p+q
If we define
Δt = μt (e) − for all t ≥ 0,
then by the definition of μt+1 the sequence (Δt ) satisfies
Δt+1 = μt (e)(1 − p) + (1 − μt (e))(q) − = (1 − p − q)Δt . (1.8)
We conclude that when 0 < p < 1 and 0 < q < 1,
q p
lim μt (e) = and lim μt (w) = (1.9)
t→∞ p+q t→∞ p+q
for any initial distribution μ0 . As we suspected, μt approaches π as t → ∞.
Remark 1.2. The traditional theory of finite Markov chains is concerned with
convergence statements of the type seen in (1.9), that is, with the rate of conver-
gence as t → ∞ for a fixed chain. Note that 1 − p − q is an eigenvalue of the
frog’s transition matrix P . Note also that this eigenvalue determines the rate of
convergence in (1.9), since by (1.8) we have
Δt = (1 − p − q)t Δ0 .
The computations we just did for a two-state chain generalize to any finite
Markov chain. In particular, the distribution at time t can be found by matrix
multiplication. Let (X0 , X1 , . . . ) be a finite Markov chain with state space X and
transition matrix P , and let the row vector μt be the distribution of Xt :
μt (x) = P{Xt = x} for all x ∈ X .
By conditioning on the possible predecessors of the (t + 1)-st state, we see that
μt+1 (y) = P{Xt = x}P (x, y) = μt (x)P (x, y) for all y ∈ X .
x∈X x∈X

Rewriting this in vector form gives

μt+1 = μt P for t ≥ 0
and hence
μt = μ 0 P t for t ≥ 0. (1.10)
Since we will often consider Markov chains with the same transition matrix but
different starting distributions, we introduce the notation Pμ and Eμ for probabil-
ities and expectations given that μ0 = μ. Most often, the initial distribution will
be concentrated at a single definite starting state x. We denote this distribution
by δx : 
1 if y = x,
δx (y) =
0 if y = x.
We write simply Px and Ex for Pδx and Eδx , respectively.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
These definitions and (1.10) together imply that

Px {Xt = y} = (δx P t )(y) = P t (x, y).
That is, the probability of moving in t steps from x to y is given by the (x, y)-th
entry of P t . We call these entries the t-step transition probabilities.
Notation. A probability distribution μ on X will be identified with a row
vector. For any event A ⊂ X , we write

μ(A) = μ(x).

For x ∈ X , the row of P indexed by x will be denoted by P (x, ·).

Remark 1.3. The way we constructed the matrix P has forced us to treat
distributions as row vectors. In general, if the chain has distribution μ at time t,
then it has distribution μP at time t + 1. Multiplying a row vector by P on the
right takes you from today’s distribution to tomorrow’s distribution.
What if we multiply a column vector f by P on the left? Think of f as a
function on the state space X . (For the frog of Example 1.1, we might take f (x)
to be the area of the lily pad x.) Consider the x-th entry of the resulting vector:
P f (x) = P (x, y)f (y) = f (y)Px {X1 = y} = Ex (f (X1 )).
y y

That is, the x-th entry of P f tells us the expected value of the function f at
tomorrow’s state, given that we are at state x today.

1.2. Random Mapping Representation

We begin this section with an example.
Example 1.4 (Random walk on the n-cycle). Let X = Zn = {0, 1, . . . , n − 1},
the set of remainders modulo n. Consider the transition matrix

⎨1/2 if k ≡ j + 1 (mod n),
P (j, k) = 1/2 if k ≡ j − 1 (mod n), (1.11)

0 otherwise.
The associated Markov chain (Xt ) is called random walk on the n-cycle. The
states can be envisioned as equally spaced dots arranged in a circle (see Figure 1.3).

Figure 1.3. Random walk on Z10 is periodic, since every step

goes from an even state to an odd state, or vice versa. Random
walk on Z9 is aperiodic.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Rather than writing down the transition matrix in (1.11), this chain can be
specified simply in words: at each step, a coin is tossed. If the coin lands heads up,
the walk moves one step clockwise. If the coin lands tails up, the walk moves one
step counterclockwise.
More precisely, suppose that Z is a random variable which is equally likely to
take on the values −1 and +1. If the current state of the chain is j ∈ Zn , then the
next state is j + Z mod n. For any k ∈ Zn ,
P{(j + Z) mod n = k} = P (j, k).
In other words, the distribution of (j + Z) mod n equals P (j, ·).
A random mapping representation of a transition matrix P on state space
X is a function f : X ×Λ → X , along with a Λ-valued random variable Z, satisfying
P{f (x, Z) = y} = P (x, y).
The reader should check that if Z1 , Z2 , . . . is a sequence of independent random
variables, each having the same distribution as Z, and the random variable X0
has distribution μ and is independent of (Zt )t≥1 , then the sequence (X0 , X1 , . . . )
defined by
Xn = f (Xn−1 , Zn ) for n ≥ 1
is a Markov chain with transition matrix P and initial distribution μ.
For the example of the simple random walk on the cycle, setting Λ = {1, −1},
each Zi uniform on Λ, and f (x, z) = x + z mod n yields a random mapping repre-

Proposition 1.5. Every transition matrix on a finite state space has a random
mapping representation.

Proof. Let P be the transition matrix of a Markov chain with state space
X = {x1 , . . . , xn }. Take Λ = [0, 1]; our auxiliary random variables Z, Z1 , Z2 , . . .
will be uniformly chosen in this interval. Set Fj,k = ki=1 P (xj , xi ) and define
f (xj , z) := xk when Fj,k−1 < z ≤ Fj,k .
We have
P{f (xj , Z) = xk } = P{Fj,k−1 < Z ≤ Fj,k } = P (xj , xk ).

Note that, unlike transition matrices, random mapping representations are far
from unique. For instance, replacing the function f (x, z) in the proof of Proposition
1.5 with f (x, 1 − z) yields a different representation of the same transition matrix.
Random mapping representations are crucial for simulating large chains. They
can also be the most convenient way to describe a chain. We will often give rules for
how a chain proceeds from state to state, using some extra randomness to determine
where to go next; such discussions are implicit random mapping representations.
Finally, random mapping representations provide a way to coordinate two (or more)
chain trajectories, as we can simply use the same sequence of auxiliary random
variables to determine updates. This technique will be exploited in Chapter 5, on
coupling Markov chain trajectories, and elsewhere.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
1.3. Irreducibility and Aperiodicity

We now make note of two simple properties possessed by most interesting
chains. Both will turn out to be necessary for the Convergence Theorem (The-
orem 4.9) to be true.
A chain P is called irreducible if for any two states x, y ∈ X there exists
an integer t (possibly depending on x and y) such that P t (x, y) > 0. This means
that it is possible to get from any state to any other state using only transitions of
positive probability. We will generally assume that the chains under discussion are
irreducible. (Checking that specific chains are irreducible can be quite interesting;
see, for instance, Section 2.6 and Example B.5. See Section 1.7 for a discussion of
all the ways in which a Markov chain can fail to be irreducible.)
Let T (x) := {t ≥ 1 : P t (x, x) > 0} be the set of times when it is possible for
the chain to return to starting position x. The period of state x is defined to be
the greatest common divisor of T (x).
Lemma 1.6. If P is irreducible, then gcd T (x) = gcd T (y) for all x, y ∈ X .
Proof. Fix two states x and y. There exist non-negative integers r and  such
that P r (x, y) > 0 and P  (y, x) > 0. Letting m = r+, we have m ∈ T (x)∩T (y) and
T (x) ⊂ T (y) − m, whence gcd T (y) divides all elements of T (x). We conclude that
gcd T (y) ≤ gcd T (x). By an entirely parallel argument, gcd T (x) ≤ gcd T (y). 
For an irreducible chain, the period of the chain is defined to be the period
which is common to all states. The chain will be called aperiodic if all states have
period 1. If a chain is not aperiodic, we call it periodic.
Proposition 1.7. If P is aperiodic and irreducible, then there is an integer r0
such that P r (x, y) > 0 for all x, y ∈ X and r ≥ r0 .
Proof. We use the following number-theoretic fact: any set of non-negative
integers which is closed under addition and which has greatest common divisor 1
must contain all but finitely many of the non-negative integers. (See Lemma 1.30
in the Notes of this chapter for a proof.) For x ∈ X , recall that T (x) = {t ≥ 1 :
P t (x, x) > 0}. Since the chain is aperiodic, the gcd of T (x) is 1. The set T (x)
is closed under addition: if s, t ∈ T (x), then P s+t (x, x) ≥ P s (x, x)P t (x, x) > 0,
and hence s + t ∈ T (x). Therefore there exists a t(x) such that t ≥ t(x) implies
t ∈ T (x). By irreducibility we know that for any y ∈ X there exists r = r(x, y)
such that P r (x, y) > 0. Therefore, for t ≥ t(x) + r,
P t (x, y) ≥ P t−r (x, x)P r (x, y) > 0.
For t ≥ t (x) := t(x) + maxy∈X r(x, y), we have P t (x, y) > 0 for all y ∈ X . Finally,
if t ≥ maxx∈X t (x), then P t (x, y) > 0 for all x, y ∈ X . 
Suppose that a chain is irreducible with period two, e.g., the simple random
walk on a cycle of even length (see Figure 1.3). The state space X can be partitioned
into two classes, say even and odd , such that the chain makes transitions only
between states in complementary classes. (Exercise 1.6 examines chains with period
Let P have period two, and suppose that x0 is an even state. The probability
distribution of the chain after 2t steps, P 2t (x0 , ·), is supported on even states,
while the distribution of the chain after 2t + 1 steps is supported on odd states. It
is evident that we cannot expect the distribution P t (x0 , ·) to converge as t → ∞.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
Fortunately, a simple modification can repair periodicity problems. Given an

2 (here I is the |X |×|X | identity matrix).
arbitrary transition matrix P , let Q = I+P
(One can imagine simulating Q as follows: at each time step, flip a fair coin. If it
comes up heads, take a step in P ; if tails, then stay at the current state.) Since
Q(x, x) > 0 for all x ∈ X , the transition matrix Q is aperiodic. We call Q a lazy
version of P . It will often be convenient to analyze lazy versions of chains.
Example 1.8 (The n-cycle, revisited). Recall random walk on the n-cycle,
defined in Example 1.4. For every n ≥ 1, random walk on the n-cycle is irreducible.
Random walk on any even-length cycle is periodic, since gcd{t : P t (x, x) >
0} = 2 (see Figure 1.3). Random walk on an odd-length cycle is aperiodic.
For n ≥ 3, the transition matrix Q for lazy random walk on the n-cycle is

⎪ 1/4 if k ≡ j + 1 (mod n),

⎨1/2 if k ≡ j (mod n),
Q(j, k) = (1.12)

⎪ 1/4 if k ≡ j − 1 (mod n),

0 otherwise.
Lazy random walk on the n-cycle is both irreducible and aperiodic for every n.
Remark 1.9. Establishing that a Markov chain is irreducible is not always
trivial; see Example B.5 and also Thurston (1990).

1.4. Random Walks on Graphs

Random walk on the n-cycle, which is shown in Figure 1.3, is a simple case of
an important type of Markov chain.
A graph G = (V, E) consists of a vertex set V and an edge set E, where
the elements of E are unordered pairs of vertices: E ⊂ {{x, y} : x, y ∈ V, x = y}.
We can think of V as a set of dots, where two dots x and y are joined by a line if
and only if {x, y} is an element of the edge set. When {x, y} ∈ E, we write x ∼ y
and say that y is a neighbor of x (and also that x is a neighbor of y). The degree
deg(x) of a vertex x is the number of neighbors of x.
Given a graph G = (V, E), we can define simple random walk on G to be
the Markov chain with state space V and transition matrix

if y ∼ x,
P (x, y) = deg(x) (1.13)
0 otherwise.
That is to say, when the chain is at vertex x, it examines all the neighbors of x,
picks one uniformly at random, and moves to the chosen vertex.
Example 1.10. Consider the graph G shown in Figure 1.4. The transition
matrix of simple random walk on G is
⎛ ⎞
0 12 12 0 0
⎜ 1 ⎟
⎜ 3 0 13 13 0 ⎟
⎜ 1 1 ⎟
P =⎜ 1 1 ⎟
⎜ 4 4 0 4 4 ⎟.
⎜ ⎟
⎝ 0 12 12 0 0 ⎠
0 0 1 0 0

2 4

1 3 5
Figure 1.4. An example of a graph with vertex set {1, 2, 3, 4, 5}
and 6 edges.

Remark 1.11. We have chosen a narrow definition of “graph” for simplicity.

It is sometimes useful to allow edges connecting a vertex to itself, called loops. It
is also sometimes useful to allow multiple edges connecting a single pair of vertices.
Loops and multiple edges both contribute to the degree of a vertex and are counted
as options when a simple random walk chooses a direction. See Section 6.5.1 for an
We will have much more to say about random walks on graphs throughout this
book—but especially in Chapter 9.

1.5. Stationary Distributions

1.5.1. Definition. We saw in Example 1.1 that a distribution π on X satis-
π = πP (1.14)
can have another interesting property: in that case, π was the long-term limiting
distribution of the chain. We call a probability π satisfying (1.14) a stationary
distribution of the Markov chain. Clearly, if π is a stationary distribution and
μ0 = π (i.e., the chain is started in a stationary distribution), then μt = π for all
t ≥ 0.
Note that we can also write (1.14) elementwise. An equivalent formulation is

π(y) = π(x)P (x, y) for all y ∈ X . (1.15)

Example 1.12. Consider simple random walk on a graph G = (V, E). For any
vertex y ∈ V ,
deg(x)P (x, y) = = deg(y). (1.16)

To get a probability, we simply normalize by y∈V deg(y) = 2|E| (a fact the reader
should check). We conclude that the probability measure
π(y) = for all y ∈ X ,
which is proportional to the degrees, is always a stationary distribution for the
walk. For the graph in Figure 1.4,
2 3 4 2 1
π = 12 , 12 , 12 , 12 , 12 .

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
If G has the property that every vertex has the same degree d, we call G d-regular .
In this case, 2|E| = d|V | and the uniform distribution π(y) = 1/|V | for every y ∈ V
is stationary.

A central goal of this chapter and of Chapter 4 is to prove a general yet precise
version of the statement that “finite Markov chains converge to their stationary
distributions.” Before we can analyze the time required to be close to stationar-
ity, we must be sure that it is finite! In this section we show that, under mild
restrictions, stationary distributions exist and are unique. Our strategy of building
a candidate distribution, then verifying that it has the necessary properties, may
seem cumbersome. However, the tools we construct here will be applied in many
other places. In Section 4.3, we will show that irreducible and aperiodic chains do,
in fact, converge to their stationary distributions in a precise sense.

1.5.2. Hitting and first return times. Throughout this section, we assume
that the Markov chain (X0 , X1 , . . . ) under discussion has finite state space X and
transition matrix P . For x ∈ X , define the hitting time for x to be

τx := min{t ≥ 0 : Xt = x},

the first time at which the chain visits state x. For situations where only a visit to
x at a positive time will do, we also define

τx+ := min{t ≥ 1 : Xt = x}.

When X0 = x, we call τx+ the first return time.

Lemma 1.13. For any states x and y of an irreducible chain, Ex (τy+ ) < ∞.

Proof. The definition of irreducibility implies that there exist an integer r > 0
and a real ε > 0 with the following property: for any states z, w ∈ X , there exists a
j ≤ r with P j (z, w) > ε. Thus, for any value of Xt , the probability of hitting state
y at a time between t and t + r is at least ε. Hence for k > 0 we have

Px {τy+ > kr} ≤ (1 − ε)Px {τy+ > (k − 1)r}. (1.17)

Repeated application of (1.17) yields

Px {τy+ > kr} ≤ (1 − ε)k . (1.18)

Recall that when Y is a non-negative integer-valued random variable, we have

E(Y ) = P{Y > t}.

Since Px {τy+ > t} is a decreasing function of t, (1.18) suffices to bound all terms of
the corresponding expression for Ex (τy+ ):
Ex (τy+ ) = Px {τy+ > t} ≤ rPx {τy+ > kr} ≤ r (1 − ε)k < ∞.
t≥0 k≥0 k≥0

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
1.5.3. Existence of a stationary distribution. The Convergence Theorem

(Theorem 4.9 below) implies that the long-term fraction of time a finite irreducible
aperiodic Markov chain spends in each state coincides with the chain’s stationary
distribution. However, we have not yet demonstrated that stationary distributions
We give an explicit construction of the stationary distribution π, which in
the irreducible case gives the useful identity π(x) = [Ex (τx+ )] . We consider a
sojourn of the chain from some arbitrary state z back to z. Since visits to z break
up the trajectory of the chain into identically distributed segments, it should not
be surprising that the average fraction of time per segment spent in each state y
coincides with the long-term fraction of time spent in y.
Let z ∈ X be an arbitrary state of the Markov chain. We will closely examine
the average time the chain spends at each state in between visits to z. To this end,
we define

π̃(y) := Ez (number of visits to y before returning to z)

= Pz {Xt = y, τz+ > t} .

Proposition 1.14. Let π̃ be the measure on X defined by (1.19).

(i) If Pz {τz+ < ∞} = 1, then π̃ satisfies π̃P = π̃.
(ii) If Ez (τz+ ) < ∞, then π := E (τ
is a stationary distribution.
z z

Remark 1.15. Recall that Lemma 1.13 shows that if P is irreducible, then
Ez (τz+ ) < ∞. We will show in Section 1.7 that the assumptions of (i) and (ii) are
always equivalent (Corollary 1.27) and there always exists z satisfying both.

Proof. For any state y, we have π̃(y) ≤ Ez τz+ . Hence Lemma 1.13 ensures
that π̃(y) < ∞ for all y ∈ X . We check that π̃ is stationary, starting from the


π̃(x)P (x, y) = Pz {Xt = x, τz+ > t}P (x, y). (1.20)
x∈X x∈X t=0

Because the event {τz+ ≥ t + 1} = {τz+ > t} is determined by X0 , . . . , Xt ,

Pz {Xt = x, Xt+1 = y, τz+ ≥ t + 1} = Pz {Xt = x, τz+ ≥ t + 1}P (x, y). (1.21)

Reversing the order of summation in (1.20) and using the identity (1.21) shows that


π̃(x)P (x, y) = Pz {Xt+1 = y, τz+ ≥ t + 1}
x∈X t=0

= Pz {Xt = y, τz+ ≥ t}. (1.22)

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
The expression in (1.22) is very similar to (1.19), so we are almost done. In fact,
Pz {Xt = y, τz+ ≥ t}

= π̃(y) − Pz {X0 = y, τz+ > 0} + Pz {Xt = y, τz+ = t}
= π̃(y) − Pz {X0 = y} + Pz {Xτz+ = y} (1.23)
= π̃(y). (1.24)
The equality (1.24) follows by considering two cases:
y = z: Since X0 = z and Xτz+ = z, the last two terms of (1.23) are both 1, and
they cancel each other out.
y = z: Here both terms of (1.23) are 0.
Therefore, combining (1.22) with (1.24) shows that π̃ = π̃P .
Finally, to get a probability measure, we normalize by x π̃(x) = Ez (τz+ ):
π(x) = satisfies π = πP. (1.25)
Ez (τz+ )

The computation at the heart of the proof of Proposition 1.14 can be gen-
eralized; see Lemma 10.5. Informally speaking, a stopping time τ for (Xt ) is a
{0, 1, . . . , } ∪ {∞}-valued random variable such that, for each t, the event {τ = t}
is determined by X0 , . . . , Xt . (Stopping times are defined precisely in Section 6.2.)
If a stopping time τ replaces τz+ in the definition (1.19) of π̃, then the proof that
π̃ satisfies π̃ = π̃P works, provided that τ satisfies both Pz {τ < ∞} = 1 and
Pz {Xτ = z} = 1.
1.5.4. Uniqueness of the stationary distribution. Earlier in this chapter
we pointed out the difference between multiplying a row vector by P on the right
and a column vector by P on the left: the former advances a distribution by one
step of the chain, while the latter gives the expectation of a function on states, one
step of the chain later. We call distributions invariant under right multiplication by
P stationary . What about functions that are invariant under left multiplication?
Call a function h : X → R harmonic at x if

h(x) = P (x, y)h(y). (1.26)

A function is harmonic on D ⊂ X if it is harmonic at every state x ∈ D. If h is

regarded as a column vector, then a function which is harmonic on all of X satisfies
the matrix equation P h = h.
Lemma 1.16. Suppose that P is irreducible. A function h which is harmonic
at every point of X is constant.
Proof. Since X is finite, there must be a state x0 such that h(x0 ) = M is
maximal. If for some state z such that P (x0 , z) > 0 we have h(z) < M , then

h(x0 ) = P (x0 , z)h(z) + P (x0 , y)h(y) < M, (1.27)

a contradiction. It follows that h(z) = M for all states z such that P (x0 , z) > 0.

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
For any y ∈ X , irreducibility implies that there is a sequence x0 , x1 , . . . , xn = y

with P (xi , xi+1 ) > 0. Repeating the argument above tells us that h(y) = h(xn−1 ) =
· · · = h(x0 ) = M . Thus, h is constant. 

Corollary 1.17. Let P be the transition matrix of an irreducible Markov

chain. There exists a unique probability distribution π satisfying π = πP .
Proof. By Proposition 1.14 there exists at least one such measure. Lemma
1.16 implies that the kernel of P − I has dimension 1, so the column rank of P − I is
|X |−1. Since the row rank of any matrix is equal to its column rank, the row-vector
equation ν = νP also has a one-dimensional space of solutions. This space contains
only one vector whose entries sum to 1. 

Remark 1.18. Another proof of Corollary 1.17 follows from the Convergence
Theorem (Theorem 4.9, proved below). Another simple direct proof is suggested in
Exercise 1.11.
Proposition 1.19. If P is an irreducible transition matrix and π is the unique
probability distribution solving π = πP , then for all states z,
π(z) = . (1.28)
Ez τz+
Proof. Let π̃z (y) equal π̃(y) as defined in (1.19), and write πz (y) = π̃z (y)/Ez τz+ .
Proposition 1.14 implies that πz is a stationary distribution, so πz = π. Therefore,
π̃z (z) 1
π(z) = πz (z) = + = .
Ez τz Ez τz+

1.6. Reversibility and Time Reversals

Suppose a probability distribution π on X satisfies
π(x)P (x, y) = π(y)P (y, x) for all x, y ∈ X . (1.29)
The equations (1.29) are called the detailed balance equations.
Proposition 1.20. Let P be the transition matrix of a Markov chain with
state space X . Any distribution π satisfying the detailed balance equations (1.29)
is stationary for P .
Proof. Sum both sides of (1.29) over all y:
π(y)P (y, x) = π(x)P (x, y) = π(x),
y∈X y∈X

since P is stochastic. 

Checking detailed balance is often the simplest way to verify that a particular
distribution is stationary. Furthermore, when (1.29) holds,
π(x0 )P (x0 , x1 ) · · · P (xn−1 , xn ) = π(xn )P (xn , xn−1 ) · · · P (x1 , x0 ). (1.30)
We can rewrite (1.30) in the following suggestive form:
Pπ {X0 = x0 , . . . , Xn = xn } = Pπ {X0 = xn , X1 = xn−1 , . . . , Xn = x0 }. (1.31)

Copyright no copyright American Mathematical Society. Duplication prohibited. Please report unauthorized use to [email protected].
