0% found this document useful (0 votes)

8 views208 pages

23 Ese650

ESE 650: Learning in Robotics is a Spring 2023 course taught by Pratik Chaudhari, focusing on the integration of perception, learning, and control in robotics. The course aims to equip students with the ability to critically evaluate and develop robotics algorithms. Key topics include state estimation, Kalman filters, dynamic programming, reinforcement learning, and various robotics applications.

Uploaded by

sophiahsu117

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views208 pages

23 Ese650

Uploaded by

sophiahsu117

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 208

ESE 650: Learning in Robotics

Spring 2023

Instructor
Pratik Chaudhari [email protected]

Teaching Assistants
Jianning Cui (cuĳn)
Swati Gupta (gswati)
Chris Hsu (chsu8)
Gaurav Kuppa (gakuppa)
Alice Kate Li (alicekl)
Pankti Parekh (pankti81)
Aditya Singh (adiprs)
Haoxiang You (youhaox)

April 19, 2023

Contents

1 What is Robotics? 5
1.1 Perception-Learning-Control 6
1.2 Goals of this course 7
1.3 Some of my favorite robots 7

2 Introduction to State Estimation 9

2.1 A review of probability 9
2.1.1 Random variables 12
2.2 Using Bayes rule for combining evidence 16
2.2.1 Coherence of Bayes rule 18
2.3 Markov Chains 19
2.4 Hidden Markov Models (HMMs) 21
2.4.1 The forward algorithm 24
2.4.2 The backward algorithm 25
2.4.3 Bayes filter 26
2.4.4 Smoothing 27
2.4.5 Prediction 28
2.4.6 Decoding: Viterbi’s Algorithm 28
2.4.7 Shortest path on a Trellis graph 32
2.5 Learning an HMM from observations 33

3 Kalman Filter and its variants 37

3.1 Background 37
3.2 Linear state estimation 39
3.2.1 One-dimensional Gaussian random variables 40
3.2.2 General case 41
3.2.3 Incorporating Gaussian observations of a state 42
3.2.4 An example 45
3.3 Background on linear and nonlinear dynamical systems 45
3.3.1 Linear systems 46
3.3.2 Linear Time-Invariant (LTI) systems 47
3.3.3 Nonlinear systems 48
3.4 Markov Decision Processes (MDPs) 48
3.4.1 Back to Hidden Markov Models 51
3.5 Kalman Filter (KF) 52
3.5.1 Step 0: Observing that the state estimate at any timestep should be a Gaussian 53
3.5.2 Step 1: Propagating the dynamics by one timestep 53

1
2

3.5.3 Step 2: Incorporating the observation 54

3.5.4 Discussion 55
3.6 Extended-Kalman Filter (EKF) 56
3.6.1 Propagation of statistics through a nonlinear transformation 57
3.6.2 Extended Kalman Filter 59
3.7 Unscented Kalman Filter (UKF) 61
3.7.1 Unscented Transform 63
3.7.2 The UT with tuning parameters 65
3.7.3 Unscented Kalman Filter (UKF) 66
3.7.4 UKF vs. EKF 67
3.8 Particle Filters (PFs) 68
3.8.1 Importance sampling 68
3.8.2 Resampling particles to make the weights equal 70
3.8.3 Particle filtering: the algorithm 72
3.8.4 Example: Localization using particle filter 73
3.8.5 Theoretical insight into particle filtering 74
3.9 Discussion 77

4 Rigid-body transforms and mapping 78

4.1 Rigid-Body Transformations 79
4.1.1 3D transformations 81
4.1.2 Rodrigues’ formula: an alternate view of rotations 83
4.2 Quaternions 84
4.3 Occupancy Grids 87
4.3.1 Estimating the map from the data 90
4.3.2 Sensor models 91
4.3.3 Back to sensor modeling 93
4.4 3D occupancy grids 95
4.5 Local Map 97
4.6 Discussion 98

5 Dynamic Programming 100

5.1 Formulating the optimal control problem 101
5.2 Dĳkstra’s algorithm 102
5.2.1 Dĳkstra’s algorithm in the backwards direction 104
5.3 Principle of Dynamic Programming 104
5.3.1 Q-factor 106
5.4 Stochastic dynamic programming: Value Iteration 107
5.4.1 Infinite-horizon problems 110
5.4.2 Dynamic programming for infinite-horizon problems 113
5.4.3 An example 114
5.4.4 Some theoretical results on value iteration 115
5.5 Stochastic dynamic programming: Policy Iteration 117
5.5.1 An example 119
3

6 Linear Quadratic Regulator (LQR) 121

6.1 Discrete-time LQR 121
6.1.1 Solution of the discrete-time LQR problem 124
6.2 Hamilton-Jacobi-Bellman equation 126
6.2.1 Infinite-horizon HJB 128
6.2.2 Solving the HJB equation 129
6.2.3 Continuous-time LQR 130
6.3 Stochastic LQR 132
6.4 Linear Quadratic Gaussian (LQG) 133
6.4.1 (Optional material) The duality between the Kalman Filter and LQR 135
6.5 Iterative LQR (iLQR) 136
6.5.1 Iterative LQR (iLQR) 138

7 Imitation Learning 141

7.1 A crash course in supervised learning 143
7.1.1 Fitting a machine learning model 145
7.1.2 Deep Neural Networks 146
7.2 Behavior Cloning 149
7.2.1 Behavior cloning with a stochastic controller 151
7.2.2 KL-divergence form of Behavior Cloning 152
7.2.3 Some remarks on Behavior Cloning 153
7.3 DAgger: Dataset Aggregation 154

8 Policy Gradient Methods 157

8.1 Standard problem setup in RL 158
8.2 Cross-Entropy Method (CEM) 159
8.2.1 Some remarks on sample complexity of simulation-based methods 161
8.3 The Policy Gradient 162
8.3.1 Reducing the variance of the policy gradient 164
8.4 An alternative expression for the policy gradient 166
8.4.1 Implementing the new expression 168
8.5 Actor-Critic methods 168
8.5.1 Advantage function 170
8.6 Discussion 171

9 Q-Learning 173
9.1 Tabular Q-Learning 173
9.1.1 How to perform exploration in Q-Learning 176
9.2 Function approximation (Deep Q Networks) 177
9.2.1 Embellishments to Q-Learning 179
9.3 Q-Learning for continuous control spaces 183

10 Model-based Reinforcement Learning 186

10.1 Learning a model of the dynamics 187
10.2 Some model-based methods 188
10.2.1 Bagging multiple models of the dynamics 188
10.2.2 Model-based RL in the latent space 190
4

11 Offline Reinforcement Learning 191

11.1 Why is offline reinforcement learning difficult? 192
11.2 Regularized Bellman iteration 194
11.2.1 Changing the fixed point of the Bellman iteration to be more conservative 194
11.2.2 Estimating the uncertainty of the value function 195

12 Meta-Learning 196
12.1 Problem formulation for image classification 198
12.1.1 Fine-tuning 199
12.1.2 Prototypical networks 199
12.1.3 Model-agnostic meta-learning (MAML) 201
12.2 Problem formulation for meta-RL 203
12.2.1 A context variable 204
12.2.2 Discussion 205

Bibliography 207
1 Chapter 1

2 What is Robotics?

Reading
1. Computing machinery and intelligence, Turing (2009)

2. Thrun Chapter 1

3. Barfoot Chapter 1

3 The word robotics was first used by a Czech writer Karel Capek
4 in a play named “Rossum’s Universal Robots” where the owner of this
5 company Mr. Rossum builds robots, i.e., agents who do forced labor, and
6 effectively an artificial man. The word was popularized by Issac Asimov
7 in one of his short stories named Liar!. This is about a robot named
8 RB-34 which, through a manufacturing fault, happens to be able to read
9 the minds of humans around it. Around 1942, Isaac Asimov started using
10 the word robot in his writings. This is also when he introduced the Three
11 Laws of Robotics as the theme for how robots would interact with others
12 in his stories/books. These are as follows.

13 1. A robot may not injure a human being or, through inaction, allow a
14 human being to come to harm.

15 2. A robot must obey the orders given it by human beings except where
16 such orders would conflict with the First Law.

17 3. A robot must protect its own existence as long as such protection

18 does not conflict with the First or Second Law.

19 Asimov would go on to base his stories on the counter-intuitive ways in

20 which robots could apply these laws. In this case, RB-34 adheres to the
21 First Law and in order to not hurt the feelings of humans and make them
22 happy, it deliberately lies to them. It tells the robopsychologist Susan
23 Colins that one of her co-workers is infatuated with her. However, when

5
6

1 she confronts RB-34 later by pointing out that lying to people can end up
2 hurting them, the robot experiences a logical conflict within its laws and
3 becomes unresponsive.
4 This is, after all, science fiction but these laws give us insight into
5 what robots are. Let’s see what modern roboticists have to say.

6 “Robotics is the science of perceiving and manipulating

7 the physical world through computer-controlled mechanical
8 devices.” — Sebastian Thrun in Probabilistic Robotics
9 “EVERYTHING comes together in the field of robotics. The
10 design of an autonomous robots involves: the choice of
11 the mechanical platform, the choice of actuators, the choice
12 of sensors, the choice of the energy source, the choices of
13 algorithms (perception, planning, and control). Each of
14 these subproblems corresponds to a discipline in itself, with
15 its design trade-offs of achievable performance vs limited
16 resources.” — Andrea Censi in Censi (2016).

17 I find the Third Law really insightful to understand intelligence as well.

18 Let us define intelligence as the ability of an organism to survive1. We
19 will all agree that trees are less intelligent than animals, an ant is less than
20 intelligent than a dog, which is less intelligent than a human. A program
21 like AlphaGo is not very intelligent because you can disable it by simply
22 switching it off. A key indicator of intelligence is the ability to sense
23 possible harm and take actions to change the outcome.
24 Robotics is Embodied Artificial Intelligence.

25 A robot is a machine that senses its environment using sensors,

26 interacts with this environment using actuators to perform a
27 given task and does so efficiently using previous experience
28 of performing similar tasks.

29 We will cover the fundamentals of these three aspects of robotics:

30 perception, planning and learning.

31 1.1 Perception-Learning-Control
32 Perception refers to the sensory mechanisms to gain information about the
33 environment (eyes, ears, tactile input etc.). Action refers to your hands,
34 legs, or motors/engines in machines that help you move on the basis of
35 this information. Learning is kind of the glue in between. It helps crunch
36 information of your sensors quickly, compare it with past data, guesses
37 what future data may look like and computes actions that are likely to
38 succeed. The three facets of intelligence are not sequential and robotics is
39 not merely a feed-forward process. Your sensory inputs depend on the
40 previous action you took.
1feel free to come up with another definition
7

1 1.2 Goals of this course

2 The goals of this course is to develop the main ideas in robotic perception,
3 learning and control. Robotics is everything, so we will focus on under-
4 standing how they are combined together to build a typical robot. After
5 this course, we expect you to be able to choose one among the different
6 robotics algorithms to perform a particular task, think critically about
7 these algorithms and build new ones.

8 Other courses Some other courses at Penn that address various aspects
9 of this picture above are

10 • Perception: CIS 580, CIS 581, CIS 680

11 • Learning: CIS 520, CIS 521, CIS 522, CIS 620, CIS 700, ESE 545,
12 ESE 546

13 • Control: ESE 650, MEAM 520, MEAM 620, ESE 500, ESE 505,
14 ESE 619

15 1.3 Some of my favorite robots

17
8

1 These videos should give you an idea of how the everyday life of a
2 roboticist looks like: Kiva’s robots, Waymo’s 360 experience, Boston
3 Dynamics’ Spot, JPL-MIT team at the DARPA Sub-T Challenge, Romeo
4 and Juliet at Ferrari’s factory, Anki’s Vector, and the DARPA Humanoid
5 Challenge.
1 Chapter 2

2 Introduction to State
3 Estimation

Reading
1. Barfoot, Chapter 2.1-2.2

2. Thrun, Chapter 2

3. Russell Chapter 15.1-15.3

4 2.1 A review of probability

5 Probability is a very useful construct to reason about real systems which
6 we cannot model at all scales. It is a fundamental part of robotics. No
7 matter how sophisticated your camera, it will have noise in how it measures
8 the real world around it. No matter how good your model for a motor is,
9 there will be unmodeled effects which make it move a little differently
10 than how you expect. We begin with a quick review of probability, you
11 can read more at many sources, e.g., MIT’s OCW.
12 An experiment is a procedure which can be repeated infinitely and
13 has a well-defined set of possible outcomes, e.g., the toss of a coin or the
14 roll of dice. The outcome itself need not always be deterministic, e.g.,
15 depending upon your experiment, the coin may come up heads or tails.
16 We call the set Ω the sample space, it is the set of all possible outcomes
17 of an experiment. For two coins, this set would be

Ω = {HH, HT, T H, T T } .

18 We want to pick this set to be right granularity to answer relevant questions,

19 e.g., it is correct but not very useful for Ω to be the position of all the

9
10

1 molecules in the coin. After every experiment, in this case tossing the two
2 coins once each, we obtain an event, it is a subset event A ⊂ Ω from the
3 sample space.
A = {HH} .
4 Probability theory is a mathematical framework that allows us to
5 reason about phenomena or experiments whose outcome is uncertain.
6 Probability of an event
P(A)
7 is a function that maps each event A to a number between 0 and 1: closer
8 to 1 this number, stronger our belief that the outcome of the experiment is
9 going to be A.

10 Axioms Probability is formalized using a set of three basic axioms that

11 are intuitive and yet very powerful. They are known as Kolmogorov’s
12 axioms:
13 • Non-negativity: P(A) ≥ 0
14 • Normalization: P(Ω) = 1
15 • Additivity: If two events A, B are such that A ∩ B = ∅, then

P(A ∪ B) = P(A) + P(B).

16 You can use these axioms to say things like P(∅) = 0, P(Ac ) = 1 − P(A),
17 or if A ⊆ B then P(A) ≤ P(B).

18 Conditioning on events Conditioning helps us answer questions like

P(A | B) := probability of A given that B occurred.

19 Effectively, the sample space has now shrunk from Ω to the event B. It
20 would be silly to have a null sample-space, so let’s say that P(B) ̸= 0. We
21 define conditional probability as

P(A ∩ B)
P(A | B) = ; (2.1)
P(B)

22 the probability is undefined if P(B) = 0. Using this definition, we can

23 compute the probability of events like “what is the probability of rolling a
24 2 on a die given that an even number was rolled”.
25 We can use this trick to get the law of total probability: if a finite Partitioning the sample space
26 number of events {Ai } form a partition of Ω, i.e.,
[
Ai ∩ Aj = ∅ ∀i, j, and Ai = Ω
i
27 X
P(B) = P(B | Ai ) P(Ai ). (2.2)
i
11

1 Bayes’ rule Imagine that instead of someone telling us that the condi-
2 tioning event actually happened, we simply had a belief

P(Ai )

3 about the possibility of such events {Ai }. For each of Ai , we can

4 compute the conditional probability P(B | Ai ) using (2.1). Say we run
5 our experiment and observe that B occurred, how would our belief on the
6 events Ai change? In other words, we wish to compute

P(Ai | B).

7 This is the subject of Bayes’ rule.

P(Ai ∩ B)
P(Ai | B) =
P(B)
P(Ai ) P(B|Ai )
= (2.3)
P(B)
P(Ai ) P(B|Ai )
=P .
i P(Aj ) P(B | Aj )

8 Bayes’ rule naturally leads to the concept of independent events. Two

9 events A, B ⊂ Ω are independent if observing one does not give us any
10 information about the other

P(A ∩ B) = P(A) P(B). (2.4)

11 This is different from disjoint events. Disjoint events never co-occur, i.e.,
12 observing one tells us that the other one did not occur.

13 Probability for experiments with real-valued outcomes We need some

14 more work in defining probability for events with real-valued outcomes.
15 The sample space is easy enough to understand, e.g., Ω = [0, 1] for your
16 score at the end of this course. We however run into difficulties if we
17 define the probability of general subsets of Ω in terms of the probabilities
18 of elementary outcomes (elements of Ω). For instance, if we wish to
19 model all elements ω ∈ Ω to be equally likely, we are forced to assign each
20 element ω a probability of zero (to be consistent with the second axiom of
21 probability). This is not very helpful in determining the probability of the
22 score being 0.9. If you instead assigned some small non-zero number to
23 P(ωi ), then we have undesirable conclusions such as

P({1, 1/2, 1/3, . . .}) = ∞.

24 The way to fix this is to avoid defining the probability of a set in terms
25 of the probability of elementary outcomes and work with more general
26 sets. While we would ideally like to be able to specify the probability of
27 every subset of Ω, it turns out that we cannot do so in a mathematically
28 consistent way. The trick then is to work with a smaller object known as a
29 σ-algebra, that is the set of “nice” subsets of Ω.
12

1 Given a sample space Ω, a σ-algebra F (also called a σ-field) is a

2 collection of subsets of Ω such that

3 • ∅∈F

4 • If A ∈ F, then Ac ∈ F.

5 • If Ai ∈ F for every i ∈ N, then ∪∞

i=1 Ai ∈ F.

6 In short, σ-algebra is a collection of subsets of Ω that is closed under com-

7 plement and countable unions. The pair (Ω, F), also called a measurable
8 space, is now used to define probability of events. A set A that belongs to
9 F is called an event. The probability measure

P : F → [0, 1].

10 assigns a probability to events in F. We cannot take F to be too small,

11 e.g., elements of F = {∅, Ω} are easy to construct our P but are not very
12 useful. For technical reasons, the σ-algebra cannot be too large; notice
13 that we used this concept to avoid considering every subset of the sample
14 space F = 2Ω . Modern probability is defined using a Borel σ-algebra.
15 Roughly speaking, this is an F that is just large enough to do interesting
16 things but small enough that mathematical technicalities do not occur.

17 2.1.1 Random variables

18 A random variable is an assignment of a value to every possible outcome.
19 Mathematically, in our new language of a measurable space, a random Random variables are typically denoted
20 variable is a function using capital letters, X, Y, Z although we will
X:Ω→R be sloppy and not always do so in this course
to avoid complicated notation. The distinction
21 if the set {ω : X(ω) ≤ c} is F-measurable for every number c ∈ R. This
between a random variable and the value that
22 is equivalent to saying that every preimage of the Borel σ-algebra on reals
it takes will be clear from context.
23 B(R) is in F. A statement X(ω) = x = 5 means that the outcome of our
24 experiment happens to be ω ∈ Ω when the realized value of the random
25 variable is a particular number x equal to 5.
26 We can now define functions of random variables, e.g., if X is a
27 random variable, the function Y = X 3 (ω) for every ω ∈ Ω, or Y = X 3
28 for short, is a new random variable. An indicator random variable is ? Let us check that Y satisfies our definition
29 special. If A ⊂ Ω, let IA : Ω → {0, 1} be the indicator function of this of a random variable.
If {ω : X(ω) ≤ c} lies
30 set A, i.e., IA (ω) = 1 if ω ∈ A and zero otherwise. If our set A ∈ F, in F then the set ω : Y (ω) ≤ c1/3 also lies
31 then IA is an indicator random variable. in F.

32 Probability mass functions The probability law, or a probability distri-

33 bution, of a random variable X is denoted by
The function IA is not a random variable if
pX (x) := P(X = x) = P ({ω ∈ Ω : X(ω) = x}) . A∈ / F, but this is, as we said in the previous
section, a mathematical corner case. Most
34 We denote probability distribution using a lower-case p. It is a function subsets of Ω belong to F.
35 of the realized value x in the range of a random variable, and pX (x) ≥ 0
P
36 (the probability is non-zero) and x pX (x) = 1 if X takes on a discrete
13

1 number of values. For instance, if X is the number of coin tosses until the
2 first head, if we assume that our tosses are independent P(H) = p > 0,
3 then we have

pX (k) = P(X = k) = P(T T · · · T H) = (1 − p)k−1 p

4 for all k = 1, 2, . . .. This is what is called a geometric probability mass

5 function.

6 Cumulative distribution function A cumulative distribution function

7 (CDF) is the probability of a random variable X taking a value less than
8 an particular x ∈ R, i.e., The CDF of a geometric random variable
for different values of p
FX (x) = P(X ≤ x).

9 The CDF FX (x) is a non-decreasing function of x. It converges to zero

10 as x → −∞ and goes to 1 as x → ∞.

11 Probability density functions A continuous random variable, i.e., one

12 that takes values in R is described by a probability density function.

13 Note that CDFs need not be continuous: in

the case of a geometric random variable,
14 If FX (x) is the CDF of an r.v. X and X takes values in R, the since the values that X takes belong to the set
15 probability density function (PDF) fX (x) (or sometimes also denoted by of integers, the CDF is constant between any
16 pX (x)) is defined to be two integers.
Z b
P(a ≤ X ≤ b) = fX (x) dx .
a

17 We also have the following relationship between the CDF and the PDF,
18 the former is the integral of the latter:
Z x
P(−∞ ≤ X ≤ x) = FX (x) = fX (x) dx .
−∞

19 This leads to the following interpretation of the probability density func-

20 tion:
P(x ≤ X ≤ x + δ) ≈ fX (x) δ.

21 Expectation and Variance The expected value of a random variable X

22 is X
E[X] = x pX (x)
x
14

1 and denotes the center of gravity of the probability mass function. Roughly
2 speaking, it is the average of a large number of repetitions of the same
3 experiment. Expectation is a linear, i.e.,

E[aX + b] = a E[X] + b

4 for any constants a, b. For two independent random variables X, Y we

5 have
E[XY ] = E[X] E[Y ].
6 We can also compute the expected value of any function g(X) using
7 the same formula
X
E[g(X)] = g(x) pX (x).
x

8 In particular, if g(x) = x2 we have the second moment E[X 2 ]. The

9 variance is defined to be

Var(X) = E (X − E[X])2

X 2
= (x − E[X]) pX (x)
x
2
= E[X 2 ] − (E[X]) .

10 The variance is always non-negative Var(X) ≥ 0. For an affine function

11 of X, we have
Var(aX + b) = a2 Var(X).
12 For continuous-valued random variables, the expectation is defined as
Z ∞
E[X] = xpX (x) dx ;
−∞

13 the definition of variance remains the same.

14 Joint distributions We often wish to think of the joint probability distri-

15 bution of multiple random variables, say the location of an autonomous car
16 in all three dimensions. The cumulative distribution function associated
17 with this is therefore

FX,Y,Z (x, y, z) = P(X ≤ x, Y ≤ y, Z ≤ z).

18 Just like we have the probability density of a single random variable, we

19 can also write the joint probability density of multiple random variables
20 fX,Y,Z (x, y, z). In this case we have
Z x Z y Z z
FX,Y,Z (x, y, z) = fX,Y,Z (x, y, z) dz dy dx .
−∞ −∞ −∞
15

1 The joint probability density factorizes if two random variables are

2 independent:

fX,Y (x, y) = fX (x)fY (y) for all x, y.

3 Two random variables are uncorrelated if and only if

E[XY ] = E[X] E[Y ].

4 Note that independence implies uncorrelatedness, they are not equivalent.

5 The covariance is defined as

Cov(X, Y ) = E[XY ] − E[X] E[Y ].

6 Conditioning As we saw before, for a single random variable X we

7 have
P(x ≤ X ≤ x + δ) ≈ fX (x) δ.
8 For two random variables, by analogy we would like

P(x ≤ X ≤ x + δ | Y ≈ y) ≈ fX|Y (x | y)δ.

9 The conditional probability density function of X given Y is defined to be

fX,Y (x, y)
fX|Y (x | y) = if fY (y) > 0.
fY (y)

10 For any given y, the conditional PDF is a normalized section of the joint
11 PDF, as shown below.

12
16

1 Continuous form of Bayes rule We can show using the definition of

2 conditional probability that

fX|Y (x | y)
fY |X (y | x) = . (2.5)
fX (x)

3 Similarly we also have the law of total probability in the continuous form
Z ∞
fX (x) = fX|Y (x | y) fY (y) dy .
−∞

4 2.2 Using Bayes rule for combining evidence

5 We now study a prototypical state estimation problem. Let us consider a
6 robot that is trying to check whether the door to a room is open or not.

8 We will abstract each observation by the sensors of the robot as a

9 random variable Y . This could be the image from its camera after running
10 some algorithm to check the state of the door, the reading from a laser
11 sensor (if the time-of-flight of the laser is very large then the door is open),
12 or any other mechanism. We have two kinds of conditional probabilities
13 in this problem

P(open | Y ) is a diagnostic quantity, while

P(Y | open) is a causal quantity.

14 The second one is called a causal quantity because the specific Y we

15 observe depends upon whether the door is open or not. The first one is
16 called a diagnostic quantity because using this observation Y we can infer
17 the state of the environment, i.e., whether the door is open or not. Next
18 imagine how you would calibrate the sensor in a lab: for each value of
19 the state of the door open, not open you would record all the different
20 observations received Y and calculate the conditional probabilities. The
21 causal probability is much easier to calculate in this context, one may
22 even use some knowledge of elementary physics to model the probability
23 P(Y | open), or one may count the number of times the observation is
24 Y = y for a given state of the door.
25 Bayes rule allows us to transform causal knowledge into diagnostic
17

1 knowledge
P(Y | open) P(open)
P(open | Y ) = .
P(Y )
2 Remember that the left hand side (diagnostic) is typically something that
3 we desire to calculate. Let us put some numbers in this formula. Let
4 P(Y | open) = 0.6 and P(Y | not open) = 0.3. We will imagine that the
5 door is open or closed with equal probability: P(open) = P(not open) =
6 0.5. We then have
P(Y | open) P(open)
P(open | Y ) =
P(Y )
P(Y | open) P(open)
=
P(Y | open) P(open) + P(Y | not open) P(not open)
0.6 × 0.5 2
= = .
0.6 × 0.5 + 0.3 × 0.5 3
7 Notice something very important, the original (prior) probability of the
8 state of the door was 0.5. If we have a sensor that fires with higher
9 likelihood if the door is open, i.e., if

P(Y | open)
>1
P(Y | not open)

10 then the probability of the door being open after receiving an observation
11 increases. If the likelihood were less than 1, then observing a realization
12 of Y would reduce our estimate of the probability of the door being open.

13 Combining evidence for Markov observations Say we updated the The denominator in Bayes rule, i.e., P(Y )
14 prior probability using our first observation Y1 , let us take another ob- is called the evidence in statistics.
15 servation Y2 . How can we integrate this new observation? It is again an
16 application of Bayes rule using two observations, or in general multiple
17 observations Y1 , . . . , Yn . Let us imagine this time that X = open.

P(Yn | X, Y1 , . . . , Yn−1 ) P(X | Y1 , . . . , Yn−1 )

P(X | Y1 , . . . , Yn ) = .
P(Yn | Y1 , . . . , Yn−1 )

18 Let us make the very natural assumption that says that our observations
19 from the sensor Y1 , . . . , Yn are independent given the state of the door X.
20 This is known as the Markov assumption.
21 We now have
P(Yn | X) P(X | Y1 , . . . , Yn−1 )
P(X | Y1 , . . . , Yn ) =
P(Yn | Y1 , . . . , Yn−1 )
= η P(Yn | X) P(X | Y1 , . . . , Yn−1 )

22 where
η −1 = P(Yn | Y1 , . . . , Yn−1 )
23 is the denominator. We can now expand the diagnostic probability on the
18

1 right-hand side recursively to get

n
Y
P(X | Y1 , . . . , Yn ) = ηi P(Yi | X) P(X). (2.6)
i=1

2 where ηi−1 = P(Yi | Y1 , . . . , Yi−1 ).

The calculation in (2.6) is very neat and you should always

remember it. Given multiple observations Y1 , . . . , Yn of the same
quantity X, we can compute the conditional probability P(X |
Y1 , . . . , Yn ) if we code up two functions to compute

• the causal probability (also called the likelihood of an observa-

tion) P(Yi | X), and

• the denominator ηi−1 .

Given these two functions, we can use the recursion to update multiple
observations. The same basic idea also holds if you have two quantities
to estimate, e.g., X1 = open door and X2 = color of the door. The
recursive application of Bayes rule lies at the heart of all state
estimation methods.

3 Let us again put some numbers into these formulae, imagine that the
4 observation Y2 was taken using a different sensor which now has

P(Y2 | open) = 0.5 and P(Y2 | not open) = 0.6.

5 We have from our previous calculation that P(open | Y1 ) = 2/3 and

P(Y2 | open) P(open | Y1 )

6 Notice in this case that the probability that the door is open has reduced
7 from P(open | Y1 ) = 2/3.

8 2.2.1 Coherence of Bayes rule

9 Would the probability change if we used sensor Y2 before using Y1 ? In
10 this case, the answer to this question is no and you are encouraged to
11 perform this computation for yourselves. Bayes rule is coherent, it will
12 give the same result regardless of the order of observations. ? Can you think of a situation where the
13 The order of incorporating observation matters if the state of the order of incorporating observations matters?
14 world changes while we make observations, e.g., if we have a sensor that
15 tracks the location of a car, the car presumably moves in between two
16 observations and we would get the wrong answer if our question was “is
17 there a car at this location”.
19

1 As we motivated in the previous chapter, movement is quite funda-

2 mental to robotics and we are typically concerned with estimating the
3 state of a dynamic world around us using our observations. We will next
4 study the concept of a Markov Chain which is a mathematical abstraction
5 for the evolution of the state of the world.

6 2.3 Markov Chains

7 Consider the Whack-The-Mole game: a mole has burrowed a network of
8 three holes x1 , x2 , x3 into the ground. It keeps going in and out of the
9 holes and we are interested in finding which hole it will show up next so
10 that we can give it a nice whack.

12 This is an example of a Markov chain. There is a transition matrix T

13 which determines the probability Tij of the mole resurfacing on a given
14 hole xj given that it resurfaced at hole xi the last time. The matrix T k is
15 the k-step transition matrix

Tijk = P(Xk = xj | X0 = xi ).

16 You can see the animations at https://setosa.io/ev/markov-chains to build

17 more intuition.

The key property of a Markov chain is that the next state Xk+1
is independent of all the past states X1 , . . . , Xk−1 given the current
state Xk .
Xk+1 ⊥ ⊥ X1 , . . . , Xk−1 | Xk
This is known as the Markov property and all systems where we ? Does a deterministic dynamical system,
can define a “state” which governs their evolution have this property. e.g., a simple pendulum, also satisfy the
Markov chains form a very broad class of systems. For example, all Markov assumption? What is the transition
of Newtonian physics fits this assumption. matrix in this case?
What is the state of the following systems?

? Can you think of a system which does not

have the Markov property?
20

1 Consider the paramecium above. Its position depends upon a large

2 number of factors: its own motion from the previous time-step but also
3 the viscosity of the material in which it is floating around. One may
4 model the state of the environment around the paramecium as a liquid
5 whose molecules hit thousands of times a second, essentially randomly,
6 and cause disturbances in how the paramecium moves. Let us call this
7 disturbance “noise in the dynamics”. If the motion of the molecules of the
8 liquid has some correlations (does it, usually?), this induces correlations in
9 the position of the paramecium. The position of the organism is no longer
10 Markov. This example is important to remember, the Markov property
11 defined above also implies that the noise in the state transition matrix is
12 independent.

13 Evolution of a Markov chain The probability of being in a state xi at

14 time k + 1 can be written as
N
X
P(Xk+1 = xi ) = P(Xk+1 = xi | Xk = xj ) P(Xk = xj ).
j=1

15 This equation governs how the probabilities P(Xk = xi ) change with time
16 k. Let’s do the calculations for the Whack-The-Mole example. Say the
17 mole was at hole x1 at the beginning. So the probability distribution of its
18 presence  
P(Xk = x1 )
π (k) = P(Xk = x2 )
P(Xk = x3 )
19 is such that
π (1) = [1, 0, 0]⊤ .
20 We can now write the above formula as

π (k+1) = T ′ π (k) (2.7)

21 1 and compute the distribution π (t) for all times

π (2) = T ′ π (1) = [0.1, 0.4, 0.5]⊤ ;

π (3) = T ′ π (2) = [0.17, 0.34, 0.49]⊤ ;
π (4) = T ′ π (3) = [0.153, 0.362, 0.485]⊤ ;
..
.
k
π (∞) = lim T ′ π (1)
k→∞
= [0.158, 0.355, 0.487]⊤ .

22 The numbers P(Xk = xi ) stop changing with time k. Under certain tech-
23 nical conditions, the distribution π (∞) is unique (single communicating
1Let us denote the transpose of the matrix T using the Matlab notation T ′ instead of T ⊤
for clarity.
21

1 class for a Markov chain with a finite number states). We can compute
2 this invariant distribution by writing

π (∞) = T ′ π (∞) .

3 We can also compute the distribution π (∞) directly: the invariant dis-
4 tribution is the right-eigenvector of the matrix T ′ corresponding to the
5 eigenvalue 1. ? Do we always know that the transition
matrix has an eigenvalue that is 1?
6 Example 2.1. Consider a Markov chain on two states where the transition
7 matrix is given by
0.5 0.5
T = .
0.4 0.6
8 The invariant distribution is

π (1) = 0.5π (1) + 0.4π (2)

π (2) = 0.5π (1) + 0.6π (2) .

9 Note that the constraint for π being a probability distribution, i.e., π (1) +
10 π (2) = 1 is automatically satisfied by the two equations. We can solve for
11 π (1) , π (2) to get
π (1) = 4/9 π (2) = 5/9.

12 2.4 Hidden Markov Models (HMMs)

13 2
14 Markov chains are a good model for how the state of the world
15 evolves with time. We may not always know the exact state of these
16 systems and only have sensors, e.g., cameras, LiDARs, and radars, to
17 record observations. These sensors are typically noisy. So we model the
18 observations as random variables.
19 Hidden Markov Models (HMMs) are an abstraction to reason about
20 observations of the state of a Markov chain. An HMM is a sequence
21 of random variables Y1 , Y2 , . . . , Yn such that the distribution of Yk only
22 depends upon the hidden state Xk of the associated Markov chain.
2Parts of this section closely follow Emilio Frazzoli’s course notes at
https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-
autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec20.pdf
and https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-
autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec21.pdf
22

Figure 2.1: A Hidden Markov Model with the underlying Markov chain, the
observation at time k only depends upon the hidden state at that time instant.
Ignore the notation Z1 , . . . , Zt we will denote the observations by Yk .

1 Notice that an HMM always has an underlying Markov chain behind

2 it. For example, if we model the position of a car Xk as a Markov chain,
3 our observation of the position at time k would be Yk . In our example
4 of the robot sensing whether the door is open or closed using multiple
5 observations across time, the Markov chain is trivial, it is simply the
6 transition matrix P(not open | not open) = P(open | open) = 1. Just like
7 Markov chains, HMMs are a very general class of mathematical models
8 that allow us to think about multiple observations across time of a Markov
9 chain.
10 Let us imagine that the observations of our HMM are also finite in
11 number, e.g., your score in this course ∈ [0, 100] where the associated
12 state of the Markov chain is your expertise in the subject matter. We will
13 write a matrix of observation probabilities

Mij = P(Yk = yj | Xk = xi ). (2.8)

14 The matrix M has non-negative entries, after all, each entry is a probability.
15 Since each state has to result in some observation, we also have
X
Mij = 1.
j

16 The state transition probabilities of the associated Markov chain are

Tij = P(Xk+1 = xj | Xk = xi ).

17 Given the abstraction of an HMM, we may be interested in solving

18 a number of problems. We will consider the problem where the state
19 Xk is the position of a car (which could be stationary or moving) and
20 observations Yk give us some estimate of the position.

21 1. Filtering: Given observations up to time k, compute the distribution

22 of the state at time k

P(Xk | Y1 , . . . , Yk ).

23 This is the most natural problem to understand: we want to find the

24 probability of the car being at a location at time k given all previous
25 observations. This is a temporally causal prediction, i.e., we are not
26 using any information from the future to reason about the present.
23

1 2. Smoothing: Given observations up to time k, compute the distribu-

2 tion of the state at any time j < k

P(Xj | Y1 , . . . , Yk ) for j < k.

3 The observation at a future time Yk+1 gives us some indication

4 of where the car might have been at time k. In this case we are
5 interested in using the entire set of observations from the past
6 Y1 , . . . , Yj , the future Yj+1 , . . . , Yk to estimate the position of the
7 car. Of course, this problem can only be solved ex post facto, i.e.,
8 after the time instant j. An important thing to remember is that we
9 are interested in the position of the car for all j < k in smoothing.

10 3. Prediction: Given observations up to time k, compute the distribu-

11 tion of the state at a time j > k

P(Xj | Y1 , . . . , Yk ) for j > k.

12 This is the case when we wish to make predictions about the state
13 of the car j > k given only observations until time k. If we knew
14 the underlying Markov chain for the HMM and its transition matrix
15 T , this would amount to running (2.7) forward using the output of
16 the filtering problem as the initial distribution of the state. ? Why is this true?

17 4. Decoding: Find the most likely state trajectory X1 , . . . , Xk that

18 maximizes the probability

P(X1 , . . . , Xk | Y1 , . . . , Yk )

19 given observations Y1 , . . . , Yk . Observe that the smoothing problem

20 is essentially solved independently for all time-steps j < k. It stands
21 to reason that if we knew a certain state (say car made a right turn)
22 was likely given observations at time k + 1 and that the traffic
23 light was green at time k (given our observations of the traffic
24 light), then we know that the car did not stop at the intersection at
25 time k. The decoding problem allows us to reason about the joint
26 probability of the states and outputs the most likely trajectory given
27 all observations.

28 5. Likelihood of observations: Given the observation trajectory,

29 Y1 , . . . , Yk , compute the probability

P(Y1 , . . . , Yk ).

30 As you may recall, this is the denominator that we need for the
31 recursive application of Bayes rule. It is made difficult by the fact
32 that we do not know the state trajectory X1 , . . . , Xk corresponding
33 to these observations.

34 These problems are closely related with each other and we will next dig
35 deeper into them. We will first discuss two building blocks, called the
24

1 forward and backward algorithm that together help solve all the above
2 problems.

3 2.4.1 The forward algorithm

4 Consider the problem of computing the likelihood of observations. We
5 can certainly write

P(Y1 , . . . , Yk )
X
= P(Y1 , . . . , Yk | X1 , . . . , Xk ) P(X1 , . . . , Xk )
all (x1 ,...,xk )

X k
Y k
Y
= P(Yi = yi | Xi = xi ) P(X1 = x1 ) P(Xk = xk | Xk−1 = xk−1 )
all (x1 ,...,xk ) i=1 i=2
X
= Mx1 y1 Mx2 y2 . . . Mxk yk πx1 Tx1 x2 . . . Txk−1 xk .
all (x1 ,...,xk )

6 But this is a very large computation, for each possible trajectory (x1 , . . . , xk )
7 the states could have taken, we need to perform 2k matrix multiplications.
8 ? How many possible state trajectories are
there? What is the total cost of computing the
likelihood of observations?

Forward algorithm We can simplify the above computation using

the Markov property of the HMM as follows. We will define a
quantity known as the forward variable

αk (x) = P(Y1 , . . . , Yk , Xk = x) (2.9)

where Y1 , . . . , Yk is our observation sequence up to time k. Observe

now that

1. We can initialize

α1 (x) = πx Mx,y1 for all x.

2. For each time i = 1, . . . , k − 1, for all states x, we can compute

X
αk+1 (x) = Mxyk+1 αk (x′ ) Tx′ x .
x′

using the law of total probability.

3. Finally, we have
X
P(Y1 , . . . , Yk ) = αk (x)
x

by marginalizing over the state variables Xk .

1 This recursion in the forward algorithm is a powerful idea and is much

2 faster than our naive summation above. ? What is the computational complexity of
the Forward algorithm?
3 2.4.2 The backward algorithm
4 Just like the forward algorithm performs the computation recursively
5 in the forward direction, we can also perform a backward recursion to
6 obtain the probability of the observations. Let us imagine that we have an
7 observation trajectory
Y1 , . . . , Yt
8 up to some time t. We first define the so-called backward variables which
9 are the probability of a future trajectory given the state of the Markov
10 chain at a particular time instant

βk (x) = P(Yk+1 , Yk+2 , . . . , Yt | Xk = x). (2.10)

11 Notice that the backward variables βk with the conditioning on Xk = x

12 are slightly different than the forward variables αk which are the joint
13 probability of the observation trajectory and Xk = x.

Backward algorithm We can compute the variables βk (x) recur-

sively again as follows.

1. Initialize
βt (x) = 1 for all x.
This simply indicates that since we are at the end of the
trajectory, the future trajectory Yt+1 , . . . does not exist.

2. For all k = t − 1, t − 2, . . . , 1, for all x, update

X
βk (x) = βk+1 (x′ ) Txx′ Mx′ yk+1 .
x′

3. We can now compute

X
P(Y1 , . . . , Yt ) = β1 (x) πx Mxy1 .
x

? What is the computational complexity of

running the backward algorithm?
14 Implementing the forward and backward algorithms in practice The
15 update equations for both αk and βk can be written using a matrix vector
16 multiplication. We maintain the vectors

αk := [αk (x1 ), αk (x2 ), . . . , αk (xN )]

βk := [βk (x1 ), βk (x2 ), . . . , βk (xN )]
26

1 and can write the updates as

⊤ ⊤
⊙ αk⊤ T

αk+1 = M·,y k+1

th
2 where ⊙ denotes the element-wise product and M·,yk+1 is the yk+1 column
3 of the matrix M . The update equation for the backward variables is

βk = T βk+1 ⊙ M·,yk+1 .

4 You must be careful about directly implement these recursions however,

5 because we are iteratively multiplying by matrices T, M whose entries
6 are all smaller than 1 (they are all probabilities after all), we can quickly
7 run into difficulties where αk , βk become too small for some states and
8 we get numerical underflow. You can implement these algorithms in the
9 log-space by writing similar update equations for log αk and log βk to
10 avoid such numerical issues.

11 2.4.3 Bayes filter

12 Let us now use the forward and backward algorithms to solve the filtering
13 problem. We want to compute

P(Xk = x | Y1 , . . . , Yk )

14 for all states x in the Markov chain. We have that

P(Xk = x, Y1 , . . . , Yk )
P(Xk = x | Y1 , . . . , Yk ) = = η αk (x) (2.11)
P(Y1 , . . . , Yk )

15 where since P(Xk = x | Y1 , . . . , Yk ) is a legitimate probability distribu-

16 tion on x, we have !−1
X
η= αk (x) .
x

17 As simple as that. In order to estimate the state at time k, we run the

18 forward algorithm to update variables αi (x) from i = 1, . . . , k. We can
19 implement this using the matrix-vector multiplication in the previous
20 section.
21 This is a commonly used algorithm known as the Bayes filter and is
22 our first insight into state estimation.

23 An important fact Even if the filtering estimate is computed recursively

24 using each observation as it arrives, the estimate is actually the probability
25 of the current state given all past observations.

P(Xk = x | Y1 , . . . , Yk ) ̸= P(Xk = x | Yk )

26 This is an extremely important concept to remember, in state-estimation we

27 are always interested in computing the state given all available observations.
27

1 In the same context, is the following statement true?

P(Xk = x | Y1 , . . . , Yk ) = P(Xk = x | Yk , Xk−1 )

2 2.4.4 Smoothing
3 Given observations till time t, we would like to compute

P(Xk = x | Y1 , . . . , Yt )

4 for all time instants k = 1, . . . , t. Observe the filtering

P(Xk = x, Y1 , . . . , Yt )
P(Xk = x | Y1 , . . . , Yt ) =
P(Y1 , . . . , Yt )
P(Xk = x, Y1 , . . . , Yk , Yk+1 , . . . , Yt )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x, Y1 , . . . , Yk ) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
βk (x) αk (x)
=
P(Y1 , . . . , Yt )
(2.12)
5 Study the first step carefully, the numerator is not equal to αk (x) because
6 observations go all the way till time t. The final step uses both the Markov
7 and the HMM properties: future observations Yk+1 , . . . , Yt depend only
8 upon future states Xk+1 , . . . , Xt (HMM property) which are independent
9 of the past observations and states give the current state Xk = x (Markov ? Both the filtering problem and the
10 property). smoothing problem give us the probability of
11 Smoothing can therefore be implemented by running the forward the state given observations. Discuss which
12 algorithm to update αk from k = 1, . . . , t and the backward algorithm to one should we should use in practice and
13 update βk from time k = t, . . . , 1. why?
14 To see an example of smoothing in action, see ORB-SLAM 2. What
15 do you think is the state of the Markov chain in this video?

16 Example for the Whack-the-mole problem Let us assume that we do

17 not see which hole the mole surfaces from (say it is dark outside) but we
18 can hear it. Our hearing is not very precise so we have an observation
19 probabilities  
0.6 0.2 0.2
M = 0.2 0.6 0.2 .
0.2 0.2 0.6
20 Assume that the mole surfaces three times and we make the measurements

Y1 = 1, Y2 = 3, Y3 = 3.

21 We want to compute the distribution of the states the mole could be in at

22 each time. Assume that we know that the mole was in hole 1 at the first
23 step, i.e., π1 = (1, 0, 0) for the Markov chain, like we had in Section 2.3.
28

1 Run the forward backward algorithm and see that

α1 = (0.6, 0, 0) , α2 = (0.012, 0.048, 0.18) , α3 = (0.0041, 0.0226, 0.0641) ,

2 and

β3 = (1, 1, 1) , β2 = (0.4, 0.44, 0.36) , β1 = (0.1512, 0.1616, 0.1392) .

3 Using these, we can now compute the filtering and the smoothing state
4 distributions, let us denote them by π f and π s respectively.

π1f = (1, 0, 0) , π2f = (0.05, 0.2, 0.75), π3f = (0.045, 0.2487, 0.7063)

5 and ? Do you notice any pattern in the solution

returned by the filtering and the smoothing
s s s
π1 = (0.999, 0, 0) , π2 = (0.0529, 0.2328, 0.7143), π3 = (0.045, 0.2487, 0.7063).problem? Explain why that is the case.

7 2.4.5 Prediction
8 We would like to compute the future probability of the state give observa-
9 tions up to some time

P(Xk = x | Y1 , . . . , Yt ) for t < k.

10 Here is a typical scenario when you would need this estimate. Imagine
11 that you are tracking the position of a car using images from your camera.
12 You are using a deep network to detect the car in each image Yk and since
13 the neural network is quite slow, the car moves multiple time steps forward
14 before you get the next observation. As you can appreciate, it would help
15 us compute a more accurate estimate of the conditional probability of
16 Xk = x if we propagated the position of the car in between successive
17 observations using our Markov chain. This is easy to do.
18 1. We compute the filtering estimate πtf = P(Xt = x | Y1 , . . . , Yt ),
19 using the forward algorithm.
20 2. Propagate the Markov chain forward for k − t time-steps using πtf
21 as the initial condition using

πi+1 = T ′ πi .

22 2.4.6 Decoding: Viterbi’s Algorithm

23 Both filtering and smoothing calculate the probability distribution of the
24 state at time k. For instance, after recording a few observations, we can
25 compute the probability distribution of the position of the car at each time
26 instant. How do we get the most likely trajectory of the car? One option
27 is to choose

X̂k = argmax P(Xk = x | Y1 , . . . , Yt )

x
29

1 at each instant and output

(X̂1 , . . . , X̂t )

2 as the answer. This is however only the point-wise best estimate of the
3 state. This sequence may not be the most likely trajectory of the Markov
4 chain underlying our HMM. In the decoding problem, we are interested in
5 computing the most likely state trajectory, not the point-wise most likely
6 sequence of states. Let us take an example of the Whack-the-mole again.
7 We will use a slightly different Markov chain shown below.

9 There are three states x1 , x2 , x3 with known initial distribution π =

10 (1, 0, 0) and transition probabilities and observations given by matrices
11 T, M respectively. Let us say that we only have two observations {y2 , y3 }
12 this time and get the observation sequence

(2, 3, 3, 2, 2, 2, 3, 2, 3)

13 from our sensor. The filtering estimates are as follows.

15 The most likely state at each instant is marked in blue. The point-wise
16 most likely sequence of states is

(1, 3, 3, 3, 3, 2, 3, 2, 3).

17 Observe that this is not even feasible for the Markov chain. The transition
18 from x3 → x2 is not even possible, so this answer is clearly wrong. Let
19 us look at the smoothing estimates.
30

2 The point-wise most likely states in this case are feasible

(1, 2, 2, 2, 2, 2, 3, 3, 3).

3 Because the smoothing estimate at time k also takes into account the
4 observations from the future t > k, it effectively eliminates the impossible
5 transition from x3 → x2 . This is still not however the most likely
6 trajectory.
7 We will exploit the Markov property again to calculate the most likely
8 state trajectory recursively. Let us define the “decoding variables” as

δk (x) = max P(X1 = x1 , . . . , Xk−1 = xk−1 , Xk = x, Y1 , . . . , Yk );

(x1 ,...,xk−1 )
(2.13)
9 this is the joint probability of the most likely state trajectory that ends at
10 the state x at time k while generating observations Y1 , . . . , Yk . We can
11 now see that

δk+1 (x) = max

′
δk (x′ ) Tx′ x Mx,yk+1 ; (2.14)
x

12 the joint probability that the most likely trajectory ends up at state x at
13 time k + 1 is the maximum of among the joint probabilities that end up
14 at any state x′ at time k multiplied by the one-step state transition Tx′ x
15 and observation Mxyk+1 probabilities. We would like to iterate upon this
16 identity to find the most likely path. The key idea is to maintain a pointer
17 to the parent state parentk (x) of the most likely trajectory, i.e., the state
18 from which you could have reached Xk = x given observations. Let us
19 see how.

Viterbi’s algorithm First initialize

δ1 (x) = πx Mxy1
parentk (x) = null.

for all states x. For all times k = 1, . . . , t − 1, for all states x, update

δk+1 (x) = max

′
δk (x′ ) Tx′ x Mx,yk+1
x
parentk+1 (x) = argmax (δk (x′ ) Tx′ x ) .
x′

The most likely final state is

x̂t = argmax δt (x′ )

x′

and we can now backtrack using our parent pointers to find the most
likely trajectory that leads to this state

x̂k = parentk+1 (x̂k+1 ).

The most likely trajectory given observations is

x̂1 , x̂2 , . . . , x̂t

and the joint probability of this trajectory and all observations is

P(X1 = x̂1 , . . . , Xt = x̂t , Y1 = y1 , . . . , Yt = yt ) = δt (x̂t ).

1 This is a very widely used algorithm, both in robotics and in other

2 areas such as speech recognition (given audio, find the most likely sentence
3 spoken by the person), wireless transmission and reception, DNA analysis
4 (e.g., the state of the Markov chain is the sequence ACTG. . . and our
5 observations are functions of these states at periodic intervals). Its name
6 comes from Andrew Viterbi who developed the algorithm in the late 60s,
7 he is one of the founders of Qualcomm Inc.
8 Here is how Viterbi’s algorithm would look like for our whack-the-
9 model example.

δ1 = (0.6, 0, 0), δ2 = (0.012, 0.048, 0.18), δ3 = (0.0038, 0.0216, 0.0432)

parent1 = (null, null, null), parent2 = (1, 1, 1), parent3 = (2, 3, 3).

10 The most likely path is the one that ends in 3 with joint probability 0.0432.
11 This path is (1, 3, 3).
12 Let us also compute Viterbi’s algorithm for a longer observation
13 sequence.
32

Just like the Bayes filter, Viterbi’s

2 The most likely trajectory is algorithm is typically implemented using
log δk (x) to avoid numerical underflows.
(1, 3, 3, 3, 3, 3, 3, 3, 3). This is particularly important for Viterbi’s
algorithm: since δk (x) is the probability of an
3 Notice that if we had only 8 observations, the most likely trajectory would
entire state and observation trajectory it can
4 be
get small very quickly for unlikely states (as
(1, 2, 2, 2, 2, 2, 2, 2, 2).
we see in this example).
5

6 What is the computational complexity of Viterbi’s algorithm? It is

7 linear in the time-horizon t and quadratic in the number of states in the
8 Markov chain. We are plucking out the most likely trajectory out of
9 card(X)t possible trajectories using the δk variables. Does this remind
10 you of some other problem that you may have seen before?

11 2.4.7 Shortest path on a Trellis graph

12 You may have seen Dĳkstra’s algorithm before that computes the shortest
13 path to reach a node in the graph given costs of traversing every edge.

Figure 2.2: A graph with costs assigned to every edge. Dĳkstra’s algorithm finds
the shortest path in this graph between nodes A and B using dynamic programming.

14 In the case of Viterbi’s algorithm, we are also interested in finding the

1 most likely path. For example we can write our joint probabilities as

P(Y1 | X1 ) P(Y2 | X2 ) P(Y3 | X3 ) P(X1 ) P(X2 | X1 ) P(X3 | X2 )

2 To find the most likely trajectory, we want to minimize − log P(X1 , X2 , X3 |

3 Y1 , Y2 , Y3 ). The term log P(Y1 , Y2 , Y3 ) does not depend on X1 , X2 , X3
4 and is a constant as far as the most likely path given observations is
5 concerned. We can now write down the “Trellis” graph as shown below.

Figure 2.3: A Trellis graph for a 3-state HMM for a sequence of three observations.
Disregard the subscript x0 .

6 Each edge is either the log-probability of the transition of the Markov

7 chain, or it is the log-probability of the receiving the observation given a
8 state. We create a dummy initial node A and a dummy terminal node B.
9 The edge-costs of the final three states, in this case sunny/cloudy/rainy,
10 are zero. The costs from node A to the respective states are the log-
11 probabilities of the initial state distribution. Dĳkstra’s algorithm, which
12 we will study in Module 2 in more detail, now gives the shortest path on the
13 Trellis graph. This approach is the same as that of the Viterbi’s algorithm:
14 our parent pointers parentk (x) are the parent nodes in Dĳkstra’s algorithm
15 and our delta variables δk (x) is the cost of each node in the Trellis graph
16 maintained by the Dĳkstra’s algorithm.

17 2.5 Learning an HMM from observations

18 In the previous sections, given an HMM that had an initial distribution π
19 for the Markov chain, a transition matrix T for the Markov chain and an
20 observation matrix M
λ = (π, T, M )
21 we computed various quantities such as

P(Y1 , . . . , Yt ; λ)
34

1 for an observation sequence Y1 , . . . , Yt of the HMM. Given an observation

2 sequence, we can also go back and update our HMM to make this
3 observation sequence more likely. This is the simplest instance of learning
4 an HMM. The prototypical problem to imagine that our original HMM λ
5 comes from is our knowledge of the original problem (say a physics model
6 of the dynamics of a robot and its sensors). Given more data, namely
7 the observations, we want to update this model. The most natural way to
8 update the model is to maximize the likelihood of observations given our
9 model, i.e.,
λ∗ = argmax P(Y1 , . . . , Yt ; λ).
λ

10 This is known as maximum-likelihood estimation (MLE). In this section

11 we will look at the Baum-Welch algorithm which solves the MLE problem
12 iteratively. Given λ, it finds a new HMM λ′ = (π ′ , T ′ , M ′ ) (the ′ denotes
13 a new matrix, not the transpose here) such that

P(Y1 , . . . , Yt ; λ′ ) > P(Y1 , . . . , Yt ; λ).

14 Let us consider a simple problem. We are going to imagine that the

15 FBI is trying to catch the dangerous criminal Keyser Soze who is known
16 to travel between two cities Los Angeles (LA) which will be state x1 and
17 New York City (NY) which will be state x2 . The FBI initially have no clue
18 about his whereabouts, so their initial belief on his location is uniform
19 π = [0.5, 0.5]. His movements are modeled using a Markov chain

0.5 0.5
T = ,
0.5 0.5

20 e.g., if Soze is in LA, he is likely to stay in LA or go to NY with equal

21 probability. The FBI can make observations about him, they either observe
22 him to be in LA (y1 ), NY (y2 ) or do not observe anything at all (null, y3 ).

0.4 0.1 0.5
M= .
0.1 0.5 0.4

23 Say that they received an observation sequence of 20 periods

(null, LA, LA, null, NY, null, NY, NY, NY, null, NY, NY, NY, NY, NY, null, null, LA, LA, NY).

24 Can we say something about the probability of Soze’s movements? At

25 each time k we can compute

γk (x) := P(Xk = x | Y1 , . . . , Yt )

26 the smoothing probability. We can also compute the most likely state
27 trajectory he could have taken given our observations using decoding. Let
28 us focus on the smoothing probabilities γk (x) as shown below.
35

2 The point-wise most likely sequence of states after doing so turns out to be

(LA, LA, LA, LA, NY, LA, NY, NY, NY, LA, NY, NY, NY, NY, NY, LA, LA, LA, LA, NY).

3 Notice how smoothing fills in the missing observations above.

4 Expected state visitation counts The next question we should ask is

5 how should we update the model λ given this data. We are going to learn
6 the entries of the state-transition using

′ E[number of transitions from x to x′ ]

Tx,x ′ = .
E[number of times the Markov chain was in state x]

7 What is the denominator, it is simply the sum of the probabilities that the
8 Markov chain was at state x at time 1, 2, . . . , t − 1 given our observations,
9 i.e.,
t−1
X
E[number of times the Markov chain was in state x] = γk (x).
k=1

10 The numerator is given in a similar fashion. We will define a quantity

? Derive the expression for ξk (x, x′ ) for
ξk (x, x′ ) := P(Xk = x, Xk+1 = x′ | Y1 , . . . , Yt ) yourself.
(2.15)
= η αk (x)Tx,x′ Mx′ ,yk+1 βk+1 (x′ );

where η is a normalizing constant such that x,x′ ξk (x, x′ ) = 1. Observe

P
11

12 that ξk is the joint probability of Xk and Xk+1

ξk (x, x′ ) = P(Xk+1 = x′ | Xk = x, Y1 , . . . , Yt ) γk (x)

̸= Tx,x′ γk (x)
= P(Xk+1 = x′ | Xk = x) P(Xk = x | Y1 , . . . , Yt ).

13 The expected value of transitioning between states x and x′ is

t−1
X
E[number of transitions from x to x′ ] = ξk (x, x′ ).
k=1
36

1 This gives us our new state transition matrix, you will see in the homework
2 that it comes to be

0.47023 0.52976
T′ = .
0.35260 0.64739

3 This is a much better informed FBI than the other we had before beginning
4 the problem where the transition matrix was all 0.5s.

5 The new initial distribution What is the new initial distribution for
6 the HMM? Recall that we are trying to compute the best HMM given the
7 observations, so if the initial distribution was

π = P(X1 )

8 before receiving any observations from the HMM, it is now

π ′ = P(X1 | Y1 , . . . , Yt ) = γ1 (x);

9 the smoothing estimate at the first time-step.

10 Updating the observation matrix We can use a similar logic at the

11 expected state visitation counts to write

′ E[number of times in state x, when observation was y]

Mx,y =
E[number of times the Markov chain was in state x]
Pt
γk (x)1{yk =y}
= k=1Pt .
k=1 γk (x)

12 You will see in your homework problem that this matrix comes up to be

0.39024 0.20325 0.40650
M′ = .
0.06779 0.706214 0.2259

13 Notice how the observation probabilities for the unknown state y3 have
14 gone down because the Markov chain does not have those states.
15 The ability to start with a rudimentary model of the HMM and update
16 it using observations is quite revolutionary. Baum et al. proved in the
17 paper Baum, Leonard E., et al. "A maximization technique occurring
18 in the statistical analysis of probabilistic functions of Markov chains."
19 The annals of mathematical statistics 41.1 (1970): 164-171. Discuss the
20 following questions:
21 • When do we stop in our iterated application of the Baum-Welch
22 algorithm?
23 • Are we always guaranteed to find the same HMM irrespective of
24 our initial HMM?
25 • If our initial HMM λ is the same, are we guaranteed to find the
26 same HMM λ′ across two different iterations of the Baum-Welch
27 algorithm?
28 • How many observations should we use to update the HMM?
1 Chapter 3

2 Kalman Filter and its

3 variants

Reading
1. Barfoot, Chapter 3, 4 for Kalman filter

2. Thrun, Chapter 3 for Kalman filter, Chapter 4 for particle filters

3. Russell Chapter 15.4 for Kalman filter

4 Hidden Markov Models (HMMs) which we discussed in the previous

5 chapter were a very general class of models. As a consequence algorithms
6 for filtering, smoothing and decoding that we prescribed for the HMM are
7 also very general. In this chapter we will consider the situation when we
8 have a little more information about our system. Instead of writing the
9 state transition and observation matrices as arbitrary matrices, we will use
10 the framework of linear dynamical systems to model them better. Since
11 we know the system a bit better, algorithms that we prescribe for these
12 models for solving filtering, smoothing and decoding will also be more
13 efficient. We will almost exclusively focus on the filtering problem in
14 this chapter. The other two, namely smoothing and decoding, can also
15 be solved easily using these ideas but are less commonly used for these
16 systems.

17 3.1 Background
18 Multi-variate random variables and linear algebra For d-dimensional
19 random variables X, Y ∈ Rd we have

E[X + Y ] = E[X] + E[Y ];

37
38

1 this is actually more surprising than it looks, it is true regardless of

2 whether X, Y are correlated. The covariance matrix of a random variable
3 is defined as

Cov(X) = E[(X − E[X]) (X − E[X])⊤ ];

4 we will usually denote this by Σ ∈ Rd×d . Note that the covariance matrix
5 is, by construction, symmetric and positive semi-definite. This means it
6 can be factorized as
Σ = U ΛU ⊤
7 where U ∈ Rd×d is an orthonormal matrix (i.e., U U ⊤ = I) and Λ is a
8 diagonal matrix with non-negative entries. The trace of a matrix is the
9 sum of its diagonal entries. It is also equal to the sum of its eigenvalues,
10 i.e.,
X d Xd
tr(Σ) = Σii = λi (Σ)
i=1 i=1
th
11 where λi (S) ≥ 0 is the i eigenvalue of the covariance matrix S. The
12 trace is a measure of the uncertainty in the multi-variate random variable
13 X, if X is a scalar and takes values in the reals then the covariance matrix
14 is also, of course, a scalar Σ = σ 2 .
15 A few more identities about the matrix trace that we will often use in
16 this chapter are as follows.
17 • For matrices A, B we have

tr(AB) = tr(BA);

18 the two matrices need not be square themselves, only their product
19 does.
20 • For A, B ∈ Rm×n
m X
n
X
tr A⊤ B = tr B ⊤ A =

Bij Aij .
i=1 j=1

21 This operation can be thought of as taking the inner product between

22 two matrices.

23 Gaussian/Normal distribution We will spend a lot of time working

24 with the Gaussian/Normal distribution. The multi-variate d-dimensional ? Why is it so ubiquitous?
25 Normal distribution has the probability density

1 1 T −1
f (x) = p exp − (x − µ) Σ (x − µ)
det(2πΣ) 2

26 where µ ∈ Rd , Σ ∈ Rd×d denote the mean and covariance respectively.

27 You should commit this formula to memory. In particular remember that
Z
1 T −1
p
exp − (x − µ) Σ (x − µ) dx = det(2πΣ)
x∈Rd 2
39

1 which is simply expressing the fact that the probability density function
2 integrates to 1.

Figure 3.1: Probability density (left) and iso-probability contours (right) of a

bi-variate Normal distribution. Warm colors denote regions of high probability.

3 Given two Gaussian rvs. X, Y ∈ Rd and Z = X + Y we have

E[Z] = E[X + Y ] = E[X] + E[Y ]

4 with covariance

Cov(Z) = ΣZ = ΣX + ΣY + ΣXY + ΣY X

5 where
Rd×d ∋ ΣXY = E (X − E[X]) (Y − E[Y ])⊤ ;

6 the matrix SY X is defined similarly. If X, Y are independent (or uncorre-

7 lated) the covariance simplifies to

ΣZ = ΣX + ΣY .

8 If we have a linear function of a Gaussian random variable X given by

9 Y = AX for some deterministic matrix A then Y is also Gaussian with
10 mean
E[Y ] = E[AX] = A E[X] = AµX
11 and covariance

Cov(Y ) = E[(AX − AµX )(AX − AµX )⊤ ]

= E[A(X − µX )(X − µX )⊤ A⊤ ]
(3.1)
= A E[(X − µX )(X − µX )⊤ ]A⊤
= AΣX A⊤ .

12 This is an important result that you should remember.

13 3.2 Linear state estimation

14 With that background, let us now look at the basic estimation problem.
15 Let X ∈ Rd denote the true state of a system. We would like to build an
40

1 estimator for this state, this is denote by

X̂.

2 An estimator is any quantity that indicates our belief of what X is. The
3 estimator is created on the basis of observations and we will therefore
4 model it as a random variable. We would like the estimator to be unbiased,
5 i.e.,
E[X̂] = X;
6 this expresses the concept that if we were to measure the state of the
7 system many times, say using many sensors or multiple observations from
8 the same sensor, the resultant estimator X̂ is correct on average. The error
9 in our belief is
X̃ = X̂ − X.
10 The error is zero-mean E[X̃] = 0 and its covariance ΣX̃ is called the
11 covariance of the estimator.

12 Optimally combining two estimators Let us now imagine that we have Conditionally independent observations
13 two estimators X̂1 and X̂2 for the same true state X. We will assume that from one true state
14 the two estimators were created independently (say different sensors) and
15 therefore are conditionally independent random variables given the true
16 state X Say both of them are unbiased but each of them have a certain
17 covariance of the error
ΣX̃1 and ΣX̃2 .
18 We would like to combine the two to obtain a better estimate of what the
19 state could be. Better can mean many different quantities depending upon
20 the problem but in general in this course we are interested in improving
21 the error covariance. Our goal is then

Given two estimators X̂1 and X̂2 of the true state X combine
them to obtain a new estimator

X̂ = some function(X̂1 , X̂2 )

which has the best error covariance tr(ΣX̃ ).

22 3.2.1 One-dimensional Gaussian random variables

23 Consider the case when X̂1 , X̂2 ∈ R are Gaussian random variables with
24 means µ1 , µ2 and variances σ12 , σ22 respectively. Assume that both are
25 unbiased estimators of X ∈ R. Let us combine them linearly to obtain a
26 new estimator
X̂ = k1 X̂1 + k2 X̂2 .
41

1 How should we pick the coefficients k1 , k2 ? We would of course like the

2 new estimator to be unbiased, so

E[X̂] = E[k1 X̂1 + k2 X̂2 ] = (k1 + k2 )X = X

⇒ k1 + k2 = 1.

3 The variance of the X̂ is

Var(X̂) = k12 σ12 + k22 σ22 = k12 σ12 + (1 − k1 )2 σ22 .

4 The optimal k1 that leads to the smallest variance is thus given by

σ22
k1 = .
σ12 + σ22

5 We set the derivative of Var(X̂) with respect to k1 to zero to get this. The
6 final estimator is
σ22 σ2
X̂ = X̂1 + 2 1 2 X̂2 . (3.2)
σ12 + σ22 σ1 + σ2

7 It is unbiased of course and has variance

2 σ12 σ22
σX̃ = .
σ12 + σ22

8 Notice that since σ22 /(σ12 + σ22 ) < 1, the variance of the new estimator is
9 smaller than either of the original estimators. This is an important fact to
10 remember, combining two estimators always results in a better estimator.

11 Some comments about the optimal combination.

12 • It is easy to see that if σ2 ≫ σ1 then the corresponding estimator,

13 namely X̂2 gets less weight in the combination. This is easy to
14 understand, if one of our estimates is very noisy, we should rely less
15 upon it to obtain the new estimate. In the limit that σ2 → ∞, the
16 second estimator is not considered at all in the combination.

17 • If σ1 = σ2 , the two estimators are weighted equally and since

2
18 σX̃ = σ12 /2 the variance reduces by half after combination.

19 • The minimal variance of the combined estimator is not zero. This

20 is easy to see because if we have two noisy estimates of the state,
21 combining them need not lead to us knowing the true state with
22 certainty.

23 3.2.2 General case

24 Let us now perform the same exercise for multi-variate Gaussian random
25 variables. We will again combine the two estimators linearly to get

X̂ = K1 X̂1 + K2 X̂2
42

1 where K1 , K2 ∈ Rd×d are matrices that we would like to choose. In order

2 for the estimator to be unbiased we again have the condition

E[X̂] = E[K1 X̂1 + K2 X̂2 ] = (K1 + K2 )X = X

⇒K1 + K2 = Id×d .

3 The covariance of X̂ is

ΣX̃ = K1 Σ1 K1⊤ + K2 Σ2 K2⊤

= K1 Σ1 K1⊤ + (I − K1 )Σ2 (I − K1 )⊤ .

4 Just like the minimized the variance in the scalar case, we will minimize
5 the trace of this covariance matrix. We know that the original covariances
6 Σ1 and Σ2 are symmetric. We will use the following identity for the
7 partial derivative of a matrix product

∂
tr ABA⊤ = 2AB

(3.3)
∂A
8 for a symmetric matrix B. Minimizing tr(ΣX̃ ) with respect to K1 amounts
9 to setting
∂
tr(ΣX̃ ) = 0
∂K1
10 which yields

0 = K1 Σ1 − (I − K1 )Σ2
⇒ K1 = Σ2 (Σ1 + Σ2 )−1 and K2 = Σ1 (Σ1 + Σ2 )−1 .

11 The optimal way to combine the two estimators is thus

X̂ = Σ2 (Σ1 + Σ2 )−1 X̂1 + Σ1 (Σ1 + Σ2 )−1 X̂2 . (3.4)

12 You should consider the similarities of this expression with the one for the
13 scalar case in (3.2). The same broad comments hold, i.e., if one of the
14 estimators has a very large variance, that estimator is weighted less in the
15 combination.

16 3.2.3 Incorporating Gaussian observations of a state

17 Let us now imagine that we have a sensor that can give us observations of
18 the state. The development in this section is analogous to our calculations
19 in Chapter 2 with the recursive application of Bayes rule or the observation
20 matrix of the HMM. We will consider a special type of sensor that gives
21 observations
Rp ∋ Y = CX + ν (3.5)
22 which is a linear function of the true state X ∈ Rd with the matrix
23 C ∈ Rp×d being something that is unique to the particular sensor. This
24 observation is not precise and we will model the sensor as having zero-
25 mean Gaussian noise
ν ∼ N (0, Q)
43

1 of covariance Q ∈ Rp×p . Notice something important here, the dimen-

2 sionality of the observations need not be the same as the dimensionality
3 of the state. This should not be surprising, after all the the number of
4 observations in the HMM need not be the same as the number of the states
5 in the Markov chain.
6 We will solve the following problem. Given an existing estimator X̂ ′
7 we want to combine it with the observation Y to update the estimator to
8 X̂, in the best way, i.e., in a way that gives the minimal variance. We will
9 again use a linear combination

X̂ = K ′ X̂ ′ + KY.

10 Again we want the estimator to be unbiased, so we set

E[X̂] = E[K ′ X̂ ′ + KY ]
= K ′ X + K E[Y ]
= K ′ X + K E[CX + ν]
= K ′ X + KCX
= X.

11 to get that
I = K ′ + KC.
⇒ X̂ = (I − KC)X̂ ′ + KY (3.6)
′ ′
= X̂ + K(Y − C X̂ ).
12 This is special form which you will do well to remember. The old
13 estimator X̂ ′ gets an additive term K(Y − C X̂ ′ ). For reasons that will
14 soon become clear, we call this term

innovation = Y − C X̂ ′ .

15 Let us now optimize K as before to compute the estimator with minimal

16 variance. We will make the following important assumption in this case.

We will assume that the observation Y is independent of the esti-

mator X̂ ′ given X. This is a natural assumption because presumably
our original estimator X̂ ′ was created using past observations and the
present observation Y is therefore independent of it given the state
X.

17 The covariance of X̂ is

ΣX̃ = (I − KC)ΣX̃ ′ (I − KC)⊤ + KQK ⊤ .

We optimize the trace of ΣX̃ with respect to K to get

∂
0= tr(ΣX̃ )
∂K
0 = −2(I − KC)ΣX̃ ′ C ⊤ + 2KQ
⇒ ΣX̃ ′ C ⊤ = K(CΣX̃ ′ C ⊤ + Q)
⇒ K = ΣX̃ ′ C ⊤ (CΣX̃ ′ C ⊤ + Q)−1 .

1 The matrix K ∈ Rd×p is called the “Kalman gain” after Rudoph Kalman
2 who developed this method in the 1960s.

Kalman gain This is an important formula and it helps to have a

mnemonic and a slightly simpler notation to remember it by. If Σ′ is
the covariance of the previous estimator, Q is the covariance of the
zero-mean observation and C is the matrix that gives the observation
from the state, then the Kalman gain is

K = ΣX̃ ′ C ⊤ (CΣX̃ ′ C ⊤ + Q)−1 . (3.7)

and the new estimator for the state is

X̂ = X̂ ′ + K(Y − C X̂ ′ ).

The covariance of the updated estimator X̂ is given by

ΣX̃ = (I − KC)ΣX̃ ′ (I − KC)⊤ + KQK ⊤

−1 (3.8)
= Σ−1X̃ ′
+ C ⊤ −1
Q C .

If C = I, the Kalman gain is the same expression as the optimal

coefficient in (3.4). This should not be surprising because the
observation is an estimator for the state.
The second expression for ΣX̃ follows by substituting the value of
the Kalman gain K. Yet another way of remembering this equation
equation is to notice that

Σ−1
X̃
= Σ−1
X̃ ′
+ C ⊤ Q−1 C
K = Σ−1
X̃
C ⊤ Q−1 (3.9)

X̂ = X̂ ′ + Σ−1
X̃
C ⊤ Q−1 Y − C X̂ ′ .

Derive these expressions for the Kalman

gain and the covariance yourself.
45

1 3.2.4 An example
2 Consider the scalar case when we have multiple measurements of some
3 scalar quantity x ∈ R corrupted by noise.

yi = x + ν i

4 where yi ∈ R and the scalar noise νi ∼ N (0, 1) is zero-mean and

5 standard Gaussian. Find the updated estimate of the state x after k such
6 measurements; this means both the mean and the covariance of the state.
7 You can solve this in two ways, you can either use the measurement
8 matrix C to be 1k = [1, . . . , 1] to be a vector of all ones and apply the
9 formula in (3.7) and (3.8) Show that the estimate x̂k after k measurements
10 has mean and covariance
k
1X
E[x̂k ] = yi .
k i=1
−1 1
Cov(x̂k ) = C ⊤ C = .
k
11 If we take one more measurement yk+1 = x + νk+1 with noise νk+1 ∼
12 N (0, σ 2 ), show using (3.9) that

1
Cov(x̂k+1 )−1 = Cov(x̂k )−1 +
σ2
σ2
⇒ Cov(x̂k+1 ) = .
σ2 k+1
13 The updated mean using (3.9) again

1
E[x̂k+1 ] = x̂k + Cov(x̂k+1 ) (yk+1 − x̂k )
σ2
yk+1 − x̂k
= x̂k + .
σ2 k + 1

14 You will notice that if the noise on the k + 1th observation is very small,
15 even after k observations, the new estimate fixates on the latest observation

σ → 0 ⇒ x̂k+1 → yk+1 .

16 Similarly, if the latest observation is very noisy, the estimate does not
17 change much
σ → ∞ ⇒ x̂k+1 → x̂k .

18 3.3 Background on linear and nonlinear dy-

19 namical systems

The true state X need not be static. We will next talk about models
for how the state of the world evolves using ideas in dynamical systems.
46

1 A continuous-time signal is a function that associates to each time t ∈ R

2 a real number y(t). We denote signals by
A continuous-time signal y(t) and
y : t 7→ y(t). discrete-time signal yk .

3 Similarly a discrete-time signal is a function that associates to each integer

4 k a real number y(k), we have been denoting quantities like this by yk .
5 A dynamical system is an operator (a box) that transforms an input
6 signal u(t) or uk to an output y(t) or yk respectively. We call the former
7 a continuous-time system and the latter a discrete-time system.

Almost always in robotics, we will be interested in systems that

are temporally causal, i.e., the output at time t0 is only a function
the input up to time t0 . Analogously, the output at time k0 for a
discrete-time system is dependent only on the input up to time k0 .
Most systems in the physical world at temporally causal. ? Can you give an example of a dynamical
system that is non-causal? Think of how a
DVD Ripper, or a pre-programmed acrobatic
9 State of a system We know that if the system is causal, in order to maneuver on a plane works.
10 compute its output at a time t0 , we only need to know all the input from
11 time t = (−∞, t0 ]. This is a lot of information. The concept of a state,
12 about which we have been cavalier until now helps with this. The state
13 x(t1 ) of a causal system at time t1 is the information needed, together
14 with the input u between times t1 and t2 to uniquely compute the output
15 y(t2 ) at time t2 , for all times t2 ≥ t1 . In other words, the state of a system
16 summarizes the whole history of what happened between (−∞, t1 ).
17 Typically the state of a system is a d-dimensional vector in Rd . The ? Discuss some examples of the state.
18 dimension of a system is the minimum d required to define a state.

19 3.3.1 Linear systems

20 A system is called a linear system if for any two input signals u1 and u2 Is the state of a system uniquely defined?
21 and any two real numbers a, b

u1 → y1
u2 → y2
au1 + bu2 → ay1 + by2 .
47

1 Linearity is a very powerful property. For instance, it suggests that if

2 we can decompose a complicated input into the sum of simple signals,
3 then the output of the system is also a sum of the outputs of these simple
4 signals. For example, if we can write the input as a Fourier series
P∞
5 u(t) = i=0 ai cos(it) + bi sin(it) we can pass each of the terms in
6 this summation to system and get the output of u(t) by summing up the
7 individual outputs.
8 Finite-dimensional systems can be written using a set of differential
9 equations as follows. Consider the spring-mass system. If z(t) denotes A second-order spring mass system
10 the position of the mass at time t and u(t) is the force that we are applying
11 upon it at time t, the position of the mass satisfies the differential equation

d2 z(t) dz(t)
m +c + kx(t) = u(t)
dt2 dt
or mz̈ + cż + kz = u

12 in short. Here m is the mass of the block, c is the damping coefficient of

13 the spring and k is the spring force constant. Let us define

z1 (t) := z(t)
dz(t)
z2 (t) := .
dt
14 We can now rewrite the dynamics as

z˙1 0 1 z1 0
= + u
z˙2 −k/m −c/m z2 1/m

15 3.3.2 Linear Time-Invariant (LTI) systems

z1 (t)
16 If we define the state x(t) = , then the above equation can be
z2 (t)
17 written as
ẋ(t) = Ax(t) + Bu(t). (3.10)
18 This is a linear system that takes in the input u(t) and has a state x(t).
19 You can check the conditions for linearity to be sure. It is also a linear
20 time-invariant (LTI) system because the matrices A, B do not change with
21 time t. The input u(t) is also typically called the control (or action,
22 or the control input) and essentially the second half of the course is
23 about computing good controls.
24 Since the state at time t encapsulates everything that happened to
25 the system due to the inputs {u(−∞), u(t)}, we can say that the system
26 computes its output y(t) as a function of the state x(t) and the latest input
27 u(t)
y(t) = function(x(t), u(t))
28 If this function is linear we have

y(t) = Cx(t) + Du(t). (3.11)

1 The pair of equations (3.10) and (3.11) together are the so-called state-
2 space model of an LTI system. The development for discrete-time systems
3 is completely analogous, we will have

xk+1 = A∆t xk + B∆t uk

(3.12)
yk = Cxk + Duk .

We have used the subscript ∆t to denote that these are discrete-

time matrices and are different from the continuous-time ones in (3.10)
and (3.11). This is an important point to keep in mind.

4 If the dynamics matrices A, B, C, D change with time, we have a

5 time-varying system.

6 3.3.3 Nonlinear systems

7 Nonlinear systems are defined entirely analogously as linear systems.
8 Imagine if we had a non-linear spring in the spring-mass system whereby
9 the dynamics of the block was given by

mz̈ + cż + (k1 z + k2 z 2 ) = u.

⊤
10 The state of the system is still x = [z1 , z2 ] . But we cannot write this
11 second-order differential equation as two first-order linear differential
12 equations. We are forced to write

z˙1 0 1 z1 0
ẋ = = + u.
z˙2 −k1 /m−k2 z1 /m −c/m z2 1/m

13 Such systems are called nonlinear systems. We will write them succinctly
14 as
ẋ = f (x, u)
(3.13)
y = g(x, u).
15 The function f : X × U → X that maps the state-space and the input
16 space to the state-space is called the dynamics of the system. Analogously,
17 for discrete-time nonlinear systems we will have

xk+1 = f∆t (xk , uk )

yk = g(xk , uk ).

18 Again the discrete-time nonlinear dynamics has a different equation than ? Is the nonlinear spring-mass system
19 the corresponding one in (3.13). time-invariant?

20 3.4 Markov Decision Processes (MDPs)

21 Let us now introduce a concept called MDPs which is very close to
22 Markov chains that we saw in the previous chapter. In fact, you are already
23 implementing an MDP in your HW 1 problem on the Bayes filter.
49

MDPs are a model for the scenario when we do not completely

know the dynamics f (xk , uk ).

1 This may happen for a number of reasons and it is important to

2 appreciate them in order to understand the widespread usage of MDPs.

3 1. We did not do a good job of identifying the function f : X ×U → X .

4 This may happen when you are driving a car on an icy road, if you
5 undertake the same control as you do on a clean road, you might
6 reach a different future state xk+1 .

7 2. We did not use the correct state-space X . You could write down
8 the state of the car as given by (x, y, θ, ẋ, ẏ, θ̇) where x, y are the
9 Euclidean co-ordinates of the car and θ is its orientation. This is
10 not a good model for studying high-speed turns, which are affect by
11 other quantities like wheel slip, the quality of the suspension etc.
12 We may not even know the full state sometimes. This occurs when
13 you are modeling how users interact with an online website like
14 Amazon.com, you’d like to model the change in state of the user
15 from “perusing stuff” to “looking stuff to buy it” to “buying it”
16 but there are certainly many other variables that affect the user’s
17 behavior. As another example, consider the path that an airplane
18 takes to go from Philadelphia to Los Angeles. This path is affected
19 by the weather at all places along the route, it’d be cumbersome
20 to incorporate the weather to find the shortest-time path for the
21 airplane.

22 3. We did not use the correct control-space U for the controller. This
23 is akin to the second point above. The gas pedal which one may
24 think of as the control input to a car is only one out of the large
25 number of variables that affect the running of the car’s engine.

MDPs are a drastic abstraction of all the above situations. We

write
xk+1 = f (xk , uk ) + ϵk (3.14)
where the “noise” ϵk is not under our control. The quantity ϵk is not
arbitrary however, we will assume that

1. noise ϵk is a random variable and we know its distribution.

For example, you ran your car lots of times on icy road and
measured how the state xk+1 deviates from similar runs on a
clean road. The difference between the two is modeled as ϵk .
Note that the distribution of ϵk may be a function of time k.

2. noise at different timesteps ϵ1 , ϵ2 , . . . , is independent.

Instead of a deterministic transition for our system from xk to xk+1 ,

we now have
xk+1 ∼ P(xk+1 | xk , uk ).
? You should think about the state-space,
which is just another way of writing (3.14). control-space and the noise in the MDP for
The latter is a probability table of size |X | × |U| × |X | akin to the Bayes filter problem in HW 1. Where do
the transition matrix of a Markov chain except that there is a different we find MDPs in real-life? There are lots of
transition matrix for every control u ∈ U. The former version (3.14) expensive robots in GRASP, e.g., a Kuka
is more amenable to analysis. MDPs can be alternatively called manipulator such as this
stochastic dynamical systems, we will use either names for them in https://www.youtube.com/watch?v=ym64NFCWOR
this course. For completeness, let us note down that linear stochastic costs upwards of $100,000. Would you model
systems will be written as it as a stochastic dynamical system?

xk+1 = Axk + Buk + ϵk .

The moral of this section is to remember that as pervasive as noise

seems in all problem formulations in this course, it models different
situations depending upon the specific problem. Understanding where
noise comes from is important for real-world applications.

1 Noise in continuous-time systems You will notice that we only talked

2 about discrete-time systems with noise in (3.14). We can also certainly
3 talk about continuous-time systems whose dynamics f we do not know
4 precisely
ẋ(t) = f (x(t), u(t)) + ϵ(t) (3.15)
5 and model the gap in our knowledge as noise ϵ(t). While this may seem
6 quite natural, it is mathematically very problematic. The hurdle stems
7 from the fact that if we want ϵ(t) to be a random variable at each time
8 instant, then the signal ϵ(t) may not actually exist, e.g., it may not even
9 be continuous. Signals like ϵ(t) exist only in very special cases, one of
10 them is called “Brownian motion” where the increment of the signal after
51

1 infinitesimal time ∆t is a Gaussian random variable

ϵ(t + ∆t) − ϵ(t) = N (0, ∆t).

? Do continuous-time systems, stochastic or

Figure 3.2: A typical Brownian motion signal ϵ(t). You can also see an animation non-stochastic, exist in the real world?
at https://en.wikipedia.org/wiki/File:Brownian_Motion.ogv Consider the Kuka manipulator again, do you
think the dynamics of this robot is a
continuous-time system? Would you model it
2 We will not worry about this technicality in this course. We will so?
3 talk about continuous-time systems with noise but with the implicit
4 understanding that there is some underlying real-world discrete-time
5 system and the continuous-time system is only an abstraction of it.

6 3.4.1 Back to Hidden Markov Models

7 Since our sensors measure the state x of the world, it will be useful to think
8 of the output y of a dynamical system as the observations from Chapter Observation noise and dynamics noise are
9 2. This idea neatly ties back our development of dynamical systems different in subtle ways. The former may not
10 to observations. Just like we considered an HMM with observation always be due to our poor modeling. For
11 probability instance, the process by which a camera
P(Yk = y | Xk = x) acquires its images has some inherent noise.
12 we will consider dynamical systems for which we do not precisely know You may have seen a side-by-side comparison
13 how the output computation. We will model the gap in our knowledge of of different cameras using their ISOs
14 the exact observation mechanism as the output being a noisy function of
15 the state. This is denoted as

yk = g(xk ) + νk . (3.16)

16 The noise νk is similar to the noise in the dynamics ϵk in (3.14). Analo-

17 gously, we can also have noise in the observations of a linear system

yk = Cxk + Duk + νk .

An image taken from a camera with low

Hidden Markov Models with underlying MDPs/Markov chains lighting has a lot of “noise”. What causes this
noise?
52

and stochastic dynamical systems with noisy observations are two dif-
ferent ways to think of the same concept, namely getting observations
across time about the true state of a dynamic world.
In the former we have

(state transition matrix) P(Xk+1 = x′ | Xk = x, uk = u)

(observation matrix) P(Yk = y | Xk = x),

while in the latter we have

(nonlinear dynamics) xk+1 = f (xk , uk ) + ϵk

(nonlinear observation model) yk = g(xk ) + νk ,
or

(linear dynamics) xk+1 = Axk + Buk + ϵk

(linear observation model) yk = Cxk + Duk + νk .

HMMs are easy to use for certain kinds of problems, e.g., speech- You will agree that creating the
to-text, or a robot wandering in a grid world (like the Bayes filter state-transition matrix for the Bayes filter
problem in HW 1). Dynamical systems are more useful for certain problem in HW 1 was really the hardest part
other kinds of problems, e.g., a Kuka manipulator where you can use of the problem. If the state-space were
Newton’s laws to simply write down the functions f, g. continuous and not a discrete cell-based
world, you could have written the dynamics
very easily in one line of code.

1 3.5 Kalman Filter (KF)

2 We will now introduce the Kalman Filter. It is the analog of the Bayes filter
3 from the previous chapter. This is by far the most important algorithm in
4 robotics and it is hard to imagine running any robot without the Kalman
5 fiter or some variant of it.
6 Consider a linear dynamical system with linear observations

xk+1 = Axk + Buk + ϵk

(3.17)
yk = Cxk + νk .

7 where the noise vectors

ϵk ∼ N (0, R)
νk ∼ N (0, Q) We will assume that the distribution of
8 are both zero-mean and Gaussian with covariances R and Q respectively. noise ϵk , νk does not change with time k. If it
9 We have also assumed that D = 0 because typically the observations do does change in your problem, you will see that
10 not depend on the control. following equations are quite easy to modify.

Our goal is to compute the best estimate of the state after multiple
53

observations
P(xk | y1 , . . . , yk ).
This is the same as the filtering problem that we solved for Hidden
Markov Models. Just like we used the forward algorithm to compute
the filtering estimate recursively, we are going to use our development
of the Kalman gain to incorporate a new observation recursively.

1 3.5.1 Step 0: Observing that the state estimate at any

2 timestep should be a Gaussian
3 Maintaining the entire probability distribution P(xk | y1 , . . . , yk ) is
4 difficult now, as opposed to the HMM with a finite number of states. We
5 will exploit the following important fact. If we assume that the initial
6 distribution of x0 was a Gaussian, since all operations in (3.17) are linear,
7 our new estimate of the state x̂k at time k is also a Gaussian

x̂k|k ∼ P(xk | y1 , . . . , yk ) ≡ N (µk|k , Σk|k ).

8 The subscript
x̂k+1|k
9 denotes that the quantity being talked about, i.e., x̂k+1|k , or others like
10 µk+1|k , is of the (k + 1)th timestep and was calculated on the basis of
11 observations up to (and including) the k th timestep. We will therefore
12 devise recursive updates to obtain µk+1|k+1 , Σk+1|k+1 using their old
13 values µk|k , Σk|k . We will imagine that our initial estimate for the state
14 x̂0|0 has a known distribution

x̂0|0 ∼ N (µ0|0 , Σ0|0 ).

16 3.5.2 Step 1: Propagating the dynamics by one timestep

17 Suppose we had an estimate x̂k|k after k observations/time-steps. Since
18 the dynamics is linear, we can use the prediction problem to compute the
19 estimate of the state at time k + 1 before the next observation arrives

P(xk+1 | y1 , . . . , yk ).

20 From the first equation of (3.17), this is given by

x̂k+1|k = Ax̂k|k + Buk + ϵk

1 Notice that the subscript on the left-hand side is k + 1 | k because we did

2 not take into account the observation at timestep k + 1 yet. The mean and
3 covariance of this estimate are given by

µk+1|k = E[x̂k+1|k ] = E[Ax̂k|k + Buk + ϵk ]

(3.18)
= Aµk|k + Buk .

4 We can also calculate the covariance of the estimate x̂k+1|k to see that

Σk+1|k = Cov(x̂k+1|k ) = Cov(Ax̂k|k + Buk + ϵk )

(3.19)
= AΣk|k A⊤ + R,

5 using our calculation in (3.1). Observe that even if we knew the state
dynamics precisely, i.e., if R = 0, we still
6 3.5.3 Step 2: Incorporating the observation have a non-trivial propagation equation for
Σk+1|k .
7 After the dynamics propagation step, our estimate of the state is x̂k+1|k ,
8 this is the state of the system that we believe is true after k observations.
9 We should now incorporate the latest observation yk+1 to update this
10 estimate to get
P(xk+1 | y1 , . . . , yk , yk+1 ).
11 This is exactly the same problem that we saw in Section 3.2.3. Given the
12 measurement
yk+1 = Cxk+1 + νk+1
13 we first compute the Kalman gain Kk+1 and the updated mean of the
14 estimate as
−1
Kk+1 = Σk+1|k C ⊤ CΣk+1|k C ⊤ + Q
(3.20)
µk+1|k+1 = µk+1|k + Kk+1 yk+1 − Cµk+1|k .

15 The covariance is given by our same calculation again

Σk+1|k+1 = (I − Kk+1 C) Σk+1|k , or

⊤ ⊤
= (I − Kk+1 C) Σk+1|k (I − Kk+1 C) + Kk+1 QKk+1 , or
−1
= Σ−1 k+1|k + C Q
⊤ −1
C .
(3.21)
16 The second expression is known as Joeseph’s form and is numerically
17 more stable than the other expressions.

The new estimate of the state is

x̂k+1|k+1 ∼ P(xk+1 | y1 , . . . , yk+1 ) ≡ N (µk+1|k+1 , Σk+1|k+1 ).

and we can again proceed to Step 1 for the next timestep.

1 3.5.4 Discussion
2 There are several important observations to make and remember about
3 the Kalman Filter (KF).

4 • Recursive updates to compute the best estimate given all past

5 observations. The KF is a recursive filter (just like the forward
6 algorithm for HMMs) and incorporates observations one by one.
7 The estimate that it maintains, namely x̂k+1|k+1 , depends upon all
8 past observations

x̂k+1|k+1 ∼ P(xk+1 | y1 , . . . , yk+1 ).

9 We have simply computed the estimate recursively.

10 • Optimality of the KF for linear systems with Gaussian noise.

11 The KF is optimal in the following sense. Imagine if we had access
12 to all the observations y1 , . . . , yk beforehand and computed some
13 other estimate

x̂fancy
k|k
filter
= some function(x̂0|0 , y1 , . . . , yk ).

14 We use some other fancy method to design this estimator, e.g.,

15 nonlinear combination of the observations or incorporating obser-
16 vations across multiple timesteps together etc. to obtain something
17 that has the smallest error with respect to the true state xk
h i
tr E (x̂fancy
k|k
filter
− xk )(x̂ fancy filter
k|k − x k )⊤
.
ϵ1 ,...,ϵk ,ν1 ,...,νk
(3.22)
18 Then this estimate would be exactly the same as that of the KF

x̂fancy
k|k
filter
= x̂KF
k|k .

19 This is a deep fact. First, the KF estimate was created recursively

20 and yet we can do no better than it with our fancy estimator. This
21 is analogous to the fact that the forward algorithm computes the
22 correct filtering estimate even if it incorporates observations one by
23 one recursively. Second, the KF combines the new observation and
24 the old estimate linearly in (3.20). You could imagine that there is
25 some other way to incorporate new observations, but it turns out
26 that for linear dynamical systems with Gaussian noise, the KF is
27 the best solution, we can do no better.

28 • The KF is the best linear filter. If we had a nonlinear dynamical

29 system or a non-Gaussian noise with a linear dynamics/observations,
30 there are other filters that can give a smaller error (3.22) than the
31 KF. In the next section, we will take a look at one such example.
32 However, even in these cases, the KF is the best linear filter.

33 • Assumptions that are implicit in the KF. We assumed that both the
34 dynamics noise ϵκ and the observation noise νk+1 are uncorrelated
56

1 with the estimate x̂k+1|k computed prior to them (where did we use
2 these assumptions?). This implicitly assumes that dynamics and
3 observation noise are “white”, i.e., uncorrelated in time

E[ϵk ϵ⊤
k′ ] = 0 for all k, k ′
E[νk νk⊤′ ] = 0 for all k, k ′ .

4 The Wikipedia webpage at https://en.wikipedia.org/wiki/Kalman_filter#Example_application,_technical

5 gives a simple example of a Kalman Filter.
? How should one modify the KF equations
if we have multiple sensors in a robot, each
6 3.6 Extended-Kalman Filter (EKF) coming in at different frequencies?

7 The KF heavily exploits the fact that our dynamics/measurements are

8 linear. For most robots, both of these are nonlinear. The Extended-Kalman
9 Filter (EKF) is a modification of the KF to handle such situations.

10 Example of a nonlinear dynamical system The state of most real

11 problems evolves as a nonlinear function of their current state and control.
12 This is a the same for sensors such as cameras measure a nonlinear function
13 of the state. We will first see how to linearize a given nonlinear system
14 shown below.

16 We have a radar sensor that measures the distance of the plane r from the
17 radar trans-receiver up to noise ν. We would like to measure its distance
18 x and height h. If the plane travels with a constant velocity, we have

ẋ = v, and v̇ = 0,

19 and p
r= x2 + h2 .
20 Since we do not really know how the plane might change its altitude, let’s
21 assume that it maintains a constant altitude

ḣ = 0.

22 The above equations are our model for how the state of the airplane evolves
23 and could of course be wrong. As we discussed, we will model the
57

1 discrepancy as noise.
     
x˙1 0 1 0 x1
x˙2  = 0 0 0 x2  + ϵ;
x˙3 0 0 0 x3
q
r = x21 + x23 + ν;

2 here x1 ≡ x, x2 ≡ v and x3 = h, and ϵ ∈ R3 , ν ∈ R are zero-mean

3 Gaussian noise. The dynamics in this case is linear but the observations
4 are a nonlinear function of the state.
5 One way to use the Kalman Filter for this problem is to linearize the
6 observation equation around some state, say x1 = x2 = x3 = 0 using the
7 Taylor series

∂r ∂r
rlinearized = r(0, 0, 0) + (x1 − 0) + (x3 − 0) ? You can try to perform a similar
∂x1 x1 =0,x3 =0 ∂x3 x1 =0,x3 =0
linearization for a simple model of a car
2x1 2x3
=0+ p 2 x1 + p 2 x3
2 x1 + x23 x1 =0,x3 =0 2 x1 + x23 x1 =0,x3 =0 ẋ = cos θ
= x1 + x3 . ẏ = sin θ
θ̇ = u.
8 In other words, upto first order in x1 , x3 , the observations are linear and
9 we can therefore run the KF for computing the state estimate after k where x, y, θ are the XY-coordinates and the
10 observations. angle of the steering wheel respectively. This
model is known as a Dubins car.
11 3.6.1 Propagation of statistics through a nonlinear trans-
12 formation
13 Given a Gaussian random variable Rd ∋ x ∼ N (µx , Σx ), we saw how to
14 compute the mean and covariance after an affine transformation y = Ax

E[y] = A E[x], and Σy = AΣx A⊤ .

15 If we had a nonlinear function of x

Rp ∋ y = f (x)

16 we can use the Taylor series by linearizing around the mean of x to

17 approximate the first and second moments of y as follows.

df
y = f (x) ≈ f (µx ) + (x − µx )
dx x=µx

= Jx + (f (µx ) − Jµx ).

18 where we have defined the Jacobian matrix

df
Rp×d ∋ J = . (3.23)
dx x=µx
58

1 This gives

E[y] ≈ E[Jx + (f (µx ) − Jµx )] = f (µx )

(3.24)
Σy = E[(y − E[y])(y − E[y])⊤ ] ≈ JΣx J ⊤ .

2 Observe how, up to first order, the mean µx is directly transformed by the

3 nonlinear function f while the covariance Σx is transformed as if there
4 were a linear operation y ≈ Jx.

A simple example
 
x1
x21 + x2 x3

y1
y= =f x2  = .
y2 sin x2 + cos x3
x3

5 We have
df 2x1 x3 x2
= ∇f (x) = .
dx 0 cos x2 − sin x3
⊤
6 The Jacobian at µx = [µx1 , µx2 , µx3 ] is

2µx1 µx3 µx2
J = ∇f (x) = .
x=µx
0 cos µx2 − sin µx3

It is very important to remember that we are approximating the

distribution of P(f (x)) as a Gaussian. Even if x is a Gaussian random
variable, the distribution of y = f (x) need not be Gaussian. Indeed
y is only Gaussian if f is an affine function of x.
59

1 3.6.2 Extended Kalman Filter

2 The above approach of linearizing the observations of the plane around
3 the origin may lead to a lot of errors. This is because the point about ? Can you say where will our linearized
4 which we linearize the system is fixed. We can do better by linearizing the observation equation incur most error?
5 system at each timestep. Let us say that we are given a nonlinear system

xk+1 = f (xk , uk ) + ϵ
yk = g(xk ) + ν.

The central idea of the Extended Kalman Filter (EKF) is to

linearize a nonlinear system at each timestep k around the latest state
estimate given by the Kalman Filter and use the resultant linearized
dynamical system in the KF equations for the next timestep.

6 Step 1: Propagating the dynamics by one timestep

7 We will linearize the dynamics equation around the mean of the previous
8 state estimate µk|k

xk+1 = f (xk , uk ) + ϵ
∂f
≈ f (µk|k , uk ) + xk − µk|k + ϵk .
∂x x=µk|k

9 Let the Jacobian be

∂f
A(µk|k ) = . (3.25)
∂x x=µk|k

10 The mean and covariance of the EKF after the dynamics propagation step
11 is therefore given by

µk+1|k = f (µk|k , uk )
(3.26)
Σk+1|k = AΣk|k A⊤ + R.

12 It is worthwhile to notice the similarities of the above set of equations

13 with (3.18) and (3.19). The mean µk|k is propagated using a nonlinear
14 function f to get µk+1|k , the covariance is propagated using the Jacobian
15 A(µk|k ) which is recomputed using (3.25) at each timestep.

16 Step 2: Incorporating the observation

17 We have access to µk+1|k after Step 1, so we can linearize the nonlinear
18 observations at this state.

yk+1 = g(xk+1 ) + ν
dg
≈ g(µk+1|k ) + (xk+1 − µk+1|k ) + ν
dx x=µk+1|k
60

1 Again define the Jacobian

∂g
C(µk+1|k ) = . (3.27)
∂x x=µk+1|k

2 Consider the fake observation which is a transformed version of the actual

3 observation yk+1 (think of this as a new sensor or a post-processed version
4 of the original sensor)
′
yk+1 = yk+1 − g(µk+1|k ) + Cµk+1|k ≈ Cxk+1 .

8 Let us resubstitute our fake observation in terms of the actual observation

9 yk+1 .
′
yk+1 − C µk+1|k = yk+1 − g(µk+1|k ),
10 to get the EKF equations for incorporating one observation

µk+1|k+1 = µk+1|k + K(yk+1 − g(µk+1|k ))

(3.28)
Σk+1|k+1 = (I − KC) Σk+1|k .

The Extended Kalman Filter estimates the state of a nonlinear

system by linearizing the dynamics and observation equations at each
timestep.
1. Say we have the current estimate µk|k and Σk|k .
2. After a control input uk the new estimate is

µk+1|k = f (µk|k , uk )
Σk+1|k = AΣk|k A⊤ + R.

where A depends on µk|k .

where again C depends on µk+1|k

1 Discussion
2 1. The EKF dramatically expands the applicability of the Kalman
3 Filter. It can be used for most real systems, even with very com-
4 plex models f, h. It is very commonly used in robotics and can
5 handle nonlinear observations from complex sensors such as a
6 LiDAR and camera easily. For instance, sophisticated augment-
7 ed/virtual reality systems like Google ARCore/Snapchat/iPhone
8 etc. (https://www.youtube.com/watch?v=cape_Af9j7w) run EKF
9 to track the motion of the phone or of the objects in the image.

10 2. The KF was special because it is the optimal linear filter, i.e., KF

11 estimates have the smallest mean squared error with respect to the
12 true state for linear dynamical systems with Gaussian. The EKF is
13 a clever application of KF to nonlinear systems but it no longer has
14 this property. There do exist filters for nonlinear systems that will
15 have a smaller mean-squared error than the EKF. We will look at
16 some of them in the next section.

17 3. Linearization is the critical step in the implementation of the EKF

18 and EKF state estimate can be quite inaccurate if the system is at
19 a state where the linearized matrix A and the nonlinear dynamics
20 f (xk , uk ) differ significantly. A common trick for handling this is to
21 perform multiple steps of dynamics propagation using a continuous-
22 time model of the system between successive observations. Say we
23 have a system
ẋ = f (x(t), u(t)) + ϵ(t)
24 where ϵ(t + δt) − ϵ(t) is a Gaussian random variable N (0, Rδt) as
25 δ → 0; see the section on Brownian motion for how to interpret
26 noise in continuous-time systems. We can construct a discrete-time
27 system from this as

xt+∆t = x(t) + f (x(t), u(t)) ∆t + ϵ

≡ f discrete-time (x(t), u(t)) + ϵ.

28 where ϵ ∼ N (0, R∆t) is noise. This is now a discrete-time

29 dynamics and we can perform Step 1 of the EKF multiple times to
30 obtain a more accurate estimate of µk+1|k and Σk+1|k .

31 3.7 Unscented Kalman Filter (UKF)

32 Linearization of the dynamics in the EKF is a neat trick to use the KF
33 equations. But as we said, this can cause severe issues in problems
34 where the dynamics is very nonlinear. In this section, we will take a look
35 at a powerful method to handle nonlinear dynamics that is better than
36 linearization.
37 Let us focus on Step 1 which propagates the dynamics in the EKF.
62

We know that even if x is Gaussian (faint blue points in top left

picture), the transformed variable y = f (x) need not be Gaussian
(faint blue points in bottom left). The EKF is really approximating
the probability distribution P(xk+1 | y1 , . . . , yk ) as a Gaussian;
this distribution could be very different from a Gaussian. This is
really the crux of the issue in filtering for nonlinear systems. This
approximation, which happens because we are linearizing about the
mean µk|k .

2 Let us instead do the following:

3 1. Sample a few points from the Gaussian N (µk|k , Σk|k ) (red points
4 in top right).

5 2. Transform each of the points using the nonlinear dynamics f (red

6 points in bottom right).

7 3. Compute their mean and covariance to get µk+1|k and Σk+1|k .

8 Notice how the green ellipse is slightly different than the black
9 ellipse (which is the true mean and covariance). Both of these would
10 be different from the mean and covariance obtained by linearization
11 of f (middle column) but the green one is more accurate.

In general, we would need a large number of sample points (red) to

accurately get the mean and covariance of y = f (x). The Unscented

Transform (UT) uses a special set of points known as “sigma points”
(these are the ones actually shown in red above) and transforms those
points. Sigma points have the special property that the empirical
mean of the transformed distribution (UT mean in the above picture)
is close to the true mean up to third order; linearization is only
accurate up to first order. The covariance (UT covariance) and true
covariance also match up to third order.

1 3.7.1 Unscented Transform

2 Given a random variable x ∼ N (µx , Σx ), the Unscented Transform
3 (UT) uses sigma points to compute an approximation of the probability
4 distribution of the random variable y = f (x).

5 Preliminaries: matrix square root. Given a symmetric matrix Σ ∈

6 Rn×n , the matrix square root of Σ is a matrix S ∈ Rn×n such that

Σ = SS.

7 We can compute this via diagonalization as follows.

Σ = V DV −1
 
d11 · · · 0
 0  −1
=V · · · 0
V

0 · · · dnn
√ 2
d11 · · · 0
 0  −1
=V  ···
 V .
0 √ 
0 ··· dnn

8 We can therefore define

√ 
d11 ··· 0
 0  −1
S=V 
 ···
V .
0 √

0 ··· dnn

9 Notice that

SS = (V D1/2 V −1 ) (V D1/2 V −1 ) = V DV −1 = Σ.

10 We can also define the matrix square root using the Cholesky decompo-
11 sition Σ = LL⊤ which is numerically more stable than computing the
12 square root using the above expression. Recall that matrices L and Σ have
13 the same eigenvectors. Typical applications of the Unscented Transform
14 will use this method.
64

1 Given a random variable Rn ∋ x ∼ N (µ, Σ), we will use the matrix

2 square root to compute the sigma points as
√ ⊤
x(i) = µ + nΣi
√ ⊤
x(n+i) = µ − nΣi (3.29)
for i = 1, . . . , n,
√ √
3 where nΣi is the ith row of the matrix nΣ. There are 2n sigma points
n o
x(1) , . . . , x(2n)

4 for an n-dimensional Gaussian. Each sigma point is assigned a weight

1
w(i) = . (3.30)
2n
5 We then transform each sigma point to get the transformed sigma points

y (i) = f (x(i) ).

6 The mean and covariance of the transformed random variable y can now
7 be computed as
2n
X
µy = w(i) y (i)
i=1
(3.31)
2n
X ⊤
Σy = w(i) y (i) − µy y (i) − µy .
i=1

r
8 Example Say we have x = with µx = [1, π/2] and Σx =
θ
2
σr 0
9 . We would like to compute the probability distribution of
0 σθ2 ? Compute the mean and covariance of y by

r cos θ linearizing the function f (x).
10 y = f (x) = which is a polar transformation. Since x is two-
r sin θ
11 dimensional, we will have 4 sigma points with equal weights w(i) = 0.25.
12 The square root in the sigma point expression is

√
√
2σr √ 0
nΣ =
0 2σθ

13 and the sigma points are

√ √
1 2σr 1 2σr
x(1) = + , x(3) = −
π/2 0 π/2 0

1 0 1 0
x(2) = + √ , x(4) = − √ .
π/2 2σθ π/2 2σθ
65

1 The transformed sigma points are

(1)
r cos θ(1)

(1) 0
√
y = (1) =
r sin θ(1) 1 + 2σr
(2) √
r cos θ(2)

(2) cos π/2 + √ 2σθ
y = (2) =
r sin θ(2) sin π/2 + 2σθ

0
√
y (3) =
1 − 2σr
√
(4) cos π/2 − √ 2σθ
y = .
sin π/2 − 2σθ

Figure 3.3: Note that the true mean is being predicted very well by the UT and is
clearly a better estimate than the linearized mean.

2 3.7.2 The UT with tuning parameters

3 The UT is a basic template for a large suite of techniques that capture
4 the covariance Σx as a set of points and transform those points through ? Are the transformed sigma points y (i) the
5 the nonlinearity. You will see many alternative implementations of the sigma points of P(y) = N (µy , Σy )?
6 UT that allow for user-tunable parameters. For instance, sometimes
7 the UT is implemented with an additional sigma point x(0) = µ with
λ
8 weight w(0) = n+λ and the weights of the other points are adjusted to be
(i) 1
9 w = 2(n+λ) for a user-chosen parameter λ. You may also see people
10 using one set of weights w(i) for computing the mean µy and and another
11 set of weights for computing the covariance Σy .
? We are left with a big lingering question.
Why do you think this method is called the
“unscented transform”?
66

1 3.7.3 Unscented Kalman Filter (UKF)

2 The Unscented Transform gives us a way to accurately estimate the mean
3 and covariance of the transformed distribution through a nonlinearity.
4 We can use the UT to modify the EKF to make it a more accurate state
5 estimator. The resultant algorithm is called the Unscented Kalman Filter
6 (UKF).

7 Step 1: Propagating the dynamics by one timestep Given our current

2n
X
µk+1|k := w(i) f (x(i) , uk )
i=1
2n
X ⊤
Σk+1|k := R + w(i) f (x(i) ) − µk+1|k f (x(i) ) − µk+1|k
i=1
(3.32)

11 Step 2.1: Incorporating one observation The observation step is also

12 modified using the UT. The key issue in this case is that we need a way
13 to compute the Kalman gain in terms of the sigma points in the UT. We
14 proceed as follows.
15 Using new sigma points x(i) for the updated state distribution N (µk+1|k , Σk+1|k )
16 with equal weights w(i) = 1/2n, we first compute their mean after the
17 transformation
X 2n
ŷ = w(i) g(x(i) ) (3.33)
i=1

18 and covariances
2n
X ⊤
Σyy := Q + w(i) g(x(i) ) − ŷ g(x(i) ) − ŷ
i=1
(3.34)
2n
X ⊤
(i) (i)
Σxy := w x − µk+1|k g(x(i) ) − ŷ .
i=1

19 Step 2.2: Computing the Kalman gain Until now we have written the
20 Kalman gain using the measurement matrix C. We will now discuss a
21 more abstract formulation that gives the same expression.
22 Say we have a random variable x with known µx , Σx and get a new
23 observation y. We saw how to incorporate this new observation to obtain
24 a better estimator for x in Section 3.2.3. We will go through a similar
25 analysis as before but in a slightly different fashion, one that does not
26 involve the matrix C. Let
x
z=
y
67

1 and µz = µx µy and

Σxx Σxy
Σz = .
Σyx Σyy

2 Finding the best (minimum variance estimator) x̂ = µx + K(y − µy )

3 amounts to minimizing
h n oi
⊤
min E tr (x̂ − x) (x̂ − x) .
K

4 This is called the least squares problem, which you have seen before
5 perhaps in slightly different notation. You can solve this problem to see
6 that the best gain K is given by

K ∗ = Σxy Σ−1
yy . (3.35)

7 and this gain leads to the new covariance

⊤ ∗⊤
(x̂ − x) (x̂ − x) = Σxx − Σxy Σ−1 ∗
yy Σyx = Σxx − K Σyy K .

8 The nice thing about the Kalman gain in (3.35) is that we can compute it
9 now using expressions of Σxy and Σyy in terms of the sigma points. This
10 goes as as follows:

K ∗ = Σxy Σ−1
yy

µk+1|k+1 = µk+1|k + K (yk+1 − ŷ)

(3.36)
Σk+1|k+1 = Σk+1|k − Σxy Σ−1
yy Σyx

= Σk+1|k − K ∗ Σyy K ∗ ⊤ .

Summary of UKF

1. The Unscented Transform (UT) is an alternative to linearization.

It gives a better approximation of the mean and covariance of
the random variable after being transformed using a nonlinear
function than taking the Taylor series approximation.

2. The UKF uses the UT and its sigma points for propagation
of uncertainty through the dynamics (3.32) and observation
nonlinearities (3.36).

11 3.7.4 UKF vs. EKF

12 As compared to the Extended Kalman Filter, the UKF is a better approxi-
13 mation for nonlinear systems. Of course, if the system is linear, both EKF
14 and UKF are equivalent to a Kalman Filter.
15 In practice, we typically use the UKF with some tuning parameters in
68

1 the Unscented Transform as discussed in Section 3.7.2. In practice, the

2 EKF also has tuning parameters where we may wish to perform multiple
3 updates of the dynamics equations with a smaller time-discretization
4 before the next observation comes in to alleviate the effect of linearizing
5 the dynamics. A well-tuned EKF is often only marginally worse than an
6 UKF: the former requires us to compute Jacobians at each step which the
7 latter does not, but the latter is often a more involved implementation.

UKF/EKF approximate filtering distribution as a Gaussian An

important point to remember about both the UKF and EKF is that
even if they can handle nonlinear systems, they still approximate the
filtering distribution

P(xk | y1 , . . . , yk )

as a Gaussian.

8 3.8 Particle Filters (PFs)

9 We next look at particle filters (PFs) which are a generalization of the
10 UKF and can handle non-Gaussian filtering distributions. Just like the UT
11 forms the building block of the UKF, the building block of a particle filter
12 is the idea of importance sampling.

13 3.8.1 Importance sampling

14 Consider the following problem, given a probability distribution p(x), we
15 want to approximate it as a sum of Dirac-delta distributions at points x(i) ,
16 also called “particles”, each with weight w(i)
n
X
p(x) ≈ w(i) δx(i) (x).
i=1

17 Say all weights are equal 1/n. Depending upon how we pick the samples
18 x(i) , we can get very different approximations
69

Figure 3.4: Black lines denote particles x(i) , while red and blue curves denote the
approximations obtained using them. If there are a large number of particles in a
given region, the approximated probability density of that region is higher.

We see in Figure 3.4 that depending upon the samples, the

approximated probability distributions p̂(x) can be quite different.
Importance sampling is a technique to sample the particles to ap-
proximate a given probability distribution p(x). The main idea is
to use another known probability distribution, let us call it q(x) to
generate particles x(i) and account for the differences between the
two by assigning weights to each particle

For i = 1, . . . , n,
x(i) ∼ q
p(x(i) )
w(i) = .
q(x(i) )

The original distribution p(x) is called the “target” and our chosen
distribution q(x) is called the “proposal”. If the number of particles
n is large, we can expect a better approximation of the target density
p(x).

1
70

1 3.8.2 Resampling particles to make the weights equal

2 A particle filter modifies the weights of each particle as it goes through the
3 dynamics and observation update steps. This often causes some particles
4 to have very low weights and some others to have very high weights.

Figure 3.5: An example run of a particle filter. The robot is shown by the green
dot in the top right. Observations from a laser sensor (blue rays) attached to the
robot measure its distance in a 360-degree field of view around it. Red dots are
particles, i.e., possible locations of the robot that we need in order to compute
the filtering density P(xk | y1 , . . . , yk ). You should think of this picture as being
similar to Problem 1 in Homework 1 where the robot was traveling on a grid. Just
like the the filtering density in Problem 1 was essentially zero in some parts of the
domain, the particles, say in the bottom left, will have essentially zero weights in
a particle filter once we incorporate multiple observations from the robot in top
right. Instead of having to carry around these null particles with small weights,
the resampling step is used to remove them and sample more particles, say in
the top right, where we can benefit from a more accurate approximation of the
filtering density.

n
The resampling step takes particles w(i) , x(i) i=1
which ap-
71

proximate a probability density p(x)

n
X
p(x) = w(i) δx(i) (x)
i=1

(i) (i)
and returns a new set of particles x′ with equal weights w′ = 1/n
that approximate the same probability density
n
1X
p(x) = δ ′ (i) (x).
n i=1 x

The goal of the resampling step is to avoid particle degeneracy, i.e.,

remove unlikely particles with very low weights and effectively split
the particles with very large weights into multiple particles.

2 Consider the weights of particles w(i) arranged in a roulette wheel as
3 shown above. We perform the following procedure: we start at some
4 location, say θ = 0, and move along the wheel in random increments
5 of the angle. After neach random
o increment, we add the corresponding
′ (i)
6 particle into our set x . Since particles with higher weights take up
7 a larger angle in the circle, this procedure will often pick those particles
8 and quickly move across particles with small weights without picking
9 them too often. We perform this procedure n times for n particles. As an
10 algorithm

11 1. Let r be a uniform random variable in interval [0, 1/n]. Pick

12 c = w(1) and initialize i = 1.

13 2. For each m = 1, . . . , n, let u = r + (m − 1)/n. Increment There are many other methods of
14 i ← i + 1 and c ← c + w(i) while u > c and set new particle resampling. We have discussed here,
(m)
15 location x′ = x(i) . something known as “low variance
resampling”, which is easy to remember and
16 It is important to notice that the resampling procedure does not actually code up. Fancier resampling methods also
17 change the locations of particles. Particles with weights much lower than change the locations of the particles. The goal
18 1/n will be eliminated while particles with weights much higher than 1/n remains the same, namely to eliminate
19 will be “cloned” into multiple particles each of weight 1/n. particles with low weights.
72

Figure 3.6: A cartoon depicting resampling. Disregard the different notation in

this cartoon. Resampling does not change the probability distribution that we wish
to approximate; it simply changes the particles and their weights.

1 3.8.3 Particle filtering: the algorithm

2 The basic template of a PF is similar to that of the UKF and involves
3 two steps, the first where we propagate particles using the dynamics to
4 estimate P(xk+1 | y1 , . . . , yk ) and a second step where we incorporate the
5 observation to compute the updated distribution P(xk+1 | y1 , . . . , yk+1 ).
6 Before we look at the theoretical derivation of a particle filter, it will
7 help to go through the algorithm as you would implement on a computer.
(i)
8 We assume that we have access to particles xk|k

n
1X
P(xk | y1 , . . . , yk ) = δ (i) (x),
n i=1 xk|k

(i)
9 all with equal weights wk|k = 1/n.

1. Step 1: Propagating the dynamics. Each particle i =

1, . . . , n is updated by one timestep

(i) (i)
xk+1|k = f (xk|k , uk ) + ϵk

where f is the system dynamics using Gaussian noise

(i)
ϵk ∼ N (0, R). Weights of particles are unchanged wk+1|k =
(i)
wk|k = 1/n.

2. Step 2: Incorporating the observation. Given a new obser-

vation yk+1 , we update the weight of each particle using the
likelihood of receiving that observation
(i) (i) (i)
wk+1|k+1 ∝ P(yk+1 | xk+1|k ) wk+1|k .

(i)
Note that P(yk+1 | xk+1|k ) is a Gaussian and depends upon
the Gaussian observation noise νk . The mean of this Gaussian
(i)
is g(xk+1|k ) and its variance is equal to Q, i.e.,

(i) (i)
P(yk+1 | xk+1|k ) = P(νk+1 ≡ yk+1 − g(xk+1|k ))
!
ν ⊤ Q−1 νk+1
1
=p exp − k+1 .
(2π)p det(Q) 2

(i)
Normalize the weights wk+1|k+1 to sum up to 1.

3. Step 3: Resampling step Perform the resampling step to

(i)
obtain new particle locations xk+1|k+1 with uniform weights
(i)
wk+1|k+1 = 1/n.

1 3.8.4 Example: Localization using particle filter

initialization observation resampling

18 19 20

2
74

motion update measurement weight update

21 22 23

resampling motion update measurement

24 25 26

weight update resampling motion update

27 28 29

measurement weight update resampling

30 31 32

motion update measurement

33 34

6 3.8.5 Theoretical insight into particle filtering

7 Step 1: Propagating the dynamics As we introduced in the section on
8 Markov Decision Processes (MDPs), a stochastic dynamical system

xk+1 = f (xk , uk ) + ϵk

is equivalent to a probability transition matrix xk+1 ∼ P(xk+1 | xk , uk ).

Our goal is to approximate the distribution of xk+1|k using particles.
75

What proposal distribution show we choose? The “closest” probability

distribution to xk+1|k that we have available is xk|k . So we set
In this sense, picking a proposal
target : P(xk+1 | y1 , . . . , yk ) distribution to draw particles from is like
proposal : P(xk | y1 , . . . , yk ) linearization. Better the match between the
proposal and the target, fewer samples we
1 need to approximate the target.
2 Suppose we had performed resampling on n our oparticle set from the
(i)
3 distribution xk|k and have a set of n particles xk|k with equal weights
4 1/n
n
1X
P(xk | y1 , . . . , yk ) ≈ δ (i) (x).
n i=1 xk|k
5 Propagating the dynamics in a PF involves computing importance sam-
6 pling weights. If we had a particle at location x that was supposed to
7 approximate the distribution of xk+1|k , as we saw for importance sampling,
8 its importance weight is the ratio of the target and proposal densities at
9 that location
P(xk+1 = x | y1 , . . . , yk )
wk+1|k (x) = .
P(xk = x | y1 , . . . , yk )

10 Let us focus on the numerator. We have

n
1 X (i)
= P(xk+1 = x | xk = xk|k , u = uk ),
n i=1

11 where the system dynamics is f (xk , uk ) + ϵk and uk is the control at

(i)
12 time k. The denominator P(xk = xk|k | y1 , . . . , yk ) when evaluated at
(i)
13 particles xk|k is simply 1/n. This gives us weights

n
(i)
X
wk+1|k (x) = P(xk+1 = x | xk = xk|k , u = uk ). (3.37)
i=1

14 Let us now think about what particles we should pick for xk+1|k . We
15 have from (3.37) a function that lets us compute the correct weight for any
16 particle we may choose to approximate xk+1|k .
(i) (i)
17 Say we keep the particle locations unchanged, i.e., xk+1|k = xk|k .
76

1 We then have
n
Draw a picture of how this approximation
(i) looks.
X
P(xk+1 = x | y1 , . . . , yk ) ≈ wk+1|k (xk|k ) δx(i) (x). (3.38)
k|k
i=1

2 You will notice that keeping the particle locations unchanged may be a very
3 poor approximation. After all, the probability density P(xk+1 | y1 , . . . , yk )
(i)
4 is large, not at the particles xk|k (that were a good approximation of xk|k ),
5 but rather at the transformed locations of these particles
(i)
f (xk|k , uk ).

6 We will therefore update the locations of the particles to be

(i) (i)
xk+1|k = f (xk|k , uk ) (3.39)

7 with weight of the ith particle given by

12 Step 2: Incorporating the observation The target and proposal distri-

13 butions in this case are

target : P(xk+1 | y1 , . . . , yk , yk+1 )

proposal : P(xk+1 | y1 , . . . , yk ).

(i) (i)
14 Since we have particles xk+1|k with weights wk+1|k for the proposal
15 distribution obtained from the propagation step, we now like to update
16 them to incorporate the latest observation yk+1 . Let us imagine for a
(i)
17 moment that the weights wk+1|k are uniform. We would then set weights

(i)
1 for each particle x = xk+1|k to get the approximated distribution as

n
(i) (i)
X
P(xk+1 = x | y1 , . . . , yk+1 ) ≈ P(yk+1 | xk+1|k ) wk+1|k δx(i) (x)
k+1|k
i=1
(3.41)
2 You will notice that the right hand side is not normalized and the distribution
3 does not integrate to 1 (why? because we did not write the proportionality
4 constant in the Bayes rule above). This is easily fixed by normalizing the
(i) (i)
5 coefficients P(yk+1 | xk+1|k ) wk+1|k to sum to 1 as follows

(i) (i)
(i)
P(yk+1 | xk+1|k ) wk+1|k
wk+1|k+1 := P (j) (j)
.
j P(yk+1 | xk+1|k ) wk+1|k

6 Step 3: Resampling step As we discussed in the previous section, after

7 incorporating the observation, some particles may have very small weights.
8 The resampling procedure resamples particles so that all of them have
9 equal weights 1/n.
n on n on
(i) (i) (i)
xk+1|k+1 , 1/n = resample xk+1|k+1 , wk+1|k+1 .
i=1 i=1

10 3.9 Discussion
11 This brings our study of filtering to a close. We have looked at some of
12 the most important algorithms for a variety of dynamical systems, both
13 linear and nonlinear. Although, we focused on filtering in this chapter,
14 all these algorithms have their corresponding “smoothing” variants, e.g.,
15 you can read about how a typical Kalman smoother is implemented at
16 https://en.wikipedia.org/wiki/Kalman_filter#Fixed-lag_smoother. Filter-
17 ing, and state estimation, is a very wide area of research even today and
18 you will find variants of these algorithms in almost every device which
19 senses the environment.
1 Chapter 4

2 Rigid-body transforms and

3 mapping

Reading
1. LaValle Chapter 3.2 for rotation matrices, Chapter 4.1-4.2 for
quaternions

2. Thrun Chapter 9.1-9.2 for occupancy grids

3. OctoMap: An Efficient Probabilistic 3D Mapping Framework

Based on Octrees
http://www.arminhornung.de/Research/pub/hornung13auro.pdf,
also see https://octomap.github.io.

4. Robot Operating System

http://www.willowgarage.com/sites/default/files/icraoss09-
ROS.pdf, Optional: Lightweight Communications and
Marshalling (LCM) system
https://people.csail.mit.edu/albert/pubs/2010-huang-olson-
moore-lcm-iros.pdf

5. A Perception-Driven Autonomous Urban Vehicle

https://april.eecs.umich.edu/media/pdfs/mitduc2009.pdf

6. Optional reading: Thrun Chapter 10 for simultaneous localiza-

tion and mapping

4 In the previous chapter, we looked at ways to estimate the state of

5 the robot in the physical world. We kept our formulation abstract, e.g.,
6 the way the robot moves was captured by an abstract expression like
7 xk+1 = f (xk , uk ) + ϵ and observations yk = g(xk ) + ν were similarly
8 opaque. In other to actually implement state estimation algorithms on real
9 robots, we need to put concrete functions in place of f, g.

78
79

1 This is easy to do for some robots, e.g., the robot in Problem 1 in

2 Homework 1 moved across cells. Of course real robots are a bit more
3 complicated, e.g., a car cannot move sideways (which is a huge headache
4 when you parallel park). In the first half of this chapter, we will look at
5 how to model the dynamics f using rigid-body transforms.
6 The story of measurement models and sensors is similar. Although we
7 need to write explicit formulae in place of the abstract function g. In the
8 second half, we will study occupancy grids and dig deeper into a typical
9 state-estimation problem in robotics, namely that of mapping the location
10 of objects in the world around the robot.

11 4.1 Rigid-Body Transformations

12 Let us imagine that the robot has a rigid body, we think of this as a subset
13 A ⊂ R2 . Say the robot is a disc

A = (x, y) ∈ R2 : x2 + y 2 ≤ 1 .

14 This set A changes as the robot moves around, e.g., if the center of mass
15 of the robot is translated by xt , yt ∈ R the set A changes to

A′ = {(x + xt , y + yt ) : (x, y) ∈ A} .

16 The concept of “degrees of freedom” denotes the maximum number of

17 independent parameters needed to completely characterize the transfor-
18 mation applied to a robot. Since the set of allowed values (xt , yt ) is a
19 two-dimensional subset of R2 , then the degrees of freedom available to a
20 translating robot is two.

22 As the above figure shows, there are two ways of thinking about this
23 transformation. We can either think of the robot transforming while the
24 co-ordinate frame of the world is fixed, or we can think of it as the robot
25 remaining stationary and the co-ordinate frame undergoing a translation.
26 The second style is useful if you want to imagine things from the robot’s
27 perspective. But the first one feels much more natural and we will therefore
28 exclusively use the first notion.
29 If the same robot if it where rotated counterclockwise by some angle
30 θ ∈ [0, 2π], we would map

(x, y) 7→ (x cos θ − y sin θ, x sin θ + y cos θ).

1 Such a map can be written as multiplication by a 2×2 rotation matrix

cos θ − sin θ
R(θ) = . (4.1)
sin θ cos θ

2 to get
x cos θ − y sin θ x
= R(θ) .
x sin θ + y cos θ y
3 The transformed robot is thus given by

x
A′ = R : (x, y) ∈ A .
y

4 If we perform both rotation and translation, we can the transformation

5 using a single matrix
 
cos θ − sin θ xt
T =  sin θ cos θ yt  (4.2)
0 0 1

6 and this transformation looks like

   
x cos θ − y sin θ + xt x
 x sin θ + y cos θ + yt  = T y  .
1 1

7 The point (x, y, 1) ∈ R3 is called homogeneous coordinate space cor-

8 responding to (x, y) ∈ R3 and the matrix T is called a homogeneous
9 transformation matrix. The peculiar names comes from the fact that even
10 if the matrix T maps rotations and translations of rigid bodies A ⊂ R2 , it
It is important to remember that T
11 is just a linear transformation of the point (x, y, 1) if viewed in the larger
represents rotation followed by a translation,
12 space R3 .
not the other way around.

13 Rigid-body transformations The transformations R ∈ R2×2 or T ∈

14 R3×3 are called rigid-body transformations. Mathematically, it means
15 that they do not cause the distance between any two points inside the set A
16 to change. Rigid-body transformations are what are called an orthogonal
17 group in mathematics.

18 A group is a mathematical object which imposes certain conditions

19 upon how two operations, e.g., rotations, can be composed together. For
20 instance, if G is the group of rotations, then (i) the composition of two
21 rotations is a rotation, we say that it satisfies closure R(θ1 )R(θ2 ) ∈ G,
22 (ii) rotations are associative

R(θ1 ) {R(θ2 )R(θ3 )} = {R(θ1 )R(θ2 )} R(θ3 ),

23 and, (iii) there exists an identity and inverse rotation

R(0), R(−θ) ∈ G.
81

1 An orthogonal group is a group whose operations preserve distances

2 in Euclidean space, i.e., g ∈ G is an element of the group that acts on two
3 points x, y ∈ Rd then

∥g(x) − g(y)∥ = ∥x − y∥.

4 If we identify the basis in Euclidean space to be the set of orthonormal

5 vectors {e1 , . . . , ed }, then equivalently, the orthogonal group O(d) is the
6 set of orthogonal matrices

O(d) := O ∈ Rd×d : OO⊤ = O⊤ O = I .

7 This implies that the square of the determinant of any element a ∈ O(d)
8 is 1, i.e., det(a) = ±1. ? Check that any rotation matrix R belongs
to an orthogonal group.
9 The Special Orthogonal Group is a sub-group of the orthogonal group
10 where the determinant of each element is +1. You can see that rotations
11 are a special orthogonal group. We denote rotations of objects in R2 as

SO(2) := R ∈ R2×2 : R⊤ R = RR⊤ = I, det(R) = 1 .

(4.3)

12 Each group element g ∈ SO(2) denotes a rotation of the XY -plane about

13 the Z-axis. The group of 3D rotations is called the Special Orthogonal
14 Group SO(3) and is defined similarly

SO(3) := R ∈ R3×3 : R⊤ R = RR⊤ = I, det(R) = 1 .

(4.4)

15 The Special Euclidean Group SE(2) is simply a composition of a 2D

16 rotation R ∈ SO(2) and a 2D translation R2 ∋ v ≡ (xt , yt )

R v
SE(2) = : R ∈ SO(2), v ∈ R ⊂ R3×3 .
2
(4.5)
0 1

17 The Special Euclidean Group SE(3) is defined similarly as

R v
SE(3) = : R ∈ SO(3), v ∈ R3 ⊂ R4×4 ; (4.6)
0 1

18 again, remember that it is rotation followed by a translation.

19 4.1.1 3D transformations
20 Translations and rotations in 3D are conceptually similar to the two-
21 dimensional case; however the details appear a bit more difficult because
22 rotations in 3D are more complicated.
82

Here is how I remember these names. Say

you are driving a car, usually in robotics we
take the X-axis to be longitudinally forward,
the Y -axis is your left hand if you are in the
driver’s seat and the Z-axis points up by the
right-hand thumb rule. Roll
Figure 4.1: Any three-dimensional rotation can be described as a sequence of
rotations about each of the cardinal axes. We usually give these specific names:
rotation about the Z-axis is called yaw, rotation about the X-axis is called roll and
rotation about Y -axis is called pitch. You should commit this picture and these
names to memory because it will be of enormous to think about these rotations
intuitively.

1 Euler angles We know that a pure counter-clockwise rotation about one

2 of the axes is written in terms of a matrix, say yaw of α-radians about the is what a dog does when it rolls, it rotates
3 Z-axis   about the X-axis. Pitch is what a plane
cos α − sin α 0
Rz (α) =  sin α cos α 0 .
0 0 1
4 Notice that this is a 3×3 matrix that keeps the Z-coordinate unchanged
5 and only affects the other two coordinates. Similarly we have for pitch (β
6 about the Y -axis) and roll (γ about the X-axis)
   
cos β 0 sin β 1 0 0
Ry (β) =  0 1 0  , Rx (γ) = 0 cos γ − sin γ  .
− sin β 0 cos β 0 sin γ cos γ
does when it takes off, its nose lifts up and it
7 By convention, a rotation matrix in three dimensions is understood as a rotates about the Y -axis. Yaw is the one
8 sequential application of rotations, first roll, then pitch, and then yaw which is not these two.

R3×3 = R(γ, β, α) = Rz (α)Ry (β)Rx (γ). (4.7)

9 The angles (γ, β, α) (in order: roll, pitch, yaw) are called Euler angles.
10 Imagine how the body frame of the robot changes as successive rotations
11 are applied. If you were sitting in a car, a pure yaw would be similar to
12 the car turning left; the Z-axis corresponding to this yaw would however
13 only be pointing straight up perpendicular to the ground if you had not
14 performed a roll/pitch before. If you had done so, the Z-axis of the body
15 frame with respect to the world will be tiled.
16 Another important thing to note is that one single parameter determines
17 all possible rotations about one axis, i.e., SO(2). But three Euler angles
18 are used to parameterize general rotations in three-dimensions. You
19 can watch https://www.youtube.com/watch?v=3Zjf95Jw2UE to get more
20 intuition about Euler angles.
83

1 Rotation matrices to Euler angles We can back-calculate the Euler

2 angles from a rotation matrix as follows. Given an arbitrary matrix
 
r11 r12 r13
R = r21 r22 r23  ,
r31 r32 r33

3 we set
α = tan−1 (r21 /r11 )
q
−1 2 2
β = tan −r31 / r32 + r33 (4.8)

γ = tan−1 (r32 /r33 ).

4 For each angle, the corresponding quadrant for the Euler angle is de-
5 termined using the signs of the numerator and the denominator. So In practice, e.g., if we run a Kalman filter
6 you should use the function atan2 in Python/C++ to implement these to estimate the Euler angles, we need be
7 expressions correctly. Notice that some of the expressions have r11 and careful in cases when α , β or γ ≈ π/2.
8 r33 in the denominator, this means that we need r11 = cos α cos β ̸= 0
Consider when β crosses π/2, i.e., it goes
9 and r33 = cos β cos γ ̸= 0. A particular physical rotation can be parame-
from π/2 − ϵ to π/2 + ϵ for some small value
10 terized in many different ways using Euler angles (depending upon the
of ϵ. In this case, α = tan−1 (r21 /r11 ),
11 order in which roll, pitch and yaw are applied), so the map from rotation
assuming r21 > 0, will jump from
12 matrices to Euler angles is not unique.
tan−1 (∞) = π/2 to
tan−1 (−∞) = −π/2—a jump of 180
degrees.
13 Homogeneous coordinates in three dimensions Just like the 2D case,
14 we can define a 4×4 matrix that transforms points (x, y, z) ∈ R3 to their Another classic problem when using Euler
15 new locations after a rotation by Euler angles (γ, β, α) and a translation angles occurs in what is called “Gimbal lock”.
16 by a vector v = (xt , yt , zt ) ∈ R3 This refers to the situation when one of the
angles, say β = π/2 (pitch up by 90 degrees).
In this case, the SO(3) rotation matrix is

R(γ, β, α) v
T = .
0 1  
0 0 1
R =  sin(α + γ) cos(α + γ) 0 .
17 4.1.2 Rodrigues’ formula: an alternate view of rotations
− cos(α + γ) sin(α + γ) 0
18 Consider a point r(t) ∈ R3 that is being rotated about an axis denoted
19 by a unit vector ω ∈ R3 with an angular velocity of 1 radian/sec. The Notice here that changing α (yaw) and γ (roll)
20 instantaneous linear velocity of the head of the vector is have the same effect on the rotation. We
cannot distinguish the effect of yaw from roll
ṙ(t) = ω × r(t) ≡ ω̂r(t) (4.9) if the pitch is 90 degrees. And such a “lock”
persists until β = π/2. Such a gimbal lock
21 where the × denotes the cross-product of the two vectors a, b ∈ R3 happened on Apollo 11; the mechanism that
  the engineers had designed to flip the
a2 b3 − a3 b2 orientation by 180 degrees and escape this
a × b = a3 b1 − a1 b3  degeneracy did not work.
a1 b2 − a2 b1
These kind of things make it very
cumbersome to work with Euler angles in
computer code. They are best used for
visualization.
84

1 which we can equivalently denote as a matrix vector multiplication Groups such as SO(2) and SO(3) are
2 a × b = âb where topological spaces (viewed as a subset of
2
Rn ) and operations such as multiplication
 
0 −a3 a2 and inverses are continuous functions on
â =  a3 0 −a1  . (4.10) these groups. These groups are also smooth
−a2 a1 0 manifolds (a manifold is a generalization of a
curved surface) and that is why they are called
3 is a skew-symmetric matrix. The solution of the differential equation (4.9) Lie groups (after Sophus Lie). Associated to
4 at time t = θ is each Lie group is a Lie algebra which is the
r(θ) = exp(ω̂θ) r(0) tangent space of the manifold at identity. The
5 where the matrix exponential of a matrix A is defined as Lie algebra of SO(3) is denoted by so(3) and
likewise we have so(2). In a sense, the Lie
∞
A2 A3 X Ak algebra achieves a “linearization” of the Lie
exp(A) = I + A + + + ... = . group. And the exponential map unlinearizes,
2! 3! k!
k=0
i.e., it takes objects in the Lie algebra to
6 This is an interesting observation: a rotation about a fixed axis ω by an objects in the Lie group
7 angle θ can represented by the matrix
exp : so(3) 7→ SO(3).
R = exp(ω̂ θ). (4.11)
What we have written in (4.11) is really just
8 You can check that this matrix is indeed a rotation by showing that this map:
9 R⊤ R = I and that det(R) = +1. We can expand the matrix exponential
10 and collect odd and even powers of ω̂ to get so(n) ∋ ω ≡ ω̂θ
SO(n) ∋ R = exp(ω) = exp(ω̂θ).
R = I + sin θ ω̂ + (1 − cos θ)ω̂ 2 . (4.12)
Therefore, if an object whose frame had a
11 which is the Rodrigues’ formula that relates the angle θ and the axis ω to rotation matrix R with respect to the origin
12 the rotation matrix. We can also go in the opposite direction, i.e., given a were rotating with an angular velocity ω
13 matrix R calculate what angle θ and axis ω it corresponds to using (remember that angular velocity is a vector
whose magnitude is the rate of rotation and
tr(R) − 1 direction is axis about which the object is
cos θ =
2 (4.13) rotation), then the rate of change of R would
R − R⊤ be given by
ω̂ = .
2 sin θ Ṙ = ω̂R.
14 Note that both the above formulae make sense only for θ ̸= 0. If we were to implement a Kalman filter
whose state is the rotation matrix R, then this
would be the dynamics equation and one
15 4.2 Quaternions would typically have an observation for the
16 We know two ways to think about rotations: we can either think in terms velocity ω using a gyroscope.
17 of the three Euler angles (γ, β, α), or we can consider a rotation matrix
18 R ∈ R3×3 . We also know ways to go back and forth between these two
19 forms with the caveat that solving for Euler angles using (4.8) may be
20 degenerate in some cases. While rotation matrices are the most general
21 representation of rotations, using them in computer code is cumbersome
22 (it is, after all, a matrix of 9 elements). So while we can build an EKF Quaternions were invented by British
23 where the state is a rotation matrix, it would be a bit more expensive to mathematician William Rowan Hamilton
24 run. We can also implement the same filter using Euler angles but doing while walking along a bridge with his wife.
25 so will require special care due to the degeneracies. He was quite excited by this discovery and
promptly graffitied the expression into the
stone of the bridge
85

Quaternions are a neat way to avoid the problems with both

the rotation matrix and Euler angles, they parametrize the space of
rotations using 4 numbers. The central idea behind quaternions is
Euler’s theorem which says that any 3D rotation can be considered
as a pure rotation by an angle θ ∈ R about an axis given by the
unit vector ω. This is the result that we also exploited in Rodrigues’
formula.

Figure 4.2: Any rotation in 3D can be represented using a unit vector ω and an
angle θ ∈ R. Notice that there are two ways to encode the same rotation, the unit
vector −ω and angle 2π − θ would give the same rotation. Mathematicians say As you see in the adjoining figure,
this as quaternions being a double-cover of SO(3). quaternions also have degeneracies but they
are rather easy ones.

1 A quaternion q as a four-dimensional vector q ≡ (u0 , u1 , u2 , u3 ) and

2 we write it as
q ≡ (u0 , u), or
(4.14)
q = u0 + u1 i + u2 j + u3 k,
3 with i, j, k being three “imaginary” components of the quaternion with
4 “complex-numbers like” relationships

i2 = j 2 = k 2 = ijk = −1. (4.15)

5 It follows from these relationships that

ij = −ji = k, ki = −ik = j, and jk = −jk = i.

6 Although you may be tempted to think about this, these imaginary

7 components i, j, k have no relationship with the square roots of negative
8 unity used to define standard complex numbers. You should simply think
9 of the quaternion as a four-dimensional vector. A unit quaternion, i.e., one
10 with
u20 + u21 + u22 + u23 = 1,
11 is special: unit quaternions can be used to represent rotations in 3D.

12 Quaternion to axis-angle representation The quaternion q = (u0 , u)

13 corresponds to to a counterclockwise rotation of angle θ about a unit
14 vector ω where θ and ω are such that

u0 = cos(θ/2), and u = ω sin (θ/2) . (4.16)

1 So given an axis-angle representation of rotation like in Rodrigues’ formula

2 (θ, ω) we can write the quaternion as

q = (cos(θ/2), ω sin(θ/2)) .

3 Using this, we can also compute the inverse of a quaternion (rotation of

4 angle θ about the opposite axis −ω) as

q −1 := (cos(θ/2), −ω sin(θ/2)) .

5 The inverse quaternion is therefore the quaternion where all entries except
6 the first have their signs flipped.

7 Multiplication of quaternions Just like two rotation matrices multiply

8 together to give a new rotation, quaternions are also a representation
9 for the group of rotations and we can also multiply two quaternions
10 q1 = (u0 , u), q2 = (v0 , v) together using the quaternion identities for
11 i, j, k in (4.15) to get a new quaternion
Quaternions belong to a larger group than
q1 q2 ≡ (u0 , u) · (v0 , v) = (u0 v0 − u⊤ v, u0 v + v0 u + u × v).
rotations called the Symplectic Group Sp(1).
12

13 Pure quaternions A pure quaternion is a quaternion with a zero scalar

14 value u0 = 0. This is very useful to simply store a standard 3D vector
15 u ∈ R3 as a quaternion (0, u). We can then rotate points easily between
16 different frames as follows. Given a vector x ∈ R3 we can form a
17 quaternion (0, x). It turns out that

q · (0, x) · q ∗ = (0, R(q)x). (4.17)

18 where q ∗ = (u0 , −u) is the conjugate quaternion of q = (u0 , u); the

19 conjugate is the same as the inverse for unit quaternions. Notice how
20 the right-hand side is the vector R(q)x corresponding to the vector x
21 rotation by a matrix R(q). This is a very useful trick to transform points
22 across coordinate frames instead of multiplying each point x ∈ R3 by the
23 corresponding SE(3) matrix element.

24 Quaternions to rotation matrix The rotation matrix corresponding to

25 a quaternion is
u0 u
R(q) = (u20 − u⊤ u)I3×3 + 2 + 2uu⊤
∥u∥
2(u20 + u21 ) − 1 2(u1 u2 − u0 u3 )
 
2(u1 u3 + u0 u2 )
= 2(u1 u2 + u0 u3 ) 2(u20 + u22 ) − 1 2(u2 u3 − u0 u1 ) .
2(u1 u3 − u0 u2 ) 2(u2 u3 − u0 u1 ) 2(u20 + u23 ) − 1
(4.18)
26 Using this you can show the identity that rotation matrix corresponding to
27 the product of two quaternions is the product of the individual rotation
87

1 matrices
R(q1 q2 ) = R(q1 )R(q2 ).

2 Rotation matrix to quaternion We can also go in the reverse direction.

3 Given a rotation matrix R, the quaternion is

1√
u0 = r11 + r22 + r33 + 1
2
r32 − r23
if u0 ̸= 0, u1 =
4u0
r13 − r31
u2 =
4u0
r21 − r12
u3 = (4.19)
4u0
r13 r12
if u0 = 0, u1 = p
2 r2 + r2 r2 + r2 r3
r12 13 12 23 13 23
r12 r23
u2 = p
2 r2 + r2 r2 + r2 r3
r12 13 12 23 13 23
r13 r23
u3 = p
2 r2 + r2 r2 + r2 r3
.
r12 13 12 23 13 23

4 There is little need to memorize these

expressions or trying to understand patterns
between them. While building a new code
5 4.3 Occupancy Grids base for your robot, you will usually code up
these formulae once and all your code will
6 Rotation matrices and quaternions let us capture the dynamics of a rigid
use them again and again.
7 robot body. We will next look at how to better understand observations.

8 What is location and what is mapping? Imagine a robot that is

9 moving around in a house. A natural representation of the state of this
10 robot is the 3D location of all the interesting objects in the room, e.g.,
11 https://www.youtube.com/watch?v=Qe10ExwzCqk. At each time-instant,
12 we record an observation from our sensor (in this case, a camera) that
13 indicates how far an object is from the robot. This helps us discover the
14 location of the objects in the room. After gathering enough observations,
15 we would have created a map of the entire house. This map is the set of
16 positions of all interesting objects in the room. Such a map is called a
17 “feature map”, these are all the green points in the image below
88

2 The main point to understand about feature map is that we can hand
3 over this map to another robot that comes to the same house. The robot
4 compares images from its camera and if it finds one of the objects inside
5 the map, it can get an estimate of its location/orientation in the room with
6 respect to the known location of the object in the map. The map is just
7 a set of “features” that help identify salient objects in the room (objects
8 which can be easily detected in images and relatively uniquely determine
9 the location inside the room). The second robot using this map to estimate
10 its position/orientation in the room is called the localization problem. We
11 already know how to solve the localization problem using filtering.
12 The first robot was solving a harder problem called Simultaneous
13 Localization And Mapping (SLAM): namely that of discovering the location
14 of both itself and the objects in the house. This is a very important and
15 challenging problem in robots but we will not discuss it further. MEAM
16 620 digs deeper into it.

In this section, we will solve a part of the SLAM problem,

namely the mapping problem. We will assume that we know the
position/orientation of the robot in the 3D world, and want to build a
map of the objects in the world. We will discuss grid maps, which
are a more crude way of representing maps than feature maps but can
be used easily even if there are lots of objects.

17 Grid maps We will first discuss two-dimensional grid maps, they look
18 as follows.
89

Figure 4.3: A grid map (also called an occupancy grid) is a large gray-scale image,
each pixel represents a cell in the physical world. In this picture, cells that are
occupied are colored black and empty cells represent free space. A grid map is a
useful representation for a robot to localize in this house using observations from
its sensors and comparing those to the map.

1 To get a quick idea of what we want to do, you can watch the mapping
2 being performed in https://www.youtube.com/watch?v=JJhEkIA1xSE.
3 We are interested in learning such maps from the observations that a
4 robot collects as it moves around the physical space. Let us make two
5 simplifying assumptions.

6 Assumption 1: each cell is either free or occupied

8 This is neat: we can now model each cell as a binary random variable that
9 indicates occupancy. Let the probability that the cell mi be occupied be
10 p(mi )

12 If we have p(mi ) = 0, then the cell is not occupied and if we have

13 p(mi ) = 1, then the cell is occupied. A priori, we do not know the state
14 of the cell so we will set the prior probability to be p(mi ) = 0.5.

15 Assumption 2: the world is static Objects in the world do not move.

16 This is reasonable if we are interested in estimating in building a map of
17 the walls inside the room. Note that it is not a reasonable assumption if
90

1 there are moving people inside the room. We will see a clever hack where
2 the Bayes rule helps automatically disregard such moving objects in this
3 section.

4 Assumption 3: cells are independent of each other This is another

5 drastic simplification. The state of our system is the occupancy of each cell
6 in the grid map. We assume that before receiving any observations, the
7 occupancy of each individual cell is independent; it is a Bernoulli variable
8 with probability 1/2 since we have assumed the prior to be uniform in
9 Assumption 1.

11 This means that if cells in the map are denoted by a vector m = (m1 , . . . , ),
12 then the probability of the cells being occupied/not-occupied can be written
13 as Y
p(m) = p(mi ). (4.20)
i

15 4.3.1 Estimating the map from the data

16 Say that the robot pose (position and orientation) is given by the sequence
17 x1 , . . . , xk . While proceeding along this sequence, the robot receives
18 observations y1 , . . . , yk . Our goal is to estimate the state of each cell
19 mi ∈ {0, 1} (aka “the map” m = (m1 , m2 , . . . , ))
Y
P(m | x1 , . . . , xk , y1 . . . , yk ) = P(mi | x1 , . . . , xk , y1 . . . , yk ).
i
(4.21)
20 This is called the “static state” Bayes filter and is conceptually exactly the
21 same as the recursive application of Bayes rule in Chapter 2 for detecting
22 whether the door was open or closed.
23 We will use a short form to keep the notation clear

y1:k = (y1 , y2 , . . . , yk );

24 the quantity x1:k is defined similarly. As usual we will use a recursive

1 Bayes filter to compute this probability as follows.

P(yk | mi , y1:k−1 , x1:k ) P(mi | y1:k−1 , x1:k )

P(¬mi | yk , xk ) P(yk | xk ) P(¬mi | y1:k−1 , x1:k−1 )

P(¬mi | x1:k , y1:k ) = .
P(¬mi ) P(yk | y1:k−1 , x1:k )

3 Let us take the ratio of the two to get

P(mi | x1:k , y1:k ) P(mi | yk , xk ) P(mi | y1:k−1 , x1:k−1 ) P(¬mi )

=
P(¬mi | x1:k , y1:k ) P(¬mi | yk , xk ) P(¬mi | y1:k−1 , x1:k−1 ) P(mi )
P(mi | yk , xk ) P(mi | y1:k−1 , x1:k−1 ) 1 − P(mi )
= .
1 − P(mi | yk , xk ) 1 − P(mi | y1:k−1 , x1:k−1 ) P(mi )
| {z }| {z } | {z }
uses observation yk recursive term prior
(4.23)
4 This is called the odds ratio. Notice that the first term uses the latest
5 observation yk , the second term can be updated recursively because it
6 is a similar expression as the left-hand side and the third term is a prior
7 probability of the cell being occupied/non-occupied. Let us rewrite this
8 formula using the log-odds-ratio that makes implementing it particularly
9 easy. The log-odds-ratio of the probability p(x) of a binary variable x is
10 defined as
p(x) 1
l(x) = log , and p(x) = 1 − .
1 − p(x) 1 + el(x)

11 The product in (4.23) now turns into a sum as

l(mi | y1:k , x1:k ) = l(mi | yk , xk ) + l(mi | y1:k−1 , x1:k−1 ) − l(mi ).

(4.24)
12 This expression is used to update the occupancy of each cell. The term

sensor model = l(mi | yk , xk )

? We assumed that the map was static. Can
13 is different for different sensors and we will investigate it next.
you think of why (4.24) automatically lets us
handle some moving objects? Think of what
the prior odds l(mi ) does to the
14 4.3.2 Sensor models log-odds-ratio l(mi | y1:k , x1:k ).
15 Sonar This works by sending out an ultrasonic chirp and measuring the
16 time between emission and reception of the signal. The time gives an
92

1 estimate of the distance of an object to the robot.

3 The figure above shows a typical sonar sensor (the two “eyes”) on a
4 low-cost robot. Data from the sensor is shown on the right, a sonar is a
5 very low resolution sensor and has a wide field of view, say 15 degrees,
6 i.e., it cannot differentiate between objects that are within 15 degrees
7 of each other and registers them as the same point. Sophisticated sonar
8 technology is used today in marine environments (submarines, fish finders,
9 detecting mines etc.).

10 Radar works in much the same way as a sonar except that it uses
11 pulses of radio waves and measures the phase difference between the
12 transmitted and the received signal. This is a very versatile sensor
13 (it was invented by the US army to track planes and missiles during
14 World War II) but is typically noisy and requires sophisticated process-
15 ing to be used for mainstream robotics. Autonomous cars, collision
16 warning systems on human-driven cars, weather sensing, and certainly
17 the military use the radar today. The following picture and the video
18 https://www.youtube.com/watch?v=hwKUcu_7F9E will give you an ap-
19 preciation of the kind of data that a radar records. Radar is a very long
20 range sensor (typically 150 m) and works primarily to detect metallic
21 objects.

23 LiDAR LiDAR, which is short for Light Detection and Ranging,

24 (https://en.wikipedia.org/wiki/Lidar) is a portmanteau of light and radar.
25 It is a sensor that uses a pulsed laser as the source of illumination and
26 records the time it takes (nanoseconds typically) for the signal to return
27 to the source. See https://www.youtube.com/watch?v=NZKvf1cXe8s
28 for how the data from a typical LiDAR (Velodyne) looks like. While a
Basic Driving 93

• Safe driving by default for various driving conditions

1 Velodyne contains an intricate system of rotating mirrors and circuitry to
• 2Behaviors
measurenaturally emerge
time elapsed, from
there arethe planning
new system:
solid state LiDARs that are rapidly
3 – Slow downtonear
evolving turns,
match theyield
needsandofmerge into traffic driving industry. Most
the autonomous
4 LiDARsother
– Passing havevehicles,
a usable3range
point of about
turn 100 m.direction, park, etc.
to change

6 A typical autonomous car This is a picture of MIT’s entry named Talos

7 to the DARPA Urban Challenge (https://en.wikipedia.org/wiki/DARPA_Grand_Challenge_(2007))
8 which was a competition where teams had to traverse a 60 mile urban
9 route within 6 hours, while obeying traffic laws, understanding incoming
10 vehicles etc. Successful demonstrations by multiple teams led by (CMU,
11 Stanford, Odin College, MIT, Penn and Cornell) in this competition jump- Waymo’s autonomous car
12 started the wave of autonomous driving. While the number of sensors
13 necessary to drive well has come down (Tesla famously does not like
14 to use LiDARs and rely exclusively on cameras and radars), the type of
15 sensors and challenges associated with them remain essentially the same.

17 4.3.3 Back to sensor modeling

18 Let us go back to understanding our sensor model l(mi | yk , xk ) where mi
19 is a particular cell of the occupancy grid, yk and xk are the observations
20 and robot position/orientation at time k.
94

Figure 4.4: Model for sonar data. (Top) A sonar gives one real-valued reading
corresponding to the distance measured along the red axis. (Bottom) if we travel
along the optical axis, the occupancy probability P(mi | yk = z, xk ) can be
modeled as a spike around the measured value z. It is very important to remember
that range sensors such as sonar gives us three kinds of information about this ray:
(i) all parts of the environment up to ≈ z are unoccupied (otherwise we would
not have recorded z), (ii) there is some object at z which resulted in the return,
(iii) but we do not know anything about what is behind z. So incorporating a
measurement yk from a sonar/radar/lidar involves not just updating the cell which
corresponds to the return, but also updating the occupancy probabilities of every
grid call along the axis.

Figure 4.5: (Left) A typical occupancy grid created using a sonar sensor by
updating the log-odds-ratio l(mi | x1:k , y1:k ) for all cells i for multiple time-steps
k. At the end of the map building process, if l(mi | x1:k , y1:k ) > 0 for a particular
cell, we set its occupancy to 1 and to zero otherwise, to get the maximum-likelihood
estimate of the occupancy grid on the right.

1 LiDAR model When we say that a LiDAR is a more accurate sensor

2 than the sonar, what we really mean is that the sensor model P(mi | yk , xk )
3 looks as follows.
95

2 As a result, we can create high-resolution occupancy grids using a LiDAR.

? How will you solve the localization

problem given the map? In other words, if we
know the occupancy grid of a building as
estimated in a prior run, and we now want to
find the position/orientation of the robot
traveling in this building, how should we use
4
these sensors?

5 4.4 3D occupancy grids

6 Two-dimensional occupancy grids are a fine representation for toy problems
7 but they run into some obvious issues. Since the occupancy grid is a “top
8 view” of the world, we cannot represent non-trivial objects in it correctly
9 (a large tree with a thin trunk eats up all the free space). We often desire a
10 fundamentally three-dimensional representation of the physical world.
96

2 We could simply create cells in 3D space and our method for occupancy
3 grid would work but this is no longer computationally cheap. For instance,
4 if we want to build a map of Levine Hall (say 100 m × 100 m area and
5 height of 25 m), a 3D grid map with a resolution of 5 cm × 5 cm × cm
6 would have about 2 billion cells (if we store a float in each cell this map will
7 require about 8 GB memory). It would be cumbersome to carry around
8 so many cells and update their probabilities after each sensor reading (a
9 Velodyne gives data at about 30 Hz). More importantly, observe that most
10 of the volume inside Levine is free space (inside of offices, inner courtyard
11 etc.) so we do not really need fine resolution in those regions.

12 Octrees We would ideally have an occupancy grid whose resolution

13 adapts with the kind of objects that are detected by the sensors. If nearby
14 cells are empty we want to collapse them together to save on memory and
15 computation, on the other hand, if nearby cells are all occupied, we want
16 to refine the resolution in that area so has to more accurately discern the
17 shape of the underlying objects. Octrees are an efficient representation for
18 3D volumes.

20 An octree is a hierarchical data structure that recursively sub-divides the

21 3D space into octants and allocates volumes as needed for a particular data
22 point observed by a range sensor. It is analogous to a kd-tree. Imagine if
23 the entire space in the above picture were empty (the tree only has a root
24 node), and we receive a reading corresponding to the dark shaded region.
25 An octree would sub-divide the space starting from the root (each node
26 in the tree populates is the parent of its eight child octants) recursively
27 until some pre-determined minimum resolution is reached. This leaf
97

1 node is grid cell; notice how different cells in the octree have different
2 resolutions. Occupancy probabilities of each leaf node are updated using
3 the same formula as that of (4.24). A key point here is that octrees are
4 designed for accurate sensors (LiDARs) where there is not much noise
5 in the observations returned by the sensor (and thereby we do not refine
6 unnecessary parts of the space)
You can find LiDAR maps of the entire
7 Octrees are very efficient at storing large map, I expect you can store
United States (taken from a plane) at
8 the entire campus of Penn in about a gigabyte. Ray tracing (following all
https://www.usgs.gov/core-science-
9 the cells mi in tree along the axis of the sensor in Figure 4.4) is harder
systems/ngp/3dep
10 in this case but there are efficient algorithms devised for this purpose.
11 An example OctoMap (an occupancy map created using an Octree) of a
building on the campus of the University of Freiburg is shown below.

13 4.5 Local Map

14 In this chapter, we primarily discussed occupancy grids of static environ-
15 ments as the robot moves around in the environment. The purpose of doing
16 so is localization, namely, finding the pose of the robot by comparing
17 the observations of the sensors with the map (think of the particle filter
18 localization example in Chapter 3). In typical problems, we often maintain
19 two kinds of maps, (i) a large occupancy grid for localization (say as big
20 as city), and (ii) another smaller map, called the local map, that is used to
21 maintain the locations of objects (typically objects that can move) in the
22 vicinity of the robot, say a 100 m × 100 m area.
23 The local map is used for planning and control purposes, e.g., to check
24 if the planned trajectory of the robot does not collide with any known
25 obstacles. See an example of the local map at the 1:42 min mark at
26 https://www.youtube.com/watch?v=2va15BE-7lQ. Some people also call
27 the local map a “cost map” because occupied cells in the local map indicate
28 a high collision cost of moving through that cell. The local map is typically
98

1 constructed in the body frame and evolves as the robot moves around
2 (objects appear in the front of the robot and are spawned in the local map
3 and disappear from the map at the back as the robot moves forward).

You should think of the map (and especially the local map) as the
filtering estimate of the locations of various objects in the vicinity of
the robot computed on the basis of multiple observations received
from the robot’s sensors.

Figure 4.6: The output of perception modules for a typical autonomous vehicle
(taken from https://www.youtube.com/watch?v=tiwVMrTLUWg. The global
occupancy grid is shown in gray (see the sides of the road). The local map is
not shown in this picture but you can imagine that it has occupied voxels at all
places where there are vehicles (purple boxes) and other stationary objects such
as traffic light, nearby buildings etc. Typically, if we know that so and so voxel
corresponds to a vehicle, we run an Extended Kalman Filter for that particular
vehicle to estimate the voxels in the local map that it is likely to be in, in the
next time-instant. The local map is a highly dynamic data structure that is rich in
information necessary for planning trajectories of the robot.

4 4.6 Discussion
5 Occupancy grids are a very popular approach to represent the environment
6 given the poses of the robot as it travels in this environment. We can also
7 use occupancy grids to localize the robot in a future run (which is usually
8 the purpose of creating them). Each cell in an occupancy grid stores the
9 posterior probability of the cell being occupied on the basis of multiple
10 observations {y1 , . . . , yk } from respective poses {x1 , . . . , xk }. This is
11 a very efficient representation of the 3D world around us with the one
12 caveat that each cell is updated independently of the others. But since
13 one gets a large amount of data from typical range senors (a 64 beam
14 Velodyne (https://velodynelidar.com/products/hdl-64e) returns about a
15 2 million points/sec and cheaper versions of this sensor will cost about
16 $100), this caveat does not hurt us much in practice. You can watch this talk
99

1 (https://www.youtube.com/watch?v=V8JMwE_L5s0) by the head of Uber’s

2 autonomous driving group to get more perspective about localization and
3 mapping.
1 Chapter 5

2 Dynamic Programming

Reading
1. (Thrun) Chapter 15

2. (Sutton & Barto) Chapters 3–4

3. Optional: (Bertsekas) Chapter 1 and 4

3 This is the beginning of Module 2, this module is about “how to act”.

4 The first module was about “how to sense”. The prototypical problem in
5 the first module was how to assimilate the information gathered by all the
6 sensors into some representation of the world. In the next few lectures,
7 we will assume that this representation is good, that it is accurate in terms
8 of its geometry (small variance of the occupancy grid) and in terms of
9 its information (small innovation in the Kalman filter etc.). Let us also
10 assume that it has all the necessary semantics, e.g., objects are labeled as
11 cars, buses, pedestrians etc (we will talk about how to do this in Module
12 4).
13 The prototypical problem investigated in the next few chapters is how
14 to move around in this world, or affect the state of this world to achieve a
15 desired outcome, e.g., drive a car from some place A to another place B.

16 Our philosophy about notation Material on Dynamic Programming

17 and Reinforcement Learning (RL), which we will cover in the following
18 chapters, contains a lot of tiny details (much more than other areas in
19 robotics/machine learning). These details are usually glossed over in most
20 treatments. In the interest of simplicity, other courses or most research
21 papers these days, develop an imprecise notation and terminology to focus
22 on the problem. However, these details of RL matter enormously when
23 you try to apply these techniques to real-world problems. Not knowing all
24 the details or using imprecise terminology to think about RL is unlikely to
25 make us good at real-world applications.

100
101

1 For this reason, the notation and the treatment in this chapter, and the
2 following ones, will be a bit pedantic. We will see complicated notation
3 and terminology for quantities, e.g., the value function, that you might
4 see being written very succinctly in other places. We will mostly follow
5 the notation of Dmitri Bertsekas’ book on “Reinforcement Learning and
6 Optimal Control” (http://www.mit.edu/ dimitrib/RLbook.html). You will
7 get used to the extra notation and it will become second nature once you
8 become more familiar with the concepts.

9 5.1 Formulating the optimal control problem

10 Let us denote the state of a robot (and the world) by xk ∈ X ⊂ Rn at the
11 k th timestep. We can change this state using a control input uk ∈ U ⊂ Rp
12 and this change is written as

xk+1 = fk (xk , uk ) (5.1)

13 for k = 0, 1, . . . , T − 1 starting from some initial given state x0 . This is

14 deterministic nonlinear dynamical system (no noise ϵ in the dynamics).
15 We will let the dynamics fk also be a function of time k. The time T is
16 some time-horizon up to which we care about running the system. The
17 state-space is X (which we will assume does not change with time k) and
18 the control-space is U .
19 Recall, that we can safely assume that the system is Markov. The
20 reason for it is as follows. If it is not, and say if xk+1 depends upon
21 both xk and the previous step xk−1 , then we can expand the state-space
22 to write a new dynamics in the expanded state-space. We will follow
23 a similar program as that of Module 1: we first describe very general
24 algorithms (dynamic programming) for general systems (Markov Decision
25 Processes), then specialize our methods to a restricted class of systems that
26 are useful in practice (linear dynamical systems) and then finally discuss a
27 very general class of systems again with more sophisticated algorithms
28 (motion-planning).

The central question in this chapter is how to pick a control uk .

We want to pick controls that lead to desirable trajectories of the
system, e.g., results in a parallel-parked car at time T and does not
collide against any other object for all times k ∈ {1, 2, . . . , T }. We
may also want to minimize some chosen quantity, e.g., when you
walk to School, you find a trajectory that avoids a certain street with
a steep uphill.

Finite, discrete state and control-space In this chapter we will

102

only consider problems with finitely-many states and controls, we

will assume that the state-space X and the control-space U are finite,
discrete sets.

1 Run-time cost and terminal cost We will take a very general view of
2 the above problem and formalize it as follows. Consider a cost function

qk (xk , uk ) ∈ R

3 which gives a scalar real-valued output for every pair (xk , uk ). This
4 models the fact that you do not want to walk more than you need to get to
5 School, i.e., we would like to minimize qk . You also want to make sure
6 the trajectory actually reaches the lecture venue, we write this down as
7 another cost qf (xT ). We want to pick control inputs (u0 , u1 , . . . , uT −1 )
8 such that
T
X −1
J(x0 ; u0 , u1 , . . . , uT −1 ) = qf (xT ) + qk (xk , uk ) (5.2)
k=0

9 is minimized. The cost qf (xT ) is called the terminal cost, it is high if

10 xT is not the lecture room and small otherwise. The cost qk is called the
11 run-time cost, it is high for instance if you have to use large control inputs,
12 e.g., xk is a climb.

The optimal control problem Given a system xk+1 = fk (xk , uk ),

we want to find control sequences that minimize the total cost J
above, i.e., we want to solve

J ∗ (x0 ) = min J(x0 ; u0 , . . . , uT −1 ) (5.3)

uk ∈U,k=0,...,T −1

It is important to realize that the function J(x0 ; u0 , . . . , uT −1 ) de-

pends upon an entire sequence of control inputs and we need to find
them all to find the optimal cost J ∗ (x0 ) of, say reaching the School
from your home x0 .

13 5.2 Dĳkstra’s algorithm

14 If the state-space X and control-space U are discrete and finite sets, we
15 can solve (5.3) as a shortest path problem using very fast algorithms.
16 Consider the following picture. This is what would be called a transition
17 graph for a deterministic finite-state dynamics.
103

Figure 5.1: Transition graph for Dĳkstra’s algorithm

1 The graph has one source node x0 . Each node in the graph is xk , each
2 edge depicts taking a certain control uk . Depending on which control we
3 pick, we move to some other node xk+1 given by the dynamics f (xk , uk ).
4 Note that this is not a transition like that of a Markov chain, everything is
5 deterministic in this graph. On each edge we write down the cost

cost(xk , xk+1 ) := qk (xk , uk )

6 where xk+1 = fk (xk , uk ) and “close” the graph with a dummy terminal
7 node with the cost qf (xT ) on every edge leading to an artificial terminal
8 node (sink).
9 Minimizing the cost in (5.3) is now the same as finding the shortest
10 path in this graph from the source to the sink. The algorithm to do so is
11 quite simple and is called Dĳkstra’s algorithm after Edsgar Dĳkstra who
12 used it around 1956 as a test program for a new computer named ARMAC
13 (http://www-set.win.tue.nl/UnsungHeroes/machines/armac.html).

14 1. Let Q be the set of nodes that are currently unvisited; all nodes in
15 the graph are added to it at the beginning. S is an empty set. An
16 array called dist maintains the distance of every node in the graph
17 from the source node x0 . Initialize dist(x0 ) = 0 and dist = ∞ for
18 all other nodes.

19 2. At each step, if Q is not empty, pop a node v ∈ Q such that v ∈ /S

20 with the smallest dist(v). Add v to S. Update the dist of all nodes
21 u connected to v. For each u, if
? Shortest path algorithms do not work if
dist(u) > dist(v) + cost(u, v)
there are cycles in the graph because the
22 update the distance of u to be dist(v) + cost(u, v). If the above shortest path is not unique. Are there cycles
23 condition is not true do nothing. in the above graph?

24 The algorithm terminates when the set Q is empty.

25 You might know that there are many other variants of Dĳkstra’s
26 algorithm, e.g., the A∗ algorithm that are quicker to find shortest paths.
27 We will look at some of these in the next chapter.
? What should one do if the state/control
space is not finite? Can we still use Dĳkstra’s
algorithm?
104

The quantity dist is quite special: observe that after Dĳkstra’s

algorithm finishes running and the set Q is empty, the dist function
gives the optimal cost to go from each node in the graph to the source
node. We wanted to only find the cost to go from source x0 to the
sink node but ended up computing the cost from every node in the
graph to the source.

1 5.2.1 Dĳkstra’s algorithm in the backwards direction

2 We can run Dĳkstra’s algorithm in the backwards direction to get the same
3 answer as well. The sets Q and S are initialized as before. In this case
4 we will let dist(v) denote the distance of a node v to the sink node. The
5 algorithm proceeds in the same fashion, it pops a node v ∈ Q, v ∈ / S and
6 updates the dist of all nodes u connected to v. For each u, if

dist(u) > dist(v) + cost(u, v)

If Dĳkstra’s algorithm (forwards or
7 then we update dist(u) to be the right-hand side of this inequality. Running backwards) is run on a graph with n vertices
8 Dĳkstra’s algorithm in reverse (from sink to the source) is completely and m edges, its computational complexity is
9 equivalent to running it in the forward direction (from source to the sink). O(m + n log n) if we use a priority queue to
find the node v ∈ Q, v ∈/ S with the smallest
dist. The number of edges in the transition
graph in Figure 5.1 is m = O(T |X|).
10 5.3 Principle of Dynamic Programming

The principle of dynamic programming is a formalization of

the idea behind Dĳkstra’s algorithm. It was discovered by Richard
Bellman in the 1940s. The idea behind dynamic programming is
quite intuitive: it says that the remainder of an optimal trajectory is
optimal.

11 We can prove this as follows. Suppose that we find the optimal control
12 sequence (u∗0 , u∗1 , . . . , u∗T −1 ) for the problem in (5.3). Our system is
13 deterministic, so this control sequence results in a unique sequence of states
14 (x0 , x∗1 , . . . , x∗T ). Each successive state is given by x∗k+1 = fk (x∗k , u∗k )
15 with x∗0 = x0 . The principle of optimality, or the principle of dynamic
16 programming, states that if one starts from a state x∗k at time k and wishes
17 to minimize the “cost-to-go”
T
X −1
qf (xT ) + qk (x∗k , uk ) + qi (xi , ui )
i=k+1

18 over the (now assumed unknown) sequence of controls (uk , uk+1 , . . . , uT −1 ),

19 then the optimal control sequence for this truncated problem is exactly
20 (u∗k , . . . , u∗T −1 ).
21 The proof of the above assertion is an easy case of proof by contradic-
22 tion: if the truncated sequence were not optimal starting from x∗k there
105

1 exists some other optimal sequence of controls for the truncated problem,
2 say (vk∗ , . . . , vT∗ −1 ). If so, the solution of the original problem where one
3 takes controls vk∗ from this new sequence for time-steps k, k + 1, . . . , T − 1
4 would have a lower cost. Hence the original sequence of controls would
5 not have been optimal.

Principle of dynamic programming. The essence of dynamic

programming is to solve the larger, original problem by sequentially
solving the truncated sub-problems. At each iteration, Dĳkstra’s
algorithm constructs the functions

JT∗ (xT ), JT∗ −1 (xT −1 ), . . . , J0∗ (x0 )

starting from JT∗ and proceeding backwards to JT∗ −1 , JT∗ −2 . . .. The

function JT∗ −k (v) is just the array dist(v) at iteration k of the back-
wards implementation of Dĳkstra’s algorithm. Mathematically,
dynamic programming looks as follows.

1. Initialize JT∗ (x) = qf (x) for all x ∈ X.

2. For iteration k = T − 1, . . . , 0, set

Jk∗ (x) = min ∗

qk (x, uk ) + Jk+1 (fk (x, uk )) (5.4)
uk ∈U

for all x ∈ X.

6 After running the above algorithm we have the optimal cost-to-go J0∗ (x)
7 for each state x ∈ X, in particular, we have the cost-to-go for the initial
8 state J0∗ (x0 ). If we remember the minimizer u∗k in (5.4) while running the
9 algorithm, we also have the optimal sequence (u∗0 , u∗1 , . . . , u∗T −1 ). The
10 function J0∗ (x) (often shortened to simply J ∗ (x)) is the optimal cost-to-go
11 from the state x ∈ X.
12 Again, we really only wanted to calculate J0∗ (x0 ) but had to do all this
13 extra work of computing Jk∗ for all the states.

14 Curse of dimensionality What is the complexity of running dynamic

15 programming? The cost of the minimization over U is O(|U |), it is a
16 bunch of comparisons between floats. The number of operations at each
17 iteration for setting the values Jk∗ (x) for all x ∈ X is |X|. So the total
18 complexity is O(T |X| |U |).
19 The terms |X| and |U | are often the hurdle in implementing dynamic
20 programming or any variant of it. Think of the grid-world in Problem 1 in
21 HW 1, it had 200×200 cells which amounts to |X| = 40, 000. This may
22 seem a reasonable number but it explodes quickly as the dimensionality
23 of the state-space grows. For a robot manipulator with six degrees-of-
24 freedom, if we discretize each joint angle into 5 degree cells, the number
25 of states is |X| ≈ 140 billion. The number of states |X| is exponential in
106

1 the dimensionality of the state-space and dynamic programming quickly

2 becomes prohibitive beyond 4 dimensions or so. Bellman called this the
3 curse of dimensionality.

4 Cost of dynamic programming is linear in the time-horizon Notice a

5 very important difference between (5.4)

Jk∗ (x) = min ∗

qk (x, uk ) + Jk+1 (fk (x, uk ))
uk ∈U

6 for iterations i = T − 1, . . . , 0 and (5.3)

J ∗ (x0 ) = min J(x0 ; u0 , . . . , uT −1 ).

uk ∈U,k=0,...,T −1
? The principle of dynamic programming
7 The latter has a minimization over a sequence of controls (u0 , u1 , . . . , uT −1 ) gives us a way to solve an optimization
8 while the former has a minimization over only the control at time k, uk problem (5.3) over a really large space (the
9 over T iterations. The former is much much easier to solve because it is space of all control trajectories) using a linear
10 a sequence of O(T ) smaller optimization problems: it is really easy to in time-horizon number of optimization
11 compute minuk ∈U ) for each state x separately than to solve the gigantic problems (5.4). Can we split any optimization
12 minimization problem in (5.3) because in the latter case, the variable of problem in sub-problems like this?
T
13 optimization is the entire control trajectory and has size |U | .

14 Dynamic programming and Viterbi’s algorithm We have seen the

15 principle of dynamic programming in action before in Viterbi’s algorithm
16 in Chapter 2. The transition graph in Figure 5.1 is the same as the Trellis
17 graph for Viterbi’s algorithm, the run-time cost was

qk (xk , uk ) := − log P(Yk | Xk ) − log P(Xk+1 | Xk )

18 and instead of a terminal cost qf , we had an initial cost − log P(X1 ).

19 Viterbi’s algorithm computed the most likely path given observations ? How should one modify dynamic
20 of the HMM, i.e., the path (X1 , . . . , XT ) that maximizes the proba- programming if we have a non-additive cost,
21 bility P(X1 , . . . , XT | Y1 , . . . , YT ) is simply the solution of dynamic e.g., the runtime cost at time k given by qk is
22 programming for the Trellis graph. a function of both xk and xk−1 ?

23 5.3.1 Q-factor
24 The quantity

Q∗k (x, u) := qk (x, u) + Jk+1

∗
(fk (x, u))

25 is called the Q-factor. It is simply the expression that is minimized in the

26 right-hand side of (5.4) and denotes the cost-to-go if control u was picked
27 at state x (corresponding to cost qk (x, u)) and the the optimal control
28 trajectory was followed after that (corresponding to cost Jk∗ (fk (x, u))
29 from state x′ = fk (x, u)). This nomenclature was introduced by Watkins
30 in his thesis.
107

1 Q-factors and the cost-to-go are equivalent ways of thinking about

2 dynamic programming. Given the Q-factor, we can obtain the cost-to-go
3 Jk∗ as
Jk∗ (x) = min Q∗k (x, uk ). (5.5)
uk ∈U

4 which is precisely the dynamic programming update (by definition) in (5.4).

5 We can also write dynamic programming completely in terms of Q-factors
6 as follows.

Dynamic programming written in terms of the Q-factor

1. Initialize Q∗T (x, u) = qf (x) for all x ∈ X and all u ∈ U .

2. For iteration k = T − 1, . . . , 0, set

Q∗k (x, u) = qk (x, u) + min

′
Q∗k+1 (fk (x, u), u′ ). (5.6)
u ∈U

for all x ∈ X and all u ∈ U .

As yet, it may be seem unnecessary to think of the Q-factor (which is

a larger array with |X| × |U | entries) instead of the cost-to-go (which
only has |X| entries in the array).

Value function The following terminology is commonly used in

the literature

value function ≡ cost-to-go J ∗ (x)

action-value function ≡ Q-factor Q∗ (x, u).

Since the two functions are equivalent, we will call both as “value
functions”. The difference will be clear from context.

7 5.4 Stochastic dynamic programming: Value

8 Iteration
9 Let us now see how dynamic programming looks for a Markov Decision
10 Process (MDP). As we saw in Chapter 3, we can think of MDPs as
11 stochastic dynamical systems denoted by

xk+1 = fk (xk , uk ) + ϵk ; x0 is given.

12 We will assume that we know the statistics of the noise ϵk at each time-step
13 (say it is a Gaussian). Stochastic dynamical systems are very different from
14 deterministic dynamical systems, given the same sequence of controls
15 (u0 , . . . , uT −1 ), we may get different state trajectories (x0 , x1 , . . . , xT )
108

1 depending upon the realization of noise (ϵ0 , . . . , ϵT −1 ). How should

2 we find a good control trajectory then? One idea is to modify (5.3) to
3 minimize the expected value of the cost over all possible state-trajectories
−1
" T
#
X
J(x0 ; u0 , . . . , uT −1 ) = E qf (xT ) + qk (xk , uk ) (5.7)
(ϵ0 ,...,ϵT −1 )
k=0

4 Suppose we minimized the above expectation and obtained the value

5 function J ∗ (x0 ) and the optimal control trajectory (u∗0 , . . . , u∗T −1 ). As Draw the picture of a one-dimensional
6 the robot starts executing this trajectory, the realized versions of the noise stochastic dynamical system (random walk on
7 ϵk might differ a lot from their expected value, and the robot may find a line) and see that the realized trajectory of
8 itself in very different states xk than the average-case states considered the system can be very different from the
9 in (5.10). average trajectory.

10 Feedback controls The concept of feedback control is a powerful way

11 to resolve this issue. Instead of seeking u∗k ∈ U as the solutions of (5.10),
12 we instead seek a function

uk (x) : X 7→ U (5.8)

13 that maps the state-space X to a control U . Effectively, given a feedback

14 control uk (x) the robot knows what control to apply at its current realized
15 state xk ∈ X, namely uk (xk ), even if the realized state xk is very different
16 from the average-case state. Feedback controls are everywhere and are
17 critical to using controls in the real world. For instance, when you tune
18 the shower faucet to give you a comfortable water temperature, you
19 are constantly estimating the state (feedback using the temperature) and
20 turning the faucet accordingly. Doing this without feedback would leave
21 you terribly cold or scalded. We will denote the space of all feedback
22 controls uk (·) that depend on the state x ∈ X by

uk (·) ∈ U(X).

23 Control policy A sequence of feedback controls

π = (u0 (·), u1 (·), . . . , uT −1 (·)). (5.9)

24 is called a control policy. This is an object that we will talk about often. It
25 is important to remember that a control policy is set of controllers (usually
26 feedback controls) that are executed at each time-step of a dynamic
27 programming problem.

The stochastic optimal control problem finds a sequence of

109

feedback controls (u0 (·), u1 (·), . . . , uT (·)) that minimizes

All this sounds very tricky and abstract but
−1
" T
#
X you will quickly get used to the idea of
J(x0 ; u0 (·), . . . , uT −1 (·)) = E qf (xT ) + qk (xk , uk (xk )) . feedback control because it is quite natural.
(ϵ0 ,...,ϵT −1 )
k=0
You can think of feedback control as being
The value function is given by analogous to the innovation term in the
Kalman filter K(yk − Cµk+1|k ) which
J ∗ (x0 ) = min J(x0 ; u0 (·), . . . , uT −1 (·)) corrects the estimate µk+1|k to get a new
uk (·)∈U (X), k=0,...,T −1
estimate µk+1|k+1 using the current
(5.10)
observation yk . Filtering would not work at
The optimal sequence of feedback controls (in short, the optimal
all if the innovation term did not depend upon
control trajectory) is the one that achieves this minimum.
the actual observation yk and only depended
upon some average observation.
1 Dĳkstra’s algorithm no longer works, as is, if the edges in the graph
2 are stochastic but we can use the principal of dynamic programming
3 to write the solution for the stochastic optimal control problem. The
4 idea remains the same, we compute a sequence of cost-to-go functions
5 JT∗ (x), JT∗ −1 (x), . . . , J0∗ (x), and in particular J0∗ (x0 ), proceeding back-
6 wards.

Finite-horizon dynamic programming for stochastic systems.

1. Initialize JT∗ (x) = qf (x) for all x ∈ X.

2. For all times k = T − 1, . . . , 0, set

Jk∗ (x) =
∗
min qk (x, uk (x)) + E Jk+1 (fk (x, uk (x)) + ϵk )
uk (·) ∈ U (X) ϵk
(5.11)
for all x ∈ X.

7 Just like (5.4), we solve a sub-problem for one time-instant at each

8 iteration. But observe a few importance differences in (5.11) compared
9 to (5.4).
10 1. There is an expectation over the noise ϵk in the second term in the
11 curly brackets. The second term in the curly brackets is the average of
12 the cost-to-go of the truncated sub-problems from time k + 1, . . . , T
13 over all possible starting states x′ = fk (xk , uk (xk )) + ϵk . This
14 makes sense, after taking the control uk (xk ), we may find the robot
15 at any of the possible states x′ ∈ X depending upon different
16 realizations of noise ϵk and the cost-to-go from xk is therefore the
17 average of the cost-to-go from each of those states (according to the
18 principal of dynamic programming).

19 2. The minimization in (5.11) is performed over a function

U(X) ∋ uk (·) : X 7→ U.
110

1 Since our set of states and controls is finite, this involves finding
2 a table of size |X| × |U | for each iteration. In (5.4), we only had
3 to search over a set of values uk ∈ U of size |U |. At the end of
4 dynamic programming, we have a sequence of feedback controls

(u∗0 (·), u∗1 (·), . . . , u∗T −1 (·)).

5 Each feedback control u∗k (x) tells us what control the robot should
6 pick if it finds itself at a state x at time k.

7 3. If we know the dynamical system, not in its functional form xk+1 = ? Why should we only care about
8 fk (xk , uk ) + ϵk but rather as a transition matrix P(xk+1 | xk , uk ) minimizing the average cost in the objective
9 (like we had in Chapter 2) then the expression in (5.11) simply in (5.10)? Can you think of any other
10 becomes objective we may wish to use?

∗
∗ ′

Jk (x) = min qk (x, uk (x)) + ′ E Jk+1 (x ))
uk (·) ∈ U (X) x ∼P(·|xk ,uk (xk ))
(5.12)

11 Computational complexity The form in (5.12) helps us understand the

12 computational complexity, each sub-problem performs |X| × |X| × |U |
13 amount of work and therefore the total complexity of stochastic dynamic
14 programming is
2
O(T |X| |U |).
15 Naturally, the quadratic dependence on the size of the state-space is an even
16 bigger hurdle while implementing dynamic programming for stochastic
17 systems.

18 5.4.1 Infinite-horizon problems

19 In the previous section, we put a lot of importance on the horizon T
20 for dynamic programming. This is natural: if the horizon T changes,
21 say you are in a hurry to get to school, the optimal trajectory may take
22 control inputs that incur a lot of runtime cost simply to reach closer to
23 the goal state (something that keeps the terminal cost small). In most,
24 real-world problems, it is not very clear what value of T we should pick.
25 We therefore formulate the dynamic programming problem as something
26 that also allows a trajectory of infinite steps but also encourages the length
27 of the trajectory to be small enough in order to be meaningful. Such
28 problems are called infinite-horizon problems (T → ∞).

29 Stationary dynamics and run-time cost We think of infinite-horizon

30 problems in the following way: at any time-step, the length of the trajectory
31 remaining for the robot to traverse is infinite. It helps in this case to solve
32 a restricted set of problems where the system dynamics and run-time cost
33 do not change as a function of time (they only change as a function of the
111

1 state and the control). We will set

q(x, u) ≡ qk (x, u),

f (x, u) ≡ fk (x, u)

2 for all x ∈ X and u ∈ U . Such a condition is called stationarity. If the

3 system is stochastic, we also require that the distribution of noise ϵk does
4 not change as a function of time (it could change in (5.11) but we did not
5 write it so). The infinite-horizon setting is never quite satisfied in practice
6 but it is a reasonable formulation for problems that run for a long length
7 of time.

8 Infinite-horizon objective The objective that we desire be minimized

9 by an infinite-horizon control policy

π = (u0 (·), u1 (·), . . . , uT (·), uT +1 (·), . . . , )

10 is defined in terms of an asymptotic limit

"T −1 #
X
k
J(x0 ; π) = lim E γ q(xk , uk (xk )) . (5.13)
T →∞ (ϵ0 ,...,ϵT −1 )
k=0

11 and we again wish to solve for the optimal cost-to-go

J ∗ (x0 ) = argmin J ∗ (x0 ; π). (5.14)

12 Thus the infinite horizon costs of a policy is the limit of its finite horizon
13 costs as the horizon tends to infinity. Notice a few important differences
14 when compared to (5.7).

15 1. The objective is a limit, it is effectively the cost of the trajectory as

16 it is allowed to stretch for a larger and larger time-horizon.

17 2. There is no terminal cost in the objective function; this makes sense

18 because an explicit terminal state xT does not exist anymore. In
19 infinite-horizon problems, you should think of the terminal cost
20 as being incorporated inside the run-time cost q(x, u) itself, e.g.,
21 move the robot to minimize the fuel used at this time instant but also
22 move it in a way that it reaches the goal at some time in the future.

23 3. Discount factor— Depending upon what controls we pick, the

24 summation
XT
q(xk , uk (xk ))
k=0

25 can diverge to infinity as T → ∞ and thereby a meaningful solution

26 to the infinite-horizon problem may not exist. In order to avoid this,
27 we use a scalar
γ ∈ (0, 1)
28 known as the discount factor in the formulation. It puts more
112

1 emphasis on costs incurred earlier in the trajectory than later ones

2 and thereby encourages the length of the trajectory to be small.
P∞
3 Notice that k=0 αk = 1/(1 − α) if |a| < 1, so if the cost
4 |q(xk , uk (xk ))| < 1, then we know that the objective in (5.13)
5 always converges.

6 Stochastic shortest path problems It is important to remember that

7 the discount factor is chosen by the user, no one prescribes it. There is
8 also a class of problems where we may choose γ = 1 but in these cases,
9 there should exist some essentially terminal state in the state space where
10 we can keep taking a control such that the runtime cost q(x, u) is zero.
11 Otherwise, the objective will diverge. The goal region in the grid-world
12 problem could be an example of such state. Such problems are called
13 stochastic shortest path problems because the time-horizon is not actually
14 infinite, we just do not know how many time-steps it will take for the robot
15 to go to the goal location. Naturally, stochastic shortest path problems
16 are a generalization of the shortest path problem solved by Dĳkstra’s
17 algorithm. The algorithms we discuss next will work for such problems.

18 Stationary policy It seems a bit cumbersome to carry around an infinitely

19 long sequence of feedback controls in infinite-horizon problems. Since
20 there is an infinitely-long trajectory yet to be traveled at any given time-
21 step, the optimal control action that we take should only depend upon the
22 current state. This is indeed true mathematically. If J ∗ (x) is the optimal
23 cost-to-go in the infinite-horizon problem starting from a state x, using
24 the principle of dynamic programming, we should also have that we can
25 split this cost as the best one-step cost of the current state x added to the
26 optimal cost-to-go from the state f (x, u) realized after taking the optimal
27 control u:

J ∗ (x) = min E [q(x, u(x)) + J ∗ (f (x, u(x)) + ϵ)] . (5.15)

u(x)∈U (X) ϵ

28 We will study this equation in depth soon. But if we find the minimum at
29 u∗ (x) for this equation, then we can run the policy

π ∗ = (u∗ (·), u∗ (·), . . . , u∗ (·), . . .)

30 for the entire infinite horizon. Such a policy is called a stationary

31 policy. Intuitively, since the future optimization problem (tail of dynamic
32 programming) from a given state x looks the same regardless of the time
33 at which we start, optimal policies for the infinite-horizon problem can
34 be found even inside the restricted class of policies where the feedback
35 control does not change with time k.
36 We will almost exclusively deal with stationary policies in this course.
113

1 5.4.2 Dynamic programming for infinite-horizon prob-

2 lems
3 We wish to compute the optimal cost-to-go of starting from a state x and
4 taking an infinitely long trajectory that minimizes the objective (5.13).
5 We will exploit the equation in (5.15) and develop an iterative algorithm
6 to compute the optimal cost-to-go J ∗ (x).

Value Iteration . The algorithm proceeds iteratively to maintain a

sequence of approximations

∀x ∈ X, J (0) (x), J (1) (x), J (2) (x), . . . ,

to the optimal value function J ∗ (x). Such an algorithm is called

“value iteration”.

1. Initialize J (0) (x) = 0 for all x ∈ X.

2. Update using the Bellman equation at each iteration, i.e., for

i = 1, 2, . . . , N , set
h i
J (i+1) (x) = min E q(x, u) + γJ (i) (f (x, u) + ϵ) . (5.16)
u∈U ϵ

for all x ∈ X until the value function converges at all states,

e.g.,

∀x ∈ X, J (i) (x) − J (i+1) (x) < small tolerance.

3. Compute the feedback control and the stationary policy π ∗ =

(u∗ (·), . . . , ) corresponding to the value function estimate J (N ) If the dynamics is given as a transition
as matrix, we can replace the expectation over
noise Eϵ as an expectation over the next state
h i
u∗ (x) = argmin E q(x, u) + γJ (N ) (f (x, u) + ϵ) (5.17)
u∈U ϵ x′ ∼ P(x′ | x, u(x)) in (5.16) to run value
iteration. Everything else remains the same
for all x ∈ X.

7 Let us observe a few important things in the above sequence of updates.

8 First, at each iteration, we are updating the values of all |X| states. This
2
9 involves |X| |U | amount of work per iteration. How many such iterations
10 N do we need until the value function converges? We will see in a bit, that

∀x ∈ X, J ∗ (x) = lim J (N ) (x).

N →∞

11 Again, we really only wanted to compute the cost-to-go J ∗ (x0 ) from some
12 initial state x0 but computed the value function at all states x ∈ X.
114

1 Q-Iteration Just like we wrote dynamic programming in terms of the

2 Q-factor, we can also write value iteration to find the optimal Q-factor
3 Q∗ (x, u), i.e., the optimal cost-to-go of the infinitely-long trajectory that
4 starts at state x, takes a control u at the first time-step and therefore follows
5 the optimal policy.

6 1. We can again initialize Q(0) (x, u) = q(x, u) for all x ∈ X and

7 u ∈ U.

8 2. The Bellman update in terms of the Q-factor becomes

′
Q(i+1) (x, u) = E q(x, u) + γ min
′
Q (i)
(f (x, u) + ϵ, u )
ϵ u ∈U
(5.18)
9 and update this for all x ∈ X and all u ∈ U .

10 3. The feedback control is the control at that state that minimizes the
11 Q-factor

∀x ∈ X, u∗ (x) = argmin Q(N ) (x, u′ ) (5.19)

u′ ∈U

12 and the control policy is

π ∗ = (u∗ , u∗ , . . . , )

13 Notice how we can directly find the u′ that has the smallest value of
14 Q(N ) and set it to be our feedback control.

15 5.4.3 An example
16 Let us consider a grid-world example. A robot would like to reach a
17 goal region (marked in green) and we are interested in computing the
18 cost-to-go from different parts of the domain. Gray cells are obstacles that
19 the robot cannot enter. At each step the robot can move in four directions
20 (north, east, west, south) with a small dynamics noise which keeps it at
21 the original cell in spite of taking the control. These pictures are when
22 the run-time-cost is negative, i.e., the robot gets a certain reward q(x, u)
23 for taking the control u at cell x. Dynamic programming (and value
24 iteration) also works in this case and we simply replace all minimizations
25 by maximizations in the equations.

26
115

4 The final value function after 50 iterations looks as follows.

6 5.4.4 Some theoretical results on value iteration

7 We list down some very powerful theoretical results for value iteration.
8 These results are valid under a very general set of conditions and make
9 value iteration work for a large number of real-world problems; they are
10 at the the heart of all modern algorithms. We will not derive them (it is
116

1 easy but cumbersome) but you should commit them to memory and try to
2 understand them intuitively.

3 Value iteration converges. Given any initialization J (0) (x) for all
4 x ∈ X, the sequence of value iteration estimates J (i) (x) converges to the
5 optimal cost
∀x ∈ X, J ∗ (x) = lim J (N ) (x)
N →∞

6 The solution is unique. The optimal cost-to-go J ∗ (x) of (5.14) satisfies

7 the Bellman equation

J ∗ (x) = min E [q(x, u) + γJ ∗ (f (x, u) + ϵ)] .

u∈U ϵ

8 The function J ∗ is also the unique solution of this equation. In other

9 words, if we find some other function J ′ (x) that satisfies the Bellman
10 equation, we are guaranteed that J ′ is indeed the optimal cost-to-go.

11 Policy evaluation: Bellman equation for a particular policy. Consider

12 a stationary policy π = (u(·), u(·), . . .). The cost of executing this policy
13 starting from a state x, is J(x; π) from (5.13), also denoted by J π (x) for
14 short. It satisfies the equation

J π (x) = q(x, u(x)) + γ E [J π (f (x, u(x)) + ϵ)] (5.20)

15 and is also the unique solution of this equation. In other words, if we

16 have a policy in hand, and wish to find the cost-to-go of this policy, i.e.,
17 “evaluate the policy” we can initialize J (0) (x) = 0 for all x ∈ X and
18 perform the sequence of iterative updates to this initialization
h i
J (i+1) (x) = q(x, u(x)) + γ E J (i) (f (x, u(x)) + ϵ) . (5.21)
ϵ

19 As the number of updates goes to infinity, the iterate converges to J π (x)

∀x ∈ X, J π (x) = lim J (N ) (x).

N →∞

20 Policy evaluation is equivalent to solving a linear system of equations.

21 Observe that the corresponding equation for policy equation (5.20) does
22 not have the minimization over controls. This allows us to write the
23 updates in (5.21) as the solution of a linear system of equations. Since we
24 are in a finite state-space, we can write the cost-to-go as a large vector
⊤
J π := [J π (x1 ), J π (x2 ), . . . , J π (xn )]

25 where n is the number of total states in the state-space. We create a similar

26 vector for the run-time cost term

q u := [q(x1 , u(x1 )), q(x2 , u(x2 )), . . . , q(xn , u(xn ))] .

117

1 We know that the expectation over noise ϵ is equivalent to an expecta-

2 tion over the next state of the system, let us rewrite the dynamics part
3 f (x, u(x)) + ϵ in terms of the Markov transition matrix

Tx,x′ = P(x′ | x, u(x))

4 as X
γ E [J π (f (x, u(x)) + ϵ)] = γ Tx,x′ J π (x′ ) = γ T J π
ϵ
x′

5 to get a linear system

J π = q u + γT J π (5.22)
−1 u
6 which can be solved easily for J π = (I − γ T ) q to get the cost-to-go
7 of a particular control policy π.

8 5.5 Stochastic dynamic programming: Policy

9 Iteration
10 Value iteration converges exponentially quickly, but asymptotically. Note
11 that the number of states |X| = n is finite and so is the number of controls
12 |U |. So this should seem funny, one would expect that we should be able
13 to find the optimal cost J ∗ (x) in finite time if the problem is finite. After
14 all we need to find |X| numbers J ∗ (x1 ), . . . , J ∗ (xn ). This intuition is
15 correct and, in this section, we will discuss an algorithm called policy
16 iteration which is a more efficient version of value iteration.
17 The idea behind policy iteration is quite simple: given a stationary
18 policy for an infinite-horizon problem π = (u(·), . . . , u(·)), we can
19 evaluate this policy to obtain its cost-to-go J π (x). If we now set the
20 feedback control to be

ũ(x) = argmin E [q(x, u) + γJ π (f (x, u) + ϵ)] , (5.23)

u∈U ϵ

21 i.e., we construct a new control policy that finds the best control to execute
? Why? It is simply because (5.23) is at least
22 in the first step ũ(·) and thereafter it executes the old feedback control u(·)
an improvement upon the feedback control
π (1) = (ũ(·), u(·), . . .), u(·). The cost-to-go cannot improve only if
the old feedback control u(·) where optimal
23 then the cost-to-go of the new policy π (1) has to be better: to begin with.
(1)
∀x ∈ X, Jπ (x) ≤ J π (x).

24 We don’t have to stop at one time-step, we can patch the old policy at the
25 first two time-steps to get

π (2) = (ũ(·), ũ(·), . . .),

26 and have by the same logic

(2) (1)
∀x ∈ X, Jπ (x) ≤ J π (x) ≤ J π (x).
118

1 If we build a new stationary policy

π̃ = (ũ(·), ũ(·), ũ(·), . . .), (5.24)

2 we similarly have

∀x ∈ X, J π̃ (x) ≤ J π (x).

3 This suggests an iterative way to compute the optimal stationary policy

4 π ∗ starting from some initial stationary policy.

Policy Iteration The algorithm proceeds to maintain a sequence of

stationary policies

π (k) = (u(k) (·), u(k) (·), u(k) (·), . . .)

that converges to the optimal policy π ∗ .

Initialize u(0) (x) = 0 for all x ∈ X. This gives the initial
stationary policy π (0) . At each iteration k = 1, . . . , we do the
following two things.

1. Policy evaluation Use multiple iterations of (5.21) to evaluate

the old policy π (k) . In other words, initialize J (0) (x) = 0 for
all x ∈ X and iterate upon
h i For large problems, we use methods for
J (i+1) (x) = q(x, u(i) (x)) + γ E J (i) (f (x, u(i) (x) + ϵ)) solving large linear systems such as Lanczos
ϵ
iteration. Typical policy evaluation problems
for all x ∈ X until convergence. In practice, we can use the are also sparse (why?) so we can use things
(k)
linear system of equations in (5.22) to solve for J π directly. like the Kaczmarz method to solve the linear
system.
2. Policy improvement Update the feedback controller us-
ing (5.23) to be
h (k)
i
u(k+1) (x) = argmin E q(x, u) + γJ π (f (x, u) + ϵ)
u∈U ϵ

for all x ∈ X and compute the updated stationary policy

π (k+1) = (u(k+1) (·), u(k+1) (·), . . .)

The algorithm terminates when the controller does not change at any
state, i.e., when the following condition is satisfied

∀x ∈ X, u(k+1) (x) = u(k) (x).

5 Just like value iteration converges to the optimal value function, it can
6 be shown that policy iteration produces a sequence of improved policies

∀x ∈ X, J π(k+1) (x) ≤ J π(k) (x)

119

1 and converges to the optimal cost-to-go

∀x ∈ X, J ∗ (x) = lim J π(N ) (x).

N →∞

2 The key property of policy iteration is that we need a finite number of

3 updates to the policy to find the optimal policy. Notice that this does
4 not mean always mean that we are doing less work than value iteration
5 in policy iteration. Observe that the policy evaluation step in the policy
6 iteration algorithm performs a number of Bellman equation updates. But
7 typically, it is observed in practice that policy iteration is much cheaper
8 computationally than value iteration.

9 5.5.1 An example
10 Let us go back to our example for value iteration. In this case, we will
11 visualize the controller u(k) (x) at each cell x as arrows pointing to some
12 other cell. The cells are colored by the value function for that particular
13 stationary policy.

15
120

3 The evaluated value for the policy after 4 iterations is optimal, compare
4 this to the example for value iteration.

5
1 Chapter 6

2 Linear Quadratic
3 Regulator (LQR)

Reading
1. http://underactuated.csail.mit.edu/lqr.html, Lecture 3-4 at
https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-
323-principles-of-optimal-control-spring-2008/lecture-notes

2. Optional: Applied Optimal Control by Bryson & Ho, Chapter

4-5

4 This chapter is the analogue of Chapter 3 on Kalman filtering. Just

5 like Chapter 2, the previous chapter gave us two algorithms, namely value
6 iteration and policy iteration, to solve dynamic programming problems for
7 a finite number of states and a finite number of controls. Solving dynamic
8 programming problems is difficult if the state/control space are infinite.
9 In this chapter, we will look at an important and powerful special case,
10 called the Linear Quadratic Regulator (LQR), when we can solve dynamic
11 programming problems easily. Just like a lot of real-world state-estimation
12 problems can be solved using the Kalman filter and its variants, a lot of
13 real-world control problems can be solved using LQR and its variants.

14 6.1 Discrete-time LQR

15 Consider a deterministic, linear dynamical system given by

xk+1 = Axk + Buk ; x0 is given.

16 where xk ∈ Rd and uk ∈ Rm which implies that A ∈ Rd×d and

17 B ∈ Rd×m . In this chapter, we are interested in calculating a feedback
18 control uk = u(xk ) for such a system. Just like we formulated the problem

121
122

1 in dynamic programming, we want to pick a feedback control which leads

2 to a trajectory that achieves a minimum of some run-time cost and a
3 terminal cost. We will assume that both the run-time and terminal costs
4 are quadratic in the state and control input, i.e.,

1 ⊤ 1
q(x, u) = x Qx + u⊤ Ru (6.1)
2 2
5 where Q ∈ Rd×d and R ∈ Rm×m are symmetric, positive semi-definite
6 matrices
Q = Q⊤ ⪰ 0, R = R⊤ ⪰ 0.
7 Effectively, if Q were a diagonal matrix, a large diagonal entry would Qii
8 models our desire that the trajectory of the system should not have a large
9 value of the state xi along its trajectories. We want these matrices to be
10 positive semi-definitive to prevent dynamic programming from picking
11 a trajectory which drives down the run-time cost to negative infinity by
12 picking.

13 Example Consider the discrete-time equivalent of the so-called double

14 integrator z̈(t) = u(t). The linear system in this case (obtained by creating
15 two states x := [z(t), ż(t)] is

1 ∆t 0
xk+1 = xk + u .
0 1 ∆t k This system is called the double integrator
because of the structure z̈ = u; if z denotes
16
the position of an object the equation is
17 First, note that a continuous-time linear dynamical system ẋ = Ax is
simply Newton’s law which connects the
18 asymptotically stable, i.e., from any initial condition x(0) its trajectories go
force applied u to the acceleration.
19 to the equilibrium point x = 0 (x(t) → 0 as t → ∞). Asymptotic stability
20 for continuous-time dynamical systems occurs if all eigenvalues of A are
21 strictly negative. A discrete-time linear dynamical system xk+1 = Axk
22 is asymptotically stable if all eigenvalues of A have magnitude strictly
23 smaller than 1, |λ(A)| < 1.
24 A typical trajectory of the double integrator will look as follows.
25 Suppose we would like to pick a different controller that more quickly
26 brings the system to its equilibrium. One way of doing so is to minimize
T
X 2
J= ∥xk ∥
k=0

27 which represents how far away both the position and velocity are from zero
28 over all times k. The following figure shows the trajectory that achieves a
29 small value of J.
123

Double integrator
1.00 z
dot z
0.75 u

0.50

0.25
z, dot z, u 0.00

0.25

0.50

0.75

1.00
0 2 4 6 8 10
t [s]

Figure 6.1: The trajectory of z(t) as a function of time t for a double integrator
z̈(t) = u where we have chosen a stabilizing (i.e., one that makes the system
asymptotically stable) controller u = −z(t) − ż(t). Notice how the trajectory
starts from some initial condition (in this case z(0) = 1 and ż(0) = 0) and moves
towards its equilibrium point z = ż = 0.

Double integrator (large control)

1 z
dot z
u
0

1
z, dot z, u

4
0 2 4 6 8 10
t [s]

Figure 6.2: The trajectory of z(t) as a function of time t for a double integra-
tor z̈(t) = u where we have chosen a large stabilizing control at each time
u = −5z(t) − 5ż(t). Notice how quickly the state trajectory converges to the
equilibrium without much oscillation as compared to Figure 6.1 but how large the
control input is at certain times.

1 This is obviously undesirable for real systems where we may want the
2 control input to be bounded between some reasonable values (a car cannot
3 accelerate by more than a certain threshold). A natural way of enforcing
4 this is to modify our our desired cost of the trajectory to be
T
X
2 2
J= ∥xk ∥ + ρ∥uk ∥
k=0

5 where the value of the parameter ρ is something chosen by the user to

6 give a good balance of how quickly the trajectory reaches the equilibrium
7 point and how much control is exerted while doing so. Linear-Quadratic-
8 Regulator (LQR) is a generalization of this idea, notice that the above
9 example is equivalent to setting Q = Id×d and R = ρIm×m for the
10 run-time cost in (6.1).
124

1 Back to LQR With this background, we are now ready to formulate

2 the Linear-Quadratic-Regulator (LQR) problem which is simply dynamic
3 programming for a linear dynamical system with quadratic run-time cost.
4 In order to enable the system to reach the equilibrium state even if we have
5 only a finite time-horizon, we also include a quadratic cost

1 ⊤
qf (x) = x Qf x. (6.2)
2
6 The dynamic programming problem is now formulated as follows.

Finite time-horizon LQR problem Find a sequence of control

inputs (u0 , u1 , . . . , uT −1 ) such that the function
T −1
1 ⊤ 1X ⊤
xk Qxk + u⊤

J(x0 ; u0 , u1 , . . . , uT −1 ) = xT Qf xT + k Ruk
2 2
k=0
(6.3)
is minimized under the constraint that xk+1 = Axk + Buk for all
times k = 0, . . . , T − 1 and x0 is given.

7 6.1.1 Solution of the discrete-time LQR problem

8 We know the principle of dynamic programming and can apply it to solve
9 the LQR problem. As usual, we will compute the cost-to-go of a trajectory
10 that starts at some state x and goes further by T − k time-steps, Jk (x)
11 backwards. Set
1
JT∗ (x) = x⊤ Qf x for all x.
2
12 Using the principle of dynamic programming, the cost-to-go JT −1 is
13 given by

1 ⊤
JT∗ −1 (xT −1 ) = min xT −1 QxT −1 + u⊤ Ru + JT∗ (AxT −1 + Bu)

u 2

1 ⊤
xT −1 QxT −1 + u⊤ Ru + (Ax + Bu)⊤ Qf (AxT −1 + Bu) .

= min
u 2

14 We can now take the derivative of the right-hand side with respect to u to
15 get
dRHS
0=
du
= Ru + B ⊤ Qf (AxT −1 + Bu) (6.4)
⇒ u∗T −1 = −(R + B ⊤ Qf B)−1 B ⊤ Qf A xT −1
≡ −KT −1 xT −1 .
16 where
KT −1 = (R + B ⊤ Qf B)−1 B ⊤ Qf A
125

1 is (surprisingly) also called the Kalman gain. The second derivative is

2 positive semi-definite

d2 RHS
= R + B ⊤ Qf B ⪰ 0
du2
3 so we know that u∗T −1 is a minimum of the convex quantity on the right-
4 hand side. Notice that the optimal control u∗T −1 is a linear function of the
5 state xT −1 . Let us now expand the cost-to-go JT −1 using this optimal
6 value (the subscript T − 1 on the curly bracket simply means that all
7 quantities are at time T − 1)

1n ⊤ o
JT∗ −1 (xT −1 ) = x Qx + u∗ ⊤ Ru∗ + (Ax + Bu∗ )⊤ Qf (Ax + Bu∗ )
2 T −1
1 ⊤ ⊤ ⊤
= xT −1 Q + K RK + (A − BK) Qf (A − BK) T −1 xT −1
2
1
≡ x⊤ PT −1 xT −1
2 T −1
8 where we set the stuff inside the curly brackets to the matrix P which is
9 also positive semi-definite. This is great, the cost-to-go is also a quadratic
10 function of the state xT −1 . Let us assume that this pattern holds for all
11 time steps and the cost-to-go of the optimal LQR trajectory starting from
12 a state x and proceeding forwards for T − k time-steps is

1 ⊤
Jk∗ (x) = x Pk x.
2
13 We can now repeat the same exercise to get a recursive formula for Pk in
14 terms of Pk+1 . This is the solution of dynamic programming for the LQR
15 problem and it looks as follows.

PT = Qf
−1
Kk = R + B ⊤ Pk+1 B B ⊤ Pk+1 A (6.5)
⊤
Pk = Q + Kk⊤ R Kk + (A − BKk ) Pk+1 (A − BKk ) ,

16 for k = T −1, T −2, . . . , 0. There are a number of important observations

17 to be made from this calculation:

18 1. The optimal controller u∗k = −Kk xk is a linear function of the state

19 xk . This is only true for linear dynamical systems with quadratic
20 costs. Notice that both the state and control space are infinite sets
21 but we have managed to solve the dynamic programming problem
22 to get the optimal controller. We could not have done it if the run-
23 time/terminal costs were not quadratic or if the dynamical system
24 were not linear. Can you say why?

25 2. The cost-to-go matrix Pk and the Kalman gain Kk do not depend

26 upon the state and can be computed ahead of time if we know what
27 the time horizon T is going to be.

28 3. The Kalman gain changes with time k. Effectively, the LQR

126

1 controller picks a large control input to quickly reduce the run-time

2 cost at the beginning (if the initial condition were such that the
3 run-time cost of the trajectory would be very large) and then gets
4 into a balancing act where it balances the control effort and the state-
5 dependent part of the run-time cost. LQR is an optimal way to strike
6 a balance between the two examples in Figure 6.1 and Figure 6.2.
7 The careful reader will notice how the equations in (6.5) and our
8 remarks about them are similar to the update equations of the Kalman filter
9 and our remarks there. In fact we will see shortly how spookily similar the
10 two are. The key difference is that Kalman filter updates run forwards in
11 time and update the covariance while LQR updates run backwards in time
12 and update the cost-to-go matrix P . This is not surprising because LQR
13 is an optimal control problem, its update equations should run backward
in time like the Dĳkstra’s algorithm. If you are trying this example yourself, I
used the formula for continuous-time LQR
Double integrator (LQR control)
and then discretized the controller while
1.0 z
dot z
u
implementing it. We will see this
0.8
in Section 6.2
0.6

0.4
z, dot z, u

0.2

0.0

0.2

0.4

0 2 4 6 8 10
t [s]

Figure 6.3: The trajectory of z(t) as a function of time t for a double integrator
z̈(t) = u where we have chosen a controller obtained from LQR with Q = I and
R = 5. This gives the controller to be about u = −0.45z(t) − 1.05ż(t). Notice
how we still get stabilization but the control acts more gradually. Using different
values of R, we can get many different behaviors. Another key aspect of LQR as
compared to Figure 6.1 where the control was chosen in an ad hoc fashion is to let
us prescribe the quality of state trajectories using high-level quantities like Q, R.
14

15 6.2 Hamilton-Jacobi-Bellman equation

16 This section will show how the principle of dynamic programming looks
17 for continuous-time deterministic dynamical systems

ẋ = f (x, u), with x(0) = x0 .

18 As we discussed in Chapter 3, we can think of this as the limit of discrete-

19 time dynamical system xk+1 = f discrete (xk , uk ) as the time discretization
20 goes to zero. Just like we have a sequence of controls in the discrete-time
21 case, we have a continuous curve that determines the control (let us also
22 call it the control sequence)

{u(t) : t ∈ R+ }
127

1 which gives rise to a trajectory of the states

{x(t) : t ∈ R+ }

2 for the dynamical system. Let us consider the case when we want to
3 find control sequences that minimize the integral of the cost along the
4 trajectory that stops at some fixed, finite time-horizon T :
Since {x(t)}t≥0 and {u(t)}t≥0 are
Z T continuous curves and the cost is now a
qf (x(T )) + q(x(t), u(t)) dt . function of a continuous-curve,
0
mathematicians say that the cost is a
5 This cost is again a function of the run-time cost and a terminal cost. “functional” of the state and control trajectory.

Continuous-time optimal control problem We again want to

solve for
( Z T )
∗
J (x0 ) = min qf (x(T )) + q(x(t), u(t)) dt (6.6)
u(t), t∈[0,T ] 0

with the system satisfying ẋ = f (x, u) at each time instant. Notice

that the minimization is over a function of time {u(t) : t ∈ [0, T ]} as
opposed to a discrete-time sequence of controls that we had in the
discrete-time case. We will next look at the Hamilton-Jacobi-Bellman
equation which is a method to solve optimal-control problems of this
kind.

6 The principle of dynamic programming principle is still valid: if we

7 have an optimal control trajectory {u∗ (t) : t ∈ [0, T ]} we can chop it up
8 into two parts at some intermediate time t ∈ [0, T ] and claim that the tail
9 is optimal. In preparation for this, let us define the cost-to-go of going
10 forward by T − t time as
( )
Z T
J ∗ (x, t) = min qf (x(T )) + q(x(s), u(s)) ds ,
u(s), s∈[t,T ] t

the cost incurred if the trajectory starts at state x and goes forward by T − t
time. This is very similar to the cost-to-go Jk∗ (x) we had in discrete-time
dynamic programming. Dynamic programming now gives
( )
Z T
J ∗ (x(t), t) = min qf (x(T )) + q(x(s), u(s)) ds
u(s), t≤s≤T t
( )
Z t+∆t Z T
= min qf (x(T )) + q(x(s), u(s)) ds + q(x(s), u(s)) ds
u(s), t≤s≤T t t+∆t
( )
Z t+∆t
∗
= min J (x(t + ∆t), t + ∆t) + q(x(s), u(s)) ds .
u(s), t≤s≤t+∆t t

11 We now take the Taylor approximation of the term J ∗ (x(t + ∆t), t + ∆t)
128

1 as follows

J ∗ (x(t + ∆t), t + ∆t) − J ∗ (x(t), t)

≈ ∂x J ∗ (x(t), t) (x(t + ∆t) − x(t)) + ∂t J ∗ (x(t), t)∆t
≈ ∂x J ∗ (x(t), t) f (x(t), u(t)) ∆t + ∂t J ∗ (x(t), t)∆t

2 where ∂x J ∗ and ∂t J ∗ denote the derivative of J ∗ with respect to its first

3 and second argument respectively. We substitute this into the minimization
4 and collect terms of ∆t to get

0 = ∂t J ∗ (x(t), t)+ min {q(x(t), u(t)) + f (x(t), u(t)) ∂x J ∗ (x(t), t)} .

u(t)∈U
(6.7)
5 Notice that the minimization in (6.7) is only over one control input
6 u(t) ∈ U , this is the control that we should take at time t. (6.7) is called
7 the Hamilton-Jacobi-Bellman (HJB) equation. Just like the Bellman
8 equation
Jk∗ (x) = min qk (x, u) + Jk+1∗

(f (x, u)) .
u∈U

9 has two quantities x and the time k, the Hamilton-Jacobi-Bellman equation

10 also has two quantities x and continuous time t. Just like the Bellman
11 equation is solved backwards in time starting from T with Jk∗ (x) = qf (x),
12 the HJB equation is solved backwards in time by setting

J ∗ (x, T ) = qf (x).

You should think of the HJB equation as the continuous-time,

continuous-space analogue of Dĳkstra’s algorithm when the number
of nodes in the graph goes to infinity and the length of each edge is
also infinitesimally small.

13 6.2.1 Infinite-horizon HJB

14 The infinite-horizon problem with the HJB equation is easy: since we
15 know that the optimal cost-to-go is not a function of time, we have

∂t J ∗ (x, t) = 0

16 and therefore J ∗ (x) satisfies

0 = min {q(x, u) + f (x, u) ∂x J ∗ (x)} . (6.8)

u∈U

17 In this case, the above equation makes sense

R ∞ only if the integral of the run-
18 time cost with the optimal controller 0 q(x(t), u∗ (x(t))) dt remains
19 bounded and does not diverge to infinity. Therefore typically in this
20 problem we will set q(0, 0) = 0, i.e., there is no cost for the system being
21 at the origin with zero control, otherwise the integral of the run-time cost
22 will never be finite. This also gives the boundary condition J ∗ (0) = 0 for
23 the HJB equation.
129

1 6.2.2 Solving the HJB equation

2 The HJB equation is a partial differential equation (PDE) because there
3 is one cost-to-go from every state x ∈ X and for every time t ∈ [0, T ].
4 It belongs to a large and important class of PDEs, collectively known
5 as Hamilton-Jacobi-type equations. As you can imagine, since dynamic
6 programming is so pervasive and solutions of DP are very useful in practice
7 for a number of problems, there have been many tools invented to solve the
8 HJB equation. These tools have applications to a wide variety of problems,
9 from understanding how sound travels in crowded rooms to how light
10 diffuses in an animated movie scene, to even obtaining better algorithms
11 to train deep networks (https://arxiv.org/abs/1704.04932). HJB equations
12 are usually never exactly solvable and a number of approximations need
13 to be made in order to solve it.

In this course, we will not solve the HJB equation. Rather, we are
interested in seeing how the HJB equation looks for continuous-time
linear dynamical systems (both deterministic and stochastic ones) and
LQR problems for such systems, as done in the following section.

14 An example We will look at a classical example of the so-called car-

on-the-hill problem given below. The state of the problem is the position

Figure 6.4: A car whose position is given by z(t) would like to climb the hill to
its right and reach the top with minimal velocity. The car rolls on the hill without
friction. The run-time cost is zero everywhere inside the state-space. Terminal
cost is -1 for hitting the left boundary (z = −1) and −1 − ż/2 for reaching the
right boundary (z = 1). The car is a single integrator, i.e., ż = u with only two
controls (u = 4 and u = −4) and cannot exceed a given velocity (in this case
|ż| ≤ 4. This looks like a simple dynamic programming problem but it is quite
hard due to the constraint on the velocity. The car may need to make multiple
swing ups before it gains enough velocity (but not too much) to climb up the hill.
15

16 and velocity (z, ż) and we can solve a two-dimensional HJB equation to
17 obtain the optimal cost-to-go from any state, as done by the authors Yuval
18 Tassa and Tom Erez in “Least Squares Solutions of the HJB Equation
19 With Neural Network Value-Function Approximators”
20 (https://homes.cs.washington.edu/t̃odorov/courses/amath579/reading/NeuralNet.pdf).
21 In practice, while solving the HJB PDE, one discretizes the state-space at
22 given set of states and solves the HJB equation (6.7) on this grid using
130

1 numerical methods (these authors used neural networks to solve it). The
end result looks as follows.

Figure 6.5: The left-hand side picture shows the infinite-horizon cost-to-go J ∗ (z, ż)
for the car-on-the-hill problem. Notice how the value function is non-smooth at
various places. This is quite typical of difficult dynamic programming problems.
The right-hand side picture shows the optimal trajectories of the car (z(t), ż(t));
gray areas indicate maximum control and white areas indicate minimum control.
The black lines show a few optimal control sequences taken the car starting from
various states in the state-space. Notice how the optimal control trajectory can
be quite different even if the car starts from nearby states (-0.5,1) and (-0.4,1.2)).
This is also quite typical of difficult dynamic programming problems.
2

3 6.2.3 Continuous-time LQR

4 Consider a linear continuous-time dynamical system given by

ẋ = A x + B u; x(0) = x0 .

5 In the LQR problem, we are interested in finding a control trajectory that

6 minimizes, as usual, a cost function that is quadratic in states and controls,
7 except that we have an integral of the run-time cost because our system is
8 a continuous-time system
Z T
1 ⊤ 1 ⊤ ⊤
x(T ) Qf x(T ) + x(t) Q x(t) + u(t) R u(t) dt .
2 2 0

9 This is a very nice setup for using the HJB equation from the previous
10 section.
11 Let us use our intuition from the discrete-time LQR problem and say
12 that the optimal cost is quadratic in the states, namely,

1 ⊤
J ∗ (x, t) = x(t) P (t) x(t);
2
13 notice that as usual the optimal cost-to-go is a function of the states x
131

1 and the time t because is the optimal cost of the continuous-time LQR
2 problem if the system starts at a state x at time t and goes on until time
3 T ≥ t. We will now check if this J ∗ satisfies the HJB equation (we don’t
4 write the arguments x(t), u(t) etc. to keep the notation clear)

∗ 1 ⊤ ⊤
⊤ ∗
−∂t J (x, t) = min x Qx + u R u + (A x + B u) ∂x J (x, t)
u∈U 2
(6.9)
5 from (6.7). The minimization is over the control input that we take at time
6 t. Also notice the partial derivatives

∂x J ∗ (x, t) = P (t) x.
1
∂t J ∗ (x, t) = x⊤ Ṗ (t) x.
2
7 It is convenient in this case to see that the minimization can be per-
8 formed using basic calculus (just like the discrete-time LQR problem), we
9 differentiate with respect to u and set it to zero.

RHS of HJB
0= .
du
⇒ u (t) = −R B ⊤ P (t) x(t)
∗ −1 (6.10)

≡ −K(t) x(t).

10 where K(t) = R−1 B ⊤ P (t) is the Kalman gain. The controller is again
11 linear in the states x(t) and the expression for the gain is very simple in
12 this case, much simpler than discrete-time LQR. Since R ≻ 0, we also
13 know that u∗ (t) computed here is the global minimum. If we substitute
14 this value of u∗ (t) back into the HJB equation we have

1 ⊤
{} = x P A + A⊤ P + Q − P BR−1 B ⊤ P x.
u∗ (t) 2

15 If order to satisfy the HJB equation, we must have that the expression
16 above is equal to −∂t J ∗ (x, t). We therefore have, what is called the
17 Continuous-time Algebraic Riccati Equation (CARE), for the matrix
18 P (t) ∈ Rd×d

−Ṗ = P A + A⊤ P + Q − P BR−1 B ⊤ P. (6.11)

19 This is an ordinary differential equation for the matrix P . The derivative

20 Ṗ = dPdt stands for differentiating every entry of P individually with
⊤
21 time t. The terminal cost is 12 x(T ) Qf x(T ) which gives the boundary
22 condition for the ODE as

P (T ) = Qf .

23 Notice that the ODE for the P (t) travels backwards in time.
24 Continuous-time LQR has particularly easy equations, as you can see
25 in (6.10) and (6.11) compared to those for discrete-time ((6.4) and (6.5)).
26 Special techniques have been invented for solving the Riccati equation. I
132

1 used the function scipy.linalg.solve_continuous_are to obtain Figure 6.3

2 using the continuous-time equations; the corresponding function for
3 solving Discrete-time Algebraic Riccati Equation (DARE) which is given
4 in (6.5) is scipy.linalg.solve_discrete_are. The continuous-time point-of-
5 view also gives powerful connections to the Kalman filter, where you can
6 show that the Kalman filter and LQR are duals of each other: in fact the
7 equations for the Kalman filter (in continuous-time) and continuous-time
8 LQR turn out to be exactly the same after you interchange appropriate
9 quantities (!).

10 Infinite-horizon LQR Just like the infinite-horizon HJB equation has

11 ∂t J ∗ (x, t) = 0, if we have an infinite-horizon LQR problem, the cost
12 matrix P should not be a function of time

Ṗ = 0.

13 The continuous-time algebraic Riccati equation in (6.11) now becomes

P A + A⊤ P + Q − P BR−1 B ⊤ P.

14 with the cost-to-go being given by J ∗ (x) = 12 x⊤ P x.

15 6.3 Stochastic LQR

16 We will next look at a very powerful result. Say we have a stochastic linear
17 dynamical system

ẋ(t) = Ax(t) + Bu(t) + Bϵ ϵ(t); x(0) is given

18 where ϵ(t) is standard Gaussian noise ϵ(t) ∼ N (0, I) that is uncorrelated

19 in time and would like to find a control sequence {u(t) : t ∈ [0, T ]} that
20 minimizes a quadratic run-time and terminal cost
" #
1 T
Z
1 ⊤ ⊤ ⊤
E x(T ) Qf x(T ) + x(t) Qx(t) + u(t) Ru(t) dt .
ϵ(t):t∈[0,T ] 2 2 0

21 over a finite-horizon T . Notice that since the system is stochastic now,

22 we should minimize the expected value of the cost over all possible
23 realizations of the noise {ϵ(t) : t ∈ [0, T ]}. This is a very challenging
24 problem, conceptually it is the equivalent of dynamic programming for an
25 MDP with an infinite number of states x(t) ∈ Rd and an infinite number
26 of controls u(t) ∈ Rm .
27 However, it turns out that the optimal controller that we should pick in
28 this case is also given by the standard LQR problem

u∗ (t) = −R−1 B ⊤ P (t) x(t)

with − Ṗ = P A + A⊤ P + Q − P BR−1 B ⊤ P ; P (T ) = Qf .

29 We will not do the proof (it is easy but tedious, you can try to show it
133

Double integrator (LQR control) Stochastic double integrator (LQR control)

1.0 z 1.0 z_s
dot z dot z_s
0.8 u 0.8 u

0.6 0.6

z_s, dot z_s, u

0.4 0.4
z, dot z, u

0.2 0.2

0.0 0.0

0.2 0.2

0.4 0.4

0 2 4 6 8 10 0 2 4 6 8 10
t [s] t [s]

Figure 6.6: Comparison of the state trajectories of deterministic LQR and stochastic
LQR problem with Bϵ = [0.1, 0.1]. The left panel is the same as that in Figure 6.3.
The control input is the same in both cases but notice that the states in the plot
on the right need not converge to the equilibrium due to noise. The cost of the
trajectory will also be higher for the stochastic LQR case due to this. The total cost
is J ∗ (x0 ) = 32.5 for the deterministic case (32.24 for the quadratic state-cost and
0.26 for the control cost). The total cost J ∗ (x0 ) is much higher for the stochastic
case, it is 81.62 (81.36 for the quadratic state cost and 0.26 for the control cost).

1 by writing the HJB equation for the stochastic LQR problem). This is a
2 very surprising result because it says that even if the dynamical system
3 had noise, the optimal control we should pick is exactly the same as the
4 control we would have picked had the system been deterministic. It is a
5 special property of the LQR problem and not true for other dynamical
6 systems (nonlinear ones, or ones with non-Gaussian noise) or other costs.
7 We know that the control u∗ (t) is the same as the deterministic case.
8 Is the cost-to-go J ∗ (x, t) also the same? If you think about this, the
9 cost-to-go in the stochastic case has to be a bit larger than the deterministic
10 case because the noise ϵ(t) is always going to non-zero when we run the
11 system, the LQR cost J ∗ (x0 , 0) = 12 x⊤ 0 P (0)x0 is, after all, only the cost
12 of the deterministic problem. It turns out that the cost for the stochastic
13 LQR case for an initial state x0 is
" Z T #
1 ⊤ 1
J ∗ (x0 , 0) = E x(T ) Qf x(T ) + . . . dt
ϵ(t):t∈[0,T ] 2 2 0
1 T
Z
1 ⊤
= x0 P (0)x0 + tr(P (t)Bϵ Bϵ⊤ ) dt .
2 2 0

14 The first term is the same as that of the deterministic LQR problem. The
15 second term is the penalty we incur for having a stochastic dynamical
16 system. This is the minimal cost achievable for stochastic LQR but it is
17 not the same as that of the deterministic LQR.

18 6.4 Linear Quadratic Gaussian (LQG)

19 Our development in the previous sections and the previous chapter was
20 based on a Markov Decision Process, i.e., we know the state x(t) at each
21 instant in time t even if this state x(t) changes stochastically. We said that
22 the optimal control for the linear dynamics is still u∗ (t) = −K(t) x(t).
134

1 What should one do if we cannot observe the state exactly?

2 Imagine a “continuous-time” form the observation equation in the
3 Kalman filter where we receive observations of the form

y(t) = Cx(t) + Dν.

4 where ν ∼ N (0, I) is standard Gaussian noise that corrupts our observa-

5 tions y. If we extrapolate the definitions of the Kalman filter mean and
6 covariance to this continuous-time setting, we can write the KF as follows.
7 We know that the Kalman filter is the optimal estimate of the state given
8 all past observations, so it computes

µ(t) = E [x(t) | y(s) : s ∈ [0, t]] .

ϵ(s),ν(s): s∈[0,t] As we discussed while introducing
stochastic dynamical systems, there are
9 There exists a “continuous-time version” of the Kalman filter (which was
various mathematical technicalities associated
10 actually invented first), called the Kalman-Bucy filter. If the covariance of
with conditioning on a continuous-time signal
11 the estimate is
{y(s) : s ∈ [0, t]}. To be precise
mathematicians define what is called a
h i
⊤
Σ(t) = E x(t) x(t) | y(s) : s ∈ [0, t] ,
ϵ(s),ν(s): s∈[0,t] “filtration” Y(t) which is the union of the
Borel σ-fields constructed using increasing
12 the Kalman-Bucy filter updates µ(t), Σ(t) using the differential equation subsets of the set {y(s) : s ∈ [0, t]}. Let us
d not worry about this here.
µ(t) = Ax(t) + Bu(t) + K(t) (y(t) − Cµ(t))
dt
d
Σ(t) = AΣ(t) + Σ(t)A⊤ + Bϵ Bϵ⊤ − K(t)DD⊤ K(t) (6.12)
⊤
dt
where K(t) = Σ(t) C ⊤ (DD⊤ )−1 .

13 This equation is very close to the Kalman filter equations you saw in
14 Chapter 3. In particular, notice the close similarity of the expression for
15 the Kalman gain K(t) with the Kalman gain of the LQR problem. You
16 can read more at https://en.wikipedia.org/wiki/Kalman_filter.

Linear Quadratic Gaussian (LQG) It turns out that we can plug

135

in the Kalman filter estimate µ(t) of the state x(t) in order to

compute optimal control for LQR if we know the state only through
observations y(t)
u∗ (t) = −K(t) µ(t). (6.13)
It is almost as if, we can blindly run a Kalman Filter in parallel with
the deterministic LQR controller and get the optimal control for the
stochastic LQR problem even if we did not observe the state of the
system exactly. This method is called Linear Quadratic Gaussian
(LQG).
This is a very powerful and surprising result. It is only true for
linear dynamical systems with linear observations, Gaussian noise in
both the dynamics and the observations and quadratic run-time and
terminal costs. It is not true in other cases. However, it is so elegant
and useful that it inspires essentially all other methods that control a
dynamical system using observations from sensors.

1 Certainty equivalence For instance, even if we are using a particle

2 filter to estimate the state of the system, we usually use the mean of the
3 state estimate at time t given by µ(t) “as if” it were the true state of the
4 system. Even if we were using some other feedback control u(x) different
5 than the LQR control (say feedback linearization), we usually plug in this
6 estimate µ(t) in place of x(t). Doing so is called “certainty equivalence”
7 in control theory/robotics, which is a word borrowed from finance where
8 one takes decisions (controls) directly using the estimate of the state (say
9 stock price) while fully knowing the the stock price will change in the
10 future stochastically.

11 6.4.1 (Optional material) The duality between the Kalman

12 Filter and LQR
13 We can re-write the covariance in (6.12) using the identity

d
Σ(t)−1 = Σ(t)−1 Σ̇(t)Σ(t)−1

dt
14 to get
−1
Ṡ = C ⊤ DD⊤ C − A⊤ S − SA − SBw Bw
⊤
S (6.14)

15 where we have defined S := Σ−1 .

16 Notice that the two equations, updates to the LQR cost matrix in (6.11)

−Ṗ = P A + A⊤ P + Q − P BR−1 B ⊤ P

17 look quite similar to this equation. In fact, they are identical and you can
18 substitute the following.
136

LQR Kalman-Bucy filter

P Σ−1
A −A
1
BR−1 B ⊤
Bw Bw
⊤ −1
C⊤

Q DD C
t T −t
2 Let us analyze this equivalence. Notice that the inverse of the Kalman
3 filter covariance is like the cost matrix of LQR. This is conceptually easy
4 to understand, our figure of merit for filtering is the inverse covariance
5 matrix (smaller the better) and our figure of merit for the LQR problem is
6 the cost matrix P (smaller the better). Similarly, smaller the LQR cost,
7 better the controller. The “dynamics” of the Kalman filter is the reverse of
8 the dynamics of the LQR problem, this shows that the P matrix is updated
9 backwards in time while the covariance Σ is updated forwards in time.
10 The next identity
BR−1 B ⊤ = Bw Bw ⊤

11 is very interesting. Imagine a situation where we have a fully-actuated

12 system with B = I and Bw being a diagonal matrix. This identity suggests
13 that larger the control cost Rii of a particular actuator i, lower is the noise
14 of using that actuator (Bw )ii , and vice-versa. This is how muscles in your
15 body have evolved: muscles that are cheap to use (low R) are also very
16 noisy in what they do whereas muscles that are expensive to use (large
17 R) which are typically the biggest muscles in the body are also the least
18 noisy and most precise. You can read more about this in the paper titled
19 “General duality between optimal control and estimation” by Emanuel
20 Todorov. The next identity
−1
Q = C ⊤ DD⊤ C

21 is related to the quadratic state-cost in LQR. Imagine the situation where

22 both Q, D are diagonal matrices. If the noise in the measurements Dii is
23 large, this is equivalent to the state-cost matrix Qii being small; roughly
24 there is no way we can achieve a low state-cost x⊤ Qx in our system that
25 consists of LQR and a Kalman filter (this combination is known as Linear
26 Quadratic Gaussian LQG as saw before) if there is lots of noise in the
27 state measurements. The final identity

t=T −t

28 is the observation that we have made many times before: dynamic

29 programming travels backwards in time and the Kalman filter travels
30 forwards in time.

31 6.5 Iterative LQR (iLQR)

32 This section is analogous to the section on the Extended Kalman Filter.
33 We will study how to solve optimal control problems for a nonlinear
137

1 dynamical system

ẋ = f (x, u); x(0) = x0 is given.

2 We will consider a deterministic continuous-time dynamical system, the

3 modifications to following section that one would make if the system
4 is discrete-time, or stochastic, are straightforward and follow the same
5 strategy. First consider the problem where the run-time and terminal costs
6 are quadratic
Z T
1 ⊤ 1 ⊤ ⊤
x(T ) Qf x(T ) + x(t) Q x(t) + u(t) Ru(t) dt .
2 2 0

7 Receding horizon control and Model Predictive Control (MPC) One

8 easy way to solve the dynamic programming problem, i.e., find a control
9 trajectory of the nonlinear system that minimizes this cost functional,
10 approximately, is by linearizing the system about the initial state x0 and
11 some reference control u0 (this can usually be zero). Let the linear system
12 be
ż = Ax0 ,u0 z + Bx0 ,u0 v; z(0) = 0; (6.15)
13 where Ax0 ,u0 = df df
dx x=x0 ,u=u0 and Bx0 ,u0 = du x=x0 ,u=u0 are the
14 Jacobians of the nonlinear function f (x, u) with respect to the state and
15 control respectively. The state of the linearized dynamics is

z := x − x0 , and v := u − u0 ,

16 We have emphasized the fact that the matrices Ax0 ,u0 , Bx0 ,u0 depend
17 upon the reference state and control using the subscript. Given the above
18 linear system, we can find a control sequence u∗ (·) that minimizes the
19 cost functional using the standard LQR formulation. Notice now that even
20 we computed this control trajectory using the approximate linear system,
21 it can certainly be executed on the nonlinear system, i.e., at run-time we
22 will simply set u ≡ u∗ (z).
23 The linearized dynamics in (6.15) is potentially going to be very
24 different from the nonlinear system. The two are close in the neighborhood
25 of x0 (and u0 ) but as the system evolves using our control input to
26 move further away from x0 , the linearized model no longer is a faithful
27 approximation of the nonlinear model. A reasonable way to fix matters
28 is to linearize about another point, say the state and control after t = 1
29 seconds, x1 , u1 to get a new system

ż = Ax1 ,u1 z + Bx1 ,u1 v; z(0) = 0

30 and take the LQR-optimal control corresponding to this system for the
31 next second.
32 The above methodology is called “receding horizon control”. The
33 idea is that we compute the optimal control trajectory u∗ (·) using an
34 approximation of the original system and recompute this control every few
35 seconds when our approximation is unlikely to be accurate. This is a very
138

1 popular technique to implement optimal controllers in typical applications.

2 The concept of using an approximate model (almost invariably, a linear
3 model with LQR cost) to plan for the near-term future and resolving the
4 problem in receding horizon fashion once the system is at the end of this
5 short time-horizon is called “Model Predictive Control”.
6 MPC is, perhaps, the second most common control algorithm im-
7 plemented in the world. It is responsible for running most complex
8 engineering systems that you can think of—power grids, oil refineries,
? Can you guess what is the most common
9 chemical plants, rockets, aircrafts etc. Essentially, one never implements
control algorithm in the world?
10 LQR directly, it is always implemented inside an MPC. For instance, in
11 autonomous driving, the trajectory that the vehicle plans for traveling
12 between two points A and B depends upon the current locations of the
13 other cars/pedestrians in its vicinity, and potentially some prediction model
14 of where they will be in the future. As the vehicle starts moving along
15 this trajectory, the rest of the world evolves around it and we recompute
16 the optimal trajectory to take into account the actual locations of the
17 cars/pedestrians in the future.

18 6.5.1 Iterative LQR (iLQR)

19 Now let us consider the situation when in addition to a nonlinear system,

ẋ = f (x, u); x(0) = x0 ,

20 the run-time and terminal cost is also nonlinear

Z T
qf (x(T )) + q(x(t), u(t)) dt .
0

21 We can solve the dynamic programming problem in this case approximately

22 using the following iterative algorithm.
23
(0)Assume that we are given an initial control trajectory u(0) (·) =
24 u (t) : t ∈ [0, T ] . Let x(0) (·) be the state trajectory that corresponds
25 to taking this control on the nonlinear system, with of course x(0) (0) = x0 .
26 At each iteration k, the Iterative LQR algorithm performs the following
27 steps.

28 Step 1 Linearize the nonlinear system about the state trajectory x(k) (·)
29 and u(k) (·) using

z(t) := x(t) − x(k) (t), and v(t) := u(t) − u(k) (t)

30 to get a new system

ż = A(k) (t)z + B (k) (t)v; z(0) = 0

31 where
df
A(k) (t) =
dx x(t)=x(k) (t),u(t)=u(k) (t)
df
B (k) (t) =
du x(t)=x(k) (t),u(t)=u(k) (t)
139

1 and compute the Taylor series approximation of the nonlinear cost up to

2 the second order

⊤ dqf
qf (x(T )) ≈ constant + z(T )
dx x(T )=x(k) (T ) ? How will you solve for the optimal
2 controller for a linear dynamics for the cost
⊤ d qf
+ z(t) z(t),
dx2 x(T )=x(k) (T ) Z T
1
q ⊤ x + x⊤ Qx dt ,
3 0 2

⊤ dq i.e., when in addition the quadratic cost, we

q(x, u, t) ≈ constant + z(t) also have an affine term?
dx x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
affine term
⊤ dq
+ v(t)
du x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
affine term

⊤ d2 q
+ z(t) z(t)
dx2 x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
≡Q
2
⊤ d q
+ v(t) v(t).
du2 x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
≡R

4 This is an LQR problem with run-time cost that depends on time (like our
5 discrete-time LQR formulation, the continuous-time formulation simply
6 has Q, R to be functions of time t in the Riccati equation) and which also
7 has terms that are affine in the state and control in addition to the usual
8 quadratic cost terms.

9 Step 2 Solve the above linearized problem using standard LQR formula-
10 tion to get the new control trajectory

u(k+1) (t) := u(k) (t) − Kz(t).

11 Simulate the nonlinear system using the control u(k+1) (·) to get the new
12 state trajectory x(k+1) (·).
13 Some important comments to remember about the iLQR algorithm.

14 1. There are many ways to pick the initial control trajectory u(0) (·), e.g.,
15 using a spline to get an arbitrary control sequence, using a spline
16 to interpolate the states to get a trajectory x(0) (·) and then back-
17 calculate the control trajectory, using the LQR solution based on the
18 linearization about the initial state, feedback linearization/differen-
19 tial flatness (https://en.wikipedia.org/wiki/Feedback_linearization)
20 etc.

21 2. The iLQR algorithm is an approximate solution to dynamic pro-

22 gramming for nonlinear system with general, nonlinear run-time and
23 terminal costs. This is because the the algorithm uses a linearization
24 about the previous state and control trajectory to compute the new
140

1 control trajectory. iLQR is not guaranteed to find the optimal

2 solution of dynamic programming, although in practice with good
3 implementations, it works excellently.

4 3. We can think of iLQR as an algorithm to track a given state trajectory

5 xg (t) by setting
2
qf = 0, and q(x, u) = ∥xg (t) − x(t)∥ .

6 This is often how iLQR is typically used in practice, e.g., to make

7 an autonomous race car closely follow the racing line (see the paper
8 “BayesRace: Learning to race autonomously using prior experience”
9 https://arxiv.org/abs/2005.04755 and https://www.youtube.com/watch?v=dgIpf0Lg8Ek
10 for a clever application of using MPC to track a challenging race
11 line), or to make a drone follow a given desired trajectory
12 (https://www.youtube.com/watch?v=QREeZvHg0lQ).

13 Differential Dynamic Programming (DDP) is a suite of techniques

14 that is a more powerful version of iterated LQR. Instead of linearizing
15 the dynamics and taking a second order Taylor approximation of the cost,
16 DDP takes a second order approximation of the Bellman equation directly.
17 The two are not the same; DDP is the more correct version of iLQR but is
18 much more challenging computationally.
19 Broadly speaking, iLQR and DDP are used to perform control for some
20 of the most sophisticated robots today, you can see an interesting discussion
21 of the trajectory planning of some of the DARPA Humanoid Robotics
22 Challenge at https://www.cs.cmu.edu/~cga/drc/atlas-control. Techniques
23 like feedback linearization work excellently for drones where we do not
24 really care for optimal cost (see “Minimum snap trajectory generation and
25 control for quadrotors” https://ieeexplore.ieee.org/document/5980409)
26 while LQR and its variants are still heavily utilized for satellites in space.
1 Chapter 7

2 Imitation Learning

Reading
1. The DAGGER algorithm
(https://www.cs.cmu.edu/~sross1/publications/Ross-
AIStats11-NoRegret.pdf)

2. https://www.youtube.com/watch?v=TUBBIgtQL_k

3. An Algorithmic Perspective on Imitation Learning

(https://arxiv.org/pdf/1811.06711.pdf)

3 This is the beginning of Module 3 of the course. The previous two

4 modules have been about how to to estimate the state of the world around
5 the robot (Module 1) and how to move the robot (or the world) to a desired
6 state (Module 2). Both of these required that we maintain a model of the
7 dynamics of the robot; this model may be inaccurate and we fudged over
8 this inaccuracy by modeling the remainder as “noise” in Markov Decision
9 Processes.
10 The next few lectures introduce different aspects of what is called
11 Reinforcement Learning (RL). This is a very large field and you can think
12 of using techniques from RL in many different ways.

13 1. Dynamic programming with function approximation. If we

14 are solving a dynamic programming problem, we can think of
15 writing down the optimal cost-to-go J ∗ (x, t) as a function of some
16 parameters, e.g., the cost-to-go is

1 ⊤
Jφ (x, t) = x(t) (some function of A, B, Q, R) x(t)
2 | {z }
function of φ

17 for LQR. We know the stuff inside the brackets to be exactly P (t)
18 but, if we did not, it could be written down as some generic function

141
142

1 of parameters φ. We know that any cost-to-go that satisfies the

2 Bellman equation is the optimal cost-to-go, so we can now “fit”
3 the candidate function Jφ (x, t) to satisfy the Bellman equation.
4 Similarly, one may also express the optimal feedback control u(·)
5 using some parameters θ as

uθ (·).

6 We will see how to fit such functions in this chapter.

7 2. Learning from data. It may happen that we do not know very

8 much about the dynamical system, e.g., we do not know a good
9 model for what drives customers as they buy items in an online
10 merchandise platform, or a robot traveling in a crowded area may
11 not have a good model for how large crowds of people walk around
12 it. One may collect data from these systems fit some model of the
13 form ẋ = f (x, u) to the data and then go back to the techniques of
14 Module 2. It is typically not clear how much data one should collect.
15 RL gives a suite of techniques to learn the cost-to-go in these
16 situations by collecting and assimilating the data automatically.
17 These techniques go under the umbrella of policy gradients, on-
18 policy methods etc. One may also simply “memorize” the data
19 provided by an expert operator, this is called Imitation Learning
20 and we will discuss it next.

21 Some motivation Imitation Learning is also called “learning from

22 demonstrations”. This is in fact one of the earliest successful examples of
23 using a neural network for driving. The ALVINN project at CMU by Dean
24 Pomerleau in 1988 (https://www.youtube.com/watch?v=2KMAAmkz9go)
25 used a two-layer neural network with 5 hidden neurons, about 1000 inputs
26 from the pixels of a camera and 30 outputs. It successfully drove in
27 different parts of the United States and Germany. Imitation learning has
28 also been responsible for numerous other early-successes of RL, e.g.,
29 acrobatic maneuvers on an RC helicopter (http://ai.stanford.edu/ acoates/-
30 papers/AbbeelCoatesNg_ĲRR2010.pdf).

Imitation Learning seeks to record data from experts, e.g., humans,

143

and reproduce these desired behaviors on robots. The key questions

we should ask, and which we will answer in this chapter, are as
follows.

1. Who should demonstrate (experts, amateurs, or novices) and

how should we record data (what states, controls etc.)?

2. How should we learn from this data? e.g., fit a supervised

regression model for the policy. How should one ignore bad
behaviors in non-expert data?

3. And most importantly, what can we do if the robot encounters

a situation which was not in the dataset.

1 7.1 A crash course in supervised learning

2 Nature gives us data X and targets Y for this data.

X → Y.

3 Nature does not usually tell us what property of a datum x ∈ X results in

4 a particular prediction y ∈ Y . We would like to learn to imitate Nature,
5 namely predict y given x.
6 What does such learning mean? It is simply a notion of being able
7 to identify patterns in the input data without explicitly programming a
8 computer for prediction. We are often happy with a learning process
9 that identifies correlations: if we learn correlations on a few samples
10 (x1 , y 1 ), . . . , (xn , y n ), we may be able to predict the output for a new
11 datum xn+1 . We may not need to know why the label of xn+1 was
12 predicted to be so and so.
13 Let us say that Nature possesses a probability distribution P over
14 (X, Y ). We will formalize the problem of machine learning as Nature
15 drawing n independent and identically distributed samples from this
16 distribution. This is denoted by
n
Dtrain = (xi , y i ) ∼ P

i=1

17 is called the “training set”. We use this data to identify patterns that help
18 make predictions on some future data.

19 What is the task in machine learning? Suppose Dtrain consists of

20 n = 50 RGB images of size 100×100 of two kinds, ones with an orange
21 inside them and ones without. 104 is a large number of pixels, each pixel
22 taking any of the possible 2553 values. Suppose we discover that one
23 particular pixel, say at location (25, 45), takes distinct values in all images
24 inside our training set. We can then construct a predictor based on this
25 pixel. This predictor, it is a binary classifier, perfectly maps the training ? How many such binary classifiers are there
at most?
144

1 images to their labels (orange: +1 or no orange: -1). If xkij is the (ij)th

2 pixel for image xk , then we use the function
(
y k if xkij = xij for some k = 1, . . . , n
f (x) =
−1 otherwise.

3 This predictor certainly solves the task. It correctly works for all images
4 in the training set. Does it work for images outside the training set?
5 Our task in machine learning is to learn a predictor that works outside
6 the training set. The training set is only a source of information that Nature
7 gives us to find such a predictor.

Designing a predictor that is accurate on Dtrain is trivial. A hash

function that memorizes the data is sufficient. This is NOT our task
in machine learning. We want predictors that generalize to new data
outside Dtrain .

8 Generalization If we never see data from outside Dtrain why should we

9 hope to do well on it? The key is the distribution P . Machine learning is
10 formalized as constructing a predictor that works well on new data that is
11 also drawn independently from the distribution P . We will call this set of
12 data the “test set”
Dtest
13 and it is constructed similarly. This assumption is important. It provides
14 coherence between past and future samples: past samples that were used
15 to train and future samples that we will wish to predict upon. How to find
16 such predictors that work well on new data? The central idea in machine
17 learning is to restrict the set of possible binary functions that we consider.

We are searching for a predictor that generalizes well but only

have the training data to select predictors.

18 The right class of functions f cannot be too large, otherwise we will

19 find our binary classifier above as the solution, and that is not very useful.
20 The class of functions cannot be too small either, otherwise we won’t be
21 able to predict difficult images. If the predictor does not even work well
22 on the training set, there is no reason why we should expect it to work on
23 the test set.

? Can you now think how is machine

Finding this correct class of functions with the right balance is
learning different from other fields you might
what machine learning is all about.
know such as statistics or optimization?
145

1 7.1.1 Fitting a machine learning model

2 Let us now solve a classification problem. We will again go around
3 the model selection problem and consider the class of linear classifiers.
4 Assume binary labels Y ∈ {−1, 1}. To keep the notation clear, we will
5 use the trick of appending a 1 to the data x and hide the bias term b in the
6 linear classifier. The predictor is now given by

f (x; w) = sign(w⊤ x)
(
+1 if w⊤ x ≥ 0 (7.1)
=
−1 else.

7 We have used the sign function denoted as sign to get binary {−1, +1} out-
8 puts form our real-valued prediction w⊤ x. This is the famous perceptron
9 model of Frank Rosenblatt.
10 We want the predictions of the model to match those in the training
11 data and devise an objective to to fit/train the perceptron.
n
1X
ℓzero-one (w) := 1 i i . (7.2)
n i=1 {y ̸=f (x ;w)}

12 The indicator function inside the summation measures the number of

13 mistakes the perceptron makes on the training dataset. The objective
14 here is designed to find weights w that minimizes the average number of
15 mistakes, also known as the training error. Such a loss that measures the
16 mistakes is called the zero-one loss, it incurs a penalty of 1 for a mistake
? Can you think of some quantity other than
17 and zero otherwise.
the zero-one error that we may wish to
optimize?

18 Surrogate losses The zero-one loss is the clearest indication of whether

19 the perceptron is working well. It is however non-differentiable, so we
20 cannot use powerful ideas from optimization theory to minimize it. This
21 is why surrogate losses are constructed in machine learning. These are
22 proxies for the loss function, typically for the classification problems and
23 look as follows. The exponential loss is
⊤
ℓexp (w) = e−y (w x)

24 or the logistic loss is

⊤

ℓlogistic (w) = log 1 + e−yw x .

25 Stochastic Gradient Descent (SGD) SGD is a very general algorithm

26 to optimize objectives typically found in machine learning. We can use
27 it so long as we have a dataset and an objective that is differentiable.
28 Consider an optimization problem where we want to solve for
n
1X i
w∗ = argmin ℓ (w)
w n i=1
146

1 where the function ℓi denotes the loss on the sample (xi , y i ) and w ∈ Rp
2 denotes the weights of the classifier. Solving this problem using SGD
3 corresponds to iteratively updating the weights using

dℓωt (w)
w(t+1) = w(t) − η ,
dw w=w(t)

4 i.e., we compute the gradient one sample with index ωt in the dataset. The
5 index ωt is chosen uniformly randomly from

ωt ∈ {1, . . . , n} .

6 In practice, at each time-step t, we typically select a few (not just one) input
ωt
7 data ωt from the training dataset and average the gradient dℓ dw(w)
w=w(t)
8 across them; this is known as a “mini-batch”. The gradient of the loss
9 ℓωt (w) with respect to w is denoted by

∇w1 ℓωt (w(t) )

 

dℓωt (w) ∇w2 ℓωt (w(t) )

∇ℓωt (w(t) ) := =  ∈ Rp .
 
..
dw w=w(t)  . 
∇wp ℓωt (w(t) )

10 The gradient ∇ℓωt (w(t) ) is therefore a vector in Rp . We have written

dℓωt (w)
∇w1 ℓωt (w(t) ) =
dw1 w=w(t)

11 for the scalar-valued derivative of the objective ℓωt (w(t) ) with respect to
12 the first weight w1 ∈ R. We can therefore write SGD as

w(t+1) = w(t) − η∇ℓωt (w(t) ). (7.3)

13 The non-negative scalar η ∈ R+ is called the step-size or the learning rate.

14 It governs the distance traveled along the negative gradient −∇ℓωt (w(t) )
15 at each iteration.

16 7.1.2 Deep Neural Networks

17 The Perceptron in (7.1) is a linear model: it computes a linear function
18 of the weights w⊤ x and uses this function to make the predictions
19 f (x; w) = sign(w⊤ x). Linear models try to split the data (say we have
20 binary labels Y = {−1, 1}) using a hyper-plane with w denoting the
21 normal to this hyper-plane. This does not work for all situations of course,
22 as the figure below shows, there is no hyper-plane that cleanly separates
23 the two classes (i.e., achieves zero mis-prediction error) but there is a
24 nonlinear function that can do the job.
25 A deep neural network is one such nonlinear function. First consider
26 a “two-layer” network

f (x; v, S) = sign v ⊤ σ S ⊤ x

147

Figure 7.1

1 where the matrix S ∈ Rd×p and a vector v ∈ Rp are the parameters or

2 “weights” of the classifier. The “nonlinearity” σ is usually set to be what
3 is called a Rectified Linear Unit (ReLU)

σ(x) := ReLU(x) = |x|+

(7.4)
= max(0, x).
Pn
4 Just like the case of a Perceptron, we can use an objective n1 i=1 ℓi (v, S)
5 that depends on both v, S to fit this classifier on training data. A deep
6 neural network takes the idea of a two-layer network to the next step and
7 has multiple “layers”, each with a different weight matrix S1 , . . . , SL .
8 The classifier is therefore given by

f (x; v, S1 , . . . , SL ) = sign v ⊤ σ SL⊤ . . . σ S2⊤ σ(S1⊤ x) . . . . (7.5)

We call each operation of the form σ Sk⊤ . . . , as a layer. Consider the

10 second layer: it takes the features generated by the first layer, namely
11 σ(S1⊤ x), multiplies these features using its feature matrix S2⊤ and applies
12 a nonlinear function σ(·) to this result element-wise before passing it on
13 to the third layer.

A deep network creates new features by composing older features.

14 This composition is very powerful. Not only do we not have to

15 pick a particular feature vector, we can create very complex features by
16 sequentially combining simpler ones. For example Figure 7.2 shows the
17 features (more precisely, the kernel) learnt by a deep neural network.
18 The first layer of features are called Gabor-like, and incidentally they are
19 similar to the features learned by the human brain in the first part of the
20 visual cortex (the one closest to the eyes). These features are combined
21 linearly along with a nonlinear operation to give richer features (spirals,
22 right angles) in the middle panel. The third layer combines the lower
23 features to get even more complex features, these look like patterns (notice
24 a soccer ball in the bottom left), a box on the bottom right etc.
148

1 Deep networks are universal function approximators The multi-layer

2 neural network is a powerful class of classifiers: depending upon how many
3 layers we have and what is the dimensionality of the the weight matrices
4 Sk at each layer, we can fit any training data. In fact, this statement,
5 which is called the universal approximation property holds even for a
6 two-layer neural network v ⊤ σ(S ⊤ x) if the number of columns in S is big
7 enough. This property is the central reason why deep networks are so
8 widely applicable, we can model complex machine learning problems if
9 we choose a big enough deep network.

Figure 7.2

10 Logits for multi-class classification. The output

ŷ = v ⊤ σ SL⊤ . . . σ S2⊤ σ(S1⊤ x) . . .

11 is called the logits corresponding to the different classes. This name

12 comes from logistic regression where logits are the log-probabilities of an
13 input datum belonging to one of the two classes. A deep network provides
14 an easy way to solve a multi-class classification problem, we simply set

v ∈ Rp×C

15 where C is the total number of classes in the data. Just like logistic
16 regression predicts the logits of the two classes, we would like to interpret
17 the vector ŷ as the log-probabilities of an input belonging to one of the
18 classes. ? What would the shape of w be if you were
performing regression using a deep network?
19 Weights It is customary to not differentiate between the parameters of
20 different layers of a deep network and simply say weights when we want
21 to refer to all parameters. The set

w := {v, S1 , S2 , . . . , SL }

22 is the set of weights. This set is typically stored in PyTorch as a set of

23 matrices, one for each layer. Using this new notation, we will write down
24 a deep neural network classifier as simply

f (x, w) (7.6)
149

1 and fitting the deep network to a dataset involves the optimization problem
n
1X i i
w∗ = argmin ℓ(y , ŷ ). (7.7)
w n i=1

2 We will also sometimes denote the loss of the ith sample as

ℓi (w) := ℓ(y i , ŷ i ).

3 Backpropagation The Backpropagation algorithm is a method to com-

4 pute the gradient of the objective while fitting a deep network using SGD,
5 i.e., it computes ∇w ℓi (w). For the purposes of this course, the details of
6 how this is done are not essential, so we will skip them. You can read more
7 in the notes of ESE 546 at https://pratikac.github.io/pub/20_ese546.pdf.

8 PyTorch We will use a library called PyTorch (https://pytorch.org) to

9 code up deep neural networks for the reinforcement learning part of this
10 course. You can find some excellent tutorials for it at
11 https://pytorch.org/tutorials/beginner/basics/intro.html. We have also
12 uploaded two recitations from the Fall 2020 offering of ESE 546 on
13 Canvas which guide you through various typical use-cases of PyTorch.
14 You are advised to go through, at least, the first recitation if you are
15 not familiar with PyTorch. For the purposes of this course, you do not
16 need to know the intricacies of PyTorch, we will give you enough code
17 to work with deep networks so that you can focus on implementing the
18 reinforcement learning-specific parts.

19 7.2 Behavior Cloning

20 With that background, we are ready to tackle what is potentially the simplest
21 problem in RL. We will almost exclusively deal with discrete-time systems
22 for RL. Let us imagine that we are given access to n trajectories each of
23 length T + 1 time-steps from an expert demonstrator for our system. We
24 write this as a training dataset

D = (xit , uit )t=0,1,...,T

i=1,...,n

25 At each step, we record the state xit ∈ Rd and the control that the expert
26 took at that state uit . We would like to learn a deterministic feedback
27 control for the robot that is parametrized by parameters θ

uθ (x) : X 7→ U ⊂ Rm .

28 using the training data. The idea is that if uθ (xi (t)) ≈ ui (t) for all i
29 and all times t, then we can simply run our learned controller uθ (x) on
30 the robot instead of having the expert. A simple example is a baby deer
31 learning to imitate how its mother in how to run.
150

1 Parameterizing the controller Our function uθ may represent many

2 different families of controllers. For example, uθ (x) = θx where θ ∈
3 Rd×p is a linear controller; this is much like the control for LQR except
4 that we can fit θ to the expert’s data instead of solving the LQR problem
5 to find the Kalman gain. We could also think of some other complicated
6 function, e.g., a two-layer neural network,

uθ (x) = v σ S ⊤ x

7 where S ∈ Rd×p and v ∈ Rm×p and σ : Rm 7→ Rm is some nonlinearity,

8 say ReLU. As we did above, we will use

θ := (v, S)

9 to denote all the weights of this two-layer neural network. Multi-layer

10 neural networks are also another possible avenue. In general, we want
11 to the parameterization of the controller to be rich enough to fit some
12 complex controller that the expert may have used on the system.

13 How to fit the controller? Given our chosen model for uθ (x), say a
14 two-layer neural network with weights θ, fitting the controller involves
15 finding the best value for the parameters θ such that uθ (xit ) ≈ uit for data
16 in our dataset. There are many ways to do this, e.g., we can solve the
17 following optimization problem
n T
1X 1 X 2
θb = argmin ℓ(θ) := uit − uθ (xit ) 2 (7.8)
θ n i=1 T + 1 t=0
| {z }
ℓi (θ)

18 The difficulty of solving the above problem depends upon how difficult the
19 model uθ (x) is, for instance, if the model is linear θ x, we can solve (7.8)
20 using ordinary least squares. If the model is a neural network, one would
21 have to use SGD to solve the optimization problem above. After fitting
22 this model, we have a new controller

uθb(x) ∈ Rm

23 that we can use anywhere in the domain X ⊂ Rd , even at places where

24 we had no expert data. This is known as Behavior Cloning, i.e., cloning
25 the controls of the expert into a parametric model.

26 Generalization performance of behavior cloning Note that the data

27 provided by the expert is not iid, of course the state xit+1 in the expert’s
28 trajectory depends upon the previous state xit . Standard supervised
29 learning makes the assumption that Nature gives training data that is
30 independent and identically distributed from the distribution P . While
31 it is still reasonable to fit the regression loss in (7.8) for such correlated
32 data, one should remember that if the expert trajectories do not go to all
33 parts of the state-space, the learned controller fitted on the training data
151

1 may not work outside these parts. Of course, if we behavior clone the
2 controls taken by a generic driver, they are unlikely to be competitive for
3 racing, and vice-versa. It is very important to realize that this does not
4 mean that BC does not generalize. Generalization in machine learning is
5 a concept that suggests that the model should work well on data from the
6 same distribution. What does the the “distribution” of the expert mean, in
7 this case, it simply refers to the distribution of the states that the expert’s Discuss generalization performance in
8 trajectories typically visit, e.g, a race driver typically drives at the limits behavior cloning.
9 of tire friction and throttle, this is different from a usual city-driver who
10 would rather maximize the longevity of their tires and engine-life.

11 7.2.1 Behavior cloning with a stochastic controller

12 In this case, we have always chosen feedback feedback controllers that
13 are deterministic, i.e., there is a single value of control u that is taken at
14 the state x. Going forward, we will also talk about stochastic controllers,
15 i.e., controllers which sample a control from a distribution. There can
16 be a few reasons of using such a controller. First, we will see in later
17 lectures how this may help in training a reinforcement learning algorithm;
18 this is because in situations where you do not know the system dynamics
19 precisely, it helps to “hedge” the feedback to take a few different control
20 actions instead of simply the one that the value function deems as the
21 maximizing one. This is not very different from having a few different
22 stocks in your portfolio. Second, we benefit from this hedging even at
23 test-time when we run a stochastic feedback control, e.g., in situations
24 where the limited training data may not want to always pick the best
25 control (because the best control was computed using an imprecise model
26 of the system dynamics and could be wrong), but rather hedge our bets by
27 choosing between a few different controls.
28 A stochastic feedback control is denoted by

u ∼ uθ (· | x) = P(· | x)

29 notice that uθ (· | x) is a probability distribution on the control space U

30 that depends on the state x, and in this case the parameters θ. The control
31 taken at a state x is a sample drawn from this probability distribution. The
32 deterministic controller is a special case of this setup where

uθ (u| x) = δuθ (x) (u) ≡ uθ (x)

33 is a Dirac-delta distribution at uθ (x). If the control space U is discrete,

34 then uθ (· | x) could be a categorical distribution. If the control space U
35 is continuous, then you may wish to think of the controls being sampled
36 from a Gaussian distribution with some mean µθ (x) and variance σθ2 (x)

Rm ∋ u ∼ uθ (· | x) = N (µθ (x), Σθ (x)).

152

1 Maximum likelihood estimation Let’s pick a particular stochastic

2 controller, say a Gaussian. How should we fit the parameters θ for this?
3 We would like to find parameters θ that make the expert’s data in our
4 dataset very likely. The log-likelihood of each datum is

log uθ (uit | xit )

5 and maximizing the log-likelihood of the entire dataset amounts to solving

n T
1X 1 X
θb = argmin − log uθ (uit | xit ) . (7.9)
θ n i=1
T + 1 t=0
| {z }
ℓi (θ)

6 Fitting BC with a Gaussian controller Notice that if we use a Gaussian

7 distribution
uθ (· | x) = N (µθ (x), I)
8 as our stochastic controller, the objective in (7.9) is the same as that
9 in (7.8).
uθ (· | x) = N µθ (x), σθ2 (x)I

10 we have that
2
∥µθ (x) − u∥2
− log uθ (u | x) = + 2cp log σθ (x).
σθ2 (x)

11 where c is a constant.

12 7.2.2 KL-divergence form of Behavior Cloning

13 Background on KL divergence The Kullback-Leibler (KL) divergence
14 is a quantity to measure the distance between two probability distributions.
15 There are many similar distances, for example, given two probability
16 distributions p(x) and q(x) supported on a discrete set X, the total
17 variation distance between them is
1 X
TV(p, q) = |p(x) − q(x)| .
2
x∈X

18 Hellinger distance (https://en.wikipedia.org/wiki/Hellinger_distance), f -

19 divergences (https://en.wikipedia.org/wiki/F-divergence) and the Wasser-
20 stein metric
21 (https://en.wikipedia.org/wiki/Wasserstein_metric) are a few other exam-
22 ples of ways to measure how different two probability distributions are
23 from each other.
24 The Kullback-Leibler divergence (KL) between two distributions is
25 given by
X p(x)
KL(p || q) = p(x) log . (7.10)
q(x)
x∈X

26 This is a distance and not a metric, i.e., it is always non-zero and zero
153

1 if and only if the two distributions are equal, but the KL-divergence
2 is not symmetric (like a metric has to be). Also, the above formula is
3 well-defined only if for all x where q(x) = 0, we also have p(x) = 0.
4 Notice that it is not symmetric
X q(x)
KL(q || p) = q(x) log ̸= KL(p || q).
p(x)
x∈X

5 The funny notation KL(p || q) was invented by Shun-ichi Amari

6 (https://en.wikipedia.org/wiki/Shun%27ichi_Amari) to emphasize the fact
7 that the KL-divergence is asymmetric. The KL-divergence is always
8 positive: you can show this using an application of Jensen’s inequality.
9 For distributions with continuous support, we integrate over the entire
10 space X and define KL divergence as
Z
p(x)
KL(p || q) = p(x) log dx .
X q(x)

11 Behavior Cloning Let us now imagine the expert is also a parametric

12 stochastic feedback controller uθ∗ (· | x). Our data is therefore drawn by
13 running this controller for n trajectories, T time-steps on the system. This
14 dataset now consists of samples from

puθ∗ (x, u)

15 which is the joint distribution on the state-space X and the control-space U .

16 We have denoted the parameters of the feedback controller which creates
17 this distribution as the subscript uθ∗ . Our behavior cloning controller
18 creates a similar distribution puθ (x, u) and the general version of the
19 objective in (7.9) is therefore

θb = argmin KL (puθ∗ || puθ ) ; (7.11)

20 The objective in (7.9) corresponds to this for Gaussian stochastic con-

21 trollers, but we can just as easily imagine some other distribution for the
22 stochastic controller of the expert and the robot.

Written this way, BC can be understood as finding a controller θb

whose distribution on the states and controls is close to the distribution
of states and controls of the expert.

23 7.2.3 Some remarks on Behavior Cloning

24 Worst-case performance Performance of Behavior Cloning can be
25 quite bad in the worst case. The authors in “Efficient reductions for
26 imitation learning” (https://www.cs.cmu.edu/~sross1/publications/Ross-
27 AIStats11-NoRegret.pdf) show that if the learned controller uθb differs
28 from the control taken by the expert controller uθ∗ with a probability ϵ at
154

1 each time-step, over a horizon of length T time-steps, it can be O(T 2 ϵ) off Draw a picture of the amplifying errors of
2 from the cost-to-go of the expert as averaged over states that the learned running behavior cloning in real-time.
3 controller visits. This is because once the robot makes a mistake and goes
4 away from the expert’s part in the state-space, future states of the robot
5 and the expert can be very different.

6 Model-free nature of BC Observe that our learned controller uθb(· | x)

7 is a feedback controller and works for entire state-space X. We did not
8 need to know the dynamics of the system to build this controller. The
9 data from the expert is conceptually the same as the model ẋ = f (x, u) of
10 the dynamics, and you can learn controllers from both. Do you however
11 notice a catch?

12 7.3 DAgger: Dataset Aggregation

13 The expert’s dataset in Behavior Cloning determines the quality of the
14 controller learned. If we collected very few trajectories from the expert,
15 they may not cover all parts of the state-space and the behavior cloned
16 controller has no data to fit the model in those parts.
17 Let us design a simple algorithm, of the same spirit as iterative-LQR,
18 to mitigate this. We start with a candidate controller, say uθ(0) (x); one
19 may also start with a stochastic controller uθ(0) (· | x) instead.

DAgger: Let the dataset D(0) be the data collected from the
expert. Initialize uθ(0) = uθb to be the BC controller learned using
data D(0) . At iteration k

1. The robot queries the expert for a fraction p of the time-steps

and uses its learned controller uθ(k−1) for the other time-steps.
If the expert corresponds to some controller uθ∗ , then the robot
controller at a state x is

u ∼ p δuθ∗ (x) + (1 − p) δuθ(k−1) (x) .

2. Use u(x) to collect a dataset D = (xit , uit )t=0,...,T i=1,...,n
with n trajectories.

3. Set the new dataset to be D(k) = D(k−1) ∪ D

4. Fit a controller uθ(k) using behavior cloning to the new dataset

D(k) .

20 The above algorithm iteratively updates the BC controller uθb by

21 drawing new data from the expert. The robot first bootstraps off the
22 expert’s data, this simply means that it uses the expert’s data to fit its
23 controller uθ(0) (x). As we discussed above, this controller may veer off
24 the expert’s trajectory if the robot starts at states that are different from
155

1 the dataset, or even if it takes a slightly different control than the expert
2 midway through a trajectory.

4 To fix this, the robot collects more data at each iteration. It uses a
5 combination of the expert and its controller to collect such data. This,
6 allows collecting a dataset of expert’s controls in states that the robot
7 visits and iteratively expands the dataset D(k) .

9 In the beginning we may wish to be close to the expert’s data and use
10 a large value of p, as the fitted controller uθk+1 becomes good, we can
11 reduce the value of p and rely less on the expert.
12 DAgger is an iterative algorithm which expands the controller to handle ? What criterion can we use to stop these
13 larger and larger parts of the state-space. Therefore, the cost-to-go of iterations? We can stop when the incremental
14 the controller learned via DAgger is O(T ) off from the cost-to-go of the dataset collected Dk is not that different from
15 expert as averaged over states that the learned controller visits. the cumulative dataset D, we know that the
new controllers are not that different. We can
also stop when the parameters of our learned
16 DAgger with expert annotations at each step DAgger is a conceptual controller are θ(k+1) ≈ θ(k) .
17 framework where the expert is queried repeatedly for new control actions.
18 This is obviously problematic because we need to expert on hand at each
19 iteration. We can also cook up a slightly version of DAgger where we
20 start with the BC controller uθ(k) = uθb and at each step, we run the
21 controller on the real system and ask the expert to relabel the data after
156

1 that run. The dataset D(k) collected by the algorithm expands at each
2 iteration and although the states xit are those visited by our controller, their
3 annotations are those given by the expert. This is a much more natural
4 way of implementing DAgger.
1 Chapter 8

2 Policy Gradient Methods

Reading
1. Sutton & Barto, Chapter 9–10, 13

2. Simple random search provides a competitive approach to

reinforcement learning at
https://arxiv.org/abs/1803.07055

3. Proximal Policy Optimization Algorithms

https://arxiv.org/abs/1707.06347

4. Are Deep Policy Gradient Algorithms Truly Policy Gradient

Algorithms? https://arxiv.org/abs/1811.02553

5. Asynchronous Methods for Deep Reinforcement Learning

http://proceedings.mlr.press/v48/mniha16.pdf

3 This chapter discusses methods to learn the controller that minimizes

4 a given cost functional over trajectories of an unknown dynamical system.
5 We will use what is called the “policy gradient” which will be the main
6 section of this chapter.
7 Recall from the last chapter that we were able to fit stochastic controllers
8 of the form uθb(· | x) that is a probability distribution on the control-space
9 U for each x ∈ X. We fitted uθ using data from the expert in imitation
10 learning. We did not learn the cost-to-go for the fitted controller, like we ? Can you give another instance when we
11 did in the lectures on dynamic programming. This is a clever choice: it is have computed a controller previously in the
12 often easier to learn the controller in a typical problem than to compute class without coming up with its cost-to-go?
13 the optimal cost-to-go as a parametric function J ∗ (x).

157
158

1 8.1 Standard problem setup in RL

2 Dynamics and rewards In this and the next few chapters we will always
3 consider discrete-time stochastic dynamical systems with a stochastic
4 controller with parameters (weights) θ. We denote them as follows

xk+1 ∼ p(· | xk , uk ) with noise denoted by ϵk

uk ∼ uθ (xk ).

5 We will also change perspective and instead of minimizing the infinite-

6 horizon sum of a runtime cost, maximize the sum of a runtime reward

r(x, u) := −q(x, u).

7 We do so simply to conform to tradition and standard notation in reinforce-

8 ment learning; the two are mathematically completely equivalent. We are
9 interested in maximizing the expected value of the cumulative rewards
10 over infinite-horizon trajectories of the system
 
X∞ 
J(θ; x0 ) = E γ k r(xk , uk ) | x0  ; (8.1)
 

x1 ,x2 ,...  
k=0
| {z }
discounted return

11 where each uk ∼ uθ (· | xk ) and each xk+1 ∼ p(· | xk , uk ).

12 Trajectory space Let us write out one trajectory of such a system a bit
13 more explicitly. We know that the probability of the next state xk+1 given
14 xk is p(xk+1 | xk , uk ). The probability of taking a control uk at state xk
15 is uθ (uk | xk ). We denote an infinite trajectory by

τ = x0 , u0 , x1 , u1 , . . . .

16 The probability of this entire trajectory occurring is

∞
Y
pθ (τ ) = p(xk+1 | xk , uk ) uθ (uk | xk );
k=0

17 we have emphasized that the distribution of trajectories depends on the

18 weights of controller θ. If we take the logarithm,
∞
X
log pθ (τ ) = log p(xk+1 | xk , uk ) + log uθ (uk | xk ).
k=0

19 Given a trajectory τ = x0 , u0 , x1 , u1 , . . ., the sum

∞
X
R(τ ) = γ k r(xk , uk ) (8.2)
k=0
159

1 is called the discounted return of the trajectory τ . Sometimes we will

2 also talk of the undiscounted return of the trajectory which is the sum of
3 the rewards up to some fixed finite horizon T without the discount factor
4 pre-multiplier. Using this notation, we can write out objective from (8.1)
5 as
J(θ; x0 ) = E [R(τ ) | x0 ] (8.3)
τ ∼pθ (τ )

6 where p(τ ) is the probability distribution of an infinitely long trajectory τ .

7 Observe what is probably the most important point in policy-gradient
8 based reinforcement learning: the probability of trajectory is an infinite
9 product of terms. All terms are smaller than 1 (they are probabilities),
10 so it is essentially zero even if the state-space and the control-space are
11 finite (even if they are small). Any given infinite (or long) trajectory is
12 quite rare under the probability distribution of the stochastic controller.
13 Policy-gradient methods sample lots of trajectories from the system and
14 average the returns across these trajectories. Since the set of trajectories
15 of even a small MDP is so large, sampling lots of trajectories, or even the
16 most likely ones, is also very hard. This is a key challenge in getting RL
17 algorithms to work.

Our goal in this chapter is to compute the best stochastic controller

which maximizes the average discounted return. Mathematically, this
amounts to finding

θb = argmax J(θ; x0 ) := E [R(τ ) | x0 ] . (8.4)

θ τ ∼pθ (τ )

The objective J(θ) is called the average return of the controller uθ .

18 Computing the average return J(θ) Before we move on to optimizing

19 J(θ), let us discuss how to compute it for given weights θ of the stochastic
20 controller. We can sample n trajectories from the system and compute the
21 an estimate of the expectation
n T
b ≈ 1
XX
J(θ) γ k r(xik , uik ) (8.5) ? Contrast (8.5) with the complexity of
n i=0
k=0
policy evaluation which was simply a system
22 for some large time-horizon T and where each uik ∼ uθ (· | xik ). of linear equations. Evaluating the policy
without having access to the dynamical
system is harder.
23 8.2 Cross-Entropy Method (CEM)
24 Let us first consider a simple method to compute the best controller. The
25 basic idea is to solve the problem

θb = argmax J(θ)
θ
160

1 using gradient descent. We would like to update weights θ iteratively

θ(k+1) = θ(k) + η ∇J(θ).

2 where the step-size is η > 0 and ∇J(θ) is the gradient of the objective
3 J(θ) with respect to weights θ. Instead of computing the exact ∇J(θ)
4 which we will do in the next section, let us simply compute the gradient
5 using a finite-difference approximation. The ith entry of the gradient is

J(θ + ϵ ei ) − J(θ − ϵ ei ) b + ϵ ei ) − J(θ

J(θ b − ϵ ei )
(∇J(θ))
b i = ≈ .
2ϵ 2ϵ
6 where ei = [0, 0, . . . , 0, 1, 0, . . .] is a vector with 1 on the ith entry. Each
7 quantity Jb is computed as the empirical average return of n trajectories
8 from the system. We compute all entries of the objective using this
9 approximation and update the parameters using

θ(k+1) = θ(k) + η ∇
b J(θ(k) ).

10 A more efficient way to compute the gradient using finite-differences

11 Instead of picking perturbations ei along the cardinal directions, let us
12 sample them from a Gaussian distribution

ξ i ∼ N (0, σ 2 I)

13 for some user-chosen covariance σ 2 . We can however no longer use

14 the finite-difference formula to compute the derivative because the noise
15 e is not aligned with the axes. We can however use a Taylor series
16 approximation as follows. Observe that

J(θ + ξ) ≈ J(θ) + ⟨∇J(θ), ξ⟩

17 where ⟨·, ·⟩ is the inner product. Given m samples ξ 1 , . . . , ξ m observe

18 that
b + ξ 1 ) = J(θ)
J(θ b + ∇J(θ), ξ 1
b + ξ 2 ) = J(θ)
J(θ b + ∇J(θ), ξ 2
.. (8.6)
.
b + ξ m ) = J(θ)
J(θ b + ⟨∇J(θ), ξ m ⟩ .

19 is a linear system of equations in ∇J(θ) ∈ Rp where θ ∈ Rp . All

20 quantities Jb are estimated as before using trajectories drawn from the
21 system. We solve this linear system, e.g., using least-squares if m > p, to
22 get an estimate of the gradient ∇J(θ)
b ∈ Rp .

The Cross-Entropy Method is a more crude but simpler way to

161

implement the above least-squares formulation. At each iteration it

updates the parameters using the formula
h i
θ(k+1) = E θ 1{J(θ)>
b b (k) )} .
J(θ (8.7)
θ∼N (θ (k) ,σ 2 I)

In simple words, the CEM samples a few stochastic controllers uθ

from a Gaussian (or any other distribution) centered around the
current controller uθ(k) and updates the weights θk in a direction that
leads to an increase in J(θ)
b > J(θb (k) ).

1 8.2.1 Some remarks on sample complexity of simulation-

2 based methods
3 CEM may seem to be a particularly bad method to maximize J(θ), after
4 all we are perturbing the weights of the stochastic controller randomly
5 and updating the weights if they result in a better average return J(θ).
b
p
6 This is likely to work well if the dimensionality of weights θ ∈ R , i.e.,
7 p, is not too large. But is unlikely to work well if we are sampling θ in
8 high-dimensions. Typical applications are actually the latter, remember
9 that we are interested in using a deep network as a stochastic controller
10 and θ are the weights of the neural networks.
11 Let us do a quick computation, if the state is x ∈ Rd and u ∈ Rm
12 with d = 12 (joint angles and velocities) and m = 6 for a six-degree of
13 freedom robot manipulator

15 and if we use a two-layer neural network with 64 neurons in the hidden

16 layer, the total number of weights θ ∈ Rp for the function uθ (· | x) =
17 N (µθ (x), σθ2 (x)I) where σ 2 (x) is a vector in Rm , is

p = (12 × 64 + 64) + (64 × 6 + 6) + (12 × 64 + 64) + (64 × 6 + 6)

| {z } | {z }
for µθ (x) for σθ2 (x)

= 2, 444.

18 This is a very high-dimensional space to sample exhaustively. Note that

19 it is quite large even if the input and output dimensions of the neural
20 network are not too large. To appreciate the complexity of computing the
21 gradient ∇J(θ), let us think of how to compute it using finite-differences.
162

1 We need two estimates J(θ b − ϵei ) and J(θ

b + ϵei ) for every dimension
2 i ∈ {1, . . . , p}. Each estimate requires us to obtain n trajectories from
3 the system. Since the number of trajectories that a robot can take is quite
4 diverse, we should use a large n, so let’s pick n = 100. The total number
5 of trajectories required to update the parameters θ(k) at each iteration is For comparison, a busy espresso bar in a
city makes about 500 shots per day. The
2 p n ≈ 106 . espresso machine would have to work for 5
years without breaking down to make 106
6 This is an absurdly large number, and things are even more daunting shots.
7 when we realize that each update of the weights requires us to sample
8 these many trajectories from the system. It is not reasonable to sample
9 such a large number of trajectories from an actual robot, that too for each
10 update of the weights.

11 Using fast simulators for RL If we expand our horizon and think

12 of learning controllers in simulation, things feel much more reasonable.
13 While running a large number of trajectories may degrade a real robot
14 beyond use, doing so requires just computation time in a robot sim-
15 ulator. There is a large number of simulators that are available with
16 various capabilities, e.g., Gazebo (http://gazebosim.org) is a sophisticated
17 simulator inside ROS that uses a number of Physics engines such as
18 Bullet (https://pybullet.org/wordpress), MuJoCo (http://www.mujoco.org)
19 is incredibly fast although not very good modeling contact, Unity is
20 a popular platform to simulate driving and more complicated scenes
21 (https://docs.nvidia.com/isaac/isaac/doc/simulation/unity3d.html), Drake
22 (https://drake.mit.edu) is better at contact modeling but more complex and
23 slower. Most robotics companies develop their own driving simulators
24 in-house. The assigned reading (#2) for this chapter is a paper which
25 develops a very fast implementation of the CEM for use in simulators.

26 Working well in simulation does not mean that a controller works

27 well on the real robot It is important to realize that a simulator is not
28 equivalent to the physical robot. Each simulator makes certain trade-offs
29 in capturing the dynamics of the real system and it is not a given that a
30 controller that was learned using data from a simulator will work well on
31 a real robot. For instance, OpenAI had to develop a large number of tricks
32 (which took about a year) to modify the simulator to enable the learned
33 policy to work well on a robot (https://openai.com/blog/learning-dexterity)
34 for a fairly narrow set of tasks.

35 8.3 The Policy Gradient

In this section, we will study how to take the gradient of the

objective J(θ), without using finite-differences.
163

1 We would like to solve the optimization problem

max J(θ) := E [R(τ ) | x0 ]

θ τ ∼pθ (τ )

2 We will suppress the dependence on x0 to keep the notation clear. The

3 expectation is taken over all trajectories starting at state x0 realized using
4 the stochastic controller uθ (· | x). We to update weights θ using gradient
5 descent which amounts to

θ(k+1) = θ(k) + η∇θ E [R(τ )] .

τ ∼pθ (τ )

6 First let us note that the distribution pθ using which we compute the
7 expectation also depends on the weights θ. This is why we cannot simply
8 move the derivative ∇θ inside the expectation

∇θ E [R(τ )] ̸= E [∇θ R(τ )] .

τ ∼pθ (τ ) τ ∼pθ (τ )

9 We need to think of a new technique to compute the gradient above.

10 Essentially, we would like to do the chain rule of calculus but where
11 one of the functions in the chain is an expectation. The likelihood-ratio
12 trick described next allows us to take such derivatives. Here is how the
13 computation goes
Z
∇θ E [R(τ )] = ∇θ R(τ )pθ (τ )dτ
τ ∼pθ
Z
= R(τ )∇θ pθ (τ ) dτ

(move the gradient inside, integral is over trajectories τ which do not depend onθ themselves)
∇pθ (τ )
Z
= R(τ )pθ (τ ) dτ
pθ (τ )
Z
= R(τ )pθ (τ )∇ log pθ (τ )dτ

= E [R(τ )∇ log pθ (τ )]
τ ∼pθ (τ )
n
1X
≈ R(τ i )∇ log pθ (τ i )
n i=1
(8.8)
14 This is called the likelihood-ratio trick to compute the policy gradient. It
15 simply multiplies and divides by the term pθ (τ ) and rewrites the term
∇pθ
16
pθ = ∇ log pθ . It gives us a neat way to compute the gradient: we
17 sample n trajectories τ 1 , . . . , τ n from the system and average the return
18 of each trajectory R(τ i ) weighted by the gradient of the likelihood of
19 taking each trajectory log pθ (τ i ). The central point to remember here is
164

1 that the gradient

T
X
∇θ log pθ (τ i ) = ∇θ log p(xik+1 | xik , uik ) + log uθ (uik | xik )
k=0
(8.9)
T
X
= ∇θ log uθ (uik | xik )
k=0

2 is computed using backpropagation for a neural network. This expression

3 is called the policy gradient because it is the gradient of the objective J(θ)
4 with respect to the parameters of the controller/policy θ.

5 Variance of policy gradient The expression for the policy gradient may
6 seem like a sleight of hand. It is a clean expression to get the gradient of
7 the objective but also comes with a number of problems. Observe that

∇pθ (τ )
∇ E [R(τ )] = E R(τ )
τ ∼pθ (τ ) τ ∼pθ (τ ) pθ (τ )
n
1 X ∇ pθ (τ i )
≈ R(τ i ) .
n i=1 pθ (τ i )

8 If we sample trajectories τ i that are not very likely under the distribution
9 pθ (τ ), the denominator in some of the summands can be very small.
10 For trajectories that are likely, the denominator is large. The empirical
11 estimate of the expectation using n trajectories where some terms are
12 very small and some others very large, therefore has a large variance. So
13 one does need lots of trajectories from the system/simulator to compute a
14 reasonable approximation of the policy gradient.

15 8.3.1 Reducing the variance of the policy gradient

16 Control variates You will perhaps appreciate that computing the accu-
17 rate policy gradient is very hard in practice. Control variates is a general
18 concept from the literature on Monte Carlo integration and is typically
19 introduced as follows. Say we have a random variable X and we would
20 like to guess its expected value µ = E[X]. Note that X is an unbiased
21 estimator of µ but it may have a large variance. If we have another random
22 variable Y with known expected value E[Y ], then

X̂ = X + c(Y − E[Y ]) (8.10)

23 is also an unbiased estimator for µ for any value of c. The variance of X̂ is

Var(X̂) = Var(X) + c2 Var(Y ) + 2c Cov(X, Y ).

24 which is minimized for

Cov(X, Y )
c∗ = −
Var(Y )
165

1 for which we have

2 !
Cov(X, Y )
Var(X̂) = Var(X) − c∗ 2 Var(Y ) = 1− Var(X).
Var(Y )

2 By subtracting Y − E[Y ] from our observed random variable X, we have

3 reduced the variance of X if the correlation between X and Y is non-zero.
4 Most importantly, note that no matter what Y we plug into the above
5 expression, we can never increase the variance of X; the worst that can
6 happen is that we pick a Y that is completely uncorrelated with X and
7 end up achieving nothing.

8 Baseline We will now use the concept of a control variate to reduce the
9 variance of the policy gradient. This is known as “building a baseline”.
10 The simplest baseline one can build is to subtract a constant value from
11 the return. Consider the PG given by

∇J(θ) = E [R(τ ) ∇ log pθ (τ )]

τ ∼pθ

= E [(R(τ ) − b) ∇ log pθ (τ )] .
τ ∼pθ (τ )

12 Observe that
Z
E [b∇ log pθ (τ )] = dτ b pθ (τ )∇ log pθ (τ )
τ ∼pθ (τ )
Z Z
= dτ b ∇pθ (τ ) = b ∇ dτ pθ (τ ) = b ∇1 = 0.

13 Example 1: Using the average returns of a mini-batch as the baseline

14 What is the simplest baseline b we can cook up? Let us write the mini-batch
15 version of the policy gradient
b
1X
R(τ i )∇ log pθ (τ i ) .

∇J(θ)
b :=
b i=1

16 where τ 1 , . . . , τ b are trajectories that are a part of our mini-batch. We can

17 set
b
1X
b= R(τ i )
b i=1
18 can use the variance-reduced gradient
b
1X
R(τ i ) − b ∇ log pθ (τ i ) .

∇J(θ)
b =
b i=1

19 This is a one-line change in your code for policy gradient so there is no

20 reason not to do it.
166

1 Example 2: A weighted averaged of the returns using the log-likelihood

2 of the trajectory The previous example showed how we can use one
3 constant baseline, namely the average of the discounted returns of all
4 trajectories in a mini-batch. What is the best constant b we can use?
5 We can perform a similar computation as done in the control variate
6 introduction to minimize the variance of the policy gradient to get the
7 following.
h i 2
δ ∇b θ J(θ) = E ((R(τ ) − bi )∇θ log pθ (τ ))2 − E [((R(τ ) − bi )) ∇θ log pθ (τ )]
i i i
τ τ
h i 2
2
= E ((R(τ ) − bi )∇θi log pθ (τ )) − ∇ b θ J(θ) .
i
τ

8 Set
δ ∇b θ J(θ)
i
=0
dbi
9 in the above expression to get
h i
2
Eτ (∇θi log pθ (τ )) R(τ )
bi = h i
2
Eτ (∇θi log pθ (τ ))

10 which is the baseline you should subtract from the gradient of the ith ? Show that any function that only depends
11 parameter θi to result in the largest variance reduction. This expression is on the state x can be used as a baseline in the
12 just the expected return but it is weighted by the magnitude of the gradient, policy gradient. This technique is known as
13 this again is 1–2 lines of code. reward shaping.

14 8.4 An alternative expression for the policy

15 gradient
16 We will first define an important quantity that helps us think of RL
17 algorithms.
18 Definition 8.1 (Discounted state visitation frequency). Given a stochas-
19 tic controller uθ (· | x) the discounted state visitation frequency for a
20 discrete-time dynamical system is given by
∞
X
dθ (x) = γ k P(xk = x | x0 , uk ∼ uθ (· | xk )).
k=0

21 The distribution dθ (x) is the probability of visiting a state x computed

22 over all trajectories of the system that start at the initial state x0 . If γ = 1,
23 this is the steady-state distribution of the Markov chain underlying the
24 Markov Decision Process where at each step the MDP choses the control
25 uk ∼ uθ (· | xk ). The fact that we have defined the discounted distribution
26 is a technicality; this version is seen in the policy gradient expression.
27 You will also notice that dθ (x) is not a normalized distribution. The
28 normalization constant is difficult to characterize both theoretically and
167

1 empirically and we will not worry about it here; RL algorithms do not

2 require it.

Q-function Using the discounted state visitation frequency, the

policy gradient that we saw before can be written in terms of the
value function as follows.

∇J(θ) = E [R(τ )∇ log pθ (τ )]

τ ∼pθ
θ (8.11)
= E E q (x, u)∇θ log uθ (u | x) .
x∼dθ u∼uθ (·|x) The derivation of this expression is easy
although tedious, you can find it in the
The function q θ (x, u) is similar to the cost-to-go that we have studied
Appendix of the paper “Policy gradient
in dynamic programming and is called the Q-function
methods for reinforcement learning with
function approximation” at
q θ (x, u) = E [R(τ ) | x0 = x, u0 = u] . (8.12)
τ ∼pθ (τ ) https://papers.nips.cc/paper/1713-policy-
gradient-methods-for-reinforcement-
It is the infinite-horizon discounted cumulative reward (i.e., the return) learning-with-function-approximation.pdf.
if the system starts at state x and takes the control u in the first step
and runs the controller uθ (· | x) for all steps thereafter. We make the
dependence of q θ on the parameters θ of the controller explicit.

3 Compare the above formula for the policy gradient with the one we
4 had before in (8.8)

∇J(θ)
b = E [R(τ ) ∇ log pθ (τ )]
τ ∼pθ (τ )
" T ! T
!#
X X
= E γ k r(xk , uk ) ∇ log uθ (uk | xk ) .
τ ∼pθ (τ )
k=0 k=0

5 It is important to notice that this is an expectation over trajectories;

6 whereas (8.11) is an expectation over states x sampled from the discounted
7 state visitation frequency. The control uk for both is sampled from the
8 stochastic controller at each time-step k. The most important distinction
9 is that (8.11) involves the expectation of the Q-function q θ weighted by
10 the gradient of the log-likelihood of picking each control action. There
11 are numerous hacky ways of deriving (8.11) from (8.8) but remember that
12 they are fundamentally different expressions of the same policy gradient.
13 This expression allows understanding of a number of properties of
14 reinforcement learning.

15 1. While the algorithm collects the data, states that are unlikely under
16 the distribution dθ contribute little to (8.11). In other words, the
17 policy gradient is insensitive to such states. The policy update will
18 not consider these unlikely states that the system is prone to visit
19 infrequently using the controller uθ .

20 2. The opposite happens for states which are very likely. For two
21 controls u1 , u2 at the same state x, the policy increases the log-
22 likelihood of taking the controls weighted by their values q θ (x, u1 )
168

1 and q θ (x, u2 ). This is sort of the “definition” of reinforcement

2 learning. In the expression (8.8) the gradient was increasing the
3 likelihood of trajectories with high returns, here it deals with states
4 and controls individually.

5 8.4.1 Implementing the new expression

6 Suppose we have a stochastic control that is a Gaussian
2
1 −
∥u−θ⊤ x∥
uθ (u | x) = p/2
e 2σ 2

(2πσ 2 )

7 where θ ∈ Rd×p and u ∈ Rp ; the variance σ can be chosen by the user.

8 We can easily compute log uθ (u | x) in (8.11). How should one compute
9 q θ (x, u) in (8.12)? We can again estimate it using sample trajectories
10 from the system; each of these trajectories would have to start from a
11 state x and the control at the first step would be u, with the controller uθ
12 being used thereafter. Note that we have one such trajectory, namely the
13 remainder of the trajectory where we encountered (x, u) while sampling
14 trajectories for the policy gradient in (8.11). In practice, we do not sample
15 trajectories a second time, we simply take this trajectory, let us call it τx,u
16 and set
XT
q θ (x, u) = γ k r(xk , uk )
k=0

17 for some large time-horizon T where (x0 , u0 ) = (x, u) and the summation
18 is evaluated for (xk , uk ) that lie on the trajectory τx,u . Effectively, we are
19 evaluating (8.12) using one sample trajectory, a highly erroneous estimate
20 of q θ .

21 8.5 Actor-Critic methods

22 We can of course do more sophisticated things to evaluate the Q-function
23 q θ in our new expression of the policy gradient.

Actor-Critic methods fit a Q-function to the data collected from the

system using the current controller (policy evaluation step) and then
use this fitted Q-function in the expression of the policy gradient (8.11)
to update the controller. In this sense, Actor-Critic methods are
conceptually similar to policy iteration.

24 In order to understand how to fit the Q-function, first recall that it

25 should satisfy the Bellman equation. This means
θ ′ ′
q θ (x, u) = r(x, u) + γ E q (x , u ) . (8.13)
u∼uθ (·|x′ ),x′ ∼P (·|x,u)

26 We do not know a model for the system so we cannot evaluate the

27 expectation over x′ ∼ P(· | x, u) like we used to in dynamic programming.
169

1 But we do have the ability to get trajectories τ i from the system.

2 Let’s say (xik , uik ) lie on τ i at time-step k. We can then estimate the
3 expectation over P(· | xik , uik ) using simply xik+1 and the expectation over
4 the controls using simply uik+1 to write

q θ (xik , uik ) ≈ r(xik , uik ) + γ q θ (xik+1 , uik+1 ) for all i ≤ n, k ≤ T.

5 This is a nice constraint on the Q-function. If this were a discrete-state,

6 discrete-control MDP, it is a set of linear equations for the q-values. These
7 constraints would be akin to our linear equations for evaluating a policy in
8 dynamic programming except that instead of using the dynamics model
9 (the transition matrix), we are using trajectories sampled from the system.

10 Parameterizing the Q-function using a neural network If we are deal-

11 ing with a continuous state/control-space, we can think of parameterizing
12 the q-function using parameters φ

qφθ (x, u) : X × U → R.

13 The parameterization is similar to the parameterization of the controller,

14 e.g., just like we would write a deterministic controller as

uθ (x) = θ⊤ x

15 we can think of a linear Q-function of the form

θ ⊤ x
qφ (x, u) = φ , φ ∈ Rm+d
u

16 which is a linear function in the states and controls. You can also think of
17 using something like
 
1
qφθ (x, u) = 1 x u φ x φ ∈ R(m+d+1)×(m+d+1) .

18 which is quadratic in the states and controls, or in general a deep network

19 with weights φ as the Q-function.

20 Fitting the Q-function We can now “fit” the parameters of the Q-

21 function by solving the problem
n T
1 XX 2
φ
b = argmin qφθ (xik , uik ) − r(xik , uik ) − γ qφθ (xik+1 , uik+1 ) .
φ n(T + 1) i=1
k=0
(8.14)
22 which is nothing other than enforcing the Bellman equation in (8.13).
23 If the Q-function is linear in [x, u] this is a least squares problem, if it
24 is quadratic the problem is a quadratic optimization problem which can
25 also be solved efficiently, in general we would solve this problem using
26 stochastic gradient descent if we are parameterizing the Q-function using a
We will be pedantic and always write the
q-function as qφθ . The superscript θ denotes
that this is the q-function corresponding to the
170
θ-parameterized controller uθ . The subscript
denotes that the q-function is parameterized
1 deep network. Such a Q-function is called the “critic” because it evaluates by parameters φ.
2 the controller uθ , or the “actor”. This version of the policy gradient where
3 one fits the parameters of both the controller and the Q-function are called
4 Actor-Critic methods.

Actor-Critic Methods We fit a deep network with weights θ to

parameterize a stochastic controller uθ (· | x) and another deep
network with weights φ to parameterize the Q-function of this
controller, qφθ (x, u). Let the controller weights at the k th iteration be
θ(k) and the Q-function weights be φk .

1. Sample n trajectories, each of T timesteps, using the current

(k)
controller uθ (· | x).
(k)
2. Fit a Q-function qφθ k+1 to this data using (8.14). using stochastic
gradient descent. While performing this fitting (although it is
not mathematically sound), it is common to use initialize the
Q-function weights to their values from the previous iteration
φκ .

3. Compute the policy gradient using the alternative expression

in (8.11) and update parameters of the policy to θ(k+1) .

5 8.5.1 Advantage function

6 The new expression for the policy gradient in (8.11) also has a large
7 variance; this should be no surprise, it is after all equal to the old
8 expression. We can however perform variance reduction on this using the
9 value function.
10 Our goal as before would be construct a relevant baseline to subtract
11 from the Q-function. It turns out that any function that depends only upon
12 the state x is a valid baseline. This gives a powerful baseline for us to to
13 use in policy gradients. We can use the value function as the baseline. The
14 value function for controls taken by the controller uθ (· | x) (notice that
15 this is not the optimal value function, it is simply the policy evaluation) is
16 given by
v θ (x) = E [R(τ ) | x0 = x]
τ ∼pθ (τ )

17 where uk ∼ uθ (· | xk ) at each timestep. We also know that the value

18 function is the expected value of the Q-function across different controls
19 sampled by the controller

v θ (x) =
θ
E q (x, u) . (8.15)
u∼uθ (·|x)

20 The value function again satisfies the dynamic programming principle/-

171

1 Bellman equation

θ
θ ′
v (x) = E r(x, u) + γ E v (x ) .
u∼uθ (·|x) x′ ∼P(·|x,u)

2 We again parameterize the value function

vψθ (x) : X → R

3 using parameters ψ and fit it to the data in the same way as (8.14) to get
n T
1 XX 2
ψb = argmin vψθ (xik ) − r(xik , uik ) − γ vψθ (xik+1 ) .
ψ n(T + 1) i=1
k=0
(8.16)

Using this baseline can modify the policy gradient to be

  
 θ
qφ (x, u) − vψθ (x) ∇θ log uθ (u | x) .
 
∇J(θ) = E E
x∼dθ u∼uθ (·|x) | {z }  
aθφ,ψ (x,u)
(8.17)
where each of the functions qφθ and vψθ are themselves fitted us-
ing (8.14) and (8.16) respectively. The difference

aθφ,ψ (x, u) = qφθ (x, u) − vψθ (x)

(8.18)
≈ qφθ (x, u) − qφθ (x, u)

E
u∼uθ (·|x)
? The advantage function is very useful
is called the advantage function. It measures how much better the while doing theoretical work on RL
particular control u is for a state x as compared to the average return algorithms. But it is also extremely useful in
of controls sampled from the controller at that state. The form (8.17) practice. It imposes a constraint upon our
is the most commonly implemented form in research papers whenever estimate qφθ and the estimate vψθ . If we are not
they say “we use the policy gradient”. solving (8.14) and (8.16) to completion, we
may benefit by imposing this constraint on the
advantage function. Can you think of a way?

4 8.6 Discussion
5 This brings to an end the discussion of policy gradients. They are, in
6 general, a complicated suite of algorithms to implement. You will see
7 some of this complexity when you implement the controller for something
8 as simple as the simple pendulum. The key challenges with implementing
9 policy gradients come from the following.

10 1. Need lots of data, each parameter update requires fresh data from
11 the systems. Typical problems may need a million trajectories, most
12 robots would break before one gets this much data from them if one
13 implements these algorithms naively.
172

1 2. The log-likelihood ratio trick has a high variance due to uθ (· | x)

2 being in the denominator of the expression, so we need to implement
3 complex variance reduction techniques such as actor-critic methods.

4 3. Fitting the Q-function and the value function is not easy, each
5 parameter update of the policy ideally requires you to solve the
6 entire problems (8.14) and (8.16). In practice, we only perform a
7 few steps of SGD to solve the two problems and reuse the solution
8 of k th iteration update as an initialization of the k + 1th update. This
9 is known as “warm start” in the optimization literature and reduces
10 the computational cost of fitting the Q/value-functions from scratch
11 each time.

12 4. The Q/value-function fitted in iteration k may be poor estimates of

13 the Q/value at iteration k + 1 for the new controller uθ(k+1) (· | x).
14 If the controller parameters change quickly, θ(k+1) is very dif-
(k+1) (k+1)
15 ferent from θ(k) , then so are q θ and v θ . There is a
16 very fine balance between training quickly and retaining the ef-
17 ficiency of warm start; and tuning this in practice is quite dif-
18 ficult. A large number of policy gradient algorithms like TRPO
19 (https://arxiv.org/abs/1502.05477) and PPO ( https://arxiv.org/abs/1707.06347)
20 try to alleviate this with varying degrees of success.

21 5. The latter, PPO, is a good policy-gradient-based algorithm to try

22 on a new problem. For instance, in a very impressive demon-
23 stration, it was used to build an RL agent to play StarCraft
24 (https://openai.com/blog/openai-five). We will see better RL meth-
25 ods in the next chapter.
1 Chapter 9

2 Q-Learning

Reading
1. Sutton & Barto, Chapter 6, 11

2. Human-level control through deep reinforcement learning

https://www.nature.com/articles/nature14236

3. Deterministic Policy Gradient Algorithms,

http://proceedings.mlr.press/v32/silver14.html

4. Addressing Function Approximation Error in Actor-Critic

Methods https://arxiv.org/abs/1802.09477

5. An Application of Reinforcement Learning to Aerobatic Heli-

copter Flight, https://papers.nips.cc/paper/3151-an-application-
of-reinforcement-learning-to-aerobatic-helicopter-flight

3 In the previous chapter, we looked at what are called “on-policy”

4 methods, these are methods where the current controller uθ(k) is used
5 to draw fresh data from the dynamical system and used to update to
6 parameters θ(k) . The key inefficiency in on-policy methods is that this
7 data is thrown away in the next iteration. We need to draw a fresh set of
8 trajectories from the system for uθ(k+1) . This lecture will discuss off-policy
9 methods which are a way to reuse past data. These methods require much
10 fewer data than on-policy methods (in practice, about 10–100× less).

11 9.1 Tabular Q-Learning

12 Recall the value iteration algorithm for discrete (and finite) state and
13 control spaces; this is also called “tabular” Q-Learning in the RL literature
14 because we can store the Q-function q(x, u) as a large table with number
15 of rows being the number of states and number of columns being the

173
174

1 number of controls, with each entry in this table being the value q(x, u).
2 Value iteration when written using the Q-function at the k th iteration for
3 the tabular setting looks like
X
(k+1) ′ (k) ′ ′
q (x, u) = P(x | x, u) r(x, u) + γ max ′
q (x , u )
u
x′ ∈X

(k) ′ ′
= E r(x, u) + γ max
′
q (x , u ) .
x′ ∼P(·|x,u) u

In the simplest possible instantiation of Q-learning, the expecta-

tion in the value iteration above (which we can only compute if we
know a model of the dynamics) is replaced by samples drawn from
the environment.

4 We will imagine the robot as using an arbitrary controller

ue (· | x)

5 that has a fairly large degree of randomness in how it picks actions. We

6 call such a controller an “exploratory controller”. Conceptually, its goal
7 is to lead the robot to diverse states in the state-space so that we get a
8 faithful estimate of the expectation in value iteration. We maintain the
9 value q (k) (x, u) for all states x ∈ X and controls u ∈ U and update these
10 values to get q (k+1) after each step of the robot.
11 From the results on Bellman equation, we know that any Q-function
12 that satisfies the above equation is the optimal Q-function; we would
13 therefore like our Q-function to satisfy

q ∗ (xk , uk ) ≈ r(xk , uk ) + γ max

′
q ∗ (xk+1 , u′ ).
u

14 over samples (xk , uk , xk+1 ) collected as the robot explores the environ-
15 ment.

Tabular Q-Learning Let us imagine the robot travels for n trajec-

tories of T time-steps each. We can now solve for q ∗ by minimizing
the objective
n X T 2
1 X
′
min q(xik , uik ) − r(xik , uik ) − γ max q(x i
k+1 , u ) .
q n(T + 1) u′ ∈U
i=1 k=0
(9.1)
on the data collected by the robot. The variable of optimization here
are all values q ∗ (x, u) for x ∈ X and u ∈ U .

16 Notice a few important things about the above optimization problem.

17 First, the last term is a maximization over u′ ∈ U , it is maxu′ ∈U q(xik+1 , u′ )
18 and not q(xik+1 , uik+1 ). In practice, you should imagine a robot performing
175

1 Q-Learning in a grid-world setting where it seeks to find the optimal

2 trajectory to go from a source location to a target location. If at each
3 step, the robot has 4 controls to choose from, computing this last term
4 involves taking the maximum of 4 different values (4 columns in the
5 tabular Q-function).
6 Notice that for finite-horizon dynamic programming we initialized
7 the Q-function at the terminal time to a known value (the terminal cost).
8 Similarly, for infinite-horizon value iteration, we discussed how we can
9 converge to the optimal Q-function with any initialization. In the above
10 case, we do not impose any such constraint upon the Q-function, but there
11 is an implicit constraint. All values q(x, u) have to be consistent with
12 each other and ideally, the residual

q(xik , uik ) − r(xik , uik ) − γ max

′
q(xik+1 , u′ ) = 0
u ∈U

13 for all trajectories i and all timesteps T .

14 Solving tabular Q-Learning How should we solve the optimization

15 problem in (9.1)? This is easy, every entry q(x, u) for x ∈ U and u ∈ U
2
16 is a variable of this objective and each (·) term in the objective simply
17 represents a constraint that ties these different values of the Q-function
18 together. We can solve for all q(x, u) iteratively as

q(x, u) ← q(x, u) − η∇q(x,u) ℓ(q)

(9.2)

′ ′
= (1 − η) q(x, u) − η r(x, u) + γ max
′
q(x , u )
u

where ℓ(q) is the entire objective n(T1+1) i k · · · above and (x, u, x′ ) ≡

P P
19

20 (xik , uik , xik+1 ) in the second equation. An important point to note here is
21 that although the robot collects a finite number of data
n
D = (xik , uik )k=0,1,...,T

i=1

22 we have an estimate for the value q(x, u) at all states x ∈ X. As an

23 intuition, tabular Q-learning looks at the returns obtained by the robot
24 after starting from a state x (the reward-to-come J(x)) and patches the
25 returns from nearby states x, x′ using the constraints in the objective (9.1).

26 Terminal state One must be very careful about the terminal state in such
27 implementations of Q-learning. Typically, most research papers imagine
28 that they are solving an infinite horizon problem but use simulators that
29 have an explicit terminal state, i.e., the simulator does not proceed to the
30 next timestep after the robot reaches the goal. A workaround for using
31 such simulators (essentially all simulators) is to modify (9.2) as

′ ′

q(x, u) = (1−η) q(x, u)−η r(x, u) + γ 1 − 1{x′ is terminal} max ′
q(x , u ) .
u
176

1 Effectively, we are setting q(x′ , u) = 0 for all u ∈ U if x′ is a terminal state

2 of problem. This is a very important point to remember and Q-Learning
3 will never work if you forget to include the term 1{x′ is terminal} in your
4 expression.

5 What is the controller in tabular Q-Learning? The controller in

6 tabular Q-Learning is easy to get after we solve (9.1). At test time, we use
7 a deterministic controller given by

u∗ (x) = argmax q ∗ (x, u′ ).

u′

8 9.1.1 How to perform exploration in Q-Learning

9 The exploratory controller used by the robot ue (· | x) is critical to perform
10 Q-Learning well. If the exploratory controller does not explore much,
11 we do not get states from all parts of the state-space. This is quite bad,
12 because in this case the estimates of Q-function at all states will be bad,
13 not just at the states that the robot did not visit. To make this intuitive,
14 imagine if we cordoned off some nodes in the graph for the backward
15 version of Dĳkstra’s algorithm and never used them to update the dist
16 variable. We would never get to the optimal cost-to-go for all states in this
17 case because there could be trajectories that go through these cordoned
18 off states that lead to a smaller cost-to-go. So it is quite important to pick
19 the right exploratory controller.
20 It turns out that a random exploratory controller, e.g., a controller This is again the power of dynamic
21 ue (· | x) that picks controls uniformly randomly is pretty good. We can programming at work. The Bellman equation
22 show that our tabular Q-Learning will converge to the optimal Q-function guarantees the convergence of value iteration
23 q ∗ (x, u) as the amount of data drawn from the random controller goes to provided we compute the expectation exactly.
24 infinity, even if we initialize the table to arbitrary values. In other words, But if the robot does give us lots of data from
25 if we are guaranteed that the robot visits each state in the finite MDP the environment, then Q-Learning also
26 infinitely often, it is a classical result that updates of the form (9.2) for inherits this property of convergence to the
27 minimizing the objective in (9.1) converge to the optimal Q-function. optimal Q-function from any initialization.

28 Epsilon-greedy exploration Instead of the robot using a arbitrary

29 controller ue (· | x) to gather data, we can use the current estimate of the
30 Q-function with some added randomness to ensure that the robot visits all
31 states in the state-space. This is a key idea in Q-Learning and is known as
32 “epsilon-greedy” exploration. We set
(
argmaxu q(x, u) with probability 1 − ϵ
ue (u | x) = (9.3)
uniform(U ) with probability ϵ.

33 for some user-chosen value of ϵ. Effectively, the robot repeats the controls
34 it took in the past with probability 1 − ϵ and uniformly samples from
35 the entire control space with probability ϵ. The former ensures that the
36 robot moves towards the parts of the state-space where states have a high
37 return-to-come (after all, that is the what the Q-function q(x, u) indicates).
177

1 The latter ensures that even if the robot’s estimate of the Q-function is bad,
2 it is still visiting every state in the state-space infinitely often.

3 A different perspective on Q-Learning Conceptually, we can think of

4 tabular Q-learning as happening in two stages. In the first stage, the robot
5 gathers a large amount of data
n
D = (xik , uik )k=0,1,...,T

i=1

6 using the exploratory controller ue (· | x); let us consider the case

7 when we are using an arbitrary exploratory controller, not epsilon-greedy
8 exploration. Using this data, the robot fits a model for the system, i.e., it
9 learns the underlying MDP

P(x′ | x, u);

10 this is very similar to the step in the Baum-Welch algorithm that we saw
11 for learning the Markov state transition matrix of the HMM in Chapter 2.
12 We simply take frequency counts to estimate this probability

1 X
P(x′ | x, u) ≈ 1{x′ was reached from x using control u}
N i

13 where N is the number of the times the robot took control u at state x.
14 Given this transition matrix, we can now perform value iteration on the
15 MDP to learn the Q-function

(k) ′
q (k+1) (x, u) = ′ E r(x, u) + γ max
′
q (x , u) .
x ∼P(·|x,u) u

16 The success of this two-stage approach depends upon how accurate our
17 estimate of P(x′ | x, u) is. This in turn depends on how much the robot
18 explored the domain and the size of the dataset it collected, both of these
19 need to be large. We can therefore think of Q-learning as interleaving
20 these two stages in a single algorithm, it learns the dynamics of the system
21 and the Q-function for that dynamics simultaneously. But the Q-Learning
22 algorithm does not really maintain a representation of the dynamics, i.e.,
23 at the end of running Q-Learning, we do not know what P(x′ | x, u) is.

24 9.2 Function approximation (Deep Q Networks)

25 Tabular methods are really nice but they do not scale to large problems.
26 The grid-world in the homework problem on policy iteration had 100
27 states, a typical game of Tetris has about 1060 states. For comparison, the
28 number of atoms in the known universe is about 1080 . The number of
29 different states in a typical Atari game is more than 10300 . These are all
30 problems with a discrete number of states and controls, for continuous
31 state/control-space, the number of distinct states/controls is infinite. So
32 it is essentially impossible to run the tabular Q-Learning method from
178

1 the previous section for most real-world problems. In this section, we

2 will look at a powerful set of algorithms that parameterize the Q-function
3 using a neural network to work around this problem.
4 We use the same idea from the previous chapter, that of parameterizing
5 the Q-function using a deep network. We will denote

qφ (x, u) : X × U 7→ R

6 as the Q-function and our goal is to fit the deep network to obtain the
7 weights φ̂, instead of maintaining a very large table of size |X| × |U | for
8 the Q-function. Fitting the Q-function is nquite similar to the tabular case:
9 given a dataset D = (xit , uit )t=0,1,...,T i=1 from the system, we want to
10 enforce
qφ (xit , uit ) = r(xit , uit ) + γ max
′
qφ (xit+1 , u′ )
u

11 for all tuples (xit , uit , xit+1 ) in the dataset. Just like the previous section,
12 we will solve
 2
n X
T
1 X
qφ (xit , uit ) − r(xit , uit ) − γ 1 − 1 i qφ (xit+1 , u′ )
 
φ̂ = argmin {xt+1 is terminal} max
φ n(T + 1) i=1 t=1

| {z
u′
}

target(x′ ;φ)
n X
T
1 X 2
≡ argmin qφ (xit , uit ) − target(xit+1 ; φ)
φ n(T + 1) i=1 t=1
(9.4)
13 The last two terms in this expression above are together called the “target”
14 because the problem is very similar to least squares regression, except that
15 the targets also depend on the weights φ. This is what makes it challenging
16 to solve.
17 As discussed above, Q-Learning with function approximation is known
18 as “Fitted Q Iteration”. Remember that very important point that the robot
19 collects data using the exploratory controller ue (· | x) but the Q-function
20 that we fit is the optimal Q-function.

21 Fitted Q-Iteration with function approximation may not converge

22 to the optimal Q-function It turns out that (9.4) has certain math-
23 ematical intricacies that prevent it from converging to the optimal
24 Q-function. We will give the intuitive reason here. In the tabular
25 Q-Learning setting, if we modify some entry q(x, u) for an x ∈ X
26 and u ∈ U , the other entires (which are tied together using the Bell-
27 man equation) are all modified. This is akin to you changing the dist
28 value of one node in Dĳkstra’s algorithm; the dist values of all other
29 nodes will have to change to satisfy the Bellman equation. This is
30 what (9.2) achieves if implemented with a decaying step-size η; see The mathematical reason behind this is that
31 http://users.isr.ist.utl.pt/∼mtjspaan/readingGroup/ProofQlearning.pdf for the Bellman operator, i.e., the update to the
32 the proof. This does not hold for (9.4). Even if the objective in (9.4) is Q/value-function is a contraction for the
33 zero on our collected dataset, i.e., the Q-function fits data collected by the tabular setting, this is not the case for Fitted
34 robot perfectly, the Q-function may not be the optimal Q-function. An Q-Iteration unless the function approximation
has some technical conditions imposed upon
it.
179

1 intuitive way of understanding this problem is that even if the Bellman

2 error is zero on samples in the dataset, the optimization objective says
3 nothing about states that are not present in the dataset; the Bellman error
4 on them is completely dependent upon the smoothness properties of the
5 function expressed by the neural architecture. Contrast this comment with
6 the solution of the HJB equation in Chapter 6 where the value function
7 was quite non-smooth at some places. If our sampled dataset does not
8 contain those places, there is no way the neural network can know the
9 optimal form of the value function.

11 9.2.1 Embellishments to Q-Learning

12 We next discuss a few practical aspects of implementing Q-Learning.
13 Each of the following points is extremely important to understand how to
14 get RL to work on real-world problems, so you should internalize these.

15 Pick mini-batches from different trajectories in SGD . In practice,

16 we fit the Q-function using stochastic gradient descent. At each iteration
17 we sample a mini-batch of inputs (xit , uit , xit+1 ) from different trajectories
18 i ∈ {1, . . . , n} and update the weights φ in the direction of the negative
19 gradient.
2
φk+1 = φk − η∇φ qpk (x, u) − target(x′ ; φk ) .

20 The mini-batch is picked to have samples from different trajectories

21 because samples from the same trajectory are correlated to each other
′′
22 (after all, the robot obtains the next tuple (x′ , u′ , x ) from the previous
23 tuple (x, u, x′ )).

24 Replay buffer The dataset D is known as the replay buffer.

25 Off-policy learning The replay buffer is typically not fixed during

26 training. Instead of drawing data from the exploratory controller ue , we
27 can think of the following algorithm. Initialize the Q-function weights to
28 φ0 and the dataset to D = ∅. At the k th iteration,
29 • Draw a dataset Dk of n trajectories from the ϵ-greedy policy
(
argmaxu q k (x, u) with probability 1 − ϵ
ue (u | x) =
uniform(U ) with probability ϵ.
180

1 • Add new trajectories to the dataset

D ← D ∪ Dk .

2 • Update weights to q k+1 using all past data D using (9.4).

3 Compare this algorithm to policy-gradient-based methods which throw

4 away the data from the previous iteration. Indeed, when we want to
5 compute the gradient ∇θ Eτ ∼pθk [R(τ )], we should sample trajectories
6 from current weights θk , we cannot use trajectories from some old weights.
7 In contrast, in Q-Learning, we maintain a cumulative dataset D that
8 contains trajectories from all the past ϵ-greedy controllers and use it to
9 find new weights of the Q-function. We can do so because of the powerful
10 Bellman equation, Q-Iteration is learning the optimal value function and
11 no matter what dataset (9.4) is evaluated upon, if the error is zero, we are
12 guaranteed that Q-function learned is the optimal one. Policy gradients do
13 not use the Bellman equation and that is why they are so inefficient. This
14 is also the reason Q-Learning with a replay buffer is called “off-policy”
15 learning because it learns the optimal controller even if the data that it uses
16 comes from some other non-optimal controller (the exploratory controller
17 or the ϵ-greedy controller).
18 Using off-policy learning is an old idea, the DQN paper which
19 demonstrated very impressive results on Atari games using RL brought it
20 back into prominence.

21 Setting a good value of ϵ for exploration is critical Towards the

22 beginning of training, we want a large value for ϵ to gather diverse data
23 from the environment. As training progresses, we want to reduce ϵ because
24 presumably we have a few good control trajectories that result in good
25 returns and can focus on searching the neighborhood of these trajectories.

26 Prioritized experience replay is an idea where instead of sampling

27 from the replay buffer D uniformly randomly when we fit the Q-function
28 in (9.4), we only sample data points (xit , uit ) which have a high Bellman
29 error

qφ (xit , uit ) − r(xit , uit ) − γ 1 − 1{xi is terminal} max
′
qφ (xit+1 , u′ )
t+1 u

30 This is a reasonable idea but is not very useful in practice for two reasons.
31 First, if we use deep networks for parameterizing the Q-function, the
32 network can fit even very complex datasets so there is no reason to not
33 use the data points with low Bellman error in (9.4); the gradient using
34 them will be small anyway. Second, there are a lot of hyper-parameters
35 that determine prioritized sampling, e.g., the threshold beyond which
36 we consider the Bellman error to be high. These hyper-parameters are
37 quite difficult to use in practice and therefore it is a good idea to not use
38 prioritized experience replay at the beginning of development of your
39 method on a new problem.
181

Huber loss for δ = 1 (green) compared to

1 Using robust regression to fit the Q-function There may be states in
the squared error loss (blue).
2 the replay buffer with very high Bellman error, e.g., the kinks in the value
3 function for the mountain car obtained from HJB above, if we happen to
4 sample those. For instance, these are states where the controller “switches”
5 and is discontinuous function of state x. In these cases, instead of these
6 few states dominating the gradient for the entire dataset, we can use ideas
7 in robust regression to reduce their effect on the gradient. A popular way
8 to do so is to use a Huber-loss in place of the quadratic loss in (9.4)
( 2
a
for |a| ≤ δ
huberδ (a) = 2 δ
(9.5)
δ |a| − 2 otherwise.

9 Delayed target Notice that the target also depends upon the weights φ:

target(x′ ; φ) := r(x, u) + γ 1 − 1{x′ qφ (x′ , u′ ).

is terminal} max
′
u

10 This creates a very big problem when we fit the Q-function. Effectively,
11 both the covariate and the target in (9.4) depend upon the weights of the
12 Q-function. Minimizing the objective in (9.4) is akin to performing least
13 squares regression where the targets keep changing every time you solve
14 for the solution. This is the root cause of why Q-Learning is difficult to
15 use in practice. A popular hack to get around this problem is to use some
16 old weights to compute the target, i.e., use the loss

1 X ′
2
qφk (xit , uit ) − target(xit+1 ; φk ) . (9.6)
n(T + 1) i,t

17 in place of (9.4). Here k ′ is an iterate much older than k, say k ′ = k − 100.

18 This trick is called “delayed target”.

19 Exponential averaging to update the target Notice that in order to

20 implement delayed targets as discussed above we will have to save all
21 weights φk , φk−1 , . . . , φk−100 , which can be cumbersome. We can
22 however do yet another clever hack and initialize two copies of the weights,
23 one for the actual Q-function φk and another for the target, let us call it
k
24 φ′ . We set the target equal to the Q-function at initialization. The target
25 copy is updated at each iteration to be
k+1 k
φ′ = (1 − α)φ′ + αφk+1 (9.7)

26 with some small value, say α = 0.05. The target’s weights are therefore
27 an exponentially averaged version of the weights of the Q-function.

28 Why are delayed targets essential for Q-Learning to work? There

29 are many explanations given why delayed targets are essential in practice
30 but the correct one is not really known yet.

31 1. For example, one reason could be that since qφk (x, u) for a given
32 state typically increases as we train for more iterations in Q-Learning,
182

1 the old weights inside a delayed target are an underestimate of the

2 true target. This might lead to some stability in situations when the
3 Q-function’s weights φk change too quickly when we fit (9.4) or we
4 do not have enough data in the replay buffer yet.

5 2. Another reason one could hypothesize is related to concepts like

6 self-distillation. For example, we may write a new objective for
7 Q-Learning that looks like
2 1 ′ 2
qφk (xit , uit ) − target(xit+1 ; φk ) + φk − φk
2λ 2

8 where the second term is known as proximal term that prevents the
′
9 weights φk from change too much from their old values φk . Proxi-
10 mal objectives are more stable versions of the standard quadratic
11 objective in (9.4) and help in cases when one is solving Q-Learning
12 using SGD updates.

13 Double Q-Learning Even a delayed target may not be sufficient to get

14 Q-Learning to lead to good returns in practice. Focus on one state x. One
15 problem arise from the max operator in (9.4). Suppose that the Q-function
16 qφk corresponds to a particularly bad controller, say a controller that picks
17 a control
argmax qφk (x, u)
u

18 that is very different from the optimal control

argmax q ∗ (x, u)
u

19 then, even the delayed target qφk′ may be a similarly poor controller. The
20 ideal target is of course the return-to-come, or the value of the optimal
21 Q-function maxu′ q ∗ (x′ , u′ ), but we do not know it while fitting the Q-
22 function. The same problem also occurs if our Q-function (or its delayed
23 version, the target) is too optimistic about the values of certain control
24 inputs, it will consistently pick those controls in the max operator. One
25 hack to get around this problem is to pick the maximizing control input
26 using the non-delayed Q-function but use the value of the delayed target

k
targetDDQN (xit+1 ; φ′ ) = r(x, u)+γ 1 − 1{xi is terminal} qφ′ k (xit+1 , u′ ).
t+1
(9.8)
27 where
u′ = argmax qφk (xit+1 , u) .
u
| {z }
control chosen by the Q-function

28 Training two Q-functions We can also train two copies of the Q-function
29 simultaneously, each with its own delayed target and mix-and-match their
k (1) k k
30 targets. Let φ(1) and φ′ be one Q-function and target pair and φ(2)
(2) k
31 and φ′ be another pair. We update both of them using the following
183

1 objective.
2
k (2) k
For φ(1) : q (1) (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} targetDDQN (x′ , φ′

)
2
k (1) k
For φ(2) : q (2) (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} targetDDQN (x′ , φ′

)
(9.9)
2 Sometimes we also use only one target that is the minimum of the two
3 targets (this helps because it is more pessimistic estimate of the true target)

(1) k (2) k
target(x′ ) := min targetDDQN (x′ , φ′ ), targetDDQN (x′ , φ′ ) .

4 You will also see many papers train multiple Q-functions, many more than
5 2. In such cases, it is a good idea to pick the control for evaluation using
6 all the Q-functions:
X
u∗ (x) := argmax qφ(k) (x, u).
u
k

7 rather than only one of them, as is often done in research papers.

8 A remark on the various tricks used to compute the target It may

9 seem that a lot of these tricks are about being pessimistic while computing
10 the target. This is our current understanding in RL and it is born out of
11 the following observation: typically in practice, you will observe that the
12 Q-function estimates can become very large. Even if the TD error is small,
13 the values qφ (x, u) can be arbitrarily large; see Figure 1 in Continuous
14 Doubly Constrained Batch Reinforcement Learning for an example in a Mathematically, the fundamental problem
15 slightly different setting. This occurs because we pick the control that in function-approximation-based RL is
16 maximizes the Q-value of a particular state x in (9.8). Effectively, if the actually clear: even if the Bellman operation
17 Q-value qφ (x′ , u) of a particular control u ∈ U is an over-estimate, the is a contraction for tabular RL, it need not be
18 target will keep selecting this control as the maximizing control, which a contraction when we are approximating the
19 drives up the value of the Q-function at qφ (x, u) as well. This problem is Q-function using a neural network. Therefore
20 a bit more drastic in the next section on continuous-valued controls. It is minimizing TD-error which works quite well
21 however unclear how to best address this issue and design mathematically for the tabular case need not work well in the
22 sound methods that do not use arbitrary heuristics such as “pessimism”. function-approximation case. There may exist
other, more robust, ways of computing the
Bellman fixed point
23 9.3 Q-Learning for continuous control spaces qφ (x, u) = r(x, u) + maxu′ γ qφ (x′ , u′ )
other than minimizing the the squared TD
24 All the methods we have looked at in this chapter are for discrete control
error but we do not have good candidates yet.
25 spaces, i.e., the set of controls that the robot can take is a finite set. In this
26 case we can easily compute the maximizing control of the Q-function.

u∗ (x) = argmax qφ (x, u).

27 Certainly a lot of real-world problems have continuous-valued controls

28 and we therefore need Q-Learning-based methods to handle this.
184

1 Deterministic policy gradient A natural way, although non-rigorous,

2 to think about this is to assume that we are given a Q-function qφ (x, u)
3 (we will leave the controller
for whichnthis is the Q-function vague for
4 now) and a dataset D = (xit , uit )Tt=0 i=1 . We can find a deterministic
5 feedback controller that takes controls that lead to good values as
n T
1 XX
θ∗ = max qφ (xit , uθ (xit )). (9.10)
θ n(T + 1) i=1 t=0

6 Effectively we are fitting a feedback controller that takes controls uθ∗ (x)
7 that are the maximizers of the Q-function. This is a natural analogue
8 of the argmax over controls for discrete/finite control spaces. Again we
9 should think of having a deep network that parametrizes the deterministic
10 controller and fitting its parameters θ using stochastic gradient descent
11 on (9.10)

θk+1 = θk + η∇θ qφ (xω , uθk (xω ))

(9.11)
= θk + η (∇u qφ (xω , u)) (∇θ uθk (xω ))

12 where ω is the index of the datum in the dataset D. The equality was
13 obtained by applying the chain rule. This result is called the “deterministic
14 policy gradient” and we should think of it as the limit of the policy gradient
15 for a stochastic controller as the stochasticity goes to zero. Also notice
16 that the term
∇u qφ (xω , u)
17 is the gradient of the output of the Q-function qφ : X × U 7→ R with
18 respect to its second input u. Such gradients can also be easily computed
19 using backpropagation in PyTorch. It is different than the gradient of the
20 output with respect to its weights

∇φ qφ (xω , u).

21 On-policy deterministic actor-critic Let us now construct an analogue

22 of the policy gradient method for the case of a deterministic controller. The
23 algorithm would proceed as follows. We initialize weights of a Q-function
24 φ0 and weights of the deterministic controller θ0 .

25 1. At the k th iteration, we collect a dataset from the robot using the

26 latest controller uθk . Let this dataset be Dk that consists of tuples
27 (x, u, x′ , u′ ).
k
28 2. Fit a Q-function q θ to this dataset by minimizing the temporal
29 difference error
X 2
φk+1 = argmin qφ (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} qφ′ (x′ , u′ ) .

φ
(x,u,x′ ,u′ )∈D k
(9.12)
30 Notice an important difference in the expression above, instead of
31 using maxu in the target, we are using the control that the current
185

1 controller, namely uθk has taken. This is because we want to

2 evaluate the controller uθk and simply parameterize the Q-function
3 using weights φk+1 . More precisely, we hope that we have
k
qφk+1 (x, uθk (x)) ≈ max q θ (x, u).
u

4 3. We can now update the controller using this Q-function:

θk+1 = θk + η∇θ qφk+1 (xω , uθk (xω )) (9.13)

SARSA is an old algorithm in RL that is
5 This algorithm is called “on-policy SARSA” because at each iteration we the tabular version of what we did here. It
6 draw fresh data Dk from the environment; this is the direct analogue of stands for state-action-reward-state-action . . .
7 actor-critic methods that we studied in the previous chapter for deterministic
8 controllers.

9 Off-policy deterministic actor-critic methods We can also run the

10 above algorithm using data from an exploratory controller. The only
11 difference is that the we now do not throw away the data Dk from older
12 iterations
D = D1 ∪ · · · ∪ Dk
13 and therefore have to change (9.12) to be
 2
X
φk+1 = argmin qφ′ (x′ , uθk (x′ )

qφ (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} ) .
φ | {z }
(x,u,x′ ,u′ )∈D notice the difference
(9.14)
14 Effectively, we are fitting the optimal Q-function using the data D but
15 since we can no longer take the maximum over controls directly, we plug
16 in the controller in the computation of the target. This is natural; we
17 think of the controller as the one that maximizes the Q-function when we
18 update (9.13). When used with deep networks, this is called the “deep
19 deterministic policy gradient” algorithm, it is popular by the name DDPG.
1 Chapter 10

2 Model-based
3 Reinforcement Learning

Reading
1. PILCO: A Model-Based and Data-Efficient Approach to Policy
Search, http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf

2. Embed to Control: A Locally Linear Latent Dynamics Model

for Control from Raw Images https://arxiv.org/abs/1506.07365

3. Deep Reinforcement Learning in a Handful of Trials using Prob-

abilistic Dynamics Models https://arxiv.org/abs/1805.12114

4 We have seen a large number of methods which use a known model

5 of the dynamical system to compute the control inputs, these include
6 value/policy iteration, Linear Quadratic Regulator (LQR) and Model
7 Predictive Control (MPC). We also saw a number of methods from the
8 reinforcement learning literature that can work “model-free”, i.e., having
9 access to some data from the environment in lieu of a model. On one hand,
10 model-based methods come with some obvious challenges, if we do not
11 know the model of the system, the controller will not be optimal and worse,
12 it may even be unsafe; think of driving on black ice that is a thin coat of
13 ice which develops after repeated freezing and melting of snow on asphalt.
14 On the other hand, model-free approaches are spectacularly inefficient:
15 policy gradient-based methods require several thousands of trajectories to
16 train a controller and even more efficient ones such as off-policy methods
17 require prohibitive amounts of data; recall the example of an espresso bar
18 in New York City that makes 50 shots a day, it takes more than a month to
19 sample more than 1000 trajectories. This has limited the reach of model-
20 free RL methods primarily to simulation, although there are examples
21 where these policies were run (typically after training) in the real world

186
187

1 also; see another example at https://ai.googleblog.com/2020/05/agile-and-

2 intelligent-locomotion-via.html. Very rarely you will see RL methods
3 being used to train robots directly.
4 It makes sense to combine model-based and model-free methods if
5 we want to reduce the number of data required from the system to learn a
6 controller. Such methods are typically called model-based RL methods.

7 10.1 Learning a model of the dynamics

8 Imagine that we have a robot with dynamics

xk+1 = f (xk , uk )

9 and obtained some data from this robot using an exploratory controller
n
10 ue (· | x). Let us call this dataset D = (xit , uit )Tt=0 i=1 ; it consists of n
11 trajectories each of length T timesteps. We can fit a deep network to learn
12 the dynamics. This involves parameterizing the unknown dynamics using
13 a deep network with weights w

fw : X × U 7→ R

14 and minimizing a regression error of the form

n T
1 XX 2
w∗ = argmin xi − fw (xit , uit ) . (10.1)
w n(T + 1) i=1 t=0 t+1 2

2
15 If the residual xit+1 − fw (xit , uit ) is small on average over the dataset,
16 then we know that given some new control u′ ̸= uit , we can, for instance,
17 estimate the new future state x′ = fw (x, u′ ). In principle, we can use this
18 model now to execute algorithms like iterated LQR to find an optimal
19 controller. We could also imagine using this as our own simulator for the
20 robot, i.e., instead of drawing new trajectories in model-free RL from the
21 actual robot, we use our learned model of the dynamics to obtain more
22 data.

23 An inverse model of the dynamics We can also learn what is called

24 the inverse model of the system that maps the current state xit and the next
25 state xit+1 to the control that takes the system from the former to the latter:

fwinv : X × X 7→ U.
2
26 The regression error for one sample in this case would be uit − fw (xit , xit+1 ) .
27 This is often a more useful model to learn from the data, e.g., if we want to
28 use this model in a Rapidly Exploring Random Tree (RRT), we can sample
29 states in the configuration space of the robot and have the learned dynamics
30 guess the control between two states. Also see the paper on contact-
31 invariant optimization (https://homes.cs.washington.edu/ todorov/papers/-
32 MordatchSIGGRAPH12.pdf) and a video at https://www.youtube.com/watch?v=mhr_jtQrhVA
For a quick primer on planning using a
188
model, see the notes at
https://ocw.mit.edu/courses/aeronautics-and-
1 for an impressive demonstration of using an inverse model. astronautics/16-410-principles-of-autonomy-
and-decision-making-fall-2010/lecture-
notes/MIT16_410F10_lec15.pdf from Emilio
Frazzoli (ETH/MIT).
Models can be wrong at parts of state-space where we have few
data This is really the biggest concern with using models. We have
seen in the chapter on deep learning that if we do not have data from
some part of the state-space, there are few guarantees of the model fw
or fwinv working well for those states. A planning algorithm does not
however know that the model is wrong for a given state. So the central
question in learning a model is “how to estimate the uncertainty of
the output of the model”, i.e.,

P(xk+1 ̸= fx (xk , uk ))

where xk+1 is the true next state of the system and fx (xk , uk ) is
our prediction using the model. If we have a good estimate of
such uncertainty, we can safely use the model only at parts of the
state-space where this uncertainty is small.

Sequentially querying the environment for data We can use our

ideas from DAgger and off-policy RL to improve our model iteratively
by collecting more data using the controller that is being learned.
Here is how it would work

1. Draw some data D from the system, fit a dynamics model

fw0 (x, u) using (10.1). Learn a feedback controller u0 (x) using
any method we know so far (LQR, MPC, RL-based methods)

2. Run the learned controller u0 (x) from the real system to collect
more data D1 and add it to the dataset

D ← D ∪ D1 .

This is a simple mechanism that that ensures that we can collect more
data from the system. If the controller goes to parts of the state-space
that the model is incorrect at, we get samples from such regions and
iteratively improve both the learned dynamics model fwk (x, u) and
the controller uk (x) using this model.

2 10.2 Some model-based methods

3 10.2.1 Bagging multiple models of the dynamics
4 Let us look at bagging, which is a method to estimate the uncertainty of
5 a learning-based predictor. Bagging is short for bootstrap aggregation,
189

1 and can be explained using a simple experiment. Suppose we wanted to

2 estimate the average height µ of people in the world. We can measure the
3 height of N individuals and obtain one estimate of the mean µ. This is of
4 course unsatisfying because we know that our answer is unlikely to be the
5 mean of the entire population. Bootstrapping computes multiple estimates
6 of the mean µk over many subsets of the data N and reports the answer as

µ := mean(µk ) + stddev(µk ).

7 Each subset of the data is created by sampling the original data with
8 N samples with replacement. This is among the most influential ideas
9 in statistics see “Bootstrap Methods: Another Look at the Jackknife”
10 https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-1/Bootstrap-
11 Methods-Another-Look-at-the-Jackknife/10.1214/aos/1176344552.full be-
12 cause it is a very simple and general procedure to obtain the uncertainty
13 of the estimate. Also see a very famous paper “Bagging predictors”
14 at https://link.springer.com/article/10.1007/BF00058655 that invented
15 random forests based on this idea.

16 Training an ensemble We are going to train multiple dynamics models

17 fwk (x, u) for k ∈
{1, . . . , M }, one each for bootstrapped versions of the
18 training dataset D1 , . . . , DM . Each subset Dk is built by sampling a
19 fraction, say 60% of the data uniformly randomly from our training dataset
20 D. In other words, the bootstrapped versions of the data are not disjoint
21 but their union is likely to be the entire training dataset. We are going to
22 use this ensemble as follows. For each pair (x, u), we run all models and
23 set
N
1 X
x̂′ = fwk (x, u);
M
k=1

24 i.e., the ensemble predicts the next state of the robot using the mean. The
25 important benefit of using an ensemble is that we can also get an estimate
26 of the error in these predictions
1/2
error in x̂′ = (δk fwk (x, u)) .

27 Different members of the ensemble are training on different datasets

28 and make different predictions as to what the next state could be. The
29 mis-match between them is an indicator of the error in our dynamics model. Typically, while fitting deep networks
30 This need not be an accurate estimate of the error (i.e., the difference fwk (x, u) using SGD, most RL papers
31 between the predicted x̂′ and the actual next state x′ of the true dynamics) initialize the training process at different
32 but is often a good proxy to use if we do bootstrapping. weights and do not perform any bootstrapping.
33 Bagging is perhaps the most useful idea in machine learning (by far). The rationale that is usually given is that since
34 It is always good to keep it in your mind. The winners of most high-profile the training process is non-convex, two
35 machine learning competitions, e.g., the Netflix Prize models initialized at different locations train
36 (https://en.wikipedia.org/wiki/Netflix_Prize) or the ImageNet challenge, to two different predictors even if they both
37 have been bagged classifiers created by fitting multiple architectures on work on the same data. Although doing this
38 the same dataset. leads to some notion of uncertainty, it is not
an entirely correct one and performing
bootstrapping will always give better
estimates.
190

How to use an ensemble for hallucinating new data The proce-

dure is very similar to what we saw above for querying the model.
The key difference is that we now have M models of the dynamics
and can mix-and-match their predictions as we simulate the trajectory.

1 PILCO A powerful Gaussian Process-based algorithm to incorporate

2 uncertainty in the predictions of a learned dynamics model is “PILCO: A
3 Model-Based and Data-Efficient Approach to Policy Search”
4 http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf. Instead of using boot-
5 strapping of an ensemble to estimate the uncertainty, this algorithm
6 explicitly models the uncertainty as

p(xk+1 | xk , uk ) = N (xk + E[∆k ], Var(∆k ))

7 where ∆k = xk+1 − xk in the training data, using a Gaussian Process.

8 See a preliminary but great tutorial on Gaussian Processes at https://distill.pub/2019/visual-
9 exploration-gaussian-processes. PILCO is a complicated algorithm to
10 implement but you can see the source code by the original authors at
11 https://mloss.org/software/view/508. You can also look at the paper ti-
12 tled “BayesRace: Learning to race autonomously using prior experience”
13 https://arxiv.org/abs/2005.04755 which uses model-based RL in

14 10.2.2 Model-based RL in the latent space

1 Chapter 11

2 Offline Reinforcement
3 Learning

Reading
1. Offline Reinforcement Learning: Tutorial, Review, and Per-
spectives on Open Problems by Levine et al. (2020)

2. Continuous doubly constrained batch reinforcement learning

by Fakoor et al. (2021)

4 So far, we have imagined that we have access to a model of a robot, a

5 simulator for it, or access to the actual robot that allows us to obtain data
6 from the robot. But there are many problems in which we cannot get any
7 of these. For example, when Amazon sells merchandise to its customers,
8 it is quite difficult for them to model or simulate each customer, or even
9 a canonical one. It is possible to do rollouts using actual customer data
10 but that would not be wide because an exploratory policy, by the time it
11 learns, would lead to huge loses. This would also not be desirable for the
12 customers. Amazon also needs reinforcement learning for this problem
13 because there is not many other ways to explore what customers like and
14 do not like. One could take another example from a very different domain.
15 Suppose a hospital is trying to develop a new protocol to handle incoming
16 patients. E.g., a patient comes into ER, there is a fixed set of checks that
17 are performed quickly on them called as “triage”. The number and kind
18 of checks are performed directly affects patient care. If there are too many
19 checks, then the patient loses precious time before they are treated. If
20 there are too few, then there is a large bottleneck when these patients are
21 referred to doctors. It is not easy to model or simulate this problem. It is
22 possible to do rollouts to discover a policy but that would not be easy, or
23 wise.
24 These problems have a few commonalities.

191
192

1 • We do not have models or simulators for such systems and that

2 is why we need to use ideas from reinforcement learning to build
3 controllers for them;
4 • There are both existing systems, i.e., there exists a controller that
5 is currently deployed. This controller may be very complex, e.g.,
6 in Amazon’s case it is the result of many scientists building this
7 system over a decade (and thereby even if one could look at it, there
8 is no way one would be able to model/simulate it). In the case of the
9 hospital, the current triage policy was likely created by the doctors
10 over experience and refined by the triage nurses by looking at actual
11 cases. Because the system evolved over a long time, a lot of this
12 knowledge is not accessible in a codified form.

Offline reinforcement learning are a suite of techniques that allow

us to learn the optimal value function from data that need not be
coming from an optimal policy—without drawing any new data from
the environment. Note that we do not just want to evaluate an existing
policy, we want to learn the optimal value function, or at the very
least improve upon the current policy.

13 There are many other problems of this kind: data is plentiful, just that we
14 cannot get more.
15 Technically speaking, offline learning is a very clean problem in that
16 we are close to the supervised learning setting (we do not know the
17 true targets). A meaningful theoretical analysis of typical reinforcement
18 learning algorithms is difficult because there are a lot of moving parts
19 in the problem definition: exploratory controllers, the fact that we are
20 adding correlated samples into our dataset as we draw more trajectories,
21 function approximation properties of the neural network that does not
22 allow Bellman iteration to remain a contraction etc. Some of these hurdles
23 are absent in the analysis of offline learning.

24 11.1 Why is offline reinforcement learning dif-

25 ficult?
A good intuition for offline reinforcement
26 Suppose that we wanted to use behavior cloning for this problem. After learning is given by this picture. In imitation
27 all, we could build a state vector x and do behavior cloning with a learning, or behavior cloning, we simply want
28 neural
i network to learn a policy uθ (x) from the existing dataset D = to learn a policy that mimics the expert (or
n
29 (xt , uit )t=0,...,T i=1 . In principle, we can also also learn the value recorded data). In offline reinforcement
30 function corresponding to this policy, qφθ (x, u) where u is the control learning, we can “stitch together” multiple
31 input. Note that this is not the optimal value function. We will call this sub-optimal policies in different parts of the
32 the behavior cloning solution to offline reinforcement learning. domain.
33 We know further that value iteration converges (for tabular MDPs)
34 from any initial condition. In lieu of the actual model of the Markov
35 Decision Process, we can imagine using the entire dataset D to calculate
193

1 Bellman updates for a value function parameterized by a neural network:

X (k+1) 2
θ (k) i ′
min qφθ (xit , uit ) − r(xit , u) − γ max
′
qφ (x t+1 , u ) .
φ u ∈U
i,t

2 This approach is unlikely to work very well. Observe the following figure.

4 We do not really know whether the initial value function qφθ assigns large
5 returns to controls that are outside of the ones in the dataset. If it does, then
6 these controls will be chosen in the maximization step while calculating
7 the target. If there are states where the value function over different
8 controls looks like this picture, then their targets will cause the value at
9 all other states to grow unbounded during training. This is exactly what
10 happens in practice. For example,

We have discussed how Bellman updates

are a set of consistent constraints on the
optimal value function. This entails that if we
over-estimate the value at a given state, then
the estimated return to come (i.e., the value)
of all other states becomes incorrect.
11

12 In offline reinforcement learning, this phenomenon is often called the

13 “extrapolation error”. It arises because we do not know a natural way to
14 force the network to avoid predicting large values for controls that are not
15 a part of the training dataset. It is instructive to ask why off-policy or
16 on-policy reinforcement learning does not suffer from extrapolation error.
17 Both of these algorithms explicitly draw more data from the simulator. If
18 the value function were to over-estimate the value of certain control actions
19 at a state, then the controller would take those controls during exploration
20 and discover that the value was in fact an over-estimate. Methods that
21 draw more data have this natural self-correcting behavior that we cannot
22 get in offline learning.
23 There is a second problem associated with computing the maximum
24 in value iteration. For problems with continuous controls, we do not know
25 of computationally effective ways to compute the maximum maxu∈U .
194

1 Typical implementations of offline learning fit a controller that maximizes

2 the value function
1 X θ i
θ = argmax q (x , uθ (x))
nT i,t φ t One can use a linear value function, e.g.,
qφθ (x, u) = ⟨[1, x, u], φ[1, x, u]⟩ or any other
3 to make it easy to calculate this maximum. But this forces us to use set of basis functions. But this is not enough
4 another network (and it is not a given that the parameterized controller to resolve the issue in theory. Bellman
5 can calculate the correct maximum). updates are not a contraction when the
TD-error is projected onto a set of bases. In
practice, this approach works reasonably well.
6 11.2 Regularized Bellman iteration
7 There are two broad class of techniques that are believed to give reasonable
8 results for offline learning. These are both quite new and ad hoc and
9 effective offline learning is essentially an open problem today. To wit,
10 current offline learning methods not only fail to learn optimal value
11 functions or policies from sub-optimal data, but they often do any better
12 than behavior cloning.

13 11.2.1 Changing the fixed point of the Bellman iteration

14 to be more conservative
15 Since extrapolation error is fundamentally caused by the value function
16 taking large values, it is reasonable strategy to modify the Bellman updates
17 to regularize the value function in some way. A basic version would look
18 like a coupled system of updates

1 X θ (k+1) i i (k)
2
φ∗ = min qφ (xt , ut ) − r(xit , u) − γqφθ (xit+1 , uθ (xit+1 ))
φ nT i,t

− λ Ω(qpθ )
1 X θ i
θ∗ = min q (x , uθ (xit )).
θ nT i,t φ t
(11.1)
19 We will use the regularizer

1 X θ i 2
Ω= q (x , uθ (xit )) − qφθ (xit , uit ) +
nT i,t φ t

20 The notation (·)+ denotes rectification, i.e., (x)+ = x if x > 0 and zero
21 otherwise. Notice that second term in the objective for fitting the value
22 function forces the value of the control uθ (xit ) to be smaller than the
23 value of the control uit taken at the same state in the dataset. This is a
24 conservative choice. While it does not fix the issue of extrapolation error,
25 it forces the value network to predict smaller values and prevents it from
26 blowing up.
195

1 A second, similar, strategy looks like

1 X 2
Ω= max qφθ (xit , u′ ) − qφθ (xit , uit ) +
nT i,t u′ ∼uθ (xt )
i

2 where the maximum of actions is computed over a large number of samples.

3 A yet another strategy looks like
Z
1 X
log exp qφθ (xit , u) .

Ω=
nT i,t u
Practically speaking, it is reasonable to
4 These strategies have been found to be somewhat useful in the sense that expect that even if we do not find the optimal
5 they prevent the value function from taking large values. But it is also value function, if we can obtain a better policy
6 clear that the solution of the problems is not the optimal value function. than the existing policy that is running on the
system, then offline reinforcement learning is
a viable approach.
7 11.2.2 Estimating the uncertainty of the value function
8 It is reasonable to ask the question whether initializing the value function
1 Chapter 12

2 Meta-Learning

Reading
1. Learning To Learn: Introduction (1996),
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3140

2. Prototypical Networks for Few-shot Learning

http://papers.nips.cc/paper/6996-prototypical-networks-
for-few-shot-learning

3. Model-Agnostic Meta-Learning for Fast Adaptation of Deep

Networks https://arxiv.org/abs/1703.03400

4. A Baseline for Few-Shot Image Classification

https://arxiv.org/abs/1909.02729

5. Meta-Q-Learning https://arxiv.org/abs/1910.00125

3 The human visual system is proof that we do not need lots of images
4 to learn to identify objects or lots of experiences to learn about concepts.
5 Consider the mushrooms shown in the image below.

7 The one on the left is called Muscaria and you’d be able to identify
8 the bright spots on this mushroom very easily after looking at one image.
9 The differences between an edible one in the center (Russala) and the one
10 on the right (Phalloides) may sometimes be subtle but a few samples from
11 each of them are enough for humans to learn this important distinction.

196
197

1 There are also more everyday examples of this phenomenon. You

2 touched a hot stove once as a child and have forever learnt not to do it.
3 You learnt to ride a bike as a child and only need a few minutes on a
4 completely new bike to be able to ride it these days. At the same time, you
5 could start learning to juggle today and will be able to juggle 3 objects
6 with a couple of days of practice.
7 The hallmark of human perception and control is the ability to gen-
8 eralize. This generalization comes in two forms. The first is the ability
9 to do a task better if you see more samples from the same task; this is
10 what machine learning calls generalization. The second is the ability to
11 mix-and-match concepts from previously seen tasks to do well on new
12 tasks; doing well means obtaining a lower error/higher reward as well
13 as learning the new task quickly with few days. This second kind is the
14 subject of what is called “learning to learn” or meta-learning.

15 Standard machine learning ⇒ generalization across samples

16 Meta-Learning ⇒ generalization across tasks

What is a task? If we are going to formalize meta-learning, we

better define what a task is. This is harder than it sounds. Say we
are doing image classification, classifying cats vs. dogs could be
considered Task 1; Task 2 could be classifying apples vs. oranges. It
is reasonable to expect that learning low-level features such as texture,
colors, shapes etc. while learning Task 1 could help us to do well on
Task 2.

17 This is not always the case, two tasks can also fight each other. Say,
18 you design a system to classify ethnicities of people using two kinds
19 of features. Task 1 uses the shape of the nose to classify Caucasians
20 (long nose) vs. Africans (wide nose). Task 2 uses the kind of hair to
21 classify Caucasians (straight hair) vs. Africans (curly hair). An image of
22 a Caucasian person with curly hair clearly results in two tasks fighting
23 each other.
24 The difficulty in meta-learning begins with defining what a task is.
25 While understanding what a task is may seem cumbersome but doable
26 for image classification, it is even harder for robotics systems.

28 We can think of two different kinds of tasks.

198

1 1. The first, on the left, is picking up different objects like a soccer

2 ball, a cup of coffee, a bottle of water etc. using a robot manipulator.
3 We may wish to learn how to pick up a soccer ball quickly given
4 data about how to pick up the bottle.

5 2. The second kind of tasks is shown on the right. You can imagine
6 that after building/training a model for the robot in your lab you
7 want to install it in a factory. The factory robot might have 6 degrees-
8 of-freedom whereas the one in your lab had only 5; your policy
9 better adapt to this slightly different state-space. An autonomous
10 car typically has about 10 cameras and 5 LIDARs, any learning
11 system on the car better adapt itself to handle the situation when
12 one of these sensors breaks down. The gears on a robot manipulator
13 will degrade over the course of its life, our policies should adapt to
14 this degrading robot.

15 Almost all the current meta-learning/meta-reinforcement learning

16 literature focuses on developing methods to do the first set of tasks. The
17 second suite of tasks are however more important in practice. In the
18 remainder of this chapter, we will discuss two canonical algorithms to
19 tackle these two kinds of tasks.
? Meta-learning vs. multi-task learning
We have talked about adaptation as the way to
handle new tasks in the previous remark.
20 12.1 Problem formulation for image classifica- Consider the following situation: in standard
21 tion machine learning, we know that larger the
size of the training data we collect better the
22 The image classification formulation thinks of each class/category as a performance on the test data; a large number
23 “task”. of images help capture lots of variability in

24 Consider a supervised learning problem with a dataset D = (xi , y i ) i=1,...,N . the data, e.g., dogs of different shapes, sizes
The labels y i ∈ {1, . . . , C} for some large K and there are N and colors. You can imagine then that in
25
C samples in
26 the dataset for each class. Think of this as a large dataset of cars, cats, dogs, order to do well on lots of different tasks, i.e.,
27 airplanes etc. all objects that are very frequent in nature and for which meta-learn, we should simply collect data
28 we can get lots of images. This training set is called the meta-training set from lots of different tasks. Can you think as
29 with C “base tasks”. to why mere multi-task learning may not
30 Say, we were simply interested in obtaining a machine learning model work well for meta-learning?
31 to classify this data. Let us denote the parameters of this model by w. If
32 this model predicts the probability of the input x belonging to each of
33 these known classes we can think of maximizing the log-likelihood of the
34 data under the model (or minimizing the cross-entropy loss)
N
1 X
ŵ = argmax log pw (y i | xi ). (12.1)
w N i=1

35 This is the standard multi-class image classification setup. Since we like

36 to think of one task as one category, this is also the multi-task learning
37 setup. The model
pŵ (· | x)
199

1 after fitting on the training data will be good at classifying some new input
2 image x as to whether it belongs to one of the C training classes. Note that
3 we have written the model as providing the probability distribution pŵ (· |
4 x) as the output, one real-valued scalar per candidate class {1, . . . , C}.
? Say, we are interested in classifying images
from classes that are different than those in
5 12.1.1 Fine-tuning the training set. The model has only C
outputs, effectively the universe is partitioned
6 Let us now consider the following setup. In addition to our original dataset into C categories as far as the model is
7 of the base tasks, we are given a “few-shot dataset” that has c new classes concerned and it does not know about any
8 and s labeled samples per class, a total of n = cs new samples other classes. How should one formalize the
problem of meta-learning then?
D′ = (xi , y i ) y i ∈ {C + 1, . . . , C + c} .

i=N +1,...,N +n
;

9 The words “few-shot” simply mean that s is small, in particular we are

10 given much fewer images per class than the meta-training dataset,

n N
=s≪ .
c C
11 This models the situation where the model is forced to classify images
12 from rare classes, e.g., the three kinds of strawberries grown on a farm in
13 California after being trained on data of cars/cats/dogs/planes etc.
14 We would like to adapt the parameters ŵ using this labeled few-shot
15 data. Here is one solution, we simply train the model again on the new
16 data. This looks like solving another optimization problem of the form
N +n
1 X λ 2
w∗ = argmin − log pw (y i | xi ) + ∥w − ŵ∥2 . (12.2)
w n 2
i=N +1

17 The new parameters w can potentially do well on the new classes even if
18 the shot s is small because training is initialized using the parameters ŵ.
19 We write down this initialization using the second term

λ 2
∥w − ŵ∥2
2
20 which keeps the parameters being optimized w closed to their initialization
21 using a quadratic spring-force controlled by the parameter λ. We can
22 expect the new model pw∗ to perform well on the new classes if the
23 initialization ŵ was good, i.e., if the new tasks and the base tasks were
24 close to each other. This method is called fine-tuning, it is the easiest trick
25 to implement to handle new classes.
? Think of a multi-layer neural network ŵ
that has K outputs. The new network should
26 12.1.2 Prototypical networks now produce m outputs, how should we
modify this network?
27 The cross-entropy objective used in (12.1) to train the model pŵ simply
28 maximizes the log-likelihood of the labels given the data. It is reasonable
29 to think that since the base classes are not going to show up as the few-shot
30 classes, we should not be fitting to this objective.
200

The idea behind a prototypical loss is to train the model to be a

good discriminator among different classes.

1 Let us imagine the features of the model, e.g., the activations of the
2 last layer in the neural network,

z = φw (x)

3 for a particular image x. Note that the features z depend on the parameters
4 w. During standard cross-entropy training, there is a linear layer on top of
5 these features and our prediction probability for class y is
⊤
ew y z
pw (y | x) = P w⊤ z
y′
y′ e

6 where wy ∈ Rdim(z) . This is the softmax operation and the vectors w are
7 the weights of the last layer of the network; when we wrote (12.1) we
8 implicitly folded those parameters into the notation w.
9 Prototypical networks train the model to be a discriminator as follows.
10 1. Each mini-batch during training time consists of a few-shot task
11 created out of the meta-training set by sub-sampling.

Depisode = Dsupport ∪ Dquery

= (xi , y i ); y i ∈ {1, . . . , C} i∈{1,...,Cs} ;

∪ (xi , y i ); y i ∈ {1, . . . , C} i∈{1,...,Cq} .

12 with |Dsupport | = Cs and |Dquery | = Cq. This is called an “episode”

13 by researchers in this literature. Each episode comes with some
14 more data from the same classes called the “query-shot” in this
15 literature. The query-shot is akin to the data from the new classes
16 that the model is forced to predict during adaptation time. Let us
17 have q query-shot per class in each episode.

18 2. We know the labels of the N = Cs labeled data and can compute

19 the prototypes, which are simply centroids of the features,

1 X
µy = 1{yi =y} φw (xi ).
s
(xi ,y i )∈D episode

20 3. You can now impose a clustering loss to force the query samples to
21 be labeled correctly, i.e., maximize

e−∥φw (x)−µy ∥2
pw,µy (y | x) = P
y′ e−∥φw (x)−µy′ ∥2

22 where y = y i and x = xi for each of the samples (xi , y i ) in the

23 query-set of the episode.
201

2 4. The objective maximized at each mini-batch is

1 X
log pw,µyi (y i | xi ).
Cq
(x,y)∈D query )

3 Note that the gradient of the above expression is both on the weights
4 w of the top layer and the weights w of the lower layers.

5 5. We can now use the trained model for classifying new classes by
6 simply feeding the new images through the network, computing the
7 prototypes using the few labeled data and computing the clustering
8 loss on the unlabeled query data at test time to see which prototype
9 the feature of a particular query datum is closest to.

10 Discussion Prototypical loss falls into the general category of metric-

11 based approaches to few-shot learning. We make a few remarks next.

12 1. It is a very natural setting for learning representations of the data for

13 classification that can be transferred easily. If the model is going to
14 be used for new classes, it seems reasonable that the prototypes of
15 the new classes should be far away from each other and the zs of the
16 query samples should be clustered around their correct prototypes.

17 2. Prototypical networks perform well if you can estimate the proto-

18 types accurately. In practice, this requires that you have about 10
19 labeled data per new class.

20 3. We used the ℓ2 metric ∥·∥2 in the z-space to compute the affinities

21 of the query samples. This may not be a reasonable metric to use for
22 some problems, so a large number of approaches try to devise/learn
23 new metrics.

24 4. A key point of prototypical networks is that there is no gradient-based

25 learning going on upon the new categories; we simply compute
26 the prototypes and the affinities and use those to classify the new
27 samples.

28 12.1.3 Model-agnostic meta-learning (MAML)

29 We will next look at a simple algorithm for gradient-based adaptation of
30 the model on the new categories. The key idea is to update the model
202

1 using the same objective in (12.1) but avoid overfitting the model on the
2 meta-training data so that the model can be quickly adapted using the
3 few-shot data via gradient-updates.
4 Here we consider an episode Depisode = Dsupport and Dquery = ∅, i.e.,
5 there are no query shots. Let us define

1 X
ℓ(w; Dsupport ) = log pw (y i | xi );
Cs
(xi ,y i )∈D support

6 this is the same objective as that in (12.1) so if we maximized the objective

ℓ(w; Dsupport )

7 we will perform standard cross-entropy training. At each mini-batch/episode,

8 the MAML algorithm instead maximizes the objective

ℓmaml (w; Dsupport ) = ℓ w + α∇ℓ(w; Dsupport ); Dsupport .

(12.3)

9 In other words, MAML uses a “look ahead” gradient: the gradient of

10 ℓ(w; Dsupport ) is not in the steepest ascent direction of ℓ(w; Dsupport )
11 but in the steepest ascent direction after one update of the parameters
12 w + α∇ℓ(w; Dsupport ).

13 Adaptation on the few-shot data One we have a model trained using

14 MAML
ŵ = argmax ℓmaml (w; D)
w

15 we can update it on new data simply by maximizing the standard cross-

16 entropy objective again, i.e.,
N +n
∗ 1 X 1 2 ? How does look-ahead in MAML help?
w = argmax log pw (y i | xi ) − ∥w − ŵ∥2 . (12.4)
w n 2λ
i=N +1

17 The adaptation phase is exactly the same as standard cross-entropy training.

18 MAML as an approximation of a second order optimization method

19 MAML is not specific to few-shot learning. We can use the MAML
20 objective for any other standard supervised learning problem, is this going
21 to help? Indeed it will, gradient descent/stochastic gradient descent are
22 myopic algorithms because they update parameters only in the direction
23 of the steepest gradient, you can potentially do better by computing the
24 lookahead gradient. The caveat is that is computationally difficult to
25 compute the lookahead gradient. Observe that

ℓmaml (w) = ℓ(w + α∇ℓ(w))

⊤
≈ ℓ(w) + α (∇ℓ(w)) ∇ℓ(w)
⇒ ∇ℓmaml (w) = ∇ℓ(w) + 2α∇2 ℓ(w) ∇ℓ(w).
203

1 So MAML is secretly a second-order optimization method, computing the

2 gradient of the MAML objective requires having access to the Hessian of
3 the objective ∇2 ℓ(w). For large models such as neural networks this is
4 very expensive to compute.
5 Remark 12.1. Let us consider a meta-training set with two mini-batches/episodes/tasks,
6 D = D1 ∪ D2 . The MAML algorithm uses the gradient
2
1X
∇ℓmaml (w; D) = ∇ ℓmaml (w; Di )
2 i=1
2
1X
= ∇ℓ(w; Di ) + 2α∇2 ℓ(w; Di ) ∇ℓ(w; Di )
2 i=1

7 Observe now that if there exist parameters w that have ∇ℓ(w; Di ) for all
8 the episodes Di then the MAML gradient is also zero. In other words, if
9 there exist parameters w that work well for all tasks then MAML may find
10 such parameters. However, in this case, the simple objective
2
1X
ℓmulti-task (w; D) = ℓ(w; Di ) (12.5)
2 i=1

11 that sums up the losses of all the mini-batches/episodes/tasks will also

? What happens if the two tasks are different
12 find these parameters. This objective known as the multi-task learning
(as is likely to be the case) in which case there
13 objective is much simpler than MAML’s because it requires only the
don’t exist parameters that work well for all
14 first-order gradient.
the tasks?

15 12.2 Problem formulation for meta-RL

16 One mathematical formulation of meta-RL is as follows. Let k denote a
17 task and there is an underlying (unknown) dynamics for this task given by

xkt+1 = f k (xkt , ukt , ξt )

18 We will assume that all the tasks have a shared state-space xkt ∈ X and a
19 shared control-space ukt ∈ U . The reward function of each task is different
20 rk (x, u) but we are maximizing the same infinite-horizon discounted
21 objective for each task. The q-function is then defined to be
"∞ #
k,θ k
X
t k
q (x, u) = E γ r (xt , ut ) | x0 = x, u0 = u, ut = uθk (xt ) .
ξ(·)
t=0

22 where uθk (xt ) is a deterministic controller for task k. Given all these
23 meta-training tasks, our objective is to learn a controller that can solve a
24 new task k ∈ / {1, . . . , K} upon being presented a few trajectories from
25 the new task. Think of you learning to pick up different objects during
26 training time and then adapting to picking up a new object not in the
27 training set.
204

1 Let us consider the off-policy Q-learning setting and learn separate

2 controllers for all the tasks for now. As usual, we want the q-function
3 to satisfy the Bellman equation, i.e., if we are using parameters φk to
4 approximate the q-function, we want to find parameters φk such that
2
k,θ k k k,θ k ′ ′
argmin E qφk (x, u) − r (x, u) − γqφk (x , uθk (x ))
φk (x,u,x′ )∈D k
(12.6)
5 where the dataset Dk is created using some exploratory policy for the task
6 k. The controllers uθk are trained to behave like the greedy policy for the
k
7 particular q-function qφk,θ
k

h k
i
argmax E qφk,θ
k (x, uθ k (x)) . (12.7)
θk (x,u)∈D k

8 The above development is standard off-policy Q-learning and we

9 have seen it in earlier lectures. The different controllers uθk do not
10 learn anything from each other, they are trained independently on their
11 own datasets. We can now construct a multi-task learning objective for
12 meta-RL, in this we will learn a single q-function and a single controller
13 for all tasks. We modify (12.6) and (12.7) to simply work on all the
14 datasets together
h 2 i
θ k w ′ ′
argmin ′
E 1 2
qφ (x, u) − r (x, u) − γ qφ (x , uw (x ))
φ (x,u,x )∈D ∪D ...
w ? Imagine a planning task with multiple
argmax E qφ (x, uw (x))
w (x,u)∈D 1 ∪D 2 ... goals. The optimal trajectory that goes to one
(12.8) goal location will bifurcate from the optimal
15 This is the multi-task learning objective for RL. This is unlikely to work trajectory that goes to some other goal. We
16 well because depending upon the task, the controllers for the different will never be able to learn a controller that
17 tasks will conflict with each other, it is unlikely that there is a single set goes to both goals using one neural network.
18 of parameters for the controller and the q-function that works well for all How to fix this? You can use MAML
19 tasks. certainly to under-fit the controller and the
q-function to all the tasks and then adapt them
using some data from the new task using
20 12.2.1 A context variable gradient updates.

21 Reinforcement Learning offers a very interesting way to solve the few-

22 shot/meta-learning problem. We can append the state-space to include
23 a context variable that is a representation of the particular task. Let us
24 construct features for a trajectory using a set of basis functions

{ϕ1 (x, u, r), ϕ2 (x, u, r), . . . , ϕm (x, u, r)}

25 We will now construct a variable µk (τ ) for a trajectory τ0:t = (x0 , u0 , . . . , xt , ut , . . . )

26 from task k as
t X
X m
µ(τ0:t ) = γ t αi ϕi (xks , uks , rk (xks , uks )).
s=0 i=1
205

1 We will call this variable a “context” because we can use it to guess which
2 task a particular trajectory is coming from. It is important to note that
3 the mixing coefficients αi are shared across all the tasks. We would
4 like to think of this feature vector µ(τ ) as a kind of indicator of whether a
5 trajectory τ belongs to the task k or not. We now learn a q-function and
6 controller that also depend on µ(τ )

qφθ (xt , ut , µ(τ0:t ))

(12.9)
uθ (xt , µ(τ0:t )).

7 Including a context variable like µ(τ ) allows the q-function to detect the
8 particular task that it is being executed for using the past t time-steps of
9 the trajectory τ0:t . This is similar to learning independent q-functions
k
10 qφk,θ
k
and controllers uθk but there is parameter sharing going on in (12.9).
11 We will still solve the multi-task learning problem like (12.8) but also
12 optimize the parameters αi s that combine the basis functions.
K h
X 2 i
argmin E qφθ (x, u, µ(τ )) − rk (x, u) − γ qφθ (x′ , uθ (x′ , µ(τ )))
φ,αi (x,u,x′ )∈D k
i=1
K
X
qφθ (x, uθ (x, µ(τ )), µ(τ )) .

argmax E
θ,αi (x,u)∈D k
i=1
(12.10)
13 The parameters αi of the context join the q-functions and the controllers
14 of the different tasks together but also allow the controller the freedom to
15 take different controls depending on which task it is being trained for.

16 Adapting the meta-learnt controller to a new task Suppose we

17 trained on K tasks using the above setup and have the parameters
18 b {αbi }i=1,...,m in our hands. How should we adapt to a new task? This
θ̂, φ,
19 is easy, we can run an exploration policy on the new task (the current policy
20 uθ will work just fine if the control space U is the same) to collect some
21 data and update our off-policy Q-learning parameters θ̂, φ b on this data
22 using (12.6) and (12.7) while keeping the results close to our meta-trained
23 parameters using penalties like
? Does adaptation always improve
1 2 1 2
∥w − ŵ∥2 and ∥φ − φ∥
b 2. performance on the new task?
2λ 2λ
24 We don’t update the context parameters αi s during such adaptation.

25 12.2.2 Discussion
26 This brings an end to the chapter on meta-learning and Module 4. We
27 focused on adapting learning-based models for robotics to new tasks. This
28 adaptation can take the form of learning a reward (inverse RL), learning
29 the dynamics (model-based RL) or learning to adapt (meta-learning).
30 Adaptation to new data/tasks with few samples is a very pertinent problem
31 because we want learning-based methods to generalize a variety of different
206

1 tasks than the ones they have been trained for. Such adaptation also comes
2 with certain caveats, adaptation may not always improve the performance
3 on new tasks; understanding when one can/cannot adapt forms the bulk of
4 the research on meta-learning.
1 Bibliography

2 Censi, A. (2016). A class of co-design problems with cyclic constraints and their solution. IEEE Robotics and
3 Automation Letters, 2(1):96–103.

4 Fakoor, R., Mueller, J. W., Asadi, K., Chaudhari, P., and Smola, A. J. (2021). Continuous doubly constrained
5 batch reinforcement learning. Advances in Neural Information Processing Systems, 34:11260–11273.

6 Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and
7 perspectives on open problems. arXiv preprint arXiv:2005.01643.

8 Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing Test, pages 23–65.
9 Springer.

207

Adaptive Filtering Prediction and Control
From Everand
Adaptive Filtering Prediction and Control
Graham C Goodwin
No ratings yet
What Is Robotics?: Reading
No ratings yet
What Is Robotics?: Reading
4 pages
Intelligent Social Robots ACMAssiut
No ratings yet
Intelligent Social Robots ACMAssiut
73 pages
AIR Notes
No ratings yet
AIR Notes
38 pages
Unit 1: Intro To Robotics: Zeroth Law First Law Second Law Third Law
No ratings yet
Unit 1: Intro To Robotics: Zeroth Law First Law Second Law Third Law
24 pages
Maxime2022 - Learning To Walk Legged Hexapod Locomotion From Simulation To The Real World
No ratings yet
Maxime2022 - Learning To Walk Legged Hexapod Locomotion From Simulation To The Real World
61 pages
Robotics For Intelligent Environments: Manfred Huber
No ratings yet
Robotics For Intelligent Environments: Manfred Huber
26 pages
FFR125 LectureNotes
100% (2)
FFR125 LectureNotes
122 pages
014.robot Control Architectures
No ratings yet
014.robot Control Architectures
36 pages
Robotics and AI
No ratings yet
Robotics and AI
90 pages
EEE-BEE009 - Robotics and Automation Dr. S. P. Vijaya Raghavan
No ratings yet
EEE-BEE009 - Robotics and Automation Dr. S. P. Vijaya Raghavan
34 pages
K. S. Fu, R.C. Gonzalez, C.S.G. Lee - Robotics - Control, Sensing, Vision, and Intelligence-Mcgraw-Hill Book Company (1987)
100% (2)
K. S. Fu, R.C. Gonzalez, C.S.G. Lee - Robotics - Control, Sensing, Vision, and Intelligence-Mcgraw-Hill Book Company (1987)
594 pages
MCE 525 Notesss
No ratings yet
MCE 525 Notesss
8 pages
Robotics
No ratings yet
Robotics
53 pages
Levine Deep RL Lecture
No ratings yet
Levine Deep RL Lecture
70 pages
Rob Lect-1 (Introduction) PDF
No ratings yet
Rob Lect-1 (Introduction) PDF
35 pages
Faizan All About Arduino Robotics
No ratings yet
Faizan All About Arduino Robotics
21 pages
Introduction To Machine Learning For Beginners: Ayush Pant
No ratings yet
Introduction To Machine Learning For Beginners: Ayush Pant
28 pages
Automation and Robotics: Prepared by G.Harsha Vardhini
No ratings yet
Automation and Robotics: Prepared by G.Harsha Vardhini
26 pages
Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli
No ratings yet
Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli
38 pages
Microcontroller - Fundamentals of Robots
No ratings yet
Microcontroller - Fundamentals of Robots
27 pages
Book
No ratings yet
Book
592 pages
Robotics Control, Sensing, Vision, and Intelligence
92% (12)
Robotics Control, Sensing, Vision, and Intelligence
594 pages
Is Robotics Going Statistics? The Field of Probabilistic Robotics
No ratings yet
Is Robotics Going Statistics? The Field of Probabilistic Robotics
8 pages
Robotics
No ratings yet
Robotics
12 pages
AI - Unit - 5
No ratings yet
AI - Unit - 5
85 pages
ArtificialIntelligence (SEC) - 3rd Unit
No ratings yet
ArtificialIntelligence (SEC) - 3rd Unit
23 pages
Kudra Yvtsev: Algorithms For Rob Tics
No ratings yet
Kudra Yvtsev: Algorithms For Rob Tics
33 pages
Presentation Khalil
No ratings yet
Presentation Khalil
63 pages
Assignment 2 and 3
No ratings yet
Assignment 2 and 3
5 pages
Chap 7
No ratings yet
Chap 7
2 pages
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
100% (1)
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
51 pages
Roboticsfor School 150930123522 Lva1 App6891
No ratings yet
Roboticsfor School 150930123522 Lva1 App6891
40 pages
Robotics
No ratings yet
Robotics
22 pages
AI Presentation
No ratings yet
AI Presentation
20 pages
Robotics
No ratings yet
Robotics
15 pages
Robots Learning From Experiences
No ratings yet
Robots Learning From Experiences
31 pages
Robotic & Artificial Intelligence
No ratings yet
Robotic & Artificial Intelligence
21 pages
1 Introduction V2
No ratings yet
1 Introduction V2
98 pages
Robotics
No ratings yet
Robotics
75 pages
502 61robotics
No ratings yet
502 61robotics
3 pages
j9YVPV - 026263354X.RoboticsPrimer
No ratings yet
j9YVPV - 026263354X.RoboticsPrimer
309 pages
Seminar On: Robotics
No ratings yet
Seminar On: Robotics
25 pages
Stockhammer TCP 2019
No ratings yet
Stockhammer TCP 2019
37 pages
Ai in Robotics
No ratings yet
Ai in Robotics
83 pages
Lec10-AI and Robotics
No ratings yet
Lec10-AI and Robotics
61 pages
Lecture1 Introduction and Basics
No ratings yet
Lecture1 Introduction and Basics
60 pages
Good 3
No ratings yet
Good 3
15 pages
Rawe Day 1.
No ratings yet
Rawe Day 1.
40 pages
On Robotic's (1) - 1
No ratings yet
On Robotic's (1) - 1
23 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
13 pages
Field Robotics
No ratings yet
Field Robotics
70 pages
Cognitive Robotics: 3. Deep Learning
No ratings yet
Cognitive Robotics: 3. Deep Learning
89 pages
10.1109 Lars-Sbr.2015.41 Apbp
No ratings yet
10.1109 Lars-Sbr.2015.41 Apbp
6 pages
Students Guide For Making Your Own Small Robot
No ratings yet
Students Guide For Making Your Own Small Robot
76 pages
Robust Adaptive Control
From Everand
Robust Adaptive Control
Petros Ioannou
No ratings yet
Quantum Information: From Foundations to Quantum Technology Applications
From Everand
Quantum Information: From Foundations to Quantum Technology Applications
Dagmar Bruss
No ratings yet
Numerical Methods and Implementation in Geotechnical Engineering – Part 2
From Everand
Numerical Methods and Implementation in Geotechnical Engineering – Part 2
Y.M. Cheng
No ratings yet
Quantum Algorithms in Action: A Practical Guide to Implementation with Qiskit
From Everand
Quantum Algorithms in Action: A Practical Guide to Implementation with Qiskit
Robert Johnson
No ratings yet
Mesh Generation: Application to Finite Elements
From Everand
Mesh Generation: Application to Finite Elements
Pascal Frey
No ratings yet
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
No ratings yet
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
14 pages
Cognitive Science UNIT 4
No ratings yet
Cognitive Science UNIT 4
10 pages
Real-Time Adaptive Estimation Framework For P80 in Hydrocyclones Overflow
No ratings yet
Real-Time Adaptive Estimation Framework For P80 in Hydrocyclones Overflow
10 pages
A New Approach To Modeling and Estimation For Pairs Trading
No ratings yet
A New Approach To Modeling and Estimation For Pairs Trading
31 pages
Computational Statistics With Matlab
No ratings yet
Computational Statistics With Matlab
71 pages
Graphical Models
No ratings yet
Graphical Models
4 pages
Modern Biostatistical Methods For Evidence Based Global Health Research DOCX PDF Download
No ratings yet
Modern Biostatistical Methods For Evidence Based Global Health Research DOCX PDF Download
17 pages
Rastefer
No ratings yet
Rastefer
7 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
39 pages
Particle Filtering Without Tears A Primer For Beginners
No ratings yet
Particle Filtering Without Tears A Primer For Beginners
16 pages
Performance Evaluation: Sara Casolari Stefania Tosi Francesco Lo Presti
No ratings yet
Performance Evaluation: Sara Casolari Stefania Tosi Francesco Lo Presti
21 pages
Bayesian Dynamic Modelling: Bayesian Theory and Applications
No ratings yet
Bayesian Dynamic Modelling: Bayesian Theory and Applications
29 pages
Particle Filter LIRMM
No ratings yet
Particle Filter LIRMM
33 pages
Real Time Unattended Object Detection and Tracking Using MATLAB
No ratings yet
Real Time Unattended Object Detection and Tracking Using MATLAB
8 pages
Bayesian Filtering For Indoor Localization and Tracking in Wireless Sensor Networks
No ratings yet
Bayesian Filtering For Indoor Localization and Tracking in Wireless Sensor Networks
13 pages
Video Object Tracking
No ratings yet
Video Object Tracking
30 pages
State of The Art Penelitian - Chat GPT 2023
No ratings yet
State of The Art Penelitian - Chat GPT 2023
137 pages
Particle Filter
No ratings yet
Particle Filter
16 pages
Particle Filter Thesis
No ratings yet
Particle Filter Thesis
235 pages
Trackerbots: Autonomous Uav For Real-Time Localization and Tracking of Multiple Radio-Tagged Animals
No ratings yet
Trackerbots: Autonomous Uav For Real-Time Localization and Tracking of Multiple Radio-Tagged Animals
18 pages
Application Note APS006 Part 3
No ratings yet
Application Note APS006 Part 3
21 pages
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
100% (4)
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
257 pages
TSA: James D. Hamilton, Time Series Analysis, Princeton University Press, 1994
No ratings yet
TSA: James D. Hamilton, Time Series Analysis, Princeton University Press, 1994
6 pages
ML Unit-5
No ratings yet
ML Unit-5
14 pages
Annual Reviews in Control: Francisco Curado Teixeira, João Quintas, António Pascoal
No ratings yet
Annual Reviews in Control: Francisco Curado Teixeira, João Quintas, António Pascoal
11 pages
Indian Institue of Technology 1
No ratings yet
Indian Institue of Technology 1
113 pages
2013 Macroeconometría - de Jong
No ratings yet
2013 Macroeconometría - de Jong
5 pages
Time Series Analysis by State Space Methods Second Edition 2nd Edition James Durbin Download
No ratings yet
Time Series Analysis by State Space Methods Second Edition 2nd Edition James Durbin Download
77 pages
TP09 Garciano PDF
No ratings yet
TP09 Garciano PDF
11 pages
Burak Gunay PHD
No ratings yet
Burak Gunay PHD
289 pages