23 Ese650
23 Ese650
Spring 2023
Instructor
Pratik Chaudhari [email protected]
Teaching Assistants
Jianning Cui (cuijn)
Swati Gupta (gswati)
Chris Hsu (chsu8)
Gaurav Kuppa (gakuppa)
Alice Kate Li (alicekl)
Pankti Parekh (pankti81)
Aditya Singh (adiprs)
Haoxiang You (youhaox)
1 What is Robotics? 5
1.1 Perception-Learning-Control 6
1.2 Goals of this course 7
1.3 Some of my favorite robots 7
1
2
9 Q-Learning 173
9.1 Tabular Q-Learning 173
9.1.1 How to perform exploration in Q-Learning 176
9.2 Function approximation (Deep Q Networks) 177
9.2.1 Embellishments to Q-Learning 179
9.3 Q-Learning for continuous control spaces 183
12 Meta-Learning 196
12.1 Problem formulation for image classification 198
12.1.1 Fine-tuning 199
12.1.2 Prototypical networks 199
12.1.3 Model-agnostic meta-learning (MAML) 201
12.2 Problem formulation for meta-RL 203
12.2.1 A context variable 204
12.2.2 Discussion 205
Bibliography 207
1 Chapter 1
2 What is Robotics?
Reading
1. Computing machinery and intelligence, Turing (2009)
2. Thrun Chapter 1
3. Barfoot Chapter 1
3 The word robotics was first used by a Czech writer Karel Capek
4 in a play named “Rossum’s Universal Robots” where the owner of this
5 company Mr. Rossum builds robots, i.e., agents who do forced labor, and
6 effectively an artificial man. The word was popularized by Issac Asimov
7 in one of his short stories named Liar!. This is about a robot named
8 RB-34 which, through a manufacturing fault, happens to be able to read
9 the minds of humans around it. Around 1942, Isaac Asimov started using
10 the word robot in his writings. This is also when he introduced the Three
11 Laws of Robotics as the theme for how robots would interact with others
12 in his stories/books. These are as follows.
13 1. A robot may not injure a human being or, through inaction, allow a
14 human being to come to harm.
15 2. A robot must obey the orders given it by human beings except where
16 such orders would conflict with the First Law.
5
6
1 she confronts RB-34 later by pointing out that lying to people can end up
2 hurting them, the robot experiences a logical conflict within its laws and
3 becomes unresponsive.
4 This is, after all, science fiction but these laws give us insight into
5 what robots are. Let’s see what modern roboticists have to say.
31 1.1 Perception-Learning-Control
32 Perception refers to the sensory mechanisms to gain information about the
33 environment (eyes, ears, tactile input etc.). Action refers to your hands,
34 legs, or motors/engines in machines that help you move on the basis of
35 this information. Learning is kind of the glue in between. It helps crunch
36 information of your sensors quickly, compare it with past data, guesses
37 what future data may look like and computes actions that are likely to
38 succeed. The three facets of intelligence are not sequential and robotics is
39 not merely a feed-forward process. Your sensory inputs depend on the
40 previous action you took.
1feel free to come up with another definition
7
8 Other courses Some other courses at Penn that address various aspects
9 of this picture above are
11 • Learning: CIS 520, CIS 521, CIS 522, CIS 620, CIS 700, ESE 545,
12 ESE 546
13 • Control: ESE 650, MEAM 520, MEAM 620, ESE 500, ESE 505,
14 ESE 619
17
8
1 These videos should give you an idea of how the everyday life of a
2 roboticist looks like: Kiva’s robots, Waymo’s 360 experience, Boston
3 Dynamics’ Spot, JPL-MIT team at the DARPA Sub-T Challenge, Romeo
4 and Juliet at Ferrari’s factory, Anki’s Vector, and the DARPA Humanoid
5 Challenge.
1 Chapter 2
2 Introduction to State
3 Estimation
Reading
1. Barfoot, Chapter 2.1-2.2
2. Thrun, Chapter 2
Ω = {HH, HT, T H, T T } .
9
10
1 molecules in the coin. After every experiment, in this case tossing the two
2 coins once each, we obtain an event, it is a subset event A ⊂ Ω from the
3 sample space.
A = {HH} .
4 Probability theory is a mathematical framework that allows us to
5 reason about phenomena or experiments whose outcome is uncertain.
6 Probability of an event
P(A)
7 is a function that maps each event A to a number between 0 and 1: closer
8 to 1 this number, stronger our belief that the outcome of the experiment is
9 going to be A.
16 You can use these axioms to say things like P(∅) = 0, P(Ac ) = 1 − P(A),
17 or if A ⊆ B then P(A) ≤ P(B).
19 Effectively, the sample space has now shrunk from Ω to the event B. It
20 would be silly to have a null sample-space, so let’s say that P(B) ̸= 0. We
21 define conditional probability as
P(A ∩ B)
P(A | B) = ; (2.1)
P(B)
1 Bayes’ rule Imagine that instead of someone telling us that the condi-
2 tioning event actually happened, we simply had a belief
P(Ai )
P(Ai | B).
P(Ai ∩ B)
P(Ai | B) =
P(B)
P(Ai ) P(B|Ai )
= (2.3)
P(B)
P(Ai ) P(B|Ai )
=P .
i P(Aj ) P(B | Aj )
11 This is different from disjoint events. Disjoint events never co-occur, i.e.,
12 observing one tells us that the other one did not occur.
24 The way to fix this is to avoid defining the probability of a set in terms
25 of the probability of elementary outcomes and work with more general
26 sets. While we would ideally like to be able to specify the probability of
27 every subset of Ω, it turns out that we cannot do so in a mathematically
28 consistent way. The trick then is to work with a smaller object known as a
29 σ-algebra, that is the set of “nice” subsets of Ω.
12
3 • ∅∈F
4 • If A ∈ F, then Ac ∈ F.
P : F → [0, 1].
1 number of values. For instance, if X is the number of coin tosses until the
2 first head, if we assume that our tosses are independent P(H) = p > 0,
3 then we have
17 We also have the following relationship between the CDF and the PDF,
18 the former is the integral of the latter:
Z x
P(−∞ ≤ X ≤ x) = FX (x) = fX (x) dx .
−∞
1 and denotes the center of gravity of the probability mass function. Roughly
2 speaking, it is the average of a large number of repetitions of the same
3 experiment. Expectation is a linear, i.e.,
E[aX + b] = a E[X] + b
Var(X) = E (X − E[X])2
X 2
= (x − E[X]) pX (x)
x
2
= E[X 2 ] − (E[X]) .
fX,Y (x, y)
fX|Y (x | y) = if fY (y) > 0.
fY (y)
10 For any given y, the conditional PDF is a normalized section of the joint
11 PDF, as shown below.
12
16
fX|Y (x | y)
fY |X (y | x) = . (2.5)
fX (x)
3 Similarly we also have the law of total probability in the continuous form
Z ∞
fX (x) = fX|Y (x | y) fY (y) dy .
−∞
1 knowledge
P(Y | open) P(open)
P(open | Y ) = .
P(Y )
2 Remember that the left hand side (diagnostic) is typically something that
3 we desire to calculate. Let us put some numbers in this formula. Let
4 P(Y | open) = 0.6 and P(Y | not open) = 0.3. We will imagine that the
5 door is open or closed with equal probability: P(open) = P(not open) =
6 0.5. We then have
P(Y | open) P(open)
P(open | Y ) =
P(Y )
P(Y | open) P(open)
=
P(Y | open) P(open) + P(Y | not open) P(not open)
0.6 × 0.5 2
= = .
0.6 × 0.5 + 0.3 × 0.5 3
7 Notice something very important, the original (prior) probability of the
8 state of the door was 0.5. If we have a sensor that fires with higher
9 likelihood if the door is open, i.e., if
P(Y | open)
>1
P(Y | not open)
10 then the probability of the door being open after receiving an observation
11 increases. If the likelihood were less than 1, then observing a realization
12 of Y would reduce our estimate of the probability of the door being open.
13 Combining evidence for Markov observations Say we updated the The denominator in Bayes rule, i.e., P(Y )
14 prior probability using our first observation Y1 , let us take another ob- is called the evidence in statistics.
15 servation Y2 . How can we integrate this new observation? It is again an
16 application of Bayes rule using two observations, or in general multiple
17 observations Y1 , . . . , Yn . Let us imagine this time that X = open.
18 Let us make the very natural assumption that says that our observations
19 from the sensor Y1 , . . . , Yn are independent given the state of the door X.
20 This is known as the Markov assumption.
21 We now have
P(Yn | X) P(X | Y1 , . . . , Yn−1 )
P(X | Y1 , . . . , Yn ) =
P(Yn | Y1 , . . . , Yn−1 )
= η P(Yn | X) P(X | Y1 , . . . , Yn−1 )
22 where
η −1 = P(Yn | Y1 , . . . , Yn−1 )
23 is the denominator. We can now expand the diagnostic probability on the
18
Given these two functions, we can use the recursion to update multiple
observations. The same basic idea also holds if you have two quantities
to estimate, e.g., X1 = open door and X2 = color of the door. The
recursive application of Bayes rule lies at the heart of all state
estimation methods.
3 Let us again put some numbers into these formulae, imagine that the
4 observation Y2 was taken using a different sensor which now has
6 Notice in this case that the probability that the door is open has reduced
7 from P(open | Y1 ) = 2/3.
11
Tijk = P(Xk = xj | X0 = xi ).
The key property of a Markov chain is that the next state Xk+1
is independent of all the past states X1 , . . . , Xk−1 given the current
state Xk .
Xk+1 ⊥ ⊥ X1 , . . . , Xk−1 | Xk
This is known as the Markov property and all systems where we ? Does a deterministic dynamical system,
can define a “state” which governs their evolution have this property. e.g., a simple pendulum, also satisfy the
Markov chains form a very broad class of systems. For example, all Markov assumption? What is the transition
of Newtonian physics fits this assumption. matrix in this case?
What is the state of the following systems?
15 This equation governs how the probabilities P(Xk = xi ) change with time
16 k. Let’s do the calculations for the Whack-The-Mole example. Say the
17 mole was at hole x1 at the beginning. So the probability distribution of its
18 presence
P(Xk = x1 )
π (k) = P(Xk = x2 )
P(Xk = x3 )
19 is such that
π (1) = [1, 0, 0]⊤ .
20 We can now write the above formula as
22 The numbers P(Xk = xi ) stop changing with time k. Under certain tech-
23 nical conditions, the distribution π (∞) is unique (single communicating
1Let us denote the transpose of the matrix T using the Matlab notation T ′ instead of T ⊤
for clarity.
21
1 class for a Markov chain with a finite number states). We can compute
2 this invariant distribution by writing
π (∞) = T ′ π (∞) .
3 We can also compute the distribution π (∞) directly: the invariant dis-
4 tribution is the right-eigenvector of the matrix T ′ corresponding to the
5 eigenvalue 1. ? Do we always know that the transition
matrix has an eigenvalue that is 1?
6 Example 2.1. Consider a Markov chain on two states where the transition
7 matrix is given by
0.5 0.5
T = .
0.4 0.6
8 The invariant distribution is
9 Note that the constraint for π being a probability distribution, i.e., π (1) +
10 π (2) = 1 is automatically satisfied by the two equations. We can solve for
11 π (1) , π (2) to get
π (1) = 4/9 π (2) = 5/9.
Figure 2.1: A Hidden Markov Model with the underlying Markov chain, the
observation at time k only depends upon the hidden state at that time instant.
Ignore the notation Z1 , . . . , Zt we will denote the observations by Yk .
14 The matrix M has non-negative entries, after all, each entry is a probability.
15 Since each state has to result in some observation, we also have
X
Mij = 1.
j
Tij = P(Xk+1 = xj | Xk = xi ).
P(Xk | Y1 , . . . , Yk ).
12 This is the case when we wish to make predictions about the state
13 of the car j > k given only observations until time k. If we knew
14 the underlying Markov chain for the HMM and its transition matrix
15 T , this would amount to running (2.7) forward using the output of
16 the filtering problem as the initial distribution of the state. ? Why is this true?
P(X1 , . . . , Xk | Y1 , . . . , Yk )
P(Y1 , . . . , Yk ).
30 As you may recall, this is the denominator that we need for the
31 recursive application of Bayes rule. It is made difficult by the fact
32 that we do not know the state trajectory X1 , . . . , Xk corresponding
33 to these observations.
34 These problems are closely related with each other and we will next dig
35 deeper into them. We will first discuss two building blocks, called the
24
1 forward and backward algorithm that together help solve all the above
2 problems.
P(Y1 , . . . , Yk )
X
= P(Y1 , . . . , Yk | X1 , . . . , Xk ) P(X1 , . . . , Xk )
all (x1 ,...,xk )
X k
Y k
Y
= P(Yi = yi | Xi = xi ) P(X1 = x1 ) P(Xk = xk | Xk−1 = xk−1 )
all (x1 ,...,xk ) i=1 i=2
X
= Mx1 y1 Mx2 y2 . . . Mxk yk πx1 Tx1 x2 . . . Txk−1 xk .
all (x1 ,...,xk )
6 But this is a very large computation, for each possible trajectory (x1 , . . . , xk )
7 the states could have taken, we need to perform 2k matrix multiplications.
8 ? How many possible state trajectories are
there? What is the total cost of computing the
likelihood of observations?
1. We can initialize
3. Finally, we have
X
P(Y1 , . . . , Yk ) = αk (x)
x
1. Initialize
βt (x) = 1 for all x.
This simply indicates that since we are at the end of the
trajectory, the future trajectory Yt+1 , . . . does not exist.
th
2 where ⊙ denotes the element-wise product and M·,yk+1 is the yk+1 column
3 of the matrix M . The update equation for the backward variables is
βk = T βk+1 ⊙ M·,yk+1 .
P(Xk = x | Y1 , . . . , Yk )
P(Xk = x | Y1 , . . . , Yk ) ̸= P(Xk = x | Yk )
2 2.4.4 Smoothing
3 Given observations till time t, we would like to compute
P(Xk = x | Y1 , . . . , Yt )
P(Xk = x, Y1 , . . . , Yt )
P(Xk = x | Y1 , . . . , Yt ) =
P(Y1 , . . . , Yt )
P(Xk = x, Y1 , . . . , Yk , Yk+1 , . . . , Yt )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x, Y1 , . . . , Yk ) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
βk (x) αk (x)
=
P(Y1 , . . . , Yt )
(2.12)
5 Study the first step carefully, the numerator is not equal to αk (x) because
6 observations go all the way till time t. The final step uses both the Markov
7 and the HMM properties: future observations Yk+1 , . . . , Yt depend only
8 upon future states Xk+1 , . . . , Xt (HMM property) which are independent
9 of the past observations and states give the current state Xk = x (Markov ? Both the filtering problem and the
10 property). smoothing problem give us the probability of
11 Smoothing can therefore be implemented by running the forward the state given observations. Discuss which
12 algorithm to update αk from k = 1, . . . , t and the backward algorithm to one should we should use in practice and
13 update βk from time k = t, . . . , 1. why?
14 To see an example of smoothing in action, see ORB-SLAM 2. What
15 do you think is the state of the Markov chain in this video?
Y1 = 1, Y2 = 3, Y3 = 3.
2 and
3 Using these, we can now compute the filtering and the smoothing state
4 distributions, let us denote them by π f and π s respectively.
π1f = (1, 0, 0) , π2f = (0.05, 0.2, 0.75), π3f = (0.045, 0.2487, 0.7063)
7 2.4.5 Prediction
8 We would like to compute the future probability of the state give observa-
9 tions up to some time
10 Here is a typical scenario when you would need this estimate. Imagine
11 that you are tracking the position of a car using images from your camera.
12 You are using a deep network to detect the car in each image Yk and since
13 the neural network is quite slow, the car moves multiple time steps forward
14 before you get the next observation. As you can appreciate, it would help
15 us compute a more accurate estimate of the conditional probability of
16 Xk = x if we propagated the position of the car in between successive
17 observations using our Markov chain. This is easy to do.
18 1. We compute the filtering estimate πtf = P(Xt = x | Y1 , . . . , Yt ),
19 using the forward algorithm.
20 2. Propagate the Markov chain forward for k − t time-steps using πtf
21 as the initial condition using
πi+1 = T ′ πi .
(X̂1 , . . . , X̂t )
2 as the answer. This is however only the point-wise best estimate of the
3 state. This sequence may not be the most likely trajectory of the Markov
4 chain underlying our HMM. In the decoding problem, we are interested in
5 computing the most likely state trajectory, not the point-wise most likely
6 sequence of states. Let us take an example of the Whack-the-mole again.
7 We will use a slightly different Markov chain shown below.
(2, 3, 3, 2, 2, 2, 3, 2, 3)
14
15 The most likely state at each instant is marked in blue. The point-wise
16 most likely sequence of states is
(1, 3, 3, 3, 3, 2, 3, 2, 3).
17 Observe that this is not even feasible for the Markov chain. The transition
18 from x3 → x2 is not even possible, so this answer is clearly wrong. Let
19 us look at the smoothing estimates.
30
(1, 2, 2, 2, 2, 2, 3, 3, 3).
3 Because the smoothing estimate at time k also takes into account the
4 observations from the future t > k, it effectively eliminates the impossible
5 transition from x3 → x2 . This is still not however the most likely
6 trajectory.
7 We will exploit the Markov property again to calculate the most likely
8 state trajectory recursively. Let us define the “decoding variables” as
12 the joint probability that the most likely trajectory ends up at state x at
13 time k + 1 is the maximum of among the joint probabilities that end up
14 at any state x′ at time k multiplied by the one-step state transition Tx′ x
15 and observation Mxyk+1 probabilities. We would like to iterate upon this
16 identity to find the most likely path. The key idea is to maintain a pointer
17 to the parent state parentk (x) of the most likely trajectory, i.e., the state
18 from which you could have reached Xk = x given observations. Let us
19 see how.
δ1 (x) = πx Mxy1
parentk (x) = null.
for all states x. For all times k = 1, . . . , t − 1, for all states x, update
and we can now backtrack using our parent pointers to find the most
likely trajectory that leads to this state
10 The most likely path is the one that ends in 3 with joint probability 0.0432.
11 This path is (1, 3, 3).
12 Let us also compute Viterbi’s algorithm for a longer observation
13 sequence.
32
Figure 2.2: A graph with costs assigned to every edge. Dijkstra’s algorithm finds
the shortest path in this graph between nodes A and B using dynamic programming.
1 most likely path. For example we can write our joint probabilities as
Figure 2.3: A Trellis graph for a 3-state HMM for a sequence of three observations.
Disregard the subscript x0 .
P(Y1 , . . . , Yt ; λ)
34
(null, LA, LA, null, NY, null, NY, NY, NY, null, NY, NY, NY, NY, NY, null, null, LA, LA, NY).
γk (x) := P(Xk = x | Y1 , . . . , Yt )
26 the smoothing probability. We can also compute the most likely state
27 trajectory he could have taken given our observations using decoding. Let
28 us focus on the smoothing probabilities γk (x) as shown below.
35
2 The point-wise most likely sequence of states after doing so turns out to be
(LA, LA, LA, LA, NY, LA, NY, NY, NY, LA, NY, NY, NY, NY, NY, LA, LA, LA, LA, NY).
7 What is the denominator, it is simply the sum of the probabilities that the
8 Markov chain was at state x at time 1, 2, . . . , t − 1 given our observations,
9 i.e.,
t−1
X
E[number of times the Markov chain was in state x] = γk (x).
k=1
1 This gives us our new state transition matrix, you will see in the homework
2 that it comes to be
0.47023 0.52976
T′ = .
0.35260 0.64739
3 This is a much better informed FBI than the other we had before beginning
4 the problem where the transition matrix was all 0.5s.
5 The new initial distribution What is the new initial distribution for
6 the HMM? Recall that we are trying to compute the best HMM given the
7 observations, so if the initial distribution was
π = P(X1 )
π ′ = P(X1 | Y1 , . . . , Yt ) = γ1 (x);
12 You will see in your homework problem that this matrix comes up to be
0.39024 0.20325 0.40650
M′ = .
0.06779 0.706214 0.2259
13 Notice how the observation probabilities for the unknown state y3 have
14 gone down because the Markov chain does not have those states.
15 The ability to start with a rudimentary model of the HMM and update
16 it using observations is quite revolutionary. Baum et al. proved in the
17 paper Baum, Leonard E., et al. "A maximization technique occurring
18 in the statistical analysis of probabilistic functions of Markov chains."
19 The annals of mathematical statistics 41.1 (1970): 164-171. Discuss the
20 following questions:
21 • When do we stop in our iterated application of the Baum-Welch
22 algorithm?
23 • Are we always guaranteed to find the same HMM irrespective of
24 our initial HMM?
25 • If our initial HMM λ is the same, are we guaranteed to find the
26 same HMM λ′ across two different iterations of the Baum-Welch
27 algorithm?
28 • How many observations should we use to update the HMM?
1 Chapter 3
Reading
1. Barfoot, Chapter 3, 4 for Kalman filter
17 3.1 Background
18 Multi-variate random variables and linear algebra For d-dimensional
19 random variables X, Y ∈ Rd we have
37
38
4 we will usually denote this by Σ ∈ Rd×d . Note that the covariance matrix
5 is, by construction, symmetric and positive semi-definite. This means it
6 can be factorized as
Σ = U ΛU ⊤
7 where U ∈ Rd×d is an orthonormal matrix (i.e., U U ⊤ = I) and Λ is a
8 diagonal matrix with non-negative entries. The trace of a matrix is the
9 sum of its diagonal entries. It is also equal to the sum of its eigenvalues,
10 i.e.,
X d Xd
tr(Σ) = Σii = λi (Σ)
i=1 i=1
th
11 where λi (S) ≥ 0 is the i eigenvalue of the covariance matrix S. The
12 trace is a measure of the uncertainty in the multi-variate random variable
13 X, if X is a scalar and takes values in the reals then the covariance matrix
14 is also, of course, a scalar Σ = σ 2 .
15 A few more identities about the matrix trace that we will often use in
16 this chapter are as follows.
17 • For matrices A, B we have
tr(AB) = tr(BA);
18 the two matrices need not be square themselves, only their product
19 does.
20 • For A, B ∈ Rm×n
m X
n
X
tr A⊤ B = tr B ⊤ A =
Bij Aij .
i=1 j=1
1 which is simply expressing the fact that the probability density function
2 integrates to 1.
4 with covariance
Cov(Z) = ΣZ = ΣX + ΣY + ΣXY + ΣY X
5 where
Rd×d ∋ ΣXY = E (X − E[X]) (Y − E[Y ])⊤ ;
ΣZ = ΣX + ΣY .
X̂.
2 An estimator is any quantity that indicates our belief of what X is. The
3 estimator is created on the basis of observations and we will therefore
4 model it as a random variable. We would like the estimator to be unbiased,
5 i.e.,
E[X̂] = X;
6 this expresses the concept that if we were to measure the state of the
7 system many times, say using many sensors or multiple observations from
8 the same sensor, the resultant estimator X̂ is correct on average. The error
9 in our belief is
X̃ = X̂ − X.
10 The error is zero-mean E[X̃] = 0 and its covariance ΣX̃ is called the
11 covariance of the estimator.
12 Optimally combining two estimators Let us now imagine that we have Conditionally independent observations
13 two estimators X̂1 and X̂2 for the same true state X. We will assume that from one true state
14 the two estimators were created independently (say different sensors) and
15 therefore are conditionally independent random variables given the true
16 state X Say both of them are unbiased but each of them have a certain
17 covariance of the error
ΣX̃1 and ΣX̃2 .
18 We would like to combine the two to obtain a better estimate of what the
19 state could be. Better can mean many different quantities depending upon
20 the problem but in general in this course we are interested in improving
21 the error covariance. Our goal is then
Given two estimators X̂1 and X̂2 of the true state X combine
them to obtain a new estimator
σ22
k1 = .
σ12 + σ22
5 We set the derivative of Var(X̂) with respect to k1 to zero to get this. The
6 final estimator is
σ22 σ2
X̂ = X̂1 + 2 1 2 X̂2 . (3.2)
σ12 + σ22 σ1 + σ2
2 σ12 σ22
σX̃ = .
σ12 + σ22
8 Notice that since σ22 /(σ12 + σ22 ) < 1, the variance of the new estimator is
9 smaller than either of the original estimators. This is an important fact to
10 remember, combining two estimators always results in a better estimator.
X̂ = K1 X̂1 + K2 X̂2
42
3 The covariance of X̂ is
4 Just like the minimized the variance in the scalar case, we will minimize
5 the trace of this covariance matrix. We know that the original covariances
6 Σ1 and Σ2 are symmetric. We will use the following identity for the
7 partial derivative of a matrix product
∂
tr ABA⊤ = 2AB
(3.3)
∂A
8 for a symmetric matrix B. Minimizing tr(ΣX̃ ) with respect to K1 amounts
9 to setting
∂
tr(ΣX̃ ) = 0
∂K1
10 which yields
0 = K1 Σ1 − (I − K1 )Σ2
⇒ K1 = Σ2 (Σ1 + Σ2 )−1 and K2 = Σ1 (Σ1 + Σ2 )−1 .
12 You should consider the similarities of this expression with the one for the
13 scalar case in (3.2). The same broad comments hold, i.e., if one of the
14 estimators has a very large variance, that estimator is weighted less in the
15 combination.
X̂ = K ′ X̂ ′ + KY.
E[X̂] = E[K ′ X̂ ′ + KY ]
= K ′ X + K E[Y ]
= K ′ X + K E[CX + ν]
= K ′ X + KCX
= X.
11 to get that
I = K ′ + KC.
⇒ X̂ = (I − KC)X̂ ′ + KY (3.6)
′ ′
= X̂ + K(Y − C X̂ ).
12 This is special form which you will do well to remember. The old
13 estimator X̂ ′ gets an additive term K(Y − C X̂ ′ ). For reasons that will
14 soon become clear, we call this term
innovation = Y − C X̂ ′ .
17 The covariance of X̂ is
∂
0= tr(ΣX̃ )
∂K
0 = −2(I − KC)ΣX̃ ′ C ⊤ + 2KQ
⇒ ΣX̃ ′ C ⊤ = K(CΣX̃ ′ C ⊤ + Q)
⇒ K = ΣX̃ ′ C ⊤ (CΣX̃ ′ C ⊤ + Q)−1 .
1 The matrix K ∈ Rd×p is called the “Kalman gain” after Rudoph Kalman
2 who developed this method in the 1960s.
X̂ = X̂ ′ + K(Y − C X̂ ′ ).
Σ−1
X̃
= Σ−1
X̃ ′
+ C ⊤ Q−1 C
K = Σ−1
X̃
C ⊤ Q−1 (3.9)
X̂ = X̂ ′ + Σ−1
X̃
C ⊤ Q−1 Y − C X̂ ′ .
1 3.2.4 An example
2 Consider the scalar case when we have multiple measurements of some
3 scalar quantity x ∈ R corrupted by noise.
yi = x + ν i
1
Cov(x̂k+1 )−1 = Cov(x̂k )−1 +
σ2
σ2
⇒ Cov(x̂k+1 ) = .
σ2 k+1
13 The updated mean using (3.9) again
1
E[x̂k+1 ] = x̂k + Cov(x̂k+1 ) (yk+1 − x̂k )
σ2
yk+1 − x̂k
= x̂k + .
σ2 k + 1
14 You will notice that if the noise on the k + 1th observation is very small,
15 even after k observations, the new estimate fixates on the latest observation
σ → 0 ⇒ x̂k+1 → yk+1 .
16 Similarly, if the latest observation is very noisy, the estimate does not
17 change much
σ → ∞ ⇒ x̂k+1 → x̂k .
The true state X need not be static. We will next talk about models
for how the state of the world evolves using ideas in dynamical systems.
46
u1 → y1
u2 → y2
au1 + bu2 → ay1 + by2 .
47
d2 z(t) dz(t)
m +c + kx(t) = u(t)
dt2 dt
or mz̈ + cż + kz = u
z1 (t) := z(t)
dz(t)
z2 (t) := .
dt
14 We can now rewrite the dynamics as
z˙1 0 1 z1 0
= + u
z˙2 −k/m −c/m z2 1/m
1 The pair of equations (3.10) and (3.11) together are the so-called state-
2 space model of an LTI system. The development for discrete-time systems
3 is completely analogous, we will have
13 Such systems are called nonlinear systems. We will write them succinctly
14 as
ẋ = f (x, u)
(3.13)
y = g(x, u).
15 The function f : X × U → X that maps the state-space and the input
16 space to the state-space is called the dynamics of the system. Analogously,
17 for discrete-time nonlinear systems we will have
18 Again the discrete-time nonlinear dynamics has a different equation than ? Is the nonlinear spring-mass system
19 the corresponding one in (3.13). time-invariant?
7 2. We did not use the correct state-space X . You could write down
8 the state of the car as given by (x, y, θ, ẋ, ẏ, θ̇) where x, y are the
9 Euclidean co-ordinates of the car and θ is its orientation. This is
10 not a good model for studying high-speed turns, which are affect by
11 other quantities like wheel slip, the quality of the suspension etc.
12 We may not even know the full state sometimes. This occurs when
13 you are modeling how users interact with an online website like
14 Amazon.com, you’d like to model the change in state of the user
15 from “perusing stuff” to “looking stuff to buy it” to “buying it”
16 but there are certainly many other variables that affect the user’s
17 behavior. As another example, consider the path that an airplane
18 takes to go from Philadelphia to Los Angeles. This path is affected
19 by the weather at all places along the route, it’d be cumbersome
20 to incorporate the weather to find the shortest-time path for the
21 airplane.
22 3. We did not use the correct control-space U for the controller. This
23 is akin to the second point above. The gas pedal which one may
24 think of as the control input to a car is only one out of the large
25 number of variables that affect the running of the car’s engine.
write
xk+1 = f (xk , uk ) + ϵk (3.14)
where the “noise” ϵk is not under our control. The quantity ϵk is not
arbitrary however, we will assume that
yk = g(xk ) + νk . (3.16)
yk = Cxk + Duk + νk .
and stochastic dynamical systems with noisy observations are two dif-
ferent ways to think of the same concept, namely getting observations
across time about the true state of a dynamic world.
In the former we have
HMMs are easy to use for certain kinds of problems, e.g., speech- You will agree that creating the
to-text, or a robot wandering in a grid world (like the Bayes filter state-transition matrix for the Bayes filter
problem in HW 1). Dynamical systems are more useful for certain problem in HW 1 was really the hardest part
other kinds of problems, e.g., a Kuka manipulator where you can use of the problem. If the state-space were
Newton’s laws to simply write down the functions f, g. continuous and not a discrete cell-based
world, you could have written the dynamics
very easily in one line of code.
Our goal is to compute the best estimate of the state after multiple
53
observations
P(xk | y1 , . . . , yk ).
This is the same as the filtering problem that we solved for Hidden
Markov Models. Just like we used the forward algorithm to compute
the filtering estimate recursively, we are going to use our development
of the Kalman gain to incorporate a new observation recursively.
8 The subscript
x̂k+1|k
9 denotes that the quantity being talked about, i.e., x̂k+1|k , or others like
10 µk+1|k , is of the (k + 1)th timestep and was calculated on the basis of
11 observations up to (and including) the k th timestep. We will therefore
12 devise recursive updates to obtain µk+1|k+1 , Σk+1|k+1 using their old
13 values µk|k , Σk|k . We will imagine that our initial estimate for the state
14 x̂0|0 has a known distribution
15
P(xk+1 | y1 , . . . , yk ).
4 We can also calculate the covariance of the estimate x̂k+1|k to see that
5 using our calculation in (3.1). Observe that even if we knew the state
dynamics precisely, i.e., if R = 0, we still
6 3.5.3 Step 2: Incorporating the observation have a non-trivial propagation equation for
Σk+1|k .
7 After the dynamics propagation step, our estimate of the state is x̂k+1|k ,
8 this is the state of the system that we believe is true after k observations.
9 We should now incorporate the latest observation yk+1 to update this
10 estimate to get
P(xk+1 | y1 , . . . , yk , yk+1 ).
11 This is exactly the same problem that we saw in Section 3.2.3. Given the
12 measurement
yk+1 = Cxk+1 + νk+1
13 we first compute the Kalman gain Kk+1 and the updated mean of the
14 estimate as
−1
Kk+1 = Σk+1|k C ⊤ CΣk+1|k C ⊤ + Q
(3.20)
µk+1|k+1 = µk+1|k + Kk+1 yk+1 − Cµk+1|k .
1 3.5.4 Discussion
2 There are several important observations to make and remember about
3 the Kalman Filter (KF).
x̂fancy
k|k
filter
= some function(x̂0|0 , y1 , . . . , yk ).
x̂fancy
k|k
filter
= x̂KF
k|k .
33 • Assumptions that are implicit in the KF. We assumed that both the
34 dynamics noise ϵκ and the observation noise νk+1 are uncorrelated
56
1 with the estimate x̂k+1|k computed prior to them (where did we use
2 these assumptions?). This implicitly assumes that dynamics and
3 observation noise are “white”, i.e., uncorrelated in time
E[ϵk ϵ⊤
k′ ] = 0 for all k, k ′
E[νk νk⊤′ ] = 0 for all k, k ′ .
15
16 We have a radar sensor that measures the distance of the plane r from the
17 radar trans-receiver up to noise ν. We would like to measure its distance
18 x and height h. If the plane travels with a constant velocity, we have
ẋ = v, and v̇ = 0,
19 and p
r= x2 + h2 .
20 Since we do not really know how the plane might change its altitude, let’s
21 assume that it maintains a constant altitude
ḣ = 0.
22 The above equations are our model for how the state of the airplane evolves
23 and could of course be wrong. As we discussed, we will model the
57
1 discrepancy as noise.
x˙1 0 1 0 x1
x˙2 = 0 0 0 x2 + ϵ;
x˙3 0 0 0 x3
q
r = x21 + x23 + ν;
∂r ∂r
rlinearized = r(0, 0, 0) + (x1 − 0) + (x3 − 0) ? You can try to perform a similar
∂x1 x1 =0,x3 =0 ∂x3 x1 =0,x3 =0
linearization for a simple model of a car
2x1 2x3
=0+ p 2 x1 + p 2 x3
2 x1 + x23 x1 =0,x3 =0 2 x1 + x23 x1 =0,x3 =0 ẋ = cos θ
= x1 + x3 . ẏ = sin θ
θ̇ = u.
8 In other words, upto first order in x1 , x3 , the observations are linear and
9 we can therefore run the KF for computing the state estimate after k where x, y, θ are the XY-coordinates and the
10 observations. angle of the steering wheel respectively. This
model is known as a Dubins car.
11 3.6.1 Propagation of statistics through a nonlinear trans-
12 formation
13 Given a Gaussian random variable Rd ∋ x ∼ N (µx , Σx ), we saw how to
14 compute the mean and covariance after an affine transformation y = Ax
Rp ∋ y = f (x)
df
y = f (x) ≈ f (µx ) + (x − µx )
dx x=µx
= Jx + (f (µx ) − Jµx ).
1 This gives
A simple example
x1
x21 + x2 x3
y1
y= =f x2 = .
y2 sin x2 + cos x3
x3
5 We have
df 2x1 x3 x2
= ∇f (x) = .
dx 0 cos x2 − sin x3
⊤
6 The Jacobian at µx = [µx1 , µx2 , µx3 ] is
2µx1 µx3 µx2
J = ∇f (x) = .
x=µx
0 cos µx2 − sin µx3
xk+1 = f (xk , uk ) + ϵ
yk = g(xk ) + ν.
xk+1 = f (xk , uk ) + ϵ
∂f
≈ f (µk|k , uk ) + xk − µk|k + ϵk .
∂x x=µk|k
10 The mean and covariance of the EKF after the dynamics propagation step
11 is therefore given by
µk+1|k = f (µk|k , uk )
(3.26)
Σk+1|k = AΣk|k A⊤ + R.
yk+1 = g(xk+1 ) + ν
dg
≈ g(µk+1|k ) + (xk+1 − µk+1|k ) + ν
dx x=µk+1|k
60
∂g
C(µk+1|k ) = . (3.27)
∂x x=µk+1|k
5 Our fake observation is a nice linear function of the state xk+1 and we
6 can therefore use the Kalman Filter equations to incorporate this fake
7 observation
′
µk+1|k+1 = µk+1|k + K(yk+1 − C µk+1|k )
−1
where K = Σk+1|k C ⊤ CΣk+1|k C ⊤ + Q .
µk+1|k = f (µk|k , uk )
Σk+1|k = AΣk|k A⊤ + R.
1 Discussion
2 1. The EKF dramatically expands the applicability of the Kalman
3 Filter. It can be used for most real systems, even with very com-
4 plex models f, h. It is very commonly used in robotics and can
5 handle nonlinear observations from complex sensors such as a
6 LiDAR and camera easily. For instance, sophisticated augment-
7 ed/virtual reality systems like Google ARCore/Snapchat/iPhone
8 etc. (https://www.youtube.com/watch?v=cape_Af9j7w) run EKF
9 to track the motion of the phone or of the objects in the image.
3 1. Sample a few points from the Gaussian N (µk|k , Σk|k ) (red points
4 in top right).
Σ = SS.
Σ = V DV −1
d11 · · · 0
0 −1
=V · · · 0
V
0 · · · dnn
√ 2
d11 · · · 0
0 −1
=V ···
V .
0 √
0 ··· dnn
9 Notice that
SS = (V D1/2 V −1 ) (V D1/2 V −1 ) = V DV −1 = Σ.
10 We can also define the matrix square root using the Cholesky decompo-
11 sition Σ = LL⊤ which is numerically more stable than computing the
12 square root using the above expression. Recall that matrices L and Σ have
13 the same eigenvectors. Typical applications of the Unscented Transform
14 will use this method.
64
1
w(i) = . (3.30)
2n
5 We then transform each sigma point to get the transformed sigma points
y (i) = f (x(i) ).
6 The mean and covariance of the transformed random variable y can now
7 be computed as
2n
X
µy = w(i) y (i)
i=1
(3.31)
2n
X ⊤
Σy = w(i) y (i) − µy y (i) − µy .
i=1
r
8 Example Say we have x = with µx = [1, π/2] and Σx =
θ
2
σr 0
9 . We would like to compute the probability distribution of
0 σθ2 ? Compute the mean and covariance of y by
r cos θ linearizing the function f (x).
10 y = f (x) = which is a polar transformation. Since x is two-
r sin θ
11 dimensional, we will have 4 sigma points with equal weights w(i) = 0.25.
12 The square root in the sigma point expression is
√
√
2σr √ 0
nΣ =
0 2σθ
Figure 3.3: Note that the true mean is being predicted very well by the UT and is
clearly a better estimate than the linearized mean.
2n
X
µk+1|k := w(i) f (x(i) , uk )
i=1
2n
X ⊤
Σk+1|k := R + w(i) f (x(i) ) − µk+1|k f (x(i) ) − µk+1|k
i=1
(3.32)
18 and covariances
2n
X ⊤
Σyy := Q + w(i) g(x(i) ) − ŷ g(x(i) ) − ŷ
i=1
(3.34)
2n
X ⊤
(i) (i)
Σxy := w x − µk+1|k g(x(i) ) − ŷ .
i=1
19 Step 2.2: Computing the Kalman gain Until now we have written the
20 Kalman gain using the measurement matrix C. We will now discuss a
21 more abstract formulation that gives the same expression.
22 Say we have a random variable x with known µx , Σx and get a new
23 observation y. We saw how to incorporate this new observation to obtain
24 a better estimator for x in Section 3.2.3. We will go through a similar
25 analysis as before but in a slightly different fashion, one that does not
26 involve the matrix C. Let
x
z=
y
67
1 and µz = µx µy and
Σxx Σxy
Σz = .
Σyx Σyy
4 This is called the least squares problem, which you have seen before
5 perhaps in slightly different notation. You can solve this problem to see
6 that the best gain K is given by
K ∗ = Σxy Σ−1
yy . (3.35)
8 The nice thing about the Kalman gain in (3.35) is that we can compute it
9 now using expressions of Σxy and Σyy in terms of the sigma points. This
10 goes as as follows:
K ∗ = Σxy Σ−1
yy
= Σk+1|k − K ∗ Σyy K ∗ ⊤ .
Summary of UKF
2. The UKF uses the UT and its sigma points for propagation
of uncertainty through the dynamics (3.32) and observation
nonlinearities (3.36).
P(xk | y1 , . . . , yk )
as a Gaussian.
17 Say all weights are equal 1/n. Depending upon how we pick the samples
18 x(i) , we can get very different approximations
69
Figure 3.4: Black lines denote particles x(i) , while red and blue curves denote the
approximations obtained using them. If there are a large number of particles in a
given region, the approximated probability density of that region is higher.
For i = 1, . . . , n,
x(i) ∼ q
p(x(i) )
w(i) = .
q(x(i) )
The original distribution p(x) is called the “target” and our chosen
distribution q(x) is called the “proposal”. If the number of particles
n is large, we can expect a better approximation of the target density
p(x).
1
70
Figure 3.5: An example run of a particle filter. The robot is shown by the green
dot in the top right. Observations from a laser sensor (blue rays) attached to the
robot measure its distance in a 360-degree field of view around it. Red dots are
particles, i.e., possible locations of the robot that we need in order to compute
the filtering density P(xk | y1 , . . . , yk ). You should think of this picture as being
similar to Problem 1 in Homework 1 where the robot was traveling on a grid. Just
like the the filtering density in Problem 1 was essentially zero in some parts of the
domain, the particles, say in the bottom left, will have essentially zero weights in
a particle filter once we incorporate multiple observations from the robot in top
right. Instead of having to carry around these null particles with small weights,
the resampling step is used to remove them and sample more particles, say in
the top right, where we can benefit from a more accurate approximation of the
filtering density.
n
The resampling step takes particles w(i) , x(i) i=1
which ap-
71
(i) (i)
and returns a new set of particles x′ with equal weights w′ = 1/n
that approximate the same probability density
n
1X
p(x) = δ ′ (i) (x).
n i=1 x
2 Consider the weights of particles w(i) arranged in a roulette wheel as
3 shown above. We perform the following procedure: we start at some
4 location, say θ = 0, and move along the wheel in random increments
5 of the angle. After neach random
o increment, we add the corresponding
′ (i)
6 particle into our set x . Since particles with higher weights take up
7 a larger angle in the circle, this procedure will often pick those particles
8 and quickly move across particles with small weights without picking
9 them too often. We perform this procedure n times for n particles. As an
10 algorithm
13 2. For each m = 1, . . . , n, let u = r + (m − 1)/n. Increment There are many other methods of
14 i ← i + 1 and c ← c + w(i) while u > c and set new particle resampling. We have discussed here,
(m)
15 location x′ = x(i) . something known as “low variance
resampling”, which is easy to remember and
16 It is important to notice that the resampling procedure does not actually code up. Fancier resampling methods also
17 change the locations of particles. Particles with weights much lower than change the locations of the particles. The goal
18 1/n will be eliminated while particles with weights much higher than 1/n remains the same, namely to eliminate
19 will be “cloned” into multiple particles each of weight 1/n. particles with low weights.
72
n
1X
P(xk | y1 , . . . , yk ) = δ (i) (x),
n i=1 xk|k
(i)
9 all with equal weights wk|k = 1/n.
(i)
Note that P(yk+1 | xk+1|k ) is a Gaussian and depends upon
the Gaussian observation noise νk . The mean of this Gaussian
(i)
is g(xk+1|k ) and its variance is equal to Q, i.e.,
(i) (i)
P(yk+1 | xk+1|k ) = P(νk+1 ≡ yk+1 − g(xk+1|k ))
!
ν ⊤ Q−1 νk+1
1
=p exp − k+1 .
(2π)p det(Q) 2
(i)
Normalize the weights wk+1|k+1 to sum up to 1.
2
74
xk+1 = f (xk , uk ) + ϵk
n
1 X (i)
= P(xk+1 = x | xk = xk|k , u = uk ),
n i=1
n
(i)
X
wk+1|k (x) = P(xk+1 = x | xk = xk|k , u = uk ). (3.37)
i=1
14 Let us now think about what particles we should pick for xk+1|k . We
15 have from (3.37) a function that lets us compute the correct weight for any
16 particle we may choose to approximate xk+1|k .
(i) (i)
17 Say we keep the particle locations unchanged, i.e., xk+1|k = xk|k .
76
1 We then have
n
Draw a picture of how this approximation
(i) looks.
X
P(xk+1 = x | y1 , . . . , yk ) ≈ wk+1|k (xk|k ) δx(i) (x). (3.38)
k|k
i=1
2 You will notice that keeping the particle locations unchanged may be a very
3 poor approximation. After all, the probability density P(xk+1 | y1 , . . . , yk )
(i)
4 is large, not at the particles xk|k (that were a good approximation of xk|k ),
5 but rather at the transformed locations of these particles
(i)
f (xk|k , uk ).
(i) (i)
14 Since we have particles xk+1|k with weights wk+1|k for the proposal
15 distribution obtained from the propagation step, we now like to update
16 them to incorporate the latest observation yk+1 . Let us imagine for a
(i)
17 moment that the weights wk+1|k are uniform. We would then set weights
P(xk+1 = x | y1 , . . . , yk , yk+1 )
w(x) =
P(xk+1 = x | y1 , . . . , yk )
P(yk+1 | xk+1 = x) P(xk+1 = x | y1 , . . . , yk )
∝ (by Bayes rule)
P(xk+1 = x | y1 , . . . , yk )
= P(yk+1 | xk+1 = x).
77
(i)
1 for each particle x = xk+1|k to get the approximated distribution as
n
(i) (i)
X
P(xk+1 = x | y1 , . . . , yk+1 ) ≈ P(yk+1 | xk+1|k ) wk+1|k δx(i) (x)
k+1|k
i=1
(3.41)
2 You will notice that the right hand side is not normalized and the distribution
3 does not integrate to 1 (why? because we did not write the proportionality
4 constant in the Bayes rule above). This is easily fixed by normalizing the
(i) (i)
5 coefficients P(yk+1 | xk+1|k ) wk+1|k to sum to 1 as follows
(i) (i)
(i)
P(yk+1 | xk+1|k ) wk+1|k
wk+1|k+1 := P (j) (j)
.
j P(yk+1 | xk+1|k ) wk+1|k
10 3.9 Discussion
11 This brings our study of filtering to a close. We have looked at some of
12 the most important algorithms for a variety of dynamical systems, both
13 linear and nonlinear. Although, we focused on filtering in this chapter,
14 all these algorithms have their corresponding “smoothing” variants, e.g.,
15 you can read about how a typical Kalman smoother is implemented at
16 https://en.wikipedia.org/wiki/Kalman_filter#Fixed-lag_smoother. Filter-
17 ing, and state estimation, is a very wide area of research even today and
18 you will find variants of these algorithms in almost every device which
19 senses the environment.
1 Chapter 4
Reading
1. LaValle Chapter 3.2 for rotation matrices, Chapter 4.1-4.2 for
quaternions
78
79
A = (x, y) ∈ R2 : x2 + y 2 ≤ 1 .
14 This set A changes as the robot moves around, e.g., if the center of mass
15 of the robot is translated by xt , yt ∈ R the set A changes to
A′ = {(x + xt , y + yt ) : (x, y) ∈ A} .
21
22 As the above figure shows, there are two ways of thinking about this
23 transformation. We can either think of the robot transforming while the
24 co-ordinate frame of the world is fixed, or we can think of it as the robot
25 remaining stationary and the co-ordinate frame undergoing a translation.
26 The second style is useful if you want to imagine things from the robot’s
27 perspective. But the first one feels much more natural and we will therefore
28 exclusively use the first notion.
29 If the same robot if it where rotated counterclockwise by some angle
30 θ ∈ [0, 2π], we would map
2 to get
x cos θ − y sin θ x
= R(θ) .
x sin θ + y cos θ y
3 The transformed robot is thus given by
x
A′ = R : (x, y) ∈ A .
y
R(0), R(−θ) ∈ G.
81
7 This implies that the square of the determinant of any element a ∈ O(d)
8 is 1, i.e., det(a) = ±1. ? Check that any rotation matrix R belongs
to an orthogonal group.
9 The Special Orthogonal Group is a sub-group of the orthogonal group
10 where the determinant of each element is +1. You can see that rotations
11 are a special orthogonal group. We denote rotations of objects in R2 as
19 4.1.1 3D transformations
20 Translations and rotations in 3D are conceptually similar to the two-
21 dimensional case; however the details appear a bit more difficult because
22 rotations in 3D are more complicated.
82
9 The angles (γ, β, α) (in order: roll, pitch, yaw) are called Euler angles.
10 Imagine how the body frame of the robot changes as successive rotations
11 are applied. If you were sitting in a car, a pure yaw would be similar to
12 the car turning left; the Z-axis corresponding to this yaw would however
13 only be pointing straight up perpendicular to the ground if you had not
14 performed a roll/pitch before. If you had done so, the Z-axis of the body
15 frame with respect to the world will be tiled.
16 Another important thing to note is that one single parameter determines
17 all possible rotations about one axis, i.e., SO(2). But three Euler angles
18 are used to parameterize general rotations in three-dimensions. You
19 can watch https://www.youtube.com/watch?v=3Zjf95Jw2UE to get more
20 intuition about Euler angles.
83
3 we set
α = tan−1 (r21 /r11 )
q
−1 2 2
β = tan −r31 / r32 + r33 (4.8)
1 which we can equivalently denote as a matrix vector multiplication Groups such as SO(2) and SO(3) are
2 a × b = âb where topological spaces (viewed as a subset of
2
Rn ) and operations such as multiplication
0 −a3 a2 and inverses are continuous functions on
â = a3 0 −a1 . (4.10) these groups. These groups are also smooth
−a2 a1 0 manifolds (a manifold is a generalization of a
curved surface) and that is why they are called
3 is a skew-symmetric matrix. The solution of the differential equation (4.9) Lie groups (after Sophus Lie). Associated to
4 at time t = θ is each Lie group is a Lie algebra which is the
r(θ) = exp(ω̂θ) r(0) tangent space of the manifold at identity. The
5 where the matrix exponential of a matrix A is defined as Lie algebra of SO(3) is denoted by so(3) and
likewise we have so(2). In a sense, the Lie
∞
A2 A3 X Ak algebra achieves a “linearization” of the Lie
exp(A) = I + A + + + ... = . group. And the exponential map unlinearizes,
2! 3! k!
k=0
i.e., it takes objects in the Lie algebra to
6 This is an interesting observation: a rotation about a fixed axis ω by an objects in the Lie group
7 angle θ can represented by the matrix
exp : so(3) 7→ SO(3).
R = exp(ω̂ θ). (4.11)
What we have written in (4.11) is really just
8 You can check that this matrix is indeed a rotation by showing that this map:
9 R⊤ R = I and that det(R) = +1. We can expand the matrix exponential
10 and collect odd and even powers of ω̂ to get so(n) ∋ ω ≡ ω̂θ
SO(n) ∋ R = exp(ω) = exp(ω̂θ).
R = I + sin θ ω̂ + (1 − cos θ)ω̂ 2 . (4.12)
Therefore, if an object whose frame had a
11 which is the Rodrigues’ formula that relates the angle θ and the axis ω to rotation matrix R with respect to the origin
12 the rotation matrix. We can also go in the opposite direction, i.e., given a were rotating with an angular velocity ω
13 matrix R calculate what angle θ and axis ω it corresponds to using (remember that angular velocity is a vector
whose magnitude is the rate of rotation and
tr(R) − 1 direction is axis about which the object is
cos θ =
2 (4.13) rotation), then the rate of change of R would
R − R⊤ be given by
ω̂ = .
2 sin θ Ṙ = ω̂R.
14 Note that both the above formulae make sense only for θ ̸= 0. If we were to implement a Kalman filter
whose state is the rotation matrix R, then this
would be the dynamics equation and one
15 4.2 Quaternions would typically have an observation for the
16 We know two ways to think about rotations: we can either think in terms velocity ω using a gyroscope.
17 of the three Euler angles (γ, β, α), or we can consider a rotation matrix
18 R ∈ R3×3 . We also know ways to go back and forth between these two
19 forms with the caveat that solving for Euler angles using (4.8) may be
20 degenerate in some cases. While rotation matrices are the most general
21 representation of rotations, using them in computer code is cumbersome
22 (it is, after all, a matrix of 9 elements). So while we can build an EKF Quaternions were invented by British
23 where the state is a rotation matrix, it would be a bit more expensive to mathematician William Rowan Hamilton
24 run. We can also implement the same filter using Euler angles but doing while walking along a bridge with his wife.
25 so will require special care due to the degeneracies. He was quite excited by this discovery and
promptly graffitied the expression into the
stone of the bridge
85
Figure 4.2: Any rotation in 3D can be represented using a unit vector ω and an
angle θ ∈ R. Notice that there are two ways to encode the same rotation, the unit
vector −ω and angle 2π − θ would give the same rotation. Mathematicians say As you see in the adjoining figure,
this as quaternions being a double-cover of SO(3). quaternions also have degeneracies but they
are rather easy ones.
q = (cos(θ/2), ω sin(θ/2)) .
q −1 := (cos(θ/2), −ω sin(θ/2)) .
5 The inverse quaternion is therefore the quaternion where all entries except
6 the first have their signs flipped.
1 matrices
R(q1 q2 ) = R(q1 )R(q2 ).
1√
u0 = r11 + r22 + r33 + 1
2
r32 − r23
if u0 ̸= 0, u1 =
4u0
r13 − r31
u2 =
4u0
r21 − r12
u3 = (4.19)
4u0
r13 r12
if u0 = 0, u1 = p
2 r2 + r2 r2 + r2 r3
r12 13 12 23 13 23
r12 r23
u2 = p
2 r2 + r2 r2 + r2 r3
r12 13 12 23 13 23
r13 r23
u3 = p
2 r2 + r2 r2 + r2 r3
.
r12 13 12 23 13 23
2 The main point to understand about feature map is that we can hand
3 over this map to another robot that comes to the same house. The robot
4 compares images from its camera and if it finds one of the objects inside
5 the map, it can get an estimate of its location/orientation in the room with
6 respect to the known location of the object in the map. The map is just
7 a set of “features” that help identify salient objects in the room (objects
8 which can be easily detected in images and relatively uniquely determine
9 the location inside the room). The second robot using this map to estimate
10 its position/orientation in the room is called the localization problem. We
11 already know how to solve the localization problem using filtering.
12 The first robot was solving a harder problem called Simultaneous
13 Localization And Mapping (SLAM): namely that of discovering the location
14 of both itself and the objects in the house. This is a very important and
15 challenging problem in robots but we will not discuss it further. MEAM
16 620 digs deeper into it.
17 Grid maps We will first discuss two-dimensional grid maps, they look
18 as follows.
89
Figure 4.3: A grid map (also called an occupancy grid) is a large gray-scale image,
each pixel represents a cell in the physical world. In this picture, cells that are
occupied are colored black and empty cells represent free space. A grid map is a
useful representation for a robot to localize in this house using observations from
its sensors and comparing those to the map.
1 To get a quick idea of what we want to do, you can watch the mapping
2 being performed in https://www.youtube.com/watch?v=JJhEkIA1xSE.
3 We are interested in learning such maps from the observations that a
4 robot collects as it moves around the physical space. Let us make two
5 simplifying assumptions.
8 This is neat: we can now model each cell as a binary random variable that
9 indicates occupancy. Let the probability that the cell mi be occupied be
10 p(mi )
11
1 there are moving people inside the room. We will see a clever hack where
2 the Bayes rule helps automatically disregard such moving objects in this
3 section.
10
11 This means that if cells in the map are denoted by a vector m = (m1 , . . . , ),
12 then the probability of the cells being occupied/not-occupied can be written
13 as Y
p(m) = p(mi ). (4.20)
i
14
y1:k = (y1 , y2 , . . . , yk );
3 The figure above shows a typical sonar sensor (the two “eyes”) on a
4 low-cost robot. Data from the sensor is shown on the right, a sonar is a
5 very low resolution sensor and has a wide field of view, say 15 degrees,
6 i.e., it cannot differentiate between objects that are within 15 degrees
7 of each other and registers them as the same point. Sophisticated sonar
8 technology is used today in marine environments (submarines, fish finders,
9 detecting mines etc.).
10 Radar works in much the same way as a sonar except that it uses
11 pulses of radio waves and measures the phase difference between the
12 transmitted and the received signal. This is a very versatile sensor
13 (it was invented by the US army to track planes and missiles during
14 World War II) but is typically noisy and requires sophisticated process-
15 ing to be used for mainstream robotics. Autonomous cars, collision
16 warning systems on human-driven cars, weather sensing, and certainly
17 the military use the radar today. The following picture and the video
18 https://www.youtube.com/watch?v=hwKUcu_7F9E will give you an ap-
19 preciation of the kind of data that a radar records. Radar is a very long
20 range sensor (typically 150 m) and works primarily to detect metallic
21 objects.
22
16
Figure 4.4: Model for sonar data. (Top) A sonar gives one real-valued reading
corresponding to the distance measured along the red axis. (Bottom) if we travel
along the optical axis, the occupancy probability P(mi | yk = z, xk ) can be
modeled as a spike around the measured value z. It is very important to remember
that range sensors such as sonar gives us three kinds of information about this ray:
(i) all parts of the environment up to ≈ z are unoccupied (otherwise we would
not have recorded z), (ii) there is some object at z which resulted in the return,
(iii) but we do not know anything about what is behind z. So incorporating a
measurement yk from a sonar/radar/lidar involves not just updating the cell which
corresponds to the return, but also updating the occupancy probabilities of every
grid call along the axis.
Figure 4.5: (Left) A typical occupancy grid created using a sonar sensor by
updating the log-odds-ratio l(mi | x1:k , y1:k ) for all cells i for multiple time-steps
k. At the end of the map building process, if l(mi | x1:k , y1:k ) > 0 for a particular
cell, we set its occupancy to 1 and to zero otherwise, to get the maximum-likelihood
estimate of the occupancy grid on the right.
2 We could simply create cells in 3D space and our method for occupancy
3 grid would work but this is no longer computationally cheap. For instance,
4 if we want to build a map of Levine Hall (say 100 m × 100 m area and
5 height of 25 m), a 3D grid map with a resolution of 5 cm × 5 cm × cm
6 would have about 2 billion cells (if we store a float in each cell this map will
7 require about 8 GB memory). It would be cumbersome to carry around
8 so many cells and update their probabilities after each sensor reading (a
9 Velodyne gives data at about 30 Hz). More importantly, observe that most
10 of the volume inside Levine is free space (inside of offices, inner courtyard
11 etc.) so we do not really need fine resolution in those regions.
19
1 node is grid cell; notice how different cells in the octree have different
2 resolutions. Occupancy probabilities of each leaf node are updated using
3 the same formula as that of (4.24). A key point here is that octrees are
4 designed for accurate sensors (LiDARs) where there is not much noise
5 in the observations returned by the sensor (and thereby we do not refine
6 unnecessary parts of the space)
You can find LiDAR maps of the entire
7 Octrees are very efficient at storing large map, I expect you can store
United States (taken from a plane) at
8 the entire campus of Penn in about a gigabyte. Ray tracing (following all
https://www.usgs.gov/core-science-
9 the cells mi in tree along the axis of the sensor in Figure 4.4) is harder
systems/ngp/3dep
10 in this case but there are efficient algorithms devised for this purpose.
11 An example OctoMap (an occupancy map created using an Octree) of a
building on the campus of the University of Freiburg is shown below.
12
1 constructed in the body frame and evolves as the robot moves around
2 (objects appear in the front of the robot and are spawned in the local map
3 and disappear from the map at the back as the robot moves forward).
You should think of the map (and especially the local map) as the
filtering estimate of the locations of various objects in the vicinity of
the robot computed on the basis of multiple observations received
from the robot’s sensors.
Figure 4.6: The output of perception modules for a typical autonomous vehicle
(taken from https://www.youtube.com/watch?v=tiwVMrTLUWg. The global
occupancy grid is shown in gray (see the sides of the road). The local map is
not shown in this picture but you can imagine that it has occupied voxels at all
places where there are vehicles (purple boxes) and other stationary objects such
as traffic light, nearby buildings etc. Typically, if we know that so and so voxel
corresponds to a vehicle, we run an Extended Kalman Filter for that particular
vehicle to estimate the voxels in the local map that it is likely to be in, in the
next time-instant. The local map is a highly dynamic data structure that is rich in
information necessary for planning trajectories of the robot.
4 4.6 Discussion
5 Occupancy grids are a very popular approach to represent the environment
6 given the poses of the robot as it travels in this environment. We can also
7 use occupancy grids to localize the robot in a future run (which is usually
8 the purpose of creating them). Each cell in an occupancy grid stores the
9 posterior probability of the cell being occupied on the basis of multiple
10 observations {y1 , . . . , yk } from respective poses {x1 , . . . , xk }. This is
11 a very efficient representation of the 3D world around us with the one
12 caveat that each cell is updated independently of the others. But since
13 one gets a large amount of data from typical range senors (a 64 beam
14 Velodyne (https://velodynelidar.com/products/hdl-64e) returns about a
15 2 million points/sec and cheaper versions of this sensor will cost about
16 $100), this caveat does not hurt us much in practice. You can watch this talk
99
2 Dynamic Programming
Reading
1. (Thrun) Chapter 15
100
101
1 For this reason, the notation and the treatment in this chapter, and the
2 following ones, will be a bit pedantic. We will see complicated notation
3 and terminology for quantities, e.g., the value function, that you might
4 see being written very succinctly in other places. We will mostly follow
5 the notation of Dmitri Bertsekas’ book on “Reinforcement Learning and
6 Optimal Control” (http://www.mit.edu/ dimitrib/RLbook.html). You will
7 get used to the extra notation and it will become second nature once you
8 become more familiar with the concepts.
1 Run-time cost and terminal cost We will take a very general view of
2 the above problem and formalize it as follows. Consider a cost function
qk (xk , uk ) ∈ R
3 which gives a scalar real-valued output for every pair (xk , uk ). This
4 models the fact that you do not want to walk more than you need to get to
5 School, i.e., we would like to minimize qk . You also want to make sure
6 the trajectory actually reaches the lecture venue, we write this down as
7 another cost qf (xT ). We want to pick control inputs (u0 , u1 , . . . , uT −1 )
8 such that
T
X −1
J(x0 ; u0 , u1 , . . . , uT −1 ) = qf (xT ) + qk (xk , uk ) (5.2)
k=0
1 The graph has one source node x0 . Each node in the graph is xk , each
2 edge depicts taking a certain control uk . Depending on which control we
3 pick, we move to some other node xk+1 given by the dynamics f (xk , uk ).
4 Note that this is not a transition like that of a Markov chain, everything is
5 deterministic in this graph. On each edge we write down the cost
6 where xk+1 = fk (xk , uk ) and “close” the graph with a dummy terminal
7 node with the cost qf (xT ) on every edge leading to an artificial terminal
8 node (sink).
9 Minimizing the cost in (5.3) is now the same as finding the shortest
10 path in this graph from the source to the sink. The algorithm to do so is
11 quite simple and is called Dijkstra’s algorithm after Edsgar Dijkstra who
12 used it around 1956 as a test program for a new computer named ARMAC
13 (http://www-set.win.tue.nl/UnsungHeroes/machines/armac.html).
14 1. Let Q be the set of nodes that are currently unvisited; all nodes in
15 the graph are added to it at the beginning. S is an empty set. An
16 array called dist maintains the distance of every node in the graph
17 from the source node x0 . Initialize dist(x0 ) = 0 and dist = ∞ for
18 all other nodes.
11 We can prove this as follows. Suppose that we find the optimal control
12 sequence (u∗0 , u∗1 , . . . , u∗T −1 ) for the problem in (5.3). Our system is
13 deterministic, so this control sequence results in a unique sequence of states
14 (x0 , x∗1 , . . . , x∗T ). Each successive state is given by x∗k+1 = fk (x∗k , u∗k )
15 with x∗0 = x0 . The principle of optimality, or the principle of dynamic
16 programming, states that if one starts from a state x∗k at time k and wishes
17 to minimize the “cost-to-go”
T
X −1
qf (xT ) + qk (x∗k , uk ) + qi (xi , ui )
i=k+1
1 exists some other optimal sequence of controls for the truncated problem,
2 say (vk∗ , . . . , vT∗ −1 ). If so, the solution of the original problem where one
3 takes controls vk∗ from this new sequence for time-steps k, k + 1, . . . , T − 1
4 would have a lower cost. Hence the original sequence of controls would
5 not have been optimal.
for all x ∈ X.
6 After running the above algorithm we have the optimal cost-to-go J0∗ (x)
7 for each state x ∈ X, in particular, we have the cost-to-go for the initial
8 state J0∗ (x0 ). If we remember the minimizer u∗k in (5.4) while running the
9 algorithm, we also have the optimal sequence (u∗0 , u∗1 , . . . , u∗T −1 ). The
10 function J0∗ (x) (often shortened to simply J ∗ (x)) is the optimal cost-to-go
11 from the state x ∈ X.
12 Again, we really only wanted to calculate J0∗ (x0 ) but had to do all this
13 extra work of computing Jk∗ for all the states.
23 5.3.1 Q-factor
24 The quantity
Since the two functions are equivalent, we will call both as “value
functions”. The difference will be clear from context.
12 We will assume that we know the statistics of the noise ϵk at each time-step
13 (say it is a Gaussian). Stochastic dynamical systems are very different from
14 deterministic dynamical systems, given the same sequence of controls
15 (u0 , . . . , uT −1 ), we may get different state trajectories (x0 , x1 , . . . , xT )
108
uk (x) : X 7→ U (5.8)
uk (·) ∈ U(X).
24 is called a control policy. This is an object that we will talk about often. It
25 is important to remember that a control policy is set of controllers (usually
26 feedback controls) that are executed at each time-step of a dynamic
27 programming problem.
U(X) ∋ uk (·) : X 7→ U.
110
1 Since our set of states and controls is finite, this involves finding
2 a table of size |X| × |U | for each iteration. In (5.4), we only had
3 to search over a set of values uk ∈ U of size |U |. At the end of
4 dynamic programming, we have a sequence of feedback controls
5 Each feedback control u∗k (x) tells us what control the robot should
6 pick if it finds itself at a state x at time k.
7 3. If we know the dynamical system, not in its functional form xk+1 = ? Why should we only care about
8 fk (xk , uk ) + ϵk but rather as a transition matrix P(xk+1 | xk , uk ) minimizing the average cost in the objective
9 (like we had in Chapter 2) then the expression in (5.11) simply in (5.10)? Can you think of any other
10 becomes objective we may wish to use?
∗
∗ ′
Jk (x) = min qk (x, uk (x)) + ′ E Jk+1 (x ))
uk (·) ∈ U (X) x ∼P(·|xk ,uk (xk ))
(5.12)
12 Thus the infinite horizon costs of a policy is the limit of its finite horizon
13 costs as the horizon tends to infinity. Notice a few important differences
14 when compared to (5.7).
28 We will study this equation in depth soon. But if we find the minimum at
29 u∗ (x) for this equation, then we can run the policy
11 Again, we really only wanted to compute the cost-to-go J ∗ (x0 ) from some
12 initial state x0 but computed the value function at all states x ∈ X.
114
10 3. The feedback control is the control at that state that minimizes the
11 Q-factor
π ∗ = (u∗ , u∗ , . . . , )
13 Notice how we can directly find the u′ that has the smallest value of
14 Q(N ) and set it to be our feedback control.
15 5.4.3 An example
16 Let us consider a grid-world example. A robot would like to reach a
17 goal region (marked in green) and we are interested in computing the
18 cost-to-go from different parts of the domain. Gray cells are obstacles that
19 the robot cannot enter. At each step the robot can move in four directions
20 (north, east, west, south) with a small dynamics noise which keeps it at
21 the original cell in spite of taking the control. These pictures are when
22 the run-time-cost is negative, i.e., the robot gets a certain reward q(x, u)
23 for taking the control u at cell x. Dynamic programming (and value
24 iteration) also works in this case and we simply replace all minimizations
25 by maximizations in the equations.
26
115
1 easy but cumbersome) but you should commit them to memory and try to
2 understand them intuitively.
3 Value iteration converges. Given any initialization J (0) (x) for all
4 x ∈ X, the sequence of value iteration estimates J (i) (x) converges to the
5 optimal cost
∀x ∈ X, J ∗ (x) = lim J (N ) (x)
N →∞
4 as X
γ E [J π (f (x, u(x)) + ϵ)] = γ Tx,x′ J π (x′ ) = γ T J π
ϵ
x′
21 i.e., we construct a new control policy that finds the best control to execute
? Why? It is simply because (5.23) is at least
22 in the first step ũ(·) and thereafter it executes the old feedback control u(·)
an improvement upon the feedback control
π (1) = (ũ(·), u(·), . . .), u(·). The cost-to-go cannot improve only if
the old feedback control u(·) where optimal
23 then the cost-to-go of the new policy π (1) has to be better: to begin with.
(1)
∀x ∈ X, Jπ (x) ≤ J π (x).
24 We don’t have to stop at one time-step, we can patch the old policy at the
25 first two time-steps to get
2 we similarly have
∀x ∈ X, J π̃ (x) ≤ J π (x).
The algorithm terminates when the controller does not change at any
state, i.e., when the following condition is satisfied
5 Just like value iteration converges to the optimal value function, it can
6 be shown that policy iteration produces a sequence of improved policies
9 5.5.1 An example
10 Let us go back to our example for value iteration. In this case, we will
11 visualize the controller u(k) (x) at each cell x as arrows pointing to some
12 other cell. The cells are colored by the value function for that particular
13 stationary policy.
14
15
120
3 The evaluated value for the policy after 4 iterations is optimal, compare
4 this to the example for value iteration.
5
1 Chapter 6
2 Linear Quadratic
3 Regulator (LQR)
Reading
1. http://underactuated.csail.mit.edu/lqr.html, Lecture 3-4 at
https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-
323-principles-of-optimal-control-spring-2008/lecture-notes
121
122
1 ⊤ 1
q(x, u) = x Qx + u⊤ Ru (6.1)
2 2
5 where Q ∈ Rd×d and R ∈ Rm×m are symmetric, positive semi-definite
6 matrices
Q = Q⊤ ⪰ 0, R = R⊤ ⪰ 0.
7 Effectively, if Q were a diagonal matrix, a large diagonal entry would Qii
8 models our desire that the trajectory of the system should not have a large
9 value of the state xi along its trajectories. We want these matrices to be
10 positive semi-definitive to prevent dynamic programming from picking
11 a trajectory which drives down the run-time cost to negative infinity by
12 picking.
27 which represents how far away both the position and velocity are from zero
28 over all times k. The following figure shows the trajectory that achieves a
29 small value of J.
123
Double integrator
1.00 z
dot z
0.75 u
0.50
0.25
z, dot z, u 0.00
0.25
0.50
0.75
1.00
0 2 4 6 8 10
t [s]
Figure 6.1: The trajectory of z(t) as a function of time t for a double integrator
z̈(t) = u where we have chosen a stabilizing (i.e., one that makes the system
asymptotically stable) controller u = −z(t) − ż(t). Notice how the trajectory
starts from some initial condition (in this case z(0) = 1 and ż(0) = 0) and moves
towards its equilibrium point z = ż = 0.
1
z, dot z, u
4
0 2 4 6 8 10
t [s]
Figure 6.2: The trajectory of z(t) as a function of time t for a double integra-
tor z̈(t) = u where we have chosen a large stabilizing control at each time
u = −5z(t) − 5ż(t). Notice how quickly the state trajectory converges to the
equilibrium without much oscillation as compared to Figure 6.1 but how large the
control input is at certain times.
1 This is obviously undesirable for real systems where we may want the
2 control input to be bounded between some reasonable values (a car cannot
3 accelerate by more than a certain threshold). A natural way of enforcing
4 this is to modify our our desired cost of the trajectory to be
T
X
2 2
J= ∥xk ∥ + ρ∥uk ∥
k=0
1 ⊤
qf (x) = x Qf x. (6.2)
2
6 The dynamic programming problem is now formulated as follows.
14 We can now take the derivative of the right-hand side with respect to u to
15 get
dRHS
0=
du
= Ru + B ⊤ Qf (AxT −1 + Bu) (6.4)
⇒ u∗T −1 = −(R + B ⊤ Qf B)−1 B ⊤ Qf A xT −1
≡ −KT −1 xT −1 .
16 where
KT −1 = (R + B ⊤ Qf B)−1 B ⊤ Qf A
125
d2 RHS
= R + B ⊤ Qf B ⪰ 0
du2
3 so we know that u∗T −1 is a minimum of the convex quantity on the right-
4 hand side. Notice that the optimal control u∗T −1 is a linear function of the
5 state xT −1 . Let us now expand the cost-to-go JT −1 using this optimal
6 value (the subscript T − 1 on the curly bracket simply means that all
7 quantities are at time T − 1)
1n ⊤ o
JT∗ −1 (xT −1 ) = x Qx + u∗ ⊤ Ru∗ + (Ax + Bu∗ )⊤ Qf (Ax + Bu∗ )
2 T −1
1 ⊤ ⊤ ⊤
= xT −1 Q + K RK + (A − BK) Qf (A − BK) T −1 xT −1
2
1
≡ x⊤ PT −1 xT −1
2 T −1
8 where we set the stuff inside the curly brackets to the matrix P which is
9 also positive semi-definite. This is great, the cost-to-go is also a quadratic
10 function of the state xT −1 . Let us assume that this pattern holds for all
11 time steps and the cost-to-go of the optimal LQR trajectory starting from
12 a state x and proceeding forwards for T − k time-steps is
1 ⊤
Jk∗ (x) = x Pk x.
2
13 We can now repeat the same exercise to get a recursive formula for Pk in
14 terms of Pk+1 . This is the solution of dynamic programming for the LQR
15 problem and it looks as follows.
PT = Qf
−1
Kk = R + B ⊤ Pk+1 B B ⊤ Pk+1 A (6.5)
⊤
Pk = Q + Kk⊤ R Kk + (A − BKk ) Pk+1 (A − BKk ) ,
0.4
z, dot z, u
0.2
0.0
0.2
0.4
0 2 4 6 8 10
t [s]
Figure 6.3: The trajectory of z(t) as a function of time t for a double integrator
z̈(t) = u where we have chosen a controller obtained from LQR with Q = I and
R = 5. This gives the controller to be about u = −0.45z(t) − 1.05ż(t). Notice
how we still get stabilization but the control acts more gradually. Using different
values of R, we can get many different behaviors. Another key aspect of LQR as
compared to Figure 6.1 where the control was chosen in an ad hoc fashion is to let
us prescribe the quality of state trajectories using high-level quantities like Q, R.
14
{u(t) : t ∈ R+ }
127
{x(t) : t ∈ R+ }
2 for the dynamical system. Let us consider the case when we want to
3 find control sequences that minimize the integral of the cost along the
4 trajectory that stops at some fixed, finite time-horizon T :
Since {x(t)}t≥0 and {u(t)}t≥0 are
Z T continuous curves and the cost is now a
qf (x(T )) + q(x(t), u(t)) dt . function of a continuous-curve,
0
mathematicians say that the cost is a
5 This cost is again a function of the run-time cost and a terminal cost. “functional” of the state and control trajectory.
the cost incurred if the trajectory starts at state x and goes forward by T − t
time. This is very similar to the cost-to-go Jk∗ (x) we had in discrete-time
dynamic programming. Dynamic programming now gives
( )
Z T
J ∗ (x(t), t) = min qf (x(T )) + q(x(s), u(s)) ds
u(s), t≤s≤T t
( )
Z t+∆t Z T
= min qf (x(T )) + q(x(s), u(s)) ds + q(x(s), u(s)) ds
u(s), t≤s≤T t t+∆t
( )
Z t+∆t
∗
= min J (x(t + ∆t), t + ∆t) + q(x(s), u(s)) ds .
u(s), t≤s≤t+∆t t
11 We now take the Taylor approximation of the term J ∗ (x(t + ∆t), t + ∆t)
128
1 as follows
J ∗ (x, T ) = qf (x).
∂t J ∗ (x, t) = 0
In this course, we will not solve the HJB equation. Rather, we are
interested in seeing how the HJB equation looks for continuous-time
linear dynamical systems (both deterministic and stochastic ones) and
LQR problems for such systems, as done in the following section.
Figure 6.4: A car whose position is given by z(t) would like to climb the hill to
its right and reach the top with minimal velocity. The car rolls on the hill without
friction. The run-time cost is zero everywhere inside the state-space. Terminal
cost is -1 for hitting the left boundary (z = −1) and −1 − ż/2 for reaching the
right boundary (z = 1). The car is a single integrator, i.e., ż = u with only two
controls (u = 4 and u = −4) and cannot exceed a given velocity (in this case
|ż| ≤ 4. This looks like a simple dynamic programming problem but it is quite
hard due to the constraint on the velocity. The car may need to make multiple
swing ups before it gains enough velocity (but not too much) to climb up the hill.
15
16 and velocity (z, ż) and we can solve a two-dimensional HJB equation to
17 obtain the optimal cost-to-go from any state, as done by the authors Yuval
18 Tassa and Tom Erez in “Least Squares Solutions of the HJB Equation
19 With Neural Network Value-Function Approximators”
20 (https://homes.cs.washington.edu/t̃odorov/courses/amath579/reading/NeuralNet.pdf).
21 In practice, while solving the HJB PDE, one discretizes the state-space at
22 given set of states and solves the HJB equation (6.7) on this grid using
130
1 numerical methods (these authors used neural networks to solve it). The
end result looks as follows.
Figure 6.5: The left-hand side picture shows the infinite-horizon cost-to-go J ∗ (z, ż)
for the car-on-the-hill problem. Notice how the value function is non-smooth at
various places. This is quite typical of difficult dynamic programming problems.
The right-hand side picture shows the optimal trajectories of the car (z(t), ż(t));
gray areas indicate maximum control and white areas indicate minimum control.
The black lines show a few optimal control sequences taken the car starting from
various states in the state-space. Notice how the optimal control trajectory can
be quite different even if the car starts from nearby states (-0.5,1) and (-0.4,1.2)).
This is also quite typical of difficult dynamic programming problems.
2
ẋ = A x + B u; x(0) = x0 .
9 This is a very nice setup for using the HJB equation from the previous
10 section.
11 Let us use our intuition from the discrete-time LQR problem and say
12 that the optimal cost is quadratic in the states, namely,
1 ⊤
J ∗ (x, t) = x(t) P (t) x(t);
2
13 notice that as usual the optimal cost-to-go is a function of the states x
131
1 and the time t because is the optimal cost of the continuous-time LQR
2 problem if the system starts at a state x at time t and goes on until time
3 T ≥ t. We will now check if this J ∗ satisfies the HJB equation (we don’t
4 write the arguments x(t), u(t) etc. to keep the notation clear)
∗ 1 ⊤ ⊤
⊤ ∗
−∂t J (x, t) = min x Qx + u R u + (A x + B u) ∂x J (x, t)
u∈U 2
(6.9)
5 from (6.7). The minimization is over the control input that we take at time
6 t. Also notice the partial derivatives
∂x J ∗ (x, t) = P (t) x.
1
∂t J ∗ (x, t) = x⊤ Ṗ (t) x.
2
7 It is convenient in this case to see that the minimization can be per-
8 formed using basic calculus (just like the discrete-time LQR problem), we
9 differentiate with respect to u and set it to zero.
RHS of HJB
0= .
du
⇒ u (t) = −R B ⊤ P (t) x(t)
∗ −1 (6.10)
≡ −K(t) x(t).
10 where K(t) = R−1 B ⊤ P (t) is the Kalman gain. The controller is again
11 linear in the states x(t) and the expression for the gain is very simple in
12 this case, much simpler than discrete-time LQR. Since R ≻ 0, we also
13 know that u∗ (t) computed here is the global minimum. If we substitute
14 this value of u∗ (t) back into the HJB equation we have
1 ⊤
{} = x P A + A⊤ P + Q − P BR−1 B ⊤ P x.
u∗ (t) 2
15 If order to satisfy the HJB equation, we must have that the expression
16 above is equal to −∂t J ∗ (x, t). We therefore have, what is called the
17 Continuous-time Algebraic Riccati Equation (CARE), for the matrix
18 P (t) ∈ Rd×d
P (T ) = Qf .
23 Notice that the ODE for the P (t) travels backwards in time.
24 Continuous-time LQR has particularly easy equations, as you can see
25 in (6.10) and (6.11) compared to those for discrete-time ((6.4) and (6.5)).
26 Special techniques have been invented for solving the Riccati equation. I
132
Ṗ = 0.
P A + A⊤ P + Q − P BR−1 B ⊤ P.
29 We will not do the proof (it is easy but tedious, you can try to show it
133
0.6 0.6
0.2 0.2
0.0 0.0
0.2 0.2
0.4 0.4
0 2 4 6 8 10 0 2 4 6 8 10
t [s] t [s]
Figure 6.6: Comparison of the state trajectories of deterministic LQR and stochastic
LQR problem with Bϵ = [0.1, 0.1]. The left panel is the same as that in Figure 6.3.
The control input is the same in both cases but notice that the states in the plot
on the right need not converge to the equilibrium due to noise. The cost of the
trajectory will also be higher for the stochastic LQR case due to this. The total cost
is J ∗ (x0 ) = 32.5 for the deterministic case (32.24 for the quadratic state-cost and
0.26 for the control cost). The total cost J ∗ (x0 ) is much higher for the stochastic
case, it is 81.62 (81.36 for the quadratic state cost and 0.26 for the control cost).
1 by writing the HJB equation for the stochastic LQR problem). This is a
2 very surprising result because it says that even if the dynamical system
3 had noise, the optimal control we should pick is exactly the same as the
4 control we would have picked had the system been deterministic. It is a
5 special property of the LQR problem and not true for other dynamical
6 systems (nonlinear ones, or ones with non-Gaussian noise) or other costs.
7 We know that the control u∗ (t) is the same as the deterministic case.
8 Is the cost-to-go J ∗ (x, t) also the same? If you think about this, the
9 cost-to-go in the stochastic case has to be a bit larger than the deterministic
10 case because the noise ϵ(t) is always going to non-zero when we run the
11 system, the LQR cost J ∗ (x0 , 0) = 12 x⊤ 0 P (0)x0 is, after all, only the cost
12 of the deterministic problem. It turns out that the cost for the stochastic
13 LQR case for an initial state x0 is
" Z T #
1 ⊤ 1
J ∗ (x0 , 0) = E x(T ) Qf x(T ) + . . . dt
ϵ(t):t∈[0,T ] 2 2 0
1 T
Z
1 ⊤
= x0 P (0)x0 + tr(P (t)Bϵ Bϵ⊤ ) dt .
2 2 0
14 The first term is the same as that of the deterministic LQR problem. The
15 second term is the penalty we incur for having a stochastic dynamical
16 system. This is the minimal cost achievable for stochastic LQR but it is
17 not the same as that of the deterministic LQR.
13 This equation is very close to the Kalman filter equations you saw in
14 Chapter 3. In particular, notice the close similarity of the expression for
15 the Kalman gain K(t) with the Kalman gain of the LQR problem. You
16 can read more at https://en.wikipedia.org/wiki/Kalman_filter.
d
Σ(t)−1 = Σ(t)−1 Σ̇(t)Σ(t)−1
dt
14 to get
−1
Ṡ = C ⊤ DD⊤ C − A⊤ S − SA − SBw Bw
⊤
S (6.14)
−Ṗ = P A + A⊤ P + Q − P BR−1 B ⊤ P
17 look quite similar to this equation. In fact, they are identical and you can
18 substitute the following.
136
t=T −t
1 dynamical system
z := x − x0 , and v := u − u0 ,
16 We have emphasized the fact that the matrices Ax0 ,u0 , Bx0 ,u0 depend
17 upon the reference state and control using the subscript. Given the above
18 linear system, we can find a control sequence u∗ (·) that minimizes the
19 cost functional using the standard LQR formulation. Notice now that even
20 we computed this control trajectory using the approximate linear system,
21 it can certainly be executed on the nonlinear system, i.e., at run-time we
22 will simply set u ≡ u∗ (z).
23 The linearized dynamics in (6.15) is potentially going to be very
24 different from the nonlinear system. The two are close in the neighborhood
25 of x0 (and u0 ) but as the system evolves using our control input to
26 move further away from x0 , the linearized model no longer is a faithful
27 approximation of the nonlinear model. A reasonable way to fix matters
28 is to linearize about another point, say the state and control after t = 1
29 seconds, x1 , u1 to get a new system
30 and take the LQR-optimal control corresponding to this system for the
31 next second.
32 The above methodology is called “receding horizon control”. The
33 idea is that we compute the optimal control trajectory u∗ (·) using an
34 approximation of the original system and recompute this control every few
35 seconds when our approximation is unlikely to be accurate. This is a very
138
28 Step 1 Linearize the nonlinear system about the state trajectory x(k) (·)
29 and u(k) (·) using
31 where
df
A(k) (t) =
dx x(t)=x(k) (t),u(t)=u(k) (t)
df
B (k) (t) =
du x(t)=x(k) (t),u(t)=u(k) (t)
139
⊤ dqf
qf (x(T )) ≈ constant + z(T )
dx x(T )=x(k) (T ) ? How will you solve for the optimal
2 controller for a linear dynamics for the cost
⊤ d qf
+ z(t) z(t),
dx2 x(T )=x(k) (T ) Z T
1
q ⊤ x + x⊤ Qx dt ,
3 0 2
⊤ d2 q
+ z(t) z(t)
dx2 x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
≡Q
2
⊤ d q
+ v(t) v(t).
du2 x(t)=x(k) (t),u(t)=u(k) (t)
| {z }
≡R
4 This is an LQR problem with run-time cost that depends on time (like our
5 discrete-time LQR formulation, the continuous-time formulation simply
6 has Q, R to be functions of time t in the Riccati equation) and which also
7 has terms that are affine in the state and control in addition to the usual
8 quadratic cost terms.
9 Step 2 Solve the above linearized problem using standard LQR formula-
10 tion to get the new control trajectory
11 Simulate the nonlinear system using the control u(k+1) (·) to get the new
12 state trajectory x(k+1) (·).
13 Some important comments to remember about the iLQR algorithm.
14 1. There are many ways to pick the initial control trajectory u(0) (·), e.g.,
15 using a spline to get an arbitrary control sequence, using a spline
16 to interpolate the states to get a trajectory x(0) (·) and then back-
17 calculate the control trajectory, using the LQR solution based on the
18 linearization about the initial state, feedback linearization/differen-
19 tial flatness (https://en.wikipedia.org/wiki/Feedback_linearization)
20 etc.
2 Imitation Learning
Reading
1. The DAGGER algorithm
(https://www.cs.cmu.edu/~sross1/publications/Ross-
AIStats11-NoRegret.pdf)
2. https://www.youtube.com/watch?v=TUBBIgtQL_k
1 ⊤
Jφ (x, t) = x(t) (some function of A, B, Q, R) x(t)
2 | {z }
function of φ
17 for LQR. We know the stuff inside the brackets to be exactly P (t)
18 but, if we did not, it could be written down as some generic function
141
142
uθ (·).
X → Y.
17 is called the “training set”. We use this data to identify patterns that help
18 make predictions on some future data.
3 This predictor certainly solves the task. It correctly works for all images
4 in the training set. Does it work for images outside the training set?
5 Our task in machine learning is to learn a predictor that works outside
6 the training set. The training set is only a source of information that Nature
7 gives us to find such a predictor.
f (x; w) = sign(w⊤ x)
(
+1 if w⊤ x ≥ 0 (7.1)
=
−1 else.
7 We have used the sign function denoted as sign to get binary {−1, +1} out-
8 puts form our real-valued prediction w⊤ x. This is the famous perceptron
9 model of Frank Rosenblatt.
10 We want the predictions of the model to match those in the training
11 data and devise an objective to to fit/train the perceptron.
n
1X
ℓzero-one (w) := 1 i i . (7.2)
n i=1 {y ̸=f (x ;w)}
1 where the function ℓi denotes the loss on the sample (xi , y i ) and w ∈ Rp
2 denotes the weights of the classifier. Solving this problem using SGD
3 corresponds to iteratively updating the weights using
dℓωt (w)
w(t+1) = w(t) − η ,
dw w=w(t)
4 i.e., we compute the gradient one sample with index ωt in the dataset. The
5 index ωt is chosen uniformly randomly from
ωt ∈ {1, . . . , n} .
6 In practice, at each time-step t, we typically select a few (not just one) input
ωt
7 data ωt from the training dataset and average the gradient dℓ dw(w)
w=w(t)
8 across them; this is known as a “mini-batch”. The gradient of the loss
9 ℓωt (w) with respect to w is denoted by
dℓωt (w)
∇w1 ℓωt (w(t) ) =
dw1 w=w(t)
11 for the scalar-valued derivative of the objective ℓωt (w(t) ) with respect to
12 the first weight w1 ∈ R. We can therefore write SGD as
f (x; v, S) = sign v ⊤ σ S ⊤ x
147
Figure 7.1
10 second layer: it takes the features generated by the first layer, namely
11 σ(S1⊤ x), multiplies these features using its feature matrix S2⊤ and applies
12 a nonlinear function σ(·) to this result element-wise before passing it on
13 to the third layer.
Figure 7.2
v ∈ Rp×C
15 where C is the total number of classes in the data. Just like logistic
16 regression predicts the logits of the two classes, we would like to interpret
17 the vector ŷ as the log-probabilities of an input belonging to one of the
18 classes. ? What would the shape of w be if you were
performing regression using a deep network?
19 Weights It is customary to not differentiate between the parameters of
20 different layers of a deep network and simply say weights when we want
21 to refer to all parameters. The set
w := {v, S1 , S2 , . . . , SL }
f (x, w) (7.6)
149
1 and fitting the deep network to a dataset involves the optimization problem
n
1X i i
w∗ = argmin ℓ(y , ŷ ). (7.7)
w n i=1
ℓi (w) := ℓ(y i , ŷ i ).
25 At each step, we record the state xit ∈ Rd and the control that the expert
26 took at that state uit . We would like to learn a deterministic feedback
27 control for the robot that is parametrized by parameters θ
uθ (x) : X 7→ U ⊂ Rm .
28 using the training data. The idea is that if uθ (xi (t)) ≈ ui (t) for all i
29 and all times t, then we can simply run our learned controller uθ (x) on
30 the robot instead of having the expert. A simple example is a baby deer
31 learning to imitate how its mother in how to run.
150
uθ (x) = v σ S ⊤ x
θ := (v, S)
13 How to fit the controller? Given our chosen model for uθ (x), say a
14 two-layer neural network with weights θ, fitting the controller involves
15 finding the best value for the parameters θ such that uθ (xit ) ≈ uit for data
16 in our dataset. There are many ways to do this, e.g., we can solve the
17 following optimization problem
n T
1X 1 X 2
θb = argmin ℓ(θ) := uit − uθ (xit ) 2 (7.8)
θ n i=1 T + 1 t=0
| {z }
ℓi (θ)
18 The difficulty of solving the above problem depends upon how difficult the
19 model uθ (x) is, for instance, if the model is linear θ x, we can solve (7.8)
20 using ordinary least squares. If the model is a neural network, one would
21 have to use SGD to solve the optimization problem above. After fitting
22 this model, we have a new controller
uθb(x) ∈ Rm
1 may not work outside these parts. Of course, if we behavior clone the
2 controls taken by a generic driver, they are unlikely to be competitive for
3 racing, and vice-versa. It is very important to realize that this does not
4 mean that BC does not generalize. Generalization in machine learning is
5 a concept that suggests that the model should work well on data from the
6 same distribution. What does the the “distribution” of the expert mean, in
7 this case, it simply refers to the distribution of the states that the expert’s Discuss generalization performance in
8 trajectories typically visit, e.g, a race driver typically drives at the limits behavior cloning.
9 of tire friction and throttle, this is different from a usual city-driver who
10 would rather maximize the longevity of their tires and engine-life.
u ∼ uθ (· | x) = P(· | x)
10 we have that
2
∥µθ (x) − u∥2
− log uθ (u | x) = + 2cp log σθ (x).
σθ2 (x)
11 where c is a constant.
26 This is a distance and not a metric, i.e., it is always non-zero and zero
153
1 if and only if the two distributions are equal, but the KL-divergence
2 is not symmetric (like a metric has to be). Also, the above formula is
3 well-defined only if for all x where q(x) = 0, we also have p(x) = 0.
4 Notice that it is not symmetric
X q(x)
KL(q || p) = q(x) log ̸= KL(p || q).
p(x)
x∈X
puθ∗ (x, u)
1 each time-step, over a horizon of length T time-steps, it can be O(T 2 ϵ) off Draw a picture of the amplifying errors of
2 from the cost-to-go of the expert as averaged over states that the learned running behavior cloning in real-time.
3 controller visits. This is because once the robot makes a mistake and goes
4 away from the expert’s part in the state-space, future states of the robot
5 and the expert can be very different.
DAgger: Let the dataset D(0) be the data collected from the
expert. Initialize uθ(0) = uθb to be the BC controller learned using
data D(0) . At iteration k
2. Use u(x) to collect a dataset D = (xit , uit )t=0,...,T i=1,...,n
with n trajectories.
1 the dataset, or even if it takes a slightly different control than the expert
2 midway through a trajectory.
4 To fix this, the robot collects more data at each iteration. It uses a
5 combination of the expert and its controller to collect such data. This,
6 allows collecting a dataset of expert’s controls in states that the robot
7 visits and iteratively expands the dataset D(k) .
9 In the beginning we may wish to be close to the expert’s data and use
10 a large value of p, as the fitted controller uθk+1 becomes good, we can
11 reduce the value of p and rely less on the expert.
12 DAgger is an iterative algorithm which expands the controller to handle ? What criterion can we use to stop these
13 larger and larger parts of the state-space. Therefore, the cost-to-go of iterations? We can stop when the incremental
14 the controller learned via DAgger is O(T ) off from the cost-to-go of the dataset collected Dk is not that different from
15 expert as averaged over states that the learned controller visits. the cumulative dataset D, we know that the
new controllers are not that different. We can
also stop when the parameters of our learned
16 DAgger with expert annotations at each step DAgger is a conceptual controller are θ(k+1) ≈ θ(k) .
17 framework where the expert is queried repeatedly for new control actions.
18 This is obviously problematic because we need to expert on hand at each
19 iteration. We can also cook up a slightly version of DAgger where we
20 start with the BC controller uθ(k) = uθb and at each step, we run the
21 controller on the real system and ask the expert to relabel the data after
156
1 that run. The dataset D(k) collected by the algorithm expands at each
2 iteration and although the states xit are those visited by our controller, their
3 annotations are those given by the expert. This is a much more natural
4 way of implementing DAgger.
1 Chapter 8
Reading
1. Sutton & Barto, Chapter 9–10, 13
157
158
12 Trajectory space Let us write out one trajectory of such a system a bit
13 more explicitly. We know that the probability of the next state xk+1 given
14 xk is p(xk+1 | xk , uk ). The probability of taking a control uk at state xk
15 is uθ (uk | xk ). We denote an infinite trajectory by
τ = x0 , u0 , x1 , u1 , . . . .
θb = argmax J(θ)
θ
160
2 where the step-size is η > 0 and ∇J(θ) is the gradient of the objective
3 J(θ) with respect to weights θ. Instead of computing the exact ∇J(θ)
4 which we will do in the next section, let us simply compute the gradient
5 using a finite-difference approximation. The ith entry of the gradient is
θ(k+1) = θ(k) + η ∇
b J(θ(k) ).
ξ i ∼ N (0, σ 2 I)
14
= 2, 444.
6 First let us note that the distribution pθ using which we compute the
7 expectation also depends on the weights θ. This is why we cannot simply
8 move the derivative ∇θ inside the expectation
(move the gradient inside, integral is over trajectories τ which do not depend onθ themselves)
∇pθ (τ )
Z
= R(τ )pθ (τ ) dτ
pθ (τ )
Z
= R(τ )pθ (τ )∇ log pθ (τ )dτ
= E [R(τ )∇ log pθ (τ )]
τ ∼pθ (τ )
n
1X
≈ R(τ i )∇ log pθ (τ i )
n i=1
(8.8)
14 This is called the likelihood-ratio trick to compute the policy gradient. It
15 simply multiplies and divides by the term pθ (τ ) and rewrites the term
∇pθ
16
pθ = ∇ log pθ . It gives us a neat way to compute the gradient: we
17 sample n trajectories τ 1 , . . . , τ n from the system and average the return
18 of each trajectory R(τ i ) weighted by the gradient of the likelihood of
19 taking each trajectory log pθ (τ i ). The central point to remember here is
164
5 Variance of policy gradient The expression for the policy gradient may
6 seem like a sleight of hand. It is a clean expression to get the gradient of
7 the objective but also comes with a number of problems. Observe that
∇pθ (τ )
∇ E [R(τ )] = E R(τ )
τ ∼pθ (τ ) τ ∼pθ (τ ) pθ (τ )
n
1 X ∇ pθ (τ i )
≈ R(τ i ) .
n i=1 pθ (τ i )
8 If we sample trajectories τ i that are not very likely under the distribution
9 pθ (τ ), the denominator in some of the summands can be very small.
10 For trajectories that are likely, the denominator is large. The empirical
11 estimate of the expectation using n trajectories where some terms are
12 very small and some others very large, therefore has a large variance. So
13 one does need lots of trajectories from the system/simulator to compute a
14 reasonable approximation of the policy gradient.
8 Baseline We will now use the concept of a control variate to reduce the
9 variance of the policy gradient. This is known as “building a baseline”.
10 The simplest baseline one can build is to subtract a constant value from
11 the return. Consider the PG given by
= E [(R(τ ) − b) ∇ log pθ (τ )] .
τ ∼pθ (τ )
12 Observe that
Z
E [b∇ log pθ (τ )] = dτ b pθ (τ )∇ log pθ (τ )
τ ∼pθ (τ )
Z Z
= dτ b ∇pθ (τ ) = b ∇ dτ pθ (τ ) = b ∇1 = 0.
8 Set
δ ∇b θ J(θ)
i
=0
dbi
9 in the above expression to get
h i
2
Eτ (∇θi log pθ (τ )) R(τ )
bi = h i
2
Eτ (∇θi log pθ (τ ))
10 which is the baseline you should subtract from the gradient of the ith ? Show that any function that only depends
11 parameter θi to result in the largest variance reduction. This expression is on the state x can be used as a baseline in the
12 just the expected return but it is weighted by the magnitude of the gradient, policy gradient. This technique is known as
13 this again is 1–2 lines of code. reward shaping.
3 Compare the above formula for the policy gradient with the one we
4 had before in (8.8)
∇J(θ)
b = E [R(τ ) ∇ log pθ (τ )]
τ ∼pθ (τ )
" T ! T
!#
X X
= E γ k r(xk , uk ) ∇ log uθ (uk | xk ) .
τ ∼pθ (τ )
k=0 k=0
15 1. While the algorithm collects the data, states that are unlikely under
16 the distribution dθ contribute little to (8.11). In other words, the
17 policy gradient is insensitive to such states. The policy update will
18 not consider these unlikely states that the system is prone to visit
19 infrequently using the controller uθ .
20 2. The opposite happens for states which are very likely. For two
21 controls u1 , u2 at the same state x, the policy increases the log-
22 likelihood of taking the controls weighted by their values q θ (x, u1 )
168
(2πσ 2 )
17 for some large time-horizon T where (x0 , u0 ) = (x, u) and the summation
18 is evaluated for (xk , uk ) that lie on the trajectory τx,u . Effectively, we are
19 evaluating (8.12) using one sample trajectory, a highly erroneous estimate
20 of q θ .
qφθ (x, u) : X × U → R.
uθ (x) = θ⊤ x
16 which is a linear function in the states and controls. You can also think of
17 using something like
1
qφθ (x, u) = 1 x u φ x φ ∈ R(m+d+1)×(m+d+1) .
v θ (x) =
θ
E q (x, u) . (8.15)
u∼uθ (·|x)
1 Bellman equation
θ
θ ′
v (x) = E r(x, u) + γ E v (x ) .
u∼uθ (·|x) x′ ∼P(·|x,u)
vψθ (x) : X → R
3 using parameters ψ and fit it to the data in the same way as (8.14) to get
n T
1 XX 2
ψb = argmin vψθ (xik ) − r(xik , uik ) − γ vψθ (xik+1 ) .
ψ n(T + 1) i=1
k=0
(8.16)
4 8.6 Discussion
5 This brings to an end the discussion of policy gradients. They are, in
6 general, a complicated suite of algorithms to implement. You will see
7 some of this complexity when you implement the controller for something
8 as simple as the simple pendulum. The key challenges with implementing
9 policy gradients come from the following.
10 1. Need lots of data, each parameter update requires fresh data from
11 the systems. Typical problems may need a million trajectories, most
12 robots would break before one gets this much data from them if one
13 implements these algorithms naively.
172
4 3. Fitting the Q-function and the value function is not easy, each
5 parameter update of the policy ideally requires you to solve the
6 entire problems (8.14) and (8.16). In practice, we only perform a
7 few steps of SGD to solve the two problems and reuse the solution
8 of k th iteration update as an initialization of the k + 1th update. This
9 is known as “warm start” in the optimization literature and reduces
10 the computational cost of fitting the Q/value-functions from scratch
11 each time.
2 Q-Learning
Reading
1. Sutton & Barto, Chapter 6, 11
173
174
1 number of controls, with each entry in this table being the value q(x, u).
2 Value iteration when written using the Q-function at the k th iteration for
3 the tabular setting looks like
X
(k+1) ′ (k) ′ ′
q (x, u) = P(x | x, u) r(x, u) + γ max ′
q (x , u )
u
x′ ∈X
(k) ′ ′
= E r(x, u) + γ max
′
q (x , u ) .
x′ ∼P(·|x,u) u
ue (· | x)
14 over samples (xk , uk , xk+1 ) collected as the robot explores the environ-
15 ment.
20 (xik , uik , xik+1 ) in the second equation. An important point to note here is
21 that although the robot collects a finite number of data
n
D = (xik , uik )k=0,1,...,T
i=1
26 Terminal state One must be very careful about the terminal state in such
27 implementations of Q-learning. Typically, most research papers imagine
28 that they are solving an infinite horizon problem but use simulators that
29 have an explicit terminal state, i.e., the simulator does not proceed to the
30 next timestep after the robot reaches the goal. A workaround for using
31 such simulators (essentially all simulators) is to modify (9.2) as
′ ′
q(x, u) = (1−η) q(x, u)−η r(x, u) + γ 1 − 1{x′ is terminal} max ′
q(x , u ) .
u
176
33 for some user-chosen value of ϵ. Effectively, the robot repeats the controls
34 it took in the past with probability 1 − ϵ and uniformly samples from
35 the entire control space with probability ϵ. The former ensures that the
36 robot moves towards the parts of the state-space where states have a high
37 return-to-come (after all, that is the what the Q-function q(x, u) indicates).
177
1 The latter ensures that even if the robot’s estimate of the Q-function is bad,
2 it is still visiting every state in the state-space infinitely often.
P(x′ | x, u);
10 this is very similar to the step in the Baum-Welch algorithm that we saw
11 for learning the Markov state transition matrix of the HMM in Chapter 2.
12 We simply take frequency counts to estimate this probability
1 X
P(x′ | x, u) ≈ 1{x′ was reached from x using control u}
N i
13 where N is the number of the times the robot took control u at state x.
14 Given this transition matrix, we can now perform value iteration on the
15 MDP to learn the Q-function
(k) ′
q (k+1) (x, u) = ′ E r(x, u) + γ max
′
q (x , u) .
x ∼P(·|x,u) u
16 The success of this two-stage approach depends upon how accurate our
17 estimate of P(x′ | x, u) is. This in turn depends on how much the robot
18 explored the domain and the size of the dataset it collected, both of these
19 need to be large. We can therefore think of Q-learning as interleaving
20 these two stages in a single algorithm, it learns the dynamics of the system
21 and the Q-function for that dynamics simultaneously. But the Q-Learning
22 algorithm does not really maintain a representation of the dynamics, i.e.,
23 at the end of running Q-Learning, we do not know what P(x′ | x, u) is.
qφ (x, u) : X × U 7→ R
6 as the Q-function and our goal is to fit the deep network to obtain the
7 weights φ̂, instead of maintaining a very large table of size |X| × |U | for
8 the Q-function. Fitting the Q-function is nquite similar to the tabular case:
9 given a dataset D = (xit , uit )t=0,1,...,T i=1 from the system, we want to
10 enforce
qφ (xit , uit ) = r(xit , uit ) + γ max
′
qφ (xit+1 , u′ )
u
11 for all tuples (xit , uit , xit+1 ) in the dataset. Just like the previous section,
12 we will solve
2
n X
T
1 X
qφ (xit , uit ) − r(xit , uit ) − γ 1 − 1 i qφ (xit+1 , u′ )
φ̂ = argmin {xt+1 is terminal} max
φ n(T + 1) i=1 t=1
| {z
u′
}
target(x′ ;φ)
n X
T
1 X 2
≡ argmin qφ (xit , uit ) − target(xit+1 ; φ)
φ n(T + 1) i=1 t=1
(9.4)
13 The last two terms in this expression above are together called the “target”
14 because the problem is very similar to least squares regression, except that
15 the targets also depend on the weights φ. This is what makes it challenging
16 to solve.
17 As discussed above, Q-Learning with function approximation is known
18 as “Fitted Q Iteration”. Remember that very important point that the robot
19 collects data using the exploratory controller ue (· | x) but the Q-function
20 that we fit is the optimal Q-function.
10
D ← D ∪ Dk .
30 This is a reasonable idea but is not very useful in practice for two reasons.
31 First, if we use deep networks for parameterizing the Q-function, the
32 network can fit even very complex datasets so there is no reason to not
33 use the data points with low Bellman error in (9.4); the gradient using
34 them will be small anyway. Second, there are a lot of hyper-parameters
35 that determine prioritized sampling, e.g., the threshold beyond which
36 we consider the Bellman error to be high. These hyper-parameters are
37 quite difficult to use in practice and therefore it is a good idea to not use
38 prioritized experience replay at the beginning of development of your
39 method on a new problem.
181
9 Delayed target Notice that the target also depends upon the weights φ:
10 This creates a very big problem when we fit the Q-function. Effectively,
11 both the covariate and the target in (9.4) depend upon the weights of the
12 Q-function. Minimizing the objective in (9.4) is akin to performing least
13 squares regression where the targets keep changing every time you solve
14 for the solution. This is the root cause of why Q-Learning is difficult to
15 use in practice. A popular hack to get around this problem is to use some
16 old weights to compute the target, i.e., use the loss
1 X ′
2
qφk (xit , uit ) − target(xit+1 ; φk ) . (9.6)
n(T + 1) i,t
26 with some small value, say α = 0.05. The target’s weights are therefore
27 an exponentially averaged version of the weights of the Q-function.
31 1. For example, one reason could be that since qφk (x, u) for a given
32 state typically increases as we train for more iterations in Q-Learning,
182
8 where the second term is known as proximal term that prevents the
′
9 weights φk from change too much from their old values φk . Proxi-
10 mal objectives are more stable versions of the standard quadratic
11 objective in (9.4) and help in cases when one is solving Q-Learning
12 using SGD updates.
argmax q ∗ (x, u)
u
19 then, even the delayed target qφk′ may be a similarly poor controller. The
20 ideal target is of course the return-to-come, or the value of the optimal
21 Q-function maxu′ q ∗ (x′ , u′ ), but we do not know it while fitting the Q-
22 function. The same problem also occurs if our Q-function (or its delayed
23 version, the target) is too optimistic about the values of certain control
24 inputs, it will consistently pick those controls in the max operator. One
25 hack to get around this problem is to pick the maximizing control input
26 using the non-delayed Q-function but use the value of the delayed target
k
targetDDQN (xit+1 ; φ′ ) = r(x, u)+γ 1 − 1{xi is terminal} qφ′ k (xit+1 , u′ ).
t+1
(9.8)
27 where
u′ = argmax qφk (xit+1 , u) .
u
| {z }
control chosen by the Q-function
28 Training two Q-functions We can also train two copies of the Q-function
29 simultaneously, each with its own delayed target and mix-and-match their
k (1) k k
30 targets. Let φ(1) and φ′ be one Q-function and target pair and φ(2)
(2) k
31 and φ′ be another pair. We update both of them using the following
183
1 objective.
2
k (2) k
For φ(1) : q (1) (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} targetDDQN (x′ , φ′
)
2
k (1) k
For φ(2) : q (2) (x, u) − r(x, u) − γ 1 − 1{x′ is terminal} targetDDQN (x′ , φ′
)
(9.9)
2 Sometimes we also use only one target that is the minimum of the two
3 targets (this helps because it is more pessimistic estimate of the true target)
(1) k (2) k
target(x′ ) := min targetDDQN (x′ , φ′ ), targetDDQN (x′ , φ′ ) .
4 You will also see many papers train multiple Q-functions, many more than
5 2. In such cases, it is a good idea to pick the control for evaluation using
6 all the Q-functions:
X
u∗ (x) := argmax qφ(k) (x, u).
u
k
6 Effectively we are fitting a feedback controller that takes controls uθ∗ (x)
7 that are the maximizers of the Q-function. This is a natural analogue
8 of the argmax over controls for discrete/finite control spaces. Again we
9 should think of having a deep network that parametrizes the deterministic
10 controller and fitting its parameters θ using stochastic gradient descent
11 on (9.10)
12 where ω is the index of the datum in the dataset D. The equality was
13 obtained by applying the chain rule. This result is called the “deterministic
14 policy gradient” and we should think of it as the limit of the policy gradient
15 for a stochastic controller as the stochasticity goes to zero. Also notice
16 that the term
∇u qφ (xω , u)
17 is the gradient of the output of the Q-function qφ : X × U 7→ R with
18 respect to its second input u. Such gradients can also be easily computed
19 using backpropagation in PyTorch. It is different than the gradient of the
20 output with respect to its weights
∇φ qφ (xω , u).
2 Model-based
3 Reinforcement Learning
Reading
1. PILCO: A Model-Based and Data-Efficient Approach to Policy
Search, http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf
186
187
xk+1 = f (xk , uk )
9 and obtained some data from this robot using an exploratory controller
n
10 ue (· | x). Let us call this dataset D = (xit , uit )Tt=0 i=1 ; it consists of n
11 trajectories each of length T timesteps. We can fit a deep network to learn
12 the dynamics. This involves parameterizing the unknown dynamics using
13 a deep network with weights w
fw : X × U 7→ R
2
15 If the residual xit+1 − fw (xit , uit ) is small on average over the dataset,
16 then we know that given some new control u′ ̸= uit , we can, for instance,
17 estimate the new future state x′ = fw (x, u′ ). In principle, we can use this
18 model now to execute algorithms like iterated LQR to find an optimal
19 controller. We could also imagine using this as our own simulator for the
20 robot, i.e., instead of drawing new trajectories in model-free RL from the
21 actual robot, we use our learned model of the dynamics to obtain more
22 data.
fwinv : X × X 7→ U.
2
26 The regression error for one sample in this case would be uit − fw (xit , xit+1 ) .
27 This is often a more useful model to learn from the data, e.g., if we want to
28 use this model in a Rapidly Exploring Random Tree (RRT), we can sample
29 states in the configuration space of the robot and have the learned dynamics
30 guess the control between two states. Also see the paper on contact-
31 invariant optimization (https://homes.cs.washington.edu/ todorov/papers/-
32 MordatchSIGGRAPH12.pdf) and a video at https://www.youtube.com/watch?v=mhr_jtQrhVA
For a quick primer on planning using a
188
model, see the notes at
https://ocw.mit.edu/courses/aeronautics-and-
1 for an impressive demonstration of using an inverse model. astronautics/16-410-principles-of-autonomy-
and-decision-making-fall-2010/lecture-
notes/MIT16_410F10_lec15.pdf from Emilio
Frazzoli (ETH/MIT).
Models can be wrong at parts of state-space where we have few
data This is really the biggest concern with using models. We have
seen in the chapter on deep learning that if we do not have data from
some part of the state-space, there are few guarantees of the model fw
or fwinv working well for those states. A planning algorithm does not
however know that the model is wrong for a given state. So the central
question in learning a model is “how to estimate the uncertainty of
the output of the model”, i.e.,
P(xk+1 ̸= fx (xk , uk ))
where xk+1 is the true next state of the system and fx (xk , uk ) is
our prediction using the model. If we have a good estimate of
such uncertainty, we can safely use the model only at parts of the
state-space where this uncertainty is small.
2. Run the learned controller u0 (x) from the real system to collect
more data D1 and add it to the dataset
D ← D ∪ D1 .
This is a simple mechanism that that ensures that we can collect more
data from the system. If the controller goes to parts of the state-space
that the model is incorrect at, we get samples from such regions and
iteratively improve both the learned dynamics model fwk (x, u) and
the controller uk (x) using this model.
µ := mean(µk ) + stddev(µk ).
7 Each subset of the data is created by sampling the original data with
8 N samples with replacement. This is among the most influential ideas
9 in statistics see “Bootstrap Methods: Another Look at the Jackknife”
10 https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-1/Bootstrap-
11 Methods-Another-Look-at-the-Jackknife/10.1214/aos/1176344552.full be-
12 cause it is a very simple and general procedure to obtain the uncertainty
13 of the estimate. Also see a very famous paper “Bagging predictors”
14 at https://link.springer.com/article/10.1007/BF00058655 that invented
15 random forests based on this idea.
24 i.e., the ensemble predicts the next state of the robot using the mean. The
25 important benefit of using an ensemble is that we can also get an estimate
26 of the error in these predictions
1/2
error in x̂′ = (δk fwk (x, u)) .
2 Offline Reinforcement
3 Learning
Reading
1. Offline Reinforcement Learning: Tutorial, Review, and Per-
spectives on Open Problems by Levine et al. (2020)
191
192
13 There are many other problems of this kind: data is plentiful, just that we
14 cannot get more.
15 Technically speaking, offline learning is a very clean problem in that
16 we are close to the supervised learning setting (we do not know the
17 true targets). A meaningful theoretical analysis of typical reinforcement
18 learning algorithms is difficult because there are a lot of moving parts
19 in the problem definition: exploratory controllers, the fact that we are
20 adding correlated samples into our dataset as we draw more trajectories,
21 function approximation properties of the neural network that does not
22 allow Bellman iteration to remain a contraction etc. Some of these hurdles
23 are absent in the analysis of offline learning.
X (k+1) 2
θ (k) i ′
min qφθ (xit , uit ) − r(xit , u) − γ max
′
qφ (x t+1 , u ) .
φ u ∈U
i,t
2 This approach is unlikely to work very well. Observe the following figure.
4 We do not really know whether the initial value function qφθ assigns large
5 returns to controls that are outside of the ones in the dataset. If it does, then
6 these controls will be chosen in the maximization step while calculating
7 the target. If there are states where the value function over different
8 controls looks like this picture, then their targets will cause the value at
9 all other states to grow unbounded during training. This is exactly what
10 happens in practice. For example,
1 X θ (k+1) i i (k)
2
φ∗ = min qφ (xt , ut ) − r(xit , u) − γqφθ (xit+1 , uθ (xit+1 ))
φ nT i,t
− λ Ω(qpθ )
1 X θ i
θ∗ = min q (x , uθ (xit )).
θ nT i,t φ t
(11.1)
19 We will use the regularizer
1 X θ i 2
Ω= q (x , uθ (xit )) − qφθ (xit , uit ) +
nT i,t φ t
20 The notation (·)+ denotes rectification, i.e., (x)+ = x if x > 0 and zero
21 otherwise. Notice that second term in the objective for fitting the value
22 function forces the value of the control uθ (xit ) to be smaller than the
23 value of the control uit taken at the same state in the dataset. This is a
24 conservative choice. While it does not fix the issue of extrapolation error,
25 it forces the value network to predict smaller values and prevents it from
26 blowing up.
195
1 X 2
Ω= max qφθ (xit , u′ ) − qφθ (xit , uit ) +
nT i,t u′ ∼uθ (xt )
i
2 Meta-Learning
Reading
1. Learning To Learn: Introduction (1996),
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3140
5. Meta-Q-Learning https://arxiv.org/abs/1910.00125
3 The human visual system is proof that we do not need lots of images
4 to learn to identify objects or lots of experiences to learn about concepts.
5 Consider the mushrooms shown in the image below.
7 The one on the left is called Muscaria and you’d be able to identify
8 the bright spots on this mushroom very easily after looking at one image.
9 The differences between an edible one in the center (Russala) and the one
10 on the right (Phalloides) may sometimes be subtle but a few samples from
11 each of them are enough for humans to learn this important distinction.
196
197
17 This is not always the case, two tasks can also fight each other. Say,
18 you design a system to classify ethnicities of people using two kinds
19 of features. Task 1 uses the shape of the nose to classify Caucasians
20 (long nose) vs. Africans (wide nose). Task 2 uses the kind of hair to
21 classify Caucasians (straight hair) vs. Africans (curly hair). An image of
22 a Caucasian person with curly hair clearly results in two tasks fighting
23 each other.
24 The difficulty in meta-learning begins with defining what a task is.
25 While understanding what a task is may seem cumbersome but doable
26 for image classification, it is even harder for robotics systems.
27
5 2. The second kind of tasks is shown on the right. You can imagine
6 that after building/training a model for the robot in your lab you
7 want to install it in a factory. The factory robot might have 6 degrees-
8 of-freedom whereas the one in your lab had only 5; your policy
9 better adapt to this slightly different state-space. An autonomous
10 car typically has about 10 cameras and 5 LIDARs, any learning
11 system on the car better adapt itself to handle the situation when
12 one of these sensors breaks down. The gears on a robot manipulator
13 will degrade over the course of its life, our policies should adapt to
14 this degrading robot.
1 after fitting on the training data will be good at classifying some new input
2 image x as to whether it belongs to one of the C training classes. Note that
3 we have written the model as providing the probability distribution pŵ (· |
4 x) as the output, one real-valued scalar per candidate class {1, . . . , C}.
? Say, we are interested in classifying images
from classes that are different than those in
5 12.1.1 Fine-tuning the training set. The model has only C
outputs, effectively the universe is partitioned
6 Let us now consider the following setup. In addition to our original dataset into C categories as far as the model is
7 of the base tasks, we are given a “few-shot dataset” that has c new classes concerned and it does not know about any
8 and s labeled samples per class, a total of n = cs new samples other classes. How should one formalize the
problem of meta-learning then?
D′ = (xi , y i ) y i ∈ {C + 1, . . . , C + c} .
i=N +1,...,N +n
;
n N
=s≪ .
c C
11 This models the situation where the model is forced to classify images
12 from rare classes, e.g., the three kinds of strawberries grown on a farm in
13 California after being trained on data of cars/cats/dogs/planes etc.
14 We would like to adapt the parameters ŵ using this labeled few-shot
15 data. Here is one solution, we simply train the model again on the new
16 data. This looks like solving another optimization problem of the form
N +n
1 X λ 2
w∗ = argmin − log pw (y i | xi ) + ∥w − ŵ∥2 . (12.2)
w n 2
i=N +1
17 The new parameters w can potentially do well on the new classes even if
18 the shot s is small because training is initialized using the parameters ŵ.
19 We write down this initialization using the second term
λ 2
∥w − ŵ∥2
2
20 which keeps the parameters being optimized w closed to their initialization
21 using a quadratic spring-force controlled by the parameter λ. We can
22 expect the new model pw∗ to perform well on the new classes if the
23 initialization ŵ was good, i.e., if the new tasks and the base tasks were
24 close to each other. This method is called fine-tuning, it is the easiest trick
25 to implement to handle new classes.
? Think of a multi-layer neural network ŵ
that has K outputs. The new network should
26 12.1.2 Prototypical networks now produce m outputs, how should we
modify this network?
27 The cross-entropy objective used in (12.1) to train the model pŵ simply
28 maximizes the log-likelihood of the labels given the data. It is reasonable
29 to think that since the base classes are not going to show up as the few-shot
30 classes, we should not be fitting to this objective.
200
1 Let us imagine the features of the model, e.g., the activations of the
2 last layer in the neural network,
z = φw (x)
3 for a particular image x. Note that the features z depend on the parameters
4 w. During standard cross-entropy training, there is a linear layer on top of
5 these features and our prediction probability for class y is
⊤
ew y z
pw (y | x) = P w⊤ z
y′
y′ e
6 where wy ∈ Rdim(z) . This is the softmax operation and the vectors w are
7 the weights of the last layer of the network; when we wrote (12.1) we
8 implicitly folded those parameters into the notation w.
9 Prototypical networks train the model to be a discriminator as follows.
10 1. Each mini-batch during training time consists of a few-shot task
11 created out of the meta-training set by sub-sampling.
1 X
µy = 1{yi =y} φw (xi ).
s
(xi ,y i )∈D episode
20 3. You can now impose a clustering loss to force the query samples to
21 be labeled correctly, i.e., maximize
e−∥φw (x)−µy ∥2
pw,µy (y | x) = P
y′ e−∥φw (x)−µy′ ∥2
1 X
log pw,µyi (y i | xi ).
Cq
(x,y)∈D query )
3 Note that the gradient of the above expression is both on the weights
4 w of the top layer and the weights w of the lower layers.
5 5. We can now use the trained model for classifying new classes by
6 simply feeding the new images through the network, computing the
7 prototypes using the few labeled data and computing the clustering
8 loss on the unlabeled query data at test time to see which prototype
9 the feature of a particular query datum is closest to.
1 using the same objective in (12.1) but avoid overfitting the model on the
2 meta-training data so that the model can be quickly adapted using the
3 few-shot data via gradient-updates.
4 Here we consider an episode Depisode = Dsupport and Dquery = ∅, i.e.,
5 there are no query shots. Let us define
1 X
ℓ(w; Dsupport ) = log pw (y i | xi );
Cs
(xi ,y i )∈D support
ℓ(w; Dsupport )
7 Observe now that if there exist parameters w that have ∇ℓ(w; Di ) for all
8 the episodes Di then the MAML gradient is also zero. In other words, if
9 there exist parameters w that work well for all tasks then MAML may find
10 such parameters. However, in this case, the simple objective
2
1X
ℓmulti-task (w; D) = ℓ(w; Di ) (12.5)
2 i=1
18 We will assume that all the tasks have a shared state-space xkt ∈ X and a
19 shared control-space ukt ∈ U . The reward function of each task is different
20 rk (x, u) but we are maximizing the same infinite-horizon discounted
21 objective for each task. The q-function is then defined to be
"∞ #
k,θ k
X
t k
q (x, u) = E γ r (xt , ut ) | x0 = x, u0 = u, ut = uθk (xt ) .
ξ(·)
t=0
22 where uθk (xt ) is a deterministic controller for task k. Given all these
23 meta-training tasks, our objective is to learn a controller that can solve a
24 new task k ∈ / {1, . . . , K} upon being presented a few trajectories from
25 the new task. Think of you learning to pick up different objects during
26 training time and then adapting to picking up a new object not in the
27 training set.
204
h k
i
argmax E qφk,θ
k (x, uθ k (x)) . (12.7)
θk (x,u)∈D k
1 We will call this variable a “context” because we can use it to guess which
2 task a particular trajectory is coming from. It is important to note that
3 the mixing coefficients αi are shared across all the tasks. We would
4 like to think of this feature vector µ(τ ) as a kind of indicator of whether a
5 trajectory τ belongs to the task k or not. We now learn a q-function and
6 controller that also depend on µ(τ )
7 Including a context variable like µ(τ ) allows the q-function to detect the
8 particular task that it is being executed for using the past t time-steps of
9 the trajectory τ0:t . This is similar to learning independent q-functions
k
10 qφk,θ
k
and controllers uθk but there is parameter sharing going on in (12.9).
11 We will still solve the multi-task learning problem like (12.8) but also
12 optimize the parameters αi s that combine the basis functions.
K h
X 2 i
argmin E qφθ (x, u, µ(τ )) − rk (x, u) − γ qφθ (x′ , uθ (x′ , µ(τ )))
φ,αi (x,u,x′ )∈D k
i=1
K
X
qφθ (x, uθ (x, µ(τ )), µ(τ )) .
argmax E
θ,αi (x,u)∈D k
i=1
(12.10)
13 The parameters αi of the context join the q-functions and the controllers
14 of the different tasks together but also allow the controller the freedom to
15 take different controls depending on which task it is being trained for.
25 12.2.2 Discussion
26 This brings an end to the chapter on meta-learning and Module 4. We
27 focused on adapting learning-based models for robotics to new tasks. This
28 adaptation can take the form of learning a reward (inverse RL), learning
29 the dynamics (model-based RL) or learning to adapt (meta-learning).
30 Adaptation to new data/tasks with few samples is a very pertinent problem
31 because we want learning-based methods to generalize a variety of different
206
1 tasks than the ones they have been trained for. Such adaptation also comes
2 with certain caveats, adaptation may not always improve the performance
3 on new tasks; understanding when one can/cannot adapt forms the bulk of
4 the research on meta-learning.
1 Bibliography
2 Censi, A. (2016). A class of co-design problems with cyclic constraints and their solution. IEEE Robotics and
3 Automation Letters, 2(1):96–103.
4 Fakoor, R., Mueller, J. W., Asadi, K., Chaudhari, P., and Smola, A. J. (2021). Continuous doubly constrained
5 batch reinforcement learning. Advances in Neural Information Processing Systems, 34:11260–11273.
6 Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and
7 perspectives on open problems. arXiv preprint arXiv:2005.01643.
8 Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing Test, pages 23–65.
9 Springer.
207