0% found this document useful (0 votes)
618 views

Dynamic Programming and Optimal Control Script

This document outlines concepts related to dynamic programming and optimal control. It discusses how to formulate problems with discrete states and time as multi-stage decision processes to minimize expected costs over time. The key concepts are modeling systems as state-space models with control inputs, defining cost functions, and solving for optimal policies using techniques like dynamic programming and the principle of optimality. Examples discussed include inventory control and optimizing chess playing strategies.

Uploaded by

kschwabs
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
618 views

Dynamic Programming and Optimal Control Script

This document outlines concepts related to dynamic programming and optimal control. It discusses how to formulate problems with discrete states and time as multi-stage decision processes to minimize expected costs over time. The key concepts are modeling systems as state-space models with control inputs, defining cost functions, and solving for optimal policies using techniques like dynamic programming and the principle of optimality. Examples discussed include inventory control and optimizing chess playing strategies.

Uploaded by

kschwabs
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Dynamic Programming and

Optimal Control
Script
Prof. Raffaello DAndrea
Lecture notes
Dieter Baldinger Thomas Mantel Daniel Rohrer
HS 2010
Contents
Contents 1
1 Introduction 3
1.1 Class Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Key Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Open Loop versus Closed Loop Control . . . . . . . . . . . . . 4
1.4 Discrete State and Finite State Problem . . . . . . . . . . . . . 5
1.5 The Basic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Dynamic Programming Algorithm 9
2.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Chess Match Strategy Revisited . . . . . . . . . . . . . . . . . . 11
2.4 Converting non-standard problems . . . . . . . . . . . . . . . 13
2.4.1 Time Lags . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Correlated Disturbances . . . . . . . . . . . . . . . . . . 13
2.4.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Deterministic, Finite State Systems . . . . . . . . . . . . . . . . 15
2.5.1 Convert DP to Shortest Path Problem . . . . . . . . . . 15
2.5.2 DP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.3 Forward DP algorithm . . . . . . . . . . . . . . . . . . . 16
2.6 Converting Shortest Path to DP . . . . . . . . . . . . . . . . . . 16
2.7 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Shortest Path Algorithms . . . . . . . . . . . . . . . . . . . . . 18
2.8.1 Label Correcting Methods . . . . . . . . . . . . . . . . . 19
2.8.2 A

Algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Multi-Objective Problems . . . . . . . . . . . . . . . . . . . . . 21
2.9.1 Extended Principle of Optimality . . . . . . . . . . . . 22
2.10 Innite Horizon Problems . . . . . . . . . . . . . . . . . . . . . 23
2.11 Stochastic, Shortest Path Problems . . . . . . . . . . . . . . . . 23
2.11.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11.2 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . 25
2.11.3 Prove B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Summary of previous lecture . . . . . . . . . . . . . . . . . . . 27
2.13 How do we solve Bellmans Equation? . . . . . . . . . . . . . . 28
1
Contents
2.13.1 Method 1: Value iteration (VI) . . . . . . . . . . . . . . 28
2.13.2 Method 2: Policy Iteration (PI) . . . . . . . . . . . . . . 28
2.13.3 Third Method: Linear Programming . . . . . . . . . . . 31
2.13.4 Analogies and Connections . . . . . . . . . . . . . . . . 32
2.14 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Continuous Time Optimal Control 36
3.1 The Hamilton Jacobi Bellman (HJB) Equation . . . . . . . . . . 37
3.2 Aside on Notation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Minimum Principle . . . . . . . . . . . . . . . . . . 42
3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Fixed Terminal State . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Free initial state, with cost . . . . . . . . . . . . . . . . . 46
3.4 Linear Systems and Quadratic Costs . . . . . . . . . . . . . . . 48
3.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 General Problem Formulation . . . . . . . . . . . . . . . . . . . 52
Bibliography 56
2
Chapter 1
Introduction
1.1 Class Objective
The class objective is to make multiple decisions in stages to minimize a cost
that captures undesirable outcomes.
1.2 Key Ingredients
1. Underlying discrete time system:
x
k+1
= f
k
(x
k
, u
k
, w
k
) k = 0, 1, . . . , N 1
k: discrete time index
x
k
: state
u
k
: control input, decision variable
w
k
: disturbance or noise, random parameters
N: time horizon
f
k
: function, captures system evolution
2. Additive cost function:
g
N
(x
N
)
. .
terminal cost
+
N1

k=0
g
k
(x
k
, u
k
, w
k
)
. .
state cost
. .
accumulated cost
g
k
is a given nonlinear function.
The cost is a function of the control applied
Because the w
k
are random, typically consider the expected cost:
E
w
k
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
, u
k
, w
k
)
_
3
1. Introduction
Example 1: Inventory Control:
Keeping an item stocked in a warehouse. Too little, you run out (bad).
Too much, cost of storage and misuse of capital (bad).
x
k
: stock in the warehouse at the beginning of k
th
time period
u
k
: stock ordered and immideately delivered at the beginning of k
th
time period
w
k
: demand during k
th
period, with some given probability distribu-
tion
Dynamics:
x
k+1
= x
k
+ u
k
w
k
Excess demand is backlogged and corresponds to negative values.
Cost:
E
_
R(x
N
) +
N1

k=0
(r (x
k
) + c u
k
)
_
r (x
k
): penaltise too much stock or negative stock
c u
k
: cost of items
R(x
n
): terminal cost from items at the end that cant be sold or de-
mand that cant be met
Objective: The objective is to minimize the cost subject to u
k
0
1.3 Open Loop versus Closed Loop Control
Open Loop: Come up with control inputs u
0
, . . . , u
N1
before k = 0. In
Open Loop the objective is to calculate {u
0
, . . . , u
N1
}.
Closed Loop: Wait until time k to make decision. Asumes x
k
is measurable.
Closed Loop will always give better performance, but is computationally
much more expensive. In Closed Loop the objective is to calculate the optimal
rule: u
k
=
k
(x
k
). = {
0
, . . . ,
N1
} ist a policy or control law.
4
Wednesday 22
nd
September, 2010 Discrete State and Finite State Problem
Example 2:

k
(x
k
) =
_
_
_
s
k
x
k
if x
k
< s
k
0 otherwise
s
k
is some threshold.
1.4 Discrete State and Finite State Problem
When state x
k
takes on discrete values or is nite in size, it is often conve-
nient to express the dynamics in terms of transition probabilities:
P
ij
(u, k) := Prob (x
k+1
= j | x
k
= i, u
k
= u)
i: start state
j: possible future state
u: control input
k: time
This is equivalent to: x
k+1
= w
k
, where w
k
has the following
Prob (w
k
= j | x
k
= i, u
k
= u) := P
ij
(u, k)
Example 3: Optimizing Chess Playing Strategies:
Two game chess match with an opponent, the objective is to come
up with a strategy that maximizes the chance to win.
Each game can have 2 outcomes:
a) Win by one player: 1 point for the winner, 0 points for the
loser.
b) Draw: 0.5 points for each player.
If games are tied 1-1 at the end of 2 games, go into sudden death
mode until someone wins.
Decision variable for player, two player styles:
1) Timid play: Draw with probability p
d
, lose with probability
1 p
d
2) Bold play: Win with probability p
w
, lose with probability
1 p
w
Asume p
d
> p
w
as a necessary condition for problem to make
sense.
5
1. Introduction
Problem: What playing style should be chosen? Since it doesnt make
sense to play Timid if we are tied 1-1 at the end of 2 games, it is a
2-stage-nite problem.
Transition Probability Graph The graphis below show all possible out-
comes.
0, 0
1, 0
0, 1
p
w
1 p
w
(a) Timid Play
0, 0
1
2
,
1
2
0, 1
p
d
1 p
d
(b) Bold Play
Figure 1.1: First Game
1, 0
1
2
,
1
2
0, 1
2, 0
3
2
,
1
2
1, 1
1
2
,
3
2
0, 2
p
d
1 p
d
p
d
1 p
d
p
d
1 p
d
(a) Timid Play
1, 0
1
2
,
1
2
0, 1
2, 0
3
2
,
1
2
1, 1
1
2
,
3
2
0, 2
p
w
1 p
w
p
w
1 p
w
p
w
1 p
w
(b) Bold Play
Figure 1.2: Second Game
Closed Loop Strategy Play timid iff player is ahead.
The probability of winning is:
p
d
p
w
+ p
w
((1 p
d
) p
w
+ p
w
(1 p
w
)) = p
2
w
(2 p
w
) + p
w
(1 p
w
) p
d
For {p
w
= 0.45, p
d
= 0.9} and {p
w
= 0.5, p
d
= 1} the probabilities to
win are 0.54 and 0.625.
Open Loop Strategy Possibilities:
6
Wednesday 29
th
September, 2010 The Basic Problem
0, 0
1, 0
0, 1
3
2
,
1
2
WIN
1, 1
0, 2 LOSE
2, 1 WIN
1, 2 LOSE
p
w
1 p
w
p
d
1 p
d
p
w
1 p
w
p
w
1 p
w
Figure 1.3: Closed Loop Strategy
1) Timid for rst 2 games: p
2
d
p
w
2) Bold in both: p
2
w
(3 2 p
w
)
3) Bold in rst, timid in second game: p
w
p
d
+ p
w
(1 p
d
) p
w
4) Timid in rst, bold in second game: p
w
p
d
+ p
w
(1 p
d
) p
w
Clearly 1) is not the optimal OL strategy, because p
2
d
p
w
p
d
p
w
+ . . .
Best strategy yields:
p
2
w
+ p
w
(1 p
w
) max (2 p
w
, p
d
)
if p
d
> 2 p
w
. The optimal OL strategy is 3) or 4). It can be shown that if
p
w
0.5, then the probability of winning is 0.5.
1.5 The Basic Problem
Summarize basic problem setup:
x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, 1, . . . , N 1
x
k
S
k
state space
u
k
C
k
control space
w
k
D
k
disturbance space
u
k
U(x
k
) C
k
. Constrained not only as a function of time, but also
of current state.
w
k
P
R
(|x
k
, u
k
). Noise distribution can depend on current state and
applied control.
7
1. Introduction
Consider policies, or control laws,
= {
0
,
1
, . . . ,
N1
}
where
k
maps state x
k
into controls u
k
=
k
(x
k
), such that
k
(x
k
)
U(x
k
) x
k
S
k
.
The set of all called Admissible Policies, denoted .
Given a policy , the expected cost of starting at state x
0
:
J

(x
0
) := E
w
k
_
g
N
(x
N
) +
N1

k=0
g
k
(x
k
,
k
(x
k
), w
k
)
_
Optimal Policy: J

(x
0
) J

(x
0
)
Optimal Cost: J

(x
0
) := J

(x
0
)
8
Chapter 2
Dynamic Programming
Algorithm
At the heart of the DP algorithm is the following very simple and intuitive
idea.
2.1 Principle of Optimality
Let

= {

0
,

1
, . . . ,

n1
} be an optimal policy. Assume that in the process
of using

, a state x
i
occurs at time i. Consider the subproblem whereby at
time i we are at state x
i
and we want to minimize
E
w
k
_
g
n
(x
n
) +
N1

k=i
g
k
(x
k
,
k
(x
k
), w
k
)
_
.
Then the truncated policy {

i
,

i+1
, . . . ,

N1
} is optimal for this problem.
The proof is simple: prove by contradiction. If the above were not optimal,
you could nd a different policy that would give a lower cost. Applying
the same policy to the original problem from i would therefore give a lower
cost, which contradicts that

was an optimal policy.


Example 4: Deterministic Scheduling Problem
Have 4 machines A, B, C, D, that are used to make something.
A must occur before B. C before D.
The solution is obtained by calculating the optimal cost for each node,
beginning at the bottom of the tree. See gure 2.1.
9
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC ABCD
ACB
ACD
ACBD
ACDB
CAB
CAD
CABD
CADB
CDA CDAB
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2
6
1
3
1
3
2
9
5
3
5
8
7
10
Figure 2.1: Problem of example 4 with optimal cost for each node written above it (in circles).
2.2 The DPA
For every initial state x
0
, the optimal cost J

(x
0
) is equal to J
0
(x
0
), given by
the last step of the following recursive algorithm, which proceeds backwards
in time from N 1 to 0:
Initialization: J
N
(x
N
) = g
N
(x
N
) x
n
S
N
Recursion: For the recursion we use
J
k
(x
k
) = min
u
k
U
k
(x
k
)
E
w
k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
_
f
k
(x
k
, u
k
, w
k
)
_
_
.
where expectation is taken with respect to P
R
(|x
k
, u
k
).
Furthemore, if u

k
=

k
(x
k
) minimizes the recursion equation for each x
k
and k, the policy

= {

0
, . . . ,

N1
} is optimal.
Comments
For each recursion step, we have to perform the optimization over all
possible values x
k
S
k
, since we dont know a priori which states we
will actually visit.
This pointwise optimization is what gives us

k
.
Proof 1: (Read section 1.5 in [1] if you are mathematically inclined).
10
Wednesday 6
th
October, 2010 Chess Match Strategy Revisited
Denote
k
:= {
k
,
k+1
, . . . ,
N1
}.
Denote
J

k
(x
k
) = min

k
E
w
k
,...,w
N1
_
g
N
(x
N
) +
N1

i=k
g
i
_
x
i
,
i
(x
i
), w
i
_
_
.
optimal cost when starting at time k, we nd ourselves at state x
k
.
J

N
(x
N
) = g
N
(x
N
), nally.
We will show that J

k
= J
k
generated by the DPA which give us
the desired result when k = 0.
Induction: J

N
(x
N
) = J
N
(x
N
), true for k = N
Assume true for k + 1: J

k+1
(x
k+1
) = J
k+1
(x
k+1
) x
k+1
S
k+1
.
Then, since
k
= {
k
,
k+1
}, have
J

k
(x
k
) = min
(
k
,
k+1
)
E
w
k
,...,w
N1
_
g
k
(x
k
,
k
(x
k
), w
k
) + g
N
(x
N
) +
N1

i=k+1
g
i
(x
i
,
i
(x
i
), w
i
)
_
by principle of optimality:
= min

k
E
w
k
_
g
k
(x
k
,
k
(x
k
), w
k
) + min

k+1
E
w
k+1
,...,w
N1
_
g
N
(x
N
) +
N1

i=k+1
g
i
(x
i
,
i
(x
i
), w
i
)
__
by denition of J

k+1
and update equation:
= min

k
E
w
k
_
g
k
(x
k
,
k
(x
k
), w
k
) + J

k+1
( f
k
(x
k
,
k
(x
k
), w
k
))
_
by induction hypothesis:
= min

k
E
w
k
(g
k
(x
k
,
k
(x
k
), w
k
) + J
k+1
( f
k
(x
k
,
k
(x
k
), w
k
)))
= min
u
k
U
k
(x
k
)
E
w
k
(g
k
(x
k
,
k
(x
k
), w
k
) + J
k+1
( f
k
(x
k
, u
k
, w
k
)))
= J
k
(x
k
)
In other words: Search over a function is simply like solving what the function does
partwise.
J
k
(x
k
) is called the cost-to-go at state x
k
.
J
k
() is called the cost-to-go function.
2.3 Chess Match Strategy Revisited
Recall
Timid Play, prob. tie = p
d
, prob. loss = 1 p
d
11
2. Dynamic Programming Algorithm
Bold Play, prob. win = p
w
, prob. loss = 1 p
w
2 game match, + tie breaker if necessary
Objective Find policy which maximizes probability of winning. We will
solve with DP, replace min by max.
Asume p
d
> p
w
.
Dene x
k
= difference between our score and opponent score at the end of
game k. Recall 1 point for win, 0 for loss and 0.5 for tie.
Dene J
k
(x
k
) = probability of winning match at time k if state = x
k
.
Start Recursion
J
2
(x
2
) =
_

_
1 if x
2
> 0
p
w
if x
2
= 0
0 if x
2
< 0
Recursive Equation
J
k
(x
k
) = max
_
p
d
J
k+1
+ (1 p
d
) J
k+1
(x
k
1)
. .
timid
, p
w
J
k+1
+ (1 p
w
) J
k+1
(x
k
1)
. .
bold

Convince yourself that this is equivalent to the formal denitions:


J
k
(x
k
) = max
u
k
E
w
k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
( f
k
_
x
k
, u
k
, w
k
)
_
_
Note: There is only a terminal cost in this problem.
J
1
(x
1
) = max [p
d
J
2
(x
1
) + (1 p
d
) J
2
(x
1
1), p
w
J
2
(x
1
+ 1) + (1 p
w
) J
2
(x
1
1)]
If x
1
= 1: max
_
p
d
+ (1 p
d
) p
w
. .
timid
, p
w
+ (1 p
w
) p
w
. .
bold

Which is bigger? Timid - Bold = (p


d
p
w
)(1 p
w
) > 0 Timid is optimal,
and J
1
(1) = p
d
+ (1 p
d
) p
w
.
If x
1
= 0: max
_
p
d
p
w
+ (1 p
d
) 0
. .
timid
, p
w
+ (1 p
w
) 0
. .
bold

Optimal is p
w
, and J
1
(0) = p
w
, Bold is optimal strategy.
If x
1
= 1: max
_
0, p
2
w

J
1
(1) = p
2
w
, optimal strategy is Bold.
12
Wednesday 6
th
October, 2010 Converting non-standard problems
J
0
(0) = max [p
d
J
1
(0) + (1 p
d
) J
1
(1), p
w
J
1
(1) + (1 p
w
) J
1
(1)]
= max
_
p
d
p
w
+ (1 p
d
) p
2
w
, p
w
(p
d
+ (1 p
d
) p
w
) + (1 p
w
) p
2
w

= max
_
p
d
p
w
+ (1 p
d
) p
2
w
, p
d
p
w
+ (1 p
d
) p
2
w
+ (1 p
w
) p
2
w

J
0
(0) = p
d
p
w
+ (1 p
d
) p
2
w
+ (1 p
w
) p
2
w
, the optimal strategy is Bold.
Optimal Strategy If ahead, play Timid.
2.4 Converting non-standard problems to Basic Prob-
lem
2.4.1 Time Lags
Assume update equation is of the following form:
x
x+1
= f
k
(x
k
, x
k1
, u
k
, u
k1
, w
k
)
Dene y
k
= x
k1
, s
k
= u
k1
_
_
_
_
_
x
k+1
y
k+1
s
k+1
_
_
_
_
_
=
_
_
_
_
_
f
k
(x
k
, y
k
, s
k
, w
k
)
x
k
u
k
_
_
_
_
_
. .

f
k
Let x
k
= (x
k
, y
k
, s
k
), x
k+1
=

f
k
( x
k
, u
k
, w
k
).
The control is u
k
=
k
(x
k
, u
k1
, x
k1
).
This can be generalized to more than one time lag.
2.4.2 Correlated Disturbances
If disturbances are not independent, but can be modeled as the output of a
system driven by independent disturbances Colored Noise.
Example 5:
w
k
= C
k
y
k+1
y
k+1
= A
k
y
k
+
k
13
2. Dynamic Programming Algorithm
A
k
, C
k
are given, {
k
} is independent.
As usual, x
k+1
= f
k
(x
k
, u
k
, w
k
).

_
_
x
k+1
y
k+1
_
_
=
_
_
f
k
_
x
k
, u
k
, C
k
_
A
k
y
k
+
k
__
A
k
y
k
+
k
_
_
and u
k
=
k
(x
k
, y
k
)
which is now in the standard form. In general, y
k
cannot be measured
and must be estimated.
2.4.3 Forecasts
When state information includes knowledge of probability distributions. At
the beginning of each period k, we receive information about w
k+1
probabil-
ity distribution. In particular, assume w
k+1
could have the following prob-
ability distributions {Q
1
, Q
2
, . . . , Q
m
}, with a priori probabilities p
1
, . . . , p
m
.
At time k, we receive forecast i that Q
i
is used to generate w
k+1
. Model as
follows: y
k+1
=
k
,
k
is a random variable, taking value i with probability
p
i
. In particular, w
k
has probability distribution Q
y
k
.
Then have
_
_
x
k+1
y
k+1
_
_
=
_
_
f
k
(x
k
, u
k
, w
k
)

k
_
_
. New state x
k
= (x
k
, y
k
).
Since y
k
is known at time k, we have a Basic Problem formulation. New
disturbance w
k
= (w
k
,
k
), depends on current state, which is allowed. DPA
takes on the following form:
J
N
(x
N
, y
N
) = g
N
(x
N
)
J
k
(x
k
, y
k
) = min
u
k
E
w
k
E

k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
_
f
k
(x
k
, u
k
, w
k
),
k
_

y
k
_
= min
u
k
E
w
k
_
g
k
(x
k
, u
k
, w
k
) +
m

i=1
p
i
J
k+1
_
f
k
(x
k
, u
k
, w
k
), i
_

y
k
_
where conditional expectation simply means that w
k
has probability distri-
bution Q
y
k
. y
k
{1, . . . , m}, expectation over w
k
is taken with respect to the
distribution Q
y
k
.
14
Wednesday 13
th
October, 2010 Deterministic, Finite State Systems
2.5 Deterministic, Finite State Systems
Recall Basic Problem
x
k+1
= f
k
(x
k
, u
k
, w
k
) k = 0, . . . , N 1
g
k
(x
k
, u
k
, w
k
) cost at stage k.
Consider Problems where
1. x
k
S
k
, S
k
is a nite set
2. No disturbances w
k
We assume, without loss of generality, that there is only one way to go from
state i S
k
to j S
k+1
(If there is more than one way, pick one with lowest
cost at stage k).
2.5.1 Convert DP to Shortest Path Problem
S
.
.
.
Stage 1
.
.
.
Stage 2
.
.
.
.
.
.
.
.
.
Stage N
. . .
. . .
. . .
T
Figure 2.2: General Shortest Path Problem
a
k
ij
= cost to go from state i S
k
to state j S
k+1
, time k. This is equal if
there is no way to go from i S
k
to j S
k+1
.
a
N
iT
= terminal cost of state i S
N
In other words,
a
k
ij
= g
k
(i, u
ij
k
), where j = f
k
(i, u
ij
k
)
a
N
iT
= g
N
(i)
15
2. Dynamic Programming Algorithm
2.5.2 DP algorithm
J
N
(i) = a
N
iT
i S
N
J
k
(i) = min
jS
k+1
_
a
k
ij
+ J
k+1
(j)
_
i S
k
, k = 0, . . . , N 1
This solves shortest path problem.
2.5.3 Forward DP algorithm
By inspection, the problem is symmetric. Shortest path from S to T is the
same as from T to S, motivating following algorithm,

J
k
(j) optimal cost to
arrive to state j:

J
N
(j) = a
0
sj
j S
1

J
k
(j) = min
iS
Nk
_
a
Nk
ij
+

J
k+1
(i)
_
j S
Nk+1
, k = 1, . . . , N

J
0
(T) = min
iS
N
_
a
N
iT
+

J
1
(i)
_

J
0
(T) = J
0
(s)
2.6 Converting Shortest Path to DP
start end
.
.
.
.
.
.
Figure 2.3: Another Path Problem, in which circles are allowed
As an example for a mental picture, one could imagine cities on a map.
16
Wednesday 13
th
October, 2010 Viterbi Algorithm
Let {1, 2, . . . , N, T} be the nodes of graph, a
ij
the cost to move from
i to j, a
ij
= if there is no edge. i and j denote nodes, as opposed to
previous section where they were states.
Assume that all cycles have non-negative cost. This isnt an issue if all
edges have cost 0.
Note that with above assumption, an optimal path length N (visits
all nodes).
Setup problem where we require exactly N moves, degenerate moves are
allowed (a
ii
= 0).
J
k
(i) optimal cost of getting from i to T in N k moves
J
N
(i) = a
iT
(can be innite, of course)
J
k
(i) = min
j
_
a
ij
+ J
k+1
(j)
_
(optimal N k move is a
ij
+ optimal N k 1
move from j).
Notice that degenerate moves are allowed. (remove in the end)
Terminate procedure if J
k
(i) = J
k+1
(i) i.
2.7 Viterbi Algorithm
This is a powerful combination of D.P. and Bayes Rule for optimal estima-
tion.
Given Markov Chain, with state transition probabilities p
ij
.
p
ij
= P (x
k+1
= j|x
k
= i) 1 i, j M
p(x
0
) = initial probability for starting state.
Can only indirectly observe state via measurement.
r(z; i, j) = P (meas = z|x
k
= i, x
k+1
= j) k
where P is the likelihood function.
17
2. Dynamic Programming Algorithm
Objective: Given Z
N
= {z
1
, . . . , z
N
} measurements, construct

X
N
= { x
0
, . . . , x
N
}
that maximizes over all X
N
= {x
0
, . . . , x
N
} P
R
(X
N
|Z
N
). Most likely state.
Recall P
R
(X
N
, Z
N
) = P
R
(X
N
|Z
N
) P
R
(Z
N
).
For a given Z
N
, maximizing P
R
(X
N
, Z
N
) over X
N
gives same result as max-
imizing P
R
(X
N
|Z
N
) over X
N
.
P
R
(X
N
, Z
N
) = P
R
(x
0
, . . . , x
N
, z
1
, . . . , z
N
)
= P
R
(x
1
, . . . , x
N
, z
1
, . . . , z
N
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) P
R
(x
1
, z
1
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) P
R
(z
1
|x
0
, x
1
) P
R
(x
1
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) r(z
1
; x
0
, x
1
)p
x
0
,x
1
P
R
(x
0
)
One more step:
P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) P
R
(x
2
, z
2
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) P
R
(z
2
|x
0
, x
1
, z
1
, x
2
) P
R
(x
2
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) r(z
2
; x
1
, x
2
)p
x
1
,x
2
Keep going, and one gets:
P
R
(X
N
, Z
N
) = P
R
(x
0
)
N

k=1
p
x
k1
,x
k
r(z
k
; x
k1
, x
k
)
Assume, that all quantities > 0. If = 0, can modify algorithm.
Since the above is a strictly positive property (by above assumptions), and
log function is monotonically increasing as a function of its argument, we
can maximize
log (P
R
(X
N
, Z
N
)) = min
X
N
_
log(P
R
(x
0
)) +
N

k=1
log
_
p
x
k1
,x
k
r(z
k
; x
k1
, x
k
)
_
_
Forward DP At time k, we can calculate cost to arrive to any state. We
dont have to wait until the end to solve the problem.
2.8 Shortest Path Algorithms
Look at alternatives to DP for problems that are nite and deterministic.
Path Length Cost
18
Wednesday 20
th
October, 2010 Shortest Path Algorithms
2.8.1 Label Correcting Methods
Assume a
ij
0. Arclength = cost to go from node i to node j 0.
OPEN BIN
i
j
REMOVE
Is d
i
+ a
ij
< d
j
?
Is d
i
+ a
ij
< d
T
?
YES
YES
set d
j
= d
i
+ a
ij
Figure 2.4: Diagram of the label correcting algorithm.
Let d
i
be shortest path to i so far.
Step 0 Place Node S in OPEN BIN, set d
S
= 0, d
j
= j.
Step 1 Remove a node i from OPEN, and execute Step 2 for all children
j of i.
Step 2 If d
i
+ a
ij
< min(d
j
, d
T
), set d
j
= d
i
+ a
ij
, set i to be the parent of j. If
j = T, place j in OPEN if it is not already there.
Step 3 If OPEN is empty, done. If not, go back to Step 1.
Example 6: Deterministic Scheduling Problem (revisited)
19
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC
ACB
ACD
CAB
CAD
CDA
T
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2
A before B
C before D
Figure 2.5: Deterministic Scheduling Problem.
Iteration # Remove OPEN d
t
OPTIMAL
0 S(0)
1 S A(5), C(3)
2 C A(5), CA(7), CD(9)
3 CD A(5), CA(7), CDA(12)
4 CDA A(5), CA(7) 14 CDAB
5 CA A(5), CAB(9), CAD(11) 14 CDAB
6 CAD A(5), CAB(9) 14 CDAB
7 CAB A(5) 10 CABD
8 A AB(7), AC(8) 10 CABD
9 AC AB(7) 10 CABD
10 AB 10 CABD
Done, optimal cost = 10, optimal path = CABD.
Different ways to remove items from OPEN give different, well known,
algorithms.
Depth-First Search Last in, rst out. What we did in example. Finds
feasible path quickly. Also good if you have limited memory.
20
Wednesday 20
th
October, 2010 Multi-Objective Problems
Best-First Search Remove best label. Dijkstras method. Remove step
is more expensive, but can give good performance.
Brendth-First Search First in, rst out. Bellman-Ford.
2.8.2 A

Algorithm
Workhouse for many AI applications, path planning.
Basic idea: Replace test d
i
+ a
ij
< d
T
by d
i
+ a
ij
+ h
j
< d
T
, where h
j
is a lower
bound to the shortest distance from j to T. Indeed, if d
j
+ a
ij
+ h
j
d
T
, clear
that path going through j will not be optimal.
2.9 Multi-Objective Problems
Example 7: Motivation: care about time and fuel.
time
fuel
inferior
non-inferior
Figure 2.6: Possibilities in the time-fuel graph.
A vector x = (x
1
, x
2
, . . . , x
M
) S is non-inferior if there are no other y S
so that y
l
x
l
, l = 1, . . . , M, with strict inequality for one of these ls.
Given a problem with M cost functions f
1
(x), . . . , f
M
(x) x X is a non-
inferior solution if the vector ( f
1
(x), . . . , f
M
(x)) is a non-inferior vector of
set {( f
1
(y), . . . , f
M
(y))

y X}.
Reasonable goal: nd all non-inferior solutions, then use another criterion
to pick which one you actually want to use.
21
2. Dynamic Programming Algorithm
How this applies to deterministic, nite state DP (which are equivalent to
shortest path problems):
x
k+1
= f
k
(x
k
, u
k
) Dynamics
g
l
N
+
N1

k=0
g
l
k
(x
k
, u
k
) l = 1, . . . , M
2.9.1 Extended Principle of Optimality
If {u
k
, . . . , u
N1
} is a non-inferior control sequence for the tail subproblem
that starts at x
k
, then {u
k+1
, . . . , u
N1
} is also non-inferior for the tail sub-
problem that starts at f
k
(x
k
, u
k
). Simple proof: by contradiction.
Algorithm First dene what we will do recursion over:
F
k
(x
k
) : the set of M-tuples (vectors of size M) of cost to go at x
k
which are
non-inferior.
F
N
(x
N
) = {(g
1
N
(x
N
), . . . , g
M
N
(x
N
))}. Only one element in set for each x
N
.
Given F
k+1
(x
k+1
) x
k+1
, generate for each state x
k
the set of vectors (g
l
k
(x
k
, u
k
) +
c
1
, . . . , g
M
k
(x
k
, u
k
) + c
M
), such that (c
1
, . . . , c
M
) F
k+1
( f
k
(x
k
, u
k
)).
These are all possible costs that are consistent with F
k+1
(x
k+1
). Then to
obtain F
k
(x
k
) simply extract all non-inferior elements.
X
k
F
k
(x
k
)

F
k+1
()
Figure 2.7: Possible sets for F
k+1
.
When we calculate F
0
(x
0
), will have all non-inferior solutions.
22
Wednesday 27
th
October, 2010 Innite Horizon Problems
2.10 Innite Horizon Problems
Consider the time (or iteration) invariant case:
x
k+1
= f (x
k
, u
k
, w
k
) x
k
S
u
k
U
w
k
P(|x
k
, u
k
)
J

(x
0
) = E
_
N1

k=0
g(x
k
,
k
(x
k
), w
k
)
_
, no terminal cost
Write down DP algorithm:
J
N
(x
N
) = 0
J
k
(x
k
) = min
u
k
U
E
w
k
(g(x
k
, u
k
, w
k
) + J
k+1
( f (x
k
, u
k
, w
k
))) k
Question: What happens as N ? Does the problem become easier?
Yes. Reason: lose notion of time. For very large class of problems, have
Bellman Equation:
J

(x) = min
u
E
w
(g(x, u, w) + J

( f (x, u, w))) x S
Bellman Equation involves solving for optimal cost to go function J

(x)x
S.
u = (x) gives optimal policy (() is obtained from solution to Bellman
Equation: for every x there is a u).
Efcient methods for solving Bellman Equation
Technical conditions on when this can be done.
2.11 Stochastic, Shortest Path Problems
x
k+1
= w
k
x
k
S, nite set
P
R
(w
k
= j|x
k
= i, u
k
= u) = p
ij
(u) u
k
U(x
k
), nite set
23
2. Dynamic Programming Algorithm
We have a nite number of states. The transition from one state to the next
is dictated by p
ij
(u): probability that next state is j given current state is i. u
is the constrol input, we can control what these transition probabilites are,
nite set of options u U(i). Problem data is time (or iteration) indepen-
dent.
Cost:
Given initial state i and a policy = {
0
,
1
, . . .}.
J

(i) = lim
N
E
_
N1

k=0
g(x
k
,
k
(x
k
))|x
o
= i
_
.
Optimal cost from state i: J

(i)
Stationary policy = {, , . . .}. Denote J

(i) as the resulting cost.


{, , . . .} is simply referred to as . is optimal if
J

(i) = J

(i) = min

(i)
Assumptions:
Existence of a cost-free termination state t
p
tt
(u) = 1 u
g(t, u) = 0 u.
Sufcient condition to make cost meaningfull. Think of this as a desti-
nation state.
integer m such that for all admissible policies

= max
i=1,...,n
P
R
(x
m
= t|x
0
= i, ) < 1
This is a strong assumption, which is only required for proofs.
2.11.1 Main Result
A) Given any initial conditions J
0
(1), . . . , J
0
(n), the sequence
J
k+1
(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
i
converges to optimal cost J

(i) for each i.


24
Wednesday 27
th
October, 2010 Stochastic, Shortest Path Problems
B) Optimal cost satises Bellmans Equation:
J

(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
i
which has a unique solution.
2.11.2 Sketch of Proof
A0) First prove that cost is bounded.
Recall m such that policies ,

:= max
i
P
R
(x
m
= t|x
0
= i, ) < 1
Since all problem data is nte, := max

< 1
P
R
(x
2m
= t|x
0
= i, ) = P
R
(x
2m
= t|x
m
= t, x
0
= i, ) P
R
(x
m
= t|x
0
= i, )
2
Generally, P
R
(x
k m
= t|x
0
= i, )
k
.
Furthermore, the cost incurred between the m periods km and (k +
1)m1 is
m
k
max
i,u
|g(i, u)| =:
k
M M := m max
i,u
|g(i, u)|
J

(i)

k=0
M
k
=
M
1
nite
A1)
J

(x
0
) = lim
N
E
_
N1

k=0
g (x
k
,
k
(x
k
))
_
= E
_
mK1

k=0
g (x
k
,
k
(x
k
))
_
+ lim
N
E
_
N1

k=mK
g (x
k
,
k
(x
k
))
_
by previous, we know that

lim
N
E
_
N1

k=mK
g (x
k
,
k
(x
k
))
_

M
K
1
As expected, we can make tail as small as we want.
25
2. Dynamic Programming Algorithm
A2) Recall that we can view J
0
as terminal cost funtion, with J
0
(i) given.
Bound its expected vallue:
|E (J
0
(x
mK
))| =

i=1
P
R
(x
mK
= i|x
0
, )J
0
(i)

_
n

i=1
P
R
(x
mK
= i|x
0
, )
_
max
i
|J
0
(i)|

K
max
i
|J
0
(i)|
A3) Sandwich
E(J
0
(x
mK
)) +E
_
mK1

k=0
g(x
k
,
k
(x
k
))
_
=
E(J
0
(x
mK
)) + J

(x
0
) lim
N
E
_
N1

k=mK
g(x
k
,
k
(x
k
))
_
Recall if a = b + c, then
a b +|c| b + c, c |c|
a b |c| b c
Follows that

K
max
i
|J
0
(i)|
M
K
1
+ J

(x
0
) E
_
J
0
(x
mK
) +
mK1

k=0
g(x
k
,
k
(x
k
))
_

K
max
i
|J
0
(i)| +
M
K
1
+ J

(x
0
)
A4) We take minimum over all policies, the middle term is exactly our DP
recursion of part A. Take limits, get
lim
K
J
mK
(x
0
) = J

(x
0
)
Now we are almost done. But since
|J
mK+1
(x
0
) J
mK
(x
0
)|
K
M
have
lim
k
J
k
(x
0
) = J

(x
0
)
26
Wednesday 3
rd
November, 2010 Summary of previous lecture
Summary
A1: bound the tail over all policies
A2: bound contribution from initial condition J
0
(i), over all policies
A3: sandwich type of bounds, middle term is DP recursion
A4: optimized over all policies, took limit.
2.11.3 Prove B
Prove that optimal cost satises Bellmans equation
J
k+1
(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
.
In Part A, we showed that J
k
() J

(), just take limits on both sides. To


prove uniqueness, just use solution of Bellman equation as initial condition
of DP iteration.
2.12 Summary of previous lecture
Dynamics
x
k+1
= w
k
P
R
{w
k
= j|x
k
= j, u
k
= u} = p
ij
, u U(i) nite set
x
k
S nite S = {1, 2, . . . , n, t}
p
tt
= 1 u U(t)
Cost Given = {
0
,
1
, . . .}
J

(i) = lim
N
E
_
N1

k=0
g(x
k
,
k
(x
k
))|x
0
= i
_
i S
g(t, u) = 0 u U(t)
J

(i) = min

(i) optimal cost


Note: J

(t) = 0 J

(t) = 0
27
2. Dynamic Programming Algorithm
Result
A) Given any initial conditions J
0
(1), . . . , J
0
(n), the sequence
J
k+1
(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
i S \ t = {1, . . . , n}
converges to J

(i).
Note: there is a lot of a short-cut here, can include terminal state t,
provided we pick J
0
(t) = 0. Does not change equations.
B)
J

(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

(j)
_
i S \ t
Bellmans Equation: Also gives optimal policy, which is in fact stationary.
2.13 How do we solve Bellmans Equation?
2.13.1 Method 1: Value iteration (VI)
Use the DP recursion of result A:
J
k+1
(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u) J
k
(j)
_
i S \ t
until it converges. J
0
(i) can be set to a guess, if the guess is good, it will
speed up convergence. How do we know that we are close to converging?
Exploit problem structure to get bounds, look at [1].
2.13.2 Method 2: Policy Iteration (PI)
Iterate over policy instead of values. Need following result:
C) For any stationary policy , the costs J

(i) are the unique solutions of


J(i) = g(i, (i)) +
n

j=1
p
ij
((i)) J(j) i S \ t
28
Wednesday 3
rd
November, 2010 How do we solve Bellmans Equation?
Furthermore, given any initial conditions J
0
(i), the sequence
J
k+1
(i) = g(i, (i)) +
n

j=1
p
ij
((i)) J
k
(j)
converges to J

(i) for each i.


Proof: trivial. Consider problem where only allowable control at state i
is (i), and apply parts A and B. Special case of general theorem.
Algorithm for PI From now on, i S \ t = {1, 2, . . . , n}.
Stage 1 Given
k
(stationary policy at iteration k, not policy at time k), solve
for the J

k (i) by solving
J(i) = g(i,
k
(i)) +
n

j=1
p
ij
(
k
(i)) J(j) i
n equations, n unknowns (Result C).
Stage 2 Improve policy.

k+1
(i) = arg min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

k (j)
_
i
Iterate, quit when J

k+1 (i) = J

k (i) i
Theorem Above terminates after a nite # of steps, and converges to opti-
mal policy.
Proof two steps
1) We will rst show that
J

k (i) J

k+1 (i) i, k
2) We will show that what we converge to satises Bellmans Equation.
1) For xed k, consider the following recursion in N:
J
N+1
(i) = g(i,
k+1
(i)) +
n

j=0
p
ij
(
k+1
(i)) J
N
(j) i
J
0
(i) = J

k (i)
29
2. Dynamic Programming Algorithm
By result C), J
N
J

k+1 as N .
J
0
(i) = g(i,
k
(i)) +

j
p
ij
(
k
(i)) J
0
(j)
g(i,
k+1
(i)) +

j
p
ij
(
k+1
(i)) J
0
(j) = J
1
(i)
J
1
(i) g(i,
k+1
(i)) +

j
p
ij
(
k+1
(i)) J
1
(j)
since J
1
(i) J
0
(i). But g(i,
k+1
(i)) = J
2
(i), keep going, get
J
0
(i) J
1
(i) . . . J
N
(i) . . .
take limit
J

k (i) J

k+1 (i) i
Since the # of stationary policies is nite, we will eventually have J

k (i) =
J

k+1 i for some nite k.


2) It follows from Stage 2 that
J

k+1 (i) = J

k (i) = min
uU(i)
_
g(i, u) +

j
p
ij
(u) J

k (j)
_
when converged, but this is Bellmans Equation! have converged to
optimal policy.
Discussion
Complexity
Stage 1 linear sytem of equations, size n, comlexity O(n
3
)
Stage 2 n minimizations, p choices (p different values of u that I can use),
complexity O(p n
2
)
Put together: O(n
2
(n + p)).
Worst case # of iterations: search over all policies, p
n
. But in practice, con-
verges very quickly.
Why does Policy Iteration converge so quickly relative to Value Iteration?
30
Wednesday 10
th
November, 2010 How do we solve Bellmans Equation?
Rewrite Value Iteration
Stage 2
k
= arg min
uU(i)
_
g(i, u) +
j
p
ij
(u) J
k
(j)
_
iterate
Stage 1 J
k+1
(i) = g(i,
k
(i)) +
j
p
ij
(
k
(i)) J
k
(j)
2.13.3 Third Method: Linear Programming
Recall: Bellmans Equation
J

(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

(j)
_
i = 1, . . . , n
and Value Iteration:
J
k+1
(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
J
k
(j)
_
i = 1, . . . , n
We showed that Value Iteration (V.I.) converges to optimal cost to go J

for
all initial guesses J
0
.
Assume we start V.I. with any J
0
that satises
J
0
(i) min
uU(i)
_
g(i, u) +
n

j=1
p
ij
J
0
(j)
_
i = 1, . . . , n
If follows that J
1
(i) J
0
(i) for all i.
J
1
(i) min
uU(i)
_
g(i, u) +
n

j=1
p
ij
J
1
(j)
_
i = 1, . . . , n
J
2
(i) J
1
(i) i
In general:
J
k+1
(i) J
k
(i) i, k
J
k
J

J
0
(i) J

(i) i
31
2. Dynamic Programming Algorithm
Now let J solve the following problem:
max

i
J(i) subject to
J(i) g(i, u) +

j
p
ij
(u)J(j) i, u U(i)
Clear that J(i) J

(i) i, by previous analysis. Since J

satises the con-


straints, it follows that J

achieves the maximum.


This is a Linear Program!
2.13.4 Analogies and Connections
Say I want to solve
J = G + PJ, J R
n
, G R
n
, P R
nn
.
Direct way to solve:
(I P)J = G
J = (I P)
1
G.
This is exactly what we do in Stage 1 of policy iteration: solve for the cost
associated wit ha specic policy.
Why is (I P)
1
guaranteed to exist?
For a given policy, let

P R
(n+1)(n+1)
be the probability matrix that cap-
tures our Markov Chain:

P =
_
_
P P
()t
0 1
_
_
p
ij
: probability next state = j given current state = i.
Facts


P is a right stochastic matrix: all rows sum up to 1, all elements 0.
Perron-Frobenius Therorem: eigenvalues of

P have absolute value 1,
at least one = 1
Assumption on terminal state: P
N
0 as N . We well eventually
reach termination state!
Therefore
32
Wednesday 10
th
November, 2010 How do we solve Bellmans Equation?
- the eigenvalues of P have absolute value < 1.
- (I P)
1
exists.
Furthermore:
(I P)
1
= I + P + P
2
+ . . .
Proof:
(I P)(I + P + P
2
+ . . .) = I + (P + P
2
+ . . .) (P + P
2
+ . . .)
= I
Therefore one way to solve for J is as follows:
J
1
= G + PJ
0
J
2
= G + PJ
1
= G + PG + P
2
J
0
.
.
.
J
N
= (I + P + . . . + P
N1
)G + P
N
J
0
J
N
(1 P)
1
G as N !
Analogy
Value Iteration: step of the update
Policy Iteration: innite numbers of update, or solving system of equations
exactly.
What is truly amazing is that various combinations of policy iteration,
value iteration, all converge to Bellmans Equation.
Recall value iteration
J
k+1
(i) = min
uU(i)
_
g(i, u) +

j
p
ij
(u)J
k
(j)
_
i = 1, . . . , n
In practice, you would implement as follows:

J(i) min
uU(i)
_
g(i, u) +

j
p
ij
(u)J(j)
_
i = 1, . . . , n
J(i)

J(i) i = 1, . . . , n
33
2. Dynamic Programming Algorithm
Dont have to do this! Can also do:
J(i) min
uU(i)
_
g(i, u) +

j
p
ij
(u)J(j)
_
i = 1, . . . , n
Gauss-Seidel update: generic technique for solving iterative equations.
Gets even better: Asynchronous Policy Iterating.
Any number of value update in between policy updates
Any number of states updated at each value update
Any number of states updated at each policy update.
Under some mild assumptions, all converge to J

2.14 Discounted Problems


J

(i) = lim
N
E
_
N1

k=0

k
g (x
k
,
k
(x
k
))

x
0
= i
_
< 1, i {1, . . . , n}
No explicit termination state required. No assumption on transition proba-
bilities required.
Bellmans Equation for this problem:
J

(i) = min
uU(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

(j)
_
i
How do we show this?
Dene associated problem with states {1, 2, . . . , n, t}. From state i = t,
when u is applied in new cost g(i, u) next state = j with probability
p
ij
(u), and t with probability 1 (since
j
p
ij
(u) = 1)
Clear that since a < 1, we have a non-zero probability of making it to
state t, therefore our assumption on reaching the termination state is
satised.
Suppose we use the same policy in discounted problem as auxiliary
problem.
Note that
P
R
_
x
k+1
= j

x
k
= i, x
k+1
= t, u
_
=
p
ij
(u)
p
i,1
+ p
i,2
+ . . . + p
i,n
=
p
ij
(u)

= p
ij
(u)
34
Wednesday 17
th
November, 2010 Discounted Problems
So as long as we reach the termination state, the state evolution is
governed by the same probabilities. The expected cost of the k
th
stage
of associated problem g(x
k
,
k
(x
k
)) times the probability, that t has not
been reached
k
, therefore have
k
g(x
k
,
k
(x
k
)).
Connections: for a given policy, have

P =
_
_
_
_
_
_
_
_
_
P
_
_
_
_
_
1
.
.
.
1
_
_
_
_
_
(1 )
0 1
_
_
_
_
_
_
_
_
_
clear that (P)
N
=
N
P
N
0
35
Chapter 3
Continuous Time Optimal
Control
Consider the following system
x(t) = f (x(t), u(t)) 0 t T
x(0) = x
0
no noise!
State x(t) R
Time t R, T is the terminal time
Control u(t) U R
m
, U control constraint set
Assume
f is continuously differentiable with respect to x. (Less stringent re-
quirement: Lipschitz)
f is continuous with respect to u
u(t) is piecewise continuous
See Appendix A in[1] for Details.
Assume: Existence and uniqueness of solutions.
Example 8:
x(t) = x(t)
1
/3
x(0) = 0
Solutions: x(t) = 0 t x(t) = (
2
3
t)
3
/2
Not uinque!
36
Wednesday 17
th
November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
Example 9:
x(t) = x(t)
2
x(0) = 1
Solutions: x(t) =
1
1t
, nite escape time, x(1) =
The solution does not exist on an interval that includes 1, e.g. [0, 2].
Objective Minimize h(x(T)) +
_
T
0
g(x(t), u(t)) dt where g and h are con-
tinuously differentiable with respect to x, g is continuous with respect to u.
Very similar to discrete time problem (
_
, x
k+1
x) except for technical
assumptions.
3.1 The Hamilton Jacobi Bellman (HJB) Equation
Continuous time analog of DP algorithm. Derive it informally by discretiz-
ing problem and taking limits. Not a rigorous derivation, but it does capture
the main ideas.
Divide time horizon into N pieces, dene =
T
/N
x
k
:= x(k), u
k
:= u(k) k = 0, 1, . . . , N
Approximate differential equation by
x(k) = f (x(k, u(k)))
x
k+1
x
k

= f (x
k
, u
k
), x
k+1
= x
k
+ f (x
k
, u
k
)
Approximate cost function:
h(x
N
) +
N1

k=0
g(x
k
, u
k
)
Dene J

(t, x) = optimal cost to go at time t and state x for the con-


tinuous problem.
Dene

J

(t, x) = discrete approximation of optimal cost to go.


Apply DP algorithm:
terminal condition:

J

(N, x) = h(x)
recursion:

(k, x) = min
uU
_
g(x, u) +

J

((k + 1) , x + f (x, u) )

k = 0, . . . , N1
37
3. Continuous Time Optimal Control
Do Taylor Expansion of

J

, because 0

((k +1) , x + f (x, u) ) =



J

(k, x) +

(k, x)
t
+
_
J

(k, x)
x
_
T
f (x, u) +o()
lim
0
o()

= 0
little oh notation: o() are quadratic terms or higher in .
Substitute back into DP recursion by :
0 = min
uU
_
g(x, u) +

(k, x)
t
+
_

(k, x)
x
_
T
f (x, u) +
o()

_
Now let t = k, and let 0. Assuming

J

J, have
0 = min
uU
_
g(x, u) +
J

(t, x)
t
+
_
J

(t, x)
x
_
T
f (x, u)
_
x, t
(3.1)
J

(T, x) = h(x)
HJB equation (3.1):
Partial Differential Equation, very difcult to solve
u = (t, x) that minimizes R.H.S. of HJB is optimal policy
Example 10: Consider the system
x(t) = u(t) |u(t)| 1
Cost is
1
2
x
2
(T), only terminal cost.
Intuitive solution: u(t) = (t, x) = sgn(x) =
_

_
1 if x > 0
0 if x = 0
1 if x < 0
What is cost to go associated with this policy?
V(t, x) =
1
2
(max{0, |x| (T t)})
2
Verify that this is indeed the cost to go associated with the policy out-
lined above.
For xed t:
38
Wednesday 17
th
November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
V(t, x)
x
T t (T t)
Figure 3.1: Example 10 for xed t.
V(t, x)
x
x
T t (T t)
Figure 3.2: First derivative.
V(t, x)
t
1
|x| T
2
T |x| T
V(t, x)
t
t
1
2
T |x| T
1: |x| T
2: |x| > T
Figure 3.3: Example 10 for xed x.
V
x
(t, x) = sgn(x) max{0, |x| (T t)}
39
3. Continuous Time Optimal Control
For xed x:
Does V(t, x) satisfy HJB?
First check: does it satisfy boundary condition?
V(T, x) =
1
2
x
2
= h(x)
Second check:
min
|u|1
_
V(t, x)
t
+
V(t, x)
x
f (x, u)
_
= min
|u|1
(1 + sgn(x), u) max{0, |x| (T t)}
= 0 by choosing u = sgn(x)
V(t, x) satises HJB equation, V(t, x) = J

(t, x).
Furthermore u = sgn(x) is an optimal solution. Not unique!
Note: Verifying that V(t, x) satises HJB is not trivial, even for
this simple example. Imagine solving for it!
Another issue:
1
2
x
2
(T) will give same optimal policy as cost |x(T)|, so
different costs give same optimal policy, but some cost are nicer to
work with than other.
3.2 Aside on Notation
Let F(t, x) be a continuosly differentiable function. Then
1.
F(t, x)
t
: partial derivative of F with respect to the rst argument.
2.
F(t, x(t))
t
=
F(t, x(t))
t

x= x(t)
: shorthand notation
3.
dF(t, x(t))
dt
=
F(t, x(t))
t
+
F(t, x(t))
x

x(t): total derivative
Example 11: For F(t, x) = tx
F(t, x)
t
= x
F(t, x(t))
t
= x(t)
dF(t, x(t))
dt
= x(t) + t

x(t)
Lemma 3.2.1: Let F(t, x, u) be a continous differentiable function and
let U be a convex set. Assume that

(t, x) := arg min


uU
F(t, x, u) is conti-
nous differentiable.
40
Wednesday 24
th
November, 2010 Aside on Notation
Then:
1)
min
uU
F(t, x, u)
t
=
F(t, x,

(t, x))
t
t, x
2)
min
uU
F(t, x, u)
x
=
F(t, x,

(t, x))
x
t, x
Example 12: Let F(t, x, u) = (1 + t)u
2
+ ux + 1, t 0, U is the real
line (no constraint on u), then
min
u
F(t, x, u) : 2(1 + t)u + x = 0, u =
x
2(1 + t)

(t, x) =
x
2(1 + t)
,
min
u
F(t, x, u) =
(1 + t)x
2
4(1 + t)
2

x
2
2(1 + t)
+ 1
=
x
2
4(1 + t)
+ 1
1)
min
uU
F(t, x, u)
t
=
x
2
4(1 + t)
2
F(t, x,

(t, x))
t
= u
2
|
u=

(t,x)
=
x
2
4(1 + t)
2
2)
min
uU
F(t, x, u)
x
=
x
2(1 + t)
F(t, x,

(t, x))
x
= u|
u=

(t,x)
=
x
2(1 + t)
Proof 2: Proof of Lemma 3.2.1 when u is unconstrained (U R
m
).
Let
G(t, x) = min
uU
F(t, x, u) = F(t, x,

(t, x)).
Then
G(t, x)
t
=
F(t, x,

(t, x))
t
+
F(t, x,

(t, x))
u
. .
=0 because

(t,x) minimizes

(t, x)
t
.

This can be done similar for


G(t, x)
x
.
41
3. Continuous Time Optimal Control
3.2.1 The Minimum Principle
HJB gives us a lot of information: optimal cost to go for all time and for all
possible states, also gives optimal feedback law u =

(t, x). What if we


only cared about optimal control trajectory for a specic initial condition
x(0) = x
0
?
Can we exploit the fact that we are asking for much less to simplify the
mathematical conditions?
Starting Point: HJB
0 = min
uU
_
g(x, u) +
J

(t, x)
t
+
_
J

(t, x)
x
_
T
f (x, u)
_
t, x
J

(T, x) = h(x) x
Let

(t, x) be te corresponding optimal strategy (feedback law)


Let
F(t, x, u) = g(x, u) +
J

(t, x)
t
+
_
J

(t, x)
x
_
T
f (x, u)
So HJB equation gives us
G(t, x) = min
uU
(F(t, x, u)) = 0
Apply Lemma
1)
G(t, x)
t
= 0 =

2
J

(t, x)
t
2
+
_

2
J

(t, x)
xt
_
T
f (x,

(t, x)) t, x
2)
G(t, x)
x
= 0 =
g(x,

(t, x))
x
+

2
J

(t, x)
xt
+

2
J

(t, x)
x
2
f (x,

(t, x))
+
f (x,

(t, x))
x
J

(t, x)
x
t, x
Consider a specic optimal trajectory:
u

(t) =

(t, x

(t)) , x

(t) = f (x

(t), u

(t)) , x

(0) = x
0
1) 0 =
d
dt
_
J

(t, x

(t))
t
_
2) 0 =
g (x

(t), u

(t))
x
+
d
dt
_
J

(t, x

(t))
x
_
+
f (x

(t), u

(t))
x
J

(t, x

(t))
x
42
Wednesday 24
th
November, 2010 Aside on Notation
p(t) =
J

(t, x

(t))
x
p
0
(t) =
J

(t, x

(t))
t
1) p
0
(t) = 0 p
0
(t) = constant for 0 t T
2) p(t) =
f (x

(t), u

(t))
x
p(t)
g(x

(t), u

(t))
x
, 0 t T
J

(T, x)
x
=
h(x)
x
p(T) =
h(x

(T))
x
Put all of this together:
Dene Hamiltonian H(x, u, p) = g(x, u) + p
T
f (x, u). Let u

(t) be an optimal
control trajectory, x

(t) resulting state trajectory. Then


x

(t) =
H
p
(x

(t), u

(t), p(t)) x

(0) = x
0
p(t) =
H
x
(x

(t), u

(t), p(t)) p(T) =


h(x

(T))
x
u

(t) = arg min


uU
H(x

(t), u, p(t))
H(x

(t), u

(t), p(t)) = constant t [0, T]


(H() = constant comes from p
0
(t) = constant).
Some remarks:
Set of 2n ODE, with split boundary conditions. Not trivial to solve.
Necessary condition, but not sufcient. Can have multiple solutions,
not all of them may be optimal.
If f (x, u) is linear, U is convex, h and g are convex, then the condition
is necessary and sufcient.
Example 13 (Resource Allocation): Some robots are sent to Mars to
build habitats for a later exploration by humans.
x(t) Number of recongurable robots, which can build habitats or them-
selves.
x(0) is given: Number of robots that arrive to Mars.
x(t) = u(t)x(t) x(0) = x
0
y(t) = (1 u(t))x(t) y(0) = 0
0 u(t) 1
43
3. Continuous Time Optimal Control
Objective: given terminal time T, nd control input u

(t) that maxi-


mizes y(T), the number of habitats build.
Note: y(t) =
_
T
0
(1 u(t))x(t) dt
Solution
g(x, u) = (1 u)x
f (x, u) = ux
H(x, u, p) = (1 u)x + pux
p(t) =
H(x

(t), u

(t), p(t))
x
= 1 + u

(t) p(t)u

(t)
p(T) = 0 (h(x) 0)
u

(t) = arg max


0u1
(x

(t) + (p(t)x

(t) x

(t)) u)
get
_
_
_
u = 0 if p(t) < 1
u = 1 if p(t) > 1
Since p(T) = 0, for t close to T, will have u

(t) = 0 and therefore


p(t) = 1.
Therefore at time t = T 1, p(t) = 1 and that is where switch occurs:
p(t) = p(t) 0 t T 1 p(T 1) = 1
p(t) = exp(t) exp(T 1) 0 t T 1
Conclusion
u

(t) =
_
_
_
1 0 t T 1
0 T 1 t T
How to use in practice:
1. If you can solve HJB, you get a feedback law u = (x). Very con-
venient, just a controller: meausure the state and apply the control
input.
2. Solve for optimal trajectory and use a feedback law (probably linear)
to keep you on that trajectory.
3. Solve for optimal trajectory online after measuring state. Do this often.
44
Wednesday 1
st
December, 2010 Extensions
HJB
Minimum
Principle
Optimal
Solutions
Non Optimal
Solutions
not too difcult
Easy to show
(in book)
Viscosity
solutions
Hard to show
rigorously
(calculus of vari-
ations)
Local Minima
Figure 3.4: Dierent approaches to nd a solution.
3.3 Extensions
(Drop x

(t), J

notation for simplycity)


3.3.1 Fixed Terminal State
Case where x(T) is given. Clear that there is no need for a terminal cost.
Recall co-state p(t) =
J(t, x(t))
x
.
p(T) = lim
tT
J(t, x(t))
x
, but we cant use h(t), the terminal cost, to constrain
p(T). Dont need constraints on p!
x(t) = f (x(t), u(t)) x(0) = x
0
, x(T) = x
T
2n ODEs
p(t) =
H(x(t), u(t), p(t))
x
2n boundary conditions
Example 14:
x(t) = u(t) x(0) = 0, x(1) = 1
g(x, u) =
1
2
(x
2
+ u
2
)
cost =
1
2
_
1
0
(x
2
(t) + u
2
(t)) dt
45
3. Continuous Time Optimal Control
Hamiltonian H(x, u, p) =
1
2
(x
2
+ u
2
) + p u
We get
x(t) = u(t)
p(t) = x(t)
u(t) = arg min
u
1
2
(x
2
(t) + u
2
(t)) + p(t) u(t)
therefore u(t) = p(t)
x(t) = p(t), p(t) = x(t), x(t) = x(t)
x(t) = Acosh(t) + Bsinh(t)
x(0) = 0 A = 0, x(1) = 1 B =
1
sinh(1)
x(t) =
sinh(t)
sinh(1)
=
e
t
e
t
e
1
e
1
Exercise: show that Hamiltonian is constant along this trajectory.
3.3.2 Free initial state, with cost
x(0) is not xed, but have initial cost l(x(0)). Can show that the resuling
condition is p(0) =
l(x(0))
x
.
Example 15:
x(t) = u(t) x(1) = 1, x(0) is free
g(x, u) =
1
2
(x
2
+ u
2
)
l(x) = 0 no cost (given)
Apply Minimum Principle, as before
x(t) = u(t) x = x(t)
p(t) = x(t)
u(t) = p(t) x(t) = Acosh(t) + Bsinh(t)
x(0) = u(0) = p(0) = 0 B = 0
x(t) =
cosh(t)
cosh(1)
=
e
t
+ e
t
e
1
+ e
1
x(0) 0.65
46
Wednesday 1
st
December, 2010 Extensions
Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory.
Gain extra degree of freedom in choosing T, we lose a degree of freedom
because H 0.
Time Varying System and Cost What happens if f = f (x, u, t), g = g(x, u, t)?
Result: Everything stays the same, except that Hamiltonian is no longer
constant along trajectory. Hint:

t = 1 and x = f (x, u, t)
Singular Problems Motivate via an example:
Tracking problem: z(t) = 1 t
2
0 t 1
Minimize
1
2
_
1
0
(x(t) z(t))
2
dt subject to | x(t)| 1.
Apply Minimum Principle:
x(t) = u(t) |u(t)| 1 x(0), x(1) are free
g(x, u, t) =
1
2
(x z(t))
2
H(x, u, t) =
1
2
(x z(t))
2
+ p u
Co-state equation:
p(t) = (x(t) z(t)) p(0) = 0, p(1) = 0
Optimal u:
u(t) = arg min
|u|1
H(x(t), u, p(t), t)
u(t) =
_

_
= 1 if p(t) > 0
= 1 if p(t) < 0
=? if p(t) = 0
Problem is singular if Hamiltonian is not a function of u for a non-trivial
time interval.
Try the following:
p(0) = 0 p(t) = 0 for 0 t T T to be determined
Then
p(t) = 0 0 t T
x(t) = z(t) 0 t T
x(t) = z(t) = 2t = u(t)
47
3. Continuous Time Optimal Control
One guess: pick T =
1
2
.
This cant be the solution: for t >
1
2
:
x(t) z(t) > 0 p < 0 p(1) < 0
cant satisfy constraint.
Explore this instead:
Switch before T =
1
2
.
x(t) = z(t) 0 t T <
1
2
x(t) = 1 T < t 1
x(t) = z(T) (t T) = 1 T
2
t + T
p(t) = (x(t) z(t)) (1 T
2
t + T 1 + t
2
) T < t 1
= T
2
T t
2
+ t
p(1) =
_
1
T
(T
2
T t
2
+ t) dt
= T
2
T
1
3
+
1
2
T
3
+ T
2
+
T
3
3

T
2
2
= 0
Simplies to (multiply by 6):
0 = 4T
3
+ 9T
2
6T + 1
= (T 1)(T 1)(1 4T)
T = 1, T =
1
4
T =
1
4
satises all the constraints and we are done!
Can easily verify that p(t) > 0 for
1
4
< t < 1, giving u(t) = 1 as required.
3.4 Linear Systems and Quadratic Costs
Look at inte horizon, LTI (linear time invariant) system:
x
k+1
= Ax
k
+ Bu
k
k = 0, 1, . . .
cost =

k=0
x
T
k
Qx
k
+ u
T
k
Ru
k
, R > 0, Q 0, R = R
T
, Q = Q
T
Informally, the cost to go is time invariant: only depends on the state and
not when we get there.
J(x) = min
u
_
x
T
Qx + u
T
Ru + J(Ax + Bu)
_
48
Wednesday 8
th
December, 2010 Linear Systems and Quadratic Costs
Conjecture that optimal cost to go is quadratic in x: J(x) = x
T
Kx, where
K = K
T
, K 0. Then
x
T
Kx = x
T
Qx +x
T
A
T
KAx +min
u
_
u
T
Ru + u
T
B
T
KBu + x
T
A
T
KBu + u
T
B
T
KAx
_
Since R > 0, B
T
KB 0, R + B
T
KB > 0
2
_
R + B
T
KB
_
u + 2B
T
KAx = 0
u =
_
R + B
T
KB
_
1
B
T
KA
Substitute back in:
All terms are of the form x
T
_

_
x.
Therefore we must have:
K = Q + A
T
KA + A
T
KB
_
R + B
T
KB
_
1
_
R + B
T
KB
_ _
R + B
T
KB
_
1
B
T
KA
2A
T
KB
_
R + B
T
KB
_
1
B
T
KA
K = A
T
_
K KB
_
R + B
T
KB
_
1
B
T
K
_
A + Q, K 0
Summary
Optimal Cost to go J(x) = x
T
Kx
Optimal feedback strategy u = Fx, F =
_
R + B
T
KB
_
1
B
T
KA
Questions
1. Can we always solve for K?
2. Is closed loop system x
k+1
= (A + BF)x
k
stable?
Example 16:
x
k+1
= 2x
k
+ 0 u
k
cost =

k=0
_
x
2
k
+ u
2
k
_
A = 2, B = 0, Q = 1, R = 1
Solve for K:
K = 4 (K 0) + 1
3K = 1
K =
1
/3
49
3. Continuous Time Optimal Control
K does not satisfy K 0 constraint. No solution to this problem: cost
is inte.
Problem with this example is that (A, B) is not stabilizable.
Stabilizable : One can nd a matrix F such that A + BF is stable.
(A + BF) < 1, eigenvalues of A + BF have magnitude < 1.
Example 17:
x
k+1
= 0.5x
k
+ 0 u
k
cost =

k=0
_
x
2
k
+ u
2
k
_
A = 0.5, B = 0, Q = 1, R = 1
Solve for K:
K = 0.25K + 1
K = 4/3
Cost to go =
4
3
x
2
k
. F = 0.
Example 18:
x
k+1
= 2x
k
+ u
k
cost =

k=0
_
x
2
k
+ u
2
k
_
A = 2, B = 1, Q = 1, R = 1
(A, B) is stabilizable. Solve for K:
K = 4
_
K K
2
(1 + K)
_
+ 1
0 = K
2
4K 1
K =
4

20
2
= 2

5
Pick K = 2 +

5 4.236 and solve for F:


F = (1 + K)
1
2K =
2K
1 + K
1.618
A + BF = 2 1.618 0.3820 is stable.
Our optimizing strategy stabilizes system as expected.
50
Wednesday 8
th
December, 2010 Linear Systems and Quadratic Costs
Example 19:
x
k+1
= 2x
k
+ u
k
cost =

k=0
_
u
2
k
_
A = 2, B = 1, Q = 0, R = 1
Solve for K:
K = 4
_
K K
2
(1 + K)
1
_
K = {0, 3}
K = 0 is clearly optimal thing to do, but it leads to an unstable system.
K = 3, however, while not being optimal, leads to F = (1 +3)
1
3 2 =
1.5, A + BF = 0.5, stable.
Modify cost to

k=0
_
u
2
k
+ x
2
k
_
, > 0, 1.
A = 2, B = 1, Q = , R = 1
Solve for K:
K() = {3 +
4
3
,

3
} (to rst order in )
The K = 3 solution is the limiting case as we put arbitrarily small cost
to the state which would otherwise .
Example 20:
x
k+1
= 0.5x
k
+ u
k
cost =

k=0
_
u
2
k
_
A = 0.5, B = 1, Q = 0, R = 1
Solve for K:
K = {0, 0.75}
Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stable
closed loop system.
If Q =
K() {
4
3
, 0.75

3
}
K = 0 is a well behaved solution for = 0.
51
3. Continuous Time Optimal Control
Need Concept of Detectability Let Q be decomposed as Q = C
T
C (can
always do this).
Need detectability assumption. (A, C) is detectable if L so that A + LC is
stable. C x
k
0 x
k
0 and C x
k
0 x
T
k
Qx
k
0.
3.4.1 Summary
Given
x
k+1
= Ax
k
+ Bu
k
k = 0, 1, . . .
cost =

k=0
_
x
T
Qx + u
T
k
Ru
k
_
Q 0, R 0
(A, B) is stabilizable, (A, C) is detectable where C is any matrix that satises
C
T
C = Q. Then:
1. unique solution to D.A.R.E
2. Optimal cost to go is J(x) = x
T
Kx
3. Optimal feedback strategy u = Fx
4. Closed loop system is stable.
3.5 General Problem Formulation
Finite horizon, time varying, disturbances.
x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
k = 0, . . . , N 1
E(w
k
) = 0 E(w
k
w
T
k
) nite
Cost:
E
_
x
T
N
Q
N
x
N
+
N1

k=0
x
T
k
Q
k
x
k
+ u
T
k
R
k
u
k
_
Q
k
= Q
T
k
0 (EV 0)
R
k
= R
T
k
> 0
52
Wednesday 15
th
December, 2010 General Problem Formulation
Apply DP to solve problem:
J
N
(x
N
) = x
T
N
Q
N
x
N
J
k
(x
k
) = min
u
k
E
_
x
T
k
Q
k
x
k
+ u
T
k
R
k
u
k
+ J
k+1
(A
k
x
k
+ B
k
u
k
+ w
k
)
_
Lets do the rst step of this recursion. Or equivalently, N = 1:
J
0
(x
0
) = min
u
0
E
_
x
T
0
Q
0
x
0
+ u
T
0
R
0
u
0
+ (A
0
x
0
+ B
0
u
0
+ w
0
)
T
Q
1
(. . .)
_
consider last term:
E
_
(A
0
x
0
+ B
0
u
0
)
T
Q
1
(A
0
x
0
+ B
0
u
0
) + 2(A
0
x
0
+ B
0
u
0
)
T
Q
1
w
0
+ w
T
0
Q
1
w
0
_
= (A
0
x
0
+ B
0
u
0
)
T
Q
1
(. . .) + E(w
T
0
Q
1
w
0
)
J
0
(x
0
) = min
u
0
_
x
T
0
Q
0
x
0
+ u
T
0
R
0
u
0
+ (A
0
x
0
+ B
0
u
0
)
T
Q
1
(. . .) + E(w
T
0
Q
1
w
0
)
_
Strategy is the same as if there was no noise (although it does give a different
cost). Certainty equivalence (works only for some problems, not all).
Solve for minimizing u
0
: differentiate , set to 0:
2R
0
u
0
+ 2 B
T
0
Q
1
B
0
u
0
+ 2 B
T
0
Q
1
A
0
x
0
= 0
u
0
= (R
0
+ B
T
0
Q
1
B
0
)
1
B
T
0
Q
1
A
0
x
0
=: F
0
x
0
Optimal feedback strategy is a linear function of the state.
Substitute and solve for J
0
(x
0
):
x
T
0
Q
0
x
0
+ u
T
0
(R
0
+ B
T
0
Q
1
B
0
) u
0
+ x
T
0
A
T
0
Q
1
A
0
x
0
+ 2 x
T
0
A
T
0
Q
1
B
0
u
0
+ E(w
T
0
Q
1
w
0
)
= x
T
0
K
0
x
0
+ E(w
T
0
Q
1
w
0
)
K
0
= Q
0
+ A
T
0
(K
1
K
1
B
0
(R
0
+ B
T
0
K
1
B
0
)
1
B
T
0
K
1
) A
0
K
1
= Q
1
Cost at k = 1: quadratic in x
1
, at k = 0: quadratic in x
0
+ constant.
Can extend this to any horizon, Discrete Riccati Equation (DRE):
K
k
= Q
k
+ A
T
k
(K
k+1
K
k+1
B
k
(R
k
+ B
T
k
K
k+1
B
k
)
1
B
k
K
k+1
) A
k
K
N
= Q
N
Feedback law:
u
k
= F
k
x
k
F
k
= (R
k
+ B
T
k
K
k+1
B
k
)
1
B
T
k
K
k+1
A
k
53
3. Continuous Time Optimal Control
Cost:
J
k
(x
k
) = x
T
k
K
k
x
k
+
N1

j=k
E(w
T
j
K
j+1
w
j
)
No noise, time invariant, innite horizon (N ): recover previous results
and DARE. In fact, above iterative method is one way to solve DARE. Iterate
backwards until it converges. ([1] has proof of convergence not trivial)
Time invariant, innite horizon with noise: Cost goes to innity. Approach:
divide cost by N and let N , cost = E(w
T
K w).
Example 21: System given:
z(t) = u(t)
Objective Apply a force to move a mass from any starting point to z =
0, z = 0. Implement on a computer that can only update information
once per second.
1. Discretize problem:
z(t) = z(0) + u(0)t 0 t < 1
z(t) = z(0) + z(0)t +
1
/2u(0)t
2
0 t < 1
Let x
1
(k) = z(k), x
2
(k) = z(k).
x(k + 1) = Ax(k) + Bu(k) k = 0, 1, . . .
A =
_
_
1 1
0 1
_
_
B =
_
_
0.5
1
_
_
2. Cost =

k=0
_
x
2
1
(k) + u
2
(k)
_
. Therefore
Q =
_
_
1 0
0 0
_
_
R = 1
3. Is system stabilizable? Can we make A + BF stable for some F?
Yes: F =
_
1 1.5
_
makes eigenvalues of A + BF = 0
4. Q can be decomposed as follows
Q =
_
_
1 0
0 0
_
_
=
_
_
1
0
_
_

_
1 0
_
54
Wednesday 15
th
December, 2010 General Problem Formulation
Therefore
C =
_
1 0
_
, Q = C
T
C
Is (A, C) detectable?
Yes.
L =
_
2 1
_
makes eigenvalues of A + LC = 0.
5. Solve DARE. Use MATLAB command dare.
K =
_
_
2 1
1 1.5
_
_
Optimal Feedback matrix:
F =
_
0.5 1.0
_
6. Physical interpretation: spring and a damper. spring has coef-
cient 0.5, damper has coefcient 1.0.
55
Bibliography
[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume I.
Athena Scientic, 3rd edition, 2005.
56

You might also like