XCS221 Mod2 Slides
XCS221 Mod2 Slides
Search problems
Machine learning
Modeling
Inference Learning
Tree search
Dynamic programming
no solution
Example: transportation
Idea: Backtracking search + stop when find the first end state.
Idea:
• Modify DFS to stop at a maximum depth.
• Call DFS for maximum depths 1, 2, . . . .
DFS on d asks: is there a solution with d actions?
Legend: b actions per state, solution size d
• Space: O(d) (saved!)
• Time: O(bd ) (same as BFS)
Tree search
Dynamic programming
Cost(s, a)
state s0
FutureCost(s0 )
end state
Find the minimum cost path from city 1 to city n, only moving
forward. It costs cij to go from i to j.
1
2 3 4 5 6 7
3 4 5 6 7 4 5 6 7 5 6 7 6 7 7
4 5 6 7 5 6 7 6 7 7 5 6 7 6 7 7 6 7 7 7
5 6 7 6 7 7 6 7 7 7 6 7 7 7 7
6 7 7 7 7 7
4
3 5
2 6
1 7
def DynamicProgramming(s):
If already computed for s, return cached answer.
If IsEnd(s): return solution
For each action a ∈ Actions(s): ...
Assumption: acyclicity
Find the minimum cost path from city 1 to city n, only moving
forward. It costs cij to go from i to j.
Constraint: Can’t visit three odd cities in a row.
3:c13 4:c14
odd, 3 odd, 4
1,3 1,4
2,3 2,4
1,2 1,5
3,3 3,4
2,2 2,5
3,2 3,5
1,1 1,6
2,1 2,6
3,1 3,6
Tree search
Dynamic programming
PastCost(s) Cost(s, a)
sstart s s0
Assumption: non-negativity
UCS in action:
Unexplored
A D
C
Start state: A, end state: D
[whiteboard]
Minimum cost path:
A → B → C → D with cost 3
Theorem: correctness
Proof:
Explored
sstart s
t u
DP no any O(N )
Note: UCS potentially explores fewer states, but requires more overhead
to maintain the priority queue
depth-first search
breadth-first search
dynamic programming
Modeling
Inference Learning
Unexplored
A D
C
Start state: A, end state: D
[whiteboard]
Minimum cost path:
A → B → C → D with cost 3
DP no any O(N )
Note: UCS potentially explores fewer states, but requires more overhead
to maintain the priority queue
Theorem: correctness
Proof:
Explored
sstart s
t u
Modeling
Inference Learning
Learning costs
A* search
Relaxation
search algorithm
learning algorithm
Cost(s, a) (a1 , . . . , ak )
(a1 , . . . , ak ) Cost(s, a)
1 1
walk:3 walk:1
2 2
walk:3 tram:2 walk:1 tram:3
3 4 3 4
walk:3 walk:1
4 4
CS221 / Liang & Sadigh 15
• Suppose the walk cost is 3 and the tram cost is 2. Then, we would obviously predict the [walk, tram]
path, which has lower cost.
• But this is not our desired output, because we actually saw the person walk all the way from 1 to 4. How
can we update the action costs so that the minimum cost path is walking?
• Intuitively, we want the tram cost to be more and the walk cost to be less. Specifically, let’s increase the
cost of every action on the predicted path and decrease the cost of every action on the true path. Now,
the predicted path coincides with the true observed path. Is this a good strategy in general?
Modeling costs (simplified)
Assume costs depend only on the action:
Cost(s, a) = w[a]
Path cost:
Cost(s, a) = w · φ(s, a)
Example:
a1 : w · [1, 0, 1] a2 : w · [1, 2, 0]
y: s0 s1 s2
Path cost:
Cost(y) = 2 + 1 = 3
• Machine translation
Learning costs
A* search
Relaxation
A* in action:
sstart s send
PastCost(s) FutureCost(s)
Intuition: add a penalty for how much action a takes us away from the
end state
Example:
sstart send
2 2 0 0
A B C D E
h(s) = 4 3 2 1 0
Cost0 (C, B) = Cost(C, B) + h(B) − h(C) = 1 + (3 − 2) = 2
C
1000
Doesn’t work because of negative modified edge costs!
CS221 / Liang & Sadigh 34
• So far, we’ve just said that h(s) is just an approximation of FutureCost(s). But can it be any approximation?
• The answer is no, as the counterexample clearly shows. The modified edge costs would be 1 (A to B),
1002 (A to C), 5 (B to D), and -999 (C to D). UCS would go to B first and then to D, finding a cost 6
path rather than the optimal cost 3 path through C.
• If our heuristic is lying to us (bad approximation of future costs), then running A* (UCS on modified costs)
could lead to a suboptimal solution. Note that the reason this heuristic doesn’t work is the same reason
UCS doesn’t work when there are negative action costs.
Consistent heuristics
Definition: consistency
A heuristic h is consistent if
• Cost0 (s, a) = Cost(s, a) + h(Succ(s, a)) − h(s) ≥ 0
• h(send ) = 0.
Cost(s, a)
s h(Succ(s, a))
h(s) send
Proposition: correctness
• Key identity:
L
X L
X
Cost0 (si−1 , ai ) = Cost(si−1 , ai ) + h(sL ) − h(s0 )
| {z }
i=1 i=1
| {z } | {z } constant
modified path cost original path cost
Definition: admissibility
Learning costs
A* search
Relaxation
Theorem: efficiency of A*
≤
PastCost(send )
remove constraints
Hard Easy
Heuristic:
h(s) = ManhattanDistance(s, (2, 5))
e.g., h((1, 1)) = 5
Start state: 1
Walk action: from s to s + 1 (cost: 1)
Tram action: from s to 2s (cost: 2)
End state: n
Constraint: can’t have more tram actions than walk actions.
Start state: 1
Walk action: from s to s + 1 (cost: 1)
Tram action: from s to 2s (cost: 2)
End state: n
Constraint: can’t have more tram actions than walk actions.
Start state: n
Walk action: from s to s − 1 (cost: 1)
Tram action: from s to s/2 (cost: 2)
End state: 1
bike
drive
Caltrain
Uber/Lyft
fly
Search problems
Machine learning
deterministic
state s, action a state Succ(s, a)
state s02
-50
Policy evaluation
Value iteration
Dice: Rewards: 0
0.8
probability
0.6
0.4
0.2
0.0
4 8 12 16 20
0.8
probability
0.6
0.4
0.2
0.0
4 8 12 16 20
quit (1/3): $4
• Succ(s, a) ⇒ T (s, a, s0 )
• Cost(s, a) ⇒ Reward(s, a, s0 )
s a s0 T (s, a, s0 )
in quit end 1
in stay in 2/3
in stay end 1/3
s a s0 T (s, a, s0 )
in quit end 1
in stay in 2/3
in stay end 1/3
Example: transportation
[semi-live solution]
Definition: policy
Policy evaluation
Value iteration
Definition: utility
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
... ...
a r s
(2,1)
Run (or press ctrl-enter) 2.4 -0.5 -50 40 E -0.1 (2,2)
S -0.1 (3,2)
E -0.1 (3,3)
E -50.1 (2,3)
3.7 5 -50 31
Value: 3.73
Utility: -36.79
Definition: utility
(
0 if IsEnd(s)
Vπ (s) =
Qπ (s, π(s)) otherwise.
X
Qπ (s, a) = T (s, a, s0 )[Reward(s, a, s0 ) + γVπ (s0 )]
s0
iteration t
0 -0.1 -0.2 0.7 1.1 1.6 1.9 2.2 2.4 2.6
0 -0.1 1.8 1.8 2.2 2.4 2.7 2.8 3 3.1
state s
0 4 4 4 4 4 4 4 4 4
0 -0.1 1.8 1.8 2.2 2.4 2.7 2.8 3 3.1
0 -0.1 -0.2 0.7 1.1 1.6 1.9 2.2 2.4 2.6
How many iterations (tPE )? Repeat until values don’t change much:
(t)
Don’t store Vπ for each iteration t, need only last two:
(t) (t−1)
Vπ and Vπ
MDP complexity
S states
A actions per state
S 0 successors (number of s0 with T (s, a, s0 ) > 0)
Time: O(tPE SS 0 )
(t)
Vπ (end) = 0
s end in
(t)
(t = 100 iterations)
Vπ 0.00 12.00
Policy evaluation
Value iteration
The optimal value Vopt (s) is the maximum value attained by any
policy.
[semi-live solution]
s end in
(t)
Vopt 0.00 12.00 (t = 100 iterations)
πopt (s) - stay
-50
Theorem: convergence
Suppose either
• discount γ < 1, or
• MDP graph is acyclic.
Then value iteration converges to the correct answer.
Example: non-convergence
CS221 68
• Let us state more formally the conditions under which any of these algorithms that we talked about will work. A sufficient condition is that
either the discount γ must be strictly less than 1 or the MDP graph is acyclic.
• We can reinterpret the discount γ < 1 condition as introducing a new transition from each state to a special end state with probability (1 − γ),
multiplying all the other transition probabilities by γ, and setting the discount to 1. The interpretation is that with probability 1 − γ, the
MDP terminates at any state.
• In this view, we just need that a sampled path be finite with probability 1.
• We won’t prove this theorem, but will instead give a counterexample to show that things can go badly if we have a cyclic graph and γ = 1.
In the graph, whatever we initialize value iteration, value iteration will terminate immediately with the same value. In some sense, this isn’t
really the fault of value iteration, but it’s because all paths are of infinite length. In some sense, if you were to simulate from this MDP, you
would never terminate, so we would never find out what your utility was at the end.
Roadmap
Learning costs
A* search
Relaxation
remove constraints
Hard Easy
Heuristic:
h(s) = ManhattanDistance(s, (2, 5))
e.g., h((1, 1)) = 5
Start state: 1
Walk action: from s to s + 1 (cost: 1)
Tram action: from s to 2s (cost: 2)
End state: n
Constraint: can’t have more tram actions than walk actions.
Start state: 1
Walk action: from s to s + 1 (cost: 1)
Tram action: from s to 2s (cost: 2)
End state: n
Constraint: can’t have more tram actions than walk actions.
Start state: n
Walk action: from s to s − 1 (cost: 1)
Tram action: from s to s/2 (cost: 2)
End state: 1
Tightness:
Proof: exercise
quit (1/3): $4
s0 ; a1 , r1 , s1 ; a2 , r2 , s2 ; a3 , r3 , s3 ; . . . ; an , rn , sn
reinforcement learning!
Start A B
action a
agent environment
For t = 1, 2, 3, . . .
Choose action at = πact (st−1 ) (how?)
Receive reward rt and observe new state st
Update parameters (how?)
a r s
(2,1)
Run (or press ctrl-enter) -50 20 W 0 (2,1)
W 0 (2,1)
N 0 (1,1)
W 0 (1,1)
-50 N 0 (1,1)
E 0 (1,2)
S 0 (2,2)
2 W 0 (2,1)
N 0 (2,2)
N 0 (3,2)
Utility: 2 S 0 (3,2)
W 2 (3,1)
CS221 / Liang & Sadigh 16
• Recall the volcano crossing example from the previous lecture. Each square is a state. From each state,
you can take one of four actions to move to an adjacent state: north (N), east (E), south (S), or west
(W). If you try to move off the grid, you remain in the same state. The starting state is (2,1), and the
end states are the four marked with red or green rewards. Transitions from (s, a) lead where you expect
with probability 1-slipProb and to a random adjacent square with probability slipProb.
• If we solve the MDP using value iteration (by setting numIters to 10), we will find the best policy (which
is to head for the 20). Of course, we can’t solve the MDP if we don’t know the transitions or rewards.
• If you set numIters to zero, we start off with a random policy. Try pressing the Run button to generate
fresh episodes. How can we learn from this data and improve our policy?
Roadmap
Reinforcement learning
Bootstrapping methods
Summary
Data: s0 ; a1 , r1 , s1 ; a2 , r2 , s2 ; a3 , r3 , s3 ; . . . ; an , rn , sn
Transitions:
0 # times (s, a, s0 ) occurs
T̂ (s, a, s ) = # times (s, a) occurs
Rewards:
\
Reward(s, a, s0 ) = r in (s, a, r, s0 )
quit (3/7): $4
in,quit ?: $? end
quit (3/7): $4
in,quit ?: $? end
All that matters for prediction is (estimate of) Qopt (s, a).
ut = rt + γ · rt+1 + γ 2 · rt+2 + · · ·
Estimate:
quit (?): $?
? in,quit ?: $? end
[whiteboard: u1 , u2 , u3 ]
a r s
0 0
(2,1)
Run (or press ctrl-enter) 1 0 0 0 -50 20 N 0 (1,1)
1 0
W 0 (1,1)
S 0 (2,1)
1 0 0
E 0 (2,2)
0 1 1 0 -50 0 0
W 0 (2,1)
1 0 0
S 2 (3,1)
0 0 0
2 0 0 0 0 0 0
0 0 0
Utility: 2
Reinforcement learning
Bootstrapping methods
Summary
On each (s, a, r, s0 , a0 ):
r +γ Q̂π (s0 , a0 )]
Q̂π (s, a) ← (1 − η)Q̂π (s, a) + η[|{z}
| {z }
data estimate
u r + Q̂π (s0 , a0 )
based on one path based on estimate
unbiased biased
large variance small variance
wait until end to update can update immediately
SARSA
On each (s, a, r, s0 ):
Q̂opt (s, a) ← (1 − η)Q̂opt (s, a) + η(r + γ V̂opt (s0 ))
| {z } | {z }
prediction target
On each (s, a, r, s0 , a0 ):
Q̂π (s, a) ← (1 − η)Q̂π (s, a) + η(r + γ Q̂π (s0 , a0 ))
On each (s, a, r, s0 ):
Q̂opt (s, a) ← (1 − η)Q̂opt (s, a) + η(r + γ max Q̂opt (s0 , a0 ))]
a0 ∈Actions(s0 )
ar s
0 0
(2,1)
Run (or press ctrl-enter) 0 0 0 0 -50 20 S 2 (3,1)
0 0
0 0 0
0 0 0 0 -50 0 0
1 0 0
0 0 0
2 0 0 0 0 0 0
0 0 0
Utility: 2
Reinforcement learning
Bootstrapping methods
Summary
For t = 1, 2, 3, . . .
Choose action at = πact (st−1 ) (how?)
Receive reward rt and observe new state st
Update parameters (how?)
s0 ; a1 , r1 , s1 ; a2 , r2 , s2 ; a3 , r3 , s3 ; . . . ; an , rn , sn
a r s
0 0
(2,1)
Run (or press ctrl-enter) 0 0 0 0 -50 100 E 0 (2,2)
0 0.3
S 0 (3,2)
W 2 (3,1)
0 0.1 0
0 2 0 -25 -50 0 0
0 2 0
0 0 0
2 2 0 0 0 0 0
0.5 0 0
ar s
98.4 98.4
(2,1)
Run (or press ctrl-enter) 98.4 98.4 98.4 -50 -50 100 S 2 (3,1)
98.4 98.4
a r s
99.8 99.6
(2,1)
Run (or press ctrl-enter) 100 100 100 -50 -50 100 W 0 (2,1)
100 100
N 0 (1,1)
S 0 (2,1)
100 100 100
W 0 (2,1)
100 100 100 -50 -50 -50 100
W 0 (2,1)
2 100 100
E 0 (2,2)
100 -50 100 S 0 (3,2)
2 2 1.9 0.3 0 0
2 2 2 1.8 1.9 0.5 1.5 0 0 -25 -50 0 0
2 1.9 0.9 0 0 0
2 1.6 1.7 0 0 0
2 2 2 1.5 1.5 0 0 0 0 -25 -50 0 0
2 1.1 1 0 0 0
0 0 0 0 0 0
2 1.9 0.7 1.6 0 0 0 0 0 0 0 0 0
1.4 0 0 0 0 0
On each (s, a, r, s0 ):
w ← w − η[Q̂opt (s, a; w) − (r + γ V̂opt (s0 ))]φ(s, a)
| {z } | {z }
prediction target
• Exploration/exploitation tradeoff
Reinforcement learning
Bootstrapping methods
Summary
stateless state
http://karpathy.github.io/2016/05/31/rl
Next time...
Adversarial games: against opponent (e.g., chess)
Search problems
Markov decision processes
Constraint satisfaction problems
Adversarial games Bayesian networks
Machine learning
Example: game 1
A B C
-50 50 1 3 -5 15
Games, expectimax
Minimax, expectiminimax
Evaluation functions
Alpha-beta pruning
-50 50 1 3 -5 15
πagent (s) = A
πopp (s, a) = 12 for a ∈ Actions(s)
0
(1) (0) (0)
0 2 5
(0.5) (0.5) (0.5) (0.5) (0.5) (0.5)
-50 50 1 3 -5 15
Veval (sstart ) = 0
Utility(s) IsEnd(s)
Veval (s) =
{ P
P
a∈Actions(s)
a∈Actions(s)
πagent (s, a)Veval (Succ(s, a)) Player(s) = agent
πopp (s, a)Veval (Succ(s, a)) Player(s) = opp
0 2 5
(0.5) (0.5) (0.5) (0.5) (0.5) (0.5)
-50 50 1 3 -5 15
Vexptmax (sstart ) = 5
Utility(s)
IsEnd(s)
Vexptmax (s) = maxa∈Actions(s) Vexptmax (Succ(s, a)) Player(s) = agent
P
a∈Actions(s) πopp (s, a)Vexptmax (Succ(s, a)) Player(s) = opp
Games, expectimax
Minimax, expectiminimax
Evaluation functions
Alpha-beta pruning
-50 1 -5
-50 50 1 3 -5 15
Vminmax (sstart ) = 1
Utility(s)
IsEnd(s)
Vminmax (s) = maxa∈Actions(s) Vminmax (Succ(s, a)) Player(s) = agent
mina∈Actions(s) Vminmax (Succ(s, a)) Player(s) = opp
-50 1 -5
-50 50 1 3 -5 15
πmin π7
-50 1 -5
-50 50 1 3 -5 15
-50 1 -5
-50 50 1 3 -5 15
0 2 5
(0.5) (0.5) (0.5) (0.5) (0.5) (0.5)
-50 50 1 3 -5 15
-50 50 1 3 -5 15
πmin π7
V (πmax , πmin ) V (πmax , π7 )
πmax ≤
1 2
≥
≤
V (πexptmax(7) , πmin ) V (πexptmax(7) , π7 )
πexptmax(7)
-5 5
CS221 / Liang & Sadigh 42
• Putting the three properties together, we obtain a chain of inequalities that allows us to relate all four
game values.
• We can also compute these values concretely for the running example.
A modified game
Example: game 2
A B C
-50 50 1 3 -5 15
Example: expectiminimax
1
πcoin (s, a) = 2 for a ∈ {0, 1}
-2
-27.5 -24.5 -2
(0.5) (0.5) (0.5) (0.5) (0.5) (0.5)
-50 -5 1 -50 -5 1
-50 50 -5 15 1 3 -50 50 -5 15 1 3
Vexptminmax (sstart ) = −2
Utility(s) IsEnd(s)
max
a∈Actions(s) Vexptminmax (Succ(s, a)) Player(s) = agent
Vexptminmax (s) =
mina∈Actions(s) Vexptminmax (Succ(s, a)) Player(s) = opp
P
a∈Actions(s) πcoin (s, a)Vexptminmax (Succ(s, a)) Player(s) = coin
Games, expectimax
Minimax, expectiminimax
Evaluation functions
Alpha-beta pruning
Example: chess
Games, expectimax
Minimax, expectiminimax
Evaluation functions
Alpha-beta pruning
3 ≤2
3 5 2 10
9 7 6 8 3
3 4 7 9
2 3
2 10 3 5
-50 1 -5
-50 50 1 3 -5 15
yes
no
-50 1 -5
-50 50 1 3 -5 15
Utility(s) IsEnd(s)
Eval(s) d=0
Vminmax (s, d) =
maxa∈Actions(s) Vminmax (Succ(s, a), d) Player(s) = agent
mina∈Actions(s) Vminmax (Succ(s, a), d − 1) Player(s) = opp
Example: chess
Eval(s) = V (s; w)
TD learning
Simultaneous games
Non-zero-sum games
State-of-the-art
V (s; w) = w · φ(s)
Neural network:
k
X
V (s; w, v1:k ) = wj σ(vj · φ(s))
j=1
Generate episode:
s0 ; a1 , r1 , s1 ; a2 , r2 , s2 ; a3 , r3 , s3 ; . . . ; an , rn , sn
s0 ; a1 , r1 , s1 ; a2 , r2 , s2 , a3 , r3 , s3 ; . . . , an , rn , sn
1
2 (prediction(w) − target)2
Gradient:
Update:
Algorithm: TD learning
On each (s, a, r, s0 ):
w ← w − η[V (s; w) − (r + γV (s0 ; w))]∇w V (s; w)
| {z } | {z }
prediction target
V (s; w) = w · φ(s)
∇w V (s; w) = φ(s)
Example: TD learning
On each (s, a, r, s0 ):
w ← w − η[V̂π (s; w) − (r + γ V̂π (s0 ; w))]∇w V̂π (s; w)
| {z } | {z }
prediction target
Algorithm: Q-learning
On each (s, a, r, s0 ):
w ← w − η[Q̂opt (s, a; w) − (r + γ 0 max Q̂opt (s0 , a0 ; w))]∇w Q̂opt (s, a; w)
| {z } a ∈Actions(s)
prediction | {z }
target
TD learning:
• Operate on V̂π (s; w)
• On-policy: value is based on exploration policy (usually based on
V̂π )
• To use, need to know rules of the game Succ(s, a)
(prediction(w) − target)2
Up next:
Turn-based Simultaneous
Zero-sum Non-zero-sum
TD learning
Simultaneous games
Non-zero-sum games
State-of-the-art
-50 1 -5
-50 50 1 3 -5 15
Simultaneous games:
Players = {A, B}
Actions: possible actions
V (a, b): A’s utility if A chooses action a, B chooses b
(let V be payoff matrix)
Always 1: π = [1, 0]
Always 2: π = [0, 1]
Uniformly random: π = [ 21 , 12 ]
[whiteboard: matrix]
CS221 / Liang & Sadigh 44
• Given a game (payoff matrix) and the strategies for the two players, we can define the value of the game.
• For pure strategies, the value of the game by definition is just reading out the appropriate entry from the
payoff matrix.
• For mixed strategies, the value of the game (that is, the expected P
utility for player
P A) is gotten by summing
over the possible actions that the players choose: V (πA , πB ) = a∈Actions b∈Actions πA (a)πB (b)V (a, b).
>
We can also write this expression concisely using matrix-vector multiplications: πA V πB .
How to optimize?
Game value:
V (πA , πB )
simultaneously
-3 -3 2 4
2 -3 -3 4 2 -3 -3 4
Player A reveals: πA = [ 21 , 21 ]
Value V (πA , πB ) = πB (1)(− 12 ) + πB (2)(+ 21 )
Optimal strategy for player B is πB = [1, 0] (pure!)
π = [p, 1 − p]
1 2
p · (2) + (1 − p) · (−3) p · (−3) + (1 − p) · (4)
= 5p − 3 = −7p + 4
π = [p, 1 − p]
1 2
p · (2) + (1 − p) · (−3) p · (−3) + (1 − p) · (4)
= 5p − 3 = −7p + 4
TD learning
Simultaneous games
Non-zero-sum games
State-of-the-art
Real life: ?
7 5
Nash equilibrium: A and B both play π = [ 12 , 12 ].
TD learning
Simultaneous games
Non-zero-sum games
State-of-the-art
Closure:
• 2007: Checkers solved in the minimax sense (outcome is draw),
but doesn’t mean you can’t win
• Alpha-beta search + 39 trillion endgame positions
Solution: learning