0% found this document useful (0 votes)

63 views10 pages

Logistics: CSE 473 Markov Decision Processes

Just For download Some Files

Uploaded by

Otong nox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views10 pages

Logistics: CSE 473 Markov Decision Processes

Just For download Some Files

Uploaded by

Otong nox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

10/12/2012

Logistics
 PS 2 due Tuesday  Thursday 10/18

 PS 3 due Thursday 10/25
CSE 473 Markov Decision Processes

Dan Weld

Many slides from Chris Bishop, Mausam, Dan Klein,
Stuart Russell, Andrew Moore & Luke Zettlemoyer

MDPs Planning Agent
Static vs. Dynamic
Markov Decision Processes
• Planning Under Uncertainty
Environment
• Mathematical Framework Fully
vs.
• Bellman Equations Partially
Deterministic
ete st c
• Value Iteration Ob
Observable
bl vs.
What action Stochastic
• Real‐Time Dynamic Programming next?
Andrey Markov
• Policy Iteration (1856‐1922)
Perfect Instantaneous
vs. vs.
• Reinforcement Learning Noisy Durative

Percepts Actions

Objective of an MDP
Review: Expectimax
• Find a policy : 6 → $  What if we don’t know what the result of an action
will be? E.g.,
• which optimizes • In solitaire, next card is unknown
• In pacman, the ghosts act randomly max
• minimizes discounted expected cost to reach a
goal or  Can do expectimax search
 Max nodes as in minimax
Max nodes as in minimax search chance
• maximizes undiscount. expected reward  Chance nodes, like min nodes, except
• maximizes expected (reward‐cost) the outcome is uncertain ‐ take
average (expectation) of children
 Calculate expected utilities 10 4 5 7
• given a ____ horizon
• finite
 Today, we formalize as an Markov Decision Process
• infinite  Handle intermediate rewards & infinite plans
• indefinite  More efficient processing

1
10/12/2012

Grid World
Markov Decision Processes
 An MDP is defined by:
 Walls block the agent’s path
• A set of states s  S
 Agent’s actions may go astray: • A set of actions a  A
 80% of the time, North action • A transition function T(s,a,s’)
• Prob that a from s leads to s’
takes the agent North • i.e., P(s’ | s,a)
(assuming no wall) • Also called “the model”
 10% ‐ actually go West • A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
 10% ‐ actually go East • A start state (or distribution)
 If there is a wall in the chosen • Maybe a terminal state
direction, the agent stays put
• MDPs: non‐deterministic search
 Small “living” reward each step
Reinforcement learning: MDPs where we don’t
 Big rewards come at the end know the transition or reward functions
 Goal: maximize sum of rewards

What is Markov about MDPs?
Solving MDPs
 In deterministic single-agent search problems, want an optimal
 Andrey Markov (1856‐1922)
plan, or sequence of actions, from start to a goal
 “Markov” generally means that  In an MDP, we want an optimal policy *: S → A
• conditioned on the present state, • A policy  gives an action for each state
p
• the future is independent of the past
p • An optimal policy maximizes expected utility if followed
An optimal policy maximizes expected utility if followed
• Defines a reflex agent
 For Markov decision processes,
“Markov” means:

Optimal policy when
R(s, a, s’) = ‐0.03
for all non‐terminals s

Example Optimal Policies Example Optimal Policies

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

2
10/12/2012

Example Optimal Policies Example Optimal Policies

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

Example: High‐Low High‐Low as an MDP
 States:
• 2, 3, 4, done
 Three card types: 2, 3, 4  Actions:
• Infinite deck, twice as many 2’s • High, Low
 Start with 3 showing  Model: T(s, a, s’):
 After each card, you say “high” or “low” • P(s’=4 | 4, Low) = 1/4
 New card is flipped 3 •
•
P(s’=3 | 4, Low) = 1/4
P(s’=2
P(s 2 | 4, Low) /
| 4, Low) = 1/2
3
• If
If you’re right, you win the points shown on
’ i h i h i h
the new card • P(s’=done | 4, Low) = 0
• Ties are no‐ops (no reward)‐0 • P(s’=4 | 4, High) = 1/4
• If you’re wrong, game ends • P(s’=3 | 4, High) = 0
• P(s’=2 | 4, High) = 0
• P(s’=done | 4, High) = 3/4
• …
 Differences from expectimax problems:  Rewards: R(s, a, s’):
 #1: get rewards as you go • Number shown on s’ if s’<s  a=“high” …
 #2: you might play forever! • 0 otherwise
 Start: 3

Search Tree: High‐Low
MDP Search Trees
 Each MDP state gives an expectimax‐like search tree

Low High s is a
s
state
, High a
, Low
(s, a) is a
s, a
q-state
T= T= T = 0, T = (s,a,s’) called a
0.5, R 0.25, R R = 4 0.25, R s,a,s’ transition
=2 =3 =0 T(s,a,s’) = P(s’|s,a)
s’
R(s,a,s’)
High Low High Low High Low

3
10/12/2012

Infinite Utilities?!
Utilities of Sequences
 In order to formalize optimality of a policy, need to  Problem: infinite state sequences have infinite rewards
understand utilities of sequences of rewards
 Typically consider stationary preferences:  Solutions:
• Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies ( depends on time left)
• Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “done” for High‐Low)
 Theorem: only two ways to define stationary utilities • Discounting: for 0 <  < 1
 Additive utility:

 Discounted utility:
• Smaller  means smaller “horizon” – shorter term focus

Discounting Recap: Defining MDPs
 Markov decision processes:
• States S s
• Start state s0
a
 Typically discount • Actions A
s, a
• Transitions P(s’|s, a)
rewards by  < 1 each (, , )
aka T(s,a,s’) s,a,s’
s,a,s
time step • Rewards R(s,a,s’) (and discount ) s’
• Sooner rewards have
higher utility than  MDP quantities so far:
later rewards • Policy,  = Function that chooses an action for each state
• Also helps the • Utility (aka “return”) = sum of discounted rewards
algorithms converge

Optimal Utilities Why Not Search Trees?

 Define the value of a state s:  Why not solve with expectimax?
V*(s) = expected utility starting in s and acting optimally s
 Define the value of a q‐state (s,a):  Problems:
Q*(s,a) = expected utility starting in s, taking action a
a
• This tree is usually infinite (why?)
and thereafter acting optimally s, a • Same states appear over and over (why?)
 Define the optimal policy: • We would search once per state (why?)
We would search once per state (why?)
*(s) = optimal action from state s s,a,s’’
s’
 Idea: Value iteration
• Compute optimal values for all states all at
once using successive approximations
• Will be a bottom‐up dynamic program similar
in cost to memoization
• Do all planning offline, no replanning needed!

4
10/12/2012

The Bellman Equations Bellman Equations for MDPs
 Definition of “optimal utility” leads to a simple
one‐step look‐ahead relationship between Q*(a, s)
optimal utility values:

(1920‐1984)

s
a
s, a
s,a,s’
s’

Bellman Backup (MDP) Bellman Backup

Q1(s,a1) = 2 +  0
• Given an estimate of V* function (say Vn)
~2
• Backup Vn function at state s
• calculate a new estimate (Vn+1) : Q1(s,a2) = 5 +  0.9~
a1 s1 V0= 0
V1= 6.5 +  0.1~ 2
5 ~ 6.1
5 V s0 a2
Q1(s,a3) = 4.5 +  2
ax s2 V0= 1 ~ 6.5
V a3

• Qn+1(s,a) : value/cost of the strategy: s3 V = 2
• execute action a in s, execute n subsequently max 0

• n = argmaxa∈Ap(s)Qn(s,a)

Value iteration [Bellman’57] Value Iteration
• assign an arbitrary assignment of V0 to each state.  Idea:
• Start with V0*(s) = 0, which we know is right (why?)
• repeat • Given Vi*, calculate the values for all states for depth i+1:
• for all states s
• compute Vnn+11(s) by Bellman backup at s. Iteration n+1
• until maxs |Vn+1(s) – Vn(s)| < 
• This is called a value update or Bellman update
-convergence
Residual(s) • Repeat until convergence

 Theorem: will converge to unique optimal values  Theorem: will converge to unique optimal values
 Basic idea: approximations get refined towards optimal values  Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do  Policy may converge long before values do

5
10/12/2012

Example: =0.9, living

Value Estimates Example: Bellman Updates reward=0, noise=0.2

 Calculate estimates Vk*(s) ? ? ?
• The optimal value considering only next k time steps
(k rewards)
• As k , Vk approaches the optimal value ? ?

 Why: ? ? ? ?
 If discounting, distant rewards become
negligible
 If terminal states reachable from
everywhere, fraction of episodes not
ending becomes negligible
 Otherwise, can get infinite expected
utility and then this approach actually
won’t work

Example: Value Iteration Example: Value Iteration

V1 V2

QuickTime™ and a
GIF decompressor
are needed to see this picture.

 Information propagates outward from terminal
states and eventually all states have correct value
estimates

Practice: Computing Actions Comments
• Decision‐theoretic Algorithm
 Which action should we chose from state s: • Dynamic Programming
• Fixed Point Computation
• Given optimal values Q? • Probabilistic version of Bellman‐Ford Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem

 Time Complexity
• Given optimal values V?
• one iteration: O(|6|2|$ |)
• number of iterations: poly(|6|, |$ |, 1/1‐)
 Space Complexity: O(|6|)
 Factored MDPs = Planning under uncertainty
• Lesson: actions are easier to select from Q’s!
• exponential space, exponential time

6
10/12/2012

Convergence Properties Convergence

• Vn → V* in the limit as n→  Define the max‐norm:
• -convergence: Vn function is within  of V*
• Optimality: current policy is within 2 of optimal
 Theorem: For any two approximations Ut and Vt
• Monotonicity
• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
• I.e. any distinct approximations must get closer to each other, so, in
• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above) particular, any approximation must get closer to the true V* (aka U)
• otherwise Vn non‐monotonic and value iteration converges to a unique, stable, optimal solution

 Theorem:

• I.e. once the change in our approximation is small, it must also be
close to correct

Value Iteration Complexity MDPs
Markov Decision Processes
 Problem size: • Planning Under Uncertainty
• |A| actions and |S| states
• Mathematical Framework
 Each Iteration • Bellman Equations
• Computation: O(|A|⋅|S|2) • Value Iteration
• Space: O(|S|) • Real‐Time Dynamic Programming
Andrey Markov
• Policy Iteration (1856‐1922)
 Num of iterations
• Reinforcement Learning
• Can be exponential in the discount factor γ

Asynchronous Value Iteration Asynchonous Value Iteration

Prioritized Sweeping
 States may be backed up in any order  Why backup a state if values of successors same?
• Instead of systematically, iteration by iteration  Prefer backing a state
• whose successors had most change
 Theorem:
• As long as every state is backed up infinitely often…  Priority Queue of (state, expected change in value)
• Asynchronous value iteration converges to optimal  Backup in the order of priority
 After backing a state update priority queue
• for all predecessors

7
10/12/2012

Asynchonous Value Iteration Why?

Real Time Dynamic Programming
[Barto, Bradtke, Singh’95]
 Why is next slide saying min

• Trial: simulate greedy policy starting from start state;
perform Bellman backup on visited states

• RTDP:
• Repeat Trials until value function converges

RTDP Trial Comments

Vn
• Properties
Qn+1(s0,a)
• if all states are visited infinitely often then Vn → V*
agreedy = a2 Min Vn
?
a1
Vn Goal
• Advantages
a2
Vn+1(s0) s0 ? • Anytime: more probable states explored quickly
Vn
a3
?
Vn
• Disadvantages
Vn
• complete convergence can be slow!
Vn

Labeled RTDP [Bonet&Geffner ICAPS03]

MDPs
 Stochastic Shortest Path Problems Markov Decision Processes
• Policy w/ min expected cost to reach goal • Planning Under Uncertainty
 Initialize v0(s) with admissible heuristic
• Underestimates remaining cost • Mathematical Framework
 Theorem: • Bellman Equations
• if residual of Vk(s) <  and • Value Iteration
Vk(s’) <  for all succ(s), s’, in greedy graph • Real‐Time Dynamic Programming
Andrey Markov
• Then Vk is ‐consistent and will remain so • Policy Iteration (1856‐1922)
 Labeling algorithm detects convergence
• Reinforcement Learning
Goal

s0 ?

8
10/12/2012

Changing the Search Space Utilities for Fixed Policies
• Value Iteration
• Search in value space  Another basic operation: compute
the utility of a state s under a fix s
• Compute the resulting policy (general non‐optimal) policy
 Define the utility of a state s, under (s)
a fixed policy : s, (s)
• Policy Iteration
Policy Iteration V(s) = expected total discounted
rewards (return) starting in s and s, (s),s’
• Search in policy space following 
s’
• Compute the resulting value  Recursive relation (one‐step look‐
ahead / Bellman equation):

Policy Evaluation Policy Iteration

 How do we calculate the V’s for a fixed policy?  Problem with value iteration:
• Considering all actions each iteration is slow: takes |A| times
 Idea one: modify Bellman updates longer than policy evaluation
• But policy doesn’t change each iteration, time wasted

 Alternative to value iteration:
• Step 1: Policy evaluation: calculate utilities for a fixed policy (not
optimal utilities!) until convergence (fast)
 Idea two: it’s just a linear system, solve with Matlab • Step 2: Policy improvement: update policy using one‐step
(or whatever) lookahead with resulting converged (but not optimal!) utilities
(slow but infrequent)
• Repeat steps until policy converges

Policy Iteration Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
 Policy evaluation: with fixed current policy , find values with
simplified Bellman updates:
• repeat
• Iterate until values converge
• Policy Evaluation: compute Vn+1the evaluation of n costly: O(n3)
• Policy Improvement: for all states s
• compute
compute n+1(s):
(s): argmax
argmaxa Ap(s)Qn+1(s,a)
(s,a)
• until n+1  n
 Policy improvement: with fixed utilities, find the best action approximate
Modified by value iteration
according to one‐step look‐ahead Policy Iteration
Advantage using fixed policy

• searching in a finite (policy) space as opposed to

uncountably infinite (value) space ⇒ convergence faster.
• all other properties follow!

9
10/12/2012

Modified Policy iteration Policy Iteration Complexity
• assign an arbitrary assignment of 0 to each state.
 Problem size:
• repeat • |A| actions and |S| states
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmaxa Ap(s)Qn+1(s,a)
 Each Iteration
• until n+1  n • Computation: O(|S|3 + |A|⋅|S|2)
• Space: O(|S|)
Advantage
 Num of iterations
• probably the most competitive synchronous dynamic
programming algorithm.
• Unknown, but can be faster in practice
• Convergence is guaranteed

Comparison Recap: MDPs
 Markov decision processes:
• States S
 In value iteration: • Actions A s
• Every pass (or “backup”) updates both utilities (explicitly, based on current • Transitions P(s’|s,a) (or T(s,a,s’)) a
utilities) and policy (possibly implicitly, based on current policy)
• Rewards R(s,a,s’) (and discount ) s, a
• Start state s0
 In policy iteration:
In policy iteration: s,a,s’
s,a,s
• Several passes to update utilities with frozen policy  Quantities: s’
• Occasional passes to update policies • Returns = sum of discounted rewards
• Values = expected future returns from a state (optimal, or for a
 Hybrid approaches (asynchronous policy iteration): fixed policy)
• Any sequences of partial updates to either policy entries or utilities will • Q‐Values = expected future returns from a q‐state (optimal, or
converge if every state is visited infinitely often for a fixed policy)

Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lec 08
No ratings yet
Lec 08
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Lec 09
No ratings yet
Lec 09
51 pages
15 MDP
No ratings yet
15 MDP
35 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Policies, Search, Utility
No ratings yet
Policies, Search, Utility
13 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Markov Decision Process
No ratings yet
Markov Decision Process
8 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Definitions
No ratings yet
Definitions
2 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
Markov Decision Process
No ratings yet
Markov Decision Process
21 pages
Here Are 200 Important Coding Questions That Are Commonly Asked in Capgemini
No ratings yet
Here Are 200 Important Coding Questions That Are Commonly Asked in Capgemini
7 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markov Decision Process
No ratings yet
Markov Decision Process
21 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Slides
No ratings yet
Slides
10 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Data Structure & Algorithm: Fundamental of DSA
No ratings yet
Data Structure & Algorithm: Fundamental of DSA
30 pages
(Nonlinear (6-31) : Structures GTU-Sem. 3-Comp/T) Binary Tree
No ratings yet
(Nonlinear (6-31) : Structures GTU-Sem. 3-Comp/T) Binary Tree
25 pages
Cheat Sheet - PLC
No ratings yet
Cheat Sheet - PLC
8 pages
Asymptotic Notations: 1) Θ Notation: The theta notation bounds a functions from above and below, so it
No ratings yet
Asymptotic Notations: 1) Θ Notation: The theta notation bounds a functions from above and below, so it
9 pages
DS-Question Bank
No ratings yet
DS-Question Bank
9 pages
Presantation - Chapter 06 - Brute Force and Exhaustive Search
No ratings yet
Presantation - Chapter 06 - Brute Force and Exhaustive Search
68 pages
Don't: Problem
No ratings yet
Don't: Problem
9 pages
AI Unit2 PPT by BSCOER
No ratings yet
AI Unit2 PPT by BSCOER
62 pages
Week11 1
No ratings yet
Week11 1
11 pages
Lab Program 3
No ratings yet
Lab Program 3
6 pages
23ucc554 Ass 2
No ratings yet
23ucc554 Ass 2
6 pages
Isye 6669 Final: Instructor: Prof. Shabbir Ahmed and Prof. Andy Sun
No ratings yet
Isye 6669 Final: Instructor: Prof. Shabbir Ahmed and Prof. Andy Sun
11 pages
Redis Cheat Sheet: by Via
No ratings yet
Redis Cheat Sheet: by Via
2 pages
SCAN (Elevator's Algorithm)
No ratings yet
SCAN (Elevator's Algorithm)
5 pages
Ds Solution
No ratings yet
Ds Solution
24 pages
TE - DS - Lab Manual - 2024-25
No ratings yet
TE - DS - Lab Manual - 2024-25
44 pages
T3 Bubble, Merge and Insertion Sort
No ratings yet
T3 Bubble, Merge and Insertion Sort
39 pages
Hash Function - Wikipedia, The Free Encyclopedia
No ratings yet
Hash Function - Wikipedia, The Free Encyclopedia
5 pages
Sorting
No ratings yet
Sorting
8 pages
ARTIFICIAL INTELLIGENCE (18CS2T29) - End Term Exam. 2020-2021
No ratings yet
ARTIFICIAL INTELLIGENCE (18CS2T29) - End Term Exam. 2020-2021
3 pages
18.2 - 27.2 Sybsc CS SPPU DS Practical Slip Solutions
No ratings yet
18.2 - 27.2 Sybsc CS SPPU DS Practical Slip Solutions
9 pages
DSC 5th Unit Notes
No ratings yet
DSC 5th Unit Notes
10 pages
Heap Sort
No ratings yet
Heap Sort
28 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
35 pages
BSC Part 1st
No ratings yet
BSC Part 1st
9 pages
Chapter 17 Linked Lists: Starting Out With C++, 3 Edition
No ratings yet
Chapter 17 Linked Lists: Starting Out With C++, 3 Edition
67 pages
C Programming Arrays: How To Declare An Array in C?
No ratings yet
C Programming Arrays: How To Declare An Array in C?
4 pages
Lab 2 UniformedSearch 17032019
No ratings yet
Lab 2 UniformedSearch 17032019
5 pages

Logistics: CSE 473 Markov Decision Processes

Uploaded by

Logistics: CSE 473 Markov Decision Processes

Uploaded by

10/12/2012

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

Bellman Backup (MDP) Bellman Backup

Example: =0.9, living

Value Estimates Example: Bellman Updates reward=0, noise=0.2

Asynchronous Value Iteration Asynchonous Value Iteration

Asynchonous Value Iteration Why?

Labeled RTDP [Bonet&Geffner ICAPS03]

• searching in a finite (policy) space as opposed to

You might also like