Cheatsheet States Models
Cheatsheet States Models
edu/~shervine
VIP Cheatsheet: States-based models The objective is to find a path that minimizes the cost.
r Backtracking search – Backtracking search is a naive recursive algorithm that tries all
possibilities to find the minimum cost path. Here, action costs can be either positive or negative.
Afshine Amidi and Shervine Amidi
r Breadth-first search (BFS) – Breadth-first search is a graph search algorithm that does a
May 23, 2019 level-by-level traversal. We can implement it iteratively with the help of a queue that stores at
each step future nodes to be visited. For this algorithm, we can assume action costs to be equal
to a constant c > 0.
Search optimization
In this section, we assume that by accomplishing action a from state s, we deterministically
arrive in state Succ(s,a). The goal here is to determine a sequence of actions (a1 ,a2 ,a3 ,a4 ,...)
that starts from an initial state and leads to an end state. In order to solve this kind of problem,
our objective will be to find the minimum cost path by using states-based models.
Tree search
This category of states-based algorithms explores all possible states and actions. It is quite
memory efficient, and is suitable for huge state spaces but the runtime can become exponential
in the worst cases.
r Depth-first search (DFS) – Depth-first search is a search algorithm that traverses a graph
by following each path as deep as it can. We can implement it recursively, or iteratively with
the help of a stack that stores at each step future nodes to be visited. For this algorithm, action
costs are assumed to be equal to 0.
• successor Succ(s,a) of state s after action a r Iterative deepening – The iterative deepening trick is a modification of the depth-first
search algorithm so that it stops after reaching a certain depth, which guarantees optimality
• whether an end state was reached IsEnd(s) when all action costs are equal. Here, we assume that action costs are equal to a constant c > 0.
r Tree search algorithms summary – By noting b the number of actions per state, d the
solution depth, and D the maximum depth, we have:
This category of states-based algorithms aims at constructing optimal paths, enabling exponen- States for which the optimal path has
Explored E
tial savings. In this section, we will focus on dynamic programming and uniform cost search. already been found
r Graph – A graph is comprised of a set of vertices V (also called nodes) as well as a set of States seen for which we are still figuring out
edges E (also called links). Frontier F
how to get there with the cheapest cost
Unexplored U States not seen yet
r Uniform cost search – Uniform cost search (UCS) is a search algorithm that aims at finding
the shortest path from a state sstart to an end state send . It explores states s in increasing order
of PastCost(s) and relies on the fact that all action costs are non-negative.
Remark: the complexity countdown supposes the number of possible actions per state to be
constant.
Learning costs
Remark: the figure above illustrates a bottom-to-top approach whereas the formula provides the Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a
intuition of a top-to-bottom problem resolution. training set of minimizing-cost-path sequence of actions (a1 , a2 , ..., ak ).
r Types of states – The table below presents the terminology when it comes to states in the r Structured perceptron – The structured perceptron is an algorithm aiming at iteratively
context of uniform cost search: learning the cost of each state-action pair. At each step, it:
• decreases the estimated cost of each state-action of the true minimizing path y given by r Correctness – If h is consistent, then A∗ returns the minimum cost path.
the training data,
r Admissibility – A heuristic h is said to be admissible if we have:
• increases the estimated cost of each state-action of the current predicted path y 0 inferred
from the learned weights. h(s) 6 FutureCost(s)
Remark: there are several versions of the algorithm, one of which simplifies the problem to only
learning the cost of each action a, and the other parametrizes Cost(s,a) to a feature vector of r Theorem – Let h(s) be a given heuristic. We have:
learnable weights.
h(s) consistent =⇒ h(s) admissible
A? search
r Efficiency – A∗ explores all states s satisfying the following equation:
r Heuristic function – A heuristic is a function h over states s, where each h(s) aims at PastCost(s) 6 PastCost(send ) − h(s)
estimating FutureCost(s), the cost of the path from s to send .
r Algorithm – A∗ is a search algorithm that aims at finding the shortest path from a state s to
an end state send . It explores states s in increasing order of PastCost(s) + h(s). It is equivalent
to a uniform cost search with edge costs Cost0 (s,a) given by:
Cost0 (s,a) = Cost(s,a) + h(Succ(s,a)) − h(s)
Remark: larger values of h(s) is better as this equation shows it will restrict the set of states s
Remark: this algorithm can be seen as a biased version of UCS exploring states estimated to be going to be explored.
closer to the end state.
r Consistency – A heuristic h is said to be consistent if it satisfies the two following properties:
Relaxation
• For all states s and actions a,
It is a framework for producing consistent heuristics. The idea is to find closed-form reduced
h(s) 6 Cost(s,a) + h(Succ(s,a)) costs by removing constraints and use them as heuristics.
r Relaxed search problem – The relaxation of search problem P with costs Cost is noted
Prel with costs Costrel , and satisfies the identity:
Costrel (s,a) 6 Cost(s,a)
r Relaxed heuristic – Given a relaxed search problem Prel , we define the relaxed heuristic
h(s) = FutureCostrel (s) as the minimum cost path from s to an end state in the graph of costs
Costrel (s,a).
• The end state verifies the following: h(s) = FutureCostrel (s) =⇒ h(s) consistent
h(send ) = 0
r Tradeoff when choosing heuristic – We have to balance two aspects in choosing a heuristic:
• Good enough approximation: the heuristic h(s) should be close to FutureCost(s) and we
have thus to not remove too many constraints.
r Max heuristic – Let h1 (s), h2 (s) be two heuristics. We have the following property:
k
h1 (s), h2 (s) consistent =⇒ h(s) = max{h1 (s), h2 (s)} consistent X
u(s0 ,...,sk ) = ri γ i−1
i=1
Markov decision processes
In this section, we assume that performing action a from state s can lead to several states s01 ,s02 ,...
in a probabilistic manner. In order to find our way between an initial state and an end state,
our objective will be to find the maximum value policy by using Markov decision processes that
help us cope with randomness and uncertainty.
• transition probabilities T (s,a,s0 ) from s to s0 with action a r Value of a policy – The value of a policy π from state s, also noted Vπ (s), is the expected
utility by following policy π from state s over random paths. It is defined as follows:
• rewards Reward(s,a,s0 ) from s to s0 with action a
Vπ (s) = Qπ (s,π(s))
• whether an end state was reached IsEnd(s)
Remark: Vπ (s) is equal to 0 if s is an end state.
• a discount factor 0 6 γ 6 1
Applications
r Policy evaluation – Given a policy π, policy evaluation is an iterative algorithm that com-
putes Vπ . It is done as follows:
(0)
Vπ (s) ←− 0
r Utility – The utility of a path (s0 , ..., sk ) is the discounted sum of the rewards on that path. r Optimal Q-value – The optimal Q-value Qopt (s,a) of state s with action a is defined to be
In other words, the maximum Q-value attained by any policy starting. It is computed as follows:
Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not
X
0
0 0
depend on the exact policy.
Qopt (s,a) = T (s,a,s ) Reward(s,a,s ) + γVopt (s )
s0 ∈ States
r Model-free Monte Carlo – The model-free Monte Carlo method aims at directly estimating
Qπ , as follows:
bπ (s,a) = average of ut where st−1 = s, at = a
Q
r Optimal value – The optimal value Vopt (s) of state s is defined as being the maximum value
attained by any policy. It is computed as follows: where ut denotes the utility starting at step t of a given episode.
Vopt (s) = max Qopt (s,a) Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent
a∈ Actions(s)
on the policy π used to generate the data.
• Initialization: for all states s, we have r SARSA – State-action-reward-state-action (SARSA) is a boostrapping method estimating
Qπ by using both raw data and estimates as part of the update rule. For each (s,a,r,s0 ,a0 ), we
(0)
Vopt (s) ←− 0 have:
h i
bπ (s,a) ←− (1 − η)Q
Q bπ (s0 ,a0 )
bπ (s,a) + η r + γ Q
• Iteration: for t from 1 to TVI , we have
(t) (t−1)
∀s, Vopt (s) ←− max Qopt (s,a) Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo
a∈ Actions(s) one where the estimate can only be updated at the end of the episode.
r Q-learning – Q-learning is an off-policy algorithm that produces an estimate for Qopt . On
with each (s,a,r,s0 ,a0 ), we have:
X h i h i
(t−1)
Qopt (s,a) = T (s,a,s0 ) Reward(s,a,s0 ) + γVopt
(t−1)
(s0 ) bopt (s,a) ← (1 − η)Q
Q bopt (s,a) + η r + γ max bopt (s0 ,a0 )
Q
a0 ∈ Actions(s0 )
s0 ∈ States
and r Game tree – A game tree is a tree that describes the possibilities of a game. In particular,
each node is a decision point for a player and each root-to-leaf path is a possible outcome of the
0 game.
Reward(s,a,s
\ ) = r in (s,a,r,s0 )
r Two-player zero-sum game – It is a game where each state is fully observed and such that
These estimations will be then used to deduce Q-values, including Qπ and Qopt . players take turns. It is defined with:
• a starting state sstart Remark: we can extract πmax and πmin from the minimax value Vminimax .
Remark: we will assume that the utility of the agent has the opposite sign of the one of the
opponent.
r Minimax properties – By noting V the value function, there are 3 properties around
r Types of policies – There are two types of policies: minimax to have in mind:
• Deterministic policies, noted πp (s), which are actions that player p takes in state s.
• Property 1 : if the agent were to change its policy to any πagent , then the agent would be
• Stochastic policies, noted πp (s,a) ∈ [0,1], which are probabilities that player p takes action no better off.
a in state s.
∀πagent , V (πmax ,πmin ) > V (πagent ,πmin )
r Expectimax – For a given state s, the expectimax value Vexptmax (s) is the maximum expected
utility of any agent policy when playing with respect to a fixed and known opponent policy πopp . • Property 2 : if the opponent changes its policy from πmin to πopp , then he will be no
It is computed as follows: better off.
∀πopp , V (πmax ,πmin ) 6 V (πmax ,πopp )
Utility(s) IsEnd(s)
max Vexptmax (Succ(s,a)) Player(s) = agent
a∈Actions(s)
Vexptmax (s) = X • Property 3 : if the opponent is known to be not playing the adversarial policy, then the
πopp (s,a)Vexptmax (Succ(s,a)) Player(s) = opp minimax policy might not be optimal for the agent.
a∈Actions(s)
Speeding up minimax
Non-zero-sum games
r Payoff matrix – We define Vp (πA ,πB ) to be the utility for player p.
Remark: in any finite-player game with finite number of actions, there exists at least one Nash
equilibrium.
r TD learning – Temporal difference (TD) learning is used when we don’t know the transi-
tions/rewards. The value is based on exploration policy. To be able to use it, we need to know
rules of the game Succ(s,a). For each (s,a,r,s0 ), the update is done as follows:
w ←− w − η V (s,w) − (r + γV (s0 ,w)) ∇w V (s,w)
Simultaneous games
This is the contrary of turn-based games, where there is no ordering on the player’s moves.
r Single-move simultaneous game – Let there be two players A and B, with given possible
actions. We note V (a,b) to be A’s utility if A chooses action a, B chooses action b. V is called
the payoff matrix.
a ∈ Actions
∀a ∈ Actions, 0 6 π(a) 6 1
r Game evaluation – The value of the game V (πA ,πB ) when player A follows πA and player
B follows πB is such that:
X
V (πA ,πB ) = πA (a)πB (b)V (a,b)
a,b
r Minimax theorem – By noting πA ,πB ranging over mixed strategies, for every simultaneous
two-player zero-sum game with a finite number of actions, we have: