100% found this document useful (1 vote)
59 views38 pages

2024 Mth058 Lecture06 Mcts

Monte Carlo tree search (MCTS) is a heuristic search algorithm that combines tree search and random sampling to find optimal decisions. MCTS iteratively builds a search tree through selection, expansion, simulation, and backpropagation steps within a computational budget. At each iteration, it selects a node for expansion, simulates a random playout from that node to a terminal state, and backs up the result to update node values. This guides the tree growth towards more promising areas of the search space. MCTS has been successfully applied to many complex games and planning problems.

Uploaded by

Mark Mystery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
59 views38 pages

2024 Mth058 Lecture06 Mcts

Monte Carlo tree search (MCTS) is a heuristic search algorithm that combines tree search and random sampling to find optimal decisions. MCTS iteratively builds a search tree through selection, expansion, simulation, and backpropagation steps within a computational budget. At each iteration, it selects a node for expansion, simulates a random playout from that node to a terminal state, and backs up the result to update node values. This guides the tree growth towards more promising areas of the search space. MCTS has been successfully applied to many complex games and planning problems.

Uploaded by

Mark Mystery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

MONTE CARLO

TREE SEARCH

Nguyễn Ngọc Thảo


[email protected]
Outline
• An introduction to MCTS
• A complete walkthrough with example
• A deeper insight to MCTS

2
Monte Carlo tree search
Game tree and search strategies
• A game tree represents a hierarchy of moves in a game.
• Each node of a game tree represents a particular state in a game.
• A move makes a transition from a node to its successors.

4
Game tree and search strategies
• Many AI problems can be cast as search problems, which
are solved by finding the best plan, model or function.
• A search algorithm finds the best path to win the game.

Minimax search for two-


players adversarial games

Best-first search for


single-player games
5
Minimax and Expectimax
• Minimax explores all the nodes available, which is infeasible
for complex games in a finite amount of time.
• It is not for imperfect information and stochastic games either.

• Expectimax generalizes minimax to stochastic games.


• The pruning is harder because of chances nodes.
6
Monte Carlo tree search (MCTS)
• MCTS is a heuristic search method that links the precision of
tree search with the generality of random sampling.
• It finds optimal decisions in a domain by taking random
samples in the decision space to grow the search tree.

The basic MCTS process: A tree is built in


an incremental and asymmetric manner.
7
MCTS: Fundamental idea
• MCTS progressively builds a partial game tree, guided by
the results of previous exploration on the same tree.

The longer the tree grows, the more


accurate the estimates become.

The tree grows in an asymmetric


manner, heading towards more
promising moves.

• The tree is used estimate the values of moves via random


simulation → adjust the policy towards a best-first strategy.
8
MTCS: Steps in an iteration
• MCTS iteratively grows a search tree within some predefined
computational budget (e.g., time, memory, or iteration).

Node: a state of the domain


Each node contains statistics describing at Link: an action leading to a
least a reward value and number of visits subsequent state
9
MCTS: Selection step
1. Selection: recursively apply a child selection policy (tree
policy), from the root till the most urgent expandable node.
𝑡0

Starting at the root node 𝑡0 ,


we reach the node 𝑡𝑛 .
𝑡𝑛

10
MCTS: Expansion step
2. Expansion: expand the tree by adding one (or more)
children, according to the available actions.

An unvisited action 𝑎 from state


𝑠 is selected and a new leaf
𝑠
node 𝑡𝑙 is added to the tree.

𝑠′
𝑡𝑙

11
MCTS: Simulation step
3. Simulation: run a simulated play from the new node(s)
using the default policy to produce an outcome.

A simulation is run from the node 𝑡𝑙 to produce a


reward value ∆, which may be
• A discrete (win/draw/loss) result or continuous
reward value for simple domains
• A vector of reward values relative to each agent
p for more complex multiagent domains.
12
MCTS: Backpropagation step
4. Backpropagation: back up the simulation result through the
selected nodes to update their statistics.

The reward value ∆ is backed up to


update the nodes along the path.

For each node, its visit count is


incremented, and its average reward
(or Q-value) updated according to ∆.

• As soon as the computation budget is reached or there is an


interruption, the search terminates.
13
• Tree policy: select or create a leaf node from the
nodes in tree (selection and expansion).
Monte Carlo tree
• Default policy: Play out the domain from a given
search (MCTS) non-terminal state to produce a value estimate
(simulation).

the action 𝑎 that leads to the best child of the root node 𝑣0
14
MCTS: Child selection
• An action 𝑎 of the root 𝑡0 is selected by some mechanism.
• Max child: select the root child with the highest reward.
• Robust child: select the most visited root child.
• Max-Robust child: select the root child with both the highest
visit count and the highest reward.
• If none exist, search until an acceptable visit count is achieved.
• Secure child: select the child which maximizes a lower
confidence bound.

15
MCTS: Applications

Realtime games and


Nondeterministic games
Physics Simulations

Bus regulation problem


Travelling Salesman Problem 16
A complete walkthrough
with example
Upper confidence bound value
• The upper confidence bound for a node is defined as
𝐥𝐧 𝑵
𝕌ℂ𝔹𝟏 = 𝑽𝒊 + 𝑪
𝒏𝒊

• 𝑉𝑖 : the value estimate of the node 𝑖, which is the average reward of


all nodes beneath this node
• 𝐶 is a tunable bias parameter (in this example, 𝐶 = 2)
• 𝑁: number of times the parent node has been visited
• 𝑛𝑖 : number of times the current node 𝑖 has been visited

• UCB1 can be served as a tree policy.

18
Roll out
• Randomly pick an action at
each step and simulate the
action to receive an average
reward when the game ends

Loop:
if Si is a terminal state:
return Value(Si)
Ai = random(available_actions(Si))
Si = Simulate(Si, Ai)
Until a terminal state is reached. 19
An example: Iteration 1
• Let’s start with an initial state 𝑆0 .
• The actions, 𝑎1 and 𝑎2 , lead to
states 𝑆1 and 𝑆2 , each has the total
score 𝑡 and the number of visits 𝑛.

Initial state

• Q: How do we choose between the two child nodes?


• A: Compute the UCB1 values for both child nodes and take whichever
node that maximizes the UCB1 value.
→ Simply take the first node when no node has been visited yet.

20
An example: Iteration 1
• The leaf node taken has not been
visited before.
→ Do a rollout all the way down
to the terminal state.
• Let’s say the value of this rollout
is 20 (i.e., just an example)

Rollout from S1

21
An example: Iteration 1
• The value 20 is backed up all the way to the root.
• So now, 𝑡 = 20 and 𝑛 = 1 for nodes 𝑆1 and 𝑆0 .
• That’s the end of the first iteration

After
backpropagation

22
An example: Iteration 2

• The UCB values are 20 for 𝑆1 and infinity


for 𝑆2 → visit 𝑆2 next.
• Rollout at 𝑆2 to get to the value 10.
• The value 10 is then backed up to the
root → the value at root node now is 30.
23
An example: Iteration 3

𝑙𝑛2 𝑙𝑛2
𝑈𝐶𝐵1 = 20 + 2 𝑈𝐶𝐵1 = 10 + 2
1 1
= 21.67 = 11.67

• 𝑆1 has a higher UCB1 value


→ the expansion will be done here
• We do a rollout from 𝑆3 and get
a value of 0 at the leaf node.
24
An example: Iteration 4
• 𝑆1 has a higher UCB1 value
• The UCB values are 0 for 𝑆3 and
infinity for 𝑆4 → visit 𝑆4 next.
𝑙𝑛3
𝑈𝐶𝐵1 = 20 + 2
2 2
= 21.48
𝑙𝑛3
𝑈𝐶𝐵1 = 10 + 2
1
= 12.1

• A rollout is done till the leaf node to get the value and backpropagate.

25
A deeper insight
into MCTS
Simulation / Playout
• Simulation (or playout) is a sequence of moves that starts in
a current node and ends in a terminal node.

It is a tree node evaluation


approximation computed by
running somehow random
game starting at that node.

27
Simulation / Playout
• During the simulation, the rollout policy function consumes a
game state 𝒔𝒊 and produces the next move/action 𝒂𝒊 .
• The default rollout policy function is a uniform random.
• In practice it is designed to be fast to allow many simulations being
played quickly.

• Simulation always results in an evaluation.


• It is a win, loss or a draw for the games, but generally any value is a
legit result of a simulation.

28
Simulation in AlphaGo and AlphaZero
• In AlphaGo, the evaluation of the leaf 𝑆𝐿 is defined a
𝑉(𝑆𝐿 ) = (1 − 𝛼)𝑣0 (𝑆𝐿 ) + 𝛼𝑧𝐿
• 𝑧𝐿 : a standard rollout evaluation with custom fast rollout policy,
which is a shallow softmax neural network with handcrafted features.
• 𝑣0 : a Value Network that evaluates positions by a 13-layer CNN,
trained on 30mils distinct positions extracted from self-plays.

• In AlphaZero, a 19-layer CNN residual network directly rates


the node.
𝑉 𝑆𝐿 = 𝑓0 (𝑆𝐿 )
• It outputs both position evaluation and moves probability vector.

29
Node expansion on the game tree
• A node is considered visited if a playout has been started in
that node, i.e., has been evaluated at least once.

• A node is fully expanded if all its children are visited.

30
Backpropagation
• Backpropagation carries simulation results up to the root.
• For every node on the path, certain statistics are updated.

A node’s statistics reflects results of


simulation started in all its descendants.

31
Statistics in a node 𝑣
• 𝑄 𝑣 : the total simulation reward
• In a simplest form, it is a sum of simulation results that passed
through considered node.
• 𝑁 𝑣 : the total number of visits
• That is, a counter of how many times a node has been on the
backpropagation path.

• 𝑄 𝑣 indicates how promising the node is and 𝑁 𝑣 refers to


how intensively explored it has been.
• These values are maintained for every visited node.

32
Exploration – Exploitation Dilemma

Nodes with high reward are good candidates to follow (exploitation)


but those with low number of visits may be interesting too (because
they are not explored well). 33
Upper Confidence Bound for trees
• Upper Confidence Bound for trees (UCT) lets us choose the
next node among visited nodes to traverse through.

𝑸 𝒗𝒊 𝒍𝒏 𝑵(𝒗)
𝕌ℂ𝕋 𝒗𝒊 , 𝒗 = +𝑪
𝑵 𝒗𝒊 𝑵 𝒗𝒊
• 𝐶 is a tunable bias parameter (𝐶 = 1/ 2 for rewards in [0,1]).

• There is an essential balance between the first (exploitation)


and second (exploration) terms.
• The exploration term ensures that each child has a nonzero
probability of selection.

34
35
def monte_carlo_tree_search(root):
while resources_left(time, computational power):
leaf = traverse(root) # leaf = unvisited node
simulation_result = rollout(leaf)
backpropagate(leaf, simulation_result)
return best_child(root)
def traverse(node):
while fully_expanded(node):
node = best_uct(node)
return pick_univisted(node.children) or node # in case no
children are present / node is terminal
def rollout(node):
while non_terminal(node):
node = rollout_policy(node)
return result(node)
MCTS
def rollout_policy(node):
return pick_random(node.children)
pseudo-code
def backpropagate(node, result):
if is_root(node) return
node.stats = update_stats(node, result)
backpropagate(node.parent)
def best_child(node):
pick child with highest number of visits
36
List of references
• Browne, Cameron B., et al. "A survey of monte carlo tree search
methods." IEEE Transactions on Computational Intelligence and AI in
games 4.1 (2012): 1-43.
• mctspy : python implementation of Monte Carlo Tree Search algorithm
(link)
• Introduction to Monte Carlo Tree Search: The Game-Changing Algorithm
behind DeepMind’s AlphaGo (link)
• Monte Carlo tree search: beginners guide (link)
• Bruno Bouzy. Monte-Carlo Tree Search (MCTS) for Computer GO.
Lecture notes for AOA class, Université Paris Descartes (link)

37
38

You might also like