0% found this document useful (0 votes)
88 views28 pages

Session 04 Monte Carlo Tree Search

This document provides an overview of Monte Carlo tree search (MCTS) and how it was used in AlphaGo. MCTS builds a search tree through repeated selection, expansion, simulation, and backpropagation steps. It balances exploration of new paths against exploitation of promising paths using an approach like UCB1. AlphaGo applied MCTS to the game of Go by using neural networks to guide move selection and evaluate simulated game outcomes.

Uploaded by

Amanda Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views28 pages

Session 04 Monte Carlo Tree Search

This document provides an overview of Monte Carlo tree search (MCTS) and how it was used in AlphaGo. MCTS builds a search tree through repeated selection, expansion, simulation, and backpropagation steps. It balances exploration of new paths against exploitation of promising paths using an approach like UCB1. AlphaGo applied MCTS to the game of Go by using neural networks to guide move selection and evaluate simulated game outcomes.

Uploaded by

Amanda Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MSBA7003 Quantitative Analysis Methods

ZHANG, Wei
Assistant Professor
HKU Business School

04 Monte Carlo Tree Search

1
Agenda
• Dynamic Decision Making under Risk
• AlphaGo and Game AI
• Monte Carlo Tree Search
• Concepts
• Search Strategy
• Simulation Strategy
• Implementation
• Carpark Revenue Management

2
Dynamic Decision Making under Risk
• In the real business world, there are many complicated decision-making
processes, wherein early decisions will affect future system state and
decisions, and decisions must be made in real time.
• Airline ticket pricing
• Hotel room pricing Period 1
Period 1
$1000 Period 1 Period 1
• Game AI $999 … $1
$0

• When the decision tree is too big Period 1


… … …

Period 1 Period 1 Period 1


• The problem cannot be solved in finite time Sold 0
Sold 1 … Sold 199 Sold 200

• Some critical information is unavailable … … …


Period 2 Period 2 Period 2 Period 2
$ 1000 … $1 $0
$ 999

… … … …
3
AlphaGo and Game AI
• Normally, a game can be solved by DP.
• DP fails if the game tree is too large or the random events follow
complicated or unknown probability distributions.

4
AlphaGo and Game AI
• AlphaGo won Lee Sedol and Ke Jie, the top two go players.
• Monte Carlo Tree Search (MCTS) is at the heart of AlphaGo’s
algorithm.

5
Model Parameters
AlphaGo and Game AI
• AlphaGo’s algorithm
• Step 1: historical data => predictive models 𝜋 and 𝜎
• Step 2: predictive model 𝜎 => preliminary reinforcement learning (RL) model 𝜌
• Step 3: preliminary RL model 𝜌 => self play data => value network
• Step 4: (model 𝜋 + model 𝜎 + value network) => Monte Carlo Tree Search

Predictive Model 𝜋 Predictive Model 𝜎 Reinforcement Learning

6
Monte Carlo Tree Search
• The basic MCTS algorithm is to build a search tree, node by node,
according to the outcomes of simulated playouts.
• The process can be broken down into four steps.
Repeat

Selection Expansion Simulation/Evaluation Backpropagation


1/4 1/4 1/4 2/5

1/2 0/0 0/2 1/2 0/0 0/2 1/2 0/0 0/2 2/3 0/0 0/2

0/1 1/1 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 1/1 0/1 0/1

0/0 0/0 0/0 0/0 0/0 1/1

Win

7
Monte Carlo Tree Search
• Initialization: Create the root node (i.e., the level 1 decision node) and the child
nodes (i.e., the state-of-nature nodes) following each option.
• Selection: Search forward by selecting a state-of-nature node (or an option)
according to the recorded data and a given policy A. After that, select or create
a child node (or a state of nature) through simulation. Repeat until no more child
nodes are available as options.
• Expansion: If the newly created node is not a terminal node, create all the state-
of-nature nodes (i.e., all the options) as the child nodes for the current node.
• Simulation/Evaluation: Repeat the previous two steps (with a different selection
policy B & expansion) until a terminal node is reached and the final payoff is
received. Or, directly evaluate the payoff of the current node.
• Backpropagation: Update the summary statistics of each node along the path
according to the result.

8
Selection Strategy
• Random Search
• Focuses on exploration (i.e., look in areas that have not been well
sampled yet) but ignores exploitation (i.e., look in areas which
appear to be promising)
• The “optimal” option given by the tree in the end may not be truly
optimal

• Upper Confidence Bound (UCB1) Strategy


• Balances exploitation of the currently most promising option with
exploration of alternatives which may later turn out to be better
• Converges to optimal decisions given sufficient time

9
UCB1 Strategy
• The upper confidence bound for node 𝑖 (an option) is given by

ln 𝑁𝑖
𝐵𝑖 = 𝑉𝑖 + 2
𝑛𝑖

• 𝑉𝑖 is the average total reward/value achieved by paths after node 𝑖 (going forward)
• 𝑁𝑖 is the number of times the parent node of 𝑖 has been visited
• 𝑛𝑖 is the number of times node 𝑖 has been visited

• At a decision node, select the child node that maximizes the UCB.

10
UCB1 Strategy
• If more than one child node has the same maximum UCB value, the tie is usually broken randomly.
• The first term drives exploitation and the second term drives exploration. If a child node is selected,
the value of its second term will decrease but the value will increase for other child nodes.
• 𝑛𝑖 = 0 yields a UCB value of infinity, so previously unvisited child nodes are assigned the largest
value in order to ensure that all child nodes are considered at least once before any child node is
further expanded.
ln 𝑁𝑖
• The formula can be modified to 𝐵𝑖 = 𝑉𝑖 + 2𝐶 , where 𝐶 can be adjusted to lower or increase
𝑛𝑖
the amount of exploration performed. If the reward is beyond the range of [0,1], the value of 𝐶 can
be carefully chosen to achieve the balance.

11
AlphaGo’s Selection Strategy
• 𝑄𝑖 = 𝑉𝑖
𝑃𝑖
• 𝑢𝑖 ∝ , where 𝑃𝑖 is given by model 𝜎
1+𝑛𝑖
• The opponents next move is also given by model 𝜎

12
Simulation/Evaluation Strategy
• Simulation/evaluation is initiated
• when the search reaches a state of nature node;
• when the search goes to a new decision node.
• The quality of S/E strategy determines the quality of decisions and the
speed of convergence.
• If the state of nature depends on previous decisions, a model should be
built to capture the dependence of the state on previous decisions.
• Historical data may depend on historical decisions.
• E.g., historical sales depend on historical prices and inventory levels …

13
AlphaGo’s Simulation/Evaluation Strategy
• Step 1: starting from the current decision node, finish the game using policy 𝜋
• Step 2: evaluate the chance of success for the simulated move using the value network
• Step 3: calculate the weighted average of the two results and give the value to Q of
the simulated move

14
Implementation: Building Search Tree
• The MCTS algorithm with UCB1 selection strategy
can be implemented with a table.

Node Parent Child Node Meaning n V UCB


Index Index Index Type
1 0 {2,3,4} Decision The root node 2 1/2 -
2 1 {…} State Option 1-1 1 1/1 2.6651
3 1 {} State Option 1-2 0 0 ∞
4 1 {…} State Option 1-3 1 0 1.6651

K+1 3 {} Decision State 3-1 0 0 ∞

15
Implementation: Building Search Tree

Node Parent Child Node Meaning n V UCB


Index Index Index Type
1 0 {2,3,4} Decision The root node 2 1/2 -
2 1 {…} State Option 1(a) 1 1/1 2.6651
3 1 {K+1} State Option 1(b) 0 0 ∞
4 1 {…} State Option 1(c) 1 0 1.6651

K+1 3 {} Decision State 3(a) 0 0 ∞
K+2 K+1 {} State Option (K+1)(a) 0 0 ∞
K+3 K+1 {} State Option (K+1)(b) 0 0 ∞

16
Implementation: Building Search Tree
Node Parent Child Node Meaning n V UCB
Index Index Index Type
1 0 {2,3,4} Decision The root node 2 1/2 -
2 1 {…} State Option 1(a) 1 1/1 2.6651
3 1 {K+1} State Option 1(b) 0 0 ∞
4 1 {…} State Option 1(c) 1 0 1.6651

K+1 3 {K+2,K+3} Decision State 3(a) 0 0 ∞
K+2 K+1 {} State Option (K+1)(a) 0 0 ∞
K+3 K+1 {} State Option (K+1)(b) 0 0 ∞

K’ … {} Terminal Outcome 0 1 ∞

17
Implementation: Building Search Tree
Node Parent Child Node Meaning n V UCB
Index Index Index Type
1 0 {2,3,4} Decision The root node 3 2/3 -
2 1 {…} State Option 1(a) 1 1/1 3.0963
3 1 {K+1} State Option 1(b) 1 1/1 3.0963
4 1 {…} State Option 1(c) 1 0 2.0963
?

K+1 3 {K+2,K+3} Decision State 3(a) 1 1 1
K+2 K+1 {…} State Option (K+1)(a) 1 1 1
K+3 K+1 {} State Option (K+1)(b) 0 0 ∞

K’ … {} Terminal Outcome 1 1 1

18
Implementation: Optimization
• When doing optimization:
• Select the option that maximizes the total reward/value V.
• The state of nature realizes according to the real situation.
• If the state does not exist in the current tree, perform the MCTS algorithm in
real time and then do the optimization.
• Note: If the initial state is uncertain, you need to train multiple trees.
• E.g., you are playing white stones in the game of go.
• The quality of the search tree can be measured by the V values of the
child nodes of the root.

19
Carpark Revenue Management
• Company Background
• SPS is one of the largest carpark operators in Hong Kong.
• Three types of customers
• Hourly parking
• Floating monthly parking
• Reserved monthly parking
• Problem to Solve
• 50 spaces, 20 floating monthly users, 10 reserved monthly users
• Maximize revenue from hourly parking without affecting monthly customers
• When to show the “FULL” sign?

20
Carpark Revenue Management
• Challenges:
• Arrivals and lengths of stay for hourly customers were not recorded when the
“FULL” sign is on
• Demand can be different for different days and different hours of a day
• Decisions made across a day are not independent. Because the length of stay is
random, accepting too many hourly users may affect monthly users several
hours later
• Solution:
• Re-generate the hourly parking arrival time intervals and the lengths of stay
with the uncensored data
• Use Monte Carlo Tree Search to decide whether to show the “FULL” sign for
every 15-min time interval

21
Carpark Revenue Management
Hourly_ReGen.csv MonthlyFloating_ReGen.csv

22
Carpark Revenue Management

40

23
A pointer to find the appropriate day in the data.

Carpark Revenue Management


We ‘die’ if a monthly user cannot find a parking lot.

Simulation terminates at midnight.

The first element in the list.

Total number of hourly arrivals during the day.

……
24
Carpark Revenue Management
……

Check whether
the current state
has been
realized before.
If yes, go to that
node;
otherwise, build
a new decision
node.

25
Carpark Revenue Management
, k):

, k)
‘V’ ‘V’

26
Carpark Revenue Management
The tree after 1000 days of simulation:

We need a
really large
amount of
simulation to
explore
deeper in the
tree!

27
In-Class Exercise

• Consider this tree built by the MCTS algorithm
for the carpark problem. Now is 02:30:00 pm State: (nM=5, nH=19) Time: 02:30:00 pm
and there are 5 monthly users and 19 hourly
users in the carpark. If for the next 15 minutes, Available, V=4.5 Full, V=5.0
there will be 2 hourly users arriving and 0
leaving. In addition, there will be 1 monthly (5,19) (6,20) (6,21) (5,19) (5,18) (6,19) Time: 02:45:00 pm

floating users returning and 0 leaving. What


sign should be shown after 15 minutes? A F A F A F A F A F A F
4.5 5.2 4.2 5.4 4.0 5.4 4.5 5.2 5.0 4.8 4.9 4.7

• When doing optimization, if we visit a state


that has never been simulated before, we
should:
• A: Randomly pick an option
• B: Run the MCTS algorithm in real time

28

You might also like