Session 04 Monte Carlo Tree Search
Session 04 Monte Carlo Tree Search
ZHANG, Wei
Assistant Professor
HKU Business School
1
Agenda
• Dynamic Decision Making under Risk
• AlphaGo and Game AI
• Monte Carlo Tree Search
• Concepts
• Search Strategy
• Simulation Strategy
• Implementation
• Carpark Revenue Management
2
Dynamic Decision Making under Risk
• In the real business world, there are many complicated decision-making
processes, wherein early decisions will affect future system state and
decisions, and decisions must be made in real time.
• Airline ticket pricing
• Hotel room pricing Period 1
Period 1
$1000 Period 1 Period 1
• Game AI $999 … $1
$0
… … … …
3
AlphaGo and Game AI
• Normally, a game can be solved by DP.
• DP fails if the game tree is too large or the random events follow
complicated or unknown probability distributions.
4
AlphaGo and Game AI
• AlphaGo won Lee Sedol and Ke Jie, the top two go players.
• Monte Carlo Tree Search (MCTS) is at the heart of AlphaGo’s
algorithm.
5
Model Parameters
AlphaGo and Game AI
• AlphaGo’s algorithm
• Step 1: historical data => predictive models 𝜋 and 𝜎
• Step 2: predictive model 𝜎 => preliminary reinforcement learning (RL) model 𝜌
• Step 3: preliminary RL model 𝜌 => self play data => value network
• Step 4: (model 𝜋 + model 𝜎 + value network) => Monte Carlo Tree Search
6
Monte Carlo Tree Search
• The basic MCTS algorithm is to build a search tree, node by node,
according to the outcomes of simulated playouts.
• The process can be broken down into four steps.
Repeat
1/2 0/0 0/2 1/2 0/0 0/2 1/2 0/0 0/2 2/3 0/0 0/2
0/1 1/1 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 0/0 0/1 0/1 0/1 1/1 1/1 0/1 0/1
Win
7
Monte Carlo Tree Search
• Initialization: Create the root node (i.e., the level 1 decision node) and the child
nodes (i.e., the state-of-nature nodes) following each option.
• Selection: Search forward by selecting a state-of-nature node (or an option)
according to the recorded data and a given policy A. After that, select or create
a child node (or a state of nature) through simulation. Repeat until no more child
nodes are available as options.
• Expansion: If the newly created node is not a terminal node, create all the state-
of-nature nodes (i.e., all the options) as the child nodes for the current node.
• Simulation/Evaluation: Repeat the previous two steps (with a different selection
policy B & expansion) until a terminal node is reached and the final payoff is
received. Or, directly evaluate the payoff of the current node.
• Backpropagation: Update the summary statistics of each node along the path
according to the result.
8
Selection Strategy
• Random Search
• Focuses on exploration (i.e., look in areas that have not been well
sampled yet) but ignores exploitation (i.e., look in areas which
appear to be promising)
• The “optimal” option given by the tree in the end may not be truly
optimal
9
UCB1 Strategy
• The upper confidence bound for node 𝑖 (an option) is given by
ln 𝑁𝑖
𝐵𝑖 = 𝑉𝑖 + 2
𝑛𝑖
• 𝑉𝑖 is the average total reward/value achieved by paths after node 𝑖 (going forward)
• 𝑁𝑖 is the number of times the parent node of 𝑖 has been visited
• 𝑛𝑖 is the number of times node 𝑖 has been visited
• At a decision node, select the child node that maximizes the UCB.
10
UCB1 Strategy
• If more than one child node has the same maximum UCB value, the tie is usually broken randomly.
• The first term drives exploitation and the second term drives exploration. If a child node is selected,
the value of its second term will decrease but the value will increase for other child nodes.
• 𝑛𝑖 = 0 yields a UCB value of infinity, so previously unvisited child nodes are assigned the largest
value in order to ensure that all child nodes are considered at least once before any child node is
further expanded.
ln 𝑁𝑖
• The formula can be modified to 𝐵𝑖 = 𝑉𝑖 + 2𝐶 , where 𝐶 can be adjusted to lower or increase
𝑛𝑖
the amount of exploration performed. If the reward is beyond the range of [0,1], the value of 𝐶 can
be carefully chosen to achieve the balance.
11
AlphaGo’s Selection Strategy
• 𝑄𝑖 = 𝑉𝑖
𝑃𝑖
• 𝑢𝑖 ∝ , where 𝑃𝑖 is given by model 𝜎
1+𝑛𝑖
• The opponents next move is also given by model 𝜎
12
Simulation/Evaluation Strategy
• Simulation/evaluation is initiated
• when the search reaches a state of nature node;
• when the search goes to a new decision node.
• The quality of S/E strategy determines the quality of decisions and the
speed of convergence.
• If the state of nature depends on previous decisions, a model should be
built to capture the dependence of the state on previous decisions.
• Historical data may depend on historical decisions.
• E.g., historical sales depend on historical prices and inventory levels …
13
AlphaGo’s Simulation/Evaluation Strategy
• Step 1: starting from the current decision node, finish the game using policy 𝜋
• Step 2: evaluate the chance of success for the simulated move using the value network
• Step 3: calculate the weighted average of the two results and give the value to Q of
the simulated move
14
Implementation: Building Search Tree
• The MCTS algorithm with UCB1 selection strategy
can be implemented with a table.
15
Implementation: Building Search Tree
16
Implementation: Building Search Tree
Node Parent Child Node Meaning n V UCB
Index Index Index Type
1 0 {2,3,4} Decision The root node 2 1/2 -
2 1 {…} State Option 1(a) 1 1/1 2.6651
3 1 {K+1} State Option 1(b) 0 0 ∞
4 1 {…} State Option 1(c) 1 0 1.6651
…
K+1 3 {K+2,K+3} Decision State 3(a) 0 0 ∞
K+2 K+1 {} State Option (K+1)(a) 0 0 ∞
K+3 K+1 {} State Option (K+1)(b) 0 0 ∞
…
K’ … {} Terminal Outcome 0 1 ∞
17
Implementation: Building Search Tree
Node Parent Child Node Meaning n V UCB
Index Index Index Type
1 0 {2,3,4} Decision The root node 3 2/3 -
2 1 {…} State Option 1(a) 1 1/1 3.0963
3 1 {K+1} State Option 1(b) 1 1/1 3.0963
4 1 {…} State Option 1(c) 1 0 2.0963
?
…
K+1 3 {K+2,K+3} Decision State 3(a) 1 1 1
K+2 K+1 {…} State Option (K+1)(a) 1 1 1
K+3 K+1 {} State Option (K+1)(b) 0 0 ∞
…
K’ … {} Terminal Outcome 1 1 1
18
Implementation: Optimization
• When doing optimization:
• Select the option that maximizes the total reward/value V.
• The state of nature realizes according to the real situation.
• If the state does not exist in the current tree, perform the MCTS algorithm in
real time and then do the optimization.
• Note: If the initial state is uncertain, you need to train multiple trees.
• E.g., you are playing white stones in the game of go.
• The quality of the search tree can be measured by the V values of the
child nodes of the root.
19
Carpark Revenue Management
• Company Background
• SPS is one of the largest carpark operators in Hong Kong.
• Three types of customers
• Hourly parking
• Floating monthly parking
• Reserved monthly parking
• Problem to Solve
• 50 spaces, 20 floating monthly users, 10 reserved monthly users
• Maximize revenue from hourly parking without affecting monthly customers
• When to show the “FULL” sign?
20
Carpark Revenue Management
• Challenges:
• Arrivals and lengths of stay for hourly customers were not recorded when the
“FULL” sign is on
• Demand can be different for different days and different hours of a day
• Decisions made across a day are not independent. Because the length of stay is
random, accepting too many hourly users may affect monthly users several
hours later
• Solution:
• Re-generate the hourly parking arrival time intervals and the lengths of stay
with the uncensored data
• Use Monte Carlo Tree Search to decide whether to show the “FULL” sign for
every 15-min time interval
21
Carpark Revenue Management
Hourly_ReGen.csv MonthlyFloating_ReGen.csv
22
Carpark Revenue Management
40
23
A pointer to find the appropriate day in the data.
……
24
Carpark Revenue Management
……
Check whether
the current state
has been
realized before.
If yes, go to that
node;
otherwise, build
a new decision
node.
25
Carpark Revenue Management
, k):
, k)
‘V’ ‘V’
26
Carpark Revenue Management
The tree after 1000 days of simulation:
We need a
really large
amount of
simulation to
explore
deeper in the
tree!
27
In-Class Exercise
…
• Consider this tree built by the MCTS algorithm
for the carpark problem. Now is 02:30:00 pm State: (nM=5, nH=19) Time: 02:30:00 pm
and there are 5 monthly users and 19 hourly
users in the carpark. If for the next 15 minutes, Available, V=4.5 Full, V=5.0
there will be 2 hourly users arriving and 0
leaving. In addition, there will be 1 monthly (5,19) (6,20) (6,21) (5,19) (5,18) (6,19) Time: 02:45:00 pm
28