0% found this document useful (0 votes)
124 views

Tutorial - AlphaGo PDF

Maximize expected reward z Policy gradient ascent on weights w

Uploaded by

Ze Wei Ng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Tutorial - AlphaGo PDF

Maximize expected reward z Policy gradient ascent on weights w

Uploaded by

Ze Wei Ng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

A Tutorial on Google AlphaGo

Bo An
18/3/2016

Slides based on:


 Silver, David, et al. “Mastering the game of Go with deep neutral networks and tree search .” Nature
529.7587 (2016): 484-489.
 Shane Moon’s presentation
Games in AI
 Ideal test bed for AI research
Clear results
Clear motivation
Good challenge
 Success in search-based approach
chess (1997, Deep Blue)
and others
 Not successful in the game of Go
Go is to Chess as Poetry is to Double-entry accounting
It goes to the core of artificial intelligence, which involves the study of learning
and decision-making, strategic thinking, knowledge representation, pattern
recognition and, perhaps most intriguingly, intuition

2 /27
The game of Go

 An 4,000 years old board


game from China
 Standard size 19×19
 Two players, Black and
White, place the stones in
turns
 Stones can not be moved, but
can be captured and taken off
 Larger territory wins

3 /27
AlphaGo vs European Champion (Fan Hui 2-Dan*)rank

October 5– 9, 2015

- Time limit: 1 hour


- AlphaGo Wins (5:0)
4 /27
AlphaGo vs World Champion (Lee Sedol 9-Dan)

March 9 – 15, 2016

• Time limit: 2 hours


• Venue: Seoul, Four
Seasons Hotel

• AlphaGo Wins (4:1)

5 /27
Computer Go AI - Definition

d=1 d=2

Computer Go
s Artificial a s’
=
Intelligence

s (state) e.g. we can represent the board into a matrix-like form


a (action)

Given s, pick the best a

6 /27
Computer Go AI – An Implementation Idea?
d=1 d=2 d=3
... d = max D ~ 19x19=361
Process the simulation until the
game ends, then report win / lose
results
How about simulating all possible board
positions?e.g. it wins 13 times if the next stone gets placed here


37,839 times
Choose the “next action / stone”
that has the most win-counts in
the full-scale simulation
431320 times

Can also apply Minmax search learned in


CZ3005

7 /27
8 /27
AlphaGo: Key Ideas

 Objective: Reduce search space without sacrificing quality


 Key Idea 1: Take advantage of human top players’ data
Deep learning
 Key Idea 2: Self-play
Reinforcement learning
 Key Idea 3: Looking ahead
Monte Carlo tree search
We learned Minimax search with evaluation functions

9 /27
Reducing Searching Space
2.
1. Position
Reducingevaluation ahead of time
“action candidates” (Depth
(Breadth Reduction)
Reduction)

d=1 d=2 d=3 …


… dd =
= maxD
maxD
V=1
Win?
Win?
V=2 Lose?
Lose?

V = 10

If
If there
thereis of
Instead
Remove athese
is model that can
asimulating
function
from tell you
that
until
search can that
the
candidates
these moves V(s):
measure: are not“board
common / probable (e.g.
evaluation
maximum
in advance depth..
(breadth reduction)
by experts, etc.)…
of state s”

10 /27
1. Reducing“action candidates”

Learning: P ( next action | current state )

= P ( a | s)

11 /27
1. Reducing“action candidates”
(1)Imitating expert moves (supervised learning)

Current State Next State

s1 s2
Prediction
s2 s3
Model
s3 s4

Data: Online Go Experts (5 ~ 9 dan)


160K games, 30M board positions

12 /27
1. Reducing“action candidates”

(1)Imitating expert moves (supervised learning)

Current Board Board


Next Action

Prediction
Model

s f: s a a
There are 19×19 = 361 possible actions
(with different possibilities)

13 /27
1. Reducing“action candidates”
(1)Imitating expert moves (supervised learning)

Current Board Next Action

Deep
Learning
Prediction
(13 layer
CNN)
Model

s g: s p(a|s) p(a|s) argmax a

14 /27
Convolutional Neural Network (CNN)
Go: abstraction is the key to win
CNN: abstraction is its forte

15 /27
1. Reducing “action candidates”

(1)Imitating expert moves (supervised learning)

Current Board Next Action

Expert Moves Imitator Model


(w/CNN)

Training:

16 /27
1. Reducing “action candidates”

(2)Improving through self-plays (reinforcement learning)


improving by playing against itself

Expert Moves Expert Moves


Imitator Model VS Imitator Model
(w/CNN) (w/CNN)

Return: board positions, win/lose info

17 /27
1. Reducing “action candidates”

(2)Improving through self-plays (reinforcement learning)

Board position win/loss

Expert Moves Imitator Model


Win
Loss
(w/CNN)
z = +1
-1

Training:

18 /27
1. Reducing “action candidates”

(2)Improving through self-plays (reinforcement learning)


older models vs. newer models

Expert Moves
Updated Model Updated
Updated
Updated
Updated Model
Model
Model
Model
Imitator
ver 1.3
Model
1.1 VS ver1,000,000
1,000,000
ver 46235.2
ver 1.3
ver 1.5
3204.1 ver ver 2.0
1.7

It uses the same topology as the expert moves imitator model, and just uses the updated parameters

Return:
The finalboard
modelpositions,
wins 80%win/lose infowhen
of the time
playing against the first model

19 /27
2. Board Evaluation

Adds a regression layer to the model


Predicts values between 0~1
Close to 1: a good board position
Close to 0: a bad board position

Board position Win / Lose

Updated Model Value


ver 1,000,000 Prediction Win
Model (0 / 1)
(Regression)

Training:
20 /27
Reducing Search Space

1. Reducing“action candidates”
(Breadth Reduction)

Policy Network

2. Board Evaluation (Depth Reduction)

Value Network

21 /27
Looking Ahead (Monte Carlo Search Tree)

22 /27
Looking Ahead (Monte Carlo Search Tree)

23 /27
Results

24 /27
AlphaGo
Lee Sedol 9-dan vs AlphaGo Energy Consumption

A very, very tough calculation;)

25 /27
AlphaGo
Taking CPU / GPU resources to virtually infinity

26 /27
Discussions

 AlphaGo’s Weakness
make the state complicated
……
 What is the next step?
Poker, Mahjong

Chess RoboCup
Environment Static Dynamic

State Change Turn taking Real time

Information
Complete Incomplete
accessibility
Sensor Non-
Symbolic
Readings symbolic
Control Central Distributed

27 /27

You might also like