Report
Report
Rasmus Haugaard
1 Introduction
As long as there have been games, there has been an interest in autonomous
players (AIs). One way to avoid handcrafting AIs could be through Evolution-
ary Algorithms (EA) where a population of agents compete and undergo some
combination of exploitation by natural selection and exploration by introducing
variation. In this work, multiple Ludo players will be evolved through EA and
compared to both random- and handcrafted players.
The provided Ludo game has been ported to python by Haukur Kristinsson
https://github.com/haukri/python-ludo. I have continued the work to make
it align with the C++ version and packaged it as a python package for modu-
larity, https://github.com/RasmusHaugaard/pyludo. The motivation behind
the re-implementation in Python was faster development.
The agent in this work is thus also implemented in python. No EA pack-
age is used. All methods are implemented using the python standard libraries
and numpy. The implementation will be available at https://github.com/
RasmusHaugaard/pyludo-ai.
2 Methods
The methods developed in this work are based on scoring the possible actions
by a relative value function with some parameters, where the function is de-
fined by the state representation, and the parameters are found with EA. More
specifically, an agent policy, π, is defined by a value function, V , depending on
the agent chromosome, θ, and the agent action representation, A, which can be
dependent on the current state and the next state.
, where s is the current state, a is the action, and s0 is the next state defined by
a known model function of s and a. Note that the next state is the immediate
next state, before the next player makes it’s move. Also, note that the state
representation is then fully defined by the action representations for the four
actions. Since only Equation 1 is examined, the full state representation will
not be referred to, and for simplicity, the action representation A(s, s0 (s, a)) will
simply be denoted A from now on.
2
Simple The simple action representation, Asim has four binary components.
1: Whether a player token was moved from home onto the board, 2: whether a
player token entered the safe end zone, 3: whether a player token entered goal,
and 4: whether an opponent token was sent home.
The corresponding value function is a chromosome-weighted sum of the ac-
tion representation components:
4
X
Vsim (θ, Asim ) = θi Asim,i (2)
i=1
, where a is the action, s is the current state, s0 is the next state, and P denotes
the environment model, the probability of ending up in state s’ given the current
state and action, and V 0 denotes the expected return given a certain state.
In the case of Ludo, since the immediate state transition is deterministic,
Equation 3 boils down to
π(s) = argmax V 0 (s0 (s, a)) (4)
a
, where s0 is the known model function. Note that s’ is not the next observable
state like it would be in Reinforcement learning, but rather the immediate state
after an action is taken, before the next player makes it’s move. Also note that
V 0 doesn’t have to actually estimate the expected reward but should just be able
to relatively score states.
The following two players only tries to relatively score the next immediate
states, and does thus not look at the current state.
3
Advanced The advanced player assigns a value to each token for both the
player and the opponents. Each token is given a score based on 4 properties. 1:
Weather it is on the board (not home), 2: a normalized (from 0 to 1) distance
it has traveled along the common path, 3: weather it is in the end zone, and
4: weather it is in the goal position. The value assigned to each property of the
tokens is decided by the EA parameters, θ1 ..θ4 , and are shared among the player
and opponents. All token values are then weighted according to the probability
of no opponent being able to hit the token home before the player’s next turn.
p is approximated by looking at how many, N , opponent tokens could possibly
hit the token, and then setting p̂ = (5/6)N . The three opponents summed token
values are weighted per opponent by θ5 , θ6 , and θ7 and subtracted from the
summed player token values, which yields the final action score.
Full Both the simple and especially the advanced player requires a lot domain
knowledge. The assumptions that the embedded domain knowledge make might
actually inhibit the player performance. In many other problems than Ludo,
obtaining this domain knowledge might be difficult. It is interesting to examine
how well an action representation with no domain knowledge will be able to
perform.
Both the simple and the advanced player only has between 4 and 7 genes
which could very likely be optimized manually with no algorithm.
The full player has no embedded domain knowledge, only looks at the im-
mediate next states s0 , and since it looks at a full state representation, there are
many parameters.
Af ull is a 4×59 array, describing for each of the four players, how many of it’s
tokens are in each position, where the token positions have been mapped from
-1, 0 .. 56, 99 to 1, 2 .. 59. An action score is then determined by feeding Af ull
through a neural network with parameters found by EA. The network has one
hidden layer of 100 neurons and one output neuron. Tanh is used as the hidden
layer activation to enable non-linearity in the learned value estimate function.
With a bias neuron added to the input, there are a total of (4 ∗ 59 + 1) ∗ 100 +
100 ∗ 1 = 23.800 parameters or genes.
With back-propagation in machine learning it is desirable to have the neurons
be in the active region of the activation function to avoid diminishing gradients.
In the case of EA there are no gradients, but it is still desirable to have the
neurons near the active region for changes in the parameters to affect the player
policy. To be in the active region of tanh, the weights in a neural networkplayer
before the activation would often be initialized with a std deviation of 1/N
[2016], where N is the amount of input neurons to the layer. As described later in
subsection 2.2, for some of the methods in this work, constant mutation strengths
are used, and thus, the genes should preferably be initialized in the same range
for all representations. Instead p of initializing the genes differently, the hidden
layer outputs are multiplied by 1/N before activation, which is equivalent but
allows the genes to be in the same dynamic range as for the other representations.
4
local maximum. For this reason, two other Mutation methods are examined,
namely the Real Adaptive One Step Mutation, where σ is stored inside the
agent chromosome during EA, so that the EA itself finds an appropriate σ.
Realizing that different genes might have different appropriate σ’s, the Real
Adaptive N Step Mutation is also examined, where a σ per gene is stored within
the chromosome. The adaptive methods only have one or two learning rates that
are set by the user and define how quickly the sigma can change. The learning
rates are all set to 0.1 in this work.
The populations need to be initialized. All genes are initially sampled from
a standard normal distribution. All adaptive sigmas are initially sampled from
a standard lognormal distribution.
the best non-random player is chosen. The final agents represent the players and
are used for player comparisons in section 3.
3 Results
First, there will be a statistic perspective of the results, second the players, the
evolutionary algorithm methods and the results will be discussed. This work
covers many players and methods why all combinations and findings cannot be
covered.
7
Table 1: Win rates in percent for players (rows) against opponents (columns).
The win rates are determined by 2500 game tournaments with two instances
of both the player and the opponent. The best player for a given opponent is
marked in bold. *Results that are statistically insignificant at a significance level
of 5 %. The diagonal is kept as validation of the evaluation method.
Table 2: Win rates in percent for player (rows) against opponents (columns).
The win rates are determined by having one instance of the player and three
of the opponent in a tournament of 2500 games. *Results that are statistically
insignificant at a significance level of 5 %.
8
Statistical significance Considering the sampled win rate for a player, each
game outcome, weather the player wins or not, can be thought of as being drawn
from a binary random variable with µ = p, where p is the true player win rate.
According to the Central Limit Theorem, the average of a sequence of indepen-
dent random variables drawn from the same distribution is normally distributed
when the sequence size, n, is large. The estimated win rate is thus approximately
normally distributed p̂ ∼ N (p, σ 2 ), σ 2 = p(1 − p)/n. The estimated variance is
s2 = p̂(1 − p̂)/n, and two single sided tests can be done to test weather the
player is significantly better or worse than the opponent. The null hypothesis is
that p = 0.5 or p = 0.25 for Table 1 and Table 2, respectively. The normalized
observed difference from the null hypothesis, z, is:
p̂ − 0.5 p̂ − 0.25
z= ∨ z=
s s
With a critical region of 95%, the z-thresholds are ±1.645. By solving for p̂, the
results in Table 1 are statistically significant, if p̂ ≤ 48.3% ∨ 51.7% ≤ p̂, and
the results in Table 2 are statistically if p̂ ≤ 23.6% ∨ 26.5% ≤ p̂. The results
that are statistically insignificant has been marked with a star for convenience.
Chosen player populations The agent chosen to represent the simple player
is from generation 100 in Figure 1 with sigma = 0.1. Generation 20 might
look slightly better. A z-test reveals the difference is not significant with 5%
significance level. The agent chosen to represent the advanced player is from
generation 55 in Figure 2. The agent chosen to represent the full player is from
the last generation of the population with size 160 in Figure 3a.
seems to sample the space evenly, but the end population has converged tightly,
which could hint that it would be more prone to settling on a local maximum for
harder problems. Blend recombination seems to accommodate the issues with
both diversity and even search.
Figure 5 shows a recombination test on the full player. No apparently signif-
icant difference was seen between the recombination methods. Letting them run
for longer and on bigger populations could potentially have showed differences.
Fig. 1: Evolving the simple player with Fig. 2: Evolution of the advanced
tournament selection, population size player with tournament selection, pop-
of 20, 10 games per tournament, no ulation size of 80, 10 games per tourna-
recombination and normal mutation ment, whole arithmetic recombination,
with σ equal to 0.01, 0.1 and 1 respec- and normal mutation with σ = 0.1.
tively. Generation number along the Generation number along the first axis
first axis and win rate along the second and win rate along the second axis.
axis. Population mean and std devia- Agent win rates are scatter plotted.
tion is plotted. Population mean is plotted.
5 Conclusion
Three players have been evolved using different evolutionary methods. One of
them has no embedded domain knowledge but is actually better than the player
with the most domain knowledge winning 54.2% of 2500 Ludo Games with two
instances of each player. The player with most domain knowledge wins 78.2%
against random players, while the player with no domain knowledge wins 77.4%
against random.
10
(a) (b)
Fig. 3: Population mean win rates while evolving the full player with tournament
selection, 10 games per tournament, whole recombination, normal mutation with
σ = 0.1, and varying population sizes from 20 to 160. a) Generation number
along the first axis. b) Total number of games played along the first axis.
Fig. 5: Evolving the full player with tournament selection, population size of 20,
10 games per tournament, normal mutation with σ equal to 0.01, and different
recombination methods. Generation number along the first axis and win rate
along the second axis. Population mean is plotted.
6 Acknowledgements
Thanks to Haukur Kristinsson for the work on the python version of ludo and
to Christian Quist Nielsen for letting me use his evolved players in this report.
References
Eiben, A. E. and Smith, James E.: Introduction to Evolutionary Computing. Springer
Publishing Company, Incorporated. 978-3-662-44874-8 (2015)
Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning. MIT Press.
http://www.deeplearningbook.org (2016)