The State of Machine Learning and Artificial Intelligence
The State of Machine Learning and Artificial Intelligence
In Computer Games
John Murphy
[email protected]
June 12, 2009
Abstract
The use of machine learning in computer games is becoming an area of interest in both
academic circles and the mainstream games industry. This has lead to experimental
combinations of machine learning and computer games, both as academic and commercial
projects. In this paper, a survey of some of these efforts is presented. The goal of the paper is
to give the reader an impression of the increased interest in and efforts toward integrating
artificial intelligence, machine learning techniques, and computer game development.
1 Introduction
Discussions of artificial intelligence in games often involve semantic arguments concerning the
definition of game AI. Members of the games industry tend to define game AI as any technique
that contributes to the perceived intelligence of an entity [1]. This can include finite state
machines and behavior trees that run predetermined scripts, which by most standards is not true
artificial intelligence. Game AI can also be defined as techniques that generate artifacts that
would normally be created by game developers. This includes the procedural generation of
cities, skeletons, crowds in games like Grand Theft Auto IV and Assassin’s Creed.
Historically, the gap between what is generally accepted as AI and what the games
industry calls AI has been quite wide. A type of AI that has been especially lacking in game AI
is dynamic synthetic agents that can experience and learn from their environment. Historically,
machine learning (ML) has not been applied to computer games. However, this has begun to
change in recent years.
The academic community, many of whom are familiar with machine learning concepts,
has become more interested in computer games in recent years. Researchers have applied their
knowledge of ML by modifying or mimicking commercial games. Likewise, as games have
become more sophisticated and developers look for new ways to innovate the industry has
begun to experiment with the use of machine learning algorithms in various areas of game
development. This paper describes in detail a few examples of these recent applications of
machine learning in computer games in an attempt to show the growing interest in and use of
machine learning in game development.
It has been shown that AI behaviors can be taught to First Person Shooter (FPS) agents by
training artificial neural networks via player modeling [2]. An example of this is the work of
Ben Geisler at Radical Entertainment. Neural networks are a good fit for this type of learning
1
problem. It can be difficult to determine explicitly why a FPS player takes a given action. Even
expert FPS players can’t completely describe the tactics that they use. The behavior of FPS
players can often only be explained by example, as the exact relationships between input and
output is not easy to determine. A neural network fits this situation well because it takes
examples of input/output pairs and adjusts its internal structure in a way that approximates the
implicit relationships present in the examples.
Geisler’s group used a modified version of the FPS game Soldier of Fortune 2. They
altered an agent such that several basic movement decisions were controlled by an artificial
neural network. They first decided which actions they wanted to have the network control,
which would be used as the outputs for the network. They decided to use four basic movement
actions: accelerate/decelerate, move forward/backward, face direction, and jump/don’t jump.
They then decided what features were to be used as input values to represent the environment
in which the agent was to act.
The game environment was broken down into four sectors around the agent. The
number of enemies in each sector was taken as an input. The health of the agent at a given time
step was also used as input. The game to be played was capture the flag, where a player or
agent is to take the opposing player or team’s flag and return it to their own base. Therefore,
the distance to the nearest goal and its location sector relative to the agent were also used as
inputs, as well as several other features of the agent’s relationship to the environment and other
agents.
The data to be used in training the network was collected by recording the actions of an
actual player. An expert player’s actions were recorded while playing a game of capture the
flag with thirteen other agents. Vector math was used to record the locations of all other agents
relative to the player. Ten thousand training examples were collected by observing one player
for 50 minutes. The basic back-propagation algorithm was used to train an ANN with ten
hidden nodes. In order to avoid overfitting, after every five epochs of training the error rate
was compared to a validation set taken from the data. If the error rate was lower than the
pervious error rate from the previous five epochs, training was stopped and the network
weights from the previous validation step were used.
In order to further improve the error rates for the various actions, two ensemble
methods were used. An ensemble is a technique that uses a collection of models that performs
better than a single model. The two methods used in this project were boosting and bagging.
Boosting involves building an ensemble by training each model instance to emphasize the
training instances that the previous models classified incorrectly. Bagging involves having
each of the models in an ensemble take a vote. The classification with the most votes is then
used.
The agent trained by the expert player achieved accuracies around 95% for all actions
for which it was trained. With ensemble methods the agent was accurate enough for mistakes
to generally seem like human mistakes as opposed to obvious AI faults. However, the learned
behavior of the agent was still used within a static, finite state system at real time. No further
training can occur at real time in this scenario. However, game developers could use similar
methods to train various agent skill levels, or even cooperative behaviors, during development
instead of writing explicit code for each type of agent.
2
3 ANNs and Genetic Programming of Virtual Pets
The game Creatures was released in 1996 [3]. It supplied a closed environment in which a
player interacted with anthropomorphic animals. The agents, called creatures, have artificial
neural networks for sensory-motor control and learning and artificial biochemistries for energy
metabolism and hormonal regulation of behavior. Both the ANN and biochemistry of a
creature are genetically encoded to allow for evolutionary adaptation through sexual
reproduction.
The player is able to interact with the creatures by stroking or slapping them in order to
give positive or negative reinforcement, respectively. The player can also manipulate objects in
the environment, such as moving a ball in front of a creature. Creatures have simulated senses
of sight, sound, and touch.
A creature’s ‘brain’ is a neural network sub-divided into ‘lobes.’ Synapses (connections)
are formed between nodes (neurons) within a lobe, and each lobe can form connections to up to
two other lobes. A given creature’s neural network brain initially consists of about 1,000
neurons, 9 lobes, and approximately 5,000 synapses. In addition to network weights being
dynamic, the pattern of connections between nodes within a lobe can change throughout the
life of a creature.
The decision layer of a creature’s neural network consists of 16 nodes that many other
nodes feed into. Each node represents one decision, with the set of possible decisions
depending upon the object being considered for interaction.
The genome for a creature is a string of bytes. Information such as network connections,
network node structure, and biochemistry are encoded in the byte string. This string forms a
single, haploid chromosome, which contains punctuation markers to indicate gene boundaries.
Crossover occurs during reproduction. Crossover errors can introduce omissions and
duplications of genes, and random mutations are also introduced. The chromosome, or genome,
of a creature is scanned at various points during its lifespan to allow for genes that are encoded
to be expressed at different parts of the creature’s life.
The behavior of the agents in Creatures was dynamic and varied. The creatures
appeared to learn. Emergent social behavior was witnessed, such as cooperative playing with a
ball. However, the appearance of learning and social behavior may also have been due to the
observers anthropomorphizing the creatures.
It has been shown that neural networks can be trained to be used as controllers for game agents,
or Non-Player Characters (NPCs). Most of the demonstrations of this ability have used only
one objective, which yields one behavior pattern. Furthermore, most learning methods only
allow for one objective. Reinforcement learning methods rely on scalar rewards and
evolutionary methods rely on scalar fitness functions. It becomes necessary to use multi-
objective methods when the target behavior is more complex or situational [4]. Information
about tradeoffs between various objectives must be included in the learning method.
Schrum and Miikkulainen specifically used Pareto-based multi-objective methods.
Pareto-based methods are those in which one objective can only be optimized by worsening
3
one or more other objective. Rather than having one optimal solution, such problems can have
a potentially infinite number of solutions.
The group of agents trained in this simulation occupies a two-dimensional surface.
Their two goals are to earn points by colliding with a single player agent while avoiding
damage taken from the player. The player holds a “bat” in front of it that it can swing at the AI
agents in order to damage them and accrue points. If an agent makes contact with the player,
the player is knocked back and automatically turns to face the agent that struck him. The goal
is to minimize points gained by the player by hitting agents while maximizing the collective
points of all of the AI agents. Early experiments using individual scores as the measure of
success taught agents competitive behavior, which resulted in poor performance. Therefore, a
group score was used instead.
The evaluations of agent performance are noisy. Different scores can result from a
monster’s evaluation because the actions of one agent affect how other agents are evaluated.
The problem is also difficult because objectives such as avoiding damage and attacking the
player are contradictory. Three criteria were used to measure the fitness of an agent’s behavior:
attack bonus given when the agent hits the player (or a slightly discounted attack bonus for
being within a close radius of the player when he is struck by another agent), the amount of
damage received by the agent, and the amount of time the agent stayed alive (an agent “dies” if
it is hit five times by the player’s bat within a trial).
Agents were trained using a neuroevolution learning algorithm. Neuroevolution
involves the production of populations of neural networks. Mutation operations can modify
existing connection weights within these networks. Mutations can also alter the topology of the
network by adding neurons and connections, including recurrent connections. Unlike some
neuroevolution algorithms, this experiment did not involve elements such as crossover and
speciation. New generations are produced by cloning each parent network and modifying those
clones via mutation. The best half of the combined parent/clone population is then selected as
the new parent generation.
The multi-objective evolutionary algorithm NSGA-II (Non-dominated Sorting Genetic
Algorithm) was used. The algorithm sorts the population into non-dominated Pareto fronts in
terms of individual fitness scores in order to select the fronts that are “dominated” by the
fewest individuals. Domination is defined as follows:
v = (v1,…,vn) dominates u = (u1,…,un) iff
1. ∀i ∈{1,…,n} : vi ≥ ui, and
2. ∃i ∈ {1,…,n} : vi > ui.
A vector v is said to be non-dominated if there do not exist any vectors in the population that
dominate it. The non-dominated vectors in the population are considered to be a Pareto front.
Once a Pareto front is determined, the included individuals are temporarily removed from the
population and successive non-dominated Pareto fronts are established. The order in which
individuals are assigned to Pareto fronts indicates their precedence when determining which
networks represent better objective tradeoffs.
A simulation was run against a hard-coded player “bot” agent that used successively
more difficult strategies. The multi-objective algorithm was compared to an algorithm using a
scalar fitness score. The groups of NPC agents evolved effective strategies against the player
bot using the multi-objective algorithm more reliably and quickly than with the scalar fitness
algorithm. The NPCs, despite having no way of sensing the locations of their teammates,
4
developed teamwork strategies, such as baiting the player bot in a wide arc to allow other
agents to catch up to and attack the player bot from behind.
Massively multiplayer online games (MMOGs) are not only venues for strategic and tactical
gameplay. The virtual goods and services in games like EVE Online and World of Warcraft are
exchanged at such high frequencies and in such large quantities as to create economies nearly
as large and complex as those that exist in reality [5]. Many players in these virtual worlds
choose to achieve their goals in the game through financial means as opposed to more standard
activities like battling other players.
The economy of EVE Online, a science-fiction massively multiplayer role playing
game (MMORPG) consisting of 220,000 players, is actively regulated by a professional
economist who monitors inflation, deflation, commodity indices, and production levels within
the virtual world. Such monitoring can help keep the economy healthy. The game designers
can also control the economy by determining the scarcity of items, fees, and the ways in which
wealth can be generated, but the collective will of the players in the virtual markets is also a
major force influencing its status.
Another way that game designers can maintain the economy of a virtual world is
through autonomous trading agents that participate in the economy. However, if these agents
are not dynamic and fitted to the economy, they can be exploited by players. Reeder et al. used
data from the economy of EVE Online to train an autonomous trading agent using a
reinforcement learning algorithm.
The agent was put in control of a manufacturing unit that could produce one of six
commodities at any given time. The goal of the agent was to determine which commodity to
produce, given data that indicated the market value of the raw materials used in each possible
manufacturing process as well as the value of each manufacturing product that could be
produced.
A reinforcement learning algorithm was used to train the agent. Reinforcement learning
is a group of algorithms that attempt to use rewards and costs of actions in order to map states
(inputs) to actions (outputs). The trading policy that the agent learns is the optimal sequence of
transactions to buy or sell a particular item. The cost estimate of taking an action was
determined by the rule
where s is the initial state, a is the action taken, s’ is the new state, p(s’,a) is the expected cost
of an action in the new state, n is the number of times a has been tried in s, α is n / (n+1), and
cim is calculated by placing the bid associated with the action and simulating the market until
the next time step. Costs are updated according to this rule for each combination of time step, t,
and percentage of original volume to be bought that remains to be bought, v. Three weeks of
live data was used to train the agent, and one week of data was used for testing the accuracy of
the learned trading policy.
The performance of the trading strategy learned by the agent was measured in terms of
wealth accumulated. This performance was compared to that of various standard, static trading
5
strategies. The agent outperformed the other trading strategies for all combinations of market
variables, and closely mimicked the trading policy generally practiced by players in the game.
It was determined that deploying trading agents into the EVE Online world could help
maintain a healthy economy by providing such services as subtle price deflation and providing
a reliable source of goods during less active markets.
A more recent project similar to that of Creatures is the Artificial General Intelligence (AGI)
project being worked on at Novamente LLC. The project involves the creation of virtual pets.
Instead of using only an artificial neural network, different types of learning are layered in
order to increase the adaptability of the parrot agents. There are three types of learning
combined in the learning algorithm that is applied to the agents being developed by Novamente;
imitative learning, in which a teacher acts out the behavior that they seek to teach the student
agent, reinforcement learning, which has already been discussed, and corrective learning, in
which the teacher actively guides and corrects the student’s behavior as it attempts to carry out
a behavior.
An interface was designed to facilitate communication between the controller software
that powers the virtual pet agents and the virtual world of Second Life. This allows for input
data to be fed into the learning architecture and for output commands to be sent to the agent in
virtual world.
The algorithm used in this AI framework is hillclimbing, which is a simple but fast
optimization technique. A more sophisticated probabilistic evolutionary learning algorithm
called MOSES is in the process of being implemented. The problem with the more
sophisticated algorithm is that it would take too long for a human operating an avatar in a
virtual world to train a pet agent using such an algorithm.
With any fairly sophisticated machine learning algorithm, such a large number of
training instances is required to learn that it would not be realistic for a human to do the
training. This is one of the main reasons that most of the examples in this paper involve offline
learning. However, the use of a fitness estimation function that predicts the fitness of a
behavior before it is performed would help to solve this problem by eliminating potential
behaviors that have low fitness before they are actually performed. The knowledge obtained
from this fitness estimation process could be stored in a collective manner, such that many AI
agents could use the information.
These virtual pet agents have been able to learn novel behaviors. However, the live
learning with a human teacher in a reasonable amount of time has not been accomplished. Also,
several Linux servers are necessary to run one AGI, so the only way to have many agents
learning at once is to have them share much of the same “brain.” Consumer hardware is not
currently capable of running the AGI software in addition to the high graphics settings that are
usually part of modern games. Virtual world environments currently do not have the richness
and variety to make such general AI agents worthwhile, but in the future virtual worlds will be
excellent venues for training and testing these agents.
6
7 Bayesian Skill Rating in Online Games
The popularity of competitive online gaming has created a need for accurate match-making
systems. Matching players of similar skill level results in more challenging, fun, interesting
games. Making rankings available to players promotes competition and interest in the game,
and rankings can also be used to determine qualification for tournaments. Microsoft uses a
Bayesian learning algorithm in their Xbox Live matchmaking system, which they call TrueSkill
[7].
In their algorithm, a game includes n players {1,…,n} that make up k teams. The
outcome vector r = (r1,…,rk) was a list of the rankings for each team, with r = 1 being the
winning team and with draws between multiple teams being allowed. The probability P(r|s, A)
of the game outcome r given, the skills s of the participating players, and the team assignments
A = {A1,…Ak} is modeled using Bayes’ rule to obtain the posterior probability distribution
A factorizing Gaussian prior distribution, p(s), is assumed. Each player i is assumed to exhibit
a performance pi, centered around their skills si with fixed variance β2. The performance tj of
team j is modeled as the sum of the performances of its members. If no draws occur, the
probability of a game outcome r is modeled as
that is, the order of performances generates the order in the game outcome.
Skill estimates need to be reported after each game, so an online learning scheme called
Gaussian density filtering is used. If skills are allowed to vary over time, a Gaussian dynamics
factor is introduced which leads to an additive variance component in the subsequent prior. A
factor graph is used to represent the relationships between team performances, individual
performances, individual skills, and the differences between team performances. A factor
graph is a graph consisting of variable and factor nodes. The factor graph in this case
represents the joint distribution p(s,p,t|r,A), which is given by the product of all of the factor
nodes in the graph. The estimated skills of the individual players, p(si|r, A) are calculated from
the joint distribution by integrating out the individual and team performances,
Estimated individual skill levels from the match are then used to update the overall skill rating
for that player.
The performance of the TrueSkill algorithm was tested during the beta test of the Xbox
game Halo 2. The data set consisted of thousands of game outcomes for four different game
types: 8 players against each other, 4 players vs. 4 players, 1 player vs. 1 player, and 8 players
vs. 8 players. The algorithm’s ability to predict close games was compared to that of Elo, a
standard skill ranking algorithm developed for comparing skill levels of chess players.
Microsoft’s algorithm showed to be significantly better at predicting the games that it
7
determined to be close. The ability of the two algorithms to create close matches was tested by
measuring their respective draw percentages. The more games end in a draw, the better the
algorithm was at matchmaking. Again, the TrueSkill algorithm outperformed Elo. TrueSkill is
currently being used for matchmaking on Xbox Live games. The system processes hundreds of
thousands of games per day, which makes it one of the largest applications of Bayesian
inference learning to date.
8 Conclusion
References
[1] http://aigamedev.com/open/article/gdc09-slides-highlights/.
[2] B. Geisler, “Integrated machine learning for behavior modeling in video games,” In D. Fu,
S. Henke, and J. Orkin, editors, Proceedings of the AAAI-2004 Workshop on Challenges in
Game Artificial Intelligence, pages 54-62. AAAI Press, 2004.
[3] S. Grand and D. Cliff, “Creatures: Entertainment software agents with artificial life.”
Autonomous Agents and Multi-Agent Systems, vol. 1(1), 1998.
8
[4] J. Schrum and R. Miikkulainen, “Constructing complex npc behavior via multi-objective
neuroevolution,” in Proceedings of the Fourth Artificial Intelligence and Interactive
Digital Entertainment Conference (AIIDE), 2008.
[5] J. Reeder et al., “Intelligent trading agents for massively multi-player game economies,” in
Proceedings of the Fourth Artificial Intelligence and Interactive Digital Entertainment
Conference (AIIDE), 2008.
[6] B. Goertzel et al., “An integrative methodology for teaching embodied non-linguistic
agents, applied to virtual animals in second life,” in Proceedings of AGI-08, IOS Press.
[7] R. Herbrich, T. Minka, and T. Graepel, “TrueSkill: A Bayesian skill rating system,” MIT
Press, 2007.