Neat Python
Neat Python
Reinforcement Learning
Jerin Paul Selvan Dr. P. S. Game
Dept. of Computer Engineering Dept. of Computer Engineering
Pune Institute of Computer Technology Pune Institute of Computer Technology
Pune, India Pune, India
[email protected] [email protected]
arXiv:2207.14140v1 [cs.LG] 28 Jul 2022
Abstract—For over a decade now, robotics and the use of artifi- Episodic vs Sequential and Known vs Unknown. An approach
cial agents have become a common thing. Testing the performance to machine learning known as NEAT, or Neuroevolution of
of new path finding or search space optimisation algorithms Augmenting Topologies, functions similarly to evolution. In
has also become a challenge as they require simulation or an
environment to test them. The creation of artificial environments its most basic form, [1] NEAT is a technique for creating
with artificial agents is one of the methods employed to test such networks that are capable of performing a certain activity,
algorithms. Games have also become an environment to test them. like balancing a pole or operating a robot. It’s significant that
The performance of the algorithms can be compared by using NEAT networks can learn using a reward function as opposed
artificial agents that will behave according to the algorithm in to back-propagation. By executing actions and observing the
the environment they are put in. The performance parameters
can be, how quickly the agent is able to differentiate between outcomes of those actions, an agent learns how to behave in
rewarding actions and hostile actions. This can be tested by a given environment via reinforcement learning, a feedback-
placing the agent in an environment with different types of based machine learning technique. The agent receives com-
hurdles and the goal of the agent is to reach the farthest by taking pliments for each positive activity and is penalised or given
decisions on actions that will lead to avoiding all the obstacles. negative feedback for each negative action. In contrast to
The environment chosen is a game called ”Flappy Bird”. The
goal of the game is to make the bird fly through a set of pipes supervised learning, reinforcement learning uses feedback to
of random heights. The bird must go in between these pipes and autonomously train the agent without the use of labelled data.
must not hit the top, the bottom, or the pipes themselves. The The agent can only learn from its experience because there
actions that the bird can take are either to flap its wings or is no labelled data. In situations like gaming, robotics, and
drop down with gravity. The algorithms that are enforced on the the like, where decisions must be made sequentially and with
artificial agents are NeuroEvolution of Augmenting Topologies
(NEAT) and Reinforcement Learning. The NEAT algorithm takes a long-term objective, RL provides a solution. The agent
an ’N’ initial population of artificial agents. They follow genetic engages with the environment and independently explores it.
algorithms by considering an objective function, crossover, mu- In reinforcement learning, an agent’s main objective is to
tation, and augmenting topologies. Reinforcement learning, on maximise positive rewards while doing better.
the other hand, remembers the state, the action taken at that
state, and the reward received for the action taken using a single II. L ITERATURE S URVEY
agent and a Deep Q-learning Network. The performance of the
NEAT algorithm improves as the initial population of the artificial Games have been used a lot to act as an environment to
agents is increased. test algorithms. There is a lot of research [3] done to create
an AI bot that can challenge a player in a multi-player or
Keywords—NeuroEvolution of Augmenting Topologies (NEAT), two-player game. Neuroevolution and Reinforcement learning
Artificial agent, Artificial environment, Game, Reinforcement
Learning (RL) algorithms are some of the algorithms that are used to create
AI bots or artificial agents. [1], [7] and [8] have implemented a
configuration of an ANN called Neuroevolution. The algorithm
I. I NTRODUCTION
does not depend on the actions taken by the agents as a whole.
An intelligent agent is anything that can detect its sur- [3], [4], [5], [6] and [7] use Reinforcement Learning algorithm
roundings, act independently to accomplish goals, and learn with Deep Q-Learning to train the agents.
from experience or use knowledge to execute tasks better. The performance of the Neuroevolution algorithm depends
The agent’s surroundings are considered an environment in on the objective function, initial population, mutation rate,
artificial intelligence. The agent uses actuators to send its weights and bias added to the network, the activation function
output to the environment after receiving information from used and overall topology of the network. Authors in [2] talk
it through sensors. [11] There are several types of environ- about how superior the Neuroevolution algorithm is over the
ments, Fully Observable vs Partially Observable, Determinis- traditional Reinforcement Learning algorithm with the Deep
tic vs Stochastic, Competitive vs Collaborative, Single-agent Q-Learning algorithm. Neuroevolution has an upper hand
vs Multi-agent, Static vs Dynamic, Discrete vs Continuous, when it comes to the time taken by the artificial agent to train
itself. There are other parameters that need to be taken into lem of multi-dimensional images and solve it using CNN.
consideration while using a Neural Network. The topology Reinforcement learning works best for continuous decision-
of the network plays a vital role in the performance. Two making problems. However, Deep Reinforcement Learning
strategies were proposed by Evgenia Papavasileiou (2021) [2], has a limitation of not converging for which Neural fitted
using fixed topologies in the neural networks and using aug- Q-learning and DQN algorithms were used to overcome the
mented topologies. The network topology is a single hidden issue. Since FNQ can work with numerical information only
layer of neurons, with each hidden neuron connected to every the author suggests use of DQN. Combining Q learning with
network input and every network output. Evolution searches CNN, the DQN can achieve self-learning. ReLu and maximum
the space of connection weights of this fully-connected topol- pooling layers are added to the CNN. Gradient descent (Adam
ogy by allowing high performing networks to reproduce. The Optimizer) was used to train the DQN parameters.
weight space is explored through the crossover of network Q-Value function based algorithms are the focus of Aidar
weight vectors and through the mutation of single networks’ Shakerimov (2021) [5]. For the DQN algorithms, improve-
weights. Thus, the goal of fixed-topology NE is to optimise ments could be achieved in their performance by using a
the connection weights that determine the functionality of a cumulative reward for training actions. To speed up training,
network. The topology, or structure, of neural networks also RNN-ReLU was used instead of LSTM or GRU. LSTM or
affects their functionality, and modifying the network structure GRU performs better than RNN-ReLU but takes 7 times
has been effective as part of supervised training. more time to train. Label smoothing was used to prevent the
There are two ways of making use of the environment. Authors vanishing gradients in RNN-ReLU. However, DQN is sensitive
in [3], [4], [6] and [7] use DNN to extract the features from to seed randomization.
the frame of the game and they form the input to the agent. SARSA is a slight variation of the traditional Q-Learning algo-
However, [1], [5] and [8] make use of the game itself and rithm. Authors in [6] use SARSA and Q-Learning algorithms
place the agent to perceive its surroundings. There are several with modifications such as -greedy policy, discretization and
combinations of Reinforcement Learning algorithms possible, backward updates. Some variants of Q-Learning were also im-
like Deep Neural Networks (DNN), Long short-term memory plemented such as a tabular approach, Q-value approximation
(LSTM), Deep Q-Network (DQN) and the like. However, using linear regression, and NN. In the implementation, [6]
depending on the type of obstacle and the type of game, its finds the SARSA algorithm to have outperformed Q-learning.
performance varies. The specifications of the rewards are a positive 5 for passing
Reinforcement Learning algorithm with DNN and LSTM have a pipe, a negative 1000 for hitting a pipe, and a positive 0.5
been used in [3]. This algorithm addresses issues like vast for surviving a frame. Feed-forward NN was used with a 3
search space, dependencies between the actions taken by the neuron input layer, 50 neuron hidden layer, 20 neuron hyphen
agent, the state and the environment, inputs and imperfect layer, and a 2 neuron output layer (ReLU activation function).
information. To reduce the complexity of the data generated The CNN is used with preprocessed input image by removing
by the perception of the agent, data skipping techniques the background, grayscale, and resizing to 80 x 80, 2 CNN
are implemented. There is, however, a drawback with this layers were used, one using sixteen 5 × 5 kernels with stride
algorithm. It takes a lot of time for the agent to train. Or, 2, and another with thirty-two 5 × 5 kernels with stride 2.
for every discrete step taken by the agent, it receives a state [7] proposes the use of specific feature selection and presents
that belongs to a set S and it sends an action from the the state by the bird velocity and the difference between the
set A actions to the environment. The environment makes bird’s position and the next lower pipe. This reduces the
a transition from state St to St+1 and a gamma value [0, 1] feature space and eliminates the need for deeper modules.
determines the preference for immediate reward over long- The agent is provided with rational human-level inputs along
term reward. A self-playing method is used by storing the with generic RL and a standard 3-layer NN with a genetic
parameters of the network to create a pool of past agents. This optimization algorithm. The reward for the agent is a positive 1
pool of past agents is used to sample opponents. This method for every cross of the pipe and a negative 100 if the agent dies.
offers RL to learn the Nash equilibrium strategy. Data skipping The Neuro evolution has the following characteristics: the NN
techniques were proposed in this paper. It refers to the process weights and the number of hidden layer units undergo changes,
of dropping certain data during the training and evaluation the mutation rate is kept at 0.3, and the initial population
process. Data skipping techniques proposed are: ”no-op” and size is 200. [8] proposes the use of two levels for the Flappy
”maintain move decision”. The network is composed of an Bird game. The fitness function is calculated by the distance
LSTM-based architecture, which has four heads with a shared traveled by the agent and the current distance to the closest
state representation layer. An actor-critic off-policy learning gap. The mutation rate is kept at 0.2, and there are 5 neurons
algorithm was proposed. in the hidden layer.
Botong Liu (2020) [4] has used Reinforcement Learning with
DQN. The game was split into frames, and each game image III. M ETHODOLOGY
was sequentially scaled, grayed, and adjusted for brightness. The NEAT algorithm implementation is dependent on the
Deep Q Network algorithm was used to convert the game objective function, crossover, mutation, and a population of
decision problem into a classification and recognition prob- agents. For a given position of the bird, say (x, y), there are
two actions that the agent can make. Either the bird flaps its
wings or it does not flap its wings. The vertical and horizontal
distances traveled by the agent are determined by the following
equations.
1
dvertical = vjump .t + .a.t2 (1)
2
dhorizontal = vf loor .t (2)
df loor = vf loor .t (3)
dpipe = vpipe .t (4)
Eq. (1) determines the vertical displacement of the agent,
where a is the acceleration that is a constant [12]. As shown in
the Fig. 2, the y coordinate of the agent, the distance between Fig. 3. Diagramatic view of the encoded chromosome in Table I
the top pipe and the agent (y - T’) and the distance between
the bottom pipe and the agent (T’) are the inputs to the neural
network. The gap between the top and the bottom pipe is TABLE I
fixed to 320 pixels, and the heights are randomly generated. E NCODING OF A CHROMOSOME BEFORE CROSSOVER AND MUTATION
The distance between subsequent pipes is also kept constant. Weight 0.25 2.31 1.55 0.98 5.11 1.17 0.07
With respect to the NEAT algorithm, the fitness of the agent From 1 2 3 1 3 4 2
is determined by the number of pipes that the agent is able To 2 3 2 3 4 3 4
Enabled 1 0 1 1 1 1 1
to cross without collision. As soon as the agent collides with
the pipe, hits the roof, or falls down to the ground, the agent
is removed from the environment. The performance of the the nodes are represented by the rows ’From’ and ’To’. The
algorithm depends on the initial population that is taken into Table. I shows the encoding of the network before mutation.
consideration. The activation function used is the hyperbolic After the mutation, or rather after topology augmentation, the
tangent function. The mutation rate is kept at 0.03. The encoding of the edges is shown in Table. II. The resultant
encoding of the chromosome is shown in Table. I. The weight connections are shown in Fig. 4. The edges that are in red
of the connection from a node in a layer to another node in are the edges that were dropped, and the edges that are in
the other layer and the dropped value is also considered as green are the ones that have been added as a result of the
part of the encoding. If the connection is to be dropped, it mutation. The cross-over process happens between any two
randomly selected parents. The next population is determined
by the fitness of the individual agents.
IV. R ESULTS
The implementation of the algorithm requires no historic
data or any dataset. The algorithm makes use of the sensory
data perceived from the environment by the artificial agent as
the program runs. The inputs to the algorithm are the y position
of the agent, the vertical distance of the agent from the top
pipe, and the vertical distance of the agent from the lower pipe.
The output of the algorithm is the action that the agent is to
take i.e. jump or drop down owing to gravity. NEAT algorithm
was implemented by taking different initial populations. Fig. Fig. 7. Gameplay when initial population is 120
5, Fig. 6 and Fig. 7 shows the average score and the scores
reached in every generation, when the game is played by the Fig. 8 for generations 30 to 50. The average score of the agent
agents over 50 generations. The change in the average scores is steadily increasing from when the initial population is 20
to 100. The maximum score is observed when the population
is 160. The average fitness value of the population is higher
when the initial population size is 100. This is shown in Fig. 9.
The initial training phase is less than 5 generations. When the
initial population has fewer agents, it takes more generations to
spike the average score of the game. This can be observed from
Fig. 10. Table. III shows the average score and the maximum
score gained by the agent over 50 generations. A maximum
score of 1025 is obtained when the initial population is 160
and the gameplay run till 50 generations.
C ONCLUSION AND F UTURE S COPE
By using a 2D game, the performance of the algorithms can
be determined very efficiently. Unlike simulation, the creation
of an environment gives better control over the environment.
Through various iterations by changing the initial population
Fig. 5. Gameplay when initial population is 80 size, the average score gained by the agent has increased. The
initial population of agents also affects the training speed. The
over the change in the initial population is separately shown in more the agents, the quicker the training is done. The highest
TABLE III
S CORES OVER CHANGE IN INITIAL POPULATION
Fig. 10. Speed of agents getting trained over initial population change