RL & DL Notes
RL & DL Notes
Unit 1
Reinforcement learning
Reinforcement Learning: An Overview
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to
maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an
agent learns to achieve a goal in an uncertain, potentially complex environment by performing
actions and receiving feedback through rewards or penalties.
RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to
maximize the cumulative reward. This learning process is characterized by the following
elements:
● Policy: A strategy used by the agent to determine the next action based on the current
state.
● Reward Function: A function that provides a scalar feedback signal based on the
state and action.
● Value Function: A function that estimates the expected cumulative reward from a
given state.
● Model of the Environment: A representation of the environment that helps in
planning by predicting future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the
possible paths and then choosing the path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward that is the diamond.
Example: Object
Example: Chess game,text summarization
recognition,spam detetction
Types of Reinforcement:
particular behavior, increases the strength and the frequency of the behavior. In other
words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
● Maximizes Performance
● Sustain Change for a long period of time
● Too much Reinforcement can lead to an overload of states which can
diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a
● Increases Behavior
● Provide defiance to a minimum standard of performance
● It Only provides enough to meet up the minimum behavior
In a typical Reinforcement Learning (RL) problem, there is a learner and a decision maker called
agent and the surrounding with which it interacts is called environment. The environment, in
return, provides rewards and a new state based on the actions of the agent. So, in reinforcement
learning, we do not teach an agent how it should do something but presents it with rewards whether
positive or negative based on its actions. So our root question for this blog is how we formulate
any problem in RL mathematically. This is where the Markov Decision Process(MDP) comes
in.
Typical Reinforcement Learning cycle
Before we answer our root question i.e. How we formulate RL problems mathematically (using
● Markov Property
● Bellman Equation
Agent : Software programs that make intelligent decisions and they are the learners in RL. These
agents interact with the environment by actions and receive rewards based on there actions.
Environment :It is the demonstration of the problem to be solved.Now, we can have a real-world
State : This is the position of the agents at a specific time-step in the environment.So,whenever an
agent performs a action the environment gives the agent reward and a new state where the agent
Anything that the agent cannot change arbitrarily is considered to be part of the environment.
In simple terms, actions can be any decision we want the agent to learn and state can be
anything which can be useful in choosing actions. We do not assume that everything in the
environment is unknown to the agent, for example, reward calculation is considered to be the
part of the environment even though the agent knows a bit on how it’s reward is calculated as a
function of its actions and states in which they are taken. This is because rewards cannot be
arbitrarily changed by the agent. Sometimes, the agent might be fully aware of its environment
but still finds it difficult to maximize the reward as like we might know how to play Rubik’s
cube but still cannot solve it. So, we can safely say that the agent-environment relationship
represents the limit of the agent control and not it’s knowledge.
Transition Probability: The probability that the agent will move from one state to another is called
transition probability.
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation
means is that the transition from state S[t] to S[t+1] is entirely independent of the past. So, the
RHS of the Equation means the same as LHS if the system has a Markov Property. Intuitively
meaning that our current state already captures the information of the past states.
As we now know about transition probability we can define state Transition Probability as
follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition
probability is given by
We can formulate the State Transition probability into a State Transition probability matrix by :
State Transition Probability Matrix
Each row in the matrix represents the probability from moving from our original or starting state
Markov Process is the memory less random process i.e. a sequence of a random state
S[1],S[2],….S[n] with a Markov Property.So, it’s basically a sequence of states with the Markov
Property.It can be defined using a set of states(S) and transition probability matrix (P).The
dynamics of the environment can be fully defined using the States(S) and Transition Probability
matrix(P).
The edges of the tree denote transition probability. From this chain let’s take some sample.
Now, suppose that we were sleeping and the according to the probability distribution there is a
0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat
ice-cream. Similarly, we can think of other sequences that we can sample from this chain.
Sleep,Ice-cream,Sleep ) every time we run the chain.Hope, it’s now clear why Markov process is
Rewards are the numerical values that the agent receives on performing some action at some
state(s) in the environment. The numerical value can be positive or negative based on the actions of
the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards
agent receives from the environment) instead of, the reward agent receives from the current
state(also called immediate reward). This total sum of reward the agent receives from the
r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move
from one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by
performing an action to move to another state. And, r[T] is the reward received by the agent by at
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say they have
finite states. For example, in racing games, we start the game (start the race) and play it until the
game is over (race ends!). This is called an episode. Once we restart the game it will start from an
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any terminal
state.These types of tasks will never end.For example, Learning how to code!
Now, it’s easy to calculate the returns from the episodic tasks as they will eventually end but
what about continuous tasks, as it will go on and on forever. The returns from sum up to infinity!
Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward
and future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has
a value between 0 and 1. A value of 0 means that more importance is given to the immediate
reward and a value of 1 means that more importance is given to future rewards. In practice, a
discount factor of 0 will never learn as it only considers immediate reward and a discount factor of 1
will go on for future rewards which may lead to infinity. Therefore, the optimal value for the
So, we can define returns using discount factor as follows :(Let’s say this is equation 1 ,as we
are going to use this equation in later for deriving Bellman Equation)
Let’s understand it with an example,suppose you live at a place where you face water scarcity so
if someone comes to you and say that he will give you 100 liters of water!(assume please!) for
the next 15 hours as a function of some parameter (ɤ).Let’s look at two possibilities : (Let’s say
this is equation 1 ,as we are going to use this equation in later for deriving Bellman Equation)
This means that we should wait till 15th hour because the decrease is not very significant , so it’s
still worth to go till the end.This means that we are also interested in future rewards.So, if the
discount factor is close to 1 then we will make a effort to go to end as the reward are of
significant importance.
This means that we are more interested in early rewards as the rewards are getting significantly
low at hour.So, we might not want to wait till the end (till 15th hour) as it will be worthless.So, if
the discount factor is close to zero then immediate rewards are more important that the future.
defeat the opponent’s king. If we give importance to the immediate rewards like a reward on
pawn defeat any opponent player then the agent will learn to perform these sub-goals no matter if
his players are also defeated. So, in this task future rewards are more important. In some, we
might prefer to use immediate rewards like the water example we saw earlier.
Till now we have seen how Markov chain defined the dynamics of a environment using set of
states(S) and Transition Probability Matrix(P).But, we know that Reinforcement Learning is all
about goal to maximize the reward.So, let’s add reward to our Markov Chain.This gives us
Markov Reward Process : As the name suggests, MDPs are the Markov chains with values
us the immediate reward from that particular state our agent is in. As we will see in the next story
how we maximize these rewards from each state our agent is in. In simple terms, maximizing the
● S is a set of states,
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
Value Function determines how good it is for the agent to be in a particular state. Of course,
to determine how good it will be to be in a particular state it must depend on some actions that it
will take. This is where policy comes in. A policy defines what actions to perform in a particular
state s.
A policy is a simple function, that defines a probability distribution over Actions (a∈ A) for
each state (s ∈ S). If an agent at time t follows a policy π then π(a|s) is the probability that the
agent with taking action (a ) at a particular time step (t).In Reinforcement Learning the
experience of the agent determines the change in policy. Mathematically, a policy is defined as
follows :
Policy Function
Now, how do we find a value of a state. The value of state s, when the agent is following a
policy π which is denoted by vπ(s) is the expected return starting from s and following a policy π
for the next states until we reach the terminal state. We can formulate this as :(This function is
This equation gives us the expected returns starting from the state(s) and going to successor
states thereafter, with the policy π. One thing to note is the returns we get is stochastic whereas
the value of a state is not stochastic. It is the expectation of returns from start state s and
thereafter, to any other state. And also note that the value of the terminal state (if there is any) is
Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class 2
Note:It’s -2 + (-2 * 0.5) + 10 * 0.25 + 0 instead of -2 * -2 * 0.5 + 10 * 0.25 + 0.Then the value
of Class 2 is -0.5 .
Bellman Equation helps us to find optimal policies and value functions. We know that our
policy changes with experience so we will have different value functions according to different
policies. The optimal value function is one that gives maximum value compared to all other
value functions.
Bellman Equation states that value function can be decomposed into two parts:
Suppose, there is a robot in some state (s) and then he moves from this state to some other state
(s’). Now, the question is how good it was for the robot to be in the state(s). Using the
Bellman equation, we can that it is the expectation of reward it got on leaving the state(s) plus
Backup Diagram
We want to know the value of state s. The value of state(s) is the reward we got upon leaving
that state, plus the discounted value of the state we landed upon multiplied by the transition
Value Calculation
discounted value of the next state multiplied by the probability of moving into that state.
The running time complexity for this computation is O(n³). Therefore, this is clearly not a
practical solution for solving larger MRPs (same for MDPs).In later Blogs, we will look at
more efficient methods like Dynamic Programming (Value iteration and Policy iteration),
We are going to talk about the Bellman Equation in much more detail in the next story.
Markov Decision Process : It is Markov Reward Process with a decisions.Everything is same like
MRP but now we have actual agency that makes decisions or take actions.
● S is a set of states,
Reward Function
following a policy π. Actually, in Markov Decision Process(MDP) the policy is the mechanism
to take decisions. So now we have a mechanism that will choose to take an action.
Policies in an MDP depend on the current state. They do not depend on history. That’s the
We have already seen how good it is for the agent to be in a particular state(State-value
function). Now, let’s see how good it is to take a particular action following a policy π from state
s (Action-Value Function).
This function specifies how good it is for the agent to take action (a) in a state (s) with a policy
π.
Basically, it tells us the value of performing a certain action(a) in a state(s) with a policy π.
Example of MDP
Now, we can see that there are no more probabilities. In fact, now our agent has choices to make
like after waking up, we can choose to watch Netflix or code and debug. Of course, the actions
of the agent are defined w.r.t some policy π and will get the reward accordingly.
Exploration and Exploitation are methods for building effective learning algorithms that can
adapt and perform optimally in different environments. This article focuses on exploitation and
Understanding Exploitation
Exploitation is a strategy of using the accumulated knowledge to make decisions that maximize
the expected reward based on the present information. The focus of exploitation is on utilizing
what is already known about the environment and achieving the best outcome using that
tried and tested actions, avoiding the uncertainty associated with less familiar options.
Nowadays, Deep Reinforcement Learning (RL) is one of the hottest topics in the Data Science
community. The fast development of RL has resulted in the growing demand for easy to
understand and convenient to use RL tools.
In recent years, plenty of RL libraries have been developed. These libraries were designed to
have all the necessary tools to both implement and test Reinforcement Learning models.
Still, they differ quite a lot. That’s why it is important to pick a library that will be quick, reliable,
and relevant for your RL task.
In this article we will cover:
● Criteria for choosing Deep Reinforcement Learning library,
● RL libraries: Pyqlearning, KerasRL, Tensorforce, RL_Coach, TFAgents, MAME RL,
MushroomRL.
Criteria
Each RL library in this article will be analyzed based on the following criteria:
1. Number of state-of-the-art (SOTA) RL algorithms implemented – the most important one
in my opinion
2. Official documentation, availability of simple tutorials and examples
3. Readable code that is easy to customize
4. Number of supported environments – a crucial decision factor for Reinforcement
Learning library
5. Logging and tracking tools support – for example, Neptune or TensorBoard
6. Vectorized environment (VE) feature – method to do multiprocess training. Using parallel
environments, your agent will experience way more situations than with one environment
7. Regular updates – RL develops quite rapidly and you want to use up-to-date
technologies
We will talk about the following libraries:
KerasRL
KerasRL is a Deep Reinforcement Learning Python library. It implements some state-of-the-art
RL algorithms, and seamlessly integrates with Deep Learning library Keras.
Moreover, KerasRL works with OpenAI Gym out of the box. This means you can evaluate and
play around with different algorithms quite easily.
To install KerasRL simply use a pip command:
pip install keras-rl
Let’s see if KerasRL fits the criteria:
1. Number of SOTA RL algorithms implemented
As of today KerasRL has the following algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Double and Dueling)
● Deep Deterministic Policy Gradient (DDPG)
● Continuous DQN (CDQN or NAF)
● Cross-Entropy Method (CEM)
● Deep SARSA
As you may have noticed, KerasRL misses two important agents: Actor-Critic Methods and
Proximal Policy Optimization (PPO).
2. Official documentation, availability of tutorials and examples
The code is easy to read and it’s full of comments, which is quite useful. Still, the documentation
seems incomplete as it misses the explanation of parameters and tutorials. Also, practical
examples leave much to be desired.
3. Readable code that is easy to customize
Very easy. All you need to do is to create a new agent following the example and then add it to
rl.agents.
4. Number of supported environments
KerasRL was made to work only with OpenAI Gym. Therefore you need to modify the agent if
you want to use any other environment.
5. Logging and tracking tools support
Logging and tracking tools support is not implemented. Nevertheless, you can use Neptune to
track your experiments.
6. Vectorized environment feature
Includes a vectorized environment feature.
7. Regular updates
The library seems not to be maintained anymore as the last updates were more than a year
ago.
To sum up, KerasRL has a good set of implementations. Unfortunately, it misses valuable points
such as visualization tools, new architectures and updates. You should probably use another
library.
Pyqlearning
Pyqlearning is a Python library to implement RL. It focuses on Q-Learning and multi-agent Deep
Q-Network.
Pyqlearning provides components for designers, not for end user state-of-the-art black boxes.
Thus, this library is a tough one to use. You can use it to design the information search
algorithm, for example, GameAI or web crawlers.
To install Pyqlearning simply use a pip command:
pip install pyqlearning
Let’s see if Pyqlearning fits the criteria:
1. Number of SOTA RL algorithms implemented
As of today Pyqlearning has the following algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Epsilon Greedy and Boltzmann)
As you may have noticed, Pyqlearning has only one important agent. The library leaves much to
be desired.
2. Official documentation, availability of tutorials and examples
Pyqlearning has a couple of examples for various tasks and two tutorials featuring Maze Solving
and the pursuit-evasion game by Deep Q-Network. You may find them in the official
documentation. The documentation seems incomplete as it focuses on the math, and not the
library’s description and usage.
3. Readable code that is easy to customize
Pyqlearning is an open-source library. Source code can be found on Github. The code lacks
comments. It may be a complicated task to customize it. Still, the tutorials might help.
4. Number of supported environments
Since the library is agnostic, it’s relatively easy to add to any environment.
5. Logging and tracking tools support
The author uses a simple logging package in the tutorials. Pyqlearning does not support other
logging and tracking tools, for example, TensorBoard.
6. Vectorized environment feature
Pyqlearning does not support Vectorized environment feature.
7. Regular updates
The library is maintained. The last update was made two months ago. Still, the development
process seems to be a slow-going one.
To sum up, Pyqlearning leaves much to be desired. It is not a library that you will use commonly.
Thus, you should probably use something else.
Tensorforce
RL_Coach
TFAgents
TFAgents is a Python library designed to make implementing, deploying, and testing RL
algorithms easier. It has a modular structure and provides well-tested components that can be
easily modified and extended.
TFAgents is currently under active development, but even the current set of components makes
it the most promising RL library.
To install TFAgents simply use a pip command:
pip install tf-agents
Let’s see if TFAgents fits the criteria:
1. Number of SOTA RL algorithms implemented
As of today, TFAgents has the following set of algorithms implemented:
● Deep Q-Learning (DQN) and its improvements (Double)
● Deep Deterministic Policy Gradient (DDPG)
● TD3
● REINFORCE
● Proximal Policy Optimization (PPO)
● Soft Actor Critic (SAC)
Overall, TFAgents has a great set of algorithms implemented.
2. Official documentation, availability of tutorials and examples
TFAgents has a series of tutorials on each major component. Still, the official documentation
seems incomplete, I would even say there is none. However, the tutorials and simple examples
do their job, but the lack of well-written documentation is a major disadvantage.
3. Readable code that is easy to customize
The code is full of comments and the implementations are very clean. TFAgents seems to have
the best library code.
4. Number of supported environments
The library is agnostic. That is why it’s easy to plug it into any environment.
5. Logging and tracking tools support
Logging and tracking tools are supported.
6. Vectorized environment feature
Vectorized environment is supported.
7. Regular updates
As mentioned above, TFAgents is currently under active development. The last update was
made just a couple of days ago.
To sum up, TFAgents is a very promising library. It already has all necessary tools to start
working with it. I wonder what it will look like when the development is over.
Stable Baselines
MushroomRL
MushroomRL is a Python Reinforcement Learning library whose modularity allows you to use
well-known Python libraries for tensor computation and RL benchmarks.
It enables RL experiments providing classical RL algorithms and deep RL algorithms. The idea
behind MushroomRL consists of offering the majority of RL algorithms, providing a common
interface in order to run them without doing too much work.
To install MushroomRL simply use a pip command.
pip install mushroom_rl
Let’s see if MushroomRL fits the criteria:
1. Number of SOTA RL algorithms implemented
As of today, MushroomRL has the following set of algorithms implemented:
● Q-Learning
● SARSA
● FQI
● DQN
● DDPG
● SAC
● TD3
● TRPO
● PPO
Overall, MushroomRL has everything you need to work on RL tasks.
2. Official documentation, availability of tutorials and examples
The official documentation seems incomplete. It misses valuable tutorials, and simple examples
leave much to be desired.
3. Readable code that is easy to customize
The code lacks comments and parameter description. It’s really hard to customize it. Although
MushroomRL never positioned itself as a library that is easy to customize.
4. Number of supported environments
MushroomRL supports the following environments:
● OpenAI Gym
● DeepMind Control Suite
● MuJoCo
For more information including installation and usage instructions please refer to official
documentation.
5. Logging and tracking tools support
MushroomRL supports various logging and tracking tools. I would recommend using
TensorBoard as the most popular one.
6. Vectorized environment feature
Vectorized environment feature is supported.
7. Regular updates
The library is maintained. The last updates were made just a few weeks ago.
To sum up, MushroomRL has a good set of algorithms implemented. Still, it misses tutorials and
examples which are crucial when you start to work with a new library.
RLlib
“RLlib is an open-source library for reinforcement learning that offers both high scalability and a
unified API for a variety of applications. RLlib natively supports TensorFlow, TensorFlow Eager,
and PyTorch, but most of its internals are framework agnostic.” ~ Website
1. Number of state-of-the-art (SOTA) RL algorithms implemented
RLlib implements them ALL! PPO? It’s there. A2C and A3C? Yep. DDPG, TD3, SAC? Of
course! DQN, Rainbow, APEX??? Yes, in many shapes and flavours! Evolution
Strategies, IMPALA, Dreamer, R2D2, APPO, AlphaZero, SlateQ, LinUCB, LinTS,
MADDPG, QMIX, … Stop it! I’m not sure if you make up these acronyms. Nonetheless,
yes, RLlib has them ALL. See the full list here.
2. Official documentation, availability of simple tutorials and examples
RLlib has comprehensive documentation with many examples. Its code is also well
commented.
3. Readable code that is easy to customize
It’s easiest to customize RLlib with callbacks. Although RLlib is open-sourced and you
can edit the code, it’s not a straightforward thing to do. RLlib codebase is quite
complicated because of its size and many layers of abstractions. Here is a guide that
should help you with that if you want to e.g. add a new algorithm.
4. Number of supported environments
RLlib works with several different types of environments, including OpenAI Gym,
user-defined, multi-agent, and also batched environments. Here you’ll find more.
5. Logging and tracking tools support
RLlib has extensive logging features. RLlib will print logs to the standard output
(command line). You can also access the logs (and manage jobs) in Ray Dashboard. In
this post, I described how to extend RLlib logging to send metrics to Neptune. It also
describes different logging techniques. I highly recommend reading it!
6. Vectorized environment (VE) feature
Yes, see here. Moreover, it’s possible to distribute the training among multiple compute
nodes e.g. on the cluster.
7. Regular updates
RLlib is maintained and actively developed.
From my experience, RLlib is a very powerful framework that covers many applications and at
the same time remains quite easy to use. That being said, because of the many layers of
abstractions, it’s really hard to extend with your code as it’s hard to find where you should even
put your code! That’s why I would recommend it for developers that look for training the models
for production and not for researchers that have to rapidly change algorithms and implement
new features.
Dopamine
“Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It
aims to fill the need for a small, easily grokked codebase in which users can freely experiment
with wild ideas (speculative research).” ~ GitHub
1. Number of state-of-the-art (SOTA) RL algorithms implemented
It focuses on supporting the state-of-the-art, single-GPU DQN, Rainbow, C51, and IQN
agents. Their Rainbow agent implements the three components identified as most
important by Hessel et al.:
● n-step Bellman updates (see e.g. Mnih et al., 2016)
● Prioritized experience replay (Schaul et al., 2015)
● Distributional reinforcement learning (C51; Bellemare et al., 2017)
2. Official documentation, availability of simple tutorials and examples
Concise documentation is available in the GitHub repo here. It’s not a very popular
framework, so it may lack tutorials. However, the authors provide colabs with many
examples of training and visualization.
3. Readable code that is easy to customize
The authors’ design principles are:
● Easy experimentation: Make it easy for new users to run benchmark
experiments.
● Flexible development: Make it easy for new users to try out research ideas.
● Compact and reliable: Provide implementations for a few, battle-tested
algorithms.
● Reproducible: Facilitate reproducibility in results. In particular, their setup follows
the recommendations given by Machado et al. (2018).
4. Number of supported environments
It’s mainly thought for the Atari 2600 game-playing. It supports OpenAI Gym.
5. Logging and tracking tools support
It supports TensorBoard logging and provides some other visualization tools, presented
in colabs, like recording video of an agent play and seaborn plotting.
6. Vectorized environment (VE) feature
No vectorized environments support.
7. Regular updates
Dopamine is maintained.
If you look for a customizable framework with well-tested DQN based algorithms, then this may
be your pick. Under the hood, it runs using TensorFlow or JAX.
SpinningUp
“While fantastic repos like garage, Baselines, and rllib make it easier for researchers who are
already in the field to make progress, they build algorithms into frameworks in ways that involve
many non-obvious choices and trade-offs, which makes them hard to learn from. […] The
algorithm implementations in the Spinning Up repo are designed to be:
● as simple as possible while still being reasonably good,
● and highly consistent with each other to expose fundamental similarities between
algorithms.
They are almost completely self-contained, with virtually no common code shared between
them (except for logging, saving, loading, and MPI utilities), so that an interested person can
study each algorithm separately without having to dig through an endless chain of
dependencies to see how something is done. The implementations are patterned so that they
come as close to pseudocode as possible, to minimize the gap between theory and code.” ~
Website
1. Number of state-of-the-art (SOTA) RL algorithms implemented
VPG, PPO, TRPO, DDPG, TD3, SAC
2. Official documentation, availability of simple tutorials and examples
Great documentation and education materials with multiple examples.
3. Readable code that is easy to customize
This code is highly readable. From my experience, it’s the most readable framework you
can find there. Every algorithm is contained in its own two, well-commented files.
Because of it, it’s also as easy as it can be to modify it. On the other hand, it’s harder to
maintain for the same reason. If you add something to one algorithm you have to
manually add it to others too.
4. Number of supported environments
It supports the OpenAI Gym environments out of the box and relies on its API. So you
can extend it to use other environments that conform to this API.
5. Logging and tracking tools support
It has a light logger that prints metrics to the standard output (cmd) and saves them to a
file. I’ve written the post on how to add the Neptune support to SpinUp.
6. Vectorized environment (VE) feature
No vectorized environments support.
7. Regular updates
SpinningUp is maintained.
Although it was created as an educational resource, the code simplicity and state-of-the-art
results make it a perfect framework for fast prototyping your research ideas. I use it in my own
research and even implement new algorithms in it using the same code structure. Here you can
find a port of SpinningUp code to the TensorFlow v2 from me and my colleagues from
AwareLab.
garage
“garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an
accompanying library of state-of-the-art implementations built using that toolkit. […] The most
important feature of garage is its comprehensive automated unit test and benchmarking suite,
which helps ensure that the algorithms and modules in garage maintain state-of-the-art
performance as the software changes.” ~ GitHub
1. Number of state-of-the-art (SOTA) RL algorithms implemented
All major RL algorithms (VPG, PPO, TRPO, DQN, DDPG, TD3, SAC, …), with their
multi-task versions (MT-PPO, MT-TRPO, MT-SAC), meta-RL algorithms (Task
embedding, MAML, PEARL, RL2, …), evolutional strategy algorithms (CEM, CMA-ES),
and behavioural cloning.
2. Official documentation, availability of simple tutorials and examples
Comprehensive documentation included with many examples and some tutorials of e.g.
how to add a new environment or implement a new algorithm.
3. Readable code that is easy to customize
It’s created as a flexible and structured tool for developing, experimenting and evaluating
algorithms. It provides a scaffold for adding new methods.
4. Number of supported environments
Garage supports a variety of external environment libraries for different RL training
purposes including OpenAI Gym, DeepMind DM Control, MetaWorld, and PyBullet. You
should be able to easily add your own environments.
5. Logging and tracking tools support
The garage logger supports many outputs including std. output (cmd), plain text files,
CSV files, and TensorBoard.
6. Vectorized environment (VE) feature
It supports vectorized environments and even allows one to distribute the training on the
cluster.
7. Regular updates
garage is maintained.
garage is similar to RLlib. It’s a big framework with distributed execution, supporting many
additional features like Docker, which is beyond simple training and monitoring. If such a tool is
something you need, i.e. in a production environment, then I would recommend comparing it
with RLlib and picking the one you like more.
Acme
“Acme is a library of reinforcement learning (RL) agents and agent building blocks. Acme strives
to expose simple, efficient, and readable agents, that serve both as reference implementations
of popular algorithms and as strong baselines, while still providing enough flexibility to do novel
research. The design of Acme also attempts to provide multiple points of entry to the RL
problem at differing levels of complexity.” ~ GitHub
1. Number of state-of-the-art (SOTA) RL algorithms implemented
It includes algorithms for continual control (DDPG, D4PG, MPO, Distributional MPO,
Multi-Objective MPO), discrete control (DQN, IMPALA, R2D2), learning from
demonstrations (DQfD, R2D3), planning and learning (AlphaZero) and behavioural
cloning.
2. Official documentation, availability of simple tutorials and examples
Documentation is rather sparse, but there are many examples and jupyter notebook
tutorials available in the repo.
3. Readable code that is easy to customize
Code is easy to read but requires one to learn its structure first. It is easy to customize
and add your own agents.
4. Number of supported environments
The Acme environment loop assumes an environment instance that implements the
DeepMind Environment API. So any environment from DeepMind will work flawlessly
(e.g. DM Control). It also provides a wrapper on the OpenAI Gym environments and the
OpenSpiel RL environment loop. If your environment implements OpenAI or DeepMind
API, then you shouldn’t have problems with pugging it in.
5. Logging and tracking tools support
It includes a basic logger and supports printing to the standard output (cmd) and saving
to CSV files. I’ve written the post on how to add the Neptune support to Acme.
6. Vectorized environment (VE) feature
No vectorized environments support.
7. Regular updates
Acme is maintained and actively developed.
Acme is simple like SpinningUp but a tier higher if it comes to the use of abstraction. It makes it
easier to maintain – code is more reusable – but on the other hand, harder to find the exact spot
in the implementation you should change when tinkering with the algorithm. It supports both
TensorFlow v2 and JAX, with the second being an interesting option as JAX gains traction
recently.
coax
“Coax is a modular Reinforcement Learning (RL) python package for solving OpenAI Gym
environments with JAX-based function approximators. […] The primary thing that sets coax
apart from other packages is that is designed to align with the core RL concepts, not with the
high-level concept of an agent. This makes coax more modular and user-friendly for RL
researchers and practitioners.” ~ Website
1. Number of state-of-the-art (SOTA) RL algorithms implemented
It implements classical RL algorithms (SARSA, Q-Learning), value-based deep RL
algorithms (Soft Q-Learning, DQN, Prioritized Experience Replay DQN, Ape-X DQN),
and policy gradient methods (VPG, PPO, A2C, DDPG, TD3).
2. Official documentation, availability of simple tutorials and examples
Clear, if sometimes confusing, documentation with many code examples and algorithms
explanation. It also includes tutorials for running training on Pong, Cartpole, ForzenLake,
and Pendulum environments.
3. Readable code that is easy to customize
Other RL frameworks often hide structure that you (the RL practitioner) are interested in.
Coax makes the network architecture take the center stage, so you can define your own
forward-pass function. Moreover, the design of coax is agnostic of the details of your
training loop. You decide how and when you update your function approximators.
4. Number of supported environments
Coax mostly focuses on OpenAI Gym environments. However, you should be able to
extend it to other environments that implement this API.
5. Logging and tracking tools support
It utilizes the Python logging module.
6. Vectorized environment (VE) feature
No vectorized environments support.
7. Regular updates
coax is maintained.
I would recommend coax for education purposes. If you want to plug-n-play with nitty-gritty
details of RL algorithms, this is a good tool to do this. It’s also built around JAX, which may be a
plus in itself (because of hype around it).
SURREAL
“Our goal is to make Deep Reinforcement Learning accessible to everyone. We introduce
Surreal, an open-source, reproducible, and scalable distributed reinforcement learning
framework. Surreal provides a high-level abstraction for building distributed reinforcement
learning algorithms.” ~ Website
1. Number of state-of-the-art (SOTA) RL algorithms implemented
It focuses on the distributed deep RL algorithms. As for now, the authors implemented
their distributed variants of PPO and DDPG.
2. Official documentation, availability of simple tutorials and examples
It provides basic documentation in the repo of installing, running, and customizing the
algorithms. However, it lacks code examples and tutorials.
3. Readable code that is easy to customize
Code structure can frighten one away, it’s not something for newcomers. That being
said, the code includes docstrings and is readable.
4. Number of supported environments
It supports OpenAI Gym and DM Control environments, as well as Robotic Suite.
Robosuite is a standardized and accessible robot manipulation benchmark with the
MuJoCo physical engine.
5. Logging and tracking tools support
It includes specialized logging tools for the distributed environment that also allow you to
record videos of agents playing.
6. Vectorized environment (VE) feature
No vectorized environments support. However, it allows one to distribute the training on
the cluster.
7. Regular updates
It doesn’t seem to be maintained anymore.