Dissecting Reinforcement Learning-Part6
Dissecting Reinforcement Learning-Part6
Dissecting Reinforcement
Learning-Part.6
Aug 14, 2017 • Massimiliano Patacchiola
The references for this post are the Sutton and Barto’s book (chapter 11,
case studies), and “Statistical Reinforcement Learning” by Masashi Sugiyama
which contains a good description of some of the applications we are going
to encounter. In this post I want you to get your hands dirty! There is a
lot of code to run, parameters to change and graphs to plot. You should
learn by doing. Fork the Github repository if you have a Github account, or
download the latest zip archive from here if you haven’t.
Multi-Armed Bandit
An armed bandit is a fancy way to call slot machines in Las Vegas. They are
bandits because they steal your money! In 1950s Mosteller and Bush were
studying the effect of reward on mice in a T-maze. In order to compare the
performance with humans they realised a two-armed bandit experiment. The
subjects could choose to pull the left or right arm in order to receive a
reward. One of the two arm was more generous.
In this experiment the subject has to find a good balance between
exploration and exploitation. Let’s suppose the subject plays a single round
finding out that the left arm is more generous. How to proceed? You must
remember that the machines are stochastic and the best one may not return
the prize for a while in a short sequence. Should the subject explore the
option that looks inferior or exploit the current best option? Formally we
can define this problem as a Markov decision process with a single state
(see the first post). There are N arms which is possible to pull and each
one as a certain probability of returning a prize. We have a single state
and N possible actions (one action for each arm). At each round the agent
chooses one arm to pull and it receives a reward. The goal of the agent is
to maximise the reward. During the years have been proposed many solutions
to the multi-armed bandit problem. In the following part of the post I will
show you some of these solutions, and I will show you empirically the
results obtained from each one.
I will consider the case were N = 3 meaning that we have 3 possible actions
(3 arms). I will call this example a three-armed testbed. A similar case has
been considered by Sutton and Barto in chapter 2.1 of their book, however
they used a 10-armed bandit and a Gaussian distribution to model the reward
function. Here I will use a Bernoulli distribution, meaning that the rewards
are either 0 or 1. From the initial state s0 we can choose one of three arms
(A, B, C). The first arm (A) return a positive reward of 1 with 30%
probability, and it returns 0 with 70% probability. The second arm (B)
returns a positive reward in 50% of the cases. The third arm (C) returns a
positive reward with 80% probability. The utility of each action is: 0.3,
0.5, 0.8. The utility can be estimated at runtime using an action-utility
(or action-value) method. If the action a has been chosen ka times leading
to a series of reward r1 , r2 , . . . , rk a
then the utility of this specific action
can be estimated through:
r1 , r2 , . . . , rk
a
Q(a) =
ka
It is helpful to play with this example and try different strategies. Before
playing we need a way to measure exploration and exploitation. In the next
sections I will quantify exploitation using the average cumulated reward and
exploration using the Root Mean Square Error (RMSE) between the true utility
distribution and the average estimation. When both the RMSE and the average
cumulated reward are low, the agent is using an exploration-based strategy.
On the other hand, when the RMSE and the average cumulated reward are high,
the agent is using an exploitation-based strategy. To simplify our lives I
created a Python module called multi_armed_bandit.py which has a class
called MultiArmedBandit . The only parameter that must be passed to the
object is a list containing the probability ∈ [0, 1] of obtaining a positive
reward:
The step() method takes as input an action which represent the index of the
arm that must be pulled. For instance calling my_bandit.step(action=0) will
pull the first arm, and calling my_bandit.step(action=2) will pull the third.
The step() method returns the reward obtained pulling that arm, which can
be 1 or 0. The method does not return anything else. Remember that we are
not moving here. There is no point in returning the state at t+1 or a
variable which identifies a terminal state, because as I said the multi-
armed bandit has a single state. Now it is time to play! In the next sub-
sections I will show you some of the strategies that is possible to use in
the three-armed testbed.
Omniscient: the word omniscient derives from Medieval Latin and it means
all-knowing. An omniscient agent knows the utility distribution before
playing and it follows an optimal policy. Let’s suppose that you work for
the company that is producing the three-armed bandit. Your duty is to
realise the firmware of the machine. Since you are the designer you
perfectly know the probability of a positive reward for each one of the
three arms. It’s time for vacation and you decide to go to Las Vegas. You
enter in a Casino and you see just in front of you the particular machine
you worked on. What are you gonna do? Probably you will start pulling the
third arm (C) like a crazy until your pockets are full of coins. You know
that the best thing to do is to focus on the third arm because it has 80%
probability of returning a positive reward. Now let’s suppose that the
omniscient agent plays for 1000 rounds, what’s the cumulated reward obtained
in the end? If the third arm has 80% probability of obtaining a coin we can
say that after 1000 round the player will get approximately 800 coins. Keep
in mind this value because it is the upper boundary for the comparison.
Random: the most intuitive strategy is a random strategy. Just pull any arm
with the same probability. This is the strategy of a naive gambler. Let’s
see what a random agent will obtain playing this way. We can create a random
agent in a few line of code:
Running the script will pull the arms 1000 times, and the reward obtained
will be accumulated in the variable called cumulated_reward . I run the
script several times (it takes just a few milliseconds) and I obtained
cumulated rewards of 527, 551, 533, 511, 538, 540. Here I want you to reason
on what we got. Why all the cumulated rewards oscillate around a value of
530? The random agent pulled the arms with (approximately) the same
probability, meaning that it pulled the first arm 1/3 of the times, the
second arm 1/3 of the times, and the third arm 1/3 of the times. The final
score can be approximated as follows: 300/3 + 500/3 + 800/3 = 533.3 . Remember
that the process is stochastic and we can have a small fluctuation every
time. To neutralise this fluctuation I introduced another loop of 2000
iterations, which repeats the script 2000 times.
The average value for the cumulated reward is 533.4, which is very close to
the estimation we made. At the same time the RMSE is extremely low (0.0006)
meaning that the random agent is greatly unbalanced in direction of
exploration instead of exploitation. The complete code is in the official
repository and is called random_agent_bandit.py .
Greedy: the agent that is following a greedy strategy pulls all the arms in
the first turn, then it selects the arm that returned the highest reward.
This strategy do not really encourage exploration and this is not
surprising. We have already seen in the second post that a greedy strategy
should be part of a larger Generalised Policy Iteration (GPI) scheme in
order to converge. Only with constant updates of the utility function it is
possible to improve the policy. An agent that uses a greedy strategy can be
fooled by random fluctuations and it can think that the second arm is the
best one only because in a short series it returned more coins. Running the
script for 2000 episodes, each one having 1000 rounds, we get:
The result on our test is 733 which is significantly over the random score.
We said that the true utility distribution is [0.3, 0.5, 0.8] . The greedy
agent has an average utility distribution of [0.14, 0.27, 0.66] and a RMSE
of 0.18, meaning that it underestimates the utilities because of its blind
strategy which does not encourage exploration. Here we can see an inverse
pattern respect to the random agent, meaning that for the greedy player both
the average reward and the RMSE are high.
The average cumulated reward is 763 which is higher than the random and the
greedy agents. The random exploration helps the agent to converge closely to
the true utility distribution leading to a low RMSE (0.005).
def softmax(x):
"""Compute softmax distribution of array x.
The result of 767 is slightly higher than epsilon-greedy, but at the same
time also the RMSE error is slightly higher. Increasing exploitation we
decreased exploration. As you can see there is a delicate balance between
exploration and exploitation and it is not so straightforward to find the
right trade-off. The complete code is included in the repository and is
called softmax_agent.py .
The average cumulated reward is 777 which is higher that the scores obtained
with the previous strategies. At the same time the utility distribution is
close to the original but the RMSE (0.008) is slightly higher compared to
the epsilon-greedy (0.005). Once again we can notice how delicate is the
balance between exploration and exploitation.
Q(a)/τ
e
P (a) =
N
Q(b)/τ
∑ e
b=1
→ 0
to be sampled, whereas in the limit of τ → 0 the action selection becomes
greedy. We can easily implement Boltzmann sampling in python:
The function boltzmann() takes in input an array and the temperature, and
returns the Boltzmann distribution of that array. Once we have the Boltzmann
distribution we can use the Numpy method numpy.random.choice() to sample an
action. The complete script is called boltzmann_agent_bandit.py and is in the
official repository of the project. Running the script with τ decreased
linearly from 10 to 0.01 leads to the following result:
The strategy reached a score of 648 which is the lowest score obtained so
far, but the RMSE on the utility distribution is the lowest as well (0.002).
Changing the temperature decay we can see how the performance increases.
Starting form a value of 0.5 and decreasing to 0.0001 we get the following
results:
As you can see there is a significant increase in the cumulated reward, but
at the same time an increase in the RMSE. As usual, we have to find the
right balance. Let’s try now with an initial temperature of 0.1 and let’s
decrease it to 0.0001:
We obtained a score of 732 and RMSE of 0.24 which are very close to the
score of the greedy agent (reward=733, RMSE=0.18). This is not surprising
because as I said previously in the limit of τ → 0 the action selection
becomes greedy. The Boltzmann sampling guarantees a wide exploration which
may be extremely useful in large state spaces, however it may have some
drawbacks. It is generally easy to choose a value for ϵ in the epsilon-based
strategies, but we cannot say the same for τ . Setting τ may require a fine
hand tuning which is not always possible in some problems. I suggest you to
run the script with different temperature values to see the difference.
s
s
P (q) =
s + f
Which kind of distribution is it? How can we find it? We can use the Bayes’
theorem. Using this theorem we can do an optimal approximation of the
posterior based on the data collected in the previous rounds. Here I define
s and f as the number of successes and failures accumulated in previous
From this equation it is clear that for finding the posterior we need the
term P (s, f |q) (likelihood), and the term P (q) (prior). Let’s start from the
likelihood. As I said the Bernoulli distribution represents the outcome of a
single experiment. In order to represent the outcome of multiple independent
experiments we have to use the Binomial distribution. This distribution can
tell us what is the probability of having s successes in s + f trials. The
distribution is represented as follows:
s + f s f
P (s, f |q) = ( )q (1 − q)
s
Great, we got the first missing term. Now we have to find the prior.
Fortunately the Bernoulli distribution has a conjugate prior which is the
Beta distribution:
α−1 β−1
q (1 − q)
P (q) =
B(α, β)
where α, β > 0 are parameters representing the success and failure rate, and
B is a normalization constant (Beta function) which ensures that the
probability integrates to 1. Now we have all the missing terms. Going back
to the Bayes’ theorem, we can plug the Binomial distribution in place of the
likelihood, and the Beta distribution in place of the prior. After some
reductions we come out with the following result:
s+α−1 f +β−1
q (1 − q)
P (q|s, f ) =
B(s + α, f + β))
If you give a look to the result you will notice that our posterior is
another Beta distribution. That’s a really clean solution to our problem. In
order to obtain the probability of a positive reward for a given arm, we
simply have to plug the parameters (α + s, β + f ) into a Beta distribution.
The main advantage of this approach is that we are going to have better
estimates of the posterior when the number of successes and failures
increases. For example, let’s say that we start with α = β = 1 meaning that
we suppose a uniform distribution for the prior. That’s reasonable because
we do not have any previous knowledge about the arm. Let’s suppose we pull
the arm three times and we obtain two successes and one failure, we can
obtain the estimation of the Bernoulli distribution for that arm through
Beta(α + 2, β + 1). This is the best estimation we can do after three rounds.
As we keep playing the posterior will get more and more accurate. Thompson
sampling can be used also with non-Bernoulli distributions. If the reward is
modelled by a multinomial distribution we can use the Dirichlet distribuion
as conjugate. If the reward is modelled as Gaussian distribution we can use
the Gaussian itself as conjugate.
In python we can easily implement the Thompson agent for the three-armed
bandit testbed. It is necessary to keep a record of successes and failures
in two Numpy arrays. Those arrays are passed to the following function:
Numpy implements the method numpy.random.beta() which takes in input the two
arrays (α + s, β + f ) and returns an array containing the values sampled from
the underlying Beta distributions. Once we have this array we only have to
take the action with the highest value using np.argmax() , which corresponds
to the arm with the highest probability of reward. Running the script
thompson_agent_bandit.py we get the following results:
The average cumulated reward obtained is 791 which is the highest score
reached so far, and it is very close to the optimal strategy of the
omniscient player. At the same time the RMSE (0.05) on the utility
distribution is fairly low. Thompson sampling seems to be the perfect
strategy for balancing exploration and exploitation, however there are some
drawbacks. In our example we used a Bernoulli distribution as posterior, but
it was an oversimplification. It can be difficult to approximate the
posterior distribution when the underlying function is completely unknown,
moreover the evaluation of the posterior requires an integration which may
be computationally expensive.
Comparing the results of the different strategies on a barchart we can see
at a glance the performances obtained. Thompson sampling seems to be the
best strategy in term of score, but in practical terms it is difficult to
apply. Softmax-greedy and epsilon-greedy are pretty similar, and choosing
the one or the other depends how much you want to encourage exploration. The
epsilong-decreesing strategy is most of the time a secure choice, since it
has been widely adopted and it has a clear dynamic on a wide variety of
cases. For instance, modern approaches (e.g. DQN, Double DQN, etc.) use
epsilon-based strategies. The homework for this section is to increase the
number of arms and run the algorithms to see which one performs better. To
change the number of arms you simply have to modify the very first line in
main function:
To generate a 10-armed bandit you can modify the variable in this way:
reward_distribution = [0.7, 0.4, 0.6, 0.1, 0.8, 0.05, 0.2, 0.3, 0.9, 0.1]
Once you have a new distribution you can run the script and register the
performances of each strategy. Increasing the number of arms should make the
exploration more important. A larger state space requires more exploration
in order to get an higher reward.
In this section we saw how exploration can affect the results in a simple
three-armed testbed. Multi-armed bandit problems are in our daily life. The
doctor who has to choose the best treatment for a patient, the web-designer
who has to find the best template for maximising the AdSense clicks, or the
entrepreneur who has to decide how to manage budget for maximising the
incomes. Now you know some strategies for dealing with these problems. In
the next section I will introduce the mountain car problem, and I will show
you how to use reinforcement learning to tackle it.
Mountain Car
The mountain car is a classic reinforcement learning problem. This problem
was first described by Andrew Moore in his PhD thesis and is defined as
follows: a mountain car is moving on a two-hills landscape. The engine of
the car does not have enough power to cross a steep climb. The driver has to
find a way to reach the top of the hill.
A good explanation of the probelm is presented in chapter 4.5.2 of
Sugiyama’s book. I will follow the same mathematical convention here. The
state space is defined by the position x obtained through the function
sin(3x) in the domain [-1.2, +0.5] (m) and the velocity ẋ defined in the
interval [-1.5, +1.5] (m/s). There are three possible actions a = [-2.0,
0.0, +2.0] which are the values of the force applied to the car (left, no-
op, right). The reward obtained is positive 1.0 only if the car reaches the
goal. A negative cost of living of -0.01 is applied at every time step. The
mass of the car is m = 0.2 kg, the gravity is g = 9.8 m/s2 , the friction is
defined by k = 0.3 N , and the time step is Δt = 0.1 s. Given all these
parameters the position and velocity of the car at t + 1 are updated using
the following equations:
xt+1 = xt + ẋt+1 Δt
at
ẋt+1 = ẋt + (g m cos(3xt ) + − k ẋt )Δt
m
The mountain car environment has been implemented in OpenAI Gym, however
here I will build everything from scratch for pedagogical reason. In the
repository you will find the file mountain_car.py which contains a class
called MountainCar . I built this class using only Numpy and matplotlib .
The class contains methods which are similar to the one used in OpenAI Gym.
The main method is called step() and allows executing an action in the
environment. This method returns the state at t+1 the reward and a value
called done which is True if the car reach the goal. The method contains
the implementations of the equation of motion and it uses the parameters
previously defined.
I added an useful method called render() which can save the episode
animation in a gif or a video (it requires imagemagick and avconv). This
method can be called every k episodes in order to save the animation and
check the improvements. For example, to save an mp4 video it is possible to
call the method with the following parameters:
my_car.render(file_path='./mountain_car.mp4', mode='mp4')
If you want an animated gif instead you can call the method in this way:
my_car.render(file_path='./mountain_car.gif', mode='gif')
Now let’s try to use the class and let’s build an agent which uses a random
policy for choosing the actions. Here I will use a time step of 0.1 seconds
and a total of 100 steps (which means a 10 seconds long episode). The code
is very compact, and at this point of the series you can easily understand
it without any additional comment:
Observing the behaviour of the car in the animation generated by the script
it is possible to see how difficult is the task. Using a purely random
policy the car remains at the bottom of the valley and it does not reach the
goal. The optimal policy is to move on the left accumulating inertia and
then to push as much as possible to the right.
How can we deal with this problem using a discrete approach? We said that
the state space is continuous, meaning that we have infinite values to take
into account. What we can do is dividing the continuous state-action space
in chunks. This is called discretization. If the car moves in a continuous
space enclosed in the range [−1.2, 0.5] it is possible to create 10 bins to
represent the position. When the car is at -1.10 it is in the first bin,
when at -0.9 in the second, etc.
In our case both position and velocity must be discretized and for this
reason we need two arrays to store all the states. Here I call bins the
discrete containers (entry of the array) where both position and velocity
are stored. In Numpy is easy to create these containers using the
numpy.linspace() function. The two arrays can be used to define a policy
matrix. In the script I defined the policy matrix as a square matrix having
size tot_bins , meaning that both velocity and position have the same number
of bins. However it is possible to differently discretize velocity and
position, obtaining a rectangular matrix.
Now it is time to use reinforcement learning for mastering the mountain car
problem. Here I will use the temporal differencing method called SARSA which
I introduced in the third Post. I suggest you to use other methods to check
the different performances you may obtain. Only a few changes are required
in order to run the code of the previous posts. In this example I trained
the policy for 10 episodes ( gamma =0.999, tot_bins =12), using an epsilon
5
decayed value (from 0.9 to 0.1) which helped exploration in the first part
of the training. The script automatically saves gif and plots every 10
4
episodes. The following is the graph of the cumulated reward per episode,
where the light-red line is the raw data and the dark line a moving average
of 500 episodes:
Episode: 100001
Epsilon: 0.1
Episode steps: 67
Cumulated Reward: 0.34
Policy matrix:
O < O O O < > < > O O <
< < > < < > < < > > O >
O > < < < < < < > < O <
O < < < > > < < > > < O
O > < < > > > < > O > >
O > > < > O > < > > < <
< O > < > > < < > > > >
O > < < > > > > > > > >
< > > > > > > > > > > >
O > > > > > > > > > O >
< < > > > > O > > > > >
> O > > > > > > > < > <
Inverted Pendulum
The inverted pendulum is another classical problem, which is considered a
benchmark in control theory. James Roberge was probably the first author to
present a solution to the problem in his bachelor thesis back in 1960. The
problem consists of a pole hinged on a cart which must be moved in order to
keep the pole in vertical position. The inverted pendulum is well described
in chapter 4.5.1 of Sugiyama’s book. Here I will use the same mathematical
−π
notation. The state space consists of the angle (rad) (which is
π
ϕ ∈ [ , ]
2 2
zero when the pole is perfectly vertical) and the angular velocity
˙
ϕ ∈ [−π, π] (rad/sec). The action space is discrete and it consists of three
forces [-50, 0, 50] (Newton) which can be applied to the cart in order to
swing the pole up.
The system has different parameters which can decide the dynamics. The mass
m = 2kg of the pole, the mass M = 8kg of the cart, the lengh d = 0.5m of
the pole, and time step Δt = 0.1s. Given these parameters the angle ϕ and
the angular velocity ˙
ϕ at t + 1 are update as follows:
˙
ϕt+1 = ϕt + ϕ Δt
t+1
˙ 2
˙ )
g sin(ϕt ) − α m d (ϕ
2
sin(2ϕt )/2 + α cos(ϕt ) at
t
˙ ˙ +
= ϕ Δt
ϕt+1 t
2
4l/3 − α m d cos (ϕt )
5 6
We can test the performance of an agent which follows a random policy. The
code is called random_agent_inverted_pendulum.py and is available on the
repository. Using a random strategy on the pole balancing environment leads
to unsatisfactory performances. The best I got running the script multiple
times is a very short episode of 1.5 seconds.
The optimal policy consists in compensating the angle and speed variations
keeping the pole as much vertical as possible. Like for the mountain car I
will deal with this problem using discretization. Both velocity and angle
are discretized in bins of equal size and the resulting arrays are used as
indices of a square policy matrix. As algorithm I will use first-visit Monte
Carlo for control, which has been introduced in the second post of the
series. I trained the policy for 5 × 10 episodes ( gamma =0.999,
5
Drone landing
There are plenty of possible applications for reinforcement learning. One of
the most interesting is robots control. Reinforcement learning offers a wide
set of techniques for the implementation of complex policies. For example,
it has been applied to humanoid control, and helicopter acrobatic maneuvers.
For a recent survey I suggest you to read the article of Kober et at.
(2013). In this example we are going to use reinforcement learning for the
control of an autonomous drone. In particular we have to train the drone to
land on a ground platform.
my_drone = DroneLanding(world_size=11)
The only requirement is the size of the world in meters. The class
implements the usual methods step() , reset() and render() . The step method
takes an integer representing one of the six actions (forward, backward,
left, right, up, down) and returns the observation at t+1 represented by a
tuple (x,y,z) which identifies the position of the drone. As usual the
method returns also the reward and the boolean variable done which is
True in case of a terminal state (the drone landed). The method render() is
based on matplotlib and generates a gif or a video with the movement of the
drone in a three-dimensional graph. Let’s start with a random agent, here is
the code:
my_drone = DroneLanding(world_size=11)
cumulated_reward = 0
print("Starting random agent...")
for step in range(50):
action = np.random.randint(low=0, high=6)
observation, reward, done = my_drone.step(action)
print("Action: " + str(action))
print("x-y-z: " + str(observation))
print("")
cumulated_reward += reward
if done: break
print("Finished after: " + str(step+1) + " steps")
print("Cumulated Reward: " + str(cumulated_reward))
my_drone.render(file_path='./drone_landing.gif', mode='gif')
print("Complete!")
Running the script several times you can have an idea of how difficult is
the task. It is very hard to reach the platform using a random strategy. In
a world of size 11 meters, there is only 0.07% probability of obtaining the
reward. Here you can see the gif generated for an episode of the random
agent.
The drone is represented with a red dot, the red surface represents the area
where landing leads to a negative reward, and the green square in the centre
is the platform. As you can see the drone keeps moving in the same part of
the room and complete the episode without landing at all. Here I will tackle
the problem using Q-learning, a technique that has been introduced in the
third post of the series. The code is pretty similar to the one used for the
gridworld and you can find it in the official repository in the file called
qlearning_drone_landing.py . The average cumulated reward for each episode (50
steps) is maximum 1.0 (if the drone land at the very first step), it is -1.5
if the drone is so unlucky to land outside of the platform at the very last
step, and it is -0.5 if the drone keeps moving without landing at all (sum
of the negative cost of living -0.01 for the 50 steps). Running the script
for 5 × 10 episodes, using an epsilon-greedy strategy (epsilon=0.1) I got
5
Using a world of 11 meter was pretty easy to get a stable policy. Let’s try
now with a world of 21 meters. In the script qlearning_drone_landing.py you
simply have set the parameter world_size=21 . In this new environment the
probability of obtaining a reward goes down to 0.01%. I will not change any
other parameter, because in this way we can compare the performance of the
algorithm on this world with the performance on the previous one.
If you look to the plot you will notice two things. First, the reward grows
extremely slowly, reaching an average of 0.6. Second, the number of episodes
5
is much higher. I had to train the policy for episodes, four times
5
25 × 10
more than in the previous environment. It took 40 minutes on the same laptop
of the previous experiment. Giving a look to the gif created at the end of
the training, we can see that eventually the policy is robust enough to
guarantee the landing on the platform.
At this point it should be clear why using a lookup table for storing the
state-action utilities is a limited approach. When the state-space grows we
need to increase the size of the table. Starting from a world of size 11 and
6 total actions, we need a lookup table of size 11 × 11 × 11 = 1331 in order
to store all the states, and a table of size 11 × 11 × 11 × 6 = 7986 in order
to store all the state-action pairs. Doubling the size of the world to 21 we
have to increase the table of approximately 9 times. Moving to a world of
size 31 we need a table which is 25 times larger. In the following image I
summarized these observations. The orange squares represent the size of the
lookup table required to store the state-action pairs, darker the square
larger the table.
The problem related to explore a large dimensional space get worst and worst
with the number of dimensions considered. In our case we had only three-
dimensions to take into account, but considering larger hyper-spaces makes
everything more complicated. This is a well known problem named curse of
dimensionality, a term which has been coined by Richard Bellman (a guy that
you should know, go to the first post to recall who is he). Only in the next
post we will see how to overcome this problem using an approximator. In the
last section I want to introduce other problems, problems which are
considered extremely hard and that cannot be solved so easily.
Hard problems
The problem that I described above are difficult but not extremely
difficult. In the end we managed to find good policies using a tabular
approach. Which kind of problems are hard to solve using reinforcement
learning?
A humanoid puppet has many degrees of freedom, and coordinating all of them
is really hard. The state space is large and is represented by the velocity
and position of multiple joints that must be controlled in synchrony in
order to achieve an effective movement. The action space is the amount of
torque it is possible to apply to each joint. Joint position, joint velocity
and torque are continuous quantities. The reward function depends on the
task. For example, in a bipedal walker the reward could be the distance
reached by the puppet in a finite amount of time. Trying to obtain decent
results using a discretised approach is infeasible. During the years
different techniques have been applied with more or less success. Despite
recent advancement humanoid control is still considered an open problem. If
you want to try there is an implementation of a bipedal walker in OpenAI
Gym. There is something harder than humanoid control? Probably yes, what
about videogames?
If you played with the 2600 Atari games you may have noticed that some of
them are really hard. How can an algorithm play those games? Well, we can
cheat. If the game can be reduced to a limited set of features, it is
possible to use model based reinforcement learning to solve it. However most
of the time the reward function and the transition matrix are unknown. In
this cases the only solution is to use the raw coulour image as state space.
The state space represented by a raw image is extremely large. There is no
point in using a lookup table for such a big space because most of the
states will remain unvisited. We should use an approximator which can
describe with a reduced set of parameters the state space. Soon I will show
you how deep reinforcement learning can use neural networks to master this
kind of problems.
Conclusions
Here I presented some classical reinforcement learning problems showing how
the techniques of the previous posts can be used to obtain stable policies.
However we always started from the assumption of a discretized state space
which was described by a lookup table or matrix. The main limitation of this
approach is that in many applications the state space is extremely large and
it is not possible to visit all the states. To solve this problem we can use
function approximation. In the next post I will introduce function
approximation and I will show you how a neural network can be used in order
to describe a large state space. The use of neural networks open up new
horizons and it is the first step toward modern methods such as deep
reinforcement learning.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value iteration
and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods, Prediction and
Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning, TD(0),
TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods, computational
Actor-Critic methods, Actor-only and Critic-only methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic Algorithm in
Reinforcement Learning, Genetic Algorithms for policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed Bandit,
Mountain Car, Inverted Pendulum, Drone landing, Hard problems.
Resources
The complete code for the Reinforcement Learning applications is
available on the dissecting-reinforcement-learning official repository on
GitHub.
References
Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). An application of
reinforcement learning to aerobatic helicopter flight. In Advances in neural
information processing systems (pp. 1-8).