Studies On Model Free Control of Dynamical Systems
Studies On Model Free Control of Dynamical Systems
and Technology
Optimal Control
Professor: Student:
*** Federico Berto
Abstract
Modeling is well know to be one of the most difficult challenges when developing
control systems: partial observability and disturbances in the control system can make
the system unstable; especially considering the cases of complex and nonlinear ones.
Model-free control is an alternative for solving control problems without the need
to design a complex model in the first place: in particular, Reinforcement Learning
has shown promising results in controlling nonlinear systems even when the state
variables cannot be derived directly, such as in the case of a system in which sensors
are unavailable or broken. The goal of this report is to demonstrate the ability of Deep
Q-Network algorithm in controlling a dynamical system with the aim of stabilizing an
inverted pendulum using only raw pixels as state feedback: this model-free formulation
does not require any assumption or previous knowledge of the dynamical system unlike
most controllers.
Introduction
Reinforcement Learning [9] is one of the branches of Machine Learning which has seen
a surge in interest in recent years because of its ability to handle complex tasks with
minimal human intervention and, in the case of model-free reinforcement learning, to learn
a representation of the state-action space without needing the actual space variables.
Reinforcement Learning (RL) has been successfully used for playing complex Atari games
[4], for mastering the game of GO [8] and even Starcraft II [10], showing important results
regarding high-dimensional state-action spaces.
Another interesting application is the model-free control, which is still an open research
topic [6] given the complex nature of the continuous state: we will explore this setup in
the report and show its promising capabilities in handling non-linear systems without the
need of an explicit model formulation, unlike many controllers do.
Moreover, we will show that the performance of the controller is better, to the best of the
author’s knowledge, than other similar setups of the raw-pixels based inverted pendulum
thanks to some tweaks we will explore.
Page 1
Optimal Control Report Federico Berto
∗
Figure 1: Agent-Environment interface for the inverted pendulum system. π{a|s} denotes
the optimal policy
• Value function: whereas the reward defines what is good in a short time, the value
function defines what is good in the long run; it is defined as the sum of the discounted
future rewards that can be accumulated from a particular state.
The goal of the control system (agent) is to maximize the cumulative discounted reward
defined as following from the environment:
∞
X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1 (1)
k=0
where γ ∈ [0, 1] is a discount factor used for giving a lighter weight to far-away future
rewards.
The policy π choosing the actions leading to the highest reward can be chosen for the
Deep Q-Networks in the ε-greedy way:
1 − ε, if a = argmax q ∗ (s, a)
π(a|s) = a∈A (2)
ε, a random action.
where q ∗ (s, a) denotes the state-action value and ε is chosen depending on how much
exploration (randomly trying new actions for finding out undiscovered and possibly better
states and actions) and exploitation (choosing the known best action) ration we want the
system to have.
Page 2
Optimal Control Report Federico Berto
Observation Space
Num Observation Min Max
0 Cart Position -2.4 2.4
1 Cart Velocity −∞ +∞
2 Pole Angle 41.8° 41.8°
3 Pole Angular Velocity −∞ ∞
Trajectories of the system In order to optimize the system, we need to know what
trajectories lead to the best cumulative reward. In this setup, agent is the entity acting
on the environment, that given a state St ∈ S selects one or more actions At ∈ A for
interacting with the environment. Our agent then gets feedback from the environment
itself: it gets a scalar valued reward Rt+1 ∈ R, and gets in a new state St+1 ∈ S. In a
finite Markov Decision Process the set of states, actions and rewards (S, A, R) contains
a finite number of elements; nonetheless, the problem can be extended in the continuous
case. The agent-environment feedback loop repeats over time and it is updated at each
time step: action selection, state and reward signals are sent only at specific moments in
time (t = 0,1,2,3...). The sequence of actions, states and rewards defines a trajectory in
time that can be described as:
S0 , A0 , R1 , S1 , A1 , R2 , S2 , A2 , R3 ... (3)
This sequence can be used on the go for optimizing the system or stored in a buffer and
used later in randomized batches allowing for better stability of the training.
Experimental Setup
We will conduct the experiments with the inverted pendulum simulation from OpenAI
Gym [1]. Also known as the cart-pole, the inverted pendulum is a common benchmark for
testing control systems because of its relative simplicity and yet challenging feat because
of the system’s nonlinearities. This system can be controlled with well-known controllers,
such as the Linear Quadratic Regulator for the linearized version, or with i.e. the Model
Predictive Control for the non-linear version, such as for the pendulum swing up; however,
our task is more challenging because of the absence of the system’s model.
Page 3
Optimal Control Report Federico Berto
The Algorithm: Deep Q-Learning The Deep Q-Learning algorithm has shown state-
of-the-art performance in multiple frameworks, such as the Atari framework [4]. In order
to approximate the action-value function, we will employ a Convolutional Neural Network
Q(s, a; θ) ≈ Q∗ (s, a). For our purposes, θ will denote the set of weights of the network,
which we will refer to as Q-network. A Q-network can be trained by minimizing a sequence
of loss functions Li (θi ) changing at each iteration i,
where yi = Es0 ∼E [r + γ maxa0 Q(s0 , θi−1 )|s, a] is the target for iteration i, ρ(s, a) is a
probability distribution over sequences s and actions a. The parameters from the previous
iteration θi−1 are kept fixed when optimizing the loss function Li (θi ). The targets depend
on the network weights: for this reason, it differs from classical weights used in supervised
learning, which are fixed before learning itself begins.
Since we want to minimize the loss function for obtaining a better fit of the networks with
respect to the real action value function, we want to compute the gradient with respect to
the weights:
The loss function is then minimized by either using stochastic gradient descent or more
sophisticated methods like RMSprop. The behavior policy is -greedy for ensuring sufficient
exploration.
What makes the algorithm work is a technique known as experience replay[7], by which
we store the agent’s experiences (also referred to as transitions) at each time-step, et =
(st , at , rt , st+1 ), in a data set D = (e1 , ..., eN ), pooled over many episodes into a replay
memory. In the inner loop of our algorithm we apply Q-Learning updates, or mini-batch
updates, to samples of experience, e ∼ D, picked up randomly from the pool of stored
Page 4
Optimal Control Report Federico Berto
samples. After performing experience replay, the agent executes an action according to
an -greedy policy. Deep Q-Learning with experience replay is written in pseudo-code in
Algorithm 1.
The following is the Python code for the experience replay, which is of paramount importance
for decoupling time-dependent transitions which would introduce a bias in the system:
1 class ReplayMemory(object):
2 """Replay Memory class for storing system's trajectories"""
3 def __init__(self, capacity):
4 self.capacity = capacity
5 self.memory = []
6 self.position = 0
7
Page 5
Optimal Control Report Federico Berto
21 def __len__(self):
22 return len(self.memory)
Deep Q-Learning has several advantages by using experience replay. Firstly, since each
step of experience is potentially used in many weight updates, we have a greater data
efficiency. Secondly, randomizing the samples allows for improved efficiency and reduces
the updates variance, given that learning directly from consecutive samples of experience
is inefficient due to the strong correlations between the samples. Thirdly, by learning off-
policy we may be able to avoid dangerous unwanted feedback loops. For example, if we did
train on-policy, if the maximizing action were to move on the left then the training samples
would be dominated by samples from the left-hand side; if the maximizing action were the
opposite one on the right, the same would happen with the right-hand side samples. This
behavior could lead to unwanted outcomes, like getting stuck in a poor local minimum, or
even catastrophically diverge. Thus, by using the off-policy experience replay we are able
to smooth out learning and to avoid oscillation or divergence in the parameters.
The exploration and exploitation problem is dealt with by the use of the following εthreshold ,
which decays over time; a lower threshold will result in higher exploitation of the learned
optimal known actions:
Page 6
Optimal Control Report Federico Berto
Figure 2: Extracted screen from the inverted pendulum environment: the image is
extracted by cropping, downsampling and applying grayscaling
Early Stopping Thus, another method we can use is that of the Early Stopping: in
order to prevent overfitting, the agent is trained up to a point in which it has learned
optimal behavior; from that point onward it would start behaving worse and worse and
then restarting learning from scratch. We can detect this optimal situation by inserting
a score threshold: when the mean of the last K episodes overcome this threshold ξ, we
stop the neural networks optimization. We have empirically set these two new hyper-
parameters, which can be tweaked, to K = 20 and ξ = 142.
Reward Function Design The reward function design is one of the most important
aspects in Reinforcement Learning and can make the difference between an agent not
training at all and one able to achieve super-human performance. Instead of the usual
reward function from OpenAI Gym, we can build a new one being able to better give a
feedback on our agent’s decisions:
Page 7
Optimal Control Report Federico Berto
xmax − |x|
r1 = − 0.8
xmax
θmax − |θ| (8)
r2 = − 0.5
θmax
rtot = r1 + r2
However, in this case the reward function just has a minor impact on the overall performance,
so we may as well use one which doesn’t explicitly employ the state variables.
Final Model
Parameters
Layer Number of Activation Kernel Stride
Filters/Neurons Size
Convolutional Layer 1 64 ReLu 5 2
Convolutional Layer 2 64 ReLu 5 2
Convolutional Layer 3 32 ReLu 5 2
Linear Output Layer 1792 None - -
After improving the model with the aforementioned techniques, we obtain the following
Convolutional Neural Network described in Table . As regards the loss function and
optimizer, we use:
• Loss Function: Huber loss function, which is:
1
L(Qtarget , Qpredicted ) = (Qtarget − Qpredicted )2 (9)
2
1 class DQN(nn.Module):
2 """Convolutional Neural Network predicting the state-action values"""
3 def __init__(self, h, w, outputs):
4 super(DQN, self).__init__()
5 self.conv1 = nn.Conv2d(nn_inputs, HIDDEN_LAYER_1,
6 kernel_size=KERNEL_SIZE, stride=STRIDE)
7 self.bn1 = nn.BatchNorm2d(HIDDEN_LAYER_1)
8 self.conv2 = nn.Conv2d(HIDDEN_LAYER_1, HIDDEN_LAYER_2,
Page 8
Optimal Control Report Federico Berto
Figure 3: Architecture of the DQN with a Convolutional Neural Network for approximating
state-action values
9 kernel_size=KERNEL_SIZE, stride=STRIDE)
10 self.bn2 = nn.BatchNorm2d(HIDDEN_LAYER_2)
11 self.conv3 = nn.Conv2d(HIDDEN_LAYER_2, HIDDEN_LAYER_3,
12 kernel_size=KERNEL_SIZE, stride=STRIDE)
13 self.bn3 = nn.BatchNorm2d(HIDDEN_LAYER_3)
14
20 convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
21 convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
22 linear_input_size = convw * convh * 32
23 nn.Dropout() # Adds further regularization
24 self.head = nn.Linear(linear_input_size, outputs)
25
The final architecture of the DQN is represented in 3. The following table shows the main
Hyper-Parameters, based on the DQN with Experience Replay and Target Network, which
Page 9
Optimal Control Report Federico Berto
Hyper-Parameters Values
Batch Size 128
Batch of Frames Size 2
Discount Rate γ 0.999
Initial ε 0.9
Final ε 0.01
ε-Decay 5000
Early Stop ξ 142
Image Height (pixels) 60
Image Width (pixels) 135
Maximum Memory Size 100000
Target Model Update τ 50
Final Results
After several unsuccessful or partially successful tries, using a tweaked CNN architecture,
regulated hyper-parameters and adjustments in image pre-processing, dealing with overfitting
and reward function design, we finally come to the experimental results. As shown in
Figure 4, the inverted pendulum system is able to stabilize to obtain the 200 score. As to
the best of the author’s knowledge, this result has not been obtained yet in the Cartpole-v0
environment with DQN and only raw visual input, mainly because of the absence of early
stopping which prevents overfitting.
Page 10
Optimal Control Report Federico Berto
Figure 4: Final results: plot of the obtained averaged out of a pool of 6 training runs. The
light blue lines represent the confidence interval obtained via mean ± standard deviation.
Most of the runs are able to get to the desired score in ∼ 800 episodes
Optimal Control with Neural Ordinary Differential Equations [2], a new class of Neural
Networks which, thanks to their inherent capability to deal with differential equations, are
revolutionizing the area of approximation and optimization of complex dynamical systems.
Conclusion
The goal of this report was to find a controller in a situation where state variables were
not available, in particular by using only raw images to generate a control policy. The test
benchmark we used is the inverted pendulum, which is used frequently to test non-linear
controllers, which we used by extracting the necessary state information automatically
thanks to the Deep Q-Learning, a Deep Reinforcement Learning algorithm capable of
controlling agents with complex state-action spaces.
After implementing several tweaks to existing algorithms, we were able to obtain a controller
capable of successfully obtaining the maximum reward score in the simulation by using
Convolutional Neural Networks for choosing the optimal action leading to the optimal
cumulative reward. The successful stabilization of the inverted pendulum shows that
Reinforcement Learning is indeed capable of achieving good performances even with diffucult
tasks.
Possible future developments include the implementation of new frameworks, such as
Neural Ordinary Differential Equations, possibly paving the way to a better synergy of
Page 11
Optimal Control Report Federico Berto
References
[1] Greg Brockman et al. OpenAI Gym. cite arxiv:1606.01540. 2016. url: http : / /
arxiv.org/abs/1606.01540.
[2] Ricky T. Q. Chen et al. “Neural Ordinary Differential Equations”. In: Advances
in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran
Associates, Inc., 2018, pp. 6571–6583. url: https://proceedings.neurips.cc/
paper/2018/file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf.
[3] Lukasz Kaiser et al. “Model-Based Reinforcement Learning for Atari”. In: CoRR
abs/1903.00374 (2019). arXiv: 1903.00374. url: http://arxiv.org/abs/1903.
00374.
[4] Volodymyr Mnih et al. “Playing atari with deep reinforcement learning”. In: arXiv
preprint arXiv:1312.5602 (2013).
[5] Lerrel Pinto et al. “Robust Adversarial Reinforcement Learning”. In: CoRR abs/1703.02702
(2017). arXiv: 1703.02702. url: http://arxiv.org/abs/1703.02702.
[6] Warren B. Powell. “From Reinforcement Learning to Optimal Control: A unified
framework for sequential decisions”. In: CoRR abs/1912.03513 (2019). arXiv: 1912.
03513. url: http://arxiv.org/abs/1912.03513.
[7] Tom Schaul et al. “Prioritized experience replay”. In: arXiv preprint arXiv:1511.05952
(2015).
[8] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
[9] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning.
Vol. 2. 4. MIT press Cambridge, 1998.
[10] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement
learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z.
Page 12