0% found this document useful (0 votes)

11 views13 pages

Studies On Model Free Control of Dynamical Systems

This report explores model-free control of dynamical systems using Reinforcement Learning, specifically focusing on the Deep Q-Network algorithm to stabilize an inverted pendulum using only raw pixel data as feedback. It highlights the challenges of modeling control systems and demonstrates the effectiveness of a model-free approach in handling nonlinear dynamics without prior knowledge of the system. The report also discusses techniques to improve performance, such as experience replay, image preprocessing, and reward function design.

Uploaded by

laithalsheakh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Studies On Model Free Control of Dynamical Systems

Uploaded by

laithalsheakh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Korea Advanced Institute of Science

and Technology

Optimal Control

Studies on Model-Free Control of

Dynamical Systems with Visual Feedback

Professor: Student:
*** Federico Berto

Course ID: ID number:

*** ***
Optimal Control Report Federico Berto

Abstract
Modeling is well know to be one of the most difficult challenges when developing
control systems: partial observability and disturbances in the control system can make
the system unstable; especially considering the cases of complex and nonlinear ones.
Model-free control is an alternative for solving control problems without the need
to design a complex model in the first place: in particular, Reinforcement Learning
has shown promising results in controlling nonlinear systems even when the state
variables cannot be derived directly, such as in the case of a system in which sensors
are unavailable or broken. The goal of this report is to demonstrate the ability of Deep
Q-Network algorithm in controlling a dynamical system with the aim of stabilizing an
inverted pendulum using only raw pixels as state feedback: this model-free formulation
does not require any assumption or previous knowledge of the dynamical system unlike
most controllers.

Introduction
Reinforcement Learning [9] is one of the branches of Machine Learning which has seen
a surge in interest in recent years because of its ability to handle complex tasks with
minimal human intervention and, in the case of model-free reinforcement learning, to learn
a representation of the state-action space without needing the actual space variables.
Reinforcement Learning (RL) has been successfully used for playing complex Atari games
[4], for mastering the game of GO [8] and even Starcraft II [10], showing important results
regarding high-dimensional state-action spaces.
Another interesting application is the model-free control, which is still an open research
topic [6] given the complex nature of the continuous state: we will explore this setup in
the report and show its promising capabilities in handling non-linear systems without the
need of an explicit model formulation, unlike many controllers do.
Moreover, we will show that the performance of the controller is better, to the best of the
author’s knowledge, than other similar setups of the raw-pixels based inverted pendulum
thanks to some tweaks we will explore.

Brief Overview of Reinforcement Learning

The control system can be defined as in the Figure 1, usually called the agent-environment
interface, which is the correspondent of the controller-plant interface in classical control.
This framework can be defined by the following:
• Policy: set of actions (which are the control inputs in the Optimal Control field) to
be selected given the current state.
• Reward : a scalar value to be maximized by the agent. This value is another feedback
along with the state at each time step; this is usually set as a positive number if the
agent is performing as desired and as a negative number otherwise. In the Optimal
Control terms, it is equivalent to the opposite of the cost function.

Page 1
Optimal Control Report Federico Berto

∗
Figure 1: Agent-Environment interface for the inverted pendulum system. π{a|s} denotes
the optimal policy

• Value function: whereas the reward defines what is good in a short time, the value
function defines what is good in the long run; it is defined as the sum of the discounted
future rewards that can be accumulated from a particular state.
The goal of the control system (agent) is to maximize the cumulative discounted reward
defined as following from the environment:

∞
X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1 (1)
k=0

where γ ∈ [0, 1] is a discount factor used for giving a lighter weight to far-away future
rewards.
The policy π choosing the actions leading to the highest reward can be chosen for the
Deep Q-Networks in the ε-greedy way:

1 − ε, if a = argmax q ∗ (s, a)
π(a|s) = a∈A (2)
ε, a random action.

where q ∗ (s, a) denotes the state-action value and ε is chosen depending on how much
exploration (randomly trying new actions for finding out undiscovered and possibly better
states and actions) and exploitation (choosing the known best action) ration we want the
system to have.

Page 2
Optimal Control Report Federico Berto

Observation Space
Num Observation Min Max
0 Cart Position -2.4 2.4
1 Cart Velocity −∞ +∞
2 Pole Angle 41.8° 41.8°
3 Pole Angular Velocity −∞ ∞

Table 1: Inverted Pendulum’s observation space

Trajectories of the system In order to optimize the system, we need to know what
trajectories lead to the best cumulative reward. In this setup, agent is the entity acting
on the environment, that given a state St ∈ S selects one or more actions At ∈ A for
interacting with the environment. Our agent then gets feedback from the environment
itself: it gets a scalar valued reward Rt+1 ∈ R, and gets in a new state St+1 ∈ S. In a
finite Markov Decision Process the set of states, actions and rewards (S, A, R) contains
a finite number of elements; nonetheless, the problem can be extended in the continuous
case. The agent-environment feedback loop repeats over time and it is updated at each
time step: action selection, state and reward signals are sent only at specific moments in
time (t = 0,1,2,3...). The sequence of actions, states and rewards defines a trajectory in
time that can be described as:

S0 , A0 , R1 , S1 , A1 , R2 , S2 , A2 , R3 ... (3)

This sequence can be used on the go for optimizing the system or stored in a buffer and
used later in randomized batches allowing for better stability of the training.

Experimental Setup
We will conduct the experiments with the inverted pendulum simulation from OpenAI
Gym [1]. Also known as the cart-pole, the inverted pendulum is a common benchmark for
testing control systems because of its relative simplicity and yet challenging feat because
of the system’s nonlinearities. This system can be controlled with well-known controllers,
such as the Linear Quadratic Regulator for the linearized version, or with i.e. the Model
Predictive Control for the non-linear version, such as for the pendulum swing up; however,
our task is more challenging because of the absence of the system’s model.

The observations space The simulation is made of a four dimensional observation

space of the cart position, cart velocity, pole angle, and pole angular velocity as shown in
Table .
The available actions are two discrete ones: push a cart left or right. The reward is defined
as +1 for every time step the inverted pendulum stands including the termination step.
The task is considered achieved when the average reward of the last 100 episodes reaches
a score close to the maximum score of 200. Episode termination occurs when:

Page 3
Optimal Control Report Federico Berto

• The pole angle is more than ±12°

• The cart position is more than ±2.4 (i.e. the center of the cart almost reaches the
edge of the display)
• The episode length is greater than 200
Although the observation space is fixed for the simulation, our vision-based control is based
on a set of images of the simulation, which significantly increase the state observation size.
In order to let the algorithm derive a good approximation of the state, we will use a
sequence of the last N images (i.e., the last 2 frames) for making it capable to infer the
dynamics of the pendulum.

The Algorithm: Deep Q-Learning The Deep Q-Learning algorithm has shown state-
of-the-art performance in multiple frameworks, such as the Atari framework [4]. In order
to approximate the action-value function, we will employ a Convolutional Neural Network
Q(s, a; θ) ≈ Q∗ (s, a). For our purposes, θ will denote the set of weights of the network,
which we will refer to as Q-network. A Q-network can be trained by minimizing a sequence
of loss functions Li (θi ) changing at each iteration i,

Li (θi ) = Es,a∼ρ(·) [(yi − Q(s, a; θi ))2 ], (4)

where yi = Es0 ∼E [r + γ maxa0 Q(s0 , θi−1 )|s, a] is the target for iteration i, ρ(s, a) is a
probability distribution over sequences s and actions a. The parameters from the previous
iteration θi−1 are kept fixed when optimizing the loss function Li (θi ). The targets depend
on the network weights: for this reason, it differs from classical weights used in supervised
learning, which are fixed before learning itself begins.

Since we want to minimize the loss function for obtaining a better fit of the networks with
respect to the real action value function, we want to compute the gradient with respect to
the weights:

∇θi Li (θi ) = Es,a∼ρ(·);s0 ∼(E) [(r + γ max

0
Q(s0 , a0 ; θi−1 ) − Q(s, a; θi ))∇θi Q(s, a; θi )] (5)
a

The loss function is then minimized by either using stochastic gradient descent or more
sophisticated methods like RMSprop. The behavior policy is -greedy for ensuring sufficient
exploration.

What makes the algorithm work is a technique known as experience replay[7], by which
we store the agent’s experiences (also referred to as transitions) at each time-step, et =
(st , at , rt , st+1 ), in a data set D = (e1 , ..., eN ), pooled over many episodes into a replay
memory. In the inner loop of our algorithm we apply Q-Learning updates, or mini-batch
updates, to samples of experience, e ∼ D, picked up randomly from the pool of stored

Page 4
Optimal Control Report Federico Berto

samples. After performing experience replay, the agent executes an action according to
an -greedy policy. Deep Q-Learning with experience replay is written in pseudo-code in
Algorithm 1.

Algorithm 1: Deep Q-Learning with Experience Replay

Initialize replay memory D
Initialize action value function Q with random weights
repeat
Observe initial state s1
for t=1: T do
Select an action a1 using policy derived from Q (e.g. -greedy)
Carry out action at
Observe reward rt and new state st+1
Store transition (st , at , rt , st+1 ) in replay buffer D
Sample random transitions (sj , aj , rj , sj+1 ) from D
Calculate target for each transition
(
rj if sj+1 is terminal
Set yi = 0
(6)
rj + γ maxa0 Q(sj+1 , a ; θ) if sj+1 is non-terminal
Train the Q network on (yj − Q(sj , aj ; θ))2 using Equation 5
end
until terminated

The following is the Python code for the experience replay, which is of paramount importance
for decoupling time-dependent transitions which would introduce a bias in the system:

1 class ReplayMemory(object):
2 """Replay Memory class for storing system's trajectories"""
3 def __init__(self, capacity):
4 self.capacity = capacity
5 self.memory = []
6 self.position = 0
7

8 def push(self, *args):

9 """Saves a transition."""
10 if len(self.memory) < self.capacity:
11 self.memory.append(None) # if we haven't reached full
12 # capacity, we append a new transition
13 self.memory[self.position] = Transition(*args)
14 self.position = (self.position + 1) % self.capacity # e.g if the capacity
15 # is 100, and our position is now 101, we don't append to position 101

Page 5
Optimal Control Report Federico Berto

16 # (impossible), but to position 1 (its remainder), overwriting old data

18 def sample(self, batch_size):

19 return random.sample(self.memory, batch_size)
20

21 def __len__(self):
22 return len(self.memory)

Deep Q-Learning has several advantages by using experience replay. Firstly, since each
step of experience is potentially used in many weight updates, we have a greater data
efficiency. Secondly, randomizing the samples allows for improved efficiency and reduces
the updates variance, given that learning directly from consecutive samples of experience
is inefficient due to the strong correlations between the samples. Thirdly, by learning off-
policy we may be able to avoid dangerous unwanted feedback loops. For example, if we did
train on-policy, if the maximizing action were to move on the left then the training samples
would be dominated by samples from the left-hand side; if the maximizing action were the
opposite one on the right, the same would happen with the right-hand side samples. This
behavior could lead to unwanted outcomes, like getting stuck in a poor local minimum, or
even catastrophically diverge. Thus, by using the off-policy experience replay we are able
to smooth out learning and to avoid oscillation or divergence in the parameters.
The exploration and exploitation problem is dealt with by the use of the following εthreshold ,
which decays over time; a lower threshold will result in higher exploitation of the learned
optimal known actions:

εthreshold = εf inal + (εinitial − εf inal )e−t/εdecay (7)

Towards Vision-Based Control

The first steps in trying to solve the inverted pendulum problem with raw pixels were based
on a first implementation from Adam Paszke1 . The algorithm is based on Convolutional
Neural Networks for predicting Q-values and implements basically the same structure of
DQN.
Although this implementation was showing improvements in training, no matter how much
training the system received, the mean reward was always swinging up and down and
showing no sign of convergence or divergence. In particular, after some tenths of episodes
showing an abrupt improvement in performance there were several episodes in which the
agent ”forgot” about the previous experiences, suddenly decreasing the performance. The
problem that needs to be solved is the overfitting of the Convolutional Neural Network,
which overestimated the state-action values leading to a sub-optimal policy.
1
Source: pytorch.org

Page 6
Optimal Control Report Federico Berto

Dealing with Overfitting

Improving Image Pre-Processing One first steps of state observation improvement
implies simplifying the network by feeding it with a grayscale as input, thus reducing by
one third the input dimensionality. Besides, the use of image subtraction used in the initial
implementation could not capture some states; indeed, a better way to deal with changing
environments over time is to feed the neural network a batch of images, them being the
last N frames. This way, the state keeps track of past behaviors. By considering at least
the last N = 2 frames, velocity features can be extracted by the neural network from the
image state.

Figure 2: Extracted screen from the inverted pendulum environment: the image is
extracted by cropping, downsampling and applying grayscaling

Early Stopping Thus, another method we can use is that of the Early Stopping: in
order to prevent overfitting, the agent is trained up to a point in which it has learned
optimal behavior; from that point onward it would start behaving worse and worse and
then restarting learning from scratch. We can detect this optimal situation by inserting
a score threshold: when the mean of the last K episodes overcome this threshold ξ, we
stop the neural networks optimization. We have empirically set these two new hyper-
parameters, which can be tweaked, to K = 20 and ξ = 142.

Reward Function Design The reward function design is one of the most important
aspects in Reinforcement Learning and can make the difference between an agent not
training at all and one able to achieve super-human performance. Instead of the usual
reward function from OpenAI Gym, we can build a new one being able to better give a
feedback on our agent’s decisions:

Page 7
Optimal Control Report Federico Berto

xmax − |x|
r1 = − 0.8
xmax
θmax − |θ| (8)
r2 = − 0.5
θmax
rtot = r1 + r2
However, in this case the reward function just has a minor impact on the overall performance,
so we may as well use one which doesn’t explicitly employ the state variables.

Final Model
Parameters
Layer Number of Activation Kernel Stride
Filters/Neurons Size
Convolutional Layer 1 64 ReLu 5 2
Convolutional Layer 2 64 ReLu 5 2
Convolutional Layer 3 32 ReLu 5 2
Linear Output Layer 1792 None - -

Table 2: Convolutional Neural Network hyper-parameters

After improving the model with the aforementioned techniques, we obtain the following
Convolutional Neural Network described in Table . As regards the loss function and
optimizer, we use:
• Loss Function: Huber loss function, which is:
1
L(Qtarget , Qpredicted ) = (Qtarget − Qpredicted )2 (9)
2

• Optimizer: RMSprop, which is able to do a gradient descent of the error in almost

the most optimal path
The Python code for the DQN is the following; it takes a sequence of images as input and
outputs the value function corresponding to each action:

1 class DQN(nn.Module):
2 """Convolutional Neural Network predicting the state-action values"""
3 def __init__(self, h, w, outputs):
4 super(DQN, self).__init__()
5 self.conv1 = nn.Conv2d(nn_inputs, HIDDEN_LAYER_1,
6 kernel_size=KERNEL_SIZE, stride=STRIDE)
7 self.bn1 = nn.BatchNorm2d(HIDDEN_LAYER_1)
8 self.conv2 = nn.Conv2d(HIDDEN_LAYER_1, HIDDEN_LAYER_2,

Page 8
Optimal Control Report Federico Berto

Figure 3: Architecture of the DQN with a Convolutional Neural Network for approximating
state-action values

9 kernel_size=KERNEL_SIZE, stride=STRIDE)
10 self.bn2 = nn.BatchNorm2d(HIDDEN_LAYER_2)
11 self.conv3 = nn.Conv2d(HIDDEN_LAYER_2, HIDDEN_LAYER_3,
12 kernel_size=KERNEL_SIZE, stride=STRIDE)
13 self.bn3 = nn.BatchNorm2d(HIDDEN_LAYER_3)
14

15 # Number of Linear input connections depends on output of conv2d layers

16 # and therefore the input image size, so compute it.
17 def conv2d_size_out(size, kernel_size = KERNEL_SIZE, stride = STRIDE):
18 return (size - (kernel_size - 1) - 1) // stride + 1
19

20 convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
21 convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
22 linear_input_size = convw * convh * 32
23 nn.Dropout() # Adds further regularization
24 self.head = nn.Linear(linear_input_size, outputs)
25

26 def forward(self, x):

27 """Forward pass in the network"""
28 x = F.relu(self.bn1(self.conv1(x)))
29 x = F.relu(self.bn2(self.conv2(x)))
30 x = F.relu(self.bn3(self.conv3(x)))
31 return self.head(x.view(x.size(0), -1))

The final architecture of the DQN is represented in 3. The following table shows the main
Hyper-Parameters, based on the DQN with Experience Replay and Target Network, which

Page 9
Optimal Control Report Federico Berto

further improves the performance of the system:

Hyper-Parameters Values
Batch Size 128
Batch of Frames Size 2
Discount Rate γ 0.999
Initial ε 0.9
Final ε 0.01
ε-Decay 5000
Early Stop ξ 142
Image Height (pixels) 60
Image Width (pixels) 135
Maximum Memory Size 100000
Target Model Update τ 50

Table 3: Reinforcement Learning hyper-parameters

Final Results
After several unsuccessful or partially successful tries, using a tweaked CNN architecture,
regulated hyper-parameters and adjustments in image pre-processing, dealing with overfitting
and reward function design, we finally come to the experimental results. As shown in
Figure 4, the inverted pendulum system is able to stabilize to obtain the 200 score. As to
the best of the author’s knowledge, this result has not been obtained yet in the Cartpole-v0
environment with DQN and only raw visual input, mainly because of the absence of early
stopping which prevents overfitting.

Potential Improvements and Future Work

Reinforcement Learning is fast-paced research area and it is even difficult at times to catch
up with the latest trends. Regarding the control systems in particular, algorithms such as
the LQR and MPCs still offer better performance and, most of all, theoretical guarantees
on stability, at the cost of needing more information on the state and/or model of the
system.
As we can also notice from the results, even though there is a confidence baseline for the
stabilization score, the system is still prone to disturbances and inherent instabilities of the
control policy and the neural network approximating the state-action values. One research
direction would be to look for robust formulations of the problem [5]. Another interesting
area, which has more technical guarantees on convergence and more efficient sampling
of the state-space is the Model-Based Reinforcement Learning [3], although it inherently
needs a description via a model, thus renouncing the model-free paradigm. Perhaps, the
most interesting research direction is the combination of Reinforcement Learning and/or

Page 10
Optimal Control Report Federico Berto

Figure 4: Final results: plot of the obtained averaged out of a pool of 6 training runs. The
light blue lines represent the confidence interval obtained via mean ± standard deviation.
Most of the runs are able to get to the desired score in ∼ 800 episodes

Optimal Control with Neural Ordinary Differential Equations [2], a new class of Neural
Networks which, thanks to their inherent capability to deal with differential equations, are
revolutionizing the area of approximation and optimization of complex dynamical systems.

Conclusion
The goal of this report was to find a controller in a situation where state variables were
not available, in particular by using only raw images to generate a control policy. The test
benchmark we used is the inverted pendulum, which is used frequently to test non-linear
controllers, which we used by extracting the necessary state information automatically
thanks to the Deep Q-Learning, a Deep Reinforcement Learning algorithm capable of
controlling agents with complex state-action spaces.
After implementing several tweaks to existing algorithms, we were able to obtain a controller
capable of successfully obtaining the maximum reward score in the simulation by using
Convolutional Neural Networks for choosing the optimal action leading to the optimal
cumulative reward. The successful stabilization of the inverted pendulum shows that
Reinforcement Learning is indeed capable of achieving good performances even with diffucult
tasks.
Possible future developments include the implementation of new frameworks, such as
Neural Ordinary Differential Equations, possibly paving the way to a better synergy of

Page 11
Optimal Control Report Federico Berto

Reinforcement Learning and Optimal Control.

References
[1] Greg Brockman et al. OpenAI Gym. cite arxiv:1606.01540. 2016. url: http : / /
arxiv.org/abs/1606.01540.
[2] Ricky T. Q. Chen et al. “Neural Ordinary Differential Equations”. In: Advances
in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran
Associates, Inc., 2018, pp. 6571–6583. url: https://proceedings.neurips.cc/
paper/2018/file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf.
[3] Lukasz Kaiser et al. “Model-Based Reinforcement Learning for Atari”. In: CoRR
abs/1903.00374 (2019). arXiv: 1903.00374. url: http://arxiv.org/abs/1903.
00374.
[4] Volodymyr Mnih et al. “Playing atari with deep reinforcement learning”. In: arXiv
preprint arXiv:1312.5602 (2013).
[5] Lerrel Pinto et al. “Robust Adversarial Reinforcement Learning”. In: CoRR abs/1703.02702
(2017). arXiv: 1703.02702. url: http://arxiv.org/abs/1703.02702.
[6] Warren B. Powell. “From Reinforcement Learning to Optimal Control: A unified
framework for sequential decisions”. In: CoRR abs/1912.03513 (2019). arXiv: 1912.
03513. url: http://arxiv.org/abs/1912.03513.
[7] Tom Schaul et al. “Prioritized experience replay”. In: arXiv preprint arXiv:1511.05952
(2015).
[8] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
[9] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning.
Vol. 2. 4. MIT press Cambridge, 1998.
[10] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement
learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z.

Page 12

Control Systems and Reinforcement Learning - Sean Meyn - 2022 - Cambridge University Press - 9781009051873 - Anna's Archive
No ratings yet
Control Systems and Reinforcement Learning - Sean Meyn - 2022 - Cambridge University Press - 9781009051873 - Anna's Archive
454 pages
Deep Learning Book Part5
No ratings yet
Deep Learning Book Part5
142 pages
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
66 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
IJCA
No ratings yet
IJCA
15 pages
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
No ratings yet
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
31 pages
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
Rewards in Reinforcement Learning
No ratings yet
Rewards in Reinforcement Learning
12 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Paper A Framework For Learning Declarative Structure Hart
No ratings yet
Paper A Framework For Learning Declarative Structure Hart
5 pages
Control of Inverted Pendulum Cart System by Use of Pid Controller
No ratings yet
Control of Inverted Pendulum Cart System by Use of Pid Controller
6 pages
Q Learning
No ratings yet
Q Learning
38 pages
Dhruv Anirudh DrSandeep
No ratings yet
Dhruv Anirudh DrSandeep
21 pages
Unit 1
No ratings yet
Unit 1
18 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
NI Tutorial 10703 en
No ratings yet
NI Tutorial 10703 en
7 pages
AI A Z HandBook
No ratings yet
AI A Z HandBook
12 pages
Identification and Control of Nonlinear Systems Using Soft Computing Techniques
No ratings yet
Identification and Control of Nonlinear Systems Using Soft Computing Techniques
5 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
02 - Experiment Physical Modelinggggg
No ratings yet
02 - Experiment Physical Modelinggggg
17 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
Q - Networks (1) 31 50
No ratings yet
Q - Networks (1) 31 50
20 pages
Untitled Document
No ratings yet
Untitled Document
11 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
Optimization of A Function Represented by A Neural Network
No ratings yet
Optimization of A Function Represented by A Neural Network
4 pages
Assignment3 Yash Patel
No ratings yet
Assignment3 Yash Patel
10 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
4 pages
Learning To Control An Inverted Pendulum Using Neural Networks
No ratings yet
Learning To Control An Inverted Pendulum Using Neural Networks
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Artificial Bee Colony Algorithm
No ratings yet
Artificial Bee Colony Algorithm
6 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Midterm Report Example3
No ratings yet
Midterm Report Example3
4 pages
Deep Q Network
No ratings yet
Deep Q Network
6 pages
Controlling An Inverted Pendulum Using S PDF
No ratings yet
Controlling An Inverted Pendulum Using S PDF
34 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
37 RL
No ratings yet
37 RL
18 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Application of Reinforcement Learning To A Two Dof Robot Arm Control
No ratings yet
Application of Reinforcement Learning To A Two Dof Robot Arm Control
2 pages
Lecture 4 Control
No ratings yet
Lecture 4 Control
23 pages
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
No ratings yet
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
5 pages
Design of LQR Controller For The Inverted Pendulum: Lili Wan, Juan Lei, Hongxia Wu
No ratings yet
Design of LQR Controller For The Inverted Pendulum: Lili Wan, Juan Lei, Hongxia Wu
5 pages
Modelling & Simulation For Optimal Control of Nonlinear Inverted Pendulum Dynamical System Using PID Controller & LQR
No ratings yet
Modelling & Simulation For Optimal Control of Nonlinear Inverted Pendulum Dynamical System Using PID Controller & LQR
6 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Addition and Subtraction
No ratings yet
Addition and Subtraction
9 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
LQ R Inverted Pendulum
No ratings yet
LQ R Inverted Pendulum
14 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
FDS Unit I
No ratings yet
FDS Unit I
126 pages
Unit 4 Numerical Techniques Unit: 4: Btech4 Sem
No ratings yet
Unit 4 Numerical Techniques Unit: 4: Btech4 Sem
106 pages
Module 4
No ratings yet
Module 4
9 pages
Critical Path Method (CPM) - 1
100% (1)
Critical Path Method (CPM) - 1
4 pages
CNS Lab Manual
No ratings yet
CNS Lab Manual
25 pages
Thesis On Extreme Learning Machine
100% (2)
Thesis On Extreme Learning Machine
4 pages
Complete Answer Guide For Solution Manual For Essentials of Econometrics Gujarati Porter 4th Edition
100% (9)
Complete Answer Guide For Solution Manual For Essentials of Econometrics Gujarati Porter 4th Edition
44 pages
Lab 5
No ratings yet
Lab 5
4 pages
Class Notes EOB (Hindi) Till Capital Budgeting May 25 Hindi
No ratings yet
Class Notes EOB (Hindi) Till Capital Budgeting May 25 Hindi
27 pages
Convex Analysis and Optimization - Syllabus
No ratings yet
Convex Analysis and Optimization - Syllabus
3 pages
App - G, Accounting Principles
No ratings yet
App - G, Accounting Principles
28 pages
Unit V
No ratings yet
Unit V
9 pages
Chapter 14 - Nonlinear Regression Models
No ratings yet
Chapter 14 - Nonlinear Regression Models
20 pages
Histogram
No ratings yet
Histogram
10 pages
TAR 2020 Reading 05
No ratings yet
TAR 2020 Reading 05
20 pages
AEDT Icepak Intro 2019R1 Advanced - Topics Optimetrics
No ratings yet
AEDT Icepak Intro 2019R1 Advanced - Topics Optimetrics
26 pages
Stress Detection Using Natural Language
No ratings yet
Stress Detection Using Natural Language
24 pages
Ac.f215 Exam 2018-2019 PDF
No ratings yet
Ac.f215 Exam 2018-2019 PDF
6 pages
WinRAR Encryption Technology FAQ
No ratings yet
WinRAR Encryption Technology FAQ
6 pages
Optimisation
No ratings yet
Optimisation
15 pages
TD4 SDC
No ratings yet
TD4 SDC
3 pages
GIS Interpolation
No ratings yet
GIS Interpolation
14 pages
E2UC403B)
No ratings yet
E2UC403B)
5 pages
Secure Joint Communication and Sensing
No ratings yet
Secure Joint Communication and Sensing
9 pages
Siddhant Srivastava Software Resume
No ratings yet
Siddhant Srivastava Software Resume
1 page
09 Domain Analysis Testing Examples - Done
No ratings yet
09 Domain Analysis Testing Examples - Done
6 pages
Assignment 1 8: Sorting of An Unsorted Array by Using Selection Sort
No ratings yet
Assignment 1 8: Sorting of An Unsorted Array by Using Selection Sort
6 pages
6b Soln
No ratings yet
6b Soln
3 pages
Lislie Matrix1
No ratings yet
Lislie Matrix1
2 pages
Loop-shaping Robust Control
From Everand
Loop-shaping Robust Control
Philippe Feyel
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Studies On Model Free Control of Dynamical Systems

Uploaded by

Studies On Model Free Control of Dynamical Systems

Uploaded by

Korea Advanced Institute of Science

Studies on Model-Free Control of

Course ID: ID number:

Brief Overview of Reinforcement Learning

Table 1: Inverted Pendulum’s observation space

The observations space The simulation is made of a four dimensional observation

• The pole angle is more than ±12°

Li (θi ) = Es,a∼ρ(·) [(yi − Q(s, a; θi ))2 ], (4)

∇θi Li (θi ) = Es,a∼ρ(·);s0 ∼(E) [(r + γ max

Algorithm 1: Deep Q-Learning with Experience Replay

8 def push(self, *args):

16 # (impossible), but to position 1 (its remainder), overwriting old data

18 def sample(self, batch_size):

εthreshold = εf inal + (εinitial − εf inal )e−t/εdecay (7)

Towards Vision-Based Control

Dealing with Overfitting

Table 2: Convolutional Neural Network hyper-parameters

• Optimizer: RMSprop, which is able to do a gradient descent of the error in almost

15 # Number of Linear input connections depends on output of conv2d layers

26 def forward(self, x):

further improves the performance of the system:

Table 3: Reinforcement Learning hyper-parameters

Potential Improvements and Future Work

Reinforcement Learning and Optimal Control.

You might also like