0% found this document useful (0 votes)

189 views14 pages

3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem

This document summarizes a research article that proposes using a deep reinforcement learning approach called DDPG-PPO (deep deterministic policy gradient - proximal policy optimization) to control a rotary inverted pendulum system. The DDPG agent is used to train the swing-up action of the pendulum in simulation, while the PPO agent trains and tests the mode selection process. Results show the DDPG-PPO agent is more effective than another RL approach called SAC-PPO for swing-up control. The proposed controller is also tested on a real rotary inverted pendulum hardware system.

Uploaded by

rajmeet singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views14 pages

3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem

Uploaded by

rajmeet singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Arabian Journal for Science and Engineering

https://doi.org/10.1007/s13369-023-07934-2

RESEARCH ARTICLE-MECHANICAL ENGINEERING

Reinforcement Learning DDPG–PPO Agent-Based Control System

for Rotary Inverted Pendulum
Rajmeet Singh Bhourji1 · Saeed Mozaﬀari1 · Shahpour Alirezaee1,2

Received: 8 September 2022 / Accepted: 8 May 2023

Abstract
The rotary inverted pendulum (RIP) system is a nonlinear system used as a benchmark for testing control strategies. RIP
system has a lot of applications in balancing of robotic systems such as drones and humanoid robots. Controlling RIP system
is a complex task without concise knowledge of classic control engineering. This paper uses the reinforcement learning
(RL) approach to control the RIP instead of classical controllers such as PID (proportional–integral–derivative) and LQR
(linear–quadratic regulator). In this work, the deep deterministic policy gradient–proximal policy optimization (DDPG–PPO)
agent is proposed and implemented to control the rotary inverted pendulum platform both in simulation and hardware. DDPG
agent with 13 layers is trained for the swing-up action of the pendulum, and the mode selection process is trained and tested
using the PPO agent. The rotary inverted pendulum is controlled using a proposed controller and compared with various
RL agents such as soft actor critic–proximal policy optimization (SAC–PPO). Additionally, the proposed method is tested
with a conventional proportional–integral–derivative (PID) controller, for different pendulum mass values, to validate its
effectiveness. Finally, the proposed RL controller is implemented on the real-time RIP apparatus (Quanser Qube-Servo).
Results show that DDPG–PPO RL agent is much effective than SAC–PPO agent during swing-up control.

Keywords Reinforcement learning · Deep deterministic policy gradient · Proximal policy optimization · Rotary inverted
pendulum · Simulink

Abbreviations SMC Sliding mode controller

TF Transformation frame
CAD Computer-aided design
DAQ Data acquisition
DC Direct control
DDPG Deep deterministic policy gradient
1 Introduction
DDQN Double deep Q-network
DQN Deep Q-network
Designing the controller and parameter tuning for the com-
LQR Linear–quadratic regulator
plex nonlinear systems is the tedious and time-consuming
PID Proportional–integral–derivative
activity. Much more knowledge is required to design the
PPO Proximal policy optimization
control law for the nonlinear dynamic problems such as
RIP Rotary inverted pendulum
inverted pendulum (RIP). The RIP comprises of many vari-
RL Reinforcement learning
ables, higher-order nonlinearity and instability. The inverted
SAC Soft actor critic
pendulum problem is a fundamental engineering problem
and having many applications. A segway is similar to an
B Shahpour Alirezaee inverted pendulum problem [1], having a pendulum bar
[email protected]
attached to a wheeled mobile chassis. Modeling of the leg of
1 Mechanical, Automotive, and Material Engineering the quadruped robot is also similar to the inverted pendulum
Department, University of Windsor, Windsor, ON, Canada problem [2]. Four inverted double-pendulum mechanisms
2 Faculty of Engineering, University of Windsor, Windsor, were attached to the robot chassis, and a control scheme was
Canada developed for motion of the robot in different gait patterns.

123
Arabian Journal for Science and Engineering

The two-dimensional inverted pendulum was used in the self- [18]. The validation of the proposed controller was done on
balancing unicycle cart [3]. both simulation and real mobile robot platform for different
For controlling the inverted pendulum, three types of con- complex paths. Saeed et al. implemented the deep determin-
trol system are available: linear, predictive, and self-learning. istic policy gradient (DDPG) RL algorithm for robotic hand
Linear approaches [4–6] are more useful for one or two manipulation for pick-and-place operation in an unknown
degrees freedom of the inverted pendulum system, restricting environment [19], and results were quite effective. Gaoat el.
their application in real-world scenarios and also hard to tune proposed the DDPG RL algorithm for obstacle avoidance
the parameter values. The predictive control approach [7] is using in the four-wheel mobile robot. The results show that
expensive and complex to implement on real-time appara- the proposed model can possess the obstacle-free navigation
tus. To encounter the above shortcomings, the self-learning path even environment has multiple obstacles [20]. As per
control approach has been proposed [8], which has more the above literature, many RL algorithms were used for con-
capability to solve inverted pendulums with more degrees of trolling nonlinear problems, but results of DDPG algorithm
freedom. were much effective in terms of less training episodes, better
The self-learning algorithms have made a vast contribu- controllability, and more effective for continuous state action
tion in the domain of the robotics [8]. A deep RL algorithm problems.
was proposed on an inverted pendulum that rotates on a In this paper, we have used RL DDPG–PPO agent-
spherical joint with an industrial six-degree of freedom robot based control system for a rotary inverted pendulum system
arm. The RL control policies have been widely used to and tested and compared the robustness of the model in
control the sequential decision problems such as locomo- MATLAB-Simulink environment during swing-up and mode
tion of the autonomous robot, walking of biped robots, etc., selection process. The main novelties and contributions of
by optimizing a cumulative reward signals [9]. Q-learning this paper are using:
is one of the RL algorithms [10]. The Q-learning algo-
rithm was used for solving the unrealistic problems such as • State-of-the-art comparison with RL SAC–PPO algorithm
autonomous navigation in unknown environment, car park- [21].
ing problem, walking of biped or quadruped robots, etc. • A modified reward function to stabilize the pendulum in
To overcome the overestimate action values, Abed-alguni up-right position.
[11] proposed double delayed Q-learning algorithm. The • Real-time implementation on hardware.
experimental results showed that double delayed Q-learning
algorithm converges to an optimal policy and that it performs
better than delayed Q-learning. For adjusting the explo- The paper is organized as follows: the background related
ration–exploitation balance during the learning process, the to the RL is presented in Sect. 2. The mathematical modeling
Bat Q-learning algorithm is proposed in [12]. This reduces and Simulink-Simscape modeling are presented in Sect. 3.
the overestimation of Q-values and leads to more accurate The RL controller with DDPG–PPO agent is proposed and
value estimates. Bat Q-learning algorithm applications are implemented to control the swing up and mode selection
still being explored. However, it has shown promising results of the rotary pendulum in the Sect. 4. Finally, the results
in several domains such as path planning for mobile robots of the proposed RL agent (DDPG–PPO) is compared with
[13]. The various cooperative RL algorithms were compared RL-based SAC–PPO agent [21]. Also proposed RL DDPG
and studied in [14]. The algorithms were compared using taxi algorithm is implemented on the real-time rotary inverted
problem. Authors also studied the effects of the frequency of pendulum hardware (Quanser Qube-Servo), and results are
Q-value sharing on the learning speed of the independent presented and discussed in Sect. 5.The conclusions are drawn
learners that share their Q-values among each other. Van in Sect. 6.
Hasselt et al. proposed the double deep Q-network (DDQN)
algorithm to solve overestimated value problems [15]. Dai
et al. implemented the DDQN algorithm on the real time 2 Background
hardware in loop control system on RIP hardware platform
and compared the results with conventional Q-learning algo- RL is the subbranch of the machine learning (ML) model
rithm [16]. The DDQN algorithm reduces the overestimated where the self-learning process is based on the feedback and
Q value and minimizes the training episodes and time as previous control actions. The modeling of the RL is based
compared to the Q-learning algorithm. Behrens et al. pro- on how the human beings learn. Humans do the action on
posed the SAC algorithm on the smart magnetic micro-robots the present state of the unknown surrounding environment
to learn swimming in an unknown environment [17]. Yu and obtain rewards accordingly. After some trails, the human
et al. proposed the model-free SAC–PID controller for auto- begin able to predict the future state is based on the present
matic control of mobile robot in an unknown environment state and desired actions. The humans learn what actions will

123
Arabian Journal for Science and Engineering

be taken to maximize the future rewards, according to the tar-

get results. Figure 1 shows the process of the RL algorithm,
how the child (Agent) learns the walking on unknown envi-
ronment (surface) by doing the action (crawling to stepping)
and getting rewards (hug from parents). At every stage, the
child learns and trains it for getting maximum rewards.
Polzounov et al. proposed the RL-based control tool to
train and test the algorithm on a RIP [22]. The swing-up con-
trol operation is the purely a nonlinear system. To control
the nonlinear system without prior knowledge of the control
system is very complex task using classical control approach.
Also the role of the initial conditions and constants is impor-
tant while designing the classical controller for nonlinear
systems. The deep RL approach is implemented to control
Fig. 1 Reinforcement learning algorithm

Fig.2 CAD modelling: a front view, b side view, c top view, d isometric view, and e hardware

123
Arabian Journal for Science and Engineering

Fig. 3 Simulink model of the rotary inverted pendulum

the RIP and compared results with classical control methods angle of the pendulum (α) is range between [−π, π] radian.
[23]. For up-right position of the pendulum, the angle α is π radian.
Although RLDQN, DDQN algorithms are very powerful, The angle of the arm (θ ) is constrained in [ −π π
2 , 2 ] radian.
it has its own challenges to solve continuous state problems. The angle θ is kept at zero when arm is at central position. α̇
DDPG algorithm is more suitable for solving continuous- and θ̇ are the angular velocity of the pendulum and the arm,
type problems. It has advantages of both DQN and DPG respectively.
algorithms. As per author’s knowledge, the RL DDPG algo-
rithm is not considered for swing-up operation of the rotary 3.1 Mathematical Modeling
inverted pendulum in the literature.
For obtaining the equation of motion of the RIP, the Lagrange
method is used [24]. Fr and F p are the force acting on the
3 Modeling rotary arm and pendulum as shown in Eq. (1) and (2).

The CAD model of the rotary inverted pendulum is shown Fr τ − Dr θ̇ (1)

in Fig. 2. The RIP consists of flat rotary arm. The one end is
pivoted on the rotary servo unit, which is attached to the DC F p −D p α̇ (2)
motor. The encoder attached to DC motor is used to encounter
the angular position of the rotary arm. The angular position where as Dr is the viscous friction torque of the rotary arm,
of the pendulum is calculated by another encoder. The r and and D p is the viscous damping coefficient of the pendulum.
L are the rotary arm and pendulum length, respectively. The τ is the torque applied at the base DC motor of the RIP.

123
Arabian Journal for Science and Engineering

Table 1 Parameters and values

Parameters Symbols Values

Motor Motor resistance Rm 22

Torque constant Kt 0.051 Nm/A
Motor Inductance Lm 4.98e−3 H
Back EMF constant Kb 0.0392 V/rad/s
Input voltage V 12 V
Pendulum and arm rod Mass of pendulum Mp 0.024 kg
Mass of arm rod Mr 0.095 kg
Length of arm r 0.103 m
Length of pendulum L 0.126 m
Inertial of pendulum Jp 1.2713e−04 kgm2
Inertia of arm rod Jr 3.3617e−04 kgm2
Viscous friction torque of arm Dr 0.001 Nm/rad/s
Viscous damping of pendulum Dp 8e−6 Nm/rad/s
Reinforcement learning Time step for feedback controller Tc 0.005 s
Simulation stop time Ts 10 s
Sample rate 4s
Learning rate γ 1e−4

The nonlinear equations of motion for the system are: from the joint block. The torque is used to actuate the rev-
olute joint 1, which is connector between DC motor and
rotary arm. The rotary arm is connected to the pendulum
Mpr 2 + 0.25Mp L 2 1 − cos (α)2 + Jr θ̈
via transformation frame block (TF-4). TF-4 is a rigid trans-
− 0.5Mp Lrcos (α) α̈ formation Simulink/Simscape/Multibody block. This block
applies a time-invariant transformation between two frames.
+ 0.5Mp L 2 sin (α) cos (α) θ̇ α̇
The transformation rotates and translates the pendulum frame

+ 0.5Mp Lrsin (α) α̇ 2 τ − Dr (3) with respect to the rotary arm frame. Connecting the frame
ports in reverse causes the transformation itself to reverse.
The frames remain fixed with respect to each other during
−0.5Mp Lrcos (α) θ̈ + Jp + 0.25Mp L 2 α̈ simulation, moving only as a single unit. The output is the
− 0.25Mp L 2 sin (α) cos (α) θ̇ 2 ˙ of the arm rod
position (θ , α) and angular velocity (θ̇ , α))
− 0.5Mp Lgsin (α) −Dp α̇ and the pendulum. The configuration block take care of the
(4)
uniform gravity for the entire mechanism. The configuration
block consists of three separate blocks: Solver configura-
where M p is the mass of pendulum. J p and Jr are the moment
tion, World frame, and Mechanism configuration. The Solver
of inertia about center of mass of pendulum and rod, respec-
configuration block specifies the solver parameters that your
tively.
model needs before you can begin simulation. In this simula-
tion, we were using ode23t solver, which is one-step solver.
3.2 Simulink Modeling
World frame block represents the global reference frame in a
model. This frame is inertial and at absolute rest. The World
The Simulink Simscape toolbox and Eqs. (3) and (4) are
frame is the ultimate reference frame. Directly or indirectly,
used for modeling the rotary inverted pendulum as shown
all other frames are defined with respect to the World frame.
in the Fig. 3. The control voltage source block is connected
Mechanism configuration block provides mechanical and
to the DC motor block. The voltage range between [− 12,
simulation parameters to a mechanism. Parameters include
12] volt is supplied to the DC motor. Further the DC motor
gravity and a linearization delta for computing numerical par-
output (angular velocity and angular position) is connected
tial derivatives during linearization. These parameters apply
to the rotational Multibody interface block, which act as
only to the target mechanism, i.e., the mechanism that the
interface between Simscape Multibody joint and a Simscape
block connects to. In our case, this block is connected to
mechanical rotational. The output of the block is the mechan-
ical rotational torque actuation and angular velocity sensing

123
Arabian Journal for Science and Engineering

Table 2 Initial values Table 3 RL DDPG agent parameters

Parameters Symbol Value Parameters Value

Arm angle Θ 0 rad Environment Pendulum system

Pendulum angle A 0 rad Reward r −θ 2 − 0.2(α − θ)2 − 0.15α̇ 2 + T
Angular velocity of arm θ̇ 0 rad/s
Angular velocity of pendulum α̇ 0 rad/s T

π
100 θ ∈ π ± 6 radiansandα ∈ ±π radians
the base of the rotary inverted pendulum. The value of the 0 otherwise
uniform gravity parameter is [x 0, y 0, z −9.8665]. Action Pendulum reference angle α ±π radian
The simulation parameters and initial conditions are men-
tioned in the Table 1 and 2. In our case we considered the
sample rate 4 s and learning rate 1e−4 . The sample rate Table 4 DDPG agent layers
refers to how often the agent receives feedback from the envi-
Layer No. Layer name Layer definition
ronment, while the learning rate determines how much the
agent updates its internal model based on the received new
1 ‘concat’ Concatenation
information.
2 ’relu_body’ ReLU
3 ’fc_body’ Fully connected
4 ’body_output’ ReLU
4 Proposed Method
5 ’input_1’ Input
Without prior complete knowledge of the classical controller, 6 ’conv_1’ Convolution
it is difficult to control nonlinear systems such as rotary 7 ’relu_input_1’ ReLU
inverted pendulum. Therefore, in this paper, the RIP is con- 8 ’fc_1’ Fully connected
trolled by training and testing RL algorithm as shown in the 9 ’input_2’ Input
Fig. 4. The inputs to RL model are provided from sensors 10 ’fc_2’ Fully connected
(motor arm encoder and pendulum encoder). The output of 11 ’input_3’ Input
RL model is the controlled voltage. In this work, we imple- 12 ’fc_3’ Fully connected
ment DDPG–PPO agent for controlling the RIP system. 13 ’output’ Fully connected
The rotary inverted pendulum system consists of two sub-
processes: swing-up of the pendulum arm and mode switch
operation. The deep deterministic policy gradient (DDPG)
where T is the controlling action force.
agent is proposed to calculate the angle for swing-up of the
The proposed reward function (Eq. (5)) is modified version
pendulum arm because this is the continuous state problem,
of the reward function, which was used in [21]. After training
while the proximal policy optimization (PPO) agent is used
the DDPG agent for swing-up behavior of pendulum, the
to perform a mode switch operation (discrete function either
performance of the rlDDPG Agent is shown in Fig. 5a and b.
0 or 1) when the pendulum arm is close to the up-right posi-
The mode switch operation activates when pendulum arm
tion (π ± π/6) radians. The parameters for the DDPG agent
is close to the upright (π ± π /6) radian, and it is controlled
are shown in Table 3.
by the RL PPO agent algorithm. Figure 6 shows the perfor-
The DDPG agent with 13 × 1 Layers array is proposed to
mance of the rlPPO Agent for mode selection operation.
control the swing-up action of the rotary inverted pendulum
Figure 7 shows the training of the RIP during 1675
as shown in Table 4.
episodes in the simulation environment. DDPG–PPO agent
The RL environment is the Quanser Qube-Servo rotary
required at least 1675 episodes to train the model. The
inverted pendulum. The mode selection is either 0 or 1. The
episode reward is around 430.9025, and the average episode
RL observations are the vectors [θ , α,θ̇, α̇]. The RL reward
reward is 430.2495. The training finished after all agents
(rw ) signal is as follows:-
reached stop training criteria, i.e., DDPG agent reward at
7500 and PPO agent at 430 rewards.
rw −θ 2 − 0.2(α − θ )2 − 0.15α˙2 + T (5)
Computational complexity of a Q-Learning algorithm in
general case is O(en) where n is the size of the state space, and
100 θ ∈ π ± π6 radiansandα ∈ ±π radians e is the total number of actions [25]. Computational complex-
T
0 otherwise ity of the DDPG algorithm mainly depends on the size of the

123
Arabian Journal for Science and Engineering

Fig.4 Reinforcement learning

model for rotary inverted
pendulum

Fig.5 Snippet of: a swing-up performance of rlDDPG Agent after training the model of rotary inverted pendulum and b learning curve

Fig.6 Snippet of: a mode selection performance of rlPPO Agent after training the model of rotary inverted pendulum, and b learning curve

123
Arabian Journal for Science and Engineering

Fig. 7 DDPG–PPO agent

training episodes for rotary
inverted pendulum

Table 5 Performance parameter

comparison for different Pendulum mass Performance Units DDPG–PPO PID Percentage
pendulum masses on rotary (M p ) parameters controller controller reduction
inverted pendulum
0.024 θ max Radian 0.99 1.48 44
γmax Radian 2.0 1.85 − 8.10
Ts Second 1.78 1.80 1.11
0.130 θ max Radian 1.49 1.72 13.3
γmax Radian 1.98 2.35 15.7
Ts Second 1.52 2.25 32.44

neural networks used for the actor and critic. These compu- According to Table 5, the proposed method requires 1675
tations are typically O(n2 ) or O(n3 ), where n is the number of episodes and about one hour to be trained.
neurons in the network. The DDPG can be computationally
expensive, particularly for large-scale problems with high-
dimensional state and action spaces. However, utilizing small 5 Results and Discussion
neural networks (n 13) makes it a viable algorithm for the
control of rotary inverted pendulum system. The proposed This section presents the simulation and experiment results
DDPG–PPO algorithm also relies on the proximal policy and discussion. The DDPG–PPO agents are proposed to con-
optimization method. The computational complexity of PPO trol the rotary inverted pendulum. The results of the Simulink
can be difficult to quantify precisely, as it depends on various model of rotary inverted pendulum in simulation environ-
factors such as the size of the neural networks, the number ment as discussed in Sect. 3.2 are shown in Fig. 8.
of iterations and episodes, and the size of the action and state As we mentioned before, the rotary inverted pendulum
spaces of the environment. However, as a rough estimate, task is a continuous control task, where the agent must main-
the computational complexity of PPO is usually in the order tain the pendulum in an upright position by applying the
of thousands to tens of thousands of iterations and episodes. appropriate torque. DDPG is a gradient-based algorithm that

123
Arabian Journal for Science and Engineering

Fig. 8 Rotary inverted pendulum Simulink results a motor arm angle, b pendulum angle, c angular velocity of motor arm, and d angular velocity
of pendulum

can handle continuous actions, and it may be more efficient at a conventional Proportional–integral–derivative (PID) con-
finding the optimal torque values needed to balance the pen- troller, for different pendulum mass values, to validate its
dulum. SAC, on the other hand, uses entropy regularization effectiveness [27] as shown in Figs. 9 and 10. The simula-
to encourage exploration, which may be less effective in a tion results shown in Figs. 9 and 10 indicate that while some
task where the optimal policy is well-defined. DDPG–PPO is of the performance parameters have degraded for the nominal
a hybrid algorithm that combines the strengths of both DDPG mass, the proposed controller outperforms the PID controller
and PPO. DDPG is known for its ability to handle continu- as the pendulum mass is increased. This suggests that the
ous action spaces, while PPO is known for its stability and proposed controller is more robust to variations in pendulum
sample efficiency. SAC–PPO uses the soft actor–critic algo- mass compared to the PID controller, which exhibits degra-
rithm, which places high emphasis on exploration. While this dation in performance with increasing mass. The detailed
can be beneficial in some environments, it can also lead to information on performance parameters such as settling time
suboptimal policies in other environments [26]. The rotary of pendulum angle (T s ) and maximum overshoot in pendu-
inverted pendulum is controlled using a proposed controller lum angle (γmax ), and motor arm angle (θ max ) is given in
and compared with various RL agents such as SAC–PPO Table 5.
[21]. Additionally, the proposed method was tested with Figure 11 provides an important advantage for control-
ling systems that are subject to unpredictable disturbances.

123
Arabian Journal for Science and Engineering

Fig. 9 Simulation response of motor arm angle for pendulum mass (M p ) in kg a 0.024 and b 0.130 (solid blue line: proposed method and dashed
red line: PID controller)

Fig. 10 Simulation response of pendulum angle for pendulum mass (M p ) in kg a 0.024 and b 0.130 (solid blue line: proposed method and dashed
red line: PID controller)

In Fig. 11a, the results for the motor arm angle are displayed. performed similarly in this task. However, in Fig. 11d, it
Our DDPG–PPO agent was able to reach the center position can be observed that the DDPG–PPO agent’s performance
(zero-radian angle) in approximately 2 s. This is faster than became stable after 1.75 s, while the SAC–PPO agent’s per-
the SAC–PPO agent, which took 4 s to reach the center posi- formance was jerky and unstable. Furthermore, when using
tion. DDPG–PPO agent result is settled down to the desired the DDPG–PPO agent, the pendulum was able to approach
value (0 rad/s) faster as compared to the SAC–PPO agent the upright position in the counterclockwise direction (posi-
results. According to Fig. 11b, the DDPG–PPO agent out- tive swing-up reference angle), which was not observed with
performed the SAC–PPO agent for the pendulum angle task the SAC–PPO agent’s results.
by reaching the upright position (3.14-rad angle) earlier. Figure 12 shows the experimental results of implementing
The results for the input voltage are presented in Fig. 11c the best episode after training the inverted rotary pendulum
for both the DDPG–PPO and SAC–PPO agents. Both agents

123
Arabian Journal for Science and Engineering

Fig. 11 DDPG–PPO agent and SAC–PPO agent comparison results: a motor arm angle, b pendulum angle, c input voltage, and d swing-up reference
angle.

using DDPG–PPO agents during simulation in a MATLAB- which is less than the SAC agent training episodes, i.e.,
Simulink environment. The hardware is connected with 178/1000. The DDPG agent takes 1 h for training the swing-
MATLAB-Simulink with HIL and Quanser_Quarc toolbox up control operation, whereas the SAC agent takes 1 h 27 min.
library. The Quarc HIL Initialize block is used for con- Also, the average reward for DDPG is less (7646.517) than
necting the hardware with MATLAB/Simulink environment. the SAC result, i.e., 7774.2451. The training mode select
This block associates a name with a particular HIL board, agent is PPO for both swing-up agents. The PPO with DDPG
which is Quanser Q8-USB in our case. This board can do performs better results than PPO with a SAC agent, because
real-time configurations; when a board type is selected for the DDPG agent has advantages of both DQN and DPG
the first time, the parameters such as number of I/O channels algorithms to perform better for continuous state and action
are initialized according to that board type. As depicted in problems.
Fig. 10, the proposed RLDDPG–PPO agents work effectively
on the rotary inverted pendulum hardware.
The comparison results of the SAC–PPO agent and
DDPG–PPO agents are mentioned in Table 6. The results 6 Conclusions
of the proposed DDPG–PPO agent are validated with the
SAC–PPO agent. The swing-up control is done by the DDPG The RL algorithm to swing up and balance a rotary inverted
agent. The training episode for DDPG agents is 152/1000, pendulum in simulation and hardware environment was pro-
posed in this paper. The overall control approach consists

123
Arabian Journal for Science and Engineering

Fig12 Experimentation results of

rotary inverted pendulum a initial
position, b–h swing positions,
and (i) swing-up position

Table 6 Comparison results of

RL DDPG–PPO agent and DDPG–PPO agent SAC–PPO agent
SAC–PPO agents for rotary
Swing-Up agent DDPG Swing-Up agent SAC
inverted pendulum

Training episode 152/1000 Training episode 178/1000

Average reward 7646.5176 Average reward 7774.2451
Time Duration 01:01:38 Time Duration 01:27:09
(hr:min:sec) (hr:min:sec)
Training mode select agent PPO Training mode select agent PPO
Training episode 1675/10000 Training episode 2795/10000
Average reward 430.249 Average reward 431.823
Mean-squared error 0.045 Mean-squared error 0.089
Mean absolute error 0.0022 Mean absolute error 0.0041

123
Arabian Journal for Science and Engineering

of four parts: rotary inverted pendulum modeling, hardware 4. Kajita, S. et al.:Biped walking stabilization based on linear
interface, environment, and agent. Without a deep knowledge inverted pendulum tracking, In: Proceeding of the IEEE/RSJ
International Conference on Intelligent Robots and Systems
of conventional control theory, the RL Deep Determinis- (IROS), pp. 4489–4496 (2010). https://doi.org/10.1109/IROS.
tic Policy Gradient agent was proposed for the swing-up 2010.5651082
control action of the rotary inverted pendulum. The 152 5. Valluru, V.K.; Singh,M.; Singh, M.: Application of linear quadratic
episodes were required to train the swing-up control action methods to stabilize cart inverted pendulum systems, In: pro-
ceeding of the 2nd IEEE International Conference on Power
by the deep deterministic policy gradient (DDPG) agent. Electronics, Intelligent and Control Energy Systems (ICPEICES),
RL proximal policy optimization (PPO) agent was used to pp. 1027–1031 (2018). https://doi.org/10.1109/ICPEICES.2018.
train the mode selection operation. At last, the effective- 8897316
ness of the proposed RL-based agent was compared with 6. Chawla, I.; Singla, A.: Real-time stabilization control of a rotary
inverted pendulum using LQR-based sliding mode controller. Arab.
conventional PID and different RL agents, such as the soft
J. Sci. Eng. 46(3), 2589–2596 (2021). https://doi.org/10.1007/
actor–critic (SAC) agent. Comparing the SAC–PPO agent s13369-020-05161-7
and the DDPG–PPO agent, the training time and number 7. Bekkar, B.;and Ferkous, K.: Design of Online Fuzzy Tuning LQR
of training episodes of DDPG–PPO agent were decreased. Controller Applied to Rotary Single Inverted Pendulum: Experi-
mental Validation. Arab J Sci Eng, 1–16 (2022).
Using the DDPG–PPO agent, the pendulum can be faster
8. Mellatshahi, N.; Mozaffari, S.; Saif, M.; Alirezaee, S.: Inverted
swing up than the SAC–PPO agent. The comparisons were pendulum control with a robotic arm using deep reinforcement
done in the MATLAB-Simulink environment. The proposed learning. In: IEEE International Symposium on Signals, Circuits
DDPG–PPO agent was implemented on the real-time rotary and Systems (ISSCS), pp. 1–6 (2021)
9. Sutton, R.S.; Barto, A.G.: Introduction to Reinforcement Learning.
inverted pendulum hardware. DDPG–PPO algorithm is sen- MIT Press, Cambridge (1998)
sitive to the choice of hyper-parameters, i.e., learning rates 10. Watkins, CJ.: Learning from delayed rewards. PhD thesis, Univer-
and discount factors. Choosing appropriate hyper-parameters sity of Cambridge England, (1989)
can be a time-consuming process that requires trial and error 11. Abed-alguni, B.H.; Ottom, M.A.: Double delayed Q-learning. Int.
J. Artif. Intell. 6(2), 41–59 (2018)
and can be difficult to generalize across different tasks and 12. Abed-alguni, B.H.: Bat Q-learningalgorithm. Jordanian J. Comput.
environments. Also, DDPG and PPO are both model-free Inf. Technol. 3(1), 56–77 (2017)
algorithms, which mean that they rely on trial-and-error 13. Xin, G.; Shi, L.; Long, G.; Pan, W.; Li, Y.; Xu, J.: Mobile robot path
learning to optimize the policy. While this approach can be planning with reformative bat algorithm. Plos One, 1–12 (2022)
14. Abed-Alguni, B.H.; Paul, D.J.; Chalup, S.K.; Henskens, F.A.: A
effective in many cases, such as the rotary inverted pen- comparison study of cooperative Q-learning algorithms for inde-
dulum. For future work, authors implement the proposed pendent learners. Int. J. Artif. Intell. 14(1), 71–93 (2016)
algorithm on the more complex environments such as double- 15. Van, H.; Guez, A.; Silver, D.: Deep reinforcement learning
link inverted pendulum, walking mechanism of legged robot with double Q-learning. In: Proceedings of the Thirtieth AAAI
Conference on Artificial Intelligence, Palo Alto, AAAI Press,
to validate the effectiveness of the algorithm. pp. 2094–2100 (2016)
16. Dai, Y.; Lee, K.; Lee, S.: A real-time HIL control system on rotary
Funding The authors received no financial support for the research, inverted pendulum hardware platform based on double deep Q-
authorship, and/or publication of this article. network. Meas. Control 54(3–4), 417–428 (2021)
17. Behrens, MR.; Ruder, WC.: Smart Magnetic Microrobots Learn to
Swim with Deep Reinforcement Learning. arXiv preprint arXiv:
2201.05599,(2022)
Declarations 18. Yu, X.; Fan, Y.; Xu, S.; Ou, L.: A self-adaptive SAC-PID control
approach based on reinforcement learning for mobile robots. Int.
Conflict of interest The authors declare that they have no known com- J. Robust Nonlinear Control 10(2), 210–229 (2021)
peting financial interests or personal relationships that could have 19. Saeed, M.; Nagdi, M.; Rosman, B.; Ali, HH.: Deep reinforcement
appeared to influence the work reported in this paper. learning for robotic hand manipulation. In: IEEE Proceedings of
the International Conference on Computer, Control, Electrical, and
Electronics Engineering (ICCCEEE), pp. 1–5 (2021)
20. Gao, X.; Yan, L.; Wang, G.; Wang, T.; Du, N.; Gerada, C.: Toward
obstacle avoidance for mobile robots using deep reinforcement
References learning algorithm. In: IEEE Proceedings of the 16th Conference
on Industrial Electronics and Applications (ICIEA), pp. 2136–2139
1. Younis, W.; Abdelati, M.: Design and implementation of an (2021)
experimental segway model. In: AIP Conference Proceeding, 21. Train Reinforcement Learning Agents to Control Quanser
pp. 350–354 (2009) QUBE™ Pendulum MATLAB & Simulink (mathworks.com)
2. Singh, R.; Bera, T.K.: Walking mechanism of quadruped robot on (2022)
a side ramp using PI controller, In: IEEE Proceedings of the 15th 22. Polzounov, K.; Redden, L.: Blue river controls: a toolkit for
International Conference on Industrial and Information Systems reinforcement learning control systems on hardware.arXiv:2001.
(ICIIS 2020), pp. 105–111 (2020) 02254, (2020)
3. Aranda- Escola´stica, E.; Guinaldo, M.; Santos, M.: Control of a
chain pendulum: a fuzzy logic approach. Int. J. Comput. Intell.
Syst. 9(2), 281–295 (2016)

123
Arabian Journal for Science and Engineering

23. Kim, JB.; Kwon, DH.; Hong, YG.: Deep Q-network based rotary 27. Kathpal, A.; Singla, A.: SimMechanics™ based modeling, simula-
inverted pendulum system and its monitoring on the EdgeX plat- tion and real-time control of Rotary Inverted Pendulum. In: IEEE
form. In: IEEE International Conference on Artificial Intelligence Proceeding of the 11th International Conference on Intelligent Sys-
in Information and Communication (ICAIIC), pp.34–39 (2019) tems and Control (ISCO), pp. 166–172 (2017)
24. Cazzolato, M.; Benjamin, S.; Zebb, P.: On the dynamics of the
furuta pendulum. J. Control Sci. Eng. 1–8 (2011) Springer Nature or its licensor (e.g. a society or other partner) holds
25. Koenig, S.; Simmons, R.G.: Complexity analysis of real-time exclusive rights to this article under a publishing agreement with the
reinforcement learning. In: Proceedings of the 11th National Con- author(s) or other rightsholder(s); author self-archiving of the accepted
ference on Artificial Intelligence (AAAI), pp. 99–105 (1993) manuscript version of this article is solely governed by the terms of such
26. Larsen, T.N.; Teigen, H.Ø.; Laache, T.; Varagnolo, D.; Rasheed, publishing agreement and applicable law.
A.: Comparing deep reinforcement learning algorithms’ ability to
safely navigate challenging waters. Front. Robot. AI, 1–19 (2021)

123

Internship Report
No ratings yet
Internship Report
72 pages
Nyawera James Xolani 2019
No ratings yet
Nyawera James Xolani 2019
258 pages
IIT Mandi Mechanism and Robot Book
100% (1)
IIT Mandi Mechanism and Robot Book
1,781 pages
Week 9
No ratings yet
Week 9
98 pages
Product Catalogue Industrial Gearmotors C A F S Ie2 Ie3 Eng r09 3
No ratings yet
Product Catalogue Industrial Gearmotors C A F S Ie2 Ie3 Eng r09 3
586 pages
Enhanced Single-Loop Control Strategies
100% (3)
Enhanced Single-Loop Control Strategies
26 pages
Ingersoll Rand Air Motors PDF
100% (1)
Ingersoll Rand Air Motors PDF
98 pages
Eaton Hydraulic Motors
No ratings yet
Eaton Hydraulic Motors
336 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
g1035 Us Nord
No ratings yet
g1035 Us Nord
268 pages
Dhruv Anirudh DrSandeep
No ratings yet
Dhruv Anirudh DrSandeep
21 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Power System Stability Lecture
75% (4)
Power System Stability Lecture
17 pages
SW Product Summary
No ratings yet
SW Product Summary
64 pages
UNIT - I - Basics of Modelling - SCH1401: School of Bio and Chemical Department of Chemical Engineering
No ratings yet
UNIT - I - Basics of Modelling - SCH1401: School of Bio and Chemical Department of Chemical Engineering
111 pages
Book ML in Python For PSE
No ratings yet
Book ML in Python For PSE
57 pages
Topic 6 Fundamentals of Fluid Flow
No ratings yet
Topic 6 Fundamentals of Fluid Flow
57 pages
Multiple Particle Systems
100% (1)
Multiple Particle Systems
21 pages
Chapter 10 Solution Manual
No ratings yet
Chapter 10 Solution Manual
8 pages
Auma - Gear Operators
100% (2)
Auma - Gear Operators
30 pages
Full MCQ SCM
No ratings yet
Full MCQ SCM
87 pages
Chapter One: Flow Through Orifices and Mouthpieces
No ratings yet
Chapter One: Flow Through Orifices and Mouthpieces
44 pages
Cyclo Drives
No ratings yet
Cyclo Drives
56 pages
NRTL and WILSON Method
No ratings yet
NRTL and WILSON Method
11 pages
Process Instrumentation (1 15) 2018 2019
No ratings yet
Process Instrumentation (1 15) 2018 2019
15 pages
Root Locus and Closed Loop Transfer Function Calculation
No ratings yet
Root Locus and Closed Loop Transfer Function Calculation
8 pages
Introduction To Measurement and Instrumentation System
No ratings yet
Introduction To Measurement and Instrumentation System
38 pages
Multi Purpose Wheelchair-1
No ratings yet
Multi Purpose Wheelchair-1
14 pages
Hydrogen Fuel Cell Experiment
No ratings yet
Hydrogen Fuel Cell Experiment
12 pages
Modeling & Simulation
No ratings yet
Modeling & Simulation
445 pages
Appendix G Seborg
No ratings yet
Appendix G Seborg
18 pages
Plant
0% (3)
Plant
20 pages
Highspeedrobot Waterloo
No ratings yet
Highspeedrobot Waterloo
114 pages
Chapter 14
No ratings yet
Chapter 14
26 pages
Fuzzy Expert System
No ratings yet
Fuzzy Expert System
20 pages
Fyp Project
No ratings yet
Fyp Project
20 pages
Report On Pid Controller
No ratings yet
Report On Pid Controller
14 pages
DC Servo Drive: 10V - 30V DC, 50W With ASCII Modbus
No ratings yet
DC Servo Drive: 10V - 30V DC, 50W With ASCII Modbus
35 pages
Torsion: Torques or Twisting Moments. Cylindrical Members That Are Subjected To Torques and Transmit
No ratings yet
Torsion: Torques or Twisting Moments. Cylindrical Members That Are Subjected To Torques and Transmit
22 pages
Separations Based On The Motion of Particles Through
0% (1)
Separations Based On The Motion of Particles Through
16 pages
Process Design and Control: Course Code Monday & Friday
No ratings yet
Process Design and Control: Course Code Monday & Friday
37 pages
Electrolysis Hygrometer
No ratings yet
Electrolysis Hygrometer
16 pages
Planning and Forecasting: Dr.B.G.Cetiner Cetinerg@itu - Edu.tr
No ratings yet
Planning and Forecasting: Dr.B.G.Cetiner Cetinerg@itu - Edu.tr
34 pages
Topic 1 - 5 Open Ended Laboratory Implementation
No ratings yet
Topic 1 - 5 Open Ended Laboratory Implementation
50 pages
Topic 1 Design of Feedback Controllerstce5102
No ratings yet
Topic 1 Design of Feedback Controllerstce5102
11 pages
FOM Unit-5
No ratings yet
FOM Unit-5
16 pages
Lesson 6 TORSION December 7 2023
No ratings yet
Lesson 6 TORSION December 7 2023
21 pages
Frenske Underwood Gillian Method PDF
No ratings yet
Frenske Underwood Gillian Method PDF
16 pages
Grade 12 Physics Unit 2
No ratings yet
Grade 12 Physics Unit 2
5 pages
Exercise 26:: ∆ l=1 mω l AY m, A Y ∆ l ∝ω
No ratings yet
Exercise 26:: ∆ l=1 mω l AY m, A Y ∆ l ∝ω
21 pages
Boiler Circulation
No ratings yet
Boiler Circulation
7 pages
Lecture-4 - MIMO-SISO - Process Variables
No ratings yet
Lecture-4 - MIMO-SISO - Process Variables
16 pages
Pid Controller Design and Tuning
No ratings yet
Pid Controller Design and Tuning
33 pages
Marking Scheme of MST-I AST-308
0% (1)
Marking Scheme of MST-I AST-308
3 pages
Ethical Clearance Application Instructions (ECAI)
No ratings yet
Ethical Clearance Application Instructions (ECAI)
3 pages
Computer Methods For Mechanical Engineering: Mathematical Modeling, Numerical Methods, and Problem Solving
No ratings yet
Computer Methods For Mechanical Engineering: Mathematical Modeling, Numerical Methods, and Problem Solving
28 pages
Caribbean Examinations Council: 02138020/CAPE/KMS 2018
No ratings yet
Caribbean Examinations Council: 02138020/CAPE/KMS 2018
11 pages
Coupled Tank - MPC
No ratings yet
Coupled Tank - MPC
6 pages
Chapter 15 Managing Engineering and Technology
No ratings yet
Chapter 15 Managing Engineering and Technology
20 pages
Dist06 Extended
No ratings yet
Dist06 Extended
29 pages
1st Year Practice Questins (Physics) 2023-25
No ratings yet
1st Year Practice Questins (Physics) 2023-25
6 pages
Assump LK HK Assumpt Basis: F F D B
No ratings yet
Assump LK HK Assumpt Basis: F F D B
4 pages
Forecasting Problems
No ratings yet
Forecasting Problems
5 pages
Qprop Theory PDF
No ratings yet
Qprop Theory PDF
14 pages
Game Theory
No ratings yet
Game Theory
5 pages
Branislav Kisa Canin Eric K. Zhang Draft Version: 2011-2018 Solutions Manual
No ratings yet
Branislav Kisa Canin Eric K. Zhang Draft Version: 2011-2018 Solutions Manual
16 pages
Chapter 3. Torsion
No ratings yet
Chapter 3. Torsion
8 pages
Questions
No ratings yet
Questions
9 pages
Quantum Mechanics Angular Momentum
No ratings yet
Quantum Mechanics Angular Momentum
13 pages
Sequence-Based Differential Evolution For Solving Economic Dispatch Considering Virtual Power Plant
No ratings yet
Sequence-Based Differential Evolution For Solving Economic Dispatch Considering Virtual Power Plant
14 pages
Difusion Maxwell Stefan
No ratings yet
Difusion Maxwell Stefan
18 pages
Perry Tabs
No ratings yet
Perry Tabs
2 pages
Model Multi Echelon
No ratings yet
Model Multi Echelon
13 pages
Ex0 Questions Solutions
No ratings yet
Ex0 Questions Solutions
7 pages
Model Predictive Control Toolbox
No ratings yet
Model Predictive Control Toolbox
11 pages
Combined Heat and Power Dynamic Economic Dispatch With Demand Side Management Incorporating Renewable Energy Sources and Pumped Hydro Energy Storage
No ratings yet
Combined Heat and Power Dynamic Economic Dispatch With Demand Side Management Incorporating Renewable Energy Sources and Pumped Hydro Energy Storage
11 pages
Review Questions
No ratings yet
Review Questions
9 pages
Acetone Recovery Using Absorber and Distillation Column: All The Unknowns Should Be Found Using ASPEN PLUS Software
No ratings yet
Acetone Recovery Using Absorber and Distillation Column: All The Unknowns Should Be Found Using ASPEN PLUS Software
5 pages
02 - Property Tables and Charts
No ratings yet
02 - Property Tables and Charts
2 pages
QUESTION 2: Mixing and Agitation
No ratings yet
QUESTION 2: Mixing and Agitation
3 pages
Dynaco P31 P51 P76 Bearing Pump
No ratings yet
Dynaco P31 P51 P76 Bearing Pump
7 pages
17 Gearshift and Brake Distribution Control For Regenerative BR
No ratings yet
17 Gearshift and Brake Distribution Control For Regenerative BR
22 pages
Individual Report
No ratings yet
Individual Report
5 pages
Physics 430: Lecture 6 Center of Mass, Angular Momentum: Dale E. Gary
No ratings yet
Physics 430: Lecture 6 Center of Mass, Angular Momentum: Dale E. Gary
17 pages
CHE 413 Momentum Transfer
No ratings yet
CHE 413 Momentum Transfer
4 pages
An Application of SMED Methodology PDF
No ratings yet
An Application of SMED Methodology PDF
4 pages
Logan 120-Ton Power Swivel - Logan Oil Tools
No ratings yet
Logan 120-Ton Power Swivel - Logan Oil Tools
2 pages
Causality
No ratings yet
Causality
4 pages
Me 201 Engineering Mechanics Statics3121
No ratings yet
Me 201 Engineering Mechanics Statics3121
19 pages
10
No ratings yet
10
3 pages
Increase Savonius Rotor Efficiency
No ratings yet
Increase Savonius Rotor Efficiency
11 pages
Pickmaster External Sensor Datasheet 9AKK107045A0348 Revb
No ratings yet
Pickmaster External Sensor Datasheet 9AKK107045A0348 Revb
2 pages
Homework As PDF-file
No ratings yet
Homework As PDF-file
5 pages
Parts Drawing - Linelazer V 200Hs
No ratings yet
Parts Drawing - Linelazer V 200Hs
3 pages
Basic Equation For DP Level Transmitter Range Calculation
No ratings yet
Basic Equation For DP Level Transmitter Range Calculation
2 pages
Gear Ratio Rhino Motor
No ratings yet
Gear Ratio Rhino Motor
1 page
Moment of Inertia and Angular Momentum: Vocabulary
No ratings yet
Moment of Inertia and Angular Momentum: Vocabulary
6 pages
A Method For Evaluation of The Chain Drive Efficiency
No ratings yet
A Method For Evaluation of The Chain Drive Efficiency
6 pages
Liquid Level Control System
No ratings yet
Liquid Level Control System
0 pages

3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem

Uploaded by

3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem

Uploaded by

Arabian Journal for Science and Engineering

RESEARCH ARTICLE-MECHANICAL ENGINEERING

Reinforcement Learning DDPG–PPO Agent-Based Control System

Received: 8 September 2022 / Accepted: 8 May 2023

Abbreviations SMC Sliding mode controller

be taken to maximize the future rewards, according to the tar-

Fig. 3 Simulink model of the rotary inverted pendulum

The CAD model of the rotary inverted pendulum is shown Fr τ − Dr θ̇ (1)

Table 1 Parameters and values

Motor Motor resistance Rm 22 

Table 2 Initial values Table 3 RL DDPG agent parameters

Parameters Symbol Value Parameters Value

Arm angle Θ 0 rad Environment Pendulum system

Fig.4 Reinforcement learning

Fig. 7 DDPG–PPO agent

Table 5 Performance parameter

Fig12 Experimentation results of

Table 6 Comparison results of

Training episode 152/1000 Training episode 178/1000

You might also like

Motor Motor resistance Rm 22