3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
https://doi.org/10.1007/s13369-023-07934-2
Abstract
The rotary inverted pendulum (RIP) system is a nonlinear system used as a benchmark for testing control strategies. RIP
system has a lot of applications in balancing of robotic systems such as drones and humanoid robots. Controlling RIP system
is a complex task without concise knowledge of classic control engineering. This paper uses the reinforcement learning
(RL) approach to control the RIP instead of classical controllers such as PID (proportional–integral–derivative) and LQR
(linear–quadratic regulator). In this work, the deep deterministic policy gradient–proximal policy optimization (DDPG–PPO)
agent is proposed and implemented to control the rotary inverted pendulum platform both in simulation and hardware. DDPG
agent with 13 layers is trained for the swing-up action of the pendulum, and the mode selection process is trained and tested
using the PPO agent. The rotary inverted pendulum is controlled using a proposed controller and compared with various
RL agents such as soft actor critic–proximal policy optimization (SAC–PPO). Additionally, the proposed method is tested
with a conventional proportional–integral–derivative (PID) controller, for different pendulum mass values, to validate its
effectiveness. Finally, the proposed RL controller is implemented on the real-time RIP apparatus (Quanser Qube-Servo).
Results show that DDPG–PPO RL agent is much effective than SAC–PPO agent during swing-up control.
Keywords Reinforcement learning · Deep deterministic policy gradient · Proximal policy optimization · Rotary inverted
pendulum · Simulink
123
Arabian Journal for Science and Engineering
The two-dimensional inverted pendulum was used in the self- [18]. The validation of the proposed controller was done on
balancing unicycle cart [3]. both simulation and real mobile robot platform for different
For controlling the inverted pendulum, three types of con- complex paths. Saeed et al. implemented the deep determin-
trol system are available: linear, predictive, and self-learning. istic policy gradient (DDPG) RL algorithm for robotic hand
Linear approaches [4–6] are more useful for one or two manipulation for pick-and-place operation in an unknown
degrees freedom of the inverted pendulum system, restricting environment [19], and results were quite effective. Gaoat el.
their application in real-world scenarios and also hard to tune proposed the DDPG RL algorithm for obstacle avoidance
the parameter values. The predictive control approach [7] is using in the four-wheel mobile robot. The results show that
expensive and complex to implement on real-time appara- the proposed model can possess the obstacle-free navigation
tus. To encounter the above shortcomings, the self-learning path even environment has multiple obstacles [20]. As per
control approach has been proposed [8], which has more the above literature, many RL algorithms were used for con-
capability to solve inverted pendulums with more degrees of trolling nonlinear problems, but results of DDPG algorithm
freedom. were much effective in terms of less training episodes, better
The self-learning algorithms have made a vast contribu- controllability, and more effective for continuous state action
tion in the domain of the robotics [8]. A deep RL algorithm problems.
was proposed on an inverted pendulum that rotates on a In this paper, we have used RL DDPG–PPO agent-
spherical joint with an industrial six-degree of freedom robot based control system for a rotary inverted pendulum system
arm. The RL control policies have been widely used to and tested and compared the robustness of the model in
control the sequential decision problems such as locomo- MATLAB-Simulink environment during swing-up and mode
tion of the autonomous robot, walking of biped robots, etc., selection process. The main novelties and contributions of
by optimizing a cumulative reward signals [9]. Q-learning this paper are using:
is one of the RL algorithms [10]. The Q-learning algo-
rithm was used for solving the unrealistic problems such as • State-of-the-art comparison with RL SAC–PPO algorithm
autonomous navigation in unknown environment, car park- [21].
ing problem, walking of biped or quadruped robots, etc. • A modified reward function to stabilize the pendulum in
To overcome the overestimate action values, Abed-alguni up-right position.
[11] proposed double delayed Q-learning algorithm. The • Real-time implementation on hardware.
experimental results showed that double delayed Q-learning
algorithm converges to an optimal policy and that it performs
better than delayed Q-learning. For adjusting the explo- The paper is organized as follows: the background related
ration–exploitation balance during the learning process, the to the RL is presented in Sect. 2. The mathematical modeling
Bat Q-learning algorithm is proposed in [12]. This reduces and Simulink-Simscape modeling are presented in Sect. 3.
the overestimation of Q-values and leads to more accurate The RL controller with DDPG–PPO agent is proposed and
value estimates. Bat Q-learning algorithm applications are implemented to control the swing up and mode selection
still being explored. However, it has shown promising results of the rotary pendulum in the Sect. 4. Finally, the results
in several domains such as path planning for mobile robots of the proposed RL agent (DDPG–PPO) is compared with
[13]. The various cooperative RL algorithms were compared RL-based SAC–PPO agent [21]. Also proposed RL DDPG
and studied in [14]. The algorithms were compared using taxi algorithm is implemented on the real-time rotary inverted
problem. Authors also studied the effects of the frequency of pendulum hardware (Quanser Qube-Servo), and results are
Q-value sharing on the learning speed of the independent presented and discussed in Sect. 5.The conclusions are drawn
learners that share their Q-values among each other. Van in Sect. 6.
Hasselt et al. proposed the double deep Q-network (DDQN)
algorithm to solve overestimated value problems [15]. Dai
et al. implemented the DDQN algorithm on the real time 2 Background
hardware in loop control system on RIP hardware platform
and compared the results with conventional Q-learning algo- RL is the subbranch of the machine learning (ML) model
rithm [16]. The DDQN algorithm reduces the overestimated where the self-learning process is based on the feedback and
Q value and minimizes the training episodes and time as previous control actions. The modeling of the RL is based
compared to the Q-learning algorithm. Behrens et al. pro- on how the human beings learn. Humans do the action on
posed the SAC algorithm on the smart magnetic micro-robots the present state of the unknown surrounding environment
to learn swimming in an unknown environment [17]. Yu and obtain rewards accordingly. After some trails, the human
et al. proposed the model-free SAC–PID controller for auto- begin able to predict the future state is based on the present
matic control of mobile robot in an unknown environment state and desired actions. The humans learn what actions will
123
Arabian Journal for Science and Engineering
Fig.2 CAD modelling: a front view, b side view, c top view, d isometric view, and e hardware
123
Arabian Journal for Science and Engineering
the RIP and compared results with classical control methods angle of the pendulum (α) is range between [−π, π] radian.
[23]. For up-right position of the pendulum, the angle α is π radian.
Although RLDQN, DDQN algorithms are very powerful, The angle of the arm (θ ) is constrained in [ −π π
2 , 2 ] radian.
it has its own challenges to solve continuous state problems. The angle θ is kept at zero when arm is at central position. α̇
DDPG algorithm is more suitable for solving continuous- and θ̇ are the angular velocity of the pendulum and the arm,
type problems. It has advantages of both DQN and DPG respectively.
algorithms. As per author’s knowledge, the RL DDPG algo-
rithm is not considered for swing-up operation of the rotary 3.1 Mathematical Modeling
inverted pendulum in the literature.
For obtaining the equation of motion of the RIP, the Lagrange
method is used [24]. Fr and F p are the force acting on the
3 Modeling rotary arm and pendulum as shown in Eq. (1) and (2).
123
Arabian Journal for Science and Engineering
The nonlinear equations of motion for the system are: from the joint block. The torque is used to actuate the rev-
olute joint 1, which is connector between DC motor and
rotary arm. The rotary arm is connected to the pendulum
Mpr 2 + 0.25Mp L 2 1 − cos (α)2 + Jr θ̈
via transformation frame block (TF-4). TF-4 is a rigid trans-
− 0.5Mp Lrcos (α) α̈ formation Simulink/Simscape/Multibody block. This block
applies a time-invariant transformation between two frames.
+ 0.5Mp L 2 sin (α) cos (α) θ̇ α̇
The transformation rotates and translates the pendulum frame
+ 0.5Mp Lrsin (α) α̇ 2 τ − Dr (3) with respect to the rotary arm frame. Connecting the frame
ports in reverse causes the transformation itself to reverse.
The frames remain fixed with respect to each other during
−0.5Mp Lrcos (α) θ̈ + Jp + 0.25Mp L 2 α̈ simulation, moving only as a single unit. The output is the
− 0.25Mp L 2 sin (α) cos (α) θ̇ 2 ˙ of the arm rod
position (θ , α) and angular velocity (θ̇ , α))
− 0.5Mp Lgsin (α) −Dp α̇ and the pendulum. The configuration block take care of the
(4)
uniform gravity for the entire mechanism. The configuration
block consists of three separate blocks: Solver configura-
where M p is the mass of pendulum. J p and Jr are the moment
tion, World frame, and Mechanism configuration. The Solver
of inertia about center of mass of pendulum and rod, respec-
configuration block specifies the solver parameters that your
tively.
model needs before you can begin simulation. In this simula-
tion, we were using ode23t solver, which is one-step solver.
3.2 Simulink Modeling
World frame block represents the global reference frame in a
model. This frame is inertial and at absolute rest. The World
The Simulink Simscape toolbox and Eqs. (3) and (4) are
frame is the ultimate reference frame. Directly or indirectly,
used for modeling the rotary inverted pendulum as shown
all other frames are defined with respect to the World frame.
in the Fig. 3. The control voltage source block is connected
Mechanism configuration block provides mechanical and
to the DC motor block. The voltage range between [− 12,
simulation parameters to a mechanism. Parameters include
12] volt is supplied to the DC motor. Further the DC motor
gravity and a linearization delta for computing numerical par-
output (angular velocity and angular position) is connected
tial derivatives during linearization. These parameters apply
to the rotational Multibody interface block, which act as
only to the target mechanism, i.e., the mechanism that the
interface between Simscape Multibody joint and a Simscape
block connects to. In our case, this block is connected to
mechanical rotational. The output of the block is the mechan-
ical rotational torque actuation and angular velocity sensing
123
Arabian Journal for Science and Engineering
123
Arabian Journal for Science and Engineering
Fig.5 Snippet of: a swing-up performance of rlDDPG Agent after training the model of rotary inverted pendulum and b learning curve
Fig.6 Snippet of: a mode selection performance of rlPPO Agent after training the model of rotary inverted pendulum, and b learning curve
123
Arabian Journal for Science and Engineering
neural networks used for the actor and critic. These compu- According to Table 5, the proposed method requires 1675
tations are typically O(n2 ) or O(n3 ), where n is the number of episodes and about one hour to be trained.
neurons in the network. The DDPG can be computationally
expensive, particularly for large-scale problems with high-
dimensional state and action spaces. However, utilizing small 5 Results and Discussion
neural networks (n 13) makes it a viable algorithm for the
control of rotary inverted pendulum system. The proposed This section presents the simulation and experiment results
DDPG–PPO algorithm also relies on the proximal policy and discussion. The DDPG–PPO agents are proposed to con-
optimization method. The computational complexity of PPO trol the rotary inverted pendulum. The results of the Simulink
can be difficult to quantify precisely, as it depends on various model of rotary inverted pendulum in simulation environ-
factors such as the size of the neural networks, the number ment as discussed in Sect. 3.2 are shown in Fig. 8.
of iterations and episodes, and the size of the action and state As we mentioned before, the rotary inverted pendulum
spaces of the environment. However, as a rough estimate, task is a continuous control task, where the agent must main-
the computational complexity of PPO is usually in the order tain the pendulum in an upright position by applying the
of thousands to tens of thousands of iterations and episodes. appropriate torque. DDPG is a gradient-based algorithm that
123
Arabian Journal for Science and Engineering
Fig. 8 Rotary inverted pendulum Simulink results a motor arm angle, b pendulum angle, c angular velocity of motor arm, and d angular velocity
of pendulum
can handle continuous actions, and it may be more efficient at a conventional Proportional–integral–derivative (PID) con-
finding the optimal torque values needed to balance the pen- troller, for different pendulum mass values, to validate its
dulum. SAC, on the other hand, uses entropy regularization effectiveness [27] as shown in Figs. 9 and 10. The simula-
to encourage exploration, which may be less effective in a tion results shown in Figs. 9 and 10 indicate that while some
task where the optimal policy is well-defined. DDPG–PPO is of the performance parameters have degraded for the nominal
a hybrid algorithm that combines the strengths of both DDPG mass, the proposed controller outperforms the PID controller
and PPO. DDPG is known for its ability to handle continu- as the pendulum mass is increased. This suggests that the
ous action spaces, while PPO is known for its stability and proposed controller is more robust to variations in pendulum
sample efficiency. SAC–PPO uses the soft actor–critic algo- mass compared to the PID controller, which exhibits degra-
rithm, which places high emphasis on exploration. While this dation in performance with increasing mass. The detailed
can be beneficial in some environments, it can also lead to information on performance parameters such as settling time
suboptimal policies in other environments [26]. The rotary of pendulum angle (T s ) and maximum overshoot in pendu-
inverted pendulum is controlled using a proposed controller lum angle (γmax ), and motor arm angle (θ max ) is given in
and compared with various RL agents such as SAC–PPO Table 5.
[21]. Additionally, the proposed method was tested with Figure 11 provides an important advantage for control-
ling systems that are subject to unpredictable disturbances.
123
Arabian Journal for Science and Engineering
Fig. 9 Simulation response of motor arm angle for pendulum mass (M p ) in kg a 0.024 and b 0.130 (solid blue line: proposed method and dashed
red line: PID controller)
Fig. 10 Simulation response of pendulum angle for pendulum mass (M p ) in kg a 0.024 and b 0.130 (solid blue line: proposed method and dashed
red line: PID controller)
In Fig. 11a, the results for the motor arm angle are displayed. performed similarly in this task. However, in Fig. 11d, it
Our DDPG–PPO agent was able to reach the center position can be observed that the DDPG–PPO agent’s performance
(zero-radian angle) in approximately 2 s. This is faster than became stable after 1.75 s, while the SAC–PPO agent’s per-
the SAC–PPO agent, which took 4 s to reach the center posi- formance was jerky and unstable. Furthermore, when using
tion. DDPG–PPO agent result is settled down to the desired the DDPG–PPO agent, the pendulum was able to approach
value (0 rad/s) faster as compared to the SAC–PPO agent the upright position in the counterclockwise direction (posi-
results. According to Fig. 11b, the DDPG–PPO agent out- tive swing-up reference angle), which was not observed with
performed the SAC–PPO agent for the pendulum angle task the SAC–PPO agent’s results.
by reaching the upright position (3.14-rad angle) earlier. Figure 12 shows the experimental results of implementing
The results for the input voltage are presented in Fig. 11c the best episode after training the inverted rotary pendulum
for both the DDPG–PPO and SAC–PPO agents. Both agents
123
Arabian Journal for Science and Engineering
Fig. 11 DDPG–PPO agent and SAC–PPO agent comparison results: a motor arm angle, b pendulum angle, c input voltage, and d swing-up reference
angle.
using DDPG–PPO agents during simulation in a MATLAB- which is less than the SAC agent training episodes, i.e.,
Simulink environment. The hardware is connected with 178/1000. The DDPG agent takes 1 h for training the swing-
MATLAB-Simulink with HIL and Quanser_Quarc toolbox up control operation, whereas the SAC agent takes 1 h 27 min.
library. The Quarc HIL Initialize block is used for con- Also, the average reward for DDPG is less (7646.517) than
necting the hardware with MATLAB/Simulink environment. the SAC result, i.e., 7774.2451. The training mode select
This block associates a name with a particular HIL board, agent is PPO for both swing-up agents. The PPO with DDPG
which is Quanser Q8-USB in our case. This board can do performs better results than PPO with a SAC agent, because
real-time configurations; when a board type is selected for the DDPG agent has advantages of both DQN and DPG
the first time, the parameters such as number of I/O channels algorithms to perform better for continuous state and action
are initialized according to that board type. As depicted in problems.
Fig. 10, the proposed RLDDPG–PPO agents work effectively
on the rotary inverted pendulum hardware.
The comparison results of the SAC–PPO agent and
DDPG–PPO agents are mentioned in Table 6. The results 6 Conclusions
of the proposed DDPG–PPO agent are validated with the
SAC–PPO agent. The swing-up control is done by the DDPG The RL algorithm to swing up and balance a rotary inverted
agent. The training episode for DDPG agents is 152/1000, pendulum in simulation and hardware environment was pro-
posed in this paper. The overall control approach consists
123
Arabian Journal for Science and Engineering
123
Arabian Journal for Science and Engineering
of four parts: rotary inverted pendulum modeling, hardware 4. Kajita, S. et al.:Biped walking stabilization based on linear
interface, environment, and agent. Without a deep knowledge inverted pendulum tracking, In: Proceeding of the IEEE/RSJ
International Conference on Intelligent Robots and Systems
of conventional control theory, the RL Deep Determinis- (IROS), pp. 4489–4496 (2010). https://doi.org/10.1109/IROS.
tic Policy Gradient agent was proposed for the swing-up 2010.5651082
control action of the rotary inverted pendulum. The 152 5. Valluru, V.K.; Singh,M.; Singh, M.: Application of linear quadratic
episodes were required to train the swing-up control action methods to stabilize cart inverted pendulum systems, In: pro-
ceeding of the 2nd IEEE International Conference on Power
by the deep deterministic policy gradient (DDPG) agent. Electronics, Intelligent and Control Energy Systems (ICPEICES),
RL proximal policy optimization (PPO) agent was used to pp. 1027–1031 (2018). https://doi.org/10.1109/ICPEICES.2018.
train the mode selection operation. At last, the effective- 8897316
ness of the proposed RL-based agent was compared with 6. Chawla, I.; Singla, A.: Real-time stabilization control of a rotary
inverted pendulum using LQR-based sliding mode controller. Arab.
conventional PID and different RL agents, such as the soft
J. Sci. Eng. 46(3), 2589–2596 (2021). https://doi.org/10.1007/
actor–critic (SAC) agent. Comparing the SAC–PPO agent s13369-020-05161-7
and the DDPG–PPO agent, the training time and number 7. Bekkar, B.;and Ferkous, K.: Design of Online Fuzzy Tuning LQR
of training episodes of DDPG–PPO agent were decreased. Controller Applied to Rotary Single Inverted Pendulum: Experi-
mental Validation. Arab J Sci Eng, 1–16 (2022).
Using the DDPG–PPO agent, the pendulum can be faster
8. Mellatshahi, N.; Mozaffari, S.; Saif, M.; Alirezaee, S.: Inverted
swing up than the SAC–PPO agent. The comparisons were pendulum control with a robotic arm using deep reinforcement
done in the MATLAB-Simulink environment. The proposed learning. In: IEEE International Symposium on Signals, Circuits
DDPG–PPO agent was implemented on the real-time rotary and Systems (ISSCS), pp. 1–6 (2021)
9. Sutton, R.S.; Barto, A.G.: Introduction to Reinforcement Learning.
inverted pendulum hardware. DDPG–PPO algorithm is sen- MIT Press, Cambridge (1998)
sitive to the choice of hyper-parameters, i.e., learning rates 10. Watkins, CJ.: Learning from delayed rewards. PhD thesis, Univer-
and discount factors. Choosing appropriate hyper-parameters sity of Cambridge England, (1989)
can be a time-consuming process that requires trial and error 11. Abed-alguni, B.H.; Ottom, M.A.: Double delayed Q-learning. Int.
J. Artif. Intell. 6(2), 41–59 (2018)
and can be difficult to generalize across different tasks and 12. Abed-alguni, B.H.: Bat Q-learningalgorithm. Jordanian J. Comput.
environments. Also, DDPG and PPO are both model-free Inf. Technol. 3(1), 56–77 (2017)
algorithms, which mean that they rely on trial-and-error 13. Xin, G.; Shi, L.; Long, G.; Pan, W.; Li, Y.; Xu, J.: Mobile robot path
learning to optimize the policy. While this approach can be planning with reformative bat algorithm. Plos One, 1–12 (2022)
14. Abed-Alguni, B.H.; Paul, D.J.; Chalup, S.K.; Henskens, F.A.: A
effective in many cases, such as the rotary inverted pen- comparison study of cooperative Q-learning algorithms for inde-
dulum. For future work, authors implement the proposed pendent learners. Int. J. Artif. Intell. 14(1), 71–93 (2016)
algorithm on the more complex environments such as double- 15. Van, H.; Guez, A.; Silver, D.: Deep reinforcement learning
link inverted pendulum, walking mechanism of legged robot with double Q-learning. In: Proceedings of the Thirtieth AAAI
Conference on Artificial Intelligence, Palo Alto, AAAI Press,
to validate the effectiveness of the algorithm. pp. 2094–2100 (2016)
16. Dai, Y.; Lee, K.; Lee, S.: A real-time HIL control system on rotary
Funding The authors received no financial support for the research, inverted pendulum hardware platform based on double deep Q-
authorship, and/or publication of this article. network. Meas. Control 54(3–4), 417–428 (2021)
17. Behrens, MR.; Ruder, WC.: Smart Magnetic Microrobots Learn to
Swim with Deep Reinforcement Learning. arXiv preprint arXiv:
2201.05599,(2022)
Declarations 18. Yu, X.; Fan, Y.; Xu, S.; Ou, L.: A self-adaptive SAC-PID control
approach based on reinforcement learning for mobile robots. Int.
Conflict of interest The authors declare that they have no known com- J. Robust Nonlinear Control 10(2), 210–229 (2021)
peting financial interests or personal relationships that could have 19. Saeed, M.; Nagdi, M.; Rosman, B.; Ali, HH.: Deep reinforcement
appeared to influence the work reported in this paper. learning for robotic hand manipulation. In: IEEE Proceedings of
the International Conference on Computer, Control, Electrical, and
Electronics Engineering (ICCCEEE), pp. 1–5 (2021)
20. Gao, X.; Yan, L.; Wang, G.; Wang, T.; Du, N.; Gerada, C.: Toward
obstacle avoidance for mobile robots using deep reinforcement
References learning algorithm. In: IEEE Proceedings of the 16th Conference
on Industrial Electronics and Applications (ICIEA), pp. 2136–2139
1. Younis, W.; Abdelati, M.: Design and implementation of an (2021)
experimental segway model. In: AIP Conference Proceeding, 21. Train Reinforcement Learning Agents to Control Quanser
pp. 350–354 (2009) QUBE™ Pendulum MATLAB & Simulink (mathworks.com)
2. Singh, R.; Bera, T.K.: Walking mechanism of quadruped robot on (2022)
a side ramp using PI controller, In: IEEE Proceedings of the 15th 22. Polzounov, K.; Redden, L.: Blue river controls: a toolkit for
International Conference on Industrial and Information Systems reinforcement learning control systems on hardware.arXiv:2001.
(ICIIS 2020), pp. 105–111 (2020) 02254, (2020)
3. Aranda- Escola´stica, E.; Guinaldo, M.; Santos, M.: Control of a
chain pendulum: a fuzzy logic approach. Int. J. Comput. Intell.
Syst. 9(2), 281–295 (2016)
123
Arabian Journal for Science and Engineering
23. Kim, JB.; Kwon, DH.; Hong, YG.: Deep Q-network based rotary 27. Kathpal, A.; Singla, A.: SimMechanics™ based modeling, simula-
inverted pendulum system and its monitoring on the EdgeX plat- tion and real-time control of Rotary Inverted Pendulum. In: IEEE
form. In: IEEE International Conference on Artificial Intelligence Proceeding of the 11th International Conference on Intelligent Sys-
in Information and Communication (ICAIIC), pp.34–39 (2019) tems and Control (ISCO), pp. 166–172 (2017)
24. Cazzolato, M.; Benjamin, S.; Zebb, P.: On the dynamics of the
furuta pendulum. J. Control Sci. Eng. 1–8 (2011) Springer Nature or its licensor (e.g. a society or other partner) holds
25. Koenig, S.; Simmons, R.G.: Complexity analysis of real-time exclusive rights to this article under a publishing agreement with the
reinforcement learning. In: Proceedings of the 11th National Con- author(s) or other rightsholder(s); author self-archiving of the accepted
ference on Artificial Intelligence (AAAI), pp. 99–105 (1993) manuscript version of this article is solely governed by the terms of such
26. Larsen, T.N.; Teigen, H.Ø.; Laache, T.; Varagnolo, D.; Rasheed, publishing agreement and applicable law.
A.: Comparing deep reinforcement learning algorithms’ ability to
safely navigate challenging waters. Front. Robot. AI, 1–19 (2021)
123