Reinenforement Learning With Pid Loop
Reinenforement Learning With Pid Loop
Paper
This paper addresses a design scheme of a proportional-integral-derivative (PID) controller with a new adaptive updating rule
based on reinforcement learning (RL) approach for nonlinear systems. A new design scheme that RL can be used to complement
the conventional PID control technology is presented. In the proposed scheme, a single radial basis function (RBF) network is
considered to calculate the control policy function of Actor and the value function of Critic simultaneously. Regarding the PID
controller structure, the inputs of RBF network are system errors, the difference of output as well as the second-order difference
of output, and they are composed of system states. The temporal difference (TD) error in the proposed scheme involves the
reinforcement signal, the current and the previous stored value of the value function. The gradient descent method is adopted
based on the TD error performance index, then, the updating rules can be yielded. Therefore, the network weights and the kernel
function can be calculated in an adaptive way. Finally, the numerical simulations are conducted in nonlinear systems to illustrate
the efficiency and robustness of the proposed scheme. © 2021 Institute of Electrical Engineers of Japan. Published by Wiley
Periodicals LLC.
Keywords: reinforcement learning; PID control; Actor-Critic learning; RBF network; nonlinear system
1. Introduction On the other hand, the adaptive PID control based on the neural
network adopts the supervised learning to optimize the network
Proportional-integral-derivative (PID) control is one of the parameters. Therefore, there are some limitations in the applica-
most common control schemes and has been dominated the tion of those methods, such as the teaching signal is hard to be
majority of industrial processes and mechanical systems, since it obtained, and it is difficult to predict values for unlabeled data. As
is of versatility, high reliability and ease of operation [1]. PID a result, the adaptive PID control based on various more advanced
controllers can be manually tuned appropriately by the operators machine learning technologies has been discussed with the rapid
and control engineers based on the empirical knowledge when development of computer science.
the mathematical model of the controlled plant is unknown. Some Bishop et al. [10] have clarified that machine learning is divided
classical tuning methods, such as Ziegler–Nichols method [2] and into three classes of algorithms: supervised learning, unsupervised
Chien–Hrones–Reswich method [3], are applied to the process learning and reinforcement learning (RL). RL differs significantly
control and the performance then is significantly outperformed from both supervised and unsupervised learning. A definition
compared to the one that is manually tuned. However, those of RL from [11] is expressed as: an RL agent has the goal
methods work well for simple controlled plants, but for complex of learning the best way to accomplish a task through repeated
systems with nonlinearity, the performance can not be guaranteed interactions with its environment. Alternatively, from the control
in the presence of uncertainty and unknown dynamics. Therefore, perspective, RL refers to an agent (controller) that interacts with
the adaptive PID control has been received considerable attention its environment (controlled system) and then modifies its actions
in recent years in order to deal with varying systems. (control signal) [12]. It has strong potential to combine the RL
Several adaptive PID control strategies that include model-based technology with the adaptive PID control to have an impact on
adaptive PID control in [4–6], adaptive PID control based on the process control applications, and it has been investigated in studies
neural network [7,8]. It has been clarified that model-based adap- [13–15]. However, these studies adopted the same updating rule
tive PID control needs an assumption that the established model which compacted in one equation for three parameters, which leads
could represent the true plant dynamics exactly [9]. However, mod- to unknown mechanism of each parameter update. Moreover, the
eling complex systems are time-consuming and lack of accuracy, PID parameters trajectories were not provided. In addition, some
hence, the PID parameters may not be adjusted in a proper way. existing methods [16–18] consider a system model to predict the
future system states which are used to obtain the predictive value
a Correspondence to: Zhe Guan, E-mail: [email protected] function for the next time step. Nevertheless, it is impractical to
* KOBELCO
solve real problems, especially for process control problems.
Construction Machinery, Dream-Driven Co-Creation
Research Center, Hiroshima University, 1-4-1 Kagamiyama, Higashi- Based on the observations above, this paper considers a PID
Hiroshima, 739-8527, Japan controller with a new adaptive updating rule based on RL
** Academy of Science and Technology, Hiroshima University, 1-4-1 technology for nonlinear systems. It has been investigated that
Kagamiyama, Higashi-Hiroshima, 739-8527, Japan the Actor-Critic structure is the most general and successful
2.2. Controller structure It is well recognized that δTD (t) = r(t) + γ V(Θ(t)) − V (Θ(t − 1)) (5)
a PID controller is applied to process systems, therefore, the
derivative kick sometimes has an impact on the performance of the with 0 < γ ≤ 1 a discount factor, which indicates the decay of
closed-loop system. As a result, this paper introduces the following current value function in the absence of reinforcement signal.
velocity-type PID controller: The TD error reveals that the learning based on the immediate
reward, namely, the reinforcement signal, and the value function.
u(t) = u(t − 1) + KI (t)e(t) − KP (t)y(t) − KD (t)2 y(t) (2) Note that the definition of reinforcement signal r(t) and value
function V (Θ(t)) will be given in next subsection.
that is
u(t) = K(t)Θ(t) (3)
3.2. Actor-Critic learning based on RBF network
where Θ(t) := [e(t), −y(t), −2 y(t)]T is composed of system The RBF network has been used as a technique to identify param-
states. denotes the difference operator defined by := 1 − z −1 . eters by performing function mappings. The simple structure,
Δy(t) 1
KD(t) r(t) := {yd (t) − y(t)}2 (9)
φj(t) 2
Δ2y(t) which indicates that the current response of the system is used,
such that a system model is not required. The TD error then
vj(t) V(t) becomes
Critic
φh(t) 1
δTD (t) = {yd (t) − y(t)}2 + γ V(t) − V (t − 1) (10)
2
Fig. 2. RBF network topology with Actor-Critic structure
As a result, the cost function in this study is denoted in the
following:
parameters convergence and adequate learning are recognized as 1 2
J (t) = δTD (t) (11)
merits of RBF network and are discussed in Ref. [21]. As a 2
consequence, the implementation of Actor-Critic is used by RBF
network, and the network topology is shown in Fig. 2. It consists Thus, the partial differential equations with respect to each output
of three-layer neural networks. weight of the Actor are developed as
The input layer consists the available process measurements and ∂J (t)
system states are constructed. On the basis of the RBF network wjP (t + 1) = wjP (t) − αw (12)
∂wjP (t)
topology, it allows to pass the system states to the hidden layers
which are shared by the Actor and the Critic directly. The control where αw is a learning rate, and
signal u(t) and value function are generated by means of a simpler
way that is the weighted sum of the function value associated with ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KP (t)
=
units in the hidden layer [22]. The detail of each layer is described ∂wjP (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KP (t) ∂wjP (t)
as follows.
∂y(t)
The input layer includes a vector which is Θ(t) ∈ R3 , and it is = δTD (t)e(t){y(t) − y(t − 1)}ϕj (t) (13)
passed to the hidden layer and is used to calculate the output of ∂u(t)
hidden unit.
In the hidden layer, the Gaussian function is selected as a kernel ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KI (t)
function of the hidden unit of RBF network, therefore, the output =
∂wjI (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KI (t) ∂wjI (t)
ϕ(t) is shown as following:
∂y(t)
||Θ(t) − μj (t)||2 = −δTD (t)e 2 (t)ϕj (t) . (14)
ϕj (t) = exp − , j = 1, 2, 3, . . . , h (6) ∂u(t)
2σj2 (t)
where μj and σj are the center vector and width scalar of the unit, ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KD (t)
respectively, h is the number of the hidden units. The center vector =
∂wjD (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KD (t) ∂wjD (t)
is defined as follows:
∂y(t)
μj (t) := [μ1j , μ2j , μ3j ]T = δTD (t)e(t) × {y(t) − 2y(t − 1) + y(t − 2)}ϕj (t)
∂u(t)
(15)
The third layer is called output layer where the outputs of the Actor
and the Critic are involved. It should be noted that as mentioned It should be noted that a prior information about the system
previously the outputs are calculated in a simple and direct way. Jacobian ∂y(t)/∂u(t) is required to calculate the above equations.
Therefore, it can yield the PID parameters K(t) in the following: Here, we consider a relation = | |sign( ), therefore, the system
h Jacobian is obtained by the following equation:
KP , I , D (t) = wjP , I , D (t)ϕj (t) (7)
∂y(t) ∂y(t) ∂y(t)
j =1 = sign (16)
∂u(t) ∂u(t) ∂u(t)
with the weights wjm (t) between the j th hidden unit and output
layer of the Actor, and m refers to the weights that are utilized with sign ( ) = 1( > 0), −1( < 0). Based on the above assump-
to calculate the assigned parameters KP (t), KI (t) and KD (t). The tion, the sign of the system Jacobian can be obtained [23]. The
value function of critic part can be obtained as follows: updating rule for output weight of the Critic is
h
vj (t + 1) = vj (t) − αv
∂J (t)
V (t) = vj (t)ϕj (t) (8) ∂vj (t)
j =1
= vj (t) + αv δTD (t)ϕt (t), (17)
where vj (t) denotes the weight between the j th hidden unit and
output layer of the Critic. with a learning rate αv .
The centers and the widths of hidden units in the hidden layer 4
are considered to be updated in the following ways:
3
Output
∂J (t)
μij (t + 1) = μij (t) − αμ 2
∂μij (t)
1
Θi (t) − μij (t)
= μij + αμ δTD (t)vj (t)ϕj (t) (18)
σj2 (t) 0
0 50 100 150 200 250 300 350 400
with Θi (t) denotes the element in Θ(t) ∈ R3 , while,
1.5
∂J (t)
σj (t + 1) = σj (t) − ασ 1
∂σj (t)
0.5
Input
||Θ(t) − μj (t)||2 0
= σj + ασ δTD (t)vj (t)ϕj (t) (19)
σj3 (t) –0.5
where αμ and ασ are learning rates of center and width, respec- –1
0 50 100 150 200 250 300 350 400
tively.
Step
3.3. Algorithm summary The every design step of the Fig. 3. Control result obtained by the proposed scheme for case
proposed adaptive PID controller under Actor-Critic structure one
based on RBF network is presented in Algorithm 1.
4: Observe the system output y(t) and then the system error e(t)
can be obtained. 0
5: Compute the kernel function (6) in hidden layer. 0 50 100 150 200 250 300 350 400
6: Calculate the output of Actor, that is the current PID 0.6
parameters from (7), and the output of Critic value function 0.4
KD
4 2.5
3 2
Output
Output
1.5
2
1
1
0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
1.5 2
1 1.5
0.5
Input
Input
1
0
–0.5 0.5
–1 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Step Step
Fig. 5. Control result obtained by the conventional scheme for Fig. 6. Control result obtained by the proposed scheme for case
case one two
0
Figure 5 shows unsatisfactory performance in terms of tracking 0 50 100 150 200 250 300 350 400
property based on the conventional scheme, since it can not deal 1
with the nonlinearity in the system.
0.5
KI
where ξ(t) is the white Gaussian noise with zero mean and 0
0 50 100 150 200 250 300 350 400
variance of 0.012 . Besides, the reference signal values are set as Step
follows: ⎧
⎪
⎪ 0.5(0 ≤ t < 100)
⎨ Fig. 7. Trajectories of adaptive PID parameters for case two
1(100 ≤ t < 200)
yd (k ) = (23)
⎪
⎪ 2(200 ≤ t < 300)
⎩ 5. Conclusions
1.5(300 ≤ t < 400)
The user-specified learning rates are set as the same values as the This paper has studied a novel adaptive PID controller under
case one. the Actor-Critic structure based on RBF network for nonlinear
The proposed scheme is employed, and the control results are systems. A new adaptive updating rule was presented via weights
shown in Fig. 6, where it is apparent that desirable control results update in the network. First, the conventional PID controller
can be obtained. The trajectories of PID parameters for this case combined with the RL on the basis of RBF network, and the
are depicted in Fig. 7, where the parameters are adjusted suitably parameters were adapted in an on-line manner. The mechanism
according to the change of reference signal. They reach to constant of the proposed scheme did not require a system model to predict
values along with time goes. the system states. The TD error was defined by considering the
The comparative study for this case is discussed by employing current and previous stored value of the value function. Then,
a conventional PID tuning method, whose parameters are shown the hidden layer of RBF network was shared by the Actor and the
below. Critic. The storage space could be saved and the computation cost
KP = 0.486, KI = 0.227, KD = 0.122 was reduced for the outputs of the hidden units. In addition, the
initial PID parameters are set as zero, which means there is no
These PID parameters are calculated by Chien–Hrones–Reswich need to know the prior knowledge on the controlled system.
method [3]. The result is shown in Fig. 8, where the oscillation is Finally, numerical simulations were given to indicate the
shown because of nonlinearity. efficiency and feasibility of the proposed scheme for complex
As a result, the effectiveness and feasibility are finally confirmed nonlinear systems. The PID parameters based on the new adaptive
through these two case studies. updating rule reached to constant values. The proposed scheme
2.5 (12) Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic
2
programming for feedback control. IEEE Circuits and Systems
Magazine 2009; 9(3):32–50.
Output
1.5 (13) Wang XS, Cheng YH, Sun W. A proposal of adaptive PID controller
1 based on reinforcement learning. Journal of China University Mining
0.5 and Technology 2007; 17(1):40–44.
(14) Howell MN, Best MC. On-line PID tuning for engine idle-speed
0 control using continuous action reinforcement learning automata.
0 50 100 150 200 250 300 350 400
Control Engineering Practice 2000; 8:147–154.
2 (15) Jin ZS, Li HC, Gao HM. An intelligent weld control strategy based
on reinforcement learning approach. The International Journal of
1.5 Advanced Manufacturing Technology 2019; 100:2163–2175.
(16) Prokhorov DV, Santiago RA, Wunsch DC II. Adaptive critic designs:
Input
1
A case study for neuro-control. Neural Networks 1995; 8:1367–1372.
0.5 (17) Morinelly JE, Ydstie BE. Dual MPC with reinforcement learning.
IFAC-PapersOnLine 2016; 49:266–271.
0 (18) Shin J, Lee JH, Realff MJ. Operational planning and optimal sizing of
0 50 100 150 200 250 300 350 400
microgrid considering multi-scale wind uncertainty. Applied Energy
Step
2017; 195:616–633.
Fig. 8. Control result obtained by the conventional scheme (19) Shin J, Badgwell TA, Liu KH, Lee JH. Reinforcement learning -
overview of recent progress and implications for process control.
Computers and Chemical Engineering 2019; 127:282–294.
(20) Barto AG, Sutton RS, Anderson C. Neuron-like adaptive elements
will be employed in a real system to verify the effectiveness from that can solve difficult learning control problems. IEEE Transactions
the practical point of view in near future. on Systems, Man, and Cybernetics 1983; SMC-13:834–846.
(21) Elanayar SVT, Shin YC. Radial basis function neural network
for approximation and estimation of nonlinear stochastic dynamic
Acknowledgment systems. IEEE Transaction on Neural Network 1994; 5(4):584–603.
(22) Roger Jang JS, Sun CT. Functional equivalence between radial basis
This work was supported by ‘Hiroshima Manufacturing Digital Innova- function networks and fuzzy inference systems. IEEE Transaction on
tion Creation Program’ with the Grant from Cabinet Office, Government Neural Network 1993; 4(1):156–159.
of Japan and Hiroshima Prefecture. (23) Omatu S, Marzuki K, Rubiyah Y. Neuro-Control and Its Applications.
Springer-Verlag: London, U.K.; 1995.
References (24) Narendra KS, Parthasarathy K. Identification and control of dynam-
ical systems using neural networks. IEEE Transactions on Neural
Networks 1990; 1(1):4/27–4/27.
(1) Åström KJ, Hägglund T. PID Controllers: Theory, Design and (25) Zi-Qiang L. On identification of the controlled plants described by
Tuning. 2nd ed. Research Triangle Park, NC: Instrument Society of the Hammerstein system. IEEE Transactions on Automatic Control
America; 1995. 1994; Ac-39(2):569–573.
(2) Ziegler JG, Nichols NB. Optimum settings for automatic controllers.
Transactions of the ASME 1942; 64:759–768.
(3) Chien KL, Hrones JA, Reswick JB. On the automatic control Zhe Guan (Member) received the M.S. degree and D.Eng.
of generalized passive systems. Transactions of the ASME 1952; degree in control system engineering from
74(2):175–185.
the Hiroshima University, Japan, in 2015
(4) Chang WD, Hwang RC, Hsieh JG. A multivariable on-line adaptive
and 2018, respectively. He is currently work-
PID controller using auto-tuning neurons. Engineering Application of
Artificial Intelligence 2003; 16:57–63. ing in KOBELCO Construction Machinery
(5) Yamamoto T, Shah S. Design and experimental evaluation of Dream-Driven Co-Creation Research Cen-
multivariable self-tuning PID controller. IEE Proceedings-Control ter, Hiroshima University, Japan. He had
Theory and Applications 2004; 151(5):645–652. worked in Micron Japan as a dry etch engi-
(6) Yu DL, Chang TK, Yu DW. A stable self-learning PID control for neer in 2018. He was a visiting postdoc with
multivariable time varying systems. Control Engineering Practice the Department of Electrical and Computer Engineering, Univer-
2007; 15(12):1577–1587. sity of Alberta from April to June in 2019. His current research
(7) Chen JH, Huang TC. Applying neural networks to on-line updated interests are in area of adaptive control, data-driven control, rein-
PID controllers for nonlinear process control. Journal of Process forcement learning and their applications. He is a member of the
Control 2004; 14(2):211–230.
Society of Instrument and Control Engineers in Japan (SICE).
(8) Liao YT, Koiwai K, Yamamoto T. Design and implementation of
a hierarchical-clustering CMAC PID controller. Asian Journal of Toru Yamamoto (Fellow) received the B.Eng. and M.Eng. degrees
Control 2019; 21(3):1077–1087. from the Tokushima University, Japan, in
(9) Hou ZS, Chi RH, Gao HJ. An overview of dynamic linearization
1984 and 1987, respectively, and the D.Eng.
based data-driven control and applications. IEEE Transactions on
Industrial Electronics 2016; 64(5):4076–4090.
degree from Osaka University, Japan, in
(10) Bishop CM. Pattern Recognition and Machine Learning (Information 1994. He is currently a Professor with the
Science and Statistics). Springer-Verlag New York. Inc: Secaucus, NJ; Department of System Cybernetics, Gradu-
2006. ate School of Engineering, Hiroshima Uni-
(11) Sutton RS, Barto AG. Reinforcement Learning: An Introduction. versity, Japan, and a leader of the National
Cambridge, MA: MIT Press; 2018. Project on Regional Industry Innovation with
support from the Cabinet Office, Government of Japan. He was a for industrial systems. Dr. Yamamoto was a National Organizing
Visiting Researcher with the Department of Mathematical Engi- Committee Chair of 5th International Conference on Advanced
neering and Information Physics, University of Tokyo, Japan, in Control of Industrial Processes (ADCONIP 2014) and General
1991, and an Overseas Research Fellow of the Japan Society for Chair of SICE (Society of Instrument and Control Engineers in
Promotion of Science (JSPS) with the Department of Chemical Japan) Annual Conference 2019, both held in Hiroshima. He is
and Materials Engineering, the University of Alberta for 6 months also a Fellow of SICE and Japan Society of Mechanical Engineers
in 2006. His current research interests are in area of self-tuning (JSME).
and learning control, data-driven control, and their implementation