0% found this document useful (0 votes)

12 views7 pages

Reinenforement Learning With Pid Loop

This paper presents a novel design scheme for a proportional-integral-derivative (PID) controller that incorporates reinforcement learning (RL) to adaptively update control parameters for nonlinear systems. The proposed method utilizes a radial basis function (RBF) network within an Actor-Critic framework to calculate control policies and evaluate performance without requiring a predefined system model. Numerical simulations demonstrate the effectiveness and robustness of the proposed adaptive PID controller in managing complex system dynamics.

Uploaded by

How to crush at everything

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views7 pages

Reinenforement Learning With Pid Loop

Uploaded by

How to crush at everything

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING

IEEJ Trans 2021

Published online in Wiley Online Library (wileyonlinelibrary.com). DOI:10.1002/tee.23430

Paper

Design of a Reinforcement Learning PID Controller

Zhe Guan*a , Member
Toru Yamamoto** , Fellow

This paper addresses a design scheme of a proportional-integral-derivative (PID) controller with a new adaptive updating rule
based on reinforcement learning (RL) approach for nonlinear systems. A new design scheme that RL can be used to complement
the conventional PID control technology is presented. In the proposed scheme, a single radial basis function (RBF) network is
considered to calculate the control policy function of Actor and the value function of Critic simultaneously. Regarding the PID
controller structure, the inputs of RBF network are system errors, the difference of output as well as the second-order difference
of output, and they are composed of system states. The temporal difference (TD) error in the proposed scheme involves the
reinforcement signal, the current and the previous stored value of the value function. The gradient descent method is adopted
based on the TD error performance index, then, the updating rules can be yielded. Therefore, the network weights and the kernel
function can be calculated in an adaptive way. Finally, the numerical simulations are conducted in nonlinear systems to illustrate
the efficiency and robustness of the proposed scheme. © 2021 Institute of Electrical Engineers of Japan. Published by Wiley
Periodicals LLC.

Keywords: reinforcement learning; PID control; Actor-Critic learning; RBF network; nonlinear system

Received 6 May 2020; Revised 9 September 2020

1. Introduction On the other hand, the adaptive PID control based on the neural
network adopts the supervised learning to optimize the network
Proportional-integral-derivative (PID) control is one of the parameters. Therefore, there are some limitations in the applica-
most common control schemes and has been dominated the tion of those methods, such as the teaching signal is hard to be
majority of industrial processes and mechanical systems, since it obtained, and it is difficult to predict values for unlabeled data. As
is of versatility, high reliability and ease of operation [1]. PID a result, the adaptive PID control based on various more advanced
controllers can be manually tuned appropriately by the operators machine learning technologies has been discussed with the rapid
and control engineers based on the empirical knowledge when development of computer science.
the mathematical model of the controlled plant is unknown. Some Bishop et al. [10] have clarified that machine learning is divided
classical tuning methods, such as Ziegler–Nichols method [2] and into three classes of algorithms: supervised learning, unsupervised
Chien–Hrones–Reswich method [3], are applied to the process learning and reinforcement learning (RL). RL differs significantly
control and the performance then is significantly outperformed from both supervised and unsupervised learning. A definition
compared to the one that is manually tuned. However, those of RL from [11] is expressed as: an RL agent has the goal
methods work well for simple controlled plants, but for complex of learning the best way to accomplish a task through repeated
systems with nonlinearity, the performance can not be guaranteed interactions with its environment. Alternatively, from the control
in the presence of uncertainty and unknown dynamics. Therefore, perspective, RL refers to an agent (controller) that interacts with
the adaptive PID control has been received considerable attention its environment (controlled system) and then modifies its actions
in recent years in order to deal with varying systems. (control signal) [12]. It has strong potential to combine the RL
Several adaptive PID control strategies that include model-based technology with the adaptive PID control to have an impact on
adaptive PID control in [4–6], adaptive PID control based on the process control applications, and it has been investigated in studies
neural network [7,8]. It has been clarified that model-based adap- [13–15]. However, these studies adopted the same updating rule
tive PID control needs an assumption that the established model which compacted in one equation for three parameters, which leads
could represent the true plant dynamics exactly [9]. However, mod- to unknown mechanism of each parameter update. Moreover, the
eling complex systems are time-consuming and lack of accuracy, PID parameters trajectories were not provided. In addition, some
hence, the PID parameters may not be adjusted in a proper way. existing methods [16–18] consider a system model to predict the
future system states which are used to obtain the predictive value
a Correspondence to: Zhe Guan, E-mail: [email protected] function for the next time step. Nevertheless, it is impractical to
* KOBELCO
solve real problems, especially for process control problems.
Construction Machinery, Dream-Driven Co-Creation
Research Center, Hiroshima University, 1-4-1 Kagamiyama, Higashi- Based on the observations above, this paper considers a PID
Hiroshima, 739-8527, Japan controller with a new adaptive updating rule based on RL
** Academy of Science and Technology, Hiroshima University, 1-4-1 technology for nonlinear systems. It has been investigated that
Kagamiyama, Higashi-Hiroshima, 739-8527, Japan the Actor-Critic structure is the most general and successful

© 2021 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.

Z. GUAN, AND T. YAMAMOTO

implementation to date [19]. Actor-Critic structure [20] has two r(t)

separate parametric structures, one is for optimal control policy Critic
termed as Actor, the other structure is for the value function,
δTD(t)
namely the Critic. The Actor component applies control signal Actor
to a system, and a Critic component evaluates the performance
using a temporal difference (TD) method. The idea of realization yd(t) + e(t) u(t) y(t)
State
PID controller System
of the Actor-Critic is to use the RBF network. Under the Actor- – convertor Θ(t)
Critic structure based on RBF network, the new adaptive updating
rule can be designed. Furthermore, model-free RL removes the
requirement for identifying a system model and becomes a
powerful tool in dealing with process control. The TD method is Fig. 1. The block diagram of the proposed scheme
applied to implement the model-free RL technique. In the proposed
scheme, the TD error is defined without using predictive system
states. The 2 y(t) then becomes 2 y(t) = y(t) − 2y(t − 1) + y(t − 2).
The remainder of this paper is organized in the following way. And K(t) := [KI (t), KP (t), KD (t)] is a vector of control parameters.
The problem formulation is discussed in Section 2, where an e(t) is the control error and is defined by the difference between
assumption is introduced as well. In Section 3, the adaptive PID reference signal yd and system output y as follows:
controller based on Actor-Critic algorithm is proposed. Numerical e(t) = yd (t) − y(t) (4)
simulations and comparative studies are conducted to illustrate the
efficiency and feasibility in Section 4. Finally, Section 5 concludes
this paper. 2.3. Objective The schematic diagram of the proposed
method is shown in Fig. 1, in which the system state Θ(t) is con-
structed based on e(t) and current system output firstly, and then
2. Problem Statement they will be used as inputs to the Actor-Critic structure. The Actor
2.1. System description Consider the following tunes the controller online using the observed system states along
discrete-time systems described by nonlinear dynamics in the the system trajectory, while the Critic, which receives the system
affine state-space difference equation form states, evaluates the control performance. The reinforcement sig-
nal r(t) and the Critic are involved in the TD error δTD (t). The
x (t + 1) = f (x (t)) + g(x (t))u(t) TD error is viewed as a crucial basis, and is applied to update the
parameters’ weights. As a result, the objective of this paper is to
y(t) = h(x (t), u(t − 1)) (1)
design a PID controller with new adaptive updating rule under the
with state x (·) ∈ R, control input u(·) ∈ R and output y(·) ∈ R, Actor-Critic structure.
respectively. f (·), g(·) and h(·) are assumed to be unknown in the
proposed scheme. 3. Adaptive Controller Design
It is required to provide two assumptions on the above system
to capture the idea about RL technology. The proposed algorithm will be explained in detail in this
section.
Assumption 2.1. The above system satisfies the 1-step Markov
property since the state at time t + 1 only depends on the state and
inputs at the previous time t, independent with the historical data. 3.1. TD error We will first introduce a value function
V (Θ(t)), which involves the information of system states, and the
This assumption is under the framework of Markov decision value function is updated at each time step. It is noteworthy that
processes (MDP), whose objective is to achieve a specified goal the predictive system states are not used in the implementation
through a satisfactory control policy. It is defined in a similar way of the Critic design. In other words, the system model is not
with RL technology. considered to predict system states. Alternatively, we store the
Assumption 2.2. The sign of partial derivatives of h(·) with previous value of the value function, namely, V (t − 1), therefore,
respect to all arguments are known, and it is also regarded as the proposed scheme requires no prior knowledge about system
the sign of system Jacobian. model compared to other existing methods.
As a result, the TD error is defined shown as:

2.2. Controller structure It is well recognized that δTD (t) = r(t) + γ V(Θ(t)) − V (Θ(t − 1)) (5)
a PID controller is applied to process systems, therefore, the
derivative kick sometimes has an impact on the performance of the with 0 < γ ≤ 1 a discount factor, which indicates the decay of
closed-loop system. As a result, this paper introduces the following current value function in the absence of reinforcement signal.
velocity-type PID controller: The TD error reveals that the learning based on the immediate
reward, namely, the reinforcement signal, and the value function.
u(t) = u(t − 1) + KI (t)e(t) − KP (t)y(t) − KD (t)2 y(t) (2) Note that the definition of reinforcement signal r(t) and value
function V (Θ(t)) will be given in next subsection.
that is
u(t) = K(t)Θ(t) (3)
3.2. Actor-Critic learning based on RBF network
where Θ(t) := [e(t), −y(t), −2 y(t)]T is composed of system The RBF network has been used as a technique to identify param-
states. denotes the difference operator defined by := 1 − z −1 . eters by performing function mappings. The simple structure,

2 IEEJ Trans (2021)

DESIGN OF A REINFORCEMENT LEARNING PID CONTROLLER

wjm(t) Those various output weights can be trained by gradient-

KP(t)
based learning algorithm. Therefore, we can obtain the adaptive
updating rule under user-specified parameters. Recall the [5], the
e(t) φ1(t) KI(t)
Actor reinforcement signal is defined as

Δy(t) 1
KD(t) r(t) := {yd (t) − y(t)}2 (9)
φj(t) 2
Δ2y(t) which indicates that the current response of the system is used,
such that a system model is not required. The TD error then
vj(t) V(t) becomes
Critic
φh(t) 1
δTD (t) = {yd (t) − y(t)}2 + γ V(t) − V (t − 1) (10)
2
Fig. 2. RBF network topology with Actor-Critic structure
As a result, the cost function in this study is denoted in the
following:
parameters convergence and adequate learning are recognized as 1 2
J (t) = δTD (t) (11)
merits of RBF network and are discussed in Ref. [21]. As a 2
consequence, the implementation of Actor-Critic is used by RBF
network, and the network topology is shown in Fig. 2. It consists Thus, the partial differential equations with respect to each output
of three-layer neural networks. weight of the Actor are developed as
The input layer consists the available process measurements and ∂J (t)
system states are constructed. On the basis of the RBF network wjP (t + 1) = wjP (t) − αw (12)
∂wjP (t)
topology, it allows to pass the system states to the hidden layers
which are shared by the Actor and the Critic directly. The control where αw is a learning rate, and
signal u(t) and value function are generated by means of a simpler
way that is the weighted sum of the function value associated with ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KP (t)
=
units in the hidden layer [22]. The detail of each layer is described ∂wjP (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KP (t) ∂wjP (t)
as follows.
∂y(t)
The input layer includes a vector which is Θ(t) ∈ R3 , and it is = δTD (t)e(t){y(t) − y(t − 1)}ϕj (t) (13)
passed to the hidden layer and is used to calculate the output of ∂u(t)
hidden unit.
In the hidden layer, the Gaussian function is selected as a kernel ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KI (t)
function of the hidden unit of RBF network, therefore, the output =
∂wjI (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KI (t) ∂wjI (t)
ϕ(t) is shown as following:
∂y(t)
||Θ(t) − μj (t)||2 = −δTD (t)e 2 (t)ϕj (t) . (14)
ϕj (t) = exp − , j = 1, 2, 3, . . . , h (6) ∂u(t)
2σj2 (t)

where μj and σj are the center vector and width scalar of the unit, ∂J (t) ∂J (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KD (t)
respectively, h is the number of the hidden units. The center vector =
∂wjD (t) ∂δTD (t) ∂y(t) ∂u(t) ∂KD (t) ∂wjD (t)
is defined as follows:
∂y(t)
μj (t) := [μ1j , μ2j , μ3j ]T = δTD (t)e(t) × {y(t) − 2y(t − 1) + y(t − 2)}ϕj (t)
∂u(t)
(15)
The third layer is called output layer where the outputs of the Actor
and the Critic are involved. It should be noted that as mentioned It should be noted that a prior information about the system
previously the outputs are calculated in a simple and direct way. Jacobian ∂y(t)/∂u(t) is required to calculate the above equations.
Therefore, it can yield the PID parameters K(t) in the following: Here, we consider a relation = | |sign( ), therefore, the system

h Jacobian is obtained by the following equation:
KP , I , D (t) = wjP , I , D (t)ϕj (t) (7)
∂y(t) ∂y(t) ∂y(t)
j =1 = sign (16)
∂u(t) ∂u(t) ∂u(t)
with the weights wjm (t) between the j th hidden unit and output
layer of the Actor, and m refers to the weights that are utilized with sign ( ) = 1( > 0), −1( < 0). Based on the above assump-
to calculate the assigned parameters KP (t), KI (t) and KD (t). The tion, the sign of the system Jacobian can be obtained [23]. The
value function of critic part can be obtained as follows: updating rule for output weight of the Critic is

h
vj (t + 1) = vj (t) − αv
∂J (t)
V (t) = vj (t)ϕj (t) (8) ∂vj (t)
j =1
= vj (t) + αv δTD (t)ϕt (t), (17)
where vj (t) denotes the weight between the j th hidden unit and
output layer of the Critic. with a learning rate αv .

3 IEEJ Trans (2021)

Z. GUAN, AND T. YAMAMOTO

The centers and the widths of hidden units in the hidden layer 4
are considered to be updated in the following ways:
3

Output
∂J (t)
μij (t + 1) = μij (t) − αμ 2
∂μij (t)
1
Θi (t) − μij (t)
= μij + αμ δTD (t)vj (t)ϕj (t) (18)
σj2 (t) 0
0 50 100 150 200 250 300 350 400
with Θi (t) denotes the element in Θ(t) ∈ R3 , while,
1.5
∂J (t)
σj (t + 1) = σj (t) − ασ 1
∂σj (t)
0.5

Input
||Θ(t) − μj (t)||2 0
= σj + ασ δTD (t)vj (t)ϕj (t) (19)
σj3 (t) –0.5
where αμ and ασ are learning rates of center and width, respec- –1
0 50 100 150 200 250 300 350 400
tively.
Step

3.3. Algorithm summary The every design step of the Fig. 3. Control result obtained by the proposed scheme for case
proposed adaptive PID controller under Actor-Critic structure one
based on RBF network is presented in Algorithm 1.

Algorithm 1 Adaptive PID controller under Actor-Critic based 0.6

on RBF network 0.4
1: Initialize instant t = 0, control input signal u(0) and reference KP
0.2
signal yd (t).
2: Initialize the parameters wjP , I , D (0), vj (0), μij (0), σj (0) and 0
0 50 100 150 200 250 300 350 400
set the values for the use-specified learning rates αw , αv , αμ , 1
ασ .
3: for t = 0 : EndTime 0.5
KI

4: Observe the system output y(t) and then the system error e(t)
can be obtained. 0
5: Compute the kernel function (6) in hidden layer. 0 50 100 150 200 250 300 350 400
6: Calculate the output of Actor, that is the current PID 0.6
parameters from (7), and the output of Critic value function 0.4
KD

V (t) from (8) at time t.

7: Obtain the TD error δTD (t) from (10) together with stored 0.2
value V (t − 1). 0
0 50 100 150 200 250 300 350 400
8: Update the weights of the PID parameters by (13)–(15) and
Step
the weights of the value function according to (17).
9: Update the centers and the widths of RBF kernel functions by Fig. 4. Trajectories of adaptive PID parameters for case one
(18)–(19).
10: end for
The user-specified learning rates included in the proposed are
summarized as follows:
4. Numerical Simulations
αw = 0.011, αv = 0.05, αμ = 0.001, ασ = 0.005
The proposed scheme has been implemented on two numerical
simulations and comparative studies in this section to evaluate the and the coefficient γ equals to 0.98. The hidden units in topology
efficiency and feasibility. RBF network are decided as 3. The initial PID parameters in the
proposed scheme are set as:
4.1. Case one: System with a hysteresis Consider the K(0) = [0, 0, 0]T
following nonlinear system from [24]:
y(t)y(t − 1)[y(t) + 2.5] The simulation results are presented in Fig. 3, where the output
y(t + 1) = + u(t) + ξ(t) (20) signal can track the reference signal by employing the proposed
1 + y(t)2 + y(t − 1)2
scheme. Regardless of the strong nonlinearity, the proposed
where ξ(t) denotes the Gaussian noise with zero mean and variance scheme can work well when the reference signal is changed.
of 0.012 . The reference signal values are set as follows: Moreover, the PID parameters are depicted in Fig. 4, where they
⎧
⎪
⎪ 2.5(0 ≤ t < 100) can be updated based on the updated weights. Furthermore, they
⎨
3.5(100 ≤ t < 200) ultimately tended to reach constant values.
yd (t) = (21)
⎪
⎪1(200 ≤ t < 300) The comparative study for this case is discussed, the result of
⎩
3(300 ≤ t < 400) which is shown in Fig. 5. In this case, the conventional PID tuning

4 IEEJ Trans (2021)

DESIGN OF A REINFORCEMENT LEARNING PID CONTROLLER

4 2.5

3 2

Output
Output

1.5
2
1
1
0.5
0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400

1.5 2
1 1.5
0.5

Input
Input

1
0
–0.5 0.5

–1 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Step Step

Fig. 5. Control result obtained by the conventional scheme for Fig. 6. Control result obtained by the proposed scheme for case
case one two

method is employed and the parameters are calculated based on 0.6

Chien–Hrones–Reswich method [3]: 0.4
KP
KP = 0.645, KI = 0.028, KD = 0.327 0.2

0
Figure 5 shows unsatisfactory performance in terms of tracking 0 50 100 150 200 250 300 350 400
property based on the conventional scheme, since it can not deal 1
with the nonlinearity in the system.
0.5
KI

4.2. Case two: Hammerstein model Consider the

0
following non-linear Hammerstein models from Ref. [25]. 0 50 100 150 200 250 300 350 400
⎫ 0.6
y(t) = 0.6y(t − 1) − 0.1y(t − 2) ⎬
+ 1.2x (t − 1) − 0.1x (t − 2) + ξ(t) (22) 0.4
⎭
KD

x (t) = u(t) − u 2 (t) + u 3 (t) 0.2

where ξ(t) is the white Gaussian noise with zero mean and 0
0 50 100 150 200 250 300 350 400
variance of 0.012 . Besides, the reference signal values are set as Step
follows: ⎧
⎪
⎪ 0.5(0 ≤ t < 100)
⎨ Fig. 7. Trajectories of adaptive PID parameters for case two
1(100 ≤ t < 200)
yd (k ) = (23)
⎪
⎪ 2(200 ≤ t < 300)
⎩ 5. Conclusions
1.5(300 ≤ t < 400)
The user-specified learning rates are set as the same values as the This paper has studied a novel adaptive PID controller under
case one. the Actor-Critic structure based on RBF network for nonlinear
The proposed scheme is employed, and the control results are systems. A new adaptive updating rule was presented via weights
shown in Fig. 6, where it is apparent that desirable control results update in the network. First, the conventional PID controller
can be obtained. The trajectories of PID parameters for this case combined with the RL on the basis of RBF network, and the
are depicted in Fig. 7, where the parameters are adjusted suitably parameters were adapted in an on-line manner. The mechanism
according to the change of reference signal. They reach to constant of the proposed scheme did not require a system model to predict
values along with time goes. the system states. The TD error was defined by considering the
The comparative study for this case is discussed by employing current and previous stored value of the value function. Then,
a conventional PID tuning method, whose parameters are shown the hidden layer of RBF network was shared by the Actor and the
below. Critic. The storage space could be saved and the computation cost
KP = 0.486, KI = 0.227, KD = 0.122 was reduced for the outputs of the hidden units. In addition, the
initial PID parameters are set as zero, which means there is no
These PID parameters are calculated by Chien–Hrones–Reswich need to know the prior knowledge on the controlled system.
method [3]. The result is shown in Fig. 8, where the oscillation is Finally, numerical simulations were given to indicate the
shown because of nonlinearity. efficiency and feasibility of the proposed scheme for complex
As a result, the effectiveness and feasibility are finally confirmed nonlinear systems. The PID parameters based on the new adaptive
through these two case studies. updating rule reached to constant values. The proposed scheme

5 IEEJ Trans (2021)

Z. GUAN, AND T. YAMAMOTO

2.5 (12) Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic
2
programming for feedback control. IEEE Circuits and Systems
Magazine 2009; 9(3):32–50.
Output

1.5 (13) Wang XS, Cheng YH, Sun W. A proposal of adaptive PID controller
1 based on reinforcement learning. Journal of China University Mining
0.5 and Technology 2007; 17(1):40–44.
(14) Howell MN, Best MC. On-line PID tuning for engine idle-speed
0 control using continuous action reinforcement learning automata.
0 50 100 150 200 250 300 350 400
Control Engineering Practice 2000; 8:147–154.
2 (15) Jin ZS, Li HC, Gao HM. An intelligent weld control strategy based
on reinforcement learning approach. The International Journal of
1.5 Advanced Manufacturing Technology 2019; 100:2163–2175.
(16) Prokhorov DV, Santiago RA, Wunsch DC II. Adaptive critic designs:
Input

1
A case study for neuro-control. Neural Networks 1995; 8:1367–1372.
0.5 (17) Morinelly JE, Ydstie BE. Dual MPC with reinforcement learning.
IFAC-PapersOnLine 2016; 49:266–271.
0 (18) Shin J, Lee JH, Realff MJ. Operational planning and optimal sizing of
0 50 100 150 200 250 300 350 400
microgrid considering multi-scale wind uncertainty. Applied Energy
Step
2017; 195:616–633.
Fig. 8. Control result obtained by the conventional scheme (19) Shin J, Badgwell TA, Liu KH, Lee JH. Reinforcement learning -
overview of recent progress and implications for process control.
Computers and Chemical Engineering 2019; 127:282–294.
(20) Barto AG, Sutton RS, Anderson C. Neuron-like adaptive elements
will be employed in a real system to verify the effectiveness from that can solve difficult learning control problems. IEEE Transactions
the practical point of view in near future. on Systems, Man, and Cybernetics 1983; SMC-13:834–846.
(21) Elanayar SVT, Shin YC. Radial basis function neural network
for approximation and estimation of nonlinear stochastic dynamic
Acknowledgment systems. IEEE Transaction on Neural Network 1994; 5(4):584–603.
(22) Roger Jang JS, Sun CT. Functional equivalence between radial basis
This work was supported by ‘Hiroshima Manufacturing Digital Innova- function networks and fuzzy inference systems. IEEE Transaction on
tion Creation Program’ with the Grant from Cabinet Office, Government Neural Network 1993; 4(1):156–159.
of Japan and Hiroshima Prefecture. (23) Omatu S, Marzuki K, Rubiyah Y. Neuro-Control and Its Applications.
Springer-Verlag: London, U.K.; 1995.
References (24) Narendra KS, Parthasarathy K. Identification and control of dynam-
ical systems using neural networks. IEEE Transactions on Neural
Networks 1990; 1(1):4/27–4/27.
(1) Åström KJ, Hägglund T. PID Controllers: Theory, Design and (25) Zi-Qiang L. On identification of the controlled plants described by
Tuning. 2nd ed. Research Triangle Park, NC: Instrument Society of the Hammerstein system. IEEE Transactions on Automatic Control
America; 1995. 1994; Ac-39(2):569–573.
(2) Ziegler JG, Nichols NB. Optimum settings for automatic controllers.
Transactions of the ASME 1942; 64:759–768.
(3) Chien KL, Hrones JA, Reswick JB. On the automatic control Zhe Guan (Member) received the M.S. degree and D.Eng.
of generalized passive systems. Transactions of the ASME 1952; degree in control system engineering from
74(2):175–185.
the Hiroshima University, Japan, in 2015
(4) Chang WD, Hwang RC, Hsieh JG. A multivariable on-line adaptive
and 2018, respectively. He is currently work-
PID controller using auto-tuning neurons. Engineering Application of
Artificial Intelligence 2003; 16:57–63. ing in KOBELCO Construction Machinery
(5) Yamamoto T, Shah S. Design and experimental evaluation of Dream-Driven Co-Creation Research Cen-
multivariable self-tuning PID controller. IEE Proceedings-Control ter, Hiroshima University, Japan. He had
Theory and Applications 2004; 151(5):645–652. worked in Micron Japan as a dry etch engi-
(6) Yu DL, Chang TK, Yu DW. A stable self-learning PID control for neer in 2018. He was a visiting postdoc with
multivariable time varying systems. Control Engineering Practice the Department of Electrical and Computer Engineering, Univer-
2007; 15(12):1577–1587. sity of Alberta from April to June in 2019. His current research
(7) Chen JH, Huang TC. Applying neural networks to on-line updated interests are in area of adaptive control, data-driven control, rein-
PID controllers for nonlinear process control. Journal of Process forcement learning and their applications. He is a member of the
Control 2004; 14(2):211–230.
Society of Instrument and Control Engineers in Japan (SICE).
(8) Liao YT, Koiwai K, Yamamoto T. Design and implementation of
a hierarchical-clustering CMAC PID controller. Asian Journal of Toru Yamamoto (Fellow) received the B.Eng. and M.Eng. degrees
Control 2019; 21(3):1077–1087. from the Tokushima University, Japan, in
(9) Hou ZS, Chi RH, Gao HJ. An overview of dynamic linearization
1984 and 1987, respectively, and the D.Eng.
based data-driven control and applications. IEEE Transactions on
Industrial Electronics 2016; 64(5):4076–4090.
degree from Osaka University, Japan, in
(10) Bishop CM. Pattern Recognition and Machine Learning (Information 1994. He is currently a Professor with the
Science and Statistics). Springer-Verlag New York. Inc: Secaucus, NJ; Department of System Cybernetics, Gradu-
2006. ate School of Engineering, Hiroshima Uni-
(11) Sutton RS, Barto AG. Reinforcement Learning: An Introduction. versity, Japan, and a leader of the National
Cambridge, MA: MIT Press; 2018. Project on Regional Industry Innovation with

6 IEEJ Trans (2021)

DESIGN OF A REINFORCEMENT LEARNING PID CONTROLLER

support from the Cabinet Office, Government of Japan. He was a for industrial systems. Dr. Yamamoto was a National Organizing
Visiting Researcher with the Department of Mathematical Engi- Committee Chair of 5th International Conference on Advanced
neering and Information Physics, University of Tokyo, Japan, in Control of Industrial Processes (ADCONIP 2014) and General
1991, and an Overseas Research Fellow of the Japan Society for Chair of SICE (Society of Instrument and Control Engineers in
Promotion of Science (JSPS) with the Department of Chemical Japan) Annual Conference 2019, both held in Hiroshima. He is
and Materials Engineering, the University of Alberta for 6 months also a Fellow of SICE and Japan Society of Mechanical Engineers
in 2006. His current research interests are in area of self-tuning (JSME).
and learning control, data-driven control, and their implementation