0% found this document useful (0 votes)
67 views7 pages

DC&#x002F DC Power Converter Control-Based Deep Machine Learning Techniques - Real-Time Implementation

This document proposes using deep reinforcement learning techniques to control a DC/DC power converter feeding a constant power load. Specifically, it suggests using a proximal policy optimization algorithm to tune the parameters of an ultralocal model control scheme. This is done to address instability issues caused by the constant power load and reference voltage variations. Simulation results on a real-time testbed show the adaptive controller can ensure stability under changing conditions by adjusting the ultralocal model controller gains online through neural network training. The approach aims to develop a model-independent control method that does not require accurate system modeling and can learn from observed data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views7 pages

DC&#x002F DC Power Converter Control-Based Deep Machine Learning Techniques - Real-Time Implementation

This document proposes using deep reinforcement learning techniques to control a DC/DC power converter feeding a constant power load. Specifically, it suggests using a proximal policy optimization algorithm to tune the parameters of an ultralocal model control scheme. This is done to address instability issues caused by the constant power load and reference voltage variations. Simulation results on a real-time testbed show the adaptive controller can ensure stability under changing conditions by adjusting the ultralocal model controller gains online through neural network training. The approach aims to develop a model-independent control method that does not require accurate system modeling and can learn from observed data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO.

10, OCTOBER 2020 9971

Letters
DC/DC Power Converter Control-Based Deep Machine Learning Techniques:
Real-Time Implementation
Mojtaba Hajihosseini , Milad Andalibi , Meysam Gheisarnejad , Hamed Farsizadeh,
and Mohammad-Hassan Khooban , Senior Member, IEEE

Abstract—The recent advances in power plants and energy re- converters feeding constant power loads (CPLs) [2]–[4] that can
sources have extended the applications of buck-boost converters lead to large oscillations in the voltage and frequency terms.
in the context of dc microgrids (MGs). However, the implementa- The advances in the hardware technologies with high comput-
tion of such interface systems in the MG applications is seriously
threatened with instability issues imposed by the constant power ing power facilitated the practical implementation of advanced
loads (CPLs). The objective is that without the accurate modeling control mythologies, e.g., nonfragile controller [1], sliding mode
information of a dc MG system, to develop a new adaptive con- controller (SMC) [5], and backstepping scheme [6], to amelio-
trol methodology for voltage stabilization of the dc–dc converters rate the control performance of dc converters feeding CPLs. By
feeding CPLs with low ripples. To achieve this goal, in this letter,
combining a nonlinear disturbance observer and backstepping
the deep reinforcement learning (DRL) technique with the Actor–
Critic architecture is incorporated into an ultralocal model (ULM) technique, a composite nonlinear controller is developed in [6]
control scheme to address the destabilization effect of the CPLs to mitigate the instability imposed by CPLs on the dc MGs. In
under the reference voltage variations. In the suggested control [7], a systematic and simple state feedback controller has been
approach, the feedback controller gains of the ULM controller are extended to stabilize the dc MGs with multiple CPLs. To meet
considered as the adjustable controller coefficients, which will be
the stability and efficient performance, the authors of [7] stated
adaptively designed by the DRL technique through online learning
of its neural networks (NNs). It is proved that the suggested scheme the nonlinear dc MG with some CPLs in a Takagi–Sugeno fuzzy
will ensure the rigorous stability of the power electronic system, model combined with a quadratic D-stability theory.
for simultaneous effects of CPL and reference voltage changes, In the abovementioned works, the stabilization of the convert-
by adaptively adjusting the ULM controller gains. To appraise ers is satisfied in the presence of ideal CPLs, however, due to the
the merits and usefulness of the suggested adaptive methodology,
inevitable uncertainties (e.g., unmodeled dynamics) in practical
some dSPACE MicroLabBox outcomes on a real-time testbed of
the dc–dc converter feeding a CPL are presented. applications, the robust model-based strategies fail to effectively
Index Terms—Constant power load (CPL), dc–dc buck-boost suppress the CPL’s nonlinearity. Moreover, the need for accurate
converter, deep reinforcement learning (DRL), ultralocal model modeling to design the model-based control strategies limits
(ULM). their applicability to handle the process with high nonlinear-
ities. These difficulties motivated researchers to develop their
I. INTRODUCTION control techniques based on the input–output (I/O) measure-
ECENTLY, the usage of dc microgrid (MG) has widened ments, referred to as data-driven strategies [8], which disappear
R in numerous industrial applications due to its more ad-
vantages than the ac MG [1]. Despite the advantages of dc
the modeling procedure and unknown dynamics. The model-
independent schemes are one of the most popular data-driven
MGs, they are faced with an instability problem of the dc–dc techniques, which are also known as model-independent adjust-
ing or intelligent controllers such as intelligent proportional inte-
Manuscript received January 3, 2020; revised February 9, 2020; accepted gral derivative [9] and model-independent nonsingular terminal
February 24, 2020. Date of publication March 2, 2020; date of current version sliding-mode control (MINTSMC) [10]. Based on the ultralocal
June 23, 2020. (Corresponding author: Mohammad-Hassan Khooban.)
Mojtaba Hajihosseini and Milad Andalibi are with the School of Electrical and model (ULM) concept, the model-independent schemes adopt
Computer Engineering, Shiraz University, Shiraz 71946-84471, Iran (e-mail: a quick observer (e.g., extended state observer, sliding mode
mojtaba. [email protected]; [email protected]). (SM) observer, etc. [10], [11]) to estimate the unknown terms of
Meysam Gheisarnejad is with the Department of Electrical Engineering,
Islamic Azad University, Najafabad Branch, Isfahan 85141-43131, Iran (e-mail: the process model. To achieve the optimal performance of the
[email protected]). intelligent controllers, the evolutionary algorithms (e.g., genetic
Hamed Farsizadeh is with the Department of Electrical and Electronics En- algorithm) are often adopted to adjust the design coefficients of
gineering, Shiraz University of Technology, Shiraz 71946-84471, Iran (e-mail:
[email protected]). the intelligent controllers in a heuristic manner. However, the
Mohammad-Hassan Khooban is with the DIGIT, Department of Engineering, implementation of such approaches can guarantee the optimal
Aarhus University, 8200 Aarhus, Denmark (e-mail: [email protected]). performance of the system only for a specific cycle period and
Color versions of one or more of the figures in this article are available online
at http://ieeexplore.ieee.org. suffer the lack of capability to learn from the observed process
Digital Object Identifier 10.1109/TPEL.2020.2977765 data and restricted generalization capability.

0885-8993 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
9972 IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020

The recent advances in the algorithmic procedure of the


reinforcement learning (RL), decreasing the variance in coef-
ficient updates, have enabled the deployment of deep neural
networks (NNs) in RL techniques, creating the context of deep
RL (DRL) [12]. However, for the complex environments with
high-dimensional inputs, the deep deterministic policy gradient
(DDPG) [13]–[15] is preferred by performing control actions of
an Actor-network to the system. In spite of the DDPG can learn
the complex specification of continuous problems, the strong
sensitivity of this algorithm to its hyperparameters makes it hard Fig. 1. Circuit representation of the dc-dc buck-boost converter with a CPL.
to adjust. With the aim of developing a DRL algorithm with high
robustness in training, proximal policy optimization (PPO) [16]
was introduced to solve the problem of learning rate selection dv P
that has made significant progress for policy search in the RL C = (1 − u) i − (2)
dt v
domains.
This letter explores the potential of the PPO algorithm to tune where i and v ∈ R > 0 denote the inductor current and the out-
the ULM control scheme in stabilizing the voltage term of dc–dc put voltage, respectively, P ∈ R > 0 denotes the CPL’s power,
buck-boost converters. As one of the majors threatens in the E ∈ R > 0 denotes the input voltage, and u  [0, 1] denotes the
stability of the converter systems, a time-varying CPL is applied duty ratio of the control switch signal. The certain equilibrium
to the dc–dc converter to verify the robustness of the suggested of the dc/dc buck-boost converter feeding a CPL is given by [17]
adaptive data-driven scheme to deal with the serious stability
   
issue of such systems. 1 1
Consequently, the major contribution of this letter is men- E := (i, v) ∈ R2 > 0 | i − P + =0 . (3)
v E
tioned in the following.
1) The CPLs imposes a destabilizing nonlinear impact on
the dc power electronic converters by an inverse voltage, III. PPO DESIGNED ULTRALOCAL MODEL
which results in remarkable fluctuations in the voltage CONTROLLER-BASED SM OBSERVER
term of the main bus or even its collapse. The stability For control of dc–dc buck-boost converters feeding CPLs,
threats of dc converters feedings CPLs is further inten- the voltage stabilization performance of conventional method-
sified when the voltage reference is varied during the ologies like fuzzy logic, SMC, and model predictive controller
simulation. In this letter, the simultaneous impact of both (MPC) are restricted from the following regulatory aspects.
the CPL and voltage reference variations in the dc–dc 1) Since the time intervals in the simulation of dc–dc buck-
buck-boost converter is investigated to consider the worst boost converters are in the microseconds, the computa-
condition for such systems from the stability perspective. tional time for designing the model-based schemes is too
2) A new ULM control scheme based on SM observer that exhaustive to be solved in real time.
can estimate the unknown dc–dc converter dynamics is 2) Due to the destabilization properties and nonlinearities
designed, instead of to developed the global mathematical imposed by the CPLs to the converters, the control ap-
model of the system. proaches fail to ameliorate the settling time and overshoot
3) The PPO algorithm with Actor–Critic architecture has terms simultaneously. This necessitates further efforts to
been adopted for online adjusting of ULM control coeffi- mitigate the CPL destructive effects in an optimal manner
cients in an adaptive manner. The suggested PPO mecha- and to ensure stability requirements.
nism, naturally, incorporates the feedback controller gains 3) For a power electronic system, like a buck-boost converter,
of the ULM controller into the design goal and offers the after a change in operating condition, the control perfor-
ULM controller with online gain designing by employing mance of the deterministic techniques deteriorates due to
the learning ability of Actor–Critic NNs. lack of adaptive ability.
4) The suggested scheme was experimentally tested and Owing to the deficiency of the existing control methodologies,
compared with that of a conventional state-of-the-art a PPO-based ULM scheme is proposed in this letter, which is
model-independent scheme, on a laboratory prototype of designed to address the aforesaid issues. First, a simple model-
the buck-boost converter. independent feedback controller-based ULM scheme (i.e., in-
telligent proportional-integral (PI) controller) is established to
II. MODEL OF THE BUCK-BOOST CONVERTER WITH CPL reduce the dependency on the converter model and obtain the
initial desired control specifications. Then, an SM observer is
The buck-boost converter feeding a time-varying CPL circuit incorporated into the ULM scheme to estimate the unknown dy-
topology is shown in Fig. 1. The average state-space model of namics of the converter and apply its feedback to the controller.
the system under certain assumptions is given by [17] Lastly, an adaptive based controller coefficient tuning approach
di is developed for the optimal setting of the predesigned feedback
L = − (1 − u) v + uE (1) controller, employing the learning ability of PPO.
dt

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020 9973

The time derivative of (8) is yielded as

V̇ = S(t)T Ṡ(t) = eT2 (t)ė2 (t)


= eT2 (t) (Φ − σsgn (e2 (t))) ≤ |e2 (t)| (|Φ| − σ) . (9)

Now, assume σ meets the condition of |Φ| + η < σ, where


η > 0, then one can obtain V̇ ≤ −η|e2 (t)| which ensures the
observer is stable asymptotically. 
Fig. 2. Structure of the feedback controller with SM observer.

B. The Proximal Policy Optimization Algorithm


A. Ultralocal Model Control Based on SM Observer
In RL framework, a task can be described by Markov decision
1) Design of Feedback Controller: Based on the knowledge process (MDP) characterized by a quintuple {S, A, r, p, γ},
of I/O measurements of the dc–dc converter, some nonlinearities, where S  Rn denotes the state space, A  Rm denotes the
uncertain parameters, and mathematical complexity of the sys- action space, r : S × A → R denotes the reward function, p :
tem can be replaced by the ULM scheme. A numerical model of S × A × S → [0, 1] denotes the transition function, which ex-
the ULM mechanism, over a short lapse of time, can be described presses the probability of transferring a new state st+1 , emitting
as [9], [11] a reward r under executing action at on the state st , and
y (λ) (t) = Φ + αu(t) (4) γ  [0 1] denotes the discount factor. With an initial state st ,
which is arbitrary
 set, the RL is aimed to maximize the obtained
where y (λ) (t) denotes the λth derivative order of the system rewards [ ∞ γ t
rt ].
t=0
output y(t). The total unknown phenomena are represented by PPO is an Actor–Critic and on-policy-based RL algorithm that
Φ; α R denotes a nonphysical constant factor. follows an optimal policy π to act on an environment described as
Establishing the PI-based factor α(α − P I) controller as the an MDP. Compared with the existing Actor–Critic-based algo-
feedback controller, the loop will be closed with the control law rithms [13], [14], in many cases, the hyperparameters of PPO are
of more robust for a large number of tasks and relatively converges
 
(λ) (λ) (λ) quicker [16]. By considering Kullback–Leibler divergence of the
Φ̂ yd (t) + ksp e1 (t) + ksi ∫ e1 (t)
u(t) = − (5) policy updates in the optimization process, PPO can guarantee
α α the optimum convergence. The Monte Carlo (MC) method is
where ksp and ksi are the feedback controller coefficients, which used to approximate the samples of the policy loss function and
(λ)
are manually adjusted; yd (t) is the desired trajectory of the the gradients of the loss function are as follows:
y(t); e1 (t) is the tracking error, which represents the difference
between yd (t) and y(t); and Φ̂ denotes a real-time estimation J(θ) = ET ∼πθ (τ ) R (st , at ) = ET ∼πθ (τ ) [R (τ )]
of Φ(t). t
2) Design of SM Observer: The estimation of Φ is a crucial (10)
factor to cancel the influence of the disturbances and unmodeled
T
dynamics included in the practical implementation. For the
∇θ J(θ) = ET ∼πθ (τ ) ∇θ logπθ (at |st ) R (τ ) . (11)
estimation of Φ, an SM observer that enjoys great robustness in
t=1
view of the uncertain plant is introduced into the ULM control
scheme. The structure of the ULM controller based on the SM The main object in policy gradient methods is trying to reduce
observer is depicted in Fig. 2. the variance of the gradient estimations toward better policies
According to (4), the term Φ can be calculated using the SM causing consistent progress. The Actor–Critic architecture by
observer, given as [10] representing a new definition of value function makes a signifi-
cant impact in this approach:
˙ = σsgn (y(t) − ŷ(t)) + αu(t)
ŷ(t) (6)
where ŷ denotes the estimated value of y; σ denotes the designed Qπ (s, a) = Eπθ [R (st , at ) |s, a] (12)
t
variable. The SM observer error is defined as e2 (t) = y(t) −
ŷ(t). By subtracting (4) from (6), one can obtain the following: V π (s) = Eπθ [R (st , at ) |s] (13)
t
ė2 (t) = Φ − σsgn (y(t) − ŷ(t)) . (7)
Aπ (s, a) = Qπ (s, a) − V π (s) . (14)
Theorem I (stability analysis) [10]: After defining the SM
manifold S(t) = e2 , the estimation observer error will converge The advantage function Aπ (s, a) measures how good an ac-
to zero if the term σ be properly set. tion is compared to the others available in that state. The value
Proof: As usual, the following term is considered as the function V (s) measures how good it is to be in that state. The
Lyapunov function Critic network separately is trained to predict the value function
1 by analyzing the cumulative receiving rewards. The PPO, as one
V = S(t)T S(t). (8) of the most efficient Actor–Critic methods, aims to maximize the
2

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
9974 IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020

Fig. 4. Self-adaptive ULM control scheme based on the PPO algorithm.

of Actor–Critic NNs. The structure of the proposed adaptive


ULM control scheme based on the PPO tuner is illustrated in
Fig. 4.
According to Fig. 4, by employing the Actor and Critic NNs,
the PPO produces the regulatory commands [dksp (t) dksi (t)]
to tune the feedback controller coefficients. Since, usually, these
Fig. 3. Flowchart of the PPO algorithm with Actor–Critic architecture. control coefficients are nonzero, a feedback controller with α −
P I structure is designed as ksp (t + 1) = ksp (t) + dksp (t) and
ksi (t + 1) = ksi (t) + dksi (t).
objective function formulating as follows: Remark 1: It is noted that the feedback controller in this
   application is not necessarily a α − P I controller, that is, the
L(θ) = Et min rt (θ)Ât , clip (rt (θ), 1 − , 1 + ) Ât suggested adaptive scheme can be extended to other types of
(15) adjustable controller coefficients (e.g., MINTSMC controller).
where  and E denotes the estimation of the advantage function The PPO agent aims to train the coefficients of Actor and
and expectation, respectively, and rt (θ) is the probability ratio Critic NNs in such a way that reduces the output voltage error
defined as follows: e, i.e., the error between the desired voltage of the converter
πθ (at , st ) vref with its actual value vo . The feedbacks in time step t
rt (θ) = . (16)
πθold (at , st ) from the buck-boost converter are selected as the voltage out-
put vo , the output-voltage error e and their derivative ė, i.e.,
Vanilla policy gradients require examples of optimal policy
state = {vo , e, ( de dt )}. To compensate for the output voltage,
making that cannot be applied to the modified policy after one 1
the reward signal rt in the PPO algorithm is set as rt = abs(e t)
.
more optimization step. PPO uses the importance of sampling
Based on the immediate reward rt , the PPO agent evaluates how
to obtain the expectation of samples from an old policy under
good the undertaken actions [dksp (t) dksi (t)]. The action (reg-
the new policy. For this purpose, each sample can be used for
ulatory signals) is applied to the environment (i.e., converter) to
several gradient ascent steps. When the new policy is refined,
eliminate the CPL disturbances occurred during the operation of
both old and new policies will diverge and result in an increased
the power electronic system, that causes the system experiences
variance of the estimation, also the old policy would be updated
the transition to a new state and to release a feedback signal
to the new policy. To achieve this goal, a similar state transition
rt . At each time step, the transition vector (st , st+1 , at , rt )
function has to exist, which is ensured by clipping the probability
is gathered and each episode is finished when the terminal
ratio to the region [1 − , 1 + ]. The algorithmic steps of the
condition is reached, i.e., the maximum number of steps for
PPO scheme are depicted in Fig. 3. (For more details about the
each episode is met. Then, the gathered information vector is
learning procedure of the PPO NNs, readers are referring to
adopted to train the RL agent by adopting a PPO tuner, training
[18]).
the policy coefficients (weight coefficient and bias of the Actor
and Critic NNs). Thus, an updated regulatory signal is produced
C. PPO-Based Feedback Controller Coefficient Tuner
to adaptively mitigate the influence of the nonideal CPL.
In this letter, the PPO algorithm, as an adaptive mechanism In this letter, two hidden layers (HLs) with 189 and 30 neurons
tuner, is adopted to adjust the feedback controller coefficients of are used in the Actor NN while the Critic NN is built with two
the ULM control scheme by extracting the advantages of the on- completely connected HLs with 30 neurons. The rectified linear
line learning and model-independent property of RL. According unit is adopted as a nonlinear mapping function for all HLs in
to the suggested strategy, the feedback controller (i.e., α − P I) the NNs. With the defined parameters of the PPO algorithm and
coefficients are considered as the design control objective and the NNs configured in Fig. 5, the rest of the parameters for the
the PPO tuner adjusts these coefficients through online learning algorithm configuration are furnished in Table I.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020 9975

Fig. 6. Photograph of laboratory prototype adopted in the experiment.


Fig. 5. Feed-Forward NNs as couple Actor–Critic.

TABLE I TABLE II
PARAMETERS OF THE PPO TUNER SPECIFICATIONS OF THE DC–DC BUCK-BOOST CONVERTER

2 GHz processor board and DS1302 I/O board. Besides, the


real-time simulation executes on Windows 10, Core i7, 2.6 GHz,
8 GB of RAM. The expected design process and fast design
performance are realized by the real-time interface (RTI) based
Remark 2: The merits of the suggested control scheme are
on MicroLabBox. The MATLAB C code generator Simulink
highlighted as follows.
Coder (formerly Real-Time Workshop) is developed for the
1) Comparing with the conventional ULM control schemes
automatic execution and seamless implementation of Simulink
(e.g., MINTSMC controller [10]), the suggested controller
models on the real-time testbed. The model can be connected to
is more robust due to its online learning ability to deal with
the dSPACE I/O board by, simply, drag the I/O module from
the CPL’s instability.
the RTI block library onto the model and connect it to the
2) Comparing with the nonfragile, MPC, and backstepping
Simulink blocks. By clicking the suitable blocks, any set out or
controllers that can guarantee the stability of systems
change in the parameters is achievable. To prepare the real-time
only in a particular cycle period [1], [6], the suggested
simulation, the collaboration of Simulink Coder and the RTI
controller can guarantee the stability of systems in various
produce, respectively, the model code and blocks to implement
periods during the experiment.
the I/O capabilities of dSPACE in Simulink models.
3) Comparing with DRL and DDPG algorithms [12]–[14],
Scenario I: In the first scenario, the adaptive model-
training of the PPO is more robust with less sensitivity to
independent capability of the PPO optimized ULM controller
its hyperparameters.
with respect to the ideal CPL and reference voltage variations is
studied. The CPL’s power is applied as a constant value of P =
IV. EXPERIMENTAL RESULTS
150 W throughout the real-time simulation while the reference
The dc–dc buck-boost converter model explained in Fig. 1 is voltage is set as 80 V for t ϵ [0, 0.3) s, 110 V for t ϵ [0.3, 0.7) s,
experimentally tested to appraise the applicability of real-time and 30 V for t ϵ [0.7, 1] s. With the concerned scenario, the
implementation of the suggested adaptive controller and verify experimental outcomes of CPL’s power (blue line), bus voltage
its performance. Applying a CPL under input voltage variations (red line), and CPL’s current (green line), respectively, for the
in the buck-boost converter, the worst scenario situation that can suggested controller and the MINTSMC scheme are depicted
be imposed on such power electronic systems is investigated in Figs. 7 and 8. As shown in Figs. 7 and 8, in response to
from the stability point of view. A comparative study is accom- the constant CPL and reference voltage variations, the output
plished in two typical scenarios to validate the superior transient voltage of the suggested controller tracks its references during
performance of the intelligent feedback controller-based PPO the experimental analysis while the voltage responses of the
tuner with the SM observer in the stabilization of the output MINTSMC scheme experience a small oscillation as the volt-
voltage than that of the MINTSMC scheme [10]. The configu- age reference varies suddenly, i.e., at t = 0.3 s and t = 0.7 s.
ration of the experimental testbed is illustrated in Fig. 6 and the Moreover, the suggested controller provides a quicker and better
parameters corresponding to the circuit components of this setup outcome to restore the power and current of CPL as compared
are presented in Table II. In the experimental setup of Fig. 6, with the MINTSMC scheme.
the output voltage of the buck-boost converter is controlled by Scenario II: Here, a nonideal CPL is applied to the converter
a dSPACE MicroLabBox with DS1202 PowerPC Dual Core at a heavy load situation with the initial power of P = 150 W.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
9976 IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020

Fig. 7. Transient outcomes of the PPO based-ULM control scheme with SM


Fig. 10. Transient outcomes of the MINTSMC scheme according to
observer according to Scenario I.
Scenario II.

Fig. 11. Bar chart comparison of different performance indices.

Fig. 8. Transient outcomes of the MINTSMC scheme according to Scenario I. observe that the experimental outcomes of the suggested ULM
control scheme (realized based on the PPO agent) experience
minor control degradation than that of the MINTSMC scheme;
specifically, within the range [0.3 s 0.7 s] of operation, where the
highest CPL’s power is applied to the converter, the set points
of the voltage term are precisely tracked while simultaneously
the power and current of CPL experience smaller deviation
from their nominal values. Such improvement in the converter
stabilization performance is valuable in the power electronic
engineering, to protect the CPL in which it is connected to,
from possible great deviations of system responses during a long
transient.
For quantitative analysis of the suggested controller, several
error measurement criteria including integral absolute error,
mean square error, and mean absolute error corresponding to
Fig. 9. Transient outcomes of the PPO based-ULM control scheme with SM the Scenario I and Scenario II are compared using bar charts, as
observer according to Scenario II. depicted in Fig. 11.

V. CONCLUSION
At t = 0.3 s, the CPL’s power is increased from its initial power
to 250 W; at t = 0.7 s, the CPL’s power is reduced from 250 The robust stabilization problem of a class of power electronic
to 75 W. The experimental outcomes of Figs. 9 and 10 includ- systems exposed to the dynamic loads has been studied in this
ing CPL power (blue line), bus voltage (red line), and CPL’s letter. Particularly, by employing the adaptive capability of DRL,
current (green line), respectively, illustrate how the suggested a novel adaptive model-independent ULM controller-based SM
controller and the MINTSMC scheme stabilize the buck-boost observer has been developed to suppress the destructive effects
converter feeding the nonideal CPL under the reference voltage of CPLs when the system is subjected to the reference voltage
variations. By comparing the results of Figs. 9 and 10, one can changes. This control strategy can achieve promising outcomes

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 35, NO. 10, OCTOBER 2020 9977

due to the following two reasons. With the aim of practical [7] M. M. Mardani, N. Vafamand, M. H. Khooban, T. Dragičević, and F.
implementation, an SM observer is incorporated into the ULM Blaabjerg, “Design of quadratic D-stable fuzzy controller for DC micro-
grids with multiple CPLs,” IEEE Trans. Ind. Electron., vol. 66, no. 6,
control scheme that ensures good compatibility with the unmod- pp. 4805–4812, Jun. 2019.
eled system dynamics and the learning ability of the PPO agent [8] J. Sun, J. Yang, W. X. Zheng, and S. Li, “GPIO-based robust control of
keeps driving the feedback controller to its optimal point, where nonlinear uncertain systems under time-varying disturbance with applica-
tion to DC–DC converter,” IEEE Trans. Circuits Syst. II: Express Briefs,
the coefficients in the Actor and Critic NNs are trained under vol. 63, no. 11, pp. 1074–1078, Nov. 2016.
the CPL and the reference voltage changes. [9] H. Abouaïssa and S. Chouraqui, “On the control of robot manipulator: A
The experimental outcomes of the prototype confirm an ex- model-free approach,” J. Comput. Sci., vol. 31, pp. 6–16, 2019.
[10] K.-H. Zhao et al., “Robust model-free nonsingular terminal sliding mode
cellent transient behavior in the voltage responses of the dc–dc control for PMSM demagnetization fault,” IEEE Access, vol. 7, pp. 15737–
buck-boost converter with the use of suggested adaptive method- 15748, 2019.
ology than the MINTSMC controller. [11] H. P. Wang, G. I. Y. Mustafa, and Y. Tian, “Model-free fractional-order
sliding mode control for an active vehicle suspension system,” Advances
Eng. Softw., vol. 115, pp. 452–461, 2018.
REFERENCES [12] L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforcement
learning-based joint task offloading and bandwidth allocation for multi-
[1] N. Vafamand, M. H. Khooban, T. Dragicevic, F. Blaabjerg, and J. Boud- user mobile edge computing,” Digit. Commun. Netw., vol. 5, pp. 10–17,
jadar, “Robust non-fragile fuzzy control of uncertain DC microgrids feed- 2019.
ing constant power loads,” IEEE Trans. Power Electron., vol. 34, no. 11, [13] C. Wang, J. Wang, Y. Shen, and X. Zhang, “Autonomous navigation of
pp. 11300–11308, Nov. 2019. UAVs in large-scale complex environments: A deep reinforcement learning
[2] S. R. Huddy and J. D. Skufca, “Amplitude death solutions for stabilization approach,” IEEE Trans. Veh. Technol., vol. 68, no. 3, pp. 2124–2136,
of DC microgrids with instantaneous constant-power loads,” IEEE Trans. Mar. 2019.
Power Electron., vol. 28, no. 1, pp. 247–253, Jan. 2012. [14] Y. Wang, J. Sun, H. He, and C. Sun, “Deterministic policy gradient with
[3] M. H. Khooban, M. Gheisarnejad, H. Farsizadeh, A. Masoudian, and integral compensator for robust quadrotor control,” IEEE Trans. Syst.,
J. Boudjadar, “A new intelligent hybrid control approach for DC/DC Man, Cybern., Syst., 2019.
converters in zero-emission ferry ships,” IEEE Trans. Power Electron., [15] M. Gheisarnejad, J. Boudjadar, and M.-H. Khooban, “A new adaptive
vol. 35, no. 6, pp. 5832–5841, Jun. 2020. type-II fuzzy-based deep reinforcement learning control: Fuel cell air-
[4] H. Farsizadeh, M. Gheisarnejad, M. Mosayebi, M. Rafiei, and M. H. feed sensors control,” IEEE Sens. J., vol. 19, no. 20, pp. 9081–9089,
Khooban, “An intelligent and fast controller for DC/DC converter feeding Oct. 2019.
CPL in a DC microgrid,” IEEE Trans. Circuits Syst. II: Express Briefs, [16] X. Wang, T. Li, and Y. Cheng, “Proximal parameter distribution optimiza-
2019. tion,” IEEE Trans. Syst., Man, Cybern., Syst., 2019.
[5] S. Singh, D. Fulwani, and V. Kumar, “Robust sliding-mode control of [17] W. He and R. Ortega, “Voltage regulation in buck–boost coniverters
DC/DC boost converter feeding a constant power load,” IET Power Elec- feeding an unknown constant power load: An adaptive passivity-based
tron., vol. 8, no. 7, pp. 1230–1237, Jul. 2015. control,” 2019, arXiv:1909.04438.
[6] Q. Xu, C. Zhang, C. Wen, and P. Wang, “A novel composite nonlinear [18] Y. Zhang, Z. Deng, and Y. Gao, “Angle of arrival passive location algorithm
controller for stabilization of constant power load in DC microgrid,” IEEE based on proximal policy optimization,” Electronics, vol. 8, p. 1558,
Trans. Smart Grid, vol. 10, no. 1, pp. 752–761, Jan. 2019. 2019.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 26,2024 at 09:02:20 UTC from IEEE Xplore. Restrictions apply.

You might also like