D Statcom
D Statcom
November 2, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3119745
ABSTRACT The high penetration level of renewable energy in large-scale power systems could adversely
affect power quality, such as voltage stability and harmonic pollution. This paper assesses the impacts of
Distribution Static Compensator (D-STATCOM), one of the Flexible AC Transmission System (FACTS)
devices, on power quality of 4.16kV-level distribution systems via transient and steady-state analysis.
Carrier-based Pulse Width Modulation (PWM) control in D-STATCOM generates d-q axis current reference
via the PID (Proportional-Integral-Differential) controller to control d-q axis current and voltage. A new
control method, via the Deep Deterministic Policy Gradient (DDPG) algorithm-based reinforcement learning
(RL), is studied to create a new d-q axis current reference applying to the voltage control, which can
improve voltage stability and transient response and derive fast convergence of current and voltage at the
D-STATCOM bus. The real-time simulations on an IEEE 13-bus system show that the proposed approach
can better control the D-STATCOM than the conventional control methods for enhancing voltage stability
and transient performance.
NOMENCLATURE Symbols:
Abbreviations: Ts RL sampling time.
D-STATCOM Distribution Static Compensator. 6 Gaussian Action Space Noise.
FACTS Flexible AC Transmission System. V voltage in real time.
VSC Voltage Source Converters. i current in real time.
PCC Point of Common Coupling.
B a batch sampled from the training dataset, from
PLL Phase Locked Loop.
replay buffer.
PID Proportional-Integral-Differential.
π Policy: π, returns an action sampled from our
PWM Pulse Width Modulation.
Actor network plus some noise for exploration.
IGBT Insulated Gated Bipolar Transistor.
NN Neural Network. θ Critic target Neural network, Q-value func-
RL Reinforcement Learning. tion critic Q(s,a) that maps state & action pair
DQN Deep Q-Network. to scalar value representing the expected total
DDPG Deep Deterministic Policy Gradient. long-term rewards.
8 Actor target Neural network, Deterministic pol-
icy actor ϕ (s) that maximizes the expected
The associate editor coordinating the review of this manuscript and cumulative long-term reward.
approving it for publication was Chun-Wei Tsai . τ Target smooth factor.
voltage value, which is directed to the part of reinforcement TABLE 1. Design parameters of D-STATCOM.
learning’s state elements in proposed approach. In section 2,
the D-STATCOM topology is illustrated, explaining how the
voltage controller works. Reinforcement learning and the
DDPG algorithm are described in section 3. Section 4 demon-
strates how the DDPG algorithm applied to the control system
is designed. As variable reference is rapidly converged to
desired real reference by the proposed approach, observed
current and voltage are also converged faster. In section 5,
performance analysis for test scenarios is illustrated after
training the DDPG process, which includes randomly volt-
age changes (2%, which corresponds to ±83.2 [V]) and
load changes within 30%. In this section, we show that
the transient response is improved in variable conditions of
the power system. The variable main feeder’s voltages and
D-STATCOM’s set-reference voltage are conducted on non-
linear loads conditions to evaluate the model’s performance
and strategy. This article is concluded in section 6.
II. TOPOLOGY OF D-STATCOM: CARRIER-BASED PWM mined by sinωt and cosωt provided by the Phase Locked
CONTROL METHOD Loop (PLL) is then regulated with two separate PID regula-
D-STATCOM mainly consists of an inverter connected to tors with respect to the reference id and iq currents obtained
the network through a transformer and capacitor C, which earlier. As shown in Figure 1, the four PID controllers are
provides dc-link voltage. In detail, D-STATCOM includes a involved in the two current regulation loop (inner and outer).
detailed representation of power electronic IGBT converters, The inner current regulation loop consists of two PID con-
and it is used to regulate the voltage of the distribution trollers that control the d-axis and q-axis currents. The con-
network. The D-STATCOM regulates adjacent bus voltage troller’s outputs are the Vd and Vq voltages designated by
by absorbing or generating reactive power. This reactive the PWM inverter’s pulse generator. The Vd and Vq voltages,
power transfer is done through the leakage reactance of the which are obtained from the integrated outputs of current PI
coupling transformer by generating a secondary voltage in controllers, are converted into phase voltages Vabc. The Iq
phase with main feeder’s primary voltage, which is provided reference comes from the outer voltage regulation loop, and
by a voltage-sourced PWM inverter [7]. In D-STATCOM, the Id reference comes from the DC-link voltage regulation
dc link capacitor operates either inductance or capacitance loop [29].
under specific conditions. When bus voltage is lower than The other two PI controllers in the outer current regulation
the reference voltage at D-STATCOM, it acts as a capacitor loop decide both Iq reference and Id reference by calculating
to absorb reactive power into the grid. In contrast, when bus the difference between the value reference voltage and actual
voltage is higher than the reference voltage at D-STATCOM, voltage observed. To be specific, as shown in Figure 1, Id ref-
it acts as an inductance to inject positive reactive power into erence is determined by PID controller output when DC link
the grid [7], [28]. voltage’s error is the input of PID controller, and Iq reference
The D-STATCOM consists of several key components as is determined by PID controller output when an error of AC
shown in Figure 1. The LC damped filters are connected at voltage which is connected to network side is the input of PID
the inverter output, and a capacitor is acting as a DC source controller. It maintains voltage regulation of the primary side
for the inverter. As the sensor measured the load side voltages equal to the reference value defined in the control system.
and currents, the supply side’s current is calculated first, and After Vq and Vd are determined, IGBTs of VSC Pulse are
then the compensating current of the D-STATCOM is calcu- transmitted and controlled by a PWM pulse generator [30].
lated [7]. Measuring the difference between the load current The filters with a series inductance Lf of 800 µH are con-
and the compensating current, the generating PWM pulses for nected to the primary side bridge output and the filter capac-
the IGBTs inverter bridge, which is essential for observation itance Cf of 100 µF in series with resistance Rf of 10 are
voltage to trace the reference voltage, are then produced. connected to the secondary low voltage side of the coupling
These voltage-sourced PWM inverters consist of two IGBT transformer (1/Y , 4.16/1.2kV ). The detailed technical spec-
bridges, so their control performance becomes more com- ifications of D-STATCOM parameters are given in Table 1.
prehensive than a single bridge. Twin inverter configuration
produces fewer harmonics than a single bridge, resulting in III. DDPG ALGORITHM FOR CONTROLLER DESIGN
smaller filters and improved dynamic response [29]. RL is a value-based algorithm, which learns by estimat-
Both voltage and current components obtained from’ abc ing the Q-value that the model will take in the given cir-
to dq transformation’’’’ in the synchronous reference deter- cumstance [31]. When estimations of this value become
somewhat possible, actions (i.e., a policy) are chosen based DDPG algorithm uses two networks, named actor and
on this value. In addition, Q-Learning has an e-greedy policy critic networks. Actor network proposes an action with a
that estimates the value for all actions and then selects the given state, and critic network predicts if the action is good
action that corresponds to the largest number among those (positive reward) or bad (negative reward) according to the
values. However, learning is not an easy task if there are many given state and an action. To determine the parameters of
behaviors in the continuous action space [32]. DDPG, firstly backpropagating the critic loss, we update
DDPG is a model-free off-policy algorithm for learn- the parameter of the critic network θ. In every iteration,
ing continuous actions. DDPG consists of two ideas from we update the actor model’s parameter ϕ by performing
Deterministic Policy Gradient (DPG) and Deep Q-Network gradient ascent on the output θ of the first critic model as
(DQN). It uses’ experience replay, which enables RL agents in Eq. (1):
to memorize past experience, and’ frozen target network
N
which can operate over continuous action spaces [33]. In the X
case of DQN, it reduces learning instability using the ‘expe- θ ← minθ B−1 (y − Qθ (s, a))2
rience replay’ and the ‘frozen target network.’’ Typical i=1
X
Q-learning obtains the data by the agent moving actually. ∇ϕ J(ϕ) = B −1
∇a Qθ (s, a)|a=πϕ (s)) ∇ϕ πϕ (s) (1)
Thus, naturally, there is a significant correlation between the
data [34], [35]. Therefore, the experience replay method is where B is [st , at , rt ,st+1 ] from replay buffer. After this,
used to reduce the correlation between input data, signifi- we calculate the parameters of ϕ and θ, where actor network
cantly reducing the relationship among them. It also enables ϕ and critic network θ update their own weights as if NN
repetitive learning of past experiences. parameters update. From this work, the target networks are
The mathematical and numerical approaches-based control updated by Polyak Averaging as shown in Eq. (2), where
system requires interpretation in the z-domain to determine θtarget is the calculated critic target, θ 0target is the previously
the PWM output. Changes in operation status of the power trained critic target, ϕtarget is the calculated actor target, and
system’s elements, such as nodal voltage, are constructed by ϕ 0 target is the previously trained actor target. Thus, the target
a sequence of real or complex numbers as a discrete time- network’s parameters update depends on the target smooth
domain signal. Rather than DQN, DDPG is appropriate for factor, which decides how much the current target network
real-time changes in a discrete-time domain because DQN can affect the entire target network.
updates the neural network using a total reward in one episode
unit, while DDPG updates the reward for each step [36]. θ 0 target ← τ θtarget + (1 − τ ) θ 0 target
Since the current and voltage data in discrete time-domain ϕ 0 target ← τ ϕtarget + (1 − τ ) ϕ 0 target (2)
are changed with continuous form, unlike discrete movement
such as top-bottom-left-right, the RL components can operate In addition, learning from sampling all accumulated
over continuous action spaces by using DDPG. To this end, experience is better than learning only from recent expe-
DDPG has the advantage of DQN and can be extended to rience. The DDPG normalizes the various unparticular
continuous action space using the actor & critic framework. units by normalizing the observation and uses batch nor-
The critic framework is used to estimate the value via the malization to put the samples into a single minibatch
Bellman equation. The actor framework is used to generate and normalize all the dimensions for better learning.
action according to the distribution of action space by the The schematic diagram of DDPG algorithm is illustrated
chain rule [37], [38]. in Figure 2.
FIGURE 4. DDPG agent workflow: determining Iq, Id reference in D-STATCOM control system.
TABLE 2. Design parameters of DDPG algorithm. from the difference between its line voltage observation and
real reference. However, designing a reward vector that only
focuses on voltage stability may not work properly and could
be risky to diverge during the model’s training process since
there is a low correlation between nodal voltage stability and
RL action vectors (Idref and Iqref ) as observed in our exper-
iment. In order to mitigate the low correlation, the reward
strategy is suggested as follows:
¬ Gather data from PID controller’s output (Idref and
Iqref ) and nodal voltage in when RL agent is not con-
sidered variable episodes.
Train RL Agent with the reward vector formed by
the differences between action vectors and outer PID
controller’s output.
® Find the total reward vectors tr1 when it learns enough.
¯ Set another reward strategy
– Agent earns the rewards tr1 evenly for each step.
changes [36]. In this paper, the observations represent – If the nodal voltage stability is improved, com-
that s = [Vacref , Vac, Vacdif , Kz−1
Ts
Vacdif Vdcref , Vdc, Vdcdif , pared to PID controllers, the extra-rewards are
K Ts
z−1 Vdcdif ]. However, magnitudes of Vac and Vdc are not received as a return for voltage stability.
at the same level, so it is necessary to normalize data to ° Re-train the RL Agent from a pre-trained agent from
0-1 range. The idea about the composition of ‘s’ (RL State) step .
components originates from the input of PID controller that
Making RL agent learn the outer PID controller’s output
creates a proportional and integral term from the error gen-
(Idref and Iqref ) is important to determine feasible RL agent’s
erated by the difference between the measurement and the
actions in the control system. Therefore, it gets more consid-
reference.
erable correlation compared to the output (Vacref and V dcref )
As RL agent learns through the reward received for each
what we originally desired.
step, setting the reward vector criteria is crucial for making
proper policies. RL reward function can be implemented to x = Iref _PID − i
pursue the minimum steady-state error in the control system. 1Iref = Iref 2 − Iref 1
In addition, its reward is applied to compose Q-value as
1Vref = Vref 2 − Vref 1 (3)
bellman equation from the critic network with mentioned
states and two actions. In Eq. (3), 1Iref is the difference between the next steady
In this paper, in order to enhance voltage stability, state value of current Iref 2 and previous steady state value of
the reward vector has to be formed by a degree of a mismatch current Iref 1 , and 1Vref also have the same relationship as
V. PERFORMANCE ANALYSIS
A. PROGRAMMABLE VOLTAGE SOURCE & CONTINGENCY
As shown in Figure 6, the voltage magnitude of the voltage
source at bus 650 varies, fluctuating between 0.985 and 1.02.
Bus 650 is the slack bus, that is, the IEEE 13-bus distribution FIGURE 7. Result comparison of voltage(p.u) at bus 632.
system is connected to the main grid through bus 650. In the
experiment, D-STATCOM is installed in parallel to bus 632.
Transformer and line impedance exists between bus 623 and case, line-to-line RMS base voltage is set at 4.16kV. The RL
bus 650. is implemented for 1,000 iterations, and the action vectors
We designate the real reference of D-STATCOM at 1 pu, of RL develop a transient response in the control system
which is shown in the purple line in the Figure 7. In this when the voltage magnitude of the programmable voltage
FIGURE 10. Result comparison of voltage (p.u) at bus 632 under load switching (3-phase breaker).
FIGURE 13. DC-link voltage and d-axis current reference changes in D-STATCOM.
FIGURE 14. D-STATCOM current comparison (p.u) with current wave form.
TABLE 3. D-STATCOM voltage performance analysis – setting time. action vector to the d-axis current reference. As our model’s
actions vectors in this paper consist of d-q axis current ref-
erence, the change in the new d-axis current reference is
shown in Figure 13. It converges at a certain level to control
DC-link voltage at 3,000 V. Using DDPG agent’s action
vector, the voltage of the capacitor is maintained constant to
the reference value of 3,000 V, as shown in the Figure 13.
The final result of the current wave form from the
D-STATCOM is shown in Figure 14. In this scenario, both the
Various distribution system conditions would lead to a conventional method and DDPG approach guarantee voltage
decrease or an increase in dc link capacitor voltage. For the stability through reactive power control, which improves the
sake of compensation, it is essential that the capacitor dc-link current transient response. In the control systems, settling
voltage remains as close to the reference value as possible. time which requires the response to reach and stay within
During the transient operation, it is possible to improve the the specified range of 2% to 5% of its final value, is a
performance of the capacitor dc link voltage by adding the RL crucial criterion in determining control performance. The
TABLE 4. D-STATCOM current performance analysis – setting time. [4] B. Singh, P. Jayaprakash, and D. P. Kothari, ‘‘New control approach for
capacitor supported DSTATCOM in three-phase four wire distribution
system under non-ideal supply voltage conditions based on synchronous
reference frame theory,’’ Int. J. Elect. Power Energy Syst., vol. 33, no. 5,
pp. 1109–1117, Sep. 2011.
[5] H. Yoon and Y. Cho, ‘‘Imbalance reduction of three-phase line current
using reactive power injection of the distributed static series compensator,’’
J. Elect. Eng. Technol., vol. 14, no. 3, pp. 1017–1025, Feb. 2019.
[6] B. S. Goud and B. L. Rao, ‘‘Power quality enhancement in grid-connected
PV/wind/battery using UPQC: Atom search optimization,’’ J. Elect. Eng.
Technol., vol. 16, no. 2, pp. 821–835, Jan. 2021.
current settling time results of the two methods are compared [7] C. Kumar and M. K. Mishra, ‘‘A voltage-controlled DSTATCOM for
power-quality improvement,’’ IEEE Trans. Power Del., vol. 29, no. 3,
in Table 4. As the reference changes from 1 to 0.995 at pp. 1499–1507, Jun. 2014.
0.2s, the DDPG approach takes 0.0313 seconds to reach [8] B. Pragathi, R. C. Poonia, B. Polaiah, and D. K. Nayak, ‘‘Evaluation and
the steady state. In comparison, conventional control system analysis of soft computing techniques for grid connected photo voltaic
system to enhance power quality issues,’’ J. Elect. Eng. Technol., vol. 16,
takes 0.0753 seconds to reach the steady state in the same pp. 1833–1840, Apr. 2021.
setup. In chronological order, the settling times for the DDPG [9] S. Ramachandran and M. Ramasamy, ‘‘Solar photovoltaic interfaced quasi
approach and conventional control system are 0.0302 seconds impedance source network based static compensator for voltage and fre-
quency control in the wind energy system,’’ J. Elect. Eng. Technol., vol. 16,
and 0.0722 seconds, respectively for the event at 0.4 seconds. no. 3, pp. 1253–1272, Feb. 2021.
They are 0.0301 seconds and 0.0732 seconds, respectively for [10] B. Blazic and I. Papic, ‘‘Improved D-StatCom control for operation with
the event at 0.6 seconds. As described in Table 4, the DDPG unbalanced currents and voltages,’’ IEEE Trans. Power Del., vol. 21, no. 1,
pp. 225–233, Jan. 2006.
approach helps improving the performance and settling time [11] C. K. Sao, P. W. Lehn, M. R. Iravani, and J. A. Martinez, ‘‘A benchmark
when time-varying reference is applied. system for digital time-domain simulation of a pulse-width-modulated
D-STATCOM,’’ IEEE Trans. Power Del., vol. 17, no. 4, pp. 1113–1120,
Oct. 2002.
VI. CONCLUSION [12] K. Sayahi, A. Kadri, F. Bacha, and H. Marzougul, ‘‘Implementation of a
With the increasing complexity of power systems, especially D-STATCOM control strategy based on direct power control method for
the distribution systems, the role of FACTS devices becomes grid connected wind turbine,’’ Int. J. Elect. Power Energy Syst., vol. 121,
Oct. 2020, Art. no. 106105.
critical for the system stability. In this paper, a 4.16 kV distri- [13] Y. M. Zhao, W. F. Xie, and X. W. Tu, ‘‘Performance-based parameter
bution system with D-STATCOM is simulated by Simulink tuning method of model-driven PID control systems,’’ ISA Trans., vol. 51,
(in discrete time domain) in real-time. Instead of PID con- no. 3, pp. 393–399, May 2012.
[14] S. Lee, J. Kim, L. Baker, A. Long, N. Karavas, N. Menard, I. Galiana,
troller, the proposed approach applies RL to regulate the and C. J. Walsh, ‘‘Autonomous multi-joint soft exosuit with augmentation-
d-axis and q-axis current references and control nodal volt- power-based control parameter tuning reduces energy cost of loaded walk-
age. Since references converge fast, observed current and ing,’’ J. Neuroeng. Rehabil., vol. 15, no. 1, pp. 1–9, Dec. 2018.
[15] B. Tandon and R. Kaur, ‘‘Genetic algorithm based parameter tuning of PID
voltage converge faster than conventional PID model. In controller for composition control system,’’ Int. J. Eng. Sci. Technol., vol. 3,
addition, these references are key to fast dc-link capacitor’s no. 8, pp. 6705–6711, Aug. 2011.
operation to inject reactive power into the grid. The simula- [16] S. R. Arya and B. Singh, ‘‘Neural network based conductance estimation
control algorithm for shunt compensation,’’ IEEE Trans. Ind. Informat.,
tions confirm that operating D-STATCOM with the proposed vol. 10, no. 1, pp. 569–577, Feb. 2014.
model could induce a more stable voltage profile and better [17] L. L. Lai, ‘‘A two-ANN approach to frequency and harmonic evaluation,’’
transient response. Furthermore, it verifies that the model is in Proc. 5th Int. Conf. Artif. Neural Netw., 1997, pp. 245–250.
[18] Y. Pan and J. Wang, ‘‘Model predictive control of unknown nonlinear
robust against D-STATCOM’s reference voltage changes. dynamical systems based on recurrent neural networks,’’ IEEE Trans. Ind.
The role of D-STATCOM would only become essential Electron., vol. 59, no. 8, pp. 3089–3101, Aug. 2012.
with the increasing complexity of the grid due to the higher [19] M. T. Ahmad, N. Kumar, and B. Singh, ‘‘Generalised neural network-based
control algorithm for DSTATCOM in distribution systems,’’ IET Power
penetration level of renewable energy. The future work would Electron., vol. 10, no. 12, pp. 1529–1538, Oct. 2017.
include renewable energy sources, such as wind and solar, [20] W. J. Shipman and L. C. Coetzee, ‘‘Reinforcement learning and deep neural
into the proposed study. Moreover, AC OPF problem using networks for PI controller tuning,’’ IFAC-PapersOnLine, vol. 52, no. 14,
pp. 111–116, 2019.
FACTS devices will be dealt with in subsequent research. [21] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, ‘‘An analysis of reinforcement
learning with function approximation,’’ in Proc. 25th Int. Conf. Mach.
REFERENCES Learn. (ICML), Helsinki, Finland, 2008, pp. 664–671.
[22] S. P. Singh, T. Jaakkola, and M. I. Jordan, ‘‘Reinforcement learning with
[1] R. Zamora and A. K. Srivastava, ‘‘Controls for microgrids with storage: soft state aggregation,’’ in Proc. Adv. Neural Inf. Process. Syst., 1995,
Review, challenges, and research needs,’’ Renew. Sustain. Energy Rev., pp. 361–368.
vol. 14, no. 7, pp. 2009–2018, Sep. 2010. [23] C. Szepesvári and W. D. Smart, ‘‘Interpolation-based Q-learning,’’ in Proc.
[2] Y. Naderi, S. H. Hosseini, S. G. Zadeh, B. Mohammadi-Ivatloo, 21st Int. Conf. Mach. Learn. (ICML), 2004, pp. 791–798.
J. C. Vasquez, and J. M. Guerrero, ‘‘An overview of power quality [24] S. Adam, L. Busoniu, and R. Babuska, ‘‘Experience replay for real-time
enhancement techniques applied to distributed generation in electrical reinforcement learning control,’’ IEEE Trans. Syst., Man, Cybern., C (Appl.
distribution networks,’’ Renew. Sustain. Energy Rev., vol. 93, pp. 201–214, Rev.), vol. 42, no. 2, pp. 201–212, Mar. 2012.
Oct. 2018. [25] Z. Cao, Q. Xiao, R. Huang, and M. Zhou, ‘‘Robust neuro-optimal control of
[3] E. Jamil, S. Hameed, B. Jamil, and Qurratulain, ‘‘Power quality improve- underactuated snake robots with experience replay,’’ IEEE Trans. Neural
ment of distribution system with photovoltaic and permanent magnet syn- Netw. Learn. Syst., vol. 29, no. 1, pp. 208–217, Jan. 2018.
chronous generator based renewable energy farm using static synchronous [26] P. Zhu, W. Dai, J. Ma, Z. Zeng, and H. Lu, ‘‘Multi-robot flocking
compensator,’’ Sustain. Energy Technol. Assessments, vol. 35, pp. 98–116, control based on deep reinforcement learning,’’ IEEE Access, vol. 8,
Oct. 2019. pp. 150397–150406, 2020.
[27] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, and D. Bian, ‘‘Deep- LEI WU (Senior Member, IEEE) received the
reinforcement-learning-based autonomous voltage control for power grid B.S. degree in electrical engineering and the M.S.
operations,’’ IEEE Trans. Power Syst., vol. 35, no. 1, pp. 814–817, degree in systems engineering from Xi’an Jiao-
Jan. 2020. tong University, Xi’an, China, in 2001 and 2004,
[28] J. Hussain, M. Hussain, S. Raza, and M. Siddique, ‘‘Power quality respectively, and the Ph.D. degree in electrical
improvement of grid connected wind energy system using DSTATCOM- engineering from Illinois Institute of Technology
BESS,’’ Int. J. Renew. Energy Res., vol. 9, no. 3, pp. 1388–1397, Sep. 2019. (IIT), Chicago, IL, USA, in 2008.
[29] A. Banerji, S. K. Biswas, and B. Singh, ‘‘DSTATCOM control algorithms:
From 2008 to 2010, he was a Senior Research
A review,’’ Int. J. Power Electron. Drive Syst. (IJPEDS), vol. 2, no. 3,
Associate with the Robert W. Galvin Center for
pp. 285–296, Sep. 2012.
[30] S. Bansrlar and R. Nayak, ‘‘Modeling of adaptable voltage controller and Electricity Innovation, IIT. He was a summer Vis-
its stability analysis in distributed generation system,’’ Int. J. Current Eng. iting Faculty with NYISO, in 2012. He was a Professor with the Electrical
Technol., vol. 5, no. 3, pp. 1798–1801, Jun. 2015. and Computer Engineering Department, Clarkson University, Potsdam, NY,
[31] L. C. Baird and A. W. Moore, ‘‘Gradient descent for general reinforcement USA, till 2018. He is currently a Professor with the Electrical and Computer
learning,’’ in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 968–974. Engineering Department, Stevens Institute of Technology, Hoboken, NJ,
[32] Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, ‘‘Hierarchical deep rein- USA. His research interests include power systems operation and planning,
forcement learning for continuous action control,’’ IEEE Trans. Neural energy economics, and community resilience microgrid.
Netw. Learn. Syst., vol. 29, no. 11, pp. 5174–5184, Nov. 2018.
[33] J. Li, T. Chai, FL. Lewis, Z. Ding, and Y. Jiang, ‘‘Off-policy interleaved
Q-learning: Optimal control for affine nonlinear discrete-time systems,’’
IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1308–1320,
May 2019.
[34] M. Ramicic and A. Bonarini, ‘‘Correlation minimizing replay memory in
temporal-difference reinforcement learning,’’ Neurocomputing, vol. 393, SUNG MIN LEE received the B.S. degree in elec-
pp. 91–100, Jun. 2020. trical engineering from Konkuk University, Seoul,
[35] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, and J. Veness, ‘‘Human- South Korea, in 2019, where he is currently pur-
level control through deep reinforcement learning,’’ Nature, vol. 518, suing an Integrated Ph.D. degree under the super-
no. 7540, pp. 529–533, 2015. vision of Prof. Y. H. Cho. His current research
[36] J. H. Woo, L. WU, J. B. Park, and J. H. Roh, ‘‘Real-time optimal power flow interests include high-power converters and grid-
using twin delayed deep deterministic policy gradient algorithm,’’ IEEE connected systems.
Access, vol. 8, pp. 213611–213618, 2020.
[37] S. Dankwa and W. Zheng, ‘‘Twin-delayed DDPG: A deep reinforcement
learning technique to model a continuous movement of an intelligent robot
agent,’’ in Proc. 3rd Int. Conf. Vis., Image Signal Process., Aug. 2019,
pp. 1–5.
[38] Z. Zhang, D. Zhang, and R. C. Qiu, ‘‘Deep reinforcement learning for
power system applications: An overview,’’ CSEE J. Power Energy Syst.,
vol. 6, no. 1, pp. 213–225, Mar. 2020.
[39] IEEE PES Distribution System Analysis Subcommittee’s Distribution
Test Feeder Working Group. Accessed: Jul. 2004. [Online]. Available: JONG-BAE PARK (Member, IEEE) received the
https://cmte.ieee.org/pes-testfeeders/resources/ B.S., M.S., and Ph.D. degrees from Seoul National
[40] R. S. Sutton and A. G. Barto, ‘‘Reinforcement learning: An introduction,’’ University, South Korea, in 1987, 1989, and 1998,
in Adaptive Computation and Machine Learning. 2 nd ed. Cambridge, MA, respectively.
USA: MIT Press, 2018. From 1998 to 2001, he was with the Electri-
[41] K. Eckle and J. S. Hieber, ‘‘A comparison of deep networks with cal and Electronics Department, Anyang Univer-
ReLU activation function and linear spline-type methods,’’ Neural Netw., sity, South Korea, as an Assistant Professor. From
vol. 110, pp. 232–242, Feb. 2019. 2006 to 2008, he was a resident Researcher with
[42] Y. Lin, J. McPhee, and N. L. Azad, ‘‘Comparison of deep reinforcement EPRI, USA. Since 2001, he has been with the
learning and model predictive control for adaptive cruise control,’’ IEEE Electrical Engineering Department, Konkuk Uni-
Trans. Intell. Vehicles, vol. 6, no. 2, pp. 221–231, Jun. 2021. versity, Seoul, South Korea, as a Professor. His major research interests
[43] R. K. Varma and M. Siavashi, ‘‘PV-STATCOM: A new smart inverter include power system operation, planning, economics, and markets.
for voltage control in distribution systems,’’ IEEE Trans. Sustain. Energy,
vol. 9, no. 4, pp. 1681–1691, Oct. 2018.