0% found this document useful (0 votes)
57 views11 pages

Safe Deep Reinforcement Learning-Based Real-Time Operation Strategy in Unbalanced Distribution System

Uploaded by

zinapyung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views11 pages

Safe Deep Reinforcement Learning-Based Real-Time Operation Strategy in Unbalanced Distribution System

Uploaded by

zinapyung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO.

6, NOVEMBER/DECEMBER 2024 8273

Safe Deep Reinforcement Learning-Based Real-Time


Operation Strategy in Unbalanced
Distribution System
Yeunggurl Yoon, Graduate Student Member, IEEE, Myungseok Yoon, Student Member, IEEE,
Xuehan Zhang, Member, IEEE, and Sungyun Choi , Senior Member, IEEE

Abstract—Unbalanced voltages are one of the voltage quality effectiveness of rotating machines [4], [5]. In detail, the motor
issues affecting customer devices in distribution systems. Con- is overheated by over 2% of phase voltage imbalance when the
ventional optimization methods are time-consuming to mitigate machine operates close to the full load [6].
unbalanced voltage in real time because these approaches must
solve each scenario after observation. Deep reinforcement learning Previous works have introduced several control approaches
(DRL) is effectively trained offline for real-time operations that to overcome the unbalanced voltage problem or operate dis-
overcome the time-consumption problem in practical implemen- tribution systems. Optimization-based approaches, as well as
tation. This paper proposes a safe deep reinforcement learning model-based and heuristic methods, are applied to the distri-
(SDRL) based distribution system operation method to mitigate bution system operation. Model-based optimization addresses a
unbalanced voltage for real-time operation and satisfy operational
constraints. The proposed SDRL method incorporates a learning simulation model or a mathematical formulation to design an
module (LM) and a constraint module (CM), controlling the energy approximated solving problem of a complex system [7], [8].
storage system (ESS) to improve voltage balancing. The proposed In [7], A bi-level volt/var mathematical optimization is applied
SDRL method is compared with the hybrid optimization (HO) to control the reactive output of photovoltaic (PV) inverters,
and typical DRL models regarding time consumption and voltage on-load tap changers, and capacitor banks to minimize active
unbalance mitigation. For this purpose, the models operate in
modified IEEE-13 node and IEEE-123 node test feeders. power losses in the distribution system. Moreover, in an un-
balanced distribution system operation study [8], the authors
Index Terms—Deep reinforcement learning, hybrid opti- proposed a distributed voltage control algorithm in distributed
mization, quadratic programming, safe deep reinforcement
learning, voltage unbalance factor.
energy resources (DERs) by controlling reactive power based
on model-based optimizations using Lagrange multiplier and
primal-dual gradients. Meanwhile, heuristic optimization is an
I. INTRODUCTION empirical approach to searching space for optimal solutions
HIS paper extends an advanced approach to mitigating rather than employing a mathematical model [9], [10]. The non-
T unbalanced voltage from previous work [1]. Unbalanced
voltage in the distribution system is a common power quality
dominated sorting genetic algorithm solves a multi-objective
optimization of PV reactive power to minimize active power
problem that can lead to various issues, such as increased losses, loads, and PV active power output [9]. A metaheuristic
losses, reduced efficiency, and operational limitations. Increas- approach is developed to determine the optimal capacity of the
ing single-phase distributed energy resources and loads has newly inserted capacitor to mitigate unbalanced voltage in the
exacerbated this problem [2], [3]. The unbalanced voltage in the distribution system [10].
distribution system could damage components and reduce the To improve the existing optimization method for the distri-
bution system operation, hybrid optimization (HO) approaches
Manuscript received 15 January 2024; revised 8 May 2024; accepted 15 are introduced. Two kinds of heuristic optimization, a salp
July 2024. Date of publication 20 August 2024; date of current version 15 swarm algorithm and particle swarm optimization (PSO), are
November 2024. Paper 2023-PSEC-1639.R1, 2023 IEEE Industry Applications hybridized to determine the optimal allocation of DERs [11].
Society Annual Meeting, Nashville, TN, USA, Oct. 29–Nov. 02, and approved
for publication in the IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS by The hybridization of genetic algorithms with an equilibrium
the Power Systems Engineering Committee of the IEEE Industry Applications optimizer algorithm has been used for the optimal location and
Society [DOI: 10.1109/IAS54024.2023.10406901]. This work was supported by size of renewable energy sources (RESs) in distribution systems
the National Research Foundation of Korea (N R F) funded by Korea Govern-
ment (MSIT) under Grant 2022R1A2C2011522 and Grant RS-2023-00218377. [12].
(Corresponding author: Sungyun Choi.) However, the detailed mathematical design, including DER
Yeunggurl Yoon, Myungseok Yoon, and Sungyun Choi are with the School uncertainties in the unbalanced three-phase distribution system,
of Electrical Engineering, Korea University, Seoul 02841, South Korea (e-mail:
[email protected]; [email protected]; [email protected]). is a massive obstacle for real-time operation using model-based
Xuehan Zhang is with the College of Electrical Engineering and Automation, optimization. The main characteristics of the modern distribu-
Fuzhou University, Fuzhou 350116, China (e-mail: [email protected]). tion system, variability and uncertainty, are reflected approxi-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TIA.2024.3446735. mately in the mathematical model that is imprecise in the prac-
Digital Object Identifier 10.1109/TIA.2024.3446735 tical distribution system [13]. Moreover, both model-based and
0093-9994 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
8274 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO. 6, NOVEMBER/DECEMBER 2024

heuristic optimization methods consume massive computation


time to explore each scenario and require re-optimization for
new scenarios [13], [14].
Meanwhile, machine learning-based offline training has ad-
vantages for real-time operation to overcome the drawbacks of
the conventional optimization methods. Reinforcement learn-
ing (RL) is an emerging technique for using control strategies
called the Markov decision process (MDP) [15]. Moreover,
deep reinforcement learning (DRL) algorithms have an approx-
imated nonlinear function, the deep neural network (DNN),
which replaces the Q-table or policy in conventional RL models
Fig. 1. Framework of the proposed SDRL model.
[16]. In [17], a microgrid operation strategy, day-ahead optimal
energy dispatching for a microgrid, was researched based on a
deep deterministic policy gradient (DDPG), an advanced DRL
r The proposed SDRL model ensures safety and effectively
algorithm. The authors in [18] proposed a branching dueling
Q-network model with an advanced structure that combined mitigates the unbalanced voltage. Moreover, it offers in-
long short-term memory (LSTM) with conventional DRL to tuitive constraints and maintains complete constraint sat-
control an energy storage system (ESS) for a microgrid con- isfaction. On the other hand, the typical DRL model re-
sidering uncertainties. Similarly, a day-ahead optimal dispatch quires a meticulously designed reward function, resulting
operation was applied to a distribution system based on double in increased model complexity. In this respect, the SDRL
Q-learning in [19], and the authors introduced approximated model is versatile and can be used to implement various
power flow using DNN to calculate the power loss and bus operational control strategies. CMDP is designed to handle
voltages. unbalanced voltage for ESS operation constraints.
r The proposed method operates by local decentralized mea-
Typical RL and DRL models reflect constraint terms as a
large coefficient or constant in a reward function to coerce limit surements where the ESS connects to the distribution
conditions. However, the large coefficient ratio of constraint and system. The controller autonomously determines the op-
objective reward terms in the reward function is not reasonable timal value without requiring communication with other
because the effect of the objective reward is slighter than the nodes. Validated on standard testbeds—the IEEE 13-
constraint term. Accordingly, the authors in [20] proposed a node and IEEE-123 node test feeders—the SDRL model
constrained soft actor-critic algorithm for volt-var control, and demonstrates enhancements in the voltage unbalance factor
the method reflected the physical constraints to satisfy the (VUF) across the entire distribution system, leveraging
voltage operating range. A constrained policy optimization was minimal local data.
introduced in [21] to optimize the neural network in the DRL, The rest of the paper is organized as follows. Section II
and a constrained MDP (CMDP) was formulated for optimal presents an SDRL-based operation method for mitigating unbal-
operation of the distribution system to achieve voltage regulation anced voltage in the distribution system. The simulation setup for
and minimum energy operating costs. The safety-guided DDPG the environment of the IEEE 13-node and 123-node test feeder
method with LSTM is presented to operate a multi-energy smart and data preparation for load and PV profiles are introduced in
grid while minimizing economic, social, and computing factors Section III. Then, Section IV presents a case study to validate the
[22]. The physical-informed safety layer is handled to satisfy effectiveness of the proposed method in terms of the VUF and
physical constraints in the microgrid’s economic operation [23]. computation time. Finally, the paper is concluded in Section V.
This paper presents an advanced approach, a safe-DRL
(SDRL) model with an additional optimization layer to control II. PROPOSED OPERATION METHOD
constrained three-phase inverter operation limits for the effi- Essentially, SDRL is an advanced DRL model that satisfies
cient and stable operation of the unbalanced distribution sys- physical constraints and is designed with two modules: the learn-
tem. Moreover, while addressing the objective and constraints ing module (LM) and the constraint module (CM). The main
of the unbalanced distribution system, three-phase measure- algorithms used in the two modules are DDPG and quadratic
ments are considered to design the CMDP of the proposed programming (QP), respectively. The CMDP, a framework for
method. The main contributions of this paper are summarized as the SDRL-based operation, is designed to satisfy physical con-
follows:
r Compared with optimization-based models, the DRL- straints. The structure of the proposed method is displayed in
Fig. 1, and the details of the framework are introduced in the
based offline training model overcomes the serious, time- following subsections.
consuming problem. Moreover, the DRL-based model is
a data-driven controller that avoids the complex modeling
of the uncertainties of the distribution system components. A. Constrained Markov Decision Process (CMDP)
Case studies indicate that the proposed model produces op- The MDP, which describes the entire framework of the RL
timal control signals within seconds, proving it is suitable operations, is a tuple consisting of S, A, P, R, and γ, which refer to
for real-time operation. the states, actions, transition probability, rewards, and discount

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
YOON et al.: SDRL -BASED REAL-TIME OPERATION STRATEGY IN UNBALANCED DISTRIBUTION SYSTEM 8275

factor, respectively. In the proposed architecture, the CMDP active power in the ESS is constrained to be the same as a
is presented to satisfy physical constraints and is additionally scheduled ESS profile. This condition is implemented in an
designed as a framework from MDP. The CMDP is defined as equality constraint as follows:
follows:
a b c
PESS = PESS + PESS + PESS , (8)
CM DP ∼ (S, A, P, R, γ) , (1)
where PESS is the scheduled ESS profile in the day-ahead
schedule. Each phase output of ESS must have the same sign
s.t. Ceq (S, A) = 0, to satisfy practical ESS operation as follows:
C (S, A) ≤ 0, (2) n
PESS · PESS > 0 (a, b, c ∈ n) . (9)
Each component of the CMDP comprises a closed loop, as n
Each phase of the ESS power factor pfESS is constrained to
shown in Fig. 1, where the states, actions, and rewards are
consider the voltage, current, and apparent power limit of the
depicted with dotted arrowed lines. The transition probability is
inverter [25], as follows:
meaningless because the states are determined by the determin-
istic action of the DDPG and the three-phase load flow result of n
0.6 ≤ |pfESS | ≤ 1 (a, b, c ∈ n) . (10)
the distribution system. The discount factor is not considered and
fixed with one because each time step in the simulation is equally The above ESS output sign and power factor conditions (9)
valuable. The equality and inequality constraint conditions are and (10) are implemented in inequality constraints in the CMDP.
denoted as Ceq (·) and C(·), respectively, to satisfy the safety of
the physical condition in the distribution system. The remaining B. Learning Module (LM)
CMDP components (states, actions, rewards, and constraints)
are designed concretely in the following paragraphs. The states are normalized to train the DNN layer of the
The three-phase voltage at the point of the common coupling LM. Moreover, z-score normalization is usually applied to the
(PCC) node connected to an ESS is observed for magnitude DNN input layer to train the network parameters efficiently.
and angle. States are the observations or measurements in the The states’ statistical mean and standard deviation values are
distribution system used to calculate the reward and as inputs for calculated from the base case with a total scenario. The base
the critic and actor networks in the DDPG. States are defined as case is a simulation for a three-phase balanced controlled ESS.
follows: The normalized state is calculated as follows:
s − μS
S = {|Va | , |Vb | , |Vc | , δa , δb , δc } , (3) s̃ = , s ∈ S, (11)
σS
where |Vn | is the voltage magnitude, and δn is the voltage angle
for three-phase a, b, c ࢠ n. where s̃ is the normalized state, μS is the mean value of the state,
The actions are the control variables of the RL agent which is and σ S is the standard deviation of the state.
the ESS in the proposed distribution system operation strategy. The LM trains action from the states and rewards using the
Action set is composed of six components as follows: DDPG algorithm, an actor-critic-based algorithm containing
two DNNs named actor and critic networks. The actor network
n n
A = {PESS , pfESS }, (4) π(S; θa ) consists of weight parameters θa and is designed to
n
where PESS are the three-phase active power outputs of the determine action. Next, the critic network Q(S, A; θc ) is pa-
n
ESS, and pfESS are the three-phase power factors. rameterized with θc and estimates the return from the observed
Rewards are the revenue from each time step, and the sum- state. The structures of the DNNs are displayed in Fig. 2. The
mation of the reward is an episode return. An episode is the total agent of the DDPG algorithm selects an action based on the actor
experience of the agent from start to finish. The reward function network. The DNN-based approach enables the formulation of
to minimize unbalanced voltage is presented with the voltage an MDP with continuous states and actions.
unbalanced factor (VUF) as follows [24]: The networks have fully connected (FC) and activation func-
tion layers to train nonlinear relationships. The rectified linear
V2 unit (ReLU) layers are commonly employed to increase the
R = − V UF = − × 100, (5)
V1 nonlinearity of neural networks. In the actor network, the tanh
1  a  layer enforces output values between −1 and +1 for the ESS
V1 = VP CC + αVPb CC + α2 VPc CC , (6) output limitation.
3
1  a  The DDPG agent selects the final action using the Ornstein–
V2 = VP CC + α2 VPb CC + αVPc CC , (7) Uhlenbeck (OU) action noise model. The OU action noise model
3
applies to stimulate agent exploration by giving random noise
where R is the reward function of the SDRL model. V1 and V2 to the output of actor network as follows:
are the positive- and negative-sequence components of the ESS   √
PCC voltage, respectively, and α = e 3 πi . Nt+1 = Nt + cnc μN − Nt t + σtN ε t,
2
(12)
ESS operation limitations are designed for the constraint part N
of the CMDP as in (2). The summation of the three-phase σt−1 = σtN (1 − λ) , (13)

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
8276 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO. 6, NOVEMBER/DECEMBER 2024

from the time step of the corresponding state, i, to the last


time step of the episode. The deterministic policy gradient is
introduced to update the weight parameters of the actor network
as follows:
∇θa J =

1  
M
∇θa π (Si ; θa ) ∇A Q (Si , A; θc ) |A = π(Si ;θa ) ,
M i=1
(17)
θa ← θa + αa ∇θa J, (18)
where the gradient of expected return ∇θa J weighted by a
learning rate of actor network αc is applied to update actor
network parameters for the gradient ascent. Finally, the DDPG
Fig. 2. DNN structures in the proposed SDRL model: (a) actor network and agent updates target parameters θat and θct as follows:
(b) critic network.
θat ← τ θa + (1 − τ ) θat , (19)
θct ← τ θc + (1 − τ ) θct . (20)
where Nt is the noise value for time step t, cnc is the amount of
noise converging to the noise mean μN , and σtN is the standard The target parameters are updated with the target smooth
deviation of noise discounted by decay ratio λ for time step t. factor τ for every time step of the training progress. The overall
After observing the state and selecting action through explo- weight and target parameters update procedure iterates for each
rations, experiences are accumulated in a replay buffer, which time step and episode until training termination.
stores data for parameter update. In other words, the actor and
critic networks update weight parameters based on replay buffer C. Constraint Module (CM)
experiences. The first step in the parameter update process is The CM is connected to the LM and formulated by an op-
updating critic parameters to determine the exact return by timization approach to satisfy constraint terms in the CMDP.
minimizing a loss function: QP is an optimization problem formulation with a second-order
M objective function subject to linear equality and inequality con-
1 
L = (yi − Q (Si , Ai ; θc ))2 , (14) straints. In the proposed method, the QP is modified from the
2M i = 1 standard QP form to limit the output action of the LM. The
where M is the minibatch size, and i is an experience in the modified objective function is as follows:
replay buffer. The loss function is calculated for the minibatch min (u − u0 )2 , (21)
chosen from the replay buffer. Moreover, yi is the target of the  T
value function in the experience i as: u = P a  b c  a  b  c
ESS , PESS , PESS , pfESS , pfESS , pfESS ,
yi = (22)

Ri , if, t = T, a
u0 = PESS b
, PESS c
, PESS a
, pfESS b
, pfESS c
, pfESS
T
, (23)
Ri + γQt (Si,t+1 , π (Si,t+1 ; θat ) ; θct ) , otherwise.
(15) where u is the constrained estimated action from u0 , and u0 is the
original action from the LM. The CM selects the nearest action
where the superscript t of the weight parameter, θ, and the by minimizing the squared error that satisfies the constraints.
critic network, Q, represents the ‘target,’ temporarily assumed The constrained actions are subject to:
as an exact value or network. Meanwhile, a subscript t of state
represents the time step of the experience. The estimated return f + g · u ≤ 0, (24)
term of the target critic network is zero for the last time step, T,
u− ≤ u ≤ ū. (25)
of an episode. In the last step of updating the critic network, the
weigh parameters are updated using gradient descent of the loss where f is the constant value matrix from the observations, and
function as follows: g is a matrix multiplied by a constrained action. In other words,
θc ← θc − αc ∇θc L, (16) f is a zero-order term, and g is the first-order coefficient of the
actions. u− and ū are the lower and upper bounds of the actions,
where parameters are updated by a learning rate of the critic respectively. The constraints terms of the QP are formulated
network αc . according to (8)–(10) as follows:
Moreover, the actor network is trained for experienced
episodes to maximize expected return J, the sum of the rewards f = [−PESS , PESS , 0, 0, 0] , (26)

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
YOON et al.: SDRL -BASED REAL-TIME OPERATION STRATEGY IN UNBALANCED DISTRIBUTION SYSTEM 8277

Algorithm 1: Combined of LM and CM.


1: Initialize critic network Q(S, A; θ c ) with random
weight parameters θ c , and target critic network with
same value θ tc = θ c
2: Initialize actor network π(S; θ a ) with random weight
parameters θ a , and target actor network with same
value θ ta = θ a
3: for episode = 1 to final episode E do
4: Reset the state as S1 for the first time step
5: for t = 1 to T do
6: Select an action A = π(S; θ a ) + N t where Nt
is OU action noise according to (12) and (13)
7: execute CM to minimize objective function (21)
subject to (22) and (23)
8: return constrained action Aconst = u∗ , where u∗ Fig. 3. The normalized average actual (a) load and (b) PV hourly profiles for
a year are illustrated as a solid line with points, and the 99% confidence interval
is optimal solution of the CM of data is illustrated as the background color for each profile.
9: Save experience (St , At , Rt , St+1 ) in the experience
buffer, while Aconst is action signal to control ESS
10: Select M random samples for minibatch. parameters of the proposed SDRL model to mitigate unbalanced
11: Select critic network target value using (15) voltage.
12: Update critic network parameters to minimize the
loss function by (14)–(16) A. Data Preparation for Load and PV Profiles
13: Update actor network parameter using gradient
ascent by (17) and (18) The actual load profile [26] from South Korea’s national
14: Update target actor and critic network parameter data and actual PV data [27] from Yeongam-gun, Jeollanam-do,
by (19) and (20) South Korea, simulate scenarios hourly for a year. The raw
15: end for load data represent residential and industrial power consumption
16: end for behaviors, providing average monthly 24-hour profiles. The
number of load profiles for each load type is 288, which is 12
months multiplied by 24 hours. These profiles are normalized
⎡ ⎤ and are scaled by the maximum load of each bus for simulation
1 1 1 0 0 0
⎢ −1 purposes. The raw PV data are given hourly for a year and catego-
⎢ −1 −1 0 0 0⎥⎥ rized into sunny and cloudy days to replicate common scenarios
g= ⎢
⎢−PESS 0 0 0 0 0⎥⎥, (27)
⎣ 0 observed in practice. These data are preprocessed to average
−PESS 0 0 0 0⎦
monthly 24-hour profiles for each weather condition to match
0 0 −PESS 0 0 0
the number of scenarios as load profiles. Namely, the number of
a b c T PV scenarios for each weather condition is 288. The normalized
u− = kP 1 , kP 1 , kP 1 , kpf 1 , kpf 1 , kpf 1 , (28)
residential load profiles, along with two categories of industrial
a b c T profiles (data center and medical center), are prepared with the
ū = kP 2 , kP 2 , kP 2 , kpf 2 , kpf 2 , kpf 2 , (29)
  normalized sunny and cloudy day PV profiles, as depicted in
0, PESS ≥ 0, 1, u ≥ 0, Fig. 3. In the ESS scenario, state of charge (SOC) scenarios
kP 1 = k = (30)
−1, PESS < 0, P 2 0, u < 0, are randomly generated from 10% to 90%, assuming day-ahead
 n
 n optimization. Finally, these monthly 24-hour load, PV, and ESS
n 0.6, pfESS ≥ 0, n 1, pfESS ≥ 0,
kpf = n k pf = n scenarios are amalgamated to simulate the distribution system.
1 −1, pfESS < 0, 2 −0.6, pfESS < 0.
In total, 576 scenarios are generated, consisting of 288 scenarios
(31) for two weather conditions.
The parameters kP1 , kP2 , kpf n n According to Fig. 3, three load types show distinct load
1 , and kpf 2 of the lower and
upper bounds are constants determined by the sign of ESS output patterns. The data center consumes power with low variability
or power factor actions, where a, b, c ∈ n denotes three-phase during the day and for each month. Meanwhile, the medical
n
for kpf n center shows a pattern of high power consumption during the
1 and kpf 2 . The overall approach of the proposed method
is presented in Algorithm 1. daytime and low power consumption during the nighttime, and
monthly variability exists. Finally, the residential load peaks in
the evening when people get off work, and monthly variability
III. DATA AND SIMULATION SETUP exists.
The actual load and PV data are preprocessed and applied The scenarios are separated into training and test sets to
to the modified IEEE testbeds. The test feeders are modified validate the SDRL in untrained scenarios. Typically, test sets are
to simulate unbalanced voltage with DERs, introducing the selected for 10–20% of the total generated dataset. In our case,

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
8278 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO. 6, NOVEMBER/DECEMBER 2024

Fig. 6. Three-phase voltage measurements for each node in the modified IEEE
13-node test feeder for the basecase.

Fig. 4. Modified IEEE 13-node test feeder for validating models. Red circles
on nodes are three-phase nodes where VUF is calculated. TABLE I
LM HYPERPARAMETERS AND TRAIING OPTIONS

Fig. 5. Modified IEEE 123-node test feeder for validating models. Red circles
on nodes are three-phase nodes where VUF is calculated.

500 training scenarios and 76 test scenarios are prepared for


the case study. The scenarios are applied to the distribution
system testbed to serve load flow scenarios for the training
process.

B. Simulation Environment and Parameters Setup


ESS, the loads’ active power on a, b, and c-phase was multiplied
The simulation environment was designed based on the IEEE by 2, 5, and 2, respectively. The three-phase voltage of the
13-node and 123-node test feeder [28], as displayed in Figs. 4 substation node is revised from 1 to 1.03 pu while neglecting
and 5. Both test feeders are modified by adding a PV and an ESS the operation of voltage regulators. Similar to the execution on
by considering unbalanced voltage in the base case simulation the 13-node system, the loads’ active power on the b and c-phases
where no control model is conducted on the system. Moreover, was multiplied by 1.5 and 2, respectively. 200 kW, 300 kW, and
in both test feeders, the load type for each node is randomly 100 kW PVs with 20% utilization rate are installed in the node
selected among three load types, as presented in Fig. 3. In the 89, and 1MW ESS is installed in the node 83.
IEEE 13-node test feeder, the 300 kW and 100 kW PVs with The simulation results of the IEEE 13-node and 123-node test
20% utilization rate are installed in the node 680 a-phase and feeder on the base case regarding three-phase voltage are shown
b-phase, respectively. The 1 MW ESS is installed in the node in Figs. 6 and 7. The modified test feeders operate within the
675 in the 13-node test feeder. Several parameters in the systems proper voltage level while encountering unbalanced voltage by
are revised to induce higher unbalanced voltage. The voltage unbalanced integrated loads and DERs.
values of the substation acting as a main grid in the distribution The LM hyperparameters for the critic network, actor net-
system are revised as Va from 1.021 to 1.011 and Vb from 1.042 work, and agent training options, including OU action noise
to 1.032. Moreover, due to the additional integration of PV and method parameters, are shown in Table I.

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
YOON et al.: SDRL -BASED REAL-TIME OPERATION STRATEGY IN UNBALANCED DISTRIBUTION SYSTEM 8279

Fig. 7. Three-phase voltage measurements for each node in the modified IEEE 123-node test feeder for the basecase.

TABLE II
HYPERARAMETERS OF PSO
Algorithm 2: Hybrid Optimization – PSO with CM.
1: Initialize N particles’ xi and vi of each particle i
2: Initialize memory values of the optimization as
𝓟 = −∞ for 𝓟 = {p1 , p2 , . . . , pi } and
G = max(𝓟) = − ∞ and the number of iteration
k=0
3: repeat
4: for i = 1 to N do
5: Choose random subset S of N particles except i.
6: Find optimal values of the neighbors G(S) =
IV. CASE STUDY max(RDRL (S))
7: Update velocity (32)
This section compares the proposed SDRL model with HO
8: Update position (33)
and a typical DRL model regarding VUF mitigation perfor-
9: execute CM to revise position through (24)–(31)
mance and computation time in unbalanced distribution systems.
10: return constrained action xi = u∗
11: if R(xi ) > pi do
A. PSO-Based Hybrid Optimization 12: pi ← xi
A heuristic optimization algorithm, PSO, is introduced to 13: if R(xi ) > G do
compare with the proposed method. The heuristic optimization 14: G ← xi
methods are appropriate to compare with the RL method because 15: end if
they operate in a model-free environment as the RL agent did. 16: end if
The PSO operates instead of the proposed LM, and it connects 17: end for
with the CM to satisfy constraints. Two optimization algorithms, 18: k ← k+1
PSO and QP, operate dually, so this structure is called a HO. The 19: e = |Gk − Gk-1 |
hyperparameters of PSO are set as in Table II. The positions 20: until |e| < τ , t = TPSO , G > Gmax , or k = kmax
of particles and control variables are denoted as (3) and (4),
respectively; the fitness function is also designed as the reward
function of the SDRL as (5)–(7). Algorithm 2 accounts for the The PSO algorithm repeats the above two steps, and
implementation of the HO, and its several notations are listed in it is terminated when it meets one of four conditions as
Table II. follows:
The PSO algorithm progresses through three steps to deter-
r The difference in global fitness between iterations is lesser
mine optimal values. First, PSO calculates the finesses of each than optimization tolerance.
particle i and stores the best fitness of each particle itself to pi
r The computation time is over one hour; the PSO fails
and the best fitness of a random set S of particles to G. Next, to converge to the optimal value within the time step of
the velocity and position of each particle are updated through scenarios.
fitness values. The particle velocity vi updates as follows:
r The global fitness exceeds the objective limit, regarded as
    an optimal value.
vik+1 = vik + r1 pi − xki + r2 G − xki , (32) r The number of iterations of the optimization progress
exceeds ten.
where r1 is self-adjustment weight and r2 is social adjustment
weight. Also, the position of particle i updates with updated B. Non-Safe Deep Reinforcement Learning
velocity as
A typical DRL model is an unimpressive DRL model without
xik+1 = xki + vik+1 . (33) an optimization layer for constraints. In this sense, the model

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
8280 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO. 6, NOVEMBER/DECEMBER 2024

Fig. 8. Training progress of (a) SDRL and (b) DRL model to mitigate unbal-
anced voltage in the modified IEEE 13-node test feeder.
Fig. 9. Comparison of VUF in the modified IEEE 13-node test feeder on test
scenarios.

requires a more specific MDP to enforce physical constraints,


so the state, action, and reward functions are slightly modified
from the proposed SDRL model (3)–(5) as follows:

S DRL = {|Va | , |Vb | , |Vc | , δa , δb , δc , PESS } , (34)


 a 
ADRL = PESS b
, PESS a
, pfESS b
, pfESS c
, pfESS , (35)
c a b
PESS = PESS − PESS − PESS . (36)

The total ESS output PESS is additionally observed, and the c-


phase output is determined by the other phases’ output to ensure
Fig. 10. Training progresses of (a) SDRL and (b) DRL model to mitigate
the sum of the three-phase ESS output as (8). unbalanced voltage in the modified IEEE 123-node test feeder.
Moreover, the typical DRL model has penalty terms in the
reward function to satisfy physical or operational constraints.
TABLE III
RDRL = R + p, (37) PERFORMANCES OF MODELS IN MODIFIED IEEE 13-NODE TEST FEEDER


6
p = − μp ηi , (38)
i=1

where r is the reward term, and p is the penalty term of the


reward function RDRL of the non-safe DRL model. The reward
term R is the same as the SDRL model (5). The penalty term
μp induces the agent to avoid operational conditions (9) and (10).
The penalty coefficient is set for 100 and is multiplied by the experiences more episodes to acquire high rewards and avoid
number of dissatisfactions of action A. Six logical bool values penalties simultaneously.
η i are 1 for dissatisfying and 0 for satisfying. The structure of After training SDRL and DRL models, trained models operate
DNNs and almost all hyperparameters are set as an SDRL model, on the test scenarios, and the results are in Fig. 9 and Table III.
as in Fig. 2 and Table I, while the Initial OU noise variance is The VUF was only calculated on six nodes despite the 13-node
revised to 0.5 to improve training convergence. system because the other nodes are either single- or two-phase.
Fig. 9 compares the average VUF for the train and test set in
the base case, DRL, SDRL, and HO. The HO was simulated on
C. Voltage Unbalance Mitigation Results only test scenarios to compare with the other control models.
HO and non-safe DRL models are simulated in the IEEE The SDRL-based model afforded appropriate control on both
13-node and 123-node test feeders to compare the performance the test and train scenarios.
of the proposed SDRL method. DRL and SDRL experience The numerical results of the performance and computation
episodes, which are training progress that the model trains time are shown in Table III. The training computation time was
through whole training scenarios for each episode. In the mod- measured during the total training time, and the test computation
ified IEEE 13-node test feeder, the SDRL and DRL models are time was calculated by a mean value of elapsed time for 76 test
trained through 1000 and 1500 episodes, respectively, in Fig. 8. scenarios. Even though the SDRL model contains optimization
Fundamentally, DRL takes much lower reward values because stage CM, it takes 0.0093 seconds for each scenario while the
of penalty terms. The SDRL experiences gradually increasing DRL-only model takes 0.0081 seconds. The calculation time
through passing episodes, but DRL trains unstably with oscillat- of CM is short, so it does not affect the real-time operation. In
ing rewards because it takes penalties against constraints while other words, the additional layer, CM, is acceptable for safety
selecting random actions by the OU noise. Therefore, DRL real-time operation in the unbalanced distribution system. The

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
YOON et al.: SDRL -BASED REAL-TIME OPERATION STRATEGY IN UNBALANCED DISTRIBUTION SYSTEM 8281

Fig. 11. Comparison of VUF in the modified IEEE-123 node test feeder on test scenarios.

Fig. 12. Unbalanced voltage mitigation result using the SDRL in the modified IEEE-123 node test feeder simulation.

TABLE IV
PERFORMANCES OF MODELS IN MODIFIED IEEE 123-NODE TEST TEEDER

Fig. 13. Unbalanced voltage mitigation result using the SDRL in the modified According to Fig. 11 and Table IV, the SDRL model performs
IEEE-13 node test feeder simulation. better than the DRL model in terms of VUF mitigation capability.
Also, the DRL model avoids penalty perfectly, with 100% satis-
faction even in the 123-node system. Still, it concentrates weakly
HO model performs slightly better than the SDRL model, but it on training the function of VUF mitigation. The SDRL and
consumes massive amounts of time, which is unavailable for DRL-only models take 0.0376 and 0.0445 seconds on average
the real-time operation of the distribution system. The VUF for each test scenario, respectively. Even in the IEEE-123 node
improvement was the calculated average difference between the test feeder, the computation time is slightly different between
base case VUF and the VUF of each control model except for SDRL and DRL, but there is no impact on real-time operations
the nearby substation nodes 632 and 633, which are uncontrol- whether a constraint module is added or not. Moreover, the HO
lable for mitigating VUF. Meanwhile, the DRL model trains model consumed high computation time to acquire an appro-
effectively to work as a safe model, showing 100% constraint priate control signal for mitigating unbalanced voltage. The HO
satisfaction, yet it struggles to adeptly mitigate VUF due to sig- algorithm terminates the arithmetic operation after observing for
nificantly higher weights assigned to penalty terms than reward an hour in the 123-node case, which has a complex system to
terms. calculate.
Moreover, the SDRL and DRL model train to mitigate VUF The voltage mitigation performance is analyzed in the three-
in the IEEE 123-node test feeder, and the training progresses phase voltage measurements comparison of both the base case
are shown in Fig. 10. The SDRL model trains for 1000 episodes and the SDRL model-based operation as in Figs. 12 and 13, and
with slight oscillations but terminates on high reward. How- the figures displayed 123-node and 13-node cases, respectively.
ever, the DRL model experiences much-oscillated rewards as The figures compare the three-phase voltage in the base case
it progresses, causing unstable training. VUF mitigation results for dotted lines and operated by the SDRL model for solid
for each node are shown in Fig. 11, and the VUF of 67 nodes, lines. Both distribution system cases show that the SDRL model
which are three-phase nodes, are calculated. controls each phase of ESS; then, the voltage of each phase is

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
8282 IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, VOL. 60, NO. 6, NOVEMBER/DECEMBER 2024

controlled by manipulating the ESS, aiming to regulate each [2] O. Mrehel and A. A. Issa, “Voltage imbalance investigation in residential
phase’s voltage magnitudes to be one pu. LV distribution networks with rooftop PV system,” in 2022 IEEE 2nd
Int. Maghreb Meeting Conf. Sci. Techn. Autom. Control Comput. Eng.,
According to the proposed method results in both test feeders, Sabratha, Libya, Jul. 2022, pp. 655–662.
the SDRL operates local control for a single ESS in the unbal- [3] G. Gupta and W. Fritz, “Voltage unbalance for power systems and miti-
anced distribution system, and the model determines action by gation techniques a survey,” in Proc. IEEE 1st Int. Conf. Power Electron.,
Intell. Control Energy Syst., Delhi, India, Jul. 2016, pp. 1–4.
measured voltage regardless of the other component profiles. [4] A. Nakadomari et al., “Unbalanced voltage compensation with optimal
The proposed control is encouraged to be set to only some voltage controlled regulators and load ratio control transformer,” Energies,
ESSs in the entire distribution system. In some cases where vol. 14, no. 11, pp. 2997–3014, May 2021.
[5] J. Kennedy, M. Morcos, and A. Lo, “Cost allocation of voltage unbalance
too many ESSs set by the proposed method are located near in distribution networks,” in 2020 19th Int. Conf. Harmon. Qual. Power,
each other, inappropriate control interactions may be caused. Dubai, UAE, Aug. 2020, pp. 1–5.
However, most ESSs contribute to the economic operation of a [6] IEEE Recommended Practice for Electric Power Distribution for Indus-
trial Plants, IEEE Standard 141-1993, Apr. 29, 1994.
distribution system or a microgrid in practice. Therefore, a few [7] Y. Long and D. S. Kirschen, “Bi-level volt/VAR optimization in distribu-
ESSs controlled by the proposed SDRL model without control tion networks with smart PV inverters,” IEEE Trans. Power Syst., vol. 37,
conflict are reasonable for unbalanced voltage mitigation and no. 5, pp. 3604–3613, Sep. 2022.
[8] N. Patari, A. K. Srivastava, G. Qu, and N. Li, “Distributed voltage control
perform reasonably in the case studies. for three-phase unbalanced distribution systems with DERs and practi-
To extend this study, a slightly revised MDP based on the cal constraints,” IEEE Trans. Ind. Appl., vol. 57, no. 6, pp. 6622–6633,
proposed SDRL can develop an entirely coordinated ESS con- Nov./Dec. 2021.
[9] Y. Ai, “The optimization of reactive power for distribution network with
trol model. The coordinated approach demands observing the PV generation based on NSGA-III,” CPSS Trans. Power Electron. Appl.,
states of several nodes. Namely, the approach requires com- vol. 6, no. 3, pp. 193–200, Sep. 2021.
plete communication infrastructure in the distribution system. [10] T. A. P. Beneteli, L. P. Cota, and T. A. M. Euzébio, “Limiting current
and voltage unbalances in distribution systems: A metaheuristic-based
This concept has a different purpose and structure from the decision support system,” Int. J. Elect. Power Energy Syst., vol. 135,
proposed method, but it has the expandability of the distribu- Feb. 2022, Art. no. 107538.
tion system study. The coordinated SDRL-based comprehensive [11] A. Ahmed, M. F. Nadeem, A. T. Kiani, N. Ullah, M. A. Khan, and A.
Mosavi, “An improved hybrid approach for the simultaneous allocation
study, including economic and ecological operation for high of distributed generators and time varying loads in distribution systems,”
power quality distribution systems, can be developed in future Energy Rep., vol. 9, pp. 1549–1560, Dec. 2023.
work. Innovative communication and measurement infrastruc- [12] O. M. Bakry et al., “Improvement of distribution networks performance
using renewable energy sources based hybrid optimization techniques,”
tures should be assumed in the future comprehensive operation Ain Shams Eng. J., vol. 13, no. 6, Nov. 2022, Art. no. 101786.
methodology. [13] D. Qiu, Y. Wang, W. Hua, and G. Strbac, “Reinforcement learning for elec-
tric vehicle applications in power systems: A critical review,” Renewable
Sustain. Energy Rev., vol. 173, Mar. 2023, Art. no. 113052.
V. CONCLUSION [14] Y. Zhang, X. Wang, J. Wang, and Y. Zhang, “Deep reinforcement learning
based volt-VAR optimization in smart distribution systems,” IEEE Trans.
The paper proposes the SDRL-based model using local mea- Smart Grid, vol. 12, no. 1, pp. 361–371, Jan. 2021.
surements to mitigate the unbalanced voltage in the distribution [15] D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control:
systems. The method comprises an LM and a CM, and CM is de- Reinforcement learning framework,” IEEE Trans. Power Syst., vol. 19,
no. 1, pp. 427–435, Feb. 2004.
veloped to constrain the action to satisfy the three-phase physical [16] J. Li, H. Liu, Y. Zhang, G. Su, and Z. Wang, “Artificial intelligence assistant
limitations of the ESS inverter using modified QP. The operating decision-making method for main & distribution power grid integration
structure is designed as CMDP to reflect the implementation of based on deep deterministic network,” in 2021 IEEE 4th Int. Elect. Energy
Conf., Wuhan, China, Aug. 2021, pp. 1–5.
both LM and CM. A modified IEEE 13-node and 123-node test [17] H. Bian, X. Tian, J. Zhang, and X. Han, “Deep reinforcement learn-
feeder are applied to validate the proposed method, and HO ing algorithm based on optimal energy dispatching for microgrid,” in
and non-safe DRL are also conducted in the environments to 2020 5th Asia Conf. Power Elect. Eng., Chengdu, China, Jul. 2020,
pp. 169–174.
compare performance. The HO consumes computation time for [18] H. Shuai, F. Li, H. Pulgar-Painemal, and Y. Xue, “Branching dueling Q-
thousands of seconds, which is unsuitable for real-time opera- network-based online scheduling of a microgrid with distributed energy
tion, even if it achieved slightly better performance to mitigate storage systems,” IEEE Trans. Smart Grid, vol. 12, no. 6, pp. 5479–5482,
Nov. 2021.
VUF than the SDRL. The DRL model is designed to satisfy [19] X. Li, X. Han, and M. Yang, “Day-ahead optimal dispatch strategy for ac-
constraints with additional penalty terms by revising the reward tive distribution network based on improved deep reinforcement learning,”
function of the SDRL. However, the DRL model succeeds in IEEE Access, vol. 10, pp. 9357–9370, 2022.
[20] H. Li and H. He, “Learning to operate distribution networks with safe
determining control variables within constraints, but it shows deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 13, no. 3,
insufficient performance for mitigating unbalanced voltage. As pp. 1860–1872, May 2022.
a result, the SDRL model operation accomplishes real-time [21] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy deep reinforcement
learning algorithm for volt-VAR control in power distribution systems,”
operation in the unbalanced distribution system regarding VUF IEEE Trans. Smart Grid, vol. 11, no. 4, pp. 3008–3018, Jul. 2020.
mitigation and time consumption. [22] Y. Wang, D. Qiu, M. Sun, G. Strbac, and Z. Gao, “Secure energy
management of multi-energy microgrid: A physical-informed safe re-
inforcement learning approach,” Appl. Energy, vol. 335, Apr. 2023,
REFERENCES Art. no. 120759.
[23] D. Qiu, Z. Dong, X. Zhang, Y. Wang, and G. Strbac, “Safe reinforcement
[1] Y. Yoon, M. Yoon, and S. Choi, “Safe deep reinforcement learning-based learning for real-time automatic control in a smart energy-hub,” Appl.
real-time operation strategy in unbalanced distribution system,” in Proc. Energy, vol. 309, Mar. 2022, Art. no. 118403.
IEEE Ind. Appl. Soc. Annu. Meeting, Nashvillle, TN, USA, Oct. 2023, [24] M. E. El-Hawary et al., “Definitions of voltage unbalance,” IEEE Power
pp. 1–6. Eng. Rev., vol. 21, no. 5, pp. 49–51, May 2001.

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.
YOON et al.: SDRL -BASED REAL-TIME OPERATION STRATEGY IN UNBALANCED DISTRIBUTION SYSTEM 8283

[25] ABB, “Energy storage modules (ESM) up to 4 MW output voltage Myungseok Yoon (Student Member, IEEE) received the B.S. degree in electrical
range of 120 volts to 40.5 kV,” Lake Mary, FL, USA, Apr. 2012. engineering in 2019 from Korea University, Seoul, South Korea, where he is
Accessed: Aug. 24, 2024. [Online]. Available: [Online]. Available: currently working toward the direct Ph.D. degree in electrical engineering. His
https://new.abb.com/docs/librariesprovider27/default-document- research interests include renewable energy and grid code in distribution systems
library/energy_storage_modules_brochure_rev_e.pdf?sfvrsn=2 and microgrids.
[26] KEPCO Management Research Institute (KEMRI), “2022 analysis
of power consumption behavior,” May 2023, Accessed: Jan. 09, Xuehan Zhang (Member, IEEE) received the B.S. degree in electrical engineer-
2024. [Online]. Available: https://home.kepco.co.kr/kepco/KR/ntcob/ ing and its automation, the M.S. degree in power system and its automation from
ntcobView.do?pageIndex=1&boardSeq=21062036&boardCd=BRD_ the South China University of Technology, Guangzhou, China, in 2015 and 2019,
000456&menuCd=FN311802&parnScrpSeq=0&searchCondition= respectively, and the Ph.D. degree in electrical energy from Korea University,
total&searchKeyword= Seoul, South Korea, in 2023. His research interests include the optimization
[27] Korea Public Data Portal, “Korea rural community corporation Yeong-am methods of distribution systems and microgrids.
solar power plant generation status,” Apr. 2023, Accessed: Jan. 09, 2024.
[Online]. Available: https://www.data.go.kr/data/15005796/fileData.do
[28] W. H. Kersting, “Radial distribution test feeders,” in Proc. IEEE Power Sungyun Choi (Senior Member, IEEE) received the B.E. degree in electrical
Eng. Soc. Winter Meeting Conf., Jan. 2001, vol. 2, pp. 908–912. engineering from Korea University, Seoul, South Korea, in 2002, and the M.S.
and Ph.D. degrees in electrical and computer engineering from the Georgia
Institute of Technology, Atlanta, GA, USA, in 2009 and 2013, respectively. From
2002 to 2005, he was a Network and System Engineer. From 2014 to 2018, he
was a Senior Researcher with Smart Power Grid Research Center, Korea Elec-
trotechnology Research Institute, Uiwang, South Korea. Since 2018, he has been
an Associate Professor of electrical engineering with Korea University, Seoul.
His research interests include power distribution, microgrids, power system state
estimation, sub-synchronous oscillations, and computational intelligence.

Yeunggurl Yoon (Graduate Student Member, IEEE) received the B.S. degree in
electrical engineering in 2021 from Korea University, Seoul, South Korea, where
he is currently working toward the direct Ph.D. degree in electrical engineering.
His research interests include distribution system planning and AI applications
on power systems for control and prediction.

Authorized licensed use limited to: Korea University. Downloaded on December 09,2024 at 11:13:24 UTC from IEEE Xplore. Restrictions apply.

You might also like