0% found this document useful (0 votes)
9 views7 pages

2019 VVC RL SmartGridComm

This paper presents a deep reinforcement learning approach to Volt-VAR control (VVC) in power distribution systems, aiming to enhance energy efficiency and reliability by minimizing operational costs while adhering to physical constraints. The VVC problem is formulated as a constrained Markov decision process and solved using policy gradient methods, which outperform traditional optimization-based approaches in terms of scalability and speed. Numerical results demonstrate the effectiveness of the proposed methods in learning near-optimal solutions for VVC.

Uploaded by

sohail khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

2019 VVC RL SmartGridComm

This paper presents a deep reinforcement learning approach to Volt-VAR control (VVC) in power distribution systems, aiming to enhance energy efficiency and reliability by minimizing operational costs while adhering to physical constraints. The VVC problem is formulated as a constrained Markov decision process and solved using policy gradient methods, which outperform traditional optimization-based approaches in terms of scalability and speed. Numerical results demonstrate the effectiveness of the proposed methods in learning near-optimal solutions for VVC.

Uploaded by

sohail khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volt-VAR Control in Power Distribution Systems

with Deep Reinforcement Learning


Wei Wang, Nanpeng Yu, Jie Shi, and Yuanqi Gao
Department of Electrical and Computer Engineering
University of California, Riverside
Riverside, CA
Email: {wwang031, nyu, jshi005, ygao024}@ucr.edu

Abstract—Volt-VAR control (VVC) plays an important role deep reinforcement learning based approach to solve the VVC
in enhancing energy efficiency, power quality, and reliability problem.
of electric power distribution systems by coordinating the op- The existing algorithms for VVC can be divided into two
erations of equipment such as voltage regulators, on-load tap
changers, and capacitor banks. VVC not only keeps voltages categories: optimization-based approach and reinforcement
in the distribution system within desirable ranges but also learning based approach. The optimization-based approach to
reduces system operation costs, which include network losses solve the VVC problem has been well researched. The VVC
and equipment depreciation from wear and tear. In this paper, problem is formulated as a deterministic optimization prob-
the deep reinforcement learning approach is taken to learn a lem with different extensions [4]–[7]. Voltage-dependent load
VVC policy, which minimizes the total operation costs while
satisfying the physical operation constraints. The VVC problem is model is introduced in [4]. Continuous controllable reactive
formulated as a constrained Markov decision process and solved power source is considered in [5]. The interaction between
by two policy gradient methods, trust region policy optimization the Volt-VAR optimizer and prosumers is incorporated in a
and constrained policy optimization. Numerical study results game theory model [6]. Considering the uncertainties of DERs,
based on IEEE 4-bus and 13-bus distribution test feeders show the VVC problem is formulated as a robust optimization
that the policy gradient methods are capable of learning near-
optimal solutions and determining control actions much faster problem [8], [9]. Both papers propose a two-stage coordination
than the optimization-based approaches. scheme for the VVC, which consists of the less-frequent
Index Terms—Reinforcement learning, Volt-VAR control, con- control for on-load tap changers and the more-frequent control
strained Markov decision process, policy gradient methods. for smart inverters. Model predictive control (MPC) based
VVC is studied in [10], [11] to reduce real power losses and
I. I NTRODUCTION voltage fluctuation [10] and preserve the life of controllable
As the penetration level of distributed energy resources equipment by penalizing the number of tap changes [11].
(DERs) continues to rise in power distribution systems, it is In the optimization-based approach, the VVC problem is
increasingly difficult to keep the voltages along the feeders typically formulated as a mixed-integer conic programming
within the desired range. The voltage profile highly impacts (MICP) or mixed-integer nonlinear programming problem.
the electricity service quality for end users. Both over-voltage The computational complexity of the solution algorithms for
and under-voltage conditions could reduce energy efficiency, these NP-hard problems increases exponentially with the dis-
cause equipment malfunction, and damage customers’ electri- tribution network size and the number of controllable devices.
cal appliances. Equipped with remote control and monitoring Thus, the optimization-based approach does not scale well for
devices, electric utilities started adopting Volt-VAR control real-time application of VVC.
(VVC) to maintain voltages within allowable range, manage The reinforcement learning approach is capable of making
power factor, and reduce operation costs. These control objec- control decisions online based on off-line trained models.
tives can be achieved by coordinating the operations of various In particular, Q-learning based algorithms are developed for
equipment such as voltage regulators, on-load tap changers, the VVC problem [12]–[14]. The tabular Q-learning method
switchable capacitor banks, and smart inverters. is adopted to solve the VVC problem [12]. A tabular Q-
Although successful field demonstrations of VVC have been learning method is proposed to solve the optimal reactive
reported by many electric utilities, there are still many barriers power dispatch problem [13], where the global reward is
to the wide-spread adoption of the technology. One of the most obtained with a consensus-based global information discovery
significant barriers is the lack of robust distribution network algorithm. In [14], separate Q-values of on-load tap changers
topology and parameter information, which are required in are approximated sequentially by radial kernel functions. So
optimization based VVC approaches. In particular, inaccurate far, all reinforcement learning based algorithms developed for
distribution secondary systems’ information [1]–[3] makes it the VVC problem are action-value methods [15], [16]. They
difficult for VVC to ensure that customers’ voltages will learn the values of actions and then select actions based on
stay within the acceptable range. To overcome the drawbacks estimated action values.
of optimization-based approaches, we develop a data-driven In this paper, we adopt a different reinforcement learning
2 2
approach called policy gradient methods [17]–[20] to solve P Fijt + QFijt = lij
t t
ui , ∀i, j ∈ N , (i, j) ∈ E, t (4)
the VVC problem. Policy gradient methods directly learn a
parameterized control policy that can select actions without u ≤ uti ≤ u, ∀i ∈ N , t (5)
using a value function. Policy gradient methods have two The objective function (1) minimizes the total operation
advantages over action-value methods. First, the VVC policy costs, which include the costs associated with line losses and
may be a simpler function to approximate than the action-value the switching costs of voltage regulators, on-load tap changers,
function. Second, continuous policy parameterization yields and capacitor banks. The switching cost is assumed to be
stronger convergence guarantees for policy-gradient methods proportional to the absolute number of tap changes between
than the -greedy action selection for action-value methods consecutive hours. Ploss t
denotes the total real line losses at
[21]. Compared to the optimization-based approaches, our hour t. Cp , Cr , Cl , and Cc are the cost coefficients for the
proposed algorithm has better scalability and does not require real power loss, the tap changes of voltage regulators, on-load
accurate and complete physical model of the distribution tap changers, and capacitor banks respectively. Nr , Nl , and
network. Nc are the total numbers of voltage regulators, on-load tap
The existing reinforcement learning based VVC works changers, and capacitor banks. T aprj (t), T aplj (t), and T apcj (t)
allow controllers to freely explore any control actions during denote the tap position of the j-th voltage regulator, on-load
learning. However, certain control actions will lead to severe tap changer, and capacitor bank at hour t. T is the operation
voltage violations in the distribution feeder. To enable safe horizon of the VVC algorithm.
exploration for controllers, we adopt the constrained policy The formulation of constraints leverages the DistFlow equa-
optimization [20] algorithm, which statistically guarantees tions [23]. The decision variables of the DistFlow formulation
every control policy during learning will satisfy operational are the vector (ut ) of uti for all the nodes (N ), the vector
constraints in the form of expectation. (lt ) of lij
t
for all the lines (E), and the vector (T AP t ) of tap
The remainder of the paper is organized as follows. Sec- positions for all the devices. uti denotes the square of voltage
tion II presents the formulations of the VVC problem as t
magnitude of node i at hour t. lij denotes the square of current
an optimization problem and as a constrained Markov deci- magnitude of the line connecting node i and j at hour t.
sion process (CMDP) problem. Section III describes how to The set of power balance constraints in the DistFlow is
leverage policy gradient methods to solve the VVC problem. represented by (2), where P Gt , QGt , P D t , and QD t denote
Section IV shows the numerical results, which demonstrate the vector of nodal real and reactive power generations and
the performance of our proposed reinforcement learning based demands at hour t. The constraints corresponding to the Ohm’s
VVC algorithms. Section V concludes the paper. law is represented by (3), where P F t and QF t denote the
vector of real and reactive power flows at hour t. Equality
II. P ROBLEM F ORMULATION constraint (4) is the only nonlinear constraint in the DistFlow
In this section, we first formulate the VVC problem as an formulation, which can be relaxed as a second order cone [23].
optimization problem and then as a CMDP problem. P Fijt and QFijt are the real and reactive power flow on the
line connecting node i and j at hour t. E and N denote the
A. Volt-VAR Control Formulated as an Optimization Problem set of edges and nodes in the distribution feeder. Equation (5)
VVC algorithm aims at minimizing the total system losses represents the nodal voltage constraints, where u and u are the
and equipment operation costs while satisfying voltage con- lower and upper limits for the square of voltage magnitude.
straints. In this formulation, we assume the voltage regulators, The detailed formulations for the operating constraints can
on-load tap changers and capacitor banks are the primary be found in [22], where binary variables are introduced to
control knobs. Then, the VVC problem can be formulated as represent the tap positions. The optimization problem shown
an optimization problem as follows [22]: above is a MICP problem.
Finally, to account for generation and load uncertainties,
T Nr
T X the VVC problem can be formulated as a MPC [10]. The
X X
min Cp [ t
Ploss ] + Cr |T aprj (t) − T aprj (t − 1)| optimization problem shown above can be solved on a rolling
t=1 t=1 j=1 basis based on the updated load and generation forecasts.
Nl
T X
X B. Volt-VAR Control Formulated as a Constrained Markov
+ Cl |T aplj (t) − T aplj (t − 1)| Decision Process
t=1 j=1
Nc
T X In the Markov decision process (MDP), the grid operator
or controller is denoted by an agent. This agent and the
X
+ Cc |T apcj (t) − T apcj (t − 1)| (1)
t=1 j=1 distribution grid interact at each of a sequence of discrete
time steps t = 0, 1, 2, . . .. At each time step t, the agent
s.t. receives the system’s state st ∈ S, and selects a control action
fP B (P Gt , QGt , P D t , QD t , T AP t , ut , lt ) = 0, ∀t (2) at ∈ A(s). One time step later, the agent receives a numerical
reward Rt+1 ∈ R ⊂ R, and finds itself in a new state st+1 .
fOL (P F t , QF t , T AP t , ut , lt ) = 0, ∀t (3) The probability of receiving a reward and observing a new
state depends on the preceding state and control action as where 1(·) is the indicator function; vit+1 is the voltage of node
P (st+1 |st , at ) = P (st+1 |s0 , a0 , ..., st , at ). i at hour t + 1; v and v are the upper and lower limits for
In the context of the VVC, the state is defined as s = voltage magnitudes. Additional operating constraints such as
[P , Q, T , t], where P , Q, T and t denote the nodal real and the line flow limits could be incorporated in a similar manner.
reactive power injections, the current tap positions, and the Now the expected discounted return of policy π with respect
time step. The action taken by a VVC agent is changing the to the cost function can be defined as
tap positions of controllable devices to T 0 . The size of the T
Ns X
action space is Πi=1 ni , where Ns = Nr + Nl + Nc is the JC (π) = E [ γ t RC (st , at , st+1 )] (11)
τ ∼π
number of controllable devices and ni denotes the number of t=0
tap positions of device i. The reward received by the controller
The final CMDP formulation for the VVC problem is:
R(st , at , st+1 ) for taking action at at state st and reaching
state st+1 is defined as the negative of the system operational max J(π) (12)
π
costs, which include the costs associated with real power losses
and equipment operations. s.t.
JC (π) ≤ J (13)
R(st , at , st+1 )
h Nr where J is the limit for the expected discounted return of the
X
= − Cp Ploss t
+ Cr |T aprj (t + 1) − T aprj (t)| cost function associated with the voltage constraints.
j=1
III. T ECHNICAL M ETHODS
Nl
X
+ Cl |T aplj (t + 1) − T aplj (t)| So far all reinforcement learning algorithms adopted to
j=1 solve the VVC problem have been action-value methods,
Nc
X i which approximate the action-value functions through learning
+ Cc |T apcj (t + 1) − T apcj (t)| (6) and then select actions based on the estimated action-value
j=1 functions. In this paper, we consider policy gradient methods,
which learn a parameterized control policy that directly selects
The goal of an agent is to find a control policy π that
actions without consulting a value function [21]. Typically, an
maximizes the expected discounted return defined as:
approximate policy is parameterized according to the soft-max
T
X in action preferences, which makes approaching deterministic
J(π) = E [ G(τ )] (7) policy easier and finding stochastic policy feasible [21]. Both
τ ∼π
t=0
of these goals can not be achieved by the -greedy action selec-
where control policy π is a mapping from state space S to tion in the action-value methods. Another notable advantage of
action space A for a deterministic policy and a mapping the policy gradient methods over the action-value methods is
from states to probabilities of selecting each possible ac- that the control policy functions may be easier to approximate
tion for a probabilistic policy. τ is a trajectory or sequence than action-value functions in many applications such as the
of states and actions, {s0 , a0 , s1 , a1 , ..., sT −1 , aT −1 , sT }. VVC problem.
G(τ
PT ) ist the discounted return along a trajectory. G(τ ) = In this section, we first introduce the preliminaries of
t=0 γ R(st , at , st+1 ), where γ ∈ (0, 1) is the discount the policy gradient methods. Then two state-of-the-art policy
factor. gradient methods based on trust region algorithms [18], [20]
Two important functions, action-value function and state- are adopted to solve the VVC problem. Finally, the design of
value function for policy π are defined as follows [21]: neural networks to approximate the policy and value functions
Qπ (s, a) = E [G(τ )|s0 = s, a0 = a] (8) in the two algorithms will be discussed.
τ ∼π
A. Preliminaries of policy gradient method
V π (s) = E [G(τ )|s0 = s] (9)
τ ∼π
Policy gradient methods learn a parameterized control pol-
The action-value function Qπ (s, a) represents the expected icy πθ that maximizes the performance measure J(π ˆ θ ) by
return starting with state s, taking action a, and following updating the parameter θ iteratively as follows:
π thereafter. The state-value function V π (s) represents the
ˆ k)
θk+1 = θk + α∇θ J(θ (14)
expected return starting with state s and thereafter following
policy π. According to the policy gradient theorem [21], the gradient
To enforce the voltage constraints, we augment the MDP can be derived as
with a set of cost functions RC (st , at , st+1 ). For the VVC
T
problem, it is defined as the number of voltage violations ˆ = E [
X
∇θ J(θ) ∇θ log πθ (at |st )Ψt ] (15)
across all nodes, i.e., τ ∼πθ
t=0
N
where Ψ may have various forms including the action-value
[1(|vit+1 | > v) + 1(|vit+1 | < v)] (10)
X
RC (st , at , st+1 ) =
i=1
function Qπθ (s, a) and the advantage function Aπθ (s, a).
The advantage function, which quantifies the improvement Algorithm 1 TRPO for VVC
by taking action a in state s compared to randomly selecting 1: Initialize parameters for policy and value function, θ0 , φ0
an action according to policy πθ and following πθ afterwards, 2: for k = 0,1,2,... do
is defined as 3: Generate sample trajectories T rk = {τ } with πθk
through power flow simulations
Aπθ (s, a) = Qπθ (s, a) − V πθ (s) (16) 4: Calculate the discounted return for the objective Ĝt
after each time step t along the trajectories
Two policy gradient methods, trust region policy optimiza- 5: Estimate the advantage for the objective Ât based on
tion (TRPO) and constrained policy optimization (CPO), that the value function Vφk
use the advantage function are presented in the following sub- 6: Obtain πθk+1
∗ by solving (18) and (19)
sections. We will discuss how to adopt them to solve the VVC 7: Update the parameters φk of the value function neural
problem formulated as MDP and CMDP. The implementation network with Ĝt as labels
details of these two algorithms can be found in [18], [20]. 8: end for

B. Trust Region Policy Optimization


C. Constrained Policy Optimization
The TRPO algorithm originally proposed in [18] provides a
theoretical guarantee of monotonic improvement of the control To directly solve the VVC problem formulated as a CMDP,
policy at each policy iteration step. the CPO algorithm, which guarantees approximate constraints
The design of the policy iteration procedure is based on the satisfaction, can be leveraged [20]. The theoretical guarantee
lower bound [20] of the performance improvement of policy of the constraint satisfaction can be shown with the upper
πθ0 over policy πθ : bound [20] of the performance improvement associated with
constraints of policy πθ0 compared to policy πθ :
1 h
J(πθ0 ) − J(πθ ) ≥ Eπ Aπθ (s, a) 1 h
1 − γ s∼η θ JC (πθ0 ) − JC (πθ ) ≤ Eπ AπCθ (s, a)
a∼πθ0 1 − γ s∼η θ
a∼πθ0
γξ πθ0 p i
πθ 0 p
− 2KL(πθ0 ||πθ )[s] (17) γξC i
1−γ + 2KL(πθ0 ||πθ )[s] (21)
1−γ
where ξ πθ0 = maxs |Ea∼πθ0 [Aπθ (s, a)]|. KL(πθ0 ||πθ )[s] is π
where ξCθ0 = maxs |Ea∼πθ0 [AπCθ (s, a)]| and AπCθ (s, a) is the
the KL-divergence between policy πθ0 and πθ at state s. corresponding advantage function for the constraint. Accord-
η πθ is the
PT discounted future state distribution, η πθ (s) = ing to (21), the constraint at each updating step is specified
t
(1 − γ) t=0 γ P (st = s|πθ ). P (st = s|πθ ) denotes the as:
probability of state s appearing at time t under policy πθ . 1 πθ
JC (πθk ) + E [A k (s, a)] ≤ J (22)
Thus, we can update the policy parameters iteratively by 1 − γ s∼ηπθk C
a∼πθ
maximizing the expected advantage with a small step size δ:
The policy update for CMDP can be found by solving (18),
πθk+1 = arg max Eπ [Aπθk (s, a)] (18) (19), and (22). Therefore, with a small enough δ, the constraint
πθ s∼η θk
a∼πθ satisfaction is almost guaranteed at step k + 1 if we start
from a feasible solution πθk according to (21). The worst-case
 
s.t. Eπ KL(πθ , πθk )[s] ≤ δ (19) constraint violation at step k + 1 is:
s∼η θk

2δγξ πθk+1
If πθk is a feasible solution, the maximum expected advantage J − JC (πθk+1 ) ≤ (23)
(1 − γ)2
is non-negative. With a small enough δ, monotonic policy
improvement is guaranteed according to (17). The optimiza- Similarly, to solve the optimization problem, (22) should be
tion problem (18) and (19) can be solved by linearizing the linearized around θk . At the beginning of the training process,
objective function and quadratically approximating the KL- a feasible solution can be recovered by solving the following
divergence around θk . problem subject to (19):
The final iterative TRPO algorithm to solve the VVC πθ
min Eπ [AC k (s, a)] (24)
problem is shown in Algorithm 1. πθ s∼η θk
a∼πθ
To adopt the TRPO algorithm for the VVC problem, the
reward function is augmented with a penalty term associated The final CPO algorithm to solve the VVC problem is
with the voltage violations: shown in Algorithm 2.

D. Value and Policy Networks


R0 (st , at , st+1 ) = R(st , at , st+1 )−CV RC (st , at , st+1 ) (20)
Both the objective function (18) and the expectation of the
where CV is the penalty factor for voltage violations. advantage function associated with the constraint in (22) can
Algorithm 2 CPO for VVC The policy function πθ is approximated by a neural network
1: Initialize parameters for policy and value functions, θ0 , with parameter θ. The structure of the policy network is shown
φ10 , and φ20 as in Fig. 1. The inputs are the states and the outputs are the
2: for k = 0,1,2,... do probabilities of selecting various actions, which represent the
Generate sample trajectories T rk = {τ } with πθk switch
PNs status of the devices. The size of the output layer is
3:
through power flow simulations i=1 ni , where Ns and ni are the number of devices and
4: Calculate the discounted returns Ĝ1t , Ĝ2t for the objec- the number of tap positions for device i. The probability
tive function and the constraint after each time step t distribution Pi of the actions for device i, is obtained from the
along the trajectories subset of the output neurons with size ni . A softmax activation
5: Estimate the advantages for the objective Â1t and the function is applied to each subset of the output neurons
constraint Â2t , based on the value functions Vφ1k and corresponding to a device. The final probability distribution
Vφ2k . of the tap combinations across all devices is calculated with
6: if the problem (18), (19) and (22) is feasible then P = ΠN i=1 Pi . Thus, in our proposed methods the network size
s

7: Obtain the optimal solution πθk+1


∗ only increases linearly with Ns .
8: else
9: Obtain the solution πθk+1
∗ by solving (19) and (24) IV. N UMERICAL S TUDY
10: end if
11: Update the parameters φ1k and φ2k of the value function A. Simulation Setup
neural networks with Ĝ1t and Ĝ2t as labels The numerical studies are conducted on the IEEE 4-bus
12: end for
and 13-bus distribution test feeders [24]. The real-world smart
meter data of an electric utility is used as the nodal load in
the simulation environment to generate power flow solutions.
be calculated with only the state-value function and the policy The length of historical data is about six months. One week of
function as follows: data during the summer peak are used for the out-of-sample
test and the rest are used for training. The length of the VVC
πθ (a|s) πθ
Eπ [Aπθk (s, a)] = Eπ [ A k (s, a)] = optimization horizon or an episode in reinforcement learning
s∼η θk s∼η θk πθk (a|s) is one week. The load time series data is scaled and allocated
a∼πθ a∼πθk

πθ (a|s) to each node according to the load profile of the standard test
Eπ [ (R(s, a, s0 ) + γV πθk (s0 ) − V πθk (s))] (25) case. Each test feeder has three switching devices: a voltage
s∼η θk πθk (a|s)
a∼πθk regulator, an on-load tap changer, and a capacitor bank. Both
the voltage regulator and the on-load tap changer have 11
Therefore, we only need to design neural networks to tap positions with turns ratios between 0.95 and 1.05. The
approximate the state-value function and the policy function. capacitor bank can be switched on and off remotely and the
The state-value function Vφ corresponding to the augmented number of ‘tap positions’ is treated to be 2. The size of the
reward in Algorithm 1 is parameterized with φ. The state-value action space for each test case is 11 × 11 × 2 = 242. In the
functions corresponding to the reward Vφ1 and the constraint 4-bus test feeder, the capacitor bank is placed at node 4. In
Vφ2 in Algorithm 2 are parameterized with φ1 and φ2 . The the 13-bus test feeder, the capacitor bank is placed at node
inputs of all the value networks are states. The output is the 675. The nominal capacity of the capacitor banks is 200kW .
expected discounted return. Initially, the turns ratios of the voltage regulators and on-load
tap changers are 1, while the capacitor banks are switched
off. The electricity price Cp is assumed to be $40/M W h.
The switching costs of the devices Cr , Cl , and Cc are set at
$0.1 per tap change.

Device 1
B. Benchmarking Algorithms
(softmax)
The MPC-based optimization algorithm is chosen as the first
benchmark. The control horizon is at 24 hours. The ARIMA
[25] model is used to forecast the load during the control
horizon. The MICP problem formulated in Section II-A is
solved on a rolling basis at each step of MPC. MOSEK and
Device n GUROBI are used to solve the MICP problem. The second
(softmax)
benchmark is set up by replacing the load forecast with
Input Hidden layers actual load data in the MPC framework. The last benchmark
represents the baseline where all switching devices are kept at
Fig. 1. Structure of the policy network their initial positions.
C. Policy Gradient Methods TABLE I
P ERFORMANCE C OMPARISON OF VOLT-VAR C ONTROL A LGORITHMS
In the TRPO and CPO algorithms, both the value and
policy neural networks have two hidden layers with 64 and 32 Algorithm OC ($) # of TC # of VV AVV (per unit)
neurons respectively. The tanh activation function is used in all Baseline 150.13 0 91 2.748
4-
the hidden layers. The linear and softmax activation functions MPC (Actual) 111.44 18 0 0
bus
are used for the output layers of the state-value and the policy MPC (Forecast) 111.89 20 0 0
test
networks. In the TRPO algorithm, the reward function is CPO 115.01 9 5 0.044
case
augmented by a penalty cost for voltage constraint violations. TRPO 120.05 3 16 0.286
The penalty coefficient CV is $1 per voltage violation per Baseline 77.88 0 268 2.673
13-
node. The terminal state is chosen as the last hour of a week bus
MPC (Actual) 58.05 6 0 0
for both algorithms. MPC (Forecast) 58.44 6 0 0
test
CPO 58.92 6 0 0
case
D. Performance Comparison TRPO 61.29 3 2 0.004
The control performances of CPO, TRPO, and MPC-
based approaches are evaluated in this subsection. Both the
CPO algorithm and the TRPO algorithm are trained for 500 1.1
Per Unit Voltage at Node 3
iterations. Each training iteration consists of 298 episodic
1.05
trajectories, which correspond to about 50,000 samples. Over

Per Unit
the training episodes, we record the average discounted return
1
(ADR), which includes the costs associated with the line
losses, tap changes, and the penalty of voltage violations. As 0.95 CPO
shown in Fig. 2, the CPO algorithm starts to outperform the TRPO
Forecast based MPC
TRPO algorithm after about 200 training iterations for the 4- 0.9
0 20 40 60 80 100 120 140 160
bus test case. For the 13-bus test case, the CPO algorithm
1.1
always outperform the TRPO algorithm. At the end of the Per Unit Voltage at Node 4
training process, the improvements of episodic returns for both 1.05
algorithms become saturated.
Per Unit

4-Bus Test Case


0 0.95 CPO
TRPO
Forecast based MPC
ADR($)

-100 0.9
0 20 40 60 80 100 120 140 160
-200 Hour
CPO
TRPO
-300 Fig. 3. Comparison of voltage profiles on the 4-bus test feeder
0 100 200 300 400 500
Iteration Number
13-Bus Test Case The MPC with actual load represents the global optimal
0
solution. As shown in Table I, the CPO algorithm is capable
of achieving a near-optimal operational cost and is nearly
ADR($)

-200 constraint-satisfying. The CPO algorithm yields a lower op-


CPO eration cost compared to the TRPO algorithm. The per unit
TRPO voltages at node 3 and 4 of the 4-bus test feeder are depicted
-400
0 100 200 300 400 500 in Fig. 3. It can be seen that the voltage solutions at node
Iteration Number 3 of the MPC-based approach with forecasted load hit the
upper bound a few times. This is common for optimization
Fig. 2. Training performance of the reinforcement learning algorithms approaches as the optimal solutions are likely to be boundary
points. By following the CPO algorithm, the voltage profiles
The total operation cost (OC), the number of tap changes at node 4 nearly stay in bounds all the time except for 5
(# of TC), the number of voltage violations (# of VV), and minor violations. The CPO algorithm outperforms the TRPO
the accumulated per unit voltage violation (AVV) over the algorithm by approximately satisfying the voltage constraints
test week are recorded in Table I for all the reinforcement all the time.
algorithms and the benchmark algorithms. The operation cost The average and the maximum computation time of the
includes the costs associated with the line losses and the MPC-based algorithms with different solvers and the policy
tap changes. PThe Paccumulated per unit voltage violation is gradient methods to determine the tap positions at each
N t t
calculated as i t [max(0, |vi | − v) + max(0, v − |vi |)]. hour are provided in Table II. Without parallel computing
TABLE II [7] M. B. Liu, C. A. Canizares, and W. Huang, “Reactive power and voltage
C OMPUTATION TIME OF VOLT-VAR C ONTROL A LGORITHMS control in distribution systems with limited switching operations,” IEEE
Transactions on Power Systems, vol. 24, no. 2, pp. 889–899, May 2009.
Algorithm Average Time (s) Maximum Time (s) [8] Y. Xu, Z. Y. Dong, R. Zhang, and D. J. Hill, “Multi-timescale coor-
dinated voltage/VAR control of high renewable-penetrated distribution
MPC (GUROBI) 10.43 90.28
4-bus systems,” IEEE Transactions on Power Systems, vol. 32, no. 6, pp. 4398–
MPC (MOSEK) 346.80 3904.22 4408, Nov. 2017.
test case
TRPO/CPO < 10−3 < 10−3 [9] W. Zheng, W. Wu, B. Zhang, and Y. Wang, “Robust reactive power
MPC (GUROBI) 4.69 8.57 optimisation and voltage control method for active distribution networks
13-bus via dual time-scale coordination,” IET Generation, Transmission Distri-
MPC (MOSEK) 53.83 328.98 bution, vol. 11, no. 6, pp. 1461–1471, May 2017.
test case
TRPO/CPO < 10−3 < 10−3 [10] Z. Wang, J. Wang, B. Chen, M. M. Begovic, and Y. He, “MPC-
based voltage/VAR optimization for distribution circuits with distributed
generators and exponential load models,” IEEE Transactions on Smart
Grid, vol. 5, no. 5, pp. 2412–2420, Sept. 2014.
[11] M. Falahi, K. Butler-Purry, and M. Ehsani, “Dynamic reactive power
(MOSEK), the computation time of the MPC-based algorithm control of islanded microgrids,” IEEE Transactions on Power Systems,
could exceed 1 hour in the worst case on an entry level DELL vol. 28, no. 4, pp. 3649–3657, Nov. 2013.
desktop. On the other hand, once trained the policy gradient [12] J. G. Vlachogiannis and N. D. Hatziargyriou, “Reinforcement learning
for reactive power control,” IEEE Transactions on Power Systems,
methods have a much faster execution speed, which makes vol. 19, no. 3, pp. 1317–1325, Aug. 2004.
them suitable for online applications. Moreover, the MPC- [13] Y. Xu, W. Zhang, W. Liu, and F. Ferrese, “Multiagent-based reinforce-
based algorithms require accurate and complete topology ment learning for optimal reactive power dispatch,” IEEE Transactions
on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
model and parameters of the distribution network, which are vol. 42, no. 6, pp. 1742–1751, Nov. 2012.
not often available. [14] H. Xu, A. D. Domı́nguez-Garcı́a, and P. W. Sauer, “Optimal tap setting
of voltage regulation transformers using batch reinforcement learning,”
arXiv, July 2018. [Online]. Available: https://arxiv.org/abs/1807.10997
V. C ONCLUSION [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
In this paper, the Volt-VAR control problem is modeled as S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
a CMDP and solved with policy gradient methods for the first D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
time. The constrained policy optimization algorithm is adopted deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
[16] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
to enable safe exploration for the controller. Both policy and with double Q-learning,” in AAAI, Feb. 2016, pp. 2094–2100.
state-value functions are approximated by neural networks. [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
The structure of the policy network is tailored to achieve better Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with
deep reinforcement learning,” arXiv, Sept. 2015. [Online]. Available:
scalability for the Volt-VAR control problem. The performance https://arxiv.org/abs/1509.02971
of the policy gradient methods and benchmarking algorithms [18] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust
are validated with the IEEE 4-bus and 13-bus test feeders. The region policy optimization,” in ICML, vol. 37, 2015, pp. 1889–1897.
[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
results show that the constrained policy optimization algorithm “Proximal policy optimization algorithms,” arXiv, July 2017. [Online].
can achieve near-optimal solutions with negligible voltage Available: https://arxiv.org/abs/1707.06347
violations. Compared to the conventional optimization based [20] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy
optimization,” in ICML, vol. 70, Aug. 2017, pp. 22–31.
approach, the proposed reinforcement learning algorithm is [21] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning,
better suited for online VVC tasks where accurate and com- 1st ed. Cambridge, MA, USA: MIT Press, 1998.
plete distribution network models are not available. [22] F. U. Nazir, B. C. Pal, and R. A. Jabr, “A two-stage chance constrained
Volt/VAR control scheme for active distribution networks with nodal
power uncertainties,” IEEE Transactions on Power Systems, vol. 34,
R EFERENCES no. 1, pp. 314–325, Jan. 2019.
[23] M. E. Baran and F. F. Wu, “Network reconfiguration in distribution
[1] W. Wang, N. Yu, B. Foggo, J. Davis, and J. Li, “Phase identification in systems for loss reduction and load balancing,” IEEE Transactions on
electric power distribution systems by clustering of smart meter data,” Power Delivery, vol. 4, no. 2, pp. 1401–1407, Apr. 1989.
in 2016 15th IEEE International Conference on Machine Learning and [24] W. H. Kersting, “Radial distribution test feeders,” in IEEE Power
Applications (ICMLA). IEEE, 2016, pp. 259–265. Engineering Society Winter Meeting, vol. 2, Jan. 2001, pp. 908–912.
[2] B. Foggo and N. Yu, “A comprehensive evaluation of supervised [25] J. W. Taylor and P. E. McSharry, “Short-term load forecasting methods:
machine learning for the phase identification problem,” World Acad. An evaluation based on european data,” IEEE Transactions on Power
Sci. Eng. Technol. Int. J. Comput. Syst. Eng, vol. 12, no. 6, 2018. Systems, vol. 22, no. 4, pp. 2213–2219, Nov. 2007.
[3] W. Wang, N. Yu, and Z. Lu, “Advanced metering infrastructure data
driven phase identification in smart grid,” GREEN 2017 Forward, pp.
16–23, 2017.
[4] H. Ahmadi, J. R. Mart, and H. W. Dommel, “A framework for Volt-
VAR optimization in distribution systems,” IEEE Transactions on Smart
Grid, vol. 6, no. 3, pp. 1473–1483, May 2015.
[5] P. Li, H. Ji, C. Wang, J. Zhao, G. Song, F. Ding, and J. Wu, “Coordinated
control method of voltage and reactive power for active distribution
networks based on soft open point,” IEEE Transactions on Sustainable
Energy, vol. 8, no. 4, pp. 1430–1442, Oct. 2017.
[6] M. H. K. Tushar and C. Assi, “Volt-VAR control through joint op-
timization of capacitor bank switching, renewable energy, and home
appliances,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4077–
4086, Sept. 2018.

You might also like