2019 VVC RL SmartGridComm
2019 VVC RL SmartGridComm
Abstract—Volt-VAR control (VVC) plays an important role deep reinforcement learning based approach to solve the VVC
in enhancing energy efficiency, power quality, and reliability problem.
of electric power distribution systems by coordinating the op- The existing algorithms for VVC can be divided into two
erations of equipment such as voltage regulators, on-load tap
changers, and capacitor banks. VVC not only keeps voltages categories: optimization-based approach and reinforcement
in the distribution system within desirable ranges but also learning based approach. The optimization-based approach to
reduces system operation costs, which include network losses solve the VVC problem has been well researched. The VVC
and equipment depreciation from wear and tear. In this paper, problem is formulated as a deterministic optimization prob-
the deep reinforcement learning approach is taken to learn a lem with different extensions [4]–[7]. Voltage-dependent load
VVC policy, which minimizes the total operation costs while
satisfying the physical operation constraints. The VVC problem is model is introduced in [4]. Continuous controllable reactive
formulated as a constrained Markov decision process and solved power source is considered in [5]. The interaction between
by two policy gradient methods, trust region policy optimization the Volt-VAR optimizer and prosumers is incorporated in a
and constrained policy optimization. Numerical study results game theory model [6]. Considering the uncertainties of DERs,
based on IEEE 4-bus and 13-bus distribution test feeders show the VVC problem is formulated as a robust optimization
that the policy gradient methods are capable of learning near-
optimal solutions and determining control actions much faster problem [8], [9]. Both papers propose a two-stage coordination
than the optimization-based approaches. scheme for the VVC, which consists of the less-frequent
Index Terms—Reinforcement learning, Volt-VAR control, con- control for on-load tap changers and the more-frequent control
strained Markov decision process, policy gradient methods. for smart inverters. Model predictive control (MPC) based
VVC is studied in [10], [11] to reduce real power losses and
I. I NTRODUCTION voltage fluctuation [10] and preserve the life of controllable
As the penetration level of distributed energy resources equipment by penalizing the number of tap changes [11].
(DERs) continues to rise in power distribution systems, it is In the optimization-based approach, the VVC problem is
increasingly difficult to keep the voltages along the feeders typically formulated as a mixed-integer conic programming
within the desired range. The voltage profile highly impacts (MICP) or mixed-integer nonlinear programming problem.
the electricity service quality for end users. Both over-voltage The computational complexity of the solution algorithms for
and under-voltage conditions could reduce energy efficiency, these NP-hard problems increases exponentially with the dis-
cause equipment malfunction, and damage customers’ electri- tribution network size and the number of controllable devices.
cal appliances. Equipped with remote control and monitoring Thus, the optimization-based approach does not scale well for
devices, electric utilities started adopting Volt-VAR control real-time application of VVC.
(VVC) to maintain voltages within allowable range, manage The reinforcement learning approach is capable of making
power factor, and reduce operation costs. These control objec- control decisions online based on off-line trained models.
tives can be achieved by coordinating the operations of various In particular, Q-learning based algorithms are developed for
equipment such as voltage regulators, on-load tap changers, the VVC problem [12]–[14]. The tabular Q-learning method
switchable capacitor banks, and smart inverters. is adopted to solve the VVC problem [12]. A tabular Q-
Although successful field demonstrations of VVC have been learning method is proposed to solve the optimal reactive
reported by many electric utilities, there are still many barriers power dispatch problem [13], where the global reward is
to the wide-spread adoption of the technology. One of the most obtained with a consensus-based global information discovery
significant barriers is the lack of robust distribution network algorithm. In [14], separate Q-values of on-load tap changers
topology and parameter information, which are required in are approximated sequentially by radial kernel functions. So
optimization based VVC approaches. In particular, inaccurate far, all reinforcement learning based algorithms developed for
distribution secondary systems’ information [1]–[3] makes it the VVC problem are action-value methods [15], [16]. They
difficult for VVC to ensure that customers’ voltages will learn the values of actions and then select actions based on
stay within the acceptable range. To overcome the drawbacks estimated action values.
of optimization-based approaches, we develop a data-driven In this paper, we adopt a different reinforcement learning
2 2
approach called policy gradient methods [17]–[20] to solve P Fijt + QFijt = lij
t t
ui , ∀i, j ∈ N , (i, j) ∈ E, t (4)
the VVC problem. Policy gradient methods directly learn a
parameterized control policy that can select actions without u ≤ uti ≤ u, ∀i ∈ N , t (5)
using a value function. Policy gradient methods have two The objective function (1) minimizes the total operation
advantages over action-value methods. First, the VVC policy costs, which include the costs associated with line losses and
may be a simpler function to approximate than the action-value the switching costs of voltage regulators, on-load tap changers,
function. Second, continuous policy parameterization yields and capacitor banks. The switching cost is assumed to be
stronger convergence guarantees for policy-gradient methods proportional to the absolute number of tap changes between
than the -greedy action selection for action-value methods consecutive hours. Ploss t
denotes the total real line losses at
[21]. Compared to the optimization-based approaches, our hour t. Cp , Cr , Cl , and Cc are the cost coefficients for the
proposed algorithm has better scalability and does not require real power loss, the tap changes of voltage regulators, on-load
accurate and complete physical model of the distribution tap changers, and capacitor banks respectively. Nr , Nl , and
network. Nc are the total numbers of voltage regulators, on-load tap
The existing reinforcement learning based VVC works changers, and capacitor banks. T aprj (t), T aplj (t), and T apcj (t)
allow controllers to freely explore any control actions during denote the tap position of the j-th voltage regulator, on-load
learning. However, certain control actions will lead to severe tap changer, and capacitor bank at hour t. T is the operation
voltage violations in the distribution feeder. To enable safe horizon of the VVC algorithm.
exploration for controllers, we adopt the constrained policy The formulation of constraints leverages the DistFlow equa-
optimization [20] algorithm, which statistically guarantees tions [23]. The decision variables of the DistFlow formulation
every control policy during learning will satisfy operational are the vector (ut ) of uti for all the nodes (N ), the vector
constraints in the form of expectation. (lt ) of lij
t
for all the lines (E), and the vector (T AP t ) of tap
The remainder of the paper is organized as follows. Sec- positions for all the devices. uti denotes the square of voltage
tion II presents the formulations of the VVC problem as t
magnitude of node i at hour t. lij denotes the square of current
an optimization problem and as a constrained Markov deci- magnitude of the line connecting node i and j at hour t.
sion process (CMDP) problem. Section III describes how to The set of power balance constraints in the DistFlow is
leverage policy gradient methods to solve the VVC problem. represented by (2), where P Gt , QGt , P D t , and QD t denote
Section IV shows the numerical results, which demonstrate the vector of nodal real and reactive power generations and
the performance of our proposed reinforcement learning based demands at hour t. The constraints corresponding to the Ohm’s
VVC algorithms. Section V concludes the paper. law is represented by (3), where P F t and QF t denote the
vector of real and reactive power flows at hour t. Equality
II. P ROBLEM F ORMULATION constraint (4) is the only nonlinear constraint in the DistFlow
In this section, we first formulate the VVC problem as an formulation, which can be relaxed as a second order cone [23].
optimization problem and then as a CMDP problem. P Fijt and QFijt are the real and reactive power flow on the
line connecting node i and j at hour t. E and N denote the
A. Volt-VAR Control Formulated as an Optimization Problem set of edges and nodes in the distribution feeder. Equation (5)
VVC algorithm aims at minimizing the total system losses represents the nodal voltage constraints, where u and u are the
and equipment operation costs while satisfying voltage con- lower and upper limits for the square of voltage magnitude.
straints. In this formulation, we assume the voltage regulators, The detailed formulations for the operating constraints can
on-load tap changers and capacitor banks are the primary be found in [22], where binary variables are introduced to
control knobs. Then, the VVC problem can be formulated as represent the tap positions. The optimization problem shown
an optimization problem as follows [22]: above is a MICP problem.
Finally, to account for generation and load uncertainties,
T Nr
T X the VVC problem can be formulated as a MPC [10]. The
X X
min Cp [ t
Ploss ] + Cr |T aprj (t) − T aprj (t − 1)| optimization problem shown above can be solved on a rolling
t=1 t=1 j=1 basis based on the updated load and generation forecasts.
Nl
T X
X B. Volt-VAR Control Formulated as a Constrained Markov
+ Cl |T aplj (t) − T aplj (t − 1)| Decision Process
t=1 j=1
Nc
T X In the Markov decision process (MDP), the grid operator
or controller is denoted by an agent. This agent and the
X
+ Cc |T apcj (t) − T apcj (t − 1)| (1)
t=1 j=1 distribution grid interact at each of a sequence of discrete
time steps t = 0, 1, 2, . . .. At each time step t, the agent
s.t. receives the system’s state st ∈ S, and selects a control action
fP B (P Gt , QGt , P D t , QD t , T AP t , ut , lt ) = 0, ∀t (2) at ∈ A(s). One time step later, the agent receives a numerical
reward Rt+1 ∈ R ⊂ R, and finds itself in a new state st+1 .
fOL (P F t , QF t , T AP t , ut , lt ) = 0, ∀t (3) The probability of receiving a reward and observing a new
state depends on the preceding state and control action as where 1(·) is the indicator function; vit+1 is the voltage of node
P (st+1 |st , at ) = P (st+1 |s0 , a0 , ..., st , at ). i at hour t + 1; v and v are the upper and lower limits for
In the context of the VVC, the state is defined as s = voltage magnitudes. Additional operating constraints such as
[P , Q, T , t], where P , Q, T and t denote the nodal real and the line flow limits could be incorporated in a similar manner.
reactive power injections, the current tap positions, and the Now the expected discounted return of policy π with respect
time step. The action taken by a VVC agent is changing the to the cost function can be defined as
tap positions of controllable devices to T 0 . The size of the T
Ns X
action space is Πi=1 ni , where Ns = Nr + Nl + Nc is the JC (π) = E [ γ t RC (st , at , st+1 )] (11)
τ ∼π
number of controllable devices and ni denotes the number of t=0
tap positions of device i. The reward received by the controller
The final CMDP formulation for the VVC problem is:
R(st , at , st+1 ) for taking action at at state st and reaching
state st+1 is defined as the negative of the system operational max J(π) (12)
π
costs, which include the costs associated with real power losses
and equipment operations. s.t.
JC (π) ≤ J (13)
R(st , at , st+1 )
h Nr where J is the limit for the expected discounted return of the
X
= − Cp Ploss t
+ Cr |T aprj (t + 1) − T aprj (t)| cost function associated with the voltage constraints.
j=1
III. T ECHNICAL M ETHODS
Nl
X
+ Cl |T aplj (t + 1) − T aplj (t)| So far all reinforcement learning algorithms adopted to
j=1 solve the VVC problem have been action-value methods,
Nc
X i which approximate the action-value functions through learning
+ Cc |T apcj (t + 1) − T apcj (t)| (6) and then select actions based on the estimated action-value
j=1 functions. In this paper, we consider policy gradient methods,
which learn a parameterized control policy that directly selects
The goal of an agent is to find a control policy π that
actions without consulting a value function [21]. Typically, an
maximizes the expected discounted return defined as:
approximate policy is parameterized according to the soft-max
T
X in action preferences, which makes approaching deterministic
J(π) = E [ G(τ )] (7) policy easier and finding stochastic policy feasible [21]. Both
τ ∼π
t=0
of these goals can not be achieved by the -greedy action selec-
where control policy π is a mapping from state space S to tion in the action-value methods. Another notable advantage of
action space A for a deterministic policy and a mapping the policy gradient methods over the action-value methods is
from states to probabilities of selecting each possible ac- that the control policy functions may be easier to approximate
tion for a probabilistic policy. τ is a trajectory or sequence than action-value functions in many applications such as the
of states and actions, {s0 , a0 , s1 , a1 , ..., sT −1 , aT −1 , sT }. VVC problem.
G(τ
PT ) ist the discounted return along a trajectory. G(τ ) = In this section, we first introduce the preliminaries of
t=0 γ R(st , at , st+1 ), where γ ∈ (0, 1) is the discount the policy gradient methods. Then two state-of-the-art policy
factor. gradient methods based on trust region algorithms [18], [20]
Two important functions, action-value function and state- are adopted to solve the VVC problem. Finally, the design of
value function for policy π are defined as follows [21]: neural networks to approximate the policy and value functions
Qπ (s, a) = E [G(τ )|s0 = s, a0 = a] (8) in the two algorithms will be discussed.
τ ∼π
A. Preliminaries of policy gradient method
V π (s) = E [G(τ )|s0 = s] (9)
τ ∼π
Policy gradient methods learn a parameterized control pol-
The action-value function Qπ (s, a) represents the expected icy πθ that maximizes the performance measure J(π ˆ θ ) by
return starting with state s, taking action a, and following updating the parameter θ iteratively as follows:
π thereafter. The state-value function V π (s) represents the
ˆ k)
θk+1 = θk + α∇θ J(θ (14)
expected return starting with state s and thereafter following
policy π. According to the policy gradient theorem [21], the gradient
To enforce the voltage constraints, we augment the MDP can be derived as
with a set of cost functions RC (st , at , st+1 ). For the VVC
T
problem, it is defined as the number of voltage violations ˆ = E [
X
∇θ J(θ) ∇θ log πθ (at |st )Ψt ] (15)
across all nodes, i.e., τ ∼πθ
t=0
N
where Ψ may have various forms including the action-value
[1(|vit+1 | > v) + 1(|vit+1 | < v)] (10)
X
RC (st , at , st+1 ) =
i=1
function Qπθ (s, a) and the advantage function Aπθ (s, a).
The advantage function, which quantifies the improvement Algorithm 1 TRPO for VVC
by taking action a in state s compared to randomly selecting 1: Initialize parameters for policy and value function, θ0 , φ0
an action according to policy πθ and following πθ afterwards, 2: for k = 0,1,2,... do
is defined as 3: Generate sample trajectories T rk = {τ } with πθk
through power flow simulations
Aπθ (s, a) = Qπθ (s, a) − V πθ (s) (16) 4: Calculate the discounted return for the objective Ĝt
after each time step t along the trajectories
Two policy gradient methods, trust region policy optimiza- 5: Estimate the advantage for the objective Ât based on
tion (TRPO) and constrained policy optimization (CPO), that the value function Vφk
use the advantage function are presented in the following sub- 6: Obtain πθk+1
∗ by solving (18) and (19)
sections. We will discuss how to adopt them to solve the VVC 7: Update the parameters φk of the value function neural
problem formulated as MDP and CMDP. The implementation network with Ĝt as labels
details of these two algorithms can be found in [18], [20]. 8: end for
πθ (a|s) to each node according to the load profile of the standard test
Eπ [ (R(s, a, s0 ) + γV πθk (s0 ) − V πθk (s))] (25) case. Each test feeder has three switching devices: a voltage
s∼η θk πθk (a|s)
a∼πθk regulator, an on-load tap changer, and a capacitor bank. Both
the voltage regulator and the on-load tap changer have 11
Therefore, we only need to design neural networks to tap positions with turns ratios between 0.95 and 1.05. The
approximate the state-value function and the policy function. capacitor bank can be switched on and off remotely and the
The state-value function Vφ corresponding to the augmented number of ‘tap positions’ is treated to be 2. The size of the
reward in Algorithm 1 is parameterized with φ. The state-value action space for each test case is 11 × 11 × 2 = 242. In the
functions corresponding to the reward Vφ1 and the constraint 4-bus test feeder, the capacitor bank is placed at node 4. In
Vφ2 in Algorithm 2 are parameterized with φ1 and φ2 . The the 13-bus test feeder, the capacitor bank is placed at node
inputs of all the value networks are states. The output is the 675. The nominal capacity of the capacitor banks is 200kW .
expected discounted return. Initially, the turns ratios of the voltage regulators and on-load
tap changers are 1, while the capacitor banks are switched
off. The electricity price Cp is assumed to be $40/M W h.
The switching costs of the devices Cr , Cl , and Cc are set at
$0.1 per tap change.
Device 1
B. Benchmarking Algorithms
(softmax)
The MPC-based optimization algorithm is chosen as the first
benchmark. The control horizon is at 24 hours. The ARIMA
[25] model is used to forecast the load during the control
horizon. The MICP problem formulated in Section II-A is
solved on a rolling basis at each step of MPC. MOSEK and
Device n GUROBI are used to solve the MICP problem. The second
(softmax)
benchmark is set up by replacing the load forecast with
Input Hidden layers actual load data in the MPC framework. The last benchmark
represents the baseline where all switching devices are kept at
Fig. 1. Structure of the policy network their initial positions.
C. Policy Gradient Methods TABLE I
P ERFORMANCE C OMPARISON OF VOLT-VAR C ONTROL A LGORITHMS
In the TRPO and CPO algorithms, both the value and
policy neural networks have two hidden layers with 64 and 32 Algorithm OC ($) # of TC # of VV AVV (per unit)
neurons respectively. The tanh activation function is used in all Baseline 150.13 0 91 2.748
4-
the hidden layers. The linear and softmax activation functions MPC (Actual) 111.44 18 0 0
bus
are used for the output layers of the state-value and the policy MPC (Forecast) 111.89 20 0 0
test
networks. In the TRPO algorithm, the reward function is CPO 115.01 9 5 0.044
case
augmented by a penalty cost for voltage constraint violations. TRPO 120.05 3 16 0.286
The penalty coefficient CV is $1 per voltage violation per Baseline 77.88 0 268 2.673
13-
node. The terminal state is chosen as the last hour of a week bus
MPC (Actual) 58.05 6 0 0
for both algorithms. MPC (Forecast) 58.44 6 0 0
test
CPO 58.92 6 0 0
case
D. Performance Comparison TRPO 61.29 3 2 0.004
The control performances of CPO, TRPO, and MPC-
based approaches are evaluated in this subsection. Both the
CPO algorithm and the TRPO algorithm are trained for 500 1.1
Per Unit Voltage at Node 3
iterations. Each training iteration consists of 298 episodic
1.05
trajectories, which correspond to about 50,000 samples. Over
Per Unit
the training episodes, we record the average discounted return
1
(ADR), which includes the costs associated with the line
losses, tap changes, and the penalty of voltage violations. As 0.95 CPO
shown in Fig. 2, the CPO algorithm starts to outperform the TRPO
Forecast based MPC
TRPO algorithm after about 200 training iterations for the 4- 0.9
0 20 40 60 80 100 120 140 160
bus test case. For the 13-bus test case, the CPO algorithm
1.1
always outperform the TRPO algorithm. At the end of the Per Unit Voltage at Node 4
training process, the improvements of episodic returns for both 1.05
algorithms become saturated.
Per Unit
-100 0.9
0 20 40 60 80 100 120 140 160
-200 Hour
CPO
TRPO
-300 Fig. 3. Comparison of voltage profiles on the 4-bus test feeder
0 100 200 300 400 500
Iteration Number
13-Bus Test Case The MPC with actual load represents the global optimal
0
solution. As shown in Table I, the CPO algorithm is capable
of achieving a near-optimal operational cost and is nearly
ADR($)