Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)
Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)
Abstract— This report summarizes work performed as part to consumptions using a power grid, under the constraint
of an internship at INRIA, in partial requirement for the of avoiding equipment failure. A typical failure we are
completion of a master degree in math and informatics. The interested in is line unplanned outage resulting from over-
goal of the internship was to develop a software environment
to simulate electricity transmission in a power grid and actions heating. Such incidents happen when lines are subject to
performed by operators to maintain this grid in security. power flows greater than a nominal threshold value. To avoid
Our environment lends itself to automate the control of the line failures (and possible subsequent cascading failures),
arXiv:2104.04080v1 [cs.LG] 6 Apr 2021
power grid with reinforcement learning (RL[1]) agents, assisting operators (dispatchers) have a set of actions at disposal: they
human operators. It is amenable to organizing benchmarks, in- can locally modify the lines interconnections, switch on or
cluding a challenge in machine learning planned by INRIA and
RTE for 2019. Our framework, built on top of open-source li- switch off power lines, or change electricity production.
braries, is available at https://github.com/MarvinLer/ The difficulty of the task arises from the complexity of
pypownet. In this report we present intermediary results and the network architecture, also called grid topology, which
its usage in the context of a Reinforcement Learning (RL) game. frequently changes due to events such as hardware failures
(e.g. due to weather conditions such as thunderstorms),
I. INTRODUCTION
planned maintenance or preventive actions. On top of that,
This project addresses technical aspects related to the rising renewable energies are less predictable than conven-
transmission of electricity in extra high voltage and high tional productions systems (e.g. nuclear plants), bringing
voltage power networks (63kV and above), such as those more uncertainty to the productions schemes. In this con-
managed by the company RTE (Réseau de Transport text, we are interested in developing tools that will assist
d’Electricité) the French TSO (Transmission System Oper- dispatchers to maintain a power grid safe and face the
ator). Numerous improvements of the efficiency of energy increasing complexity of their task. This work builds on top
infrastructure are anticipated in the next decade from the of work performed by Benjamin Donnot[3][4][5] as part of
deployment of smart grid technology in power distribution is PhD thesis and Joao Araujo, an intern having performed
networks to more accurate consumptions preditive tools. As preliminary work on the subject last summer.
we progress in deploying renewable energy harnessing wind Recent work in deep learning[6] has underlined the po-
and solar power to transform it to electric power, we also tential of deep neural networks in solving complex tasks
expect to see growth in power demand for new applications ([7], [8]). For classification and regression tasks, they are
such as electric vehicles. Electricity is known to be difficult usually trained using supervised learning, which necessitates
to store at an industrial level. Hence supply and demand a labeled dataset. In our case, a suitable dataset could be
on the power grid must be balanced at all times to the made of pairs of grid situations and dispatchers curative
extent possible. Failure to achieve this balance may result in actions, such that the models are trained by copying (and
network breakdown and subsequent power outages of various hopefully generalizing) the dispatchers actions given a grid
levels of gravity. Principally, shutting down and restarting state (or a temporal chronic of grids photos). Unfortunately,
power plants (particularly nuclear power plants) is very we do not have access to such labeled data providing
difficult and costly since there is no easy way to switching preventive or remedial actions of dispatchers for given crisis
on/off generators. Many consumers, including hospitals and situations, for very-high voltage grid1 .
people hospitalized at home as well as factories critically This prompted us to investigate methods of reinforcement
suffer from power outages. Using machine learning (and in learning. Recent papers ([9], [10], [11]) managed to
particular reinforcement learning) may allow us to optimize successfully apply reinforcement learning to high-
better the operation of the grid eventually leading to reduce dimensional temporal tasks. One specific aspect of our
redundancy in transmission lines, and make better utilization problem is that the power grid can be accurately simulated
of generators, and lower power prices. The goal of this with a physical simulator implementing the laws of
project is to prepare a data science challenge to engage the
1 All of the actions of the dispatchers are recorded and without proper an-
scientific community to help solving this difficult problem.
notations so their motivations are not accurately documented. Besides, a lot
RTE is a Transmission System Operators (TSO). One of of these actions are anticipative, which necessitates extra additional amounts
the objective of TSOs is to route electricity from productions of data including consumptions predictions and planned productions.
physics (ordinary differential equations) under some quite there cannot be only one element connected to a pole, since
restrictive assumptions, for example that we are in a quasi the electricity would have no exit point. In this work, we
stationary regime. These hypotheses are quite common will constraint the substations to have at most two buses, i.e.
in the power system community. Therefore our problem element of a substation can be grouped into a maximum
lends itself well to reinforcement learning because data of two groups, in which objects are directly connected.
can be generated using an Environment simulator (a power In other words, the elements of every substation can be
grid physical simulator). The hope is that a trained model interconnected into one or two groups.
would implement a policy (mapping states of the network
to preventive or curative actions that maintain the network
in security over time), that might be used to assist human
dispatchers in making the right decision. In our project,
we simplified the overall problem by limiting ourselves
to toy examples of grids and subsets of actions to create
a “serious game” simulating semi-realistic conditions of
power grid control. This game lends itself to reinforcement
learning solutions. The proposed framework will be used
for a challenge implemented on the Codalab platform
(http://competitions.codalab.org).
2
only lowers when some branches are switched off (compared
to a fully operating grid), which means that the grid is a priori
more prone to overflows.
Modifying the productions outputs, often called re-
dispatching, is the operation of changing the amount of
energy produced by some productions. The actions is called
”re-dispatching” because one production is lowered by a
(a) Representation a (b) Representation b given amount that is redistributed among other productions.
Fig. 2: Example of representations of inner configuration If the amount is not counter-weighted, there might not be
of a substation. The substation is the gray ellipse. Buses enough productions to satisfy the demand. Re-dispatching is
are depicted the two pink filled circles. Because buses do expensive, because this operation require the modifications
not have proper physical meaning, both figures represent the of multiple generators, which are not the property of TSOs.
same configuration. For this configuration, production G and As such, we do not take re-dispatching into account in this
branch L1 are directly connected, and consumption C and work.
branch L2 are directly connected. Node splitting represents the majority of manual inter-
ventions done by dispatchers at RTE. It is the operation
of changing the interconnections configuration of elements
an overflowed state. In reality, a more restrictive approach is within a substation. By definition, a substation is at the
taken. TSOs often ensure that if a component of a grid were intersection of at least two branches. In fact, branches can
to fail (e.g. a branch, a plant or a switch) then the whole grid be connected to only a subgroup of the branches connected
would remain in security i.e. no branches are overflowed. to a substation. The operation of node splitting consists in
This verification of this realistic criteria would necessitate modifying the patterns of branches interconnections. The
significant additional computation resources, which is why name refers to viewing the operation as defining subnodes
we do not take it into account in our study. A branch is (or buses) to a substation, with each branch connected to
overflowed when its flowing current is above its thermal none or one subnode. In the following, we limit the number
threshold. The more current in a power line, the more it of subnodes to be 2.
heats, which causes a dilatation phenomenon of the line. The There are essentially two ways of operating a large-scale
air between the line and the ground acts as insulation and power grid:
might then not been sufficient to protect nearby passengers • A preventive mode: avoiding future failures given esti-
from electric arcs. Apart from the security of passengers, mations of productions and consumptions schemes
a melted power line needs to be replaced. It takes several • A curative mode: resolving a failure given the current
3
set), that would propose multiple curative solutions given implementations define various parameters about the grid
grid states such that operators could take decisions rapidly elements, including the grid structure. Various version exist.
by selecting an action among the candidate ones. We are particularly interested in the case IEEE-118, which
is a simplification of the Californian grid. Explicitaly, it has
C. Load-flow computations
118 substations, 56 productions and 186 branches (without
A load-flow computation is the operation to compute counting the lines between productions and substations).
the values of the flows within an electric grid given the
grid structure, a set of injections, and a set of parameters E. Reinforcement learning
describing the productions, loads, branches among other Reinforcement learning (RL, [1], [16]) is a domain of ma-
elements. We make the assumption that a grid subject to chine learning that differ from supervised and unsupervised
injections will instantaneously converge to its-steady-state. learning. Indeed, there is no supervisor in reinforcement
As such, a load-flow computation is an optimization problem learning, but rather a reward signal, expressing in our case
subject to equations and constraints with necessitates dozens the degree of satisfaction of grid security constraints. An
of variables for the elements describing a power grid, such agent (in our case emulating a dispatcher) interacts with
as the resistance, reactance of the power lines. the Environment (in our case the game), implementing a
policy determining which actions are performed given state,
Most of the high voltage grids are operated in AC mode towards maximizing rewards. RL algorithms train a policy,
(alternative current), as opposed to DC mode (direct cur- which is usually a parametric function of the system state
rent). However, AC load-flows being complex and slow and reward so far. The set of actions in our case are taken
to compute, they are sometimes approximated with a DC from permitted changes in grid topology. RL systems are
approximation. often describe using Markov Decision Processes (MDP)2 . A
The DC modeling makes the following simplifying as- Markov Decision Process is a tuple hS, A, P, R, γi such that:
sumptions: • S is a finite set of states
• Branches are lossless: the input and output powers of • A is a finite set of actions
4
Optimal value functions describe the best achievable per- training manifold of grids with one and only one switched
formance in the MDP, thus solving the problem. We can off line).
prove that there exist an optimal solution: its performance Their method relies on a novel architecture call Guided
is greater or equal than any other model. Such an optimal Dropout. It is influenced by the conventional Dropout
policy is intrinsically optimal regarding both state-value and [17] commonly used in deep learning. However, instead
action-value functions. of nulling randomly some hidden units, they adopt a
On advantage of MDPs is their capacity of performing scheme that controls the active units based on the input.
sequential decision making: temporal data are important in More precisely, they build a feed-forward neural network
reinforcement learning (because of the delayed reward) so with a set of productions, a set of loads, and a binary
the data do not need to be independent and identically vector of line service status (switched on or off) as input.
distributed. The model outputs a set of flows, that should be close
The overall mechanism of reinforcement learning consists to the ground-truth flows. The model is trained using
in training an Agent by interacting with an Environment. back-propagation, with a regression loss such as Mean
At each time step t, the Agent takes a set of actions at Squared Error. In this context, some hidden units are
from an Action Space, based on the current state st of the activated only when associated lines are disconnected, i.e.
Environment and the previous reward Rt . The Environment the associated line connectivity status input is 0. Their
then computes the next state st+1 as well as a reward Rt+1 approach has better training and testing performance than the
resulting from both st and at . See Fig. 3 for a visual baseline approach that consists in the same network without
representation of this vanilla mechanism. Guided Dropout, i.e. all of the conditionally activated units
are activated (which has more parameters, so more capacity).
5
With such a modelization, an action a∗t+1 is taken, at each A value network was then trained, using the third CNN,
timestep t with state st , such that it maximizes the expected to predict the likelihood of a win, given the current game-
Q function: state. This is similar to classical approach of value functions,
a∗t+1 = argmax Q(st , a) except that it is learned in this case. They stitch the trained
| {z }
a∈A
networks together using Monte-Carlo Tree Search. Without
going into further details, AlphaGo uses a mixture of the
2) AlphaGo: One of the most important breakthrough output of the value network as well as the result of a
in the field of Artificial Intelligence over the last decade simulation to compute the value of a state in the Monte-
is the success of an AI over the world champion of Go. Carlo tree. One last trick consists in dividing the state value
Go is a zero-sum one vs one board game, played on a by the number of times a simulation has lead to this state. By
board of size 19 by 19. At each non-used location, a doing so, there is a trade-off between exploitation (using the
player can put one of their piece. The goal of the game trained policy) and exploration (visiting new positions). The
is to control the bigger area of the board. In the past, latter trick encourages exploration, since it penalizes actions
several subhuman performance models have been created, that were often chosen.
mainly based on tree search algorithms, and enhanced by AlphaGo Zero: A more recent version, AlphaGo Zero[10]
trading approximation with tree depth exploration. The achieves even better performance, not only on the game of
game of Go is significantly harder to exhaustively simulate Go, but also on Chess and Shogi3 . A major improvement of
that chess, because Go board is larger. For small n, there this version is that it does not rely on expert moves. This is
are roughly (19 × 19)n 81n reachable configurations from an advantage because it reduces the dependence to training
the current grid to a grid of n more depth, which quickly data (e.g. recorded games of high elo players), and leverage
falls out of scope of current computers computation capacity. the importance of a simulator for reinforcement learning (in
their case, the respective board games). Specifically, there is
In [23], Silver et al. present the architecture of AlphaGo. no initialization on expert behavior data. The agent learns
Overall, they train 3 Convolutional Neural Networks: two and improves by self-playing.
policy networks, and one value network. Both of the CNN
take the current state of the grid as input, under the form Apart from this improvement, the value-function neural
of an image (with extra features under the scope of our network (the one modeling the probability of winning given
discussion). The first policy network consists in copying a state) and the Q neural network (the one modeling the
expert moves, from an aggregation of 30 million positions, probability of an action given a state and the reward so
and managed approximately 57% of accuracy on a test-set. far) are merged into a unique CNN architecture. Without
In more details, a CNN of 13 layers with ReLU activations is going into details about the net architecture, it leverages
trained on recorded expert moves using supervised learning. batch normalization[24] after some layers’ output, on top of
The labeled dataset consists in games of Go played by residual connection[25] that improve the flow of gradient.
humans, which are discretized into pairs of (grid state, action The trained neural network is then incorporated into a
taken). The output of the network consists in a grid of MCTS algorithm to choose more consistently the investi-
same size at the board (flattened for convenience). Given gated branches. The winner receives +1 at the end of a game,
a set of parameters θ, the network then outputs a probability while the loser gets -1.
distribution pθ (a|s), where a corresponds to every board
location, and s is the current state of the grid. A second IV. GAME DESIGN
lighter CNN is trained on the same task. It will be used to This section describes the game setting that we design,
make rapid simulations (the authors claim < 2µs). Inference such that it would lend itself to a reinforcement learning solu-
is done by maximizing the networks probability function tion. Since any such game requires defining four components:
over the possible actions: State, Action, Reward, Information, we endeavor to define
first the simulation Environment and its parameterization
a∗ = argmax pθ (a|s) as a State space, then the Action space and finally the
| {z }
a=(i,j)∈{1,...,19}2 Reward resulting from an observation and an action. We
The next step of their approach consists in improving the make explicit the Information available to the Agent to
previously learned policy by making it play against itself, determine the next action (observable part of the State space,
using the outcome of these games as a training signal. More also called Observation space).
formally, the previous policy is trained using policy gradient A. Environment
learning, by making it play against previous versions of itself. The game is based on a simulation Environment that
Policy gradient methods are a type of reinforcement learning emulates a power grid based on IEEE-118 of Matpower
algorithm, which consists in optimizing parametrized poli- [2]. It is implemented as a Partially Observable Infinite
cies with respect to the long-term cumulative reward using Markov Decision Process. Formally, the environment is a
gradient descent. At this point, their trained model beat the tuple h S, A, O, P, R, Z, γi such that:
previously best-working Go software, Pachi, in 85% of the
games. 3A Japanese variant of chess
6
• S is a continuous set of states Step R Environment performs first state update (before
• A is a finite set of actions injection change): st+0.5 = P1 (st , at ), to compute
• O is a continuous set of observations Reward: rt+1 = R(st+0.5 , at ).
• P is a state transition (probability) function Step S Env. applies news injections xt+1 and re-computes
• R is a reward function State: st+1 = P2 (st+0.5 , xt+1 ).
• Z is an observable function Step O Environment reveals Observation ot+1 of st+1 (and
• γ is a discount factor reward already computed).
The MDP is partially observable partly because the Agents In this setting, two Matpower callbacks are done at steps R
do not have access to the state resulting directly from their (with P1 ) and step S (with P2 ). The task can be parallelized
actions, as will become clearer in what follows. It is infinite for the users interested in batch reinforcement learning.
because the productions and consumptions take real values.
B. Observation space
We assume discrete time updates (at intervals to be de-
termined; typically 5 minutes, 1 hour or 1 day). Actions An observation ot represents the state of the grid at time
are performed at unit time intervals. Thus, given a state st , step t. Among all the variables and parameters that dictate
and an action at of the Agent, and a new set of injections the response of the grid to a set of injections, we keep only
xt+1 , the variables are updated as follows by the simulation the changing variables of the system, detailed below. Others
Environment: variables are hidden to the Agent, including the parameters
of the elements constituting the grid.
st+0.5 = P1 (st , at ) (2) An observation is a fixed-sized structure made of the
rt+1 = R(st+0.5 , at ) (3) following elements:
• Active, reactive and voltage values of the productions
st+1 = P2 (st+0.5 , xt+1 ) (4)
• Active, reactive and voltage values of the consumptions
ot+1 = Z(st+1 ) (5) • Active, reactive and voltage values of the lines: one 3-
tuple for each substation of each line
The state st includes a description of the grid topology (lines
• Relative thermal limits
in service and line interconnections) and the status of the
• Lines interconnection patterns
power flows in all lines. An action at may consist in a change
• Lines service status
in the grid topology. The reward calculation is based both on
the state and the action (some states presenting more danger The active, reactive and voltage values of both the
than others and some actions being more costly than others). productions and consumptions are the injections of the
For the purpose of clarity, we decompose calculations power grid at a given timestep. Each of these values are
using a half-way time step t + 0.5 and two state transition stored as lists of fixed sized throughout the game. For
functions P1 and P2 . This is because the calculation of the IEEE-118, the lists are of size 56 for productions, and of
reward rt+1 is based on the immediate consequences of size 99 for consumptions.
the action taken by the agent st+0.5 , prior to the (slower)
application of a change in injections xt+1 . In the simplest The line power flow values are stored similarly. We keep
case, xt can follow a defined schedule, but it could also be two values per line: the in-flow and the out-flow. This is
a random variable. Other factors may influence P2 , such as justified by the fact that there are losses within lines in the
incidental or planned changes in grid topology. AC setting. For IEEE-118, there are 186 lines.
More precisely, in Equation (3), function P1 implements
the laws of physics of power grid systems (it is actually The relative thermal limit vector is the element-wise
deterministic in our setting). In practice, the game first division of the lines flowing current list by the lines thermal
applies the action onto the grid, then uses Matpower to limits list. More precisely, given a set of flowing currents
compute the resulting flows. Equation (4) then computes the ft = (fi,t )i and a set of associated thermal limits (one per
reward, depending on the last state and the actions of the branch) T = (thi )i (thermal limits are fixed through time),
Agent. Next, in Equation (5), function P2 performs another the relative thermal limit (ri,t )i is:
load-flow computation, based on the last state and the next fi,t
∀i, ri,t =
set of injections xt+1 . Finally, Equation (5) compiles the thi
information that is made available to the agent. The modelization condenses both the values of the current
The role of the Agent is to devise a strategy to make and the values of the thermal limits for every line.
optimum actions through a policy function Π(ot ; θ), which Consequently, a line i is overflowed iff ri ≥ 1.
may include parameters θ adjustable by training (i.e. by
reinforcement learning). The game iterates over “LARSO” The information about the grid topology is given using a
cycles: topology list noted τ of fixed dimension. Each element i of
Step L Agent gets new observation ot and reward rt and the topology list represents the id of the chosen configuration
updates/Learns (the parameters of) its policy Π. for the substation i, except that the id are converted to one-
Step A Agent performs Action at = Π(ot ). hot vectors. Given a substation i with ni elements, and our
7
hypothesis of a maximum of two buses per substation, the a substation, where some elements can be disconnected is
following formula gives the number of (non-unique) pos- n l
sible topological configurations with all objects considered
X X l X
=2 2l−1 − l = O(2n )
switched on4 : k
l=0 k=0;k6=1,k6=l−1 l=0
n
We propose to decouple the representation of the topology
X n
= 2n − 2n into two parts: a vector of line status and a one-hot vector
k
k=0;k6=1,k6=n−1 representing the configuration as if all the lines were in-
For identifiability issues, since we consider configurations to service. The lines service status vector is of size the number
be equal up to bus permutation, there are 2n−1 − n unique of lines in the grid and takes binary values: 0 represents a line
possible configurations for a substations of n elements (since out-of-service, 1 a line in-service. There are exactly 2n−1 −
we counter twice each configuration in the previous formula). n + n = 2n−1 values to fully represent this modelization
We use a one-hot encoding of such configurations, i.e. if instead of the O(2n ) when considering configurations with
a substation (i) has ni configurations, we will use a sub- out-of-service elements.
vector of dimension ni with a 1 in the j th position if the j th C. Action space
configuration is used and 0 everywhere else.
The game allows two types of actions: the disconnection
For IEEE-118, this approach leads to a topological list of and reconnection of lines, and the modification of the grid
sum of its elements size approximately 10000. However, we topology. For better integration within the Gym environment,
note that only one substation is responsible for three quarters those two types of action are stored within an Action 2-tuple
of the size, because there are 14 objects connected to it. such that the players need to provide one structure at each
See Fig. 4 for a plot of the distribution of the total number time step. Besides, the game validates that actions proposed
of (unique) interconnection configurations per substation, by the player are well formed, as described below.
with all elements non-disconnected. Consequently, in order
to limit the size of the topological space, we reduce the 1) Changing the line service status: The line service
number of available topologies for some of the bigger status is encoded as a vector (a1i,t )i of size the number of
substations. lines of the grid (186 for the IEEE-118) such that:
1
: switch line i on
1
∀i, ai,t = −1 : put line i out-of-service
0 : do nothing to line i
8
approach, which could prevent reinforcement learning electricity for a certain amount of time. We would like to
models from learning good representations for the grid. avoid these situations at all cost. Consequently, by design,
the game will stop the current playing epoch once a load
The verification step for the topological action is assert has been cut, i.e. there are not enough incoming electricity
that the latter is of expected size, and that its elements are to satisfy the local demand. When such an event happens,
either None, or a one-hot vector of expected shape. the game will return a specific reward; it is up to the player
to load the next epoch.
In real life, situations might happen when the grid has
D. Reward
overflowed lines. In that case, those lines are dynamically
The reward is designed as a sum of 4 subrewards, each disconnected to protect them. After such disconnections, the
intended to focus on one aspect of grid conduct: grid will naturally converge to an equilibrium consequently
1) Line usage subreward to the topology and the laws of physics. When the
2) Cut load subreward equilibrium is reached, other lines can then be overflowed,
3) Action cost subreward since the whole grid has the same injections but a
4) Distance to the reference grid subreward lower capacity. Recursively, this can create a cascading
In the following, we define and give insight about each failure, where disconnected overflowed lines provoke
subreward. new overflowed lines, which could eventually isolate a
consumption. The game consequently has a cascading
Line usage subreward: Ideally, dispatchers should avoid failure module that will simulate cascading failures after an
situations where a line is overflowed. Given a timestep t, we Agent has taken an action. If the cascading failure does not
note (f ai,t )i the set of current flows and (thi )i the set of disconnect any consumption, the reward remains unchanged.
thermal limits. We can use the following formula to count On the other hand, if a load is disconnected, then the game
the number no of overflowed lines: will stop the epoch and retrieve the corresponding load cut
N lines reward.
X f ai,t
no =
i=1
thi Action cost subreward: The cost of putting a line
out-of-service or changing the topology (pattern of
However, this modelization does not give any information on
line interconnections) is integrated within the reward
the usage of the lines, except for the overflowed ones. With
computation. It is motivated by real-life conditions, where
such a formula, there are no explicit way to discriminate two
those actions need to be performed manually by specialized
situations where the number of overflowed lines are identical,
teams and at specific locations. The costs of one line
since the reward would be identical. Ideally, we would like
disconnection, one line reconnection, or one substation
all the lines to use as little of their capacity as possible,
topological change are identical. The action cost reward
rather than some lines exploding their limits and others using
sums the cost of those atomic operations for every action
close to nothing. Besides, another drawback of the formula
taken by the Agent to better. More precisely, the value
can appear when some lines have a ratio slightly below 1,
of this sub-reward is the cost one one action, multiplied
and others slightly above 1. The first group would not be
by the number of disconnections added to the number of
considered as overflowed while the second would increase
re-connections and the number of topological changes.
the reward. In other words, this formula is highly sensitive
to noise when the ratio are close to 1. Because of the flaws,
we introduce a modified formula that we call the line usage Distance to the reference grid subreward: Another aspect
reward: of grid conduct gravitates around the idea that dispatchers
perform well with a given topological setting. We would like
Nlines
X f ai,t 2 the Agents to ultimately change the topology of the grid in
thi response to potential harms, such that the grid topology is
i=1
not far from a reference topology. This subreward computes
Note that we use the squared of the ratio for computing the the distance of a grid to a reference grid by summing the
line usage. This allows to have non-negative ratio, and also number of local topological changes to transform the former
to amplify the impact of overflowed lines and minimize into the latter.
the impact of secured lines. Challengers can modify the
reward by using an absolute value instead of the square. V. RESULTS
The subreward is multiplied by -1 before being sent to the
players, such that models will minimize line usage. In order to demonstrate use cases of the proposed environ-
ment, we developed basic baselines relying on hand-crafter
Cut load subreward: A major aspect of grid conduct algorithms. We measured the performance of each model by
is to carry electricity such that every consumption has the running similar experiments. Besides, we applied models to
expected active and reactive values. If a consumption is cut, resolve crisis situations, where a power grid suffers from line
this means that a group of people won’t have access to power overflow to be eliminated.
9
A. Baselines implementations G1 G4 C2 C3
Time step 0 150 50 50 150
1) Do-nothing policy: The Agent does not take any
Time step 1 200 50 100 150
action.
2) Random line-disconnection policy: The grid has one Fig. 6: Precomputed values of the injections to be loaded at
and only one disconnected branch (line put out-of-service) each time step of the game.
at each time step, chosen randomly by the Agent. Equiva-
lently, the Agent choses a random branch to disconnect and
reconnects the previous disconnected line. higher reward. Specifically, with the notations of Fig. 5, the
3) Random node splitting policy: The Agent selects one injections are displayed in Fig. 6. Note that in for each time
substation at each time step, and randomly changes its local step, the sum of production equals the sum of consumptions.
topological configuration. Note that topology changes are not This is induced by the DC approximation for which the lines
reverted apart from the action of the Agent: they perpetuate are lossless.
in time until further changed.
Between time step 0 and time step 1, the consumption C2
4) Greedy line-disconnection policy: At each time step,
rises by 50MW. Generator G1 is incremented by the same
the policy simulates every 1-line disconnections and applies
amount to produce enough electricity. A representation of
the action that maximizes the reward. Formally, the Action
the initial state of the game, i.e. the first state observed
Space A is of size 186 for the IEEE-118 and is made of every
by the do-nothing Agent, is displayed in Fig. 7(a). The
branch disconnection possible. This is equivalent to a Tree
substations 2 (top right) and 4 (bottom left) have two nodes.
Search of root the current state st , with leaves i being the
This comes from the fact that they are made of four elements
reward for disconnection of branch i and 1-line disconnection
(substation 2 has three power lines and one consumption,
as actions. The policy choses the optimal action
substation 4 has three power lines and one generator). On
the contrary, substation 1 (top left) and 3 (bottom right)
a∗t = argmax R(st+1 , at )
| {z } only have one node, because there cannot be more than one
at ∈A group of directly connected elements, such that there is at
We are in the process of obtaining benchmarks on these least two elements per group (because electricity need to
baselines, using a same-context experiment. exit). For both the two-nodes substations, the initial lines
interconnection configuration is to have all elements directly
B. A practical case: curing a crisis situation connected, i.e. on the same bus.
We conduct an experiment treating a toy use case in the For step A, the do-nothing policy will not output any
proposed framework. The grid used in this experiment, dis- action for the timestep 0. Formally, the Environment will
played in Fig. 5, is made of only 4 substations, 2 productions, apply the action of the player onto the grid, and discard the
2 consumptions and 4 branches. By construction, we set flows that are not pertinent anymore (because the flows are
every branch to a thermal limit threshold of 100MW. a function of the injections, and the grid explicit topology).
This subsequent grid, obtained by computing a load-flow
using Matpower, after taking the previous set of injections
and performing no topological change is the same as Fig.
7(a). At this point, there are no overflow, so no cascading
failure simulation, so the subreward of cut load, rload cut = 0.
With the hypothesis that the the initial grid is the reference
grid, the subreward of the distance to the reference grid,
rdistance to ref = 0. No action were performed, so the cost of the
Fig. 5: Illustration of case4gs. The purple circles are the operation, raction cost = 0. Finally, we compute the subreward
4 substations numbered from 1 to 4, the 2 productions rline usage of the line capacity usage is
are the green squares G1 and G4 , and the 2 consumptions
are the yellow ones C2 and C3 . This is the reference grid l=5
X fi 2
l=5
1 X 2
for our experiment, and no elements are added throughout rline usage = − =− fi
thi 1002
the game (only disconnections, and substations inner lines l=1 l=1
interconnections). 82.862 + 67.142 + 45.132 + 77.992 + 72.012
=−
1002
With this example grid, we devised a toy experiment rline usage ≈ −2.468
consisting in playing only two time steps of the game,
with hand-selected injections detailed below and in DC because by construction of this situation, all of the thermal
approximation (such that we can neglect reactive values). limits are 1. Note that the flows of the lines connecting
The Agent is the Do-Nothing Agent, that do no take any productions and consumptions to the rest of the system do
action throughout the whole game. The injection will create not count in the reward computation, since there values are
an overflow, and we show a curative action that induces a do not depend on the Agent. The timestep computed reward
10
(a) Steps L and A: time step t = 0, initial grid (a) Steps L and A: time step t = 1, previous grid
configuration (observation o0 ). The Do-Nothing Agent configuration (observation o1 ) from the previous time
agent performs no learning and no action. step (see Fig. 7. The Do-Nothing Agent agent performs
no learning and no action.
11
new observation. Our approach can be used with multiple
level of simplicity, notably by the type of algorithm to be
used to compute load-flows. States are not directly visible
by the Agents, but rather exported into observations. The
reward computation involves multiple inner subrewards,
which are designed to reflect some of the goal of TSOs
dispatchers. We demonstrated how the game processed two
(a) Step A: an Agent applies a node-splitting aciton on timesteps, with a do-nothing Agent and compare it with an
the top right substation, given the observation o1 of Fig. Agent that performs one grid modification.
7(c). Note that we do not display the flows here (they
first need to be computed by step R below). For future work, we plan to design game situations that
are interesting for the player and include situations of cri-
sis leading to eventually cascading effects, i.e. meta-stable
states, that pose difficult problems to be solved. We need to
reverse-engineer the problem in an adversarial way in some
sense: one player creates difficult problems and the other
tries to solve them. The game will come with a Graphical
User Interface, which will allow users to manually play the
(b) Step R: state s1.5 obtained after computing the load- game, and could be used to watch policies in action. integrate
flow of the above state. Contrary to the do-nothing new features to the game in order to make it closer to real-
policy, there are no overflow here, thus no power outage life conditions. Maintenance schemes and random branches
(and also no cascading failure simulation). We would
expect a higher reward than in the case of outage
hazards will be furtherly integrated, along with random noise
(depending on the load cut subreward value). injection into the planned productions and demands. Once
the user interface is operational, we plan to spend time
Fig. 9: A candidate action that can avoid the crisis situation optimizing the code such that computing can be fast enough
of outage, given observation o1 . The Agent applied a curative to tackle the number of necessary learning steps. We will
node-splitting action, which does not provoke any overflow also establish more complex baselines, focusing on additional
once applied. In that case, the Agent do not lose the game, approaches for reinforcement learning including Actor-Critic
and subsequent following LARSO cycles can be performed. methods ([26], [27]) and Normalized Advantage Functions
([28]). Those baselines performance will help use tune a
set of suitable hyperparameters, such as subreward values.
which underlines a global outage of the power system. Besides, new grid parameters will be explored for the IEEE-
The outage induce that at least one consumptions has 118, including thermal limits, such that we can tune the game
been cut. This is a situation of game over: the game difficulty.
will reinitialize the grid overall structure, and load the
the remaining timesteps to be played. Before that the
Environment returns the value of the cut load subreward.
We show in Fig. 9 a node splitting action that can avoid Acknowledgements: This work was supported by INRIA
the outage given the previous state s1 . Specifically, we and ChaLearn. Special thanks to my collaborator Kimang
directly connect the top right consumption with the top left Khun and my advisors Isabelle Guyon, Antoine Marot,
substation, and the bottom right substation with the top right Benjamin Donnot. I am grateful to Marc Schoenauer for
one. welcoming me at the LRI lab in the TAU group and to RTE’s
advisors including Patrick Panciatici for their guidance.
VI. CONCLUSION AND FUTURE WORK
R EFERENCES
In this work, we try to tackle the task of applying machine
[1] RS Sutton, AG Barto, ”Reinforcement learning: An introduction”,
learning, and more specifically reinforcement learning, on DOI: https://doi.org/10.1016/S1364-6613(99)01331-5
the task of operating a extra high voltage power grid in safe [2] R. D. Zimmerman, C. E. Murillo-Sanchez and R. J. Thomas, ”MAT-
conditions. To do so, we propose a novel game environment POWER: Steady-State Operations, Planning, and Analysis Tools for
Power Systems Research and Education,” in IEEE Transactions on
that is able to simulate a power grid through time given Power Systems, vol. 26, no. 1, pp. 12-19, Feb. 2011.
an Agent policy. For reinforcement learning integration, [3] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Patrick Panci-
the game is modeled using a Partially Observable Infinite atici, Antoine Marot, ”Introducing machine learning for power system
operation support”, arXiv:1709.09527v1
Markov Decision Process. At each time step, an Agent takes [4] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Patrick Panci-
an action given an observation, which leads to a first inner atici, Antoine Marot, ”Optimization of computational budget for power
state that is used to compute a reward (along with the chosen system risk assessment”, arXiv:1805.01174v1
[5] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Antoine Marot,
action). The game then loads the next set of injections, Patrick Panciatici, ”Fast Power system security analysis with Guided
compute the resulting state and provides the Agent with a Dropout”, supplemental material. 2017. ¡hal-01649938¿
12
[6] Ian Goodfellow, Yoshua Bengio, Aaron Courville, ”Deep Learning
Book”, MIT Press, 2016
[7] M. Mahmud, M.S. Kaiser, A. Hussain, S. Vassanelli. “Applica-
tions of Deep Learning and Reinforcement Learning to Biolog-
ical Data,” IEEE Trans. Neural Netw. Learn. Syst., 2018, doi:
10.1109/TNNLS.2018.2790388.
[8] Travers Ching, et al., ”Opportunities and obstacles for deep learning
in biology and medicine”, DOI: 10.1098/rsif.2017.0387
[9] Kun Shao, Yuanheng Zhu, Dongbin Zhao, ” StarCraft Micromanage-
ment with Reinforcement Learning and Curriculum Transfer Learn-
ing”, arXiv:1804.00810v1
[10] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,
Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis
Hassabis, ”Mastering the game of Go without human knowledge”,
Nature volume 550, pages 354–359 (19 October 2017)
[11] Seyed Sajad Mousavi, Michael Schukat, Enda Howley, ”Traffic Light
Control Using Deep Policy-Gradient and Value-Function Based Rein-
forcement Learning”, arXiv:1704.08883v2
[12] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Antoine Marot,
Patrick Panciatici, ” Fast Power system security analysis with Guided
Dropout”, arXiv:1801.09870v1
[13] P. Demetriou, M. Asprou, J. Quiros-Tortos and E. Kyriakides, ”Dy-
namic IEEE Test Systems for Transient Analysis,” in IEEE Systems
Journal, vol. 11, no. 4, pp. 2108-2117, Dec. 2017.
[14] Andrey Y. Lokhov, Marc Vuffray, Dmitry Shemetov, Deepjyoti Deka,
and Michael Chertkov, ”Online Learning of Power Transmission
Dynamics”, arXiv:1710.10021v1
[15] Salar Fattahi, Javad Lavaei, and Alper Atamturk, ” A Bound Strength-
ening Method for Optimal Transmission Switching in Power Systems”,
arXiv:1711.10428v1
[16] Le Pham Tuyen, Ngo Anh Vien, Abu Layek, TaeChoong Chung,
”Deep Hierarchical Reinforcement Learning Algorithm in Partially
Observable Markov Decision Processes”, arXiv:1805.04419v1
[17] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,
Ruslan R. Salakhutdinov, ” Improving neural networks by preventing
co-adaptation of feature detectors”, arXiv:1207.0580v1
[18] Metaxiotis K, Kagiannas A, Askounis D, Psarras J. Artificial intelli-
gence in short term electric load forecasting: a state-of-the-art survey
for the researcher. Energy conversion and Management. 2003 Jun
1;44(9):1525-34.
[19] Hippert HS, Pedreira CE, Souza RC. Neural networks for short-
term load forecasting: A review and evaluation. IEEE Transactions
on power systems. 2001 Feb;16(1):44-55.
[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves,
Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, ”Playing Atari
with Deep Reinforcement Learning”, arXiv:1312.5602v1
[21] Yann LeCun, Yoshua Bengio. ”Convolutional networks for images,
speech, and time-series”. In M. A. Arbib, editor, The Handbook of
Brain Theory and Neural Networks. MIT Press, 1995.
[22] Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified
activations in convolutional network. arXiv preprint arXiv:1505.00853.
2015 May 5.
[23] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Lau-
rent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Tim-
othy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Grae-
pel, Demis Hassabis, ”Mastering the game of Go with deep neural
networks and tree search”, Nature volume 529, pages 484–489 (28
January 2016)
[24] Sergey Ioffe, Christian Szegedy, ” Batch Normalization: Accelerat-
ing Deep Network Training by Reducing Internal Covariate Shift”,
arXiv:1502.03167v3
[25] K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for
Image Recognition,” 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
[26] Konda VR, Tsitsiklis JN. Actor-critic algorithms. InAdvances in neural
information processing systems 2000 (pp. 1008-1014).
[27] Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M. Natural actor–critic
algorithms. Automatica. 2009 Nov 1;45(11):2471-82.
[28] Gu S, Lillicrap T, Sutskever I, Levine S. Continuous deep q-learning
with model-based acceleration. InInternational Conference on Machine
Learning 2016 Jun 11 (pp. 2829-2838).
13
A PPENDIX I
A BOUT THE IEEE FORMAT FOR POWER GRIDS
The IEEE format allows to represent a steady-state of a power grid in a condensed manner. We use the version 2 of this
format, which differs from its first iteration by the variables it stores. IEEE format is used to compute load-flows using the
open-source software Matpower.
More precisely, the IEEE format is made of at least three matrices (+ one version value and one value indicating the base
MVA of the system):
• The bus matrix: stores values about the substations and the consumptions
• The generator matrix: parameters related to the generators
• The branch matrix: stores the values related to the flows of the system
Here, we list the parameters stored in each of those three matrices. For each one, the columns indicate parameters, and
the lines indicate the corresponding objects.
Bus matrix
The bus matrix is a matrix of shape n × 13, which implies that there are 13 parameters for every bus. The parameters
are (i.e. the columns):
1) ID of the bus
2) Type of the bus: 1 for a PV bus, 2 for a PQ bus, 3 for the slack bus (or reference bus), and 4 for isolated bus (not
linked to any other element)
3) Real power demand
4) Reactive power demand
5) Shunt conductance (some substations have shunts)
6) Shunt susceptance
7) ID indicating the area of the bus (not used in our case)
8) Voltage magnitude
9) Voltage angle
10) Base voltage (total voltage is magnitude times base voltage)
11) ID indicating the zone of the bus (bot used in our case)
12) Maximum voltage magnitude
13) Minimum voltage magnitude
Some of these parameters, such as the voltage magnitude, need to be specified in per-unit. It is the expression of some
quantities as fractions of a defined base unit quantity (for voltage magnitude, the base unit quantity is the base voltage
parameter). We do not explicitly use the area of zone parameters, since we consider every elements to be within the same grid.
One thing to note about the IEEE format is that there are no notion of substations. In fact, matpower only used the notion
of bus. We use some tricks, which include artificially created nodes with the same parameters, for actions such as node
splitting.
Gen matrix
The matrix of generators has n lines, where n is the number of generators of the grid, and 10 parameters, which are:
1) ID of the bus directly on which the generator is directly connected
2) Real power output
3) Reactive power output
4) Maximum reactive power output
5) Minimum reactive power output
6) Voltage magnitude setpoint
7) Base MVA of the generator
8) Status of the generator (0 out-of-service, >0 in service)
9) Maximum real power output
10) Minimum real power output
Branch matrix
The branch matrix has 11 parameters, and 4 extra values per branch representing the flows. Branch are identified using
a from bus and a to bus (used for convention). The 11 parameters for a branch are:
1) ID of from bus
14
2) ID of to bus
3) Resistance
4) Reactance
5) Susceptance
6) Long term rating
7) Short term rating
8) Emergency rating
9) Transfromer shift phase angle
10) Branch status (1 in service, 0 out-of-service)
On top of that for steady-states, there are 4 extra columns:
1) P at origin
2) Q at origin
3) P at destination
4) Q at destination
Branch are represented using origin and destination values, since in the AC mode there can be losses (function of some
parameters including the branch resistance).
15
A PPENDIX II
P OWER F LOWS EQUATIONS
This section is a rapid overview of the problem that power flow software need to solve.
A. Model of the power grid
Let G be a grid with n nodes, m power lines.
The nodes of G are divided in two parts, namely the generator nodes, where at least one production unit (power plant, wind
plant etc.) participating to voltage control is connected5 , and those called load nodes
.
To connect node i and node j there are element with complex impedance Zi,j . If nothing connects the two, one can think
of Zi,j = ∞.
Often, it is more convenient to think of the admittance Y , instead of the impedance Z. The admittance is nothing more than
:
1
Yi,j =
Zi,j
So if two nodes i and j are not connected, we have Yi,j = 0.
The Ohm’s law (also called Kirchoff’s voltage law) between node k and node j, in complex form can be written as :
ik→j = Yi,j × (Vj − Vk )
There is another fundamental law in a power grid, the Kirchoff’s power law. It states that, at a node i :
n
X node
ik = ik→j
j=1,j6=i
where ik is all the complex current injected at node k and ∀k, ik→j denote the (complex) current flowing from node k to
node j.
B. Equations to satisfy
A load-flow is a computation that takes as input:
• the real power for all load nodes PD
• the reactive power for all loads nodes QD
• the real power for all generator nodes PG
• the voltage magnitude |V | for all generator nodes
• the voltage angle Θ for the slack bus
• the voltage magnitude |V | for the slack bus
With these informations, a load-flow computes, for each load-bus the voltage angle Θl and magnitude |V |l and then derived
other the interesting quantities, such as the active power flow, the reactive power flow, or the currents power flow on each
power line of the system.
5 Actually, for the system to be properly specified, one node where there is a generator will be called a slack bus
16
The power flow equations are, for each node (slack node, production node or load node) i of the power grid:
N
X
0 = −Pi + |V |i |V |k (Gi,k . cos(Θi − Θk ) + Bi,k sin(Θi − Θk )) for the real power
k=1
N
X
0 = Qi + |V |i |V |k (Gi,k . sin(Θi − Θk ) − Bi,k cos(Θi − Θk )) for the reactive power
k=1
where:
• Pi is the real production injected at this node
• Gi,k is the real part of the element in the bus admittance matrix, eg the real part of the admittance of the line connecting
bug i to bus k (if any) or 0 (if not)
• Bi,k is the imaginary part of the element in the bus admittance matrix, eg the imaginary part of the admittance of the
line connecting bug i to bus k (if any) or 0 (if not)
For the system to be fully determined by these sets of equations, these equations are not written for the slack bus, and
only the real part of this equation is written for the production nodes.
Once these quantities have been computed, one can compute the active power flows on each elements of the network.
For example, for a given line connecting bus i to bus k with admittance Y at the origin node i having conductance Si and
susceptance Bi :
Pi→k = |Vi |.|Vk | ∗ Y. sin(Θi − Θk ) + |Vi |2 .Si (6)
2
Qi→k = − |Vi |.|Vk | ∗ Y. cos(Θi − Θk ) + |Vi | . (Y − Bi ) (7)
p
2
Pi→k + Q2i→k
Ii→k = (8)
|Vi |
C. DC approximation
For a more detailed information, the powerflows are shown in DCPowerFlowEquations.pdf. This section is greatly inspired
from
DC power ow in unit commitment models chapter 3. In this section we will suppose that there is not transformers nor phase
shifters. These two objects can of course be taken into account in the DC approximation, as shown in the two previous
papers.
In this part, we will present one of the most used model to approximate the load-flow equations. In counterpart, some
results of the AC model won’t be accessible for example the losses or the voltage magnitudes.
Despite these drawbacks, DC modelisation has two main advantages. First of all, it can always find a solution to its
equations, and more importantly it is much faster to compute.
The impact of each of these assumptions on the power-flow equation are discussed now:
1) R << X. The part has a big impact on the equations. First this induces that the losses are fully neglected.
17
And by definition, we have :
Y = G + jB
thus :
R2
G= → 0
R2 + X 2 R→0
and
−X −1
B= →
R2+X 2 R→0 X
2) Θi − Θk ≈ 0. This will allow a linearization of the problem, as the trigonometric functions sin and cos will be
approximate by the identity and the constant 1 (first order approximation). The powerflow equations then becomes:
N
X
0 = −Pi + |V |i |V |k Bi,k (Θi − Θk ) for the real power
k=1
N
X
0 = Qi + |V |i |V |k Bi,k for the reactive power
k=1
3) |V |j ≈ |V |nom . The last non linearity in the previous equations arises due to the factor |V |i .|V |k . Assuming that
|V |j ≈ |V |nom will make them disappear. This is also a very strong assumption preventing us from getting voltage
magnitude as a results of the DC approximation. This leads to:
|V |i |V |k ≈ |V |nom ∗ |V |nom
So at the end the equations are:
N
X
Pi = Bi,k (Θi − Θk ) for the real power (9)
k=1,k6=i
N
X
Qi = − Bi,k = 0 for the reactive power (10)
k=1
18