0% found this document useful (0 votes)

6 views18 pages

Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)

This report details the development of a software environment for simulating electricity transmission in power grids, aimed at assisting operators in maintaining grid security using reinforcement learning. The framework, built on open-source libraries, is designed to automate control actions and facilitate machine learning benchmarks. The project addresses the complexities of managing power networks, particularly in the context of rising renewable energy sources and the need for efficient grid operation.

Uploaded by

Nico Antonelli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views18 pages

Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)

Uploaded by

Nico Antonelli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Design and implementation of an environment

for Learning to Run a Power Network (L2RPN)

Marvin Lerousseau1,2
1 Grenoble INP - Ensimag 2 Université Grenoble Alpes
June 17, 2018

Abstract— This report summarizes work performed as part to consumptions using a power grid, under the constraint
of an internship at INRIA, in partial requirement for the of avoiding equipment failure. A typical failure we are
completion of a master degree in math and informatics. The interested in is line unplanned outage resulting from over-
goal of the internship was to develop a software environment
to simulate electricity transmission in a power grid and actions heating. Such incidents happen when lines are subject to
performed by operators to maintain this grid in security. power flows greater than a nominal threshold value. To avoid
Our environment lends itself to automate the control of the line failures (and possible subsequent cascading failures),
arXiv:2104.04080v1 [cs.LG] 6 Apr 2021

power grid with reinforcement learning (RL[1]) agents, assisting operators (dispatchers) have a set of actions at disposal: they
human operators. It is amenable to organizing benchmarks, in- can locally modify the lines interconnections, switch on or
cluding a challenge in machine learning planned by INRIA and
RTE for 2019. Our framework, built on top of open-source li- switch off power lines, or change electricity production.
braries, is available at https://github.com/MarvinLer/ The difficulty of the task arises from the complexity of
pypownet. In this report we present intermediary results and the network architecture, also called grid topology, which
its usage in the context of a Reinforcement Learning (RL) game. frequently changes due to events such as hardware failures
(e.g. due to weather conditions such as thunderstorms),
I. INTRODUCTION
planned maintenance or preventive actions. On top of that,
This project addresses technical aspects related to the rising renewable energies are less predictable than conven-
transmission of electricity in extra high voltage and high tional productions systems (e.g. nuclear plants), bringing
voltage power networks (63kV and above), such as those more uncertainty to the productions schemes. In this con-
managed by the company RTE (Réseau de Transport text, we are interested in developing tools that will assist
d’Electricité) the French TSO (Transmission System Oper- dispatchers to maintain a power grid safe and face the
ator). Numerous improvements of the efficiency of energy increasing complexity of their task. This work builds on top
infrastructure are anticipated in the next decade from the of work performed by Benjamin Donnot[3][4][5] as part of
deployment of smart grid technology in power distribution is PhD thesis and Joao Araujo, an intern having performed
networks to more accurate consumptions preditive tools. As preliminary work on the subject last summer.
we progress in deploying renewable energy harnessing wind Recent work in deep learning[6] has underlined the po-
and solar power to transform it to electric power, we also tential of deep neural networks in solving complex tasks
expect to see growth in power demand for new applications ([7], [8]). For classification and regression tasks, they are
such as electric vehicles. Electricity is known to be difficult usually trained using supervised learning, which necessitates
to store at an industrial level. Hence supply and demand a labeled dataset. In our case, a suitable dataset could be
on the power grid must be balanced at all times to the made of pairs of grid situations and dispatchers curative
extent possible. Failure to achieve this balance may result in actions, such that the models are trained by copying (and
network breakdown and subsequent power outages of various hopefully generalizing) the dispatchers actions given a grid
levels of gravity. Principally, shutting down and restarting state (or a temporal chronic of grids photos). Unfortunately,
power plants (particularly nuclear power plants) is very we do not have access to such labeled data providing
difficult and costly since there is no easy way to switching preventive or remedial actions of dispatchers for given crisis
on/off generators. Many consumers, including hospitals and situations, for very-high voltage grid1 .
people hospitalized at home as well as factories critically This prompted us to investigate methods of reinforcement
suffer from power outages. Using machine learning (and in learning. Recent papers ([9], [10], [11]) managed to
particular reinforcement learning) may allow us to optimize successfully apply reinforcement learning to high-
better the operation of the grid eventually leading to reduce dimensional temporal tasks. One specific aspect of our
redundancy in transmission lines, and make better utilization problem is that the power grid can be accurately simulated
of generators, and lower power prices. The goal of this with a physical simulator implementing the laws of
project is to prepare a data science challenge to engage the
1 All of the actions of the dispatchers are recorded and without proper an-
scientific community to help solving this difficult problem.
notations so their motivations are not accurately documented. Besides, a lot
RTE is a Transmission System Operators (TSO). One of of these actions are anticipative, which necessitates extra additional amounts
the objective of TSOs is to route electricity from productions of data including consumptions predictions and planned productions.
physics (ordinary differential equations) under some quite there cannot be only one element connected to a pole, since
restrictive assumptions, for example that we are in a quasi the electricity would have no exit point. In this work, we
stationary regime. These hypotheses are quite common will constraint the substations to have at most two buses, i.e.
in the power system community. Therefore our problem element of a substation can be grouped into a maximum
lends itself well to reinforcement learning because data of two groups, in which objects are directly connected.
can be generated using an Environment simulator (a power In other words, the elements of every substation can be
grid physical simulator). The hope is that a trained model interconnected into one or two groups.
would implement a policy (mapping states of the network
to preventive or curative actions that maintain the network
in security over time), that might be used to assist human
dispatchers in making the right decision. In our project,
we simplified the overall problem by limiting ourselves
to toy examples of grids and subsets of actions to create
a “serious game” simulating semi-realistic conditions of
power grid control. This game lends itself to reinforcement
learning solutions. The proposed framework will be used
for a challenge implemented on the Codalab platform
(http://competitions.codalab.org).

This document is organized as follows. First, we give

some context about power grids and large-scale power grid (a) (b)
conduct. Second, we review recent works about machine Fig. 1: Mini example of power transmission grid. Elec-
learning applied to Power Systems and Reinforcement Learn- tricity must be transported from production nodes (brown
ing. We then present our contribution to the design and im- circles) to consumption nodes (green circles). They are inter-
plementation of the framework. The following part discusses connected through a network (grid) of transmission lines (red
some early results obtained using the proposed game. Finally, lines), connected at substations (red squares). The injections
the last part lists key elements of our future work. X = (x1 , x2 , x3 , x4 ), which include both productions and
II. BACKGROUND consumptions, must add up to zero. The way in which lines
are interconnected is referred to as grid topology τ . The
A. Power grids
flows in the red lines Y result from the injections and the
A power grid is an network made of electric hardware topology: Y = S(X, τ ). At any time, the grid operators (or
and intended to transmit electricity from productions to dispatchers) must make sure that the network is operated
consumptions. See Fig. 1 for a representation of a power in security and no line exceeds its thermal limit (a current
grid. Formally, the structure of a power grid is a graph flow above which the line might melt). (a) Line y4 goes over
G = {V, E} with V the set of nodes and E the set of its thermal limit 100. (b) A change in topology (splitting of
edges. Edges represent the power lines, also called branches node 6) brings y4 back to its thermal limit.
or transmission lines. In practice, V is the set of substations,
which are physical entities on which other elements can
be interconnected, such as productions (e.g. solar farm) or On top of the graph structure, a power grid is submitted to
consumptions (such as a city). A bus is a mathematical Kirchoff’s laws. For instance, at any node, the sum of input
concept referring to an intersection of directly connected power is equal to the sum of output power. Given a set of
elements within a substation. Substations can have multiple injections (productions and consumptions values), a network
buses, in the sense that elements can be directly wired structure and physical laws, a grid will naturally converge to
to any subset of the other elements. Fig. 2 displays two an equilibrium, also called a steady-state. We are especially
representations of a substation composed of 4 elements: one interested in the flows circulating in the branches of the grid
generator G, one consumption C, and two power lines L1 in steady-states.
and L2 . In this particular topological configuration, there are The value of the current of a flow is obtained as a
2 buses: the bus made of G and L1 and the one composed combination of the real and reactive power of a branch and
of C and L2 . Note that buses do not have proper physical the voltage value of the substation in which the branch is
representation: as such, both representations in 2 are strictly connected. The set of branches real power is often denoted
equivalent (invariance by bus permutation). P , reactive power is Q and V for the corresponding voltage
Concretely, substations are made of electric poles on magnitude values.
which the connectible elements can be wired. TSOs techni-
cians can change the pole on which an element is connected B. Safety criteria and grid conduct
by operating switches. They can also switch off elements TSO’s Dispatchers need to ensure that the grid operates
such as branches such that they are temporarily removed in safe conditions at all time. For the purpose of this work,
from the grid (to repaint power lines for example). Moreover, a grid will be considered safe when there are no branches in

2
only lowers when some branches are switched off (compared
to a fully operating grid), which means that the grid is a priori
more prone to overflows.
Modifying the productions outputs, often called re-
dispatching, is the operation of changing the amount of
energy produced by some productions. The actions is called
”re-dispatching” because one production is lowered by a
(a) Representation a (b) Representation b given amount that is redistributed among other productions.
Fig. 2: Example of representations of inner configuration If the amount is not counter-weighted, there might not be
of a substation. The substation is the gray ellipse. Buses enough productions to satisfy the demand. Re-dispatching is
are depicted the two pink filled circles. Because buses do expensive, because this operation require the modifications
not have proper physical meaning, both figures represent the of multiple generators, which are not the property of TSOs.
same configuration. For this configuration, production G and As such, we do not take re-dispatching into account in this
branch L1 are directly connected, and consumption C and work.
branch L2 are directly connected. Node splitting represents the majority of manual inter-
ventions done by dispatchers at RTE. It is the operation
of changing the interconnections configuration of elements
an overflowed state. In reality, a more restrictive approach is within a substation. By definition, a substation is at the
taken. TSOs often ensure that if a component of a grid were intersection of at least two branches. In fact, branches can
to fail (e.g. a branch, a plant or a switch) then the whole grid be connected to only a subgroup of the branches connected
would remain in security i.e. no branches are overflowed. to a substation. The operation of node splitting consists in
This verification of this realistic criteria would necessitate modifying the patterns of branches interconnections. The
significant additional computation resources, which is why name refers to viewing the operation as defining subnodes
we do not take it into account in our study. A branch is (or buses) to a substation, with each branch connected to
overflowed when its flowing current is above its thermal none or one subnode. In the following, we limit the number
threshold. The more current in a power line, the more it of subnodes to be 2.
heats, which causes a dilatation phenomenon of the line. The There are essentially two ways of operating a large-scale
air between the line and the ground acts as insulation and power grid:
might then not been sufficient to protect nearby passengers • A preventive mode: avoiding future failures given esti-

from electric arcs. Apart from the security of passengers, mations of productions and consumptions schemes
a melted power line needs to be replaced. It takes several • A curative mode: resolving a failure given the current

weeks to replaces lines in the context of very-hight voltage grid loadflow

power grids and such a replacement can cost multiple million Dispatchers at RTE often ensure that the grid is currently
euros. Each line of the system has a nominal thermal limit, in a secure state, i.e. with no failures. If a failure happens,
under which it is protected against melting. By denoting fb dispatchers will perform curative actions. On a regular mode,
the value of the current of the branch b, and thb its thermal without incident, they have an anticipative operating mode.
limit, an overflow corresponds to: Indeed, dispatchers often ensure that the ”n-1” criteria is
fb respected. This criteria means that if any element of a
b overflowed ⇐⇒ fb ≥ thb ⇐⇒ ≥1 (1) grid were to fail, then the grid would still be in security.
thb
Usually, they run multiple simulations of the current grid
Dispatchers have a set of actions at their disposal to avoid but with one of its element out-of-service, observing the
operating the grid with failures. There are essentially three resulting steady-states of the potential grids. This criteria
types of actions: is implemented because unexpected hazards can occur and
• Switching off/on branches. break some element of the grid such as power lines. See
• Modifying generators electricity production. Donnot’s work[12] on using deep learning to predict flows
• Node splitting. of a grid in ”n-1” situations to accelerate those simulations.
Branch disconnection or reconnection is often used in TSOs usually operate at nation-scale. For example, RTE
nationwide power grids, although more scarcely by dispatch- operates the French grid made of more than 6500 nodes,
ers. When a branch is overflowed, the system automatically 3000 productions and 10000 branches. Taking into account
disconnect the latter so that it does not melt. Dispatchers hazards, maintenance, and the injections distributions, the
can, in theory, disconnect a branch prior to a situation of task of operating the grid in a safe mode is rather complex.
overflow for the predicted failures. Switching on/off a branch Besides, the human factor limits the tools used for predicting
operation is cheap, because TSOs operate on element within the grid subsequent states. For example, temperature esti-
their range (e.g. activating switches to modify the branches mations and weather predictions could be integrated when
interconnections). In practice, dispatchers do not disconnect managing the grid, but would only complicate the task for
power lines apart for the maintenance operations, such as dispatchers. In this context, we are interested in building a
reparation or painting. Indeed, the power capacity of a grid policy Π : S → A (S is the State set and A the Action

3
set), that would propose multiple curative solutions given implementations define various parameters about the grid
grid states such that operators could take decisions rapidly elements, including the grid structure. Various version exist.
by selecting an action among the candidate ones. We are particularly interested in the case IEEE-118, which
is a simplification of the Californian grid. Explicitaly, it has
C. Load-flow computations
118 substations, 56 productions and 186 branches (without
A load-flow computation is the operation to compute counting the lines between productions and substations).
the values of the flows within an electric grid given the
grid structure, a set of injections, and a set of parameters E. Reinforcement learning
describing the productions, loads, branches among other Reinforcement learning (RL, [1], [16]) is a domain of ma-
elements. We make the assumption that a grid subject to chine learning that differ from supervised and unsupervised
injections will instantaneously converge to its-steady-state. learning. Indeed, there is no supervisor in reinforcement
As such, a load-flow computation is an optimization problem learning, but rather a reward signal, expressing in our case
subject to equations and constraints with necessitates dozens the degree of satisfaction of grid security constraints. An
of variables for the elements describing a power grid, such agent (in our case emulating a dispatcher) interacts with
as the resistance, reactance of the power lines. the Environment (in our case the game), implementing a
policy determining which actions are performed given state,
Most of the high voltage grids are operated in AC mode towards maximizing rewards. RL algorithms train a policy,
(alternative current), as opposed to DC mode (direct cur- which is usually a parametric function of the system state
rent). However, AC load-flows being complex and slow and reward so far. The set of actions in our case are taken
to compute, they are sometimes approximated with a DC from permitted changes in grid topology. RL systems are
approximation. often describe using Markov Decision Processes (MDP)2 . A
The DC modeling makes the following simplifying as- Markov Decision Process is a tuple hS, A, P, R, γi such that:
sumptions: • S is a finite set of states

• Branches are lossless: the input and output powers of • A is a finite set of actions

branches are identical. • P is a state transition probability matrix,

a 0
• All of the buses voltages are close to 1 per-unit. Pss 0 = P[st+1 = s |st = s, at = a]

• The branches voltage angles are almost identical. • R is a reward function,

Rsa = E[Rt+1 |st = s, at = a]
All these 3 assumptions make the initial problem linear as
• γ is a discount factor 0 ≤ γ ≤ 1
shown in the Appendix II. Making this problem linear has
some advantages, including a low computation time. In this context, a policy Π fully describes the behavior of
On the other hand, the AC modeling does not make the an agent as a distribution of the action space given states:
previous simplifying assumptions about the system. As a Π(a|s) = P[at = a|st = s]
consequence, AC modeling is closer to real-conditions than
All of the states of a MDP are Markov, i.e. the future is
DC approximation. However, AC load-flow computation is
independent of the past given the present, which formalizes
harder than DC, because the overlying equations are non-
as:
linear (for example, if the difference of angles x is not
close to 0, then we don’t have sin(x) ≈ x, and sin is non- st is Markov ⇐⇒ P[st+1 |st ] = P[st+1 |s1 , ..., st ]
linear). Multiple optimization algorithms can be used for AC
The return Gt of a MDP is the cumulative discounted
load-flow computations, including Newton’s method, Fast-
reward starting from timestep t:
Decoupled or Gauss-Seidel.
inf
Our framework relies on Matpower [2], an open-source X
power system framework written in Matlab. Matpower works Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+1+k
k=0
with the IEEE format for power grids. See Appendix I for
further details about this format. From the return, we define two value functions:
• The state-value function of policy Π given state s is
D. Examples of power grids vΠ (s) = E[Gt |st = s]
For a first reinforcement learning challenge on power • The action-value function of policy Π given action a
grid control, the entire French grid might be too ambitious, applied at state s is qΠ (s, a) = E[Gt |st = s, at = a]
because we believe that the number of elements (i.e. We further define the optimal state-value function v ∗(s) and
the dimension of load-flows) is too vast for current the optimal action-value function q ∗(s, a) as respectively the
reinforcement learning approaches (approximately 6500 maximum over the policies of the state-value function and
nodes and nearly 10000 lines). the action-value function:
v ∗ (s) = max vΠ (s)q ∗ (s, a) = max qΠ (s, a)
The California power transmission grid is a relatively Π Π
simpler system, which has been widely studied such as in 2 There exists multiple variants of MDPs. For instance, infinite MDP can
[13], [14] or [15]. Some simplified implementations of this involve continuous state space or action space. Partially observable MDP
grid are available in open-source (and IEEE format). These involve an Agent that do not directly observe the states.

4
Optimal value functions describe the best achievable per- training manifold of grids with one and only one switched
formance in the MDP, thus solving the problem. We can off line).
prove that there exist an optimal solution: its performance Their method relies on a novel architecture call Guided
is greater or equal than any other model. Such an optimal Dropout. It is influenced by the conventional Dropout
policy is intrinsically optimal regarding both state-value and [17] commonly used in deep learning. However, instead
action-value functions. of nulling randomly some hidden units, they adopt a
On advantage of MDPs is their capacity of performing scheme that controls the active units based on the input.
sequential decision making: temporal data are important in More precisely, they build a feed-forward neural network
reinforcement learning (because of the delayed reward) so with a set of productions, a set of loads, and a binary
the data do not need to be independent and identically vector of line service status (switched on or off) as input.
distributed. The model outputs a set of flows, that should be close
The overall mechanism of reinforcement learning consists to the ground-truth flows. The model is trained using
in training an Agent by interacting with an Environment. back-propagation, with a regression loss such as Mean
At each time step t, the Agent takes a set of actions at Squared Error. In this context, some hidden units are
from an Action Space, based on the current state st of the activated only when associated lines are disconnected, i.e.
Environment and the previous reward Rt . The Environment the associated line connectivity status input is 0. Their
then computes the next state st+1 as well as a reward Rt+1 approach has better training and testing performance than the
resulting from both st and at . See Fig. 3 for a visual baseline approach that consists in the same network without
representation of this vanilla mechanism. Guided Dropout, i.e. all of the conditionally activated units
are activated (which has more parameters, so more capacity).

Neural networks have also been used for short-term load

forecasting ([18], [19]). The overlying idea is to better predict
consumptions scheme, such that the power grid future states
can be better anticipated thanks to better estimators. The
recent challenge see4c (http://project.see4c.eu/)
is another work on the integration of machine learning with
Fig. 3: Feedback loop in Reinforcement Learning Power Systems. The goal of the challenge was to forecast
power flows given historical data.
III. STATE OF THE ART
In this section, we briefly review recent works in machine B. Reinforcement learning recent works
learning applied to Power Systems, and introduce basic
notions of reinforcement learning applicable to our problem Here, we present some algorithms and concepts used in
at hand. reinforcement learning to deal with complex systems, in the
context of gaming.
A. Machine learning applied to Power Systems 1) Deep Q-learning on Atari games: Google DeepMind
Reinforcement Learning is by no means the only way of has successfully used Deep Q-learning to tackle the task
applying Machine Learning to Power Systems. There are of playing Atari 2600 games [20], which are a collection
many other ways of applying machine learning to Power of arcade games. The games were usually played on
Systems, not necessarily to control them, but for example arcade machines, which roughly consist of a screen with a
to predict power flows, rather than using accurate but slow joystick used to move an object in four spatial dimensions
physical simulators such as Matpower or more sophisticated (left/right/up/down), and one button to perform a specific
ones that cannot properly use GPUs. One the advantage of action based on the type of game (e.g. shooting a laser
machine learning is be that we do not need to explicit the in Space Invader). Their model achieves super-human
laws governing the system, as it is the case for usual load- performance on several games, and can play more than a
flow softwares. The models can learn inner representations thousand different games.
of the physics concepts without any domain knowledge.
Machine learning has been applied for Power Systems To do so, their first model a Value function Q(s, a) using
problematics. Donnot [12] introduces a novel deep learning a deep neural network, where s is a state and a an action.
algorithm to compute load-flow, in order to replace con- More precisely, the network is a Convolutional Neural
ventional simulators based on differential equation solvers. Network [21] with ReLU [22] activations. The network
Their approach consists in training a feed-forward multi- input consists of 4 consecutive stacked frames (obtained
layer perceptron on predicting load-flows for a given grid from an emulator of the game), so that there is a notion of
with one line switched off. They measure the generalization motion within the input. The network output consists in a
of their proposed algorithm on the computations of load- vector of size 5 indicating which buttons are pressed.
flows where two lines are disconnected (which are out of the

5
With such a modelization, an action a∗t+1 is taken, at each A value network was then trained, using the third CNN,
timestep t with state st , such that it maximizes the expected to predict the likelihood of a win, given the current game-
Q function: state. This is similar to classical approach of value functions,
a∗t+1 = argmax Q(st , a) except that it is learned in this case. They stitch the trained
| {z }
a∈A
networks together using Monte-Carlo Tree Search. Without
going into further details, AlphaGo uses a mixture of the
2) AlphaGo: One of the most important breakthrough output of the value network as well as the result of a
in the field of Artificial Intelligence over the last decade simulation to compute the value of a state in the Monte-
is the success of an AI over the world champion of Go. Carlo tree. One last trick consists in dividing the state value
Go is a zero-sum one vs one board game, played on a by the number of times a simulation has lead to this state. By
board of size 19 by 19. At each non-used location, a doing so, there is a trade-off between exploitation (using the
player can put one of their piece. The goal of the game trained policy) and exploration (visiting new positions). The
is to control the bigger area of the board. In the past, latter trick encourages exploration, since it penalizes actions
several subhuman performance models have been created, that were often chosen.
mainly based on tree search algorithms, and enhanced by AlphaGo Zero: A more recent version, AlphaGo Zero[10]
trading approximation with tree depth exploration. The achieves even better performance, not only on the game of
game of Go is significantly harder to exhaustively simulate Go, but also on Chess and Shogi3 . A major improvement of
that chess, because Go board is larger. For small n, there this version is that it does not rely on expert moves. This is
are roughly (19 × 19)n 81n reachable configurations from an advantage because it reduces the dependence to training
the current grid to a grid of n more depth, which quickly data (e.g. recorded games of high elo players), and leverage
falls out of scope of current computers computation capacity. the importance of a simulator for reinforcement learning (in
their case, the respective board games). Specifically, there is
In [23], Silver et al. present the architecture of AlphaGo. no initialization on expert behavior data. The agent learns
Overall, they train 3 Convolutional Neural Networks: two and improves by self-playing.
policy networks, and one value network. Both of the CNN
take the current state of the grid as input, under the form Apart from this improvement, the value-function neural
of an image (with extra features under the scope of our network (the one modeling the probability of winning given
discussion). The first policy network consists in copying a state) and the Q neural network (the one modeling the
expert moves, from an aggregation of 30 million positions, probability of an action given a state and the reward so
and managed approximately 57% of accuracy on a test-set. far) are merged into a unique CNN architecture. Without
In more details, a CNN of 13 layers with ReLU activations is going into details about the net architecture, it leverages
trained on recorded expert moves using supervised learning. batch normalization[24] after some layers’ output, on top of
The labeled dataset consists in games of Go played by residual connection[25] that improve the flow of gradient.
humans, which are discretized into pairs of (grid state, action The trained neural network is then incorporated into a
taken). The output of the network consists in a grid of MCTS algorithm to choose more consistently the investi-
same size at the board (flattened for convenience). Given gated branches. The winner receives +1 at the end of a game,
a set of parameters θ, the network then outputs a probability while the loser gets -1.
distribution pθ (a|s), where a corresponds to every board
location, and s is the current state of the grid. A second IV. GAME DESIGN
lighter CNN is trained on the same task. It will be used to This section describes the game setting that we design,
make rapid simulations (the authors claim < 2µs). Inference such that it would lend itself to a reinforcement learning solu-
is done by maximizing the networks probability function tion. Since any such game requires defining four components:
over the possible actions: State, Action, Reward, Information, we endeavor to define
first the simulation Environment and its parameterization
a∗ = argmax pθ (a|s) as a State space, then the Action space and finally the
| {z }
a=(i,j)∈{1,...,19}2 Reward resulting from an observation and an action. We
The next step of their approach consists in improving the make explicit the Information available to the Agent to
previously learned policy by making it play against itself, determine the next action (observable part of the State space,
using the outcome of these games as a training signal. More also called Observation space).
formally, the previous policy is trained using policy gradient A. Environment
learning, by making it play against previous versions of itself. The game is based on a simulation Environment that
Policy gradient methods are a type of reinforcement learning emulates a power grid based on IEEE-118 of Matpower
algorithm, which consists in optimizing parametrized poli- [2]. It is implemented as a Partially Observable Infinite
cies with respect to the long-term cumulative reward using Markov Decision Process. Formally, the environment is a
gradient descent. At this point, their trained model beat the tuple h S, A, O, P, R, Z, γi such that:
previously best-working Go software, Pachi, in 85% of the
games. 3A Japanese variant of chess

6
• S is a continuous set of states Step R Environment performs first state update (before
• A is a finite set of actions injection change): st+0.5 = P1 (st , at ), to compute
• O is a continuous set of observations Reward: rt+1 = R(st+0.5 , at ).
• P is a state transition (probability) function Step S Env. applies news injections xt+1 and re-computes
• R is a reward function State: st+1 = P2 (st+0.5 , xt+1 ).
• Z is an observable function Step O Environment reveals Observation ot+1 of st+1 (and
• γ is a discount factor reward already computed).
The MDP is partially observable partly because the Agents In this setting, two Matpower callbacks are done at steps R
do not have access to the state resulting directly from their (with P1 ) and step S (with P2 ). The task can be parallelized
actions, as will become clearer in what follows. It is infinite for the users interested in batch reinforcement learning.
because the productions and consumptions take real values.
B. Observation space
We assume discrete time updates (at intervals to be de-
termined; typically 5 minutes, 1 hour or 1 day). Actions An observation ot represents the state of the grid at time
are performed at unit time intervals. Thus, given a state st , step t. Among all the variables and parameters that dictate
and an action at of the Agent, and a new set of injections the response of the grid to a set of injections, we keep only
xt+1 , the variables are updated as follows by the simulation the changing variables of the system, detailed below. Others
Environment: variables are hidden to the Agent, including the parameters
of the elements constituting the grid.
st+0.5 = P1 (st , at ) (2) An observation is a fixed-sized structure made of the
rt+1 = R(st+0.5 , at ) (3) following elements:
• Active, reactive and voltage values of the productions
st+1 = P2 (st+0.5 , xt+1 ) (4)
• Active, reactive and voltage values of the consumptions
ot+1 = Z(st+1 ) (5) • Active, reactive and voltage values of the lines: one 3-
tuple for each substation of each line
The state st includes a description of the grid topology (lines
• Relative thermal limits
in service and line interconnections) and the status of the
• Lines interconnection patterns
power flows in all lines. An action at may consist in a change
• Lines service status
in the grid topology. The reward calculation is based both on
the state and the action (some states presenting more danger The active, reactive and voltage values of both the
than others and some actions being more costly than others). productions and consumptions are the injections of the
For the purpose of clarity, we decompose calculations power grid at a given timestep. Each of these values are
using a half-way time step t + 0.5 and two state transition stored as lists of fixed sized throughout the game. For
functions P1 and P2 . This is because the calculation of the IEEE-118, the lists are of size 56 for productions, and of
reward rt+1 is based on the immediate consequences of size 99 for consumptions.
the action taken by the agent st+0.5 , prior to the (slower)
application of a change in injections xt+1 . In the simplest The line power flow values are stored similarly. We keep
case, xt can follow a defined schedule, but it could also be two values per line: the in-flow and the out-flow. This is
a random variable. Other factors may influence P2 , such as justified by the fact that there are losses within lines in the
incidental or planned changes in grid topology. AC setting. For IEEE-118, there are 186 lines.
More precisely, in Equation (3), function P1 implements
the laws of physics of power grid systems (it is actually The relative thermal limit vector is the element-wise
deterministic in our setting). In practice, the game first division of the lines flowing current list by the lines thermal
applies the action onto the grid, then uses Matpower to limits list. More precisely, given a set of flowing currents
compute the resulting flows. Equation (4) then computes the ft = (fi,t )i and a set of associated thermal limits (one per
reward, depending on the last state and the actions of the branch) T = (thi )i (thermal limits are fixed through time),
Agent. Next, in Equation (5), function P2 performs another the relative thermal limit (ri,t )i is:
load-flow computation, based on the last state and the next fi,t
∀i, ri,t =
set of injections xt+1 . Finally, Equation (5) compiles the thi
information that is made available to the agent. The modelization condenses both the values of the current
The role of the Agent is to devise a strategy to make and the values of the thermal limits for every line.
optimum actions through a policy function Π(ot ; θ), which Consequently, a line i is overflowed iff ri ≥ 1.
may include parameters θ adjustable by training (i.e. by
reinforcement learning). The game iterates over “LARSO” The information about the grid topology is given using a
cycles: topology list noted τ of fixed dimension. Each element i of
Step L Agent gets new observation ot and reward rt and the topology list represents the id of the chosen configuration
updates/Learns (the parameters of) its policy Π. for the substation i, except that the id are converted to one-
Step A Agent performs Action at = Π(ot ). hot vectors. Given a substation i with ni elements, and our

7
hypothesis of a maximum of two buses per substation, the a substation, where some elements can be disconnected is
following formula gives the number of (non-unique) pos- n l
sible topological configurations with all objects considered
X X l X
=2 2l−1 − l = O(2n )
switched on4 : k
l=0 k=0;k6=1,k6=l−1 l=0
n
We propose to decouple the representation of the topology

X n
= 2n − 2n into two parts: a vector of line status and a one-hot vector
k
k=0;k6=1,k6=n−1 representing the configuration as if all the lines were in-
For identifiability issues, since we consider configurations to service. The lines service status vector is of size the number
be equal up to bus permutation, there are 2n−1 − n unique of lines in the grid and takes binary values: 0 represents a line
possible configurations for a substations of n elements (since out-of-service, 1 a line in-service. There are exactly 2n−1 −
we counter twice each configuration in the previous formula). n + n = 2n−1 values to fully represent this modelization
We use a one-hot encoding of such configurations, i.e. if instead of the O(2n ) when considering configurations with
a substation (i) has ni configurations, we will use a sub- out-of-service elements.
vector of dimension ni with a 1 in the j th position if the j th C. Action space
configuration is used and 0 everywhere else.
The game allows two types of actions: the disconnection
For IEEE-118, this approach leads to a topological list of and reconnection of lines, and the modification of the grid
sum of its elements size approximately 10000. However, we topology. For better integration within the Gym environment,
note that only one substation is responsible for three quarters those two types of action are stored within an Action 2-tuple
of the size, because there are 14 objects connected to it. such that the players need to provide one structure at each
See Fig. 4 for a plot of the distribution of the total number time step. Besides, the game validates that actions proposed
of (unique) interconnection configurations per substation, by the player are well formed, as described below.
with all elements non-disconnected. Consequently, in order
to limit the size of the topological space, we reduce the 1) Changing the line service status: The line service
number of available topologies for some of the bigger status is encoded as a vector (a1i,t )i of size the number of
substations. lines of the grid (186 for the IEEE-118) such that:

1
 : switch line i on
1
∀i, ai,t = −1 : put line i out-of-service

0 : do nothing to line i


Given a line service status sline i,t−1

status
from the previous ob-
servation, and a line service status action a1i,t , the observation
is updated as follow:
(
line status a1i,t if a1i,t 6= 0
∀i, si,t =
sline
i,t−1
status
otherwise
The verification step for an action consists in checking
Fig. 4: Distribution of the number of configurations per that the lines service action is of the expected size and that
substation in IEEE-118. The bar graph was obtained by the values are -1, 0 or 1.
sorting the substations by their number of possible configu-
rations, without restrictions. We can infer that only one line 2) Changing the grid topology: The grid topology action
is responsible for the majority of the sum of number of is encoded as a vector (τi , t)1≤i≤nsubstations of expected size
configurations of the whole grid. In fact, most substations the number of substations in the grid (118 for IEEE-118).
of the IEEE-118 have at most 10 lines interconnection Its values τi,t are either None or a one-hot vector whose
configurations. active value indicate the id of the associated substation
target topological configuration. In other words, players or
Note that our design of the topological modelization of Agents specify, for each substation, whether to not change
the grid involve the assumption that all of the objects are the local topology (value of None) or to update the local
connected. The number of all the possible configurations of topological configuration of the substation to the desired
configuration.
4 With two possible groups of directly connected elements, and no
elements disconnected (i.e. each element is directly connected to one group), One approach for the topological vector would have been
then choosing one configuration is equivalent to choosing the number of to specify, for each substation and for each of its connected
elements k to pick for one group, and then choosing k elements. Choosing
k elements among n is n , which is summed to take into account all the object, the node on which the object is connected. However,
k
possible number of elements in a group. we find that there are identifiability issues with such an

8
approach, which could prevent reinforcement learning electricity for a certain amount of time. We would like to
models from learning good representations for the grid. avoid these situations at all cost. Consequently, by design,
the game will stop the current playing epoch once a load
The verification step for the topological action is assert has been cut, i.e. there are not enough incoming electricity
that the latter is of expected size, and that its elements are to satisfy the local demand. When such an event happens,
either None, or a one-hot vector of expected shape. the game will return a specific reward; it is up to the player
to load the next epoch.
In real life, situations might happen when the grid has
D. Reward
overflowed lines. In that case, those lines are dynamically
The reward is designed as a sum of 4 subrewards, each disconnected to protect them. After such disconnections, the
intended to focus on one aspect of grid conduct: grid will naturally converge to an equilibrium consequently
1) Line usage subreward to the topology and the laws of physics. When the
2) Cut load subreward equilibrium is reached, other lines can then be overflowed,
3) Action cost subreward since the whole grid has the same injections but a
4) Distance to the reference grid subreward lower capacity. Recursively, this can create a cascading
In the following, we define and give insight about each failure, where disconnected overflowed lines provoke
subreward. new overflowed lines, which could eventually isolate a
consumption. The game consequently has a cascading
Line usage subreward: Ideally, dispatchers should avoid failure module that will simulate cascading failures after an
situations where a line is overflowed. Given a timestep t, we Agent has taken an action. If the cascading failure does not
note (f ai,t )i the set of current flows and (thi )i the set of disconnect any consumption, the reward remains unchanged.
thermal limits. We can use the following formula to count On the other hand, if a load is disconnected, then the game
the number no of overflowed lines: will stop the epoch and retrieve the corresponding load cut
N lines reward.
X f ai,t
no =
i=1
thi Action cost subreward: The cost of putting a line
out-of-service or changing the topology (pattern of
However, this modelization does not give any information on
line interconnections) is integrated within the reward
the usage of the lines, except for the overflowed ones. With
computation. It is motivated by real-life conditions, where
such a formula, there are no explicit way to discriminate two
those actions need to be performed manually by specialized
situations where the number of overflowed lines are identical,
teams and at specific locations. The costs of one line
since the reward would be identical. Ideally, we would like
disconnection, one line reconnection, or one substation
all the lines to use as little of their capacity as possible,
topological change are identical. The action cost reward
rather than some lines exploding their limits and others using
sums the cost of those atomic operations for every action
close to nothing. Besides, another drawback of the formula
taken by the Agent to better. More precisely, the value
can appear when some lines have a ratio slightly below 1,
of this sub-reward is the cost one one action, multiplied
and others slightly above 1. The first group would not be
by the number of disconnections added to the number of
considered as overflowed while the second would increase
re-connections and the number of topological changes.
the reward. In other words, this formula is highly sensitive
to noise when the ratio are close to 1. Because of the flaws,
we introduce a modified formula that we call the line usage Distance to the reference grid subreward: Another aspect
reward: of grid conduct gravitates around the idea that dispatchers
perform well with a given topological setting. We would like
Nlines
X f ai,t 2 the Agents to ultimately change the topology of the grid in
thi response to potential harms, such that the grid topology is
i=1
not far from a reference topology. This subreward computes
Note that we use the squared of the ratio for computing the the distance of a grid to a reference grid by summing the
line usage. This allows to have non-negative ratio, and also number of local topological changes to transform the former
to amplify the impact of overflowed lines and minimize into the latter.
the impact of secured lines. Challengers can modify the
reward by using an absolute value instead of the square. V. RESULTS
The subreward is multiplied by -1 before being sent to the
players, such that models will minimize line usage. In order to demonstrate use cases of the proposed environ-
ment, we developed basic baselines relying on hand-crafter
Cut load subreward: A major aspect of grid conduct algorithms. We measured the performance of each model by
is to carry electricity such that every consumption has the running similar experiments. Besides, we applied models to
expected active and reactive values. If a consumption is cut, resolve crisis situations, where a power grid suffers from line
this means that a group of people won’t have access to power overflow to be eliminated.

9
A. Baselines implementations G1 G4 C2 C3
Time step 0 150 50 50 150
1) Do-nothing policy: The Agent does not take any
Time step 1 200 50 100 150
action.
2) Random line-disconnection policy: The grid has one Fig. 6: Precomputed values of the injections to be loaded at
and only one disconnected branch (line put out-of-service) each time step of the game.
at each time step, chosen randomly by the Agent. Equiva-
lently, the Agent choses a random branch to disconnect and
reconnects the previous disconnected line. higher reward. Specifically, with the notations of Fig. 5, the
3) Random node splitting policy: The Agent selects one injections are displayed in Fig. 6. Note that in for each time
substation at each time step, and randomly changes its local step, the sum of production equals the sum of consumptions.
topological configuration. Note that topology changes are not This is induced by the DC approximation for which the lines
reverted apart from the action of the Agent: they perpetuate are lossless.
in time until further changed.
Between time step 0 and time step 1, the consumption C2
4) Greedy line-disconnection policy: At each time step,
rises by 50MW. Generator G1 is incremented by the same
the policy simulates every 1-line disconnections and applies
amount to produce enough electricity. A representation of
the action that maximizes the reward. Formally, the Action
the initial state of the game, i.e. the first state observed
Space A is of size 186 for the IEEE-118 and is made of every
by the do-nothing Agent, is displayed in Fig. 7(a). The
branch disconnection possible. This is equivalent to a Tree
substations 2 (top right) and 4 (bottom left) have two nodes.
Search of root the current state st , with leaves i being the
This comes from the fact that they are made of four elements
reward for disconnection of branch i and 1-line disconnection
(substation 2 has three power lines and one consumption,
as actions. The policy choses the optimal action
substation 4 has three power lines and one generator). On
the contrary, substation 1 (top left) and 3 (bottom right)
a∗t = argmax R(st+1 , at )
| {z } only have one node, because there cannot be more than one
at ∈A group of directly connected elements, such that there is at
We are in the process of obtaining benchmarks on these least two elements per group (because electricity need to
baselines, using a same-context experiment. exit). For both the two-nodes substations, the initial lines
interconnection configuration is to have all elements directly
B. A practical case: curing a crisis situation connected, i.e. on the same bus.
We conduct an experiment treating a toy use case in the For step A, the do-nothing policy will not output any
proposed framework. The grid used in this experiment, dis- action for the timestep 0. Formally, the Environment will
played in Fig. 5, is made of only 4 substations, 2 productions, apply the action of the player onto the grid, and discard the
2 consumptions and 4 branches. By construction, we set flows that are not pertinent anymore (because the flows are
every branch to a thermal limit threshold of 100MW. a function of the injections, and the grid explicit topology).
This subsequent grid, obtained by computing a load-flow
using Matpower, after taking the previous set of injections
and performing no topological change is the same as Fig.
7(a). At this point, there are no overflow, so no cascading
failure simulation, so the subreward of cut load, rload cut = 0.
With the hypothesis that the the initial grid is the reference
grid, the subreward of the distance to the reference grid,
rdistance to ref = 0. No action were performed, so the cost of the
Fig. 5: Illustration of case4gs. The purple circles are the operation, raction cost = 0. Finally, we compute the subreward
4 substations numbered from 1 to 4, the 2 productions rline usage of the line capacity usage is
are the green squares G1 and G4 , and the 2 consumptions
are the yellow ones C2 and C3 . This is the reference grid l=5
X fi 2
l=5
1 X 2
for our experiment, and no elements are added throughout rline usage = − =− fi
thi 1002
the game (only disconnections, and substations inner lines l=1 l=1
interconnections). 82.862 + 67.142 + 45.132 + 77.992 + 72.012
=−
1002
With this example grid, we devised a toy experiment rline usage ≈ −2.468
consisting in playing only two time steps of the game,
with hand-selected injections detailed below and in DC because by construction of this situation, all of the thermal
approximation (such that we can neglect reactive values). limits are 1. Note that the flows of the lines connecting
The Agent is the Do-Nothing Agent, that do no take any productions and consumptions to the rest of the system do
action throughout the whole game. The injection will create not count in the reward computation, since there values are
an overflow, and we show a curative action that induces a do not depend on the Agent. The timestep computed reward

10
(a) Steps L and A: time step t = 0, initial grid (a) Steps L and A: time step t = 1, previous grid
configuration (observation o0 ). The Do-Nothing Agent configuration (observation o1 ) from the previous time
agent performs no learning and no action. step (see Fig. 7. The Do-Nothing Agent agent performs
no learning and no action.

(b) Step R: state s0.5 updated from s0 after the Do-

Nothing Agent has made his action (none); the resulting
load-flow state is the same as the initial state, in (a). The
associated reward given s0.5 and a0 , r0 is -2.468.

(c) Step S and step O: observation o1 of the grid,

obtained by applying the new injections x1 of Fig. 6,
then computing a load-flow, and exporting a view of s1 .
Note that o1 contains one overflowed line: the Agent
should find an action supposed to cure the overflow, or
the situation could lead to a global outage. (b) Step R: Cascading failure simulation. The action
of the Agent leads to a situation where at least one
Fig. 7: Explicit representations of the evolution of a grid line is overflowed (top line). For computing the hidden
for the time step t = 0. The purple ellipses and circles are state s1.5 , the Environment will perform a cascading
the substations, the yellow squares are consumptions and the failure simulation, which consists in successively and
green squares are productions. The values written next to repeatedly switch off overflowed lines, and computing
a new load-flow. This is part of the function P1 of
each line represents the real power flowing into the associated step R. First row= cascading simulation initialization;
line. The direction of the arrows indicate the direction of the second row= top line put out-of-service resulting in 3
flowing current in the line. more over-flowed lines after a load-flow computation;
third row=new over-flowed lines out-of-service leads to
global outage. In that case, the reward r1 returned by
is then: the Environment is equal to the value of the subreward
related to a cut load.
r0 = rload cut + rdistance to ref + raction cost + rline usage
Fig. 8: Steps L, A and R of timestep t = 1. In that case, the
r0 = −2.468 cascading failure simulation provoked a global power outage,
The Environment returns the previous reward to the Agent, which leads to a game over.
and then will compute the state s1 of the Environment: this
is step S. To do so, the game will load the injections x1 of
the timestep 1, displayed in Fig. 6, and compute a load-flow s1.5 , since there is an overflowed branch, the game will
to retrieve the subsequent flows. The resulting state s1 is simulate a cascading failure: repeatedly making overflowed
detailed in Fig. 7(c). As depicted by the thick red line, one lines out-of-service, then computing the consequent load-
line is overflowed. At this point, the game has done one full flow. This leads to the successive rows in Fig. 8(b). The first
cycle of LARSO. row results from disconnected the overflow line; the second
Upon such state, the do-nothing policy does not perform row is the consequent load-flow, which provokes 3 additional
any action: step L and A for timestep t = 1. To compute overflows; the last row displays the next line disconnection,

11
new observation. Our approach can be used with multiple
level of simplicity, notably by the type of algorithm to be
used to compute load-flows. States are not directly visible
by the Agents, but rather exported into observations. The
reward computation involves multiple inner subrewards,
which are designed to reflect some of the goal of TSOs
dispatchers. We demonstrated how the game processed two
(a) Step A: an Agent applies a node-splitting aciton on timesteps, with a do-nothing Agent and compare it with an
the top right substation, given the observation o1 of Fig. Agent that performs one grid modification.
7(c). Note that we do not display the flows here (they
first need to be computed by step R below). For future work, we plan to design game situations that
are interesting for the player and include situations of cri-
sis leading to eventually cascading effects, i.e. meta-stable
states, that pose difficult problems to be solved. We need to
reverse-engineer the problem in an adversarial way in some
sense: one player creates difficult problems and the other
tries to solve them. The game will come with a Graphical
User Interface, which will allow users to manually play the
(b) Step R: state s1.5 obtained after computing the load- game, and could be used to watch policies in action. integrate
flow of the above state. Contrary to the do-nothing new features to the game in order to make it closer to real-
policy, there are no overflow here, thus no power outage life conditions. Maintenance schemes and random branches
(and also no cascading failure simulation). We would
expect a higher reward than in the case of outage
hazards will be furtherly integrated, along with random noise
(depending on the load cut subreward value). injection into the planned productions and demands. Once
the user interface is operational, we plan to spend time
Fig. 9: A candidate action that can avoid the crisis situation optimizing the code such that computing can be fast enough
of outage, given observation o1 . The Agent applied a curative to tackle the number of necessary learning steps. We will
node-splitting action, which does not provoke any overflow also establish more complex baselines, focusing on additional
once applied. In that case, the Agent do not lose the game, approaches for reinforcement learning including Actor-Critic
and subsequent following LARSO cycles can be performed. methods ([26], [27]) and Normalized Advantage Functions
([28]). Those baselines performance will help use tune a
set of suitable hyperparameters, such as subreward values.
which underlines a global outage of the power system. Besides, new grid parameters will be explored for the IEEE-
The outage induce that at least one consumptions has 118, including thermal limits, such that we can tune the game
been cut. This is a situation of game over: the game difficulty.
will reinitialize the grid overall structure, and load the
the remaining timesteps to be played. Before that the
Environment returns the value of the cut load subreward.

We show in Fig. 9 a node splitting action that can avoid Acknowledgements: This work was supported by INRIA
the outage given the previous state s1 . Specifically, we and ChaLearn. Special thanks to my collaborator Kimang
directly connect the top right consumption with the top left Khun and my advisors Isabelle Guyon, Antoine Marot,
substation, and the bottom right substation with the top right Benjamin Donnot. I am grateful to Marc Schoenauer for
one. welcoming me at the LRI lab in the TAU group and to RTE’s
advisors including Patrick Panciatici for their guidance.
VI. CONCLUSION AND FUTURE WORK
R EFERENCES
In this work, we try to tackle the task of applying machine
[1] RS Sutton, AG Barto, ”Reinforcement learning: An introduction”,
learning, and more specifically reinforcement learning, on DOI: https://doi.org/10.1016/S1364-6613(99)01331-5
the task of operating a extra high voltage power grid in safe [2] R. D. Zimmerman, C. E. Murillo-Sanchez and R. J. Thomas, ”MAT-
conditions. To do so, we propose a novel game environment POWER: Steady-State Operations, Planning, and Analysis Tools for
Power Systems Research and Education,” in IEEE Transactions on
that is able to simulate a power grid through time given Power Systems, vol. 26, no. 1, pp. 12-19, Feb. 2011.
an Agent policy. For reinforcement learning integration, [3] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Patrick Panci-
the game is modeled using a Partially Observable Infinite atici, Antoine Marot, ”Introducing machine learning for power system
operation support”, arXiv:1709.09527v1
Markov Decision Process. At each time step, an Agent takes [4] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Patrick Panci-
an action given an observation, which leads to a first inner atici, Antoine Marot, ”Optimization of computational budget for power
state that is used to compute a reward (along with the chosen system risk assessment”, arXiv:1805.01174v1
[5] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Antoine Marot,
action). The game then loads the next set of injections, Patrick Panciatici, ”Fast Power system security analysis with Guided
compute the resulting state and provides the Agent with a Dropout”, supplemental material. 2017. ¡hal-01649938¿

12
[6] Ian Goodfellow, Yoshua Bengio, Aaron Courville, ”Deep Learning
Book”, MIT Press, 2016
[7] M. Mahmud, M.S. Kaiser, A. Hussain, S. Vassanelli. “Applica-
tions of Deep Learning and Reinforcement Learning to Biolog-
ical Data,” IEEE Trans. Neural Netw. Learn. Syst., 2018, doi:
10.1109/TNNLS.2018.2790388.
[8] Travers Ching, et al., ”Opportunities and obstacles for deep learning
in biology and medicine”, DOI: 10.1098/rsif.2017.0387
[9] Kun Shao, Yuanheng Zhu, Dongbin Zhao, ” StarCraft Micromanage-
ment with Reinforcement Learning and Curriculum Transfer Learn-
ing”, arXiv:1804.00810v1
[10] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,
Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis
Hassabis, ”Mastering the game of Go without human knowledge”,
Nature volume 550, pages 354–359 (19 October 2017)
[11] Seyed Sajad Mousavi, Michael Schukat, Enda Howley, ”Traffic Light
Control Using Deep Policy-Gradient and Value-Function Based Rein-
forcement Learning”, arXiv:1704.08883v2
[12] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Antoine Marot,
Patrick Panciatici, ” Fast Power system security analysis with Guided
Dropout”, arXiv:1801.09870v1
[13] P. Demetriou, M. Asprou, J. Quiros-Tortos and E. Kyriakides, ”Dy-
namic IEEE Test Systems for Transient Analysis,” in IEEE Systems
Journal, vol. 11, no. 4, pp. 2108-2117, Dec. 2017.
[14] Andrey Y. Lokhov, Marc Vuffray, Dmitry Shemetov, Deepjyoti Deka,
and Michael Chertkov, ”Online Learning of Power Transmission
Dynamics”, arXiv:1710.10021v1
[15] Salar Fattahi, Javad Lavaei, and Alper Atamturk, ” A Bound Strength-
ening Method for Optimal Transmission Switching in Power Systems”,
arXiv:1711.10428v1
[16] Le Pham Tuyen, Ngo Anh Vien, Abu Layek, TaeChoong Chung,
”Deep Hierarchical Reinforcement Learning Algorithm in Partially
Observable Markov Decision Processes”, arXiv:1805.04419v1
[17] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,
Ruslan R. Salakhutdinov, ” Improving neural networks by preventing
co-adaptation of feature detectors”, arXiv:1207.0580v1
[18] Metaxiotis K, Kagiannas A, Askounis D, Psarras J. Artificial intelli-
gence in short term electric load forecasting: a state-of-the-art survey
for the researcher. Energy conversion and Management. 2003 Jun
1;44(9):1525-34.
[19] Hippert HS, Pedreira CE, Souza RC. Neural networks for short-
term load forecasting: A review and evaluation. IEEE Transactions
on power systems. 2001 Feb;16(1):44-55.
[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves,
Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, ”Playing Atari
with Deep Reinforcement Learning”, arXiv:1312.5602v1
[21] Yann LeCun, Yoshua Bengio. ”Convolutional networks for images,
speech, and time-series”. In M. A. Arbib, editor, The Handbook of
Brain Theory and Neural Networks. MIT Press, 1995.
[22] Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified
activations in convolutional network. arXiv preprint arXiv:1505.00853.
2015 May 5.
[23] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Lau-
rent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Tim-
othy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Grae-
pel, Demis Hassabis, ”Mastering the game of Go with deep neural
networks and tree search”, Nature volume 529, pages 484–489 (28
January 2016)
[24] Sergey Ioffe, Christian Szegedy, ” Batch Normalization: Accelerat-
ing Deep Network Training by Reducing Internal Covariate Shift”,
arXiv:1502.03167v3
[25] K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for
Image Recognition,” 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
[26] Konda VR, Tsitsiklis JN. Actor-critic algorithms. InAdvances in neural
information processing systems 2000 (pp. 1008-1014).
[27] Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M. Natural actor–critic
algorithms. Automatica. 2009 Nov 1;45(11):2471-82.
[28] Gu S, Lillicrap T, Sutskever I, Levine S. Continuous deep q-learning
with model-based acceleration. InInternational Conference on Machine
Learning 2016 Jun 11 (pp. 2829-2838).

13
A PPENDIX I
A BOUT THE IEEE FORMAT FOR POWER GRIDS
The IEEE format allows to represent a steady-state of a power grid in a condensed manner. We use the version 2 of this
format, which differs from its first iteration by the variables it stores. IEEE format is used to compute load-flows using the
open-source software Matpower.

More precisely, the IEEE format is made of at least three matrices (+ one version value and one value indicating the base
MVA of the system):
• The bus matrix: stores values about the substations and the consumptions
• The generator matrix: parameters related to the generators
• The branch matrix: stores the values related to the flows of the system

Here, we list the parameters stored in each of those three matrices. For each one, the columns indicate parameters, and
the lines indicate the corresponding objects.
Bus matrix
The bus matrix is a matrix of shape n × 13, which implies that there are 13 parameters for every bus. The parameters
are (i.e. the columns):
1) ID of the bus
2) Type of the bus: 1 for a PV bus, 2 for a PQ bus, 3 for the slack bus (or reference bus), and 4 for isolated bus (not
linked to any other element)
3) Real power demand
4) Reactive power demand
5) Shunt conductance (some substations have shunts)
6) Shunt susceptance
7) ID indicating the area of the bus (not used in our case)
8) Voltage magnitude
9) Voltage angle
10) Base voltage (total voltage is magnitude times base voltage)
11) ID indicating the zone of the bus (bot used in our case)
12) Maximum voltage magnitude
13) Minimum voltage magnitude
Some of these parameters, such as the voltage magnitude, need to be specified in per-unit. It is the expression of some
quantities as fractions of a defined base unit quantity (for voltage magnitude, the base unit quantity is the base voltage
parameter). We do not explicitly use the area of zone parameters, since we consider every elements to be within the same grid.

One thing to note about the IEEE format is that there are no notion of substations. In fact, matpower only used the notion
of bus. We use some tricks, which include artificially created nodes with the same parameters, for actions such as node
splitting.
Gen matrix
The matrix of generators has n lines, where n is the number of generators of the grid, and 10 parameters, which are:
1) ID of the bus directly on which the generator is directly connected
2) Real power output
3) Reactive power output
4) Maximum reactive power output
5) Minimum reactive power output
6) Voltage magnitude setpoint
7) Base MVA of the generator
8) Status of the generator (0 out-of-service, >0 in service)
9) Maximum real power output
10) Minimum real power output
Branch matrix
The branch matrix has 11 parameters, and 4 extra values per branch representing the flows. Branch are identified using
a from bus and a to bus (used for convention). The 11 parameters for a branch are:
1) ID of from bus

14
2) ID of to bus
3) Resistance
4) Reactance
5) Susceptance
6) Long term rating
7) Short term rating
8) Emergency rating
9) Transfromer shift phase angle
10) Branch status (1 in service, 0 out-of-service)
On top of that for steady-states, there are 4 extra columns:
1) P at origin
2) Q at origin
3) P at destination
4) Q at destination
Branch are represented using origin and destination values, since in the AC mode there can be losses (function of some
parameters including the branch resistance).

15
A PPENDIX II
P OWER F LOWS EQUATIONS
This section is a rapid overview of the problem that power flow software need to solve.
A. Model of the power grid
Let G be a grid with n nodes, m power lines.
The nodes of G are divided in two parts, namely the generator nodes, where at least one production unit (power plant, wind
plant etc.) participating to voltage control is connected5 , and those called load nodes
.
To connect node i and node j there are element with complex impedance Zi,j . If nothing connects the two, one can think
of Zi,j = ∞.
Often, it is more convenient to think of the admittance Y , instead of the impedance Z. The admittance is nothing more than
:
1
Yi,j =
Zi,j
So if two nodes i and j are not connected, we have Yi,j = 0.

The Ohm’s law (also called Kirchoff’s voltage law) between node k and node j, in complex form can be written as :
ik→j = Yi,j × (Vj − Vk )
There is another fundamental law in a power grid, the Kirchoff’s power law. It states that, at a node i :
n
X node

ik = ik→j
j=1,j6=i

where ik is all the complex current injected at node k and ∀k, ik→j denote the (complex) current flowing from node k to
node j.

If we denote by Y the matrix :

 P
−Y2 −Yn

j6=1 Y1,j ...
P
 −Y1 j6=2 Y2,j Y2,3 ... 
Y =
 
.. 
 . P

Y1,n Y2,n ... j6=n Y2,j
and substituting Kirchoff’s voltage law in Kirchoff’s current law we have :
   
i1 v1
 ..   . 
 .  = Y  .. 
in vn
Y is commonly called the admittance matrix.

B. Equations to satisfy
A load-flow is a computation that takes as input:
• the real power for all load nodes PD
• the reactive power for all loads nodes QD
• the real power for all generator nodes PG
• the voltage magnitude |V | for all generator nodes
• the voltage angle Θ for the slack bus
• the voltage magnitude |V | for the slack bus

With these informations, a load-flow computes, for each load-bus the voltage angle Θl and magnitude |V |l and then derived
other the interesting quantities, such as the active power flow, the reactive power flow, or the currents power flow on each
power line of the system.

5 Actually, for the system to be properly specified, one node where there is a generator will be called a slack bus

16
The power flow equations are, for each node (slack node, production node or load node) i of the power grid:
N
X
0 = −Pi + |V |i |V |k (Gi,k . cos(Θi − Θk ) + Bi,k sin(Θi − Θk )) for the real power
k=1
N
X
0 = Qi + |V |i |V |k (Gi,k . sin(Θi − Θk ) − Bi,k cos(Θi − Θk )) for the reactive power
k=1

where:
• Pi is the real production injected at this node
• Gi,k is the real part of the element in the bus admittance matrix, eg the real part of the admittance of the line connecting
bug i to bus k (if any) or 0 (if not)
• Bi,k is the imaginary part of the element in the bus admittance matrix, eg the imaginary part of the admittance of the
line connecting bug i to bus k (if any) or 0 (if not)
For the system to be fully determined by these sets of equations, these equations are not written for the slack bus, and
only the real part of this equation is written for the production nodes.

Once these quantities have been computed, one can compute the active power flows on each elements of the network.
For example, for a given line connecting bus i to bus k with admittance Y at the origin node i having conductance Si and
susceptance Bi :
Pi→k = |Vi |.|Vk | ∗ Y. sin(Θi − Θk ) + |Vi |2 .Si (6)
2
Qi→k = − |Vi |.|Vk | ∗ Y. cos(Θi − Θk ) + |Vi | . (Y − Bi ) (7)
p
2
Pi→k + Q2i→k
Ii→k = (8)
|Vi |
C. DC approximation
For a more detailed information, the powerflows are shown in DCPowerFlowEquations.pdf. This section is greatly inspired
from
DC power ow in unit commitment models chapter 3. In this section we will suppose that there is not transformers nor phase
shifters. These two objects can of course be taken into account in the DC approximation, as shown in the two previous
papers.
In this part, we will present one of the most used model to approximate the load-flow equations. In counterpart, some
results of the AC model won’t be accessible for example the losses or the voltage magnitudes.
Despite these drawbacks, DC modelisation has two main advantages. First of all, it can always find a solution to its
equations, and more importantly it is much faster to compute.

Let’s recall the powerflow equations in the AC case:

N
X
0 = −Pi + |V |i |V |k (Gi,k . cos(Θi − Θk ) + Bi,k sin(Θi − Θk )) for the real power
k=1
N
X
0 = Qi + |V |i |V |k (Gi,k . sin(Θi − Θk ) − Bi,k cos(Θi − Θk )) for the reactive power
k=1

The DC modeling will make three important assumptions:

1) the resistance (R) of a line is negligible its reactance (X)
2) For two connected buses (let’s say i and k) the difference of phase Θi − Θk is very small
3) The voltage magnitude at each bus is equal to its nominative value.

The impact of each of these assumptions on the power-flow equation are discussed now:
1) R << X. The part has a big impact on the equations. First this induces that the losses are fully neglected.

For every line, the admittance can be written:

1 1 R2 X
Y = = = 2 2
−j 2
Z R + jX R +X R + X2

17
And by definition, we have :
Y = G + jB
thus :
R2
G= → 0
R2 + X 2 R→0
and
−X −1
B= →
R2+X 2 R→0 X

So the power flow equations become:

N
X
0 = −Pi + |V |i |V |k (Bi,k sin(Θi − Θk )) for the real power
k=1
N
X
0 = Qi + |V |i |V |k (Bi,k cos(Θi − Θk )) for the reactive power
k=1

2) Θi − Θk ≈ 0. This will allow a linearization of the problem, as the trigonometric functions sin and cos will be
approximate by the identity and the constant 1 (first order approximation). The powerflow equations then becomes:
N
X
0 = −Pi + |V |i |V |k Bi,k (Θi − Θk ) for the real power
k=1
N
X
0 = Qi + |V |i |V |k Bi,k for the reactive power
k=1

3) |V |j ≈ |V |nom . The last non linearity in the previous equations arises due to the factor |V |i .|V |k . Assuming that
|V |j ≈ |V |nom will make them disappear. This is also a very strong assumption preventing us from getting voltage
magnitude as a results of the DC approximation. This leads to:
|V |i |V |k ≈ |V |nom ∗ |V |nom
So at the end the equations are:
N
X
Pi = Bi,k (Θi − Θk ) for the real power (9)
k=1,k6=i
N
X
Qi = − Bi,k = 0 for the reactive power (10)
k=1

D. Computation of current flows from DC equations

As we can see, the DC equations does not allow to capture flows in amps (A). Multiple methods allow to do that. We
choose to do the following.
The AC equations 8 gives us: p
2
Pi→k + Q2i→k
Ii→k =
|Vi |
We can obtained Pi→k from the DC equation 9, and |Vi | = 1 by assumptions. So the real problem is to compute Qi→k .
Indeed, we can do better than simply assign Q = 0 in these formulas.

Nicolas Cuadrado Thesis M SC ML
No ratings yet
Nicolas Cuadrado Thesis M SC ML
66 pages
GAIA - A Large Language Model For Advanced Power Dispatch: Cheng Zhao Zhou Zhao Cao Yang
No ratings yet
GAIA - A Large Language Model For Advanced Power Dispatch: Cheng Zhao Zhou Zhao Cao Yang
13 pages
Physics Informed Neural Networks-Based AC Optimal Power Flow Under High RES Penetration
No ratings yet
Physics Informed Neural Networks-Based AC Optimal Power Flow Under High RES Penetration
10 pages
Reinforcement Learning Based Energy Storage System Operation 4hqn7ox7bg
No ratings yet
Reinforcement Learning Based Energy Storage System Operation 4hqn7ox7bg
12 pages
Power Grid Congestion Management Via Topology Optimization With AlphaZero
No ratings yet
Power Grid Congestion Management Via Topology Optimization With AlphaZero
17 pages
Synthesizing World Models For Bilevel Planning
No ratings yet
Synthesizing World Models For Bilevel Planning
29 pages
Artificial Intelligence For Resilience in Smart Grid Operations
No ratings yet
Artificial Intelligence For Resilience in Smart Grid Operations
131 pages
Dynamic Energy Scheduling and Routing of Multiple Electric Vehicles Using Deep Reinforcement Learning
No ratings yet
Dynamic Energy Scheduling and Routing of Multiple Electric Vehicles Using Deep Reinforcement Learning
11 pages
Using Meta Reinforcement Learning To Bridge The Gap Between Simulation and Experiment in Energy Demand Response
No ratings yet
Using Meta Reinforcement Learning To Bridge The Gap Between Simulation and Experiment in Energy Demand Response
5 pages
Deep Reinforcement Learning For Smart Grid Operations Algorithms Applications and Prospects
No ratings yet
Deep Reinforcement Learning For Smart Grid Operations Algorithms Applications and Prospects
42 pages
A Novel Hybrid-Action-Based Deep Reinforcement Learning For Industrial Energy Management
No ratings yet
A Novel Hybrid-Action-Based Deep Reinforcement Learning For Industrial Energy Management
15 pages
Energy Management System For An Industrial Microgrid Using Optimization Algorithms-Based Reinforcement Learning Technique
No ratings yet
Energy Management System For An Industrial Microgrid Using Optimization Algorithms-Based Reinforcement Learning Technique
18 pages
A Holistic Power Optimization Approach For Microgrid Control Based On Deep Reinforcement Learning
No ratings yet
A Holistic Power Optimization Approach For Microgrid Control Based On Deep Reinforcement Learning
25 pages
02 Task Performance 1
No ratings yet
02 Task Performance 1
12 pages
Enhancing Energy Efficiency in Smart Grids Through Reinforcement Learning-Based Control Strategies
No ratings yet
Enhancing Energy Efficiency in Smart Grids Through Reinforcement Learning-Based Control Strategies
9 pages
Two-Level Optimal Scheduling Strategy of Demand Response-Based Microgrids Based On Renewable Energy Forecasting
No ratings yet
Two-Level Optimal Scheduling Strategy of Demand Response-Based Microgrids Based On Renewable Energy Forecasting
32 pages
Smart Microgrid Optimization Using Deep Reinforcement Learning by Utilizing The Energy Storage Systems
No ratings yet
Smart Microgrid Optimization Using Deep Reinforcement Learning by Utilizing The Energy Storage Systems
7 pages
Surprise For The Sniper - Sienna Trap
No ratings yet
Surprise For The Sniper - Sienna Trap
323 pages
Fenrg 12 1361869
No ratings yet
Fenrg 12 1361869
12 pages
Deep Learning and Reinforcement Learning Approach On Microgrid
No ratings yet
Deep Learning and Reinforcement Learning Approach On Microgrid
19 pages
Batteries 09 00219 v3
No ratings yet
Batteries 09 00219 v3
16 pages
18CS30051 Debajyoti Dasgupta Report 10th Sem
No ratings yet
18CS30051 Debajyoti Dasgupta Report 10th Sem
134 pages
8-Reinforcement Learning-Based Control With Application Through Steam Generator System
No ratings yet
8-Reinforcement Learning-Based Control With Application Through Steam Generator System
10 pages
Towards Adaptive Self-Improvement For Smarter Energy Systems
No ratings yet
Towards Adaptive Self-Improvement For Smarter Energy Systems
7 pages
FRSGR 03 1371153
No ratings yet
FRSGR 03 1371153
22 pages
36385-Article Text-138449-1-10-20240711
No ratings yet
36385-Article Text-138449-1-10-20240711
14 pages
Pesgm2024 000011
No ratings yet
Pesgm2024 000011
5 pages
Energies 14 08365 v2
No ratings yet
Energies 14 08365 v2
17 pages
Physics-Informed Graphical Representation-Enabled Deep Reinforcement Learning For Robust Distribution System Voltage Control
No ratings yet
Physics-Informed Graphical Representation-Enabled Deep Reinforcement Learning For Robust Distribution System Voltage Control
14 pages
Multi-Agent Deep Reinforcement Learning Based Scheduling Approach
No ratings yet
Multi-Agent Deep Reinforcement Learning Based Scheduling Approach
8 pages
10 1016@j Tej 2020 106890
No ratings yet
10 1016@j Tej 2020 106890
8 pages
A Multi-Scale Time-Series Dataset With Benchmark For Machine Learning in Decarbonized Energy Grids
No ratings yet
A Multi-Scale Time-Series Dataset With Benchmark For Machine Learning in Decarbonized Energy Grids
18 pages
22ee007 Mlea CIA-3 QB Answers
No ratings yet
22ee007 Mlea CIA-3 QB Answers
10 pages
ML IN POWER SYSTEM ReserachProposalPresentation
No ratings yet
ML IN POWER SYSTEM ReserachProposalPresentation
8 pages
Toward Rapid, Optimal, and Feasible Power Dispatch Through Generalized Neural Mapping
No ratings yet
Toward Rapid, Optimal, and Feasible Power Dispatch Through Generalized Neural Mapping
5 pages
Future Trends of Deep Learning Neural Networks
No ratings yet
Future Trends of Deep Learning Neural Networks
10 pages
Power Grid Operational Risk Assessment Using Graph Neural Network Surrogates
No ratings yet
Power Grid Operational Risk Assessment Using Graph Neural Network Surrogates
5 pages
Artificial Neural Network Algorithms For The Optimum ELD
No ratings yet
Artificial Neural Network Algorithms For The Optimum ELD
6 pages
Deep Reinforcement Learning Based Reactive Power R
No ratings yet
Deep Reinforcement Learning Based Reactive Power R
20 pages
Reactive Power Optimization Using Feed Forward Neural Deep Reinforcement
No ratings yet
Reactive Power Optimization Using Feed Forward Neural Deep Reinforcement
5 pages
A Reinforcement Learning Approach For Defending Against Multiscenario Load Redistribution Attacks
No ratings yet
A Reinforcement Learning Approach For Defending Against Multiscenario Load Redistribution Attacks
12 pages
Effective Optimal Power Flow Strategies Using Graph Artificial Intelligence (PAPER ID-1091)
No ratings yet
Effective Optimal Power Flow Strategies Using Graph Artificial Intelligence (PAPER ID-1091)
10 pages
A Deep Reinforcement Learning Based Approach For Optimal Active Power Dispatch
No ratings yet
A Deep Reinforcement Learning Based Approach For Optimal Active Power Dispatch
5 pages
AIML Project (Rev 2)
No ratings yet
AIML Project (Rev 2)
18 pages
Deep Reinforcement Learning For Power System
No ratings yet
Deep Reinforcement Learning For Power System
13 pages
Optimal Power Flow Using Graph Neural Networks
No ratings yet
Optimal Power Flow Using Graph Neural Networks
5 pages
A Review On The Applications of Reinforcement Learning Control For Power Electronic Converters
No ratings yet
A Review On The Applications of Reinforcement Learning Control For Power Electronic Converters
6 pages
Flexible Transmission Network Expansion Planning B
No ratings yet
Flexible Transmission Network Expansion Planning B
21 pages
Ail2 19
No ratings yet
Ail2 19
15 pages
Deep Reinforcement Learning-Based Robust Protection in DER-Rich Distribution Grids
No ratings yet
Deep Reinforcement Learning-Based Robust Protection in DER-Rich Distribution Grids
12 pages
IEEE Xplore Reference Download 2024.3.18.12.29.46
No ratings yet
IEEE Xplore Reference Download 2024.3.18.12.29.46
4 pages
Smart Grid Optimiization by Deep Reinforcement Learning Over Discreet and Continuous Action Space
No ratings yet
Smart Grid Optimiization by Deep Reinforcement Learning Over Discreet and Continuous Action Space
4 pages
A Reinforcement Learning Approach For Fast Frequency Control in Low-Inertia Power Systems
No ratings yet
A Reinforcement Learning Approach For Fast Frequency Control in Low-Inertia Power Systems
6 pages
A Review of Machine Learning Applications in Power System Resilience
No ratings yet
A Review of Machine Learning Applications in Power System Resilience
5 pages
MPC For Optimization
No ratings yet
MPC For Optimization
6 pages
Applsci 14 09590
No ratings yet
Applsci 14 09590
16 pages
Saint Kabir's Pad & Dohe
100% (3)
Saint Kabir's Pad & Dohe
89 pages
Identify Your Helpers of Destiny
90% (10)
Identify Your Helpers of Destiny
6 pages
2004, Vol.6, No.4, Pediatric Surgery PDF
100% (1)
2004, Vol.6, No.4, Pediatric Surgery PDF
95 pages
3-Structure Analysis - Trusses
No ratings yet
3-Structure Analysis - Trusses
58 pages
Power Systems Stability Control: Reinforcement Learning Framework
No ratings yet
Power Systems Stability Control: Reinforcement Learning Framework
9 pages
Successful Commercial Beekeeping
No ratings yet
Successful Commercial Beekeeping
5 pages
DALL E 3 - OpenAI
No ratings yet
DALL E 3 - OpenAI
8 pages
smc7 12
No ratings yet
smc7 12
43 pages
Exploring Application of Machine Learning To Power System Analysis
No ratings yet
Exploring Application of Machine Learning To Power System Analysis
3 pages
Cisa High Temperature Steam Sterilizer 6464
No ratings yet
Cisa High Temperature Steam Sterilizer 6464
8 pages
List 2 Lab
No ratings yet
List 2 Lab
22 pages
Hensley Bolt-On Wear Runners
No ratings yet
Hensley Bolt-On Wear Runners
7 pages
UserManual en 1051 1100
No ratings yet
UserManual en 1051 1100
50 pages
Function and Functionalism, A Synthetic Perspective
No ratings yet
Function and Functionalism, A Synthetic Perspective
21 pages
Limnological Features of Ikere Gorge Reservoir, Iseyin South-Western Nigeria: Plankton Composition and Abundance
No ratings yet
Limnological Features of Ikere Gorge Reservoir, Iseyin South-Western Nigeria: Plankton Composition and Abundance
12 pages
EAF DustTreatment ByNewProcess
No ratings yet
EAF DustTreatment ByNewProcess
11 pages
Implicit & Explicit Finite Element Analysis - CAE ANALYSIS
No ratings yet
Implicit & Explicit Finite Element Analysis - CAE ANALYSIS
2 pages
Grade 11 Mathematics June Exam P2 QP
No ratings yet
Grade 11 Mathematics June Exam P2 QP
8 pages
Redemption - Batch - 3 11 24 To 3 15 24
No ratings yet
Redemption - Batch - 3 11 24 To 3 15 24
4 pages
6ES72141AG400XB0 Datasheet en
No ratings yet
6ES72141AG400XB0 Datasheet en
9 pages
Drug Study Potassium Citrate
No ratings yet
Drug Study Potassium Citrate
3 pages
Lectra Diamino V5R3 Referencia Brochure
100% (1)
Lectra Diamino V5R3 Referencia Brochure
3 pages
2017 Modelling and Transient Simulation of Water Flow in Pipelines Using WANDA Transient Software
No ratings yet
2017 Modelling and Transient Simulation of Water Flow in Pipelines Using WANDA Transient Software
10 pages
Project C: Dr. Shahin Tavakoli Applied Bayesian Statistics Project 1
No ratings yet
Project C: Dr. Shahin Tavakoli Applied Bayesian Statistics Project 1
2 pages
Role of UN and International NGOs in Global Health Governance - Edited
No ratings yet
Role of UN and International NGOs in Global Health Governance - Edited
3 pages
Phychem Lab Assignment - R104 R105
No ratings yet
Phychem Lab Assignment - R104 R105
1 page
TEP-North Region: Huawei Technologies Co., LTD
No ratings yet
TEP-North Region: Huawei Technologies Co., LTD
5 pages
Andrew D. Miall
No ratings yet
Andrew D. Miall
48 pages
Vision in Insects
No ratings yet
Vision in Insects
3 pages
Field Strength Meter Circuit
No ratings yet
Field Strength Meter Circuit
2 pages
Assigement 1 Pete Olmeca
No ratings yet
Assigement 1 Pete Olmeca
2 pages
Distributed Facts Device for Flow Controls
From Everand
Distributed Facts Device for Flow Controls
Dr.V.V.L.N. Sastry
No ratings yet
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
From Everand
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
Analog Dialogue
No ratings yet
Analog Dialogue, Volume 46, Number 3: Analog Dialogue, #7
From Everand
Analog Dialogue, Volume 46, Number 3: Analog Dialogue, #7
Analog Dialogue
No ratings yet

Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)

Uploaded by

Design and Implementation of An Environment For Learning To Run A Power Network (L2RPN)

Uploaded by

Design and implementation of an environment

for Learning to Run a Power Network (L2RPN)

This document is organized as follows. First, we give

weeks to replaces lines in the context of very-hight voltage grid loadflow

branches are identical. • P is a state transition probability matrix,

• The branches voltage angles are almost identical. • R is a reward function,

Neural networks have also been used for short-term load

Given a line service status sline i,t−1

(b) Step R: state s0.5 updated from s0 after the Do-

(c) Step S and step O: observation o1 of the grid,

If we denote by Y the matrix :

Let’s recall the powerflow equations in the AC case:

The DC modeling will make three important assumptions:

For every line, the admittance can be written:

So the power flow equations become:

D. Computation of current flows from DC equations

You might also like