Scalability and Sample Efficiency
Scalability and Sample Efficiency
Abstract—Data-driven state estimation (SE) is becoming in- from both WAMS and SCADA systems is formulated in a
creasingly important in modern power systems, as it allows nonlinear way and solved in a centralized manner using the
for more efficient analysis of system behaviour using real-time Gauss-Newton method [1]. On the other hand, the SE problem
measurement data. This paper thoroughly evaluates a phasor
measurement unit-only state estimator based on graph neural that considers only PMU data provided by WAMS has a linear
networks (GNNs) applied over factor graphs. To assess the sample formulation, enabling faster, non-iterative solutions.
efficiency of the GNN model, we perform multiple training In this work, we will focus on the SE considering only
experiments on various training set sizes. Additionally, to evaluate phasor measurements, described with a system of linear
the scalability of the GNN model, we conduct experiments on
power systems of various sizes. Our results show that the GNN- equations [2], which is becoming viable with the increasing
based state estimator exhibits high accuracy and efficient use of deployment of PMUs. This formulation is usually solved using
data. Additionally, it demonstrated scalability in terms of both linear weighted least-squares (WLS), which involve matrix
memory usage and inference time, making it a promising solution factorizations and can be numerically sensitive [3]. To address
for data-driven SE in modern power systems. the numerical instability issues that often arise when using
Index Terms—State Estimation, Graph Neural Networks, Ma-
chine Learning, Power Systems, Real-Time Systems traditional SE solvers, researchers have turned to data-driven
deep learning approaches [4], [5]. These approaches, when
I. I NTRODUCTION trained on relevant datasets, are able to provide solutions even
when traditional methods fail. For example, in [4], a combina-
Motivation and literature review: The state estimation tion of feed-forward and recurrent neural networks was used
(SE) algorithm plays a crucial role in power systems by to predict network voltages using historical measurement data.
providing an accurate and up-to-date representation of the In the nonlinear SE formulation, the study [5] demonstrates
current state of the system, allowing for efficient and reliable the use of deep neural networks as fast and quality initializers
operation. Its purpose is to estimate complex bus voltages of the Gauss-Newton method.
using available measurements, power system parameters, and
Both linear WLS and common deep learning SE methods at
topology information [1]. In this sense, the SE can be seen
its best approach quadratic computational complexity regard-
as a problem of solving large, noisy, sparse, and generally
ing the power system size. To fully utilize high sampling rates
nonlinear systems of equations. The measurement data used
of PMUs, there is a motivation to develop SE algorithms with
by the SE algorithm usually come from two sources: the
a linear computational complexity. One way of achieving this
supervisory control and data acquisition (SCADA) system
could be using increasingly popular graph neural networks
and the wide area monitoring system (WAMS) system. The
(GNNs) [6], [7]. GNNs have several advantages when used
SCADA system provides low-resolution measurements that
in power systems, such as permutation invariance, the ability
cannot capture system dynamics in real-time, while the WAMS
to handle varying power system topologies, and requiring
system provides high-resolution data from phasor measure-
fewer trainable parameters and less storage space compared
ment units (PMUs) that enable real-time monitoring of the
to conventional deep learning methods. One of the key bene-
system. The SE problem that considers measurement data
fits of GNNs is the ability to perform distributed inference
This paper has received funding from the European Union’s Horizon 2020 using only local neighbourhood measurements, which can
research and innovation programme under Grant Agreement number 856967. be efficiently implemented using the emerging 5G network
communication infrastructure and edge computing [8]. This the bus [17]. The state variables are given as x in rectangu-
allows for real-time and low-latency decision-making even in lar coordinates, and therefore consist of real and imaginary
large-scale networks, as the computations are performed at components of bus voltages. The PMU measurements are
the edge of the network, closer to the data source, reducing transformed from polar to rectangular coordinate system, since
the amount of data that needs to be transmitted over the then the SE problem can be formulated using a system of linear
network. This feature is particularly useful for utilizing the equations [15]. The solution to this sparse and noisy system
high sampling rates of PMUs, as it can reduce communication can be found by solving the linear WLS problem:
delays in PMU measurement delivery that occur in centralized
HT Σ−1 H x = HT Σ−1 z,
(1)
SE implementations.
GNNs are being applied in a variety of prediction tasks where the Jacobian matrix H ∈ Rm×2n is defined according to
in the field of power systems, including fault location [9], the partial first-order derivatives of the measurement functions,
stability assessment [10], and load forecasting [11]. GNNs and m is the total number of linear equations. The observation
have also been used for power flow problems, both in a error covariance matrix is Σ ∈ Rm×m , while the vector z ∈
supervised [12] and an unsupervised [13] manner. A hybrid Rm contains measurement values in rectangular coordinate
nonlinear SE approach [14] combines a model and data-based system. The aim of the WLS-based SE is to minimize the sum
approach, where voltages calculated using a GNN are used as of residuals between the measurements and the corresponding
a regularization term in the SE loss function. values that are calculated using the measurement functions [1].
Contributions: In our previous work [15], we proposed a This approach has the disadvantage of requiring a transfor-
data-driven linear PMU-only state estimator based on GNNs mation of measurement errors (magnitude and angle errors)
applied over factor graphs. The model demonstrated good from polar to rectangular coordinates, making them correlated,
approximation capabilities under normal operating conditions resulting in a non-diagonal covariance matrix Σ and increased
and performed well in unobservable and underdetermined computational effort. To simplify the calculation, the non-
scenarios. This work significantly extends our previous work diagonal elements of Σ are often ignored, which can impact
in the following ways: the accuracy of the SE [17]. We can use the classical theory of
• We conduct an empirical analysis to investigate how the propagation of uncertainty to compute variances in rectangular
same GNN architecture could be used for power systems coordinates from variances in polar coordinates [18]. The
of various sizes. We assume that the local properties solution to (1) obtained by ignoring the non-diagonal elements
of the graphs in these systems are similar, leading to of the covariance matrix Σ to avoid its computationally
local neighbourhoods with similar structures which can demanding inversion is referred to as the approximative WLS
be represented using the same embedding space size and SE solution.
the same number of GNN layers. In the rest of the paper, we will explore whether using
• To evaluate the sample efficiency of the GNN model, we a GNN model trained with measurement values, variances,
run multiple training experiments on different sizes of and covariances labelled with the exact solutions of (1) leads
training sets. Additionally, we assess the scalability of the to greater accuracy compared to the approximative WLS SE,
model by training it on various power system sizes and which ignores covariances. The GNN model, once trained,
evaluating its accuracy, training convergence properties, scales linearly with respect to the number of power system
inference time, and memory requirements. buses, allowing for lower computation time compared to both
• As a side contribution, the proposed GNN model is the approximate and exact solvers of (1).
tested in scenarios with high measurement variances,
III. M ETHODS
using which we simulate phasor misalignments due to
communication delays, and the results are compared with In this section, we introduce spatial GNNs on a high-level
linear WLS solutions of SE. and describe how can they be applied to the linear SE problem.
k−1
mj hv K
h2 k−1 Message Aggregate Update hj k
m2,j k−1 ... Layerv Layerv
... ...
hv2 K−1
mnj ,j k−1
label
k−1
hnj Message
...
output
Loss Pred
Fig. 1: A GNN layer, which represents a single message pass-
ing iteration, includes multiple trainable functions, depicted as Fig. 2: Proposed GNN architecture for heterogeneous aug-
yellow rectangles. The number of first-order neighbours of the mented factor graphs. Variable nodes are represented by circles
node j is denoted as nj . and factor nodes are represented by squares. The high-level
computational graph begins with the loss function for a
The message function calculates the message mi,j ∈ Ru variable node, and the layers that aggregate into different types
between two node embeddings, the aggregation function com- of nodes have distinct trainable parameters.
bines the incoming messages in a specific way, resulting in
an aggregated message mj ∈ Ru , and the update function
calculates the update to each node’s embedding. The message
passing process is repeated a fixed number of times, with prediction quality in unobservable scenarios [15]. This is
the final node embeddings passed through additional neural because the graph remains connected even when simulating
network layers to generate predictions. GNNs are trained by the removal of factor nodes (e.g., measurement loss), which
optimizing their parameters using a variant of gradient descent, allows messages to be propagated in the entire K-hop neigh-
with the loss function being a measure of the distance between bourhood of the variable node. This allows for the physical
the ground-truth values and the predictions. connection between power system buses to be preserved when
a factor node corresponding to a branch current measurement
B. State Estimation using Graph Neural Networks is removed.
The proposed GNN model is designed to be applied over
The proposed GNN for a heterogeneous graph has two
a graph with a SE factor graph topology [19], which consists
types of layers: one for factor nodes and one for variable
of factor and variable nodes with edges between them. The
nodes. These layers, denoted as Layerf and Layerv , have
variable nodes are used to create a s-dimensional embedding
their own sets of trainable parameters, which allow them
for the real and imaginary parts of the bus voltages, which
to learn their message, aggregation, and update functions
are used to generate state variable predictions. Here, s is
separately. Different sets of trainable parameters are used
a training hyperparameter tuned using trial and error. The
for variable-to-variable and factor-to-variable node messages.
factor nodes serve as inputs for measurement values, variances,
Both GNN layers use two-layer feed-forward neural networks
and covariances. Factor nodes do not generate predictions,
as message functions, single layer neural networks as update
but they participate in the GNN message passing process
functions, and the attention mechanism [7] in the aggregation
to send input data to their neighbouring variable nodes. To
function. Then, a two-layer neural network Pred is applied
improve the model’s representation of a node’s neighbourhood
to the final node embeddings hK of variable nodes only,
structure, we use binary index encoding as input features
to create state variable predictions. The loss function is the
for variable nodes. This encoding allows the GNN to better
mean-squared error (MSE) between the predictions and the
capture relationships between nodes and reduces the number of
ground-truth values, calculated using variable nodes only. All
input neurons and trainable parameters, as well as training and
trainable parameters are updated via gradient descent and
inference time, compared to the one-hot encoding used in [15].
backpropagation over a mini-batch of graphs. The high-level
The GNN model can be applied to various types and quantities
computational graph of the GNN architecture specialized for
of measurements on both power system buses and branches,
heterogeneous augmented factor graphs is depicted in Figure 2.
and the addition or removal of measurements can be simulated
by adding or removing factor nodes. In contrast, applying a The proposed model uses an inference process that requires
GNN to the bus-branch power system model would require measurements from the K-hop neighbourhood of each node,
assigning a single input vector to each bus, which can cause allowing for computational and geographical distribution. Ad-
problems such as having to fill elements with zeros when not ditionally, since the node degree in the SE factor graph
all measurements are available and making the output sensitive is limited, the computational complexity for the inference
to the order of measurements in the input vector. process is constant. As a result, the overall GNN-based SE
Connecting the variable nodes in the 2-hop neighbourhood has a linear computational complexity, making it efficient and
of the factor graph topology significantly improves the model’s scalable for large networks.
IV. N UMERICAL R ESULTS 14 Redundancy
12 Avg. Degree
graph properties
In this section, we conduct numerical experiments to in-
Power system
Avg. Path Length
vestigate the scalability and sample efficiency of the proposed 10 Avg. Cluster Coeff.
GNN approach. By varying the power system and training set 8
sizes, we are able to assess the model’s memory requirements, 6
prediction speed, and accuracy and compare them to those of 4
traditional SE approaches. 2
We use the IEEE 30-bus system, the IEEE 118-bus sys- 0
tem, the IEEE 300-bus system, and the ACTIVSg 2000-bus 30 118 300 2000
system [20], with measurements placed so that measurement
Number of buses
redundancy is maximal. For the purpose of sample efficiency
analysis, we create training sets containing 10, 100, 1000, Fig. 3: Properties of augmented factor graphs along with the
and 10000 samples for each of the mentioned power systems. system’s measurement redundancy for different test power
Furthermore, we use validation and test sets comprising 100 systems, labelled with their corresponding number of buses.
samples. These datasets are generated by solving the power
flow problem using randomly generated bus power injections 0.1 n = 10 n = 102
and adding Gaussian noise to obtain the measurement values. n = 103 n = 104
Validation loss
All the data samples were labelled using the traditional SE
solver. An instance of the GNN model is trained on each of
these datasets. 0.05
In contrast to our previous work, we use higher variance
values of 5 × 10−1 to examine the performance of the GNN
algorithm under conditions where input measurement phasors
0
are unsynchronized due to communication delays [21]. While 1 30 60 90 120 150
this is usually simulated by using variance values that increase
Epoch
over time, as an extreme scenario we fix the measurement
variances to a high value. Fig. 4: Validation losses for trainings on four different training
In all the experiments, the node embedding size is set to set sizes.
64, and the learning rate is 4 × 10−4 . The minibatch size is
32, and the number of GNN layers is 4. We use the ReLU
activation function and a gradient clipping value of 5 × 10−1 . achieves scalability, as it applies the same set of operations to
The optimizer is Adam, and we use mean batch normalization. the local, K-hop neighbourhoods of augmented factor graphs
of varying sizes without having to adapt to each individual
A. Properties of Power System Augmented Factor Graphs case.
For all four test power systems, we create augmented factor
graphs using the methodology described in Section III-B. B. Training Convergence Analysis
Fig. 3 illustrates how the properties of the augmented factor First, we analyse the training process for the IEEE 30-
graphs, such as average node degree, average path length, bus system with four different sizes of the training set. As
average clustering coefficient, along with the system’s maxi- mentioned in III-B, the training loss is a measure of the
mal measurement redundancy, vary across different test power error between the predictions and the ground-truth values
systems. for data samples used in the training process. The validation
The average path length is a property that characterizes loss, on the other hand, is a measure of the error between
the global graph structure, and it tends to increase as the the predictions and the ground-truth values on a separate
size of the system grows. However, as a design property validation set. In this analysis, we used a validation set of
of high-voltage networks, the other graph properties such as 100 samples.
the average node degree, average clustering coefficient, as The training losses for all the training processes converged
well as maximal measurement redundancy do not exhibit a smoothly, so we do not plot them for the sake of clarity. Figure
clear trend of change with respect to the size of the power 4 shows the validation losses for 150 epochs of training on four
system. This suggests that the structures of local, K-hop different training sets. For smaller training sets, the validation
neighbourhoods within the graph are similar across different loss decreases initially but then begins to increase, which
power systems, and that they contain similar factor-to-variable is a sign of overfitting. In these cases, a common practice
node ratio. Consequently, it is reasonable to use the same GNN in machine learning is to select the model with the lowest
architecture (most importantly, the number of GNN layers validation loss value. As it will be shown in IV-C, the separate
and the node embedding size) for all test power systems, test set results for models created using small training sets are
regardless of their size. In this way, the proposed model still satisfactory. As the number of samples in the training set
increases, the training process becomes more stable. This is complexity, and WLS, with more than quadratic complexity,
because the model has more data to learn from and is therefore becomes apparent as the number of buses increases. From the
less prone to overfitting. results, it can be observed that GNN significantly outperforms
Next, in Table I, we present the training results for the WLS in terms of inference time on larger power systems.
other power systems and training sets of various sizes. The The number of trainable parameters in the GNN model
numbers in the table represent the number of epochs after remains relatively constant, as the number of power system
which either the validation loss stopped changing or began buses increases. The number of input neurons for variable
to increase. Similarly to the experiments on the IEEE 30- node binary index encoding does grow logarithmically with the
bus system, the trainings on smaller training sets exhibited number of variable nodes. However, this increase is relatively
overfitting, while others converged smoothly. For the former, small compared to the total number of GNN parameters1 . This
the number in the table indicates the epoch at which the indicates that the GNN approach is scalable and efficient, as
validation loss reached its minimum and stopped improving. the model’s complexity does not significantly increase with
For the latter, the number in the table represents the epoch the size of the power system being analysed.
when there were five consecutive validation loss changes less
than 10−5 . V. C ONCLUSIONS
TABLE I: Epoch until validation loss minimum for various In this study, we focused on thoroughly testing a GNN-
power systems and training set sizes. based state estimation algorithm in scenarios with large vari-
ances, and examining its scalability and sample efficiency.
Power system IEEE 118 IEEE 300 ACTIVSg 2000
The results showed that the proposed approach provides good
10 samples 61 400 166 results for large power systems, with lower prediction errors
100 samples 38 84 200 compared to the approximative SE. The GNN model used
1000 samples 24 82 49 in this approach is also fast and maintains constant memory
10000 samples 12 30 15 usage, regardless of the size of the scheme. Additionally, the
GNN was found to be an effective approximation method
for WLS SE even with a relatively small number of training
Increasing the size of the training set generally results in samples, particularly for larger power systems, indicating its
a lower number of epochs until the validation loss reaches sample efficiency. Given these characteristics, the approach is
its minimum. However, the epochs until the validation loss worthy of further consideration for real-world applications.
reaches its minimum vary significantly between the different
power systems. This could be due to differences in the R EFERENCES
complexity of the systems or the quality of the data used for [1] A. Monticelli, “Electric power system state estimation,” Proceedings of
training. the IEEE, vol. 88, no. 2, pp. 262–282, 2000.
[2] M. Göl and A. Abur, “A fast decoupled state estimator for systems
C. Accuracy Assessment measured by pmus,” IEEE Trans. Power Syst., vol. 30, no. 5, pp. 2766–
2771, 2015.
Fig. 5 reports the mean squared errors (MSEs) between the [3] G. N. Korres and N. M. Manousakis, “State estimation and observability
predictions and the ground-truth values on 100-sample sized analysis for phasor measurement unit measured systems,” IET Gener.
test sets for all trained models and the approximate WLS SE. Transm. Dis., vol. 6, no. 9, pp. 902–913, September 2012.
[4] L. Zhang, G. Wang, and G. B. Giannakis, “Real-time power system state
These results indicate that even the GNN models trained on estimation and forecasting via deep unrolled neural networks,” IEEE
small datasets outperform the approximate WLS SE, except for Trans. Signal Process., vol. 67, no. 15, pp. 4069–4077, 2019.
the models trained on the IEEE 30-bus system with 10 and [5] A. S. Zamzam, X. Fu, and N. D. Sidiropoulos, “Data-driven learning-
based optimization for distribution system state estimation,” IEEE Trans.
100 samples. These results suggest that the quality of the GNN Power Syst., vol. 34, no. 6, pp. 4796–4805, 2019.
model’s predictions and the generalization capabilities improve [6] W. L. Hamilton, “Graph representation learning,” Synthesis Lectures on
as the amount of training data increases, and the models Artificial Intelligence and Machine Learning, vol. 14, no. 3, pp. 1–159,
2020.
with the best results (highlighted in bold) have significantly [7] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and
smaller MSEs compared to the approximate WLS SE. While Y. Bengio, “Graph Attention Networks,” in Proc. ICLR, 2018.
we use randomly generated training sets in this analysis, using [8] O. Kundacina, M. Forcan, M. Cosovic, D. Raca, M. Dzaferagic,
D. Miskovic, M. Maksimovic, and D. Vukobratovic, “Near real-time
carefully selected training samples based on historical load distributed state estimation via ai/ml-empowered 5g networks,” in Proc.
consumption data could potentially lead to even better results SmartGridComm. IEEE, 2022, pp. 284–289.
with small datasets. [9] K. Chen, J. Hu, Y. Zhang, Z. Yu, and J. He, “Fault location in power
distribution systems via deep graph convolutional networks,” IEEE J.
D. Inference Time and Memory Requirements Sel. Areas Commun., vol. 38, no. 1, pp. 119–131, 2020.
[10] R. Zhang, W. Yao, Z. Shi, L. Zeng, Y. Tang, and J. Wen, “A graph
The plot in Fig. 6 shows the ratio of execution times attention networks-based model to distinguish the transient rotor angle
between WLS SE and GNN SE inference as a function of instability and short-term voltage instability in power systems,” Int. J.
Electr. Power Energy Syst., vol. 137, p. 107783, 2022.
the number of buses in the system. These times are measured
on a test set of 100 samples. As expected, the difference 1 In fact, all the GNN models we train have the same number of trainable
in computational complexity between GNN, with its linear parameters: 49921, which equates to 0.19 MB of memory.
Approx. SE (baseline) GNN SE
0.02 0.02
0 0
10 102 103 104 10 102 103 104
Number of training samples Number of training samples
(a) IEEE 30 (b) IEEE 118
0.04 0.04
0.02 0.02
0 0
10 102 103 104 10 102 103 104
Number of training samples Number of training samples
(c) IEEE 300 (d) ACTIVSg 2000
Fig. 5: Test set results for various power systems and training set sizes.