Julia As A Portable High-Level Language For Numerical Solvers of Power Flow Equations On GPU Architectures
Julia As A Portable High-Level Language For Numerical Solvers of Power Flow Equations On GPU Architectures
Michel Schanen∗, Daniel Adrian Maldonado, François Pacaud, Mihai Anitescu, Kibaek Kim, Youngdae Kim, Vishwas
Rao, Anirudh Subramanyam
Argonne National Laboratory, 9700 S. Cass Ave, Illinois, USA
Abstract
We present ExaPF.jl, a solver for power flow on GPUs, entirely written in Julia. It implements a highly parallel Newton-
Raphson solver for nonlinear equations. We exploit Julia packages for kernel and array abstractions at the modeling
level, and generate efficient codes at runtime for both CPU and NVIDIA GPU using Julia’s inherent metaprogramming
capabilities. In the future, this infrastructure will allow us to leverage this machinery for AMD and Intel GPUs by
targeting the ROCm and oneAPI frameworks. The composable design of the Julia language allows us to apply automatic
differentiation that alleviates the user from providing derivatives. We also detail a GPU implementation of the iterative
solver BiCGSTAB, to solve efficiently the linear systems arising in the Newton-Raphson algorithm. We show how to
improve the performance of the BiCGSTAB algorithm by using a block-Jacobi preconditioner, tailored towards the
batched matrix inversion capabilities of modern GPUs. The Newton-Raphson algorithm will eventually serve as the
foundation of a reduced-space optimization method that will run entirely on the GPU.
Keywords: power flow; GPU; optimization; Julia; framework; automatic differentiation;
ex = nothing
programming models are needed, since code transforma-
if target == "cuda"
25 tion is done through language supported macros, moving
ex = quote
@cuda $threads $blocks $expr
∗ Correspondingauthor 50 end
Email address: [email protected] (Michel Schanen) end
Preprint submitted to Parallel Computing November 13, 2020
90 V=Vector # CPU vector
V=CuVector # GPU CUDA vector
V=ROCVector # GPU ROCm vector
V=oneArray # GPU oneAPI vector
V=CuVector{Float64} # GPU vector
95 D=V{Dual{Float64}} # GPU dual vector
Figure 1: Julia GPU API - The abstraction is either through the
da, db, dc=D(ones(Float64,n) # GPU transfer
abstract type of Julia language arrays or through a kernel abstrac- function incrmul(a::AbstractArray,
tion. The generation and compilation of GPU code happens through b::AbstractArray)
metaprogramming in Julia that generates calls to the C APIs which c::AbstractArray)
initiates a just-in-time compilation of the object code at runtime.
100 c .+= a .* b
end
if target == "cpu" incrmul(a,b,c) # JIT instantiation
ex = quote
$expr This allows linear algebra kernels to be written generically
55 end for both CPU and GPU. For example Krylov.jl [5] imple-
end 105 ments various iterative solvers that use exactly the same
return esc(ex) code for CPUs and for GPUs. In addition it allows the
end application of AutoDiff right through the iterative solver
on a GPU by a simple type change in the code similar to
In this paper we develop an efficient power flow solver Listing 2.
60 that targets general purpose architectures as well as GPUs.110 When algorithms leave the area of linear algebra and
It uses algorithms from AutoDiff and iterative linear vector calculus, more complex kernels might be necessary,
solvers and combines them in a modular way while sep- which leads us to the other method of abstraction. This is
arating hardware and algorithm as much as possible. This achieved through a hardware generalization that defines a
implementation serves as a mini-app and as the founda- common kernel abstracted API across various GPU types.
65 tion for our future work on an optimal power flow solver115 This has more similarity to traditional portability layers
for GPUs. like RAJA or Kokkos. However, it allows to be encapsu-
In Section 2 we give an overview of Julia’s portable ab- lated in codes with the array abstraction described above,
stractions for GPUs. Section 3 describes how we lever- making switching from one abstraction to the other seam-
aged Julia’s composable design to implement a power flow less. We will go into more detail when we cover the mod-
70 solver with a modeling layer, differential programming ca-120 eling of the power flow in Section 4.1.
pabilities, and a portable algebra all targeting GPUs. The The generation layer (see Figure 1) connects to the just-
results in Section 5 give an overview on current GPU hard- in-time interface of the respective vendor provided GPU
ware using relevant power grid cases. We then conclude API (CUDA for NVIDIA, ROCm for AMD, and oneAPI
with a brief path forward in Section 6 for using this ma- for Intel). Without going into the details, CUDA is cur-
75 chinery in solving optimal power flow on GPU architec-125 rently by far the most developed API for scientific com-
tures. puting. This is also true on the Julia side. A lot of com-
panies and customers of Julia Computing, Inc. rely on
NVIDIA GPU support in Julia. This makes the pack-
2. GPU programming in Julia
age CUDA.jl a Tier 1 supported package by Julia Com-
The GPU API design of Julia follows a classical pattern130 puting, Inc. and NVIDIA a sponsor of the language Ju-
of language design and their compilers (see Figure 1). On lia. ROCm and oneAPI support in Julia is currently in
80 the one end are abstractions connecting to the algorithms early development with a targeted goal of using the same
and on the other end the generation of hardware specific programming models implemented by GPUArrays.jl and
code. The Julia programmer can either choose to write KernelAbstraction.jl.
abstracted kernels or follow an array abstraction that con-135 This document focuses on GPUs and thus on program-
nects to the abstract array type of Julia. The latter option ming for single nodes. However, it should be noted that
85 has the advantage of requiring no code changes at all and Julia has support for distributed parallelism through MPI
opening all the tools written for array types; vectors, ma- and its own distributed abstraction. With MPI being
trices, linear algebra. It relies solely on the broadcasting the de facto standard for distributed scientific comput-
operator ’.’ for generating kernels at the assignment level140 ing, MPI.jl provides a solid MPI API for Julia, allevi-
(see Listing 2). ating the user through its multiple dispatch from dealing
with tedious low level data structure types. Its support
for serialization even allows the transparent communica-
Listing 2: Array abstraction in Julia: tion of functions if the involved processes use the same
2
145 Julia system image. MPI.jl provides support for CUDA 1. Evaluate fi = f (xi , p)
enabled MPI libraries allowing direct communication be- 2. Compute Ji = ∇f (xi , p)
tween GPUs connected via NVLink. Nonetheless, it would 3. Solve Ji ∆xi = −fi
also be worthwhile to investigate Julia’s own distributed180 4. Update xi+1 = xi + ∆xi
API that allows for much leaner code avoiding the For-
150 tran and C focused MPI interface. In particular, we aim In our endeavor to perform this computations fully lever-
at applying distributed computing for solving multiperiod aging the GPU, we first come to the problem of solving the
security constrained optimal power flow. linear system of step 3 that involves the Jacobian matrix
of the residual function. This matrix can be substantially
185 large ( 100,000 entries) and in dense format cannot fit in a
3. Fast and Portable Power Flow Solver: ExaPF.jl single GPU block. It is generally very sparse but unstruc-
tured, due to the topology of the power network. Because
ExaPF.jl1 leverages the GPU capabilities of Julia by of the difficulties of solving large and sparse systems on
155 implementing a power flow solver that runs fully on the GPUs with direct solvers, we employ an iterative solver.
GPU without any host-device transfer. We model math-190 It has been shown [7] that iterative solvers like GMRES
ematically the electrical network as an undirected graph can offer good convergence but often require direct pre-
G = (B, E), where we denote by B the set of buses and E conditioners such as incomplete LU (ILU) which can also
the lines connecting the buses together. The topology of be difficult to implement on a GPU architecture. In our
160 the network is usually given by a node admittance matrix work, we have found a combination of another iterative
Y ∈C nb ×nb
, with nb = |B| the number of buses in the net-195 solver, BiCGSTAB, and a block Jacobi preconditioner, to
work. The real and imaginary components of the matrix √ offer good performance on GPU architectures.
Y are usually split apart Y = Y re +jY im , with j = −1. With this in mind, all aforementioned steps are entirely
For a bus b ∈ B, we denote by Vb its voltage magnitude, implemented on the GPU. The algorithm proceeds as fol-
165 θb its voltage angle, pb its active power and qb its reac- lows: step 1 is implemented using an abstract GPU ker-
tive power which are physical quantities that describe the200 nel via KernelAbstraction.jl (see Listing 3). In step 2
steady state of the power network. Let V = (Vb )b∈B and the Jacobian Ji is generated once per run using AutoDiff
θ = (θb )b∈B and evaluated directly on the GPU at each iteration. The
For each bus b, the network must satisfy energy con- AutoDiff package ForwardDiff.jl [8] is transparently ap-
servation laws that are defined by the Kirchoff law: these plied to that kernel to generate Ji in 2.
are the power balance equations. Kirchoff law states that205 In step 3, we note that the sparsity pattern of the Ja-
the sum of injected power (pk , qk ) into bus k ∈ B must be cobian matrix Ji does not change from one iteration to
equal to the sum of extracted power, establishing a set of the next. Thus, we instantiate the block-Jacobi precon-
power balance equations: ditioner at the first iteration by partitioning the initial
X Jacobian matrix J0 . Then, we update the preconditioner
pk = Vk Vl (Yklre cos θkl + Yklim sin θkl ) , 210 Pi at each iteration with a three steps procedure:
l∈B
X (1) 1. Extract the Jacobi blocks from the sparse CSR matrix
qk = Vk Vl (Yklre sin θkl − Yklim cos θkl ) ,
Ji and store them in dense format blocks.
l∈B
2. Apply batch inversion (e.g. CUBLAS) on the dense
where Yklre and Yklim are the elements of matrices Y re and blocks to get an approximation to Ji−1 .
170 Y im , respectively, and θkl := θk − θl . We write the power215 3. Move the inverted blocks to the sparse CSR matrix
equations (1) in an abstract formalism by introducing a Pi .
state variable x = (V , θ) ∈ R2nb , a vector p ∈ Rnp of
The batch inversion has to be supported by the BLAS li-
np parameters, and a residual functional f : R2nb → Rnb
brary provided by the GPU vendor. This is not a standard
encoding the equations (1) in the compact form: f (x, p) =
BLAS call and we admit that its support is not guaran-
0.
teed. However, its implementation can also be mapped to
175
220
In power system analysis, the Newton-Raphson algo-
single dispatching BLAS kernels for matrix inversion.
rithm [6] is a standard algorithm to solve the set of non-
We solve the linear system Ji ∆xi = −fi using
linear equations f (x, p) = 0. Starting from an initial guess
the iterative algorithm BiCGSTAB. Our preconditioned
x0 , the algorithm proceeds to the following iterations, till
BiCGSTAB is implemented straight from the original pa-
a convergence criteria is reached:
225 per [9] using the GPUArrays.jl array abstraction. The
xi+1 = xi − ∇f (xi , p) f (xi , p) , i = 1, · · ·
−1 same abstraction is used for the step update of x in step
5). The reason why we have not implemented a matrix-
Step by step, the algorithm writes out for each iteration i: free version of BiCGSTAB is because we have to apply the
preconditioner Pi . As we will show in Section 5, neither
230 the GPU memory nor the runtime of computing Ji is a
1 https://github.com/exanauts/ExaPF.jl bottleneck in this application.
3
coef_cos = v_m[fr]*v_m[to]*ybus_re_nzval[c]
coef_sin = v_m[fr]*v_m[to]*ybus_im_nzval[c]
cost = sum(c2 .+ c3 .* pg .+ c4 .* pg.ˆ2) cos_val = cos(aij)
sin_val = sin(aij)
270 F[i] += coef_cos*cos_val+coef_sin*sin_val
Figure 2: Array abstraction using GPUArrays.jl if i > npv
F[npq + i] += coef_cos*sin_val
- coef_sin*cos_val
end
@oneapi residual_kernel(F,...) 275 end
@cuda residual_kernel(F,...) end
@roc residual_kernel(F,...)
1.5
0.5
Figure 4: Point-wise multiplication transformed to a dual type
0.8 f (x, p) = 0 .
SE
2
the active generated power at the generators in the power
O
O
3
G
EE
9
[9] H. A. van der Vorst, Bi-CGSTAB: A fast and smoothly converg-
ing variant of Bi-CG for the solution of nonsymmetric linear
systems, SIAM Journal on Scientific and Statistical Computing
13 (2) (1992) 631–644.
670 [10] I. Dunning, J. Huchette, M. Lubin, JuMP: A modeling language
for mathematical optimization, SIAM Review 59 (2) (2017)
295–320.
[11] A. Griewank, A. Walther, Evaluating Derivatives: Principles
and Techniques of Algorithmic Differentiation, 2nd Edition, So-
675 ciety for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 2008.
[12] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow:
A system for large-scale machine learning, in: 12th {USENIX}
680 symposium on operating systems design and implementation
({OSDI} 16), 2016, pp. 265–283.
[13] C. Rackauckas, M. Innes, Y. Ma, J. Bettencourt, L. White,
V. Dixit, Diffeqflux.jl - A Julia library for neural differential
equations, CoRR abs/1902.02376. arXiv:1902.02376.
685 URL https://arxiv.org/abs/1902.02376
[14] T. A. Davis, Algorithm 832: UMFPACK v4.3—an
unsymmetric-pattern multifrontal method, ACM Trans.
Math. Softw. 30 (2) (2004) 196–199.
[15] C. C. Paige, M. A. Saunders, LSQR: An algorithm for sparse
690 linear equations and sparse least squares, ACM Transactions on
Mathematical Software (TOMS) 8 (1) (1982) 43–71.
[16] K. Świrydowicz, S. Thomas, J. Maack, S. Peles, G. Kestor, J. Li,
Linear solvers for power grid optimization problems: a review
of gpu-accelerated solvers, Parallel ComputingSubmitted.
695 [17] M. Innes, A. Edelman, K. Fischer, C. Rackauckus, E. Saba,
V. B. Shah, W. Tebbutt, Zygote: A differentiable program-
ming system to bridge machine learning and scientific comput-
ing, arXiv preprint arXiv:1907.07587 (2019) 140.
[18] M. Innes, Flux: Elegant machine learning with Julia, Journal
700 of Open Source Software 3 (25) (2018) 602.
10