100% found this document useful (1 vote)
59 views

Julia As A Portable High-Level Language For Numerical Solvers of Power Flow Equations On GPU Architectures

This document describes ExaPF.jl, a Julia-based power flow solver that can run on GPUs. It leverages Julia's metaprogramming capabilities to generate efficient CPU and GPU code at runtime. The solver implements a highly parallel Newton-Raphson algorithm for solving nonlinear power flow equations. It uses Julia packages to provide portable abstractions for GPU computing and applies automatic differentiation to compute derivatives without needing to code them by hand.

Uploaded by

Edgardo Tabilo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
59 views

Julia As A Portable High-Level Language For Numerical Solvers of Power Flow Equations On GPU Architectures

This document describes ExaPF.jl, a Julia-based power flow solver that can run on GPUs. It leverages Julia's metaprogramming capabilities to generate efficient CPU and GPU code at runtime. The solver implements a highly parallel Newton-Raphson algorithm for solving nonlinear power flow equations. It uses Julia packages to provide portable abstractions for GPU computing and applies automatic differentiation to compute derivatives without needing to code them by hand.

Uploaded by

Edgardo Tabilo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Julia as a Portable High-Level Language for Numerical Solvers of Power Flow

Equations on GPU Architectures

Michel Schanen∗, Daniel Adrian Maldonado, François Pacaud, Mihai Anitescu, Kibaek Kim, Youngdae Kim, Vishwas
Rao, Anirudh Subramanyam
Argonne National Laboratory, 9700 S. Cass Ave, Illinois, USA

Abstract
We present ExaPF.jl, a solver for power flow on GPUs, entirely written in Julia. It implements a highly parallel Newton-
Raphson solver for nonlinear equations. We exploit Julia packages for kernel and array abstractions at the modeling
level, and generate efficient codes at runtime for both CPU and NVIDIA GPU using Julia’s inherent metaprogramming
capabilities. In the future, this infrastructure will allow us to leverage this machinery for AMD and Intel GPUs by
targeting the ROCm and oneAPI frameworks. The composable design of the Julia language allows us to apply automatic
differentiation that alleviates the user from providing derivatives. We also detail a GPU implementation of the iterative
solver BiCGSTAB, to solve efficiently the linear systems arising in the Newton-Raphson algorithm. We show how to
improve the performance of the BiCGSTAB algorithm by using a block-Jacobi preconditioner, tailored towards the
batched matrix inversion capabilities of modern GPUs. The Newton-Raphson algorithm will eventually serve as the
foundation of a reduced-space optimization method that will run entirely on the GPU.
Keywords: power flow; GPU; optimization; Julia; framework; automatic differentiation;

1. Introduction the transformation of the original code from compile time


to runtime. This leads to a better separation between the
Julia [1] is a new programming language that leverages applied transformation logic and the original code. In List-
modern programming design for scientific computing. It ing 1 we show a naive implementation of a dispatch macro
is both an interpreted and just-in-time compiled language 30 that generates code either for the CPU or CUDA depend-
5 that allows users to write high performing code while ing on what the global variable target is set to. The
having accessible syntax similar to Matlab. Its strong expression expr contains the actual function call. This
metaprogramming capabilities allows the developer to gen- macro either generates the code that dispatches the func-
erate and transform code at runtime, making it highly flex- tion using CUDA or just calls the function using the CPU.
ible. The functional design of the language enables domain 35 An expression is stored as an abstract syntax tree and
10 scientists to write composable, modular, and maintainable enables the programmer to directly access the Intermedi-
code. ate Representation (IR). This is a strong tool that allows,
All this is tied to the efficient compiler back-end for example, a language intrinsic implementation of au-
LLVM [2], known to generate highly efficient machine code tomatic differentiation (AutoDiff) and hardware specific
for C/C++ on a variety of architectures. With portability 40 code. We show how these two totally independent code
15 solved for general purpose computing platforms, the rise of transformations are easily combined in Julia, a work that
GPU architectures in high-performance computing shows would otherwise be difficult to achieve in custom program-
the limit of this solution. This specialized hardware comes ming models.
in combination with a custom programming model. In sci-
entific computing—where portability is key—this requires
20 well-designed portability layers through another layer of
metaprogramming models (e.g. Kokkos [3], RAJA [4]), Listing 1: Macro example in Julia
thus substantially increasing the complexity of a code base.
macro dispatch(threads, blocks, expr)
With the metaprogramming capabilities of Julia no such
45

ex = nothing
programming models are needed, since code transforma-
if target == "cuda"
25 tion is done through language supported macros, moving
ex = quote
@cuda $threads $blocks $expr
∗ Correspondingauthor 50 end
Email address: [email protected] (Michel Schanen) end
Preprint submitted to Parallel Computing November 13, 2020
90 V=Vector # CPU vector
V=CuVector # GPU CUDA vector
V=ROCVector # GPU ROCm vector
V=oneArray # GPU oneAPI vector
V=CuVector{Float64} # GPU vector
95 D=V{Dual{Float64}} # GPU dual vector
Figure 1: Julia GPU API - The abstraction is either through the
da, db, dc=D(ones(Float64,n) # GPU transfer
abstract type of Julia language arrays or through a kernel abstrac- function incrmul(a::AbstractArray,
tion. The generation and compilation of GPU code happens through b::AbstractArray)
metaprogramming in Julia that generates calls to the C APIs which c::AbstractArray)
initiates a just-in-time compilation of the object code at runtime.
100 c .+= a .* b
end
if target == "cpu" incrmul(a,b,c) # JIT instantiation
ex = quote
$expr This allows linear algebra kernels to be written generically
55 end for both CPU and GPU. For example Krylov.jl [5] imple-
end 105 ments various iterative solvers that use exactly the same
return esc(ex) code for CPUs and for GPUs. In addition it allows the
end application of AutoDiff right through the iterative solver
on a GPU by a simple type change in the code similar to
In this paper we develop an efficient power flow solver Listing 2.
60 that targets general purpose architectures as well as GPUs.110 When algorithms leave the area of linear algebra and
It uses algorithms from AutoDiff and iterative linear vector calculus, more complex kernels might be necessary,
solvers and combines them in a modular way while sep- which leads us to the other method of abstraction. This is
arating hardware and algorithm as much as possible. This achieved through a hardware generalization that defines a
implementation serves as a mini-app and as the founda- common kernel abstracted API across various GPU types.
65 tion for our future work on an optimal power flow solver115 This has more similarity to traditional portability layers
for GPUs. like RAJA or Kokkos. However, it allows to be encapsu-
In Section 2 we give an overview of Julia’s portable ab- lated in codes with the array abstraction described above,
stractions for GPUs. Section 3 describes how we lever- making switching from one abstraction to the other seam-
aged Julia’s composable design to implement a power flow less. We will go into more detail when we cover the mod-
70 solver with a modeling layer, differential programming ca-120 eling of the power flow in Section 4.1.
pabilities, and a portable algebra all targeting GPUs. The The generation layer (see Figure 1) connects to the just-
results in Section 5 give an overview on current GPU hard- in-time interface of the respective vendor provided GPU
ware using relevant power grid cases. We then conclude API (CUDA for NVIDIA, ROCm for AMD, and oneAPI
with a brief path forward in Section 6 for using this ma- for Intel). Without going into the details, CUDA is cur-
75 chinery in solving optimal power flow on GPU architec-125 rently by far the most developed API for scientific com-
tures. puting. This is also true on the Julia side. A lot of com-
panies and customers of Julia Computing, Inc. rely on
NVIDIA GPU support in Julia. This makes the pack-
2. GPU programming in Julia
age CUDA.jl a Tier 1 supported package by Julia Com-
The GPU API design of Julia follows a classical pattern130 puting, Inc. and NVIDIA a sponsor of the language Ju-
of language design and their compilers (see Figure 1). On lia. ROCm and oneAPI support in Julia is currently in
80 the one end are abstractions connecting to the algorithms early development with a targeted goal of using the same
and on the other end the generation of hardware specific programming models implemented by GPUArrays.jl and
code. The Julia programmer can either choose to write KernelAbstraction.jl.
abstracted kernels or follow an array abstraction that con-135 This document focuses on GPUs and thus on program-
nects to the abstract array type of Julia. The latter option ming for single nodes. However, it should be noted that
85 has the advantage of requiring no code changes at all and Julia has support for distributed parallelism through MPI
opening all the tools written for array types; vectors, ma- and its own distributed abstraction. With MPI being
trices, linear algebra. It relies solely on the broadcasting the de facto standard for distributed scientific comput-
operator ’.’ for generating kernels at the assignment level140 ing, MPI.jl provides a solid MPI API for Julia, allevi-
(see Listing 2). ating the user through its multiple dispatch from dealing
with tedious low level data structure types. Its support
for serialization even allows the transparent communica-
Listing 2: Array abstraction in Julia: tion of functions if the involved processes use the same
2
145 Julia system image. MPI.jl provides support for CUDA 1. Evaluate fi = f (xi , p)
enabled MPI libraries allowing direct communication be- 2. Compute Ji = ∇f (xi , p)
tween GPUs connected via NVLink. Nonetheless, it would 3. Solve Ji ∆xi = −fi
also be worthwhile to investigate Julia’s own distributed180 4. Update xi+1 = xi + ∆xi
API that allows for much leaner code avoiding the For-
150 tran and C focused MPI interface. In particular, we aim In our endeavor to perform this computations fully lever-
at applying distributed computing for solving multiperiod aging the GPU, we first come to the problem of solving the
security constrained optimal power flow. linear system of step 3 that involves the Jacobian matrix
of the residual function. This matrix can be substantially
185 large ( 100,000 entries) and in dense format cannot fit in a
3. Fast and Portable Power Flow Solver: ExaPF.jl single GPU block. It is generally very sparse but unstruc-
tured, due to the topology of the power network. Because
ExaPF.jl1 leverages the GPU capabilities of Julia by of the difficulties of solving large and sparse systems on
155 implementing a power flow solver that runs fully on the GPUs with direct solvers, we employ an iterative solver.
GPU without any host-device transfer. We model math-190 It has been shown [7] that iterative solvers like GMRES
ematically the electrical network as an undirected graph can offer good convergence but often require direct pre-
G = (B, E), where we denote by B the set of buses and E conditioners such as incomplete LU (ILU) which can also
the lines connecting the buses together. The topology of be difficult to implement on a GPU architecture. In our
160 the network is usually given by a node admittance matrix work, we have found a combination of another iterative
Y ∈C nb ×nb
, with nb = |B| the number of buses in the net-195 solver, BiCGSTAB, and a block Jacobi preconditioner, to
work. The real and imaginary components of the matrix √ offer good performance on GPU architectures.
Y are usually split apart Y = Y re +jY im , with j = −1. With this in mind, all aforementioned steps are entirely
For a bus b ∈ B, we denote by Vb its voltage magnitude, implemented on the GPU. The algorithm proceeds as fol-
165 θb its voltage angle, pb its active power and qb its reac- lows: step 1 is implemented using an abstract GPU ker-
tive power which are physical quantities that describe the200 nel via KernelAbstraction.jl (see Listing 3). In step 2
steady state of the power network. Let V = (Vb )b∈B and the Jacobian Ji is generated once per run using AutoDiff
θ = (θb )b∈B and evaluated directly on the GPU at each iteration. The
For each bus b, the network must satisfy energy con- AutoDiff package ForwardDiff.jl [8] is transparently ap-
servation laws that are defined by the Kirchoff law: these plied to that kernel to generate Ji in 2.
are the power balance equations. Kirchoff law states that205 In step 3, we note that the sparsity pattern of the Ja-
the sum of injected power (pk , qk ) into bus k ∈ B must be cobian matrix Ji does not change from one iteration to
equal to the sum of extracted power, establishing a set of the next. Thus, we instantiate the block-Jacobi precon-
power balance equations: ditioner at the first iteration by partitioning the initial
X Jacobian matrix J0 . Then, we update the preconditioner
pk = Vk Vl (Yklre cos θkl + Yklim sin θkl ) , 210 Pi at each iteration with a three steps procedure:
l∈B
X (1) 1. Extract the Jacobi blocks from the sparse CSR matrix
qk = Vk Vl (Yklre sin θkl − Yklim cos θkl ) ,
Ji and store them in dense format blocks.
l∈B
2. Apply batch inversion (e.g. CUBLAS) on the dense
where Yklre and Yklim are the elements of matrices Y re and blocks to get an approximation to Ji−1 .
170 Y im , respectively, and θkl := θk − θl . We write the power215 3. Move the inverted blocks to the sparse CSR matrix
equations (1) in an abstract formalism by introducing a Pi .
state variable x = (V , θ) ∈ R2nb , a vector p ∈ Rnp of
The batch inversion has to be supported by the BLAS li-
np parameters, and a residual functional f : R2nb → Rnb
brary provided by the GPU vendor. This is not a standard
encoding the equations (1) in the compact form: f (x, p) =
BLAS call and we admit that its support is not guaran-
0.
teed. However, its implementation can also be mapped to
175
220
In power system analysis, the Newton-Raphson algo-
single dispatching BLAS kernels for matrix inversion.
rithm [6] is a standard algorithm to solve the set of non-
We solve the linear system Ji ∆xi = −fi using
linear equations f (x, p) = 0. Starting from an initial guess
the iterative algorithm BiCGSTAB. Our preconditioned
x0 , the algorithm proceeds to the following iterations, till
BiCGSTAB is implemented straight from the original pa-
a convergence criteria is reached:
225 per [9] using the GPUArrays.jl array abstraction. The
xi+1 = xi − ∇f (xi , p) f (xi , p) , i = 1, · · ·
−1 same abstraction is used for the step update of x in step
5). The reason why we have not implemented a matrix-
Step by step, the algorithm writes out for each iteration i: free version of BiCGSTAB is because we have to apply the
preconditioner Pi . As we will show in Section 5, neither
230 the GPU memory nor the runtime of computing Ji is a
1 https://github.com/exanauts/ExaPF.jl bottleneck in this application.
3
coef_cos = v_m[fr]*v_m[to]*ybus_re_nzval[c]
coef_sin = v_m[fr]*v_m[to]*ybus_im_nzval[c]
cost = sum(c2 .+ c3 .* pg .+ c4 .* pg.ˆ2) cos_val = cos(aij)
sin_val = sin(aij)
270 F[i] += coef_cos*cos_val+coef_sin*sin_val
Figure 2: Array abstraction using GPUArrays.jl if i > npv
F[npq + i] += coef_cos*sin_val
- coef_sin*cos_val
end
@oneapi residual_kernel(F,...) 275 end
@cuda residual_kernel(F,...) end
@roc residual_kernel(F,...)

Using the @kernel macro provided by


Figure 3: Kernel abstraction using KernelAbstractions.jl KernelAbstraction.jl the function is dispatched
on the GPU architecture chosen by the user (@cuda,
280 @oneapi, @roc). Note that F here is the output of the
This algorithm has three key algorithmic domains: Im- residual function f mentioned before. This is all the
plementation of f (modeling), computation of the Jaco- modeling code a user has to provide. In the future we aim
bian J (AutoDiff), and the linear solver. We will go over at making this more amenable to the user by generating
235 these three areas in the next section. the code according to the domain specific language speci-
285 fied in the JuMP.jl package [10]. However, we identified
4. Portable and Composable Algorithm Design three constructs that JuMP.jl is currently missing and we
are working towards addressing these issues. We will go
In our portable design we distinguish between three es- through these proposed changes by showing a proposed
sential ingredients: modeling, differentiation (AutoDiff), model written in the extended language (see Listing 4)
and linear algebra. Our goal is to have these three com-290
240 ponents abstracted in a portable way and apply this de-
sign pattern for nonlinear equations in the future (such as
nonlinear programming in optimization). We will empha- Listing 4: Proposed DSL extension
size the benefits of using a composable and differentiable
m = Model()
language like Julia that allows these three components to
@vector(m, P, 1:nbus); @vector(m, Vm, 1:nbus)
245 be completely independent and not domain or application
@vector(m, Q, 1:nbus); @vector(m, Va, 1:nbus)
specific.
@graph(m, G, Yre)
295 @spequation(m, P .= vm
4.1. Modeling
.* sum(vm[l]
The modeling in ExaPF.jl leverages the GPUArrays.jl .* (Yre[k,l]
and KernelAbstraction.jl abstractions presented in .* cos(va[k] .- va[l])
250 Section 1. In addition to this hardware/software abstrac- .+ Yim[k,l]
tion, we have separated the mathematical and physical300 .* sin(va[k] .- va[l])
abstractions. The power flow kernel, which evaluates the ),
residual function of the Newton Rhapson problem, is im- for k,l=neighbor(k) in G
plemented using the physical quantities. )
255 )
305 @spequation(m, Q .= vm
.* sum(vm[l]
Listing 3: Kernel implementation using KernelAbstraction.jl
.* (Yre[k,l]
@kernel function residual_kernel!(F, ...) .* sin(va[k] .- va[l])
i = @index(Global, Linear) .- Yim[k,l]
fr = (i <= npv) ? pv[i] : pq[i-npv] 310 .* cos(va[k] .- va[l])
F[i] -= pinj[fr] ),
260 if i>npv for k,l=neighbor(k) in G
F[i + npq] -= qinj[fr] )
end )
for c in colptr[fr]:colptr[fr+1]-1 315 @synchronize(m)
to = ybus_re_rowval[c]
265 aij = v_a[fr]-v_a[to]
4
Computational cost per color/direction for 9,000 bus system

1.5

Cost(ms) per direction


1

0.5
Figure 4: Point-wise multiplication transformed to a dual type

1. Currently, in JuMP.jl, variables are the first-class ob- 0


1 8 16 32 64 128 256 512
ject. For GPUs, models have to be written with vec-
tors (@vector) as a first class object to apply the Number of concurrent directions
broadcast operator as shown in Section 1 to seam- Figure 5: Optimal number of concurrent directions in a power flow
320 lessly generate SIMD kernels for the GPU. model evaluation with 9,241 buses on the PEGASE case (see Sec-
2. To allow the expression of sparsity, an adjacency tion 5). The number of actual Jacobian colors is 28. To compute
graph G is defined through the macro @graph passed the full Jacobian, a directional derivative component for each color
is required.
to a sparse equation macro @spequation. This even-
tually allows loops over sparse matrix entries in CSR
325 or CSC format in a kernel.
3. And last, since kernels are evaluated asynchronously,
we want to convey that notion to the model and ex-
plicitly give the user the ability to define synchroniza-
tion points via the @synchronize macro.
330 These amendments to the language are currently dis-
cussed in our community and have not yet been im-
plemented in our package. This feature would allow
the model to be easily defined outside of the package
ExaPF.jl. (a) Jacobian stored in CSR or (b) Compressed Jacobian in
CSC: A valid coloring has no dense storage: White color en-
color appearing twice in a row tries are unused space
335 4.2. Differentiation
Writing the differentiated model by hand is known to be Figure 6: Jacobian coloring on the IEEE-30 bus case compressing
a tedious process prone to errors and very hard to debug. the matrix from 53 × 53 to 15 × 53
Since this software is meant as a research tool we do not
want to burden the user with differentiating the model
340 and implementing J(x) and only requiring the primal/non360 restriction the hardware abstractions for the GPU have, is
differentiated function f . To this end, we designed the that the data structures have to be of binary types. This
package to make full use of AutoDiff [11] throughout all is true for all primitive types and types composed of prim-
architectures. AutoDiff is a technology that transforms itive types. However, it is not true for example if functions
code of a function implementation y = f (x), with x ∈ or references are part of a type. Fortunately, the AutoD-
345 Rn , y ∈ Rn algorithmically to compute the tangent model365 iff package ForwardDiff.jl generates a dual or derivative
calculating y = f (x) and ẏ = J(x) · ẋ, where ẏ and ẋ are type that has the isbits property. This allows us to apply
called the tangents, directional derivatives, or duals. Note, AutoDiff seamlessly for GPUs just as for CPUs. Each dual
that the Jacobian J is not generated directly. In order to type is composed of its value and the directional derivative
compute the full Jacobian, one has to call the tangent- values.
350 model n times over the n Cartesian basis vectors of Rn .370 In Figure 4 we show an example of a point-wise multipli-
This yields a computational cost of O(n · cost(f )), with cation of two vectors. The red color is the original vector
cost(f ) being the cost of the original function evaluation. that is then generated to a GPU kernel using the broad-
The composability of the Julia language has shown early cast multiplication. The dual type adds c directions to
on that code transformations like AutoDiff can be seam- each vector entry. Again, the broadcast operator is seam-
355 lessly integrated in a modular workflow. In stark contrast375 lessly applied to the now twice vectorized object. This is
to tools like Tensorflow [12], AutoDiff in Julia does not rely also reflected in our runtime results (see Figure 5), where
on a domain specific language (DSL)[13], since all state- we see a superlinear speedup by adding directions c with
ments in the modeling frameworks and hardware abstrac- an optimum reached around 32 and 64 directions for an
tions eventually end up as Julia code or Julia IR. The only entire power flow evaluation.
5
Figure 7: Linear solver implementation: Preconditioned BiCGSTAB Figure 8: Comparison of GMRES and BiCGSTAB on two cases from
using batch inversion on diagonal blocks together with BiCGSTAB the Grid Optimization competition using reference implementations
on CPU

However, our Jacobian has a dimension that is on the


Figure 8). Both linear solvers did not converge with-
380

scale of O(number of buses) which would amount to a


425 out the preconditioner. As a reference implementa-
large number of directions c and exhausts the memory of
tion we used the iterative solvers of the Julia package
the GPU. That is why we apply a technique called Ja-
IterativeSolvers.jl and Krylov.jl. None of the two
cobian coloring (see Figure 6). As the sparsity structure
packages implements a BiCGSTAB algorithm that works
stays the same across the computation we do this once
with GPUArrays.jl. That is why we implemented our own
385

in the setup stage on the CPU because this algorithm is


430 BiCGSTAB based on the original paper [9]. For compar-
not amenable to the GPU. This allows us to compute the
isons we added the following linear solvers to our package:
full Jacobian in one evaluation by computing several inde-
pendent directions in parallel while exploiting the afore- • GMRES from Kyrlov.jl [5] (GPU/CPU) and
390 mentioned SIMD operations. The output is a compressed IterativeSolvers.jl (CPU),
Jacobian that is directly used to build the preconditioner
described in the following section. Another kernel is used • BiCGSTAB from IterativeSolvers.jl (CPU),
to extract the sparse Jacobian into CSR or CSC format
for the iterative linear solver. 435 • CUSOLVER (CPU/GPU hybrid),

• Standard \ operator in Julia using the multifrontal


395 4.3. Linear Algebra solver UMFPACK [14],
We have extensively explored the best iterative solver
setup for our GPU implementation, consisting of choosing • and BiCGSTAB (CPU/GPU) [9].
a preconditioner and an iterative linear solver algorithm.
With all the components in place we present our results in
For the preconditioner the natural choice for a GPU is to
440 the next section.
400 have independent tasks that can be executed by small ker-
nels. We therefore picked a block-Jacobi preconditioner
usually used in distributed parallelism for PDEs. We par- 5. Results
tition the Jacobian matrix using METIS. The procedure
splits the Jacobian into diagonal blocks with a minimal All experiments were conducted on our workstation at
405 cut (see Figure 7). These blocks are built in single dense Argonne and a node of the Summit supercomputer at the
format directly from the compressed Jacobian. We then Oak Ridge Leadership Computing Facility. Both systems
apply a batched BLAS call to invert these single dense445 are equipped with NVIDIA Quadro GV100 GPUs with
blocks. This nonstandard batched BLAS call is present in both cards only differing in the amount of RAM (16GB vs
CUBLAS, however, there is no guarantee this feature will 32GB). Our workstation uses the 32GB variant combined
410 be available in ROCm and oneAPI. A workaround would with a dual socket Intel Xeon Gold 6140 CPU @ 2.30GHz
be to asynchronously dispatch single dense BLAS calls for and 512 GB of RAM. Summit uses the PowerPC archi-
each block. After this batch inversion is done, the inverted450 tecture which is supported by Julia since version 1.3. On
blocks are put into the precondition matrix P stored in Summit we build Julia directly from source and use the
sparse CSR or CSC format via another GPU kernel. Note CUDA libraries installed on the system. Since ExaPF.jl
415 that this technique of having a lot of single blocks to invert runs nearly entirely on the GPU we observed that the
(up to 1, 000) is not amenable to the CPU, thus a one on runtime differences between Summit and our workstation
one comparison between CPU and GPU is not reasonable455 are negligible for our results, which is attributed to the
here. We acknowledge that for the CPU the user should benchmarked parts running entirely on the GPU. To get
continue using standard direct sparse solvers. an overview of the performance we use a variety of network
420 Our Jacobian is an indefinite matrix, thus we compared data (see Table 1): the IEEE 300 bus case, the European
two algorithms: GMRES and BiCGSTAB. BiCGSTAB Pan European Grid Advanced Simulation and state Es-
lead to a much better performance than GMRES for460 timation (PEGASE) 9,241 bus case, and three grids of
our problem with the block-Jacobi preconditioner (see various sizes (10,000, and 30,000 buses) from the Trial
6
Case Buses Generators
IEEE300 300 70
PEGASE 9,241 1445 10
GO1 10,000 2,089
GO2 30,000 3,526 8

Total runtime (s)


Table 1: Overview of tested cases: Case name, number of buses, 6
generators

3 dataset of the ARPA-E Grid Optimization (GO) com- 2


petition. Note that the largest transmission network in
the U.S. is the Eastern Interconnect with a size of around 0
465 70,000 buses.
The size of the Jacobian is proportional to the number
of buses. From Section 3 we have to following three major IEEE300 PEGASE GO1 GO2
computational steps: compute the Jacobian J through Au-
UMFPACK PCBiCGSTAB CUSOLVE
toDiff, compute the preconditioner P , and solve the linear
470 system (P · J)\F . In addition, the structure of the net- Figure 9: Total runtime with various linear solvers
work influences the number of Jacobian colors which itself
influences the performance of AutoDiff. Case Colors Dim. Blocks N-R Krylov
We compare the performance between using a direct lin- IEEE300 8 530 9 5 192
ear solver on the CPU, direct linear solver on the GPU, PEGASE 28 17,036 267 6 4666
475 and our custom PCBiCGSTAB. The direct linear solver GO1 14 19,068 298 4 1292
is implemented through the backslash operator \ in Ju- GO2 20 57,721 902 4 2153
lia which eventually calls UMFPACK. This solution is
roughly the same algorithm than in MATPOWER and Table 2: Performance details - Jacobian colors, and dimension of
achieves similar performance. For the direct solver on the the n × n Jacobian matrix, Newton-Raphson iterations, and Krylov
iterations
480 GPU we chose the vendor provided LSQR csrlsvqr [15]
of the CUSOLVE library, which is a black box as the code
is closed source. To achieve best performance (see Sec- CUSOLVE faltering and PCBiCGSTAB being the second
tion 4.2) we keep to block-Jacobi block sizes at around510 fastest solution. To understand this better we look at the
64 × 64. This leads to a large number of blocks (up to performance details of PCBiCGSTAB in Table 2 and re-
485 902 for the GO2 case). It is impossible to invert that alize that Newton-Raphson iterations are all around 4-6
many blocks efficiently on the CPU, thus we do not run iterations, with a tolerance for the Krylov solver set to
the PCBiCGSTAB on the CPU. The CPU should clearly  = 10−6 . However, the BiCGSTAB struggles with both
shine with the traditional sparse solver. The choice of the515 the IEEE300 and PEGASE cases with an associated num-
solvers is preceded by an extensive exploration of the lin- ber of Krylov iterations being around a quarter of the ma-
490 ear solvers available on GPUs for our class of problems. trix dimension. The cases GO1 and GO2 show a much bet-
This work is included in an article submitted to this same ter Krylov solver convergence. This is due to the condition
issue of Parallel Computing [16]. of the matrix, which eventually is based on the structure
We distinguish between the setup stage and the solve520 and physics of the network. IEEE300 and PEGASE are
stage. The setup stage runs on the CPU and includes artificially created networks, which are known to be harder
495 reading the power system data, coloring the Jacobian, the problem instances. GO1 and GO2 could be based on more
block-Jacobi partitioning, compilation, and general object realistic instances, however we have no confirmation this
factory assembly. These steps have to be only executed being the case. Tuning the number of blocks and the tol-
once per case and could in theory be largely reused across525 erances could lead to better performance, but we wanted
runs. We plan on deploying such features in the future. to have the same parameters for all four cases and a realis-
500 The solving stage is the loop over the Newton-Raphson tic use case scenario of our software, by a heuristic default
iterations elaborated in Section 3. This loop constitutes choice of tolerances and number of Jacobi blocks in the
our benchmarked section and requires no allocations and preconditioner.
no host-device transfer. 530 Last, we want to take a look at the relative performance
The total runtime results in Figure 9 show that the (see Figure 10) of AutoDiff, the block-Jacobi precondi-
505 sparse solver UMFPACK on the CPU provides the fastest tioner, and the BiCGSTAB. AutoDiff is known to lead to
time to solution on all four cases. The CUSOLVE solver performance overheads due to the additional complexity
shows better results than PCBiCGSTAB on IEEE300 and of the code transformation. Added to this, knowing that
PEGASE. However, on the cases GO1 and GO2 we see535 AutoDiff on the GPU is a bottleneck in machine learning
7
power flow, directly in the reduced space induced by the
1 power flow equations
Fraction of total runtime

0.8 f (x, p) = 0 .

We will not cover the numerical optimization details and


0.6
560

focus on how our GPU implementation of the power flow


allows us to solve optimal power flow entirely on the GPU.
0.4
6.1. Optimal power flow problem
0.2
We assume now a control subset of the parameters vec-
0 tor p, which we partition into a vector of control u and
a vector of fixed parameter pf such that p = (u, pf ). In
optimal power flow, the control variables u are usually
00

SE

2
the active generated power at the generators in the power
O

O
3

G
EE

grid. Then, the job of the optimizer is to find the optimal


PE
IE

AutoDiff PC BiCGSTAB control u] such that a cost functional c(x, u) is minimized


while satisfying a set of inequality constraints h(x, u) ≤ 0.
Figure 10: Fraction of runtime for the three most costly computa- The optimization problem is given by
tions: BiCGSTAB, preconditioner (PC), and Jacobian computation
through AutoDiff. The bulk of the time is spent in BiCGSTAB. minx,u c(x, u)
s.t. f (x, u) = 0 (2)
h(x, u) ≤ 0 ,
applications, we were surprised that our careful implemen-
tation resulted in a stellar performance for the Jacobian (we have omitted the vector of fixed parameter pf for the
computation on the GPU. AutoDiff takes below 10% of565 sake of conciseness).
the runtime in all four cases. The preconditioner — which
540 relies on the batch inversion of the blocks — comes in 6.2. Reduced-space algorithm
with below 20% of the total runtime. This leaves us with
In (2), we are optimizing with relation to the state x and
the BiCGSTAB taking around 80% of the runtime. It
the control u. However, we know that the state depends
should be noted that our BiCGSTAB implementation is
on the control u via the implicit equation f (x, u) = 0.
written using the GPUArrays.jl abstraction in only 100
Rephrasing it, for each control u, we ought to find a state
545 lines of code that resemble a general purpose MATLAB
x = x(u) such that f x(u), u = 0. The solution x(u)
implementation.
is defined implicitly as a solution of the Newton-Raphson
We expect further performance gains by
algorithm described in Section 3. Therefore, the reduced-
• optimizing the BiCGSTAB implementation, space problem writes out as simply as

• improving the preconditioner (e.g. additive-Schwarz). minu c(x(u), u)


(3)
s.t. h(x(u), u) ≤ 0 .
550 In summary, we are confident that our solution of using
iterative solvers on the GPU for complex system problems The formulation (3) comes with two advantages.
is the right way forward. Different architectures require 1. The dimension of the problem is reduced from nx +nu
different solutions, and sparse direct solvers do not seem a to nu , allowing to decrease the memory usage of the
good fit for GPUs. Moreover, we are confirming that Julia570 resolution algorithm.
technologies, like AutoDiff, are efficiently portable to sci-
2. The equality equations f (x, u) = 0 are satisfied im-
555

entific applications on GPUs without relying on complex


plicitly, and do not appear in the reduced-space prob-
frameworks like Tensorflow or PyTorch.
lem. Thus, the resulting non-linear problem encom-
passes only inequality constraints.
6. Extension to Optimal Power Flow: Reduced The beauty of reduced-space method is that the implicit
Methods function theorem (under some regularity conditions) al-
lows us to derive the gradient of the functions f and h
Now that we have described everything to port the res- directly in the reduced-space. For instance, if we consider
olution of the power flow equations on the GPU, the at- the objective c, we introduce the Lagrangian functional
tentive reader is in right to wonder on how to apply this to associated to the equality constraints f (x, u) = 0 as
optimal power flow. For the sake of clarification, we pro-
pose hereafter an application to the resolution of optimal L(x, u, λ) = c(x, u) + λ> f (x, u) . (4)
8
If x = x(u) and satisfies the power flow equation, the Our implementation adheres to the principle of universal
Lagrangian is equal to L(x, u, λ) = c(x, u) and its value610 differential programming, allowing it to be readily encap-
is independent from theadjoint λ. By the chain rule, the sulated in a machine learning framework like Flux.jl [18].
gradient of L x(u), u, λ with relation to u is then given As opposed to Tensorflow, no code has to be ported to a
by domain specific implementation and it can be transpar-
∂L ∂c ∂c dx ently included in Flux.jl models.
 ∂f ∂f dx 
= + + λ> +
∂u ∂u ∂x du ∂u ∂x du (5)615 We expect that every type of accelerator comes with its
∂c ∂f strength and weaknesses that will shape the algorithm that
 ∂c ∂f  dx
= + λ> + + λ> .
∂u ∂u ∂x ∂x du fits the architecture best. We have shown that CPUs are a
By choosing the Lagrangian multipliers or adjoints as a natural fit for sparse algebra with their deep pipelines and
solution of the linear system large caches, whereas GPUs fit very well with the matrix-
620 vector operations in iterative linear solvers. We also apply
∂f >
∂c the block-Jacobi algorithm with a very large number of
λ=− , (6) blocks. That is a domain in which it was rarely used on
∂x ∂x
the CPU, but seems to be promising on the GPU. With
we compute the reduced gradient as the advent of GPUs as the main source of performance on
625 upcoming systems, we expect a new trend of algorithms
∂c ∂f > used for nonlinear programming in complex system prob-
∇u c(u) = + λ. (7)
∂u ∂u lems.
∂f
575 Fortunately, the gradient has already been computed
∂x Acknowledgements
in the solution of the powerflow as the Jacobian Ji =
∇f (x), which we acquired using AutoDiff. Moreover, we
can use the same linear solver PC-BiCGSTAB on the GPU This research was supported by the Exascale Computing
to solve the linear system (6). AutoDiff may also be ap-630 Project (17-SC-20-SC), a joint project of the U.S. Depart-
580 plied to compute ∇c, however, due to performance reasons ment of Energy’s Office of Science and National Nuclear
and it only being a quadratic function, we implemented the Security Administration, responsible for delivering a capa-
gradient by hand. In the future, we envision to apply ad- ble exascale ecosystem, including software, applications,
joint AutoDiff tools like Zygote.jl [17] to generate that and hardware technology, to support the nation’s exascale
implementation. 635 computing imperative.
585 The selection of the step update uk+1 is subject to cur-
rent research. However, all these steps are trivial to im-
plement on the GPU using the same GPUArrays.jl ab- References
straction. The optimal power flow implementation is only
[1] J. Bezanson, A. Edelman, S. Karpinski, V. B. Shah, Julia: A
a thin additional layer on top of our implementation for fresh approach to numerical computing, SIAM review 59 (1)
590 the power flow. (2017) 65–98.
640 [2] C. Lattner, V. Adve, LLVM: A compilation framework for
lifelong program analysis & transformation, in: International
7. Conclusion Symposium on Code Generation and Optimization, 2004. CGO
2004., IEEE, 2004, pp. 75–86.
[3] H. C. Edwards, C. R. Trott, D. Sunderland, Kokkos: Enabling
We have described a portable design of a nonlinear equa-645 manycore performance portability through polymorphic mem-
tions solver in Julia, extensible to nonlinear programming. ory access patterns, Journal of Parallel and Distributed Com-
It serves both as a mini-app for optimal power flow and puting 74 (12) (2014) 3202 – 3216.
[4] D. A. Beckingsale, J. Burmark, R. Hornung, H. Jones, W. Kil-
595 constitutes an engineering application in itself. It includes lian, A. J. Kunen, O. Pearce, P. Robinson, B. S. Ryujin, T. R.
the basic issues that also arise in optimization: Newton-650 Scogland, RAJA: Portable performance for large-scale scientific
Raphson, linear solver, AutoDiff, and modeling. applications, in: 2019 IEEE/ACM International Workshop on
ExaPF.jl runs on systems from a laptop to supercom- Performance, Portability and Productivity in HPC (P3HPC),
2019, pp. 71–81.
puters like Summit without any code change. Julia com- [5] A. Montoison, D. Orban, contributors, Krylov.jl:
600 pletely relieves the user from the compilation step and655 A Julia basket of hand-picked Krylov methods,
tedious building of portability frameworks conventionally https://github.com/JuliaSmoothOptimizers/Krylov.jl (June
used in scientific computing. With incoming support 2020).
[6] T. J. Ypma, Historical development of the Newton–Raphson
for Intel and AMD GPUs in KernelAbstraction.jl and method, SIAM review 37 (4) (1995) 531–551.
GPUArrays.jl we will be able to target the upcoming su-660 [7] A. Flueck, H.-D. Chiang, Solving the nonlinear power flow equa-
605 percomputers Aurora and Frontier. This seamless support tions with an inexact newton method using GMRES, IEEE
Transactions on Power Systems 13 (2) (1998) 267–273.
for GPU accelerators is extensible to other accelerators like [8] J. Revels, M. Lubin, T. Papamarkou, Forward-mode automatic
FPGAs, TPMs, as long as they provide accelerated vector differentiation in Julia, arXiv:1607.07892 [cs.MS].
operations, crucial for scientific codes. 665 URL https://arxiv.org/abs/1607.07892

9
[9] H. A. van der Vorst, Bi-CGSTAB: A fast and smoothly converg-
ing variant of Bi-CG for the solution of nonsymmetric linear
systems, SIAM Journal on Scientific and Statistical Computing
13 (2) (1992) 631–644.
670 [10] I. Dunning, J. Huchette, M. Lubin, JuMP: A modeling language
for mathematical optimization, SIAM Review 59 (2) (2017)
295–320.
[11] A. Griewank, A. Walther, Evaluating Derivatives: Principles
and Techniques of Algorithmic Differentiation, 2nd Edition, So-
675 ciety for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 2008.
[12] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow:
A system for large-scale machine learning, in: 12th {USENIX}
680 symposium on operating systems design and implementation
({OSDI} 16), 2016, pp. 265–283.
[13] C. Rackauckas, M. Innes, Y. Ma, J. Bettencourt, L. White,
V. Dixit, Diffeqflux.jl - A Julia library for neural differential
equations, CoRR abs/1902.02376. arXiv:1902.02376.
685 URL https://arxiv.org/abs/1902.02376
[14] T. A. Davis, Algorithm 832: UMFPACK v4.3—an
unsymmetric-pattern multifrontal method, ACM Trans.
Math. Softw. 30 (2) (2004) 196–199.
[15] C. C. Paige, M. A. Saunders, LSQR: An algorithm for sparse
690 linear equations and sparse least squares, ACM Transactions on
Mathematical Software (TOMS) 8 (1) (1982) 43–71.
[16] K. Świrydowicz, S. Thomas, J. Maack, S. Peles, G. Kestor, J. Li,
Linear solvers for power grid optimization problems: a review
of gpu-accelerated solvers, Parallel ComputingSubmitted.
695 [17] M. Innes, A. Edelman, K. Fischer, C. Rackauckus, E. Saba,
V. B. Shah, W. Tebbutt, Zygote: A differentiable program-
ming system to bridge machine learning and scientific comput-
ing, arXiv preprint arXiv:1907.07587 (2019) 140.
[18] M. Innes, Flux: Elegant machine learning with Julia, Journal
700 of Open Source Software 3 (25) (2018) 602.

10

You might also like