Dynamic Generation of Python Bindings For HPC Kernels

Uploaded by

Qian WANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views12 pages

Dynamic Generation of Python Bindings For HPC Kernels

Uploaded by

Qian WANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Dynamic Generation of Python Bindings

for HPC Kernels

Steven Zhu, Nader Al Awar, Mattan Erez, and Milos Gligoric
The University of Texas at Austin
{stevenzhu,nader.alawar,mattan.erez,gligoric}@utexas.edu

Abstract—Traditionally, high performance kernels (HPKs) or PyKokkos [11]), or (b) create bindings to existing HPKs
have been written in statically typed languages, such as C/C++ implemented in C/C++ or one of the frameworks that supports
and Fortran. A recent trend among scientists—prototyping ap- performance portability (using libraries like pybind11 [12]).
plications in dynamic languages such as Python—created a gap
between the applications and existing HPKs. Thus, scientists In either case, substantial work is required [13], [14]. Main-
have to either reimplement necessary kernels or manually create tenance of manually written bindings (as HPKs evolve) intro-
a connection layer to leverage existing kernels. Either option duces additional challenges.
requires substantial development effort and slows down progress We present WAYO UT, a novel approach to automati-
in science. We present a technique, dubbed WAYO UT, which cally generating connection layers for existing (performance
automatically generates the entire connection layer for HPKs
invoked from Python and written in C/C++. WAYO UT performs portable) HPKs to be used by Python applications. WAYO UT
a hybrid analysis: it statically analyzes header files to generate is the first approach that combines static and dynamic pro-
Python wrapper classes and functions, and dynamically generates gram analysis. Specifically, for a given header file, WAYO UT
bindings for those kernels. By leveraging the type information performs static analysis to create: (1) wrapper classes and
available at run-time, it generates only the necessary bindings. We functions, i.e., a Python API provided to scientists that reflect
evaluate WAYO UT by rewriting dozens of existing examples from
C/C++ to Python and leveraging HPKs enabled by WAYO UT. the given header file, and (2) header files with templated
Our experiments show the feasibility of our technique, as well bindings that will be instantiated at run-time. When a Python
as negligible performance overhead on HPKs performance. application is executed and one of the wrapper functions is in-
Index Terms—bindings, high performance kernels, dynamic voked, WAYO UT intercepts the call, instantiates and generates
program analysis, Python the bindings for the given types, and invokes an existing HPK.
One of the key insights behind WAYO UT is that it postpones
I. I NTRODUCTION binding generation until it has the types needed (which are not
Traditionally, high-performance computing (HPC) applica- available statically in Python). WAYO UT also caches generated
tions are written in statically typed (and low-level) program- bindings, so only the very first invocation of each function
ming languages, such as C/C++ and Fortran [1]–[3]. These (with one set of type arguments) introduces some overhead;
languages are the de facto standard in the HPC area due to the cache is saved across application runs.
the excellent performance of the resulting applications. We designed WAYO UT to overcome the limitations of
HPC applications spend most of their execution time in cppyy [15] and pyximport [16], which target the same task, but
so-called high-performance kernels (HPKs), such as linear take very different approaches. Unfortunately, neither of the
algebra operations and solvers [4]. Over the last several years, two mentioned approaches could be used to invoke existing
the number of HPKs has been steadily growing and existing HPKs from within a Python application. Cppyy depends on
HPKs are constantly optimized and updated to support new a powerful but immature tool chain, including PyPy [17],
hardware platforms. an alternative implementation of the Python interpreter, and
Recently, several frameworks were introduced to enable Cling [18], an interactive C++ interpreter. On the other hand,
developers to write performance portable HPKs. Namely, a pyximport does not support dynamic instantiation of templates
developer can write an HPK only once and the framework and thus is unable to instantiate bindings if types are known
automatically enables the execution of that HPK on a variety of only at run-time.
hardware platforms (e.g., Intel CPUs, Nvidia GPUs, and AMD We overcome a set of critical challenges to realize WAY-
GPUs). Some of the most notable frameworks that support O UT, including: (1) the lack of function and method over-
performance portability include Kokkos [5], [6] and RAJA [7]. loading in Python; (2) concurrent use of multiple template
These frameworks enable the rapid development of new HPKs, instantiations of the same class; (3) inferring types of returned
although they are still based on C/C++. objects; and (4) ambiguously typed template arguments.
Meanwhile, scientists are transitioning to dynamically typed We evaluate WAYO UT by automatically generating bindings
languages, such as Python [8], Julia [9], or Lua, for writing for Kokkos Kernels [4], one of the most popular frameworks
their applications. In order to obtain good performance, sci- for HPKs, and Thrust [19], a powerful template library con-
entists have to either: (a) implement HPKs in their language taining parallel algorithms. We rewrote a number of existing
of choice (using high-performance libraries like Numba [10] examples (that use Kokkos Kernels and Thrust) from C/C++

1
to Python. Our experiments show the feasibility of our tech-
Python C++
nique, as well as its negligible performance overhead on HPK
performance. In our experiments, we also show that WAYO UT Call Call
does not impact the performance portability of HPKs: we were Bindings HPK
User Code Language HPK
able to execute all the examples on both CPUs and GPUs. Return Bindings Return
This paper makes the following key contributions: Result Result
• Design of WAY O UT , a novel approach for automatically
generating a connection layer for existing HPKs to be used
in Python applications. WAYO UT uses a hybrid approach— Fig. 1: An illustration of language bindings.
a combination of static and dynamic program analysis—to
instantiate the connection layer.
formal training in programming, so using C++, which is
• Implementation of WAY O UT for Python. The design of
notorious for its poor error messages and complicated build
WAYO UT is modular and others could use our processing
systems, can be a huge deterrent. Instead, they prefer higher
of header files to support connection layer with other pro-
level languages with “batteries included” [8], such as Python.
gramming languages, e.g., Lua. Source code of WAYO UT is
Several attempts have been made to expose these libraries
available at https://github.com/EngineeringSoftware/wayout.
and kernels to other languages [11], [26], [27]. This requires
• Evaluation of WAY O UT by rewriting a number of existing
the use of language bindings, which allow for interoperability
examples from C/C++ to Python and using existing HPKs between different languages. Figure 1 shows a high-level
from Kokkos Kernels and Thrust. We chose Kokkos because illustration of language bindings between Python and C++.
it is a popular performance portability framework and it Numerous frameworks have been implemented to provide
currently has only a few manually written bindings; we Python bindings to C++ code, such as Boost.Python [28],
chose Thrust to demonstrate the generality of our approach. pybind11 [12], and SWIG [29]. The following code snippet
II. M OTIVATION shows what a call to SpMV could look like once it has been
exposed to Python through one of the binding frameworks.
In this section, we provide some background on HPKs and
binding generation, as well as motivation for WAYO UT. spmv(char_ptr("N"), alpha, A, x, beta, y)

A. HPKs However, manually writing these bindings can be tedious

The usage of hand-optimized HPKs in scientific computing and challenging. For example, the Python bindings for creating
is extremely common. Typically, these kernels are written a Kokkos View [13], the main multi-dimensional data structure
using high performance C/C++ frameworks that can exploit in Kokkos, are written in pybind11. Despite only binding
parallelism on multi-core processors, such as OpenMP [20] for a small part of Kokkos, the total lines of code for these
CPUs and CUDA [21] for GPUs. More recently, frameworks bindings is over 900, as they make heavy use of C++ macros
such as Kokkos [5], [6] and RAJA [7] build abstractions on and compile-time template instantiation to generate all the
top of these device-specific frameworks to enable performance different combinations of template arguments. For Kokkos
portability, i.e., code that is portable across devices while still Views, this includes different data types (int16_t, int32_t,
achieving good performance. As such, these frameworks are a double, etc.), dimensions (one through eight), memory lay-
natural choice for writing high-performance kernels. Kokkos, outs, memory spaces, and memory traits. Each combination
for example, is used by numerous applications and packages of these arguments forms a single template instantiation. The
for large-scale scientific computing, such as Trilinos [22], following code snippet shows one such instantiation.
LAMMPS [23], Albany [24], Empire [25], and others. Kokkos::View<int*, LayoutLeft, HostSpace>;
Kokkos Kernels [4] is a collection of performance portable
kernels written in Kokkos. It includes a large variety of math Besides being hard to write, compiling the bindings takes
kernels and data structures commonly used in linear algebra a large amount of time (around 6 hours on our machines
and graph algorithms. One such example of a linear algebra for a commonly used subset of all combinations) due to the
kernel is the sparse matrix vector multiply kernel, or SpMV large number of template instantiations that need to happen. In
for short. The following code snippet shows how SpMV can addition to the time overhead, compilation occasionally runs
be called in a Kokkos (C++) application, where A is a sparse out of memory due to the large number of template argument
matrix, alpha and beta are scalars, and x and y are vectors. combinations, meaning that the process will not terminate
successfully on some machines.
KokkosSparse::spmv("N", alpha, A, x, beta, y); Prior work on automatic generation of Python bindings for
C++ code [30], [31] extracts library APIs by parsing header
B. Binding Generation files for class and function declarations. While this simplifies
The target audience for these HPKs is largely composed writing the bindings, it requires that the user manually adds
of scientists [4], [22] who need them for simulations and code to instantiate all the needed template arguments since
experiments. However, these scientists typically do not have these frameworks employ static analysis. Also, this does not

2
solve the compilation issues for large numbers of template 1 template <class AlphaType, class AMatrix,
instantiations. Therefore, such an approach does not work well 2 class XVector, class BetaType,
for templated libraries such as Kokkos Kernels and Thrust. 3 class YVector>
4 void spmv(const char mode[], const AlphaType &alpha,
As a result, we propose generating these bindings dy- 5 const AMatrix &A, const XVector &x,
namically, i.e., on demand at run-time such that only the 6 const BetaType &beta, const YVector &y);
necessary template instantiations are created. This allows types 7 /* ... */
8
to be passed at run-time, removing the need for the user to 9 template <class ScalarType, class OrdinalType,
manually add template instantiation. It also reduces the cost 10 class Device, class MemoryTraits = void,
of compilation by compiling bindings only when needed. We 11 class SizeType = typename Kokkos::ViewTraits
12 <OrdinalType *, Device, void, void>::size_type>
show that this approach can achieve performance comparable 13 class CrsMatrix;
to manually written bindings. 14 /* ... */
Fig. 2: An example of a kernel and data structure declaration
III. WAYO UT OVERVIEW from Kokkos Kernels.
In this section, we show an example of high performance
1 int main() {
kernel (HPK) from Kokkos Kernels, and then use this HPK to 2 /* ... */
demonstrate the workflow of WAYO UT. 3 using mat_t = KokkosSparse::CrsMatrix<
We encountered multiple challenges during the design and 4 double, int,
5 Kokkos::DefaultExecutionSpace, void, int>;
implementation of WAYO UT. We highlight these challenges 6
⭐ like so, and then outline our design choices and how we 7 mat_t A = mat_t(numRows, numCols, nnz, val, ptr, in);
solved these challenges. 8
9 View<double *> y(N);
10 View<double *> x(N);
A. Example 11
12 KokkosSparse::spmv("N", alpha, A, x, beta, y);
Figure 2 shows the function signature of the SpMV HPK 13 }
spmv (line 4) and the class declaration of CrsMatrix, the
Fig. 3: An example using a kernel and data structure from
sparse matrix data structure it operates on (line 13). This kernel
Kokkos Kernels.
performs the operation y = beta ∗ y + alpha ∗ A ∗ x.
The template parameters of spmv are used to set the types
used in the kernel at compile-time: AlphaType, BetaType B. Workflow
are the scalar types, XVector and YVector are the vector Figure 4 shows a high-level overview of WAYO UT; in this
types, and AMatrix is the sparse matrix type, which can be section we highlight the user workflow. There are two main
set to CrsMatrix in this example. The template arguments steps to WAYO UT’s workflow. First, the user provides the path
for CrsMatrix are as follows: ScalarType is the type of to the header files or the include directory. WAYO UT then
entries contained in the matrix, OrdinalType is the type generates a Python API consisting of wrappers for the C++
of the matrix index, Device specifies on which device’s API, which was declared in the passed header files. The user
memory (e.g., GPU) the matrix is located in, MemoryTraits can then access the C++ API using the Python API exposed
specifies the Kokkos memory access trait to be used (Atomic, by the generated wrappers.
RandomAccess, etc.), and SizeType specifies the type of the 1) Header Files: The first step in using WAYO UT is
row offset. passing in the header files containing the required class and
Figure 3 shows an example using spmv and CrsMatrix. function declarations that together constitute the API (step 1
To call the kernel, the user first defines mat_t to alias the in Figure 4).
instantiated CrsMatrix type (line 3) and instantiates the WAYO UT can then generate Python wrappers that mirror
matrix and vectors (lines 7-10). The CrsMatrix constructor the C++ API (kernel.py in Figure 4).
takes in as arguments the number of rows, columns, and 2) Python Wrappers: Once the Python wrappers have been
elements, followed by Views containing the matrix entries, generated, they can be imported (step 2 ) and called (step 3 )
row offsets, and column indices. Views y and x represent the by the user. Calling a wrapper for the first time will generate
one-dimensional vectors, and their constructor specifies the the templated bindings which will then be compiled into a
size of the View. The View constructor is templated on the shared library 4 . Figure 5 shows the SpMV example using the
datatype and dimensionality (one-dimensional double in this generated wrappers. Similar to the C++ version, we first alias
case). Finally, the user can call the spmv kernel (as shown the matrix type (line 3), and then define the matrix and vectors
on line 12). The arguments passed to the call are a string (lines 5-9). We call the CrsMatrix class method nnz, which
specifying the operation mode (no transpose, transpose, or returns the number of entries in the matrix, to demonstrate
conjugate transpose), the scalar alpha, the matrix A, the vector how a class method can be called (line 6). Finally, we call the
x, the scalar beta, and the vector y. The latter is passed by spmv kernel (line 10).
reference and will hold the result of the operation upon return WAYO UT generates wrappers for both function and class
from the function. declarations, as well as wrappers for public fields and methods,

3
Fig. 4: An overview of WAYO UT’s workflow.

1 if name == ” main ”: what the type of the generated matrix should be, so the users
2 # assume constructor arguments are initialized must pass the template argument to WAYO UT; otherwise, C++
3 mat t = CrsMatrix( compilation fails. These arguments can be passed in via the
4 float, int, ”Kokkos::DefaultExecutionSpace”, None, int)
5 A = mat t(numRows, numCols, nnz, val, ptr, ind) keyword argument template_args.
6 print(”num elem:”, A.nnz()) When the user calls a function, the arguments are passed to
7 the underlying kernels by reference. However, there are cases
8 y = View(”double *”)(N)
9 x = View(”double *”)(N) where a kernel expects an argument as a pointer. To support
10 spmv(char ptr(”N”), alpha, A, x, beta, b) this, WAYO UT provides a simple class named ptr which the
Fig. 5: Python WAYO UT example using spmv and CrsMatrix. user can use to wrap their object and indicate that the argument
should be passed as a pointer. A similar issue occurs with
1 if name == ” main ”:
string arguments: some functions require the standard C++
2 # assume constructor arguments are initialized string whereas others accept character pointers. To support
3 crsmat t = CrsMatrix( this, Python strings are cast to standard strings by default,
4 float, int, ”Kokkos::DefaultExecutionSpace”, None, int)
5 A = generate structured matrix2D(
and arguments that are character pointers use the char_ptr
6 ”FD”, structure, template args=[crsmat t]) wrapper class. Line 10 in Figure 5 shows an example of this.
Fig. 6: Python WAYO UT example using template_args. If a function returns a pointer, the default behavior is to treat
it as a reference, i.e., assume that C++ retains ownership of
the object. This means that when the resource is freed, Python
in the original C++ API. would not attempt to garbage collect the object and assume
Functions. One Python wrapper function is generated for the C++ run-time would do so. To override this behavior, the
each C++ function. An issue that arises here is overloaded user can set the boolean keyword argument take_ownership
functions. ⭐ Python does not allow overloaded functions i.e., so that Python is responsible for freeing memory.
redefining a function with a different number and different Classes. One wrapper class is generated for each C++ class.
types of arguments. To account for this, WAYO UT instead The __init__ method (i.e., the constructor in Python) of the
generates a single wrapper function with a variable number of wrapper class is used to pass in template arguments, creating
arguments for each unique function name. At run-time, if an a type object that can also be used as a type alias (Figure
overloaded function is used, the correct instance will be called 5 line 3). To create an instance of the class, the user calls
based on the number and types of the arguments passed by the type object, passing in the constructor arguments to the
the user. __call__ method (line 5).
Users can call templated functions normally because the Wrapper classes can accept a variable number of templates
template arguments can be deduced from the argument types to support optional template arguments. Additionally, if the
at run-time in most cases. ⭐ In some cases, these types template argument is a primitive data type (i.e., int, float,
cannot be deduced, and so have to be explicitly specified etc.), the corresponding Python data type can be used. If the
by the user. For example, Figure 6 shows a code snippet template argument is a class type, it can be set to a type alias or
taken from a Kokkos Kernels tutorial using CrsMatrix. it can be passed as a string. The latter is useful for referring to
Instead of calling the constructor directly, it calls the typedefs defined in the header files. For instance, in Kokkos
generate_structured_matrix2D kernel to initialize the the DefaultExecutionSpace type is simply a typedef that
matrix. In C++, this kernel is templated on the type of the changes depending on compile-time flags, but we can still use
matrix to be initialized. The two arguments are for stencil it as a template argument in Python by passing it as a string to
type and matrix structure. These arguments do not hint at the class constructor (line 3). This can also be used to specify

4
pointer types (e.g., double*) for template arguments (line 8). 1 def spmv(*args, template args=None, take ownership=False):
Once an object has been created, it can be used like any 2 mod, name=generate func binding(”spmv”, ”KokkosSparse”,
Python object. The wrapper class contains all class fields 3 args, includes, template args, take ownership)
4 args = [get handle(arg) for arg in args]
and methods present in the C++ version. Private fields and 5 res = getattr(mod, name)(*args)
methods are not accessible. As WAYO UT supports inheritance, 6 return cast return(res)
attributes from the parent class are accessible as well. Any 7
8 class CrsMatrix:
object returned from a function call will be automatically 9 ”””Compressed sparse row implementation of a sparse matrix.”””
wrapped using the correct wrapper class. 10 namespace = ”KokkosSparse”
Figure 7 shows the generated Python wrappers for the 11 def init (self, *template args, handle=None):
12 self. handle = handle
SpMV example. The spmv wrapper is defined on line 1 and 13 self. cpp name = handle. cpp type if handle else \
the CrsMatrix wrapper is defined on line 8; the contents of 14 register class(”CrsMatrix”, self. namespace, template args)
these wrappers are explained in the next section. 15 def call (self, *args):
16 if self. handle:
17 if hasattr(self, ’ cpp call ’):
IV. T ECHNIQUES 18 return self. cpp call (*args)
In this section, we describe our binding generation ap- 19 raise RuntimeError(
20 ”Error: can’t call constructor on instance!”)
proach, including both static and dynamic phases. In the 21 mod,name = generate constructor(self. cpp name, args, includes)
static phase (Section IV-A), WAYO UT parses C++ header files
to generate Python wrappers and templated bindings. In the 22 args = [get handle(arg) for arg in args]
23 inst = copy.copy(self)
dynamic phase (Section IV-B), WAYO UT intercepts calls to the 24 inst. handle = getattr(mod, name)(*args)
Python wrappers. Then, it instantiates, compiles, and imports 25 return inst
the templated bindings based on the types known only at 26 def nnz(self, *args, take ownership=False):
27 ”””//! The number of stored entries in the sparse matrix.”””
run-time, completing the link between Python and C++. We 28 mod,name = generate class func binding(self, ”nnz”, args,
then describe the casting mechanisms used to move arguments 29 includes, take ownership)
from Python to C++ and vice versa (Section IV-C). Next, we 30 args = [get handle(arg) for arg in args]
31 res = getattr(mod, name)(self. handle, *args)
describe our techniques to support inheritance (Section IV-D) 32 return cast return(res)
and operator overloading (Section IV-E). Finally, we discuss 33 ”””/* ... */”””
GPU support (Section IV-F) and integration with manually Fig. 7: Python wrapper generated by WAYO UT for spmv and
written bindings (Section IV-G). CrsMatrix.
There are two highlights to our approach: first, generating
Python code in the form of wrapper classes and functions 1 #include <pybind11/pybind11.h>
allows the user to easily use and potentially modify the 2 #include <KokkosSparse_CrsMatrix.hpp>
generated bindings; second, the lazy approach to binding 3 template <class T>
4 void generate_class(pybind11::module &_mod,
instantiation and compilation reduces the otherwise high com- 5 const char *name, const char *cpp_type) {
putational cost of binding and compiling everything ahead of 6 pybind11::class_<T> _class(_mod, name);
time. Once a binding has been compiled, it is cached on the 7 _class.def_property_readonly_static("_cpp_type",
8 [cpp_type](const pybind11::object&) {
file system for later use. 9 return cpp_type;
10 });
A. Static Generation 11 _class.def_readwrite("graph", &T::graph);
12 _class.def_readwrite("values", &T::values);
We use Clang [32] to parse the header files and py- 13 /* ... */
bind11 [12] as the bindings library. We chose pybind11 due 14 }
to its popularity, flexibility, and ease of use. Writing bindings Fig. 8: Generated C++ templated header for the CrsMatrix.
using pybind11 involves defining a Python module object
which is used to register classes and functions so that they
can be accessed from Python. the appropriate overloaded instance at run-time based on the
When WAYO UT is invoked by a user, it uses the Clang types of the arguments. These types are extracted from the
Python API to parse the header files and return the root node arguments using the Python built-in function type().
of the corresponding Abstract Syntax Tree (AST). WAYO UT WAYO UT then generates Python wrappers mirroring the
can then extract the API from header files by traversing the original C++ API. Figure 7 partially shows the generated
AST recursively to discover classes and functions. One issue wrappers for the spmv function and CrsMatrix class, with the
with this approach is that ⭐ Python does not allow function latter also containing wrapper methods for its corresponding
or method overloading, both of which are used heavily in C++ class methods.
HPKs, such as Kokkos Kernels, especially for constructors. In addition to Python wrappers, WAYO UT generates one
To deal with this, WAYO UT first stores function names in a C++ header file for each class encountered during AST traver-
set so that only one wrapper function is generated, even if sal. Figure 8 shows the header generated for the CrsMatrix
other overloaded instances exist. Inside the wrapper functions class. The header file contains a function templated on T,
for overloaded functions, WAYO UT adds code that selects where T is the type to be registered via pybind11. The

5
function registers the type T with pybind11, as well as all 1 /*==================================================*/
the class fields. Since all instances of a templated class 2 /* generated binding code for registering CrsMatrix */
have the same members, the header file can be reused by 3 #include "CrsMatrix.hpp"
4 PYBIND11_MODULE(f_f8ee838d9c3174dc82a, k) {
different instantiations of the templated class at run-time e.g., 5 generate_class<KokkosSparse::CrsMatrix<
CrsMatrix<double,...> or CrsMatrix<int,...>. 6 double, int,
7 Kokkos::DefaultExecutionSpace, void, int>>(
B. Dynamic Generation 8 k, "f_f8ee838d9c3174dc82a",
9 "KokkosSparse::CrsMatrix<double,int,"
At run-time, the user imports and calls the generated Python 10 "Kokkos::DefaultExecutionSpace,void,int>");
wrappers (shown in Figure 7). Internally, the wrappers call 11 }
12
WAYO UT to instantiate the templated functions based on 13 /*==================================================*/
the types passed, generating a C++ source file that uses the 14 /* generated binding code for CrsMatrix constructor */
templated binding header files generated in the static phase. 15 auto func(pybind11::args args) {
16 auto a0 = args[0].cast<std::string>();
WAYO UT then compiles the C++ source into a shared object 17 auto a1 = args[1].cast<int>();
file (or simply DSO) that can be imported and used by the 18 /* ... */
wrapper. Later calls to the same wrappers will reuse the 19 return new KokkosSparse::CrsMatrix<double, int,
20 Kokkos::DefaultExecutionSpace, void, int>
existing DSO if the types are unchanged. 21 {a0, a1, a2, a3, a4, a5, a6};
1) Wrapper: In Figure 7, the spmv wrapper calls the 22 }
WAYO UT function generate_function_binding (line 2) 23
24 /*==================================================*/
to generate the function binding. This call captures infor- 25 /* generated binding code for nnz method of CrsMatrix */
mation such as function name ("spmv") and namespace 26 auto func(pybind11::args args) {
("KokkosSparse") which are needed to uniquely identify 27 auto &a0 = args[0].cast<
28 KokkosSparse::CrsMatrix<double, int,
the C++ function that needs to be bound. This is needed in 29 Kokkos::DefaultExecutionSpace, void, int> &>();
combination with the arguments and optionally the template 30 return a0.nnz();
arguments to generate a hash that uniquely identifies the bind- 31 }
32
ing instantiation. Similarly, the methods of CrsMatrix call 33 /*==================================================*/
WAYO UT to generate instantiated bindings. The generate 34 /* generated binding code for spmv */
functions check to see if a module matching the hash has been 35 auto func(pybind11::args args) {
36 auto a0 = args[0].cast<std::string>();
imported. If so, it simply returns the module object containing 37 auto a1 = args[1].cast<double>();
the function. If the module has not been imported, WAYO UT 38 /* ... */
attempts to import it from the file system. If the corresponding 39 return KokkosSparse::spmv(a0.c_str(), a1, a2, a3, a4, a5);
DSO does not exist, then WAYO UT generates the binding 40 }
instantiation source code for the function.
2) Binding Generation: There are two main types of bind- Fig. 9: Generated C++ binding instantiation code for the
ings. One is for registering classes so pybind11 knows how SpMV example.
to cast objects between Python and C++, while the other
is for binding an instantiated templated function. For class the C++ API function. The first function calls the CrsMatrix
registration, the binding source code first includes the class constructor (line 19), the second function calls the nnz class
header (shown in Figure 8) generated during the static phase method (line 30), and the third function calls the standalone
and uses it to register classes. For function bindings, WAYO UT spmv function (line 39).
generates intermediate C++ functions that cast arguments from The bindings are then compiled into object files. Intuitively,
Python types to the corresponding C++ types and internally WAYO UT would then link the files containing all the instan-
call the API function. tiations into one single DSO file and import it. Whenever a
Figure 9 shows examples for both types of bindings. During new instantiation is generated and linked, WAYO UT would
class registration, a Python module object is first created using reload the DSO. However, this will not work because ⭐
the PYBIND11_MODULE (line 4). The first argument is the name Python does not provide support for dynamically reloading
of the kernel which is set to the unique hash corresponding DSOs unless their reference count reaches zero and they are
to that instantiation. The second argument is a handle to the garbage collected. Waiting for the garbage collector to run is
module object that is used to register functions for that module. unreliable and might not even happen before the application
Then, the class is registered in pybind11 (line 5). completes. Our solution is to generate a separate DSO for each
WAYO UT defines an intermediate function for each method template instantiation of every class and function. This has the
(lines 15, 26, and 35) which accepts as input an argument of added benefit of avoiding the extra linking overhead when new
type pybind11::args containing a list of arguments. We use bindings are generated. It also allows WAYO UT to elegantly
auto as the return type of the intermediate functions and rely support overloading and templates by separating them into
on the compiler to deduce it from the argument types. different modules and avoiding re-definition errors in Python,
Each intermediate function explicitly casts each argument to since each combination of arguments would correspond to a
its corresponding C++ type (e.g., lines 16-17) and then calls different module.

6
The generated Python wrapper can then access and call WAYO UT therefore wraps these objects in the appropriate
functions registered in the module using the built-in getattr wrapper class so the class fields and methods can still be
function (Figure 7, lines 5, 24, and 31). accessed normally (Figure 7, line 6). To do so, WAYO UT
first checks if the returned object has the _cpp_type field.
C. Casting If not, then the returned object is a primitive and no casting
When the user calls a bound function (such as spmv in is needed. Otherwise, WAYO UT initializes a wrapper object
Figure 5, line 10), WAYO UT casts the passed arguments from using the binding object as the handle.
types that are valid in Python to types that are valid in C++. Additional complications occur when the return type has not
Once control returns to the Python side, the returned binding been registered with pybind11. For example, assume the user
object is also cast to the correct wrapper class. WAYO UT uses calls a function that returns a matrix type that has not been
three forms of casting: explicit, implicit, and autocasting. instantiated before. To solve this, we also generate dummy
1) Explicit Casting: As mentioned previously, intermediate functions which return empty instances of the return type.
functions accept as input a list of arguments (args). Explicit When a module is imported, WAYO UT also calls the dummy
casting refers to calling the pybind11 cast method on ele- function. If the class is not registered, a TypeError will be
ments of args to convert them into types that can be used in thrown by pybind11, which we catch and parse to extract the
C++, storing them in local variables (Figure 9, lines 16-17). class that needs to be registered. Since this only needs to be
These variables can then be passed to the C++ function call. done once when a module is imported, the overhead is minimal
The type to be cast to is passed as a template argument. and guarantees that all return types are registered.
Since the binding instantiation is generated at run-time, these
types are chosen based on the types of the passed arguments. D. Inheritance
This form of casting works fine if the argument is a primitive Inheritance is a commonly used feature in C++ to facilitate
(e.g., int). However, if the argument type is one of the code reuse. While it is not used much in Kokkos Kernels,
wrapper classes (e.g., CrsMatrix), an additional implicit cast Thrust [19] extensively utilizes inheritance in its various
may be required. structures. WAYO UT supports inheritance during the static
2) Implicit Casting: In heavily templated classes, it is phase, where the name of the parent can be extracted from
common for objects with slightly different template instantia- the AST. Then we can naturally emulate the C++ inheritance
tions to be semantically equivalent. For instance, the Kokkos relationship by having the Python wrapper class of a C++ child
View object has an execution space template argument, which class inherit from the Python wrapper class of the parent.
can either be of type Device or MemorySpace, which are
interchangeable. In the SpMV example, spmv can accept both E. Operator Overloading
Kokkos::View<double *, HostSpace> and
Kokkos::View<double *, Device<OpenMP, HostSpace> Operator overloading in C++ is used to implement the built
for its View arguments, even if they are different types, be- in operators for custom datatypes, e.g., using the [ ] operator
cause Kokkos internally implements implicit casting between to access elements in a data structure. WAYO UT supports
the two. operator overloading by treating them as class methods, with
In order for pybind11’s cast to work on non-primitive the caveat that the method name is mapped to the corre-
types, WAYO UT must use the type that was obtained during sponding Python magic method name (e.g., operator[] to
class registration, as that is the type that pybind11 recognizes. __setitem__ and __getitem__). Since WAYO UT already
Otherwise, cast throws an exception for an illegal cast. uses the __call__ magic method for invoking the constructor,
In some cases, different parts of a C++ API depend on we map the C++ call operator to a new __cpp_call__
different template instantiations of the same class, even if method which is invoked when a class instance is called
they are semantically equivalent. ⭐ This is a challenge for (e.g., Figure 7 line 17). WAYO UT currently supports the C++
WAYO UT since it uses pybind11 to cast objects to the exact addition, subtraction, bracket, call, and dereference operators,
type needed by functions, which will result in an exception if although support for others is planned.
there is any difference in types.
To solve this, WAYO UT caches information about the C++ F. GPU Support
type of a binding object by adding an extra _cpp_type field As most HPKs support heterogeneous systems, it is impor-
during class registration. This extra field is a string set to tant for WAYO UT to support GPUs as well. Code that runs
the fully qualified C++ type name. Therefore, during binding on GPUs (e.g., CUDA or HIP) typically cannot be compiled
generation, WAYO UT can use this stored name to cast the using a regular C++ compiler such as g++. Instead, it needs to
argument to the appropriate type. be compiled with a specific compiler (e.g., NVCC for CUDA).
3) Autocasting: ⭐ When an object is returned from a This is easy to do in WAYO UT, as the only modification
function, pybind11 does not cast it to one of WAYO UT’s needed is to switch to the right compiler. Additionally, since
wrapper classes, so it cannot be used to access the fields one of the main targets of our work is Kokkos, the kernel
and methods. Ideally, the functions would return objects of interface does not change when running with a GPU, so no
the same type as the generated wrapper class. further modifications to WAYO UT are needed.

7
G. Integration of Manually Written Bindings • GraphColoring: Assigns colors to elements of a graph
There are instances where it is still beneficial to use man- such that no neighboring nodes have the same color.
ually written bindings for convenience reasons. For instance, • InnerProduct: Calculates the inner product of the form
⟨y, A ∗ x⟩ = y ∗ A ∗ x.
T
the Kokkos View object is a general purpose n-dimensional
data structure. It overloads the parentheses operator for reading • SpGEMM: Implements sparse matrix-matrix multiplica-

and modifying data instead of the commonly used square tion in two phases: symbolic followed by numeric, with
brackets (e.g., int x = view(1);). This does not work well a kernel for each phase.
with pybind11 since the parentheses operator returns a refer- • SpILUK: Implements sparse k-level incomplete LU fac-

ence to a primitive, which pybind11 handles by passing by torization.

value to Python, meaning that modification of the contents We also need Python bindings for Kokkos Views as they
is not possible. However, Kokkos does have Python bindings appear frequently in our test subjects and in Kokkos Kernels.
(manually written) for Views [13]. These bindings leverage a In our subjects, we used both the manually written Python
pybind11 feature that allows the Python buffer protocol [33] bindings and bindings automatically generated by WAYO UT.
to be implemented for the raw data buffer contained in Views, As mentioned before, Views use the C++ parentheses operator
which allows the internal data to be accessed normally from to modify data, meaning that they cannot be directly modified
Python. Since they are implemented using pybind11, these in Python using the automatically generated bindings, so we
bindings can be used seamlessly with WAYO UT. implement only four of our subjects using the latter.
To demonstrate the generality of our approach, we also
V. E VALUATION generated bindings for kernels in the Thrust library. We ported
We evaluate WAYO UT by answering the following four 7 examples from the official Thrust repository [36] to Python:
research questions: histogram, mode, saxpy, set operations, sort, sparse, and sum.
RQ1. How effective is WAYO UT at generating bindings for In summary, WAYO UT successfully generated bindings to
Kokkos Kernels and CUDA Thrust? Kokkos Kernels and Thrust, which we were able to use to port
RQ2. What is the run-time performance overhead of the workloads from C++ to Python.
bindings generated by WAYO UT? RQ2: What is the run-time performance overhead of the
bindings generated by WAYO UT?
RQ3. How does the run-time performance of the automatically
Figure 10 shows plots of computation time (y-axis) vs. input
generated bindings compare to handwritten bindings?
data size (x-axis) for our subjects from Kokkos Kernels and
RQ4. What is the time needed to generate the bindings? Thrust. For WAYO UT, we show computation time after the
We ran all experiments on an Ubuntu 18.04 machine with a 6- bindings have been instantiated and compiled for all types that
core Intel Core i7-8700 3.20GHz CPU and 64GB of RAM, and occur in each subject. We show binding generation time in
an Nvidia GeForce 1080 GPU with 8GB of memory. We used RQ4. The time shown does not include time spent to initialize
Python 3.8.5, GCC 7.5, OpenMP 4.5, and CUDA 10.2. We the subject, as most subjects initialize arrays in sequential
used Kokkos 3.1.01, and Kokkos Kernels from the “develop” loops, which dominates the running time for larger input sizes.
branch (commit 62985984). Finally, we used Thrust 1.12.0. Including that time would mean comparing Python to C++
All data presented are averaged over 3 runs and the Thrust rather than measuring the overhead of the generated bindings.
subjects were run for 100 iterations. For most subjects, our Python implementation can achieve
performance comparable to the original C++ implementation.
A. Results
For the CGSolve subject, we observe overhead that scales
RQ1: How effective is WAYO UT at generating bindings for with the size of the input data. This happens because the
Kokkos Kernels and CUDA Thrust? subject runs most of its computations in a loop that calls the
Using WAYO UT, we automatically generated bindings for kernel internally. It also computes a square root in Python
all the kernels in the Kokkos Kernels framework. We verified using the math.sqrt() function. The number of iterations of
that WAYO UT is able to run all 39 kernels present in the this loop scales with the size of the input data, increasing the
Kokkos Kernels wiki [34], as well as the sparse matrix number of calls to math.sqrt(), which in turn increases the
container CrsMatrix and numerous other helper functions total time taken compared to the C++ implementation.
used for memory allocation and initialization. We also observe noticeable performance overhead for the
We then ported existing C++ programs that use these kernels set operations subject (Figure 10k). This subject invokes var-
to Python. Specifically, we implemented 7 applications from ious functions that each allocates a result vector and calls
the official Kokkos repository [35] in Python: a different set operation (e.g., merge, union). In C++, the
• CGSolve: Implements a conjugate gradient algorithm for result vector is allocated on the stack, while in Python, the
solving systems of linear equations of the form Ax = b. object must be allocated on the heap. Both heap allocation and
• CGSolve SpILUKprecond: Similar to CGSolve, but Python’s garbage collector introduce substantial overhead.
uses preconditioning for faster convergence. Thus, these two outliers can be attributed to Python itself
• GaussSeidel: Implements the Gauss-Seidel method for rather than WAYO UT. In summary, bindings generated by
solving a system of linear equations. WAYO UT introduce minimal performance overhead.

8
100
20 WayOut (OpenMP) WayOut (OpenMP) WayOut (OpenMP)
Kokkos (OpenMP) Kokkos (OpenMP) Kokkos (OpenMP)
40
WayOut (CUDA) 80 WayOut (CUDA) WayOut (CUDA)
15 Kokkos (CUDA) Kokkos (CUDA) Kokkos (CUDA)
30

Time [s]
60
10
40 20

5 20 10

0 0 0
210 211 212 213 214 215 215 216 217 218 219 220 219 220 221 222 223 224
Size Size Size
(a) CGSolve (b) CGSolve_SpILUKprecond (c) GaussSeidel

10 WayOut (OpenMP) WayOut (OpenMP) 8 WayOut (OpenMP)

Kokkos (OpenMP)
200
Kokkos (OpenMP) Kokkos (OpenMP)
8 WayOut (CUDA) WayOut (CUDA) WayOut (CUDA)
Kokkos (CUDA) 150 Kokkos (CUDA) 6 Kokkos (CUDA)
Time [s]

6
100 4
4

2 50 2

0 0 0
220 221 222 223 224 225 225 226 227 228 229 230 221 222 223 224 225 226
Size Size Size
(d) GraphColoring (e) InnerProduct (f) SpGEMM

10
WayOut (OpenMP) WayOut (OpenMP) 80 WayOut (OpenMP)
Kokkos (OpenMP) 150 Thrust (OpenMP) Thrust (OpenMP)
8 WayOut (CUDA) WayOut (CUDA) WayOut (CUDA)
Kokkos (CUDA) 125 Thrust (CUDA) 60 Thrust (CUDA)
Time [s]

6 100
75 40
4
50
2 20
25
0 0 0
216 217 218 219 220 221 221 222 223 224 225 226 221 222 223 224 225 226
Size Size Size
(g) SpILUK (h) histogram (i) mode

80 WayOut (OpenMP) WayOut (OpenMP) WayOut (OpenMP)

70 Thrust (OpenMP) Thrust (OpenMP) 120 Thrust (OpenMP)
80 WayOut (CUDA)
WayOut (CUDA) WayOut (CUDA)
60 Thrust (CUDA) Thrust (CUDA) 100 Thrust (CUDA)
60
Time [s]

50 80
40
40 60
30
40
20 20
10 20
0 0 0
223 224 225 226 227 228 221 222 223 224 225 226 222 223 224 225 226 227
Size Size Size
(j) saxpy (k) set_operations (l) sort

80 WayOut (OpenMP)
16 WayOut (OpenMP)
70 Thrust (OpenMP) 14 Thrust (OpenMP)
WayOut (CUDA) WayOut (CUDA)
60 Thrust (CUDA)
12 Thrust (CUDA)
Time [s]

50 10
40 8
30 6
20 4
10 2
0 0
220 221 222 223 224 225 225 226 227 228 229 230
Size Size
(m) sparse (n) sum

Fig. 10: Kernel time using WAYO UT generated bindings vs. original Kokkos Kernels/Thrust implementation.

9
TABLE I: Performance of Generated versus Manually Written Bindings.
Subject Size OpenMP Time [s] CUDA Time [s]
Manual Generated Ratio Manual Generated Ratio
20
CGSolve SpILUKprecond 2 99.14 102.49 1.03 31.46 30.51 0.97
24
GaussSeidel 2 43.09 43.33 1.01 18.14 19.76 1.09
30
InnerProduct 2 31.65 31.62 1.00 212.82 213.79 1.00
21
SpILUK 2 3.18 3.12 0.98 9.66 9.71 1.01

TABLE II: Bindings Build Time (Kokkos Kernels on the left and Thrust on the right).
Subject Kernels Modules Static Dynamic Dynamic Subject Kernels Modules Static Dynamic Dynamic
Phase Phase Phase Phase Phase Phase
[s] (g++) [s] (NVCC) [s] [s] (g++) [s] (NVCC) [s]
CGSolve 7 12 3.43 32.13 82.93 histogram 13 34 4.21 98.84 281.24
CG SpILUK 23 35 5.94 96.17 248.01 mode 10 28 4.15 81.86 230.83
GaussSeidel 8 15 5.77 43.17 111.63 saxpy 6 17 3.52 49.41 139.40
GraphColoring 11 17 5.13 51.29 130.21 set operations 11 17 3.68 49.50 144.43
InnerProduct 2 2 3.05 7.59 26.28 sort 5 12 3.61 34.79 98.36
SpGEMM 7 12 4.37 33.23 85.83 sparse 9 33 3.81 96.60 273.38
SpILUK 18 28 5.31 76.49 196.39 sum 4 11 3.50 31.92 90.87

RQ3: How does the run-time performance of the automatically dynamic phase execution time, as it calls 23 kernels and
generated bindings compare to manually written bindings? generates 35 modules, more than any other subject.
We compare the manually written Python bindings provided It is important to note that the execution time shown here
in the Kokkos repository for the View class against the bind- only occurs once, when the bindings are instantiated for the
ings generated by WAYO UT. Table I shows the performance first time. Later calls of kernels with the same types, and
of the generated bindings versus the handwritten ones with even later runs of the same application would not incur this
both OpenMP and CUDA. The first column shows the name overhead as the modules are cached on the filesystem.
of the subject. The second column shows the size of the input WAYO UT is also considerably faster than the approach used
data. The rest of the table shows computation time for both in the Kokkos View bindings [13], which is a purely static
the manually written and automatically generated bindings, as approach that instantiates all combinations of types during
well as the ratio of generated time to manual time. compilation. On our machine, compiling those bindings takes
The results show that the performance of the bindings over 6 hours, and runs out of memory on another machine.
generated by WAYO UT matches that of the manually written
bindings. This is expected as both sets of bindings use py- VI. L IMITATIONS
bind11, and WAYO UT only generates an additional lightweight C++ allows passing arguments and returning values by
Python wrapper which has minimal performance overhead. value, pointer, or reference. Python always passes primitives
RQ4: What is the time needed to generate the bindings? by value and objects by reference. As such, the Python API
Table II shows the average time taken to automatically generated by WAYO UT will not always exactly match the
generate the bindings for each library. The columns show the functionality of the C++ API: primitives are always passed and
name of the subject, the number of kernels used, the number returned by value, and objects are always passed by reference
of modules generated (i.e., DSOs that instantiate the classes or pointer. WAYO UT allows passing pointers with ptr and
and functions), the time taken during the static phase, and character pointers with char_ptr.
the time taken during the dynamic phase for g++ and NVCC Another limitation of WAYO UT is that the generated wrap-
respectively. pers may not be very “Pythonic”. For example, while ptr
The results show that WAYO UT has acceptable execution and char_ptr are practical solutions to pointer arguments,
time. The largest cause of performance overhead in either such constructs will be unfamiliar to Python programmers.
phase is caused by calling the C++ compiler. The time taken Additionally, the generated wrappers do not make use of
during the static phase is mostly caused by compiling the certain Python features such as keyword arguments (i.e.,
enums DSO file and does not vary greatly across subjects. **kwargs) and dynamic typing.
The time taken during the dynamic phase varies depending It would be possible to make the generated APIs more
on the number of modules generated and the compiler used. Pythonic by adding another layer of abstraction on top of the
More kernel calls with different types results in more template wrappers generated by WAYO UT. Currently, this would require
instantiations, and therefore more modules generated. For additional effort from the user, although we plan to explore a
example, the CGSolve SpILUKprecond subject has a large way to automate this step in future work.

10
Some kernels in Thrust accept a function object as an argu- Python and C++ allows the user to manually write bindings
ment in order for the user to define kernel behavior. WAYO UT for some classes to make them more Pythonic if desired.
does not support these kernels as this would require translating Furthermore, the dependence on Cling also limits supported
Python code to C++; an earlier work, PyKokkos [11], supports libraries to features supported by Cling. For instance, it does
translation from Python to C++. However, since the goal of not have support for thread level storage symbol relocation,
WAYO UT is to bind existing HPKs where the behavior is which is used in the shared object for Kokkos. Another
already defined, this is a minor limitation. example is CUDA support. Since WAYO UT invokes a compiler
Finally, we focused primarily on Kokkos and Thrust in to compile shared objects, it has flexibility of choosing NVCC
our evaluation. We chose Kokkos because it is a popular rather than g++ as the compiler, whereas Cling support for
performance portability framework with a large number of CUDA is still experimental.
kernels, and Thrust is a popular CUDA library.
D. High Performance Python
VII. R ELATED W ORK
PyKokkos [11] is a framework for writing performance
A. Binding Frameworks portable kernels in Python. The user writes kernels in a
Boost.Python [28], pybind11 [12], SWIG [29], and pyxim- small, statically typed subset of Python, which PyKokkos
port [16] are frameworks that allow binding C or C++ code so then translates to C++ (Kokkos) to obtain better performance.
that it can be called from Python. Typically, these frameworks Numba [10] is a Python JIT compiler based on LLVM.
require that the user specify the C++ interface to be bound Cython [16] adds C-like language extensions to Python to
using some form of domain-specific language or configuration improve performance. WAYO UT is not meant for writing
file. WAYO UT only asks the user for the header files containing kernels. WAYO UT provides access to pre-existing, hand-tuned
class and function declarations, and automatically generates high-performance kernels.
the bindings with no extra effort from the user. NumPy [27] and SciPy [26] both contain data structures and
kernels used in scientific computing. A significant part of both
B. Static Binding Generation
libraries is implemented in C and C++ and manually wrapped
CFFI [37] is a Python library that can import C code using so it can be accessed from Python. WAYO UT automatically
C-like declarations and generate the necessary bindings in a C generates bindings to interoperate between Python and C++.
file. However, it does not support C++ and requires the user
to manually declare the interface. AutoWIG [30] provides a VIII. C ONCLUSION
Python API to pass in header files and then generates bindings
We present WAYO UT, a technique for automatically gen-
using Boost.Python. Additionally, the user has to provide a
erating Python bindings for C++ code, specifically high-
header file that contains all the needed template instantiations
performance kernels. WAYO UT combines static and dynamic
for templated classes and functions. Afterwards, the user
analysis in order to reconcile Python’s dynamic nature with
must compile the generated bindings. Similarly, Binder [31]
C++’s static typing, and is able to support heavily templated
statically parses header files to obtain all classes and functions.
classes and functions. We implement WAYO UT by building
As with AutoWIG, the desired template instantiations must
Python and C++ code generators that produce a connection
be explicitly used or specified in the header files. In contrast
layer between the two languages. Our evaluation shows that
to AutoWIG, it is meant to be used entirely through the
WAYO UT can support Kokkos Kernels framework and CUDA
command-line. WAYO UT is more flexible and more Pythonic
Thrust with minimal performance overhead. Additionally,
through its dynamic analysis: templates are only instantiated
WAYO UT can generate bindings at an acceptable performance
at run-time through types passed to automatically generated
cost, making it more feasible than manually written and stati-
Python wrapper classes. The user does not have to specify all
cally generated bindings. We believe that WAYO UT enables
the types that they want to use ahead of time.
faster development of scientific applications by connecting
C. Dynamic Binding Generation Python, a high-level language frequently used by scientists,
Cppyy [15] dynamically generates bindings to C++ libraries. to existing HPKs written in C++.
It uses Cling [18], a C++ interpreter based on Clang and
ACKNOWLEDGMENT
LLVM, to generate C++ code that instantiates and calls classes
and functions included in header files, and then binds that We thank George Biros, Martin Burtscher, Ian Henrik-
code to enable accessing it from Python. The definitions sen, Jonathan R. Madsen, Arthur Peters, Keshav Pingali,
of those classes and functions are loaded at run-time by Sivasankaran Rajamanickam, Christopher J. Rossbach, Karl
dynamically linking a shared object library. This presents a W. Schulz, Christian Trott, and the anonymous reviewers
problem for libraries such as Kokkos Kernels, which currently for their feedback on this work. This work was partially
can only be compiled to a static library. WAYO UT provides the supported by the US National Science Foundation under Grant
flexibility of linking a static library during compilation, instead Nos. CCF-1652517 and CCF-2107291, and the Department
of exclusively requiring shared object libraries as cppyy does. of Energy, National Nuclear Security Administration under
Additionally, WAYO UT’s use of pybind11 to interface between Award Number DE-NA0003969.

11
R EFERENCES [16] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and
K. Smith, “Cython: The best of both worlds,” in Computing in Science
[1] C. R. Trott, “ExaMiniMD,” https://github.com/ECP-copa/ExaMiniMD, and Engineering, 2011, pp. 31–39.
2017.
[2] D. Lebrun-Grandié, A. Prokopenko, B. Turcksin, and S. R. Slattery, [17] “PyPy,” 2021, https://www.pypy.org/.
“ArborX: A performance portable geometric search library,” ACM Trans- [18] V. Vassilev, P. Canal, A. Naumann, L. Moneta, and P. Russo, “Cling
actions on Mathematical Software, vol. 47, no. 1, pp. 1–15, 2020. – the new interactive interpreter for ROOT 6,” in Journal of Physics:
[3] S. Slattery, “Cabana,” https://github.com/ECP-copa/Cabana, 2018. Conference Series, 2012, pp. 52–71.
[4] S. Rajamanickam, S. Acer, L. Berger-Vergiat, V. Dang, N. Ellingwood, [19] N. Bell and J. Hoberock, “Chapter 26 - Thrust: A productivity-oriented
E. Harvey, B. Kelley, C. R. Trott, J. Wilke, and I. Yamazaki, “Kokkos library for CUDA,” in GPU Computing Gems Jade Edition, 2012, pp.
kernels: Performance portable sparse/dense linear algebra and graph 359–371.
kernels,” https://arxiv.org/abs/2103.11991, 2021. [20] “OpenMP,” 2020, https://www.openmp.org.
[5] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling
manycore performance portability through polymorphic memory access [21] “CUDA Zone,” 2020, https://developer.nvidia.com/cuda-zone.
patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, [22] The Trilinos Project Team, The Trilinos Project Website.
pp. 3202–3216, 2014. [23] “LAMMPS molecular dynamics simulator,” https://lammps.sandia.gov/,
[6] C. Trott, L. Berger-Vergiat, D. Poliakoff, S. Rajamanickam, D. Lebrun- 2020.
Grandie, J. Madsen, M. Gligoric, N. Al Awar, G. Shipman, and
[24] “Albany multiphysics code,” http://snlcomputation.github.io/Albany/,
G. Womeldorff, “The Kokkos ecosystem: Comprehensive performance
2020.
portability for high performance computing,” Computing in Science and
Engineering. [25] M. T. Bettencourt and S. Shields, “EMPIRE: Sandia’s next genera-
[7] D. A. Beckingsale, J. Burmark, R. Hornung, H. Jones, W. Killian, A. J. tion plasma tool,” Sandia National Lab.(SNL-NM), Albuquerque, NM
Kunen, O. Pearce, P. Robinson, B. S. Ryujin, and T. R. Scogland, (United States), Tech. Rep., 2019.
“RAJA: Portable performance for large-scale scientific applications,” in Fundamental Algorithms for Scientific Computing in Python,” Nature
Workshop on Performance, Portability and Productivity in HPC, 2019, Methods, vol. 17, pp. 261–272, 2020.
pp. 71–81. [27] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen,
[8] T. E. Oliphant, “Python for scientific computing,” Computing in Science D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern,
and Engineering, vol. 9, no. 3, pp. 10–20, 2007. M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F.
[9] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh del Rio, M. Wiebe, P. Peterson, P. Gerard-Marchant, K. Sheppard,
approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp. T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant,
65–98, 2017. “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–
[10] S. K. Lam, A. Pitrou, and S. Seibert, “Numba: A LLVM-based Python 362, 2020.
JIT compiler,” in Workshop on the LLVM Compiler Infrastructure in [28] D. Abrahams and R. W. Grosse-Kunstleve, “Building hybrid systems
HPC, 2015, pp. 1–6. with Boost.Python,” The C/C++ Users Journal, vol. 21, 2003.
[11] N. Al Awar, S. Zhu, G. Biros, and M. Gligoric, “A performance
portability framework for Python,” in International Conference on [29] D. Beazley, “Automated scientific software scripting with SWIG,” Future
Supercomputing, 2021, pp. 467–478. Generation Computer Systems, vol. 19, no. 5, pp. 599–609, 2003.
[12] “Pybind11 Documentation,” 2020, https://pybind11.readthedocs.io/en/ [30] P. Fernique and C. Pradal, “AutoWIG: Automatic generation of Python
stable/intro.html. bindings for C++ libraries,” PeerJ Computer Science, vol. 4, 2018.
[13] J. R. Madsen, “kokkos-python,” https://github.com/kokkos/ [31] RosettaCommons, “Binder,” https://github.com/RosettaCommons/
kokkos-python, 2020. binder, 2016.
[14] E. Slaughter and A. Aiken, “Pygion: Flexible, scalable task-based par- [32] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong
allelism with Python,” in Parallel Applications Workshop, Alternatives program analysis & transformation,” in International Symposium on
To MPI, 2019, pp. 58–72. Code Generation and Optimization, 2004, pp. 75–86.
[15] W. T. Lavrijsen and A. Dutta, “High-performance Python-C++ bindings
with PyPy and Cling,” in Workshop on Python for High-Performance [33] T. Oliphant and C. Banks, “Pep 3118 – revising the buffer protocol,”
and Scientific Computing (PyHPC), 2016, p. 2735. https://www.python.org/dev/peps/pep-3118/, 2006.
[26] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, [34] S. Rajamanickam, “Kokkos kernels wiki,” https://github.com/kokkos/
D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. kokkos-kernels/wiki/APIReference, 2021.
van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. [35] C. R. Trott, “Kokkos Tutorials,” https://github.com/kokkos/
Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. kokkos-tutorials, 2015.
Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henrik-
[36] “Thrust,” https://github.com/NVIDIA/thrust, 2021.
sen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,
F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: [37] “CFFI documentation,” https://cffi.readthedocs.io/en/latest/, 2012.