A Data Structure Optimizing Compiler for tUPL
A Data Structure Optimizing Compiler for tUPL
Computer Science
MASTER’S THESIS
Leiden Institute of Advanced Computer Science (LIACS)
Leiden University
Niels Bohrweg 1
2333 CA Leiden
The Netherlands
A Data Structure Optimizing Compiler for
tUPL
Supervisors:
Dr. K.F.D. Rietveld
Prof. dr. H.A.G. Wijshoff
Abstract
1 Introduction 3
1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Overview 7
3 libtupl 9
3.1 Compilation process . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Transformation tree . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 I/O generation through generators . . . . . . . . . . . . . . . . . 12
3.4.1 Transformations which modify data structures . . . . . . 13
3.4.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 I/O generation through transformation graphs . . . . . . . . . . 15
3.5.1 Transformations which modify data structures . . . . . . 17
3.5.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Generating imperative code for transformation graphs . 23
3.6 Code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Sparse matrix-vector multiplication example . . . . . . . . . . . 25
4 Tython 32
4.1 Syntax & parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Debug compilation . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Release compilation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 libtupl extensions 39
5.1 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Trivially parallelizing execution of loops . . . . . . . . . . . . . . 40
5.3 Runtime I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Deriving a CSR implementation . . . . . . . . . . . . . . . . . . . 43
5.6 Deriving a Diagonal-CSR hybrid implementation . . . . . . . . 43
6 Experiments 45
6.1 Experimental configurations . . . . . . . . . . . . . . . . . . . . . 45
6.2 Overview experiments . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Diagonal experiments . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4 Duplicated matrix experiments . . . . . . . . . . . . . . . . . . . 55
1
7 Conclusions 60
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2 Future work for the derivation of data structures . . . . . . . . . 61
7.3 Future work for the transformation framework & implementation 62
Bibliography 63
A Transformation passes 66
A.1 EncapsulationPass . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2 AggregateReservoirPass . . . . . . . . . . . . . . . . . . . . . . . 67
A.3 LocalizationPass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.4 QueryForwardSubstitutionPass . . . . . . . . . . . . . . . . . . . 69
A.5 ReservoirMaterializationPass . . . . . . . . . . . . . . . . . . . . 69
A.6 NStarMaterializationPass . . . . . . . . . . . . . . . . . . . . . . 70
A.7 SharedSpaceMaterializationPass . . . . . . . . . . . . . . . . . . 71
A.8 HorizontalIterationSpaceReductionPass . . . . . . . . . . . . . . 72
A.9 DelocalizationPass . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.10 StructureJaggedSplittingPass . . . . . . . . . . . . . . . . . . . . 73
A.11 ConcretizationPass . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2
Chapter 1
Introduction
3
explicit data structures.
Let us consider an example. Listing 1.1 shows a sparse matrix-vector multi-
plication specification in tUPL. This specification iterates over the tuples in the
NZ tuplespace (representing nonzeros) in any order and, for each nonzero, per-
forms the multiply-add operation atomically. Here, the NZ tuplespace contains
two-dimensional tuples with fields row and col, both integers.
1 forelem nz in NZ:
2 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 1.1: Sparse matrix-vector multiplication tUPL specification.
Note how we store the nonzero value in a shared space A rather than with
the tuples in the tuplespace. Although the alternative is allowed and equiva-
lent, the convention is to store only indices in the tuples in reservoirs.
Listing 1.2 shows another example: sorting. This specification continues to
swap adjacent elements until no element is out of order anymore. Generally,
whilelem loops terminate when the program is in some state to which we can
always return, no matter what sequence of (executable) tuples are executed.
For this sorting specification it holds that when no element is out of order, no
tuple is enabled at all, so detecting termination is trivial in this case.
1 whilelem adj in ADJS:
2 if A[adj.left] > A[adj.right]:
3 tmp = A[adj.left]
4 A[adj.left] = A[adj.right]
5 A[adj.right] = tmp
Listing 1.2: Sorting tUPL specification.
We have developed an optimizing compiler for tUPL, focusing on the auto-
matic generation of efficient data structures. This project consists of two parts:
libtupl and Tython. libtupl is a language-agnostic compiler that operates on a
tUPL AST in which we perform optimization routines and generate output pro-
grams. libtupl only takes an initial AST and some other initialization structures
as input. Tython is a front-end we have developed for libtupl which extends
Python 3 with tUPL constructs. Tython interfaces with libtupl to then perform
the actual optimization routines.
In Section 2 we look at an overview of the entire compiler. Within Section 3
we will look at the heart of the compiler: libtupl. In particular, two strategies
for the generation of I/O routines will be discussed. Section 4 will describe at
the Tython front-end we have developed in more detail. Section 5 enumerates
a few additional extensions to tUPL. These extensions enable the compiler to
derive more advanced implementations. In Section 6 we will demonstrate the
effectiveness of various derived advanced implementations using a number of
experiments on sparse matrix-dense matrix multiplication. Finally, Section 7
will conclude this thesis and propose some avenues for future work.
4
original forelem framework. In unreleased slides, Prof. dr. H.A.G. Wijshoff de-
scribed various additions to the forelem framework which formed tUPL, such
as the whilelem loop. In this thesis we extend the transformations to robustly
transform input and output data streams and also introduce a range of new
transformations.
Many algorithms have already been specified in tUPL. We have already
seen sparse matrix-vector multiplication and sorting in Listings 1.1 and 1.2 re-
spectively. Previously, specifications for maximum flow, finding strongly con-
nected components, triangular solve [13], LU factorization [12], K-means clus-
tering [8, 7], PageRank [15] and more have been constructed. For many of
these specifications full implementations have been derived, such as for Page-
Rank [15].
Within the experiments we will consider an extended version of tUPL which
takes parallelization and runtime I/O into account. This is then applied on
sparse matrix-dense matrix multiplication (SpMM) by deriving various sparse
matrix data structures for the algorithm, similar to the experiments in the dis-
sertation of Dr. K.F.D. Rietveld [10]. In this thesis we, however, perform exper-
iments using tUPL, which has evolved significantly since those experiments.
Additionally, we consider the automatic transformation of I/O routines so the
input and output is transformed to the generated data structures automatically.
Although a lot of research and library development has been performed to op-
timize the performance of SpMM in the case where all data is in-memory, such
as in [12, 14, 1, 16], only a limited amount of research has been done in which
data is loaded from persistent memory, such as in [17]. In our experiments we
consider both cases: data can either be in-memory or has to be loaded from
persistent memory.
1.2 Notation
Within this thesis we typically use a Python-like style to describe code, so cer-
tain Python-like notations are used. In fact, the syntax is mostly that of Tython,
the front-end we have developed for tUPL. For example a: Generator indi-
cates that the variable a is of type Generator. Similarly, λ() -> Generator
indicates the function returns a Generator. We use the shorthand notation {a:
int, b: int} to describe the type of a named tuple in code, here a two-tuple
with fields a and b, both of type int. Outside of code we usually omit the type
of values and write named tuples as ⟨ a, b⟩.
Sometimes the function assert is used to indicate something is always
truthy. This function also returns the object being asserted. Like in Python,
empty structures are not truthy.
The unary postfix ++ and -- operators from, for example, C++ are also
used in code for brevity. Outside of C++ code, the unary prefix ∗ operator
will unpack a structure. For example, with a = ⟨1, 2⟩, ⟨∗ a, 3⟩ = ⟨1, 2, 3⟩, but
⟨ a, 3⟩ = ⟨⟨1, 2⟩, 3⟩. Unpacking a Subscriptable implies unpacking it into a
Stream of key-value pairs (i.e. tuple of length two), as will be described in
Section 3.
Note that in previous work involving the forelem framework and tUPL a
different notation was used to specify algorithms. To ease into the new no-
tation Listings 1.3 and 1.4 can be compared. The specification using the old
5
notation in Listing 1.3 is equivalent to the specification with the new notation
in Listing 1.4.
1 forelem (r; r ∈ NZ.row)
2 forelem (nz; nz ∈ NZ.row[r])
3 C[nz.row] += A[nz.row, nz.col] * B[nz.col];
Listing 1.3: A tUPL specification using the old notation.
1 forelem r in NZ.row:
2 forelem nz in NZ where nz.row == r:
3 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 1.4: A tUPL specification using the new notation (i.e. Tython syntax).
6
Chapter 2
Overview
Within this thesis we look at two components: libtupl and Tython. Tython, the
front-end, can parse Tython code and generate an AST from it. Tython then
converts this Tython AST to an agnostic tUPL AST part of libtupl. Initially, the
code represented by the tUPL AST is expressed as operations on tuplereser-
voirs and shared spaces. In order to import data into these tuplereservoirs and
shared spaces, a load I/O routine must be initialized. Similarly, for the reverse
export operation an unload routine is defined. For every transformation that is
performed by libtupl, both the algorithm AST as well as the load/unload rou-
tines are transformed. This ensures that for every step in the transformation
chain input data can be transformed to the libtupl-generated data structures
and vice versa.
Figure 2.1 visualizes this structure. Note that each Function contains the
AST while each IOGen contains information on how to load/unload the data
structures for that particular instantiation of the function. Each (parameteriz-
able) optimization Pass can modify the Function and IOGen objects, forming
a new node in the transformation tree. Tython (or another front-end) initializes
the root TransformationTreeNode. In the end, code generation is performed on
some final TransformationTreeNode. This will yield, for example, C++ code
that can import (load) data, run the algorithm and export (unload) data.
The majority of the logic is part of the libtupl library, including a range of
passes that can be applied on the transformation tree nodes, as we will see in
Section 3.3. We primarily look at data structure transformations, not consider-
ing algorithmic transformations in great detail. Two different ways of handling
I/O data transformations have been developed, each with certain advantages
and disadvantages. Firstly, the I/O generation approach through combinations
of simple generators transforming the data will be described in detail in Sec-
tion 3.4. Secondly, the I/O generation approach by constructing transformation
graphs of the data is described in Section 3.5. The definition of the IOGen ob-
jects and the way these objects are modified during each transformation Pass
differs significantly between the two approaches as we will see in these two
sections.
In Section 4 we will take a more detailed look at how Tython parses input
code and interacts with libtupl to perform the compilation process.
7
.tpy file .cpy file
use to generate
.so file
use to initialize compile C++
.cpp file
Tython
libtupl
code generation
CompilationInstance
TransformationTreeNode
func: Function
transformations load: IOGen
unload: IOGen
. . . Pass(. . .) . . . Pass(. . .)
TransformationTreeNode TransformationTreeNode
. . . Pass(. . .) . . . Pass(. . .)
TransformationTreeNode TransformationTreeNode
8
Chapter 3
libtupl
In this chapter we will describe libtupl, a library to compile and optimize tUPL
programs. libtupl does not include features to parse code, instead it operates
on a more agnostic tUPL abstract syntax tree (AST), which allows any front-
end to use this library. The frond-end we have additionally developed, Tython,
is discussed in Section 4.
9
data structure optimization together can thus lead to a wide range of differ-
ent data structures. Within the next few sections we will look at data struc-
ture transformations and two techniques to efficiently transform the input and
output routines to match the target data structures. The vast majority of the
data structure optimization process has been implemented as part of the Tython
project as the C++ libtupl library.
3.3 Transformations
A wide range of simple transformations have been implemented as passes to
automatically transform the algorithm. Within this section we will describe
these transformations and the effect they have on the algorithm. Within Sec-
tions 3.4 and 3.5 we will look at the effects these transformation passes have on
the input and output data streams, as these have to be transformed to be able
to store the actual data in the generated data structures.
Table 3.1 lists the implemented transformations. Some of these transforma-
tions are based on the forelem framework [11]. Appendix A can be referred to
for a more detailed description of each of these passes.
The transformations can transform the initially unmaterialized specification
(i.e. iteration order of the loops is undefined) to a materialized specification,
where we fix the order we iterate through reservoir tuples. In materialized spec-
ifications all SharedSpace and Reservoir symbols have been transformed
into Subscriptable symbols. Subscriptable symbols assign indices to each
element, but do not enforce how these elements are stored. The ReservoirMate-
rializationPass, NStarMaterializationPass and SharedSpaceMaterializationPass
passes can be used to materialize a specification. The materialized specification
becomes a concretized specification after the final ConcretizationPass. Here all
Subscriptable symbols have defined implementations, such as it becoming
a multi-dimensional array or jagged linked list. In concretized specifications
loops are also limited to just simple for- and while-loops. Note that vari-
ous other transformations exist that operate on unmaterialized or materialized
specifications that solely change the data structure in some way, leading to dif-
ferent outputs.
To illustrate the application of a transformation, consider the following ex-
10
Transformation name Description See also
EncapsulationPass Transforms iterating a tuplespace its possible values (forelem A.1, [11]
row in NZ.row) to iterating a range (forelem row in [0,
max(NZ.row)]) when possible.
AggregateReservoirPass Transforms aggregations over a tuplespace its possible field A.2
values (forelem row in [0, max(NZ.row)]) to a scalar value
(forelem row in [0, max_NZ_row)]), delegating the compu-
tation of that scalar to load time.
LocalizationPass Merges (localizes) the values of a shared space into a reservoir so A.3, [11]
that indexing the original shared space is no longer necessary.
QueryForwardSubstitutionPass When a loop has an equals-query, like in forelem nz in NZ A.4
where nz.row == row, substitutes nz.row for row in the loop
body.
ReservoirMaterializationPass Materializes a reservoir, constructing a subscriptable to store the A.5, [11]
reservoir its tuple data into. In the case of forelem nz in
NZ where nz.row == row the materialization leads to a two-
dimensional Subscriptable PNZ with indices row (the query)
and k (an offset). The loop is substituted by forelem k in N*,
iterating over all offsets.
NStarMaterializationPass Materializes an N* Reservoir. If the original forelem loop A.6, [11]
has a query, this can lead to, for example, a Subscriptable
PNZ_len, in which the number of tuples are stored matching
those query values. PNZ then contains tuple data on indices
[query_value, 0 . . . PNZ_len[query_value] − 1].
SharedSpaceMaterializationPass Materializes a shared space, converting it into a subscriptable. A.7, [11]
HorizontalIterationSpaceReductionPass Removes fields from a subscriptable containing tuples if that A.8, [11]
field is never used anywhere, reducing the width of each element
in the subscriptable.
DelocalizationPass Duplicates a subscriptable containing tuples into two subscripta- A.9
bles. The specification is modified to access the newly duplicated
subscriptable to access certain fields. Usually followed up by a
HorizontalIterationSpaceReductionPass to shrink both subscript-
ables so that they contain mutually exclusive fields.
StructureJaggedSplittingPass Splits a Subscriptable containing tuples into a jagged Subscript- A.10, [11]1
able. For example, can transform Z[x, y].a += Z[x, y].b *
Z[x, y].c into Z[x]._a[y].a += Z[x]._b_c[z].b * Z[x].
_b_c[z].c. Note how Z is now an array of structures containing
two arrays each (i.e. a jagged structure).
ConcretizationPass Concretizes each subscriptable to an actual data structure (usu- A.11, [11]
ally an array of potentially multiple index dimensions). forelem
loops are concretized to, for example, simple for-loops.
1 Regular structure splitting only.
Table 3.1: All implemented transformations that have an effect on the algo-
rithm specification.
11
ample application of the LocalizationPass. Listing 3.2 shows a scenario on
which the LocalizationPass can be applied. We can decide to merge, for ex-
ample, the shared space A into tuplespace NZ. This will insert a new field
merged_value for each tuple nz inside of NZ with value nz.merged_value
= A[nz.row, nz.col]. In order to access the value A[nz.row, nz.col] we
can now simply read it from nz.merged_value instead, yielding the code in
Listing 3.3.
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 3.2: Example scenario on which the LocalizationPass can be applied.
12
1 λ(input: Generator[Tuple[int, int], NTuple[row: int, col:
int]]) -> Generator[Tuple[int, int], NTuple[col: int
]]:
2 while input:
3 key, value = input.next()
4 yield key, (value.col,)
Listing 3.4: A generator that reduces the width of the tuples in a subscriptable.
Note that using coroutines to abstract the I/O will result in the chain of gen-
erators being controlled from the bottom (i.e. the end of the chain). Each gener-
ator then invokes the upper generator to produce a new element (through the
next() call). The bottom generators here typically contain routines that con-
struct the final data structure from the input and load concrete data in there,
while the top generators typically read raw data elements from, for example,
13
the file system (or the other way around when unloading data from the gener-
ated data structures).
This can cause performance issues when more than one concretized data
structure reads from a single source: that source is then read multiple times.
For example, applying the ReservoirMaterializationPass and NStarMaterial-
izationPass on some reservoir A can lead to the materialized reservoir PA (i.e.
transforming the specification in Listing 3.5 into Listing 3.6), generated by the
generator in Listing 3.7, and additionally a lookup subscriptable PA_len, gen-
erated by the generator in Listing 3.8, containing the number of entries match-
ing each query. Both of these data structures are based on the contents in A,
which results in the input being read twice in order to construct the data in
both structures. See also Appendix B.3 for additional details about this case.
Similar issues occur when performing the DelocalizationPass, as described in
Appendix B.4.
1 forelem a in A.a:
2 forelem tuple in A where tuple.a == a: # A:
Reservoir[a: int, b: int]
3 ... tuple ...
Listing 3.5: Some tUPL specification with tuplespace A before materialization.
1 forelem a in A.a:
2 forelem k in PA_len[a]:
3 ... PA[a, k] ...
Listing 3.6: Some tUPL specification with subscriptable PA after
materialization.
14
9 yield query, count
Listing 3.8: A generator materializing a N* reservoir into a _len subscriptable.
3.4.2 Initialization
Initial generators must be defined that stream the input data. For tuplespaces
this is always a stream of tuples, for shared spaces this streams both the key and
value of the contents of that shared space. Callback functions can be defined
to export shared space data after running the algorithm, taking the key and
value of each element in the shared space as function parameters. This callback
function is invoked for each element in the shared space. A chain of generators
can transform the data back to the desired output format before invoking the
callback function for each data element.
For example, Listing 3.9 defines a coroutine generator in C++ which pro-
duces tuples ⟨row, col ⟩ from a given input file in text format, which could be
used to define where nonzero elements are stored in a sparse matrix. Genera-
tors like this are used to initialize the data structures from external sources.
1 struct tuple_row_col {
2 uint64_t row;
3 uint64_t col;
4 };
5
6 Stream <tuple_row_col > loader(std::ifstream text_stream) {
7 tuple_row_col nonzero_location;
8 while (text_stream >> nonzero_location.row &&
text_stream >> nonzero_location.col) {
9 co_yield nonzero_location;
10 }
11 }
Listing 3.9: Coroutine to load nonzero position data for shared space NZ.
15
edges connect output sockets of one node to input sockets of others. A node
may have zero or more input sockets and zero or more output sockets. It is per-
mitted to connect multiple edges to a single output socket, in which case the
data is duplicated. Sockets can be tagged by any amount of expressions, relating
the data passing through a socket to the data structures used in the code. The
exact way this I/O transformation graph is converted into imperative code is
flexible, but here we primarily use a top-down compilation, unlike the genera-
tor approach, which is basically limited to a bottom-up compilation. Thus the
control with this method lies at the topmost data generators, like a node where
raw data is being read from a file. This is then pushed to other I/O nodes as
data is being read.
Let us consider a simple example. Listing 3.10 shows the initial sparse
matrix-vector multiplication specification without additional algorithmic op-
timizations applied on it. We can materialize shared space A to subscriptable
PA using the SharedSpaceMaterializationPass and then concretize it to a flat 2D
array CPA using the ConcretizationPass.
1 forelem nz in NZ:
2 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 3.10: Initial sparse matrix-vector multiplication specification.
Figure 3.1 displays the simplified resulting input transformation graph, not
displaying transformations on data structures other than A1 . I/O graphs con-
sist of a few concepts. Primarily it consists of I/O nodes, which typically de-
fine some sort of operation on data. I/O nodes have any number of input and
output sockets (depending on the type of the I/O node). Output sockets can
connect to input sockets of other nodes as long as no cycle forms. Connections
between sockets are always annotated with the type of data sent. For example,
between the “I/O reader” and “Tuple to KeyValue” node we send named tu-
ples ⟨row, col, val ⟩, and from “Tuple to KeyValue” to “Concretize to FlatArray”
we send both a two-dimensional unnamed tuple (the index), denoted by [2D],
and named tuples ⟨row, col, val ⟩. Note that when we send these two items they
are always sent together: it can practically be seen as a two-tuple containing an-
other two-tuple and the named tuple. We usually annotate each input socket
with symbols (local to the I/O node) to which we assign incoming data. The
output socket is typically annotated with a rough description of the transfor-
mation the I/O node performs. Finally we allow sockets to be tagged by expres-
sions used in the algorithm, defining what data structures are represented by
data going through that socket. In most cases these expressions are just a single
symbol. For example, the data in A, PA and CPA is represented by the output of
the “Tuple to KeyValue” node.
Note that in the example input transformation graph of Figure 3.1 the appli-
cation of the SharedSpaceMaterializationPass to A did not transform the data,
but did tag the “Tuple to KeyValue” node to indicate PA is also represented by
the same output. The application of the ConcretizationPass did not transform
the data as well. The I/O graph was initialized with just the “I/O reader” and
“Tuple to KeyValue” nodes, the “Concretize to FlatArray” node is produced by
1 Figure 3.1 is simplified. Concretization does not produce just a single I/O node, as we will see
later in this section. The example illustrates the structure of I/O graphs well, though.
16
I/O node I/O reader
⟨row, col, val ⟩
Input socket
Input variables t
Tuple to KeyValue
Output expressions key: [t.row, t.col ], value: t
Output socket
Expression tag [2D], ⟨row, col, val ⟩ A, PA, CPA
k, v
Concretize to FlatArray
the ConcretizationPass. This node will actually write the data to the generated
data structure.
These I/O transformation graphs specify the data transformations at a higher
level than the coroutines in the previous section did, allowing us to perform
transformations on this transformation graph more easily. Unlike the genera-
tor approach where we generate imperative code immediately, we only store
high-level logical operations in the transformation tree describing what each
node does. Only in the end we generate imperative code for each I/O node
that actually performs the data transformations, unlike the approach in the
previous sections where the produced coroutines are practically directly exe-
cutable. It is, for example, possible to convert an I/O transformation graph
back to a range of connected generators as described in the previous section,
although in Section 3.5.1 we will see that a top-down compilation approach
avoids certain problems a bottom-up compilation yields.
Table 3.3 enumerates all implemented I/O nodes. They are described in
more detail in Appendix C.
17
Node name Input Output Description Conv.
ConstantStream Tuple, NTuple For a range of keys outputs a constant tuple D.4
(see also Section 3.5.2).
DataStreamReader NTuple Imports data using an external data generator D.3
(see also Section 3.5.2).
DataStreamWriter NTuple Exports data using an external callback func- D.10
tion (see also Section 3.5.2).
Aggregate NTuple Tuple, NTuple Aggregates the value in a certain field of D.5
all input tuples depending on the aggregate
function. Outputs single singleton tuple.
KeyValue to Tuple Tuple, any NTuple Transforms input key-value pairs to a tuple Like D.1
depending on the configured expression.
Transform KeyValue Tuple, any Tuple, any Transforms input key-value pairs depending D.2
on the configured expressions.
Tuple to KeyValue NTuple Tuple, any Transforms input tuple to a key-value pair de- Like D.2
pending on the configured expressions.
Transform Tuple NTuple Tuple, any Transforms input tuple depending on the con- D.1
figured expression.
Count Tuples NTuple NTuple & Tuple, Assigns a number to each input tuple, ex- D.6
NTuple tending the tuple with a new field contain-
ing that number. Tuples that match the same
query have distinct numbers. Also outputs
key-value pairs containing the total number
of tuples in each query group.
Write Value Tuple, any Writes data to a concretized Subscriptable. D.8
Deconcretize Tuple, any Streams data out of a concretized D.9
Subscriptable.
Jag Tuple, NTuple Tuple, NTuple Reduces the width of the input key by a cer- D.7
tain offset, creating a jagged structure (see also
Section C.6).
Merge NTuple & Tuple, NTuple Extend each tuple in the first input with a —
any new field whose value is looked up by key
in the second input depending on the config-
ured query.
Table 3.3: All implemented I/O nodes. The last column references to Ap-
pendix D where conversion routines for each I/O node are described, see also
Section 3.5.3.
18
Transformation name Description See also
HorizontalIterationSpaceReductionPass Creates a “Transform KeyValue” node modifying the tuple to the C.1
reduced tuple.
AggregateReservoirPass Creates an “Aggregate” node aggregating a certain field of input C.2
tuples with the specified function.
LocalizationPass Creates a “Merge” node with both the reservoir and shared space C.3
as input. The node is parameterized by the query describing
where value to be merged is located in the shared space (see Ap-
pendix A.3).
ReservoirMaterializationPass + NStar- Creates a “Count Tuples” node, assigning a unique index to each C.4
MaterializationPass tuple for each query group, in the end forming PR from input
reservoir R. Also outputs key-value pairs containing the number
of tuples matching each query group for the PR_len subscript-
able.
DelocalizationPass When delocalizing PR, a duplicate PR_deloc is created. What- C.5
ever I/O node outputs the data of PR (i.e. is tagged with expres-
sion PR) is then also tagged with PR_deloc.
StructureJaggedSplittingPass Creates a “Jag” node which converts the subscriptable to a jagged C.6
subscriptable. “Transform KeyValue” nodes are linked to that to
perform the structure splitting.
MergeEliminationPass This is a new pass that only has an effect on the I/O transforma- C.7
tion graphs. It tries to eliminate “Merge” I/O nodes and replace
them with a “Transform Tuple” I/O node, computing the desired
output directly from some shared ascendant I/O node.
ConcretizationPass Creates a “Write Value” node that actually writes resulting values C.8
to a concretized data structure.
19
structure after the ReservoirMaterializationPass and NStarMaterializationPass
have been applied on A (recall that ∗ Z will yield a stream of the key-value pairs
in Z, i.e. the number of tuples that match each query). In this figure, “Any
I/O node” indicates the node that generates the data of A can be any node:
regardless of it the same new nodes are attached to its output socket. Note
how a “Count Tuples” node has two outputs: one will represent PA' its data
and one will represent PA_len its data (see also Section C.4). This node can
be compiled top-down, avoiding the need to read the input twice and instead
producing data for both connected I/O nodes at the same time.
A
⟨row, col ⟩
t
Count Tuples
⟨∗t, Z [t.row] + +⟩ ∗Z
t
Tuple to KeyValue
key: [t.row, t.k ], value: t
3.5.2 Initialization
In order to initialize the I/O transformation graphs, initial graph nodes have
to be created. These nodes have to be tagged with the tuplespace or shared
20
Any I/O node
k, v
Jag
key: k [2 :], value: v
assert(k == []), v
Write Value
space represented by a node. To provide input data, the I/O nodes “DataS-
treamReader” and “ConstantStream” can be used. The “DataStreamReader”
takes an (external) generator coroutine as input, which can be used to, for
example, read data from a file. The “ConstantStream” simply generates tu-
ples with constant values (v0 , v1 , . . .) on each shared space storage location
(0 . . . x0 , 0 . . . x1 , . . . , 0 . . . xn−1 ), where x, n and v are parametrizable. It is mostly
used to, for example, zero-initialize an output shared space.
The I/O node “DataStreamWriter” can be used to export any data after the
algorithm has been executed. It takes an (external) callback function that is
invoked for each tuple sent to this I/O node. This is mostly used to export the
data of a shared space. Note that in many cases it is easy to inline this callback.
Let us consider an example I/O initialization for the sparse matrix-vector
multiplication. We use the standard sparse matrix-vector multiplication speci-
fication as shown in Listing 3.11. The host language here is C++. Let us assume
that A is an N × N sparse matrix and we start indexing from 0. So B and C both
contain N elements. C is zero-initialized while B is initialized using doubles
from 1 to n. The data for shared space A and tuplespace NZ is read from a sim-
ple COO-like file storing triplets. We thus have to split off the value into shared
space A and only store the coordinates in tuplespace NZ.
1 forelem nz in NZ:
2 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 3.11: Sparse matrix-vector multiplication specification in tUPL.
21
1 struct tuple_row_col_val {
2 uint64_t row;
3 uint64_t col;
4 double val;
5 };
6
7 Stream <tuple_row_col_val > coo_loader(std::ifstream
coo_file) {
8 tuple_row_col_val triplet;
9 while (coo_file >> triplet.row && coo_file >> triplet
.col && coo_file >> triplet.val) {
10 co_yield triplet;
11 }
12 }
Listing 3.12: Coroutine to load data sparse matrix triplet data for A and NZ.
1 struct tuple_n {
2 uint64_t n;
3 };
4
5 Stream <tuple_n > coo_loader(size_t count) {
6 for (int n = 0; n < count; ++n) {
7 co_yield {n};
8 }
9 }
Listing 3.13: Coroutine to load vector index data for B.
We can use this to construct the input graph as shown in Figure 3.4. The
upper “DataStreamReader” streams data produced by the coroutine from List-
ing 3.12 and the lower one the data produced by the coroutine in Listing 3.13.
The unconnected sockets remain unconnected: transformations on the input
specification will result in automatically generated I/O nodes connecting to
these sockets.
Note that in this case we can also connect another “Tuple to KeyValue” to
the lower “DataStreamReader” to zero-initialize C, instead of using the “Con-
stantStream” utility. Figure 3.5 shows this alternative zero-initialization ap-
proach for C.
For the output graph we want to export the data in C to, for example, the
standard output. Before we can use the “DataStreamWriter” we have to specify
an (external) callback routine that is invoked for each tuple of data sent to this
writer. The input of this node will be a two-tuple containing the 1D location
and the value at that location in C, so this callback function needs to take this
as input. Listing 3.14 shows a possible callback function.
1 struct tuple_n_val {
2 uint64_t n;
3 double val;
4 };
5
22
DataStreamReader
⟨row, col, val ⟩
t t
Transform Tuple Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val
NZ A
DataStreamReader ConstantStream
⟨n⟩ key: [0 ≤ c0 < N ], value: 0
⟨n⟩ C
t
Tuple to KeyValue
key: [t.n], value: t.n + 1.0
Figure 3.4: Initial input graph for tuplespace NZ and shared spaces A, B and C
for the sparse matrix-vector multiplication specification.
Figure 3.6 shows a possible initial output graph, using the callback function
in Listing 3.14 for the “DataStreamWriter”. Note how the “KeyValue to Tuple”
node converts the input key and value into this node (representing each storage
location and value in C) to a two-tuple, without explicit key, that can be used
as input to the “DataStreamWriter”.
23
DataStreamReader
⟨n⟩
⟨n⟩ ⟨n⟩
t t
Tuple to KeyValue Tuple to KeyValue
key: [t.n], value: t.n + 1.0 key: [t.n], value: 0
B C
Figure 3.5: Alternative initial input graph (only displaying the initialization of
B and C).
k, v
KeyValue to Tuple
⟨ k [0], v ⟩
⟨n, val ⟩
t
DataStreamWriter
node conversion routines. After converting the two graphs the standard code
generators can be used to convert the language-agnostic tUPL code to a target
language.
Generally the implementation of these conversion routines is trivial. Each
conversion routine takes a number of symbols as input in which the data is
stored of the parent I/O nodes. The routine then creates new symbols contain-
ing the data that will be sent to the child I/O nodes. Typically these nodes
are always compiled top-down. We start with the top (root) nodes that have no
input sockets. These nodes will push data downwards, applying some trans-
formation at each node. This is unlike the previous generator approach, where
data would always need to be pulled from a parent generator. This tends to
produce highly linear code. We rely on lower level compilers to optimize this
code further by performing, for example, forward substitution and dead code
elimination.
Details about how each I/O node is converted and examples can be found
in Appendix D. Table 3.3 enumerates all implemented I/O nodes. The last
column in this table refers to subsections in Appendix D containing details
about its conversion routine.
24
3.6 Code generation
Once the ConcretizationPass has been performed, a code generator can be used
to convert the output-agnostic transformation tree node containing the con-
cretized algorithm specification and load/unload objects to some target lan-
guage (see also Figure 2.1). Such code generators can be used to generate, for
example, C++ output. This output can be compiled to an executable or library
using, for example, clang.
After the various transformation passes have been applied the algorithm
will only consist of basic constructs, such as while and for statements; whilelem
and forelem statements have been eliminated. Additionally, all symbols refer-
enced are now of simple types: concretized subscriptables or primitive scalars.
This allows code generators to generate code in a straight-forward way as typ-
ical target languages support similar constructs too. Implementations of the
low-level concretized subscriptables are part of the libtupl runtime, which al-
lows target languages to manipulate such concretized subscriptables through
a simple API.
When using the I/O approach with generators various coroutines can be
produced. While it is possible to flatten these coroutines to low-level impera-
tive code, many target languages support coroutines themselves too (and their
compilers flatten simple coroutines as well). We thus generally do not elimi-
nate coroutines, but let the lower level compilers optimize them away as they
please instead.
The I/O approach using transformation graphs will output load and un-
load functions with basic constructs once the graphs have been transformed
into imperative code as described in Section 3.5.3. It is thus also trivial to tran-
spile this to a target language.
25
3 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing 3.16: Example initial sparse matrix-vector multiplication example after
algorithmic optimization.
After algorithmic optimization, data structure optimization follows. In this
example we will work towards a jagged data structure (i.e. a data structure
whose elements are subscriptable data structures): an array indexed by row
at the top level. Then, for each row two arrays are stored: one in which the
column indices are stored and one in which the value is stored for each nonzero.
Key advantage of such a structure over a non-jagged 2D array is that we do
not have to pad each row to the maximum amount of nonzeros that occurs in a
row anymore, potentially saving a significant amount of memory. Furthermore,
splitting the column and nonzero values into two separate arrays could allow
for better vectorization.
Let us first perform the EncapsulationPass on the code in Listing 3.16, fol-
lowed by the LocalizationPass, localizing shared space NZ into tuplereservoir
A on [row, col ], forming NZ_merge_A. Then, let us perform the AggregateReser-
voirPass on this NZ_merge_A, aggregating values for field row for function max.
At this point the specification will look like the one shown in Listing 3.17, while
the input transformation graph will be changed to Figure 3.7. The output trans-
formation graph does not change during these first few transformations.
1 forelem row in [0, aggr_NZ_merge_A_max_row]:
2 forelem nz in NZ_merge_A where nz.row == row:
3 C[nz.row] += nz.merged_val * B[nz.col]
Listing 3.17: Algorithm specification after performing three transformations.
Let us now continue with the materialization of the data structures. First we
perform the QueryForwardSubstitutionPass to avoid reading t.row from the
tuples t: this value is always equal to row in the inner loop. After that we con-
tinue with the ReservoirMaterializationPass of NZ_merge_A into PNZ_merge_A
and the NStarMaterializationPass to fully materialize the NZ_merge_A tuplespace
into PNZ_merge_A' and PNZ_merge_A_len. Finally we perform the SharedSpace-
MaterializationPass twice, once for B (into PB) and once for C (into PC). These
passes will transform the algorithm to the code shown in Listing 3.18. The in-
put transformation graph will become the one visualized in Figure 3.8. The
only change to the output transformation graph of Figure 3.6 is that we also
tag the “KeyValue to Tuple” node with PC. No additional I/O nodes are intro-
duced in this output transformation graph.
1 forelem row in [0, aggr_NZ_merge_A_max_row]:
2 forelem k in [0, PNZ_merge_A_len[row]-1]:
3 PC[row] += PNZ_merge_A '[row, k].merged_val * PB[
PNZ_merge_A '[row, k].col]
Listing 3.18: Algorithm specification after performing five additional
transformations.
Before concretizing everything we first perform the HorizontalIterationSpac-
eReductionPass to remove the unused row and k tuple fields from the subscript-
able PNZ_merge_A' storing the tuples, which results in PNZ_merge_A''. Ad-
ditionally we perform a MergeEliminationPass to eliminate the “Merge” I/O
26
DataStreamReader
⟨row, col, val ⟩
t t
Transform Tuple Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val
t k, v → X [k ] = v
Merge
⟨∗t, X [t.row, t.col ]⟩
t
Aggregate
key: [], value: max(t.row)
aggr_NZ_merge_A_max_row
DataStreamReader ConstantStream
⟨n⟩ key: [0 ≤ c0 < N ], value: 0
⟨n⟩ C
t
Tuple to KeyValue
key: [t.n], value: t.n + 1.0
Figure 3.7: Input transformation graph corresponding with Listing 3.17 after
performing three transformations.
27
DataStreamReader
⟨row, col, val ⟩
t t
Transform Tuple Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val
t k, v → X [k] = v
Merge
⟨∗t, X [t.row, t.col ]⟩
t t
Count Tuples Aggregate
⟨∗t, Z [t.row] + +⟩ ∗Z key: [], value: max(t.row)
PNZ_merge_A_len aggr_NZ_merge_A_max_row
⟨row, col, k⟩
t
Tuple to KeyValue
key: [t.row, t.k], value: t
PNZ_merge_A'
DataStreamReader ConstantStream
⟨n⟩ key: [0 ≤ c0 < N ], value: 0
⟨n⟩ C, PC
t
Tuple to KeyValue
key: [t.n], value: t.n + 1.0
B, PB
Figure 3.8: Input transformation graph corresponding with Listing 3.18 after
performing six additional transformations.
28
node in the input transformation graph. Then we perform a StructureJagged-
SplittingPass on PNZ_merge_A'' to split the tuples ⟨col, merged_val ⟩ stored in
this subscriptable into the groups [⟨col ⟩, ⟨merged_val ⟩] at offset 1. To finish the
data structure transformation process we perform the ConcretizationPass to
concretize everything. This will yield the code in Listing 3.19 (note the con-
version to C-style for loop specification) and the input transformation graph
from Figure 3.9. The ConcretizationPass also changed the output transforma-
tion graph, as shown in Figure 3.10.
1 for row = 0, row <= aggr_NZ_merge_A_max_row , row += 1:
2 for k = 0, k <= PNZ_merge_A_len[row]-1, k += 1:
3 CPC[row] += CPNZ_merge_A '''[row].__merged_val[k].
merged_val * CPB[CPNZ_merge_A '''[row].__col[k].col]
Listing 3.19: Algorithm specification after performing all transformations.
Finally the output code is generated using the C++ code generator. This
code generator will also invoke routines to translate the transformation I/O
graphs to tUPL code before converting those to C++ as described in Section 3.5.3.
For reference, the full C++ output code of this example is provided in Ap-
pendix E.
29
DataStreamReader DataStreamReader
⟨row, col, val ⟩ ⟨n⟩
t t t
Transform Tuple Tuple to KeyValue Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val key: [t.n], value: t.n + 1.0
t k, v
Transform Tuple Jag
⟨t.row, t.col, t.val ⟩ key: k[1 :], value: v
t t assert(k == []), v
Count Tuples Aggregate Write Value
⟨∗t, Z [t.row] + +⟩ ∗Z key: [], value: max(t.row)
PNZ_merge_A_len aggr_NZ_merge_A_max_row
⟨row, col, merged_val, k⟩
[1D], int
[0D], int
k, v
t k, v Jag
Tuple to KeyValue Jag key: k[0 :], value: v
assert(k == []), v
k, v assert(k == []), v Write Value
Transform KeyValue Write Value
key: k, value: ⟨v.col, v.merged_val ⟩
k, v
Jag
key: k[1 :], value: v
ConstantStream
key: [0 ≤ c0 < N ], value: 0
[1D], ⟨col, merged_val ⟩ PNZ_merge_A''',CPNZ_merge_A'''
[1D], ⟨col, merged_val ⟩
[1D], double C, PC, CPC
k, v k, v
Transform KeyValue Transform KeyValue k, v
key: k, value: ⟨v.col ⟩ key: k, value: ⟨v.col ⟩ Jag
PNZ_merge_A'''[_].__col, PNZ_merge_A'''[_].__merged_val, key: k[1 :], value: v
[1D], ⟨col ⟩ [1D], ⟨merged_val ⟩
CPNZ_merge_A'''[_].__col CPNZ_merge_A'''[_].__merged_val
k, v k, v
Jag Jag assert(k == []), v
key: k[1 :], value: v key: k[1 :], value: v
Write Value
[0D], ⟨col ⟩ CPNZ_merge_A'''[_].__col[_] [0D], ⟨merged_val ⟩ CPNZ_merge_A'''[_].__merged_val[_]
Figure 3.9: Input transformation graph corresponding with Listing 3.19 after
performing all transformations.
30
Deconcretize
key: k, value: v
k, v
KeyValue to Tuple
⟨ k [0], v ⟩
⟨n, val ⟩
t
DataStreamWriter
31
Chapter 4
Tython
32
to generate parsers, AST definitions and semantic analyzers, existing compila-
tion functions in the CPython project that parse text into Python ASTs can also
be used to parse text into Tython ASTs.
In Tython we can specify sparse matrix-vector multiplication using, for
example, the code in Listing 4.1. In this code we define a Tython module
“MatVec” using the newly introduced tdef keyword. It consists of a tuplespace
(Reservoir) and three shared spaces (SharedSpace), defined on lines 2—5 us-
ing the ctxdef keyword. This keyword can be used to define shared spaces
and reservoirs for the entire specification. For example, Reservoir[{row:
int, col: int},] indicates the reservoir contains tuples of type {row: int
, col: int}, and SharedSpace[2, int] indicates the shared space has a
two-dimensional index and contains integers. Within this module we define
the “matvec” tUPL function on line 17, which operates on these spaces. Ad-
ditionally, “load” and “unload” functions have been defined on lines 7 and 14
respectively. These functions are only symbolically analyzed and used to ini-
tialize the I/O state (in the case of release compilation this initializes the I/O
transformation graphs, as described in Section 4.3, allowing libtupl to trans-
form the I/O transformation graphs as it applies transformations). On line 9
we initialize shared space A from the input stream Values. For each tuple v in
Values a value is inserted into A with index (v.row, v.col) and value v.val.
Initialization of the other structures is similar. On line 15 we define how a data
structure should be unloaded. Here, for each key-value pair k, v in C a tuple
(k[0], v) is sent to the output stream CVals.
1 tdef MatVec:
2 ctxdef NZ: Reservoir[{row: int, col: int},]
3 ctxdef A: SharedSpace[2, int]
4 ctxdef B: SharedSpace[1, int]
5 ctxdef C: SharedSpace[1, int]
6
7 def load(Values: InStream[{row: int, col: int, val:
int},],
8 BVals: InStream[{i: int, val: int},]):
9 BindSharedSpace(A, Values , lambda v: (v.row, v.
col), lambda v: v.val)
10 BindSharedSpace(B, BVals, lambda v: (v.i,),
lambda v: v.val)
11 BindReservoir(NZ, Values , lambda v: (v.row, v.col
))
12 BindSharedSpace(C, BVals, lambda v: (v.i,),
lambda v: 0)
13
14 def unload(CVals: OutStream[{i: int, v: int},]):
15 BindSharedSpaceOut(CVals, C, lambda k, v: (k._0,
v))
16
17 def matvec():
18 forelem t in NZ:
19 C[t.row] += C[t.row] + A[t.row, t.col] * B[t.
33
col]
Listing 4.1: Sparse matrix-vector multiplication in Tython.
In order to actually invoke this tUPL code we can use standard Python code
as shown in Listing 4.2. Whenever loading data, any Python iterable can be
used as input, like a list (and is compatible with an InStream in Listing 4.1).
Any container to which a tuple key can be assigned, like a dict, can be used
to unload results into (such objects are compatible with an OutStream in List-
ing 4.1).
1 # like instantiating an instance of a class
2 matvec = MatVec()
3 # load data into the spaces
4 matvec.load(
5 [(0, 0, 1), (5, 2, 4), (1, 3, 12.4)],
6 [5, 6, 7, 8, 9, 10]
7 )
8 # run the algorithm
9 matvec.matvec()
10 # unload data to Python data structures
11 C = {}
12 matvec.unload(C)
13 print(C)
Listing 4.2: Invoking the sparse matrix-vector multiplication in Tython.
In Python, the ast module allows easy inspection and manipulation of the
AST of Python code within Python scripts [18]. Similarly to this, we have cre-
ated a new Python module tast that allows this for Tython code. Additionally,
some utility routines to parse a string to such a Tython AST are included in
this tast module. The Tython compiler will convert the Tython AST code to
a standard Python AST by iterating through all Tython AST nodes using this
tast module and creating equivalent Python AST nodes using the ast module
(which is built into Python). For Tython-specific syntax (tdef blocks), a spe-
cial compilation process takes over and replaces these Tython AST nodes with
Python AST nodes. This process is described in more detail in the next two
sections as this process differs between Tython its debug and release mode.
34
be defined. In the case of debug compilation, BindSharedSpace(A, Values
, lambda v: (v.row, v.col), lambda v: v.val) will insert, for each tu-
ple t in Values, value (lambda v: v.val)(t) at key (lambda v: (v.row,
v.col))(t) in dict A. BindSharedSpaceOut(CVals, C, lambda k, v: (k
._0, v)) will, for each key k and value v in dict C, insert the key and value (
lambda k, v: (k._0, v))(k, v) into the CVals Python structure (typically
a dict).
Now the algorithm can be transpiled. The algorithm in Listing 4.1 will be
transpiled into a Python AST equivalent to the Python code shown in List-
ing 4.3. The forelem loop is simply converted into a for-loop and the sub-
scripts into the dicts are now explicit Python tuples.
1 class MatVec:
2 ...
3 def matvec(self):
4 for t in self.NZ:
5 self.C[(t.row,)] = self.C[(t.row,)] + self.A
[(t.row, t.col)] * self.B[(t.col,)]
Listing 4.3: Transpiled sparse matrix-vector multiplication.
whilelem loops will pseudo-randomly select tuples to execute from the tu-
plespace until no more tuples are enabled, then terminate. For example, the
sorting specification in Listing 4.4 will compile to something roughly equiva-
lent to the Python code in Listing 4.5.
1 tdef Sort:
2 ...
3 def sort():
4 whilelem adj in ADJS:
5 if A[adj.left] > A[adj.right]:
6 tmp = A[adj.left]
7 A[adj.left] = A[adj.right]
8 A[adj.right] = tmp
Listing 4.4: Stable sorting in tUPL.
1 class Sort:
2 ...
3 def sort(self):
4 __all_tuples = set(itertools.product(self.ADJ,
range(1)))
5 __enabled = set(__all_tuples)
6 while __enabled:
7 adj, __exec_seq_idx = random.sample(__enabled
, 1)[0]
8 __enabled.remove((adj, __exec_seq_idx))
9 if __exec_seq_idx == 0 and self.A[adj.left] >
self.A[adj.right]:
10 tmp = self.A[adj.left]
11 self.A[adj.left] = self.A[adj.right]
12 self.A[adj.right] = tmp
35
13 __enabled = set(_all_tuples)
Listing 4.5: Transpiled sort.
Note how the whilelem loop has been transpiled to a while-loop that keeps
iterating a random tuple as long as the __enabled set of tuples and sequential
code block pairs is nonempty. Whenever any tuple is selected it is removed
from the enabled set. However, whenever any tuple successfully executes the
__enabled set is reset back to __all_tuples, causing all previous disabled
tuples to be attempted again.
The debug compiled whilelem loop does not check if the program ends
up in a state to which it is always possible to return. The program should ter-
minate in such a state, but when using debug compilation such specifications
may never terminate.
Tython libtupl
.tpy
parse optimize
(via tast.compile) hs
grap
/O
Tython AST +I tUPL AST + I/O graphs
A ST )
PL ridge
rewrite each tdef. . . lize tU via b
( generate
ia
(via tast iterator + ast) init (via C++ code generator)
C++ library
imp
ort
. . . into a Import AST node ed linked/loaded into
int
o (plus Python bindings etc.)
compile
(via marshall)
.pyc
36
mented in Python to allow rapid development, Python bindings are necessary
for libtupl to allow Tython to interface with libtupl. This bridge exposes vari-
ous libtupl classes to Python using Boost.Python [4]. It includes, for example,
functions to define static types and symbols, functions to create AST nodes for
tUPL code and functions to create I/O transformation graph nodes to allow
constructing I/O transformation graphs from Python. The bridge also enables
the various optimization transformations to be performed on the tUPL code
and I/O graphs. Additionally, it exposes code generators, allowing it to dump,
for example, C++ output. Some utilities are also exposed to, for example, de-
termine the static result type of a tUPL AST expression.
For each tdef code block a separate tUPL compilation is performed (each
tdef is a single CompilationInstance in Figure 2.1). First, the compiler will
initialize the root node of the transformation tree, starting by copying the defi-
nitions of the shared spaces and tuplespaces that are defined through ctxdef
to this node.
The load and unload function signatures are then analyzed statically. Each
parameter of the load and unload functions, which must all be of type InStream
and OutStream respectively in Tython, are converted to “DataStreamReader”
and “DataStreamWriter” I/O node respectively. Tython constructs a generator
for each “DataStreamReader” that iterates the values of any Python data struc-
ture and converts each value to a C++ struct which the generator will then
yield1 . For each “DataStreamWriter” Tython will generate a callback function
that inserts each output tuple into a Python data structure.
Once these reader and writer I/O nodes have been initialized, tUPL will
analyze the contents of the load/unload functions. Invocations to “Bind...”
functions are then converted to I/O nodes. For example, BindSharedSpace(A
, Values, lambda v: (v.row, v.col), lambda v: v.val) will create a
“Tuple to KeyValue” node linking to the “DataStreamReader” loading the data
of Values, with the two lambda-functions respectively specifying the key and
value of each element in the shared space. BindReservoir behaves similarly to
BindSharedSpace, just using a “Transform Tuple” I/O node. ConstantInit
(C, 0, (1000,)) will lead to the initialization of a “ConstantStream” (root)
I/O node, which will stream key-value pairs for which the value is always
0. Tuples are produced for each of the one-dimensional keys ⟨k ⟩ where 0 ≤
k < 1000. This “ConstantStream” node is then tagged with the shared space
symbol expression C. BindSharedSpaceOut(CVals, C, lambda k, v: (k.
_0, v)) is practically the reverse of BindSharedSpace. It takes the key-value
data from C, converts it to a singular tuple using a “KeyValue to Tuple” I/O
node (which is tagged with the symbol expression C), then links its output to
the “DataStreamWriter” I/O node previously created for CVals. The load def-
inition from Listing 4.1 is converted to the input transformation graph in Fig-
ure 3.5. The output definition is converted to the output transformation graph
in Figure 3.6.
Finally, after initializing the I/O transformation graphs, the compiler will
translate the Tython algorithm definition to a tUPL AST. The Tython and tUPL
ASTs are quite similar, so this is a mostly straight-forward translation.
Now Tython will invoke the optimization routines on the fully initialized
root transformation tree node. Once a concretized algorithm has been gener-
1 The C++ code generator will generate structs for each tUPL (named) tuple.
37
ated Tython can invoke the C++ code generator to generate the optimized C++
implementation. Tython will add various additional C++ functions, such as
the generator and callback routines that the “DataStreamReader” and “DataS-
treamWriter” need. Additionally, Tython generates Python C extension mod-
ule specific code (such as a function bindings and other necessary runtime
code) to allow compiling all generated code to a shared object that can be im-
ported as a Python C extension module. This is then compiled using any C++
compiler, such as clang++ or g++.
Tython substitutes the tdef Tython AST node with a Python AST node that
imports this generated module to allow invoking the optimized tUPL algo-
rithm straight from Python. Once the compilation is completed the Tython
AST is thus converted to a Python AST, which has been marshalled into a .pyc
file. Additionally, for each tdef block Tython has generated and compiled a
Python C extension that contains the optimized tUPL code, which is imported
from the .pyc file.
38
Chapter 5
libtupl extensions
39
3
4 forelem t in X where t.field0 != 0:
5 ...
Listing 5.3: Possible resulting code after constructing a hybrid algorithm.
Note how in the first forelem loop we iterate all nonzeros on the main di-
agonal. If a matrix has many tuples on the main diagonal, this hybrid variant
could be useful. With some additional transformations (such as the algorith-
mic transformation letting us iterate the upper loop row-by-row and localizing
A into the tuplereservoir) the upper forelem loop can be transformed to the
specification shown in Listing 5.5. Note how NZ is practically being split into
two separate parts during this process (NZ and PNZ_merged_A'), indirectly per-
forming reservoir splitting [7].
1 forelem i in [0, VECCOUNT -1]:
2 forelem row in [0, max(NZ.row)]:
3 C[i, nz.row] += PNZ_merged_A '[row].merged_val * B[i,
nz.col]
4
5 forelem nz in NZ where nz.row != nz.col:
6 C[i, nz.row] += A[nz.row, nz.col] * B[i, nz.col]
Listing 5.5: Hybrid SpMM specification in tUPL after additional
transformations.
As shown in Listing 5.5, we will in the end iterate the diagonal matrix val-
ues in a dense fashion, which often is much faster if the diagonal consists of
primarily nonzeros. Note how we can transform the upper and lower forelem
loop separately from one another: we may have localized A into PNZ in the
upper forelem loop (and constructed an extended tuplespace), the bottom
forelem loop still uses the original tuplespace without having the values of
A merged into it.
40
compiler can, for example, apply synchronization techniques to ensure this.
However, certain loops may be able to be trivially parallelized, like when the
execution of any iteration of some loop only has read-read data dependencies
with other loop iterations. In Listing 5.6 we are about to concretize the specifica-
tion. It is easy to see that we can execute the outer loop in parallel without the
need of additional synchronization techniques because writes to PC directly de-
pend on the loop iterator i. In the end, loop blocking can be applied to divide
the iterations among multiple processors [7].
1 @parallelize
2 forelem i in [0, VECCOUNT -1]:
3 forelem k in PNZ_len[]:
4 PC[i, PNZ'[k].row] += PA[PNZ'[k].row, PNZ'[k].col] *
PB[i, PNZ'[k].col]
Listing 5.6: Trivially parallelizable SpMM specification in tUPL.
41
View
key: [ a], value: ∗( M + a ∗ sizeo f (double))
[1D], double
k, v
Transform KeyValue
key: k, value: v + 1.0
k, v
Jag
key: k[1 :], value: v
assert(k == []), v
Write Value
42
Rather than storing the tuples in a two-dimensional structure as in List-
ing 5.7, we can decide to place the values in each dimension one after another
in a one-dimensional structure (without padding tuples). This is illustrated in
Listing 5.8. A new one-dimensional subscriptable PNZ''_ptr then indicates
where the data begins and ends in the one-dimensional PNZ'' for each row.
1 forelem i in [0, VECCOUNT -1]:
2 forelem row in [0, max(PNZ.row)]:
3 forelem k in [PNZ''_ptr[row], PNZ''_ptr[row+1] - 1]:
4 PC[i, row] += PNZ''[k].val * PB[i, PNZ''[k].col]
Listing 5.8: SpMM specification in tUPL after performing dimensionality
reduction.
43
to a hybrid algorithm, as in Listing 5.4 (where we split only the center diagonal
band into a separate loop). Both loops can now be transformed independently
of one another. The CSR implementation is derived like previously for the
loop where nz.row != nz.col. It is here also decided that we only insert
tuples for which where nz.row != nz.col holds into the CSR data structure,
eliminating the need for the where nz.row != nz.col query.
The loop where nz.row == nz.col is transformed by first performing al-
gorithmic optimization such that the tuples are iterated row-by-row. Forward
substitution is performed inside of the loop, replacing all occurrences of nz.
col with nz.row. Then, the EncapsulationPass, AggregateReservoirPass, Query-
ForwardSubstitutionPass, ReservoirMaterializationPass and NStarMaterializa-
tionPass are performed. Because only one tuple exists in each row, the NZ reser-
voir is materialized to a 1D Subscriptable. Finally, the SharedSpaceMaterial-
izationPass, HorizontalIterationSpaceReduction and ConcretizationPass (to a
dense array for the values) are performed.
Listing 5.10 illustrates a possible resulting algorithm. Note that various
minor transformations can still be applied, such as loop interchange, splitting
or merging. Additionally, multiple diagonals can be transformed to such a
dense array.
1 for i in [0, VECCOUNT -1]:
2 for row in [0, max_NZ_row -1]:
3 CPC[i, row] += CPNZ''[row] * CPB[i, row]
4
5 for row in [0, max_NZ_row -1]:
6 for k in [CPNZ_zipped_A_ptr ''[row], CPNZ_zipped_A_ptr
''[row+1] - 1]:
7 CPC[i, row] += CPNZ_zipped_A ''[k] * CPB[i,
CPNZ_zipped_A ''_deloc[k]]
Listing 5.10: SpMM implementation using a Diagonal-CSR hybrid data
structure for the sparse matrix in tUPL.
44
Chapter 6
Experiments
45
performance issues with MKL their CSC implementation we will not consider
input that is sorted by column, then row in our experiments1 .
See also
Derived CSR (For column major dense matrices) Section 5.5
Derived CSR (For row major dense matrices) Section 5.5
Derived Hybrid (For column major dense matrices) Section 5.6
Derived Hybrid (For row major dense matrices) Section 5.6
For the Derived Hybrid variant it is generally desired to minimize the amount
of explicit zeros in the stored diagonal bands. If a diagonal band thus has few
nonzeros, we elect to just store them in the secondary CSR data structure in-
stead.
Clang 7 is used to compile the C++ code with -O3 -march=native
-ffast-math. We rely on clang to perform low-level optimizations such as
vectorization. The code generator could insert a limited number of compiler
hints such as restrict and assume_aligned, which we can mechanically de-
rive from the property that subscriptables do not share memory and that sub-
scriptable data is always allocated at 32-byte boundaries. 64-bit doubles and
64-bit integers are used everywhere. We do not perform low-level (manual)
optimizations that require extra domain knowledge.
All experiments are have been executed on the DAS-5 cluster at Leiden Uni-
versity [2]. Each node consists of two Intel Xeon E5-2630 v3 CPUs at 3.2 GHz.
Each processor has 8 cores. While we will consider multi-processor computa-
tion (via NUMA), we do not consider simultaneous multithreading or multi-
node computation. All matrix data is read from and written to a separate file
server, which is connected to the compute nodes via a FDR InfiniBand inter-
connect. MKL version 2019.0.117 is used.
In the result figures each bar shows the mean runtime of the algorithm in
a certain scenario. Each bar also is split into three parts with different shades.
The bottom (lightest) part shows the initialization time. This consists of opening
input files, creating and opening a new properly-sized output file2 (or allocat-
ing memory in the heap and initializing the input array), memory mapping the
files and converting the sparse matrix data to the desired format. The middle
part of each bar shows the actual SpMM runtime. The top (darkest) part shows
the cleanup time, which includes closing files and freeing allocated memory.
Note that error bars are also drawn for each bar, but these are sometimes too
small to be visible.
1 The CSC sparse matrix-dense matrix multiplication routine in MKL is more than a magnitude
slower than a naive CSC implementation. Additionally, as of this writing MKL only supports
column major dense matrices in this configuration.
2 This implicitly results in a zero-initialization of the output, but the implementations still do
46
(a) Mallya/lhr34c (b) Fluorem/HV15R (c) (d) (e) Rajat/rajat30
SNAP/sx-stackoverflow Bourchtein/atmosmodl
Figure 6.1: The five sparse matrices used for the overview experiments.
47
two runs to reduce variety.
For the NUMA cases we use the local memory allocation policy. Due to the
parallel initialization of the input dense matrix this will result in the first half
of the data being on NUMA node 0 and the second half on NUMA node 1. The
output data is not initialized, so the algorithms will determine the actual mem-
ory binding runtime instead (allocations are based on which thread triggers
the allocation through a page fault). If the input and output dense matrices
are memory mapped instead of heap-allocated the exact allocation behavior
depends on the OS kernel instead. Preliminary experiments have shown that
duplicating the shared sparse matrix data structures such that both NUMA
nodes have a local copy does not improve performance significantly.
It is generally expected for the sparse matrices with obvious diagonals to
have improved performance when using the Derived Hybrid algorithm com-
pared to the Derived CSR implementation. We expect the Derived CSR im-
plementation to perform slightly worse than the hand-optimized MKL CSR
implementation.
48
Execution times on matrix sx-stackoverflow, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
Derived CSR Derived CSR
40
8
Wall clock time (s)
1.77tMKL CSR
30
6
0.36tMKL CSR
0.98tMKL CSR
20 4
10 2
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16
Figure 6.2: Performance on the sx-stackoverflow matrix where the dense matri-
ces are not memory mapped.
Execution times on matrix sx-stackoverflow, with I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
70 20.0
Derived CSR Derived CSR
17.5
60
15.0
Wall clock time (s)
50
12.5
40
10.0
30
7.5
20
5.0
10 2.5
0 0.0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16
Figure 6.3: Performance on the sx-stackoverflow matrix where the dense matri-
ces are also memory mapped.
49
Execution times on matrix atmosmodl, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
3.5
8 Derived CSR Derived CSR
Derived Hybrid Derived Hybrid
3.0
0.81tDerived CSR
Wall clock time (s)
6 2.5
2.0
4
1.5
0.90tMKL CSR
0.75tMKL CSR
1.0
2
0.5
0 0.0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32
Figure 6.4: Performance on the atmosmodl matrix where the dense matrices
are not memory mapped.
50
Execution times on matrix atmosmodl, with I/O for the dense matrices
Column major dense matrices Row major dense matrices
20.0
MKL CSR MKL CSR
14
17.5 Derived CSR Derived CSR
Derived Hybrid Derived Hybrid
12
15.0
Wall clock time (s)
10
12.5
8
10.0
6
7.5
5.0 4
2.5 2
0.0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32
Figure 6.5: Performance on the atmosmodl matrix where the dense matrices
are also memory mapped.
during the cleanup phase rather than during the algorithm its own runtime.
This is, for example, clearly visible in the 1024 columns, 8 or 4 threads scenar-
ios with column major dense matrices. Additionally, because not a lot of data
is in the diagonal data array, a lot of time is spent waiting on the input matrix
data to be loaded during the first pass: most operations on each dense matrix
element occur in the second pass instead (the CSR pass), during which all data
has already been loaded. MKL CSR and Derived CSR are mostly on par here,
though.
Performance for the HV15R matrix is similarly to the rajat30 matrix, with
only about 11% of the nonzeros being in a diagonal band, although when the
dense matrices are memory mapped a significant performance improvement
can be seen through the Derived Hybrid variant in the column major cases
compared to Derived CSR, as shown in Figure 6.8, for example being twice as
fast in the 512 columns, 4 threads scenario. MKL CSR still outperforms Derived
Hybrid here, though, taking only 60% the amount of time Derived Hybrid does.
The row major cases have low performance variance, but Derived CSR outper-
forms MKL CSR in the majority of the cases, for example being 14% faster in
the 512 columns, 2 threads scenario. This matrix is a very large matrix with
many nonzeros per row, but many nonzeros are not located in dense diagonal
bands. Unlike rajat30, the HV15R matrix has 17 diagonal bands. As a result,
the first Derived Hybrid pass is not as I/O bottlenecked as this first pass is for
the rajat30 matrix.
51
Execution times on matrix rajat30, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
10
MKL CSR MKL CSR
Derived CSR 4.0 Derived CSR
Derived Hybrid Derived Hybrid
8 3.5
Wall clock time (s)
0.93tDerived CSR
3.0
6
2.5
2.0
0.87tDerived Hybrid
4
1.5
0.71tMKL CSR
1.0
2
0.5
0 0.0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 1024 1024 1024 1024 1024 256 256 256 256 256 64 64 64 64 64 1024 1024 1024 1024 1024 256 256 256 256 256 64 64 64 64 64
Figure 6.6: Performance on the rajat30 matrix where the dense matrices are not
memory mapped.
Execution times on matrix rajat30, with I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
Derived CSR 16 Derived CSR
Derived Hybrid Derived Hybrid
1.04tDerived CSR
20
14
0.92tDerived Hybrid
Wall clock time (s)
12
0.98tMKL CSR
15
10
8
10
6
4
5
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 1024 1024 1024 1024 1024 256 256 256 256 256 64 64 64 64 64 1024 1024 1024 1024 1024 256 256 256 256 256 64 64 64 64 64
Figure 6.7: Performance on the rajat30 matrix where the dense matrices are also
memory mapped.
52
Execution times on matrix HV15R, with I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
300 Derived CSR 50 Derived CSR
Derived Hybrid Derived Hybrid
0.50tDerived CSR
250
40
Wall clock time (s)
200
30
0.60tDerived Hybrid
150
0.69tDerived CSR
0.41tDerived Hybrid
20
100
10
50
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32 512 512 512 512 512 128 128 128 128 128 32 32 32 32 32
Figure 6.8: Performance on the HV15R matrix where the dense matrices are
also memory mapped.
Figure 6.9: The five sparse matrices used for the diagonal experiments.
53
Execution times on matrix Chebyshev4, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
80
MKL CSR MKL CSR
120 Derived CSR Derived CSR
Derived Hybrid 70 Derived Hybrid
1.52tMKL CSR
100 60
Wall clock time (s)
1.84tMKL CSR
2.01tMKL CSR
80 50
0.78tDerived CSR
40
60
30
40
1.67tMKL CSR
1.37tMKL CSR
20
0.51tMKL CSR
0.44tMKL CSR
20
10
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 16384 16384 16384 16384 16384 4096 4096 4096 4096 4096 1024 1024 1024 1024 1024 16384 16384 16384 16384 16384 4096 4096 4096 4096 4096 1024 1024 1024 1024 1024
Figure 6.10: Performance on the Chebyshev4 matrix where the dense matrices
are not memory mapped.
additional nonzeros outside of a clear diagonal band. The kim2 matrix does
have 25 diagonal bands, though, whereas the atmosmodl matrix only has 7.
The Chebyshev4 matrix has obvious diagonal bands, but only close to 31%
of the nonzeros are located in these diagonal bands, all other elements are thus
located in the secondary CSR format when using the Derived Hybrid format.
As shown in Figure 6.10, the performance of the Derived Hybrid format does
not come as close to the MKL CSR performance compared to the performance
with the atmosmodl matrix in the column major scenarios with 16384 columns
(roughly 40%–50% times slower than MKL CSR), but the Derived Hybrid im-
plementation still significantly outperforms the Derived CSR implementation
in many scenarios, even with such a small amount of nonzeros being located
in diagonal bands, such as in the 16384 columns, 4 threads scenario, where it is
28% faster. Interestingly, with fewer columns, such as in the 4096 columns sce-
narios, Derived Hybrid often outperforms both Derived CSR and MKL CSR,
being twice as fast in the 8 + 8 and 8 threads scenarios. In the row major sce-
narios MKL CSR is in many scenarios much faster: up to two times as fast in
the 8 + 8 threads scenarios. This matrix also has various dense rows at the top
of the matrix: a triple-hybrid format may improve performance further for this
matrix.
The raefsky3 matrix performs somewhat similar to Chebyshev4, as shown in
Figure 6.11. For this matrix, more elements fit in the diagonal bands when us-
ing the Derived Hybrid format, close to 89%. However, most of these diagonal
bands are not dense, like with rajat30. Although there still is a measurable per-
formance improvement when using the Derived Hybrid implementation in the
column major scenarios compared to Derived CSR, it may also be more bene-
54
Execution times on matrix raefsky3, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
50 MKL CSR MKL CSR
Derived CSR 30 Derived CSR
Derived Hybrid Derived Hybrid
40 25
Wall clock time (s)
0.82tDerived CSR
0.45tDerived Hybrid
20
30
15
0.77tDerived CSR
0.86tDerived CSR
0.52tDerived Hybrid
0.30tDerived Hybrid
20
0.40tDerived Hybrid
0.82tDerived CSR
10
10
5
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 32768 32768 32768 32768 32768 8192 8192 8192 8192 8192 2048 2048 2048 2048 2048 32768 32768 32768 32768 32768 8192 8192 8192 8192 8192 2048 2048 2048 2048 2048
Figure 6.11: Performance on the raefsky3 matrix where the dense matrices are
not memory mapped.
ficial to not store certain diagonal bands in the diagonal data array if many
explicit zeros are required and instead store them in the secondary CSR format.
MKL CSR tends to outperform the derived implementations significantly in
most scenarios, although the performance differences are much smaller in the
row major scenarios.
For the matrix_9 matrix the Derived Hybrid format generally performs worse
than the Derived CSR format as shown in Figure 6.12. Many diagonal bands
are not dense in this matrix when using the Derived Hybrid format, some only
consist of about 31 nonzeros, leading to a lot of explicit zeros, causing this re-
duced performance. Additionally, this matrix has a range of nonzeros in the
last few columns too, outside of any diagonal band.
55
Execution times on matrix matrix_9, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR MKL CSR
Derived CSR Derived CSR
20 Derived Hybrid Derived Hybrid
8
Wall clock time (s)
15 6
10 4
5 2
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 8192 8192 8192 8192 8192 2048 2048 2048 2048 2048 512 512 512 512 512 8192 8192 8192 8192 8192 2048 2048 2048 2048 2048 512 512 512 512 512
Figure 6.12: Performance on the matrix_9 matrix where the dense matrices are
not memory mapped.
c c
c c c
a b a a b b
a a b b
56
Execution times on matrix atmosmodl-3, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR 14 MKL CSR
Derived CSR Derived CSR
50
Derived Hybrid [31 diags] Derived Hybrid [31 diags]
12
Derived Hybrid [21 diags] Derived Hybrid [21 diags]
10
20
4
10
2
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16
Figure 6.14: Performance on the atmosmodl-3 matrix where the dense matrices
are not memory mapped.
Results First, we will look at the results of atmosmodl-3. Figure 6.14 displays
the performance of the derived implementations on this matrix. We consider
three variants of the Derived Hybrid implementation: one which considers all
31 diagonals, one only considering the 21 densest diagonals and finally one con-
sidering the 11 very dense diagonals only. Variants with fewer diagonal bands
have nonzeros stored in the secondary CSR format. This allows us to analyze
the performance effect of including sparse diagonal bands in the dense diago-
nal storage format (introducing explicit zeros as a result). Note how, when we
use column major dense matrices, the Derived Hybrid format performs best
when we remove the boundary diagonal bands from the diagonal matrix, i.e.
reducing the amount of stored diagonal bands in the diagonal format from 31
to 21: these outer bands are only 13 full, only having between 0.6–0.8 times the
original runtime. For example, in the 256 columns, 8 + 8 threads scenario this
nearly matches MKL CSR performance. For such sparse bands it is thus bet-
ter to store the data in the secondary CSR format. Removing the second-outer
band as well, which here is about 32 full, usually leads to a slight reduction in
performance. For the row major dense matrix cases it is similar: considering
21 diagonals leads to the best performance in most cases, up to two times as
fast in for example the 16 columns, 8 + 8 threads scenario. Derived CSR and
MKL CSR outperform the Derived Hybrid variants here, though. Derived CSR
is 3 times as fast as Derived Hybrid in the 16 columns, 8 threads scenario, for
example.
57
Execution times on matrix atmosmodl-3, with I/O for the dense matrices
Column major dense matrices Row major dense matrices
MKL CSR 35 MKL CSR
60
Derived CSR Derived CSR
Derived Hybrid [31 diags] 30 Derived Hybrid [31 diags]
50 Derived Hybrid [21 diags] Derived Hybrid [21 diags]
Derived Hybrid [11 diags] Derived Hybrid [11 diags]
25
Wall clock time (s)
40
20
30
15
20
10
10 5
0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16
Figure 6.15: Performance on the atmosmodl-3 matrix where the dense matrices
are also memory mapped.
58
Execution times on matrix atmosmodl-2, without I/O for the dense matrices
Column major dense matrices Row major dense matrices
8
MKL CSR MKL CSR
17.5 Derived CSR Derived CSR
7
12.5
5
0.83tDerived Hybrid [19 diags]
10.0
4
7.5 3
5.0 2
2.5 1
0.0 0
Threads 8 + 8 4 + 4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2 8+8 4+4 8 4 2
Columns 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16 256 256 256 256 256 64 64 64 64 64 16 16 16 16 16
Figure 6.16: Performance on the atmosmodl-2 matrix where the dense matrices
are not memory mapped.
tion instead of not 31. Considering 19 bands on the atmosmodl-2 matrix will
also often have three diagonal bands next to one another, whereas considering
31 bands on the atmosmodl-3 matrix will often have five diagonal bands next
to one another. This may lead to improved vectorization, as at most 4 packed
doubles can be multiplied at once on Intel Xeon E5-2630 v3 processors.
In general there are various scenarios in which the Derived Hybrid imple-
mentation outperforms Derived CSR, even though the Derived Hybrid imple-
mentation has a longer initialization time and needs to perform two passes
over the dense matrix data. In some cases that overhead can cause Derived
CSR to outperform Derived Hybrid, though. This can also happen when the
diagonals stored in Derived Hybrid are not very dense: about half of the el-
ements in a diagonal band should be nonzero for it to be worth considering
the Derived Hybrid implementation when the dense matrix data is in-memory.
When dense matrices are loaded from files additional nonzeros do not nega-
tively affect performance as much. In fact, avoiding the second CSR pass en-
tirely in such cases can improve performance. Generally, the Derived Hybrid
format is much less effective when the dense matrices are row major compared
to column major dense matrices.
The derived implementations in some cases outperform MKL CSR. There
is no clear pattern in which cases this occurs, but it is interesting to note that
MKL does not always perform well in especially NUMA scenarios. In many
cases the Derived Hybrid implementation comes closer to the MKL CSR per-
formance.
59
Chapter 7
Conclusions
Within this section we conclude the project. Additionally some future work
will be suggested.
7.1 Summary
In this thesis we have described libtupl: a library to optimize tUPL programs.
It can perform data structure optimization by performing various simple trans-
formations, constructing a transformation tree in the process.
libtupl also keeps track of transformations that have to be done on the algo-
rithm its input and output to make it compatible with the automatically gener-
ated data structures, generating load and unload routines automatically. Two
ways have been explored to keep track of such data transformations: one in
which each transformation is described through a coroutine generator and one
in which we construct a higher-level I/O transformation tree.
We have seen that the coroutine generator approach has the disadvantage
that, for example, the input may be read multiple times unnecessarily. This
is because each generated data structure will try to fill the data of this data
structure completely on its own: if multiple data structures are based on the
same input, this leads to the input being read multiple times during the load
phase, once for each data structure. The coroutine generator approach does
have a lot of control over merging multiple input streams into one, though.
I/O transformation trees do not explicitly define how the tree is actually
converted to imperative code, but we generally convert them in a way such
that the input sources push data towards the generated data structures (the
coroutine generator approach always does this the other way around, pulling
data from input sources). This avoids problems where the input may be read
multiple times and has the additional advantage that I/O transformation trees
are much easier to transform further, yielding more optimized load and unload
routines.
Additionally, we have developed Tython, an extension of Python 3, to serve
as a front-end for libtupl. Tython code can contain tUPL code blocks in addi-
tion to standard Python code. These tUPL code blocks can be optimized by the
Tython compiler using libtupl. Tython outputs compiled Python files which
automatically invoke the optimized implementations, which can then be exe-
60
cuted using the standard CPython interpreter or imported from any Python or
Tython script. Any Python project can thus easily integrate tUPL code using
Tython.
Finally, to illustrate the effectiveness of the data structures that are auto-
matically derived using the presented framework, we have performed various
experiments on derived SpMM implementations using an extended version of
tUPL: also considering parallelization and runtime I/O. We have seen that De-
rived Hybrid implementations (hybrids of CSR and compressed diagonal stor-
age) of sparse matrix-dense matrix multiplication from a simple input specifica-
tion can result in performance that, for various sparse matrix structures, signif-
icantly improves performance over the simpler Derived CSR implementations,
sometimes being twice as fast. Such Derived Hybrid implementations can also
be very competitive with state-of-the-art hand-optimized implementations, in
certain scenarios being about 50% faster than MKL CSR. These differences are
mostly noticeable with column major dense input matrices; with row major
dense input matrices Derived Hybrid implementations are generally slower
than Derived CSR and MKL CSR. Performance differences tends to be more
apparent in NUMA scenarios. Additionally, the Derived Hybrid format per-
forms best when the stored diagonal bands primarily consist of nonzeros: if
more than about half of the values in a band consists of zeros it is suggested to
store that band in the secondary CSR format instead.
61
as fill-in would be valuable to research.
Another interesting case could be to derive efficient indexing data struc-
tures to perform certain data queries quickly. For example, we could specify
a query that finds all distinct persons that are at least 30 years old that own a
boat that has a value of at least 500 using the tUPL code in Listing 7.1. This is
similar to the work done by Dr. K.F.D. Rietveld in [10], where he transformed
SQL queries into specifications for the forelem framework.
1 forelem boat in Boats where boat.value >= 500:
2 forelem person in Persons where person.age >= 30 and
boat.owner_id == person.id:
3 ExpensiveBoatOwners.add(person) # ExpensiveBoatOwners
is a tuplespace (set) too
Listing 7.1: Data query example in tUPL.
62
compiled like a generator (child node needs to be request this generator to pro-
duce data) or traditionally (outputs data/invokes children as data is pushed to
the node).
Whenever the boundaries of data structures are not known in advance a lot
of reallocations may be required as new data is loaded, like when concretizing
into a dense array. These reallocations can be very expensive. Additional input
passes could be generated to try to determine data structure index boundaries
before allocating these structures if they are not known beforehand, which can
be much cheaper than reallocating the data every time to accommodate for
larger sizes. If the input data is ordered in some way, data structure bound-
aries may be able to be determined much faster too. Note that more than one
additional pass may be required to avoid reallocations. For example, CPNZ
may be a (max (t.row) + 1) × max (CPNZ_len) dense array. Here the data struc-
ture CPNZ_len is a one-dimensional array with max (t.row) + 1 values. Using
an initial pass only max (t.row) can be determined without reallocations: only
then we know the dimensions of CPNZ_len. The second pass allows filling
CPNZ_len and computing max (CPNZ_len) as well without reallocations. A
third pass can then finally fill CPNZ without reallocations, now that the dimen-
sions of that two-dimensional dense array are known.
In Section 5.3 we looked at runtime I/O and coined the “View” node. This
concept can be extended to support random access iterators instead of just a
pointer to raw memory. Such a random access iterator can be indexed directly
to obtain a random value, just like accessing a random value in raw memory
using the “View” node, but it can also be iterated through sequentially, like an
InStream can. This allows a compilation flow where runtime I/O is consid-
ered, i.e. where each data structure access is directly mapped to this random
access iterator, but also still allows generating optimized data structures.
An alternative approach for runtime I/O is embedding the streaming op-
erations (i.e. iterating through InStream or emitting data to OutStream) right
into the algorithm itself, also eliminating the need for a generated concretized
data structure. This would also allow, for example, reading data from a file
without requiring a full random access iterator (or a pointer to raw memory)
during the algorithm itself.
63
Bibliography
64
[12] K. F. D. Rietveld and H. A. G. Wijshoff. “Optimizing sparse matrix com-
putations through compiler-assisted programming.” In: Proceedings of the
ACM International Conference on Computing Frontiers. ACM. 2016, pp. 100–
109.
[13] K. F. D. Rietveld and H. A. G. Wijshoff. “Towards a new tuple-based
programming paradigm for expressing and optimizing irregular parallel
computations.” In: Proceedings of the 11th ACM Conference on Computing
Frontiers. ACM. 2014, p. 16.
[14] E. Saule, K. Kaya, and Ü. V. Çatalyürek. “Performance Evaluation of
Sparse Matrix Multiplication Kernels on Intel Xeon Phi.” In: International
Conference on Parallel Processing and Applied Mathematics. Springer. 2013,
pp. 559–570.
[15] B. van Strien, K. F. D. Rietveld, and H. A. G. Wijshoff. “Deriving highly
efficient implementations of parallel pagerank.” In: Parallel Processing
Workshops (ICPPW), 2017 46th International Conference on. IEEE. 2017, pp. 95–
102.
[16] R. Vuduc, J. W. Demmel, and K. A. Yelick. “OSKI: A library of automati-
cally tuned sparse matrix kernels.” In: Journal of Physics: Conference Series.
Vol. 16. 1. IOP Publishing. 2005, p. 521.
[17] D. Zheng et al. “Semi-external memory sparse matrix multiplication for
billion-node graphs.” In: IEEE Transactions on Parallel and Distributed Sys-
tems 28.5 (2017), pp. 1470–1483.
[18] ast — Abstract Syntax Trees — Python 3.7.2 documentation. https://docs.
python.org/3/library/ast.html. Accessed: 2019-01-04.
65
Appendix A
Transformation passes
Various passes have been implemented that affect the algorithm its code spec-
ification. This appendix contains a more detailed description of these imple-
mented passes.
A.1 EncapsulationPass
If we iterate through a tuplespace its possible field values, like “forelem row
in NZ.row”, the EncapsulationPass can try to substitute this with a range
iterator, like “forelem row in [0, max(NZ.row)]”. The advantage of such
a range iterator is that it can in the end be concretized to a plain for-loop and
we do not have to materialize the sequence of possible field values (i.e. NZ.row)
at all if we also apply the AggregateReservoirPass, which we will see later. This
is only allowed if the iterator value (i.e. row) is solely used in equals conditions
of loops over the same tuplespace (i.e. NZ) on the same field (i.e. nz.row ==
row) or inside of loop bodies of such loops. This can, for example, transform
the code in Listing A.1 into the code in Listing A.2.
1 forelem row in NZ.row:
2 forelem nz in NZ where nz.row == row:
3 C[row] += A[row, nz.col] * B[nz.col]
Listing A.1: Example scenario on which the EncapsulationPass can be applied.
66
1 forelem row in NZ.row:
2 C[row] = 0
3 forelem nz in NZ where nz.row == row:
4 C[row] += A[row, nz.col] * B[nz.col]
Listing A.3: A situation in which the EncapsulationPass cannot be applied on
the outer loop.
A.2 AggregateReservoirPass
The AggregateReservoirPass can be used to transform aggregations of tuplespace
fields into a single scalar value. This can allow us to, for example, transform
the code in Listing A.4 into Listing A.5.
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[row] += A[row, nz.col] * B[nz.col]
Listing A.4: Example scenario on which the AggregateReservoirPass can be
applied.
A.3 LocalizationPass
The LocalizationPass can copy the values of a shared space into a new field
of all tuples in a tuplespace (i.e. localize the values in the shared space). This
typically eliminates the shared space completely and the values of the shared
space are then retrieved from tuple fields directly: the shared space is prac-
tically merged into the tuplespace. The only requirement is that this shared
space is solely indexed using tuple fields from tuples in the tuplespace into
which we are going to merge the shared space. For example, if A is accessed
through A[nz.row, nz.col] only, with nz being tuples from NZ, then we can
merge safely. If A[row, nz.col] were used to access the shared space, sub-
stituting A[row, nz.col] with a merged value would not necessarily be safe.
Additionally, we typically do not merge shared spaces into the tuplespace if
the shared space is writable to avoid complications.
67
The LocalizationPass can be used to, for example, merge shared space A
into tuplespace NZ in Listing A.6. This will produce the code in Listing A.7. Tu-
plespace NZ now thus contains tuples ⟨row, col, merged_value⟩, where merged_value
= A[row, col] for each tuple. We call this [row, col] the query of the Lo-
calizationPass. In other words, for each tuple t in the tuplespace NZ we look up
the value in the shared space that is to-be-merged using the query. That value
thus is A[t.row, t.col] here.
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[nz.row] += A[nz.row, nz.col] * B[nz.col]
Listing A.6: Example scenario on which the LocalizationPass can be applied.
Note that applying the LocalizationPass does not necessarily eliminate the
shared space entirely. For example, the specification in Listing A.8 accesses
shared space A in two locations, but with different address functions. We can,
in this example, only replace shared space accesses with the merged value in
the tuple if the address function output matches the merge query. After a
single application of the LocalizationPass with query [row, col], as shown
in Listing A.9 the tuple has been changed to ⟨row, col, merged_value⟩, where
merged_value contains merged_value = A[row, col].
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[nz.row] += A[nz.row, nz.col] + A[nz.col, nz.row]
Listing A.8: A scenario in which a single application of the LocalizationPass
does not eliminate the shared space.
68
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[nz.row] += nz.merged_value + nz.merged_value_2
Listing A.10: Two applications of the LocalizationPass do eliminate the shared
space entirely.
A.4 QueryForwardSubstitutionPass
The QueryForwardSubstitutionPass tries to perform forward substitution based
on the query (i.e. condition) of a loop. For example, in Listing A.11, we have
the query where nz.row == row on tuplespace NZ. Clearly, nz.row can be
substituted with just row in the inner loop body, which leads to Listing A.12.
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[nz.row] += A[nz.row, nz.col] + A[nz.col, nz.row]
Listing A.11: A scenario where applying the QueryForwardSubstitutionPass is
beneficial.
A.5 ReservoirMaterializationPass
The ReservoirMaterializationPass can transform a tuplespace into a materialized
tuplespace. In other words, we now assign storage indices to each tuple in the
tuplespace. We do not define an actual data structure in which the tuplespace
data is truly stored yet, though.
This entails creating a new subscriptable symbol in which the tuple data can
be looked up (virtually through the assigned storage indices). We will denote
such derived symbols by prefixing the name with a P. Their type is always a
Subscriptable. Looping structures and tuple accesses are then transformed
to use this subscriptable instead of the non-materialized tuplespace.
We will illustrate what this pass truly does using an example. Listing A.13
can be transformed into Listing A.14 using the ReservoirMaterializationPass,
materializing the NZ tuplespace. The subscriptable PNZ is created in which
69
data can be looked up. The equals query from the original forelem loop (i.e.
NZ where nz.row == row) is here moved into first dimension of the PNZ sub-
scriptable. Additionally, the subscriptable gets an additional dimension in
which we pass the new iterator value k: an offset. Generally speaking the
amount of equal queries the loop has, plus one dimension for the offset value
k, will be the total number of index dimensions of the resulting subscriptable.
We also extend each tuple in the tuplespace with this new offset value k.
1 forelem row in [0, max(NZ.row)]:
2 forelem nz in NZ where nz.row == row:
3 C[row] += A[row, nz.col] * B[nz.col]
Listing A.13: Example scenario on which the ReservoirMaterializationPass can
be applied.
If n tuples match the query NZ where nz.row == row, then each of those
n tuples has some (typically distinct) offset value k so that the tuple can be re-
trieved through PNZ[row, k]. The exact offset value k assigned to each tuple
depends on N*: a tuplespace that will iterate the offset values (potentially dif-
ferent offset values depending on the outer loop). How N* (and indirectly PNZ)
will look will be determined through the NStarMaterializationPass.
A.6 NStarMaterializationPass
The NStarMaterializationPass can be seen as a continuation of the Reservoir-
MaterializationPass. The ReservoirMaterializationPass produced a N* reser-
voir containing the offsets at which each tuple is stored. The NStarMaterializa-
tionPass materializes this reservoir and determines what this reservoir actually
looks like.
Materializing the N* reservoir from Listing A.14 could yield the code in
Listing A.15. Here we assign each tuple an offset value between 0 and PNZ_len
[row]-1 and the inner loop iterates a varying number of times depending on
row. PNZ_len[row] then contains an integer indicating the number of tuples
that originally matched the query where t.row == row for that row. This
technique of materializing the N* reservoir can always be applied. Note how
assigning the offset values of each tuple will define what PNZ will actually look
like, so PNZ also becomes PNZ' (they are practically the same in the code, but
we know more information about the PNZ' subscriptable).
70
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 C[row] += A[row, PNZ'[row, k].col] * B[PNZ'[row, k].
col]
Listing A.15: Possible result code after applying the NStarMaterializationPass.
A.7 SharedSpaceMaterializationPass
The SharedSpaceMaterializationPass can materialize a shared space. Similarly
to the ReservoirMaterializationPass we now generate a subscriptable symbol
for the shared space. The indices for this subscriptable symbol can directly be
derived from the address function used to access the shared space.
In the case of Listing A.17 we can apply the SharedSpaceMaterialization-
Pass three times to materialized all three shared spaces. This transforms the
specification into Listing A.18. Although this visually does not change much
because we use shorthand notations for shared space indexing, omitting the
explicit address function, the address function is now converted into code to
index the subscriptables directly: no address function exists anymore (i.e. the
type of A is a SharedSpace while that of PA is a Subscriptable).
71
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 C[row] += A[row, PNZ[row, k].col] * B[PNZ[row, k].col
]
Listing A.17: Example scenario on which the SharedSpaceMaterializationPass
can be applied.
A.8 HorizontalIterationSpaceReductionPass
If subscriptables contain tuples with tuple fields which are not accessed any-
where, the HorizontalIterationSpaceReductionPass can eliminate that field from
the tuples, reducing the width of each tuple. For example, in Listing A.19, PNZ
is the direct result of reservoir materialization and contains tuples ⟨row, col, k⟩.
We only read the field col from tuples in this subscriptable, so the Horizontal-
IterationSpaceReductionPass will create a new subscriptable PNZ' in which we
only store tuples with a single field: ⟨col ⟩.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[row] += PA[row, PNZ[row, k].col] * PB[PNZ[row, k].
col]
Listing A.19: Example scenario on which the
HorizontalIterationSpaceReductionPass can be applied.
A.9 DelocalizationPass
The DelocalizationPass splits accesses to a subscriptable symbol containing tu-
ples off into a duplicate of that subscriptable symbol for a certain set of tuple
fields. For example, Listing A.20 can be transformed into Listing A.21 using a
DelocalizationPass, splitting accesses on tuple fields {col } off into PNZ_deloc.
Typically the HorizontalIterationSpaceReductionPass is executed afterwards to
try to reduce both subscriptables their containing tuples into smaller tuples to
not duplicate the tuple data.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 C[row] += PNZ[row, k].merged_value * B[PNZ[row, k].
col]
72
Listing A.20: Example scenario on which the DelocalizationPass can be
applied.
A.10 StructureJaggedSplittingPass
The StructureJaggedSplittingPass is similar to the DelocalizationPass, but has
more capabilities. It can transform expressions in the form of Subscriptable[
_, _, _].field into, for example, Subscriptable[_, _].fieldgroup[_].
field, where fieldgroup is an array of tuple types containing at least field,
but potentially additional tuple fields (i.e. a jagged structure). Aside from the
expression the pass is applied on, the offset at which we split the subscriptable
(in the previous example 2) and the previously described field groups are param-
eters of this pass. The offset must be between 0 and the number of dimensions
of the subscriptable (exclusive upper bound).
We will illustrate the exact behavior through an example. Listing A.22
shows a possible scenario. We apply the StructureJaggedSplittingPass on ex-
pression PNZ with offset 1 and groups [⟨row⟩, ⟨col, val ⟩]. This leads to the code
shown in Listing A.23. Note how we now have a 1D subscriptable PNZ' (in-
dexed with row), with at each location a tuple ⟨__row, __col_val ⟩. At each tuple
location __row we have another 1D subscriptable storing tuples ⟨row⟩, where
row is a scalar value. Similarly, at each __col_val a 1D subscriptable is stored
containing tuples ⟨col, val ⟩.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[PNZ[row, k].row] += PNZ[row, k].val * PB[PNZ[row,
k].col]
Listing A.22: Example scenario on which the StructureJaggedSplittingPass can
be applied.
73
is now a symbol PNZ ′ containing a struct of two elements rather than two en-
tirely disjoint symbols, as shown in Listing A.24.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[PNZ'.__row[row, k].row] += PNZ'.__col_val[row, k].
val * PB[PNZ'.__col_val[row, k].col]
Listing A.24: Possible result after applying the StructureJaggedSplittingPass
on PNZ with offset 0.
Note that it can be beneficial to apply this pass multiple times with dif-
ferent parameters to construct more advanced data structures. For example,
we can apply the pass on PNZ'.__col_val[_, _] with offset 1 with groups
[⟨row⟩, ⟨col ⟩] on the code in Listing A.24. The result is shown in Listing A.25.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[PNZ''.__row[row, k].row] += PNZ''.__col_val[row].
__val[k].val * PB[PNZ''.__col_val[row].__col[k].col]
Listing A.25: Possible result after applying the StructureJaggedSplittingPass
on PNZ'.__col_val[_, _] with offset 1.
It is also reasonable to just use a single group containing all tuple fields
whenever applying the pass at a nonzero offset. For example, applying the
pass on Listing A.22 again with offset 1, but groups [⟨row, col, val ⟩] will yield
the code in Listing A.26. Here we change the 2D subscriptable to a 1D sub-
scriptable containing single (separate) 1D subscriptables at each location, with-
out performing any additional grouping. In the end this could be concretized
as an array of arrays (regular jagged array) in which the tuples are stored.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[PNZ'[row].__row_col_val[k].row] += PNZ'[row].
__row_col_val[k].val * PB[PNZ'[row].__row_col_val[k].
col]
Listing A.26: Possible result after applying the StructureJaggedSplittingPass
on PNZ with offset 1, but using only a single group.
A.11 ConcretizationPass
The ConcretizationPass will concretize all subscriptables. A concretized sub-
scriptable, whose name we typically prefix with a C, will have a defined con-
cretization, such as an array or linked list. Nested data structures, such as an
X-dimensional array containing linked lists, can be achieved through applying
the StructureJaggedSplittingPass before the ConcretizationPass. For example,
for the code in Listing A.26, PNZ' could be concretized as a linked list while
PNZ'[_].__row_col_val could be concretized as an array. libtupl currently
always concretizes everything as an array.
74
Additionally, the ConcretizationPass will convert all tUPL-based looping
structures to more traditional looping structures. For example, range-based
forelem loops can be converted to a for-loop.
Let us concretize the code from Listing A.24. This will yield the code in
Listing A.27. The exact data structures of CPC, CPNZ' and CPB are not visible
in this code, but tUPL concretizes these data structures to dense arrays. CPC
and CPB thus become 1D arrays of doubles. CPNZ' is then concretized to a
1D array containing a tuple of two 1D arrays at each location (i.e. CPNZ'[_].
__row and CPNZ'[_].__col_val are concretized to 1D arrays containing 1D
and 2D tuples respectively).
1 for row = 0, row <= max(NZ.row), row += 1:
2 forelem k = 0, k <= PNZ_len[row]-1, k += 1:
3 CPC[CPNZ '[row].__row[k].row] += CPNZ '[row].__col_val[
k].val * CPB[CPNZ '[row].__col_val[k].col]
Listing A.27: Possible result after applying the ConcretizationPass.
75
Appendix B
B.1 HorizontalIterationSpaceReductionPass
Horizontal iteration space reduction simply reduces the width of tuples inside
of a subscriptable structure. The produced generators look like the example
in Listing B.1. As this transformation is performed on subscriptables, a key
and value is passed in. This transformation does not modify the key at all, but
constructs a new, smaller NTuple for each input value.
1 λ(input: Generator[Tuple[int, int], NTuple[a: int, b: int
, c: int]]) -> Generator[Tuple[int, int], NTuple[a:
int, b: int]]:
2 while input:
3 key, value = input.next()
4 yield key, (value.a, value.b)
Listing B.1: A generator adjusting the data in a subscriptable after horizontal
iteration space reduction, removing the c field from the value tuple.
B.2 LocalizationPass
The LocalizationPass relates data from a shared space to tuples in a tuplespace.
Multiple variations for the LocalizationPass are possible, each relating the data
in different ways.
A naive approach is to simply write each shared space value to an interme-
76
diate data structure that allows looking up values by keys, practically convert-
ing this stream of data to a temporarily concretized data structure. This always
works but can be highly inefficient, especially if this utility data structure is
chosen is unsuitable for the keys of the shared space. Listing B.2 shows such a
possible implementation.
1 λ(tr: Generator[NTuple[a: int, b: int]], ss: Generator[
Tuple[int, int], double]) -> Generator[NTuple[a: int,
b: int, merged_val: double]]:
2 # convert shared space to some immediate lookup table
3 lut = {}
4 while ss:
5 key, value = ss.next()
6 lut[key] = value
7
8 # finally stream through the reservoir: appending the
merged tuple value
9 while tr:
10 tuple = tr.next()
11 yield (*tuple, lut[tuple.a, tuple.b])
Listing B.2: A generator merging the data in a shared space into a tuplespace.
Often it is possible to merge both generators together more efficiently. As
this generator can control on its own when to advance each generator, it could
decide to advance both at the same time. This only works well when the
value to be merged from the shared space is at the same stream offset as the
stream offset of the tuple in the reservoir stream into which the value should
be merged. Listing B.3 shows such a generator.
1 λ(tr: Generator[NTuple[a: int, b: int]], ss: Generator[
Tuple[int, int], double]) -> Generator[NTuple[a: int,
b: int, merged_val: double]]:
2 # finally stream through the reservoir: appending the
merged tuple value
3 while tr:
4 tuple = tr.next()
5 key, value = ss.next()
6 assert(tuple.a == key[0] and tuple.b = key[1]) #
either prevent this from happening using an additional
analysis pass or fall back to another approach once this
occurs
7 yield (*tuple, value)
Listing B.3: A generator merging the data in a shared space into a tuplespace
without an intermediate lookup table.
77
each tuple in the tuplereservoir to convert it to a subscriptable, usually starting
from 0. Usually we also produce the associated _len subscriptable, indicating
how many tuples match a certain query. Let us assume that is the case in the
following example too.
For the materialized reservoir the generator shown in Listing B.4 generates
the key-value data for the subscriptable. The associated _len subscriptable is
generated in a highly similar fashion, as shown in Listing B.5. Note how the
data in the _len reservoir cannot be yielded until all tuples have been counted,
as the input order of tuples is generally undefined. Additionally note that we
need two separate generators for the two subscriptables. Both of them are
very similar: they both have to count the tuples and read the exact same in-
put stream. As a result, production of the _len subscriptable requires reading
the input stream twice as the input generator is practically duplicated. Using
the generator approach it is not possible to merge these two generators and
generate data for both data structures at once, as the control lies at the bottom.
1 λ(tr: Generator[NTuple[a: int, b: int]]) -> Generator[
Tuple[int, int], NTuple[a: int, b: int]]:
2 counts = {} # some sort of lookup table, default value of
0 if key does not yet exist
3 while tr:
4 tuple = tr.next()
5 query = (tuple.a,)
6 yield (*query, counts[query]), (*tuple, counts[
query])
7 counts[query] += 1
Listing B.4: A generator materializing a tuplereservoir.
B.4 DelocalizationPass
Although the DelocalizationPass does not truly change data, it does basically
duplicate a subscriptable, aside from changing the algorithm in a sensible way.
The subscriptable being delocalized had a Generator generating key-value
pairs for this subscriptable. For the delocalized subscriptable this generator is
cloned and will in the end be executed twice to generate data for both subscript-
ables. A delocalization thus also leads to reading the input multiple times.
78
B.5 ConcretizationPass
The ConcretizationPass actually writes data to the desired data structures. For
example, the function in Listing B.6 may be produced to concretize subscript-
able PA into CPA.
1 λ(tr: Generator[Tuple[int, int], NTuple[a: int, b: int]])
:
2 while tr:
3 key, value = tr.next()
4 CPA.ensure_writable(*key)
5 CPA[*key] = value
Listing B.6: A function concretizing a subscriptable.
Note how this function will consume a generator completely and write the
data to a concrete data structure. The implementation of the concretized data
structure is part of the libtupl runtime, which implements things like subscript-
ing a data structure and the ensure_writable method.
For the output a generator is produced instead, enumerating all key-value
pairs in the concrete data structure, as shown in Listing B.7.
1 λ() -> Generator[Tuple[int, int], NTuple[a: int, b: int
]]:
2 for dim0 in range(0, CPA.bounds <0>()):
3 for dim1 in range(1, CPA.bounds <1>()):
4 yield (dim0, dim1), CPA[dim0, dim1]
Listing B.7: A generator deconcretizing a concretized subscriptable. Here we
assume the subscriptable is concretized as some dense data structure.
79
Appendix C
C.1 HorizontalIterationSpaceReductionPass
Whenever we perform horizontal iteration space reduction on a subscriptable,
we reduce the width of the value tuple and leave the key unchanged. Fig-
ure C.1 shows a “Transform KeyValue” node reducing the tuple its width. Such
an I/O node can arbitrarily transform a key and value based on the expressions
it has been parameterized with. Here, the output key is equal to the input
key and the output value becomes ⟨t.col ⟩ where t is the input value. Because
we performed the HorizontalIterationSpaceReductionPass on subscriptable A
(whose associated I/O node is not shown here, but could be located above the
“Transform KeyValue” I/O node in Figure C.1) we tag this node with A' as a
result.
k, v
Transform KeyValue
key: k, value: ⟨t.col ⟩
C.2 AggregateReservoirPass
We can aggregate values from a reservoir of tuples using the AggregateReser-
voirPass. An “Aggregate” I/O node is then used to actually perform this ag-
gregation. Figure C.2 illustrates this. For “Aggregate” nodes only the function
that should be used for the aggregation (such as min, max, + and ×) and the tu-
ple field to be aggregated are parameterizable. Typically this node is compiled
80
to something that completely processes the input passed to it and whenever
all input has been processed a scalar value is output. This pass also outputs a
zero-dimensional key (the output of this I/O node is always stored for which
some sort of key is required later down the line — alternatively, a “Tuple to
KeyValue” node could also be used to create this zero-dimensional key).
⟨row, col ⟩
t
Aggregate
key: [], value: max(t.row)
C.3 LocalizationPass
The LocalizationPass will use a “Merge” node to merge the value of a shared
space into a new tuple field for all tuples sent to this node. Figure C.3 illus-
trates this. “Merge” nodes are parameterized by an expression that indicates
what shared space value should be merged into each tuple, in this example
[t.row, t.col] for each tuple t. Although the “Merge” node in Figure C.3
is visualized using an intermediate lookup table X, the implementation of a
“Merge” node may not have to use such a lookup table. Sometimes “Merge”
nodes can be eliminated entirely too using the MergeEliminationPass, as we
will see later.
t k, v → X [k] = v
Merge
⟨∗t, X [t.row, t.col ]⟩
81
ple, [row] is the query. It outputs two things: first a stream of tuples which
mirrors the input steam, but with the generated offset value appended to each
tuple. The “Tuple to KeyValue” node then transforms this to the desired key-
value stream for the subscriptable PNZ'.
The second “Count Tuples” output is a key-value stream which indicates
the number of tuples matching each unique query value. This output repre-
sents subscriptable PNZ_len.
⟨row, col ⟩
t
Count Tuples
⟨∗t, Z [t.row] + +⟩ ∗Z
t
Tuple to KeyValue
key: [t.row, t.k ], value: t
Note how this “Count Tuples” I/O node can compute the two outputs si-
multaneously (depending on the actual implementation, which we discuss in
Section D.6). This is unlike the generator I/O approach, for which two disjoint
generators are required, each computing one output at a time.
C.5 DelocalizationPass
The DelocalizationPass on its own only duplicates a subscriptable symbol and
tries to let different pieces of code use different data structures. Thus, no actual
data transformations need to take place. HorizontalIterationSpaceReduction is
usually performed after a DelocalizationPass to create two simpler data struc-
tures, which is visualized in Figure C.5.
C.6 StructureJaggedSplittingPass
Whenever we perform a StructureJaggedSplittingPass on a subscriptable we
split the index of the subscriptable into two pieces. For example, jagging A[
_, _, _] at offset 2 will yield A[_, _][_]. Additionally, tuple fields are re-
grouped. Together, this can transform A[_, _, _].a + A[_, _, _].b where
A contains tuples ⟨ a, b⟩ into, for example, A[_, _].__a[_].a + A[_, _].__b
[_].b (a and b split apart) or A[_, _].__a_b[_].a + A[_, _].__a_b[_].b
(a and b grouped together). For the index splitting we introduce a “Jag” node
82
Any I/O node
A, A_deloc
[2D], ⟨row, col ⟩
[2D], ⟨row, col ⟩
k, v k, v
Transform KeyValue Transform KeyValue
key: k, value: ⟨t.row⟩ key: k, value: ⟨t.col ⟩
and for the field grouping we use previously introduced “Transform KeyValue”
nodes.
Let us look at the simple example in Listing C.1. Applying the Structure-
JaggedSplittingPass with offset 1 and groups [⟨col, val ⟩, ⟨row⟩] on this code
will produce the code in Listing C.2. This pass also produces a single “Jag”
I/O node and one or more “Transform KeyValue” I/O nodes, as shown in Fig-
ure C.6.
1 forelem row in [0, max(NZ.row)]:
2 forelem k in [0, PNZ_len[row]-1]:
3 PC[PNZ[row, k].row] += PNZ[row, k].val * PB[PNZ[row,
k].col]
Listing C.1: Example scenario on which the StructureJaggedSplittingPass can
be applied.
83
Any I/O node
PNZ
[2D], ⟨row, col, val⟩
k, v
Jag
key: k [offset :], value: t
PNZ'[_]
[1D], ⟨row, col, val ⟩
[1D], ⟨row, col, val ⟩
k, v k, v
Transform KeyValue Transform KeyValue
key: k, value: ⟨t.row⟩ key: k, value: ⟨t.col, t.val ⟩
C.7 MergeEliminationPass
The MergeEliminationPass is a pass that is specific for the transformation graph
I/O generation approach. This pass does not transform the algorithm its spec-
ification in any way, but instead changes the I/O graph only. This pass tries to
eliminate “Merge” nodes, typically produced by the LocalizationPass.
Let us illustrate the behavior of this pass through an example. Figure C.7
shows an I/O graph in which a “Merge” node exists. Note that the inputs
of this “Merge” node are a “Transform Tuple“ and “Tuple to KeyValue” node,
which have a shared I/O node (which outputs tuples of data). The MergeElim-
inationPass can realize that in this scenario we can avoid complicated merging
logic by bypassing the “Transform Tuple” and “Tuple to KeyValue” nodes en-
tirely and immediately generating the desired merged output through a single
“Tuple to KeyValue” node directly linking to the shared source node. This will
yield the I/O graph in Figure C.8. The existing I/O nodes that represent T
and A have their output no longer connected to the previous “Merge” node,
but may still have their outputs connected to other I/O nodes. If this is not the
case, future dead code elimination will just eliminate the effect of these I/O nodes
entirely.
Caution must be taken when implementing this MergeEliminationPass, as
it cannot always eliminate a “Merge” node, even when a shared I/O node ex-
ists. Every time a tuple is produced as input for the “Merge” node, the value at
X[t.row, t.col] must be produced at the shared space input in Figure C.7.
Static symbolic analysis can guarantee this.
Here, whenever the shared I/O node produces value t0 = ⟨ a, b, c⟩, the
“Merge” node will receive t = ⟨ a, b⟩ from the “Transform Tuple” node (which
84
Any I/O node
⟨row, col, val ⟩
t t
Transform Tuple Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val
t k, v → X [k ] = v
Merge
⟨∗t, X [t.row, t.col ]⟩
just eliminates the latter value c). The “Tuple to KeyValue”, in this case, pro-
duces value v = c for storage on location k = [ a, b]. The “Merge” I/O node
wants to merge, for the incoming t = ⟨ a, b⟩, the value X [t.row, t.col ] = X [ a, b] =
v into that tuple. Because this value in X [ a, b] = v is produced at the same time
⟨ a, b⟩ is streamed to the node (i.e. also produced when t0 is output by the
shared I/O node), it is safe to eliminate the “Merge” node here.
For example, when the “Transform Tuple” node in Figure C.7 would trans-
form the input tuple t to ⟨t.col, t.row⟩ (now also flipping row and col), the
“Merge” node cannot be eliminated safely. When the shared I/O node would
produce the value t0 = ⟨ a, b, c⟩, the “Merge” node will now receive t = ⟨b, a⟩,
k = [ a, b] and v = c. The “Merge” node wants to try to merge X [t.row, t.col ] =
X [b, a] into the tuple, but the “Tuple to KeyValue” node will produce only the
value v at the key k = [ a, b], which is not [b, a]. Thus the MergeEliminationPass
cannot eliminate the “Merge” node safely in this case.
C.8 ConcretizationPass
The ConcretizationPass will also have to actually write data to the data struc-
tures selected. For this, we introduce a “Write Value” I/O node. This I/O node
can only write scalar values to some previously fixed storage location, though,
so we first use a “Jag” node to remove (or actually, fix) the key entirely. Note
that due to the parent/child relations between the I/O nodes the concrete stor-
age location this “Write Value” node should write to will be passed down to
this node via the “Jag” node implicitly (all children of this “Jag” node will have
a limited scope in which subscriptable keys are locked down).
Let us consider an example in which we concretize PA, a two-dimensional
subscriptable containing doubles. Figure C.9 shows the resulting I/O graph.
85
Any I/O node
⟨row, col, val ⟩
t t
Transform Tuple Tuple to KeyValue
⟨t.row, t.col ⟩ key: [t.row, t.col ], value: t.val
t
Transform Tuple
⟨t.row, t.col, t.val ⟩
As the output socket of the “Jag” node has been tagged with CPA[_, _], the
“Write Value” will write the incoming values to this CPA concretized subscript-
able.
Note that, for unloading data structures, we need to “deconcretize” the data
structures too in the unload I/O transformation graph. For this we introduce
a “Deconcretize” node, which reads all key-value pairs in a subscriptable and
streams them out. Such a “Deconcretize” node is a root I/O node in the unload
graph, as shown in Figure C.10.
86
Any I/O node
PA, CPA
[2D], double
k, v
Jag
key: k [2 :], value: t
CPA[_, _]
[0D], double
assert(k == []), v
Write Value
Deconcretize
key: k, value: v
[2D], double
PA, CPA
87
Appendix D
I/O nodes need to be converted to imperative code during the code genera-
tion step. This appendix describes how various I/O nodes are converted to the
output-agnostic internal AST before the output-specific code generator tran-
spiles it to the desired target language.
88
output symbols are assigned to and the symbols are forwarded to all child
nodes.
D.3 DataStreamReader
The “DataStreamReader” node is always a root node of the I/O graph (i.e.
this node has no input sockets) and is parameterized by an external genera-
tor coroutine only. This node will create a loop to iterate the entire external
generator until it is exhausted and push the generated elements to the child
I/O nodes. Listing D.2 shows the code generated by this I/O node during this
conversion process.
1 ... # other disjoint root node code
2 while input_stream:
3 output_symbol = input_stream.next()
4 ... # child nodes
5 ... # other disjoint root node code
Listing D.2: Code generated by the “DataStreamReader” conversion routine.
D.4 ConstantStream
The “ConstantStream” node is, just like the “DataStreamReader”, always a root
node of the I/O graph. It will generate a constant value for a range of keys. List-
ing D.3 shows the code generated by this I/O node. N here is one less than the
number of desired key dimensions. limit describes the limits of each key for
each key dimension, such as (1000, ) for a one-dimensional “ConstantStream”
of 1000 constants.
1 ... # other disjoint root node code
2 for dim0 in range(0, limit[0]):
3 for dim1 in range(0, limit[1]):
4 ...
5 for dimN in range(0, limit[N]):
6 output_symbol_key = [dim0, dim1, ...,
dimN]
7 output_symbol_value = constant_value
8 ... # child nodes
9
10 ... # other disjoint root node code
Listing D.3: Code generated by the “ConstantStream” conversion routine.
D.5 Aggregate
The “Aggregate” node will process all the tuples streamed to it (in fact, only
processes a single field of the tuples streamed to it), then finally produce a sin-
gle scalar output. Some scalar aggregator is initialized to initial_value, de-
pending on the aggregate function (often 0, but when multiplying this would
89
be 1, for example). Then for each tuple the value in the to-aggregate field is
taken and the aggregate function is applied on the current aggregator and the
incoming value, updating the aggregator value. The final aggregator value is
sent to all child I/O nodes once all tuples have been processed. Listing D.4
shows the code generated by this I/O node.
1 aggregator = initial_value
2 ...
3 # data iterators...:
4 ...
5 input_symbol = ... # assigned to by parent I/O node
6 aggregator = aggregate_func(aggregator , input_symbol.
target_field)
7 ...
8 ...
9 # after all data has been iterated
10 output_symbol_key = []
11 output_symbol_value = aggregator
12 ... # output socket 0 code
13 ...
Listing D.4: Code generated by the “Aggregate” conversion routine.
90
In some cases it is possible to generate more efficient output code, like when
one would know in advance that the input query would be ordered. This
would eliminate the need for a full fledged lookup table, such as counts in
Listing D.5. The code at output socket 1 could also be executed immediately
once we move on to another query value, rather than forcing us to wait until
all tuples have been processed.
D.7 Jag
The “Jag” I/O node will generate code to ensure space is allocated to write at the
index of the key that has been split off. The exact implementation may vary
depending on the underlying data structure. For example, let us assume a flat
array is used and the “Jag” splits a 4D index ⟨ a, b, c, d⟩ at offset 2 and is tagged
with symbol expression CPNZ'''. The node will then ensure that CPNZ'''[a
, b] is writable. It will also pass on a reference to child I/O node generation
routines that ⟨ a, b⟩ is the index that data should be written on (i.e. further limit
the scope of child I/O nodes).
The “Jag” node does not actually write data, but only (re)allocates it. It may
initialize inner data structures, though, through a data-independent default con-
structor. The “Jag” node does not modify or use the value passed to the node,
but does pass the value symbol reference on to child I/O nodes. Listing D.6
illustrates the code generated by this I/O node.
1 ...
2 input_symbol_key = ... # assigned to by parent I/O node
3 ensure_writable(CPNZ''', input_symbol_key.a,
input_symbol_key.b)
4 output_symbol_key = (input_symbol_key.c, input.symbol_key
.d)
5 ...
Listing D.6: Code generated by the “Jag” conversion routine.
In the above example, another “Jag” I/O node can be generated as some
descendant of the previous “Jag” I/O node as part of the concretization. The
remaining two index values ⟨c, d⟩ will then also be split off and locked down.
This descendant “Jag” node will be tagged to generate data for CPNZ'''[_,
_].__a_b. The two placeholder slots here will be occupied by the referenced
indices from the ancestor “Jag” node: a and b. The code generated by this
second “Jag” node will thus roughly look like the code shown in Listing D.7.
1 ...
2 input_symbol_key = ... # assigned to by some ancestor I/O
node
3 ...
4 input_symbol2_key = ... # assigned to by parent I/O node
5 ensure_writable(CPNZ'''[input_symbol_key.a,
input_symbol_key.b].__a_b, input_symbol2_key.c,
input_symbol2_key.d)
6 output_symbol2_key = ()
91
7 ...
Listing D.7: Code generated by the second “Jag” conversion routine.
D.9 Deconcretize
The “Deconcretize” I/O node will read all the data out of a concretized data
structure and pushes each key and value to child I/O nodes. Let us assume that
CPA is being deconcretized and that CPA is concretized as some N-dimensional
dense data structure. Then Listing D.9 is the code generated by this I/O node
to deconcretize the data.
1 ... # other disjoint root node code
2 for dim0 in range(0, CPA.bounds <0>()):
3 for dim1 in range(1, CPA.bounds <1>()):
4 ...
5 for dimN in range(N, CPA.bounds <N>()):
6 output_symbol_key = (dim0, dim1, ..., N)
7 output_symbol_value = CPA[*
output_symbol_key]
8 ... # child nodes
9 ... # other disjoint root node code
Listing D.9: Code generated by the “Deconcretize” conversion routine for CPA.
Here we assume the subscriptable is concretized as some dense data structure.
92
D.10 DataStreamWriter
The “DataStreamWriter” I/O node simply consumes each tuple streamed to it
and invokes a callback function with (a reference to) this tuple as parameter.
Listing D.10 is the code that can be generated by this node.
1 ...
2 input_symbol = ... # assigned to by parent I/O node
3 callback_func(input_symbol)
4 ...
Listing D.10: Code generated by the second “DataStreamWriter” conversion
routine.
93
Appendix E
Example C++ output for the sparse matrix-vector multiplication example com-
pilation process in Section 3.7.
1 struct _tuple__col_int___ {
2 int _col;
3 };
4 struct _tuple__zipped_val_int___ {
5 int _zipped_val;
6 };
7 struct
_tuple____col_FlatArraySubscriptablem1mm_tuple__col_
int___m______zipped_val_FlatArraySubscriptablem1mm_
tuple__zipped_val_int___m___ {
8 FlatArraySubscriptable <1, _tuple__col_int___ > ___col;
9 FlatArraySubscriptable <1, _tuple__zipped_val_int___ >
___zipped_val;
10 };
11 struct _tuple__row_int___ {
12 int _row;
13 };
14 struct _tuple__0_int___ {
15 int _0;
16 };
17 struct _tuple__row_int____col_int____val_int___ {
18 int _row;
19 int _col;
20 int _val;
21 };
22 struct _tuple__i_int____val_int___ {
23 int _i;
24 int _val;
25 };
26 struct _tuple_ {
27 };
28 struct _tuple__row_int____col_int____zipped_val_int____k_
94
int___ {
29 int _row;
30 int _col;
31 int _zipped_val;
32 int _k;
33 };
34 struct _tuple__row_int____k_int___ {
35 int _row;
36 int _k;
37 };
38 struct _tuple__row_int____col_int____zipped_val_int___ {
39 int _row;
40 int _col;
41 int _zipped_val;
42 };
43 struct _tuple__0_int____1_int___ {
44 int _0;
45 int _1;
46 };
47 struct _tuple__col_int____zipped_val_int___ {
48 int _col;
49 int _zipped_val;
50 };
51 struct _tuple__i_int____v_int___ {
52 int _i;
53 int _v;
54 };
55 class MatVec {
56
57 private:
58 FlatArraySubscriptable <1, int> _PB;
59
60 private:
61 FlatArraySubscriptable <1,
_tuple____col_FlatArraySubscriptablem1mm_tuple__
col_int___m______zipped_val_FlatArraySubscriptablem1mm
_tuple__zipped_val_int___m___ > _PNZ_zip_Ammm;
62
63 private:
64 FlatArraySubscriptable <1, int> _PC;
65
66 private:
67 FlatArraySubscriptable <0, int>
_accum_NZ_zip_A_row_max;
68
69 private:
70 FlatArraySubscriptable <1, int> _PNZ_zip_A_len;
71
72 public:
73 void _matvec() {
95
74 _tuple__row_int___ _rowval;
75 _tuple__0_int___ _kv;
76
77 for (_rowval._row = 0; _rowval._row <=
_accum_NZ_zip_A_row_max.get(); ++_rowval._row) {
78 if (true) {
79 for (_kv._0 = 0; _kv._0 <= (
_PNZ_zip_A_len.get(_rowval._row) - 1); ++_kv._0) {
80 if (true) {
81 _PC.get(_rowval._row) = (_PC.get(
_rowval._row) + (_PNZ_zip_Ammm.get(_rowval._row).
___zipped_val.get(_kv._0)._zipped_val * _PB.get(
_PNZ_zip_Ammm.get(_rowval._row).___col.get(_kv._0).
_col)));
82 }
83 }
84 }
85 }
86 }
87
88 public:
89 void _load(InStream <
_tuple__row_int____col_int____val_int___ >& _Values ,
InStream <_tuple__i_int____val_int___ >& _BVals) {
90 _tuple__0_int___ _tmp_26;
91 int _tmp_25;
92 int _tmp_23;
93 _tuple_ _tmp_6;
94 int _tmp_35;
95 int _tmp_27;
96 int _itr0;
97 _tuple_ _tmp_34;
98 _tuple__0_int___ _csgen_key;
99 _tuple__0_int___ _tmp_31;
100 int _tmp_33;
101
_tuple__row_int____col_int____zipped_val_int____k_int___
_tmp_10;
102 _tuple__row_int____k_int___ _tmp_9;
103 int _tmp_5;
104 _tuple__0_int___ _tmp_24;
105 int _tmp_17;
106 _tuple__row_int____col_int____zipped_val_int___
_tmp_3;
107 _tuple_ _tmp_29;
108 int _tmp_2;
109 int _tmp_32;
110 _tuple__row_int____col_int____val_int___ _tmp_0;
111 _tuple__0_int___ _tmp_14;
112 FlatArraySubscriptable <1, int> _tmp_7;
96
113 int _csgen_value;
114 _tuple__i_int____val_int___ _tmp_30;
115 _tuple_ _tmp_22;
116 _tuple__0_int____1_int___ _tmp_1;
117 _tuple__row_int____k_int___ _tmp_11;
118 int _tmp_21;
119 _tuple__col_int____zipped_val_int___ _tmp_12;
120 _tuple_ _tmp_36;
121 int _tmp_13;
122 _tuple__0_int___ _tmp_15;
123 _tuple_ _tmp_18;
124 int _tmp_28;
125 _tuple__0_int___ _tmp_19;
126
_tuple__row_int____col_int____zipped_val_int____k_int___
_tmp_8;
127 _tuple__col_int___ _tmp_16;
128 _tuple__zipped_val_int___ _tmp_20;
129
130 _tmp_5 = 0;
131 while (_neof(_Values)) {
132 _tmp_0 = _read(_Values);
133 _tmp_1 = (_tuple__0_int____1_int___){_tmp_0.
_row, _tmp_0._col};
134 _tmp_2 = _tmp_0._val;
135 _tmp_3 = (
_tuple__row_int____col_int____zipped_val_int___){
_tmp_0._row, _tmp_0._col, _tmp_0._val};
136 _tmp_5 = _max(_tmp_5 , _tmp_3._row);
137 _ensure_writable(_tmp_7 , _tmp_3._row);
138 _tmp_8 = (
_tuple__row_int____col_int____zipped_val_int____k_int___
){_tmp_3._row, _tmp_3._col, _tmp_3._zipped_val , _tmp_7
.get(_tmp_3._row)};
139 _tmp_7.get(_tmp_3._row) = (_tmp_7.get(_tmp_3.
_row) + 1);
140 _tmp_9 = (_tuple__row_int____k_int___){_tmp_8
._row, _tmp_8._k};
141 _tmp_10 = _tmp_8;
142 _tmp_11 = _tmp_9;
143 _tmp_12 = (
_tuple__col_int____zipped_val_int___){_tmp_10._col,
_tmp_10._zipped_val};
144 _tmp_13 = _tmp_11._row;
145 _ensure_writable(_PNZ_zip_Ammm , _tmp_11._row)
;
146 _tmp_14 = (_tuple__0_int___){_tmp_11._k};
147 _tmp_15 = _tmp_14;
148 _tmp_16 = (_tuple__col_int___){_tmp_12._col};
149 _tmp_17 = _tmp_15._0;
97
150 _ensure_writable(_PNZ_zip_Ammm.get(_tmp_13).
___col , _tmp_15._0);
151 _tmp_18 = (_tuple_){};
152 _PNZ_zip_Ammm.get(_tmp_13).___col.get(_tmp_17
) = _tmp_16;
153 _tmp_19 = _tmp_14;
154 _tmp_20 = (_tuple__zipped_val_int___){_tmp_12
._zipped_val};
155 _tmp_21 = _tmp_19._0;
156 _ensure_writable(_PNZ_zip_Ammm.get(_tmp_13).
___zipped_val , _tmp_19._0);
157 _tmp_22 = (_tuple_){};
158 _PNZ_zip_Ammm.get(_tmp_13).___zipped_val.get(
_tmp_21) = _tmp_20;
159 }
160 while (_neof(_BVals)) {
161 _tmp_30 = _read(_BVals);
162 _tmp_31 = (_tuple__0_int___){_tmp_30._i};
163 _tmp_32 = _tmp_30._val;
164 _tmp_33 = _tmp_31._0;
165 _ensure_writable(_PB, _tmp_31._0);
166 _tmp_34 = (_tuple_){};
167 _PB.get(_tmp_33) = _tmp_32;
168 }
169 for (_itr0 = 0; _itr0 <= 1137; ++_itr0) {
170 _csgen_key = (_tuple__0_int___){_itr0};
171 _csgen_value = 0;
172 _tmp_35 = _csgen_key._0;
173 _ensure_writable(_PC, _csgen_key._0);
174 _tmp_36 = (_tuple_){};
175 _PC.get(_tmp_35) = _csgen_value;
176 }
177 _ensure_writable(_accum_NZ_zip_A_row_max);
178 _tmp_6 = (_tuple_){};
179 _accum_NZ_zip_A_row_max.get() = _tmp_5;
180 for (_tmp_23 = 0; _tmp_23 <= (_tmp_7.bound <0>() -
1); ++_tmp_23) {
181 _tmp_24 = (_tuple__0_int___){_tmp_23};
182 _tmp_25 = _tmp_7.get(_tmp_23);
183 _tmp_26 = _tmp_24;
184 _tmp_27 = _tmp_25;
185 _tmp_28 = _tmp_26._0;
186 _ensure_writable(_PNZ_zip_A_len , _tmp_26._0);
187 _tmp_29 = (_tuple_){};
188 _PNZ_zip_A_len.get(_tmp_28) = _tmp_27;
189 }
190 }
191
192 public:
193 void _unload(OutStream <_tuple__i_int____v_int___ >&
98
_CVals) {
194 int _tmp_37;
195 _tuple__0_int___ _tmp_38;
196 _tuple__i_int____v_int___ _tmp_40;
197 int _tmp_39;
198
199 for (_tmp_37 = 0; _tmp_37 <= (_PC.bound <0>() - 1)
; ++_tmp_37) {
200 _tmp_38 = (_tuple__0_int___){_tmp_37};
201 _tmp_39 = _PC.get(_tmp_37);
202 _tmp_40 = (_tuple__i_int____v_int___){_tmp_38
._0, _tmp_39};
203 _write(_CVals , _tmp_40);
204 }
205 }
206 };
99