0% found this document useful (0 votes)
26 views7 pages

Pyhpc2013 Submission 10

1. PeachPy is a Python framework that aids in developing high-performance assembly kernels by representing assembly instructions and registers as Python objects, allowing assembly metaprogramming in Python. This provides 2. PeachPy automates routine tasks in assembly programming like register allocation, but only when it will not impact performance. It handles register allocation automatically since physical register choice does not affect performance on modern CPUs. 3. PeachPy enables an optimization expert to generate compute kernels for different data types, operations, instruction sets from a single source, advancing Python-based assembly metaprogramming pioneered by earlier works.

Uploaded by

swiezaplesn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

Pyhpc2013 Submission 10

1. PeachPy is a Python framework that aids in developing high-performance assembly kernels by representing assembly instructions and registers as Python objects, allowing assembly metaprogramming in Python. This provides 2. PeachPy automates routine tasks in assembly programming like register allocation, but only when it will not impact performance. It handles register allocation automatically since physical register choice does not affect performance on modern CPUs. 3. PeachPy enables an optimization expert to generate compute kernels for different data types, operations, instruction sets from a single source, advancing Python-based assembly metaprogramming pioneered by earlier works.

Uploaded by

swiezaplesn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PeachPy: A Python Framework for

Developing High-Performance Assembly


Kernels
Marat Dukhan
School of Computational Science & Engineering
Georgia Institute of Technology, Atlanta, GA, USA

Abstract—We introduce PeachPy, a Python frame- PeachPy joins a vast pool of tools that blend
work which aids the development of assembly kernels Python and code generation. Code-generation is
for high-performance computing. PeachPy automates used by Python programs for varying reasons and
several routine tasks in assembly programming such as
allocating registers and adapting functions to different use-cases. Some projects, such as PyPy [3] and
calling conventions. By representing assembly instruc- Cython [4], use code-generation to improve the
tions and registers as Python objects, PeachPy enables performance of Python code itself. Cython achieves
developers to use Python for assembly metaprogram- this goal by statically compiling Python sources to
ming, and thus provides a modern alternative to tradi- machine codes (via C compiler) and providing a
tional macro processors in assembly programming. The
current version of PeachPy supports x86-64 and ARM syntax to specify types of Python variables. PyPy
architectures. instead relies on JIT-compilation and type inference.
Another group of code generation tools is com-
I. I NTRODUCTION prised of Python bindings to widely used general-
purpose code-generation targets. This groups in-
We consider the problem of how to enable produc- cludes LLVM-Py [5], PyCUDA and PyOpenCL [6]
tive assembly language programming. The use of projects. In essence, these examples focus on accel-
assembly still plays an important role in develop- erating Python through low-level code generation.
ing performance-critical computational kernels in By contrast, we are interested primarily in the
high-performance computing. For instance, recent task of generating assembly code using Python.
studies have shown how the performance of many This style of Python-assisted assembly program-
computations in dense linear algebra depend criti- ming was pioneered by CorePy [7], the works of
cally on a relatively small number of highly tuned Malas et al. [8], Dongarra and Luszczek [9], and
implementations of microkernel code [1], for which several other projects [10, 11]. CorePy [7] enabled
high-level compilers produce only suboptimal im- developers a programmer to write an assembly
plementations [2]. In cases like these, manual low- program in Python and compile it from the Python
level programming may be the only option. Unfor- interpreter. The authors suggested using CorePy
tunately, existing mainstream tools for assembly- to optimize performance-sensitive parts of Python
level programming are still very primitive, being application. Malas et al. [8] used Python to auto-
tedious and time-consuming to use, compared to tune assembly for PowerPC 450 processors power-
higher-level programming models and languages. ing Blue Gene/P supercomputers. Their assembly
Our goal is to ease assembly programming. In framework could simulate the processor pipeline
particular, we wish to enable an assembly pro- and reschedule instructions to avoid data hazards
grammer to build high-performing code for a va- and improve performance. The assembly code-
riety of operations (i.e., not just for linear algebra), generator of Dongarra and Luszczek [9] focuses on
data types, and processors, and to do so using a an assembly programmer productivity, and is the
relatively small amount of assembly code, com- most similar in spirit to PeachPy. The primary use-
bined with easy-to-use metaprogramming facilities case for their code-generator is cross-compilation
and nominal automation for routine tasks provided to ARM, but they also supported a debug mode,
they do not hurt performance. Toward this end, where the code-generator outputs equivalent C
we are developing PeachPy, a new Python-based code.
framework that aids the development of high- The audience for PeachPy is an optimization
performance assembly kernels. expert writing multiple but similar kernels in as-

PyHPC 2013, November 18, 2013, Denver, Colorado, USA


sembly, as might happen when implementing the a modern alternative to traditional assem-
same operation in both double and single preci- bly macro programming. The use of Python-
sion or for multiple architectures. PeachPy further based metaprogramming lets an optimization
aids the programmer by automating a number of expert generate compute kernels for different
essential assembly programming tasks; however, it data types, operations, instruction sets, mi-
does so only in instances when there will be no croarchitectures, and tuning parameters from
unintended losses in performance. To achieve these a single source. While earlier works pioneered
design goals, PeachPy has a unique set of features: the use of Python for assembly metaprogram-
• Like prior efforts [7–9], PeachPy represents ming [7–9], PeachPy goes further with new
assembly instructions, registers, and other tools and support for a wider range of use-
operands as first-class Python objects. cases. For instance, PeachPy makes it easier to
• PeachPy’s syntax closely resembles traditional write software pipelined code and to emulate
assembly language.1 Most instructions can be newer instruction sets using older instruc-
converted to calls to equivalent PeachPy func- tions. Another important PeachPy-enabled
tions by just enclosing their operands in paren- use-case is creating fat binaries with versions
theses. for multiple instruction sets or microarchitec-
• PeachPy enriches assembly programming with tures.
some features of compilers for higher-level 2) Automation of routine tasks in assembly
languages. PeachPy performs liveness analysis (§ II): PeachPy fully automates some tasks
on assembly functions, does fully automatic in assembly programming, but only where
register allocation,2 adapts program to differ- such automation is not likely to impact per-
ent calling conventions, and coalesces equal formance. For example, PeachPy handles reg-
constants in memory. ister allocation because on modern out-of-
• PeachPy collects information about the instruc- order processors the choice of physical regis-
tion sets used in the program. This information ters to use for virtual registers does not affect
enables the implementation of dynamic dis- performance. However, unlike compilers for
patching between several code versions based high-level languages, if PeachPy finds that it
on instruction extensions available on a given cannot fit all virtual registers onto physical
host processor. registers without spilling, it will not silently
• PeachPy aims to replace traditional assembly spill registers on local variables, which might
language programming models. Furthermore, degrade performance; rather, it will alert the
PeachPy-generated code does not impose any programmer by generating a Python excep-
requirements on runtime libraries and does tion.
not need Python to run. PeachPy supports a
B. PeachPy DSL
wide variety of instruction set architectures,
including various versions of SSE, AVX and PeachPy contains two Python modules,
AVX2, FMA3 and FMA4 on x86 and NEON peachpy.x64 and peachpy.arm, which
on ARM. implement assembly instructions and registers
for x86-64 and ARM architectures, respectively,
A. Contributions as Python objects. These modules are intended to
PeachPy brings two improvements to the process be used with the from peachpy.arch import
of developing assembly kernels: * statement. When imported this way, PeachPy
objects let the programmer write assembly in
1) Advancing Python-based assembly metapro-
Python with syntax similar to traditional assembly,
gramming (§ III): PeachPy represents assem-
effectively implementing an assembly-like domain-
bly instructions, registers, and constants as
specific language (DSL) on the top of Python..
first-class Python functions and objects. As
Assembly functions are represented by a
such, PeachPy enables Python to serve as
Function class in peachpy modules. The function
1 Malas et al [8] followed the same principle in their PowerPC constructor accepts four parameters that specify the
assembler assembler object (storage for functions), function
2 Most previous research for assembly code generation used name, tuple of function arguments, and the string
some kind of register management mechanism. However, it name of a target microarchitecture. The latter
required the programmer to explicitly allocate and release regis-
ters. PeachPy allocates registers automatically based on liveness restricts the set of instructions which can be used:
analysis. an attempt to use an instruction unsupported on
the target microarchitecture will cause throw an an assembly function, it performs a liveness anal-
exception. Once a function is created, it can be ysis for virtual registers and uses it to bind virtual
made active by the with statement. Instructions registers to physical registers. If it cannot perform
generated in the scope of with statement are this binding without spilling, it will generate a
added to the active function. When the execution Python exception to alert the programmer about the
of the Python script leaves the scope of the with situation. The programmer may then rewrite the
statement, PeachPy will run the post-processing function to use fewer virtual registers or manually
analysis passes and generate final assembly code. spill virtual registers to local variables.

B. Constant allocation
Listing 1 Minimal PeachPy example
Code and data are physically stored in different
from peachpy.x64 import * sections of the executable, and in assembly pro-
gramming they are defined and implemented in
abi = peachpy.c.ABI(’x64-sysv’) different parts of the source file. The code where
assembler = Assembler(abi) an in-memory constant is used might be far from
x_argument = peachpy.c.Parameter("x", the line where the constant is defined. Thus, the
peachpy.c.Type("uint32_t")) programmer has to always keep in mind the names
arguments = (x_argument) of all constants used in a function. PeachPy solves
function_name = "f" this problem by allowing constants to be defined
microarchitecture = "SandyBridge" exactly where they are used. When PeachPy final-
izes an assembly function, it scans all instructions
with Function(assembler, function_name, for such constants, coalesces equal constants (it can
arguments, microarchitecture): coalesce integer and floating-point constants with
MOV( eax, 0 ) the same binary representation), and generates a
RETURN() constants section.
print assembler C. Adaptation to calling conventions
Both x86-64 and ARM architectures can be used
with several variants of function calling conven-
II. A UTOMATION OF ROUTINE TASKS tions. They might differ in the how the param-
eters are passed to a function or which registers
By design, PeachPy only includes automation must be saved in the function’s prolog and re-
of assembly programming tasks that will not ad- stored in the epilog. For assembly programmers,
versely affect the efficiency of the generated code. supporting multiple calling conventions requires
Any automating choices which might affect per- having several versions of assembly code. While
formance are left to the programmer. This design these versions are mostly similar, maintaining them
differs from the philosophy of higher-level pro- separately can quickly become a significant burden.
gramming models, where the compiler must cor- To assist adaptation of function to different
rectly handle all situations, even if doing so might calling conventions, PeachPy provides a pseudo-
result in suboptimal performance. PeachPy opts instruction, LOAD.PARAMETER which loads a func-
for relatively less automation, in part because we tion parameter into a register. If the parameter is
expect it will be used for codes where the high- passed in a register and the destination operand
automation approach of high-level compilers does of the LOAD.PARAMETER pseudo-instruction is a
not deliver good performance. virtual register, PeachPy will bind this virtual reg-
ister to the physical register where the parameter is
A. Register allocation passed, and the LOAD.PARAMETER instruction will
While PeachPy allows referencing of registers become a no-op.
using their standard names (e.g., rcx or xmm0 on
x86), it also provides a virtual register abstraction. III. M ETAPROGRAMMING
The programmer creates a virtual register by calling One of the goals of PeachPy project was to sim-
constructors without parameters for the register plify writing multiple similar compute kernels in
classes. Virtual registers can be used as instruction assembly. To achieve this goal, PeachPy leverages
operands in the same way as physical registers. Af- flexibility of Python to replace macro preprocessors
ter PeachPy receives all instructions that constitute in traditional assemblers.
A. Custom Instructions C. Generalized Kernels
PeachPy users can define Python functions that Since instructions in PeachPy are first-class
will be used interchangeably with real assembly Python objects, it is easy to parametrize the oper-
instructions, just like macros in traditional assembly ation or data type of a compute kernel to enable
can be used similar to instructions. Unlike macros, generating multiple similar kernels from a single
Python-based PeachPy functions that implement generalized kernel. Listing 4 gives an example of
an instruction can use virtual registers to hold one such generalized kernel.
temporaries without making them part of the in-
terface. Listing 2 shows how the SSE3 instruction Listing 4 Generalized kernel which generates ad-
HADDPD can be simulated using SSE2 instructions dition and subtraction for 32-bit and 64-bit integer
in PeachPy. The interface of the simulated HADDPD arrays
instruction exactly matches the interface of the real
def VECTOR_OP(operation, data_size):
HADDPD, and can be transparently used by compute
SIMD_OP = {(’Add’, 4): VPADDD,
kernels which need this instruction.
(’Add’, 8): VPADDQ,
Listing 2 Simulation of HADDPD with x86 SSE2 (’Sub’, 4): VPSUBD,
instructions (’Sub’, 8): VPSUBQ}
[(operation, data_size)]
def HADDPD(xmm_dst, xmm_src): LABEL( "next_batch" )
xmm_tmp = SSERegister() xmm_x = SSERegister()
MOVAPD( xmm_tmp, xmm_dst ) xmm_y = SSERegister()
UNPACKLPD( xmm_dst, xmm_src ) VMOVDQU( xmm_x, [xPtr] )
UNPACKHPD( xmm_tmp, xmm_src ) VMOVDQU( xmm_y, [yPtr] )
ADDPD( xmm_tmp, xmm_dst ) SIMD_OP( xmm_x, xmm_x, xmm_y )
VMOVDQU( [zPtr], xmm_x )
for ptr in [xPtr, yPtr, zPtr]:
B. Parameterized Code Generation ADD( ptr, 16 )
SUB( length, 16 / data_size )
With PeachPy, a programmer can use Python’s JNZ( "next_batch" )
control flow statements and expressions to
parametrize the generation of assembly code.
Then, an optimization expert might tune the
D. ISA-Specific Code Generation
parameters to maximize the performance. Common
examples of such parameters in HPC kernels are A PeachPy user must specify the target mi-
loop unrolling factors and prefetch distances. An croarchitecture when creating a Function object.
example of parameterized code generation for an This information is provided back to PeachPy
array summation kernel appears in Listing 3. kernels via static methods of the Target class.
Target.has_<isa-extension> methods indi-
Listing 3 Array summation kernel parametrized by cate if the target microarchitecture supports various
loop unroll factor and prefetching distance using ISA extensions, such as SSE4.2, AVX, FMA4, or
x86 AVX instructions ARM NEON. HPC kernels may use this infor-
mation to benefit from new instructions without
def VECTOR_SUM(unroll_regs, prefetch_distance):
ymm_acc = [AVXRegister() rewriting the whole kernel for each ISA level. List-
for _ in range(unroll_regs)] ing 5 shows how to make a dot product kernel use
for i in range(unroll_regs): AVX, FMA3, or FMA4 instructions, depending on
VXORPD( ymm_acc[i], ymm_acc[i] )
LABEL( "next_batch" ) the target microarchitecture.
PREFETCHNTA( [arrayPtr + prefetch_distance] )
for i in range(unroll_regs): E. Instruction Streams
VADDPD( ymm_acc[i], [arrayPtr + i * 32] )
ADD( arrayPtr, unroll_regs * 32 )
By default, PeachPy adds each generated in-
SUB( arrayLen, unroll_regs * 8 ) struction to the active assembly function. How-
JNZ( "next_batch" ) ever, it may also make sense to generate sim-
# Sums all ymm_acc to scalar in ymm_acc[0]
REDUCE.SUM( ymm_acc, peachpy.c.type("float") )
ilar instruction streams and then combine them
VMOVSS( [sumPtr], ymm_acc[0].get_oword() ) together. One example is optimizing the kernel
for the ARM Cortex-A9 microarchitecture. Cortex-
A9 processors can decode two instructions per
Listing 5 Dot product kernel which can use AVX, Listing 6 Use of instruction streams to interleave
FMA3 or FMA4 instructions scalar and vector instructions
def DOT_PRODUCT(): scalar_stream = InstructionStream()
if Target.has_fma4() or Target.has_fma3():
MULADD = VFMADDPS if Target.has_fma4()
with scalar_stream:
else VFMADD231PS # Scalar kernel
else: ...
def MULADD(ymm_x, ymm_a, ymm_b, ymm_c):
ymm_t = AVXRegister()
VMULPS( ymm_t, ymm_a, ymm_b ) vector_stream = InstructionStream()
VADDPS( ymm_x, ymm_t, ymm_c ) with vector_stream:
ymm_acc = AVXRegister()
# SIMD kernel
VXORPS( ymm_acc, ymm_acc ) ...
LABEL( "next_batch" )
ymm_tmp = AVXRegister()
VMOVAPS( ymm_tmp, [xPtr] )
while scalar_stream or vector_stream:
MULADD( ymm_acc, ymm_tmp, [yPtr], ymm_acc ) # Mix scalar and vector instructions
ADD( xPtr, 32 ) scalar_stream.issue()
ADD( yPtr, 32 )
SUB( length, 8 )
vector_stream.issue()
JNZ( "processBatch" )
REDUCE.SUM( [ymm_acc],
peachpy.c.type("float") ) Listing 7 Use of instruction streams to separate
VMOVSS( [resultPtr], ymm_acc.get_oword() )
similar instructions for software pipelining
instruction_columns = [InstructionStream()
for _ in range(3)]
cycle, but only one of them can be a SIMD in- ymm_x = [AVXRegister for _ in range(unroll_regs)]
for i in range(unroll_regs):
struction. Thus, performance may be improved by with instruction_columns[0]:
mixing scalar and vector operations. Bernstein and VMOVDQU( ymm_x[i], [xPtr + i * 32] )
Schwabe [12] found that such instruction blending with instruction_columns[1]:
VPADDD( ymm_x[i], ymm_y )
improves performance of Salsa20 stream cipher with instruction_columns[2]:
on the ARM Cortex-A8 microarchitecture, which VMOVDQU( [zPtr + i * 32], ymm_x[i] )
has a similar limitation on instruction decoding. with instruction_columns[0]:
ADD( xPtr, unroll_regs * 32 )
PeachPy instruction streams provide a mechanism with instruction_columns[2]:
to redirect generated instructions to a different ADD( zPtr, unroll_regs * 32 )
sequence, and later merge sequences of instruc-
tions. When the InstructionStream object is
used with the Python with statement all instruc-
tions generated in the with scope are added to any instruction is decoded its inputs are already
the instruction stream instead of active function. computed, so the instruction can issue immedi-
Instructions stored in InstructionStream then ately. Figure 1 illustrates this principle. Making a
can be added one-by-one to the active function software pipelined version of a kernel typically
(or active InstructionStream object) by calling the involves duplicating the kernel code twice, skew-
issue method. Listing 6 outlines the use of instruc- ing the instruction columns relative to each other,
tion stream object to mix scalar and vector variants and looping on the middle part of the resulting
of the kernel. sequence.

F. Software Pipelining IV. P ERFORMANCE S TUDY


Instruction streams can also assist in develop- Although PeachPy improves the programmabil-
ing software-pipelined versions of compute kernels. ity of traditional assembly, PeachPy code is harder
Using instruction streams, the programmer can sep- to develop and maintain than code in C or FOR-
arate similar instructions from different unrolled TRAN. As such, we might want to check that opti-
loop iterations into different instruction streams. mization at the assembly level can at least improve
Listing 7 applies this technique to a kernel which performance compared to using an optimizing C
adds constant to a vector of integers. compiler. Previous research provides ambiguous
After similar instructions are collected in instruc- results: Malas et al. [8] found that, on PowerPC
tion streams, it is often possible to shift these se- 450 processor, optimized assembly can deliver up
quences relative to each other so that by the time to twice the performance of the best autotuned C
gives the C++ compiler a lot of freedom to
schedule instructions.
• The initial order of instructions is close to
optimal, as it exactly matches the manually
optimized PeachPy code. The only part left to
the compiler is register allocation. If the com-
piler could replicate the register allocation used
in the PeachPy code, it would get exactly the
same performance. But the compiler could also
improve upon the PeachPy implementation by
finding a better instruction schedule.
Figure 2 demonstrates the performance results.
Fig. 1: Illustration of software pipelining principle None of the three tested compilers could match
for kernel in Listing 7. Instruction streams corre- the performance of manually optimized assembly,
spond to columns on this illustration. Instructions although for vector logarithm gcc’s output is very
are executed in left-to-right, then top-to-bottom close. This result suggests that developing HPC
order. codes in assembly might be necessary to get op-
timal performance on modern processors. In such
cases, developers can leverage PeachPy to make
and Fortran code; by contrast, and Satish et al. [13] assembly programming easier.
suggest that code optimized with assembly or C
intrinsics is just 10 − 40% faster than C code.
In this section we describe a limited study of per-
formance of PeachPy-generated code versus code
generated by C++ compilers from equivalent C++
intrinsics. For this study we used kernels for com-
puting vector logarithm and vector exponential
functions3 from Yeppp! library (www.yeppp.info)
on an Intel Core i7-4770K (Haswell microarchitec-
ture). A kernel takes a batch of 40 double preci-
sion elements, and computes log or exp on each
element using only floating-point multiplication,
addition, integer and logical operations. These ker-
nels were originally implemented in PeachPy. For Fig. 2: Performance evaluation of code generated by
this study, we wrote an equivalent C++ version PeachPy and three C++ compilers from equivalent
by converting each assembly instruction generated intrinsics.
by PeachPy into an equivalent C++ intrinsic call.
In the C++ source code, the intrinsics are called
in exactly the same order as the corresponding V. C ONCLUSION
assembly instructions in PeachPy code. In this paper, we introduced a Python framework
Several properties make this code nearly ideal for for developing assembly compute kernels called
a C++ compiler: PeachPy. PeachPy simplifies writing HPC kernels
• The code is already vectorized using intrinsic compared to traditional assembly, and introduces
functions, so it does not depend on the quality a significant degree of automation to the process.
of the compiler’s auto-vectorizer. PeachPy allows developers to leverage the flexibil-
• There is only one final branch in a loop itera- ity of Python to generate a large number of similar
tion, so a loop iteration forms a basic block. In kernels from a single source file.
the PeachPy output, each loop iteration for log A CKNOWLEDGEMENTS
contains 581 instructions and the same itera-
We thank Aron Ahmadia for his useful and
tion for exp contains 400 instructions, which
insightful comments on this research and careful
3 Vector logarithm and vector exponential functions compute
proofreading of the final draft of this paper. We
log and exp on each elements of an input vector and produce thank Richard Vuduc for detailed suggestions of
a vector of outputs improvements for this paper.
This work was supported in part by grants [9] J. Dongarra and P. Luszczek, “Anatomy of
to Prof. Richard Vuduc’s research lab, The a globally recursive embedded LINPACK
HPC Garage (www.hpcgarage.org), from the Na- benchmark,” in High Performance Extreme
tional Science Foundation (NSF) under NSF CA- Computing (HPEC), 2012 IEEE Conference on,
REER award number 0953100; and a grant from 2012, pp. 1–6.
the Defense Advanced Research Projects Agency [10] F. Boesch. (2008) pyasm, a python assembler.
(DARPA) Computer Science Study Group program. [Online]. Available: http://bitbucket.org/
pyalot/pyasm
R EFERENCES [11] G. Olson. (2006) PyASM users guide v.
[1] F. G. Van Zee, T. Smith, F. D. Igual, 0.3. /doc/usersGuide.txt. [Online]. Available:
M. Smelyanskiy, X. Zhang, M. Kistler, V. Aus- http://github.com/grant-olson/pyasm
tel, J. Gunnels, T. M. Low, B. Marker, L. Kil- [12] D. J. Bernstein and P. Schwabe, “NEON
lough, and R. A. van de Geijn, “Implementing crypto,” in Cryptographic Hardware and
level-3 BLAS with BLIS: Early experience,” Embedded Systems–CHES 2012. Springer,
The University of Texas at Austin, Department 2012, pp. 320–339.
of Computer Science, FLAME Working Note [13] N. Satish, C. Kim, J. Chhugani, H. Saito,
#69. Technical Report TR-13-03, Apr. 2013. R. Krishnaiyer, M. Smelyanskiy, M. Girkar,
[2] K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and P. Dubey, “Can traditional programming
and F. Gustavson, “An experimental bridge the Ninja performance gap for paral-
comparison of cache-oblivious and cache- lel computing applications?” in Proceedings
conscious programs,” in Proceedings of the of the 39th International Symposium on
nineteenth annual ACM symposium on Computer Architecture, 2012, pp. 440–451.
Parallel algorithms and architectures, 2007,
pp. 93–104.
[3] A. Rigo and S. Pedroni, “PyPy’s approach to
virtual machine construction,” in Companion
to the 21st ACM SIGPLAN symposium
on Object-oriented programming systems,
languages, and applications, 2006, pp. 944–
953.
[4] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin,
D. Seljebotn, and K. Smith, “Cython: The
best of both worlds,” Computing in Science
Engineering, vol. 13, no. 2, pp. 31–39, March-
April 2011.
[5] llvmpy: Python bindings for LLVM. [Online].
Available: http://www.llvmpy.org
[6] A. Klöckner, N. Pinto, Y. Lee, B. Catan-
zaro, P. Ivanov, and A. Fasih, “PyCUDA
and PyOpenCL: A scripting-based approach
to GPU run-time code generation,” Parallel
Computing, vol. 38, no. 3, pp. 157–174, 2012.
[7] A. Friedley, C. Mueller, and A. Lumsdaine,
“High-performance code generation using
CorePy,” in Proc. of the 8th Python in Science
Conference, Pasadena, CA USA, 2009, pp. 23–
28.
[8] T. Malas, A. J. Ahmadia, J. Brown, J. A. Gun-
nels, and D. E. Keyes, “Optimizing the perfor-
mance of streaming numerical kernels on the
IBM Blue Gene/P PowerPC 450 processor,”
International Journal of High Performance
Computing Applications, vol. 27, no. 2, pp.
193–209, 2013.

You might also like