Pyhpc2013 Submission 10
Pyhpc2013 Submission 10
Abstract—We introduce PeachPy, a Python frame- PeachPy joins a vast pool of tools that blend
work which aids the development of assembly kernels Python and code generation. Code-generation is
for high-performance computing. PeachPy automates used by Python programs for varying reasons and
several routine tasks in assembly programming such as
allocating registers and adapting functions to different use-cases. Some projects, such as PyPy [3] and
calling conventions. By representing assembly instruc- Cython [4], use code-generation to improve the
tions and registers as Python objects, PeachPy enables performance of Python code itself. Cython achieves
developers to use Python for assembly metaprogram- this goal by statically compiling Python sources to
ming, and thus provides a modern alternative to tradi- machine codes (via C compiler) and providing a
tional macro processors in assembly programming. The
current version of PeachPy supports x86-64 and ARM syntax to specify types of Python variables. PyPy
architectures. instead relies on JIT-compilation and type inference.
Another group of code generation tools is com-
I. I NTRODUCTION prised of Python bindings to widely used general-
purpose code-generation targets. This groups in-
We consider the problem of how to enable produc- cludes LLVM-Py [5], PyCUDA and PyOpenCL [6]
tive assembly language programming. The use of projects. In essence, these examples focus on accel-
assembly still plays an important role in develop- erating Python through low-level code generation.
ing performance-critical computational kernels in By contrast, we are interested primarily in the
high-performance computing. For instance, recent task of generating assembly code using Python.
studies have shown how the performance of many This style of Python-assisted assembly program-
computations in dense linear algebra depend criti- ming was pioneered by CorePy [7], the works of
cally on a relatively small number of highly tuned Malas et al. [8], Dongarra and Luszczek [9], and
implementations of microkernel code [1], for which several other projects [10, 11]. CorePy [7] enabled
high-level compilers produce only suboptimal im- developers a programmer to write an assembly
plementations [2]. In cases like these, manual low- program in Python and compile it from the Python
level programming may be the only option. Unfor- interpreter. The authors suggested using CorePy
tunately, existing mainstream tools for assembly- to optimize performance-sensitive parts of Python
level programming are still very primitive, being application. Malas et al. [8] used Python to auto-
tedious and time-consuming to use, compared to tune assembly for PowerPC 450 processors power-
higher-level programming models and languages. ing Blue Gene/P supercomputers. Their assembly
Our goal is to ease assembly programming. In framework could simulate the processor pipeline
particular, we wish to enable an assembly pro- and reschedule instructions to avoid data hazards
grammer to build high-performing code for a va- and improve performance. The assembly code-
riety of operations (i.e., not just for linear algebra), generator of Dongarra and Luszczek [9] focuses on
data types, and processors, and to do so using a an assembly programmer productivity, and is the
relatively small amount of assembly code, com- most similar in spirit to PeachPy. The primary use-
bined with easy-to-use metaprogramming facilities case for their code-generator is cross-compilation
and nominal automation for routine tasks provided to ARM, but they also supported a debug mode,
they do not hurt performance. Toward this end, where the code-generator outputs equivalent C
we are developing PeachPy, a new Python-based code.
framework that aids the development of high- The audience for PeachPy is an optimization
performance assembly kernels. expert writing multiple but similar kernels in as-
B. Constant allocation
Listing 1 Minimal PeachPy example
Code and data are physically stored in different
from peachpy.x64 import * sections of the executable, and in assembly pro-
gramming they are defined and implemented in
abi = peachpy.c.ABI(’x64-sysv’) different parts of the source file. The code where
assembler = Assembler(abi) an in-memory constant is used might be far from
x_argument = peachpy.c.Parameter("x", the line where the constant is defined. Thus, the
peachpy.c.Type("uint32_t")) programmer has to always keep in mind the names
arguments = (x_argument) of all constants used in a function. PeachPy solves
function_name = "f" this problem by allowing constants to be defined
microarchitecture = "SandyBridge" exactly where they are used. When PeachPy final-
izes an assembly function, it scans all instructions
with Function(assembler, function_name, for such constants, coalesces equal constants (it can
arguments, microarchitecture): coalesce integer and floating-point constants with
MOV( eax, 0 ) the same binary representation), and generates a
RETURN() constants section.
print assembler C. Adaptation to calling conventions
Both x86-64 and ARM architectures can be used
with several variants of function calling conven-
II. A UTOMATION OF ROUTINE TASKS tions. They might differ in the how the param-
eters are passed to a function or which registers
By design, PeachPy only includes automation must be saved in the function’s prolog and re-
of assembly programming tasks that will not ad- stored in the epilog. For assembly programmers,
versely affect the efficiency of the generated code. supporting multiple calling conventions requires
Any automating choices which might affect per- having several versions of assembly code. While
formance are left to the programmer. This design these versions are mostly similar, maintaining them
differs from the philosophy of higher-level pro- separately can quickly become a significant burden.
gramming models, where the compiler must cor- To assist adaptation of function to different
rectly handle all situations, even if doing so might calling conventions, PeachPy provides a pseudo-
result in suboptimal performance. PeachPy opts instruction, LOAD.PARAMETER which loads a func-
for relatively less automation, in part because we tion parameter into a register. If the parameter is
expect it will be used for codes where the high- passed in a register and the destination operand
automation approach of high-level compilers does of the LOAD.PARAMETER pseudo-instruction is a
not deliver good performance. virtual register, PeachPy will bind this virtual reg-
ister to the physical register where the parameter is
A. Register allocation passed, and the LOAD.PARAMETER instruction will
While PeachPy allows referencing of registers become a no-op.
using their standard names (e.g., rcx or xmm0 on
x86), it also provides a virtual register abstraction. III. M ETAPROGRAMMING
The programmer creates a virtual register by calling One of the goals of PeachPy project was to sim-
constructors without parameters for the register plify writing multiple similar compute kernels in
classes. Virtual registers can be used as instruction assembly. To achieve this goal, PeachPy leverages
operands in the same way as physical registers. Af- flexibility of Python to replace macro preprocessors
ter PeachPy receives all instructions that constitute in traditional assemblers.
A. Custom Instructions C. Generalized Kernels
PeachPy users can define Python functions that Since instructions in PeachPy are first-class
will be used interchangeably with real assembly Python objects, it is easy to parametrize the oper-
instructions, just like macros in traditional assembly ation or data type of a compute kernel to enable
can be used similar to instructions. Unlike macros, generating multiple similar kernels from a single
Python-based PeachPy functions that implement generalized kernel. Listing 4 gives an example of
an instruction can use virtual registers to hold one such generalized kernel.
temporaries without making them part of the in-
terface. Listing 2 shows how the SSE3 instruction Listing 4 Generalized kernel which generates ad-
HADDPD can be simulated using SSE2 instructions dition and subtraction for 32-bit and 64-bit integer
in PeachPy. The interface of the simulated HADDPD arrays
instruction exactly matches the interface of the real
def VECTOR_OP(operation, data_size):
HADDPD, and can be transparently used by compute
SIMD_OP = {(’Add’, 4): VPADDD,
kernels which need this instruction.
(’Add’, 8): VPADDQ,
Listing 2 Simulation of HADDPD with x86 SSE2 (’Sub’, 4): VPSUBD,
instructions (’Sub’, 8): VPSUBQ}
[(operation, data_size)]
def HADDPD(xmm_dst, xmm_src): LABEL( "next_batch" )
xmm_tmp = SSERegister() xmm_x = SSERegister()
MOVAPD( xmm_tmp, xmm_dst ) xmm_y = SSERegister()
UNPACKLPD( xmm_dst, xmm_src ) VMOVDQU( xmm_x, [xPtr] )
UNPACKHPD( xmm_tmp, xmm_src ) VMOVDQU( xmm_y, [yPtr] )
ADDPD( xmm_tmp, xmm_dst ) SIMD_OP( xmm_x, xmm_x, xmm_y )
VMOVDQU( [zPtr], xmm_x )
for ptr in [xPtr, yPtr, zPtr]:
B. Parameterized Code Generation ADD( ptr, 16 )
SUB( length, 16 / data_size )
With PeachPy, a programmer can use Python’s JNZ( "next_batch" )
control flow statements and expressions to
parametrize the generation of assembly code.
Then, an optimization expert might tune the
D. ISA-Specific Code Generation
parameters to maximize the performance. Common
examples of such parameters in HPC kernels are A PeachPy user must specify the target mi-
loop unrolling factors and prefetch distances. An croarchitecture when creating a Function object.
example of parameterized code generation for an This information is provided back to PeachPy
array summation kernel appears in Listing 3. kernels via static methods of the Target class.
Target.has_<isa-extension> methods indi-
Listing 3 Array summation kernel parametrized by cate if the target microarchitecture supports various
loop unroll factor and prefetching distance using ISA extensions, such as SSE4.2, AVX, FMA4, or
x86 AVX instructions ARM NEON. HPC kernels may use this infor-
mation to benefit from new instructions without
def VECTOR_SUM(unroll_regs, prefetch_distance):
ymm_acc = [AVXRegister() rewriting the whole kernel for each ISA level. List-
for _ in range(unroll_regs)] ing 5 shows how to make a dot product kernel use
for i in range(unroll_regs): AVX, FMA3, or FMA4 instructions, depending on
VXORPD( ymm_acc[i], ymm_acc[i] )
LABEL( "next_batch" ) the target microarchitecture.
PREFETCHNTA( [arrayPtr + prefetch_distance] )
for i in range(unroll_regs): E. Instruction Streams
VADDPD( ymm_acc[i], [arrayPtr + i * 32] )
ADD( arrayPtr, unroll_regs * 32 )
By default, PeachPy adds each generated in-
SUB( arrayLen, unroll_regs * 8 ) struction to the active assembly function. How-
JNZ( "next_batch" ) ever, it may also make sense to generate sim-
# Sums all ymm_acc to scalar in ymm_acc[0]
REDUCE.SUM( ymm_acc, peachpy.c.type("float") )
ilar instruction streams and then combine them
VMOVSS( [sumPtr], ymm_acc[0].get_oword() ) together. One example is optimizing the kernel
for the ARM Cortex-A9 microarchitecture. Cortex-
A9 processors can decode two instructions per
Listing 5 Dot product kernel which can use AVX, Listing 6 Use of instruction streams to interleave
FMA3 or FMA4 instructions scalar and vector instructions
def DOT_PRODUCT(): scalar_stream = InstructionStream()
if Target.has_fma4() or Target.has_fma3():
MULADD = VFMADDPS if Target.has_fma4()
with scalar_stream:
else VFMADD231PS # Scalar kernel
else: ...
def MULADD(ymm_x, ymm_a, ymm_b, ymm_c):
ymm_t = AVXRegister()
VMULPS( ymm_t, ymm_a, ymm_b ) vector_stream = InstructionStream()
VADDPS( ymm_x, ymm_t, ymm_c ) with vector_stream:
ymm_acc = AVXRegister()
# SIMD kernel
VXORPS( ymm_acc, ymm_acc ) ...
LABEL( "next_batch" )
ymm_tmp = AVXRegister()
VMOVAPS( ymm_tmp, [xPtr] )
while scalar_stream or vector_stream:
MULADD( ymm_acc, ymm_tmp, [yPtr], ymm_acc ) # Mix scalar and vector instructions
ADD( xPtr, 32 ) scalar_stream.issue()
ADD( yPtr, 32 )
SUB( length, 8 )
vector_stream.issue()
JNZ( "processBatch" )
REDUCE.SUM( [ymm_acc],
peachpy.c.type("float") ) Listing 7 Use of instruction streams to separate
VMOVSS( [resultPtr], ymm_acc.get_oword() )
similar instructions for software pipelining
instruction_columns = [InstructionStream()
for _ in range(3)]
cycle, but only one of them can be a SIMD in- ymm_x = [AVXRegister for _ in range(unroll_regs)]
for i in range(unroll_regs):
struction. Thus, performance may be improved by with instruction_columns[0]:
mixing scalar and vector operations. Bernstein and VMOVDQU( ymm_x[i], [xPtr + i * 32] )
Schwabe [12] found that such instruction blending with instruction_columns[1]:
VPADDD( ymm_x[i], ymm_y )
improves performance of Salsa20 stream cipher with instruction_columns[2]:
on the ARM Cortex-A8 microarchitecture, which VMOVDQU( [zPtr + i * 32], ymm_x[i] )
has a similar limitation on instruction decoding. with instruction_columns[0]:
ADD( xPtr, unroll_regs * 32 )
PeachPy instruction streams provide a mechanism with instruction_columns[2]:
to redirect generated instructions to a different ADD( zPtr, unroll_regs * 32 )
sequence, and later merge sequences of instruc-
tions. When the InstructionStream object is
used with the Python with statement all instruc-
tions generated in the with scope are added to any instruction is decoded its inputs are already
the instruction stream instead of active function. computed, so the instruction can issue immedi-
Instructions stored in InstructionStream then ately. Figure 1 illustrates this principle. Making a
can be added one-by-one to the active function software pipelined version of a kernel typically
(or active InstructionStream object) by calling the involves duplicating the kernel code twice, skew-
issue method. Listing 6 outlines the use of instruc- ing the instruction columns relative to each other,
tion stream object to mix scalar and vector variants and looping on the middle part of the resulting
of the kernel. sequence.