Ipdps2018 DTP
Ipdps2018 DTP
Ipdps2018 DTP
Using the general property that multidimensional DFTs are The ⊗Iµ produces memory access in cacheline size µ pack-
separable operations, the 2D DFT can be further decomposed ets [20]. This modification offers benefits when implementing
data movement with SIMD instructions. Moreover, it reduces
DF Tn×m = DF Tn ⊗ Im In ⊗ DF Tm . the false sharing when threads are applied for parallelism.
| {z }| {z }
Stage 2 Stage 1 The data reshape technique can be extended to the 3D FFT
implementation. However, the transposition in three dimen-
The overall operation described mathematically above is
sions becomes a rotation along the data cube’s main diagonal.
depicted in Figure 3. The data is viewed as a n × m matrix
Similar to the 2D FFT, the rotation is meant to assist memory
stored in row-major order. The first stage of the 2D FFT
accesses in subsequent stages. The definition of the rotation
applies 1D FFTs of size m in the rows, whereas the second k,n
matrix Km is
stage applies 1D FFTs of size n in the columns direction. The
k,n
= Lmk mn
above decomposition of the 2D FFT is called the pencil-pencil Km m ⊗ In Ik ⊗ Lm .
decomposition, where each pencil refers to a 1D FFT applied
in each dimension. Recall that the 1D FFTs require strided It can be seen that the K matrix has two parts. The first
access, therefore for large 2D FFTs each of the pencils will part transposes the front xy-plane. The second part transposes
require data to be accessed strided from memory. the side xz-plane. The rotation is depicted in Figure 5. This
The same approach can be applied to decompose the 3D implementation is an element-wise rotation. Therefore, for the
FFT, where DF Tk×n×m represents the dense matrix. same reasons discussed for the 2D case, we block the rotation
to move entire cachelines. The adopted 3D FFT decomposition
DF Tn×m = DF Tk ⊗ Imn Ik ⊗ DF Tn ⊗ Im is represented as
| {z }| {z }
Stage 3 Stage 2 n,m/µ
DF Tk×n×m = Kkµ ⊗ Iµ Inm/µ ⊗ DF Tk ⊗ Iµ Stage 3
Ikn ⊗ DF Tm . m/µ,k
| {z } Knµ ⊗ Iµ Imk/µ ⊗ DF Tn ⊗ Iµ Stage 2
Stage 1 k,n
Km/µ ⊗ Iµ Ikn ⊗ DF Tm Stage 1.
Data is viewed as a 3D cube of size k × n × m, stored in
row major order with the x-dimension corresponding to the For the remainder of this section, we discuss overlapping
size m laid out in the fastest memory dimension. The 3D FFT computation and communication for the first stage. Same steps
applies 1D FFTs in each of the three dimensions. The problem are applied to the other stages.
of accessing data at strides remains.
B. Data Movement From Memory vs. Cache
III. OVERLAPPING DATA M OVEMENT WITH Recall that we want to overlap data movement from memory
C OMPUTATION with computation on cached data. This means that it is
The goal of our approach is to overlap data movement necessary to separate main memory data movement from cache
with computation so that data can be streamed from memory data movement. Since all stages of a multidimensional FFT are
cally/horizontally stacking all-zero matrices Ou×v of various
sizes and the identity matrix I of size b.
Oib×b
Sn,b,i = Ib where Sn,b,i ∈ Rn×b
O(n/b−i−1)b×b
The Gn,b,i matrix is the transposed version of Sn,b,i . To better
understand the meaning behind the two constructs, they can
Fig. 5: 3D rotation applied on the data cube after each compute be viewed as sliding windows that read/write blocks of size
stage. The original data of size k × n × m is rotated to a cube b elements from the input/output. If we consider the In as a
of size m × k × n. The data in the xy-plane is moved to the copy operation of n elements, then the two matrices are slices
data points in the yx-plane in the rotated data cube. through the columns and rows of the identity matrix,
Gn,b,0
similar, we will illustrate the separation of the two types of
Gn,b,1
data movements using the first stage of the 3D FFT.
In = Sn,b,0 Sn,b,1 . . . Sn,b,n−1 = .. .
Working on cached data. The first step towards applying .
our approach is tiling the computation such that the data Gn,b,n−1
required for computation fits in the cache. This means that
Combining the two constructs with the K matrix we can define
instead of applying all 1D FFTs, we apply a smaller batch k,n
the data movement matrices Wb,i = Km/µ ⊗Iµ Sknm,b,i and
k,n Rb,i = Gknm,b,i Iknm .
Km/µ ⊗ Iµ Ikn ⊗ DF Tm =
| {z } We separate computation from data movement to/from off-
block by b chip memory. This creates three dependent tasks as seen in
k,n
Figure 6. The Load task moves b contiguous elements from the
Km/µ ⊗ Iµ Iknm/b ⊗ Ib/m ⊗ DF Tm .
| {z } input to the cached buffer. The Compute task applies the b/m
Compute
1D FFTs of size m inplace. The Store task copies data back
The batch size is determined by the size of the shared buffer. If to main memory from the cached buffer once computation
b represents the size of the buffer, then b/m represents the total finished. The tasks are parallelized across the available threads.
number of 1D pencils that can be applied, where m represents
C. Task Parallelization and Scheduling
the size of the 1D FFT.
Each knm/b iteration applies a batch of 1D FFTs on the The SPL notation also allows us to identify which compo-
buffer of size b. We denote the Ib/m ⊗ DF Tm construct as nents need to be parallelized. Recall that computation must be
the compute kernel. We further decorate the compute kernel performed on data stored in the cached buffer. This implies
with (·) to specify that the computation must be done inplace, that the parallelism must be applied on the Ib/m ⊗ DF Tm
i.e. the input is overwritten with the computed result. The construct which will modify the first child as follows
computation kernel is parallelized across the threads assigned
for computation or compute-threads. Iknm/b ⊗ Wb,i Ib/m ⊗ DF Tm Rb,i =
| {z }
Identifying main memory access. Data needs to be read parallel on pd , pc
from memory and stored into the local buffer before com-
Iknm/b ⊗ Wb,i Ipc ⊗ Ib/(mpc ) ⊗ DF Tm Rb,i .
putation can start. Once computation has completed data |{z} | {z } |{z}
pd pc pd
needs to be stored to main memory from the local buffer.
The computation kernel determines the data movement. The Given pc threads for computation, each thread will apply its
SPL notation helps us capture the read/write operations. By Ib/pc m ⊗ DF Tm on its own disjoint data points.
reformulating the SPL formula, we obtain the following: Data movement is parallelized across pd threads. The matrix
Rb,i copies contiguous blocks of size b from main memory to
k,n
Km/µ
⊗ Iµ Iknm/b ⊗ Ib/m ⊗ DF Tm = the cached buffer. Since data is contiguous, data is streamed in.
The matrix Wb,i writes blocks equal to the size of the cacheline
Iknm/b ⊗ Wb,i Ib/m ⊗ DF Tm Rb,i . from the cached buffer back to main memory. Since data is
|{z} | } |{z}
rotated and thus written at strides, bandwidth utilization may
{z
Store Compute Load
drop. The write matrix is also parallelized across pd threads.
Let Wb,i and Rb,i denote the data movement matrices that read We finally apply software pipelining [21] to the outermost
and write blocks of size b from and to main memory every i construct represented by Iknm/b . Software pipelining allows
iteration. The variable i takes values in 0, knm/b − 1. the skewing of the Load, Compute and Store tasks and permits
We define two additional constructs Sn,b,i and Gn,b,i to the tasks to be executed in parallel. We group the Load and
help with the construction of the read and write matrices. Store tasks into one task, with the observation that the store
The two constructs are rectangular matrices obtained by verti- operation must precede the load operation. Table II shows
Iteration Store and Load with pd threads Compute with pc threads
i=0 t[i mod 2] = Rb,i x Prologue
i=1 t[i mod 2] = Rb,i x t[(i + 1) mod 2] = Ib/m ⊗ DFTm t[(i + 1) mod 2]
i=2 y = Wb,i−2 t[i mod 2] t[i mod 2] = Rb,i x t[(i + 1) mod 2] = Ib/m ⊗ DFTm t[(i + 1) mod 2]
... ... ... Steady
i = knm/b−1 y = Wb,i−2 t[i mod 2] t[i mod 2] = Rb,i x t[(i + 1) mod 2] = Ib/m ⊗ DFTm t[(i + 1) mod 2] State
i = knm/b y = Wb,i−2 t[i mod 2] t[(i + 1) mod 2] = Ib/m ⊗ DFTm t[(i + 1) mod 2] Epilogue
i = knm/b+1 y = Wb,i−2 t[i mod 2]
TABLE II: Applying software pipelining to the outer loop of the construct Iknm/b ⊗ Wb,i Ib/m ⊗ DF Tm Rb,i .
ID and the socket ID, since each thread must do its own task.
We copy the generated code within the template framework in
order to get the full 2D and 3D FFT implementation.
IV. M ITIGATING I NTERFERENCE
Threads/cores share resources such as execution pipeline,
cache hierarchy, main memory and links between multiple
sockets. Threads/cores contend for these resources. Usually,
interference on the shared resources causes slower execution.
Fig. 6: Separating data movement from computation for the In this section, we focus on the main causes of interference
first stage of the 3D FFT. Data is streamed into the local and provide solutions to reduce the effects of possible conflicts
buffer. Computation applies batches of 1D FFTs inplace. Data and thus increase overall performance.
is rotated/transposed back to main memory.
A. Single Socket Execution
the prologue, steady state and epilogue of the SPL construct. Both Intel and AMD offer multiple threads that share the
The prologue loads data into the shared buffer and signals the same floating point functional units and have private and/or
threads to start computation. The epilogue stores data back to shared caches. Pinning the threads to specific cores influences
main memory once computation has been completed. In steady the overall performance of the application. We discuss the
state data movement and computation is executed in parallel. interference at the functional units and cache level and show
While the compute threads apply computation on one half of solutions to reduce contention.
the shared buffer, the data threads store and load data to and Interference in the execution pipeline. Irrespective of
from main memory into the other half of the buffer. Given p vendor, threads share the floating point functional units.
the total number of threads, this implies that p = pc + pd , Therefore, in our approach we group one data-thread and
where pc and pd represent the compute and data threads. one compute-thread. The threads are pinned together to the
same core to share the functional units. Data-threads only
D. Parallel Framework and Code Generation
load and store data. Compute-threads execute some load and
The above mathematical descriptions are translated into store operations, however they predominantly execute com-
code. We first construct a general C code template for putation such additions/multiplications. On most architectures
doing the double-buffering mechanism. We parallelize the load/store instructions and arithmetic instructions are issued to
framework using OpenMP [22], more precisely we use different pipelines. Choosing threads with the same instruction
#pragma omp parallel region, since it gives better mix is not recommended since the threads will conflict for the
control over the threads. Based on the thread ID returned same execution pipeline. We use NOP instructions interleaved
by omp_get_thread_num, we determine which threads do within the data-threads task to allow the compute-threads to
data movement and which do computation. Half the threads issue their loads and make progress. Data-threads issue only
are used for data movement and half the threads are used for load/store operations, thus the threads may fully occupy the
computation. We use kmp_affinity on the Intel architec- load/store pipeline. Even though compute-threads have fewer
tures and sched_setaffinity on the AMD architectures load/store operations, they still require some for computation.
to pin the threads to the specific cores and we explicitly use Interference at the cache hierarchy. Pinning a data-thread
them within the C code. In addition, we use #pragma omp and a compute-thread so that they share the functional units,
barrier to synchronize the threads. makes the threads share some of the cache levels. For example,
The SPL notation described in this work used to capture the on the Intel architecture, the two threads/hyperthreads on the
computation and data movement is implemented within the cores share both the L1 and L2 cache. In addition to the
SPIRAL system [15], which is a framework for automatically different mixes of instructions, the two type of threads also
generating code for linear transforms. We use SPIRAL to auto- have different memory access patterns. Data-threads stream
matically generate the computation and data movement using through while reading and rotate the data on the write back.
either AVX or SSE, depending on the underlying architecture However, the FFT accesses data at strides. The different access
we target. We parameterize the generated code by the thread patterns causes cache evictions.
B. Dual Socket Execution
We extend our approach to two socket systems, where each
socket belongs to a Non-Uniform Memory Access (NUMA)
domain. Each NUMA domain has private main memory. The
sockets can access their memory through fast bandwidth buses
and they can access neighbor’s main memory through data-
links, such as Intel’s QuickPath Interconnect (QPI) or AMD’s
HyperTransport (HT). Figure 7 shows the topology of a typical
Fig. 7: Two socket system with two NUMA domains. Each
two socket system. It is worthwhile noticing that bandwidth to
NUMA domain consists of one CPU and local main memory.
main memory within a NUMA domain is higher compared to
The two processors are connected via QPI(Intel) or HT(AMD).
the bandwidth over the data links. The difference in bandwidth
suggests computation to be done on data stored locally and
Non-temporal loads and stores. Recall that non-temporal transfers over the interconnects to be kept to a minimum.
loads and stores bypass the cache hierarchy to reduce cache We only extend the 3D FFT to two sockets. The expression
pollution. However, not all loads and stores within the code
must be non-temporal. In our approach, only the Wb,i and Iknm/b ⊗ Wb,i Ib/m ⊗ DF Tm Rb,i =
Rb,i matrices must utilize the non-temporal operations, since
| {z }
parallel on sk, pc , pd
those are the only operations that move data to and from
Isk ⊗ Iknm/bsk ⊗ Wb,i,sk Ib/m ⊗ DF Tm Rb,i,sk
memory. The matrix Rb,i must read data non-temporally, | {z }
however it must store the data temporally in the shared buffer. parallel on pc , pd
The compute-threads must apply the FFT on the data in the specifies the order in which we parallelize the first stage of
subsequent iteration. However, the matrix Wb,i can read and the 3D FFT, the same steps are taken for the other two stages.
write non-temporally since the computed data is not required We parallelize the construct first over the number of sockets
until the next FFT stage. The write matrix non-temporally sk and then over the number of data-threads pd and compute-
stores cacheline blocks at large strides in main memory. threads pc . The parallelism over the sockets modifies the read
Cache aware FFT. Computation is cacheline aware, since and write matrices, since all data points are required for the
the FFT is computed at cacheline granularity. Recall that a 1D computation of the 3D FFT. Thus, data needs to be exchanged
FFT accesses data at non-unit strides. Accessing data at large across the data links.
strides may cause data to be placed in the same set and thus Since communication over the QPI/HT links is costly, we
evict other cachelines before the remaining data within the apply a slab-pencil decomposition to the 3D FFT, where the
cacheline is fully consumed. We use SIMD instruction such first two stages communicate only within the NUMA domain.
as SSE and AVX to implement the computation. We follow We split the data-set in the z-dimension, where each socket
the details from [18], where the FFT performs a data-format receives a contiguous block of size k/sockets ×n×m. Each
change of the complex data storage between compute stages. node can compute a 2D FFT locally and transpose locally,
The format change swaps from complex interleaved, where the without crossing the interconnect. In order to compute the 1D
real and imaginary components are interleaved in memory, to a FFT in the z-dimension data needs to be exchanged across
block interleaved format, where blocks of the real components the data-links. Another communication over the interconnect
and blocks of the imaginary components are consecutive in is needed once the 1D pencil is fully computed so that data is
memory. This format change is meant to make computation put in the correct order. Figure 8 depicts the data movement in
more efficient. Separating the real and imaginary components the three compute stages of the 3D FFT. Table III presents the
and using AVX instructions allows computation to be done at generalized versions of write Wb,i matrices. All three matrices
cacheline granularity. Since the 2D and 3D FFTs have multiple are parameterized by the number of sockets k. By setting
stages, the format change is applied once in the first stage, the the number of sockets equal to sk = 1, the implementation
rest of the computation is done in block interleaved and in the defaults to the single-socket implementation. Reading data is
last stage data is changed to complex interleaved. This change done by each socket from the local memory and has the same
of format is different from the one presented in [9] which is representation as the single-socket implementation.
meant to improve communication between threads.
V. E XPERIMENTAL R ESULTS
Cache aware buffer allocation. The cached buffer t shared
by the data-threads and the compute-threads must reside within Experimental setup. We now evaluate the performance
the cache hierarchy. More precisely, the buffer must be located of our approach. We run experiments on Intel and AMD
in the last level cache (LLC) since it is shared between all the systems with one or two sockets. All systems provide multiple
threads. We set the size of the buffer to be equal with half cores/threads that share a large last level cache. All Intel
of the LLC, b = sizeLLC /2. The buffer cannot fully occupy architectures have hyperthreads enabled. AMD does not offer
the LLC since computation also requires extra temporaries for hyperthread support. For the dual-socket systems we configure
storing partial results and constants such as the twiddle factors. the QPI/HT protocol to use Home Snoop. The Home Snoop
Fig. 8: The three different data cube shapes after each 3D FFT stage. Data is distributed across the k-dimension to each socket.
The first stage reads and writes the data locally, while the other two stages read data locally but write data across the sockets.
Matrix SPL Representation bound we consider infinite compute resources and do not
1
Wi,b,sk =
n,k/sk
Isk ⊗ Km/µ ⊗ Iµ Sknm,b,i take computation time into account. We use the STREAM
2
Wi,b,sk =
sknm/µ
Lnm/µ ⊗ Ikµ/sk Isk ⊗ Kn
k/sk,m/µ
⊗ Iµ Sknm,b,i
benchmark [1] to determine the achievable bandwidth (GB/s)
3 m/µ,n/sk for each targeted architecture. We determine the performance
Lskk
Wi,b,sk = k ⊗ Imn/sk Isk ⊗ Kk ⊗ Iµ Sknm,b,i
when streaming the total amount of data as
TABLE III: The SPL representations for the three write 5 · N · log(N ) · bandwidthSTREAM
matrices (rotation matrices) applied after each compute stage. Pio = ,
2 · N · nrstages · sizeof(double)
protocol improves bandwidth over the interconnect since it where nrstages represents the number of compute stages in the
reduces cache traffic used for cache coherence. FFT, N represents the size of the FFT and sizeof(double)
1) One socket systems: Intel Haswell 4770K, Intel Kaby represents the size of a double precision floating point in
Lake 7700K and AMD FX-8350 - 8 threads, 8 MB L3 GB. The current implementation offers support for complex
cache, 32/64/64 GB DRAM, bandwidth 20/40/12 GB/s numbers therefore the total size is multiplied by two.
2) Two socket systems: Intel Haswell 2667v3, AMD 6276 2D FFT. We first present results for the 2D FFT implemen-
Interlagos (Bluewaters): 16 threads, 20/16 MB L3 cache, tation. Figure 9 shows the results of the three implementations
256/64 GB DRAM, bandwidth 85/20 GB/s compared to the achievable peak when streaming data at 40
We compare our implementation to MKL 2017.0 and GB/s. Our double-buffering approach achieves on average
FFTW 3.3.6 on the Intel architectures. On the AMD systems 74% of the achievable peak. However, MKL and FFTW
we compare our implementation only against FFTW 3.3.6. implementation achieve on average 50% of peak. Two aspects
All libraries are compiled with OpenMP, AVX and SSE are worth noticing. First, for small sizes bandwidth utilization
enabled. On the Intel architectures, we use MKL_DYNAMIC, is less than 80% since the number of iterations iter = mn/b
MKL_NUM_THREADS and KMP_AFFINITY to control the in each compute stage is small. For example, iter = 4 when
number of threads and the placement of the threads within b = 131, 072, m = 512 and n = 1, 024. Second, as the size of
the MKL library. On the AMD architectures, we use the 2D FFT increases bandwidth utilization drops. Recall that
OMP_NUM_THREADS and GOMP_AFFINITY to control the after each compute stage data needs to be transposed similarly
threads within the FFTW library. We compile all code with to the rotation described in Figure 6. The transposition is
the ICC compiler version 2017.0 and the GCC compiler 4.8.5- applied on a panel of size b/m×m, where m is the size of the
20150623. All code is compiled with the -O3 flag. We do 1D FFT and b is the size of the shared buffer. As m increases,
not compare the current implementation against the SPIRAL b/m decreases, therefore TLB misses cannot be amortized.
generated parallel code presented in [20]. The previous parallel We leave as future work other methods of separating data
implementation targeted medium size 1D FFTs and did not movement from computation for cases where the size of the
offer support for compute/communication overlap. 1D FFT is equal or greater than the size of the shared buffer.
Performance metric. We report performance for our ap- 3D FFT. We show results for the 3D FFT on multiple
proach, MKL and FFTW as billion floating point operations architectures and compare them against the achievable peak.
divided by runtime in seconds. We use 5N log(N ) as estimate We evaluate large 3D FFTs that do not fit on the cache.
for the flop count which over-estimates the number of opera- Figure 1 shows the results on the Intel Kaby Lake 7700K.
tions. The resulting Pseudo-Gflop/s is proportional to inverse Our implementation achieves 80% to 90% of practical peak,
runtime and is an accepted performance metric for the FFT. whereas MKL and FFTW achieve at most 47%. Our approach
We use the rdtsc time-stamp counter and the CPU frequency uses the bandwidth and the cache hierarchy more efficiently
to compute the overall execution time. and therefore outperforms MKL and FFTW by almost 3x. It
We also compare performance of all three parallel im- is important to state that the 3D FFT does not experience
plementations against the performance when streaming data the problems displayed by the 2D FFT. First, the number
in and out of cache at bandwidth speed. For the upper of iterations in each compute stage is higher. For example,
We further present scaling results keeping the problem
size fixed and increasing the number of sockets from one
to two. The bottom two plots in Figure 11 show the results
on the two-socket Intel and AMD architecture. For the Intel
architecture, given the data movement in Figure 8 our approach
improves performance on average by 1.7x when increasing
the number of sockets. Communication over the QPI link and
conflicts between the two types of threads limit the overall
improvement. On the AMD system the HT link runs at a
similar bandwidth speed as the bus to main memory, therefore
the slowdown caused by the interconnect is smaller. We do
not report comparison results against FFTW for the AMD
Fig. 9: The plot shows performance of the 2D FFT on the two-socket system since the FFTW library misbehaves on the
Intel Kaby Lake 7700K. We compare against the achievable Bluewaters system and provides buggy results.
performance when data is streamed at bandwidth speed. Our
approach achieves on average 75% ofl peak. The labels on the
bars show unnormalized performance.