Debunking The 100X GPU vs. CPU Myth
Debunking The 100X GPU vs. CPU Myth
CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU
ABSTRACT 1. INTRODUCTION
Recent advances in computing have led to an explosion in the amount The past decade has seen a huge increase in digital content as
of data being generated. Processing the ever-growing data in a more documents are being created in digital form than ever be-
timely manner has made throughput computing an important as- fore. Moreover, the web has become the medium of choice for
pect for emerging applications. Our analysis of a set of important storing and delivering information such as stock market data, per-
throughput computing kernels shows that there is an ample amount sonal records, and news. Soon, the amount of digital data will ex-
of parallelism in these kernels which makes them suitable for to- ceed exabytes (1018 ) [31]. The massive amount of data makes stor-
day’s multi-core CPUs and GPUs. In the past few years there have ing, cataloging, processing, and retrieving information challenging.
been many studies claiming GPUs deliver substantial speedups (be- A new class of applications has emerged across different domains
tween 10X and 1000X) over multi-core CPUs on these kernels. To such as database, games, video, and finance that can process this
understand where such large performance difference comes from, huge amount of data to distill and deliver appropriate content to
we perform a rigorous performance analysis and find that after ap- users. A distinguishing feature of these applications is that they
plying optimizations appropriate for both CPUs and GPUs the per- have plenty of data level parallelism and the data can be processed
formance gap between an Nvidia GTX280 processor and the Intel independently and in any order on different processing elements
Core i7 960 processor narrows to only 2.5x on average. In this pa- for a similar set of operations such as filtering, aggregating, rank-
per, we discuss optimization techniques for both CPU and GPU, ing, etc. This feature together with a processing deadline defines
analyze what architecture features contributed to performance dif- throughput computing applications. Going forward, as digital data
ferences between the two architectures, and recommend a set of continues to grow rapidly, throughput computing applications are
architectural features which provide significant improvement in ar- essential in delivering appropriate content to users in a reasonable
chitectural efficiency for throughput kernels. duration of time.
Two major computing platforms are deemed suitable for this new
Categories and Subject Descriptors class of applications. The first one is the general-purpose CPU
C.1.4 [Processor Architecture]: Parallel architectures (central processing unit) that is capable of running many types of
; C.4 [Performance of Systems]: Design studies applications and has recently provided multiple cores to process
; D.3.4 [Software]: Processors—Optimization data in parallel. The second one is the GPU (graphics process-
ing unit) that is designed for graphics processing with many small
General Terms processing elements. The massive processing capability of GPU
Design, Measurement, Performance allures some programmers to start exploring general purpose com-
puting with GPU. This gives rise to the GPGPU field [3, 33].
Keywords Fundamentally, CPUs and GPUs are built based on very different
CPU architecture, GPU architecture, Performance analysis, Perfor- philosophies. CPUs are designed for a wide variety of applications
mance measurement, Software optimization, Throughput Comput- and to provide fast response times to a single task. Architectural
ing advances such as branch prediction, out-of-order execution, and
super-scalar (in addition to frequency scaling) have been responsi-
ble for performance improvement. However, these advances come
at the price of increasing complexity/area and power consumption.
As a result, main stream CPUs today can pack only a small number
of processing cores on the same die to stay within the power and
thermal envelopes. GPUs on the other hand are built specifically
Permission to make digital or hard copies of all or part of this work for for rendering and other graphics applications that have a large de-
personal or classroom use is granted without fee provided that copies are gree of data parallelism (each pixel on the screen can be processed
not made or distributed for profit or commercial advantage and that copies
independently). Graphics applications are also latency tolerant (the
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific processing of each pixel can be delayed as long as frames are pro-
permission and/or a fee. cessed at interactive rates). As a result, GPUs can trade off single-
ISCA’10, June 19–23, 2010, Saint-Malo, France. thread performance for increased parallel processing. For instance,
Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00.
451
GPUs can switch from processing one pixel to another when long characteristics. These kernels have a large amount of data-level
latency events such as memory accesses are encountered and can parallelism, which makes them a natural fit for modern multi-core
switch back to the former pixel at a later time. This approach works architectures. Table 1 summarizes the workload characterization.
well when there is ample data-level parallelism. The speedup of an We classify these kernels based on (1) their compute and memory
application on GPUs is ultimately limited by the percentage of the requirements, (2) regularity of memory accesses, which determines
scalar section (in accordance with Amdahl’s law). the ease of exploiting data-level parallelism (SIMD), and (3) the
One interesting question is the relative suitability of CPU or GPU granularity of tasks, which determines the impact of synchroniza-
for throughput computing workloads. CPUs have been the main tion. These characteristics provide insights into the architectural
workhorse for traditional workloads and would be expected to do features that are required to achieve good performance.
well for throughput computing workloads. There is little doubt that 1. SGEMM (both dense and sparse) is an important kernel that is
today’s CPUs would provide the best single thread performance for an integral part of many linear algebra numerical algorithms, such
throughput computing workloads. However, the limited number of as linear solvers. SGEMM is characterized by regular access pat-
cores in today’s CPUs limits how many pieces of data can be pro- terns and therefore maps to SIMD architecture in a straightforward
cessed simultaneously. On the other hand, GPUs provide many manner. Threading is also simple, as matrices can be broken into
parallel processing units which are ideal for throughput comput- sub-blocks of equal size which can be operated on independently
ing. However, the design for graphics pipeline lacks some criti- by multiple threads. SGEMM performs O(n3 ) compute, where
cal processing capabilities (e.g., large caches) for general purpose n is the matrix dimension and has O(n2 ) data accesses. The ra-
workloads, which may result in lower architecture efficiency on tio of compute to data accesses is O(n), which makes SGEMM a
throughput computing workloads. compute-bound application, when properly blocked.
This paper attempts to correlate throughput computing charac- 2. MC or Monte Carlo randomly samples a complex function,
teristics with architectural features on today’s CPUs and GPUs and with an unknown or highly complex analytical representation, and
provides insights into why certain throughput computing kernels averages the results. We use an example of Monte Carlo from com-
perform better on CPUs and others work better on GPUs. We use putational finance for pricing options [34]. It simulates a random
a set of kernels and applications that have been identified by previ- path of an underlying stock over time and calculates a payoff from
ous studies [6, 10, 13, 44] as important components of throughput the option at the end of the time step. It repeats this step many
computing workloads. We highlight the importance of platform- times to collect a large number of samples which are then averaged
specific software optimizations, and recommend an application- to obtain the option price. Monte Carlo algorithms are generally
driven design methodology that identifies essential hardware archi- compute-bound with regular access patterns, which makes it a very
tecture features based on application characteristics. good fit for SIMD architectures.
This paper makes the following contributions: 3. Conv or convolution is a common image filtering operation
used for effects such as blur, emboss and sharpen. Its arithmetic
• We reexamine a number of claims [9, 19, 21, 32, 42, 45, computations are simple multiply-add operations and its memory
47, 53] that GPUs perform 10X to 1000X better than CPUs accesses are regular in small neighborhood. Each pixel is cal-
on a number of throughput kernels/applications. After tun- culated independently, thus providing ample parallelism at both
ing the code for BOTH CPU and GPU, we find the GPU SIMD and thread level. Though its compute-to-memory charac-
only performs 2.5X better than CPU. This puts CPU and teristic varies depending on the filter size, in practice, it usually
GPU roughly in the same performance ballpark for through- exhibits high compute-to-memory ratio. Its sliding-window-style
put computing. access pattern gives rise to a memory alignment issue in SIMD
• We provide a systematic characterization of throughput com- computations. Also, multi-dimensional convolutions incur non-
puting kernels regarding the types of parallelism available, sequential data accesses, which require good cache blocking for
the compute and bandwidth requirements, the access pattern high performance.
and the synchronization needs. We identify the important 4. FFT or Fast Fourier Transform is one of the most impor-
software optimization techniques for efficient utilization of tant building blocks for signal processing applications. It converts
CPU and GPU platforms. signals from time domain to frequency domain, and vice versa.
FFT is an improved algorithm to implement Discrete Fourier Trans-
• We analyze the performance difference between CPU and form (DFT). DFT requires O(n2 ) operations and FFT improves it
GPU and identify the key architecture features that benefit to O(n log n). FFT algorithms have been studied exhaustively [26].
throughput computing workloads. Though various optimizations have been developed for each us-
age model/hardware platform, their basic behavior is similar. It
This paper is organized as follows: Section 2 discusses the through- is composed of log n stages of the butterfly computation followed
put computing workloads used for this study. Section 3 describes by a bit-reverse permutation. Arithmetic computations are simple
the two main compute platforms – CPUs and GPUs. Section 4 dis- floating-point multiply-adds, but data access patterns are non-trivial
cusses the performance of our throughput computing workloads on pseudo-all-to-all communication, which makes parallelization and
today’s compute platforms. Section 5 provides a platform-specific SIMDification difficult. Therefore, many studies [7] have focused
optimization guide and recommends a set of essential architecture on the challenges to implement FFT on multi-core wide-SIMD ar-
features. Section 6 discusses related work and Section 7 concludes chitectures well.
our findings. 5. SAXPY or Scalar Alpha X Plus Y is one of the functions
in the Basic Linear Algebra Subprograms (BLAS) package and is
2. THE WORKLOAD: THROUGHPUT COM- a combination of scalar multiplication and vector addition. It has
PUTING KERNELS a regular access pattern and maps well to SIMD. The use of TLP
requires only a simple partitioning of the vector. For long vectors
We analyzed the core computation and memory characteristics
that do not fit into the on-die storage, SAXPY is bandwidth bound.
of recently proposed benchmark suites [6, 10, 13, 44] and formu-
lated the set of throughput computing kernels that capture these
452
Kernel Application SIMD TLP Characteristics
SGEMM (SGEMM) [48] Linear algebra Regular Across 2D Tiles Compute bound after tiling
Monte Carlo (MC) [34, 9] Computational Finance Regular Across paths Compute bound
Convolution (Conv) [16, 19] Image Analysis Regular Across pixels Compute bound; BW bound for small filters
FFT (FFT) [17, 21] Signal Processing Regular Across smaller FFTs Compute/BW bound depending on size
SAXPY (SAXPY) [46] Dot Product Regular Across vector BW bound for large vectors
LBM (LBM) [32, 45] Time Migration Regular Across cells BW bound
Constraint Solver (Solv) [14] Rigid body physics Gather/Scatter Across constraints Synchronization bound
SpMV (SpMV) [50, 8, 47] Sparse Solver Gather Across non-zero BW bound for typical large matrices
GJK (GJK) [38] Collision Detection Gather/Scatter Across objects Compute Bound
Sort (Sort) [15, 39, 40] Database Gather/Scatter Across elements Compute bound
Ray Casting (RC) [43] Volume Rendering Gather Across rays 4-8MB first level working set,
over 500MB last level working set
Search (Search) [27] Database Gather/Scatter Across queries Compute bound for small tree, BW
bound at bottom of tree for large tree
Histogram (Hist) [53] Image Analysis Requires Across pixels Reduction/synchronization bound
conflict detection
Bilateral (Bilat) [52] Image Analysis Regular Across pixels Compute Bound
Table 1: Throughput computing kernels characteristics. The referred papers contains the best previous reported performance
numbers on CPU/GPU platforms. Our optimized performance numbers are at least on par or better than those numbers.
For very short vectors, SAXPY spends a large portion of time per- resolution of convex objects in physically-based animations/simu-
forming horizontal reduction operation. lations in virtual environments. A large fraction of the run-time
6. LBM or Lattice Boltzmann method, is a class of computa- is spent in computing support map, i.e., the furthest vertex of the
tional fluid dynamics. LBM uses the discrete Boltzmann equation object along a given direction. Scalar implementation of GJK is
to simulate the flow of a Newtonian fluid instead of solving the compute bound on current CPUs and GPUs and can exploit DLP
Navier Stokes equations. In each time step, for a D3Q19 lattice, to further speedup the run-time. The underlying SIMD is exploited
LBM traverses the entire 3D fluid lattice and for each cell com- by executing multiple instances of the kernel on different pairs of
putes new distribution function values from the cell’s 19 neighbors objects. This requires gathering the object data (vertices/edges) of
(including self). Within each time step, the lattice can be traversed multiple objects into a SIMD register to facilitate fast support map
in any order as values from the neighbors are computed from the execution. Hence the run-time is dependent on support for an ef-
previous time step. This aspect makes LBM suitable for both TLP ficient gather instruction by the underlying hardware. There also
and DLP. LBM has O(n) compute and requires O(n) data, where n exist techniques [38] that can compute support map by memoizing
is the number of cells. The working set consists of the data of the the object into a lookup table, and performing lookups into these
cell and its 19 neighbors. The reuse of these values is substantially tables at run-time. Although still requiring gathers, this lookup can
less than convolution. Large caches do not improve the perfor- be performed using texture mapping units available on GPUs to
mance significantly. The lack of reuse also means that the compute achieve further speedups.
to bandwidth ratio is low; LBM is usually bandwidth bound. 10. Sort or radix sort is a multi-pass sorting algorithm used in
7. Solv or constraint solver is a key part of game physics simu- many areas including databases. Each pass sorts one digit of the in-
lators. During the execution of the physical simulation pipeline, a put at a time, from least to most significant. Each pass involves data
collision detection phase computes pairs of colliding bodies, which rearrangement in the form of memory scatters. On CPUs, the best
are then used as inputs to a constraint solving phase. The con- implementation foregoes the use of SIMD and implements a scatter
straint solver operates on these pairs and computes the separating oriented rearrangement within cache. On GPUs, where SIMD use
contact forces, which keeps the bodies from inter-penetrating into is important, the algorithm is rewritten using a 1-bit sort primitive,
one another. The constraints are typically divided into batches of called split [39]. The split based code, however, has more scalar
independent constraints [14]. SIMD and TLP are both exploited operations than the buffer code (since it works on a single bit at
among independent constraints. Exploiting SIMD parallelism is a time). The overall efficiency of SIMD use relative to optimized
however challenging due to the presence of gather/scatter opera- scalar code is therefore not high even for split code. The number of
tions required to gather/scatter object data (position, velocity) for bits considered per pass of radix sort depends on the size of the lo-
different objects. Ideally, the constraint solver should be bandwidth cal storage. Increasing cache sizes will thus improve performance
bound, because it iterates over all constraints in a given iteration (each doubling of the cache size will increase the number of bits
and the number of constraints for realistic large scale destruction per pass by one). Overall, radix sort has O(n) bandwidth and com-
scenes exceeds the capacity of today’s caches. However, practical pute requirements (where n is the number of elements to be sorted),
implementations suffer from synchronization costs across sets of but is usually compute bound due to the inefficiency of SIMD use.
independent constraints, which limits performance on current ar- 11. RC or Ray Casting is an important visual application, used
chitectures. to visualize 3D datasets, such as CT data used in medical imaging.
8. SpMV or sparse matrix vector multiplication is at the heart of High quality algorithms, known as ray casting, cast rays through
many iterative solvers. There are several storage formats of sparse the volume, performing compositing of each voxel into a corre-
matrices, compressed row storage being the most common. Com- sponding pixel, based on voxel opacity and color. Tracing multi-
putation in this format is characterized by regular access patterns ple rays using SIMD is challenging, because rays can access non-
over non-zero elements and irregular access patterns over the vec- contiguous memory locations, resulting in incoherent and irregular
tor, based on column index. When the matrix is large and does not memory accesses. Some ray casting implementations perform a
fit into on-die storage, a well optimized kernel is usually bandwidth decent amount of computation, for example, gradient shading. The
bound. first level working set due to adjacent rays accessing the same vol-
9. GJK is a commonly used algorithm for collision detection and ume data is reasonably small. However, the last level working set
453
can be as large as the volume itself, which is several gigabytes of memory. Table 2 provides the peak single-precision and double-
data. precision FLOPS for both scalar and SSE units, and also the peak
12. Search or in-memory tree structured index search is a com- bandwidth available per die.
monly used operation in various fields of computer science, espe-
cially databases. For CPUs, the performance depends on whether 3.1.2 Nvidia GTX280 GPU
the trees can fit in cache or not. For small trees (tree sizes smaller The Nvidia GTX280 is composed of an array of multiproces-
than the last-level cache (LLC)), the search operation is compute sors (a.k.a. SM). Each SM has 8 scalar processing units running
bound, and can exploit the underlying SIMD to achieve speedups. in lockstep1 , each at 1.3 GHz. The hardware SIMD structure is
However, for large trees (tree sizes larger than the LLC), the last exposed to programmers through thread warps. To hide memory
few levels of the tree do not fit in the LLC, and hence the run-time latency, GTX280 provides hardware multi-threading support that
for search is bound by the available memory bandwidth. As far as allows hundreds of thread contexts to be active simultaneously.
the GPUs are concerned, the available high-bandwidth exceeds the To alleviate memory bandwidth, the card includes various on-chip
required bandwidth even for large trees, and the run-time is com- memories – such as multi-ported software-controlled 16KB mem-
pute bound. The run-time of search is proportional to the tree depth ory (referred to as local shared buffer) and small non-coherent read-
on the GTX280. only caches. The GTX280 also has special functional units like the
13. Hist or histogram computation is an important image pro- texture sampling unit, and math units for fast transcendental oper-
cessing algorithm which hashes and aggregates pixels from the ations. Table 2 depicts the peak FLOPS and the peak bandwidth of
continuous stream of data into a smaller number of bins. While the GTX2802 .
address computation is SIMD friendly, SIMDification of the aggre-
gation, however, requires hardware support for conflict detection, 3.2 Implications for Throughput Computing
currently not available on modern architectures. The access pattern Applications
is irregular and hence SIMD is hard to exploit. Generally, multi- We now describe how the salient hardware features on the two ar-
threading of histogram requires atomic operation support. How- chitectures differ from each other, and their implications for through-
ever, there are several parallel implementations of histogram which put computing applications.
use privatization. Typically, private histograms can be made to fit Processing Element Difference: The CPU core is designed to
into available on-die storage. However, the overhead of reducing work well for a wide range of applications, including single-threaded
the private histograms is high, which becomes a major bottleneck applications. To improve single-thread performance, the CPU core
for highly parallel architectures. employs out-of-order super-scalar architecture to exploit instruc-
14. Bilat or bilateral filter is a common non-linear filter used in tion level parallelism. Each CPU core supports scalar and SIMD
image processing for edge-preserving smoothing operations. The operations, with multiple issue ports allowing for more than one
core computation has a combination of a spatial and an intensity operation to be issued per cycle. It also has a sophisticated branch
filter. The neighboring pixel values and positions are used to com- predictor to reduce the impact of branch misprediction on perfor-
pute new pixel values. It has high computational requirements and mance. Therefore, the size and complexity of the CPU core limits
the performance should scale linearly with increased flops. Typical the number of cores that can be integrated on the same die.
image sizes are large; TLP/DLP can be exploited by dividing the In comparison, the processing element for GPU or SM trades off
pixels among threads and SIMD units. Furthermore, the bilateral fast single thread performance and clock speed for high through-
filter involves transcendental operations like computing exponents, put. Each SM is relatively simple. It consists of a single fetch unit
which can significantly benefit from fast math units. and eight scalar units. Each instruction is fetched and executed in
parallel on all eight scalar units over four cycles for 32 data ele-
3. TODAY’S HIGH PERFORMANCE COM- ments (a.k.a. a warp). This keeps the area of each SM relatively
small, and therefore more SMs can be packed per die, as compared
PUTE PLATFORMS to the number of CPU cores.
In this section, we describe two popular high-performance com- Cache size/Multi-threading: CPU provides caches and hardware
pute platforms of today: (1) A CPU based platform with an In- prefetchers to help programmers manage data implicitly. The caches
tel Core i7-960 processor; and (2) A GPU based platform with an are transparent to the programmer, and capture the most frequently
Nvidia GTX280 graphics processor. used data. If the working set of the application can fit into the on-
die caches, the compute units are used more effectively. As a re-
3.1 Architectural Details sult, there has been a trend of increasing cache sizes in recent years.
First, we discuss the architectural details of the two architectures, Hardware prefetchers provide additional help to reduce memory la-
and analyze the total compute, bandwidth and other architectural tency for streaming applications. Software prefetch instructions are
features available to facilitate throughput computing applications. also supported to potentially reduce the latency incurred with irreg-
ular memory accesses. In contrast, GPU provides for a large num-
3.1.1 Intel Core i7 CPU ber of light-weight threads to hide memory latency. Each SM can
The Intel Core i7-960 CPU is the latest multi-threaded multi- support up to 32 concurrent warps per multi-processor. Since all the
core Intel-Architecture processor. It offers four cores on the same threads within a warp execute the same instruction, the warps are
die running at a frequency of 3.2GHz. The Core i7 processor cores switched out upon issuing memory requests. To capture repeated
feature an out-of-order super-scalar microarchitecture, with newly access patterns to the same data, GTX280 provides for a few local
added 2-way hyper-threading. In addition to scalar units, it also storages (shared buffer, constant cache and texture cache). The size
has 4-wide SIMD units that support a wide range of SIMD instruc- 1 We view 8 scalar units as SIMD lanes, hence 8-element wide
tions [24]. Each core has a separate 32KB L1 for both instructions SIMD for GTX280.
and data, and a 256KB unified L2 data cache. All four cores share 2 The peak single-precision SIMD Flops for GTX280 is 311.1 and
an 8MB L3 data cache. The Core i7 processor also features an increases to 933.1 by including fused multiply-add and a multiply
on-die memory controller that connects to three channels of DDR operation which can be executed in SFU pipeline.
454
Num. Frequency Num. BW SP SIMD DP SIMD Peak SP Scalar Peak SP SIMD Peak DP SIMD
PE (GHz) Transistors (GB/sec) width width FLOPS (GFLOPS) Flops (GFLOPS) Flops (GFLOPS)
Core i7-960 4 3.2 0.7B 32 4 2 25.6 102.4 51.2
GTX280 30 1.3 1.4B 141 8 1 116.6 311.1/933.1 77.8
Table 2: Core i7 and GTX280 specifications. BW: local DRAM bandwidth, SP: Single-Precision Floating Point, DP: Double-Precision
Floating Point.
of the local shared buffer is just 16KB, and much smaller than the
cache sizes on CPUs.
Bandwidth Difference: Core i7 provides a peak external mem-
ory bandwidth of 32 GB/sec, while GTX280 provides a bandwidth
of around 141 GB/sec. Although the ratio of peak bandwidth is
pretty large (∼4.7X), the ratio of bytes per flop is comparatively
smaller (∼1.6X) for applications not utilizing fused multiply add
in the SFU.
Other Differences: CPUs provide for fast synchronization op-
erations, something that is not efficiently implemented on GPUs.
CPUs also provide for efficient in-register cross-lane SIMD oper-
ations, like general shuffle and swizzle instructions. On the other Figure 1: Comparison between Core i7 and GTX280 Perfor-
hand, such operations are emulated on GPUs by storing the data mance.
into the shared buffer, and loading it with the appropriate shuffle
pattern. This incurs large overheads for some throughput comput-
ing applications. In contrast, GPUs provide support for gather/s-
[1, 8, 2, 34], respectively. For the evaluations of SGEMM, SpMV
catter instructions from memory, something that is not efficiently
and FFT on Core i7, we used Intel MKL 10.0. Table 3 shows
implemented on CPUs. Gather/Scatter operations are important
the performance of throughput computing kernels on Core i7 and
to increase SIMD utilization for applications requiring access to
GTX280 processor with the appropriate performance metric shown
non-contiguous regions of memory to be operated upon in a SIMD
in the caption. To the best of our knowledge, our performance num-
fashion. Furthermore, the availability of special function units like
bers are at least on par and often better than the best published
texture sampling unit and math units for fast transcendental helps
data. We typically find that the highest performance is achieved
speedup throughput computing applications that spend a substan-
when multiple threads are used per core. For Core i7, the best per-
tial amount of time in these operations.
formance comes from running 8 threads on 4 cores. For GTX280,
while the maximum number of warps that can be executed on one
4. PERFORMANCE EVALUATIONS ON GPU SM is 32, a judicious choice is required to balance the ben-
CORE I7 AND GTX280 efit of multithreading with the increased pressure on registers and
on-chip memory resources. Kernels are often run with 4 to 8 warps
This section evaluates the performance of the throughput com-
per core for best GPU performance.
puting kernels on the Core i7-960 and GTX280 processors and an-
alyzes the measured results.
4.2 Performance Comparison
4.1 Methodology Figure 1 shows the relative performance between GTX280 and
We measured the performance of our kernels on (1) a 3.2GHz Core i7 processors when data transfer time for GTX280 is not con-
Core i7-960 processor running the SUSE Enterprise Server 11 op- sidered. Our data shows that GTX280 only has an average of 2.5X
erating system with 6GB of PC1333 DDR3 memory on an Intel performance advantage over Core i7 in the 14 kernels tested. Only
DX58SO motherboard, and (2) a 1.3GHz GTX280 processor (an GJK achieves a greater than 10X performance gap due to the use of
eVGA GeForce GTX280 card with 1GB GDDR3 memory) in the the texture sampler. Sort and Solv actually perform better on Core
same Core i7 system with Nvidia driver version 19.180 and the i7 . Our results are far less than previous claims like the 50X dif-
CUDA 2.3 toolkit. ference in pricing European options using Monte Carlo method [9],
Since we are interested in comparing the CPU and the GPU ar- the 114X difference in LBM [45], the 40X difference in FFT [21],
chitectures at the chip level to see if any specific architecture fea- the 50X difference in sparse matrix vector multiplication [47] and
tures are responsible for the performance difference, we did not in- the 40X difference in histogram computation [53], etc.
clude the data transfer time for GPU measurements. We assume the There are many factors that contributed to the big difference be-
throughput computing kernels are executed in the middle of other tween previous reported results and ours. One factor is what CPU
computations that create data in GPU memory before the kernel ex- and GPU are used in the comparison. Comparing a high perfor-
ecution and use data generated by the kernel in GPU memory. For mance GPU to a mobile CPU is not an optimal comparison as their
applications that do not meet our assumption, transfer time can sig- considerations for operating power, thermal envelop and reliability
nificantly degrade performance as reported by Datta in [16]. The are totally different. Another factor is how much optimization is
GPU results as presented here are an upper bound of what will be performed on the CPU and GPU. Many studies compare optimized
seen in actual applications for these algorithms. GPU code to unoptimized CPU code and resulted in large differ-
For both CPU and GPU performance measurements, we have ence. Other studies which perform careful optimizations to CPU
optimized most of the kernels individually for each platform. For and GPU such as [27, 39, 40, 43, 49] report much lower speedup
some of the kernels, we have used the best available implemen- similar to ours. Section 5.1 discusses the necessary software opti-
tation that already existed. Specifically, evaluations of SGEMM, mizations for improving performance for both CPU and GPU plat-
SpMV, FFT and MC on GTX280 have been done using code from forms.
455
Apps. SGEMM MC Conv FFT SAXPY LBM Solv SpMV GJK Sort RC Search Hist Bilat
Core i7-960 94 0.8 1250 71.4 16.8 85 103 4.9 67 250 5 50 1517 83
GTX280 364 1.4 3500 213 88.8 426 52 9.1 1020 198 8.1 90 2583 475
Table 3: Raw performance measured on the two platforms. From the left, metrics are Gflops/s, billion paths/s, million pixels/s,
Gflops/s, GB/s, million lookups/s, FPS, Gflops/s, FPS, million elements/s, FPS, million queries/s, million pixels/s, million pixels/s.
4.3 Performance Analysis The reason for not achieving the peak compute ratio is because
In this section, we analyze the performance results and identify GPUs do not achieve peak efficiency in the presence of shared
the architectural features that contribute to the performance of each buffer accesses. Volkov et al. [48] show that GPUs obtain only
of our kernels. We begin by first identifying kernels that are purely about 66% of the peak flops even for SGEMM (known to be com-
bounded by one of the two fundamental processor resources - band- pute bound). Our results match their achieved performance ratios.
width and compute. We then identify the role of other architectural MC uses double precision arithmetic, and hence has a performance
features such as hardware support for irregular memory accesses, ratio of 1.8X, close to the 1.5X double-precision (DP) flop ratio.
fast synchronization and hardware for performing fixed function Bilat utilizes fast transcendental operations on GPUs (described
computations (such as texture and transcendental math operations) later), and has a GTX280 to Core i7 performance ratio better than
in speeding up the remaining kernels. 5X. The algorithm used for Sort critically depends on the SIMD
width of the processor. A typical radix sort implementation in-
4.3.1 Bandwidth volves reordering data involving many scalar operations for buffer
management and data scatters. However, scalar code is inefficient
Every kernel requires some amount of external memory band- on GPUs, and hence the best GPU sort code uses a SIMD friendly
width to bring data into the processor. The impact of external split primitive. This has many more operations than the scalar code
memory bandwidth on kernel performance depends on two factors: - and is consequently 1.25X slower on the GTX280 than on Core
(1) whether the kernel has enough computation to fully utilize the i7.
memory accesses; and (2) whether the kernel has a working set Seven of our fourteen kernels have been identified as bounded
that fits in the on-die storages (either cache or buffers). Two of our by compute or bandwidth resources. We now describe the other
kernels, SAXPY and LBM have large working sets that require architectural features that have a performance impact on the other
global memory accesses without much compute on the loaded data seven kernels.
- they are purely bandwidth bound. These kernels will benefit from
increased bandwidth resources. The performance ratios for these 4.3.3 Cache
two kernels between GTX280 and Core i7 are 5.3X and 5X re- As mentioned in the section 4.3.1, on-die storage can alleviate
spectively. These results are inline with the ratio between the two external memory bandwidth pressure if all or part of the kernel’s
processors’ peak memory bandwidth (which is 4.7X). working set can fit in such storage. When the working set fits in
SpMV also has a large working set and very little compute. cache, most kernels are compute bound and the performance will
However, the performance ratio for this kernel between GTX280 scale with increasing compute. The five kernels that we identify
and Core i7 is 1.9X, which is about 2.3X lower than the ratio of as compute bound have working sets that can be tuned to fit in
peak bandwidth between the two processors. This is due to the fact any reasonably sized cache without significant loss of performance.
that the GTX280 implementation of SpMV keeps both vector and Consequently, they only rely on the presence on some kind of on-
column index data structures in GDDR since they do not fit in the chip storage and are compute bound on both CPUs and GPUs.
small on-chip shared buffer. However, in the Core i7 implemen- There are kernels whose working set cannot be easily tuned to
tation, the vectors always fit in cache and the column index fits in any given cache size without loss of performance. One example
cache for about half the matrixes. On average, the GPU bandwidth is radix sort, which requires a working set that increases with the
requirement for SpMV is about 2.5X the CPU requirement. As the number of bits considered per pass of the sort. The number of
result, although the GTX280 has 4.7X more bandwidth than Core passes over the data, and hence the overall runtime, decreases as
i7 , the performance ratio is only 1.9X. we increase cache size. On GPUs with a small local buffer of
Our other kernels either have a high compute-to-bandwidth ratio 16 KB shared among many threads, we can only sort 4 bits in
or working sets that fit completely or partially in the on-die storage, one pass - requiring 8 passes to sort 32-bit data. On Core i7, we
thereby reducing the impact of memory bandwidth on performance. can fit the working set of 8 bits in L2 cache; this only requires 4
These categories of kernels will be described in later sections. passes - a 2X speedup. This contributes to Sort on Core i7 being
1.25X faster than GTX280. Another interesting example is index
4.3.2 Compute Flops tree search (Search). Here, the size of the input search tree de-
The computational flops available on a processor depend on single- termines the working set. For small trees that fit in cache, search
thread performance, as well as TLP due to the presence of multiple on CPUs is compute bound, and in fact is 2X faster than GPU
cores, or DLP due to wide vector (SIMD) units. While most appli- search. For larger trees, search on CPUs is bandwidth bound, and
cations (except the bandwidth-bound kernels) can benefit from im- becomes 1.8X slower than GPUs. GPU search, in contrast, is al-
proved single-thread performance and thread-level parallelism by ways compute bound due to ISA inefficiencies (i.e., the unavail-
exploiting additional cores, not all of them can exploit SIMD well. ability of cross-lane SIMD operations).
We identify SGEMM, MC, Conv, FFT and Bilat as being able to Another important working set characteristic that determines ker-
exploit all available flops on both CPU and GPU architectures. Fig- nel performance is whether the working set scales with the number
ure 1 shows that SGEMM, Conv and FFT have GTX280-to-Core of threads or is shared by all threads. Kernels like SGEMM, MC,
i7 performance ratios in the 2.8-4X range. This is close to the 3-6X Conv, FFT, Sort, RC, and Hist have working sets that scale with
single-precision (SP) flop ratio of the GTX280 to Core i7 architec- the number of threads. These kernels require larger working sets
tures (see Table 2), depending on whether kernels can utilize fused for GPUs (with more threads) than CPUs. This may not have any
multiply-adds or not. performance impact if the kernel can be tiled to fit into a cache of an
456
arbitrary size (e.g. SGEMM and FFT). However, tiling can only bottleneck as the number of cores/threads and the SIMD width in-
be done to an extent for RC. Consequently, RC becomes band- crease.
width bound on GPUs, which have very small amount of on-die The performance of Hist is mainly limited by atomic updates.
storages, but is not bandwidth bound on CPUs (instead being af- Although Core i7 supports a hardware lock increment instruction,
fected by gathers/scatters, described later), and the performance ra- 28% of the total run-time is still spent on atomic updates. Atomic
tio on GTX280 to Core i7 is only 1.6X, far less than bandwidth and update support on the GTX280 is also very limited. Consequently,
compute ratios. a privatization approach where each thread generates a local his-
togram was implemented for both CPU and GPU. However, this
4.3.4 Gather/Scatter implement does not scale with increase core count because the re-
Kernels that are not bandwidth bound can benefit with increasing duction overhead increases with the number of cores. Also, the
DLP. However, the use of SIMD execution units places restrictions lack of cross-SIMD lane operations like reduction on GPUs leads
on kernel implementations, particularly in the layout of the data. to large instruction overhead on GTX280. Thus, Hist is only 1.8X
Operands and results of SIMD operations are typically required to faster on GTX280 than on Core i7, much lower than the compute
be grouped together sequentially in memory. To achieve the best and bandwidth ratios (∼5X). As was mentioned in Section 2, map-
performance, they should be placed into an address-aligned struc- ping Hist to SIMD requires support for conflict detection which
ture (for example for 4-wide single-precision SIMD, the best per- is not currently available on modern architectures. Our analysis
formance will be when the data is 16-byte aligned). If the data does of ideal conflict detection hardware, capable of detecting an arbi-
not meet these layout restrictions, programmers must convert the trary number of conflicting indices within the same SIMD vector,
data layout of kernels. This generally involves gather/scatter op- improves histogram computation by up to 3X [29].
erations, where operands are gathered from multiple locations and In Solv, a batch of independent constraints is executed simul-
packed together into a tight grouping, and results are scattered from taneously by multiple cores/threads, followed by a barrier before
a tight grouping to multiple locations. Performing gather/scatter in executing the next batch. Since resolving a constraint requires only
software can be expensive.3 Thus, efficient hardware support for small amount of computation (on the order of several hundred in-
gather/scatter operations is very important. structions), the task granularity is small. As a result, the execution
A number of our kernels rely on the availability of gather/scatter time is dominated by the barrier overhead. On Core i7, barriers
operations. For example, GJK spends a large fraction of its run- are implemented using atomic instructions. While it is possible
time in computing the support map. This requires gathering the to implement barrier operations entirely on GPU [48], this imple-
object data (vertices/edges) of multiple objects into a SIMD reg- mentation does not guarantee that previous accesses to all levels of
ister to facilitate fast support map execution. Another example is memory hierarchy have completed. CPUs provide a memory con-
RC, which requires gathering volume data across the rays. Fre- sistency model with the help of a cache coherence protocol. Due
quent irregular memory accesses result in large number of gather to the fact that cache coherence is not available on today’s GPUs,
operations. Up to 10% of the dynamic instructions are gather re- assuring memory consistency between two batches of constraints
quests. requires launching the second batch from the CPU host, which in-
On Core i7, there is no hardware gather/scatter support. Conse- curs additional overhead. As a result, the barrier execution time of
quently, GJK and RC do not utilize SIMD efficiently. For exam- GTX280 is order of magnitude slower than on Core i7, resulting
ple, RC sees very incremental benefit from SSE between 0.8X and in an overall 1.9X slow down in performance for GTX280 when
1.2X, due to large overhead of software gather. GJK also sees min- compared to Core i7 for the constraint solver.
imal benefits from SSE. On GTX280, support for gather/scatter is
offered for accesses to the local buffer and GDDR memory. Local 4.3.6 Fixed Function
shared buffer supports simultaneous gather/scatter accesses to mul- Bilat consists of transcendental operations like computing ex-
tiple banks. The GDDR memory controller coalesces requests to ponential and power functions. However, for image processing
the same line to reduce the number of gather/scatter accesses. This purpose, high accuracy version of these functions are not neces-
improved gather/scatter support leads to an improvement of GJK sary. Current CPUs use algebraic expressions to evaluate such ex-
performance on the GTX280 over the Core i7 . However, gath- pressions up to the required accuracy, while modern GPUs provide
er/scatter support only has a small impact (of 1.2X) on RC per- hardware to speedup the computation. On Core i7 , a large portion
formance because the accesses are widely spread out to memory, of run-time (around 66%) is spent in transcendental computation.
requiring multiple GDDR accesses even with coalescing support – On GTX280, due to the presence of fast transcendental hardware, it
it therefore becomes limited by GPU memory bandwidth. Conse- achieves a 5.7X performance ratio compare to Core i7 (much more
quently, the ratio of GTX280 to Core i7 performance for RC is only than the peak compute ratio of around 3X). Speeding up transcen-
1.6X, slightly better than the scalar flop ratio of 1.5X. dental on Core i7 (for example, as on GTX280) would improve
Bilat performance by around 2X, and the resultant GPU-to-CPU
4.3.5 Reduction and Synchronization performance ratio would be around 3X, which is closer to the peak
Throughput computing kernels achieve high performance through compute ratio. MC is another kernel that would benefit from fast
thread-level (multiple cores and threads) and/or data-level (wide transcendental on CPUs.
vector) parallelism. Reduction and synchronization are two opera- Modern GPUs also provide for other fixed function units like the
tions that do not scale with increasing thread count and data-level texture sampling unit, which is a major component of rendering al-
parallelism. Various optimization techniques have been proposed gorithms. However, by reducing the linear-time support-map com-
to reduce the need of reduction and to avoid synchronization. How- putation to constant-time texture lookups, GJK collision detection
ever, the synchronization overhead is still dominant in some kernels algorithm can exploit the fast texture lookup capability of GPUs,
such as Hist and Solv, and will become an even bigger performance resulting in an overall 14.9X speedup on GTX280 over Core i7.
3 For 4-wide SIMD on Core i7, a compiler generated gather se-
quence will take 20 instructions and even a hand optimized assem- 5. DISCUSSION
bly sequence will still take 13 instructions. The platform-specific software optimization is critical to fully
457
utilize compute/bandwidth resources for both CPUs and GPUs. We performance benefits, it also incurs high area overhead. Increasing
first discuss these software optimization techniques and derive a SIMD width also provides higher performance, and is more area-
number of key hardware architecture features which play a ma- efficient. Our observation is confirmed by the trend of increasing
jor role in improving performance of throughput computing work- SIMD width in computing platforms such as Intel architecture pro-
loads. cessor with AVX extension [23], Larrabee [41] and next generation
Nvidia GT GPUs [30]. Increasing SIMD width will reach a point of
5.1 Platform Optimization Guide diminishing return. As discussed in Section 4.3.4, irregular mem-
Traditionally, CPU programmers have heavily relied on increas- ory accesses can significantly decrease SIMD efficiency. The cost
ing clock frequencies to improve performance and have not opti- to fix this would offset any area benefit offered by increase SIMD
mized their applications to fully extract TLP and DLP. However, width. Consequently, the future throughput computing processor
CPUs are evolving to incorporate more cores with wider SIMD should strike the right balance between SIMD and MIMD execu-
units, and it is critical for applications to be parallelized to exploit tion.
TLP and DLP. In the absence of such optimizations, CPU imple- With the growth of compute flops, high memory bandwidth is
mentations are sub-optimal in performance and can be orders of critical to achieve scalable performance. The current GPUs lever-
magnitude off their attainable performance. For example, the pre- age high-end memory technology (e.g., graphics DDR or GDDR)
viously reported LBM number on GPUs claims 114X speedup over to support high compute throughput. This solution limits the mem-
CPUs [45]. However, we found that with careful multithreading, ory capacity available in a GPU platform to an amount much smaller
reorganization of memory access patterns, and SIMD optimiza- than the capacities deployed in CPU-based servers today. Further-
tions, the performance on both CPUs and GPUs is limited by mem- more, increasing memory bandwidth to match the compute has pin-
ory bandwidth and the gap is reduced to only 5X. Now we highlight count and power limitations. Instead, one should explore emerging
the key platform-specific optimization techniques we learned from memory technologies such as 3D-stacking [12] or cache compres-
optimizing the throughput computing kernels. sion [5].
CPU optimization: First, most of our kernels can linearly scale Large cache: As shown in Section 4.3.3, caches provide signif-
with the number of cores. Thus multithreading provides 3-4X per- icant benefit for throughput computing applications. An example
formance improvement on Core i7. Second, CPUs heavily rely on proof of our viewpoint is that GTX280 has limited on-die memo-
caches to hide memory latency. Moreover, memory bandwidth on ries and around 40% of our benchmarks will lose the opportunity to
CPUs is low as compared to GPUs. Blocking is one technique benefit from increasing compute flops. The size of the on-die stor-
which reduces LLC misses on CPUs. Programmers must be aware age should match the working set of target workloads for maximum
of the underlying cache hierarchy or use auto-tuning techniques to efficiency. Some workloads have a working set that only depends
obtain the best performing kernels [18, 35]. Many of our kernels, on the dataset and does not change with increasing core count or
SGEMM, FFT, SpMV, Sort, Search, and RC use cache blocking. thread count. For today’s datasets, 8MB on-die storage is suffi-
Sort, for best performance, requires the number of bits per pass to cient to eliminate 90% of all accesses to external memory. As the
be tuned so that its working set fits in cache. RC blocks the volume data footprint is likely to increase tomorrow, larger on-die storage
to increase 3D locality between rays in a bundle. We observe that is necessary for these workloads to work well. Other workloads
cache blocking improves the performance of Sort and Search by 3- have working set scales with the number of processing threads. For
5X. Third, we found that reordering data to prevent irregular mem- these workloads, one should consider their per thread on-die stor-
ory accesses is critical for SIMD utilization on CPUs. The main age size requirement. For today’s dataset, we found most per thread
reason is that CPUs do not have gather/scatter support. Search per- working sets can be as small as a few KB to as large as 256KB. The
forms explicit SIMD blocking to make memory accesses regular. per thread working sets are unlikely to change in the future as they
Solv performs a reordering of the constraints to improve memory are already set to scale with increased thread count.
access patterns. Other kernels, such as LBM and RC convert some Gather/Scatter: 42% percent of our benchmarks can exploit
of the data structures from array-of-structure to structure-of-array SIMD better with an efficient gather/scatter support. Our simulation-
format to completely eliminate gather operations. For example, the based analysis projects a 3X performance benefit for SpMV and
performance of LBM improves by 1.5X from this optimization. RC with idealized gather operations. Idealized gather operation
GPU optimization: For GPUs, we found that global inter-thread can simultaneously gather all elements into SIMD register in the
synchronization is very costly, because it involves a kernel termi- same amount of time to load one cache line. This may require sig-
nation and new kernel call overhead from the host. Hist minimizes nificant hardware and be impractical to build as it may require a
global synchronization by privatizing histograms. Solv also per- large number of cache ports. Therefore, this represents the upper
forms constraint reordering to minimize conflicts among neighbor- bound of the gather/scatter hardware potential. Cheaper alterna-
ing constraints, which are global synchronization points. Another tives exist. One alternative is to use multi-banking - an approach
important optimization for GPUs was the use of the local shared taken by GTX280. On GTX280, its local shared memory allows
buffer. Most of our kernels use the shared buffer to reduce band- 16 simultaneous accesses to 16 banks in a single cycle, as long as
width consumption. Additionally, our GPU sort uses the fact that there are no bank conflicts. However, this data structure is explic-
buffer memory is multi-banked to enable efficient gathers/scatters itly managed by the programmer. Another alternative is to take
of data. advantage of cache line locality by gathering - i.e. to extract all el-
ements required the same gather from a single load of the required
5.2 Hardware Recommendations cache line. This approach requires shuffle logic to reorder the data
within a cache line before writing into the target register. Shuf-
In this section we capitalize on the learning from Section 4.3 to
fle logic is already available in general-purpose CPUs for permu-
derive a number of key processor features which play major role in
tation operations within SIMD registers. Our analysis shows that
improving performance of throughput computing applications.
many throughput computing kernels have large amounts of cache
High compute flops and memory bandwidth: High compute
line locality. For example, Solv accesses on average 3.6 cache
flops can be achieved in two ways - by increasing core count or in-
lines within each 8-wide gather request. RC accesses on average
creasing SIMD width. While increasing core count provides higher
458
5 cache lines within each 16-wide gather request. Lastly, future 50]. However, these efforts concentrate on (1) overcoming paral-
throughput computing processors should provide improved easy- lel scalability bottlenecks, and (2) demonstrating multi-core perfor-
of-programming support for gather/scatter operations. mance over a single-core of the same type.
Efficient synchronization and cache coherence: Core i7 al- General-purpose computation on graphics hardware (GPGPU)
lows instructions like increment and compare&exchange to have has been an active topic in the graphics community. Extensive work
an atomic lock prefix. GTX280 also has support for atomic opera- has recently been published on GPGPU computation; this is sum-
tions, but only through device memory. In both CPUs and GPUs, marized well in [3, 33]. A number of studies [8, 9, 19, 20, 21, 25,
the current solutions are slow, and more importantly do not scale 27, 34, 40, 43, 53] discuss similar throughput computing kernels as
well with respect to core count and SIMD width. Therefore, it is in this paper. However, their focus is to map non-graphic applica-
critical to provide efficient synchronization solutions in the future. tions to GPUs in terms of algorithms and programming models
Two types of synchronizations are common in throughput com- Analytical models of CPUs [51] and GPUs [22] have also been
puting kernels: reductions and barriers. First, reductions should proposed. They provide a structural understanding of throughput
provide atomicity between multiple threads and multiple SIMD computing performance on CPUs and GPUs. However, (1) each
lanes. For example, Hist loses up to 60% of SIMD efficiency be- discusses either CPUs or GPUs only, and (2) their models are very
cause it cannot handle inter-SIMD-lane atomicity well. We recom- simplified. Further, they try to verify their models against real sil-
mend hardware support for atomic vector read-modify-write oper- icons, rather than to provide in-depth performance comparison be-
ations [29], which enables conflict detection between SIMD lanes tween CPUs and GPUs.
as well as atomic memory accesses across multiple threads, thus This paper provides an architectural analysis of CPUs and GPUs.
achieving 54% performance improvement on four cores with 4- Instead of simply showing the performance comparison, we study
wide SIMD. Second, faster barrier and coherent caches become how architectural features such as core complexity, cache/buffer
more important as core count increases and task size gets smaller. design, and fixed function units impact throughput computing work-
For example, in Solv, the average task size is only about 1000 cy- loads. Further, we provide our recommendation on what architec-
cles, while a barrier takes several hundred cycles on CPUs and ture features to improve future throughput computing architectures.
several micro-seconds on GPUs. We recommend hardware sup- To the best of our knowledge, this is the first paper that evaluates
port for fast barriers to amortize small task size and cache co- CPUs and GPUs from the perspective of architecture design. In
herence to guarantee memory consistency between barrier invoca- addition, this paper also presents a fair comparison between per-
tions. In addition, we also believe that hardware accelerated task formance on CPUs and GPUs and dispels the myth that GPUs are
queues will improve synchronization performance even more (68% 100x-1000x faster than CPUs for throughput computing kernels.
to 109% [28]).
Fixed function units: As shown in Section 4.3.6, Bilat can be 7. CONCLUSION
sped up by 2X using fast transcendental operations. Texture sam-
pling units significantly improve the performance of GJK. In fact, In this paper, we analyzed the performance of an important set
a large class of image processing kernels (i.e., video encoding/de- of throughput computing kernels on Intel Core i7-960 and Nvidia
coding) can also exploit such fixed function units to accelerate spe- GTX280. We show that CPUs and GPUs are much closer in perfor-
cialized operations at very low area/power cost. Likewise, Core mance (2.5X) than the previously reported orders of magnitude dif-
i7 introduced a special purpose CRC instruction to accelerate the ference. We believe many factors contributed to the reported large
processing of CRC computations and the upcoming 32nm version gap in performance, such as which CPU and GPU are used and
will add encryption/decryption instructions that accelerate key ker- what optimizations are applied to the code. Optimizations for CPU
nels by 10X [36]. Future CPUs and GPUs will continue this trend that contributed to performance improvements are: multithreading,
of adding key primitives for developers to use in accelerating the cache blocking, and reorganization of memory accesses for SIMD-
algorithms of interest. ification. Optimizations for GPU that contributed to performance
improvements are: minimizing global synchronization and using
local shared buffers are the two key techniques to improve perfor-
6. RELATED WORK mance. Our analysis of the optimized code on the current CPU
Throughput computing applications have been identified as one and GPU platforms led us to identify the key hardware architecture
of the most important classes of future applications [6, 10, 13, features for future throughput computing machines – high compute
44]. Chen et al. [13] describe a diverse set of emerging appli- and bandwidth, large caches, gather/scatter support, efficient syn-
cations, called RMS (Recognition, Mining, and Synthesis) work- chronization, and fixed functional units. We plan to perform power
loads, and demonstrate that its core kernel functions exist in ap- efficiency study on CPUs and GPUs in the future.
plications across many different domains. The PARSEC bench-
mark discusses emerging workloads and their characteristics for
CMPs [10]. The Berkeley View report illustrates 13 kernels to de- 8. REFERENCES
sign and evaluate throughput computing models [6]. The UIUC’s [1] CUDA BLAS Library. http://developer.download.nvidia.com/
compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.
Parboil benchmark tailors to capture the strengths GPUs [44]. In
[2] CUDA CUFFT Library. http://developer.download.nvidia.com/
this paper, we share the vision that throughput computing will sig- compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.
nificantly impact future computing paradigms, and analyze the per- [3] General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.
formance of a representative subset of kernels from these workload [4] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving
suites on common high-performance architectures. predictable performance through better memory controller placement in
many-core cmps. In ISCA ’09: Proceedings of the 36th annual international
Multi-core processors are a major architectural trend in today’s symposium on Computer architecture, 2009.
general-purpose CPUs. Various aspects of multi-core architectures [5] A. R. Alameldeen. Using compression to improve chip multiprocessor
such as cache/memory hierarchy [11], on-chip interconnect [4] and performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A.
power management [37] have been studied. Many parallel kernels [6] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer,
D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The
have also been ported and optimized for multi-core systems, some landscape of parallel computing research: A view from berkeley. Technical
of which are similar to the kernels discussed in this paper [15, 17, Report UCB/EECS-183, 2006.
459
[7] D. H. Bailey. A high-performance fft algorithm for vector [29] S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes,
supercomputers-abstract. In Proceedings of the Third SIAM Conference on C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip
Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, multiprocessors. In ISCA ’08: Proceedings of the 35th International
1989. Society for Industrial and Applied Mathematics. Symposium on Computer Architecture, pages 441–452, Washington, DC, USA,
[8] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on 2008. IEEE Computer Society.
throughput-oriented processors. In SC ’09: Proceedings of the 2009 ACM/IEEE [30] N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.
conference on Supercomputing, 2009. [31] P. Lyman and H. R. Varian. How much information.
[9] C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.
and derivatives pricing. http://quantcatalyst.com/download.php? [32] NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/
file=DerivativesPricing.pdf. cuda_home.html, 2009.
[10] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: [33] Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger,
characterization and architectural implications. In PACT ’08: Proceedings of Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose
the 17th international conference on Parallel architectures and compilation computation on graphics hardware. Computer Graphics Forum, 26(1):80–113,
techniques, pages 72–81, New York, NY, USA, 2008. ACM. March 2007.
[11] S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. [34] V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing.
Multi-execution: multicore caching for data-similar executions. SIGARCH http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo
Comput. Archit. News, 37(3):164–173, 2009. /doc/MonteCarlo.pdf.
[12] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, [35] M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer,
D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and
S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the
MICRO 39: Proceedings of the 39th Annual IEEE/ACM International IEEE, special issue on “Program Generation, Optimization, and Adaptation”,
Symposium on Microarchitecture, pages 469–479, Washington, DC, USA, 93(2):232– 275, 2005.
2006. IEEE Computer Society. [36] R. Ramanathan. Extending the world.s most popular processor architecture.
[13] Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, Intel Whitepaper.
A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of [37] K. K. Rangan, G.-Y. Wei, and D. Brooks. Thread motion: fine-grained power
recognition, mining, and synthesis workloads and its implications. Proceedings management for multi-core systems. SIGARCH Comput. Archit. News,
of the IEEE, 96(5):790–807, 2008. 37(3):302–313, 2009.
[14] Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, [38] R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH
A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical ’06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA,
simulations on next-generation architecture with many cores. Intel Technology 2006. ACM.
Journal, 11, 2007. [39] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for
[15] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, manycore GPUs. In IPDPS, pages 1–10, 2009.
A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on [40] N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast
multi-core simd cpu architecture. PVLDB, 1(2):1313–1324, 2008. Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM
[16] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, SIGMOD, 2010.
J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on [41] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,
state-of-the-art multicore architectures. In SC ’08: Proceedings of the 2008 S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan,
ACM/IEEE conference on Supercomputing, pages 1–12, Piscataway, NJ, USA, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing.
2008. IEEE Press. ACM Trans. Graph., 27(3):1–15, August 2008.
[17] F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. [42] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient
Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, computation of sum-products on gpus through software-managed cache. In
special issue on “Signal Processing on Platforms with Multiple Cores”, Proceedings of the 22nd ACM International Conference on Supercomputing,
26(6):90–102, 2009. pages 309–318, June 2008.
[18] M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In [43] M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson,
Proceedings of the IEEE, volume 93, pages 216–231, 2005. P. Dubey, K. Augustine, D. Kim, A. Kyker, V. W. Lee, A. D. Nguyen, L. Seiler,
[19] L. Genovese. Graphic processing units: A possible answer to HPC. In 4th and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging
ABINIT Developer Workshop, 2009. to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph.,
[20] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high 15(6):1563–1570, 2009.
performance graphics co-processor sorting for large database management. In [44] The IMPACT Research Group, UIUC. Parboil benchmark suite.
SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international http://impact.crhc.illinois.edu/parboil.php.
conference on Management of data, pages 325–336, NY, USA, 2006. ACM. [45] J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for
[21] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High 3D CFD. In International Journal of Computational Fluid Dynamics,
performance discrete fourier transforms on graphics processors. In SC ’08: volume 22, pages 443–456, 2008.
Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages [46] N. Univ. of Illinois. Technical reference: Base operating system and extensions
1–12, Piscataway, NJ, USA, 2008. IEEE Press. , volume 2, 2009.
[22] S. Hong and H. Kim. An analytical model for a gpu architecture with [47] F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix
memory-level and thread-level parallelism awareness. SIGARCH Comput. vector product on GPUs. Technical report, University of Almeria, June 2009.
Archit. News, 37(3):152–163, 2009.
[48] V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector
[23] Intel Advanced Vector Extensions Programming Reference. Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS
[24] Intel. SSE4 Programming Reference. 2007. Department, University of California, Berkeley, May 2008.
[25] C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on [49] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra.
graphics hardware. In PACT ’05: Proceedings of the 14th International In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing,
Conference on Parallel Architectures and Compilation Techniques, pages pages 1–11, Piscataway, NJ, USA, 2008. IEEE Press.
185–196, Washington, DC, USA, 2005. IEEE Computer Society. [50] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel.
[26] J. R. Johnson, R. W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology Optimization of sparse matrix-vector multiplication on emerging multicore
for designing, modifying, and implementing fourier transform algorithms on platforms. In SC ’07: Proceedings of the 2007 ACM/IEEE conference on
various architectures. Circuits Syst. Signal Process., 9(4):449–500, 1990. Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.
[27] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, [51] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual
S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on performance model for multicore architectures. Commun. ACM, 52(4):65–76,
Modern CPUs and GPUs. In ACM SIGMOD, 2010. 2009.
[28] S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for [52] W. Xu and K. Mueller. A performance-driven study of regularization methods
fine-grained parallelism on chip multiprocessors. In ISCA ’07: Proceedings of for gpu-accelerated iterative ct. In Workshop on High Performance Image
the 34th annual international symposium on Computer architecture, pages Reconstruction (HPIR), 2009.
162–173, New York, NY, USA, 2007. ACM. [53] Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In
International Conference on Computer Science and Software Engineering,
volume 3, pages 198–201, 2008.
460