CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
net/publication/235892269
CITATIONS READS
6 294
16 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Siddharth Choudhary on 18 June 2014.
Abstract
Parallel computing using accelerators has gained widespread research attention in the past few years.
In particular, using GPUs for general purpose computing has brought forth several success stories with
respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi-
cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational
resources, it is important to also include CPUs in the computation. We call this the hybrid computing
model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore
such a model is quite natural.
We reevaluate the claim of a recent paper by Lee et al.(ISCA 2010). We argue that the right question
arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of
whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse
workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment
with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla
T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both
these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions.
On both these platforms, we also show that our solutions are 90% resource efficient on average.
Our work therefore suggests that hybrid computing can offer tremendous advantages at not only
research-scale platforms but also the more realistic scale systems with significant performance gains and
resource efficiency to the large scale user community.
1 Introduction
Parallel computing using accelerator based platforms has gained widespread research attention in recent
years. Accelerator based general purpose computing, however, relegated the role of CPUs to second-class
citizens where a CPU sends data and program to an accelerator and gets the results of the computation from
the accelerator. As CPUs evolve and narrow the performance gap with accelerators on several challenge
problems, it is imperative that parity is restored by also bringing CPUs in the computation process.
Hybrid computing seeks to simultaneously use all the computational resources on a given tightly coupled
platform. We envisage a multicore CPU plus accelerators such as GPUs, see also Figure 1, as one such
realization. The CPU in Figure 1 on the left with six cores is connected to a many-core GPU on the right.
In our work, we use two variants of the model shown in Figure 1 by choosing different CPUs and GPUs.
The case for hybrid computing on such a platform can be made naturally. Computers come with a CPU,
at least a dual-core at present, and is expected to contain tens of cores in the near future. Graphics process-
ing units are traditionally used to process graphics operations and most computers come equipped with a
graphics card that presently has several GFLOPS of computing power. Moreover, commodity production
of GPUs has significantly lowered their prices. Hence, an application using both a multicore CPU and an
accelerator such as a GPU can benefit from faster processing speed, better power and resource utilization,
1
Figure 1: A tightly coupled hybrid platform.
DATA TRANSFER
SEND RESULTS
SEND RESULTS
(a) (b)
Figure 2: A view of hybrid multicore computing. Figure (a) shows the conventional accelerator based com-
puting where the CPU typically stays idle. Figure (b) shows the hybrid computing model with computation
overlapping between the CPU and the GPU.
and the like. Figure 2 illustrates the benefits of such a hybrid computing model. As can be noticed in Figure
2(b), hybrid computing calls for complete utilization of the available computing resources.
Further, it is believed that GPUs are not well suited for computations that offer little SIMD parallelism,
and have highly irregular memory access patterns. For various reasons, CPUs do not suffer greatly on
such computations. Thus, hybrid computing opens the possibility of novel solution designs that take the
heterogeneity into account. Hence, we posit that hybrid computing has the scope to bring the benefits of
high performance computing to also desktop and commodity users.
We distinguish between our model of hybrid computing with other existing models as follows. We
consider computational resources that are tightly coupled. Supercomputers such as the Tianhe-1A that use a
combination of CPUs and GPUs do not fall in our category. The issues at that scale would overlap with some
of the issues that arise in hybrid computing, but have other unique issues such as interconnection network
and its cross-section bandwidth, latency, and the like.
Hybrid computing solutions on platforms using a combination of CPUs and GPUs are being studied for
various problem classes such as dense linear algebra kernels [5, 45], maximum flows in networks [23], list
ranking [48] and the like. Most of these works however have a few limitations. In some cases, for example
[23, 48, 44], computation is done only either on the CPU or the GPU at any given time. Such a scenario
keeps one of the computing resource idle and hence is in general wasteful in resources. Secondly, we explore
a diverse set of workloads so as to highlight the advantages and limitations of hybrid multicore computing.
In a departure from most earlier works, we study hybrid computing also on low end CPU and GPU models
and see the viability of hybrid computing on such platforms.
The aim of this work is to explore the applicability of hybrid multicore computing to a class of appli-
cations. To this end, we select a collection of 13 different workloads ranging from databases to graphs,
and image analysis, and present hybrid algorithmic solutions for these workloads. Some of the results that
we show in this paper are reported recently [8, 9, 18, 15, 33], or under submission [10]. We identify two
different algorithm engineering approaches that we have used across our 13 workloads. Further, these ap-
proaches can be used to classify most of the recent works on hybrid computing including [45, 20]. The two
approaches are described briefly in the following:
• Work Sharing: In a work sharing approach where the problem is decomposed into one or more parts,
with each part running on a different machine in the hybrid computing platform. In this approach, the
2
work shares have to be chosen appropriately so as to balance the load on the CPU and the GPU. In this
approach, the actual algorithm used on the different units could be different, and in some cases, may
also be different from the best possible algorithm on each device respectively. For example, see our
hybrid algorithm for graph connected components and sparse matrix-vector multiplication described
in Section 4.
• Task Parallel: A task parallel approach involves viewing the computation as a collection of inter-
dependent tasks, and tasks are mapped on to the available units. In this setting, one has to think of
arriving at the best possible mapping that minimizes the overall span and also minimizes the idle time
of machines. As can be seen, the time taken by a hybrid solution using task parallelism is the time
corresponding to the longest path in the task graph with nodes and edges labelled by the time taken to
complete the task and communication time respectively.
A slight modification to the task parallelism approach is that of pipelined parallelism. A pipelined
parallelism approach views the computation as a clever combination of the units to set up a pipeline
that has different functionalities at each stage of the pipeline.
• We develop new hybrid solutions for seven of the 13 workloads studied in this paper: sorting, his-
togram, spmv, bilateral filtering, convolution, ray casting, and the Lattice Boltzman Method.
• We experiment with a hybrid system consisting of a six core Intel i7 980X (Westmere) CPU and an
NVidia Tesla T10 GPU. On this hybrid platform, called Hybrid-High, we show that hybrid computing
has an average of 29% performance improvement over the best possible GPU-alone implementation.
Our solutions also exhibit 90% resource efficiency on average.
• To promote the idea that hybrid computing has benefits on more widely used platforms, we also exper-
iment with a hybrid platform that has an Intel Core 2 Duo E7400 (Allendale) with an NVidia GT520
GPU. We feel that such configurations still have the potential to make supercomputing affordable
and accessible to everyone. On such a platform, called Hybrid-Low, we show that hybrid computing
results in an average of 37% performance improvement compared to the best possible GPU-alone
implementation. Our solutions also exhibit 90% resource efficiency on average.
• We analyze the above results and offer insights on the limits and applicability of hybrid computing in
the present architectural space.
The title of our paper is motivated by a recent related paper [46] that argued about the relative strengths
of GPUs and multicore CPUs on a set of throughput oriented workloads. We establish through this paper that
accelerator based computing should leverage the combined strengths of all devices in a computing platform
via hybrid computing. A majority of our workload overlap with the workloads considered in [46].
3
2 Hybrid Computing Platforms
In this section, we briefly describe the two hybrid CPU+GPU computing platforms that we use in our study.
3 Workloads
In this paper, we experiment with the following workloads. Table 3 summarizes the characteristics of the
various workloads considered.
Sorting: Sorting is one of the fundamental operations in information processing that has many appli-
cations. Recent results on efficient implementations of sorting are reported in [34, 3], to name a few. This
workload becomes an important case study due to the large number of applications. For the purposes of this
paper, we focus on comparison-based sorting techniques and leave non-comparison based sorting technique
such as radix sort out of the scope.
Histogram: One of the important operations in image processing is to compute the histogram of the
intensity values of the pixels. Computing the histogram of a dataset in parallel typically requires the use of
atomic operations. The nature of this workload such as use of atomic operations and being memory bound
make this a good case study for hybrid computing.
4
Workload Application Area Nature Characteristics Solution
(Short name) Methodology
Sorting Semi-numerical, regular compute bound work sharing+
sort databases task parallel
Histogram image processing Atomics, memory bound work sharing
hist irregular
Sparse matrix-vector Sparse Linear Algebra irregular memory bound work sharing
multiplication
spmv
Sparse matrix-matrix Sparse Linear Algebra irregular device storage, work sharing
multiplication memory bound
spgemm
Ray casting Image processing irregular compute bound work sharing
RC
Bilateral filtering Image processing regular compute bound work sharing +
Bilat task parallel
Convolution Image processing regular compute bound work sharing
Conv
Monte-carlo Physics, computational regular, compute bound, task parallel
MC finance pseudorandom
numbers
List Ranking graphs and trees irregular memory bound task parallel
LR
Connected Components graph algorithms irregular memory bound work sharing
CC
Lattice Boltzman Method Computational Fluid irregular memory bound task parallel
LBM Dynamics
Image Dithering Image processing irregular causal dependencies work sharing
Dither
Bundle adjustment Image processing irregular memory bound task parallel
Bundle
5
Sparse Matrix-Vector Multiplication (spmv): Efficient operations involving sparse matrices are
essential to achieve high performance across numerical applications such as climate modeling, molecular
dynamics, and the like. In most of the above applications, spmv computation is the main bottleneck. Hence,
efforts to speed up this computation on modern architectures have attracted significant research attention in
recent years [49, 16]. Further, the computation involved in spmv is highly irregular in nature due to the
sparsity of the matrix.
Sparse matrix-matrix multiplication (spgemm): Another operation that is important with respect to
sparse matrices is that of multiplying two sparse matrices. This workload has found applications spanning
several areas such as graph algorithms [26], numerical analysis including computation fluid dynamics [37,
21, 16] and is also included as one of the seven dwarfs in parallel computing in the Berkeley report [4]. Some
of the recent works that have reported efficient sparse matrix multiplication on modern architectures are [13].
It is generally accepted that the difficulties of this workload include its irregular nature of computation, the
difficulty in predicting the size of the output and the concomitant memory management problems.
Both spmv and spgemm are important linear algebra kernels with significant applications, and hence
their choice is justified.
Ray casting: This is a fundamental problem in image analysis and computer graphics. Recent applica-
tions to medical image analysis are also reported [42]. As all rays perform the computations independently,
the problem is very much portable for parallel architectures. Tracing multiple rays in an SIMD fashion
is challenging, because rays access non-contiguous memory locations, resulting in incoherent and irregular
memory accesses. The choice of this workload is justified because of the range of applications of ray casting
to visual computing.
Bilateral filter: Bilateral filter is an edge-preserving and noise reducing filter used in image processing.
It is a non-linear filter in which intensity value at any pixel is equal to a weighted sum of intensities in the
neighbourhood. The filter involves transcendental operations like computing exponentials, which can be
very computationally expensive. Hence, it is a compute bound problem with regular memory access.
Convolution: Convolution is a common operation used in image processing for effects such as blur,
emboss and sharpen. Given the image signal and the filter, the output at each pixel is equal to the weighted
sum of its neighbours. Since each pixel can be computed independently by a thread, there is ample par-
allelism available. The computations increase with the size of filter and exhibits high compute to memory
ratio.
The bilateral filtering and convolution workloads are commonly used filters in image processing appli-
cations, justifying their inclusion on our workloads.
Monte-carlo: Monte Carlo methods are used in several areas of science to simulate complex processes,
to validate simpler processes, and to evaluate data. In Monte Carlo (MC) methods, a stochastic model is
constructed in which the expected value of a certain random variable is equal to the physical quantity to
be determined. The expected value of this random variable is then determined by the average of many
independent samples representing the random variable. This workload exhibits a regular memory access
pattern and is compute bound, typically. The choice of this workload is justified by the wide body of
applications using Monte Carlo methods across varied domains such as computational finance, physics, and
engineering.
List Ranking: The importance of list ranking to parallel computing has been identified by Wyllie as
early as 1978 in his Ph. D. thesis [50]. The list ranking problem is to find the distance of every node from
one end of the given linked list. The workload is memory bound due to the highly irregular nature of the
computation involved. The workload is chosen as a case study as list ranking is often a primitive in several
graph and tree based computations.
Connected Components: Finding the connected components of a given undirected graph has been
a fundamental graph problem with several applications. Ideas used in parallel algorithms for connected
components find immediate application to other important graph algorithms such as minimum spanning
trees and the like. Hence, this workload is important and offers good scope as a case study.
Lattice Boltzman Method: Lattice Boltzman Method (LBM) refers to a class of applications from
6
computational fluid dynamics (CFD) and are used in fluid simulations. It is a numerical method that solves
the Navier–Stokes equation via the discrete Boltzman equation. In this work, we study the D3Q19 lattice
model where over a three dimensional cubic lattice, each cell computes the new function values based on
its 19 neighbors [22]. The LBM operation is highly data parallel. This workload is considered for its
applications to computational fluid dynamics.
Image Dithering: In Floyd-Steinberg Dithering (FSD) (see [18]), we approximate a higher color res-
olution image using a limited color palette by diffusing the errors of threshold operation to the neighboring
pixels according to a weighted matrix. The problem is thus inherently sequential and poses enormous chal-
lenge for a parallel implementation, let alone a hybrid implementation. Dithering has various applications
such as printing, display on low-end LCD or mobile devices, visual cryptography, image compression, and
the like. Further this workload is significant because of its atypical nature amongst image processing appli-
cations that does not offer embarrassing parallelism.
Bundle Adjustment: It refers to the optimal adjustment of bundles of rays that leave 3D feature points
onto each camera center with respect to both camera positions and point coordinates. Bundle Adjustment
is carried out using the Levenberg-Marquardt (LM) algorithm [32, 38] because of its effective damping
strategy to converge quickly from a wide range of initial guesses. Bundle adjustment often is the slowest
and the most computationally resource intensive step. It is one of the primary bottleneck in the Structure-
from-Motion pipeline, consuming about half of the total computation time.
4 Implementation Details
In this section, we describe the implementation details of the various workloads described in Section 3. For
the workloads including sort, Bilat, Conv, spmv, hist, RC, and LBM we developed hybrid implemen-
tations for the purposes of this paper. In some cases, we use implementation developed in recent existing
works that are known to be the best possible. Workloads LR, CC, Dither, MC, spgemm, and MC fall
under this category. For more details on these implementations, we refer the reader to the technical reports
available at [28].
4.1 Sorting
Our implementation of sorting is a comparison based sorting algorithm based on the techniques of sample
sort reported in [34]. Sample sort involves placing the elements into various bins according to a number of
splitters. For sorting, we apply the basic principle of work partitioning in a hierarchical fashion. We, first
compute the histogram of the data in a hybrid manner. Using the histogram results we perform the binning
process. As, the histogram provides a good estimate of the distribution of the data, the binning process
consumes a much lesser overhead. However, on the initial iteration of the kernel, the individual bins that
are created are large in size and cannot be directly used for sorting. In order to optimally sort each of the
bins, we reduce the size of each of the bins down to a certain threshold where groups of 32 elements can be
compared by a single warp. Each of these warps will in effect implement quick sort on the 32 elements. We
recursively run the binning process to reduce the bin sizes to the chosen threshold. The CPU on the other
hand, will not be limited by the compute as much as the GPU. We can hence still leave the bin sizes of the
CPU at a higher threshold than that of the GPU. We also notice that there is a clear trade-off in the number
of recursive calls to split the elements into bins, and the time taken to sort the bins independently.
4.2 Histogram
The histogram operation in a parallel setting requires the proper use of atomic increment operations to ensure
consistent and reliable results. We use the work sharing approach where we divide the data set into two sets
for the GPU and the CPU. We then perform the computation of the histogram on both the devices in an
overlapped fashion. This step is followed by a simple addition of the results from the two devices. On the
7
GPU, using shared memory is critical in order to reduce the global memory latency. The atomic increment
is performed by a single warp working on the data that is obtained from the shared memory. The histogram
in general is a bandwidth bound problem and hence, proper use of the memory channels are essential. The
shared local cache (L1) in the CPU is used to improve the performance. The resulting partial histograms are
then added bin-by-bin to give the final histogram result over the entire input data.
4.3 spmv
In the spmv workload, some of the challenges faced by modern architectures include the overhead of the
auxiliary data structures, irregular memory access patterns due to the sparsity of the matrix, load balancing,
and the like. Several recent works [11] have therefore focussed on optimizing the spmv computation on
most modern architectures.
In our hybrid implementation, we use a novel work sharing based solution summarized as follows. In
spmv , we notice that the computation involving one row is independent of the computation involving
other rows. This suggests that one should attempt a work sharing based solution. Instead of splitting the
computation according to some threshold, we use the following novel work sharing approach. Our approach
is guided by the fact that typically spmv is used over multiple iterations. So, one can rely on preprocessing
techniques that aim to improve the performance of spmv.
Notice that the GPUs are good at exploiting massive data parallelism with regular memory access pat-
ters. Therefore, we assign the computation corresponding to the dense rows to the GPU and the computation
corresponding to the sparse rows to the CPU. The exact definition of sparsity is estimated via experimen-
tation. In this direction, we first sort the rows of the matrix according to the number of nonzeros. We then
rearrange the matrix according to increasing order of the number of nonzeros. We also rearrange the x
vector and then assign the computation corresponding to the dense rows to the GPU and the sparse rows to
the CPU. The entire x vector is kept at both the CPU and the GPU. This therefore suggests that when using
the work sharing approach, one can divide the computation according to what computation is more suitable
for each of the architectures in the hybrid platform.
On the CPU, we use the Intel MKL [1] library routines and on the GPU we use the CUSP library
routines1 . These are known to be offer the best possible results on each platform.
4.4 spgemm
For the spgemm workload, one can see that computations on various rows of the input matrices are inde-
pendent of each other. So, we use a work sharing model in our hybrid implementation. We use the Intel
MKL library [1] for computations on the CPU, and use a row-row method based implementation developed
by us recently in [33] for the GPU computations. The main implementation difficulty experienced in this
workload is to arrive at the appropriate work shares. The work share would be dictated primarily by the
volume of the output. Since estimating the volume of output is as hard as actually multiplying the matrices,
one has to rely on heuristics to arrive at the work share. In our implementation, we use the runtime of a CPU
alone implementation and a GPU alone implementation to obtain the work share.
On the CPU, we use the Intel MKL [1] library routines and on the GPU we use the Row-Row method
of matrix multiplication. The row-row method P works as follows [33]. In Cm×n = Am×p · Bp×n , the ith
row C, denoted C(i, :), is computed as j∈A(i,:) A(i, j) · B(j, :). This formulation works best on GPUs
for sparse matrices as only those elements that contribute to the output are accessed. For more details of the
Row-Row method on the GPU, we refer the reader to [33].
8
rays access non-contiguous memory locations, resulting in incoherent and irregular memory accesses.
We notice that there are two main steps in ray casting. The first step is to find the first triangle that is
intersected by each ray. This is then used to find the first tetrahedron intersected. The second step involves
tracing the ray from the first hit point and traversing the ray through the entire mesh to keep accumulating
the intensity values from the interpolation function. The computation for this ray finishes once the ray leaves
the mesh.
In our hybrid implementation, we used a work sharing based solution model since the computation for
each ray can be performed independently. However, the nature and amount of computation in the above two
steps differs significantly. For this reason, we proceed as follows. We ensure that computation for each ray
finishes the first step before starting computation for the second step for any ray. The work share across
the two steps is also varied to reflect the varied nature of the computation across the two steps. The work
share for each step is obtained empirically by studying the time taken by the CPU and the GPU individually
on each of the two steps. We notice that the optimal work shares across the two steps vary significantly
depending on the platform.
9
4.7 Monte Carlo
Monte Carlo applications typically involve several iterations each using pseudorandom numbers to estimate
the expected value of a random variable of interest. We chose the application of photon migration from [12].
In the hybrid solution, we use a hybrid pseudorandom number generator that we developed in [8]. In photon
migration, several photons are launched with their position and direction initialized to either zeros (for some
pencil beam initialized at the origin) or some random numbers. At every step a photon takes, a fraction
of its weight is absorbed, and then photon packet is scattered. The new direction and weight of photon are
updated. After several such steps if the remaining weight of a photon is below a certain threshold, the photon
is terminated.
4.9 LBM
The LBM workload is highly parallelizable since each particle can be distributed to each thread of com-
putation. The Lattice Boltzman model simulates the propagation and collision processes of a large of a
large number of particles. We perform the simulation over cubic lattices. A standard notation in LBM is
the DnQm scheme. The parameter n stands for the dimensions of the cubic lattice, and m stands for the
number of ”speeds” studied. In this work, we study the D3Q19 lattice model where over a 3-dimensional
cubic lattice, each particle computes the new function values based on its present values.
The computations of various functions are independent of each other. So, in our hybrid implementation,
we use a task parallel solution approach. Given the relative speeds of the CPU and the GPU, we choose
to compute four functions on the CPU and the remaining 15 functions are computed on the GPU. Each
GPU thread is assigned the computation with respect to one particle. This can be seen to improve the data
coalescing effects on GPU.
10
GPU architecture is suited for larger number of light weight threads. We notice that as we use both the
CPU and the GPU in a work sharing model, we need to transfer at most three floating point numbers from
the CPU to the GPU. We therefore were able to arrive at an efficient hybrid solution. More details of this
implementation appear in [18].
For the bundle adjustment workload, we decompose the LM algorithm into multiple steps, each of which
is performed using a kernel on the GPU or a function on the CPU. Our implementation efficiently schedules
the steps on CPU and GPU to minimize the overall computation time. The concerted work of the CPU and
the GPU is critical to the overall performance gain. The implementation that we use here appears in [15].
Table 2: S UMMARY OF RESULTS OF OUR IMPLEMENTATIONS ON THE H YBRID -H IGH AND THE H YBRID -
L OW PLATFORMS . T HE PHRASE ” UAR ” IN THE SECOND ROW REFERS TO THE DATASET THAT CON -
TAINS ITEMS DRAWN UNIFORMLY AT RANDOM APPROPRIATE FOR THE WORKLOAD . A CITATION IN
THE SECOND ROW INDICATES THAT WE HAVE USED THE DATASETS FROM THE WORK CITED . T HE
PERFORMANCE GAIN INDICATED IS ACCORDING TO THE FOLLOWING METRIC : TIME FOR sort AND
hist, GF LOPS FOR spmv, TIME FOR spgemm, F RAMES PER SECOND FOR RC, TIME FOR LBM, M
PIXLES / SEC FOR Bilat AND Conv, AND TIME FOR MC, LR, CC, Dither, AND Bundle. F OR THE
Bundle WORKLOAD , THE IDLE TIME ON H YBRID -L OW PLATFORM IS NOT AVAILABLE .
11
the minimum time required by a pure GPU or a pure CPU solution. Secondly, by nature, hybrid solutions
should minimize the amount of time any of the device is idle. We define the idle time of a hybrid solution as
the total time any device in the hybrid platform is not used in the computation. This could be due to waiting
for results from other device, or not alloted any part of the computation, or not alloted enough part of the
computation. A low idle time indicates better resource efficiency.
5.2 Results
Table 2 summarizes the results of our hybrid solutions. The second row of Table 2 specifies the dataset used
in our study. The entries in the third and the fourth row are the percentage improvement of hybrid solutions
using the Hybrid-High and the Hybrid-Low platforms respectively. The fifth and the sixth row of the Table
shows the idle time of our hybrid solutions on both the platforms considered. The performance gain and the
idle times are for the largest input sizes for all workload except spmv, spgemm. For these two workloads
we use the average measurement over all the instances in the dataset considered [49]. The values reported
in Table 2 and Figure 3[a]–[l] are the average over multiple runs. The results of Table 2 indicate that our
hybrid solutions offer an average of 30% improvement on the Hybrid-High platform, and an average of 34%
on the Hybrid-Low platform. Remarkably, the Hybrid-Low platform whose configuration is likely to match
commonly used desktop configurations also offers good incentives for hybrid computing.
Figure 3[a]–[l] show the performance of our hybrid implementations on various inputs from the datasets
mentioned in the second row of Table 2. The plots in Figure 3 show that our hybrid implementations scale
well over increasing input sizes. In most cases, our maximum input size is limited only by the available
memory on the GPU in the hybrid platform.
On most workloads, our results on the Hybrid-High and Hybrid-Low platforms suggest that hybrid
computing has scope and advantage. Our workloads also have applications in common settings such as
graphics and image manipulation, data processing, and the like. Some of these operations are invoked
internally by regular users of computers such as gamers. The input sizes that we used in evaluating the
Hybrid-Low platform are also close to the typical usage in most cases.
12
80
Percentage Improvement
60 35
Hybrid-High Hybrid-High HIGH END GPU
70 LOW END GPU
55 Hybrid-Low Hybrid-Low
60
Percentage Improvement
Percentage Improvement
50 50
30
40
45 30
40 20
25 10
35
Dense
Protein
FEM/Spheres
FEM/Cantilever
Wind Tunnel
FEM/Harbor
QCD
FEM/Ship
FEM/Accelerator
Circuit
Webbase
30
25 20
20
15 15
1 2 4 8 16 32 1 2 4 8 16 32
No of Elements (in Million) No of elements (in Million) Image Size
100
Percentage Improvement
16
90 HIGH END GPU Hybrid-High Hybrid-High
LOW END GPU 45 Hybrid-Low Hybrid-Low
80
14
Percentage Improvement
Percentage Improvement
70
60
50
40
12
40
30 35
20 10
10
Dense
Protein
FEM/Spheres
FEM/Cantilever
Wind Tunnel
FEM/Harbor
QCD
FEM/Ship
Economics
Epidemiology
FEM/Accelerator
Circuit
Webbase
LP
30 8
25 6
20 4
20000 25000 30000 35000 40000 45000 3 5 9 15
Image Size No of Pixels Filter Size (2D)
45 32
Hybrid-HIGH Hybrid-High Hybrid-High
40 Hybrid-LOW Hybrid-Low 60 Hybrid-Low
30
Percentage Improvement
Percentage Improvement
Percentage Improvement
35 50
28
30
26 40
25
24 30
20
22 20
15
10 20 10
5 18
3 5 7 9 15 17 8 16 32 64 128 4 8 16 32 64 128
Filter Size (2D) No of Photons (in 100,000) No of Nodes (in Million)
60 30
Hybrid-High Hybrid-High HIGH END GPU
Hybrid-Low 16 Hybrid-Low LOW END GPU
55
25
Percentage Improvement
Percentage Improvement
Percentage Improvement
50 15
20
45 14
40 15
13
35
12 10
30
11 5
25
20 10 0
1 1.5 2 2.5 3 3.5 4 4.5 64 128 256 >6200*8000 >4096*4096 >6042*3298 >4224*3106
Figure 3: The plots show the performance improvement (in percentage) of hybrid solutions over a pure GPU
solution for the workloads considered over various input sizes.
13
5.3.2 Idle Time
Table 2 also shows the idle time of our workloads on both the platforms. For workloads that use a work
sharing parallel approach, it can be observed that the idle time is quite small. This is due to the fact that at
the right threshold of work distribution, the CPU and the GPU take near identical times.
For workloads using a task parallel solution approach, such as LBM, and Bundle, it is possible that the
computation time is not matched between the CPU and the GPU. In the case of LBM and Bundle, further
fine-tuning of the task assignment is also not possible. In the case of Bundle adjustment workload, there
is no equivalent Pure-GPU code as the hybrid code is a direct extension of the available CPU code. Some
tasks are not amenable to a further sub-division which means that computation on those tasks would always
result in a imbalance on the CPU and the GPU runtime. In such cases, the idle time tends to be high.
5.4 Discussion
In this section, we try to highlight some of the lessons that were learnt during our study in hybrid computing.
These can offer some insights into how future heterogeneous architectures at the commodity scale and also
at the higher end can be designed.
14
Figure 4: Figure showing the CPU and GPU overlapped computation on the Conv hybrid solution on a
3600× 3600 image and a 15×15 filter.
fractional independent set (FIS) as two tasks. However, dependent tasks executing on different devices im-
plies that the results of one task have to be necessarily communicated to the other task. The communication
time has to be taken into account when mapping the tasks to the devices.
5.4.4 Identifying and Mapping Work Units in the Task Parallel Approach
Some of our hybrid solutions use the technique of task parallelism. In this technique, we identify work
units, or tasks, and their inter-dependence in terms of their precedences. These tasks are then mapped onto
the best possible device according to the architectural suitability. We discuss two issues in this context that
affect the performance of hybrid solutions.
Firstly, it is not easy in general to identify the right tasks, as computing is often traditionally understood
in a sequential step-by-step manner. Even in parallel computing, the intention in general is to speed up each
step of the computation using the available processors. Only recently are other methodologies for parallel
computing such as using domain specific languages [19, 25] are gaining attention. While these languages
alleviate the job of writing efficient parallel programs, they can still be constrained by a traditional step-by-
step approach of problem solving.
Identifying the tasks and their dependencies requires a careful reinterpretation of the computation in-
volved. For instance, in the Bilat workload, we noticed that GPUs are not amenable to computing tran-
scendental functions. These were therefore executed on the CPU. Further, it is also noticed that there are
really very few transcendental function evaluation that are required for a given image. (These are based on
the maximum difference between the pixel intensities). Therefore, we precompute these values, and transfer
these values from the CPU to the GPU. While we may be precomputing more values than needed in an ac-
tual input, the benefits of this model stem from the fact that recomputing transcendental functions is rather
expensive on any architecture.
15
GENERATE CPU IDLE GENERATE CPU IDLE GENERATE
Figure 5: Figure showing the assignment of tasks during the LR hybrid solution on a list of size 128 M
elements.
The LR workload offers similar insights. Our implementation of list ranking in a hybrid setting [9]
has a preprocessing phase thar requires a large quantity of random numbers. These random numbers can be
generated on the CPU and transfered to the GPU. In our implementation, we generate the random numbers
on the CPU and the GPU uses the random numbers thus supplied. This is seen to save a lot of processing
time in the hybrid setting. Figure 5 shows the task assignment used in our hybrid implementation LR.
Secondly, it is not easy to identify the right task for the right processor. At present, our arguments are
based on intuitive reasoning backed by experimental evidence. In future, we would like to study formal
mechanism to arrive at an appropriate and near-optimal task mapping. In fact, arriving at an optimal as-
signment can be easily seen to be an NP-complete problem and hence one should consider near-optimal
assignments.
6 Related Work
There has been considerable interest in GPU computing in recent years. Some of the notable works include
scan [40], spmv[11], sorting [34], and the like. Other modern architectures that have been studied recently
include the IBM Cell and the multi-core machines. Bader et al. [7] have studied list ranking on the Cell
architecture and show that by running multiple threads at each SPU, list ranking using the Hellman-JaJa
algorithm can be done efficiently. Other notable works on the Cell architecture include [47, 49]. Williams
et al. [49] have studied the spmv kernel on various multi-core architectures including those from Intel, Sun,
and AMD. Since most of the above cited works do not involve hybrid computing, we do not intend to cite
all such works in this paper and refer the reader to other natural sources.
A recent work that motivated this paper is the work of Lee et al. [46]. In [46], Lee et al. argue that
GPU computing can offer on average only a 3x performance advantage over a multicore CPU on a range
of 14 workloads deemed important for throughput oriented applications. Some of our workloads overlap
theirs [46]. Their paper also generated a wide amount of debate on the applicability and limitations of
GPU computing. Our view however is that it is not a question of whether GPUs can outperform CPUs or
vice-versa, but rather what can be achieved when GPUs and CPUs join forces in a viable hybrid computing
platform. Further, for the workloads that are included also in [46], we provided our own GPU and CPU
implementations. In workloads such as Bilat, we use novel ideas such as precomputing the transcendentals
on the GPU for a pure GPU implementation that improve the performance beyond what is reported in [46].
16
Hybrid computing is gaining popularity across application areas such as dense linear algebra kernels
[5, 45, 20], maximum flows [23], graph BFS [26] and the like. The aim of this paper is to however evaluate
the promise and the potential of hybrid computing by considering a rich set of diverse workloads. Further,
in some of these works, (cf. [26, 23, 48]), while both the CPU and the GPU are used in the computation, one
of the devices is idle and while the other is performing computation. In contrast, we seek solutions where
both the devices are simultaneously involved in the computation.
There have been recent works that propose benchmark suites for GPU computing. Popular amongst
them are the Rodinia [14] and SHOC [17]. Some of our workloads such as sorting, spgemm are part of
the SHOC Level one benchmark suite. Subsets of the workloads considered in our paper appear in other
benchmarking efforts related to parallel computing. The Berkeley report [4] lists dwarfs as computational
patterns that have wide application. Workloads such as sort, hist, spmv, sgemm, are part of Berkeley
dwarfs. This serves to illustrate the wide acceptance of our chosen workloads.
7 Conclusions
In this paper, we have evaluated the case for hybrid computing by considering workloads from diverse
application areas and two different hybrid platforms. We also experimented with two hybrid platforms and
analyzed their suitability for hybrid computing. Our study opens the way for evaluation on other challenges
with respect to hybrid computing such as power efficiency, benchmark suites, and performance models for
hybrid computing (see [29, 27]).
References
[1] “Intel math kernel library,” http://software.intel.com/en-us/articles/intel-mkl/.
[3] A. Davidson and D. Tarjan and M. Garland and J. D. Owens, “Efficient Parallel Merge Sort for Fixed
and Variable Length Keys,” in Proc. InPar, May 2012.
[5] M. Baboulin, J. Dongarra, and S. Tomov, “Some Issues in Dense Linear Algebra for Multicore and
Special Purpose Architectures,” UT-CS-08-200, University of Tennessee, Tech. Rep., 2008.
[6] D. Bader and K. Madduri, “Gtgraph: A suite of synthetic graph generators,” http://wwwstatic. cc.
gatech.edu/∼kamesh.
[7] D. A. Bader, V. Agarwal, and K. Madduri, “On the Design and Analysis of Irregular Algorithms on
the Cell Processor: A Case Study of List Ranking,” in Proc. of IEEE IPDPS, 2007, pp. 1–10.
[8] D. S. Banerjee, A. Bahl, and K. Kothapalli, “On Demand Fast Parallel Pseudo Random Number Gen-
erator with Applications,” in Proc. LSPP, 2012.
[9] D. S. Banerjee and K. Kothapalli, “Hybrid multicore algorithms for list ranking and graph connected
components,” in Proc. HiPC, 2011.
[10] D. S. Banerjee, K. Kothapalli, P. Sakurikar, and P. J. Narayanan, “Hybrid histogram and sorting with
applications,” Under submission, Available at http://cstar.iiit.ac.in/∼kkishore/hybrid/histandsort.pdf,
2012.
17
[11] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented
processors,” in Proc. SC, 2009.
[12] D. Boas, J. Culver, J. Stott, and A. Dunn, “Three dimensional monte carlo code for photon migration
through complex heterogeneous media including the adult human head,” Opt. Express, vol. 10, pp.
159–170, Feb 2002.
[13] A. Buluc and J. R. Gilbert, “Challenges and advances in parallel sparse matrix-matrix multiplication,”
in Proc. ICPP, 2008, pp. 503–510.
[14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark
suite for heterogeneous computing,” in Proc. IISWC, 2009, pp. 44 –54.
[15] S. Choudhary, S. Gupta, and P. J. Narayanan, “Practical time bundle adjustment for 3d reconstruction
on the gpu,” in Proc. of ECCV Workshop on CVGPU, 2011.
[16] J. K. Cullum and R. A. Willoughby, Lanczos Algorithms for large symmetric eigenvalue computations.
Birkhäuser Boston, 1985.
[18] A. Deshpande, I. Misra, and P. J. Narayanan, “Hybrid implementation of error diffusion dithering,” in
HiPC, 2011, pp. 1–10.
[19] Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen, F. Ham, A. Aiken,
K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan, “Liszt: A domain specific language for building
portable mesh-based pde solvers,” in Proc. SC, 2011, pp. 1 –12.
[20] E. Agullo and C. Augonnet and J. Dongarra and M. Faverge and J. Langou and H. Ltaief and S. Tomov,
“LU Factorization for Accelerator-based Systems,” in Proc. of IEEE/ACS AICCSA, 2011.
[22] J. Habich, T. Zeiser, G. Hager, and G. Wellein, “Performance analysis and optimization strategies for a
D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA,” Advances in Engineering Software,
vol. 42, no. 5, pp. 266–272, 2011.
[23] Z. He and B. Hong, “Dynamically tuned push-relabel algorithm for the maximum flow problem on
cpu-gpu-hybrid platforms,” in Proc. IPDPS, 2010.
[24] D. R. Helman and J. JàJà, “Designing Practical Efficient Algorithms for Symmetric Multiprocessors,”
in Proc. ALENEX, 1999, pp. 37–56.
[25] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun, “Green-marl: a dsl for easy and efficient graph analy-
sis,” SIGARCH Comput. Archit. News, vol. 40, no. 1, pp. 349–362, 2012.
[26] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient Parallel Graph Exploration on Multi-Core CPU and
GPU,” in Proc. PACT, 2011, pp. 78–88.
[27] S. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level
parallelism awareness,” in Proc. ISCA, 2009, pp. 152–163.
18
[29] K. Kothapalli, R. Mukherjee, S. Rehman, S. Patidar, P. J. Narayanan, and K. Srinathan, “A performance
prediction model for the cuda gpgpu platform,” in Proc. HiPC, 2009.
[30] S. Lad, K. K. Singh, K. Kothapalli, and P. Narayanan, “Hybrid multi-core algorithms for regular image
filtering applications,” Under submission, 2012.
[31] C. Ledergerber, G. Guennebaud, M. Meyer, M. Bacher, and H. Pfister, “Volume mls ray casting,” IEEE
T. Vis. Comp. Gr., vol. 14, no. 6, pp. 1372 –1379, 2008.
[33] K. Matam, S. K. Bharadwaj, and K. Kothapalli, “Sparse matrix matrix multiplication on hybrid
CPU+GPU platforms,” in in Proc. of HiPC, 2012 (to appear).
[34] N. Leischner, V. Osipov and P. Sanders, “GPU Sample Sort,” in Proc. IPDPS, April 2010.
[35] NVidia Corporation, “Cuda: Compute unified device architecture programming guide,” Technical re-
port, NVidia, 2007.
[37] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes, The Art of
Scientific Computing, 2nd ed. Cambridge University Press, 1992.
[39] S. Ribeiro, A. Maximo, C. Bentes, A. Oliveira, and R. Farias, “Memory-aware and efficient ray-
casting algorithm,” in Proceedings of the XX Brazilian Symposium on Computer Graphics and Image
Processing, 2007, pp. 147–154.
[40] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan primitives for GPU computing,” in Proc.
ACM GH, 2007.
[41] Y. Shiloach and U. Vishkin, “An O(log n) parallel connectivity algorithm.” J. Algorithms, pp. 57–67.
[44] M. Tang, J. yi Zhao, R. Tong, and D. Manocha, “GPU accelerated Convex Hull Computation,” in SMI
’12, 2012.
[45] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense liner algebra for hybrid gpu accelerated
manycore systems,” Parallel Computing, vol. 12, pp. 10–16, Dec. 2009.
[47] O. Villa, D. Scarpazza, F. Petrini, and J. Peinador, “Challenges in mapping graph exploration algo-
rithms on advanced multi-core processors,” in Proc. IPDPS, 2007.
19
[48] Z. Wei and J. JaJa, “Optimization of linked list prefix computations on multithreaded gpus using cuda,”
in Proc. IPDPS, 2010.
[49] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-
vector multiplication on emerging multicore platforms,” in Proc. SC, 2007.
[50] J. C. Wyllie, “The complexity of parallel computations,” Ph.D. dissertation, Cornell University, Ithaca,
NY, 1979.
20
View publication stats