0% found this document useful (0 votes)

39 views21 pages

CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013

This document discusses revisiting the claim that GPUs are always better than CPUs for computation. It argues that a hybrid CPU+GPU approach can offer significant advantages over using either alone. The authors experimented with 13 workloads on two hybrid platforms - a 6-core Intel i7 CPU with an Nvidia Tesla GPU, and a dual-core Intel CPU with an Nvidia GT520 GPU. They found that hybrid solutions consistently outperformed single-device solutions, with an average of 90% resource efficiency. The paper concludes that hybrid computing provides substantial performance gains and efficiency benefits that can enable high performance computing on mainstream systems.

Uploaded by

kalyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views21 pages

CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013

Uploaded by

kalyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235892269

CPU and/or GPU: Revisiting the GPU Vs. CPU Myth

Article · March 2013

Source: arXiv

CITATIONS READS
6 294

16 authors, including:

Surinder Sood Shashank Sharma

University of Auckland Aligarh Muslim University
2 PUBLICATIONS 6 CITATIONS 24 PUBLICATIONS 615 CITATIONS

SEE PROFILE SEE PROFILE

Krishna Kumar Singh Kiran Matam

KIIT University University of Southern California
36 PUBLICATIONS 61 CITATIONS 18 PUBLICATIONS 203 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine Learning Projects View project

All content following this page was uploaded by Siddharth Choudhary on 18 June 2014.

The user has requested enhancement of the downloaded file.

CPU and/or GPU: Revisiting the GPU Vs. CPU Myth
Kishore Kothapalli, Dip Sankar Banerjee, P. J. Narayanan, Surinder Sood,
Aman Kumar Bahl, Shashank Sharma, Shrenik Lad, Krishna Kumar Singh, Kiran Matam,
Sivaramakrishna Bharadwaj, Rohit Nigam, Parikshit Sakurikar, Aditya Deshpande,
Ishan Misra, Siddharth Choudhary, Shubham Gupta
International Institute of Information Technology, Hyderabad
arXiv:1303.2171v1 [cs.DC] 9 Mar 2013

Gachibowli, Hyderabad 500 032, India.

Abstract
Parallel computing using accelerators has gained widespread research attention in the past few years.
In particular, using GPUs for general purpose computing has brought forth several success stories with
respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi-
cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational
resources, it is important to also include CPUs in the computation. We call this the hybrid computing
model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore
such a model is quite natural.
We reevaluate the claim of a recent paper by Lee et al.(ISCA 2010). We argue that the right question
arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of
whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse
workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment
with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla
T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both
these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions.
On both these platforms, we also show that our solutions are 90% resource efficient on average.
Our work therefore suggests that hybrid computing can offer tremendous advantages at not only
research-scale platforms but also the more realistic scale systems with significant performance gains and
resource efficiency to the large scale user community.

1 Introduction
Parallel computing using accelerator based platforms has gained widespread research attention in recent
years. Accelerator based general purpose computing, however, relegated the role of CPUs to second-class
citizens where a CPU sends data and program to an accelerator and gets the results of the computation from
the accelerator. As CPUs evolve and narrow the performance gap with accelerators on several challenge
problems, it is imperative that parity is restored by also bringing CPUs in the computation process.
Hybrid computing seeks to simultaneously use all the computational resources on a given tightly coupled
platform. We envisage a multicore CPU plus accelerators such as GPUs, see also Figure 1, as one such
realization. The CPU in Figure 1 on the left with six cores is connected to a many-core GPU on the right.
In our work, we use two variants of the model shown in Figure 1 by choosing different CPUs and GPUs.
The case for hybrid computing on such a platform can be made naturally. Computers come with a CPU,
at least a dual-core at present, and is expected to contain tens of cores in the near future. Graphics process-
ing units are traditionally used to process graphics operations and most computers come equipped with a
graphics card that presently has several GFLOPS of computing power. Moreover, commodity production
of GPUs has significantly lowered their prices. Hence, an application using both a multicore CPU and an
accelerator such as a GPU can benefit from faster processing speed, better power and resource utilization,

1
Figure 1: A tightly coupled hybrid platform.

PURE DEVICE HYBRID

CPU GPU CPU GPU

SEND CODE SEND CODE
COMPUTE SEND DATA COMPUTE
SEND DATA COMPUTE

TIME DATA TRANSFER

TIME IDLE

DATA TRANSFER

SEND RESULTS
SEND RESULTS

(a) (b)

Figure 2: A view of hybrid multicore computing. Figure (a) shows the conventional accelerator based com-
puting where the CPU typically stays idle. Figure (b) shows the hybrid computing model with computation
overlapping between the CPU and the GPU.

and the like. Figure 2 illustrates the benefits of such a hybrid computing model. As can be noticed in Figure
2(b), hybrid computing calls for complete utilization of the available computing resources.
Further, it is believed that GPUs are not well suited for computations that offer little SIMD parallelism,
and have highly irregular memory access patterns. For various reasons, CPUs do not suffer greatly on
such computations. Thus, hybrid computing opens the possibility of novel solution designs that take the
heterogeneity into account. Hence, we posit that hybrid computing has the scope to bring the benefits of
high performance computing to also desktop and commodity users.
We distinguish between our model of hybrid computing with other existing models as follows. We
consider computational resources that are tightly coupled. Supercomputers such as the Tianhe-1A that use a
combination of CPUs and GPUs do not fall in our category. The issues at that scale would overlap with some
of the issues that arise in hybrid computing, but have other unique issues such as interconnection network
and its cross-section bandwidth, latency, and the like.
Hybrid computing solutions on platforms using a combination of CPUs and GPUs are being studied for
various problem classes such as dense linear algebra kernels [5, 45], maximum flows in networks [23], list
ranking [48] and the like. Most of these works however have a few limitations. In some cases, for example
[23, 48, 44], computation is done only either on the CPU or the GPU at any given time. Such a scenario
keeps one of the computing resource idle and hence is in general wasteful in resources. Secondly, we explore
a diverse set of workloads so as to highlight the advantages and limitations of hybrid multicore computing.
In a departure from most earlier works, we study hybrid computing also on low end CPU and GPU models
and see the viability of hybrid computing on such platforms.
The aim of this work is to explore the applicability of hybrid multicore computing to a class of appli-
cations. To this end, we select a collection of 13 different workloads ranging from databases to graphs,
and image analysis, and present hybrid algorithmic solutions for these workloads. Some of the results that
we show in this paper are reported recently [8, 9, 18, 15, 33], or under submission [10]. We identify two
different algorithm engineering approaches that we have used across our 13 workloads. Further, these ap-
proaches can be used to classify most of the recent works on hybrid computing including [45, 20]. The two
approaches are described briefly in the following:

• Work Sharing: In a work sharing approach where the problem is decomposed into one or more parts,
with each part running on a different machine in the hybrid computing platform. In this approach, the

2
work shares have to be chosen appropriately so as to balance the load on the CPU and the GPU. In this
approach, the actual algorithm used on the different units could be different, and in some cases, may
also be different from the best possible algorithm on each device respectively. For example, see our
hybrid algorithm for graph connected components and sparse matrix-vector multiplication described
in Section 4.

• Task Parallel: A task parallel approach involves viewing the computation as a collection of inter-
dependent tasks, and tasks are mapped on to the available units. In this setting, one has to think of
arriving at the best possible mapping that minimizes the overall span and also minimizes the idle time
of machines. As can be seen, the time taken by a hybrid solution using task parallelism is the time
corresponding to the longest path in the task graph with nodes and edges labelled by the time taken to
complete the task and communication time respectively.
A slight modification to the task parallelism approach is that of pipelined parallelism. A pipelined
parallelism approach views the computation as a clever combination of the units to set up a pipeline
that has different functionalities at each stage of the pipeline.

1.1 Our Contributions

Some of the specific contributions of our study are as follows.

• We develop new hybrid solutions for seven of the 13 workloads studied in this paper: sorting, his-
togram, spmv, bilateral filtering, convolution, ray casting, and the Lattice Boltzman Method.

• We experiment with a hybrid system consisting of a six core Intel i7 980X (Westmere) CPU and an
NVidia Tesla T10 GPU. On this hybrid platform, called Hybrid-High, we show that hybrid computing
has an average of 29% performance improvement over the best possible GPU-alone implementation.
Our solutions also exhibit 90% resource efficiency on average.

• To promote the idea that hybrid computing has benefits on more widely used platforms, we also exper-
iment with a hybrid platform that has an Intel Core 2 Duo E7400 (Allendale) with an NVidia GT520
GPU. We feel that such configurations still have the potential to make supercomputing affordable
and accessible to everyone. On such a platform, called Hybrid-Low, we show that hybrid computing
results in an average of 37% performance improvement compared to the best possible GPU-alone
implementation. Our solutions also exhibit 90% resource efficiency on average.

• We analyze the above results and offer insights on the limits and applicability of hybrid computing in
the present architectural space.

The title of our paper is motivated by a recent related paper [46] that argued about the relative strengths
of GPUs and multicore CPUs on a set of throughput oriented workloads. We establish through this paper that
accelerator based computing should leverage the combined strengths of all devices in a computing platform
via hybrid computing. A majority of our workload overlap with the workloads considered in [46].

1.2 Organization of the Paper

The rest of the paper is organized as follows. Section 2 discusses the architectures of the GPUs and the CPUs
used in our paper. Section 3 describes the workloads that we study in this paper. Section 4 outlines some
of the important implementation perspectives of our hybrid algorithms. Section 5 discusses the outcomes of
our hybrid solutions and analyzes the solutions. The work is put in the right context by discussing related
prior work in Section 6. The paper ends with concluding remarks in Section 7.

3
2 Hybrid Computing Platforms
In this section, we briefly describe the two hybrid CPU+GPU computing platforms that we use in our study.

2.1 The Hybrid-High Platform

One of the hybrid computing platform we use in this paper, labeled Hybrid-High, is a combination of a six
core Intel i7 980X CPU with an NVidia Tesla T10 GPU.
The Intel i7 980X CPU that we use in our experiments is a six core machine with each core running
at 3.4 GHz and with a thermal design power of 130 W. The six physical cores are hyper-threaded so that
together they can run 12 threads. Other features of the i7 980X include a 32 KB instruction + 32 KB data
L1 cache per core, a 256 KB L2 cache per core, and a large shared 12 MB L3 cache shared by all 6 cores.
The memory bandwidth is up to 1066 MHz.
The GPU is a massively multi-threaded processor containing hundreds of processing elements or cores,
called the Scalar Processors (SPs). The Tesla C1060 is add-on card based on the Tesla T10 GPU[36] having
SPs arranged in groups of eight. It has 30 such SMs, which makes for a total of 240 processing cores. Each
of the cores are clocked at 1.3 Ghz. These eight SP execute in a Single Instruction Multiple Thread (SIMT)
fashion. Hence, all the SPs in an SM execute the same instruction at the same time.
The CUDA API allows a user to create a large number of threads to execute code on the GPU. Threads
are also hierarchically grouped into blocks and grids. Blocks are serially assigned for execution on each
SM. Each of the blocks are made of several warps which execute in each core in a SIMD fashion. For more
details, we refer the interested reader to [35].

2.2 The Hybrid-Low Platform

We also experiment with a hybrid platform, called Hybrid-Low, that resembles a desktop computing envi-
ronment more closely. The Hybrid-Low platform is a combination of an Intel Core 2 Duo E7400 CPU along
with an NVidia GT520 GPU.
The Intel Core 2 Duo CPU is one of the earliest multicore offerings from Intel and was released in the
year 2008. It has 2 cores with hyper-threading and each of the cores are clocked at 2.8 GHz. The CPU
consists of a 3 MB L2 cache and the maximum power consumption is around 65 W. The CPU was designed
entirely for commodity PCs which gives a sustained performance of about 20 GFLOPS.
The GT520 is a stand-alone graphics processor having 48 computing cores and 1 GB of global memory.
Each of the compute cores are clocked at 810 MHz. The GPU on an average give a sustained performance
of 77.7 GFLOPS and consume about 29W of power. In this system both the processors are of a comparable
performance range and hence provide a more realistic platform for experimenting the hybrid programs.

3 Workloads
In this paper, we experiment with the following workloads. Table 3 summarizes the characteristics of the
various workloads considered.
Sorting: Sorting is one of the fundamental operations in information processing that has many appli-
cations. Recent results on efficient implementations of sorting are reported in [34, 3], to name a few. This
workload becomes an important case study due to the large number of applications. For the purposes of this
paper, we focus on comparison-based sorting techniques and leave non-comparison based sorting technique
such as radix sort out of the scope.
Histogram: One of the important operations in image processing is to compute the histogram of the
intensity values of the pixels. Computing the histogram of a dataset in parallel typically requires the use of
atomic operations. The nature of this workload such as use of atomic operations and being memory bound
make this a good case study for hybrid computing.

4
Workload Application Area Nature Characteristics Solution
(Short name) Methodology
Sorting Semi-numerical, regular compute bound work sharing+
sort databases task parallel
Histogram image processing Atomics, memory bound work sharing
hist irregular
Sparse matrix-vector Sparse Linear Algebra irregular memory bound work sharing
multiplication
spmv
Sparse matrix-matrix Sparse Linear Algebra irregular device storage, work sharing
multiplication memory bound
spgemm
Ray casting Image processing irregular compute bound work sharing
RC
Bilateral filtering Image processing regular compute bound work sharing +
Bilat task parallel
Convolution Image processing regular compute bound work sharing
Conv
Monte-carlo Physics, computational regular, compute bound, task parallel
MC finance pseudorandom
numbers
List Ranking graphs and trees irregular memory bound task parallel
LR
Connected Components graph algorithms irregular memory bound work sharing
CC
Lattice Boltzman Method Computational Fluid irregular memory bound task parallel
LBM Dynamics
Image Dithering Image processing irregular causal dependencies work sharing
Dither
Bundle adjustment Image processing irregular memory bound task parallel
Bundle

Table 1: Various workloads considered in this paper.

5
Sparse Matrix-Vector Multiplication (spmv): Efficient operations involving sparse matrices are
essential to achieve high performance across numerical applications such as climate modeling, molecular
dynamics, and the like. In most of the above applications, spmv computation is the main bottleneck. Hence,
efforts to speed up this computation on modern architectures have attracted significant research attention in
recent years [49, 16]. Further, the computation involved in spmv is highly irregular in nature due to the
sparsity of the matrix.
Sparse matrix-matrix multiplication (spgemm): Another operation that is important with respect to
sparse matrices is that of multiplying two sparse matrices. This workload has found applications spanning
several areas such as graph algorithms [26], numerical analysis including computation fluid dynamics [37,
21, 16] and is also included as one of the seven dwarfs in parallel computing in the Berkeley report [4]. Some
of the recent works that have reported efficient sparse matrix multiplication on modern architectures are [13].
It is generally accepted that the difficulties of this workload include its irregular nature of computation, the
difficulty in predicting the size of the output and the concomitant memory management problems.
Both spmv and spgemm are important linear algebra kernels with significant applications, and hence
their choice is justified.
Ray casting: This is a fundamental problem in image analysis and computer graphics. Recent applica-
tions to medical image analysis are also reported [42]. As all rays perform the computations independently,
the problem is very much portable for parallel architectures. Tracing multiple rays in an SIMD fashion
is challenging, because rays access non-contiguous memory locations, resulting in incoherent and irregular
memory accesses. The choice of this workload is justified because of the range of applications of ray casting
to visual computing.
Bilateral filter: Bilateral filter is an edge-preserving and noise reducing filter used in image processing.
It is a non-linear filter in which intensity value at any pixel is equal to a weighted sum of intensities in the
neighbourhood. The filter involves transcendental operations like computing exponentials, which can be
very computationally expensive. Hence, it is a compute bound problem with regular memory access.
Convolution: Convolution is a common operation used in image processing for effects such as blur,
emboss and sharpen. Given the image signal and the filter, the output at each pixel is equal to the weighted
sum of its neighbours. Since each pixel can be computed independently by a thread, there is ample par-
allelism available. The computations increase with the size of filter and exhibits high compute to memory
ratio.
The bilateral filtering and convolution workloads are commonly used filters in image processing appli-
cations, justifying their inclusion on our workloads.
Monte-carlo: Monte Carlo methods are used in several areas of science to simulate complex processes,
to validate simpler processes, and to evaluate data. In Monte Carlo (MC) methods, a stochastic model is
constructed in which the expected value of a certain random variable is equal to the physical quantity to
be determined. The expected value of this random variable is then determined by the average of many
independent samples representing the random variable. This workload exhibits a regular memory access
pattern and is compute bound, typically. The choice of this workload is justified by the wide body of
applications using Monte Carlo methods across varied domains such as computational finance, physics, and
engineering.
List Ranking: The importance of list ranking to parallel computing has been identified by Wyllie as
early as 1978 in his Ph. D. thesis [50]. The list ranking problem is to find the distance of every node from
one end of the given linked list. The workload is memory bound due to the highly irregular nature of the
computation involved. The workload is chosen as a case study as list ranking is often a primitive in several
graph and tree based computations.
Connected Components: Finding the connected components of a given undirected graph has been
a fundamental graph problem with several applications. Ideas used in parallel algorithms for connected
components find immediate application to other important graph algorithms such as minimum spanning
trees and the like. Hence, this workload is important and offers good scope as a case study.
Lattice Boltzman Method: Lattice Boltzman Method (LBM) refers to a class of applications from

6
computational fluid dynamics (CFD) and are used in fluid simulations. It is a numerical method that solves
the Navier–Stokes equation via the discrete Boltzman equation. In this work, we study the D3Q19 lattice
model where over a three dimensional cubic lattice, each cell computes the new function values based on
its 19 neighbors [22]. The LBM operation is highly data parallel. This workload is considered for its
applications to computational fluid dynamics.
Image Dithering: In Floyd-Steinberg Dithering (FSD) (see [18]), we approximate a higher color res-
olution image using a limited color palette by diffusing the errors of threshold operation to the neighboring
pixels according to a weighted matrix. The problem is thus inherently sequential and poses enormous chal-
lenge for a parallel implementation, let alone a hybrid implementation. Dithering has various applications
such as printing, display on low-end LCD or mobile devices, visual cryptography, image compression, and
the like. Further this workload is significant because of its atypical nature amongst image processing appli-
cations that does not offer embarrassing parallelism.
Bundle Adjustment: It refers to the optimal adjustment of bundles of rays that leave 3D feature points
onto each camera center with respect to both camera positions and point coordinates. Bundle Adjustment
is carried out using the Levenberg-Marquardt (LM) algorithm [32, 38] because of its effective damping
strategy to converge quickly from a wide range of initial guesses. Bundle adjustment often is the slowest
and the most computationally resource intensive step. It is one of the primary bottleneck in the Structure-
from-Motion pipeline, consuming about half of the total computation time.

4 Implementation Details
In this section, we describe the implementation details of the various workloads described in Section 3. For
the workloads including sort, Bilat, Conv, spmv, hist, RC, and LBM we developed hybrid implemen-
tations for the purposes of this paper. In some cases, we use implementation developed in recent existing
works that are known to be the best possible. Workloads LR, CC, Dither, MC, spgemm, and MC fall
under this category. For more details on these implementations, we refer the reader to the technical reports
available at [28].

4.1 Sorting
Our implementation of sorting is a comparison based sorting algorithm based on the techniques of sample
sort reported in [34]. Sample sort involves placing the elements into various bins according to a number of
splitters. For sorting, we apply the basic principle of work partitioning in a hierarchical fashion. We, first
compute the histogram of the data in a hybrid manner. Using the histogram results we perform the binning
process. As, the histogram provides a good estimate of the distribution of the data, the binning process
consumes a much lesser overhead. However, on the initial iteration of the kernel, the individual bins that
are created are large in size and cannot be directly used for sorting. In order to optimally sort each of the
bins, we reduce the size of each of the bins down to a certain threshold where groups of 32 elements can be
compared by a single warp. Each of these warps will in effect implement quick sort on the 32 elements. We
recursively run the binning process to reduce the bin sizes to the chosen threshold. The CPU on the other
hand, will not be limited by the compute as much as the GPU. We can hence still leave the bin sizes of the
CPU at a higher threshold than that of the GPU. We also notice that there is a clear trade-off in the number
of recursive calls to split the elements into bins, and the time taken to sort the bins independently.

4.2 Histogram
The histogram operation in a parallel setting requires the proper use of atomic increment operations to ensure
consistent and reliable results. We use the work sharing approach where we divide the data set into two sets
for the GPU and the CPU. We then perform the computation of the histogram on both the devices in an
overlapped fashion. This step is followed by a simple addition of the results from the two devices. On the

7
GPU, using shared memory is critical in order to reduce the global memory latency. The atomic increment
is performed by a single warp working on the data that is obtained from the shared memory. The histogram
in general is a bandwidth bound problem and hence, proper use of the memory channels are essential. The
shared local cache (L1) in the CPU is used to improve the performance. The resulting partial histograms are
then added bin-by-bin to give the final histogram result over the entire input data.

4.3 spmv
In the spmv workload, some of the challenges faced by modern architectures include the overhead of the
auxiliary data structures, irregular memory access patterns due to the sparsity of the matrix, load balancing,
and the like. Several recent works [11] have therefore focussed on optimizing the spmv computation on
most modern architectures.
In our hybrid implementation, we use a novel work sharing based solution summarized as follows. In
spmv , we notice that the computation involving one row is independent of the computation involving
other rows. This suggests that one should attempt a work sharing based solution. Instead of splitting the
computation according to some threshold, we use the following novel work sharing approach. Our approach
is guided by the fact that typically spmv is used over multiple iterations. So, one can rely on preprocessing
techniques that aim to improve the performance of spmv.
Notice that the GPUs are good at exploiting massive data parallelism with regular memory access pat-
ters. Therefore, we assign the computation corresponding to the dense rows to the GPU and the computation
corresponding to the sparse rows to the CPU. The exact definition of sparsity is estimated via experimen-
tation. In this direction, we first sort the rows of the matrix according to the number of nonzeros. We then
rearrange the matrix according to increasing order of the number of nonzeros. We also rearrange the x
vector and then assign the computation corresponding to the dense rows to the GPU and the sparse rows to
the CPU. The entire x vector is kept at both the CPU and the GPU. This therefore suggests that when using
the work sharing approach, one can divide the computation according to what computation is more suitable
for each of the architectures in the hybrid platform.
On the CPU, we use the Intel MKL [1] library routines and on the GPU we use the CUSP library
routines1 . These are known to be offer the best possible results on each platform.

4.4 spgemm
For the spgemm workload, one can see that computations on various rows of the input matrices are inde-
pendent of each other. So, we use a work sharing model in our hybrid implementation. We use the Intel
MKL library [1] for computations on the CPU, and use a row-row method based implementation developed
by us recently in [33] for the GPU computations. The main implementation difficulty experienced in this
workload is to arrive at the appropriate work shares. The work share would be dictated primarily by the
volume of the output. Since estimating the volume of output is as hard as actually multiplying the matrices,
one has to rely on heuristics to arrive at the work share. In our implementation, we use the runtime of a CPU
alone implementation and a GPU alone implementation to obtain the work share.
On the CPU, we use the Intel MKL [1] library routines and on the GPU we use the Row-Row method
of matrix multiplication. The row-row method P works as follows [33]. In Cm×n = Am×p · Bp×n , the ith
row C, denoted C(i, :), is computed as j∈A(i,:) A(i, j) · B(j, :). This formulation works best on GPUs
for sparse matrices as only those elements that contribute to the output are accessed. For more details of the
Row-Row method on the GPU, we refer the reader to [33].

4.5 Ray Casting

In ray casting, rays can perform the computations independently. Therefore, the problem is very much
portable for parallel architectures. Tracing multiple rays in an SIMD fashion is however challenging because
1
Nvidia CUSP library, http://code.google.com/p/cusp-library/

8
rays access non-contiguous memory locations, resulting in incoherent and irregular memory accesses.
We notice that there are two main steps in ray casting. The first step is to find the first triangle that is
intersected by each ray. This is then used to find the first tetrahedron intersected. The second step involves
tracing the ray from the first hit point and traversing the ray through the entire mesh to keep accumulating
the intensity values from the interpolation function. The computation for this ray finishes once the ray leaves
the mesh.
In our hybrid implementation, we used a work sharing based solution model since the computation for
each ray can be performed independently. However, the nature and amount of computation in the above two
steps differs significantly. For this reason, we proceed as follows. We ensure that computation for each ray
finishes the first step before starting computation for the second step for any ray. The work share across
the two steps is also varied to reflect the varied nature of the computation across the two steps. The work
share for each step is obtained empirically by studying the time taken by the CPU and the GPU individually
on each of the two steps. We notice that the optimal work shares across the two steps vary significantly
depending on the platform.

4.6 Bilateral Transforms and Convolution

The workload of bilateral filtering poses interesting challenges. The workload has regular memory access
patterns, offers good work sharing, and is compute bound. Bilateral filtering has the following mathematical
equation on an input image I and O being the output image.
In our hybrid implementation, we use a combination of task parallelism and work sharing approach.
Notice that bilateral filtering involves computing transcendental functions that are very time consuming on
the GPU. The number of unique such function evaluations are however limited by the filter size in case of
the spatial filter, and by the number of different intensity values in the case of the range filter. The largest
filter size of interest is typically 15 × 15. So, there are only 225 different values that have to be evaluated
for the spatial filter. Given that the intensity values for the images under consideration were between 0 to
255, the range filter also has only 255 unique values to the computed. We perform these computations on
the the multicore CPU and transfer the results to the GPU. This use of novel task parallelism in our hybrid
implementation proved to be quite beneficial and serves to illustrate the advantages of a hybrid computing
platform.
Given the values of the transcendental functions in the form of look-up tables, applying the filter can be
done on parts of the image independently. So, we use the work sharing approach and divide the input image
I into two parts: ICPU and IGPU . The CPU and the GPU computations apply the filter on the above image
parts respectively. The size of the partitions is arrived at empirically.
On both the CPU and the GPU, our implementation proceeds by each thread reading a set of pixels and
applying the filter using the look-up tables. Additionally on the GPU, we make use of shared memory by
having each thread block load the required image from the global memory to the shared memory.
For the convolution workload, we observe that the computation is very regular, compute bound, and is
also highly amenable to data parallel operations. As is the standard approach in most GPU implementations
and other modern architectures, we imagine that each thread is computing on a small portion of the image.
To take advantage of the hybrid computing platform, we divide the computation into two parts according to
a certain threshold. The multicore CPU computes the convolution of a part of the image ICP U and the GPU
computes on the rest of the image, say IGP U . The threshold is chosen as follows. It is observed, (cf. [46]),
that the GPU has about a 3x time advantage compared to multicore CPUs. As the exact model of the CPU
and the GPU used in the study of [46] is comparable to that of our platform, we start with the assumption
that the threshold could be around 25%. We then fine tune the actual threshold by experimentation. On the
CPU, we utilize the availability of the Intel MKL implementation that is known to be by far the most efficient
implementation of convolution on multicore CPUs. For the GPUs, we use our own custom implementation.
For more details, we refer the reader to [30].

9
4.7 Monte Carlo
Monte Carlo applications typically involve several iterations each using pseudorandom numbers to estimate
the expected value of a random variable of interest. We chose the application of photon migration from [12].
In the hybrid solution, we use a hybrid pseudorandom number generator that we developed in [8]. In photon
migration, several photons are launched with their position and direction initialized to either zeros (for some
pencil beam initialized at the origin) or some random numbers. At every step a photon takes, a fraction
of its weight is absorbed, and then photon packet is scattered. The new direction and weight of photon are
updated. After several such steps if the remaining weight of a photon is below a certain threshold, the photon
is terminated.

4.8 List Ranking and Connected Components

For the list ranking workload, we summarize the approach used in [9, 8]. We start by preprocessing the
linked list to reduce the size of the list from n nodes to a size of n/ log n using ideas from fractional
independent sets. We then use the algorithm of Hellman and JaJa [24] to rank the list of remaining elements.
Finally, the ranking is extended to the elements removed from the list during the preprocessing phase. For
the preprocessing, we require that each node in the linked list choose a bit among {0, 1} independently and
uniformly at random. For this, we use the multicore CPU to generate a stream of pseudorandom numbers.
These numbers are then transfered to the GPU so that each node can access the pseudorandom numbers
when it requires.
In the CC workload, as reported in [9], we notice that the best possible algorithms for sequential
processing, such as DFS, and PRAM-style parallel algorithms use fundamentally different techniques.
Therefore, we use the following approach that divides the computation across the multicore CPU and the
GPU. The input graph G is partitioned into two induced subgraphs, G1 = G[V1 ] and G2 = G[V2 ] where
V (G) = V1 ∪ V2 and V1 ∩ V2 = Φ. We use BFS on the CPU cores to find the connected components in
the graph G1 and the algorithm of Shiloach and Vishkin [41] on the GPU to find the connected components
in the graph G2 . In the above computation, edges with exactly one end point in either V1 or V2 are not
included. We call such edges as cross edges. Hence, in a final step, the connected components of G1 and
G2 are combined using the cross edges. This final step in done on the GPU. The size of V1 , and hence the
size of V2 , is fixed using an experimentally obtained threshold.

4.9 LBM
The LBM workload is highly parallelizable since each particle can be distributed to each thread of com-
putation. The Lattice Boltzman model simulates the propagation and collision processes of a large of a
large number of particles. We perform the simulation over cubic lattices. A standard notation in LBM is
the DnQm scheme. The parameter n stands for the dimensions of the cubic lattice, and m stands for the
number of ”speeds” studied. In this work, we study the D3Q19 lattice model where over a 3-dimensional
cubic lattice, each particle computes the new function values based on its present values.
The computations of various functions are independent of each other. So, in our hybrid implementation,
we use a task parallel solution approach. Given the relative speeds of the CPU and the GPU, we choose
to compute four functions on the CPU and the remaining 15 functions are computed on the GPU. Each
GPU thread is assigned the computation with respect to one particle. This can be seen to improve the data
coalescing effects on GPU.

4.10 Image Dithering and Bundle Adjustment

For the image dithering workload, we use different strategies to perform FSD on the CPU and the GPU. On
multi-core CPUs, we formulated a block based approach as this reduces the total threads required, which
is favorable to the multi-core CPUs. For many-core GPUs, we operate at the pixel level since many-core

10
GPU architecture is suited for larger number of light weight threads. We notice that as we use both the
CPU and the GPU in a work sharing model, we need to transfer at most three floating point numbers from
the CPU to the GPU. We therefore were able to arrive at an efficient hybrid solution. More details of this
implementation appear in [18].
For the bundle adjustment workload, we decompose the LM algorithm into multiple steps, each of which
is performed using a kernel on the GPU or a function on the CPU. Our implementation efficiently schedules
the steps on CPU and GPU to minimize the overall computation time. The concerted work of the CPU and
the GPU is critical to the overall performance gain. The implementation that we use here appears in [15].

5 Results and Discussion

In this section, we start by describing our evaluation methodology, the results obtained, and then discuss the
results in the context of hybrid computing.

Workload Dataset Hybrid-High Hybrid-Low Hybrid High Hybrid Low

Gain% Gain% Idle Time% Idle Time%
sort uar 18.6 28.9 13.3 9.1
hist uar 32.3 21.8 47.7 27.9
spmv [49] 15.1 48.4 1.2 1.1
spgemm [49] 38.9 41.87 1.76 0.41
RC [39, 31] 23.8 39.7 10.3 8.9
LBM uar 15.0 11.6 27.4 23.5
Bilat uar 12.9 7.22 1.5 1.5
Conv uar 23.5 41.0 0.04 0.04
MC uar 15.7 16.8 6.0 4.78
LR uar 57.7 33.9 4.2 9.11
CC [6] 45.16 56.4 2.85 3.76
Dither uar 25.5 10.5 8.9 6.42
Bundle [43] 88.4 78.8 77.0 *

Table 2: S UMMARY OF RESULTS OF OUR IMPLEMENTATIONS ON THE H YBRID -H IGH AND THE H YBRID -
L OW PLATFORMS . T HE PHRASE ” UAR ” IN THE SECOND ROW REFERS TO THE DATASET THAT CON -
TAINS ITEMS DRAWN UNIFORMLY AT RANDOM APPROPRIATE FOR THE WORKLOAD . A CITATION IN
THE SECOND ROW INDICATES THAT WE HAVE USED THE DATASETS FROM THE WORK CITED . T HE
PERFORMANCE GAIN INDICATED IS ACCORDING TO THE FOLLOWING METRIC : TIME FOR sort AND
hist, GF LOPS FOR spmv, TIME FOR spgemm, F RAMES PER SECOND FOR RC, TIME FOR LBM, M
PIXLES / SEC FOR Bilat AND Conv, AND TIME FOR MC, LR, CC, Dither, AND Bundle. F OR THE
Bundle WORKLOAD , THE IDLE TIME ON H YBRID -L OW PLATFORM IS NOT AVAILABLE .

5.1 Evaluation Methodology

On the platforms described in Section 2, we have used OpenMP specification 3.0 [2] and CUDA 4.0 [35]
to implement our hybrid solutions. Our GPU programs also use standard optimization techniques such as
coalesced memory access, use of shared memory, minimizing thread divergence, and the like. The hybrid
programs are optimized according to best practices for hybrid solutions such as minimizing the overall exe-
cution time, minimizing the idle time for any device, asynchronous transfer of intermediate values between
the devices, and the like.
We are interested in two aspects of hybrid solutions. Firstly, we want to study the benefits of hybrid
computing. We define the gain of a hybrid solution as the ratio of the time taken by the hybrid program to

11
the minimum time required by a pure GPU or a pure CPU solution. Secondly, by nature, hybrid solutions
should minimize the amount of time any of the device is idle. We define the idle time of a hybrid solution as
the total time any device in the hybrid platform is not used in the computation. This could be due to waiting
for results from other device, or not alloted any part of the computation, or not alloted enough part of the
computation. A low idle time indicates better resource efficiency.

5.2 Results
Table 2 summarizes the results of our hybrid solutions. The second row of Table 2 specifies the dataset used
in our study. The entries in the third and the fourth row are the percentage improvement of hybrid solutions
using the Hybrid-High and the Hybrid-Low platforms respectively. The fifth and the sixth row of the Table
shows the idle time of our hybrid solutions on both the platforms considered. The performance gain and the
idle times are for the largest input sizes for all workload except spmv, spgemm. For these two workloads
we use the average measurement over all the instances in the dataset considered [49]. The values reported
in Table 2 and Figure 3[a]–[l] are the average over multiple runs. The results of Table 2 indicate that our
hybrid solutions offer an average of 30% improvement on the Hybrid-High platform, and an average of 34%
on the Hybrid-Low platform. Remarkably, the Hybrid-Low platform whose configuration is likely to match
commonly used desktop configurations also offers good incentives for hybrid computing.
Figure 3[a]–[l] show the performance of our hybrid implementations on various inputs from the datasets
mentioned in the second row of Table 2. The plots in Figure 3 show that our hybrid implementations scale
well over increasing input sizes. In most cases, our maximum input size is limited only by the available
memory on the GPU in the hybrid platform.
On most workloads, our results on the Hybrid-High and Hybrid-Low platforms suggest that hybrid
computing has scope and advantage. Our workloads also have applications in common settings such as
graphics and image manipulation, data processing, and the like. Some of these operations are invoked
internally by regular users of computers such as gamers. The input sizes that we used in evaluating the
Hybrid-Low platform are also close to the typical usage in most cases.

5.3 Analysis of the Results

In this section, we analyse the results shown in Section 5 in the context of our evaluation methodology.

5.3.1 Performance of Hybrid Solutions

Some of our workloads such as Conv, Bilat are very amenable to the GPU style of computation. On
such workloads, as can be noticed from Table 2, the hybrid advantage on the Hybrid-High platform is rather
modest. This is accentuated by the fact that the GPU in the Hybrid-High platform has a peak throughput that
is 10 times that of the CPU in the Hybrid-High platform when considering single precision operations. On
the Hybrid-Low platform however, as the ratio of the peak throughput of the GPU and the CPU is smaller,
hybrid computing can be seen to offer a decent advantage even on regular workloads.
On the other hand, about half our workloads have irregular memory access patterns that are known to
be very difficult for GPUs to handle. Examples of such workloads include LR, spgemm, CC, and the like.
On such workloads, as can be observed from Table 2, hybrid computing on the Hybrid-High platform offers
better than 40% advantage on workloads such as LR even while the GPU peak throughput is about ten
times that of the peak CPU throughput. This is possible by using novel task mapping techniques that assign
the right task to the right processor. This suggests that as GPUs suffer on workloads with highly irregular
memory access patterns, one should think of utilizing the power of hybrid computing.
For the workloads that are common to the workloads considered in [46], our GPU alone results are either
better or comparable than those reported in [46]. Since the CPU used in [46] is different from the CPUs we
used in our hybrid computing platforms, it is not possible to compare the CPU alone performance.

12
80

Percentage Improvement
60 35
Hybrid-High Hybrid-High HIGH END GPU
70 LOW END GPU
55 Hybrid-Low Hybrid-Low
60
Percentage Improvement

Percentage Improvement
50 50
30
40
45 30
40 20
25 10
35

Dense

Protein

FEM/Spheres

FEM/Cantilever

Wind Tunnel

FEM/Harbor

QCD

FEM/Ship

FEM/Accelerator

Circuit

Webbase
30
25 20

20
15 15
1 2 4 8 16 32 1 2 4 8 16 32
No of Elements (in Million) No of elements (in Million) Image Size

(a) Sorting (b) Histogram (c) spmv

100
Percentage Improvement

16
90 HIGH END GPU Hybrid-High Hybrid-High
LOW END GPU 45 Hybrid-Low Hybrid-Low
80
14
Percentage Improvement

Percentage Improvement
70
60
50
40
12
40
30 35
20 10
10
Dense
Protein
FEM/Spheres
FEM/Cantilever
Wind Tunnel
FEM/Harbor
QCD
FEM/Ship
Economics
Epidemiology
FEM/Accelerator
Circuit
Webbase
LP

30 8

25 6

20 4
20000 25000 30000 35000 40000 45000 3 5 9 15
Image Size No of Pixels Filter Size (2D)

(d) spgemm (e) Ray casting (f) Bilateral Filtering

45 32
Hybrid-HIGH Hybrid-High Hybrid-High
40 Hybrid-LOW Hybrid-Low 60 Hybrid-Low
30
Percentage Improvement

Percentage Improvement

35 50
28
30
26 40
25
24 30
20
22 20
15

10 20 10

5 18
3 5 7 9 15 17 8 16 32 64 128 4 8 16 32 64 128
Filter Size (2D) No of Photons (in 100,000) No of Nodes (in Million)

(g) Convolution (h) Monte Carlo (i) List ranking

60 30
Hybrid-High Hybrid-High HIGH END GPU
Hybrid-Low 16 Hybrid-Low LOW END GPU
55
25
Percentage Improvement

Percentage Improvement

50 15
20
45 14
40 15
13
35
12 10
30
11 5
25

20 10 0
1 1.5 2 2.5 3 3.5 4 4.5 64 128 256 >6200*8000 >4096*4096 >6042*3298 >4224*3106

No of Vertices (in million) Iterations x648000 Lattice Updates Image Size

(j) Connected components (k) Lattice-Boltzman Method (l) Image Dithering

Figure 3: The plots show the performance improvement (in percentage) of hybrid solutions over a pure GPU
solution for the workloads considered over various input sizes.

13
5.3.2 Idle Time
Table 2 also shows the idle time of our workloads on both the platforms. For workloads that use a work
sharing parallel approach, it can be observed that the idle time is quite small. This is due to the fact that at
the right threshold of work distribution, the CPU and the GPU take near identical times.
For workloads using a task parallel solution approach, such as LBM, and Bundle, it is possible that the
computation time is not matched between the CPU and the GPU. In the case of LBM and Bundle, further
fine-tuning of the task assignment is also not possible. In the case of Bundle adjustment workload, there
is no equivalent Pure-GPU code as the hybrid code is a direct extension of the available CPU code. Some
tasks are not amenable to a further sub-division which means that computation on those tasks would always
result in a imbalance on the CPU and the GPU runtime. In such cases, the idle time tends to be high.

5.4 Discussion
In this section, we try to highlight some of the lessons that were learnt during our study in hybrid computing.
These can offer some insights into how future heterogeneous architectures at the commodity scale and also
at the higher end can be designed.

5.4.1 Communication Cost

In most of the hybrid computing solutions, it is required that the devices transfer intermediate results or other
such data related to the progress of the computation. For instance, in the sorting workload, in our hybrid
implementation, as the GPU further splits bins which have more than a pre-selected number of elements,
the CPU is sorting the bins with fewer elements. To enable this, we send the starting and ending indices of
the bins with fewer elements that the CPU can sort.
Ideally, one likes to hide this communication with computation. However, at present, computing on
CPU-GPU hybrid platforms is difficult as the communication bandwidth between the CPU and the GPU is
via the PCI Express link. On the Hybrid-High platform, the peak bandwidth offered is about 6 GB/s. This
limitation means that hybrid solutions have to think of novel ways to minimize the amount of communi-
cation, hide communication latencies with other computations, and possibly avoid communication. These
may also limit the nature of techniques that can be used in hybrid solution design.
In future, therefore, one has to conceive hybrid architectures with a more tighter coupling between the
devices so that communication costs can be minimized. Emerging models such as the Intel MIC and the
AMD Fusion may offer some hope in this direction and deserve a careful future study.

5.4.2 The Right Solution Methodology

In Section 1, we have identified two broad solution methodologies that hybrid algorithms use, namely task
parallelism and work sharing. As we use these two approaches for the 13 workloads presented in this paper,
we discuss which approach may be suitable for a given problem.
The work sharing solution methodology involves dividing the work between the CPU and the GPU so
that both take roughly the same computation time. This solution methodology is useful when the compu-
tation on a part of the input is almost independent of the computation on the other part of the input. For
instance, in the Conv workload, the computation on each pixel is dependent only on the values of the
neighboring pixels in the input image. This property allows computation on one sub-image to be treated
independent of the rest of the image. Similar observations apply to the spmv workload.
The task parallel approach is useful when the computation can be seen as a set of tasks and their depen-
dencies. It may be useful to also represent the tasks and their dependencies as a task graph naturally. Further,
the tasks should be such that there exists tasks that are more efficient on a particular architecture. For in-
stance, in the LR workload, we identify tasks such as generating pseudo-random numbers and computing a

14
Figure 4: Figure showing the CPU and GPU overlapped computation on the Conv hybrid solution on a
3600× 3600 image and a 15×15 filter.

fractional independent set (FIS) as two tasks. However, dependent tasks executing on different devices im-
plies that the results of one task have to be necessarily communicated to the other task. The communication
time has to be taken into account when mapping the tasks to the devices.

5.4.3 Identifying the Right Threshold in Work Sharing

The work sharing approach to hybrid computing suggests that the CPU and the GPU in the hybrid platform
split the overall work in some ratio. This solution methodology is used in workloads such as sort, hist, MC,
spgemm, Bilat, and Conv.
In this case, one can see that the work distribution should ideally be according to the ratio of the pro-
cessing times on the CPU and the GPU in the platform. For instance, if the GPU alone runtime is TGP U
and the CPU alone runtime is TCP U , then the hybrid solution should split work as TGPTUGP U
+TCP U percentage
on the CPU and the remaining percentage on the GPU. The calculation indicates an ideal scenario where
all intermediate communication is hidden by useful compute, and no post-processing of the partial results
obtained by the CPU and the GPU is required. An example is shown in Figure 4 where the input image is
split in a ratio of 18%. One can therefore use the above calculation to identify a good work distribution to
start with and then adjust it experimentally after also taking into account the communication times and the
post-processing involved.

5.4.4 Identifying and Mapping Work Units in the Task Parallel Approach
Some of our hybrid solutions use the technique of task parallelism. In this technique, we identify work
units, or tasks, and their inter-dependence in terms of their precedences. These tasks are then mapped onto
the best possible device according to the architectural suitability. We discuss two issues in this context that
affect the performance of hybrid solutions.
Firstly, it is not easy in general to identify the right tasks, as computing is often traditionally understood
in a sequential step-by-step manner. Even in parallel computing, the intention in general is to speed up each
step of the computation using the available processors. Only recently are other methodologies for parallel
computing such as using domain specific languages [19, 25] are gaining attention. While these languages
alleviate the job of writing efficient parallel programs, they can still be constrained by a traditional step-by-
step approach of problem solving.
Identifying the tasks and their dependencies requires a careful reinterpretation of the computation in-
volved. For instance, in the Bilat workload, we noticed that GPUs are not amenable to computing tran-
scendental functions. These were therefore executed on the CPU. Further, it is also noticed that there are
really very few transcendental function evaluation that are required for a given image. (These are based on
the maximum difference between the pixel intensities). Therefore, we precompute these values, and transfer
these values from the CPU to the GPU. While we may be precomputing more values than needed in an ac-
tual input, the benefits of this model stem from the fact that recomputing transcendental functions is rather
expensive on any architecture.

15
GENERATE CPU IDLE GENERATE CPU IDLE GENERATE

GENERATE TRANSFER TRANSFER TRANSFER TRANSFER

GPU IDLE ITERATION 1 ITERATION 2 ITERATION 3

Figure 5: Figure showing the assignment of tasks during the LR hybrid solution on a list of size 128 M
elements.

The LR workload offers similar insights. Our implementation of list ranking in a hybrid setting [9]
has a preprocessing phase thar requires a large quantity of random numbers. These random numbers can be
generated on the CPU and transfered to the GPU. In our implementation, we generate the random numbers
on the CPU and the GPU uses the random numbers thus supplied. This is seen to save a lot of processing
time in the hybrid setting. Figure 5 shows the task assignment used in our hybrid implementation LR.
Secondly, it is not easy to identify the right task for the right processor. At present, our arguments are
based on intuitive reasoning backed by experimental evidence. In future, we would like to study formal
mechanism to arrive at an appropriate and near-optimal task mapping. In fact, arriving at an optimal as-
signment can be easily seen to be an NP-complete problem and hence one should consider near-optimal
assignments.

5.4.5 Lessons for other Hybrid Computing Platforms

Heterogeneity in architectures is a common phenomenon in recent times. Therefore, hybrid computing is
poised to play a great role so as to improve resource efficiency. It is hence important to understand how
applications can treat the heterogeneity as an advantage. In this section, we extrapolate the lessons learnt
from this paper for other hybrid computing platforms and architectural recommendations for the same.
Our experience with the Hybrid-Low platform suggests that the computing devices in a hybrid platform
should have similar peak performance capabilities. However, heterogeneity helps by allowing the right
task to be executed on the right device. The current trend in equipping architectures with special purpose
accelerators such as MMX units, CRC units, encrypt/decrypt units, therefore allows for more task parallel
hybrid solutions. Having dedicated accelerators can also make processor design less complex, and also
allows for simpler frequency scaling thereby improving power efficiency.

6 Related Work
There has been considerable interest in GPU computing in recent years. Some of the notable works include
scan [40], spmv[11], sorting [34], and the like. Other modern architectures that have been studied recently
include the IBM Cell and the multi-core machines. Bader et al. [7] have studied list ranking on the Cell
architecture and show that by running multiple threads at each SPU, list ranking using the Hellman-JaJa
algorithm can be done efficiently. Other notable works on the Cell architecture include [47, 49]. Williams
et al. [49] have studied the spmv kernel on various multi-core architectures including those from Intel, Sun,
and AMD. Since most of the above cited works do not involve hybrid computing, we do not intend to cite
all such works in this paper and refer the reader to other natural sources.
A recent work that motivated this paper is the work of Lee et al. [46]. In [46], Lee et al. argue that
GPU computing can offer on average only a 3x performance advantage over a multicore CPU on a range
of 14 workloads deemed important for throughput oriented applications. Some of our workloads overlap
theirs [46]. Their paper also generated a wide amount of debate on the applicability and limitations of
GPU computing. Our view however is that it is not a question of whether GPUs can outperform CPUs or
vice-versa, but rather what can be achieved when GPUs and CPUs join forces in a viable hybrid computing
platform. Further, for the workloads that are included also in [46], we provided our own GPU and CPU
implementations. In workloads such as Bilat, we use novel ideas such as precomputing the transcendentals
on the GPU for a pure GPU implementation that improve the performance beyond what is reported in [46].

16
Hybrid computing is gaining popularity across application areas such as dense linear algebra kernels
[5, 45, 20], maximum flows [23], graph BFS [26] and the like. The aim of this paper is to however evaluate
the promise and the potential of hybrid computing by considering a rich set of diverse workloads. Further,
in some of these works, (cf. [26, 23, 48]), while both the CPU and the GPU are used in the computation, one
of the devices is idle and while the other is performing computation. In contrast, we seek solutions where
both the devices are simultaneously involved in the computation.
There have been recent works that propose benchmark suites for GPU computing. Popular amongst
them are the Rodinia [14] and SHOC [17]. Some of our workloads such as sorting, spgemm are part of
the SHOC Level one benchmark suite. Subsets of the workloads considered in our paper appear in other
benchmarking efforts related to parallel computing. The Berkeley report [4] lists dwarfs as computational
patterns that have wide application. Workloads such as sort, hist, spmv, sgemm, are part of Berkeley
dwarfs. This serves to illustrate the wide acceptance of our chosen workloads.

7 Conclusions
In this paper, we have evaluated the case for hybrid computing by considering workloads from diverse
application areas and two different hybrid platforms. We also experimented with two hybrid platforms and
analyzed their suitability for hybrid computing. Our study opens the way for evaluation on other challenges
with respect to hybrid computing such as power efficiency, benchmark suites, and performance models for
hybrid computing (see [29, 27]).

References
[1] “Intel math kernel library,” http://software.intel.com/en-us/articles/intel-mkl/.

[2] “Openmp.” [Online]. Available: http://www.openmp.org

[3] A. Davidson and D. Tarjan and M. Garland and J. D. Owens, “Efficient Parallel Merge Sort for Fixed
and Variable Length Keys,” in Proc. InPar, May 2012.

[4] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker,

J. Shalf, S. Williams et al., “The Landscape of Parallel Computing Research: A view from Berkeley,”
University of California, Berkeley: Technical Report UCB-EECS-2006-183, Tech. Rep., 2006.

[5] M. Baboulin, J. Dongarra, and S. Tomov, “Some Issues in Dense Linear Algebra for Multicore and
Special Purpose Architectures,” UT-CS-08-200, University of Tennessee, Tech. Rep., 2008.

[6] D. Bader and K. Madduri, “Gtgraph: A suite of synthetic graph generators,” http://wwwstatic. cc.
gatech.edu/∼kamesh.

[7] D. A. Bader, V. Agarwal, and K. Madduri, “On the Design and Analysis of Irregular Algorithms on
the Cell Processor: A Case Study of List Ranking,” in Proc. of IEEE IPDPS, 2007, pp. 1–10.

[8] D. S. Banerjee, A. Bahl, and K. Kothapalli, “On Demand Fast Parallel Pseudo Random Number Gen-
erator with Applications,” in Proc. LSPP, 2012.

[9] D. S. Banerjee and K. Kothapalli, “Hybrid multicore algorithms for list ranking and graph connected
components,” in Proc. HiPC, 2011.

[10] D. S. Banerjee, K. Kothapalli, P. Sakurikar, and P. J. Narayanan, “Hybrid histogram and sorting with
applications,” Under submission, Available at http://cstar.iiit.ac.in/∼kkishore/hybrid/histandsort.pdf,
2012.

17
[11] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented
processors,” in Proc. SC, 2009.

[12] D. Boas, J. Culver, J. Stott, and A. Dunn, “Three dimensional monte carlo code for photon migration
through complex heterogeneous media including the adult human head,” Opt. Express, vol. 10, pp.
159–170, Feb 2002.

[13] A. Buluc and J. R. Gilbert, “Challenges and advances in parallel sparse matrix-matrix multiplication,”
in Proc. ICPP, 2008, pp. 503–510.

[14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark
suite for heterogeneous computing,” in Proc. IISWC, 2009, pp. 44 –54.

[15] S. Choudhary, S. Gupta, and P. J. Narayanan, “Practical time bundle adjustment for 3d reconstruction
on the gpu,” in Proc. of ECCV Workshop on CVGPU, 2011.

[16] J. K. Cullum and R. A. Willoughby, Lanczos Algorithms for large symmetric eigenvalue computations.
Birkhäuser Boston, 1985.

[17] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S.

Vetter, “The Scalable Heterogeneous Computing (SHOC) benchmark suite,” in Proc. GPGPU, 2010,
pp. 63–74.

[18] A. Deshpande, I. Misra, and P. J. Narayanan, “Hybrid implementation of error diffusion dithering,” in
HiPC, 2011, pp. 1–10.

[19] Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen, F. Ham, A. Aiken,
K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan, “Liszt: A domain specific language for building
portable mesh-based pde solvers,” in Proc. SC, 2011, pp. 1 –12.

[20] E. Agullo and C. Augonnet and J. Dongarra and M. Faverge and J. Langou and H. Ltaief and S. Tomov,
“LU Factorization for Accelerator-based Systems,” in Proc. of IEEE/ACS AICCSA, 2011.

[21] G. H. Golub and C. F. V. Loan, Matrix Computations, 2nd ed., 1989.

[22] J. Habich, T. Zeiser, G. Hager, and G. Wellein, “Performance analysis and optimization strategies for a
D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA,” Advances in Engineering Software,
vol. 42, no. 5, pp. 266–272, 2011.

[23] Z. He and B. Hong, “Dynamically tuned push-relabel algorithm for the maximum flow problem on
cpu-gpu-hybrid platforms,” in Proc. IPDPS, 2010.

[24] D. R. Helman and J. JàJà, “Designing Practical Efficient Algorithms for Symmetric Multiprocessors,”
in Proc. ALENEX, 1999, pp. 37–56.

[25] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun, “Green-marl: a dsl for easy and efficient graph analy-
sis,” SIGARCH Comput. Archit. News, vol. 40, no. 1, pp. 349–362, 2012.

[26] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient Parallel Graph Exploration on Multi-Core CPU and
GPU,” in Proc. PACT, 2011, pp. 78–88.

[27] S. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level
parallelism awareness,” in Proc. ISCA, 2009, pp. 152–163.

[28] K. Kothapalli, “CPU &&/k GPU: An evaluation of hybrid multicore computing,”

http://cstar.iiit.ac.in/∼kkishore/hybrid/, 2012.

18
[29] K. Kothapalli, R. Mukherjee, S. Rehman, S. Patidar, P. J. Narayanan, and K. Srinathan, “A performance
prediction model for the cuda gpgpu platform,” in Proc. HiPC, 2009.

[30] S. Lad, K. K. Singh, K. Kothapalli, and P. Narayanan, “Hybrid multi-core algorithms for regular image
filtering applications,” Under submission, 2012.

[31] C. Ledergerber, G. Guennebaud, M. Meyer, M. Bacher, and H. Pfister, “Volume mls ray casting,” IEEE
T. Vis. Comp. Gr., vol. 14, no. 6, pp. 1372 –1379, 2008.

[32] M. Lourakis, “LEVMAR: Levenberg-Marquardt nonlinear least squares algorithms in c/c++,”

http://www.ics.forth.gr/ lourakis/levmar/, 2004.

[33] K. Matam, S. K. Bharadwaj, and K. Kothapalli, “Sparse matrix matrix multiplication on hybrid
CPU+GPU platforms,” in in Proc. of HiPC, 2012 (to appear).

[34] N. Leischner, V. Osipov and P. Sanders, “GPU Sample Sort,” in Proc. IPDPS, April 2010.

[35] NVidia Corporation, “Cuda: Compute unified device architecture programming guide,” Technical re-
port, NVidia, 2007.

[36] ——, “Tesla c1060 computing processor board,” http://www.nvidia.com/docs/IO/43395/BD-04111-

001 v06.pdf, 2010.

[37] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes, The Art of
Scientific Computing, 2nd ed. Cambridge University Press, 1992.

[38] A. Ranganathan, “The Levenberg–Marquardt algorithm,” http://www.ananth.in/docs/lmtut.pdf, Honda

Research Institute, Tech. Rep., 2004.

[39] S. Ribeiro, A. Maximo, C. Bentes, A. Oliveira, and R. Farias, “Memory-aware and efficient ray-
casting algorithm,” in Proceedings of the XX Brazilian Symposium on Computer Graphics and Image
Processing, 2007, pp. 147–154.

[40] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan primitives for GPU computing,” in Proc.
ACM GH, 2007.

[41] Y. Shiloach and U. Vishkin, “An O(log n) parallel connectivity algorithm.” J. Algorithms, pp. 57–67.

[42] M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine,

D. Kim, A. Kyker, V. W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb, “Mapping high-fidelity volume
rendering for medical imaging to CPU, GPU, and many-core architectures,” IEEE T. Vis. Comp. Graph,
vol. 15, pp. 1563 – 1570, 2009.

[43] N. Snavely, “Notre dame 715 dataset,” http://phototour.cs.washington.edu/datasets/, 2009.

[44] M. Tang, J. yi Zhao, R. Tong, and D. Manocha, “GPU accelerated Convex Hull Computation,” in SMI
’12, 2012.

[45] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense liner algebra for hybrid gpu accelerated
manycore systems,” Parallel Computing, vol. 12, pp. 10–16, Dec. 2009.

[46] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy,

S. Chennupaty, P. Hammarlund, R. Singhal and P. Dubey, “Debunking the 100X GPU vs. CPU myth:
an evaluation of throughput computing on CPU and GPU,” in Proc. ISCA , 2010.

[47] O. Villa, D. Scarpazza, F. Petrini, and J. Peinador, “Challenges in mapping graph exploration algo-
rithms on advanced multi-core processors,” in Proc. IPDPS, 2007.

19
[48] Z. Wei and J. JaJa, “Optimization of linked list prefix computations on multithreaded gpus using cuda,”
in Proc. IPDPS, 2010.

[49] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-
vector multiplication on emerging multicore platforms,” in Proc. SC, 2007.

[50] J. C. Wyllie, “The complexity of parallel computations,” Ph.D. dissertation, Cornell University, Ithaca,
NY, 1979.

20
View publication stats

Task 1
No ratings yet
Task 1
12 pages
Lab08 - Synchronization Readers Writer Problem
No ratings yet
Lab08 - Synchronization Readers Writer Problem
3 pages
XILINX XC Series
100% (1)
XILINX XC Series
41 pages
Avionics Unit 1-Students
100% (1)
Avionics Unit 1-Students
10 pages
List of Samsung Codes For Samsung One UI
No ratings yet
List of Samsung Codes For Samsung One UI
4 pages
Yaesu FT-411 Technical Supplement Service Manual
100% (1)
Yaesu FT-411 Technical Supplement Service Manual
34 pages
Arun Seminar Report
No ratings yet
Arun Seminar Report
23 pages
C, C++ Questions
No ratings yet
C, C++ Questions
39 pages
Syllabus - BCA (C2 Semester 1)
No ratings yet
Syllabus - BCA (C2 Semester 1)
4 pages
Step by Step Installation of Oracle Apps R12
No ratings yet
Step by Step Installation of Oracle Apps R12
13 pages
SoCT Slides
No ratings yet
SoCT Slides
157 pages
ERDC Course Content Sap Architecture Modified
No ratings yet
ERDC Course Content Sap Architecture Modified
80 pages
READMY
No ratings yet
READMY
4 pages
A Complete Reference For Informatica Power Center ETL Tool
No ratings yet
A Complete Reference For Informatica Power Center ETL Tool
8 pages
Question Bank Oose
No ratings yet
Question Bank Oose
4 pages
Users Guide NMM Chap1-7
No ratings yet
Users Guide NMM Chap1-7
204 pages
A Beginner Using PIC Controller
No ratings yet
A Beginner Using PIC Controller
8 pages
Preset Management With Vyzex MPD32
No ratings yet
Preset Management With Vyzex MPD32
9 pages
Lecture12 PDF
No ratings yet
Lecture12 PDF
9 pages
Ready Queue and Waiting Queue: Assignment On
No ratings yet
Ready Queue and Waiting Queue: Assignment On
3 pages
Price Telangana
No ratings yet
Price Telangana
1 page
Pervasive Computing
No ratings yet
Pervasive Computing
4 pages
Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor
No ratings yet
Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor
11 pages
A Performance Analysis For Microprocessor Architec
No ratings yet
A Performance Analysis For Microprocessor Architec
6 pages
Parani SD1000 Quick Start Guide
No ratings yet
Parani SD1000 Quick Start Guide
1 page
Datasheet - Lenovo-Tab-M8-2nd-GEN - HD
No ratings yet
Datasheet - Lenovo-Tab-M8-2nd-GEN - HD
2 pages
Project 11
No ratings yet
Project 11
6 pages
Cisco Rop
No ratings yet
Cisco Rop
48 pages
Olc5 Dit DFT
No ratings yet
Olc5 Dit DFT
37 pages
Signals Systems Using Matlab Luis F Chaparro Organized
No ratings yet
Signals Systems Using Matlab Luis F Chaparro Organized
8 pages
Infineon IAUS300N08S5N012T DataSheet v01 00 EN-2399798
No ratings yet
Infineon IAUS300N08S5N012T DataSheet v01 00 EN-2399798
11 pages
Core Java Material
No ratings yet
Core Java Material
51 pages
Brochures Panasonic PT VMZ 61
No ratings yet
Brochures Panasonic PT VMZ 61
3 pages
Expt 2 - 2024!25!2-Input NAND - NOR Gate
No ratings yet
Expt 2 - 2024!25!2-Input NAND - NOR Gate
5 pages

CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013

Uploaded by

CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

CPU and/or GPU: Revisiting the GPU Vs. CPU Myth

Article · March 2013

Surinder Sood Shashank Sharma

SEE PROFILE SEE PROFILE

Krishna Kumar Singh Kiran Matam

SEE PROFILE SEE PROFILE

Machine Learning Projects View project

The user has requested enhancement of the downloaded file.

Gachibowli, Hyderabad 500 032, India.

PURE DEVICE HYBRID

CPU GPU CPU GPU

TIME DATA TRANSFER

1.1 Our Contributions

1.2 Organization of the Paper

2.1 The Hybrid-High Platform

2.2 The Hybrid-Low Platform

Table 1: Various workloads considered in this paper.

4.5 Ray Casting

4.6 Bilateral Transforms and Convolution

4.8 List Ranking and Connected Components

4.10 Image Dithering and Bundle Adjustment

5 Results and Discussion

Workload Dataset Hybrid-High Hybrid-Low Hybrid High Hybrid Low

5.1 Evaluation Methodology

5.3 Analysis of the Results

5.3.1 Performance of Hybrid Solutions

(a) Sorting (b) Histogram (c) spmv

(d) spgemm (e) Ray casting (f) Bilateral Filtering

(g) Convolution (h) Monte Carlo (i) List ranking

No of Vertices (in million) Iterations x648000 Lattice Updates Image Size

(j) Connected components (k) Lattice-Boltzman Method (l) Image Dithering

5.4.1 Communication Cost

5.4.2 The Right Solution Methodology

5.4.3 Identifying the Right Threshold in Work Sharing

GENERATE TRANSFER TRANSFER TRANSFER TRANSFER

GPU IDLE ITERATION 1 ITERATION 2 ITERATION 3

5.4.5 Lessons for other Hybrid Computing Platforms

[2] “Openmp.” [Online]. Available: http://www.openmp.org

[4] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker,

[17] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S.

[21] G. H. Golub and C. F. V. Loan, Matrix Computations, 2nd ed., 1989.

[28] K. Kothapalli, “CPU &&/k GPU: An evaluation of hybrid multicore computing,”

[32] M. Lourakis, “LEVMAR: Levenberg-Marquardt nonlinear least squares algorithms in c/c++,”

[36] ——, “Tesla c1060 computing processor board,” http://www.nvidia.com/docs/IO/43395/BD-04111-

[38] A. Ranganathan, “The Levenberg–Marquardt algorithm,” http://www.ananth.in/docs/lmtut.pdf, Honda

[42] M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine,

[43] N. Snavely, “Notre dame 715 dataset,” http://phototour.cs.washington.edu/datasets/, 2009.

[46] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy,

You might also like