This document discusses considerations for evaluating and comparing microprocessor platforms. It distinguishes between the perspectives of application developers and computer architecture researchers when making comparisons. Application developers are primarily concerned with advancing their application domain given implementation constraints, while computer architects focus on assessing architectural strengths and weaknesses. The document argues that platform comparisons need to provide enough detail to allow reproducibility and ensure results are properly contextualized for different audiences.
This document discusses considerations for evaluating and comparing microprocessor platforms. It distinguishes between the perspectives of application developers and computer architecture researchers when making comparisons. Application developers are primarily concerned with advancing their application domain given implementation constraints, while computer architects focus on assessing architectural strengths and weaknesses. The document argues that platform comparisons need to provide enough detail to allow reproducibility and ensure results are properly contextualized for different audiences.
This document discusses considerations for evaluating and comparing microprocessor platforms. It distinguishes between the perspectives of application developers and computer architecture researchers when making comparisons. Application developers are primarily concerned with advancing their application domain given implementation constraints, while computer architects focus on assessing architectural strengths and weaknesses. The document argues that platform comparisons need to provide enough detail to allow reproducibility and ensure results are properly contextualized for different audiences.
This document discusses considerations for evaluating and comparing microprocessor platforms. It distinguishes between the perspectives of application developers and computer architecture researchers when making comparisons. Application developers are primarily concerned with advancing their application domain given implementation constraints, while computer architects focus on assessing architectural strengths and weaknesses. The document argues that platform comparisons need to provide enough detail to allow reproducibility and ensure results are properly contextualized for different audiences.
Considerations When Evaluating Microprocessor Platforms
Michael Anderson, Bryan Catanzaro
, Jike Chong, Ekaterina Gonina, Kurt Keutzer,
Chao-Yue Lai, Mark Murphy, David Shefeld, Bor-Yiing Su, Narayanan Sundaram Electrical Engineering and Computer Sciences University of California, Berkeley Abstract Motivated by recent papers comparing CPU and GPU performance, this paper explores the questions: Why do we compare microprocessors and by what means should we compare them? We distinguish two distinct perspec- tives from which to make comparisons: application de- velopers and computer architecture researchers. We sur- vey the distinct concerns of these groups, identifying es- sential information each group expects when interpret- ing comparisons. We believe the needs of both groups should be addressed separately, as the goals of applica- tion developers are quite different from those of com- puter architects. Reproducibility of results is widely acknowledged as the foundation of scientic investigation. Accordingly, it is imperative that platform comparisons supply enough detail for others to reproduce and contextualize results. As parallel processing continues to increase in impor- tance, and parallel microprocessor architectures con- tinue to proliferate, the importance of conducting and publishing reproducible microprocessor platform com- parisons will also increase. We seek to add our voice to the discussion about how these comparisons should be conducted. 1 Introduction Several recent papers, including [Lee et al., 2010] and [Vuduc et al., 2010] examine widespread claims of 10 to 100-fold performance improvements for applications running on Graphics Processing Units (GPUs) compared to applications running on traditional Central Process- ing Units (CPUs). These papers note that when com- paring well optimized CPU and GPU implementations of throughput oriented workloads, for which GPUs are designed, GPUs hold a much more modest performance advantage, on the order of 2.5. The 10-100 perfor- mance improvements quoted for GPU processing there- fore arise from comparing against sequential, poorly op- timized CPU code. Accordingly, [Lee et al., 2010] sug- gests that future studies comparing CPUs and GPUs should compare against thread and SIMD-parallelized CPU code in order to avoid misleading comparisons.
We agree that GPU-CPU comparisons, and particu- larly speedup numbers, are often taken out of context and used inappropriately. We also agree that many pub- lished comparisons are not sufciently clear about the meaning of the comparisons they provide, which is in- compatible with basic scientic methodology. The is- sues raised by recent papers provide the community with an opportunity to revisit the goals of and methodologies for performing comparisons. We wish to add our voice to this discussion. We begin by examining why comparisons between processor ar- chitectures are complicated by the rise of non-traditional parallel architectures. We then investigate the motiva- tions behind conducting such comparisons. We nd two distinct points of view which hold particular importance. The rst is the viewpoint of the application developer, who is focused on advancing the state of the art in a par- ticular application domain, given implementation con- straints. The second is the viewpoint of the computer architect, who is primarily concerned with assessing the strengths and weaknesses of various architectural fea- tures. Since researchers have different objectives, it is natural that their perspectives and methodologies should differ. It is not feasible for every cross-platform com- parison to conduct experiments to satisfy all audiences. However, it is feasible, and should be expected, that ev- ery comparison should include or reference information which allows the comparison to be reproduced and con- textualized. 2 Revisiting Comparisons Cross-platform comparisons depend critically on a host of details. Everything from the data structures used in a benchmark to specics of silicon implementation can strongly affect the results of a comparison. This has al- ways been the case, but we now have an additional hur- dle to deal with: the complexities of parallel program- ming. Concerns about algorithms and datasets used in comparisons have historically led to the creation of well dened benchmark suites. For parallel CPU benchmarking, examples include as PARSEC [Bienia et al., 2008], SPEComp [Aslot et al., 2001], and SPLASH-2 [Singh et al., 1995]. These benchmark suites have very well dened inputs, outputs, and im- plementations, which makes it easier to normalize the software side of a cross-platform comparison. However, today we are faced with a diverse array of hardware platforms which we may wish to compare, such as the Cell Broadband Engine [Chen et al., 2007], NVIDIAGPUs [Lindholm et al., 2008] and AMDGPUs [Owens et al., 2008], Intels Many Integrated Core Ar- chitecture [Seiler et al., 2008], as well as multi-core CPUs from a host of vendors. This diversity in parallel architectures, especially due to SIMD and memory sub- system conguration, is exposed in low-level program- ming models. Accordingly, different hardware platforms require different source code to efciently perform the same computation. This makes it much harder to de- ne benchmark suites which can be efciently targeted across this diverse array of hardware architectures. Re- cent efforts such as OpenCL [Khronos Group, 2010] are a step towards source code portability. However, the widely divergent characteristics of various parallel archi- tectures require architectural consideration even at the algorithmic level, if efciency is a primary concern. And efciency is always a strong concern when evaluating parallel architectures: without the drive for increased performance, the software complexity required for uti- lizing parallel architectures is an unjustied expense. Since efcient parallelized source code is not gener- ally portable, it is difcult to form a benchmark suite comprised of a body of code which can be reused across parallel architectures. Consequently, it is likely that fu- ture benchmark suites will include functional specica- tions for the computations being tested, but allow for a wide variety of implementations. This further compli- cates reader comprehension of cross-platform compar- isons, since optimized reference implementations may not be widely available. In addition to the complexities of performing cross- platform comparisons, it is also important to differenti- ate between the concerns of application developers and those of architecture researchers. We examine these con- cerns in the following sections. 3 Concerns of Application Developers Application developers work to advance the state of the art in their application domain, under a set of imple- mentation constraints. These cost constraints may in- clude developer time and skill, the freedom to rethink algorithms (or not), deployment costs - including hard- ware budgets in both dollars, joules and watts, as well as the need to interoperate with large legacy codebases, to name a few. Accordingly, when application develop- ers are presented with new technologies, such as SIMD instruction set expansions, on-chip multiprocessing, or programmable GPUs, they evaluate these technologies for their potential to provide qualitative advancements in a particular domain, given their particular set of im- plementation constraints. For example, medical imag- ing researchers strive to improve the kind of imaging which can be practically performed in a clinical setting. Increased computing capabilities make more advanced algorithms for image reconstruction feasible, which are interesting to medical imaging researchers because they provide new capabilities, such as quicker image acquisi- tion time or greater resolution. To application developers, platform characteristics, such as performance, price, power consumption, and so forth, are useful in a qualitative sense, to answer the question: Does a particular computing platform enable an advance in the application domain, given my imple- mentation constraints? Consequently, when application researchers publish comparisons between technologies, they naturally focus on application capabilities, and do not normally carry out architectural comparisons with careful performance characterizations. Application de- velopers seeking to demonstrate new application capa- bilities naturally focus their energy on computing plat- forms that best showcase their application, and compare against the established code bases which form the state of the art in their domain. As we noted earlier, there are many potential com- puting platforms beyond the Intel CPUs and NVIDIA GPUs considered in [Lee et al., 2010], ranging from AMD CPUs and GPUs, to Sun Niagara 2, IBM Power 7, Cell and BlueGene, as well as FPGA implementa- tions, among others. Although each computing platform has unique performance characteristics, fully examining the implementation space to make architectural compar- isons is outside the scope of application developers con- cerns. There are too many potential choices to consider. In particular, documenting precisely how far behind one architecture lags behind another is of little value to ap- plication developers, since such experiments generally do not improve the state of the art in their domain of expertise. Consequently, when application developers report 10-100 speed up numbers from using a GPU, these speedup numbers should not be interpreted as architec- tural comparisons claiming that GPUs are 100 faster than CPUs. Instead, they illustrate a return on invest- ment: starting from a legacy, usually sequential, code base that performed at some level x, the application de- veloper was able to take the application to some new level y, with an overall gain of y/x. The reported speedup number merely quanties the improvements yielded by this effort; it says nothing about the archi- tectural merits of the platforms used to implement the computations being compared. If application developers had no implementation con- straints, it would be feasible to expect them to make detailed comparisons in order to answer the question: Which architecture is optimal for my problem, and by how much? However, as we have explained, these comparisons provide the application developer with lit- tle value, because the information discovered in such an exercise is only incidental to advancing the state of the art in an application domain. Stated differently, to application developers, 10-100 speedups from using GPUs are not mythical, in fact they are commonplace and realistic: they come from porting well-established and commonly used sequential applications to GPUs, often after a clean-sheet redesign of the computations in the application and signicant algorithmic rework- ing. The fact that application developers could have also achieved large performance improvements by rewriting their legacy code to target parallel CPU implementations is an orthogonal issue - one that justiably falls outside the scope of their concerns. Although it is unrealistic to expect application developers to do the implementation work necessary for complete architectural comparisons, we can and should expect them to contextualize their re- sults: such 10-100 speedups should never be claimed as architectural comparisons between CPUs and GPUs, but only as an advancement over the previous implemen- tation to which comparison is being made. If the previ- ous implementation was important to an application do- main, this 10-100 speedup will have a signicant im- pact on those who use it, and so the speedup provides value to application developers, particularly if it can be shown to improve application accuracy or capabilities. 4 Concerns of Architecture Researchers Computer architects face a different set of problems and concerns than application developers. While application developers can focus on an application, even changing an algorithm to t a particular processor architecture, most computer architects must make engineering trade- offs to optimize their designs for a wide variety of ap- plications which may run on their devices, and consider the workloads they use to test their architectures as xed. For example, architects working on x86 microarchitec- tures focus on extracting performance from the x86 ISA, scaling across a broad variety of workloads and design constraints from Mobile Internet Devices all the way to the datacenter. Instead of focusing on a particular ap- plication domain, computer architects must design ar- chitectures which perform efciently across a variety of application domains. Consequently, computer architects often evaluate ar- chitectures based on benchmark suites which sample a broad range of application domains, where each bench- mark is simple enough to be well characterized and op- timized for the architectures being compared. However, conducting these comparisons is challenging, since it is difcult to normalize away implementation artifacts. For example, when comparing implementations of two dif- ferent architectures, it is difcult to control for the fact that the implementations may have been carried out on different semiconductor processes, with different design budgets, targeting different customer price points, dis- sipating different amounts of power, etc. Although it is difcult to normalize these factors away, it is impor- tant for comparisons to present salient information about the implementations being performed, so that the reader may contextualize the results of the comparison. More specically, when computer architects make comparisons between architectural features, there are a number of elements which must be carefully con- sidered. The rst is the specics of silicon imple- mentation, which includes the style of logic imple- mentation, such as standard-cell versus full or cus- tom design styles. These details may lead to an or- der of magnitude performance difference, irrespective of the inherent value of a microarchitectural feature [Chinnery and Keutzer, 2002]. The silicon process used for implementation and its particular characteristics, as well as die size and power consumption gures may also materially impact the comparison being made. Comparisons made for computer architects should in- clude information about these details, so that the audi- ence can contextualize the results. It would have been useful, for example, if [Lee et al., 2010] had devoted some space to these details in their comparison - as it stands, the paper does not mention that it compares a 65nm GPU released to market in June 2008 with a 45nm CPU released in October 2009, a 16 month gap which represents almost a full generation of Moores law. Other important details such as die size and power consumption are also omitted, which makes it difcult for architecture researchers to understand the compari- son being presented. Additionally, in order to reproduce results from a comparison, the benchmarks and datasets used to cre- ate the comparison must be clearly dened. For example, consider Sparse Matrix Vector Multiplica- tion, a well studied kernel on both CPUs and GPUs, e.g. [Williams et al., 2009], [Bell and Garland, 2009], which serves as one of the 14 benchmarks used in [Lee et al., 2010] to compare a CPU and a GPU. It is well known that Sparse Matrix Vector Mul- tiplication performance varies widely when multiply- ing a given matrix, depending on the data struc- ture used to represent the matrix. For example, [Bell and Garland, 2009] found a 35-fold performance improvement when using a DIA representation com- pared to the Compressed Sparse Row format (CSR), on single-precision three-point Laplacian matrices, when running both computations on the same GPU. Even among CSR formats, there are multiple implementa- tions, each of which has certain advantages for par- ticular datasets. Concretely, [Bell and Garland, 2009] found that a vector CSR implementation was faster on the whole than a scalar CSR representation - in the extreme up to 18 times faster, while running on the same hardware. However, [Bell and Garland, 2009] also found that there were certain matrices where the scalar CSR implementation was 6 times faster than the vec- tor CSR implementation, again running on the same hardware. The choice of Sparse Matrix representation and Sparse Matrix Vector Multiplication implementa- tion, and Sparse Matrix dataset used for benchmarking critically impacts observed performance. In fact, given an absolute performance target, it is possible to construct a dataset which achieves that tar- get performance, as long as the desired target is feasi- ble with respect to the extremes in Sparse Matrix Vec- tor multiplication performance which have been ob- served on a target platform. Since the extremes are widely separated, like the 35-fold range observed in [Bell and Garland, 2009], absolute performance num- bers are not interpretable without knowing more about the datasets and implementations used to generate such numbers. Unfortunately, [Lee et al., 2010] does not detail the datasets or implementations used to perform their com- parison. Besides Sparse Matrix Vector Multiplica- tion, other benchmarks used for comparison, such as ray-casting, collision detection, and solvers for col- lision resolution, also exhibit strong workload- and implementation-dependent performance characteristics. Without access to details of the benchmark suite which underlies a comparison, readers are not able to inter- pret, contextualize or reproduce performance results, since so much of the observed level of performance de- pends on those details. We encourage those conduct- ing cross-platform comparisons to make details of their work available. 5 Example Comparisons for Application Research To give some positive examples of cross-platform com- parisons for application researchers, we cite two exam- ples. [You et al., 2009] discussed algorithmic issues which arise in parallelization of Automatic Speech Recognition (ASR). ASR takes audio waveforms of human speech and infers the most likely word sequence intended by the speaker. Implementing ASR on parallel platforms presents two challenges to the application developer: ef- cient SIMD utilization and efcient core level synchro- nization. The authors explored four different algorithmic variations of the computation on two parallel platforms - Intel Core i7 CPU and NVIDIA GTX280 GPU. The sequential baseline was implemented on a single core of a Core i7 quad-core processor. The authors noted that the platform characteristics of the CPU yielded dif- ferent algorithmic tradeoffs than the GPU, meaning that the most efcient algorithm for parallel ASR was differ- ent for the two platforms. Overall, compared to the se- quential baseline implementation, the authors achieved a 3.4X speedup on the quad-core Intel Core-i7, and a 10.5X speedup on the GTX280. As illustrated by this example, application developers use their freedom to change algorithms and data structures to suit a particular platform. Therefore, using a benchmark suite with xed algorithms and data structures does not provide compar- isons which are useful to application developers. [Feng and Zeng, 2010] is another example which il- lustrates the concerns of application researchers when performing comparisons. The authors proposed paral- lelizing chip-level power grid analysis on modern GPUs. The power grid analysis problem requires computing the voltage at all nodes of a circut as a function of time. It can be solved by formulating the problem into a lin- ear system Gx = b, with G representing the resistance between nodes, and b representing the voltage sources. Traditionally, these problems have been solved by direct methods such as LU or Cholesky decomposition. How- ever, memory access patterns for the direct methods are not very efcient. As a result, [Feng and Zeng, 2010] proposed multi-grid preconditioned conjugate gradient (MGPCG) method to solve the linear system. The con- jugate gradient method is highly data parallel, and ac- cording to the experimental results, Feng achieved 25x speedup on GPU using the MGPCG method against the serial implementation using the Cholesky decomposi- tion method. The authors of [Feng and Zeng, 2010] are primarily concerned with advancing power grid analysis by intro- ducing a faster solver. Their paper details several differ- ent algorithms and their parallelism implications. To get better data access patterns, they decided not to use tradi- tional direct solvers, but instead use an iterative method better suited to the architecture of the processor they tar- geted. The authors detail signicant algorithmic explo- ration, including a pure CG solver, a preconditioned CG solver, a multigrid solver, and nally the multigrid pre- conditioned CG solver. The 25x speedup number comes from exploring the algorithm space and parallelizing the algorithm. Although they are comparing their paral- lelized GPU iterative solver with a serial CPU direct solver, the comparison is still valid and of value to those using the serial direct solver. The experimental results are not meant as an architectural comparison of CPUs versus GPUs. Instead, they show how application re- searchers advance the state-of-the-art by exploring the algorithm space and introducing parallelism. 6 Example Comparisons for Architecture Research We also cite two examples of useful cross-platform com- parisons which contain details useful for architecture re- searchers. [Kozyrakis and Patterson, 2002] evaluates their pro- totype vector processor with the EEMBC benchmarks and compares their results to ve existing processor ar- chitectures. The authors mention that the processors were implemented by different design teams, using dif- ferent process technology, at different clock frequencies, and with differing power consumption. The results for the embedded processors are normalized to clock fre- quency and code size comparisons are also made. The authors performed a multi-architectural comparison and provided in-depth analysis for the embedded applica- tions running on the 6 different platforms. The EEMBC benchmark suite is well known and characterized, and results are publically available, making it possible for readers to reproduce the experiments, given the avail- ability of the appropriate hardware platforms. [Williams et al., 2009] evaluates Sparse Matrix Vec- tor Multiplication across multicore processors from In- tel, AMD and Sun, as well as the Cell Broadband en- gine. Although not specically targeted at architecture researchers, this comparison does provide details of the hardware platforms being compared, describes the dif- ferent optimizations which were applied to achieve high performance on each of the various architectures under consideration, as well analyzes the impact of various mi- croarchitectural features for each platform with respect to the optimizations they employ. Additionally, they de- scribe in detail the benchmark datasets and data struc- tures they employ, as well as justify why the datasets constitute a good sampling of the space of sparse matri- ces. These details allow others to reproduce their work and build on it. 7 Conclusion As parallel architectures continue to diverge, cross- platform comparisons will be increasingly important. Conducting such comparisons is inherently difcult: there is a wide spectrum of implementation concerns, ranging from silicon implementation to algorithmic choices, all of which critically inuence the results of such comparisons. Additionally, these comparisons are conducted for widely different purposes, based on the needs and motivations of the researchers conducting the comparisons. We have outlined two important, yet divergent view- points for conducting comparisons: the viewpoint of ap- plication developers, as well as the viewpoint of archi- tectural researchers. Application developers focus on al- gorithmic innovation, co-designing algorithms for a par- ticular domain to perform efciently on parallel hard- ware, given a set of implementation constraints. Com- puter architects attempt to extract maximal performance from the architectures being compared on a particular benchmark suite, in order to compare architectural fea- tures. Although it is not realistic to expect all compar- isons to include all the information necessary to satisfy all potential audiences, it is crucial that researchers con- ducting cross-platform comparisons choose a particular audience and attempt to provide enough information for that audience. Additionally, the results of such a com- parison should be reproducible. Papers like [Lee et al., 2010] and [Vuduc et al., 2010] raise important points; we agree that many papers which compare CPUs and GPUs, nd a 100speedup by using a GPU compared to a sequential CPU implementation, and then claim that GPU architecture is more suited to that computation, are taking their results far out of con- text. Without performing the work necessary to under- stand a workload at an architectural level across plat- forms, these claims are unsupportable. However, the narrower claim, that a GPU implementation was 100 faster than the legacy sequential implementation, is still valid and potentially of great interest to application de- velopers using the legacy sequential implementation. If a comparison is being made for architecture re- searchers, information needs to be presented about the silicon implementation of the processors being com- pared. Additionally, the benchmark suite being used needs to be fully documented, so that the audience can interpret and reproduce the results. We recognize that space in publications is highly constrained, which presents an obstacle to communicating complex details of cross-platform comparisons. We suggest the use of expanded technical reports or websites with freely avail- able datasets and source code in order to communicate these details, since without them, the comparisons are impossible to interpret. As parallel computing continues to mature, we be- lieve that cross-platform comparisons will play an im- portant role in helping both architecture researchers as well as application developers understand the landscape of parallel computing. Although conducting these com- parisons can be complicated and at times contentious, the discussion they engender is essential to advancing our collective understanding of parallel architectures and applications. 8 Acknowledgements Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab afliates National Instru- ments, NEC, Nokia, NVIDIA, and Samsung. References [Aslot et al., 2001] Aslot, V., Domeika, M. J., Eigen- mann, R., Gaertner, G., Jones, W. B., and Parady, B. (2001). SPEComp: a new benchmark suite for mea- suring parallel computer performance. In Proceed- ings of the International Workshop on OpenMP Ap- plications and Tools: OpenMP Shared Memory Par- allel Programming, pages 110. [Bell and Garland, 2009] Bell, N. and Garland, M. (2009). Implementing sparse matrix-vector multipli- cation on throughput-oriented processors. In SC 09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 111, New York, NY, USA. ACM. [Bienia et al., 2008] Bienia, C., Kumar, S., Singh, J. P., and Li, K. (2008). The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. [Chen et al., 2007] Chen, T., Raghavan, R., Dale, J. N., and Iwata, E. (2007). Cell broadband engine archi- tecture and its rst implementation: A performance view. IBM Journal of Research and Development, 51(5):559 572. [Chinnery and Keutzer, 2002] Chinnery, D. and Keutzer, K. (2002). Closing the Gap Between ASIC & Custom. Kluwer Academic Publishers. [Feng and Zeng, 2010] Feng, Z. and Zeng, Z. (2010). Parallel multigrid preconditioning on graphics pro- cessing units (gpus) for robust power grid analysis. In Proceedings of the 47th Design Automation Con- ference, DAC 10, pages 661666, New York, NY, USA. ACM. [Khronos Group, 2010] Khronos Group (2010). OpenCL. http://www.khronos.org/ opencl. [Kozyrakis and Patterson, 2002] Kozyrakis, C. and Pat- terson, D. (2002). Vector vs. superscalar and vliw ar- chitectures for embedded multimedia benchmarks. In Proceedings of the 35th annual ACM/IEEE interna- tional symposium on Microarchitecture, MICRO 35, pages 283293, Los Alamitos, CA, USA. IEEE Com- puter Society Press. [Lee et al., 2010] Lee, V. W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A. D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., and Dubey, P. (2010). Debunking the 100XGPUvs. CPUMyth: an Evaluation of Through- put Computing on CPU and GPU. In ISCA10, pages 451460. [Lindholm et al., 2008] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. (2008). Nvidia tesla: A unied graphics and computing architecture. Mi- cro, IEEE, 28(2):39 55. [Owens et al., 2008] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., and Phillips, J. (2008). Gpu computing. Proceedings of the IEEE, 96(5):879 899. [Seiler et al., 2008] Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. (2008). Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27:18:118:15. [Singh et al., 1995] Singh, J. P., Gupta, A., Ohara, M., Torrie, E., and Woo, S. C. (1995). The splash-2 pro- grams: Characterization and methodological consid- erations. ISCA95, pages 2436. [Vuduc et al., 2010] Vuduc, R., Chandramowlishwaran, A., Choi, J., Guney, M., and Shringarpure, A. (2010). On the limits of GPU acceleration. In Proceedings of the USENIX Workshop on Hot Topics in Parallelism (HotPar), Berkeley, CA, USA. [Williams et al., 2009] Williams, S. W., Oliker, L., Shalf, J., Yelick, K., and Demmel, J. (2009). Op- timization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, pages 178194. [You et al., 2009] You, K., Chong, J., Yi, Y., Gonina, E., Hughes, C., Chen, Y.-K., Sung, W., and Keutzer, K. (2009). Parallel scalability in speech recognition. Signal Processing Magazine, IEEE, 26(6):124 135.