Week 5 - The Impact of Multi-Core Computing On Computational Optimization
Week 5 - The Impact of Multi-Core Computing On Computational Optimization
Abstract
In this paper we confront the challenges that emerging multi-core architectures are placing on the optimization com-
munity. We first identify the unique challenges of multi-core computing followed by a presentation of the performance
characteristics of the architecture. We then present a motivating example, the distributed Bellman-Ford shortest path
algorithm, to demonstrate the impact that multi-core architectures have on computation showing that traditional ap-
proaches to parallelization do not scale. We subsequently modify the algorithm to take full advantage of multi-core
systems and demonstrate scalability with some numerical results.
Keywords
multi-core computation, Bellman-Ford shortest path, distributed computation, algorithm performance instrumentation
1. Introduction
Multi-core computing will change the way in which we perceive computation and the way in which we build compu-
tational optimization algorithms. In the past Moore’s Law [12] has afforded the optimization community the luxury of
free performance increases over the past decades. We are currently at a crossroads where we are seeing the previous
increase in serial performance replaced with future increases in parallelism [7, 8]. Fortunately the academic commu-
nity has recognized this need for parallel computation as demonstrated in the forward thinking report from Berkeley[1]
about upcoming challenges.
As the number of cores increases managing access to memory, at its various levels, becomes the key to developing high
performance and scalable implementations. The work of Drepper in [6] provides a detailed analysis of these issues and
its impact on programmers. In this paper we apply this work at the algorithmic level providing a different perspective
that we believe will be valuable for researchers implementing computational optimization algorithms. To attack these
challenges the computing community is also developing new metrics for computation on multi-core processors. These
models help researchers understand what is required to achieve maximum performance and scalability on specific
multi-core architectures. The roofline model of Williams et al. [17] gives an easy way to visualize how code will
perform on modern CPU’s. It can give a reasonable expectation of performance without resorting to low-level and
difficult instrumentation. It is this simple approach we hope to duplicate here through the use of instrumentation of
performance factors. We define a performance factor as an observable relationship between a property of an algorithm
and it’s performance.
The main goal of this work was not to produce the fastest Bellman-Ford shortest path implementation, but to find fast
and scalable ways to achieve parallelism on multi-core platforms that is easy for algorithm designers to implement.
The sample data used for the implementation was from the dataset of the “9th DIMACS Implementation Challenge -
Shortest Paths” [5].
● ●●
●
●
● ●
● ●
● ●
● ●
●
● ●
●
● ●
10^10 ●
●
●
●
●
● ●
● ●
●
● ● 10^4.5 ●
●
● ●
●
●
● ● ●●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
● ●
●
●
● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
●
●
● ●
● ●
● ● ●
10^9 ●
●
cores ● ● nodes
iters/nodes
● ●
●
● ●
●
● ●
●
264346
vtime
● 10^4.0 ●
●
●
●
●
●
1 ●
● 435666 ●
● ●
●
●
●
2 ●
●
1070376 ●
●
● ●
●
●
4 ● ● ● 2758119 ●
● ● ●
●
●
8 ●
● ● 6262104 ●
● ●
●
●
● ●
● ●
● ●
10^8 ● ● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
10^3.5 ●
●
● ●
● ●
●
●
● ●
●
● ●
● ●
●
● ● ●
● ● ● ●
● ●
● ●
●
● ● ●
● ●
10^7 ●
●
● ●
●
● ●
●
●
●
● ● ●
● ●
10^3.0
nodes abs
(a) Computational time v.s. nodes. (b) Change in the average number of iterations per node per step.
and modeling the performance of the multi-core algorithm a deeper understanding of the problem can be obtained.
Overall performance is a complex interaction of the two and requires and understanding of both. Figure 2 illustrates
this by the fact that the number of iterations per block drops at the same time as the block size exceeds the size of cache
available to each core. Here, the increase in cycles per node (shown in Figure 2(a)) is dominated by the transition from
blocks running entirely in the local core cache to blocks no longer fitting in these caches. These factors combined
form a complex view of total running time as shown in Figure 2(b).
The effect of block size on performance can be broken down into the following three major areas: algorithm operation,
middleware operation (macro), and cpu-core operation (micro). In the remainder of the paper we discuss these three
areas identifying a number of contributing factors.
2e+09
●
cores cores ●
● 1 ● 1
2 2 ●
1000 4 4
8 8
1e+09
Average Number of Cycles / Node
●
500
5e+08
●
●
● ●
● ● ●
200
2e+08
●
● ● ●
● ●
●
100
blocksize blocksize
(a) Cycles to process a node vs. block size for varying num- (b) Total running time.
ber of cores.
Performance Factor 2 (parallel computation). The amount of parallel computation performed increases with the total
number of node iterations.
Since we are not developing a model of how the algorithm changes based on block size (although interesting) we will
simply use the number of node and block iterations and steps required for each problem and block size as a given and
split the algorithm into serial (global communication) and parallel phases.
Performance Factor 3 (overhead). The amount of overhead increases with the number node iterations.
Performance Factor 4 (parallel overhead). The amount of overhead increases with the number of threads.
For the overhead factor there is little difference between threads (1.5 cycles per thread) for the in-cache case so we
take the average overhead value per node iteration for all cases, which is 10 cycles per node iteration. For the uncached
Middelkoop
case the parallel overhead factor is prevalent and we utilize linear regression to get threads ∗ 33 − 30 cycles per node
iteration (with a R squared of 0.41). Due to limited space and the importance of the other factors we will not discuss
the macro performance factors further.
Performance Factor 5 (serial sequential load). The node evaluation time for global communication is constant for
each problem.
After the global communication phase of the algorithm blocks are scheduled, in parallel, to be iterated. Because the
nodes of the block will not be in cache (pushed out by the global communication or other block iterations) the first
computation sweep and communication sweep will show the impact of loading the nodes from DRAM into the cache.
If the cache size is sufficient to host the entire block then the next computation and communication sweep will not
have to reload the node data from DRAM as it will be in the cache. Because of this we have loading and iteration of
communication and computation giving a total of four algorithm block contexts.
The split between load and iteration is also important since some blocks do not change with label update and thus
will not have additional iterations. Since there are no additional iterations there will be no chance to amortize the
initial cost of loading the data from DRAM into the cache. Because of this there are two distinct performance profiles
1 The process of examining the distribution of events gave a surprising amount of insightful information about the algorithms performance and
400 400
●
● ●
●
●
● ● ● ●
●
● ● ●
● ● ● ●
● ●
● ● ● ●
300 300 ●
● ●
●
cploadcomp
cores cores
cpitercomp
● ●
● ● ●
● ●● ● 1 ● 1 ●
●● ● ● ● ●
● ●
●
● ●
●
●
● 2 ●
●
●
● 2 ●
● ●
●
4 ● 4 ●
●
● ●
● ●
● ●
● 8 ●
●●
●
8 ●
●
200 ● ● ●
● ● ● ●
● 200
●● ● ● ● ●● ●
● ● ● ● ●●
● ● ● ● ● ● ●● ●
●
● ● ●
●● ● ● ●
● ●
●
● ● ● ● ●
● ● ●● ●● ● ●● ● ●● ●● ●●
● ●● ●
● ●
● ● ● ●
● ●
● ● ●
●●● ●●● ●● ● ●● ● ● ● ●
● ● ● ● ● ● ●● ●● ●
● ● ● ●● ● ●● ●
● ● ● ●●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●● ●●
● ● ● ●● ●● ●● ●● ●● ● ●
●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
100 ● ●●● ●● ●● ● ●● ●● ● ●● ●
● ●● ● ●● 100 ● ● ●● ● ●
● ● ●
● ● ●● ● ●●●
●
●
●● ●
●●● ●
● ● ●
● ●●●
●
● ● ● ● ●●● ●
●● ● ●●● ● ●●●
● ● ●●● ● ●●●
abs abs
●
●
●
●
● ●
● ● ● ●
●
● ●
400 ●
● 400 ●
● ●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
300 300 ●
● ●
cploadcomm
●
cpitercomm
● ●
cores cores
● ●
● 1 ●
●
●
1 ●
● ●
● 2 ●
● ●
2 ●
●
● ●
● ●
● ● ● 4 ●
● ● ●
●
4 ●
●
● 8 ● ● ● ● ● 8 ●
200 200 ● ● ● ●
●
●
● ● ●
●● ●
●
● ●
● ● ● ●
● ● ●● ●● ●
●● ● ●
●
●
●
● ● ● ● ●●
●
● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
●●
● ● ●● ●● ●●
● ●
●
100 ● ● 100 ●
● ● ● ●●● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●● ●
● ● ● ●● ●
● ● ●
● ●● ● ●● ●
●
●●●
● ● ●●
● ● ●
● ● ●●● ●
● ●
● ● ●
●
●●● ● ● ● ●●●
● ●●●
● ●●●●
●● ● ●●
●●●
●●
●●
● ●●●
●
● ●●
●●
●●
●● ●
● ●
● ●● ●
● ●
●●● ●● ●●
● ● ●
●● ●●
●
● ●
● ●●● ● ●● ● ●● ●
●
●● ●● ●● ●● ●●
● ●● ●●
● ● ●● ●● ●
●
● ●● ●
●
abs abs
Figure 3: Node evaluation time in cycles vs. block size for different algorithmic contexts
Because of the similarity between communication and computation we only define three performance factors and
apply them to four algorithmic contexts. We now define them as follows:
Performance Factor 6 (cached block load). The evaluation time for nodes first accessed in a block with a footprint
that fits in cache increases with the number of simultaneous threads and is constant with block size.
Performance Factor 7 (cached block iteration). The evaluation time for nodes previously accessed in a block with a
footprint that fits cache is the same or better than the first access and is constant.
Middelkoop
Performance Factor 8 (uncached block access). The evaluation time for nodes in a block with a footprint that does
not fit in cache increases with the number of simultaneous threads.
The order of evaluation for Figure 3 is a-c-b-d with b-d repeating until local convergence. Therefore the first computa-
tion is considered under the cached block load factor. Due to the similarity between computation and communication
(we evaluate the same number of nodes) we consider them under the same cached block iteration factor. Finally, when
block size is such that the block footprint does not fit in cache we consider all algorithmic contexts as uncached block
access. Visually it is easy to determine when the block footprint exceeds the cache size and occurs around 1e4 nodes
in a block for the problems presented in the paper. For lower thread counts this is actually a bit larger since the L3
cache on the AMD Opteron processor is shared between the four cores on the die. Since unused cores will not utilize
the shared L3 cache the remaining cores will have a larger share of the cache.
4. Modeling Performance
Now that we have a models for how the algorithm, middleware, and processor operate we can now combine them to
model performance. Table 1 summarizes what performance factors where used in which algorithmic context. This
table also shows the number of cycles per node evaluation for each context (cached and uncached). In cases where the
number of threads is a factor we show the numbers for single and eight cores. It also shows number of times that an
algorithmic context/performance factor is utilized (count) during a run (as a reminder, steps are the number of global
communication phases).
Using this information, along with the algorithmic performance factors (parallel and serial computation), we construct
the following model to predict the wall clock running time of the algorithm:
runtime = global_communication × steps × nodes + (2)
( f irst_computation(cached,threads) × (steps × nodes) +
f irst_communication(cached,threads) × (steps × nodes) +
subsequent_computation(cached,threads) × (iterations − steps × nodes) +
subsequent_communication(cached,threads) × (iterations − steps × nodes) +
overhead(cached,threads) × iterations
) ×threads−1 × cpu.
The conversion between cycles and runtime is cpu and is the cycles per microsecond that the processor runs at (for the
hardware in this paper it is 2300 cycles per microsecond). The performance factor utilizes the cycle times as shown
in Table 1 and are dependent on the runtime configuration (blocks and threads). If a block will fit in cache then the
cached access numbers are used, otherwise the uncached are used It should be noted that only global communication
is run in serial and that the remaining algorithmic contexts are run in parallel.
The number of steps and iterations that a block size produces was taken directly from the experiments2 and fed into the
model to get a prediction of runtime by using Equation 2. The model performs well with a mean absolute percentage
error (MAPE) of 16%. MAPE was selected due the large scale of predictions that the model makes. A regression of
the relationship between observed and predicted indicates a close match with a R squared value of 0.98 and a slope
close to 1.
2 A future extension to the work done in this paper would be to develop a model of how block size impacts the number of steps and iterations
References
[1] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer,
David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick.
The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, Dec 2006. URL http://www.eecs.berkeley.edu/
Pubs/TechRpts/2006/EECS-2006-183.html.
[2] Hans J. Boehm. Threads cannot be implemented as a library. In PLDI ’05: Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and implementation, pages 261–268, New York, NY,
USA, 2005. ACM. ISBN 1-59593-056-6. URL http://doi.acm.org/10.1145/1065010.1065042.
[3] Lei Chai, Qi Gao, and Dhabaleswar K. Panda. Understanding the impact of multi-core architecture in cluster
computing: A case study with Intel dual-core system. In CCGRID ’07: Proceedings of the Seventh IEEE
International Symposium on Cluster Computing and the Grid, pages 471–478, Washington, DC, USA, 2007.
IEEE Computer Society. ISBN 0-7695-2833-3. URL http://dx.doi.org/10.1109/CCGRID.2007.119.
[4] Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. An evalu-
ation of OpenMP on current and emerging multithreaded/multicore processors. In International Workshop on
OpenMP (IWOMP 2005), 2005. URL http://people.cs.vt.edu/~mfcurt/papers/iwomp05.pdf.
[5] Camil Demetrescu, Andrew Goldberg, and David Johnson, editors. 9th DIMACS Implementation Challenge,
DIMACS Center, Rutgers University, Piscataway, NJ, 2006. http://www.dis.uniroma1.it/~challenge9/.
[6] Ulrich Drepper. What every programmer should know about memory. Red Hat Inc., November 2007. http:
//people.redhat.com/drepper/cpumemory.pdf.
[7] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro,
25(3):16–31, 2005. ISSN 0272-1732. doi: http://dx.doi.org/10.1109/MM.2005.56. URL http://dx.doi.org/
10.1109/MM.2005.56.
[8] Steve Furber. The future of computer technology and its implications for the computer industry. The Computer
Journal, 2008. doi: 10.1093/comjnl/bxn022. URL http://comjnl.oxfordjournals.org/cgi/content/
abstract/bxn022v1.
[9] Guang R. Gao, Mitsuhisa Satoand, and Eduard Ayguad. Special issue on OpenMP. International Journal of
Parallel Programming, 36(3), June 2008.
[10] Kevin R. Hutson, Terri L Schlosser, and Dogulas R Shier. On the distributed Bellman-Ford algorithm and the
looping problem. INFORMS Journal on Computing, 19(4):542–551, 2007.
[11] Nick Maclaren. Why POSIX threads are unsuitable for C++. white paper, February 2006. URL http://www.
opengroup.org/platform/single_unix_specification/doc.tpl?gdid=10087.
Middelkoop
[12] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114–117, 1965.
[13] Victor Pankratius, Christoph Schaefer, Ali Jannesari, and Walter F. Tichy. Software engineering for multicore
systems: an experience report. In IWMSE ’08: Proceedings of the 1st international workshop on Multicore
software engineering, pages 53–60, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-031-9. URL http:
//doi.acm.org/10.1145/1370082.1370096.
[14] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly
Media, Inc., 2007.
[15] Christian Terboven, Dieter Mey, and Samuel Sarholz. Openmp on multicore architectures. In IWOMP ’07:
Proceedings of the 3rd international workshop on OpenMP, pages 54–64, Berlin, Heidelberg, 2008. Springer-
Verlag. ISBN 978-3-540-69302-4. doi: http://dx.doi.org/10.1007/978-3-540-69303-1_5.
[16] D. Terpstra, H. Jagode, H. You, and J. Dongarra. Collecting performance data with papi-c. In Proceedings of the
3rd Parallel Tools Workshop. Springer Verlag, 2010.
[17] Samuel Webb Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance
model for floating-point programs and multicore architectures. Communications of the ACM, April 2009.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.