0% found this document useful (0 votes)
13 views

Week 5 - The Impact of Multi-Core Computing On Computational Optimization

This paper discusses the challenges of developing algorithms to take advantage of parallel processing on multi-core computer architectures. It presents a modified distributed Bellman-Ford shortest path algorithm as a case study. The original algorithm did not scale well on multi-core systems due to limitations in traditional parallelization approaches. The modified algorithm leverages the Threaded Building Blocks library to better manage concurrent access to shared memory across processor cores. Numerical results demonstrate improved scalability of the modified algorithm on multi-core systems.

Uploaded by

Game Account
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Week 5 - The Impact of Multi-Core Computing On Computational Optimization

This paper discusses the challenges of developing algorithms to take advantage of parallel processing on multi-core computer architectures. It presents a modified distributed Bellman-Ford shortest path algorithm as a case study. The original algorithm did not scale well on multi-core systems due to limitations in traditional parallelization approaches. The modified algorithm leverages the Threaded Building Blocks library to better manage concurrent access to shared memory across processor cores. Numerical results demonstrate improved scalability of the modified algorithm on multi-core systems.

Uploaded by

Game Account
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Proceedings of the 2012 Industrial and Systems Engineering Research Conference

G. Lim and J.W. Herrmann, eds.

The Impact of Multi-Core Computing on Computational


Optimization
Timothy Middelkoop
Department of Industrial and Systems Engineering, University of Florida
PO Box 116595, Gainesville FL 32611-6595, U.S.A.

Abstract

In this paper we confront the challenges that emerging multi-core architectures are placing on the optimization com-
munity. We first identify the unique challenges of multi-core computing followed by a presentation of the performance
characteristics of the architecture. We then present a motivating example, the distributed Bellman-Ford shortest path
algorithm, to demonstrate the impact that multi-core architectures have on computation showing that traditional ap-
proaches to parallelization do not scale. We subsequently modify the algorithm to take full advantage of multi-core
systems and demonstrate scalability with some numerical results.

Keywords
multi-core computation, Bellman-Ford shortest path, distributed computation, algorithm performance instrumentation

1. Introduction
Multi-core computing will change the way in which we perceive computation and the way in which we build compu-
tational optimization algorithms. In the past Moore’s Law [12] has afforded the optimization community the luxury of
free performance increases over the past decades. We are currently at a crossroads where we are seeing the previous
increase in serial performance replaced with future increases in parallelism [7, 8]. Fortunately the academic commu-
nity has recognized this need for parallel computation as demonstrated in the forward thinking report from Berkeley[1]
about upcoming challenges.
As the number of cores increases managing access to memory, at its various levels, becomes the key to developing high
performance and scalable implementations. The work of Drepper in [6] provides a detailed analysis of these issues and
its impact on programmers. In this paper we apply this work at the algorithmic level providing a different perspective
that we believe will be valuable for researchers implementing computational optimization algorithms. To attack these
challenges the computing community is also developing new metrics for computation on multi-core processors. These
models help researchers understand what is required to achieve maximum performance and scalability on specific
multi-core architectures. The roofline model of Williams et al. [17] gives an easy way to visualize how code will
perform on modern CPU’s. It can give a reasonable expectation of performance without resorting to low-level and
difficult instrumentation. It is this simple approach we hope to duplicate here through the use of instrumentation of
performance factors. We define a performance factor as an observable relationship between a property of an algorithm
and it’s performance.
The main goal of this work was not to produce the fastest Bellman-Ford shortest path implementation, but to find fast
and scalable ways to achieve parallelism on multi-core platforms that is easy for algorithm designers to implement.
The sample data used for the implementation was from the dataset of the “9th DIMACS Implementation Challenge -
Shortest Paths” [5].

1.1 Multi-Core Systems


To understand why communication has a dramatic impact on performance in multi-core algorithms we must under-
stand the basic architecture of multi-core systems. Modern systems are made up of a hierarchy of successive levels
of cache and communication buses. At the top are independent processor cores similar to those used in the single
processor systems of the past. Each processor core has a local (exclusive) L1 cache, which is the fastest (it runs at near
Middelkoop
processor speed) and on the order of 64KiB. As we progress down towards main memory caches become slower, grow
larger and are shared by more cores. For the system used in the paper (a dual socket 2.6 GHz AMD quad core Opteron)
there is 512KiB of L2 cache for each core (some systems share the L2 cache between cores) and 3072KiB of L3 cache
shared between 4 cores on a single die. The system has two sockets, each supporting an integrated memory controller.
This means that if a core must access DRAM located on the other socket this request must be communicated between a
bus. To ensure that all the cores see the same information stored in DRAM the caches must coordinate access through a
cache coherence protocol. Near simultaneous access to memory (often for locking) is expensive because this protocol
must guarantee coherency and must communicate over slower busses and potentially coordinate between many cache
systems. In this paper we do not focus on performance in this area (coordination) as scientific computing should spend
most of the time in computation not coordination.

1.2 Performance of Multi-Core Systems


The previous section indicates that there is the potential for dramatic performance differences in the execution of
algorithms based on their communication properties. The following question arises: what impact does different levels
of cache have on performance, and by extension, what is the “best” that we can do? In the past simply looking at the
performance of the processor (through the use of micro-benchmarks) would tell most of the story.
Unfortunately, traditional processor micro-benchmarks do not highlight the performance issues in multi-core proces-
sors. Although micro-benchmarks are often misleading, they do reveal important aspects of multi-core computing.
Therefore, a suit of benchmarks was developed and run to investigate this point. The first set of benchmarks measure
raw memory bandwidth at the various cache levels. In the single core case (2.6 GHz AMD quad core Opteron) the
benchmark (GNU C++ compiler optimized) indicates that processor can sum a sequential array of long (64 bit) inte-
gers that fits in the L1 or L2 cache at a rate of around 13 GiB/s. Since this computation has an operational intensity
of 1/8 flops/byte it is bandwidth limited. This result is similar to the expected performance predicted by the roofline
model [17] for this processor and application. When the array no longer fits in the L3 cache then the performance
drops to 3.5GiB/s, which is approaching the theoretical bandwidth of the DRAM. When increasing to 8 simultaneous
threads the DRAM performance drops to 1.2GiB/s per core, which demonstrates the significance of the shared access
to main memory.
Since most optimization applications do not process long streams of data with a low operational intensity it is important
to determine a lower bound on randomly accessed data. A second set of benchmarks was run with a similar setup but
using unoptimized code with the addition of a variable stride length (distance to the next element in ram) and hand-
unrolled loops. The benchmark has an average overhead of 2.6 cycles per iteration ultimately limiting peak bandwidth.
For the stride=1 case performance was on par to the previous DRAM sized benchmarks reinforcing that fact that the
benchmark is memory bound. When the stride is set to 32 the stride limit for the hardware prefetcher is exceeded and
performance is similar to random access since the prefetcher can no longer speculatively retrieve the next element. In
the single core case bandwidth drops to 550MiB/s and to 150MiB/s for the 8 core case, a dramatic difference from the
L1 cache bandwidth.
These benchmarks indicate two things, first that as memory access moves off the core and into L3/DRAM performance
is impacted by other cores, second that cache and access patterns can have up to two orders of magnitude impact on
performance. When looking at the benchmarks this is obvious, however, algorithm design must account for these two
factors and the middleware must easily support this style of computation. Looking forward, the impact of multiple
cores is relatively low at this time (factor of three for 8 cores in two sockets) but it will continue degrade as the number
of cores accessing shared resources increases.

1.3 Parallel Programming Tools


Developing an algorithm that can be implemented in parallel requires that it conform to the techniques and technologies
that the underlying implementation allows. It is unfortunate, according to the literature [2, 4, 11, 13, 15] and in the
authors experience, that popular technologies to build parallel programs (POSIX Threads, Open Multi-Processing
(OpenMP)[9], and Message Passing Interface (MPI)) are not ideal solutions for scientific computing on multi-core
systems. Even though MPI can be highly scalable when it is used in large clusters (HPC), it was not considered
for this study as it performs poorly on multi-core systems since it was not designed for shared cache and memory
systems[3]. In general, the problem is that most of the parallel solutions are based on an outdated model of uniform
concurrent memory access for multiple processors in a system, which is no longer the case in multi-core systems.
Middelkoop
The Threaded Building Blocks (TBB) open source project by Intel [14] was used for the implementation presented
in this paper. TBB is a template library that manages concurrency by providing design patterns, and the underlying
middleware infrastructure, that facilitates and encourages the user to utilize parallel safe data access methods. The
issues of locking, for the most part, are hidden from the user. In this paper we use the middleware to manage all access
to shared data by having the middleware indicate which areas are currently available for access to a independently
running function (task). Any transfer of data between areas is either done by a function that has access to both areas,
or by a single serial global communication phase done outside the parallel computation phase.

1.4 Performance Instrumentation


Performance optimization on multi-core architectures is difficult due to the interaction between CPU caches and lim-
ited shared access to main memory. Optimizing performance requires instrumentation and careful tuning. As we will
see later there is a complex relationship between problem decomposition and performance. Problem based metrics,
such as the number of iterations performed on a node, are not captured by traditional profiling software. In many
cases profiling will just confirm that a majority of the computational time is spent in the analysis code. This paper
demonstrates the value of tools that combine domain level metrics with processor profiling information. To collect
this data a probe (source level calls to the profiling library) that collect domain information along with processor per-
formance counter data (for example cache misses) was developed. The probes measure computational time by using
the high resolution performance counters available on modern processors utilizing the PAPI library [16]. These probes
make it easy to mark portions of the code to be instrumented and to have the results collected along with contextual
information (block number, iteration, etc.) in a database without worrying about the underlying details.

2. Multi-Core Distributed Bellman-Ford Shortest Path


The multi-core distributed Bellman-ford shortest path algorithm is based on the synchronous distributed Bellman-Ford
shortest path as presented in [10]. Recall that the original Bellman-Ford algorithm is a label updating algorithm. The
DBF stores local copies of incoming node labels (instead of storing them in a central location) and iterates over all the
nodes until there are no new updates in an iteration, at which time all the local labels are updated in a communication
phase. The DBF continues until there are no label updates (computation phase) after a communication phase. This
property is important for the multi-core version as this information now can reside exclusively in a single processor
core. In addition information is replicated for the nodes reducing the need for expensive locking operations needed to
ensure the validity of the information.
The key idea for the multi-core DBF (MDBF) is to split the network into blocks. Blocks are processed in parallel as
sub-problems, however, nodes outside the block are not updated during the block communication phase. This has three
very important properties. The first is that the size of a block can be set to fit in a cores cache making the processing
of the block exclusive to a core. The second is that nodes do not need lock access to the local store of labels (the
algorithm pushes label updates), which adds overhead. The third property is that data is not shared between cores,
which, as we saw earlier, can have dramatic impact on performance.
The key decision is in setting the block size. The smaller the block size the higher in the cache hierarchy it will fit,
increasing the speed of subsequent iterations. However, as we will see later, for smaller block sizes the number of
iterations processed during a block is also smaller. This translates into a larger number of times that the blocks must
be processed. Interestingly, this does not necessarily translate in to an increase in the total number of times that a
node is processed. For small block sizes, the total number node evaluations actually increases as block size increases,
this continues up to a point before it decreases again. This indicates a complex relationship between the domain, the
problem instance, and the algorithm parameters.

3. Performance and Scalability


There are three main variables that can be controlled when running the MCDBF algorithm, which are problem size,
the number of cores used, and block size. Figure 1(a) shows a log-log plot of the time in microseconds (vtime) v.s. the
problem size (nodes). We can clearly see that as the problem size grows so does the overall computational time. What
we also see is that the number of cores and block size also have a dramatic, order of magnitude, affect on performance.
Problem size and the number of cores affects computational time in a predictable manner, however, the effect of block
size is a complex interaction of a number of factors and will be the focus of the remainder of the paper.
Although the optimal block size could be found using an auto-tuner (search) we strongly believe that by understanding
Middelkoop

● ●●


● ●
● ●
● ●
● ●

● ●

● ●
10^10 ●




● ●
● ●

● ● 10^4.5 ●

● ●


● ● ●●




● ●



● ●
● ●

● ●


● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ●


● ●
● ●
● ● ●
10^9 ●

cores ● ● nodes

iters/nodes
● ●

● ●

● ●

264346
vtime

● 10^4.0 ●





1 ●
● 435666 ●
● ●



2 ●

1070376 ●

● ●


4 ● ● ● 2758119 ●
● ● ●


8 ●
● ● 6262104 ●
● ●


● ●
● ●
● ●
10^8 ● ● ●






● ●

● ●





10^3.5 ●

● ●
● ●


● ●

● ●
● ●

● ● ●
● ● ● ●
● ●
● ●

● ● ●
● ●
10^7 ●

● ●

● ●



● ● ●
● ●
10^3.0

10^5.5 10^6.0 10^6.5 10^2 10^3 10^4 10^5

nodes abs

(a) Computational time v.s. nodes. (b) Change in the average number of iterations per node per step.

Figure 1: Impact of average block size (abs) on the MDBF.

and modeling the performance of the multi-core algorithm a deeper understanding of the problem can be obtained.
Overall performance is a complex interaction of the two and requires and understanding of both. Figure 2 illustrates
this by the fact that the number of iterations per block drops at the same time as the block size exceeds the size of cache
available to each core. Here, the increase in cycles per node (shown in Figure 2(a)) is dominated by the transition from
blocks running entirely in the local core cache to blocks no longer fitting in these caches. These factors combined
form a complex view of total running time as shown in Figure 2(b).
The effect of block size on performance can be broken down into the following three major areas: algorithm operation,
middleware operation (macro), and cpu-core operation (micro). In the remainder of the paper we discuss these three
areas identifying a number of contributing factors.

3.1 Algorithm Performance Factors


Altering the block size changes the convergence properties of the algorithm since each block converges locally before
the global communication phase. Experimentally, reducing the block size increases the number of global communica-
tion phases (steps) that the algorithm must make. Since global communication is both expensive (as we will see later)
and serial it reduces the amount of parallel computation time available. What is not so clear is the complex interaction
between the block size and the total number of times that a node must be evaluated during local convergence as shown
in Figure 1(b). Our current hypothesis is that as the block size increases the amount of recalculation in each block
increases (recall that blocks iterate locally) due to the delay in propagation of label values to other blocks. At some
point this trend reverses as the block size is probably sufficient to support a lot of local computation. This behavior in
of it self is an interesting phenomenon that warrants further study.
Although the effect of block size on the number of steps and iterations performed is not well understood the number
of steps and iterations performed have a direct impact on performance. Because these are easy to measure we base the
algorithmic performance factors on them instead of block size directly and only consider the impact on computation.
We define computation as the number of instructions (not time) that must be performed doing algorithmic computa-
tions. We assume that the algorithmic factors are not significantly influence by macro or micro performance factors.
We have experimentally verified this by tracking the number of instructions (not cycles) per node evaluation performed
during the algorithm portion of the program. The following two performance factors formalize this relationship:
Performance Factor 1 (serial computation). The amount of serial computation performed increases with the number
of steps.
Middelkoop

2e+09

cores cores ●

● 1 ● 1
2 2 ●

1000 4 4
8 8

1e+09
Average Number of Cycles / Node

Running Time (usec)



500

5e+08


● ●

● ● ●
200

2e+08

● ● ●
● ●

100

5e+01 5e+02 5e+03 5e+04 5e+01 5e+02 5e+03 5e+04

blocksize blocksize

(a) Cycles to process a node vs. block size for varying num- (b) Total running time.
ber of cores.

Figure 2: Impact of block size.

Performance Factor 2 (parallel computation). The amount of parallel computation performed increases with the total
number of node iterations.

Since we are not developing a model of how the algorithm changes based on block size (although interesting) we will
simply use the number of node and block iterations and steps required for each problem and block size as a given and
split the algorithm into serial (global communication) and parallel phases.

3.2 Macro Performance Factors


There are three main sources of overhead (non-value added time added by the TBB middleware) and are as follows:
scheduler overhead, which is responsible for assigning tasks; locking overhead that accounts for time ensuring that
parallel accesses to memory happens as expected; and idle time, which is time that a core remains idle (or in a waiting
state) because of a lack of work. Due to the complexities of task scheduling, locking, and waiting and the fact that all
macro performance factors are non-value-added we will consider them all as overhead.
We find the overhead by measuring the amount of elapsed time between the beginning and end of the program (also
called wall clock time) and measure it by using the gettimeofday function. The amount of time loading the program
and storing the instrumentation data is ignored. We compute overhead as the amount of time that the processors are
not working on the algorithm (compute or communicate) as follows:
overhead = (wtime − global) × threads − (compute + communicate). (1)
It should be noted that the idle processors during the serial phase of the algorithm is not considered overhead but
processor idle time during the parallel phase of the algorithm is considered a part of the overhead. Cores go idle (or in
a busy-wait loop) when there is no available work. This happens with greater impact as the block size increases. This
becomes a serious problem when the block size increases to a size where there are very few blocks compared to the
number of available cores. Due to the complexities of the middleware system and the goals of the paper a sophisticated
model of the middleware was not developed. We do, however, characterize the middleware by the following two macro
performance factors:

Performance Factor 3 (overhead). The amount of overhead increases with the number node iterations.
Performance Factor 4 (parallel overhead). The amount of overhead increases with the number of threads.

For the overhead factor there is little difference between threads (1.5 cycles per thread) for the in-cache case so we
take the average overhead value per node iteration for all cases, which is 10 cycles per node iteration. For the uncached
Middelkoop
case the parallel overhead factor is prevalent and we utilize linear regression to get threads ∗ 33 − 30 cycles per node
iteration (with a R squared of 0.41). Due to limited space and the importance of the other factors we will not discuss
the macro performance factors further.

3.3 Micro Performance Factors


The motivation behind this paper is to develop tools and techniques to facilitate the development multi-core scalable
algorithms. To maximize performance in this environment algorithms must maximally reuse cache. This section
details the performance factors involved with reusing cache, mainly the instruction execution time as it relates to
accessing memory at it’s various levels.
Classically, single processor systems had deterministic processor instruction times, that is, an instruction would take
a fixed number of processor cycles to execute. However, modern multi-core processors are sophisticated systems
that, arguably, act more like software then hardware. If one takes this view then it is easy understand why modeling
CPU performance is a difficult task. To make this task more manageable we measure the number of cycles a set of
instructions takes in a particular algorithmic context. The performance instrumentation presented in this paper is what
allows for the collection of performance data based on algorithmic context. With this information we are able to build
more accurate models processor performance.
The micro performance factors where developed by a combination of understanding memory/processor performance
[6], building micro-benchmarks to simulate the operation of various usage patterns of the algorithm, and by studying
high resolution performance data from operating algorithm. The high resolution data was used to look at the variability
and distribution of the sample events with the assumption that events from the same algorithmic context should show
similar processor cycle times. The distribution of each factor was examined individually to ensure that they met
expectations. These expectations where formed by estimating the instruction count and what types of memory access
patterns where expected with their associated latencies and throughput. Latency and throughput information was
estimated by using the optimization guides from Intel and AMD and from our own micro-benchmark efforts1 . We do
not detail them in this paper as they are extensive and highly dependent on the algorithm implementation and processor
architecture.
We consider five micro performance factors that are all based on CPU performance (the average number of cycles it
takes to process a node in a block). Each performance factor represents a different algorithmic context. In all but one
case we can see the dramatic impact that cache has on performance. As the block size increases past the point where
the working set (block iteration) fits in the cache we see a dramatic decrease in CPU performance (increase in the
number of cycles to evaluate a node). The only case in which we do not see this behavior is global communication,
which has no iterative component and is not decomposed into blocks.
The global communication phase of the algorithm is run serially and is mostly sequential access to the entire problem
data set. Since the problem does not fit in any of the caches the execution time is dominated by DRAM access times.
The amount of time to process a node is based on the number of arcs at the node and the proximity of the target node
in memory. Relative to the other factors this factor is essentially constant (for multiple threads an average 146 cycles
per node). We now formally define this performance factor as follows:

Performance Factor 5 (serial sequential load). The node evaluation time for global communication is constant for
each problem.

After the global communication phase of the algorithm blocks are scheduled, in parallel, to be iterated. Because the
nodes of the block will not be in cache (pushed out by the global communication or other block iterations) the first
computation sweep and communication sweep will show the impact of loading the nodes from DRAM into the cache.
If the cache size is sufficient to host the entire block then the next computation and communication sweep will not
have to reload the node data from DRAM as it will be in the cache. Because of this we have loading and iteration of
communication and computation giving a total of four algorithm block contexts.
The split between load and iteration is also important since some blocks do not change with label update and thus
will not have additional iterations. Since there are no additional iterations there will be no chance to amortize the
initial cost of loading the data from DRAM into the cache. Because of this there are two distinct performance profiles
1 The process of examining the distribution of events gave a surprising amount of insightful information about the algorithms performance and

how to define the performance factors.


Middelkoop
depending on if there is iteration required or not. By splitting them we now have an iteration context and a loading
context with homogeneous performance characteristics. The four contexts can be seen in Figure 3. In all cases we can
see the dramatic effect cache utilization for smaller block sizes.

400 400


● ●


● ● ● ●

● ● ●
● ● ● ●
● ●
● ● ● ●
300 300 ●
● ●

cploadcomp

cores cores

cpitercomp
● ●
● ● ●
● ●● ● 1 ● 1 ●
●● ● ● ● ●
● ●

● ●


● 2 ●


● 2 ●
● ●

4 ● 4 ●

● ●
● ●
● ●
● 8 ●
●●

8 ●

200 ● ● ●
● ● ● ●
● 200
●● ● ● ● ●● ●
● ● ● ● ●●
● ● ● ● ● ● ●● ●

● ● ●
●● ● ● ●
● ●

● ● ● ● ●
● ● ●● ●● ● ●● ● ●● ●● ●●
● ●● ●
● ●
● ● ● ●
● ●
● ● ●
●●● ●●● ●● ● ●● ● ● ● ●
● ● ● ● ● ● ●● ●● ●
● ● ● ●● ● ●● ●
● ● ● ●●


● ● ● ● ● ● ● ● ● ● ● ● ●● ●●
● ● ● ●● ●● ●● ●● ●● ● ●
●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
100 ● ●●● ●● ●● ● ●● ●● ● ●● ●
● ●● ● ●● 100 ● ● ●● ● ●
● ● ●
● ● ●● ● ●●●


●● ●
●●● ●
● ● ●
● ●●●

● ● ● ● ●●● ●
●● ● ●●● ● ●●●
● ● ●●● ● ●●●

10^2 10^3 10^4 10^5 10^2 10^3 10^4 10^5

abs abs

(a) First block computation (b) Subsequent block computation





● ●
● ● ● ●

● ●
400 ●
● 400 ●
● ●

●● ● ●










300 300 ●
● ●
cploadcomm


cpitercomm

● ●
cores cores
● ●
● 1 ●


1 ●

● ●
● 2 ●
● ●
2 ●

● ●
● ●
● ● ● 4 ●
● ● ●

4 ●

● 8 ● ● ● ● ● 8 ●
200 200 ● ● ● ●


● ● ●
●● ●

● ●
● ● ● ●
● ● ●● ●● ●
●● ● ●



● ● ● ● ●●

● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
●●
● ● ●● ●● ●●
● ●

100 ● ● 100 ●
● ● ● ●●● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●● ●
● ● ● ●● ●
● ● ●
● ●● ● ●● ●

●●●
● ● ●●
● ● ●
● ● ●●● ●
● ●
● ● ●

●●● ● ● ● ●●●
● ●●●
● ●●●●
●● ● ●●
●●●
●●
●●
● ●●●

● ●●
●●
●●
●● ●
● ●
● ●● ●
● ●
●●● ●● ●●
● ● ●
●● ●●

● ●
● ●●● ● ●● ● ●● ●

●● ●● ●● ●● ●●
● ●● ●●
● ● ●● ●● ●

● ●● ●

10^2 10^3 10^4 10^5 10^2 10^3 10^4 10^5

abs abs

(c) First block communication (d) Subsequent block communication

Figure 3: Node evaluation time in cycles vs. block size for different algorithmic contexts

Because of the similarity between communication and computation we only define three performance factors and
apply them to four algorithmic contexts. We now define them as follows:
Performance Factor 6 (cached block load). The evaluation time for nodes first accessed in a block with a footprint
that fits in cache increases with the number of simultaneous threads and is constant with block size.
Performance Factor 7 (cached block iteration). The evaluation time for nodes previously accessed in a block with a
footprint that fits cache is the same or better than the first access and is constant.
Middelkoop
Performance Factor 8 (uncached block access). The evaluation time for nodes in a block with a footprint that does
not fit in cache increases with the number of simultaneous threads.

The order of evaluation for Figure 3 is a-c-b-d with b-d repeating until local convergence. Therefore the first computa-
tion is considered under the cached block load factor. Due to the similarity between computation and communication
(we evaluate the same number of nodes) we consider them under the same cached block iteration factor. Finally, when
block size is such that the block footprint does not fit in cache we consider all algorithmic contexts as uncached block
access. Visually it is easy to determine when the block footprint exceeds the cache size and occurs around 1e4 nodes
in a block for the problems presented in the paper. For lower thread counts this is actually a bit larger since the L3
cache on the AMD Opteron processor is shared between the four cores on the die. Since unused cores will not utilize
the shared L3 cache the remaining cores will have a larger share of the cache.

4. Modeling Performance
Now that we have a models for how the algorithm, middleware, and processor operate we can now combine them to
model performance. Table 1 summarizes what performance factors where used in which algorithmic context. This
table also shows the number of cycles per node evaluation for each context (cached and uncached). In cases where the
number of threads is a factor we show the numbers for single and eight cores. It also shows number of times that an
algorithmic context/performance factor is utilized (count) during a run (as a reminder, steps are the number of global
communication phases).

Performance Factor (mean cycles per node)


Algorithmic Context Cached Uncached Count
global communication serial sequential load (145) steps × nodes
first computation cached load (102, 224) uncached access (98, 330) steps × nodes
first communication cached iteration (71) uncached access (98, 381) steps × nodes
subsequent computation cached iteration (35) uncached access (75, 251) iterations − steps × nodes
subsequent communication cached iteration (49) uncached access (96, 297) iterations − steps × nodes
- overhead (10) overhead (3, 234) iterations

Table 1: Micro performance factors used in algorithmic context

Using this information, along with the algorithmic performance factors (parallel and serial computation), we construct
the following model to predict the wall clock running time of the algorithm:
runtime = global_communication × steps × nodes + (2)
( f irst_computation(cached,threads) × (steps × nodes) +
f irst_communication(cached,threads) × (steps × nodes) +
subsequent_computation(cached,threads) × (iterations − steps × nodes) +
subsequent_communication(cached,threads) × (iterations − steps × nodes) +
overhead(cached,threads) × iterations
) ×threads−1 × cpu.
The conversion between cycles and runtime is cpu and is the cycles per microsecond that the processor runs at (for the
hardware in this paper it is 2300 cycles per microsecond). The performance factor utilizes the cycle times as shown
in Table 1 and are dependent on the runtime configuration (blocks and threads). If a block will fit in cache then the
cached access numbers are used, otherwise the uncached are used It should be noted that only global communication
is run in serial and that the remaining algorithmic contexts are run in parallel.
The number of steps and iterations that a block size produces was taken directly from the experiments2 and fed into the
model to get a prediction of runtime by using Equation 2. The model performs well with a mean absolute percentage
error (MAPE) of 16%. MAPE was selected due the large scale of predictions that the model makes. A regression of
the relationship between observed and predicted indicates a close match with a R squared value of 0.98 and a slope
close to 1.
2 A future extension to the work done in this paper would be to develop a model of how block size impacts the number of steps and iterations

required before convergence.


Middelkoop
These results show that by utilizing performance factors in algorithmic contexts we can both understand and predict the
performance of multi-core algorithms. Most importantly it gives an understanding on how to maximize performance
in multi-core systems, which is to decompose problems into independent blocks that fit into cache and to maximally
reuse the data once it is loaded.

5. Summary and Conclusions


The results presented here show the importance of how problems are decomposed for parallel execution and the
complex interaction between architecture and algorithm performance. By understanding how an algorithm works and
by providing a means to measure process performance in different algorithm contexts we can develop performance
models algorithms on multi-core systems. The MDBF implementation tells us that not only does cache have a major
impact on multi-core computation but that it is possible to develop algorithms that can effectively take advantage of
local high speed cache. Initial indications are that there is also great potential in optimizing the size and structure of
the blocks scheduled for parallel execution to decrease the number of steps and iterations required for convergence
and to reduce scheduler idle time.

References
[1] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer,
David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick.
The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, Dec 2006. URL http://www.eecs.berkeley.edu/
Pubs/TechRpts/2006/EECS-2006-183.html.
[2] Hans J. Boehm. Threads cannot be implemented as a library. In PLDI ’05: Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and implementation, pages 261–268, New York, NY,
USA, 2005. ACM. ISBN 1-59593-056-6. URL http://doi.acm.org/10.1145/1065010.1065042.
[3] Lei Chai, Qi Gao, and Dhabaleswar K. Panda. Understanding the impact of multi-core architecture in cluster
computing: A case study with Intel dual-core system. In CCGRID ’07: Proceedings of the Seventh IEEE
International Symposium on Cluster Computing and the Grid, pages 471–478, Washington, DC, USA, 2007.
IEEE Computer Society. ISBN 0-7695-2833-3. URL http://dx.doi.org/10.1109/CCGRID.2007.119.
[4] Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. An evalu-
ation of OpenMP on current and emerging multithreaded/multicore processors. In International Workshop on
OpenMP (IWOMP 2005), 2005. URL http://people.cs.vt.edu/~mfcurt/papers/iwomp05.pdf.
[5] Camil Demetrescu, Andrew Goldberg, and David Johnson, editors. 9th DIMACS Implementation Challenge,
DIMACS Center, Rutgers University, Piscataway, NJ, 2006. http://www.dis.uniroma1.it/~challenge9/.
[6] Ulrich Drepper. What every programmer should know about memory. Red Hat Inc., November 2007. http:
//people.redhat.com/drepper/cpumemory.pdf.
[7] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro,
25(3):16–31, 2005. ISSN 0272-1732. doi: http://dx.doi.org/10.1109/MM.2005.56. URL http://dx.doi.org/
10.1109/MM.2005.56.
[8] Steve Furber. The future of computer technology and its implications for the computer industry. The Computer
Journal, 2008. doi: 10.1093/comjnl/bxn022. URL http://comjnl.oxfordjournals.org/cgi/content/
abstract/bxn022v1.
[9] Guang R. Gao, Mitsuhisa Satoand, and Eduard Ayguad. Special issue on OpenMP. International Journal of
Parallel Programming, 36(3), June 2008.
[10] Kevin R. Hutson, Terri L Schlosser, and Dogulas R Shier. On the distributed Bellman-Ford algorithm and the
looping problem. INFORMS Journal on Computing, 19(4):542–551, 2007.
[11] Nick Maclaren. Why POSIX threads are unsuitable for C++. white paper, February 2006. URL http://www.
opengroup.org/platform/single_unix_specification/doc.tpl?gdid=10087.
Middelkoop
[12] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114–117, 1965.
[13] Victor Pankratius, Christoph Schaefer, Ali Jannesari, and Walter F. Tichy. Software engineering for multicore
systems: an experience report. In IWMSE ’08: Proceedings of the 1st international workshop on Multicore
software engineering, pages 53–60, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-031-9. URL http:
//doi.acm.org/10.1145/1370082.1370096.
[14] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly
Media, Inc., 2007.
[15] Christian Terboven, Dieter Mey, and Samuel Sarholz. Openmp on multicore architectures. In IWOMP ’07:
Proceedings of the 3rd international workshop on OpenMP, pages 54–64, Berlin, Heidelberg, 2008. Springer-
Verlag. ISBN 978-3-540-69302-4. doi: http://dx.doi.org/10.1007/978-3-540-69303-1_5.
[16] D. Terpstra, H. Jagode, H. You, and J. Dongarra. Collecting performance data with papi-c. In Proceedings of the
3rd Parallel Tools Workshop. Springer Verlag, 2010.
[17] Samuel Webb Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance
model for floating-point programs and multicore architectures. Communications of the ACM, April 2009.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

You might also like