0% found this document useful (0 votes)

13 views

Week 5 - The Impact of Multi-Core Computing On Computational Optimization

This paper discusses the challenges of developing algorithms to take advantage of parallel processing on multi-core computer architectures. It presents a modified distributed Bellman-Ford shortest path algorithm as a case study. The original algorithm did not scale well on multi-core systems due to limitations in traditional parallelization approaches. The modified algorithm leverages the Threaded Building Blocks library to better manage concurrent access to shared memory across processor cores. Numerical results demonstrate improved scalability of the modified algorithm on multi-core systems.

Uploaded by

Game Account

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Week 5 - The Impact of Multi-Core Computing On Computational Optimization

Uploaded by

Game Account

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Proceedings of the 2012 Industrial and Systems Engineering Research Conference

G. Lim and J.W. Herrmann, eds.

The Impact of Multi-Core Computing on Computational

Optimization
Timothy Middelkoop
Department of Industrial and Systems Engineering, University of Florida
PO Box 116595, Gainesville FL 32611-6595, U.S.A.

Abstract

In this paper we confront the challenges that emerging multi-core architectures are placing on the optimization com-
munity. We first identify the unique challenges of multi-core computing followed by a presentation of the performance
characteristics of the architecture. We then present a motivating example, the distributed Bellman-Ford shortest path
algorithm, to demonstrate the impact that multi-core architectures have on computation showing that traditional ap-
proaches to parallelization do not scale. We subsequently modify the algorithm to take full advantage of multi-core
systems and demonstrate scalability with some numerical results.

Keywords
multi-core computation, Bellman-Ford shortest path, distributed computation, algorithm performance instrumentation

1. Introduction
Multi-core computing will change the way in which we perceive computation and the way in which we build compu-
tational optimization algorithms. In the past Moore’s Law [12] has afforded the optimization community the luxury of
free performance increases over the past decades. We are currently at a crossroads where we are seeing the previous
increase in serial performance replaced with future increases in parallelism [7, 8]. Fortunately the academic commu-
nity has recognized this need for parallel computation as demonstrated in the forward thinking report from Berkeley[1]
about upcoming challenges.
As the number of cores increases managing access to memory, at its various levels, becomes the key to developing high
performance and scalable implementations. The work of Drepper in [6] provides a detailed analysis of these issues and
its impact on programmers. In this paper we apply this work at the algorithmic level providing a different perspective
that we believe will be valuable for researchers implementing computational optimization algorithms. To attack these
challenges the computing community is also developing new metrics for computation on multi-core processors. These
models help researchers understand what is required to achieve maximum performance and scalability on specific
multi-core architectures. The roofline model of Williams et al. [17] gives an easy way to visualize how code will
perform on modern CPU’s. It can give a reasonable expectation of performance without resorting to low-level and
difficult instrumentation. It is this simple approach we hope to duplicate here through the use of instrumentation of
performance factors. We define a performance factor as an observable relationship between a property of an algorithm
and it’s performance.
The main goal of this work was not to produce the fastest Bellman-Ford shortest path implementation, but to find fast
and scalable ways to achieve parallelism on multi-core platforms that is easy for algorithm designers to implement.
The sample data used for the implementation was from the dataset of the “9th DIMACS Implementation Challenge -
Shortest Paths” [5].

1.1 Multi-Core Systems

To understand why communication has a dramatic impact on performance in multi-core algorithms we must under-
stand the basic architecture of multi-core systems. Modern systems are made up of a hierarchy of successive levels
of cache and communication buses. At the top are independent processor cores similar to those used in the single
processor systems of the past. Each processor core has a local (exclusive) L1 cache, which is the fastest (it runs at near
Middelkoop
processor speed) and on the order of 64KiB. As we progress down towards main memory caches become slower, grow
larger and are shared by more cores. For the system used in the paper (a dual socket 2.6 GHz AMD quad core Opteron)
there is 512KiB of L2 cache for each core (some systems share the L2 cache between cores) and 3072KiB of L3 cache
shared between 4 cores on a single die. The system has two sockets, each supporting an integrated memory controller.
This means that if a core must access DRAM located on the other socket this request must be communicated between a
bus. To ensure that all the cores see the same information stored in DRAM the caches must coordinate access through a
cache coherence protocol. Near simultaneous access to memory (often for locking) is expensive because this protocol
must guarantee coherency and must communicate over slower busses and potentially coordinate between many cache
systems. In this paper we do not focus on performance in this area (coordination) as scientific computing should spend
most of the time in computation not coordination.

1.2 Performance of Multi-Core Systems

The previous section indicates that there is the potential for dramatic performance differences in the execution of
algorithms based on their communication properties. The following question arises: what impact does different levels
of cache have on performance, and by extension, what is the “best” that we can do? In the past simply looking at the
performance of the processor (through the use of micro-benchmarks) would tell most of the story.
Unfortunately, traditional processor micro-benchmarks do not highlight the performance issues in multi-core proces-
sors. Although micro-benchmarks are often misleading, they do reveal important aspects of multi-core computing.
Therefore, a suit of benchmarks was developed and run to investigate this point. The first set of benchmarks measure
raw memory bandwidth at the various cache levels. In the single core case (2.6 GHz AMD quad core Opteron) the
benchmark (GNU C++ compiler optimized) indicates that processor can sum a sequential array of long (64 bit) inte-
gers that fits in the L1 or L2 cache at a rate of around 13 GiB/s. Since this computation has an operational intensity
of 1/8 flops/byte it is bandwidth limited. This result is similar to the expected performance predicted by the roofline
model [17] for this processor and application. When the array no longer fits in the L3 cache then the performance
drops to 3.5GiB/s, which is approaching the theoretical bandwidth of the DRAM. When increasing to 8 simultaneous
threads the DRAM performance drops to 1.2GiB/s per core, which demonstrates the significance of the shared access
to main memory.
Since most optimization applications do not process long streams of data with a low operational intensity it is important
to determine a lower bound on randomly accessed data. A second set of benchmarks was run with a similar setup but
using unoptimized code with the addition of a variable stride length (distance to the next element in ram) and hand-
unrolled loops. The benchmark has an average overhead of 2.6 cycles per iteration ultimately limiting peak bandwidth.
For the stride=1 case performance was on par to the previous DRAM sized benchmarks reinforcing that fact that the
benchmark is memory bound. When the stride is set to 32 the stride limit for the hardware prefetcher is exceeded and
performance is similar to random access since the prefetcher can no longer speculatively retrieve the next element. In
the single core case bandwidth drops to 550MiB/s and to 150MiB/s for the 8 core case, a dramatic difference from the
L1 cache bandwidth.
These benchmarks indicate two things, first that as memory access moves off the core and into L3/DRAM performance
is impacted by other cores, second that cache and access patterns can have up to two orders of magnitude impact on
performance. When looking at the benchmarks this is obvious, however, algorithm design must account for these two
factors and the middleware must easily support this style of computation. Looking forward, the impact of multiple
cores is relatively low at this time (factor of three for 8 cores in two sockets) but it will continue degrade as the number
of cores accessing shared resources increases.

1.3 Parallel Programming Tools

Developing an algorithm that can be implemented in parallel requires that it conform to the techniques and technologies
that the underlying implementation allows. It is unfortunate, according to the literature [2, 4, 11, 13, 15] and in the
authors experience, that popular technologies to build parallel programs (POSIX Threads, Open Multi-Processing
(OpenMP)[9], and Message Passing Interface (MPI)) are not ideal solutions for scientific computing on multi-core
systems. Even though MPI can be highly scalable when it is used in large clusters (HPC), it was not considered
for this study as it performs poorly on multi-core systems since it was not designed for shared cache and memory
systems[3]. In general, the problem is that most of the parallel solutions are based on an outdated model of uniform
concurrent memory access for multiple processors in a system, which is no longer the case in multi-core systems.
Middelkoop
The Threaded Building Blocks (TBB) open source project by Intel [14] was used for the implementation presented
in this paper. TBB is a template library that manages concurrency by providing design patterns, and the underlying
middleware infrastructure, that facilitates and encourages the user to utilize parallel safe data access methods. The
issues of locking, for the most part, are hidden from the user. In this paper we use the middleware to manage all access
to shared data by having the middleware indicate which areas are currently available for access to a independently
running function (task). Any transfer of data between areas is either done by a function that has access to both areas,
or by a single serial global communication phase done outside the parallel computation phase.

1.4 Performance Instrumentation

Performance optimization on multi-core architectures is difficult due to the interaction between CPU caches and lim-
ited shared access to main memory. Optimizing performance requires instrumentation and careful tuning. As we will
see later there is a complex relationship between problem decomposition and performance. Problem based metrics,
such as the number of iterations performed on a node, are not captured by traditional profiling software. In many
cases profiling will just confirm that a majority of the computational time is spent in the analysis code. This paper
demonstrates the value of tools that combine domain level metrics with processor profiling information. To collect
this data a probe (source level calls to the profiling library) that collect domain information along with processor per-
formance counter data (for example cache misses) was developed. The probes measure computational time by using
the high resolution performance counters available on modern processors utilizing the PAPI library [16]. These probes
make it easy to mark portions of the code to be instrumented and to have the results collected along with contextual
information (block number, iteration, etc.) in a database without worrying about the underlying details.

2. Multi-Core Distributed Bellman-Ford Shortest Path

The multi-core distributed Bellman-ford shortest path algorithm is based on the synchronous distributed Bellman-Ford
shortest path as presented in [10]. Recall that the original Bellman-Ford algorithm is a label updating algorithm. The
DBF stores local copies of incoming node labels (instead of storing them in a central location) and iterates over all the
nodes until there are no new updates in an iteration, at which time all the local labels are updated in a communication
phase. The DBF continues until there are no label updates (computation phase) after a communication phase. This
property is important for the multi-core version as this information now can reside exclusively in a single processor
core. In addition information is replicated for the nodes reducing the need for expensive locking operations needed to
ensure the validity of the information.
The key idea for the multi-core DBF (MDBF) is to split the network into blocks. Blocks are processed in parallel as
sub-problems, however, nodes outside the block are not updated during the block communication phase. This has three
very important properties. The first is that the size of a block can be set to fit in a cores cache making the processing
of the block exclusive to a core. The second is that nodes do not need lock access to the local store of labels (the
algorithm pushes label updates), which adds overhead. The third property is that data is not shared between cores,
which, as we saw earlier, can have dramatic impact on performance.
The key decision is in setting the block size. The smaller the block size the higher in the cache hierarchy it will fit,
increasing the speed of subsequent iterations. However, as we will see later, for smaller block sizes the number of
iterations processed during a block is also smaller. This translates into a larger number of times that the blocks must
be processed. Interestingly, this does not necessarily translate in to an increase in the total number of times that a
node is processed. For small block sizes, the total number node evaluations actually increases as block size increases,
this continues up to a point before it decreases again. This indicates a complex relationship between the domain, the
problem instance, and the algorithm parameters.

3. Performance and Scalability

There are three main variables that can be controlled when running the MCDBF algorithm, which are problem size,
the number of cores used, and block size. Figure 1(a) shows a log-log plot of the time in microseconds (vtime) v.s. the
problem size (nodes). We can clearly see that as the problem size grows so does the overall computational time. What
we also see is that the number of cores and block size also have a dramatic, order of magnitude, affect on performance.
Problem size and the number of cores affects computational time in a predictable manner, however, the effect of block
size is a complex interaction of a number of factors and will be the focus of the remainder of the paper.
Although the optimal block size could be found using an auto-tuner (search) we strongly believe that by understanding
Middelkoop

● ●●
●
●
● ●
● ●
● ●
● ●
●
● ●
●
● ●
10^10 ●
●
●
●
●
● ●
● ●
●
● ● 10^4.5 ●
●
● ●
●
●
● ● ●●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
● ●
●
●
● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
●
●
● ●
● ●
● ● ●
10^9 ●
●
cores ● ● nodes

iters/nodes
● ●
●
● ●
●
● ●
●
264346
vtime

● 10^4.0 ●
●
●
●
●
●
1 ●
● 435666 ●
● ●
●
●
●
2 ●
●
1070376 ●
●
● ●
●
●
4 ● ● ● 2758119 ●
● ● ●
●
●
8 ●
● ● 6262104 ●
● ●
●
●
● ●
● ●
● ●
10^8 ● ● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
10^3.5 ●
●
● ●
● ●
●
●
● ●
●
● ●
● ●
●
● ● ●
● ● ● ●
● ●
● ●
●
● ● ●
● ●
10^7 ●
●
● ●
●
● ●
●
●
●
● ● ●
● ●
10^3.0

10^5.5 10^6.0 10^6.5 10^2 10^3 10^4 10^5

nodes abs

(a) Computational time v.s. nodes. (b) Change in the average number of iterations per node per step.

Figure 1: Impact of average block size (abs) on the MDBF.

and modeling the performance of the multi-core algorithm a deeper understanding of the problem can be obtained.
Overall performance is a complex interaction of the two and requires and understanding of both. Figure 2 illustrates
this by the fact that the number of iterations per block drops at the same time as the block size exceeds the size of cache
available to each core. Here, the increase in cycles per node (shown in Figure 2(a)) is dominated by the transition from
blocks running entirely in the local core cache to blocks no longer fitting in these caches. These factors combined
form a complex view of total running time as shown in Figure 2(b).
The effect of block size on performance can be broken down into the following three major areas: algorithm operation,
middleware operation (macro), and cpu-core operation (micro). In the remainder of the paper we discuss these three
areas identifying a number of contributing factors.

3.1 Algorithm Performance Factors

Altering the block size changes the convergence properties of the algorithm since each block converges locally before
the global communication phase. Experimentally, reducing the block size increases the number of global communica-
tion phases (steps) that the algorithm must make. Since global communication is both expensive (as we will see later)
and serial it reduces the amount of parallel computation time available. What is not so clear is the complex interaction
between the block size and the total number of times that a node must be evaluated during local convergence as shown
in Figure 1(b). Our current hypothesis is that as the block size increases the amount of recalculation in each block
increases (recall that blocks iterate locally) due to the delay in propagation of label values to other blocks. At some
point this trend reverses as the block size is probably sufficient to support a lot of local computation. This behavior in
of it self is an interesting phenomenon that warrants further study.
Although the effect of block size on the number of steps and iterations performed is not well understood the number
of steps and iterations performed have a direct impact on performance. Because these are easy to measure we base the
algorithmic performance factors on them instead of block size directly and only consider the impact on computation.
We define computation as the number of instructions (not time) that must be performed doing algorithmic computa-
tions. We assume that the algorithmic factors are not significantly influence by macro or micro performance factors.
We have experimentally verified this by tracking the number of instructions (not cycles) per node evaluation performed
during the algorithm portion of the program. The following two performance factors formalize this relationship:
Performance Factor 1 (serial computation). The amount of serial computation performed increases with the number
of steps.
Middelkoop

2e+09
●
cores cores ●

● 1 ● 1
2 2 ●

1000 4 4
8 8

1e+09
Average Number of Cycles / Node

Running Time (usec)

●

●
500

5e+08
●

●
● ●

● ● ●
200

2e+08
●

● ● ●
● ●
●
100

5e+01 5e+02 5e+03 5e+04 5e+01 5e+02 5e+03 5e+04

blocksize blocksize

(a) Cycles to process a node vs. block size for varying num- (b) Total running time.
ber of cores.

Figure 2: Impact of block size.

Performance Factor 2 (parallel computation). The amount of parallel computation performed increases with the total
number of node iterations.

Since we are not developing a model of how the algorithm changes based on block size (although interesting) we will
simply use the number of node and block iterations and steps required for each problem and block size as a given and
split the algorithm into serial (global communication) and parallel phases.

3.2 Macro Performance Factors

There are three main sources of overhead (non-value added time added by the TBB middleware) and are as follows:
scheduler overhead, which is responsible for assigning tasks; locking overhead that accounts for time ensuring that
parallel accesses to memory happens as expected; and idle time, which is time that a core remains idle (or in a waiting
state) because of a lack of work. Due to the complexities of task scheduling, locking, and waiting and the fact that all
macro performance factors are non-value-added we will consider them all as overhead.
We find the overhead by measuring the amount of elapsed time between the beginning and end of the program (also
called wall clock time) and measure it by using the gettimeofday function. The amount of time loading the program
and storing the instrumentation data is ignored. We compute overhead as the amount of time that the processors are
not working on the algorithm (compute or communicate) as follows:
overhead = (wtime − global) × threads − (compute + communicate). (1)
It should be noted that the idle processors during the serial phase of the algorithm is not considered overhead but
processor idle time during the parallel phase of the algorithm is considered a part of the overhead. Cores go idle (or in
a busy-wait loop) when there is no available work. This happens with greater impact as the block size increases. This
becomes a serious problem when the block size increases to a size where there are very few blocks compared to the
number of available cores. Due to the complexities of the middleware system and the goals of the paper a sophisticated
model of the middleware was not developed. We do, however, characterize the middleware by the following two macro
performance factors:

Performance Factor 3 (overhead). The amount of overhead increases with the number node iterations.
Performance Factor 4 (parallel overhead). The amount of overhead increases with the number of threads.

For the overhead factor there is little difference between threads (1.5 cycles per thread) for the in-cache case so we
take the average overhead value per node iteration for all cases, which is 10 cycles per node iteration. For the uncached
Middelkoop
case the parallel overhead factor is prevalent and we utilize linear regression to get threads ∗ 33 − 30 cycles per node
iteration (with a R squared of 0.41). Due to limited space and the importance of the other factors we will not discuss
the macro performance factors further.

3.3 Micro Performance Factors

The motivation behind this paper is to develop tools and techniques to facilitate the development multi-core scalable
algorithms. To maximize performance in this environment algorithms must maximally reuse cache. This section
details the performance factors involved with reusing cache, mainly the instruction execution time as it relates to
accessing memory at it’s various levels.
Classically, single processor systems had deterministic processor instruction times, that is, an instruction would take
a fixed number of processor cycles to execute. However, modern multi-core processors are sophisticated systems
that, arguably, act more like software then hardware. If one takes this view then it is easy understand why modeling
CPU performance is a difficult task. To make this task more manageable we measure the number of cycles a set of
instructions takes in a particular algorithmic context. The performance instrumentation presented in this paper is what
allows for the collection of performance data based on algorithmic context. With this information we are able to build
more accurate models processor performance.
The micro performance factors where developed by a combination of understanding memory/processor performance
[6], building micro-benchmarks to simulate the operation of various usage patterns of the algorithm, and by studying
high resolution performance data from operating algorithm. The high resolution data was used to look at the variability
and distribution of the sample events with the assumption that events from the same algorithmic context should show
similar processor cycle times. The distribution of each factor was examined individually to ensure that they met
expectations. These expectations where formed by estimating the instruction count and what types of memory access
patterns where expected with their associated latencies and throughput. Latency and throughput information was
estimated by using the optimization guides from Intel and AMD and from our own micro-benchmark efforts1 . We do
not detail them in this paper as they are extensive and highly dependent on the algorithm implementation and processor
architecture.
We consider five micro performance factors that are all based on CPU performance (the average number of cycles it
takes to process a node in a block). Each performance factor represents a different algorithmic context. In all but one
case we can see the dramatic impact that cache has on performance. As the block size increases past the point where
the working set (block iteration) fits in the cache we see a dramatic decrease in CPU performance (increase in the
number of cycles to evaluate a node). The only case in which we do not see this behavior is global communication,
which has no iterative component and is not decomposed into blocks.
The global communication phase of the algorithm is run serially and is mostly sequential access to the entire problem
data set. Since the problem does not fit in any of the caches the execution time is dominated by DRAM access times.
The amount of time to process a node is based on the number of arcs at the node and the proximity of the target node
in memory. Relative to the other factors this factor is essentially constant (for multiple threads an average 146 cycles
per node). We now formally define this performance factor as follows:

Performance Factor 5 (serial sequential load). The node evaluation time for global communication is constant for
each problem.

After the global communication phase of the algorithm blocks are scheduled, in parallel, to be iterated. Because the
nodes of the block will not be in cache (pushed out by the global communication or other block iterations) the first
computation sweep and communication sweep will show the impact of loading the nodes from DRAM into the cache.
If the cache size is sufficient to host the entire block then the next computation and communication sweep will not
have to reload the node data from DRAM as it will be in the cache. Because of this we have loading and iteration of
communication and computation giving a total of four algorithm block contexts.
The split between load and iteration is also important since some blocks do not change with label update and thus
will not have additional iterations. Since there are no additional iterations there will be no chance to amortize the
initial cost of loading the data from DRAM into the cache. Because of this there are two distinct performance profiles
1 The process of examining the distribution of events gave a surprising amount of insightful information about the algorithms performance and

how to define the performance factors.

Middelkoop
depending on if there is iteration required or not. By splitting them we now have an iteration context and a loading
context with homogeneous performance characteristics. The four contexts can be seen in Figure 3. In all cases we can
see the dramatic effect cache utilization for smaller block sizes.

400 400

●
● ●
●
●
● ● ● ●
●
● ● ●
● ● ● ●
● ●
● ● ● ●
300 300 ●
● ●
●
cploadcomp

cores cores

cpitercomp
● ●
● ● ●
● ●● ● 1 ● 1 ●
●● ● ● ● ●
● ●
●
● ●
●
●
● 2 ●
●
●
● 2 ●
● ●
●
4 ● 4 ●
●
● ●
● ●
● ●
● 8 ●
●●
●
8 ●
●
200 ● ● ●
● ● ● ●
● 200
●● ● ● ● ●● ●
● ● ● ● ●●
● ● ● ● ● ● ●● ●
●
● ● ●
●● ● ● ●
● ●
●
● ● ● ● ●
● ● ●● ●● ● ●● ● ●● ●● ●●
● ●● ●
● ●
● ● ● ●
● ●
● ● ●
●●● ●●● ●● ● ●● ● ● ● ●
● ● ● ● ● ● ●● ●● ●
● ● ● ●● ● ●● ●
● ● ● ●●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●● ●●
● ● ● ●● ●● ●● ●● ●● ● ●
●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
100 ● ●●● ●● ●● ● ●● ●● ● ●● ●
● ●● ● ●● 100 ● ● ●● ● ●
● ● ●
● ● ●● ● ●●●
●

●
●● ●
●●● ●
● ● ●
● ●●●
●
● ● ● ● ●●● ●
●● ● ●●● ● ●●●
● ● ●●● ● ●●●

10^2 10^3 10^4 10^5 10^2 10^3 10^4 10^5

abs abs

(a) First block computation (b) Subsequent block computation

●
●
●
●
● ●
● ● ● ●
●
● ●
400 ●
● 400 ●
● ●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
300 300 ●
● ●
cploadcomm

●
cpitercomm

● ●
cores cores
● ●
● 1 ●
●
●
1 ●

● ●
● 2 ●
● ●
2 ●
●
● ●
● ●
● ● ● 4 ●
● ● ●
●
4 ●
●
● 8 ● ● ● ● ● 8 ●
200 200 ● ● ● ●
●
●
● ● ●
●● ●
●
● ●
● ● ● ●
● ● ●● ●● ●
●● ● ●
●
●
●
● ● ● ● ●●
●
● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
●●
● ● ●● ●● ●●
● ●
●
100 ● ● 100 ●
● ● ● ●●● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●● ● ●● ●
● ● ● ●● ●
● ● ●
● ●● ● ●● ●
●
●●●
● ● ●●
● ● ●
● ● ●●● ●
● ●
● ● ●
●
●●● ● ● ● ●●●
● ●●●
● ●●●●
●● ● ●●
●●●
●●
●●
● ●●●
●
● ●●
●●
●●
●● ●
● ●
● ●● ●
● ●
●●● ●● ●●
● ● ●
●● ●●
●
● ●
● ●●● ● ●● ● ●● ●
●
●● ●● ●● ●● ●●
● ●● ●●
● ● ●● ●● ●
●
● ●● ●
●

10^2 10^3 10^4 10^5 10^2 10^3 10^4 10^5

abs abs

(c) First block communication (d) Subsequent block communication

Figure 3: Node evaluation time in cycles vs. block size for different algorithmic contexts

Because of the similarity between communication and computation we only define three performance factors and
apply them to four algorithmic contexts. We now define them as follows:
Performance Factor 6 (cached block load). The evaluation time for nodes first accessed in a block with a footprint
that fits in cache increases with the number of simultaneous threads and is constant with block size.
Performance Factor 7 (cached block iteration). The evaluation time for nodes previously accessed in a block with a
footprint that fits cache is the same or better than the first access and is constant.
Middelkoop
Performance Factor 8 (uncached block access). The evaluation time for nodes in a block with a footprint that does
not fit in cache increases with the number of simultaneous threads.

The order of evaluation for Figure 3 is a-c-b-d with b-d repeating until local convergence. Therefore the first computa-
tion is considered under the cached block load factor. Due to the similarity between computation and communication
(we evaluate the same number of nodes) we consider them under the same cached block iteration factor. Finally, when
block size is such that the block footprint does not fit in cache we consider all algorithmic contexts as uncached block
access. Visually it is easy to determine when the block footprint exceeds the cache size and occurs around 1e4 nodes
in a block for the problems presented in the paper. For lower thread counts this is actually a bit larger since the L3
cache on the AMD Opteron processor is shared between the four cores on the die. Since unused cores will not utilize
the shared L3 cache the remaining cores will have a larger share of the cache.

4. Modeling Performance
Now that we have a models for how the algorithm, middleware, and processor operate we can now combine them to
model performance. Table 1 summarizes what performance factors where used in which algorithmic context. This
table also shows the number of cycles per node evaluation for each context (cached and uncached). In cases where the
number of threads is a factor we show the numbers for single and eight cores. It also shows number of times that an
algorithmic context/performance factor is utilized (count) during a run (as a reminder, steps are the number of global
communication phases).

Performance Factor (mean cycles per node)

Algorithmic Context Cached Uncached Count
global communication serial sequential load (145) steps × nodes
first computation cached load (102, 224) uncached access (98, 330) steps × nodes
first communication cached iteration (71) uncached access (98, 381) steps × nodes
subsequent computation cached iteration (35) uncached access (75, 251) iterations − steps × nodes
subsequent communication cached iteration (49) uncached access (96, 297) iterations − steps × nodes
- overhead (10) overhead (3, 234) iterations

Table 1: Micro performance factors used in algorithmic context

Using this information, along with the algorithmic performance factors (parallel and serial computation), we construct
the following model to predict the wall clock running time of the algorithm:
runtime = global_communication × steps × nodes + (2)
( f irst_computation(cached,threads) × (steps × nodes) +
f irst_communication(cached,threads) × (steps × nodes) +
subsequent_computation(cached,threads) × (iterations − steps × nodes) +
subsequent_communication(cached,threads) × (iterations − steps × nodes) +
overhead(cached,threads) × iterations
) ×threads−1 × cpu.
The conversion between cycles and runtime is cpu and is the cycles per microsecond that the processor runs at (for the
hardware in this paper it is 2300 cycles per microsecond). The performance factor utilizes the cycle times as shown
in Table 1 and are dependent on the runtime configuration (blocks and threads). If a block will fit in cache then the
cached access numbers are used, otherwise the uncached are used It should be noted that only global communication
is run in serial and that the remaining algorithmic contexts are run in parallel.
The number of steps and iterations that a block size produces was taken directly from the experiments2 and fed into the
model to get a prediction of runtime by using Equation 2. The model performs well with a mean absolute percentage
error (MAPE) of 16%. MAPE was selected due the large scale of predictions that the model makes. A regression of
the relationship between observed and predicted indicates a close match with a R squared value of 0.98 and a slope
close to 1.
2 A future extension to the work done in this paper would be to develop a model of how block size impacts the number of steps and iterations

required before convergence.

Middelkoop
These results show that by utilizing performance factors in algorithmic contexts we can both understand and predict the
performance of multi-core algorithms. Most importantly it gives an understanding on how to maximize performance
in multi-core systems, which is to decompose problems into independent blocks that fit into cache and to maximally
reuse the data once it is loaded.

5. Summary and Conclusions

The results presented here show the importance of how problems are decomposed for parallel execution and the
complex interaction between architecture and algorithm performance. By understanding how an algorithm works and
by providing a means to measure process performance in different algorithm contexts we can develop performance
models algorithms on multi-core systems. The MDBF implementation tells us that not only does cache have a major
impact on multi-core computation but that it is possible to develop algorithms that can effectively take advantage of
local high speed cache. Initial indications are that there is also great potential in optimizing the size and structure of
the blocks scheduled for parallel execution to decrease the number of steps and iterations required for convergence
and to reduce scheduler idle time.

References
[1] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer,
David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick.
The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, Dec 2006. URL http://www.eecs.berkeley.edu/
Pubs/TechRpts/2006/EECS-2006-183.html.
[2] Hans J. Boehm. Threads cannot be implemented as a library. In PLDI ’05: Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and implementation, pages 261–268, New York, NY,
USA, 2005. ACM. ISBN 1-59593-056-6. URL http://doi.acm.org/10.1145/1065010.1065042.
[3] Lei Chai, Qi Gao, and Dhabaleswar K. Panda. Understanding the impact of multi-core architecture in cluster
computing: A case study with Intel dual-core system. In CCGRID ’07: Proceedings of the Seventh IEEE
International Symposium on Cluster Computing and the Grid, pages 471–478, Washington, DC, USA, 2007.
IEEE Computer Society. ISBN 0-7695-2833-3. URL http://dx.doi.org/10.1109/CCGRID.2007.119.
[4] Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. An evalu-
ation of OpenMP on current and emerging multithreaded/multicore processors. In International Workshop on
OpenMP (IWOMP 2005), 2005. URL http://people.cs.vt.edu/~mfcurt/papers/iwomp05.pdf.
[5] Camil Demetrescu, Andrew Goldberg, and David Johnson, editors. 9th DIMACS Implementation Challenge,
DIMACS Center, Rutgers University, Piscataway, NJ, 2006. http://www.dis.uniroma1.it/~challenge9/.
[6] Ulrich Drepper. What every programmer should know about memory. Red Hat Inc., November 2007. http:
//people.redhat.com/drepper/cpumemory.pdf.
[7] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro,
25(3):16–31, 2005. ISSN 0272-1732. doi: http://dx.doi.org/10.1109/MM.2005.56. URL http://dx.doi.org/
10.1109/MM.2005.56.
[8] Steve Furber. The future of computer technology and its implications for the computer industry. The Computer
Journal, 2008. doi: 10.1093/comjnl/bxn022. URL http://comjnl.oxfordjournals.org/cgi/content/
abstract/bxn022v1.
[9] Guang R. Gao, Mitsuhisa Satoand, and Eduard Ayguad. Special issue on OpenMP. International Journal of
Parallel Programming, 36(3), June 2008.
[10] Kevin R. Hutson, Terri L Schlosser, and Dogulas R Shier. On the distributed Bellman-Ford algorithm and the
looping problem. INFORMS Journal on Computing, 19(4):542–551, 2007.
[11] Nick Maclaren. Why POSIX threads are unsuitable for C++. white paper, February 2006. URL http://www.
opengroup.org/platform/single_unix_specification/doc.tpl?gdid=10087.
Middelkoop
[12] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114–117, 1965.
[13] Victor Pankratius, Christoph Schaefer, Ali Jannesari, and Walter F. Tichy. Software engineering for multicore
systems: an experience report. In IWMSE ’08: Proceedings of the 1st international workshop on Multicore
software engineering, pages 53–60, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-031-9. URL http:
//doi.acm.org/10.1145/1370082.1370096.
[14] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly
Media, Inc., 2007.
[15] Christian Terboven, Dieter Mey, and Samuel Sarholz. Openmp on multicore architectures. In IWOMP ’07:
Proceedings of the 3rd international workshop on OpenMP, pages 54–64, Berlin, Heidelberg, 2008. Springer-
Verlag. ISBN 978-3-540-69302-4. doi: http://dx.doi.org/10.1007/978-3-540-69303-1_5.
[16] D. Terpstra, H. Jagode, H. You, and J. Dongarra. Collecting performance data with papi-c. In Proceedings of the
3rd Parallel Tools Workshop. Springer Verlag, 2010.
[17] Samuel Webb Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance
model for floating-point programs and multicore architectures. Communications of the ACM, April 2009.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Diag System Architecture: Submit Technical Questions at
No ratings yet
Diag System Architecture: Submit Technical Questions at
104 pages
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Project Synopsis Inventory Management System
100% (3)
Project Synopsis Inventory Management System
17 pages
OSGi Tutorial PDF
No ratings yet
OSGi Tutorial PDF
45 pages
10.1007@978 3 030 14070 0
100% (1)
10.1007@978 3 030 14070 0
611 pages
Increasing Factors Which Improves The Performance of Computer in Future
No ratings yet
Increasing Factors Which Improves The Performance of Computer in Future
7 pages
Seminar Report
50% (4)
Seminar Report
30 pages
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
No ratings yet
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
10 pages
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
No ratings yet
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
12 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Performance Analysis On Multicore Processors
No ratings yet
Performance Analysis On Multicore Processors
9 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Architecture
No ratings yet
Architecture
21 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Mastering System Programming with C: Files, Processes, and IPC
From Everand
Mastering System Programming with C: Files, Processes, and IPC
Larry Jones
No ratings yet
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
Multicore Computers
No ratings yet
Multicore Computers
21 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
21 pages
Operating System 6
No ratings yet
Operating System 6
16 pages
Gttse 07
No ratings yet
Gttse 07
68 pages
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
1.1 Parallelism
No ratings yet
1.1 Parallelism
29 pages
DDR5 Speed Boost
From Everand
DDR5 Speed Boost
Mei Gates
No ratings yet
Multicore Computers
No ratings yet
Multicore Computers
18 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Amdahl's Law: S (N) T (1) /T (N)
No ratings yet
Amdahl's Law: S (N) T (1) /T (N)
46 pages
Systems Programming: Concepts and Techniques
From Everand
Systems Programming: Concepts and Techniques
Peter Johnson
No ratings yet
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Mastering C: Advanced Techniques and Best Practices
From Everand
Mastering C: Advanced Techniques and Best Practices
Adam Jones
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
Destinationsof Benims Projems
No ratings yet
Destinationsof Benims Projems
6 pages
Tristram FP 443
No ratings yet
Tristram FP 443
6 pages
Memory 2
No ratings yet
Memory 2
31 pages
Concurrency and Multithreading in C: POSIX Threads and Synchronization
From Everand
Concurrency and Multithreading in C: POSIX Threads and Synchronization
Larry Jones
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Tuning Programs With Oprofi Le
No ratings yet
Tuning Programs With Oprofi Le
10 pages
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Final Report: Multicore Processors
No ratings yet
Final Report: Multicore Processors
12 pages
10 Multi-Level Strategies: Assignments
No ratings yet
10 Multi-Level Strategies: Assignments
20 pages
ITEC582 Chapter18
No ratings yet
ITEC582 Chapter18
36 pages
Updated_CS8083 MCP UNIT II notes
No ratings yet
Updated_CS8083 MCP UNIT II notes
23 pages
Assgniment 3rd Year 2nd Semester
No ratings yet
Assgniment 3rd Year 2nd Semester
5 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Term Paper
No ratings yet
Term Paper
9 pages
03-Memory
No ratings yet
03-Memory
48 pages
Week 4 - Human Resource Management
No ratings yet
Week 4 - Human Resource Management
37 pages
CPR Aed Crossword
No ratings yet
CPR Aed Crossword
2 pages
Week 6 - An Introduction To Embedded Systems 1
No ratings yet
Week 6 - An Introduction To Embedded Systems 1
6 pages
Week 5 - Change Management
No ratings yet
Week 5 - Change Management
17 pages
A Traveller in The City
No ratings yet
A Traveller in The City
9 pages
Week 5 - Further Reading Material
No ratings yet
Week 5 - Further Reading Material
3 pages
Week 3 - Organizational Structure
No ratings yet
Week 3 - Organizational Structure
31 pages
Week 1 - Introduction To Management
No ratings yet
Week 1 - Introduction To Management
23 pages
Week 3 - Further Reading Material
No ratings yet
Week 3 - Further Reading Material
2 pages
Week 2 - Further Reading Material
No ratings yet
Week 2 - Further Reading Material
3 pages
Week 4 - Developing, Organizing and Revising - Part 1
No ratings yet
Week 4 - Developing, Organizing and Revising - Part 1
10 pages
Module Outline
No ratings yet
Module Outline
4 pages
Week 3 - 5 Paragraph Sample Essay
No ratings yet
Week 3 - 5 Paragraph Sample Essay
3 pages
Assignment 3 Brief
No ratings yet
Assignment 3 Brief
2 pages
Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1
No ratings yet
Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1
7 pages
Week 1 - An Intro To Computers Functions and Interconnections
No ratings yet
Week 1 - An Intro To Computers Functions and Interconnections
27 pages
Week 2 - The Memory System and Instruction Set Architecture
No ratings yet
Week 2 - The Memory System and Instruction Set Architecture
19 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Week 2 - Study of Memory Organization and Multiprocessor System
No ratings yet
Week 2 - Study of Memory Organization and Multiprocessor System
6 pages
Week 7 - (Part B) Trees
No ratings yet
Week 7 - (Part B) Trees
13 pages
Week 1 - The Universal Computer - The Road From Leibniz To Turing
No ratings yet
Week 1 - The Universal Computer - The Road From Leibniz To Turing
3 pages
Week 5 - (Part B) Counting Principles
No ratings yet
Week 5 - (Part B) Counting Principles
11 pages
Week 4 - (Part B) Recurrence Relations
No ratings yet
Week 4 - (Part B) Recurrence Relations
12 pages
Week 3 - Functions
No ratings yet
Week 3 - Functions
14 pages
Week 4 - (Part A) Relations
No ratings yet
Week 4 - (Part A) Relations
12 pages
Week 2 - Elementary Algebra
No ratings yet
Week 2 - Elementary Algebra
24 pages
Week 5 - (Part A) Permutations and Combinations
No ratings yet
Week 5 - (Part A) Permutations and Combinations
13 pages
Week 3 - Formative Task
No ratings yet
Week 3 - Formative Task
2 pages
Week 1 - Set Theory and Mathematical Proofs
No ratings yet
Week 1 - Set Theory and Mathematical Proofs
16 pages
Week 6
No ratings yet
Week 6
7 pages
FM-0037-A User Manual Ship Trender
No ratings yet
FM-0037-A User Manual Ship Trender
21 pages
Pharmacy Management System Harshal00000
No ratings yet
Pharmacy Management System Harshal00000
44 pages
Teamcenter Integrations Availability Matrix-2023-Feb-28
No ratings yet
Teamcenter Integrations Availability Matrix-2023-Feb-28
83 pages
FR Post-10
No ratings yet
FR Post-10
25 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
Latitude 7330 Rugged Extreme Spec Sheet
No ratings yet
Latitude 7330 Rugged Extreme Spec Sheet
5 pages
0201 Microsoft Word 2013 Tabs Tables and Graphics
No ratings yet
0201 Microsoft Word 2013 Tabs Tables and Graphics
19 pages
Talend - Software Installation
No ratings yet
Talend - Software Installation
4 pages
Converting A File or Files To A PDF/A Compliant Version
No ratings yet
Converting A File or Files To A PDF/A Compliant Version
8 pages
FinalDocument - Shopping Mall Administration-1
No ratings yet
FinalDocument - Shopping Mall Administration-1
46 pages
Fire Fighting PowerPoint Templates
No ratings yet
Fire Fighting PowerPoint Templates
48 pages
Libre Office Impress - Revision Notes
No ratings yet
Libre Office Impress - Revision Notes
11 pages
Auto Patch Log
No ratings yet
Auto Patch Log
35 pages
User Manual Extreme PBR Combo +
No ratings yet
User Manual Extreme PBR Combo +
15 pages
Lecture06-Meterpreter-vncinject
No ratings yet
Lecture06-Meterpreter-vncinject
58 pages
Oasis-3: Imaging Methods and Data Dictionary: Version 1.6 April 2018
No ratings yet
Oasis-3: Imaging Methods and Data Dictionary: Version 1.6 April 2018
32 pages
Acknowledgement
No ratings yet
Acknowledgement
87 pages
Design, Prototyping and Construction
No ratings yet
Design, Prototyping and Construction
28 pages
Angular Material
No ratings yet
Angular Material
19 pages
What's New in Artcam 2018
No ratings yet
What's New in Artcam 2018
15 pages
1ST Term Ict Lesson Notes For Week 9
No ratings yet
1ST Term Ict Lesson Notes For Week 9
13 pages
Intoduction To Computing Christian Felix PDF
100% (1)
Intoduction To Computing Christian Felix PDF
111 pages
BLEN2020
No ratings yet
BLEN2020
11 pages
Hardware Configuration
100% (1)
Hardware Configuration
21 pages
Practical Lab 3
No ratings yet
Practical Lab 3
7 pages
UTest Ebook Mobile Testing
No ratings yet
UTest Ebook Mobile Testing
48 pages