Effective Performance Measurement and Analysis of Multithreaded Applications Nathan

Effective Performance Measurement and Analysis of
Multithreaded Applications
Nathan R. Tallent John M. Mellor-Crummey
Rice University
{tallent,johnmc}@rice.edu
Abstract cies have increased [19]. As a result, the microprocessor industry

Understanding why the performance of a multithreaded program has shifted its focus from increasing clock frequencies to deliver-
does not improve linearly with the number of cores in a shared- ing increasing numbers of processor cores. For software to benefit
memory node populated with one or more multicore processors from increases in core counts as new generations of microproces-
is a problem of growing practical importance. This paper makes sors emerge, it must exploit threaded parallelism. As a result, there
three contributions to performance analysis of multithreaded pro- is an urgent need for programming models and tools to support de-
grams. First, we describe how to measure and attribute parallel velopment of efficient multithreaded programs. In this paper, we
idleness, namely, where threads are stalled and unable to work. address the challenge of creating tools for measuring, attributing,
This technique applies broadly to programming models ranging and analyzing the performance of multithreaded programs.
from explicit threading (e.g., Pthreads) to higher-level models such Performance tools typically report how resources, such as time,
as Cilk and OpenMP. Second, we describe how to measure and are consumed rather than wasted. For parallel programs, it is typi-
attribute parallel overhead—when a thread is performing miscel- cally most important to know where time is wasted as a result of
laneous work other than executing the user’s computation. By em- an ineffective parallelization. To enable an average developer to
ploying a combination of compiler support and post-mortem analy- quickly assess the quality of the parallelization in a multithreaded
sis, we incur no measurement cost beyond normal profiling to glean application, tools should pinpoint program regions where the par-
this information. Using idleness and overhead metrics enables one allelization is inefficient and quantify their impact on performance.
to pinpoint areas of an application where concurrency should be Two aspects of a parallelization in particular are important for ef-
increased (to reduce idleness), decreased (to reduce overhead), or ficiency: whether there is adequate parallelism in the program to
where the present parallelization is hopeless (where idleness and keep all of the processor cores busy, and whether the parallelism
overhead are both high). Third, we describe how to measure and at- is sufficiently coarse-grain so that the cost of managing the paral-
tribute arbitrary performance metrics for high-level multithreaded lelism does not become significant with respect to the cost of the
programming models, such as Cilk. This requires bridging the gap parallel work. In this paper, we describe novel techniques for as-
between the expression of logical concurrency in programs and its sessing both of these aspects of parallel efficiency.
realization at run-time as it is adaptively partitioned and scheduled For performance tools to be useful, they must apply to the mul-
onto a pool of threads. We have prototyped these ideas in the con- tithreaded programming models of choice. Over the last decade,
text of Rice University’s HPCT OOLKIT performance tools. We de- high-level programming models such as OpenMP [22] and Cilk [10]
scribe our approach, implementation, and experiences applying this were developed to simplify the development of multithreaded pro-
approach to measure and attribute work, idleness, and overhead in grams. These programming models raise the level of abstraction of
executions of Cilk programs. parallel programming by partitioning the problem into two parts:
the programmer is responsible for expressing the logical concur-
Categories and Subject Descriptors C.4 [Performance of sys- rency in a program and a run time system is responsible for parti-
tems]: Measurement techniques, Performance attributes. tioning and mapping parallel work efficiently onto a pool of threads
for execution. Without appropriate support for tools, the nature of
General Terms Performance, Measurement, Algorithms. this run-time mapping of work to threads is obscure and renders
Keywords Performance Analysis, Call Path Profiling, Multi- ineffective tools that measure and attribute performance directly to
threaded Programming Models, HPCT OOLKIT. threads in the run-time system.
In our work, we focus on using call path profiling [12] to at-
tribute costs in a program execution to the calling contexts in which
1. Introduction they are incurred. For modular programs, it is important to attribute
Over the last several years, power dissipation has become a sub- the costs incurred by each procedure to the different contexts in
stantial problem for microprocessor architectures as clock frequen- which the procedure is called. The need for context is obvious if
one considers that string manipulation routines might be called
from many distinct places in a program. Of particular interest is
providing this capability for high-level multithreaded programming
Permission to make digital or hard copies of all or part of this work for personal or models such as Cilk. However, for high-level multithreaded paral-
classroom use is granted without fee provided that copies are not made or distributed lel programming models, using call path profiling to associate costs
for profit or commercial advantage and that copies bear this notice and the full citation with the context in which they are incurred is not as simple as it
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
sounds. At each sample event, a call path profiler must attribute
PPoPP’09, February 14–18, 2009, Raleigh, North Carolina, USA.
the cost represented by the sample (e.g., time) to the current execu-
Copyright © 2009 ACM 978-1-60558-397-6/09/02. . . $5.00 tion context, which consists of the stack of procedure frames active
229
when the event occurred. For programs written in a programming 2. Pinpointing parallel bottlenecks
model such as Cilk, which uses a work-stealing run-time system to We describe two novel measurement and analysis techniques that
partition and map work onto a thread pool, the stack of native pro- enable an average developer to quickly determine whether a multi-
cedure frames active within a thread represents only a suffix of the threaded application is effectively parallelized. If the application is
calling context. Cilk’s work-stealing run-time system causes call- not effectively parallelized, our techniques direct one’s attention to
ing contexts to become separated in space and time as procedure areas of the program that need improvement.
frames migrate between threads as work is stolen. As a result, a
standard call path profile of a Cilk program during execution will 2.1 Quantifying insufficient parallelism
show fragments of call paths mapped to each of the threads in the
run-time system’s thread pool. Since frames can be stolen, even the To quantify insufficient parallelism, we describe how to efficiently
mapping between even an individual procedure frame and a thread and directly measure parallel idleness, i.e., when threads are idle
may not be one to one. As a result, a standard call path profile and unable to perform useful work. Our measurements of idleness
of a Cilk program will yield a result that is at best cumbersome are based on sampling of a time-based counter such as the wall
and at worst incomprehensible. For effective performance analysis clock or a hardware cycle counter. Measurement overhead is low
of multithreaded programming models with sophisticated run-time and controllable by adjusting the sampling frequency. When a sam-
systems, it is important to bridge the gap between the abstractions ple event occurs, a signal handler collects the context for the sample
of the user’s program and their realization at run time. and associates the sample count with its context.1 Collecting par-
This paper makes the following contributions for understanding allel idleness on a node with n processor cores requires minor ad-
the performance of multithreaded parallel programs: justments to traditional time-based sampling. The first adjustment
is to extend the run-time system to always maintain nw and nw ,
the number of working and idle processor cores, respectively. This
• A technique for measuring and attributing parallel idleness—
can be done by maintaining a node-wide counter representing nw .
when threads are idling or blocked and unable to perform useful When a core acquires a unit of useful work (e.g., acquiring a proce-
work. This technique applies broadly to programming models dure activation using work stealing or plucking a unit of work from
ranging from explicit threading (e.g., Pthreads [7]) to higher- a task queue), it atomically increments nw . Similarly, when a core
level models such as Cilk, OpenMP and Threading Building finishes a unit of work, it atomically decrements nw to indicate that
Blocks [23]. The technique relies on minor modifications to the it is no longer actively working. In this scheme, nw = n − nw .
run-time systems of multithreaded programming models. Consider a run-time system that has one worker thread per core.
• A technique for measuring and attributing parallel overhead— On a sample, each thread receives an asynchronous signal, resulting
when a thread is performing miscellaneous work other than ex- in a per-thread sample event. If a sample event occurs in a thread
ecuting the user’s computation. This technique could be applied that is not working, we ignore it. When a sample event occurs in a
to both library-based programming models such as Pthreads thread that is actively working, the thread attributes one sample to a
and Threading Building Blocks, as well as compiler-based pro- work metric for the sample context. It then obtains nw and nw and
gramming models such as Cilk and OpenMP. By employing a attributes a fractional sample nw /nw to an idleness metric for the
combination of compiler support and post-mortem analysis, we sample context. Even though the thread itself is not idle, it is critical
incur no measurement cost beyond normal profiling to glean to understand what work it is performing when other threads are
this information. idle. Our strategy charges the thread its proportional responsibility
• The definition of and a method for efficiently collecting logical for not keeping the idle processors busy at that moment at that point
call path profiles—a generalization of call path profiles that in the program.
enables one to measure and correlate execution behavior at For example, if three threads are active on a quad core processor,
different levels of abstraction. We develop this approach here whenever a sample event for the cycle counter interrupts a working
to relate the execution of a multithreaded program by a work- thread, the working thread will record one sample of work in its
stealing run-time system back to its source-level representation. work metric, and 1/3 sample of idleness in its idleness metric. The
1/3 sample of idleness represents its share of the responsibility for
the core that is sitting idle.
We believe these complementary techniques are necessary for ef- After measurement is completed, idleness can be computed
fective performance measurement and analysis of high-level multi- for each program context. Since samples are accumulated during
threaded programming models. Logical call path profiles are the measurement, the idleness value for a given thread and context is
key for mapping measurements of work, idleness and overhead
P
nwi over all samples i for that context. It is often useful to ex-
back to the source-level abstractions in high-level multithreaded press this idleness metric as a percentage of the total idleness for
parallel programming models. Our idleness and overhead metrics the program. Total idleness may be computed post-mortem by sum-
enable one to pinpoint areas of an application where concurrency ming idleness metric over all threads and contexts in the program.
should be increased (to reduce idleness), decreased (to reduce over- The idleness value may be converted to a time unit by multiplying
head), or where the present parallelization is hopeless (where idle- by the sample period. One can also divide the idleness for each con-
ness and overhead are both high). To show the utility of these tech- text by the application’s total effort—the sum of work and idleness
niques, we describe their implementation for Cilk and show how everywhere across all threads—to understand the fraction of total
they bridge the gap between the execution complexity of a Cilk pro- effort that was wasted in each context.
gram and the relative simplicity of the Cilk programming model. A variant of this strategy applies to situations where the number
Our tool attributes work, idleness, and overhead to Cilk source code of threads nT may not equal the number of cores (such as with
lines in their full logical user-level calling context. Pthreads programs or OpenMP’s nested parallelism). If nT > n,
This paper is organized as follows. First, Section 2 describes then nw may not exceed n; if nT < n, then nw cannot exceed nT .
parallel idleness and overhead. Then, Section 3 defines logical call
path profiles while Section 4 shows how to obtain them using 1 As mentioned in Section 1, we attribute costs to their full calling context
logical stack unwinding. Section 5 describes the application of using call path profiling. In this section, we use the term context rather than
these ideas to Cilk. Finally, Section 6 discusses related work and calling context since idleness can be measured with or without full calling
Section 7 concludes. context.
230
parallel lelism could profitably be reduced to enhance parallel efficiency.
idleness overhead interpretation If the overhead is high and the idleness low, the granularity of the
low low effectively parallel; focus on serial performance parallelism should be increased to reduce overhead. If the overhead
low high coarsen concurrency granularity is high and there is still insufficient parallelism, the parallelism is
high low refine concurrency granularity
high high switch parallelization strategies
inefficient and no granularity adjustment will help; keeping the idle
processors busy requires a different parallelization. For instance,
Table 1. Using parallel idleness and overhead to determine if the one might use a combination of data and functional parallelism
given application and input are effectively parallel on n cores. rather than one alone.
One can assess the efficiency of work and identify rate limit-
ing factors on individual processor cores by using metrics derived
2.2 Quantifying parallelization overhead from hardware performance counter measurements. Many differ-
ent factors can limit an application’s performance such as instruc-
Parallel overhead occurs when a thread is performing miscella- tion mix, memory bandwidth, memory latency, and pipeline stalls.
neous work other than executing the user’s computation. Sources For each of these factors, information from hardware performance
of parallel overhead include costs such as those for synchroniza- counters can be used to compute derived metrics that quantify the
tion or dynamically managing the distribution of work. extent to which the factor is a rate limiter. Consider how to assess
For library-based programming models such as Pthreads, iden- whether memory bandwidth is a rate limiter. During an execution,
tifying parallel overhead is easy: any time spent in a routine in the one can sample hardware counter events for total cycles and mem-
Pthreads library can be labeled as parallel overhead. For language- ory bus transactions. By multiplying the sampling period by the
based parallel programming models, one must rely on compiler sample count for each instruction, one can obtain an estimate of
support to identify inlined sources of parallel overhead. A com- how many bus transactions are associated with each instruction.
piler for a multi-threaded programming model, such as OpenMP By multiplying the number of bus transactions by the transaction
or Cilk, can tag statements in its generated code to indicate which granularity (e.g., the line size for the lowest level cache), one can
are associated with parallelization overhead. In Section 5.2, we de- compute the amount of data transferred by each instruction. By di-
scribe how we mark sources of parallel overhead for Cilk. In a post- viding the amount of data transferred by instructions within a scope
mortem analysis, we recover compiler-recorded information about (e.g., loop) by the total number of cycles spent in that scope, one
overhead statements, identify instructions associated with overhead can compute the memory bandwidth consumed in that scope. By
statements and run-time library routines, and attribute any samples comparing that with a model of peak bandwidth achievable on the
of work (as defined in Section 2.1) associated with them to par- architecture, one can determine whether a loop is bandwidth bound
allelization overhead. The tags therefore partition the application or not. Attributing metrics to static scopes such as loops and dy-
code into instructions corresponding to either useful work or over- namic contexts such as call paths to support such analysis of mul-
head (distinct from idleness). tithreaded programs is the topic of the next section.
A benefit of this scheme is that tags are only meta-information:
they can be inserted and overhead can be associated with them us-
ing post-mortem analysis without affecting run time performance 3. Logical call path profiles
in any way. In addition, the tags may be refined to partition sources
of overhead into multiple types. For example, it may be useful to To enable effective performance analysis of higher-level program-
distinguish between synchronization overhead and all other over- ming languages it is necessary to bridge the gap between the user’s
head. Such a refinement would provide more detailed information abstractions and their implementation. A key aspect of this is recov-
to users or analysis tools. ering user-level calling contexts. As mentioned previously, when
In particular, tags do not have any associated instrumentation. Cilk programs execute, user-level calling contexts are separated in
While the mapping between instructions and tags consume space, space and time by work stealing. Mapping measurements during
it need not induce any run time cost. For example, the mapping can execution back to a source program requires reassembling user-
be located within a section of a compiled binary that is not loaded level contexts, which have been fragmented during execution. The
into memory at run time or maintained in a separate file. next two sections extend the notion of call path profiling by defin-
The tags we propose could take several forms, but one particu- ing logical call paths and describing how to generally and effi-
larly convenient one is to associate overhead instructions with spe- ciently obtain logical call path profiles using a logical calling con-
cial unique procedure names within the line map. For example, syn- text tree. Logical call path profiling applies to both parallel and
chronization code could be tagged with the special procedure or file serial applications. In Section 5, we describe how this technique
name parallel-overhead:sync. forms an essential building block for measurement and analysis of
multithreaded Cilk program executions by a work-stealing run-time
2.3 Analyzing efficiency system.
In a parallel program, we must consider two kinds of efficiency:
parallel efficiency across multiple processor cores and efficiency 3.1 Logical call paths
on individual processor cores. With information about parallel idle- A sampling-based call path profiler obtains a call path by unwind-
ness and overhead attributed hierarchically over loops,2 procedures, ing the call stack at a sample point to obtain a list of active pro-
and the calling contexts of a program, we can directly assess paral- cedure instances, or frames. Such a call path may not correspond
lel efficiency and provide guidance for how to improve it (see Ta- directly to a user-level calling context. We introduce the notion of
ble 1). If a region of the program (e.g., a parallel loop) is attributed logical call paths to bridge this gap. We obtain logical call paths by
with high idleness and low overhead, the granularity of the paral- logically unwinding the call stack. To support a precise discussion
of this concept, we introduce and define the following terminology.
2 Because we collect performance metrics using statistical sampling of hard- A bichord is a pair hPi , Li i consisting of a p-chord Pi and a
ware performance counters, which associates counts directly with instruc- l-chord Li where each p-chord (or l-chord) is is a sequence of p-
tions, and use binary analysis to associate instructions with higher-level pro- notes (l-notes), e.g.:
gram structures such as loops, we can directly compute and attribute metrics
at the level of individual loops. hPi , Li i = h(pi,1 , . . . , pi,m1 ), (li,1 , . . . , li,m2 )i
231
A note represents a frame; a chord a grouping of frames; and a 4. 1 ↔ M. At first sight, this association may seem esoteric.
bichord the association of a group of physical stack frames (Pi ) However, it has important applications. It directly corresponds
with a group of logical (Li ) stack frames. Logical frames corre- to using Cilk’s scheduling loop as a proxy for walking the
spond to a user-level calling context; physical frames correspond cactus stack of parent procedures that are stored in the heap and
to an implementation-level realization of that view. The p-notes have no physical presence on the stack. As another example, a
Pi = (p1,1 , . . . , p1,m1 ) that form p-chord Pi represent the bi- Java compiler could form one physical procedure from a ‘hot’
chord’s physical call path fragment, while the l-notes form the log- chain of user-level procedures.
ical call path fragment. We say that the length |Pi | of p-chord Pi ,
is the number of p-notes contained therein, i.e., m1 in the above Three observations are apropos. First, as previously discussed, as-
example; similarly, |Li | = m2 . sociations 0 ↔ {0, 1, M} are excluded meaning that the length of
A logical call path is a sequence of bichords a p-chord is always non-zero. Second and in contrast, association
(2) implies that it is possible to have a zero-length l-chord. The final
hhP1 , L1 i, hP2 , L2 i, . . . , hPn , Ln ii omitted association, M ↔ M, can always be represented as some
where hP1 , L1 i is the program’s entry point and where bichord combination of categories (1-4) above.
hPn , Ln i represents the innermost set of frames. It is natural to We now concisely define a logical call path as a sequence of
speak of the p-chord projection for the logical call path as bichords hhP1 , L1 i, hP2 , L2 i, . . . , hPn , Ln ii where n ≥ 1 and
∀i[|Pi | ≥ 1], but where it is possible that |Li | = 0 for any i.
hP1 , . . . , Pn i
and the p-note projection as 3.2 Representing logical call path profiles
h(p1,1 , . . . , p1,m1 ), . . . , (pn,1 , . . . , pn,mn )i At run-time, we wish to efficiently obtain and represent a logi-
cal call path profile, i.e., a collection of logical call paths anno-
where p1,1 represents the physical program entry point and the tated with sample counts with the time dimension removed. Our
projection represents the physical call path from the entry point to approach is to form a logical calling context tree—an extension of
the sample point. Logical projections are analogous. a calling context tree (CCT) [2]—that associates metric counts with
To provide intuition for a discussion of bichord forms, it is use- logical call paths.
ful to consider a concrete representation. We represent a p-note
projection as a list of instruction pointers, one for each procedure 3.2.1 Weighted logical calling context trees
frame active at the time a sample event occurs. The first instruction
pointer of the unwind (pn,mn ) is the program counter location at We first define a very simple logical CCT. Given a logical unwind
which the sample event occurred. The rest of the list contains the hhPn , Ln i, hPn−1 , Ln−1 i, . . . , hP1 , L1 ii
return address for each of the active procedure frames. Similarly,
each l-note in a logical call path contains an opaque logical instruc- where hPn , Ln i is a sample point, the straightforward extension of
tion pointer that represents the logical context. a CCT ensures that the path
Defining a logical call path to consist of a sequence of bichords
formed of notes enables us to preserve interesting relationships be- hhP1 , L1 i, hP2 , L2 i, . . . , hPn , Ln ii
tween the physical and logical call path. To formalize these re- exists within the tree, where hP1 , L1 i is the root of the tree and
lationships, we first observe that a logical call path’s p-note pro- where hPn , Ln i is a leaf node. Metrics such as sample counts
jection should always have a non-zero length because the physical are associated with each leaf node (sample point); in this example
stack is never empty. Moreover, intuitively, every l-chord must be metrics at hPn , Ln i are incremented.
associated with at least one p-note. This implies that no bichord We define the physical projection of a logical CCT to be the
should have a zero length p-chord. Equivalently, we observe that CCT formed by taking the p-chord projection of each call path in
a p-note projection should not have ‘gaps’, i.e., a machine cannot the logical CCT. The logical projection of a logical CCT is defined
return to a ‘virtual’ logical frame — an l-note without an asso- analogously.
ciated p-note — and then return back to a physical frame. From
this starting point, we consider the possible relationships, or asso-
ciations, between the lengths of a bichords’s p-chord and l-chord. 3.2.2 Efficiently representing logical calling context trees
Given bichord Bi = hPi , Li i, there are several possible associa- While this logical CCT representation is simple, treating bichords
tions between |Pi | and |Li | that we describe with a member from as atomic units can result in considerable space inefficiency. To
the set {0, 1, M} × {0, 1, M}, where M (a mnemonic for multi reduce memory effects, we wish to share notes without losing
or many) represents any natural number m ≥ 2. We are interested any information represented in the logical CCT. The Appendix
in the following four categories accounting for five of the possible describes when sharing is possible and develops a more efficient
association types: and practical implementation.
1. 1 ↔ 1. One p-note directly corresponds to one l-note—the
typical case for C or Fortran code where a physical procedure 4. Obtaining logical call path profiles
frame corresponds to a logical procedure frame.
Given the definition of a logical call path and the representation of
2. 1 ↔ 0 and M ↔ 0. A p-chord corresponds to an empty l-chord. a call path profile using a logical calling context tree, we now turn
This situation typically arises when run-time support code is our attention to obtaining a logical call path profile. To provide
executed. For example, a sample event that interrupts the run- low controllable measurement overhead, we use statistical sam-
time system’s scheduler may find several physical frames that pling and form the logical calling context tree by collecting and
correspond to no logical procedure frame. inserting logical call paths on demand for each sample. ‘Physical’
3. M ↔ 1. This association often describes the run-time system call path profilers use stack unwinding to collect the call path. Since
implementing a high level user routine. For example, a Python the physical calling context alone is insufficient for obtaining the
interpreter may require a chain of procedure calls (several p- logical call path, we develop the more general the notion of logical
notes) to implement a user level call to sort a list. stack unwinding to collect the logical call path.
232
4.1 Logical stack unwinding Algorithm 1: logical-backtrace: performs a logical unwind
Consider a contrived example where a Python driver calls a Java let c be the unwind cursor, initialized with the machine
routine that calls a Cilk solver. Though unusual, this example shows context and language-specific logical unwind agents
that each bichord in a logical call path could potentially derive while step-bichord(&c) 6= END UNWIND do
from a different run-time system. Because run-time systems use the let a be the bichord’s association (from c)
system stack in their implementation, this suggests that the actual while step-pnote(&c) 6= END CHORD do
process of logical unwinding should be controlled by the physical Record p-note (instruction pointer from c)
stack. This is natural because although the physical call stack may while step-lnote(&c) 6= END CHORD do
represent the composition of calls from many different languages, Record l-note (logical instruction pointer from c)
it conforms to a known ABI. In addition, using a physical unwind Form bichord from a and the lists of p-notes and l-notes
naturally corresponds to our requirement that a p-note projection
not have ‘gaps’, i.e., there is at least one representative stack frame
for each l-chord in the logical unwind. However, since a physical The logical unwinding API is divided into a two-level hier-
stack unwinder alone cannot determine either the association of the archy corresponding to the division between bichords and notes.
bichord or the length of the p-chord or the content of the l-chord, In particular, the top level addresses finding the bichords within
some sort of additional information must be available to construct a logical unwind while the other level targets finding the notes
the bichord. This information can be obtained using a language- of a chord. An outline of of the backtrace routine is shown
specific plug-in or agent to assist a ‘physical’ stack unwinder. Each in Algorithm 1. Each level adopts semantics similar to libun-
agent would understand its corresponding language implementa- wind [20]. This means that to find each bichord in the logical
tion well enough to determine the particulars of reconstructing an unwind hhPn , Ln i, hPn−1 , Ln−1 i, . . . , hP1 , L1 ii,3 n successive
l-chord given the start of a p-chord. It is important to emphasize a calls to step-bichord are required along with an additional call that
p-chord’s start because assistance from the agent will in general be returns a special value to indicate the unwind is completed. The
necessary to determine the p-chord’s length, e.g., 1 vs. M. advantage of these semantics is that they help ensure agents do not
There must be some way of selecting which agent to use at have to perform contextual look ahead. For example, to examine
any point in the logical unwind. In the example above, one must all l-notes within the l-chord (li,1 , . . . , li,m ), m + 1 calls are issued
know when to use the Cilk, Java and Python agents, respectively, to step-lnote. This means that the agent need not know that li,1
to obtain the relevant bichords. Observe that at any point in the is the last l-note in the l-chord unwind until the m + 1st call to
execution, the return address instruction pointer located in the stack step-lnote. This fact is particularly useful for an agent to a multi-
frame should map to at most one run-time system and therefore one threaded run time system because thread-specific state need not be
agent. Consequently, the frame’s return address serves a proxy for maintained within the agent. Rather, all state for the unwind can be
the specific agent that should be consulted to assist formation of maintained by a fixed-sized thread-specific cursor allocated by the
the bichord. During a program’s execution, the mapping of code logical unwinder.
segments within the address space (the load map) can typically be As discussed previously, logical unwinding is driven by a stack
determined by interrogating the operating system. unwind. On each call to step-bichord, the library determines if a
valid physical stack frame exists. If so, it extracts the return address
4.2 Thread creation contexts
instruction pointer and determines if it maps to any agent. If it does,
Often it is useful to know the context in which a thread was created. that particular agent is used complete the discovery of the bichord.
The creation context of a thread is defined as the calling context Otherwise, the ‘identity’ agent is used to create a 1 ↔ 1 bichord
at the time the thread was created. For example, consider a solver representing native code.
using fork-join parallelism where a pool of Pthreads is created Observe that the asymmetry between p-chords and l-chords
using several calls to pthread_create. It is desirable to capture plays a critical role in the unwind process. For a p-chord Pi of
the calling context of the pthread_create so that the Pthread length mi , the mi + 1th call to step-pnote both completes enumer-
can be rooted within the context of the solver. The thread creation ation of Pi ’s p-notes and discovers the next p-chord. For example,
context may be captured and maintained as an extension to the consider a section of the physical projection representing p-chords
thread’s physical stack. Pi and Pi+1 :
4.3 An API for logical unwinding (. . . , pi,mi )(pi+1,1 , . . .)
While iterating over the p-notes in p-chord Pi , we first issue mi
We have designed and implemented a general API for obtaining
calls to step-pnote. On the mi + 1th call, the agent discovers that
logical unwinds given language specific agents. Technically, there
there are no more p-notes in Pi , but only because it has found p-
are two sub-APIs, one for collecting logical unwinds (using agents)
note pi+1,1 , the beginning of p-chord Pi+1 . This means that the
and one describing the interface to which language specific agents
p-note portion of the cursor is pointing to the beginning of Pi+1
must conform and the assumptions they may make.
before the cursor has stepped to Pi+1 . This ‘peeking’ behavior
The API for logical unwinding is designed to place as much
is important because we must know the initial portion of Pi+1 in
burden as possible on the non-agent library routines so that agent
order to know which agent to assign the responsibility of the next
implementation is as easy as possible. For example, an agent is
bichord. In contrast, step-lnote need not ‘peek’ ahead in to the
not required to perform any look-ahead to determine the length of
next l-chord. Indeed, it should not because the next l-chord may
an l-chord. Although this information could be used by the logical
be handled by a different agent and may have length 0.
unwinder (Algorithm 1) for allocating storage, we determined that
it was more desirable to complicate the code for the unwinder
than to complicate each agent’s implementation. Consequently, 5. Measurement and analysis of Cilk executions
the logical unwinder ensures that enough buffer space is always To demonstrate the power of using our parallel idleness and over-
available to store a bichord. As another example, the agent interface head metrics in combination with logical call path profiling, we de-
sub-API promises a small amount of functionality to ease agent veloped an implementation of a profiler for Cilk-5 [10] (currently
implementation, such as a means to inspect the address space and a
safe memory allocator (malloc may not be safe). 3A logical unwind is simply the reverse of a logical call path.
233
cilk int fib(int n)
profile {
compile & link call stack
execution
profile if (n < 2)
[hpcrun]
app. optimized return (n);
source binary
binary
program
else {
analysis int x, y;
structure
[hpcstruct]
x = spawn fib(n - 1);
...
interpret profile
Figure 2. Fragment of a Cilk program (Fibonacci numbers).
presentation
[hpcviewer] database correlate w/ source
[hpcprof]
call path and a fractional sample nw /nw of idleness is added to the

Figure 1. HPCT OOLKIT workflow. idleness metric associated with this logical call path.
5.2 Parallel overhead

at version 5.4.6). We chose Cilk for several reasons. First, Cilk
lets parallel programmers focus on specifying logical concurrency, To attribute parallel overhead to logical calling contexts we use sev-
while its run-time system handles the details of executing that logi- eral mechanisms (described below) to identify all overhead inserted
cal concurrency efficiently. The power of Cilk’s abstraction of log- by the Cilk compiler into a Cilk application binary. At run-time,
ical concurrency is something that will be critical if programmers samples associated with parallel overhead will be attributed as work
are to routinely write scalable multithreaded applications. (Indeed, to the logical calling context in which they arise. After an execution
Cilk is being developed into a commercial product.) Second, Cilk of a Cilk program completes, in a post-mortem analysis phase we
pioneered a sophisticated work-stealing scheduler that is provably partition sample counts of the work into useful work and parallel
efficient assuming the availability of sufficient concurrency. Third, overhead based on compile-time information.
the Cilk compiler and run-time implementations are freely avail- Our strategy for identifying the parallel overhead within a Cilk
able. application binary relies on the hpcstruct binary analysis tool for
Our profiler implementation is part of HPCT OOLKIT [1, 24], a recovering program structure from a binary. hpcstruct analyzes
performance toolkit whose workflow is shown in Figure 1. hpcrun an application binary to recover a mapping between object code and
(top, middle), a sampling-based call path profiler, measures the per- program structure. In particular, hpcstruct recovers the structure
formance of fully-optimized executables. hpcstruct analyzes ap- of procedures, including a procedure’s loop nests, and identifies
plication binaries to recover program structure such as procedures code that has been inlined therein. Thus, hpcstruct will naturally
and loop nests. hpcprof (bottom right) interprets call path profiles identify overhead-related code in a procedure if that code appears to
and correlates them with program structure, generating databases have been inlined. We accomplish this is by using #line compiler
for interactive exploration with hpcviewer. directives to simulate inlining.
To support measurement and analysis of work, idleness, and Given this overall strategy, we used two different methods to
parallel overhead in executions of multithreaded Cilk programs, ease the implementation effort. The Cilk compiler compiles Cilk
we extended hpcrun to collect logical call path profiles for Cilk. source code to C and then uses a vendor C compiler to generate
In the following sections, we describe our approach, along with an executable. It turns out that nearly all parallel overhead inserted
minor supporting modifications to the Cilk run-time system. After into the intermediate C code by the Cilk compiler is encapsulated
measurements are complete, we use logical calling contexts to cor- either by a call to a method or macro.4 Consequently, it is possible
relate our measurements of work, idleness, and parallel overhead to identify essentially all overhead by 1) tagging about 45 Cilk run
with the Cilk source program and interactively explore the perfor- time library routines with #line directives, and 2) inserting appro-
mance data using hpcviewer. priate #line directives surrounding the appropriate macro refer-
ences before the generated C code is fed to the vendor compiler.5
5.1 Parallel work and idleness Given this fact, and given our unfamiliarity with the Cilk compiler’s
source code, we determined that instead of modifying the compiler
To support measurement of our idleness metric, we modified the it would be easier to 1) appropriately tag the Cilk run time rou-
Cilk scheduler to classify threads as working or non-working tines and 2) write a Cilk post-processor that inserted the appropri-
and to maintain the number of working and idle threads (nw ate tags in the intermediate C file. To preserve the ability to recover
and nw , respectively). These modifications were straightforward. sensible structure for a routine and use a debugger with the result-
Each worker thread executes a scheduling loop that acquires work ing executable, our post-processor preserves the line number of the
(through a steal, if necessary) and then performs that work. Since original source file. A sanitized example of an original Cilk routine
the work is executed via a method call, the scheduling loop is ‘ex- and its corresponding post-processed C code is shown in Figures 2
ited’ to perform the work and then re-entered as the worker thread and 3. (Note that the ‘odd’ formatting in the post-processed C, such
waits to acquire more work. To identify a thread as actively work- as the declaration on the first line, is critical for aligning the line
ing or idle, we set a thread-specific state variable just before the numbers of the generated code with the source.)
thread exits or enters the scheduling loop, respectively. At the same
time, a global counter representing the number of working threads
is atomically incremented or decremented as each thread exits and 4 Parallel overhead that derives neither from a method nor macro call is
enters the scheduling loop, respectively. When a sample event in- either continuation control flow, a declaration, or trivial.
terrupts a worker thread, one of two things happen: if the worker is 5 When a macro is expanded by the C preprocessor, no indication of its
idle, the sample event is ignored; if the worker is active, the logical originating source file is typically recorded. In contrast, if a function call
call path for the work being executed is collected, one sample is at- is inlined, a C compiler will effectively generate the appropriate #line
tributed to the work metric metric total associated with this logical directives.
234
int fib(WorkerState* ws, int n) { struct frame* fr;
#line 28 "hpctoolkit:parallel-overhead"
CILK2C_INIT_FRAME(fr, ...);
CILK2C_START_THREAD_FAST();
#line 28 "fib.cilk"
if (n < 2) { int t = n;
CILK2C_BEFORE_RETURN_FAST();
#line 31 "fib.cilk"
return t;}
else {
int x; int y;
{ fr->header.entry=1; fr->scope0.n = n;
CILK2C_BEFORE_SPAWN_FAST();
CILK2C_PUSH_FRAME(fr);
#line 34 "fib.cilk"
x = fib(ws, n-1);
CILK2C_XPOP_FRAME_RESULT(fr, 0, x);
CILK2C_AFTER_SPAWN_FAST();
#line 34 "fib.cilk"
}
...
Figure 3. Post-processed C fragment from the Cilk compiler (cor-

responding to Figure 2). Parallel overhead is demarcated with Figure 4. A Calling Context (top-down) view of Cholesky.
#line directives.
A. There may be i frames corresponding to Cilk run time rou-

tines (e.g., creation of continuation information) or user level
5.3 Cilk call path profiles C routines. Cilk run time routines correspond to a bichord with
association 1 ↔ 0 (since they are not part of the logical call
To attribute parallel idleness and overhead to logical calling con- path), while user-level C routines correspond to an association
texts, we modified hpcrun to collect logical call path profiles for of 1 ↔ 1.
Cilk. In particular, we implemented the logical unwind API (de-
scribed in Section 4.3), developed a Cilk-specific agent, and modi- B. There may be j frames corresponding to Cilk fast frames. Since
fied hpcprof’s profile interpretation and source code correlation to the fast clone of a Cilk routine directly corresponds to a physical
normalize the results. The design of the Cilk agent illustrates sev- frame and a logical frame, the pair corresponds to a bichord
eral important points. Although discussing this agent necessarily with association 1 ↔ 1.
involves details about the Cilk implementation, it is important to C. There is always at least one frame corresponding to the Cilk
note that the API remains language independent. For example we scheduler.
are working on agents for other models such as OpenMP.
To understand the Cilk agent, it is necessary to review some These segments may not be interchanged.
high-level details about the Cilk-5 implementation. For each source The exact interpretation of segment C depends upon whether
Cilk routine, the Cilk compiler generates two clones, a ‘fast’ and there are additional ancestor frames in the Cactus stack. That is,
‘slow’ version. The fast clone is very similar to the corresponding when a worker steals any procedure other than ‘main’, that proce-
C procedure, and is executed in the common case. Importantly, dure’s logical context is represented as a chain of ancestor frames
whenever a procedure is spawned, the fast version is executed. The within the Cactus stack. In this case, the scheduler frame has as-
slow clone is executed only when parallel semantics are necessary sociation 1 ↔ M. Otherwise, if the innermost frame in segment
such as when a procedure is stolen. B corresponds to ‘main’, which has no logical calling context, the
Each worker thread maintains a deque (stored in the heap) of scheduler frame has association 1 ↔ 0.
ready procedure instances, which together form a ‘Cactus stack’,
i.e., a tree where the root corresponds to the bottom (outermost 5.4 Case study
frame) of the stack. Local work is pushed and popped from the tail To demonstrate the power of attributing work, parallel idleness
of the deque (top or inner frames) while thieves steal from the head and parallel overhead to logical call path profiles, we apply our
(bottom or outer frames). Execution proceeds on the thread’s stack method to analyze the performance of a Cilk program for Cholesky
even though a ‘shadow’ continuation is maintained on the deque. decomposition. We used the example Cholesky program in the
Whenever a thief steals a procedure’s continuation, it resumes it Cilk 5.4.6 source distribution. We ran a problem size of 3000 ×
using the slow version of that procedure. Since frames may only be 3000 (30,000 non-zeros) on an SMP with dual quad-core AMD
stolen from the deque’s head (bottom of cactus stack), this implies Opterons (2360 SE, 2.5 Ghz) and 4 GB main memory. We profiled
that the descendants of a fast procedure may only be fast procedures the execution using hpcrun, which gathers separate data for each
themselves. thread, and processed the results using hpcprof.
We may infer the following invariants about the frames on a Figure 4 presents one view of the aggregated results displayed
worker’s stack (in top-down order): by hpcviewer. The view has three main components. The navi-
235
gation pane (lower left sub-pane) shows a top-down view of the
calling context tree, partially expanded. One can see several user-
level procedure instances along the call paths. (Physical procedure
instances are not shown.) The selected line in the navigation pane
and the source pane (top sub-pane) shows the procedure cholesky.
Each entry in the navigation pane is associated with metric values in
the metric pane to the right. Sibling entries are sorted with respect
to the selected metric column (in this case ‘work (all/I)’). Observe
at the bottom of the navigation pane a loop, located within the con-
text of cilk_main. The loop is detected by hpcstruct’s program
structure analysis; the navigation pane actually contains a fusion
of the dynamic logical calling contexts and the hpcstruct’s static
context information.
The metric columns in Figure 4 show summed values over the
eight worker threads for work (in cycles), parallel idleness and
parallel overhead (yielding the ‘all’ qualifier in their names). Both
idleness and overhead are shown as percentages of total effort,
where effort is the sum of work, idleness and overhead. In the
idleness and overhead columns, the values in scientific notation
represent the aforementioned percentages; the values shown as
percentages to their right give an entry’s proportion of the total
idleness or overhead, respectively. The metrics are inclusive (hence
the ‘I’ qualifier) in the sense that they represent values for the
associated procedure instance in addition to all of its callees. Thus,
the metric name ‘work (all/I)’ means inclusive work summed over
all threads.
Because Cilk-5 emphasizes recursive decompositions of algo- Figure 5. A Callers (bottom-up) view of Cholesky.
rithms — parallelism is exposed through asynchronous procedure
calls — call chains can become quite long. Nevertheless, expand-
ing the calling context tree to the first call to cholesky and noting enabled us to quickly make strong and precise statements about the
the metrics on the right is very informative. Figure 4 shows that parallel efficiency of this program. Although it is not surprising
about 47.2% of of the total work of the program is spent in the top that serial code is responsible for idleness, the fact that we can
level call to cholesky; the top level call to mul_and_subT (which immediately quantify and pinpoint its impact on parallel efficiency
verifies the factorization) is a close second at about 46.0%. We can shows the effectiveness of our methods.
also quickly see that about 12.5% and 65.9% of the total parallel
idleness and overhead, respectively, occur in cholesky. However, 6. Related work
because this idleness and overhead are small with respect to effort
(about 1.62% and 0.189%, respectively), it is clear that the paral- Our parallel idleness metric is similar to Quartz’s [3] notion of ‘nor-
lelization of cholesky is very effective for this execution. In con- malized time’ to highlight code with poor concurrency. Normalized
trast, the parallelization of the entire program (for which we can time is computed by attributing 1/nw (using the notation from Sec-
use cilk_main as a proxy) is less effective, with overhead essen- tion 2.1) to the relevant section of code on each sample of a work-
tially remaining the same, but idleness accounting for about 11.6% ing thread, inflating compute times in areas of poor parallelization.
of total effort. While our idleness metric is similar in that it also highlights code
To pinpoint exactly where inefficiency occurs using the idleness sections with poor concurrency, it is different in that it is a direct
and overhead metrics, we turn to the ‘Callers’ or bottom-up view measure of parallel idleness: nw /nw . This quantitative/qualitative
in Figure 5. If the top-down view looks ‘down’ the call chain, the distinction is important because Quartz’s qualitative metric can be
bottom-up view looks ‘up’ to a procedure’s callers. Thus at the first ambiguous. Consider a program that executes with n threads (on
level, the bottom-up view lists all the procedures in the program, n cores) with two phases named X and Y , where each phase ex-
rank-ordered according to the selected metric—in this case, relative ecutes for an equal amount of time, t. During phase X, procedure
idleness, the most troubling inefficiency. Note that in contrast to x executes serially; during phase Y , n instances of procedure y ex-
Figure 4, these metric values are ‘exclusive’ (signified with an ecute without any loss to overhead. Unintuitively, the normalized
‘E’) in the sense that they do not include values for a procedure’s times kTx k and kTy k for procedures x and y are identical (t/1 and
callees. The top two routines in the rank-ordered list are versions nt/n, respectively) even though n−1 threads are idle for the whole
of the C library routine free and together account for about 34.3% duration of phase X. In contrast, our idleness metric would yield
of the program’s idleness. When the callers for these routines are values of Ix = (n − 1)t and Iy = 0. Although Quartz eliminates
expanded, it is evident that they are both called by free_matrix, this ambiguity by using n counters for each procedure, assigning
a non-Cilk, i.e., serial, helper routine that deallocates the matrix t to counter x1 and 0 to counters x2 . . . xn , this solution requires
for the Cholesky driver. Continuing down the list reveals that every a comparison between n counters to convey the same thing as Ix .
routine shown in the screen shot except mul_and_subT is a serial Additionally, we attribute idleness to full logical calling contexts,
helper. Since each of these serial routines except block_schur_ even in the presence of a work-stealing run time.
full is related to initialization or finalization, it is immediately The idea of computing parallel overhead is not new. For exam-
evident that to reduce parallel idleness either the size of the matrix ple, ‘cycle accounting’ is a powerful methodology for partitioning
must be increased or the initialization and finalization routines must stall cycles during the execution of serial code [9, 17]. To predict
be parallelized. The significance of this conclusion is that without parallel performance, Crovella and LeBlanc describe a ‘lost cy-
having any prior knowledge of the source code, our techniques have cles analysis’ [8] that separates parallel overhead from pure com-
putation. They further divide parallel overhead into sub-categories
236
useful for differentiating between different performance problems. paths. One deficiency of our profile data is that it does not distin-
However, they lament that “[m]easuring lost cycles directly for the guish between idleness (or overhead) that is the result of a few calls
entire environment space is still impractical.” Our method directly to a long-running function as opposed to many calls to a fast one.
measures parallel overhead without any run-time cost. However, given the properties of the Cilk scheduler, we can com-
Several tools for obtaining call path profiles have been de- pute metrics similar to Thread Profiler’s but for a fraction of the
veloped, but they collect only physical call path profile projec- overhead.
tions [4, 11, 13, 18, 21] or logical (user-level) call path profile pro-
jections, such as for Java [5, 25, 26]. In parallel but independent
work, Itzkowitz et al. describe an OpenMP API that enables a sta- 7. Conclusions
tistical call path profiler to correlate user-level call paths with run- Because of the growing need to develop applications for multicore
time metrics about whether a thread is working or waiting [16]. Our architectures, effective tools for quantifying and for pinpointing
work is more general in the sense that we define logical call path performance bottlenecks in multithreaded applications are abso-
profiles, explain how they can be efficiently represented, and de- lutely essential. This will be increasingly true as less skilled ap-
scribe a general API for obtaining them. Although the two idleness plication developers are forced to write parallel programs to benefit
metrics are similar, we additionally collect and attribute a parallel from increasing core counts in emerging processors.
overhead metric without any run-time cost. We have shown that attributing work, parallel idleness and par-
It is interesting to compare our performance analysis of Cilk to allel overhead to logical calling contexts enables one to quickly
Cilk’s own performance metrics. Cilk computes two metrics that obtain unique insight into the run-time performance of Cilk pro-
directly correspond to the theoretical model that underlies Cilk’s grams. In particular, we demonstrated the power of our method by
provably-efficient scheduler. The first is total work or the time for using it to pinpoint and quantify serialization in a Cilk execution. A
a serial execution of the program with a given input. The second strength of our approach is that our performance metrics are com-
is critical path, or a prediction of the execution time on an infinite pletely intuitive and can be mapped back to the user’s programming
number of processors. The significant advantages of Cilk’s met- abstractions, even though the run-time realization of these abstrac-
rics are that they are ‘platform independent’ and provide a theoret- tions is significantly different. While we described a prototype tool
ical upper bound on the scalability of a program with a given in- for measurement and analysis of multithreaded programs written
put. However, they share two important disadvantages. First, Cilk’s in Cilk, our underlying techniques for computing parallel idleness,
metrics are computed using extremely costly instrumentation — parallel overhead, and obtaining logical call path profiles are more
which itself disturbs the application’s performance characteristics. general and can be applied directly to other multithreaded program-
Second, these metrics do not aid the programmer in pinpoint- ming models such as OpenMP and Threading Building Blocks.
ing where in the source code inefficiency arises. In contrast, our Our work shows that it is possible to construct effective and
method immediately pinpoints parallel inefficiency in user-level efficient performance tools for multithreaded programs. The run-
source code. Moreover, paired with hardware performance counter time cost of our profiling can be dialed down arbitrarily low by
information, our method can help distinguish between different reducing the sampling frequency. We have also shown that it is
types of architectural bottlenecks in different regions of code. possible to collect implementation-level measurements and project
Critical path is a classic metric for understanding parallel pro- detailed metrics to a much higher level of abstraction without
grams. While Cilk computes the critical path’s lower bound for a compromising their accuracy or utility.
program and given input, it is also possible to determine the actual
critical path for an execution. Intel’s VTune [15] computes the ac-
tual critical path for an execution, though at the native thread level. Acknowledgments
The classic problem with critical path information is that after ex- Development of the HPCT OOLKIT performance tools is supported
pending much effort to reduce its cost, a completely different crit- by the Department of Energy’s Office of Science under coopera-
ical path may emerge, slightly less costly than the original. There- tive agreements DE-FC02-07ER25800 and DE-FC02-06ER25762.
fore, it is much more useful to know how much ‘slackness’ exists HPCT OOLKIT would not exist without the contributions of the
in the critical path. Intel’s Thread Profiler [6,14] not only computes other project members: Laksono Adhianto, Michael Fagan, Mark
critical path but classifies its segments by concurrency level and Krentel, and Gabriel Marin. Project alumni include Robert Fowler
thread interaction. Given a segment where nT threads execute on n and Nathan Froyd.
cores (n > 1), the tool classifies that segment’s concurrency level
as either serial (nT = 1), under-subscribed (1 < nT < n), fully
parallel (nT = n), or oversubscribed (nT > n). These categories References
are then qualified by three interaction effects such as cruise, im-
[1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-
pact and blocking. Cruise time is time that a thread does not delay Crummey, and N. R. Tallent. HPCToolkit: Tools for performance
the next thread on the critical path while impact time is the oppo- analysis of optimized parallel programs. Technical Report TR08-06,
site. If a thread on the critical path waits for some external event, Rice University, 2008.
it accumulates blocking time. Thus, performance tuners should fo-
[2] G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware
cus on areas of serial or under-subscribed impact time rather than performance counters with flow and context sensitive profiling.
fully parallel cruise time. The disadvantages of Thread Profiler are In SIGPLAN Conference on Programming Language Design and
that it uses costly instrumentation, reports information at the native Implementation, pages 85–96, New York, NY, USA, 1997. ACM
(Win32) thread level, and does not provide contextual information. Press.
An interesting observation about our idleness and overhead met- [3] T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel
rics is that, in the context of Cilk, they approximate a quantitative program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115–
measure of critical path slackness, tied to full calling context. To 125, 1990.
see this, note that a Cilk worker thread is idle only if it is waiting
[4] Apple Computer. Shark. http://developer.apple.com/tools/
for another worker thread to 1) make asynchronous calls or 2) re- sharkoptimize.html.
lease a lock. Therefore, if a thread’s idleness is high in a certain
context, then that context was on one of the ‘interesting’ critical [5] W. Binder. Portable and accurate sampling profiling for Java. Softw.
Pract. Exper., 36(6):615–650, 2006.
237
[6] C. P. Breshears. Using Intel Thread Profiler for Win32 threads: language design and implementation, pages 263–271, New York, NY,
Philosophy and theory. http://software.intel.com/en-us/ USA, 2006. ACM.
articles/using-intel-thread-profiler-for-win32-...
threads-philosophy-and-theory, August 2007. Appendix: Efficiently representing logical CCTs
[7] D. R. Butenhof. Programming with POSIX threads. Addison-Wesley
Recall that Section 3.2.1 defined a logical calling context tree (L-
Longman Publishing Co., Inc., Boston, MA, USA, 1997.
CCT) as a tree of bichords. Accordingly, two distinct call paths in
[8] M. E. Crovella and T. J. LeBlanc. Parallel performance using lost the tree may be partially shared if and only if they they share a com-
cycles analysis. In Supercomputing ’94: Proceedings of the 1994 mon prefix of bichords. (All paths share a common root.) One issue
conference on Supercomputing, pages 600–609, Los Alamitos, CA,
that arises during a straight-forward implementation of L-CCTs is
USA, 1994. IEEE Computer Society Press.
that common notes between multiple bichords are unnecessarily
[9] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A duplicated. We illustrate this problem with an example.
performance counter architecture for computing accurate CPI Suppose over the course of several samples, we obtain several
components. SIGPLAN Not., 41(11):175–184, 2006.
logical unwinds of the forms below (where inner frames are on the
[10] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation left and a sample point, if relevant, is underlined):
of the Cilk-5 multithreaded language. In Proceedings of the ACM
SIGPLAN ’98 Conference on Programming Language Design and . . . h(pi,a ), (li,1 )i, . . . (1)
Implementation, pages 212–223, Montreal, Quebec, Canada, June h(p0i,b , pi,a ), (li,1 )i, . . . (2)
1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No.
5, May, 1998. h(p0i,c , pi,b , pi,a ), (li,1 )i, . . . (3)
[11] N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path . . . , h(pi,c , pi,b , pi,a ), (li,1 )i, . . . (4)
profiling of unmodified, optimized code. In ICS ’05: Proceedings of
the 19th annual International Conference on Supercomputing, pages h(p0i,a ), (li,1 )i, . . . (5)
81–90, New York, NY, USA, 2005. ACM Press.
. . . , h(pi,e , pi,f , pi,a ), (li,1 )i, . . . (6)
[12] R. J. Hall. Call path profiling. In ICSE ’92: Proceedings of the 14th
international conference on Software engineering, pages 296–306, . . . , h(pi,c ), (li,1 )i, h(pi,b ), (li,1 )i, h(pi,a ), (li,1 )i, . . . (7)
New York, NY, USA, 1992. ACM Press.
[13] Intel Corporation. Intel performance tuning utility. Linked from . . . h(pi,a ), (lj,1 )i, . . . (8)
http://whatif.intel.com/. . . . h(pi,a ), (li,2 , li,1 )i, . . . (9)
[14] Intel Corporation. Intel thread profiler. http://www.intel.com/
Unwinds (1)–(6), with bichords of association M ↔ 1 and 1 ↔ 1,
software/products/tpwin.
could represent an interpreter implementing a high-level logical
[15] Intel Corporation. Intel VTune performance analyzers. http: operation, signified by l-note li,1 . Although none of these bichords
//www.intel.com/software/products/vtune/. are equal, all share li,1 ; and all but (5) share pi,a . However, a L-
[16] M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime CCT treats each bichord as an atomic unit, thereby requiring that
API for profiling. http://www.compunity.org/futures/ any common notes be duplicated when the corresponding call paths
omp-api.html. are inserted into the L-CCT. (Even the bichords in Unwinds (3)
[17] D. Levinthal. Execution-based cycle accounting on Intel Core 2 Duo and (4) must be distinct because the former contains a sample and
processors. http://www.devx.com/go-parallel/Link/33315. should therefore be a leaf node.) In general, the M-portion of these
[18] J. Levon et al. OProfile. http://oprofile.sourceforge.net/. bichords may be long and the frequent sample rate identifies most,
if not all, of the unique prefixes. An analogous situation occurs in
[19] M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal
our Cilk profiler, where the root bichord of (almost) all call paths
design-space exploration for multicore architectures. IEEE Trans-
actions on Parallel and Distributed Systems, 19(5):666–681, May has association 1 ↔ M. As a result, several seemingly unnecessary
2008. p-notes exist with the L-CCT. For compact representation of an L-
CCT, it is desirable to know when it is both possible and profitable
[20] D. Mosberger-Tang. libunwind. http://www.nongnu.org/
to share the notes of two bichords.
libunwind/.
[21] T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying po- Terminology
tential parallelism via loop-centric profiling. In CF ’07: Proceedings Observe that some associations are naturally related. For example,
of the 4th international conference on Computing frontiers, pages
1 ↔ 0 is the natural ‘base case’ of M ↔ 0. Similarly, 1 ↔ 1 is
143–152, New York, NY, USA, 2007. ACM.
the natural ‘base case’ of both 1 ↔ M and M ↔ 1. We therefore
[22] OpenMP Architecture Review Board. OpenMP application program define the following association classes:
interface, version 3.0. http://www.openmp.org/mp-documents/
spec30.pdf, May 2008. • A ↔ 0 = {1 ↔ 0, M ↔ 0}
[23] J. Reinders. Intel Threading Building Blocks. O’Reilly, Sebastopol, • A ↔ 1 = {1 ↔ 1, M ↔ 1}
CA, 2007.
• 1 ↔ A = {1 ↔ 1, 1 ↔ M}
[24] Rice University. HPCToolkit performance tools. http://
hpctoolkit.org. Let the functions ip and lip return the physical and logical in-
struction pointers given a p-note or l-note, respectively. The func-
[25] T. Yasue, T. Suganuma, H. Komatsu, and T. Nakatani. An efficient
online path profiling framework for Java just-in-time compilers.
tions assoc and assoc-class return the association and association-
In PACT ’03: Proceedings of the 12th International Conference class of a bichord, respectively. For convenience, we also define
on Parallel Architectures and Compilation Techniques, page 148, assoc-class= to test whether two bichords have identical associa-
Washington, DC, USA, 2003. IEEE Computer Society. tion classes, respectively.
[26] X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, Sharing within bichords
efficient, and adaptive calling context profiling. In PLDI ’06:
Proceedings of the 2006 ACM SIGPLAN conference on Programming We first consider the limits of sharing within bichords. Sharing be-
tween any two bichords may either be full or partial. If two paths
238
partially share a bichord, they may still be able to partially share to maintain distinct logical calling contexts (cf. Unwinds (2)
another bichord (cf. Unwinds (4) and (7)). However, partially shar- and (8)). Therefore, partial sharing is not profitable.
ing either bichord requires that the paths diverge in some fashion • assoc-class6=(Bx , By ): Since association classes are fully dis-
(otherwise they would be equal). Additional sharing requires that tinct, partial sharing is not possible without duplicating associ-
paths merge again, turning the tree into a graph and creating am- ation information (cf. Unwinds (2) and (9)).
biguous calling contexts. Therefore, two bichords may be partially
shared only if they are both roots of their respective call paths or Implementation
their respective call path predecessors are fully shared. After partial
sharing, paths must diverge. We now translate the above conclusions into a practical implemen-
The next task is to clearly define when partial sharing may occur tation for the L-CCT.
between two bichords Bx = hPx , Lx i and By = hPy , Ly i. We We maintain the two-level distinction between bichords and
divide the analysis into two cases. notes implicitly. A bichord is represented by a list of X-structures.
Case 1. Px = Py or Lx = Ly . Without loss of generality assume Each X contains an association (assoc) and a physical and logi-
the latter. cal instruction pointer (ip and lip, respectively). Given a bichord
hPx , Lx i, we need n Xs X1 , . . . , Xn where n = max(|Px |, |Lx |)
• assoc-class=(Bx , By ): Compare Unwinds (1)–(6). Although and where X1 represents the outermost portion of the bichord. Let
these bichords represent at least three fully distinct contexts the function note-id return the index of an X-structure within a bi-
and two different associations, they have identical association chord: note-id(Xj ) = j.6 Note that ip(Xj ) = NIL if |Px | < j ≤
classes. Each p-chord (except (5)) has a common prefix begin- n; similarly for lip(Xk ).
ning with p-note pi,a . In general, several other types of non- Given this representation, a logical call path is simply a list
prefix sharing are possible (e.g., suffixes). However, prefix shar- of X-structures X1 , . . . , Xn . A bichord begins at every Xi where
ing naturally corresponds to tree structure whereas non-prefix note-id(Xi ) = 1. A L-CCT is a tree of X-structures. Each X in
sharing effectively requires that a path diverges, skips one or the L-CCT may have a vector of metric values. A non-zero met-
more p-notes, and then re-merges. ric count naturally implements the ‘sample-point flag’ mentioned
Therefore we formulate the prefix condition for partially shar- above. To implement the ‘base-case flag’, we simply ensure that
ing two bichords Bx and By : when a 1 ↔ 1 bichord shares the root of, say, an M ↔ 1 bi-
chord, the root X has association 1 ↔ 1. Thus, the bichords in Un-
((Px @ Py ) ∨ (Py @ Px )) and Lx = Ly winds (1) and (2) would be represented as two Xs . . . X1 , X2 . . .
Px = Py and ((Lx @ Ly ) ∨ (Ly @ Lx )) (by symmetry) where assoc(X1 ) = 1 ↔ 1, assoc(X2 ) = M ↔ 1; where X2 has
where = and @ (‘strict prefix’) are defined with respect to the a non-zero metric value; and where X1 is an interior node.
sequence of notes that form a chord. The final item is to describe an efficient way to insert a logical
call path into the L-CCT in a way that corresponds to the full and
The one issue is that Bx and By may have different associa- partial sharing of bichords described above. To ensure the L-CCT
tions; prefix sharing is not effective if associations must be du- is rooted, we prefix a synthetic root node to the beginning of every
plicated. However, because we know the bichord’s association call path, implying that every call path has a length of at least two.
classes are identical, we know that if their associations are dif- Inserting a path into the L-CCT therefore turns into the following
ferent, one association must be the ‘base case’ of the other. For problem: Given the call path fragment m0 → n0 (as X-structures)
example, Unwinds (1) and (2) have associations 1 ↔ 1 and and given a node m in the L-CCT such that m0 = m, is it the case
M ↔ 1, respectively. We show below how to implement an that ∃n such that n is a child of m and sharable?(n, n0 ) holds? If
implicit ‘base-case flag’ that preserves this information. the answer is yes, n may be shared and insertion proceeds to the
It turns out that the prefix condition can be relaxed slightly. children of n and n0 . Otherwise, a new path for n is spliced into the
Consider Unwinds (2) and (3), which may share p-note pi,a by tree.
the above condition. Observe that p0i,b represents a sample point To define sharable?, we first consider a physical calling context
while pi,b represents a call site. Although in general ip(p0i,b ) 6= tree where X-structures only contain a physical instruction pointer
ip(pi,b ), a sample can be taken at a call site (technically, a return (ip). In this case we simply have:
address), meaning that it is possible that ip(p0i,b ) = ip(pi,b ). We sharable?(n, n0 ) : ip=(n, n0 )
show below how to implement an implicit ‘sample-point flag’
that enables us to extend the prefix condition to allow sharing To extend this definition to a L-CCT, we observe that both ips and
in this case. The flag indicates that the note both is and is not a lips should be equal if bichords are equal or if one is a prefix of the
sample point. other. To properly compute a prefix, bichords must be demarcated
• assoc-class6=(Bx , By ): An enumeration of the possibilities for
and aligned which we can ensure by also testing note-id(). Con-
sulting note-id() also forces path divergence after partial sharing.
By for each of the five possible associations for Bx shows that
Finally, we need to ensure that sharing is only permitted when at
this case is impossible (by the assumption Lx = Ly ).
least one of Px = Py and Lx = Ly holds. We can check this by
Case 2. Px 6= Py and Lx 6= Ly . additionally examining assoc-class. This results in the following
• assoc-class=(Bx , By ): Note that neither association may be in simple test:
association class A ↔ 0; otherwise Lx = Ly . sharable?(n, n0 ) : ip=(n, n0 ) ∧ lip=(n, n0 ) ∧
We now consider the two other association classes and focus, assoc-class=(n, n0 ) ∧ note-id=(n, n0 )
without loss of generality, on A ↔ 1. There are three cases.
First, both bichords may have association 1 ↔ 1. Second, one
bichord has association 1 ↔ 1 and the other M ↔ 1. Third,
both bichords have association M ↔ 1.
In the first case, no sharing is possible (since neither chord is 6 In implementation, assoc and note-id may be combined into one bit-field,
equal). In the second and third cases, prefix sharing among p- since the former only needs 3 bits; we use 8 and pre-compute association
notes may be possible. However, l-notes must be duplicated classes.
239

Effective Performance Measurement and Analysis of Multithreaded Applications Nathan

Uploaded by

Copyright:

Available Formats

Effective Performance Measurement and Analysis of Multithreaded Applications Nathan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Effective Performance Measurement and Analysis of Multithreaded Applications Nathan

Uploaded by

Copyright:

Available Formats

Effective Performance Measurement and Analysis of

Abstract cies have increased [19]. As a result, the microprocessor industry

call path and a fractional sample nw /nw of idleness is added to the

5.2 Parallel overhead

Figure 3. Post-processed C fragment from the Cilk compiler (cor-

A. There may be i frames corresponding to Cilk run time rou-

You might also like