0% found this document useful (0 votes)

9 views79 pages

05 Cmsc416 Perf Analysis

Uploaded by

qiqi85078802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views79 pages

05 Cmsc416 Perf Analysis

Uploaded by

qiqi85078802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Introduction to Parallel Computing (CMSC416 / CMSC616)

Performance Modeling, Analysis, and Tools

Abhinav Bhatele, Alan Sussman
Weak versus strong scaling

• Strong scaling: Fixed total problem size as we run on more processes

• Sorting n numbers on 1 process, 2 processes, 4 processes, …

• Problem size per process decreases with increase in number of processes

• Weak scaling: Fixed problem size per process but increasing total problem size as
we run on more processes
• Sorting n numbers on 1 process

• 2n numbers on 2 processes

• 4n numbers on 4 processes

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 2

Amdahl’s law

• Speedup is limited by the serial portion of the code

• Often referred to as the serial “bottleneck”

• Lets say only a fraction f of the code can be parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3

Amdahl’s law

• Speedup is limited by the serial portion of the code

• Often referred to as the serial “bottleneck”

• Lets say only a fraction f of the code can be parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3

Amdahl’s law

• Speedup is limited by the serial portion of the code

• Often referred to as the serial “bottleneck”

• Lets say only a fraction f of the code can be parallelized on p processes

1
Speedup =
(1 − f ) + f/p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3

Performance analysis

• Parallel performance of a program might not be what the developer expects

• How do we nd performance bottlenecks?

• Performance analysis is the process of studying the performance of a code

• Identify why performance might be slow

• Serial performance

• Serial bottlenecks when running in parallel

• Communication overheads

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 4

fi
Different performance analysis methods

• Analytical techniques: use algebraic formulae

• In terms of data size (n), number of processes (p)

• Time complexity analysis: big O notation

• Scalability analysis: Isoef ciency

• More detailed modeling of various operations such as communication

• Analytical models: LogP, alpha-beta model

• Empirical performance analysis using pro ling tools

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 5

fi
fi
Parallel prefix sum

2 8 3 5 7 4 1 6

2 10 11 8 12 11 5 7

2 10 13 18 23 19 17 18

2 10 13 18 25 29 30 36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6

Parallel prefix sum

2 8 3 5 7 4 1 6

2 10 11 8 12 11 5 7

2 10 13 18 23 19 17 18

2 10 13 18 25 29 30 36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6

Parallel prefix sum for n >> p

• Assign n/p elements (block) to each process

• Perform pre x sum on these blocks on each process locally

• Number of calculations per processs:

• Then do the parallel algorithm using the computed partial pre x sums
• Number of phases:

• Total number of calculations per process:

• Communication per process (one message containing one key/number):

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7

fi
fi
Parallel prefix sum for n >> p

• Assign n/p elements (block) to each process

• Perform pre x sum on these blocks on each process locally

n
• Number of calculations per processs:
p
• Then do the parallel algorithm using the computed partial pre x sums
• Number of phases:

• Total number of calculations per process:

• Communication per process (one message containing one key/number):

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7

fi
fi
Parallel prefix sum for n >> p

• Assign n/p elements (block) to each process

• Perform pre x sum on these blocks on each process locally

n
• Number of calculations per processs:
p
• Then do the parallel algorithm using the computed partial pre x sums
• Number of phases: log(p)
• Total number of calculations per process:

• Communication per process (one message containing one key/number):

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7

fi
fi
Parallel prefix sum for n >> p

• Assign n/p elements (block) to each process

• Perform pre x sum on these blocks on each process locally

n
• Number of calculations per processs:
p
• Then do the parallel algorithm using the computed partial pre x sums
• Number of phases: log(p)
n
• Total number of calculations per process: log(p) ×
p
• Communication per process (one message containing one key/number):

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7

fi
fi
Parallel prefix sum for n >> p

• Assign n/p elements (block) to each process

• Perform pre x sum on these blocks on each process locally

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7

fi
fi
operate in the presence of network faults and allow
the operating system to assign programs to collec-
tions of nodes. Thus, the physical interconnect
Modeling communication: LogP model underlying a program may vary even on a single
machine. Attempting to exploit a specific network
topology is likely to yield algorithms that are not very
robust in practice.

• Used for modeling communication on the

The convergence of parallel architectures is reflect-
inter-node network
ed in our LogP model that addresses significant com-

P (processors)
P M P M ... P M
L: latency or delay o (overhead) o g (gap)

L (latency)
o: overhead (processor busy in communication) Interconnection network Limited capacity
(L/g to or from
a processor)

g: gap (required between successive sends/ Figure 2. The LogP model describes an abstract
machine configuration in terms of four performance
receives) g is the inverse of bandwidth
parameters: L, the latency experienced in each
1/g = bandwidth
communication event; o, the overhead experienced
by the sending and receiving processors for each
P: number of processors / processes communication event; g, the gap between successive
sends or successive receives by a processor; and P,
the number of processor/memory modules.
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8
alpha + n * beta model
• Another model for communication

Tcomm = α + n × β

α: latency

n: size of message

1/β: bandwidth

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9

Isoefficiency

• Relationship between problem size and number of processes to maintain a certain

level of ef ciency

• At what rate should we increase problem size with respect to number of processes
to keep ef ciency constant (iso-ef ciency)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10

fi
fi
fi
Speedup and efficiency

• Speedup: Ratio of execution time on one process to that on p processes

t1
Speedup =
tp

• Ef ciency: Speedup per process

t1
E ciency =
tp × p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 11

ffi
fi
Efficiency in terms of overhead
• Total time spent in all processes = (useful) computation + overhead (extra
computation + communication + idle time + other overheads)

p × tp = t1 + to

t1 t1 1
E ciency = = = to
tp × p t1 + to 1 +
t1
ffi
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 12
Isoefficiency function

1
E ciency = to
1+ t1

• Ef ciency is constant if to / t1 is constant (K)

to = K × t1
ffi
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 13
fi
Isoefficiency analysis
n
n
p
• 1D decomposition:
• Computation:

• Communication:

n
p

• 2D decomposition: n
p
• Computation:

• Communication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
n n
• Computation: n× =
p p

• Communication:

n
p

• 2D decomposition: n
p
• Computation:

• Communication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
n n
• Computation: n× =
p p

• Communication: 2× n

n
p

• 2D decomposition: n
p
• Computation:

• Communication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
n n
• Computation: n×
p
=
p
to 2 × n 2×p
= n =
• Communication: 2× n t1 p n
n
p

• 2D decomposition: n
p
• Computation:

• Communication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
nn
• Computation: n×
p
=
p
to 2 × n 2×p
= n =
• Communication: 2× n t1 p n
n
p

• 2D decomposition: n
n n n p
• Computation:
p
×
p
=
p

• Communication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
nn
• Computation: n×
p
=
p
to 2 × n 2×p
= n =
• Communication: 2× n t1 p n
n
p

• 2D decomposition: n
n n n p
• Computation:
p
×
p
=
p

• Communication n
4×
p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
n
p
• 1D decomposition:
nn
• Computation: n×
p
=
p
to 2 × n 2×p
= n =
• Communication: 2× n t1 p n
n
p

• 2D decomposition:
n
n
n n n 4× 4× p
p
• Computation:
p
×
p
=
p to p
= n =
• Communication t1 p n
n
4×
p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Isoefficiency analysis
n
We only consider
n
communication for to p
• 1D decomposition:
nn
• Computation: n×
p
=
p
to 2 × n 2×p
= n =
• Communication: 2× n t1 p n
n
p

• 2D decomposition:
n
n
n n n 4× 4× p
p
• Computation:
p
×
p
=
p to p
= n =
• Communication t1 p n
n
4×
p

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14

Announcements

• Assignment 2 is due next Wednesday on October 9 11:59 pm ET

• Grades for assignment 1 have been posted on gradescope and ELMS

• Quiz 1 has been posted, due on Oct 2 at noon ET

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 15

Empirical performance analysis

• Two parts to doing empirical performance analysis

• measurement: gather/collect performance data from a program execution

• analysis/visualization: analyze the measurements to identify performance issues

• Simplest tool: adding timers in the code manually and using print statements

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 16

Using timers
double start, end;
double phase1, phase2, phase3;

start = MPI_Wtime();
... phase1 code ...
end = MPI_Wtime();
phase1 = end - start;

start = MPI_Wtime();
... phase2 ...
end = MPI_Wtime();
phase2 = end - start;

start = MPI_Wtime();
... phase3 ...
end = MPI_Wtime();
phase3 = end - start;

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 17

Using timers
double start, end;
double phase1, phase2, phase3;

start = MPI_Wtime();
... phase1 code ...
end = MPI_Wtime(); Phase 1 took 2.45 s
phase1 = end - start;

start = MPI_Wtime(); Phase 2 took 11.79 s

... phase2 ...
end = MPI_Wtime(); Phase 3 took 4.37 s
phase2 = end - start;

start = MPI_Wtime();
... phase3 ...
end = MPI_Wtime();
phase3 = end - start;

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 17

Performance tools

• Tracing tools
• Capture entire execution trace, typically via instrumentation

• Pro ling tools

• Provide aggregated information

• Typically use statistical sampling

• Many tools can do both

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18

fi
Metrics recorded

• Counts of function invocations

• Time spent in each function/code region

• Number of bytes sent (in case of MPI messages)

• Hardware counters such as oating point operations, cache misses, etc.

• To x performance problems — we need to connect metrics to source code

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 19

fi
fl
Tracing tools
• Record all the events in the program with enter/leave timestamps

• Events: user functions, MPI and other library routines, etc.

Timeline visualization of a 2-process execution trace

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 20

Examples of tracing tools

• VampirTrace

• Score-P

• TAU

• Projections

• HPCToolkit

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 21

Profiling tools
• Ignore the speci c times at which events
occurred

• Provide aggregate information about time

spent in different functions/code regions

• Examples:
• gprof, perf

• mpiP

• HPCToolkit, caliper
gprof data in hpctView

• Python tools: cpro le, pyinstrument, scalene

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 22

fi
fi
3.2 GraphFrame

Calling contexts, trees, and graphs The central data structure in the Hatc
which combines the structured index G
Figure 3 shows the two objects in a G
(the index), and a DataFrame object st
with each node.

• Calling context or call path: Sequence of function invocations main

leading to the current sample (statement in code)

•
physics solvers
Calling context tree (CCT): dynamic pre x tree of all call
paths in an execution mpi hypre mpi

• Call graph: obtained by merging nodes in a CCT with the psm2 psm2

same name into a single node but keeping caller-callee

relationships as edges
Figure 3: In Hatchet, the GraphFra
a DataFrame object.

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) Because of the way we have23 arch
Graph, we can insert Node objects direc
fi
Calling context trees, call graphs, …
foo

bar qux waldo

baz grault quux fred garply

corge plugh xyzzy

bar grault garply thud

baz grault baz garply

Calling context tree (CCT)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24

Calling context trees, call graphs, …
foo

bar qux waldo

baz grault quux fred garply

corge plugh xyzzy

bar grault garply thud

baz grault baz garply

Calling context tree (CCT)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24

Calling context trees, call graphs, …
foo Contextual information
File
bar qux waldo Line number
Function name
Callpath
baz grault quux fred garply
Load module
Process ID
corge plugh xyzzy Thread ID

bar grault garply thud

baz grault baz garply

Calling context tree (CCT)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24

Calling context trees, call graphs, …
foo Contextual information
File
bar qux waldo Line number
Function name
Callpath
baz grault quux fred garply
Load module
Process ID
corge plugh xyzzy Thread ID

bar grault garply thud Performance Metrics

Time
baz grault baz garply
Flops
Cache misses
Calling context tree (CCT)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24

Calling context trees, call graphs, …
foo Contextual information foo

File
bar qux waldo Line number qux waldo
Function name
Callpath
baz grault quux fred garply quux fred
Load module
Process ID
corge plugh xyzzy Thread ID corge plugh xyzzy

bar grault garply thud Performance Metrics bar thud

Time
baz grault baz garply
Flops grault baz garply
Cache misses
Calling context tree (CCT) Call graph

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 24

the structure of the graph. Since they m a y com- p l a c e of t h e c h i l d r e n . A l t h o u g h t h e n u m b e r of calls
plete strongly connected components, the static of e a c h m e m b e r f r o m within t h e c y c l e a r e shown,
call graph construction is done before topological t h e y do n o t a f f e c t t i m e p r o p a g a t i o n . When a child is
ordering.

Output of profiling tools

a m e m b e r of a c y c l e , t h e t i m e shown is t h e
a p p r o p r i a t e f r a c t i o n of t h e t i m e for t h e whole cycle.
S e l f - r e c u r s i v e r o u t i n e s h a v e t h e i r calls b r o k e n down
5. Data Presentation into calls f r o m t h e o u t s i d e a n d s e l f - r e c u r s i v e calls.
The data is presented to the user in two Only t h e o u t s i d e c a l l s a f f e c t t h e p r o p a g a t i o n of
different formats. The first presentation simply time.
lists the routines without regard to the a m o u n t of The following e x a m p l e is a t y p i c a l f r a g m e n t of a
time their descendants use. The second presenta- call g r a p h .
tion incorporates the call graph of the program. Call graph
5.1. The Flat Profile
The fiat profile c o n s i s t s of a list of all t h e r o u -
t i n e s t h a t a r e c a l l e d d u r i n g e x e c u t i o n of t h e p r o -
g r a m , with t h e c o u n t of t h e n u m b e r of t i m e s t h e y
a r e c a l l e d a n d t h e n u m b e r of s e c o n d s of e x e c u t i o n
t i m e for w h i c h t h e y a r e t h e m s e l v e s a c c o u n t a b l e .

•
The r o u t i n e s a r e l i s t e d in d e cHatchet:
r e a s i n g oPruning
r d e r of ethe
x e cOvergrowth
u- in Parallel Profiles
Flat pro le: Listing of all invoked functions with counts and
t i o n t i m e . A list of t h e r o u t i n e s t h a t a r e n e v e r
c a l l e d d u r i n g e x e c u t i o n of t h e p r o g r a m is also a v a i l -
execution times a b l e t o verify t h a t n o t h i n g i m p o r t a n t is o m i t t e d b y
t h i s e x e c u t i o n . The fiat profile gives a quick over- The en'try in t h e call g r a p h profile listing for t h i s
e x a m p l e is shown in F i g u foo r e 4.
view of t h e r o u t i n e s t h a t a r e u s e d , and shows t h e
r o u t i n e s t h a t a r e t h e m s e l v e s r e s p o n s i b l e for l a r g e The e n t r y is for r o u t i n e EXAMPLE, which has t h e

• Call graph pro le: unique node per function

f r a c t i o n s of t h e e x e c u t i o n t i m e . In p r a c t i c e , t h i s
profile u s u a l l y shows t h a t no single f u n c t i o n is
o v e r w h e l m i n g l y r e s p o n s i b l e for t h e t o t a l t i m e 'of t h e
Caller r o u t i n e s as i t s p a r e n t s , a n d t h e Sub r o u t i n e s
as its c h i l d r e n . bar The r e aquxd e r s waldoh o u l d k e e p in m i n d
t h a t all i n f o r m a t i o n is g i v e n w i t h r e s p e c t to EXAM-
p r o g r a m . Notice t h a t for t h i s profile, t h e i n d i v i d u a l PLE. The i n d e x in t h e first c o l u m n shows t h a t EXAM-

•
t i m e s s u m to t h e t o t a l e x e c u t i o n t i m e .
Calling context tree: unique node per calling context
PLE is t h e bazs e c o n grault d e n t r yquux in t h e fredprofilegarply
listing. The
EXAMPLE r o u t i n e is Called t e n t i m e s , f o u r t i m e s b y
5.'b-. The Call Graph Profile CALLER1, a n d six t i m e s b y CALLER2. C o n s e q u e n t l y
Ideally, we would like to p r i n t t h e call g r a p h o f 4 0 ~ of EXAmPLE's t i m e iscorge p r o p a gplugh
a t e d t o xyzzy
CALLER1, a n d
the p r o g r a m , b u t we a r e l i m i t e d b y t h e two- 60~ of EXAMPLE'S t i m e is p r d p a g a t e d %o CALLER2.
d i m e n s i o n a l n a t u r e of o u r o u t p u t d e v i c e s . We c a n - The self 'and d e s c e n d a n t fields o'f t h e p a r e n t s show
bar
n o t a s s u m e t h a t a call g r a p h is p l a n a r , a n d even if i t the a m o u n t o'f self a n dgrault
d e s c e ngarply
d a n t t ithud
m e EXAMPLE
is, t h a t we c a n p r i n t a p l a n a r v e r s i o n - o f it. I n s t e a d , p r o p a g a t e s to ' t h e m '(but n o t t h e 'time u s e d by t h e
we c h o o s e to l i s t e a c h r o u t i n e , t o g e t h e r With infor- p a r e n t s d i r e c t l y ) . Note t h a t EXAMPLE calls i~tself
baz
'mation about the routines that are its direct r e c u i ' s i v e l y four t i m grault
e s . The r o u t i n ebaz EXAMPLE garply
calls
p a r e n t s a n d c h i l d r e n . This listing p r e s e n tCalling s a win-context r o u t i tree
n e SUB1 t w e n t y t i m e s , SUB2 once, a n d n e v e r
dow into t h e c a l l g r a p h . B a s e d o n Our e x p e r i e n c e , calls SUB3. S i n c e sUB2 ~s c a l l e d a 'total of five t i m e s ,
b o t h p a r e n t i n f o r m a t i o n and c h i l d i n i o r m a t i 0 n is 20~ of its self a n d d e s c e n d a n t 'time is p r o p a g a t e d to
important, and should be a v a i l a b l e w i t h o u t EXAMPLE's FlameGraph d e s c e n d a n t t i m e field. B e c a u s e SUB1 is a
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 25
foo
fi
fi
Hatchet: performance analysis tool

• Hatchet enables programmatic analysis of parallel pro les

• Leverages pandas which supports multi-dimensional tabular datasets

• Create a structured index to enable indexing pandas dataframes by nodes in a graph

• A set of operators to lter, prune and/or aggregate structured data

https://hatchet.readthedocs.io/en/latest/

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 26

fi
fi
Pandas and dataframes

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27

Pandas and dataframes

• Pandas is an open-source Python library

for data analysis

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27

Pandas and dataframes
Columns

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure
Rows
• Supports many operations borrowed from SQL
databases

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27

Pandas and dataframes
Index Columns

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure
Rows
• Supports many operations borrowed from SQL
databases

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27

Pandas and dataframes
Index Columns

• Pandas is an open-source Python library

for data analysis

• Dataframe: two-dimensional tabular data

structure
Rows
• Supports many operations borrowed from SQL
databases

• MultiIndex enables working with high-

dimensional data in a 2D data structure

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 27

Main data structure in hatchet: a GraphFrame
• Consists of a structured index
graph object and a pandas
dataframe

• Graph stores caller-callee

relationships

• Dataframe stores all numerical

and categorical data for each
node in the graph

• In case of multiple processes/

thread, there is a row per node
per process per thread
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 28
354 3.2 Graphframe
355
The central data structure in the Hatchet library is a Graphframe,
356
which combines the structured index Graph with a pandas DataFrame.
Main data structure in hatchet: a GraphFrame
357
Figure ?? shows the two objects in a graphframe – a graph object
358
(the index), and a dataframe object storing the metrics associated
359
• indexwith each node.
Consists of a structured 360
graph object and a pandas
361
dataframe 362 main
363
• Graph stores caller-callee364
relationships 365
physics solvers

366

• Dataframe stores all numerical

367 mpi hypre mpi

and categorical data for each

368

node in the graph 369 psm2 psm2

370

• In case of multiple processes/

371

thread, there is a row per372nodeFigure 3:Graph

In Hatchet,
object the graphframe consists of a graph and
373
per process per thread a dataframe object.
374
375 Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 28
Because of the way we have architected the Hatchet structured
354 3.2 Graphframe
355
The central data structure in the Hatchet library is a Graphframe,
356
which combines the structured index Graph with a pandas DataFrame.
Main data structure in hatchet: a GraphFrame
357
Figure ?? shows the two objects in a graphframe – a graph object
358
(the index), and a dataframe object storing the metrics associated
359
• indexwith each node.
Consists of a structured 360
graph object and a pandas
361
dataframe 362 main
363
• Graph stores caller-callee364
relationships 365
physics solvers

366

• Dataframe stores all numerical

367 mpi hypre mpi

and categorical data for each

368

node in the graph 369 psm2 psm2

370

• In case of multiple processes/

371

thread, there is a row per372nodeFigure 3:Graph

In Hatchet,
object the graphframe consists of a graph and
Dataframe
373
per process per thread a dataframe object.
374
375 Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 28
Because of the way we have architected the Hatchet structured
Dataframe operation: filter

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 29

Dataframe operation: filter

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 29

Dataframe operation: filter

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 29

parent-child connections are modi�ed), all the columns in the from the dataframe due to a �lter operation. When one or more
operations that lead to changes in the graph structure return a new
dataframe that store inclusive values of some metric become inaccu- nodes on a path are removed from the graph, the nearest alive
graphframe.
rate. This function performs a post-order traversal of the graph to ancestor is connected by an edge to the nearest alive child on the
update all inclusive metrics stored in the dataframe for each node. path. All call paths in the graph are re-wired in this manner. After
ovided by
ulated in
hough all
Graph operation: squash
4.1 Dataframe Operations

4.2 Filter
�lter: Graph takesOperations
a user-supplied function and applies that to all
a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
columns containing inclusive metrics inaccurate, since the parent-
rows in the dataframe. The resulting series or dataframe is used to
me, some child relationships have changed. Hence, the squash operation also
�lter The squash
the dataframe
squash: to operation
only returnisrows
typically performed
that are true. The byreturned
the user
nd others calls update_inclusive_columns to make all inclusive columns
after a �lterpreserves
graphframe operation theonoriginal
the dataframe. As shown
graph provided in Figure
as input to the5,
following in the dataframe accurate again.
squash
�lter removesFigure
operation. nodes 4from theagraph
shows that were
dataframe previously
before and afterremoved
a �lter
le, so any
from the dataframe
operation. In this case,due
theto a �lterfunction
applied operation. When
returns all one
rowsorwhere
more
urn a new
nodes
time on a path
is greater thanare removed from the graph, the nearest alive
10.0.
ancestor is connected by an edge to the nearest alive child on the
path. All call paths in the graph are re-wired in this manner. After
a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
that to all
columns containing inclusive metrics inaccurate, since the parent-
is used to
child relationships have changed. Hence, the squash operation also
returned
calls update_inclusive_columns to make all inclusive columns
put to the
in the dataframe accurate again.
ter a �lter
ws where

1 gf = GraphFrame ( ... ) 1 filtered_gf = gf . filter ( lambda x : x [ time ] > 10.0)

2 filtered_gf main
= gf . filter ( lambda x: x[ time ] > 10.0) 2 squashed_gf = filtered_gf . squash ()

Figure 4: The dataframe before (left) and after (right) a �lter Figure 5: The graph before (left) and after (right) a squash
physics solvers

operation on the time column. operation on the graphframe.

mpi hypre mpi

psm2 psm2

1 filtered_gf = gf . filter ( lambda x: x[ time ] > Abhinav

10.0) Bhatele, Alan Sussman (CMSC416 / CMSC616) 30
0) 2 squashed_gf = filtered_gf . squash ()
parent-child connections are modi�ed),graphframe. all the columns in the from the dataframe due to a �lter operation. When one or more
operations that lead to changes in the graph structure return a new ancestor is connected by an edge to the nearest alive child on the
dataframe that store inclusive values of some metric become inaccu- nodes on a path are removed from the graph, the nearest alive
graphframe. path. All call paths in the graph are re-wired in this manner. After
4.1 Dataframe
rate. This function performs a post-order Operations
traversal of the graph to ancestor is connected by an edge to the nearest alive child on the
a squash operation, the graph and dataframe become consistent
update all inclusive metrics stored in the dataframe for each node. path. All call paths in the graph are re-wired in this manner. After
ovided by
ulated in
hough all
Graph operation: squash
4.1 Dataframe Operations
Filter takes a user-supplied
�lter: Filter takes a user-supplied function
rows in the dataframe. The resulting series
function and applies
4.2 Graph Operations �lter the dataframe to only return rows that
�lter: that to all
and applies
a squash
or
that the
operation,
dataframe is
to all
used
again. Additionally, a squash operation will make the values in all
graph and dataframe become consistent
to
columns containing inclusive metrics inaccurate, since the parent-
again. Additionally, a squash operation will make the values in all
are true. The returned
child relationships have changed. Hence, the squash operation also
columns containing inclusive metrics inaccurate, since the parent-
rows in the dataframe. The resulting series or dataframe is used to calls update_inclusive_columns to make all inclusive columns
me, some graphframe preserves the original graphchildprovided as inputhave
relationships to thechanged. Hence, the squash operation also
�lter The squash
the dataframe
squash: to operation
only returnisrows
typically performed
that are true. The byreturned
the user in the dataframe accurate again.
nd others �lter operation. Figure 4 shows a dataframe before and after a �lter
calls update_inclusive_columns to make all inclusive columns
after a �lterpreserves
graphframe operation theonoriginal
the dataframe. As shown
graph provided in Figure
as input to the5,
following operation. In this case, the applied function returns
in the all rows
dataframe whereagain.
accurate
squash
�lter removesFigure
operation. nodes 4from
showstheagraph that were
dataframe previously
before and afterremoved
a �lter
le, so any time is greater than 10.0.
from the dataframe due to a �lter operation.
operation. In this case, the applied function When
returns all one
rowsorwhere
more
urn a new
nodes
time on a path
is greater thanare removed from the graph, the nearest alive
10.0.
ancestor is connected by an edge to the nearest alive child on the
path. All call paths in the graph are re-wired in this manner. After
a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
that to all
columns containing inclusive metrics inaccurate, since the parent-
is used to
child relationships have changed. Hence, the squash operation also
returned
calls update_inclusive_columns to make all inclusive columns
put to the
in the dataframe accurate again.
ter a �lter
ws where
gf = GraphFrame ( ... )
1 1 filtered_gf = gf . filter ( lambda x : x [ time ] > 10.0)
2 filtered_gf = gf . filter ( lambda x : x [ time ] > 10.0) 2 squashed_gf = filtered_gf . squash ()
gf = GraphFrame ( ... ) 1 filtered_gf
main = gf . filter ( lambda x : x [ time ] > 10.0)
filter
1
2 filtered_gf main
= gf . filter ( lambda x: x[ time ] > 10.0) 2 squashed_gf = filtered_gf . squash ()

Figure 4: The dataframe before (left) and after (right) a �lter

physics solvers
Figure 5: The graph before (left) and after (right) a squash
operation on the time column. operation on the graphframe.
Figure 4: The dataframe before (left) and after (right) a �lter Figure 5: The graph before (left) and after (right) a squash
physics solvers

operation on the time column. operation

mpi hypre
on the
mpi
graphframe.
mpi hypre mpi

psm2 psm2
psm2 psm2

1 filtered_gf = gf . filter ( lambda x: x[ time ] > Abhinav

Figure 4: The dataframe before (left) and after (right) a �lter

operation on the time column. operation

mpi hypre
on the
mpi
graphframe.
mpi hypre mpi

psm2 psm2
psm2 psm2

1 filtered_gf = gf . filter ( lambda x: x[ time ] > Abhinav

10.0) Bhatele, Alan Sussman (CMSC416 / CMSC616) 30
0) 2 squashed_gf = filtered_gf . squash ()
parent-child connections are modi�ed),graphframe.
graph all the
object of columns
the graphframe in the
and the from the dataframe
performance metrics are due to a �lter operation. When one or more
graphframe.
operations that lead to changes in the graph structure return a new ancestor is connected by an edge to the nearest alive child ancestor on the
dataframe that store inclusive valuesused of some metricthe
to construct become inaccu-
graphframe object. Asnodes on a path
the readers are removed from the graph, the nearest alive
construct
graphframe. path. All call paths inWhen
update_inclusive_columns: the graph
a graphareisre-wired in this
rewired (i.e., the manner.path.
AfterAll
4.1theseDataframe
rate. This function performs a post-order traversal
two objects, they Operations
of thealsograph
make toconnectsancestor
between the is connected
graph and 4.1by an Dataframe
edge to the
parent-child Operations
nearest alive child on the
a squash operation, the graphall
connections are modi�ed), andthedataframe
columns inbecome
the consistent
a squash
dataframe objects using
update all inclusive metrics stored in the dataframe for each node. the structured path. All call paths in the graph
index. are that
re-wired in this manner. Aftermetric become inaccu-

Graph operation: squash

4.1 Dataframe Operations dataframe
again. store inclusive
Additionally, values
a of
squash some
operation will make the valuesagain. in allAd
ovided by �lter: Filter takes a user-supplied function and applies
a squash that�lter:
operation, to all
the Filter
graph takes
and
rate. This a user-supplied
dataframe
function become
performs function
consistent
a post-order and applies
traversal of thethat
graphto to
all
4 inTHE HATCHET columns containing inclusive metrics inaccurate, since the parent- columns
ulated in rows the dataframe. TheAPI resulting series or dataframe is used
rows to
in the dataframe. The resulting
again. Additionally, a squash operation will make the values in all
update all inclusive metrics stored series
in the or dataframe
dataframe for is
eachused
node.to
�lter: Filter takes a user-supplied
4.2 Graph Operations �lter function
Wethenow and applies
describe some that
of the to all
important operators provided by child relationships have changed. Hence, the squash operation childalso
rela
hough all dataframe to only return rows that are true. The returned
�lter the dataframe to only return
columns containing inclusive metrics inaccurate, since the parent- rows that are true. The returned
rows in the dataframe. The resultingthe series or
Hatchet dataframe
API allowingis used to
structured data to be manipulated in calls update_inclusive_columns to make all inclusive columns calls upda
me, some graphframe preserves the original graphchild provided as inputhave
relationships graphframe
to thechanged.
4.2 preserves thesquash
Hence, Operations
Graph the original graph provided
operation also as input to the
�lter
squash: The squash
the dataframe to operation
only returnisrows
typically performed
that ways:
di�erent are true. The
�ltered, byaggregated,
the user pruned, etc. Even though all
returned in the dataframe accurate again. in the dat
nd others �lter operation. Figure 4 shows a dataframe before and after �lter
a �lter
operation. Figure 4
calls update_inclusive_columns to make all inclusive columns shows a dataframe before and after a �lter
after a �lterpreserves
graphframe operation theonoriginal
the dataframe.
graph As shown
provided
of the operations as in are
input
below Figure
to the5,
performed on the graphframe, some
Thecase,
squash operation is typicallyreturns
performed by thewhere
user
following operation.
only modifyIn this
the case, the
dataframe, applied
some function
in returns
the all
dataframe rows
only modify the graph, and others operation.
where
accurate In
squash:
again. this the applied function all rows
squash
�lter removesFigure
operation. nodes 4from theagraph
shows that
dataframe were previously
before and afterremoved
a �lter
le, so any time is greater
modify both. than
They 10.0.
are categorized accordingly in the following time is after a �lter
greater thanoperation
10.0. on the dataframe. As shown in Figure 5,
from the dataframe
operation. In this case,due
theto a �lterfunction
applied operation. When
returns all one
rowsorwhere more squash removes nodes from the graph that were previously removed
urn a new sections. Note that we consider a graph to be immutable, so any
nodes
time on a path
is greater thanare removed from
10.0. the graph, the nearest alive from the dataframe due to a �lter operation. When one or more
operations that lead to changes in the graph structure return a new
ancestor is connected by an edge to the nearest alive child on the nodes on a path are removed from the graph, the nearest alive
graphframe.
path. All call paths in the graph are re-wired in this manner. After ancestor is connected by an edge to the nearest alive child on the
a squash operation, the graph and4.1 dataframe become consistent path. All call paths in the graph are re-wired in this manner. After
Dataframe Operations a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
that to all �lter: Filter takes a user-supplied function and applies that to all
again. Additionally, a squash operation will make the values in all
columns containing inclusive metrics inaccurate, since the parent- columns containing inclusive metrics inaccurate, since the parent-
is used to rows in the dataframe. The resulting series or dataframe is used to
child relationships have changed. Hence, the squash operation also child relationships have changed. Hence, the squash operation also
returned �lter the dataframe to only return rows that are true. The returned
calls update_inclusive_columns to make all inclusive columns calls update_inclusive_columns to make all inclusive columns
put to the graphframe preserves the original graph provided as input to the
in the dataframe accurate again. �lter operation. Figure 4 shows a dataframe before and after a �lter in the dataframe accurate again.
ter a �lter
ws where operation. In this case, the applied function returns all rows where
time is greater than 10.0.
gf = GraphFrame ( ... )
1 1 gf = GraphFrame
1 filtered_gf
( ... ) = gf . filter ( lambda x : x [ time ] > 10.0)
1 filtered
2 filtered_gf = gf . filter ( lambda x : x [ time ] > 10.0)
2 filtered_gf 2 squashed_gf
= gf . filter (=lambda
filtered_gf
x: x[ .time
squash
] ()
> 10.0) 2 squashed
gf = GraphFrame ( ... ) 1 filtered_gf
main = gf . filter ( lambda x : x [ time ] > 10.0)
filter squash
1
2 filtered_gf main
= gf . filter ( lambda x: x[ time ] > 10.0) 2 squashed_gf = filtered_gf . squash ()
main
Figure 4: The dataframe before (left) and after (right) aFigure
physics
�lter 4: TheFigure
solvers
dataframe
5: The
before
graph(left)
before
and(left)
afterand
(right)
after
a �lter
(right) a squash
Figure 5
operation on the time column. operation onoperation
the time on
column.
the graphframe. operatio
Figure 4: The dataframe before (left) and after (right) a �lter Figure 5: The graph before (left) and after (right) a squash
physics solvers
physics hypre psm2
operation on the time column. operation
mpi hypre
on the
mpi
graphframe.
mpi hypre mpi
psm2

psm2 psm2
psm2 psm2

1 gf = GraphFrame ( ... ) 1 filtered_gf = gf . filter ( lambda x: x [ time ] > 10.0)

2 filtered_gf = gf . filter ( lambda x: x [ time ] > 10.0) 2 squashed_gf = filtered_gf . squash ()

1 filtered_gf = gf . filter ( lambda x: x[ time ] > Abhinav

10.0) Bhatele, Alan Sussman (CMSC416 / CMSC616) 30
0) 2 squashed_gf = filtered_gf . squash ()
Figure 4: The dataframe before (left) and after (right) a �lter Figure 5: The graph before (left) and after (right) a squash
Graphframe operation: subtract

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31

Graphframe operation: subtract
Bhatele
Bhateleetetal.al.

main main

physics solvers physics solvers

—

mpi hypre mpi mpi hypre mpi

psm2 psm2 psm2 psm2

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31

ot have two children with
nodes together. The graph
psm2Graphframe
when called directly operation: subtract
d earlier, node merging can
other Hatchet API calls are Bhatele
Bhateleetetal.al.
hs.
ataFrame in addition to the
ns all rows from main the original main main

es in the new graph. Addi-

e the values
physics in all columns
solvers physics solvers physics solvers
— =
e, since the parent-child re-
squash operation
mpi hyprealso mpi
calls mpi hypre mpi mpi hypre mpi

all inclusive columns in the

psm2 psm2 psm2 psm2 psm2 psm2

s have the same nodes and Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31
ot have two children with
nodes together. The graph
psm2Graphframe
when called directly operation: subtract
d earlier, node merging can
other Hatchet API calls are Bhatele
Bhateleetetal.al.
hs. https://hatchet.readthedocs.io
ataFrame in addition to the
ns all rows from main the original main main

es in the new graph. Addi-

e the values
physics in all columns
solvers physics solvers physics solvers
— =
e, since the parent-child re-
squash operation
mpi hyprealso mpi
calls mpi hypre mpi mpi hypre mpi

all inclusive columns in the

psm2 psm2 psm2 psm2 psm2 psm2

s have the same nodes and Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31
1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )

Visualizing small graphs

3
4 gf2 -= gf1

Figure 6: Subtraction operation on two graphframes (result-

ing graph at the bottom).

FlameGraph
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 32
which in
1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
ng a3 copy
Visualizing small graphs
ect, 4while
gf2 -= gf1
r example, 1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
rands and 3
Figure 46:gf2
Subtraction
-= gf1 operation on two graphframes (result-
ing graph at the bottom).
ntical (i.e.,
computes Figure 6: Subtraction operation on two graphframes (result-
where the ing graph at the bottom).
is applied
sum. The
di�es one foo

g addition
bar qux waldo

operation baz grault quux fred garply

he graphs
putes the corge plugh xyzzy

e subtract
�es one of
bar grault garply thud

ssignment baz grault baz garply

ame from
FlameGraph
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 32
FlameGraph
two graphs are not identical, union (described above) is applied
which�rst
1
in to create a uni�ed graph before performing the sum. The
gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
ng a3 copy
add operation returns a new resulting graphframe or modi�es one
gf2
r example,
Visualizing small graphs
ect, 4while
of the-=
graphframes
gf1
assignment: (a+ =
1 gf1
in place in the case of the following addition
= b).
GraphFrame ( ... )
gf2 = GraphFrame ( ... )
rands and
2
subtract:3 The subtract operation is similar to the add operation
Figure 46: Subtraction
gf2 -= gf1 operation on two graphframes
in that it requires the two graphs to be identical. Once the graphs (result-
ing graph atequivalent,
are structurally the bottom).
the subtract operation computes the
ntical (i.e.,
di�erence between the two dataframes column-wise. The subtract
computes Figurea new
operation returns 6: Subtraction operation
resulting graphframe on two
or modi�es onegraphframes
of (result-
where the
thegraphframes
ing graph
in placeat thecase
in the bottom).
of the subtraction assignment
(a = b). Figure 6 shows the subtraction of one graphframe from
is applied
another and the graph for the resulting graphframe.
sum. The FlameGraph
di�es one foo

4.4 Visualizing Output

g addition foo
bar qux waldo
Hatchet provides its own visualization as well as support bar for qux
two waldo quux fred
other visualizations of the structured data stored in the graph object. corge xyzzy

operation
bar thud
The native visualization in Hatchet is a string that baz can be printed
grault quux fred garply

to the terminal to display the graph. Hatchet can also output the
he graphs
putes graph
the in the DOT format or a folded stack used by �ame graphcorge[8]. plugh Figure 7: Visualization outputs supported in Hatchet in-
clude terminal output (left), Flamegraph
xyzzy

The dot utility in Graphviz produces a hierarchical drawing of DOT (right), and �ame graph
e subtract
directed graphs, particularly useful for showing the direction grault of garply (bottom).
�es onetheofedges. Flame graphs are useful for quickly identifying the per-
bar thud

ssignment
formance bottleneck, that is the box with the largest baz width. grault The
5 PERFORMANCE
baz garply

ame from
y-axis of the �ame graph represents the call stack depth. Figure 7
FlameGraph
shows the same Hatchet graph presented in theAbhinav
three Bhatele,
supported vi- (CMSC416It is vital that performance analysis tools have low overheads
Alan Sussman / CMSC616) 32 and
sualizations: terminal, DOT, and �ame graph.FlameGraph
For particularly large that they enable quick analysis of performance datasets without the
Starter code for reading data
import hatchet as ht
import sys Replace this with another reader
if __name__ == ‘__main__': depending on data source
file_name = sys.argv[1]
gf = ht.GraphFrame.from_caliper(file_name)

print(gf.tree())
print(gf.dataframe)

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 33

Example 1: Generating a flat profile

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34

Example 1: Generating a flat profile

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34

Example 1: Generating a flat profile

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 34

Example 2: Comparing two executions

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 35

Example 2: Comparing two executions

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 35

Example 2: Comparing two executions

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 35

can automatically unify the pro�le outputs and calculate e�ciency.
1 queryIt also
= [enables
* , easy plotting: of the
{ name results via Python libraries.}]
MPI_File_write_all 1 qu
2 filtered_test = graphframe . filter ( query ) 2 fi

Example 3: Speedup and efficiency 3 print ( filtered_test . tree () )

8 CONCLUSION
3 pr

In this study, we proposed Chopper, a Python-based API for perfor-

Figure 12:analysis,
mance Call paths of the
which provides problematic
programmatic analysisportions
capabilities of the program be
that simplify the performance analysis of single and multiple ex-
1
2
datasets = glob . glob ( list_of_tortuga_profiles )
gfs = hatchet . GraphFrame . construct_from ( datasets ) in writeSingleField reduced from 7.033 to 2.088.
ecutions of parallel applications. We decided to build it on top of
The 1024 proces
Hatchet to leverage Hatchet’s programmatic interface and visual-
3
4 df = hatchet . Chopper . speedup_efficiency ( gfs , strong = True ,
efficiency = True ) ization capabilities. We designed Chopper as an easy-to-use and
5 df = df . loc [ df [ 1024 ] < 0.7] high-level API to avoid having a steep learning curve for new users.
df .T . loc [: , :]. plot . bar ()
In this paper, we used several case studies to demonstrate how
6

Chopper enables performing common but laborious analysis tasks

Figure 13: Demonstration of scalability analysis by using by writing only a few lines of Python code. Speci�cally, we pre-
multiple executions. We plot e�ciency of the four least e�- sented how Chopper simpli�es analysis tasks for single and multi-
cient nodes discovered in Tortuga strong scaling executions ple executions such as detecting load imbalance, �nding hot paths,
(64, 128, 256, 512, 1024 processes). 64 process count execution identifying scaling bottlenecks, �nding correlation between met-
is used as the baseline. The vertical labels on each bar corre- rics and CCT nodes, and causes of performance variation. We also
sponds the absolute time spent in the functions. demonstrated some useful functionalities such as reading multi-
ple pro�le data sets at once and unifying multiple GraphFrames.
We identi�ed potential performance problems in the Tortuga and
Quicksilver applications. Additionally, we identi�ed the causes of
After getting these e�ciency results, we decided to focus on the performance variability in AMG and Laghos runs. The e�ective ca-
writeSingleField function because it is one of the functions that pabilities that Chopper provides make performance analysis tasks
has signi�cantly decreasing e�ciency. We further annotated this easier to perform with less programming e�ort.
function to identify the code block that causes this scalability issue. In the future, we plan to improve correlation analysis by adding
We identi�ed the MPI_File_write_all function as a cause of this predictive modeling capabilities to facilitate performance analysis.
problem. It is a collective and blocking function that uses all the To further simplify the analyses and reduce the e�ort required, we
processes in the program to write to a �le. We decide to optimize plan to support customizable plotting capabilities. We also plan to
the code by replacing the MPI_File_write_all call with the non- develop performance analysis techniques for GPU programs.
Abhinav Bhatele, Alan Sussman =
1 datasets (CMSC416
glob ./glob
CMSC616)
( list_of_tortuga_profiles ) 36
can automatically unify the pro�le outputs and calculate e�ciency.
1 queryIt also
= [enables
* , easy plotting: of the
{ name results via Python libraries.}]
MPI_File_write_all 1 qu
2 filtered_test = graphframe . filter ( query ) 2 fi

Example 3: Speedup and efficiency 3 print ( filtered_test . tree () )

8 CONCLUSION
3 pr

In this study, we proposed Chopper, a Python-based API for perfor-

Chopper enables performing common but laborious analysis tasks

Figure 13: Demonstration of scalability analysis by using by writing only a few lines of Python code. Speci�cally, we pre-
multiple executions. We plot e�ciency of the four least e�- sented how Chopper simpli�es analysis tasks for single and multi-
cient nodes discovered in Tortuga strong scaling executions ple executions such as detecting load imbalance, �nding hot paths,
(64, 128, 256, 512, 1024 processes). 64 process count execution identifying scaling bottlenecks, �nding correlation between met-
is used as the baseline. The vertical labels on each bar corre- rics and CCT nodes, and causes of performance variation. We also
sponds the absolute time spent in the functions. demonstrated some useful functionalities such as reading multi-
ple pro�le data sets at once and unifying multiple GraphFrames.
We identi�ed potential performance problems in the Tortuga and
Quicksilver applications. Additionally, we identi�ed the causes of
After getting these e�ciency results, we decided to focus on the performance variability in AMG and Laghos runs. The e�ective ca-
writeSingleField function because it is one of the functions that pabilities that Chopper provides make performance analysis tasks
has signi�cantly decreasing e�ciency. We further annotated this easier to perform with less programming e�ort.
function to identify the code block that causes this scalability issue. In the future, we plan to improve correlation analysis by adding
We identi�ed the MPI_File_write_all function as a cause of this predictive modeling capabilities to facilitate performance analysis.
problem. It is a collective and blocking function that uses all the To further simplify the analyses and reduce the e�ort required, we
processes in the program to write to a �le. We decide to optimize plan to support customizable plotting capabilities. We also plan to
the code by replacing the MPI_File_write_all call with the non- develop performance analysis techniques for GPU programs.
Abhinav Bhatele, Alan Sussman =
1 datasets (CMSC416
glob ./glob
CMSC616)
( list_of_tortuga_profiles ) 36
Example 4: Load imbalance
(a) Quicksilver Load Imbalance DataFrame Output (b) Load Imbalance Histogram of MacroscopicCrossSec-
tion.cc:22
1 Simplifying
graphframe = hatchet . GraphFrame . from_hpctoolkit ( the Analysis of Parallel
qs_profile_128 ) Profiles Using Chopper
2
3
Paper Type: Regular
graphframe_imbalance = graphframe . load_imbalance ( verbose = True )
4 # sort the top 50 nodes that have the highest mean value by imbalance
5 df_imb = graphframe_imbalance . dataframe . head (50) . sort_values ( time . imbalance , ascending = False )
6 print ( df_imb . head (4) ) # Dataframe Output ( a )

Figure 9: Demonstration of load imbalance analysis and the results of the case study. The most imbalance is caused by
MacroscopicCrossSection:22. Chopper’s load_imbalance function provides detailed statistics about the imbalance (a) that can
be easily plotted by using Python libraries (b). We use Quicksilver execution on 128 processes.

case study demonstrates that users can easily examine the correla- the communication libraries (such as libpsm2.s and libmpi.so),
tion between di�erent performance metrics and investigate outliers which is expected due to network congestion mentioned in the
or potential issues by performing analyses at CCT node level. Chop- previous paper [citation removed for double-blind review]
per also enables users to easily plot the results via Python libraries. The Chopper API enables the analysis of multiple executions us-
ing a single function call and presents the results in an easy-to-plot
format. This is a tedious and fraught task without programmatic
7.2 Comparing Multiple Executions analysis capabilities as it requires comparing performance nodes
More advanced analysis tasks, such as studying scalability and vari- from multiple runs simultaneously. Furthermore, to the best of our
knowledge,Load
ability, require analyzing multiple executions of the same program (a) Quicksilver thisImbalance
is the �rstDataFrame
study that uses CCT data to identify
Output
with di�erent parameters. In this case, the user needs to analyze performance
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)variability. 37
Example 4: Load imbalance
(a) Quicksilver Load Imbalance DataFrame Output (b) Load Imbalance Histogram of MacroscopicCrossSec-
tion.cc:22
1 Simplifying
graphframe = hatchet . GraphFrame . from_hpctoolkit ( the Analysis of Parallel
qs_profile_128 ) Profiles Using Chopper
2
3
Paper Type: Regular
graphframe_imbalance = graphframe . load_imbalance ( verbose = True )
4 # sort the top 50 nodes that have the highest mean value by imbalance
5 df_imb = graphframe_imbalance . dataframe . head (50) . sort_values ( time . imbalance , ascending = False )
6 print ( df_imb . head (4) ) # Dataframe Output ( a )

5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
ch3 Parallel PDF
0% (1)
ch3 Parallel PDF
76 pages
Advanced Computer Architecture CSE 8383
No ratings yet
Advanced Computer Architecture CSE 8383
56 pages
6.189 Lecture5 Parallelism
No ratings yet
6.189 Lecture5 Parallelism
63 pages
Module 1 Chapter3
No ratings yet
Module 1 Chapter3
45 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
01 cmsc416 Intro
No ratings yet
01 cmsc416 Intro
51 pages
12 MPIProgramPerformance
No ratings yet
12 MPIProgramPerformance
33 pages
Chapter 3: Principles of Scalable Performance
No ratings yet
Chapter 3: Principles of Scalable Performance
41 pages
Massively Parallel Processors
No ratings yet
Massively Parallel Processors
102 pages
Parallel Computing Simply in Depth by Ajit Singh PDF
No ratings yet
Parallel Computing Simply in Depth by Ajit Singh PDF
125 pages
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
No ratings yet
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
33 pages
02 cmsc416 Parallel
No ratings yet
02 cmsc416 Parallel
33 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Chapter 23
No ratings yet
Chapter 23
24 pages
ch4 PC
No ratings yet
ch4 PC
76 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
Set1 Intro
No ratings yet
Set1 Intro
41 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
Mpi
No ratings yet
Mpi
46 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Brief Overview of Parallel Computing
No ratings yet
Brief Overview of Parallel Computing
14 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
No ratings yet
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
34 pages
HPC Note
No ratings yet
HPC Note
39 pages
OOAD
No ratings yet
OOAD
67 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Performance&Scalability Ch3
No ratings yet
Performance&Scalability Ch3
41 pages
Unit 4 HPC Part2
No ratings yet
Unit 4 HPC Part2
18 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Computational Force: A Unifying Concept For Scalability Analysis
No ratings yet
Computational Force: A Unifying Concept For Scalability Analysis
7 pages
Revision Slides
No ratings yet
Revision Slides
25 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
4 DesigningParallelPrograms
No ratings yet
4 DesigningParallelPrograms
69 pages
FMTH0301/Rev.5.1 Course Plan
No ratings yet
FMTH0301/Rev.5.1 Course Plan
16 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
Android Robot: Paper Animations To Download and Make
No ratings yet
Android Robot: Paper Animations To Download and Make
9 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Unit 2 - 2.1 (Parallel Approaches)
No ratings yet
Unit 2 - 2.1 (Parallel Approaches)
11 pages
410A Week 4
No ratings yet
410A Week 4
12 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
Lecture # 21
No ratings yet
Lecture # 21
16 pages
Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
Performance Metrics
No ratings yet
Performance Metrics
16 pages
Unit-2 Aca
No ratings yet
Unit-2 Aca
24 pages
Math Task Sheet 19 - MYP 3 - 07.12.18
0% (1)
Math Task Sheet 19 - MYP 3 - 07.12.18
6 pages
CompArch 23a MP-1
No ratings yet
CompArch 23a MP-1
17 pages
ACONEX - Issue New Drawings Procedure (M City Example)
No ratings yet
ACONEX - Issue New Drawings Procedure (M City Example)
8 pages
What Are The Components of Physical Database Structure of Oracle Database
100% (4)
What Are The Components of Physical Database Structure of Oracle Database
27 pages
Stm-Lecture Notes - 0 PDF
100% (1)
Stm-Lecture Notes - 0 PDF
120 pages
Functional Specifications For Purchase Order: 1. Business Requirements
No ratings yet
Functional Specifications For Purchase Order: 1. Business Requirements
5 pages
EDM Using RTP Interface in Rate
100% (1)
EDM Using RTP Interface in Rate
31 pages
Assignment 1
No ratings yet
Assignment 1
10 pages
Matlab Tutorial
0% (1)
Matlab Tutorial
46 pages
Core Java Programs
100% (4)
Core Java Programs
50 pages
Database Management Systems PPT Part 2
No ratings yet
Database Management Systems PPT Part 2
8 pages
Optalign RS5
0% (1)
Optalign RS5
185 pages
Rushil's Resume
No ratings yet
Rushil's Resume
1 page
Corporate Quality Management Job Description
No ratings yet
Corporate Quality Management Job Description
2 pages
Mind Over Machine - The Power of Human I
No ratings yet
Mind Over Machine - The Power of Human I
6 pages
Questionnaire For A Survey On Estimation of Role of IT 1.0 Project On Postal Services Operations
No ratings yet
Questionnaire For A Survey On Estimation of Role of IT 1.0 Project On Postal Services Operations
2 pages
Tutorial and User's Guide
No ratings yet
Tutorial and User's Guide
459 pages
Eaac0203 LM03
No ratings yet
Eaac0203 LM03
196 pages
Java QB
No ratings yet
Java QB
12 pages
CHECK
No ratings yet
CHECK
18 pages
Gantt Chart
No ratings yet
Gantt Chart
12 pages
Better RUN: New Sources of Competitive Advantage
No ratings yet
Better RUN: New Sources of Competitive Advantage
25 pages
CSTM 0120 - Sample Exam #03 (Solutions)
No ratings yet
CSTM 0120 - Sample Exam #03 (Solutions)
19 pages
PDF Dox
No ratings yet
PDF Dox
4 pages
P 3
No ratings yet
P 3
4 pages
M365 Fundamentals Learning Path (July 2019) PDF
No ratings yet
M365 Fundamentals Learning Path (July 2019) PDF
1 page
Program For 3rd, 5th Sem Exam - 2022
No ratings yet
Program For 3rd, 5th Sem Exam - 2022
2 pages
Basic Linux/Unix Commands: Somsak Ketkeaw
No ratings yet
Basic Linux/Unix Commands: Somsak Ketkeaw
34 pages
Tensei Philippines - Company Profile
No ratings yet
Tensei Philippines - Company Profile
22 pages
POC ZOHO App Creater V 1.0.0
No ratings yet
POC ZOHO App Creater V 1.0.0
2 pages
Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet