05 Cmsc416 Perf Analysis
05 Cmsc416 Perf Analysis
• Weak scaling: Fixed problem size per process but increasing total problem size as
we run on more processes
• Sorting n numbers on 1 process
• 2n numbers on 2 processes
• 4n numbers on 4 processes
1
Speedup =
(1 − f ) + f/p
1
Speedup =
(1 − f ) + f/p
1
Speedup =
(1 − f ) + f/p
• Communication overheads
2 8 3 5 7 4 1 6
2 10 11 8 12 11 5 7
2 10 13 18 23 19 17 18
2 10 13 18 25 29 30 36
2 8 3 5 7 4 1 6
2 10 11 8 12 11 5 7
2 10 13 18 23 19 17 18
2 10 13 18 25 29 30 36
• Then do the parallel algorithm using the computed partial pre x sums
• Number of phases:
P (processors)
P M P M ... P M
L: latency or delay o (overhead) o g (gap)
L (latency)
o: overhead (processor busy in communication) Interconnection network Limited capacity
(L/g to or from
a processor)
g: gap (required between successive sends/ Figure 2. The LogP model describes an abstract
machine configuration in terms of four performance
receives) g is the inverse of bandwidth
parameters: L, the latency experienced in each
1/g = bandwidth
communication event; o, the overhead experienced
by the sending and receiving processors for each
P: number of processors / processes communication event; g, the gap between successive
sends or successive receives by a processor; and P,
the number of processor/memory modules.
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8
alpha + n * beta model
• Another model for communication
Tcomm = α + n × β
α: latency
n: size of message
1/β: bandwidth
• At what rate should we increase problem size with respect to number of processes
to keep ef ciency constant (iso-ef ciency)
t1
Speedup =
tp
t1
E ciency =
tp × p
p × tp = t1 + to
t1 t1 1
E ciency = = = to
tp × p t1 + to 1 +
t1
ffi
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 12
Isoefficiency function
1
E ciency = to
1+ t1
to = K × t1
ffi
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 13
fi
Isoefficiency analysis
n
n
p
• 1D decomposition:
• Computation:
• Communication:
n
p
• 2D decomposition: n
p
• Computation:
• Communication
• Communication:
n
p
• 2D decomposition: n
p
• Computation:
• Communication
• Communication: 2× n
n
p
• 2D decomposition: n
p
• Computation:
• Communication
• 2D decomposition: n
p
• Computation:
• Communication
• 2D decomposition: n
n n n p
• Computation:
p
×
p
=
p
• Communication
• 2D decomposition: n
n n n p
• Computation:
p
×
p
=
p
• Communication n
4×
p
• 2D decomposition:
n
n
n n n 4× 4× p
p
• Computation:
p
×
p
=
p to p
= n =
• Communication t1 p n
n
4×
p
• 2D decomposition:
n
n
n n n 4× 4× p
p
• Computation:
p
×
p
=
p to p
= n =
• Communication t1 p n
n
4×
p
• Simplest tool: adding timers in the code manually and using print statements
start = MPI_Wtime();
... phase1 code ...
end = MPI_Wtime();
phase1 = end - start;
start = MPI_Wtime();
... phase2 ...
end = MPI_Wtime();
phase2 = end - start;
start = MPI_Wtime();
... phase3 ...
end = MPI_Wtime();
phase3 = end - start;
start = MPI_Wtime();
... phase1 code ...
end = MPI_Wtime(); Phase 1 took 2.45 s
phase1 = end - start;
start = MPI_Wtime();
... phase3 ...
end = MPI_Wtime();
phase3 = end - start;
• Tracing tools
• Capture entire execution trace, typically via instrumentation
• VampirTrace
• Score-P
• TAU
• Projections
• HPCToolkit
• Examples:
• gprof, perf
• mpiP
• HPCToolkit, caliper
gprof data in hpctView
Calling contexts, trees, and graphs The central data structure in the Hatc
which combines the structured index G
Figure 3 shows the two objects in a G
(the index), and a DataFrame object st
with each node.
•
physics solvers
Calling context tree (CCT): dynamic pre x tree of all call
paths in an execution mpi hypre mpi
• Call graph: obtained by merging nodes in a CCT with the psm2 psm2
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) Because of the way we have23 arch
Graph, we can insert Node objects direc
fi
Calling context trees, call graphs, …
foo
File
bar qux waldo Line number qux waldo
Function name
Callpath
baz grault quux fred garply quux fred
Load module
Process ID
corge plugh xyzzy Thread ID corge plugh xyzzy
Time
baz grault baz garply
Flops grault baz garply
Cache misses
Calling context tree (CCT) Call graph
•
The r o u t i n e s a r e l i s t e d in d e cHatchet:
r e a s i n g oPruning
r d e r of ethe
x e cOvergrowth
u- in Parallel Profiles
Flat pro le: Listing of all invoked functions with counts and
t i o n t i m e . A list of t h e r o u t i n e s t h a t a r e n e v e r
c a l l e d d u r i n g e x e c u t i o n of t h e p r o g r a m is also a v a i l -
execution times a b l e t o verify t h a t n o t h i n g i m p o r t a n t is o m i t t e d b y
t h i s e x e c u t i o n . The fiat profile gives a quick over- The en'try in t h e call g r a p h profile listing for t h i s
e x a m p l e is shown in F i g u foo r e 4.
view of t h e r o u t i n e s t h a t a r e u s e d , and shows t h e
r o u t i n e s t h a t a r e t h e m s e l v e s r e s p o n s i b l e for l a r g e The e n t r y is for r o u t i n e EXAMPLE, which has t h e
•
t i m e s s u m to t h e t o t a l e x e c u t i o n t i m e .
Calling context tree: unique node per calling context
PLE is t h e bazs e c o n grault d e n t r yquux in t h e fredprofilegarply
listing. The
EXAMPLE r o u t i n e is Called t e n t i m e s , f o u r t i m e s b y
5.'b-. The Call Graph Profile CALLER1, a n d six t i m e s b y CALLER2. C o n s e q u e n t l y
Ideally, we would like to p r i n t t h e call g r a p h o f 4 0 ~ of EXAmPLE's t i m e iscorge p r o p a gplugh
a t e d t o xyzzy
CALLER1, a n d
the p r o g r a m , b u t we a r e l i m i t e d b y t h e two- 60~ of EXAMPLE'S t i m e is p r d p a g a t e d %o CALLER2.
d i m e n s i o n a l n a t u r e of o u r o u t p u t d e v i c e s . We c a n - The self 'and d e s c e n d a n t fields o'f t h e p a r e n t s show
bar
n o t a s s u m e t h a t a call g r a p h is p l a n a r , a n d even if i t the a m o u n t o'f self a n dgrault
d e s c e ngarply
d a n t t ithud
m e EXAMPLE
is, t h a t we c a n p r i n t a p l a n a r v e r s i o n - o f it. I n s t e a d , p r o p a g a t e s to ' t h e m '(but n o t t h e 'time u s e d by t h e
we c h o o s e to l i s t e a c h r o u t i n e , t o g e t h e r With infor- p a r e n t s d i r e c t l y ) . Note t h a t EXAMPLE calls i~tself
baz
'mation about the routines that are its direct r e c u i ' s i v e l y four t i m grault
e s . The r o u t i n ebaz EXAMPLE garply
calls
p a r e n t s a n d c h i l d r e n . This listing p r e s e n tCalling s a win-context r o u t i tree
n e SUB1 t w e n t y t i m e s , SUB2 once, a n d n e v e r
dow into t h e c a l l g r a p h . B a s e d o n Our e x p e r i e n c e , calls SUB3. S i n c e sUB2 ~s c a l l e d a 'total of five t i m e s ,
b o t h p a r e n t i n f o r m a t i o n and c h i l d i n i o r m a t i 0 n is 20~ of its self a n d d e s c e n d a n t 'time is p r o p a g a t e d to
important, and should be a v a i l a b l e w i t h o u t EXAMPLE's FlameGraph d e s c e n d a n t t i m e field. B e c a u s e SUB1 is a
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 25
foo
fi
fi
Hatchet: performance analysis tool
https://hatchet.readthedocs.io/en/latest/
366
370
366
370
4.2 Filter
�lter: Graph takesOperations
a user-supplied function and applies that to all
a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
columns containing inclusive metrics inaccurate, since the parent-
rows in the dataframe. The resulting series or dataframe is used to
me, some child relationships have changed. Hence, the squash operation also
�lter The squash
the dataframe
squash: to operation
only returnisrows
typically performed
that are true. The byreturned
the user
nd others calls update_inclusive_columns to make all inclusive columns
after a �lterpreserves
graphframe operation theonoriginal
the dataframe. As shown
graph provided in Figure
as input to the5,
following in the dataframe accurate again.
squash
�lter removesFigure
operation. nodes 4from theagraph
shows that were
dataframe previously
before and afterremoved
a �lter
le, so any
from the dataframe
operation. In this case,due
theto a �lterfunction
applied operation. When
returns all one
rowsorwhere
more
urn a new
nodes
time on a path
is greater thanare removed from the graph, the nearest alive
10.0.
ancestor is connected by an edge to the nearest alive child on the
path. All call paths in the graph are re-wired in this manner. After
a squash operation, the graph and dataframe become consistent
again. Additionally, a squash operation will make the values in all
that to all
columns containing inclusive metrics inaccurate, since the parent-
is used to
child relationships have changed. Hence, the squash operation also
returned
calls update_inclusive_columns to make all inclusive columns
put to the
in the dataframe accurate again.
ter a �lter
ws where
Figure 4: The dataframe before (left) and after (right) a �lter Figure 5: The graph before (left) and after (right) a squash
physics solvers
psm2 psm2
psm2 psm2
psm2 psm2
psm2 psm2
psm2 psm2
psm2 psm2
psm2 psm2
main main
s have the same nodes and Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31
ot have two children with
nodes together. The graph
psm2Graphframe
when called directly operation: subtract
d earlier, node merging can
other Hatchet API calls are Bhatele
Bhateleetetal.al.
hs. https://hatchet.readthedocs.io
ataFrame in addition to the
ns all rows from main the original main main
s have the same nodes and Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 31
1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
FlameGraph
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 32
which in
1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
ng a3 copy
Visualizing small graphs
ect, 4while
gf2 -= gf1
r example, 1 gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
rands and 3
Figure 46:gf2
Subtraction
-= gf1 operation on two graphframes (result-
ing graph at the bottom).
ntical (i.e.,
computes Figure 6: Subtraction operation on two graphframes (result-
where the ing graph at the bottom).
is applied
sum. The
di�es one foo
g addition
bar qux waldo
he graphs
putes the corge plugh xyzzy
e subtract
�es one of
bar grault garply thud
ame from
FlameGraph
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 32
FlameGraph
two graphs are not identical, union (described above) is applied
which�rst
1
in to create a uni�ed graph before performing the sum. The
gf1 = GraphFrame ( ... )
2 gf2 = GraphFrame ( ... )
ng a3 copy
add operation returns a new resulting graphframe or modi�es one
gf2
r example,
Visualizing small graphs
ect, 4while
of the-=
graphframes
gf1
assignment: (a+ =
1 gf1
in place in the case of the following addition
= b).
GraphFrame ( ... )
gf2 = GraphFrame ( ... )
rands and
2
subtract:3 The subtract operation is similar to the add operation
Figure 46: Subtraction
gf2 -= gf1 operation on two graphframes
in that it requires the two graphs to be identical. Once the graphs (result-
ing graph atequivalent,
are structurally the bottom).
the subtract operation computes the
ntical (i.e.,
di�erence between the two dataframes column-wise. The subtract
computes Figurea new
operation returns 6: Subtraction operation
resulting graphframe on two
or modi�es onegraphframes
of (result-
where the
thegraphframes
ing graph
in placeat thecase
in the bottom).
of the subtraction assignment
(a = b). Figure 6 shows the subtraction of one graphframe from
is applied
another and the graph for the resulting graphframe.
sum. The FlameGraph
di�es one foo
operation
bar thud
The native visualization in Hatchet is a string that baz can be printed
grault quux fred garply
to the terminal to display the graph. Hatchet can also output the
he graphs
putes graph
the in the DOT format or a folded stack used by �ame graphcorge[8]. plugh Figure 7: Visualization outputs supported in Hatchet in-
clude terminal output (left), Flamegraph
xyzzy
The dot utility in Graphviz produces a hierarchical drawing of DOT (right), and �ame graph
e subtract
directed graphs, particularly useful for showing the direction grault of garply (bottom).
�es onetheofedges. Flame graphs are useful for quickly identifying the per-
bar thud
ssignment
formance bottleneck, that is the box with the largest baz width. grault The
5 PERFORMANCE
baz garply
ame from
y-axis of the �ame graph represents the call stack depth. Figure 7
FlameGraph
shows the same Hatchet graph presented in theAbhinav
three Bhatele,
supported vi- (CMSC416It is vital that performance analysis tools have low overheads
Alan Sussman / CMSC616) 32 and
sualizations: terminal, DOT, and �ame graph.FlameGraph
For particularly large that they enable quick analysis of performance datasets without the
Starter code for reading data
import hatchet as ht
import sys Replace this with another reader
if __name__ == ‘__main__': depending on data source
file_name = sys.argv[1]
gf = ht.GraphFrame.from_caliper(file_name)
print(gf.tree())
print(gf.dataframe)
Figure 9: Demonstration of load imbalance analysis and the results of the case study. The most imbalance is caused by
MacroscopicCrossSection:22. Chopper’s load_imbalance function provides detailed statistics about the imbalance (a) that can
be easily plotted by using Python libraries (b). We use Quicksilver execution on 128 processes.
case study demonstrates that users can easily examine the correla- the communication libraries (such as libpsm2.s and libmpi.so),
tion between di�erent performance metrics and investigate outliers which is expected due to network congestion mentioned in the
or potential issues by performing analyses at CCT node level. Chop- previous paper [citation removed for double-blind review]
per also enables users to easily plot the results via Python libraries. The Chopper API enables the analysis of multiple executions us-
ing a single function call and presents the results in an easy-to-plot
format. This is a tedious and fraught task without programmatic
7.2 Comparing Multiple Executions analysis capabilities as it requires comparing performance nodes
More advanced analysis tasks, such as studying scalability and vari- from multiple runs simultaneously. Furthermore, to the best of our
knowledge,Load
ability, require analyzing multiple executions of the same program (a) Quicksilver thisImbalance
is the �rstDataFrame
study that uses CCT data to identify
Output
with di�erent parameters. In this case, the user needs to analyze performance
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)variability. 37
Example 4: Load imbalance
(a) Quicksilver Load Imbalance DataFrame Output (b) Load Imbalance Histogram of MacroscopicCrossSec-
tion.cc:22
1 Simplifying
graphframe = hatchet . GraphFrame . from_hpctoolkit ( the Analysis of Parallel
qs_profile_128 ) Profiles Using Chopper
2
3
Paper Type: Regular
graphframe_imbalance = graphframe . load_imbalance ( verbose = True )
4 # sort the top 50 nodes that have the highest mean value by imbalance
5 df_imb = graphframe_imbalance . dataframe . head (50) . sort_values ( time . imbalance , ascending = False )
6 print ( df_imb . head (4) ) # Dataframe Output ( a )
Figure 9: Demonstration of load imbalance analysis and the results of the case study. The most imbalance is caused by
MacroscopicCrossSection:22. Chopper’s load_imbalance function provides detailed statistics about the imbalance (a) that can
be easily plotted by using Python libraries (b). We use Quicksilver execution on 128 processes.
case study demonstrates that users can easily examine the correla- the communication libraries (such as libpsm2.s and libmpi.so),
tion between di�erent performance metrics and investigate outliers which is expected due to network congestion mentioned in the
or potential issues by performing analyses at CCT node level. Chop- previous paper [citation removed for double-blind review]
per also enables users to easily plot the results via Python libraries. The Chopper API enables the analysis of multiple executions us-
ing a single function call and presents the results in an easy-to-plot
format. This is a tedious and fraught task without programmatic
7.2 Comparing Multiple Executions analysis capabilities as it requires comparing performance nodes
More advanced analysis tasks, such as studying scalability and vari- from multiple runs simultaneously. Furthermore, to the best of our
knowledge,Load
ability, require analyzing multiple executions of the same program (a) Quicksilver thisImbalance
is the �rstDataFrame
study that uses CCT data to identify
Output
with di�erent parameters. In this case, the user needs to analyze performance
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)variability. 37