0% found this document useful (0 votes)
20 views19 pages

24.profiling II

The document discusses profiling tools gprof and mpiP. It provides examples of mpiP reports showing aggregate time and message size for different calls and comparisons of scaling for different applications and problem sizes.

Uploaded by

sayo3712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

24.profiling II

The document discusses profiling tools gprof and mpiP. It provides examples of mpiP reports showing aggregate time and message size for different calls and comparisons of scaling for different applications and problem sizes.

Uploaded by

sayo3712
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Profiling - II

Lecture 24
April 15, 2024
Gprof
"gprof: A Call Graph Execution Profiler", by S. Graham, P. Kessler, M.
McKusick; Proceedings of the SIGPLAN '82 Symposium on Compiler
Construction, SIGPLAN Notices, Vol. 17, No 6, pp. 120-126, June 1982.

• Compile with –g –pg flags


• gprof ./exe gmon.out > gprof.out
Sections in the mpiP report

• 31:@--- MPI Time (seconds) ---------------------------------------------------


• 52:@--- Callsites: 43 --------------------------------------------------------
• 99:@--- Aggregate Time (top twenty, descending, milliseconds) --------
• 123:@--- Aggregate Sent Message Size (top twenty, descending, bytes) ----------
• 146:@--- Callsite Time statistics (all, milliseconds): 688 --------------------
• 923:@--- Callsite Message Sent statistics (all, sent bytes) -------------------
• 1268:@--- End of Report --------------------------------------------------------
Aggregate Time – Strong Scaling Comparison (mg)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Bcast 12 22.5 0.49 19.71 0.66
Send 9 21.1 0.46 18.47 0.10
Processes = 4
Send 20 14.1 0.31 12.39 0.02
Send 1 13.4 0.29 11.74 0.32
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Barrier 7 149 2.24 21.84 0.01
Send 9 140 2.10 20.48 0.81
Processes = 16
Send 21 123 1.84 17.94 0.87
Wait 26 58.8 0.88 8.60 0.09
Aggregate Time – Strong Scaling Comparison (ft)

@--- Aggregate Time (top twenty, descending, milliseconds)


Call Site Time App% MPI% COV
Alltoall 9 443 5.69 84.89 0.03
Bcast 7 43.3 0.56 8.29 0.00 Processes = 4
Reduce 4 32.4 0.42 6.21 0.02
Barrier 5 1.57 0.02 0.30 1.16

@--- Aggregate Time (top twenty, descending, milliseconds)


Call Site Time App% MPI% COV
Alltoall 9 1.73e+03 16.22 91.43 0.05 Processes = 16
Reduce 4 76.4 0.72 4.03 0.84
Comm_split 10 44.3 0.41 2.34 0.92
Bcast 7 24.8 0.23 1.31 0.48
Aggregate Time – Data Scaling (cg on 16 processes)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Bcast 3 477 16.58 39.24 0.01
Wait 21 176 6.11 14.46 0.74 Class = A (small problem)
Send 10 162 5.64 13.34 0.85
Wait 6 89.6 3.11 7.37 0.13
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Wait 21 1.03e+04 3.07 31.48 0.79
Send 15 8.84e+03 2.65 27.13 0.19 Class = C (large problem)
Send 12 8.44e+03 2.53 25.91 0.79
Wait 11 728 0.22 2.23 1.49
MPI Time vs. App Time (Class = A NPROCS=16)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV

Bcast 3 477 16.58 39.24 0.01 cg


Allreduce 5 246 1.31 77.45 0.60 ep
Alltoall 9 1.73e+03 16.22 91.43 0.05 ft
Recv 3 3.78e+03 5.30 27.16 0.76 lu
Barrier 7 149 2.24 21.84 0.01 mg
Waitall 41 1.89e+03 2.13 20.52 0.17 sp
IPM Profiles

Configuration difference?
IMB Reduce (NPROCS = 4)
IMB Reduce (NPROCS = 4)
Communication Matrix (IMB Reduce, NPROCS=4)
IMB Reduce (NPROCS = 8, 1 host)
IMB Reduce (NPROCS = 16, 2 hosts)
IMB Gather (NPROCS=32, 4 hosts)
Darshan Internals
• Intercepts MPI-IO routines using PMPI • Dynamic linking at runtime
interface • LD_PRELOAD – enables overriding
• Data recorded on each process at run time • Static linking at compile-time
and then merged and stored during • Inserting wrapper functions
MPI_Finalize
• --wrap option
• MPI_Wtime() collects timing information
• Time Overhead
• In-memory file record • MPI_Wtime() call - 165 ns
• Array of counters for I/O calls • Function wrapping - 14 ns *
• Frequency count of common access sizes
• Memory overhead
• File record 2 MB limit per process
• Aggregate statistics beyond limit

* “24/7 Characterization of Petascale I/O Workloads”


Darshan I/O Profiler
• cd io
• export DARSHAN_LOGPATH=darshan-logs
• mpiicc –o indepIO indepIO.c
• export LD_PRELOAD=../lib (path to libdarshan.so)
• qsub subindepIO.c
• mkdir $DARSHAN_LOGPATH/2024/4/15
• ls –t $DARSHAN_LOGPATH/2024/4/15 [Look for .darshan]
• ./darshan-parser <logfile> > parsed
• grep POSIX_F_FASTEST_RANK_TIME parsed
• grep MPIIO_F_FASTEST_RANK_TIME parsed
• grep MPIIO_F_SLOWEST_RANK_TIME parsed
Revision Q1

MPI_Bcast of 10 KB data (root=0) on the 2D mesh.


There are 8 processes placed on the 4 nodes.
Ranks 0 and 1 are placed on node 1, ranks 2 and 3 are placed
on node 2 and so on.
Bandwidth of every link is 1 Gbps. Assume hop=0 between
processes in a node. Assume XY routing policy (i.e. messages
first traverse in x-dimension, followed by y-dimension).
Total time = 4 ms
Analyze and discuss the effective bandwidth, maximum #hops,
and link contention with Bcast.
Revision Q2
Compare and contrast recursive doubling algorithm for MPI_Reduce on
8 processes for the following node allocations:
(a) Ranks 0 – 3 are on csews1, ranks 4 – 7 are on csews2
(b) Even ranks are on csews1, odd ranks are on csews2
Revision Q3: 3D domain decomposition
17 //initialize
18 for (int i=0; i<N; i++)
19 for (int j=0; j<N; j++)
20 for (int k=0; k<N; k++)
21 data[i][j][k] = (rank+1) * (i+j+k);

22 int xStart=_____________________________,
yStart=____________________________,
zStart=___________________________;

23 int xEnd=_______________________________,
yEnd=______________________________,
zEnd=_____________________________;

You might also like