24.profiling II

The document discusses profiling tools gprof and mpiP. It provides examples of mpiP reports showing aggregate time and message size for different calls and comparisons of scaling for different applications and problem sizes.

Uploaded by

sayo3712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views19 pages

24.profiling II

Uploaded by

sayo3712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Profiling - II

Lecture 24
April 15, 2024
Gprof
"gprof: A Call Graph Execution Profiler", by S. Graham, P. Kessler, M.
McKusick; Proceedings of the SIGPLAN '82 Symposium on Compiler
Construction, SIGPLAN Notices, Vol. 17, No 6, pp. 120-126, June 1982.

• Compile with –g –pg flags

• gprof ./exe gmon.out > gprof.out
Sections in the mpiP report

• 31:@--- MPI Time (seconds) ---------------------------------------------------

• 52:@--- Callsites: 43 --------------------------------------------------------
• 99:@--- Aggregate Time (top twenty, descending, milliseconds) --------
• 123:@--- Aggregate Sent Message Size (top twenty, descending, bytes) ----------
• 146:@--- Callsite Time statistics (all, milliseconds): 688 --------------------
• 923:@--- Callsite Message Sent statistics (all, sent bytes) -------------------
• 1268:@--- End of Report --------------------------------------------------------
Aggregate Time – Strong Scaling Comparison (mg)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Bcast 12 22.5 0.49 19.71 0.66
Send 9 21.1 0.46 18.47 0.10
Processes = 4
Send 20 14.1 0.31 12.39 0.02
Send 1 13.4 0.29 11.74 0.32
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Barrier 7 149 2.24 21.84 0.01
Send 9 140 2.10 20.48 0.81
Processes = 16
Send 21 123 1.84 17.94 0.87
Wait 26 58.8 0.88 8.60 0.09
Aggregate Time – Strong Scaling Comparison (ft)

@--- Aggregate Time (top twenty, descending, milliseconds)

Call Site Time App% MPI% COV
Alltoall 9 443 5.69 84.89 0.03
Bcast 7 43.3 0.56 8.29 0.00 Processes = 4
Reduce 4 32.4 0.42 6.21 0.02
Barrier 5 1.57 0.02 0.30 1.16

@--- Aggregate Time (top twenty, descending, milliseconds)

Call Site Time App% MPI% COV
Alltoall 9 1.73e+03 16.22 91.43 0.05 Processes = 16
Reduce 4 76.4 0.72 4.03 0.84
Comm_split 10 44.3 0.41 2.34 0.92
Bcast 7 24.8 0.23 1.31 0.48
Aggregate Time – Data Scaling (cg on 16 processes)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Bcast 3 477 16.58 39.24 0.01
Wait 21 176 6.11 14.46 0.74 Class = A (small problem)
Send 10 162 5.64 13.34 0.85
Wait 6 89.6 3.11 7.37 0.13
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV
Wait 21 1.03e+04 3.07 31.48 0.79
Send 15 8.84e+03 2.65 27.13 0.19 Class = C (large problem)
Send 12 8.44e+03 2.53 25.91 0.79
Wait 11 728 0.22 2.23 1.49
MPI Time vs. App Time (Class = A NPROCS=16)
@--- Aggregate Time (top twenty, descending, milliseconds)
Call Site Time App% MPI% COV

Bcast 3 477 16.58 39.24 0.01 cg

Allreduce 5 246 1.31 77.45 0.60 ep
Alltoall 9 1.73e+03 16.22 91.43 0.05 ft
Recv 3 3.78e+03 5.30 27.16 0.76 lu
Barrier 7 149 2.24 21.84 0.01 mg
Waitall 41 1.89e+03 2.13 20.52 0.17 sp
IPM Profiles

Configuration difference?
IMB Reduce (NPROCS = 4)
IMB Reduce (NPROCS = 4)
Communication Matrix (IMB Reduce, NPROCS=4)
IMB Reduce (NPROCS = 8, 1 host)
IMB Reduce (NPROCS = 16, 2 hosts)
IMB Gather (NPROCS=32, 4 hosts)
Darshan Internals
• Intercepts MPI-IO routines using PMPI • Dynamic linking at runtime
interface • LD_PRELOAD – enables overriding
• Data recorded on each process at run time • Static linking at compile-time
and then merged and stored during • Inserting wrapper functions
MPI_Finalize
• --wrap option
• MPI_Wtime() collects timing information
• Time Overhead
• In-memory file record • MPI_Wtime() call - 165 ns
• Array of counters for I/O calls • Function wrapping - 14 ns *
• Frequency count of common access sizes
• Memory overhead
• File record 2 MB limit per process
• Aggregate statistics beyond limit

* “24/7 Characterization of Petascale I/O Workloads”

Darshan I/O Profiler
• cd io
• export DARSHAN_LOGPATH=darshan-logs
• mpiicc –o indepIO indepIO.c
• export LD_PRELOAD=../lib (path to libdarshan.so)
• qsub subindepIO.c
• mkdir $DARSHAN_LOGPATH/2024/4/15
• ls –t $DARSHAN_LOGPATH/2024/4/15 [Look for .darshan]
• ./darshan-parser <logfile> > parsed
• grep POSIX_F_FASTEST_RANK_TIME parsed
• grep MPIIO_F_FASTEST_RANK_TIME parsed
• grep MPIIO_F_SLOWEST_RANK_TIME parsed
Revision Q1

MPI_Bcast of 10 KB data (root=0) on the 2D mesh.

There are 8 processes placed on the 4 nodes.
Ranks 0 and 1 are placed on node 1, ranks 2 and 3 are placed
on node 2 and so on.
Bandwidth of every link is 1 Gbps. Assume hop=0 between
processes in a node. Assume XY routing policy (i.e. messages
first traverse in x-dimension, followed by y-dimension).
Total time = 4 ms
Analyze and discuss the effective bandwidth, maximum #hops,
and link contention with Bcast.
Revision Q2
Compare and contrast recursive doubling algorithm for MPI_Reduce on
8 processes for the following node allocations:
(a) Ranks 0 – 3 are on csews1, ranks 4 – 7 are on csews2
(b) Even ranks are on csews1, odd ranks are on csews2
Revision Q3: 3D domain decomposition
17 //initialize
18 for (int i=0; i<N; i++)
19 for (int j=0; j<N; j++)
20 for (int k=0; k<N; k++)
21 data[i][j][k] = (rank+1) * (i+j+k);

22 int xStart=_____________________________,
yStart=____________________________,
zStart=___________________________;

23 int xEnd=_______________________________,
yEnd=______________________________,
zEnd=_____________________________;

Azure Data Factory
100% (4)
Azure Data Factory
16 pages
Department of Software Engineering: Course Code: CS332 Class: BESE6AB Lab 12: MPI (Part 2)
No ratings yet
Department of Software Engineering: Course Code: CS332 Class: BESE6AB Lab 12: MPI (Part 2)
2 pages
Lisa19 Slides Gregg
No ratings yet
Lisa19 Slides Gregg
64 pages
Performance Measurement Tools and Techniques
No ratings yet
Performance Measurement Tools and Techniques
50 pages
MPI Application Tune Up r5
No ratings yet
MPI Application Tune Up r5
23 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
23.profiling I
No ratings yet
23.profiling I
29 pages
Percona2016linuxsystemsperf 160421182216
No ratings yet
Percona2016linuxsystemsperf 160421182216
72 pages
Mpi
No ratings yet
Mpi
46 pages
ACA UNit 1
No ratings yet
ACA UNit 1
29 pages
Pellegrini Mpi Cache
No ratings yet
Pellegrini Mpi Cache
9 pages
Node-Static-0 3 0
No ratings yet
Node-Static-0 3 0
1 page
Performance Tweaks and Tools For Linux
100% (2)
Performance Tweaks and Tools For Linux
113 pages
CS621 Final Term Current Papers
100% (1)
CS621 Final Term Current Papers
9 pages
CS5204/EE5364 - Advanced Computer Architecture - Performance
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Performance
56 pages
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
No ratings yet
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
69 pages
Unit 5 - Linux System Performance
No ratings yet
Unit 5 - Linux System Performance
27 pages
Broken Linux Performance Tools: Brendan Gregg
No ratings yet
Broken Linux Performance Tools: Brendan Gregg
95 pages
Mega Pipe
No ratings yet
Mega Pipe
14 pages
DCOM - OPC and Performance Issues
No ratings yet
DCOM - OPC and Performance Issues
3 pages
SP 07272030
No ratings yet
SP 07272030
127 pages
Scale2017perfanalysisbpf169 170304230834
No ratings yet
Scale2017perfanalysisbpf169 170304230834
70 pages
Linuxperftools 140820091946 Phpapp01
No ratings yet
Linuxperftools 140820091946 Phpapp01
85 pages
LM32 Ait L19
No ratings yet
LM32 Ait L19
19 pages
SP 07271130
No ratings yet
SP 07271130
166 pages
IMS End To End Network Dimensioning - Module 1 - BASICS
100% (1)
IMS End To End Network Dimensioning - Module 1 - BASICS
33 pages
HPC Clusters Best Practices Performance Study
No ratings yet
HPC Clusters Best Practices Performance Study
38 pages
Awrrpt 1 475 482
No ratings yet
Awrrpt 1 475 482
90 pages
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
No ratings yet
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
128 pages
Awrrpt 07 To 11 04 2016
No ratings yet
Awrrpt 07 To 11 04 2016
41 pages
Using MPI Portable Programming With The Message Pa PDF
No ratings yet
Using MPI Portable Programming With The Message Pa PDF
8 pages
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
No ratings yet
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
81 pages
Bpftools2017 190108214100
No ratings yet
Bpftools2017 190108214100
44 pages
Fluent Analysis Intel
No ratings yet
Fluent Analysis Intel
17 pages
Awrrpt 1 19237 19243
No ratings yet
Awrrpt 1 19237 19243
69 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
Informatica Realtime Processing H2L Performance Complete
No ratings yet
Informatica Realtime Processing H2L Performance Complete
5 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
SP 07271930
No ratings yet
SP 07271930
134 pages
Google: Designs, Lessons and Advice From Building Large Distributed Systems
100% (3)
Google: Designs, Lessons and Advice From Building Large Distributed Systems
73 pages
SP 07271730
No ratings yet
SP 07271730
175 pages
This Unit: - Metrics
No ratings yet
This Unit: - Metrics
7 pages
Statpack 10 08
No ratings yet
Statpack 10 08
75 pages
Assignment 3
No ratings yet
Assignment 3
17 pages
Velocity2017bpfsuperpowers 170622233822
No ratings yet
Velocity2017bpfsuperpowers 170622233822
54 pages
P51a 03 Part2
No ratings yet
P51a 03 Part2
38 pages
Thenewsystemsperformance 131014005720 Phpapp01
No ratings yet
Thenewsystemsperformance 131014005720 Phpapp01
17 pages
Assignment 1, CS633: Code Explanation
No ratings yet
Assignment 1, CS633: Code Explanation
4 pages
Thakur05-Optimization of Collective Communication Operations in MPICH
No ratings yet
Thakur05-Optimization of Collective Communication Operations in MPICH
18 pages
Ch7slides Partb
No ratings yet
Ch7slides Partb
31 pages
Debugging Ruby
88% (8)
Debugging Ruby
86 pages
Thakur03-Improving The Performance of Collective Operations in MPICH
No ratings yet
Thakur03-Improving The Performance of Collective Operations in MPICH
11 pages
RC10PRD 10 11 Awr
No ratings yet
RC10PRD 10 11 Awr
52 pages
Libfabric Old Mpi Presentations 2014-01-14
No ratings yet
Libfabric Old Mpi Presentations 2014-01-14
20 pages
High Performance Kernel Mode Web Server For Windows
No ratings yet
High Performance Kernel Mode Web Server For Windows
57 pages
A High Performance MPI For Parallel and Distributed Computing
No ratings yet
A High Performance MPI For Parallel and Distributed Computing
4 pages
Computer Networks Record
No ratings yet
Computer Networks Record
93 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
From Everand
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
Naim Dahnoun
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Introduction To Computer Structure
No ratings yet
Introduction To Computer Structure
9 pages
The Difference Between GIT and SVN
No ratings yet
The Difference Between GIT and SVN
6 pages
Instructions SC AC19 1a
No ratings yet
Instructions SC AC19 1a
3 pages
LVDS Source Synchronous 7:1 Serialization and Deserialization Using Clock Multiplication
No ratings yet
LVDS Source Synchronous 7:1 Serialization and Deserialization Using Clock Multiplication
18 pages
CS554 - Advanced Database Systems Homework 8: Undo-Log Records
No ratings yet
CS554 - Advanced Database Systems Homework 8: Undo-Log Records
15 pages
JavaScript Practical
No ratings yet
JavaScript Practical
53 pages
Shared Variables NI PSP
No ratings yet
Shared Variables NI PSP
22 pages
Instructional Module and Its Components (Guide) : Course Ccs 3 Developer and Their Background
No ratings yet
Instructional Module and Its Components (Guide) : Course Ccs 3 Developer and Their Background
7 pages
74 Series ICs
No ratings yet
74 Series ICs
13 pages
ComLynx User Guide 16 20120817 A7
No ratings yet
ComLynx User Guide 16 20120817 A7
48 pages
Pointers in C
No ratings yet
Pointers in C
25 pages
Virtual Private Network
No ratings yet
Virtual Private Network
9 pages
Fntdb07 Architecture of A Database System
100% (4)
Fntdb07 Architecture of A Database System
119 pages
PAM For Informatica 9.1.0 HF6
No ratings yet
PAM For Informatica 9.1.0 HF6
93 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
Unit - V PHP (DS & Stat)
No ratings yet
Unit - V PHP (DS & Stat)
6 pages
myCANAL FR NEW 2022 (YASHVIR GAMING) .SVB 2
No ratings yet
myCANAL FR NEW 2022 (YASHVIR GAMING) .SVB 2
8 pages
Pointer Assignment
No ratings yet
Pointer Assignment
2 pages
Structured Query Language
No ratings yet
Structured Query Language
16 pages
IDirect Spec Sheet Evolution X5 0817
No ratings yet
IDirect Spec Sheet Evolution X5 0817
2 pages
Types of Memory Their Uses
No ratings yet
Types of Memory Their Uses
12 pages
Test 200 530
No ratings yet
Test 200 530
7 pages
Configuring An LDAP AAA Server: Selected ASDM VPN Procedures, Version 5.2 (1) OL-10670-01
No ratings yet
Configuring An LDAP AAA Server: Selected ASDM VPN Procedures, Version 5.2 (1) OL-10670-01
14 pages
Netconf: in This Chapter
No ratings yet
Netconf: in This Chapter
54 pages
Java Lab Manual
No ratings yet
Java Lab Manual
42 pages
SINUMERIK 840D/810D: Description of Functions 12.2001 Edition
No ratings yet
SINUMERIK 840D/810D: Description of Functions 12.2001 Edition
206 pages
SAP Code Inspector
No ratings yet
SAP Code Inspector
6 pages
Wifi Event - Log30
No ratings yet
Wifi Event - Log30
11 pages
Programming With C and C++
No ratings yet
Programming With C and C++
363 pages

24.profiling II

Uploaded by

24.profiling II

Uploaded by

Profiling - II

• Compile with –g –pg flags

• 31:@--- MPI Time (seconds) ---------------------------------------------------

@--- Aggregate Time (top twenty, descending, milliseconds)

@--- Aggregate Time (top twenty, descending, milliseconds)

Bcast 3 477 16.58 39.24 0.01 cg

* “24/7 Characterization of Petascale I/O Workloads”

MPI_Bcast of 10 KB data (root=0) on the 2D mesh.

You might also like