Parallel Performance Wizard A Performance Analysis Tool For Partitioned Global Address Space Programming
Parallel Performance Wizard A Performance Analysis Tool For Partitioned Global Address Space Programming
Global-Address-Space Programming
4 PPW Framework
5.3 Statistical Data Tables PPW supports the loading of multiple performance data
sets at once, a feature that proves useful when working with
Like other successful tools, PPW displays high-level sta- multiple related experimental runs. Included also in PPW
tistical performance information in the form of two tables is the Experiment Set Comparison chart (Figure 4) that the
that show statistical data for all regions of a program. The user can use to quickly compare the performance of data
Profile Table reports flat profile information while the Tree sets loaded. For example, when the user loads the data sets
Table (Figure 7) separately reports the performance data for for the same program running on different system sizes, he
the same code region executed in individual callpaths. Se- or she can quickly determine the scalability of the program
lecting an entry in these tables highlights the corresponding and identify sequential and parallel portions of the program.
line in the source code viewer below the tables, allowing Another use is to load data sets for different revisions of the
Figure 6. PPW Array Distribution visualiza-
tion showing the physical layout of a 7x8 ar-
ray with blocking factor of 3 on a system with
Figure 5. PPW Data Transfers visualization 8 nodes
showing the communication volume between
processing nodes for put operations
have distributed the data as desired, a particularly important
consideration when using the PGAS abstraction. We are
program running on the same system size and compare the also currently investigating the possibility of incorporating
effect of the revisions on performance. communication pattern information with this visualization
to give more insight into how specific shared objects are be-
5.5 Communication Pattern Visualization ing accessed. This extension would allow PPW to provide
more specific details as to how data transfers behaved dur-
In many situations, especially in cluster computing, sig- ing program execution.
nificant performance degradation is associated with poorly
arranged inter-node communications. The Data Transfer vi-
sualization (Figure 5) provided by PPW is helpful in iden- 6 Case Study
tifying communication related bottlenecks such as commu-
nication hot spots. While this visualization is not unique In this section, we present a small case study conducted
to PPW, ours is able to show communications from both using PPW version 1.0 [17]. We ran George Washington
implicit and explicit one-sided data transfer operations in University’s UPC implementation [18] of the FT bench-
the original program. By default, this visualization depicts mark (which implements an FFT algorithm) from the NAS
the inter-node communication volume for all data transfer benchmark suite 2.4 using Berkeley UPC 2.4, which in-
operations in the program, but users are able to view the cludes a fully functional GASP implementation. Tracing
information for a subset of data transfers (e.g. puts only, performance data were collected for the class B setting exe-
gets with payload size of 8-16kB, etc.). Threads that initi- cuted on a 32-node Opteron 2.0 GHz cluster with a Quadrics
ate an inordinate amount of communication will have their QsNetII interconnect. Initially no change was made to the
corresponding blocks in the grid stand out in red. Similarly, FT source code.
threads that have affinity to data for which many transfers From the Tree Table for the FT benchmark (Figure 7),
occur will have their column in the grid stand out. it was immediately obvious that the fft function call (3rd
row) constitutes the bulk of the execution time (9s out of
5.6 Array Distribution Visualization 10s of total execution time). Further examination of per-
formance data for events within the fft function revealed
A novel visualization provided by PPW is the Array Dis- that upc_barriers (represented as upc_notify and
tribution, which graphically depicts the physical layout of upc_wait) in transpose2_global (6th row) ap-
shared objects in the application on the target system (Fig- peared to be a potential bottleneck. We came to this con-
ure 6). Such a visualization helps users verify that they clusion from the fact that the actual average execution time
Figure 8. Annotated Jumpshot view of the
original version of FT showing the serialized
Figure 7. PPW Tree Table visualization for the
nature of upc memget
original FT shown with code yielding part of
the performance degradation
[4] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. [15] Berkeley UPC project website. http://upc.lbl.gov/
K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K.
[16] Vampir tool website. http://www.vampir-ng.de/
Kunchithapadam, and T. Newhall. “The paradyn par-
allel performance measurement tool”. IEEE Computer, [17] PPW Project website. http://ppw.hcs.ufl.edu/
28(11):37–46, November 1995.
[18] GWU UPC NAS 2.4 benchmarks.
[5] S. S. Shende and A. D. Malony. “The Tau Par- http://www.gwu.edu/ upc/download.html
allel Performance System”. International Journal of
High-Performance Computing Applications (HPCA),
20(2):297–311, May 2006.
[6] The UPC Consortium, “UPC Language Specifica-
tions”, May 2005.
http://www.gwu.edu/˜upc/docs/upc specs 1.2.pdf