Unveiling The Interplay Between Global Link Arrangements and Network Management Algorithms On Dragon y Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Unveiling the Interplay Between


Global Link Arrangements and Network
Management Algorithms on Dragonfly Networks
Fulya Kaplan∗ , Ozan Tuncer∗ , Vitus J. Leung† , Scott K. Hemmert† , and Ayse K. Coskun∗
∗ Boston University, Boston, MA – {fkaplan3, otuncer, acoskun}@bu.edu
† Sandia National Laboratories, Albuquerque, NM – {vjleung, kshemme}@sandia.gov

Abstract—Network messaging delay historically constitutes a


large portion of the wall-clock time for High Performance
Computing (HPC) applications, as these applications run on
many nodes and involve intensive communication among their
tasks. Dragonfly network topology has emerged as a promising
solution for building exascale HPC systems owing to its low
network diameter and large bisection bandwidth. Dragonfly
includes local links that form groups and global links that connect
Fig. 1: A dragonfly group with all-to-all local connections.
these groups via high bandwidth optical links. Many aspects of Boxes are routers, circles are nodes, solid lines are electrical
the dragonfly network design are yet to be explored, such as the local links, and dashed lines are optical global links.
performance impact of the connectivity of the global links, i.e.,
global link arrangements, the bandwidth of the local and global
links, or the job allocation algorithm. trical links. Optical links also provide longer physical distance
This paper first introduces a packet-level simulation frame- traveled per network hop.
work to model the performance of HPC applications in detail. Dragonfly network topology [6] exploits these technology
The proposed framework is able to simulate known MPI (message advances mentioned above to achieve high bisection band-
passing interface) routines as well as applications with custom- width1 and high scalability. A dragonfly network has a two-
defined communication patterns for a given job placement algo-
rithm and network topology. Using this simulation framework, level hierarchy, where the elements in each level are closely
we investigate the coupling between global link bandwidth connected, resulting in a low network diameter. Variations of
and arrangements, communication pattern and intensity, job dragonfly topology are currently used in Cray XC [7], Cray
allocation and task mapping algorithms, and routing mechanisms Cascade [8] and IBM PERCS [9].
in dragonfly topologies. We demonstrate that by choosing the The dragonfly topology’s two-level hierarchy is composed
right combination of system settings and workload allocation
algorithms, communication overhead can be decreased by up of local links forming groups and global links connecting
to 44%. We also show that circulant arrangement provides these groups via optical links. Within a group, the routers are
up to 15% higher bisection bandwidth compared to the other connected in an all-to-all or a flattened-butterfly fashion using
arrangements; but for realistic workloads, the performance electrical links. Overall, each router has ports connecting them
impact of link arrangements is less than 3%. to (i) the compute nodes, (ii) the other routers in the group,
and (iii) the other groups in the network. Figure 1 illustrates
I. I NTRODUCTION a single group consisting of 4 routers, where each router is
In HPC systems, network communication efficiency and connected to 2 nodes, 3 other routers within the group, and 2
delay play important roles in determining performance and routers in other groups [6].
scalability [1]. When scaling up to tens of thousands of nodes, The existing literature on dragonflies focuses on analyzing
traditional topologies such as toroidal meshes increase the the impact of routing [10], [11], [12] and job allocation
network energy consumption up to 50% of the overall system algorithms [13], [14], [15] on performance. One aspect of
energy [2] and introduce large communication overhead as dragonfly network topologies that has not yet been extensively
messages need to travel tens of network hops on average [3]. studied is the global link arrangement, which defines the
Recent technological advances have enabled new topology connectivity of each router in a group to the other groups.
designs to overcome these energy and performance limitations. Recent work proposes three specific link arrangements to
First, owing to increased-radix routers [4], a larger number of connect groups to each other: absolute, relative, and circulant-
ports can be connected to a router, allowing lower-diameter based [16]. Hastings et al. provide a theoretical study on these
networks that reduce the number of hops a packet needs arrangements and analyze how the system bisection bandwidth
to travel. A reduced number of network hops translates to changes with respect to the global and local link bandwidths
shorter messaging delays as well as lower energy consumption. [17]. Existing work, however, does not investigate the coupling
Second, the availability of cost- and energy-efficient optical 1 Bisection bandwidth is the minimum bandwidth between two equally-sized
switches [5] enables higher link bandwidth compared to elec- parts of the system.

978-1-5090-6611-7/17 $31.00 © 2017 IEEE 325


DOI 10.1109/CCGRID.2017.93
G4 G5 G4 G5 G4 G5

G3 G6 G3 G6 G3 G6

G2 G7 G2 G7 G2 G7

G1 G8 G1 G8 G1 G8

G0 G0 G0
(a) Absolute (b) Relative (c) Circulant-based
Fig. 2: Dragonfly networks adopting different global link arrangements: a) absolute, b) relative, c) circulant-based arrangement.
Each configuration uses g = 9, a = 4, h = 2, and all-to-all local connections. The boxes represent the routers.

of these arrangements with routing mechanisms, job allocation bandwidth ratio of the global links over the local links.
algorithms, or communication patterns. For high bandwidth ratios, our analysis shows up to 44%
This paper studies the impact of global link arrangements difference in communication overhead between the best
on network performance in tandem with routing mechanisms, and worst performing {routing, allocation} pairs.
job placement algorithms, and application communication The rest of the paper starts by providing background on
patterns. To enable such analysis, we design a packet-level dragonfly networks. Section III presents the details of our
simulation framework that unifies network design parameters proposed simulation framework together with our target HPC
with communication patterns and accurately models wall-clock machines and workload assumptions. In Section IV, we pro-
time of HPC applications. Our specific contributions are as vide the results of our performance analysis regarding different
follows: aspects of dragonfly networks. Section V describes the prior
work on dragonfly and Section VI concludes the paper.
• We introduce a packet-level simulation framework based
II. BACKGROUND ON D RAGONFLY N ETWORKS
on the Structural Simulation Toolkit (SST) [18]. Our
framework estimates the performance of HPC appli- Dragonfly topology [6] is a two-level hierarchical direct
cations with consideration of the underlying network network based on high-radix routers. We use the parameters
topology by closing the gap between job allocation/task presented in Table I to describe a dragonfly topology.
mapping and detailed network simulation. This frame- In the first hierarchical level, a routers constitute a group
work allows simulating the combined impact of alloca- and are connected by local electrical links, typically with
tion/mapping decisions as well as network and applica- an all-to-all or a flattened-butterfly network topology. Each
tion properties on the performance of HPC applications. router is connected to p compute nodes and has h optical
• Using our framework, we evaluate the performance of links that form an inter-group network. The routers in a group
global link arrangements considering a set of job allo- collectively act as a virtual router with a · p connections to
cation and routing mechanisms for various communica- compute nodes and a · h connections to other groups. The
tion patterns, and demonstrate that circulant global link second hierarchical level consists of g of these virtual routers,
arrangement provides up to 15% lower communication typically connected with an all-to-all topology.
overhead compared to the other arrangements when the The rest of this section describes the link arrangements,
network is highly loaded, owing to its higher bisection routing mechanisms, and job placement strategies we use in
bandwidth. We also show that the impact of global our study of global link arrangements.
arrangements on application performance can be limited A. Link Arrangements
for common MPI patterns.
There are two link groups in a dragonfly network: global
• We show that task mapping can substantially impact
and local. Global links refer to the optical inter-group cables,
application running time, and the level of this impact is
also affected by the routing mechanism and the network
load. In our experiments with selected applications, we c Number of cores per node
p Number of nodes connected to a router
show that task mapping affects application running time a Number of routers in a group
by up to 11%. g Number of groups
• We demonstrate that the combined impact of the job h Number of optical links on a router
allocation and routing mechanism highly depends on the TABLE I: Notation for the dragonfly parameters.

326
whereas local links connect routers within a single group. 2) Valiant [19]: Valiant routing aims to spread the traffic
among the network to avoid hot-spots. If GS = GD , this
1) Global: We use three different global link arrangements and mechanism sends the message first to a randomly-selected
comply with the terminology used by Hastings et al. [17]. intermediate group, then to RD , using minimal routing. If
Absolute arrangement: In the absolute arrangement, the GS = GD , a random intermediate router is selected within the
first available port in group 0 is connected to the first available group. All messages travel at most 5 hops in Valiant routing
port in the group 1. Then, the next available port in group 0 is as we use all-to-all local connections.
connected to the first available port in group 2. This continues
until group 0 is linked to all other groups. We apply the same C. Allocation
procedure to the remaining groups in order. As a result, port Allocation refers to the placement of incoming jobs to
i of group j is connected to group i if i < j, and to group the available machine nodes. Studies have shown that allo-
i + 1 otherwise. Figure 2(a) depicts an absolute arrangement. cation has a significant impact on application performance
Relative arrangement: All groups have identical relative on dragonflies, but there is no consensus in the community
connections in this arrangement. As shown in Figure 2(b), in on which allocation algorithm maximizes the performance of
each group, port 0 is connected to the next group, port 1 is dragonflies [13], [14], [15], [20]. In order to study the impact
connected to the second next group, and so on. In other words, of global link arrangements in presence of a comprehensive
port i of group j is connected to group (i + j + 1) mod g, set of workload management algorithms, we consider three
where g is the total number of groups. allocation techniques that are shown to have fundamentally
Circulant-based arrangement: With the circulant-based different characteristics [13]: cluster, spread, and random.
arrangement, in each group, port 0 is connected to the next
group, and port 1 is connected to the previous group. Port 2 1) Cluster: Cluster first fills the available nodes in a single
is connected to the second next, and port 3 is connected to dragonfly group in order. Once there is no available node left,
the second previous group, and so on. In other words, port i it continues with the next group.
of group j is connected to group (i/2 + j + 1) mod g if i is 2) Spread: This allocation strategy aims to spread a given job
even, and to group (−i/2 + j − 1) mod g if i is odd. This uniformly among groups by filling routers in a round robin
arrangement assumes that each router has an even number of manner. It first uses all nodes in router 0 of group 0, then
optical links, and it is depicted in Figure 2(c). continues with the nodes in router 0 of group 1, and so on.
2) Local: In dragonfly architectures, local links are typically Once all router 0s are occupied in all groups, it continues with
arranged as a flattened-butterfly (e.g., Cray Cascade [8]) or router 1.
as an all-to-all network (e.g., IBM PERCS [9]). In this paper, 3) Random: This strategy simply selects random nodes from
we assume all-to-all local link arrangements with a uniform the set of all available nodes in the system.
local link bandwidth to isolate the impact of global link
arrangements. We define the ratio of global link bandwidth D. Task Mapping
to local link bandwidth as α. Task mapping refers to the mapping of the MPI ranks of
a single job onto the compute cores located in the nodes
B. Routing selected by the allocation algorithm. The aim of task mapping
A routing strategy defines the paths of message traversal is to place closely-communicating MPI ranks next to each
through the network. Routing has been shown to play an other to reduce the communication overhead and network
important role in the system performance in Dragonflies [10], load. As HPC infrastructures typically do not have access to
[11]. While a shortest-path routing strategy reduces the mes- the communication pattern of incoming jobs, task mapping
sage latency in a dragonfly with low network utilization, it is performed by the HPC application once the job starts
can lead to hot-spots on the global links when two groups executing.
intensively communicate with each other over a few links. Commonly known task mapping strategies include graph-
In this work, we use two static routing strategies to study based approaches and linear mapping, which assigns the MPI
the impact of routing on different global link arrangements: ranks to the cores in a linear order. In this work, we would
minimal and Valiant. like to minimize the performance impact of task mapping so
as to isolate the effect of global link arrangements. Thus, for
1) Minimal: Minimal routing can be described as follows. each application, we select the task mapping algorithm that
Let us define the source and destination groups as GS and suits the communication pattern in the best way. Task mapping
GD , and the source and destination routers as RS and RD , algorithms we use are as follows:
respectively. If GS = GD , then select an intermediate router,
Ra , which is inside GS and has a direct link to GD . Next, 1) Random: This task mapper randomly places application’s
use the direct link from Ra to Rb (which is the router in MPI ranks onto the allocated nodes. We use the random task
GD ). If Rb = RD , use the shortest-path to RD inside GD . mapper to average out the impact of the messaging order in
The longest communication distance with this mechanism is applications with uniform communication (e.g., all-to-all and
3 router-to-router hops as we use all-to-all local connections. bisection pattern; refer to Section III-C for further details).

327
2) Recursive Graph Bisection [21]: This algorithm recursively !!& !$

  /$ /,$"% #83,#*83 
splits the application’s communication graph and the network
!!&  $,$"% #86,#*8322 
topology graph into equal halves using minimum weighted ##* - # $
 )&  "'----
 ")# ---
edge-cuts. At the end of the recursion, the remaining MPI (#

ranks are placed in the remaining compute cores. We use this 2 355 !!3-!#

task mapper for the 3D stencil pattern as it has been shown 32 467 !!4-!#
--- --- ---
to perform better than linear task mapping [21].
--- --- ---

III. E XPERIMENTAL M ETHODOLOGY


   # )#%
Network links in real HPC machines are statically con-  
nected, and thus, it is not easy to experiment with differ-  %  "$ #
'"  $ "
#!! "$ !!"
ent link connections without having an entire HPC system $( "$ !  )
specifically allocated for this purpose. Hence, we use packet- '"  #
---- 
 % . !! 
level simulations (instead of experiments on real machines)
in order to compare network performance for different design    # #,#%$
 ,  " 
assumptions. $( "$ !  ) "  
"$ "0
Our proposed simulation framework enables users to eval-  ""$ 0
 $" 
%#

($#
uate the combined performance impact of various design pa-  '% "$
rameters such as global link arrangements, allocation/mapping $.$#*
---- / 0 $(
0 "$%#%#
algorithms, and application communication patterns. To im-
plement our framework, we extend the Structural Simulation Fig. 3: Proposed simulation framework developed in SST,
Toolkit (SST), which has been developed by Sandia National which integrates scheduling and network elements.
Laboratories to assist in the design, evaluation and optimiza-
tion of HPC architectures and applications [18]. We add to
have a holistic and accurate evaluation of the HPC data center
SST the ability to perform packet-level simulations for custom
performance. Our simulation framework is illustrated in Figure
dragonfly topologies along with dragonfly-specific routing and
3. We assume a job trace where jobs are defined by their arrival
job allocation algorithms. Using those features we add in SST,
time, required number of processors, and the application com-
we simulate workloads with different communication patterns
munication pattern. We define the application communication
to evaluate the impact of global link arrangements under a
patterns using application phase files. A phase file may include
variety of scenarios. The rest of this section explains our
a single communication pattern (e.g., all-to-all), or a successive
simulation framework, the HPC machines we study, and the
set of communication patterns representing the phases of an
workloads we use in detail.
application. The scheduler element schedules the jobs (i.e.,
A. Simulation Framework decides on when to start running the job), allocates nodes for
the jobs, and maps individual tasks of a job onto the allocated
SST simulator has been widely used by researchers in both
nodes, depending on the selected scheduling, allocation, and
academic and industrial institutions. The accuracy of SST has
mapping policies. The scheduler element includes advanced
been validated in publications and by hardware vendors [22],
allocation/mapping algorithms that are applicable to various
[23], [24]. SST incorporates individual elements to model
network topologies such as 2D/3D mesh and torus. In addition,
specific aspects of a modern data center in detail, such as a
we implement the dragonfly topology along with dragonfly-
scheduler and job allocator, a network simulator and a message
specific job allocation algorithms (i.e., cluster, spread, random)
passing simulator.
in the scheduler element.
In the original version of SST, the scheduler element is a
After the jobs are allocated and tasks are mapped onto
standalone module and is not connected to the detailed net-
the nodes, ember and merlin elements simulate the network
work simulator. It estimates the wall-clock time of applications
timing for sending/receiving packets from one end point in
through an average hop distance-based model using empirical
the network to another. Ember models the MPI routines used
data. The standalone scheduler module with hop distance-
in current HPC applications, such as boundary exchange (i.e.,
based performance model is good for estimating the relative
stencil), all-to-all, all-reduce. These MPI routines are named in
performance improvement of new allocation/mapping strate-
SST as ember motifs. Using the motifs, ember implements the
gies, but it is not sufficient in evaluating the more complex
message traffic between the tasks of an application. We also
behavior that depends on network link bandwidths, message
add the functionality to ember that allows the user to simulate
sizes, and routing strategies. In addition, the existing detailed
custom defined communication patterns. The Merlin element
network simulator is unaware of the job allocation/task map-
works in cooperation with ember and models the behavior
ping algorithms and it requires the user to manually define the
of routers, network interface cards (NICs), and the network
nodes allocated for each job.
routing algorithms. Merlin can capture the transient network
We propose a unified simulation framework that closes the
behavior resulting from congestion, stalls, and routing in a
gap between the scheduler and network simulation elements to
cycle accurate manner. We implement the dragonfly network

328
model, which accounts for the global link arrangements, in on the performance even for all-to-all communication pattern
merlin. Apart from the link arrangements described in Section due to the scheduling order of the messages. To average out
II-A, we provide the user the flexibility to define custom link the impact of task mapping, we run all-to-all workloads 15
arrangements. We can also define bandwidths separately for times with random task mapping and select the case with the
(i) the links to the hosts, (ii) the local links within the group, median running time for our analysis.
and (iii) the global links across groups. Bisection: We use this communication pattern to account
Communication between the scheduler and merlin & ember for the bisection bandwidth of the dragonfly network and
elements is provided through a file interface in SST. The to compare our results against the theoretical analysis from
scheduler dumps the job allocation/task mapping information prior work by Hastings et al. [17]. In this pattern, the tasks
into a file. A python script then converts this information into are divided into two equal-sized groups (group #1 and #2),
the input format that ember can use. where a task from group #1 communicates with every task
Scalability of the SST framework to simulate exascale sys- in group #2 and none of the tasks in group #1. In order to
tems has been demonstrated in recent work [25]. In their work, achieve the minimum bisection bandwidth between groups #1
Groves et al. analyze the impact of the number of global links and #2, we use the network cuts provided by Hastings et al.
and link bandwidths on performance and power consumption for the small machine size. These network cuts are also called
of dragonfly networks without considering allocation, task minimum cuts; thus, the resulting application running time is
mapping, or global link arrangements. They model a dragonfly a good representation of the bisection bandwidth. Once we
machine with 96 groups, 48 routers per group and 24 nodes decide on which group of tasks occupy which nodes based on
per router, which corresponds to a total of 100,592 nodes. In the minimum cuts, we run bisection workload 15 times with
this work, we experiment with machines with smaller intra- random task mapping using those nodes. We select the case
group networks in order to focus on the impact of global link with maximum running time for this analysis as the bisection
arrangements. bandwidth represents the worst case.
For other machine sizes for which the minimum cuts are
B. Target HPC Machines not readily available, we use the cut that takes the first half of
We study global link arrangements in 3 different target drag- the nodes in order, starting from node #0. Similarly, we apply
onfly machines. For all of our machines, we use h = 2 optical random task mapping using these nodes and select the case
links per router. We change the parameters listed in Table I with the maximum running time.
(specifically, a, p, c, g) to experiment with different machine 3D Stencil: This messaging pattern is observed in a large
sizes. Our small machine uses the same dragonfly parameters portion of the HPC applications and is explicitly supported
used in the theoretical study of global link arrangements by by MPI [28]. Examples of real-life applications with 3D
Hastings et al. [17]: a = 4, p = 2, c = 2, and g = 9, stencil communication include multi-dimensional shock-wave
corresponding to a total of 72 nodes and 144 cores. For our analysis [29] and molecular dynamics [30]. In this pattern,
medium-size machine, we use a = 8, p = 2, c = 2, g = 17, application tasks exchange messages with their six nearest
corresponding to a total of 272 nodes and 544 cores. As for Cartesian neighbors (i.e., in the x, y, z directions). In SST,
the large machine, we use a = 8, p = 4, c = 4, g = 17, which we use the Halo3D motif in ember, which models the 3D-
adds up to a total of 544 nodes and 2, 176 cores. stencil MPI routine. As the running time of stencil jobs
We set the bandwidth of each link that is connected to the significantly depends on task mapping [21], we use recursive
hosts and other routers within a group (i.e., local links) to graph bisection to efficiently place application tasks.
1GB/s. We sweep α (global to local link bandwidth ratio) For each of these communication patterns, we specify
from 0.5GB/s to 4GB/s in 0.5GB/s steps to analyze the parameters such as the message size, the number of iterations,
impact of global link bandwidths. In order to isolate the and the compute time per iteration. In order to focus on the
effect of crossbar bandwidth on the simulation results, we communication overhead, we set the compute time to zero.
set the crossbar bandwidth to the sum of the maximum link We set the number of iterations for each application such that
bandwidths connected to a router. we allow enough time for the network to reach steady state.
In our experiments, we experiment with 3 different message
C. Workloads sizes: 100KB, 1000KB, and 4000KB.
We focus on three communication patterns: all-to-all, bi- We assume single job and multi-job cases, where all jobs
section, and 3D stencil. We select bisection pattern as it collectively occupy 100% of the dragonfly machine. High
represents the bisection bandwidth of the network. All-to-all system utilization is a common case in HPC and also makes
and 3D stencil patterns are commonly observed in real HPC it easier to observe how different conditions (communication
applications [26]. patterns, message size, etc.) lead to network congestion. In
All-to-all: It is a commonly known pattern where each the multi-job case, we assume two jobs of the same com-
task communicates with every one of the other tasks in an munication pattern are running simultaneously with each job
application. Fast Fourier Transform (FFT) is an example ap- occupying 50% of the machine. We focus on single-run of
plication with uniform all-to-all communication [27]. During the jobs instead of job traces, which means that all jobs arrive
our simulations, we observed that task mapping has an impact and get allocated at the same time.

329
Fig. 4: Running time of the bisection communication pattern Fig. 5: Running time of the bisection communication pattern
using different global link arrangements with minimal routing. using different global link arrangements with Valiant routing.
All results are normalized with respect to absolute arrangement All results are normalized with respect to absolute arrangement
at α = 0.5. at α = 0.5.

IV. A NALYSIS ON G LOBAL L INK A RRANGEMENTS


This section presents our results on the impact of global link
arrangements along with our findings on the coupling between
link arrangements, communication patterns, routing, and job
placement. We start our analysis using a single job that utilizes
the entire machine to avoid the impacts of job allocation and
inter-job interference, and continue with using multiple jobs.
A. Single Job Analysis
Here, we analyze the impact of global link arrangements on
the communication overhead when the entire HPC machine is
allocated to a single job. First, we validate the theoretical study Fig. 6: Running time of the all-to-all communication pattern
of Hastings et al. [17] on how bisection bandwidth changes using different global link arrangements with minimal routing.
with link arrangements, and then focus on communication All results are normalized with respect to absolute arrangement
patterns that are commonly seen in HPC applications. at α = 0.5.
1) Bisection Bandwidth: In the bisection communication pat-
tern, the first half of the dragonfly communicates with the
The impact of global link arrangements on running time
second half. When used with minimal routing, the running
decreases slightly when we use Valiant routing. As shown in
time of this pattern is a good representation of system’s
Figure 5, circulant-based arrangement leads to 3-7% shorter
bisection bandwidth where the network hot-spots are also
running time in comparison to absolute arrangement. The trend
taken into account.
for bisection bandwidth is similar for other message sizes.
Hastings et al. [17] have theoretically shown that the bi-
The reason why circulant arrangement performs better lies
section bandwidth of the absolute global link arrangement
behind how balanced the bisection bandwidth cuts are across
falls behind the other two global arrangements as the ratio
the groups. A cut is a balanced cut if equal number of routers
of global to local link bandwidth, α, is increased above 1.25.
lie on each side of the cut in each group. For groups with a
Between α = 1.25 and α = 4, circulant-based arrangement
multiple of four routers, with circulant arrangement at α = 4,
provides the highest bisection bandwidth; and at α = 4,
the minimum bisection cut is completely balanced across all
both circulant-based and relative arrangements have the same
the groups (i.e., 2 routers per group). Moreover, the circulant
bisection bandwidth.
arrangement achieves a balanced cut for a lower α than that
Our results demonstrate a similar pattern as shown in
of the relative arrangement, while absolute arrangement never
Figure 4 for 1000KB message size. We report all results in
reaches a balanced cut.
terms of normalized running time. Running time with circulant
arrangement is 5-15% shorter than the other two arrangements 2) Realistic Workloads: The above results use bisection pat-
for α > 1.5. Circulant arrangement provides better perfor- tern, which is not typical in HPC workloads. This section
mance with increasing α values. For all α values, performance presents our results with all-to-all and stencil communication
difference between absolute and relative arrangements is very patterns.
small within 4%. Unlike bisection pattern, global link arrangements do not

330
Fig. 8: Running time of all-to-all workload with different {link
Fig. 7: Running time of the stencil communication pattern arrangement, routing algorithm} pairs and α = 4. For each
using different global link arrangements with minimal routing message size, the results are normalized with respect to the
for 4000KB message size. All results are normalized with {absolute-minimal} pair of the corresponding message size.
respect to absolute arrangement at α = 0.5.

have a significant impact on running time for all-to-all and


stencil patterns. Figure 6 shows that the running time differ-
ence due to link arrangements is below 1% for any given α
for all-to-all communication.
In Figure 7, we show the same trend for the stencil pattern.
As the stencil pattern injects fewer messages into the network
compared to bisection and all-to-all patterns, we present results
with 4000KB message size, which achieves a similar network
load. The performance difference among different global link
arrangements is less than 3% for all α values. Fig. 9: Maximum running time variation of the all-to-all com-
The independence of all-to-all pattern’s running time from munication pattern under random task mapping with different
global link arrangements is also valid for different message {link arrangement, routing algorithm} pairs and α = 0.5.
sizes and routing algorithms as shown in Figure 8. However, Variation shown is relative to the median value.
we observe that the routing mechanism has a significant impact
on running time. For small message sizes such as 100KB,
Valiant routing effectively spreads network traffic and avoids arrangements with bisection pattern. Figure 10 shows similar
bottlenecks. Thus, for 100KB message size, Valiant reduces trends for bisection bandwidth, where we observe 2-8% shorter
running time by 15%. On the other hand, Valiant routing running time with circulant arrangement in comparison to the
increases the average traffic as it does not use the minimum absolute arrangement. For medium machine size, relative ar-
path. As a result, for larger message sizes, Valiant routing rangement starts to perform closer to the circulant arrangement
introduces new network bottlenecks and results in up to 5% than the absolute arrangement, indicating that the performance
longer running time compared to minimal routing. difference among global link arrangements is not specific to
Another interesting observation of our study is regarding the small machine size.
the impact of task mapping. As mentioned in Section III-C, For the large machine with 17 groups and 2,176 cores,
our results on all-to-all pattern belongs to the case with the we carry out simulations with all-to-all and bisection patterns
median running time among 15 simulations with random task to analyze the impact of increasing the number of tasks
mapping. In Figure 9, we present the variation in running time per local router on the resulting application running time.
among these 15 simulations as a percentage of the median run- The results are consistent with the small and medium-size
ning time for α = 0.5. While the variation is between 3-11% machines: The performance of the all-to-all pattern shows
with minimal routing depending on the specific task mapping negligible difference across the global link arrangements (less
used by the random task mapper, it is below 2.5% with Valiant than 2%). Similarly, for the bisection pattern, we observe that
routing. As Valiant routing has an inherent randomization for the circulant arrangement provides 3-10% shorter running time
intermediate router/group selection, it minimizes the impact of compared to the absolute arrangement.
task mapping by randomly spreading traffic. For α = 4, our
results show that the variation reduces below 5% with minimal B. Multiple Job Analysis
routing due to the increased network bandwidth, whereas the We next explore the dependency of performance on various
variation stays below 2% with Valiant routing. network parameters when two jobs are running together.
When considering multiple jobs, an important aspect comes
3) Medium and Large Machines: For the medium machine into the picture, namely, job allocation algorithm. We focus
with 17 groups, we repeat our analysis on the global link on three main job allocation strategies: cluster, spread, and

331
C. Summary of Our Key Findings
Our performance analysis on dragonfly networks consider-
ing various aspects has led to the following key findings:
• Detailed network simulation closely follows the theory
regarding the impact of global link arrangements on
bisection bandwidth. We show that circulant arrangement
provides up to 15% shorter running time compared to
absolute arrangement for the bisection communication
pattern with minimal routing.
• In contrast to the bisection communication pattern, global
Fig. 10: Running time of the bisection communication pattern link arrangements do not have a significant impact on
using different global link arrangements with minimal routing performance (i.e., <3%) for realistic patterns such as all-
for 100KB message size on the large system. All results are to-all and 3D stencil.
normalized with respect to absolute arrangement at α = 0.5. • We show that task mapping can significantly impact the
application running time even for the all-to-all pattern due
to the scheduling order of the messages. We observe up
random allocation (See Section II-C). For this analysis, we to 11% performance variation for all-to-all jobs due to
use the small machine and stencil communication pattern, and task mapping.
consider two jobs with the stencil pattern running together • For realistic workloads, the choice of {routing, alloca-
(e.g., a common example HPC scenario is when users submit tion} algorithm pair has a larger impact on performance
the same stencil job with different inputs). We report the compared to the impact of link arrangements. The se-
running time of the slower job. lection of the best performing pair depends on α. As α
In the previous results in Section IV-A, we have shown increases, traffic on local links becomes a bottleneck. As
that the circulant arrangement provides slightly shorter run- Valiant routing creates more traffic on all links, it results
ning time for stencil application. Thus, in our next analysis, in up to 44% longer running times in comparison to using
we focus on circulant arrangement. Figure 11 compares the minimal routing at α = 4.
running time for each {routing, allocation} algorithm pair over
different α values. The results are normalized to {minimal, V. R ELATED W ORK
cluster} case at their corresponding α value. An interesting
As dragonfly network topologies [6] gain popularity due to
observation is that how good a {routing, allocation} pair
their high bisection bandwidth and low diameter, researchers
performs is a function of the bandwidth ratio α. At α = 0.5,
have investigated how to improve the performance of dragon-
valiant and minimal routing behave rather similarly, with
flies. Most of these studies use simulators to easily experiment
{minimal, random} being the best choice. As α increases,
with different dragonfly settings with a clear visibility on
the pairs involving Valiant algorithm start to perform poorly.
the system. A group of existing literature utilizes high-level
The reason is that global links become less congested with
simulators, where the performance evaluation metrics are
increasing α. When α is large, local link congestion becomes
based on the link usage [31], [32]. Low-level simulators exist,
the bottleneck; and Valiant routing amplifies this effect by
but they either do not model different global link arrangements
increasing the overall network traffic. This results in around
[33] or focus on network packet delays without considering job
44% performance difference between the best and worst pairs
placement algorithms [34]. In our work, we propose a unified
({minimal, spread} and {valiant, random}, respectively) at
simulation framework that is able to simulate the combination
α = 4.
of such factors.
Based on these simulators, various techniques have been
proposed to improve dragonfly performance. Prisacari et
al. [10] introduce dragonfly-specific hierarchical all-to-all ex-
change patterns that reduce all-to-all communication time
by up to 45%. Fuentes et al. [11] focus on traffic patterns
that portray real use cases and study the network unfairness
caused by existing routing mechanisms with relative global
link arrangement. Yébenes et al. [12] introduce a packet
queuing scheme to reduce message stalls with minimal-path
routing. These works specifically target improving routing-
Fig. 11: Running time of the stencil communication pattern related issues in dragonflies.
using circulant arrangements with {routing, allocation} al- Researchers have identified a strong coupling between rout-
gorithm pairs. Results are normalized with respect to the ing and job allocation in dragonflies. Bhatele et al. [13] con-
{minimal, cluster} pair at their corresponding α value. duct a study on a IBM PERCS architecture [9] and conclude

332
that default MPI rank-ordered allocation leads to significant ratio between global and local links. Finally, we show that
network congestion when used together with minimal-routing. task mapping in dragonfly can result in up to 11% variation in
They claim that (1) indirect routing obviates the need for intel- the application running time even for all-to-all communication
ligent job allocation and (2) random allocation gives the best pattern due to the scheduling order of the messages.
performance when direct routing is used. Following up on this
work, Chakaravarthy et al. [14] show that transpose communi- ACKNOWLEDGMENT
cation patterns [35] benefit from direct routing, and Prisacari
et al. demonstrate that random allocation with minimal-routing This work has been partially funded by Sandia National
is consistently outperformed by Cartesian job placement with Laboratories. Sandia National Laboratories is a multi-program
indirect routing for stencil communication patterns [15]. These laboratory managed and operated by Sandia Corporation, a
works focus on allocation and routing strategies but do not wholly owned subsidiary of Lockheed Martin Corporation, for
consider different global link arrangements. the U.S. Department of Energy’s National Nuclear Security
Several studies investigate global links in dragonflies. Administration under contract DE-AC04-94AL85000.
Bhatele et al. [32] investigate the impact of changing the
number of router links on performance. Groves et al. [25] R EFERENCES
explore the effect of the number of global links and link
[1] J. Dongarra et al., “The international exascale software project
bandwidths on the performance and power consumption of roadmap,” Int. J. High Perform. Comput. Appl., vol. 25, no. 1, pp. 3–60,
dragonfly networks considering the absolute arrangement only. Feb. 2011.
Camarero et al. [16] introduce the three global link arrange- [2] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, “Energy
proportional datacenter networks,” in Proceedings of the 37th Annual
ments we use in this paper; however, they do not provide International Symposium on Computer Architecture, ser. ISCA ’10,
any performance comparison between them and they instead 2010, pp. 338–347.
focus on the impact of the routing mechanisms. Hastings et al. [3] M. Besta and T. Hoefler, “Slim fly: A cost effective low-diameter
network topology,” in Proceedings of the International Conference for
conduct a theoretical analysis on global link arrangements and High Performance Computing, Networking, Storage and Analysis, ser.
show that the commonly-used absolute link arrangement leads SC ’14, 2014, pp. 348–359.
to a smaller bisection bandwidth when the ratio of global/local [4] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta, “Microarchitecture of
a high-radix router,” in Proceedings of the 32nd Annual International
link bandwidths is larger than 1.25 [17]. Wen et al. propose Symposium on Computer Architecture, ser. ISCA ’05, 2005, pp. 420–
Flexfly, a re-configurable network architecture for the global 431.
links using low-radix optical switches [36]. Flexfly modifies [5] P. Dong, X. Liu, S. Chandrasekhar, L. L. Buhl, R. Aroca, and Y. K. Chen,
“Monolithic silicon photonic integrated circuits for compact 100+ gb/s
inter-group connections based on the observed network traffic coherent optical receivers and transmitters,” IEEE Journal of Selected
to mitigate the need for indirect routing. However, it requires Topics in Quantum Electronics, vol. 20, no. 4, pp. 150–157, July 2014.
knowledge on application traffic patterns, which are not easy [6] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-
to extract because application type is typically unknown to scalable dragonfly topology,” in 35th International Symposium on Com-
puter Architecture, 2008. ISCA ’08., June 2008, pp. 77–88.
HPC systems and traffic patterns may depend on application [7] B. Alverson, E. Froese, L. Kaplan, and D. Roweth, “Cray XC series
input. network,” Tech. Rep., 2012, Cray, Inc., White paper.
To the best of our knowledge, our work is the first to [8] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson,
T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard, “Cray cascade:
experimentally evaluate the impact of different global link A scalable hpc system based on a dragonfly network,” in International
arrangements on performance in tandem with link bandwidths, Conference for High Performance Computing, Networking, Storage and
communication patterns, job placement algorithms, and rout- Analysis (SC), Nov 2012, pp. 1–9.
[9] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup,
ing mechanisms. T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony, “The
PERCS high-performance interconnect,” in 2010 18th IEEE Symposium
VI. C ONCLUSION on High Performance Interconnects, Aug 2010.
[10] B. Prisacari, G. Rodriguez, and C. Minkenberg, “Generalized hierar-
In this paper, we present a thorough analysis on the unex- chical all-to-all exchange patterns,” in 2013 IEEE 27th International
plored aspects of the dragonfly networks. For this purpose, Symposium on Parallel Distributed Processing (IPDPS), May 2013, pp.
we first propose a simulation framework that is able to 537–547.
[11] P. Fuentes, E. Vallejo, C. Camarero, R. Beivide, and M. Valero,
evaluate the combined impact of global link arrangements, “Throughput unfairness in dragonfly networks under realistic traffic
link bandwidths, job allocation and routing algorithms on the patterns,” in 2015 IEEE International Conference on Cluster Computing,
application running time. We then compare the performance Sept 2015, pp. 801–808.
of several known global link arrangements and show that [12] P. Yébenes, J. Escudero-Sahuquillo, P. J. Garcı́a, and F. J. Quiles,
“Straightforward solutions to reduce hol blocking in different dragonfly
circulant arrangement provides up to 15% reduction in com- fully-connected interconnection patterns,” The Journal of Supercomput-
munication time for the bisection communication pattern. We ing, pp. 1–23, 2016.
demonstrate that for common MPI communication patterns, [13] A. Bhatele, W. D. Gropp, N. Jain, and L. V. Kale, “Avoiding hot-spots on
two-level direct networks,” in 2011 International Conference for High
the impact of global link arrangements is of less significance. Performance Computing, Networking, Storage and Analysis (SC), Nov
On the other hand, for the same MPI patterns, we find that 2011, pp. 1–11.
the choice of job allocation and routing algorithm is highly [14] V. T. Chakaravarthy, M. Kedia, Y. Sabharwal, N. P. K. Katta, R. Ra-
jamony, and A. Ramanan, “Mapping strategies for the PERCS archi-
important, leading up to 44% difference in communication tecture,” in 2012 19th International Conference on High Performance
overhead, and that the best choice depends on the bandwidth Computing (HiPC), Dec 2012, pp. 1–10.

333
[15] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and D. C. Arnold, “(sai) stalled, active and idle: Characterizing power
and T. Hoefler, “Efficient task placement and routing of nearest neigh- and performance of large-scale dragonfly networks,” in 2016 IEEE
bor exchanges in dragonfly networks,” in Proceedings of the 23rd International Conference on Cluster Computing (CLUSTER), Sept 2016,
International Symposium on High-performance Parallel and Distributed pp. 50–59.
Computing, ser. HPDC ’14, 2014, pp. 129–140. [26] K. Antypas, “Nersc-6 workload analysis and benchmark selection pro-
[16] C. Camarero, E. Vallejo, and R. Beivide, “Topological characterization cess,” Lawrence Berkeley National Laboratory, 2008.
of hamming and dragonfly networks and its implications on routing,” [27] J. Meng, E. Llamosı́, F. Kaplan, C. Zhang, J. Sheng, M. Herbordt,
ACM Trans. Archit. Code Optim., vol. 11, no. 4, pp. 39:1–39:25, Dec. G. Schirner, and A. K. Coskun, “Communication and cooling aware
2014. job allocation in data centers for communication-intensive workloads,”
[17] E. Hastings, D. Rincon-Cruz, M. Spehlmann, S. Meyers, A. Xu, D. P. J. Parallel Distrib. Comput., vol. 96, no. C, pp. 181–193, Oct. 2016.
Bunde, and V. J. Leung, “Comparing global link arrangements for [Online]. Available: http://dx.doi.org/10.1016/j.jpdc.2016.05.016
dragonfly networks,” in 2015 IEEE International Conference on Cluster [28] W. Gropp, T. Hoefler, R. Thakur, and E. Lusk, Using Advanced MPI:
Computing, Sept 2015, pp. 361–370. Modern Features of the Message-Passing Interface. MIT Press, Nov.
[18] A. Rodrigues, E. Cooper-Balis, K. Bergman, K. Ferreira, D. Bunde, and 2014.
K. S. Hemmert, “Improvements to the structural simulation toolkit,” in [29] E. S. Hertel, Jr., R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I.
Proceedings of the 5th International ICST Conference on Simulation Kerley, J. M. Mcglaun, S. V. Petney, S. A. Silling, P. A. Taylor, and
Tools and Techniques, ser. SIMUTOOLS ’12, 2012, pp. 190–195. L. Yarrington, “Cth: A software family for multi-dimensional shock
[19] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel com- physics analysis,” in in Proceedings of the 19th International Symposium
munication,” in Proceedings of the Thirteenth Annual ACM Symposium on Shock Waves, held at, 1993, pp. 377–382.
on Theory of Computing, ser. STOC ’81, 1981, pp. 263–277. [30] S. Plimpton, “Fast parallel algorithms for short-range molecular
[20] N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale, “Maximizing dynamics,” J. Comput. Phys., vol. 117, no. 1, pp. 1–19, Mar. 1995.
throughput on a dragonfly network,” in Proceedings of the International [Online]. Available: lammps.sandia.gov
Conference for High Performance Computing, Networking, Storage and [31] G. Zheng, T. Wilmarth, P. Jagadishprasad, and L. V. Kalé, “Simulation-
Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. based performance prediction for large parallel machines,” International
336–347. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.33 Journal of Parallel Programming, vol. 33, no. 2, pp. 183–207, 2005.
[21] T. Hoefler and M. Snir, “Generic topology mapping strategies for [32] A. Bhatele, N. Jain, Y. Livnat, V. Pascucci, and P.-T. Bremer, “Evaluating
large-scale parallel architectures,” in Proceedings of the International system parameters on a dragonfly using simulation and visualization,”
Conference on Supercomputing, ser. ICS ’11, 2011, pp. 75–84. Tech. Rep., July 2015, technical Report.
[22] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, [33] M. Garcia, P. Fuentes, M. Odriozola, E. Vallejo, and R. Beivide.
M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and FOGSim interconnection network simulator. [Online]. Available:
B. Jacob, “The structural simulation toolkit,” SIGMETRICS Perform. http://fuentesp.github.io/fogsim/
Eval. Rev., vol. 38, no. 4, pp. 37–42, Mar. 2011. [Online]. Available: [34] P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia, and F. J. Quiles,
http://doi.acm.org/10.1145/1964218.1964225 “Towards modeling interconnection networks of exascale systems with
[23] M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, “A omnet++,” in 21st Euromicro International Conference on Parallel,
framework for architecture-level power, area, and thermal simulation and Distributed, and Network-Based Processing, Feb 2013, pp. 203–207.
its application to network-on-chip design exploration,” SIGMETRICS [35] P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas,
Perform. Eval. Rev., vol. 38, no. 4, pp. 63–68, Mar. 2011. [Online]. J. Kepner, J. Mccalpin, D. Bailey, and D. Takahashi, “Introduction to
Available: http://doi.acm.org/10.1145/1964218.1964229 the HPC challenge benchmark suite,” Tech. Rep., 2005.
[24] K. D. Underwood, M. Levenhagen, and A. Rodrigues, “Simulating red [36] K. Wen, P. Samadi, S. Rumley, C. P. Chen, Y. Shen, M. Bahadori,
storm: Challenges and successes in building a system simulation,” in J. Wilke, and K. Bergman, “Flexfly: Enabling a reconfigurable drag-
2007 IEEE International Parallel and Distributed Processing Sympo- onfly through silicon photonics,” in International Conference for High
sium, March 2007, pp. 1–10. Performance Computing, Networking, Storage and Analysis (SC), Nov
[25] T. Groves, R. E. Grant, S. Hemmer, S. Hammond, M. Levenhagen, 2016.

334

You might also like