Unveiling The Interplay Between Global Link Arrangements and Network Management Algorithms On Dragon y Networks
Unveiling The Interplay Between Global Link Arrangements and Network Management Algorithms On Dragon y Networks
Unveiling The Interplay Between Global Link Arrangements and Network Management Algorithms On Dragon y Networks
G3 G6 G3 G6 G3 G6
G2 G7 G2 G7 G2 G7
G1 G8 G1 G8 G1 G8
G0 G0 G0
(a) Absolute (b) Relative (c) Circulant-based
Fig. 2: Dragonfly networks adopting different global link arrangements: a) absolute, b) relative, c) circulant-based arrangement.
Each configuration uses g = 9, a = 4, h = 2, and all-to-all local connections. The boxes represent the routers.
of these arrangements with routing mechanisms, job allocation bandwidth ratio of the global links over the local links.
algorithms, or communication patterns. For high bandwidth ratios, our analysis shows up to 44%
This paper studies the impact of global link arrangements difference in communication overhead between the best
on network performance in tandem with routing mechanisms, and worst performing {routing, allocation} pairs.
job placement algorithms, and application communication The rest of the paper starts by providing background on
patterns. To enable such analysis, we design a packet-level dragonfly networks. Section III presents the details of our
simulation framework that unifies network design parameters proposed simulation framework together with our target HPC
with communication patterns and accurately models wall-clock machines and workload assumptions. In Section IV, we pro-
time of HPC applications. Our specific contributions are as vide the results of our performance analysis regarding different
follows: aspects of dragonfly networks. Section V describes the prior
work on dragonfly and Section VI concludes the paper.
• We introduce a packet-level simulation framework based
II. BACKGROUND ON D RAGONFLY N ETWORKS
on the Structural Simulation Toolkit (SST) [18]. Our
framework estimates the performance of HPC appli- Dragonfly topology [6] is a two-level hierarchical direct
cations with consideration of the underlying network network based on high-radix routers. We use the parameters
topology by closing the gap between job allocation/task presented in Table I to describe a dragonfly topology.
mapping and detailed network simulation. This frame- In the first hierarchical level, a routers constitute a group
work allows simulating the combined impact of alloca- and are connected by local electrical links, typically with
tion/mapping decisions as well as network and applica- an all-to-all or a flattened-butterfly network topology. Each
tion properties on the performance of HPC applications. router is connected to p compute nodes and has h optical
• Using our framework, we evaluate the performance of links that form an inter-group network. The routers in a group
global link arrangements considering a set of job allo- collectively act as a virtual router with a · p connections to
cation and routing mechanisms for various communica- compute nodes and a · h connections to other groups. The
tion patterns, and demonstrate that circulant global link second hierarchical level consists of g of these virtual routers,
arrangement provides up to 15% lower communication typically connected with an all-to-all topology.
overhead compared to the other arrangements when the The rest of this section describes the link arrangements,
network is highly loaded, owing to its higher bisection routing mechanisms, and job placement strategies we use in
bandwidth. We also show that the impact of global our study of global link arrangements.
arrangements on application performance can be limited A. Link Arrangements
for common MPI patterns.
There are two link groups in a dragonfly network: global
• We show that task mapping can substantially impact
and local. Global links refer to the optical inter-group cables,
application running time, and the level of this impact is
also affected by the routing mechanism and the network
load. In our experiments with selected applications, we c Number of cores per node
p Number of nodes connected to a router
show that task mapping affects application running time a Number of routers in a group
by up to 11%. g Number of groups
• We demonstrate that the combined impact of the job h Number of optical links on a router
allocation and routing mechanism highly depends on the TABLE I: Notation for the dragonfly parameters.
326
whereas local links connect routers within a single group. 2) Valiant [19]: Valiant routing aims to spread the traffic
among the network to avoid hot-spots. If GS = GD , this
1) Global: We use three different global link arrangements and mechanism sends the message first to a randomly-selected
comply with the terminology used by Hastings et al. [17]. intermediate group, then to RD , using minimal routing. If
Absolute arrangement: In the absolute arrangement, the GS = GD , a random intermediate router is selected within the
first available port in group 0 is connected to the first available group. All messages travel at most 5 hops in Valiant routing
port in the group 1. Then, the next available port in group 0 is as we use all-to-all local connections.
connected to the first available port in group 2. This continues
until group 0 is linked to all other groups. We apply the same C. Allocation
procedure to the remaining groups in order. As a result, port Allocation refers to the placement of incoming jobs to
i of group j is connected to group i if i < j, and to group the available machine nodes. Studies have shown that allo-
i + 1 otherwise. Figure 2(a) depicts an absolute arrangement. cation has a significant impact on application performance
Relative arrangement: All groups have identical relative on dragonflies, but there is no consensus in the community
connections in this arrangement. As shown in Figure 2(b), in on which allocation algorithm maximizes the performance of
each group, port 0 is connected to the next group, port 1 is dragonflies [13], [14], [15], [20]. In order to study the impact
connected to the second next group, and so on. In other words, of global link arrangements in presence of a comprehensive
port i of group j is connected to group (i + j + 1) mod g, set of workload management algorithms, we consider three
where g is the total number of groups. allocation techniques that are shown to have fundamentally
Circulant-based arrangement: With the circulant-based different characteristics [13]: cluster, spread, and random.
arrangement, in each group, port 0 is connected to the next
group, and port 1 is connected to the previous group. Port 2 1) Cluster: Cluster first fills the available nodes in a single
is connected to the second next, and port 3 is connected to dragonfly group in order. Once there is no available node left,
the second previous group, and so on. In other words, port i it continues with the next group.
of group j is connected to group (i/2 + j + 1) mod g if i is 2) Spread: This allocation strategy aims to spread a given job
even, and to group (−i/2 + j − 1) mod g if i is odd. This uniformly among groups by filling routers in a round robin
arrangement assumes that each router has an even number of manner. It first uses all nodes in router 0 of group 0, then
optical links, and it is depicted in Figure 2(c). continues with the nodes in router 0 of group 1, and so on.
2) Local: In dragonfly architectures, local links are typically Once all router 0s are occupied in all groups, it continues with
arranged as a flattened-butterfly (e.g., Cray Cascade [8]) or router 1.
as an all-to-all network (e.g., IBM PERCS [9]). In this paper, 3) Random: This strategy simply selects random nodes from
we assume all-to-all local link arrangements with a uniform the set of all available nodes in the system.
local link bandwidth to isolate the impact of global link
arrangements. We define the ratio of global link bandwidth D. Task Mapping
to local link bandwidth as α. Task mapping refers to the mapping of the MPI ranks of
a single job onto the compute cores located in the nodes
B. Routing selected by the allocation algorithm. The aim of task mapping
A routing strategy defines the paths of message traversal is to place closely-communicating MPI ranks next to each
through the network. Routing has been shown to play an other to reduce the communication overhead and network
important role in the system performance in Dragonflies [10], load. As HPC infrastructures typically do not have access to
[11]. While a shortest-path routing strategy reduces the mes- the communication pattern of incoming jobs, task mapping
sage latency in a dragonfly with low network utilization, it is performed by the HPC application once the job starts
can lead to hot-spots on the global links when two groups executing.
intensively communicate with each other over a few links. Commonly known task mapping strategies include graph-
In this work, we use two static routing strategies to study based approaches and linear mapping, which assigns the MPI
the impact of routing on different global link arrangements: ranks to the cores in a linear order. In this work, we would
minimal and Valiant. like to minimize the performance impact of task mapping so
as to isolate the effect of global link arrangements. Thus, for
1) Minimal: Minimal routing can be described as follows. each application, we select the task mapping algorithm that
Let us define the source and destination groups as GS and suits the communication pattern in the best way. Task mapping
GD , and the source and destination routers as RS and RD , algorithms we use are as follows:
respectively. If GS = GD , then select an intermediate router,
Ra , which is inside GS and has a direct link to GD . Next, 1) Random: This task mapper randomly places application’s
use the direct link from Ra to Rb (which is the router in MPI ranks onto the allocated nodes. We use the random task
GD ). If Rb = RD , use the shortest-path to RD inside GD . mapper to average out the impact of the messaging order in
The longest communication distance with this mechanism is applications with uniform communication (e.g., all-to-all and
3 router-to-router hops as we use all-to-all local connections. bisection pattern; refer to Section III-C for further details).
327
2) Recursive Graph Bisection [21]: This algorithm recursively !!& !$
/$ /,$"% #83,#*83
splits the application’s communication graph and the network
!!& $,$"% #86,#*8322
topology graph into equal halves using minimum weighted ##* - # $
)& "'----
")# ---
edge-cuts. At the end of the recursion, the remaining MPI (#
ranks are placed in the remaining compute cores. We use this 2 355 !!3-!#
task mapper for the 3D stencil pattern as it has been shown 32 467 !!4-!#
--- --- ---
to perform better than linear task mapping [21].
--- --- ---
328
model, which accounts for the global link arrangements, in on the performance even for all-to-all communication pattern
merlin. Apart from the link arrangements described in Section due to the scheduling order of the messages. To average out
II-A, we provide the user the flexibility to define custom link the impact of task mapping, we run all-to-all workloads 15
arrangements. We can also define bandwidths separately for times with random task mapping and select the case with the
(i) the links to the hosts, (ii) the local links within the group, median running time for our analysis.
and (iii) the global links across groups. Bisection: We use this communication pattern to account
Communication between the scheduler and merlin & ember for the bisection bandwidth of the dragonfly network and
elements is provided through a file interface in SST. The to compare our results against the theoretical analysis from
scheduler dumps the job allocation/task mapping information prior work by Hastings et al. [17]. In this pattern, the tasks
into a file. A python script then converts this information into are divided into two equal-sized groups (group #1 and #2),
the input format that ember can use. where a task from group #1 communicates with every task
Scalability of the SST framework to simulate exascale sys- in group #2 and none of the tasks in group #1. In order to
tems has been demonstrated in recent work [25]. In their work, achieve the minimum bisection bandwidth between groups #1
Groves et al. analyze the impact of the number of global links and #2, we use the network cuts provided by Hastings et al.
and link bandwidths on performance and power consumption for the small machine size. These network cuts are also called
of dragonfly networks without considering allocation, task minimum cuts; thus, the resulting application running time is
mapping, or global link arrangements. They model a dragonfly a good representation of the bisection bandwidth. Once we
machine with 96 groups, 48 routers per group and 24 nodes decide on which group of tasks occupy which nodes based on
per router, which corresponds to a total of 100,592 nodes. In the minimum cuts, we run bisection workload 15 times with
this work, we experiment with machines with smaller intra- random task mapping using those nodes. We select the case
group networks in order to focus on the impact of global link with maximum running time for this analysis as the bisection
arrangements. bandwidth represents the worst case.
For other machine sizes for which the minimum cuts are
B. Target HPC Machines not readily available, we use the cut that takes the first half of
We study global link arrangements in 3 different target drag- the nodes in order, starting from node #0. Similarly, we apply
onfly machines. For all of our machines, we use h = 2 optical random task mapping using these nodes and select the case
links per router. We change the parameters listed in Table I with the maximum running time.
(specifically, a, p, c, g) to experiment with different machine 3D Stencil: This messaging pattern is observed in a large
sizes. Our small machine uses the same dragonfly parameters portion of the HPC applications and is explicitly supported
used in the theoretical study of global link arrangements by by MPI [28]. Examples of real-life applications with 3D
Hastings et al. [17]: a = 4, p = 2, c = 2, and g = 9, stencil communication include multi-dimensional shock-wave
corresponding to a total of 72 nodes and 144 cores. For our analysis [29] and molecular dynamics [30]. In this pattern,
medium-size machine, we use a = 8, p = 2, c = 2, g = 17, application tasks exchange messages with their six nearest
corresponding to a total of 272 nodes and 544 cores. As for Cartesian neighbors (i.e., in the x, y, z directions). In SST,
the large machine, we use a = 8, p = 4, c = 4, g = 17, which we use the Halo3D motif in ember, which models the 3D-
adds up to a total of 544 nodes and 2, 176 cores. stencil MPI routine. As the running time of stencil jobs
We set the bandwidth of each link that is connected to the significantly depends on task mapping [21], we use recursive
hosts and other routers within a group (i.e., local links) to graph bisection to efficiently place application tasks.
1GB/s. We sweep α (global to local link bandwidth ratio) For each of these communication patterns, we specify
from 0.5GB/s to 4GB/s in 0.5GB/s steps to analyze the parameters such as the message size, the number of iterations,
impact of global link bandwidths. In order to isolate the and the compute time per iteration. In order to focus on the
effect of crossbar bandwidth on the simulation results, we communication overhead, we set the compute time to zero.
set the crossbar bandwidth to the sum of the maximum link We set the number of iterations for each application such that
bandwidths connected to a router. we allow enough time for the network to reach steady state.
In our experiments, we experiment with 3 different message
C. Workloads sizes: 100KB, 1000KB, and 4000KB.
We focus on three communication patterns: all-to-all, bi- We assume single job and multi-job cases, where all jobs
section, and 3D stencil. We select bisection pattern as it collectively occupy 100% of the dragonfly machine. High
represents the bisection bandwidth of the network. All-to-all system utilization is a common case in HPC and also makes
and 3D stencil patterns are commonly observed in real HPC it easier to observe how different conditions (communication
applications [26]. patterns, message size, etc.) lead to network congestion. In
All-to-all: It is a commonly known pattern where each the multi-job case, we assume two jobs of the same com-
task communicates with every one of the other tasks in an munication pattern are running simultaneously with each job
application. Fast Fourier Transform (FFT) is an example ap- occupying 50% of the machine. We focus on single-run of
plication with uniform all-to-all communication [27]. During the jobs instead of job traces, which means that all jobs arrive
our simulations, we observed that task mapping has an impact and get allocated at the same time.
329
Fig. 4: Running time of the bisection communication pattern Fig. 5: Running time of the bisection communication pattern
using different global link arrangements with minimal routing. using different global link arrangements with Valiant routing.
All results are normalized with respect to absolute arrangement All results are normalized with respect to absolute arrangement
at α = 0.5. at α = 0.5.
330
Fig. 8: Running time of all-to-all workload with different {link
Fig. 7: Running time of the stencil communication pattern arrangement, routing algorithm} pairs and α = 4. For each
using different global link arrangements with minimal routing message size, the results are normalized with respect to the
for 4000KB message size. All results are normalized with {absolute-minimal} pair of the corresponding message size.
respect to absolute arrangement at α = 0.5.
331
C. Summary of Our Key Findings
Our performance analysis on dragonfly networks consider-
ing various aspects has led to the following key findings:
• Detailed network simulation closely follows the theory
regarding the impact of global link arrangements on
bisection bandwidth. We show that circulant arrangement
provides up to 15% shorter running time compared to
absolute arrangement for the bisection communication
pattern with minimal routing.
• In contrast to the bisection communication pattern, global
Fig. 10: Running time of the bisection communication pattern link arrangements do not have a significant impact on
using different global link arrangements with minimal routing performance (i.e., <3%) for realistic patterns such as all-
for 100KB message size on the large system. All results are to-all and 3D stencil.
normalized with respect to absolute arrangement at α = 0.5. • We show that task mapping can significantly impact the
application running time even for the all-to-all pattern due
to the scheduling order of the messages. We observe up
random allocation (See Section II-C). For this analysis, we to 11% performance variation for all-to-all jobs due to
use the small machine and stencil communication pattern, and task mapping.
consider two jobs with the stencil pattern running together • For realistic workloads, the choice of {routing, alloca-
(e.g., a common example HPC scenario is when users submit tion} algorithm pair has a larger impact on performance
the same stencil job with different inputs). We report the compared to the impact of link arrangements. The se-
running time of the slower job. lection of the best performing pair depends on α. As α
In the previous results in Section IV-A, we have shown increases, traffic on local links becomes a bottleneck. As
that the circulant arrangement provides slightly shorter run- Valiant routing creates more traffic on all links, it results
ning time for stencil application. Thus, in our next analysis, in up to 44% longer running times in comparison to using
we focus on circulant arrangement. Figure 11 compares the minimal routing at α = 4.
running time for each {routing, allocation} algorithm pair over
different α values. The results are normalized to {minimal, V. R ELATED W ORK
cluster} case at their corresponding α value. An interesting
As dragonfly network topologies [6] gain popularity due to
observation is that how good a {routing, allocation} pair
their high bisection bandwidth and low diameter, researchers
performs is a function of the bandwidth ratio α. At α = 0.5,
have investigated how to improve the performance of dragon-
valiant and minimal routing behave rather similarly, with
flies. Most of these studies use simulators to easily experiment
{minimal, random} being the best choice. As α increases,
with different dragonfly settings with a clear visibility on
the pairs involving Valiant algorithm start to perform poorly.
the system. A group of existing literature utilizes high-level
The reason is that global links become less congested with
simulators, where the performance evaluation metrics are
increasing α. When α is large, local link congestion becomes
based on the link usage [31], [32]. Low-level simulators exist,
the bottleneck; and Valiant routing amplifies this effect by
but they either do not model different global link arrangements
increasing the overall network traffic. This results in around
[33] or focus on network packet delays without considering job
44% performance difference between the best and worst pairs
placement algorithms [34]. In our work, we propose a unified
({minimal, spread} and {valiant, random}, respectively) at
simulation framework that is able to simulate the combination
α = 4.
of such factors.
Based on these simulators, various techniques have been
proposed to improve dragonfly performance. Prisacari et
al. [10] introduce dragonfly-specific hierarchical all-to-all ex-
change patterns that reduce all-to-all communication time
by up to 45%. Fuentes et al. [11] focus on traffic patterns
that portray real use cases and study the network unfairness
caused by existing routing mechanisms with relative global
link arrangement. Yébenes et al. [12] introduce a packet
queuing scheme to reduce message stalls with minimal-path
routing. These works specifically target improving routing-
Fig. 11: Running time of the stencil communication pattern related issues in dragonflies.
using circulant arrangements with {routing, allocation} al- Researchers have identified a strong coupling between rout-
gorithm pairs. Results are normalized with respect to the ing and job allocation in dragonflies. Bhatele et al. [13] con-
{minimal, cluster} pair at their corresponding α value. duct a study on a IBM PERCS architecture [9] and conclude
332
that default MPI rank-ordered allocation leads to significant ratio between global and local links. Finally, we show that
network congestion when used together with minimal-routing. task mapping in dragonfly can result in up to 11% variation in
They claim that (1) indirect routing obviates the need for intel- the application running time even for all-to-all communication
ligent job allocation and (2) random allocation gives the best pattern due to the scheduling order of the messages.
performance when direct routing is used. Following up on this
work, Chakaravarthy et al. [14] show that transpose communi- ACKNOWLEDGMENT
cation patterns [35] benefit from direct routing, and Prisacari
et al. demonstrate that random allocation with minimal-routing This work has been partially funded by Sandia National
is consistently outperformed by Cartesian job placement with Laboratories. Sandia National Laboratories is a multi-program
indirect routing for stencil communication patterns [15]. These laboratory managed and operated by Sandia Corporation, a
works focus on allocation and routing strategies but do not wholly owned subsidiary of Lockheed Martin Corporation, for
consider different global link arrangements. the U.S. Department of Energy’s National Nuclear Security
Several studies investigate global links in dragonflies. Administration under contract DE-AC04-94AL85000.
Bhatele et al. [32] investigate the impact of changing the
number of router links on performance. Groves et al. [25] R EFERENCES
explore the effect of the number of global links and link
[1] J. Dongarra et al., “The international exascale software project
bandwidths on the performance and power consumption of roadmap,” Int. J. High Perform. Comput. Appl., vol. 25, no. 1, pp. 3–60,
dragonfly networks considering the absolute arrangement only. Feb. 2011.
Camarero et al. [16] introduce the three global link arrange- [2] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, “Energy
proportional datacenter networks,” in Proceedings of the 37th Annual
ments we use in this paper; however, they do not provide International Symposium on Computer Architecture, ser. ISCA ’10,
any performance comparison between them and they instead 2010, pp. 338–347.
focus on the impact of the routing mechanisms. Hastings et al. [3] M. Besta and T. Hoefler, “Slim fly: A cost effective low-diameter
network topology,” in Proceedings of the International Conference for
conduct a theoretical analysis on global link arrangements and High Performance Computing, Networking, Storage and Analysis, ser.
show that the commonly-used absolute link arrangement leads SC ’14, 2014, pp. 348–359.
to a smaller bisection bandwidth when the ratio of global/local [4] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta, “Microarchitecture of
a high-radix router,” in Proceedings of the 32nd Annual International
link bandwidths is larger than 1.25 [17]. Wen et al. propose Symposium on Computer Architecture, ser. ISCA ’05, 2005, pp. 420–
Flexfly, a re-configurable network architecture for the global 431.
links using low-radix optical switches [36]. Flexfly modifies [5] P. Dong, X. Liu, S. Chandrasekhar, L. L. Buhl, R. Aroca, and Y. K. Chen,
“Monolithic silicon photonic integrated circuits for compact 100+ gb/s
inter-group connections based on the observed network traffic coherent optical receivers and transmitters,” IEEE Journal of Selected
to mitigate the need for indirect routing. However, it requires Topics in Quantum Electronics, vol. 20, no. 4, pp. 150–157, July 2014.
knowledge on application traffic patterns, which are not easy [6] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-
to extract because application type is typically unknown to scalable dragonfly topology,” in 35th International Symposium on Com-
puter Architecture, 2008. ISCA ’08., June 2008, pp. 77–88.
HPC systems and traffic patterns may depend on application [7] B. Alverson, E. Froese, L. Kaplan, and D. Roweth, “Cray XC series
input. network,” Tech. Rep., 2012, Cray, Inc., White paper.
To the best of our knowledge, our work is the first to [8] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson,
T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard, “Cray cascade:
experimentally evaluate the impact of different global link A scalable hpc system based on a dragonfly network,” in International
arrangements on performance in tandem with link bandwidths, Conference for High Performance Computing, Networking, Storage and
communication patterns, job placement algorithms, and rout- Analysis (SC), Nov 2012, pp. 1–9.
[9] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup,
ing mechanisms. T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony, “The
PERCS high-performance interconnect,” in 2010 18th IEEE Symposium
VI. C ONCLUSION on High Performance Interconnects, Aug 2010.
[10] B. Prisacari, G. Rodriguez, and C. Minkenberg, “Generalized hierar-
In this paper, we present a thorough analysis on the unex- chical all-to-all exchange patterns,” in 2013 IEEE 27th International
plored aspects of the dragonfly networks. For this purpose, Symposium on Parallel Distributed Processing (IPDPS), May 2013, pp.
we first propose a simulation framework that is able to 537–547.
[11] P. Fuentes, E. Vallejo, C. Camarero, R. Beivide, and M. Valero,
evaluate the combined impact of global link arrangements, “Throughput unfairness in dragonfly networks under realistic traffic
link bandwidths, job allocation and routing algorithms on the patterns,” in 2015 IEEE International Conference on Cluster Computing,
application running time. We then compare the performance Sept 2015, pp. 801–808.
of several known global link arrangements and show that [12] P. Yébenes, J. Escudero-Sahuquillo, P. J. Garcı́a, and F. J. Quiles,
“Straightforward solutions to reduce hol blocking in different dragonfly
circulant arrangement provides up to 15% reduction in com- fully-connected interconnection patterns,” The Journal of Supercomput-
munication time for the bisection communication pattern. We ing, pp. 1–23, 2016.
demonstrate that for common MPI communication patterns, [13] A. Bhatele, W. D. Gropp, N. Jain, and L. V. Kale, “Avoiding hot-spots on
two-level direct networks,” in 2011 International Conference for High
the impact of global link arrangements is of less significance. Performance Computing, Networking, Storage and Analysis (SC), Nov
On the other hand, for the same MPI patterns, we find that 2011, pp. 1–11.
the choice of job allocation and routing algorithm is highly [14] V. T. Chakaravarthy, M. Kedia, Y. Sabharwal, N. P. K. Katta, R. Ra-
jamony, and A. Ramanan, “Mapping strategies for the PERCS archi-
important, leading up to 44% difference in communication tecture,” in 2012 19th International Conference on High Performance
overhead, and that the best choice depends on the bandwidth Computing (HiPC), Dec 2012, pp. 1–10.
333
[15] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and D. C. Arnold, “(sai) stalled, active and idle: Characterizing power
and T. Hoefler, “Efficient task placement and routing of nearest neigh- and performance of large-scale dragonfly networks,” in 2016 IEEE
bor exchanges in dragonfly networks,” in Proceedings of the 23rd International Conference on Cluster Computing (CLUSTER), Sept 2016,
International Symposium on High-performance Parallel and Distributed pp. 50–59.
Computing, ser. HPDC ’14, 2014, pp. 129–140. [26] K. Antypas, “Nersc-6 workload analysis and benchmark selection pro-
[16] C. Camarero, E. Vallejo, and R. Beivide, “Topological characterization cess,” Lawrence Berkeley National Laboratory, 2008.
of hamming and dragonfly networks and its implications on routing,” [27] J. Meng, E. Llamosı́, F. Kaplan, C. Zhang, J. Sheng, M. Herbordt,
ACM Trans. Archit. Code Optim., vol. 11, no. 4, pp. 39:1–39:25, Dec. G. Schirner, and A. K. Coskun, “Communication and cooling aware
2014. job allocation in data centers for communication-intensive workloads,”
[17] E. Hastings, D. Rincon-Cruz, M. Spehlmann, S. Meyers, A. Xu, D. P. J. Parallel Distrib. Comput., vol. 96, no. C, pp. 181–193, Oct. 2016.
Bunde, and V. J. Leung, “Comparing global link arrangements for [Online]. Available: http://dx.doi.org/10.1016/j.jpdc.2016.05.016
dragonfly networks,” in 2015 IEEE International Conference on Cluster [28] W. Gropp, T. Hoefler, R. Thakur, and E. Lusk, Using Advanced MPI:
Computing, Sept 2015, pp. 361–370. Modern Features of the Message-Passing Interface. MIT Press, Nov.
[18] A. Rodrigues, E. Cooper-Balis, K. Bergman, K. Ferreira, D. Bunde, and 2014.
K. S. Hemmert, “Improvements to the structural simulation toolkit,” in [29] E. S. Hertel, Jr., R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I.
Proceedings of the 5th International ICST Conference on Simulation Kerley, J. M. Mcglaun, S. V. Petney, S. A. Silling, P. A. Taylor, and
Tools and Techniques, ser. SIMUTOOLS ’12, 2012, pp. 190–195. L. Yarrington, “Cth: A software family for multi-dimensional shock
[19] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel com- physics analysis,” in in Proceedings of the 19th International Symposium
munication,” in Proceedings of the Thirteenth Annual ACM Symposium on Shock Waves, held at, 1993, pp. 377–382.
on Theory of Computing, ser. STOC ’81, 1981, pp. 263–277. [30] S. Plimpton, “Fast parallel algorithms for short-range molecular
[20] N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale, “Maximizing dynamics,” J. Comput. Phys., vol. 117, no. 1, pp. 1–19, Mar. 1995.
throughput on a dragonfly network,” in Proceedings of the International [Online]. Available: lammps.sandia.gov
Conference for High Performance Computing, Networking, Storage and [31] G. Zheng, T. Wilmarth, P. Jagadishprasad, and L. V. Kalé, “Simulation-
Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. based performance prediction for large parallel machines,” International
336–347. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.33 Journal of Parallel Programming, vol. 33, no. 2, pp. 183–207, 2005.
[21] T. Hoefler and M. Snir, “Generic topology mapping strategies for [32] A. Bhatele, N. Jain, Y. Livnat, V. Pascucci, and P.-T. Bremer, “Evaluating
large-scale parallel architectures,” in Proceedings of the International system parameters on a dragonfly using simulation and visualization,”
Conference on Supercomputing, ser. ICS ’11, 2011, pp. 75–84. Tech. Rep., July 2015, technical Report.
[22] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, [33] M. Garcia, P. Fuentes, M. Odriozola, E. Vallejo, and R. Beivide.
M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and FOGSim interconnection network simulator. [Online]. Available:
B. Jacob, “The structural simulation toolkit,” SIGMETRICS Perform. http://fuentesp.github.io/fogsim/
Eval. Rev., vol. 38, no. 4, pp. 37–42, Mar. 2011. [Online]. Available: [34] P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia, and F. J. Quiles,
http://doi.acm.org/10.1145/1964218.1964225 “Towards modeling interconnection networks of exascale systems with
[23] M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, “A omnet++,” in 21st Euromicro International Conference on Parallel,
framework for architecture-level power, area, and thermal simulation and Distributed, and Network-Based Processing, Feb 2013, pp. 203–207.
its application to network-on-chip design exploration,” SIGMETRICS [35] P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas,
Perform. Eval. Rev., vol. 38, no. 4, pp. 63–68, Mar. 2011. [Online]. J. Kepner, J. Mccalpin, D. Bailey, and D. Takahashi, “Introduction to
Available: http://doi.acm.org/10.1145/1964218.1964229 the HPC challenge benchmark suite,” Tech. Rep., 2005.
[24] K. D. Underwood, M. Levenhagen, and A. Rodrigues, “Simulating red [36] K. Wen, P. Samadi, S. Rumley, C. P. Chen, Y. Shen, M. Bahadori,
storm: Challenges and successes in building a system simulation,” in J. Wilke, and K. Bergman, “Flexfly: Enabling a reconfigurable drag-
2007 IEEE International Parallel and Distributed Processing Sympo- onfly through silicon photonics,” in International Conference for High
sium, March 2007, pp. 1–10. Performance Computing, Networking, Storage and Analysis (SC), Nov
[25] T. Groves, R. E. Grant, S. Hemmer, S. Hammond, M. Levenhagen, 2016.
334