Improving NESUS 2014

Proceedings of the First International Workshop on Sustainable
Ultrascale Computing Systems (NESUS 2014)

Porto, Portugal
Jesus Carretero, Javier Garcia Blas

Jorge Barbosa, Ricardo Morla
(Editors)
August 27-28, 2014

Juan-Antonio Rico-Gallego, Juan-Carlos Diaz-Martin 1
NESUS IC1305
Network for Sustainable Ultrascale Computing First NESUS Workshop. • October 2014 • Vol. I, No. 1
Improving the Performance of the

MPI_Allreduce Collective Operation through
Rank Renaming
Juan-Antonio Rico-Gallego Juan-Carlos Díaz-Martín
University of Extremadura, Spain University of Extremadura, Spain
[email protected] [email protected]
Abstract
Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime
between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes.
However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping,
because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement
through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes
inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found
regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior
to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels
hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained
performance results. Extensions to other algorithms and collective operations are proposed.
Keywords MPI Collectives, Parallel Algorithms, Message Passing Interface, Multi-core Clusters
I. Introduction gorithm, not matter if both are used in the implementation of the
same collective. For example, in the implementation of the allreduce
MPI [1] collective functions involve a group of processes commu- operation in MPICH, referred above, the Recursive Doubling algo-
nicating by message passing in an isolated context, known as com- rithm shows a better performance when the mapping is round-robin,
municator. Each process of a communicator is identified by its rank, while the Ring algorithm runs faster under the sequential mapping.
an integer number ranging from 0 to P − 1, where P is the size of An approach to the issue of collectives performance is building
the communicator. The optimisation of collectives is a key issue algorithms that are aware of the different capacities of the available
in HPC applications. A collective operation can be executed by communication channels, as shared memory and network. These
different algorithms, each suitable for a given network technology, algorithms, known as hierarchical, stand on minimizing the commu-
communicator size, message size, etc. For example, in the MPICH nications through the slower channels, but the implementation for
library [2], the implementation of MPI_Allreduce uses two algorithms some collectives as allgather is not as effective as expected, even im-
for medium and large messages when the number of processes is practicable, and hence it is not provided in well-known MPI libraries
a power of two, namely Recursive Doubling and Ring. The switch as Open MPI [3].
from the first to the second algorithm is done at execution time, This paper describes a new approach to the optimization of col-
with platform-dependent beforehand established message size and lectives in multi-core clusters. The goal is to obtain the best possible
process number thresholds. communication throughput. For instance, in the Ring algorithm, the
Current parallel systems are composed of multi-core nodes con- communication takes place between consecutive ranks. If consecu-
nected by a high performance network. The communication cost tive ranks are mapped to different nodes, all the communications
between two MPI ranks depends on their location, being lower if progress through the network. Instead, a schedule of consecutive
they share memory, and higher if they are in different nodes. There- ranks to processes placed in the same multi-core node favours the
fore the performance of an application depends on the assignation much more efficient shared memory communication. Our method
of the ranks to the processors of the cluster (mapping). In general, is based on a temporal reassignment of ranks. That neither modifies
two types of mapping cover the necessities of most applications: the algorithm nor the physical mapping. Instead, it is carried out
sequential and round-robin. In the sequential mapping, ranks bind by means of a transformation function prior to the execution of
to processors so that a domain is completed (e.g. socket or node) the algorithm. The function is simple and efficient, and converts a
before moving to the next domain. In round-robin, ranks are bound sequential mapping to round-robin and vice versa only during the
to domains by rotating on the existing domains. execution of the algorithm.
Mapping affects to the performance of the underlying algorithms This paper focuses on the Ring algorithm in the context of the
of collective operations. Interestingly, a given mapping may favour allreduce operation. Besides, the methodology described is directly
an algorithm and, at the same time, being harmful to another al- applicable to other algorithms used in the implementation of MPI
1
2 Improving the Performance of the MPI Allreduce Collective Operation through Rank Renaming
First NESUS Workshop. • October 2014 • Vol. I, No. 1
collectives. Platform considered is characterized by P, the number of [12]. Vadhiyar et al. [13] evaluate such improvement of performance
processors (or processes involved in the operation), and M, the num- through previous executed series of experiments conducted in an
ber of nodes in the cluster. Q = P/M is the number of processors per specific platform.
node. Two channels are considered in the system, shared memory Multi-core clusters introduce a new actor in the scene. Perfor-
and network, with different performance. The study is conducted mance becomes dependant on the effective use of the different com-
under two different mappings, sequential and round-robin, under munication channels. Hierarchical algorithms are specifically built
the assumption of a homogeneously distributed number of processes to minimize the use of slower communication channels, and usually
over the nodes of the system. A hierarchical implementation of the execute in several stages [14]. The process group splits in subgroups,
algorithm is examined as well. The attained cost reduction depends with a local root per subgroup. Processes in a subgroup commu-
on the number of nodes and the number of processes per node. nicate through the faster communication channel, usually shared
In the used experimental platform, even with a small number of memory, hence, a subgroup is assigned to a node in the system. The
processes and nodes, the improvement reaches up to 2× for long application of these kind of algorithms to several implementations of
messages. the MPI standard and hardware platforms is extensively evaluated
With respect to the structure of this article, following this intro- in [15], [16] and [17]. Based on analytical communication models,
duction, section II reviews proposals of optimization of collective Karonis et al. [18] demonstrated the advantages of a multilevel
operations in a broad range of platforms. Section III studies the topology-aware implementation of algorithms with respect to opti-
allreduce Ring algorithm in multi-core clusters based on the incom- mal plain algorithms. Sack and Gropp [19] show that a suboptimal
ing mapping, and also a hierarchical allreduce implementation. The algorithm in terms of inter-domain communications may produce
section exposes our proposal to improve the performance of the lesser congestion that an optimal algorithm, and therefore to achieve
algorithm and the section IV outlines extensions to cases not covered a faster execution.
in this paper. Section V shows the obtained performance figures, Former approximations adapt algorithms to the underlying com-
and section VI concludes the paper. munication capabilities. An inverse approach is to improve the
performance through the calculation of the best layout of the pro-
cesses over the processors of the cluster. Kravtsov et al. [20] define
II. Related Work
and propose an efficient solution to the topology-aware co-allocation
MPI collectives performance is a key issue in high performance problem, and Jeannot et al. proposes the TreeMatch algorithm in [21]
computing applications, and significant work has been invested in applied to multi-core clusters. The challenge is optimally mapping
their design and optimization. Collectives in the MPI standard can the graph that defines the communication necessities of an appli-
be implemented from several of a set of algorithms available. cation to the graph of the available resources. The solution can be
For instance, MPI_Allreduce can be implemented using the Re- applied to MPI collective operations, provided that they are built
cursive Doubling algorithm, that improves the latency when P is a as a set of point-to-point transmissions [22]. Algorithms to auto-
power of two for small messages because is optimum with regard to matically build the optimal distance-aware collective communication
the number of stages, however, the Ring algorithm performs better topology, based on the distance information between processes, are
for larger messages. Both algorithms are also used in the imple- proposed in [23]. The results are applied to Binomial Tree broadcast
mentation of MPI_Allgather, for which, in addition, other proposed and Ring allgather collectives.
algorithms improve the performance when requirements related to
message size, process number or hardware and network technolo- III. MPI_Allreduce Ring Algorithm
gies are met. Bruck algorithm [4] is more efficient for very short
messages, even though it needs additional temporal memory, Neigh- In the MPI_Allreduce collective operation every process contributes
bour Exchange algorithm in [5] requires half the stages than the Ring with a buffer of size m bytes and gets in the output buffer the result
algorithm when the number of processes is even, and it exploits of applying an specified operation to all the P processes buffers.
the piggy-backing feature of the TCP/IP protocols, as well as the Ring algorithm implementation of the allreduce collective first
Dissemination algorithm, proposed in [6], based on processes pair- copies data from the input buffer to the output buffer. Next, it
wise exchange of messages. Also related to the improvement of the operates on the output buffer in two phases: computation and distri-
performance by exploiting some networks capabilities, Mamidala et bution. The algorithm does not preserve order of operations. As a
al. [7] evaluate the RDMA capacity for allowing concurrent direct consequence, it can not be used with non commutative operations.
memory access by the processes either in the same or different node The computation phase is done in P − 1 stages. The data buffer is
of a multi-core cluster. Ma et al. [8] discuss the intra-node processes divided up in segments of size m/P. In each stage k, from k = 0, a
direct copy communication through shared memory by using the ca- process p sends its p − k segment to process p + 1, and next receives
pacities of the operating system, and in [9] evaluate its impact in the in a temporary buffer a segment from process p − 1, that operates
collectives operations. Kielmann et al. [10] focus on the optimization with local p − k − 1 segment, with wraparounds. The operated
of collective communications for clustered wide area systems. segment in each process will be sent in the next stage. After P − 1
The use of several algorithms in the same collective, based on stages, each process p has a full operated segment in the p + 1
system dependant beforehand established thresholds for message position of the output buffer.
size and number of processes is shown by Thakur et al. in work Distribution phase performs an allgather to distribute these seg-
[11] in a monoprocessor cluster of workstations. This approach ments between processes also using a Ring algorithm. The algorithm
has been adopted by the MPICH library, and it is available in the operates in P − 1 stages. All processes contribute with an m/P bytes
Open MPI library through its Modular Component Architecture segment at offset p + 1 and receive P segments ordered by rank, for
2
number of processes running allreduce decreases with respect to

a simple allreduce operation in section III, from P to M, but the
M#0 size of messages contributed by each process increases from m/P to
m · Q/P. Thus, the amount of data transmitted through the network
is the same as the allreduce Ring algorithm with sequential mapping.
Nevertheless, this hierarchical design of the allgather minimizes
the network contention regardless of the initial process mapping.
Finally, in the third phase, the local root process broadcasts its
resulting buffer to the rest of the processes in the same node.
Additional communicators must be created to perform collectives
inside each node, and the inter-node allreduce between local roots.
III.2 Allreduce Mapping Transformation at Run-

Figure 1: Representation of the communications in the stages of the Allre- Time
duce Ring (both computation and distribution phases) algorithm when
processes are sequentially and round robin mapped, in a machine with Under awareness of a regular mapping, such as sequential or round-
P = 6 processes and M = 2 nodes. robin, the programmer would be in the position of exploiting that
knowledge to increase the algorithm performance.
Necessity of minimize network communication advises a physical
a total of m bytes. In each stage, process p sends to process p + 1 the rearrangement of the processes that guarantees a sequential mapping
m/P bytes received in the previous stage from process p − 1, with before starting Ring algorithm. In practice, however, physically
wraparound. moving the processes conveys an excessive latency and cached data
Figure 1 represents the transmissions between processes in a ma- invalidation penalties. We propose instead a mere previous logical
chine with P = 6 processes and M = 2 nodes under both sequential rearrangement of the ranks, a solution that is applied dynamically
(left) and round robin (right) mapping. In the Ring algorithm se- and efficiently. Logical renaming of processes ranks can be applied
quential mapping minimizes the point-to-point transmissions across to both computation and distribution phases. The new algorithm is
the network. The first process of each node receives from the for- denoted as Ring*.
mer node, and the last process sends to the next node. The rest of Let be a rank set R = {r0 , r1 , . . . , r P−1 } assigned to the P processes
transmissions take place in shared memory and they progress in of a communicator following a Round Robin mapping. For simplicity
parallel with a total of M inter-node flows. Nevertheless, when the and without loss of generality, we define r p = p. The set R can be
incoming mapping is round-robin M × Q inter-node transmissions transformed into another set S = {s0 , s1 , . . . , s P−1 }, which shows a
take place at a time, giving rise to a much higher contention that sequential mapping, with s p = f SQ ( p). The transformation function
degrades the communication. In networks as Ethernet, for instance, f SQ is defined as:
the contention may lead to a performance breakdown that grows
with the number of simultaneous transfers. In section V, we evaluate f SQ ( p) = (( p × Q) %P) + b p / M c (1)
the performance of Ring algorithm with sequential and round-robin
mappings in a cluster based on Infiniband. The number of nodes M must be known and the processes must
The fact is that the layout of the processes over the processors be homogeneously distributed between nodes, i.e., the number of
has a great impact in the effective cost. In the allreduce, both processes per node (Q) must be constant. See section IV for explana-
phases, computation and distribution, communicate each rank with tions about extensions to irregular mappings.
the nearest next and previous rank numbers, hence, consecutive Similarly, a rank set S = {s0 , s1 , . . . , s P−1 } with s p = p
ranks must run in the same node in order to increase the data and a sequential mapping can be transformed into a set R =
transmissions inside a node and minimize the network contention. {r0 , r1 , . . . , r P−1 }, which shows a round robin mapping, with r p =
The next section explores the design of algorithms which take into f RR ( p). The transformation function f RR (inverse of f SQ ) is defined
account the mapping of processes on the physical hierarchy of as:
communications. f RR ( p) = (( p × M ) %P) + b p / Qc (2)
In the allreduce computation phase, renaming of processes is

III.1 Hierarchical Allreduce Algorithm
applied prior to the execution of the Ring algorithm. A process with
An algorithm can be designed to minimize the data movement rank p behaves as a process with rank f SQ ( p), applying the definition
through the slowest communication channels in a multi-core cluster in (1). Then, in the stage k, a process sends the segment f SQ ( p) − k to
for the allreduce collective. For example, the implementation found process behaving as f SQ ( p) + 1, calculated as f RR f SQ ( p) + 1 . Next,
in Open MPI library is composed of three phases, that progress it receives a segment from process with rank f RR f SQ ( p) − 1 in a
sequentially. temporary buffer, that operates with local segment f SQ ( p) − k − 1.
The first phase performs a reduce operation local to each node For instance, process p = 2 in Figure 2 behaves as f SQ (2) = 1.
in the system. One process per node, called local root, obtains In the first stage k = 0, it sends the segment f SQ (2) − k =
the full operated Q segments in the output buffer. The second 1 to process behaving as the rank f SQ (2) + 1, which is calcu-
phase performs an inter-node allreduce between local roots. The lated as f RR f SQ (2) + 1 = f RR (2) = 4, and receives from
3
Ring Allreduce Algorithm (P=16, M=2)

M#0
450
Ring SEQ
400 Ring RR
Hierarch
350
Ring∗
300
MBytes/s
250
200
150
100
Figure 2: Allreduce Ring algorithm computation phase, stage k = 0, with 50
round-robin mapping, P = 6 and M = 2. The buffer total size is divided up 0
in P segments. Processes renaming through the transformation functions 1 2 4 8 16 32 64 128 256 512 1M 2M 4M
changes the mapping from round-robin to sequential before starting. Message Size (KBytes)
Figure 3: Bandwidth of the allreduce Ring algorithm with sequential and

f RR f SQ (2) − 1 = f RR (0) = 0 a segment to be operated with the round-robin mappings, and the Ring* algorithm when executed with M = 2
local segment number f SQ (2) − k − 1 = 0. nodes and Q = 8 processes by node, for a total of P = 16 processes.
Again, allreduce distribution phase requires renaming of processes
from p into f SQ ( p) prior to the execution of the regular Ring allgather
algorithm. The operating principle is the same as the computation V. Performance Evaluation
phase.
The experimental platform used, named Fermi, is composed of eight
nodes connected by a QDR Infiniband network. Each node has two
IV. Extensions of the Method 2.27 GHz Quad core Intel Xeon E5520 processors, with 8MB of shared
L3 cache size, making a total of eight cores per node. The operating
Our approach consists of departing from the a priori knowledge system is Linux 2.6.32. We used IMB (Intel MPI Benchmark), version
of a layout with regular mapping and keeps the original algorithm 3.2, to obtain the latency data. Bandwidth is calculated as the
after having switched to another regular mapping, much more message size divided by the latency, and showed in figures for the
favourable in terms of performance. Such mapping information sake of clarity. A high number of iterations are executed for each
could be available through the processes manager module of the collective algorithm and mean time is taken. IMB runs on Open
particular MPI implementation. MPI 1.8, the library that provides the allreduce algorithms through
The transformation functions can be applied to algorithms with its Tuned and Hierarch collective components. Nonetheless it should
similar communication patterns to the Ring algorithm. For instance, be noted that MVAPICH2 yields similar results, as well as MPICH,
Neighbour Exchange and Binomial Tree perform better when processes which has been tested in Ethernet networks.
are sequentially mapped. Other algorithms have opposite require- The allreduce Ring algorithm performance is plotted in Figures 3
ments, such as Recursive Doubling and Dissemination algorithms, to 5. Figures represent the bandwidth of sequential (SEQ) and round
better suited to initial round robin mapping, because the distance robin (RR) mappings, as well as the Ring∗ algorithm and the hier-
between rank numbers communicating exponentially grows in each archical implementation of the collective operation, for increasing
stage. The mapping needs to change from sequential to round robin, number of nodes (M), with Q = 8. The difference in bandwidth be-
through the inverse application of the transformation functions. tween sequential and Ring∗ algorithm with respect to less favourable
The above-mentioned algorithms are used in a wide variety of round robin mapping is nearly to 2×, for all the range of messages.
collectives operations defined in the MPI standard, as Broadcast, Ring∗ overload to the Ring algorithm is very low, because it is only
Scatter, Allgather, etc. proving the method as highly generic. attributable to the execution of transformation functions.
Nevertheless, we can not always make assumptions about the Hierarchical implementation of the allreduce algorithm leads to a
deployment of the ranks over the cluster, all the more so as this higher performance than round robin, but it degrades with the size
layout may change with the creation of new communicators at run of the message because phases must progress sequentially. Perfor-
time, that could assign different ranks to the processes. On that case, mance depends as well on the algorithms used in each phase. In this
with non-regular mappings, each rank involved in the collective paper we use the binomial tree algorithm for the Reduce and Broadcast
operation will need to have information about the layout of all algorithms in the phases 1 and 3, and allreduce Ring algorithm in
the ranks in the communicator. Resource requirements for that the inter-node phase 2. This configuration outperforms even the
information are under study by the authors. allreduce Ring algorithm for short and medium messages.
Performance measurements in clusters with other network tech- In Figure 6 we plot the relative mean bandwidth (measured along
nologies, such as Ethernet, confirms the expected results, with an the whole range of messages plotted in figures) between different
increase in the difference of performance between mappings that is cases for a constant number of nodes (M = 8), and a growing
proportional to the difference in bandwidth capabilities between the number of processes per node. Note that the difference between the
channels. sequential and the round-robin mapping grows with Q. As expected,
4
Ring Allreduce Algorithm (P=32, M=4) Ring Allreduce Algorithm (P=64, M=8)
350 300
Ring SEQ Ring SEQ
300 Ring RR Ring RR
Hierarch 250 Hierarch
250 Ring∗ Ring∗
200
MBytes/s
MBytes/s
200
150
150
100
100
50 50
0 0
1 2 4 8 16 32 64 128 256 512 1M 2M 4M 1 2 4 8 16 32 64 128 256 512 1M 2M 4M
Message Size (KBytes) Message Size (KBytes)
Figure 4: Bandwidth of the allreduce Ring algorithm with sequential and Figure 5: Bandwidth of the allreduce Ring algorithm with sequential and
round-robin mappings, and the Ring* algorithm when executed with M = 4 round-robin mappings, and the Ring* algorithm when executed with M = 8
nodes and Q = 8 processes by node, for a total of P = 32 processes. nodes and Q = 8 processes by node, for a total of P = 64 processes.
the difference between sequential and Ring∗ remains constant, and Acknowledgment
denotes a minimal overload. Respect to the hierarchical case, the
The work presented in this paper has been partially supported by EU
difference is constant for that small number of nodes.
under the COST programme Action IC1305, ’Network for Sustain-
It can be observed in Figure 3 that the change of mapping of the
able Ultrascale Computing (NESUS)’, and by the computing facilities
Ring∗ algorithm shows an improvement with respect to the round
of Extremadura Research Centre for Advanced Technologies (CETA-
robin mapping even in a minimal configuration with only M = 2
CIEMAT), funded by the European Regional Development Fund
nodes.
(ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of
Spain.
VI. Conclusions
The performance of MPI collective algorithms in multi-core clusters References
highly depends on the deployment of the processes on the processors
of the system. These algorithms usually establish a communication [1] MPI Forum. MPI: A Message-Passing Interface Standard, Ver-
pattern between ranks that, if under specific regular mappings, use sion 3.0., September 2012.
the communication resources effectively, other mappings signifi-
[2] MPICH. MPICH High Performance Portable MPI Implementa-
cantly worsen their performance. The hierarchical design pursues
tion, 2013.
the optimal use of the system available communication channels,
regardless of the process mapping, but they are only efficient in a [3] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara
limited subset of collectives operations. Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sa-
This paper proposes a more generic approach, whose goal is to hay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine,
adapt the mapping of processes to the communication pattern of Ralph H. Castain, David J. Daniel, Richard L. Graham, and
the collective algorithm in run-time to reduce network traffic and Timothy S. Woodall. Open MPI: Goals, concept, and design of
contention. Such a switch does not require process migration, but a a next generation MPI implementation. In Proceedings, 11th Eu-
renaming of the processes ranks prior to the execution of the original ropean PVM/MPI Users’ Group Meeting, pages 97–104, Budapest,
algorithm. Hungary, September 2004.
Performance improvements of MPI_Allreduce collective is evalu-
ated when built upon the Ring algorithm, which performs better [4] J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, and D. Weathersby.
when processes are mapped sequentially. The figures show that Efficient algorithms for all-to-all communications in multiport
the processes renaming adds a low impact upon the cost of the message-passing systems. Parallel and Distributed Systems, IEEE
original algorithm. Results are also compared to the hierarchical Transactions on, 8(11):1143–1156, 1997.
implementation of the collective.
Our approach can be applied to other algorithms commonly used [5] Jing Chen, Linbo Zhang, Yunquan Zhang, and Wei Yuan. Per-
in MPI collective operations, as the Recursive Doubling, Neighbour formance evaluation of allgather algorithms on terascale linux
Exchange, Dissemination or Binomial Tree, with different incoming cluster with fast ethernet. In Proceedings of the Eighth Interna-
mapping necessities, covering a broad range of communication tional Conference on High-Performance Computing in Asia-Pacific
patterns. In addition, the paper discusses extensions to cover non- Region, HPCASIA ’05, pages 437–, Washington, DC, USA, 2005.
regular mapping of processes and other collective operations. IEEE Computer Society.
5
Allreduce Relative Mean Bandwidths Workshop on Component Models and Systems for Grid Applications,
3.0 pages 167–185, St. Malo, France, July 2004. Springer.
Ring∗ / Ring RR
Ring SEQ / Ring∗ [13] S.S. Vadhiyar, G.E. Fagg, and J. Dongarra. Automatically tuned
2.5 Ring SEQ / Hierarch
collective communications. In Supercomputing, ACM/IEEE 2000
2.0 Conference, pages 3–3, Nov 2000.
1.5 [14] Meng-Shiou Wu, R.A. Kendall, and K. Wright. Optimizing
1.0
collective communications on smp clusters. In Parallel Processing,
2005. ICPP 2005. International Conference on, pages 399–407, 2005.
[15] Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur.
Hierarchical collectives in mpich2. In Proceedings of the 16th
2 4 8
European PVM/MPI Users’ Group Meeting on Recent Advances
Process/Node Number
in Parallel Virtual Machine and Message Passing Interface, pages
325–326, Berlin, Heidelberg, 2009. Springer-Verlag.
Figure 6: Allreduce relative mean bandwidths (calculated for the whole
range of messages) of the Ring* algorithm with respect to the Ring with [16] A.R. Mamidala, R. Kumar, D. De, and D.K. Panda. Mpi collec-
round robin mapping, of Ring with sequential mapping with respect to tives on modern multicore clusters: Performance optimizations
Ring*, and of Ring* with respect to the hierarchical implementation. All and communication characteristics. In Cluster Computing and
the tests are executed with M = 8 nodes and Q = 8 processes by node, for the Grid, 2008. CCGRID ’08. 8th IEEE International Symposium on,
a total of P = 64 processes. pages 130–137, 2008.
[17] Richard L. Graham and Galen Shipman. Mpi support for multi-
[6] Gregory D. Benson, Cho wai Chu, Qing Huang, and Sadik G.
core architectures: Optimized shared memory collectives. In
Caglar. A comparison of mpich allgather algorithms on
Proceedings of the 15th European PVM/MPI Users’ Group Meeting
switched networks. In In Proceedings of the 10th EuroPVM/MPI
on Recent Advances in Parallel Virtual Machine and Message Passing
2003 Conference, pages 335–343. Springer, 2003.
Interface, pages 130–140, Berlin, Heidelberg, 2008. Springer-
[7] Amith R. Mamidala, Abhinav Vishnu, and Dhabaleswar K. Verlag.
Panda. Efficient shared memory and rdma based design for
mpi_allgather over infiniband. In Proceedings of the 13th European [18] Nicholas T. Karonis, Bronis R. de Supinski, Ian T. Foster, William
PVM/MPI User’s Group conference on Recent advances in parallel Gropp, and Ewing L. Lusk. A multilevel approach to topology-
virtual machine and message passing interface, EuroPVM/MPI’06, aware collective operations in computational grids. CoRR,
pages 66–75, Berlin, Heidelberg, 2006. Springer-Verlag. cs.DC/0206038, 2002.
[8] Teng Ma, G. Bosilca, A. Bouteiller, and J.J. Dongarra. Hierknem: [19] Paul Sack and William Gropp. Faster topology-aware collective
An adaptive framework for kernel-assisted and topology-aware algorithms through non-minimal communication. SIGPLAN
collective communications on many-core clusters. In Parallel Not., 47(8):45–54, February 2012.
Distributed Processing Symposium (IPDPS), 2012 IEEE 26th Inter-
[20] Valentin Kravtsov, Martin Swain, Uri Dubin, Werner Dubitzky,
national, pages 970–982, 2012.
and Assaf Schuster. A fast and efficient algorithm for topology-
[9] Teng Ma, G. Bosilca, A. Bouteiller, B. Goglin, J.M. Squyres, aware coallocation. In Proceedings of the 8th international confer-
and J.J. Dongarra. Kernel assisted collective intra-node mpi ence on Computational Science, Part I, ICCS ’08, pages 274–283,
communication among multi-core and many-core cpus. In Berlin, Heidelberg, 2008. Springer-Verlag.
Parallel Processing (ICPP), 2011 International Conference on, pages
[21] E. Jeannot, G. Mercier, and F. Tessier. Process placement in
532–541, Sept 2011.
multicore clusters:algorithmic issues and practical techniques.
[10] Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Parallel and Distributed Systems, IEEE Transactions on, 25(4):993–
Plaat, and Raoul A. F. Bhoedjang. Magpie: Mpi’s collective 1002, April 2014.
communication operations for clustered wide area systems.
SIGPLAN Not., 34(8):131–140, May 1999. [22] Jin Zhang, Jidong Zhai, Wenguang Chen, and Weimin Zheng.
Process mapping for mpi collective communications. In Pro-
[11] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Op- ceedings of the 15th International Euro-Par Conference on Parallel
timization of collective communication operations in mpich. Processing, Euro-Par ’09, pages 81–92, Berlin, Heidelberg, 2009.
International Journal of High Performance Computing Applications, Springer-Verlag.
19(1):49–66, 2005.
[23] Teng Ma, T. Herault, G. Bosilca, and J.J. Dongarra. Process
[12] Jeffrey M. Squyres and Andrew Lumsdaine. The component distance-aware adaptive mpi collective communications. In
architecture of open MPI: Enabling third-party collective algo- Cluster Computing (CLUSTER), 2011 IEEE International Conference
rithms. In Vladimir Getov and Thilo Kielmann, editors, Pro- on, pages 196–204, 2011.
ceedings, 18th ACM International Conference on Supercomputing,

Improving NESUS 2014

Uploaded by

Copyright:

Available Formats

Improving NESUS 2014

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving NESUS 2014

Uploaded by

Copyright:

Available Formats

Proceedings of the First International Workshop on Sustainable

Ultrascale Computing Systems (NESUS 2014)

Jesus Carretero, Javier Garcia Blas

August 27-28, 2014

Improving the Performance of the

First NESUS Workshop. • October 2014 • Vol. I, No. 1

First NESUS Workshop. • October 2014 • Vol. I, No. 1

number of processes running allreduce decreases with respect to

III.2 Allreduce Mapping Transformation at Run-

In the allreduce computation phase, renaming of processes is

First NESUS Workshop. • October 2014 • Vol. I, No. 1

Ring Allreduce Algorithm (P=16, M=2)

Figure 3: Bandwidth of the allreduce Ring algorithm with sequential and

First NESUS Workshop. • October 2014 • Vol. I, No. 1

First NESUS Workshop. • October 2014 • Vol. I, No. 1

You might also like