Improving NESUS 2014
Improving NESUS 2014
Improving NESUS 2014
NESUS IC1305
Network for Sustainable Ultrascale Computing First NESUS Workshop. • October 2014 • Vol. I, No. 1
Abstract
Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime
between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes.
However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping,
because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement
through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes
inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found
regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior
to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels
hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained
performance results. Extensions to other algorithms and collective operations are proposed.
Keywords MPI Collectives, Parallel Algorithms, Message Passing Interface, Multi-core Clusters
I. Introduction gorithm, not matter if both are used in the implementation of the
same collective. For example, in the implementation of the allreduce
MPI [1] collective functions involve a group of processes commu- operation in MPICH, referred above, the Recursive Doubling algo-
nicating by message passing in an isolated context, known as com- rithm shows a better performance when the mapping is round-robin,
municator. Each process of a communicator is identified by its rank, while the Ring algorithm runs faster under the sequential mapping.
an integer number ranging from 0 to P − 1, where P is the size of An approach to the issue of collectives performance is building
the communicator. The optimisation of collectives is a key issue algorithms that are aware of the different capacities of the available
in HPC applications. A collective operation can be executed by communication channels, as shared memory and network. These
different algorithms, each suitable for a given network technology, algorithms, known as hierarchical, stand on minimizing the commu-
communicator size, message size, etc. For example, in the MPICH nications through the slower channels, but the implementation for
library [2], the implementation of MPI_Allreduce uses two algorithms some collectives as allgather is not as effective as expected, even im-
for medium and large messages when the number of processes is practicable, and hence it is not provided in well-known MPI libraries
a power of two, namely Recursive Doubling and Ring. The switch as Open MPI [3].
from the first to the second algorithm is done at execution time, This paper describes a new approach to the optimization of col-
with platform-dependent beforehand established message size and lectives in multi-core clusters. The goal is to obtain the best possible
process number thresholds. communication throughput. For instance, in the Ring algorithm, the
Current parallel systems are composed of multi-core nodes con- communication takes place between consecutive ranks. If consecu-
nected by a high performance network. The communication cost tive ranks are mapped to different nodes, all the communications
between two MPI ranks depends on their location, being lower if progress through the network. Instead, a schedule of consecutive
they share memory, and higher if they are in different nodes. There- ranks to processes placed in the same multi-core node favours the
fore the performance of an application depends on the assignation much more efficient shared memory communication. Our method
of the ranks to the processors of the cluster (mapping). In general, is based on a temporal reassignment of ranks. That neither modifies
two types of mapping cover the necessities of most applications: the algorithm nor the physical mapping. Instead, it is carried out
sequential and round-robin. In the sequential mapping, ranks bind by means of a transformation function prior to the execution of
to processors so that a domain is completed (e.g. socket or node) the algorithm. The function is simple and efficient, and converts a
before moving to the next domain. In round-robin, ranks are bound sequential mapping to round-robin and vice versa only during the
to domains by rotating on the existing domains. execution of the algorithm.
Mapping affects to the performance of the underlying algorithms This paper focuses on the Ring algorithm in the context of the
of collective operations. Interestingly, a given mapping may favour allreduce operation. Besides, the methodology described is directly
an algorithm and, at the same time, being harmful to another al- applicable to other algorithms used in the implementation of MPI
1
2 Improving the Performance of the MPI Allreduce Collective Operation through Rank Renaming
collectives. Platform considered is characterized by P, the number of [12]. Vadhiyar et al. [13] evaluate such improvement of performance
processors (or processes involved in the operation), and M, the num- through previous executed series of experiments conducted in an
ber of nodes in the cluster. Q = P/M is the number of processors per specific platform.
node. Two channels are considered in the system, shared memory Multi-core clusters introduce a new actor in the scene. Perfor-
and network, with different performance. The study is conducted mance becomes dependant on the effective use of the different com-
under two different mappings, sequential and round-robin, under munication channels. Hierarchical algorithms are specifically built
the assumption of a homogeneously distributed number of processes to minimize the use of slower communication channels, and usually
over the nodes of the system. A hierarchical implementation of the execute in several stages [14]. The process group splits in subgroups,
algorithm is examined as well. The attained cost reduction depends with a local root per subgroup. Processes in a subgroup commu-
on the number of nodes and the number of processes per node. nicate through the faster communication channel, usually shared
In the used experimental platform, even with a small number of memory, hence, a subgroup is assigned to a node in the system. The
processes and nodes, the improvement reaches up to 2× for long application of these kind of algorithms to several implementations of
messages. the MPI standard and hardware platforms is extensively evaluated
With respect to the structure of this article, following this intro- in [15], [16] and [17]. Based on analytical communication models,
duction, section II reviews proposals of optimization of collective Karonis et al. [18] demonstrated the advantages of a multilevel
operations in a broad range of platforms. Section III studies the topology-aware implementation of algorithms with respect to opti-
allreduce Ring algorithm in multi-core clusters based on the incom- mal plain algorithms. Sack and Gropp [19] show that a suboptimal
ing mapping, and also a hierarchical allreduce implementation. The algorithm in terms of inter-domain communications may produce
section exposes our proposal to improve the performance of the lesser congestion that an optimal algorithm, and therefore to achieve
algorithm and the section IV outlines extensions to cases not covered a faster execution.
in this paper. Section V shows the obtained performance figures, Former approximations adapt algorithms to the underlying com-
and section VI concludes the paper. munication capabilities. An inverse approach is to improve the
performance through the calculation of the best layout of the pro-
cesses over the processors of the cluster. Kravtsov et al. [20] define
II. Related Work
and propose an efficient solution to the topology-aware co-allocation
MPI collectives performance is a key issue in high performance problem, and Jeannot et al. proposes the TreeMatch algorithm in [21]
computing applications, and significant work has been invested in applied to multi-core clusters. The challenge is optimally mapping
their design and optimization. Collectives in the MPI standard can the graph that defines the communication necessities of an appli-
be implemented from several of a set of algorithms available. cation to the graph of the available resources. The solution can be
For instance, MPI_Allreduce can be implemented using the Re- applied to MPI collective operations, provided that they are built
cursive Doubling algorithm, that improves the latency when P is a as a set of point-to-point transmissions [22]. Algorithms to auto-
power of two for small messages because is optimum with regard to matically build the optimal distance-aware collective communication
the number of stages, however, the Ring algorithm performs better topology, based on the distance information between processes, are
for larger messages. Both algorithms are also used in the imple- proposed in [23]. The results are applied to Binomial Tree broadcast
mentation of MPI_Allgather, for which, in addition, other proposed and Ring allgather collectives.
algorithms improve the performance when requirements related to
message size, process number or hardware and network technolo- III. MPI_Allreduce Ring Algorithm
gies are met. Bruck algorithm [4] is more efficient for very short
messages, even though it needs additional temporal memory, Neigh- In the MPI_Allreduce collective operation every process contributes
bour Exchange algorithm in [5] requires half the stages than the Ring with a buffer of size m bytes and gets in the output buffer the result
algorithm when the number of processes is even, and it exploits of applying an specified operation to all the P processes buffers.
the piggy-backing feature of the TCP/IP protocols, as well as the Ring algorithm implementation of the allreduce collective first
Dissemination algorithm, proposed in [6], based on processes pair- copies data from the input buffer to the output buffer. Next, it
wise exchange of messages. Also related to the improvement of the operates on the output buffer in two phases: computation and distri-
performance by exploiting some networks capabilities, Mamidala et bution. The algorithm does not preserve order of operations. As a
al. [7] evaluate the RDMA capacity for allowing concurrent direct consequence, it can not be used with non commutative operations.
memory access by the processes either in the same or different node The computation phase is done in P − 1 stages. The data buffer is
of a multi-core cluster. Ma et al. [8] discuss the intra-node processes divided up in segments of size m/P. In each stage k, from k = 0, a
direct copy communication through shared memory by using the ca- process p sends its p − k segment to process p + 1, and next receives
pacities of the operating system, and in [9] evaluate its impact in the in a temporary buffer a segment from process p − 1, that operates
collectives operations. Kielmann et al. [10] focus on the optimization with local p − k − 1 segment, with wraparounds. The operated
of collective communications for clustered wide area systems. segment in each process will be sent in the next stage. After P − 1
The use of several algorithms in the same collective, based on stages, each process p has a full operated segment in the p + 1
system dependant beforehand established thresholds for message position of the output buffer.
size and number of processes is shown by Thakur et al. in work Distribution phase performs an allgather to distribute these seg-
[11] in a monoprocessor cluster of workstations. This approach ments between processes also using a Ring algorithm. The algorithm
has been adopted by the MPICH library, and it is available in the operates in P − 1 stages. All processes contribute with an m/P bytes
Open MPI library through its Modular Component Architecture segment at offset p + 1 and receive P segments ordered by rank, for
2
Juan-Antonio Rico-Gallego, Juan-Carlos Diaz-Martin 3
3
4 Improving the Performance of the MPI Allreduce Collective Operation through Rank Renaming
MBytes/s
250
200
150
100
Figure 2: Allreduce Ring algorithm computation phase, stage k = 0, with 50
round-robin mapping, P = 6 and M = 2. The buffer total size is divided up 0
in P segments. Processes renaming through the transformation functions 1 2 4 8 16 32 64 128 256 512 1M 2M 4M
changes the mapping from round-robin to sequential before starting. Message Size (KBytes)
4
Juan-Antonio Rico-Gallego, Juan-Carlos Diaz-Martin 5
Ring Allreduce Algorithm (P=32, M=4) Ring Allreduce Algorithm (P=64, M=8)
350 300
Ring SEQ Ring SEQ
300 Ring RR Ring RR
Hierarch 250 Hierarch
250 Ring∗ Ring∗
200
MBytes/s
MBytes/s
200
150
150
100
100
50 50
0 0
1 2 4 8 16 32 64 128 256 512 1M 2M 4M 1 2 4 8 16 32 64 128 256 512 1M 2M 4M
Message Size (KBytes) Message Size (KBytes)
Figure 4: Bandwidth of the allreduce Ring algorithm with sequential and Figure 5: Bandwidth of the allreduce Ring algorithm with sequential and
round-robin mappings, and the Ring* algorithm when executed with M = 4 round-robin mappings, and the Ring* algorithm when executed with M = 8
nodes and Q = 8 processes by node, for a total of P = 32 processes. nodes and Q = 8 processes by node, for a total of P = 64 processes.
the difference between sequential and Ring∗ remains constant, and Acknowledgment
denotes a minimal overload. Respect to the hierarchical case, the
The work presented in this paper has been partially supported by EU
difference is constant for that small number of nodes.
under the COST programme Action IC1305, ’Network for Sustain-
It can be observed in Figure 3 that the change of mapping of the
able Ultrascale Computing (NESUS)’, and by the computing facilities
Ring∗ algorithm shows an improvement with respect to the round
of Extremadura Research Centre for Advanced Technologies (CETA-
robin mapping even in a minimal configuration with only M = 2
CIEMAT), funded by the European Regional Development Fund
nodes.
(ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of
Spain.
VI. Conclusions
The performance of MPI collective algorithms in multi-core clusters References
highly depends on the deployment of the processes on the processors
of the system. These algorithms usually establish a communication [1] MPI Forum. MPI: A Message-Passing Interface Standard, Ver-
pattern between ranks that, if under specific regular mappings, use sion 3.0., September 2012.
the communication resources effectively, other mappings signifi-
[2] MPICH. MPICH High Performance Portable MPI Implementa-
cantly worsen their performance. The hierarchical design pursues
tion, 2013.
the optimal use of the system available communication channels,
regardless of the process mapping, but they are only efficient in a [3] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara
limited subset of collectives operations. Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sa-
This paper proposes a more generic approach, whose goal is to hay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine,
adapt the mapping of processes to the communication pattern of Ralph H. Castain, David J. Daniel, Richard L. Graham, and
the collective algorithm in run-time to reduce network traffic and Timothy S. Woodall. Open MPI: Goals, concept, and design of
contention. Such a switch does not require process migration, but a a next generation MPI implementation. In Proceedings, 11th Eu-
renaming of the processes ranks prior to the execution of the original ropean PVM/MPI Users’ Group Meeting, pages 97–104, Budapest,
algorithm. Hungary, September 2004.
Performance improvements of MPI_Allreduce collective is evalu-
ated when built upon the Ring algorithm, which performs better [4] J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, and D. Weathersby.
when processes are mapped sequentially. The figures show that Efficient algorithms for all-to-all communications in multiport
the processes renaming adds a low impact upon the cost of the message-passing systems. Parallel and Distributed Systems, IEEE
original algorithm. Results are also compared to the hierarchical Transactions on, 8(11):1143–1156, 1997.
implementation of the collective.
Our approach can be applied to other algorithms commonly used [5] Jing Chen, Linbo Zhang, Yunquan Zhang, and Wei Yuan. Per-
in MPI collective operations, as the Recursive Doubling, Neighbour formance evaluation of allgather algorithms on terascale linux
Exchange, Dissemination or Binomial Tree, with different incoming cluster with fast ethernet. In Proceedings of the Eighth Interna-
mapping necessities, covering a broad range of communication tional Conference on High-Performance Computing in Asia-Pacific
patterns. In addition, the paper discusses extensions to cover non- Region, HPCASIA ’05, pages 437–, Washington, DC, USA, 2005.
regular mapping of processes and other collective operations. IEEE Computer Society.
5
6 Improving the Performance of the MPI Allreduce Collective Operation through Rank Renaming
Allreduce Relative Mean Bandwidths Workshop on Component Models and Systems for Grid Applications,
3.0 pages 167–185, St. Malo, France, July 2004. Springer.
Ring∗ / Ring RR
Ring SEQ / Ring∗ [13] S.S. Vadhiyar, G.E. Fagg, and J. Dongarra. Automatically tuned
2.5 Ring SEQ / Hierarch
collective communications. In Supercomputing, ACM/IEEE 2000
2.0 Conference, pages 3–3, Nov 2000.
1.5 [14] Meng-Shiou Wu, R.A. Kendall, and K. Wright. Optimizing
1.0
collective communications on smp clusters. In Parallel Processing,
2005. ICPP 2005. International Conference on, pages 399–407, 2005.
[15] Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur.
Hierarchical collectives in mpich2. In Proceedings of the 16th
2 4 8
European PVM/MPI Users’ Group Meeting on Recent Advances
Process/Node Number
in Parallel Virtual Machine and Message Passing Interface, pages
325–326, Berlin, Heidelberg, 2009. Springer-Verlag.
Figure 6: Allreduce relative mean bandwidths (calculated for the whole
range of messages) of the Ring* algorithm with respect to the Ring with [16] A.R. Mamidala, R. Kumar, D. De, and D.K. Panda. Mpi collec-
round robin mapping, of Ring with sequential mapping with respect to tives on modern multicore clusters: Performance optimizations
Ring*, and of Ring* with respect to the hierarchical implementation. All and communication characteristics. In Cluster Computing and
the tests are executed with M = 8 nodes and Q = 8 processes by node, for the Grid, 2008. CCGRID ’08. 8th IEEE International Symposium on,
a total of P = 64 processes. pages 130–137, 2008.
[17] Richard L. Graham and Galen Shipman. Mpi support for multi-
[6] Gregory D. Benson, Cho wai Chu, Qing Huang, and Sadik G.
core architectures: Optimized shared memory collectives. In
Caglar. A comparison of mpich allgather algorithms on
Proceedings of the 15th European PVM/MPI Users’ Group Meeting
switched networks. In In Proceedings of the 10th EuroPVM/MPI
on Recent Advances in Parallel Virtual Machine and Message Passing
2003 Conference, pages 335–343. Springer, 2003.
Interface, pages 130–140, Berlin, Heidelberg, 2008. Springer-
[7] Amith R. Mamidala, Abhinav Vishnu, and Dhabaleswar K. Verlag.
Panda. Efficient shared memory and rdma based design for
mpi_allgather over infiniband. In Proceedings of the 13th European [18] Nicholas T. Karonis, Bronis R. de Supinski, Ian T. Foster, William
PVM/MPI User’s Group conference on Recent advances in parallel Gropp, and Ewing L. Lusk. A multilevel approach to topology-
virtual machine and message passing interface, EuroPVM/MPI’06, aware collective operations in computational grids. CoRR,
pages 66–75, Berlin, Heidelberg, 2006. Springer-Verlag. cs.DC/0206038, 2002.
[8] Teng Ma, G. Bosilca, A. Bouteiller, and J.J. Dongarra. Hierknem: [19] Paul Sack and William Gropp. Faster topology-aware collective
An adaptive framework for kernel-assisted and topology-aware algorithms through non-minimal communication. SIGPLAN
collective communications on many-core clusters. In Parallel Not., 47(8):45–54, February 2012.
Distributed Processing Symposium (IPDPS), 2012 IEEE 26th Inter-
[20] Valentin Kravtsov, Martin Swain, Uri Dubin, Werner Dubitzky,
national, pages 970–982, 2012.
and Assaf Schuster. A fast and efficient algorithm for topology-
[9] Teng Ma, G. Bosilca, A. Bouteiller, B. Goglin, J.M. Squyres, aware coallocation. In Proceedings of the 8th international confer-
and J.J. Dongarra. Kernel assisted collective intra-node mpi ence on Computational Science, Part I, ICCS ’08, pages 274–283,
communication among multi-core and many-core cpus. In Berlin, Heidelberg, 2008. Springer-Verlag.
Parallel Processing (ICPP), 2011 International Conference on, pages
[21] E. Jeannot, G. Mercier, and F. Tessier. Process placement in
532–541, Sept 2011.
multicore clusters:algorithmic issues and practical techniques.
[10] Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Parallel and Distributed Systems, IEEE Transactions on, 25(4):993–
Plaat, and Raoul A. F. Bhoedjang. Magpie: Mpi’s collective 1002, April 2014.
communication operations for clustered wide area systems.
SIGPLAN Not., 34(8):131–140, May 1999. [22] Jin Zhang, Jidong Zhai, Wenguang Chen, and Weimin Zheng.
Process mapping for mpi collective communications. In Pro-
[11] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Op- ceedings of the 15th International Euro-Par Conference on Parallel
timization of collective communication operations in mpich. Processing, Euro-Par ’09, pages 81–92, Berlin, Heidelberg, 2009.
International Journal of High Performance Computing Applications, Springer-Verlag.
19(1):49–66, 2005.
[23] Teng Ma, T. Herault, G. Bosilca, and J.J. Dongarra. Process
[12] Jeffrey M. Squyres and Andrew Lumsdaine. The component distance-aware adaptive mpi collective communications. In
architecture of open MPI: Enabling third-party collective algo- Cluster Computing (CLUSTER), 2011 IEEE International Conference
rithms. In Vladimir Getov and Thilo Kielmann, editors, Pro- on, pages 196–204, 2011.
ceedings, 18th ACM International Conference on Supercomputing,