0% found this document useful (0 votes)
5 views

Adding topology and memory awareness in data aggregation algorithms

The paper introduces TAPIOCA, an architecture-aware data aggregation library designed to enhance I/O performance in large-scale scientific applications by optimizing the two-phase I/O scheme. TAPIOCA leverages modern HPC system characteristics, including memory hierarchy and network topology, to significantly improve data movement efficiency, achieving up to a 13-fold increase in I/O bandwidth for specific applications. The library aims to address the growing I/O bottleneck in supercomputers by providing a flexible and portable solution for collective I/O operations across various architectures.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Adding topology and memory awareness in data aggregation algorithms

The paper introduces TAPIOCA, an architecture-aware data aggregation library designed to enhance I/O performance in large-scale scientific applications by optimizing the two-phase I/O scheme. TAPIOCA leverages modern HPC system characteristics, including memory hierarchy and network topology, to significantly improve data movement efficiency, achieving up to a 13-fold increase in I/O bandwidth for specific applications. The library aims to address the growing I/O bottleneck in supercomputers by providing a flexible and portable solution for collective I/O operations across various architectures.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Future Generation Computer Systems 159 (2024) 188–203

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Adding topology and memory awareness in data aggregation algorithms


François Tessier a ,∗, Venkatram Vishwanath b , Emmanuel Jeannot c
a
Inria, University of Rennes, CNRS, IRISA Rennes, Rennes, France
b
Argonne National Laboratory, Lemont, IL, USA
c
Inria, LaBRI, Univ. Bordeaux, CNRS, Bordeaux-INP, Bordeaux, France

ARTICLE INFO ABSTRACT

MSC: With the growing gap between computing power and the ability of large-scale systems to ingest data, I/O is
68W15 becoming the bottleneck for many scientific applications. Improving read and write performance thus becomes
68W10 decisive, and requires consideration of the complexity of architectures. In this paper, we introduce TAPIOCA,
Keywords: an architecture-aware data aggregation library. TAPIOCA offers an optimized implementation of the two-phase
Data movement I/O scheme for collective I/O operations, taking advantage of the many levels of memory and storage that
I/O populate modern HPC systems, and leveraging network topology. We show that TAPIOCA can significantly
Data aggregation
improve the I/O bandwidth of synthetic benchmarks and I/O kernels of scientific applications running on
Deep memory and storage hierarchy
leading supercomputers. For example, on HACC-IO, a cosmology code, TAPIOCA improves data writing by a
Architecture-aware placement
factor of 13 on nearly a third of the target supercomputer.

1. Introduction the I/O bottleneck. However, these memory hierarchy levels come
with unique characteristics and sometimes require a dedicated software
In the domain of large-scale simulations, driven by the demand stack, making efficient use challenging. In addition, the process of
for reliability and precision, the generation of tera- or petabytes of moving data necessitates traversing intricate network topologies that
data has become increasingly prevalent. However, a growing disparity must be considered, such as an all-to-all, 5D-torus or dragonfly.
between compute power and I/O performance on supercomputers has In this landscape, harnessing these architecture and application
emerged. Over the past decade, the ratio of I/O bandwidth to comput- characteristics are key to making optimized decisions. Among these
ing power for the first three systems on the top5001 list has decreased techniques, data aggregation plays a central role for mitigating data
by a factor of ten, as illustrated in Fig. 1. In that context, efficiently movement bottlenecks. It involves aggregating data at various points
moving data between the applications and the storage system within in the architecture to optimize expensive data access operations. In
high-performance computing (HPC) machines is crucial. collective I/O operations, for instance, data aggregation accumulates
On the application side, managing I/O is complicated by the diverse contiguous data chunks in memory before writing them to the storage
data structures employed. For instance, particle-based applications of- system. This approach is called ‘‘two-phase I/O’’. However, the current
ten require writing multiple variables in multidimensional arrays dis- implementations of the two-phase I/O scheme suffer several limita-
tributed among processing entities, while adaptive mesh refinement
tions, especially with regard to the complexity of modern architectures.
(AMR) applications must handle varying I/O sizes depending on input
A reevaluation of this algorithm that fully leverages the potential
parameters. The popularity of deep learning algorithms has introduced
of new technologies such as RDMA (Remote Direct Memory Access)
new workloads demanding vast quantities of input data. Additionally,
and asynchronous operations can highly improve I/O performance.
complex workflows like in-situ visualization and analysis further exac-
Furthermore, an approach that is agnostic to the network topology and
erbate this complexity. Consequently, optimizing data movement is of
the memory is necessary to effectively handle the deepening complexity
paramount importance for the foreseeable future for scaling science.
of memory and topology hierarchies.
From a hardware perspective, there has been a growing disparity be-
In this paper, we introduce TAPIOCA, an I/O library designed to
tween the amount of data that needs to be transferred and the memory
or storage capabilities in terms of both capacity and performance. To perform architecture-aware data aggregation at scale. TAPIOCA targets
address this issue, hardware vendors have introduced intermediate tiers applications using collective I/O operations and can be extended to in-
of memory and storage, which must be utilized effectively to alleviate tricate workflows such as in-situ or in-transit analysis that may require

∗ Corresponding author.
E-mail address: [email protected] (F. Tessier).
1
https://www.top500.org/.

https://doi.org/10.1016/j.future.2024.05.016
Received 15 December 2023; Received in revised form 7 May 2024; Accepted 14 May 2024
Available online 18 May 2024
0167-739X/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

jobs and ensure a good performance reproducibility. On Cray XC40,


a dragonfly network topology is deployed. Thus, the minimal distance
from one node to another is at most three hops (although the routing
strategy can transmit packets through more links). On this platform, a
subset of nodes, called LNET routers, plays the role of a proxy to the
storage system. Another network is then dedicated to send data to disk.
A complementary approach consists in using on-node memory and
storage to reduce the I/O pressure of each application on the parallel
file-system. This allows several optimizations. For example, SSD-based
burst buffers (as the ones used on the Cray Cori infrastructure [2]) are
intermediate nodes between compute nodes and storage system that
can supply a smaller storage capacity but a higher I/O bandwidth.
These storage tiers are designed to absorb the burst and accelerate the
I/O phase of applications. While writing data out for future analysis is
Fig. 1. Ratio of I/O bandwidth (GBps) to computing power (TFlops) of the top 3
supercomputers of the Top500 over the past 10 years. costly, another technique involves storing data in memory for in situ
analysis. Although bandwidth-efficient, this approach is limited by the
amount of memory available.
temporarily persistent data. With an abstraction layer for the network An Ad-Hoc file systems [3,4] is an application-level file systems, de-
interconnect and deep memory hierarchy, this library can execute ployed at launch-time that intercept I/O accesses to store data locally,
data aggregation on any memory or storage system in current and on-nodes, hiding and abstracting the local memory or burst buffers
forthcoming large-scale systems, offering seamless portability across present in the machine. With such system, the question if staging data
various supercomputers. To determine the most suitable location for at some point (when the SSD are full) or at the end of the application
data aggregation, we also provide a detailed cost model minimizing remains a key problem.
data movement. To validate our approach, we demonstrate how TAPI- At the same time, parallel file systems have been improved to
OCA outperforms traditional I/O calls on a synthetic benchmark and support an increasing I/O load in terms of both throughput and avail-
the I/O kernels of two real applications. We run our experiments on two able storage capacity. This software stack is accompanied by strong
leadership-class supercomputers and a visualization cluster at Argonne algorithms to balance the I/O load.
National Laboratory, USA, all featuring characteristics we are seeing on Despite these upgrades, however, room remains for improvement
emerging exascale architectures. in parallel I/O and more generally in data movements. In particular,
an abstraction layer is necessary for taking network topologies, local
2. Context and motivation memory and disks into account. The goal is to use in a simple way
the hardware features of the supercomputer at their full potential to
2.1. Large-scale simulations optimize I/O operations.

Large-scale simulations can be categorized into various groups, and 2.3. MPI I/O and the two-phase I/O scheme
among them, certain applications are heavily reliant on I/O opera-
tions, resulting in substantial time spent accessing the storage system. MPI [5] is widely utilized for the development of large-scale
There are several factors contributing to this behavior. For instance, distributed-memory applications on high-performance clusters. Within
some applications involve extensive reading of input data that must MPI, the MPI I/O component plays a critical role in facilitating input
be processed during the simulation. Conversely, other applications and output operations. One significant aspect of MPI I/O is the collec-
generate significant amounts of data that require subsequent processing tive I/O mechanism, which enables efficient reading and writing of data
after generation. Additionally, certain applications frequently access at scale. In collective I/O, all MPI tasks involved in the communication
the file system for checkpointing purposes. It is worth noting that invoke the I/O routine in a synchronized manner, allowing the MPI
these categories are not mutually exclusive, and an application can fall runtime system to optimize data movement based on various appli-
into multiple categories. Therefore, optimizing the I/O access of such cation parameters, including data size, memory layout, and storage
applications holds paramount significance, as it directly impacts the arrangement.
overall execution time. The two-phase I/O algorithm, present in MPI I/O implementa-
tions like ROMIO [6], is a well-established and efficient optimization
2.2. Accessing data at scale technique. It involves selecting a subset of processes to aggregate
contiguous data segments (aggregation phase) before writing them to
In recent years, the ratio between compute and I/O performance the storage system (I/O phase). The primary objective of this approach
of supercomputers has been constantly degrading. Nowadays, in many is to minimize latency and enhance parallel file-system accesses by
applications, the I/O is becoming a bottleneck, requiring to improve aggregating data in a manner that aligns with the layout of the parallel
data movement. In order to reduce this gap, effort has been carried file-system. Fig. 2 provides an illustrative example of this technique,
out at the hardware level and especially for the topology of the ma- featuring four processes with two selected as aggregators. By minimiz-
chine. Indeed, the networks topologies, despite being more complex, ing network contention around the storage system and maximizing the
tend to reduce the distance between the data and the storage. Many I/O bandwidth through the writing of large contiguous data chunks,
supercomputers features I/O nodes that are embedded within racks to substantial performance improvements are achieved. However, the
serve as a proxy to the parallel file-system. This architecture helps to current implementation of this approach exhibits several limitations.
avoid I/O interference by decoupling the compute network and the I/O Firstly, despite offering improved I/O performance compared to di-
network. On the IBM BG/Q, for example, a 5D-torus network offers rect access, it often falls short of achieving the peak I/O bandwidth.
a limited number of hops between compute nodes and storage while Secondly, there is an observed inefficiency in the placement policy
providing different routes to distribute the load [1]. In addition, a for aggregators, despite the potential impact on performance through
node’s partitioning in blocks of 512 nodes linked to four I/O nodes smart mapping. Lastly, existing implementations fail to leverage the
reduces as much as possible the impact of I/O interference between data model, data layout, and memory and system hierarchy effectively.

189
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

driver for MPI-IO was developed to take advantage of network-attached


memory tiers [27]. However, these approaches are tailored to specific
architectures and memory tiers, limiting portability.
To ensure code portability and accommodate the emerging exascale
machines with new memory and storage tiers, an architecture abstrac-
tion is essential. Hwloc [28] offers a common hardware abstraction,
although it provides qualitative information and does not account for
the interconnect network. At a higher level, SharP [29] provides an
abstraction layer for allocating memory on any available tier, but it
is dependent on the data model to handle. Our approach stands out
by adopting a data aggregation method that considers the underlying
architecture through memory and network interconnect abstractions.
TAPIOCA can perform aggregation on any available memory and stor-
Fig. 2. Example of the two-phase I/O mechanism. age tier, offering a model that minimizes the cost of data movement,
while being independent of the application’s data model.
Our current approach differs from existing solutions by combining
This work focuses on addressing these limitations within the context an optimized buffering system with an architecture-aware quantitative
of the two-phase I/O scheme. Specifically, we present TAPIOCA, an I/O aggregator mapping strategy. It targets various systems, such as IBM
library built on top of MPI I/O, designed to optimize the two-phase BG/Q and Cray XC40, along with both GPFS and Lustre. Furthermore,
I/O scheme for large-scale supercomputers with a keen awareness of it is extensible to accommodate new storage tiers and takes into account
system topology. TAPIOCA encompasses three primary directions: an the application’s I/O pattern.
efficient implementation of the two-phase I/O algorithm, an enhanced
aggregator placement strategy that accounts for system characteristics, 4. Our approach
and a versatile interface to query system topology information.
In this paper, we introduce TAPIOCA (standing for Topology-Aware
3. Related work Parallel I/O: Collective Algorithm), a MPI-based library for collective
I/O operations using an optimized architecture-aware two-phase I/O
Parallel I/O [7] is an active research topic, primarily developed in algorithm. By relying on an architecture abstraction layer, TAPIOCA
the context of intensive parallel I/O. While I/O tuning is a crucial step enables the placement of aggregators taking into account the network
to increase I/O bandwidth, improvements at various layers of the I/O topology and the available memory and storage spaces. Our library
software stack are necessary. Parallel file systems like GPFS [8] and also optimizes data aggregation by taking the data layout into account
Lustre are widely used [9], and parallel I/O libraries such as MPI I/O through a description of the I/O phases in the application’s code.
and its ROMIO [6] implementation, part of the MPI-2 [5] standard, are Finally, we focused on an efficient implementation leveraging one-sided
common for performing reads and writes. Collective I/O techniques, communication (Remote Memory Access) and multi-buffering.
like Chaarawi et al.’s evaluation of various write algorithms [10], In the rest of this section, we present these different aspects of
have been deployed to enhance performance. The two-phase I/O al- TAPIOCA. We begin by detailing our hardware abstraction layer, then
gorithm [11], which aggregates data on a subset of processes before we introduce our architecture-aware cost model for data aggregation.
writing it to the storage system, is a de facto collective I/O approach. We conclude this section by presenting our aggregation algorithms
Various efforts have been made to optimize this algorithm [12–17], for both read and write collective operations. For the remainder of
but they usually lack awareness of the available tiers of memory and this paper, we will use the term buffer to refer to the memory space
storage, limiting their ability to leverage these resources effectively. dedicated to aggregation on the ‘‘aggregator’’ processes, and target to
Other research has investigated multithreading to overlap aggregation designate the destination of the data (typically, a parallel file system).
and I/O phases using double buffering [15,18]. However, the optimal
number of aggregators and buffer size in collective I/O remains an open 4.1. Architecture abstraction
question. In general, other I/O libraries offer aggregation techniques,
but these are generally not very advanced or scalable [19]. A key feature of our approach is to achieve code and performance
Data movement optimizations based on data aggregation has also portability across a broad variety of architectures, including emerging
been explored beyond low-level I/O libraries. At an ephemeral file- and future network interconnects and tiers of memory and storage. To
system level for instance, UnifyFS has implemented aggregation as a do so, we have developed two abstraction layers with which our library
unique namespace on node-local storage resources [20]. Other work, interacts for both efficient aggregators placement and management of
for example in the field of checkpointing [21,22], has developed aggre- reads and writes on different levels of memory and storage. Fig. 3
gation techniques to accelerate write phases, notably via asynchronous depicts how those components fit into TAPIOCA while Listings 1 and 2
operations. However, these approaches are limited to a specific tool or show some of the API functions of those two abstractions.
framework and are architecture-agnostic, whereas TAPIOCA proposes Our memory abstraction (Listing 1) allows to allocate and free
to take advantage of MPI, which is widely used in the community, buffers on any kind of memory or storage. memRead() and memWrite()
while also leveraging the machine’s topology. From a workflow point of functions are in charge of data movements from/to an allocated buffer.
view, some authors have investigated using available SSDs to overcome As some operations are either asynchronous or need a process involved
DRAM shortages in specific workflows [23], while others have focused to be completed, a memFlush() function has been implemented to
on finely describing workflow data movements [24,25]. However, these ensure that all the initiated operations on the buffer are finished.
techniques often require users to possess in-depth knowledge of their Functions giving vendors or experimental performance values for the
applications. memory tiers are also available. The memPersistency() function returns
In contrast, some research has been conducted from a runtime the persistency capability of a memory tier. The cost model we describe
perspective. For example, authors have proposed transparently moving in 4.2 queries those values.
data from applications to storage systems through an intermediate Technically, this memory abstraction internally calls the appro-
fast storage layer [26]. Another approach explored the use of fast priate functions according to the type of memory managed. For in-
storage layers (e.g., burst buffers) as a distributed file system [4], and a stance, if data is aggregated on the high-bandwidth memory (HBM),

190
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Listing 2: Function prototypes for network interconnect


int networkBandwidth ( int level ) ;
int networkLatency ();
int networkDistanceToIONode ( i n t rank , i n t IONode ) ;
int networkDistanceBetweenRanks ( i n t srcRank , i n t destRank ) ;

of the MPI standard propose a set of aggregators mapping strategies


for the two-phase I/O scheme. For example, in MPICH [31] a strategy
consists in selecting the bridge node (i.e. the node directly linked to
the I/O node) as a first aggregator and the other aggregators following
a rank order. This strategy takes into account neither the distance
between the compute nodes and the storage system nor the amount of
Fig. 3. High-level view of TAPIOCA and the two abstraction layers. data exchanged. Moreover, the process mapping may severely impact
the performance by selecting aggregators on neighboring nodes in-
evitably creating contention. We propose in TAPIOCA a topology-aware
Listing 1: Function prototypes for memory/storage data movements approach for aggregators placement. Also, while existing methods se-
b u f f _ t ∗ memAlloc ( mem_t mem, i n t b u f f S i z e , lect a subset of nodes to gather chunks of data, we consider a set of
bool masterRank , char∗ fileName , available memory banks on nodes and chose among those tiers the ones
MPI_Comm comm ) ;
void memFree ( b u f f _ t ∗buff ) ; fulfilling the persistency and performance requirements. For instance,
int memWrite ( b u f f _ t ∗b u f f , void∗ s r c B u f f e r , if the number of aggregators is equal to the number of nodes (i.e. one
int srcSize , int offset ,
aggregator per node), we can locally aggregate data on the fastest
i n t destRank ) ;
int memRead ( b u f f _ t ∗b u f f , void∗ s r c B u f f e r , available memory tier. In case we have more than one node sending
int srcSize , int offset , data to an aggregator, the I/O bandwidth and the latency will be prob-
i n t srcRank ) ;
ably bounded by the performance of the network interconnect. Thus,
void memFlush ( b u f f _ t ∗buff ) ;
int memLatency ( mem_t mem) ; a memory tier with enough capacity is sufficient. Another criteria we
int memBandwidth ( mem_t mem) ; include in our model concerns data persistency. A workflow including
int memCapacity ( mem_t mem) ;
int memPersistency ( mem_t mem) ;
in-situ analysis, for example, may need temporary persistent local data.
Therefore, our strategy involves considering the topology of the
target system and the memory/storage requirements in an objective
function in order to determine a near-optimal aggregator placement
the memkind 2 library will be used for allocation and deallocation. minimizing data movements. For the rest of this paper, we call ‘‘parti-
Depending on the scope of the memory bank, the memory management tion’’ a subset of nodes hosting processes sharing a contiguous piece of
technique may vary. An on-node SSD, for example, is locally accessible data in file. The number of aggregators defines the partition size, each
with regular I/O calls (POSIX, MPI, . . . ) but has to be exposed to remote partition electing one aggregator among the processes.
nodes in case of aggregation from multiple compute nodes. In this case, Given, for each partition:
we implemented this feature by mapping a file on SSD into the main
• 𝑉𝑀 : The set of heterogeneous memory banks present in the
memory through a mmap system call then by exposing this buffer to
partition and fulfilling the persistency requirements;
remote nodes with a MPI Window (RMA).
• 𝐴 ∈ 𝑉𝑀 : A memory tier able to aggregate data, chosen among the
The network abstraction provides the relative location of compute
available memory banks;
nodes as well as performance information. In order to tackle various
• 𝐶𝑎𝑝𝐴 : The capacity of a memory tier 𝐴
topologies making our approach work on a diverse set of supercom-
• 𝑇 : The target memory, usually a file system;
puters, we developed a generic C++ interface to implement our data
• 𝑁𝑏𝑢𝑓 𝑓 : The number of aggregation buffers;
aggregation method for use on any system. Listing 2 presents the main
function prototypes to implement to take advantage of a topology- • 𝑆𝑏𝑢𝑓 𝑓 : The aggregation buffer size;
aware aggregator placement. Some of these values can be computed • 𝜔(𝑢, 𝑣): The amount of data to move from one memory bank 𝑢 to
dynamically during the execution, while others, depending on the plat- another 𝑣 with 𝑢, 𝑣 ∈ 𝑉𝑀 ;
form, need a one-time preliminary run of vendor tools to gather topol- • 𝑑(𝑢, 𝑣): The distance between memory banks 𝑢 and 𝑣 (hops or bus)
ogy information. For example, on BG/Q+GPFS a hardware-specific MPI with 𝑢, 𝑣 ∈ 𝑉𝑀 ;
( )
extension (MPIX library [30]) offers a set of functions providing infor- • 𝑙: The latency such as 𝑙 = 𝑚𝑎𝑥 𝑙𝑛𝑒𝑡𝑤𝑜𝑟𝑘 , 𝑙𝑣 with 𝑣 ∈ 𝑉𝑀 ;
mation such as the distance to the I/O node (MPIX_IO_distance) • 𝐵𝑢→𝑣 : The bandwidth from memory bank 𝑢 to 𝑣 with 𝑢, 𝑣 ∈ 𝑉𝑀 ,
( )
while on a Cray XC40 machine associated with a Lustre filesystem, such as 𝐵𝑢→𝑣 = 𝑚𝑖𝑛 𝐵𝑛𝑒𝑡𝑤𝑜𝑟𝑘 , 𝐵𝑢 , 𝐵𝑣 .
more work is needed to gather the I/O nodes placement. Overall,
the effort required to support a new architecture is quite low and is
4.2.1. Memory requirements
independent of the application.
First, the selected aggregator has to fulfill a memory capacity condi-
tion. The memory bank chosen to aggregate data has to have a capacity
4.2. Architecture-aware aggregators placement
greater or equal than the size needed for the aggregation buffers. We
consider two cases: with and without a need of persistency. If the
The second main contribution of this work on data aggregation con-
cerns the aggregators placement policy. The various implementations aggregated data needs to be persistent in memory, the memory capacity
has to be at least the sum of the data produced for an aggregator.

2
𝐶𝑎𝑝𝐴 ≥ 𝜔(𝑢, 𝐴)
http://memkind.github.io/memkind/. 𝑢∈𝑉𝑀 ,𝑢≠𝐴

191
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 5. Toy examples of four processes collectively writing data on a Lustre file system
through a data aggregation process.

Table 1
Memory and network capabilities based on vendors information.
Value# HBM DRAM NVR Network
Latency (ms) 10 20 100 30
Bandwidth (GBps) 180 90 0.15 12.5
Fig. 4. Objective function minimizing the communication costs to and from an Capacity (GB) 16 192 128 N/A
aggregator. Persistency No No Job lifetime N/A

Table 2
However, if persistency is not necessary, the memory capacity must For each process, according to the amount of data produced (𝜔) and the network and
be able to contain the number of buffers required for aggregation. More memory information, sum of the aggregation cost 𝐶𝑜𝑠𝑡𝐴 and the I/O cost 𝐶𝑜𝑠𝑡𝑇 .
formally, the memory capacity has to be such as: P# 𝜔(𝑖, 𝐴) HBM DRAM NVR
0 10 0.593 0.603 2.350
𝐶𝑎𝑝𝐴 ≥ 𝑁𝑏𝑢𝑓 𝑓 × 𝑆𝑏𝑢𝑓 𝑓
1 50 0.470 0.480 2.020
Once this prerequisite has been met, we obtain a subset 𝑉𝑚 ⊆ 𝑉𝑀 2 20 0.742 0.752 2.710
3 5 0.503 0.513 2.120
containing the aggregators candidates from the set of the memory
banks. The next step consists of selecting the most appropriate memory
tier providing the best I/O bandwidth among the candidates.
4.2.3. Toy example
4.2.2. Objective function Fig. 5 illustrates our model with four processes that need to col-
To do so, we define two costs 𝐶1 and 𝐶2 as depicted in Fig. 4. 𝐶1 lectively write data on a parallel file system (PFS). We consider that
corresponds to the cost of aggregating data onto the aggregator. To each process is located on a different node. Two memory banks within
compute this cost, we sum up the cost of each data producer 𝑖 of sending a node are separated by one hop while the distance between nodes is
an amount of data 𝜔(𝑖, 𝐴) to a memory bank 𝐴 used for aggregation.
noted on the links (white circles). Each node hosts two types of memory
This cost takes into account the slowest bandwidth involved as well as
in addition to the main memory (DRAM): a high-bandwidth memory
the worst latency.
( ) (HBM) and a HDD-based non-volatile memory (NVR). The source of
∑ 𝜔(𝑖, 𝐴)
𝐶1 = 𝑙 × 𝑑(𝑖, 𝐴) + the data is the DRAM (blue boxes) while the destination is a Lustre
𝑖∈𝑉 ,𝑖≠𝐴
𝐵𝑖→𝐴 file system (green box). There is no need for intermediate persistency.
𝑀

𝐶2 is the cost of sending the aggregated data to the destination Processes P0, P1, P2 and P3 respectively produce 10 MB, 50 MB, 20
(typically, the storage system). MB and 5 MB. Based on vendors values, we set in Table 1 the latency,
bandwidth, capacity and level of persistency of the available tiers of
𝜔(𝐴, 𝑇 )
𝐶2 = 𝑙 × 𝑑(𝐴, 𝑇 ) + memory and the interconnect network for this toy example.
𝐵𝐴→𝑇
Table 2 shows, for each process, the cost of aggregating data on
Every node is in charge of computing the cost, for each of its
its local available tiers of memory. Our model shows that the most
local memory bank, of being an aggregator. Let us take as an example
advantageous location for aggregation is the high-bandwidth mem-
a node hosting three different types of memory complying with the
ory available on the node hosting process 𝑃 1. We can notice that
persistency and capacity requirements mentioned previously. Three
the difference between aggregation on HBM and DRAM is negligible.
pairs of {𝐶1 , 𝐶2 } will be computed, one for each tier.
We observed this result with real experiments on a supercomputer
To determine the near-optimal location for data aggregation, we
find out the minimal value of the sum of these two costs among the equipped with those types of memory. Likewise, this behavior has
elements of 𝑉𝑚 . More formally, our objective function is: also been observed in a related work [32]. Finally, Fig. 6 depicts the
( ) decision taken by TAPIOCA for the aggregator selection.
𝐴𝑟𝑐ℎ𝐴𝑤𝑎𝑟𝑒(𝐴) = 𝑚𝑖𝑛 𝐶1 + 𝐶2 It has to be noted that the aggregation memory can be also defined
A call to MPI_Allreduce across a partition with the by the user through an environment variable (TAPIOCA_AGGRTIER).
MPI_MINLOC parameter enables our algorithm to choose as an aggre- When using this method, the environment variable can be set to any
gator the process with the minimal cost. Hence, for each partition an memory tier implemented in our memory abstraction. The aggregators
aggregator is elected. location is then computed according to the only topology information.

192
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 7. Calling three collective writes for an array of structure data layout with MPI
Fig. 6. Decision taken by TAPIOCA for selecting the most appropriate aggregator given I/O and TAPIOCA.
the initial state described in Fig. 5.

data. Moreover, TAPIOCA also takes advantage of buffers pipelining to


4.3. Advanced data aggregation algorithms further optimize the aggregation and I/O phases.

In this section we show how to use TAPIOCA in applications to per-


Algorithm 2: Collective TAPIOCA writes.
form collective I/O operations. Then we detail our architecture-aware
implementation of read and write calls taking advantage of advanced 1 𝑛 ← 5;
techniques such as one-sided communication and multi-buffering. 2 𝑥[𝑛], 𝑦[𝑛], 𝑧[𝑛];
3 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑟𝑎𝑛𝑘 × 3 × 𝑛;
4.3.1. Data pattern awareness 5

Compared with the MPI standard, our approach requires the de- 6 for 𝑖 ← 0, 𝑖 < 3, 𝑖 ← 𝑖 + 1 do
scription of the upcoming I/O operations before performing read or 7 𝑐𝑜𝑢𝑛𝑡[𝑖] ← 𝑛;
write calls. We extract from this information the data model (multi- 8 𝑡𝑦𝑝𝑒[𝑖] ← sizeof (𝑡𝑦𝑝𝑒);
dimensional arrays) and the data layout (array of structures, structure 9 𝑜𝑓 𝑠𝑡[𝑖] ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑖 × 𝑛;
of arrays). The identification of these data patterns is the key to better 11
scheduling I/O and to reduce the idle time for all the MPI tasks. As an 12 TAPIOCA_Init (𝑐𝑜𝑢𝑛𝑡, 𝑡𝑦𝑝𝑒, 𝑜𝑓 𝑠𝑡, 3);
example, Algorithm 1 describes the collective MPI I/O calls needed for 14
a set of MPI processes writing three arrays in a file, each one describing 15 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑥, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
a dimension of coordinates in (𝑥, 𝑦, 𝑧), following an array of structures 16 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛 ;
data layout. Each call to MPI_File_write_at_all is a collective 18
operation independent of the next calls. 19 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑦, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
20 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛;
Algorithm 1: Collective MPI I/O writes. 22
23 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑧, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
1 𝑛 ← 5;
2 𝑥[𝑛], 𝑦[𝑛], 𝑧[𝑛];
3 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑟𝑎𝑛𝑘 × 3 × 𝑛;
5
4.3.2. Buffers pipelining
6 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑥, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); In order to optimize both the aggregation phase and the I/O phase,
7 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛 ; each aggregator manages at least two buffers. Therefore, while data is
9 aggregated into a buffer, another one can be flushed into the target. In
10 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑦, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); our implementation, as the aggregation phase is performed with RMA
11 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛; operations (one-sided communication), no synchronization is needed
13 between the processes sending data to the aggregators and the ag-
14 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑧, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); gregators themselves. Moreover, the aggregators perform non-blocking
independent writes to the target (usually a storage system) making
themselves available for other operations. In this way the aggregators
With TAPIOCA, application developers have to describe the upcom-
are able to flush a full buffer while receiving data into another one. This
ing writes. This description contains nothing more than what is already
loop is performed as many times as necessary to process the data. The
known and requires less than a dozen lines of code. Algorithm 2 is
buffers used by the aggregators to stage data are allocated as a multiple
the TAPIOCA version of Algorithm 1. Since we have three variables to
of the target file-system block size to avoid lock penalties during the I/O
write, we declare arrays of size 3 describing the number of elements,
the size of the data type, and the offset in file (for loop starting line 6). phase. As depicted in Fig. 8, a series of experiments in which we ran a
Then, TAPIOCA is initialized with this information. This phase enabled simple benchmark from 2048 BG/Q nodes on a GPFS file-system (each
our library to schedule the aggregation phase in order to completely process writes the same chunk size to different offsets of a single shared
fill an aggregator buffer before flushing it to the target. Fig. 7 gives file) helped motivate this choice, although this behavior is known.
another perspective of what happens when performing this write phase Each instance of a buffer filling and flushing itself is called a round.
with MPI I/O and TAPIOCA. In our example, MPI I/O has to flush A global round is equivalent to a round performed by the same buffer
three almost empty buffers in file while TAPIOCA can aggregate all the on all the aggregators.

193
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Algorithm 3: TAPIOCA Write Algorithm


1 𝐺𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 0;
2 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← ComputeNumberOfRounds (𝑑𝑎𝑡𝑎𝑠𝑖𝑧𝑒);
4
5 Function TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑑𝑎𝑡𝑎, 𝑠𝑖𝑧𝑒, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠)
6 𝑟𝑜𝑢𝑛𝑑 ← GetRound();
7 𝑎𝑔𝑔𝑟 ← GetAggregatorRank();
8 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒 ← GetRoundSize(𝑟𝑜𝑢𝑛𝑑);
9 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡 ← GetBuffCount();
10 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
12
13 while 𝑟𝑜𝑢𝑛𝑑 ≠ 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 do
14 Fence ();
15 if I am an aggregator then
Fig. 8. Benchmark measuring the impact of the file-system block size for write 16 iFlush_Buffer (𝑏𝑢𝑓 𝑓 𝐼𝑑);
operations.
17 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 + 1;
18 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
4.3.3. TAPIOCA’s collective write and read algorithms 20

Using the contributions of the previous sections, we present here our 21 RMA_Put (𝑑𝑎𝑡𝑎, 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑎𝑔𝑔𝑟, 𝑏𝑢𝑓 𝑓 𝐼𝑑);
write and read algorithms implemented in TAPIOCA and executed by 23

the aggregation processes selected thanks to our cost model presented 24 if 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒 = 𝑠𝑖𝑧𝑒 then
in Section 4.2. 25 while 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ≠ 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑𝑠 do
Algorithm 3 details the write method implemented in our library. 26 Fence ();
For each call to TAPIOCA_Write, we retrieve information computed 27 if I am an aggregator then
during the initialization phase such as the number of aggregation 28 iFlush_Buffer (𝑏𝑢𝑓 𝑓 𝐼𝑑);
buffers, the round number, the target aggregator, the amount of data 29 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 + 1;
to write during this round and the aggregator buffer to put data in 30 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
(lines 6 to 10). Then, the while loop starting from line 13 blocks the
processes whose current round is different from the global round in a 31 else
32 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑑𝑎𝑡𝑎 +
fence (barrier in the context of MPI one-sided communication). Only
𝑟𝑜𝑢𝑛𝑑𝑆𝑖𝑧𝑒, 𝑠𝑖𝑧𝑒 − 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
the processes with the matching round can lift the barrier. If a process
passing this fence is an aggregator, it flushes the appropriate buffer into
the file (I/O phase). Line 21 just puts the data into the target buffer by
way of a one-sided operation (Aggregation phase). If the process has
written all its data, it enters a portion of code similar to the one starting we present a comparative study of TAPIOCA and MPI I/O on diverse
from line 13. Else, we recursively call this TAPIOCA_Write function use-cases.
again while updating the function parameters. Table 3 summarizes the experimental setup used to evaluate our
We present in Algorithm 4 our read procedure. We can mainly architecture-aware data aggregation technique.
distinguish four blocks in this algorithm. From line 15 to 21, we
perform a first synchronization of the processes involved in the read 5.1. Testbeds
operation. During this phase, the processes chosen to act as aggregators
read from the input file a chunk of data whose size is the size of 5.1.1. Mira
an aggregation buffer (I/O phase). From line 24 to 31, this data is Mira is a 10 PetaFLOPS IBM BG/Q supercomputer ranked in the top
distributed from the aggregators to the other processes. The processes ten of the Top500 ranking for years, until June 2017 (see Fig. 9). Mira
passing this conditional block carry out a RMA operation to get data contains 48K nodes interconnected with a 5D-torus high-speed network
from the appropriate aggregation buffer (aggregation phase, line 34). providing a theoretical bandwidth of 1.8 GBps per link. Each node hosts
The last block, from line 37 to the end, is quite similar to the second 16 hyperthreaded PowerPC A2 cores (1600 MHz) and 16 GB of main
block. The processes whose data has been fully retrieved, get stuck in memory. Following the BG/Q architecture rules, Mira splits the nodes
a waiting loop, while the others recursively call the read function. into Psets. A Pset is a subset of 128 nodes sharing the same I/O node.
Two compute nodes of a Pset offer a 1.8 GBps link to the I/O node.
5. Evaluation These nodes are called the bridge nodes. GPFS [8] manages the 27 PB
of storage. In terms of software, we compiled the test applications and
To validate our approach, we ran a large series of experiments our library with the IBM XL compiler, v12.1, and used the default MPI
on Mira and Theta, two leadership-class supercomputers at Argonne installation on Mira based on MPICH2 v1.5 (MPI-2 standard).
National Laboratorywhich have been decommissioned respectively in
2019 and 2023. We also used Cooley, a mid-scale visualization cluster, 5.1.2. Theta
to highlight the portability of our method. TAPIOCA was assessed on Theta is a 11.7 PetaFLOPS Cray XC40 supercomputer. This ar-
I/O benchmarks and on two I/O kernels of large-scale applications: chitecture (see Fig. 10) consists of more than 3600 nodes and 864
a cosmological simulation and a computational fluid dynamics (CFD) Aries routers interconnected with a dragonfly network. The routers
code. are distributed in groups of 96 internally interconnected with 14 GBps
In this section, we first describe the three testbeds we carried out our electrical links, while 12.5 GBps optical links connect groups together.
experiments on. Then, we demonstrate in 5.2 the impact of user-defined Each router hosts four Intel KNL 7250 nodes. A KNL node offers 68
parameters on collective I/O operations. This step calibrates TAPIOCA 1.60 GHz cores, 192 GB of main memory, a 128 GB SSD, and 16 GB of
and MPI I/O for a fair comparison. Finally, starting from Section 5.3 MCDRAM. The MCDRAM, also called high-bandwidth memory (HBM),

194
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Algorithm 4: TAPIOCA Read Algorithm


1 𝐺𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 0;
2 𝑅𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 ← 0;
3 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← ComputeNumberOfRounds (𝑑𝑎𝑡𝑎𝑠𝑖𝑧𝑒);
5
6 Function TAPIOCA_Read (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑑𝑎𝑡𝑎, 𝑠𝑖𝑧𝑒, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠)
7 𝑟𝑜𝑢𝑛𝑑 ← GetRound();
8 𝑎𝑔𝑔𝑟 ← GetAggregatorRank();
9 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒 ← GetRoundSize(𝑟𝑜𝑢𝑛𝑑);
10 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡 ← GetBuffCount();
11 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
12 𝑟𝑒𝑎𝑑𝐼𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
14
15 if firstRead then
16 if I am an aggregator then
17 Pull_Buffer (𝑟𝑒𝑎𝑑𝐼𝑑);
18 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 + 1; Fig. 9. IBM BG/Q architecture.
19 𝑟𝑒𝑎𝑑𝐼𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
21 Fence ();
23
24 while 𝑟𝑜𝑢𝑛𝑑 ≠ 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 do
25 if I am an aggregator AND 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 < 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑𝑠
then
26 Pull_Buffer (𝑟𝑒𝑎𝑑𝐼𝑑);
27 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 + 1;
28 𝑟𝑒𝑎𝑑𝐼𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
29 Fence ();
30 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 + 1;
31 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
33
34 RMA_Get (𝑑𝑎𝑡𝑎, 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑎𝑔𝑔𝑟, 𝑏𝑢𝑓 𝑓 𝐼𝑑);
36
37 if 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒 = 𝑠𝑖𝑧𝑒 then
38 while 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ≠ 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑𝑠 do Fig. 10. Cray XC40 architecture.
39 if I am an aggregator AND 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 < 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑𝑠
then
40 Pull_Buffer (𝑟𝑒𝑎𝑑𝐼𝑑); can be used as an additional cache or as a high-speed allocatable
41 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 + 1; memory (up to 400 GBps). On this platform, we compiled the test
42 𝑟𝑒𝑎𝑑𝐼𝑑 ← 𝑟𝑒𝑎𝑑𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡; applications and TAPIOCA with the Cray wrapper invoking the Intel
43 Fence (); compiler (v17.0) optimized for this architecture. We used the default
44 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 + 1; Cray MPI implementation based on MPICH and implementing the
45 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡; MPI-3 standard.
The storage system on Theta provides 9.2 PB of usable space man-
46 else aged by the Lustre file system [9,33]. Fig. 11 shows a simple example of
47 TAPIOCA_Read (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑑𝑎𝑡𝑎 + Lustre on this supercomputer. Disks are hosted on OST (object storage
𝑟𝑜𝑢𝑛𝑑𝑆𝑖𝑧𝑒, 𝑠𝑖𝑧𝑒 − 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); target) and accessible through OSS (object storage server). Theta has 56
OST and OSS nodes (ratio 1:1). From an application point of view, each
OSS is accessible through 7 LNET nodes allocated among the compute
nodes. Unfortunately, the vendor does not provide a way to know how
Table 3 the data is distributed on LNET nodes. It explains why aggregators
Experimental setup. placement on this platform do not take the I/O phase into account.
HPC systems Cray XC40, IBM BG/Q, Haswell-based cluster
Comparison TAPIOCA, MPI-IO 5.1.3. Cooley
IOR Cooley, is a Haswell-based analysis and visualization cluster fea-
Workloads
Synthetic I/O benchmarks turing 126 Intel Haswell E5-2620 nodes, each with 12 cores, 384 GB
IO kernel of a cosmological application (HACC) of memory and a local hard-disk drive (HDD). The 27 PB of shared
IO kernel of a direct numerical simulation (S3D)
storage are managed with a GPFS file-system. The interconnect is a 56
Operations Write and read with various subfiling techniques Gbps FDR Infiniband CLOS network. We limited our experiments on
DDR: Main memory this cluster to evaluating the portability of our library and abstraction
HBM: High-bandwidth memory layer.
Memory, Storage
NVR: NVRAM, either a on-node SSD or HDD
PFS: Parallel file-system (Lustre or GPFS)
5.2. Calibration of collective I/O operations with user-defined parameters

To achieve good performance on large-scale supercomputers with


collective I/O operations, users often have to tune their environment

195
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 12. I/O bandwidth achieved with IOR benchmark on 512 Mira nodes, 16 ranks
per node, with and without user-defined optimizations.
Fig. 11. Storage system on Theta managed by Lustre.

to take advantage of certain optimizations. We listed the most common


parameters that may have an impact on I/O performance and compared
on Mira and Theta a baseline I/O bandwidth with the default param-
eters and an optimized run with I/O tuning. This first study allows to
present a fair comparison between TAPIOCA and MPI I/O in the rest
of this paper.
To evaluate I/O performance, we ran IOR, a popular I/O bench-
mark [34]. We varied the data size read and written per process from
200 kB to 4 MB. All the I/O calls were MPI I/O collective operations.
A run was repeated 20 times, and the mean and the standard deviation
were calculated. It has to be noted that we used a recommended
subfiling technique on Mira (one file per Pset ) for our experiments on
this architecture while MPI processes managed a single shared file on
Theta.
On Mira (Fig. 12), runs with the default parameters gave up to 7.3
GBps for read and around 2 GBps for write with a large variability.
To increase this I/O bandwidth, we mainly set environment variables
optimizing collective calls and reducing lock contention by sharing
files locks. We note that the default number of aggregators and the
aggregator buffer size set to their default values (i.e., 16 aggregators
per Pset and 16 MB) offered the best performance. These settings were Fig. 13. I/O bandwidth achieved with IOR benchmark on 512 nodes on Theta, 16
ranks per node, with and without user-defined optimizations. Log scale on 𝑦-axis.
able to increase the read bandwidth by 13% on the best case, and the
optimized write bandwidth outperformed three times the baseline case
on 4 MB. Table 4
Fig. 13 depicts the same experiment on Theta. On this platform, Ratio ‘‘Aggregator buffer size : Stripe size’’.
IOR with the default parameters revealed a read bandwidth of approx- Ratio 1∶8 1∶4 1∶2 1∶1 2∶1 4∶1
imately 800 MBps while up to 36 GBps were reached with optimized I/O Bw (GBps) 0.36 0.64 0.91 1.57 1.08 1.14
parameters. The write bandwidth was increased from nearly 200 MBps
to 10 GBps in the best case. The gap was substantial between these two
scenarios. Indeed, by default on Theta’s Lustre file-system, the number
of OSTs (disks) is set to 1 and the stripe size (size of the chunks of data This preliminary study also allowed to highlight a decisive corre-
distributed among the OSTs) to 1 MB. Using 48 OSTs and a stripe size lation between aggregator buffer size set in TAPIOCA and stripe size
of 8 MB highly increased the I/O bandwidth. As on Mira, two locking of the Lustre file system. Table 4 shows the average I/O bandwidth
modes are available. Lock sharing set for collective operations reduced achieved on 512 nodes and 16 ranks per node with various aggregator
the lock contention and took part in the performance improvement. buffer sizes and stripe sizes on a simple use-case: each MPI process
Another parameter is the number of aggregators per OST in MPI I/O. writes 1 MB of data into a single shared file. Specifically, we set the
Our experiments showed that two aggregators per OST per set of 512 buffer size in TAPIOCA to 4 MB, 8 MB, and 16 MB. For each case, we
compute nodes (if 48 OSTs are used) gives a good increase. We also changed the stripe size in such a way that we could maintain a certain
identified a routing algorithm (IN_ORDER) that provided better I/O ratio. We observed that a 1:1 ratio—that is, an aggregator buffer size
bandwidth. equal to the stripe size—gives the best performance.

196
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 14. I/O bandwidth achieved with 1D-array from 128 Cray XC40 nodes while Fig. 15. I/O bandwidth achieved with 1D-array from 128 Cray XC40 nodes while
varying the data distribution. Data read/written into a single shared file on Lustre. varying the number of nodes per file. The node-local SSD was also considered as a
target.

For the rest of our experiments, and to ensure a fair comparison Table 5
MPI-IO vs. TAPIOCA, one file per node on Lustre and SSD (TAPIOCA only) with random
between TAPIOCA and MPI-IO, we configured each environment with data distribution.
the optimal parameters determined in this section. I/O operation MPI-IO TAPIOCA
Lustre SSD
5.3. 1D-array
Read Bw (GBps) 0.99 0.80 4.47
Write Bw (GBps) 2.46 5.89 4.32
We first ran a series of experiments with a micro-benchmark called
1D-Array. In this code, every MPI process writes a contiguous piece of
data in one or multiple shared files (subfiling) during a collective call. Table 6
Reading and writing one file per node on Lustre with TAPIOCA while aggregating
We used this benchmark to provide an initial assessment of TAPIOCA’s
on the three tiers of memory and storage available on nodes. 1 MB read/written per
full range of capabilities, namely architecture-aware aggregator place- process.
ment, I/O scheduling, and the means to use any type of memory and I/O operation DRAM HBM SSD
storage level for aggregation. As Theta is our most recent architecture
Read Bw (GBps) 8.96 8.24 7.80
featuring multiple memory and storage tiers, we focused on this plat- Write Bw (GBps) 19.15 19.36 10.70
form for this first analysis. We compared TAPIOCA with the MPI I/O
implementation installed on Theta while varying the data distribution
among the processes, the number of nodes involved in files read and
written, and the aggregation memory tiers. tier. The storage space is mapped into a memory space exposed to one-
This micro-benchmark allocates one buffer per process filled with sided communication. We also ran experiments showing this feature. It
random values and collectively write/read it to/from the storage sys- has to be noted that the file created on each local SSD was temporary
tem. We tried out three different configurations for the buffer size: ev- (allocation lifetime). We can conclude from these results that one file
ery process allocates the same buffer or a random buffer size is chosen per node is the configuration offering the best I/O bandwidth for MPI-
or the buffer sizes follow a normal distribution. To have a fair compar- IO and our library. We also observe that setting the SSD as a destination
ison, the data distributions were preserved between experiments with provides better performance. However, this must be moderated by the
MPI-IO and TAPIOCA. fact that the volume of data read and written is small and that a cache
Fig. 14 shows experiments on 128 Cray XC40 nodes while writing effect undoubtedly comes into play. The ‘‘1:1’’ case was also evaluated
and reading data to a single shared file on the Lustre file system. We with a random data distribution as shown in Table 5. Again, the best
selected 48 aggregators (DRAM) for both MPI-IO and TAPIOCA. We I/O performance was achieved with TAPIOCA except on the read case
carried out three use-cases: the first one with an array of 25K integers from the Lustre file system. We are still investigating the poor read
per process (100 kB), the second one with a random distribution of the bandwidth obtained in some of our experiments.
data among the processes (a value between 0 kB and 100 kB) and the Last, Table 6 gives the read and write I/O bandwidth achieved on
last one with a normal distribution among the processes. Our approach the Lustre file system when performing data aggregation on the three
outperformed MPI-IO on the three types of distributions. However, tiers of memory available on the Cray system. In order to highlight
the performance gap was particularly significant with a random and the differences, we increased the data size per process to 1 MB. We
a normal distribution seeing as the write bandwidth was respectively first observed that the difference in performance was not significant
approximately 6 and 29 times higher while we read data 3 times faster. between aggregation on DRAM and HBM. This experiment corrobo-
Performing I/O operations on a single shared file is known to rates the cost model evaluation presented in Section 4.2. We can also
often provide poor performance. Subfiling is usually preferred. Fig. 15 notice the overhead due to the file mapping in memory (mmap) when
presents the results we obtained on the same platform while performing aggregating data on the local SSD.
subfiling, from one file per node (1:1) to one file per 8 nodes (1:8). In
such a use-case, one aggregator was selected per group of nodes writing 5.4. HACC-IO
or reading the same file. Data aggregation was performed on DRAM
while the destination of the data was the Lustre file system. Unlike HACC-IO is the I/O kernel of HACC (Hardware Accelerated Cos-
MPI-IO, TAPIOCA allows to set the local SSD as a shared destination mology Code). This large-scale cosmological application requires the

197
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 16. Write bandwidth achieved with HACC-IO on Mira by writing one file per Pset Fig. 17. Write bandwidth achieved with HACC-IO on Mira by writing one file per Pset
from 1024 nodes (16 ranks/node). TAPIOCA: 16 aggregators per Pset, 16 MB for the from 4096 nodes (16 ranks/node). TAPIOCA: 16 aggregators per Pset, 16 MB aggregator
aggregator buffer size. buffer size.

massive compute power of supercomputers to simulate the mass evo-


lution of the universe with particle-mesh techniques. In terms of I/O,
every process of a HACC simulation manages a number of particles.
Each particle is defined by nine variables—𝑋𝑋, 𝑌 𝑌 , 𝑍𝑍, 𝑉 𝑋, 𝑉 𝑌 ,
𝑉 𝑍, 𝑝ℎ𝑖, 𝑝𝑖𝑑, and 𝑚𝑎𝑠𝑘—corresponding to the coordinates, the velocity
vector, and relevant physics properties. The size of a particle is 38
bytes. A useful base value of 25,000 particles requires approximately
1 MB. In the following, we first present our results on Mira with 1024
and 4096 nodes and 16 ranks per node (resp. 16K and 64K processes)
while only writing data, the reading phase providing similar results for
both methods. Then we show the results on Theta with 1024 and 2048
nodes and 16 ranks per node (resp. 16K and 32K processes) for both
read and write.

5.4.1. Mira
Fig. 16 shows the results on 1024 Mira nodes, with 16 ranks per
node and one file per Pset as output. We compared our approach to MPI
Fig. 18. Read and write bandwidth achieved with HACC-IO from 1024 Cray XC40
I/O on this platform with two data layouts: array of structures (AoS) nodes while writing into a single shared file on the Lustre file-system.
and structure of arrays (SoA). For these experiments, we varied the data
size per rank from 5K to 100K particles. We first note from the results
that subfiling is an efficient technique to improve I/O performance on
and write, TAPIOCA outperformed MPI-IO respectively by a factor of
the BG/Q since up to 90% of the peak I/O bandwidth was achieved by
5.4 and 13.8 with a 1 MB data size per process.
our topology-aware strategy. We also note that we outperformed the
default implementation even on large messages. As demonstrated in 5.3 and 5.4.1, subfiling is a key method to
Fig. 17 presents experiments with the same configuration as the improve I/O bandwidth and reduce the proportion of the wall time
previous one except that we ran it on 4096 Mira nodes. The behav- spent in I/O. As shown in Fig. 19, writing one file per node on the
ior was similar, with the peak write bandwidth almost reached with parallel file system improved the performance up to 40 times with a
TAPIOCA (the peak is estimated to 89.6 GBps on this node count). As large amount of data per process. On this case, MPI-IO and TAPIOCA
with experiments on 1024 nodes, the gap with MPI I/O decreased as the offered I/O performance in the same confidence interval. As mentioned
data size increased. In any case, the I/O performance was substantially previously, whatever the subfiling granularity chosen, TAPIOCA is able
improved for both AoS and SoA layouts. to use the local SSD as a file destination (as well as an aggregation
layer). Therefore, we included the results when writing and reading
5.4.2. Theta data to/from this storage layer. In this case, the I/O bandwidth was
Our experiments on Theta showed a good I/O performance gain as boosted in the range of 4 and 9 times when writing data and in the
well. Fig. 18 depicts the read and write bandwidth achieved on 1024 range of 6 and 8 when reading compared to the parallel file system.
nodes on the Cray XC40 supercomputer while sharing a single file as To extend the analysis of this use-case, we ran a weak scaling study
output and varying the data size per process. This result highlights the of the previous experiment as depicted in Fig. 20. Here, every process
performance improvement TAPIOCA can achieve on a standard work- managed 1 MB of data. The aggregation was performed on the DRAM
flow, from the application to a parallel file system. Data aggregation of each aggregator and the target for output data was set to the Lustre
was performed on the DRAM in this set of experiments. On both read parallel file system and the on-node SSD. This last method revealed

198
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Fig. 21. HACC-IO on 1024 Cray XC40 nodes, one file per node on local SSD.
Comparison of the aggregation on DDR and on HBM.

Fig. 19. Read and write bandwidth achieved with HACC-IO from 1024 Cray XC40
nodes while writing one file per node on the Lustre file-system and on the local SSD
(TAPIOCA only). Log-scale on 𝑦-axis.

Fig. 22. Write/Read workflow using TAPIOCA and SSDs as both an aggregation buffer
(write) and a target (read).

Workflow. HACC is a very large-scale simulation code generating data


that can be analyzed or visualized, in real time if possible. We propose
here a situation of this application in a workflow as described in Fig. 22
that can be seamlessly implemented with TAPIOCA. The workflow
might be either a single application performing write (simulation) and
read (analysis) operations consecutively like an in-situ analysis with co-
located processes or two different applications running during the same
Fig. 20. HACC-IO, one file per Cray XC40 node on Lustre and local SSD. 1 MB per
allocation as the data is persistent on SSD for the allocation lifetime. As
process, varying the number of nodes. described in Section 4, TAPIOCA allows to use on-node local SSD as an
aggregation layer. This task is done by mapping a file created for the
occasion on the SSD to the DRAM of the node. A MPI window then
exhibits this buffer to local and remote nodes.
a very strong scalability as the I/O performance attained increased by Table 7 shows the best I/O bandwidth achieved for write and read
more or less 50% every time we doubled the number of compute nodes. as well as the best time to solution for the whole workflow. The
Eventually, thanks to the memory abstraction we have proposed performance variation is based on the MPI-IO case and the TAPIOCA
(see Section 4.1), we carried out experiments with data aggregation on case using SSD. The first result row is for information purpose. We
the high-bandwidth memory available on the compute nodes. For the can see that the overhead due to the mmap system call is widely
purposes of this experiment, we chose to take the case with the best counterbalanced by the performance attained with the read operation.
performance so far, i.e. one file written to the local SSD per node. This The total time to solution is reduced by 26.82%.
choice was motivated by the fact that in other configurations (writing to
the Lustre file system, for example), our model shows that performance 5.4.3. Cooley
is limited by the network, regardless of the aggregation layer used. To assess the portability of our architecture-aware data aggregation
Thereby, Fig. 21 compares an execution with aggregation on DRAM and algorithm, we ran experiments with HACC-IO on Cooley, a 64-node
on HBM while writing and reading one file per node on the node-local Haswell-based visualization cluster. To take advantage of the features
SSD. Even so, the performance gap between the two memory banks and we proposed in our data aggregation library on another platform, there
the SSDs remains wide. The two-phase I/O operations is still bounded is no need to modify the application. Only the compilation process
by the SSD’s bandwidth as expected, showing no difference between and an implementation of the memory and network abstraction are
data aggregation on the DRAM or on the HBM. necessary.

199
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Table 7 Table 9
Max. write and read bandwidth (GBps) and total I/O time achieved with and without Maximum write bandwidth (GBps) achieved with aggregation performed on HBM using
aggregation on SSD. the TAPIOCA library.
Agg. tier Write Read I/O time Points Size 256 nodes 1024 nodes
TAPIOCA DDR 47.50 38.92 693.88 ms MPI-IO 134M 160 GB 3.02 GBps 4.42 GBps
TAPIOCA 537M 640 GB 4.86 GBps 13.75 GBps
MPI-IO DDR 32.95 37.74 843.73 ms
Variation N/A N/A +60.93% +210.91%
TAPIOCA SSD 26.88 227.22 617.46 ms
Variation −36.10% +446.94% −26.82%

Table 10
Maximum write bandwidth (GBps) while artificially reducing the memory capacity of
Table 8
the HBM then the DRAM. For each run, the gray box corresponds to the memory tier
Max. write and read bandwidth (GBps) and total I/O time achieved with and without
selected for aggregation by TAPIOCA.
aggregation on local HDD.
Run HBM DDR NVR Bandwidth Std dev.
Agg. tier Write Read I/O time
1 16 GB 192 GB 128 GB 4.86 GBps 0.39 GBps
TAPIOCA DDR 6.60 38.80 123.41 ms
2 ↓ 32 MB 192 GB 128 GB 4.90 GBps 0.43 GBps
MPI-IO DDR 6.02 17.46 155.40 ms 3 ↓ 32 MB ↓ 32 MB 128 GB 2.98 GBps 0.15 GBps
TAPIOCA HDD 5.97 35.86 135.86 ms
Variation −0.83% +105.38% −12.57%

TAPIOCA in case of the fastest tier of memory available does not have
The testbed we targeted is not designed for intensive I/O. In ad- enough space for aggregated data. Table 10 presents the results. The
dition, the on-node disks are hard disk drives with poor performance. capacity requirement described in Section 4.2 not being fulfilled, the
However, this machine is suitable for workflows combining simulation second then the third fastest memory tier are selected. In the third
and visualization as presented is Fig. 22. Beyond the I/O performance, scenario, data is aggregated on the node-local SSD, offering poor I/O
these experiments are more a proof of concept. bandwidth compared to HBM or DRAM. However, the application can
We show in Table 8 the results obtained with the workflow de- still be carried out.
scribed in Fig. 22. To control the impact of GPFS caching, we inter-
leaved random I/O with HACC-IO write and read runs. We can notice 6. Discussion
that the overhead caused by local aggregation on HDD is very low.
Again, the read bandwidth is significantly increased while the overall In this section, we discuss several challenges, including those faced
I/O time is reduced by more than 12% on this cluster. while pursuing this research. These highlight the need for better co-
design between hardware and software stacks, as well as the need for
5.5. S3D-IO domain-driven research for I/O data management.

S3D [35] is a state-of-the-art direct numerical simulation (DNS) 6.1. Impact of network interference
code written in Fortan and MPI, in the field of computational fluid
dynamics (CFD). S3D focuses on turbulence-chemistry interactions in While carrying out experiments with our I/O library, we observed a
combustion. The DNS approach aims to address small domain problems certain variability in the I/O bandwidth measurements. This instability
to calibrate physical models for macro-scale CFD simulations. S3D was due to I/O interference from other concurrently running jobs.
is based on a 3D domain decomposition distributed across the MPI On Mira, a set of I/O nodes is isolated only as part of a 512-nodes
processes. In terms of I/O, a new single shared file is collectively allocation.
written every 𝑛 timesteps. The state of each element of the studied To emphasize this behavior, we ran controlled benchmark tests us-
domain is stored following an array of structure data layout. The file ing one Pset (128 nodes compute nodes, two bridge nodes and one I/O
as output is used both as a checkpoint in case of failure and for data node). Our tests were run to highlight the impact of I/O interference.
analysis. S3D-IO is a version of the S3D production code whose physics In one case, we ran a single I/O intensive HACC-IO job on 64 of the
modules have been removed. The memory arrangement as well as the 128 nodes, while leaving the other 64 nodes idle. This case eliminated
I/O routines have been kept though. interference on the bridge and I/O nodes. In the other case, we ran the
We implemented a module in S3D-IO using TAPIOCA for managing same I/O intense job on 64 of the 128 nodes, while simultaneously
I/O operations. For these experiments, we let our architecture-aware running jobs of varying I/O intensity on the other 64 nodes. Node
algorithm described in Section 4.2 automatically decide the most ap- allocation was distributed such that each 64 node job used 32 nodes per
propriate tiers of memory for data aggregation among the compute bridge node. This configuration corresponded to the default distribution
nodes. on BG/Q. Fig. 23 depicts a 5D Torus flattened on 2 dimensions and the
We first present in Table 9 a typical use-case of S3D with 134 and aforementioned jobs partitioning.
537 millions grid points respectively distributed on 256 and 1024 nodes Table 11 shows the mean I/O bandwidth achieved with HACC-IO
on the Cray XC40 system (16 ranks per node). We set the number of with and without interference. A single I/O intensive HACC-IO job run-
aggregators to 96 on 256 nodes and 384 on 1024 nodes for both MPI-IO ning on 64 nodes sharing two bridge nodes can reach more than 60%
and TAPIOCA. For this use-case, our aggregator placement algorithm of the peak I/O bandwidth. However, the performance is decreased by
selected the HBM as an aggregation layer for all the 96 aggregating 13% when a concurrent job is running on the same Pset. We can also
nodes. We can see that on the two problem sizes, TAPIOCA significantly notice a rise in variability (standard deviation) of 37.5%. This result
outperforms MPI-IO. When running on 1024 nodes, the I/O bandwidth demonstrates the need for a good understanding of the underlying
is multiplied by 3. topology and better ways to leverage this knowledge by conducting
In order to emphasize the adaptability of our approach, we ran more research in the domain of topology-aware resource allocation
another series of experiments on 256 nodes with 134 millions grid or I/O contention management (I/O scheduling or I/O priority). On
points while artificially decreasing the capacity of the high-bandwidth BG/Q for instance, we have learnt that the minimal unit to consider
memory then the DRAM to 32 MB. At the same time, we set the number for a node allocation is a block of four Psets (512 nodes) to reduce
of aggregation buffers to 3 and their size to 16 MB (so 48 MB total, as much as possible the impact of I/O interference and ensure a good
above the memory capacity). The goal was to show the behavior of reproducibility.

200
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

7. Conclusion

In this paper, we have introduced TAPIOCA, a data aggregation al-


gorithm designed to leverage supercomputer’s architecture effectively.
Specifically, we have demonstrated how an architectural abstraction,
coupled with an aggregator’s placement model, can alleviate the I/O
bottleneck across present and future large-scale systems. Our library
has exhibited significant performance improvements across typical I/O
workloads and more intricate workflows that express diverse I/O re-
quirements. Our assessment on benchmarks and on two real appli-
cations narrowed down to the I/O phases, have demonstrated our
ability to outperform MPI-IO while offering extended flexibility. We
conducted experiments with up to 16K processes on three systems at
Argonne National Laboratory including Theta, a 11.69 PetaFLOPS Cray
XC40 system and Mira, a 10 PetaFLOPS IBM BG/Q supercomputer. In
Fig. 23. Job partitioning on a Pset on Mira to demonstrate the impact of I/O particular, on an example of a typical ‘‘simulation + analysis’’ workflow
interference on performance. configuration, we have demonstrated an execution time saving of the
order of 26%, while transparently taking advantage of local storage
Table 11 resources.
Mean I/O bandwidth achieved with HACC-IO (2 MB per rank) through our I/O library In future endeavors, we aim to strengthen this approach, beginning
with and without interference. Concurrent jobs have variable I/O intensity (0.2 MB to with an in-depth exploration of the influence of input parameters on
4 MB per rank). the cost of data movement. Analyzing the data access pattern, for
HACC-IO Other instance, could allow a better characterization of applications and
Average Std_dev Average Std_dev deliver performance improvements. Another future direction concerns
No-interference 2.20 GBps 0.10 GBps N/A N/A the number of aggregators. As mentioned in Section 5, the way we
Interference 1.92 GBps 0.16 GBps 1.15 GBps 0.35 GBps determine the appropriate number of aggregators is empirical or based
on the system’s default value. Therefore, we plan to implement a
contention model determining the number of aggregators given the
bandwidth degradation due to concurrent accesses on the aggregators,
6.2. Architecture-aware limitations
and the number of streams required to achieve high I/O performance
on the parallel file-system. In the longer term, we will extend this result
This work has highlighted some of the limitations of the from the ‘‘n to 1’’ paradigm, i.e. a fixed set of processes sending data to
‘‘architecture-aware’’ approach. For example, on Theta, the lack of a single aggregator, to ‘‘n to m’’ models where processes can supply data
information on the placement of LNET nodes (the nodes through which to several aggregators. A contention model in this case will be all the
I/O transits to the Lustre file system) makes it challenging to take more decisive. Finally, from an evaluation perspective, we will study
advantage of the topology for the I/O phase. Similarly, the packet how TAPIOCA performs on AI workloads. The read-intensive access
routing algorithms on the network may not be known or may be pattern of these applications is a relevant use case for our library.
‘‘adaptive’’ as is the case on the Cray machine in comparison to systems More than just optimizing TAPIOCA, we also plan to delve into
with static routing such as on BG/Q. To fully making these approaches, multi-level data aggregation. While our library currently allows the
we need to deal with the uncertainties in the routing algorithms and identification or explicit specification of a single aggregation layer,
adopting a multi-level approach could undoubtedly benefit various
require capabilities for system introspection.
workloads, particularly in scenarios such as checkpointing. In that
Given the increasing adoption of heterogeneous memory and stor-
context, we also plan to extend our model beyond the two-phase I/O
age on HPC systems, at a node-level, rack-level and system-level, algorithm. In hybrid HPC/Cloud systems, for example, data aggregation
architecture-aware methods such as the one implemented in TAPIOCA as a preamble to data movements between geo-distributed infrastruc-
will be needed to fully realize the performance potentials. Similarly, tures is a key element for which our placement model can provide a
the emergence of disaggregation technologies such as CXL (Compute solution.
Express Link) [36] should make architecture-aware placement even
more important. CRediT authorship contribution statement
More generally, to address such issues, we advocate for better co-
design between the hardware and software stack to provide feedback François Tessier: Writing – original draft, Software, Methodol-
from the underlying architecture with accurate tools and libraries. ogy, Investigation, Conceptualization. Venkatram Vishwanath: Writ-
ing – review & editing, Methodology, Conceptualization. Emmanuel
Jeannot: Writing – review & editing, Methodology, Conceptualization.
6.3. Potential applications
Declaration of competing interest
The two-phase I/O algorithm is just one use-case among others
The authors declare the following financial interests/personal rela-
to illustrate our contribution of coupling data between computation
tionships which may be considered as potential competing interests:
and storage. Our approaches are widely applicable to data coupling
Francois Tessier reports travel was provided by Joint-Laboratory for
between various stages of a scientific workflow and will enable the Extreme Scale Computing. If there are other authors, they declare
efficient movement of data between the stages. This is a critical compo- that they have no known competing financial interests or personal
nent for science workflows combining simulations, analysis, AI, among relationships that could have appeared to influence the work reported
others. As we are witnessing these workflows being executed on hetero- in this paper.
geneous systems with diverse memory, storage, compute and network- Emmanuel Jeannot and Venkatram Vishwanath declare that they
ing characteristics, approaches such as TAPIOCA, can provide a holistic have no known competing financial interests or personal relationships
data-movement acceleration. that could have appeared to influence the work reported in this paper.

201
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

Data availability [15] Y. Tsujita, H. Muguruma, K. Yoshinaga, A. Hori, M. Namiki, Y. Ishikawa, Improv-
ing collective I/O performance using pipelined two-phase I/O, in: Proceedings
of the 2012 Symposium on High Performance Computing, HPC ’12, Society for
Data will be made available on request.
Computer Simulation International, San Diego, CA, USA, 2012, pp. 7:1–7:8, URL
http://dl.acm.org/citation.cfm?id=2338816.2338823.
Acknowledgments [16] F. Tessier, V. Vishwanath, E. Jeannot, TAPIOCA: An I/O library for optimized
topology-aware data aggregation on large-scale supercomputers, in: 2017 IEEE
International Conference on Cluster Computing, CLUSTER, 2017, pp. 70–80,
This research has been funded in part by the NCSA-Inria-ANL-BSC-
http://dx.doi.org/10.1109/CLUSTER.2017.80.
JSC-Riken-UTK Joint-Laboratory on Extreme Scale Computing (JLESC). [17] P. Malakar, V. Vishwanath, Hierarchical read–write optimizations for scientific
This research used resources of the Argonne Leadership Computing applications with multi-variable structured datasets, Int. J. Parallel Program. 45
Facility, a U.S. Department of Energy (DOE) Office of Science user facil- (1) (2017) 94–108, http://dx.doi.org/10.1007/s10766-015-0388-z.
ity at Argonne National Laboratory and is based on research supported [18] Y. Tsujita, K. Yoshinaga, A. Hori, M. Sato, M. Namiki, Y. Ishikawa, Multithreaded
two-phase I/O: Improving collective MPI-IO performance on a Lustre file system,
by the U.S. DOE Office of Science-Advanced Scientific Computing in: 2014 22nd Euromicro International Conference on Parallel, Distributed, and
Research Program, under Contract No. DE-AC02-06CH11357. Network-Based Processing, 2014, pp. 232–235, http://dx.doi.org/10.1109/PDP.
2014.46.
References [19] J.F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, C. Jin, Flexible IO and
integration for scientific codes through the adaptable IO system (ADIOS),
in: Proceedings of the 6th International Workshop on Challenges of Large
[1] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S. Kumar,
Applications in Distributed Environments, CLADE ’08, Association for Computing
V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J. Parker, The IBM Blue
Machinery, New York, NY, USA, 2008, pp. 15–24, http://dx.doi.org/10.1145/
Gene/Q interconnection network and message unit, in: Proceedings of 2011
1383529.1383533.
International Conference for High Performance Computing, Networking, Storage
[20] M.J. Brim, A.T. Moody, S.-H. Lim, R. Miller, S. Boehm, C. Stanavige, K.M.
and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 26:1–26:10, http://dx.
Mohror, S. Oral, UnifyFS: A user-level shared file system for unified access to
doi.org/10.1145/2063384.2063419, URL http://doi.acm.org/10.1145/2063384.
distributed local storage, in: 2023 IEEE International Parallel and Distributed
2063419.
Processing Symposium, IPDPS, 2023, pp. 290–300, http://dx.doi.org/10.1109/
[2] W. Bhimji, D. Bard, M. Romanus, D. Paul, A. Ovsyannikov, B. Friesen, M. Bryson,
IPDPS54959.2023.00037.
J. Correa, G.K. Lockwood, V. Tsulaia, S. Byna, S. Farrell, D. Gursoy, C. Daley,
[21] M. Gossman, B. Nicolae, J. Calhoun, Modeling multi-threaded aggregated I/O for
V. Beckner, B. Van Straalen, D. Trebotich, C. Tull, G.H. Weber, N.J. Wright,
asynchronous checkpointing on HPC systems, in: ISPDC 2023: The 22nd Inter-
K. Antypas, Prabhat, Accelerating science with the NERSC burst buffer early
national Symposium on Parallel and Distributed Computing, IEEE, Bucharest,
user program, in: CUG2016 Proceedings, 2016, URL https://escholarship.org/
Romania, 2023, pp. 101–105, http://dx.doi.org/10.1109/ISPDC59212.2023.
uc/item/9wv6k14t.
00021, URL https://hal.science/hal-04343661.
[3] M.-A. Vef, N. Moti, T. Süß, M. Tacke, T. Tocci, R. Nou, A. Miranda, T.
[22] B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, F. Cappello, VeloC: Towards
Cortes, A. Brinkmann, GekkoFS—A temporary burst buffer file system for HPC
high performance adaptive asynchronous checkpointing at large scale, in: 2019
applications, J. Comput. Sci. Tech. 35 (1) (2020) 72–91, http://dx.doi.org/
IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2019,
10.1007/s11390-020-9797-6, URL https://jcst.ict.ac.cn/en/article/doi/10.1007/
pp. 911–920, http://dx.doi.org/10.1109/IPDPS.2019.00099.
s11390-020-9797-6.
[23] T. Jin, F. Zhang, Q. Sun, H. Bui, M. Romanus, N. Podhorszki, S. Klasky, H.
[4] T. Wang, K. Mohror, A. Moody, K. Sato, W. Yu, An ephemeral burst-buffer file
Kolla, J. Chen, R. Hager, C.S. Chang, M. Parashar, Exploring data staging across
system for scientific applications, in: SC16: International Conference for High
deep memory hierarchies for coupled data intensive simulation workflows, in:
Performance Computing, Networking, Storage and Analysis, 2016, pp. 807–818,
2015 IEEE International Parallel and Distributed Processing Symposium, 2015,
http://dx.doi.org/10.1109/SC.2016.68.
pp. 1033–1042, http://dx.doi.org/10.1109/IPDPS.2015.50.
[5] M.P.I. Forum, MPI-2: Extensions to the Message-Passing Interface, July 1997,
http://www.mpi-forum.org/docs/docs.html. [24] M. Dreher, T. Peterka, Decaf: Decoupled dataflows for in situ high-performance
workflows, 2017, http://dx.doi.org/10.2172/1372113, URL http://www.osti.
[6] R. Thakur, W. Gropp, E. Lusk, A case for using MPI’s derived datatypes to
gov/scitech/servlets/purl/1372113.
improve I/O performance, in: Proceedings of SC98: High Performance Network-
ing and Computing, ACM Press, 1998, URL http://www.mcs.anl.gov/~thakur/ [25] M. Dreher, K. Sasikumar, S. Sankaranarayanan, T. Peterka, Manala: A flexible
dtype/. flow control library for asynchronous task communication, in: 2017 IEEE
[7] H. Luu, M. Winslett, W. Gropp, R. Ross, P. Carns, K. Harms, M. Prabhat, S. Byna, International Conference on Cluster Computing, CLUSTER, 2017, pp. 509–519,
Y. Yao, A multiplatform study of I/O behavior on petascale supercomputers, in: http://dx.doi.org/10.1109/CLUSTER.2017.31.
Proceedings of the 24th International Symposium on High-Performance Parallel [26] B. Dong, S. Byna, K. Wu, Prabhat, H. Johansen, J.N. Johnson, N. Keen, Data
and Distributed Computing, HPDC ’15, ACM, New York, NY, USA, 2015, pp. elevator: Low-contention data movement in hierarchical storage system, in: 2016
33–44, http://dx.doi.org/10.1145/2749246.2749269, URL http://doi.acm.org/ IEEE 23rd International Conference on High Performance Computing, HiPC,
10.1145/2749246.2749269. 2016, pp. 152–161, http://dx.doi.org/10.1109/HiPC.2016.026.
[8] F. Schmuck, R. Haskin, GPFS: A shared-disk file system for large computing [27] J. Kunkel, E. Betke, An MPI-IO in-memory driver for non-volatile pooled memory
clusters, in: Proceedings of the 1st USENIX Conference on File and Storage of the kove XPD, in: J.M. Kunkel, R. Yokota, M. Taufer, J. Shalf (Eds.), High
Technologies, FAST ’02, USENIX Association, Berkeley, CA, USA, 2002, URL Performance Computing, Springer International Publishing, Cham, 2017, pp.
http://dl.acm.org/citation.cfm?id=1083323.1083349. 679–690.
[9] Lustre filesystem website, http://lustre.org/. [28] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S.
[10] M. Chaarawi, S. Chandok, E. Gabriel, Performance evaluation of collective write Thibault, R. Namyst, Hwloc: a generic framework for managing hardware affini-
algorithms in MPI I/O, in: G. Allen, J. Nabrzyski, E. Seidel, G.D. van Albada, ties in HPC applications, in: Proceedings of the 18th Euromicro International
J. Dongarra, P.M.A. Sloot (Eds.), Computational Science – ICCS 2009: 9th Conference on Parallel, Distributed and Network-Based Processing, PDP2010,
International Conference Baton Rouge, la, USA, May 25-27, 2009 Proceedings, IEEE Computer Society Press, Pisa, Italia, 2010, URL http://hal.inria.fr/inria-
Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 185–194. 00429889.
[11] J.M. del Rosario, R. Bordawekar, A. Choudhary, Improved parallel I/O via a two- [29] M.G. Venkata, F. Aderholdt, Z. Parchman, SharP: Towards programming
phase run-time access strategy, SIGARCH Comput. Archit. News 21 (5) (1993) extreme-scale systems with hierarchical heterogeneous memory, in: 2017 46th
31–38, http://dx.doi.org/10.1145/165660.165667, URL http://doi.acm.org/10. International Conference on Parallel Processing Workshops, ICPPW, 2017, pp.
1145/165660.165667. 145–154, http://dx.doi.org/10.1109/ICPPW.2017.32.
[12] R. Thakur, W. Gropp, E. Lusk, Optimizing noncontiguous accesses in MPI [30] M. Gilge, et al., IBM System Blue Gene Solution - Blue Gene/q Application
I/O, Parallel Comput. 28 (1) (2002) 83–105, http://dx.doi.org/10.1016/S0167- Development, IBM Redbooks, 2014.
8191(01)00129-6. [31] W. Gropp, MPICH2: A new start for MPI implementations, in: Proceedings of the
[13] R. Thakur, W. Gropp, E. Lusk, Data sieving and collective I/O in ROMIO, in: 9th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel
Proceedings of the the 7th Symposium on the Frontiers of Massively Parallel Virtual Machine and Message Passing Interface, Springer-Verlag, London, UK,
Computation, FRONTIERS ’99, IEEE Computer Society, Washington, DC, USA, UK, 2002, p. 7, URL http://dl.acm.org/citation.cfm?id=648139.749473.
1999, p. 182, URL http://dl.acm.org/citation.cfm?id=795668.796733. [32] J. Liu, Q. Koziol, H. Tang, F. Tessier, W. Bhimji, B. Cook, B. Austin, S. Byna,
[14] R. Thakur, W. Gropp, E. Lusk, On implementing MPI-IO portably and with B. Thakur, G. Lockwood, et al., Understanding the IO performance gap between
high performance, in: Proceedings of the Sixth Workshop on I/O in Parallel cori KNL and haswell, in: Cray User Group Meeting, 2017.
and Distributed Systems, IOPADS ’99, ACM, New York, NY, USA, 1999, pp. [33] P. Schwan, Lustre: Building a file system for 1,000-node clusters, in: Proceedings
23–32, http://dx.doi.org/10.1145/301816.301826, URL http://doi.acm.org/10. of the Linux Symposium, 2003, p. 9.
1145/301816.301826. [34] IOR: Parallel filesystem I/O benchmark. https://github.com/LLNL/ior.

202
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203

[35] E.R. Hawkes, R. Sankaran, J.C. Sutherland, J.H. Chen, Direct numerical simula- Venkatram Vishwanath is a computer scientist at Argonne
tion of turbulent combustion: fundamental insights towards predictive models, National Laboratory. He is the Data Science Group Lead at
J. Phys. Conf. Ser. 16 (1) (2005) 65, URL http://stacks.iop.org/1742-6596/16/ the Argonne Leadership Computing Facility (ALCF). His cur-
i=1/a=009. rent focus is on algorithms, system software, and workflows
[36] D.D. Sharma, Compute Express Link® : An open industry-standard interconnect to facilitate data-centric applications on supercomputing
enabling heterogeneous data-centric computing, in: 2022 IEEE Symposium on systems. His interests include AI for Science applications,
High-Performance Interconnects, HOTI, 2022, pp. 5–12, http://dx.doi.org/10. supercomputing architectures, parallel algorithms and run-
1109/HOTI55740.2022.00017. times, and collaborative workspaces. He has received best
papers awards at venues including HPDC and LDAV, and
won the 2022 ACM Gordon Bell prize for HPC innovations
for COVID19 research.
François Tessier has been a researcher at Inria Rennes since
2020. He works in the KerData team on I/O optimization
and storage infrastructure modeling for large-scale systems. Emmanuel Jeannot is a senior research scientist at Inria.
In 2015, He received a Ph.D. from University of Bordeaux From 2000 to 2009, he worked in Nancy (Loria labora-
on affinity-aware process placement. Afterwards, he joined tory then Inria). In 2006, he was a visiting researcher at
Argonne National Laboratory then the Swiss National Super- the University of Tennessee, ICL laboratory. Since 2009,
computing Center. During these five years, he explored data Emmanuel Jeannot has been conducting his research at
aggregation techniques for I/O improvement and dynamic INRIA Bordeaux, where he leads the TADaaM team, and
provisioning of storage systems for complex workflows. at the LaBRI laboratory of the University of Bordeaux. His
Since 2023, he has been actively involved in NumPEx, primary research interests encompass the vast domain of
which aims to prepare the software stack for France’s first parallel and high-performance computing, including runtime
Exascale machine. systems, process placement, scheduling for heterogeneous
environments, I/O and storage, algorithms and models for
parallel machines, and programming models.

203

You might also like