Adding topology and memory awareness in data aggregation algorithms
Adding topology and memory awareness in data aggregation algorithms
MSC: With the growing gap between computing power and the ability of large-scale systems to ingest data, I/O is
68W15 becoming the bottleneck for many scientific applications. Improving read and write performance thus becomes
68W10 decisive, and requires consideration of the complexity of architectures. In this paper, we introduce TAPIOCA,
Keywords: an architecture-aware data aggregation library. TAPIOCA offers an optimized implementation of the two-phase
Data movement I/O scheme for collective I/O operations, taking advantage of the many levels of memory and storage that
I/O populate modern HPC systems, and leveraging network topology. We show that TAPIOCA can significantly
Data aggregation
improve the I/O bandwidth of synthetic benchmarks and I/O kernels of scientific applications running on
Deep memory and storage hierarchy
leading supercomputers. For example, on HACC-IO, a cosmology code, TAPIOCA improves data writing by a
Architecture-aware placement
factor of 13 on nearly a third of the target supercomputer.
1. Introduction the I/O bottleneck. However, these memory hierarchy levels come
with unique characteristics and sometimes require a dedicated software
In the domain of large-scale simulations, driven by the demand stack, making efficient use challenging. In addition, the process of
for reliability and precision, the generation of tera- or petabytes of moving data necessitates traversing intricate network topologies that
data has become increasingly prevalent. However, a growing disparity must be considered, such as an all-to-all, 5D-torus or dragonfly.
between compute power and I/O performance on supercomputers has In this landscape, harnessing these architecture and application
emerged. Over the past decade, the ratio of I/O bandwidth to comput- characteristics are key to making optimized decisions. Among these
ing power for the first three systems on the top5001 list has decreased techniques, data aggregation plays a central role for mitigating data
by a factor of ten, as illustrated in Fig. 1. In that context, efficiently movement bottlenecks. It involves aggregating data at various points
moving data between the applications and the storage system within in the architecture to optimize expensive data access operations. In
high-performance computing (HPC) machines is crucial. collective I/O operations, for instance, data aggregation accumulates
On the application side, managing I/O is complicated by the diverse contiguous data chunks in memory before writing them to the storage
data structures employed. For instance, particle-based applications of- system. This approach is called ‘‘two-phase I/O’’. However, the current
ten require writing multiple variables in multidimensional arrays dis- implementations of the two-phase I/O scheme suffer several limita-
tributed among processing entities, while adaptive mesh refinement
tions, especially with regard to the complexity of modern architectures.
(AMR) applications must handle varying I/O sizes depending on input
A reevaluation of this algorithm that fully leverages the potential
parameters. The popularity of deep learning algorithms has introduced
of new technologies such as RDMA (Remote Direct Memory Access)
new workloads demanding vast quantities of input data. Additionally,
and asynchronous operations can highly improve I/O performance.
complex workflows like in-situ visualization and analysis further exac-
Furthermore, an approach that is agnostic to the network topology and
erbate this complexity. Consequently, optimizing data movement is of
the memory is necessary to effectively handle the deepening complexity
paramount importance for the foreseeable future for scaling science.
of memory and topology hierarchies.
From a hardware perspective, there has been a growing disparity be-
In this paper, we introduce TAPIOCA, an I/O library designed to
tween the amount of data that needs to be transferred and the memory
or storage capabilities in terms of both capacity and performance. To perform architecture-aware data aggregation at scale. TAPIOCA targets
address this issue, hardware vendors have introduced intermediate tiers applications using collective I/O operations and can be extended to in-
of memory and storage, which must be utilized effectively to alleviate tricate workflows such as in-situ or in-transit analysis that may require
∗ Corresponding author.
E-mail address: [email protected] (F. Tessier).
1
https://www.top500.org/.
https://doi.org/10.1016/j.future.2024.05.016
Received 15 December 2023; Received in revised form 7 May 2024; Accepted 14 May 2024
Available online 18 May 2024
0167-739X/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Large-scale simulations can be categorized into various groups, and 2.3. MPI I/O and the two-phase I/O scheme
among them, certain applications are heavily reliant on I/O opera-
tions, resulting in substantial time spent accessing the storage system. MPI [5] is widely utilized for the development of large-scale
There are several factors contributing to this behavior. For instance, distributed-memory applications on high-performance clusters. Within
some applications involve extensive reading of input data that must MPI, the MPI I/O component plays a critical role in facilitating input
be processed during the simulation. Conversely, other applications and output operations. One significant aspect of MPI I/O is the collec-
generate significant amounts of data that require subsequent processing tive I/O mechanism, which enables efficient reading and writing of data
after generation. Additionally, certain applications frequently access at scale. In collective I/O, all MPI tasks involved in the communication
the file system for checkpointing purposes. It is worth noting that invoke the I/O routine in a synchronized manner, allowing the MPI
these categories are not mutually exclusive, and an application can fall runtime system to optimize data movement based on various appli-
into multiple categories. Therefore, optimizing the I/O access of such cation parameters, including data size, memory layout, and storage
applications holds paramount significance, as it directly impacts the arrangement.
overall execution time. The two-phase I/O algorithm, present in MPI I/O implementa-
tions like ROMIO [6], is a well-established and efficient optimization
2.2. Accessing data at scale technique. It involves selecting a subset of processes to aggregate
contiguous data segments (aggregation phase) before writing them to
In recent years, the ratio between compute and I/O performance the storage system (I/O phase). The primary objective of this approach
of supercomputers has been constantly degrading. Nowadays, in many is to minimize latency and enhance parallel file-system accesses by
applications, the I/O is becoming a bottleneck, requiring to improve aggregating data in a manner that aligns with the layout of the parallel
data movement. In order to reduce this gap, effort has been carried file-system. Fig. 2 provides an illustrative example of this technique,
out at the hardware level and especially for the topology of the ma- featuring four processes with two selected as aggregators. By minimiz-
chine. Indeed, the networks topologies, despite being more complex, ing network contention around the storage system and maximizing the
tend to reduce the distance between the data and the storage. Many I/O bandwidth through the writing of large contiguous data chunks,
supercomputers features I/O nodes that are embedded within racks to substantial performance improvements are achieved. However, the
serve as a proxy to the parallel file-system. This architecture helps to current implementation of this approach exhibits several limitations.
avoid I/O interference by decoupling the compute network and the I/O Firstly, despite offering improved I/O performance compared to di-
network. On the IBM BG/Q, for example, a 5D-torus network offers rect access, it often falls short of achieving the peak I/O bandwidth.
a limited number of hops between compute nodes and storage while Secondly, there is an observed inefficiency in the placement policy
providing different routes to distribute the load [1]. In addition, a for aggregators, despite the potential impact on performance through
node’s partitioning in blocks of 512 nodes linked to four I/O nodes smart mapping. Lastly, existing implementations fail to leverage the
reduces as much as possible the impact of I/O interference between data model, data layout, and memory and system hierarchy effectively.
189
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
190
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
191
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 5. Toy examples of four processes collectively writing data on a Lustre file system
through a data aggregation process.
Table 1
Memory and network capabilities based on vendors information.
Value# HBM DRAM NVR Network
Latency (ms) 10 20 100 30
Bandwidth (GBps) 180 90 0.15 12.5
Fig. 4. Objective function minimizing the communication costs to and from an Capacity (GB) 16 192 128 N/A
aggregator. Persistency No No Job lifetime N/A
Table 2
However, if persistency is not necessary, the memory capacity must For each process, according to the amount of data produced (𝜔) and the network and
be able to contain the number of buffers required for aggregation. More memory information, sum of the aggregation cost 𝐶𝑜𝑠𝑡𝐴 and the I/O cost 𝐶𝑜𝑠𝑡𝑇 .
formally, the memory capacity has to be such as: P# 𝜔(𝑖, 𝐴) HBM DRAM NVR
0 10 0.593 0.603 2.350
𝐶𝑎𝑝𝐴 ≥ 𝑁𝑏𝑢𝑓 𝑓 × 𝑆𝑏𝑢𝑓 𝑓
1 50 0.470 0.480 2.020
Once this prerequisite has been met, we obtain a subset 𝑉𝑚 ⊆ 𝑉𝑀 2 20 0.742 0.752 2.710
3 5 0.503 0.513 2.120
containing the aggregators candidates from the set of the memory
banks. The next step consists of selecting the most appropriate memory
tier providing the best I/O bandwidth among the candidates.
4.2.3. Toy example
4.2.2. Objective function Fig. 5 illustrates our model with four processes that need to col-
To do so, we define two costs 𝐶1 and 𝐶2 as depicted in Fig. 4. 𝐶1 lectively write data on a parallel file system (PFS). We consider that
corresponds to the cost of aggregating data onto the aggregator. To each process is located on a different node. Two memory banks within
compute this cost, we sum up the cost of each data producer 𝑖 of sending a node are separated by one hop while the distance between nodes is
an amount of data 𝜔(𝑖, 𝐴) to a memory bank 𝐴 used for aggregation.
noted on the links (white circles). Each node hosts two types of memory
This cost takes into account the slowest bandwidth involved as well as
in addition to the main memory (DRAM): a high-bandwidth memory
the worst latency.
( ) (HBM) and a HDD-based non-volatile memory (NVR). The source of
∑ 𝜔(𝑖, 𝐴)
𝐶1 = 𝑙 × 𝑑(𝑖, 𝐴) + the data is the DRAM (blue boxes) while the destination is a Lustre
𝑖∈𝑉 ,𝑖≠𝐴
𝐵𝑖→𝐴 file system (green box). There is no need for intermediate persistency.
𝑀
𝐶2 is the cost of sending the aggregated data to the destination Processes P0, P1, P2 and P3 respectively produce 10 MB, 50 MB, 20
(typically, the storage system). MB and 5 MB. Based on vendors values, we set in Table 1 the latency,
bandwidth, capacity and level of persistency of the available tiers of
𝜔(𝐴, 𝑇 )
𝐶2 = 𝑙 × 𝑑(𝐴, 𝑇 ) + memory and the interconnect network for this toy example.
𝐵𝐴→𝑇
Table 2 shows, for each process, the cost of aggregating data on
Every node is in charge of computing the cost, for each of its
its local available tiers of memory. Our model shows that the most
local memory bank, of being an aggregator. Let us take as an example
advantageous location for aggregation is the high-bandwidth mem-
a node hosting three different types of memory complying with the
ory available on the node hosting process 𝑃 1. We can notice that
persistency and capacity requirements mentioned previously. Three
the difference between aggregation on HBM and DRAM is negligible.
pairs of {𝐶1 , 𝐶2 } will be computed, one for each tier.
We observed this result with real experiments on a supercomputer
To determine the near-optimal location for data aggregation, we
find out the minimal value of the sum of these two costs among the equipped with those types of memory. Likewise, this behavior has
elements of 𝑉𝑚 . More formally, our objective function is: also been observed in a related work [32]. Finally, Fig. 6 depicts the
( ) decision taken by TAPIOCA for the aggregator selection.
𝐴𝑟𝑐ℎ𝐴𝑤𝑎𝑟𝑒(𝐴) = 𝑚𝑖𝑛 𝐶1 + 𝐶2 It has to be noted that the aggregation memory can be also defined
A call to MPI_Allreduce across a partition with the by the user through an environment variable (TAPIOCA_AGGRTIER).
MPI_MINLOC parameter enables our algorithm to choose as an aggre- When using this method, the environment variable can be set to any
gator the process with the minimal cost. Hence, for each partition an memory tier implemented in our memory abstraction. The aggregators
aggregator is elected. location is then computed according to the only topology information.
192
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 7. Calling three collective writes for an array of structure data layout with MPI
Fig. 6. Decision taken by TAPIOCA for selecting the most appropriate aggregator given I/O and TAPIOCA.
the initial state described in Fig. 5.
Compared with the MPI standard, our approach requires the de- 6 for 𝑖 ← 0, 𝑖 < 3, 𝑖 ← 𝑖 + 1 do
scription of the upcoming I/O operations before performing read or 7 𝑐𝑜𝑢𝑛𝑡[𝑖] ← 𝑛;
write calls. We extract from this information the data model (multi- 8 𝑡𝑦𝑝𝑒[𝑖] ← sizeof (𝑡𝑦𝑝𝑒);
dimensional arrays) and the data layout (array of structures, structure 9 𝑜𝑓 𝑠𝑡[𝑖] ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑖 × 𝑛;
of arrays). The identification of these data patterns is the key to better 11
scheduling I/O and to reduce the idle time for all the MPI tasks. As an 12 TAPIOCA_Init (𝑐𝑜𝑢𝑛𝑡, 𝑡𝑦𝑝𝑒, 𝑜𝑓 𝑠𝑡, 3);
example, Algorithm 1 describes the collective MPI I/O calls needed for 14
a set of MPI processes writing three arrays in a file, each one describing 15 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑥, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
a dimension of coordinates in (𝑥, 𝑦, 𝑧), following an array of structures 16 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛 ;
data layout. Each call to MPI_File_write_at_all is a collective 18
operation independent of the next calls. 19 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑦, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
20 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛;
Algorithm 1: Collective MPI I/O writes. 22
23 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑧, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
1 𝑛 ← 5;
2 𝑥[𝑛], 𝑦[𝑛], 𝑧[𝑛];
3 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑟𝑎𝑛𝑘 × 3 × 𝑛;
5
4.3.2. Buffers pipelining
6 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑥, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); In order to optimize both the aggregation phase and the I/O phase,
7 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛 ; each aggregator manages at least two buffers. Therefore, while data is
9 aggregated into a buffer, another one can be flushed into the target. In
10 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑦, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); our implementation, as the aggregation phase is performed with RMA
11 𝑜𝑓 𝑓 𝑠𝑒𝑡 ← 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑛; operations (one-sided communication), no synchronization is needed
13 between the processes sending data to the aggregators and the ag-
14 MPI_File_write_at_all (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑧, 𝑛, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠); gregators themselves. Moreover, the aggregators perform non-blocking
independent writes to the target (usually a storage system) making
themselves available for other operations. In this way the aggregators
With TAPIOCA, application developers have to describe the upcom-
are able to flush a full buffer while receiving data into another one. This
ing writes. This description contains nothing more than what is already
loop is performed as many times as necessary to process the data. The
known and requires less than a dozen lines of code. Algorithm 2 is
buffers used by the aggregators to stage data are allocated as a multiple
the TAPIOCA version of Algorithm 1. Since we have three variables to
of the target file-system block size to avoid lock penalties during the I/O
write, we declare arrays of size 3 describing the number of elements,
the size of the data type, and the offset in file (for loop starting line 6). phase. As depicted in Fig. 8, a series of experiments in which we ran a
Then, TAPIOCA is initialized with this information. This phase enabled simple benchmark from 2048 BG/Q nodes on a GPFS file-system (each
our library to schedule the aggregation phase in order to completely process writes the same chunk size to different offsets of a single shared
fill an aggregator buffer before flushing it to the target. Fig. 7 gives file) helped motivate this choice, although this behavior is known.
another perspective of what happens when performing this write phase Each instance of a buffer filling and flushing itself is called a round.
with MPI I/O and TAPIOCA. In our example, MPI I/O has to flush A global round is equivalent to a round performed by the same buffer
three almost empty buffers in file while TAPIOCA can aggregate all the on all the aggregators.
193
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Using the contributions of the previous sections, we present here our 21 RMA_Put (𝑑𝑎𝑡𝑎, 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑜𝑓 𝑓 𝑠𝑒𝑡, 𝑎𝑔𝑔𝑟, 𝑏𝑢𝑓 𝑓 𝐼𝑑);
write and read algorithms implemented in TAPIOCA and executed by 23
the aggregation processes selected thanks to our cost model presented 24 if 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒 = 𝑠𝑖𝑧𝑒 then
in Section 4.2. 25 while 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ≠ 𝑇 𝑜𝑡𝑎𝑙𝑅𝑜𝑢𝑛𝑑𝑠 do
Algorithm 3 details the write method implemented in our library. 26 Fence ();
For each call to TAPIOCA_Write, we retrieve information computed 27 if I am an aggregator then
during the initialization phase such as the number of aggregation 28 iFlush_Buffer (𝑏𝑢𝑓 𝑓 𝐼𝑑);
buffers, the round number, the target aggregator, the amount of data 29 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 + 1;
to write during this round and the aggregator buffer to put data in 30 𝑏𝑢𝑓 𝑓 𝐼𝑑 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑛𝑑 % 𝑏𝑢𝑓 𝑓 𝐶𝑜𝑢𝑛𝑡;
(lines 6 to 10). Then, the while loop starting from line 13 blocks the
processes whose current round is different from the global round in a 31 else
32 TAPIOCA_Write (𝑓 , 𝑜𝑓 𝑓 𝑠𝑒𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑑𝑎𝑡𝑎 +
fence (barrier in the context of MPI one-sided communication). Only
𝑟𝑜𝑢𝑛𝑑𝑆𝑖𝑧𝑒, 𝑠𝑖𝑧𝑒 − 𝑐ℎ𝑢𝑛𝑘𝑆𝑖𝑧𝑒, 𝑡𝑦𝑝𝑒, 𝑠𝑡𝑎𝑡𝑢𝑠);
the processes with the matching round can lift the barrier. If a process
passing this fence is an aggregator, it flushes the appropriate buffer into
the file (I/O phase). Line 21 just puts the data into the target buffer by
way of a one-sided operation (Aggregation phase). If the process has
written all its data, it enters a portion of code similar to the one starting we present a comparative study of TAPIOCA and MPI I/O on diverse
from line 13. Else, we recursively call this TAPIOCA_Write function use-cases.
again while updating the function parameters. Table 3 summarizes the experimental setup used to evaluate our
We present in Algorithm 4 our read procedure. We can mainly architecture-aware data aggregation technique.
distinguish four blocks in this algorithm. From line 15 to 21, we
perform a first synchronization of the processes involved in the read 5.1. Testbeds
operation. During this phase, the processes chosen to act as aggregators
read from the input file a chunk of data whose size is the size of 5.1.1. Mira
an aggregation buffer (I/O phase). From line 24 to 31, this data is Mira is a 10 PetaFLOPS IBM BG/Q supercomputer ranked in the top
distributed from the aggregators to the other processes. The processes ten of the Top500 ranking for years, until June 2017 (see Fig. 9). Mira
passing this conditional block carry out a RMA operation to get data contains 48K nodes interconnected with a 5D-torus high-speed network
from the appropriate aggregation buffer (aggregation phase, line 34). providing a theoretical bandwidth of 1.8 GBps per link. Each node hosts
The last block, from line 37 to the end, is quite similar to the second 16 hyperthreaded PowerPC A2 cores (1600 MHz) and 16 GB of main
block. The processes whose data has been fully retrieved, get stuck in memory. Following the BG/Q architecture rules, Mira splits the nodes
a waiting loop, while the others recursively call the read function. into Psets. A Pset is a subset of 128 nodes sharing the same I/O node.
Two compute nodes of a Pset offer a 1.8 GBps link to the I/O node.
5. Evaluation These nodes are called the bridge nodes. GPFS [8] manages the 27 PB
of storage. In terms of software, we compiled the test applications and
To validate our approach, we ran a large series of experiments our library with the IBM XL compiler, v12.1, and used the default MPI
on Mira and Theta, two leadership-class supercomputers at Argonne installation on Mira based on MPICH2 v1.5 (MPI-2 standard).
National Laboratorywhich have been decommissioned respectively in
2019 and 2023. We also used Cooley, a mid-scale visualization cluster, 5.1.2. Theta
to highlight the portability of our method. TAPIOCA was assessed on Theta is a 11.7 PetaFLOPS Cray XC40 supercomputer. This ar-
I/O benchmarks and on two I/O kernels of large-scale applications: chitecture (see Fig. 10) consists of more than 3600 nodes and 864
a cosmological simulation and a computational fluid dynamics (CFD) Aries routers interconnected with a dragonfly network. The routers
code. are distributed in groups of 96 internally interconnected with 14 GBps
In this section, we first describe the three testbeds we carried out our electrical links, while 12.5 GBps optical links connect groups together.
experiments on. Then, we demonstrate in 5.2 the impact of user-defined Each router hosts four Intel KNL 7250 nodes. A KNL node offers 68
parameters on collective I/O operations. This step calibrates TAPIOCA 1.60 GHz cores, 192 GB of main memory, a 128 GB SSD, and 16 GB of
and MPI I/O for a fair comparison. Finally, starting from Section 5.3 MCDRAM. The MCDRAM, also called high-bandwidth memory (HBM),
194
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
195
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 12. I/O bandwidth achieved with IOR benchmark on 512 Mira nodes, 16 ranks
per node, with and without user-defined optimizations.
Fig. 11. Storage system on Theta managed by Lustre.
196
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 14. I/O bandwidth achieved with 1D-array from 128 Cray XC40 nodes while Fig. 15. I/O bandwidth achieved with 1D-array from 128 Cray XC40 nodes while
varying the data distribution. Data read/written into a single shared file on Lustre. varying the number of nodes per file. The node-local SSD was also considered as a
target.
For the rest of our experiments, and to ensure a fair comparison Table 5
MPI-IO vs. TAPIOCA, one file per node on Lustre and SSD (TAPIOCA only) with random
between TAPIOCA and MPI-IO, we configured each environment with data distribution.
the optimal parameters determined in this section. I/O operation MPI-IO TAPIOCA
Lustre SSD
5.3. 1D-array
Read Bw (GBps) 0.99 0.80 4.47
Write Bw (GBps) 2.46 5.89 4.32
We first ran a series of experiments with a micro-benchmark called
1D-Array. In this code, every MPI process writes a contiguous piece of
data in one or multiple shared files (subfiling) during a collective call. Table 6
Reading and writing one file per node on Lustre with TAPIOCA while aggregating
We used this benchmark to provide an initial assessment of TAPIOCA’s
on the three tiers of memory and storage available on nodes. 1 MB read/written per
full range of capabilities, namely architecture-aware aggregator place- process.
ment, I/O scheduling, and the means to use any type of memory and I/O operation DRAM HBM SSD
storage level for aggregation. As Theta is our most recent architecture
Read Bw (GBps) 8.96 8.24 7.80
featuring multiple memory and storage tiers, we focused on this plat- Write Bw (GBps) 19.15 19.36 10.70
form for this first analysis. We compared TAPIOCA with the MPI I/O
implementation installed on Theta while varying the data distribution
among the processes, the number of nodes involved in files read and
written, and the aggregation memory tiers. tier. The storage space is mapped into a memory space exposed to one-
This micro-benchmark allocates one buffer per process filled with sided communication. We also ran experiments showing this feature. It
random values and collectively write/read it to/from the storage sys- has to be noted that the file created on each local SSD was temporary
tem. We tried out three different configurations for the buffer size: ev- (allocation lifetime). We can conclude from these results that one file
ery process allocates the same buffer or a random buffer size is chosen per node is the configuration offering the best I/O bandwidth for MPI-
or the buffer sizes follow a normal distribution. To have a fair compar- IO and our library. We also observe that setting the SSD as a destination
ison, the data distributions were preserved between experiments with provides better performance. However, this must be moderated by the
MPI-IO and TAPIOCA. fact that the volume of data read and written is small and that a cache
Fig. 14 shows experiments on 128 Cray XC40 nodes while writing effect undoubtedly comes into play. The ‘‘1:1’’ case was also evaluated
and reading data to a single shared file on the Lustre file system. We with a random data distribution as shown in Table 5. Again, the best
selected 48 aggregators (DRAM) for both MPI-IO and TAPIOCA. We I/O performance was achieved with TAPIOCA except on the read case
carried out three use-cases: the first one with an array of 25K integers from the Lustre file system. We are still investigating the poor read
per process (100 kB), the second one with a random distribution of the bandwidth obtained in some of our experiments.
data among the processes (a value between 0 kB and 100 kB) and the Last, Table 6 gives the read and write I/O bandwidth achieved on
last one with a normal distribution among the processes. Our approach the Lustre file system when performing data aggregation on the three
outperformed MPI-IO on the three types of distributions. However, tiers of memory available on the Cray system. In order to highlight
the performance gap was particularly significant with a random and the differences, we increased the data size per process to 1 MB. We
a normal distribution seeing as the write bandwidth was respectively first observed that the difference in performance was not significant
approximately 6 and 29 times higher while we read data 3 times faster. between aggregation on DRAM and HBM. This experiment corrobo-
Performing I/O operations on a single shared file is known to rates the cost model evaluation presented in Section 4.2. We can also
often provide poor performance. Subfiling is usually preferred. Fig. 15 notice the overhead due to the file mapping in memory (mmap) when
presents the results we obtained on the same platform while performing aggregating data on the local SSD.
subfiling, from one file per node (1:1) to one file per 8 nodes (1:8). In
such a use-case, one aggregator was selected per group of nodes writing 5.4. HACC-IO
or reading the same file. Data aggregation was performed on DRAM
while the destination of the data was the Lustre file system. Unlike HACC-IO is the I/O kernel of HACC (Hardware Accelerated Cos-
MPI-IO, TAPIOCA allows to set the local SSD as a shared destination mology Code). This large-scale cosmological application requires the
197
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 16. Write bandwidth achieved with HACC-IO on Mira by writing one file per Pset Fig. 17. Write bandwidth achieved with HACC-IO on Mira by writing one file per Pset
from 1024 nodes (16 ranks/node). TAPIOCA: 16 aggregators per Pset, 16 MB for the from 4096 nodes (16 ranks/node). TAPIOCA: 16 aggregators per Pset, 16 MB aggregator
aggregator buffer size. buffer size.
5.4.1. Mira
Fig. 16 shows the results on 1024 Mira nodes, with 16 ranks per
node and one file per Pset as output. We compared our approach to MPI
Fig. 18. Read and write bandwidth achieved with HACC-IO from 1024 Cray XC40
I/O on this platform with two data layouts: array of structures (AoS) nodes while writing into a single shared file on the Lustre file-system.
and structure of arrays (SoA). For these experiments, we varied the data
size per rank from 5K to 100K particles. We first note from the results
that subfiling is an efficient technique to improve I/O performance on
and write, TAPIOCA outperformed MPI-IO respectively by a factor of
the BG/Q since up to 90% of the peak I/O bandwidth was achieved by
5.4 and 13.8 with a 1 MB data size per process.
our topology-aware strategy. We also note that we outperformed the
default implementation even on large messages. As demonstrated in 5.3 and 5.4.1, subfiling is a key method to
Fig. 17 presents experiments with the same configuration as the improve I/O bandwidth and reduce the proportion of the wall time
previous one except that we ran it on 4096 Mira nodes. The behav- spent in I/O. As shown in Fig. 19, writing one file per node on the
ior was similar, with the peak write bandwidth almost reached with parallel file system improved the performance up to 40 times with a
TAPIOCA (the peak is estimated to 89.6 GBps on this node count). As large amount of data per process. On this case, MPI-IO and TAPIOCA
with experiments on 1024 nodes, the gap with MPI I/O decreased as the offered I/O performance in the same confidence interval. As mentioned
data size increased. In any case, the I/O performance was substantially previously, whatever the subfiling granularity chosen, TAPIOCA is able
improved for both AoS and SoA layouts. to use the local SSD as a file destination (as well as an aggregation
layer). Therefore, we included the results when writing and reading
5.4.2. Theta data to/from this storage layer. In this case, the I/O bandwidth was
Our experiments on Theta showed a good I/O performance gain as boosted in the range of 4 and 9 times when writing data and in the
well. Fig. 18 depicts the read and write bandwidth achieved on 1024 range of 6 and 8 when reading compared to the parallel file system.
nodes on the Cray XC40 supercomputer while sharing a single file as To extend the analysis of this use-case, we ran a weak scaling study
output and varying the data size per process. This result highlights the of the previous experiment as depicted in Fig. 20. Here, every process
performance improvement TAPIOCA can achieve on a standard work- managed 1 MB of data. The aggregation was performed on the DRAM
flow, from the application to a parallel file system. Data aggregation of each aggregator and the target for output data was set to the Lustre
was performed on the DRAM in this set of experiments. On both read parallel file system and the on-node SSD. This last method revealed
198
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Fig. 21. HACC-IO on 1024 Cray XC40 nodes, one file per node on local SSD.
Comparison of the aggregation on DDR and on HBM.
Fig. 19. Read and write bandwidth achieved with HACC-IO from 1024 Cray XC40
nodes while writing one file per node on the Lustre file-system and on the local SSD
(TAPIOCA only). Log-scale on 𝑦-axis.
Fig. 22. Write/Read workflow using TAPIOCA and SSDs as both an aggregation buffer
(write) and a target (read).
199
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Table 7 Table 9
Max. write and read bandwidth (GBps) and total I/O time achieved with and without Maximum write bandwidth (GBps) achieved with aggregation performed on HBM using
aggregation on SSD. the TAPIOCA library.
Agg. tier Write Read I/O time Points Size 256 nodes 1024 nodes
TAPIOCA DDR 47.50 38.92 693.88 ms MPI-IO 134M 160 GB 3.02 GBps 4.42 GBps
TAPIOCA 537M 640 GB 4.86 GBps 13.75 GBps
MPI-IO DDR 32.95 37.74 843.73 ms
Variation N/A N/A +60.93% +210.91%
TAPIOCA SSD 26.88 227.22 617.46 ms
Variation −36.10% +446.94% −26.82%
Table 10
Maximum write bandwidth (GBps) while artificially reducing the memory capacity of
Table 8
the HBM then the DRAM. For each run, the gray box corresponds to the memory tier
Max. write and read bandwidth (GBps) and total I/O time achieved with and without
selected for aggregation by TAPIOCA.
aggregation on local HDD.
Run HBM DDR NVR Bandwidth Std dev.
Agg. tier Write Read I/O time
1 16 GB 192 GB 128 GB 4.86 GBps 0.39 GBps
TAPIOCA DDR 6.60 38.80 123.41 ms
2 ↓ 32 MB 192 GB 128 GB 4.90 GBps 0.43 GBps
MPI-IO DDR 6.02 17.46 155.40 ms 3 ↓ 32 MB ↓ 32 MB 128 GB 2.98 GBps 0.15 GBps
TAPIOCA HDD 5.97 35.86 135.86 ms
Variation −0.83% +105.38% −12.57%
TAPIOCA in case of the fastest tier of memory available does not have
The testbed we targeted is not designed for intensive I/O. In ad- enough space for aggregated data. Table 10 presents the results. The
dition, the on-node disks are hard disk drives with poor performance. capacity requirement described in Section 4.2 not being fulfilled, the
However, this machine is suitable for workflows combining simulation second then the third fastest memory tier are selected. In the third
and visualization as presented is Fig. 22. Beyond the I/O performance, scenario, data is aggregated on the node-local SSD, offering poor I/O
these experiments are more a proof of concept. bandwidth compared to HBM or DRAM. However, the application can
We show in Table 8 the results obtained with the workflow de- still be carried out.
scribed in Fig. 22. To control the impact of GPFS caching, we inter-
leaved random I/O with HACC-IO write and read runs. We can notice 6. Discussion
that the overhead caused by local aggregation on HDD is very low.
Again, the read bandwidth is significantly increased while the overall In this section, we discuss several challenges, including those faced
I/O time is reduced by more than 12% on this cluster. while pursuing this research. These highlight the need for better co-
design between hardware and software stacks, as well as the need for
5.5. S3D-IO domain-driven research for I/O data management.
S3D [35] is a state-of-the-art direct numerical simulation (DNS) 6.1. Impact of network interference
code written in Fortan and MPI, in the field of computational fluid
dynamics (CFD). S3D focuses on turbulence-chemistry interactions in While carrying out experiments with our I/O library, we observed a
combustion. The DNS approach aims to address small domain problems certain variability in the I/O bandwidth measurements. This instability
to calibrate physical models for macro-scale CFD simulations. S3D was due to I/O interference from other concurrently running jobs.
is based on a 3D domain decomposition distributed across the MPI On Mira, a set of I/O nodes is isolated only as part of a 512-nodes
processes. In terms of I/O, a new single shared file is collectively allocation.
written every 𝑛 timesteps. The state of each element of the studied To emphasize this behavior, we ran controlled benchmark tests us-
domain is stored following an array of structure data layout. The file ing one Pset (128 nodes compute nodes, two bridge nodes and one I/O
as output is used both as a checkpoint in case of failure and for data node). Our tests were run to highlight the impact of I/O interference.
analysis. S3D-IO is a version of the S3D production code whose physics In one case, we ran a single I/O intensive HACC-IO job on 64 of the
modules have been removed. The memory arrangement as well as the 128 nodes, while leaving the other 64 nodes idle. This case eliminated
I/O routines have been kept though. interference on the bridge and I/O nodes. In the other case, we ran the
We implemented a module in S3D-IO using TAPIOCA for managing same I/O intense job on 64 of the 128 nodes, while simultaneously
I/O operations. For these experiments, we let our architecture-aware running jobs of varying I/O intensity on the other 64 nodes. Node
algorithm described in Section 4.2 automatically decide the most ap- allocation was distributed such that each 64 node job used 32 nodes per
propriate tiers of memory for data aggregation among the compute bridge node. This configuration corresponded to the default distribution
nodes. on BG/Q. Fig. 23 depicts a 5D Torus flattened on 2 dimensions and the
We first present in Table 9 a typical use-case of S3D with 134 and aforementioned jobs partitioning.
537 millions grid points respectively distributed on 256 and 1024 nodes Table 11 shows the mean I/O bandwidth achieved with HACC-IO
on the Cray XC40 system (16 ranks per node). We set the number of with and without interference. A single I/O intensive HACC-IO job run-
aggregators to 96 on 256 nodes and 384 on 1024 nodes for both MPI-IO ning on 64 nodes sharing two bridge nodes can reach more than 60%
and TAPIOCA. For this use-case, our aggregator placement algorithm of the peak I/O bandwidth. However, the performance is decreased by
selected the HBM as an aggregation layer for all the 96 aggregating 13% when a concurrent job is running on the same Pset. We can also
nodes. We can see that on the two problem sizes, TAPIOCA significantly notice a rise in variability (standard deviation) of 37.5%. This result
outperforms MPI-IO. When running on 1024 nodes, the I/O bandwidth demonstrates the need for a good understanding of the underlying
is multiplied by 3. topology and better ways to leverage this knowledge by conducting
In order to emphasize the adaptability of our approach, we ran more research in the domain of topology-aware resource allocation
another series of experiments on 256 nodes with 134 millions grid or I/O contention management (I/O scheduling or I/O priority). On
points while artificially decreasing the capacity of the high-bandwidth BG/Q for instance, we have learnt that the minimal unit to consider
memory then the DRAM to 32 MB. At the same time, we set the number for a node allocation is a block of four Psets (512 nodes) to reduce
of aggregation buffers to 3 and their size to 16 MB (so 48 MB total, as much as possible the impact of I/O interference and ensure a good
above the memory capacity). The goal was to show the behavior of reproducibility.
200
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
7. Conclusion
201
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
Data availability [15] Y. Tsujita, H. Muguruma, K. Yoshinaga, A. Hori, M. Namiki, Y. Ishikawa, Improv-
ing collective I/O performance using pipelined two-phase I/O, in: Proceedings
of the 2012 Symposium on High Performance Computing, HPC ’12, Society for
Data will be made available on request.
Computer Simulation International, San Diego, CA, USA, 2012, pp. 7:1–7:8, URL
http://dl.acm.org/citation.cfm?id=2338816.2338823.
Acknowledgments [16] F. Tessier, V. Vishwanath, E. Jeannot, TAPIOCA: An I/O library for optimized
topology-aware data aggregation on large-scale supercomputers, in: 2017 IEEE
International Conference on Cluster Computing, CLUSTER, 2017, pp. 70–80,
This research has been funded in part by the NCSA-Inria-ANL-BSC-
http://dx.doi.org/10.1109/CLUSTER.2017.80.
JSC-Riken-UTK Joint-Laboratory on Extreme Scale Computing (JLESC). [17] P. Malakar, V. Vishwanath, Hierarchical read–write optimizations for scientific
This research used resources of the Argonne Leadership Computing applications with multi-variable structured datasets, Int. J. Parallel Program. 45
Facility, a U.S. Department of Energy (DOE) Office of Science user facil- (1) (2017) 94–108, http://dx.doi.org/10.1007/s10766-015-0388-z.
ity at Argonne National Laboratory and is based on research supported [18] Y. Tsujita, K. Yoshinaga, A. Hori, M. Sato, M. Namiki, Y. Ishikawa, Multithreaded
two-phase I/O: Improving collective MPI-IO performance on a Lustre file system,
by the U.S. DOE Office of Science-Advanced Scientific Computing in: 2014 22nd Euromicro International Conference on Parallel, Distributed, and
Research Program, under Contract No. DE-AC02-06CH11357. Network-Based Processing, 2014, pp. 232–235, http://dx.doi.org/10.1109/PDP.
2014.46.
References [19] J.F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, C. Jin, Flexible IO and
integration for scientific codes through the adaptable IO system (ADIOS),
in: Proceedings of the 6th International Workshop on Challenges of Large
[1] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S. Kumar,
Applications in Distributed Environments, CLADE ’08, Association for Computing
V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J. Parker, The IBM Blue
Machinery, New York, NY, USA, 2008, pp. 15–24, http://dx.doi.org/10.1145/
Gene/Q interconnection network and message unit, in: Proceedings of 2011
1383529.1383533.
International Conference for High Performance Computing, Networking, Storage
[20] M.J. Brim, A.T. Moody, S.-H. Lim, R. Miller, S. Boehm, C. Stanavige, K.M.
and Analysis, SC ’11, ACM, New York, NY, USA, 2011, pp. 26:1–26:10, http://dx.
Mohror, S. Oral, UnifyFS: A user-level shared file system for unified access to
doi.org/10.1145/2063384.2063419, URL http://doi.acm.org/10.1145/2063384.
distributed local storage, in: 2023 IEEE International Parallel and Distributed
2063419.
Processing Symposium, IPDPS, 2023, pp. 290–300, http://dx.doi.org/10.1109/
[2] W. Bhimji, D. Bard, M. Romanus, D. Paul, A. Ovsyannikov, B. Friesen, M. Bryson,
IPDPS54959.2023.00037.
J. Correa, G.K. Lockwood, V. Tsulaia, S. Byna, S. Farrell, D. Gursoy, C. Daley,
[21] M. Gossman, B. Nicolae, J. Calhoun, Modeling multi-threaded aggregated I/O for
V. Beckner, B. Van Straalen, D. Trebotich, C. Tull, G.H. Weber, N.J. Wright,
asynchronous checkpointing on HPC systems, in: ISPDC 2023: The 22nd Inter-
K. Antypas, Prabhat, Accelerating science with the NERSC burst buffer early
national Symposium on Parallel and Distributed Computing, IEEE, Bucharest,
user program, in: CUG2016 Proceedings, 2016, URL https://escholarship.org/
Romania, 2023, pp. 101–105, http://dx.doi.org/10.1109/ISPDC59212.2023.
uc/item/9wv6k14t.
00021, URL https://hal.science/hal-04343661.
[3] M.-A. Vef, N. Moti, T. Süß, M. Tacke, T. Tocci, R. Nou, A. Miranda, T.
[22] B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, F. Cappello, VeloC: Towards
Cortes, A. Brinkmann, GekkoFS—A temporary burst buffer file system for HPC
high performance adaptive asynchronous checkpointing at large scale, in: 2019
applications, J. Comput. Sci. Tech. 35 (1) (2020) 72–91, http://dx.doi.org/
IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2019,
10.1007/s11390-020-9797-6, URL https://jcst.ict.ac.cn/en/article/doi/10.1007/
pp. 911–920, http://dx.doi.org/10.1109/IPDPS.2019.00099.
s11390-020-9797-6.
[23] T. Jin, F. Zhang, Q. Sun, H. Bui, M. Romanus, N. Podhorszki, S. Klasky, H.
[4] T. Wang, K. Mohror, A. Moody, K. Sato, W. Yu, An ephemeral burst-buffer file
Kolla, J. Chen, R. Hager, C.S. Chang, M. Parashar, Exploring data staging across
system for scientific applications, in: SC16: International Conference for High
deep memory hierarchies for coupled data intensive simulation workflows, in:
Performance Computing, Networking, Storage and Analysis, 2016, pp. 807–818,
2015 IEEE International Parallel and Distributed Processing Symposium, 2015,
http://dx.doi.org/10.1109/SC.2016.68.
pp. 1033–1042, http://dx.doi.org/10.1109/IPDPS.2015.50.
[5] M.P.I. Forum, MPI-2: Extensions to the Message-Passing Interface, July 1997,
http://www.mpi-forum.org/docs/docs.html. [24] M. Dreher, T. Peterka, Decaf: Decoupled dataflows for in situ high-performance
workflows, 2017, http://dx.doi.org/10.2172/1372113, URL http://www.osti.
[6] R. Thakur, W. Gropp, E. Lusk, A case for using MPI’s derived datatypes to
gov/scitech/servlets/purl/1372113.
improve I/O performance, in: Proceedings of SC98: High Performance Network-
ing and Computing, ACM Press, 1998, URL http://www.mcs.anl.gov/~thakur/ [25] M. Dreher, K. Sasikumar, S. Sankaranarayanan, T. Peterka, Manala: A flexible
dtype/. flow control library for asynchronous task communication, in: 2017 IEEE
[7] H. Luu, M. Winslett, W. Gropp, R. Ross, P. Carns, K. Harms, M. Prabhat, S. Byna, International Conference on Cluster Computing, CLUSTER, 2017, pp. 509–519,
Y. Yao, A multiplatform study of I/O behavior on petascale supercomputers, in: http://dx.doi.org/10.1109/CLUSTER.2017.31.
Proceedings of the 24th International Symposium on High-Performance Parallel [26] B. Dong, S. Byna, K. Wu, Prabhat, H. Johansen, J.N. Johnson, N. Keen, Data
and Distributed Computing, HPDC ’15, ACM, New York, NY, USA, 2015, pp. elevator: Low-contention data movement in hierarchical storage system, in: 2016
33–44, http://dx.doi.org/10.1145/2749246.2749269, URL http://doi.acm.org/ IEEE 23rd International Conference on High Performance Computing, HiPC,
10.1145/2749246.2749269. 2016, pp. 152–161, http://dx.doi.org/10.1109/HiPC.2016.026.
[8] F. Schmuck, R. Haskin, GPFS: A shared-disk file system for large computing [27] J. Kunkel, E. Betke, An MPI-IO in-memory driver for non-volatile pooled memory
clusters, in: Proceedings of the 1st USENIX Conference on File and Storage of the kove XPD, in: J.M. Kunkel, R. Yokota, M. Taufer, J. Shalf (Eds.), High
Technologies, FAST ’02, USENIX Association, Berkeley, CA, USA, 2002, URL Performance Computing, Springer International Publishing, Cham, 2017, pp.
http://dl.acm.org/citation.cfm?id=1083323.1083349. 679–690.
[9] Lustre filesystem website, http://lustre.org/. [28] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S.
[10] M. Chaarawi, S. Chandok, E. Gabriel, Performance evaluation of collective write Thibault, R. Namyst, Hwloc: a generic framework for managing hardware affini-
algorithms in MPI I/O, in: G. Allen, J. Nabrzyski, E. Seidel, G.D. van Albada, ties in HPC applications, in: Proceedings of the 18th Euromicro International
J. Dongarra, P.M.A. Sloot (Eds.), Computational Science – ICCS 2009: 9th Conference on Parallel, Distributed and Network-Based Processing, PDP2010,
International Conference Baton Rouge, la, USA, May 25-27, 2009 Proceedings, IEEE Computer Society Press, Pisa, Italia, 2010, URL http://hal.inria.fr/inria-
Part I, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 185–194. 00429889.
[11] J.M. del Rosario, R. Bordawekar, A. Choudhary, Improved parallel I/O via a two- [29] M.G. Venkata, F. Aderholdt, Z. Parchman, SharP: Towards programming
phase run-time access strategy, SIGARCH Comput. Archit. News 21 (5) (1993) extreme-scale systems with hierarchical heterogeneous memory, in: 2017 46th
31–38, http://dx.doi.org/10.1145/165660.165667, URL http://doi.acm.org/10. International Conference on Parallel Processing Workshops, ICPPW, 2017, pp.
1145/165660.165667. 145–154, http://dx.doi.org/10.1109/ICPPW.2017.32.
[12] R. Thakur, W. Gropp, E. Lusk, Optimizing noncontiguous accesses in MPI [30] M. Gilge, et al., IBM System Blue Gene Solution - Blue Gene/q Application
I/O, Parallel Comput. 28 (1) (2002) 83–105, http://dx.doi.org/10.1016/S0167- Development, IBM Redbooks, 2014.
8191(01)00129-6. [31] W. Gropp, MPICH2: A new start for MPI implementations, in: Proceedings of the
[13] R. Thakur, W. Gropp, E. Lusk, Data sieving and collective I/O in ROMIO, in: 9th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel
Proceedings of the the 7th Symposium on the Frontiers of Massively Parallel Virtual Machine and Message Passing Interface, Springer-Verlag, London, UK,
Computation, FRONTIERS ’99, IEEE Computer Society, Washington, DC, USA, UK, 2002, p. 7, URL http://dl.acm.org/citation.cfm?id=648139.749473.
1999, p. 182, URL http://dl.acm.org/citation.cfm?id=795668.796733. [32] J. Liu, Q. Koziol, H. Tang, F. Tessier, W. Bhimji, B. Cook, B. Austin, S. Byna,
[14] R. Thakur, W. Gropp, E. Lusk, On implementing MPI-IO portably and with B. Thakur, G. Lockwood, et al., Understanding the IO performance gap between
high performance, in: Proceedings of the Sixth Workshop on I/O in Parallel cori KNL and haswell, in: Cray User Group Meeting, 2017.
and Distributed Systems, IOPADS ’99, ACM, New York, NY, USA, 1999, pp. [33] P. Schwan, Lustre: Building a file system for 1,000-node clusters, in: Proceedings
23–32, http://dx.doi.org/10.1145/301816.301826, URL http://doi.acm.org/10. of the Linux Symposium, 2003, p. 9.
1145/301816.301826. [34] IOR: Parallel filesystem I/O benchmark. https://github.com/LLNL/ior.
202
F. Tessier et al. Future Generation Computer Systems 159 (2024) 188–203
[35] E.R. Hawkes, R. Sankaran, J.C. Sutherland, J.H. Chen, Direct numerical simula- Venkatram Vishwanath is a computer scientist at Argonne
tion of turbulent combustion: fundamental insights towards predictive models, National Laboratory. He is the Data Science Group Lead at
J. Phys. Conf. Ser. 16 (1) (2005) 65, URL http://stacks.iop.org/1742-6596/16/ the Argonne Leadership Computing Facility (ALCF). His cur-
i=1/a=009. rent focus is on algorithms, system software, and workflows
[36] D.D. Sharma, Compute Express Link® : An open industry-standard interconnect to facilitate data-centric applications on supercomputing
enabling heterogeneous data-centric computing, in: 2022 IEEE Symposium on systems. His interests include AI for Science applications,
High-Performance Interconnects, HOTI, 2022, pp. 5–12, http://dx.doi.org/10. supercomputing architectures, parallel algorithms and run-
1109/HOTI55740.2022.00017. times, and collaborative workspaces. He has received best
papers awards at venues including HPDC and LDAV, and
won the 2022 ACM Gordon Bell prize for HPC innovations
for COVID19 research.
François Tessier has been a researcher at Inria Rennes since
2020. He works in the KerData team on I/O optimization
and storage infrastructure modeling for large-scale systems. Emmanuel Jeannot is a senior research scientist at Inria.
In 2015, He received a Ph.D. from University of Bordeaux From 2000 to 2009, he worked in Nancy (Loria labora-
on affinity-aware process placement. Afterwards, he joined tory then Inria). In 2006, he was a visiting researcher at
Argonne National Laboratory then the Swiss National Super- the University of Tennessee, ICL laboratory. Since 2009,
computing Center. During these five years, he explored data Emmanuel Jeannot has been conducting his research at
aggregation techniques for I/O improvement and dynamic INRIA Bordeaux, where he leads the TADaaM team, and
provisioning of storage systems for complex workflows. at the LaBRI laboratory of the University of Bordeaux. His
Since 2023, he has been actively involved in NumPEx, primary research interests encompass the vast domain of
which aims to prepare the software stack for France’s first parallel and high-performance computing, including runtime
Exascale machine. systems, process placement, scheduling for heterogeneous
environments, I/O and storage, algorithms and models for
parallel machines, and programming models.
203