CERN Accelerating science

Article
Report number arXiv:2410.14239
Title Parallel Writing of Nested Data in Columnar Formats
Author(s) Hahnfeld, Jonas (CERN ; Frankfurt U.) ; Blomer, Jakob (CERN) ; Kollegger, Thorsten (Frankfurt U.)
Publication 2024-08-26
Imprint 2024-10-18
Number of pages 14
In: Lect. Notes Comput. Sci. 14802 (2024) 18-31
In: 30th European Conference on Parallel and Distributed Processing (Euro-Par 2024), Madrid, Spain, 26 - 30 Aug 2024, pp.18-31
DOI 10.1007/978-3-031-69766-1_2 (publication)
Subject category cs.DC ; Computing and Computers
Abstract High Energy Physics (HEP) experiments, for example at the Large Hadron Collider (LHC) at CERN, store data at exabyte scale in sets of files. They use a binary columnar data format by the ROOT framework, that also transparently compresses the data. In this format, cells are not necessarily atomic but they may contain nested collections of variable size. The fact that row and block sizes are not known upfront makes it challenging to implement efficient parallel writing. In particular, the data cannot be organized in a regular grid where it is possible to precompute indices and offsets for independent writing. In this paper, we propose a scalable approach to efficient multithreaded writing of nested data in columnar format into a single file. Our approach removes the bottleneck of a single writer while staying fully compatible with the compressed, columnar, variably row-sized data representation. We discuss our design choices and the implementation of scalable parallel writing for ROOT's RNTuple format. An evaluation of our approach shows perfect scalability only limited by storage bandwidth for a synthetic benchmark. Finally we evaluate the benefits for a real-world application of dataset skimming.
Copyright/License preprint: (License: CC BY 4.0)
publication: © 2024-2025 The Author(s) (exclusive license to Springer Nature Switzerland AG)



Corresponding record in: Inspire
 Záznam vytvorený 2024-12-10, zmenený 2025-01-20


Plný text:
Nahraj plný text
PDF