CERN Accelerating science

Article
Report number arXiv:2106.04284
Title LLAMA: The Low Level Abstraction For Memory Access
Author(s) Gruber, Bernhard Manfred (CERN ; Dresden, Tech. U. ; CASUS) ; Amadio, Guilherme (CERN) ; Blomer, Jakob (CERN) ; Matthes, Alexander (HZDR, Dresden) ; Widera, René (HZDR, Dresden) ; Bussmann, Michael (CASUS)
Publication 2023-01-01
Imprint 2021-06-08
Number of pages 39
Note 39 pages, 10 figures, 11 listings
In: Softw Pract Exper. 2022 (2022) 1- 27
DOI 10.1002/spe.3077 (publication)
Subject category Computing and Computers ; cs.PF
Abstract The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Structs) and SoA (Struct of Arrays) layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU\textsuperscript® lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.
Copyright/License publication: (License: CC-BY-4.0)
preprint: (License: CC BY 4.0)



Corresponding record in: Inspire
 Record created 2021-06-12, last modified 2024-12-05


Fulltext:
Download fulltextPDF
External links:
Download fulltext00011 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.
Download fulltext00009 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.
Download fulltext00010 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.
Download fulltext00008 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.
Download fulltext00000 Conceptual overview of LLAMA.
Download fulltext00001 Overview of the C++ library components of LLAMA.
Download fulltext00007 : Caption not extracted
Download fulltext00004 : Move with 256Mi particles
Download fulltext00005 : Caption not extracted
Download fulltext00002 Concept of LLAMA mappings. A mapping defines how a record and array dimension index tuple is translated into a blob number and offset.
Download fulltext00006 : Move with 256Mi particles
Download fulltext00003 : Caption not extracted