LLAMA: The Low Level Abstraction For Memory Access - CERN Document Server

Página principal > LLAMA: The Low Level Abstraction For Memory Access

Article
Report number	arXiv:2106.04284
Title	LLAMA: The Low Level Abstraction For Memory Access
Author(s)	Gruber, Bernhard Manfred (CERN ; Dresden, Tech. U. ; CASUS) ; Amadio, Guilherme (CERN) ; Blomer, Jakob (CERN) ; Matthes, Alexander (HZDR, Dresden) ; Widera, René (HZDR, Dresden) ; Bussmann, Michael (CASUS)
Publication	2023-01-01
Imprint	2021-06-08
Number of pages	39
Note	39 pages, 10 figures, 11 listings
In:	Softw Pract Exper. 2022 (2022) 1- 27
DOI	10.1002/spe.3077 (publication)
Subject category	Computing and Computers ; cs.PF
Abstract	The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Structs) and SoA (Struct of Arrays) layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU\textsuperscript® lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.
Copyright/License	publication: (License: CC-BY-4.0) preprint: (License: CC BY 4.0)

Show more plots

Corresponding record in: Inspire

Record created 2021-06-12, last modified 2024-12-05

Registos similares

Fulltext:

External links:

00011 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.

00009 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.

00010 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.

00008 Throughput comparison of several copy implementations between different mappings. The particle record dimension consists of 7 \lstinline{float}s. The event record dimension consists of the first 100 \lstinline{int32}s, \lstinline{int64}s, \lstinline{float}s, \lstinline{byte}s and \lstinline{bool}s as they occur in an internal event dataset from the CMS detector at CERN (cf. the example on LLAMA's GitHub repository for the full definition). The (p) versions are parallel. The (r) versions read contiguously, whereas the (w) versions write contiguously. For the field-wise naive and \lstinline{std::copy}, the throughput depends a lot on the field types of the record dimension. The \lstinline{aosoa_copy} outperforms both of them in most cases, single- and multithreaded.

00000 Conceptual overview of LLAMA.

00001 Overview of the C++ library components of LLAMA.

00007 : Caption not extracted

00004 : Move with 256Mi particles

00005 : Caption not extracted

00002 Concept of LLAMA mappings. A mapping defines how a record and array dimension index tuple is translated into a blob number and offset.

00006 : Move with 256Mi particles

00003 : Caption not extracted

Add to personal basket
Export as BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks