

# The Compact Muon Solenoid Experiment

# **Conference Report**

Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland



16 March 2023

# Implementation of generic SoA data structure in the CMS software

Eric Cano for the CMS Collaboration

#### **Abstract**

GPU applications require a structure of array (SoA) layout for the data to achieve good memory access performance. During the development of the CMS Pixel reconstruction for GPUs, the Patatrack developers crafted various techniques to optimise the data placement in memory and its access inside GPU kernels. The work presented here gathers, automates and extends those patterns, and offers a simplified and consistent programming interface. The work automates the creation of SoA structures, fulfilling technical requirements like cache line alignment, while optionally providing alignment and cache hinting to the compiler and range checking. Protection of read-only products of the CMS software framework (CMSSW) is also ensured with constant versions of the SoA. A compact description of the SoA is provided to minimize the size of data passed to GPU kernels. Finally, the user interface is designed to be as simple as possible, providing an AoS-like semantic allowing compact and readable notation in the code. The result of porting of CMSSW to SoA will be presented, along with performance measurements.

Presented at ACAT2022 21st International Workshop on Advanced Computing and Analysis Techniques in Physics Research

# Implementation of generic SoA data structures in the CMS software

#### Eric Cano and Andrea Bocci

CERN, European Organization for Nuclear Research, Meyrin, Switzerland

E-mail: eric.cano@cern.ch

**Abstract.** GPU applications require a structure of array (SoA) layout for the data to achieve good memory access performance. During the development of the CMS Pixel reconstruction for GPUs, the Patatrack developers crafted various techniques to optimise the data placement in memory and its access inside GPU kernels. The work presented here gathers, automates and extends those patterns, and offers a simplified and consistent programming interface.

The work automates the creation of SoA structures, fulfilling technical requirements like cache line alignment, while optionally providing alignment and cache hinting to the compiler and range checking. Protection of read-only products of the CMS software framework (CMSSW) is also ensured with constant versions of the SoA. A compact description of the SoA is provided to minimize the size of data passed to GPU kernels. Finally, the user interface is designed to be as simple as possible, providing an AoS-like semantic allowing compact and readable notation in the code.

#### 1. Data layout in GPUs: Array of structure vs Structure of Arrays

In comparison with CPUs, graphical processing units (GPUs) provide vast amounts of computing power by trading scheduling silicon real estate with arithmetic and logic unit (ALU) space, allowing many computations — in the order of thousands — to be executed in parallel. This trade-off is achieved with multiple ALUs executing the same instruction at the same time on their respective threads, in a lockstep fashion.

The width of this parallel execution — and naming — varies from manufacturer to manufacturer: 8, 16 or 32 threads per wave for Intel, 32 or 64 threads per wavefront for AMD and 32 threads per warp for NVIDIA. Like on CPUs where the various cores share the memory subsystem, the GPU cores executing the lock stepped threads share the same memory controller and cache, easily overloading it if accessing data scattered over many cache lines; the spread of memory accesses should be minimized by adequately laying out the data in memory. In many cases, memory access is the limiting factor for performance.

In a common scenario where each thread processes an instance of a structure, the usual strategy consists in reorganizing classic arrays of structures (AoS) into structures of arrays (SoA) where corresponding elements of successive structures are stored contiguously in cache-aligned columns, as illustrated in figure 1.

While the physical layout in memory is optimized for parallel processing, the per thread logic remains that of an AoS as shown in figure 2.





Figure 1. AoS vs SoA access patterns

```
struct AoS {
  static const size_t SIZE = 54;
  struct Element {
    double x, y, z;
    uint32_t id;
    Eigen::Matrix<double, 3, 6> m;
  Element elements[SIZE];
  double r:
AoS aos;
const Eigen::Matrix<double, 3, 6> matrix{
  {1, 2, 3, 4, 5, 6},
{2, 4, 6, 8, 10, 12}
  {3, 6, 9, 12, 15, 18}};
for (uint32_t i = 0; i < AoS::SIZE; i++) {</pre>
  if (i == 0)
    aos.r = 1.0:
  aos.elements[i]
    { 0., 0., 0., i, matrix * i};
```

Figure 2. AoS C++ code

## 2. Pre-existing implementations of SoA in CMSSW

Prior to the work described here, SoAs were already in place in the CMSSW code in multiple, ad-hoc implementations. Some have compilation time defined sizes, while others are sized at run time; some needed multiple memory allocations for the same SoA instead of one consequently requiring multiple memory transfers from device to host were sometimes needed. The primitives hinting the compiler for cache type choice were also inserted directly in the using code.

# 3. Generic SoA and managing class hierarchy

The generic SoA described here automates the definition and implementation of runtime sized SoAs, and automatically generates a hierarchy of classes which handle different aspects of the SoA. Layouts divide a memory buffer into runtime sized columns, while Views provide the interface to the data. The latter are the lightweight structures passed to kernels. They are limited to a pointer for each columns, and some sizes. Buffers can be allocated on host memory, pinned host memory or device memory. The Layout is memory type agnostic and will subdivide any type of Buffers indifferently.

Data transfers from host to device and vice-versa are implemented with full Buffer copies. On top of this hierarchy of classes stands the PortableCollections; they handle the allocation of buffers or the proper size and the initialization of the Layout on top of the former. The host flavor of PortableCollection also manages the serialization and describination of data between memory and ROOT files.

### 4. Technical implementation

The generic SoA implementation targets a notation as concise and readable as possible, keeping the AoS syntax style, as it better represents the problem. Unlike Python, C++ is statically typed: code generation has to happen before compile time. Therefore generic SoAs are implemented in macros leveraging the Boost::PP [1] package. An example of SoA layout declaration is shown on figure 3, access syntax on figure 4. Notably, the AoS style is preserved, including for list initialization with the exception of the use of operator() to access the logic

members of the structure. The logic rows of the SoA are accessed with operator[], and can be stored in auto variables to allow concise notation as illustrated in figure 4.

```
namespace portabletest {
    this typedef is needed because commas
  // confuse macros
 using Matrix = Eigen::Matrix<double, 3, 6>;
    SoA layout with x, y, z, id, m fields
 GENERATE_SOA_LAYOUT(TestSoALayout,
      columns: one value per element
    SOA_COLUMN(double, x),
    SOA_COLUMN(double, y),
    SOA_COLUMN(double, z)
   SOA_COLUMN(int32_t, id),
      scalars: one value for the
    // whole structure
   SOA_SCALAR (double, r),
    // Eigen columns
    SOA_EIGEN_COLUMN(Matrix, m))
 using TestSoA = TestSoALayout <>;
  // namespace portabletest
```

Figure 3. SoA declaration

```
static __global__ void testAlgoKernel(
   portabletest::TestSoA::View view,
   int32_t size) {
  const int32_t thread = blockIdx.x *
    blockDim.x + threadIdx.x;
  const int32_t stride = blockDim.x *
    gridDim.x;
  const portabletest::Matrix
    matrix{{1, 2, 3, 4, 5, 6},
           {2, 4, 6, 8, 10, 12},
           {3, 6, 9, 12, 15, 18}};
  if (thread == 0) {
    view.r() = 1.:
  for (auto i = thread; i < size;</pre>
    i += stride) {
    view[i] = {0., 0., 0., i, matrix * i};
    // Alternate ways to access the rows,
    // member by member
    auto vi = view[i];
    vi.x() = vi.y() = view[i].z() = 0.;
}
```

Figure 4. SoA use in a CUDA kernel

The generic SoA supports three types of elements: numeric columns, scalars and Eigen [2] columns. The numeric columns are targeted at numeric types, but could functionally accommodate other classes, with potential performance side effects. Those columns contain as many elements as the SoA does. The scalars hold a single element per SoA and are not available via row access. The last type, Eigen columns can hold vectors and matrices, with one numeric column per vector or matrix entry, as illustrated in figure 5.



Figure 5. Buffer with SoA Layout and View

#### 4.1. Layout class: memory management

The necessary Buffer size can be computed by a static helper function from the Layout class. The columns layout inside the buffer is computed by the constructor of the Layout class; it will divide the Buffer by computing the column addresses, adding padding at the end of columns if necessary to ensure cache line alignment. Additionally, stride length and total size is computed for Eigen columns, to be used by Eigen itself and serialization, respectively.

#### 4.2. View class: data access

The View class is designed to contain the minimal necessary variables to ensure data access, and the corresponding functions. This minimal memory footprint ensures efficient kernel launches. The View contains the size of the columns, one pointer to each of them or scalar, plus the stride of the Eigen columns — avoiding any size computation kernel side.

The View — the interface to the data — provides a logic row accessor in the form of operator[]. The row object provides accessors to column components for the selected row. operator[] optionally range checks indices. The accessors to scalar components are direct members of the View. Likewise, extra column accessors provide pointers to each column component.

A const variant of the class is available, ensuring that products consumed by CMSSW modules remain immutable. This class is distinct, completely forbidding write access even via const casting.

Another constructor variant allows per column initialization, bypassing buffer splitting and providing the SoA interface without buffer allocation, like during porting.

The View classes can be defined with any set of components from multiple Layouts and Views. Nevertheless, the most common use of the View is the trivial one.

The constructor of the View can optionally validate column alignment. All non-Eigen accessors provide optional compiler cache hinting, ensuring use of the non-coherent cache on nVidia GPUs and similar optimizations in other environments.

# 4.3. Cache hinting, range checking and other tunable behaviors

Optional behaviors of Layouts and Views are selected at compilation time as template parameters. The defaults are usually the right choice, and most common setup. The parameters and their defaults are shown in figure 6.

Figure 6. SoA template parameters and defaults

#### 5. Portable collections: buffer management

PortableCollections are templates parametrized on Layouts. They manage the allocation of Buffers and the creation of Layouts and Views. As CMS is currently moving from CUDA to Alpaka [3] [4] [5], PortableCollections exist in multiple flavors: both for CUDA and Alpaka, and with variants for host side and devices in each case. All versions handle the allocation of the corresponding buffers. The PortableCollection then provides access to the buffers for data transfers between host and device, the views for data access, and can be serialized to and from ROOT files [6] — for the host use case.

The ROOT data files generated from the PortableCollections can be readily used in bare ROOT, but the default memory layout used by ROOT when reading them back is not appropriate for use with GPUs. Proper ROOT serialization is achieved with automatically generated functions that ensure proper memory allocation and data placement at read time.

```
using TestDeviceCollection = cms::cuda::PortableDeviceCollection <portabletest::TestSoA>;
TestDeviceCollection deviceProduct(size_, ctx.stream());
testAlgoKernel(deviceProduct.view(), deviceProduct->metadata().size());
cudatest::TestHostCollection hostProduct{size_, ctx.stream()};
cms::cuda::copyAsync(hostProduct.buffer(), deviceProduct.const_buffer(),
    deviceProduct.bufferSize(), ctx.stream());
```

Figure 7. Instanciation and device-to-host copy of a portable collection

The streamer of the PortableCollection allocate the memory and delegates the copying of the data from the ROOT file into the memory columns to the Layout streamer.

Figure 8. Layout and streamer declaration for ROOT streaming

# 6. Conclusion, status and further developments

So far, SiPixelDigis and SiPixelClusters have been ported to SoA. Systematic use of a generic SoA reduced memory allocation number, and simplified code. The previously scattered SoA knowledge is now consolidated in a single package and its use automated. Some manual XML description of the data structures are still necessary, automation of this step is under development.

Multi layout data collections are also in the works. They will allow keeping in the same product sets of related data of different sizes, like tracks and hits with cross reference by index. Sub buffer, column level access is also investigated to optimize some use cases.

# 7. References

- [1] The Boost Library Preprocessor Subset for C/C++. https://www.boost.org/doc/libs/1\_67\_0/libs/preprocessor/doc/index.html.
- [2] Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.
- [3] Benjamin Worpitz. Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures, Sep 2015.
- [4] Erik Zenker, Benjamin Worpitz, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E. Nagel, and Michael Bussmann. Alpaka - an abstraction library for parallel kernel acceleration. IEEE Computer Society, May 2016.
- [5] A. Matthes, R. Widera, E. Zenker, B. Worpitz, A. Huebl, and M. Bussmann. Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the alpaka library. Jun 2017.
- [6] ROOT Data analysis framework. https://root.cern.ch//.