Leveraging the Run 3 experience for the evolution of the ATLAS software-based readout towards HL-LHC

On behalf of the ATLAS TDAQ Collaboration

Serguei Kolos



University of California, Irvine



The HL LHC objective is to increase the integrated luminosity by a factor of 10 beyond the LHC's design value.

CHEP 2024 Conference This image is taken from https://home.cern/resources/image/accelerators/high-luminosity-lhc-gallery

## The Evolution of ATLAS

Most of the legacy systems will be completely replaced with Some systems have been partially upgraded during LS2 the new ones new and upgraded forward rigger and DAQ creased readout rate New Tracking system New Small Wheel New Muon detector muon detector New Timing detector Liquid Argon New Luminosity detector Trigger New Trigger and DAQ Calorimeter Trigger the new all-Si track systems **Readout System** new High-Granularity 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 1 FMAM 3 JASOND 3 FMAM 3 A SOND 3 FMAM 3 JASOND 3 FMAM 3 JASOND 3 FMAM 3 JASOND 3 FMAM 3 JASOND 3 FMAM 3 A SOND 3 FMAM 3 A SON Run 4 Long Shutdown 3 (LS3) Run 3 LHC Run 3 HL-LHC Run 4 LS 2 LS 3

#### The ATLAS Readout System Evolution



## Front-End Link eXchange





- Custom PCIe card installed in commodity servers
- Receive data from the FE electronics via optical fibers:
  - Up to 48 input links per card
  - A physical link can be split into multiple virtual channels, called E-Links
- FELIX SW routes data to high-speed commercial network using RoCE\* protocol

\*Remote Direct Memory Access over Converged Ethernet

#### SoftWare based Read-Out Driver



- SW processes running on commodity PCs
  - Aggregate data packets from FELIX E-Links into larger fragments
  - Buffer aggregated fragments until they are requested by the HLT
  - Release the fragments after they have been consumed by the HLT

#### Data Rate and Throughput Challenges for HL-LHC



#### SW ROD Data Handling



- For some detectors, FE electronics aggregate data from individual channels to larger fragments
- For others, FE electronics send data for every channel separately:
  - Aggregation must be done by SW ROD
- The final Event Building is done before passing data to the High-Level Trigger

## The Fragment Building Challenge

- Data packets from O(100)
  E–Links must be aggregated into one fragment
  - Total packet rate O(100)MHz
- A single 3 GHz CPU core would offer O(10) cycles per data packet
- Requires extensive and efficient use of multithreading



#### The SW ROD Fragment Builder Algorithm



- A fully built fragment consists of **N** memory blocks
  - **N** is the number of Data receiving threads
- The number of allocated memory blocks is proportional to **N**
- Aggregation performance scales well with the number of Data Receiving threads
- The first implementation performed much worse that expected

# What could Affect the Fragment Builder Performance?



#### **SW/HW** environment

Network latency is O(1) us Worst case OS scheduler latency is O(1) ms

Can only be measured and accepted



#### **Code quality**

CPU cache-friendly Efficient multi-threading

Scalable to many CPU cores

Under control of the SW developers



#### **OS** services

Memory management

Must be taken under control

#### Memory Allocation Time is the Primary Issue



These and the other measurements were done on a standard Run 3 SW ROD server (specification is given in the Backup section)

#### Boost Memory Pool vs Malloc



#### The Boost Memory Pool Issue



- Memory pool grows dynamically requesting large memory blocks from the OS
- Each time a new block is allocated Boost Memory Pool organizes individual data chunks into a single-linked list
  - The time is proportional to the number of chunks in the block
- Caused data loss when accumulated allocation time exceeded the maximum HLT latency

#### The New Memory Pool

- A custom Memory Pool implementation has been produced to address the issue:
  - Data chunks are added to the list of free chunks only when they are freed (returned to the Memory Pool)
- Single open source header file Custom Boost **MemoryPool** Memory (boost 1.82) Pool Average 100 80 allocation time (ns) Maximum  $10^{8}$  $10^{4}$ allocation time (ns)



#### Virtual Memory Management



#### The Virtual Memory Mapping Overhead

- Custom Memory Pool with 4KB memory chunks:
  - The worst case
- Written 1 byte to the beginning of each block during allocation

|                      | Cold<br>Memory | Warm<br>Memory |
|----------------------|----------------|----------------|
| Average<br>time (ns) | 700            | 80             |



#### How to Mitigate this Issue?



## Putting it all Together



- The measurements were done on the Run 4 SW ROD candidate computers:
  - Specifications in the second slide of the Backup section
- Affinity was set to optimize CPU cache usage:
  - Cores 0-15 for Intel
  - Cores 0-3 for AMD (use the same L3 Cache instance)
- AMD is faster for up to 24 E-Links:
  - Better cache performance
- Intel is faster for more than 24 E-Links:
  - Higher CPU frequency

#### Summary

- HL-LHC Run 4 Readout system of the ATLAS experiment will operate at 1MHz input data rate
  - This poses extremely challenging performance requirements to the new SWbased Readout Application
- Very efficient multi-threading algorithm has been developed to utilize the full power of modern CPUs:
  - Memory management was the main issue that affected performance and maximum latency
  - Using custom Memory Pool implementation, the issues have been successfully addressed



#### Run 3 SW ROD Computer

| CPU     | 2 x Intel(R) Xeon(R) Gold 5218 2300 MHz   |     |                              |  |  |  |  |
|---------|-------------------------------------------|-----|------------------------------|--|--|--|--|
|         | 2 x 16 physical cores                     |     |                              |  |  |  |  |
|         |                                           |     |                              |  |  |  |  |
| Cache   | L1d:                                      | 1   | <b>WIB</b> (32 instances)    |  |  |  |  |
|         | L1i:                                      | 1   | MiB (32 instances)           |  |  |  |  |
|         | L2:                                       | 32  | MiB (32 instances)           |  |  |  |  |
|         | L3:                                       | 44  | MiB (2 instances)            |  |  |  |  |
|         |                                           |     |                              |  |  |  |  |
| RAM     | 96 GB                                     |     |                              |  |  |  |  |
|         | 12 x 8GB DDR4 Samsung<br><b>2666 MT/s</b> |     |                              |  |  |  |  |
| Network | Mellanox MT2                              | 780 | 0 Connect-X5 <b>100 Gb/s</b> |  |  |  |  |

#### Run 4 Data Handler Candidate Computers

| CPU     | 2 x Intel(R) Xeon(R) Gold 6444Y 4000 MHz |                        | CPU     | AMD EPYC 9354P <b>3800 MHz</b>               |                       |  |
|---------|------------------------------------------|------------------------|---------|----------------------------------------------|-----------------------|--|
|         | 2 x 16 physical cores                    |                        |         | 32 physical cores                            |                       |  |
| Cache   | L1d:                                     | 1.5 MiB (32 instances) | Cache   | L1d:                                         | 1 MiB (32 instances)  |  |
|         | L1i:                                     | 1 MiB (32 instances)   |         | L1i:                                         | 1 MiB (32 instances)  |  |
|         | L2:                                      | 64 MiB (32 instances)  |         | L2:                                          | 32 MiB (32 instances) |  |
|         | L3:                                      | 90 MiB (2 instances)   |         | L3:                                          | 256 MiB (8 instances) |  |
|         |                                          |                        |         |                                              |                       |  |
| RAM     | 128 GB                                   |                        | RAM     | 64 GB                                        |                       |  |
|         | 8 x 16GB DD<br><b>4800 MT/s</b>          | R5 Micron Technology   |         | 4 x 16GB DDR5 Micron Technology<br>4800 MT/s |                       |  |
| Network | Mellanox MT2910 Connect-X7 400 Gb/s      |                        | Network | Mellanox MT2910 Connect-X7 400 Gb/s          |                       |  |