

# FELIX: first operational experience with the new ATLAS readout system and perspectives for HL-LHC

**Joaquin Hoya** on behalf of the ATLAS TDAQ Collaboration



ENERGY U.S. DEPARTMENT OF Arg.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by UChicago Argonne, LLC

CHEP2023 - 11/05/23

FELIX **FELIX** Argonne

# **ATLAS Trigger and Data Acquisition**

#### **ATLAS upgrades for the LHC Run 3**

The Large Hadron Collider (LHC) collides proton bunches at a 40MHz rate. • ATLAS detects the collision products and selects (trigger) physics events of interest. The **Run 3** expected avg. event data rate for permanent storage is ~3 kHz. • New detector and trigger systems installed for Run 3 to improve background Run 4+ (HL-LHC) rejection. √s = 14 TeV max pileup 140-200 New in Run 3 Muon System New Small Wheels (NSW) Run 3 Inner Barrel RPCs (BIS7/8) √s = 13-14 TeV Calorimeters max pileup 60 Liquid Argon (LAr) digital readout Trigger and DAQ L1Calorimeter Trigger (L1Calo) FELIX & Software Readout Driver (SWROD) 40MHz rate Accept signal Run 2 max latency 2.5 µs √s = 13 TeV max rate 100 kHz Level-1(L1) max pileup 60 High-Level Trigger Permanent DAO 100kHz 3kHz Trigger (HLT) Storage Pileup = number of interactions per LHC bunch crossing

2029

2025

2022

2018

2015

# FELIX and ATLAS TDAQ in Run 3 (2022-2025)

#### FELIX: <u>Front-End LInk eX</u>change (<u>https://atlas-project-felix.web.cern.ch</u>)

#### Run 3:

- Same as Run 2 for most sub-detectors.
- Legacy ROD and ROS architecture is being replaced with **FELIX** and **SW ROD**. It includes NSW, LAr, L1Calo and BIS78.

**FELIX** is a **data router** that works as an interface between on-detector systems and commodity computing.

- The data being routed includes readout, configuration, trigger, clock distribution, monitoring.
- **FELIX** system consists of commodity servers with PCIe cards. Used for **data routing** only.
- SWROD is in charge of data processing, aggregation, and monitoring. Hosted by commodity computers.



The introduction of FELIX brings down the number of custom components in the system, reducing design and maintenance efforts. **COTS** earlier in the readout chain.

**GBT** : synchronous serial protocol at 4.8 Gb/s **FM** : 8b/10b RX link at 9.6 Gb/s (FULL Mode)



# **FELIX Hardware**

#### The FLX-712 FELIX card

- FPGA Xilinx Kintex UltraScale XCKU115, 16-lane PCIe Gen3.
- 8 MiniPODs to support up to 48 bidirectional optical links (most commonly: 4 MiniPODs/24 links).
- Interface to Timing, Trigger and Control (TTC) systems. BUSY output.
- Flash memory to store firmware.



~300 boards produced, for ATLAS, ProtoDUNE, ATLAS tracker upgrade, and others.

# **FELIX Firmware**

Two main flavours:

**FULL**: interface to other FPGA-base systems

- Up to 24 channels per FLX-712, 9.6 Gb/s each
- **GBT**: interface to GBTX
  - GBTX is a radiation-hard ASIC [1]
  - On-detector data stream aggregator
  - Supports 24 x 4.8 Gb/s bi-directional GBT links
  - Each GBT link carries multiple data streams (E-links) of configurable bandwidth



# **FELIX Software**

#### **Readout application**

Felix-star transfer data between the FLX-712 card and network peers

- Interrupt driven central event loop architecture.
- Asynchronous non-blocking architecture.
- Single thread, two processes per card.
- Two data transfer approaches: zero-copy, data coalescence.
- Custom network library based on libfabric [1].
- Uses Remote Direct Memory Access (**RDMA**) technology for **low overhead transfers**.

• Run 3 software architecture scalable for Run 4.

[1] <u>https://ofiwg.github.io/libfabric/</u> \* Nvidia/Mellanox ConnectX-5 [image by storagereview.com]

**felix-star** runs as daemon on FELIX servers

• Each FELIX server hosts up to two FLX-712 cards

FELIX server:

- Intel Xeon E5-1660 v4 @ 3.2GHz.
- 32 GB DDR4 2667 MHz memory.
- Mellanox Connect-X 25/100 GbE).



- 1. Send completed
- 2. Data received
- 3. Buffer available for sending

#### System events:

- 1. Timer events (timerfd)
- 2. Signals (eventfd)
- 3. Any file descriptor event



CERN-PHOTO-202107-094-113

# **FELIX Performance in ATLAS**

64 FELIX PCs, 105 FLX-712 cards installed in ATLAS in 2022.

- Application control and monitoring based on Supervisor [1]
  - automatically start and restarts felix-star applications, and can be monitored and controlled via a web interface.
- Monitoring integrated in the ATLAS infrastructure:
  - operational monitoring [2] with Grafana [3] dashboard.
  - Integration into the ATLAS-wide ErrorReporting System (ERS) [4].

<u>http://supervisord.org</u>
 doi: 10.1051/epjconf/202024501020
 <u>https://go2.grafana.com</u>
 doi: 10.1088/1742-6596/608/1/012004







# **FELIX Performance: LAr**

#### LDPB (LAr Digital Processing Blade) -- FELIX in FULL mode

Throughput ~40 Gb/s



# **FELIX** Performance: L1Calo

#### Level-1 Calorimeter trigger -- FELIX in FULL mode

Throughput ~8 Gb/s

FELIX



- Throughput ~8Gb/s in a high rate run.
- Stable performance at ~100 kHz.
- L1Calo uses a feature called "streams":
  - 16 links per FLX-712 using up to 9 streams, each carrying data at 100 kHz. Ο
- Avg. message size: 3kB
- L1Calo uses buffered mode.



# **FELIX Performance: NSW**

#### New Small Wheel -- FELIX in GBT mode

- The NSW has the largest number of E-links ~200 per FELIX card Ο
- Each E-link providing data at ~100 kHz.
- Avg. message size: 40B



One major challenge in SW during last year was the late packet arrival:

- All messages were delivered but with a latency up to 100ms (could exceed SWROD time window)
- The leading cause was CPU saturation, reaching 100% at high rate (>80 kHz)
- Performance optimizations deployed since earlier this year led to messages delivered on time!





HL-LHC tī event in ATLAS ITK at <µ>=200



# ATLAS DAQ in Run 4

#### 2029+

**Run 4 conditions** 

- **1 MHz** L1 trigger rate  $\rightarrow$  **×10 Run 3**
- Up to 200 avg. interactions per bunch-crossing  $\rightarrow \times 3$  Run 3
- 4.6 TB/s data throughput  $\rightarrow \times 20-30 \text{ Run } 3$

#### **FELIX requirements**

- Readout of all sub-detectors
- ~14000 optical links with bandwidth up to **25 Gb/s**
- support for new detector-specific functionalities
  - e.g. continuous "trickle" reconfiguration of new tracker front-end electronics (pixel, strips)

Data handler - evolution of SWROD, under development.



# **Future FELIX cards**

#### Prototypes, firmware and software upgrades

A new FELIX card is necessary to support

- increased maximum link bandwidth (10  $\rightarrow$  25 Gb/s).
- new timing/trigger interface (will receive data at 9.6 Gb/s).

#### Prototypes

- FLX-181 and FLX-182 prototypes.
- Xilinx XC(V)M1802 FPGA up to 24 links 25 Gb/s.
- new FPGAs, 4+ generation PCIe, new optical transceivers (FireFly<sup>™</sup>).

#### Firmware Upgrades to support

- Additional data encoding types.
- Higher link and PCIe interface speed.
- Larger buffers in computer memory.

#### Software Upgrades

- Same architecture as in Run 3 but different deployment scheme:
  - Run 3: only 2 readout applications per card.
  - Run 4: up to 8 readout applications.







# Integration with new systems

New all-silicon inner tracker

- Increased acceptance up to  $|\eta| < 4$  and pile-up rejection.
- Comparable/better tracking performance at much higher pile-up conditions (~200).

FELIX is being used in the ITk Pixel and Strips production and testing.

**ITk Strips** FELIX Strips firmware functional Configuration and readout via FELIX.



**ITk Pixel** FELIX Pixel firmware successfully tested.



New flavours in addition to GBT and FULL mode:

- **IpGBT** (evolution of GBT)
- PIXEL & STRIP (custom lpGBT)
- Interlaken (64b/67b encoding)



# Summary

- FELIX is a data acquisition component for the ATLAS experiment to interface detector electronics and commodity computing.
- Run 3:
  - FELIX was used instead of the legacy readout architecture for new sub-detector systems, reducing the amount of custom hardware in the data taking path.
  - FELIX firmware and the software are mature and used for data taking.
  - Good performance for all the new systems (**NSW**, **LAr** and **L1Calo**).
- Run 4:
  - FELIX will readout all sub-detectors.
  - Hardware prototypes under development.
  - Firmware under development. Early builds successfully tested using Run 3 hardware.
  - Run 3 software architecture scalable for Run 4.
  - FELIX is already part of the early production and testing of some of the new Run 4 detectors.



### **Backup Slides**



U.S. DEPARTMENT OF ENERGY Argonne National Laboratory is a U.S. Department of Energy laboratory managed by UChicago Argonne, LLC.



### **FELIX Performance**





### FELIX: Front-End Link eXchange

- FELIX is a **data router** that works as an interface between ondetector systems and commodity computing.
- ATLAS-wide effort to harmonize detector readout systems.
- Designed to cope with the expected higher data volumes and event processing complexity.
- The data being routed includes readout, configuration, trigger, clock distribution, monitoring, BUSY and TTC signals.
- The firmware is modular and flexible, with a routing module between the custom serial links and PCIe interface.
- The software includes drivers, low-level tools, test software and routing software.
- First-generation FELIX cards are in use during Run 3.



\*TTC refers to the Trigger, Timing and Control systems



# **FELIX Firmware**

Two main flavours:

- FULL: to interface the FELIX to other FPGA-base systems
  - - Up to 24 channels per FLX-712, 9.6 Gb/s each
- **GBT**: to interface to GBTX
  - GBTX is a radiation-hard ASIC developed at CERN [1]
  - Used as on-detector data stream aggregator
  - GBT firmware supports 24 x 4.8 Gb/s bi-directional GBT links
  - Each GBT link carries multiple data streams (e-links) of configurable bandwidth

| Mode | Message size | Rate per link | (e)links per<br>card | Total message<br>rate per card | Total data rate<br>per card | Use case |
|------|--------------|---------------|----------------------|--------------------------------|-----------------------------|----------|
| FULL | 4800 bytes   | 100 kHz       | 12                   | 1. 2 MHz                       | 46 Gbps                     | LAr      |
| GBT  | 40 bytes     | 100 kHz       | 192                  | 19.2 MHz                       | 7.5 Gbps                    | NSW      |

#### **ATLAS** benchmarks

#### [1] doi: 10.5170/CERN-2009-006.342



# **FELIX Software**

Felix-star architechture





# **FELIX Software**

#### **Readout application**

- server publishes links/e-links, clients subscribe
- two data transfer approaches: zero-copy, data coalescence



• user-friendly API hides the complexity of network library for client applications

#### **API** functions

subscribe(elink\_number)
unsubscribe(elink\_number)
send\_data(elink\_number)

#### **API callback hooks**

on\_message\_received(elink\_number) on\_connection\_established(elink\_number) on\_disconnection(elink\_number)



# **FELIX users in ATLAS Run 3**

#### Muon Spectrometer [GBT mode]

- New Small Wheels (NSW)
  - sTGC (Small-strip Thin Gap Chamber)
  - MicroMegas (Micro Mesh Gaseous Structure)
- BIS78 (Barrel Inner Small MDT sector 7/8)

#### L1 calorimeter trigger [FULL mode]

- gFEX (Global Feature Extractor)
- jFEX (Jet Feature Extractor)
- TREX (Tile Rear Extension)
- ROD, Hub for eFEX (Electron Feature Extractor)

#### Liquid Argon Calorimeter [48-ch GBT / FULL mode]

- LTDB (LAr Trigger Digitizer Board, custom GBT)
- LDPB (LAr Digital Processing Board, FULL)

#### Tile Calorimeter test system [FULL mode]



# **GBT** fragment building algorithm in FELIX Star software



### Performance on FELIX testbed at CERN



