



# The ATLAS Event Builder

W.Vandelli\*
CERN – Physics Department/ATD

On behalf of ATLAS TDAQ DataFlow

<sup>\*</sup> This research project has been supported by a Marie Curie Early Stage Research Training Fellowship of the European Community's Sixth Framework Programme under contract number (MRTN-CT-2006-035606)



### **ATLAS TDAQ DataFlow**



- H.P. Beck<sup>1</sup>, M. Abolins<sup>2</sup>, A. Battaglia<sup>1</sup>, R. Blair<sup>3</sup>, A. Bogaerts<sup>4</sup>, M. Bosman<sup>5</sup>,
- M. Ciobotaru<sup>6</sup>, R. Cranfield<sup>7</sup>, G. Crone<sup>8</sup>, J. Dawson<sup>3</sup>, R. Dobinson<sup>4</sup>, M. Dobson<sup>4</sup>,
- A. Dos Anjos<sup>9</sup>, G. Drake<sup>3</sup>, Y. Ermoline<sup>2</sup>, R. Ferrari<sup>10</sup>, M.L. Ferrer<sup>11</sup>, D. Francis<sup>4</sup>,
- S. Gadomski<sup>1</sup>, S. Gameiro<sup>4</sup>, B. Gorini<sup>4</sup>, B. Green<sup>12</sup>, W. Haberichter<sup>3</sup>, C. Häberli<sup>1</sup>,
- R. Hauser<sup>2</sup>, C. Hinkelbein<sup>13</sup>, R. Hughes-Jones<sup>14</sup>, M. Joos<sup>4</sup>, G. Kieft<sup>15</sup>, K. Kordas<sup>1</sup>,
- A. Kugel<sup>13</sup>, L. Leahu<sup>16</sup>, G. Lehmann<sup>4</sup>, B. Martin<sup>4</sup>, L. Mapelli<sup>4</sup>, C. Meessen<sup>17</sup>, C. Meirosu<sup>15</sup>,
- A. Misiejuk<sup>12</sup>, G. Mornacchi<sup>4</sup>, M. Müller<sup>13</sup>, Y. Nagasaka<sup>18</sup>, A. Negri<sup>6</sup>, E. Pasqualucci<sup>19,20</sup>,
- T. Pauly<sup>4</sup>, J. Petersen<sup>4</sup>, B. Pope<sup>2</sup>, J. Schlereth<sup>3</sup>, R. Spiwoks<sup>4</sup>, S. Stancu<sup>6</sup>, J. Strong<sup>12†</sup>,
- S. Sushkov<sup>5</sup>, T. Szymocha<sup>21</sup>, L. Tremblet<sup>4</sup>, G. Unel<sup>4,6</sup>, W. Vandelli<sup>4</sup>, J. Vermeulen<sup>15</sup>,
- P. Werner<sup>4</sup>, S. Wheeler-Ellis<sup>6</sup>, F. Wickens<sup>8</sup>, W. Wiedenmann<sup>9</sup>, M. Yu<sup>13</sup>, Y. Yasu<sup>22</sup>,
- J. Zhang<sup>3</sup>, H. Zobernig<sup>9</sup>

t deceased

- 1. Universität Bern, Switzerland
- 2. Michigan State University, Ann Arbor, MI
- 3. Argonne National Laboratory
- 4. CERN, Geneva, Switzerland
- 5. Inst. de Fisica de Altas Energias (IFAE), Universidad Autonoma de Barcelona, Spain
- 6. University of California, Irvine, CA, US
- 7. University College, London, UK
- 8. CCLRC Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK
- 9. Univ. of Wisconsin, Madison, WI, US
- 10. INFN Sezione di Pavia, Italy
- 11. Laboratori Nazionali di Frascati, Italy
- 12. Physics Department, Royal Holloway College, University of London, Italy

- 13. Universität Mannheim, Germany
- 14. University of Manchester, UK
- 15. NIKHEF, Amsterdam, The Netherlands
- 16. National Institute for Physics and Nuclear Engineering "Horia Hulubei", NIPNE-HH, Bucarest, Romania
- 17. CPPM Marseille, France
- 18. Hiroshima Institute of Technology, Japan
- 19. Universita di Roma "La Sapienza", Rome, Italy
- 20. INFN Roma, Rome, Italy
- 21. Henryk Niewodniczanski Inst. Nucl. Physics, Cracow, Poland
- 22. High Energy Accelerator Research Organization (KEK), Tsukuba, Japan



## **ATLAS** Experiment







### ATLAS TDAQ





Event data **pushed** @ ≤ 100 kHz, 1600 fragments of ~ 1 kByte each



# Read-Out System



#### SDX<sub>1</sub>

- 150 PCs housing custom PCI boards
  - 1600 optical readout links (ROL)
- ROSes hold the data till the LVL2 decision
  - serve data to LVL2 and Event Builder



Event data **pushed** @ ≤ 100 kHz, 1600 fragments of ~ 1 kByte each



# LVL2 System







## **ATLAS Event Builder Commitment**





SDX<sub>1</sub>

**DataFlow** 



1U rack mountable PCs

~100

**Event** 

Builder

SFI

oROS

stores

~1900

Event

Filter

Network



**Event Builder:** full events @ ~ 3 kHz

Event data **pushed** @ ≤ 100 kHz, 1600 fragments of ~ 1 kByte each ~1.5 MB event size



### **Event Builder Protocol**



#### DFM is the "orchestra leader" of the EB system

- receives trigger via the network
- assigns events to the SFIs, load balancing the farm
- handles Fnd-Of-Event
- sends clear messages to the ROSes

#### SFI is the event builder application

- asks all the ROSes for data fragments
- builds a complete event
- serves the Event Filter

#### SFI features

- Traffic shaping: limited number of outstanding requests
- Re-asks missing fragments
- Monitoring server



#### **Network Protocols used**

- □ UDP / IP for data requests and data replies
- □ UDP / IP multicast for the DFM clear messages
- ☐ TCP / IP for data flow commands
- Possibility to use TCP / IP everywhere



## Presently Installed Hardware





- SFI/DFM: dual AMD Opteron 252 2.6 GHz
- HLT: dual Intel Xeon E5320 quad-core 1.86GHz

Central & local file server, online service and monitoring nodes not shown



# **Event Builder Scaling Properties**



- Extend test capabilities running SFI application on HLT nodes
- Perfect scaling up to 59 SFIs
  - 1.5 MB event size
  - 6.5GB/s @ 4.5kHz
- SFI application can roughly exploit the full GE link
  - 114MB/s @ 76 Hz (SFI)
  - 112MB/s @ 75 Hz (HLT)
- Multi-core HLT nodes have good SFI performance
  - Double SFI approach
- Very promising result, BUT



10% performance decrease when sending data to Event Filter

No LVL2 load on system (ROSes & network) during the test





## Double SFI Approach







- Exploit the availability of multi-core processors on the same node running multiple SFI applications
  - extended network capability needed
- HLT node with quad-NIC board
  - dual-CPU quad-core 1.86GHz
  - 2 bonded interfaces toward ROSes
  - 2 bonded interfaces toward the EF
- Throughput with 32 ROSes, 1.1 MB event
  - Reference SFI 95 MB/s
  - Double SFI 2 x 80 MB/s = 160 MB/s



## Double SFI Approach







- Exploit the availability of multi-core processors on the same node running multiple SFI applications
  - extended network capability needed
- HLT node with quad-NIC board
  - dual-CPU quad-core 1.86GHz
  - 2 bonded interfaces toward ROSes
  - 2 bonded interfaces toward the EF
- Throughput with 32 ROSes, 1.1 MB event
  - Reference SFI 95 MB/s
  - Double SFI 2 x 80 MB/s = 160 MB/s
- Good performance, limited by the (output) bonding efficiency
  - Input only:  $2 \times 114 \text{ MB/s} = 228 \text{ MB/s}$
  - Foreseen working point <70 MB/s per SFI
  - Far from the CPU limit: dual-CPU dual-core



## Including LVL2 load in the system



- Due to the limited computing resources cannot use real LVL2 algorithms
- Simulate the LVL2 load on network and ROSes with LVL2 processing unit (L2PU) running dummy algorithm.
  - Rol size
  - #Rols
- Custom topology with EB driven by an independent, nonrequesting LVL2 farm
  - decouple EB and LVL2
  - independently tune corresponding working points



Try to reach the highest L2 request rate on the ROSes, at a given Rol size, varying the number of Rols per event

Measure the effects on the EB and on the ROSes



### LVL2 and Event Builder



- Not able to completely decouple Event Building and LVL2
- Observed decrease in the LVL2 rate is due to a decrease of LVL1 rate
- System limits not exposed
  - Limited by the presently installed hardware
  - Correlation effect in the RolB system



The system can always sustain the Event Builder in the explored space





#### Closer look to the ROSes







#### Closer look to the ROSes







### Conclusions



- 1/3 of the ATLAS Event Builder nodes are installed and tested
- ATLAS Event builder is based on a pull protocol
  - Data Flow Manager (DFM) receives triggers and load balance the building farm
  - Event Builder application (SFI) requests data to the ROS, handles packet losses and traffic shaping, serves complete events to the Event Filter and to monitoring applications
- Extended Event Building tests exploiting HLT nodes
  - we exceed the required bandwidth with 2/3 of the building nodes
  - 10% degradation expected sending data to the Event Filter
- Successfully tested a multi-core machine running two SFI applications (Double SFI approach)
  - 15% degradation running two applications on a single node instead of two
  - still able to provide more than needed throughput per application
- Initial test of Event Builder performance including the LVL2 traffic did not expose major system limits
  - not yet able to reach the final load on the ReadOut System and the Event Builder mostly because of the limited available hardware
  - system to be extended in Spring 2008





# Backup Slides



# Data Acquisition Strategy



#### Based on three trigger levels

LVL1 hardware trigger

LVL2: 500 1U PC farm

 Reconstruction within Region of Interest (RoI) defined by LVL1

EF: 1900 1U PC farm

 Complete event reconstruction



Trigger (HLT)

High Level



# ReadOut System (ROS)



- ☐ 153 ROS PCs installed
  - ☐ 40 used for these tests
- ☐ 4U, 19" rack mountable PC
- Motherboard: Supermicro X6DHE-XB
- ☐ CPU: One 3.4 GHz Xeon
  - □ Hyper threading not used
  - ☐ uni-processor kernel
- ☐ RAM: 512 MB
- Network:
  - ☐ 2 GB onboard
    - 1 used for control network
  - □ 4 GB on PCI-Express card
    - 1 used for LVL2 data
    - 1 used for event building
- ☐ Redundant power supply
- Network booted (no local hard disk)
- ☐ Remote management via IPMI





#### **SFIs**



#### □ 32 SFI PCs installed

- ☐ Final system ~100 SFIs
- □ 29 SFIs used in these tests
- ☐ 1U, 19" rack mountable PC
- ☐ Motherboard: Supermicro H8DSR-i
- ☐ CPU: AMD Opteron 252 2.6 GHz
  - ☐ SMP kernel
- □ RAM: 2 GB
- Network:
  - 2 GB onboard
    - 1 used for control network
    - 1 used for data-in
  - ☐ 1 GB on PCI-Express card used for data-out
  - □ 1 dedicated IPMI port
- ☐ Cold-swappable power supply
- Network booted
- ☐ Local hard disk to store event data: only used for commissioning
- ☐ Remote management via IPMI





### **DFMs**



- ☐ 12 DFM PCs installed
  - ☐ Final system needs 1 DFM
  - ☐ 12 DFMs
    - ☐ run up to 12 TDAQ partitions in parallel
    - useful during commissioning
- ☐ Same PC as for SFI
- □ Network:
  - 2 GB onboard
    - 1 used for control network
    - 1 used for data network
    - 1 dedicated IPMI port
- ☐ Cold-swappable power supply
- Network booted
- ☐ Local hard disk (not used)
- ☐ Remote management via IPMI





### HLT



#### 124 HLT PCs installed

- ☐ 1U, 19" rack mountable PC
- ☐ CPU: Intel Xeon E5320 quad-core
  - 1.86GHz
    - ☐ SMP kernel
- RAM: 8 GB
- Network:
  - ☐ 2 GB onboard
    - 1 used for control network
    - 1 used for data-in
  - ☐ 1 dedicated IPMI port
- Network booted
- ☐ Local hard disk
- Remote management via IPMI





## Networking





- ☐ Force10 E1200
  - ☐ 6 blades x 4 optical 10GE ports
  - ☐ 2 blades x48 copper GE ports
  - ☐ Up to 14 blades 1260 GE ports total 672 GE ports @ line speed
- □ Data network
  - Event builder traffic
  - □ LVL2 traffic



- ☐ Force10 E600
  - ☐ Up to 7 blades 630 GE ports total 336 GE ports @ line speed
- Data network
  - ☐ To Event Filter



- ☐ Force10 E600
  - ☐ Up to7 blades 630 GE ports total 336 GE ports @ line speed
- Control network
  - Run Control
  - **Databases**
  - Monitoring samplers



### HTL nodes as SFIs







## Traffic shaping



☐ Traffic shaping is achieved by limiting the number of outstanding requests per SFI



- ☐ For big event sizes and large number of outstanding requests, the aggregated bandwidth drops
- → packet loss and subsequent re-ask of data fragment





# LVL2 request pattern



