

Home Search Collections Journals About Contact us My IOPscience

Use of FPGA embedded processors for fast cluster reconstruction in the NA62 liquid krypton electromagnetic calorimeter

This content has been downloaded from IOPscience. Please scroll down to see the full text. 2014 JINST 9 C01010 (http://iopscience.iop.org/1748-0221/9/01/C01010)

View the table of contents for this issue, or go to the journal homepage for more

Download details:

IP Address: 188.184.3.52 This content was downloaded on 13/11/2014 at 22:56

Please note that terms and conditions apply.

PUBLISHED BY IOP PUBLISHING FOR SISSA MEDIALAB



RECEIVED: November 15, 2013 ACCEPTED: December 9, 2013 PUBLISHED: January 7, 2014

TOPICAL WORKSHOP ON ELECTRONICS FOR PARTICLE PHYSICS 2013, 23–27 SEPTEMBER 2013, PERUGIA, ITALY

# Use of FPGA embedded processors for fast cluster reconstruction in the NA62 liquid krypton electromagnetic calorimeter

D. Badoni,<sup>e</sup> M. Bizzarri,<sup>b</sup> V. Bonaiuto,<sup>c</sup> B. Checcucci,<sup>d</sup> N. De Simone,<sup>c,1</sup> L. Federici,<sup>c</sup>

A. Fucci,<sup>e</sup> G. Paoluzzi,<sup>e</sup> A. Papi,<sup>d</sup> M. Piccini,<sup>d</sup> A. Salamon,<sup>e</sup> G. Salina,<sup>e</sup>

# E. Santovetti,<sup>a</sup> F. Sargeni<sup>c</sup> and S. Venditti<sup>f</sup>

<sup>a</sup>University of Rome "Tor Vergata" - Department of Physics, Rome, Italy
<sup>b</sup>University of Perugia - Department of Physics, Perugia, Italy
<sup>c</sup>University of Rome "Tor Vergata" - Department of Electronic Engineering, Rome, Italy
<sup>d</sup>INFN - Sezione di Perugia, Perugia, Italy
<sup>e</sup>INFN - Sezione di Rome "Tor Vergata", Rome, Italy
<sup>f</sup>CERN, Genève, Switzerland

*E-mail:* nico.desimone@cern.ch

ABSTRACT: The goal of the NA62 experiment at the CERN SPS is the measurement of the Branching Ratio of the very rare kaon decay  $K^+ \rightarrow \pi^+ v \bar{v}$  with a 10% accuracy by collecting 100 events in two years of data taking. An efficient photon veto system is needed to reject the  $K^+ \rightarrow \pi^+ \pi^0$ background and a liquid krypton electromagnetic calorimeter will be used for this purpose in the 1-10 mrad angular region. The L0 trigger system for the calorimeter consists of a peak reconstruction algorithm implemented on FPGA by using a mixed parallel architecture based on soft core Altera NIOS II embedded processors together with custom VHDL modules. This solution allows an efficient and flexible reconstruction of the energy-deposition peak. The system will be totally composed of 36 TEL62 boards, 108 mezzanine cards and 215 high-performance FPGAs. We describe the design, current status and the results of the first performance tests.

KEYWORDS: Trigger concepts and systems (hardware and software); Digital electronic circuits; Calorimeters

<sup>&</sup>lt;sup>1</sup>Corresponding author.

# Contents

| 1 | Introduction                                   |   |
|---|------------------------------------------------|---|
| 2 | Trigger and data acquisition system            | 1 |
| 3 | The Liquid Krypton electromagnetic calorimeter | 2 |
| 4 | The Liquid Krypton Level 0 trigger             | 2 |
|   | 4.1 Trigger algorithm                          | 3 |
|   | 4.2 Trigger processor implementation           | 3 |
|   | 4.3 Embedded processors for trigger logic      | 5 |
| 5 | Performance tests                              | 6 |
|   | 5.1 Discussion                                 | 8 |
| 6 | Conclusion                                     | 8 |

### 1 Introduction

NA62 [1] is an experiment at CERN SPS that aims at a precise measure of the branching ratio of the rare kaon decay  $K^+ \rightarrow \pi^+ v \bar{v}$ . Such decay channel offers a clean theoretical environment for precise SM predictions and therefore also represents an investigation probe for new physics. The current SM prediction is  $BR(K^+ \rightarrow \pi^+ v \bar{v}) = (7.81 \pm 0.75 \pm 0.29) \cdot 10^{-11}$  [3] while present experimental results provide  $BR(K^+ \rightarrow \pi^+ v \bar{v}) = (1.73 \pm 1.10) \cdot 10^{-10}$  [2]. NA62 is designed to improve such measurement achieving 10% accuracy by collecting about 100 events in two years of data taking.

The NA62 detector [4] (figure 1), currently being installed at the SPS North Area High Intensity Facility, is composed of: a differential Cerenkov counter (CEDAR), a beam tracker (GTK) and charged particle detector (CHANTI), a straw chambers magnetic spectrometer, a photon veto system composed of different detectors in the various angular decay regions, a RICH, a charged particle hodoscope (CHOD) and a muon detector (MUV).

# 2 Trigger and data acquisition system

The CERN SPS  $400 \, GeV/c$  primary beam will provide  $3 \times 10^{12}$  protons per spill (4.8 *s* burst duration with a period of 16.8 *s*) impinging on a beryllium target. The selected  $75 \, GeV/c$  secondary hadron beam will result in an instantaneous kaon rate of about 50 MHz. In order to extract few interesting decays from a such intense flux, a complex and performing three level trigger and data acquisition system was designed [5].



Figure 1. Schematic view of the NA62 detector.

The Level 0 (L0) trigger algorithm is based on few sub-detectors (the charged hodoscope, the muon detector and the liquid krypton electromagnetic calorimeter and possibly large-angle vetoes) and it is performed by dedicated custom hardware modules, with a maximum output rate of 1 MHz and a maximum latency of 1 ms.

The data from each sub-detector — except the Liquid Krypton (LKr) calorimeter — are sent to a farm of PCs where the Level 1 (L1) and Level 2 (L2) software triggers are performed. L1 algorithms are run on the data of individual detectors. A positive L1 decision triggers the readout of the calorimeter data (which are kept in memories up to then) and, subsequently, L2 algorithms are executed on the complete event. The L1 trigger has a maximum output rate of 100 kHz and 1 s of total latency, while the L2 trigger, has an output rate of the order of 15 kHz with a maximum total latency equal to the basic data taking time unit, the period of the SPS beam-delivery cycle.

# 3 The Liquid Krypton electromagnetic calorimeter

In order to suppress the background from  $K^+ \rightarrow \pi^+ \pi^0$  decay, an efficient photon veto system is foreseen. The NA48 electromagnetic calorimeter is used [6] in the 1-10 mrad angular region. This calorimeter is a quasi-homogeneous ionization device using liquid krypton as active medium and characterized by excellent time and energy resolution.

The Liquid Krypton (LKr) calorimeter will be readout by the new Calorimeter REAdout Modules [7] (CREAMs) which will provide 40 MHz 14 bit sampling for all 13248 calorimeter readout channels, data buffering, optional zero suppression and programmable trigger sums for the L0 LKr calorimeter trigger processor.

### 4 The Liquid Krypton Level 0 trigger

The L0 LKr electromagnetic calorimeter trigger (figure 2) identifies electromagnetic clusters in the calorimeter and prepares a time-ordered list of reconstructed clusters together with the arrival time, position, and energy measurements of each cluster. Information on reconstructed clusters is used to veto decays with more than one cluster in the LKr calorimeter.



Figure 2. Segmentation of LKr electromagnetic calorimeter L0 trigger.



Figure 3. LKr electromagnetic calorimeter trigger segmentation.

The trigger processor also provides a coarse-grained readout of the LKr calorimeter that can be used in software triggers and off-line as a cross-check for the CREAM high-granularity readout.

### 4.1 Trigger algorithm

Trigger algorithm is based on energy deposits in tiles of 16 calorimeter cells which are available from the CREAM readout boards. Electromagnetic cluster search is executed in two steps with two one-dimensional (1D) algorithms (figure 3).

The calorimeter is divided in slices parallel to the vertical axis. In the first step peaks in space and time are searched independently in each slice with a 1D algorithm. In the second step, different peaks which are close in time and space are merged and assigned to the same electromagnetic cluster.

### 4.2 Trigger processor implementation

The main parameters driving the design of the processor are the expected high instantaneous hit rate (30 MHz), the required single cluster time resolution (1.5 ns) and a maximum allowed latency of 100  $\mu$ s from detector hit generation to trigger primitives output to the L0 trigger processor.

The processor is a three-layer parallel system, composed of Front-End and Concentrator boards, both based on the 9U TEL62 cards [8, 9] equipped with custom dedicated mezzanines (figure 4).

The LKr L0 trigger continuously receives from the LKr readout modules (CREAMs) 864 trigger sums, each one corresponding to a tile of  $4 \times 4$  calorimeter cells. Data transmission from the



Figure 4. LKr trigger processor block diagram. 28 Front-End boards and 8 Concentrator boards are foreseen in the system.

CREAM main digitizer boards to the trigger processor is performed over standard Ethernet cables (Cat.6, length up to 15 m) with an effective data rate per lane of up to 720 Mbps (640 Mbps payload). Data is transmitted using standard embedded clock serdes chips (DS92LV16) and received by the input mezzanine TELDES (TEL62 DESerializer) (figure 5). Each TELDES receives 16 Ethernet links, each providing a trigger sum.

The processor input stage is composed by 28 Front-End boards, each Front-End board receives 32 trigger sums as 16-bit tiles at 40MHz from two TELDES meazzanines (figure 5). Each board performs peak search in space and it computes time, position and energy for each detected peak. In order to extract timing information at the ns level a parabolic interpolation in time around sample maximum and a digital constant fraction discrimination are performed after the peak search algorithms. Information on reconstructed peaks is transferred from the Front-End boards to the Concentrator boards on low-latency high-bandwidth dedicated trigger links. Raw data received by the readout modules are also stored in latency memories, to be readout upon request.

The Concentrator board receives trigger data from up to 8 FE boards and combines peaks detected by different front-end boards into a single cluster. Overlap between neighbouring Concentrators is foreseen to guarantee that each cluster will be fully contained in at least one Concentrator board with proper logic to avoid double counting. The reconstructed clusters are also stored in latency memories, to be readout upon request. Eight Concentrator boards equipped with 24 custom mezzanines are foreseen in the whole system.

High speed low latency trigger data transmission from the Front-End to the Concentrator boards is performed by dedicated mezzanines (Trigger and Readout TX mezzanines and Trigger RX mezzanines, see figure 4 and 6).

The Trigger and Readout TX mezzanines transmit up to 4.8 Gbps (48 bits at 100 MHz) over halogen-free individually shielded twisted pairs using the DS90CR485 serializer. The Trigger RX mezzanines receive and deserialize data using the DS90CR486 deserializer.

Readout data is transmitted over two standard gigabit Ethernet cables using an Altera IP MAC core together with an external PHY.



**Figure 5**. The TEL62 Deserializer Board. Each Ethernet connector receives two calorimeter readout channels. Sixteen equalizers and sixteen deserializers chips are also visible.



Figure 6. Trigger and Readout TX (left) and Trigger RX (right) mezzanine card prototypes.

## 4.3 Embedded processors for trigger logic

Highly selective L0 triggers traditionally require a careful implementation in dedicated high-speed logic. FPGA-based design is a common choice that allows some degree of flexibility but far away from the quick development, test and update possibilities in the software world. Additionally, developing effort is often concentrated where timing performance is not crucial.

The L0 trigger of the NA62 LKr calorimeter is implemented with a combination of custom logic on Altera Stratix III FPGAs tightly coupled with embedded processors NIOS II [10]. The NIOS II we used is the "fast" version, aimed at high performance applications. It allows 250MHz+ operations (240 MHz used in this work) with performance over 300 MIPS and it is optimized for performance-critical applications as well as applications with large amounts of data.

Software written in standard C language implements part of the peak-reconstruction algorithm. This allows to fine-tune between software and hardware in execution time, developing time and validation time. Higher performance is also easily achievable by using a multiprocessor architecture.

The entire architecture is fitting well inside the used Stratix III Altera FPGA (EP3SL110) (see figure 7). The code running on the NIOS II processor has been optimized in order to allow the reduction of the size of the processor onchip instruction RAM (e.g. it can fit on M9K memory blocks instead of using the more scarse M144K blocks).

| Flow Summary              |                                           |  |  |
|---------------------------|-------------------------------------------|--|--|
| Flow Status               | Successful - Mon Oct 21 12:43:47 2013     |  |  |
| Quartus II 32-bit Version | 12.0 Build 178 05/31/2012 SJ Full Version |  |  |
| Revision Name             | pp_fpga                                   |  |  |
| Top-level Entity Name     | pp_fpga                                   |  |  |
| Family                    | Stratix III                               |  |  |
| Device                    | EP3SL110F1152C4                           |  |  |
| Timing Models             | Final                                     |  |  |
| Logic utilization         | 32 %                                      |  |  |
| Combinational ALUTs       | 20,252 / 86,000 ( 24 % )                  |  |  |
| Memory ALUTs              | 32 / 43,000 ( < 1 % )                     |  |  |
| Dedicated logic registers | 17,020 / 86,000 ( 20 % )                  |  |  |
| Total registers           | 17020                                     |  |  |
| Total pins                | 581 / 744 ( 78 % )                        |  |  |
| Total virtual pins        | 0                                         |  |  |
| Total block memory bits   | 982,170 / 4,303,872 ( 23 % )              |  |  |
| DSP block 18-bit elements | 16 / 288 ( 6 % )                          |  |  |
| Total PLLs                | 1 / 8 ( 13 % )                            |  |  |
| Total DLLs                | 0/4(0%)                                   |  |  |

**Figure 7**. Altera Quartus flow summary report for the test system with 4 NIOS II cores on a Stratix III Altera FPGA (EP3SL110).



Figure 8. Scheme of the performance test.



**Figure 9**. The peak reconstruction algorithm. Left: peak-finder logic, implemented in VHDL, is a pipelined stage performing peak recognization with the criteria *peak in time, peak in space* and *over threshold*. Right: NIOS II software-based parabolic fit and fine estimation of the peak rising time.

# 5 Performance tests

We present the results of the first tests aimed at verifying that the designed architecture meets the performance requirements for the L0 trigger processing of the NA62 LKr calorimeter.

The architecture of the test, performed on a single Pre-Processing FPGA, is shown in figure 8. It has been designed to test the stand-alone trigger processor providing dummy calorimeter data from an internal memory. The Experiment Control System (ECS) of NA62 is a standardized system to access firmware registers, FIFOs and RAMs from a PC platform, implemented trough the PCI interface of the Credit-Card PC on-board the TEL62. In order to control the tests and access the results, the ECS system has been connected to the memory with input data, configuration registers



**Figure 10**. In red the normalized distribution of the NIOS II processing time for the peak-fitting algorithm with a sample of 1000 events. The various colors show how I/O and different mathematical operations contribute to the total processing time. Vertical lines indicate worst cases.

and performance counters system. The mux in figure 8 outputs the input data, either coming from test memory or from TELDES, to the processing firmware, allowing to switch from real data to dummy data in any moment. The data are 8 channels of 16-bit ADC values at 40 MHz. The pipelined peak-finder logic (VHDL) performs, for each of the 8 input channels, a peak recognition based on the criteria *peak in time, peak in space* and *over threshold* as shown in the upper part of figure 9. Peaks are identified on each tile with the two vertical<sup>1</sup> neighbors and on 4 consecutive time slices. The peak is therefore fully described by 240 bits.

Peak data enter a load-balancing logic block that delivers the data to four NIOS II cores that perform a parabolic fit and a fine estimation of the rising time of the peak (see right part of figure 9). In this first test we chose to implement a simple Round-Robin scheduling algorithm: first peak goes to the first NIOS II, second peak to the second NIOS II and so on, going back to the first NIOS II for the 5-th peak.

The performance was calculated through counters that measure the processing time of the NIOS II cores. As shown in the normalized distributions in figure 10, multiple measurements have been performed with different programs running on the NIOS II: complete peak-reconstruction or, in addition to the I/O operation, different and increasingly more complex mathematical operations. Results agree with expectations, such as a higher cost for the division compared to other operations. The algorithm has been therefore designed in order to minimize its computational cost: e.g. the fine-time reconstruction of the peak rising time is calculated by constructing a linear approximation between the two data sample on the rising edge of the peak and by finding its crossing time of a threshold level (fraction of the peak value). The width of the distributions corresponds to a variation in the algorithm latency, to be attributed to the bit banging technique used, in the current implementation, to interface control signals between the NIOS and the external on-chip logic. The distributions show no tails, allowing to determine the maximum (worst) processing time for this test in 139 clock cycles. Considering the 240 MHz NIOS II system frequency, this is equivalent to 1.9 MHz processing rate per core, hence a total of 7.6 MHz with 4 cores processing in parallel.

<sup>&</sup>lt;sup>1</sup>The remaining horizontal dimension is handled on the concentrator boards, not included in this test.

### 5.1 Discussion

Performance results must be conservatively compared with the rate of incoming peaks from the calorimeter, that is the output rate of the peak-finder logic. We therefore considered the maximum instantaneous hit rate on the LKr calorimeter of 30 MHz and we made the hypothesis that each hit produces a wide cluster of 256 calorimeter cells<sup>2</sup>. By using simulations to estimate spatial non-uniformity in the peak rate, we estimated, for the tiles read by a Pre-Processing FPGA, a worst-case scenario of 4.2 MHz peak rate in the calorimeter center. This is significantly smaller than the performance result of 7.6 MHz. The proposed architecture for the LKr L0 Trigger can therefore sustain the estimated worst-case scenario of 4.2 MHz incoming peak rate.

### 6 Conclusion

A fast parallel architecture, based on a mixture of VHDL design and NIOS II processors, has been designed for cluster reconstruction and counting in the LKr electromagnetic calorimeter of the NA62 experiment. Test results here presented show that the L0 trigger system fully meets the timing and bandwidth requirements of the experiment. More extensive tests to stress the system capabilities are undergoing and will include inter-communications between different TEL62 boards. The system will be commissioned in the last part of 2014, ready for data taking at the end of 2014–beginning of 2015.

### References

- [1] NA62 collaboration, *Proposal to Measure the Rare Decay*  $K^+ \rightarrow \pi^+ v \bar{v}$  *at the CERN SPS*, CERN-SPSC-2005-013.
- [2] E949 collaboration, A. Artamonov et al., New measurement of the  $K^+ \rightarrow \pi^+ v \bar{v}$  branching ratio, Phys. Rev. Lett. **101** (2008) 191802 [arXiv:0808.2459].
- [3] J. Brod, M. Gorbahn and E. Stamou, *Two-Loop Electroweak Corrections for the*  $K \rightarrow \pi \nu n \bar{\nu}$  *Decays*, *Phys. Rev.* **D 83** (2011) 034030 [arXiv:1009.0947].
- [4] NA62 collaboration, NA62 Technical Design, NA62-10-07 (2010).
- [5] M. Sozzi, A concept for the NA62 Trigger and Data Acquisition, NA62-07-03 (2007).
- [6] NA48 collaboration, V. Fanti et al., *The Beam and detector for the NA48 neutral kaon CP-violations experiment at CERN, Nucl. Instrum. Meth.* A **574** (2007) 433.
- [7] S. Venditti et al., *The new NA62 LKr readout: first tests and future perspectives*, in *Topical Workshop* on *Electronics for Particle Physics 2013*, 23–27 September 2013, Perugia, Italy.
- [8] E. Pedreschi et al, *Firmware approach for TEL62 trigger and data acquisition board*, in *Topical Workshop on Electronics for Particle Physics 2012*, 17–21 September 2012, Oxford, U.K.
- [9] B. Angelucci et al., *The FPGA based Trigger and Data Acquisition system for the CERN NA62 experiment*, in *Topical Workshop on Electronics for Particle Physics 2013*, 23–27 September 2013, Perugia, Italy.
- [10] http://www.altera.com.

 $<sup>^{2}</sup>$ A realistic estimation would consider that only about 20% of the hit are due to photons and produce clusters and that their dimension is expected to be smaller than 256 calorimeter cells.