Next generation associative memory devices for the FTK tracking processor of the ATLAS experiment

Matteo Beretta

INFN Frascati National Laboratories

TWEPP 2013 23-27 September, Perugia

Matteo Beretta

INFN , Frascati National Laboratories

### 1 ATLAS FTK: architecture and working principle

### 2 AMchip history

#### 3 Evolution from AMchip04 towards AMchip05

#### 4 Future plans

### **5** Conclusions

Matteo Beretta

TWEPP 2013 23-27 September, Perugia

2 / 22

# ATLAS FTK architecture<sup>a</sup>

<sup>a</sup>Design of a Hardware Track Finder (Fast Tracker) for the ATLAS Trigger

To be presented by Guido VOLPI on 25 Sep 2013 from 15:15 to 15:40



#### Matteo Beretta

INFN, Frascati National Laboratories

# AMchip for the ATLAS FTK: working principle



- I Flip-flop (FF) for each layer stores layer matches
- All patterns are compared in parallel with incoming data (HIT)
- Fast pattern matching and flexible input
- The AM readout is based on a modifed Fischer Tree<sup>1</sup>

<sup>&</sup>lt;sup>1</sup> P. Fischer, NIM A 461 (2001) 499-504

## AMchip evolution





- 90's Full custom VLSI chip 0.7μm (INFN-Pisa) 128 patterns, 6x12bit words each (F. Morsani et al., The AMchip: a Full-custom MOS VLSI Associative memory for Pattern Recognition, IEEE Trans. on Nucl. Sci., vol. 39, pp. 795-797, (1992).)
- 1998 FPGA for the same AMchip (P. Giannetti et al. A Programmable Associative Memory for Track Finding, Nucl. Intsr. and Meth., vol. A413/2-3, pp.367-373, (1998)).
- 1999 G. Magazzù, first standard cell project presented at LHCC
- 2006 Standard Cell UMC 0.18 μm 5000 pattern/AMchip for CDF SVT upgrade total: 6M patterns (L. Sartori, A. Annovi et al., A VLSI Processor for Fast Track Finding Based on Content Addressable Memories, IEEE TNS, Vol 53, Issue 4, Part 2, Aug. 2006)
- 2012 AMchip04 8k patterns in 14mm<sup>2</sup>, TSMC 65nm LP technology Power/pattern/MHz 40 times less. Pattern density x12. First variable resolution implementation. (F. Alberti et al 2013 JINST 8 C01040, doi:10.1088/1748-0221/8/01/C01040)
- 2013 AMchip MiniAsic and AMchip05 a further step towards final AMchip version. Serialized input and output buses at 2 Gbs, further power reduction approach. BGA 23 × 23 package.
- 2014 AMchip06: final version of the AMchip for the ATLAS experiment.

# New Amchip Package



- Up to now everything was done using PQ208 package in which fits the 14mm<sup>2</sup> AMchip04. Only few pins are dedicated to VDD and GND.
- AMchip06 will be larger than 180mm<sup>2</sup>, and needs larger package: bga 23 × 23 pins. Lots of pins for the different VDD domains and ground. Few pins remain for signals, so instead of parallel buses we have to use serialized input and output.

## AMchip04 prototype characterzation

The power consumption for the AMchip04 with 8K pattern is:

|                                   | Measured @ 100MHz | extrapolated to 128K |
|-----------------------------------|-------------------|----------------------|
| Baseline, leakage (mA)            | 7                 | 112                  |
| clock distribution (mA)           | 30                | 480                  |
| std, not bitline propagation (mA) | 6                 | 96                   |
| bitline propagation (mA)          | 82                | 1312                 |
| AM cells (mA)                     | 70                | 1120                 |
| Total Core (mA)                   | 195               | 3120                 |
| Voltage (V)                       | 1.2               | 1.2                  |
| Total Core (W)                    | 0.234             | 3.744                |



#### Meets original goal of 4W/128k patterns.

Including power consumption from many FPGAs expect > 5kW/crate  $\Rightarrow$  additional AMchip RD to reduce total power consumption.

INFN , Frascati National Laboratories

# AMchip04 CAM layer architecture

Each AMchip04 CAM layer is composed of 4 NAND type cells (9 transistors each) and 14 NOR type cells (10 transistors each) To save power we have combined two different match line driving schemes<sup>2</sup>:

- Current race scheme
- Selective precharge scheme



Each layer stores a word: 12 bits + 3 ternary values (0, 1, X)

<sup>2</sup>K. Pagiamtzis and A. Sheikholeslami, IEEE JSSC 41 (2006) 712-727 < </p>

- Reduce full custom core power supply voltage from 1.2V to 0.8V
- Reduce CAM layer matchline capacity
- Reduce bitline capacity (length)
- Reduce bitline swing voltage from 1.2V to 0.8V
- Try to reduce the std. logic supply voltage from 1.2V to 1.0V.

What we would like to avoid is reducing the clock frequency which must be 100MHz.

# AMchip05 CAM layer architecture

Each AMchip05 CAM layer is composed of 6 NAND type cells (9 transistors each) and 12 NOR type cells (9 transistors each) This layer is supplyed at 0.8V. To maintain speed at 100MHz we have introduced a second current generator.

- Current race scheme
- Selective precharge scheme



With respect to the previous version there is an increase of about 12% in the area.

Matteo Beretta

# XOR+RAM

XOR+RAM is an alternative to the AMchip04 cells. More

digital approach: XOR + RAM Features:

- Concatenation of adjacent layers for 4 bus / 32 bit mode
- All bits are configurable as ternary logic
- 2% area reduction
- Simplified logic interface (timing)

This architecture is actually implemented in a miniasic that is under test.









Fig. 7. Layout of the XOR-based CAM cell in 65 nm CMOS technology

TWEPP 2013 23-27 September, Perugia

Image: A math a math

Matteo Beretta

# XOR+RAM and LV AMcell power consumption

Power consumption at 0.8V.

|                                   | extrapolated<br>to 128K |
|-----------------------------------|-------------------------|
| Baseline, leakage (mA)            | 112                     |
| clock distribution (mA)           | 720                     |
| std, not bitline propagation (mA) | 144                     |
| bitline propagation (mA)          | 1278                    |
| AM cells (mA)                     | 377                     |
| Total Core (mA)                   | 2631                    |
| Voltage (V)                       | 0.8                     |
| Total Core (W)                    | 2.1                     |

#### LV AM cell

| to 128K |
|---------|
| 112     |
| 720     |
| 144     |
| 1022    |
| 524     |
| 2523    |
| 0.8     |
| 2.02    |
|         |

The values reported in the tables are extrapolated from simulations. This architecture seems to be very promising and is currently under test in a miniasic.

From 60% to 70% power saving with respect AMchip04.

More conservative power consumption at 1.0V is 2.8W (-25% w.r.t. AMchip04 extrapolation)

# SERDES IO requirements

To simplify the board routing we have substituted the parallel input and output databus with a high speed serializer and deserializer.

The main features required for the SERDES are:

- data rate at least 2Gbps
- separate serializer and deserializer macro
- 32bit input/output bus
- driver and receiver circuits compatible with LVDS standard
- 8b/10b encode/decode capabilities
- comma detection and word alignment
- BIST capabilities for fast debugging
- Low power

We have bought SERDES core by Silicon Creations. To test this core we have designed a miniasic with 5 DES, 1 SER, their control logic and our AM memory core with only few banks. This chip is currently under test.





Matteo Beretta

INFN, Frascati National Laboratories

# Miniasic tests first results

Test performed up to now:

- JTAG chip programming: OK
- SERDES initialization and PLL locking: OK
- JTAG writing pattern : OK
- pattern matching test: OK
- SERDES BIST: OK
- DES 8b/10b decode, comma detection and word alignment: OK
- SER data encode and transmission: OK
- BERT: In progress
- Full serial link characterization: to be done
- Power consumption measurement: to be done

The miniasic with serdes work fine.



Figure: Miniasic TX Eye diagram at 2Gbps

TWEPP 2013 23-27 September, Perugia

# AMCHIP 05 logic and simulation

- AMchip05 MPW VHDL is a bigger version of the MiniASIC (submitted in March) plus few improvements and fixes
- New features with respect to AMchip04:
  - SERDES I/O @ 16 bits (2 DC) (AMchip04 was 15 bits 3 DC. Internally it's always 18 bits with configurable DC)
  - Two pattern inputs one pattern output (merge of pattern streams)
  - 1-layer match threshold (other thresholds: never, 8, 7, 6, always)
  - double width mode (4 bus 32 bit)
  - optional continuous readout mode (AMchip04 was event based only)
- VHDL is in very good shape (last fixes)
  - Partially rewritten to be very modular: AMchip06 will be straightforward.
- Simulation code rewritten in SystemC

(a)

# Modular FloorPlan AMCHIP 05 MPW



- LVDS @ 2GHz: 11 SERDES (2 pattern in, 1 pattern out, 8 hit buses)
- LVDS @ 100 MHz: CLK
- single-ended control signals: JTAG Init, Dtest, Holds

INFN , Frascati National Laboratories

Image: A math a math

- Finishing the AMchip Miniasic characterization
- Finalize the AMchip05 floorplan and layout
- LV CAM cell characterization
- XOR+RAM characterization
- Begin AMchip06 floorplan

### Future evolutions in 2.5D





AMchip04 has been designed to be horizontally symmetric.

- In/out buses for pattern output pipeline can change direction
- Buses are swapped internally to maintain consistency

Symmetry helps in designing and routing mezzanines for 2D chips, but also enables vertical stacking:

### **2.5D**

A B A B A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A

TWEPP 2013 23-27 September, Perugia

#### Matteo Beretta

### Future evolutions in 3D



TWEPP 2013 23-27 September, Perugia

Matteo Beretta

## Future evolutions in 3D



#### TWEPP 2013 23-27 September, Perugia

Matteo Beretta

INFN , Frascati National Laboratories

21 / 22

# Conclusions

AMchip04 and AMchip05 represent a major improvement in the AMchip family

- Mixed full-custom / standard cell design
  - CAM blocks as full-custom hard blocks, optimized for low power consumption and area efficiency
  - Control logic in standard cells (easy to develop and debug)
- Introduction of SERDES in AMchip 05 instead of parallel buses
- Power consumption reduction using Low Voltage memory core
- Designed with future evolution in mind (vertical stacking, full 3D design, )
- First prototype AMchip06 will be available early 2014
- Design and production of final version Amchip06 in 2014