0% found this document useful (0 votes)
113 views13 pages

MTIA First Generation Silicon Targeting Meta's Recommendation

Uploaded by

zorkodatro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views13 pages

MTIA First Generation Silicon Targeting Meta's Recommendation

Uploaded by

zorkodatro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MTIA: First Generation Silicon Targeting Meta’s Recommendation

Systems
Amin Firoozshahian Joe Shajrawi Jordan Fix
Joel Coburn Kevin Quinn Hangchen Yu
Roman Levenstein Nagesh Sreedhara Richard Li
Rakesh Nattoji Pankaj Kansal Kaustubh Gondkar
Ashwin Kamath Willie Wei Jack Montgomery
Olivia Wu Dheepak Jayaraman Mike Tsai
Gurdeepak Grewal Linda Cheng Saritha Dwarakapuram
Harish Aepala Pritam Chopda Sanjay Desai
Bhasker Jakka Eric Wang Nili Avidan
Bob Dreyer Ajay Bikumandla Poorvaja Ramani
Adam Hutchin Arun Karthik Sengottuvel Karthik Narayanan
Utku Diril† Krishna Thottempudi Ajit Mathews
Krishnakumar Nair Ashwin Narasimha Sethu Gopal
Ehsan K. Ardestani Brian Dodds Maxim Naumov
Martin Schatz Cao Gao Vijay Rao
Yuchen Hao Jiyuan Zhang Krishna Noru
Rakesh Komuravelli Mohammad Al-Sanabani Harikrishna Reddy
Kunming Ho Ana Zehtabioskui Prahlad Venkatapuram
Sameer Abu Asal Alexis Bjorlin
Meta Platforms Inc.
Menlo Park, CA, USA
ABSTRACT performance and programmability of future generations of
architecture.
Meta has traditionally relied on using CPU-based servers for
running inference workloads, specifically Deep Learning
Recommendation Models (DLRM), but the increasing compute
CCS CONCEPTS
and memory requirements of these models have pushed the •Computer systems organization~Architectures~Other
company towards using specialized solutions such as GPUs or architectures~Neural networks
other hardware accelerators. This paper describes the company's
effort in constructing its first silicon specifically designed for KEYWORDS
recommendation systems; it describes the accelerator architecture Accelerators, Machine Learning, Inference, Recommendation
and platform design, the software stack for enabling and Systems, Performance, Programmability
optimizing PyTorch-based models and provides an initial ACM Reference format:
performance evaluation. With our emerging software stack, we
Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh
have made significant progress towards reaching the same or
Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish
higher efficiency as the GPU: We averaged 0.9x perf/W across
Aepala, Bhasker Jakka, Bob Dreyer, Adam Hutchin, Utku Diril,
various DLRMs, and benchmarks show operators such as
Krishnakumar Nair, Ehsan K. Ardestani, Martin Schatz, Yuchen
GEMMs reaching 2x perf/W. Finally, the paper describes the
Hao, Rakesh Komuravelli, Kunming Ho, Sameer Abu Asal, Joe
lessons we learned during this journey which can improve the

Shajrawi, Kevin Quinn, Nagesh Sreedhara, Pankaj Kansal, Willie
Rivos Inc., work done while at Meta Platforms Inc.
Wei, Dheepak Jayaraman, Linda Cheng, Pritam Chopda, Eric
Permission to make digital or hard copies of all or part of this work for personal or
Wang, Ajay Bikumandla, Arun Karthik Sengottuvel, Krishna
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Thottempudi, Ashwin Narasimha, Brian Dodds, Cao Gao, Jiyuan
on the first page. Copyrights for components of this work owned by others than the Zhang, Mohammad Al-Sanabani, Ana Zehtabioskui, Jordan Fix,
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Hangchen Yu, Richard Li, Kaustubh Gondkar, Jack Montgomery,
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
Mike Tsai, Saritha Dwarakapuram, Sanjay Desai, Nili Avidan,
ISCA '23, June 17–21, 2023, Orlando, FL, USA. Poorvaja Ramani, Karthik Narayanan, Ajit Mathews, Sethu
© 2023 Copyright is held by the owner/author(s). Publication rights licensed to ACM. Gopal, Maxim Naumov, Vijay Rao, Krishna Noru, Harikrishna
ACM ISBN 979-8-4007-0095-8/23/06...$15.00.
Reddy, Prahlad Venkatapuram and Alexis Bjorlin. 2023. MTIA:
DOI: https://doi.org/10.1145/3579371.3589348
First Generation Silicon Targeting Meta’s Recommendation
Systems. In Proceedings of 2023 International Symposium on
Computer Architecture (ISCA’23), June 17-23, 2013, Orlando,
FL. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3579371.3589348

1 Introduction
Machine learning (ML) workloads have become ubiquitous in
online activities. In recent years, these models have seen
substantial growth in size and complexity, which has contributed
towards their increased prediction accuracy and effectiveness.
However, at the same time, this growth has presented significant
challenges for the hardware platforms that are used for training
and inference of these models at very large scales. Total Cost of
Ownership (TCO) is one of the major constraining factors in
Figure 1: Scaling trends for inference models
launching models to production in the datacenter, and power is a
significant component of TCO for these platforms. Therefore,
Figure 2 shows the estimated number of servers that are
performance-per-TCO (and performance-per-watt) has become an
deployed for serving inference workloads within the datacenter
important metric for any hardware platform targeting these
over the past couple of years. The light solid line shows the
workloads.
number of CPU-based servers, the dashed line shows the number
Deep Learning Recommendation Models (DLRM) [16] have of servers equipped with the first-generation inference accelerator,
emerged as one of the most dominant workloads in Meta’s
Intel NNPI [10], and the dark solid line shows the number of
datacenters [17][18]. These models combine traditional multilayer
GPU-based servers [12]. While the initial demand for increased
perceptron (MLP) operations (referred to as fully connected or FC
capacity was temporarily met using the NNPI accelerator, the
at times) which are compute intensive, with embedding tables that
requirements for the inference models quickly outpaced the NNPI
transform sparse features into a dense representation. These tables
capabilities and provided motivation for using GPUs. This
contain wide vectors that are indexed randomly and are reduced to
brought the additional advantage of leveraging the existing
a single vector that is then combined with data coming from other
ecosystem used already for training. Therefore, as can be
layers to produce the final results [16]. While embedding table observed, the increased demand in model complexity is served
operations have rather light compute requirements, their memory increasingly with GPUs as accelerators.
footprint and bandwidth requirements are rather demanding due to While recent generations of GPUs provide a lot of memory
the nature of the data access pattern and size of the tables. bandwidth and compute power, they are not designed with
Figure 1 shows the historical and estimated future growth in inference in mind, and therefore the efficiency of processing real
both complexity and memory footprint of the inference workloads inference workloads is low. Developers use a myriad of software
related to recommendation models in Meta’s production techniques, such as operator fusion, shape specialization, graph
datacenters. The dashed line shows the estimated growth in the transformations and kernel optimizations to raise the efficiency of
model's compute requirement while the solid lines demonstrate GPUs. But despite these efforts, there is still an efficiency gap
the increase in the memory footprint. The gray solid line captures which makes it challenging and expensive to deploy models in
the footprint of the device memory used to store embedding practice.
tables, which is an important component of these models. The
level of growth in both compute and memory requirements is
certainly an issue that needs to be addressed, especially
considering how these workloads are typically run in the
datacenter.

2 Motivation
Traditionally CPUs have been used as the primary vehicle to
serve inference workloads in Meta’s production datacenters, but
they are not cost effective in keeping up with the demands of the
most recent workloads. To that extent, hardware acceleration has
been considered an attractive solution that can address power and
performance issues and provide a more efficient way of serving
inference requests while at the same time providing enough Figure 2: Growth in server demand for inference workloads
headroom in compute performance for running future models.
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

Given the experience deploying NNPI and GPUs as between processors, their peripherals and custom hardware
accelerators, it was clear that there is room for a more optimized blocks.
solution for important inference workloads. This optimal solution
is based on an in-house accelerator which is architected from the LPDDR5 LPDDR5 LPDDR5 LPDDR5

ground up to address the requirements of demanding inference DDR Ctrl. DDR Ctrl. DDR Ctrl. DDR Ctrl.
XBAR
workloads, specifically focused on meeting the performance
M M M M M M M M M M M M M M M M
requirements of DLRM systems. However, while focusing on XBAR
DLRM workloads (given their ongoing variation and evolution

M
and the fact that the architecture is effectively constructed for PE PE PE PE PE PE PE PE

DDR Ctrl.

DDR Ctrl.
LPDDR5
LPDDR5

M
forthcoming generations of these workloads) it was also clear that

M
PE PE PE PE PE PE PE PE

M
in addition to performance, the architecture should also provide

M
PE PE PE PE PE PE PE PE
enough generality and programmability, to support future versions

DDR Ctrl.

DDR Ctrl.
LPDDR5
LPDDR5

M
of these workloads and potentially other types of neural network

M
PE PE PE PE PE PE PE PE

M
XBAR

XBAR
models.

XBAR

XBAR
M

M
PE PE PE PE PE PE PE PE
While creating a custom silicon solution opens the door for

DDR Ctrl.

DDR Ctrl.
LPDDR5
LPDDR5

M
ample innovation and specialization towards the target workloads,

M
PE PE PE PE PE PE PE PE
creating an accelerator architecture for mass deployment in the

M
M

M
PE PE PE PE PE PE PE PE

DDR Ctrl.

DDR Ctrl.
datacenter is a monumental task. The focus and strategy when

LPDDR5
LPDDR5

M
architecting the accelerator therefore has been on adopting and

M
PE PE PE PE PE PE PE PE

M
reusing suitable pieces of technology, as well as tools and

M
XBAR
Control
environments, from vendors and the open-source community. M M M M M M M M M M M M M M M M Subsystem
This not only improves the time to market, but it also leverages XBAR
(CCP)
the support and enhancements that come from the community and DDR Ctrl. DDR Ctrl. DDR Ctrl. DDR Ctrl. Host
Interface
vendors and reduces the amount of resources required for LPDDR5 LPDDR5 LPDDR5 LPDDR5
building, enabling, and deploying such platforms. Figure 3: High-level architecture of the accelerator
The rest of this paper explains the undertaking of architecting
MTIA, Meta’s first accelerator chip targeting inference 3.1 Fixed Function Units
workloads, and the learnings that came with it. The next section
details the accelerator’s architecture and its various provisioned Each PE has a total of five fixed function blocks and a
features and components. Section 4 goes over mapping an Command Processor which orchestrates and coordinates
example operator to this architecture, demonstrating how various execution of operations on these fixed function blocks. Functional
provisioned features are utilized to run the operator efficiently. units form a coarse-grained pipeline within the PE, where data can
Section 5 provides an overview of the accelerator’s software stack be passed from one unit to the next to perform successive
and section 6 describes our evaluation methodology and results. operations. Each functional unit can also access the data directly
Finally, section 7 discusses a few important lessons learned during within the PE’s local memory, perform the necessary operations,
this development cycle. and write the result back, without passing the data to other
functional units.
3.1.1 Memory Layout Unit (MLU)
3 Accelerator Architecture This block performs operations related to copying and
Figure 3 shows the high-level architecture of the accelerator, changing the layout of data in the local memory. It can operate on
which is organized as an array of processing elements (PEs) tensors with 4/8/16/32-bit data types. Operations like transpose,
connected on a grid. The grid is connected to a set of on-chip concatenation, or reshape are performed using this block. The
memory blocks and off-chip memory controllers through output data can be sent to the next block directly to be operated on
crossbars on each side. There is a separate control subsystem with immediately or can be stored in PE’s memory. For example, MLU
dedicated processors and peripherals to run the system's control can transpose a matrix and provide the output directly to DPE
software. The host interface unit which contains a PCIe interface, block for a matrix multiplication operation, or it can format the
associated DMA engines, and secure boot processor also sits data properly as part of the depth-wise convolution operation and
alongside this control subsystem. send it to DPE to perform the actual computation.
Figure 4 shows the internal organization of the PE. A PE 3.1.2 Dot-Product Engine (DPE)
consists of two RISC-V processor cores and associated This block performs a set of dot-product operations on two
peripherals (on the left), as well as several fixed function units input tensors. The first tensor is read and stored within the DPE
specialized in performing specific computations or data first, then the second tensor is streamed through the block and a
movements (on the right). In addition, each PE has 128KB of dot product operation is performed with all the rows of the first
local storage. A local interconnect establishes the connectivity
tensor. DPE can perform 1024 INT8 multiplications (32×32) or 3.1.5 Fabric Interface (FI)
512 FP16/BF16 multiplications (32×16) per cycle. Operations are This block acts as the gateway in and out of the PE. It connects
fully pipelined; performing multiplication of two maximum size to and communicates over the accelerator’s on-chip network. It
matrices takes 32 clock cycles. In case of INT8 multiplication, the formulates and sends memory access requests to on-chip and off-
resulting output is stored in INT32 format, while in the case of chip memories, as well as system registers, and receives back the
BF16 or FP16 multiplications, the result is stored in FP32 format. data or write completions. It implements a set of DMA-like
The result is always sent to the next functional unit in the pipeline operations that transfers the data in and out of PE’s local memory.
for storage and accumulation. It also receives and transmits cache misses and un-cached
accesses from processor cores and allows other entities (other PEs
To/From NoC or the control subsystem) to access the PE’s internal resources.
3.1.6 Command Processor (CP)
Fabric Interface (FI) In addition to hosting PE’s local memory and registers, the CP
PE block acts as the central processing unit that orchestrates
execution of various operations on the fixed function blocks
MLU concurrently. It receives instructions from the two processor cores
Debug Proc-A
Subsystem (Scalar)
in the PE, performs dependency checking, scheduling, and
tracking for those instructions, and dispatches them to the fixed
Command
function units for execution. It contains two separate schedulers
DPE
Processor (one for each processor core), a set of command queues, as well as
Machine arbitration logic for accessing the local memory and register
PE Interconnect
Timer resources.
RE The hardware provides a set of basic atomic primitives to
allow synchronization between the cores (within the PE or across
PLIC multiple PEs). These primitives are enacted by processors, which
Proc-B
(Interrupt
(Vector) Regs.
LS allows atomic update to predefined registers, and can stall the
Controller) Mem. SE
processor until certain conditions are satisfied externally (e.g., a
counter reaches a certain value). At the higher level, these
Figure 4: PE’s internal organization mechanisms are used for efficient implementation of software
constructs such as locks, ticketing locks, mutexes and barriers.
3.1.3 Reduction Engine (RE) The logic that performs the atomic operations as well as the
The reduction engine hosts the storage elements that keep relevant registers reside within the Command Processor and are
track of the results of the matrix multiplication operations and tightly integrated with the processor cores through custom
accumulates them over multiple operations. There are four interfaces.
separate storage banks that can be independently used to store and
accumulate the results coming from DPE. RE can load an initial 3.2 Processor Cores
bias into these accumulators and can also send their contents to
Each PE contains two RISC-V cores that run the application’s
neighbor PEs over a dedicated reduction network (discussed later
code and issue commands to the CP for offloading various
in this section). Upon receiving results over the reduction
computations to fixed function units. The cores are single issue,
network, RE accumulates the received values on top of the values
in-order cores, with a five-stage pipeline (AX25-V100, from
in one of the local storage banks. It can then send the result to the
Andes Technology), and are heavily customized to suit the
next neighbor, to the SE, or store it in the PE’s local memory
functionalities needed. The set of customizations includes custom
directly.
interfaces, custom registers, custom instructions, and custom
3.1.4 SIMD Engine (SE) exceptions. Custom interfaces connect cores to the CP to issue
This block performs operations like quantization/de- commands to fixed function units and move data back and forth
quantization and nonlinear functions. Internally the block contains between cores and local memory. Custom registers store the
a set of lookup tables and floating-point arithmetic units to command information that is sent to the CP upon issuing
calculate linear or cubic approximation of nonlinear functions commands. Custom instructions are added to start the desired
such as exponentials, sigmoid, tanh, etc. The approximation operation on each of the fixed function units. And finally custom
accepts INT8 or FP16 data types as inputs, producing an INT8 or exceptions ensure correctness of each command issued to the CP
FP32 result at the output. The unit can receive its inputs directly and raise an exception in case of illegal values in the command.
from the RE block or read them from the local memory. In One of the processor cores is equipped with the RISC-V vector
addition, this block is also capable of using its floating-point extension, which adds extra flexibility to the PE and allows
ALUs to perform a set of predefined elementwise operations, such implementing operations that do not map well to the existing fixed
as addition, multiplication, accumulation, etc. function units. The vector processing unit contains 32 vector
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

registers, each 64B wide and has the same width for all vector Read (Size=3, offset=1, stride=2)
functional units. It implements version 0.8.1 of the RISC-V vector Read Pointer
extension [23].

3.3 Local Memory (LS)


Each PE has total of 128KB of local memory to be used by

32B
processors and functional units. The CP implements an arbitration
scheme for memory banks and coordinates accesses from cores
and fixed function units. Local memories are mapped to the
system’s address space and can be accessed by cores via regular
load/store instructions.
Write Pointer
There is an abstraction layer introduced on top of the local Available elements
memories to simplify usage and dependency checking between
operations that use them. This can be considered as further
Figure 5: Reading from a Circular Buffer
extension of the concept of the buffet [1][2]. Each PE can define
circular buffers (CBs) that are mapped to the existing local While some instructions like DMA operations automatically
memory. Each CB is designated with an ID and has a pair of adjust the read and write pointers (as they move the data in and
registers that specify its size (depth) and starting address in the out of the CBs, and hence produce or consume elements), other
local memory. In addition, each CB also implements a set of read custom instructions do not move the pointers. This allows data
and write pointers to implement a hardware FIFO. inside the CB to be reused multiple times by different operations
In a CB, read operations always read the data starting from the before it is explicitly marked as consumed. Hardware provides
read pointer and write operations always write data starting from
additional custom instructions that can adjust both read and write
the write pointer. Like buffets, read and write operations carry an pointers in each CB, allowing explicit marking of the data
offset which allows them to access a location other than the elements as produced or consumed, when necessary.
current head or tail of the buffer (Figure 5). Fixed function units
use the CB IDs as their input/output operands; for example, a 3.4 Memory Subsystem and
matrix multiplication operation uses two CBs as its input
operands. Before allowing an operation to start, the Command Interconnect
Processor checks the availability of the data in the input CBs and In addition to the local memory within the PEs, the accelerator
space in the output CB. It allows the operation to start only if the also has 128MB of on-chip SRAM, organized as slices around the
necessary element and space checks pass. Therefore, an operation grid. This on-chip memory can be used as addressable scratchpad
is guaranteed to have the necessary resources to complete and will memory, or as a common, shared, memory-side cache. There are
not stall the functional unit in the middle of its execution. four LPDDR5 controllers on each side of the grid, providing a
The Command Processor also uses the CB IDs to enforce total of 176 GB/s (theoretical) off-chip bandwidth. The
dependency checks and interlocks between different custom accelerator can support a total of 128GB of off-chip memory
instructions. It ensures that operations that access and modify a capacity. Memory addresses are distributed across these
particular CB are always executed in program order, while controllers, and among the on-chip SRAM slices. When on-chip
operations that operate on different CBs or different regions of the SRAM is configured as cache, each four cache slices are
same CB can execute in parallel. This significantly simplifies the associated with a single memory controller and cache its
dependency checks as opposed to using absolute local memory addresses.
addresses for enforcing such interlocks. The on-chip network that connects all the PEs and memories
CBs also simplify realization of the producer-consumer together is based on the AXI interconnect with special
execution model between different operations. These operations enhancements. The interconnects consist of two networks for
can be initiated by different cores or different fixed function units. carrying memory and register accesses separately. The memory
For example, a program can issue a series of DMA operations to access network is equipped with a multicast feature which allows
the hardware (which moves the data from an external memory coalescing of requests from multiple PEs into one (if they are
into a CB), following it up with a set of custom compute made to the same set of addresses). A single request is then sent to
operations (e.g., MATMUL) that uses that data, without requiring the memory blocks to retrieve the data and return it to all
an explicit synchronization between the two. The MATMUL requesting PEs. Multicast is only supported for the PEs that are
instruction is automatically stalled by the Command Processor located along the same row or column in the grid however, and
until enough data is brought into the CBs by prior DMA cannot be used for an arbitrary group of PEs.
operations, and is started immediately afterwards, relieving the In addition to the main AXI based interconnect, PEs are also
program from explicitly checking the availability of the data. connected to each other via a specialized network, called the
reduction network. This is a unidirectional network that travels Multicasting: As mentioned earlier, the system’s NoC allows
only from north to south and from west to east. It carries partial coalescing requests from multiple PEs when they access the same
sums from the accumulators in the RE block of one PE to another. set of addresses in memory. This reduces memory bandwidth and
Using this network, PEs can expediently accumulate the result of increases the energy efficiency of data movement by allowing the
their computation without having to save and restore it in data to be shared while reading it from memory only once and
memory. The last PE in the row or column can then store the final delivering it to all requesters [1][6][7][8]
result in the memory, after all partial values are accumulated. Figure 6 shows the die plot with the grid of PEs, surrounded
by on-chip SRAMs and off-chip DDR controllers, while Table I
3.5 Parallelism and Data Reuse lists the summary of the chip features and parameters.
Parallelism, locality, and data reuse play a significant role in
efficient utilization of limited hardware resources in any deep Table I - Summary of MTIA features and parameters.
learning accelerator. MTIA architecture has provisioned a set of
features to allow multiple degrees of parallelism and maximal Parameter Value
exploitation of temporal and spatial data reuse in neural network Technology TSMC 7nm
models and operators, as discussed below. Frequency 800MHz nominal (1.1 GHz max)
Parallelism: The architecture provides support for multiple
Instances 1.12B gates, 65M flops
levels of parallelism and overlapping of various operations. Data
Dimensions 19.34 × 19.1mm (373 mm2)
level parallelism (DLP) is exploited by usage of wide vectors in
Package 43 × 43, ~2800 pins
fixed function units as well as the vector processors. Multiple PEs
also can operate on the same task in a data parallel manner. TDP 25 W
Instruction level parallelism is exploited in the Command Voltage Dual rail: 0.67V (logic), 0.75V (memories)
Processor, by allowing multiple outstanding operations to be Host Connectivity 8× PCIe Gen4 (16 GB/s)
handled by different fixed function blocks simultaneously. 102.4 (INT8)
GEMM TOPS (MAC)
Memory level parallelism (MLP) is achieved by allowing many 51.2 (FP16)
outstanding requests to on-chip and off-chip memories from each Vector: 0.8 (FP32) / 1.6 (FP16) / 3.2 (INT8)
SIMD TOPS
PE. And finally, thread level parallelism (TLP) can be achieved SE: 1.6 (FP16) / 3.2 (INT8)
by utilizing multiple PEs (or groups of PEs) to run parallel Local memory: 400GB/s per PE
threads, as well as by having two independent threads within each Memory Bandwidth On-chip SRAM: 800GB/s
PE. Threads within the PE can cooperate in performing a given Off-chip DRAM: 176 GB/s
task, by one thread orchestrating the data movement and the other Local memory: 128KB per PE
one orchestrating the computation.
Memory Capacity On-chip SRAM: 128MB
Caching: There are multiple levels of caching in various
Off-chip LPDDR5: 64GB (16 channels)
blocks of the hardware to improve locality and reduce memory
bandwidth consumption. This includes instruction and data caches
in the processor cores, large on-chip last level cache, and caching
for input operands in the DPE block. The caching at the DPE level 4 Mapping an FC Layer
allows the engine to hold data from both operand A and operand To demonstrate how all the above-mentioned features work
B and save access to local memory upon hit. together, let’s consider an FC operator that performs a matrix
Circular buffers / local memories: Circular buffers provide multiplication operation in the form of CT = A×BT and see how it
the storage for holding input operands while the PE performs the maps to a sub-grid of PEs. The reason for performing the
computations. Flexibility in adjusting pointers as well as operations in a transposed manner is to keep k as the inner
offsetting into any location within a circular buffer allows the dimension for both tensors, to increase the efficiency of memory
program to access each line of data multiple times, before accesses. Matrix A is assumed to be m×k and matrix B is assumed
deciding to mark it as consumed. to be k×n (hence BT will be n×k), producing output C which will
Specialized reduction: Having a dedicated reduction network be an m×n matrix (or CT being an n×m matrix). Inputs are
not only offloads a large part of data transfer from the system’s assumed to have row major memory layout. When the inner
main on-chip network, but also provides a way for grouping PEs dimension (k) is not a multiple of 32B, the outer dimension (m or
together and using their local memories in an aggregate form. n) stride is aligned to 32B boundaries for efficient data movement.
This in turn allows storing a larger portion of input operands in For simplicity, we will assume that all elements are of INT8 data
the PEs and reducing the bandwidth requirement for loading them type.
from off-chip memory. In addition, the DPE block utilizes
reduction trees (spatial sum) to calculate the output of a
multiplication operation [1][3][4], which is known to be more
energy efficient [5].
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

row multicast-read of matrix A. Similarly, all PEs in each column


participate in column multicast-read of matrix BT.
0
PE[0,0] PE[0,2] PE[0,1] PE[0,3] BT
127
128 k
PE[1,0] PE[1,2] PE[1,1] PE[1,3] 0 511 512 1023
m 255 0
256 PE[0,0] PE[1,0] PE[0,1] PE[1,1]
PE[2,0] PE[2,2] PE[2,1] PE[2,3] PE[2,0] PE[3, 0] PE[2,1] PE[3, 1] 127
383
384 PE[0,2] PE[1,2] PE[0,3] PE[1,3] 128 n
PE[3,0] PE[3,2] PE[3,1] PE[3,3]
511 PE[2,2] PE[3, 2] PE[2,3] PE[3, 3] 255
0 511 512 1023
k
A
PE[0,0] PE[0,1] PE[0,2] PE[0,3]
m: 0-127 m: 0-127 m: 0-127 m: 0-127
k: 0-511 k: 512-1023 k: 0-511 k: 512-1023
n: 0-127 n: 0-127 n: 128-255 n: 128-255
PE[1,0] PE[1,1] PE[1,2] PE[1,3]
m: 128-255 m: 128-255 m: 128-255 m: 128-255
k: 0-511 k: 512-1023 k: 0-511 k: 512-1023
n: 0-127 n: 0-127 n: 128-255 n: 128-255
PE[2,0] PE[2,1] PE[2,2] PE[2,3]
m: 256-383 m: 256-383 m: 256-383 m: 256-383
k: 0-511 k: 512-1023 k: 0-511 k: 512-1023
n: 0-127 n: 0-127 n: 128-255 n: 128-255
PE[3,0] PE[3,1] PE[3,2] PE[3,3]
m: 384-1023 m: 384-1023 m: 384-1023 m: 384-1023
k: 0-511 k: 512-1023 k: 0-511 k: 512-1023
n: 0-127 n: 0-127 n: 128-255 n: 128-255

Figure 7: Mapping an FC operator to a sub-grid


Figure 6: MTIA die plot
Within the PE, the operation is divided between the two cores
As mentioned earlier, DPE works on blocks of in a producer-consumer manner. Figure 8 shows the pseudocode
32(m)×32(k)×32(n) inputs, generating 32(n)×32(m) partial results corresponding to each of the cores in the PE. Core0 issues a set of
accumulated in the RE. This operation takes 32 clock cycles. In DMA operations that move data from main memory into CB_A
order to feed the DPE’s pipeline, 32(m)×32(k) blocks of matrix A and CB_B, used to store matrices A and B locally. In a parallel
and 32(n)×32(k) blocks of matrix BT must be brought from thread, Core1 issues a set of matrix multiply (MML) instructions
external memory into PE’s local memory in 32 cycles, requiring that reads data from CB_A and CB_B respectively and stores the
64B/cycle of bandwidth. To alleviate this bandwidth pressure, the results in an accumulator register. As can be observed, each block
four accumulators in the RE block are used to accumulate 2×2 of data is used twice to produce a partial result in each of the
blocks of partial results, holding a total of 64(n)×64 (m) elements accumulator registers. If the operation is the last iteration, the data
of the output matrix. By using the accumulators in this manner, is marked as consumed in the CB by issuing a POP instruction,
we use every 32×32 input block twice, hence reducing the otherwise the corresponding CB offsets are incremented to move
external bandwidth requirement to 32B/cycle. to the next block of data in the next iteration. At the end, the
Tensor dimensions m, n and k are distributed in multiples of reduction operation (REDUCE) is called to accumulate all partial
64, 64 and 32 across the PE grid respectively. Each PE hence sums across PEs. The last PE in the reduction chain sends the data
works on a different sub-block of the larger result matrix in a data back to main memory using the DMA operation.
parallel fashion. The reduction dimension (k) is distributed over The two cores in the PE must synchronize at the start of the
multiple PEs along the row (or column). This facilitates the usage operation as only one of them performs the necessary
of the reduction network to accumulate partial results after initialization tasks (e.g., setting up the CBs to use). But
multiplication is completed. PEs pass the calculated partial results afterwards, there is no explicit, per iteration synchronization; the
to each other to accumulate and pass to the next PE. When two or producer-consumer synchronization is taken care of by the
more PEs along a given row or column use the same block of hardware: If the consumer (the MML operation) attempts to use a
input data from either input matrix, the multicast feature of the on- CB that does not have enough data, hardware stalls the operation
chip network is used to coalesce the requests from multiple PEs until the producer (DMA operation) places enough data within the
and send a single request to the memory, further reducing memory CB, at which point it allows the matrix multiplication to proceed.
bandwidth requirements. This asynchronicity decouples the producer and consumer threads
Figure 7 shows an example of distributing an FC operator with and allows the producer to move ahead and bring in more data for
dimensions of 512(m), 1024(k) and 256(n) on a 4×4 PE sub-grid. later iterations.
The reduction dimension (k) is distributed across two PEs along
the same row and dimension m is distributed across four rows.
PEs in columns 0 and 2, and PEs in columns 1 and 3 participate in
#--------------------------------Core0-------------------------------------
work = GetWorkForMyPE(...) Application Layer
INIT CB_A, CB_B and CB_C # Setup circular buffers
multicast_A, multicast_B = JoinMulticastGroup(...)
Sync(...) # Synchronize with others PyTorch Framework
read_B = true KNYFE (DSL) PyTorch (Host)
for m in range(work.m.begin, work.m.end, 64): # For every row of “A”...
read_A = true
FX Precompiled
for n in range(work.n.begin, work.n.end, 64): # ...read entire “B” subgraphs
for k in range(work.k.begin, work.k.end, 32): FX MTIA AFG (FX Compiled Eager MTIA operators MTIA Kernels
if read_A: Compiler Subgraph Executor) PyTorch Operators Library
Compiled
DMA GetAddr(A, (m, k)), size=(64,32), CB_A, multicast_A executable
if read_B:
Compiler MTIATensor,
DMA GetAddr(B, (n, k)), size=(64,32), CB_B, multicast_B PyTorch Accelerator Runtime
read_A = false Device Mem Allocator,
Stream Interface (Host)
read_B = false
#--------------------------------Core1-------------------------------------
work = GetWorkForMyPE(...)
MTIA Streaming API Firmware Interface
Sync(...) # Synchronize with others
MTIA Firmware Driver (Host)
for m in range(work.m.begin, work.m.end, 64): # For every two chunks of “A”
cb_offset_B = 0
for n in range(work.n.begin, work.n.end, 64):# Multiply two chunks of “B”
cb_offset_A = 0 Firmware
MTIA Firmware
INIT RE acc with 0 # Initialize accumulators (Device)
for k in range(work.k.begin, work.k.end, 32):
MML acc=0,size=(32,32,32),CB_B,CB_A,cb_offset_B ,cb_offset_A
MML acc=1,size=(32,32,32),CB_B,CB_A,cb_offset_B ,cb_offset_A+32*32
MML acc=2,size=(32,32,32),CB_B,CB_A,cb_offset_B+32*32,cb_offset_A Figure 9: MTIA’s software stack
MML acc=3,size=(32,32,32),CB_B,CB_A,cb_offset_B+32*32,cb_offset_A+32*32
if ((m + 64) >= work.m.end): # If last Iteration...
POP CB_B, size=2*32*32
else:
# ...mark “B” data as consumed
# Otherwise...
Compilers: The next important component in the software
cb_offset_B += 2*32*32 # ...proceed to the next chunk stack is a set of compilers which consists of multiple parts:
if ((n + 64) >= work.n.end): # If last Iteration...
POP CB_A, size=2*32*32 # ...mark “A” data as consumed • A PyTorch FX-based ML model compiler which applies
else: # Otherwise...
cb_offset_A += 2*32*32 # ...proceed to the next chunk several transformations and model-level optimizations to the
REDUCE destination = neighbor PE or CB_C, size=(64,64))# Send to next PE
if IsLastPEInReduction(...): # If last PE in sequence PyTorch graph represented as FX IR [19][20], and gradually
DMA PutAddr(C, (n, m)), size=(64, 64), CB_C # Write result to memory
converts it into LLVM IR [21][22]. It is responsible for graph
Figure 8: Pseudocode for the FC operator running in PE optimizations which take advantage of the PE grid and MTIA’s
memory subsystem. It implements a tensor placement scheme that
takes a best-effort approach to keep producer-consumer data in
5 Software Stack on-chip memory. It can also split a model into sub-graphs
intended to run across multiple cards and even across sub-grids
The software stack for MTIA is designed with two main goals within the same chip.
in mind: be efficient for production, meaning achieve higher • A DSL-based compiler (codename KNYFE) for ML kernel
perf/TCO than other best-in-class solutions, and at the same time, development, which takes a short high-level description of an ML
be simple and straightforward to use, even simpler than available kernel and produces low-level optimized C++ code. It uses low-
alternatives. The software stack for MTIA is designed and built level hardware specific APIs to implement the ML operator and is
around PyTorch to benefit from its capabilities and to achieve a used extensively for developing many of the ML kernels used in
seamless integration with other components of the ML MTIA.
infrastructure available in a production environment. The rest of • LLVM-based compiler toolchain which converts LLVM IR
this section provides an overview of each component of the into an executable for the device. LLVM is used primarily due to
software stack as shown in Figure 9. the RISC-V support it provides and is responsible for the lowest
ML serving platform: At the top of the software stack, we level of optimizations like register allocation, in-lining and code
have production-specific ML model serving platforms generation. Most major optimizations like tiling or scheduling of
(Application Layer as illustrated in Figure 9). These serving the work and data among PEs are performed by the higher-level
platforms are operating on top of PyTorch and are mostly compilers mentioned earlier.
hardware agnostic, supporting execution on heterogeneous Library of ML kernels: Another important component is the
hardware systems including CPUs, GPUs, and accelerators like library of kernels and ML operators that are used to construct the
MTIA. ML models executing on the device. Many of these kernels are
PyTorch Runtime: A PyTorch Runtime integration for MTIA developed using the DSL compiler mentioned earlier, but some of
was developed which provides necessary functionality and the most performance demanding kernels, e.g., fully connected
features including MTIA Tensors, a host-side memory allocator, (FC) layers and embedding bag (EB) layers, are developed by
and CUDA-like streaming APIs for scheduling the desired experts directly in low-level C++ using exposed intrinsics to
operators to execute on the device. The runtime supports different ensure they can achieve the highest levels of performance possible
modes of model execution, including eager mode, as well as full on the hardware.
graph compilation and execution to maximize performance. It also Host driver and firmware interface: MTIA platform
supports running models split into partitions spanning multiple software enables the host to access the accelerator device. It
cards, providing the necessary synchronization and manages the device lifecycle and resources, and it helps initiate
communication channels between them. and track runtime operations on the device. This part of the stack
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

is broadly split into two parts: the host software and device While we can compare the absolute performance of MTIA
firmware. The host software consists of the Linux device driver, a versus NNPI and GPUs, each device has different capabilities in
device access library for providing a uniform device interface, and terms of compute throughput, memory bandwidth, and memory
a streaming API to interface with PyTorch, as well as software capacity. They also operate under different power budgets.
tools and utilities for managing and monitoring the device. Therefore, in our study we report perf/W (as a proxy for
Device firmware: The device firmware includes a ROM perf/TCO, given the sensitive nature of TCO), because power is
based pre-boot firmware, secure boot firmware running on its own an important factor in provisioning for deployment in the
processor, the Control Core Processor firmware running on the datacenter. We use the total platform power divided by the
control subsystem performing runtime and management number of accelerator cards to determine power provisioned for
operations, and finally the PE monitor that runs on the PEs in the each accelerator, as opposed to using the maximum TDP for the
compute grid, which schedules and monitors workloads running card.
on the PEs. The main control firmware is based on the Zephyr
Real Time OS [9]. 6.1 Benchmark Performance
We first evaluate the performance of several important
operators and kernels that push the limits of the architecture and
6 Results are representative of main components in production DLRMs.
We evaluate the performance of the MTIA by comparing it Table III shows the latency breakdown of a request in a
against a baseline accelerator (NNPI) [10] and against more representative DLRM with batch sizes of 64 and 256. The model
has approximately 750 layers with nearly 550 consisting of EB
recently deployed GPUs. It should be noted that we report the
operators. For batch size of 64, FC dominates the execution time
results collected with an under-development software stack, as we
followed by EB, while for batch size 256, EB dominates FC
believe this reflects the end-to-end performance and is
slightly and the two together account for 62% of the execution
representative of a production environment. However, this stack is
time. It should be noted that with larger input shapes, the kernels
not currently as optimized as the GPU’s software stack.
are able to better amortize the setup costs, and reuse the data
Consequently, there are cases where the GPU is more efficient,
more, hence achieving higher utilization of the fixed-function
but we are hoping to close this gap over time and have the MTIA
units in the hardware.
software stack deliver the full gains of the architecture across all
the DLRM workload space. We evaluate both operator-based
benchmarks as well as full DLRM models varying in complexity, Table III - Operator breakdown, medium complexity DLRM
size, and accuracy. Since these accelerators are all based on
different hardware platforms, we first compare their system level Operator Batch size 64 Batch size 256
hardware specification (Table II). These platforms are the FC (Fully Connected) 42.10 % 32.4%
following: Yosemite V2 server with six NNPI accelerator cards EB (Embedding Bag) 31.19 % 30.0%
[11], Zion4S server with eight Nvidia A100 GPUs [12], and Concat 2.86 % 11.5%
Yosemite V3 server [13] with twelve MTIA accelerator cards. Transpose 8.47 % 5.9%
Quantize 1.55 % 5.3%
Dequantize 2.94 % 3.3%
Table II - Inference hardware platforms 3.30 %
BatchMatMul 1.7%
Yosemite Zion4S Yosemite V3 Others 7. 59 % 11.0%
Metric
V2 (6 NNPI) (8 GPU) (12 MTIA)
System 298 W 4500 W 780 W Based on the breakdown, we use a set of benchmarks to assess
Power Card 13.5 W 330 W 35 W the efficiency of the MTIA’s hardware. While not full-fledged
Percentage 27.2 % 58.7 % 53.8 % workloads, these benchmarks allow exercising various shapes and
INT8 (TOPS/s) 50 × 6 624 × 8 104 × 12 sizes for important operators (including corner cases) and shed
Compute
FP16 (TF/s) 6.25 × 6 312 × 8 52 × 12 light on the potential deficiencies that might exist in the hardware.
Type (device) LPDDR HBM LPDDR GemmBench [14] is used to evaluate dense computation; it
Size (device) 16 GB × 6 40 GB × 8 32 GB × 12 creates a model composed of a chain of FC layers. In our
Memory BW (device) 50 GB/s × 6 1.5 TB/s × 8 150 GB/s × 12 benchmark runs we focus on both FP16 and INT8 (quantized)
Size (host) 64 GB 1.5 TB 96 GB data, which requires additional quantize and dequantize layers.
BW (host) 50 GB/s 400 GB/s 76 GB/s TBEBench [15] is used to evaluate sparse computation, and
Dev.-to-Dev. PCIe NVLink PCIe allows us to configure the batch size, number of tables, number of
Comms. P2P BW (card) 3.2 GB/s 80 GB/s 12.8 GB/s rows per table, embedding dimension, and pooling factor of TBE
NIC BW 50 Gbps 400 Gbps 100 Gbps operators. BatchGEMMBench [24], ConcatBench [26], and
TransposeBench [26] are used to efficiently cover other Sparse computation: While a typical recommendation model
significant operators typically seen in recommendation models. might include hundreds of EmbeddingBag (EB) operators, they
We also evaluate several elementwise kernels including quantize, can be merged together into one or more TableBatchedEmbedding
dequantize, and tanh. (TBE) operators to amortize kernel launch overhead and increase
Dense computation: We evaluate both INT8 and FP16 Fully the work that can be parallelized across the device. Figure 12
Connected (FC) layers (Figure 10 and Figure 11). When accuracy shows the performance (in GB/s/W) for the TBE benchmark
is sufficient, INT8 quantization unlocks a potential 2x running on MTIA and GPU for a set of representative operator
improvement in FC throughput. For the set of shapes we evaluate, shapes. Note that we report performance in terms of GB/s here
the trend lines roughly track for MTIA and the GPU across INT8 because this benchmark is mostly memory bound, and measuring
and FP16, indicating that the software implementations are well bandwidth as opposed to lookups/sec provides better insight into
optimized across a range of arithmetic intensities. In many cases, hardware utilization. Here we utilize the cache configuration of
MTIA achieves 2x or greater performance per Watt, and is the on-chip SRAM to take advantage of locality across and within
particularly effective for low batch sizes which helps when batches. In these examples, all table entries use 8-bit quantization
serving requests under stringent latency requirements. For large and the triplets shown in the graph describe the operator’s pooling
batch sizes, the GPU is able to achieve higher utilization with the factor, number of rows in the table, and the embedding dimension
increased amount of work so the perf/W gains of MTIA are lower. (elements per row). MTIA achieves between 0.6x to 1.5x the
Note that MTIA is most efficient when tensors can be streamed perf/W of the GPU with the current kernel implementation.
directly from SRAM, which means that graph optimizations and Given the evolving nature of the software stack, we observe
managing data locality are very important for good performance that there is significant headroom for improvement: MTIA is
at the model level. reaching just 10-20% of its memory bandwidth whereas the GPU
is achieving about 60% of its HBM bandwidth. To ensure that
there are no deficiencies in hardware, we used hand-written
kernels developed for RTL validation, and could observe
performance levels as high as 500 GB/s (more than 60% of
roofline) or 6 GB/s/W given sufficient locality in the SRAM. We
hope to close this gap by improving the software pipelining and
instruction scheduling of the TBE kernels.

Figure 10: INT8 FC performance

Figure 12: TBE performance

Other operators: While FC and TBE tend to dominate


execution time, we found that other operators can be just as
important, especially given how much effort is spent optimizing
the former. We evaluated BatchMatMul, Concat, Transpose, and
several elementwise kernels for M=256, K=128, N=32, with
tensor data placed in SRAM and DRAM (Figure 13). These
operators tend to be memory bound which is exemplified by
BatchMatMul and Tanh, which reach more than 90% and 80% of
the SRAM bandwidth, respectively. When data is placed in the
Figure 11: FP16 FC performance DRAM, the efficiency drops down to around 40% on average,
because it is more difficult to hide the additional memory latency.
We believe implementation of data placement optimizations,
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

operator fusion, and minimizing data fetch from DRAM could and MTIA needs similar optimizations in order to achieve the
potentially mitigate this issue. same or higher levels of efficiency. These initial results give us
insight into areas of the software stack that we should consider
focusing on in the future (e.g., large FCs, TBE optimizations,
operator fusion, etc.), as well as provide important learnings for
next-generation architecture which we discuss next.

Figure 13: Performance of other operators

6.2 Model Performance


We examine the performance of five different representative
DRLMs, described in Table IV, which range from low to high Figure 14: Performance of DLRMs
complexity. MTIA can run the same recommendation models that
run on NNPI and GPU. With the current level of maturity of the
software stack, MTIA achieves near perf/W parity with the GPU 7 Discussion
and exceeds the perf/W of NNPI, while roofline modeling
Building silicon is always a difficult, lengthy, and time-
indicates there is significant room for improvement as the
consuming process, especially when done for the first time. For
software stack matures further.
MTIA, the resulting silicon needed to achieve high performance,
handle a wide range of recommendation models, and provide a
Table IV - DLRM models used for evaluation. level of programmability that would allow rapid deployment of
models in production. This section highlights our important
Complexity observations and reflections regarding architectural choices, and
DLRM Model Size (GB)
(GFLOPS/batch)
how they impacted the software stack, performance, and
Low Complexity 1 (LC1) 53.2 0.032
developer efficiency. These lessons also act as guidance for
Low Complexity 2 (LC2) 4.5 0.014 improving and enhancing future generations of architecture.
Medium Complexity 1 (MC1) 120 0.140 Dual-Core PEs: The choice of having two separate processor
Medium Complexity 2 (MC2) 200 0.220 cores within the PE and allowing both to control the fixed
High Complexity (HC) 725 0.450 function units provided a great degree of parallelism and
flexibility at the thread level, allowing decoupling of compute
Figure 14 shows the performance (in TFLOPS/s/W) across the from data transfer. While this decoupling simplified the
above-mentioned set of DLRMs. Compared to NNPI, MTIA programming and alleviated performance issues when a particular
achieves 1.6x higher efficiency while compared to GPU, it operator is instruction bound (by providing twice the overall
reaches 0.9x efficiency. There are two important factors to instruction throughput), using both cores correctly and efficiently
consider in these results: the model characteristics and the level of in software took some effort. Details such as synchronization
software optimization in the implementations. For low complexity between the two cores for initialization and clean up before
models, MTIA has a significant advantage over the GPU because execution of a job were difficult to get right the first time, but
these models are dominated by FC layers with smaller input afterwards were leveraged in all workloads through proper
shapes and MTIA handles this quite efficiently, e.g. LC2 shows integration in the software stack.
nearly a 3x improvement. For medium complexity models, MTIA General-Purpose Compute: Addition of general-purpose
still sees an efficiency gain over the GPU, but it is lower because compute in the form of RISC-V vector support proved to be a
FCs are less dominant, and the GPU software stack provides more judicious choice: There were operators which were developed or
efficient implementations of other operators (with TBE and gained importance after the architecture definition phase, and
aggressive operator fusion). For high complexity models, we see hence the architecture did not include any offload support for
that the GPU software stack is better optimized for large shapes, them. Operators like LayerNorm and BatchedReduceAdd were
straightforward to implement with vectors, and these advance and there are not enough outstanding requests to hide the
implementations proved superior to versions using scalar cores latency.
and fixed function units. Cache Coherence: While the system implements a shared
Automated Code Generation: Some of the architectural memory paradigm, there is no hardware support for cache
choices made regarding how the fixed function units are coherency. In this shared memory system, inter-PE coherency is
integrated and operated in the PE have made the automatic code not required as different PEs operate on their dedicated part of the
generation by compiler difficult. Processor cores must assemble dataset in a data parallel manner. However, intra-PE coherency
and issue explicit commands to operate any of the fixed-function sometimes causes correctness issues: If the same set of memory
blocks. While this is done through addition of custom instructions addresses are touched by the two processor cores, or by the fixed-
and registers to the processors, it still requires assembling and function units and a core, or addresses are reused by the same core
passing many parameters to each offload engine to specify the across different operators, the cached copies of these addresses
details of the operation. Controlling a heterogenous set of fixed- must be explicitly flushed from caches, otherwise the stale data
function units from within the program and balancing the data could be used.
flow between them is a challenging problem for the compiler. Architecture Hierarchy: Recommendation models for the
Achieving desired levels of utilization on the fixed-function accelerator vary greatly in size and complexity in the layers and
blocks across various input shapes and sizes is also difficult. operators they employ. While large layers when mapped on the
While our DSL-based KNYFE compiler makes it easier to write PE grid can extract desired utilization level from available
kernels and handles many of these issues automatically, it requires hardware resources and amortize the overhead of job creation and
learning a new DSL. dispatch, the smaller layers or lower batch sizes have to resort to
Circular Buffers: Addition of the circular buffer abstraction techniques such as exploiting sub-graph parallelism and fusion to
greatly simplified the dependency checking between custom get to the same level of utilization. Even though there is plenty of
operations that work on the same region of memory, as the room at the software level to perform such optimizations or
circular buffer IDs were used as units of dependency checks reduce the overheads of deploying jobs, we believe some
(similar to register IDs in the processor cores). They also provisioning at the architecture level would have made addressing
simplified the implementation of the producer-consumer this problem easier. This is because for smaller jobs the grid must
relationship between fixed function blocks and processors, as the be divided into smaller sub-grids so that each can handle a smaller
hardware holds off the operation until enough data (or space) is job, and the task of setting up and tearing down these sub-grids is
available in the circular buffer without any need for explicit part of the system’s firmware. We believe having another level of
synchronization at the software level. The flexible addressing hierarchy in the architecture itself, for example clusters of PEs,
mechanism also allows arbitrary access to any location within a might have made this problem easier to solve as it provides
circular buffer, which simplifies data reuse as different operations natural units of isolation and management, compared to a
can access different segments within the circular buffer multiple monolithic grid of PEs.
times. However, this requires software to explicitly manage the In the first generation of MTIA, we built an architecture that
space within the buffer and decide when the data should be can gain significant levels of efficiency for DLRM workloads
marked as consumed, which might create difficult to debug compared to GPUs and other accelerators. We hope to continue to
correctness issues if not performed properly. increase this efficiency over time as the software stack matures.
Memory Latency: Both PE and on-chip SRAM memories The experience of writing kernels, building a compiler, and
turned out to have longer than typical access latencies. Having optimizing models for this architecture has given us great insights
lots of clients accessing the PE memory complicated the into what features are more impactful. We are hoping to integrate
arbitration scheme and added latency cycles. For fixed function and leverage all the lessons learned on both hardware and
blocks, this latency gets rolled into the operation’s latency, but software sides of the project in future generations of architecture.
when processors try to use the local memory, software must resort
to techniques such as unrolling and software pipelining to hide ACKNOWLEDGMENTS
latencies, which increases the register pressure. This becomes This project has been the culmination of dedicated and
exacerbated when developing kernels on the vector processor, as enthusiastic work of many talented teams and individuals. While
it limits the amount of register grouping that can be performed it is impossible to mention every team and every person here, the
and hence increases the dynamic instruction count. authors would like to specially thank the following teams: Infra
Placement of the on-chip SRAMs over the perimeter, while Silicon, AI Systems Hardware-Software Co-Design, Compiler
evenly distributing requests over multiple slices, created a large Backend, Firmware, Emulation, Release to Production (RTP),
degree of non-uniformity in memory access latencies. Since the Sourcing and Operations Engineering (SOE), and Hardware
requests are always completed after the last piece of data arrives, Platforms. We also would like to express our sincere gratitude to
the overall latency gets impacted when making larger than Misha Smelyanskiy, Olof Johansson, Kumar Sundararajan, K.
minimum requests. While in most cases this latency could be Rajesh Jagannath, Xiao He, Jongsoo Park, Changkyu Kim,
hidden by prefetching data into the PE’s memory, in cases such as Mahesh Maddury, Brian Ko, Kaushal Gandhi, Tom Ulrich,
EmbeddingBag operators with small pooling groups the latency Pritesh Modi, Manish Modi, Krishanth Skandakumaran, Teja
gets exposed because the memory access pattern is not known in Kala, Bhargav Alluri, Hao Jin, Adam Bauserman, Sameer
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems

Shripad, Meghana Reddy Swamyreddygari, Sharat Kumar, Dick Mathews, L. Qiao, M. Smelyanskiy, B. Jia, V. Rao., “First-Generation
Inference Accelerator Deployment at Facebook,” in Arxiv, 2021. [Online].
Tam, Prasad Addagarla, Di Wu, Puneet Anand, Sanjay Kumar, Available: https://arxiv.org/abs/2107.04140, unpublished.
James Hegeman, Nadav Rotem, Fangran Xu, and Shuqing Zhao. [11] J. Ehlen, J. Clow, B. Wei, D. Chong, “Facebook Multi-node Server Platform:
Yosemite V2 Design Specification,” Open Compute Project,
We would also like to thank all our vendors for their collaboration https://www.opencompute.org/documents/facebook-multi-node-server-
and the support that they provided during the course of the platform-yosemite-v2-design-specification
project. This work would not have been possible without their [12] D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M.
Ozdal, J. Nie, J. Park, L. Luo, J. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu,
close engagement. J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.H. Chu, S. Yilmaz, H.
Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W.
Zhao, D. Melts, K. Dhulipala, KR. Kishore, T. Graf, A Eisenman, K. K.
REFERENCES Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K Nair, B. Muthiah,
[1] V. Sze, Y. Chen, T. Yang, and J.S. Emer, Efficient Processing of Deep Neural M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L.
Networks, Synthesis Lectures on Computer Architecture, Morgan & Claypool Qiao, M. Smelyanskiy, B. Jia, V. Rao, “Software-Hardware Co-design for Fast
Publishers, 2020. and Scalable Training of Deep Learning Recommendation Models,” in
[2] M. Pellauer, Y. S. Shao, J. Clemons, N. Crago, K. Hegde, R. Venkatesan, S. Proceedings of the International Symposium on Computer Architecture
W. Keckler, C. W. Fletcher, and J. Emer, “Buffets: An efficient and (ISCA), June 2022.
composable storage idiom for explicit decoupled data orchestration,” in [13] M. Haken, J. Clow, Y. Li, B. Wei, D. Chong, T Ky, “Yosemite V3: Facebook
Proceedings of Architectural Support for Programming Languages and Multi-node Server Platform Design Specification”, Open Compute Project,
Operating Systems (ASPLOS), 2019. https://www.opencompute.org/documents/ocp-yosemite-v3-platform-design-
[3] Y.-H. Chen, J. Emer, and V. Sze, Eyeriss: A spatial architecture for energy- specification-1v16-pdf
efficient dataflow for convolutional neural networks, in Proceedings of [14] GemmBench. [Online]. Available:
International Symposium on Computer Architecture (ISCA), 2016. https://https://github.com/pytorch/glow/blob/master/tests/benchmark/GemmBe
[4] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An energy-efficient nch.cpp
reconfigurable accelerator for deep convolutional neural networks IEEE [15] TableBatchedEmbeddingBagBench (TBEBench). [Online]. Available:
Journal of Solid-State Circuits (JSSC), 51(1), 2017. https://github.com/pytorch/glow/blob/master/tests/benchmark/TBEBench.cpp
[5] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, and N. Takagi, “A high- [16] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X.
speed multiplier using a redundant binary adder tree,” in IEEE Journal of Wang, U. Gupta, C. Wu, A.G. Azzolini, D. Dzhulgakov, A. Mallevich, I.
Solid-State Circuits (JSSC), 22(1):28–34, 1987. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X.
[6] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible Chen, W. Chen, V. Rao, B. Jia, L. Xiong, M. Smelyanskiy “Deep Learning
accelerator for emerging deep neural networks on mobile devices,” in IEEE Recommendation Model for Personalization and Recommendation Systems,”
Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), in Arxiv, 2021, [Online]. Available: https://arxiv.org/abs/1906.00091,
2019. unpublished
[7] T. Krishna, H. Kwon, A. Parashar, M. Pellauer, and A. Samajdar, Data [17] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K.
Orchestration in Deep Learning Accelerators, Synthesis Lectures on Hazelwood, M. Hempstead, B. Jia, H.S.Lee, A. Malevich, D. Mudigere, M.
Computer Architecture, Morgan & Claypool Publishers, 2020. Smelyanskiy, L. Xiong, X. Zhang, “The Architectural Implications of
[8] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Facebook’s DNN-based Personalized Recommendation,” in Proceedings of
Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-L. Cantin, C. Chao, C. the International Symposium on High Performance Computer Architecture
Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, (HPCA), February 2020.
R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. [18] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P.
Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz,
Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang,
Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, Yiming Wu, Hector Yuen, Utku Diril, D. Dzhulgakov, Kim Hazelwood, Bill
R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, Mikhail
Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Smelyanskiy “Deep Learning Inference in Facebook Data Centers:
Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, Characterization, Performance Optimizations and Hardware Implications,” in
B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, Arxiv, 2018, [Online]. Available: https://arxiv.org/abs/1811.09886
and D. H. Yoon, “In-datacenter performance analysis of a tensor processing [19] J. K. Reed, Z. DeVito, H. He, A. Ussery, J. Ansel, “Torch.fx: Practical
unit,” in Proceedings of the International Symposium on Computer Program Capture and Transformation of Deep Learning in Python,” in Arxiv,
Architecture (ISCA), June 2017. 2022, [Online]. Available: https://arxiv.org/abs/2112.08429
[9] https://zephyrproject.org [20] https://pytorch.org/docs/stable/fx.html
[10] M. Anderson, B. Chen, S. Chen, S. Deng, J. Fix, M. Gschwind, A. Kalaiah, C. [21] C. Lattner, V. Adve, “LLVM: a compilation framework for lifelong program
Kim, J. Lee, J. Liang, H. Liu, Y. Lu, J. Montgomery, A. Moorthy, S. Nadathur, analysis & transformation,” in Proceedings of International Symposium on
S. Naghshineh, A. Nayak, J. Park, C. Petersen, M. Schatz, N. Sundaram, B. Code Generation and Optimization, 2004.
Tang, P. Tang, A. Yang, J. Yu, H. Yuen, Y. Zhang, A. Anbudurai, V. Balan, [22] https://llvm.org/docs/LangRef.html
H. Bojja, J. Boyd, M. Breitbach, C. Caldato, A. Calvo, G. Catron, S. [23] https://github.com/riscv/riscv-v-spec
Chandwani, P. Christeas, B. Cottel, B. Coutinho, A. Dalli, A. Dhanotia, O. [24] BatchGemmBench. [Online]. Available:
Duncan, R. Dzhabarov, S. Elmir, C. Fu, W. Fu, M. Fulthorp, A. Gangidi, N. https://github.com/pytorch/glow/blob/master/tests/benchmark/BatchGemmBen
Gibson, S. Gordon, B. Padilla Hernandez, D. Ho, Y. Huang, O. Johansson, S. ch.cpp
Juluri, S. Kanaujia, M. Kesarkar, J. Killinger, B. Kim, R. Kulkarni, M. Lele, [25] ConcatBench. [Online]. Available:
Huayu Li, Huamin Li, Y. Li, C. Liu, J. Liu, B. Maher, C. Mallipedi, S. https://github.com/pytorch/glow/blob/master/tests/benchmark/ConcatBench.cp
Mangla, K.K. Matam, J. Mehta, S. Mehta, C. Mitchell, B. Muthiah, N. p
Nagarkatte, A. Narasimha, B. Nguyen, T. Ortiz, S. Padmanabha, D. Pan, A. [26] TransposeBench. [Online]. Available:
Poojary, Y. Qi, O. Raginel, D. Rajagopal, T. Rice, C. Ross, N. Rotem, S. Russ, https://github.com/pytorch/glow/blob/master/tests/benchmark/TransposeBench
K. Shah, B. Shan, H. Shen, P. Shetty, K. Skandakumaran, K. Srinivasan, R. .cpp
Sumbaly, M. Tauberg, M. Tzur, S. Verma, H. Wang, M. Wang, B. Wei , A.
Xia, C. Xu, M. Yang, K. Zhang, R. Zhang, M. Zhao, W. Zhao, R. Zhu, A.

You might also like