MTIA First Generation Silicon Targeting Meta's Recommendation
MTIA First Generation Silicon Targeting Meta's Recommendation
Systems
Amin Firoozshahian Joe Shajrawi Jordan Fix
Joel Coburn Kevin Quinn Hangchen Yu
Roman Levenstein Nagesh Sreedhara Richard Li
Rakesh Nattoji Pankaj Kansal Kaustubh Gondkar
Ashwin Kamath Willie Wei Jack Montgomery
Olivia Wu Dheepak Jayaraman Mike Tsai
Gurdeepak Grewal Linda Cheng Saritha Dwarakapuram
Harish Aepala Pritam Chopda Sanjay Desai
Bhasker Jakka Eric Wang Nili Avidan
Bob Dreyer Ajay Bikumandla Poorvaja Ramani
Adam Hutchin Arun Karthik Sengottuvel Karthik Narayanan
Utku Diril† Krishna Thottempudi Ajit Mathews
Krishnakumar Nair Ashwin Narasimha Sethu Gopal
Ehsan K. Ardestani Brian Dodds Maxim Naumov
Martin Schatz Cao Gao Vijay Rao
Yuchen Hao Jiyuan Zhang Krishna Noru
Rakesh Komuravelli Mohammad Al-Sanabani Harikrishna Reddy
Kunming Ho Ana Zehtabioskui Prahlad Venkatapuram
Sameer Abu Asal Alexis Bjorlin
Meta Platforms Inc.
Menlo Park, CA, USA
ABSTRACT performance and programmability of future generations of
architecture.
Meta has traditionally relied on using CPU-based servers for
running inference workloads, specifically Deep Learning
Recommendation Models (DLRM), but the increasing compute
CCS CONCEPTS
and memory requirements of these models have pushed the •Computer systems organization~Architectures~Other
company towards using specialized solutions such as GPUs or architectures~Neural networks
other hardware accelerators. This paper describes the company's
effort in constructing its first silicon specifically designed for KEYWORDS
recommendation systems; it describes the accelerator architecture Accelerators, Machine Learning, Inference, Recommendation
and platform design, the software stack for enabling and Systems, Performance, Programmability
optimizing PyTorch-based models and provides an initial ACM Reference format:
performance evaluation. With our emerging software stack, we
Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh
have made significant progress towards reaching the same or
Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish
higher efficiency as the GPU: We averaged 0.9x perf/W across
Aepala, Bhasker Jakka, Bob Dreyer, Adam Hutchin, Utku Diril,
various DLRMs, and benchmarks show operators such as
Krishnakumar Nair, Ehsan K. Ardestani, Martin Schatz, Yuchen
GEMMs reaching 2x perf/W. Finally, the paper describes the
Hao, Rakesh Komuravelli, Kunming Ho, Sameer Abu Asal, Joe
lessons we learned during this journey which can improve the
†
Shajrawi, Kevin Quinn, Nagesh Sreedhara, Pankaj Kansal, Willie
Rivos Inc., work done while at Meta Platforms Inc.
Wei, Dheepak Jayaraman, Linda Cheng, Pritam Chopda, Eric
Permission to make digital or hard copies of all or part of this work for personal or
Wang, Ajay Bikumandla, Arun Karthik Sengottuvel, Krishna
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Thottempudi, Ashwin Narasimha, Brian Dodds, Cao Gao, Jiyuan
on the first page. Copyrights for components of this work owned by others than the Zhang, Mohammad Al-Sanabani, Ana Zehtabioskui, Jordan Fix,
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Hangchen Yu, Richard Li, Kaustubh Gondkar, Jack Montgomery,
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
Mike Tsai, Saritha Dwarakapuram, Sanjay Desai, Nili Avidan,
ISCA '23, June 17–21, 2023, Orlando, FL, USA. Poorvaja Ramani, Karthik Narayanan, Ajit Mathews, Sethu
© 2023 Copyright is held by the owner/author(s). Publication rights licensed to ACM. Gopal, Maxim Naumov, Vijay Rao, Krishna Noru, Harikrishna
ACM ISBN 979-8-4007-0095-8/23/06...$15.00.
Reddy, Prahlad Venkatapuram and Alexis Bjorlin. 2023. MTIA:
DOI: https://doi.org/10.1145/3579371.3589348
First Generation Silicon Targeting Meta’s Recommendation
Systems. In Proceedings of 2023 International Symposium on
Computer Architecture (ISCA’23), June 17-23, 2013, Orlando,
FL. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3579371.3589348
1 Introduction
Machine learning (ML) workloads have become ubiquitous in
online activities. In recent years, these models have seen
substantial growth in size and complexity, which has contributed
towards their increased prediction accuracy and effectiveness.
However, at the same time, this growth has presented significant
challenges for the hardware platforms that are used for training
and inference of these models at very large scales. Total Cost of
Ownership (TCO) is one of the major constraining factors in
Figure 1: Scaling trends for inference models
launching models to production in the datacenter, and power is a
significant component of TCO for these platforms. Therefore,
Figure 2 shows the estimated number of servers that are
performance-per-TCO (and performance-per-watt) has become an
deployed for serving inference workloads within the datacenter
important metric for any hardware platform targeting these
over the past couple of years. The light solid line shows the
workloads.
number of CPU-based servers, the dashed line shows the number
Deep Learning Recommendation Models (DLRM) [16] have of servers equipped with the first-generation inference accelerator,
emerged as one of the most dominant workloads in Meta’s
Intel NNPI [10], and the dark solid line shows the number of
datacenters [17][18]. These models combine traditional multilayer
GPU-based servers [12]. While the initial demand for increased
perceptron (MLP) operations (referred to as fully connected or FC
capacity was temporarily met using the NNPI accelerator, the
at times) which are compute intensive, with embedding tables that
requirements for the inference models quickly outpaced the NNPI
transform sparse features into a dense representation. These tables
capabilities and provided motivation for using GPUs. This
contain wide vectors that are indexed randomly and are reduced to
brought the additional advantage of leveraging the existing
a single vector that is then combined with data coming from other
ecosystem used already for training. Therefore, as can be
layers to produce the final results [16]. While embedding table observed, the increased demand in model complexity is served
operations have rather light compute requirements, their memory increasingly with GPUs as accelerators.
footprint and bandwidth requirements are rather demanding due to While recent generations of GPUs provide a lot of memory
the nature of the data access pattern and size of the tables. bandwidth and compute power, they are not designed with
Figure 1 shows the historical and estimated future growth in inference in mind, and therefore the efficiency of processing real
both complexity and memory footprint of the inference workloads inference workloads is low. Developers use a myriad of software
related to recommendation models in Meta’s production techniques, such as operator fusion, shape specialization, graph
datacenters. The dashed line shows the estimated growth in the transformations and kernel optimizations to raise the efficiency of
model's compute requirement while the solid lines demonstrate GPUs. But despite these efforts, there is still an efficiency gap
the increase in the memory footprint. The gray solid line captures which makes it challenging and expensive to deploy models in
the footprint of the device memory used to store embedding practice.
tables, which is an important component of these models. The
level of growth in both compute and memory requirements is
certainly an issue that needs to be addressed, especially
considering how these workloads are typically run in the
datacenter.
2 Motivation
Traditionally CPUs have been used as the primary vehicle to
serve inference workloads in Meta’s production datacenters, but
they are not cost effective in keeping up with the demands of the
most recent workloads. To that extent, hardware acceleration has
been considered an attractive solution that can address power and
performance issues and provide a more efficient way of serving
inference requests while at the same time providing enough Figure 2: Growth in server demand for inference workloads
headroom in compute performance for running future models.
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems
Given the experience deploying NNPI and GPUs as between processors, their peripherals and custom hardware
accelerators, it was clear that there is room for a more optimized blocks.
solution for important inference workloads. This optimal solution
is based on an in-house accelerator which is architected from the LPDDR5 LPDDR5 LPDDR5 LPDDR5
ground up to address the requirements of demanding inference DDR Ctrl. DDR Ctrl. DDR Ctrl. DDR Ctrl.
XBAR
workloads, specifically focused on meeting the performance
M M M M M M M M M M M M M M M M
requirements of DLRM systems. However, while focusing on XBAR
DLRM workloads (given their ongoing variation and evolution
M
and the fact that the architecture is effectively constructed for PE PE PE PE PE PE PE PE
DDR Ctrl.
DDR Ctrl.
LPDDR5
LPDDR5
M
forthcoming generations of these workloads) it was also clear that
M
PE PE PE PE PE PE PE PE
M
in addition to performance, the architecture should also provide
M
PE PE PE PE PE PE PE PE
enough generality and programmability, to support future versions
DDR Ctrl.
DDR Ctrl.
LPDDR5
LPDDR5
M
of these workloads and potentially other types of neural network
M
PE PE PE PE PE PE PE PE
M
XBAR
XBAR
models.
XBAR
XBAR
M
M
PE PE PE PE PE PE PE PE
While creating a custom silicon solution opens the door for
DDR Ctrl.
DDR Ctrl.
LPDDR5
LPDDR5
M
ample innovation and specialization towards the target workloads,
M
PE PE PE PE PE PE PE PE
creating an accelerator architecture for mass deployment in the
M
M
M
PE PE PE PE PE PE PE PE
DDR Ctrl.
DDR Ctrl.
datacenter is a monumental task. The focus and strategy when
LPDDR5
LPDDR5
M
architecting the accelerator therefore has been on adopting and
M
PE PE PE PE PE PE PE PE
M
reusing suitable pieces of technology, as well as tools and
M
XBAR
Control
environments, from vendors and the open-source community. M M M M M M M M M M M M M M M M Subsystem
This not only improves the time to market, but it also leverages XBAR
(CCP)
the support and enhancements that come from the community and DDR Ctrl. DDR Ctrl. DDR Ctrl. DDR Ctrl. Host
Interface
vendors and reduces the amount of resources required for LPDDR5 LPDDR5 LPDDR5 LPDDR5
building, enabling, and deploying such platforms. Figure 3: High-level architecture of the accelerator
The rest of this paper explains the undertaking of architecting
MTIA, Meta’s first accelerator chip targeting inference 3.1 Fixed Function Units
workloads, and the learnings that came with it. The next section
details the accelerator’s architecture and its various provisioned Each PE has a total of five fixed function blocks and a
features and components. Section 4 goes over mapping an Command Processor which orchestrates and coordinates
example operator to this architecture, demonstrating how various execution of operations on these fixed function blocks. Functional
provisioned features are utilized to run the operator efficiently. units form a coarse-grained pipeline within the PE, where data can
Section 5 provides an overview of the accelerator’s software stack be passed from one unit to the next to perform successive
and section 6 describes our evaluation methodology and results. operations. Each functional unit can also access the data directly
Finally, section 7 discusses a few important lessons learned during within the PE’s local memory, perform the necessary operations,
this development cycle. and write the result back, without passing the data to other
functional units.
3.1.1 Memory Layout Unit (MLU)
3 Accelerator Architecture This block performs operations related to copying and
Figure 3 shows the high-level architecture of the accelerator, changing the layout of data in the local memory. It can operate on
which is organized as an array of processing elements (PEs) tensors with 4/8/16/32-bit data types. Operations like transpose,
connected on a grid. The grid is connected to a set of on-chip concatenation, or reshape are performed using this block. The
memory blocks and off-chip memory controllers through output data can be sent to the next block directly to be operated on
crossbars on each side. There is a separate control subsystem with immediately or can be stored in PE’s memory. For example, MLU
dedicated processors and peripherals to run the system's control can transpose a matrix and provide the output directly to DPE
software. The host interface unit which contains a PCIe interface, block for a matrix multiplication operation, or it can format the
associated DMA engines, and secure boot processor also sits data properly as part of the depth-wise convolution operation and
alongside this control subsystem. send it to DPE to perform the actual computation.
Figure 4 shows the internal organization of the PE. A PE 3.1.2 Dot-Product Engine (DPE)
consists of two RISC-V processor cores and associated This block performs a set of dot-product operations on two
peripherals (on the left), as well as several fixed function units input tensors. The first tensor is read and stored within the DPE
specialized in performing specific computations or data first, then the second tensor is streamed through the block and a
movements (on the right). In addition, each PE has 128KB of dot product operation is performed with all the rows of the first
local storage. A local interconnect establishes the connectivity
tensor. DPE can perform 1024 INT8 multiplications (32×32) or 3.1.5 Fabric Interface (FI)
512 FP16/BF16 multiplications (32×16) per cycle. Operations are This block acts as the gateway in and out of the PE. It connects
fully pipelined; performing multiplication of two maximum size to and communicates over the accelerator’s on-chip network. It
matrices takes 32 clock cycles. In case of INT8 multiplication, the formulates and sends memory access requests to on-chip and off-
resulting output is stored in INT32 format, while in the case of chip memories, as well as system registers, and receives back the
BF16 or FP16 multiplications, the result is stored in FP32 format. data or write completions. It implements a set of DMA-like
The result is always sent to the next functional unit in the pipeline operations that transfers the data in and out of PE’s local memory.
for storage and accumulation. It also receives and transmits cache misses and un-cached
accesses from processor cores and allows other entities (other PEs
To/From NoC or the control subsystem) to access the PE’s internal resources.
3.1.6 Command Processor (CP)
Fabric Interface (FI) In addition to hosting PE’s local memory and registers, the CP
PE block acts as the central processing unit that orchestrates
execution of various operations on the fixed function blocks
MLU concurrently. It receives instructions from the two processor cores
Debug Proc-A
Subsystem (Scalar)
in the PE, performs dependency checking, scheduling, and
tracking for those instructions, and dispatches them to the fixed
Command
function units for execution. It contains two separate schedulers
DPE
Processor (one for each processor core), a set of command queues, as well as
Machine arbitration logic for accessing the local memory and register
PE Interconnect
Timer resources.
RE The hardware provides a set of basic atomic primitives to
allow synchronization between the cores (within the PE or across
PLIC multiple PEs). These primitives are enacted by processors, which
Proc-B
(Interrupt
(Vector) Regs.
LS allows atomic update to predefined registers, and can stall the
Controller) Mem. SE
processor until certain conditions are satisfied externally (e.g., a
counter reaches a certain value). At the higher level, these
Figure 4: PE’s internal organization mechanisms are used for efficient implementation of software
constructs such as locks, ticketing locks, mutexes and barriers.
3.1.3 Reduction Engine (RE) The logic that performs the atomic operations as well as the
The reduction engine hosts the storage elements that keep relevant registers reside within the Command Processor and are
track of the results of the matrix multiplication operations and tightly integrated with the processor cores through custom
accumulates them over multiple operations. There are four interfaces.
separate storage banks that can be independently used to store and
accumulate the results coming from DPE. RE can load an initial 3.2 Processor Cores
bias into these accumulators and can also send their contents to
Each PE contains two RISC-V cores that run the application’s
neighbor PEs over a dedicated reduction network (discussed later
code and issue commands to the CP for offloading various
in this section). Upon receiving results over the reduction
computations to fixed function units. The cores are single issue,
network, RE accumulates the received values on top of the values
in-order cores, with a five-stage pipeline (AX25-V100, from
in one of the local storage banks. It can then send the result to the
Andes Technology), and are heavily customized to suit the
next neighbor, to the SE, or store it in the PE’s local memory
functionalities needed. The set of customizations includes custom
directly.
interfaces, custom registers, custom instructions, and custom
3.1.4 SIMD Engine (SE) exceptions. Custom interfaces connect cores to the CP to issue
This block performs operations like quantization/de- commands to fixed function units and move data back and forth
quantization and nonlinear functions. Internally the block contains between cores and local memory. Custom registers store the
a set of lookup tables and floating-point arithmetic units to command information that is sent to the CP upon issuing
calculate linear or cubic approximation of nonlinear functions commands. Custom instructions are added to start the desired
such as exponentials, sigmoid, tanh, etc. The approximation operation on each of the fixed function units. And finally custom
accepts INT8 or FP16 data types as inputs, producing an INT8 or exceptions ensure correctness of each command issued to the CP
FP32 result at the output. The unit can receive its inputs directly and raise an exception in case of illegal values in the command.
from the RE block or read them from the local memory. In One of the processor cores is equipped with the RISC-V vector
addition, this block is also capable of using its floating-point extension, which adds extra flexibility to the PE and allows
ALUs to perform a set of predefined elementwise operations, such implementing operations that do not map well to the existing fixed
as addition, multiplication, accumulation, etc. function units. The vector processing unit contains 32 vector
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems
registers, each 64B wide and has the same width for all vector Read (Size=3, offset=1, stride=2)
functional units. It implements version 0.8.1 of the RISC-V vector Read Pointer
extension [23].
32B
processors and functional units. The CP implements an arbitration
scheme for memory banks and coordinates accesses from cores
and fixed function units. Local memories are mapped to the
system’s address space and can be accessed by cores via regular
load/store instructions.
Write Pointer
There is an abstraction layer introduced on top of the local Available elements
memories to simplify usage and dependency checking between
operations that use them. This can be considered as further
Figure 5: Reading from a Circular Buffer
extension of the concept of the buffet [1][2]. Each PE can define
circular buffers (CBs) that are mapped to the existing local While some instructions like DMA operations automatically
memory. Each CB is designated with an ID and has a pair of adjust the read and write pointers (as they move the data in and
registers that specify its size (depth) and starting address in the out of the CBs, and hence produce or consume elements), other
local memory. In addition, each CB also implements a set of read custom instructions do not move the pointers. This allows data
and write pointers to implement a hardware FIFO. inside the CB to be reused multiple times by different operations
In a CB, read operations always read the data starting from the before it is explicitly marked as consumed. Hardware provides
read pointer and write operations always write data starting from
additional custom instructions that can adjust both read and write
the write pointer. Like buffets, read and write operations carry an pointers in each CB, allowing explicit marking of the data
offset which allows them to access a location other than the elements as produced or consumed, when necessary.
current head or tail of the buffer (Figure 5). Fixed function units
use the CB IDs as their input/output operands; for example, a 3.4 Memory Subsystem and
matrix multiplication operation uses two CBs as its input
operands. Before allowing an operation to start, the Command Interconnect
Processor checks the availability of the data in the input CBs and In addition to the local memory within the PEs, the accelerator
space in the output CB. It allows the operation to start only if the also has 128MB of on-chip SRAM, organized as slices around the
necessary element and space checks pass. Therefore, an operation grid. This on-chip memory can be used as addressable scratchpad
is guaranteed to have the necessary resources to complete and will memory, or as a common, shared, memory-side cache. There are
not stall the functional unit in the middle of its execution. four LPDDR5 controllers on each side of the grid, providing a
The Command Processor also uses the CB IDs to enforce total of 176 GB/s (theoretical) off-chip bandwidth. The
dependency checks and interlocks between different custom accelerator can support a total of 128GB of off-chip memory
instructions. It ensures that operations that access and modify a capacity. Memory addresses are distributed across these
particular CB are always executed in program order, while controllers, and among the on-chip SRAM slices. When on-chip
operations that operate on different CBs or different regions of the SRAM is configured as cache, each four cache slices are
same CB can execute in parallel. This significantly simplifies the associated with a single memory controller and cache its
dependency checks as opposed to using absolute local memory addresses.
addresses for enforcing such interlocks. The on-chip network that connects all the PEs and memories
CBs also simplify realization of the producer-consumer together is based on the AXI interconnect with special
execution model between different operations. These operations enhancements. The interconnects consist of two networks for
can be initiated by different cores or different fixed function units. carrying memory and register accesses separately. The memory
For example, a program can issue a series of DMA operations to access network is equipped with a multicast feature which allows
the hardware (which moves the data from an external memory coalescing of requests from multiple PEs into one (if they are
into a CB), following it up with a set of custom compute made to the same set of addresses). A single request is then sent to
operations (e.g., MATMUL) that uses that data, without requiring the memory blocks to retrieve the data and return it to all
an explicit synchronization between the two. The MATMUL requesting PEs. Multicast is only supported for the PEs that are
instruction is automatically stalled by the Command Processor located along the same row or column in the grid however, and
until enough data is brought into the CBs by prior DMA cannot be used for an arbitrary group of PEs.
operations, and is started immediately afterwards, relieving the In addition to the main AXI based interconnect, PEs are also
program from explicitly checking the availability of the data. connected to each other via a specialized network, called the
reduction network. This is a unidirectional network that travels Multicasting: As mentioned earlier, the system’s NoC allows
only from north to south and from west to east. It carries partial coalescing requests from multiple PEs when they access the same
sums from the accumulators in the RE block of one PE to another. set of addresses in memory. This reduces memory bandwidth and
Using this network, PEs can expediently accumulate the result of increases the energy efficiency of data movement by allowing the
their computation without having to save and restore it in data to be shared while reading it from memory only once and
memory. The last PE in the row or column can then store the final delivering it to all requesters [1][6][7][8]
result in the memory, after all partial values are accumulated. Figure 6 shows the die plot with the grid of PEs, surrounded
by on-chip SRAMs and off-chip DDR controllers, while Table I
3.5 Parallelism and Data Reuse lists the summary of the chip features and parameters.
Parallelism, locality, and data reuse play a significant role in
efficient utilization of limited hardware resources in any deep Table I - Summary of MTIA features and parameters.
learning accelerator. MTIA architecture has provisioned a set of
features to allow multiple degrees of parallelism and maximal Parameter Value
exploitation of temporal and spatial data reuse in neural network Technology TSMC 7nm
models and operators, as discussed below. Frequency 800MHz nominal (1.1 GHz max)
Parallelism: The architecture provides support for multiple
Instances 1.12B gates, 65M flops
levels of parallelism and overlapping of various operations. Data
Dimensions 19.34 × 19.1mm (373 mm2)
level parallelism (DLP) is exploited by usage of wide vectors in
Package 43 × 43, ~2800 pins
fixed function units as well as the vector processors. Multiple PEs
also can operate on the same task in a data parallel manner. TDP 25 W
Instruction level parallelism is exploited in the Command Voltage Dual rail: 0.67V (logic), 0.75V (memories)
Processor, by allowing multiple outstanding operations to be Host Connectivity 8× PCIe Gen4 (16 GB/s)
handled by different fixed function blocks simultaneously. 102.4 (INT8)
GEMM TOPS (MAC)
Memory level parallelism (MLP) is achieved by allowing many 51.2 (FP16)
outstanding requests to on-chip and off-chip memories from each Vector: 0.8 (FP32) / 1.6 (FP16) / 3.2 (INT8)
SIMD TOPS
PE. And finally, thread level parallelism (TLP) can be achieved SE: 1.6 (FP16) / 3.2 (INT8)
by utilizing multiple PEs (or groups of PEs) to run parallel Local memory: 400GB/s per PE
threads, as well as by having two independent threads within each Memory Bandwidth On-chip SRAM: 800GB/s
PE. Threads within the PE can cooperate in performing a given Off-chip DRAM: 176 GB/s
task, by one thread orchestrating the data movement and the other Local memory: 128KB per PE
one orchestrating the computation.
Memory Capacity On-chip SRAM: 128MB
Caching: There are multiple levels of caching in various
Off-chip LPDDR5: 64GB (16 channels)
blocks of the hardware to improve locality and reduce memory
bandwidth consumption. This includes instruction and data caches
in the processor cores, large on-chip last level cache, and caching
for input operands in the DPE block. The caching at the DPE level 4 Mapping an FC Layer
allows the engine to hold data from both operand A and operand To demonstrate how all the above-mentioned features work
B and save access to local memory upon hit. together, let’s consider an FC operator that performs a matrix
Circular buffers / local memories: Circular buffers provide multiplication operation in the form of CT = A×BT and see how it
the storage for holding input operands while the PE performs the maps to a sub-grid of PEs. The reason for performing the
computations. Flexibility in adjusting pointers as well as operations in a transposed manner is to keep k as the inner
offsetting into any location within a circular buffer allows the dimension for both tensors, to increase the efficiency of memory
program to access each line of data multiple times, before accesses. Matrix A is assumed to be m×k and matrix B is assumed
deciding to mark it as consumed. to be k×n (hence BT will be n×k), producing output C which will
Specialized reduction: Having a dedicated reduction network be an m×n matrix (or CT being an n×m matrix). Inputs are
not only offloads a large part of data transfer from the system’s assumed to have row major memory layout. When the inner
main on-chip network, but also provides a way for grouping PEs dimension (k) is not a multiple of 32B, the outer dimension (m or
together and using their local memories in an aggregate form. n) stride is aligned to 32B boundaries for efficient data movement.
This in turn allows storing a larger portion of input operands in For simplicity, we will assume that all elements are of INT8 data
the PEs and reducing the bandwidth requirement for loading them type.
from off-chip memory. In addition, the DPE block utilizes
reduction trees (spatial sum) to calculate the output of a
multiplication operation [1][3][4], which is known to be more
energy efficient [5].
MTIA: First Generation Silicon Targeting Meta’s Recommendation
ISCA’23, June 2023, Orlando, Florida USA
Systems
is broadly split into two parts: the host software and device While we can compare the absolute performance of MTIA
firmware. The host software consists of the Linux device driver, a versus NNPI and GPUs, each device has different capabilities in
device access library for providing a uniform device interface, and terms of compute throughput, memory bandwidth, and memory
a streaming API to interface with PyTorch, as well as software capacity. They also operate under different power budgets.
tools and utilities for managing and monitoring the device. Therefore, in our study we report perf/W (as a proxy for
Device firmware: The device firmware includes a ROM perf/TCO, given the sensitive nature of TCO), because power is
based pre-boot firmware, secure boot firmware running on its own an important factor in provisioning for deployment in the
processor, the Control Core Processor firmware running on the datacenter. We use the total platform power divided by the
control subsystem performing runtime and management number of accelerator cards to determine power provisioned for
operations, and finally the PE monitor that runs on the PEs in the each accelerator, as opposed to using the maximum TDP for the
compute grid, which schedules and monitors workloads running card.
on the PEs. The main control firmware is based on the Zephyr
Real Time OS [9]. 6.1 Benchmark Performance
We first evaluate the performance of several important
operators and kernels that push the limits of the architecture and
6 Results are representative of main components in production DLRMs.
We evaluate the performance of the MTIA by comparing it Table III shows the latency breakdown of a request in a
against a baseline accelerator (NNPI) [10] and against more representative DLRM with batch sizes of 64 and 256. The model
has approximately 750 layers with nearly 550 consisting of EB
recently deployed GPUs. It should be noted that we report the
operators. For batch size of 64, FC dominates the execution time
results collected with an under-development software stack, as we
followed by EB, while for batch size 256, EB dominates FC
believe this reflects the end-to-end performance and is
slightly and the two together account for 62% of the execution
representative of a production environment. However, this stack is
time. It should be noted that with larger input shapes, the kernels
not currently as optimized as the GPU’s software stack.
are able to better amortize the setup costs, and reuse the data
Consequently, there are cases where the GPU is more efficient,
more, hence achieving higher utilization of the fixed-function
but we are hoping to close this gap over time and have the MTIA
units in the hardware.
software stack deliver the full gains of the architecture across all
the DLRM workload space. We evaluate both operator-based
benchmarks as well as full DLRM models varying in complexity, Table III - Operator breakdown, medium complexity DLRM
size, and accuracy. Since these accelerators are all based on
different hardware platforms, we first compare their system level Operator Batch size 64 Batch size 256
hardware specification (Table II). These platforms are the FC (Fully Connected) 42.10 % 32.4%
following: Yosemite V2 server with six NNPI accelerator cards EB (Embedding Bag) 31.19 % 30.0%
[11], Zion4S server with eight Nvidia A100 GPUs [12], and Concat 2.86 % 11.5%
Yosemite V3 server [13] with twelve MTIA accelerator cards. Transpose 8.47 % 5.9%
Quantize 1.55 % 5.3%
Dequantize 2.94 % 3.3%
Table II - Inference hardware platforms 3.30 %
BatchMatMul 1.7%
Yosemite Zion4S Yosemite V3 Others 7. 59 % 11.0%
Metric
V2 (6 NNPI) (8 GPU) (12 MTIA)
System 298 W 4500 W 780 W Based on the breakdown, we use a set of benchmarks to assess
Power Card 13.5 W 330 W 35 W the efficiency of the MTIA’s hardware. While not full-fledged
Percentage 27.2 % 58.7 % 53.8 % workloads, these benchmarks allow exercising various shapes and
INT8 (TOPS/s) 50 × 6 624 × 8 104 × 12 sizes for important operators (including corner cases) and shed
Compute
FP16 (TF/s) 6.25 × 6 312 × 8 52 × 12 light on the potential deficiencies that might exist in the hardware.
Type (device) LPDDR HBM LPDDR GemmBench [14] is used to evaluate dense computation; it
Size (device) 16 GB × 6 40 GB × 8 32 GB × 12 creates a model composed of a chain of FC layers. In our
Memory BW (device) 50 GB/s × 6 1.5 TB/s × 8 150 GB/s × 12 benchmark runs we focus on both FP16 and INT8 (quantized)
Size (host) 64 GB 1.5 TB 96 GB data, which requires additional quantize and dequantize layers.
BW (host) 50 GB/s 400 GB/s 76 GB/s TBEBench [15] is used to evaluate sparse computation, and
Dev.-to-Dev. PCIe NVLink PCIe allows us to configure the batch size, number of tables, number of
Comms. P2P BW (card) 3.2 GB/s 80 GB/s 12.8 GB/s rows per table, embedding dimension, and pooling factor of TBE
NIC BW 50 Gbps 400 Gbps 100 Gbps operators. BatchGEMMBench [24], ConcatBench [26], and
TransposeBench [26] are used to efficiently cover other Sparse computation: While a typical recommendation model
significant operators typically seen in recommendation models. might include hundreds of EmbeddingBag (EB) operators, they
We also evaluate several elementwise kernels including quantize, can be merged together into one or more TableBatchedEmbedding
dequantize, and tanh. (TBE) operators to amortize kernel launch overhead and increase
Dense computation: We evaluate both INT8 and FP16 Fully the work that can be parallelized across the device. Figure 12
Connected (FC) layers (Figure 10 and Figure 11). When accuracy shows the performance (in GB/s/W) for the TBE benchmark
is sufficient, INT8 quantization unlocks a potential 2x running on MTIA and GPU for a set of representative operator
improvement in FC throughput. For the set of shapes we evaluate, shapes. Note that we report performance in terms of GB/s here
the trend lines roughly track for MTIA and the GPU across INT8 because this benchmark is mostly memory bound, and measuring
and FP16, indicating that the software implementations are well bandwidth as opposed to lookups/sec provides better insight into
optimized across a range of arithmetic intensities. In many cases, hardware utilization. Here we utilize the cache configuration of
MTIA achieves 2x or greater performance per Watt, and is the on-chip SRAM to take advantage of locality across and within
particularly effective for low batch sizes which helps when batches. In these examples, all table entries use 8-bit quantization
serving requests under stringent latency requirements. For large and the triplets shown in the graph describe the operator’s pooling
batch sizes, the GPU is able to achieve higher utilization with the factor, number of rows in the table, and the embedding dimension
increased amount of work so the perf/W gains of MTIA are lower. (elements per row). MTIA achieves between 0.6x to 1.5x the
Note that MTIA is most efficient when tensors can be streamed perf/W of the GPU with the current kernel implementation.
directly from SRAM, which means that graph optimizations and Given the evolving nature of the software stack, we observe
managing data locality are very important for good performance that there is significant headroom for improvement: MTIA is
at the model level. reaching just 10-20% of its memory bandwidth whereas the GPU
is achieving about 60% of its HBM bandwidth. To ensure that
there are no deficiencies in hardware, we used hand-written
kernels developed for RTL validation, and could observe
performance levels as high as 500 GB/s (more than 60% of
roofline) or 6 GB/s/W given sufficient locality in the SRAM. We
hope to close this gap by improving the software pipelining and
instruction scheduling of the TBE kernels.
operator fusion, and minimizing data fetch from DRAM could and MTIA needs similar optimizations in order to achieve the
potentially mitigate this issue. same or higher levels of efficiency. These initial results give us
insight into areas of the software stack that we should consider
focusing on in the future (e.g., large FCs, TBE optimizations,
operator fusion, etc.), as well as provide important learnings for
next-generation architecture which we discuss next.
Shripad, Meghana Reddy Swamyreddygari, Sharat Kumar, Dick Mathews, L. Qiao, M. Smelyanskiy, B. Jia, V. Rao., “First-Generation
Inference Accelerator Deployment at Facebook,” in Arxiv, 2021. [Online].
Tam, Prasad Addagarla, Di Wu, Puneet Anand, Sanjay Kumar, Available: https://arxiv.org/abs/2107.04140, unpublished.
James Hegeman, Nadav Rotem, Fangran Xu, and Shuqing Zhao. [11] J. Ehlen, J. Clow, B. Wei, D. Chong, “Facebook Multi-node Server Platform:
Yosemite V2 Design Specification,” Open Compute Project,
We would also like to thank all our vendors for their collaboration https://www.opencompute.org/documents/facebook-multi-node-server-
and the support that they provided during the course of the platform-yosemite-v2-design-specification
project. This work would not have been possible without their [12] D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M.
Ozdal, J. Nie, J. Park, L. Luo, J. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu,
close engagement. J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.H. Chu, S. Yilmaz, H.
Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W.
Zhao, D. Melts, K. Dhulipala, KR. Kishore, T. Graf, A Eisenman, K. K.
REFERENCES Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K Nair, B. Muthiah,
[1] V. Sze, Y. Chen, T. Yang, and J.S. Emer, Efficient Processing of Deep Neural M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L.
Networks, Synthesis Lectures on Computer Architecture, Morgan & Claypool Qiao, M. Smelyanskiy, B. Jia, V. Rao, “Software-Hardware Co-design for Fast
Publishers, 2020. and Scalable Training of Deep Learning Recommendation Models,” in
[2] M. Pellauer, Y. S. Shao, J. Clemons, N. Crago, K. Hegde, R. Venkatesan, S. Proceedings of the International Symposium on Computer Architecture
W. Keckler, C. W. Fletcher, and J. Emer, “Buffets: An efficient and (ISCA), June 2022.
composable storage idiom for explicit decoupled data orchestration,” in [13] M. Haken, J. Clow, Y. Li, B. Wei, D. Chong, T Ky, “Yosemite V3: Facebook
Proceedings of Architectural Support for Programming Languages and Multi-node Server Platform Design Specification”, Open Compute Project,
Operating Systems (ASPLOS), 2019. https://www.opencompute.org/documents/ocp-yosemite-v3-platform-design-
[3] Y.-H. Chen, J. Emer, and V. Sze, Eyeriss: A spatial architecture for energy- specification-1v16-pdf
efficient dataflow for convolutional neural networks, in Proceedings of [14] GemmBench. [Online]. Available:
International Symposium on Computer Architecture (ISCA), 2016. https://https://github.com/pytorch/glow/blob/master/tests/benchmark/GemmBe
[4] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An energy-efficient nch.cpp
reconfigurable accelerator for deep convolutional neural networks IEEE [15] TableBatchedEmbeddingBagBench (TBEBench). [Online]. Available:
Journal of Solid-State Circuits (JSSC), 51(1), 2017. https://github.com/pytorch/glow/blob/master/tests/benchmark/TBEBench.cpp
[5] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, and N. Takagi, “A high- [16] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X.
speed multiplier using a redundant binary adder tree,” in IEEE Journal of Wang, U. Gupta, C. Wu, A.G. Azzolini, D. Dzhulgakov, A. Mallevich, I.
Solid-State Circuits (JSSC), 22(1):28–34, 1987. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X.
[6] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible Chen, W. Chen, V. Rao, B. Jia, L. Xiong, M. Smelyanskiy “Deep Learning
accelerator for emerging deep neural networks on mobile devices,” in IEEE Recommendation Model for Personalization and Recommendation Systems,”
Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), in Arxiv, 2021, [Online]. Available: https://arxiv.org/abs/1906.00091,
2019. unpublished
[7] T. Krishna, H. Kwon, A. Parashar, M. Pellauer, and A. Samajdar, Data [17] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K.
Orchestration in Deep Learning Accelerators, Synthesis Lectures on Hazelwood, M. Hempstead, B. Jia, H.S.Lee, A. Malevich, D. Mudigere, M.
Computer Architecture, Morgan & Claypool Publishers, 2020. Smelyanskiy, L. Xiong, X. Zhang, “The Architectural Implications of
[8] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Facebook’s DNN-based Personalized Recommendation,” in Proceedings of
Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-L. Cantin, C. Chao, C. the International Symposium on High Performance Computer Architecture
Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, (HPCA), February 2020.
R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. [18] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P.
Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz,
Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang,
Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, Yiming Wu, Hector Yuen, Utku Diril, D. Dzhulgakov, Kim Hazelwood, Bill
R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, Mikhail
Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Smelyanskiy “Deep Learning Inference in Facebook Data Centers:
Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, Characterization, Performance Optimizations and Hardware Implications,” in
B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, Arxiv, 2018, [Online]. Available: https://arxiv.org/abs/1811.09886
and D. H. Yoon, “In-datacenter performance analysis of a tensor processing [19] J. K. Reed, Z. DeVito, H. He, A. Ussery, J. Ansel, “Torch.fx: Practical
unit,” in Proceedings of the International Symposium on Computer Program Capture and Transformation of Deep Learning in Python,” in Arxiv,
Architecture (ISCA), June 2017. 2022, [Online]. Available: https://arxiv.org/abs/2112.08429
[9] https://zephyrproject.org [20] https://pytorch.org/docs/stable/fx.html
[10] M. Anderson, B. Chen, S. Chen, S. Deng, J. Fix, M. Gschwind, A. Kalaiah, C. [21] C. Lattner, V. Adve, “LLVM: a compilation framework for lifelong program
Kim, J. Lee, J. Liang, H. Liu, Y. Lu, J. Montgomery, A. Moorthy, S. Nadathur, analysis & transformation,” in Proceedings of International Symposium on
S. Naghshineh, A. Nayak, J. Park, C. Petersen, M. Schatz, N. Sundaram, B. Code Generation and Optimization, 2004.
Tang, P. Tang, A. Yang, J. Yu, H. Yuen, Y. Zhang, A. Anbudurai, V. Balan, [22] https://llvm.org/docs/LangRef.html
H. Bojja, J. Boyd, M. Breitbach, C. Caldato, A. Calvo, G. Catron, S. [23] https://github.com/riscv/riscv-v-spec
Chandwani, P. Christeas, B. Cottel, B. Coutinho, A. Dalli, A. Dhanotia, O. [24] BatchGemmBench. [Online]. Available:
Duncan, R. Dzhabarov, S. Elmir, C. Fu, W. Fu, M. Fulthorp, A. Gangidi, N. https://github.com/pytorch/glow/blob/master/tests/benchmark/BatchGemmBen
Gibson, S. Gordon, B. Padilla Hernandez, D. Ho, Y. Huang, O. Johansson, S. ch.cpp
Juluri, S. Kanaujia, M. Kesarkar, J. Killinger, B. Kim, R. Kulkarni, M. Lele, [25] ConcatBench. [Online]. Available:
Huayu Li, Huamin Li, Y. Li, C. Liu, J. Liu, B. Maher, C. Mallipedi, S. https://github.com/pytorch/glow/blob/master/tests/benchmark/ConcatBench.cp
Mangla, K.K. Matam, J. Mehta, S. Mehta, C. Mitchell, B. Muthiah, N. p
Nagarkatte, A. Narasimha, B. Nguyen, T. Ortiz, S. Padmanabha, D. Pan, A. [26] TransposeBench. [Online]. Available:
Poojary, Y. Qi, O. Raginel, D. Rajagopal, T. Rice, C. Ross, N. Rotem, S. Russ, https://github.com/pytorch/glow/blob/master/tests/benchmark/TransposeBench
K. Shah, B. Shan, H. Shen, P. Shetty, K. Skandakumaran, K. Srinivasan, R. .cpp
Sumbaly, M. Tauberg, M. Tzur, S. Verma, H. Wang, M. Wang, B. Wei , A.
Xia, C. Xu, M. Yang, K. Zhang, R. Zhang, M. Zhao, W. Zhao, R. Zhu, A.