0% found this document useful (0 votes)
50 views16 pages

Using Trio - Juniper Networks' Programmable Chipset - For Emerging In-Network Applications

An in-network computing scheme

Uploaded by

Jianguo Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views16 pages

Using Trio - Juniper Networks' Programmable Chipset - For Emerging In-Network Applications

An in-network computing scheme

Uploaded by

Jianguo Xu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Using Trio – Juniper Networks’ Programmable Chipset – for

Emerging In-Network Applications

Mingran Yang† , Alex Baban∗ , Valery Kugel∗ , Jeff Libby∗ , Scott Mackie∗ ,

Swamy Sadashivaiah Renu Kananda∗ , Chang-Hong Wu∗ , Manya Ghobadi†


† Massachusetts Institute of Technology ∗ Juniper Networks

ABSTRACT 1 INTRODUCTION
This paper describes Trio, a programmable chipset used in Juniper Data-intensive applications are the foundation of today’s online
Networks’ MX-series routers and switches. Trio’s architecture is services. With the gradual slowdown of Moore’s law, hardware
based on a multi-threaded programmable packet processing en- accelerators are struggling to meet the performance demands of
gine and a hierarchy of high-capacity memory systems, making emerging cloud applications, such as machine learning, databases,
it fundamentally different from pipeline-based architectures. Trio storage, and data analytics. Further advances are significantly lim-
gracefully handles non-homogeneous packet processing rates for ited by the amount of computation and memory that can fit in a
a wide range of networking use cases and protocols, making it single server, driving the need for efficient distributed systems for
an ideal platform for emerging in-network applications. We be- data-intensive applications.
gin by describing the Trio chipset’s fundamental building blocks, The availability of programmable switches, such as Intel’s Tofino [2,
including its multi-threaded Packet Forwarding and Packet Process- 20, 22], has created opportunities to design new packet-processing
ing Engines. We then discuss Trio’s programming language, called protocols and compilers [17, 20, 24, 44, 45, 58, 69, 71]. Tofino switches
Microcode. To showcase Trio’s flexible Microcode-based program- have also paved the way to using in-network computing [23, 60, 74]
ming environment, we describe two use cases. First, we demonstrate to accelerate applications such as caching [43], database query
Trio’s ability to perform in-network aggregation for distributed processing [50, 73], machine learning training [36, 48, 55, 63, 77],
machine learning. Second, we propose and design an in-network inference [76], and consensus protocols [27, 28, 52]. The key idea
straggler mitigation technique using Trio’s timer threads. We pro- of in-network computing is to leverage the switches’ unique van-
totype both use cases on a testbed using three real DNN models tage point to perform part of the computation directly inside the
(ResNet50, DenseNet161, and VGG11) to demonstrate Trio’s ability network, thereby reducing latency and improving performance.
to mitigate stragglers while performing in-network aggregation. Although programmable switches have been crucial enablers of
Our evaluations show that when stragglers occur in the cluster, this new paradigm, the Protocol Independent Switch Architecture
Trio outperforms today’s pipeline-based solutions by up to 1.8×. (PISA) [2, 20, 22, 58] is often a poor fit for emerging in-network
applications, thus limiting further growth and precluding the wide-
CCS CONCEPTS spread adoption of in-network computing applications [35, 37, 67].
• Networks → Routers; Programmable networks; In-network This paper presents Trio’s programmable architecture for in-
processing; network computing. Trio is Juniper Networks’ programmable chipset
with a multi-billion dollar pre-existing customer base. It has been de-
KEYWORDS ployed in hundreds of thousands of routers and switches worldwide
in the core, edge, and datacenter environments. The Trio chipset
Network hardware design, Programmable dataplanes, Network
has been used in production devices for over a decade.
support for machine learning
Trio is built on a set of customized processor cores, with an in-
ACM Reference Format: struction set optimized for networking applications. As a result, the
Mingran Yang, Alex Baban, Valery Kugel, Jeff Libby, Scott Mackie, Swamy chipset has the performance of a traditional ASIC, while enjoying
Sadashivaiah Renu Kananda, Chang-Hong Wu, Manya Ghobadi. 2022. Using the flexibility of a fully programmable processor by allowing the
Trio – Juniper Networks’ Programmable Chipset – for Emerging In-Network installation of new features via software. Trio’s flexible architec-
Applications. In ACM SIGCOMM 2022 Conference (SIGCOMM ’22), August
ture enables it to support features and protocols developed long
22–26, 2022, Amsterdam, Netherlands. ACM, New York, NY, USA, 16 pages.
https://doi.org/10.1145/3544216.3544262 after the chipset is released. Trio processor cores have access to
a high-performance large memory system to store data and state
related to system configuration and packets. The memory system
is central to the scalability of emerging applications with large
5IJTXPSLJTMJDFOTFEVOEFSB$SFBUJWF$PNNPOT"UUSJCVUJPO
/PO$PNNFSDJBM4IBSF"MJLF*OUFSOBUJPOBM-JDFOTF memory footprints.
SIGCOMM ’22, August 22–26, 2022, Amsterdam, Netherlands Trio’s architecture is fundamentally different from that of Tofino.
© 2022 Copyright held by the owner/author(s). Trio has a non-pipelined architecture, so different packets do not
ACM ISBN 978-1-4503-9420-8/22/08.
https://doi.org/10.1145/3544216.3544262

633
     
   
     

    
      
     
    

 
     
 
  
    

       
        
 
    
Figure 1: High-level comparison of a Trio-based  
  
     
router/switch and a PISA-based switch.    
 

Figure 2: Trio’s Packet Forwarding Engine (PFE) architec-


necessarily flow through the same physical paths on the chip. In- ture. Each PFE has hundreds of multi-threaded Packet Pro-
coming packets in Trio are processed independently using thou- cessing Engines (PPEs).
sands of parallel threads (details in §2). These threads use a run-to-
completion model [12, 70], in which a thread will execute as many
2 TRIO’S ARCHITECTURE
instructions as are needed to complete the processing for the packet
it is currently working on. Trio has dedicated logic to ensure pack- Since its introduction in 2009, the Trio chipset has gone through six
ets of the same flow are delivered in order, but packets of different generations [16] with various performance points and architectures.
flows can be processed out of order, allowing it to efficiently handle This section provides a detailed overview of Trio’s recent architec-
a mix of concurrent applications. ture. First, we give a high-level overview of packet forwarding and
Consequently, Trio can gracefully handle different packet pro- processing in a Trio-based router1 (§2.1). We then turn to the details
cessing rates: it can support lower than line-rate for applications of Trio’s packet processing engines (§2.2). Finally, we explain Trio’s
that require rich per-packet processing while maintaining line-rate various memory types and read-modify-write operations (§2.3).
for applications with simple per-packet processing needs. In con-
trast, PISA-based switches force all packets to traverse the same set 2.1 Trio-based Router Architecture
of pipeline stages, independent of the application. P4 programs [19] Figure 1 illustrates the high-level differences between a Trio-based
have an all-or-nothing fate, wherein programmability is sacrificed router (or switch) and a PISA-based switch. There are two impor-
for line-rate packet processing since PISA-based switches are unable tant components of every Trio-based device: (𝑖) Packet Forwarding
to support flexible packet processing rates. Engines and (𝑖𝑖) Packet Processing Engines, described below.
In this paper, we first describe the fundamental building blocks Packet Forwarding Engine (PFE). PFEs are the central pro-
of the Trio chipset, including details of its packet processing en- cessing elements of Trio’s forwarding plane and are used to system-
gines and the surrounding memory systems (§2). Next, we describe atically move packets in and out of the device. A Trio-based device
Trio’s programming language, called Microcode (§3). We then use consists of one or more PFE. Depending on the generation, each Trio
in-network aggregation for machine learning training as the first chipset supports a different packet processing bandwidth. Trio’s
use case to explain Trio’s flexible Microcode design (§4). We in- first-generation PFEs supported 40 Gbps of network bandwidth
troduce in-network straggler mitigation as a second use case to with multiple chips. Today, Trio’s sixth-generation PFE supports
demonstrate Trio’s unique ability to launch efficient timer-based 1.6 Tbps in a single chip. A small router may have only a single
threads (§5). We demonstrate that implementing straggler mitiga- PFE, while larger routers have multiple PFEs connected by an inter-
tion in Trio is straightforward, while, to the best of our knowledge, connection fabric, as shown in Figure 1(a). By providing any-to-any
enabling efficient straggler mitigation inside PISA-based devices is connection between the PFEs, the interconnection fabric expands
challenging, if not impossible. the bandwidth of a device much farther than a single chip could
We implement both use cases on a testbed with a Juniper MX480 support. Each PPE handles packets in both the ingress and egress
device [10], one 64×100 Gbps Tofino switch, and six ASUS ESC4000A- directions. Packets arrive at the system through an ingress PFE and
E10 servers, each with one A100 Nvidia GPU [11] and one 100 Gbps exit through an egress PFE.
Mellanox ConnectX5 NIC. We train three DNN models (ResNet50 [41], Packet Processing Engine (PPE). Each PFE has hundreds of
DenseNet161 [42], and VGG11 [68]) to demonstrate Trio’s ability multi-threaded Packet Processing Engines (PPEs), as shown in Fig-
to mitigate stragglers while performing in-network aggregation. ure 2. Each PPE supports tens of threads working on different pack-
Our evaluations show that when stragglers occur in the cluster, ets at the same time. Unlike Tofino’s architecture, where pipelines
Trio outperforms SwitchML [63], the state-of-the-art in-network cannot access each other’s registers, PPE threads within one PFE
aggregation platform, by up to 1.8× (§6). can share state efficiently via shared memory. Section 2.2 explains
Juniper Networks continues to evolve the Trio chipset for higher the PPE’s thread-based design in more detail.
bandwidth, lower power, and additional functionalities for existing Parallel packet processing. PFE’s hardware logic automati-
and emerging applications, while also developing software infras- cally divides each incoming packet into head and tail parts (anal-
tructures to support more use cases. We invite the networking ogous to PISA’s header and payload). The packet head is the first
community to identify novel use cases that will leverage Trio’s 1 In this paper, we use the term router and switch interchangeably. Historically, Juniper
programmable architecture. Networks’ devices are called routers.

634
part of a packet and is usually large enough to hold all of the packet other types of data structures. Second, each thread has 32 64-bit
headers needed to process the packet (the size of the packet head general-purpose registers that are private to it. The local storage
is different for each generation of the Trio devices but is typically (memory and registers) holds information specific to the packet
around 200 bytes). The tail consists of the remaining bytes of the being processed. Shared state across packets is held in the Shared
packet (if any). When a new packet arrives, a hardware module Memory System accessible to all PPEs.
inside PFE, called the Dispatch module, sends the packet head to a ALU types. There are two ALU types: (𝑖) Condition ALUs and
PPE for processing based on availability, and the PPE spawns a new (𝑖𝑖) Move ALUs. Condition ALUs are used for arithmetic or logical
thread for this packet head. Packet tails are held in the PFE’s Packet operations to produce 32-bit data results and/or for comparative
Buffer in the Memory and Queueing Subsystem to avoid storing a operations to produce 1-bit condition results. Move ALUs produce
large number of bytes in the PPE threads. By default, each thread 32-bit results that can be written into either a register or local
works on a single packet. Many PPE threads work in parallel to memory. The results from the Condition ALUs can be used as inputs
provide the required processing bandwidth. to the Move ALUs. This ALU organization allows the resources of
Reorder Engine. When packet processing is completed, the each instruction to be flexibly allocated between sequencing control
modified packet head is sent to a Reorder Engine. The Reorder En- (described next) and generation of logical/arithmetic results to be
gine holds the updated packet head until all earlier arriving packets stored in registers/memory. Importantly, each ALU operand and
in the same flow have been processed to ensure in-order delivery. each Move ALU result can be a bit-field of arbitrary length (up to
The Reorder Engine then sends the modified packet heads to the 32 bits) and an arbitrary bit offset. This has two main benefits. First,
Memory and Queueing Subsystem to be enqueued for transmission. it improves the efficiency of accessing fields of varying sizes in a
packet header. Second, it improves the utilization of memory and
register capacity, allowing each piece of data to use only the bits
2.2 Packet Processing Engine
it needs. Trio has ALUs in both the PPEs and the Shared Memory
Trio’s PPEs provide capabilities that are difficult or impossible to System. The former are used for operations on registers and local
achieve with fixed processing pipelines or existing specialized pro- memory, while the latter is used for operations on data stored
cessing units. Each PPE is a VLIW (Very Long Instruction Word) in the Shared Memory System. Operations on a packet tail is also
multi-threaded Microcode engine core. Each micro-instruction con- supported by moving sections of the packet tail to the local memory
trols multiple ALUs, operand and result selection, and complex of the PPE thread.
multi-way branching. The complexity of the work needed to ex- Sequencing logic. The condition result(s) from one or more
ecute a micro-instruction means each instruction takes multiple Condition ALU can be used by a sequencing logic unit to select the
clock cycles. Because each PFE typically serves many packets at next micro-instruction to be executed. Each micro-instruction in-
the same time, one PPE does not require high single-thread perfor- cludes the address of a target block of one to eight micro-instructions.
mance. Each thread in Trio has only one datapath instruction at Any or all of the condition results can be ignored, and the com-
a time. Trio does not dispatch an instruction on the same thread bination of the condition results used is highly flexible. Much of
as the previous instruction into the PPE pipeline until the latter the work done in packet processing involves complex conditional
exits the pipeline. Hence, there is no need to pass data between branches in the code, especially during parsing. Trio’s ability to
instructions on the same thread because subsequent instructions perform a complex multi-way branching in a single instruction is
do not depend on the results from the previous ones before the data well-matched to the needs of packet processing applications. The
writebacks are completed. PPE supports a call-return mechanism to subroutines, which can
PPE threads. A PPE thread is usually started when a packet be nested up to eight levels deep.
head arrives at the PPE and destroyed when the processing is com- Efficient hash calculation. Efficient load-balancing is an im-
plete for that packet at this PPE. The thread destruction is auto- portant requirement of all routers/switches. In a Trio-based system,
matically handled by the hardware logic in the chip, although the a Microcode program is responsible for specifying which packet
programmer has control as to when to give up the execution of a fields are included in the hash calculation. This allows complete
thread. Threads can also start in response to certain internal events, flexibility as to which packet fields contribute to the load balancing
including statistical collection and timers (more details in §5). Exter- decision, including the ability to select fields from packet headers
nal events have the ability to spawn the execution of new threads whose protocols have not yet been invented. The hash function in
through similar mechanisms. Together, the PPEs in the ingress and Trio is a high-quality hash function implemented using dedicated
egress PFEs handle all functions needed to process a packet (e.g., logic. As a result, the hash function implementation is more effi-
packet parsing, route lookup, packet rewriting). cient than a comparable hash function implemented in software.
Per-thread local storage. Each PPE has two main forms of The combination of programmable field selection and hardwired
internal storage. First, each thread has a dedicated pool of local hash function gives PPEs an unprecedented balance of flexibility
memory (1.25 KBytes). The local memory can be accessed on any and efficiency.
byte boundary, using either pointer registers or an address con- Flexible programming. There is no fixed limit on the num-
tained in the micro-instruction. Before a PPE thread is initiated, ber or type of headers that can be processed by a PPE. Hence, a
the packet head is loaded into the local memory of that thread. PPE can easily create new headers or consume/remove existing
When an outgoing packet is being sent, the modified packet head headers in packets using Trio’s Microcode program (§3). As new
is unloaded from the thread’s local memory. The use of pointer protocols are developed, the Trio packet processing architecture can
registers allows efficient access to packet headers, as well as to adapt by enhancing the software that runs on the PPEs. PPEs can

635
   is implemented by a heavily multi-banked SRAM and is typically
used for frequently-accessed data structures. The off-chip memory
     has a multi-megabyte on-chip cache which is similar to the on-chip
SRAM and is heavily multi-banked to provide high throughput.
The size of the On-chip SRAM and the Off-chip DRAM cache are

   
   
   

   
   

   
   
   
   
   

   
   
software configurable (typically 2-8 MBytes and 8-24 MBytes, re-
spectively). The Off-chip DRAM is several GBytes. The on- and
off-chip memories are architecturally equivalent and exist in dif-
ferent ranges of a single unified address space. They only differ in
 
capacity, latency, and available bandwidth. This allows data struc-
tures to be placed in the type of memory that best matches their
 
capacity and bandwidth requirements.
      
Memory transactions. The memory system supports read and
write operations of varying sizes, from 8 bytes up to 64 bytes (in 8-
Figure 3: Trio’s Shared Memory System.
byte increments). Trio can support full memory system bandwidth
with 8-byte accesses. In addition, a rich variety of read-modify-
write operations are supported, including Packet/Byte Counters,
also create or consume packets to accomplish tasks, such as keep- Policers, Logical Fetch-and-Ops (And/Or/Xor/Clear), Fetch-and-
alive functions, at a much higher rate than can be supported by a Swap, Masked Write, and 32-bit add. The read-modify-write opera-
control plane CPU because of PPEs’ multi-threaded architecture. tions are enabled by read-modify-write engines, as specified below.
Importantly, processing cycles are fungible between applications, Read-modify-write engines. Packet processing requires ex-
enabling graceful handling of the packet processing requirements tremely high-rate read-modify-write operations. Processing a sin-
of different applications. As a result, Trio-based systems can provide gle packet may involve updates to multiple counters, operations on
lower packet rates for applications with richer packet processing one or more policers, and other operations as needed by the appli-
and higher packet rates for those with simpler packet processing, cation. A naive approach to handle read-modify-write operations
or a mix of the two. is to give one thread ownership of a memory location while the
operation is carried out. But this approach cannot meet the high effi-
2.3 Shared Memory System ciency requirements of packet processing. In contrast, Trio offloads
Recent Trio chipsets support several GBytes of memory in each PFE. the read-modify-write operations to its memory system, where
This section gives an overview of Trio’s Shared Memory System. a range of memory locations is handled by a single read-modify-
Advantages of shared memory. For switches and routers, write engine. If multiple requests to the same memory location
some data structures, such as counters and policers, need to be arrive at around the same time, the engine processes the requests
modified at a high rate. To support efficient access to these data in sequence, guaranteeing consistency of the updates. There is no
structures by hundreds of PPE threads, Trio’s Shared Memory Sys- need to issue explicit coherence commands to a location in memory
tem serves as the place for all threads to access and modify the when mixing read, write, and read-modify-write operations. Each
data. All data accesses (read, write, and read-modify-write) to the read-modify-write engine processes memory requests at a rate of
Shared Memory System are processed by the read-modify-write en- 8-byte per clock cycle. Hence, a single read-modify-write engine
gines, located close to the Shared Memory System. When multiple for the entire Shared Memory System cannot provide the memory
threads access the same memory location at around the same time, bandwidth needed to process packets at a sufficiently high rate. To
there is no need to move data from one thread to another. Instead, address this challenge, Trio supports several banks of SRAM and
data modification happens inside the read-modify-write engines. off-chip cache with their own read-modify-write engine, enabling
This allows high-speed data updates near the memory and nicely the read-modify-write processing bandwidth to scale with the raw
meets the needs of packet processing applications. In contrast, the memory bandwidth.
cache-line-based coherency model used by conventional proces- Crossbar and shared memory performance. Trio’s Crossbar
sors requires data to be moved to the thread during access; this is designed to support all read-modify-write engines, such that the
creates longer delays when multiple threads try to modify the same Crossbar itself will never limit the memory performance. If the
memory location. Although this model can support more complex load offered to a given read-modify-write engine exceeds the 8-
and general operations on the data, it performs poorly for data bytes per cycle throughput, there will be backpressure through the
structures that can be accessed by hundreds of threads. Crossbar. Juniper Networks increased the number of read-modify-
Memory types. The Trio memory system is optimized to pro- write engines in each generation of Trio chips so that the memory
vide a high access rate for relatively small (8 bytes) requests. To bandwidth increases with the packet processing bandwidth.
achieve the required combination of bandwidth, latency, and ca-
pacity, the memory system uses two types of memory, shown
in Figure 3: (𝑖) a high-bandwidth on-chip memory, with approxi- 3 TRIO’S PROGRAMMING ENVIRONMENT
mately 70 ns access latency from the PPE; and (𝑖𝑖) a large high-
This section gives an overview of Trio’s programming environ-
bandwidth DRAM-based off-chip memory, with approximately
ment. Section 3.1 describes Trio’s programming language and the
300 ns to 400 ns access latency from the PPE. The on-chip memory

636
       
sent from the Memory and Queueing Subsystem, pass through the
 Crossbar, and then arrive at the local memory of the PPE. An XTXN
        
consists of a request by a PPE to a target and a reply sent by the
   target back to the PPE. The format of the XTXN depends on the
            target block. For instance, read requests sent to the Shared Mem-
ory System take memory address as the parameter, and the data is
     returned in the XTXN response register.
  Compiler. To compile Microcode programs, the programmer
uses a tool called Trio Compiler (TC). TC maps the source code
Figure 4: Programmer uses a C-like language called Mi- for an instruction to the various resources the instruction can con-
crocode to program new applications and configure the tar- trol, including mapping variables to their underlying storage and
get Trio routers. assigning instructions to Microcode memory inside PPEs. TC has
characteristics of both compilers and assemblers. On the compiler
side, TC supports the translation of high-level C-style expressions
into hardware instructions. On the assembler side, TC source code
toolchain for programming Trio-based devices. Section 3.2 provides
must contain instruction delineation, whereby the programmer
a packet filtering example programmed in Trio Microcode.
marks the beginning and end of blocks of code representing a sin-
gle instruction. If the code designated to one instruction does not fit,
3.1 Trio’s Programming Language and
TC fails the compilation because it cannot implement the requested
Toolchain actions across multiple instructions. TC does not have a separate
The programming language for Trio-based devices is a C-like lan- compilation and linking phase. It requires the complete source code
guage called Microcode. The programmer implements all packet instead of individual modules to generate the binary. This binary
processing operations in Microcode, including packet parsing, route contains data to initialize PPE resources such as Microcode memory
lookup, packet rewriting, and in-network computations (if any). and local memory. It also defines required symbols, such as the ad-
Figure 4 shows the tools needed to program new applications on dress in local memory where the packet header starts. This binary
Trio. To program a new application on Trio, the programmer uses file serves as part of the Junos2 software image used by Trio’s ASIC
the Microcode language to write new applications and adds the new driver for device initialization.
Microcode program to the existing codebase. Then the programmer vMX Virtual Router. Juniper Networks is making a concerted
uses Trio’s compiler to generate the software image and configures effort to enable third-party access to programming Trio-based de-
the target device. vices. As a first step to enable third-party access to Trio’s function-
Expression syntax. Microcode supports C-style expressions. alities, Juniper Networks developed the vMX Virtual Router [5].
The supported variable types include scalar (label, bool, and integers vMX is a virtualized Universal Routing Platform and consists of a
in various sizes) and compound (struct and union). Microcode also virtual control plane (VCP) and a virtual forwarding plane (VFP).
supports pointers and arrays, conditions, function calls and gotos, The VCP is powered by the Junos operating system, and the VFP
and switch statements. runs the Microcode engine optimized for x86 environments. vMX is
Instruction boundary. A Microcode program has multiple in- available as licensed software for deployment on x86-based servers
structions. A single Microcode instruction can perform limited and cloud services, such as Amazon Web Services.
operations, and the programmer needs to explicitly specify the Advanced Forwarding Interface. In Trio, packet forwarding
instruction boundaries. Typically, a single Microcode instruction is a sequence of operations executed by a PFE. Each operation can
can perform four registers or two local memory reads, and two be represented by a node on a graph of potential packet forwarding
registers or two local memory writes. operations. The PFE executes a series of operations for an individ-
Variable storage classes. When defining a new variable in Mi- ual packet based on its type/fields. Juniper Networks’ Advanced
crocode, the programmer needs to specify the location to store the Forwarding Interface (AFI) [3] provides partial programmability by
variable. There are three types of variable storage classes: mem- allowing third-party developers to control and manage a section
ory (PPE’s local memory and registers), bus (indicating that the of this forwarding path graph via a small virtual container called
variable serves as input to the ALUs), and virtual (indicating con- a sandbox. The sandbox enables developers to add, remove and
stant values). Access to data stored in the Shared Memory System, change the order of operations for specific packets.
such as forwarding tables, is achieved via the external transactions
specified below. 3.2 Microcode Program Example
External transaction. PPEs can issue external transaction (XTXN) We illustrate the usage of Trio Microcode by showing an example
to other modules, such as the Shared Memory System, Hash lookup/in- of a filtering application whose function is to forward all incoming
sert/delete, high-performance Filters, and counter/policer blocks, IP packets with no optional headers and drop all non-IP packets
over the Crossbar. These XTXNs can be either synchronous or asyn- and IP packets with options.
chronous. In synchronous XTXNs, the PPE thread is suspended Microcode program workflow. Figure 5 shows the Microcode
until the XTXN reply is received; in asynchronous XTXNs, the program workflow. Each incoming packet is processed by one PPE
PPE thread continues running normally. PPEs can also fetch data
from the packet tails through XTXNs. In this case, packet tails are 2 Juniper Networks’ Junos is the operating system that powers Trio-based devices.

637
                     
              
 
       
       

Figure 5: Microcode program workflow of the filtering application. Labels in brackets are the corresponding instruction names.

!# !#
thread. The thread first looks at the packet’s Ethernet header. If
EtherType is equal to 0x0800, the packet is an IP packet, and the    

next step is to process the IP header; otherwise, it is a non-IP        $"
packet. In this case, the thread drops the packet and increments
the Packet/Byte Counter for non-IP packets. For an IP packet, the Figure 6: Packet/Byte Counter layout in our example.
thread further examines whether it has any IP option fields. The Pointer indicates the starting address of each counter.
thread forwards all non-option IP packets, but drops IP packets
with options and increments the Packet/Byte Counter for IP-option ir0 = 1;
packets. After completing all required operations, the thread exits. if ( ipv4_addr - > ver == 4 &&
ipv4_addr - > ihl == 5) {
Packet header formats. Programmers need to define packet
goto forward_packet ;
header structures in the Microcode program. The format of the }
packet header definition is similar to that of P4 [19], where each goto count_dropped ;
header is defined by an ordered list of field names with the corre- end
sponding field widths. Here, we show the definition of the standard Dropped packet counting. This instruction increments the
Ethernet header as an example. Packet/Byte Counter for dropped packets. Packet/Byte Counter
struct ether_t { is a special counter stored in the Shared Memory System. Each
dmac : 48; Packet/Byte Counter is 16 bytes, and consists of two portions:
smac : 48;
packet counter portion, which calculates the number of packets;
etype : 16;
}; and byte counter portion, which calculates the number of bytes.
Figure 6 shows the Packet/Byte Counter layout and the correspond-
Ethernet header processing. This instruction decides whether ing starting addresses in our example. Instruction count_dropped
the incoming packet is an IP packet by looking at the EtherType field first calculates the address of the Packet/Byte Counter to be incre-
in the Ethernet header. If the incoming packet is an IP packet, the mented based on DROP_CNT_BASE and ir0. If ir0 is equal to 0,
program continues to instruction process_ip; otherwise, it goes then the current packet is a non-IP packet, and the corresponding
to instruction count_dropped. In this case, we set the intermediate counter address is DROP_CNT_BASE; otherwise the current packet
register ir0 to 0 to indicate the current packet is a non-IP packet, is an IP-options packet, and the corresponding counter address is
and count_dropped will use ir0 to calculate the starting address DROP_CNT_BASE+2. This instruction then issues an external trans-
of the corresponding Packet/Byte Counter for non-IP packets. action (XTXN) called CounterIncPhys, which is specific for in-
process_ether : crementing Packet/Byte Counter and takes two parameters: the
begin counter address and the packet length. This XTXN increments the
ir0 = 0;
if ( ether_ptr - > etype == 0 x0800 ) {
packet counter portion and the byte counter portion in Packet/Byte
goto process_ip ; Counter separately: the packet counter portion is incremented by 1,
} and the byte counter portion is incremented by pkt_len.
goto count_dropped ;
count_dropped :
end
begin
IP header processing. This instruction looks at the Version const : addr = DROP_CNT_BASE + ir0 * 2;
CounterIncPhys ( addr , r_work . pkt_len ) ;
and Internet Header Length (IHL) fields of the IP header. A Version
goto drop_packet ;
value equal to 4 and IHL value equal to 5 indicate the current packet end
is a non-option IP packet. For non-option IP packets, the program
Packet forwarding and dropping. For completeness, we show
continues to instruction forward_packet; otherwise, it goes to
brief definitions of instructions forward_packet and drop_packet
instruction count_dropped. In this case, we set the intermediate
as referenced by prior instructions. In our implementation, both
register ir0 to 1 to indicate the current packet is an IP-options
packet forwarding and packet dropping are completed by multiple
packet, and count_dropped will use ir0 to calculate the starting
instructions.
address of the corresponding Packet/Byte Counter for IP-options
forward_packet :
packets.
// code to forward the packet
process_ip : // based on the destination address
begin
const ipv4_t * ipv4_addr = ether_ptr + drop_packet :
sizeof ( ether_t ) ; // code to drop the packet

638
!    #     !$"  ! 
struct trio_ml_hdr_t { // 12 bytes
job_id : 8; // aggregation job id
 !      
block_id : 32; // aggregation block id
age_op : 4; // if the block has aged out
Figure 7: Trio-ML packet format. final : 1; // if the block is final block
degraded : 1; // aggregation is partial
: 2; // unused for byte alignment
src_id : 8; // source id of the packet
4 USE CASE 1: IN-NETWORK AGGREGATION src_cnt : 8; // number of sources contributing
gen_id : 16; // generation id
IN TRIO : 4; // room to expand grad_cnt
The previous sections described Trio’s flexible packet processing ar- grad_cnt : 12; // number of gradients
chitecture, the memory system, and the programming environment. };
In this section, we discuss in-network aggregation for distributed
ML training as a concrete use case. Figure 8: Trio-ML packet header structure.
Data-parallel training. One of the common approaches to dis-
tributed Machine Learning (ML) training is data-parallel training.
In this approach, the neural network is replicated across 𝑁 workers Job records. We use a hash table (with key = job_id, block_id)
(or replicas), with each worker processing a small subset of the to keep track of ongoing aggregations in the device. Figure 17 in
training data (mini-batch) to compute its local model gradients. At Appendix A.1 shows the structure definition of job records. Job
every iteration, workers must synchronize their model parameters records are created at job configuration time and persist until the
by exchanging and aggregating their gradients to ensure conver- job is complete. They contain the current number of active blocks
gence [39]. This step is called allreduce, and it is most commonly being aggregated for each job (block_curr_cnt). They also con-
implemented using a parameter server [53] or ring-allreduce [9, 65]. trol memory sharing across jobs by capping the maximum number
Overview of in-network aggregation. The allreduce step puts of concurrent aggregation blocks (block_cnt_max) and the maxi-
significant pressure on the network fabric because the entire set mum number of gradients per block (block_grad_max) for a given
of model gradients must be exchanged many times throughout job. In addition, job records hold the parameters required for gener-
the training process [39, 40, 65]. Recent work proposed in-network ating and forwarding the response, as well as the block expiration
aggregation to improve the performance of distributed ML training timeout. Finally, job records contain bitmasks indicating which
workloads [18, 36, 47, 48, 54, 63, 77]. By aggregating gradients inside sources (workers) are participating in the job.
network switches, rather than at end-hosts, in-network aggregation Block records. Trio-ML creates a block record when it receives
accelerates training jobs with heavy communication overheads. a packet with a new block_id from a server. Figure 18 in Ap-
Trio-based in-network aggregation. We now describe Trio- pendix A.1 shows the structure definition of block records. The
ML, our Microcode implementation to perform in-network aggrega- Microcode program removes the block record when the block aggre-
tion inside Trio-based devices. Section 5 extends Trio-ML to handle gation is complete and the block’s result has been generated. Block
straggling workers and describes the challenges faced by Tofino- records hold the block’s aggregation state, including the count and
based in-network aggregation solutions [48, 63] in the presence of bitmask of sources that must still deliver their packets, pointers to
stragglers. the aggregation buffer and the parent job record, and the block’s
Trio-ML packet format. Figure 7 illustrates Trio-ML’s aggre- start time and expiration interval. Figure 9 illustrates the Trio-ML
gation packet format. Following previous proposals [48, 63], we use aggregation Microcode program’s data structure operations when
UDP packets to carry the gradients. Packets are addressed to the multiple aggregation jobs are present concurrently, and each job
router with a pre-defined destination port (e.g., 12000). After the directs multiple blocks to aggregate in parallel.
UDP header, we define a Trio-ML header that describes the block of Window-based streaming aggregation. Following prior work [48,
gradients carried in each packet. A block is a subset of DNN model 63], we assume servers do not send the entire model to the aggre-
gradients that fits in one packet. The gradients are 32-bit integers gator at once. Instead, they stream the gradients using a window
converted from floating-point using the scaling approach proposed parameter. Each server has a parameter called window that con-
by ATP [48]. trols the number of outstanding gradients waiting to be aggregated.
Trio-ML header structure. Figure 8 shows the Trio-ML header Section 6 evaluates the impact of window size on performance.
structure, as defined in our Microcode program. job_id and block_id Trio-ML aggregation Microcode program workflow. Fig-
uniquely identify the block of gradients for each training job. All ure 10 shows Trio-ML’s aggregation Microcode program workflow.
servers participating in the same job_id send blocks of gradi- Each aggregation packet is processed by one thread. Each thread
ents using the same sequence of block_ids. src_id identifies the starts by extracting job_id and block_id from the packet and us-
sender of each packet, enabling the aggregator to keep track of ing them to look up the block aggregation record. If the block record
which servers have contributed their data to the block and to rec- does not exist (i.e., this is the first packet with a certain block_id),
ognize retransmissions by the servers. Generation number gen_id the thread proceeds to create the block record (provided that a job
is used to distinguish blocks in consecutive iterations of the model record with the specified job_id exists). The block record points
aggregation. The header structure has three additional variables to the aggregation buffer in the Shared Memory System (DMEM),
(age_op, degraded, and src_cnt) that are most meaningful in the where gradients are summed up. In Trio, packets consist of a head,
context of stragglers (details in §5). which holds the first 192 bytes of the packet, and a tail, which

639
                     

 %" "$ #


( ,)& )
"$#    
  ( ,+ ( ,)&  ( ,')   ( ,)"$ "$ ( ,)&
  
( ,)&  ( ,)  %" %$ "$# "$#   ( ,)
)

 %"  %" #


  ( ,+"$ "$ ( ,)&
!  "$    
( ,)& "$# "$#   ( ,+
"$#   ( ,)&  ( ,+
  ( ,+

 %"
*
( ,*& ( ,*&  ( ,')  
"$#     ( ,)"$ "$ ( ,)&
  ( ,*   
( ,*&  ( ,) "$# "$#   ( ,)
 %" %$
*

( ,*&  ( ,*  %" # "$ ( ,)&


 %"   ( ,*"$
!  "$    
( ,*& "$# "$#   ( ,*
"$#  
  ( ,*

Figure 9: Data structure operations in Trio-ML aggregation Microcode program.

 #"#' #"#'
%'#" ##($ ##%
#("
##($ ##%
#(" the same PFE. In this case, the destination IP address of the Result
%#$$'
'%' #., #. #.,-0 packet is the address of the multicast group, which spans all sources
 ##(" ##("
participating in the job. Server membership in the multicast group
"#!"$'
'%"'- - %' "&%' # is achieved by allowing each server to issue an IGMP registration or
%"'&%#! #% *' by including server interfaces in a Static Multicast configuration on
%#- % $' "'(% &( '$'
%"'& #
'%"'- -
the router. Standard IP forwarding then takes care of delivering the
# &'$' #
"#!"$' ' ' *&'&+ %$'
 #+
%#- % Result packet to all servers. Trio-ML uses hierarchical aggregation
%"'& & & &( '$'
when ML sources span multiple PFEs. In hierarchical aggregation,
42%#!  134%#! one of the PFEs is configured as a top-level aggregator to which
' "'#  (% (% %'
  %"'& the other (first-level aggregator) PFEs feed their results. With hi-
%'42("& %'134'# 
erarchical aggregation inside a multi-PFE chassis, first-level PFEs
 ' ##%
# & # & send their packets to the designated top-level PFE directly, with-
"#' + "#(%+ ("#%)%"

out relying on IP forwarding. The top-level PFE sees lower-level


Figure 10: Trio-ML aggregation Microcode program work- PFEs as individual sources and aggregates them in the same way
flow. as a single-level aggregator does. Hierarchical aggregation can be
extended to work across multiple devices by setting the destination
IP of the Result packet to the IP address of next-level aggregator
holds the rest of the packet. Packet head data is readily available and relying on IP forwarding to unicast the packet. The top-level
in the thread’s Local Memory (LMEM), whereas tail data resides aggregator will, of course, multicast the final result back to the
in the Memory and Queueing Subsystem and must be read into servers. A desirable property of hierarchical aggregation is that the
LMEM before they can be used. Hence, aggregation proceeds in amount of data is reduced as the aggregated gradients move up the
two phases. Phase one aggregates gradients from the packet head, hierarchy, in a manner opposite to multicast replication. Note that
while phase two is structured as a loop that aggregates gradients when hierarchical aggregation is being set up, all configurations are
from the packet tail in 64-byte chunks (16 32-bit gradients). When done via the control-plane, and no Microcode changes are needed.
all gradients in a packet have been aggregated, the block context
is checked for completeness. If all sources have been accounted
for, the block context is complete, and the result generation phase
5 USE CASE 2: IN-NETWORK STRAGGLER
starts. MITIGATION
Result packet. Every block produces one aggregation Result As a first use case, the previous section walked through Trio-ML’s
packet. The Result packet is a new packet, whose IP/UDP and Trio- Microcode program, explaining how it implements in-network ag-
ML headers are reconstructed from the block and job records, and gregation for ML training jobs. This section describes in-network
whose data (gradients) come from the aggregation buffer. The new straggler mitigation as a second important use case.
packet must have a head and tail, just as incoming packets do. The straggler problem. In shared clusters hosting several jobs,
Accordingly, the tail is constructed in a loop: each iteration pulls different servers often experience uncorrelated performance jitter
a 256-byte chunk from the DMEM aggregation buffer into LMEM due to congestion, load imbalance, resource contention, garbage
and then writes it out to the new tail in the Packet Buffer (PMEM). collection, background OS activities, or storage delays [13, 26, 33,
Finally, the Result packet is passed to the standard forwarding code 34, 38, 51, 56, 79]. As a result, servers that are collectively working
using the nexthop address from the job record (out_nh_addr field). on the same job must wait for the slowest job, hence, decreasing
Hierarchical aggregation. Trio-ML uses single-level aggrega- the overall application performance. This problem is known as
tion when all ML sources are connected to interfaces hosted on the straggler problem and has been studied extensively for several

640
distributed applications, including MapReduce [14, 15, 31, 75, 78], (each 4 KBytes in size). Trio-ML aggregates these packets as soon
Spark [57, 79], database queries [59], key-value stores [46], and as they arrive, but it needs to keep the block records of partially
machine learning [25, 30, 38, 61, 72]. Google cites the straggler aggregated results in its hash table. Our timer threads need to be
problem as one of the main causes of poor performance in its data able to quickly scan this large hash table to identify aged blocks. We
processing jobs [46], and production traces from Facebook and find it is challenging for a single thread to scan a large hash table
Microsoft indicate that jobs can be slowed down by a factor of eight to locate expired records. To address this challenge, we deploy 𝑁
due to stragglers [14]. periodic threads operating in parallel. This is achieved by initiating
The case for in-network straggler mitigation. Existing sys- the threads such that the interarrival interval between back-to-
tems address the straggler problem in a number of ways, including back threads is 1/𝑁 of the desired timeout interval. Every triggered
cloning [13, 14], speculative execution [15, 31, 79], and rapid reas- thread scans 1/𝑁 of the aggregation table, thus reducing the amount
signment [38]. Cloning approaches are prohibitively expensive for of processing required by each individual thread by a factor of 𝑁 .
large jobs, such as ML training. Speculatively executing duplicate Trio’s timer resolution allows hundreds of threads to be deployed
work and using rapid reassignment approaches require servers to in this manner. In this scenario, no PPE is reserved specifically for
coordinate updates via message passing, which delays detection and running timer threads, and every timer thread can be spawned in
mitigation. In addition, today’s straggler mitigation techniques are any of the PPEs based on availability.
all server-based, thereby potentially creating a circular dependency Straggler mitigation. Once a straggler is identified, various
wherein stragglers may need to be mitigated using other servers techniques may be used to mitigate its impact on the application’s
that could be straggling themselves. We argue that decoupling strag- performance. The complexity of these techniques depends on the
gler mitigation from servers is a more robust approach. Hence, we application. Following prior work [36], Trio-ML gives up on the
propose in-network straggler mitigation as a more suitable solution straggling source(s) and sends a partial aggregation result to all
for latency-sensitive applications. In-network straggler mitigation the workers, including the straggler(s), along with the number of
leverages the network devices’ vantage point to keep track of active sources participating in the partial result. We use the age_op field in
workers and promptly react to stragglers. Importantly, in-network the Trio-ML packet header structure (Figure 8) to indicate whether
straggler mitigation avoids the need for extra communication time a block has aged out due to stragglers. If it has, the degraded field
across servers to detect straggling workers in distributed systems. is set on the Result packet to inform the servers that the aggregation
Trio to the rescue. We now explain our approach to implement- result has been calculated using only a partial set of workers, and
ing in-network straggler mitigation in Trio-based devices. To the the src_cnt field informs the senders how many non-straggling
best of our knowledge, it is challenging, if not impossible, to real- sources contributed to the aggregation result. Servers that receive
ize efficient in-network straggler mitigation in PISA-based devices partial aggregation results divide the returned aggregated gradient
mainly because performing timer-based operations (such as sending values by the number of aggregated sources extracted from the
notification packet to servers when the timer for checking straggler Trio-ML header.
events expires) in P4 requires coordination with the switch control Advanced straggler mitigation. Although the straggler miti-
plane. In comparison, Trio has the ability to perform timer-based gation approach we use in Trio-ML is simple, the techniques de-
processing and can spawn multiple threads in PPEs. We use our scribed in this paper are generic and can be used to implement
Trio-ML application (§4) as a running example to explain the use more complex straggler mitigation approaches for other latency-
of Trio’s timer threads for straggler detection and mitigation. sensitive applications. For instance, service providers can use Trio’s
Straggler detection with timer threads. Trio’s architecture timer threads to identify whether a worker is a temporary straggler
contains tens of high-resolution timers, which can be used to launch (slows down temporarily) or a permanent one (is out of service
Microcode threads that execute periodically. To detect straggling for a very long time), and notify all other workers accordingly.
sources in our Trio-ML application, we leverage these threads to To do so, the Microcode program needs to implement two types
trigger periodic scanning of the aggregation hash table. Another of timer threads. One type happens more frequently; it detects
advantage of Trio is that its hash hardware supports a per-record straggler events, similar to the timer threads for our ML use case.
‘Recently Referenced’ (REF) flag. REF flags are set when a record Another type happens less frequently; this type detects the per-
is created and whenever it is referenced by a lookup. To detect server straggler event count, analyzes whether these are temporary
stragglers, we program Trio’s timer threads to periodically visit all or permanent stragglers, and sends notification to all other workers.
hash records to check and clear their REF flags. The timer threads
determine whether the records have aged out by checking each
record’s REF flag prior to clearing it. If the flag is not set, it means
the record has not been accessed for at least the duration of the 6 EVALUATIONS
timer interval. We use this feature to set a timeout interval to detect This section evaluates the performance of our in-network aggre-
straggling sources. gation and straggler mitigation use cases. First, we explain our
Multi-thread scanning of large hash tables. One of the key testbed setup and methodology (§6.1). Next, we demonstrate Trio-
challenges of in-network straggler detection is that routers and ML’s time-to-accuracy and iteration time speedups in the presence
switches inside the network may have to absorb a large amount of stragglers, as well as the efficiency of timer threads in Trio (§6.2).
of data from non-straggling servers while waiting for the timeout Finally, we benchmark the latency and throughput of the Trio-ML
to expire. For instance, in our Trio-ML application, the window Microcode program without stragglers (§6.3).
parameter allows servers to send up to 65,535 aggregation packets

641
         DNN Model Size Batch Dataset
 
        size/GPU
 
ResNet50 98 MB 64 ImageNet
       VGG11 507 MB 128 ImageNet
DensNet161 109 MB 64 ImageNet
  
Table 1: DNN models used in our experiments.
  

  
therefore, in our evaluations, we use SwitchML-256 with pool size
     512, even though it consumes the resources of all four pipelines on
our Tofino switch. Finally, SwitchML end-hosts retransmit their
gradients after 1 ms to tolerate packet loss. But this feature cre-

 !"      
ates spurious retransmissions during straggling periods, reducing
SwitchML’s performance. Therefore, we disable this feature in our
     experiments.
Trio-ML setup. To make an apples-to-apples comparison with
Figure 11: Our 100 Gbps testbed with a Juniper MX480 router, SwitchML, we configure Trio-ML servers to use DPDK integrated
a Tofino switch, and six GPU servers. with PyTorch. Unless otherwise stated, we configure each server
to send 1024 gradients per packet and stream the gradients using
a window size of 4096 packets. Section 6.3 evaluates the impact of
6.1 Methodology and Setup varying the number of gradients and the window size on perfor-
Testbed. Our testbed includes six ASUS ESC4000A-E10 servers, mance. For straggler detection, we launch 𝑁 = 100 timer threads
one 64×100 Gbps Tofino switch, and one Juniper Networks’ MX480 on Trio, each with a 10 ms timeout period, unless otherwise stated.
router [10]. Each server is equipped with one A100 Nvidia GPU [11] Straggler generation pattern. To evaluate the impact of in-
(40 GBytes of HBM2 memory) and one 100 Gbps Mellanox Con- network straggler mitigation, we synthetically generate transient
nectX5 NIC. The Juniper router is populated with two MPC10E- worker slowdown by inserting sleep commands into the servers dur-
15C-MRATE line cards [4]. Each line card hosts 12 × 100 Gbps ports ing training. Following prior work [38], we use the “Slow Worker
distributed across three PFEs based on Trio’s 5th generation chipset. Pattern” to inject stragglers by selecting three possible delay points
Figure 11(a) shows a photo of our testbed. To demonstrate the power in each iteration and allowing one of the servers to decide to slow
of hierarchical aggregation in Trio, we connect three servers to PFE1 down at each point with a given probability 𝑝 (straggling probabil-
and another three servers to PFE2 , as shown in Figure 11(b). All ity). If a worker decides to straggle at a particular delay point, it
PFEs are internally connected; hence, we enable Trio-ML’s hierar- will be slowed for a period uniformly randomly chosen between
chical aggregation by configuring PFE4 as the top-level aggregator. 0.5 and 2× of the typical iteration time, where typical iteration time
The dotted lines in Figure 11(b) illustrate the internal path of our refers to the average iteration time of each model when there are
hierarchical aggregation setup. For Tofino experiments, we connect no stragglers in the system.
all six servers to a single pipeline because SwitchML’s open-source Ideal setup. To compare Trio-ML to an ideal environment where
code does not yet support hierarchical aggregation across pipelines. no stragglers exist in the system, we use Pytorch with NCCL [8]
Note that connecting all servers to a single pipeline of the Tofino and an RDMA backend without adding stragglers.
switch guarantees the best performance of SwitchML. If servers
are connected to multiple pipelines, recirculation is required and 6.2 Distributed ML Training Speedups
will result in performance degradation. Time-to-accuracy improvements. The ultimate goal of in-network
DNN Workloads. We evaluate three real-world DNN models: aggregation is to reduce the time-to-accuracy of distributed ML
ResNet50 [41], DenseNet161 [42], and VGG11 [68]. Table 1 summa- training workloads, even in the presence of stragglers. Figure 12
rizes the models and batch sizes used in our experiments. Following shows the top-5 validation accuracy results for the three DNN
prior work [7, 63, 66], we select batch sizes that achieve the best pos- training jobs when the straggling probability 𝑝 is 16%. This proba-
sible time-to-accuracy. We train all three models with the ImageNet bility emulates an environment with a moderate rate of stragglers.
dataset [32]. Note that the probability of a straggler event occurring in a single
SwitchML setup. For our baseline, we use SwitchML’s open- iteration increases with the number of workers participating in
source code [64]. SwitchML provides RDMA-based and DPDK- a job [56]. Therefore, it is entirely possible that at a large scale,
based implementations. However, its RDMA-based implementation every training iteration observes at least one straggling worker
is not yet integrated with training frameworks. Hence, we use even when it is a different straggler in each iteration due to un-
SwitchML’s DPDK-based implementation integrated with the Py- correlated transient effects [38]. As shown in Figure 12(a), when
Torch [49] training framework. In addition, SwitchML provides two training ResNet50, Trio-ML reaches 90% target validation accuracy
packet size designs: (𝑖) SwitchML-64, where each packet carries 64 1.56× faster than SwitchML. Trio-ML recovers from straggler de-
gradients and uses a single pipeline on the Tofino switch to per- lays via its partial aggregation strategy, whereas SwitchML servers
form aggregation, and (𝑖𝑖) SwitchML-256, where each packet carries need to wait for the straggling server. Similarly, Figures 12(b) and
256 gradients requiring all four pipelines to perform the aggrega- (c) show that Trio-ML outperforms SwitchML by 1.56× and 1.60×
tion at line rate. SwitchML-256 performs better than SwitchML-64; when training DenseNet161 and VGG11, respectively.

642
  
       
 

 

 
  
   
  
  
  
   
  
 
 

 


 
             
         
 
  
              
           
         
Figure 12: Time-to-accuracy improvements for three DNN models when straggling probability is 𝑝 = 16%.

 
 

 

 
         


#! 
#  


     
 

$ 
      
 

      

      

      

 







  

  
                          
  '    '    '
   !     "  
Figure 13: Impact of straggling workers on training iteration time for three DNN models. Trio-ML is able to maintain the
training iteration time close to the Ideal case.

Training iteration time. To evaluate the impact of straggling our in-network aggregation Microcode program without stragglers
probability on training performance, we sweep through different (i.e., 𝑝 = 0). In these benchmarks, we use four servers connected to
probabilities and measure the corresponding training iteration the same PFE.
times. Figure 13 shows the average iteration time for the first 100 it- Microcode program analysis. The Trio-ML Microcode pro-
erations when training three DNN models with Trio-ML, SwitchML, gram is quite compact, using ≈60 instructions. It uses a single
and the Ideal setup. The figure shows that as the straggling proba- thread per packet where most of the cycles are spent on a loop that
bility increases, SwitchML’s iteration time increases because the reads gradients from the packet tail into the thread’s local memory
switch needs to wait for the straggling worker(s) before generating and adds them into the aggregation buffer. This loop’s efficiency
the aggregation result. Increasing SwitchML’s pool size does not is ≈1.2 run-time instructions per gradient, and it is executed for
help, as its aggregation logic requires all participating workers to every packet from every source. Another loop copies aggregated
contribute before making progress. In contrast, Trio-ML mitigates gradients into the Packet Buffer when building the aggregation
the effect of stragglers using its in-network straggler mitigation result packet; it uses less processing time, because it is executed
technique and is able to maintain the training iteration time close once per block, i.e., once for the whole set of sources. Trio’s read-
to the Ideal case. With straggling probability 𝑝 = 16%, Trio-ML modify-write engines perform the summation of gradients entering
speeds up average iteration time by 1.72× for ResNet50, 1.75× for the aggregation buffer. Trio-ML uses 12 such engines, and each add
DenseNet161, and 1.8× for VGG11, compared to SwitchML. operation takes two cycles. With a 1 GHz clock speed, the current
In-network timer threads’ efficiency. Trio’s timer threads Trio generation supports 6 billion operations per second per PFE.
periodically scan the aggregation records to identify which servers Aggregation latency. We quantify the PFE aggregation latency
are straggling beyond the specified timeout interval. To evaluate the by instrumenting the Trio-ML Microcode program to keep track
efficiency of this process, we vary the timeout interval on Trio and of the amount of time each aggregation packet spends in Trio.
measure how long it takes for the non-straggling servers to receive To compute a faithful estimation of a single thread’s aggregation
partial aggregation results. For each timeout interval, we send 20 latency, we enforce each server to send only one aggregation packet
back-to-back packets and report the time between sending one at a time by setting the window parameter to one. Each experiment
aggregation packet and receiving the corresponding result packet consists of sending 10,000 back-to-back packets, and we repeat
on each server. Figure 14 shows Trio-ML servers are able to recover the experiments 20 times. The left y-axis of Figure 15 reports the
from stragglers within 2× the timeout interval. average aggregation latency as the number of gradients per packet
is increased. With 64 gradients per packet, the aggregation latency
is 30 𝜇s. Larger packets incur a larger aggregation latency, but this
6.3 Trio-ML Microcode Program Performance
increase is not always linear. For instance, increasing the packet
The previous section established Trio-ML’s performance in the size by a factor of 16 (1024 on the x-axis) causes the aggregation
presence of stragglers. This section benchmarks the performance of

643
%*$
  

 

    !


    


   

    !


     '&$$   %($  
     
 %*$$ %&$

 +$$ %$$
 +$
 ($$
   

*$
 &$$

Better


Better
($

  
 %$$
 &$

  
)$ %$
   % %* &)* ($,* % %* &)* ($,*
            !   !
              !    ! 

Figure 14: In-network Figure 15: Per PFE aggrega- Figure 16: Impact of window size on aggregation latency and
timer threads efficiency. tion latency and rate. throughput.

latency to increase to 200 𝜇s (a factor of 6.6 increase). This result Trio for routing purposes. Finally, the data structures can be stored
suggests Trio is more efficient with larger packets. The right y-axis more efficiently, thereby reducing the transmission bandwidth and
of Figure 15 confirms this observation by plotting the aggregation processing cycles of external monitoring devices.
rate (the ratio of the number of gradients over the aggregation Trio for in-network security. To mitigate DDoS attacks, the
latency). As shown, the aggregation rate starts to plateau between MX systems based on Trio support a feature to identify and drop ma-
512 and 1024 gradients per packet. Next, we evaluate the impact of licious packets, capitalizing on the chipset’s high performance and
increasing the window size for these two packet sizes. flexible packet filter mechanism. Trio also acts as a fast forwarding-
Impact of aggregation window size. Increasing the window pa- path based on security flows on the SRX security platforms [1].
rameter enables the PPE threads to work on multiple aggregations Trio is capable of performing additional complex in-network se-
in parallel. To evaluate the impact of window size on Trio-ML’s ag- curity processing on incoming packets, either by aggregating fea-
gregation latency and throughput, we configure the servers to send tures or by performing inference of ML models installed by ser-
packets with 512 or 1024 gradients with varying window sizes. We vice providers, to identify and mitigate anomalies in traffic. Unlike
refer to these two cases as Trio-ML-512 and Trio-ML-1024. Figure 16 off-device-based solutions, Trio’s programmable architecture for
shows the interplay between aggregation latency and throughput anomaly detection on the network datapath enables low-latency
as the window size increases. Figure 16(a) shows that increasing threat mitigation.
the window size causes the aggregation latency to increase because Packet loss in Trio-ML. Transient traffic spikes may occur in
the Microcode program needs to handle more simultaneous aggre- a datacenter running a variety of diverse applications, and this,
gation packets. Figure 16(b) shows that increasing the window size in turn, may lead to aggregation packets being lost. A practical
improves the aggregation throughput because it pipelines packet in-network aggregation system needs a level of resiliency allowing
arrivals into the router. The best window is the one that maximizes long-running jobs to survive such hiccups. SwitchML [63] suggests
the throughput while minimizing latency. We find window size 4096 how such resiliency can be achieved. Trio-ML implementation has
achieves a good balance between latency and throughput. provisions to support this solution, although it is not part of the
current code and we leave this to future work.
7 DISCUSSION AND FUTURE USE CASES Future open sourcing plans. We are considering several future
Trio for in-network telemetry. Most network operators require open source ideas. First, we plan to add comprehensive support for
telemetry or insights into the traffic in their networks for capacity P4 programming for Trio. Juniper engineering has made an initial
planning, service-level agreement monitoring, security mitigation, effort to achieve this goal [6], but recent changes and enhancements
and other purposes. Current networking devices usually rely on to the P4 core specification should allow greater flexibility and
packet sampling using internal processors embedded in the devices more features to be exposed via the P4 interface. Second, we plan
or external monitoring devices for further processing. Because of to create a domain-specific language to allow the full scope of
the high rate of traffic through the devices and the limited amount forwarding-path features of the Trio chipset to be available to third-
of processing and bandwidth available for monitoring, only a small party developers. Juniper is exploring development in this area and
percentage of packets (one in tens of thousands or less) is selected welcomes feedback from the community.
to be monitored, and the decision to sample packets is often blind,
based on a simple time interval [62]. Trio’s packet processing flexi- 8 RELATED WORK
bility and availability of operational resources make it suitable for In-network computing using programmable switches. Sev-
in-network telemetry. For instance, service providers can leverage eral prior papers proposed in-network computing by leveraging
Trio’s large memory to keep track of incoming packets to main- some form of programmability inside the network. These approaches
tain sufficient information for telemetry. Moreover, Trio’s timer fall into two categories: (1) computation at line-rate using PISA-
threads are suitable for periodic monitoring and anomaly analysis. based architectures [18, 48, 63] and (2) computation at sub line-
To provide more intelligent telemetry for network operators, ma- rates using on-chip FPGAs [21]. Our in-network ML aggregation
chine learning-based classification techniques may be performed use case is closely related to Sharp [18], SwitchML [63], ATP [48],
on each packet, based on the packet fields already extracted by PANAMA [36], and Flare [29]. Sharp [18] is Mellanox’s proprietary

644
design geared towards dedicated ML training clusters; it assumes to the same memory location at around the same time, and Trio’s
network bandwidth can be exclusively reserved. In contrast, we con- read-modify-write engine processes the requests in sequence, guar-
sider networks where links are shared across multiple users and ap- anteeing consistency of the updates. In addition, dRMT’s memory
plications. SwitchML [63] and ATP [48] use commercially available accesses through the crossbar are scheduled at compile time, and
Tofino switches to perform gradient aggregation. Although Tofino this reduces the flexibility of incrementally updating and recompil-
switches can perform line-rate packet processing, their pipelined ing the application code. The complexity of the crossbar scheduling
architecture has more limited programmability, making in-network algorithm can limit the ability of the architecture to scale to higher
straggler mitigation extremely challenging. We use SwitchML as numbers of match-action processors. In contrast, Trio’s crossbar is
a baseline comparison to Trio-ML. SwitchML serves as an apples- scheduled in real-time, thus providing efficient access to memory.
to-apples comparison for our use case, making it a more appro- This dynamic scheduling mechanism enables Trio to scale from 16
priate baseline than ATP. More specifically, ATP’s performance PPEs in the first generation to 160 PPEs in the sixth generation, and
improvements are impacted by in-network aggregation and an ad- it will continue to scale higher in the future. Furthermore, in dRMT,
ditional parameter server, while SwitchML and Trio-ML are more the packet parser and deparser are located outside the match-action
similar, as both approaches only use switch/router for aggregation. processors. Any parsing of the inner headers of the packets that rely
PANAMA’s [36] in-network aggregation hardware can support flex- on the lookup results (e.g., MPLS encapsulated packets) will have to
ible packet processing, but it is based on FPGAs acting as bumps- be recirculated back to the parser for processing. In contrast, Trio’s
in-the-wire, making it impractical for large scale deployments. This PPEs are fully programmable processors, able to handle packet
paper, however, aims to use Trio’s programmable architecture to parsing/deparsing as well as the rest of the packet lookup and pro-
design new stateful in-network applications from the ground up. cessing, in a run-to-completion manner. Trio’s multi-threaded PPEs
Several key features of Trio enable these new applications. First, also allow packets to be processed by different Microcode programs
Trio’s large memory and fast access to packet tail data enable ef- depending on their processing requirements.
ficient in-network computation. Second, Trio’s Shared Memory
System provides several GBytes of storage; this is sufficient for 9 CONCLUSION
data storage even in the presence of straggling workers, or when
This paper describes Juniper Networks’ programmable chipset, Trio,
multiple applications are running simultaneously. Finally, Trio has
and its use in emerging data-intensive in-network applications. Trio
no limits on the number of instructions on a single packet, enabling
has been in production for over a decade and has built a large cus-
the Microcode program to launch the computation instructions
tomer base with billions of dollars in market share. We describe
required by large packets.
Trio’s multi-threaded and programmable packet forwarding and
Straggler mitigation. There is a plethora of prior work on un-
packet processing engines. We then use in-network aggregation
derstanding and mitigating the impact of stragglers in distributed
for distributed machine learning training and in-network strag-
systems [13–15, 25, 26, 30, 31, 33, 34, 38, 46, 51, 56, 57, 59, 61, 72,
gler mitigation as two use cases to illustrate Trio’s Microcode and
75, 78, 79]. In particular, Harlap et al. proposed FlexRR to mitigate
programming environment. Our evaluations show that Trio out-
the impact of stragglers on distributed learning jobs [38]. FlexRR
performs today’s pipeline-based solutions by up to 1.8×. This work
requires peer-to-peer communication among workers to detect
does not raise any ethical issues.
slowed workers and perform work re-assignment. In contrast, we
consider mitigating stragglers inside the network without any mes-
sage passing across workers and without requiring a parameter 10 ACKNOWLEDGMENTS
server. Tandon et al. [72] and Raviv et al. [61] proposed coding We would like to thank our shepherd Gábor Rétvári and anonymous
theory frameworks for mitigating stragglers in distributed learning reviewers for their valuable feedback. We also acknowledge Juniper
by duplicating the training data across workers; however, Trio-ML Networks for providing resources for this research. In particular, we
does not require data duplication. thank Pradeep Sindhu and the Trio development team for creating
Alternative switch architectures. The research community the chipsets. We are grateful to Raj Yavatkar and Alex Mallery for
has been working on alternative switch architectures to address valuable discussions of the paper. This work was supported by the
some of the limitations of PISA-based architectures, such as lack of Air Force AI Accelerator, ARPA-E ENLITENED PINE, DARPA Fast-
shared memory and shallow pipeline depths. The most competitive NICs, NSF grants CNS-2008624, SHF-2107244, ASCENT-2023468,
example is dRMT (Disaggregated Programmable Switching) [24]. CAREER-2144766, and Sloan fellowship. This research was partially
The dRMT switch architecture implements a centralized, shared sponsored by the United States Air Force Research Laboratory and
memory pool that all match-action stages can access. Instead of the United States Air Force Artificial Intelligence Accelerator and
executing the match-action stages in a pipeline, dRMT aggregates was accomplished under Cooperative Agreement Number FA8750-
these stages in a cluster and executes them in a round-robin or- 19-2-1000. The views and conclusions contained in this document
der. A control logic unit schedules the stages so as to maximize are those of the authors and should not be interpreted as represent-
the cluster’s throughput while respecting program dependencies. ing the official policies, either expressed or implied, of the United
However, the centralized memory pool is gated by a mux that con- States Air Force or the U.S. Government. The U.S. Government is
nects stages to memory, and only one stage can access memory in authorized to reproduce and distribute reprints for Government
a given clock cycle. This can result in the slow down of program purposes notwithstanding any copyright notation herein. The na-
execution when an application requires memory access in multiple ture of this research is not military-related and does not have direct
stages. In Trio, multiple threads can send memory access requests military implications.

645
REFERENCES [26] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura,
[1] [n. d.]. Accelerating the Next Generation of Juniper Connected Se- and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting
curity with Trio. ([n. d.]). https://blogs.juniper.net/en-us/security/ Workloads for Improved Resource Management in Large Cloud Platforms. In
accelerating-the-next-generation-of-juniper-connected-security-with-trio. Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17).
[2] [n. d.]. Barefoot Tofino. ([n. d.]). https://www.barefootnetworks.com/products/ ACM, New York, NY, USA, 153–167. https://doi.org/10.1145/3132747.3132772
brief-tofino/. [27] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weatherspoon, M.
[3] [n. d.]. Juniper Networks Advanced Forwarding Interface (AFI). https://github. Canini, F. Pedone, and R. Soule. 2020. P4xos: Consensus as a Network Service.
com/Juniper/AFI. ([n. d.]). IEEE/ACM Transactions on Networking 28, 4 (2020).
[4] [n. d.]. Juniper Networks’ MX Series Universal Routing Platform Interface Module [28] Huynh Tu Dang, Marco Canini, Fernando Pedone, and Robert Soulé. 2016. Paxos
Reference. ([n. d.]). https://www.juniper.net/documentation/us/en/hardware/ made switch-y. ACM SIGCOMM Computer Communication Review 46, 2 (2016),
mx-module-reference/topics/concept/mpc10e-15c-mrate.html. 18–24.
[5] [n. d.]. Juniper Networks vMX Series Universal Routing Plat- [29] Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten
form. https://www.juniper.net/us/en/products/routers/mx-series/ Hoefler. 2021. Flare: Flexible in-Network Allreduce. In Proceedings of the Inter-
vmx-virtual-router-software.html. ([n. d.]). national Conference for High Performance Computing, Networking, Storage and
[6] [n. d.]. Juniper P4 Agent. https://github.com/Juniper/JP4Agent. ([n. d.]). Analysis (SC ’21). Association for Computing Machinery, New York, NY, USA,
[7] [n. d.]. MLPerf: A broad ML benchmark suite. ([n. d.]). https://mlperf.org/. Article 35, 16 pages. https://doi.org/10.1145/3458817.3476178
[8] [n. d.]. NVIDIA Collective Communication Library (NCCL). ([n. d.]). https: [30] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark
//developer.nvidia.com/nccl. Mao, Marc' aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and
[9] 2017. Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework Andrew Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural
for TensorFlow. (2017). https://eng.uber.com/horovod. Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
[10] 2022. Juniper Networks’ MX480 Universal Routing Platform. Weinberger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.
(2022). https://www.juniper.net/us/en/products/routers/mx-series/ cc/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf
mx480-universal-routing-platform.html. [31] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing
[11] 2022. NVIDIA A100 Tensor Core GPU. (2022). https://www.nvidia.com/en-us/ on Large Clusters. In OSDI’04: Sixth Symposium on Operating System Design and
data-center/a100/. Implementation. San Francisco, CA, 137–150.
[12] J. R. Allen, B. M. Bass, C. Basso, R. H. Boivie, J. L. Calvignac, G. T. Davis, L. [32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A
Frelechoux, M. Heddes, A. Herkersdorf, A. Kind, J. F. Logan, M. Peyravian, M. A. Large-Scale Hierarchical Image Database. In CVPR09.
Rinaldi, R. K. Sabhikhi, M. S. Siegel, and M. Waldvogel. 2003. IBM PowerNP [33] Celestine Dünner, Thomas Parnell, Dimitrios Sarigiannis, Nikolas Ioannou,
network processor: Hardware, software, and applications. IBM Journal of Research Andreea Anghel, Gummadi Ravi, Madhusudanan Kandasamy, and Haralam-
and Development 47, 2.3 (2003), 177–193. https://doi.org/10.1147/rd.472.0177 pos Pozidis. 2018. Snap ML: A Hierarchical Framework for Machine Learn-
[13] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2012. Why ing. In Advances in Neural Information Processing Systems 31, S. Ben-
let resources idle? Aggressive Cloning of Jobs with Dolly. In USENIX HotCloud gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-
(usenix hotcloud ed.). https://www.microsoft.com/en-us/research/publication/ nett (Eds.). Curran Associates, Inc., 252–262. http://papers.nips.cc/paper/
let-resources-idle-aggressive-cloning-jobs-dolly/ 7309-snap-ml-a-hierarchical-framework-for-machine-learning.pdf
[14] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. [34] Farshid Farhat, Diman Zad Tootaghaj, Yuxiong He, Anand Sivasubramaniam,
Effective Straggler Mitigation: Attack of the Clones. In 10th USENIX Sympo- Mahmut Kandemir, and Chita R. Das. 2018. Stochastic Modeling and Optimization
sium on Networked Systems Design and Implementation (NSDI 13). USENIX As- of Stragglers. IEEE Transactions on Cloud Computing 6, 4 (oct 2018), 1164–1177.
sociation, Lombard, IL, 185–198. https://www.usenix.org/conference/nsdi13/ https://doi.org/10.1109/tcc.2016.2552516
technical-sessions/presentation/ananthanarayanan [35] Yong Feng, Zhikang Chen, Haoyu Song, Wenquan Xu, Jiahao Li, Zijian Zhang,
[15] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Sto- Tong Yun, Ying Wan, and Bin Liu. 2022. Enabling In-situ Programmability in
ica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Network Data Plane: From Architecture to Language. In 19th USENIX Symposium
Outliers in Map-Reduce Clusters using Mantri. In 9th USENIX Sympo- on Networked Systems Design and Implementation (NSDI 22). 635–649.
sium on Operating Systems Design and Implementation (OSDI 10). USENIX [36] N. Gebara, P. Costa, and M. Ghobadi. 2021. PANAMA: In-network Aggregation
Association, Vancouver, BC. https://www.usenix.org/conference/osdi10/ for Shared Machine Learning Clusters. In Proc. Conference on Machine Learning
reining-outliers-map-reduce-clusters-using-mantri and Systems (MLSys). 1–16.
[16] Sally Bament. 2022. Juniper Introduces New Trio 6-based MX Portfo- [37] Nadeen Gebara, Alberto Lerner, Mingran Yang, Minlan Yu, Paolo Costa,
lio. (2022). https://blogs.juniper.net/en-us/service-provider-transformation/ and Manya Ghobadi. 2020. Challenging the Stateless Quo of Pro-
juniper-introduces-new-trio-6-based-mx-portfolio. grammable Switches. In ACM Workshop on Hot Topics in Networks
[17] Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, (HotNets). ACM. https://www.microsoft.com/en-us/research/publication/
Minian Yu, and Michael Mitzenmacher. 2020. PINT: Probabilistic In-band Net- challenging-the-stateless-quo-of-programmable-switches/
work Telemetry. In ACM SIGCOMM. [38] Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R. Ganger, Phillip B.
[18] Gil Bloch. 2019. Accelerating Distributed Deep Learning with In-Network Gibbons, Garth A. Gibson, and Eric P. Xing. 2016. Addressing the Straggler
Computing Technology. (Aug. 2019). https://conferences.sigcomm.org/events/ Problem for Iterative Convergent Parallel ML. In Proceedings of the Seventh ACM
apnet2019/slides/Industrial_1_3.pdf Symposium on Cloud Computing (SoCC ’16). ACM, New York, NY, USA, 98–111.
[19] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer https://doi.org/10.1145/2987550.2987554
Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and [39] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R.
David Walker. 2014. P4: Programming Protocol-Independent Packet Processors. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. 2018. PipeDream: Fast and
ACM SIGCOMM Computer Communication Review (CCR) (2014). Efficient Pipeline Parallel DNN Training. CoRR abs/1806.03377 (2018).
[20] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin [40] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2018. Com-
Izzard, Fernando Mujica, and Mark Horowitz. 2013. Forwarding metamorpho- munication Scheduling as a First-Class Citizen in Distributed Machine Learning
sis: Fast programmable match-action processing in hardware for SDN. ACM Systems. CoRR abs/1803.03288 (2018). arXiv:1803.03288 http://arxiv.org/abs/1803.
SIGCOMM Computer Communication Review 43, 4 (2013), 99–110. 03288
[21] Pietro Bressana, Noa Zilberman, Dejan Vucinic, and Robert Soulé. 2020. Trading [41] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image
Latency for Compute in the Network. In ACM NAI. Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
[22] Broadcom. [n. d.]. BCM56870 Series. ([n. d.]). https://www.broadcom.com/ (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
products/ethernet-connectivity/switching/strataxgs/bcm56870-series. [42] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger.
[23] A. Caulfield, P. Costa, and M. Ghobadi. 2018. Beyond SmartNICs: Towards a Fully 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE
Programmable Cloud: Invited Paper. In IEEE HPRS. Conference on Computer Vision and Pattern Recognition (CVPR ’17). IEEE Computer
[24] Sharad Chole, Andy Fingerhut, Sha Ma, Anirudh Sivaraman, Shay Vargaftik, Society, Honolulu, HI, 2261–2269. https://arxiv.org/abs/1608.06993v5
Alon Berger, Gal Mendelson, Mohammad Alizadeh, Shang-Tse Chuang, Isaac [43] Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster,
Keslassy, Ariel Orda, and Tom Edsall. 2017. dRMT: Disaggregated Programmable Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores
Switching. In ACM SIGCOMM. with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating
[25] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Systems Principles (SOSP ’17).
Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the Straggler Problem [44] Daehyeok Kim, Zaoxing Liu, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, Vyas
with Bounded Staleness. In 14th Workshop on Hot Topics in Operating Systems Sekar, and Srinivasan Seshan. 2020. TEA: Enabling State-Intensive Network
(HotOS XIV). USENIX Association, Santa Ana Pueblo, NM. https://www.usenix. Functions on Programmable Switches. In Proceedings of the Annual Conference
org/conference/hotos13/session/cipar of the ACM Special Interest Group on Data Communication on the Applications,
Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM
’20). Association for Computing Machinery, New York, NY, USA, 90–106. https:

646
//doi.org/10.1145/3387514.3405855 [61] Netanel Raviv, Itzhak Tamo, Rashish Tandon, and Alexandros G. Dimakis. 2020.
[45] Daehyeok Kim, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, and Srinivasan Seshan. Gradient Coding From Cyclic MDS Codes and Expander Graphs. IEEE Transac-
2018. Generic External Memory for Switch Data Planes. In ACM HotNets. tions on Information Theory 66, 12 (2020), 7475–7489. https://doi.org/10.1109/TIT.
[46] Eugene Kirpichov and Malo Denielou. 2016. No shard left be- 2020.3029396
hind: dynamic work rebalancing in Google Cloud Dataflow. [62] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren.
(May 2016). https://cloud.google.com/blog/products/gcp/ 2015. Inside the Social Network’s (Datacenter) Network. In SIGCOMM.
no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow [63] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,
[47] Benjamin Klenk, Nan Jiang, G. Thorson, and L. Dennison. 2020. An In-Network Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter
Architecture for Accelerating Shared-Memory Multiprocessor Collectives. 2020 Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggrega-
ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) tion. In 18th USENIX Symposium on Networked Systems Design and Implementation
(2020), 996–1009. (NSDI 21). USENIX Association, 785–808. https://www.usenix.org/conference/
[48] ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya nsdi21/presentation/sapio
Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant [64] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,
Learning. In 18th USENIX Symposium on Networked Systems Design and Imple- Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter
mentation (NSDI 21). USENIX Association, 741–761. Richtarik. 2022. SwitchML open source code. (2022). https://github.com/p4lang/
[49] Adam Lerer, Ledell Wu, Jiajun Shen, Timothée Lacroix, Luca Wehrstedt, Abhijit p4app-switchML.
Bose, and Alexander Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale [65] Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed
Graph Embedding System. CoRR abs/1903.12287 (2019). arXiv:1903.12287 http: deep learning in TensorFlow. (2018). arXiv:cs.LG/1802.05799
//arxiv.org/abs/1903.12287 [66] Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein,
[50] Alberto Lerner, Rana Hussein, and Philippe Cudré-Mauroux. 2019. The Case for Roy Frostig, and George E. Dahl. 2018. Measuring the Effects of Data Parallelism
Network Accelerated Query Processing. In Proceedings of the Innovative Data on Neural Network Training. CoRR abs/1811.03600 (2018). arXiv:1811.03600
Systems Research Conference (CIDR ’19). http://arxiv.org/abs/1811.03600
[51] Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew [67] Vishal Shrivastav. 2022. Stateful Multi-Pipelined Programmable Switches. In
Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-Direct: High-Performance Proceedings of the 2022 ACM SIGCOMM Conference (SIGCOMM ’22).
In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th [68] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
Symposium on Operating Systems Principles (SOSP ’17). ACM, New York, NY, USA, works for Large-Scale Image Recognition. (2015). arXiv:cs.CV/1409.1556
137–152. https://doi.org/10.1145/3132747.3132756 [69] Anirudh Sivaraman, Alvin Cheung, Mihai Budiu, Changhoon Kim, Mohammad
[52] Jialin Li, Ellis Michael, and Dan R. K. Ports. 2017. Eris: Coordination-Free Consis- Alizadeh, Hari Balakrishnan, George Varghese, Nick McKeown, and Steve Licking.
tent Transactions Using In-Network Concurrency Control. In Proceedings of the 2016. Packet Transactions: High-Level Programming for Line-Rate Switches. In
26th Symposium on Operating Systems Principles (SOSP ’17). ACM SIGCOMM.
[53] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, [70] Erich Strohmaier, Jack J. Dongarra, Hans W. Meuer, and Horst D. Simon. 1999.
Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling The Marketplace of High-Performance Computing. Parallel Comput. 25, 13–14
Distributed Machine Learning with the Parameter Server. In 11th USENIX Sym- (dec 1999), 1517–1544. https://doi.org/10.1016/S0167-8191(99)00067-8
posium on Operating Systems Design and Implementation (OSDI 14). USENIX As- [71] Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, and Kunle Olukotun.
sociation, Broomfield, CO, 583–598. https://www.usenix.org/conference/osdi14/ 2022. Taurus: A Data Plane Architecture for Per-Packet ML. ASPLOS (2022).
technical-sessions/presentation/li_mu [72] Rashish Tandon, Qi Lei, Alexandros G. Dimakis, and Nikos Karampatziakis.
[54] Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian 2017. Gradient Coding: Avoiding Stragglers in Distributed Learning. In Pro-
Huang. 2019. Accelerating Distributed Reinforcement learning with In-Switch ceedings of the 34th International Conference on Machine Learning (Proceed-
Computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer ings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.),
Architecture (ISCA). 279–291. Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 3368–3376.
[55] Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. 2017. http://proceedings.mlr.press/v70/tandon17a.html
SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switch- [73] Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. 2020. Cheetah:
ing ASICs. In Proceedings of the 2017 ACM SIGCOMM Conference (SIGCOMM Accelerating Database Queries with Switch Pruning. In Proceedings of the 2020
’17). ACM SIGMOD Conference (SIGMOD ’20).
[56] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, [74] Yuta Tokusashi, Huynh Tu Dang, Fernando Pedone, Robert Soulé, and Noa
and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of Zilberman. 2019. The Case For In-Network Computing On Demand. In EuroSys.
the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). [75] Da Wang, Gauri Joshi, and Gregory W. Wornell. 2019. Efficient Straggler Replica-
Association for Computing Machinery, New York, NY, USA, 439–455. https: tion in Large-Scale Parallel Computing. ACM Trans. Model. Perform. Eval. Comput.
//doi.org/10.1145/2517349.2522738 Syst. 4, 2, Article 7 (April 2019), 23 pages. https://doi.org/10.1145/3310336
[57] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: [76] Zhaoqi Xiong and Noa Zilberman. 2019. Do Switches Dream of Machine Learn-
Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM ing? Toward In-Network Classification. In Proceedings of the 18th ACM Workshop
Symposium on Operating Systems Principles (SOSP ’13). Association for Computing on Hot Topics in Networks (HotNets’19).
Machinery, New York, NY, USA, 69–84. https://doi.org/10.1145/2517349.2522716 [77] Yifan Yuan, Omar Alama, Jiawei Fei, Jacob Nelson, Dan RK Ports, Amedeo Sapio,
[58] P4.org Architecture Working Group. [n. d.]. P416 Portable Switch Architecture Marco Canini, and Nam Sung Kim. 2022. Unlocking the Power of Inline {Floating-
(PSA). ([n. d.]). https://p4.org/p4-spec/docs/PSA.html. Point } Operations on Programmable Switches. In 19th USENIX Symposium on
[59] Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. Networked Systems Design and Implementation (NSDI 22). 683–700.
2020. Starling: A Scalable Query Engine on Cloud Functions. In Proceedings of the [78] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott
2020 ACM SIGMOD International Conference on Management of Data (SIGMOD Shenker, and Ion Stoica. 2010. Delay Scheduling: A Simple Technique for Achiev-
’20). Association for Computing Machinery, New York, NY, USA, 131–141. https: ing Locality and Fairness in Cluster Scheduling. In Proceedings of the 5th European
//doi.org/10.1145/3318464.3380609 Conference on Computer Systems (EuroSys ’10). Association for Computing Ma-
[60] Dan R. K. Ports and Jacob Nelson. 2019. When Should The Network Be The chinery, New York, NY, USA, 265–278. https://doi.org/10.1145/1755913.1755940
Computer?. In Proceedings of the Workshop on Hot Topics in Operating Systems [79] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica.
(HotOS ’19). 2008. Improving MapReduce Performance in Heterogeneous Environments.
In Proceedings of the 8th USENIX Conference on Operating Systems Design and
Implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 29–42.

647
A APPENDICES Figure 18 shows the structure definition of the block records. The
Appendices are supporting material that has not been peer-reviewed. field without a field name represents unused bits for byte alignment.
The information contained in other fields are listed as follow:
A.1 Trio-ML Record Structure block_exp: block timeout interval in millisecond
block_age: age of the current block
Figure 17 shows the structure definition of the job records. The field
block_start_time: start time of the current block
without a field name represents unused bits for byte alignment. The
job_ctx_paddr: pointer to the job record
information contained in other fields are listed as follow:
aggr_paddr: pointer to the aggregation buffer
block_curr_cnt: current number of active blocks
grad_cnt: number of gradients in the block
block_cnt_max: maximum number of concurrent blocks
rcvd_cnt: number of received ML sources
block_grad_max: maximum number of gradients per block
rcvd_mask_0: bitmask field for received sources
block_exp: block timeout interval in millisecond
rcvd_mask_1: additional bitmask field for received sources
block_total_cnt: job’s cumulative blocks count
rcvd_mask_2: additional bitmask field for received sources
out_src_addr: Result packet source IP
rcvd_mask_3: additional bitmask field for received sources
out_dst_addr: Result packet destination IP
out_nh_addr: pointer to egress forward chain
src_cnt: number of ML sources in the job
src_mask_0: bitmask field for job’s sources
src_mask_1: additional bitmask field for job’s sources
src_mask_2: additional bitmask field for job’s sources
src_mask_3: additional bitmask field for job’s sources

struct t rio _ml _jo b _ c t x _ t { // 58 bytes


block_curr_cnt : 16; struct t r i o _ m l _ b l o c k _ c t x _ t { // 58 bytes
block_cnt_max : 12; block_exp : 8;
block_grad_max : 12; block_age : 8;
block_exp : 8; b l o c k _ s t a r t _ t i m e : 64;
block_total_cnt : 32; job_ctx_paddr : 32;
out_src_addr : 32; aggr_paddr : 32;
out_dst_addr : 32; : 20;
out_nh_addr : 32; grad_cnt : 12;
: 24; : 24;
src_cnt : 8; rcvd_cnt : 8;
src_mask_0 : 64; rcvd_mask_0 : 64;
src_mask_1 : 64; rcvd_mask_1 : 64;
src_mask_2 : 64; rcvd_mask_2 : 64;
src_mask_3 : 64; rcvd_mask_3 : 64;
}; };

Figure 17: Trio-ML job record structure. Figure 18: Trio-ML block record structure.

648

You might also like