0% found this document useful (0 votes)

16 views19 pages

SiP-ML - High-Bandwidth Optical Network Interconnects For Machine

Uploaded by

橄榄白

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

SiP-ML - High-Bandwidth Optical Network Interconnects For Machine

Uploaded by

橄榄白

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

SiP-ML: High-Bandwidth Optical Network Interconnects for Machine

Learning Training
Mehrdad Khani1 , Manya Ghobadi1 , Mohammad Alizadeh1 , Ziyi Zhu2 , Madeleine Glick2 , Keren Bergman2 ,
Amin Vahdat3 , Benjamin Klenk4 , Eiman Ebrahimi4
1 Massachusetts Institute of Technology 2 Columbia University 3 Google 4 NVIDIA

ABSTRACT tasks can still take days and even weeks [2–4]. Solutions such as
This paper proposes optical network interconnects as a key enabler NVIDIA DGX [5] enable distributed training on a small number
for building high-bandwidth ML training clusters with strong scal- of GPUs (e.g., 8–16) connected with a high-speed electrical switch
ing properties. Our design, called SiP-ML, accelerates the training with Tbps bandwidth, but large-scale ML clusters must resort to
time of popular DNN models using silicon photonics links capable connecting GPU servers over much slower infiniband fabrics [6,
of providing multiple terabits-per-second of bandwidth per GPU. 7]. We argue that future distributed ML training workloads are
SiP-ML partitions the training job across GPUs with hybrid data likely to require several Tbps of bandwidth per device at large
and model parallelism while ensuring the communication pattern scales, creating a pressing need for entirely new ways to build
can be supported efficiently on the network interconnect. We de- interconnects for distributed ML systems.
velop task partitioning and device placement methods that take the With Silicon Photonic (SiP) technology [8–18], it is now possi-
degree and reconfiguration latency of optical interconnects into ble to build I/O interfaces integrated with an electronic chip with
account. Simulations using real DNN models show that, compared Tbps bandwidth [8, 19]. These optical I/O chiplets can be directly
to the state-of-the-art electrical networks, our approach improves integrated into a CPU/GPU/FPGA/ASIC package [20], providing
training time by 1.3–9.1×. significantly higher bandwidth density than today’s technologies.
This paper proposes an end-to-end optical solution, called SiP-
CCS CONCEPTS ML, for strong scaling of ML workloads by leveraging SiP chiplets.
SiP-ML exploits the predictability of ML training traffic patterns to
• Networks → Network architectures; Network design and
find a parallelization strategy that meets the limitations of the opti-
planning algorithms;
cal topology at hand. Specifically, we explore two all-optical archi-
KEYWORDS tectures: (i) SiP-OCS, an Optical Circuit Switch (OCS) design based
on commercially available switches; and (ii) SiP-Ring, a switch-
Optical networks, Distributed Machine Learning, Silicon photonics, less ring design enabled by reconfigurable Micro-ring resonators
Reconfigurable networks (MRRs) [21] embedded in SiP interfaces [22, 23]. Each of these archi-
ACM Reference Format: tectures inherits one of the constraints of optical circuit-switched
Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine interconnects to an extreme. Optical Circuit Switches are too slow
Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, Eiman Ebrahimi. to reconfigure (e.g., 10 ms [24–26]) for ML models with a few mil-
2021. SiP-ML: High-Bandwidth Optical Network Interconnects for Machine liseconds of iteration time, while the ring topology can only support
Learning Training. In ACM SIGCOMM 2021 Conference (SIGCOMM ’21),
communication between nearby GPUs. We show that SiP-ML’s par-
August 23–27, 2021, Virtual Event, USA. ACM, New York, NY, USA, 19 pages.
https://doi.org/10.1145/3452296.3472900
allelization algorithm can produce traffic patterns suited to both
these constraints by taking the degree limitation of all-optical circuit-
1 INTRODUCTION switched interconnects as an input parameter.
To evaluate SiP-ML, we develop a detailed simulator for dis-
The ever-growing demand for more accurate machine learning (ML)
tributed neural network training. Our simulation results show the
models has resulted in a steady increase in the dataset and model
following: (1) for representative Natural Language Processing and
sizes of deep neural networks (DNNs). Since 2012, the amount of
Computer Vision DNN models, SiP-ML speeds up the total training
compute used in the largest AI training jobs has been increasing
time by a factor of 1.3–9.1× compared to today’s electrical network
exponentially with a 3.4-month doubling time [1], 50× faster than
fabrics; (2) although SiP-Ring’s switchless design constrains the
the pace of Moore’s Law.
physical topology to a ring, it performs similarly to SiP-OCS be-
The computation requirements of large ML models has been
cause of the fast reconfigurability offered by the MRRs; (3) a SiP-ML
partly met by the rapid development of ML hardware accelerators
interconnect with per-GPU bandwidth B performs as well as or
and specialized software stacks. Although hardware accelerators
better than an ideal, full-bisection electrical switch with per-GPU
have provided a significant amount of speed-up, today’s training
bandwidth B/2; (4) when per-GPU bandwidth is high (e.g., order of
Terabits-per-second), hybrid parallelism strategies outperform data
parallelism by up to 2× in terms of time-to-accuracy.
This work is licensed under a Creative Commons Attribution International 4.0 License.
This work does not raise any ethical issues.

SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8383-7/21/08.
https://doi.org/10.1145/3452296.3472900

657
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

Transformer ResNet50 Ideal added to the training job [43]. As a result, the entire system is able
to process a larger global batch while keeping the iteration time of

Time-to-Acc. (normalized)
Thrput (normalized)

0.1
100 each worker the same. It is widely thought that training with large
batches reduces the time-to-accuracy because large batches can
0.01 produce better model updates, allowing the training to converge
10
with fewer total iterations [44, 45]. However, increasing the global
0.001 batch size in DNN training does not always translate to improving
1
the number of iterations for all models [46, 47]. As an example,
10 100 1,000 10 100 1,000
Fig. 1 compares the throughput and time-to-accuracy of two DNN
Number of GPUs Number of GPUs
models: Transformer [48] and ResNet-50 [49]. The numbers are
(a) Throughput (b) Time-to-Accuracy obtained from Nvidia’s benchmark results [50]. As shown in Fig. 1a,
increasing the number of GPUs increases the batch size and thus
Figure 1: Weak scaling in today’s training systems. improves the throughput (images/sec) of both models. However,
the time-to-accuracy does not scale at the same rate and starts to
2 BACKGROUND AND MOTIVATION plateau at large scales, as shown in Fig. 1b. As we show in our eval-
uations, reducing the time-to-accuracy at 1000-GPU scale requires
This section describes the key concepts of designing scalable ML significantly higher bandwidth than today’s clusters (§4).
training interconnects. First, we discuss various parallelization Approach 2: Strong Scaling. Instead of reducing the number of it-
strategies for distributed training (§2.1). Then, we describe weak erations, a more effective scaling approach is to reduce the iteration
and strong scaling and identify their network bandwidth require- time as the number of workers increases. This approach is called
ments (§2.2). Finally, we introduce Silicon Photonics as a promising strong scaling [43]. In contrast to weak scaling where the system
technology to build high-bandwidth ML training interconnects operates on a larger global batch size as the system scales, strong
(§2.3). scaling parallelizes the computation for a fixed batch size either by
reducing the local batch size per worker or by partitioning the com-
2.1 Parallelization Strategies putation task across workers. However, achieving strong scaling
Data Parallelism (DP). A popular parallelization strategy is data is challenging, because reducing the iteration time leads to more
parallelism where a batch of training data is distributed across frequent model updates and, hence, requires the I/O bandwidth to
multiple workers. Each worker has an identical copy of the DNN scale with the number of workers [47]. Furthermore, since each
model but trains on a subset of the training batch, called a local worker must perform small granular computations, strong scaling
batch, in parallel. In DP training, workers need to communicate can be sensitive to network latency and small inefficiencies in the
their model weight updates after each iteration. This step can be compute/network software stack.
performed using various techniques such as broadcasting [27], pa- Bandwidth Requirements of Weak and Strong Scaling. Today,
rameter servers [28], ring-allreduce [29–31], and tree-reduce [32]. the technique most commonly used to scale a distributed training
Model Parallelism (MP). In this approach, the DNN model is par- job is weak scaling using the DP strategy. This approach is popular
titioned across different workers [33, 34]. The batch is copied to all because as more workers are added to the job: (i) the computation
MP workers, and different parts of the DNN model are computed on time of each worker remains constant (since the local batch is
different workers, resulting in faster iteration times. Model paral- constant); and (ii) the size of data transfers at each iteration remains
lelism is an active area of research, with various proposals for model constant (because it depends on the DNN model).1 In contrast, in
partitioning [35–38]. Recent work has shown significant gains can strong scaling approaches, the bandwidth requirement increases
be obtained with model parallelism; however, the degree of model (often super linearly) as the system is scaled, since (i) strong scaling
parallelism has been limited to a few tens of workers [39–42]. leads to reduced computation time per worker and shorter training
Hybrid Parallelism. We consider a hybrid of the above paralleliza- iterations, and (ii) the amount of data exchanged at each iteration
tion strategies. Our proposed interconnects and task partitioning stays the same or even grows with scale.2 In today’s systems, the
algorithms are designed specifically to support a hybrid of DP and degree of MP is limited to 8 or 16 workers within one DGX box [51]
MP.Further, we do not make any assumptions about a specific com- with Tbps communication bandwidth per GPU [42, 52–54].
munication pattern, such as ring-allreduce or all-to-all. Our goal is
to support a variety of communication patterns using smart task 2.3 Silicon Photonics for ML Training
partitioning and GPU placement algorithms (details in §3). A straightforward approach to meet the high-bandwidth require-
ment of large-scale training workloads is to augment the bandwidth
2.2 Weak and Strong Scaling of ML Jobs of existing electrical switches. However, recent trends in SERDES/-
To identify the bandwidth requirements of ML systems, we first packet switching technology suggest that we will hit a wall in
describe two fundamental scaling paradigms. 1 The amount of data transferred in DP in each iteration depends on the all-reduce
Approach 1: Weak Scaling. The first approach is to scale the algorithm. With a ring-reduce implementation, each worker exchanges 2×M, where M
throughput of data processing (number of processed data sam- is the DNN model size. Note that as the number of workers increase, the bandwidth
ples/sec) as the number of workers increases. The principal tech- per worker remains constant but the total required bandwidth grows.
2 The amount of data transferred in MP in each iteration depends on the model parti-
nique for throughput scaling is to keep the local batch size per tioning strategy but often increases significantly with scale, particularly when a kernel
worker fixed and grow the global batch size as more workers are is split on anything other than the batch dimension.

658
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

{
counter clock-wise ring

{
All-optical topologies OCS OCS OCS
Gbps domain Tbps domain

clock-wise ring

Tbps Domain
OCS
TeraPHY TeraPHY TeraPHY TeraPHY TeraPHY TeraPHY TeraPHY TeraPHY
GPU GPU GPU GPU GPU GPU GPU GPU

{ Node1 Node2 Noden-1 Noden Node1 Node2 Noden-1 Noden

Server
Node (a) SiP-OCS topology (b) SiP-Ring topology
GPU GPU GPU GPU SiP
ports TeraPHY Figure 3: Two topologies we consider for SiP-ML.
GPU GPU GPU GPU GPU

(a) Today’s ML clusters (b) SiP-ML cluster

domains: (i) a Gbps bandwidth domain that interconnects thou-
sands of servers using conventional network fabrics and off-the-
Figure 2: Comparing today’s ML cluster with SiP-ML. shelf NICs; (ii) an all-to-all Tbps bandwidth domain that tightly
connects a handful of GPUs inside a server or a DGX. In contrast, a
SiP-ML cluster consists of disaggregated GPUs, each equipped with
capacity with standard electrical packet switching [55–58]. For in-
Tbps SiP interfaces, interconnected by an all-optical network. An
stance, realizing an electrical packet switch with 100 ports each
example of a SiP interface is the TeraPHY optical I/O technology de-
with 10 Tbps is extremely challenging. This is because the traffic
veloped by Ayar Labs [64], capable of carrying 2 Tbps bandwidth (80
manager ASIC in the switch needs to process packets at 1000 Tbps
wavelengths each carrying 25 Gbps [65]). A GPU can be equipped
speed, but today’s ASICs can only process packets at 12.8 Tbps
with several of these interfaces. To put the choice of topology into
speed. To get to 1000 Tbps switching, we need to build a “Clos”
perspective, we first introduce two fundamental factors affecting
of switching ASICs inside each electrical switch [59]. This is a
all optical circuit-switched interconnects.
challenging undertaking.
Degree. Unlike packet-switched networks, optical interconnects
At the same time, substantial progress is being made with Sil-
are circuit-based. Hence, at any point in time, each node has a
icon Photonics chiplets to bring optical interconnects very close
limited number of active circuits, thereby limiting the number of
(essentially on die) to the ASICs. Recent advances in SiP fabrication
nodes it can communicate with directly. We refer to this as the
processes have created an opportunity to build chiplets with optical
node degree. A topology with degree D means each node can
I/O ports that can transmit data at far higher rates than electrical
simultaneously maintain, at most, D circuits. Depending on the
conductors [9–17, 60–62]. With SiP interfaces, however, it is possi-
traffic pattern, these circuits can be established with one to D
ble to build I/O interfaces integrated with electronics at 10 Tbps/mm
other nodes. Topologies with higher degree are suited for traffic
bandwidth (BW) density [14, 19, 20, 60, 63, 64]. Such integration
patterns with high fan-out, but they also tend to have a larger
enables building next-generation computer architectures that are
cabling footprint.
fundamentally impossible with today’s technologies.
Reconfiguration Latency. The reconfiguration latency puts a
In this paper, we propose all-optical interconnects as an attrac-
lower bound on how long the circuits should be kept to achieve a
tive solution to build the next generation of ML systems. We argue
high duty cycle [66]. For a topology with reconfiguration latency r ,
that ML workloads present a unique opportunity to build special-
the circuit hold time should be longer than, for instance, 10×r to
ized circuit-based interconnects. While conventional datacenter
achieve a 90% duty cycle.
workloads have unpredictable behavior, with short flows dominat-
There are various optical topologies that realize SiP-ML’s vision.
ing the traffic, ML workloads are predictable, periodic, and consist
At one end of the spectrum are switch-based interconnects, such
of mostly large transfers. Importantly, the parallelization algorithm
as MEMS-based Optical Circuit Switch interconnects [24, 25, 55,
determines the circuit schedules, and the entire training repeats the
67, 68] and Rotor-based interconnects [66, 69]. On the other end lie
same communication pattern at every iteration. This unique char-
switch-free topologies such as ring [26, 70, 71], circulant graphs [72],
acteristic simplifies the control-plane logic with which datacenter
torus [73, 74], hypercube [75] and dragonfly interconnects [76–78].
optical designs have grappled for years.
In this paper, we consider two topologies at opposite ends of the
spectrum, as shown in Fig. 3. SiP-OCS is the first natural topology
3 SIP-ML DESIGN choice because OCSs are commercially available today [79]. How-
In this section, we introduce degree and reconfiguration latency as ever, their reconfiguration latency is ≈10 ms, making them suitable
fundamental factors affecting all optical circuit-based interconnects for circuits that last through the entire training. Fig. 3a illustrates
(§3.1). We then discuss our parallelization algorithm, explaining our SiP-OCS topology. SiP-OCS consists of Q optical switches, each
how it takes these factors into account to produce a suitable paral- with N ports (the same as the number of GPUs), where each GPU
lelization strategy for a given topology (§3.2). Finally, we discuss is connected to every OCS in a flat topology. Hence, in SiP-OCS,
SiP-ML’s control plane and wavelength allocation (§3.3). the degree D is equal to the number of switches (Q).
As an alternate, extreme design point, we also investigate the
3.1 Degree and Reconfiguration Latency possibility of removing the switching elements entirely and evalu-
Fig. 2 illustrates the differences between today’s ML training clus- ate the performance of a minimalistic, switch-free topology called
ters and SiP-ML. The state-of-the-art clusters have two bandwidth SiP-Ring. In contrast to SiP-OCS, SiP-Ring reconfigures wavelengths

659
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

within each port to achieve logically rich topologies. Reconfigura- To minimize per-op run-time, it is desirable to split ops into
tion is done using Micro-ring resonators (MRRs) [21] embedded smaller pieces of computation. There are many ways to split an
in SiP ports [22, 23]. MRRs act as spectral filters to select and for- op; for example, a 2D convolution can be split across height, width,
ward wavelengths, and they enable the reuse of wavelengths across and channel dimensions [38]. However, in splitting ops, we must
non-overlapping segments of the ring (Fig. 13a in the appendix take care not to compromise GPU utilization. GPUs (and other ML
illustrates an example). Our experiments show MRRs can switch accelerators) internally distribute an op over a massive number of
between different wavelengths within 25µs (§4.4). We discuss the cores. If we split an op too finely, it will not have enough compute
SiP-Ring design in more detail in Appendix A.1. intensity to utilize the cores effectively, and, therefore, we will
achieve no reduction in run-time from splitting. As a result, we
choose a minimum quantum of computation time, τ , and split ops
3.2 Degree-Aware Parallelization Strategy
to sub-ops of a size near τ . We also cap the maximum number of
A DNN can be viewed as a directed acyclic graph (DAG) of oper- partitions for each op at k (the MP degree), as there is no point in
ations (ops). To parallelize a DNN training job, we need to decide splitting beyond the maximum number of available parallel workers.
which GPU is responsible for running each op (or a part of each op). The result is a balanced computation graph whose vertices are the
As a simple example, to train a model with global batch size b using sub-ops, as shown in Fig. 4(b) for our running example.
DP on N GPUs, we break each op into N parallel sub-ops, each The right choice of the split dimension depends on the type of
operating on a local batch of size b/N (this is referred to as splitting the op and can impact the communication pattern between the
on the sample dimension [38]), and we map one sub-op to each sub-ops. For example, in the case of a 2D convolution on an image
GPU. In general, MP follows similar steps: first partition each op with multiple output channels, if we divide the op across the height
into parallel ops, then place the sub-ops. However, the partitioning and width dimensions of the input, none of the sub-ops needs to
and placement decisions are not as straightforward as in DP. know the entire input image. However, if we split the op across the
Our parallelization algorithm takes the following as input: (i) output channel dimension, every sub-op needs a copy of the input
a DNN computation graph, G in = (V , E), where V is the set of image, leading to a broadcast communication pattern with high
operations (nodes) and E is the set of data dependencies (edges) overhead. We select the most efficient dimension for each op. Since
between the operations; (ii) the global batch size denoted by b; we always split ops uniformly, sub-ops tend to communicate the
(iii) a parameter k denoting the number of GPUs to partition the same amount of data with their descendants (the edges between the
model using MP; (iv) a parameter l denoting the number of GPUs to sub-ops at each stage in Fig. 4(b) carry roughly the same amount
partition the data using DP; and (v) the physical degree constraint of traffic).
of the optical network topology, denoted by D. Our algorithm finds (ii) Placement. Next, we assign a GPU device to each op in the
a hybrid MP-DP strategy with k-way model parallelism and l-way balanced graph. Our placement aims to minimize the total run-
data parallelism for N = k ×l GPUs, such that the training iteration time while respecting the communication degree constraint D
time is minimized while satisfying the degree constraint (i.e., each required by the optical interconnect. Each GPU has two types of
GPU communicates with no more than D other GPUs). We assume communications: (i) it must communicate with some of the GPUs in
all GPUs are identical. its MP group (depending on the op placement); (ii) given the hybrid
The core of the algorithm determines an MP placement of the DP-MP strategy, there are l MP groups that need to synchronize
DNN computation on k GPUs. Specifically, we begin by splitting the their parameters through DP. Hence, each GPU must communicate
GPUs into l groups, with k GPUs per group, and we divide the global with its counterparts in the other l MP groups to perform an all-
batch equally between the groups (i.e., each group is responsible reduce operation to synchronize the model parameters across the
for a local batch of training data of size b/l). Then, we compute an DP partitions. We use the ring-allreduce [29, 30] algorithm for
MP placement across k devices. We replicate the same placement this step. This requires a ring communication pattern between
in each group to produce the final hybrid MP-DP strategy. Fig. 4 corresponding GPUs in the MP groups, which requires each GPU
illustrates the key steps in our parallelization algorithm across 8 to send data to one GPU in another group. Therefore a GPU can
GPUs, with k = 4-way MP, l = 2-way DP, and degree constraint communicate with, at most, ∆ = D - 1 other GPUs within its own
D=3. We use this as a running example in the remainder of this MP group to meet the overall degree constraint.
section. We now present a heuristic algorithm for placing ops within
(i) Partitioning. DNN training involves sequential stages of com- an MP group to minimize run-time with a constraint ∆ on the
putation, as dictated by the data dependencies in the computation degree of communication. While this problem can be written as
graph. For example, the graph in Fig. 4(a) has 4 sequential ops, an Integer Linear Problem (ILP), it is prohibitive to solve this ILP
shown as rectangles of different colors. The size of each rectangle given the scale of the balanced computation graph (e.g., over 20K
represents the computation time of the op. The key to minimizing sub-ops for the Transformer DNN model). Algorithm 1 provides
training time is to balance the computation load across devices the pseudocode.
at every stage of computation to maximize parallelism. Note that The key strategy in our algorithm is to map GPU devices into a
balancing per-stage computation is not the same as balancing the metric space and transform the degree constraint into a distance
total load on each device. Sequentially-dependent ops cannot run constraint in that space. We select an arbitrary ordering of GPU
in parallel, hence placing them on the same device has no impact devices and place ops to maintain a maximum communication
on run-time compared to placing them on different devices, even distance of ∆; i.e., devices i and j are allowed to communicate only if
though it increases the total load on the device. (i −j) mod k ≤ ∆. This constraint leads to a sparse diagonal traffic

660
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

GPU7 C2 D1
C0 GPU 3 C2 D1 GPU6 C1 D0
GPU5 A1 C0 C3
A0 C1 D0 GPU 2 C1 D0 A0 B0
GPU4
A B C D B0 GPU 1 A1 C0 C3 GPU3 C2 D1
A1 C2 D1 GPU2 C1 D0
GPU 0 A0 B0 GPU1 A1 C0 C3
C3 GPU0 A0 B0
Time Time
(a) Compute graph (b) Balanced compute graph (c) MP placement (d) Final DP-MP placement

Figure 4: An example of SiP-ML’s parallelization strategy with k = 4, l = 2, D=3, and ∆ = 2.

Algorithm 1 that meets a maximal set of range constraints. We then reallocate

Task Placement with a Communication Degree Constraint the remaining parents that violate the constraint into the nearest
1: Input: Balanced compute graph g_in, computation quantum τ , degree device that meets the distance constraint with x. As this may create
constraint ∆, local_batchsize b /l , mp_degree k distance violations between parents and grandparents of x, we
2: Output: A task graph g_out with placed ops continue this backward process until all previously placed ops meet
3: for op in g_in.topological_sort( ) do the distance constraint with their parents. We then restart a forward
4: for sub_op in par_ops_map[op] do pass from the first located op and verify the distance constraints
5: far_id←farthest sub_op’s predecessor device id between the placed ops and their children. If any violations have
6: near_id←nearest sub_op’s predecessor device id
occurred due to reallocation, we relocate the child op. This forward-
7: range_lo←near_id
backward procedure is repeated until all ops are placed. We leave
8: range_hi←far_id + ∆
9: sub_op.device←get_earliest_avail(avail_times, the convergence proof to future work.
range_lo, range_hi, sub_op.mem_size) Fig. 4(c) shows the MP placement for our running example, with
10: cand_start←latest end time of predecessors ∆ = 2. Notice two properties of this placement: (i) each GPU com-
11: start←max(cand_start, avail_time[sub_op.device]) municates with, at most, ∆ = 2 other GPUs, as required, and (ii)
12: end←start + sub_op.duration the sub-ops of each op are balanced well across the 4 GPUs. In fact,
13: avail_time[sub_op.device]←end the only op that is not perfectly balanced is C, but the 4 sub-ops
14: end for of this op cannot be placed on all 4 GPUs without violating the
15: end for communication degree constraint, because whichever GPU op B
16: g_out ← add_network_ops(g_out)
resides on would then need to communicate with the other 3 GPUs.
Putting it all together. Fig. 4(d) shows the final hybrid MP-DP
placement for our example. As mentioned earlier, it is created simply
matrix with zeros outside a ∆ distance from the main diagonal, by replicating the MP placement in the l = 2 GPU groups. As for
satisfying the communication degree constraint. the communication pattern, each GPU communicates with, at most,
The algorithm begins with a topological sort of the balanced ∆ = 2 other GPUs in its MP group and one more GPU for the
computation graph (shown in Fig. 4(b) for our example), such that ring topology required for the DP all-reduce step. For example, in
each sub-op appears in the sorted list after its dependencies. It Fig. 4(d), GPU 1 must communicate with GPUs 2 and 3 for MP and
places the sub-ops in this sorted order, guaranteeing that when a GPU 5 for DP. Our parallelization algorithm takes the degree of
sub-op is placed, all of its dependencies have already been placed. MP and DP (k and l) as input, but it is trivial to optimize over these
For each sub-op, the algorithm first computes a set of placement parameters to find the combination that minimizes training time
candidates. These are the devices where the sub-op can be placed for a given number of GPUs, as discussed in Appendix 4.2.
without violating the distance constraint mentioned above. We
compute the intersection of these ranges for all parents of x to
determine its placement candidates. Then, we select the earliest
3.3 Circuit Scheduling
available device among these candidates to place x, and we schedule Given that our SiP-OCS topology reconfigures its circuits only
the op on that device as soon as its dependencies have completed. If once at the beginning of the training job, its control plane logic is
there is a tie at this step, we select the GPU with the smallest index simple. In this case, the main task is to compute the total traffic
so that we can minimize the distance between communicating matrix resulting from the parallelization algorithm and then assign
GPUs.3 Notice that since we place the sub-ops in order of their circuits to each pair of GPUs that must communicate, such that
dependencies, keeping track of when each op can be scheduled on the maximum transfer time is minimized. We determine the circuit
each device is straightforward. If the intersection of the feasible assignment with a simple ILP run once for each training job (details
ranges for all parents of the sub-op x is empty, i.e., the maximum in §A.2).
distance between the parents is longer than ∆ − 1, we relocate the The control plane for the SiP-Ring topology is more challeng-
parent nodes into a smaller device range so that the placement ing, as circuits can be reconfigured during training. Hence, our
of x becomes feasible. For this purpose, we place x on the GPU controller needs to estimate the traffic and reschedule the circuits
periodically. Therefore, every GPU’s host needs to read its GPU
3 This property helps enable wavelength reuse in the ring topology (§A.1). transfer buffer counters through PCIe and communicate them to a

661
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

central controller. Using NVIDIA’s nvml API, we poll the NVLink using MRRs and our parallelization algorithm’s ability to adapt its
counters on a Tesla V100 GPU at a 300-microsecond granularity. strategy to the topology (e.g., ensuring most communication occurs
However, this API is designed for management purposes and is between nearby nodes on the ring).
not optimized for latency. We believe obtaining the counters at a (iii) A SiP-ML interconnect with per-GPU bandwidth B performs
sub-100-microsecond scale should be feasible with further engineer- as well as or better than an ideal, full-bisection electrical switch
ing. Our experiments confirm that the observed traffic matrix over with per-GPU bandwidth B/2. For instance, given 1024 GPUs and
the past 100µs is a good estimate of the communication demands B = 8 Tbps, SiP-ML’s dynamic topology provides at least 4 Tbps
over the next 100 µs. Using the traffic matrix, we can solve an ILP of bandwidth, on average, between each pair of GPUs that need to
(see §A.1) for optimal wavelength scheduling on the ring topol- communicate.
ogy. However, solving an ILP is too slow for short-timescale circuit (iv) When per-GPU bandwidth is high (e.g., order of terabits-per-
scheduling. Therefore, we propose a fast, approximate wavelength second), hybrid parallelism strategies outperform data parallelism
scheduling algorithm that solves a minimum-cost flow routing by up to 2× in terms of time-to-accuracy.
problem to schedule wavelengths. Appendix A.1 describes this al-
gorithm in detail. Note that while we currently propose to measure 4.1 Methodology & Setup
the traffic matrix for dynamic circuit establishment, exploiting the
To evaluate SiP-ML, we implement a detailed simulator, called
predictability of training workloads is a natural step which we leave
Rostam, to model several baseline network architectures connect-
for future work.
ing up to thousands of GPUs. Our simulator is ≈10K lines of code
Supporting Multiple Jobs. We anticipate a SiP-ML cluster will
in C++ and is available online at https://github.com/MLNetwork/
typically be used to run multiple jobs at the same time. Each job will
rostam.git. We discuss the details of our simulator in §4.2. In our
run on a subset of GPUs, dedicated to that job. Supporting multiple
evaluations, we set the quantum of computation for balancing the
jobs with SiP-OCS requires no changes to our design, except that
computation graphs, τ , to 10 µs (§3.2).
we allocate a subset of available GPUs when a job arrives and
Comparisons. We consider the following network architectures:
correspondingly set the total number of GPUs in our placement
algorithm. When a job completes, we release its GPUs and optical • Elect-Flat: an ideal electrical switch that scales to any number
circuits. SiP-Ring follows a similar logic, but we ideally prefer to of GPUs, N , for any per-GPU bandwidth of B; i.e., each GPU can
allocate each job to a contiguous block of neighboring GPUs on simultaneously communicate with N − 1 other GPUs with a total
the ring. Fragmentation of the ring space, as jobs arrive and depart, bandwidth of B in both send and receive directions. This baseline
could make this difficult to achieve at all times. One solution is to has zero reconfiguration delay. For any pair of (B, N ), no network
use a standard OCS to assign GPU interfaces to arbitrary locations can communicate faster than this baseline. In practice, it can be
on the ring. approximated with full-bisection bandwidth topologies such as fat-
Scalability Considerations. While our current version of SiP- tree for relatively small values of B (e.g., 100–400 Gbps), or with a
OCS assumes each OCS has enough ports to connect to every GPU small N (e.g., tens of nodes) with large B. Note that no electrical
in a flat topology, a more realistic setting is to use hierarchical network would be able to perform better than this flat electrical
Clos [80] or flat designs such as BCube [81] to scale SiP-OCS. Our baseline, as it provides full-bisection bandwidth.
SiP-Ring topology can be scaled using Theia [72] and SlimFly [82] • Elect-Cluster: a hierarchical electric network fabric represen-
to build hierarchical rings. Another way to scale SiP-Ring is to tative of today’s ML clusters interconnecting GPUs. Each server
consider 2D rings, where we have K horizontal rings, with N GPUs hosts eight GPUs, connected with an internal high-speed electrical
on each ring. We then connect every K GPUs from K different switch providing per-GPU bandwidth of B, typically in the order
horizontal rings on a single vertical ring. Hence, there will be K + N of terabits-per-second. The servers are connected with a slower
rings in total, connecting N K GPUs. Each GPU has direct access electrical fabric providing 400 Gbps bandwidth per server (unless
to one vertical and one horizontal ring and must divide its SiP otherwise stated). In practice, servers can be thought of as DGX [5]
interfaces between the two. Depending on the vertical bandwidth boxes with an internal NVSwitch [83] interconnect, communicating
requirement of the interconnect, this ratio can be adjusted. over a standard datacenter network fabric (e.g., fat-tree).
• SiP-Ring: a ring-based interconnect for SiP-ML, as described in
§3.1. Each GPU has W distinct wavelengths that it can dynamically
4 EVALUATION allocate to communicate with its 16 closest neighbors on the ring
In this section, we quantify the performance of SiP-ML by compar- (in both directions). We assume each wavelength carries 25 Gbps of
ing it to other network interconnects. Our results show: bandwidth, providing a maximum bandwidth of B = W ×25Gbps for
(i) For three representative DNN models (Transformer, ResNet, each GPU. Unlike SiP-OCS, this topology is rapidly reconfigurable,
and Megatron), SiP-ML speeds up training time by a factor of 1.3– with a reconfiguration latency of 25 µs (§4.4). We estimate the traffic
9.1× compared to hierarchical electrical network fabrics represen- every 100 µs as described in §3.3 unless stated otherwise.
tative of today’s ML clusters. This is because SiP-ML eliminates • SiP-OCS: an optical circuit switch interconnect for SiP-ML, as
bandwidth bottlenecks and enables hybrid DP/MP parallelization described in §3.1 with Q OCS switches, each with N ports (the
strategies that cannot be supported efficiently by today’s fabrics. same as the number of the GPUs). Each GPU has Q optical links
(ii) Although SiP-Ring’s switchless design constrains connectiv- (each with a bandwidth of B/Q), one to each OCS. Each GPU can
ity, it performs similarly to SiP-OCS. SiP-Ring’s limited connectivity communicate with, at most, D=Q other GPUs at the same time. To
is compensated by its ability to rapidly reschedule wavelengths study the impact of D, we vary the number of OCS switches in the

662
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

Elect-Cluster 200 Gbps Elect-Cluster 400 Gbps Elect-Flat (DP) Elect-Flat SiP-OCS SiP-Ring

10
Time-to-Acc. (mins)

9
8
7 101

6 104

4 100
103
27 28 29 210 211 212 213 27 28 29 210 211 212 213 27 28 29 210 211 212 213
BW per GPU (Gbps) BW per GPU (Gbps) BW per GPU (Gbps)

(a) ResNet50 (b) Transformer (c) Megatron

Figure 5: Impact of bandwidth B on the total training time (Time-to-Accuracy) for N=1024 GPUs. DP is not feasible for Megatron
because of its huge memory footprint.

interconnect, using a default value of 16. Since OCS reconfiguration Profiling. We first need to profile the average GPU and CPU com-
delay is too long compared to the typical training iteration time of pute time, peak memory size, and input/output data sizes of each
our DNN models (< 20ms), we compute the best one-shot circuit operation in the model in addition to its data dependencies. Each
schedule for each workload, as described in §3.3. To evaluate the compute operation typically has one or more input/output arrays
potential benefits of optical switches with fast reconfiguration [55, of data, “tensors”. Profiling the operations over different input/out-
71], we also evaluate the impact of lowering the reconfiguration put tensor shapes helps predict the speed ups of partitioning each
latency and allowing multiple reconfigurations within each training operation in different input/output tensor dimensions. We start
iteration.4 profiling over a fair range of batch sizes, typically starting with 1
Training workloads. We consider ResNet, Transformer, and Mega- sample/iteration and continuing until we run out of GPU memory.
tron, three representative DNN models widely used in computer The profiling step is independent from the simulator and can use
vision and natural language processing applications. ResNet [84] is any convenient profiling tool. Moreover, profiling along other than
an image classification model with 25 million parameters. Trans- the samples dimension (e.g., height and width in the 2D convolu-
former refers to a Universal Transformer with 350 million parame- tion) helps improve the simulation’s accuracy. In absence of the
ters. Megatron[52] is a variant of the GPT model [85] with 18 billion profiling data in any dimension, we assume a linear dependency
parameters. between the total number of splits and each split’s compute time in
We focus on time-to-accuracy as our primary metric. We de- that dimension. Depending on the dimension of the split, Rostam
termine the time-to-accuracy by multiplying the time for a single adds the required new data dependencies in the placement stage.
training iteration (obtained via our simulator) by the number of In addition to the operations profile, we need to know the required
training iterations required to reach the target accuracy. We use number of iterations to achieve a certain level of model accuracy
numbers reported in prior work for the required training iterations as a function of the global batch. This profile depends on the DNN
for these models at a given batch size. For ResNet and Transformer, model and the training dataset [46]. Rostam can combine the latter
Shallue et al. [86] report the number of training iterations across a two profiles in the placement stage to come up with the best hybrid
range of batch sizes. Hence for these models, we optimize over batch parallelization strategy. In this paper, we profile all models on an
size to find the lowest possible time-to-accuracy in each network NVIDIA Tesla V100 GPU with 32 GB of memory.
configuration. For Megatron, we use batch size 1024 and 240,000 Placement. Our approach to explore the space of hybrid paral-
training iterations, following [50, 87]. Note that we report the total lelism techniques takes as input: (1) the number of GPUs, (2) the
pre-training time for Megatron, which requires significantly more bandwidth available per GPU, (3) the graph profile for the DNN
training iterations than a typical fine-tuning task. But the relative model as described above, and (4) the curve providing the required
improvements we report would hold for fine-tuning the model since number of training iterations as a function of the (global) batch
we are directly decreasing the iteration time. size. We search through all possible hybrid parallelizations over a
ResNet and Transformer fit in a typical GPU’s memory. Hence range of global batch size configurations and use the placement
the main reason to parallelize them is to speed up training. Mega- algorithm (e.g., Algorithm 1 (§3)) for device placement. We then
tron, cannot fit on one GPU and therefore cannot be trained with estimate each configuration’s run-time based on the graph profile
only DP; MP is required to split it across multiple GPU memories. and the bottleneck bandwidth. To estimate the effect of the network,
we also compute the latency for each data transfer (edge) in the
graph profile according to the bottleneck bandwidth. We finally
4.2 Simulator
select the fastest of all these parallelization strategies.
The overall flow of an end-to-end simulation in Rostam is as fol- Two points are worth noting about this procedure. First, one of
lows. the strategies that our task parallalization considers is the conven-
tional DP. However, as our results show (see §4.3), in many cases,
4 In
the extreme, eliminating reconfiguration latency entirely would make SiP-OCS DP is not the best strategy for large-scale training. Second, the time
equivalent to the ideal Elect-Flat architecture.

663
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

Analysis. We first consider the Elect-Flat architecture. Recall that

Elect-Flat has ideal performance. At every value of B, it provides
Optimal Hybrid Degree

1, 020 1, 020
256 256
64 DP degree 64 each GPU with its full interface bandwidth regardless of the commu-
16 MP degree 16 nication pattern. Thus Elect-Flat’s training time serves as a lower
4 4 bound for any other network. Fig. 5 shows that increasing B on
1 1
100 1, 000 10, 000 100 1, 000 10, 000
Elect-Flat improves training time for all models, but the improve-
Bandwidth per GPU (Gbps) Bandwidth per GPU (Gbps)
ment is much larger for Transformer and Megatron than ResNet50.
ResNet50 is less sensitive to network bandwidth for two reasons.
(a) ResNet50 (b) Transformer
First, it is a smaller model than the others and therefore requires less
Figure 6: Optimal hybrid trade-off between the degree of MP bandwidth for all-reduce operations. Second, ResNet50 trains effec-
and DP at different per-node bandwidths for 1024 GPUs. tively with large batch sizes (via weak scaling), further reducing its
bandwidth requirements [86, 90–92].
computed for a configuration in this procedure is only an estimate;
Comparing DP with the best strategy found using Algorithm 1
in our actual simulations, a GPU’s bandwidth can vary over time
on Elect-Flat is also instructive. Consider Transformer: when B is
(e.g., due to circuit reconfiguration). Therefore, our simulator re-
less than 1 Tbps, our placement cannot beat DP. But as B increases
quires a runtime stage to track the effect of dynamic decisions on
to 8 Tbps, SiP-ML’s hybrid strategy outperforms DP by ≈50%.
ops scheduling more precisely.
Now let us turn to the Elect-Cluster architectures. For all three
Runtime. Our runtime simulator relies on three main compo-
models, the training time plateaus as we increase B, with Elect-
nents: GPUs, an interconnect, and an executive session. The session
Cluster (400 Gbps) outperforming Elect-Cluster (200 Gbps). Recall
launches the operations onto the GPUs as soon as their dependen-
that here, B is the local bandwidth between the GPUs within each
cies are met in the DNN graph. The interconnect can be electrical
server. The results show that scaling this local bandwidth can im-
or optical. Our current implementation includes SiP-Ring, SiP-OCS,
prove training time to an extent (by enabling some model paral-
electrical, and full-mesh interconnects.
lelism), but the slow server-to-server network eventually becomes
Rostam models a latency for each op launched onto the GPU and
a bottleneck and prevents further speedups.
a minimum completion time for ops that run on the GPU. Hence,
Compared to Elect-Cluster architectures, SiP-OCS and SiP-Ring
there is a lower-bound on how quickly we can run a compute graph
achieve 1.3–9.1× faster training time as we scale B. The benefits
that depends on its critical path length. We set the launch latency
are smallest for ResNet50 (which does not require very high com-
and the minimum completion to 1 microsecond in our experiments.
munication bandwidth) and most significant for Megatron. SiP-ML
Moreover, Rostam overlaps the communication and computation
architectures are less efficient than the ideal Elect-Flat (which can-
whenever possible.
not be realized in practice for large values of B and N ): to achieve the
same training time, SiP-ML architectures require up to 2× higher
4.3 Results bandwidth per GPU (B) (e.g., Transformer), with a smaller gap in
Fig. 5 compares the time-to-accuracy of our three DNN models many cases (e.g., Megatron). This difference reflects the constraints
with 1024 GPUs on different network architectures. We vary the imposed by optical circuit switching. Specifically, in our evalua-
bandwidth per GPU, B, between 128—8192 Gbps, and compare tions, we set the degree constraint for both SiP-OCS and SiP-Ring
Elect-Flat, Elect-Cluster with two values of inter-server bandwidth at D=16. SiP-OCS requires a one-shot reconfiguration, while SiP-
(200 Gbps or 400 Gbps), SiP-OCS, and SiP-Ring. For each value Ring imposes a traffic locality requirement on the communication
of B and each network architecture, we use Algorithm 1 (§3.2) to pattern. Despite these constraints, SiP-ML performs quite well, as
search for the best parallelization strategy, as described in §4.2. our placement algorithm adapts the parallelization strategy to suit
To compare the different architectures on an equal footing, we the degree requirement.
run Algorithm 1 for electrical networks by removing the degree SiP-OCS and SiP-Ring perform similarly overall. Each architec-
constraint. We then compare our results to the state-of-the-art ture has pluses and minuses. Unlike SiP-OCS, SiP-Ring has fast
results reported in MLPerf [88] and find that they are comparable reconfiguration, but it makes communication between more distant
or better (§A.3). For reference, we also show data parallel (DP) GPUs on the ring less efficient. Our results show that the impacts of
training on Elect-Flat (except for Megatron which cannot use basic these factors on overall performance effectively cancel each other
DP). out.
We also experiment with FlexFlow [38] as a state-of-the-art Parallelization strategies. Fig. 6 plots the degrees of DP and MP
placement algorithm. FlexFlow’s network model does not support for each value of B in SiP-OCS. The figure shows that as the per-
the degree constraints required by our optical interconnects. For node bandwidth increases on the x-axis, the optimal strategy uses
electrical interconnects, we run the FlexFlow code [89] for our more model parallelism to decrease the total training time. This
workloads, but the strategies it finds are very similar to DP. We is consistent with current practice: when the network is slow, DP
believe there are two reasons for this. First, the scales we consider is more efficient but on a fast network, combining MP and DP
(e.g., 1000 GPUs) are much larger than those in FlexFlow, making improves training time. For instance, the Transformer model shown
the search space for its Metropolis algorithm significantly larger. in Fig. 6b starts with 1024-way DP and 1-way MP, but at 10 Tbps
Second, FlexFlow’s implementation only searches for partitioning bandwidth per-GPU, the best training time is achieved with 16-way
strategies across the batch dimension (although the approach in [38] MP and 64-way DP.
is general).

664
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

15
2 Tbps, Reconfig. 4 Tbps, Reconfig.

Time-to-Acc. (mins)
2 Tbps, one-shot 4 Tbps, one-shot
10

0
101 102 103
Figure 7: Traffic matrices generated by SiP-ML for the Trans- Reconfiguration Delay ( µ sec)
former model on 1024 GPUs (displaying only the first 32 Figure 9: Impact of OCS reconfiguration delay on time-to-
GPUs for brevity). accuracy of Transformer in SiP-OCS for two per-GPU band-
2 Tbps
widths. The critical reconfiguration delay when choosing be-
Time-to-Acc. (mins)

15 4 Tbps tween reconfigurable and one-shot allocation depends on

8 Tbps the bandwidth.
10
Elect-Flat Elect-Cluster 400Gbps
5 SiP-OCS SiP-Ring
102

Time-to-Acc. (mins)
102
10 12 14 16
Num. of OCSs
101
Figure 8: Impact of number of OCSs in SiP-OCS on time-to- 101
accuracy of a hybrid training of Transformer with one-shot
configuration. The lines correspond to different per-GPU 100
bandwidth (B). Dashed horizontal lines of the same color 25 26 27 28 29 210 211 212 23 24 25 26 27 28 29 210
show performance achieved by Elect-Flat at the same band- Number of GPUs Number of GPUs
width. (a) ResNet50 (b) Transformer
Communication patterns. To better understand the communi- Figure 10: Overall performance of SiP-ML’s OCS and Ring
cation patterns produced by Algorithm 1, Fig. 7 shows the traffic topologies at different scales.
matrices for the Transformer model with MP degree k = 4, 8, 16,
corresponding to 2 Tbps, 6 Tbps, and 10 Tbps per-GPU bandwidth, OCS) in SiP-OCS can improve performance in two ways: (i) we
respectively. These traffic matrices have two main components: can increase the maximum permissible communication degree; or
(i) a set of identical k × k blocks, corresponding to the traffic be- (ii) for the same communication degree, we can allow a more fine-
tween the nodes in each MP group (brighter colors represent larger grained allocation of circuits (with less bandwidth per circuit). The
values); (ii) an off-diagonal component, corresponding to the DP latter enables SiP-ML to align circuit bandwidth to traffic demands
ring-all-reduce traffic used by each GPU to synchronize its param- more closely, resulting in less wasted bandwidth. Fig. 8 shows the
eters with its peers in other MP groups (holding the same part of time-to-accuracy vs. number of OCSs for a one-shot circuit con-
the model). Within the k × k blocks, the entries near the diagonal figuration of the Transformer model. Performance improves with
are larger (brighter), indicating the GPUs communicate more with more OCSs, but benefits are marginal beyond 12 OCSs. Also, un-
their immediate neighbors. This property helps when mapping the surprisingly, a larger bandwidth per GPU (B) reduces sensitivity
communication to SiP-Ring. The off-diagonal entries (DP traffic) to the number of OCSs; it has more headroom, thus masking the
are smaller than the largest entries for the MP traffic, but they are inefficiencies caused by fewer OCSs.
still significant. This is the downside of current hierarchical electri- Fig. 9 shows how future OCSs with faster reconfiguration time
cal fabrics, as shown in Fig. 5, the low server-to-server bandwidth could improve the total training time of a Transformer model. For
becomes a chokepoint. a reconfiguration delay of d, we use the traffic matrix of the past 5d
The traffic matrices also show how SiP-ML meets the degree seconds to reconfigure the circuit allocations. We maintain circuits
constraint. For example, in SiP-OCS, each GPU establishes circuits for 5d to amortize the reconfiguration delay overhead. As expected,
with members of its MP group and is also part of a ring with its reducing the reconfiguration delay always helps. However, note
peers in other MP groups. The resulting topology is effectively that for d > 300µs, a one-shot allocation outperforms a dynamic
the union of l = N /k identical direct-connect topologies and k reconfiguration. Once again, higher bandwidth per GPU masks
rings. The number of circuits to each destination is chosen based inefficiencies, and one-shot allocation performs as well as rapid
on the traffic intensity towards that destination, although finding dynamic reconfiguration.
the optimal circuit allocation is more subtle and requires solving Impact of scale. Fig. 10 compares the training time of Resnet50
an ILP (§3.3). and Transformer on different network architectures across different
Impact of number of OCSs and reconfiguration latency. In- scales, with B = 8 Tbps of bandwidth per GPU. As in Fig. 5, we
creasing the number of OCSs (or the total number of ports on each see that SiP-OCS and SiP-Ring are close to the performance of

665
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

Fabric Latency
#GPUs Rx Buffer
Resp. Local data

1 µsec 3 µsec 10 µsec 30 µsec 100 µsec Classifier MEM Compute Fiber I/O

λ1, λ2, λ3
Cache Core

ta MRRs 10mm
Controller

Req.
Create e da

MUX
mot
request Re 10mm
32 1× 0.99× 0.83× 0.73× 0.64× Sync.
Bias control board Micro Ring
Stratix V FPGA board Tx Buffer
128 2.11× 2.10× 1.52× 1.36× 1.29× Zoomed in Bias Voltages

Fiber ring
512 4.27× 4.04× 3.03× 2.49× 2.03× Wavelength
SiP Chip SiP
switch MRR1 MRR2 MRR3 Allocation
Table 1: Impact of interconnect latency on the scaling effi-
ciency. Training speed-ups are normalized by the speed-up λ2 /λ3 λ1 /λ3 λ1 /λ2 Traffic Matrix
Prediction

at 32 GPUs with 1µsec latency. GPU1 GPU2 GPU3

λ1 λ2 λ3

the ideal Elect-Flat at all scales, with SiP-Ring occasionally slightly

worse. With Elect-Cluster, the training time improves up to a certain (a) Photo (b) Logical schematic
scale, and then the benefits taper off as the low server-to-server
bandwidth becomes a bottleneck. Once again, ResNet scales quite Figure 11: SiP-ML’s testbed.
well with Elect-Cluster, in line with current practice [31]. But larger
models and those less amenable to large-batch training, such as to cause a state change in the MRRs, depending on the scheduling
Transformer, can benefit significantly from SiP-ML’s high per-GPU decision. We use commodity SFP+ transceivers connected to the
bandwidth at moderate-to-large system scales. high-speed serial transceiver port on the FPGA board to achieve
Impact of network latency. Network fabric latency can play an the conversion between electrical and optical domains. Our three
important role in scaling ML workloads at multi Tbps network input wavelengths are λ 1 =1546.92 nm, λ 2 =1554.94 nm, and λ 3 =
speeds. Table 1 shows the impact of different minimum interconnect 1556.55 nm. Our SiP optical chip consists of six MRRs (we use three
latencies on training performance. The results show the training of them as shown in Fig. 11b) to select and forward any of the wave-
speedup relative to an Elect-Flat network with 32 GPUs with B = lengths to the target emulated GPUs. To evaluate our prototype, we
10 Tbps, and 1 µs fabric latency. Latencies above ∼10µs degrade implement 2D convolutional computation workloads in Verilog to
performance. This suggests another potential advantage of optical perform data fetching, computing, and storing between emulated
networks over electrical switching fabrics, the latter can suffer GPU nodes. A GPU node can get access to the other GPU node’s
from variable latency due to the presence of buffers. To compare to memory and perform read/write operations, similar to how real
the best-case performance of the baselines, our simulations do not GPUs communicate today.
model buffering within electrical fabrics, as this depends on factors Example: programming the MRRs. We set the first configura-
such as the details of the transport protocols [93, 94]. tion such that GPU1 is connected to GPU2 ; this means MRR1 is
SiP-Ring reconfiguration delay. While Tbps SiP-enabled chiplets tuned to select and forward λ 2 to GPU1 , while MRR2 is tuned to
are just about to hit the market [8, 63, 95], their reconfiguration select and forward λ 1 to GPU2 . For simplicity of the configura-
latency has not been evaluated. To evaluate the reconfiguration tion logic, MRR3 is always tuned to λ 1 but is effectively in idle
latency of SiP-ML’s ring topology, we build a small-scale testbed mode, as the optical power of λ 1 has been dropped through MRR2 .
(details in §4.4). Our testbed includes a thermo-optic SiP chip which To change the state to Configuration2 where GPU1 is connected
has six micro-ring resonators (MRRs). To hit 10 Tbps bandwidth to GPU3 , MRR1 should be tuned to select and forward λ 3 , while
we must package 400 MRRs (each modulating light at 25 Gbps). As MRR2 should be detuned from λ 1 for the optical power of λ 1 to pass
a result, our testbed only supports 10 Gbps bandwidth. Rather than through MRR3 to GPU3 . Note that in this configuration, MRR3 , can
bandwidth, we focus on validating reconfigurability. Our measure- remain tuned to λ 1 .
ments show a reconfiguration delay of 25 µs (Fig. 12b and Fig. 12c Testbed limitations. Our use of commodity FPGAs and transceivers
in §4.4). is driven by pragmatic concerns. It allows us to implement work-
loads without needing separate modulation logic at the transmitter
4.4 Testbed or demodulation logic at the receiver. Packets are forwarded to the
To benchmark the switching time and throughput of a SiP-based SFP+ transceiver which modulates the light for us. However, this
architecture, we build a small-scale testbed. method has limitations as well. Implementing convolutional neural
Testbed setup. Fig. 11a shows a photograph of our experimental networks in an FPGA, rather than a GPU as would be the case in
testbed. We built a three-node prototype of SiP-ML using FPGA the actual system, introduces complex Verilog logic with overhead
development boards (to emulate GPUs), and a thermo-optic SiP chip on (de)serializing the remote memory access commands.
which has six micro-ring resonators (MRRs). Each MRR is tuned to To validate the feasibility of our optical design, we answer the
select one wavelength by receiving the appropriate bias signal from following four key questions. (i) What is the impact of using MRRs
the bias control board. We use Stratix V FPGAs to emulate the GPU to select/bypass wavelengths on throughput? (ii) How fast can we
training workflow, as no commercial GPU chip supports optical reconfigure the MRRs to dynamically tune to appropriate wave-
interfaces. Our FPGAs have 50 Mb embedded memory and 1152 MB lengths? (iii) What is the end-to-end switching time? (iv) What is
DDR3 memory. The FPGAs are programmed and configured as indi- the impact of our scheduling algorithm on throughput?
vidual compute nodes with their own local memory. The controller MRRs as select/bypass interfaces. We first examine the selec-
logic is implemented using one of the FPGAs. A digital-to-analog t/bypass functions of our MRR-based interfaces. A transceiver chan-
converter (DAC) provides the necessary bias signals to the SiP chip nel is instantiated on the FPGA and a SFP+ optical transceiver at

666
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

Normalized Throughput
M RR1 M RR2 Conf ig.1 Conf ig.2
4
20μs 8.4μs

Rx Signal (Volts)
1 Loopback 0.3 1

Frequency
Bypass 3
0.2
CDF

Select 2 0.8
0.5
0.1 GP U3 → GP U2
1 0.6 GP U2 → GP U3
0 0
9 9.1 9.2 9.3 0 10 15 20 25 0 200 400 600 800 1,000
0 50 100 150 200
Throughput (Gbps) Reconfirguration Time (µs) Slot length (µs)
Time (µs)

(a) Micro-ring select/bypass through- (b) Micro-ring reconfiguration time (c) End-to-end reconfiguration time (d) End-to-end throughput
put
Figure 12: Testbed benchmarks.

1546.92 nm is used to perform the throughput measurements for Putting it all together. We also measure the achieved throughput
select, bypass and loopback cases. As shown in Fig. 12a, the through- while changing the scheduling slot length between the two config-
put measurement of the select mode (the MRR tuned at 1546.92 nm) urations. We conduct five different case studies with slot lengths of
is the curve in black while the result for bypassing the MRR is in 64, 128, 256, 512 and 1000 µs and measure the ideal throughput. The
blue. The red curve is the baseline measurement where the optical curve in blue in Fig. 12d indicates the switching state from GPU3
transmitter is connected directly to the receiver channel without to GPU2 lasting the duration set by the experiment; the curve in
coupling the optical signal in/out the SiP chip. Our measurements red indicates the switching from GPU2 to GPU3 . As the plot shows,
show in all three cases, the throughput is 9 to 9.3 Gbps confirming the link can achieve above 90% of the ideal throughput, when the
the feasibility of the idea of using MRRs as select/bypass interfaces. scheduling slot length is 220 µs. This is because the end-to-end
MRR reconfiguration time. To measure the reconfiguration time reconfiguration takes only about 20 µs; hence, having a scheduling
of our MRRs, we place InGaAs PIN photodetectors after MRR1 and slot 10 times larger will result in near optimal throughput.
MRR2 in Fig. 11b and change the bias voltage from Config1 to
Config2 , where MRR1 and MRR2 are tuned into and out of reso- 5 DISCUSSION
nance with λ 1 . We switch light between the two photodetectors by Power budget and scalability. Optical power loss is a key mea-
applying different bias signals to the SiP chip every 125 µs. The pho- sure for any optical system. To estimate the D of our SiP-Ring
todetectors convert the received photocurrent into voltage. We use topology, we measure the loss of light in our testbed. Our experi-
an oscilloscope to measure real time light intensity and can there- ments indicate that the loss per MRR is negligible (0.125–0.025 dB
fore measure the reconfiguration speed. Fig. 12b shows the receive per MRR). However, coupling the light in and out of each node
signal at the photodetectors. In one case, the signal reaches stable creates 0.5 dB loss because each I/O interface has an input and
state in approximately 20 µs, and in another case, it takes only 8.4 µs. output coupler with loss. Overall, the total loss incurred by passing
This is because tuning the MRR into the chosen wavelength is faster through each node on SiP-Ring is 0.625–0.525 dB. Hence, assuming
than tuning out of that wavelength due to our use of the thermal a 10 dB power budget based on transmit power and receiver sen-
tuning effect. We conservatively, consider 25 µs as the switching sitivity [96], SiP-Ring can send light to 16 back-to-back neighbors
time in our simulations. This experiment micro-benchmarks the without requiring amplification. At first blush, it appears infeasible
micro-ring reconfiguration time; additional time might be required to scale SiP-Ring, as building a cluster with more than 16 nodes
for transceivers to start decoding bits. This additional time is not needs amplifiers which add non-linear noise to the system. How-
fundamental, and next we show how we measured the end-to-end ever, SiP-Ring can capture path length limitations in its placement
reconfiguration time between FPGAs. algorithm. For instance, the path length in our evaluations is lim-
End-to-end reconfiguration time. The end-to-end reconfigura- ited to 16 nodes (Appendix A.1). This is because the placement
tion time includes the MRRs’ reconfiguration time, the transceivers’ algorithm is able to place GPUs locally close to each other such
locking time, and the handshaking time between newly connected that every GPU only interacts with, at most, a GPU that is 15 nodes
nodes. The distribution of end-to-end switching time between away (i.e., the node degree is 16). As a result, SiP-Ring’s design can
Config1 and Config2 is shown in Fig. 12c. We perform 300 mea- take path length into account to scale to large numbers of nodes.
surements to obtain the distribution, showing that the average Cost of SiP-ML. The entire field of silicon photonics is based on
switching time to Config1 is 13 µs and Config2 is 15 µs. Indeed, it the concept that the fundamental way to reduce the cost of photonic
is reasonable that the fastest end-to-end reconfiguration time may devices is to leverage the high volume manufacturing capabilities
be less than the micro-ring reconfiguration time, as the receiver of the silicon electronics industry. As a result, it is impossible to
at the FPGA receives enough optical power to start the synchro- provide an accurate cost estimation for SiP-ML. Prior work has
nization process before stabilization of the light output power. As built TeraPHY SiP interfaces with size 8.86 mm × 5.5 mm [20, Slide
described above, the micro-ring reconfiguration times for tuning 41]. This area contains optical transmit, receive, and MRRs. The
and detuning are not equal, leading to two distinct distributions. cost of manufacturing this SiP interface is $44,082 for a volume of
The additional variations in the distribution of the reconfiguration 20 chips ($4,408/chip) based on 2020 Europractice pricelist [97].5
time are a consequence of the time required for the transceiver to
lock onto the new signal and carry out the handshaking protocol. 5 Europractice is an EC initiative that provides the industry and academia with a
platform to develop smart integrated systems, ranging from advanced prototype
design to volume production. The cost is listed as AC80,000 on page 10 under imec
Si-Photonics iSiPP50G; the volume is listed as 20 samples on page 6 under iSiPP50G.

667
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

Hence, assuming the cost will drop by a factor of 10 at mass produc- two main reasons for the lack of adoption of all-optical datacen-
tion, our current cost estimation for each SiP interface in SiP-ML is ters so far. In contrast, this paper builds an all-optical interconnect
≈$440. We further estimate the cost of on-chip electrical circuitry with a simple and practical task placement algorithm primarily
(drivers, MRR’s tuning control logic, and CMOS transimpedance used to accelerate ML workloads. Our ring topology (SiP-Ring) is
amplification) to be ≈$300. This estimate is based on Europractice inspired by Quartz [70], Mordia [71], and Megaswitch [26]. They
pricelist for a 10 mm2 chip area [14, 19, 98, 99].6 Another approach all use a fiber ring to interconnect the datacenter topology, but
to observe the potential cost effectiveness of SiP solutions is to they do not leverage MRRs. Moreover, Mordia realizes a microsec-
look at it from the standpoint of pluggable transceivers and active ond switching circuit switch, but it does not reuse wavelengths,
copper cables. Today’s SiP-based pluggable optics at 100 Gbps cost and this significantly reduces its bandwidth efficiency compared
roughly $1/Gbps (SiP PSM4 and CWDM4). In comparison, a non to SiP-Ring. As a result, Mordia’s number of ports is limited by
SiP-based SR-4 pluggable transceiver is around $3/Gbps (multimode the number of wavelengths. Jellyfish [137], Rotornet [66], and
and VCSEL based). Similarly, a 400 Gbps SR8 is $3/Gbps, while a Opera [69] take advantage of the unpredictability of datacenter
SiP based 400 Gbps DR4 and FR4 is projected to be $1/Gbps. We workloads and use expander-based topologies to improve the com-
note that there is a large distinction between the cost of commodity pletion time of short and long flows. Random permutations are
DWDM transponders used in wide-area networks and SiP-ML’s not ideal for ML workloads, as a training workload is a periodic
SiP interfaces. In particular, DWDM transponders are designed repetition of thousands of iterations. Shoal [135], Larry [138], XFab-
to operate at long distances; this imposes strict challenges on the ric [139], and Sirius [55] have proposed reconfigurable datacenter
laser, manufacturing, forward-error correction, photodiode sen- interconnects with nanosecond switching fabric. We believe these
sitivity, modulation scheme, and light coupling. In contrast, SiP proposals have the potential to change the game in datacenter en-
interfaces are designed for short distances and do not require coher- vironments, but they are not commercially available yet and they
ent detection; hence, they can take advantage of the development do not support Tbps bandwidth between communicating nodes.
and commercialization of photonics components for short distance Moreover, our results show µs reconfiguration latency is close to
datacenters. optimal for ML; a control plane with nanosecond response time
might be needed for a general purpose datacenter traffic, but it is
6 RELATED WORK an overkill for distributed ML training. Finally, there is rich body of
Our work builds on two lines of related work. research on silicon photonics [17, 140–142], embedding silicon pho-
Software/hardware systems for distributed ML. Many soft- tonics switches in High Performance Computing clusters [143] and
ware platforms and techniques have focused on enabling large- energy-efficient datacenters [144]. By focusing on ML, our work
scale distributed machine learning in recent years [100–105]. In takes an application-level perspective to build an interconnect with
particular several papers focus on enabling large-scale data par- SiP components.
allel training [45, 100–104, 106]. Relevant to this paper, several
aim to reduce communication overhead using techniques such
as compression [107–110], asynchronous updates [28, 111–114], 7 CONCLUSION
partially-exchanged gradients [115], and smart parameter propa- In this paper, we propose optical network interconnects for dis-
gation [2, 45, 116–119]. In addition, a variety of algorithmic ap- tributed ML training clusters capable of providing multiple terabits-
proaches have been developed to accelerate communication among per-second of bandwidth per GPU. Our results show that the pre-
devices customized for the underlying network [120], or to improve dictability of ML workloads makes them a great fit for optical inter-
model parallel training using smart task device placement [121, 122], connects. We develop a new task partitioning and placement algo-
and more efficient pipelining strategies [4, 123]. There is also a rithm that exploits the degree requirement of optical networks to
significant body of work on new electrical hardware designs to ac- find a parallelization strategy suitable for a given network topology.
celerate machine learning computations [118, 124–129]. The work We show this approach can mitigate and in fact largely overcome
proposed here is orthogonal to the above mentioned techniques, as concerns such as limited communication degree and reconfigura-
they can still be applied to further improve both data and model bility of optical circuit-switched networks. Simulations using three
parallel training. Our work differs in that we investigate the sys- real DNN models show that, compared to today’s electrical network
tem requirements of using SiP as a new underlying technology to fabrics with limited server-to-server bandwidth, SiP-ML improves
interconnect hundreds of GPUs in an all-optical architecture. the training time by 1.3–9.1× at scale.
Datacenter Interconnects. The broad vision for this paper is to
use all-optical interconnects for future distributed ML systems.
Optical interconnects have a long and rich history in the data- 8 ACKNOWLEDGMENTS
center research community [24–26, 55, 66, 70, 71, 130–135]. Prior We would like to thank our shepherd Hitesh Ballani and anony-
work shows the benefits of reconfigurable topologies in datacenter mous reviewers for their feedback. This work was partly supported
networks by adding optical links to the electrical topology [24, by AEPA-E ENLITENED PINE, DARPA FastNICs, DARPA PIPES,
66, 71, 133, 136] or by creating all-optical datacenter intercon- a Cisco Research Center Award, NSF ASCENT-2023468, NSF CNS-
nects [26, 55, 70, 131, 132]. The unpredictability of legacy datacenter 2008624, NSF CNS-1751009, NSF CNS-2006827, NSF CNS-1563826
workloads and the complexity of managing hybrid topologies are as well as by a SystemsThatLearn@CSAIL Ignite Grant and a Ma-
chineLearningApplications@CSAIL Award.
6 Page 6 under GLOBALFOUNDRIES 22 nm FDSOI lists A
C14,000/mm2 for 50 samples.

668
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

REFERENCES microring resonators. Laser & Photonics Reviews, 6(1):47–73, 2012. https://
[1] AI and Compute. https://openai . com/blog/ai-and-compute/. onlinelibrary . wiley . com/doi/abs/10 . 1002/lpor . 201100017.
[2] Minsik Cho, Ulrich Finkler, David Kung, and Hillery Hunter. Blueconnect: [23] Q. Cheng, M. Bahadori, Y. Hung, Y. Huang, N. Abrams, and K. Bergman. Scalable
Decomposing all-reduce for deep learning on heterogeneous network hierarchy. microring-based silicon clos switch fabric with switch-and-select stages. IEEE
In SysML Conference, 2019. Journal of Selected Topics in Quantum Electronics, 25(5):1–11, Sep. 2019.
[3] Siddharth Das. CNN Architectures, 2017. [24] Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajab-
[4] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. dolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and
Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Amin Vahdat. Helios: A hybrid electrical/optical switch architecture for modular
Generalized pipeline parallelism for dnn training. In Proceedings of the 27th data centers. SIGCOMM’10, pages 339–350.
ACM Symposium on Operating Systems Principles, SOSP âĂŹ19, page 1âĂŞ15, [25] Guohui Wang, David G. Andersen, Michael Kaminsky, Konstantina Papagian-
New York, NY, USA, 2019. Association for Computing Machinery. naki, T.S. Eugene Ng, Michael Kozuch, and Michael Ryan. c-Through: Part-time
[5] NVIDIA DGX A100. https://www . nvidia . com/en-us/data-center/dgx-a100/. optics in data centers. SIGCOMM’10, pages 327–338.
[6] NVIDIA Selene Cluster. https://blogs . nvidia . com/blog/2020/12/18/nvidia- [26] Li Chen, Kai Chen, Zhonghua Zhu, Minlan Yu, George Porter, Chunming Qiao,
selene-busy/. and Shan Zhong. Enabling wide-spread communications on optical fabric with
[7] S S Vazhkudai, B R de Supinski, A S Bland, A Geist, J Sexton, J Kahle, C J Zimmer, megaswitch. In 14th USENIX Symposium on Networked Systems Design and Im-
S Atchley, S H Oral, D E Maxwell, V G Vergara Larrea, A Bertsch, R Goldstone, plementation (NSDI 17), pages 577–593, Boston, MA, 2017. USENIX Association.
W Joubert, C Chambreau, D Appelhans, R Blackmore, B Casses, G Chochia, [27] Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang
G Davison, M A Ezell, E Gonsiorowski, L Grinberg, B Hanson, B Hartner, I Karlin, Yu, and Eric Xing. Lighter-communication distributed machine learning via
M L Leininger, D Leverman, C Marroquin, A Moody, M Ohmacht, R Panka- sufficient factor broadcasting. In Proceedings of the Thirty-Second Conference on
jakshan, F Pizzano, J H Rogers, B Rosenburg, D Schmidt, M Shankar, F Wang, Uncertainty in Artificial Intelligence, pages 795–804, Arlington, Virginia, USA,
P Watson, B Walkup, L D Weems, and J Yin. The design, deployment, and 2016. AUAI Press.
evaluation of the coral pre-exascale systems. 7 2018. [28] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja
[8] Valerie Coffey. DARPA PIPES Program demonstrates 2 Tbit/s optical intercon- Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed
nects at the chip level, July 2020. https://www . laserfocusworld . com/fiber- machine learning with the parameter server. OSDI’14, pages 583–598. USENIX
optics/article/14176186/darpa-pipes-program-demonstrates-2-tbits-optical- Association, 2014.
interconnects-at-the-chip-level. [29] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collec-
[9] Mark Wade. Optical i/o chiplets eliminate bottlenecks to unleash innovation, tive communication operations in mpich. Int. J. High Perform. Comput. Appl.,
2020. https://ayarlabs . com/ayar-labs-solving-critical-computing-challenges- 19(1):49–66, February 2005.
through-optical-i-o/. [30] Baidu, 2017. https://github . com/baidu-research/baidu-allreduce.
[10] Yutaka Urino, Takahiro Nakamura, and Yasuhiko Arakawa. Silicon Optical [31] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu
Interposers for High-Density Optical Interconnects, pages 1–39. Springer Berlin Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen,
Heidelberg, Berlin, Heidelberg, 2016. Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning
[11] D. Kim, K. Y. Au, H. Y. L. X. Luo, Y. L. Ye, S. Bhattacharya, and G. Q. Lo. 2.5d silicon training system with mixed-precision: Training imagenet in four minutes. CoRR,
optical interposer for 400 gbps electronic-photonic integrated circuit platform abs/1807.11205, 2018.
packaging. In 2017 IEEE 19th Electronics Packaging Technology Conference (EPTC), [32] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March
pages 1–4, Dec 2017. 1986.
[12] E. R. H. Fuchs, R. E. Kirchain, and S. Liu. The future of silicon photonics: Not [33] Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P
so fast? insights from 100g ethernet lan transceivers. Journal of Lightwave Xing. On model parallelization and scheduling strategies for distributed machine
Technology, 29(15):2319–2326, Aug 2011. learning. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
[13] David Thomson, Aaron Zilkie, John E Bowers, Tin Komljenovic, Graham T Reed, Weinberger, editors, Advances in Neural Information Processing Systems 27, pages
Laurent Vivien, Delphine Marris-Morini, Eric Cassan, Leopold Virot, Jean-Marc 2834–2842. Curran Associates, Inc., 2014.
Fedeli, Jean-Michel Hartmann, Jens H Schmid, Dan-Xia Xu, Frederic Boeuf, [34] Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. Exploring hidden dimensions
Peter O’Brien, Goran Z Mashanovich, and M Nedeljkovic. Roadmap on silicon in accelerating convolutional neural networks. volume 80 of Proceedings of
photonics. Journal of Optics, 18(7):073003, 2016. Machine Learning Research, pages 2274–2283, StockholmsmÃďssan, Stockholm
[14] M. Wade, M. Davenport, M. De Cea Falco, P. Bhargava, J. Fini, D. Van Orden, Sweden, 10–15 Jul 2018. PMLR.
R. Meade, E. Yeung, R. Ram, M. Popovic, V. Stojanovic, and C. Sun. A bandwidth- [35] Tal BenNun and Torsten Hoefler. Demystifying parallel and distributed deep
dense, low power electronic-photonic platform and architecture for multi-tbps learning: An in-depth concurrency analysis. CoRR, abs/1802.09941, 2018.
optical i/o. pages 1–3, Sep. 2018. [36] L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen. Accpar: Tensor partition-
[15] N. Ophir, C. Mineo, D. Mountain, and K. Bergman. Silicon photonic microring ing for heterogeneous deep learning accelerators. In 2020 IEEE International
links for high-bandwidth-density, low-power chip i/o. IEEE Micro, 33(1):54–67, Symposium on High Performance Computer Architecture (HPCA), pages 342–355,
Jan 2013. 2020.
[16] G.T. Reed and A.P. Knights. Silicon Photonics: An Introduction. Wiley, 2004. [37] Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir, and
[17] Qixiang Cheng, Meisam Bahadori, Madeleine Glick, Sebastien Rumley, and Brian Van Essen. Channel and filter parallelism for large-scale cnn training.
Keren Bergman. Recent advances in optical technologies for data centers: a In Proceedings of the International Conference for High Performance Computing,
review. Optica, 5(11):1354–1370, Nov 2018. Networking, Storage and Analysis, SC’19, New York, NY, USA, 2019. Association
[18] Madeleine Glick, Lionel C. Kimmerling, and Robert C. Pfahl. A roadmap for for Computing Machinery.
integrated photonics. Opt. Photon. News, 29(3):36–41, Mar 2018. [38] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism
[19] Amir H. Atabaki, Sajjad Moazeni, Fabio Pavanello, Hayk Gevorgyan, Jelena for deep neural networks. SysML, 2019.
Notaros, Luca Alloatti, Mark T. Wade, Chen Sun, Seth A. Kruger, Huaiyu Meng, [39] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,
Kenaish Al Qubaisi, Imbert Wang, Bohan Zhang, Anatol Khilo, Christopher V. Marc aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and
Baiocco, Milovs A. Popovic, Vladimir M. Stojanovic, and Rajeev J. Ram. Integrat- Andrew Y. Ng. Large scale distributed deep networks. In F. Pereira, C. J. C.
ing photonics with silicon nanoelectronics for the next generation of systems Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information
on a chip. Nature, 556(7701):349–354, 2018. Processing Systems 25, pages 1223–1231. Curran Associates, Inc., 2012.
[20] Mark Wade, Erik Anderson, Shahab Ardalan, Pavan Bhargava, Sidney Buch- [40] Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluç. Integrated model
binder, Michael Davenport, John Fini, Anatoly Khilo, Chandru Ramamurthy and data parallelism in training neural networks. CoRR, abs/1712.04432, 2017.
Roy Meade, Michael Rust, Vladimir Stojanovic Forrest Sedgwick, Derek Van [41] Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi
Orden, Chong Zhang Edward Wang, Chen Sun, Sergey Shumarayev, Conor Mao, and Mohammad Alizadeh. Learning generalizable device placement al-
O’Keeffe, Tim T. Hoang, David Kehlet, Ravi V. Mahajan, Allen Chan, and Tina gorithms for distributed machine learning. In Advances in Neural Information
Tran. TeraPHY: A Chiplet Technology for Low-Power, High-Bandwidth Optical Processing Systems 32, pages 3983–3993. Curran Associates, Inc., 2019.
I/O. HotChips, pages i–xlviii, August 2019. https://www . hotchips . org/hc31/ [42] Shar Narasimhan. NVIDIA Clocks World’s Fastest BERT Training Time and
HC312 . 9A yarLabs2 0190820H CF INAL . pdf. Largest Transformer Based Model, Paving Path For Advanced Conversational
[21] Valentina Donzella, Ahmed Sherwali, Jonas Flueckiger, Samantha M. Grist, AI, Aug. 2019. https://devblogs . nvidia . com/training-bert-with-gpus/.
Sahba Talebi Fard, and Lukas Chrostowski. Design and fabrication of soi micro- [43] Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and
ring resonators based on sub-wavelength grating waveguides. Opt. Express, Brian Van Essen. Improving strong-scaling of cnn training by exploiting finer-
23(4):4791–4803, Feb 2015. grained parallelism, 2019.
[22] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, [44] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better:
T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets. Silicon Closing the generalization gap in large batch training of neural networks. In

669
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

Proceedings of the 31st International Conference on Neural Information Processing [71] George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-
Systems, NIPS’17, pages 1729–1739, Red Hook, NY, USA, 2017. Curran Associates Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat.
Inc. Integrating microsecond circuit switching into the data center. SIGCOMM’13,
[45] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, pages 447–458.
Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large [72] meg walraed sullivan, Jitu Padhye, and Dave Maltz. Theia: Simple and cheap
minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. networking for ultra-dense data centers. In HotNets-XIII Proceedings of the 13th
[46] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, ACM Workshop on Hot Topics in Networks. ACM, October 2014.
Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on [73] Paolo Costa, Austin Donnelly, Greg O’Shea, and Antony Rowstron. Camcubeos:
neural network training. Journal of Machine Learning Research, 20(112):1–49, A key-based network stack for 3d torus cluster topologies. In Proceedings of
2019. the 22nd International Symposium on High-Performance Parallel and Distributed
[47] Yosuke Oyama, Naoya Maruyama, Nikoli Dryden, Erin McCarthy, Peter Har- Computing, HPDC ’13, pages 73–84, New York, NY, USA, 2013. Association for
rington, Jan Balewski, Satoshi Matsuoka, Peter Nugent, and Brian Van Essen. Computing Machinery.
The case for strong scaling in deep learning: Training large 3d cnns with hybrid [74] Hussam Abu-Libdeh, Paolo Costa, Antony Rowstron, Greg O’Shea, and Austin
parallelism. IEEE Transactions on Parallel and Distributed Systems, 2020. Donnelly. Symbiotic routing in future data centers. In Proceedings of the ACM
[48] MLPerf v0.6: NVIDIA Implementation of Attention Mechanisms for Translation, SIGCOMM 2010 Conference, SIGCOMM ’10, page 51?62, New York, NY, USA,
Aug. 2019. https://github . com/mlperf/trainingr esultsv 0 . 6/tree/master/NVIDIA/ 2010. Association for Computing Machinery.
benchmarks/transformer/implementations/pytorch. [75] J. M. Kumar and L. M. Patnaik. Extended hypercube: a hierarchical intercon-
[49] ResNet v1.5 for TensorFlow, 2020. nection network of hypercubes. IEEE Transactions on Parallel and Distributed
[50] NVIDIA Data Center Deep Learning Product Performance. https:// Systems, 3(1):45–57, 1992.
developer . nvidia . com/deep-learning-performance-training-inference. [76] John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. Technology-
[51] Nvidia DGX-2. https://www . nvidia . com/content/dam/en-zz/Solutions/Data- driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News,
Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-uk . pdf. 36(3):77âĂŞ88, June 2008.
[52] MegatronLM: Training Billion+ Parameter Language Models Using GPU Model [77] Min Yee Teh, Jeremiah J. Wilke, Keren Bergman, and Sébastien Rumley. Design
Parallelism, Jul. 2019. https://nv-adlr . github . io/MegatronLM. space exploration of the dragonfly topology. In Julian M. Kunkel, Rio Yokota,
[53] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Michela Taufer, and John Shalf, editors, High Performance Computing, pages
Memory optimizations toward training trillion parameter models, 2019. https: 57–74, Cham, 2017. Springer International Publishing.
//www . deepspeed . ai/. [78] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable
[54] Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Matthew Denton, and dragonfly topology. In 2008 International Symposium on Computer Architecture,
Tushar Krishna. Efficient communication acceleration for next-gen scale-up pages 77–88, 2008.
deep learning training platforms, 2020. [79] Calient Optical Circuit Switch. https://www . calient . net/products/edge640-
[55] Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, optical-circuit-switch/.
Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and [80] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable,
Hugh Williams. Sirius: A Flat Datacenter Network with Nanosecond Optical commodity data center network architecture. SIGCOMM Comput. Commun.
Switching. SIGCOMM’20, Aug. 2020. Rev., 38(4):63–74, August 2008.
[56] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark [81] Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi,
silicon and the end of multicore scaling. In 2011 38th Annual International Chen Tian, Yongguang Zhang, and Songwu Lu. Bcube: A high performance,
Symposium on Computer Architecture (ISCA), pages 365–376, June 2011. server-centric network architecture for modular data centers. In Proceedings of
[57] R. Colwell. The chip design game at the end of moore’s law. In 2013 IEEE Hot the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09,
Chips 25 Symposium (HCS), pages 1–16, Aug 2013. page 63?74, New York, NY, USA, 2009. Association for Computing Machinery.
[58] H. J. S. Dorren, E. H. M. Wittebol, R. de Kluijver, G. Guelbenzu de Villota, P. Duan, [82] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameter network topol-
and O. Raz. Challenges for optically enabled high-radix switches for data center ogy. In SC ’14: Proceedings of the International Conference for High Performance
networks. Journal of Lightwave Technology, 33(5):1117–1125, March 2015. Computing, Networking, Storage and Analysis, pages 348–359, Nov 2014.
[59] Alexis BjÃűrlin and Manish Mehta. Broadcom discusses its co-packaged optics [83] Alexander Ishii, Denis Foley, Eric Anderson, Bill Dally, Glenn Dearth Larry
plans. http://www . gazettabyte . com/home/2021/4/27/broadcom-discusses-its- Dennison, Mark Hummel, and John Schafer. NVIDIA’s NVLink-Switching Chip
co-packaged-optics-plans . html, 2021. [Online; last accessed 25-June-2021]. and Scale-Up GPU-Compute Server. HotChips, 2018. https://www . hotchips . org/
[60] Steven Leibson. Ayar labs and Intel demo FPGA with optical transceivers hc30/2conf/2 . 01N vidiaN VswitchH otChips2018D GX2NVSF inal . pdf.
in DARPA PIPES project: 2 Tbps now, >100 Tbps is the goal, Mar. [84] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
2020. https://blogs . intel . com/psg/ayar-labs-and-intel-demo-fpga-with-optical- In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
transceivers-in-darpa-pipes-project-2-tbps-now-100-tbps-is-the-goal/. pages 770–778, June 2016.
[61] Pipes researchers demonstrate optical interconnects to improve performance of [85] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving
digital microelectronics, Mar. 2020. https://www . darpa . mil/news-events/2020- Language Understanding by Generative Pre-Training.
03-25. [86] Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein,
[62] Tiffany Trader. Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on
Chips, Aug. 2019. https://www . hpcwire . com/2019/08/19/ayar-labs-to-demo- neural network training. CoRR, abs/1811.03600, 2018.
photonics-chiplet-in-fpga-package-at-hot-chips/. [87] Raul Puri. Megatron: a large, powerful transformer, Aug. 2019. https://
[63] F. Douglis, S. Robertson, E. Van den Berg, J. Micallef, M. Pucci, A. Aiken, M. Hat- github . com/NVIDIA/Megatron-LM.
tink, M. Seok, and K. Bergman. Fleet—fast lanes for expedited execution at 10 [88] MLPerf: A broad ML benchmark suite. https://mlperf . org/.
terabits: Program overview. IEEE Internet Computing, (01):1–1, apr 5555. [89] FlexFlow Github. https://github . com/flexflow/FlexFlow . git.
[64] Ayar Labs TeraPHY Silicon Chip. https://ayarlabs . com/products/. [90] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd:
[65] Demonstration of Ayar Labs’ Optical I/O Multi-Chip Package and Single-Die Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325,
Package solutions, Aug. 2020. https://vimeo . com/449164007. 2017.
[66] William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, [91] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Ima-
Alex C. Snoeren, and George Porter. Rotornet: A scalable, low-complexity, genet training in minutes. In Proceedings of the 47th International Conference on
optical datacenter network. SIGCOMM ’17, pages 267–280, 2017. Parallel Processing, pages 1–10, 2018.
[67] Tae Joon Seok, Niels Quack, Sangyoon Han, Richard S. Muller, and Ming C. Wu. [92] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou,
Large-scale broadband digital silicon photonic switches with vertical adiabatic Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable
couplers. Optica, 3(1):64–70, Jan 2016. deep learning training system with mixed-precision: Training imagenet in four
[68] Kyungmok Kwon, Tae Joon Seok, Johannes Henriksson, Jianheng Luo, Lane minutes. arXiv preprint arXiv:1807.11205, 2018.
Ochikubo, John Jacobs, Richard S Muller, and Ming C Wu. 128× 128 silicon [93] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
photonic mems switch with scalable row/column addressing. In CLEO: Science Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data
and Innovations, pages SF1A–4. Optical Society of America, 2018. Center TCP (DCTCP). In Proceedings of the ACM SIGCOMM 2010 Conference,
[69] William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, SIGCOMM ’10, pages 63–74, New York, NY, USA, 2010. ACM.
and George Porter. Expanding across time to deliver bandwidth efficiency and [94] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang,
low latency. NSDI’20, 2020. Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. Hpcc: high
[70] Yunpeng James Liu, Peter Xiang Gao, Bernard Wong, and Srinivasan Keshav. precision congestion control. In Proceedings of the ACM Special Interest Group
Quartz: A new design element for low-latency dcns. SIGCOMM’14, pages 283– on Data Communication, pages 44–58. 2019.
294.

670
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

[95] Roy Meade, Shahab Ardalan, Michael Davenport, John Fini, Chen Sun, Mark clusters. CoRR, abs/1511.00175, 2015.
Wade, Alexandra Wright-Gladstein, and Chong Zhang. Teraphy: A high-density [118] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian
electronic-photonic chiplet for optical i/o from a multi-chip module. In Optical Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, and Michael
Fiber Communication Conference (OFC) 2019, page M4D.7. Optical Society of Haselman. Accelerating persistent neural networks at datacenter scale. In Hot
America, 2019. Chips, volume 29, 2017.
[96] Alvaro moscoso martir, Juliana MÃĳller, Johannes Hauck, Nicolas Chimot, Rony [119] Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gen-
Setter, Avner Badihi, Daniel Rasmussen, Alexandre Garreau, Mads Nielsen, nady Pekhimenko. Priority-based parameter propagation for distributed DNN
Elmira Islamova, Sebastian Romero-GarcÃŋa, Bin Shen, Anna Sandomirsky, training. CoRR, abs/1905.03960, 2019.
Sylvie Rockman, Chao Li, Saeed Sharif Azadeh, Guo-Qiang Lo, Elad Mentovich, [120] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep
Florian Merget, and Jeremy Witzens. Silicon photonics wdm transceiver with learning in tensorflow. CoRR, abs/1802.05799, 2018.
soa and semiconductor mode-locked laser. Scientific Reports, 7, 05 2016. [121] Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen,
[97] 2020 General Europractice Pricelist, Jan. 2020. https://europractice-ic . com/wp- Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean.
content/uploads/2020/01/General-MPW-EUROPRACTICE-200123-v3 . pdf. Device placement optimization with reinforcement learning. In Proceedings of
[98] D. Kim, K. Y. Au, H. Y. L. X. Luo, Y. L. Ye, S. Bhattacharya, and G. Q. Lo. 2.5d silicon the 34th International Conference on Machine Learning-Volume 70, pages 2430–
optical interposer for 400 gbps electronic-photonic integrated circuit platform 2439. JMLR. org, 2017.
packaging. In 2017 IEEE 19th Electronics Packaging Technology Conference (EPTC), [122] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song,
pages 1–4, Dec 2017. Zenglin Xu, and Tim Kraska. Superneurons: dynamic gpu memory management
[99] Chen Sun, Mark T. Wade, Yunsup Lee, Jason S. Orcutt, Luca Alloatti, Michael S. for training deep neural networks. In ACM SIGPLAN Notices, volume 53, pages
Georgas, Andrew S. Waterman, Jeffrey M. Shainline, Rimas R. Avizienis, Sen Lin, 41–53. ACM, 2018.
Benjamin R. Moss, Rajesh Kumar, Fabio Pavanello, Amir H. Atabaki, Henry M. [123] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam,
Cook, Albert J. Ou, Jonathan C. Leu, Yu-Hsin Chen, Krste Asanović, Rajeev J. Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks
Ram, MilošA. Popović, and Vladimir M. Stojanović. Single-chip microprocessor using pipeline parallelism. NeurIPS, 2019.
that communicates directly using light. Nature, 528(7583):534–538, 2015. [124] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An
[100] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, energy-efficient reconfigurable accelerator for deep convolutional neural net-
Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed works. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
deep networks. In Advances in neural information processing systems, pages [125] Yichen Shen, Nicholas C Harris, Scott Skirlo, Mihika Prabhu, Tom Baehr-Jones,
1223–1231, 2012. Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, et al.
[101] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng An- Deep learning with coherent nanophotonic circuits. Nature Photonics, 11(7):441,
drew. Deep learning with cots hpc systems. In International conference on 2017.
machine learning, pages 1337–1345, 2013. [126] Mahdi Nazm Bojnordi and Engin Ipek. Memristive boltzmann machine: A
[102] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. hardware accelerator for combinatorial optimization and deep learning. In
Project adam: Building an efficient and scalable deep learning training system. 2016 IEEE International Symposium on High Performance Computer Architecture
In OSDI’14, pages 571–582, 2014. (HPCA), pages 1–13. IEEE, 2016.
[103] Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, [127] Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, and Xuehai Zhou. Dlau: A scalable
Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A {GPU } cluster deep learning accelerator unit on fpga. IEEE Transactions on Computer-Aided
manager for distributed deep learning. In NSDI’19, pages 485–500, 2019. Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
[104] Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, and Yonggang Wen. Op- [128] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
timizing network performance for distributed dnn training on gpu clusters: Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-
Imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855, 2019. datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE
[105] Adam Lerer, Ledell Wu, Jiajun Shen, Timothée Lacroix, Luca Wehrstedt, Abhijit 44th Annual International Symposium on Computer Architecture (ISCA), pages
Bose, and Alexander Peysakhovich. Pytorch-biggraph: A large-scale graph 1–12. IEEE, 2017.
embedding system. CoRR, abs/1903.12287, 2019. [129] Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and
[106] Luo Mai, Chuntao Hong, and Paolo Costa. Optimizing network performance in David Glasco. Gpus and the future of parallel computing. IEEE Micro, 31(5):7–17,
distributed machine learning. In 7th USENIX Workshop on Hot Topics in Cloud 2011.
Computing (HotCloud 15), Santa Clara, CA, 2015. USENIX Association. [130] Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson
[107] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient Huang, and Amin Vahdat. Hedera: Dynamic flow scheduling for data center
compression: Reducing the communication bandwidth for distributed training. networks. In Proceedings of the 7th USENIX Conference on Networked Systems
arXiv preprint arXiv:1712.01887, 2017. Design and Implementation, NSDI’10, pages 19–19, Berkeley, CA, USA, 2010.
[108] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. USENIX Association.
QSGD: Communication-efficient SGD via randomized quantization and encod- [131] Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar, Samir R. Das,
ing. volume 3, pages 1710 – 1721, 2018. Jon P. Longtin, Himanshu Shah, and Ashish Tanwer. Firefly: A reconfigurable
[109] Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 3lc: Lightweight wireless data center fabric using free-space optics. SIGCOMM’14, pages 319–330.
and effective traffic compression for distributed machine learning. arXiv preprint [132] M. Ghobadi, R. Mahajan, A. Phanishayee, N. Devanur, J. Kulkarni, G. Ranade,
arXiv:1802.07389, 2018. P. Blanche, H. Rastegarfar, M. Glick, and D. Kilper. Projector: Agile reconfig-
[110] Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep urable data center interconnect. SIGCOMM ’16, pages 216–229, 2016.
learning with sparse and quantized communication. In S. Bengio, H. Wallach, [133] He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen,
Neural Information Processing Systems 31, pages 2525–2536. Curran Associates, Michael Kaminsky, George Porter, and Alex C. Snoeren. Scheduling techniques
Inc., 2018. for hybrid circuit/packet networks. In CoNEXT, pages 41:1–41:13. ACM, 2015.
[111] Alex Krizhevsky. One weird trick for parallelizing convolutional neural net- [134] Ankit Singla, Atul Singh, and Yan Chen. OSA: An optical switching architecture
works. arXiv preprint arXiv:1404.5997, 2014. for data center networks with unprecedented flexibility. In Presented as part of
[112] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic the 9th USENIX Symposium on Networked Systems Design and Implementation
averaging sgd. In Advances in Neural Information Processing Systems, pages (NSDI 12), pages 239–252, San Jose, CA, 2012. USENIX.
685–693, 2015. [135] Vishal Shrivastav, Asaf Valadarsky, Hitesh Ballani, Paolo Costa, Ki Suh Lee,
[113] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. Han Wang, Rachit Agarwal, and Hakim Weatherspoon. Shoal: A network
In Advances in Neural Information Processing Systems, pages 873–881, 2011. architecture for disaggregated racks. In 16th USENIX Symposium on Networked
[114] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!: A Systems Design and Implementation (NSDI’19). USENIX, February 2019.
lock-free approach to parallelizing stochastic gradient descent. In Proceedings [136] He Liu, Feng Lu, Alex Forencich, Rishi Kapoor, Malveeka Tewari, Geoffrey M.
of the 24th International Conference on Neural Information Processing Systems, Voelker, George Papen, Alex C. Snoeren, and George Porter. Circuit switching
NIPS’11, pages 693–701, 2011. under the radar with REACToR. NSDI’14, pages 1–15.
[115] Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter [137] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish:
Pietzuch. Ako: Decentralised deep learning with partial gradient exchange. Networking data centers randomly. In Proceedings of the 9th USENIX Confer-
SoCC ’16, 2016. ence on Networked Systems Design and Implementation, NSDI’12, pages 17–17,
[116] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. Communi- Berkeley, CA, USA, 2012. USENIX Association.
cation scheduling as a first-class citizen in distributed machine learning systems. [138] Andromachi Chatzieleftheriou, Sergey Legtchenko, Hugh Williams, and Antony
CoRR, abs/1803.03288, 2018. Rowstron. Larry: Practical network reconfigurability in the data center. In 15th
[117] Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. USENIX Symposium on Networked Systems Design and Implementation (NSDI 18),
Firecaffe: near-linear acceleration of deep neural network training on compute pages 141–156, Renton, WA, April 2018. USENIX Association.

671
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

[139] Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh
Williams, and Xiaohan Zhao. Xfabric: A reconfigurable in-rack network for rack-
scale computers. In 13th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 16), pages 15–29, Santa Clara, CA, March 2016. USENIX
Association.
[140] Sebastien Rumley, Meisam Bahadori, Robert Polster, Simon D. Hammond,
David M. Calhoun, Ke Wen, Arun Rodrigues, and Keren Bergman. Optical
interconnects for extreme scale computing systems. Parallel Computing, 64:65 –
80, 2017. High-End Computing for Next-Generation Scientific Discovery.
[141] Nicolas Sherwood-Droz, Howard Wang, Long Chen, Benjamin G. Lee, Aleksandr
Biberman, Keren Bergman, and Michal Lipson. Optical 4×4 hitless silicon router
for optical networks-on-chip (noc). Opt. Express, 16(20):15915–15922, Sep 2008.
[142] Qixiang Cheng, Sebastien Rumley, Meisam Bahadori, and Keren Bergman. Pho-
tonic switching in high performance datacenters. Opt. Express, 26(12):16022–
16043, Jun 2018.
[143] G. Michelogiannakis, Y. Shen, X. Meng M. Y. Teh, B. Aivazi, T. Groves, J. Shalf,
M. Glick, M. Ghobadi, L. Dennison, and K. Bergman. Bandwidth steering for
hpc using silicon nanophotonics. ACM/IEEE Supercomputing Conference (SC),
10 2019.
[144] Keren Bergman, John Shalf, George Michelogiannakis, Sebastien Rumley, Larry
Dennison, and Monia Ghobadi. Pine: An energy efficient flexibly interconnected
photonic data center architecture for extreme scalability. In 2018 IEEE Optical
Interconnects Conference (OI), OI ’18, 2018.
[145] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network flows.
1988.
[146] Robert E Tarjan. Dynamic trees as search trees via euler tours, applied to the
network simplex algorithm. Mathematical Programming, 78(2):169–177, 1997.
[147] James B Orlin, Serge A Plotkin, and Éva Tardos. Polynomial dual network
simplex algorithms. Mathematical programming, 60(1-3):255–276, 1993.
[148] Prabhakar Raghavan and Clark D. Thompson. Randomized rounding: A tech-
nique for provably good algorithms and algorithmic proofs. Technical Report
UCB/CSD-85-242, EECS Department, University of California, Berkeley, May
1985.

672
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

A APPENDIX Step 2: Compute min-cost flow. Having constructed the graph

Appendices are supporting material that has not been peer-reviewed. G, we solve the following flow routing problem:
X fe
maximize ,
A.1 SiP-Ring TM (2)
e
e ∈E:T M e >0
One of the core properties of SiP-ML-Ring is the ability to dynami- where for an edge e = (u, v), fe is the flow on the edge, and
cally place bandwidth around the static topology to maximize the T Me = T Muv is the traffic demand on that edge. The constraints
throughput between communicating nodes for model parallel jobs. (not shown for brevity) are the standard flow conservation con-
Note that for ring-allreduce data parallel jobs, there is no need to straints. The intuition for the above objective is that we wish to
reschedule the bandwidth once a physical ring is established by maximize throughput but preferentially allocate a larger flow (more
the patch panel. However, we find that model parallel jobs benefit wavelengths) to GPU-to-GPU paths with smaller demand. The rea-
from bandwidth rescheduling. The optimal bandwidth allocation son for favoring smaller demands is to complete them quickly,
maximizes the throughput while ensuring no two paths sharing reducing the number of nodes with which each node must commu-
the same fiber are assigned the same wavelength. More formally, nicate. This keeps the unsatisfied traffic pattern sparse over time,
the bandwidth allocation problem corresponds to the following allowing the remaining traffic to be handled efficiently in future
optimization problem. Let T Mi j denote the predicted GPU-to-GPU wavelength reconfiguration events.
traffic matrix, and W denote the total number of wavelengths (a.k.a The objective in Eq. (2) can be equivalently be written as a min-
available bandwidth). We can represent a wavelength allocation as cost flow routing problem [145] by defining the weight of edge e
a 3-dimensional binary matrix, Λ, where Λi jk is 1 if GPU i sends as w e = −1/T Me if T Me > 0, and w e = 0 if e is a dummy edge.
data to GPU j using λk and is zero otherwise. There are several The problem is then to minimize e w e fe . Min-cost flow routing
P
possible objectives. A natural one is to minimize the maximum com- can be solved using the network simplex algorithm [145–147]. The
pletion time for any GPU-to-GPU transfer, where the completion procedure for constructing the graph and defining the flow routing
TM
time is P Λi j . This can be expressed as an Integer Linear Program problem is slightly more complicated when the cut chosen for
k i jk
(ILP) by maximizing the minimum inverse of the completion time, adding the source and sink nodes includes more than one edge. In
as follows: this case, we need additional constraints to ensure consistency of
flows between the cut edges.
In the more general case of cuts with higher degrees, suppose we
Λi jk /T Mi j
P
maximizeΛ∈ {0,1} N ×N ×W min k
i j:T M i j >0 would like to inject the flows at the segment between Node3 and
s.t. Node4 . The problem remains basically the same, but we need to add
(1)
(1) Λ (i mod N )jk ∀l, k the following three constraints in addition to the flow conservation
P
(j ≤i ≤N +l )∩(l <j ) ≤ 1
(2)
P
Λ ≤ ∀j, k constraints: (1) X = X ′ , (2) Y = Y ′ , and (3) X + Y = 1. We can
Pi i jk
1
(3) j Λi jk ≤ 1 ∀i, k simply add these constraints to our simplex problem as well.

The constraints are (1) ensure fiber segments do not contain over- Sink
Node
lapping wavelengths (ring constraint), and (2) ensure each GPU can X
2
Y’
GB Node3 Node3 Y Src
use each wavelength for communication with, at most, one other 3
X’
Node
R R 1 1 2
GPU (node constraint). 3 3 3
Note that the size of the ILP solution space, Λ ∈ {0, 1} N ×N ×W , Node2 Node4 Node2 Node4

1
grows with the number of nodes in the network, rendering it in-
1
tractable at larger scale. Therefore, instead of solving the ILP, we RG B
Node1 Node1

present a more practical algorithm that turns this discrete optimiza-

tion problem into a min-cost flow routing problem which can be (a) Wavelength allocation (b) Flow allocation

solved efficiently.
Step 1: Communication graph construction. We construct a
directed communication graph, G = (V , E), where V is the set of
nodes and for every T Muv > 0, there is a directed edge e = (u, v). Figure 13: Wavelength allocation and its equivalent flow
After including edges for the entire T M in G, we check whether routing translation for multiedge cut.
every adjacent node-pair on the topology is connected in G. If not,
we add a “dummy” edge between them to E. The direction of all Step 3: Remove and repeat. The solution obtained by solving
edges in G is the same as that of wave propagation on the fiber.We the above min-cost flow problem may result in some GPU-to-GPU
then add dummy sink and source nodes by cutting the edges in G demands completing very quickly. However, since reconfiguration
along an arbitrary topology segment. For simplicity, let us assume incurs delay (e.g., 25 µs in our prototype), we cannot reconfigure
for now that this process cuts only one edge of the graph.We add wavelengths too quickly without hurting efficiency (more on this
two terminal nodes on the two ends of the cut edge to be the source below). Therefore we should plan the wavelength allocation based
and sink. The source node injects a unit-sized flow into the ring on a time horizon rather than looking only at the instantaneous
and the sink node receives it. traffic demands. To this end, we iteratively solve the min-cost flow

673
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA M. Khani et al.

SiP-ML (∆ = ∞) SiP-ML (∆ = 16)

problem in Equation (2), serving theT M with step-size of ∆ based on
MLPerf v0.7
the flows obtained after each iteration, and repeating this procedure 35
until there is no unserved demand left in the T M. We compute the

Scaling Efficiency (%)

100

mean of the flow allocations over all iterations as the final flow 30 75
allocation.
25 50
Step 4: Mapping flows to bandwidth. Finally, we scale the flows
from the previous step by W and map them to integer numbers 20 25
using a technique called randomized rounding [148]. This produces
15 0
the final compute and bandwidth allocation. An important con- 27 28 29 210 211 212 213 27 28 29 210 211 212 213
sideration in SiP-ML’s design is how frequently to reschedule the BW per GPU (Gbps) BW per GPU (Gbps)
bandwidth allocations. By rescheduling frequently, we can better
(a) ResNet50 (b) Transformer
tailor the bandwidth allocation to meet the traffic demands. But
rescheduling too quickly is undesirable, because each reconfigu-
ration incurs a delay during which no traffic can be sent. In our Figure 14: Comparing the scaling efficiency of our place-
experiments, we found setting the rescheduling period to 100 µs ment algorithm at different bandwidths to state-of-the-art
(4× the reconfiguration delay) provides the best performance. expert designed placements in MLPerf benchmark for 1024
GPUs.

A.2 SiP-OCS ILP

Similar to §A.1, we assumeT Mi j denotes the estimated traffic matrix
between GPUs i and j. We have N GPUs and Q OCSs each with N
ports. There is B/Q bandwidth available between each GPU and
OCS. Let P ∈ {0, 1} N ×N ×Q stand for the permutation configuration
of all OCSs with Pi jk = 1 if there is a circuit between GPUs i and j
on OCS k. Therefore, thetotal available
bandwidth between GPUs
PQ
i and j would be: (B/Q ) k=1 Pi jk . Our circuit scheduling goal
can be expressed as an Integer Linear Program (ILP) by maximizing
the minimum inverse of the completion time, as follows:

k Pi jk /T Mi j
P
maximizeP ∈ {0,1} N ×N ×Q min
i j:T M i j >0
s.t. (3)
(1) P ∀j, k
P
≤
Pi i jk
1
(2) P
j i jk ≤ 1 ∀i, k
Figure 15: System level diagram of GPU nodes with scalable
SiP select/bypass interface. The incoming 64 wavelengths
where constraints (1) and (2) would enforce the OCS configurations
are separated into four groups with 16 wavelengths each for
to be in the form of a permutation for each OCS; i.e., each GPU can
select/bypass.
establish a circuit with only one other GPU on each OCS. For com-
mercial OCSs that have orders of magnitude higher reconfiguration
delay than MRRs, we only use one-shot configuration. For such
configurations, our experiments show ILPs can be solved reason-
ably fast enough for thousands of nodes. Note that with one-time
scheduling, this optimization happens only once at the beginning
of training each new workload.
A.4 Optical Simulations
Fig. 15 demonstrates our approach to achieve SiP interfaced GPU
nodes at large scale. Every WDM input of 64 wavelengths from the
A.3 Scaling Efficiency of the Placement previous GPU node is first de-interleaved into 4 groups with 16
In Fig. 14, we compare the scaling efficiency of SiP-ML’s placement wavelengths each. We use cascaded SiP micro-ring filters to perform
algorithm on 1024 GPUs to the efficiency achieved in the most wavelength selective add/drop or to pass wavelength(s) through the
recent version of the MLPerf training benchmark [88]. We highlight node based on the requirement of global scheduler. To overcome
the following takeaways: 1) workloads like ResNet50 are too small to the spectral power variability caused by the multi-staged optical
be efficiently scaled to 1000s of GPUs; 2) our placement generalizes components, we add optical amplifiers, optical (de)multiplexers and
to electrical topologies without degree constraint; 3) placement variable optical attenuators (VOAs) to equalize the optical power for
with optical degree constraints respects the compute efficiency in each wavelength at the output of the GPU node. An interleaver then
addition to interconnect constraints; 4) overall, SiP-ML achieves combines all 4 groups and forwards the new WDM signal to the
up to 4.3× better scaling efficiency than today’s expert-designed next GPU node. We simulate our SiP add/drop interface using the
parallelization strategies for clusters in MLPerf benchmark. American Institute for Manufacturing Integrated Photonics (AIM

674
SiP-ML: Optical Network Interconnects for Machine Learning SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

Photonics) process design kit (PDK) in OptSim software. The add/- with multimode interference (MMI) couplers. In the simulation,
drop filters are from the AIM PDK and the (de)interleavers are built we achieve an equalized optical spectrum at the output of a GPU
with cascaded 2-stage MZI. The optical multiplexer/demultiplexers node for two cases: 1) 64 bypass wavelengths; 2) 64 wavelengths
are designed using ideal OptSim models with a bandwidth of 0.5nm. with 32 wavelengths being dropped and added, while the other 32
The multiplexer/demultiplexer function can also be implemented wavelengths bypassing the node.

675

An Overview On Application of Machine Learning Techniques in Optical Networks
No ratings yet
An Overview On Application of Machine Learning Techniques in Optical Networks
26 pages
Alberto Scandurra Auth., Ian OConnor, Gabriela Nicolescu Eds. Integrated Optical Interconnect Architectures For Embedded Systems1
No ratings yet
Alberto Scandurra Auth., Ian OConnor, Gabriela Nicolescu Eds. Integrated Optical Interconnect Architectures For Embedded Systems1
276 pages
Photonics 09 00030
No ratings yet
Photonics 09 00030
38 pages
Porte
No ratings yet
Porte
34 pages
Hpca 2022
No ratings yet
Hpca 2022
15 pages
Bigbus Jetc
No ratings yet
Bigbus Jetc
30 pages
A Review of Optical Neural Networks
No ratings yet
A Review of Optical Neural Networks
11 pages
Energy Harvesting Optical Modulators
No ratings yet
Energy Harvesting Optical Modulators
9 pages
A Review of Optical Neural Networks
No ratings yet
A Review of Optical Neural Networks
11 pages
FPGA-based Implementation of Two-Step Schedulers For Modular Optical Interconnection Networks
No ratings yet
FPGA-based Implementation of Two-Step Schedulers For Modular Optical Interconnection Networks
10 pages
Research Progress in Optical Neural Networks: Theory, Applications and Developments
No ratings yet
Research Progress in Optical Neural Networks: Theory, Applications and Developments
39 pages
A Survey On Optical Communication
No ratings yet
A Survey On Optical Communication
29 pages
Ultra-Broadband All-Optical Nonlinear Activation Function Enabled by MoTe2/ Optical Waveguide Integrated Devices
No ratings yet
Ultra-Broadband All-Optical Nonlinear Activation Function Enabled by MoTe2/ Optical Waveguide Integrated Devices
12 pages
1 en 24 Chapter
No ratings yet
1 en 24 Chapter
13 pages
Introducing Neuromorphic Spiking
No ratings yet
Introducing Neuromorphic Spiking
24 pages
10.1038@s41586 020 2973 6
No ratings yet
10.1038@s41586 020 2973 6
9 pages
Deep Learning With Coherent Nanophotonic Circuits: Articles
No ratings yet
Deep Learning With Coherent Nanophotonic Circuits: Articles
7 pages
High Bandwidth Density Sillicon Photosonic Resonators
No ratings yet
High Bandwidth Density Sillicon Photosonic Resonators
19 pages
High-Bandwidth Density Silicon Photonic Resonators For Energy-Efficient Optical Interconnects
No ratings yet
High-Bandwidth Density Silicon Photonic Resonators For Energy-Efficient Optical Interconnects
19 pages
Proof: Optics and Laser Technology
No ratings yet
Proof: Optics and Laser Technology
11 pages
Photonic Machine Learning With On-Chip Diffractive Optics
No ratings yet
Photonic Machine Learning With On-Chip Diffractive Optics
10 pages
An All-Optical General-Purpose CPU and Optical Computer Architecture
No ratings yet
An All-Optical General-Purpose CPU and Optical Computer Architecture
15 pages
Multi-Dimensional Optical Neural Network
No ratings yet
Multi-Dimensional Optical Neural Network
6 pages
300-Gbps Optical Interconnection Using Neural-Network Based Silicon Microring Modulator
No ratings yet
300-Gbps Optical Interconnection Using Neural-Network Based Silicon Microring Modulator
9 pages
Silicon Photonic Neural Applications and Prospects: Keywords
No ratings yet
Silicon Photonic Neural Applications and Prospects: Keywords
10 pages
Training Material Lenovo Certified Data Center Technical Sales Professional DCP-315C
60% (5)
Training Material Lenovo Certified Data Center Technical Sales Professional DCP-315C
136 pages
Photonics 11 00613
No ratings yet
Photonics 11 00613
23 pages
Machine Learning For Optical Fiber Communication Systems: An Introduction and Overview
No ratings yet
Machine Learning For Optical Fiber Communication Systems: An Introduction and Overview
23 pages
s43593 022 00027 X Compressed
No ratings yet
s43593 022 00027 X Compressed
25 pages
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
No ratings yet
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
15 pages
Optics For Ai Ofc 2022
No ratings yet
Optics For Ai Ofc 2022
3 pages
Accelerating Neural Networks For Large Language Models and Graph Processing With Silicon Photonics
No ratings yet
Accelerating Neural Networks For Large Language Models and Graph Processing With Silicon Photonics
6 pages
Principles of Neuromorphic Photonics PDF
No ratings yet
Principles of Neuromorphic Photonics PDF
28 pages
Ultrafast Coherent Dynamics of Microring Modulators
No ratings yet
Ultrafast Coherent Dynamics of Microring Modulators
34 pages
On Fiber Hotnets 2023
No ratings yet
On Fiber Hotnets 2023
9 pages
Photonic Neuromorphic Computing, Architectures - Final
No ratings yet
Photonic Neuromorphic Computing, Architectures - Final
3 pages
"SiP Architecture For Accelerating Collective Communication in Distributed Deep Learning
No ratings yet
"SiP Architecture For Accelerating Collective Communication in Distributed Deep Learning
3 pages
Nsdi21 SwitchML
No ratings yet
Nsdi21 SwitchML
25 pages
Photonics 12 00458
No ratings yet
Photonics 12 00458
6 pages
Performance Task in Ict 10
No ratings yet
Performance Task in Ict 10
31 pages
InP Photonic Integration
No ratings yet
InP Photonic Integration
1 page
Accessory Boards
No ratings yet
Accessory Boards
6 pages
Computer Architecture - CSE4001 "Operating Principles of The Computer Architecture"
No ratings yet
Computer Architecture - CSE4001 "Operating Principles of The Computer Architecture"
60 pages
Gebbwolf3d v2.1
100% (1)
Gebbwolf3d v2.1
320 pages
5 Introduction To Huawei AI Platforms v3.5
No ratings yet
5 Introduction To Huawei AI Platforms v3.5
113 pages
Octane Render User Manual
50% (2)
Octane Render User Manual
151 pages
Unit - 1: Cloud Architecture and Model
No ratings yet
Unit - 1: Cloud Architecture and Model
9 pages
Broadband Packet
No ratings yet
Broadband Packet
3 pages
Developing A Videogame
No ratings yet
Developing A Videogame
3 pages
MPC HC Guide
No ratings yet
MPC HC Guide
4 pages
FS16 Saratech 04 PerformanceTuning
No ratings yet
FS16 Saratech 04 PerformanceTuning
38 pages
Brand and Model Number Features Cost: Type Your Answers Here
100% (1)
Brand and Model Number Features Cost: Type Your Answers Here
2 pages
Introduction To Information Systems-Course Material
No ratings yet
Introduction To Information Systems-Course Material
90 pages
2024 Global Trends in AI - WEKA
No ratings yet
2024 Global Trends in AI - WEKA
49 pages
Expansion Slot
No ratings yet
Expansion Slot
38 pages
Choosing Gear For Your Smaart Measurement System
No ratings yet
Choosing Gear For Your Smaart Measurement System
17 pages
Datorteknik, Eitf70, Per Andersson
No ratings yet
Datorteknik, Eitf70, Per Andersson
29 pages
GPGPU Sim Tutorial
No ratings yet
GPGPU Sim Tutorial
28 pages
Minor Project Report
No ratings yet
Minor Project Report
60 pages
Nvidia Rtx4080服务器版本 Spec
No ratings yet
Nvidia Rtx4080服务器版本 Spec
5 pages
OPT: Open Pre-Trained Transformer Language Models
No ratings yet
OPT: Open Pre-Trained Transformer Language Models
30 pages
Honey, I Shrunk The LLM! A Beginner's Guide To Quantization - The Register
No ratings yet
Honey, I Shrunk The LLM! A Beginner's Guide To Quantization - The Register
11 pages
TipidPC Magazine Issue #2
No ratings yet
TipidPC Magazine Issue #2
108 pages
Sheet 3 Microprocessor
No ratings yet
Sheet 3 Microprocessor
6 pages
Mipsology Aws f1
No ratings yet
Mipsology Aws f1
10 pages
S4421 Gpu Computing With Matlab
No ratings yet
S4421 Gpu Computing With Matlab
27 pages
Raider GE66 12UHS-236
No ratings yet
Raider GE66 12UHS-236
1 page
BullSequana SA21Ga Product Brief
No ratings yet
BullSequana SA21Ga Product Brief
1 page
Teslapersonalsupercomputer 160201192005
No ratings yet
Teslapersonalsupercomputer 160201192005
16 pages
ATI FireGL X3-256 Specs - TechPowerUp GPU Database
No ratings yet
ATI FireGL X3-256 Specs - TechPowerUp GPU Database
3 pages
AI and Deep Learning for Networks
From Everand
AI and Deep Learning for Networks
Gopee Mukhopadhyay
No ratings yet
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
From Everand
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
Anand Vemula
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
From Everand
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
From Everand
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
Dr Chaitra HV
No ratings yet
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
From Everand
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
William Smith
No ratings yet
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mobile Neural Network Framework in Practice: The Complete Guide for Developers and Engineers
From Everand
Mobile Neural Network Framework in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
IGNOU MCS 231 Mobile Computing Previous Year Solved Papers
From Everand
IGNOU MCS 231 Mobile Computing Previous Year Solved Papers
Manish Soni
No ratings yet
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
From Everand
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cilium: Architecture, Networking, and Security in Kubernetes: Definitive Reference for Developers and Engineers
From Everand
Cilium: Architecture, Networking, and Security in Kubernetes: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Data Forwarding Technologies: Definitive Reference for Developers and Engineers
From Everand
Principles of Data Forwarding Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
From Everand
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

SiP-ML - High-Bandwidth Optical Network Interconnects For Machine

Uploaded by

SiP-ML - High-Bandwidth Optical Network Interconnects For Machine

Uploaded by

SiP-ML: High-Bandwidth Optical Network Interconnects for Machine

SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA

{ Node1 Node2 Noden-1 Noden Node1 Node2 Noden-1 Noden

(a) Today’s ML clusters (b) SiP-ML cluster

Figure 4: An example of SiP-ML’s parallelization strategy with k = 4, l = 2, D=3, and ∆ = 2.

Algorithm 1 that meets a maximal set of range constraints. We then reallocate

(a) ResNet50 (b) Transformer (c) Megatron

Analysis. We first consider the Elect-Flat architecture. Recall that

15 4 Tbps tween reconfigurable and one-shot allocation depends on

at 32 GPUs with 1µsec latency. GPU1 GPU2 GPU3

the ideal Elect-Flat at all scales, with SiP-Ring occasionally slightly

A APPENDIX Step 2: Compute min-cost flow. Having constructed the graph

present a more practical algorithm that turns this discrete optimiza-

SiP-ML (∆ = ∞) SiP-ML (∆ = 16)

Scaling Efficiency (%)

A.2 SiP-OCS ILP

You might also like