0% found this document useful (0 votes)
189 views45 pages

Dojo System v25

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views45 pages

Dojo System v25

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Super-Compute System Scaling for ML Training

Bill Chang, Rajiv Kurian, Doug Williams, Eric Quinnell


Path to General Autonomy

Model Architecture
Vision, Path Planning, Auto-Labeling
New Models Architectures
Parameter Sizes Increasing Exponentially

Training Data
Video Training Data With 4D Labels
Ground Truth Generation

Training Infrastructure
Training and Evaluation Pipeline
Accelerated ML Training System

Flexible System Architecture

Software at Scale
Typical System

Compute

Fixed Ratio

Memory I/O
Optimized ML Training System

Compute

ML Requirements Evolving

Memory I/O
Disaggregated System Architecture

Compute

Flexible Ratio

Memory I/O
Optimized Compute

Compute

Memory I/O
Technology-Enabled Scaling

System-On-Wafer Technology
- 25 D1 Compute Dies + 40 I/O Dies
- Compute and I/O Dies Optimize Efficiency and Reach
- Heterogenous RDL Optimized for High-Density and High-
Power Layout

Maximize Performance and Yield


- Known Good Die and Fault Tolerant Designs
- Each Tile Assembled With Fully Functional Dies
- Harvesting and Fully Configurable Routing for Yield
Training Tile

Unit of Scale
- Large Compute With Optimized I/O
- Fully Integrated System Module (Power/Cooling)

Uniform High-Bandwidth
- 10 TB/s on-tile bisection bandwidth
- 36 TB/s o -tile aggregate bandwidth

9 PFLOPS BF16/CFP8
11 GB High-Speed ECC SRAM
36 TB/s Aggregate I/O BW
ff
Flexible Building Block

Tile Tile Tile

9 TB/s 9 TB/s

Tile Tile Tile

Scale With Multiple Tiles

No Additional Power/Cooling Design Needed


Disaggregated Memory

Compute

Memory I/O
V1 Dojo Interface Processor

32GB High-Bandwidth Memory


- 800 GB/s Total Memory Bandwidth

900 GB/s TTP Interface


- Tesla Transport Protocol (TTP) - Full custom protocol
- Provides full DRAM bandwidth to Training Tile

50 GB/s TTP over Ethernet (TTPoE)


- Enables extending communication over standard Ethernet
- Native hardware support

32 GB/s Gen4 PCIe Interface


Dojo Interface Processor - PCIe Topology

160GB Total DRAM per Tile edge


PCIe Host
- Shared memory for training tiles
HBM
HBM
HBM
5 DIP Cards Provide Max Bandwidth
HBM
HBM

- 4.5 TB/s aggregate bandwidth to DRAM over TTP DIP


DIP
DIP
PCIe DIP
DIP Tile
Tile

80 Lanes PCIe Gen4 Interface


- Provide standard connectivity to hosts
Scalable Communication

Compute

Memory I/O
Tesla Transport Protocol

Node

D1
Bandwidth

TTP Tile

DIP

TTPoE

Latency
Dojo Interface Processor - Z-Plane Topology

TTPoE - Point-to-Point over Ethernet


- Provides high-radix connectivity in Z-plane TTP network
- Enables “shortcuts” across the network Ethernet
Switch
- Manage latency for sync and control across compute plane
Dojo Interface Processor - Z-Plane Topology

TTPoE - Point-to-Point over Ethernet


- Provides high-radix connectivity in Z-plane TTP network
- Enables “shortcuts” across the network Ethernet ~30 Hops
Switch
- Manage latency for sync and control across compute plane
Dojo Interface Processor - Z-Plane Topology

TTPoE - Point-to-Point over Ethernet


- Provides high-radix connectivity in Z-plane TTP network
- Enables “shortcuts” across the network Ethernet
~4 Hops
Switch
- Manage latency for sync and control across compute plane
Dojo Network Interface Card

Host
Remote DMA over TTPoE
- DMA to/from any TTP endpoint (compute SRAM, DRAM)
- Leverage switched Ethernet networks
DRAM

DNIC
CPU TTPoE

Enables Remote Compute for Pre/post-


processing
Remote DMA Topology

HBM
HBM
HBM
HBM
HBM
DRAM

CPU
DNIC DIP
DIP
DIP
DIP
DIP Tile
DRAM

DNIC
CPU

DRAM
Ethernet
DNIC Switch
CPU

DRAM

DNIC
CPU
HBM
HBM
HBM
HBM
DRAM
HBM

DNIC
CPU
DIP
DIP
DIP
DIP
DIP Tile

Scale-Out for CPU/Memory Bound


Pre-Processing Workloads
V1 Dojo Training Matrix

DRAM HBM HBM DRAM

4.5 TB/s 4.5 TB/s


CPU DIP
DIP Tile Tile Tile
DIP
DIP CPU
x5 x5

9 TB/s 9 TB/s

DRAM HBM HBM DRAM

4.5 TB/s 4.5 TB/s


CPU DIP Tile Tile Tile
DIP CPU
x5 x5

Switch

DNIC DNIC DNIC DNIC DNIC DNIC

1 EFLOP BF16/CFP8
DRAM DRAM DRAM DRAM DRAM DRAM
1.3 TB High-Speed ECC SRAM
CPU CPU CPU CPU CPU CPU
13 TB High-BW DRAM
Disaggregated Scalable System

Tile

Compute

Interface Processor Network Interface

Memory I/O
Software at Scale
Model Execution

Workloads operate almost entirely Unlike typical accelerators, all forms of


out of SRAM parallelism may cross die boundaries
Single copy of parameters - replicated just in time Thanks to High TTP Bandwidth
High utilization

HBM HBM

4.5 TB/s 4.5 TB/s


DIP
DIP Tile Tile Tile
DIP
DIP
x5 x5

9 TB/s 9 TB/s

HBM HBM

4.5 TB/s 4.5 TB/s


DIP Tile Tile Tile
DIP
x5 x5
Model Execution
P0 [C K1 R S]

DIP
x5

Tile

Tile

Tile

Tile

DIP
x5
P1 [K1 K2 R S]

Parameters Are Distributed Across the DIPs


Model Execution
P0 [C K1 R S]

DIP
x5

C
[ K1 R S]
2

Tile

C
[ K1 R S]
2

Tile

K1
[ K2 R S]
2
Tile

K1
[ K2 R S]
2

Tile

DIP
x5
P1 [K1 K2 R S]

Parameters Are Sharded Across the Tiles at Load Time


Once per training run
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

Tile

Tile

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Inputs Sharded Across the DIPs in the Batch Dimension


Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

N
[ C H W]
4

Tile

N
[ C H W]
4
Tile

N
[ C H W]
4

Tile

N
[ C H W]
4
Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Inputs Are Also Sharded (by Batch) Across the Tiles


Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

C C
[ K1 R S] [ K1 R S]
2 2

Tile

Tile

C C
[ K1 R S] [ K1 R S]
2 2

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Parameters Are Replicated Across the Tiles Just in Time


A single copy of parameter in the entire system - use the high BW to replicate parameters just in time
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

[C K1 R S]

Tile

[C K1 R S]

Tile

[C K1 R S]

Tile

[C K1 R S]

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Parameters Are Replicated Across the Tiles Just in Time


A single copy of parameter in the entire system - use the high BW to replicate parameters just in time
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4

Tile

N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4

Tile

N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4

Tile

N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

The First Layer Is Run in a Data Parallel Manner


Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

K1
[ K2 R S]
2
Tile

K1
[ K2 R S]
2
Tile

K1
[ K2 R S]
2
Tile

K1
[ K2 R S]
2
Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Parameters For the Next Layer Are Replicated Concurrently


1 copy per 2 tiles. The next layer is better executed in a model parallel manner
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

C
[ K1 R S]
2

Tile

C
[ K1 R S]
2

Tile

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Discard Replicated Parameters and Input for Minimal SRAM Footprint


Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

N K1
[ H W]
4 2

Tile

N K1 N
[ H W] [ K1 H W]
4 2 4

Tile

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Replicate Input Activation for the Next Layer - Split Across Channels
Only 1 N/4 batch shown
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

K1 N K1 N
[ K2 R S] [ H W] [ K2 H W]
2 4 2 4

Tile

K1 N K1 N
[ K2 R S] [ H W] [ K2 H W]
2 4 2 4

Tile

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Compute Partial Sum for Each N/4 Batch on Each Tile


Only 1 N/4 batch shown
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

N N
[ K2 H W] [ K2 H W]
4 4

Tile

N
[ K2 H W]
4

Tile

Tile

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Reduce Partial Sum for Each N/4 Batch Across Tiles


Small packet size, fine-grained synchronization and low-latency network makes pipelined partial sums work
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5

N
[ K2 H W]
4

Tile

N
[ K2 H W]
4

Tile

N
[ K2 H W]
4

Tile

N
[ K2 H W]
4

Tile

DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2

Same Computation Runs on Every Other N/4 Batch


Combination of data and model parallel
End-To-End Training Work ow

Compute Post
Data Loading
(Training) Processing

File Loading
Output
Decode
Compression
Augmentation
File Write
Ground Truth Generation
fl
Video-Based Training

Data Loading

Flexible compute required for: Multi-camera, multi-frame models


- Augmentation - Requires decoding GOP_SIZE/2 frames for first per-
- Image rectification camera frame and 1 decode for every frame after
- Ground truth generation
Data Loading Needs of Di erent Model

Decode

PCIE

Storage BW

CPU Cores
0% 25% 50% 75% 100%

Requirements as % of a Single Host’s Capacity

Model 1

Model 2
ff
Data Loading Needs of Di erent Models

Decode

PCIE

Storage BW

CPU Cores
0% 88% 175% 263% 350% 438% 525% 613% 700%

Requirements as % of a Single Host’s Capacity

Model 1

Model 2

Model 3
ff
Disaggregated Data Loading Tier

HBM HBM

DIP
DIP Tile Tile Tile
DIP
DIP
x5 x5

HBM HBM

DIP Tile Tile Tile


DIP
x5 x5

Switch

DNIC DNIC DNIC DNIC

DRAM DRAM DRAM DRAM

CPU CPU CPU CPU


Disaggregated Data Loading Tier

HBM HBM

DIP
DIP Tile Tile Tile
DIP
DIP
x5 x5

HBM HBM

DIP Tile Tile Tile


DIP
x5 x5

Batch 1B Batch 1C

Batch 1A DNIC DNIC DNIC DNIC


Batch 1D

DRAM DRAM DRAM DRAM

CPU CPU CPU CPU


Disaggregated Resources

Model 1 Model 2 Model 3

ML Compute

Memory

IO

Resources Can Be Partitioned per Job


Dojo Supercomputer for ML Training

New integration enable high-bandwidth and


performance ML Compute

Uniform high-bandwidth enables full exploitation Memory

of parallelism by software
IO

Vertically integrated I/O addresses all workload Training

bottlenecks including data loading

You might also like