Dojo System v25
Dojo System v25
Model Architecture
Vision, Path Planning, Auto-Labeling
New Models Architectures
Parameter Sizes Increasing Exponentially
Training Data
Video Training Data With 4D Labels
Ground Truth Generation
Training Infrastructure
Training and Evaluation Pipeline
Accelerated ML Training System
Software at Scale
Typical System
Compute
Fixed Ratio
Memory I/O
Optimized ML Training System
Compute
ML Requirements Evolving
Memory I/O
Disaggregated System Architecture
Compute
Flexible Ratio
Memory I/O
Optimized Compute
Compute
Memory I/O
Technology-Enabled Scaling
System-On-Wafer Technology
- 25 D1 Compute Dies + 40 I/O Dies
- Compute and I/O Dies Optimize Efficiency and Reach
- Heterogenous RDL Optimized for High-Density and High-
Power Layout
Unit of Scale
- Large Compute With Optimized I/O
- Fully Integrated System Module (Power/Cooling)
Uniform High-Bandwidth
- 10 TB/s on-tile bisection bandwidth
- 36 TB/s o -tile aggregate bandwidth
9 PFLOPS BF16/CFP8
11 GB High-Speed ECC SRAM
36 TB/s Aggregate I/O BW
ff
Flexible Building Block
9 TB/s 9 TB/s
Compute
Memory I/O
V1 Dojo Interface Processor
Compute
Memory I/O
Tesla Transport Protocol
Node
D1
Bandwidth
TTP Tile
DIP
TTPoE
Latency
Dojo Interface Processor - Z-Plane Topology
Host
Remote DMA over TTPoE
- DMA to/from any TTP endpoint (compute SRAM, DRAM)
- Leverage switched Ethernet networks
DRAM
DNIC
CPU TTPoE
HBM
HBM
HBM
HBM
HBM
DRAM
CPU
DNIC DIP
DIP
DIP
DIP
DIP Tile
DRAM
DNIC
CPU
DRAM
Ethernet
DNIC Switch
CPU
DRAM
DNIC
CPU
HBM
HBM
HBM
HBM
DRAM
HBM
DNIC
CPU
DIP
DIP
DIP
DIP
DIP Tile
9 TB/s 9 TB/s
Switch
1 EFLOP BF16/CFP8
DRAM DRAM DRAM DRAM DRAM DRAM
1.3 TB High-Speed ECC SRAM
CPU CPU CPU CPU CPU CPU
13 TB High-BW DRAM
Disaggregated Scalable System
Tile
Compute
Memory I/O
Software at Scale
Model Execution
HBM HBM
9 TB/s 9 TB/s
HBM HBM
DIP
x5
Tile
Tile
Tile
Tile
DIP
x5
P1 [K1 K2 R S]
DIP
x5
C
[ K1 R S]
2
Tile
C
[ K1 R S]
2
Tile
K1
[ K2 R S]
2
Tile
K1
[ K2 R S]
2
Tile
DIP
x5
P1 [K1 K2 R S]
Tile
Tile
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
N
[ C H W]
4
Tile
N
[ C H W]
4
Tile
N
[ C H W]
4
Tile
N
[ C H W]
4
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
C C
[ K1 R S] [ K1 R S]
2 2
Tile
Tile
C C
[ K1 R S] [ K1 R S]
2 2
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
[C K1 R S]
Tile
[C K1 R S]
Tile
[C K1 R S]
Tile
[C K1 R S]
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4
Tile
N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4
Tile
N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4
Tile
N N
[ C H W] [ K1 H W]
[C K1 R S] 4 4
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
K1
[ K2 R S]
2
Tile
K1
[ K2 R S]
2
Tile
K1
[ K2 R S]
2
Tile
K1
[ K2 R S]
2
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
C
[ K1 R S]
2
Tile
C
[ K1 R S]
2
Tile
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
N K1
[ H W]
4 2
Tile
N K1 N
[ H W] [ K1 H W]
4 2 4
Tile
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
Replicate Input Activation for the Next Layer - Split Across Channels
Only 1 N/4 batch shown
Model Execution
N
P0 [C K1 R S] I0 [ C H W]
2
DIP
x5
K1 N K1 N
[ K2 R S] [ H W] [ K2 H W]
2 4 2 4
Tile
K1 N K1 N
[ K2 R S] [ H W] [ K2 H W]
2 4 2 4
Tile
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
N N
[ K2 H W] [ K2 H W]
4 4
Tile
N
[ K2 H W]
4
Tile
Tile
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
N
[ K2 H W]
4
Tile
N
[ K2 H W]
4
Tile
N
[ K2 H W]
4
Tile
N
[ K2 H W]
4
Tile
DIP
x5
N
P1 [K1 K2 R S] I0 [ C H W]
2
Compute Post
Data Loading
(Training) Processing
File Loading
Output
Decode
Compression
Augmentation
File Write
Ground Truth Generation
fl
Video-Based Training
Data Loading
Decode
PCIE
Storage BW
CPU Cores
0% 25% 50% 75% 100%
Model 1
Model 2
ff
Data Loading Needs of Di erent Models
Decode
PCIE
Storage BW
CPU Cores
0% 88% 175% 263% 350% 438% 525% 613% 700%
Model 1
Model 2
Model 3
ff
Disaggregated Data Loading Tier
HBM HBM
DIP
DIP Tile Tile Tile
DIP
DIP
x5 x5
HBM HBM
Switch
HBM HBM
DIP
DIP Tile Tile Tile
DIP
DIP
x5 x5
HBM HBM
Batch 1B Batch 1C
ML Compute
Memory
IO
of parallelism by software
IO