Specilizing Processors For ML
Specilizing Processors For ML
02.19.2023
Luca Benini T12: Specializing Processors for ML 1 of 68
© 2023 IEEE International Solid-State Circuits Conference
Outline
Outline
[ReutherHPEC22]
[Researchdive22]
[Researchdive22]
[Researchdive22]
TinyML challenge
AI capabilities in the power envelope of an MCU: 10-mW peak (1mW avg)
Luca Benini T12: Specializing Processors for ML 4 of 68
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)
Low-Power MCUs
1TOPS/W=1pJ/OP
TinyML (1GOPs/Inf)
@10fps in 10mW
0.4
0.3
M4
0.2
0.1
M7
0
2 2.5 3 3.5 4 4.5 5 5.5
Performance (coremark/MHz)
0.4
0.3
M4
0.2
0.1
M7
0
2 2.5 3 3.5 4 4.5 5 5.5
Performance (coremark/MHz)
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
V1 Baseline RISC (not good for ML) RISC V1
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
V1 Baseline RISC (not good for ML) RISC V1
Extensions for Data Processing
V2 Data motion (e.g. auto-increment) V2
Data processing (e.g. MAC)
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
V1 Baseline RISC (not good for ML) RISC V1
Extensions for Data Processing
V2 Data motion (e.g. auto-increment) V2
Data processing (e.g. MAC) V3
Domain specific data processing
V3 Narrow bitwidth
HW support for special arithmetic
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]
V3
Domain specific data processing ISA extension cost 25 kGE 40 kGE
Narrow bitwidth
HW support for special arithmetic
(1.6x), energy efficient if 0.6Texec
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]
V3
Domain specific data processing ISA extension cost 25 kGE 40 kGE
Narrow bitwidth
HW support for special arithmetic
(1.6x), energy efficient if 0.6Texec
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
RISC-V Instruction Set Architecture
SW
Started by UC-Berkeley in 2010 Applications
Contract between SW and HW
Partitioned into user and privileged spec OS
External Debug
Standard governed by the RISC-V
Debug
ISA
foundation
Necessary for the continuity User Privileged
Compressed
Instructions (C)
Compressed
Instructions (C)
MCU focus
Register file
requires additional
write port
Register file
requires additional
read port if offset
is stored in
register
Cons:
Additional read port on the register file
used for pre/post increment with
register
Area cost:
Processor core area increases by 5%
Performance:
Speedup can be up to factor 2!
Luca Benini T12: Specializing Processors for ML 25 of 68
© 2023 IEEE International Solid-State Circuits Conference
Packed-SIMD
Remember: DNN inference is OK with low-bitwidth operands
Packed-SIMD extensions
Well known idea from classic DSP
Widely used (ARM, x86)
SIMD in 32bit machines
Vectors are either 4 8bits-elements or 2 16bits-elements
8 elements less common (e.g. not available in ARM cortex M4)
Partial Product
Generator
35:2 compressor
… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1
p.lbu a0,1(a3!) p.lw a1,4(t1!) //load 4-bytes with post inc
p.lbu a1,32(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5 //4 mac
… …
RV32IMCXpulp
RV32IMC
RV32EC
x53.4 x3.5
x1 x1
x1.3
x6.1
Scheduler
Luca Benini T12: Specializing Processors for ML 34 of 68
© 2023 IEEE International Solid-State Circuits Conference
76
Runtime for three different applications
[P. D. Schiavone et al. PATMO17]
RV32IMCXpulp
RV32IMC
RV32EC
x53.4 x3.5
x1 x1
Better
x1.3
x6.1
RV32IMCXpulp
Extensions have more effect RV32IMC
RV32EC
x53.4 x3.5
x1 x1
Better
x1.3
x6.1
x2.2
Better
x3.5
x2.4
x2.7
Better
RV32IMCXpulp
RV32IMC
RV32EC
41.6 ms
4.78 ms
784 μs
41.6 ms
Better
4.78 ms
784 μs
41.6 ms
Better
4.78 ms
784 μs
41.6 ms
Better
4.78 ms
649 ms
784 μs
41.6 ms
Better
4.78 ms
649 ms 31 s
784 μs
7 Sum-of-dot-product
4 move
1 shuffle
3 lw/sw
~ 5 control instructions
7 Sum-of-dot-product
4 move
1 shuffle
3 lw/sw
~ 5 control instructions
addi a0,a0,1
addi t1,t1,1
addi t3,t3,1
addi t4,t4,1
lbu a7,-1(a0)
lbu a6,-1(t4)
lbu a5,-1(t3)
lbu t5,-1(t1)
mul s1,a7,a6
mul a7,a7,a5
add s0,s0,s1
mul a6,a6,t5
add t0,t0,a7
mul a5,a5,t5
add t2,t2,a6
add t6,t6,a5
bne s5,a0,1c000bc
9x less
instructions
than RV32IMC
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
N N/
lp.setup
lp.setup
lp.setup can we remove?
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!)
LD/ST with post addi t3,t3,1 addi t3,t3,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!)
increment addi t4,t4,1 addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) lbu a7,a7,a5
a6,-1(t4)
pv.sdotsp.b s2, w1, x2
mul
8-bit SIMD sdotp lbu a5,-1(t3) lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) lbu a6,a6,t5
t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul
mul s1,a7,a6 mul t0,t0,a7
s1,a7,a6
end
add
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end
9x less
instructions
than RV32IMC
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
Yes! dotp+ld
N N/
lp.setup
lp.setup
lp.setup can we remove?
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!) Init NN-RF (outside of the loop)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!) lp.setup
LD/ST with post addi t3,t3,1 addi t3,t3,1
addi t4,t4,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!) N/4 pv.nnsdotup.h s0,ax1,9
increment addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) pv.nnsdotsp.b s1, aw2, 0
lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) pv.nnsdotsp.b s2, aw4, 2
lbu a7,a7,a5
mul a6,-1(t4)
pv.sdotsp.b s2, w1, x2
lbu a5,-1(t3) pv.nnsdotsp.b s3, aw3, 4
8-bit SIMD sdotp lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) pv.nnsdotsp.b s4, ax1, 14
lbu a6,a6,t5
mul t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul s1,a7,a6 end
8-bit sdotp + LD mul t0,t0,a7
end
add s1,a7,a6
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1 pv.nnsdot{up,usp,sp}.{h,b,n,c} rD, rs1, Imm
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end
9x less
instructions
than RV32IMC
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
Yes! dotp+ld
N N/
lp.setup
lp.setup
lp.setup can we remove?
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!) Init NN-RF (outside of the loop)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!) lp.setup
LD/ST with post addi t3,t3,1 addi t3,t3,1
addi t4,t4,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!) N/4 pv.nnsdotup.h s0,ax1,9
increment addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) pv.nnsdotsp.b s1, aw2, 0
lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) pv.nnsdotsp.b s2, aw4, 2
lbu a7,a7,a5
mul a6,-1(t4)
pv.sdotsp.b s2, w1, x2
lbu a5,-1(t3) pv.nnsdotsp.b s3, aw3, 4
8-bit SIMD sdotp lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) pv.nnsdotsp.b s4, ax1, 14
lbu a6,a6,t5
mul t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul s1,a7,a6 end
8-bit sdotp + LD mul t0,t0,a7
end
add s1,a7,a6
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1 pv.nnsdot{up,usp,sp}.{h,b,n,c} rD, rs1, Imm
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end
Special-purpose registers
Luca Benini T12: Specializing Processors for ML 43 of 68
43
© 2023 IEEE International Solid-State Circuits Conference
Not only RISC-V: ARM v8.1-M
Next: M85
M85>M55
20% more “AI throughput”
M85>M7
30% more “GP throughput”
[wikichip]
1b
2b 5b
3b
4b 8b
1b
2b 5b
3b
4b 8b
1b
0.8% 4x
4.4% 7x
2b 5b
2.9% 8x
3b
4b 8b
8-bit Vector
Result
Result
Result
How to encode all these instructions?
Goal
HW support for
mixed-precision SIMD
instructions;
Challenge
Enormous number of
instructions to be
encoded in the ISA;
Solution
Status-based
execution.
32b 32b
RISC-V
core
CLUSTER
Luca Benini T12: Specializing Processors for ML 60 of 68
60
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2
Logarithmic Interconnect
CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2
Logarithmic Interconnect
CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2
Logarithmic Interconnect
CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
CLUSTER
Luca Benini T12: Specializing Processors for ML 62 of 68
62
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
[Glaser et al. TPDS20]
Tightly Coupled Data Memory BF=2
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
L2
Mem RISC-V RISC-V RISC-V RISC-V
core core core core
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
interconnect
HW
DMA Mem0 Mem1 Mem2 Mem3
SYNC
Logarithmic Interconnect
interconnect
HW
Reduces miss DMA Mem0 Mem1 Mem2 Mem3
SYNC
latency
Logarithmic Interconnect
interconnect
HW
Reduces miss DMA Mem0 Mem1 Mem2 Mem3
SYNC
latency
Logarithmic Interconnect
interconnect
HW
Reduces miss DMA Mem0 Mem1 Mem2 Mem3
SYNC
latency
Logarithmic Interconnect
interconnect
HW
Reduces miss DMA Mem0 Mem1 Mem2 Mem3
SYNC
latency
Logarithmic Interconnect
interconnect
HW
Reduces miss DMA Mem0 Mem1 Mem2 Mem3
SYNC
latency
Logarithmic Interconnect
1
ENERGY EFFICIENCY
[TOPS/W]
Log scale
0.1
0.01
0.001
8-bit convolution 4-bit convolution 2-bit convolution
Luca Benini T12: Specializing Processors for ML 65 of 68
© 2023 IEEE International Solid-State Circuits Conference
8-Cores Cluster + ISA (FDX22nm)
STM32L4 (M4) STM32H7 (M7) PULP (RI5CY) 0.65V
PULP (RI5CY) 0.8V PULP (XpulpNN + m&l) 0.65V PULP (XpulpNN + m&l) 0.8V
0.1
0.01
0.001
8-bit convolution 4-bit convolution 2-bit convolution
Luca Benini T12: Specializing Processors for ML 65 of 68
© 2023 IEEE International Solid-State Circuits Conference
8-Cores Cluster + ISA (FDX22nm)
STM32L4 (M4) STM32H7 (M7) PULP (RI5CY) 0.65V
PULP (RI5CY) 0.8V PULP (XpulpNN + m&l) 0.65V PULP (XpulpNN + m&l) 0.8V
6x 356x 7.4x
294x [Garofalo et al. OJSSCS22]
146x 1.6x 1230x
1 1600x
ENERGY EFFICIENCY
401x
[TOPS/W]
Log scale
0.1
0.01
0.001
8-bit convolution 4-bit convolution 2-bit convolution
Luca Benini T12: Specializing Processors for ML 65 of 68
© 2023 IEEE International Solid-State Circuits Conference
Summary
Conclusion
Processor cores are key for ML workloads
Ensure flexibility
Complement acceleration engines
Low cost solutions when extreme performance is not needed
Specialization of MCU-class cores
Complexity aware specialization for Energy efficiency
All main ISA are evolving in this direction
Support for aggressive quantization and mixed precision is needed
Need to deal with “opcode expolosion”
From SIMD to vector ISA also for MCU-class cores
Parallel Ultra low power to boost performance at high efficiency