0% found this document useful (0 votes)

11 views

Specilizing Processors For ML

The document discusses specializing processors for machine learning by outlining motivations for machine learning on microcontroller units, limitations of classical instruction set architectures for machine learning, instruction set architecture extensions for machine learning, microarchitectures of machine learning specialized cores, and power, performance, and area and energy analysis.

Uploaded by

sairaghubabu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Specilizing Processors For ML

Uploaded by

sairaghubabu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 171

ISSCC 2023 Tutorials

Specializing Processors for ML

Luca Benini ([email protected])

ISSCC Tutorial 12

02.19.2023
Luca Benini T12: Specializing Processors for ML 1 of 68
© 2023 IEEE International Solid-State Circuits Conference
Outline
Outline

 Motivations for ML on MCU processors (IoT, TinyML)

 Classical instruction set architectures (ISAs) limitations for ML
 ISA Extensions for ML
 Micro-architecture of ML-specialized cores
 PPA (power performance area) and Energy analyisis
 Advanced Micro-architectural optimizations
 Supporting aggressive Quantization
 From single to multiple cores: parallel ultra-low power

Luca Benini T12: Specializing Processors for ML 2 of 68

[ReutherHPEC22]

Luca Benini T12: Specializing Processors for ML 3 of 68

© 2023 IEEE International Solid-State Circuits Conference

Edge ML Market

[Researchdive22]

Luca Benini T12: Specializing Processors for ML 4 of 68

[Researchdive22]

Luca Benini T12: Specializing Processors for ML 4 of 68

[Researchdive22]

TinyML challenge
AI capabilities in the power envelope of an MCU: 10-mW peak (1mW avg)
Luca Benini T12: Specializing Processors for ML 4 of 68
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)

“Model zoo” fast evolution  need programmable solutions

Luca Benini T12: Specializing Processors for ML 5 of 68
5
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)
[YuTMLR22 - arXiv:2205.01917]

COCO: 91.1%, 1B-param, many GOPS

70%, “Tiny” DNNs 5M-param

“Model zoo” fast evolution  need programmable solutions

Luca Benini T12: Specializing Processors for ML 5 of 68
5
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)
[YuTMLR22 - arXiv:2205.01917]

COCO: 91.1%, 1B-param, many GOPS

70%, “Tiny” DNNs 5M-param

“Model zoo” fast evolution  need programmable solutions

Luca Benini T12: Specializing Processors for ML 5 of 68
5
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)
[YuTMLR22 - arXiv:2205.01917]

COCO: 91.1%, 1B-param, many GOPS

High OP/B ratio

Massive Parallelism
MAC-dominated
Low precision OK

70%, “Tiny” DNNs 5M-param

“Model zoo” fast evolution  need programmable solutions

Luca Benini T12: Specializing Processors for ML 5 of 68
5
© 2023 IEEE International Solid-State Circuits Conference
TinyML Workloads – DNNs (and More)
[YuTMLR22 - arXiv:2205.01917]

COCO: 91.1%, 1B-param, many GOPS

High OP/B ratio

Massive Parallelism
MAC-dominated
Low precision OK
[Li20 - arXiv:2011.12289 ]

70%, “Tiny” DNNs 5M-param

“Model zoo” fast evolution  need programmable solutions

Luca Benini T12: Specializing Processors for ML 5 of 68
5
© 2023 IEEE International Solid-State Circuits Conference
ML on MCUs?
[Courtesy of J Pineda, private comm.]
High performance MCUs

Low-Power MCUs

1TOPS/W=1pJ/OP 
TinyML (1GOPs/Inf)
@10fps in 10mW

Luca Benini T12: Specializing Processors for ML 6 of 68

© 2023 IEEE International Solid-State Circuits Conference
The Challenge: Energy efficiency@GOPS
ARM Cortex-M MCUs: M0+, M4, M7 (40LP, typ, 1.1V)*
0.7
M0+
0.6
Energy-efficient MCUs

Efficiency (coremark/uW 0.5

0.4

0.3
M4
0.2

0.1
M7
0
2 2.5 3 3.5 4 4.5 5 5.5
Performance (coremark/MHz)

High performance MCUs

*data from ARM’s web

Luca Benini T12: Specializing Processors for ML 7 of 68

7
© 2023 IEEE International Solid-State Circuits Conference
The Challenge: Energy efficiency@GOPS
ARM Cortex-M MCUs: M0+, M4, M7 (40LP, typ, 1.1V)*
0.7
M0+
0.6
Energy-efficient MCUs

Efficiency (coremark/uW 0.5

0.4

0.3
M4
0.2

0.1
M7
0
2 2.5 3 3.5 4 4.5 5 5.5
Performance (coremark/MHz)

High performance MCUs

*data from ARM’s web

Luca Benini T12: Specializing Processors for ML 7 of 68

7
© 2023 IEEE International Solid-State Circuits Conference
High-Performance vs. Energy-Efficient
Instr Memory [Azizi et al. ISCA10]
Instr Data Data Memory
Instr Address Data Address Data

Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps

Branches

“Classical” core performance scaling trajectory

 Faster CLK  deeper pipeline  IPC drops
 Recover IPC  superscalar  ILP bottleneck (dependencies)
 Mitigate ILP bottlenecks  OOO  huge power, area cost!
Luca Benini T12: Specializing Processors for ML 8 of 68
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]

Instr Data Data Memory

Instr Address Data Address Data

Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches

Luca Benini T12: Specializing Processors for ML 9 of 68

9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]

Instr Data Data Memory

Instr Address Data Address Data

Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
V1 Baseline RISC (not good for ML) RISC V1

Luca Benini T12: Specializing Processors for ML 9 of 68

Instr Data Data Memory

Instr Address Data Address Data

Luca Benini T12: Specializing Processors for ML 9 of 68

Instr Data Data Memory

Instr Address Data Address Data

Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / /
C /I
E W RF
D Decode
operands X E B
fwd X
Jumps
Branches
V1 Baseline RISC (not good for ML) RISC V1
Extensions for Data Processing
V2 Data motion (e.g. auto-increment) V2
Data processing (e.g. MAC) V3
Domain specific data processing
V3 Narrow bitwidth
HW support for special arithmetic
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]

Instr Data Data Memory

Instr Address Data Address Data
M/D
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / / ALU
C /I
E W RF 70% RF+DP
D Decode
operands X E B
fwd X
Jumps RF
Branches
V1 Baseline RISC (not good for ML) RISC V1
Extensions for Data Processing
V2 Data motion (e.g. auto-increment) V2
Data processing (e.g. MAC) V3
Domain specific data processing
V3 Narrow bitwidth
HW support for special arithmetic
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]

Instr Data Data Memory

V3
Domain specific data processing ISA extension cost 25 kGE  40 kGE
Narrow bitwidth
HW support for special arithmetic
(1.6x), energy efficient if 0.6Texec
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
A way Out: Processor Specialization
3-cycle ALU-OP, 4-cyle MEM-OP only IPC loss: LD-use, Branch
Instr Memory [Gautschi et al. TVLSI 2017]

Instr Data Data Memory

Instr Address Data Address Data
M/D
RdRF
Rd A
I G E
I RF U
PC Align and D X Wr
P F
gen Decompress / / ALU
C /I
E W RF 70% RF+DP
D Decode
operands X E B
fwd XEX
Jumps RF
Branches
V1 Baseline RISC (not good for ML) RISC V1
Extensions for Data Processing
V2 Data motion (e.g. auto-increment) V2
Data processing (e.g. MAC) V3

V3
Domain specific data processing ISA extension cost 25 kGE  40 kGE
Narrow bitwidth
HW support for special arithmetic
(1.6x), energy efficient if 0.6Texec
Luca Benini T12: Specializing Processors for ML 9 of 68
9
© 2023 IEEE International Solid-State Circuits Conference
RISC-V Instruction Set Architecture
SW
 Started by UC-Berkeley in 2010 Applications
 Contract between SW and HW
 Partitioned into user and privileged spec OS
 External Debug
 Standard governed by the RISC-V

Debug
ISA
foundation
 Necessary for the continuity User Privileged

 Defines 32, 64 and 128 bit ISA

 No implementation, just the ISA HW
 Different implementations (open&close source)
 Open ISA ≠ Open Hardware
Luca Benini T12: Specializing Processors for ML 10 of 68
© 2023 IEEE International Solid-State Circuits Conference
23
RISC-V (RV) Foundation Members

A modern, open, free ISA, extensible by construction

Endorsed and Supported by 1000+ Companies

Luca Benini T12: Specializing Processors for ML 11 of 68

A modern, open, free ISA, extensible by construction

Endorsed and Supported by 1000+ Companies

Luca Benini T12: Specializing Processors for ML 11 of 68

© 2023 IEEE International Solid-State Circuits Conference
25
RISC-V ISA Baseline and Extensions
I Integer instructions (frozen)

E Reduced number of registers

 Kept very simple and extendable
 Wide range of applications from IoT to HPC
M Multiplication and Division (frozen)
 RV + word-width + extensions
 RV32IMC: 32bit, integer, multiplication,
A Atomic instructions (frozen)
compressed

F Single-Precision Floating-Point (frozen)  User specification:

 Separated into extensions, only I is mandatory
Double-Precision Floating-Point
D (frozen)  Privileged Specification (WIP):
 Governs OS functionality: Exceptions, Interrupts
C Compressed Instructions (frozen)
 Virtual Addressing
X Non Standard Extensions  Privilege Levels

Luca Benini T12: Specializing Processors for ML 12 of 68

Luca Benini T12: Specializing Processors for ML 13 of 68

Basic Instructions (I)

Luca Benini T12: Specializing Processors for ML 13 of 68

Basic Instructions (I) Multiply/Divide (M)

Luca Benini T12: Specializing Processors for ML 13 of 68

Basic Instructions (I) Multiply/Divide (M)

Compressed
Instructions (C)

Luca Benini T12: Specializing Processors for ML 13 of 68

Privilege Multiply/Divide (M)

Mode
Basic Instructions (I)

Atomic Extensions (A)

Compressed
Instructions (C)

Luca Benini T12: Specializing Processors for ML 13 of 68

Privilege Multiply/Divide (M)

Mode
Basic Instructions (I)

Atomic Extensions (A)

Floating Point Extensions

Compressed
Instructions (C)

Luca Benini T12: Specializing Processors for ML 13 of 68

 There are 32 registers, each 32 / 64 / 128 bits long

 Named x0 to x31
 x0 is hard wired to zero
 There is a standard ‘E’ extension that uses only 16 registers (RV32E)

Luca Benini T12: Specializing Processors for ML 14 of 68

 There are 32 registers, each 32 / 64 / 128 bits long

Luca Benini T12: Specializing Processors for ML 14 of 68

 There are 32 registers, each 32 / 64 / 128 bits long

 Named x0 to x31
 x0 is hard wired to zero
 There is a standard ‘E’ extension that uses only 16 registers (RV32E)
 In addition one program counter (PC)
 Byte based addressing, program counter increments by 4/8/16
 For floating point operation 32 additional FP registers

Luca Benini T12: Specializing Processors for ML 14 of 68

 There are 32 registers, each 32 / 64 / 128 bits long

Luca Benini T12: Specializing Processors for ML 14 of 68

© 2023 IEEE International Solid-State Circuits Conference
36
RV Instructions four basic types
 R register to register operations
 I operations with immediate/constant values
 S / SB operations with two source registers
 U / UJ operations with large immediate/constant value

Luca Benini T12: Specializing Processors for ML 15 of 68

© 2023 IEEE International Solid-State Circuits Conference
37
RV Instructions four basic types
 R register to register operations
 I operations with immediate/constant values
 S / SB operations with two source registers
 U / UJ operations with large immediate/constant value

Luca Benini T12: Specializing Processors for ML 15 of 68

© 2023 IEEE International Solid-State Circuits Conference
38
RV Instructions four basic types
 R register to register operations
 I operations with immediate/constant values
 S / SB operations with two source registers
 U / UJ operations with large immediate/constant value

Luca Benini T12: Specializing Processors for ML 15 of 68

© 2023 IEEE International Solid-State Circuits Conference
39
RV Instructions four basic types
 R register to register operations
 I operations with immediate/constant values
 S / SB operations with two source registers
 U / UJ operations with large immediate/constant value

Luca Benini T12: Specializing Processors for ML 15 of 68

© 2023 IEEE International Solid-State Circuits Conference
40
RV Instructions four basic types
 R register to register operations
 I operations with immediate/constant values
 S / SB operations with two source registers
 U / UJ operations with large immediate/constant value

Luca Benini T12: Specializing Processors for ML 15 of 68

 All operations are on internal registers

 Can not manipulate data in memory directly
 Load instructions to copy from memory to registers
 R-type or I-type instructions to operate on them
 Store instructions to copy from registers back to
memory
 Branch and Jump instructions
 1/3 ALU utilization if operands are from/to memory
(LD, ALU, ST)

Luca Benini T12: Specializing Processors for ML 16 of 68

 Reserved opcodes for standard extensions

 Rest of opcodes free for custom implementations
 Standard extensions will be frozen/not change in the future

Extensibility is integral to the RISC-V ISA design

Luca Benini T12: Specializing Processors for ML 17 of 68
© 2023 IEEE International Solid-State Circuits Conference
43
Encoding of instructions, main groups

 Reserved opcodes for standard extensions

 Rest of opcodes free for custom implementations
 Standard extensions will be frozen/not change in the future

Extensibility is integral to the RISC-V ISA design

Luca Benini T12: Specializing Processors for ML 17 of 68
© 2023 IEEE International Solid-State Circuits Conference
44
Encoding of instructions, main groups

 Reserved opcodes for standard extensions

 Rest of opcodes free for custom implementations
 Standard extensions will be frozen/not change in the future

Extensibility is integral to the RISC-V ISA design

Luca Benini T12: Specializing Processors for ML 17 of 68
© 2023 IEEE International Solid-State Circuits Conference
45
RV Instruction Length is Encoded
 LSB of the instruction tells how long the instruction is
 Supports instructions of 16, 32, 48, 64, 80, 96, … , 320 bit
 Allows RISC-V to have Compressed instructions

MCU focus

Luca Benini T12: Specializing Processors for ML 18 of 68

© 2023 IEEE International Solid-State Circuits Conference
46
How to get efficiency: ISA extensions
addi a0,a0,1 Only one MACs every 4 cycles!
addi t1,t1,1
addi t3,t3,1
addi t4,t4,1
lbu a7,-1(a0) 1. Post modified LD/ST
lbu a6,-1(t4) 2. MAC
lbu a5,-1(t3) 3. HW loop
lbu t5,-1(t1) 4. Packed-SIMD operations with dot product
mul s1,a7,a6 5. Shuffle operations for vectors
mul a7,a7,a5
6. Mac-load
add s0,s0,s1
mul a6,a6,t5
add t0,t0,a7
mul a5,a5,t5
add t2,t2,a6
add t6,t6,a5
bne s5,a0,1c000bc
[M. Gautschi et al. TVLSI2017]

Luca Benini T12: Specializing Processors for ML 19 of 68

 Motivations for ML on MCU processors (IoT, TinyML)

Luca Benini T12: Specializing Processors for ML 20 of 68

 Offset can be stored in:

 Register
 Immediate

Luca Benini T12: Specializing Processors for ML 21 of 68

 Offset can be stored in:

 Register
 Immediate ⇒save 2 additional instructions to update
the read addresses of the operands!

Luca Benini T12: Specializing Processors for ML 21 of 68

© 2023 IEEE International Solid-State Circuits Conference
Post increment LD/ST
 Automatic address update c = 0;
 Update base register with for(i=0;i<100;i++)
computed address after the c = c + a[i]*b[i];
memory access
⇒ Save instructions to update
address register Original RISCV
 Post-increment: addi x4, x0, 64
Lstart :
 Base address serves as memory lb x2, 0(x10)
address lb x3, 0(x12)
addi x10, x10, 1
addi x12, x12, 1
 Offset can be stored in: …..
 Register bne x2,x3, Lstart
 Immediate ⇒save 2 additional instructions to update
the read addresses of the operands!

Luca Benini T12: Specializing Processors for ML 21 of 68

© 2023 IEEE International Solid-State Circuits Conference
Post increment LD/ST
 Automatic address update c = 0;
 Update base register with for(i=0;i<100;i++)
computed address after the c = c + a[i]*b[i];
memory access
⇒ Save instructions to update
address register Original RISCV Auto-incr load/store
 Post-increment: addi x4, x0, 64 addi x4, x0, 64
Lstart : Lstart :
 Base address serves as memory lb x2, 0(x10) lb x2, 0(x10!)
address lb x3, 0(x12) lb x3, 0(x12!)
addi x10, x10, 1 …..
addi x12, x12, 1
 Offset can be stored in: …..
bne x2,x3, Lstart

 Register bne x2,x3, Lstart

 Immediate ⇒save 2 additional instructions to update
the read addresses of the operands!

Luca Benini T12: Specializing Processors for ML 21 of 68

 Register file
requires additional
write port
 Register file
requires additional
read port if offset
is stored in
register

 Processor core area increases by 8-12 %

 Ports can be used for other instructions
Luca Benini T12: Specializing Processors for ML 22 of 68
© 2023 IEEE International Solid-State Circuits Conference
Multiply Accumulate
int acc=0, coeff[N], inp[N];
for(int i=0; i<N; i++)
Acc A B acc += coeff[i] * inp[i];
x
MAC Unit +

Luca Benini T12: Specializing Processors for ML 23 of 68

© 2023 IEEE International Solid-State Circuits Conference
Multiply Accumulate
 Accumulation on 32 bit data p.mac int acc=0, coeff[N], inp[N];
for(int i=0; i<N; i++)
Acc A B acc += coeff[i] * inp[i];
x
MAC Unit +
 Directly on the register file
 Pro:
 Faster access to mac accumulation
 Single cycle mult/mac
 Cons:
 Additional read port on the register file
 used for pre/post increment with
register

Luca Benini T12: Specializing Processors for ML 23 of 68

© 2023 IEEE International Solid-State Circuits Conference
Multiply Accumulate
 Accumulation on 32 bit data p.mac int acc=0, coeff[N], inp[N];
for(int i=0; i<N; i++)
Acc A B acc += coeff[i] * inp[i];
x acc =__builtin_pulp_mac ( inp[i],
MAC Unit + coeff[i],
acc);
 Directly on the register file
 Pro:
 Faster access to mac accumulation
 Single cycle mult/mac
 Cons:
 Additional read port on the register file
 used for pre/post increment with
register

Luca Benini T12: Specializing Processors for ML 23 of 68

© 2023 IEEE International Solid-State Circuits Conference
Multiply Accumulate
 Accumulation on 32 bit data p.mac int acc=0, coeff[N], inp[N];
for(int i=0; i<N; i++)
Acc A B acc += coeff[i] * inp[i];
x acc =__builtin_pulp_mac ( inp[i],
MAC Unit + coeff[i],
acc);
 Directly on the register file
Intrinsics: special functions that map directly to
 Pro: inlined DSP instructions.
 Faster access to mac accumulation However, the compiler can already place the p.mac
 Single cycle mult/mac instruction into the above code!

 Cons:
 Additional read port on the register file
 used for pre/post increment with
register

Luca Benini T12: Specializing Processors for ML 23 of 68

Original RISC-V HW Loop Ext

//initialize counter // init accumulator
mv x4, 100 mv x5, 0
// init accumulator //set number iterations, start and
mv x5, 0 end of the loop
Lstart: lp.setupi 100, Lend
//decrement counter //load elements from mem
addi x4, x4, -1 lw x8, 0(x9)
//load elements from mem lw x10, 0(x11)
lw x8, 0(x9) //update memory pointers
lw x10, 0(x11) add x9, x9, 4
//update memory pointers add x11, x11, 4
add x9, x9, 4 //mac
add x11, x11, 4 mul x8, x8, x10
//mac Lend: add x5, x5, x8
mul x8, x8, x10
add x5, x5, x8 No counter and
bne x4, x0, Lstart
branch overhead!
Luca Benini T12: Specializing Processors for ML 24 of 68
© 2023 IEEE International Solid-State Circuits Conference
59
Hardware loops
c = 0;
 Hardware loop setup with: for(i=0;i<100;i++)
 3 separate instructions c = c + a[i]*b[i];
lp.start, lp.end, lp.count, lp.counti
⇒ No restriction on start/end address Original RISC-V HW Loop Ext
//initialize counter // init accumulator
mv x4, 100 mv x5, 0
// init accumulator //set number iterations, start and
mv x5, 0 end of the loop
Lstart: lp.setupi 100, Lend
//decrement counter //load elements from mem
addi x4, x4, -1 lw x8, 0(x9)
//load elements from mem lw x10, 0(x11)
lw x8, 0(x9) //update memory pointers
lw x10, 0(x11) add x9, x9, 4
//update memory pointers add x11, x11, 4
add x9, x9, 4 //mac
add x11, x11, 4 mul x8, x8, x10
//mac Lend: add x5, x5, x8
mul x8, x8, x10
add x5, x5, x8 No counter and
bne x4, x0, Lstart
branch overhead!
Luca Benini T12: Specializing Processors for ML 24 of 68
© 2023 IEEE International Solid-State Circuits Conference
60
Hardware loops
c = 0;
 Hardware loop setup with: for(i=0;i<100;i++)
 3 separate instructions c = c + a[i]*b[i];
lp.start, lp.end, lp.count, lp.counti
⇒ No restriction on start/end address Original RISC-V HW Loop Ext
//initialize counter // init accumulator
mv x4, 100 mv x5, 0
 Fast setup instructions // init accumulator //set number iterations, start and
mv x5, 0 end of the loop
lp.setup, lp.setupi Lstart: lp.setupi 100, Lend
//decrement counter //load elements from mem
⇒ Start address= PC + 4 addi x4, x4, -1 lw x8, 0(x9)
⇒ End address= start address + //load elements from mem lw x10, 0(x11)
lw x8, 0(x9) //update memory pointers
offset lw x10, 0(x11) add x9, x9, 4
//update memory pointers add x11, x11, 4
⇒ Counter from immediate/register add x9, x9, 4 //mac
add x11, x11, 4 mul x8, x8, x10
//mac Lend: add x5, x5, x8
mul x8, x8, x10
add x5, x5, x8 No counter and
bne x4, x0, Lstart
branch overhead!
Luca Benini T12: Specializing Processors for ML 24 of 68
© 2023 IEEE International Solid-State Circuits Conference
61
Hardware Loop Implementation

 Hardware loops or Zero Overhead Loops to remove

branch overhead in for loops
 After configuration with start, end, count variables
comparison and branch not required
 Smaller loop benefit more
 Loop needs to be set up beforehand and is fully defined
by:
 Start address
 End address  Two sets registers implemented to support
 Counter nested loops

 Area cost:
 Processor core area increases by 5%
 Performance:
 Speedup can be up to factor 2!
Luca Benini T12: Specializing Processors for ML 25 of 68
© 2023 IEEE International Solid-State Circuits Conference
Packed-SIMD
Remember: DNN inference is OK with low-bitwidth operands
 Packed-SIMD extensions
 Well known idea from classic DSP
 Widely used (ARM, x86)
 SIMD in 32bit machines
 Vectors are either 4 8bits-elements or 2 16bits-elements
 8 elements less common (e.g. not available in ARM cortex M4)

Computation add, sub, shift, avg, abs, dot

product
Compare min, max, compare
Manipulate extract, pack, shuffle

Luca Benini T12: Specializing Processors for ML 26 of 68

© 2023 IEEE International Solid-State Circuits Conference
63
Packed-SIMD ALU Instructions
 Same Register-file
 The instruction encodes how to interpret the content of the register
rs1 0x03 0x02 0x01 0x00

rs2 0x0D 0x0C 0x0B 0x0A

add rD, rs1, rs2 rD = 0x03020100 +

0x0D0C0B0A
add.h rD, rs1, rs2 rD[0] = 0x0100 + 0x0B0A
rD[1] = 0x0302 + 0x0D0C
add.b rD, rs1, rs2 rD[0] = 0x00 + 0x0A
rD[1] = 0x01 + 0x0B
rD[2] = 0x02 + 0x0C
rD[3] = 0x03 + 0x0D

Luca Benini T12: Specializing Processors for ML 27 of 68

 Advanced ALU for SIMD extensions

 Optimized datapath to reduce
resources
 Multiple-adders for round
 Adder followed by shifter for fixed
point normalization
 Clip unit uses one adder as
comparator and the main
comparator

Luca Benini T12: Specializing Processors for ML 28 of 68

© 2023 IEEE International Solid-State Circuits Conference
Expanding SIMD Dot Product
 Dot Product: (half word example)
C[31:0] = A[31:16]*B[31:16] + A[15:0]*B[15:0] + C[31:0]

32 bit 32 bit 32 bit

 2 multiplications, 1 addition, 1 accumulation in 1 cycle (2x for bytes)

Partial Product
Generator
35:2 compressor

Luca Benini T12: Specializing Processors for ML 29 of 68

 16x16b with sign selection for

short multiplications [with round
and normalization]. 5 cycles FSM
for higher 64-bits (mulh*
instructions)

 32x32b single cycle MAC/MUL

unit

 16x16b short parallel dot

product

 8x8b byte parallel dot product

 clock gating to reduce switching

activity between the scalar and
SIMD multipliers

Luca Benini T12: Specializing Processors for ML 30 of 68

© 2023 IEEE International Solid-State Circuits Conference
Compiler Builtins for Extension
Computation add, sub, shift, avg, abs, dot
SIMD Instructions product
of the Xpulp ISA Compare min, max, compare
extension
Manipulate extract, pack, shuffle

Luca Benini T12: Specializing Processors for ML 31 of 68

 Dot-product without accumulation between unsigned char vectors (v4u):

S = __builtin_dotup4(A, B); // S = A[0]*B[0] + A[1]*B[1] + A[2]*B[2] + A[3]*B[3], A is v4u, B is v4u

Luca Benini T12: Specializing Processors for ML 31 of 68

 Dot-product without accumulation between unsigned char vectors (v4u):

Luca Benini T12: Specializing Processors for ML 31 of 68

 Dot-product without accumulation between unsigned char vectors (v4u):

Luca Benini T12: Specializing Processors for ML 31 of 68

 Dot-product without accumulation between unsigned char vectors (v4u):

Luca Benini T12: Specializing Processors for ML 31 of 68

 Dot-product without accumulation between unsigned char vectors (v4u):

S = __builtin_dotup4(A, B); // S = A[0]*B[0] + A[1]*B[1] + A[2]*B[2] + A[3]*B[3], A is v4u, B is v4u
 Dot-product without accumulation between signed char vectors (v4s):
S = __builtin_dotsp4(A, B); // S = A[0]*B[0] + A[1]*B[1] + A[2]*B[2] + A[3]*B[3] , A is v4s, B is v4s
 Also with mixed signs:
S = __builtin_dotusp4(A, B); // S = A[0]*B[0] + A[1]*B[1] + A[2]*B[2] + A[3]*B[3] , A is v4u, B is v4s
 Similar builtins without accumulation for short vectors:
S = __builtin_dotup2(A, B); S = __builtin_dotsp2(A, B); S = __builtin_dotusp2(A, B);
 All of these are also available with accumulation (over accumulator S):
S = __builtin_sdotup4(A, B, S); S = __builtin_sdotsp4(A, B, S); S = __builtin_sdotusp4(A, B, S);
S = __builtin_sdotup2(A, B, S); S = __builtin_sdotsp2(A, B, S); S = __builtin_sdotusp2(A, B, S);

Luca Benini T12: Specializing Processors for ML 31 of 68

© 2023 IEEE International Solid-State Circuits Conference
ISA Extensions at Work
 The innermost loop has 4x less iterations
 4 bytes per matrix are loaded as a 32b word
 Dot product with accumulation performs in 1 cycle 4 macs

… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1
p.lbu a0,1(a3!) p.lw a1,4(t1!) //load 4-bytes with post inc
p.lbu a1,32(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5 //4 mac
… …

Luca Benini T12: Specializing Processors for ML 32 of 68

© 2023 IEEE International Solid-State Circuits Conference
74
Extensions at work
for (i = 0; i < 100; i++)
d[i] = a[i] + b[i];
Baseline Auto-incr load/store HW Loop Packed-SIMD
mv x5, 0 mv x5, 0 lp.setupi 100, Lend lp.setupi 25, Lend
mv x4, 100 mv x4, 100 lb x2, 0(x10!) lw x2, 0(x10!)
Lstart: Lstart: lb x3, 0(x11!) lw x3, 0(x11!)
lb x2, 0(x10) lb x2, 0(x10!) add x2, x3, x2 pv.add.b x2, x3, x2
lb x3, 0(x11) lb x3, 0(x11!) Lend: sb x2, 0(x12!) Lend: sw x2, 0(x12!)
addi x10,x10, 1 addi x4, x4, -1
addi x11,x11, 1 add x2, x3, x2
add x2, x3, x2 sb x2, 0(x12!)
sb x2, 0(x12) bne x4, x5, Lstart
addi x4, x4, -1
addi x12,x12, 1
bne x4, x5, Lstart

11 cycles/output 8 cycles/output 5 cycles/output 1,25 cycles/output

Luca Benini T12: Specializing Processors for ML 33 of 68
© 2023 IEEE International Solid-State Circuits Conference
75
Runtime for three different applications
[P. D. Schiavone et al. PATMO17]

RV32IMCXpulp
RV32IMC
RV32EC
x53.4 x3.5
x1 x1

x1.3

x6.1

Scheduler
Luca Benini T12: Specializing Processors for ML 34 of 68
© 2023 IEEE International Solid-State Circuits Conference
76
Runtime for three different applications
[P. D. Schiavone et al. PATMO17]

RV32IMCXpulp
RV32IMC
RV32EC
x53.4 x3.5
x1 x1
Better

x1.3

x6.1

EEMBC Coremark Scheduler

Luca Benini T12: Specializing Processors for ML 34 of 68
© 2023 IEEE International Solid-State Circuits Conference
77
Runtime for three different applications
[P. D. Schiavone et al. PATMO17]

RV32IMCXpulp
Extensions have more effect RV32IMC
RV32EC
x53.4 x3.5
x1 x1
Better

x1.3

x6.1

2D Convolution EEMBC Coremark Scheduler

Luca Benini T12: Specializing Processors for ML 34 of 68
© 2023 IEEE International Solid-State Circuits Conference
78
Different cores for diff. Area budget

x2.2
Better

x3.5

RV32IMCXpulp RV32IMC RV32EC

Luca Benini T12: Specializing Processors for ML 35 of 68
© 2023 IEEE International Solid-State Circuits Conference
79
Different cores for diff. Power budgets

x2.4
x2.7
Better

RV32IMCXpulp RV32IMC RV32EC

Luca Benini T12: Specializing Processors for ML 36 of 68
© 2023 IEEE International Solid-State Circuits Conference
80
Energy: 2D-Conv @55MHz, 0.8V

RV32IMCXpulp
RV32IMC
RV32EC

41.6 ms

4.78 ms

784 μs

Luca Benini T12: Specializing Processors for ML 37 of 68

© 2023 IEEE International Solid-State Circuits Conference
81
Energy: 2D-Conv @55MHz, 0.8V
More frequent events/ processing
RV32IMCXpulp
RV32IMC
RV32EC

41.6 ms

Better
4.78 ms

784 μs

Luca Benini T12: Specializing Processors for ML 37 of 68

© 2023 IEEE International Solid-State Circuits Conference
82
Energy: 2D-Conv @55MHz, 0.8V
More frequent events/ processing
RV32IMCXpulp
RV32IMC
RV32EC

41.6 ms

Better
4.78 ms

784 μs

Luca Benini T12: Specializing Processors for ML 37 of 68

© 2023 IEEE International Solid-State Circuits Conference
83
Energy: 2D-Conv @55MHz, 0.8V
More frequent events/ processing
RV32IMCXpulp
RV32IMC
RV32EC

41.6 ms

Better
4.78 ms

649 ms
784 μs

Luca Benini T12: Specializing Processors for ML 37 of 68

© 2023 IEEE International Solid-State Circuits Conference
84
Energy: 2D-Conv @55MHz, 0.8V
More frequent events/ processing
RV32IMCXpulp
RV32IMC
RV32EC

41.6 ms

Better
4.78 ms

649 ms 31 s
784 μs

Luca Benini T12: Specializing Processors for ML 37 of 68

 Motivations for ML on MCU processors (IoT, TinyML)

Luca Benini T12: Specializing Processors for ML 38 of 68

© 2023 IEEE International Solid-State Circuits Conference
Advanced SIMD: Shuffle Instruction
 To use the SIMD unit the
elements have to be aligned in
the register file rD
 Shuffle allows to recombine
bytes into 1 register: Mask bits
rA
 pv.shuffle2.b rD, rA, rB

rD{3} = (rB[26]==0) ? rA:rD {rB[25:24]} rB

rD{2} = (rB[18]==0) ? rA:rD {rB[17:16]}
rD{1} = (rB[10]==0) ? rA:rD {rB[ 9: 8]}
rD{0} = (rB[ 2]==0) ? rA:rD {rB[ 1: 0]}

 With rX{i} = rX[(i+1)8-1:i8] rD =

Luca Benini T12: Specializing Processors for ML 39 of 68

© 2023 IEEE International Solid-State Circuits Conference
87
Shuffle for Direct SIMD Convolution
Convolution in
registers
5x5 convolutional filter

7 Sum-of-dot-product
4 move
1 shuffle
3 lw/sw
~ 5 control instructions

Luca Benini T12: Specializing Processors for ML 40 of 68

© 2023 IEEE International Solid-State Circuits Conference
88
Shuffle for Direct SIMD Convolution
Convolution in
registers
5x5 convolutional filter

7 Sum-of-dot-product
4 move
1 shuffle
3 lw/sw
~ 5 control instructions

Significant benefit in reuse of registers and less LD/ST

Luca Benini T12: Specializing Processors for ML 40 of 68
© 2023 IEEE International Solid-State Circuits Conference
89
GEMM-based Convolution
8-bit Convolution example
CMSIS-NN based Matrix Multiplication Layout: 2x2
x1 x2
S1 S2
w1
w2 × = S3 S4

Luca Benini T12: Specializing Processors for ML 41 of 68

© 2023 IEEE International Solid-State Circuits Conference
90
GEMM-based Convolution
8-bit Convolution example
CMSIS-NN based Matrix Multiplication Layout: 2x2
x1 x2
S1 S2
w1
w2 × = S3 S4

RegisterFile 2x2: 43%

of the RI5CY core: utilization
32 general purpose
registers

Luca Benini T12: Specializing Processors for ML 41 of 68

© 2023 IEEE International Solid-State Circuits Conference
91
GEMM-based Convolution
8-bit Convolution example
CMSIS-NN based Matrix Multiplication Layout: 2x2 PULP-NN Matrix Multiplication Layout: 4x2
x1 x2 x1 x2
S1 S2
S1 S2 w1
w1
× = w2 S3 S4
w2 S3 S4
w3 × = S5 S6
w4 S7 S8

RegisterFile 2x2: 43% 4x2: 69%

of the RI5CY core: utilization utilization
32 general purpose
registers

Luca Benini T12: Specializing Processors for ML 41 of 68

© 2023 IEEE International Solid-State Circuits Conference
92
GEMM-based Convolution
8-bit Convolution example
CMSIS-NN based Matrix Multiplication Layout: 2x2 PULP-NN Matrix Multiplication Layout: 4x2
x1 x2 x1 x2
S1 S2
S1 S2 w1
w1
× = w2 S3 S4
w2 S3 S4
w3 × = S5 S6
w4 S7 S8

RegisterFile 2x2: 43% 4x2: 69% More Data Reuse

of the RI5CY core: utilization utilization &
32 general purpose Higher utilization of the RF
registers
Peak Performance (8 cores)
2x2 :12.8 MAC/cyc
4x2 :15.5 MAC/cyc

Luca Benini T12: Specializing Processors for ML 41 of 68

© 2023 IEEE International Solid-State Circuits Conference
93
GEMM-based Convolution
8-bit Convolution example
CMSIS-NN based Matrix Multiplication Layout: 2x2 PULP-NN Matrix Multiplication Layout: 4x2
x1 x2 x1 x2
S1 S2
S1 S2 w1
w1
× = w2 S3 S4
w2 S3 S4
w3 × = S5 S6
w4 S7 S8

RegisterFile 2x2: 43% 4x2: 69% More Data Reuse

of the RI5CY core: utilization utilization &
32 general purpose Higher utilization of the RF
registers
Peak Performance (8 cores)
2x2 :12.8 MAC/cyc
4x2 :15.5 MAC/cyc

Never underestimate the importance of registers! How to get more?

Luca Benini T12: Specializing Processors for ML 41 of 68
© 2023 IEEE International Solid-State Circuits Conference
94
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC

addi a0,a0,1
addi t1,t1,1
addi t3,t3,1
addi t4,t4,1
lbu a7,-1(a0)
lbu a6,-1(t4)
lbu a5,-1(t3)
lbu t5,-1(t1)
mul s1,a7,a6
mul a7,a7,a5
add s0,s0,s1
mul a6,a6,t5
add t0,t0,a7
mul a5,a5,t5
add t2,t2,a6
add t6,t6,a5
bne s5,a0,1c000bc

Luca Benini T12: Specializing Processors for ML 42 of 68

42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
lp.setup
addi a0,a0,1 addi a0,a0,1
HW Loop
addi t1,t1,1 addi t1,t1,1
addi t3,t3,1 addi t3,t3,1
addi t4,t4,1 addi t4,t4,1
lbu a7,-1(a0) lbu a7,-1(a0)
lbu a6,-1(t4) lbu a6,-1(t4)
lbu a5,-1(t3) lbu a5,-1(t3)
lbu t5,-1(t1) lbu t5,-1(t1)
mul s1,a7,a6 mul s1,a7,a6
mul a7,a7,a5 mul a7,a7,a5
add s0,s0,s1 add s0,s0,s1
mul a6,a6,t5 mul a6,a6,t5
add t0,t0,a7 add t0,t0,a7
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end

Luca Benini T12: Specializing Processors for ML 42 of 68

42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
lp.setup
lp.setup
addi a0,a0,1 addi a0,a0,1
HW Loop p.lw w1, 4(a0!)
addi t1,t1,1 addi t1,t1,1
p.lw w2, 4(a1!)
LD/ST with post addi t3,t3,1 addi t3,t3,1
p.lw x1, 4(a2!)
increment addi t4,t4,1 addi t4,t4,1
p.lw x2, 4(a3!)
lbu a7,-1(a0) lbu s1,a7,a6
a7,-1(a0)
mul
lbu a6,-1(t4) lbu a7,a7,a5
a6,-1(t4)
mul
lbu a5,-1(t3) lbu a5,-1(t3)
add s0,s0,s1
lbu t5,-1(t1) lbu a6,a6,t5
t5,-1(t1)
mul
mul s1,a7,a6 mul t0,t0,a7
s1,a7,a6
add
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end

Luca Benini T12: Specializing Processors for ML 42 of 68

42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
N N/
lp.setup
lp.setup
lp.setup
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!)
LD/ST with post addi t3,t3,1 addi t3,t3,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!)
increment addi t4,t4,1 addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) lbu a7,a7,a5
a6,-1(t4)
pv.sdotsp.b s2, w1, x2
mul
8-bit SIMD sdotp lbu a5,-1(t3) lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) lbu a6,a6,t5
t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul
mul s1,a7,a6 mul t0,t0,a7
s1,a7,a6
end
add
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end

Luca Benini T12: Specializing Processors for ML 42 of 68

9x less
instructions
than RV32IMC
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
N N/
lp.setup
lp.setup
lp.setup can we remove?
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!)
LD/ST with post addi t3,t3,1 addi t3,t3,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!)
increment addi t4,t4,1 addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) lbu a7,a7,a5
a6,-1(t4)
pv.sdotsp.b s2, w1, x2
mul
8-bit SIMD sdotp lbu a5,-1(t3) lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) lbu a6,a6,t5
t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul
mul s1,a7,a6 mul t0,t0,a7
s1,a7,a6
end
add
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end

9x less
instructions
than RV32IMC
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Achieving 100% dotp Unit Utilization
8-bit Convolution RV32IMC RV32IMCXpulp
Yes! dotp+ld
N N/
lp.setup
lp.setup
lp.setup can we remove?
addi a0,a0,1 addi w1,
p.lw a0,a0,1
HW Loop p.lw w1, 4(a0!)
4(a0!) Init NN-RF (outside of the loop)
addi t1,t1,1 4 addi
p.lw t1,t1,1
w2, 4(a1!)
p.lw w2, 4(a1!) lp.setup
LD/ST with post addi t3,t3,1 addi t3,t3,1
addi t4,t4,1
p.lw
p.lw x1,
x1, 4(a2!)
4(a2!) N/4 pv.nnsdotup.h s0,ax1,9
increment addi
p.lw t4,t4,1
p.lw x2,
x2, 4(a3!)
4(a3!)
lbu a7,-1(a0) pv.nnsdotsp.b s1, aw2, 0
lbu a7,-1(a0)
pv.sdotsp.b
mul s1,a7,a6s1, w1, x1
lbu a6,-1(t4) pv.nnsdotsp.b s2, aw4, 2
lbu a7,a7,a5
mul a6,-1(t4)
pv.sdotsp.b s2, w1, x2
lbu a5,-1(t3) pv.nnsdotsp.b s3, aw3, 4
8-bit SIMD sdotp lbu a5,-1(t3)
pv.sdotsp.b
add s0,s0,s1 s3, w2, x1
lbu t5,-1(t1) pv.nnsdotsp.b s4, ax1, 14
lbu a6,a6,t5
mul t5,-1(t1)s4,
pv.sdotsp.b w2, x2
mul s1,a7,a6 end
8-bit sdotp + LD mul t0,t0,a7
end
add s1,a7,a6
mul a7,a7,a5 mul a5,a5,t5
a7,a7,a5
mul
add s0,s0,s1 add t2,t2,a6
s0,s0,s1 pv.nnsdot{up,usp,sp}.{h,b,n,c} rD, rs1, Imm
add
mul a6,a6,t5 mul t6,t6,a5
a6,a6,t5
add
add t0,t0,a7
end t0,t0,a7
add
mul a5,a5,t5 mul a5,a5,t5
add t2,t2,a6 add t2,t2,a6
add t6,t6,a5 add t6,t6,a5
bne s5,a0,1c000bc end

9x less 14.5x less instructions

instructions at an extra 3% area cost
than RV32IMC (~600GEs)
Luca Benini T12: Specializing Processors for ML 42 of 68
42
© 2023 IEEE International Solid-State Circuits Conference
Hardware for dotp+ld

[A. Garofalo et al., TEC21]

Luca Benini T12: Specializing Processors for ML 43 of 68

43
© 2023 IEEE International Solid-State Circuits Conference
Hardware for dotp+ld
NN RF: 6 32-bit regs (weights and input activations)

[A. Garofalo et al., TEC21]

Luca Benini T12: Specializing Processors for ML 43 of 68

43
© 2023 IEEE International Solid-State Circuits Conference
Hardware for dotp+ld
NN RF: 6 32-bit regs (weights and input activations)

[A. Garofalo et al., TEC21]

Luca Benini T12: Specializing Processors for ML 43 of 68

43
© 2023 IEEE International Solid-State Circuits Conference
Hardware for dotp+ld
NN RF: 6 32-bit regs (weights and input activations)

[A. Garofalo et al., TEC21]

Luca Benini T12: Specializing Processors for ML 43 of 68

43
© 2023 IEEE International Solid-State Circuits Conference
Hardware for dotp+ld
NN RF: 6 32-bit regs (weights and input activations)

[A. Garofalo et al., TEC21]

Special-purpose registers
Luca Benini T12: Specializing Processors for ML 43 of 68
43
© 2023 IEEE International Solid-State Circuits Conference
Not only RISC-V: ARM v8.1-M

 New embedded vector ISA Helium (MVE)

 Uses 8 128-bit vector registers (reuses the 32 FP registers)
 ISA enhancements for loops, branches (Low Overhead
Branch Extension)
 Instructions for half-precision floating-point support

Luca Benini T12: Specializing Processors for ML 44 of 68

© 2023 IEEE International Solid-State Circuits Conference
ARM MVE Vectors
 SIMD capability for Cortex-M CPUs: 128-bit registers used to hold e.g. 16
separate 8-bit values (vector)
 An instruction operates on each value independently (with predication)
 The position of an element in a vector is called lane
 Extension of ARM Thumb
 Vectors of elements of the same data type: Int/FP
 Integers: signed/unsigned 8-, 16-, 32-bit, fixed-point saturating (Q7, Q15, Q31)
 Floating-point elements may be single (32-bit) or half precision (16-bit).

Luca Benini T12: Specializing Processors for ML 45 of 68

© 2023 IEEE International Solid-State Circuits Conference
ARM MVE Vector Execution Model
 MVE permits instruction execution to be interleaved
 Multiple instructions may overlap in the pipeline execute stage.
 E.g. a Vector Load (VLDR) instruction which reads multiple words into a vector register may
execute at the same time as a Vector Multiply (VMUL) instruction which uses that data
 It is up to the CPU hardware designer to decide how many “beats” are executed on each
clock cycle (eg. 32-bit datapath vs. 64-bit datapath)

 Complicates exception handling

 Beat D of the VLDR happens after beat A of the VMLA has completed. If memory for
beat D triggers a fault, the processor stores which beats have already been executed
 After exception handling, if the program returns to this location, the hardware already
knows which beats should not be re-executed
Luca Benini T12: Specializing Processors for ML 46 of 68
© 2023 IEEE International Solid-State Circuits Conference
ARM MVE Memory access

 LD/ST vector instructions

(contiguous), and more
 Vector Scatter-Gather Load/Store
instructions
 Vector Deinterleaving/Interleaving
Load/Store
 Contiguous vector accesses can be
signiﬁcantly faster than scatter-
gather
 A gather-load operation must perform
a separate access for each distinct
element. So, a gather-load of 32-bit
wide data may result in four memory
accesses and there could be 16
separate accesses for a load of 8-bit
data.

Luca Benini T12: Specializing Processors for ML 47 of 68

© 2023 IEEE International Solid-State Circuits Conference
ARM MVE vs. Arm-v7 SIMD, NEON
 Helium vs. Cortex-M (Arm-v7) SIMD (e.g. Cortex-M33)

 Helium vs. Neon (A processors)

Luca Benini T12: Specializing Processors for ML 48 of 68

© 2023 IEEE International Solid-State Circuits Conference
First ARM MVE Core: Cortex-M55
 Over 150 new instructions (>130 vector instructions)
 Can decode 2 instr/cycle, 4-stage in-order scalar pipeline (15% faster
than M4): 4.2 CoreMark/MHz (25%>M4, 20%<M7)
 Can be coupled with Ethos U55 ML accelerator

Next: M85
M85>M55
20% more “AI throughput”
M85>M7
30% more “GP throughput”

[wikichip]

Luca Benini T12: Specializing Processors for ML 49 of 68

 Performance relative to Cortex-M4

 Major improvements for Q7, FP16 (new datatypes in HW)
 ML benchmark (KWS): MFCC, DNN (2 conv, 3 FC layers),
8-bit (w, act), 80-500KB, accuracy 90%-95%

Luca Benini T12: Specializing Processors for ML 50 of 68

© 2023 IEEE International Solid-State Circuits Conference
Next: Sub-byte precision
Sub-byte Quantization Results
Quantizazion of a
MobilenetV1_224_1.0 [A. Capotondi et al TCIRCSYS II 20]

2b 5b
3b
4b 8b

Mixed-precision approach key to meet

the memory constraints of tiny devices

QNNs are a natural target for execution low-cost edge platforms

Luca Benini T12: Specializing Processors for ML 51 of 68

© 2023 IEEE International Solid-State Circuits Conference
Next: Sub-byte precision
Sub-byte Quantization Results
Quantizazion of a
MobilenetV1_224_1.0 [A. Capotondi et al TCIRCSYS II 20]

2b 5b
3b
4b 8b

Mixed-precision approach key to meet

the memory constraints of tiny devices

QNNs are a natural target for execution low-cost edge platforms

Luca Benini T12: Specializing Processors for ML 51 of 68

© 2023 IEEE International Solid-State Circuits Conference
Next: Sub-byte precision
Sub-byte Quantization Results
Quantizazion of a
MobilenetV1_224_1.0 [A. Capotondi et al TCIRCSYS II 20]

0.8% 4x
4.4% 7x
2b 5b
2.9% 8x
3b
4b 8b

Mixed-precision approach key to meet

the memory constraints of tiny devices

QNNs are a natural target for execution low-cost edge platforms

Luca Benini T12: Specializing Processors for ML 51 of 68

© 2023 IEEE International Solid-State Circuits Conference
Sub-byte operands manipulation
32-bit data load with post increment (one cycle)

8-bit Vector

To MAC units Huge overhead!

Luca Benini T12: Specializing Processors for ML 52 of 68

© 2023 IEEE International Solid-State Circuits Conference
Mixed Precision SIMD Processor
 Can support all variants: [Ottavi et al. ISVLSI20]

 16x16, 16x8, 16x4, 16x2 Operand A Operand B

 8x8, 8x4, 8x2
 4x4, 4x2
 2x2 SIGN EXTENSION

 Avoids Pack/unpack Overheads

 Maximized performance (SIMD)
 Maximizes RF use (Data Locality) DOT-PRODUCT MODULE

Result

Luca Benini T12: Specializing Processors for ML 53 of 68

© 2023 IEEE International Solid-State Circuits Conference
119
Mixed Precision SIMD Processor
 Can support all variants: [Ottavi et al. ISVLSI20]

 16x16, 16x8, 16x4, 16x2 Operand A Operand B

 8x8, 8x4, 8x2
 4x4, 4x2
 2x2 SIGN EXTENSION

 Avoids Pack/unpack Overheads

 Maximized performance (SIMD)
 Maximizes RF use (Data Locality) DOT-PRODUCT MODULE

Result

Luca Benini T12: Specializing Processors for ML 53 of 68

© 2023 IEEE International Solid-State Circuits Conference
120
Mixed Precision SIMD Processor
 Can support all variants: [Ottavi et al. ISVLSI20]

 16x16, 16x8, 16x4, 16x2 Operand A Operand B

 8x8, 8x4, 8x2
 4x4, 4x2
 2x2 SIGN EXTENSION

 Avoids Pack/unpack Overheads

 Maximized performance (SIMD)
 Maximizes RF use (Data Locality) DOT-PRODUCT MODULE

Result
How to encode all these instructions?

Luca Benini T12: Specializing Processors for ML 53 of 68

Luca Benini T12: Specializing Processors for ML 54 of 68

© 2023 IEEE International Solid-State Circuits Conference
122
Mixed-Precision Core: ISA Explosion!
dotp variants
add variants
sub variants
avg variants
shift variants
max variants
min variants
abs variants
…

> 500 instructions

Luca Benini T12: Specializing Processors for ML 54 of 68

© 2023 IEEE International Solid-State Circuits Conference
123
Virtual SIMD Instructions
 Encode operation as a virtual
SIMD in the ISA (e.g. pv.dotsp.v pv.sdotsp .v
sdotsp.v) pv.dotsp.sc .v pv.sdotsp.sc .v
pv.dotsp.sci .v pv.sdotsp.sci .v
 Format specified at runtime pv.dotup .v pv.sdotup .v
by a Control Register (e.g. pv.dotup.sc .v pv.sdotup.sc .v
4x4) pv.dotup.sci .v pv.sdotup.sci .v
pv.dotusp .v pv.sdotusp .v
 18018 Instructions needed pv.dotusp.sc .v pv.sdotusp.sc .v
for SIMD DOTP pv.dotusp.sci .v pv.sdotusp.sci .v
 Potential to avoid code
replication for different
formats SCALAR MAC3
DECODER INSTR 2 MULT/ALU SCALAR
 Tiny Overhead on QNN for VIRTUAL SIMD INSTR SDOTP.v
Switching format SIMD MIX8x4 MULT/ALU
CSR FORMAT SIMD
 Format switch not frequent in SDOTP.M8x4
DNN, e.g. every layer.

Luca Benini T12: Specializing Processors for ML 55 of 68

Luca Benini T12: Specializing Processors for ML 56 of 68

 Goal
 HW support for
mixed-precision SIMD
instructions;
 Challenge
 Enormous number of
instructions to be
encoded in the ISA;
 Solution
 Status-based
execution.

Luca Benini T12: Specializing Processors for ML 56 of 68

© 2023 IEEE International Solid-State Circuits Conference
Extended Dot-Product Unit
OpC OpA OpB Multi-Precision
(32b scalar) (32b SIMD vector) (32b SIMD vector) Integer Dotp-Unit
SLICER AND ROUTER

32b 32b

Output Result Mux

Dotp Result(32b scalar)
Luca Benini T12: Specializing Processors for ML 57 of 68
© 2023 IEEE International Solid-State Circuits Conference
Extended Dot-Product Unit
OpC OpA OpB Multi-Precision
(32b scalar) (32b SIMD vector) (32b SIMD vector) Integer Dotp-Unit
SLICER AND ROUTER

32b 32b 32b 32b

Output Result Mux

32b 32b 32b 32b

Output Result Mux

Dotp Result(32b scalar)
Luca Benini T12: Specializing Processors for ML 57 of 68
© 2023 IEEE International Solid-State Circuits Conference
Outline
Outline

 Motivations for ML on MCU processors (IoT, TinyML)

 Classical instruction set architectures (ISAs) limitations for ML
 ISA Extensions for ML
 Micro-architecture of ML-specialized cores
 PPA (power performance area) and Energy analyisis
 Advanced Micro-architectural optimizations
 Supporting ggressive Quantization
 From single to multiple cores: parallel ultra-low power

Luca Benini T12: Specializing Processors for ML 58 of 68

© 2023 IEEE International Solid-State Circuits Conference
ML & Parallel, Near-threshold
Optimum [Rossi et al. IEEE Micro17]
point

Luca Benini T12: Specializing Processors for ML 59 of 68

59
© 2023 IEEE International Solid-State Circuits Conference
ML & Parallel, Near-threshold
Optimum [Rossi et al. IEEE Micro17]
point
 As VDD decreases,
operating speed
decreases
 However efficiency
increases more
work done per Joule
 Until leakage effects
start to dominate

Luca Benini T12: Specializing Processors for ML 59 of 68

59
© 2023 IEEE International Solid-State Circuits Conference
ML & Parallel, Near-threshold
Optimum [Rossi et al. IEEE Micro17]
point
Better to have N× PEs  As VDD decreases,
running at optimum Energy operating speed
than 1 PE running fast at decreases
low Energy efficiency
 However efficiency
increases more
work done per Joule
 Until leakage effects
start to dominate

Luca Benini T12: Specializing Processors for ML 59 of 68

ML is massively parallel and scales well

(P/S  with NN size)
Luca Benini T12: Specializing Processors for ML 58 of 68
59
© 2023 IEEE International Solid-State Circuits Conference
Multiple RV Cores (1-16)

RISC-V
core

Luca Benini T12: Specializing Processors for ML 60 of 68

RISC-V RISC-V RISC-V RISC-V

core core core core

Luca Benini T12: Specializing Processors for ML 60 of 68

RISC-V RISC-V RISC-V RISC-V

core core core core

CLUSTER
Luca Benini T12: Specializing Processors for ML 60 of 68
60
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2

Mem4 Mem5 Mem6 Mem7

Mem0 Mem1 Mem2 Mem3

Logarithmic Interconnect

RISC-V RISC-V RISC-V RISC-V

core core core core

CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2

Mem4 Mem5 Mem6 Mem7

Mem0 Mem1 Mem2 Mem3

Logarithmic Interconnect

RISC-V RISC-V RISC-V RISC-V

core core core core

CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
Low-Latency Shared TCDM
Tightly Coupled Data Memory BF=2

Mem4 Mem5 Mem6 Mem7

Mem0 Mem1 Mem2 Mem3

Logarithmic Interconnect

RISC-V RISC-V RISC-V RISC-V

core core core core

CLUSTER
Luca Benini T12: Specializing Processors for ML 61 of 68
61
© 2023 IEEE International Solid-State Circuits Conference
DMA + Fast synchro
Tightly Coupled Data Memory BF=2