PULP: A Parallel Ultra Low Power platform
for next generation IoT Applications
Davide Rossi1
Francesco Conti1, Andrea Marongiu1,2, Antonio Pullini2, Igor Loi1, Michael Gautschi2,
Giuseppe Tagliavini1, Alessandro Capotondi1, Philippe Flatresse3, Luca Benini1,2
1DEI-UNIBO, 2IIS-ETHZ, 3STMicroelectroncis
How efficient do we need to be?
1012ops/J
1pJ/op
1GOPS/mW
[*RuchIBM11]
System View
Sense
Analyze and Classify
Transmit
MEMS IMU
Controller
MEMS Microphone
Short range, BW
L2 Memory
e.g. CotrexM
ULP Imager
Low rate (periodic) data
IOs
EMG/ECG/EIT
25 MOPS
1 12000
MOPS
1 10 mW
SW update, commands
Long range, low BW
100 W 2 mW
Battery + Harvesting powered
a few mW power envelope
Idle:
~1W
Active: ~ 50mW
3
Near-Sensor Processing
Image
COMPRESSION
COMPUTATIONAL OUTPUT
INPUT
FACTOR
BANDWIDTH
DEMAND
BANDWIDTH
Tracking:
[*Lagroce2014]
80 Kbps
1.34 GOPS
0.16 Kbps
500x
256 Kbps
100 MOPS
0.02 Kbps
12800x
2.4 Kbps
7.7 MOPS
0.02 Kbps
120x
16 Kbps
150 MOPS
0.08 Kbps
200x
Voice/Sound
Speech:
[*VoiceControl]
Inertial
Kalman:
[*Nilsson2014]
Biometrics
SVM:
[*Benatti2014]
Extremely compact output (single index, alarm, signature)
Computational power of ULP Controllers is not enough
Parallel worloads
4
PULP:
pJ/op Parallel ULP computing
pJ/op is traditionally the target of ASIC + Controllers
Scalable: to many-core + heterogeneity
Best-in-class LP silicon technology
Programmable: OpenMP, OpenCL, OpenVX
Open: Software & HW
Processor &
Hardware IPs
Compiler
Infrastructure
Virtualization
Layer
Programming
Model
From ULP computing to parallel + heterogeneous ULP computing
1mW-10mW active power
5
Near-Threshold Multiprocessing
Minimum Energy Operation
0.9
Energy/Cycle (nJ)
0.8
32nm CMOS, 25 C
0.7
0.6
4.7X
0.5
0.4
Total Energy
0.3
Leakage Energy
0.2
Dynamic Energy
0.1
0.0
0.2
0.55
0.3
0.55
0.4
0.55
0.5
0.55
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1
1
1.1
1.1
1.2
1.2
Logic Vcc / Memory Vcc (V)
[VivekDeDATE2013]
Near-Threshold Computing (NTC):
1.
Dont waste energy pushing devices in strong inversion
2.
Recover performance with parallel execution
3.
Aggressively manage idle power (switching, leakage)
7
The best Processor
[AziziISCA10]
Single issue in-order is most energy efficient
Put more than one + shared memory to fill cluster area
8
Building PULP
SIMD + MIMD + sequential
Private per-core instruction cache
I$0
4-stage, in-order
OpenRISC core
PE0
I$N1
.....
PEN1
2 ..16 Cores
DEMUX
Periph
+ExtM
LOWLATENCYINTERCONNECT
MB0
L1TCDM
MBM1
DMA Double buffering
Tightly Coupled DMA
1 Cycle Shared Multi-Banked L1 Data Memory
+ Low Latency Interconnect
GPU like shared memory low overhead data sharing
Near Threshold but parallel Maximum Energy efficiency when Active
+ strong power management for (partial) idleness
9
OR1ON: Extended OpenRISC Core
4-stage OpenRISC
IPC ~ 1
DSP extensions:
Hardware loops
Eliminates branching overhead
LD/ST + post-increment
Enhanced vector indexing
Small vector support (SIMD)
2x 16-bit operations
4x 8-bit operations
Unaligned memory accesses
To better exploit SIMD
UP TO 5x performance improvement
and 3x reduction of energy!!!
10
Silicon Implementation
Technology For ULP
Low VDDMIN
Low cost
High ION/IOFF
@ LowVDD
Knobs for
variability
management
Knobs for power
management
UTBB FD-SOI provides good features for ULP design:
Good behavior at low voltage
Body bias for power and variability management
12
Body biasing with
UTBB FD-SOI technology
RVT transistor
(conventional-well)
Vdd/2 + 300 mV
LVT transistor
(flip-well)
BODY BIAS
WINDOWS
RVT: Regular Voltage Threshold
LVT: Low Voltage Threshold
Vdd/2 + 300 mV
FBB: Forward Body Bias
RBB: Reverse Body Bias
Poly biasing allow to trade
performance/leakage
At design time
RVT transistors: low leakage + flexible power management (FBB + RBB)
13
Near Threshold + Body Biasing
Combined
FBB vs. FREQUENCY
RBB vs. LEAKAGE
+ 2.5x
@0.5V
- 10x
@0.5V
RVT transistors
State retentive (no state retentive registers and memories)
Ultra-fast transitions (tens of ns depending on n-well area to bias)
Low area overhead for isolation (3m spacing for deep n-well isolation)
Thin grids for voltage distribution (small transient current for wells polarization)
Simple circuits for on-chip VBB generation (e.g. charge pump)
But even with aggressive RBB leakage is not zero!
14
Body Biasing for
Variability Management
Process variation
120C
Thermal inversion
100x
@0.5V
25 MHz 7 MHz (3)
-40C
RVT transistors
FBB/RBB
FREQUENCY +
LEAKAGE
COMPENSATION
WITH 0.2 BB
FREQUENCY +
LEAKAGE
COMPENSATION
WITH -1.8 to +0.5 BB
15
ULP memory implementation:
latch-based SCM
Standard 6T SRAMs:
High VDDMIN
Bottleneck for energy efficiency
256x32 6T SRAMS vs. SCM
2x-4x
Near-Threshold SRAMs (8T)
Lower VDDMIN
Area/timing overhead (25%-50%)
High active energy
Low technology portability
Standard Cell Memories:
Wide supply voltage range
Lower read/write energy (2x - 4x)
Easy technology portability
Controlled P&R mitigates area
overhead
16
Architectural Technology Awareness
Exploiting body biasing
The cluster is partitioned in separate clock gating and body bias regions
Body bias multiplexers (BBMUXes) control the well voltages of each region
Each region can be active (FBB) or idle (deep RBB low leakage!)
State-Retentive + Low Leakage + Fast transitions
18
Power Management:
Hardware Synchronization
Core shut-down sequence:
1) Disable fetching
2) Wait outstanding
transactions
2) Clock gating
3) Reverse Body Biasing
Private, per core port
single cycle latency
no contention
PE0
PE1
PE2
PE3
HW
SYNCH
BARRIER COST
15x
GOALS:
Reduce parallelization overhead
Accelerate common OpenMP and OpenCL patterns (e.g. Task creation)
Automatically manage shut down of idle cores
19
Power Management:
External Events
Programming sequence:
1) Set events mask
2) Program transfer
3) Trigger transfer
4) Shut down cores
48 maskable events
General purpose
DMA
Timers
Peripherals (SPI, I2C.
GPIO...)
PE0
PE1
PE2
PE3
HW
SYNCH
EVENT
UNIT
SPI
PERIPHERALINTERCONNECT
L2
MEM
GOALS:
Automatically manage shut down of cores during data transfers
20
Heterogeneous Memory Architecture
+ Management
Shared I$ to recover SCMs area overhead
Private L0 buffers to reduce pressure on shared I$
I$B0
SHAREDI$
I$Bk
PE0
.....
PEN1
MMU
MMU
interleaved
SCM0
SCMM1
Reconfigurable Pipeline Stages
for SRAMs degradation@low VDD
L1TCDM
private
SRAM0
MMU (logical/physical add map):
1) Interleaved/private addresses
2) Shutdown of SRAM banks
SRAMM1
SCM on I$ and part of TCDM
to widen the operating range
21
The PULP Family
PULPv1
CHIP FEATURES
Tester chip
Technology
28nm FDSOI (RVT)
Chip Area
3mm2
# Cores
4xOpenRISC
I$
4x1kbyte (private)
TCDM
16 kbyte
L2
16 kbyte
BB regions
VDD range
0.45-1.2V
VBB range
-1.8V - +0.9V
Perf. Range
1 MOPS-1.9GOPS
Power Range
100 W
Peak Efficiency
60 GOPS/[email protected]*
-127 mW*
*Does not include IOs
23
Measured Results
Maximum Energy Efficiency
@ 0.5V + 0.5V FBB 60GOPS/W
1.8GOPS
~10mW@100MHz,0.75V
Low leakage
(< 2%)
Wide operating range
Peak GOPS/W competitive with best-in-class near-threshold
(16bit) ULP microcontrollers, plus more than x100 peak GOPS!
24
PULPv2
= PULPv1 + 2 DVFS regions (SoC + CLUSTER) + Event Unit + Peripherals
25
PULPv3
= PULPv2 + Extended cores + HW Synch + Shared Cache + HWCE + Shared IOs
26
PULPs Summary
#ofcores
L2memory
TCDM
Reconf.pipe.stages
I$
Bodybiasregions
DVFS
I/Oconnectivity
Extendedprocessor
Eventunit
Debugunit
Status
Technology
Voltagerange
BBrange
Maxfreq.
Maxperf.
Peaken.eff.
PULPv1
4
16kB
16kBSRAM
no
4kBSRAMprivate
yes
no
JTAG
no
no
no
PULPv1
siliconproven
FDSOI28nm
conventionalwell
0.45V 1.2V
1.8V 0.9V
475MHz
1.9GOPS
60GOPS/W
PULPv2
4
64kB
32kBSRAM
8kBSCM
yes
4kBSCMprivate
yes
yes
full
no
yes
no
PULPv3
4
128kB
32kBSRAM
16kBSCM
yes
4kBSCMshared
yes
yes
fullmultiplexed
Yes
yes+HWsynchro
yes
PULPv2
posttapeout
FDSOI28nmflip
well
0.3V1.2V
0.0V1.8V
1GHz
4GOPS
135GOPS/W
PULPv3
pretapeout
FDSOI28nm
conventionalwell
0.5V 0.7V
1.8V 0.9V
200MHz
1.8 GOPS
385 GOPS/W
*equivalent 32-bit RISC operations
27
Breaking the GOPS/mW wall
Recovering more silicon efficiency
GOPS/W
1
> 100
SW
Mixed
HW
1GOPS/mW
General-purpose Throughput
Computing
Computing
CPU
GPGPU
ULP parallel
Computing
Accelerator Gap
HW IP
Closing The Accelerator Efficiency Gap with Agile Customization
29
Fractal Heterogeneity
Fixed function accelerators have limited reuse how to limit proliferation?
30
Learn to Accelerate
Brain-inspired (deep convolutional networks) systems
are high performers in many tasks over many domains
Human:
85% (untrained),
94.9% (trained)
CNN:
93.4% accuracy
Image recognition
[RussakovskyIMAGENET2014]
Speech recognition
[HannunARXIV2014]
Flexible acceleration: learned CNN weights are the program
31
PULP CNN Performance
Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE
ENERGY EFFICIENCY
8 GOPS
61x
6500 GOPS/W
47x
PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V
32
Thanks for your attention!!!
www-micrel.deis.unibo.it/pulp-project
References
[RuchIBM11] Ruch, P., Toward five-dimensional scaling: How density improves efficiency in
future computers, IBM Journal of Research and Development , vol.55, no.5, pp.1-13, 2011.
[AziziISCA10] O. Azizi, et. al., Energy-Performance Tradeoffs in Processor Architecture and
Circuit Design: A Marginal Cost Analysis Proceedings of the 37th annual international
symposium on Computer architecture, ISCA 2010, pp. 26-36, June 1923, 2010.
[Nilsson2014] John-Olof Nilsson et.al., Foot-mounted inertial navigation made easy, 2014
International Conference on Indoor Positioning and Indoor Navigation, 27-30 October 2014.
[Benatti2014] S .Benatti et. al., "EMG-based hand gesture recognition with flexible analog
front end," IEEE Biomedical Circuits and Systems Conference (BioCAS), pp.57,60, Oct. 2014.
[Lagorce2014] Lagorce et. al., Asynchronous Event-Based Multikernel Algorithm for HighSpeed Visual Features Tracking, IEEE Trans Neural Netw Learn Syst. 2014 Sep 16.
[VoiceControl] TrulyHandsfreeVoice Control, available: http://www.sensory.com/wpcontent/uploads/80-0342-A.pdf
[VivekDeDATE13] De, Vivek, "Near-Threshold Voltage design in nanoscale CMOS," Design,
Automation & Test in Europe Conference & Exhibition DATE, 2013.
[DoganICSDPTMO2011] Dogan, A. Y., et al., Power/performance exploration of single-core
and multi-core processor approaches for biomedical signal processing, Integrated Circuit and
System Design, Power and Timing Modeling, Optimization, and Simulation, pp. 102-11, 2011.
[RussakovskyIMAGENET2014] O. Russakovsky, ImageNet Large Scale Visual
Recognition Challenge, International Journal of Computer Vision, 2014.
[HannunARXIV2014] A. Hannun Deep Speech: Scaling up end-to-end speech recognition,
arXiv, 2014.
34
How Big is the IoT?
How much energy to process (1 op. per Byte) one BB?
35
Microcontrollers Landscape
*not exhaustive
36
Parallel NTC
Core Power [mW]
High Workloads
Low Workloads
SUB-Vth
NEAR-Vth
[DoganICSDPTMO2011]
Workload [MOPS]
Target Workload 1-Core Energy Efficiency 4-Cores Energy Efficiency
[MOPS]
(ideal) [MOPS/mW]
(ideal) [MOPS/mW]
Ratio
100
43
55
1.3x
200
33
50
1.5x
400
18
43
2.4x
*Measured on our first prototype
37
Parallel NTC + Race to Halt
Power SINGLE-CORE @ MAX FREQUENCY (e.g. 200MHz)
Low Workload
(duty cycled)
core power
system power
active period
MULTI-CORE @ MAX FREQUENCY (e.g. 200 MHz)
Power
core power
Ideally same energy of single-core solution
Deep sleep
saved energy
system power
Low Workload
(duty cycled)
active period
Going faster allows to integrate system power over a smaller period
The main constraint here is the power envelope
38
Back to SRAMs
VOLTAGE LIMIT OF SRAMS
2x
@0.5V
SRAM performance rapidly degrades at low voltage
SRAM VDDMIN is higher than logic (and SCM)
39